https://github.com/broadinstitute/emmaemb

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 6 DOI reference(s) in README
✓
Academic publication links
Links to: biorxiv.org
○
Academic email domains
✓
Institutional organization owner
Organization broadinstitute has institutional domain (www.broadinstitute.org)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.9%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: broadinstitute
License: mit
Language: Python
Default Branch: main
Size: 132 MB

Statistics

Stars: 15
Watchers: 4
Forks: 3
Open Issues: 0
Releases: 1

Created almost 2 years ago · Last pushed 9 months ago

Metadata Files

Readme License

EmmaEmb

EmmaEmb is a Python library designed to facilitate the initial comparison of diverse embedding spaces in embeddings for molecular biology. By incorporating user-defined feature data on the natural grouping of data points, EmmaEmb enables users to compare global statistics and understand the differences in clustering of natural groupings across different embedding spaces.

Although designed for the application on embeddings of molecular biology data (e.g. protein sequences), the library is general and can be applied to any type of embedding space.

How to cite

If you use EmmaEmb, please cite the pre-print:

Rissom, P. F., Yanez Sarmiento, P., Safer, J., Coley, C. W., Renard, B. Y., Heyne, H. O., Iqbal, S. Decoding protein language models: insights from embedding space analysis. bioRxiv (2025), https://doi.org/10.1101/2024.06.21.600139

or, if you prefer the BibTeX format:

@article {Rissom2024.06.21.600139, author = {Rissom, Pia Francesca and Sarmiento, Paulo Yanez and Safer, Jordan and Coley, Connor W. and Renard, Bernhard Y. and Heyne, Henrike O. and Iqbal, Sumaiya}, title = {Decoding protein language models: insights from embedding space analysis}, year = {2025}, doi = {10.1101/2024.06.21.600139}, publisher = {Cold Spring Harbor Laboratory}, journal = {bioRxiv} }

Overview

Workflow
Input
Features
Installation and first steps
Scripts for protein language model embeddings
License

Workflow

The following figure provides an overview of the EmmaEmb workflow:

EmmaEmb workflow

EmmaEmb enables the comparative analysis of information captured in different embedding spaces. The workflow consists of the following steps:

A. Embedding Generation: Starting with a set of samples (e.g., proteins or genes), embeddings are extracted from multiple foundation models, which may differ in architecture or training.

B. Feature Integration: Sample-specific categorical data (e.g., functional annotations, protein families) is incorporated to the analysis.

C. Feature Distribution Analysis: The distribution of categorical features is assessed within local neighborhoods in each embedding space, using k-nearest neighbors to quantify class consistency and overlap.

D. Pairwise Space Comparison: Embedding spaces are compared based on pairwise distances and neighborhood similarity to identify global and local differences. Regions with high divergence can be further examined using feature data to understand variations in model representation.

Input

EmmaEmb is centered around the Emma object, which serves as the core of the library. The following input data is required:

Feature Data: A pandas DataFrame containing sample-specific categorical features. Each row corresponds to a sample, and each column corresponds to a feature. The first column should contain the sample IDs.
Embedding Spaces: Precomputed embeddings for each sample (scripts for generating embeddings from protein language models are provided). Embeddings should be stored in a directory with one file per sample. The file name should correspond to the sample ID, and the file should contain the embedding as a list of floats. Multiple embedding spaces can be added to the Emma object for comparison. Dimensions do not need to match across spaces.

The Emma object is initialized with feature data and embedding spaces can be added incrementally.

Features

Visualisation after dimensionality reduction

EmmaEmb supports dimensionality reduction techniques such as PCA, t-SNE, and UMAP to visualize and analyze high-dimensional embeddings in lower-dimensional spaces. The plots can be colour coded by a feature of interest from the feature data.

Computation of pairwise distances

To make embedding spaces comparable, EmmaEmb analyses rely on comparing not individual embeddings, but the relationships between them. The library calculates pairwise distances between samples in each embedding space. Users can select from multiple distance metrics, including:

Euclidean
Cosine
Manhattan

For parts of the analysis only the k-nearest neighbors are considered, which will be based on the pairwise distances. The pairwise distances are only calculated once and can be reused for multiple analyses. For large dataset sizes, EmmaEmb supports the option to approximate nearest neighbors.

Feature distribution across spaces

For a selected feature from the feature data, EmmaEmb provides two metrics to assess the alignment of features across embedding spaces:

KNN feature alignment scores: Quantify the alignment of features by examining the nearest neighbors of each sample in different spaces. This score reveals the extent to which samples with a shared feature are embedded close to each other in different spaces.
KNN class similarity matrix: Measure the consistency of class-level relationships by assessing the overlap of nearest neighbors for samples within the same class across spaces. This provides insights into the relationships between classes in different embedding spaces.

Pairwise space comparison

EmmaEmb provides two metrics to directly compare two embedding spaces:

Global comparison of pairwise distances: Compare the distribution of pairwise distances between samples in two embedding spaces. This metric is useful for assessing the overall similarity of the two spaces. The pairwise distances can also be visualized in a scatter plot.
Cross-space neighborhood similarity: Assess the similarity of local neighborhoods in two embedding spaces. This metric is useful for identifying regions where the two spaces diverge. The similarity is calculated based on the overlap of k-nearest neighbors between samples in the two spaces. The regions of divergence can be characterized using the feature data.

Installation

You can install the EmmaEmb library through pip, or access examples locally by cloning the github repo.

Installing the EmmaEmb library

pip install emmaemb

Cloning the EmmaEmb repo

``` git clone https://github.com/broadinstitute/EmmaEmb

cd emmaemb # enter project directory pip3 install . # install dependencies jupyter lab colab_notebooks # open notebook examples in jupyter for local exploration ```

Getting Started

To get started with the EmmaEmb library, load the metadata and embeddings, and initialize the Emma object. The following code snippet demonstrates how to use EmmaEmb to compare two embedding spaces:

```python import numpy as np import pandas as pd

from emmaemb.core import Emma

from emmaemb.vizualisation import ( plotembspace, plotknnalignmentscoresacrosskanddistancemetrics, plotpairwisedistance_comparison )

generate sample feature data and embeddings

featuredata = [ ["sample1", "A", "human"], ["sample2", "A", "mouse"], ["sample3", "B", "human"], ["sample4", "B", "mouse"], ["sample5", "C", "human"], ] featuredata = pd.DataFrame.fromrecords(featuredata * 5, columns=["sampleid", "enzymeclass", "species"])

prott5embs = np.random.rand(25, 100) esm2_embs = np.random.rand(25, 100)

np.save("prott5embs.npy", prott5embs) np.save("esm2embs.npy", esm2embs)

initiate Emma object and load embedding spaces

emma = Emma(featuredata=featuredata) emma.addembspace( embeddingssource="prott5embs.npy", embspacename="ProtT5") emma.addembspace( embeddingssource="esm2embs.npy", embspace_name="ESM2")

initial visual inspection

fig1 = plotembspace(emma, embspace="ProtT5", colorby="enzymeclass", method="PCA")

calculation of pairwise distances

distancemetrics = ['cosine', 'euclidean'] for distancemetric in distancemetrics: emma.calculatepairwisedistances(embspace="ProtT5", metric=distancemetric) emma.calculatepairwisedistances(embspace="ESM2", metric=distance_metric)

visualize feature alignment

fig2 = plotknnalignmentscoresacrosskanddistancemetrics( emma, feature="enzymeclass", k_values=[5, 10, 15, 20], metrics=["euclidean", "cosine"] )

compare embedding spaces directly

fig3 = plotpairwisedistancecomparison( emma, embspacex="ProtT5", embspacey="ESM2", metric="cosine", group_by="species" )

fig1.show() fig2.show() fig_3.show() ```

A more detailed example can be found in the notebook.

Approximate nearest neighbors with Annoy

For very large embedding spaces, calculating exact k-nearest neighbors can be computationally expensive. EmmaEmb supports the option to use Annoy to approximate nearest neighbors efficiently:

Set use_annoy=True when calling get_knn or related functions.
You can specify the annoy_metric ("euclidean", "manhattan", "cosine") and the number of trees (n_trees) to balance accuracy and performance.

Scripts for protein language model embeddings

The repository also contains a wrapper script for retrieving protein embeddings from a diverse set of pre-trained Protein Language Models.

The script includes a heuristic to chunk and aggregate long sequences to ensure compatibility with the models' input size constraints.

The script supports the following models:

Contact

If you have any questions or suggestions, please feel free to reach out to: francesca.risom@hpi.de.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Owner

Name: Broad Institute
Login: broadinstitute
Kind: organization
Location: Cambridge, MA

Website: http://www.broadinstitute.org/
Twitter: broadinstitute
Repositories: 1,083
Profile: https://github.com/broadinstitute

Broad Institute of MIT and Harvard

GitHub Events

Total

Release event: 2
Watch event: 11
Delete event: 5
Issue comment event: 2
Push event: 82
Pull request review event: 11
Pull request review comment event: 10
Pull request event: 14
Fork event: 3
Create event: 11

Last Year

Release event: 2
Watch event: 11
Delete event: 5
Issue comment event: 2
Push event: 82
Pull request review event: 11
Pull request review comment event: 10
Pull request event: 14
Fork event: 3
Create event: 11

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 1
Total pull requests: 12
Average time to close issues: less than a minute
Average time to close pull requests: 13 days
Total issue authors: 1
Total pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.92
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 1

Past Year

Issues: 1
Pull requests: 12
Average time to close issues: less than a minute
Average time to close pull requests: 13 days
Issue authors: 1
Pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.92
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 1

View more stats

Top Authors

Issue Authors

pia-francesca (1)

Pull Request Authors

jannis-baum (11)
dependabot[bot] (1)

Top Labels

Issue Labels

Pull Request Labels

dependencies (1)

Packages

Total packages: 1
Total downloads:
- pypi 669 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 16
Total maintainers: 2

pypi.org: emmaemb

A library for comparing embedding spaces

Homepage: https://github.com/broadinstitute/emmaemb
Documentation: https://emmaemb.readthedocs.io/
License: MIT License
Latest release: 1.1.6
published 9 months ago

Versions: 16
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 669 Last month

Rankings

Dependent packages count: 9.7%

Average: 32.1%

Dependent repos count: 54.6%

Maintainers (2)

jordansafer fran_ris