https://github.com/biomedsciai/gene-benchmark

Benchmark gene representations from different model families

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.7%) to scientific vocabulary

Keywords

benchmarking foundation-models foundation-models-for-biology representation-learning

Last synced: 6 months ago · JSON representation

Repository

Benchmark gene representations from different model families

Basic Info

Host: GitHub
Owner: BiomedSciAI
License: apache-2.0
Language: Jupyter Notebook
Default Branch: main
Homepage: https://arxiv.org/abs/2412.04075
Size: 11.9 MB

Statistics

Stars: 12
Watchers: 4
Forks: 0
Open Issues: 3
Releases: 0

Topics

benchmarking foundation-models foundation-models-for-biology representation-learning

Created over 1 year ago · Last pushed 9 months ago

Metadata Files

Readme License

Gene Benchmark

A package to benchmark pre-trained models on downstream gene-related tasks.

The package adopts a framework that allows comparison of models using multiple input types by focusing on the learned embeddings the model provides for each entity. This allows comparing models trained on text, proteins, transcriptomics, or any other modality.

To learn more about the benchmark, check out the preprint "Does your model understand genes? A benchmark of gene properties for biological and text models" at https://arxiv.org/abs/2412.04075 .

The repository is divided into the following sections:

gene_benchmark: The package itself, containing the scripts for extracting textual descriptions from NCBI, encoding textual descriptions and evaluation of model performance.
notebooks: Notebooks for creating the results figures and package usage examples.
scripts: Scripts for description extraction, encoding, task creation and execution.
tasks: The default directory that will be populated with all the tasks after running the task creation script.

In depth explanation on each of the packages components can be found in the gene_benchmark directory.

Environment

Using a virtual environment for all commands in this guide is strongly recommended. The package works with conda, uv and vanilla venv environments.

```sh

create a conda enviornment "gene_benchmark" with Python version 3.11

conda create -n gene_benchmark python=3.11

activate the enviornment before installing new packages

conda activate gene_benchmark ```

Installation

The package is not yet available on pypi. It can be installed directly from GitHub The following command will install the repository as a Python package, and also attempt to install dependencies specified in the setup.py file or the pyproject.toml.

```sh

Note that the command does not clone the repository.

pip install "git+https://github.com/BiomedSciAI/gene-benchmark.git" ```

or the repo can be cloned and installed manually.

```sh git clone https://github.com/BiomedSciAI/gene-benchmark.git

Change directory to the root of the cloned repository

cd gene-benchmark

install from the local directory

pip install -e . ```

Usage

To evaluate your model on the tasks a few basic steps need to be done:

Set up

Create the tasks: This repo does not contain gene task data. Instead, we provide scripts to download and populate the tasks from diverse sources. To download the files containing the tasks, run the following commands in your terminal from the root directory. Note that each task dataset has its own license, which is separate from the license of this package.

sh python scripts/tasks_retrieval/gene2gene_task_creation.py --allow-downloads True python scripts/tasks_retrieval/Genecorpus_tasks_creation.py --allow-downloads True python scripts/tasks_retrieval/HLA_task_creation.py --allow-downloads True python scripts/tasks_retrieval/HPA_tasks_creation.py --allow-downloads True python scripts/tasks_retrieval/humantfs_task_creation.py --allow-downloads True python scripts/tasks_retrieval/Reactome_tasks_creation.py --allow-downloads True python scripts/tasks_retrieval/uniprot_keyword_tasks_creation.py --allow-downloads True

Now your tasks directory should be populated with subdirectories with the tasks names. Each subdirectory holds two .csv files, one with the gene symbols (entities.csv) and one with the labels (outcomes.csv). The shape of these csv files will be determined based on the task type. For example, for the multi class tasks, the outcomes will be a 2d matrix.

Create your task yaml: The script for running the tasks can receive either the task names themselves or a .yaml file containing the list of task names you wish to run. If you choose to create a .yaml file with the task names, create a separate file for each task type. For example for the binary tasks:

sh - TF vs non-TF - long vs short range TF - bivalent vs non-methylated - Lys4-only-methylated vs non-methylated - dosage sensitive vs insensitive TF - Gene2Gene - CCD Transcript - CCD Protein - N1 network - N1 targets - HLA class I vs class II

The example task configs can be found in task_configs

Create the model config file: This config file will hold the path to your models embeddings and the name you wish to use for your model. The structure of the file:

sh encoder: class_name: PreComputedEncoder class_args: encoder_model_name: "/path/to/your/embeddings/my_models_embeddings.csv" model_name: my_model_name

Note that the script expects the embedding csv file to have a 'symbol' column with the gene symbols, this will be set as the index.

Run task

Each task type (binary, categorical or multi-label) will be run separately. For example, for running the binary tasks the command is:

sh python scripts/run_task.py -t /path/to/task/yaml/base_binary.yaml -tf /tasks -m /path/to/model/config/model.yaml --output-file-name binary_tasks.csv

Note:

For the other task types (categorical, regression or multi labe) you need to add -s category/regression/multi
When you are running the tasks on multiple models, and you would like them to be comparable, you can add a excluded-symbols-file input. This needs to be a path to a yaml file containing a list of gene names you would like to exclude.
To avoid getting errors during the cross validation due to class imbalance, you can add a threshold for the classes "-th" (for multi label)or "-cth" (for categorical)

Citation

If you make use of this package, please cite our manuscript:

tex @misc{kantor2024doesmodelunderstandgenes, title={Does your model understand genes? A benchmark of gene properties for biological and text models}, author={Yoav Kan-Tor and Michael Morris Danziger and Eden Zohar and Matan Ninio and Yishai Shimoni}, year={2024}, eprint={2412.04075}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2412.04075}, }

Owner

Name: BiomedSciAI
Login: BiomedSciAI
Kind: organization

Repositories: 6
Profile: https://github.com/BiomedSciAI

GitHub Events

Total

Issues event: 4
Watch event: 7
Delete event: 6
Issue comment event: 7
Push event: 54
Pull request review event: 11
Pull request review comment event: 12
Pull request event: 17
Create event: 9

Last Year

Issues event: 4
Watch event: 7
Delete event: 6
Issue comment event: 7
Push event: 54
Pull request review event: 11
Pull request review comment event: 12
Pull request event: 17
Create event: 9

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 3
Total pull requests: 47
Average time to close issues: 27 days
Average time to close pull requests: 7 days
Total issue authors: 3
Total pull request authors: 5
Average comments per issue: 1.0
Average comments per pull request: 0.21
Merged pull requests: 43
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 3
Pull requests: 10
Average time to close issues: 27 days
Average time to close pull requests: 9 days
Issue authors: 3
Pull request authors: 3
Average comments per issue: 1.0
Average comments per pull request: 0.1
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

qwu01 (1)
yoavkt (1)
njwfish (1)
edenjenzohar (1)

Pull Request Authors

yoavkt (55)
edenjenzohar (21)
mmdanziger (7)
vesnarb (2)
matanninio (2)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

pyproject.toml pypi

mygene *
pandas >=2.1.3
scikit-learn *
sentence_transformers *

.github/workflows/pre-commit-check.yml actions

actions/checkout v4 composite
actions/setup-python v3 composite

.github/workflows/python-package.yml actions

actions/checkout v4 composite
actions/setup-python v3 composite

https://github.com/biomedsciai/gene-benchmark

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.MD

Gene Benchmark

Environment

create a conda enviornment "gene_benchmark" with Python version 3.11

activate the enviornment before installing new packages

Installation

Note that the command does not clone the repository.

Change directory to the root of the cloned repository

install from the local directory

Usage

Set up

Run task

Citation

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies