https://github.com/chongwulab/dna_foundation_benchmark
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 5 DOI reference(s) in README -
✓Academic publication links
Links to: biorxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.4%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: ChongWuLab
- Language: Python
- Default Branch: main
- Size: 237 KB
Statistics
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
DNA Foundation Models Benchmarking
Introduction
This repository is for generating the results of DNA foundation models benchmarking.
Please cite the following manuscript for using DNAm models built and association results by our work:
Feng, Haonan, Lang Wu, Bingxin Zhao, Chad Huff, Jianjun Zhang, Jia Wu, Lifeng Lin, Peng Wei, and Chong Wu. "Benchmarking DNA Foundation Models for Genomic Sequence Classification." bioRxiv (2024): 2024-08.
1. Environment Setup
To ensure reproducibility and avoid dependency conflicts, we strongly recommend using separate virtual environments (e.g., virtualenv) or Docker images for each DNA foundation model you intend to use. This mirrors our own development and testing process.
- Python Environment: Please refer to the official GitHub repository of each specific DNA foundation model for detailed instructions on setting up their required Python environment and installing dependencies. Each model may have unique requirements.
- Hardware: GPU is required for running these DNA foundation models and our associated code efficiently. All our experiments were conducted on GPU-accelerated hardware.
- DNA Foundation Model Loading In our workflow, all DNA foundation models were downloaded from Huggingface and were loaded using
from_pretrained()with the argumentcheckpointspecifying the local model path, and the argumentlocal_files_only=True. Feel free to change the loading process.
2. Sequence Classification Benchmark
This section outlines the steps to generate results for the sequence classification benchmark tasks.
Datasets
We curated and preprocessed datasets for various genomic tasks from published works. These datasets consist of DNA sequences and their corresponding labels. We further processed them to create standardized train-test splits.
The processed datasets can be downloaded from Huggingface: https://huggingface.co/datasets/hfeng3/dnafoundationbenchmarkdataset. After downloading, unzip and put all files inside the dataprocessed directory. After downloading and extracting, each individual dataset directory will contain train.csv and test.csv files. The resultant directory structure should be:
dna_foundation_benchmark/
│
├── data_processed/
│ ├── enhancers/
│ │ ├── enhancer/
│ │ │ ├──train.csv
│ │ │ ├──test.csv
│ │ ├── enhancer_strength/
│ │ │ ├──train.csv
│ │ │ ├──test.csv
│ │── ...
├── analysis/
├── job_scripts/
The path to the directory (containing train.csv and test.csv) will be used as the --data_path argument in the inference scripts.
Generating Zero-Shot Embeddings
To generate zero-shot embeddings for a given dataset using a specific DNA foundation model:
Navigate to the scripts directory:
bash cd job_scriptsRun the inference script: Execute the Python script corresponding to the DNA foundation model you wish to use (e.g.,
inference_cadph.pyfor Caduceus-Ph,inference_dnabert.pyfor DNABERT, etc.).Arguments:
-
--data_path: Path to the specific dataset folder (e.g.,../data_processed/iPro-WAEL/Promoter_Arabidopsis_NonTATA). Adjust this path based on where you downloaded and extracted the datasets relative to thejobscriptsdirectory. -
--data_name: The name of the dataset folder (e.g.,Promoter_Arabidopsis_NonTATA). This is often the same as the last component of--data_path. -
--max_length: The maximum sequence length to use after tokenization. Sequences longer than this will be truncated. -
--pooling: The pooling method to apply to the model's output embeddings. Common options includemean,max, orcls.
Example: To generate zero-shot embeddings using the Caduceus-Ph model for the PromoterArabidopsisNonTATA dataset, with a maximum tokenized length of 700 and using max pooling, run the following command from within the
job_scriptsdirectory:bash python inference_cadph.py \ --data_path ../data_processed/iPro-WAEL/Promoter_Arabidopsis_NonTATA \ --data_name Promoter_Arabidopsis_NonTATA \ --max_length 700 \ --pooling max(Note: The../data_processed/part of the--data_pathassumes your datasets are stored in adata_processeddirectory one level above thejob_scriptsdirectory. Please adjust this path according to your local file structure.)-
The generated embeddings will by default be saved to embeddings/[DATASET_NAME] as csv files. For the example above, the output would be in embeddings/Promoter_Arabidopsis_NonTATA.
Classification on Embeddings
Once the zero-shot embeddings have been generated, you can train simple classifiers on them to get the final performance metrics.
Stay in the
jobscriptsdirectory.Run the classification script: Run the
classify_[model_short_name].pyscript corresponding to the model whose embeddings you want to evaluate. The script loads the embeddings you generated in the previous step and uses them to train a classifier.Arguments:
-
--data_name: Name of the dataset. -
--pooling: The pooling method used during the embedding generation step (e.g.,max,mean). This is crucial for loading the correct embedding file. -
--multiclass: Specifyyesornoto indicate if the dataset is for a multi-class task. This is important for calculating the correct performance metrics. -
--classifier: The name of the simple classifier to train. Options includerandom_forest,naive_bayes, orelastic(for Elastic Net logistic regression).
Example: To train an Elastic Net classifier on the Caduceus-Ph embeddings for the PromoterArabidopsisNonTATA dataset (which were generated with max pooling and is a binary classification task), run the following command:
bash python classify_cadph.py \ --data_name Promoter_Arabidopsis_NonTATA \ --pooling max \ --multiclass no \ --classifier elasticOutput: The script will train the classifier, evaluate its performance, and save a summary of the metrics. The output for this example will be generated in
results_final/cadph/.-
Result Analysis and Comparison
To organize the generated metrics and replicate the comparison tables from our work, navigate to the analysis directory:
bash
cd analysis
From here, you can run the following scripts to aggregate results:
-
compare_across_classifier.py: Compares different classifiers for the same model and pooling method. -
compare_across_model.py: Compares the performance of different DNA foundation models. -
compare_across_pooling.py: Compares the impact of different pooling methods. -
compare_pretrain_vs_1k.py: A specific comparison for HyenaDNA checkpoints. This assumes you have already pretrained a HyenaDNA model, obtained its classification results in the same format as our other experiments, and have also runinference_hyena_1k.pyandclassify_hyena_1k.pyfrom thejob_scriptsdirectory.
To replicate the boxplots (Figure 2 and Supp Fig 1), run boxplotclassifier.py and boxplotpooling.py.
Statistical Significance (DeLong's Test)
To statistically compare the Area Under the ROC Curve (AUC) between different results, we provide scripts that implement DeLong's test. From the analysis directory, run the appropriate script to get a list of winners with a significance of p < 0.01.
-
delong_across_models.py: Compares AUCs between different models. -
delong_across_pooling.py: Compares AUCs between different pooling methods. -
delong_pretrain_vs_1k.py: Compares AUCs for the HyenaDNA checkpoint evaluation.
Comparison with CNN Baseline
To compare the performance of foundation models against a baseline CNN model:
- From the
job_scriptsdirectory, runclassify_baseline.pyfor each dataset. - Ensure you have already generated results for the foundation models using
meanpooling. - Navigate to the
analysisdirectory. - Run
compare_meanpool_vs_baseline.pyto aggregate the metrics anddelong_meanpool_vs_baseline.pyto perform the statistical comparison.
3. Gene Expression Prediction Benchmark
Please note: The code for this section was adapted for a specific high-performance computing (HPC) server environment. As such, it contains server-specific arguments and code structures. This section is provided as a general reference for our methodology rather than a direct, universally runnable guide.
Data Acquisition
- DNA Sequences: Access to the required individual DNA sequences is protected and must be applied for through the GTEx Portal: https://gtexportal.org/home/protectedDataAccess
- Gene Expression: The corresponding gene expression QTL data can be downloaded from: https://www.gtexportal.org/home/downloads/adult-gtex/qtl
Experiment Pipeline
- After preprocessing the DNA sequences, generate zero-shot embeddings for each foundation model following gtex[modelshortname].py in `jobscripts`.
- Run
GTEX_cov_out.pyinanalysisto regress out covariates from the gene expression data. GTEX_regression.pyto perform the final regression task, predicting gene expression from the processed DNA embeddings.GTEX_summary_results.pyto summarize and compare the final performance metrics across different foundation models.
4. Variant Effect Quantification
This section details the procedure for replicating our variant effect quantification results.
Datasets
The primary dataset for this task is located in data_processed/pathogenic. This experiment also requires the hg38 reference genome fasta file, which we provide in data_processed/TAD/hg38.ml.fa.
Experimental Pipeline
First, generate the required DNA sequences (i.e. with reference vs. alternate alleles) for the foundation models by running the
pathogenic_generate.pyscript.Navigate to the
job_scriptsdirectory. For each foundation model, run the corresponding script to generate embeddings, calculate the distance metric between variant pairs, and store the results.bash python patho_[model_short_name].pyThe results for each model will be saved to a corresponding directory, for example:results_final/[model_short_name]_meanpool/.To organize the results from all models and replicate our plots, navigate to the
analysisdirectory and run:bash python patho_summary.py
5. TAD Region Recognition
This section describes the workflow for identifying Topologically Associating Domain (TAD) boundaries.
Datasets
The data for this task is located in data_processed/TAD/. This includes data downloaded from (https://console.cloud.google.com/storage/browser/basenji_hic/insulation) and the reference genome files required for sequence generation.
Experimental Pipeline
Preprocessing: First, run
select_tads.pyto choose the TAD regions for analysis. Then, rungenerate_seqs_from_tads.pyandgenerate_seqs_random.pyto generate the positive (TAD boundary) and negative (random non-boundary) DNA sequences.Embedding Generation: In
job_scriptsdirectory, runtad_ntv2.pyto generate the embeddings for the sequences.Analysis and Plotting: In
analysisdirectory, and runtad_interpret.pyto replicate the heatmap plot from our results.
6. Runtime Analysis
This section explains how to replicate the model runtime and throughput analysis.
Measure Runtimes: In
job_scriptsdirectory, for each foundation model, run its corresponding runtime script.bash python runtime_[model_short_name].pyYou are encouraged to modify parameters within the script, such asbatch_sizeandsequence_length, or substitute different model checkpoints to test various configurations. The results will be saved to theruntimes/folder.Plotting: To visualize the results and replicate our comparison plot, navigate to the
analysisdirectory and run the plotting scriptplot_runtime.py.
References
- Zhou, Z. et al. DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome. Preprint at https://doi.org/10.48550/arXiv.2306.15006 (2024).
- Dalla-Torre, H. et al. Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nat Methods 22, 287–297 (2025).
- Nguyen, E. et al. HyenaDNA: long-range genomic sequence modeling at single nucleotide resolution. In Proceedings of the 37th International Conference on Neural Information Processing Systems (2023).
- Schiff, Y. et al. Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling. Preprint at https://doi.org/10.48550/arXiv.2403.03234 (2024).
- Sanabria, M., Hirsch, J., Joubert, P. M. & Poetsch, A. R. DNA language model GROVER learns sequence context in the human genome. Nat Mach Intell 6, 911–923 (2024).
- Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods 18, 1196–1203 (2021).
Owner
- Name: Chong Wu Lab
- Login: ChongWuLab
- Kind: organization
- Email: cwu3@fsu.edu
- Repositories: 3
- Profile: https://github.com/ChongWuLab
Chong Wu@Florida State University
GitHub Events
Total
- Watch event: 4
- Push event: 11
Last Year
- Watch event: 4
- Push event: 11