2025-epistasis-dl-quantgen
exploratory work with gpatlas datasets
https://github.com/arcadia-science/2025-epistasis-dl-quantgen
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.8%) to scientific vocabulary
Repository
exploratory work with gpatlas datasets
Basic Info
- Host: GitHub
- Owner: Arcadia-Science
- Language: Jupyter Notebook
- Default Branch: main
- Size: 31.4 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
Empirical scaling of deep learning models in epistasis prediction
Purpose
This repository contains all scripts needed to replicate the analyses in the pub Epistasis and deep learning in quantitative genetics. These analyses test the ability of simple deep learning models to learn epistatic interactions in a series of simulated genotype-phenotype datasets.
Our scripts and analyses are split up into three experiments that correspond to three sections in the pub:
Experiment 1: "Scaling"
This set of scripts tests the ability of a simple MLP neural network to capture epistasis in a simulated genotype-to-phenotype mapping task. We generate data across a variety of genetic architectures, sample sizes, and QTL numbers to figure out how much data is needed for an MLP to learn epistasis.
The directory for this experiment is: workflow_sims/alphasimr_scaling
Experiment 2: "Dilution"
This set of scripts focuses on one base scenario from the scaling experiment (100 QTLs and 10,000 samples) and tests the ability of an MLP to learn epistasis when the causal QTLs are diluted with progressively larger numbers of neutral QTLs.
The directory for this experiment is: workflow_sims/alphasimr_dilution
Experiment 3: "Pleiotropy/Genetic correlation"
This set of scripts also focuses on one base scenario of 100 QTLs and 10,000 samples but tests how training on multiple phenotypes that are genetically correlated to varying degrees boosts MLP performance.
The directory for this experiment is: workflow_sims/alphasimr_pleio
Hardware Requirements
We ran the three experimental pipelines on a GPU based AWS EC2 instance (g4dn.8xlarge) with 12 vCPUs, 128Gb of RAM, a 1Tb hard drive, and a T4 Tensor Core GPU. These hardware requirements are only necessary if you wish to replicate the large sample size simulations (10^6 samples) of the scaling experiment. The smaller sample size simulations can be run with 30Gb of RAM (e.g. on a g4dn.2xlarge instance) and take up much less drive space. See the Snakemake workflow instructions below for details on how to avoid replicating the large sample size simulations.
A GPU greatly speeds up model fitting in PyTorch but is not strictly required. However, expect run times to be exceptionally slow when fitting models for simulations with more than 10^3 samples or QTLs.
Runtime for the first scaling simulations is on the order of a week, the dilution and pleiotropy simulations take around 48 hours if run without parallelization. Runtime will be reduced significantly for the scaling simulations if using the pre-generated data (see Uploaded data below) and avoiding the largest sample size simulations.
Data
All input data required to reproduce the results in the pub is generated with the following simulations scripts:
- workflow_sims/alphasimr_scaling/alphasim_generate.R
- workflow_sims/alphasimr_dilution/alphasim_generate.R
- workflow_sims/alphasimr_pleio/alphasim_generate.R
These scripts are set-up to be run as part of a snakemake pipeline described below.
Uploaded data
Alternatively we have uploaded the genotype, phenotype and QTL effect size files we generated in our simulations to the following zenodo repository.
This repo includes four files:
- alphasimr_scaling_input.tar.xz simulated data for the scaling experiment (except the largest sim reps of 1e06 samples)
- alphasimr_scaling_1e06_input.tar.xz simulated data for the scaling experiment from the largest sim reps of 1e06 samples
- alphasimr_dilution_input.tar.xz simulated data for the dilution experiment
- alphasimr_pleio_input.tar.xz simulated data for the pleiotropy experiment
Simply download these files, and then extract them to the correct input data directory for each experiment. The total scaling experiment extracts to ~195Gb of data, this drops to just ~10Gb if excluding the million sample reps. The dilution and pleiotropy experiments extract to ~2Gb and ~5Gb respectively.
For example: tar -xJf alphasimr_pleio_input.tar.xz -C workflow_sims/alphasimr_pleio/alphasimr_output
The same directory structure is used for all 3 experiments. You will likely have to mkdir these alphasimr_output directories before extracting into them.
Installation and Setup
To directly replicate the environments used to produce the pub, first install Miniconda
Then create and activate a new virtual environment:
bash
conda env create -n snakemake --file workflow_sims/envs/snakemake.yml
conda activate snakemake
Snakemake workflow
To generate simulated data and fit all models, first install and activate the conda environment as described above and then run this command: bash run_snakemake_pipeline.sh.
As discussed above, the pipelines executed by this script require substantial computational resources and require approximately one week to run for the scaling experiment if inluding large sample sizes.
This script will execute all snakemake pipelines sequentially, allowing for parallelization within snakemake workflow if the --cores parameter is set to more than 1. In principle if hardware allows, these pipelines can be run in parallel by the user with a modified pipeline script.
Workflow description
For each experiment you will see the following general snakemake files:
- workflow_sims/alphasimr_*/Snakefile_sims.smk this pipeline executes the job run_sims to generate and save simulated data using the R package AlphaSimR
- workflow_sims/alphasimr_*/Snakefile_linear_mod.smk this takes the output from the sims and runs run_python_rrBLUP which fits a ridge-regression model using scikit-learn as follows:
- workflow_sims/alphasimr_*/Snakefile_gpatlas.smk this workflow also takes the output from the sims and fits models using PyTorch
- runs generate_input_data to generate hdf5 files for input to PyTorch
- runs optimize_fit_gpnet/fit_gpnet to fit an MLP predicting simulated phenotype from simulated genotypes either with or without hyperparameter optimization depending on the experiment.
Additionally the dilution experiment has another workflow workflow_sims/alphasimr_dilution/Snakefile_feat_seln.
This runs a modified verion of the workflow_sims/alphasimr_*/Snakefile_gpatlas.smk workflow where feature selection is performed using LASSO regression in the rule optimize_fit_feat_seln_gpnet
Changing simulation parameters
For the first 'scaling' experiment described in the pub you may wish to avoid running the 10^5 and 10^6 sample size simulations due to the hardware requirements.
To do so, you can re-rerun the workflow_sims/alphasimr_scaling/generate_simulation_reps.ipynb notebook, which generates the config file workflow_sims/alphasimr_scaling/Snakemake_wildcard_config.yaml which captures the simulation parameters combinations snakemake will execute. Simply edit the sample_sizes dictionary in the notebook to contain the smaller sample sizes you would like to simulate and run the notebook to generate an updated config file. This notebook requires you to first create a Conda environment from the file workflow_sims/envs/gpatlas.yml.
The same process can be done to change the number of replicates for the scaling experiment. Otherwise all parameters for the other simulations are captured in the header of their respective snakemake files and can be edited as needed. Be warned that in order for the pipelines to work properly, all the snakefiles generating results for one experiment must have the same parameter combinations, so their headers must be identical.
Visualize results
The three main figures in the pub are generated in Jupyter notebooks found in the three experiment directories:
- workflow_sims/alphasimr_scaling/Fig_1_scaling.ipynb
- workflow_sims/alphasimr_dilution/Fig_2_dilution.ipynb
- workflow_sims/alphasimr_pleio/Fig_3_pleiotropy.ipynb
If you would like to re-run these notebooks, create a Conda environment from the file workflow_sims/envs/gpatlas.yml and use it to run the notebooks.
Description of the folder structure
sandler_gpatlas_data
├── workflow_sims
│ ├── alphasimr_scaling #pipeline for recreating Experiment 1 (Scaling)
│ │ ├── Fig_1_scaling.ipynb #notebook for recreating scaling experiment result Fig 1.
│ │ ├── Fig_supplement.ipynb #notebook for comparing analytical and SGD fit linear models (Sup Fig 1.)
│ │ ├── generate_simulation_reps.ipynb #notebook for writing simulation replicates config
│ │ ├── Snakemake_wildcard_config.yaml #simulation replicates config
│ │ └── Snakemake*.smk #pipelines for generating sims, fitting linear, and MLP models.
│ ├── alphasimr_dilution #pipeline for recreating Experiment 2 (Dilution)
│ │ ├── Fig_2_dilution.ipynb #notebook for recreating dilution experiment result Fig 2.
│ │ └── Snakemake*.smk #pipelines for generating sims, fitting linear, MLP, and feature seln. models.
│ ├── alphasimr_pleio #pipeline for recreating Experiment 3 (Genetic correlations + MTL)
│ │ ├── Fig_3_pleiotropy.ipynb #notebook for recreating pleiotropy experiment result Fig 3.
│ │ └── Snakemake*.smk #pipelines for generating sims, fitting linear, and MLP models.
│ ├── envs #conda environments needed for recreating results
│ └── gpatlas #python module with geno-pheno modeling functionality
└── run_snakemake_pipeline.sh #master script for running all snakemake pipelines
Owner
- Name: Arcadia Science
- Login: Arcadia-Science
- Kind: organization
- Location: United States of America
- Website: https://www.arcadiascience.com/
- Twitter: ArcadiaScience
- Repositories: 16
- Profile: https://github.com/Arcadia-Science
Citation (CITATION.cff)
cff-version: 1.2.0
message: If you use this software, please cite the associated publication.
title: Epistasis and deep learning in quantitative genetics
doi: 10.57844/arcadia-25nt-guw3
authors:
- family-names: Bell
given-names: Audrey
affiliation: Arcadia Science
orcid: https://orcid.org/0009-0008-2270-1613
- family-names: Cheveralls
given-names: Keith
affiliation: Arcadia Science
orcid: https://orcid.org/0000-0002-4157-6087
- family-names: Sandler
given-names: George
affiliation: Arcadia Science
orcid: https://orcid.org/0000-0001-9420-1521
- family-names: York
given-names: Ryan
affiliation: Arcadia Science
orcid: https://orcid.org/0000-0002-1073-1494
preferred-citation:
title: Epistasis and deep learning in quantitative genetics
type: article
doi: 10.57844/arcadia-25nt-guw3
authors:
- family-names: Sandler
given-names: George
affiliation: Arcadia Science
orcid: https://orcid.org/0000-0001-9420-1521
- family-names: York
given-names: Ryan
affiliation: Arcadia Science
orcid: https://orcid.org/0000-0002-1073-1494
year: 2025
GitHub Events
Total
- Push event: 1
- Create event: 1
Last Year
- Push event: 1
- Create event: 1
Dependencies
- h5py >=3.12.1
- numpy >=1.26.4
- optuna >=3.5.0
- pandas >=2.2.2
- scikit-learn >=1.5.1
- torch >=2.5.1