2025-epistasis-dl-quantgen

exploratory work with gpatlas datasets

https://github.com/arcadia-science/2025-epistasis-dl-quantgen

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.8%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

exploratory work with gpatlas datasets

Basic Info
  • Host: GitHub
  • Owner: Arcadia-Science
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 31.4 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created about 1 year ago · Last pushed 9 months ago
Metadata Files
Readme Citation

README.md

Empirical scaling of deep learning models in epistasis prediction

Purpose

This repository contains all scripts needed to replicate the analyses in the pub Epistasis and deep learning in quantitative genetics. These analyses test the ability of simple deep learning models to learn epistatic interactions in a series of simulated genotype-phenotype datasets.

Our scripts and analyses are split up into three experiments that correspond to three sections in the pub:

Experiment 1: "Scaling"

This set of scripts tests the ability of a simple MLP neural network to capture epistasis in a simulated genotype-to-phenotype mapping task. We generate data across a variety of genetic architectures, sample sizes, and QTL numbers to figure out how much data is needed for an MLP to learn epistasis.

The directory for this experiment is: workflow_sims/alphasimr_scaling

Experiment 2: "Dilution"

This set of scripts focuses on one base scenario from the scaling experiment (100 QTLs and 10,000 samples) and tests the ability of an MLP to learn epistasis when the causal QTLs are diluted with progressively larger numbers of neutral QTLs.

The directory for this experiment is: workflow_sims/alphasimr_dilution

Experiment 3: "Pleiotropy/Genetic correlation"

This set of scripts also focuses on one base scenario of 100 QTLs and 10,000 samples but tests how training on multiple phenotypes that are genetically correlated to varying degrees boosts MLP performance.

The directory for this experiment is: workflow_sims/alphasimr_pleio

Hardware Requirements

We ran the three experimental pipelines on a GPU based AWS EC2 instance (g4dn.8xlarge) with 12 vCPUs, 128Gb of RAM, a 1Tb hard drive, and a T4 Tensor Core GPU. These hardware requirements are only necessary if you wish to replicate the large sample size simulations (10^6 samples) of the scaling experiment. The smaller sample size simulations can be run with 30Gb of RAM (e.g. on a g4dn.2xlarge instance) and take up much less drive space. See the Snakemake workflow instructions below for details on how to avoid replicating the large sample size simulations.

A GPU greatly speeds up model fitting in PyTorch but is not strictly required. However, expect run times to be exceptionally slow when fitting models for simulations with more than 10^3 samples or QTLs.

Runtime for the first scaling simulations is on the order of a week, the dilution and pleiotropy simulations take around 48 hours if run without parallelization. Runtime will be reduced significantly for the scaling simulations if using the pre-generated data (see Uploaded data below) and avoiding the largest sample size simulations.

Data

All input data required to reproduce the results in the pub is generated with the following simulations scripts: - workflow_sims/alphasimr_scaling/alphasim_generate.R - workflow_sims/alphasimr_dilution/alphasim_generate.R - workflow_sims/alphasimr_pleio/alphasim_generate.R

These scripts are set-up to be run as part of a snakemake pipeline described below.

Uploaded data

Alternatively we have uploaded the genotype, phenotype and QTL effect size files we generated in our simulations to the following zenodo repository.

This repo includes four files: - alphasimr_scaling_input.tar.xz simulated data for the scaling experiment (except the largest sim reps of 1e06 samples) - alphasimr_scaling_1e06_input.tar.xz simulated data for the scaling experiment from the largest sim reps of 1e06 samples - alphasimr_dilution_input.tar.xz simulated data for the dilution experiment - alphasimr_pleio_input.tar.xz simulated data for the pleiotropy experiment

Simply download these files, and then extract them to the correct input data directory for each experiment. The total scaling experiment extracts to ~195Gb of data, this drops to just ~10Gb if excluding the million sample reps. The dilution and pleiotropy experiments extract to ~2Gb and ~5Gb respectively.

For example: tar -xJf alphasimr_pleio_input.tar.xz -C workflow_sims/alphasimr_pleio/alphasimr_output

The same directory structure is used for all 3 experiments. You will likely have to mkdir these alphasimr_output directories before extracting into them.

Installation and Setup

To directly replicate the environments used to produce the pub, first install Miniconda

Then create and activate a new virtual environment:

bash conda env create -n snakemake --file workflow_sims/envs/snakemake.yml conda activate snakemake

Snakemake workflow

To generate simulated data and fit all models, first install and activate the conda environment as described above and then run this command: bash run_snakemake_pipeline.sh.

As discussed above, the pipelines executed by this script require substantial computational resources and require approximately one week to run for the scaling experiment if inluding large sample sizes.

This script will execute all snakemake pipelines sequentially, allowing for parallelization within snakemake workflow if the --cores parameter is set to more than 1. In principle if hardware allows, these pipelines can be run in parallel by the user with a modified pipeline script.

Workflow description

For each experiment you will see the following general snakemake files: - workflow_sims/alphasimr_*/Snakefile_sims.smk this pipeline executes the job run_sims to generate and save simulated data using the R package AlphaSimR - workflow_sims/alphasimr_*/Snakefile_linear_mod.smk this takes the output from the sims and runs run_python_rrBLUP which fits a ridge-regression model using scikit-learn as follows: - workflow_sims/alphasimr_*/Snakefile_gpatlas.smk this workflow also takes the output from the sims and fits models using PyTorch - runs generate_input_data to generate hdf5 files for input to PyTorch - runs optimize_fit_gpnet/fit_gpnet to fit an MLP predicting simulated phenotype from simulated genotypes either with or without hyperparameter optimization depending on the experiment.

Additionally the dilution experiment has another workflow workflow_sims/alphasimr_dilution/Snakefile_feat_seln. This runs a modified verion of the workflow_sims/alphasimr_*/Snakefile_gpatlas.smk workflow where feature selection is performed using LASSO regression in the rule optimize_fit_feat_seln_gpnet

Changing simulation parameters

For the first 'scaling' experiment described in the pub you may wish to avoid running the 10^5 and 10^6 sample size simulations due to the hardware requirements. To do so, you can re-rerun the workflow_sims/alphasimr_scaling/generate_simulation_reps.ipynb notebook, which generates the config file workflow_sims/alphasimr_scaling/Snakemake_wildcard_config.yaml which captures the simulation parameters combinations snakemake will execute. Simply edit the sample_sizes dictionary in the notebook to contain the smaller sample sizes you would like to simulate and run the notebook to generate an updated config file. This notebook requires you to first create a Conda environment from the file workflow_sims/envs/gpatlas.yml.

The same process can be done to change the number of replicates for the scaling experiment. Otherwise all parameters for the other simulations are captured in the header of their respective snakemake files and can be edited as needed. Be warned that in order for the pipelines to work properly, all the snakefiles generating results for one experiment must have the same parameter combinations, so their headers must be identical.

Visualize results

The three main figures in the pub are generated in Jupyter notebooks found in the three experiment directories: - workflow_sims/alphasimr_scaling/Fig_1_scaling.ipynb - workflow_sims/alphasimr_dilution/Fig_2_dilution.ipynb - workflow_sims/alphasimr_pleio/Fig_3_pleiotropy.ipynb

If you would like to re-run these notebooks, create a Conda environment from the file workflow_sims/envs/gpatlas.yml and use it to run the notebooks.

Description of the folder structure

sandler_gpatlas_data ├── workflow_sims │ ├── alphasimr_scaling #pipeline for recreating Experiment 1 (Scaling) │ │ ├── Fig_1_scaling.ipynb #notebook for recreating scaling experiment result Fig 1. │ │ ├── Fig_supplement.ipynb #notebook for comparing analytical and SGD fit linear models (Sup Fig 1.) │ │ ├── generate_simulation_reps.ipynb #notebook for writing simulation replicates config │ │ ├── Snakemake_wildcard_config.yaml #simulation replicates config │ │ └── Snakemake*.smk #pipelines for generating sims, fitting linear, and MLP models. │ ├── alphasimr_dilution #pipeline for recreating Experiment 2 (Dilution) │ │ ├── Fig_2_dilution.ipynb #notebook for recreating dilution experiment result Fig 2. │ │ └── Snakemake*.smk #pipelines for generating sims, fitting linear, MLP, and feature seln. models. │ ├── alphasimr_pleio #pipeline for recreating Experiment 3 (Genetic correlations + MTL) │ │ ├── Fig_3_pleiotropy.ipynb #notebook for recreating pleiotropy experiment result Fig 3. │ │ └── Snakemake*.smk #pipelines for generating sims, fitting linear, and MLP models. │ ├── envs #conda environments needed for recreating results │ └── gpatlas #python module with geno-pheno modeling functionality └── run_snakemake_pipeline.sh #master script for running all snakemake pipelines

Owner

  • Name: Arcadia Science
  • Login: Arcadia-Science
  • Kind: organization
  • Location: United States of America

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this software, please cite the associated publication.
title: Epistasis and deep learning in quantitative genetics
doi: 10.57844/arcadia-25nt-guw3
authors:
- family-names: Bell
  given-names: Audrey
  affiliation: Arcadia Science
  orcid: https://orcid.org/0009-0008-2270-1613
- family-names: Cheveralls
  given-names: Keith
  affiliation: Arcadia Science
  orcid: https://orcid.org/0000-0002-4157-6087
- family-names: Sandler
  given-names: George
  affiliation: Arcadia Science
  orcid: https://orcid.org/0000-0001-9420-1521
- family-names: York
  given-names: Ryan
  affiliation: Arcadia Science
  orcid: https://orcid.org/0000-0002-1073-1494
preferred-citation:
  title: Epistasis and deep learning in quantitative genetics
  type: article
  doi: 10.57844/arcadia-25nt-guw3
  authors:
  - family-names: Sandler
    given-names: George
    affiliation: Arcadia Science
    orcid: https://orcid.org/0000-0001-9420-1521
  - family-names: York
    given-names: Ryan
    affiliation: Arcadia Science
    orcid: https://orcid.org/0000-0002-1073-1494
  year: 2025

GitHub Events

Total
  • Push event: 1
  • Create event: 1
Last Year
  • Push event: 1
  • Create event: 1

Dependencies

workflow_sims/gpatlas/pyproject.toml pypi
  • h5py >=3.12.1
  • numpy >=1.26.4
  • optuna >=3.5.0
  • pandas >=2.2.2
  • scikit-learn >=1.5.1
  • torch >=2.5.1