https://github.com/arcadia-science/2024-organismal-selection

Code associated with the pub "Leveraging evolution to identify novel organismal models of human biology"

https://github.com/arcadia-science/2024-organismal-selection

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.8%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Code associated with the pub "Leveraging evolution to identify novel organismal models of human biology"

Basic Info
  • Host: GitHub
  • Owner: Arcadia-Science
  • License: mit
  • Language: R
  • Default Branch: main
  • Homepage:
  • Size: 19.6 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created over 1 year ago · Last pushed 9 months ago
Metadata Files
Readme License

README.md

2024-organismal-selection

run with conda

Purpose

This repository contains code for proteome curation, phylogenomic inference, molecular conservation calculations, and analyses related to the pub "Leveraging evolution to identify novel organismal models of human biology".

Installation and Setup

This repository uses conda to manage software environments and package installation. You can find operating system-specific instructions for installing miniconda here.

After installing conda and mamba, you can now build the environment. Because the conservation analysis depends on several R packages not distributed through conda, as well as several packages that must be locally compiled from source, you must take two additional steps before building the environment. First, you must edit the environment YAML file, uncommenting the C/C++ compilers that are appropriate for your operating system. This section of the environment YAML file is shown below. Currently, a Unix-like environment is assumed, with Linux-specific compilers uncommented by default. If you are running on Mac, you'll need to comment the GCC compilers, and uncomment those for clang.

dependencies: # Comment and uncomment the relevant lines below based on your operating system. - gcc_linux-64 # Linux (GCC C compiler) - gxx_linux-64 # Linux (GCC C++ compiler) # - clang_osx-64 # macOS (Clang C compiler) # - clangxx_osx-64 # macOS (Clang C++ compiler)

Second, you must run additional build scripts after creating and activating the new conda environment with the appropriate compilers installed. Below, we provide code to carry out the whole process (after modifying the environment YAML file).

```sh

Create the environment and activate it (after first editing the environment YAML file).

mamba env create -n aastatsmvdists --file envs/aastatsmvdists.yml conda activate aastatsmv_dists

Install the remaining dependencies within this conda environment:

bash install/installpathd8.sh bash install/installtreepl.sh bash install/installrpackagesforaastatsmv_dists.sh ```

Data

Before proceeding with any (re)analysis, first download the NovelTree run outputs from Zenodo here and decompress the outputs

```sh

Download all data and results from Zenodo (note: this file is 13GB).

wget https://zenodo.org/records/14425432/files/2024-organismal-selection-zenodo.zip

Extract these data:

unzip 2024-organismal-selection-zenodo.zip

Navigate into the directory and extract the NovelTree run outputs for reanalysis:

cd 2024-organismal-selection-zenodo/ tar -xzvf results-noveltree-model-euks.tar.gz ```

The data hosted on zenodo, includes a directory (2024-organismal-selection-zenodo/) containing the following:

  • run_configurations/noveltree-model-euks-samplesheet.csv - the samplesheet for our snakemake preprocessing workflow to filter and preprocess species proteomes prior to analysis with NovelTree.
  • run_configurations/euk_preprocess_samplesheet.tsv & run_configurations/noveltree-model-euks-parameterfile.json - the NovelTree sample and parameter files used to run NovelTree.
  • preprocessed_proteomes.tar.gz - a compressed tarball containing the preprocessed proteomes used by our NovelTree run.
  • results-noveltree-model-euks.tar.gz - a compressed tarball containing all outputs generated by our NovelTree run.
  • aa-summary-stats.tar.gz - a compressed tarball containing all AA summary statistics generated by code/genefam_aa_summaries.py.
  • gf-aa-multivar-distances.tar.gz - a compressed tarball containing all result files produced by code/calc_protein_mv_distances.R.
  • organismal_selection_tool_citations.csv - source citations describing available genetic perturbations for organisms in our portfolio.

Usage

With the NovelTree run outputs downloaded and extracted into the base directory of this repository, we now proceed by calling the script code/genefam_aa_summaries.py. This bash script calculates for each protein sequence within each gene family, summaries of AA composition, as well as AA physical properties. All code below assumes that you have downloaded and extracted the directory 2024-organismal-selection-zenodo/ from this pubs correspoding Zenodo repository.

```sh

Ensure we are calling this script within the correct conda environment

conda activate aastatsmv_dists

Set the MSA directory to variable

msadir="2024-organismal-selection-zenodo/results-noveltree-model-euks/witchalignments/original_alignments/"

Now, run the script to calculate the physicochemical properties of each protein using ProtParam

python code/genefamaasummaries.py -t 10 $msa_dir ```

This will create a new directory called "aa-summary-stats/" that contains the calculated AA properties for each protein, and summarized for each gene famly. With these protein properties curated, we can now proceed with the calculation of pairwise multivariate distances between proteins within each gene family.

sh Rscript code/calc_protein_mv_distances.R

Briefly, this script:

  1. Reads in the species tree from the NovelTree run results and time-calibrates it using a species tree containing these species obtained from timetree.org
  2. Reads in species metadata from the NovelTree samplesheet and copy number information
  3. Reads in the gene family trees and protein properties calculated by code/genefam_aa_summaries.py, retaining only those gene families that contain human proteins, and then for each gene family, it:
    • Time-calibrates the gene family trees so branch lengths reflect time, rather than the extent of sequence divergence.
    • Uses this tree to transforms the AA physical properties such that we correct for phylogenetic non-independence between proteins
    • Calculate multivariate (mahalanobis) distances between proteins

Replicating the analyses of molecular conservation in the pub

Create the conda environment and install the remaining R packages:

```sh mamba env create -n organismal-selection-analysis --file envs/analysis.yml

conda activate organismal-selection-analysis

Rscript install/installrpackagesforanalysis.R ```

Next, load and organize the data:

sh Rscript code/org-sel-data.R

The code to recreate the analyses and figures from the pub is in the script code/org-sel-analysis.R.

Contributing

See how we recognize feedback and contributions to our code.

Owner

  • Name: Arcadia Science
  • Login: Arcadia-Science
  • Kind: organization
  • Location: United States of America

GitHub Events

Total
  • Release event: 1
  • Delete event: 2
  • Issue comment event: 2
  • Push event: 14
  • Public event: 1
  • Pull request review comment event: 6
  • Pull request review event: 7
  • Pull request event: 4
  • Create event: 3
Last Year
  • Release event: 1
  • Delete event: 2
  • Issue comment event: 2
  • Push event: 14
  • Public event: 1
  • Pull request review comment event: 6
  • Pull request review event: 7
  • Pull request event: 4
  • Create event: 3

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 0
  • Total pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: 3 days
  • Total issue authors: 0
  • Total pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 1.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: 3 days
  • Issue authors: 0
  • Pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 1.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • austinhpatton (1)
  • keithchev (1)
Top Labels
Issue Labels
Pull Request Labels