https://github.com/arcadia-science/2024-organismal-selection
Code associated with the pub "Leveraging evolution to identify novel organismal models of human biology"
https://github.com/arcadia-science/2024-organismal-selection
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.8%) to scientific vocabulary
Repository
Code associated with the pub "Leveraging evolution to identify novel organismal models of human biology"
Basic Info
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
2024-organismal-selection
Purpose
This repository contains code for proteome curation, phylogenomic inference, molecular conservation calculations, and analyses related to the pub "Leveraging evolution to identify novel organismal models of human biology".
Installation and Setup
This repository uses conda to manage software environments and package installation. You can find operating system-specific instructions for installing miniconda here.
After installing conda and mamba, you can now build the environment. Because the conservation analysis depends on several R packages not distributed through conda, as well as several packages that must be locally compiled from source, you must take two additional steps before building the environment. First, you must edit the environment YAML file, uncommenting the C/C++ compilers that are appropriate for your operating system. This section of the environment YAML file is shown below. Currently, a Unix-like environment is assumed, with Linux-specific compilers uncommented by default. If you are running on Mac, you'll need to comment the GCC compilers, and uncomment those for clang.
dependencies: # Comment and uncomment the relevant lines below based on your operating system.
- gcc_linux-64 # Linux (GCC C compiler)
- gxx_linux-64 # Linux (GCC C++ compiler)
# - clang_osx-64 # macOS (Clang C compiler)
# - clangxx_osx-64 # macOS (Clang C++ compiler)
Second, you must run additional build scripts after creating and activating the new conda environment with the appropriate compilers installed. Below, we provide code to carry out the whole process (after modifying the environment YAML file).
```sh
Create the environment and activate it (after first editing the environment YAML file).
mamba env create -n aastatsmvdists --file envs/aastatsmvdists.yml conda activate aastatsmv_dists
Install the remaining dependencies within this conda environment:
bash install/installpathd8.sh bash install/installtreepl.sh bash install/installrpackagesforaastatsmv_dists.sh ```
Data
Before proceeding with any (re)analysis, first download the NovelTree run outputs from Zenodo here and decompress the outputs
```sh
Download all data and results from Zenodo (note: this file is 13GB).
wget https://zenodo.org/records/14425432/files/2024-organismal-selection-zenodo.zip
Extract these data:
unzip 2024-organismal-selection-zenodo.zip
Navigate into the directory and extract the NovelTree run outputs for reanalysis:
cd 2024-organismal-selection-zenodo/ tar -xzvf results-noveltree-model-euks.tar.gz ```
The data hosted on zenodo, includes a directory (2024-organismal-selection-zenodo/) containing the following:
run_configurations/noveltree-model-euks-samplesheet.csv- the samplesheet for our snakemake preprocessing workflow to filter and preprocess species proteomes prior to analysis with NovelTree.run_configurations/euk_preprocess_samplesheet.tsv&run_configurations/noveltree-model-euks-parameterfile.json- the NovelTree sample and parameter files used to run NovelTree.preprocessed_proteomes.tar.gz- a compressed tarball containing the preprocessed proteomes used by our NovelTree run.results-noveltree-model-euks.tar.gz- a compressed tarball containing all outputs generated by our NovelTree run.aa-summary-stats.tar.gz- a compressed tarball containing all AA summary statistics generated bycode/genefam_aa_summaries.py.gf-aa-multivar-distances.tar.gz- a compressed tarball containing all result files produced bycode/calc_protein_mv_distances.R.organismal_selection_tool_citations.csv- source citations describing available genetic perturbations for organisms in our portfolio.
Usage
With the NovelTree run outputs downloaded and extracted into the base directory of this repository, we now proceed by calling the script code/genefam_aa_summaries.py. This bash script calculates for each protein sequence within each gene family, summaries of AA composition, as well as AA physical properties. All code below assumes that you have downloaded and extracted the directory 2024-organismal-selection-zenodo/ from this pubs correspoding Zenodo repository.
```sh
Ensure we are calling this script within the correct conda environment
conda activate aastatsmv_dists
Set the MSA directory to variable
msadir="2024-organismal-selection-zenodo/results-noveltree-model-euks/witchalignments/original_alignments/"
Now, run the script to calculate the physicochemical properties of each protein using ProtParam
python code/genefamaasummaries.py -t 10 $msa_dir ```
This will create a new directory called "aa-summary-stats/" that contains the calculated AA properties for each protein, and summarized for each gene famly. With these protein properties curated, we can now proceed with the calculation of pairwise multivariate distances between proteins within each gene family.
sh
Rscript code/calc_protein_mv_distances.R
Briefly, this script:
- Reads in the species tree from the NovelTree run results and time-calibrates it using a species tree containing these species obtained from timetree.org
- Reads in species metadata from the NovelTree samplesheet and copy number information
- Reads in the gene family trees and protein properties calculated by
code/genefam_aa_summaries.py, retaining only those gene families that contain human proteins, and then for each gene family, it:- Time-calibrates the gene family trees so branch lengths reflect time, rather than the extent of sequence divergence.
- Uses this tree to transforms the AA physical properties such that we correct for phylogenetic non-independence between proteins
- Calculate multivariate (mahalanobis) distances between proteins
Replicating the analyses of molecular conservation in the pub
Create the conda environment and install the remaining R packages:
```sh mamba env create -n organismal-selection-analysis --file envs/analysis.yml
conda activate organismal-selection-analysis
Rscript install/installrpackagesforanalysis.R ```
Next, load and organize the data:
sh
Rscript code/org-sel-data.R
The code to recreate the analyses and figures from the pub is in the script code/org-sel-analysis.R.
Contributing
See how we recognize feedback and contributions to our code.
Owner
- Name: Arcadia Science
- Login: Arcadia-Science
- Kind: organization
- Location: United States of America
- Website: https://www.arcadiascience.com/
- Twitter: ArcadiaScience
- Repositories: 16
- Profile: https://github.com/Arcadia-Science
GitHub Events
Total
- Release event: 1
- Delete event: 2
- Issue comment event: 2
- Push event: 14
- Public event: 1
- Pull request review comment event: 6
- Pull request review event: 7
- Pull request event: 4
- Create event: 3
Last Year
- Release event: 1
- Delete event: 2
- Issue comment event: 2
- Push event: 14
- Public event: 1
- Pull request review comment event: 6
- Pull request review event: 7
- Pull request event: 4
- Create event: 3
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 0
- Total pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: 3 days
- Total issue authors: 0
- Total pull request authors: 2
- Average comments per issue: 0
- Average comments per pull request: 1.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: 3 days
- Issue authors: 0
- Pull request authors: 2
- Average comments per issue: 0
- Average comments per pull request: 1.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- austinhpatton (1)
- keithchev (1)