hp3

Repository for Host-Pathogen Phylogeny Project. Paper DOI: 10.1038/nature22975

https://github.com/ecohealthalliance/hp3

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.2%) to scientific vocabulary

Keywords

eha-modeling-analytics eha-predict
Last synced: 7 months ago · JSON representation

Repository

Repository for Host-Pathogen Phylogeny Project. Paper DOI: 10.1038/nature22975

Basic Info
Statistics
  • Stars: 15
  • Watchers: 12
  • Forks: 6
  • Open Issues: 4
  • Releases: 2
Topics
eha-modeling-analytics eha-predict
Created almost 10 years ago · Last pushed almost 6 years ago
Metadata Files
Readme License Zenodo

README.md

HP3 Analysis files

DOI

This repository contains code, data, documentation, metadata and figure source files used in Olival et. al. (2017) "Host and Viral Traits Predict Zoonotic Spillover from Mammals." Nature https://dx.doi.org/10.1038/nature22975

Repo Structure

  • documents contains two R markdown documents in both raw and readable HTML form which give more detail than in the main paper or supplemental methods on our model-fitting and validation process: model_summaries.Rmd/html and geographic_cross_validation.Rmd/html.
  • data/ contains data used in these analyses, including
    • our primary database of host-viral associations (associations.csv)
    • databases of host (hosts.csv) and viral (viruses.csv) traits
    • 2 phylogenetic tree files in Newick format (*.tree) format. One (supertree_mammals.tree) is a pruned version of the mammallian supertree (Bininda-Emonds et. al. 2007), for the subset of mammals in our database. The other (cytb-supertree.tree) is a custom-built cytochrome-B phylogeny constrained to the order-level topology of the mammalian supertree (see supplementary methods).
    • full references for all associations in our database (references.txt)
    • An intermediates/ directory with derived data (species phylogenetic distance matrices and PVR-corrected host mass)
    • A metadata.csv file that describes variables in our database and derived variables used in model-fitting
    • IUCN_taxonomy_23JUN2016.csv, data from IUCN used to harmonize our data with IUCN spatial data (see Supplementary Methods)
    • Genbank_accession_cytb.csv,two Genbank accession numbers used in constructing the Cyt-B constrained tree
    • region_names.rds, a list of zoogeographical region names used to describe cross-validation regions.
  • figures/ contains figures and tables in the paper and extended data and the scripts to generate them, including a maps/ subdirectory with individual maps that are stitched together for the main and extended figures.
  • scripts/ contains all the scripts used to fit the models and generate outputs
  • R/ contains files with functions used in other scripts.
  • misc/ contains small scripts used for other calculations
  • intermediates/ is a holding directory for intermediate data files and fitted model objects in *.rds R data form. These are re-created when the project is built
  • shapefiles/ is an empty holding directory. Large shapefiles used to generate maps and in analyses are stored separately on AWS to limit the size of this repository. They are downloaded to this folder by the scripts when needed.

Listing of files

``` README.md | This file in .md format README.txt | This file in .txt format HP3.Rproj | Rstudio project organization file Makefile | Makefile for building project .zenodo.json | Metadata file for ZENODO repository data/ associations.csv | associations database cytbsupertree.tree | tree file for Cyt-b constrained version of mammal supertree Genbankaccessioncytb.csv | Genbank accession numbers used for calculating the Cyt-b constrained tree hosts.csv | hosts database IUCNtaxonomy23JUN2016.csv | IUCN taxonomy to harmonize IUCN spatial data with hosts database metadata.csv | listing of variables in hosts, viruses, and associations databases references.txt | listing of reference sources for associations database regionnames.rds | R object of zoogeographical region names for cross-validation supertree_mammals.tree | tree file for mammal supertree viruses.csv | viruses database intermediate/ | Intermediate data files calculated by scripts, primarily phylogenetic distance matrices

documents/ modelsummaries.Rmd | R-markdown document of GAM model summaries and diagnostics modelsummaries.html | Compiled HTML of above geographiccrossvalidation.Rmd | R-markdown geospatial diagnostics of models geographiccrossvalidation.html | Compiled HTML of above

figures | Figures and tables for manuscript and supplements Figure01A-boxplots.pdf | Figure01B-boxplots.pdf | Figure02-all-gams.svg | Figure03-missing-zoo-maps.png | Figure04-viral-traits.svg | ExtendedFigure03-ALL.png | ExtendedFigure04-CARNIVORA.png | ExtendedFigure05-CETARTIODACTYLA.png | ExtendedFigure06-CHIROPTERA.png | ExtendedFigure07-PRIMATES.png | ExtendedFigure08-RODENTIA.png | ExtendedTable01-models.docx | SuppTable1-observed-predicted-missing.csv | maps/ | Individual maps stiched together for figures.

misc/ | Assorted side-analyses calc-bat-special.R | Calculates significance of bat order effect in GAM genhostspatialdata.R | Used for generating host zoogeographies shapefile phylo-primates.Rmd | Examination of phylogenetic effects specific to primates calc-pred-obs-correlation.R | Alternative measures of model fit zoonoticdevexplainedw_offset.R | For calculating deviance explained in models with offsets

R/ | Functions used in scripts and R markdown documents avggamvis.R | Functions for visualizing the average GAM of an ense crossvalidation.R | Cross validation cvgamby.R | Zoogeographical cross-validation fitgam.R | Fitting ensembles of gam models logp.R | Log function with offset for zeros modelreduction.R | Dropping non-predictive variables from models relativecontributions.R | Calculating the explained deviance from different variables in a model utils.R | Miscelaneuous utility functions

scripts/ | Scripts to build project outputs 01-download-shapefiles.R | Fetch shapefiles from storage on Amazon AWS 02-generatephylogeneticintermediatedata.R | Calculate phylogenetic distance matrices and PVD-adjusted body mass 03-preprocessdata.R | Data cleaning and merging 04-fit-models.R | Fit the GAMs in the paper 05-make-Figure01-boxplots.R | Generate boxplots in Figure 1 06-make-Figure02-all-gams.R | Generate Figure 2 07-make_maps.R | Generate all maps 08-make-Figure03-ExtendedFigs-stitch-maps.R | Assemble maps together into Figure 3 and Extended Figures 09-make-Figure04-viral-traits.R | Generate Figure 4 10-make-ExtendedFigure02-heatmap.R | Generate heat map for Extended Figure 2 11-make-ExtendedTable01-models.R | Generate Extended Table 1 of model summaries 12-make-SuppTable01-predictions.R | Generate supplemental table of oberved and predicted viruses and zoonoses by species

intermediates/ | Holds intermediate fitted model objects when project is built shapefiles/ | Holds large shapefiles downloaded when project is built packrat/ | Holds all R package dependencies .Rprofile | Configures R to use packrat dependencies

```

Reproducing the analysis

The Makefile in this repository holds the project workflow. Running make all in the directory will re-build the project. make clean will remove shapefiles, intermediate data, fit models, and all figures and maps. If this project is opened in RStudio, this can also be accomplished with the "Build All" and "Clean" buttons in the Build tab.

This project uses packrat to manage R package dependencies. Running packrat::restore() will unpack the versions of packages used in this project. In addition, these packages have the following system requirements: cairo, gdal, GEOS, libmagick++-, jave, libcurl, libpng, libxml2, OpenSSL, and pandoc. All analyses were performed using R 3.3.2 under Ubuntu 14.04. Complete build takes approximately 1 hour with 40 cores and 256GB of memory, or approximately 8 hours on a 2-core Macbook Pro with 16GB of memory.

Owner

  • Name: EcoHealth Alliance
  • Login: ecohealthalliance
  • Kind: organization
  • Email: tech@ecohealthalliance.org
  • Location: New York, NY

GitHub Events

Total
Last Year

Committers

Last synced: 11 months ago

All Time
  • Total Commits: 314
  • Total Committers: 8
  • Avg Commits per committer: 39.25
  • Development Distribution Score (DDS): 0.564
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Noam Ross n****s@g****m 137
Anna Willoughby w****y@e****g 106
Cale Basaraba b****a@e****g 33
Carlos Zambrana-Torrelio c****t@g****m 17
kevinolival o****l@e****g 15
Cale Basaraba b****a@e****h 3
Noam Ross r****s@e****g 2
Your Name y****u@e****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 14
  • Total pull requests: 10
  • Average time to close issues: 7 months
  • Average time to close pull requests: 4 days
  • Total issue authors: 3
  • Total pull request authors: 3
  • Average comments per issue: 2.0
  • Average comments per pull request: 0.6
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • noamross (6)
  • arw36 (3)
  • jhpoelen (1)
Pull Request Authors
  • noamross (4)
  • calebasaraba (3)
  • arw36 (1)
Top Labels
Issue Labels
analysis (3) visualization (1) cleanup (1) reproducibility (1)
Pull Request Labels