https://github.com/azmigueldario/hsci478_assignment

https://github.com/azmigueldario/hsci478_assignment

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 13 DOI reference(s) in README
  • Academic publication links
    Links to: ncbi.nlm.nih.gov
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.4%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: azmigueldario
  • Language: Shell
  • Default Branch: main
  • Size: 9.56 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme

README.md

Instruction for Nexstrain assignment analysis

The input data is taken from GISAID (EPISET240315sp) as specified in the paper by Piccoli et al. (2024) with a few modifications to improve the runtime. The specific accession numbers are available in the Supplementary table 1 of the paper.

Environment requirements

Most bioinformatics pipelines require a Unix-type environment, which can be Linux, MacOS, or the Windows Subsystem for Linux (WSL2) in Windows. We use the conda environment management to install all dependencies and guarantee reproducibility.

We will mainly use Nextstrain - Augur. The installation instructions and additional details are available at the Nextstrain documentation.

Once you have conda (or mamba) ready, you can use the environment.yml from this folder to reproduce the environment:

```sh

conda create --name augur -f environment.yml

for mamba

mamba create --name augur -f environment.yml ```

Preliminary steps

To optimize runtime and provide additional context for the samples, we run a few preliminary filter steps. During sampling, we select a maximum of 150 strains (--subsample-max-sequences) after grouping by month (--group-by). Finally, a few samples were flagged as having poor quality by the authors, so we exclude those from the final set (--exclude). To obtain the same results, a random seed has been added to the command.

```sh

Time: 2 minutes each

index assemblies information for filtering by quality

augur index \ --sequences inputdata/studysequences.fasta \ --output processeddata/studyindex.tsv

filter out sequences

augur filter \ --sequences inputdata/studysequences.fasta \ --sequence-index processeddata/studyindex.tsv \ --metadata inputdata/studymetadata.tsv \ --exclude inputdata/poorqualitygenomes.txt \ --output processeddata/filteredstudy.fasta \ --output-metadata processeddata/metadata_filtered.tsv \ --group-by month \ --subsample-max-sequences 150 \ --max-length 29900 \ --min-date 2020 \ --subsample-seed 455 ```

To provide additional context, a few more Brazilian strains will be added after filtering. The .fasta files can be merged with simple command line magic, but its better to handle the metadata carefully, so we merge the contextual information with augur too.

```sh cat inputdata/complementarysequences.fasta \ processeddata/filteredstudy.fasta > processeddata/assignmentsequences.fasta

merge metadata tables

augur merge \ --metadata STUDY=processeddata/metadatafiltered.tsv \ ADDITIONAL=inputdata/complementarymetadata.tsv \ --output-metadata processeddata/mergedmetadata.tsv

```

Building phylogenetic tree

Create multiple sequence alignment to identify differences among sequences, a necessary input for the visualization step.

  • The reference sequences is the Wuhan-Hu-1 strain from the beginning of the pandemic https://www.ncbi.nlm.nih.gov/nuccore/MN908947
  • The new alignment is used to produce a phylogenetic tree in newick format. The algorithm behind this process is IQTREE2, a maximum likelihood approach with a GTR model
  • Output shows differences in substitutions per site (SNVs per site)

```sh

Time: 1 minute each

augur align \ --sequences processeddata/assignmentsequences.fasta \ --reference-sequence inputdata/referenceMN908947.3.fasta \ --output processeddata/alignmentassignment.fasta \ --method mafft

augur tree \ --alignment processeddata/alignmentassignment.fasta \ --method iqtree \ --output processeddata/treeraw.nwk ```

Basic phylodynamics

With augur, we can try to approximate the ancestral relationships among the strains using the sampling dates available in the metadata. The tool employed in this process is called TreeTime.

  • We use the metadata file we prepared previously, specify that mutations are the unit of divergence (--divergence-units), and keep the tree rooted (--keep-root)
  • If tips deviate too much from the regression line, they will be removed from the tree (--clock-filter-iqd)
  • The new tree will have branch lenghts that reflect the number of mutations instead of mutations per site (--divergence-units)

```sh

Time: 3 minutes

augur refine \ --tree processeddata/treeraw.nwk \ --alignment processeddata/alignmentassignment.fasta \ --metadata processeddata/mergedmetadata.tsv \ --output-tree results/timetree.nwk \ --output-node-data results/branchlengths.json \ --divergence-units mutations \ --keep-root \ --timetree \ --clock-filter-iqd 4 ```

Now that we have adjusted the branch lengths according to the sample date, we will recalculate possible ancestral relationships based on the geographical location described in the metadata. This will predict the most likely location at the internal nodes (most recent common ancestors).

  • In other settings, this could produce interesting data. Here, we have samples exclusively from a region and only humans, so no relevant changes will be shown.

```sh

Time: 1 minute

augur traits \ --tree results/timetree.nwk \ --metadata processeddata/mergedmetadata.tsv \ --columns pangolinlineage \ --output-node-data results/trait_node.json --confidence ```

Finally, we explore if there are nucleotide mutations in an internal node that can lead to its descendants. This will be a nice way to visualize possible recombinants or ancestral relationships.

```sh

optional

augur ancestral \ --tree results/timetree.nwk \ --alignment processeddata/alignmentassignment.fasta \ --output-node-data results/ntmuts.json \ --inference joint ```

Export the data

Now we have everything we need to visualize it on the web. After exporting the files, load them into the Nextstrain Auspice webtool. Augur sends all the information in a format called JSON that contains information about the phylogenetic relationship and also the node data we just inferred.

```sh

Time: 10 seconds

augur export v2 \ --tree results/timetree.nwk \ --metadata processeddata/mergedmetadata.tsv \ --node-data results/branchlengths.json \ results/traitnode.json \ --maintainers "hsci/mbb478" \ --title "Assignment 3 - viral surveillance" \ --output results/analysis-package.json \ --geo-resolutions division ```

References

We gratefully acknowledge all data contributors, i.e., the Authors and their Originating laboratories responsible for obtaining the specimens, and their Submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID Initiative, on which this research is based. Elbe, S. and Buckland-Merrett, G. (2017) Data, disease and diplomacy: GISAID’s innovative contribution to global health. Global Challenges, 1:33-46. [ Note: The complete list of used sequences is availble in the file samplelistgisaid.txt ]

Nextstrain - Hadfield et al. Nextstrain: real-time tracking of pathogen evolution, Bioinformatics (2018).

IQTREE2 - B.Q. Minh, et al. (2020) IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol., 37:1530-1534.

TimeTree - Pavel Sagulenko, Vadim Puller, Richard A Neher. (2018) TreeTime: Maximum-likelihood phylodynamic analysis. Virus evolution.

MAGFFT - Katoh K and Standley D. (2013) MAFFT Multiple Sequence Alignment Software Version 7. Mol Biol Evo Jan 16;30(4)

Owner

  • Name: Miguel D Prieto G
  • Login: azmigueldario
  • Kind: user
  • Location: Vancouver, Canada
  • Company: Simon Fraser University, CIDGOH

M.D., MSc. Research interests include public health, infectious diseases and molecular epidemiology

GitHub Events

Total
  • Push event: 6
  • Create event: 2
Last Year
  • Push event: 6
  • Create event: 2