https://github.com/azmigueldario/hsci478_assignment

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 13 DOI reference(s) in README
✓
Academic publication links
Links to: ncbi.nlm.nih.gov
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.4%) to scientific vocabulary

Last synced: 6 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: azmigueldario
Language: Shell
Default Branch: main
Size: 9.56 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme

Instruction for Nexstrain assignment analysis

The input data is taken from GISAID (EPISET240315sp) as specified in the paper by Piccoli et al. (2024) with a few modifications to improve the runtime. The specific accession numbers are available in the Supplementary table 1 of the paper.

Environment requirements

Most bioinformatics pipelines require a Unix-type environment, which can be Linux, MacOS, or the Windows Subsystem for Linux (WSL2) in Windows. We use the conda environment management to install all dependencies and guarantee reproducibility.

We will mainly use Nextstrain - Augur. The installation instructions and additional details are available at the Nextstrain documentation.

Once you have conda (or mamba) ready, you can use the environment.yml from this folder to reproduce the environment:

```sh

conda create --name augur -f environment.yml

for mamba

mamba create --name augur -f environment.yml ```

Preliminary steps

To optimize runtime and provide additional context for the samples, we run a few preliminary filter steps. During sampling, we select a maximum of 150 strains (--subsample-max-sequences) after grouping by month (--group-by). Finally, a few samples were flagged as having poor quality by the authors, so we exclude those from the final set (--exclude). To obtain the same results, a random seed has been added to the command.

```sh

Time: 2 minutes each

index assemblies information for filtering by quality

augur index \ --sequences inputdata/studysequences.fasta \ --output processeddata/studyindex.tsv

filter out sequences

augur filter \ --sequences inputdata/studysequences.fasta \ --sequence-index processeddata/studyindex.tsv \ --metadata inputdata/studymetadata.tsv \ --exclude inputdata/poorqualitygenomes.txt \ --output processeddata/filteredstudy.fasta \ --output-metadata processeddata/metadata_filtered.tsv \ --group-by month \ --subsample-max-sequences 150 \ --max-length 29900 \ --min-date 2020 \ --subsample-seed 455 ```

To provide additional context, a few more Brazilian strains will be added after filtering. The .fasta files can be merged with simple command line magic, but its better to handle the metadata carefully, so we merge the contextual information with augur too.

```sh cat inputdata/complementarysequences.fasta \ processeddata/filteredstudy.fasta > processeddata/assignmentsequences.fasta

merge metadata tables

augur merge \ --metadata STUDY=processeddata/metadatafiltered.tsv \ ADDITIONAL=inputdata/complementarymetadata.tsv \ --output-metadata processeddata/mergedmetadata.tsv

```

Building phylogenetic tree

Create multiple sequence alignment to identify differences among sequences, a necessary input for the visualization step.

The reference sequences is the Wuhan-Hu-1 strain from the beginning of the pandemic https://www.ncbi.nlm.nih.gov/nuccore/MN908947
The new alignment is used to produce a phylogenetic tree in newick format. The algorithm behind this process is IQTREE2, a maximum likelihood approach with a GTR model
Output shows differences in substitutions per site (SNVs per site)

```sh

Time: 1 minute each

augur align \ --sequences processeddata/assignmentsequences.fasta \ --reference-sequence inputdata/referenceMN908947.3.fasta \ --output processeddata/alignmentassignment.fasta \ --method mafft

augur tree \ --alignment processeddata/alignmentassignment.fasta \ --method iqtree \ --output processeddata/treeraw.nwk ```

Basic phylodynamics

With augur, we can try to approximate the ancestral relationships among the strains using the sampling dates available in the metadata. The tool employed in this process is called TreeTime.

We use the metadata file we prepared previously, specify that mutations are the unit of divergence (--divergence-units), and keep the tree rooted (--keep-root)
If tips deviate too much from the regression line, they will be removed from the tree (--clock-filter-iqd)
The new tree will have branch lenghts that reflect the number of mutations instead of mutations per site (--divergence-units)

```sh

Time: 3 minutes

augur refine \ --tree processeddata/treeraw.nwk \ --alignment processeddata/alignmentassignment.fasta \ --metadata processeddata/mergedmetadata.tsv \ --output-tree results/timetree.nwk \ --output-node-data results/branchlengths.json \ --divergence-units mutations \ --keep-root \ --timetree \ --clock-filter-iqd 4 ```

Now that we have adjusted the branch lengths according to the sample date, we will recalculate possible ancestral relationships based on the geographical location described in the metadata. This will predict the most likely location at the internal nodes (most recent common ancestors).

In other settings, this could produce interesting data. Here, we have samples exclusively from a region and only humans, so no relevant changes will be shown.

```sh

Time: 1 minute

augur traits \ --tree results/timetree.nwk \ --metadata processeddata/mergedmetadata.tsv \ --columns pangolinlineage \ --output-node-data results/trait_node.json --confidence ```

Finally, we explore if there are nucleotide mutations in an internal node that can lead to its descendants. This will be a nice way to visualize possible recombinants or ancestral relationships.

```sh

optional

augur ancestral \ --tree results/timetree.nwk \ --alignment processeddata/alignmentassignment.fasta \ --output-node-data results/ntmuts.json \ --inference joint ```

Export the data

Now we have everything we need to visualize it on the web. After exporting the files, load them into the Nextstrain Auspice webtool. Augur sends all the information in a format called JSON that contains information about the phylogenetic relationship and also the node data we just inferred.

```sh

Time: 10 seconds

augur export v2 \ --tree results/timetree.nwk \ --metadata processeddata/mergedmetadata.tsv \ --node-data results/branchlengths.json \ results/traitnode.json \ --maintainers "hsci/mbb478" \ --title "Assignment 3 - viral surveillance" \ --output results/analysis-package.json \ --geo-resolutions division ```

References

We gratefully acknowledge all data contributors, i.e., the Authors and their Originating laboratories responsible for obtaining the specimens, and their Submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID Initiative, on which this research is based. Elbe, S. and Buckland-Merrett, G. (2017) Data, disease and diplomacy: GISAID’s innovative contribution to global health. Global Challenges, 1:33-46. [ Note: The complete list of used sequences is availble in the file samplelistgisaid.txt ]

Nextstrain - Hadfield et al. Nextstrain: real-time tracking of pathogen evolution, Bioinformatics (2018).

IQTREE2 - B.Q. Minh, et al. (2020) IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol., 37:1530-1534.

TimeTree - Pavel Sagulenko, Vadim Puller, Richard A Neher. (2018) TreeTime: Maximum-likelihood phylodynamic analysis. Virus evolution.

MAGFFT - Katoh K and Standley D. (2013) MAFFT Multiple Sequence Alignment Software Version 7. Mol Biol Evo Jan 16;30(4)

Owner

Name: Miguel D Prieto G
Login: azmigueldario
Kind: user
Location: Vancouver, Canada
Company: Simon Fraser University, CIDGOH

Repositories: 1
Profile: https://github.com/azmigueldario

M.D., MSc. Research interests include public health, infectious diseases and molecular epidemiology

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/azmigueldario/hsci478_assignment

Science Score: 36.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Instruction for Nexstrain assignment analysis

Environment requirements

for mamba

Preliminary steps

Time: 2 minutes each

index assemblies information for filtering by quality

filter out sequences

merge metadata tables

Building phylogenetic tree

Time: 1 minute each

Basic phylodynamics

Time: 3 minutes

Time: 1 minute

optional

Export the data

Time: 10 seconds

References

Owner

GitHub Events

Total

Last Year