sierra-local

sierra-local: A lightweight standalone application for drug resistance prediction - Published in JOSS (2019)

https://github.com/poonlab/sierra-local

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 8 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Scientific Fields

Computer Science Computer Science - 33% confidence
Last synced: 4 months ago · JSON representation

Repository

Retrieve HIVdb algorithm as XML and apply locally to HIV sequences

Basic Info
  • Host: GitHub
  • Owner: PoonLab
  • License: gpl-3.0
  • Language: Python
  • Default Branch: master
  • Size: 76 MB
Statistics
  • Stars: 10
  • Watchers: 8
  • Forks: 7
  • Open Issues: 8
  • Releases: 7
Created over 8 years ago · Last pushed 5 months ago
Metadata Files
Readme Contributing License Authors

README.md

PyPI DOI

sierra-local

sierra-local is a Python 3 implementation of the Stanford University HIV Drug Resistance Database (HIVdb) Sierra web service for generating drug resistance predictions from HIV-1 sequence data. This Python package enables laboratories to run this prediction algorithm without needing to transmit patient data over the network, and confers full control over data provenance and security.

Rationale

The Stanford HIVdb algortihm is a widely used method for predicting the drug resistance phenotype of an HIV-1 infection based on its genetic sequence, specifically the complete or partial sequence of the genomic regions encoding the primary targets of modern antiretroviral therapy. Prediction of HIV-1 drug resistance is an important component in the routine clinical management of HIV-1 infection, being faster and more cost-effective than the direct measurement of drug resistance by culturing virus isolates in the laboratory. The HIVdb algorithm is essentially rules-based classifier that is actively maintained and released to the public domain in the ASI (Algorithm Specification Interface) exchange format, demonstrating a laudable commitment by the HIVdb developers to open-source research and clinical practice.

The HIVdb algorithm is usually accessed through a web service hosted at Stanford University (Sierra). While this is a convenient format for many clinical laboratories, it requires a network connection and the transmission of potentially sensitive patient-derived data to a remote server. Transmitting sequence data over the web may present a bottleneck for laboratories located at sites that are geographically distant from the host server, or where network traffic is prone to service disruptions. Furthermore, the use of HIV-1 sequence data in criminal cases raises significant issues around data privacy.

Our objective was to build a lightweight, open-source Python implementation of the HIVdb algorithm for processing data on a local computer without sending any data over the network. During the development of sierra-local, the maintainers of Sierra released the source code for their web service under a permissive free software license (GPL v3.0). We were thrilled that the HIVdb developers elected to release their server code, but we remained committed to complete sierra-local so the HIV research and clinical communities can process their own data without needing to install and maintain an Apache server, build an SQL database, or to install a sizeable number of software dependencies.

Dependencies

We tried to minimize dependencies: - Python 3 (tested on Python 3.9.0 and Python 3.10.9) - Python modules (used by updater.py script): - requests - NucAmino v0.1.3 or later (included with the package). - Post-Align is the new alignment program and requires the following dependencies (included with the package as well): - Cython==0.29.32 - more-itertools==9.1.0 - orjson==3.9.1 - types-setuptools==67.8.0.0 - minimap2

Installation

Setting up Sierra-Local

On a Linux system, you can install sierra-local as follows: git clone http://github.com/PoonLab/sierra-local cd sierra-local sudo python3 setup.py install Note that you need super-user privileges to install the package by this method. For more detailed instructions, please refer to the document INSTALL.md that should be located in the root directory of this Python package.

Alternatively, you can install with pip, which doesn't need sudo. git clone http://github.com/PoonLab/sierra-local cd sierra-local pip install --user .

Using sierra-local

Command-line interface (CLI)

Before running, we recommend using the sierralocal/updater.py script to update the data files associated with this repository to the most updated versions available from hivfacts. Please note that you do need the requests package stated above for the following command to run. More information regarding this script is detailed below. console (sierra) will@dyn172-30-75-11 sierra-local % python3 sierralocal/updater.py Downloading the latest HIVDB XML File Updated HIVDB XML into /Users/will/projects/sierra-local/sierralocal/data/HIVDB_9.8.xml Downloading the latest file to determine apobec Updated apobecs file to /Users/will/projects/sierra-local/sierralocal/data/apobecs.csv Downloading the latest file to determine is unusual Updated is unusual file to /Users/will/projects/sierra-local/sierralocal/data/rx-all_subtype-all.csv Downloading the latest file to determine SDRM mutations Updated SDRM mutations file to /Users/will/projects/sierra-local/sierralocal/data/sdrms_hiv1.csv Downloading the latest file to determine mutation type Updated mutation type file to /Users/will/projects/sierra-local/sierralocal/data/mutation-type-pairs_hiv1.csv Downloading the latest APOBEC DRMS File Updated APOBEC DRMs into /Users/will/projects/sierra-local/sierralocal/data/apobec_drms.json Downloading the latest subtype genotype property File Updated reference fasta to /Users/will/projects/sierra-local/sierralocal/data/genotype-properties.csv Downloading the latest subtype reference fasta file Updated reference fasta to /Users/will/projects/sierra-local/sierralocal/data/genotype-references.fasta

To run a quick example, use the following sequence of commands: console art@Jesry:~/git/sierra-local$ python3 scripts/retrieve_hivdb_data.py RT RT.fa art@Jesry:~/git/sierra-local$ sierralocal RT.fa searching path /root/miniconda3/envs/py395/lib/python3.10/site-packages/sierralocal/data/HIVDB*.xml searching path /root/miniconda3/envs/py395/lib/python3.10/site-packages/sierralocal/data/apobec*.json HIVdb version 9.4 Aligning using post-align Aligned RT.fax 100 sequences found in file RT.fa. Writing JSON to file RT_results.json Time elapsed: 19.796 seconds (5.1555 it/s) To swap between running Post-Align (default) and NucAmnio, you can specify using -alignment, where inputting nuc will result in NucAmino being called will@Jesry:~/sierra-local# sierralocal RT.fa -alignment nuc searching path /root/miniconda3/envs/py395/lib/python3.10/site-packages/sierralocal/data/HIVDB*.xml searching path /root/miniconda3/envs/py395/lib/python3.10/site-packages/sierralocal/data/apobec*.json HIVdb version 9.4 Found NucAmino binary /root/miniconda3/envs/py395/lib/python3.10/site-packages/sierralocal/bin/nucamino-linux-amd64 Aligned RT.fa 100 sequences found in file RT.fa. Writing JSON to file RT_results.json Time elapsed: 4.1417 seconds (25.45 it/s)

retrieve_hivdb_data.py is a Python script that we provided to download small samples of HIV-1 sequence data from the Stanford HIVdb database. In this case, we have retrieved 100 reverse transcriptase (RT) sequences and processsed them with the sierra-local pipeline. By default, the results are written to the file [FASTA basename]_results.json: console art@Jesry:~/git/sierra-local$ head RT_results.json [ { "inputSequence": { "header": "U54771.CM240.CRF01_AE.0" }, "subtypeText": "CRF01_AE", "validationResults": [], "alignedGeneSequences": [ { "firstAA": 1,

We can also specify a different ASI (XML) file representing an earlier version of the HIVdb algorithm to reprocess the same data, and save the output to another file: console (sierra) will@dyn172-30-75-11 sierra-local % sierralocal -xml sierralocal/data/HIVDB_9.8.xml RT.fa -o RT-v9.8.json searching path /Users/will/miniconda3/envs/sierra/lib/python3.9/site-packages/sierralocal/data/apobec_drms.json HIVdb version 9.8 Aligning using post-align Aligned RT.fa 100 sequences found in file RT.fa. Writing JSON to file RT-v9.8.json Time elapsed: 9.3831 seconds (10.709 it/s)

We find that switching versions of the algorithm from 8.5 to 8.7 results in substantial changes in resistance scores for these data with the introduction of a new drug doravirine (DOR). In addition, two of 100 cases were scored differently: console art@Jesry:~/git/sierra-local$ python3 scripts/json2csv.py RT_results.json RT-v8.7.json.csv art@Jesry:~/git/sierra-local$ python3 scripts/json2csv.py RT-v8.5.json RT-v8.5.json.csv art@Jesry:~/git/sierra-local$ R ```R

v5 <- read.csv('RT-v8.5.json.csv') v7 <- read.csv('RT-v8.7.json.csv') v7 <- v7[,-which(names(v7)=='DOR')] temp <- sapply(1:nrow(v5), function(i) any(v5[i,] != v7[i,])) which(temp) [1] 23 63 v5[23,] name subtype ABC AZT D4T DDI FTC LMV TDF EFV ETR NVP RPV 23 Y14503.BCF13.O.22 Group O 15 -10 -10 10 60 60 -10 50 45 95 65 v7[23,] name subtype ABC AZT D4T DDI FTC LMV TDF EFV ETR NVP RPV 23 Y14503.BCF13.O.22 Group O 15 -10 -10 10 60 60 -10 50 55 105 75 v5[63,] name subtype ABC AZT D4T DDI FTC LMV TDF EFV ETR NVP RPV 63 AF102332.A11.B.62 B 90 115 115 90 80 80 60 0 0 0 0 v7[63,] name subtype ABC AZT D4T DDI FTC LMV TDF EFV ETR NVP RPV 63 AF102332.A11.B.62 B 90 115 115 90 80 80 60 0 10 10 10 ```

To specify your own JSON file for APOBEC DRMS, you can call -json followed by your file: (sierra) will@dyn172-30-75-11 sierra-local % sierralocal RT.fa -json sierralocal/data/apobec_drms.c9583ac2.json searching path /Users/will/miniconda3/envs/sierra/lib/python3.9/site-packages/sierralocal/data/HIVDB*.xml HIVdb version 9.4 Aligning using post-align Aligned RT.fa 100 sequences found in file RT.fa. Writing JSON to file RT_results.json Time elapsed: 9.3442 seconds (10.751 it/s)

As a Python module

If you have downloaded the package source to your computer, you can also run sierra-local as a Python module from the root directory of the package. In the following example, we are calling the main function of sierra-local from an interactive Python session: ```console (sierra) will@dyn172-30-75-11 sierra-local % git clone http://github.com/PoonLab/sierra-local (sierra) will@dyn172-30-75-11 sierra-local % cd sierra-local (sierra) will@dyn172-30-75-11 sierra-local % python3 Python 3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:35:41) [Clang 16.0.6 ] on darwin Type "help", "copyright", "credits" or "license" for more information.

from sierralocal.main import sierralocal sierralocal('RT.fa', 'RT.json') searching path /Users/will/projects/sierra-local/sierralocal/data/HIVDB*.xml searching path /Users/will/projects/sierra-local/sierralocal/data/apobec_drms.json HIVdb version 9.8 Aligning using post-align Aligned RT.fa 100 sequences found in file RT.fa. Writing JSON to file RT.json (100, 0.0769047737121582) `` Note that this doesn't require anysudo` privileges.

Subtyping

Currently, we do not support the subtyping function present in sierrapy. However, there is a framework of the script located in /sierralocal/deprecated/subtyper.py. We do not recommend using this feature through our scripts without modification as it is not maintained or tested. However, you can manually enable this feature by changing the do_subtype values to True in sierralocal/nucaminohook.py and importing the subtyper class from the subtyper script.

Updating the algorithm and other data files

The Stanford HIVdb database regularly updates its resistance genotyping algorithm and publishes the associated ASI2 XML file on their github, hivfacts. In previous versions of sierra-local, we used Python to automatically query this website and download the newest version if it was not already present on the user's computer. Subsequent changes to the Stanford HIVdb website, however, meant that users would have to install several additional dependencies in order for Python to locate the required files. As a result, we decided to make the updater.py script an optional step of the pipeline.

Manually running the script enabled me to grab the most recent versions of the ASI2 and other mutation data files from the HIVdb webserver:

Now of course, it would be much simpler to manually download these files yourself in hivfacts, but in some applications there may be a benefit to automating this step.

About Us

This project was developed at the Poon lab within the Department of Pathology and Laboratory Medicine, Schulich School of Medicine and Dentistry, Western University, London, Ontario. Development of sierra-local was supported in part by a grant from the Canadian Institutes of Health Research (PJT-156178).

If you use sierra-local for your work, please cite the following paper: * sierra-local: A lightweight standalone application for drug resistance prediction. Jasper C Ho, Garway T Ng, Mathias Renaud, Art FY Poon, (2019). Journal of Open Source Software, 4(33), 1186, https://doi.org/10.21105/joss.01186

If you want to reference the validation of sierra-local on HIV-1 pol data, please cite the following preprint: * sierra-local: A lightweight standalone application for secure HIV-1 drug resistance prediction. Jasper C Ho, Garway T Ng, Mathias Renaud, Art FY Poon. bioRxiv 393207; doi: https://doi.org/10.1101/393207

Owner

  • Name: PoonLab
  • Login: PoonLab
  • Kind: organization

JOSS Publication

sierra-local: A lightweight standalone application for drug resistance prediction
Published
January 25, 2019
Volume 4, Issue 33, Page 1186
Authors
Jasper C. Ho ORCID
Department of Pathology and Laboratory Medicine, Western University, London, ON, Canada
Garway T. Ng
Department of Pathology and Laboratory Medicine, Western University, London, ON, Canada
Mathias Renaud
Department of Pathology and Laboratory Medicine, Western University, London, ON, Canada
Art F. y. Poon ORCID
Department of Pathology and Laboratory Medicine, Western University, London, ON, Canada, Department of Microbiology and Immunology, Western University, London, ON, Canada, Department of Applied Mathematics, Western University, London, ON, Canada
Editor
Pjotr Prins ORCID
Tags
bioinformatics HIV/AIDS drug resistance sequence analysis clinical virology

GitHub Events

Total
  • Create event: 3
  • Release event: 1
  • Issues event: 9
  • Watch event: 4
  • Delete event: 1
  • Issue comment event: 30
  • Push event: 17
  • Pull request review event: 1
  • Pull request review comment event: 1
  • Pull request event: 7
  • Fork event: 3
Last Year
  • Create event: 3
  • Release event: 1
  • Issues event: 9
  • Watch event: 5
  • Delete event: 1
  • Issue comment event: 32
  • Push event: 17
  • Pull request review event: 1
  • Pull request review comment event: 1
  • Pull request event: 8
  • Fork event: 3

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 286
  • Total Committers: 11
  • Avg Commits per committer: 26.0
  • Development Distribution Score (DDS): 0.619
Past Year
  • Commits: 20
  • Committers: 4
  • Avg Commits per committer: 5.0
  • Development Distribution Score (DDS): 0.4
Top Committers
Name Email Commits
Art Poon a****n@g****m 109
WilliamZekaiWang W****g@g****m 49
jzpero j****3@u****a 31
jzpero j****o@g****m 24
Tammy Ng t****2@u****a 22
WilliamZekaiWang 1****g 21
GopiGugan g****n@u****a 13
MathiasRenaud m****5@g****m 9
SandeepThokala i****8@g****m 5
Jasper Ho j****o@g****m 2
William Zekai Wang w****l@W****l 1
Committer Domains (Top 20 + Academic)
uwo.ca: 3

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 98
  • Total pull requests: 23
  • Average time to close issues: 6 months
  • Average time to close pull requests: 12 days
  • Total issue authors: 18
  • Total pull request authors: 4
  • Average comments per issue: 3.0
  • Average comments per pull request: 0.43
  • Merged pull requests: 19
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 8
  • Pull requests: 8
  • Average time to close issues: about 20 hours
  • Average time to close pull requests: 5 days
  • Issue authors: 6
  • Pull request authors: 2
  • Average comments per issue: 0.63
  • Average comments per pull request: 0.38
  • Merged pull requests: 5
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ArtPoon (47)
  • jzpero (13)
  • Kanyerezi30 (10)
  • SandeepThokala (7)
  • schorlton-bugseq (3)
  • aguang (2)
  • erick-dorlass (2)
  • azneto (2)
  • WilliamZekaiWang (2)
  • MathiasRenaud (2)
  • marcbennedbaek (1)
  • ghost (1)
  • nedjoni (1)
  • LilyAnderssonLee (1)
  • GopiGugan (1)
Pull Request Authors
  • WilliamZekaiWang (17)
  • ArtPoon (2)
  • schorlton-bugseq (2)
  • GopiGugan (2)
Top Labels
Issue Labels
enhancement (6) bug (4) help wanted (3) low priority (2) question (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 180 last-month
  • Total docker downloads: 29
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 6
  • Total maintainers: 3
pypi.org: sierralocal

Local execution of HIVdb algorithm

  • Versions: 6
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 180 Last month
  • Docker Downloads: 29
Rankings
Docker downloads count: 3.8%
Dependent packages count: 7.4%
Forks count: 17.0%
Average: 20.3%
Stargazers count: 21.6%
Dependent repos count: 22.3%
Downloads: 50.0%
Maintainers (3)
Last synced: 4 months ago