https://github.com/bede/hostile

Precise host read removal

https://github.com/bede/hostile

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 23 DOI reference(s) in README
  • Academic publication links
    Links to: biorxiv.org, ncbi.nlm.nih.gov
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Precise host read removal

Basic Info
  • Host: GitHub
  • Owner: bede
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 2.39 MB
Statistics
  • Stars: 102
  • Watchers: 3
  • Forks: 7
  • Open Issues: 8
  • Releases: 10
Created over 2 years ago · Last pushed 8 months ago
Metadata Files
Readme Changelog License

README.md

DOI:10.1101/2023.07.04.547735 PyPI version Bioconda version Downloads Tests

Hostile

Hostile removes host sequences from short and long read (meta)genomes, consuming single or paired FASTQ from files or stdin. Batteries are included – a human reference genome is downloaded when run for the first time. Hostile is precise by default, removing an order of magnitude fewer microbial reads than existing approaches while removing >99.5% of real human reads from 1000 Genomes Project samples. For best possible retention of microbial reads, optionally use an existing index masked against bacterial and/or viral genomes, or make your own using the built-in masking utility. Read headers can be replaced with integers (using --rename) for privacy and smaller FASTQs. Heavy lifting is done with fast existing tools operating on a stream. In benchmarks, bacterial Illumina reads were decontaminated at 32Mbp/s (210k reads/sec) and bacterial ONT reads at 22Mbp/s, using 8 alignment threads. In typical use, Hostile requires 4GB of RAM for decontaminating short reads (Bowtie2) and 13GB for long reads (Minimap2). Further information and benchmarks can be found in the paper and blog post. Please open an issue to report problems or otherwise reach out for help and advice, and please cite the paper if you use Hostile in your work.

Simplified overview

Indexes

The default index human-t2t-hla comprises T2T-CHM13v2.0 and IPD-IMGT/HLA v3.51, and is downloaded automatically when running Hostile unless another index is specified. Higher microbial sequence retention may be possible using masked indexes, which are very easy to use. The index human-t2t-hla-argos985 is masked against 985 reference grade bacterial genomes including common human pathogens, while human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401 is further masked against all known virus and phage genomes. The latter should be used when retention of viral sequences is a priority. To use a standard index, simply pass its name as the value of the --index argument, which takes care of downloading and caching the relevant index. Object storage is provided by the ModMedMicro research unit at the University of Oxford. Custom indexes are also supported (see below).

| Name | Composition | Date | Masked positions | | :----------------------------------------------------------: | :----------------------------------------------------------: | ------- | ---------------------- | | human-t2t-hla (default) | T2T-CHM13v2.0 + IPD-IMGT/HLA v3.51 | 2023-07 | 0 (0%) | | human-t2t-hla-argos985 | human-t2t-hla masked with 150mers for 985 FDA-ARGOS bacterial genomes | 2023-07 | 317,973 (0.010%) | | human-t2t-hla.rs-viral-202401_ml-phage-202401 | human-t2t-hla masked with 150mers for 18,719 RefSeq viral and 26,928 Millard Lab phage genomes | 2024-01 | 1,172,993 (0.037%) | | human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401 | human-t2t-hla masked with 150mers for 985 FDA-ARGOS bacterial, 18,719 RefSeq viral, and 26,928 Millard Lab phage genomes | 2024-01 | 1,473,260 (0.046%) | | human-t2t-hla-argos985-mycob140 | human-t2t-hla masked with 150mers for 985 FDA-ARGOS bacterial & 140 mycobacterial genomes | 2023-07 | 319,752 (0.010%) | | mouse-mm39 | GRCm39 (GCF_000001635.27) | 2024-11 | 0 (0%) |

Performance of human-t2t-hla and human-t2t-hla-argos985-mycob140 was evaluated in the paper

Install Install with bioconda Install with Docker

Installation with conda/mamba or Docker is recommended due to non-Python dependencies (Bowtie2, Minimap2, Samtools and Bedtools). Hostile is tested with Ubuntu Linux 22.04, MacOS 12, and under WSL for Windows.

Conda/mamba

bash conda create -y -n hostile -c conda-forge -c bioconda hostile conda activate hostile

Docker

bash git clone https://github.com/bede/hostile.git cd hostile docker build

A Biocontainer image is also available, but beware that this often lags behind the latest released version

Development

git clone https://github.com/bede/hostile.git cd hostile conda env create -y -f environment.yml conda activate hostile pip install --editable '.[dev]' pytest pre-commit install

Getting started

```bash

Long reads

hostile clean --fastq1 long.fastq.gz # Creates long.clean.fastq.gz hostile clean --fastq1 --index mouse-mm39 # Use mouse index cat reads.fastq | hostile clean --fastq1 - # Read from stdin hostile clean --fastq1 long.fastq.gz -o - > long.clean.fastq # Write to stdout hostile clean --fastq1 long.fastq.gz --invert # Keep only host reads

Short reads

hostile clean --fastq1 short.r1.fq.gz --aligner bowtie2 # Single/unpaired hostile clean --fastq1 short.r1.fq.gz --fastq2 short.r2.fq.gz # Paired cat interleaved.fastq | hostile clean --fastq1 - --fastq2 - # Read from stdin hostile clean --fastq1 short.r1.fq.gz --fastq2 short.r2.fq.gz -o - > clean.interleaved.fq # Write fastq to stdout ```

Custom indexes

  • To list available standard indexes, run hostile index list.
  • To optionally download and cache the default index (human-t2t-hla) ahead of time, run hostile index fetch. Include --minimap2 or --bowtie2 to download only the respective long or short read index rather than both. To download and cache another standard index, provide its name with e.g. hostile index fetch --name human-t2t-hla-argos985 --minimap2.
  • To use a custom genome/index (made with hostile mask or otherwise), run hostile clean with --index path/to/genome.fa (for minimap2) or --index path/to/bowtie2-index-name (for Bowtie2). Note that Minimap2 mode accepts a path to a genome in fasta format or .mmi, whereas Bowtie2 mode accepts a path to a precomputed index, minus the .x.bt2 suffix. A Bowtie2 index can be built for use with Hostile using e.g. bowtie2-build genome.fa index-name.
  • To change where indexes are stored, set the environment variable HOSTILE_CACHE_DIR to a directory of your choice. Run hostile index list to verify.
  • If you wish to use your own remote repository of indexes, set the environment variable HOSTILE_REPOSITORY_URL. Hostile will then look for indexes inside {HOSTILE_REPOSITORY_URL}/manifest.json.
  • From version 2.0.0 onwards, Hostile automatically builds and reuses .mmi files to speed up long read decontamination with Minimap2. If building an MMI is interrupted, you may receive an error about index corruption. If this happens, run hostile index delete --mmi, or if using a custom index, delete the .mmi created in the same directory.

Command line usage

```bash $ hostile clean -h usage: hostile clean [-h] --fastq1 FASTQ1 [--fastq2 FASTQ2] [--aligner {bowtie2,minimap2,auto}] [--index INDEX] [--invert] [--rename] [--reorder] [-c] [-o OUTPUT] [--aligner-args ALIGNER_ARGS] [-t THREADS] [--force] [--airplane] [-d]

Remove reads aligning to an index from fastq[.gz] input files or stdin.

options: -h, --help show this help message and exit --fastq1 FASTQ1 path to forward fastq[.gz] file (or - for stdin) --fastq2 FASTQ2 optional path to reverse fastq[.gz] file (or - for stdin) (default: ) --aligner {bowtie2,minimap2,auto} alignment algorithm. Defaults to minimap2 (long read) given fastq1 only or bowtie2 (short read) given fastq1 and fastq2. Override with bowtie2 for single/unpaired short reads (default: auto) --index INDEX name of standard index or path to custom genome (Minimap2) or Bowtie2 index (default: human-t2t-hla) --invert keep only reads aligning to the index (and their mates if applicable) (default: False) --rename replace read names with incrementing integers (default: False) --reorder ensure deterministic output order (default: False) -c, --casava use Casava 1.8+ read header format (default: False) -o, --output OUTPUT path to output directory or - for stdout (default: /Users/bede/Research/git/hostile) --aligner-args ALIGNER_ARGS additional arguments for alignment (default: ) -t, --threads THREADS number of alignment threads. A sensible default is chosen automatically (default: 10) --force overwrite existing output files (default: False) --airplane disable automatic index download (offline mode) (default: False) -d, --debug show debug messages (default: False) ```

Long reads

Writes compressed fastq.gz files to working directory, sends log to stdout bash $ hostile clean --fastq1 tests/data/tuberculosis_1_1.fastq.gz INFO: Hostile v2.0.0. Mode: long read (Minimap2) INFO: Found cached standard index human-t2t-hla (MMI available) INFO: Cleaning… INFO: Cleaning complete [ { "version": "2.0.0", "aligner": "minimap2", "index": "human-t2t-hla", "options": [], "fastq1_in_name": "tuberculosis_1_1.fastq.gz", "fastq1_in_path": "/Users/bede/Research/git/hostile/tests/data/tuberculosis_1_1.fastq.gz", "reads_in": 1, "reads_out": 1, "reads_removed": 0, "reads_removed_proportion": 0.0, "fastq1_out_name": "tuberculosis_1_1.clean.fastq.gz", "fastq1_out_path": "/Users/bede/Research/git/hostile/tuberculosis_1_1.clean.fastq.gz" } ]

Long reads (non-default index, save log)

bash $ hostile clean --fastq1 tests/data/tuberculosis_1_1.fastq.gz --index human-t2t-hla-argos985-mycob140 > log.json INFO: Hostile v2.0.0. Mode: long read (Minimap2) INFO: Found cached standard index human-t2t-hla (MMI available) INFO: Cleaning… INFO: Cleaning complete

Long reads (stdout)

Reads sent to stdout, log sent to stderr

bash $ hostile clean --fastq1 tests/data/tuberculosis_1_1.fastq.gz -o - > out.fastq INFO: Hostile v2.0.0. Mode: long read (Minimap2) INFO: Found cached standard index human-t2t-hla (MMI available) INFO: Cleaning… INFO: Cleaning complete [ { "version": "2.0.0", "aligner": "minimap2", "index": "human-t2t-hla", "options": [ "stdout" ], "fastq1_in_name": "tuberculosis_1_1.fastq.gz", "fastq1_in_path": "/Users/bede/Research/git/hostile/tests/data/tuberculosis_1_1.fastq.gz", "reads_in": 1, "reads_out": 1, "reads_removed": 0, "reads_removed_proportion": 0.0 } ]

Short paired reads

When providing both --fastq1 and --fastq2, Hostile asssumes you are providing short reads and uses Bowtie2 automatically.

bash $ hostile clean --fastq1 human_1_1.fastq.gz --fastq2 human_1_2.fastq.gz INFO: Hostile v2.0.0. Mode: paired short read (Bowtie2) INFO: Found cached standard index human-t2t-hla INFO: Cleaning… INFO: Cleaning complete [ { "version": "2.0.0", "aligner": "bowtie2", "index": "human-t2t-hla", "options": [], "fastq1_in_name": "human_1_1.fastq.gz", "fastq1_in_path": "/Users/bede/Research/git/hostile/tests/data/human_1_1.fastq.gz", "reads_in": 2, "reads_out": 0, "reads_removed": 2, "reads_removed_proportion": 1.0, "fastq2_in_name": "human_1_2.fastq.gz", "fastq2_in_path": "/Users/bede/Research/git/hostile/tests/data/human_1_2.fastq.gz", "fastq1_out_name": "human_1_1.clean_1.fastq.gz", "fastq1_out_path": "/Users/bede/Research/git/hostile/human_1_1.clean_1.fastq.gz", "fastq2_out_name": "human_1_2.clean_2.fastq.gz", "fastq2_out_path": "/Users/bede/Research/git/hostile/human_1_2.clean_2.fastq.gz" } ]

Short single/unpaired reads (save log)

When decontaminating single/unpaired short reads, you must specify --aligner bowtie2 to override the default long read setting for single/unpaired input. Interleaved input is not supported.

bash $ hostile clean --fastq1 human_1_1.fastq.gz --aligner bowtie2 > log.json INFO: Hostile v2.0.0. Mode: paired short read (Bowtie2) INFO: Found cached standard index human-t2t-hla-argos985 INFO: Cleaning… INFO: Cleaning complete

Short paired reads (stdout)

When using stdout mode with paired input, Hostile sends interleaved paired reads to stdout.

bash $ hostile clean --fastq1 human_1_1.fastq.gz --fastq2 human_1_2.fastq.gz -o - > interleaved.fastq INFO: Hostile v2.0.2. Mode: paired short read (Bowtie2) INFO: Found cached standard index human-t2t-hla INFO: Cleaning… INFO: Cleaning complete [ { "version": "2.0.2", "aligner": "bowtie2", "index": "human-t2t-hla", "options": [], "fastq1_in_name": "human_1_1.fastq.gz", "fastq1_in_path": "/Users/bede/Research/git/hostile/tests/data/human_1_1.fastq.gz", "reads_in": 2, "reads_out": 0, "reads_removed": 2, "reads_removed_proportion": 1.0, "fastq2_in_name": "human_1_2.fastq.gz", "fastq2_in_path": "/Users/bede/Research/git/hostile/tests/data/human_1_2.fastq.gz", } ]

Python usage

```python from pathlib import Path from hostile.lib import cleanfastqs, cleanpaired_fastqs

Long reads, defaults

clean_fastqs( fastqs=[Path("reads.fastq.gz")], )

Paired short reads, various options, capture log

log = cleanpairedfastqs( fastqs=[(Path("reads1.fastq.gz"), Path("reads2.fastq.gz"))], index="human-t2t-hla-argos985", out_dir=Path("decontaminated-reads"), rename=True, force=True, threads=12 )

print(log) ```

Masking reference genomes

The mask subcommand makes it easy to create custom-masked indexes in order to achieve maximum retention of specific target organisms: bash hostile mask human.fasta lots-of-bacterial-genomes.fasta --threads 8 You may wish to use one of the existing reference genomes as a starting point. Masking uses Minimap2 to align 150mers of the supplied target genomes with the reference genome, and bedtools to mask all aligned regions with N. Both a masked genome (for Minimap2) and a masked Bowtie2 index is created.

Threads and compression

Hostile automatically allocates an appropriate number and ratio of available CPU cores to alignment and compression tasks. For maximum performance, consider using stdout (--output -) and compressing the fastq stream with zstandard, a faster gzip alternative. e.g.

hostile clean --fastq1 reads.fq.gz -o - | zstd > reads.clean.fq.gz

Alignment parameters

Hostile's alignment parameters achieve leading precision as demonstrated in the manuscript, blog post and independent benchmarks.

Short reads (Bowtie2)

  • --sensitive: (default; implicit; equivalent to -D 15 -R 2 -N 0 -L 22 -i S,1,1.15)
  • -k 1: disable secondary alignments

Long reads (Minimap2)

  • -x map-ont: (default; implicit)
  • --secondary no: disable secondary alignments

Custom parameters

Should you wish to override Hostile's alignment parameters, you may do so by passing custom --aligner-args to hostile clean. For short reads, e.g. --aligner-args \"--sensitive-local\" increases host depletion at the expense of false positives. For long reads, e.g. --aligner-args \"-m 30\" has a similar effect. As of version 2.0.2, it is possible to change Minimap's alignment preset from the default map-ont using --aligner-args.

Using more sensitive alignment parameters dramatically increases false positive rate for some viral and bacterial genomes. For 2x150bp simulated viral RefSeq reads, the --very-sensitive-local Bowtie2 preset recommended by Forbes et al. (2025) increases false positive rate by 42x to 0.2% overall for a complex bacterial metagenome in my testing. Proceed with caution when applying these presets.

Citation

DOI:10.1101/2023.07.04.547735

Bede Constantinides, Martin Hunt, Derrick W Crook. "Hostile: accurate decontamination of microbial host sequences" Bioinformatics, 2023; btad728, https://doi.org/10.1093/bioinformatics/btad728

latex @article{10.1093/bioinformatics/btad728, author = {Constantinides, Bede and Hunt, Martin and Crook, Derrick W}, title = {Hostile: accurate decontamination of microbial host sequences}, journal = {Bioinformatics}, volume = {39}, number = {12}, pages = {btad728}, year = {2023}, month = {12}, issn = {1367-4811}, doi = {10.1093/bioinformatics/btad728}, url = {https://doi.org/10.1093/bioinformatics/btad728}, eprint = {https://academic.oup.com/bioinformatics/article-pdf/39/12/btad728/54850422/btad728.pdf}, }

Owner

  • Name: Bede Constantinides
  • Login: bede
  • Kind: user
  • Company: Oxford Nanopore Technologies

GitHub Events

Total
  • Create event: 4
  • Release event: 2
  • Issues event: 37
  • Watch event: 22
  • Delete event: 5
  • Issue comment event: 67
  • Push event: 45
  • Fork event: 4
Last Year
  • Create event: 4
  • Release event: 2
  • Issues event: 37
  • Watch event: 22
  • Delete event: 5
  • Issue comment event: 67
  • Push event: 45
  • Fork event: 4

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 39
  • Total pull requests: 2
  • Average time to close issues: 3 months
  • Average time to close pull requests: 1 day
  • Total issue authors: 20
  • Total pull request authors: 2
  • Average comments per issue: 2.0
  • Average comments per pull request: 0.5
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 21
  • Pull requests: 1
  • Average time to close issues: 23 days
  • Average time to close pull requests: 2 days
  • Issue authors: 13
  • Pull request authors: 1
  • Average comments per issue: 1.9
  • Average comments per pull request: 1.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • bede (16)
  • arpit20328 (3)
  • Rohit-Satyam (2)
  • bdklahn (2)
  • xiaoli-dong (1)
  • jannikseidelQBiC (1)
  • dterumalai (1)
  • Ackia (1)
  • jfy133 (1)
  • chantisakee (1)
  • ryanjameskennedy (1)
  • naturepoker (1)
  • tnmquann (1)
  • BioWilko (1)
  • pvanheus (1)
Pull Request Authors
  • JDhillonEIT (1)
  • bede (1)
Top Labels
Issue Labels
enhancement (12) bug (4) documentation (3) low priority (1) questionable (1) packaging (1)
Pull Request Labels
enhancement (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 124 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 12
  • Total maintainers: 1
pypi.org: hostile

Accurate host read removal

  • Versions: 12
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 124 Last month
Rankings
Stargazers count: 9.9%
Dependent packages count: 10.1%
Downloads: 13.1%
Average: 15.5%
Dependent repos count: 21.6%
Forks count: 22.6%
Maintainers (1)
Last synced: 6 months ago