Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.8%) to scientific vocabulary
Repository
Core metagenomic binning tools for mtsv
Basic Info
- Host: GitHub
- Owner: FofanovLab
- License: mit
- Language: Rust
- Default Branch: master
- Size: 14 MB
Statistics
- Stars: 6
- Watchers: 2
- Forks: 2
- Open Issues: 2
- Releases: 7
Metadata Files
README.md
mtsv-tools
MTSv Tools is a suite of core tools for taxonomic classification of metagenomic sequencing reads. MTSv performs a full-alignment using an FM-index assisted q-gram filter followed by SIMD accelerated Smith-Waterman alignment.
Installation
conda install mtsv-tools -c bioconda
Requirements
mtsv is built in Rust. You'll need:
rustcandcargo>= 1.29.0 (rustup.rs is the easiest installation method)- a C compiler (tested with GCC and clang)
Tests
To run tests:
$ cargo test
To generate a code coverage report, make sure kcov >= 26 is installed on your PATH, then install cargo-kcov:
$ cargo install cargo-kcov
To run coverage:
$ cargo kcov -- --exclude-pattern="/.cargo,vendor/,tests/,bench/,include/,bin/,ssw/"
This will place a code coverage report under target/kcov/index.html.
Building Package
To build the MTSv binaries:
$ cargo build --release
They'll be available under target/release/mtsv-*.
Documentation
To generate the internal documentation:
$ cargo doc [--open]
(pass the --open flag if you want to immediately open the docs in your browser)
Usage
mtsv builds several binaries:
mtsv-chunkmtsv-binnermtsv-buildmtsv-collapse
All of these accept the --help flag to print a help message on their usage. See below for specific use instructions.
Reference Sequence Data
MTSv implements a custom metagenomic index (MG-index) based on the FM-index data structure. Reference indices must be built prior to performing taxonomic classification.
Reference file format
To construct the MG-indices, you'll need a multi-FASTA file of all reference sequences, with headers in the format SEQID-TAXID. So a sequence has a unique integer ID 12345, and belongs to the NCBI taxonomic ID 987, the header for that sequence should read 12345-987. The reference sequences can be sourced from any DNA sequence collection (i.e., GenBank, RefSeq, etc.) and customized to fit your project.
Chunking reference database
Because MTSv was designed to be highly parallelizable, we recommend building multiple indices from smaller chunks of the reference sequences. This helps reduce the memory requirements and allows for faster processing for both index building and assignment.
$ mtsv-chunk -i PATH_TO_FASTA -o PATH_TO_CHUNK_FOLDER -g NUM_GBS_PER_CHUNK
This will break up the reference fasta into a series of smaller files and place them into the directory specified. See the help message for further information.
``` mtsv-chunk 2.0.0 Adam Perry adam.n.perry@gmail.com:Tara Furstenau tara.furstenau@gmail.com Split a FASTA reference database into chunks for index generation.
USAGE: mtsv-chunk [FLAGS] --input --output
FLAGS: -v Include this flag to trigger debug-level logging. -h, --help Prints help information -V, --version Prints version information
OPTIONS: -i, --input Path(s) to vedro results files to collapse -o, --output
Metagenomic index build (MG-index)
Now that you have N chunks of your FASTA database, they need to be processed into indices which MTSv can use for querying. During the index build, the sequences in the chunked FASTA file are concatenated while recording the location of sequence boundaries and the TaxID associated with each sequence. A suffix array, Burrows-Wheeler Transform (BWT), and FM-index are built from the concatenated sequences using the Rust-Bio v0.39.1 package. The FM-index and the associated sequence metadata constitutes the MG-index. One MG-index is created per FASTA file, and new indices can be added as the reference collection grows without needing to rebuild any of the existing indices.
$ mtsv-build --fasta /path/to/chunkN.fasta --index /path/to/write/chunkN.index
Using default settings, indices will be ~3.6x the size of the reference file and require about that much RAM to run the binning step. The default sampling interval is 64 for the BWT occurance array and 32 for the suffix array. This can be overridden by passing --sample-interval <FM_SAMPLE_INTERVAL> for the occurance array or --sa-sample <SA_SAMPLE_RATE> for the suffix array. Lower values will increase the size of the index and can provide a reduction in query time. Increasing the flag will decrease the size of the index up to a point while accepting a slower query time.
See the help message for other options. ``` $ mtsv-build --help mtsv-build 2.0.0 Adam Perry adam.n.perry@gmail.com:Tara Furstenau tara.furstenau@gmail.com Index construction for mtsv metagenomics binning tool.
USAGE:
mtsv-build [FLAGS] [OPTIONS] --fasta
FLAGS: -v Include this flag to trigger debug-level logging. -h, --help Prints help information -V, --version Prints version information
OPTIONS:
-f, --fasta
-i, --index <INDEX> Absolute path to mtsv index file.
--sa-sample <SA_SAMPLE_RATE>
Suffix array sampling rate. If sampling rate is k, every k-th entry will be kept. [default: 32]
```
Binning Reads
The mtsv-binner command assignes the reads to reference sequences in the provided MG-index (a separate binning command should be run for each of the desired MG-Indices). It will begin by extracting overlapping substrings (seeds) of the same size (--seed-size) with certain offsets (--seed-interval) from each query sequence and its reverse complement. It then uses the MG-index to search for exact, ungapped matches for each seed. The seed matches are sorted by location and grouped into candidate regions using specified windows. The number of hits per candidate is tallied and any candidate that does not meet the minimum number of seed hits is filtered out. The remaining candidate positions are sorted in descending order by the number of seed hits so that the most promising regions are evaluated first.
For each candidate region, MTSv extracts the corresponding range from the reference sequence and looks up the TaxID associated with the region in the MG-index. If the current query has already been sucessfully aligned to the TaxID associated with the candidate region, no additional alignment is attempted, and the next candidate region is checked. Otherwise an SIMD-accelerated Smith-Waterman alignment is performed between the extracted reference sequence and the query sequence (using a scoring of 1 for matches and -1 for mismatches, gap opening, and gap extension). If the alignment score is sufficiently high, there is one final check to determine if the edit distance is less than or equal to the user-specified edit distance cutoff (--edit-rate). If the alignment is considered successful, then no further alignments are attempted for that query against the same TaxID. Skipping all additional alignments to a TaxID avoids many expensive operations and reduces computation time.
Parameters
The candidate filtering step is based on a q-gram filtering algorithm which defines the minimum number of exact k-mer matches (from all n-k+1 overlapping k-mers that can be expected between an n-length read and a reference sequence with at most e mismatches. In the worst case where all mismatches are evenly spaced across the alignment, the minimum number of matching k-mers is: m = (n+1) - k(e+1) and m is positive when n/(e+1) > k. If only every lth overlapping k-mer is used, the minimum number of matching k-mers is expected to be m/l. The user provides the seed k-mer size (--seed-size) and the interval l (--seed-interval) which establishes the number of seeds as n_seeds = ceil((n - k + 1)/l) and because this varies based on read size, the minimum number of reads required to make an assignment (--min-seed) is provided as a percentage of these seeds floor(min-seed * n_seeds). Similarly, the edit distance threshold is calculated as the product of the --edit-rate (float between 0 and 1) and the length of the read, n.
$ mtsv-binner --edit-rate 0.13 --seed-size 18 \
--seed-interval 2 --threads 8 \
--index /path/to/chunk1.index \
--fastq /path/to/reads.fastq \
--results /path/to/write/chunk1_results.txt
See the help message for other options.
``` $ mtsv-binner --help mtsv 2.0.0 Adam Perry adam.n.perry@gmail.com:Tara Furstenau tara.furstenau@gmail.com Metagenomics binning tool.
USAGE:
mtsv-binner [FLAGS] [OPTIONS] --fasta
FLAGS: -v Include this flag to trigger debug-level logging. -h, --help Prints help information -V, --version Prints version information
OPTIONS:
-e, --edit-rate
Output
mtsv-binner writes results for a single read per line. For example, if a read with the header R1_123 maps to taxon IDs 562, 9062, and 100 with edit distances 5, 10, and 11:
R1_123:562=5,9062=10,100=11
Collapsing Results
Since each output file from the mtsv-binner command will only represent assignments to references within a single MG-index, the results from all MG-indices must be combined into a single results file for further analysis.
$ mtsv-collapse /path/to/chunk1_results.txt /path/to/chunk2_results.txt ... \
--output /path/to/collapsed_results.txt
Make sure to include all of the chunk files. While the collapser could be run in multiple phases, it's generally much faster to do them all at once. If the same TaxID was assigned to the same read in multiple files, the one with the lowest edit distance will be recorded in the final output.
See the help message for other options.
``` $ mtsv-collapse --help mtsv-collapse 2.0.0 Adam Perry adam.n.perry@gmail.com:Tara Furstenau tara.furstenau@gmail.com Tool for combining the output of multiple separate mtsv runs.
USAGE:
mtsv-collapse [FLAGS]
FLAGS: -v Include this flag to trigger debug-level logging. -h, --help Prints help information -V, --version Prints version information
OPTIONS: -o, --output
ARGS:
Owner
- Name: FofanovLab
- Login: FofanovLab
- Kind: organization
- Repositories: 3
- Profile: https://github.com/FofanovLab
Citation (CITATION.cff)
cff-version: 1.1.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Furstenau
given-names: Tara
- family-names: Schneider
given-names: Tsosie
- family-names: Perry
given-names: Adam
- family-names: Fofanov
given-names: Viacheslav
title: mtsv_tools-v2.0.2
version: 2.0.2
date-released: 2017-7-25
GitHub Events
Total
Last Year
Dependencies
- 132 dependencies
- mktemp 0.2 development
- quickcheck 0.3 development
- rand 0.3 development
- anyhow 1.0
- bincode 1.3.3
- bio 0.39.1
- chrono 0.2
- clap 2.9
- cue 0.1
- env_logger 0.3
- flate2 0.2
- itertools 0.4
- log 0.3
- rustc-serialize 0.3.24
- serde 1.0
- stopwatch 0.0.7
- tar 0.4
- quickcheck 0.3 development
- libc 0.2