https://github.com/bacpop/sparrowhawk
Short-read assembler for bacterial genomics based on a de Bruijn graph written in Rust 🦅
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
â—‹CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
â—‹DOI references
-
â—‹Academic publication links
-
â—‹Academic email domains
-
â—‹Institutional organization owner
-
â—‹JOSS paper metadata
-
â—‹Scientific vocabulary similarity
Low similarity (15.6%) to scientific vocabulary
Keywords
Repository
Short-read assembler for bacterial genomics based on a de Bruijn graph written in Rust 🦅
Basic Info
Statistics
- Stars: 9
- Watchers: 2
- Forks: 1
- Open Issues: 9
- Releases: 2
Topics
Metadata Files
README.md
sparrowhawk 
Short-read assembler for bacterial genomics based on a de Bruijn graph written in Rust.
Disclaimer :warning: :construction:
This is a work in progress project. This in particular implies:
- Not all the main features we want are yet implemented.
- Code might be messy, and not even documented.
- General documentation on how to install and use the tool might be short or even missing.
- Finding unexpected errors/behaviour or bugs should not be a surprise.
- Some features might be partially hardcoded.
These (and potentially other) items will be progressively fixed before version 1.0.
sparrowhawk?
Sparrowhawk was at one time the Archmage of Earthsea. Also, the sparrowhawk (Accipiter nisus) is a bird of prey native to Europe (and the island of Gont).
Description
Note: this repository is for the Rust-based genomic assembler. If you are looking for its web implementation, see sparrowhawk-web.
sparrowhawk aims to be a fast short-read assembler for bacterial genomics. It has been developed taking advantage/inspiration of other Rust-based tools developed by our group (such as ska.rust), as well as others (such as Katome or SKESA).
Current main features:
- Currently only support for Illumina paired short reads.
- Single k-value, that must be odd and between 3 (not recommended to go below 19/21, due to memory requirements) and 256.
- Designed thinking on bacterial genomes (i.e. "small").
- Uses a node-based de Bruijn graph built upon petgraph.
- Partially parallelised with rayon.
- Compilation to WebAssembly targets (currently only wasm32-unknown-unknown) is possible to run the assembler in web projects. For more info, see sparrowhawk-web.
:construction: In-progress future main (not all) features: :construction: - Partial GPU acceleration. - Multi-k support. - Improved error correction and graph collapse logics.
Installation
Currently the only option is to compile from source.
Compilation from source
Development has been done only on x86_64 GNU/Linux-based systems, and most surely will probably stay that way (i.e. no other systems have been tested). To compile our project from source as we did, you will need the rust toolchain installed in your system. Then, you can download the code of the e.g. version v0.1.1 using
git clone --branch v0.1.1 https://github.com/bacpop/sparrowhawk.git
sparrowhawk is designed to compile to run natively as a x86_64 binary, but you can also compile it to the WebAssembly target wasm32-unknown-unknown. You can see below how to do it manually (as we did for development). Check out sparrowhawk-web for an integrated project with Javascript and wasm-pack.
Compilation to x86_64 (default)
Move into the downloaded repository and use cargo to build the project. You can add the --release argument to include some compiler optimisations.
cd sparrowhawk
cargo build --release
If using the --release flag, this should place your compiled binary inside target/release.
Compilation to wasm32-unknown-unknown
For this, you will need to activate the feature wasm with the -F argument and manually set the target. You can add the --release argument to include some compiler optimisations.
cd sparrowhawk
cargo build --release -F wasm --target wasm32-unknown-unknown
If using the --release flag, this should place your compiled binary inside target/wasm32-unknown-unknown/release as sparrowhawk.wasm.
Usage
Here we will only consider the binary compiled for x86_64, refer to sparrowhawk-web for an example of usage of the wasm32-unknown-unknown compilation target.
sparrowhawk can be called later to see the basic options and arguments with
./sparrowhawk
Currently, only the build option is present (apart from help), that allows assemblying the genomes. You can check its arguments running
./sparrowhawk build --help
An example execution could be the following:
./sparrowhawk -f ./reads.tsv -k 31 --threads 1 -v --min-count 5 --output-dir ./ --output-prefix prefix
This will assemble your reads, with k=31 and using only one thread. The minimum repeats of one particular k-mer to be considered are 5 (which is also the default). The output contigs will be written in the current directory as a fasta file called prefix_contigs.fasta, given that we have indicated, using the --output-prefix argument the word "prefix" as prefix. The input files in this case are provided through a reads.tsv tab-separated file, that contains an identifier for your reads and the two file paths separated by a space, i.e. a file that contains this line
IDENTIFIER /path/to/the/read_1.fastq /path/to/the/read_2.fastq
Alternatively, you could have run:
./sparrowhawk /path/to/the/read_1.fastq /path/to/the/read_2.fastq -k 31 --threads 1 -v --min-count 5 --output-dir ./ --output-prefix prefix
In the same folder as the output FASTA file, the graph before collapsing will be exported in DOT, and GFA versions 1.1 and 2 as prefix_graph.dot, prefix_graph.gfa, and prefix_graph.gfa2 respectively. A histogram of the k-mer frequency spectrum will be saved in the same directory as prefix_kmerspectrum.png. These optional files can be avoided with the corresponding arguments.
Owner
- Name: Bacterial population genetics
- Login: bacpop
- Kind: organization
- Email: contact@bacpop.org
- Location: United Kingdom
- Website: www.bacpop.org
- Repositories: 20
- Profile: https://github.com/bacpop
Pathogen Informatics and Modelling @ EMBL-EBI / Bacterial Evolutionary Epidemiology Group @ Imperial College London
GitHub Events
Total
- Create event: 5
- Release event: 1
- Issues event: 3
- Watch event: 4
- Issue comment event: 5
- Push event: 18
- Pull request event: 3
- Fork event: 1
Last Year
- Create event: 5
- Release event: 1
- Issues event: 3
- Watch event: 4
- Issue comment event: 5
- Push event: 18
- Pull request event: 3
- Fork event: 1
Dependencies
- pretty_assertions 1.4.1 development
- stacker 0.1.20 development
- ahash 0.8.11
- argmin 0.9
- argmin-math 0.3
- bnum 0.13.0
- clap 4.5.35
- console_error_panic_hook 0.1.7
- flate2 1.0
- getrandom 0.2
- indicatif 0.17.11
- js-sys 0.3.51
- json 0.12.4
- libm 0.2
- log 0.4.27
- needletail 0.6.3
- nohash-hasher 0.2.0
- num-traits 0.2.19
- num_cpus 1.16.0
- petgraph 0.8.1
- plotters 0.3.7
- project-root 0.2.2
- rayon 1.10.0
- regex 1.11.1
- seq_io 0.3.2
- serde 1.0.219
- simple-error 0.2
- simple_logger 5.0.0
- stacker 0.1.20
- wasm-bindgen 0.2
- wasm-bindgen-file-reader 1
- web-sys 0.3.77