Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README -
○Academic publication links
-
✓Committers with academic emails
3 of 6 committers (50.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.7%) to scientific vocabulary
Keywords
Repository
Yet Another Chimeric Read Detector
Basic Info
- Host: GitHub
- Owner: natir
- License: mit
- Language: Rust
- Default Branch: master
- Size: 36.6 MB
Statistics
- Stars: 81
- Watchers: 3
- Forks: 8
- Open Issues: 0
- Releases: 11
Topics
Metadata Files
Readme.md
Yet Another Chimeric Read Detector for long reads 🧬 💻
Using all-against-all read mapping, yacrd performs:
- computation of pile-up coverage for each read
- detection of chimeras
Chimera detection is done as follows:
- for each region where coverage is smaller or equal than
min_coverage(default 0), yacrd creates a bad region. - if there is a bad region that starts at a position strictly after the beginning of the read and ends strictly before the end of the read, the read is marked as
Chimeric - if total bad region length > 0.8 * read length, the read is marked as
NotCovered - if a read isn't
ChimericorNotCoveredisNotBad
WARNING:
Minimap2 v2.19 introduce some change in selection of seed and chaining of this seed to generate overlap. This change could have an impacte on yacrd behavior, it's seems not so important (thanks to Rohit-Satyam for testing), but if you use higher version it's at your own risk.
Rationale
Long read error-correction tools usually detect and also remove chimeras. But it is difficult to isolate or retrieve information from just this step.
DAStrim (from the DASCRUBBER suite does a similar job to yacrd but relies on a different mapping step, and uses different (likely more advanced) heuristics. Yacrd is simpler and easier to use.
This repository contains a set of scripts to evaluate yacrd against other similar tools such as DASCRUBBER and miniscrub on real data sets.
Input
Any set of long reads (PacBio, Nanopore, anything that can be given to minimap2). yacrd takes the resulting PAF (Pairwise Alignement Format) from minimap2 or BLASR m4 file from some other long reads overlapper as input.
Requirements
Instalation
With conda
yacrd is avaible in bioconda channel
if bioconda channel is setup you can run :
conda install yacrd
From source
``` git clone https://github.com/natir/yacrd.git cd yacrd git checkout v0.6.2
cargo build cargo test cargo install --path . ```
How to use Yacrd
Find chimera
minimap2 -x {corresponding preset} reads.fq reads.fq > overlap.paf
yacrd -i overlap.paf -o reads.yacrd
Post-detection operation
yacrd can perform some post-detection operation:
- filter: for sequence or overlap file, record with reads marked as Chimeric or NotCovered isn't write in output
- extract: for sequence or overlap file, record contains reads marked as Chimeric or NotCovered is write in output
- split: for sequence file bad region in middle of reads are removed, NotCovered read is removed
- scrubb: for sequence file all bad region are removed, NotCovered read is removed
minimap2 -x {corresponding preset} reads.fq reads.fq > mapping.paf
yacrd -i mapping.paf -o reads.yacrd filter -i reads.fasta -o reads.filter.fasta
yacrd -i mapping.paf -o reads.yacrd extract -i reads.fasta -o reads.extract.fasta
yacrd -i mapping.paf -o reads.yacrd split -i reads.fasta -o reads.split.fasta
yacrd -i mapping.paf -o reads.yacrd scrubb -i reads.fasta -o reads.scrubb.fasta
Read scrubbing overlapping recommended parameter
We recommended this parameter for dataset with coverage upper than 30x.
For nanopore data, we recommend using minimap2 with all-vs-all nanopore preset with a maximal distance between seeds fixe to 500 (option -g 500) to generate overlap. We recommend to run yacrd with minimal coverage fixed to 4 (option -c) and minimal coverage of read fixed to 0.4 (option -n).
This is an exemple of how run a yacrd scrubbing:
minimap2 -x ava-ont -g 500 reads.fasta reads.fasta > overlap.paf
yacrd -i overlap.paf -o report.yacrd -c 4 -n 0.4 scrubb -i reads.fasta -o reads.scrubb.fasta
For pacbio P6-C4 data, we recommend to use minimap2 with all-vs-all pacbio preset with a maximal distance between seeds fixe to 800 (option -g 800) to generate overlap. We recommend to run yacrd with minimal coverage fixed to 4 (option -c 4) and minimal coverage of read fixed to 0.4 (option -n 0.4).
minimap2 -x ava-pb -g 800 reads.fasta reads.fasta > overlap.paf
yacrd -i overlap.paf -o report.yacrd -c 4 -n 0.4 scrubb -i reads.fasta -o reads.scrubb.fasta
For pacbio Sequel data, we recommend to use minimap2 with all-vs-all pacbio preset with a maximal distance between seeds fixe to 5000 (option -g 5000) to generate overlap. We recommand to run yacrd with minimal coverage fixed to 3 (option -c 3) and minimal coverage of read fixed to 0.4 (option -n 0.4).
minimap2 -x ava-pb -g 5000 reads.fasta reads.fasta > overlap.paf
yacrd -i overlap.paf -o report.yacrd -c 3 -n 0.4 scrubb -i reads.fasta -o reads.scrubb.fasta
If you have parameter sets for other types of data do not hesitate to make a pull request to add them, thanks.
Important note
Extension
yacrd use extension to detect format file if your filename contains (anywhere):
- .paf: file is consider has minimap file
- .m4, .mhap: file is consider has blasr m4 file (mhap output)
- .fa, .fasta: file is consider has fasta file
- .fq, .fastq: file is consider has fastq file
- .yacrd: file is consider has yacrd output file
Compression
yacrd automatically detect file if is compress or not (gzip, bzip2 and lzma compression is available). For post-detection operation, if input is compressed output have the same compression format.
Use yacrd report as input
You can use yacrd report as input in place of overlap file, ondisk option are ignored if you use yarcd report has input.
Output
type_of_read id_in_mapping_file length_of_read length_of_gap,begin_pos_of_gap,end_pos_of_gap;length_of_gap,be…
Example
NotCovered readA 4599 3782,0,3782
Here, readA doesn't have sufficient coverage, there is a zero-coverage region of length 3782bp between positions 0 and 3782.
Chimeric readB 10452 862,1260,2122;3209,4319,7528
Here, readB is chimeric with 2 zero-coverage regions: one between bases 1260 and 2122, another between 4319 and 7528.
Minimum supported Rust version
Currently the minimum supported Rust version is 1.74.
Citation
If you use yacrd in your research, please cite the following publication:
Pierre Marijon, Rayan Chikhi, Jean-Stéphane Varré, yacrd and fpa: upstream tools for long-read genome assembly, Bioinformatics, btaa262, https://doi.org/10.1093/bioinformatics/btaa262
bibtex format:
@article {@article{Marijon_2020,
doi = {10.1093/bioinformatics/btaa262},
url = {https://doi.org/10.1093%2Fbioinformatics%2Fbtaa262},
year = 2020,
month = {apr},
publisher = {Oxford University Press ({OUP})},
author = {Pierre Marijon and Rayan Chikhi and Jean-St{\'{e}}phane Varr{\'{e}}},
editor = {Inanc Birol},
title = {yacrd and fpa: upstream tools for long-read genome assembly},
journal = {Bioinformatics}
}
Owner
- Name: Pierre Marijon
- Login: natir
- Kind: user
- Location: Paris
- Company: Seqoia
- Website: https://pierre.marijon.fr/link.html
- Twitter: pierre_marijon
- Repositories: 105
- Profile: https://github.com/natir
Citation (CITATION.cff)
# YAML 1.2
---
abstract: "Genome assembly is increasingly performed on long, uncorrected reads. Assembly quality may be degraded due to unfiltered chimeric reads; also, the storage of all read overlaps can take up to terabytes of disk space.We introduce two tools: yacrd for chimera removal and read scrubbing, and fpa for filtering out spurious overlaps. We show that yacrd results in higher-quality assemblies and is one hundred times faster than the best available alternative.https://github.com/natir/yacrd and https://github.com/natir/fpa.Supplementary data are available at Bioinformatics online."
authors:
-
affiliation: " Department of Computer Science , Inria, Univ. Lille, CNRS, Centrale Lille, UMR 9189 - CRIStAL, Lille F-59000, France"
family-names: Marijon
given-names: Pierre
orcid: "https://orcid.org/0000-0002-6694-6873"
-
affiliation: " Department of Computational Biology , Institut Pasteur, C3BI USR 3756 IP CNRS, Paris, France"
family-names: Chikhi
given-names: Rayan
orcid: "https://orcid.org/0000-0003-1099-8735"
-
affiliation: " Univ. Lille , CNRS, Centrale Lille, UMR 9189 - CRIStAL - Centre de Recherche en Informatique Signal et Automatique de Lille, F-59000 Lille, France"
family-names: "Varré"
given-names: "Jean-Stéphane"
orcid: "https://orcid.org/0000-0001-6322-0519"
cff-version: "1.1.0"
date-released: 2020-04-21
doi: "10.1093/bioinformatics/btaa262"
message: "If you use this software, please cite it using these metadata."
repository-code: "https://github.com/natir/yacrd"
title: "yacrd and fpa: upstream tools for long-read genome assembly"
...
GitHub Events
Total
- Issues event: 6
- Watch event: 8
- Issue comment event: 3
- Push event: 3
Last Year
- Issues event: 6
- Watch event: 8
- Issue comment event: 3
- Push event: 3
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Marijon Pierre | p****n@i****r | 115 |
| Pierre Marijon | p****n@m****e | 34 |
| Pierre Marijon | p****n@h****e | 8 |
| Maël Kerbiriou | m****u@i****r | 7 |
| Rayan Chikhi | r****i | 5 |
| Anicet Ebou | a****u@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 49
- Total pull requests: 9
- Average time to close issues: 5 months
- Average time to close pull requests: 3 days
- Total issue authors: 28
- Total pull request authors: 5
- Average comments per issue: 2.71
- Average comments per pull request: 0.56
- Merged pull requests: 8
- Bot issues: 0
- Bot pull requests: 1
Past Year
- Issues: 3
- Pull requests: 0
- Average time to close issues: 7 days
- Average time to close pull requests: N/A
- Issue authors: 3
- Pull request authors: 0
- Average comments per issue: 1.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- natir (16)
- FSciammarella (3)
- Rohit-Satyam (2)
- dpryan79 (2)
- tseemann (2)
- oneillkza (2)
- emiliomastriani (1)
- dominik-handler (1)
- NinaMercedes (1)
- colindaven (1)
- desmodus1984 (1)
- rchikhi (1)
- shengzizhang (1)
- dcourtine (1)
- jiajia19901101 (1)
Pull Request Authors
- rchikhi (5)
- natir (1)
- Ebedthan (1)
- dependabot[bot] (1)
- Piezoid (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cargo 11,677 total
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 8
- Total maintainers: 1
crates.io: yacrd
Using all-against-all read mapping, yacrd performs: computation of pile-up coverage for each read and detection of chimeras
- Homepage: https://github.com/natir/yacrd
- Documentation: https://docs.rs/yacrd/
- License: MIT
-
Latest release: 0.6.2
published over 5 years ago
Rankings
Maintainers (1)
Dependencies
- adler 1.0.2
- aho-corasick 0.7.18
- anyhow 1.0.58
- atty 0.2.14
- autocfg 1.1.0
- bgzip 0.2.1
- bincode 1.3.3
- bitflags 1.3.2
- bstr 0.2.17
- byteorder 1.4.3
- bzip2 0.4.3
- bzip2-sys 0.1.11+1.0.8
- cc 1.0.73
- cfg-if 1.0.0
- clap 3.2.14
- clap_derive 3.2.7
- clap_lex 0.2.4
- crc32fast 1.3.2
- crossbeam-channel 0.5.6
- crossbeam-deque 0.8.2
- crossbeam-epoch 0.9.10
- crossbeam-utils 0.8.11
- csv 1.1.6
- csv-core 0.1.10
- either 1.7.0
- env_logger 0.9.0
- fastrand 1.8.0
- flate2 1.0.24
- fs2 0.4.3
- fxhash 0.2.1
- hashbrown 0.12.3
- heck 0.4.0
- hermit-abi 0.1.19
- humantime 2.1.0
- indexmap 1.9.1
- instant 0.1.12
- itoa 0.4.8
- jobserver 0.1.24
- lazy_static 1.4.0
- libc 0.2.126
- lock_api 0.4.7
- log 0.4.17
- lzma-sys 0.1.19
- memchr 2.5.0
- memoffset 0.6.5
- miniz_oxide 0.5.3
- niffler 2.4.0
- noodles 0.18.0
- noodles-bgzf 0.8.0
- noodles-core 0.3.4
- noodles-fasta 0.6.0
- noodles-fastq 0.4.0
- noodles-sam 0.11.0
- num_cpus 1.13.1
- once_cell 1.13.0
- os_str_bytes 6.2.0
- parking_lot 0.11.2
- parking_lot_core 0.8.5
- pkg-config 0.3.25
- proc-macro-error 1.0.4
- proc-macro-error-attr 1.0.4
- proc-macro2 1.0.41
- quote 1.0.20
- rayon 1.5.3
- rayon-core 1.9.3
- redox_syscall 0.2.15
- regex 1.6.0
- regex-automata 0.1.10
- regex-syntax 0.6.27
- remove_dir_all 0.5.3
- remove_dir_all 0.7.0
- rustc-hash 1.1.0
- ryu 1.0.10
- scopeguard 1.1.0
- serde 1.0.140
- serde_derive 1.0.140
- sled 0.34.7
- smallvec 1.9.0
- strsim 0.10.0
- syn 1.0.98
- tempfile 3.3.0
- termcolor 1.1.3
- textwrap 0.15.0
- thiserror 1.0.31
- thiserror-impl 1.0.31
- unicode-ident 1.0.2
- version_check 0.9.4
- winapi 0.3.9
- winapi-i686-pc-windows-gnu 0.4.0
- winapi-util 0.1.5
- winapi-x86_64-pc-windows-gnu 0.4.0
- xz2 0.1.7
- zstd 0.7.0+zstd.1.4.9
- zstd-safe 3.1.0+zstd.1.4.9
- zstd-sys 1.5.0+zstd.1.4.9
- tempfile 3 development
- anyhow 1
- bincode 1
- clap 3
- csv 1
- env_logger 0.9
- log 0.4
- niffler 2
- noodles 0.18
- rayon 1
- remove_dir_all 0.7
- rustc-hash 1
- serde 1
- sled 0.34
- thiserror 1
- actions-rs/cargo v1 composite
- actions-rs/install v0.1 composite
- actions-rs/toolchain v1 composite
- actions/checkout v1 composite
- codecov/codecov-action v4 composite