Back to sequences
Back to sequences: Find the origin of k-mers - Published in JOSS (2024)
Science Score: 95.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 9 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: joss.theoj.org -
✓Committers with academic emails
1 of 5 committers (20.0%) from academic institutions -
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Repository
Basic Info
- Host: GitHub
- Owner: pierrepeterlongo
- License: agpl-3.0
- Language: Rust
- Default Branch: main
- Size: 1.91 MB
Statistics
- Stars: 45
- Watchers: 3
- Forks: 8
- Open Issues: 0
- Releases: 6
Metadata Files
README.md
Back to sequences

Given a set $K$ of kmers (fasta / fastq [.gz] format) and a set of sequences (fasta / fastq [.gz] format), this tool will extract the sequences containing some of those kmers.
A minimal ($m$) and a maximal ($M$) thresholds are proposed. A sequence whose percentage of kmers shared with $K$ are in $]m, M]$ is output with its original header + the number of shared kmers + the ratio of shared kmers: ```
original_header 20 6.13 TGGATAAAAAGGCTGACGAAAGGTCTAGCTAAAATTGTCAGGTGCTCTCAGATAAAGCAGTAAGCGAGTTGGTGTTCGCTGAGCGTCGACTAGGCAACGTTAAAGCTATTTTAGGC... ``` In this case 20 kmers are shared with the indexed kmers. This represents 6.13% of the kmers in the sequence.
Install
Please see https://b2s-doc.readthedocs.io/en/latest/usage.html#installation
Simplest usage
bash
back_to_sequences --in-kmers kmers.fasta --in-sequences reads.fasta --out-sequences filtered_reads.fasta --out-kmers counted_kmers.txt
The filtered_reads.fasta file contains the original sequences (here reads) from reads.fasta that contain at least one of the kmers from kmers.fasta. The headers of each read is the same as in reads.fasta, plus the estimated ratio of shared kmers and number of shared kmers.
As the --out-kmers option is used, the file counted_kmers.txt contains for each kmer in kmers.fasta the number of times it was found in filtered_reads.fasta.
Result example
Example results obtained on * the GenOuest platform on a node with 32 threads Xeon 2.2 GHz, denoted by "genouest" in the table below. * a MacBook, Apple M2 pro, 16 GB RAM, with 10 threads, denoted by "mac" in the table below. * AMD Ryzen 7 4.2 GHz 5800X 64 GB RAM, with 16 threads, denoted by "AMD" in the table below.
Indexed: one million kmers eacho of length 31. We queried: from 10,000 reads to 200 million reads each of length 100.
| Number of reads | Time genouest | Time mac | Time AMD | max RAM | |:---------------:|:-------------:|:--------:|:--------:|:-------:| | 10,000 | 0.7s | 0.54s | 0.4s | 0.13 GB | | 100,000 | 0.8s | 0.8s | 1.2s | 0.13 GB | | 1,000,000 | 2.0s | 3.5s | 7.1s | 0.13 GB | | 10,000,000 | 7.1s | 11s | 16s | 0.13 GB | | 100,000,000 | 47s | 58s | 48s | 0.13 GB | | 200,000,000 | 1m32s | 1m52s | 1m44 | 0.13 GB |
See this page for details
Basical usages and parameters
Please reafer the specific documentation for * basical usages * a complete description of parameters
Contributions
Please check out How to contribute
Citations
Baire et al., (2024). Back to sequences: Find the origin of k-mers. Journal of Open Source Software, 9(101), 7066, https://doi.org/10.21105/joss.07066
bibtex:
bib
@article{Baire2024,
author = {Anthony Baire and Pierre Marijon and Francesco Andreace and Pierre Peterlongo},
title = {Back to sequences: Find the origin of k-mers}, journal = {Journal of Open Source Software},
doi = {10.21105/joss.07066},
url = {https://doi.org/10.21105/joss.07066},
year = {2024},
publisher = {The Open Journal},
volume = {9},
number = {101},
pages = {7066}
}
Documentation
Full documentation is available at https://b2s-doc.readthedocs.io/en/latest/
Owner
- Name: Pierre Peterlongo
- Login: pierrepeterlongo
- Kind: user
- Location: Rennes, France
- Company: @Inria
- Website: http://people.rennes.inria.fr/Pierre.Peterlongo/
- Twitter: p_peterlongo
- Repositories: 1
- Profile: https://github.com/pierrepeterlongo
I’m designing models and algorithms for sequencing (NGS & TGS) data.
JOSS Publication
Back to sequences: Find the origin of k-mers
Authors
Univ. Rennes, Inria, CNRS, IRISA - UMR 6074, Rennes, F-35000 France
Tags
rust kmer indexing genomic sequencing dataGitHub Events
Total
- Release event: 3
- Watch event: 2
- Delete event: 1
- Push event: 45
- Pull request event: 6
- Create event: 6
Last Year
- Release event: 3
- Watch event: 2
- Delete event: 1
- Push event: 45
- Pull request event: 6
- Create event: 6
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Pierre Peterlongo | p****o@i****r | 227 |
| Pierre Marijon | p****t@a****r | 19 |
| Bryce Mecum | p****h@g****m | 12 |
| Anthony Baire | A****e@i****r | 10 |
| Francesco Andreace | f****e@g****m | 1 |
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 5
- Total pull requests: 10
- Average time to close issues: 12 days
- Average time to close pull requests: about 18 hours
- Total issue authors: 5
- Total pull request authors: 4
- Average comments per issue: 2.6
- Average comments per pull request: 0.2
- Merged pull requests: 9
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 6
- Average time to close issues: N/A
- Average time to close pull requests: about 1 hour
- Issue authors: 0
- Pull request authors: 2
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 6
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- amoeba (1)
- husamia (1)
- natir (1)
- cmonat (1)
- Anjan-Purkayastha (1)
Pull Request Authors
- natir (6)
- pierrepeterlongo (4)
- amoeba (2)
- a-ba (1)
