Back to sequences

Back to sequences: Find the origin of k-mers - Published in JOSS (2024)

https://github.com/pierrepeterlongo/back_to_sequences

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 9 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: joss.theoj.org
✓
Committers with academic emails
1 of 5 committers (20.0%) from academic institutions
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: pierrepeterlongo
License: agpl-3.0
Language: Rust
Default Branch: main
Size: 1.91 MB

Statistics

Stars: 45
Watchers: 3
Forks: 8
Open Issues: 0
Releases: 6

Created over 2 years ago · Last pushed 11 months ago

Metadata Files

Readme Contributing License

Back to sequences

Old library - IA generated

Given a set $K$ of kmers (fasta / fastq [.gz] format) and a set of sequences (fasta / fastq [.gz] format), this tool will extract the sequences containing some of those kmers.

A minimal ($m$) and a maximal ($M$) thresholds are proposed. A sequence whose percentage of kmers shared with $K$ are in $]m, M]$ is output with its original header + the number of shared kmers + the ratio of shared kmers: ```

original_header 20 6.13 TGGATAAAAAGGCTGACGAAAGGTCTAGCTAAAATTGTCAGGTGCTCTCAGATAAAGCAGTAAGCGAGTTGGTGTTCGCTGAGCGTCGACTAGGCAACGTTAAAGCTATTTTAGGC... ``` In this case 20 kmers are shared with the indexed kmers. This represents 6.13% of the kmers in the sequence.

Install

Please see https://b2s-doc.readthedocs.io/en/latest/usage.html#installation

Simplest usage

bash back_to_sequences --in-kmers kmers.fasta --in-sequences reads.fasta --out-sequences filtered_reads.fasta --out-kmers counted_kmers.txt The filtered_reads.fasta file contains the original sequences (here reads) from reads.fasta that contain at least one of the kmers from kmers.fasta. The headers of each read is the same as in reads.fasta, plus the estimated ratio of shared kmers and number of shared kmers.

As the --out-kmers option is used, the file counted_kmers.txt contains for each kmer in kmers.fasta the number of times it was found in filtered_reads.fasta.

Result example

Example results obtained on * the GenOuest platform on a node with 32 threads Xeon 2.2 GHz, denoted by "genouest" in the table below. * a MacBook, Apple M2 pro, 16 GB RAM, with 10 threads, denoted by "mac" in the table below. * AMD Ryzen 7 4.2 GHz 5800X 64 GB RAM, with 16 threads, denoted by "AMD" in the table below.

Indexed: one million kmers eacho of length 31. We queried: from 10,000 reads to 200 million reads each of length 100.

| Number of reads | Time genouest | Time mac | Time AMD | max RAM | |:---------------:|:-------------:|:--------:|:--------:|:-------:| | 10,000 | 0.7s | 0.54s | 0.4s | 0.13 GB | | 100,000 | 0.8s | 0.8s | 1.2s | 0.13 GB | | 1,000,000 | 2.0s | 3.5s | 7.1s | 0.13 GB | | 10,000,000 | 7.1s | 11s | 16s | 0.13 GB | | 100,000,000 | 47s | 58s | 48s | 0.13 GB | | 200,000,000 | 1m32s | 1m52s | 1m44 | 0.13 GB |

See this page for details

Basical usages and parameters

Please reafer the specific documentation for * basical usages * a complete description of parameters

Contributions

Please check out How to contribute

Citations

Baire et al., (2024). Back to sequences: Find the origin of k-mers. Journal of Open Source Software, 9(101), 7066, https://doi.org/10.21105/joss.07066

bibtex: bib @article{Baire2024, author = {Anthony Baire and Pierre Marijon and Francesco Andreace and Pierre Peterlongo}, title = {Back to sequences: Find the origin of k-mers}, journal = {Journal of Open Source Software}, doi = {10.21105/joss.07066}, url = {https://doi.org/10.21105/joss.07066}, year = {2024}, publisher = {The Open Journal}, volume = {9}, number = {101}, pages = {7066} }

Documentation

Full documentation is available at https://b2s-doc.readthedocs.io/en/latest/

Owner

Name: Pierre Peterlongo
Login: pierrepeterlongo
Kind: user
Location: Rennes, France
Company: @Inria

Website: http://people.rennes.inria.fr/Pierre.Peterlongo/
Twitter: p_peterlongo
Repositories: 1
Profile: https://github.com/pierrepeterlongo

I’m designing models and algorithms for sequencing (NGS & TGS) data.

JOSS Publication

Back to sequences: Find the origin of k-mers

Published

September 23, 2024

DOI

10.21105/joss.07066

Volume 9, Issue 101, Page 7066

Authors

Anthony Baire
Univ. Rennes, Inria, CNRS, IRISA - UMR 6074, Rennes, F-35000 France

Pierre Marijon

Laboratoire de Biologie Médicale Multisites SeqOIA, Paris, France

Francesco Andreace

Department of Computational Biology, Institut Pasteur, Université Paris Cité, Paris, F-75015, France

Pierre Peterlongo

Univ. Rennes, Inria, CNRS, IRISA - UMR 6074, Rennes, F-35000 France

Editor

Mark A. Jensen

GitHub Events

Total

Release event: 3
Watch event: 2
Delete event: 1
Push event: 45
Pull request event: 6
Create event: 6

Last Year

Release event: 3
Watch event: 2
Delete event: 1
Push event: 45
Pull request event: 6
Create event: 6

Committers

Last synced: 10 months ago

All Time

Total Commits: 269
Total Committers: 5
Avg Commits per committer: 53.8
Development Distribution Score (DDS): 0.156

Past Year

Commits: 102
Committers: 3
Avg Commits per committer: 34.0
Development Distribution Score (DDS): 0.167

Top Committers

Name	Email	Commits
Pierre Peterlongo	p**o@i**r	227
Pierre Marijon	p**t@a**r	19
Bryce Mecum	p**h@g**m	12
Anthony Baire	A**e@i**r	10
Francesco Andreace	f**e@g**m	1

Committer Domains (Top 20 + Academic)

irisa.fr: 1 aphp.fr: 1 inria.fr: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 5
Total pull requests: 10
Average time to close issues: 12 days
Average time to close pull requests: about 18 hours
Total issue authors: 5
Total pull request authors: 4
Average comments per issue: 2.6
Average comments per pull request: 0.2
Merged pull requests: 9
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 6
Average time to close issues: N/A
Average time to close pull requests: about 1 hour
Issue authors: 0
Pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 6
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

amoeba (1)
husamia (1)
natir (1)
cmonat (1)
Anjan-Purkayastha (1)

Back to sequences

Science Score: 95.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Back to sequences

Install

Simplest usage

Result example

Basical usages and parameters

Contributions

Citations

Documentation

Owner

JOSS Publication

Back to sequences: Find the origin of k-mers

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels