Back to sequences

Back to sequences: Find the origin of k-mers - Published in JOSS (2024)

https://github.com/pierrepeterlongo/back_to_sequences

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 9 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
    1 of 5 committers (20.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software
Last synced: 6 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: pierrepeterlongo
  • License: agpl-3.0
  • Language: Rust
  • Default Branch: main
  • Size: 1.91 MB
Statistics
  • Stars: 45
  • Watchers: 3
  • Forks: 8
  • Open Issues: 0
  • Releases: 6
Created over 2 years ago · Last pushed 7 months ago
Metadata Files
Readme Contributing License

README.md

Back to sequences

tests lints License DOI

Old library - IA generated

Given a set $K$ of kmers (fasta / fastq [.gz] format) and a set of sequences (fasta / fastq [.gz] format), this tool will extract the sequences containing some of those kmers.

A minimal ($m$) and a maximal ($M$) thresholds are proposed. A sequence whose percentage of kmers shared with $K$ are in $]m, M]$ is output with its original header + the number of shared kmers + the ratio of shared kmers: ```

original_header 20 6.13 TGGATAAAAAGGCTGACGAAAGGTCTAGCTAAAATTGTCAGGTGCTCTCAGATAAAGCAGTAAGCGAGTTGGTGTTCGCTGAGCGTCGACTAGGCAACGTTAAAGCTATTTTAGGC... ``` In this case 20 kmers are shared with the indexed kmers. This represents 6.13% of the kmers in the sequence.

Install

Please see https://b2s-doc.readthedocs.io/en/latest/usage.html#installation

Simplest usage

bash back_to_sequences --in-kmers kmers.fasta --in-sequences reads.fasta --out-sequences filtered_reads.fasta --out-kmers counted_kmers.txt The filtered_reads.fasta file contains the original sequences (here reads) from reads.fasta that contain at least one of the kmers from kmers.fasta. The headers of each read is the same as in reads.fasta, plus the estimated ratio of shared kmers and number of shared kmers.

As the --out-kmers option is used, the file counted_kmers.txt contains for each kmer in kmers.fasta the number of times it was found in filtered_reads.fasta.

Result example

Example results obtained on * the GenOuest platform on a node with 32 threads Xeon 2.2 GHz, denoted by "genouest" in the table below. * a MacBook, Apple M2 pro, 16 GB RAM, with 10 threads, denoted by "mac" in the table below. * AMD Ryzen 7 4.2 GHz 5800X 64 GB RAM, with 16 threads, denoted by "AMD" in the table below.

Indexed: one million kmers eacho of length 31. We queried: from 10,000 reads to 200 million reads each of length 100.

| Number of reads | Time genouest | Time mac | Time AMD | max RAM | |:---------------:|:-------------:|:--------:|:--------:|:-------:| | 10,000 | 0.7s | 0.54s | 0.4s | 0.13 GB | | 100,000 | 0.8s | 0.8s | 1.2s | 0.13 GB | | 1,000,000 | 2.0s | 3.5s | 7.1s | 0.13 GB | | 10,000,000 | 7.1s | 11s | 16s | 0.13 GB | | 100,000,000 | 47s | 58s | 48s | 0.13 GB | | 200,000,000 | 1m32s | 1m52s | 1m44 | 0.13 GB |

See this page for details

Basical usages and parameters

Please reafer the specific documentation for * basical usages * a complete description of parameters

Contributions

Please check out How to contribute

Citations

Baire et al., (2024). Back to sequences: Find the origin of k-mers. Journal of Open Source Software, 9(101), 7066, https://doi.org/10.21105/joss.07066

bibtex: bib @article{Baire2024, author = {Anthony Baire and Pierre Marijon and Francesco Andreace and Pierre Peterlongo}, title = {Back to sequences: Find the origin of k-mers}, journal = {Journal of Open Source Software}, doi = {10.21105/joss.07066}, url = {https://doi.org/10.21105/joss.07066}, year = {2024}, publisher = {The Open Journal}, volume = {9}, number = {101}, pages = {7066} }

Documentation

Full documentation is available at https://b2s-doc.readthedocs.io/en/latest/

Owner

  • Name: Pierre Peterlongo
  • Login: pierrepeterlongo
  • Kind: user
  • Location: Rennes, France
  • Company: @Inria

I’m designing models and algorithms for sequencing (NGS & TGS) data.

JOSS Publication

Back to sequences: Find the origin of k-mers
Published
September 23, 2024
Volume 9, Issue 101, Page 7066
Authors
Anthony Baire
Univ. Rennes, Inria, CNRS, IRISA - UMR 6074, Rennes, F-35000 France
Pierre Marijon ORCID
Laboratoire de Biologie Médicale Multisites SeqOIA, Paris, France
Francesco Andreace ORCID
Department of Computational Biology, Institut Pasteur, Université Paris Cité, Paris, F-75015, France
Pierre Peterlongo ORCID
Univ. Rennes, Inria, CNRS, IRISA - UMR 6074, Rennes, F-35000 France
Editor
Mark A. Jensen ORCID
Tags
rust kmer indexing genomic sequencing data

GitHub Events

Total
  • Release event: 3
  • Watch event: 2
  • Delete event: 1
  • Push event: 45
  • Pull request event: 6
  • Create event: 6
Last Year
  • Release event: 3
  • Watch event: 2
  • Delete event: 1
  • Push event: 45
  • Pull request event: 6
  • Create event: 6

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 269
  • Total Committers: 5
  • Avg Commits per committer: 53.8
  • Development Distribution Score (DDS): 0.156
Past Year
  • Commits: 102
  • Committers: 3
  • Avg Commits per committer: 34.0
  • Development Distribution Score (DDS): 0.167
Top Committers
Name Email Commits
Pierre Peterlongo p****o@i****r 227
Pierre Marijon p****t@a****r 19
Bryce Mecum p****h@g****m 12
Anthony Baire A****e@i****r 10
Francesco Andreace f****e@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 5
  • Total pull requests: 10
  • Average time to close issues: 12 days
  • Average time to close pull requests: about 18 hours
  • Total issue authors: 5
  • Total pull request authors: 4
  • Average comments per issue: 2.6
  • Average comments per pull request: 0.2
  • Merged pull requests: 9
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 6
  • Average time to close issues: N/A
  • Average time to close pull requests: about 1 hour
  • Issue authors: 0
  • Pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • amoeba (1)
  • husamia (1)
  • natir (1)
  • cmonat (1)
  • Anjan-Purkayastha (1)
Pull Request Authors
  • natir (6)
  • pierrepeterlongo (4)
  • amoeba (2)
  • a-ba (1)
Top Labels
Issue Labels
Pull Request Labels