seqR

fast and comprehensive k-mer counting package

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 5 committers (20.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (18.2%) to scientific vocabulary

Keywords

bioinformatics bioinformatics-tool dna-processing feature-engineering feature-extraction genomics hashing hashing-algorithms k-mer k-mer-counting kmer kmer-counting kmer-frequency-count kmers ngram ngrams protein-sequences rcpp rcppparallel rpackage

Last synced: 6 months ago · JSON representation

Repository

fast and comprehensive k-mer counting package

Basic Info

Host: GitHub
Owner: slowikj
Language: C++
Default Branch: master
Homepage: https://slowikj.github.io/seqR/
Size: 1.72 MB

Statistics

Stars: 18
Watchers: 2
Forks: 1
Open Issues: 7
Releases: 0

Topics

Created over 6 years ago · Last pushed over 4 years ago

Metadata Files

Readme

README.Rmd

---
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo = TRUE,
  collapse = TRUE,
  comment = "#>",
  out.width = "100%"
)
```

```{r, include = FALSE}
library(seqR)
```

# seqR - fast and comprehensive k-mer counting package


[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/seqR)](https://cran.r-project.org/package=seqR)
[![R build status](https://github.com/slowikj/seqR/workflows/R-CMD-check/badge.svg)](https://github.com/slowikj/seqR/actions)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![codecov.io](https://codecov.io/github/slowikj/seqR/coverage.svg?branch=master)](https://codecov.io/github/slowikj/seqR?branch=master)
[![Code Quality Status](https://www.code-inspector.com/project/23909/status/svg)](https://www.code-inspector.com/project/23909/status/svg)
[![Code Quality Score](https://www.code-inspector.com/project/23909/score/svg)](https://www.code-inspector.com/project/23909/score/svg)


## About

`seqR` is an R package for fast k-mer counting. It provides

* **highly optimized** (the core algorithm is written in C++)
* **in-memory**
* **probabilistic** (with configurable dimensionality of a hash value
used for storing k-mers internally),
* **multi-threaded** (with a configurable size of the batch of sequences (`batch_size`) to process in a single step. If `batch_size` equals 1, the multi-threaded mode is disabled, which potentially causes a longer computation time)

implementation that supports 

* **various variants of k-mers** (contiguous, gapped, and positional counterparts)
* **all biological sequences** (e.g., nucleic acids and proteins)

Moreover, the result optimizes memory consumption by the application of **sparse matrices**
(see [package Matrix](https://CRAN.R-project.org/package=Matrix)),
compatible with machine learning packages
such as [ranger](https://CRAN.R-project.org/package=ranger)
and [xgboost](https://CRAN.R-project.org/package=xgboost).

## How to...

### How to install

To install `seqR` from CRAN:

```{r, eval=FALSE}
install.packages("seqR")
```

Alternatively, if you want to use the latest development version:

```{r, eval=FALSE}
# install.packages("devtools")
devtools::install_github("slowikj/seqR")
```

### How to use

The package provides two functions that facilitate k-mer counting

* `count_kmers` (used for counting k-mers of one type)
* `count_multimers` (a wrapper of `count_kmers`, used for counting k-mers of many types in a single invocation of the function)

and one function used for custom processing of k-mer matrices:

* `rbind_columnwise` (a helper function used for merging several k-mer matrices that do not have same sets of columns)

To learn more, see [features overview vignette](https://slowikj.github.io/seqR/articles/features-overview.html)
and [reference](https://slowikj.github.io/seqR/reference/index.html).

#### Examples

##### counting 5-mers

```{r}
count_kmers(sequences=c("AAAAAVVAVFF", "DFGSADFGSA"),
            k=5)
```

##### counting gapped 5-mers with gaps (0, 1, 0, 2) (XX_XX__X)

```{r}
count_kmers(sequences=c("AAAAAVVAVFF", "DFGSADFGSA"),
            kmer_gaps=c(0, 1, 0, 2))
```


##### counting 1-mers and 2-mers

```{r}
data(CsgA)

CsgA[1L:2]

count_multimers(sequences=CsgA,
                k_vector = c(1, 2))
```


### How to cite

For citation type:

```{r, eval=FALSE}
citation("seqR")
```

or use:

Jadwiga Słowik and Michał Burdukiewicz (2021). seqR: fast and comprehensive k-mer counting package. R package version 1.0.0.

## Benchmarks

The `seqR` package has been compared with other existing k-mer counting R packages:
[biogram](https://CRAN.R-project.org/package=biogram),
[kmer](https://CRAN.R-project.org/package=kmer),
[seqinr](https://CRAN.R-project.org/package=seqinr),
and [biostrings](https://bioconductor.org/packages/Biostrings).

All benchmark experiments have been performed using Intel Core i7-6700HQ 2.60GHz  8 cores, using the [microbenchmark](https://CRAN.R-project.org/package=microbenchmark) R package. 

### Contiguous k-mers

#### Changing k



The input consists of one `DNA` sequence of length `3 000`.

#### Changing the number of sequences



Each `DNA` sequence has `3 000` elements, `contiguous 5-mer` counting.

### Gapped k-mers

#### Changing the first contiguous part of a k-mer



The input consists of one `DNA` sequence of length `1 000 000`. `Gapped 5-mers` counting with base gaps `(1, 0, 0, 1)`.

#### Changing the first gap size



The input consists of one `DNA` sequence of length `100 000`. `Gapped 5-mers` counting with base gaps `(1, 0, 0, 1)`.

Owner

Name: Jadwiga Słowik
Login: slowikj
Kind: user
Location: Poland

Website: https://onlinejudge.org/index.php?option=onlinejudge&page=show_authorstats&userid=57739
Twitter: slowikj5
Repositories: 2
Profile: https://github.com/slowikj

passionate software developer & data scientist, problem solver, competitive programmer

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Committers

Last synced: over 2 years ago

All Time

Total Commits: 457
Total Committers: 5
Avg Commits per committer: 91.4
Development Distribution Score (DDS): 0.162

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Jadwiga Słowik	j**k@g**m	383
Jadwiga Słowik	j**5@g**m	54
slowikj	s**j@s**l	10
Michał Burdukiewicz	m**z@g**m	7
piotr-ole	p**k@g**m	3

Committer Domains (Top 20 + Academic)

student.mini.pw.edu.pl: 1

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 49
Total pull requests: 43
Average time to close issues: 3 months
Average time to close pull requests: 2 days
Total issue authors: 2
Total pull request authors: 1
Average comments per issue: 0.29
Average comments per pull request: 0.0
Merged pull requests: 40
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

slowikj (46)
michbur (1)

Pull Request Authors

slowikj (43)

Top Labels

Issue Labels

enhancement (23) performance (11) bug (8) refactor (5) hashing (4) wontfix (3) documentation (2) nice to have (2)

Pull Request Labels

Packages

Total packages: 1
Total downloads: unknown

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 2

cran.r-project.org: seqR

Fast and Comprehensive K-Mer Counting Package

Homepage: https://github.com/slowikj/seqR
Documentation: http://cran.r-project.org/web/packages/seqR/seqR.pdf
License: GPL (≥ 3)
Status: removed
Latest release: 1.0.1
published over 4 years ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 0

Rankings

Stargazers count: 16.3%

Forks count: 21.9%

Dependent repos count: 25.5%

Dependent packages count: 29.8%

Average: 36.6%

Downloads: 89.7%

Last synced: 11 months ago

Dependencies

DESCRIPTION cran

R >= 3.6 depends
Matrix * imports
Rcpp >= 1.0.4 imports
RcppParallel >= 5.1.2 imports
rlang * imports
slam * imports
covr * suggests
ggplot2 * suggests
knitr * suggests
microbenchmark * suggests
pryr * suggests
rmarkdown * suggests
spelling * suggests
testthat * suggests

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

seqR

Science Score: 10.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.Rmd

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: seqR

Rankings

Dependencies