Science Score: 39.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (16.4%) to scientific vocabulary
Keywords
combinators
higher-order-functions
parser
parsing
r-package
Last synced: 6 months ago
·
JSON representation
Repository
Construct parser combinators in R
Basic Info
Statistics
- Stars: 5
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 15
Topics
combinators
higher-order-functions
parser
parsing
r-package
Created about 2 years ago
· Last pushed 7 months ago
Metadata Files
Readme
Changelog
License
README.Rmd
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
library(parcr)
```
[](https://cran.r-project.org/package=parcr)
[](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[](https://github.com/SystemsBioinformatics/parcr/actions/workflows/R-CMD-check.yaml)
## Construct parser combinator functions for parsing character vectors
This R package contains tools to construct parser combinator functions, higher
order functions that parse input. The main goal of this package is to simplify
the creation of *transparent* parsers for structured text files generated by
machines like laboratory instruments. Such files consist of lines of text
organized in higher-order structures like headers with metadata and blocks of
measured values. To read these data into R you first need to create a parser
that processes these files and creates R-objects as output. The `parcr` package
simplifies the task of creating such parsers.
This package was inspired by the package
["Ramble"](https://github.com/NoRaincheck/Ramble) by Chapman Siu and co-workers
and by the paper
["Higher-order functions for parsing"](https://doi.org/10.1017/S0956796800000411)
by [Graham Hutton](https://orcid.org/0000-0001-9584-5150) (1992).
## Installation
Install the stable version from CRAN
```
install.packages("parcr")
```
To install the development version including its vignette run the following command
```
install_github("SystemsBioinformatics/parcr", build_vignettes=TRUE)
```
## Example application: a parser for *fasta* sequence files
As an example of a realistic application we write a parser for
fasta-formatted files for nucleotide and protein sequences. We use a few
simplifying assumptions about this format for the sake of the example. Real
fasta files are more complex than we pretend here.
*Please note that more background about the functions that we use here is
available in the package documentation. Here we only present a summary.*
A fasta file with mixed sequence types could look like the example below:
```{r, echo=FALSE, comment = NA}
data("fastafile")
cat(paste0(fastafile, collapse="\n"))
```
Since fasta files are text files we could read such a file using `readLines()`
into a character vector. The package provides the data set `fastafile` which
contains that character vector.
```{r, eval=FALSE}
data("fastafile")
```
We can distinguish the following higher order components in a fasta file:
- A **fasta** file: consists of one or more **sequence blocks** until the
**end of the file**.
- A **sequence block**: consist of a **header** and a
**nucleotide sequence** or a **protein sequence**. A sequence block could be
preceded by zero or more **empty lines**.
- A **nucleotide sequence**: consists of one or more
**nucleotide sequence strings**.
- A **protein sequence**: consists of one or more
**protein sequence strings**.
- A **header** is a *string* that starts with a ">" immediately followed by
a **title** without spaces.
- A **nucleotide sequence string** is a *string* without spaces that consists
*entirely* of symbols from the set `{G,A,T,C}`.
- A **protein sequence string** is a *string* without spaces that consists
*entirely* of symbols from the set `{A,R,N,D,B,C,E,Q,Z,G,H,I,L,K,M,F,P,S,T,W,Y,V}`.
It now becomes clear what we mean when we say that the package allows us
to write *transparent* parsers: the description above of the structure of fasta
files can be put straight into code for a `Fasta()` parser:
```{r}
Fasta <- function() {
one_or_more(SequenceBlock()) %then%
eof()
}
SequenceBlock <- function() {
MaybeEmpty() %then%
Header() %then%
(NuclSequence() %or% ProtSequence()) %using%
function(x) list(x)
}
NuclSequence <- function() {
one_or_more(NuclSequenceString()) %using%
function(x) list(type = "Nucl", sequence = paste(x, collapse=""))
}
ProtSequence <- function() {
one_or_more(ProtSequenceString()) %using%
function(x) list(type = "Prot", sequence = paste(x, collapse=""))
}
```
Functions like `one_or_more()`, `%then%`, `%or%`, `%using%`, `eof()` and
`MaybeEmpty()` are defined in the package and are the basic parsers with
which the package user can build complex parsers. The `%using%` operator uses
the function on its right-hand side to modify parser output on its left hand
side. Please see the vignette in the `parcr` package for more explanation why
this is useful or necessary even.
Notice that the new parser functions that we define above are higher order
functions taking no input, hence the empty argument brackets `()` behind their
names.
Now we need to define the parsers `Header()`, `NuclSequenceString()`
and `ProtSequenceString()` that actually recognize and process the header line
string and strings of nucleotide or protein sequences in the character vector
`fastafile`. We use the function constructor `stringparser()` from the package
to construct helper functions that recognize and capture the desired matches,
and we use `match_s()` to to create `parcr` compliant parsers from these.
```{r}
Header <- function() {
match_s(stringparser("^>(\\w+)")) %using%
function(x) list(title = unlist(x))
}
NuclSequenceString <- function() {
match_s(stringparser("^([GATC]+)$"))
}
ProtSequenceString <- function() {
match_s(stringparser("^([ARNDBCEQZGHILKMFPSTWYV]+)$"))
}
```
Now we have all the elements that we need to apply the `Fasta()` parser.
```{r}
Fasta()(fastafile)
```
The output of the parser consists of two elements, `L` and `R`, where `L`
contains the parsed and processed part of the input and `R` the remaining
un-parsed part of the input. Since we explicitly demanded to parse until the
end of the file by the `eof()` function in the definition of the `Fasta()`
parser, the `R` element contains an empty list to signal that the parser was
indeed at the end of the input. Please see the package documentation for more
examples and explanation.
Finally, let's present the result of the parse more concisely using the names
of the elements inside the `L` element:
```{r}
d <- Fasta()(fastafile)[["L"]]
invisible(lapply(d, function(x) {cat(x$type, x$title, x$sequence, "\n")}))
```
Owner
- Name: Systems Biology Lab, Vrije Universiteit Amsterdam
- Login: SystemsBioinformatics
- Kind: organization
- Location: Amsterdam, The Netherlands
- Website: http://teusinkbruggemanlab.nl/
- Repositories: 6
- Profile: https://github.com/SystemsBioinformatics
This is the code repository of the Systems Biology Lab. Our lab studies the molecular networks inside cells that give rise to cell behaviour and fitness.
GitHub Events
Total
- Create event: 2
- Release event: 1
- Issues event: 1
- Watch event: 1
- Push event: 5
Last Year
- Create event: 2
- Release event: 1
- Issues event: 1
- Watch event: 1
- Push event: 5
Packages
- Total packages: 1
-
Total downloads:
- cran 247 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 4
- Total maintainers: 1
cran.r-project.org: parcr
Construct Parsers for Structured Text Files
- Homepage: https://github.com/SystemsBioinformatics/parcr
- Documentation: http://cran.r-project.org/web/packages/parcr/parcr.pdf
- License: MIT + file LICENSE
-
Latest release: 0.5.3
published 7 months ago
Rankings
Dependent packages count: 28.4%
Average: 32.4%
Dependent repos count: 36.4%
Maintainers (1)
Last synced:
6 months ago