parcr

Construct parser combinators in R

https://github.com/systemsbioinformatics/parcr

Science Score: 39.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.4%) to scientific vocabulary

Keywords

combinators higher-order-functions parser parsing r-package
Last synced: 6 months ago · JSON representation

Repository

Construct parser combinators in R

Basic Info
  • Host: GitHub
  • Owner: SystemsBioinformatics
  • License: other
  • Language: HTML
  • Default Branch: main
  • Homepage:
  • Size: 626 KB
Statistics
  • Stars: 5
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 15
Topics
combinators higher-order-functions parser parsing r-package
Created about 2 years ago · Last pushed 7 months ago
Metadata Files
Readme Changelog License

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(parcr)
```


[![CRAN status](https://www.r-pkg.org/badges/version/parcr)](https://cran.r-project.org/package=parcr)
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![R-CMD-check](https://github.com/SystemsBioinformatics/parcr/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/SystemsBioinformatics/parcr/actions/workflows/R-CMD-check.yaml)


## Construct parser combinator functions for parsing character vectors

This R package contains tools to construct parser combinator functions, higher 
order functions that parse input. The main goal of this package is to simplify
the creation of *transparent* parsers for structured text files generated by 
machines like laboratory instruments. Such files consist of lines of text 
organized in higher-order structures like headers with metadata and blocks of 
measured values. To read these data into R you first need to create a parser
that processes these files and creates R-objects as output. The `parcr` package
simplifies the task of creating such parsers.

This package was inspired by the package 
["Ramble"](https://github.com/NoRaincheck/Ramble) by Chapman Siu and co-workers 
and by the paper
["Higher-order functions for parsing"](https://doi.org/10.1017/S0956796800000411) 
by [Graham Hutton](https://orcid.org/0000-0001-9584-5150) (1992).

## Installation

Install the stable version from CRAN

```
install.packages("parcr")
```

To install the development version including its vignette run the following command

```
install_github("SystemsBioinformatics/parcr", build_vignettes=TRUE)
```

## Example application: a parser for *fasta* sequence files

As an example of a realistic application we write a parser for 
fasta-formatted files for nucleotide and protein sequences. We use a few 
simplifying assumptions about this format for the sake of the example. Real 
fasta files are more complex than we pretend here.

*Please note that more background about the functions that we use here is 
available in the package documentation. Here we only present a summary.*

A fasta file with mixed sequence types could look like the example below:

```{r, echo=FALSE, comment = NA}
data("fastafile")
cat(paste0(fastafile, collapse="\n"))
```

Since fasta files are text files we could read such a file using `readLines()`
into a character vector. The package provides the data set `fastafile` which 
contains that character vector.

```{r, eval=FALSE}
data("fastafile")
```

We can distinguish the following higher order components in a fasta file:
 
- A **fasta** file: consists of one or more **sequence blocks** until the 
  **end of the file**.
- A **sequence block**: consist of a **header** and a 
  **nucleotide sequence** or a **protein sequence**. A sequence block could be
  preceded by zero or more **empty lines**.
- A **nucleotide sequence**: consists of one or more 
  **nucleotide sequence strings**.
- A **protein sequence**: consists of one or more 
  **protein sequence strings**.
- A **header** is a *string* that starts with a ">" immediately followed by
  a **title** without spaces.
- A **nucleotide sequence string** is a *string* without spaces that consists
  *entirely* of symbols from the set `{G,A,T,C}`.
- A **protein sequence string** is a *string* without spaces that consists
  *entirely* of symbols from the set `{A,R,N,D,B,C,E,Q,Z,G,H,I,L,K,M,F,P,S,T,W,Y,V}`.

It now becomes clear what we mean when we say that the package allows us
to write *transparent* parsers: the description above of the structure of fasta
files can be put straight into code for a `Fasta()` parser:

```{r}
Fasta <- function() {
  one_or_more(SequenceBlock()) %then%
    eof()
}

SequenceBlock <- function() {
  MaybeEmpty() %then% 
    Header() %then% 
    (NuclSequence() %or% ProtSequence()) %using%
    function(x) list(x)
}

NuclSequence <- function() {
  one_or_more(NuclSequenceString()) %using% 
    function(x) list(type = "Nucl", sequence = paste(x, collapse=""))
}

ProtSequence <- function() {
  one_or_more(ProtSequenceString()) %using% 
    function(x) list(type = "Prot", sequence = paste(x, collapse=""))
}
```

Functions like `one_or_more()`, `%then%`, `%or%`, `%using%`, `eof()` and
`MaybeEmpty()` are defined in the package and are the basic parsers with
which the package user can build complex parsers. The `%using%` operator uses
the function on its right-hand side to modify parser output on its left hand 
side. Please see the vignette in the `parcr` package for more explanation why
this is useful or necessary even.

Notice that the new parser functions that we define above are higher order 
functions taking no input, hence the empty argument brackets `()` behind their
names.

Now we need to define the parsers `Header()`, `NuclSequenceString()`
and `ProtSequenceString()` that actually recognize and process the header line 
string and strings of nucleotide or protein sequences in the character vector 
`fastafile`. We use the function constructor `stringparser()` from the package 
to construct helper functions that recognize and capture the desired matches, 
and we use `match_s()` to to create `parcr` compliant parsers from these.

```{r}
Header <- function() {
  match_s(stringparser("^>(\\w+)")) %using% 
    function(x) list(title = unlist(x))
}

NuclSequenceString <- function() {
  match_s(stringparser("^([GATC]+)$"))
}

ProtSequenceString <- function() {
  match_s(stringparser("^([ARNDBCEQZGHILKMFPSTWYV]+)$"))
}
```

Now we have all the elements that we need to apply the `Fasta()` parser.

```{r}
Fasta()(fastafile)
```

The output of the parser consists of two elements, `L` and `R`, where `L` 
contains the parsed and processed part of the input and `R` the remaining 
un-parsed part of the input. Since we explicitly demanded to parse until the 
end of the file by the `eof()` function in the definition of the `Fasta()` 
parser, the `R` element contains an empty list to signal that the parser was
indeed at the end of the input. Please see the package documentation for more
examples and explanation.

Finally, let's present the result of the parse more concisely using the names 
of the elements inside the `L` element:

```{r}
d <- Fasta()(fastafile)[["L"]]
invisible(lapply(d, function(x) {cat(x$type, x$title, x$sequence, "\n")}))
```

Owner

  • Name: Systems Biology Lab, Vrije Universiteit Amsterdam
  • Login: SystemsBioinformatics
  • Kind: organization
  • Location: Amsterdam, The Netherlands

This is the code repository of the Systems Biology Lab. Our lab studies the molecular networks inside cells that give rise to cell behaviour and fitness.

GitHub Events

Total
  • Create event: 2
  • Release event: 1
  • Issues event: 1
  • Watch event: 1
  • Push event: 5
Last Year
  • Create event: 2
  • Release event: 1
  • Issues event: 1
  • Watch event: 1
  • Push event: 5

Packages

  • Total packages: 1
  • Total downloads:
    • cran 247 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 4
  • Total maintainers: 1
cran.r-project.org: parcr

Construct Parsers for Structured Text Files

  • Versions: 4
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 247 Last month
Rankings
Dependent packages count: 28.4%
Average: 32.4%
Dependent repos count: 36.4%
Maintainers (1)
Last synced: 6 months ago