DFplyr

A `DataFrame` (`S4Vectors`) backend for `dplyr`

https://github.com/jonocarroll/dfplyr

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.3%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

A `DataFrame` (`S4Vectors`) backend for `dplyr`

Basic Info
  • Host: GitHub
  • Owner: jonocarroll
  • License: gpl-3.0
  • Language: R
  • Default Branch: master
  • Size: 146 KB
Statistics
  • Stars: 21
  • Watchers: 5
  • Forks: 1
  • Open Issues: 8
  • Releases: 0
Created over 6 years ago · Last pushed 10 months ago
Metadata Files
Readme Changelog License

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    fig.path = "man/figures/README-",
    out.width = "100%"
)
```
# DFplyr




The goal of DFplyr is to enable `dplyr` and `ggplot2` support for
`S4Vectors::DataFrame` by providing the appropriate extension methods. As row
names are an important feature of many Bioconductor structures, these are
preserved where possible.

## Installation

You can install the development version from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("jonocarroll/DFplyr")
```

You can install from [Bioconductor](https://bioconductor.org) with:

``` r
if (!require("BiocManager", quietly =TRUE))
    install.packages("BiocManager")

# The following initializes usage of Bioc devel
BiocManager::install(version='devel')

BiocManager::install("DFplyr")
```

## Examples

First create an S4Vectors `DataFrame`, including S4 columns if desired

```{r}
library(S4Vectors)
m <- mtcars[, c("cyl", "hp", "am", "gear", "disp")]
d <- as(m, "DataFrame")
d$grX <- GenomicRanges::GRanges("chrX", IRanges::IRanges(1:32, width = 10))
d$grY <- GenomicRanges::GRanges("chrY", IRanges::IRanges(1:32, width = 10))
d$nl <- IRanges::NumericList(lapply(d$gear, function(n) round(rnorm(n), 2)))
d
```

This will appear in RStudio's environment pane as a 

```
Formal class DataFrame (dplyr-compatible)
``` 

when using `DFplyr`. No interference with the actual object is required, but
this helps identify that `dplyr`-compatibility is available.

`DataFrame`s can then be used in `dplyr` calls the same as `data.frame` or
`tibble` objects. Support for working with S4 columns is enabled provided they
have appropriate functions. Adding multiple columns will result in the new
columns being created in alphabetical order

```{r}
library(DFplyr)

mutate(d, newvar = cyl + hp)

mutate(d, nl2 = nl * 2)

mutate(d, length_nl = lengths(nl))

mutate(d,
    chr = GenomeInfoDb::seqnames(grX),
    strand_X = BiocGenerics::strand(grX),
    end_X = BiocGenerics::end(grX)
)
```

the object returned remains a standard `DataFrame`, and further calls can be 
piped with `%>%`


```{r}
mutate(d, newvar = cyl + hp) %>%
    pull(newvar)
```

Some of the variants of the `dplyr` verbs also work

```{r}
mutate_if(d, is.numeric, ~ .^2)

mutate_if(d, ~ inherits(., "GRanges"), BiocGenerics::start)
```

Use of `tidyselect` helpers is limited to within `dplyr::vars()` calls and using 
the `_at` variants

```{r}
mutate_at(d, vars(starts_with("c")), ~ .^2)

select_at(d, vars(starts_with("gr")))
```

Importantly, grouped operations are supported. `DataFrame` does not 
natively support groups (the same way that `data.frame` does not) so these
are implemented specifically for `DFplyr`

```{r}
group_by(d, cyl, am)
```

Other verbs are similarly implemented, and preserve row names where possible

```{r}
select(d, am, cyl)

arrange(d, desc(hp))

filter(d, am == 0)

slice(d, 3:6)

group_by(d, gear) %>%
    slice(1:2)
```

`rename` is itself renamed to `rename2` due to conflicts between {dplyr} and 
{S4Vectors}, but works in the {dplyr} sense of taking `new = old` replacements 
with NSE syntax

```{r}
select(d, am, cyl) %>%
    rename2(foo = am)
```

Row names are not preserved when there may be duplicates or they don't make
sense, otherwise the first label (according to the current de-duplication
method, in the case of `distinct`, this is via `BiocGenerics::duplicated`). This
may have complications for S4 columns.

```{r}
distinct(d)

group_by(d, cyl, am) %>%
    tally(gear)

count(d, gear, am, cyl)
```

## Coverage

Most `dplyr` functions are implemented with the exception of `join`s. 

If you find any which are not, please [file an issue](https://github.com/jonocarroll/DFplyr/issues/new).

Owner

  • Name: Jonathan Carroll
  • Login: jonocarroll
  • Kind: user
  • Location: Adelaide, South Australia
  • Company: @IrregularlyScheduledProgramming

Recovering theoretical physicist / ongoing coffee addict / continually improving data scientist. I'm interested in open-source data projects, mainly in R.

GitHub Events

Total
  • Watch event: 1
  • Issue comment event: 16
  • Push event: 6
  • Pull request event: 1
  • Fork event: 1
Last Year
  • Watch event: 1
  • Issue comment event: 16
  • Push event: 6
  • Pull request event: 1
  • Fork event: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 12
  • Total pull requests: 7
  • Average time to close issues: 7 months
  • Average time to close pull requests: about 18 hours
  • Total issue authors: 4
  • Total pull request authors: 2
  • Average comments per issue: 3.33
  • Average comments per pull request: 2.0
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 3
  • Average time to close issues: 9 days
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 2
  • Average comments per issue: 0.33
  • Average comments per pull request: 4.33
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • jonocarroll (8)
  • sa-lee (2)
  • hpages (1)
  • DarwinAwardWinner (1)
Pull Request Authors
  • jonocarroll (5)
  • ppaxisa (2)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • bioconductor 2,028 total
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 2
  • Total maintainers: 1
bioconductor.org: DFplyr

A `DataFrame` (`S4Vectors`) backend for `dplyr`

  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 2,028 Total
Rankings
Dependent repos count: 0.0%
Dependent packages count: 31.5%
Average: 42.4%
Downloads: 95.6%
Maintainers (1)
Last synced: 10 months ago