vroom

Fast reading of delimited files

https://github.com/tidyverse/vroom

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 25 committers (4.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (17.6%) to scientific vocabulary

Keywords

csv csv-parser fixed-width-text r tsv tsv-parser

Keywords from Contributors

parsing setup tidy-data fwf geo documentation-tool excel unit-testing data-manipulation package-creation

Last synced: 10 months ago · JSON representation

Repository

Fast reading of delimited files

Basic Info

Host: GitHub
Owner: tidyverse
License: other
Language: C++
Default Branch: main
Homepage: https://vroom.r-lib.org
Size: 21.3 MB

Statistics

Stars: 633
Watchers: 17
Forks: 64
Open Issues: 79
Releases: 22

Topics

csv csv-parser fixed-width-text r tsv tsv-parser

Created over 7 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Changelog License Code of conduct

README.Rmd

---
output:
  github_document:
    html_preview: false
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
options(tibble.print_min = 3)
```

# 🏎💨vroom 


[![R-CMD-check](https://github.com/tidyverse/vroom/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tidyverse/vroom/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/tidyverse/vroom/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidyverse/vroom?branch=main)
[![CRAN status](https://www.r-pkg.org/badges/version/vroom)](https://cran.r-project.org/package=vroom)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)


```{r echo = FALSE, message = FALSE}
tm <- vroom::vroom(system.file("bench", "taxi.tsv", package = "vroom"))
versions <- vroom::vroom(system.file("bench", "session_info.tsv", package = "vroom"))

# Use the base version number for read.delim
versions$package[versions$package == "base"] <- "read.delim"

library(dplyr)
tbl <- tm %>% filter(type == "real", op == "read", reading_package %in% c("data.table", "readr", "read.delim") | manip_package == "base") %>%
  rename(package = reading_package) %>%
  left_join(versions) %>%
  transmute(
    package = package,
    version = ondiskversion,
    "time (sec)" = time,
    speedup = max(time) / time,
    "throughput" = paste0(prettyunits::pretty_bytes(size / time), "/sec")
  ) %>%
  arrange(desc(speedup))
```

The fastest delimited reader for R, **`r filter(tbl, package == "vroom") %>% pull("throughput") %>% trimws()`**.



But that's impossible! How can it be [so fast](https://vroom.r-lib.org/articles/benchmarks.html)?

vroom doesn't stop to actually _read_ all of your data, it simply indexes where each record is located so it can be read later.
The vectors returned use the [Altrep framework](https://svn.r-project.org/R/branches/ALTREP/ALTREP.html) to lazily load the data on-demand when it is accessed, so you only pay for what you use.
This lazy access is done automatically, so no changes to your R data-manipulation code are needed.

vroom also uses multiple threads for indexing, materializing non-character columns, and when writing to further improve performance.

```{r, echo = FALSE}
knitr::kable(tbl, digits = 2, align = "lrrrr")
```

## Features

vroom has nearly all of the parsing features of
[readr](https://readr.tidyverse.org) for delimited and fixed width files, including

- delimiter guessing\*
- custom delimiters (including multi-byte\* and Unicode\* delimiters)
- specification of column types (including type guessing)
  - numeric types (double, integer, big integer\*, number)
  - logical types
  - datetime types (datetime, date, time)
  - categorical types (characters, factors)
- column selection, like `dplyr::select()`\*
- skipping headers, comments and blank lines
- quoted fields
- double and backslashed escapes
- whitespace trimming
- windows newlines
- [reading from multiple files or connections\*](#reading-multiple-files)
- embedded newlines in headers and fields\*\*
- writing delimited files with as-needed quoting.
- robust to invalid inputs (vroom has been extensively tested with the
  [afl](https://lcamtuf.coredump.cx/afl/) fuzz tester)\*.

\* *these are additional features not in readr.*

\*\* *requires `num_threads = 1`.*

## Installation

Install vroom from CRAN with:

```r
install.packages("vroom")
```

Alternatively, if you need the development version from
[GitHub](https://github.com/) install it with:

``` r
# install.packages("pak")
pak::pak("tidyverse/vroom")
```
## Usage

See [getting started](https://vroom.r-lib.org/articles/vroom.html)
to jump start your use of vroom!

vroom uses the same interface as readr to specify column types.

```{r, include = FALSE}
tibble::rownames_to_column(mtcars, "model") %>%
  vroom::vroom_write("mtcars.tsv", delim = "\t")
```

```{r example}
vroom::vroom("mtcars.tsv",
  col_types = list(cyl = "i", gear = "f",hp = "i", disp = "_",
                   drat = "_", vs = "l", am = "l", carb = "i")
)
```

```{r, include = FALSE}
unlink("mtcars.tsv")
```

## Reading multiple files

vroom natively supports reading from multiple files (or even multiple
connections!).

First we generate some files to read by splitting the nycflights dataset by
airline.
For the sake of the example, we'll just take the first 2 lines of each file.
```{r}
library(nycflights13)
purrr::iwalk(
  split(flights, flights$carrier),
  ~ { .x$carrier[[1]]; vroom::vroom_write(head(.x, 2), glue::glue("flights_{.y}.tsv"), delim = "\t") }
)
```

Then we can efficiently read them into one tibble by passing the filenames directly to vroom.
The `id` argument can be used to request a column that reveals the filename that each row originated from.

```{r}
files <- fs::dir_ls(glob = "flights*tsv")
files
vroom::vroom(files, id = "source")
```

```{r, include = FALSE}
fs::file_delete(files)
```

## Learning more

- [Getting started with vroom](https://vroom.r-lib.org/articles/vroom.html)
- [📽 vroom: Because Life is too short to read slow](https://www.youtube.com/watch?v=RA9AjqZXxMU&t=10s) - Presentation at UseR!2019 ([slides](https://speakerdeck.com/jimhester/vroom))
- [📹 vroom: Read and write rectangular data quickly](https://www.youtube.com/watch?v=ZP_y5eaAc60) - a video tour of the vroom features.

## Benchmarks

The speed quoted above is from a real `r format(fs::fs_bytes(tm$size[[1]]))` dataset with `r format(tm$rows[[1]], big.mark = ",")` rows and `r tm$cols[[1]]` columns,
see the [benchmark article](https://vroom.r-lib.org/articles/benchmarks.html)
for full details of the dataset and
[bench/](https://github.com/tidyverse/vroom/tree/main/inst/bench) for the code
used to retrieve the data and perform the benchmarks.

# Environment variables

In addition to the arguments to the `vroom()` function, you can control the
behavior of vroom with a few environment variables. Generally these will not
need to be set by most users.

- `VROOM_TEMP_PATH` - Path to the directory used to store temporary files when
  reading from a R connection. If unset defaults to the R session's temporary
  directory (`tempdir()`).
- `VROOM_THREADS` - The number of processor threads to use when indexing and
  parsing. If unset defaults to `parallel::detectCores()`.
- `VROOM_SHOW_PROGRESS` - Whether to show the progress bar when indexing.
  Regardless of this setting the progress bar is disabled in non-interactive
  settings, R notebooks, when running tests with testthat and when knitting
  documents.
- `VROOM_CONNECTION_SIZE` - The size (in bytes) of the connection buffer when
  reading from connections (default is 128 KiB).
- `VROOM_WRITE_BUFFER_LINES` - The number of lines to use for each buffer when
  writing files (default: 1000).

There are also a family of variables to control use of the Altrep framework.
For versions of R where the Altrep framework is unavailable (R < 3.5.0) they
are automatically turned off and the variables have no effect. The variables
can take one of `true`, `false`, `TRUE`, `FALSE`, `1`, or `0`.

- `VROOM_USE_ALTREP_NUMERICS` - If set use Altrep for _all_ numeric types
  (default `false`).

There are also individual variables for each type. Currently only
`VROOM_USE_ALTREP_CHR` defaults to `true`.

- `VROOM_USE_ALTREP_CHR`
- `VROOM_USE_ALTREP_FCT`
- `VROOM_USE_ALTREP_INT`
- `VROOM_USE_ALTREP_BIG_INT`
- `VROOM_USE_ALTREP_DBL`
- `VROOM_USE_ALTREP_NUM`
- `VROOM_USE_ALTREP_LGL`
- `VROOM_USE_ALTREP_DTTM`
- `VROOM_USE_ALTREP_DATE`
- `VROOM_USE_ALTREP_TIME`

## RStudio caveats

RStudio's environment pane calls `object.size()` when it refreshes the pane, which
for Altrep objects can be extremely slow. RStudio 1.2.1335+ includes the fixes
([RStudio#4210](https://github.com/rstudio/rstudio/pull/4210),
[RStudio#4292](https://github.com/rstudio/rstudio/pull/4292)) for this issue,
so it is recommended you use at least that version.

## Thanks

- [Gabe Becker](https://github.com/gmbecker), [Luke
  Tierney](https://homepage.divms.uiowa.edu/~luke/) and [Tomas Kalibera](https://github.com/kalibera) for
  conceiving, Implementing and maintaining the [Altrep
  framework](https://svn.r-project.org/R/branches/ALTREP/ALTREP.html)
- [Romain François](https://github.com/romainfrancois), whose
  [Altrepisode](https://web.archive.org/web/20200315075838/https://purrple.cat/blog/2018/10/14/altrep-and-cpp/) package
  and [related blog-posts](https://web.archive.org/web/20200315075838/https://purrple.cat/blog/2018/10/14/altrep-and-cpp/) were a great guide for creating new Altrep objects in C++.
- [Matt Dowle](https://github.com/mattdowle) and the rest of the [Rdatatable](https://github.com/Rdatatable) team, `data.table::fread()` is blazing fast and great motivation to see how fast we could go faster!

Owner

Name: tidyverse
Login: tidyverse
Kind: organization

Website: http://tidyverse.org
Repositories: 43
Profile: https://github.com/tidyverse

The tidyverse is a collection of R packages that share common principles and are designed to work together seamlessly

GitHub Events

Total

Issues event: 10
Watch event: 10
Issue comment event: 4
Pull request event: 1
Fork event: 3

Last Year

Issues event: 10
Watch event: 10
Issue comment event: 4
Pull request event: 1
Fork event: 3

Committers

Last synced: about 1 year ago

All Time

Total Commits: 1,260
Total Committers: 25
Avg Commits per committer: 50.4
Development Distribution Score (DDS): 0.14

Past Year

Commits: 1
Committers: 1
Avg Commits per committer: 1.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Jim Hester	j**r@g**m	1,084
Jenny Bryan	j**n@g**m	104
DavisVaughan	d**s@r**m	24
Shelby Bearrows	3****s	17
Ho Bich Hai	h**i@g**m	4
Michael Chirico	c**m@g**m	2
Hadley Wickham	h**m@g**m	2
Kent Johnson	k**n@a**m	2
Lionel Henry	l**y@g**m	2
Mara Averick	m**k@g**m	2
MikeJohnPage	3****e	2
Remy	r**d@g**m	2
Andrie de Vries	a**s@g**m	1
Anirban	b**6@g**m	1
Bill Lattner	w**r@g**m	1
Criscely Luján Paredes	c**n@g**m	1
Edzer Pebesma	e**a@u**e	1
Florencia Mangini	f**i@g**m	1
James Baird	j**4@g**m	1
Jeroen Ooms	j**s@g**m	1
Luke Johnston	l**t@g**m	1
Mauro Lepore	m**e@g**m	1
Panagiotis Cheilaris	p**s@g**m	1
bart1	1****1	1
jrf1111	3****1	1

Committer Domains (Top 20 + Academic)

uni-muenster.de: 1 akoyabio.com: 1 google.com: 1 rstudio.com: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 121
Total pull requests: 40
Average time to close issues: 5 months
Average time to close pull requests: 29 days
Total issue authors: 74
Total pull request authors: 16
Average comments per issue: 1.45
Average comments per pull request: 1.38
Merged pull requests: 28
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 11
Pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 11
Pull request authors: 2
Average comments per issue: 0.18
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

jennybc (24)
khusmann (5)
jimhester (4)
sbearrows (4)
bart1 (4)
ahcyip (3)
klmr (2)
mgacc0 (2)
ramiromagno (2)
mgirlich (2)
muschellij2 (2)
nunotexbsd (1)
cboettig (1)
jochemvankempen (1)
stephpeng93 (1)

Pull Request Authors

sbearrows (12)
jennybc (8)
DavisVaughan (3)
dvg-p4 (2)
hadley (2)
khusmann (2)
jeroen (2)
ahcyip (2)
bart1 (1)
bairdj (1)
jimhester (1)
philaris (1)
MichaelChirico (1)
olivroy (1)
kevinushey (1)

Top Labels

Issue Labels

feature (21) bug (17) colspec 📁 (7) upkeep (6) documentation (5) datetime 📆 (4) readr 📖 (3) performance :rocket: (3) col_types (2) encoding � (2) reprex (1) robustness 🏋️‍♀️ (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 909,851 last-month
Total docker downloads: 46,398,832

Total dependent packages: 45
Total dependent repositories: 193
Total versions: 22
Total maintainers: 1

cran.r-project.org: vroom

Read and Write Rectangular Text Data Quickly

Homepage: https://vroom.r-lib.org
Documentation: http://cran.r-project.org/web/packages/vroom/vroom.pdf
License: MIT + file LICENSE
Latest release: 1.6.5
published over 2 years ago

Versions: 22
Dependent Packages: 45
Dependent Repositories: 193
Downloads: 909,851 Last month
Docker Downloads: 46,398,832

Rankings

Downloads: 0.2%

Stargazers count: 0.6%

Forks count: 1.2%

Dependent repos count: 1.3%

Dependent packages count: 1.8%

Average: 3.7%

Docker downloads count: 17.3%

Maintainers (1)

jenny@posit.co

Last synced: 10 months ago

Dependencies

DESCRIPTION cran

R >= 3.4 depends
bit64 * imports
cli >= 3.2.0 imports
crayon * imports
glue * imports
hms * imports
lifecycle * imports
methods * imports
rlang >= 0.4.2 imports
stats * imports
tibble >= 2.0.0 imports
tidyselect * imports
tzdb >= 0.1.1 imports
vctrs >= 0.2.0 imports
withr * imports
archive * suggests
bench >= 1.1.0 suggests
covr * suggests
curl * suggests
dplyr * suggests
forcats * suggests
fs * suggests
ggplot2 * suggests
knitr * suggests
patchwork * suggests
prettyunits * suggests
purrr * suggests
rmarkdown * suggests
rstudioapi * suggests
scales * suggests
spelling * suggests
testthat >= 2.1.0 suggests
tidyr * suggests
utils * suggests
waldo * suggests
xml2 * suggests

.github/workflows/R-CMD-check.yaml actions

actions/checkout v2 composite
r-lib/actions/check-r-package v2 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

.github/workflows/debug.yaml actions

actions/checkout v2 composite
mxschmitt/action-tmate v3 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

.github/workflows/pkgdown.yaml actions

JamesIves/github-pages-deploy-action 4.1.4 composite
actions/checkout v2 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

.github/workflows/pr-commands.yaml actions

actions/checkout v2 composite
r-lib/actions/pr-fetch v2 composite
r-lib/actions/pr-push v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

.github/workflows/test-coverage.yaml actions

actions/checkout v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

vroom

Science Score: 23.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.Rmd

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: vroom

Rankings

Maintainers (1)

Dependencies