nanoparquet

R package to read and write Parquet files

https://github.com/r-lib/nanoparquet

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.6%) to scientific vocabulary

Keywords

parquet r
Last synced: 6 months ago · JSON representation

Repository

R package to read and write Parquet files

Basic Info
Statistics
  • Stars: 70
  • Watchers: 3
  • Forks: 6
  • Open Issues: 38
  • Releases: 6
Topics
parquet r
Created almost 2 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog License

README.md

nanoparquet

R-CMD-check CRAN status <!-- badges: end -->

nanoparquet is a reader and writer for a common subset of Parquet files.

Features:

  • Read and write flat (i.e. non-nested) Parquet files.
  • Can read most Parquet data types.
  • Can read a subset of columns from a Parquet file.
  • Can write many R data types, including factors and temporal types to Parquet.
  • Can append a data frame to a Parquet file without first reading and then rewriting the whole file.
  • Completely dependency free.
  • Supports Snappy, Gzip and Zstd compression.
  • Competitive with other tools in terms of speed, memory use and file size.

Limitations:

  • Nested Parquet types are not supported.
  • Some Parquet logical types are not supported: INTERVAL, UNKNOWN.
  • Only Snappy, Gzip and Zstd compression is supported.
  • Encryption is not supported.
  • Reading files from URLs is not supported.
  • nanoparquet always reads the data (or the selected subset of it) into memory. It does not work with out-of-memory data in Parquet files like Apache Arrow and DuckDB does.

Installation

Install the R package from CRAN:

r install.packages("nanoparquet")

Usage

Read

Call read_parquet() to read a Parquet file: r df <- nanoparquet::read_parquet("example.parquet")

To see the columns of a Parquet file and how their types are mapped to R types by read_parquet(), call read_parquet_schema() first: r nanoparquet::read_parquet_schema("example.parquet")

Folders of similar-structured Parquet files (e.g. produced by Spark) can be read like this:

r df <- data.table::rbindlist(lapply( Sys.glob("some-folder/part-*.parquet"), nanoparquet::read_parquet ))

Write

Call write_parquet() to write a data frame to a Parquet file: r nanoparquet::write_parquet(mtcars, "mtcars.parquet")

To see how the columns of the data frame will be mapped to Parquet types by write_parquet(), call infer_parquet_schema() first: r nanoparquet::infer_parquet_schema(mtcars)

Inspect

Call read_parquet_info(), read_parquet_schema(), or read_parquet_metadata() to see various kinds of metadata from a Parquet file:

  • read_parquet_info() shows a basic summary of the file.
  • read_parquet_schema() shows all columns, including non-leaf columns, and how they are mapped to R types by read_parquet().
  • read_parquet_metadata() shows the most complete metadata information: file meta data, the schema, the row groups and column chunks of the file.

r nanoparquet::read_parquet_info("mtcars.parquet") nanoparquet::read_parquet_schema("mtcars.parquet") nanoparquet::read_parquet_metadata("mtcars.parquet")

If you find a file that should be supported but isn't, please open an issue here with a link to the file.

Options

See also ?parquet_options() for further details.

  • nanoparquet.class: extra class to add to data frames returned by read_parquet(). If it is not defined, the default is "tbl", which changes how the data frame is printed if the pillar package is loaded.
  • nanoparquet.compression_level: See ?parquet_options() for the defaults and the possible values for each compression method. Inf selects maximum compression for each method.
  • nanoparquet.num_rows_per_row_group: The number of rows to put into a row group by write_parquet(), if row groups are not specified explicitly. It should be an integer scalar. Defaults to 10 million.
  • nanoparquet.use_arrow_metadata: unless this is set to FALSE, read_parquet() will make use of Arrow metadata in the Parquet file. Currently this is used to detect factor columns.
  • nanoparquet.write_arrow_metadata: unless this is set to FALSE, write_parquet() will add Arrow metadata to the Parquet file. This helps preserving classes of columns, e.g. factors will be read back as factors, both by nanoparquet and Arrow.
  • nanoparquet.write_data_page_version: Data version to write by default. Possible values are 1 and 2. Default is 1.
  • nanoparquet.write_minmax_values: Whether to write minimum and maximum values per row group, for data types that support this in write_parquet().

License

MIT

Owner

  • Name: R infrastructure
  • Login: r-lib
  • Kind: organization

GitHub Events

Total
  • Create event: 17
  • Release event: 3
  • Issues event: 54
  • Watch event: 12
  • Delete event: 5
  • Issue comment event: 86
  • Push event: 146
  • Pull request event: 24
  • Fork event: 5
Last Year
  • Create event: 17
  • Release event: 3
  • Issues event: 54
  • Watch event: 12
  • Delete event: 5
  • Issue comment event: 86
  • Push event: 146
  • Pull request event: 24
  • Fork event: 5

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 32
  • Total pull requests: 14
  • Average time to close issues: 2 months
  • Average time to close pull requests: about 12 hours
  • Total issue authors: 16
  • Total pull request authors: 4
  • Average comments per issue: 1.84
  • Average comments per pull request: 0.5
  • Merged pull requests: 11
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 27
  • Pull requests: 14
  • Average time to close issues: 22 days
  • Average time to close pull requests: about 12 hours
  • Issue authors: 14
  • Pull request authors: 4
  • Average comments per issue: 1.85
  • Average comments per pull request: 0.5
  • Merged pull requests: 11
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • gaborcsardi (38)
  • r2evans (2)
  • TurnaevEvgeny (2)
  • damonbayer (1)
  • vankesteren (1)
  • Upipa (1)
  • PMassicotte (1)
  • apalacio9502 (1)
  • D3SL (1)
  • thisisnic (1)
  • yutannihilation (1)
  • cboettig (1)
  • tanho63 (1)
  • hadley (1)
  • cmrnp (1)
Pull Request Authors
  • gaborcsardi (40)
  • vincentarelbundock (1)
  • hadley (1)
  • yutannihilation (1)
  • eitsupi (1)
Top Labels
Issue Labels
feature (13) upkeep (6) bug (5) documentation (1)
Pull Request Labels
upkeep (1) feature (1)

Packages

  • Total packages: 1
  • Total downloads:
    • cran 5,166 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 6
  • Total maintainers: 1
cran.r-project.org: nanoparquet

Read and Write 'Parquet' Files

  • Versions: 6
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 5,166 Last month
Rankings
Stargazers count: 26.6%
Forks count: 28.8%
Dependent packages count: 28.8%
Dependent repos count: 35.5%
Average: 41.0%
Downloads: 85.3%
Maintainers (1)
Last synced: 6 months ago