nanoparquet

R package to read and write Parquet files

https://github.com/r-lib/nanoparquet

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.6%) to scientific vocabulary

Keywords

parquet r

Last synced: 6 months ago · JSON representation

Repository

R package to read and write Parquet files

Basic Info

Host: GitHub
Owner: r-lib
License: other
Language: C++
Default Branch: main
Homepage: https://nanoparquet.r-lib.org/
Size: 10.5 MB

Statistics

Stars: 70
Watchers: 3
Forks: 6
Open Issues: 38
Releases: 6

Topics

parquet r

Created almost 2 years ago · Last pushed 6 months ago

Metadata Files

Readme Changelog License

README.md

nanoparquet

nanoparquet is a reader and writer for a common subset of Parquet files.

Features:

Read and write flat (i.e. non-nested) Parquet files.
Can read most Parquet data types.
Can read a subset of columns from a Parquet file.
Can write many R data types, including factors and temporal types to Parquet.
Can append a data frame to a Parquet file without first reading and then rewriting the whole file.
Completely dependency free.
Supports Snappy, Gzip and Zstd compression.
Competitive with other tools in terms of speed, memory use and file size.

Limitations:

Nested Parquet types are not supported.
Some Parquet logical types are not supported: INTERVAL, UNKNOWN.
Only Snappy, Gzip and Zstd compression is supported.
Encryption is not supported.
Reading files from URLs is not supported.
nanoparquet always reads the data (or the selected subset of it) into memory. It does not work with out-of-memory data in Parquet files like Apache Arrow and DuckDB does.

Installation

Install the R package from CRAN:

r install.packages("nanoparquet")

Usage

Read

Call read_parquet() to read a Parquet file: r df <- nanoparquet::read_parquet("example.parquet")

To see the columns of a Parquet file and how their types are mapped to R types by read_parquet(), call read_parquet_schema() first: r nanoparquet::read_parquet_schema("example.parquet")

Folders of similar-structured Parquet files (e.g. produced by Spark) can be read like this:

r df <- data.table::rbindlist(lapply( Sys.glob("some-folder/part-*.parquet"), nanoparquet::read_parquet ))

Write

Call write_parquet() to write a data frame to a Parquet file: r nanoparquet::write_parquet(mtcars, "mtcars.parquet")

To see how the columns of the data frame will be mapped to Parquet types by write_parquet(), call infer_parquet_schema() first: r nanoparquet::infer_parquet_schema(mtcars)

Inspect

Call read_parquet_info(), read_parquet_schema(), or read_parquet_metadata() to see various kinds of metadata from a Parquet file:

read_parquet_info() shows a basic summary of the file.
read_parquet_schema() shows all columns, including non-leaf columns, and how they are mapped to R types by read_parquet().
read_parquet_metadata() shows the most complete metadata information: file meta data, the schema, the row groups and column chunks of the file.

r nanoparquet::read_parquet_info("mtcars.parquet") nanoparquet::read_parquet_schema("mtcars.parquet") nanoparquet::read_parquet_metadata("mtcars.parquet")

If you find a file that should be supported but isn't, please open an issue here with a link to the file.

Options

See also ?parquet_options() for further details.

nanoparquet.class: extra class to add to data frames returned by read_parquet(). If it is not defined, the default is "tbl", which changes how the data frame is printed if the pillar package is loaded.
nanoparquet.compression_level: See ?parquet_options() for the defaults and the possible values for each compression method. Inf selects maximum compression for each method.
nanoparquet.num_rows_per_row_group: The number of rows to put into a row group by write_parquet(), if row groups are not specified explicitly. It should be an integer scalar. Defaults to 10 million.
nanoparquet.use_arrow_metadata: unless this is set to FALSE, read_parquet() will make use of Arrow metadata in the Parquet file. Currently this is used to detect factor columns.
nanoparquet.write_arrow_metadata: unless this is set to FALSE, write_parquet() will add Arrow metadata to the Parquet file. This helps preserving classes of columns, e.g. factors will be read back as factors, both by nanoparquet and Arrow.
nanoparquet.write_data_page_version: Data version to write by default. Possible values are 1 and 2. Default is 1.
nanoparquet.write_minmax_values: Whether to write minimum and maximum values per row group, for data types that support this in write_parquet().

License

MIT

Owner

Name: R infrastructure
Login: r-lib
Kind: organization

Repositories: 154
Profile: https://github.com/r-lib

GitHub Events

Total

Create event: 17
Release event: 3
Issues event: 54
Watch event: 12
Delete event: 5
Issue comment event: 86
Push event: 146
Pull request event: 24
Fork event: 5

Last Year

Create event: 17
Release event: 3
Issues event: 54
Watch event: 12
Delete event: 5
Issue comment event: 86
Push event: 146
Pull request event: 24
Fork event: 5

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 32
Total pull requests: 14
Average time to close issues: 2 months
Average time to close pull requests: about 12 hours
Total issue authors: 16
Total pull request authors: 4
Average comments per issue: 1.84
Average comments per pull request: 0.5
Merged pull requests: 11
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 27
Pull requests: 14
Average time to close issues: 22 days
Average time to close pull requests: about 12 hours
Issue authors: 14
Pull request authors: 4
Average comments per issue: 1.85
Average comments per pull request: 0.5
Merged pull requests: 11
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

gaborcsardi (38)
r2evans (2)
TurnaevEvgeny (2)
damonbayer (1)
vankesteren (1)
Upipa (1)
PMassicotte (1)
apalacio9502 (1)
D3SL (1)
thisisnic (1)
yutannihilation (1)
cboettig (1)
tanho63 (1)
hadley (1)
cmrnp (1)

Pull Request Authors

gaborcsardi (40)
vincentarelbundock (1)
hadley (1)
yutannihilation (1)
eitsupi (1)

Top Labels

Issue Labels

feature (13) upkeep (6) bug (5) documentation (1)

Pull Request Labels

upkeep (1) feature (1)

Packages

Total packages: 1
Total downloads:
- cran 5,166 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 6
Total maintainers: 1

cran.r-project.org: nanoparquet

Read and Write 'Parquet' Files

Homepage: https://github.com/r-lib/nanoparquet
Documentation: http://cran.r-project.org/web/packages/nanoparquet/nanoparquet.pdf
License: MIT + file LICENSE
Latest release: 0.4.2
published 12 months ago

Versions: 6
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 5,166 Last month

Rankings

Stargazers count: 26.6%

Forks count: 28.8%

Dependent packages count: 28.8%

Dependent repos count: 35.5%

Average: 41.0%

Downloads: 85.3%

Maintainers (1)

csardi.gabor@gmail.com

Last synced: 6 months ago