Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.6%) to scientific vocabulary
Keywords
Repository
R package to read and write Parquet files
Basic Info
- Host: GitHub
- Owner: r-lib
- License: other
- Language: C++
- Default Branch: main
- Homepage: https://nanoparquet.r-lib.org/
- Size: 10.5 MB
Statistics
- Stars: 70
- Watchers: 3
- Forks: 6
- Open Issues: 38
- Releases: 6
Topics
Metadata Files
README.md
nanoparquet
nanoparquet is a reader and writer for a common subset of Parquet files.
Features:
- Read and write flat (i.e. non-nested) Parquet files.
- Can read most Parquet data types.
- Can read a subset of columns from a Parquet file.
- Can write many R data types, including factors and temporal types to Parquet.
- Can append a data frame to a Parquet file without first reading and then rewriting the whole file.
- Completely dependency free.
- Supports Snappy, Gzip and Zstd compression.
- Competitive with other tools in terms of speed, memory use and file size.
Limitations:
- Nested Parquet types are not supported.
- Some Parquet logical types are not supported:
INTERVAL,UNKNOWN. - Only Snappy, Gzip and Zstd compression is supported.
- Encryption is not supported.
- Reading files from URLs is not supported.
- nanoparquet always reads the data (or the selected subset of it) into memory. It does not work with out-of-memory data in Parquet files like Apache Arrow and DuckDB does.
Installation
Install the R package from CRAN:
r
install.packages("nanoparquet")
Usage
Read
Call read_parquet() to read a Parquet file:
r
df <- nanoparquet::read_parquet("example.parquet")
To see the columns of a Parquet file and how their types are mapped to
R types by read_parquet(), call read_parquet_schema() first:
r
nanoparquet::read_parquet_schema("example.parquet")
Folders of similar-structured Parquet files (e.g. produced by Spark) can be read like this:
r
df <- data.table::rbindlist(lapply(
Sys.glob("some-folder/part-*.parquet"),
nanoparquet::read_parquet
))
Write
Call write_parquet() to write a data frame to a Parquet file:
r
nanoparquet::write_parquet(mtcars, "mtcars.parquet")
To see how the columns of the data frame will be mapped to Parquet types
by write_parquet(), call infer_parquet_schema() first:
r
nanoparquet::infer_parquet_schema(mtcars)
Inspect
Call read_parquet_info(), read_parquet_schema(), or
read_parquet_metadata() to see various kinds of metadata from a Parquet
file:
read_parquet_info()shows a basic summary of the file.read_parquet_schema()shows all columns, including non-leaf columns, and how they are mapped to R types byread_parquet().read_parquet_metadata()shows the most complete metadata information: file meta data, the schema, the row groups and column chunks of the file.
r
nanoparquet::read_parquet_info("mtcars.parquet")
nanoparquet::read_parquet_schema("mtcars.parquet")
nanoparquet::read_parquet_metadata("mtcars.parquet")
If you find a file that should be supported but isn't, please open an issue here with a link to the file.
Options
See also ?parquet_options() for further details.
nanoparquet.class: extra class to add to data frames returned byread_parquet(). If it is not defined, the default is"tbl", which changes how the data frame is printed if the pillar package is loaded.nanoparquet.compression_level: See?parquet_options()for the defaults and the possible values for each compression method.Infselects maximum compression for each method.nanoparquet.num_rows_per_row_group: The number of rows to put into a row group bywrite_parquet(), if row groups are not specified explicitly. It should be an integer scalar. Defaults to 10 million.nanoparquet.use_arrow_metadata: unless this is set toFALSE,read_parquet()will make use of Arrow metadata in the Parquet file. Currently this is used to detect factor columns.nanoparquet.write_arrow_metadata: unless this is set toFALSE,write_parquet()will add Arrow metadata to the Parquet file. This helps preserving classes of columns, e.g. factors will be read back as factors, both by nanoparquet and Arrow.nanoparquet.write_data_page_version: Data version to write by default. Possible values are 1 and 2. Default is 1.nanoparquet.write_minmax_values: Whether to write minimum and maximum values per row group, for data types that support this inwrite_parquet().
License
MIT
Owner
- Name: R infrastructure
- Login: r-lib
- Kind: organization
- Repositories: 154
- Profile: https://github.com/r-lib
GitHub Events
Total
- Create event: 17
- Release event: 3
- Issues event: 54
- Watch event: 12
- Delete event: 5
- Issue comment event: 86
- Push event: 146
- Pull request event: 24
- Fork event: 5
Last Year
- Create event: 17
- Release event: 3
- Issues event: 54
- Watch event: 12
- Delete event: 5
- Issue comment event: 86
- Push event: 146
- Pull request event: 24
- Fork event: 5
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 32
- Total pull requests: 14
- Average time to close issues: 2 months
- Average time to close pull requests: about 12 hours
- Total issue authors: 16
- Total pull request authors: 4
- Average comments per issue: 1.84
- Average comments per pull request: 0.5
- Merged pull requests: 11
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 27
- Pull requests: 14
- Average time to close issues: 22 days
- Average time to close pull requests: about 12 hours
- Issue authors: 14
- Pull request authors: 4
- Average comments per issue: 1.85
- Average comments per pull request: 0.5
- Merged pull requests: 11
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- gaborcsardi (38)
- r2evans (2)
- TurnaevEvgeny (2)
- damonbayer (1)
- vankesteren (1)
- Upipa (1)
- PMassicotte (1)
- apalacio9502 (1)
- D3SL (1)
- thisisnic (1)
- yutannihilation (1)
- cboettig (1)
- tanho63 (1)
- hadley (1)
- cmrnp (1)
Pull Request Authors
- gaborcsardi (40)
- vincentarelbundock (1)
- hadley (1)
- yutannihilation (1)
- eitsupi (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 5,166 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 6
- Total maintainers: 1
cran.r-project.org: nanoparquet
Read and Write 'Parquet' Files
- Homepage: https://github.com/r-lib/nanoparquet
- Documentation: http://cran.r-project.org/web/packages/nanoparquet/nanoparquet.pdf
- License: MIT + file LICENSE
-
Latest release: 0.4.2
published 12 months ago