joinem
CLI for fast, flexbile concatenation of tabular data using polars
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 6 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.2%) to scientific vocabulary
Repository
CLI for fast, flexbile concatenation of tabular data using polars
Basic Info
Statistics
- Stars: 16
- Watchers: 1
- Forks: 0
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
joinem provides a CLI for fast, flexbile concatenation of tabular data using polars
- Free software: MIT license
- Repository: https://github.com/mmore500/joinem
- Documentation: https://github.com/mmore500/joinem/blob/master/README.md
Install
python3 -m pip install joinem
Features
- Lazily streams I/O to expeditiously handle numerous large files.
- Supports CSV and parquet input files.
- Due to current polars limitations, JSON and feather files are not supported.
- Input formats may be mixed.
- Supports output to CSV, JSON, parquet, and feather file types.
- Allows mismatched columns and/or empty data files with
--how diagonaland--how diagonal_relaxed. - Provides a progress bar with
--progress. - Add programatically-generated columns to output.
Example Usage
Pass input filenames via stdin, one filename per line.
find path/to/*.parquet path/to/*.csv | python3 -m joinem out.parquet
Output file type is inferred from the extension of the output file name.
Supported output types are feather, JSON, parquet, and csv.
find -name '*.parquet' | python3 -m joinem out.json
If file columns may mismatch, use --how diagonal.
find path/to/ -name '*.csv' | python3 -m joinem out.csv --how diagonal
If some files may be empty, use --how diagonal_relaxed.
To run via Singularity/Apptainer,
ls -1 *.csv | singularity run docker://ghcr.io/mmore500/joinem out.feather
Advanced Usage
Add literal value column to output.
ls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.lit(2).alias("two")'
Cast a column to categorical in the output, shrink dtypes, and tune compression.
ls -1 *.csv | python3 -m joinem out.pqt \
--with-column 'pl.col("uuid").cast(pl.Categorical)' --string-cache \
--shrink-dtypes \
--write-kwarg 'compression_level=10' --write-kwarg 'compression="zstd"'
Alias an existing column in the output.
ls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.col("a").alias("a2")'
Apply regex on source datafile paths to create new column in output.
ls -1 path/to/*.csv | python3 -m joinem out.csv \
--with-column 'pl.lit(filepath).str.replace(r".*?([^/]*)\.csv", r"${1}").alias("filename stem")'
Read data from stdin and write data to stdout.
cat foo.csv | python3 -m joinem "/dev/stdout" --stdin \
--output-filetype csv --input-filetype csv
Write to parquet via stdout using pv to display progress, cast "myValue" column to categorical, and use lz4 for parquet compression.
ls -1 input/*.pqt | python3 -m joinem "/dev/stdout" --output-filetype pqt \
--with-column 'pl.col("myValue").cast(pl.Categorical)' \
--write-kwarg 'compression="lz4"' \
| pv > concat.pqt
API
``` usage: main.py [-h] [--version] [--progress] [--stdin] [--drop DROP] [--eager-read] [--eager-write] [--filter FILTERS] [--head HEAD] [--tail TAIL] [--sample SAMPLE] [--seed SEED] [--with-column WITHCOLUMNS] [--shrink-dtypes] [--string-cache] [--how {vertical,verticalrelaxed,diagonal,diagonalrelaxed,horizontal,align,alignfull,aligninner,alignleft,alignright}] [--input-filetype INPUT_FILETYPE] [--output-filetype OUTPUTFILETYPE] [--read-kwarg READKWARGS] [--write-kwarg WRITEKWARGS] output_file
CLI for fast, flexbile concatenation of tabular data using Polars.
positional arguments: output_file Output file name
options:
-h, --help show this help message and exit
--version show program's version number and exit
--progress Show progress bar
--stdin Read data from stdin
--drop DROP Columns to drop.
--eager-read Use read* instead of scan. Can improve performance
in some cases.
--eager-write Use write_ instead of sink*. Can improve performance
in some cases.
--filter FILTERS Expression to be evaluated and passed to polars DataFrame.filter.
Example: 'pl.col("thing") == 0'
--head HEAD Number of rows to include in output, counting from front.
--tail TAIL Number of rows to include in output, counting from back.
--gather-every GATHEREVERY
Take every nth row.
--sample SAMPLE Number of rows to include in output, sampled uniformly. Pass --seed
for deterministic behavior.
--shuffle Should output be shuffled? Pass --seed for deterministic behavior.
--seed SEED Integer seed for deterministic behavior.
--with-column WITHCOLUMNS
Expression to be evaluated to add a column, as access
to each datafile's filepath as filepath and polars
as pl. Example:
'pl.lit(filepath).str.replace(r".?([^/]).csv",
r"${1}").alias("filename stem")'
--shrink-dtypes Shrink numeric columns to the minimal required datatype.
--string-cache Enable Polars global string cache.
--how {vertical,verticalrelaxed,diagonal,diagonalrelaxed,horizontal,align,alignfull,aligninner,alignleft,align_right}
How to concatenate frames. See
--input-filetype INPUTFILETYPE Filetype of input. Otherwise, inferred. Example: csv, parquet, json, feather --output-filetype OUTPUTFILETYPE Filetype of output. Otherwise, inferred. Example: csv, parquet --read-kwarg READKWARGS Additional keyword arguments to pass to pl.read* or pl.scan* call(s). Provide as 'key=value'. Specify multiple kwargs by using this flag multiple times. Arguments will be evaluated as Python expressions. Example: 'inferschemalength=None' --write-kwarg WRITEKWARGS Additional keyword arguments to pass to pl.write* or pl.sink* call. Provide as 'key=value'. Specify multiple kwargs by using this flag multiple times. Arguments will be evaluated as Python expressions. Example: 'compression="lz4"'
Provide input filepaths via stdin. Example: find path/to/ -name '*.csv' | python3 -m joinem out.csv ```
Citing
If joinem contributes to a scholarly work, please cite it as
Matthew Andres Moreno. (2024). mmore500/joinem. Zenodo. https://doi.org/10.5281/zenodo.10701182
bibtex
@software{moreno2024joinem,
author = {Matthew Andres Moreno},
title = {mmore500/joinem},
month = feb,
year = 2024,
publisher = {Zenodo},
doi = {10.5281/zenodo.10701182},
url = {https://doi.org/10.5281/zenodo.10701182}
}
And don't forget to leave a star on GitHub!
Owner
- Name: Matthew Andres Moreno
- Login: mmore500
- Kind: user
- Location: East Lansing, MI
- Company: @devosoft
- Website: mmore500.github.io
- Twitter: MorenoMathewA
- Repositories: 43
- Profile: https://github.com/mmore500
doctoral student, Computer Science and Engineering at Michigan State University
Citation (CITATION.cff)
cff-version: 1.1.0 message: "If you use this software, please cite it as below." title: 'joinem: a Python library for data collation' abstract: "joinem provides a CLI for fast, flexbile concatenation of tabular data." authors: - family-names: Moreno given-names: Matthew Andres orcid: 0000-0003-4726-4479 date-released: 2024-02-24 doi: 10.5281/zenodo.10701182 license: MIT repository-code: https://github.com/mmore500/joinem url: "https://github.com/mmore500/joinem"
GitHub Events
Total
- Issues event: 3
- Watch event: 5
- Delete event: 4
- Push event: 29
- Pull request event: 7
- Create event: 10
Last Year
- Issues event: 3
- Watch event: 5
- Delete event: 4
- Push event: 29
- Pull request event: 7
- Create event: 10
Committers
Last synced: 10 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Matthew Andres Moreno | m****g@g****m | 86 |
| vivaansinghvi07 | v****8@g****m | 1 |
Issues and Pull Requests
Last synced: 8 months ago
All Time
- Total issues: 3
- Total pull requests: 8
- Average time to close issues: 24 days
- Average time to close pull requests: 8 minutes
- Total issue authors: 1
- Total pull request authors: 2
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 8
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 3
- Pull requests: 8
- Average time to close issues: 24 days
- Average time to close pull requests: 8 minutes
- Issue authors: 1
- Pull request authors: 2
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 8
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- mmore500 (3)
Pull Request Authors
- mmore500 (13)
- vivaansinghvi07 (2)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 7,069 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 19
- Total maintainers: 1
pypi.org: joinem
CLI for fast, flexbile concatenation of tabular data using Polars.
- Documentation: https://joinem.readthedocs.io/
- License: MIT license
-
Latest release: 0.10.0
published 10 months ago