joinem

CLI for fast, flexbile concatenation of tabular data using polars

https://github.com/mmore500/joinem

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 6 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.2%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

CLI for fast, flexbile concatenation of tabular data using polars

Basic Info

Host: GitHub
Owner: mmore500
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 183 KB

Statistics

Stars: 16
Watchers: 1
Forks: 0
Open Issues: 1
Releases: 0

Created over 2 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

README.md

joinem provides a CLI for fast, flexbile concatenation of tabular data using polars

Free software: MIT license
Repository: https://github.com/mmore500/joinem
Documentation: https://github.com/mmore500/joinem/blob/master/README.md

Install

python3 -m pip install joinem

Features

Lazily streams I/O to expeditiously handle numerous large files.
Supports CSV and parquet input files.
- Due to current polars limitations, JSON and feather files are not supported.
- Input formats may be mixed.
Supports output to CSV, JSON, parquet, and feather file types.
Allows mismatched columns and/or empty data files with --how diagonal and --how diagonal_relaxed.
Provides a progress bar with --progress.
Add programatically-generated columns to output.

Example Usage

Pass input filenames via stdin, one filename per line. find path/to/*.parquet path/to/*.csv | python3 -m joinem out.parquet

Output file type is inferred from the extension of the output file name. Supported output types are feather, JSON, parquet, and csv. find -name '*.parquet' | python3 -m joinem out.json

If file columns may mismatch, use --how diagonal. find path/to/ -name '*.csv' | python3 -m joinem out.csv --how diagonal

If some files may be empty, use --how diagonal_relaxed.

To run via Singularity/Apptainer, ls -1 *.csv | singularity run docker://ghcr.io/mmore500/joinem out.feather

Advanced Usage

Add literal value column to output. ls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.lit(2).alias("two")'

Cast a column to categorical in the output, shrink dtypes, and tune compression. ls -1 *.csv | python3 -m joinem out.pqt \ --with-column 'pl.col("uuid").cast(pl.Categorical)' --string-cache \ --shrink-dtypes \ --write-kwarg 'compression_level=10' --write-kwarg 'compression="zstd"'

Alias an existing column in the output. ls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.col("a").alias("a2")'

Apply regex on source datafile paths to create new column in output. ls -1 path/to/*.csv | python3 -m joinem out.csv \ --with-column 'pl.lit(filepath).str.replace(r".*?([^/]*)\.csv", r"${1}").alias("filename stem")'

Read data from stdin and write data to stdout. cat foo.csv | python3 -m joinem "/dev/stdout" --stdin \ --output-filetype csv --input-filetype csv

Write to parquet via stdout using pv to display progress, cast "myValue" column to categorical, and use lz4 for parquet compression. ls -1 input/*.pqt | python3 -m joinem "/dev/stdout" --output-filetype pqt \ --with-column 'pl.col("myValue").cast(pl.Categorical)' \ --write-kwarg 'compression="lz4"' \ | pv > concat.pqt

API

``` usage: main.py [-h] [--version] [--progress] [--stdin] [--drop DROP] [--eager-read] [--eager-write] [--filter FILTERS] [--head HEAD] [--tail TAIL] [--sample SAMPLE] [--seed SEED] [--with-column WITHCOLUMNS] [--shrink-dtypes] [--string-cache] [--how {vertical,verticalrelaxed,diagonal,diagonalrelaxed,horizontal,align,alignfull,aligninner,alignleft,alignright}] [--input-filetype INPUT_FILETYPE] [--output-filetype OUTPUTFILETYPE] [--read-kwarg READKWARGS] [--write-kwarg WRITEKWARGS] output_file

CLI for fast, flexbile concatenation of tabular data using Polars.

positional arguments: output_file Output file name

options: -h, --help show this help message and exit --version show program's version number and exit --progress Show progress bar --stdin Read data from stdin --drop DROP Columns to drop. --eager-read Use read* instead of scan. Can improve performance in some cases. --eager-write Use write_ instead of sink*. Can improve performance in some cases. --filter FILTERS Expression to be evaluated and passed to polars DataFrame.filter. Example: 'pl.col("thing") == 0' --head HEAD Number of rows to include in output, counting from front. --tail TAIL Number of rows to include in output, counting from back. --gather-every GATHEREVERY Take every nth row. --sample SAMPLE Number of rows to include in output, sampled uniformly. Pass --seed for deterministic behavior. --shuffle Should output be shuffled? Pass --seed for deterministic behavior. --seed SEED Integer seed for deterministic behavior. --with-column WITHCOLUMNS Expression to be evaluated to add a column, as access to each datafile's filepath as filepath and polars as pl. Example: 'pl.lit(filepath).str.replace(r".?([^/]).csv", r"${1}").alias("filename stem")' --shrink-dtypes Shrink numeric columns to the minimal required datatype. --string-cache Enable Polars global string cache. --how {vertical,verticalrelaxed,diagonal,diagonalrelaxed,horizontal,align,alignfull,aligninner,alignleft,align_right} How to concatenate frames. See for more information.

--input-filetype INPUTFILETYPE Filetype of input. Otherwise, inferred. Example: csv, parquet, json, feather --output-filetype OUTPUTFILETYPE Filetype of output. Otherwise, inferred. Example: csv, parquet --read-kwarg READKWARGS Additional keyword arguments to pass to pl.read* or pl.scan* call(s). Provide as 'key=value'. Specify multiple kwargs by using this flag multiple times. Arguments will be evaluated as Python expressions. Example: 'inferschemalength=None' --write-kwarg WRITEKWARGS Additional keyword arguments to pass to pl.write* or pl.sink* call. Provide as 'key=value'. Specify multiple kwargs by using this flag multiple times. Arguments will be evaluated as Python expressions. Example: 'compression="lz4"'

Provide input filepaths via stdin. Example: find path/to/ -name '*.csv' | python3 -m joinem out.csv ```

Citing

If joinem contributes to a scholarly work, please cite it as

Matthew Andres Moreno. (2024). mmore500/joinem. Zenodo. https://doi.org/10.5281/zenodo.10701182

bibtex @software{moreno2024joinem, author = {Matthew Andres Moreno}, title = {mmore500/joinem}, month = feb, year = 2024, publisher = {Zenodo}, doi = {10.5281/zenodo.10701182}, url = {https://doi.org/10.5281/zenodo.10701182} }

And don't forget to leave a star on GitHub!

Owner

Name: Matthew Andres Moreno
Login: mmore500
Kind: user
Location: East Lansing, MI
Company: @devosoft

Website: mmore500.github.io
Twitter: MorenoMathewA
Repositories: 43
Profile: https://github.com/mmore500

doctoral student, Computer Science and Engineering at Michigan State University

Citation (CITATION.cff)

cff-version: 1.1.0
message: "If you use this software, please cite it as below."
title: 'joinem: a Python library for data collation'
abstract: "joinem provides a CLI for fast, flexbile concatenation of tabular data."
authors:
- family-names: Moreno
  given-names: Matthew Andres
  orcid: 0000-0003-4726-4479
date-released: 2024-02-24
doi: 10.5281/zenodo.10701182
license: MIT
repository-code: https://github.com/mmore500/joinem
url: "https://github.com/mmore500/joinem"

GitHub Events

Total

Issues event: 3
Watch event: 5
Delete event: 4
Push event: 29
Pull request event: 7
Create event: 10

Last Year

Issues event: 3
Watch event: 5
Delete event: 4
Push event: 29
Pull request event: 7
Create event: 10

Committers

Last synced: about 1 year ago

All Time

Total Commits: 87
Total Committers: 2
Avg Commits per committer: 43.5
Development Distribution Score (DDS): 0.011

Past Year

Commits: 56
Committers: 2
Avg Commits per committer: 28.0
Development Distribution Score (DDS): 0.018

Top Committers

Name	Email	Commits
Matthew Andres Moreno	m**g@g**m	86
vivaansinghvi07	v**8@g**m	1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 3
Total pull requests: 8
Average time to close issues: 24 days
Average time to close pull requests: 8 minutes
Total issue authors: 1
Total pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 3
Pull requests: 8
Average time to close issues: 24 days
Average time to close pull requests: 8 minutes
Issue authors: 1
Pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

mmore500 (3)

Pull Request Authors

mmore500 (13)
vivaansinghvi07 (2)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 7,069 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 19
Total maintainers: 1

pypi.org: joinem

CLI for fast, flexbile concatenation of tabular data using Polars.

Documentation: https://joinem.readthedocs.io/
License: MIT license
Latest release: 0.10.0
published about 1 year ago

Versions: 19
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 7,069 Last month

Rankings

Dependent packages count: 9.8%

Average: 37.4%

Dependent repos count: 64.9%

Maintainers (1)

mmore500

Last synced: 10 months ago