joinem

CLI for fast, flexbile concatenation of tabular data using polars

https://github.com/mmore500/joinem

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.2%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

CLI for fast, flexbile concatenation of tabular data using polars

Basic Info
  • Host: GitHub
  • Owner: mmore500
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 183 KB
Statistics
  • Stars: 16
  • Watchers: 1
  • Forks: 0
  • Open Issues: 1
  • Releases: 0
Created about 2 years ago · Last pushed 10 months ago
Metadata Files
Readme License Citation

README.md

PyPi CI GitHub stars DOI

joinem provides a CLI for fast, flexbile concatenation of tabular data using polars

Install

python3 -m pip install joinem

Features

  • Lazily streams I/O to expeditiously handle numerous large files.
  • Supports CSV and parquet input files.
    • Due to current polars limitations, JSON and feather files are not supported.
    • Input formats may be mixed.
  • Supports output to CSV, JSON, parquet, and feather file types.
  • Allows mismatched columns and/or empty data files with --how diagonal and --how diagonal_relaxed.
  • Provides a progress bar with --progress.
  • Add programatically-generated columns to output.

Example Usage

Pass input filenames via stdin, one filename per line. find path/to/*.parquet path/to/*.csv | python3 -m joinem out.parquet

Output file type is inferred from the extension of the output file name. Supported output types are feather, JSON, parquet, and csv. find -name '*.parquet' | python3 -m joinem out.json

If file columns may mismatch, use --how diagonal. find path/to/ -name '*.csv' | python3 -m joinem out.csv --how diagonal

If some files may be empty, use --how diagonal_relaxed.

To run via Singularity/Apptainer, ls -1 *.csv | singularity run docker://ghcr.io/mmore500/joinem out.feather

Advanced Usage

Add literal value column to output. ls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.lit(2).alias("two")'

Cast a column to categorical in the output, shrink dtypes, and tune compression. ls -1 *.csv | python3 -m joinem out.pqt \ --with-column 'pl.col("uuid").cast(pl.Categorical)' --string-cache \ --shrink-dtypes \ --write-kwarg 'compression_level=10' --write-kwarg 'compression="zstd"'

Alias an existing column in the output. ls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.col("a").alias("a2")'

Apply regex on source datafile paths to create new column in output. ls -1 path/to/*.csv | python3 -m joinem out.csv \ --with-column 'pl.lit(filepath).str.replace(r".*?([^/]*)\.csv", r"${1}").alias("filename stem")'

Read data from stdin and write data to stdout. cat foo.csv | python3 -m joinem "/dev/stdout" --stdin \ --output-filetype csv --input-filetype csv

Write to parquet via stdout using pv to display progress, cast "myValue" column to categorical, and use lz4 for parquet compression. ls -1 input/*.pqt | python3 -m joinem "/dev/stdout" --output-filetype pqt \ --with-column 'pl.col("myValue").cast(pl.Categorical)' \ --write-kwarg 'compression="lz4"' \ | pv > concat.pqt

API

``` usage: main.py [-h] [--version] [--progress] [--stdin] [--drop DROP] [--eager-read] [--eager-write] [--filter FILTERS] [--head HEAD] [--tail TAIL] [--sample SAMPLE] [--seed SEED] [--with-column WITHCOLUMNS] [--shrink-dtypes] [--string-cache] [--how {vertical,verticalrelaxed,diagonal,diagonalrelaxed,horizontal,align,alignfull,aligninner,alignleft,alignright}] [--input-filetype INPUT_FILETYPE] [--output-filetype OUTPUTFILETYPE] [--read-kwarg READKWARGS] [--write-kwarg WRITEKWARGS] output_file

CLI for fast, flexbile concatenation of tabular data using Polars.

positional arguments: output_file Output file name

options: -h, --help show this help message and exit --version show program's version number and exit --progress Show progress bar --stdin Read data from stdin --drop DROP Columns to drop. --eager-read Use read* instead of scan. Can improve performance in some cases. --eager-write Use write_ instead of sink*. Can improve performance in some cases. --filter FILTERS Expression to be evaluated and passed to polars DataFrame.filter. Example: 'pl.col("thing") == 0' --head HEAD Number of rows to include in output, counting from front. --tail TAIL Number of rows to include in output, counting from back. --gather-every GATHEREVERY Take every nth row. --sample SAMPLE Number of rows to include in output, sampled uniformly. Pass --seed for deterministic behavior. --shuffle Should output be shuffled? Pass --seed for deterministic behavior. --seed SEED Integer seed for deterministic behavior. --with-column WITHCOLUMNS Expression to be evaluated to add a column, as access to each datafile's filepath as filepath and polars as pl. Example: 'pl.lit(filepath).str.replace(r".?([^/]).csv", r"${1}").alias("filename stem")' --shrink-dtypes Shrink numeric columns to the minimal required datatype. --string-cache Enable Polars global string cache. --how {vertical,verticalrelaxed,diagonal,diagonalrelaxed,horizontal,align,alignfull,aligninner,alignleft,align_right} How to concatenate frames. See for more information.

--input-filetype INPUTFILETYPE Filetype of input. Otherwise, inferred. Example: csv, parquet, json, feather --output-filetype OUTPUTFILETYPE Filetype of output. Otherwise, inferred. Example: csv, parquet --read-kwarg READKWARGS Additional keyword arguments to pass to pl.read* or pl.scan* call(s). Provide as 'key=value'. Specify multiple kwargs by using this flag multiple times. Arguments will be evaluated as Python expressions. Example: 'inferschemalength=None' --write-kwarg WRITEKWARGS Additional keyword arguments to pass to pl.write* or pl.sink* call. Provide as 'key=value'. Specify multiple kwargs by using this flag multiple times. Arguments will be evaluated as Python expressions. Example: 'compression="lz4"'

Provide input filepaths via stdin. Example: find path/to/ -name '*.csv' | python3 -m joinem out.csv ```

Citing

If joinem contributes to a scholarly work, please cite it as

Matthew Andres Moreno. (2024). mmore500/joinem. Zenodo. https://doi.org/10.5281/zenodo.10701182

bibtex @software{moreno2024joinem, author = {Matthew Andres Moreno}, title = {mmore500/joinem}, month = feb, year = 2024, publisher = {Zenodo}, doi = {10.5281/zenodo.10701182}, url = {https://doi.org/10.5281/zenodo.10701182} }

And don't forget to leave a star on GitHub!

Owner

  • Name: Matthew Andres Moreno
  • Login: mmore500
  • Kind: user
  • Location: East Lansing, MI
  • Company: @devosoft

doctoral student, Computer Science and Engineering at Michigan State University

Citation (CITATION.cff)

cff-version: 1.1.0
message: "If you use this software, please cite it as below."
title: 'joinem: a Python library for data collation'
abstract: "joinem provides a CLI for fast, flexbile concatenation of tabular data."
authors:
- family-names: Moreno
  given-names: Matthew Andres
  orcid: 0000-0003-4726-4479
date-released: 2024-02-24
doi: 10.5281/zenodo.10701182
license: MIT
repository-code: https://github.com/mmore500/joinem
url: "https://github.com/mmore500/joinem"

GitHub Events

Total
  • Issues event: 3
  • Watch event: 5
  • Delete event: 4
  • Push event: 29
  • Pull request event: 7
  • Create event: 10
Last Year
  • Issues event: 3
  • Watch event: 5
  • Delete event: 4
  • Push event: 29
  • Pull request event: 7
  • Create event: 10

Committers

Last synced: 10 months ago

All Time
  • Total Commits: 87
  • Total Committers: 2
  • Avg Commits per committer: 43.5
  • Development Distribution Score (DDS): 0.011
Past Year
  • Commits: 56
  • Committers: 2
  • Avg Commits per committer: 28.0
  • Development Distribution Score (DDS): 0.018
Top Committers
Name Email Commits
Matthew Andres Moreno m****g@g****m 86
vivaansinghvi07 v****8@g****m 1

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 3
  • Total pull requests: 8
  • Average time to close issues: 24 days
  • Average time to close pull requests: 8 minutes
  • Total issue authors: 1
  • Total pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 8
  • Average time to close issues: 24 days
  • Average time to close pull requests: 8 minutes
  • Issue authors: 1
  • Pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • mmore500 (3)
Pull Request Authors
  • mmore500 (13)
  • vivaansinghvi07 (2)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 7,069 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 19
  • Total maintainers: 1
pypi.org: joinem

CLI for fast, flexbile concatenation of tabular data using Polars.

  • Versions: 19
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 7,069 Last month
Rankings
Dependent packages count: 9.8%
Average: 37.4%
Dependent repos count: 64.9%
Maintainers (1)
Last synced: 7 months ago