df-compress

A python package to compress pandas dataframes akin to Stata's `compress` command.

https://github.com/phchavesmaia/df-compress

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.6%) to scientific vocabulary
Last synced: 9 months ago · JSON representation ·

Repository

A python package to compress pandas dataframes akin to Stata's `compress` command.

Basic Info
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 7
Created about 1 year ago · Last pushed 11 months ago
Metadata Files
Readme License Citation

README.md

Build Python PyPI DOI

df-compress

A python package to compress pandas DataFrames akin to Stata's compress command. This function may prove particularly helpfull to those dealing with large datasets.

Installation

You can install df-compress by running the following command: python pip install df_compress

How to use

After installing the package use the following import: python from df_compress import compress

Example

It follows a reproducible example on df-compress usage: ```python from df_compress import compress import pandas as pd import numpy as np

size = 1000000 df = pd.DataFrame(columns=["Year","State","Value","Intvalue"]) df.Year = np.random.randint(low=2000,high=2023,size=size).astype(str) df.State = np.random.choice(['RJ','SP','ES','MT'],size=size) df.Value= np.random.rand(size,1) df.Intvalue = df.Value*10 // 1

compress(df, show_conversions=True, parallel = False) # which modifies the original DataFrame without needing to reassign it Which will print for you the transformations and memory saved: Initial memory usage: 114.44 MB Final memory usage: 7.63 MB Memory reduced by: 106.81 MB (93.3%)

Variable type conversions: column from to memory saved (MB) Year object int16 48.637264 State object category 47.683231 Value float64 float32 3.814571 Int_value float64 int8 6.675594 ```

Optional Parameters

The function has three optimal parameters (arguments): - convert_strings (bool): Whether to attempt to parse object columns as numbers - defaults to True - numeric_threshold (float): Indicates the proportion of valid numeric entries needed to convert a string to numeric - defaults to 0.999
- show_conversions (bool): whether to report the changes made column by column - defaults to False - parallel (bool): whether to compress the columns in parallel - defaults to False

Parallelization Caveats

The parallelization is implemented using Dask and a local client. Moreover, the code is parallelized at the columns. Thus, opting for parallel compression does not guarantee performance improvements and should be a conscious decision taken on a case-by-case basis. To prove this point, the implementation example provided above runs significantly slower when opting for parallel compression (0.29x).

As far as I know, the reason why parallelization does not guarantee efficiency regards the overhead time. Whenever you run some code in parallel you must "organize" it before computing the operation, which may take some time. If the efficiency gains from parallelizing the operation do not cover the overhead time, you incur an efficiency loss. Therefore, my recommendation is to only opt for parallel compression when you have a DataFrame with many columns.

It follows a quick benchmark on a 12 CPUs computer to give you perspective on when to use parallel compression: ```python import pandas as pd from df_compress import compress import sys, os import numpy as np from time import time

class HiddenPrints: def enter(self): self.originalstdout = sys.stdout sys.stdout = open(os.devnull, 'w')

def __exit__(self, exc_type, exc_val, exc_tb):
    sys.stdout.close()
    sys.stdout = self._original_stdout

def timereps(reps, func): start = time() for i in range(0, reps): func() end = time() return (end - start) / reps

def benchmark_compression(df): print("Running benchmark on DataFrame with shape:", df.shape, "\n")

print("Testing non-parallel compression...")
with HiddenPrints():
    time_non_parallel = timereps(10, lambda: compress(df.copy(deep=True), parallel=False, show_conversions=False))
print(f"Non-parallel time: {time_non_parallel:.2f} seconds\n")

print("Testing parallel compression...")
with HiddenPrints():
    time_parallel = timereps(10, lambda: compress(df.copy(deep=True), parallel=True, show_conversions=False))
print(f"Parallel time: {time_parallel:.2f} seconds\n")

speedup = time_non_parallel / time_parallel if time_parallel > 0 else float('inf')
print(f"Parallel speedup: {speedup:.2f}x")

def generatetestdataframe(nrows=1000000, nobjectcols=10, nnumericcols=10): data = {} for i in range(nobjectcols): data[f"obj{i}"] = np.random.choice(['A', 'B', 'C', 'D', 'E'], size=nrows) for i in range(nnumericcols): data[f"num{i}"] = np.random.randn(nrows) return pd.DataFrame(data) `` When testing for a 40 column DataFrame (benchmarkcompression(generatetestdataframe(nobjectcols=20, nnumericcols=20))) I find that: `` Running benchmark on DataFrame with shape: (1000000, 40)

Testing non-parallel compression... Non-parallel time: 17.60 seconds

Testing parallel compression... Parallel time: 12.06 seconds

Parallel speedup: 1.46x ``` That said, a known issue is that parallel compression breaks down when dealing with really large datasets (56M rows and 50+ columns, for example). Addressing this is a top priority on the to-do list.

Owner

  • Login: phchavesmaia
  • Kind: user

Citation (CITATION.cff)

cff-version: 0.7.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Chaves Maia"
  given-names: "Pedro Henrique"
  orcid: "https://orcid.org/0009-0003-4869-3352"
title: "df-compress"
date-released: 2025-06-11
url: "https://github.com/phchavesmaia/df-compress"
language: "Python"
doi: "https://doi.org/10.5281/zenodo.15148480"

GitHub Events

Total
  • Release event: 13
  • Push event: 40
  • Create event: 8
Last Year
  • Release event: 13
  • Push event: 40
  • Create event: 8

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 49 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 12
  • Total maintainers: 1
pypi.org: df-compress

A python package to compress pandas DataFrames akin to Stata's `compress` command

  • Versions: 12
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 49 Last month
Rankings
Dependent packages count: 9.4%
Average: 31.0%
Dependent repos count: 52.7%
Maintainers (1)
Last synced: 10 months ago

Dependencies

setup.py pypi
  • numpy *
  • pandas *