df-compress

A python package to compress pandas dataframes akin to Stata's `compress` command.

https://github.com/phchavesmaia/df-compress

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

A python package to compress pandas dataframes akin to Stata's `compress` command.

Basic Info

Host: GitHub
Owner: phchavesmaia
License: mit
Language: Python
Default Branch: main
Homepage: https://pypi.org/project/df-compress/
Size: 59.6 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 7

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

df-compress

A python package to compress pandas DataFrames akin to Stata's compress command. This function may prove particularly helpfull to those dealing with large datasets.

Installation

You can install df-compress by running the following command: python pip install df_compress

How to use

After installing the package use the following import: python from df_compress import compress

Example

It follows a reproducible example on df-compress usage: ```python from df_compress import compress import pandas as pd import numpy as np

size = 1000000 df = pd.DataFrame(columns=["Year","State","Value","Intvalue"]) df.Year = np.random.randint(low=2000,high=2023,size=size).astype(str) df.State = np.random.choice(['RJ','SP','ES','MT'],size=size) df.Value= np.random.rand(size,1) df.Intvalue = df.Value*10 // 1

compress(df, show_conversions=True, parallel = False) # which modifies the original DataFrame without needing to reassign it Which will print for you the transformations and memory saved: Initial memory usage: 114.44 MB Final memory usage: 7.63 MB Memory reduced by: 106.81 MB (93.3%)

Variable type conversions: column from to memory saved (MB) Year object int16 48.637264 State object category 47.683231 Value float64 float32 3.814571 Int_value float64 int8 6.675594 ```

Optional Parameters

The function has three optimal parameters (arguments): - convert_strings (bool): Whether to attempt to parse object columns as numbers - defaults to True - numeric_threshold (float): Indicates the proportion of valid numeric entries needed to convert a string to numeric - defaults to 0.999
- show_conversions (bool): whether to report the changes made column by column - defaults to False - parallel (bool): whether to compress the columns in parallel - defaults to False

Parallelization Caveats

The parallelization is implemented using Dask and a local client. Moreover, the code is parallelized at the columns. Thus, opting for parallel compression does not guarantee performance improvements and should be a conscious decision taken on a case-by-case basis. To prove this point, the implementation example provided above runs significantly slower when opting for parallel compression (0.29x).

As far as I know, the reason why parallelization does not guarantee efficiency regards the overhead time. Whenever you run some code in parallel you must "organize" it before computing the operation, which may take some time. If the efficiency gains from parallelizing the operation do not cover the overhead time, you incur an efficiency loss. Therefore, my recommendation is to only opt for parallel compression when you have a DataFrame with many columns.

It follows a quick benchmark on a 12 CPUs computer to give you perspective on when to use parallel compression: ```python import pandas as pd from df_compress import compress import sys, os import numpy as np from time import time

class HiddenPrints: def enter(self): self.originalstdout = sys.stdout sys.stdout = open(os.devnull, 'w')

def __exit__(self, exc_type, exc_val, exc_tb):
    sys.stdout.close()
    sys.stdout = self._original_stdout

def timereps(reps, func): start = time() for i in range(0, reps): func() end = time() return (end - start) / reps

def benchmark_compression(df): print("Running benchmark on DataFrame with shape:", df.shape, "\n")

print("Testing non-parallel compression...")
with HiddenPrints():
    time_non_parallel = timereps(10, lambda: compress(df.copy(deep=True), parallel=False, show_conversions=False))
print(f"Non-parallel time: {time_non_parallel:.2f} seconds\n")

print("Testing parallel compression...")
with HiddenPrints():
    time_parallel = timereps(10, lambda: compress(df.copy(deep=True), parallel=True, show_conversions=False))
print(f"Parallel time: {time_parallel:.2f} seconds\n")

speedup = time_non_parallel / time_parallel if time_parallel > 0 else float('inf')
print(f"Parallel speedup: {speedup:.2f}x")

def generatetestdataframe(nrows=1000000, nobjectcols=10, nnumericcols=10): data = {} for i in range(nobjectcols): data[f"obj{i}"] = np.random.choice(['A', 'B', 'C', 'D', 'E'], size=nrows) for i in range(nnumericcols): data[f"num{i}"] = np.random.randn(nrows) return pd.DataFrame(data) ``When testing for a 40 column DataFrame (benchmarkcompression(generatetestdataframe(nobjectcols=20, nnumericcols=20))) I find that:`` Running benchmark on DataFrame with shape: (1000000, 40)

Testing non-parallel compression... Non-parallel time: 17.60 seconds

Testing parallel compression... Parallel time: 12.06 seconds

Parallel speedup: 1.46x ``` That said, a known issue is that parallel compression breaks down when dealing with really large datasets (56M rows and 50+ columns, for example). Addressing this is a top priority on the to-do list.

Owner

Login: phchavesmaia
Kind: user

Repositories: 1
Profile: https://github.com/phchavesmaia

Citation (CITATION.cff)

cff-version: 0.7.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Chaves Maia"
  given-names: "Pedro Henrique"
  orcid: "https://orcid.org/0009-0003-4869-3352"
title: "df-compress"
date-released: 2025-06-11
url: "https://github.com/phchavesmaia/df-compress"
language: "Python"
doi: "https://doi.org/10.5281/zenodo.15148480"

GitHub Events

Total

Release event: 13
Push event: 40
Create event: 8

Last Year

Release event: 13
Push event: 40
Create event: 8

Packages

Total packages: 1
Total downloads:
- pypi 49 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 12
Total maintainers: 1

pypi.org: df-compress

A python package to compress pandas DataFrames akin to Stata's `compress` command

Homepage: https://github.com/phchavesmaia/df-compress
Documentation: https://df-compress.readthedocs.io/
License: MIT
Latest release: 0.7.0
published about 1 year ago

Versions: 12
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 49 Last month

Rankings

Dependent packages count: 9.4%

Average: 31.0%

Dependent repos count: 52.7%

Maintainers (1)

phchavesmaia

Last synced: 11 months ago

Dependencies

setup.py pypi

numpy *
pandas *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

df-compress

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

df-compress

Installation

How to use

Example

Optional Parameters

Parallelization Caveats

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Packages

pypi.org: df-compress

Rankings

Maintainers (1)

Dependencies