df-compress
A python package to compress pandas dataframes akin to Stata's `compress` command.
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary
Repository
A python package to compress pandas dataframes akin to Stata's `compress` command.
Basic Info
- Host: GitHub
- Owner: phchavesmaia
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://pypi.org/project/df-compress/
- Size: 59.6 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 7
Metadata Files
README.md
df-compress
A python package to compress pandas DataFrames akin to Stata's compress command. This function may prove particularly helpfull to those dealing with large datasets.
Installation
You can install df-compress by running the following command:
python
pip install df_compress
How to use
After installing the package use the following import:
python
from df_compress import compress
Example
It follows a reproducible example on df-compress usage:
```python
from df_compress import compress
import pandas as pd
import numpy as np
size = 1000000 df = pd.DataFrame(columns=["Year","State","Value","Intvalue"]) df.Year = np.random.randint(low=2000,high=2023,size=size).astype(str) df.State = np.random.choice(['RJ','SP','ES','MT'],size=size) df.Value= np.random.rand(size,1) df.Intvalue = df.Value*10 // 1
compress(df, show_conversions=True, parallel = False) # which modifies the original DataFrame without needing to reassign it
Which will print for you the transformations and memory saved:
Initial memory usage: 114.44 MB
Final memory usage: 7.63 MB
Memory reduced by: 106.81 MB (93.3%)
Variable type conversions: column from to memory saved (MB) Year object int16 48.637264 State object category 47.683231 Value float64 float32 3.814571 Int_value float64 int8 6.675594 ```
Optional Parameters
The function has three optimal parameters (arguments):
- convert_strings (bool): Whether to attempt to parse object columns as numbers
- defaults to True
- numeric_threshold (float): Indicates the proportion of valid numeric entries needed to convert a string to numeric
- defaults to 0.999
- show_conversions (bool): whether to report the changes made column by column
- defaults to False
- parallel (bool): whether to compress the columns in parallel
- defaults to False
Parallelization Caveats
The parallelization is implemented using Dask and a local client. Moreover, the code is parallelized at the columns. Thus, opting for parallel compression does not guarantee performance improvements and should be a conscious decision taken on a case-by-case basis. To prove this point, the implementation example provided above runs significantly slower when opting for parallel compression (0.29x).
As far as I know, the reason why parallelization does not guarantee efficiency regards the overhead time. Whenever you run some code in parallel you must "organize" it before computing the operation, which may take some time. If the efficiency gains from parallelizing the operation do not cover the overhead time, you incur an efficiency loss. Therefore, my recommendation is to only opt for parallel compression when you have a DataFrame with many columns.
It follows a quick benchmark on a 12 CPUs computer to give you perspective on when to use parallel compression: ```python import pandas as pd from df_compress import compress import sys, os import numpy as np from time import time
class HiddenPrints: def enter(self): self.originalstdout = sys.stdout sys.stdout = open(os.devnull, 'w')
def __exit__(self, exc_type, exc_val, exc_tb):
sys.stdout.close()
sys.stdout = self._original_stdout
def timereps(reps, func): start = time() for i in range(0, reps): func() end = time() return (end - start) / reps
def benchmark_compression(df): print("Running benchmark on DataFrame with shape:", df.shape, "\n")
print("Testing non-parallel compression...")
with HiddenPrints():
time_non_parallel = timereps(10, lambda: compress(df.copy(deep=True), parallel=False, show_conversions=False))
print(f"Non-parallel time: {time_non_parallel:.2f} seconds\n")
print("Testing parallel compression...")
with HiddenPrints():
time_parallel = timereps(10, lambda: compress(df.copy(deep=True), parallel=True, show_conversions=False))
print(f"Parallel time: {time_parallel:.2f} seconds\n")
speedup = time_non_parallel / time_parallel if time_parallel > 0 else float('inf')
print(f"Parallel speedup: {speedup:.2f}x")
def generatetestdataframe(nrows=1000000, nobjectcols=10, nnumericcols=10):
data = {}
for i in range(nobjectcols):
data[f"obj{i}"] = np.random.choice(['A', 'B', 'C', 'D', 'E'], size=nrows)
for i in range(nnumericcols):
data[f"num{i}"] = np.random.randn(nrows)
return pd.DataFrame(data)
``
When testing for a 40 column DataFrame (benchmarkcompression(generatetestdataframe(nobjectcols=20, nnumericcols=20))) I find that:
``
Running benchmark on DataFrame with shape: (1000000, 40)
Testing non-parallel compression... Non-parallel time: 17.60 seconds
Testing parallel compression... Parallel time: 12.06 seconds
Parallel speedup: 1.46x ``` That said, a known issue is that parallel compression breaks down when dealing with really large datasets (56M rows and 50+ columns, for example). Addressing this is a top priority on the to-do list.
Owner
- Login: phchavesmaia
- Kind: user
- Repositories: 1
- Profile: https://github.com/phchavesmaia
Citation (CITATION.cff)
cff-version: 0.7.0 message: "If you use this software, please cite it as below." authors: - family-names: "Chaves Maia" given-names: "Pedro Henrique" orcid: "https://orcid.org/0009-0003-4869-3352" title: "df-compress" date-released: 2025-06-11 url: "https://github.com/phchavesmaia/df-compress" language: "Python" doi: "https://doi.org/10.5281/zenodo.15148480"
GitHub Events
Total
- Release event: 13
- Push event: 40
- Create event: 8
Last Year
- Release event: 13
- Push event: 40
- Create event: 8
Packages
- Total packages: 1
-
Total downloads:
- pypi 49 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 12
- Total maintainers: 1
pypi.org: df-compress
A python package to compress pandas DataFrames akin to Stata's `compress` command
- Homepage: https://github.com/phchavesmaia/df-compress
- Documentation: https://df-compress.readthedocs.io/
- License: MIT
-
Latest release: 0.7.0
published 12 months ago
Rankings
Maintainers (1)
Dependencies
- numpy *
- pandas *