https://github.com/brycewang-stanford/pdtab
A pandas-based library that replicates Stata's tabulate functionality
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.9%) to scientific vocabulary
Repository
A pandas-based library that replicates Stata's tabulate functionality
Basic Info
- Host: GitHub
- Owner: brycewang-stanford
- License: mit
- Language: Python
- Default Branch: main
- Size: 64.5 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
pdtab: Pandas-based Tabulation Library
pdtab is a comprehensive Python library that replicates the functionality of Stata's tabulate command using pandas as the backend. This library provides powerful one-way, two-way, and summary tabulations with statistical tests and measures of association.
Overview
Stata's tabulate command is one of the most widely used tools for creating frequency tables and cross-tabulations in statistical analysis. pdtab brings this functionality to Python, offering:
- Complete Stata compatibility: Replicates all major features of Stata's tabulate command
- Statistical tests: Chi-square tests, Fisher's exact test, likelihood-ratio tests
- Association measures: Cramér's V, Goodman and Kruskal's gamma, Kendall's τb
- Flexible output: Console tables, HTML, and visualization options
- Weighted analysis: Support for frequency, analytic, and importance weights
- Missing value handling: Comprehensive options for dealing with missing data
Integration with Broader Ecosystem
pdtab is part of a comprehensive econometric and statistical analysis ecosystem:
PyStataR
The pdtab library will be integrated into PyStataR, a comprehensive Python package that bridges Stata and R functionality in Python. PyStataR aims to provide Stata users with familiar commands and workflows while leveraging Python's powerful data science ecosystem.
StasPAI
For users interested in AI-powered econometric analysis, StasPAI offers a related project focused on integrating statistical analysis with artificial intelligence methods. StasPAI provides advanced econometric modeling capabilities enhanced by machine learning approaches.
These projects together form a unified toolkit for modern econometric analysis, combining the best of Stata's user-friendly interface, R's statistical capabilities, and Python's machine learning ecosystem.
Installation
bash
pip install pdtab
Or install from source:
bash
git clone https://github.com/brycewang-stanford/pdtab.git
cd pdtab
pip install -e .
Requirements
- Python 3.8+
- pandas >= 1.0.0
- numpy >= 1.18.0
- scipy >= 1.4.0
- matplotlib >= 3.0.0 (for plotting)
- seaborn >= 0.11.0 (for enhanced plotting)
🎯 Design Philosophy
pdtab is designed as a pure Python library focused exclusively on providing Stata's tabulate functionality through a clean, programmatic API.
Key Design Decisions:
- No Command-Line Interface: pdtab is intentionally designed as a library-only package to maintain simplicity and focus on the Python ecosystem
- Jupyter-First Approach: Optimized for data science workflows in Jupyter notebooks and Python scripts
- Programmatic Access: All functionality accessible through Python functions with comprehensive options
- Integration Ready: Designed to integrate seamlessly with pandas, matplotlib, and the broader PyData ecosystem
This design ensures pdtab remains lightweight, maintainable, and perfectly suited for modern data science workflows.
Quick Start
Basic One-way Tabulation
```python import pandas as pd import pdtab
Create sample data
data = pd.DataFrame({ 'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female'], 'education': ['High School', 'College', 'Graduate', 'High School', 'College', 'Graduate'], 'income': [35000, 45000, 75000, 40000, 55000, 80000] })
One-way frequency table
result = pdtab.tabulate('gender', data=data) print(result) ```
gender Freq Percent Cum
Male 3 50.00 50.00
Female 3 50.00 100.00
Total 6 100.00 100.00
Two-way Cross-tabulation with Statistics
```python
Two-way table with chi-square test
result = pdtab.tabulate('gender', 'education', data=data, chi2=True, exact=True) print(result) ```
Summary Tabulation
```python
Summary statistics by group
result = pdtab.tabulate('gender', data=data, summarize='income') print(result) ```
``` Summary of income by gender
gender Mean Std. Dev. Freq Male 55000.0 20000.0 3 Female 55000.0 20000.0 3 Total 55000.0 18257.4 6 ```
Main Functions
tabulate(varname1, varname2=None, data=None, **options)
Main tabulation function supporting:
One-way options:
- missing=True: Include missing values as a category
- sort=True: Sort by frequency (descending)
- plot=True: Create bar chart
- nolabel=True: Show numeric codes instead of labels
- generate='prefix': Create indicator variables
Two-way options:
- chi2=True: Pearson's chi-square test
- exact=True: Fisher's exact test
- lrchi2=True: Likelihood-ratio chi-square
- V=True: Cramér's V
- gamma=True: Goodman and Kruskal's gamma
- taub=True: Kendall's τb
- row=True: Row percentages
- column=True: Column percentages
- cell=True: Cell percentages
- expected=True: Expected frequencies
Summary options:
- summarize='variable': Variable to summarize
- means=False: Suppress means
- standard=False: Suppress standard deviations
- freq=False: Suppress frequencies
tab1(varlist, data=None, **options)
Create one-way tables for multiple variables:
python
results = pdtab.tab1(['gender', 'education'], data=data)
for var, result in results.items():
print(f"\n{var}:")
print(result)
tab2(varlist, data=None, **options)
Create all possible two-way tables:
python
results = pdtab.tab2(['gender', 'education', 'region'], data=data, chi2=True)
for (var1, var2), result in results.items():
print(f"\n{var1} × {var2}:")
print(result)
tabi(table_data, **options)
Immediate tabulation from supplied data:
```python
From string (Stata format)
result = pdtab.tabi("30 18 \ 38 14", exact=True)
From list
result = pdtab.tabi([[30, 18], [38, 14]], chi2=True) ```
Visualization
Create plots directly from tabulation results:
```python
Bar chart for one-way table
result = pdtab.tabulate('gender', data=data, plot=True)
Heatmap for two-way table
result = pdtab.tabulate('gender', 'education', data=data) fig = pdtab.viz.createtabulationplots(result, plot_type='heatmap') ```
Statistical Tests
Supported Tests
- Pearson's Chi-square Test: Tests independence in contingency tables
- Likelihood-ratio Chi-square: Alternative to Pearson's chi-square
- Fisher's Exact Test: Exact test for small samples (especially 2×2 tables)
Association Measures
- Cramér's V: Measure of association (0-1 scale)
- Goodman and Kruskal's Gamma: For ordinal variables (-1 to 1)
- Kendall's τb: Rank correlation with tie correction (-1 to 1)
Weighted Analysis
Support for different weight types:
```python
Frequency weights
result = pdtab.tabulate('gender', data=data, weights='freq_weight')
Analytic weights
result = pdtab.tabulate('gender', data=data, weights='analytic_weight') ```
Missing Value Handling
Flexible options for missing data:
```python
Exclude missing values (default)
result = pdtab.tabulate('gender', data=data)
Include missing as category
result = pdtab.tabulate('gender', data=data, missing=True)
Subpopulation analysis
result = pdtab.tabulate('gender', data=data, subpop='analysis_sample') ```
Export Options
Export results in multiple formats:
```python result = pdtab.tabulate('gender', 'education', data=data)
Export to dictionary
datadict = result.todict()
Export to HTML
htmltable = result.tohtml()
Save plot
fig = pdtab.viz.createtabulationplots(result) pdtab.viz.save_plot(fig, 'crosstab.png') ```
Advanced Examples
Complex Two-way Analysis
```python
Comprehensive two-way analysis
result = pdtab.tabulate( 'treatment', 'outcome', data=clinical_data, chi2=True, # Chi-square test exact=True, # Fisher's exact test V=True, # Cramér's V row=True, # Row percentages expected=True, # Expected frequencies missing=True # Include missing values )
print(result) print(f"Chi-square: {result.statistics['chi2']['statistic']:.3f}") print(f"p-value: {result.statistics['chi2']['pvalue']:.3f}") print(f"Cramér's V: {result.statistics['cramersv']:.3f}") ```
Summary Analysis by Multiple Groups
```python
Income analysis by gender and education
result = pdtab.tabulate( 'gender', 'education', data=data, summarize='income', means=True, standard=True, obs=True ) ```
Immediate Analysis of Published Data
```python
Analyze a 2×3 contingency table from literature
published_data = """ 45 55 60 \ 30 40 35 """
result = pdtab.tabi(published_data, chi2=True, exact=True, V=True) print("Published data analysis:") print(result) ```
Stata Comparison
pdtab aims for 100% compatibility with Stata's tabulate command:
| Stata Command | pdtab Equivalent |
|---------------|------------------|
| tabulate gender | pdtab.tabulate('gender', data=df) |
| tabulate gender education, chi2 | pdtab.tabulate('gender', 'education', data=df, chi2=True) |
| tabulate gender, summarize(income) | pdtab.tabulate('gender', data=df, summarize='income') |
| tab1 gender education region | pdtab.tab1(['gender', 'education', 'region'], data=df) |
| tab2 gender education region | pdtab.tab2(['gender', 'education', 'region'], data=df) |
| tabi 30 18 \\ 38 14, exact | pdtab.tabi("30 18 \\\\ 38 14", exact=True) |
🤝 Contributing
We welcome contributions! Please see our Contributing Guide for details.
Development Setup
bash
git clone https://github.com/brycewang-stanford/pdtab.git
cd pdtab
pip install -e ".[dev]"
pytest
📄 License
MIT License - see LICENSE file for details.
🙏 Acknowledgments
- Stata Corporation for the original tabulate command design
- Pandas Development Team for the excellent data manipulation library
- SciPy Community for statistical computing tools
Related Projects
pdtab is part of a broader ecosystem of econometric and statistical tools:
- PyStataR - Comprehensive Python package bridging Stata and R functionality (pdtab will be integrated into this project)
- StasPAI - AI-powered econometric analysis toolkit combining statistical methods with machine learning
Support
- Documentation: https://pdtab.readthedocs.io
- Issues: GitHub Issues
- Discussions: GitHub Discussions
pdtab - Bringing Stata's tabulation power to the Python ecosystem! 🐍
Owner
- Name: Bryce Wang
- Login: brycewang-stanford
- Kind: user
- Repositories: 1
- Profile: https://github.com/brycewang-stanford
GitHub Events
Total
- Push event: 4
- Create event: 2
Last Year
- Push event: 4
- Create event: 2
Packages
- Total packages: 1
-
Total downloads:
- pypi 93 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 2
- Total maintainers: 1
pypi.org: pdtab
A pandas-based library that replicates Stata's tabulate functionality
- Homepage: https://github.com/pdtab/pdtab
- Documentation: https://pdtab.readthedocs.io/
- License: MIT
-
Latest release: 0.1.1
published 7 months ago
Rankings
Maintainers (1)
Dependencies
- matplotlib >=3.0.0
- numpy >=1.18.0
- pandas >=1.0.0
- scipy >=1.4.0
- seaborn >=0.11.0
- tabulate >=0.8.0
- matplotlib >=3.0.0
- numpy >=1.18.0
- pandas >=1.0.0
- scipy >=1.4.0
- seaborn >=0.11.0
- tabulate >=0.8.0