https://github.com/sidkris/megaprofiler

A Python library for automatic data profiling and validation

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary

Keywords

anomaly-detection data-profiling data-science data-validation machine-learning pypi python

Last synced: 10 months ago · JSON representation

Repository

A Python library for automatic data profiling and validation

Basic Info

Host: GitHub
Owner: sidkris
License: mit
Language: Python
Default Branch: main
Homepage: https://pypi.org/project/megaprofiler/
Size: 148 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

anomaly-detection data-profiling data-science data-validation machine-learning pypi python

Created almost 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License

MegaProfiler is an easy-to-use, highly customizable Python library designed for profiling and analyzing datasets. It provides deep insights into your data's structure, distributions, missing values, anomalies, and more. With built-in support for data validation, anomaly detection, and data drift tracking, it's the perfect tool for data scientists and engineers looking to automate exploratory data analysis (EDA) and quality checks for large datasets.

While other libraries like pandas-profiling exist, MegaProfiler stands out for its extensibility, scalability, and integration with data validation and anomaly detection, making it ideal for data preprocessing and ETL pipelines.

Key Features

Automatic Data Summaries:
- Automatically generate statistical summaries, distributions, unique values, missing values, and data types for each column.
Anomaly Detection:
- Flag unusual distributions, outliers, or inconsistent data using z-score, IQR, or machine learning techniques (e.g., Isolation Forest).
Data Validation:
- Set custom validation rules (e.g., no missing values in specific columns, data type constraints) and receive alerts for rule violations.
Custom Reports:
- Generate configurable reports in various formats (e.g., HTML, PDF), with customizable thresholds for anomalies.
Data Drift Detection:
- Track changes in data distributions over time to detect shifts in data quality or content, useful for continuous monitoring of data pipelines.
Multicollinearity and Correlation Analysis:
- Perform advanced correlation analysis and detect multicollinearity with Variance Inflation Factor (VIF).
Time Series Analysis:
- Decompose and analyze time series data to identify trends, seasonality, and residuals.

Benefits

MegaProfiler is an invaluable tool for: - Data Scientists and Engineers: It automates exploratory data analysis, saving valuable time and reducing manual inspection of large datasets. - ETL Pipelines: Easily detect issues such as missing data, outliers, or data drift, and ensure the quality of data moving through your pipeline. - Data Quality Assurance: Validate the integrity of your data before model training or analysis, minimizing the risk of poor model performance due to flawed data.

Installation

You can install MegaProfiler using pip:

```bash pip install megaprofiler

Owner

Name: Siddharth Krishnan
Login: sidkris
Kind: user
Location: Bengaluru

Website: https://www.sidkrishnan.com/
Repositories: 85
Profile: https://github.com/sidkris

Primary Interests --> AI & DeFi Languages (Main) --> Python & Rust

GitHub Events

Total

Release event: 1
Watch event: 1
Push event: 4
Create event: 1

Last Year

Release event: 1
Watch event: 1
Push event: 4
Create event: 1

Packages

Total packages: 1
Total downloads:
- pypi 44 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 5
Total maintainers: 1

pypi.org: megaprofiler

megaprofiler is a highly customizable and extensible data profiling library designed to help data scientists and engineers understand their datasets before performing analysis or building models.

Homepage: https://github.com/sidkris/megaprofiler
Documentation: https://megaprofiler.readthedocs.io/
License: MIT License
Latest release: 1.0.0
published over 1 year ago

Versions: 5
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 44 Last month

Rankings

Dependent packages count: 10.4%

Average: 34.3%

Dependent repos count: 58.3%

Maintainers (1)

sid_krishnan

Last synced: 11 months ago

Dependencies

megaprofiler.egg-info/requires.txt pypi

imbalanced-learn *
matplotlib *
numpy *
pandas *
scikit-learn *
scipy *
seaborn *
statsmodels *
tabulate *

requirements.txt pypi

imbalanced-learn *
matplotlib *
numpy *
pandas *
pytest *
scikit-learn *
scipy *
seaborn *
statsmodels *
tabulate *

setup.py pypi

imbalanced-learn *
matplotlib *
numpy *
pandas *
scikit-learn *
scipy *
seaborn *
statsmodels *
tabulate *

venv/Lib/site-packages/numpy/_core/tests/examples/cython/setup.py pypi

venv/Lib/site-packages/numpy/_core/tests/examples/limited_api/setup.py pypi

venv/Lib/site-packages/pandas/pyproject.toml pypi

numpy >=1.22.4; python_version<'3.11'
numpy >=1.23.2; python_version=='3.11'
numpy >=1.26.0; python_version>='3.12'
python-dateutil >=2.8.2
pytz >=2020.1
tzdata >=2022.7

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/sidkris/megaprofiler

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Key Features

Benefits

Installation

Owner

GitHub Events

Total

Last Year

Packages

pypi.org: megaprofiler

Rankings

Maintainers (1)

Dependencies