https://github.com/aloncrack7/data-degradation-detector

Library to make stimations on data degradation

https://github.com/aloncrack7/data-degradation-detector

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Library to make stimations on data degradation

Basic Info
  • Host: GitHub
  • Owner: aloncrack7
  • License: gpl-3.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 3.44 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created 7 months ago · Last pushed 6 months ago
Metadata Files
Readme License

README.md

data-degradation-detector

Tests PyPI version Python Support License: GPL v3

DOI

Library to make estimations on data degradation

Installation

bash pip install data-degradation-detector

Usage

Basic Example

```python from datadegradationdetector import univariate, multivariate, report

Univariate analysis

univariateresults = univariate.analyze('data/WineQT.csv', 'data/WineQTtreated.csv')

Multivariate analysis

multivariateresults = multivariate.analyze('data/WineQT.csv', 'data/WineQTtreated.csv')

Generate a report

report.generate(univariateresults, multivariateresults, outputdir='tmp/finalreport') ```

CLI Usage

bash python -m data_degradation_detector.univariate data/WineQT.csv data/WineQT_treated.csv python -m data_degradation_detector.multivariate data/WineQT.csv data/WineQT_treated.csv

API Reference

univariate module

  • get_distribution_descriptors(column: pd.Series) -> DistributionDescriptors
    Compute distribution descriptors (mean, std, min, max, quartiles) for a column.
  • get_distribution_descriptors_all_columns(df: pd.DataFrame) -> dict
    Compute descriptors for all columns in a DataFrame.
  • plot_distribution_descriptors(column: pd.Series, ...)
    Plot distribution and descriptors for a column.
  • plot_distribution_descriptors_all_columns(df: pd.DataFrame, ...)
    Plot distributions for all columns.
  • compare_distributions(original: pd.Series, new_data: pd.Series, ...) -> DistributionChanges
    Compare two distributions and visualize changes.
  • compare_distribbutions_all_columns(original: pd.DataFrame, new_data: pd.DataFrame, ...)
    Compare all columns between two DataFrames.
  • descriptor_evolution(dfs: list[pd.Series], ...)
    Plot evolution of descriptors for a column across multiple DataFrames.
  • descriptor_evolution_all_columns(dfs: list[pd.DataFrame], ...)
    Plot evolution for all columns.

multivariate module

  • get_best_clusters(X: pd.DataFrame, path: str = None, plot: bool = True) -> Cluster_statistics
    Find optimal KMeans clusters and return statistics.
  • get_cluster_defined_number(X: pd.DataFrame, num_clusters: int, ...) -> Cluster_statistics
    Run KMeans with a fixed number of clusters.
  • compare_clusters(cluster_stats1, cluster_stats2, delta=0.1) -> ClusterChanges
    Compare two clusterings and return changes.
  • clustering_evolution(dfs: list[pd.DataFrame], num_clusters: int, ...)
    Visualize clustering evolution across multiple DataFrames.
  • correlation_matrix(df: pd.DataFrame, path: str = None)
    Plot and/or save a correlation matrix heatmap.
  • get_cluster_info_from_json(json_data)
    Load cluster statistics from JSON.

Classes

  • Cluster_statistics
    Holds statistics for a clustering (num_clusters, silhouette, centroids, radius, label percentages).
  • ClusterChanges
    Represents and quantifies changes between two clusterings.

report module

  • get_number_of_output_classes(y: pd.Series) -> int
    Get the number of unique classes in a target column.
  • create_initial_report(df: pd.DataFrame, target: str, base_metrics: dict, path: str, number_of_output_classes: int = None)
    Generate initial visualizations and statistics for a dataset.
  • create_report(original_df, original_clusters, degraded_dfs, base_metrics, path, new_metrics=None)
    Generate a full report comparing original and degraded datasets.

For more details, see the source code or the documentation.

Development

Running Tests

bash python -m pytest testing/ -v

Contributing

Contributions are welcome! Please open issues or pull requests on GitHub.

License

This project is licensed under the GPL v3 License. See the LICENSE file for details.

Contact

For questions or support, open an issue on GitHub.

Owner

  • Login: aloncrack7
  • Kind: user

GitHub Events

Total
  • Release event: 1
  • Push event: 14
  • Create event: 3
Last Year
  • Release event: 1
  • Push event: 14
  • Create event: 3

Dependencies

.github/workflows/publish.yml actions
  • actions/checkout v4 composite
  • actions/setup-python v4 composite
.github/workflows/tests.yml actions
  • actions/checkout v4 composite
  • actions/setup-python v4 composite
pyproject.toml pypi
  • matplotlib *
  • numpy *
  • pandas *
  • scikit-learn *
requirements-dev.txt pypi
  • build >=0.8.0 development
  • pytest >=7.0.0 development
  • setuptools >=61.0 development
  • twine >=4.0.0 development
requirements.txt pypi
  • build *
  • matplotlib *
  • numpy *
  • pandas *
  • scikit-learn *
  • twine *