https://github.com/aloncrack7/data-degradation-detector

Library to make stimations on data degradation

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.7%) to scientific vocabulary

Last synced: 6 months ago · JSON representation

Repository

Library to make stimations on data degradation

Basic Info

Host: GitHub
Owner: aloncrack7
License: gpl-3.0
Language: Jupyter Notebook
Default Branch: main
Size: 3.44 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 1

Created 7 months ago · Last pushed 6 months ago

Metadata Files

Readme License

README.md

data-degradation-detector

Library to make estimations on data degradation

Installation

bash pip install data-degradation-detector

Usage

Basic Example

```python from datadegradationdetector import univariate, multivariate, report

Univariate analysis

univariateresults = univariate.analyze('data/WineQT.csv', 'data/WineQTtreated.csv')

Multivariate analysis

multivariateresults = multivariate.analyze('data/WineQT.csv', 'data/WineQTtreated.csv')

Generate a report

report.generate(univariateresults, multivariateresults, outputdir='tmp/finalreport') ```

CLI Usage

bash python -m data_degradation_detector.univariate data/WineQT.csv data/WineQT_treated.csv python -m data_degradation_detector.multivariate data/WineQT.csv data/WineQT_treated.csv

API Reference

`univariate` module

get_distribution_descriptors(column: pd.Series) -> DistributionDescriptors
Compute distribution descriptors (mean, std, min, max, quartiles) for a column.
get_distribution_descriptors_all_columns(df: pd.DataFrame) -> dict
Compute descriptors for all columns in a DataFrame.
plot_distribution_descriptors(column: pd.Series, ...)
Plot distribution and descriptors for a column.
plot_distribution_descriptors_all_columns(df: pd.DataFrame, ...)
Plot distributions for all columns.
compare_distributions(original: pd.Series, new_data: pd.Series, ...) -> DistributionChanges
Compare two distributions and visualize changes.
compare_distribbutions_all_columns(original: pd.DataFrame, new_data: pd.DataFrame, ...)
Compare all columns between two DataFrames.
descriptor_evolution(dfs: list[pd.Series], ...)
Plot evolution of descriptors for a column across multiple DataFrames.
descriptor_evolution_all_columns(dfs: list[pd.DataFrame], ...)
Plot evolution for all columns.

`multivariate` module

get_best_clusters(X: pd.DataFrame, path: str = None, plot: bool = True) -> Cluster_statistics
Find optimal KMeans clusters and return statistics.
get_cluster_defined_number(X: pd.DataFrame, num_clusters: int, ...) -> Cluster_statistics
Run KMeans with a fixed number of clusters.
compare_clusters(cluster_stats1, cluster_stats2, delta=0.1) -> ClusterChanges
Compare two clusterings and return changes.
clustering_evolution(dfs: list[pd.DataFrame], num_clusters: int, ...)
Visualize clustering evolution across multiple DataFrames.
correlation_matrix(df: pd.DataFrame, path: str = None)
Plot and/or save a correlation matrix heatmap.
get_cluster_info_from_json(json_data)
Load cluster statistics from JSON.

Classes

Cluster_statistics
Holds statistics for a clustering (num_clusters, silhouette, centroids, radius, label percentages).
ClusterChanges
Represents and quantifies changes between two clusterings.

`report` module

get_number_of_output_classes(y: pd.Series) -> int
Get the number of unique classes in a target column.
create_initial_report(df: pd.DataFrame, target: str, base_metrics: dict, path: str, number_of_output_classes: int = None)
Generate initial visualizations and statistics for a dataset.
create_report(original_df, original_clusters, degraded_dfs, base_metrics, path, new_metrics=None)
Generate a full report comparing original and degraded datasets.

For more details, see the source code or the documentation.

Development

Running Tests

bash python -m pytest testing/ -v

Contributing

Contributions are welcome! Please open issues or pull requests on GitHub.

License

This project is licensed under the GPL v3 License. See the LICENSE file for details.

Contact

For questions or support, open an issue on GitHub.

Owner

Login: aloncrack7
Kind: user

Repositories: 1
Profile: https://github.com/aloncrack7

GitHub Events

Total

Release event: 1
Push event: 14
Create event: 3

Last Year

Release event: 1
Push event: 14
Create event: 3

Dependencies

.github/workflows/publish.yml actions

actions/checkout v4 composite
actions/setup-python v4 composite

.github/workflows/tests.yml actions

actions/checkout v4 composite
actions/setup-python v4 composite

pyproject.toml pypi

matplotlib *
numpy *
pandas *
scikit-learn *

requirements-dev.txt pypi

build >=0.8.0 development
pytest >=7.0.0 development
setuptools >=61.0 development
twine >=4.0.0 development

requirements.txt pypi

build *
matplotlib *
numpy *
pandas *
scikit-learn *
twine *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/aloncrack7/data-degradation-detector

Science Score: 49.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

data-degradation-detector

Installation

Usage

Basic Example

Univariate analysis

Multivariate analysis

Generate a report

CLI Usage

API Reference

`univariate` module

`multivariate` module

Classes

`report` module

Development

Running Tests

Contributing

License

Contact

Owner

GitHub Events

Total

Last Year

Dependencies

https://github.com/aloncrack7/data-degradation-detector

Science Score: 49.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

data-degradation-detector

Installation

Usage

Basic Example

Univariate analysis

Multivariate analysis

Generate a report

CLI Usage

API Reference

univariate module

multivariate module

Classes

report module

Development

Running Tests

Contributing

License

Contact

Owner

GitHub Events

Total

Last Year

Dependencies

`univariate` module

`multivariate` module

`report` module