https://github.com/aloncrack7/data-degradation-detector
Library to make stimations on data degradation
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.7%) to scientific vocabulary
Repository
Library to make stimations on data degradation
Basic Info
- Host: GitHub
- Owner: aloncrack7
- License: gpl-3.0
- Language: Jupyter Notebook
- Default Branch: main
- Size: 3.44 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
data-degradation-detector
Library to make estimations on data degradation
Installation
bash
pip install data-degradation-detector
Usage
Basic Example
```python from datadegradationdetector import univariate, multivariate, report
Univariate analysis
univariateresults = univariate.analyze('data/WineQT.csv', 'data/WineQTtreated.csv')
Multivariate analysis
multivariateresults = multivariate.analyze('data/WineQT.csv', 'data/WineQTtreated.csv')
Generate a report
report.generate(univariateresults, multivariateresults, outputdir='tmp/finalreport') ```
CLI Usage
bash
python -m data_degradation_detector.univariate data/WineQT.csv data/WineQT_treated.csv
python -m data_degradation_detector.multivariate data/WineQT.csv data/WineQT_treated.csv
API Reference
univariate module
get_distribution_descriptors(column: pd.Series) -> DistributionDescriptors
Compute distribution descriptors (mean, std, min, max, quartiles) for a column.get_distribution_descriptors_all_columns(df: pd.DataFrame) -> dict
Compute descriptors for all columns in a DataFrame.plot_distribution_descriptors(column: pd.Series, ...)
Plot distribution and descriptors for a column.plot_distribution_descriptors_all_columns(df: pd.DataFrame, ...)
Plot distributions for all columns.compare_distributions(original: pd.Series, new_data: pd.Series, ...) -> DistributionChanges
Compare two distributions and visualize changes.compare_distribbutions_all_columns(original: pd.DataFrame, new_data: pd.DataFrame, ...)
Compare all columns between two DataFrames.descriptor_evolution(dfs: list[pd.Series], ...)
Plot evolution of descriptors for a column across multiple DataFrames.descriptor_evolution_all_columns(dfs: list[pd.DataFrame], ...)
Plot evolution for all columns.
multivariate module
get_best_clusters(X: pd.DataFrame, path: str = None, plot: bool = True) -> Cluster_statistics
Find optimal KMeans clusters and return statistics.get_cluster_defined_number(X: pd.DataFrame, num_clusters: int, ...) -> Cluster_statistics
Run KMeans with a fixed number of clusters.compare_clusters(cluster_stats1, cluster_stats2, delta=0.1) -> ClusterChanges
Compare two clusterings and return changes.clustering_evolution(dfs: list[pd.DataFrame], num_clusters: int, ...)
Visualize clustering evolution across multiple DataFrames.correlation_matrix(df: pd.DataFrame, path: str = None)
Plot and/or save a correlation matrix heatmap.get_cluster_info_from_json(json_data)
Load cluster statistics from JSON.
Classes
Cluster_statistics
Holds statistics for a clustering (num_clusters, silhouette, centroids, radius, label percentages).ClusterChanges
Represents and quantifies changes between two clusterings.
report module
get_number_of_output_classes(y: pd.Series) -> int
Get the number of unique classes in a target column.create_initial_report(df: pd.DataFrame, target: str, base_metrics: dict, path: str, number_of_output_classes: int = None)
Generate initial visualizations and statistics for a dataset.create_report(original_df, original_clusters, degraded_dfs, base_metrics, path, new_metrics=None)
Generate a full report comparing original and degraded datasets.
For more details, see the source code or the documentation.
Development
Running Tests
bash
python -m pytest testing/ -v
Contributing
Contributions are welcome! Please open issues or pull requests on GitHub.
License
This project is licensed under the GPL v3 License. See the LICENSE file for details.
Contact
For questions or support, open an issue on GitHub.
Owner
- Login: aloncrack7
- Kind: user
- Repositories: 1
- Profile: https://github.com/aloncrack7
GitHub Events
Total
- Release event: 1
- Push event: 14
- Create event: 3
Last Year
- Release event: 1
- Push event: 14
- Create event: 3
Dependencies
- actions/checkout v4 composite
- actions/setup-python v4 composite
- actions/checkout v4 composite
- actions/setup-python v4 composite
- matplotlib *
- numpy *
- pandas *
- scikit-learn *
- build >=0.8.0 development
- pytest >=7.0.0 development
- setuptools >=61.0 development
- twine >=4.0.0 development
- build *
- matplotlib *
- numpy *
- pandas *
- scikit-learn *
- twine *