Recent Releases of https://github.com/aloncrack7/data-degradation-detector
https://github.com/aloncrack7/data-degradation-detector - 1.0.6
API Reference
univariate module
get_distribution_descriptors(column: pd.Series) -> DistributionDescriptors
Compute distribution descriptors (mean, std, min, max, quartiles) for a column.get_distribution_descriptors_all_columns(df: pd.DataFrame) -> dict
Compute descriptors for all columns in a DataFrame.plot_distribution_descriptors(column: pd.Series, ...)
Plot distribution and descriptors for a column.plot_distribution_descriptors_all_columns(df: pd.DataFrame, ...)
Plot distributions for all columns.compare_distributions(original: pd.Series, new_data: pd.Series, ...) -> DistributionChanges
Compare two distributions and visualize changes.compare_distribbutions_all_columns(original: pd.DataFrame, new_data: pd.DataFrame, ...)
Compare all columns between two DataFrames.descriptor_evolution(dfs: list[pd.Series], ...)
Plot evolution of descriptors for a column across multiple DataFrames.descriptor_evolution_all_columns(dfs: list[pd.DataFrame], ...)
Plot evolution for all columns.
multivariate module
get_best_clusters(X: pd.DataFrame, path: str = None, plot: bool = True) -> Cluster_statistics
Find optimal KMeans clusters and return statistics.get_cluster_defined_number(X: pd.DataFrame, num_clusters: int, ...) -> Cluster_statistics
Run KMeans with a fixed number of clusters.compare_clusters(cluster_stats1, cluster_stats2, delta=0.1) -> ClusterChanges
Compare two clusterings and return changes.clustering_evolution(dfs: list[pd.DataFrame], num_clusters: int, ...)
Visualize clustering evolution across multiple DataFrames.correlation_matrix(df: pd.DataFrame, path: str = None)
Plot and/or save a correlation matrix heatmap.get_cluster_info_from_json(json_data)
Load cluster statistics from JSON.
Classes
Cluster_statistics
Holds statistics for a clustering (num_clusters, silhouette, centroids, radius, label percentages).ClusterChanges
Represents and quantifies changes between two clusterings.
report module
get_number_of_output_classes(y: pd.Series) -> int
Get the number of unique classes in a target column.create_initial_report(df: pd.DataFrame, target: str, base_metrics: dict, path: str, number_of_output_classes: int = None)
Generate initial visualizations and statistics for a dataset.create_report(original_df, original_clusters, degraded_dfs, base_metrics, path, new_metrics=None)
Generate a full report comparing original and degraded datasets.
- Jupyter Notebook
Published by aloncrack7 6 months ago