https://github.com/andrei-vataselu/data-science-snippets

🧰 Essential EDA and Data Cleaning Helpers for Any DataFrame This collection of functions is designed to accelerate exploratory data analysis (EDA), quickly surface data quality issues, and offer high-level insights into the structure and content of your dataset.

https://github.com/andrei-vataselu/data-science-snippets

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • β—‹
    CITATION.cff file
  • βœ“
    codemeta.json file
    Found codemeta.json file
  • βœ“
    .zenodo.json file
    Found .zenodo.json file
  • β—‹
    DOI references
  • β—‹
    Academic publication links
  • β—‹
    Academic email domains
  • β—‹
    Institutional organization owner
  • β—‹
    JOSS paper metadata
  • β—‹
    Scientific vocabulary similarity
    Low similarity (10.2%) to scientific vocabulary

Keywords

artificial-intelligence data data-science eda feature-engineering hyperparamater-tunning library loading model-evaluation modeling preprocessing python snippets text-processing time-series visualization
Last synced: 5 months ago · JSON representation

Repository

🧰 Essential EDA and Data Cleaning Helpers for Any DataFrame This collection of functions is designed to accelerate exploratory data analysis (EDA), quickly surface data quality issues, and offer high-level insights into the structure and content of your dataset.

Basic Info
  • Host: GitHub
  • Owner: andrei-vataselu
  • License: other
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 30.3 KB
Statistics
  • Stars: 2
  • Watchers: 1
  • Forks: 2
  • Open Issues: 0
  • Releases: 3
Topics
artificial-intelligence data data-science eda feature-engineering hyperparamater-tunning library loading model-evaluation modeling preprocessing python snippets text-processing time-series visualization
Created 9 months ago · Last pushed 9 months ago
Metadata Files
Readme Contributing License Code of conduct Security

README.md

🧠 data-science-snippets

data-science-snippets is a modular, production-ready Python snippets containing curated, reusable utilities used in the day-to-day workflows of senior data scientists and machine learning engineers.

It includes tools for EDA, cleaning, validation, text processing, feature engineering, visualization, model evaluation, time series, and more β€” organized by task to keep your work clean and efficient.


πŸš€ Features

βœ… Covers every major step in the data science lifecycle
βœ… Clean, modular structure by task
βœ… Built for reusability in real-world projects
βœ… Lightweight: only depends on pandas, numpy, matplotlib, seaborn by default
βœ… Compatible with Python 3.9+


πŸ“ Folder Structure

data-science-snippets/ β”œβ”€β”€ eda/ β”‚ β”œβ”€β”€ most_frequent_values.py β”‚ β”œβ”€β”€ data_summary.py β”‚ β”œβ”€β”€ cardinality_report.py β”‚ └── basic_statistics.py β”œβ”€β”€ data_cleaning/ β”‚ β”œβ”€β”€ missing_data_summary.py β”‚ β”œβ”€β”€ outlier_detection.py β”‚ └── duplicate_removal.py β”œβ”€β”€ preprocessing/ β”‚ β”œβ”€β”€ minmax_scaling.py β”‚ β”œβ”€β”€ encoding.py β”‚ └── normalize_columns.py β”œβ”€β”€ loading/ β”‚ β”œβ”€β”€ load_csv_with_info.py β”‚ β”œβ”€β”€ safe_parquet_loader.py β”‚ └── load_large_file_chunks.py β”œβ”€β”€ visualization/ β”‚ β”œβ”€β”€ missing_data_heatmap.py β”‚ β”œβ”€β”€ distribution_plot.py β”‚ └── correlation_matrix.py β”‚ └── color_palette_utils.py β”œβ”€β”€ feature_engineering/ β”‚ β”œβ”€β”€ create_datetime_features.py β”‚ β”œβ”€β”€ binning.py β”‚ β”œβ”€β”€ interaction_terms.py β”‚ └── rare_label_encoding.py β”œβ”€β”€ automated_eda/ β”‚ β”œβ”€β”€ quick_eda_report.py β”‚ └── profile_report_wrapper.py β”œβ”€β”€ model_evaluation/ β”‚ β”œβ”€β”€ classification_report_extended.py β”‚ β”œβ”€β”€ confusion_matrix_plot.py β”‚ β”œβ”€β”€ cross_validation_metrics.py β”‚ └── roc_auc_plot.py β”œβ”€β”€ text_processing/ β”‚ β”œβ”€β”€ clean_text.py β”‚ β”œβ”€β”€ tokenize_text.py β”‚ └── tfidf_features.py β”œβ”€β”€ time_series/ β”‚ β”œβ”€β”€ lag_features.py β”‚ β”œβ”€β”€ rolling_statistics.py β”‚ └── datetime_indexing.py β”œβ”€β”€ modeling/ β”‚ β”œβ”€β”€ model_training.py β”‚ β”œβ”€β”€ pipeline_builder.py β”‚ └── hyperparameter_tuner.py β”œβ”€β”€ data_validation/ β”‚ β”œβ”€β”€ schema_check.py β”‚ β”œβ”€β”€ unique_constraints.py β”‚ └── value_range_check.py β”œβ”€β”€ utils/ β”‚ β”œβ”€β”€ memory_optimization.py β”‚ β”œβ”€β”€ execution_timer.py β”‚ └── logging_setup.py └── README.md


πŸ”Ή eda/ – Exploratory Data Analysis

  • most_frequent_values.py: Shows the most common (modal) value per column, its frequency, and percent from non-null values.
  • data_summary.py: Summarizes dtypes, nulls, uniques, and memory usage for quick inspection.
  • cardinality_report.py: Reports high-cardinality columns in categorical features.
  • basic_statistics.py: Returns mean, median, min, max, std, and other summary statistics.

πŸ”Ή data_cleaning/

  • missing_data_summary.py: Shows missing value count and percentage per column, along with data types.
  • outlier_detection.py: Detects outliers using IQR or Z-score methods.
  • duplicate_removal.py: Identifies and removes duplicate rows or records.

πŸ”Ή preprocessing/

  • minmax_scaling.py: Scales numeric values to a [0, 1] range.
  • encoding.py: Label encoding and one-hot encoding utilities.
  • normalize_columns.py: Z-score standardization and column normalization helpers.

πŸ”Ή loading/

  • load_csv_with_info.py: Loads CSVs and prints metadata like shape, dtypes, and missing values.
  • safe_parquet_loader.py: Robust parquet file loader with fallback options.
  • load_large_file_chunks.py: Loads large files in chunks with progress reporting.

πŸ”Ή visualization/

  • missing_data_heatmap.py: Visualizes missing values with a Seaborn heatmap.
  • distribution_plot.py: Plots distributions of numeric variables.
  • correlation_matrix.py: Draws a correlation heatmap of numeric features.

πŸ”Ή feature_engineering/

  • create_datetime_features.py: Extracts features like day, month, year, weekday from datetime columns.
  • binning.py: Performs binning (equal-width or quantile) on continuous variables.
  • interaction_terms.py: Creates interaction features (e.g., feature1 * feature2).
  • rare_label_encoding.py: Groups rare categorical labels into 'Other'.

πŸ”Ή automated_eda/

  • quick_eda_report.py: Generates a summary of shape, dtypes, nulls, basic stats.
  • profile_report_wrapper.py: Wrapper for pandas-profiling / ydata-profiling report generation.

πŸ”Ή model_evaluation/

  • classification_report_extended.py: Displays precision, recall, F1 with support for multiple averages.
  • confusion_matrix_plot.py: Annotated confusion matrix visual.
  • cross_validation_metrics.py: Computes metrics across folds and aggregates results.
  • roc_auc_plot.py: Plots ROC curve and calculates AUC score.

πŸ”Ή text_processing/

  • clean_text.py: Removes punctuation, stopwords, numbers, and lowercases text.
  • tokenize_text.py: Word and sentence tokenizers with NLTK or spaCy support.
  • tfidf_features.py: Builds TF-IDF matrix from text columns.

πŸ”Ή time_series/

  • lag_features.py: Generates lagged versions of a column for time-aware modeling.
  • rolling_statistics.py: Rolling mean, median, std, and min/max features.
  • datetime_indexing.py: Time-based slicing, filtering, and resampling helpers.

πŸ”Ή modeling/

  • model_training.py: Trains scikit-learn models with optional cross-validation and logging.
  • pipeline_builder.py: Builds preprocessing + modeling pipelines using Pipeline or ColumnTransformer.
  • hyperparameter_tuner.py: Wraps GridSearchCV or RandomizedSearchCV with easy setup and evaluation.

πŸ”Ή data_validation/

  • schema_check.py: Validates schema based on expected dtypes and column names.
  • unique_constraints.py: Ensures unique values for IDs or compound keys.
  • value_range_check.py: Checks for valid value ranges in numeric columns.

πŸ”Ή utils/

  • memory_optimization.py: Downcasts numerical columns to save memory.
  • execution_timer.py: Times function execution with decorators or context managers.
  • logging_setup.py: Sets up consistent logging configuration for larger projects.

πŸ› οΈ Usage

python Copy-Paste πŸ“¦


πŸ“š Requirements

  • Python β‰₯ 3.9
  • pandas β‰₯ 1.5.3
  • numpy β‰₯ 1.24.4
  • seaborn β‰₯ 0.12.2
  • matplotlib β‰₯ 3.6.3

πŸ” Security

Please see our SECURITY.md for vulnerability disclosure guidelines.


πŸ‘₯ Authors

  • Vataselu Andrei
  • Nicola-Diana Sincaru

πŸ“„ License

This project is licensed under the MIT License. See the LICENSE file for details.


🌟 Contributions

We welcome contributions! If you have a reusable function or snippet that you think belongs in a senior data scientist’s toolkit, feel free to open a pull request.

Owner

  • Name: Vataselu Andrei
  • Login: andrei-vataselu
  • Kind: user
  • Location: Romania
  • Company: Endava

work smart, not hard

GitHub Events

Total
Last Year

Dependencies

pyproject.toml pypi
setup.py pypi
  • numpy >=1.24.4
  • pandas >=1.5.3
.github/workflows/ci.yml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
.github/workflows/greeting.yml actions
  • actions/github-script v7 composite
.github/workflows/publish-to-pypi.yml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
  • pypa/gh-action-pypi-publish release/v1 composite
.github/workflows/stale.yml actions
  • actions/stale v9 composite