https://github.com/amr-yasser226/datagovernanceworkflow

https://github.com/amr-yasser226/datagovernanceworkflow

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.4%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: amr-yasser226
  • Language: HTML
  • Default Branch: main
  • Size: 18.1 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Created about 1 year ago · Last pushed 12 months ago
Metadata Files
Readme License

README.md

DataGovernanceWorkflow

Project Overview

The DataGovernanceWorkflow repository provides a comprehensive pipeline for managing, profiling, encrypting, and auditing sensitive data. It includes encryption routines, data profiling and quality control notebooks, compliance report generation (GDPR, CCPA, HIPAA), and attack simulation scripts. The workflow is organized to separate raw data, analysis notebooks, scripts, and generated reports for clarity and reproducibility.

Repository Structure

DataGovernanceWorkflow/ ├── data/ # Raw and processed datasets (CSV, JSON) ├── scripts/ # Standalone Python scripts for encryption, decryption, and attack simulations ├── notebooks/ # Jupyter notebooks for interactive exploration and profiling ├── reports/ # Generated HTML and PDF reports (profiling, compliance, quality control) ├── requirements.txt # Python package dependencies ├── LICENSE # Project license └── README.md # Project overview and instructions

Data Directory (data/)

Contains raw input files and outputs from processing steps:

  • ccpa_compliant.csv: Data annotated for CCPA compliance (DoNotSell flag and can_sell_data column).
  • Cleaned_csv.csv: Preprocessed dataset used for encryption and profiling.
  • encrypted_data.csv: Sensitive columns encrypted using Fernet, Caesar, and Playfair ciphers.
  • gdpr_compliant.csv: Data anonymized for GDPR fields (IP, Username, Password, City, Country).
  • hipaa_report.json: HIPAA compliance findings in JSON format.
  • recovered_columns.csv: Columns recovered after brute-force decryption of Caesar-encrypted fields.
  • ssh_logs_processed.csv: SSH log dataset cleaned and formatted for profiling and validation.

Scripts Directory (scripts/)

  • frequency_attack.py: Implements an improved brute-force attack on Caesar-ciphered columns, diagnoses mismatches, and applies custom fixes to maximize recovery accuracy.
  • profilling_code.py: Generates programmatic, text-based profiling of numeric, datetime, and categorical columns, and visualizes login attempt patterns by hour, country, and city.

Notebooks Directory (notebooks/)

  1. Data_encryption.ipynb
  • Reads the cleaned CSV and drops index columns.
  • Encrypts Password with Fernet.
  • Applies Ceasar cipher (shift=3) to Username, City, and Country.
  • Assigns usernames to random categories for role-permissions testing.
  • Integrates GDPR, CCPA, and HIPAA pseudonymization or stub routines, exporting compliance artifacts.
  1. data_profiling.ipynb
  • Uses ydata_profiling to generate an HTML profiling report of the SSH log dataset.
  1. profilling_code.ipynb
  • Programmatic profiling: computes summary statistics for each column (numeric, datetime, categorical).
  • Builds a pandas DataFrame of profiling information and displays it.
  • Converts and analyzes combined datetime fields and plots login attempts by hour, country, and city.
  1. Quality_Control.ipynb
  • Loads the SSH log data and inspects schema.
  • Cleans duplicates and missing values (median for numeric, mode for categorical).
  • Removes outliers based on 1.5 × IQR rule.
  • Validates the cleaned dataset against a Pandera schema, reporting any failures.

Reports Directory (reports/)

  • profiling_report.html Interactive HTML summary of data profiling.
  • profiling_report.pdf PDF export of the profiling report.
  • profiling_data_ssh_logs_process.html HTML rendering of the profiling steps for SSH logs.
  • Phase 1.pdf Quality Control notebook report summarizing cleaning, outlier handling, and schema validation.

Compliance Workflows

  1. GDPR Compliance
  • Anonymizes IP addresses and pseudonymizes other sensitive fields using python_gdpr_utils if available, else a stub based on MD5 hashing.
  • Outputs gdpr_compliant.csv.
  1. CCPA Compliance
  • Adds DoNotSell flag per user with consistent random assignment.
  • Derives can_sell_data column.
  • Outputs ccpa_compliant.csv.
  1. HIPAA Compliance
  • Runs HIPAA scanners (HippoScanner, TenableIO, SecurityMonkey) if installed, else returns an empty stub.
  • Outputs hipaa_report.json.

Setup and Usage

  1. Environment Setup

bash python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt

  1. Run Encryption and Compliance Pipeline

bash python scripts/frequency_attack.py # Attacks and recovers encrypted fields # For notebooks, launch Jupyter Lab: jupyter lab notebooks

  1. Generate Reports
  • Open notebooks/data_profiling.ipynb to regenerate profiling HTML.
  • Run Quality_Control.ipynb to validate data schema and update the Phase 1 report.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Owner

  • Login: amr-yasser226
  • Kind: user

GitHub Events

Total
  • Watch event: 1
  • Member event: 2
  • Push event: 17
  • Fork event: 1
  • Create event: 3
Last Year
  • Watch event: 1
  • Member event: 2
  • Push event: 17
  • Fork event: 1
  • Create event: 3

Committers

Last synced: 11 months ago

All Time
  • Total Commits: 24
  • Total Committers: 2
  • Avg Commits per committer: 12.0
  • Development Distribution Score (DDS): 0.167
Past Year
  • Commits: 24
  • Committers: 2
  • Avg Commits per committer: 12.0
  • Development Distribution Score (DDS): 0.167
Top Committers
Name Email Commits
Amr Yasser a****6@g****m 20
OxHazem o****d@h****m 4

Issues and Pull Requests

Last synced: 11 months ago


Dependencies

requirements.txt pypi
  • Brotli ==1.1.0
  • DateTime ==5.5
  • Deprecated ==1.2.18
  • Faker ==37.1.0
  • Flask ==3.1.0
  • HeapDict ==1.0.1
  • IMDbPY ==2022.7.9
  • ImageHash ==4.3.1
  • Jinja2 ==3.1.4
  • Markdown ==3.7
  • MarkupSafe ==2.1.5
  • PySocks ==1.7.1
  • PyWavelets ==1.8.0
  • PyYAML ==6.0.1
  • Pygments ==2.18.0
  • SQLAlchemy ==2.0.25
  • Send2Trash ==1.8.3
  • Werkzeug ==3.1.3
  • absl-py ==2.2.2
  • aiohappyeyeballs ==2.6.1
  • aiohttp ==3.11.16
  • aiosignal ==1.3.2
  • annotated-types ==0.7.0
  • anyio ==4.4.0
  • argon2-cffi ==23.1.0
  • argon2-cffi-bindings ==21.2.0
  • arrow ==1.3.0
  • asttokens ==2.4.1
  • astunparse ==1.6.3
  • async-lru ==2.0.5
  • attrs ==25.3.0
  • babel ==2.17.0
  • beautifulsoup4 ==4.12.3
  • bleach ==6.2.0
  • blinker ==1.9.0
  • cachetools ==5.5.0
  • cbor ==1.0.0
  • certifi ==2023.11.17
  • cffi ==1.17.1
  • charset-normalizer ==3.3.2
  • chest ==0.2.3
  • chromedriver-autoinstaller ==0.6.4
  • cinemagoer ==2023.5.1
  • click ==8.1.7
  • colorama ==0.4.6
  • comm ==0.2.2
  • contourpy ==1.2.1
  • cssselect ==1.3.0
  • cssselect2 ==0.8.0
  • cssutils ==2.11.1
  • cycler ==0.12.1
  • dacite ==1.9.2
  • dataframe_image ==0.2.7
  • debugpy ==1.8.7
  • decorator ==5.1.1
  • defusedxml ==0.7.1
  • dill ==0.3.9
  • distlib ==0.3.8
  • dnspython ==2.6.1
  • email_validator ==2.2.0
  • executing ==2.1.0
  • fastapi ==0.112.0
  • fastapi-cli ==0.0.5
  • fastjsonschema ==2.21.1
  • filelock ==3.15.4
  • flatbuffers ==25.2.10
  • fonttools ==4.53.1
  • fqdn ==1.5.1
  • frozenlist ==1.5.0
  • fsspec ==2025.3.2
  • gast ==0.6.0
  • gdown ==5.2.0
  • git-filter-repo ==2.47.0
  • google-api-core ==2.24.0
  • google-api-python-client ==2.156.0
  • google-auth ==2.37.0
  • google-auth-httplib2 ==0.2.0
  • google-auth-oauthlib ==1.2.1
  • google-pasta ==0.2.0
  • google_search_results ==2.4.2
  • googleapis-common-protos ==1.66.0
  • greenlet ==3.1.1
  • grpcio ==1.71.0
  • h11 ==0.14.0
  • h5py ==3.13.0
  • htmlmin ==0.1.12
  • httpcore ==1.0.5
  • httplib2 ==0.22.0
  • httptools ==0.6.1
  • httpx ==0.27.0
  • idna ==3.6
  • ijson ==3.3.0
  • inscriptis ==2.5.3
  • ipykernel ==6.29.5
  • ipython ==8.28.0
  • ir_datasets ==0.5.10
  • ir_measures ==0.3.7
  • isoduration ==20.11.0
  • itsdangerous ==2.2.0
  • jedi ==0.19.1
  • joblib ==1.4.2
  • json5 ==0.12.0
  • jsonpointer ==3.0.0
  • jsonschema ==4.23.0
  • jsonschema-specifications ==2024.10.1
  • jupyter-events ==0.12.0
  • jupyter-lsp ==2.2.5
  • jupyter_client ==8.6.3
  • jupyter_core ==5.7.2
  • jupyter_server ==2.15.0
  • jupyter_server_terminals ==0.5.3
  • jupyterlab ==4.4.0
  • jupyterlab_pygments ==0.3.0
  • jupyterlab_server ==2.27.3
  • keras ==3.9.2
  • kiwisolver ==1.4.5
  • libclang ==18.1.1
  • llvmlite ==0.44.0
  • lxml ==5.1.0
  • lz4 ==4.4.3
  • markdown-it-py ==3.0.0
  • matplotlib ==3.8.4
  • matplotlib-inline ==0.1.7
  • mdurl ==0.1.2
  • missingno ==0.5.2
  • mistune ==3.1.3
  • ml_dtypes ==0.5.1
  • more-itertools ==10.6.0
  • mpmath ==1.3.0
  • multidict ==6.4.3
  • multimethod ==1.12
  • mypy-extensions ==1.0.0
  • namex ==0.0.8
  • nbclient ==0.10.2
  • nbconvert ==7.16.6
  • nbformat ==5.10.4
  • nest-asyncio ==1.6.0
  • networkx ==3.3
  • nltk ==3.9.1
  • notebook ==7.4.0
  • notebook_shim ==0.2.4
  • numba ==0.61.0
  • numpy ==1.26.4
  • oauthlib ==3.2.2
  • opt_einsum ==3.4.0
  • optree ==0.15.0
  • overrides ==7.7.0
  • packaging ==24.1
  • pandas ==2.2.2
  • pandera ==0.23.1
  • pandocfilters ==1.5.1
  • parso ==0.8.4
  • patsy ==1.0.1
  • pdfkit ==1.0.0
  • phik ==0.12.4
  • pillow ==10.4.0
  • platformdirs ==4.2.2
  • playwright ==1.51.0
  • prometheus_client ==0.21.1
  • prompt_toolkit ==3.0.48
  • propcache ==0.3.1
  • proto-plus ==1.25.0
  • protobuf ==5.29.2
  • psutil ==6.1.0
  • pure_eval ==0.2.3
  • puremagic ==1.28
  • pyarrow ==19.0.1
  • pyasn1 ==0.6.1
  • pyasn1_modules ==0.4.1
  • pycparser ==2.22
  • pydantic ==2.8.2
  • pydantic_core ==2.20.1
  • pydyf ==0.11.0
  • pyee ==12.1.1
  • pyjnius ==1.6.1
  • pyparsing ==3.1.2
  • pyphen ==0.17.2
  • python-dateutil ==2.9.0.post0
  • python-dotenv ==1.0.1
  • python-json-logger ==3.3.0
  • python-multipart ==0.0.9
  • python-terrier ==0.13.0
  • pytrec-eval-terrier ==0.5.6
  • pytz ==2024.2
  • pywin32 ==308
  • pywinpty ==2.0.15
  • pyzmq ==26.2.0
  • referencing ==0.36.2
  • regex ==2024.11.6
  • requests ==2.32.3
  • requests-oauthlib ==2.0.0
  • rfc3339-validator ==0.1.4
  • rfc3986-validator ==0.1.1
  • rich ==13.7.1
  • rpds-py ==0.24.0
  • rsa ==4.9
  • scikit-learn ==1.6.1
  • scipy ==1.13.1
  • seaborn ==0.13.2
  • setuptools ==75.6.0
  • shellingham ==1.5.4
  • six ==1.16.0
  • sniffio ==1.3.1
  • soupsieve ==2.5
  • stack-data ==0.6.3
  • starlette ==0.37.2
  • statsmodels ==0.14.4
  • sympy ==1.13.1
  • tabulate ==0.9.0
  • tangled-up-in-unicode ==0.2.0
  • tensorboard ==2.19.0
  • tensorboard-data-server ==0.7.2
  • tensorflow ==2.19.0
  • termcolor ==3.0.1
  • terminado ==0.18.1
  • threadpoolctl ==3.5.0
  • tinycss2 ==1.4.0
  • tinyhtml5 ==2.0.0
  • tornado ==6.4.1
  • tqdm ==4.67.1
  • traitlets ==5.14.3
  • trec-car-tools ==2.6
  • typeguard ==4.4.2
  • typer ==0.12.3
  • types-python-dateutil ==2.9.0.20241206
  • typing-inspect ==0.9.0
  • typing_extensions ==4.12.2
  • tzdata ==2024.1
  • ujson ==5.10.0
  • unlzw3 ==0.2.3
  • uri-template ==1.3.0
  • uritemplate ==4.1.1
  • urllib3 ==2.1.0
  • uvicorn ==0.30.5
  • virtualenv ==20.26.3
  • visions ==0.7.6
  • warc3-wet ==0.2.5
  • warc3-wet-clueweb09 ==0.2.5
  • watchfiles ==0.22.0
  • wcwidth ==0.2.13
  • weasyprint ==65.1
  • webcolors ==24.11.1
  • webencodings ==0.5.1
  • websocket-client ==1.8.0
  • websockets ==12.0
  • wget ==3.2
  • wheel ==0.45.1
  • wordcloud ==1.9.4
  • wrapt ==1.17.2
  • yarl ==1.19.0
  • ydata-profiling ==4.8.3
  • zlib-state ==0.1.9
  • zope.interface ==7.2
  • zopfli ==0.2.3.post1