https://github.com/azazh/medical-data-warehouse

https://github.com/azazh/medical-data-warehouse

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.9%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: Azazh
  • Default Branch: master
  • Size: 6.84 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme

README.md

Ethiopian Medical Businesses Data Pipeline

This project focuses on building a robust data pipeline for Ethiopian medical businesses by scraping data from Telegram channels, cleaning and transforming the data, and storing it in a data warehouse for analysis. The pipeline consists of two main tasks:

  1. Task 1: Data Scraping and Collection Pipeline - Scrapes data from Telegram channels.
  2. Task 2: Data Cleaning and Transformation - Cleans and transforms the scraped data using Python and DBT.

Table of Contents

  1. Project Overview
  2. Repository Structure
  3. Task 1: Data Scraping and Collection Pipeline
  4. Task 2: Data Cleaning and Transformation
  5. Setup and Installation
  6. Usage
  7. Challenges and Solutions
  8. Contributing
  9. License

Project Overview

The goal of this project is to build a data pipeline that: - Scrapes data from Telegram channels related to Ethiopian medical businesses. - Cleans and transforms the scraped data. - Stores the data in a PostgreSQL database for analysis.

The pipeline is designed to be modular, scalable, and easy to maintain.

Task 1: Data Scraping and Collection Pipeline

Objective

Scrape data from Telegram channels, including text and media, and store it in a structured format.

Implementation

  • Tools: Python (telethon, pandas, logging), Telegram API.
  • Steps:
    1. Set up Telegram API access using API_ID and API_HASH.
    2. Scrape data from specified Telegram channels (e.g., DoctorsET, Chemed).
    3. Store raw data in JSON files and media files in a structured directory.
    4. Log all activities for monitoring and debugging.

Output

  • Raw data stored in raw_data/ directory.
  • Media files stored in raw_data/media/.
  • Logs stored in scraping.log.

Task 2: Data Cleaning and Transformation

Objective

Clean and transform the scraped data to ensure consistency, remove duplicates, and prepare it for analysis.

Implementation

  • Tools: Python (pandas, sqlalchemy), DBT (Data Build Tool).
  • Steps:
    1. Load raw data from JSON files.
    2. Clean data by removing duplicates, handling missing values, and standardizing formats.
    3. Validate data to ensure quality.
    4. Store cleaned data in a PostgreSQL database.
    5. Use DBT to transform data into analytical models.

Output

  • Cleaned data stored in PostgreSQL (raw_medical_data table).
  • DBT models for staging (stg_medical_data) and analytics (fact_messages).
  • Logs stored in data_cleaning.log.

Setup and Installation

Clone the Repository

bash git clone https://github.com/Azazh/Medical-Data-Warehouse.git cd Medical-Data-Warehouse

Install Python Dependencies

bash pip install -r requirements.txt

Set Up PostgreSQL Database

  1. Create a database named medical_dw.
  2. Update the connection string in data_cleaning.py and profiles.yml.

Set Up Telegram API

  1. Obtain API_ID and API_HASH from my.telegram.org.
  2. Update the credentials in telegram_scraper.py.

Install DBT

bash pip install dbt-postgres

Usage

Run the Scraping Script

bash python telegram_scraper.py

Run the Data Cleaning Script

bash python data_cleaning.py

Run DBT Transformations

bash cd medical_transform dbt run --models marts dbt test dbt docs generate dbt docs serve

Challenges and Solutions

| Challenge | Solution | |--|-| | Rate limits on Telegram API | Implemented rate limiting and retries in the scraping script. | | Inconsistent data formats in Telegram messages | Standardized text and date formats during cleaning. | | Duplicate messages in scraped data | Removed duplicates based on message_id and channel. |

Contributing

Contributions are welcome! Please follow these steps: 1. Fork the repository. 2. Create a new branch for your feature or bug fix. 3. Submit a pull request with a detailed description of your changes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For questions or feedback, please contact: - azazh w
- azazhwuletaw@gmail.com
- https://github.com/azazh

Owner

  • Login: Azazh
  • Kind: user

GitHub Events

Total
  • Push event: 6
  • Create event: 1
Last Year
  • Push event: 6
  • Create event: 1

Dependencies

.github/workflows/unittests.yml actions
requirements.txt pypi
  • Brlapi ==0.8.3
  • Django ==4.0.4
  • GitPython ==3.1.43
  • Mako ==1.1.3
  • MarkupSafe ==2.0.1
  • Pillow ==9.1.0
  • PyGObject ==3.42.1
  • PyJWT ==2.3.0
  • PyNaCl ==1.5.0
  • PyYAML ==5.4.1
  • Pygments ==2.18.0
  • SecretStorage ==3.3.1
  • TA-Lib ==0.5.1
  • Telethon ==1.38.1
  • aiohappyeyeballs ==2.4.4
  • aiohttp ==3.11.11
  • aiohttp-retry ==2.9.1
  • aiosignal ==1.3.2
  • amqp ==5.3.1
  • annotated-types ==0.7.0
  • antlr4-python3-runtime ==4.9.3
  • appdirs ==1.4.4
  • apturl ==0.5.2
  • asgiref ==3.5.1
  • asttokens ==3.0.0
  • async-timeout ==5.0.1
  • asyncssh ==2.19.0
  • atpublic ==5.0
  • attrs ==24.3.0
  • autopep8 ==1.6.0
  • bcrypt ==3.2.0
  • billiard ==4.2.1
  • blinker ==1.4
  • celery ==5.4.0
  • certifi ==2021.10.8
  • cffi ==1.17.1
  • chardet ==4.0.0
  • charset-normalizer ==2.0.12
  • click ==8.1.8
  • click-didyoumean ==0.3.1
  • click-plugins ==1.1.1
  • click-repl ==0.3.0
  • colorama ==0.4.4
  • comm ==0.2.2
  • command-not-found ==0.3
  • configobj ==5.0.9
  • cryptography ==44.0.0
  • cupshelpers ==1.0
  • dbus-python ==1.2.18
  • debugpy ==1.8.11
  • decorator ==5.1.1
  • defer ==1.0.6
  • defusedxml ==0.7.1
  • dictdiffer ==0.9.0
  • diskcache ==5.6.3
  • distlib ==0.3.4
  • distro ==1.7.0
  • distro-info ==1.1
  • django-allauth ==0.50.0
  • django-cors-headers ==3.11.0
  • django-rest-auth ==0.9.5
  • django-rest-authtoken ==2.1.4
  • djangorestframework ==3.13.1
  • dpath ==2.2.0
  • dulwich ==0.22.7
  • duplicity ==0.8.21
  • dvc-data ==3.16.7
  • dvc-http ==2.32.0
  • dvc-objects ==5.1.0
  • dvc-render ==1.0.2
  • dvc-studio-client ==0.21.0
  • dvc-task ==0.40.2
  • entrypoints ==0.4
  • exceptiongroup ==1.2.2
  • executing ==2.1.0
  • fasteners ==0.14.1
  • filelock ==3.6.0
  • flatten-dict ==0.4.2
  • flufl.lock ==8.1.0
  • frozenlist ==1.5.0
  • fsspec ==2024.12.0
  • funcy ==2.0
  • future ==0.18.2
  • gitdb ==4.0.11
  • grandalf ==0.8
  • gto ==1.7.2
  • gunicorn ==20.1.0
  • gyp ==0.1
  • httplib2 ==0.20.2
  • hydra-core ==1.3.2
  • idna ==3.3
  • importlib-metadata ==4.6.4
  • ipykernel ==6.29.5
  • ipython ==8.30.0
  • iterative-telemetry ==0.0.9
  • jedi ==0.19.2
  • jeepney ==0.7.1
  • jupyter_client ==8.6.3
  • jupyter_core ==5.7.2
  • keyring ==23.5.0
  • kombu ==5.4.2
  • language-selector ==0.1
  • launchpadlib ==1.10.16
  • lazr.restfulclient ==0.14.4
  • lazr.uri ==1.0.6
  • lockfile ==0.12.2
  • louis ==3.20.0
  • macaroonbakery ==1.3.1
  • markdown-it-py ==3.0.0
  • matplotlib-inline ==0.1.7
  • mdurl ==0.1.2
  • monotonic ==1.6
  • more-itertools ==8.10.0
  • multidict ==6.1.0
  • nest-asyncio ==1.6.0
  • netifaces ==0.11.0
  • networkx ==3.4.2
  • numpy ==1.26.4
  • oauthlib ==3.2.0
  • olefile ==0.46
  • omegaconf ==2.3.0
  • orjson ==3.10.12
  • packaging ==24.2
  • pandas ==2.2.3
  • paramiko ==2.9.3
  • parso ==0.8.4
  • pathspec ==0.12.1
  • pexpect ==4.8.0
  • platformdirs ==4.3.6
  • prompt_toolkit ==3.0.48
  • propcache ==0.2.1
  • protobuf ==3.12.4
  • psutil ==6.1.1
  • psycopg2 ==2.9.5
  • psycopg2-binary ==2.9.5
  • ptyprocess ==0.7.0
  • pure_eval ==0.2.3
  • pyRFC3339 ==1.1
  • pyaes ==1.6.1
  • pyasn1 ==0.6.1
  • pycairo ==1.20.1
  • pycodestyle ==2.8.0
  • pycparser ==2.21
  • pycups ==2.0.1
  • pydantic ==2.10.4
  • pydantic_core ==2.27.2
  • pydot ==3.0.3
  • pygit2 ==1.16.0
  • pygtrie ==2.5.0
  • pymacaroons ==0.13.0
  • pyparsing ==3.2.0
  • python-apt ==2.4.0
  • python-dateutil ==2.8.2
  • python-debian ==0.1.43
  • python-dotenv ==0.20.0
  • python3-openid ==3.2.0
  • pytz ==2022.1
  • pyxdg ==0.27
  • pyzmq ==26.2.0
  • reportlab ==3.6.8
  • requests ==2.27.1
  • requests-oauthlib ==1.3.1
  • rich ==13.9.4
  • rsa ==4.9
  • ruamel.yaml ==0.18.6
  • ruamel.yaml.clib ==0.2.12
  • scmrepo ==3.3.9
  • semver ==3.0.2
  • shellingham ==1.5.4
  • shortuuid ==1.0.13
  • shtab ==1.7.1
  • six ==1.16.0
  • smmap ==5.0.1
  • sqlparse ==0.4.2
  • sqltrie ==0.11.1
  • stack-data ==0.6.3
  • systemd-python ==234
  • tabulate ==0.9.0
  • toml ==0.10.2
  • tomlkit ==0.13.2
  • tornado ==6.4.2
  • tqdm ==4.67.1
  • traitlets ==5.14.3
  • typer ==0.15.1
  • typing_extensions ==4.12.2
  • tzdata ==2023.4
  • ubuntu-drivers-common ==0.0.0
  • ubuntu-pro-client ==8001
  • ufw ==0.36.1
  • unattended-upgrades ==0.1
  • urllib3 ==1.26.9
  • usb-creator ==0.3.7
  • vine ==5.1.0
  • virtualenv ==20.13.0
  • voluptuous ==0.15.2
  • wadllib ==1.3.6
  • wcwidth ==0.2.13
  • xdg ==5
  • xkit ==0.0.0
  • yarl ==1.18.3
  • zc.lockfile ==3.0.post1
  • zipp ==1.0.0