https://github.com/azazh/medical-data-warehouse
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.9%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: Azazh
- Default Branch: master
- Size: 6.84 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Ethiopian Medical Businesses Data Pipeline
This project focuses on building a robust data pipeline for Ethiopian medical businesses by scraping data from Telegram channels, cleaning and transforming the data, and storing it in a data warehouse for analysis. The pipeline consists of two main tasks:
- Task 1: Data Scraping and Collection Pipeline - Scrapes data from Telegram channels.
- Task 2: Data Cleaning and Transformation - Cleans and transforms the scraped data using Python and DBT.
Table of Contents
- Project Overview
- Repository Structure
- Task 1: Data Scraping and Collection Pipeline
- Task 2: Data Cleaning and Transformation
- Setup and Installation
- Usage
- Challenges and Solutions
- Contributing
- License
Project Overview
The goal of this project is to build a data pipeline that: - Scrapes data from Telegram channels related to Ethiopian medical businesses. - Cleans and transforms the scraped data. - Stores the data in a PostgreSQL database for analysis.
The pipeline is designed to be modular, scalable, and easy to maintain.
Task 1: Data Scraping and Collection Pipeline
Objective
Scrape data from Telegram channels, including text and media, and store it in a structured format.
Implementation
- Tools: Python (
telethon,pandas,logging), Telegram API. - Steps:
- Set up Telegram API access using
API_IDandAPI_HASH. - Scrape data from specified Telegram channels (e.g., DoctorsET, Chemed).
- Store raw data in JSON files and media files in a structured directory.
- Log all activities for monitoring and debugging.
- Set up Telegram API access using
Output
- Raw data stored in
raw_data/directory. - Media files stored in
raw_data/media/. - Logs stored in
scraping.log.
Task 2: Data Cleaning and Transformation
Objective
Clean and transform the scraped data to ensure consistency, remove duplicates, and prepare it for analysis.
Implementation
- Tools: Python (
pandas,sqlalchemy), DBT (Data Build Tool). - Steps:
- Load raw data from JSON files.
- Clean data by removing duplicates, handling missing values, and standardizing formats.
- Validate data to ensure quality.
- Store cleaned data in a PostgreSQL database.
- Use DBT to transform data into analytical models.
Output
- Cleaned data stored in PostgreSQL (
raw_medical_datatable). - DBT models for staging (
stg_medical_data) and analytics (fact_messages). - Logs stored in
data_cleaning.log.
Setup and Installation
Clone the Repository
bash
git clone https://github.com/Azazh/Medical-Data-Warehouse.git
cd Medical-Data-Warehouse
Install Python Dependencies
bash
pip install -r requirements.txt
Set Up PostgreSQL Database
- Create a database named
medical_dw. - Update the connection string in
data_cleaning.pyandprofiles.yml.
Set Up Telegram API
- Obtain
API_IDandAPI_HASHfrom my.telegram.org. - Update the credentials in
telegram_scraper.py.
Install DBT
bash
pip install dbt-postgres
Usage
Run the Scraping Script
bash
python telegram_scraper.py
Run the Data Cleaning Script
bash
python data_cleaning.py
Run DBT Transformations
bash
cd medical_transform
dbt run --models marts
dbt test
dbt docs generate
dbt docs serve
Challenges and Solutions
| Challenge | Solution |
|--|-|
| Rate limits on Telegram API | Implemented rate limiting and retries in the scraping script. |
| Inconsistent data formats in Telegram messages | Standardized text and date formats during cleaning. |
| Duplicate messages in scraped data | Removed duplicates based on message_id and channel. |
Contributing
Contributions are welcome! Please follow these steps: 1. Fork the repository. 2. Create a new branch for your feature or bug fix. 3. Submit a pull request with a detailed description of your changes.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Contact
For questions or feedback, please contact:
- azazh w
- azazhwuletaw@gmail.com
- https://github.com/azazh
Owner
- Login: Azazh
- Kind: user
- Repositories: 1
- Profile: https://github.com/Azazh
GitHub Events
Total
- Push event: 6
- Create event: 1
Last Year
- Push event: 6
- Create event: 1
Dependencies
- Brlapi ==0.8.3
- Django ==4.0.4
- GitPython ==3.1.43
- Mako ==1.1.3
- MarkupSafe ==2.0.1
- Pillow ==9.1.0
- PyGObject ==3.42.1
- PyJWT ==2.3.0
- PyNaCl ==1.5.0
- PyYAML ==5.4.1
- Pygments ==2.18.0
- SecretStorage ==3.3.1
- TA-Lib ==0.5.1
- Telethon ==1.38.1
- aiohappyeyeballs ==2.4.4
- aiohttp ==3.11.11
- aiohttp-retry ==2.9.1
- aiosignal ==1.3.2
- amqp ==5.3.1
- annotated-types ==0.7.0
- antlr4-python3-runtime ==4.9.3
- appdirs ==1.4.4
- apturl ==0.5.2
- asgiref ==3.5.1
- asttokens ==3.0.0
- async-timeout ==5.0.1
- asyncssh ==2.19.0
- atpublic ==5.0
- attrs ==24.3.0
- autopep8 ==1.6.0
- bcrypt ==3.2.0
- billiard ==4.2.1
- blinker ==1.4
- celery ==5.4.0
- certifi ==2021.10.8
- cffi ==1.17.1
- chardet ==4.0.0
- charset-normalizer ==2.0.12
- click ==8.1.8
- click-didyoumean ==0.3.1
- click-plugins ==1.1.1
- click-repl ==0.3.0
- colorama ==0.4.4
- comm ==0.2.2
- command-not-found ==0.3
- configobj ==5.0.9
- cryptography ==44.0.0
- cupshelpers ==1.0
- dbus-python ==1.2.18
- debugpy ==1.8.11
- decorator ==5.1.1
- defer ==1.0.6
- defusedxml ==0.7.1
- dictdiffer ==0.9.0
- diskcache ==5.6.3
- distlib ==0.3.4
- distro ==1.7.0
- distro-info ==1.1
- django-allauth ==0.50.0
- django-cors-headers ==3.11.0
- django-rest-auth ==0.9.5
- django-rest-authtoken ==2.1.4
- djangorestframework ==3.13.1
- dpath ==2.2.0
- dulwich ==0.22.7
- duplicity ==0.8.21
- dvc-data ==3.16.7
- dvc-http ==2.32.0
- dvc-objects ==5.1.0
- dvc-render ==1.0.2
- dvc-studio-client ==0.21.0
- dvc-task ==0.40.2
- entrypoints ==0.4
- exceptiongroup ==1.2.2
- executing ==2.1.0
- fasteners ==0.14.1
- filelock ==3.6.0
- flatten-dict ==0.4.2
- flufl.lock ==8.1.0
- frozenlist ==1.5.0
- fsspec ==2024.12.0
- funcy ==2.0
- future ==0.18.2
- gitdb ==4.0.11
- grandalf ==0.8
- gto ==1.7.2
- gunicorn ==20.1.0
- gyp ==0.1
- httplib2 ==0.20.2
- hydra-core ==1.3.2
- idna ==3.3
- importlib-metadata ==4.6.4
- ipykernel ==6.29.5
- ipython ==8.30.0
- iterative-telemetry ==0.0.9
- jedi ==0.19.2
- jeepney ==0.7.1
- jupyter_client ==8.6.3
- jupyter_core ==5.7.2
- keyring ==23.5.0
- kombu ==5.4.2
- language-selector ==0.1
- launchpadlib ==1.10.16
- lazr.restfulclient ==0.14.4
- lazr.uri ==1.0.6
- lockfile ==0.12.2
- louis ==3.20.0
- macaroonbakery ==1.3.1
- markdown-it-py ==3.0.0
- matplotlib-inline ==0.1.7
- mdurl ==0.1.2
- monotonic ==1.6
- more-itertools ==8.10.0
- multidict ==6.1.0
- nest-asyncio ==1.6.0
- netifaces ==0.11.0
- networkx ==3.4.2
- numpy ==1.26.4
- oauthlib ==3.2.0
- olefile ==0.46
- omegaconf ==2.3.0
- orjson ==3.10.12
- packaging ==24.2
- pandas ==2.2.3
- paramiko ==2.9.3
- parso ==0.8.4
- pathspec ==0.12.1
- pexpect ==4.8.0
- platformdirs ==4.3.6
- prompt_toolkit ==3.0.48
- propcache ==0.2.1
- protobuf ==3.12.4
- psutil ==6.1.1
- psycopg2 ==2.9.5
- psycopg2-binary ==2.9.5
- ptyprocess ==0.7.0
- pure_eval ==0.2.3
- pyRFC3339 ==1.1
- pyaes ==1.6.1
- pyasn1 ==0.6.1
- pycairo ==1.20.1
- pycodestyle ==2.8.0
- pycparser ==2.21
- pycups ==2.0.1
- pydantic ==2.10.4
- pydantic_core ==2.27.2
- pydot ==3.0.3
- pygit2 ==1.16.0
- pygtrie ==2.5.0
- pymacaroons ==0.13.0
- pyparsing ==3.2.0
- python-apt ==2.4.0
- python-dateutil ==2.8.2
- python-debian ==0.1.43
- python-dotenv ==0.20.0
- python3-openid ==3.2.0
- pytz ==2022.1
- pyxdg ==0.27
- pyzmq ==26.2.0
- reportlab ==3.6.8
- requests ==2.27.1
- requests-oauthlib ==1.3.1
- rich ==13.9.4
- rsa ==4.9
- ruamel.yaml ==0.18.6
- ruamel.yaml.clib ==0.2.12
- scmrepo ==3.3.9
- semver ==3.0.2
- shellingham ==1.5.4
- shortuuid ==1.0.13
- shtab ==1.7.1
- six ==1.16.0
- smmap ==5.0.1
- sqlparse ==0.4.2
- sqltrie ==0.11.1
- stack-data ==0.6.3
- systemd-python ==234
- tabulate ==0.9.0
- toml ==0.10.2
- tomlkit ==0.13.2
- tornado ==6.4.2
- tqdm ==4.67.1
- traitlets ==5.14.3
- typer ==0.15.1
- typing_extensions ==4.12.2
- tzdata ==2023.4
- ubuntu-drivers-common ==0.0.0
- ubuntu-pro-client ==8001
- ufw ==0.36.1
- unattended-upgrades ==0.1
- urllib3 ==1.26.9
- usb-creator ==0.3.7
- vine ==5.1.0
- virtualenv ==20.13.0
- voluptuous ==0.15.2
- wadllib ==1.3.6
- wcwidth ==0.2.13
- xdg ==5
- xkit ==0.0.0
- yarl ==1.18.3
- zc.lockfile ==3.0.post1
- zipp ==1.0.0