https://github.com/azazh/ethiohub
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.1%) to scientific vocabulary
Last synced: 6 months ago
·
JSON representation
Repository
Basic Info
- Host: GitHub
- Owner: Azazh
- License: mit
- Language: Jupyter Notebook
- Default Branch: master
- Size: 10.7 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Created about 1 year ago
· Last pushed about 1 year ago
Metadata Files
License
https://github.com/Azazh/EthioHub/blob/master/
# **EthioMart eCommerce**
## **Overview**
EthioMart Telegram Scraper is a Python-based tool designed to extract structured data from Ethiopian e-commerce Telegram channels. The scraper retrieves historical messages, monitors new messages in real time, and stores extracted data in a structured format, including text, metadata, and media. Additionally, EthioMart employs a fine-tuned LLM for Amharic Named Entity Recognition (NER) to extract key business entities such as product names, prices, and locations from the collected data.
## **Features**
**Historical Data Scraping** Extracts up to 10,000 past messages per channel.
**Real-Time Monitoring** Captures new messages instantly as they are posted.
**Media Handling** Downloads and stores images from Telegram messages.
**Structured Data Storage** Saves data in CSV and CoNLL formats for further processing.
**Amharic NER** Fine-tuned LLM to extract product names, prices, and locations.
**Scalable & Customizable** Easily add more channels for scraping.
## **Installation**
### **1. Clone the Repository**
```bash
git clone https://github.com/Azazh/EthioHub.git
cd EthioHub
### **2. Install Dependencies**
```bash
pip install -r requirements.txt
### **3. Set Up Environment Variables**
Create a `.env` file in the project root and add your Telegram API credentials:
TG_API_ID=your_api_id
TG_API_HASH=your_api_hash
phone=your_phone_number
**Get API credentials** from [my.telegram.org](https://my.telegram.org/apps).
## **Usage**
### **1. Start Historical Data Scraping**
Extract messages from selected channels and store them in `telegram_data.csv`:
```bash
python telegram_scraper.py
### **2. Enable Real-Time Monitoring**
Automatically log new messages from specified channels:
```bash
python real_time_monitor.py
### **3. Preprocess Extracted Data**
Preprocess text data, tokenize, and normalize Amharic text:
```bash
python preprocess_data.py
### **4. Label Data for NER (CoNLL Format)**
A subset of the dataset is manually labeled in the CoNLL format to train the Amharic NER model. Example:
B-PRICE
1000 I-PRICE
I-PRICE
O
B-LOC
B-LOC
I-LOC
O
Labeled data is stored in `data/labeled_amharic_data.conll`.
## **Task 3: Fine-Tune NER Model**
Fine-tune a Named Entity Recognition (NER) model to extract key entities (e.g., products, prices, and location) from Amharic Telegram messages.
### **Steps:**
Use a GPU-supported environment (Google Colab or local setup).
Utilize pre-trained models such as XLM-Roberta, bert-tiny-amharic, or afroxmlr.
Load the labeled dataset in CoNLL format.
Tokenize the data and align labels with tokens.
Configure training parameters (learning rate, epochs, batch size, evaluation strategy).
Fine-tune using Hugging Faces Trainer API.
Evaluate on the validation set and save the final model.
## **Task 4: Model Comparison & Selection**
Compare multiple models and select the best-performing one for entity extraction.
### **Steps:**
Fine-tune various models: XLM-Roberta, DistilBERT, and mBERT.
Evaluate each model on the validation set.
Compare performance based on accuracy, speed, and robustness.
Select the best model for production based on evaluation metrics.
## **Task 5: Model Interpretability**
Ensure transparency and trust in the system by interpreting the NER models decision-making process.
### **Steps:**
Implement SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations).
Analyze edge cases with ambiguous text or overlapping entities.
Generate interpretability reports to explain model predictions and identify areas for improvement.
## **Project Structure**
.
.github
workflows
unittests.yml
.vscode
data
labeled_amharic_data.conll
preprocessed_telegram_data.csv
telegram_data.csv
telegram_data.xlsx
tokens_labels.conll
notebooks
data_ingestion_and_data_preprocessing.ipynb
label_dataset_conll_format.ipynb
fine_tune_ner_model.ipynb
model_comparison.ipynb
model_interpretability.ipynb
scripts
preprocess_data.py
fine_tune_model.py
evaluate_models.py
interpret_model.py
src
tests
venv
.env
.gitignore
LICENSE
## **Data Format**
Extracted data is saved in `telegram_data.csv` with the following fields:
| Channel Title | Channel Username | Message ID | Message Text | Date | Media Path |
|--------------|-----------------|------------|--------------|------|------------|
| FashionTera | @fashiontera | 12345 | "New dresses available!" | 2024-01-01 | photos/fashiontera_12345.jpg |
Labeled data is stored in CoNLL format for NER model training.
## **Next Steps**
Deploy the best-performing NER model in a production environment.
Automate entity recognition and structured data extraction.
Optimize data retrieval and storage for large-scale monitoring.
## **Contributing**
Want to improve this project? Fork the repo and submit a pull request!
## **License**
MIT License Feel free to use and modify this project.
Owner
- Login: Azazh
- Kind: user
- Repositories: 1
- Profile: https://github.com/Azazh
GitHub Events
Total
- Push event: 23
- Create event: 2
Last Year
- Push event: 23
- Create event: 2
Dependencies
.github/workflows/unittests.yml
actions