https://github.com/azazh/ethiohub

https://github.com/azazh/ethiohub

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.1%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: Azazh
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: master
  • Size: 10.7 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 1 year ago · Last pushed about 1 year ago
Metadata Files
License

https://github.com/Azazh/EthioHub/blob/master/

# **EthioMart eCommerce**  

## **Overview**  
EthioMart Telegram Scraper is a Python-based tool designed to extract structured data from Ethiopian e-commerce Telegram channels. The scraper retrieves historical messages, monitors new messages in real time, and stores extracted data in a structured format, including text, metadata, and media. Additionally, EthioMart employs a fine-tuned LLM for Amharic Named Entity Recognition (NER) to extract key business entities such as product names, prices, and locations from the collected data.

## **Features**  
 **Historical Data Scraping**  Extracts up to 10,000 past messages per channel.  
 **Real-Time Monitoring**  Captures new messages instantly as they are posted.  
 **Media Handling**  Downloads and stores images from Telegram messages.  
 **Structured Data Storage**  Saves data in CSV and CoNLL formats for further processing.  
 **Amharic NER**  Fine-tuned LLM to extract product names, prices, and locations.  
 **Scalable & Customizable**  Easily add more channels for scraping.  



## **Installation**  

### **1. Clone the Repository**  
```bash
git clone https://github.com/Azazh/EthioHub.git
cd EthioHub


### **2. Install Dependencies**  
```bash
pip install -r requirements.txt


### **3. Set Up Environment Variables**  
Create a `.env` file in the project root and add your Telegram API credentials:  

TG_API_ID=your_api_id
TG_API_HASH=your_api_hash
phone=your_phone_number

 **Get API credentials** from [my.telegram.org](https://my.telegram.org/apps).  



## **Usage**  

### **1. Start Historical Data Scraping**  
Extract messages from selected channels and store them in `telegram_data.csv`:  
```bash
python telegram_scraper.py


### **2. Enable Real-Time Monitoring**  
Automatically log new messages from specified channels:  
```bash
python real_time_monitor.py


### **3. Preprocess Extracted Data**  
Preprocess text data, tokenize, and normalize Amharic text:  
```bash
python preprocess_data.py


### **4. Label Data for NER (CoNLL Format)**  
A subset of the dataset is manually labeled in the CoNLL format to train the Amharic NER model. Example:

  B-PRICE
1000 I-PRICE
 I-PRICE
 O
 B-LOC
 B-LOC
 I-LOC
 O

Labeled data is stored in `data/labeled_amharic_data.conll`.



## **Task 3: Fine-Tune NER Model**  
Fine-tune a Named Entity Recognition (NER) model to extract key entities (e.g., products, prices, and location) from Amharic Telegram messages.

### **Steps:**  
 Use a GPU-supported environment (Google Colab or local setup).  
 Utilize pre-trained models such as XLM-Roberta, bert-tiny-amharic, or afroxmlr.  
 Load the labeled dataset in CoNLL format.  
 Tokenize the data and align labels with tokens.  
 Configure training parameters (learning rate, epochs, batch size, evaluation strategy).  
 Fine-tune using Hugging Faces Trainer API.  
 Evaluate on the validation set and save the final model.



## **Task 4: Model Comparison & Selection**  
Compare multiple models and select the best-performing one for entity extraction.

### **Steps:**  
 Fine-tune various models: XLM-Roberta, DistilBERT, and mBERT.  
 Evaluate each model on the validation set.  
 Compare performance based on accuracy, speed, and robustness.  
 Select the best model for production based on evaluation metrics.



## **Task 5: Model Interpretability**  
Ensure transparency and trust in the system by interpreting the NER models decision-making process.

### **Steps:**  
 Implement SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations).  
 Analyze edge cases with ambiguous text or overlapping entities.  
 Generate interpretability reports to explain model predictions and identify areas for improvement.



## **Project Structure**  

.
 .github
    workflows
        unittests.yml
 .vscode
 data
    labeled_amharic_data.conll
    preprocessed_telegram_data.csv
    telegram_data.csv
    telegram_data.xlsx
    tokens_labels.conll
 notebooks
    data_ingestion_and_data_preprocessing.ipynb
    label_dataset_conll_format.ipynb
    fine_tune_ner_model.ipynb
    model_comparison.ipynb
    model_interpretability.ipynb
 scripts
    preprocess_data.py
    fine_tune_model.py
    evaluate_models.py
    interpret_model.py
 src
 tests
 venv
 .env
 .gitignore
 LICENSE




## **Data Format**  
Extracted data is saved in `telegram_data.csv` with the following fields:  
| Channel Title | Channel Username | Message ID | Message Text | Date | Media Path |
|--------------|-----------------|------------|--------------|------|------------|
| FashionTera | @fashiontera | 12345 | "New dresses available!" | 2024-01-01 | photos/fashiontera_12345.jpg |

Labeled data is stored in CoNLL format for NER model training.



## **Next Steps**  
 Deploy the best-performing NER model in a production environment.  
 Automate entity recognition and structured data extraction.  
 Optimize data retrieval and storage for large-scale monitoring.  



## **Contributing**  
Want to improve this project? Fork the repo and submit a pull request!  



## **License**  
 MIT License  Feel free to use and modify this project.  




Owner

  • Login: Azazh
  • Kind: user

GitHub Events

Total
  • Push event: 23
  • Create event: 2
Last Year
  • Push event: 23
  • Create event: 2

Dependencies

.github/workflows/unittests.yml actions