https://github.com/azazh/ethiohub

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.1%) to scientific vocabulary

Last synced: 6 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: Azazh
License: mit
Language: Jupyter Notebook
Default Branch: master
Size: 10.7 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

License

https://github.com/Azazh/EthioHub/blob/master/

# **EthioMart eCommerce**

## **Overview**
EthioMart Telegram Scraper is a Python-based tool designed to extract structured data from Ethiopian e-commerce Telegram channels. The scraper retrieves historical messages, monitors new messages in real time, and stores extracted data in a structured format, including text, metadata, and media. Additionally, EthioMart employs a fine-tuned LLM for Amharic Named Entity Recognition (NER) to extract key business entities such as product names, prices, and locations from the collected data.

## **Features**
**Historical Data Scraping** Extracts up to 10,000 past messages per channel.
**Real-Time Monitoring** Captures new messages instantly as they are posted.
**Media Handling** Downloads and stores images from Telegram messages.
**Structured Data Storage** Saves data in CSV and CoNLL formats for further processing.
**Amharic NER** Fine-tuned LLM to extract product names, prices, and locations.
**Scalable & Customizable** Easily add more channels for scraping.

## **Installation**

### **1. Clone the Repository**
```bash
git clone https://github.com/Azazh/EthioHub.git
cd EthioHub

### **2. Install Dependencies**
```bash
pip install -r requirements.txt

### **3. Set Up Environment Variables**
Create a `.env` file in the project root and add your Telegram API credentials:

TG_API_ID=your_api_id
TG_API_HASH=your_api_hash
phone=your_phone_number

**Get API credentials** from [my.telegram.org](https://my.telegram.org/apps).

## **Usage**

### **1. Start Historical Data Scraping**
Extract messages from selected channels and store them in `telegram_data.csv`:
```bash
python telegram_scraper.py

### **2. Enable Real-Time Monitoring**
Automatically log new messages from specified channels:
```bash
python real_time_monitor.py

### **3. Preprocess Extracted Data**
Preprocess text data, tokenize, and normalize Amharic text:
```bash
python preprocess_data.py

### **4. Label Data for NER (CoNLL Format)**
A subset of the dataset is manually labeled in the CoNLL format to train the Amharic NER model. Example:

B-PRICE
1000 I-PRICE
I-PRICE
O
B-LOC
B-LOC
I-LOC
O

Labeled data is stored in `data/labeled_amharic_data.conll`.

## **Task 3: Fine-Tune NER Model**
Fine-tune a Named Entity Recognition (NER) model to extract key entities (e.g., products, prices, and location) from Amharic Telegram messages.

### **Steps:**
Use a GPU-supported environment (Google Colab or local setup).
Utilize pre-trained models such as XLM-Roberta, bert-tiny-amharic, or afroxmlr.
Load the labeled dataset in CoNLL format.
Tokenize the data and align labels with tokens.
Configure training parameters (learning rate, epochs, batch size, evaluation strategy).
Fine-tune using Hugging Faces Trainer API.
Evaluate on the validation set and save the final model.

## **Task 4: Model Comparison & Selection**
Compare multiple models and select the best-performing one for entity extraction.

### **Steps:**
Fine-tune various models: XLM-Roberta, DistilBERT, and mBERT.
Evaluate each model on the validation set.
Compare performance based on accuracy, speed, and robustness.
Select the best model for production based on evaluation metrics.

## **Task 5: Model Interpretability**
Ensure transparency and trust in the system by interpreting the NER models decision-making process.

### **Steps:**
Implement SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations).
Analyze edge cases with ambiguous text or overlapping entities.
Generate interpretability reports to explain model predictions and identify areas for improvement.

## **Project Structure**

.
.github
workflows
unittests.yml
.vscode
data
labeled_amharic_data.conll
preprocessed_telegram_data.csv
telegram_data.csv
telegram_data.xlsx
tokens_labels.conll
notebooks
data_ingestion_and_data_preprocessing.ipynb
label_dataset_conll_format.ipynb
fine_tune_ner_model.ipynb
model_comparison.ipynb
model_interpretability.ipynb
scripts
preprocess_data.py
fine_tune_model.py
evaluate_models.py
interpret_model.py
src
tests
venv
.env
.gitignore
LICENSE

## **Data Format**
Extracted data is saved in `telegram_data.csv` with the following fields:
| Channel Title | Channel Username | Message ID | Message Text | Date | Media Path |
|--------------|-----------------|------------|--------------|------|------------|
| FashionTera | @fashiontera | 12345 | "New dresses available!" | 2024-01-01 | photos/fashiontera_12345.jpg |

Labeled data is stored in CoNLL format for NER model training.

## **Next Steps**
Deploy the best-performing NER model in a production environment.
Automate entity recognition and structured data extraction.
Optimize data retrieval and storage for large-scale monitoring.

## **Contributing**
Want to improve this project? Fork the repo and submit a pull request!

## **License**
MIT License Feel free to use and modify this project.

Owner

Login: Azazh
Kind: user

Repositories: 1
Profile: https://github.com/Azazh

GitHub Events

Total

Push event: 23
Create event: 2

Last Year

Push event: 23
Create event: 2

Dependencies

.github/workflows/unittests.yml actions

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science