https://github.com/c974/p-rag
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.5%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: C974
- Language: HTML
- Default Branch: main
- Size: 0 Bytes
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Palestine RAG System (P-RAG)
A comprehensive Retrieval-Augmented Generation (RAG) system specifically designed for Palestine-related knowledge base queries and benchmarking. This system combines document processing, semantic search, and language model generation with specialized focus on Palestinian historical, political, and cultural content.
Project Overview
The Palestine RAG System consists of three main components:
- P-RAG.py - Core RAG system with intelligent document processing and caching
- benchmark_rag.py - Comprehensive benchmarking script with Bloom taxonomy evaluation
- pandascv.py - Advanced analytics and visualization for benchmark results
Table of Contents
- Features
- Installation
- Quick Start
- Components
- Configuration
- Usage Examples
- Benchmark System
- File Structure
- Troubleshooting
- Contributing
Features
Core RAG System (P-RAG.py)
- Multi-format Document Processing: Supports PDF, DOCX, TXT, MD, and JSON files
- Intelligent Caching: File-based caching with automatic change detection
- Advanced Chunking: Context-aware text segmentation with overlap
- GPU Acceleration: CUDA support for faster embedding generation
- Batch Processing: Efficient batch embedding creation with progress tracking
- Interactive Chat Interface: Real-time question-answering system
Benchmarking System (benchmark_rag.py)
- Bloom Taxonomy Integration: Evaluates answers across 6 cognitive levels
- Multiple Language Models: Support for various open-source and commercial models
- Comprehensive Metrics: Accuracy, retrieval quality, and generation performance
- Detailed Reporting: JSON and HTML output with statistical analysis
- Comparative Analysis: Side-by-side model performance evaluation
Analytics Dashboard (pandascv.py)
- Data Visualization: Heatmaps and performance charts
- Cross-Model Comparison: RAG vs Open Source model analysis
- Statistical Insights: Performance trends and accuracy distributions
- Export Capabilities: CSV and HTML report generation
Installation
Prerequisites
- Python 3.8 or higher
- CUDA-compatible GPU (optional, for acceleration)
- Ollama (for local model inference)
Step 1: Clone the Repository
bash
git clone <repository-url>
cd P-Rag
Step 2: Install Dependencies
bash
pip install -r requirements.txt
Step 3: Set Up Ollama (Optional)
If using local models via Ollama: ```bash
Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
Pull required models
ollama pull deepseek-r1:8b ollama pull phi4-mini-reasoning:latest ollama pull qwen3:8b ollama pull gemma3:4b-it-qat ```
Step 4: Prepare Source Documents
```bash mkdir Sources
Add your PDF, DOCX, TXT, MD, or JSON files to the Sources folder
```
Quick Start
Basic RAG Usage
```python
Run the interactive Palestine RAG system
python P-RAG.py ```
The system will:
1. Load and process documents from the Sources folder
2. Create embeddings (cached for future use)
3. Start an interactive chat interface
Example interaction:
Palestine RAG Chatbot Ready!
You: What were the main causes of the Nakba?
Assistant: [Generated response based on your source documents]
Running Benchmarks
```python
Evaluate RAG system performance
python benchmark_rag.py ```
Analyzing Results
```python
Generate comprehensive analytics
python pandascv.py ```
Components
P-RAG.py - Core RAG System
Key Functions:
- load_documents_from_folder() - Multi-format document processing
- create_embeddings_batch() - Efficient batch embedding generation
- retrieve() - Semantic similarity search
- main() - Interactive chat interface
Supported File Types: - PDF: Automatic text extraction with error handling - DOCX: Microsoft Word document processing - TXT/MD: Plain text and Markdown files - JSON: Structured data with intelligent text extraction
Caching System:
- Vector embeddings cached in vector_db_cache.pkl
- File metadata tracking in source_metadata_cache.pkl
- Automatic cache invalidation on file changes
benchmark_rag.py - Evaluation Framework
Evaluation Metrics: - Accuracy: Exact and semantic answer matching - Bloom Taxonomy: Cognitive complexity assessment - Retrieval Quality: Relevance and coverage metrics - Generation Performance: Response time and coherence
Supported Models: - Hugging Face Transformers - Ollama local models - Custom embedding models
Output Formats: - Detailed JSON reports - Statistical summaries - Performance comparisons
pandascv.py - Analytics & Visualization
Features: - Performance heatmaps with seaborn styling - Model type distinction (RAG vs Open Source) - Statistical trend analysis - Interactive HTML reports - CSV data export
Visualization Types: - Accuracy heatmaps by Bloom taxonomy level - Model performance comparisons - Distribution analysis - Trend charts
Configuration
Model Configuration
```python
Language Models (in P-RAG.py)
LANGUAGE_MODEL = 'hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF' Deepseek = 'deepseek-r1:8b' phi4 = 'phi4-mini-reasoning:latest' qwen = 'qwen3:8b'
Embedding Models
EMBEDDINGMODELNAME = 'all-MiniLM-L6-v2' # Fast embeddingmodelname = 'BAAI/bge-large-en-v1.5' # High quality ```
Directory Structure
P-Rag/
Sources/ # Source documents
detailed_Benchmark_Open Source/ # Open source results
detailed_Benchmark_RAG/ # RAG system results
P-RAG.py # Core RAG system
benchmark_rag.py # Benchmarking script
pandascv.py # Analytics dashboard
bloom_utils.py # Bloom taxonomy utilities
requirements.txt # Dependencies
README.md # This file
Cache Files
vector_db_cache.pkl- Embedding vectorssource_metadata_cache.pkl- File metadatavector_db_cache_bge.pkl- BGE model embeddings
Usage Examples
Example 1: Document Processing
```python
Load specific documents
documents = loadspecificdocuments("./Sources", ["document1.pdf", "document2.docx"])
Create embeddings
createembeddingsbatch(documents, batch_size=32) ```
Example 2: Custom Queries
```python
Retrieve relevant context
results = retrieve("What is the significance of olive trees in Palestinian culture?", top_n=5)
Process results
for chunk, similarity, filename, metadata in results: print(f"Source: {filename}, Similarity: {similarity:.3f}") print(f"Content: {chunk[:200]}...") ```
Example 3: Benchmark Evaluation
```python
Initialize benchmarker
benchmarker = RAGBenchmarker()
Load benchmark dataset
benchmarkdata = loadbenchmark("convertedbenchmarkwith_bloom.jsonl")
Run evaluation
results = benchmarker.evaluatebenchmark(benchmarkdata) ```
Benchmark System
Bloom Taxonomy Levels
The system evaluates responses across six cognitive levels:
- Remember - Factual recall and basic information retrieval
- Understand - Comprehension and explanation
- Apply - Using knowledge in new situations
- Analyze - Breaking down information and examining relationships
- Evaluate - Making judgments and assessments
- Create - Generating new ideas and solutions
Evaluation Process
- Question Loading: Import benchmark questions with Bloom classifications
- Context Retrieval: Semantic search through document corpus
- Answer Generation: LLM-based response generation
- Similarity Analysis: Compare generated answers to gold standards
- Performance Metrics: Calculate accuracy and quality scores
- Report Generation: Create detailed analysis reports
Sample Benchmark Output
json
{
"metadata": {
"model": "deepseek-r1:8b",
"embedding_model": "BAAI/bge-large-en-v1.5",
"total_questions": 50,
"evaluation_date": "2025-07-05"
},
"analysis": {
"overall_accuracy": 0.76,
"correct_answers": 38,
"bloom_level_stats": {
"Remember": {"accuracy": 0.85, "total": 10},
"Understand": {"accuracy": 0.78, "total": 12},
"Apply": {"accuracy": 0.70, "total": 8}
}
}
}
File Structure
P-Rag/
Core Files
P-RAG.py # Main RAG system
benchmark_rag.py # Evaluation framework
pandascv.py # Analytics dashboard
bloom_utils.py # Bloom taxonomy utilities
Data Directories
Sources/ # Source documents
detailed_Benchmark_Open Source/
detailed_Benchmark_RAG/
Cache Files
vector_db_cache.pkl
source_metadata_cache.pkl
vector_db_cache_bge.pkl
Configuration
requirements.txt # Python dependencies
README.md # Documentation
Output Files
*.json # Benchmark results
*.csv # Analytics data
*.html # Report visualizations
Troubleshooting
Common Issues
1. CUDA Out of Memory ```python
Reduce batch size
createembeddingsbatch(documents, batch_size=16) # Instead of 32 ```
2. Ollama Connection Issues ```bash
Check Ollama status
ollama list ollama serve # If not running ```
3. File Encoding Problems ```python
The system tries multiple encodings automatically:
encodings = ['utf-8', 'latin-1', 'cp1252', 'iso-8859-1'] ```
4. Empty Vector Database
- Ensure Sources folder contains supported file types
- Check file permissions and accessibility
- Verify files are not corrupted
5. Slow Performance - Enable GPU acceleration: Install CUDA-compatible PyTorch - Use smaller embedding models for faster processing - Increase batch sizes if memory allows
Performance Optimization
Memory Usage:
- Monitor GPU memory with nvidia-smi
- Adjust batch sizes based on available memory
- Use CPU fallback for large models
Speed Improvements: - Cache embeddings for reuse - Use quantized models when available - Enable mixed precision training
Contributing
Development Setup
- Fork the repository
- Create a feature branch
- Install development dependencies
- Make your changes
- Run tests and benchmarks
- Submit a pull request
Code Style
- Follow PEP 8 guidelines
- Use type hints where possible
- Add docstrings for all functions
- Include error handling and logging
Testing
```bash
Run basic functionality tests
python P-RAG.py --test python benchmark_rag.py --dry-run ```
License
This project is part of the QCRI Summer Internship Program 2025. Please refer to your institution's guidelines for usage and distribution.
Acknowledgments
- Qatar Computing Research Institute (QCRI)
- Palestine research community
- Open-source embedding model providers
- Bloom taxonomy framework developers
For support or questions, please contact the development team or create an issue in the repository.
Additional Resources
Owner
- Name: Anas-AM2104114
- Login: C974
- Kind: user
- Repositories: 1
- Profile: https://github.com/C974
GitHub Events
Total
- Push event: 2
- Create event: 1
Last Year
- Push event: 2
- Create event: 1
Dependencies
- PyPDF2 >=3.0.1
- beautifulsoup4 >=4.11.0
- eventregistry >=9.0.0
- jsonlines >=3.1.0
- matplotlib >=3.6.0
- numpy >=1.24.0
- ollama >=0.1.7
- pandas >=1.5.0
- pymongo >=4.3.0
- pytest >=7.2.0
- pytest-cov >=4.0.0
- python-docx >=0.8.11
- requests >=2.28.0
- seaborn >=0.12.0
- sentence-transformers >=2.2.2
- torch >=2.0.0
- tqdm >=4.64.0
- transformers >=4.35.0