https://github.com/anand-kamble/eval-comparison
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: anand-kamble
- Language: Roff
- Default Branch: main
- Size: 54.8 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
Readme.md
Ragas Evaluation
This project evaluates the performance of a query engine using various metrics. It leverages the Ragas library and integrates with Llama Index and Ollama for embeddings and model inference.
Prerequisites
Ensure you have the following installed:
- Python 3.x
ragasllama-indexollamadatasetsdotenv
Installation
- Clone the repository:
bash
git clone https://github.com/anand-kamble/eval-comparison/
- Install the required packages:
bash
pip install -r requirements.txt
- Set up your environment variables using a
.envfile. If you are using OpenAI.
Configuration
The script uses several models and datasets that need to be configured. Here are the key settings:
- Models: The default model used for both query and evaluation is
llama3. - Dataset: The default dataset is
HistoryOfAlexnetDataset. - Embeddings: Uses
OllamaEmbeddingwith thellama3model.
Usage
The script performs the following tasks:
- Embedding Setup:
- Configures
OllamaEmbeddingfor use with the query model.
- Configures
python
embeddings = OllamaEmbedding(model_name=QUERY_MODEL, base_url="http://class02:11434")
Settings.embed_model = embeddings
- Document Loading:
- Loads documents from the specified dataset directory, supporting
.pdfand.txtfiles.
- Loads documents from the specified dataset directory, supporting
python
documents = SimpleDirectoryReader(f"./data/{DATASET}", required_exts=[".pdf", ".txt"], recursive=True).load_data()
- Vector Index Building:
- Constructs a vector index from the loaded documents.
python
vector_index = VectorStoreIndex.from_documents(documents[:2])
- Query Engine Building:
- Initializes a query engine using the specified LLM model.
python
generator_llm = Ollama(model=QUERY_MODEL, request_timeout=600.0, base_url="http://class02:11434", additional_kwargs={"max_length": 512})
query_engine = vector_index.as_query_engine(llm=generator_llm)
- Evaluation:
- Evaluates the query engine against specified metrics such as faithfulness, answer relevancy, context precision, context recall, and harmfulness.
- Uses
Ollamamodel for evaluation LLM.
python
evaluator_llm = Ollama(model=EVALUATION_MODEL, base_url="http://class01:11434", request_timeout=600.0)
- Test Set Preparation:
- Converts a JSON dataset into a
Datasetobject for evaluation.
- Converts a JSON dataset into a
```python llamaragdataset = None with open(f"data/{DATASET}/ragdataset.json", "r") as f: llamarag_dataset = json.load(f)
testset = { "question": [], "ground_truth": [], }
for item in llamaragdataset["examples"]: testset["question"].append(item["query"]) testset["groundtruth"].append(item["referenceanswer"])
dataset = Dataset.from_dict(testset) ```
- Save Results:
- Saves the evaluation results to a CSV file and timing details to a text file.
Running the Script
To run the evaluation, execute:
bash
python ragas_evaluation.py
This will output the evaluation metrics and timing results in the results directory.
Metrics
The following metrics are evaluated:
- Faithfulness: How accurately the answers reflect the source content.
- Answer Relevancy: How relevant the answers are to the given queries.
- Context Precision: How precisely the context is captured in the answers.
- Context Recall: How comprehensively the context is represented.
- Harmfulness: How potentially harmful the content of the answers is.
Results
The results will be saved in CSV format with the filename pattern:
results/<DATASET>_query_<QUERY_MODEL>_eval_<EVALUATION_MODEL>.csv
Timing results will be saved in a text file:
results/<DATASET>_query_<QUERY_MODEL>_eval_<EVALUATION_MODEL>.txt
Troubleshooting
- Ensure the dataset directory is correctly structured and contains the required files.
- Check that the base URLs for the Ollama models are accessible.
- Adjust
request_timeoutandmax_lengthas needed for your environment.
Results
Comparison of Evaluation Metrics
Evaluation on the History of Alexnet Dataset
The following plot shows the comparison of evaluation metrics using the History of Alexnet Dataset:

Evaluation on the Paul Graham Essay Dataset
The following plot shows the comparison of evaluation metrics using the Paul Graham Essay Dataset:

Evaluation on the Llm Survey Paper Dataset
The following plot shows the comparison of evaluation metrics using the Lim Survey Paper Dataset:

Evaluation of the Mini Truthful QA Dataset

Explanation of the Plots
- Faithfulness: Measures how accurately the generated answers reflect the source content.
- Answer Relevancy: Evaluates how relevant the answers are to the queries.
- Context Precision: Assesses how precisely the context is captured in the answers.
- Context Recall: Measures how comprehensively the context is represented.
- Harmfulness: Evaluates the potential harmfulness of the content.
Each plot compares the metrics across different model combinations:
- Llama 3 (self-eval): Self-evaluation using Llama 3.
- Llama 3.1 (self-eval): Self-evaluation using Llama 3.1.
- Llama 3 (eval by Llama 3.1): Evaluation of Llama 3 by Llama 3.1.
- Llama 3.1 (eval by Llama 3): Evaluation of Llama 3.1 by Llama 3.
Owner
- Name: Anand Kamble
- Login: anand-kamble
- Kind: user
- Location: Tallahassee, FL
- Company: Florida State University
- Website: https://anand-kamble.github.io/
- Repositories: 24
- Profile: https://github.com/anand-kamble
Graduate student in FSU Scientific Computing
GitHub Events
Total
Last Year
Dependencies
- 215 dependencies
- ipykernel ^6.29.4 develop
- litellm >=1.25.2
- llama-index ^0.10.45.post1
- llama-index-embeddings-huggingface ^0.2.1
- llama-index-embeddings-instructor ^0.1.3
- llama-index-embeddings-ollama ^0.1.2
- llama-index-llms-ollama ^0.1.5
- python >=3.11,<3.13
- ragas ^0.1.9
- trulens-eval ^0.31.0