https://github.com/anand-kamble/eval-comparison

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: anand-kamble
Language: Roff
Default Branch: main
Size: 54.8 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Changelog

Ragas Evaluation

This project evaluates the performance of a query engine using various metrics. It leverages the Ragas library and integrates with Llama Index and Ollama for embeddings and model inference.

Prerequisites

Ensure you have the following installed:

Python 3.x
ragas
llama-index
ollama
datasets
dotenv

Installation

Clone the repository:

bash git clone https://github.com/anand-kamble/eval-comparison/

Install the required packages:

bash pip install -r requirements.txt

Set up your environment variables using a .env file. If you are using OpenAI.

Configuration

The script uses several models and datasets that need to be configured. Here are the key settings:

Models: The default model used for both query and evaluation is llama3.
Dataset: The default dataset is HistoryOfAlexnetDataset.
Embeddings: Uses OllamaEmbedding with the llama3 model.

Usage

The script performs the following tasks:

Embedding Setup:
- Configures OllamaEmbedding for use with the query model.

python embeddings = OllamaEmbedding(model_name=QUERY_MODEL, base_url="http://class02:11434") Settings.embed_model = embeddings

Document Loading:
- Loads documents from the specified dataset directory, supporting .pdf and .txt files.

python documents = SimpleDirectoryReader(f"./data/{DATASET}", required_exts=[".pdf", ".txt"], recursive=True).load_data()

Vector Index Building:
- Constructs a vector index from the loaded documents.

python vector_index = VectorStoreIndex.from_documents(documents[:2])

Query Engine Building:
- Initializes a query engine using the specified LLM model.

python generator_llm = Ollama(model=QUERY_MODEL, request_timeout=600.0, base_url="http://class02:11434", additional_kwargs={"max_length": 512}) query_engine = vector_index.as_query_engine(llm=generator_llm)

Evaluation:
- Evaluates the query engine against specified metrics such as faithfulness, answer relevancy, context precision, context recall, and harmfulness.
- Uses Ollama model for evaluation LLM.

python evaluator_llm = Ollama(model=EVALUATION_MODEL, base_url="http://class01:11434", request_timeout=600.0)

Test Set Preparation:
- Converts a JSON dataset into a Dataset object for evaluation.

```python llamaragdataset = None with open(f"data/{DATASET}/ragdataset.json", "r") as f: llamarag_dataset = json.load(f)

testset = { "question": [], "ground_truth": [], }

for item in llamaragdataset["examples"]: testset["question"].append(item["query"]) testset["groundtruth"].append(item["referenceanswer"])

dataset = Dataset.from_dict(testset) ```

Save Results:
- Saves the evaluation results to a CSV file and timing details to a text file.

Running the Script

To run the evaluation, execute:

bash python ragas_evaluation.py

This will output the evaluation metrics and timing results in the results directory.

Metrics

The following metrics are evaluated:

Faithfulness: How accurately the answers reflect the source content.
Answer Relevancy: How relevant the answers are to the given queries.
Context Precision: How precisely the context is captured in the answers.
Context Recall: How comprehensively the context is represented.
Harmfulness: How potentially harmful the content of the answers is.

Results

The results will be saved in CSV format with the filename pattern:

results/<DATASET>_query_<QUERY_MODEL>_eval_<EVALUATION_MODEL>.csv

Timing results will be saved in a text file:

results/<DATASET>_query_<QUERY_MODEL>_eval_<EVALUATION_MODEL>.txt

Troubleshooting

Ensure the dataset directory is correctly structured and contains the required files.
Check that the base URLs for the Ollama models are accessible.
Adjust request_timeout and max_length as needed for your environment.

Results

Comparison of Evaluation Metrics

Evaluation on the History of Alexnet Dataset

The following plot shows the comparison of evaluation metrics using the History of Alexnet Dataset: history_of_alexnet

Evaluation on the Paul Graham Essay Dataset

The following plot shows the comparison of evaluation metrics using the Paul Graham Essay Dataset: paul_graham

Evaluation on the Llm Survey Paper Dataset

The following plot shows the comparison of evaluation metrics using the Lim Survey Paper Dataset: Llm Survey Paper Dataset

Evaluation of the Mini Truthful QA Dataset

Mini Truthful QA Dataset

Explanation of the Plots

Faithfulness: Measures how accurately the generated answers reflect the source content.
Answer Relevancy: Evaluates how relevant the answers are to the queries.
Context Precision: Assesses how precisely the context is captured in the answers.
Context Recall: Measures how comprehensively the context is represented.
Harmfulness: Evaluates the potential harmfulness of the content.

Each plot compares the metrics across different model combinations:

Llama 3 (self-eval): Self-evaluation using Llama 3.
Llama 3.1 (self-eval): Self-evaluation using Llama 3.1.
Llama 3 (eval by Llama 3.1): Evaluation of Llama 3 by Llama 3.1.
Llama 3.1 (eval by Llama 3): Evaluation of Llama 3.1 by Llama 3.

Owner

Name: Anand Kamble
Login: anand-kamble
Kind: user
Location: Tallahassee, FL
Company: Florida State University

Website: https://anand-kamble.github.io/
Repositories: 24
Profile: https://github.com/anand-kamble

Graduate student in FSU Scientific Computing

GitHub Events

Total

Last Year

Dependencies

poetry.lock pypi

215 dependencies

pyproject.toml pypi

ipykernel ^6.29.4 develop
litellm >=1.25.2
llama-index ^0.10.45.post1
llama-index-embeddings-huggingface ^0.2.1
llama-index-embeddings-instructor ^0.1.3
llama-index-embeddings-ollama ^0.1.2
llama-index-llms-ollama ^0.1.5
python >=3.11,<3.13
ragas ^0.1.9
trulens-eval ^0.31.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/anand-kamble/eval-comparison

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

Readme.md

Ragas Evaluation

Prerequisites

Installation

Configuration

Usage

Running the Script

Metrics

Results

Troubleshooting

Results

Comparison of Evaluation Metrics

Evaluation on the History of Alexnet Dataset

Evaluation on the Paul Graham Essay Dataset

Evaluation on the Llm Survey Paper Dataset

Evaluation of the Mini Truthful QA Dataset

Explanation of the Plots

Each plot compares the metrics across different model combinations:

Owner

GitHub Events

Total

Last Year

Dependencies