llm-exams-evaluation

https://github.com/imis-miki/llm-exams-evaluation

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.9%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: IMIS-MIKI
License: apache-2.0
Language: Python
Default Branch: main
Size: 6.48 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 1

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Zenodo

LLM Exams Evaluation

This repository accompanies the paper "Pass or Fail? Lessons from Evaluating LLMs on Medical Exams". It documents a systematic evaluation of multilingual and on-premise large language models (LLMs) applied to curated German and Portuguese medical exam datasets.

We evaluate several models (e.g., LLaMA, Mistral, Qwen2.5, Gemma, and GPT-4o) and facilitated the use of retrieval-augmented generation (RAG).

Note: needs instance or connection to server running Ollama with models under analysis properly installed.

⚠️ The exam datasets used in this study are not publicly released due to privacy and copyright concerns. However, the results, scripts, and evaluation framework are provided to support reproducibility and further exploration.

Features

Evaluation of over 1,500 curated medical questions (German & Portuguese)
Support for prompt language comparison
Retrieval-Augmented Generation (RAG) integration
Runtime performance tracking
On-premise execution for privacy-compliant benchmarking

Installation

Clone the project and create the Python environment using the provided environment.yml file:

bash git clone https://github.com/your-org/llm-exams-evaluation.git cd llm-exams-evaluation conda env create -f environment.yml conda llm-exams-evaluation

Create .env file with target server running Ollama OLLAMA_HOST='' or OpenAI API key OPENAI_API_KEY=''.

Fill curated/ folder with target exams using exam_template.json as a template

RAG Setup

To enable Retrieval-Augmented Generation - Additional setup may be required for manually installing required packages.

Project Structure

├── images/ # Image folder containing images generated during the results analysis ├── results/ # Results after running model on target exams ├── utils/ # Auxiliar code ├── .env # Environment file ├── environment.yml # Conda environment specification ├── exam_template.json # Template with expected format for creating new exam entries ├── rag_test.py # Run RAG technique ├── README.md # This file ├── reporter.py # Analyse results, create plots and tables ├── runner.py # Main runner, runs exmas using specified models

Hardware

Experiments were run on a local server equipped with an NVIDIA GA100 GPU. GPT-4o was accessed via OpenAI’s API.

Citation

If you use this work, please cite:

Macedo M., Händel C., Bueno A., Schreweis B., Saalfeld S., Ulrich H. Pass or Fail? Lessons from Evaluating LLMs on Medical Exams, 2025.

Owner

Name: IMIS-MIKI
Login: IMIS-MIKI
Kind: organization

Repositories: 1
Profile: https://github.com/IMIS-MIKI

GitHub Events

Total

Release event: 1
Member event: 1
Push event: 3
Create event: 2

Last Year

Release event: 1
Member event: 1
Push event: 3
Create event: 2

Dependencies

environment.yml pypi

banks ==2.1.2
beautifulsoup4 ==4.13.3
click ==8.1.8
contourpy ==1.3.1
cycler ==0.12.1
deprecated ==1.2.18
dirtyjson ==1.0.8
faiss-gpu ==1.7.2
filetype ==1.2.0
fonttools ==4.56.0
fsspec ==2025.3.0
griffe ==1.7.2
jinja2 ==3.1.6
kiwisolver ==1.4.8
langchain ==0.3.20
langchain-community ==0.3.19
langchain-core ==0.3.43
langchain-ollama ==0.2.3
langchain-openai ==0.3.8
langsmith ==0.3.8
llama-cloud ==0.1.15
llama-cloud-services ==0.6.6
llama-index ==0.12.32
llama-index-agent-openai ==0.4.6
llama-index-cli ==0.4.1
llama-index-core ==0.12.32
llama-index-embeddings-openai ==0.3.1
llama-index-indices-managed-llama-cloud ==0.6.9
llama-index-llms-openai ==0.3.26
llama-index-multi-modal-llms-openai ==0.4.3
llama-index-program-openai ==0.3.1
llama-index-question-gen-openai ==0.3.0
llama-index-readers-file ==0.4.6
llama-index-readers-llama-parse ==0.4.0
llama-parse ==0.6.4.post1
markupsafe ==3.0.2
matplotlib ==3.10.1
nest-asyncio ==1.6.0
networkx ==3.4.2
nltk ==3.9.1
ollama ==0.4.7
pandas ==2.2.3
pillow ==11.1.0
platformdirs ==4.3.7
pyparsing ==3.2.1
pypdf ==5.4.0
python-dateutil ==2.9.0.post0
pytz ==2025.1
regex ==2024.11.6
six ==1.17.0
soupsieve ==2.6
striprtf ==0.0.26
tabulate ==0.9.0
tiktoken ==0.9.0
tzdata ==2025.1
wrapt ==1.17.2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science