llm-exams-evaluation
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.9%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: IMIS-MIKI
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 6.48 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
LLM Exams Evaluation
This repository accompanies the paper "Pass or Fail? Lessons from Evaluating LLMs on Medical Exams". It documents a systematic evaluation of multilingual and on-premise large language models (LLMs) applied to curated German and Portuguese medical exam datasets.
We evaluate several models (e.g., LLaMA, Mistral, Qwen2.5, Gemma, and GPT-4o) and facilitated the use of retrieval-augmented generation (RAG).
Note: needs instance or connection to server running Ollama with models under analysis properly installed.
⚠️ The exam datasets used in this study are not publicly released due to privacy and copyright concerns. However, the results, scripts, and evaluation framework are provided to support reproducibility and further exploration.
Features
- Evaluation of over 1,500 curated medical questions (German & Portuguese)
- Support for prompt language comparison
- Retrieval-Augmented Generation (RAG) integration
- Runtime performance tracking
- On-premise execution for privacy-compliant benchmarking
Installation
Clone the project and create the Python environment using the provided environment.yml file:
bash
git clone https://github.com/your-org/llm-exams-evaluation.git
cd llm-exams-evaluation
conda env create -f environment.yml
conda llm-exams-evaluation
Create .env file with target server running Ollama OLLAMA_HOST='' or OpenAI API key OPENAI_API_KEY=''.
Fill curated/ folder with target exams using exam_template.json as a template
RAG Setup
To enable Retrieval-Augmented Generation - Additional setup may be required for manually installing required packages.
Project Structure
├── images/ # Image folder containing images generated during the results analysis
├── results/ # Results after running model on target exams
├── utils/ # Auxiliar code
├── .env # Environment file
├── environment.yml # Conda environment specification
├── exam_template.json # Template with expected format for creating new exam entries
├── rag_test.py # Run RAG technique
├── README.md # This file
├── reporter.py # Analyse results, create plots and tables
├── runner.py # Main runner, runs exmas using specified models
Hardware
Experiments were run on a local server equipped with an NVIDIA GA100 GPU. GPT-4o was accessed via OpenAI’s API.
Citation
If you use this work, please cite:
Macedo M., Händel C., Bueno A., Schreweis B., Saalfeld S., Ulrich H. Pass or Fail? Lessons from Evaluating LLMs on Medical Exams, 2025.
Owner
- Name: IMIS-MIKI
- Login: IMIS-MIKI
- Kind: organization
- Repositories: 1
- Profile: https://github.com/IMIS-MIKI
GitHub Events
Total
- Release event: 1
- Member event: 1
- Push event: 3
- Create event: 2
Last Year
- Release event: 1
- Member event: 1
- Push event: 3
- Create event: 2
Dependencies
- banks ==2.1.2
- beautifulsoup4 ==4.13.3
- click ==8.1.8
- contourpy ==1.3.1
- cycler ==0.12.1
- deprecated ==1.2.18
- dirtyjson ==1.0.8
- faiss-gpu ==1.7.2
- filetype ==1.2.0
- fonttools ==4.56.0
- fsspec ==2025.3.0
- griffe ==1.7.2
- jinja2 ==3.1.6
- kiwisolver ==1.4.8
- langchain ==0.3.20
- langchain-community ==0.3.19
- langchain-core ==0.3.43
- langchain-ollama ==0.2.3
- langchain-openai ==0.3.8
- langsmith ==0.3.8
- llama-cloud ==0.1.15
- llama-cloud-services ==0.6.6
- llama-index ==0.12.32
- llama-index-agent-openai ==0.4.6
- llama-index-cli ==0.4.1
- llama-index-core ==0.12.32
- llama-index-embeddings-openai ==0.3.1
- llama-index-indices-managed-llama-cloud ==0.6.9
- llama-index-llms-openai ==0.3.26
- llama-index-multi-modal-llms-openai ==0.4.3
- llama-index-program-openai ==0.3.1
- llama-index-question-gen-openai ==0.3.0
- llama-index-readers-file ==0.4.6
- llama-index-readers-llama-parse ==0.4.0
- llama-parse ==0.6.4.post1
- markupsafe ==3.0.2
- matplotlib ==3.10.1
- nest-asyncio ==1.6.0
- networkx ==3.4.2
- nltk ==3.9.1
- ollama ==0.4.7
- pandas ==2.2.3
- pillow ==11.1.0
- platformdirs ==4.3.7
- pyparsing ==3.2.1
- pypdf ==5.4.0
- python-dateutil ==2.9.0.post0
- pytz ==2025.1
- regex ==2024.11.6
- six ==1.17.0
- soupsieve ==2.6
- striprtf ==0.0.26
- tabulate ==0.9.0
- tiktoken ==0.9.0
- tzdata ==2025.1
- wrapt ==1.17.2