Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.9%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: IMIS-MIKI
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 6.48 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created 11 months ago · Last pushed 11 months ago
Metadata Files
Readme License Zenodo

README.md

LLM Exams Evaluation

DOI License

This repository accompanies the paper "Pass or Fail? Lessons from Evaluating LLMs on Medical Exams". It documents a systematic evaluation of multilingual and on-premise large language models (LLMs) applied to curated German and Portuguese medical exam datasets.

We evaluate several models (e.g., LLaMA, Mistral, Qwen2.5, Gemma, and GPT-4o) and facilitated the use of retrieval-augmented generation (RAG).

Note: needs instance or connection to server running Ollama with models under analysis properly installed.

⚠️ The exam datasets used in this study are not publicly released due to privacy and copyright concerns. However, the results, scripts, and evaluation framework are provided to support reproducibility and further exploration.

Features

  • Evaluation of over 1,500 curated medical questions (German & Portuguese)
  • Support for prompt language comparison
  • Retrieval-Augmented Generation (RAG) integration
  • Runtime performance tracking
  • On-premise execution for privacy-compliant benchmarking

Installation

Clone the project and create the Python environment using the provided environment.yml file:

bash git clone https://github.com/your-org/llm-exams-evaluation.git cd llm-exams-evaluation conda env create -f environment.yml conda llm-exams-evaluation

Create .env file with target server running Ollama OLLAMA_HOST='' or OpenAI API key OPENAI_API_KEY=''.

Fill curated/ folder with target exams using exam_template.json as a template

RAG Setup

To enable Retrieval-Augmented Generation - Additional setup may be required for manually installing required packages.

Project Structure

├── images/ # Image folder containing images generated during the results analysis ├── results/ # Results after running model on target exams ├── utils/ # Auxiliar code ├── .env # Environment file ├── environment.yml # Conda environment specification ├── exam_template.json # Template with expected format for creating new exam entries ├── rag_test.py # Run RAG technique ├── README.md # This file ├── reporter.py # Analyse results, create plots and tables ├── runner.py # Main runner, runs exmas using specified models

Hardware

Experiments were run on a local server equipped with an NVIDIA GA100 GPU. GPT-4o was accessed via OpenAI’s API.

Citation

If you use this work, please cite:

Macedo M., Händel C., Bueno A., Schreweis B., Saalfeld S., Ulrich H. Pass or Fail? Lessons from Evaluating LLMs on Medical Exams, 2025.

Owner

  • Name: IMIS-MIKI
  • Login: IMIS-MIKI
  • Kind: organization

GitHub Events

Total
  • Release event: 1
  • Member event: 1
  • Push event: 3
  • Create event: 2
Last Year
  • Release event: 1
  • Member event: 1
  • Push event: 3
  • Create event: 2

Dependencies

environment.yml pypi
  • banks ==2.1.2
  • beautifulsoup4 ==4.13.3
  • click ==8.1.8
  • contourpy ==1.3.1
  • cycler ==0.12.1
  • deprecated ==1.2.18
  • dirtyjson ==1.0.8
  • faiss-gpu ==1.7.2
  • filetype ==1.2.0
  • fonttools ==4.56.0
  • fsspec ==2025.3.0
  • griffe ==1.7.2
  • jinja2 ==3.1.6
  • kiwisolver ==1.4.8
  • langchain ==0.3.20
  • langchain-community ==0.3.19
  • langchain-core ==0.3.43
  • langchain-ollama ==0.2.3
  • langchain-openai ==0.3.8
  • langsmith ==0.3.8
  • llama-cloud ==0.1.15
  • llama-cloud-services ==0.6.6
  • llama-index ==0.12.32
  • llama-index-agent-openai ==0.4.6
  • llama-index-cli ==0.4.1
  • llama-index-core ==0.12.32
  • llama-index-embeddings-openai ==0.3.1
  • llama-index-indices-managed-llama-cloud ==0.6.9
  • llama-index-llms-openai ==0.3.26
  • llama-index-multi-modal-llms-openai ==0.4.3
  • llama-index-program-openai ==0.3.1
  • llama-index-question-gen-openai ==0.3.0
  • llama-index-readers-file ==0.4.6
  • llama-index-readers-llama-parse ==0.4.0
  • llama-parse ==0.6.4.post1
  • markupsafe ==3.0.2
  • matplotlib ==3.10.1
  • nest-asyncio ==1.6.0
  • networkx ==3.4.2
  • nltk ==3.9.1
  • ollama ==0.4.7
  • pandas ==2.2.3
  • pillow ==11.1.0
  • platformdirs ==4.3.7
  • pyparsing ==3.2.1
  • pypdf ==5.4.0
  • python-dateutil ==2.9.0.post0
  • pytz ==2025.1
  • regex ==2024.11.6
  • six ==1.17.0
  • soupsieve ==2.6
  • striprtf ==0.0.26
  • tabulate ==0.9.0
  • tiktoken ==0.9.0
  • tzdata ==2025.1
  • wrapt ==1.17.2