Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.8%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: tranthanhtung-sgu
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 25 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 10 months ago · Last pushed 10 months ago
Metadata Files
Readme Contributing License Citation Codeowners

README.md

COMP6011 Task 2: Medical LLM Evaluation

This repository contains the code, configuration, and results for the evaluation of open-source medical Large Language Models as part of the COMP6011 Advanced AI Research Topics assignment.

Evaluation Overview

The evaluation consists of two main parts:

  1. Standardized Benchmarking: Using the lm-evaluation-harness framework to evaluate four selected models (BioMistral-7B, Meditron-7B, InternistAI-7B, Asclepius-13B) on key medical benchmarks (MedQA, PubMedQA, MedMCQA, MMLU subsets).
  2. Qualitative Proof-of-Concept (PoC): Running a custom script (10cases.py) to evaluate the top-performing models from the benchmarking phase (InternistAI-7B and BioMistral-7B) on 10 complex NEJM clinical case studies provided in task2data.jsonl.

Reproducing the Results

Follow these steps to set up the environment and reproduce the evaluations performed for this report.

1. Prerequisites

  • Operating System: Ubuntu (Tested on Ubuntu 24.04 LTS)
  • GPU: NVIDIA GPU with CUDA support and >= 24GB VRAM (Tested on NVIDIA RTX 3090).
  • NVIDIA Drivers: Compatible NVIDIA drivers installed.
  • Python: Python 3.8+ (Python 3.10 recommended). Managed via conda or venv.
  • Git: Required for cloning the repository.
  • Internet Connection: Required for downloading models and datasets.

2. Setup Environment

  1. Clone this Repository: bash git clone https://github.com/tranthanhtung-sgu/lm-evaluation-harness.git # Or your specific fork URL cd lm-evaluation-harness # Or your repo directory name

  2. Create and Activate Virtual Environment:

    • Using conda: bash conda create -n lm-eval-task2 python=3.10 -y conda activate lm-eval-task2
    • Using venv: bash python3 -m venv venv source venv/bin/activate
  3. Install Dependencies: Install all required packages using the provided requirements.txt file. This includes lm-evaluation-harness, transformers, torch, accelerate, bitsandbytes, etc. bash pip install -r requirements.txt (Note: The base repository seems to be a fork of lm-evaluation-harness. If you haven't explicitly generated the requirements.txt from your *active environment after installing everything, it's safer to run pip install -e . first, then pip install transformers torch accelerate bitsandbytes pandas tqdm sentencepiece huggingface_hub)*

3. Hugging Face Authentication (Important)

Several models evaluated (Meditron-7B, InternistAI-7B, Asclepius-13B, potentially based on Llama) require accepting license terms on their Hugging Face model pages and may require authentication via a Hugging Face access token.

  1. Login via CLI: bash huggingface-cli login Follow the prompts and paste your Hugging Face access token (ensure it has at least 'read' permissions).

  2. Accept License Terms: Manually visit the Hugging Face pages for the models listed below and accept any license terms if you haven't already:

    • epfl-llm/meditron-7b
    • internistai/base-7b-v0.2
    • starmpcc/Asclepius-13B (or its base model if applicable)

4. Running Standardized Benchmarks (lm-evaluation-harness)

These commands replicate the quantitative benchmarking reported in Section 4.1 of the report. They evaluate the models using 8-bit quantization on the specified tasks. The results generated these runs are stored in the results/ directory.

Run each model's evaluation separately:

  • BioMistral-7B (BioMistral/BioMistral-7B) ```bash

    5-shot tasks

    lmeval --model hf \ --modelargs pretrained=BioMistral/BioMistral-7B,loadin8bit=True,trustremotecode=True \ --tasks medqa4options,medmcqa,mmluprofessionalmedicine,mmluclinicalknowledge,mmluanatomy,mmlucollegebiology,mmlumedicalgenetics \ --numfewshot 5 \ --device cuda:0 \ --batchsize auto \ --outputpath results/biomistral7b5shoteval.json

    0-shot task (Note: ran 5-shot in provided JSON, use num_fewshot 5 to replicate)

    lmeval --model hf \ --modelargs pretrained=BioMistral/BioMistral-7B,loadin8bit=True,trustremotecode=True \ --tasks pubmedqa \ --numfewshot 5 \ --device cuda:0 \ --batchsize auto \ --outputpath results/biomistral7bpubmedqa5shot_eval.json ```

  • Meditron-7B (epfl-llm/meditron-7b) ```bash

    5-shot tasks

    lmeval --model hf \ --modelargs pretrained=epfl-llm/meditron-7b,loadin8bit=True,trustremotecode=True \ --tasks medqa4options,medmcqa,mmluprofessionalmedicine,mmluclinicalknowledge,mmluanatomy,mmlucollegebiology,mmlumedicalgenetics,pubmedqa \ --numfewshot 5 \ --device cuda:0 \ --batchsize auto \ --outputpath results/meditron7b5shoteval.json ```

  • InternistAI-7B (internistai/base-7b-v0.2) ```bash

    5-shot tasks

    lmeval --model hf \ --modelargs pretrained=internistai/base-7b-v0.2,loadin8bit=True,trustremotecode=True \ --tasks medqa4options,medmcqa,mmluprofessionalmedicine,mmluclinicalknowledge,mmluanatomy,mmlucollegebiology,mmlumedicalgenetics,pubmedqa \ --numfewshot 5 \ --device cuda:0 \ --batchsize auto \ --outputpath results/internistai7b5shoteval.json ```

  • Asclepius-13B (starmpcc/Asclepius-13B) ```bash

    5-shot tasks

    lmeval --model hf \ --modelargs pretrained=starmpcc/Asclepius-13B,loadin8bit=True,trustremotecode=True \ --tasks medqa4options,medmcqa,mmluprofessionalmedicine,mmluclinicalknowledge,mmluanatomy,mmlucollegebiology,mmlumedicalgenetics,pubmedqa \ --numfewshot 5 \ --device cuda:0 \ --batchsize auto \ --outputpath results/asclepius13b5shoteval.json ```

Notes: * Ensure the results/ directory exists before running (mkdir results). * These evaluations can take several hours per model. * Monitor GPU memory usage (nvidia-smi). If errors occur, try setting --batch_size 1. * The JSON files provided in the repository (results_*.json) are the outputs of these commands.

5. Running the NEJM Qualitative Proof-of-Concept (PoC)

This replicates the qualitative evaluation reported in Section 4.2 of the report, using the custom script 10cases.py.

  1. Ensure Data File: Verify that task2data.jsonl (containing the 10 NEJM cases) is present in the root directory of the repository.
  2. Run the Script: Execute the Python script. This script will load the selected models (InternistAI-7B and BioMistral-7B), process each case from task2data.jsonl, generate diagnostic suggestions, measure latency, and save the outputs. bash python 10cases.py
  3. Check Outputs:
    • An aggregated CSV file named nejm_poc_raw_outputs.csv will be created/overwritten in the root directory, containing the raw text outputs and latencies for both models for all 10 cases.
    • Individual text files for each case, containing the outputs from both models, will be saved in the nejm_outputs/ directory.

6. Expected Outputs and Verification

After running the steps above, you can verify the results against the report:

  • Standardized Benchmarks: Compare the accuracy values in the generated JSON files within the results/ directory against those presented in Table 4 (or your updated results table) in the report. Minor variations due to package versions or hardware specifics are possible but should be small.
  • NEJM PoC Outputs: Examine the raw text in nejm_poc_raw_outputs.csv and the individual files in nejm_outputs/. These form the basis for the qualitative analysis (Top-1/Top-3 accuracy, reasoning scores) presented in Section 4.2 and Table 5 of the report. The latencies recorded in the CSV should match those reported.

This setup allows for the reproduction of the core quantitative and qualitative evaluations presented in the accompanying report.

Owner

  • Name: Trần Thanh Tùng
  • Login: tranthanhtung-sgu
  • Kind: user

Citation (CITATION.bib)

@misc{eval-harness,
  author       = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
  title        = {A framework for few-shot language model evaluation},
  month        = 12,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {v0.4.0},
  doi          = {10.5281/zenodo.10256836},
  url          = {https://zenodo.org/records/10256836}
}

GitHub Events

Total
  • Push event: 1
  • Create event: 1
Last Year
  • Push event: 1
  • Create event: 1

Dependencies

.github/workflows/new_tasks.yml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
  • tj-actions/changed-files v46.0.5 composite
.github/workflows/publish.yml actions
  • actions/checkout v4 composite
  • actions/download-artifact v4 composite
  • actions/setup-python v5 composite
  • actions/upload-artifact v4 composite
  • pypa/gh-action-pypi-publish release/v1 composite
.github/workflows/static.yml actions
  • actions/checkout v4 composite
  • actions/configure-pages v5 composite
  • actions/deploy-pages v4 composite
  • actions/upload-pages-artifact v3 composite
.github/workflows/unit_tests.yml actions
  • actions/cache v3 composite
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
  • actions/upload-artifact v4 composite
  • pre-commit/action v3.0.1 composite
lm_eval/tasks/japanese_leaderboard/requirements.txt pypi
  • emoji ==2.14.0
  • fugashi *
  • neologdn ==0.5.3
  • rouge_score >=0.1.2
pyproject.toml pypi
  • accelerate >=0.26.0
  • datasets >=2.16.0
  • dill *
  • evaluate >=0.4.0
  • evaluate *
  • jsonlines *
  • more_itertools *
  • numexpr *
  • peft >=0.2.0
  • pybind11 >=2.6.2
  • pytablewriter *
  • rouge-score >=0.0.4
  • sacrebleu >=1.5.0
  • scikit-learn >=0.24.1
  • sqlitedict *
  • torch >=1.8
  • tqdm-multiprocess *
  • transformers >=4.1
  • word2number *
  • zstandard *
requirements.txt pypi
setup.py pypi