research-ai-topic-2
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.8%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: tranthanhtung-sgu
- License: mit
- Language: Python
- Default Branch: main
- Size: 25 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
COMP6011 Task 2: Medical LLM Evaluation
This repository contains the code, configuration, and results for the evaluation of open-source medical Large Language Models as part of the COMP6011 Advanced AI Research Topics assignment.
Evaluation Overview
The evaluation consists of two main parts:
- Standardized Benchmarking: Using the
lm-evaluation-harnessframework to evaluate four selected models (BioMistral-7B, Meditron-7B, InternistAI-7B, Asclepius-13B) on key medical benchmarks (MedQA, PubMedQA, MedMCQA, MMLU subsets). - Qualitative Proof-of-Concept (PoC): Running a custom script (
10cases.py) to evaluate the top-performing models from the benchmarking phase (InternistAI-7B and BioMistral-7B) on 10 complex NEJM clinical case studies provided intask2data.jsonl.
Reproducing the Results
Follow these steps to set up the environment and reproduce the evaluations performed for this report.
1. Prerequisites
- Operating System: Ubuntu (Tested on Ubuntu 24.04 LTS)
- GPU: NVIDIA GPU with CUDA support and >= 24GB VRAM (Tested on NVIDIA RTX 3090).
- NVIDIA Drivers: Compatible NVIDIA drivers installed.
- Python: Python 3.8+ (Python 3.10 recommended). Managed via
condaorvenv. - Git: Required for cloning the repository.
- Internet Connection: Required for downloading models and datasets.
2. Setup Environment
Clone this Repository:
bash git clone https://github.com/tranthanhtung-sgu/lm-evaluation-harness.git # Or your specific fork URL cd lm-evaluation-harness # Or your repo directory nameCreate and Activate Virtual Environment:
- Using
conda:bash conda create -n lm-eval-task2 python=3.10 -y conda activate lm-eval-task2 - Using
venv:bash python3 -m venv venv source venv/bin/activate
- Using
Install Dependencies: Install all required packages using the provided
requirements.txtfile. This includeslm-evaluation-harness,transformers,torch,accelerate,bitsandbytes, etc.bash pip install -r requirements.txt(Note: The base repository seems to be a fork of lm-evaluation-harness. If you haven't explicitly generated therequirements.txtfrom your *active environment after installing everything, it's safer to runpip install -e .first, thenpip install transformers torch accelerate bitsandbytes pandas tqdm sentencepiece huggingface_hub)*
3. Hugging Face Authentication (Important)
Several models evaluated (Meditron-7B, InternistAI-7B, Asclepius-13B, potentially based on Llama) require accepting license terms on their Hugging Face model pages and may require authentication via a Hugging Face access token.
Login via CLI:
bash huggingface-cli loginFollow the prompts and paste your Hugging Face access token (ensure it has at least 'read' permissions).Accept License Terms: Manually visit the Hugging Face pages for the models listed below and accept any license terms if you haven't already:
-
epfl-llm/meditron-7b -
internistai/base-7b-v0.2 -
starmpcc/Asclepius-13B(or its base model if applicable)
-
4. Running Standardized Benchmarks (lm-evaluation-harness)
These commands replicate the quantitative benchmarking reported in Section 4.1 of the report. They evaluate the models using 8-bit quantization on the specified tasks. The results generated these runs are stored in the results/ directory.
Run each model's evaluation separately:
BioMistral-7B (
BioMistral/BioMistral-7B) ```bash5-shot tasks
lmeval --model hf \ --modelargs pretrained=BioMistral/BioMistral-7B,loadin8bit=True,trustremotecode=True \ --tasks medqa4options,medmcqa,mmluprofessionalmedicine,mmluclinicalknowledge,mmluanatomy,mmlucollegebiology,mmlumedicalgenetics \ --numfewshot 5 \ --device cuda:0 \ --batchsize auto \ --outputpath results/biomistral7b5shoteval.json
0-shot task (Note: ran 5-shot in provided JSON, use num_fewshot 5 to replicate)
lmeval --model hf \ --modelargs pretrained=BioMistral/BioMistral-7B,loadin8bit=True,trustremotecode=True \ --tasks pubmedqa \ --numfewshot 5 \ --device cuda:0 \ --batchsize auto \ --outputpath results/biomistral7bpubmedqa5shot_eval.json ```
Meditron-7B (
epfl-llm/meditron-7b) ```bash5-shot tasks
lmeval --model hf \ --modelargs pretrained=epfl-llm/meditron-7b,loadin8bit=True,trustremotecode=True \ --tasks medqa4options,medmcqa,mmluprofessionalmedicine,mmluclinicalknowledge,mmluanatomy,mmlucollegebiology,mmlumedicalgenetics,pubmedqa \ --numfewshot 5 \ --device cuda:0 \ --batchsize auto \ --outputpath results/meditron7b5shoteval.json ```
InternistAI-7B (
internistai/base-7b-v0.2) ```bash5-shot tasks
lmeval --model hf \ --modelargs pretrained=internistai/base-7b-v0.2,loadin8bit=True,trustremotecode=True \ --tasks medqa4options,medmcqa,mmluprofessionalmedicine,mmluclinicalknowledge,mmluanatomy,mmlucollegebiology,mmlumedicalgenetics,pubmedqa \ --numfewshot 5 \ --device cuda:0 \ --batchsize auto \ --outputpath results/internistai7b5shoteval.json ```
Asclepius-13B (
starmpcc/Asclepius-13B) ```bash5-shot tasks
lmeval --model hf \ --modelargs pretrained=starmpcc/Asclepius-13B,loadin8bit=True,trustremotecode=True \ --tasks medqa4options,medmcqa,mmluprofessionalmedicine,mmluclinicalknowledge,mmluanatomy,mmlucollegebiology,mmlumedicalgenetics,pubmedqa \ --numfewshot 5 \ --device cuda:0 \ --batchsize auto \ --outputpath results/asclepius13b5shoteval.json ```
Notes:
* Ensure the results/ directory exists before running (mkdir results).
* These evaluations can take several hours per model.
* Monitor GPU memory usage (nvidia-smi). If errors occur, try setting --batch_size 1.
* The JSON files provided in the repository (results_*.json) are the outputs of these commands.
5. Running the NEJM Qualitative Proof-of-Concept (PoC)
This replicates the qualitative evaluation reported in Section 4.2 of the report, using the custom script 10cases.py.
- Ensure Data File: Verify that
task2data.jsonl(containing the 10 NEJM cases) is present in the root directory of the repository. - Run the Script: Execute the Python script. This script will load the selected models (InternistAI-7B and BioMistral-7B), process each case from
task2data.jsonl, generate diagnostic suggestions, measure latency, and save the outputs.bash python 10cases.py - Check Outputs:
- An aggregated CSV file named
nejm_poc_raw_outputs.csvwill be created/overwritten in the root directory, containing the raw text outputs and latencies for both models for all 10 cases. - Individual text files for each case, containing the outputs from both models, will be saved in the
nejm_outputs/directory.
- An aggregated CSV file named
6. Expected Outputs and Verification
After running the steps above, you can verify the results against the report:
- Standardized Benchmarks: Compare the accuracy values in the generated JSON files within the
results/directory against those presented in Table 4 (or your updated results table) in the report. Minor variations due to package versions or hardware specifics are possible but should be small. - NEJM PoC Outputs: Examine the raw text in
nejm_poc_raw_outputs.csvand the individual files innejm_outputs/. These form the basis for the qualitative analysis (Top-1/Top-3 accuracy, reasoning scores) presented in Section 4.2 and Table 5 of the report. The latencies recorded in the CSV should match those reported.
This setup allows for the reproduction of the core quantitative and qualitative evaluations presented in the accompanying report.
Owner
- Name: Trần Thanh Tùng
- Login: tranthanhtung-sgu
- Kind: user
- Repositories: 1
- Profile: https://github.com/tranthanhtung-sgu
Citation (CITATION.bib)
@misc{eval-harness,
author = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
title = {A framework for few-shot language model evaluation},
month = 12,
year = 2023,
publisher = {Zenodo},
version = {v0.4.0},
doi = {10.5281/zenodo.10256836},
url = {https://zenodo.org/records/10256836}
}
GitHub Events
Total
- Push event: 1
- Create event: 1
Last Year
- Push event: 1
- Create event: 1
Dependencies
- actions/checkout v4 composite
- actions/setup-python v5 composite
- tj-actions/changed-files v46.0.5 composite
- actions/checkout v4 composite
- actions/download-artifact v4 composite
- actions/setup-python v5 composite
- actions/upload-artifact v4 composite
- pypa/gh-action-pypi-publish release/v1 composite
- actions/checkout v4 composite
- actions/configure-pages v5 composite
- actions/deploy-pages v4 composite
- actions/upload-pages-artifact v3 composite
- actions/cache v3 composite
- actions/checkout v4 composite
- actions/setup-python v5 composite
- actions/upload-artifact v4 composite
- pre-commit/action v3.0.1 composite
- emoji ==2.14.0
- fugashi *
- neologdn ==0.5.3
- rouge_score >=0.1.2
- accelerate >=0.26.0
- datasets >=2.16.0
- dill *
- evaluate >=0.4.0
- evaluate *
- jsonlines *
- more_itertools *
- numexpr *
- peft >=0.2.0
- pybind11 >=2.6.2
- pytablewriter *
- rouge-score >=0.0.4
- sacrebleu >=1.5.0
- scikit-learn >=0.24.1
- sqlitedict *
- torch >=1.8
- tqdm-multiprocess *
- transformers >=4.1
- word2number *
- zstandard *