research-ai-topic-2

https://github.com/tranthanhtung-sgu/research-ai-topic-2

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.8%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: tranthanhtung-sgu
License: mit
Language: Python
Default Branch: main
Size: 25 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme Contributing License Citation Codeowners

COMP6011 Task 2: Medical LLM Evaluation

This repository contains the code, configuration, and results for the evaluation of open-source medical Large Language Models as part of the COMP6011 Advanced AI Research Topics assignment.

Evaluation Overview

The evaluation consists of two main parts:

Standardized Benchmarking: Using the lm-evaluation-harness framework to evaluate four selected models (BioMistral-7B, Meditron-7B, InternistAI-7B, Asclepius-13B) on key medical benchmarks (MedQA, PubMedQA, MedMCQA, MMLU subsets).
Qualitative Proof-of-Concept (PoC): Running a custom script (10cases.py) to evaluate the top-performing models from the benchmarking phase (InternistAI-7B and BioMistral-7B) on 10 complex NEJM clinical case studies provided in task2data.jsonl.

Reproducing the Results

Follow these steps to set up the environment and reproduce the evaluations performed for this report.

1. Prerequisites

Operating System: Ubuntu (Tested on Ubuntu 24.04 LTS)
GPU: NVIDIA GPU with CUDA support and >= 24GB VRAM (Tested on NVIDIA RTX 3090).
NVIDIA Drivers: Compatible NVIDIA drivers installed.
Python: Python 3.8+ (Python 3.10 recommended). Managed via conda or venv.
Git: Required for cloning the repository.
Internet Connection: Required for downloading models and datasets.

2. Setup Environment

Clone this Repository: bash git clone https://github.com/tranthanhtung-sgu/lm-evaluation-harness.git # Or your specific fork URL cd lm-evaluation-harness # Or your repo directory name
Create and Activate Virtual Environment:
- Using conda: bash conda create -n lm-eval-task2 python=3.10 -y conda activate lm-eval-task2
- Using venv: bash python3 -m venv venv source venv/bin/activate
Install Dependencies: Install all required packages using the provided requirements.txt file. This includes lm-evaluation-harness, transformers, torch, accelerate, bitsandbytes, etc. bash pip install -r requirements.txt (Note: The base repository seems to be a fork of lm-evaluation-harness. If you haven't explicitly generated the requirements.txt from your *active environment after installing everything, it's safer to run pip install -e . first, then pip install transformers torch accelerate bitsandbytes pandas tqdm sentencepiece huggingface_hub)*

3. Hugging Face Authentication (Important)

Several models evaluated (Meditron-7B, InternistAI-7B, Asclepius-13B, potentially based on Llama) require accepting license terms on their Hugging Face model pages and may require authentication via a Hugging Face access token.

Login via CLI: bash huggingface-cli login Follow the prompts and paste your Hugging Face access token (ensure it has at least 'read' permissions).
Accept License Terms: Manually visit the Hugging Face pages for the models listed below and accept any license terms if you haven't already:
- epfl-llm/meditron-7b
- internistai/base-7b-v0.2
- starmpcc/Asclepius-13B (or its base model if applicable)

4. Running Standardized Benchmarks (`lm-evaluation-harness`)

These commands replicate the quantitative benchmarking reported in Section 4.1 of the report. They evaluate the models using 8-bit quantization on the specified tasks. The results generated these runs are stored in the results/ directory.

Run each model's evaluation separately:

BioMistral-7B (BioMistral/BioMistral-7B) ```bash

5-shot tasks

lmeval --model hf \ --modelargs pretrained=BioMistral/BioMistral-7B,loadin8bit=True,trustremotecode=True \ --tasks medqa4options,medmcqa,mmluprofessionalmedicine,mmluclinicalknowledge,mmluanatomy,mmlucollegebiology,mmlumedicalgenetics \ --numfewshot 5 \ --device cuda:0 \ --batchsize auto \ --outputpath results/biomistral7b5shoteval.json

0-shot task (Note: ran 5-shot in provided JSON, use num_fewshot 5 to replicate)

lmeval --model hf \ --modelargs pretrained=BioMistral/BioMistral-7B,loadin8bit=True,trustremotecode=True \ --tasks pubmedqa \ --numfewshot 5 \ --device cuda:0 \ --batchsize auto \ --outputpath results/biomistral7bpubmedqa5shot_eval.json ```
Meditron-7B (epfl-llm/meditron-7b) ```bash

5-shot tasks

lmeval --model hf \ --modelargs pretrained=epfl-llm/meditron-7b,loadin8bit=True,trustremotecode=True \ --tasks medqa4options,medmcqa,mmluprofessionalmedicine,mmluclinicalknowledge,mmluanatomy,mmlucollegebiology,mmlumedicalgenetics,pubmedqa \ --numfewshot 5 \ --device cuda:0 \ --batchsize auto \ --outputpath results/meditron7b5shoteval.json ```
InternistAI-7B (internistai/base-7b-v0.2) ```bash

5-shot tasks

lmeval --model hf \ --modelargs pretrained=internistai/base-7b-v0.2,loadin8bit=True,trustremotecode=True \ --tasks medqa4options,medmcqa,mmluprofessionalmedicine,mmluclinicalknowledge,mmluanatomy,mmlucollegebiology,mmlumedicalgenetics,pubmedqa \ --numfewshot 5 \ --device cuda:0 \ --batchsize auto \ --outputpath results/internistai7b5shoteval.json ```
Asclepius-13B (starmpcc/Asclepius-13B) ```bash

5-shot tasks

lmeval --model hf \ --modelargs pretrained=starmpcc/Asclepius-13B,loadin8bit=True,trustremotecode=True \ --tasks medqa4options,medmcqa,mmluprofessionalmedicine,mmluclinicalknowledge,mmluanatomy,mmlucollegebiology,mmlumedicalgenetics,pubmedqa \ --numfewshot 5 \ --device cuda:0 \ --batchsize auto \ --outputpath results/asclepius13b5shoteval.json ```

Notes: * Ensure the results/ directory exists before running (mkdir results). * These evaluations can take several hours per model. * Monitor GPU memory usage (nvidia-smi). If errors occur, try setting --batch_size 1. * The JSON files provided in the repository (results_*.json) are the outputs of these commands.

5. Running the NEJM Qualitative Proof-of-Concept (PoC)

This replicates the qualitative evaluation reported in Section 4.2 of the report, using the custom script 10cases.py.

Ensure Data File: Verify that task2data.jsonl (containing the 10 NEJM cases) is present in the root directory of the repository.
Run the Script: Execute the Python script. This script will load the selected models (InternistAI-7B and BioMistral-7B), process each case from task2data.jsonl, generate diagnostic suggestions, measure latency, and save the outputs. bash python 10cases.py
Check Outputs:
- An aggregated CSV file named nejm_poc_raw_outputs.csv will be created/overwritten in the root directory, containing the raw text outputs and latencies for both models for all 10 cases.
- Individual text files for each case, containing the outputs from both models, will be saved in the nejm_outputs/ directory.

6. Expected Outputs and Verification

After running the steps above, you can verify the results against the report:

Standardized Benchmarks: Compare the accuracy values in the generated JSON files within the results/ directory against those presented in Table 4 (or your updated results table) in the report. Minor variations due to package versions or hardware specifics are possible but should be small.
NEJM PoC Outputs: Examine the raw text in nejm_poc_raw_outputs.csv and the individual files in nejm_outputs/. These form the basis for the qualitative analysis (Top-1/Top-3 accuracy, reasoning scores) presented in Section 4.2 and Table 5 of the report. The latencies recorded in the CSV should match those reported.

This setup allows for the reproduction of the core quantitative and qualitative evaluations presented in the accompanying report.

Owner

Name: Trần Thanh Tùng
Login: tranthanhtung-sgu
Kind: user

Repositories: 1
Profile: https://github.com/tranthanhtung-sgu

Citation (CITATION.bib)

@misc{eval-harness,
  author       = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
  title        = {A framework for few-shot language model evaluation},
  month        = 12,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {v0.4.0},
  doi          = {10.5281/zenodo.10256836},
  url          = {https://zenodo.org/records/10256836}
}

GitHub Events

Total

Push event: 1
Create event: 1

Last Year

Push event: 1
Create event: 1

Dependencies

.github/workflows/new_tasks.yml actions

actions/checkout v4 composite
actions/setup-python v5 composite
tj-actions/changed-files v46.0.5 composite

.github/workflows/publish.yml actions

actions/checkout v4 composite
actions/download-artifact v4 composite
actions/setup-python v5 composite
actions/upload-artifact v4 composite
pypa/gh-action-pypi-publish release/v1 composite

.github/workflows/static.yml actions

actions/checkout v4 composite
actions/configure-pages v5 composite
actions/deploy-pages v4 composite
actions/upload-pages-artifact v3 composite

.github/workflows/unit_tests.yml actions

actions/cache v3 composite
actions/checkout v4 composite
actions/setup-python v5 composite
actions/upload-artifact v4 composite
pre-commit/action v3.0.1 composite

lm_eval/tasks/japanese_leaderboard/requirements.txt pypi

emoji ==2.14.0
fugashi *
neologdn ==0.5.3
rouge_score >=0.1.2

pyproject.toml pypi

accelerate >=0.26.0
datasets >=2.16.0
dill *
evaluate >=0.4.0
evaluate *
jsonlines *
more_itertools *
numexpr *
peft >=0.2.0
pybind11 >=2.6.2
pytablewriter *
rouge-score >=0.0.4
sacrebleu >=1.5.0
scikit-learn >=0.24.1
sqlitedict *
torch >=1.8
tqdm-multiprocess *
transformers >=4.1
word2number *
zstandard *

requirements.txt pypi

setup.py pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

research-ai-topic-2

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

COMP6011 Task 2: Medical LLM Evaluation

Evaluation Overview

Reproducing the Results

1. Prerequisites

2. Setup Environment

3. Hugging Face Authentication (Important)

4. Running Standardized Benchmarks (`lm-evaluation-harness`)

5-shot tasks

0-shot task (Note: ran 5-shot in provided JSON, use num_fewshot 5 to replicate)

5-shot tasks

5-shot tasks

5-shot tasks

5. Running the NEJM Qualitative Proof-of-Concept (PoC)

6. Expected Outputs and Verification

Owner

Citation (CITATION.bib)

GitHub Events

Total

Last Year

Dependencies

research-ai-topic-2

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

COMP6011 Task 2: Medical LLM Evaluation

Evaluation Overview

Reproducing the Results

1. Prerequisites

2. Setup Environment

3. Hugging Face Authentication (Important)

4. Running Standardized Benchmarks (lm-evaluation-harness)

5-shot tasks

0-shot task (Note: ran 5-shot in provided JSON, use num_fewshot 5 to replicate)

5-shot tasks

5-shot tasks

5-shot tasks

5. Running the NEJM Qualitative Proof-of-Concept (PoC)

6. Expected Outputs and Verification

Owner

Citation (CITATION.bib)

GitHub Events

Total

Last Year

Dependencies

4. Running Standardized Benchmarks (`lm-evaluation-harness`)