healthbench

HealthBench

https://github.com/stanfordbdhg/healthbench

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (6.6%) to scientific vocabulary

Last synced: 6 months ago · JSON representation ·

Repository

HealthBench

Basic Info

Host: GitHub
Owner: StanfordBDHG
License: mit
Language: Ruby
Default Branch: main
Size: 104 KB

Statistics

Stars: 3
Watchers: 3
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

Medicine on the Edge: On-Device LLM Benchmark 🏥📱

Overview

This repository contains the codebase for benchmarking on-device Large Language Models (LLMs) for clinical reasoning. The study evaluates the feasibility, accuracy, and performance of mobile LLM inference using the AMEGA medical benchmark dataset.

Abstract 📄

The deployment of Large Language Models (LLM) on mobile devices offers significant potential for medical applications, enhancing privacy, security, and cost-efficiency by eliminating reliance on cloud-based services and keeping sensitive health data local. However, the performance and accuracy of on-device LLMs in real-world medical contexts remain underexplored. In this study, we benchmark publicly available on-device LLMs using the AMEGA dataset, evaluating accuracy, computational efficiency, and thermal limitation across various mobile devices. Our results indicate that compact general-purpose models like Phi-3 Mini achieve a strong balance between speed and accuracy, while medically fine-tuned models such as Med42 and Aloe attain the highest accuracy. Notably, deploying LLMs on older devices remains feasible, with memory constraints posing a greater challenge than raw processing power. Our study underscores the potential of on-device LLMs for healthcare while emphasizing the need for more efficient inference and models tailored to real-world clinical reasoning.

Features

Benchmarking mobile LLMs on real-world medical cases
Supports on-device execution for Apple Silicon (iPhones & iPads)
Thermal and battery performance tracking
Evaluation using AMEGA benchmark dataset
SpeziLLM & MLX-based inference for efficient on-device execution

Selcted Models 🏆

Med42 🏅 (Highest Accuracy)
Aloe 8B 🏅 (Top-tier medical LLM)
Phi-3 Mini ⚡ (Best balance of speed & accuracy)
DeepSeek R1 (Reasoning-focused LLM)
Llama & Qwen Variants (General-purpose models)
and more

Owner

Name: Stanford Biodesign Digital Health
Login: StanfordBDHG
Kind: organization
Location: United States of America

Twitter: StanfordBDHG
Repositories: 18
Profile: https://github.com/StanfordBDHG

Citation (CITATION.cff)

#
# This source file is part of the StanfordBDHG Template Application project
#
# SPDX-FileCopyrightText: 2023 Stanford University
#
# SPDX-License-Identifier: MIT
#

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Schmiedmayer"
  given-names: "Paul"
  orcid: "https://orcid.org/0000-0002-8607-9148"
title: "TemplateApplication"
doi: 10.5281/zenodo.7633671
url: "https://github.com/StanfordBDHG/TemplateApplication"

GitHub Events

Total

Watch event: 11
Member event: 1
Push event: 2
Fork event: 1
Create event: 2

Last Year

Watch event: 11
Member event: 1
Push event: 2
Fork event: 1
Create event: 2

Dependencies

.github/workflows/beta-deployment.yml actions

.github/workflows/build-and-test.yml actions

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science