llm-fhir-eval

Benchmarking Large Language Models for FHIR

https://github.com/flexpa/llm-fhir-eval

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.0%) to scientific vocabulary

Keywords

evals fhir fhir-llm fhir-resources fhirpath llm llm-evaluation-framework

Last synced: 6 months ago · JSON representation ·

Repository

Benchmarking Large Language Models for FHIR

Basic Info

Host: GitHub
Owner: flexpa
Language: TypeScript
Default Branch: master
Homepage:
Size: 119 KB

Statistics

Stars: 30
Watchers: 0
Forks: 3
Open Issues: 0
Releases: 0

Topics

evals fhir fhir-llm fhir-resources fhirpath llm llm-evaluation-framework

Created over 1 year ago · Last pushed 9 months ago

Metadata Files

Readme Citation

@flexpa/llm-fhir-eval

[!NOTE] Follow the development progress on FHIR Chat.

Overview

@flexpa/llm-fhir-eval is an evaluation framework designed to benchmark the performance of LLMs on FHIR-specific tasks including generation, validation, and extraction. This framework systematically tests and validates the capabilities of LLMs in handling various healthcare-interoperability related tasks, ensuring they meet the standards required for effective FHIR implementations. It implements evaluations from prior art such as FHIR-GPT.

Benchmark

@flexpa/llm-fhir-eval benchmarks FHIR-specific tasks including:

FHIR Resource Generation:
- Generate accurate FHIR resources such as Patient, Observation, MedicationStatement, etc.
- Test the ability to create complex resource relationships and validate terminology bindings.
FHIR Resource Validation:
- Validate FHIR resources using operations like $validate.
- Check for schema compliance, required field presence, and value set binding verification.
Data Extraction:
- Extract structured FHIR-compliant data from clinical notes and other unstructured data.
- Evaluate the proficiency of LLMs in extracting specific healthcare data elements.
Tool Use:
- Test models' ability to use FHIR validation tools and other healthcare-specific functions.
- Validate proper tool calling for FHIR operations.

Available Evaluations

Data Extraction (evals/extraction/)
- Description: Comprehensive evaluation of extracting structured FHIR data from unstructured clinical text.
- Configurations: Both minimalist and specialist approaches available.
- Test categories: Basic demographics, conditions, explanations of benefit, medication requests, observations.
FHIR Resource Generation (evals/generation/)
- Description: Tests the ability to generate valid FHIR resources and bundles.
- Configurations: Zero-shot bundle generation and multi-turn tool use scenarios.
- Models supported: GPT-3.5-turbo, GPT-4.1, O3 (low/high reasoning), Claude 3.5 Haiku, Claude 3.5 Sonnet, Claude Sonnet 4, Claude Opus 4

Custom Assertions

The framework includes custom assertion functions:

fhirPathEquals.mjs: Validates FHIR Path expressions
isBundle.mjs: Checks if output is a valid FHIR Bundle
metaElementMissing.mjs: Validates required metadata elements
validateOperation.mjs: Validates FHIR operation results

Tools

validateFhirBundle.mjs: Tool for validating FHIR Bundle resources

Custom Providers

AnthropicMessagesWithRecursiveToolCallsProvider.ts: Enhanced Anthropic provider with recursive tool calling (up to 10 depth levels)
OpenAiResponsesWithRecursiveToolCallsProvider.ts: Enhanced OpenAI provider with recursive tool calling

These providers enable multi-turn tool interactions where models can iteratively call validation tools to improve their FHIR resource generation.

Commands to Run Evaluations

Install dependencies and set up environment variables:

bash yarn install

Copy the .env.template file to .env and supply your API keys for the models you plan to test.

Run an evaluation:

```bash

Example: Run the extraction evaluation with minimalist config

promptfoo eval -c evals/extraction/config-minimalist.yaml

Example: Run the FHIR bundle generation evaluation

promptfoo eval -c evals/generation/config-zero-shot-bundle.yaml

Example: Run multi-turn tool use evaluation

promptfoo eval -c evals/generation/config-multi-turn-tool-use.js ```

The evaluation will print its performance metrics to the console and optionally save results to files.

Owner

Name: Flexpa
Login: flexpa
Kind: organization

Website: https://www.flexpa.com
Twitter: flexpa
Repositories: 2
Profile: https://github.com/flexpa

Citation (CITATION.cff)

cff-version: 1.2.0
message: 'If you use this software, please cite it as below.'
authors:
  - family-names: 'Kelly'
    given-names: 'Joshua'
    orcid: 'https://orcid.org/0009-0000-7191-0595'
title: 'FHIR LLM Eval'
version: 0.0.1
date-released: 2024-11-22
url: 'https://github.com/flexpa/fhir-llm-evals'

GitHub Events

Total

Watch event: 33
Push event: 13
Fork event: 5
Create event: 1

Last Year

Watch event: 33
Push event: 13
Fork event: 5
Create event: 1

Dependencies

package.json npm

@types/bun latest development
openai ^4.53.1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

llm-fhir-eval

Science Score: 57.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

@flexpa/llm-fhir-eval

Overview

Benchmark

Available Evaluations

Custom Assertions

Tools

Custom Providers

Commands to Run Evaluations

Example: Run the extraction evaluation with minimalist config

Example: Run the FHIR bundle generation evaluation

Example: Run multi-turn tool use evaluation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies