verifact

Verifying Facts in LLM-Generated Clinical Text with Electronic Health Records

https://github.com/philipchung/verifact

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.7%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Verifying Facts in LLM-Generated Clinical Text with Electronic Health Records

Basic Info

Host: GitHub
Owner: philipchung
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 374 KB

Statistics

Stars: 10
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

Verifying Facts in LLM-Generated Clinical Text with Electronic Health Records

Preprint Manuscript: VeriFact: Verifying Facts in LLM-Generated Clinical Text with Electronic Health Records

Dataset: MIMIC-III-Ext-VeriFact-BHC: Labeled Propositions From Brief Hospital Course Summaries for Long-form Clinical Text Evaluation

VeriFactis a long-form text fact-checker that verifies any text written about a patient against their own electronic health record (EHR). VeriFact decomposes the text into a set of propositions which are individually verified against the patient's EHR. VeriFact combines RAG with LLM-as-a-Judge to perform fact verification.

VeriFact-BHC is a dataset to benchmark VeriFact performance against human clinicians. This dataset is derived from MIMIC-III Clinical Database v1.4. It contains human-written Brief Hospital Course (BHC) narratives typically found in discharge summaries and also a LLM-written BHC for 100 patients. It also contains the reference EHR for each patient. All BHC narratives are decomposed into propositions which are annotated by clinicians to develop a human clinician ground truth.

Scripts

Scripts to generate the unannotated VeriFact-BHC dataset, run the VeriFact system to generate AI rater labels, and compute interrater agreement and classificaiton metrics are contained in scripts. These scripts rely on the locally-deployed services which are described below.

Environment Variables

Add your environment variables in .env.

```sh

Hugging Face Token: https://huggingface.co/docs/hub/en/security-tokens

HFTOKEN=${HUGGINGFACEREAD_TOKEN}

Hugging Face Cache

HF_HOME=${HOME}/.cache/huggingface

Local Machine URL

SERVERBASEURL=localhost

Traefik Configuration

ADMIN_EMAIL=email@domain.edu ```

If you plan to commit this code to a public repo, git ignore the .env file so you do not commit your secrets. The .env is made available in this repo for visibility to default environment variables which are used by docker containers and scripts.

Python Environment

```sh

Create python virtual environment

uv venv

Create/update lock file (only if needed, otherwise skip this step)

uv lock

Sync virtual environment with lockfile specification

uv sync --all-packages ```

Services

All models used in VeriFact are local open-source models which can be launched using the provided docker-compose.yml configuration.

Local services include:

Local Embedding Model (requires GPU): customized infinity inference engine to serve the BAAI/bge-m3 model with both dense and sparse embedding generation.
Local Rerank Model (requires GPU): customized infinity inference engine to serve the BAAI/bge-reranker-v2-m3 reranking model.
Local LLM Inference Service (requires GPU): vLLM serving for hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4.
Vector Database: locally hosted qdrant vector database
Traefik: router, reverse proxy, load balancer
Redis: key-value store for redis-queue
Redis-Queue (RQ) Dashboard: monitoring rq jobs
Prometheus + Grafana: monitoring dashboard for vLLM.

These services are all containerized using docker. Docker Compose is used to coordinate launching and stopping these microservices.

```sh

Start All Services (in detached mode)

docker compose up -d

Check All Services Running

docker ps

Check Logs

docker logs

Inspect Each Container

docker exec -it /bin/sh

Stop All Services

docker compose down ```

Example Service Deployment

LLM Inference is significantly more compute intensive than Embedding or Reranking. Thus it is recommended to setup LLMs in data parallel configuration. Embedding and Reranking models can share a GPU.

On a server with 4-GPUs (using the docker-compose.yml in this project):

```sh

Launch Traefik for reverse proxy & load balancing

Traefik Dashboard: ${SERVER_URL}:8090/dashboard

docker compose up traefik -d

Launch Qdrant for vector database, Redis & RQ-Dashboard for tracking tasks in queue

Qdrant Dashboard: ${SERVER_URL}:6333/dashboard

Redis Stack Dashboard: ${SERVER_URL}:6380/redis-stack

RQ-Dashboard: ${SERVER_URL}:9181

docker compose up qdrant redis rq-dashboard -d

Launch Local LLM Inference API in Tensor Parallel Configuration (uses vLLM)

The default LLM is a quantized Llama 3.1 70B model, which requires 37GB VRAM for the model itself. This container configures the LLM inference service in tensor parallelism which splits model weights across 2 GPUs.

docker compose up llm-tp2 -d

Alternatively, launch local LLM Inference on a single GPU. Multiple docker containers can be launched and traefik will distribute API requests across the LLM containers in round-robin fashion

docker compose up llm0 llm1 llm2 -d

Launch Prometheus, Grafana dashboards for monitoring vLLM inference throughput

Prometheus Dashboard: ${SERVER_URL}:9090

Grafana Dashboard: ${SERVER_URL}:3000

docker compose up prometheus grafana -d

Launch Embedding & Rerank Inference API on GPU3 (uses Infinity Embeddings)

These containers are customized for compatibility with BGE-M3 model and to reduce VRAM use

docker compose up embed1 rerank1 -d ```

Specific configurations for ports and URLs are found in the .env file that docker-compose.yml references.

Docker services are reached via Traefik reverse proxy and load balancer. Using Traefik, multiple docker containers providing LLM inference can service the same API endpoint. Same is true for embedding and rerank inference services. Traefik will load balance the API requests equally across docker containers hosting the same service.

Parallel tasks are managed using rq which is a queue backed by redis.

The vLLM inference service metrics are monitored via Prometheus and a Grafana dashboard. Prometheus and Grafana setup is described in verifact/services/vllm/monitoring/README.md.

Performance

Performance of locally-hosted models is dependent on your GPU accelerator and local hardware. Lower latency and higher throughput may be achieved by replacing locally-hosted models with dedicated API inference services.

Citation

@article{Chung2025, title={VeriFact: Verifying Facts in LLM-Generated Clinical Text with Electronic Health Records}, author={Philip Chung and Akshay Swaminathan and Alex J. Goodell and Yeasul Kim and S. Momsen Reincke and Lichy Han and Ben Deverett and Mohammad Amin Sadeghi and Abdel-Badih Ariss and Marc Ghanem and David Seong and Andrew A. Lee and Caitlin E. Coombes and Brad Bradshaw and Mahir A. Sufian and Hyo Jung Hong and Teresa P. Nguyen and Mohammad R. Rasouli and Komal Kamra and Mark A. Burbridge and James C. McAvoy and Roya Saffary and Stephen P. Ma and Dev Dash and James Xie and Ellen Y. Wang and Clifford A. Schmiesing and Nigam Shah and Nima Aghaeepour}, year={2025}, eprint={2501.16672}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2501.16672}, }

Owner

Name: Philip Chung
Login: philipchung
Kind: user

Repositories: 13
Profile: https://github.com/philipchung

Citation (CITATION.cff)

cff-version: 1.2.0
title: >-
  VeriFact: Verifying Facts in LLM-Generated Clinical Text
  with Electronic Health Records
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Philip
    family-names: Chung
    orcid: "https://orcid.org/0000-0002-1194-7510"
  - given-names: Akshay
    family-names: Swaminathan
    orcid: "https://orcid.org/0000-0003-3426-9289"
  - given-names: Alex
    family-names: Goodell
    orcid: "https://orcid.org/0000-0003-0229-8843"
  - given-names: Yeasul
    family-names: Kim
    orcid: "https://orcid.org/0000-0001-8289-1297"
  - given-names: S. Momsen
    family-names: Reincke
    orcid: "https://orcid.org/0000-0002-8132-3527"
  - given-names: Lichy
    family-names: Han
    orcid: "https://orcid.org/0000-0002-5785-0968"
  - given-names: Ben
    family-names: Deverett
    orcid: "https://orcid.org/0000-0002-3119-7649"
  - given-names: Mohammad Amin
    family-names: Sadeghi
    orcid: "https://orcid.org/0000-0003-3335-1758"
  - given-names: Abdel Badih
    family-names: Ariss
    orcid: "https://orcid.org/0000-0003-0269-3130"
  - given-names: Marc
    family-names: Ghanem
    orcid: "https://orcid.org/0000-0002-7479-7994"
  - given-names: David
    family-names: Seong
    orcid: "https://orcid.org/0000-0002-8980-5731"
  - given-names: Andrew
    family-names: Lee
    orcid: "https://orcid.org/0009-0006-8964-6677"
  - given-names: Caitlin
    family-names: Coombes
    orcid: "https://orcid.org/0000-0001-8414-4279"
  - given-names: Brad
    family-names: Bradshaw
    orcid: "https://orcid.org/0000-0001-5371-9682"
  - given-names: Mahir
    family-names: Sufian
    orcid: "https://orcid.org/0000-0002-9702-4556"
  - given-names: Hyo Jung
    family-names: Hong
    orcid: "https://orcid.org/0000-0001-7674-8398"
  - given-names: Teresa
    family-names: Nguyen
    orcid: "https://orcid.org/0000-0001-9522-8937"
  - given-names: Mohammad
    family-names: Rasouli
    orcid: "https://orcid.org/0000-0001-7181-5803"
  - given-names: Komal
    family-names: Kamra
    orcid: "https://orcid.org/0000-0003-4700-583X"
  - given-names: Mark
    family-names: Burbridge
    orcid: "https://orcid.org/0000-0001-6765-5739"
  - given-names: James
    family-names: McAvoy
    orcid: "https://orcid.org/0009-0006-3838-5438"
  - given-names: Roya
    family-names: Saffary
    orcid: "https://orcid.org/0000-0001-9959-9399"
  - given-names: Stephen
    family-names: Ma
    orcid: "https://orcid.org/0000-0003-3738-9569"
  - given-names: Dev
    family-names: Dash
    orcid: "https://orcid.org/0000-0002-0223-1641"
  - given-names: James
    family-names: Xie
    orcid: "https://orcid.org/0000-0002-9511-0012"
  - given-names: Ellen
    family-names: Wang
    orcid: "https://orcid.org/0000-0002-9151-938X"
  - given-names: Clifford
    family-names: Schmiesing
    orcid: "https://orcid.org/0000-0002-8979-5959"
  - given-names: Nigam
    family-names: Shah
    orcid: "https://orcid.org/0000-0001-9385-7158"
  - given-names: Nima
    family-names: Aghaeepour
    orcid: "https://orcid.org/0000-0002-6117-8764"
identifiers:
  - type: doi
    value: 10.48550/arXiv.2501.16672
    description: arXiv Preprint
repository-code: "https://github.com/philipchung/verifact"
abstract: >-
  VeriFact: A long-form text fact-checker that verifies any
  text written about a patient against their own electronic
  health record (EHR). VeriFact decomposes the text into a
  set of propositions which are individually verified
  against the patient's EHR. VeriFact combines RAG with
  LLM-as-a-Judge to perform fact verification.
keywords:
  - Fact Checking
  - Evaluation
  - Large Language Models
  - Medicine
  - Electronic Health Records
license: MIT

GitHub Events

Total

Watch event: 16
Push event: 4
Public event: 1

Last Year

Watch event: 16
Push event: 4
Public event: 1

Dependencies

docker-compose.yml docker

cjlapao/rq-dashboard 0.7.1
grafana/grafana 11.4.0-ubuntu
infinity/embed latest
infinity/rerank latest
prom/prometheus v2.55.1
qdrant/qdrant v1.10.0
redis/redis-stack latest
traefik v3.0
vllm/vllm-openai v0.6.4

services/embed/Dockerfile docker

michaelf34/infinity 0.0.53 build

services/rerank/Dockerfile docker

michaelf34/infinity 0.0.53 build

packages/irr_metrics/pyproject.toml pypi

irrcac >=0.4.4
krippendorff >=0.8.0
numpy >=1.26.4
openpyxl >=3.1.5
pandas >=2.2.3
pingouin >=0.5.5
pydantic >=2.10.3
scipy >=1.12.0

packages/llm_judge/pyproject.toml pypi

llama-index >=0.12.4
pandas >=2.2.3
pydantic >=2.10.3
qdrant-client >=1.12.1
tqdm >=4.67.1

packages/llm_writer/pyproject.toml pypi

llama-index >=0.12.4
pandas >=2.2.3
tqdm >=4.67.1

packages/proposition_validity/pyproject.toml pypi

llama-index >=0.12.4
pydantic >=2.10.3
tqdm >=4.67.1

packages/pydantic_utils/pyproject.toml pypi

pydantic >=2.10.3

packages/rag/pyproject.toml pypi

llama-index >=0.12.4
llama-index-core >=0.12.4
llama-index-embeddings-huggingface >=0.4.0
llama-index-llms-azure-openai >=0.3.0
llama-index-llms-huggingface >=0.4.0
llama-index-llms-openai >=0.3.3
llama-index-llms-openai-like >=0.3.0
llama-index-vector-stores-qdrant >=0.4.0
pydantic >=2.10.3
qdrant-client >=1.12.1

packages/rq_utils/pyproject.toml pypi

redis >=5.2.1
rq >=1.16.2,<2.0.0
rq-dashboard >=0.8.2.2
setproctitle >=1.3.4
tqdm >=4.67.1

packages/utils/pyproject.toml pypi

python-dotenv >=1.0.1
tenacity >=8.0.0
tiktoken >=0.8.0
tqdm >=4.67.1

pyproject.toml pypi

httpx >=0.28.1
ipykernel >=6.29.0
ipywidgets >=8.1.5
jupyter >=1.1.0
matplotlib >=3.9.2
mypy >=1.13.0
nest-asyncio >=1.6.0
numpy >=1.26.4
openai >=1.57.0
pandas >=2.2.0
pandas-stubs >=2.2.3.241126
pyarrow >=18.1.0
python-dotenv >=1.0.1
ruff >=0.8.2
scipy >=1.12.0
seaborn >=0.13.0
tqdm >=4.67.1
transformers >=4.46.3
typer >=0.15.1
types-tqdm >=4.67.0.20241119