Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.3%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: almeidava93
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 19.9 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created 9 months ago · Last pushed 8 months ago
Metadata Files
Readme License Citation

README.md

Large Language Models as Medical Codes Selectors: a benchmark using the International Classification of Primary Care

DOI

This repository contains the data and code for the study "Large Language Models as Medical Codes Selectors: a benchmark using the International Classification of Primary Care" published as preprint at arXiv and currently under peer review. The full article is available here.

Reproducing the study results

This study was performed in several steps and builds upon Almeida et al. work on ICPC-2 search engines.

  • Setup virtual environment: Setup a virtual environment with Python 3.11 or greater and install the requirements from requirements.txt.

  • Vector database: Since the setup of the vector database can take several hours and may have some cost due to API services, a compressed file vector_database.zip with the vector database is available for download in this link. All you need to do is download the file and extract its contents at the root directory of this repository.

  • OpenAI embedding model: To perform the retrieval steps you will need an OpenAI API key and define it in the .env file.

  • Prepare the vector database and the evaluation dataset: Run the following to prepare the vector database and retrieve the results for each query in the evaluation dataset. This command will build the vector database if it is not available at the root directory. If that is the case, it may take some time to complete and may cost you credits at OpenAI. You can start setup the virtual environment and run: python prepare.py or simply use uv run: uv run prepare.py

  • Perform automatic code selection with the selected LLMs: Run the following to perform inference with all the selected LLMs. The results of this scripts are already provided in the file data/llms_results.csv to avoid additional costs. If you're interested in reproducing this yourself, run: python get_llms_results.py or uv run get_llms_results.py

  • Evaluation: Run the following to compute the evaluation metrics and create the LaTeX tables. All files will be stored in the results folder. python eval.py or simply use uv run: uv run eval.py

  • Plots: Run the following to plot the graphs. python plot.py or simply use uv run: uv run plot.py

Citation

If you find this repository useful, please consider citing it.

APA-style citation: Anjos de Almeida, V. (2025). Source code for the paper: "Large Language Models as Medical Codes Selectors: a benchmark using the International Classification of Primary Care" (Version 1.0.0) [Computer software]. https://github.com/almeidava93/llm-as-code-selectors-paper

BibTeX entry: @software{Anjos_de_Almeida_Source_code_for_2025, author = {Anjos de Almeida, Vinicius}, license = {MIT}, month = jul, title = {{Source code for the paper: "Large Language Models as Medical Codes Selectors: a benchmark using the International Classification of Primary Care"}}, url = {https://github.com/almeidava93/llm-as-code-selectors-paper}, version = {1.0.0}, year = {2025} }

Owner

  • Login: almeidava93
  • Kind: user

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  Source code for the paper: "Large Language Models as
  Medical Codes Selectors: a benchmark using the
  International Classification of Primary Care"
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Vinicius
    family-names: Anjos de Almeida
    email: vinicius.almeida@alumni.usp.br
    affiliation: University of São Paulo
    orcid: 'https://orcid.org/0009-0001-1273-586X'
identifiers:
  - type: doi
    value: 10.5281/zenodo.15998992
repository-code: 'https://github.com/almeidava93/llm-as-code-selectors-paper'
abstract: >-
  Background: Medical coding is critical for structuring
  healthcare data. It can lead to a better understanding of
  population health, guide quality improvement
  interventions, and policy making. This study investigates
  the ability of large language models (LLMs) to select
  appropriate codes from the International Classification of
  Primary Care, 2nd edition (ICPC-2), based on the results
  of a specialized search engine.


  Methods: A dataset of 437 clinical expressions in
  Brazilian Portuguese was used, each annotated with
  relevant ICPC-2 codes. A semantic search engine based on
  OpenAI’s text-embedding-3-large model retrieved candidate
  expressions from a corpus of 73,563 ICPC-2-labeled
  concepts. Thirty-three LLMs (both open-source and private)
  were prompted with each query and a ranked list of
  retrieved results, and asked to return the best-matching
  ICPC-2 code. Performance was evaluated using F1-score,
  with additional analysis of token usage, cost, response
  time, and formatting adherence.


  Results: Of the 33 models evaluated, 28 achieved a maximum
  F1-score above 0.8, and 10 exceeded 0.85. The
  top-performing models were gpt-4.5-preview, o3, and
  gemini-2.5-pro. By optimizing the retriever, performance
  can improve by up to 4 percentage points. Most models were
  able to return valid codes in the expected format and
  restrict outputs to retrieved results, reducing
  hallucination risk. Notably, smaller models (<3B
  parameters) underperformed due to format inconsistencies
  and sensitivity to input length.


  Conclusions: LLMs show strong potential for automating
  ICPC-2 code selection, with many models achieving high
  performance even without task-specific fine-tuning. This
  work establishes a benchmark for future studies and
  describes some of the challenges for achieving better
  results.
keywords:
  - International Classification of Primary Care
  - Medical coding
  - Medical coding automation
  - Large language models
  - Artificial intelligence
  - Benchmark
  - Extreme multiclass classification
license: MIT
version: 1.0.0
date-released: '2025-07-16'

GitHub Events

Total
  • Release event: 1
  • Push event: 11
  • Create event: 2
Last Year
  • Release event: 1
  • Push event: 11
  • Create event: 2

Dependencies

pyproject.toml pypi
  • accelerate >=1.9.0
  • adjusttext >=1.3.0
  • chromadb >=1.0.15
  • dotenv >=0.9.9
  • jsonlines >=4.0.0
  • langchain >=0.3.26
  • langchain-community >=0.3.27
  • langchain-deepseek >=0.1.3
  • langchain-google-genai >=2.1.8
  • langchain-huggingface >=0.3.0
  • langchain-openai >=0.3.28
  • matplotlib >=3.10.3
  • openai >=1.97.0
  • pandas >=2.3.1
  • seaborn >=0.13.2
  • toml >=0.10.2
  • torch >=2.7.1
  • torchaudio >=2.7.1
  • torchvision >=0.22.1
  • tqdm >=4.67.1
  • transformers >=4.53.2
requirements.txt pypi
uv.lock pypi
  • 169 dependencies