llm-as-code-selectors-paper
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.3%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: almeidava93
- License: mit
- Language: Python
- Default Branch: main
- Size: 19.9 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
Large Language Models as Medical Codes Selectors: a benchmark using the International Classification of Primary Care

This repository contains the data and code for the study "Large Language Models as Medical Codes Selectors: a benchmark using the International Classification of Primary Care" published as preprint at arXiv and currently under peer review. The full article is available here.
Reproducing the study results
This study was performed in several steps and builds upon Almeida et al. work on ICPC-2 search engines.
Setup virtual environment: Setup a virtual environment with Python 3.11 or greater and install the requirements from
requirements.txt.Vector database: Since the setup of the vector database can take several hours and may have some cost due to API services, a compressed file vector_database.zip with the vector database is available for download in this link. All you need to do is download the file and extract its contents at the root directory of this repository.
OpenAI embedding model: To perform the retrieval steps you will need an OpenAI API key and define it in the
.envfile.Prepare the vector database and the evaluation dataset: Run the following to prepare the vector database and retrieve the results for each query in the evaluation dataset. This command will build the vector database if it is not available at the root directory. If that is the case, it may take some time to complete and may cost you credits at OpenAI. You can start setup the virtual environment and run:
python prepare.pyor simply useuv run:uv run prepare.pyPerform automatic code selection with the selected LLMs: Run the following to perform inference with all the selected LLMs. The results of this scripts are already provided in the file
data/llms_results.csvto avoid additional costs. If you're interested in reproducing this yourself, run:python get_llms_results.pyoruv run get_llms_results.pyEvaluation: Run the following to compute the evaluation metrics and create the LaTeX tables. All files will be stored in the
resultsfolder.python eval.pyor simply useuv run:uv run eval.pyPlots: Run the following to plot the graphs.
python plot.pyor simply useuv run:uv run plot.py
Citation
If you find this repository useful, please consider citing it.
APA-style citation:
Anjos de Almeida, V. (2025). Source code for the paper: "Large Language Models as Medical Codes Selectors: a benchmark using the International Classification of Primary Care" (Version 1.0.0) [Computer software]. https://github.com/almeidava93/llm-as-code-selectors-paper
BibTeX entry:
@software{Anjos_de_Almeida_Source_code_for_2025,
author = {Anjos de Almeida, Vinicius},
license = {MIT},
month = jul,
title = {{Source code for the paper: "Large Language Models as Medical Codes Selectors: a benchmark using the International Classification of Primary Care"}},
url = {https://github.com/almeidava93/llm-as-code-selectors-paper},
version = {1.0.0},
year = {2025}
}
Owner
- Login: almeidava93
- Kind: user
- Repositories: 2
- Profile: https://github.com/almeidava93
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: >-
Source code for the paper: "Large Language Models as
Medical Codes Selectors: a benchmark using the
International Classification of Primary Care"
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Vinicius
family-names: Anjos de Almeida
email: vinicius.almeida@alumni.usp.br
affiliation: University of São Paulo
orcid: 'https://orcid.org/0009-0001-1273-586X'
identifiers:
- type: doi
value: 10.5281/zenodo.15998992
repository-code: 'https://github.com/almeidava93/llm-as-code-selectors-paper'
abstract: >-
Background: Medical coding is critical for structuring
healthcare data. It can lead to a better understanding of
population health, guide quality improvement
interventions, and policy making. This study investigates
the ability of large language models (LLMs) to select
appropriate codes from the International Classification of
Primary Care, 2nd edition (ICPC-2), based on the results
of a specialized search engine.
Methods: A dataset of 437 clinical expressions in
Brazilian Portuguese was used, each annotated with
relevant ICPC-2 codes. A semantic search engine based on
OpenAI’s text-embedding-3-large model retrieved candidate
expressions from a corpus of 73,563 ICPC-2-labeled
concepts. Thirty-three LLMs (both open-source and private)
were prompted with each query and a ranked list of
retrieved results, and asked to return the best-matching
ICPC-2 code. Performance was evaluated using F1-score,
with additional analysis of token usage, cost, response
time, and formatting adherence.
Results: Of the 33 models evaluated, 28 achieved a maximum
F1-score above 0.8, and 10 exceeded 0.85. The
top-performing models were gpt-4.5-preview, o3, and
gemini-2.5-pro. By optimizing the retriever, performance
can improve by up to 4 percentage points. Most models were
able to return valid codes in the expected format and
restrict outputs to retrieved results, reducing
hallucination risk. Notably, smaller models (<3B
parameters) underperformed due to format inconsistencies
and sensitivity to input length.
Conclusions: LLMs show strong potential for automating
ICPC-2 code selection, with many models achieving high
performance even without task-specific fine-tuning. This
work establishes a benchmark for future studies and
describes some of the challenges for achieving better
results.
keywords:
- International Classification of Primary Care
- Medical coding
- Medical coding automation
- Large language models
- Artificial intelligence
- Benchmark
- Extreme multiclass classification
license: MIT
version: 1.0.0
date-released: '2025-07-16'
GitHub Events
Total
- Release event: 1
- Push event: 11
- Create event: 2
Last Year
- Release event: 1
- Push event: 11
- Create event: 2
Dependencies
- accelerate >=1.9.0
- adjusttext >=1.3.0
- chromadb >=1.0.15
- dotenv >=0.9.9
- jsonlines >=4.0.0
- langchain >=0.3.26
- langchain-community >=0.3.27
- langchain-deepseek >=0.1.3
- langchain-google-genai >=2.1.8
- langchain-huggingface >=0.3.0
- langchain-openai >=0.3.28
- matplotlib >=3.10.3
- openai >=1.97.0
- pandas >=2.3.1
- seaborn >=0.13.2
- toml >=0.10.2
- torch >=2.7.1
- torchaudio >=2.7.1
- torchvision >=0.22.1
- tqdm >=4.67.1
- transformers >=4.53.2
- 169 dependencies