https://github.com/berenslab/llm-excess-vocab

Delving into LLM-assisted writing in biomedical publications through excess vocabulary

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org, ncbi.nlm.nih.gov, science.org, zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.1%) to scientific vocabulary

Last synced: 6 months ago · JSON representation

Repository

Delving into LLM-assisted writing in biomedical publications through excess vocabulary

Basic Info

Host: GitHub
Owner: berenslab
License: mit
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 42 MB

Statistics

Stars: 45
Watchers: 5
Forks: 7
Open Issues: 0
Releases: 0

Created almost 2 years ago · Last pushed 7 months ago

Metadata Files

Readme License

Delving into LLM-assisted writing in biomedical publications through excess vocabulary

Excess words in 2024

Analysis code for the paper Kobak et al. 2025, Delving into LLM-assisted writing in biomedical publications through excess vocabulary.

How to cite: @article{kobak2025delving, title={Delving into {LLM}-assisted writing in biomedical publications through excess vocabulary}, author={Kobak, Dmitry and Gonz\'alez-M\'arquez, Rita and Horv\'at, Em\H{o}ke-\'Agnes and Lause, Jan}, journal={Science Advances}, year={2025}, volume={11}, number={27}, pages={eadt3813}, }

Preprint version: https://arxiv.org/abs/2406.07016.

Example excess words

Update: July 2025

Updated analysis using six more months of PubMed data compared to our published paper and computing monthly (rather than yearly) excess values:

Excess words in July 2025

Materials

All 900 excess words that we identified from 2013 to 2024 are listed in results/excess_words.csv together with our annotations.
The 362,442 × 15 matrix of yearly word occurrences (for each word and year, the number of abstracts in that year containing that word; the additional last row contains the total number of abstracts in that year) is available in results/yearly-counts.csv.gz. It allows to reproduce the main parts of our analysis.
All figures from the paper are available in the figures/ folder.

Reproducibility instructions

All excess frequency analysis and all figures shown in the paper (and provided in the figures/ folder) are produced by the scripts/03-figures.ipynb Python notebook (apart from Figure 7, which is produced by scripts/08-figure-tsne.ipynb). This notebook takes as input the results/yearly-counts.csv.gz file with yearly counts of each word and several other files with yearly counts of word groups (yearly-counts*). The notebook only takes a minute to run.
These yearly word count files are produced by the scripts/02-preprocess-and-count.py script which takes a few hours to run and needs a lot of memory. This script takes a dataframe with abstract texts as input, performs abstracts cleaning via regular expressions (~1 hour), then runs Python vectorizer = sklearn.feature_extraction.text.CountVectorizer( binary=True, min_df=1e-6 ) vectorizer.fit_transform(df.AbstractText.values) (~0.5 hours), and then does yearly aggregation.
The input to the scripts/02-preprocess-and-count.py script is pubmed_baseline_2025.parquet.gzip containing PubMed data from the end-of-2024 snapshot. This is similar to files available at the repository associated with our Patterns paper "The landscape of biomedical research", but corresponds to the newer PubMed snapshot. This file is constructed by the scripts/01-process-baseline.ipynb notebook that takes all XML files from https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/ as input. These files have to be previously downloaded from the link above, unzipped, and stored in a directory, from which the scripts/01-process-baseline.ipynb notebook will read, combine, and save as a single dataframe (pubmed_baseline_2025.parquet.gzip).
The t-SNE figure produced by scripts/08-figure-tsne.ipynb takes df_tsne_22_24.parquet.gzip as input, which contains the t-SNE coordinates of the 2022-2024 papers as well as some metadata (class labels, country, inferred gender, and whether the paper is retracted or not). The t-SNE embedding is obtained as follows: the raw texts are first processed with a transformer (PubMedBERT) to obtain a numerical high-dimensional representation of each abstract (this is done in the file 04-obtain-BERT-embeddings.py). Then, the high dimensional vectors are reduced to two dimensions with t-SNE (this is done in the file 05-obtain-tsne-embeddings.py). After, the metadata is prepared (with the exception of the retractions) and saved with the 2D coordinates in df_tsne_22_24.parquet.gzip (this is done in the 06-generate-tsne-df.ipynb).
In the notebook 07-analysis-retracted-papers.ipynb, PMIDs from retracted papers are scraped from PubMed and combined with the ones available in the database Retraction Watch. Retracted papers are then ploted in the 2022-2024 t-SNE embedding. Additionally, a boolean flag of whether a paper is retracted or not is computed and added to the df_tsne_22_24.parquet.gzip dataframe.

Owner

Name: Berens Lab @ University of Tübingen
Login: berenslab
Kind: organization
Email: philipp.berens@uni-tuebingen.de
Location: Tübingen, Germany

Website: https://hertie.ai/data-science
Repositories: 60
Profile: https://github.com/berenslab

Department of Data Science at the Hertie Institute for AI in Brain Health, University of Tübingen

GitHub Events

Total

Issues event: 1
Watch event: 8
Issue comment event: 1
Push event: 13
Fork event: 5

Last Year

Issues event: 1
Watch event: 8
Issue comment event: 1
Push event: 13
Fork event: 5

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/berenslab/llm-excess-vocab

Science Score: 49.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Delving into LLM-assisted writing in biomedical publications through excess vocabulary

Update: July 2025

Materials

Reproducibility instructions

Owner

GitHub Events

Total

Last Year