biorecap

Retrieve and summarize bioRxiv preprints with a local LLM using ollama

https://github.com/stephenturner/biorecap

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 5 DOI reference(s) in README
  • Academic publication links
    Links to: biorxiv.org, medrxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (17.0%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Retrieve and summarize bioRxiv preprints with a local LLM using ollama

Basic Info
Statistics
  • Stars: 70
  • Watchers: 2
  • Forks: 10
  • Open Issues: 1
  • Releases: 3
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog Contributing License Citation

README.Rmd

---
output: github_document
---



```{r, eval=FALSE, echo=FALSE}
# Run interactively
devtools::build_readme()
pkgdown::build_site()
```


```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# biorecap 


[![R-CMD-check](https://github.com/stephenturner/biorecap/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/stephenturner/biorecap/actions/workflows/R-CMD-check.yaml)
[![arXiv](https://img.shields.io/badge/DOI-10.48550/arXiv.2408.11707-AD1429)](https://doi.org/10.48550/arXiv.2408.11707)
[![biorecap-r-universe](https://stephenturner.r-universe.dev/badges/biorecap)](https://stephenturner.r-universe.dev/biorecap)


Retrieve and summarize [bioRxiv](https://www.biorxiv.org/) and [medRxiv](https://www.medrxiv.org/) preprints using a local LLM with [Ollama](https://ollama.com/) via [ollamar](https://cran.r-project.org/package=ollamar). 

Turner, S. D. (2024). biorecap: an R package for summarizing bioRxiv preprints with a local LLM. _arXiv_, 2408.11707. https://doi.org/10.48550/arXiv.2408.11707. 

## Installation

Install biorecap from GitHub (keep `dependencies=TRUE` to get Suggests packages needed to create the HTML report):

```{r, eval=FALSE}
# install.packages("remotes")
remotes::install_github("stephenturner/biorecap", dependencies=TRUE)
```

## Usage

### Quick start

First, load the biorecap library.

```{r}
library(biorecap)
```

Let's make sure Ollama is running and that we can talk to it through R:

```{r, eval=FALSE}
test_connection()
```

```
#> Ollama local server running
#> 
#> GET http://localhost:11434/
#> Status: 200 OK
#> Content-Type: text/plain
#> Body: In memory (17 bytes)
```

Next we can list our available models:

```{r, eval=FALSE}
list_models()
```

```
             name   size parameter_size quantization_level            modified
1   gemma2:latest 5.4 GB           9.2B               Q4_0 2024-08-07T07:35:15
3    llama3.1:70b  40 GB          70.6B               Q4_0 2024-07-24T10:57:08
4 llama3.1:latest 4.7 GB           8.0B               Q4_0 2024-07-31T09:38:38
5 llama3.2:latest   2 GB           3.2B             Q4_K_M 2024-09-25T14:54:23
6     phi3:latest 2.2 GB           3.8B               Q4_0 2024-08-28T04:37:58      
```

Write an HTML report containing summaries of recent preprints in select subject areas to the current working directory. You can include both bioRxiv and medRxiv subjects, and biorecap will know which RSS feed to use.

```{r, eval=FALSE}
biorecap_report(output_dir=".", 
                subject=c("bioinformatics", "infectious_diseases"), 
                model="llama3.2")
```

Example HTML report generated from bioRxiv (bioinformatics) and infectious diseases (medRxiv) subjects on September 25, 2024:

```{r, echo=FALSE}
knitr::include_graphics(here::here("man/figures/report_screenshot.jpg"))
```


### Details

The `get_preprints()` function retrieves preprints from the RSS feed of either bioRxiv or medRxiv, based on the subject you provided. You pass one or more subjects to the `subject` argument. 

```{r, eval=FALSE}
pp <- get_preprints(subject=c("bioinformatics", 
                              "infectious_diseases"))
head(pp)
tail(pp)
```

```{r, echo=FALSE}
pp <- example_preprints
pp |> dplyr::select(-prompt, -summary) |> head()
pp |> dplyr::select(-prompt, -summary) |> tail()
```

The `add_prompt()` function adds a prompt to each preprint that will be used to prompt the model.

```{r, eval=FALSE}
pp <- pp |> add_prompt()
pp
```

```{r, echo=FALSE}
pp |> dplyr::select(-summary)
```

Let's take a look at one of these prompts:

> I am giving you a paper’s title and abstract. Summarize the paper in as many sentences as I instruct. Do not include any preamble text. Just give me the summary. 
> 
> Number of sentences in summary: 2 
> 
> Title: SeuratExtend: Streamlining Single-Cell RNA-Seq Analysis Through an Integrated and Intuitive Framework 
> 
> Abstract: Single-cell RNA sequencing (scRNA-seq) has revolutionized the study of cellular heterogeneity, but the rapid expansion of analytical tools has proven to be both a blessing and a curse, presenting researchers with significant challenges. Here, we present SeuratExtend, a comprehensive R package built upon the widely adopted Seurat framework, which streamlines scRNA-seq data analysis by integrating essential tools and databases. SeuratExtend offers a user-friendly and intuitive interface for performing a wide range of analyses, including functional enrichment, trajectory inference, gene regulatory network reconstruction, and denoising. The package seamlessly integrates multiple databases, such as Gene Ontology and Reactome, and incorporates popular Python tools like scVelo, Palantir, and SCENIC through a unified R interface. SeuratExtend enhances data visualization with optimized plotting functions and carefully curated color schemes, ensuring both aesthetic appeal and scientific rigor. We demonstrate SeuratExtend’s performance through case studies investigating tumor-associated high-endothelial venules and autoinflammatory diseases, and showcase its novel applications in pathway-Level analysis and cluster annotation. SeuratExtend empowers researchers to harness the full potential of scRNA-seq data, making complex analyses accessible to a wider audience. The package, along with comprehensive documentation and tutorials, is freely available at GitHub, providing a valuable resource for the single-cell genomics community.

The `add_summary()` function uses a locally running LLM available through Ollama to summarize the preprint. Let's add the summary. Notice that we can do this all in a single pipeline. This takes a few minutes!

```{r, eval=FALSE}
pp <- 
  get_preprints(subject=c("bioinformatics", "infectious_diseases")) |> 
  add_prompt() |> 
  add_summary(model="llama3.2")
```

Let's take a look at the results:

```{r}
pp
```

Let's look at one of those summaries. Here's the summary for the SeuratExtend paper (abstract above):

> SeuratExtend is an R package that integrates essential tools and databases for single-cell RNA sequencing (scRNA-seq) data analysis, streamlining the process through a user-friendly interface. The package offers various analyses, including functional enrichment and gene regulatory network reconstruction, and seamlessly integrates multiple databases and popular Python tools.

The `biorecap_report()` function runs this code in an RMarkdown template, writing the resulting HTML and CSV file with results to the current working directory.

```{r, eval=FALSE}
biorecap_report(output_dir=".", 
                subject=c("bioinformatics", "infectious_diseases"), 
                model="llama3.2")
```

The built-in `subjects` is a list with vectors containing all the available bioRxiv and medRxiv subjects.

```{r}
subjects$biorxiv
subjects$medrxiv
```

You could create a report for _all_ subjects like this (note, this could take some time):

```{r, eval=FALSE}
biorecap_report(output_dir=".", 
                subject=c(subjects$biorxiv, subjects$medrxiv)
                model="llama3.2")
```

Owner

  • Name: Stephen Turner
  • Login: stephenturner
  • Kind: user
  • Location: Charlottesville, VA
  • Company: @colossal-compsci

Data scientist in biotech, former academic, Principal Scientist and Head of Genomic Strategy at Colossal Biosciences

Citation (CITATION.cff)

# --------------------------------------------
# CITATION file created with {cffr} R package
# See also: https://docs.ropensci.org/cffr/
# --------------------------------------------
 
cff-version: 1.2.0
message: 'To cite package "biorecap" in publications use:'
type: software
license: MIT
title: 'biorecap: Retrieve and summarize bioRxiv preprints with a local LLM using
  ollama'
version: 0.1.0
abstract: Retrieve and summarize bioRxiv preprints with a local LLM using ollama.
authors:
- family-names: Turner
  given-names: Stephen
  email: vustephen@gmail.com
  orcid: https://orcid.org/0000-0001-9140-9028
repository-code: https://github.com/stephenturner/biorecap
url: https://stephenturner.github.io/biorecap/
contact:
- family-names: Turner
  given-names: Stephen
  email: vustephen@gmail.com
  orcid: https://orcid.org/0000-0001-9140-9028
references:
- type: software
  title: 'R: A Language and Environment for Statistical Computing'
  notes: Depends
  url: https://www.R-project.org/
  authors:
  - name: R Core Team
  institution:
    name: R Foundation for Statistical Computing
    address: Vienna, Austria
  year: '2024'
  version: '>= 4.2.0'
- type: software
  title: dplyr
  abstract: 'dplyr: A Grammar of Data Manipulation'
  notes: Imports
  url: https://dplyr.tidyverse.org
  repository: https://CRAN.R-project.org/package=dplyr
  authors:
  - family-names: Wickham
    given-names: Hadley
    email: hadley@posit.co
    orcid: https://orcid.org/0000-0003-4757-117X
  - family-names: François
    given-names: Romain
    orcid: https://orcid.org/0000-0002-2444-4226
  - family-names: Henry
    given-names: Lionel
  - family-names: Müller
    given-names: Kirill
    orcid: https://orcid.org/0000-0002-1416-3412
  - family-names: Vaughan
    given-names: Davis
    email: davis@posit.co
    orcid: https://orcid.org/0000-0003-4777-038X
  year: '2024'
  doi: 10.32614/CRAN.package.dplyr
- type: software
  title: ollamar
  abstract: 'ollamar: ''Ollama'' Language Models'
  notes: Imports
  url: https://hauselin.github.io/ollama-r/
  repository: https://CRAN.R-project.org/package=ollamar
  authors:
  - family-names: Lin
    given-names: Hause
    email: hauselin@gmail.com
    orcid: https://orcid.org/0000-0003-4590-7039
  year: '2024'
  doi: 10.32614/CRAN.package.ollamar
- type: software
  title: rlang
  abstract: 'rlang: Functions for Base Types and Core R and ''Tidyverse'' Features'
  notes: Imports
  url: https://rlang.r-lib.org
  repository: https://CRAN.R-project.org/package=rlang
  authors:
  - family-names: Henry
    given-names: Lionel
    email: lionel@posit.co
  - family-names: Wickham
    given-names: Hadley
    email: hadley@posit.co
  year: '2024'
  doi: 10.32614/CRAN.package.rlang
- type: software
  title: rmarkdown
  abstract: 'rmarkdown: Dynamic Documents for R'
  notes: Imports
  url: https://pkgs.rstudio.com/rmarkdown/
  repository: https://CRAN.R-project.org/package=rmarkdown
  authors:
  - family-names: Allaire
    given-names: JJ
    email: jj@posit.co
  - family-names: Xie
    given-names: Yihui
    email: xie@yihui.name
    orcid: https://orcid.org/0000-0003-0645-5666
  - family-names: Dervieux
    given-names: Christophe
    email: cderv@posit.co
    orcid: https://orcid.org/0000-0003-4474-2498
  - family-names: McPherson
    given-names: Jonathan
    email: jonathan@posit.co
  - family-names: Luraschi
    given-names: Javier
  - family-names: Ushey
    given-names: Kevin
    email: kevin@posit.co
  - family-names: Atkins
    given-names: Aron
    email: aron@posit.co
  - family-names: Wickham
    given-names: Hadley
    email: hadley@posit.co
  - family-names: Cheng
    given-names: Joe
    email: joe@posit.co
  - family-names: Chang
    given-names: Winston
    email: winston@posit.co
  - family-names: Iannone
    given-names: Richard
    email: rich@posit.co
    orcid: https://orcid.org/0000-0003-3925-190X
  year: '2024'
  doi: 10.32614/CRAN.package.rmarkdown
- type: software
  title: tidyRSS
  abstract: 'tidyRSS: Tidy RSS for R'
  notes: Imports
  url: https://github.com/RobertMyles/tidyrss
  repository: https://CRAN.R-project.org/package=tidyRSS
  authors:
  - family-names: McDonnell
    given-names: Robert Myles
    email: robertmylesmcdonnell@gmail.com
  year: '2024'
  doi: 10.32614/CRAN.package.tidyRSS
- type: software
  title: tinytable
  abstract: 'tinytable: Simple and Configurable Tables in ''HTML'', ''LaTeX'', ''Markdown'',
    ''Word'', ''PNG'', ''PDF'', and ''Typst'' Formats'
  notes: Imports
  url: https://vincentarelbundock.github.io/tinytable/
  repository: https://CRAN.R-project.org/package=tinytable
  authors:
  - family-names: Arel-Bundock
    given-names: Vincent
    email: vincent.arel-bundock@umontreal.ca
    orcid: https://orcid.org/0000-0003-2042-7063
  year: '2024'
  doi: 10.32614/CRAN.package.tinytable
- type: software
  title: knitr
  abstract: 'knitr: A General-Purpose Package for Dynamic Report Generation in R'
  notes: Suggests
  url: https://yihui.org/knitr/
  repository: https://CRAN.R-project.org/package=knitr
  authors:
  - family-names: Xie
    given-names: Yihui
    email: xie@yihui.name
    orcid: https://orcid.org/0000-0003-0645-5666
  year: '2024'
  doi: 10.32614/CRAN.package.knitr
- type: software
  title: markdown
  abstract: 'markdown: Render Markdown with ''commonmark'''
  notes: Suggests
  url: https://github.com/rstudio/markdown
  repository: https://CRAN.R-project.org/package=markdown
  authors:
  - family-names: Xie
    given-names: Yihui
    email: xie@yihui.name
    orcid: https://orcid.org/0000-0003-0645-5666
  - family-names: Allaire
    given-names: JJ
  - family-names: Horner
    given-names: Jeffrey
  year: '2024'
  doi: 10.32614/CRAN.package.markdown

GitHub Events

Total
  • Watch event: 10
  • Fork event: 3
Last Year
  • Watch event: 10
  • Fork event: 3

Committers

Last synced: 11 months ago

All Time
  • Total Commits: 38
  • Total Committers: 2
  • Avg Commits per committer: 19.0
  • Development Distribution Score (DDS): 0.026
Past Year
  • Commits: 38
  • Committers: 2
  • Avg Commits per committer: 19.0
  • Development Distribution Score (DDS): 0.026
Top Committers
Name Email Commits
Stephen Turner v****n@g****m 37
VP Nagraj p****j@s****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 5
  • Total pull requests: 5
  • Average time to close issues: 6 days
  • Average time to close pull requests: 14 minutes
  • Total issue authors: 5
  • Total pull request authors: 2
  • Average comments per issue: 1.4
  • Average comments per pull request: 0.0
  • Merged pull requests: 5
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 5
  • Pull requests: 5
  • Average time to close issues: 6 days
  • Average time to close pull requests: 14 minutes
  • Issue authors: 5
  • Pull request authors: 2
  • Average comments per issue: 1.4
  • Average comments per pull request: 0.0
  • Merged pull requests: 5
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • vpnagraj (1)
  • Michael-Geuenich (1)
  • sunta3iouxos (1)
  • huyvuong (1)
  • danieljking8 (1)
Pull Request Authors
  • stephenturner (8)
  • vpnagraj (2)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v4 composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pkgdown.yaml actions
  • JamesIves/github-pages-deploy-action v4.5.0 composite
  • actions/checkout v4 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
DESCRIPTION cran
  • R >= 4.2.0 depends
  • dplyr * imports
  • ollamar * imports
  • rlang * imports
  • rmarkdown * imports
  • tidyRSS * imports
  • tinytable * imports
  • knitr * suggests
  • markdown * suggests