cellama

Cell type annotation with local Large Language Models (LLMs) - Ensuring privacy and speed with extensive customized reports

https://github.com/celvoxes/cellama

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.4%) to scientific vocabulary

Keywords

celltype large-language-models llm rna-seq scanpy seurat single-cell
Last synced: 6 months ago · JSON representation

Repository

Cell type annotation with local Large Language Models (LLMs) - Ensuring privacy and speed with extensive customized reports

Basic Info
  • Host: GitHub
  • Owner: CelVoxes
  • Language: R
  • Default Branch: main
  • Homepage: https://celvox.co
  • Size: 84.9 MB
Statistics
  • Stars: 147
  • Watchers: 4
  • Forks: 6
  • Open Issues: 1
  • Releases: 0
Topics
celltype large-language-models llm rna-seq scanpy seurat single-cell
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme Citation

README.Rmd

---
title: "ceLLama"
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

![](ceLLama_files/cellama.png)

ceLLama is a streamlined automation pipeline for cell type annotations using large-language models (LLMs).

### Advantages:

- **Privacy**: Operates locally, ensuring no data leaks.
- **Comprehensive Analysis**: Considers negative genes.
- **Speed**: Efficient processing.
- **Extensive Reporting**: Generates customized reports.

ceLLama is ideal for quick and preliminary cell type checks!

> [!NOTE]\
> Check the [tutorial](ceLLama/pbmc2700.ipynb) for Scanpy example.

## Installation

To install ceLLama, use the following command:
```{r eval=FALSE}
devtools::install_github("CelVoxes/ceLLama")
```

## Usage

#### Step 1: Install Ollama

Download [`Ollama`](https://ollama.com/).

#### Step 2: Choose Your Model

Select your preferred model. For instance, to run the Llama3 model, use the following terminal command:

```{bash eval=FALSE}
ollama run llama3.1
```

This initiates a local server, which can be verified by visiting http://localhost:11434/. The page should display "Ollama is running".

#### Step 3: Annotate Cell Types

Load the required libraries and data:
```{r pbmc2700, message=FALSE, warning=FALSE}
library(Seurat)
library(tidyverse)
library(httr)

pbmc.data <- Read10X("../../Downloads/filtered_gene_bc_matrices/hg19/")

pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

# note that you can chain multiple commands together with %>%
pbmc <- pbmc %>%
    SCTransform(verbose = F) %>%
    RunPCA(verbose = F) %>%
    FindNeighbors(dims = 1:10, verbose = F) %>%
    FindClusters(resolution = 0.5, verbose = F) %>%
    RunUMAP(dims = 1:10, verbose = F)

DimPlot(pbmc, label = T, label.size = 3) + theme_void() + theme(aspect.ratio = 1)
```

Identify cluster markers:
```{r find DEGs}
DefaultAssay(pbmc) <- "RNA"

# Find cluster markers
pbmc.markers <- FindAllMarkers(pbmc, verbose = F, min.pct = 0.5)

# split into a lists per cluster
pbmc.markers.list <- split(pbmc.markers, pbmc.markers$cluster)
```

Run ceLLama:
```{r run ceLLama}
# set seed, make temperature 0 for reproducible results
library(ceLLama)

res <- ceLLama(pbmc.markers.list, temperature = 0, seed = 101, n_genes = 30)
```

> [!TIP]\
> Increase `temperature` to diversify outputs.
> Set different `base_prompt` to customize annotations.

Transfer the labels:
```{r transfer annotations}
# transfer the labels
annotations <- map_chr(res, 1)

Idents(pbmc) <- "seurat_clusters"
names(annotations) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, annotations)

DimPlot(pbmc, label = T, repel = T, label.size = 3) + theme_void() + theme(aspect.ratio = 1) & NoLegend()
```


## Chain of Thought (Experimental)

Here, we can utilize [thinkR](https://github.com/eonurk/thinkR) package for annotation. The goal of this approach is to leverage the modal's capabilities to break down complex reasoning processes into structured steps. This stepwise decomposition in principle should allow for clear annotations, capturing the intermediate thinking and decision-making throughout an analysis or problem-solving task.

```{r}
# devtools::install_github("eonurk/thinkR")
library(thinkR)
```


```{r eval=FALSE}
# use_thinkR = T
res <- ceLLama(pbmc.markers.list, temperature = 0, seed = 101, n_genes = 30, use_thinkR = T, 
               base_prompt = "This is from a PBMC dataset. Act like an expert immunologist and give me the cell type annotation for this cluster. ")
```

Thinking... ```{r echo=FALSE} # Assuming `res` contains the results to be displayed in markdown format res <- readRDS("thinkR_results.rds") # Formatting the output properly for markdown output <- paste( unlist(lapply(res, function(res_inner){ lapply(res_inner$annotation$steps, function(m) { if (!is.null(m$title) && !is.null(m$content) && !is.null(m$thinking_time)) { sprintf( "### %s\n\n%s\n\n**Time:** %s s\n\n---\n", m$title, m$content, m$thinking_time ) } }) })), collapse = "\n" ) # Printing the output for markdown without c("") cat(output) ```
```{r echo=FALSE, warning=FALSE} # Load necessary package for parsing JSON library(jsonlite) # Assuming `res` contains the structured results as described res <- readRDS("thinkR_results.rds") # Extracting the final answers in a clean format final_annotations <- sapply(res, function(res_inner) { # Retrieve all steps from the annotation steps <- res_inner$annotation$steps # Find the step with title "Final Answer" and extract its content final_step <- Filter(function(step) step$title == "Final Answer", steps) # Extract and parse the content of the "Final Answer" step if (length(final_step) > 0) { content <- trimws(final_step[[1]]$content) # Trim whitespace # Attempt to parse content as JSON, if it is in JSON format parsed <- tryCatch(fromJSON(content), error = function(e) NULL) # If parsed successfully, extract the relevant field if (!is.null(parsed)) { paste0(trimws(parsed$content), " (Confidence: ", parsed$confidence, ")") } else { content # Return content as is if not JSON } } else { NA # If no final answer is found, return NA } }) # Print the final answers in a clean format cat(paste(final_annotations, collapse = "\n")) ``` ```{r} Idents(pbmc) <- "seurat_clusters" names(final_annotations) <- levels(pbmc) pbmc <- RenameIdents(pbmc, final_annotations) DimPlot(pbmc, label = T, repel = T, label.size = 3) + theme_void() + theme(aspect.ratio = 1) & NoLegend() ``` ## Using OpenAI API > [!WARNING]\ > This will send data to OpenAI! You can also use OpenAI for annotating your cell types. First, you can to create a `.Renviron` file where you keep your API key. `OPENAI_API_KEY="Best_key_ever"` ```{r} # Default is gpt-4o-mini res.openai <- ceLLama(pbmc.markers.list, temperature = 0, seed = 101, n_genes = 30, use_openai = T, # money brr. model = "gpt-4o-mini", # set the model openai_api_key = Sys.getenv("OPENAI_API_KEY") # or just copy/paste ) # transfer the labels annotations <- map_chr(res.openai, 1) Idents(pbmc) <- "seurat_clusters" names(annotations) <- levels(pbmc) pbmc <- RenameIdents(pbmc, annotations) DimPlot(pbmc, label = T, repel = T, label.size = 3) + theme_void() + theme(aspect.ratio = 1) & NoLegend() ``` ## Creating Reports Generate detailed reports explaining the annotations: ```{r eval=FALSE} # Get the reason for the annotation! (a bit slower) res <- ceLLama(pbmc.markers.list, temperature = 0, seed = 101, get_reason = T) # These creates html report in the current directory generate_report_md(res) create_html_report() ``` ![](ceLLama_files/report-example.png) View the full report [here](report.html). ## Disclaimer > [!IMPORTANT]\ > LLMs make mistakes, please check important info. ## License This project is licensed under the CC BY-NC 4.0 License. For more details, visit [here](https://creativecommons.org/licenses/by-nc/4.0/).

Owner

  • Name: celvox
  • Login: CelVoxes
  • Kind: organization
  • Location: Netherlands

The Voice of Cells

GitHub Events

Total
  • Release event: 1
  • Watch event: 10
  • Push event: 5
  • Fork event: 2
  • Create event: 1
Last Year
  • Release event: 1
  • Watch event: 10
  • Push event: 5
  • Fork event: 2
  • Create event: 1

Committers

Last synced: about 1 year ago

All Time
  • Total Commits: 30
  • Total Committers: 1
  • Avg Commits per committer: 30.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 30
  • Committers: 1
  • Avg Commits per committer: 30.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
eonurk o****4@g****m 30

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 2
  • Total pull requests: 0
  • Average time to close issues: about 4 hours
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 2.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 0
  • Average time to close issues: about 4 hours
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 2.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • maxim-h (2)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels