mllmcelltype

🏆 #1 Multi-LLM consensus framework | 550+ stars | 95% accuracy | 10+ LLM providers | Leading cell annotation tool

https://github.com/cafferychen777/mllmcelltype

Keywords

artificial-intelligence bioinformatics cell-type-annotation claude computational-biology consensus-algorithm deepseek gemini grok large-language-models llm multi-llm-consensus openai openrouter qwen scanpy scrna scrnaseq-analysis seurat single-cell

Last synced: 6 months ago · JSON representation

Repository

🏆 #1 Multi-LLM consensus framework | 550+ stars | 95% accuracy | 10+ LLM providers | Leading cell annotation tool

Basic Info

Host: GitHub
Owner: cafferychen777
License: mit
Language: Python
Default Branch: main
Homepage: https://www.mllmcelltype.com/
Size: 60.8 MB

Statistics

Stars: 548
Watchers: 20
Forks: 47
Open Issues: 9
Releases: 5

Topics

artificial-intelligence bioinformatics cell-type-annotation claude computational-biology consensus-algorithm deepseek gemini grok large-language-models llm multi-llm-consensus openai openrouter qwen scanpy scrna scrnaseq-analysis seurat single-cell

Created 11 months ago · Last pushed 6 months ago

Metadata Files

Readme Contributing Funding License Code of conduct Citation Security

README.md

Share mLLMCelltype on Twitter - Multi-LLM consensus cell annotation framework

mLLMCelltype GitHub stars - 540+ community members supporting the leading consensus framework

mLLMCelltype GitHub forks - Community contributions to consensus-based cell annotation

mLLMCelltype open source license - Free multi-LLM consensus framework for research

mLLMCelltype last commit - Active development of consensus-based annotation

mLLMCelltype GitHub issues - Community support and feature requests

mLLMCelltype latest release - Production-ready consensus framework version

mLLMCelltype primary language - R-based consensus framework

mLLMCelltype repository size - Comprehensive consensus annotation toolkit

mLLMCelltype contributors - Growing community of consensus framework developers

mLLMCelltype: Multi-LLM Consensus Framework for Cell Type Annotation

mLLMCelltype is a multi-LLM consensus framework for automated cell type annotation in single-cell RNA sequencing (scRNA-seq) data. The framework integrates multiple large language models including OpenAI GPT-5/4.1, Anthropic Claude-4/3.7/3.5, Google Gemini-2.0, X.AI Grok-3, DeepSeek-V3, Alibaba Qwen2.5, Zhipu GLM-4, MiniMax, Stepfun, and OpenRouter to improve annotation accuracy through consensus-based predictions.

Key Advantages: - Improved Accuracy: Achieves 95% annotation accuracy through multi-model consensus - Reduced Bias: Multiple model integration minimizes individual model limitations
- Cost Efficiency: 70-80% API cost reduction through optimized consensus algorithms - Uncertainty Quantification: Provides metrics for annotation confidence assessment

Abstract

mLLMCelltype is an open-source tool for single-cell transcriptomics analysis that uses multiple large language models to identify cell types from gene expression data. The software implements a consensus approach where multiple models analyze the same data and their predictions are combined, which helps reduce errors and provides uncertainty metrics. This methodology offers advantages over single-model approaches through integration of multiple model predictions. mLLMCelltype integrates with single-cell analysis platforms such as Scanpy and Seurat, allowing researchers to incorporate it into existing workflows. The method does not require reference datasets for annotation.

Comparison with Other Methods: - Consensus-based approach: Multi-model consensus provides improved reliability compared to single-model systems - Model Support: Compatible with 10+ LLM providers
- Performance: 95% accuracy in benchmark studies with uncertainty quantification - Community Adoption: 540+ GitHub stars

News

Web Application Launch (2025-06-18)

We're excited to announce the launch of mLLMCelltype Web Application! Now you can access mLLMCelltype's powerful cell type annotation capabilities directly through your web browser without any installation required.

** Key Features:** - Easy-to-use interface: Upload your scRNA-seq data and get annotations in minutes - Multi-LLM consensus: Choose from various AI models including GPT-4, Claude, Gemini, and more - Real-time processing: Monitor annotation progress with live updates - Multiple export formats: Download results in CSV, TSV, Excel, or JSON formats - No setup required: Start annotating immediately without installing packages

** Access the Web App**: https://mllmcelltype.com

** Beta Testing Phase**: The web application is currently in beta testing. We welcome your feedback and suggestions to help us improve the platform. Please report any issues or share your experience through our GitHub Issues or Discord community.

CRAN Release (2025-09-02)

mLLMCelltype is now available on CRAN. Install the package using: R install.packages("mLLMCelltype")

CRAN page: https://CRAN.R-project.org/package=mLLMCelltype
DOI: 10.32614/CRAN.package.mLLMCelltype

** Important: Gemini Model Migration (2025-06-02)**

Google has discontinued several Gemini 1.5 models and will discontinue more on September 24, 2025: - Already discontinued: Gemini 1.5 Pro 001, Gemini 1.5 Flash 001 - Will be discontinued on Sept 24, 2025: Gemini 1.5 Pro 002, Gemini 1.5 Flash 002, Gemini 1.5 Flash-8B -001

Recommended migration: Use gemini-2.0-flash or gemini-2.0-flash-lite for better performance and continued support. The aliases gemini-1.5-pro and gemini-1.5-flash will continue to work until September 24, 2025, as they point to the -002 versions.

** Important: Claude Model Deprecation (2025-07-21)**

Anthropic will retire the following Claude models on July 21, 2025: - Claude 2 (all versions) - Claude 2.1 - Claude 3 Sonnet (non-versioned) - Claude 3 Opus (non-versioned)

Recommended migration: - For Claude 2/2.1 Use claude-sonnet-4-20250514 or claude-3-5-sonnet-20241022 - For Claude 3 Sonnet Use claude-sonnet-4-20250514 or claude-3-7-sonnet-20250219 - For Claude 3 Opus Use claude-opus-4-20250514 or claude-3-opus-20240229

Please update your code before July 21, 2025 to avoid service disruption.

August 2025: mLLMCelltype has reached 540+ GitHub stars with growing community adoption. We thank all contributors and users who have supported this project.

Key Features

Multi-LLM Consensus: Integrates predictions from multiple LLMs to reduce single-model limitations and biases
Model Support: Compatible with 10+ LLM providers including OpenAI, Anthropic, Google, and others
Accuracy: 95% accuracy validated through benchmarking on multiple datasets
Cost Efficiency: 70-80% API cost reduction through consensus optimization
Iterative Discussion: LLMs evaluate evidence and refine annotations through multiple rounds of discussion
Uncertainty Quantification: Provides Consensus Proportion and Shannon Entropy metrics to identify uncertain annotations
Error Reduction: Cross-model validation reduces incorrect predictions
Noise Tolerance: Maintains accuracy with imperfect marker gene lists
Hierarchical Annotation: Supports multi-resolution analysis with consistency checks
Reference-Free: Performs annotation without pre-training or reference datasets
Documentation: Records complete reasoning process for transparency
Integration: Compatible with Scanpy/Seurat workflows and marker gene outputs
Extensibility: Supports addition of new LLMs as they become available

Recent Updates

v1.2.3 (2025-05-10)

Bug Fixes

Fixed error handling in consensus checking when API responses are NULL or invalid
Improved error logging for OpenRouter API error responses
Added robust NULL and type checking in check_consensus function

Improvements

Enhanced error diagnostics for OpenRouter API errors
Added detailed logging of API error messages and response structures
Improved robustness when handling unexpected API response formats

v1.2.2 (2025-05-09)

Bug Fixes

Fixed the 'non-character argument' error that occurred when processing API responses
Added robust type checking for API responses across all model providers
Improved error handling for unexpected API response formats

Improvements

Added detailed error logging for API response issues
Implemented consistent error handling patterns across all API processing functions
Enhanced response validation to ensure proper structure before processing

v1.2.1 (2025-05-01)

Improvements

Added support for OpenRouter API
Added support for free models through OpenRouter
Updated documentation with examples for using OpenRouter models

v1.2.0 (2025-04-30)

Features

Added visualization functions for cell type annotation results
Added support for uncertainty metrics visualization
Implemented improved consensus building algorithm

v1.1.5 (2025-04-27)

Bug Fixes

Fixed an issue with cluster index validation that caused errors when processing certain CSV input files
Improved error handling for negative indices with clearer error messages

Improvements

Added example script for CSV-based annotation workflow (catheartannotation.R)
Enhanced input validation with more detailed diagnostics
Updated documentation to clarify CSV input format requirements

See NEWS.md for a complete changelog.

Directory Structure

R/: R language interface and implementation
python/: Python interface and implementation

Installation

R Version

```r

Install from CRAN (recommended)

install.packages("mLLMCelltype")

Or install development version from GitHub

devtools::install_github("cafferychen777/mLLMCelltype", subdir = "R") ```

Python Version

Quick Start: Try mLLMCelltype instantly in Google Colab without any installation! Click the badge above to open our interactive notebook with examples and step-by-step guidance.

```bash

Install from PyPI

pip install mllmcelltype

Or install from GitHub (note the subdirectory parameter)

pip install git+https://github.com/cafferychen777/mLLMCelltype.git#subdirectory=python ```

Important Note on Dependencies

mLLMCelltype uses a modular design where different LLM provider libraries are optional dependencies. Depending on which models you plan to use, you'll need to install the corresponding packages:

```bash

For using OpenAI models (GPT-5, etc.)

pip install "mllmcelltype[openai]"

For using Anthropic models (Claude)

pip install "mllmcelltype[anthropic]"

For using Google models (Gemini)

pip install "mllmcelltype[gemini]"

To install all optional dependencies at once

pip install "mllmcelltype[all]" ```

If you encounter errors like ImportError: cannot import name 'genai' from 'google', it means you need to install the corresponding provider package. For example:

```bash

For Google Gemini models

pip install google-genai ```

Supported Models

OpenAI: GPT-5/GPT-4.1/GPT-4.5 (API Key)
Anthropic: Claude-4-Opus/Claude-4-Sonnet/Claude-3.7-Sonnet/Claude-3.5-Haiku (API Key)
Google: Gemini-2.0-Pro/Gemini-2.0-Flash (API Key)
Alibaba: Qwen2.5-Max (API Key)
DeepSeek: DeepSeek-V3/DeepSeek-R1 (API Key)
Minimax: MiniMax-Text-01 (API Key)
Stepfun: Step-2-16K (API Key)
Zhipu: GLM-4 (API Key)
X.AI: Grok-3/Grok-3-mini (API Key)
OpenRouter: Access to multiple models through a single API (API Key)
- Supports models from OpenAI, Anthropic, Meta, Google, Mistral, and more
- Format: 'provider/model-name' (e.g., 'openai/gpt-5', 'anthropic/claude-3-opus')
- Free models available with :free suffix (e.g., 'microsoft/mai-ds-r1:free', 'deepseek/deepseek-chat:free')

Usage Examples

Python

```python

Example of using mLLMCelltype for single-cell RNA-seq cell type annotation with Scanpy

import scanpy as sc import pandas as pd from mllmcelltype import annotateclusters, interactiveconsensus_annotation import os

Note: Logging is automatically configured when importing mllmcelltype

You can customize logging if needed using the logging module

Load your single-cell RNA-seq dataset in AnnData format

adata = sc.readh5ad('yourdata.h5ad') # Replace with your scRNA-seq dataset path

Perform Leiden clustering for cell population identification if not already done

if 'leiden' not in adata.obs.columns: print("Computing leiden clustering for cell population identification...") # Preprocess single-cell data: normalize counts and log-transform for gene expression analysis if 'log1p' not in adata.uns: sc.pp.normalizetotal(adata, targetsum=1e4) # Normalize to 10,000 counts per cell sc.pp.log1p(adata) # Log-transform normalized counts

# Dimensionality reduction: calculate PCA for scRNA-seq data
if 'X_pca' not in adata.obsm:
    sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)  # Select informative genes
    sc.pp.pca(adata, use_highly_variable=True)  # Compute principal components

# Cell clustering: compute neighborhood graph and perform Leiden community detection
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=30)  # Build KNN graph for clustering
sc.tl.leiden(adata, resolution=0.8)  # Identify cell populations using Leiden algorithm
print(f"Leiden clustering completed, identified {len(adata.obs['leiden'].cat.categories)} distinct cell populations")

Identify marker genes for each cell cluster using differential expression analysis

sc.tl.rankgenesgroups(adata, 'leiden', method='wilcoxon') # Wilcoxon rank-sum test for marker detection

Extract top marker genes for each cell cluster to use in cell type annotation

markergenes = {} for i in range(len(adata.obs['leiden'].cat.categories)): # Select top 10 differentially expressed genes as markers for each cluster genes = [adata.uns['rankgenesgroups']['names'][str(i)][j] for j in range(10)] markergenes[str(i)] = genes

IMPORTANT: mLLMCelltype requires gene symbols (e.g., KCNJ8, PDGFRA) not Ensembl IDs (e.g., ENSG00000176771)

If your AnnData object uses Ensembl IDs, convert them to gene symbols for accurate annotation:

Example conversion code:

if 'Gene' in adata.var.columns: # Check if gene symbols are available in the metadata

genenamedict = dict(zip(adata.var_names, adata.var['Gene']))

markergenes = {cluster: [genenamedict.get(geneid, geneid) for geneid in genes]

for cluster, genes in marker_genes.items()}

IMPORTANT: mLLMCelltype requires numeric cluster IDs

The 'cluster' column must contain numeric values or values that can be converted to numeric.

Non-numeric cluster IDs (e.g., "cluster1", "Tcells", "7_0") may cause errors or unexpected behavior.

If your data contains non-numeric cluster IDs, create a mapping between original IDs and numeric IDs:

Example standardization code:

originalids = list(markergenes.keys())

idmapping = {original: idx for idx, original in enumerate(originalids)}

markergenes = {str(idmapping[cluster]): genes for cluster, genes in marker_genes.items()}

Configure API keys for the large language models used in consensus annotation

At least one API key is required for multi-LLM consensus annotation

os.environ["OPENAIAPIKEY"] = "your-openai-api-key" # For GPT-5/4.1 models (recommended) os.environ["ANTHROPICAPIKEY"] = "your-anthropic-api-key" # For Claude-4/3.7/3.5 models os.environ["GEMINIAPIKEY"] = "your-gemini-api-key" # For Google Gemini-2.5 models os.environ["QWENAPIKEY"] = "your-qwen-api-key" # For Alibaba Qwen2.5 models

Additional optional LLM providers for enhanced consensus diversity:

os.environ["DEEPSEEKAPIKEY"] = "your-deepseek-api-key" # For DeepSeek-V3 models

os.environ["ZHIPUAPIKEY"] = "your-zhipu-api-key" # For Zhipu GLM-4 models

os.environ["STEPFUNAPIKEY"] = "your-stepfun-api-key" # For Stepfun models

os.environ["MINIMAXAPIKEY"] = "your-minimax-api-key" # For MiniMax models

os.environ["OPENROUTERAPIKEY"] = "your-openrouter-api-key" # For accessing multiple models via OpenRouter

Execute multi-LLM consensus cell type annotation with iterative deliberation

consensusresults = interactiveconsensusannotation( markergenes=markergenes, # Dictionary of marker genes for each cluster species="human", # Specify organism for appropriate cell type annotation tissue="blood", # Specify tissue context for more accurate annotation models=["gpt-5", "claude-opus-4-20250514", "gemini-2.5-pro", "qwen-max-2025-01-25"], # Multiple LLMs for consensus consensusthreshold=1, # Minimum proportion required for consensus agreement maxdiscussionrounds=3 # Number of deliberation rounds between models for refinement )

Alternatively, use OpenRouter for accessing multiple models through a single API

This is especially useful for accessing free models with the :free suffix

os.environ["OPENROUTERAPIKEY"] = "your-openrouter-api-key"

Example using free OpenRouter models (no credits required)

freemodelsresults = interactiveconsensusannotation( markergenes=markergenes, species="human", tissue="blood", models=[ {"provider": "openrouter", "model": "meta-llama/llama-4-maverick:free"}, # Meta Llama 4 Maverick (free) {"provider": "openrouter", "model": "nvidia/llama-3.1-nemotron-ultra-253b-v1:free"}, # NVIDIA Nemotron Ultra 253B (free) {"provider": "openrouter", "model": "deepseek/deepseek-chat-v3-0324:free"}, # DeepSeek Chat v3 (free) {"provider": "openrouter", "model": "microsoft/mai-ds-r1:free"} # Microsoft MAI-DS-R1 (free) ], consensusthreshold=0.7, maxdiscussion_rounds=2 )

Retrieve final consensus cell type annotations from the multi-LLM deliberation

finalannotations = consensusresults["consensus"]

Integrate consensus cell type annotations into the original AnnData object

adata.obs['consensuscelltype'] = adata.obs['leiden'].astype(str).map(final_annotations)

Add uncertainty quantification metrics to evaluate annotation confidence

adata.obs['consensusproportion'] = adata.obs['leiden'].astype(str).map(consensusresults["consensusproportion"]) # Agreement level adata.obs['entropy'] = adata.obs['leiden'].astype(str).map(consensusresults["entropy"]) # Annotation uncertainty

Prepare for visualization: compute UMAP embeddings if not already available

UMAP provides a 2D representation of cell populations for visualization

if 'Xumap' not in adata.obsm: print("Computing UMAP coordinates...") # Make sure neighbors are computed first if 'neighbors' not in adata.uns: sc.pp.neighbors(adata, nneighbors=10, n_pcs=30) sc.tl.umap(adata) print("UMAP coordinates computed")

Visualize results with enhanced aesthetics

Basic visualization

sc.pl.umap(adata, color='consensuscelltype', legend_loc='right', frameon=True, title='mLLMCelltype Consensus Annotations')

More customized visualization

import matplotlib.pyplot as plt

Set figure size and style

plt.rcParams['figure.figsize'] = (10, 8) plt.rcParams['font.size'] = 12

Create a more publication-ready UMAP

fig, ax = plt.subplots(1, 1, figsize=(12, 10)) sc.pl.umap(adata, color='consensuscelltype', legendloc='on data', frameon=True, title='mLLMCelltype Consensus Annotations', palette='tab20', size=50, legendfontsize=12, legend_fontoutline=2, ax=ax)

Visualize uncertainty metrics

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7)) sc.pl.umap(adata, color='consensusproportion', ax=ax1, title='Consensus Proportion', cmap='viridis', vmin=0, vmax=1, size=30) sc.pl.umap(adata, color='entropy', ax=ax2, title='Annotation Uncertainty (Shannon Entropy)', cmap='magma', vmin=0, size=30) plt.tightlayout() ```

Using a Single Free OpenRouter Model

For users who prefer a simpler approach with just one model, the Microsoft MAI-DS-R1 free model via OpenRouter provides excellent results:

```python import os from mllmcelltype import annotate_clusters

Note: Logging is automatically configured

Set your OpenRouter API key

os.environ["OPENROUTERAPIKEY"] = "your-openrouter-api-key"

Define marker genes for each cluster

marker_genes = { "0": ["CD3D", "CD3E", "CD3G", "CD2", "IL7R", "TCF7"], # T cells "1": ["CD19", "MS4A1", "CD79A", "CD79B", "HLA-DRA", "CD74"], # B cells "2": ["CD14", "LYZ", "CSF1R", "ITGAM", "CD68", "FCGR3A"] # Monocytes }

Annotate using Microsoft MAI-DS-R1 free model

annotations = annotateclusters( markergenes=marker_genes, species='human', tissue='peripheral blood', provider='openrouter', model='microsoft/mai-ds-r1:free' # Free model )

Print annotations

for cluster, annotation in annotations.items(): print(f"Cluster {cluster}: {annotation}") ```

This approach is fast, accurate, and doesn't require any API credits, making it ideal for quick analyses or when you have limited API access.

Extracting Marker Genes from AnnData Objects

If you're using Scanpy with AnnData objects, you can easily extract marker genes directly from the rank_genes_groups results:

```python import os import scanpy as sc from mllmcelltype import annotate_clusters

Note: Logging is automatically configured

Set your OpenRouter API key

os.environ["OPENROUTERAPIKEY"] = "your-openrouter-api-key"

Load and preprocess your data

adata = sc.readh5ad('yourdata.h5ad')

Perform preprocessing and clustering if not already done

sc.pp.normalizetotal(adata, targetsum=1e4)

sc.pp.log1p(adata)

sc.pp.highlyvariablegenes(adata)

sc.pp.pca(adata)

sc.pp.neighbors(adata)

sc.tl.leiden(adata)

Find marker genes for each cluster

sc.tl.rankgenesgroups(adata, 'leiden', method='wilcoxon')

Extract top marker genes for each cluster

marker_genes = { cluster: adata.uns['rankgenesgroups']['names'][cluster][:10].tolist() for cluster in adata.obs['leiden'].cat.categories }

Annotate using Microsoft MAI-DS-R1 free model

annotations = annotateclusters( markergenes=marker_genes, species='human', tissue='peripheral blood', # adjust based on your tissue type provider='openrouter', model='microsoft/mai-ds-r1:free' # Free model )

Add annotations to AnnData object

adata.obs['cell_type'] = adata.obs['leiden'].astype(str).map(annotations)

Visualize results

sc.pl.umap(adata, color='celltype', legendloc='on data', frameon=True, title='Cell Types Annotated by MAI-DS-R1') ```

This method automatically extracts the top differentially expressed genes for each cluster from the rank_genes_groups results, making it easy to integrate mLLMCelltype into your Scanpy workflow.

R

Note: For more detailed R tutorials and documentation, please visit the mLLMCelltype documentation website.

Using Seurat Object

```r

Load required packages

library(mLLMCelltype) library(Seurat) library(dplyr) library(ggplot2) library(cowplot) # Added for plot_grid

Load your preprocessed Seurat object

pbmc <- readRDS("yourseuratobject.rds")

If starting with raw data, perform preprocessing steps

pbmc <- NormalizeData(pbmc)

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

pbmc <- ScaleData(pbmc)

pbmc <- RunPCA(pbmc)

pbmc <- FindNeighbors(pbmc, dims = 1:10)

pbmc <- FindClusters(pbmc, resolution = 0.5)

pbmc <- RunUMAP(pbmc, dims = 1:10)

Find marker genes for each cluster

pbmc_markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)

Set up cache directory to speed up processing

cachedir <- "./mllmcelltypecache" dir.create(cache_dir, showWarnings = FALSE, recursive = TRUE)

Choose a model from any supported provider

Supported models include:

- OpenAI: 'gpt-5', 'gpt-5-mini', 'gpt-4.1', 'gpt-4.1-mini', 'gpt-4.1-nano', 'gpt-4-turbo', 'gpt-3.5-turbo', 'o1', 'o1-mini', 'o1-preview', 'o1-pro'

- Anthropic: 'claude-sonnet-4-20250514', 'claude-3-5-sonnet-20241022', 'claude-3-5-haiku-20241022', 'claude-3-opus-20240229'

- DeepSeek: 'deepseek-chat', 'deepseek-reasoner'

- Google: 'gemini-2.5-pro', 'gemini-2.5-flash', 'gemini-2.0-flash', 'gemini-2.0-flash-lite', 'gemini-1.5-pro', 'gemini-1.5-flash'

- Qwen: 'qwen-max-2025-01-25'

- Stepfun: 'step-2-mini', 'step-2-16k', 'step-1-8k'

- Zhipu: 'glm-4-plus', 'glm-3-turbo'

- MiniMax: 'minimax-text-01'

- Grok: 'grok-3', 'grok-3-latest', 'grok-3-fast', 'grok-3-fast-latest', 'grok-3-mini', 'grok-3-mini-latest', 'grok-3-mini-fast', 'grok-3-mini-fast-latest'

- OpenRouter: Access to models from multiple providers through a single API. Format: 'provider/model-name'

- OpenAI models: 'openai/gpt-5', 'openai/gpt-5-mini', 'openai/gpt-4-turbo', 'openai/gpt-4', 'openai/gpt-3.5-turbo'

- Anthropic models: 'anthropic/claude-3.7-sonnet', 'anthropic/claude-3.5-sonnet', 'anthropic/claude-3.5-haiku', 'anthropic/claude-3-opus'

- Meta models: 'meta-llama/llama-3-70b-instruct', 'meta-llama/llama-3-8b-instruct', 'meta-llama/llama-2-70b-chat'

- Google models: 'google/gemini-2.5-pro', 'google/gemini-2.5-flash', 'google/gemini-1.5-pro-latest', 'google/gemini-1.5-flash'

- Mistral models: 'mistralai/mistral-large', 'mistralai/mistral-medium', 'mistralai/mistral-small'

- Other models: 'microsoft/mai-ds-r1', 'perplexity/sonar-small-chat', 'cohere/command-r', 'deepseek/deepseek-chat', 'thudm/glm-z1-32b'

Run LLMCelltype annotation with multiple LLM models

consensusresults <- interactiveconsensusannotation( input = pbmcmarkers, tissuename = "human PBMC", # provide tissue context models = c( "claude-opus-4-20250514", # Anthropic Claude 4 (latest) "gpt-5", # OpenAI "gemini-2.5-pro", # Google "qwen-max-2025-01-25" # Alibaba ), apikeys = list( anthropic = "your-anthropic-key", openai = "your-openai-key", gemini = "your-google-key", qwen = "your-qwen-key" ), topgenecount = 10, controversythreshold = 1.0, entropythreshold = 1.0, cachedir = cachedir )

Print structure of results to understand the data

print("Available fields in consensusresults:") print(names(consensusresults))

Add annotations to Seurat object

Get cell type annotations from consensusresults$finalannotations

clustertocelltypemap <- consensusresults$final_annotations

Create new cell type identifier column

celltypes <- as.character(Idents(pbmc)) for (clusterid in names(clustertocelltypemap)) { celltypes[celltypes == clusterid] <- clustertocelltypemap[[clusterid]] }

Add cell type annotations to Seurat object

pbmc$celltype <- celltypes

Add uncertainty metrics

Extract detailed consensus results containing metrics

consensusdetails <- consensusresults$initialresults$consensusresults

Create a data frame with metrics for each cluster

uncertaintymetrics <- data.frame( clusterid = names(consensusdetails), consensusproportion = sapply(consensusdetails, function(res) res$consensusproportion), entropy = sapply(consensus_details, function(res) res$entropy) )

Add uncertainty metrics for each cell

Note: seurat_clusters is a metadata column automatically created by FindClusters() function

It contains the cluster ID assigned to each cell during clustering

Here we use it to map cluster-level metrics (consensus_proportion and entropy) to individual cells

If you don't have seurat_clusters column (e.g., if you used a different clustering method),

you can use the active identity (Idents) or any other cluster assignment in your metadata:

Option 1: Use active identity

current_clusters <- as.character(Idents(pbmc))

Option 2: Use another metadata column that contains cluster IDs

currentclusters <- pbmc$yourcluster_column

For this example, we use the standard seurat_clusters column:

currentclusters <- pbmc$seuratclusters # Get cluster ID for each cell

Match each cell's cluster ID with the corresponding metrics in uncertainty_metrics

pbmc$consensusproportion <- uncertaintymetrics$consensusproportion[match(currentclusters, uncertaintymetrics$clusterid)] pbmc$entropy <- uncertaintymetrics$entropy[match(currentclusters, uncertaintymetrics$clusterid)]

Save results for future use

saveRDS(consensusresults, "pbmcmLLMCelltyperesults.rds") saveRDS(pbmc, "pbmcannotated.rds")

Visualize results with SCpubr for publication-ready plots

if (!requireNamespace("SCpubr", quietly = TRUE)) { remotes::install_github("enblacar/SCpubr") } library(SCpubr) library(viridis) # For color palettes

Basic UMAP visualization with default settings

pdf("pbmcbasicannotations.pdf", width=8, height=6) SCpubr::doDimPlot(sample = pbmc, group.by = "celltype", label = TRUE, legend.position = "right") + ggtitle("mLLMCelltype Consensus Annotations") dev.off()

More customized visualization with enhanced styling

pdf("pbmccustomannotations.pdf", width=8, height=6) SCpubr::doDimPlot(sample = pbmc, group.by = "celltype", label = TRUE, label.box = TRUE, legend.position = "right", pt.size = 1.0, border.size = 1, font.size = 12) + ggtitle("mLLMCelltype Consensus Annotations") + theme(plot.title = element_text(hjust = 0.5)) dev.off()

Visualize uncertainty metrics with enhanced SCpubr plots

Get cell types and create a named color palette

celltypes <- unique(pbmc$celltype) colorpalette <- viridis::viridis(length(celltypes)) names(colorpalette) <- celltypes

Cell type annotations with SCpubr

p1 <- SCpubr::doDimPlot(sample = pbmc, group.by = "celltype", label = TRUE, legend.position = "bottom", # Place legend at the bottom pt.size = 1.0, label.size = 4, # Smaller label font size label.box = TRUE, # Add background box to labels for better readability repel = TRUE, # Make labels repel each other to avoid overlap colors.use = colorpalette, plot.title = "Cell Type") + theme(plot.title = elementtext(hjust = 0.5, margin = margin(b = 15, t = 10)), legend.text = element_text(size = 8), legend.key.size = unit(0.3, "cm"), plot.margin = unit(c(0.8, 0.8, 0.8, 0.8), "cm"))

Consensus proportion feature plot with SCpubr

p2 <- SCpubr::doFeaturePlot(sample = pbmc, features = "consensusproportion", order = TRUE, pt.size = 1.0, enforcesymmetry = FALSE, legend.title = "Consensus", plot.title = "Consensus Proportion", sequential.palette = "YlGnBu", # Yellow-Green-Blue gradient, following Nature Methods standards sequential.direction = 1, # Light to dark direction min.cutoff = min(pbmc$consensusproportion), # Set minimum value max.cutoff = max(pbmc$consensusproportion), # Set maximum value na.value = "lightgrey") + # Color for missing values theme(plot.title = elementtext(hjust = 0.5, margin = margin(b = 15, t = 10)), plot.margin = unit(c(0.8, 0.8, 0.8, 0.8), "cm"))

Shannon entropy feature plot with SCpubr

p3 <- SCpubr::doFeaturePlot(sample = pbmc, features = "entropy", order = TRUE, pt.size = 1.0, enforcesymmetry = FALSE, legend.title = "Entropy", plot.title = "Shannon Entropy", sequential.palette = "OrRd", # Orange-Red gradient, following Nature Methods standards sequential.direction = -1, # Dark to light direction (reversed) min.cutoff = min(pbmc$entropy), # Set minimum value max.cutoff = max(pbmc$entropy), # Set maximum value na.value = "lightgrey") + # Color for missing values theme(plot.title = element_text(hjust = 0.5, margin = margin(b = 15, t = 10)), plot.margin = unit(c(0.8, 0.8, 0.8, 0.8), "cm"))

Combine plots with equal widths

pdf("pbmcuncertaintymetrics.pdf", width=18, height=7) combinedplot <- cowplot::plotgrid(p1, p2, p3, ncol = 3, relwidths = c(1.2, 1.2, 1.2)) print(combinedplot) dev.off() ```

Using CSV Input

You can also use mLLMCelltype with CSV files directly without Seurat, which is useful for cases where you already have marker genes available in CSV format:

```r

Install the latest version of mLLMCelltype

devtools::install_github("cafferychen777/mLLMCelltype", subdir = "R", force = TRUE)

Load necessary packages

library(mLLMCelltype)

Configure unified logging (optional - uses defaults if not specified)

configurelogger(level = "INFO", consoleoutput = TRUE, json_format = TRUE)

Create cache directory

cachedir <- "path/to/your/cache" dir.create(cachedir, showWarnings = FALSE, recursive = TRUE)

Read CSV file content

markersfile <- "path/to/your/markers.csv" filecontent <- readLines(markers_file)

Skip header row

datalines <- filecontent[-1]

Convert data to list format, using numeric indices as keys

markergeneslist <- list() cluster_names <- c()

First collect all cluster names

for(line in datalines) { parts <- strsplit(line, ",", fixed = TRUE)[[1]] clusternames <- c(cluster_names, parts[1]) }

Then create markergeneslist with numeric indices

for(i in 1:length(datalines)) { line <- datalines[i] parts <- strsplit(line, ",", fixed = TRUE)[[1]]

# First part is the cluster name cluster_name <- parts[1]

# Use index as key (0-based index, compatible with Seurat) cluster_id <- as.character(i - 1)

# Remaining parts are genes genes <- parts[-1]

# Filter out NA and empty strings genes <- genes[!is.na(genes) & genes != ""]

# Add to markergeneslist markergeneslist[[cluster_id]] <- list(genes = genes) }

Set API keys

apikeys <- list( gemini = "YOURGEMINIAPIKEY", qwen = "YOURQWENAPIKEY", grok = "YOURGROKAPIKEY", openai = "YOUROPENAIAPIKEY", anthropic = "YOURANTHROPICAPIKEY" )

Run consensus annotation with paid models

consensusresults <- interactiveconsensusannotation( input = markergeneslist, tissuename = "your tissue type", # e.g., "human heart" models = c("gemini-2.5-pro", "gemini-2.5-flash", "qwen-max-2025-01-25", "grok-3-latest", "claude-sonnet-4-20250514", "gpt-5"), apikeys = apikeys, controversythreshold = 0.6, entropythreshold = 1.0, maxdiscussionrounds = 3, cachedir = cachedir )

Alternatively, use free OpenRouter models (no credits required)

Add OpenRouter API key to the api_keys list

api_keys$openrouter <- "your-openrouter-api-key"

Run consensus annotation with free models

freeconsensusresults <- interactiveconsensusannotation( input = markergeneslist, tissuename = "your tissue type", # e.g., "human heart" models = c( "meta-llama/llama-4-maverick:free", # Meta Llama 4 Maverick (free) "nvidia/llama-3.1-nemotron-ultra-253b-v1:free", # NVIDIA Nemotron Ultra 253B (free) "deepseek/deepseek-chat-v3-0324:free", # DeepSeek Chat v3 (free) "microsoft/mai-ds-r1:free" # Microsoft MAI-DS-R1 (free) ), apikeys = apikeys, consensuscheckmodel = "deepseek/deepseek-chat-v3-0324:free", # Free model for consensus checking controversythreshold = 0.6, entropythreshold = 1.0, maxdiscussionrounds = 2, cachedir = cache_dir )

Save results

saveRDS(consensusresults, "yourresults.rds")

Print results summary

cat("\nResults summary:\n") cat("Available fields:", paste(names(consensus_results), collapse=", "), "\n\n")

Print final annotations

cat("Final cell type annotations:\n") for(cluster in names(consensusresults$finalannotations)) { cat(sprintf("%s: %s\n", cluster, consensusresults$finalannotations[[cluster]])) } ```

Notes on CSV format: - The CSV file should have values in the first column that will be used as indices (these can be cluster names, numbers like 0,1,2,3 or 1,2,3,4, etc.) - The values in the first column are only used for reference and are not passed to the LLMs - Subsequent columns should contain marker genes for each cluster - An example CSV file for cat heart tissue is included in the package at inst/extdata/Cat_Heart_markers.csv

Example CSV structure: cluster,gene 0,Negr1,Cask,Tshz2,Ston2,Fstl1,Dse,Celf2,Hmcn2,Setbp1,Cblb 1,Palld,Grb14,Mybpc3,Ensfcag00000044939,Dcun1d2,Acacb,Slco1c1,Ppp1r3c,Sema3c,Ppp1r14c 2,Adgrf5,Tbx1,Slco2b1,Pi15,Adam23,Bmx,Pde8b,Pkhd1l1,Dtx1,Ensfcag00000051556 3,Clec2d,Trat1,Rasgrp1,Card11,Cytip,Sytl3,Tmem156,Bcl11b,Lcp1,Lcp2

You can access the example data in your R script using: r system.file("extdata", "Cat_Heart_markers.csv", package = "mLLMCelltype")

Using a Single LLM Model

If you only want to use a single LLM model instead of the consensus approach, use the annotate_cell_types() function. This is useful when you have access to only one API key or prefer a specific model:

```r

Load required packages

library(mLLMCelltype) library(Seurat)

Load your preprocessed Seurat object

pbmc <- readRDS("yourseuratobject.rds")

Find marker genes for each cluster

pbmc_markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)

Choose a model from any supported provider

Supported models include:

- OpenAI: 'gpt-5', 'gpt-5-mini', 'gpt-4.1', 'gpt-4.1-mini', 'gpt-4.1-nano', 'gpt-4-turbo', 'gpt-3.5-turbo', 'o1', 'o1-mini', 'o1-preview', 'o1-pro'

- Anthropic: 'claude-sonnet-4-20250514', 'claude-3-5-sonnet-20241022', 'claude-3-5-haiku-20241022', 'claude-3-opus-20240229'

- DeepSeek: 'deepseek-chat', 'deepseek-reasoner'

- Google: 'gemini-2.5-pro', 'gemini-2.5-flash', 'gemini-2.0-flash', 'gemini-2.0-flash-lite', 'gemini-1.5-pro', 'gemini-1.5-flash'

- Qwen: 'qwen-max-2025-01-25'

- Stepfun: 'step-2-mini', 'step-2-16k', 'step-1-8k'

- Zhipu: 'glm-4-plus', 'glm-3-turbo'

- MiniMax: 'minimax-text-01'

- Grok: 'grok-3', 'grok-3-latest', 'grok-3-fast', 'grok-3-fast-latest', 'grok-3-mini', 'grok-3-mini-latest', 'grok-3-mini-fast', 'grok-3-mini-fast-latest'

- OpenRouter: Access to models from multiple providers through a single API. Format: 'provider/model-name'

- OpenAI models: 'openai/gpt-5', 'openai/gpt-5-mini', 'openai/gpt-4-turbo', 'openai/gpt-4', 'openai/gpt-3.5-turbo'

- Anthropic models: 'anthropic/claude-3.7-sonnet', 'anthropic/claude-3.5-sonnet', 'anthropic/claude-3.5-haiku', 'anthropic/claude-3-opus'

- Meta models: 'meta-llama/llama-3-70b-instruct', 'meta-llama/llama-3-8b-instruct', 'meta-llama/llama-2-70b-chat'

- Google models: 'google/gemini-2.5-pro', 'google/gemini-2.5-flash', 'google/gemini-1.5-pro-latest', 'google/gemini-1.5-flash'

- Mistral models: 'mistralai/mistral-large', 'mistralai/mistral-medium', 'mistralai/mistral-small'

- Other models: 'microsoft/mai-ds-r1', 'perplexity/sonar-small-chat', 'cohere/command-r', 'deepseek/deepseek-chat', 'thudm/glm-z1-32b'

Run cell type annotation with a single LLM model

singlemodelresults <- annotatecelltypes( input = pbmcmarkers, tissuename = "human PBMC", # provide tissue context model = "claude-opus-4-20250514", # specify a single model (Claude 4 Opus) apikey = "your-anthropic-key", # provide the API key directly topgene_count = 10 )

Using a free OpenRouter model

freemodelresults <- annotatecelltypes( input = pbmcmarkers, tissuename = "human PBMC", model = "meta-llama/llama-4-maverick:free", # free model with :free suffix apikey = "your-openrouter-key", topgene_count = 10 )

Print the results

print(singlemodelresults)

Add annotations to Seurat object

singlemodelresults is a character vector with one annotation per cluster

pbmc$celltype <- plyr::mapvalues( x = as.character(Idents(pbmc)), from = as.character(0:(length(singlemodelresults)-1)), to = singlemodel_results )

Visualize results

DimPlot(pbmc, group.by = "cell_type", label = TRUE) + ggtitle("Cell Types Annotated by Single LLM Model") ```

Comparing Different Models

You can also compare annotations from different models by running annotate_cell_types() multiple times with different models:

```r

Define models to test

modelstotest <- c( "claude-sonnet-4-20250514", # Anthropic "gpt-5", # OpenAI "gemini-2.5-pro", # Google "qwen-max-2025-01-25" # Alibaba )

API keys for different providers

api_keys <- list( anthropic = "your-anthropic-key", openai = "your-openai-key", gemini = "your-gemini-key", qwen = "your-qwen-key" )

Test each model and store results

results <- list() for (model in modelstotest) { provider <- getprovider(model) apikey <- api_keys[[provider]]

# Run annotation results[[model]] <- annotatecelltypes( input = pbmcmarkers, tissuename = "human PBMC", model = model, apikey = apikey, topgenecount = 10 )

# Add to Seurat object columnname <- paste0("celltype", gsub("[^a-zA-Z0-9]", "", model)) pbmc[[column_name]] <- plyr::mapvalues( x = as.character(Idents(pbmc)), from = as.character(0:(length(results[[model]])-1)), to = results[[model]] ) } ```

Advanced Consensus Configuration: Specifying the Consensus Check Model

The consensus_check_model parameter (R) / consensus_model parameter (Python) allows you to specify which LLM model to use for consensus checking and discussion moderation. This parameter is critical for the accuracy of consensus annotation because the consensus check model:

Evaluates semantic similarity between different cell type annotations
Calculates consensus metrics (proportion and entropy)
Moderates and synthesizes discussions between models for controversial clusters
Makes final decisions when models disagree

** Important: We strongly recommend using the most capable models available for consensus checking, as this directly impacts annotation quality.**

Recommended Models for Consensus Checking (Ranked by Performance)

Anthropic Claude Models (Highest recommendation)
- claude-opus-4-20250514 - Best overall performance (Claude 4 - latest release, June 27, 2025)
- claude-sonnet-4-20250514 - Excellent balance of performance and speed (Claude 4)
- claude-sonnet-4-20250514 - Superior performance with Claude 4
- claude-3-5-sonnet-20241022 - Good performance with faster response
OpenAI Models
- o1 / o1-pro - Advanced reasoning capabilities
- gpt-5 - Strong performance across various cell types
- gpt-4.1 - Latest GPT-4 variant
Google Gemini Models
- gemini-2.5-pro - Top-tier performance with enhanced reasoning
- gemini-2.5-flash - Excellent balance of performance and speed
- gemini-2.0-flash - Good performance with faster processing
Other High-Performance Models
- deepseek-r1 / deepseek-reasoner - Strong reasoning capabilities
- qwen-max-2025-01-25 - Excellent for scientific contexts
- grok-3-latest - Advanced language understanding

R Package Usage

```r

Example 1: Using the best available model for consensus checking (Recommended)

consensusresults <- interactiveconsensusannotation( input = markergeneslist, tissuename = "human brain", models = c("gpt-5", "claude-sonnet-4-20250514", "gemini-2.5-pro", "qwen-max-2025-01-25"), apikeys = apikeys, consensuscheckmodel = "claude-opus-4-20250514", # Use the most capable model controversythreshold = 0.7, entropythreshold = 1.0 )

Example 2: Using a high-performance model when Claude Opus is not available

consensusresults <- interactiveconsensusannotation( input = markergeneslist, tissuename = "mouse liver", models = c("gpt-5", "gemini-2.5-pro", "qwen-max-2025-01-25"), apikeys = apikeys, consensuscheckmodel = "claude-sonnet-4-20250514", # High-performance Claude 4 model controversythreshold = 0.7, entropythreshold = 1.0 )

Example 3: Using OpenAI's reasoning model for complex cases

consensusresults <- interactiveconsensusannotation( input = markergeneslist, tissuename = "human immune cells", models = c("gpt-5", "claude-sonnet-4-20250514", "gemini-2.5-pro"), apikeys = apikeys, consensuscheckmodel = "o1", # OpenAI's advanced reasoning model controversythreshold = 0.7, entropythreshold = 1.0 )

NOT RECOMMENDED: Avoid using less capable or free models for consensus checking

as this may significantly reduce annotation accuracy

```

Python Package Usage

```python

Example 1: Using the best available model for consensus checking (Recommended)

consensusresults = interactiveconsensusannotation( markergenes=markergenes, species="human", tissue="brain", models=["gpt-5", "claude-sonnet-4-20250514", "gemini-2.5-pro", "qwen-max-2025-01-25"], consensusmodel="claude-opus-4-20250514", # Use the most capable model consensusthreshold=0.7, entropythreshold=1.0 )

Example 2: Using dictionary format with a high-performance model

consensusresults = interactiveconsensusannotation( markergenes=markergenes, species="mouse", tissue="liver", models=["gpt-5", "gemini-2.5-pro", "qwen-max-2025-01-25"], consensusmodel={"provider": "anthropic", "model": "claude-sonnet-4-20250514"}, consensusthreshold=0.7, entropythreshold=1.0 )

Example 3: Using Google's latest model for consensus

consensusresults = interactiveconsensusannotation( markergenes=markergenes, species="human", tissue="heart", models=["gpt-5", "claude-sonnet-4-20250514", "qwen-max-2025-01-25"], consensusmodel={"provider": "google", "model": "gemini-2.5-pro"}, consensusthreshold=0.7, entropythreshold=1.0 )

Example 4: Default behavior (uses Qwen with high-performance fallback)

consensusresults = interactiveconsensusannotation( markergenes=markergenes, species="human", tissue="blood", models=["gpt-5", "claude-sonnet-4-20250514", "gemini-2.5-pro"], # If not specified, defaults to qwen-max-2025-01-25 (a high-performance model) consensusthreshold=0.7, entropy_threshold=1.0 ) ```

Best Practices for Consensus Model Selection

Prioritize Accuracy Over Cost: The consensus check model plays a crucial role in determining final annotations. Using a less capable model here can compromise the entire annotation process.
Model Availability: Ensure you have API access to your chosen consensus model. The system will use fallback models if the primary choice is unavailable.
Consistency: Use the same high-performance model for all consensus checks within a project to ensure consistent evaluation criteria.
Complex Tissues: For challenging tissues (e.g., brain, immune system), consider using the most advanced models like Claude Opus, O1, or Gemini 2.5 Pro.
Default Behavior:
- R: Uses the first model in the models list if not specified
- Python: Defaults to qwen-max-2025-01-25 (a high-performance model) with claude-3-5-sonnet-latest as fallback

Why Model Quality Matters for Consensus Checking

The consensus check model must: - Accurately assess semantic similarity between different cell type names (e.g., recognizing that "T lymphocyte" and "T cell" refer to the same cell type) - Understand biological context and hierarchical relationships - Synthesize discussions from multiple models to reach accurate conclusions - Provide reliable confidence metrics for downstream analysis

Using a less capable model for these critical tasks can lead to: - Misidentification of controversial clusters - Incorrect consensus calculations - Poor resolution of disagreements between models - Ultimately, less accurate cell type annotations

Advanced Features: Cluster Selection and Cache Control (v1.3.1)

mLLMCelltype v1.3.1 introduces two powerful parameters that give you fine-grained control over the annotation process:

1. clusterstoanalyze - Selective Cluster Analysis

This parameter allows you to specify exactly which clusters to analyze without manually filtering your input data:

```r

Example: Focus on specific clusters for T cell subtyping

consensusresults <- interactiveconsensusannotation( input = pbmcmarkers, tissuename = "human PBMC - T cell subtypes", models = c("gpt-5", "claude-sonnet-4-20250514"), apikeys = apikeys, clusterstoanalyze = c(0, 1, 7), # Only analyze T cell clusters controversythreshold = 0.7 )

Example: Re-analyze controversial clusters with different context

consensusresults <- interactiveconsensusannotation( input = pbmcmarkers, tissuename = "activated immune cells", models = c("gpt-5", "claude-sonnet-4-20250514", "gemini-2.5-pro"), apikeys = apikeys, clusterstoanalyze = c("3", "5"), # Focus on specific clusters cachedir = "consensus_cache" ) ```

Benefits: - No need to subset your data manually - Maintains original cluster numbering - Reduces API calls and costs by only analyzing relevant clusters - Perfect for iterative refinement of specific cell populations

2. force_rerun - Bypass Cache for Fresh Analysis

This parameter forces re-analysis of controversial clusters, bypassing cached results:

```r

Example: Initial broad analysis

initialresults <- interactiveconsensusannotation( input = markers, tissuename = "human brain", models = c("gpt-5", "claude-sonnet-4-20250514"), apikeys = apikeys, use_cache = TRUE )

Example: Re-analyze with specific subtype context

subtyperesults <- interactiveconsensusannotation( input = markers, tissuename = "human brain - neuronal subtypes", models = c("gpt-5", "claude-sonnet-4-20250514"), apikeys = apikeys, clusterstoanalyze = c(2, 3, 5), # Neuronal clusters forcererun = TRUE, # Force fresh analysis despite cache usecache = TRUE # Still benefit from cache for non-controversial clusters ) ```

Important Notes: - force_rerun only affects controversial clusters requiring LLM discussion - Non-controversial clusters still use cache for performance - Useful when changing tissue context or focusing on subtypes - Combines well with clusters_to_analyze for targeted re-analysis

Common Use Cases

Iterative Subtyping Workflow: ```r # Step 1: General cell type annotation generaltypes <- interactiveconsensusannotation( input = data, tissuename = "human PBMC", models = models, apikeys = apikeys )

Step 2: Focus on T cells with subtype context

tcellsubtypes <- interactiveconsensusannotation( input = data, tissuename = "human T lymphocytes", models = models, apikeys = apikeys, clusterstoanalyze = c(0, 1, 4, 7), # T cell clusters from step 1 forcererun = TRUE # Fresh analysis with T cell context )

Step 3: Further refine CD8+ T cells

cd8subtypes <- interactiveconsensusannotation( input = data, tissuename = "human CD8+ T cells - activation states", models = models, apikeys = apikeys, clusterstoanalyze = c(1, 4), # CD8+ clusters force_rerun = TRUE ) ```

Cost-Effective Re-analysis: ```r # Only re-analyze clusters that were controversial controversial <- initialresults$controversialclusters

refinedresults <- interactiveconsensusannotation( input = data, tissuename = "human PBMC - refined", models = c("gpt-5", "claude-opus-4-20250514", "gemini-2.5-pro"), apikeys = apikeys, clusterstoanalyze = controversial, # Only controversial ones forcererun = TRUE, consensuscheck_model = "claude-opus-4-20250514" # Use best model ) ```

These features significantly enhance the flexibility and efficiency of mLLMCelltype, making it easier to perform detailed, iterative cell type annotation workflows while managing API costs effectively.

Visualization Examples

Cell Type Annotation Visualization

Below is an example of publication-ready visualization created with mLLMCelltype and SCpubr, showing cell type annotations alongside uncertainty metrics (Consensus Proportion and Shannon Entropy):

Figure: Left panel shows cell type annotations on UMAP projection. Middle panel displays the consensus proportion using a yellow-green-blue gradient (deeper blue indicates stronger agreement among LLMs). Right panel shows Shannon entropy using an orange-red gradient (deeper red indicates lower uncertainty, lighter orange indicates higher uncertainty).

Marker Gene Visualization

mLLMCelltype now includes enhanced marker gene visualization functions that integrate seamlessly with the consensus annotation workflow:

```r

Load required libraries

library(mLLMCelltype) library(Seurat) library(ggplot2)

After running consensus annotation

consensusresults <- interactiveconsensusannotation( input = markersdf, tissuename = "human PBMC", models = c("anthropic/claude-3.5-sonnet", "openai/gpt-5"), apikeys = list(openrouter = "yourapikey") )

Create marker gene visualizations using Seurat

Add consensus annotations to Seurat object

clusterids <- as.character(Idents(pbmcdata)) celltypeannotations <- consensusresults$finalannotations[cluster_ids]

Handle any missing annotations

if (any(is.na(celltypeannotations))) { namask <- is.na(celltypeannotations) celltypeannotations[namask] <- paste("Cluster", clusterids[namask]) }

Add to Seurat object

pbmcdata@meta.data$celltypeconsensus <- celltype_annotations

Create a dotplot of marker genes

DotPlot(pbmcdata, features = topmarkers, group.by = "celltypeconsensus") + RotatedAxis()

Create a heatmap of marker genes

DoHeatmap(pbmcdata, features = topmarkers, group.by = "celltypeconsensus") ```

Key Features of Marker Gene Visualization:

DotPlot: Shows both percentage of cells expressing each gene (dot size) and average expression level (color intensity)
Heatmap: Displays scaled expression values with clustering of genes and cell types
Seamless Integration: Works directly with consensus annotation results added to Seurat objects
Standard Seurat Functions: Uses familiar Seurat visualization functions for consistency

For detailed instructions and advanced customization options, see the Visualization Guide.

Citation

If you use mLLMCelltype in your research, please cite:

bibtex @article{Yang2025.04.10.647852, author = {Yang, Chen and Zhang, Xianyang and Chen, Jun}, title = {Large Language Model Consensus Substantially Improves the Cell Type Annotation Accuracy for scRNA-seq Data}, elocation-id = {2025.04.10.647852}, year = {2025}, doi = {10.1101/2025.04.10.647852}, publisher = {Cold Spring Harbor Laboratory}, URL = {https://www.biorxiv.org/content/early/2025/04/17/2025.04.10.647852}, journal = {bioRxiv} }

You can also cite this in plain text format:

Yang, C., Zhang, X., & Chen, J. (2025). Large Language Model Consensus Substantially Improves the Cell Type Annotation Accuracy for scRNA-seq Data. bioRxiv. Read our full research paper on bioRxiv

Contributing

We welcome and appreciate contributions from the community! There are many ways you can contribute to mLLMCelltype:

Reporting Issues

If you encounter any bugs, have feature requests, or have questions about using mLLMCelltype, please open an issue on our GitHub repository. When reporting bugs, please include:

A clear description of the problem
Steps to reproduce the issue
Expected vs. actual behavior
Your operating system and package version information
Any relevant code snippets or error messages

Pull Requests

We encourage you to contribute code improvements or new features through pull requests:

Fork the repository
Create a new branch for your feature (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Areas for Contribution

Here are some areas where contributions would be particularly valuable:

Adding support for new LLM models
Improving documentation and examples
Optimizing performance
Adding new visualization options
Extending functionality for specialized cell types or tissues
Translations of documentation into different languages

Code Style

Please follow the existing code style in the repository. For R code, we generally follow the tidyverse style guide. For Python code, we follow PEP 8.

Community

Join our Discord community to get real-time updates about mLLMCelltype, ask questions, share your experiences, or collaborate with other users and developers. This is a great place to connect with the team and other users working on single-cell RNA-seq analysis.

Thank you for helping improve mLLMCelltype!

Owner

Name: Caffery Yang
Login: cafferychen777
Kind: user

Website: www.cafferyyang.com
Twitter: CafferyYang
Repositories: 1
Profile: https://github.com/cafferychen777

Chen Yang is a junior at Southern Medical University majoring in biostatistics. In 2020-2021, he was awarded the National Scholarship from the Ministry of Educa

GitHub Events

Total

Fork event: 32
Create event: 6
Release event: 4
Issues event: 17
Watch event: 303
Delete event: 2
Issue comment event: 22
Public event: 1
Push event: 417
Gollum event: 4
Pull request review event: 1
Pull request review comment event: 2
Pull request event: 84

Last Year

Fork event: 32
Create event: 6
Release event: 4
Issues event: 17
Watch event: 303
Delete event: 2
Issue comment event: 22
Public event: 1
Push event: 417
Gollum event: 4
Pull request review event: 1
Pull request review comment event: 2
Pull request event: 84

Committers

Last synced: 9 months ago

All Time

Total Commits: 223
Total Committers: 1
Avg Commits per committer: 223.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 223
Committers: 1
Avg Commits per committer: 223.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Chen Yang	c**7@t**u	223

Committer Domains (Top 20 + Academic)

tamu.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 14
Total pull requests: 50
Average time to close issues: 2 days
Average time to close pull requests: less than a minute
Total issue authors: 12
Total pull request authors: 1
Average comments per issue: 2.0
Average comments per pull request: 0.0
Merged pull requests: 50
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 14
Pull requests: 50
Average time to close issues: 2 days
Average time to close pull requests: less than a minute
Issue authors: 12
Pull request authors: 1
Average comments per issue: 2.0
Average comments per pull request: 0.0
Merged pull requests: 50
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Starlitnightly (2)
odxdld (2)
huhm123 (1)
baike687 (1)
jiadalidu (1)
Liripo (1)
hai178912522 (1)
luna2terra (1)
jiangli2941 (1)
wow-hub (1)
Jyyin333 (1)
cafferychen777 (1)

Pull Request Authors

cafferychen777 (96)

Top Labels

Issue Labels

bug (3) question (2) bug: API (2) bug: fixed (2) enhancement (1) bug: R (1) by design (1) not a bug - solution provided (1) feature: LLM integration (1) docs: API (1)

Pull Request Labels

Packages

Total packages: 3
Total downloads:
- pypi 392 last-month

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 0
(may contain duplicates)
Total versions: 20
Total maintainers: 1

proxy.golang.org: github.com/cafferychen777/mllmcelltype

Documentation: https://pkg.go.dev/github.com/cafferychen777/mllmcelltype#section-documentation
License: mit
Latest release: v1.2.9
published 8 months ago

Versions: 5
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.4%

Average: 5.6%

Dependent repos count: 5.8%

Last synced: 6 months ago

proxy.golang.org: github.com/cafferychen777/mLLMCelltype

Documentation: https://pkg.go.dev/github.com/cafferychen777/mLLMCelltype#section-documentation
License: mit
Latest release: v1.2.9
published 8 months ago

Versions: 5
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.4%

Average: 5.6%

Dependent repos count: 5.8%

Last synced: 6 months ago

pypi.org: mllmcelltype

A Python module for cell type annotation using various LLMs.

Homepage: https://github.com/cafferychen777/mLLMCelltype
Documentation: https://mllmcelltype.readthedocs.io/
License: MIT License
Latest release: 1.3.4
published 6 months ago

Versions: 10
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 392 Last month

Rankings

Dependent packages count: 9.3%

Average: 30.9%

Dependent repos count: 52.4%

Maintainers (1)

cafferychen777

Last synced: 6 months ago

mllmcelltype

Science Score: 59.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

mLLMCelltype: Multi-LLM Consensus Framework for Cell Type Annotation

Abstract

Table of Contents

News

Key Features

Recent Updates

v1.2.3 (2025-05-10)

Bug Fixes

Improvements

v1.2.2 (2025-05-09)

Bug Fixes

Improvements

v1.2.1 (2025-05-01)

Improvements

v1.2.0 (2025-04-30)

Features

v1.1.5 (2025-04-27)

Bug Fixes

Improvements

Directory Structure

Installation

R Version

Install from CRAN (recommended)

Or install development version from GitHub

Python Version

Install from PyPI

Or install from GitHub (note the subdirectory parameter)

Important Note on Dependencies

For using OpenAI models (GPT-5, etc.)

For using Anthropic models (Claude)

For using Google models (Gemini)

To install all optional dependencies at once

For Google Gemini models

Supported Models

Usage Examples

Python

Example of using mLLMCelltype for single-cell RNA-seq cell type annotation with Scanpy

Note: Logging is automatically configured when importing mllmcelltype

You can customize logging if needed using the logging module

Load your single-cell RNA-seq dataset in AnnData format

Perform Leiden clustering for cell population identification if not already done

Identify marker genes for each cell cluster using differential expression analysis

Extract top marker genes for each cell cluster to use in cell type annotation

IMPORTANT: mLLMCelltype requires gene symbols (e.g., KCNJ8, PDGFRA) not Ensembl IDs (e.g., ENSG00000176771)

If your AnnData object uses Ensembl IDs, convert them to gene symbols for accurate annotation:

Example conversion code:

if 'Gene' in adata.var.columns: # Check if gene symbols are available in the metadata

genenamedict = dict(zip(adata.var_names, adata.var['Gene']))

markergenes = {cluster: [genenamedict.get(geneid, geneid) for geneid in genes]

for cluster, genes in marker_genes.items()}

IMPORTANT: mLLMCelltype requires numeric cluster IDs

The 'cluster' column must contain numeric values or values that can be converted to numeric.

Non-numeric cluster IDs (e.g., "cluster1", "Tcells", "7_0") may cause errors or unexpected behavior.

If your data contains non-numeric cluster IDs, create a mapping between original IDs and numeric IDs:

Example standardization code:

originalids = list(markergenes.keys())

idmapping = {original: idx for idx, original in enumerate(originalids)}

markergenes = {str(idmapping[cluster]): genes for cluster, genes in marker_genes.items()}

Configure API keys for the large language models used in consensus annotation

At least one API key is required for multi-LLM consensus annotation

Additional optional LLM providers for enhanced consensus diversity:

os.environ["DEEPSEEKAPIKEY"] = "your-deepseek-api-key" # For DeepSeek-V3 models

os.environ["ZHIPUAPIKEY"] = "your-zhipu-api-key" # For Zhipu GLM-4 models

os.environ["STEPFUNAPIKEY"] = "your-stepfun-api-key" # For Stepfun models

os.environ["MINIMAXAPIKEY"] = "your-minimax-api-key" # For MiniMax models

os.environ["OPENROUTERAPIKEY"] = "your-openrouter-api-key" # For accessing multiple models via OpenRouter

Execute multi-LLM consensus cell type annotation with iterative deliberation

Alternatively, use OpenRouter for accessing multiple models through a single API

This is especially useful for accessing free models with the :free suffix

Example using free OpenRouter models (no credits required)

Retrieve final consensus cell type annotations from the multi-LLM deliberation