mllmcelltype
🏆 #1 Multi-LLM consensus framework | 550+ stars | 95% accuracy | 10+ LLM providers | Leading cell annotation tool
Science Score: 59.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 5 DOI reference(s) in README -
✓Academic publication links
Links to: biorxiv.org -
✓Committers with academic emails
1 of 1 committers (100.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.5%) to scientific vocabulary
Keywords
Repository
🏆 #1 Multi-LLM consensus framework | 550+ stars | 95% accuracy | 10+ LLM providers | Leading cell annotation tool
Basic Info
- Host: GitHub
- Owner: cafferychen777
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://www.mllmcelltype.com/
- Size: 60.8 MB
Statistics
- Stars: 548
- Watchers: 20
- Forks: 47
- Open Issues: 9
- Releases: 5
Topics
Metadata Files
README.md
mLLMCelltype: Multi-LLM Consensus Framework for Cell Type Annotation
mLLMCelltype is a multi-LLM consensus framework for automated cell type annotation in single-cell RNA sequencing (scRNA-seq) data. The framework integrates multiple large language models including OpenAI GPT-5/4.1, Anthropic Claude-4/3.7/3.5, Google Gemini-2.0, X.AI Grok-3, DeepSeek-V3, Alibaba Qwen2.5, Zhipu GLM-4, MiniMax, Stepfun, and OpenRouter to improve annotation accuracy through consensus-based predictions.
Key Advantages:
- Improved Accuracy: Achieves 95% annotation accuracy through multi-model consensus
- Reduced Bias: Multiple model integration minimizes individual model limitations
- Cost Efficiency: 70-80% API cost reduction through optimized consensus algorithms
- Uncertainty Quantification: Provides metrics for annotation confidence assessment
Abstract
mLLMCelltype is an open-source tool for single-cell transcriptomics analysis that uses multiple large language models to identify cell types from gene expression data. The software implements a consensus approach where multiple models analyze the same data and their predictions are combined, which helps reduce errors and provides uncertainty metrics. This methodology offers advantages over single-model approaches through integration of multiple model predictions. mLLMCelltype integrates with single-cell analysis platforms such as Scanpy and Seurat, allowing researchers to incorporate it into existing workflows. The method does not require reference datasets for annotation.
Comparison with Other Methods:
- Consensus-based approach: Multi-model consensus provides improved reliability compared to single-model systems
- Model Support: Compatible with 10+ LLM providers
- Performance: 95% accuracy in benchmark studies with uncertainty quantification
- Community Adoption: 540+ GitHub stars
Table of Contents
- News
- Key Features
- Recent Updates
- Directory Structure
- Installation
- Usage Examples
- Visualization Example
- Citation
- Contributing
News
Web Application Launch (2025-06-18)
We're excited to announce the launch of mLLMCelltype Web Application! Now you can access mLLMCelltype's powerful cell type annotation capabilities directly through your web browser without any installation required.
** Key Features:** - Easy-to-use interface: Upload your scRNA-seq data and get annotations in minutes - Multi-LLM consensus: Choose from various AI models including GPT-4, Claude, Gemini, and more - Real-time processing: Monitor annotation progress with live updates - Multiple export formats: Download results in CSV, TSV, Excel, or JSON formats - No setup required: Start annotating immediately without installing packages
** Access the Web App**: https://mllmcelltype.com
** Beta Testing Phase**: The web application is currently in beta testing. We welcome your feedback and suggestions to help us improve the platform. Please report any issues or share your experience through our GitHub Issues or Discord community.
CRAN Release (2025-09-02)
mLLMCelltype is now available on CRAN. Install the package using:
R
install.packages("mLLMCelltype")
- CRAN page: https://CRAN.R-project.org/package=mLLMCelltype
- DOI: 10.32614/CRAN.package.mLLMCelltype
** Important: Gemini Model Migration (2025-06-02)**
Google has discontinued several Gemini 1.5 models and will discontinue more on September 24, 2025: - Already discontinued: Gemini 1.5 Pro 001, Gemini 1.5 Flash 001 - Will be discontinued on Sept 24, 2025: Gemini 1.5 Pro 002, Gemini 1.5 Flash 002, Gemini 1.5 Flash-8B -001
Recommended migration: Use gemini-2.0-flash or gemini-2.0-flash-lite for better performance and continued support. The aliases gemini-1.5-pro and gemini-1.5-flash will continue to work until September 24, 2025, as they point to the -002 versions.
** Important: Claude Model Deprecation (2025-07-21)**
Anthropic will retire the following Claude models on July 21, 2025: - Claude 2 (all versions) - Claude 2.1 - Claude 3 Sonnet (non-versioned) - Claude 3 Opus (non-versioned)
Recommended migration:
- For Claude 2/2.1 Use claude-sonnet-4-20250514 or claude-3-5-sonnet-20241022
- For Claude 3 Sonnet Use claude-sonnet-4-20250514 or claude-3-7-sonnet-20250219
- For Claude 3 Opus Use claude-opus-4-20250514 or claude-3-opus-20240229
Please update your code before July 21, 2025 to avoid service disruption.
August 2025: mLLMCelltype has reached 540+ GitHub stars with growing community adoption. We thank all contributors and users who have supported this project.
Key Features
- Multi-LLM Consensus: Integrates predictions from multiple LLMs to reduce single-model limitations and biases
- Model Support: Compatible with 10+ LLM providers including OpenAI, Anthropic, Google, and others
- Accuracy: 95% accuracy validated through benchmarking on multiple datasets
- Cost Efficiency: 70-80% API cost reduction through consensus optimization
- Iterative Discussion: LLMs evaluate evidence and refine annotations through multiple rounds of discussion
- Uncertainty Quantification: Provides Consensus Proportion and Shannon Entropy metrics to identify uncertain annotations
- Error Reduction: Cross-model validation reduces incorrect predictions
- Noise Tolerance: Maintains accuracy with imperfect marker gene lists
- Hierarchical Annotation: Supports multi-resolution analysis with consistency checks
- Reference-Free: Performs annotation without pre-training or reference datasets
- Documentation: Records complete reasoning process for transparency
- Integration: Compatible with Scanpy/Seurat workflows and marker gene outputs
- Extensibility: Supports addition of new LLMs as they become available
Recent Updates
v1.2.3 (2025-05-10)
Bug Fixes
- Fixed error handling in consensus checking when API responses are NULL or invalid
- Improved error logging for OpenRouter API error responses
- Added robust NULL and type checking in check_consensus function
Improvements
- Enhanced error diagnostics for OpenRouter API errors
- Added detailed logging of API error messages and response structures
- Improved robustness when handling unexpected API response formats
v1.2.2 (2025-05-09)
Bug Fixes
- Fixed the 'non-character argument' error that occurred when processing API responses
- Added robust type checking for API responses across all model providers
- Improved error handling for unexpected API response formats
Improvements
- Added detailed error logging for API response issues
- Implemented consistent error handling patterns across all API processing functions
- Enhanced response validation to ensure proper structure before processing
v1.2.1 (2025-05-01)
Improvements
- Added support for OpenRouter API
- Added support for free models through OpenRouter
- Updated documentation with examples for using OpenRouter models
v1.2.0 (2025-04-30)
Features
- Added visualization functions for cell type annotation results
- Added support for uncertainty metrics visualization
- Implemented improved consensus building algorithm
v1.1.5 (2025-04-27)
Bug Fixes
- Fixed an issue with cluster index validation that caused errors when processing certain CSV input files
- Improved error handling for negative indices with clearer error messages
Improvements
- Added example script for CSV-based annotation workflow (catheartannotation.R)
- Enhanced input validation with more detailed diagnostics
- Updated documentation to clarify CSV input format requirements
See NEWS.md for a complete changelog.
Directory Structure
R/: R language interface and implementationpython/: Python interface and implementation
Installation
R Version
```r
Install from CRAN (recommended)
install.packages("mLLMCelltype")
Or install development version from GitHub
devtools::install_github("cafferychen777/mLLMCelltype", subdir = "R") ```
Python Version
Quick Start: Try mLLMCelltype instantly in Google Colab without any installation! Click the badge above to open our interactive notebook with examples and step-by-step guidance.
```bash
Install from PyPI
pip install mllmcelltype
Or install from GitHub (note the subdirectory parameter)
pip install git+https://github.com/cafferychen777/mLLMCelltype.git#subdirectory=python ```
Important Note on Dependencies
mLLMCelltype uses a modular design where different LLM provider libraries are optional dependencies. Depending on which models you plan to use, you'll need to install the corresponding packages:
```bash
For using OpenAI models (GPT-5, etc.)
pip install "mllmcelltype[openai]"
For using Anthropic models (Claude)
pip install "mllmcelltype[anthropic]"
For using Google models (Gemini)
pip install "mllmcelltype[gemini]"
To install all optional dependencies at once
pip install "mllmcelltype[all]" ```
If you encounter errors like ImportError: cannot import name 'genai' from 'google', it means you need to install the corresponding provider package. For example:
```bash
For Google Gemini models
pip install google-genai ```
Supported Models
- OpenAI: GPT-5/GPT-4.1/GPT-4.5 (API Key)
- Anthropic: Claude-4-Opus/Claude-4-Sonnet/Claude-3.7-Sonnet/Claude-3.5-Haiku (API Key)
- Google: Gemini-2.0-Pro/Gemini-2.0-Flash (API Key)
- Alibaba: Qwen2.5-Max (API Key)
- DeepSeek: DeepSeek-V3/DeepSeek-R1 (API Key)
- Minimax: MiniMax-Text-01 (API Key)
- Stepfun: Step-2-16K (API Key)
- Zhipu: GLM-4 (API Key)
- X.AI: Grok-3/Grok-3-mini (API Key)
- OpenRouter: Access to multiple models through a single API (API Key)
- Supports models from OpenAI, Anthropic, Meta, Google, Mistral, and more
- Format: 'provider/model-name' (e.g., 'openai/gpt-5', 'anthropic/claude-3-opus')
- Free models available with
:freesuffix (e.g., 'microsoft/mai-ds-r1:free', 'deepseek/deepseek-chat:free')
Usage Examples
Python
```python
Example of using mLLMCelltype for single-cell RNA-seq cell type annotation with Scanpy
import scanpy as sc import pandas as pd from mllmcelltype import annotateclusters, interactiveconsensus_annotation import os
Note: Logging is automatically configured when importing mllmcelltype
You can customize logging if needed using the logging module
Load your single-cell RNA-seq dataset in AnnData format
adata = sc.readh5ad('yourdata.h5ad') # Replace with your scRNA-seq dataset path
Perform Leiden clustering for cell population identification if not already done
if 'leiden' not in adata.obs.columns: print("Computing leiden clustering for cell population identification...") # Preprocess single-cell data: normalize counts and log-transform for gene expression analysis if 'log1p' not in adata.uns: sc.pp.normalizetotal(adata, targetsum=1e4) # Normalize to 10,000 counts per cell sc.pp.log1p(adata) # Log-transform normalized counts
# Dimensionality reduction: calculate PCA for scRNA-seq data
if 'X_pca' not in adata.obsm:
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5) # Select informative genes
sc.pp.pca(adata, use_highly_variable=True) # Compute principal components
# Cell clustering: compute neighborhood graph and perform Leiden community detection
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=30) # Build KNN graph for clustering
sc.tl.leiden(adata, resolution=0.8) # Identify cell populations using Leiden algorithm
print(f"Leiden clustering completed, identified {len(adata.obs['leiden'].cat.categories)} distinct cell populations")
Identify marker genes for each cell cluster using differential expression analysis
sc.tl.rankgenesgroups(adata, 'leiden', method='wilcoxon') # Wilcoxon rank-sum test for marker detection
Extract top marker genes for each cell cluster to use in cell type annotation
markergenes = {} for i in range(len(adata.obs['leiden'].cat.categories)): # Select top 10 differentially expressed genes as markers for each cluster genes = [adata.uns['rankgenesgroups']['names'][str(i)][j] for j in range(10)] markergenes[str(i)] = genes
IMPORTANT: mLLMCelltype requires gene symbols (e.g., KCNJ8, PDGFRA) not Ensembl IDs (e.g., ENSG00000176771)
If your AnnData object uses Ensembl IDs, convert them to gene symbols for accurate annotation:
Example conversion code:
if 'Gene' in adata.var.columns: # Check if gene symbols are available in the metadata
genenamedict = dict(zip(adata.var_names, adata.var['Gene']))
markergenes = {cluster: [genenamedict.get(geneid, geneid) for geneid in genes]
for cluster, genes in marker_genes.items()}
IMPORTANT: mLLMCelltype requires numeric cluster IDs
The 'cluster' column must contain numeric values or values that can be converted to numeric.
Non-numeric cluster IDs (e.g., "cluster1", "Tcells", "7_0") may cause errors or unexpected behavior.
If your data contains non-numeric cluster IDs, create a mapping between original IDs and numeric IDs:
Example standardization code:
originalids = list(markergenes.keys())
idmapping = {original: idx for idx, original in enumerate(originalids)}
markergenes = {str(idmapping[cluster]): genes for cluster, genes in marker_genes.items()}
Configure API keys for the large language models used in consensus annotation
At least one API key is required for multi-LLM consensus annotation
os.environ["OPENAIAPIKEY"] = "your-openai-api-key" # For GPT-5/4.1 models (recommended) os.environ["ANTHROPICAPIKEY"] = "your-anthropic-api-key" # For Claude-4/3.7/3.5 models os.environ["GEMINIAPIKEY"] = "your-gemini-api-key" # For Google Gemini-2.5 models os.environ["QWENAPIKEY"] = "your-qwen-api-key" # For Alibaba Qwen2.5 models
Additional optional LLM providers for enhanced consensus diversity:
os.environ["DEEPSEEKAPIKEY"] = "your-deepseek-api-key" # For DeepSeek-V3 models
os.environ["ZHIPUAPIKEY"] = "your-zhipu-api-key" # For Zhipu GLM-4 models
os.environ["STEPFUNAPIKEY"] = "your-stepfun-api-key" # For Stepfun models
os.environ["MINIMAXAPIKEY"] = "your-minimax-api-key" # For MiniMax models
os.environ["OPENROUTERAPIKEY"] = "your-openrouter-api-key" # For accessing multiple models via OpenRouter
Execute multi-LLM consensus cell type annotation with iterative deliberation
consensusresults = interactiveconsensusannotation( markergenes=markergenes, # Dictionary of marker genes for each cluster species="human", # Specify organism for appropriate cell type annotation tissue="blood", # Specify tissue context for more accurate annotation models=["gpt-5", "claude-opus-4-20250514", "gemini-2.5-pro", "qwen-max-2025-01-25"], # Multiple LLMs for consensus consensusthreshold=1, # Minimum proportion required for consensus agreement maxdiscussionrounds=3 # Number of deliberation rounds between models for refinement )
Alternatively, use OpenRouter for accessing multiple models through a single API
This is especially useful for accessing free models with the :free suffix
os.environ["OPENROUTERAPIKEY"] = "your-openrouter-api-key"
Example using free OpenRouter models (no credits required)
freemodelsresults = interactiveconsensusannotation( markergenes=markergenes, species="human", tissue="blood", models=[ {"provider": "openrouter", "model": "meta-llama/llama-4-maverick:free"}, # Meta Llama 4 Maverick (free) {"provider": "openrouter", "model": "nvidia/llama-3.1-nemotron-ultra-253b-v1:free"}, # NVIDIA Nemotron Ultra 253B (free) {"provider": "openrouter", "model": "deepseek/deepseek-chat-v3-0324:free"}, # DeepSeek Chat v3 (free) {"provider": "openrouter", "model": "microsoft/mai-ds-r1:free"} # Microsoft MAI-DS-R1 (free) ], consensusthreshold=0.7, maxdiscussion_rounds=2 )
Retrieve final consensus cell type annotations from the multi-LLM deliberation
finalannotations = consensusresults["consensus"]
Integrate consensus cell type annotations into the original AnnData object
adata.obs['consensuscelltype'] = adata.obs['leiden'].astype(str).map(final_annotations)
Add uncertainty quantification metrics to evaluate annotation confidence
adata.obs['consensusproportion'] = adata.obs['leiden'].astype(str).map(consensusresults["consensusproportion"]) # Agreement level adata.obs['entropy'] = adata.obs['leiden'].astype(str).map(consensusresults["entropy"]) # Annotation uncertainty
Prepare for visualization: compute UMAP embeddings if not already available
UMAP provides a 2D representation of cell populations for visualization
if 'Xumap' not in adata.obsm: print("Computing UMAP coordinates...") # Make sure neighbors are computed first if 'neighbors' not in adata.uns: sc.pp.neighbors(adata, nneighbors=10, n_pcs=30) sc.tl.umap(adata) print("UMAP coordinates computed")
Visualize results with enhanced aesthetics
Basic visualization
sc.pl.umap(adata, color='consensuscelltype', legend_loc='right', frameon=True, title='mLLMCelltype Consensus Annotations')
More customized visualization
import matplotlib.pyplot as plt
Set figure size and style
plt.rcParams['figure.figsize'] = (10, 8) plt.rcParams['font.size'] = 12
Create a more publication-ready UMAP
fig, ax = plt.subplots(1, 1, figsize=(12, 10)) sc.pl.umap(adata, color='consensuscelltype', legendloc='on data', frameon=True, title='mLLMCelltype Consensus Annotations', palette='tab20', size=50, legendfontsize=12, legend_fontoutline=2, ax=ax)
Visualize uncertainty metrics
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7)) sc.pl.umap(adata, color='consensusproportion', ax=ax1, title='Consensus Proportion', cmap='viridis', vmin=0, vmax=1, size=30) sc.pl.umap(adata, color='entropy', ax=ax2, title='Annotation Uncertainty (Shannon Entropy)', cmap='magma', vmin=0, size=30) plt.tightlayout() ```
Using a Single Free OpenRouter Model
For users who prefer a simpler approach with just one model, the Microsoft MAI-DS-R1 free model via OpenRouter provides excellent results:
```python import os from mllmcelltype import annotate_clusters
Note: Logging is automatically configured
Set your OpenRouter API key
os.environ["OPENROUTERAPIKEY"] = "your-openrouter-api-key"
Define marker genes for each cluster
marker_genes = { "0": ["CD3D", "CD3E", "CD3G", "CD2", "IL7R", "TCF7"], # T cells "1": ["CD19", "MS4A1", "CD79A", "CD79B", "HLA-DRA", "CD74"], # B cells "2": ["CD14", "LYZ", "CSF1R", "ITGAM", "CD68", "FCGR3A"] # Monocytes }
Annotate using Microsoft MAI-DS-R1 free model
annotations = annotateclusters( markergenes=marker_genes, species='human', tissue='peripheral blood', provider='openrouter', model='microsoft/mai-ds-r1:free' # Free model )
Print annotations
for cluster, annotation in annotations.items(): print(f"Cluster {cluster}: {annotation}") ```
This approach is fast, accurate, and doesn't require any API credits, making it ideal for quick analyses or when you have limited API access.
Extracting Marker Genes from AnnData Objects
If you're using Scanpy with AnnData objects, you can easily extract marker genes directly from the rank_genes_groups results:
```python import os import scanpy as sc from mllmcelltype import annotate_clusters
Note: Logging is automatically configured
Set your OpenRouter API key
os.environ["OPENROUTERAPIKEY"] = "your-openrouter-api-key"
Load and preprocess your data
adata = sc.readh5ad('yourdata.h5ad')
Perform preprocessing and clustering if not already done
sc.pp.normalizetotal(adata, targetsum=1e4)
sc.pp.log1p(adata)
sc.pp.highlyvariablegenes(adata)
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.leiden(adata)
Find marker genes for each cluster
sc.tl.rankgenesgroups(adata, 'leiden', method='wilcoxon')
Extract top marker genes for each cluster
marker_genes = { cluster: adata.uns['rankgenesgroups']['names'][cluster][:10].tolist() for cluster in adata.obs['leiden'].cat.categories }
Annotate using Microsoft MAI-DS-R1 free model
annotations = annotateclusters( markergenes=marker_genes, species='human', tissue='peripheral blood', # adjust based on your tissue type provider='openrouter', model='microsoft/mai-ds-r1:free' # Free model )
Add annotations to AnnData object
adata.obs['cell_type'] = adata.obs['leiden'].astype(str).map(annotations)
Visualize results
sc.pl.umap(adata, color='celltype', legendloc='on data', frameon=True, title='Cell Types Annotated by MAI-DS-R1') ```
This method automatically extracts the top differentially expressed genes for each cluster from the rank_genes_groups results, making it easy to integrate mLLMCelltype into your Scanpy workflow.
R
Note: For more detailed R tutorials and documentation, please visit the mLLMCelltype documentation website.
Using Seurat Object
```r
Load required packages
library(mLLMCelltype) library(Seurat) library(dplyr) library(ggplot2) library(cowplot) # Added for plot_grid
Load your preprocessed Seurat object
pbmc <- readRDS("yourseuratobject.rds")
If starting with raw data, perform preprocessing steps
pbmc <- NormalizeData(pbmc)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
pbmc <- ScaleData(pbmc)
pbmc <- RunPCA(pbmc)
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
pbmc <- RunUMAP(pbmc, dims = 1:10)
Find marker genes for each cluster
pbmc_markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
Set up cache directory to speed up processing
cachedir <- "./mllmcelltypecache" dir.create(cache_dir, showWarnings = FALSE, recursive = TRUE)
Choose a model from any supported provider
Supported models include:
- OpenAI: 'gpt-5', 'gpt-5-mini', 'gpt-4.1', 'gpt-4.1-mini', 'gpt-4.1-nano', 'gpt-4-turbo', 'gpt-3.5-turbo', 'o1', 'o1-mini', 'o1-preview', 'o1-pro'
- Anthropic: 'claude-sonnet-4-20250514', 'claude-3-5-sonnet-20241022', 'claude-3-5-haiku-20241022', 'claude-3-opus-20240229'
- DeepSeek: 'deepseek-chat', 'deepseek-reasoner'
- Google: 'gemini-2.5-pro', 'gemini-2.5-flash', 'gemini-2.0-flash', 'gemini-2.0-flash-lite', 'gemini-1.5-pro', 'gemini-1.5-flash'
- Qwen: 'qwen-max-2025-01-25'
- Stepfun: 'step-2-mini', 'step-2-16k', 'step-1-8k'
- Zhipu: 'glm-4-plus', 'glm-3-turbo'
- MiniMax: 'minimax-text-01'
- Grok: 'grok-3', 'grok-3-latest', 'grok-3-fast', 'grok-3-fast-latest', 'grok-3-mini', 'grok-3-mini-latest', 'grok-3-mini-fast', 'grok-3-mini-fast-latest'
- OpenRouter: Access to models from multiple providers through a single API. Format: 'provider/model-name'
- OpenAI models: 'openai/gpt-5', 'openai/gpt-5-mini', 'openai/gpt-4-turbo', 'openai/gpt-4', 'openai/gpt-3.5-turbo'
- Anthropic models: 'anthropic/claude-3.7-sonnet', 'anthropic/claude-3.5-sonnet', 'anthropic/claude-3.5-haiku', 'anthropic/claude-3-opus'
- Meta models: 'meta-llama/llama-3-70b-instruct', 'meta-llama/llama-3-8b-instruct', 'meta-llama/llama-2-70b-chat'
- Google models: 'google/gemini-2.5-pro', 'google/gemini-2.5-flash', 'google/gemini-1.5-pro-latest', 'google/gemini-1.5-flash'
- Mistral models: 'mistralai/mistral-large', 'mistralai/mistral-medium', 'mistralai/mistral-small'
- Other models: 'microsoft/mai-ds-r1', 'perplexity/sonar-small-chat', 'cohere/command-r', 'deepseek/deepseek-chat', 'thudm/glm-z1-32b'
Run LLMCelltype annotation with multiple LLM models
consensusresults <- interactiveconsensusannotation( input = pbmcmarkers, tissuename = "human PBMC", # provide tissue context models = c( "claude-opus-4-20250514", # Anthropic Claude 4 (latest) "gpt-5", # OpenAI "gemini-2.5-pro", # Google "qwen-max-2025-01-25" # Alibaba ), apikeys = list( anthropic = "your-anthropic-key", openai = "your-openai-key", gemini = "your-google-key", qwen = "your-qwen-key" ), topgenecount = 10, controversythreshold = 1.0, entropythreshold = 1.0, cachedir = cachedir )
Print structure of results to understand the data
print("Available fields in consensusresults:") print(names(consensusresults))
Add annotations to Seurat object
Get cell type annotations from consensusresults$finalannotations
clustertocelltypemap <- consensusresults$final_annotations
Create new cell type identifier column
celltypes <- as.character(Idents(pbmc)) for (clusterid in names(clustertocelltypemap)) { celltypes[celltypes == clusterid] <- clustertocelltypemap[[clusterid]] }
Add cell type annotations to Seurat object
pbmc$celltype <- celltypes
Add uncertainty metrics
Extract detailed consensus results containing metrics
consensusdetails <- consensusresults$initialresults$consensusresults
Create a data frame with metrics for each cluster
uncertaintymetrics <- data.frame( clusterid = names(consensusdetails), consensusproportion = sapply(consensusdetails, function(res) res$consensusproportion), entropy = sapply(consensus_details, function(res) res$entropy) )
Add uncertainty metrics for each cell
Note: seurat_clusters is a metadata column automatically created by FindClusters() function
It contains the cluster ID assigned to each cell during clustering
Here we use it to map cluster-level metrics (consensus_proportion and entropy) to individual cells
If you don't have seurat_clusters column (e.g., if you used a different clustering method),
you can use the active identity (Idents) or any other cluster assignment in your metadata:
Option 1: Use active identity
current_clusters <- as.character(Idents(pbmc))
Option 2: Use another metadata column that contains cluster IDs
currentclusters <- pbmc$yourcluster_column
For this example, we use the standard seurat_clusters column:
currentclusters <- pbmc$seuratclusters # Get cluster ID for each cell
Match each cell's cluster ID with the corresponding metrics in uncertainty_metrics
pbmc$consensusproportion <- uncertaintymetrics$consensusproportion[match(currentclusters, uncertaintymetrics$clusterid)] pbmc$entropy <- uncertaintymetrics$entropy[match(currentclusters, uncertaintymetrics$clusterid)]
Save results for future use
saveRDS(consensusresults, "pbmcmLLMCelltyperesults.rds") saveRDS(pbmc, "pbmcannotated.rds")
Visualize results with SCpubr for publication-ready plots
if (!requireNamespace("SCpubr", quietly = TRUE)) { remotes::install_github("enblacar/SCpubr") } library(SCpubr) library(viridis) # For color palettes
Basic UMAP visualization with default settings
pdf("pbmcbasicannotations.pdf", width=8, height=6) SCpubr::doDimPlot(sample = pbmc, group.by = "celltype", label = TRUE, legend.position = "right") + ggtitle("mLLMCelltype Consensus Annotations") dev.off()
More customized visualization with enhanced styling
pdf("pbmccustomannotations.pdf", width=8, height=6) SCpubr::doDimPlot(sample = pbmc, group.by = "celltype", label = TRUE, label.box = TRUE, legend.position = "right", pt.size = 1.0, border.size = 1, font.size = 12) + ggtitle("mLLMCelltype Consensus Annotations") + theme(plot.title = element_text(hjust = 0.5)) dev.off()
Visualize uncertainty metrics with enhanced SCpubr plots
Get cell types and create a named color palette
celltypes <- unique(pbmc$celltype) colorpalette <- viridis::viridis(length(celltypes)) names(colorpalette) <- celltypes
Cell type annotations with SCpubr
p1 <- SCpubr::doDimPlot(sample = pbmc, group.by = "celltype", label = TRUE, legend.position = "bottom", # Place legend at the bottom pt.size = 1.0, label.size = 4, # Smaller label font size label.box = TRUE, # Add background box to labels for better readability repel = TRUE, # Make labels repel each other to avoid overlap colors.use = colorpalette, plot.title = "Cell Type") + theme(plot.title = elementtext(hjust = 0.5, margin = margin(b = 15, t = 10)), legend.text = element_text(size = 8), legend.key.size = unit(0.3, "cm"), plot.margin = unit(c(0.8, 0.8, 0.8, 0.8), "cm"))
Consensus proportion feature plot with SCpubr
p2 <- SCpubr::doFeaturePlot(sample = pbmc, features = "consensusproportion", order = TRUE, pt.size = 1.0, enforcesymmetry = FALSE, legend.title = "Consensus", plot.title = "Consensus Proportion", sequential.palette = "YlGnBu", # Yellow-Green-Blue gradient, following Nature Methods standards sequential.direction = 1, # Light to dark direction min.cutoff = min(pbmc$consensusproportion), # Set minimum value max.cutoff = max(pbmc$consensusproportion), # Set maximum value na.value = "lightgrey") + # Color for missing values theme(plot.title = elementtext(hjust = 0.5, margin = margin(b = 15, t = 10)), plot.margin = unit(c(0.8, 0.8, 0.8, 0.8), "cm"))
Shannon entropy feature plot with SCpubr
p3 <- SCpubr::doFeaturePlot(sample = pbmc, features = "entropy", order = TRUE, pt.size = 1.0, enforcesymmetry = FALSE, legend.title = "Entropy", plot.title = "Shannon Entropy", sequential.palette = "OrRd", # Orange-Red gradient, following Nature Methods standards sequential.direction = -1, # Dark to light direction (reversed) min.cutoff = min(pbmc$entropy), # Set minimum value max.cutoff = max(pbmc$entropy), # Set maximum value na.value = "lightgrey") + # Color for missing values theme(plot.title = element_text(hjust = 0.5, margin = margin(b = 15, t = 10)), plot.margin = unit(c(0.8, 0.8, 0.8, 0.8), "cm"))
Combine plots with equal widths
pdf("pbmcuncertaintymetrics.pdf", width=18, height=7) combinedplot <- cowplot::plotgrid(p1, p2, p3, ncol = 3, relwidths = c(1.2, 1.2, 1.2)) print(combinedplot) dev.off() ```
Using CSV Input
You can also use mLLMCelltype with CSV files directly without Seurat, which is useful for cases where you already have marker genes available in CSV format:
```r
Install the latest version of mLLMCelltype
devtools::install_github("cafferychen777/mLLMCelltype", subdir = "R", force = TRUE)
Load necessary packages
library(mLLMCelltype)
Configure unified logging (optional - uses defaults if not specified)
configurelogger(level = "INFO", consoleoutput = TRUE, json_format = TRUE)
Create cache directory
cachedir <- "path/to/your/cache" dir.create(cachedir, showWarnings = FALSE, recursive = TRUE)
Read CSV file content
markersfile <- "path/to/your/markers.csv" filecontent <- readLines(markers_file)
Skip header row
datalines <- filecontent[-1]
Convert data to list format, using numeric indices as keys
markergeneslist <- list() cluster_names <- c()
First collect all cluster names
for(line in datalines) { parts <- strsplit(line, ",", fixed = TRUE)[[1]] clusternames <- c(cluster_names, parts[1]) }
Then create markergeneslist with numeric indices
for(i in 1:length(datalines)) { line <- datalines[i] parts <- strsplit(line, ",", fixed = TRUE)[[1]]
# First part is the cluster name cluster_name <- parts[1]
# Use index as key (0-based index, compatible with Seurat) cluster_id <- as.character(i - 1)
# Remaining parts are genes genes <- parts[-1]
# Filter out NA and empty strings genes <- genes[!is.na(genes) & genes != ""]
# Add to markergeneslist markergeneslist[[cluster_id]] <- list(genes = genes) }
Set API keys
apikeys <- list( gemini = "YOURGEMINIAPIKEY", qwen = "YOURQWENAPIKEY", grok = "YOURGROKAPIKEY", openai = "YOUROPENAIAPIKEY", anthropic = "YOURANTHROPICAPIKEY" )
Run consensus annotation with paid models
consensusresults <- interactiveconsensusannotation( input = markergeneslist, tissuename = "your tissue type", # e.g., "human heart" models = c("gemini-2.5-pro", "gemini-2.5-flash", "qwen-max-2025-01-25", "grok-3-latest", "claude-sonnet-4-20250514", "gpt-5"), apikeys = apikeys, controversythreshold = 0.6, entropythreshold = 1.0, maxdiscussionrounds = 3, cachedir = cachedir )
Alternatively, use free OpenRouter models (no credits required)
Add OpenRouter API key to the api_keys list
api_keys$openrouter <- "your-openrouter-api-key"
Run consensus annotation with free models
freeconsensusresults <- interactiveconsensusannotation( input = markergeneslist, tissuename = "your tissue type", # e.g., "human heart" models = c( "meta-llama/llama-4-maverick:free", # Meta Llama 4 Maverick (free) "nvidia/llama-3.1-nemotron-ultra-253b-v1:free", # NVIDIA Nemotron Ultra 253B (free) "deepseek/deepseek-chat-v3-0324:free", # DeepSeek Chat v3 (free) "microsoft/mai-ds-r1:free" # Microsoft MAI-DS-R1 (free) ), apikeys = apikeys, consensuscheckmodel = "deepseek/deepseek-chat-v3-0324:free", # Free model for consensus checking controversythreshold = 0.6, entropythreshold = 1.0, maxdiscussionrounds = 2, cachedir = cache_dir )
Save results
saveRDS(consensusresults, "yourresults.rds")
Print results summary
cat("\nResults summary:\n") cat("Available fields:", paste(names(consensus_results), collapse=", "), "\n\n")
Print final annotations
cat("Final cell type annotations:\n") for(cluster in names(consensusresults$finalannotations)) { cat(sprintf("%s: %s\n", cluster, consensusresults$finalannotations[[cluster]])) } ```
Notes on CSV format:
- The CSV file should have values in the first column that will be used as indices (these can be cluster names, numbers like 0,1,2,3 or 1,2,3,4, etc.)
- The values in the first column are only used for reference and are not passed to the LLMs
- Subsequent columns should contain marker genes for each cluster
- An example CSV file for cat heart tissue is included in the package at inst/extdata/Cat_Heart_markers.csv
Example CSV structure:
cluster,gene
0,Negr1,Cask,Tshz2,Ston2,Fstl1,Dse,Celf2,Hmcn2,Setbp1,Cblb
1,Palld,Grb14,Mybpc3,Ensfcag00000044939,Dcun1d2,Acacb,Slco1c1,Ppp1r3c,Sema3c,Ppp1r14c
2,Adgrf5,Tbx1,Slco2b1,Pi15,Adam23,Bmx,Pde8b,Pkhd1l1,Dtx1,Ensfcag00000051556
3,Clec2d,Trat1,Rasgrp1,Card11,Cytip,Sytl3,Tmem156,Bcl11b,Lcp1,Lcp2
You can access the example data in your R script using:
r
system.file("extdata", "Cat_Heart_markers.csv", package = "mLLMCelltype")
Using a Single LLM Model
If you only want to use a single LLM model instead of the consensus approach, use the annotate_cell_types() function. This is useful when you have access to only one API key or prefer a specific model:
```r
Load required packages
library(mLLMCelltype) library(Seurat)
Load your preprocessed Seurat object
pbmc <- readRDS("yourseuratobject.rds")
Find marker genes for each cluster
pbmc_markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
Choose a model from any supported provider
Supported models include:
- OpenAI: 'gpt-5', 'gpt-5-mini', 'gpt-4.1', 'gpt-4.1-mini', 'gpt-4.1-nano', 'gpt-4-turbo', 'gpt-3.5-turbo', 'o1', 'o1-mini', 'o1-preview', 'o1-pro'
- Anthropic: 'claude-sonnet-4-20250514', 'claude-3-5-sonnet-20241022', 'claude-3-5-haiku-20241022', 'claude-3-opus-20240229'
- DeepSeek: 'deepseek-chat', 'deepseek-reasoner'
- Google: 'gemini-2.5-pro', 'gemini-2.5-flash', 'gemini-2.0-flash', 'gemini-2.0-flash-lite', 'gemini-1.5-pro', 'gemini-1.5-flash'
- Qwen: 'qwen-max-2025-01-25'
- Stepfun: 'step-2-mini', 'step-2-16k', 'step-1-8k'
- Zhipu: 'glm-4-plus', 'glm-3-turbo'
- MiniMax: 'minimax-text-01'
- Grok: 'grok-3', 'grok-3-latest', 'grok-3-fast', 'grok-3-fast-latest', 'grok-3-mini', 'grok-3-mini-latest', 'grok-3-mini-fast', 'grok-3-mini-fast-latest'
- OpenRouter: Access to models from multiple providers through a single API. Format: 'provider/model-name'
- OpenAI models: 'openai/gpt-5', 'openai/gpt-5-mini', 'openai/gpt-4-turbo', 'openai/gpt-4', 'openai/gpt-3.5-turbo'
- Anthropic models: 'anthropic/claude-3.7-sonnet', 'anthropic/claude-3.5-sonnet', 'anthropic/claude-3.5-haiku', 'anthropic/claude-3-opus'
- Meta models: 'meta-llama/llama-3-70b-instruct', 'meta-llama/llama-3-8b-instruct', 'meta-llama/llama-2-70b-chat'
- Google models: 'google/gemini-2.5-pro', 'google/gemini-2.5-flash', 'google/gemini-1.5-pro-latest', 'google/gemini-1.5-flash'
- Mistral models: 'mistralai/mistral-large', 'mistralai/mistral-medium', 'mistralai/mistral-small'
- Other models: 'microsoft/mai-ds-r1', 'perplexity/sonar-small-chat', 'cohere/command-r', 'deepseek/deepseek-chat', 'thudm/glm-z1-32b'
Run cell type annotation with a single LLM model
singlemodelresults <- annotatecelltypes( input = pbmcmarkers, tissuename = "human PBMC", # provide tissue context model = "claude-opus-4-20250514", # specify a single model (Claude 4 Opus) apikey = "your-anthropic-key", # provide the API key directly topgene_count = 10 )
Using a free OpenRouter model
freemodelresults <- annotatecelltypes( input = pbmcmarkers, tissuename = "human PBMC", model = "meta-llama/llama-4-maverick:free", # free model with :free suffix apikey = "your-openrouter-key", topgene_count = 10 )
Print the results
print(singlemodelresults)
Add annotations to Seurat object
singlemodelresults is a character vector with one annotation per cluster
pbmc$celltype <- plyr::mapvalues( x = as.character(Idents(pbmc)), from = as.character(0:(length(singlemodelresults)-1)), to = singlemodel_results )
Visualize results
DimPlot(pbmc, group.by = "cell_type", label = TRUE) + ggtitle("Cell Types Annotated by Single LLM Model") ```
Comparing Different Models
You can also compare annotations from different models by running annotate_cell_types() multiple times with different models:
```r
Define models to test
modelstotest <- c( "claude-sonnet-4-20250514", # Anthropic "gpt-5", # OpenAI "gemini-2.5-pro", # Google "qwen-max-2025-01-25" # Alibaba )
API keys for different providers
api_keys <- list( anthropic = "your-anthropic-key", openai = "your-openai-key", gemini = "your-gemini-key", qwen = "your-qwen-key" )
Test each model and store results
results <- list() for (model in modelstotest) { provider <- getprovider(model) apikey <- api_keys[[provider]]
# Run annotation results[[model]] <- annotatecelltypes( input = pbmcmarkers, tissuename = "human PBMC", model = model, apikey = apikey, topgenecount = 10 )
# Add to Seurat object columnname <- paste0("celltype", gsub("[^a-zA-Z0-9]", "", model)) pbmc[[column_name]] <- plyr::mapvalues( x = as.character(Idents(pbmc)), from = as.character(0:(length(results[[model]])-1)), to = results[[model]] ) } ```
Advanced Consensus Configuration: Specifying the Consensus Check Model
The consensus_check_model parameter (R) / consensus_model parameter (Python) allows you to specify which LLM model to use for consensus checking and discussion moderation. This parameter is critical for the accuracy of consensus annotation because the consensus check model:
- Evaluates semantic similarity between different cell type annotations
- Calculates consensus metrics (proportion and entropy)
- Moderates and synthesizes discussions between models for controversial clusters
- Makes final decisions when models disagree
** Important: We strongly recommend using the most capable models available for consensus checking, as this directly impacts annotation quality.**
Recommended Models for Consensus Checking (Ranked by Performance)
Anthropic Claude Models (Highest recommendation)
claude-opus-4-20250514- Best overall performance (Claude 4 - latest release, June 27, 2025)claude-sonnet-4-20250514- Excellent balance of performance and speed (Claude 4)claude-sonnet-4-20250514- Superior performance with Claude 4claude-3-5-sonnet-20241022- Good performance with faster response
OpenAI Models
o1/o1-pro- Advanced reasoning capabilitiesgpt-5- Strong performance across various cell typesgpt-4.1- Latest GPT-4 variant
Google Gemini Models
gemini-2.5-pro- Top-tier performance with enhanced reasoninggemini-2.5-flash- Excellent balance of performance and speedgemini-2.0-flash- Good performance with faster processing
Other High-Performance Models
deepseek-r1/deepseek-reasoner- Strong reasoning capabilitiesqwen-max-2025-01-25- Excellent for scientific contextsgrok-3-latest- Advanced language understanding
R Package Usage
```r
Example 1: Using the best available model for consensus checking (Recommended)
consensusresults <- interactiveconsensusannotation( input = markergeneslist, tissuename = "human brain", models = c("gpt-5", "claude-sonnet-4-20250514", "gemini-2.5-pro", "qwen-max-2025-01-25"), apikeys = apikeys, consensuscheckmodel = "claude-opus-4-20250514", # Use the most capable model controversythreshold = 0.7, entropythreshold = 1.0 )
Example 2: Using a high-performance model when Claude Opus is not available
consensusresults <- interactiveconsensusannotation( input = markergeneslist, tissuename = "mouse liver", models = c("gpt-5", "gemini-2.5-pro", "qwen-max-2025-01-25"), apikeys = apikeys, consensuscheckmodel = "claude-sonnet-4-20250514", # High-performance Claude 4 model controversythreshold = 0.7, entropythreshold = 1.0 )
Example 3: Using OpenAI's reasoning model for complex cases
consensusresults <- interactiveconsensusannotation( input = markergeneslist, tissuename = "human immune cells", models = c("gpt-5", "claude-sonnet-4-20250514", "gemini-2.5-pro"), apikeys = apikeys, consensuscheckmodel = "o1", # OpenAI's advanced reasoning model controversythreshold = 0.7, entropythreshold = 1.0 )
NOT RECOMMENDED: Avoid using less capable or free models for consensus checking
as this may significantly reduce annotation accuracy
```
Python Package Usage
```python
Example 1: Using the best available model for consensus checking (Recommended)
consensusresults = interactiveconsensusannotation( markergenes=markergenes, species="human", tissue="brain", models=["gpt-5", "claude-sonnet-4-20250514", "gemini-2.5-pro", "qwen-max-2025-01-25"], consensusmodel="claude-opus-4-20250514", # Use the most capable model consensusthreshold=0.7, entropythreshold=1.0 )
Example 2: Using dictionary format with a high-performance model
consensusresults = interactiveconsensusannotation( markergenes=markergenes, species="mouse", tissue="liver", models=["gpt-5", "gemini-2.5-pro", "qwen-max-2025-01-25"], consensusmodel={"provider": "anthropic", "model": "claude-sonnet-4-20250514"}, consensusthreshold=0.7, entropythreshold=1.0 )
Example 3: Using Google's latest model for consensus
consensusresults = interactiveconsensusannotation( markergenes=markergenes, species="human", tissue="heart", models=["gpt-5", "claude-sonnet-4-20250514", "qwen-max-2025-01-25"], consensusmodel={"provider": "google", "model": "gemini-2.5-pro"}, consensusthreshold=0.7, entropythreshold=1.0 )
Example 4: Default behavior (uses Qwen with high-performance fallback)
consensusresults = interactiveconsensusannotation( markergenes=markergenes, species="human", tissue="blood", models=["gpt-5", "claude-sonnet-4-20250514", "gemini-2.5-pro"], # If not specified, defaults to qwen-max-2025-01-25 (a high-performance model) consensusthreshold=0.7, entropy_threshold=1.0 ) ```
Best Practices for Consensus Model Selection
Prioritize Accuracy Over Cost: The consensus check model plays a crucial role in determining final annotations. Using a less capable model here can compromise the entire annotation process.
Model Availability: Ensure you have API access to your chosen consensus model. The system will use fallback models if the primary choice is unavailable.
Consistency: Use the same high-performance model for all consensus checks within a project to ensure consistent evaluation criteria.
Complex Tissues: For challenging tissues (e.g., brain, immune system), consider using the most advanced models like Claude Opus, O1, or Gemini 2.5 Pro.
Default Behavior:
- R: Uses the first model in the
modelslist if not specified - Python: Defaults to
qwen-max-2025-01-25(a high-performance model) withclaude-3-5-sonnet-latestas fallback
- R: Uses the first model in the
Why Model Quality Matters for Consensus Checking
The consensus check model must: - Accurately assess semantic similarity between different cell type names (e.g., recognizing that "T lymphocyte" and "T cell" refer to the same cell type) - Understand biological context and hierarchical relationships - Synthesize discussions from multiple models to reach accurate conclusions - Provide reliable confidence metrics for downstream analysis
Using a less capable model for these critical tasks can lead to: - Misidentification of controversial clusters - Incorrect consensus calculations - Poor resolution of disagreements between models - Ultimately, less accurate cell type annotations
Advanced Features: Cluster Selection and Cache Control (v1.3.1)
mLLMCelltype v1.3.1 introduces two powerful parameters that give you fine-grained control over the annotation process:
1. clusterstoanalyze - Selective Cluster Analysis
This parameter allows you to specify exactly which clusters to analyze without manually filtering your input data:
```r
Example: Focus on specific clusters for T cell subtyping
consensusresults <- interactiveconsensusannotation( input = pbmcmarkers, tissuename = "human PBMC - T cell subtypes", models = c("gpt-5", "claude-sonnet-4-20250514"), apikeys = apikeys, clusterstoanalyze = c(0, 1, 7), # Only analyze T cell clusters controversythreshold = 0.7 )
Example: Re-analyze controversial clusters with different context
consensusresults <- interactiveconsensusannotation( input = pbmcmarkers, tissuename = "activated immune cells", models = c("gpt-5", "claude-sonnet-4-20250514", "gemini-2.5-pro"), apikeys = apikeys, clusterstoanalyze = c("3", "5"), # Focus on specific clusters cachedir = "consensus_cache" ) ```
Benefits: - No need to subset your data manually - Maintains original cluster numbering - Reduces API calls and costs by only analyzing relevant clusters - Perfect for iterative refinement of specific cell populations
2. force_rerun - Bypass Cache for Fresh Analysis
This parameter forces re-analysis of controversial clusters, bypassing cached results:
```r
Example: Initial broad analysis
initialresults <- interactiveconsensusannotation( input = markers, tissuename = "human brain", models = c("gpt-5", "claude-sonnet-4-20250514"), apikeys = apikeys, use_cache = TRUE )
Example: Re-analyze with specific subtype context
subtyperesults <- interactiveconsensusannotation( input = markers, tissuename = "human brain - neuronal subtypes", models = c("gpt-5", "claude-sonnet-4-20250514"), apikeys = apikeys, clusterstoanalyze = c(2, 3, 5), # Neuronal clusters forcererun = TRUE, # Force fresh analysis despite cache usecache = TRUE # Still benefit from cache for non-controversial clusters ) ```
Important Notes:
- force_rerun only affects controversial clusters requiring LLM discussion
- Non-controversial clusters still use cache for performance
- Useful when changing tissue context or focusing on subtypes
- Combines well with clusters_to_analyze for targeted re-analysis
Common Use Cases
- Iterative Subtyping Workflow: ```r # Step 1: General cell type annotation generaltypes <- interactiveconsensusannotation( input = data, tissuename = "human PBMC", models = models, apikeys = apikeys )
Step 2: Focus on T cells with subtype context
tcellsubtypes <- interactiveconsensusannotation( input = data, tissuename = "human T lymphocytes", models = models, apikeys = apikeys, clusterstoanalyze = c(0, 1, 4, 7), # T cell clusters from step 1 forcererun = TRUE # Fresh analysis with T cell context )
Step 3: Further refine CD8+ T cells
cd8subtypes <- interactiveconsensusannotation( input = data, tissuename = "human CD8+ T cells - activation states", models = models, apikeys = apikeys, clusterstoanalyze = c(1, 4), # CD8+ clusters force_rerun = TRUE ) ```
- Cost-Effective Re-analysis: ```r # Only re-analyze clusters that were controversial controversial <- initialresults$controversialclusters
refinedresults <- interactiveconsensusannotation( input = data, tissuename = "human PBMC - refined", models = c("gpt-5", "claude-opus-4-20250514", "gemini-2.5-pro"), apikeys = apikeys, clusterstoanalyze = controversial, # Only controversial ones forcererun = TRUE, consensuscheck_model = "claude-opus-4-20250514" # Use best model ) ```
These features significantly enhance the flexibility and efficiency of mLLMCelltype, making it easier to perform detailed, iterative cell type annotation workflows while managing API costs effectively.
Visualization Examples
Cell Type Annotation Visualization
Below is an example of publication-ready visualization created with mLLMCelltype and SCpubr, showing cell type annotations alongside uncertainty metrics (Consensus Proportion and Shannon Entropy):
Figure: Left panel shows cell type annotations on UMAP projection. Middle panel displays the consensus proportion using a yellow-green-blue gradient (deeper blue indicates stronger agreement among LLMs). Right panel shows Shannon entropy using an orange-red gradient (deeper red indicates lower uncertainty, lighter orange indicates higher uncertainty).
Marker Gene Visualization
mLLMCelltype now includes enhanced marker gene visualization functions that integrate seamlessly with the consensus annotation workflow:
```r
Load required libraries
library(mLLMCelltype) library(Seurat) library(ggplot2)
After running consensus annotation
consensusresults <- interactiveconsensusannotation( input = markersdf, tissuename = "human PBMC", models = c("anthropic/claude-3.5-sonnet", "openai/gpt-5"), apikeys = list(openrouter = "yourapikey") )
Create marker gene visualizations using Seurat
Add consensus annotations to Seurat object
clusterids <- as.character(Idents(pbmcdata)) celltypeannotations <- consensusresults$finalannotations[cluster_ids]
Handle any missing annotations
if (any(is.na(celltypeannotations))) { namask <- is.na(celltypeannotations) celltypeannotations[namask] <- paste("Cluster", clusterids[namask]) }
Add to Seurat object
pbmcdata@meta.data$celltypeconsensus <- celltype_annotations
Create a dotplot of marker genes
DotPlot(pbmcdata, features = topmarkers, group.by = "celltypeconsensus") + RotatedAxis()
Create a heatmap of marker genes
DoHeatmap(pbmcdata, features = topmarkers, group.by = "celltypeconsensus") ```
Key Features of Marker Gene Visualization:
- DotPlot: Shows both percentage of cells expressing each gene (dot size) and average expression level (color intensity)
- Heatmap: Displays scaled expression values with clustering of genes and cell types
- Seamless Integration: Works directly with consensus annotation results added to Seurat objects
- Standard Seurat Functions: Uses familiar Seurat visualization functions for consistency
For detailed instructions and advanced customization options, see the Visualization Guide.
Citation
If you use mLLMCelltype in your research, please cite:
bibtex
@article{Yang2025.04.10.647852,
author = {Yang, Chen and Zhang, Xianyang and Chen, Jun},
title = {Large Language Model Consensus Substantially Improves the Cell Type Annotation Accuracy for scRNA-seq Data},
elocation-id = {2025.04.10.647852},
year = {2025},
doi = {10.1101/2025.04.10.647852},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/04/17/2025.04.10.647852},
journal = {bioRxiv}
}
You can also cite this in plain text format:
Yang, C., Zhang, X., & Chen, J. (2025). Large Language Model Consensus Substantially Improves the Cell Type Annotation Accuracy for scRNA-seq Data. bioRxiv. Read our full research paper on bioRxiv
Contributing
We welcome and appreciate contributions from the community! There are many ways you can contribute to mLLMCelltype:
Reporting Issues
If you encounter any bugs, have feature requests, or have questions about using mLLMCelltype, please open an issue on our GitHub repository. When reporting bugs, please include:
- A clear description of the problem
- Steps to reproduce the issue
- Expected vs. actual behavior
- Your operating system and package version information
- Any relevant code snippets or error messages
Pull Requests
We encourage you to contribute code improvements or new features through pull requests:
- Fork the repository
- Create a new branch for your feature (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Areas for Contribution
Here are some areas where contributions would be particularly valuable:
- Adding support for new LLM models
- Improving documentation and examples
- Optimizing performance
- Adding new visualization options
- Extending functionality for specialized cell types or tissues
- Translations of documentation into different languages
Code Style
Please follow the existing code style in the repository. For R code, we generally follow the tidyverse style guide. For Python code, we follow PEP 8.
Community
Join our Discord community to get real-time updates about mLLMCelltype, ask questions, share your experiences, or collaborate with other users and developers. This is a great place to connect with the team and other users working on single-cell RNA-seq analysis.
Thank you for helping improve mLLMCelltype!
Owner
- Name: Caffery Yang
- Login: cafferychen777
- Kind: user
- Website: www.cafferyyang.com
- Twitter: CafferyYang
- Repositories: 1
- Profile: https://github.com/cafferychen777
Chen Yang is a junior at Southern Medical University majoring in biostatistics. In 2020-2021, he was awarded the National Scholarship from the Ministry of Educa
GitHub Events
Total
- Fork event: 32
- Create event: 6
- Release event: 4
- Issues event: 17
- Watch event: 303
- Delete event: 2
- Issue comment event: 22
- Public event: 1
- Push event: 417
- Gollum event: 4
- Pull request review event: 1
- Pull request review comment event: 2
- Pull request event: 84
Last Year
- Fork event: 32
- Create event: 6
- Release event: 4
- Issues event: 17
- Watch event: 303
- Delete event: 2
- Issue comment event: 22
- Public event: 1
- Push event: 417
- Gollum event: 4
- Pull request review event: 1
- Pull request review comment event: 2
- Pull request event: 84
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Chen Yang | c****7@t****u | 223 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 14
- Total pull requests: 50
- Average time to close issues: 2 days
- Average time to close pull requests: less than a minute
- Total issue authors: 12
- Total pull request authors: 1
- Average comments per issue: 2.0
- Average comments per pull request: 0.0
- Merged pull requests: 50
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 14
- Pull requests: 50
- Average time to close issues: 2 days
- Average time to close pull requests: less than a minute
- Issue authors: 12
- Pull request authors: 1
- Average comments per issue: 2.0
- Average comments per pull request: 0.0
- Merged pull requests: 50
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- Starlitnightly (2)
- odxdld (2)
- huhm123 (1)
- baike687 (1)
- jiadalidu (1)
- Liripo (1)
- hai178912522 (1)
- luna2terra (1)
- jiangli2941 (1)
- wow-hub (1)
- Jyyin333 (1)
- cafferychen777 (1)
Pull Request Authors
- cafferychen777 (96)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 3
-
Total downloads:
- pypi 392 last-month
-
Total dependent packages: 0
(may contain duplicates) -
Total dependent repositories: 0
(may contain duplicates) - Total versions: 20
- Total maintainers: 1
proxy.golang.org: github.com/cafferychen777/mllmcelltype
- Documentation: https://pkg.go.dev/github.com/cafferychen777/mllmcelltype#section-documentation
- License: mit
-
Latest release: v1.2.9
published 8 months ago
Rankings
proxy.golang.org: github.com/cafferychen777/mLLMCelltype
- Documentation: https://pkg.go.dev/github.com/cafferychen777/mLLMCelltype#section-documentation
- License: mit
-
Latest release: v1.2.9
published 8 months ago
Rankings
pypi.org: mllmcelltype
A Python module for cell type annotation using various LLMs.
- Homepage: https://github.com/cafferychen777/mLLMCelltype
- Documentation: https://mllmcelltype.readthedocs.io/
- License: MIT License
-
Latest release: 1.3.4
published 6 months ago