https://github.com/buaadreamer/minicpm-v

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone

Basic Info

Host: GitHub
Owner: BUAADreamer
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 323 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Fork of OpenBMB/MiniCPM-o

Created over 1 year ago · Last pushed over 1 year ago

https://github.com/BUAADreamer/MiniCPM-V/blob/main/



 

**A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone**

  [](./README_zh.md) |
  English




  
   WeChat  |

 


   Discord  





  MiniCPM-o 2.6     | MiniCPM-V 2.6   | 
   Technical Blog [English/]




**MiniCPM-o** is the latest series of end-side multimodal LLMs (MLLMs) ungraded from MiniCPM-V. The models can now take images, video, text, and audio as inputs and provide high-quality text and speech outputs in an end-to-end fashion. Since February 2024, we have released 6 versions of the model, aiming to achieve **strong performance and efficient deployment**. The most notable models in the series currently include:

- **MiniCPM-o 2.6**:  The latest and most capable model in the MiniCPM-o series. With a total of 8B parameters, this end-to-end model **achieves comparable performance to GPT-4o-202405 in vision, speech, and multimodal live streaming**, making it one of the most versatile and performant models in the open-source community. For the new voice mode, MiniCPM-o 2.6 **supports bilingual real-time speech conversation with configurable voices**, and also allows for fun capabilities such as emotion/speed/style control, end-to-end voice cloning, role play, etc. It also advances MiniCPM-V 2.6's visual capabilities such **strong OCR capability, trustworthy behavior, multilingual support, and video understanding**. Due to its superior token density, MiniCPM-o 2.6 can for the first time **support multimodal live streaming on end-side devices** such as iPad.

- **MiniCPM-V 2.6**: The most capable model in the MiniCPM-V series. With a total of 8B parameters, the model **surpasses GPT-4V in single-image, multi-image and video understanding**. It outperforms **GPT-4o mini, Gemini 1.5 Pro and Claude 3.5 Sonnet** in single image understanding, and can for the first time support real-time video understanding on iPad.



## News 

####  Pinned

* [2025.01.24]  MiniCPM-o 2.6 technical report is released! See [here](https://openbmb.notion.site/MiniCPM-o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9).

* [2025.01.23]  MiniCPM-o 2.6 is now supported by [Align-Anything](https://github.com/PKU-Alignment/align-anything), a framework by PKU-Alignment Team for aligning any-to-any modality large models with human intentions. It supports DPO and SFT fine-tuning on both vision and audio. Try it now!

* [2025.01.19]  **ATTENTION!** We are currently working on merging MiniCPM-o 2.6 into the official repositories of llama.cpp, ollama, and vllm. Until the merge is complete, please USE OUR LOCAL FORKS of [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md), [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md), and [vllm](https://github.com/OpenBMB/MiniCPM-o?tab=readme-ov-file#efficient-inference-with-llamacpp-ollama-vllm). **Using the official repositories before the merge may lead to unexpected issues**.

* [2025.01.19]  MiniCPM-o tops GitHub Trending and reaches top-2 on Hugging Face Trending!

* [2025.01.17] We have updated the usage of MiniCPM-o 2.6 int4 quantization version and resolved the model initialization error. Click [here](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and try it now!

* [2025.01.13]  We open-source MiniCPM-o 2.6, which matches GPT-4o-202405 on vision, speech and multimodal live streaming. It advances popular capabilities of MiniCPM-V 2.6, and supports various new fun features. Try it now!

* [2024.08.17]  MiniCPM-V 2.6 is now fully supported by [official](https://github.com/ggerganov/llama.cpp) llama.cpp! GGUF models of various sizes are available [here](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf).

* [2024.08.06]  We open-source MiniCPM-V 2.6, which outperforms GPT-4V on single image, multi-image and video understanding. It advances popular features of MiniCPM-Llama3-V 2.5, and can support real-time video understanding on iPad. Try it now!

* [2024.08.03] MiniCPM-Llama3-V 2.5 technical report is released! See [here](https://arxiv.org/abs/2408.01800).

* [2024.05.23]  MiniCPM-V tops GitHub Trending and Hugging Face Trending! Our demo, recommended by Hugging Face Gradios official account, is available [here](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5). Come and try it out!




 
Click to view more news.

* [2024.08.15] We now also support multi-image SFT. For more details, please refer to the [document](https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune).
* [2024.08.14] MiniCPM-V 2.6 now also supports [fine-tuning](https://github.com/modelscope/ms-swift/issues/1613) with the SWIFT framework!
* [2024.08.10]  MiniCPM-Llama3-V 2.5 is now fully supported by [official](https://github.com/ggerganov/llama.cpp) llama.cpp! GGUF models of various sizes are available [here](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf).

* [2024.07.19] MiniCPM-Llama3-V 2.5 supports vLLM now! See [here](#inference-with-vllm).

* [2024.06.03] Now, you can run MiniCPM-Llama3-V 2.5 on multiple low VRAM GPUs(12 GB or 16 GB) by distributing the model's layers across multiple GPUs. For more details, Check this [link](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md).
* [2024.05.28]  MiniCPM-Llama3-V 2.5 now fully supports its feature in llama.cpp and ollama! Please pull the latest code **of our provided forks** ([llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md), [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5)). GGUF models in various sizes are available [here](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf/tree/main). MiniCPM-Llama3-V 2.5 series is **not supported by the official repositories yet**, and we are working hard to merge PRs. Please stay tuned!

* [2024.05.28]  We now support LoRA fine-tuning for MiniCPM-Llama3-V 2.5, using only 2 V100 GPUs! See more statistics [here](https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune#model-fine-tuning-memory-usage-statistics).

* [2024.05.25] MiniCPM-Llama3-V 2.5 now supports streaming outputs and customized system prompts. Try it [here](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5#usage)!
* [2024.05.24] We release the MiniCPM-Llama3-V 2.5 [gguf](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf), which supports [llama.cpp](#inference-with-llamacpp) inference and provides a 6~8 token/s smooth decoding on mobile phones. Try it now!

* [2024.05.23]  We've released a comprehensive comparison between Phi-3-vision-128k-instruct and MiniCPM-Llama3-V 2.5, including benchmarks evaluations, multilingual capabilities, and inference efficiency . Click [here](./docs/compare_with_phi-3_vision.md) to view more details.

* [2024.05.20] We open-soure MiniCPM-Llama3-V 2.5, it has improved OCR capability and supports 30+ languages, representing the first end-side MLLM achieving GPT-4V level performance! We provide [efficient inference](#deployment-on-mobile-phone) and [simple fine-tuning](./finetune/readme.md). Try it now!
* [2024.04.23] MiniCPM-V-2.0 supports vLLM now! Click [here](#inference-with-vllm) to view more details.
* [2024.04.18] We create a HuggingFace Space to host the demo of MiniCPM-V 2.0 at [here](https://huggingface.co/spaces/openbmb/MiniCPM-V-2)!
* [2024.04.17] MiniCPM-V-2.0 supports deploying [WebUI Demo](#webui-demo) now!
* [2024.04.15] MiniCPM-V-2.0 now also supports [fine-tuning](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v-2.md) with the SWIFT framework!
* [2024.04.12] We open-source MiniCPM-V 2.0, which achieves comparable performance with Gemini Pro in understanding scene text and outperforms strong Qwen-VL-Chat 9.6B and Yi-VL 34B on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. Click here to view the MiniCPM-V 2.0 technical blog.
* [2024.03.14] MiniCPM-V now supports [fine-tuning](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v.md) with the SWIFT framework. Thanks to [Jintao](https://github.com/Jintao-Huang) for the contribution
* [2024.03.01] MiniCPM-V now can be deployed on Mac!
* [2024.02.01] We open-source MiniCPM-V and OmniLMM-12B, which support efficient end-side deployment and powerful multimodal capabilities correspondingly.
 


## Contents 


- [MiniCPM-o 2.6](#minicpm-o-26)
- [MiniCPM-V 2.6](#minicpm-v-26)
- [Chat with Our Demo on Gradio ](#chat-with-our-demo-on-gradio-)
- [Inference](#inference)
  - [Model Zoo](#model-zoo)
  - [Multi-turn Conversation](#multi-turn-conversation)
    - [Chat with Multiple Images](#chat-with-multiple-images)
    - [In-context Few-shot Learning](#in-context-few-shot-learning)
    - [Chat with Video](#chat-with-video)
    - [Speech Conversation](#speech-conversation)
      - [Mimick](#mimick)
      - [General Speech Conversation with Configurable Voices](#general-speech-conversation-with-configurable-voices)
      - [Speech Conversation as an AI Assistant](#speech-conversation-as-an-ai-assistant)
      - [Instruction-to-Speech](#instruction-to-speech)
      - [Voice Cloning](#voice-cloning)
      - [Addressing Various Audio Understanding Tasks](#addressing-various-audio-understanding-tasks)
    - [Multimodal Live Streaming](#multimodal-live-streaming)
  - [Inference on Multiple GPUs](#inference-on-multiple-gpus)
  - [Inference on Mac](#inference-on-mac)
  - [Efficient Inference with llama.cpp, ollama, vLLM](#efficient-inference-with-llamacpp-ollama-vllm)
- [Fine-tuning](#fine-tuning)
- [FAQs](#faqs)
- [Limitations](#limitations)


## MiniCPM-o 2.6

**MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:

-  **Leading Visual Capability.**
  MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in multi-image and video understanding, and shows promising in-context learning capability.

-  **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual real-time speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc.

-  **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continuous video and audio streams independent of user queries, and support real-time speech interaction**. It **outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-art performance in the open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding.										

-  **Strong OCR Capability and Others.**
Advancing popular visual capabilities from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**.
  Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages.


-  **Superior Efficiency.**
  In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., the number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPads.

-    **Easy Usage.**
MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train_and_infer.md), (5) quick [local WebUI demo](#chat-with-our-demo-on-gradio), and (6) online web demo on [server](https://minicpm-omni-webdemo-us.modelbest.cn/).

**Model Architecture.**

- **End-to-end Omni-modal Architecture.** Different modality encoders/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge. The model is trained in a fully end-to-end manner with only CE loss.
- **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaming inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaming processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices. 
- **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates end-to-end voice cloning and description-based voice creation.






### Evaluation  


  



Click to view visual understanding results.

**Image Understanding**



    
        
            Model
            Size
            Token Density⁺
            OpenCompass
            OCRBench
            MathVista mini
            ChartQA
            MMVet
            MMStar
            MME
            MMB1.1 test
            AI2D
            MMMU val
            HallusionBench
            TextVQA val
            DocVQA test
            MathVerse mini
            MathVision
            MMHal Score
        
    
    
        
            Proprietary
        
        
            GPT-4o-20240513
            -
            1088
            69.9
            736
            61.3
            85.7
            69.1
            63.9
            2328.7
            82.2
            84.6
            69.2
            55.0
            -
            92.8
            50.2
            30.4
            3.6
        
        
            Claude3.5-Sonnet
            -
            750
            67.9
            788
            61.6
            90.8
            66.0
            62.2
            1920.0
            78.5
            80.2
            65.9
            49.9
            -
            95.2
            -
            -
            3.4
        
        
            Gemini 1.5 Pro
            -
            -
            64.4
            754
            57.7
            81.3
            64.0
            59.1
            2110.6
            73.9
            79.1
            60.6
            45.6
            73.5
            86.5
            -
            19.2
            -
        
        
            GPT-4o-mini-20240718
            -
            1088
            64.1
            785
            52.4
            -
            66.9
            54.8
            2003.4
            76.0
            77.8
            60.0
            46.1
            -
            -
            -
            -
            3.3
        
        
            Open Source
        
        
            Cambrian-34B
            34B
            1820
            58.3
            591
            50.3
            75.6
            53.2
            54.2
            2049.9
            77.8
            79.5
            50.4
            41.6
            76.7
            75.5
            -
            -
            -
        
        
            GLM-4V-9B
            13B
            784
            59.1
            776
            51.1
            -
            58.0
            54.8
            2018.8
            67.9
            71.2
            46.9
            45.0
            -
            -
            -
            -
            -
        
        
            Pixtral-12B
            12B
            256
            61.0
            685
            56.9
            81.8
            58.5
            54.5
            -
            72.7
            79.0
            51.1
            47.0
            75.7
            90.7
            -
            -
            -
        
        
            VITA-1.5
            8B
            784
            63.3
            741
            66.2
            -
            52.7
            60.2
            2328.1
            76.8
            79.2
            52.6
            44.6
            -
            -
            -
            -
            -
        
        
            DeepSeek-VL2-27B (4B)
            27B
            672
            66.4
            809
            63.9
            86.0
            60.0
            61.9
            2253.0
            81.2
            83.8
            54.0
            45.3
            84.2
            93.3
            -
            -
            3.0
        
        
            Qwen2-VL-7B
            8B
            784
            67.1
            866
            58.2
            83.0
            62.0
            60.7
            2326.0
            81.8
            83.0
            54.1
            50.6
            84.3
            94.5
            31.9
            16.3
            3.2
        
        
            LLaVA-OneVision-72B
            72B
            182
            68.1
            741
            67.5
            83.7
            60.6
            65.8
            2261.0
            85.0
            85.6
            56.8
            49.0
            80.5
            91.3
            39.1
            -
            3.5
        
        
            InternVL2.5-8B
            8B
            706
            68.3
            822
            64.4
            84.8
            62.8
            62.8
            2344.0
            83.6
            84.5
            56.0
            50.1
            79.1
            93.0
            39.5
            19.7
            3.4
        
        
            MiniCPM-V 2.6
            8B
            2822
            65.2
            852*
            60.6
            79.4
            60.0
            57.5
            2348.4*
            78.0
            82.1
            49.8*
            48.1*
            80.1
            90.8
            25.7
            18.3
            3.6
        
        
            MiniCPM-o 2.6
            8B
            2822
            70.2
            897*
            71.9*
            86.9*
            67.5
            64.0
            2372.0*
            80.5
            85.8
            50.4*
            51.9
            82.0
            93.5
            41.4*
            23.1*
            3.8
        
    


* We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.


⁺ Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.

Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.


**Multi-image and Video Understanding**


 

    
        
            Model
            Size
            BLINK val
            Mantis Eval
            MIRB
            Video-MME (wo / w subs)
        
    
    
        
            Proprietary
        
        
            GPT-4o-20240513
            -
            68.0
            -
            -
            71.9/77.2
        
        
            GPT4V
            -
            54.6
            62.7
            53.1
            59.9/63.3
        
        
            Open-source
        
        
            VITA-1.5
            8B
            45.0
            -
            -
            56.1/58.7
        
        
            LLaVA-NeXT-Interleave 14B
            14B
            52.6
            66.4
            30.2
            -
        
        
            LLaVA-OneVision-72B
            72B
            55.4
            77.6
            -
            66.2/69.5
        
        
            MANTIS 8B
            8B
            49.1
            59.5
            34.8
            -
        
        
            Qwen2-VL-7B
            8B
            53.2
            69.6*
            67.6*
            63.3/69.0
        
        
            InternVL2.5-8B
            8B
            54.8
            67.7
            52.5
            64.2/66.9
        
        
            MiniCPM-V 2.6
            8B
            53.0
            69.1
            53.8
            60.9/63.6
        
        
            MiniCPM-o 2.6
            8B
            56.7
            71.9
            58.6
            63.9/67.9
        
    



* We evaluate officially released checkpoints by ourselves.





Click to view audio understanding and speech conversation results.

**Audio Understanding**



    
        
            Task
            Size
            ASR (zh)
            ASR (en)
            AST
            Emotion
        
        
            Metric
            
            CER
            WER
            BLEU
            ACC
        
        
            Dataset
            
            AISHELL-1
            Fleurs zh
            WenetSpeech test-net
            LibriSpeech test-clean
            GigaSpeech
            TED-LIUM
            CoVoST en2zh
            CoVoST zh2en
            MELD emotion
        
    
    
        
            Proprietary
        
        
            GPT-4o-Realtime
            -
            7.3*
            5.4*
            28.9*
            2.6*
            12.9*
            4.8*
            37.1*
            15.7*
            33.2*
        
        
            Gemini 1.5 Pro
            -
            4.5*
            5.9*
            14.3*
            2.9*
            10.6*
            3.0*
            47.3*
            22.6*
            48.4*
        
        
            Open-Source
        
        
            Qwen2-Audio-7B
            8B
            -
            7.5
            -
            1.6
            -
            -
            45.2
            24.4
            55.3
        
        
            Qwen2-Audio-7B-Instruct
            8B
            2.6*
            6.9*
            10.3*
            3.1*
            9.7*
            5.9*
            39.5*
            22.9*
            17.4*
        
          
            VITA-1.5
            8B
            2.16
            -
            8.4
            3.4
            -
            -
            -
            -
            -
        
        
            GLM-4-Voice-Base
            9B
            2.5
            -
            -
            2.8
            -
            -
            -
            -
        
        
            MiniCPM-o 2.6
            8B
            1.6
            4.4
            6.9
            1.7
            8.7
            3.0
            48.2
            27.2
            52.4
        
    


* We evaluate officially released checkpoints by ourselves.



**Speech Generation**



    
        
            Task
            Size
            SpeechQA
        
        
            Metric
            
            ACC
            G-Eval (10 point)
            Semantic ELO score
            Acoustic ELO score
            Overall ELO score
            UTMOS
            ASR-WER
        
        
            Dataset
            
            Speech Llama Q.
            Speech Web Q.
            Speech Trivia QA
            Speech AlpacaEval
            AudioArena
        
    
    
        
            Proprietary
        
        
            GPT-4o-Realtime
            
            71.7
            51.6
            69.7
            7.4
            1157
            1203
            1200
            4.2
            2.3
        
        
            Open-Source
        
        
            GLM-4-Voice
            9B
            50.0
            32.0
            36.4
            5.1
            999
            1147
            1035
            4.1
            11.7
        
        
            Llama-Omni
            8B
            45.3
            22.9
            10.7
            3.9
            960
            878
            897
            3.2
            24.3
        
        
            VITA-1.5
            8B
            46.7
            28.1
            23.3
            2.0
            -
            -
            -
            -
            -
        
        
            Moshi
            7B
            43.7
            23.8
            16.7
            2.4
            871
            808
            875
            2.8
            8.2
        
        
            Mini-Omni
            1B
            22.0
            12.8
            6.9
            2.5
            926
            803
            865
            3.4
            10.0
        
        
            MiniCPM-o 2.6
            8B
            61.0
            40.0
            40.2
            5.1
            1088
            1163
            1131
            4.2
            9.8
        
    


All results are from AudioEvals, and the evaluation methods along with further details can be found in AudioEvals.



**End-to-end Voice Cloning**



    
        
            Task
            Voice cloning
        
        
            Metric
            SIMO
            SIMO
        
        
            Dataset
            Seed-TTS test-zh
            Seed-TTS test-en
        
    
    
        
            F5-TTS
            76
            67
        
        
            CosyVoice
            75
            64
        
        
            FireRedTTS
            63
            46
        
        
            MiniCPM-o 2.6
            57
            47
        
    






Click to view multimodal live streaming results.
  
**Multimodal Live Streaming**: results on StreamingBench


    
        
            Model
            Size
            Real-Time Video Understanding
            Omni-Source Understanding
            Contextual Understanding
            Overall
        
    
    
        
            Proprietary
        
        
            Gemini 1.5 Pro
            -
            77.4
            67.8
            51.1
            70.3
        
        
            GPT-4o-202408
            -
            74.5
            51.0
            48.0
            64.1
        
        
            Claude-3.5-Sonnet
            -
            74.0
            41.4
            37.8
            59.7
        
        
            Open-source
        
        
            VILA-1.5
            8B
            61.5
            37.5
            26.7
            49.5
        
        
            LongVA
            7B
            63.1
            35.9
            30.2
            50.7
        
        
            LLaVA-Next-Video-34B
            34B
            69.8
            41.7
            34.3
            56.7
        
        
            Qwen2-VL-7B
            8B
            71.2
            40.7
            33.1
            57.0
        
        
            InternVL2-8B
            8B
            70.1
            42.7
            34.1
            57.0
        
        
            VITA-1.5
            8B
            70.9
            40.8
            35.8
            57.4
        
        
            LLaVA-OneVision-7B
            8B
            74.3
            40.8
            31.0
            58.4
        
        
            InternLM-XC2.5-OL-7B
            8B
            75.4
            46.2
            33.6
            60.8
        
        
            MiniCPM-V 2.6
            8B
            72.4
            40.2
            33.4
            57.7
        
        
            MiniCPM-o 2.6
            8B
            79.9
            53.4
            38.5
            66.0
        
    





### Examples 

We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw-speed recording on an iPad Pro and a Web demo.


  






  
  
  



## MiniCPM-V 2.6


Click to view more details of MiniCPM-V 2.6

**MiniCPM-V 2.6** is the latest and most capable model in the MiniCPM-V series. The model is built on SigLip-400M and Qwen2-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-Llama3-V 2.5, and introduces new features for multi-image and video understanding. Notable features of MiniCPM-V 2.6 include:

-  **Leading Performance.**
  MiniCPM-V 2.6 achieves an average score of 65.2 on the latest version of OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding.

-  **Multi Image Understanding and In-context Learning.** MiniCPM-V 2.6 can also perform **conversation and reasoning over multiple images**. It achieves **state-of-the-art performance** on popular multi-image benchmarks such as Mantis-Eval, BLINK, Mathverse mv and Sciverse mv, and also shows promising in-context learning capability.

-  **Video Understanding.** MiniCPM-V 2.6 can also **accept video inputs**, performing conversation and providing dense captions for spatial-temporal information. It outperforms **GPT-4V, Claude 3.5 Sonnet and LLaVA-NeXT-Video-34B** on Video-MME with/without subtitles.

-  **Strong OCR Capability and Others.**
  MiniCPM-V 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro**.
  Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, with significantly lower hallucination rates than GPT-4o and GPT-4V on Object HalBench, and supports **multilingual capabilities** on English, Chinese, German, French, Italian, Korean, etc.


-  **Superior Efficiency.**
  In addition to its friendly size, MiniCPM-V 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-V 2.6 can efficiently support **real-time video understanding** on end-side devices such as iPad.

-    **Easy Usage.**
MiniCPM-V 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) and [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#inference-with-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks, (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web [demo](http://120.92.209.146:8887/).

### Evaluation  

    



Click to view single image results on OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench. 



    
        
            Model
            Size
            Token Density⁺
            OpenCompass
            MME
            MMVet
            OCRBench
            MMMU val
            MathVista mini
            MMB1.1 test
            AI2D
            TextVQA val
            DocVQA test
            HallusionBench
            Object HalBench
        
    
    
        
            Proprietary
        
        
            GPT-4o
            -
            1088
            69.9
            2328.7
            69.1
            736
            69.2
            61.3
            82.2
            84.6
            -
            92.8
            55.0
            17.6
        
        
            Claude 3.5 Sonnet
            -
            750
            67.9
            1920.0
            66.0
            788
            65.9
            61.6
            78.5
            80.2
            -
            95.2
            49.9
            13.8
        
        
            Gemini 1.5 Pro
            -
            -
            64.4
            2110.6
            64.0
            754
            60.6
            57.7
            73.9
            79.1
            73.5
            86.5
            45.6
            -
        
        
            GPT-4o mini
            -
            1088
            64.1
            2003.4
            66.9
            785
            60.0
            52.4
            76.0
            77.8
            -
            -
            46.1
            12.4
        
        
            GPT-4V
            -
            1088
            63.5
            2070.2
            67.5
            656
            61.7
            54.7
            79.8
            78.6
            78.0
            87.2
            43.9
            14.2
        
        
            Step-1V
            -
            -
            59.5
            2206.4
            63.3
            625
            49.9
            44.8
            78.0
            79.2
            71.6
            -
            48.4
            -
        
        
            Qwen-VL-Max
            -
            784
            58.3
            2281.7
            61.8
            684
            52.0
            43.4
            74.6
            75.7
            79.5
            93.1
            41.2
            13.4
        
        
            Open-source
        
        
            LLaVA-NeXT-Yi-34B
            34B
            157
            55.0
            2006.5
            50.7
            574
            48.8
            40.4
            77.8
            78.9
            69.3
            -
            34.8
            12.6
        
        
            Mini-Gemini-HD-34B
            34B
            157
            -
            2141.0
            59.3
            518
            48.0
            43.3
            -
            80.5
            74.1
            78.9
            -
            -
        
        
            Cambrian-34B
            34B
            1820
            58.3
            2049.9
            53.2
            591
            50.4
            50.3
            77.8
            79.5
            76.7
            75.5
            41.6
            14.7
        
        
            GLM-4V-9B
            13B
            784
            59.1
            2018.8
            58.0
            776
            46.9
            51.1
            67.9
            71.2
            -
            -
            45.0
            -
        
        
            InternVL2-8B
            8B
            706
            64.1
            2215.1
            54.3
            794
            51.2
            58.3
            79.4
            83.6
            77.4
            91.6
            45.0
            21.3
        
        
            MiniCPM-Llama-V 2.5
            8B
            1882
            58.8
            2024.6
            52.8
            725
            45.8
            54.3
            72.0
            78.4
            76.6
            84.8
            42.4
            10.3
        
        
            MiniCPM-V 2.6
            8B
            2822
            65.2
            2348.4*
            60.0
            852*
            49.8*
            60.6
            78.0
            82.1
            80.1
            90.8
            48.1*
            8.2
        
    



* We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.

⁺ Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.

Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.





Click to view multi-image results on Mantis Eval, BLINK, Mathverse mv, Sciverse mv, MIRB.

 

    
        
            Model
            Size
            Mantis Eval
            BLINK val
            Mathverse mv
            Sciverse mv
            MIRB
        
    
    
        
            Proprietary
        
        
            GPT-4V
            -
            62.7
            54.6
            60.3
            66.9
            53.1
        
        
            LLaVA-NeXT-Interleave-14B
            14B
            66.4
            52.6
            32.7
            30.2
            -
        
        
            Open-source
        
        
            Emu2-Chat
            37B
            37.8
            36.2
            -
            27.2
            -
        
        
            CogVLM
            17B
            45.2
            41.1
            -
            -
            -
        
        
            VPG-C
            7B
            52.4
            43.1
            24.3
            23.1
            -
        
        
            VILA 8B
            8B
            51.2
            39.3
            -
            36.5
            -
        
        
            InternLM-XComposer-2.5
            8B
            53.1*
            48.9
            32.1*
            -
            42.5
        
        
            InternVL2-8B
            8B
            59.0*
            50.9
            30.5*
            34.4*
            56.9*
        
        
            MiniCPM-V 2.6
            8B
            69.1
            53.0
            84.9
            74.9
            53.8
        
    



* We evaluate the officially released checkpoint by ourselves.



Click to view video results on Video-MME and Video-ChatGPT.


    
        
            Model
            Size
            Video-MME
            Video-ChatGPT
        
        
            
            
            w/o subs
            w subs
            Correctness
            Detail
            Context
            Temporal
            Consistency
        
    
    
        
            Proprietary
        
        
            Claude 3.5 Sonnet
            -
            60.0
            62.9
            -
            -
            -
            -
            -
        
        
            GPT-4V
            -
            59.9
            63.3
            -
            -
            -
            -
            -
        
        
            Open-source
        
        
            LLaVA-NeXT-7B
            7B
            -
            -
            3.39
            3.29
            3.92
            2.60
            3.12
        
        
            LLaVA-NeXT-34B
            34B
            -
            -
            3.29
            3.23
            3.83
            2.51
            3.47
        
        
            CogVLM2-Video
            12B
            -
            -
            3.49
            3.46
            3.23
            2.98
            3.64
        
        
            LongVA
            7B
            52.4
            54.3
            3.05
            3.09
            3.77
            2.44
            3.64
        
        
            InternVL2-8B
            8B
            54.0
            56.9
            -
            -
            -
            -
            -
        
        
            InternLM-XComposer-2.5
            8B
            55.8
            -
            -
            -
            -
            -
            -
        
        
            LLaVA-NeXT-Video
            32B
            60.2
            63.0
            3.48
            3.37
            3.95
            2.64
            3.28
        
        
            MiniCPM-V 2.6
            8B
            60.9
            63.6
            3.59
            3.28
            3.93
            2.73
            3.62
        
    






Click to view few-shot results on TextVQA, VizWiz, VQAv2, OK-VQA.


    
        
            Model
            Size
            Shot
            TextVQA val
            VizWiz test-dev
            VQAv2 test-dev
            OK-VQA val
        
    
    
        
            Flamingo
            80B
            0*
            35.0
            31.6
            56.3
            40.6
        
        
            4
            36.5
            39.6
            63.1
            57.4
        
        
            8
            37.3
            44.8
            65.6
            57.5
        
        
            IDEFICS
            80B
            0*
            30.9
            36.0
            60.0
            45.2
        
        
            4
            34.3
            40.4
            63.6
            52.4
        
        
            8
            35.7
            46.1
            64.8
            55.1
        
        
            OmniCorpus
            7B
            0*
            43.0
            49.8
            63.2
            45.5
        
        
            4
            45.4
            51.3
            64.5
            46.5
        
        
            8
            45.6
            52.2
            64.7
            46.6
        
        
            Emu2
            37B
            0
            26.4
            40.4
            33.5
            26.7
        
        
            4
            48.2
            54.6
            67.0
            53.2
        
        
            8
            49.3
            54.7
            67.8
            54.1
        
        
            MM1
            30B
            0
            26.2
            40.4
            48.9
            26.7
        
        
            8
            49.3
            54.7
            70.9
            54.1
        
        
            MiniCPM-V 2.6⁺
            8B
            0
            43.9
            33.8
            45.4
            23.9
        
        
            4
            63.6
            60.5
            65.5
            50.1
        
        
            8
            64.6
            63.4
            68.2
            51.4
        
    




* denotes zero image shot and two additional text shots following Flamingo.

⁺ We evaluate the pretraining ckpt without SFT.


### Examples 


  
  
  
  
  


  Click to view more cases.
  
    
    
  


We deploy MiniCPM-V 2.6 on end devices. The demo video is the raw screen recording on a iPad Pro without edition.


      
          
      
    
 
    
 


      
          
      
    
 
    
 


      
      
    

    




## Legacy Models  

| Model                | Introduction and Guidance       |
|:----------------------|:-------------------:|
| MiniCPM-Llama3-V 2.5  | [Document](./docs/minicpm_llama3_v2dot5.md)   | 
| MiniCPM-V 2.0  | [Document](./docs/minicpm_v2.md)   | 
| MiniCPM-V 1.0  | [Document](./docs/minicpm_v1.md)   | 
| OmniLMM-12B  | [Document](././docs/omnilmm_en.md)   |  


## Chat with Our Demo on Gradio 

We provide online and local demos powered by Hugging Face Gradio , the most popular model deployment framework nowadays. It supports streaming outputs, progress bars, queuing, alerts, and other useful features.


### Online Demo  

Click here to try out the online demo of [MiniCPM-o 2.6](https://minicpm-omni-webdemo-us.modelbest.cn/) | [MiniCPM-V 2.6](http://120.92.209.146:8887/) | [MiniCPM-Llama3-V 2.5](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5) | [MiniCPM-V 2.0](https://huggingface.co/spaces/openbmb/MiniCPM-V-2).

### Local WebUI Demo  
  
You can easily build your own local WebUI demo using the following commands.

Please ensure that `transformers==4.44.2` is installed, as other versions may have compatibility issues.

If you are using an older version of PyTorch, you might encounter this issue `"weight_norm_fwd_first_dim_kernel" not implemented for 'BFloat16'`, Please add `self.minicpmo_model.tts.float()` during the model initialization.

**For real-time voice/video call demo:**
1. launch model server:
```shell
pip install -r requirements_o2.6.txt

python web_demos/minicpm-o_2.6/model_server.py
```

2. launch web server:

```shell
# Make sure Node and PNPM is installed.
sudo apt-get update
sudo apt-get install nodejs npm
npm install -g pnpm


cd web_demos/minicpm-o_2.6/web_server
# create ssl cert for https, https is required to request camera and microphone permissions.
bash ./make_ssl_cert.sh  # output key.pem and cert.pem

pnpm install  # install requirements
pnpm run dev  # start server
```
Open `https://localhost:8088/` in browser and enjoy the real-time voice/video call.

**For chatbot demo:**
```shell
pip install -r requirements_o2.6.txt

python web_demos/minicpm-o_2.6/chatbot_web_demo_o2.6.py
```
Open `http://localhost:8000/` in browser and enjoy the vision mode chatbot.

## Inference


### Model Zoo

| Model           | Device | Memory    |          Description       | Download |
|:-----------|:--:|:-----------:|:-------------------|:---------------:|
| MiniCPM-o 2.6| GPU | 18 GB  | The latest version, achieving GPT-4o level performance for vision, speech and multimodal live streaming on end-side devices.   |  [](https://huggingface.co/openbmb/MiniCPM-o-2_6)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6) |
| MiniCPM-o 2.6 gguf | CPU | 8 GB  | The gguf version, lower memory usage and faster inference.   |  [](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-gguf) |
| MiniCPM-o 2.6 int4 | GPU | 9 GB  | The int4 quantized version, lower GPU memory usage.   |  [](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-int4) |
| MiniCPM-V 2.6| GPU | 17 GB  | Strong end-side multimodal performance for single image, multi-image and video understanding.   |  [](https://huggingface.co/openbmb/MiniCPM-V-2_6)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6) |
| MiniCPM-V 2.6 gguf | CPU | 6 GB  | The gguf version, lower memory usage and faster inference.   |  [](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-gguf) |
| MiniCPM-V 2.6 int4 | GPU | 7 GB  | The int4 quantized version, lower GPU memory usage.   |  [](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-int4) |

### Multi-turn Conversation

Please ensure that `transformers==4.44.2` is installed, as other versions may have compatibility issues. We are investigating this issue.
```shell
pip install -r requirements_o2.6.txt
```

Please refer to the following codes to run.






```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

torch.manual_seed(100)

model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)

image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')

# First round chat 
question = "What is the landform in the picture?"
msgs = [{'role': 'user', 'content': [image, question]}]

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

# Second round chat, pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": [answer]})
msgs.append({"role": "user", "content": ["What should I pay attention to when traveling here?"]})

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)
```

You will get the following output:

```
"The landform in the picture is a mountain range. The mountains appear to be karst formations, characterized by their steep, rugged peaks and smooth, rounded shapes. These types of mountains are often found in regions with limestone bedrock and are shaped by processes such as erosion and weathering. The reflection of the mountains in the water adds to the scenic beauty of the landscape."

"When traveling to this scenic location, it's important to pay attention to the weather conditions, as the area appears to be prone to fog and mist, especially during sunrise or sunset. Additionally, ensure you have proper footwear for navigating the potentially slippery terrain around the water. Lastly, respect the natural environment by not disturbing the local flora and fauna."
```

#### Chat with Multiple Images

 Click to view Python code running MiniCPM-o 2.6 with multiple images input. 
  
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)

image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'

msgs = [{'role': 'user', 'content': [image1, image2, question]}]

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)
```


#### In-context Few-shot Learning

 Click to view Python code running MiniCPM-o 2.6 with few-shot input. 

```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)

question = "production date" 
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')

msgs = [
    {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
    {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
    {'role': 'user', 'content': [image_test, question]}
]

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)
```


#### Chat with Video

 Click to view Python code running MiniCPM-o 2.6 with video input. 

```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu    # pip install decord

model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)

MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number

def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames

video_path="video_test.mp4"
frames = encode_video(video_path)
question = "Describe the video"
msgs = [
    {'role': 'user', 'content': frames + [question]}, 
]

# Set decode params for video
params = {}
params["use_image_id"] = False
params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    **params
)
print(answer)
```



#### Speech Conversation
  Model initialization 

```python
import torch
import librosa
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)

model.init_tts()
model.tts.float()
```



##### Mimick

`Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.

 Click here to demonstrate the capability of end-to-end audio understanding and generation. 

```python
mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
msgs = [{'role': 'user', 'content': [mimick_prompt,audio_input]}]
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    temperature=0.3,
    generate_audio=True,
    output_audio_path='output.wav', # save the tts result to output_audio_path
)
```



##### General Speech Conversation with Configurable Voices

A general usage scenario of MiniCPM-o 2.6 is role-playing a specific character based on the audio prompt. It will mimic the voice of the character to some extent and act like the character in text, including language style. In this mode, MiniCPM-o 2.6 will sounds **more natural and human-like**. Self-defined audio prompts can be used to customize the voice of the character in an end-to-end manner.

 Click to view the Python code for enabling MiniCPM-o 2.6 to interact with you in a specified voice.

```python
ref_audio, _ = librosa.load('./assets/voice_01.wav', sr=16000, mono=True) # load the reference audio
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')

# round one
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
msgs = [sys_prompt, user_question]
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result.wav',
)

# round two
history = msgs.append({'role': 'assistant', 'content': res})
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
msgs = history.append(user_question)
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_round_2.wav',
)
print(res)
```



##### Speech Conversation as an AI Assistant

An enhanced feature of MiniCPM-o 2.6 is to act as an AI assistant, but only with limited choice of voices. In this mode, MiniCPM-o 2.6 is **less human-like and more like a voice assistant**. But it is more instruction-following.

 Click to view the Python code for enabling MiniCPM-o 2.6 to act as an AI assistant.

```python
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en') 
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}

# round one
msgs = [sys_prompt, user_question]
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result.wav',
)

# round two
history = msgs.append({'role': 'assistant', 'content': res})
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
msgs = history.append(user_question)
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_round_2.wav',
)
print(res)
```



##### Instruction-to-Speech

MiniCPM-o 2.6 can also do Instruction-to-Speech, aka **Voice Creation**. You can describe a voice in detail, and the model will generate a voice that matches the description. For more Instruction-to-Speech sample instructions, you can refer to https://voxinstruct.github.io/VoxInstruct/.


 Click to view Python code running MiniCPM-o 2.6 with Instruction-to-Speech. 

```python
instruction = 'Speak like a male charming superstar, radiating confidence and style in every word.'

msgs = [{'role': 'user', 'content': [instruction]}]

res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result.wav',
)
```


##### Voice Cloning

MiniCPM-o 2.6 can also do zero-shot text-to-speech, aka **Voice Cloning**. With this mode, model will act like a TTS model.


 Click to show Python code running MiniCPM-o 2.6 with voice cloning. 

```python
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
text_prompt = f"Please read the text below."
user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]}

msgs = [sys_prompt, user_question]
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result.wav',
)

```


##### Addressing Various Audio Understanding Tasks

MiniCPM-o 2.6 can also be used to address various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.


 Click to show Python code running MiniCPM-o 2.6 with specific audioQA task. 

For audio-to-text tasks, you can use the following prompts:

- ASR with ZH(same as AST en2zh): ``
- ASR with EN(same as AST zh2en): `Please listen to the audio snippet carefully and transcribe the content.`
- Speaker Analysis: `Based on the speaker's content, speculate on their gender, condition, age range, and health status.`
- General Audio Caption: `Summarize the main content of the audio.`
- General Sound Scene Tagging: `Utilize one keyword to convey the audio's content or the associated scene.`

```python
task_prompt = "Please listen to the audio snippet carefully and transcribe the content." + "\n" # can change to other prompts.
audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)

msgs = [{'role': 'user', 'content': [task_prompt, audio_input]}]

res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result.wav',
)
print(res)
```






#### Multimodal Live Streaming

 Click to view Python code running MiniCPM-o 2.6 with chat inference. 

```python
import math
import numpy as np
from PIL import Image
from moviepy.editor import VideoFileClip
import tempfile
import librosa
import soundfile as sf
import torch
from transformers import AutoModel, AutoTokenizer

def get_video_chunk_content(video_path, flatten=True):
    video = VideoFileClip(video_path)
    print('video_duration:', video.duration)
    
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
        temp_audio_file_path = temp_audio_file.name
        video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
        audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)
    num_units = math.ceil(video.duration)
    
    # 1 frame + 1s audio chunk
    contents= []
    for i in range(num_units):
        frame = video.get_frame(i+1)
        image = Image.fromarray((frame).astype(np.uint8))
        audio = audio_np[sr*i:sr*(i+1)]
        if flatten:
            contents.extend(["", image, audio])
        else:
            contents.append(["", image, audio])
    
    return contents


model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16)
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)

model.init_tts()

# If you are using an older version of PyTorch, you might encounter this issue "weight_norm_fwd_first_dim_kernel" not implemented for 'BFloat16', Please convert the TTS to float32 type.
# model.tts.float()

# https://huggingface.co/openbmb/MiniCPM-o-2_6/blob/main/assets/Skiing.mp4
video_path="assets/Skiing.mp4"
sys_msg = model.get_sys_prompt(mode='omni', language='en')
# if use voice clone prompt, please set ref_audio
# ref_audio_path = '/path/to/ref_audio'
# ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
# sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')

contents = get_video_chunk_content(video_path)
msg = {"role":"user", "content": contents}
msgs = [sys_msg, msg]

# please set generate_audio=True and output_audio_path to save the tts result
generate_audio = True
output_audio_path = 'output.wav'

res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    temperature=0.5,
    max_new_tokens=4096,
    omni_input=True, # please set omni_input=True when omni inference
    use_tts_template=True,
    generate_audio=generate_audio,
    output_audio_path=output_audio_path,
    max_slice_nums=1,
    use_image_id=False,
    return_dict=True
)
print(res)
```



 Click to view Python code running MiniCPM-o 2.6 with streaming inference. 

Note: The streaming inference has a slight performance degradation because the audio encoding is not global.
```python
# a new conversation need reset session first, it will reset the kv-cache
model.reset_session()

contents = get_video_chunk_content(video_path, flatten=False)
session_id = '123'
generate_audio = True

# 1. prefill system prompt
res = model.streaming_prefill(
    session_id=session_id,
    msgs=[sys_msg], 
    tokenizer=tokenizer
)

# 2. prefill video/audio chunks
for content in contents:
    msgs = [{"role":"user", "content": content}]
    res = model.streaming_prefill(
        session_id=session_id,
        msgs=msgs, 
        tokenizer=tokenizer
    )

# 3. generate
res = model.streaming_generate(
    session_id=session_id,
    tokenizer=tokenizer,
    temperature=0.5,
    generate_audio=generate_audio
)

audios = []
text = ""

if generate_audio:
    for r in res:
        audio_wav = r.audio_wav
        sampling_rate = r.sampling_rate
        txt = r.text

        audios.append(audio_wav)
        text += txt
        
    res = np.concatenate(audios)
    sf.write("output.wav", res, samplerate=sampling_rate)
    print("text:", text)
    print("audio saved to output.wav")
else:
    for r in res:
        text += r['text']
    print("text:", text)
```



### Inference on Multiple GPUs
You can run MiniCPM-Llama3-V 2.5 on multiple low VRAM GPUs (12 GB or 16 GB) by distributing the model's layers across multiple GPUs. Please refer to this [tutorial](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md) for detailed instructions on how to load the model and inference using multiple low VRAM GPUs.


### Inference on Mac

Click to view an example, to run MiniCPM-Llama3-V 2.5 on  Mac with MPS (Apple silicon or AMD GPUs). 

```python
# test.py  Need more than 16GB memory.
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True, low_cpu_mem_usage=True)
model = model.to(device='mps')

tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True)
model.eval()

image = Image.open('./assets/hk_OCR.jpg').convert('RGB')
question = 'Where is this photo taken?'
msgs = [{'role': 'user', 'content': question}]

answer, context, _ = model.chat(
    image=image,
    msgs=msgs,
    context=None,
    tokenizer=tokenizer,
    sampling=True
)
print(answer)
```
Run with command:
```shell
PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py
```



### Efficient Inference with llama.cpp, ollama, vLLM

See [our fork of llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpmv-main/examples/llava/README-minicpmv2.6.md) for more detail. This implementation supports smooth inference of 16~18 token/s on iPad (test environmentiPad Pro + M4).

See [our fork of ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) for more detail. This implementation supports smooth inference of 16~18 token/s on iPad (test environmentiPad Pro + M4).



 vLLM now officially supports MiniCPM-V 2.6, MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.0. And you can use our fork to run MiniCPM-o 2.6 for now. Click to see. 

1. Install vLLM(>=0.7.1):
```shell
pip install vllm
```

2. Run Example:
* [Vision Language](https://docs.vllm.ai/en/latest/getting_started/examples/vision_language.html) 
* [Audio Language](https://docs.vllm.ai/en/latest/getting_started/examples/audio_language.html) 
  

## Fine-tuning

### Simple Fine-tuning 

We support simple fine-tuning with Hugging Face for MiniCPM-o 2.6, MiniCPM-V 2.6, MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.0.

[Reference Document](./finetune/readme.md)


### With Align-Anything 

We support fine-tuning MiniCPM-o 2.6 by PKU-Alignment Team (both vision and audio, SFT and DPO) with the [Align-Anything framework](https://github.com/PKU-Alignment/align-anything). Align-Anything is a scalable framework that aims to align any-modality large models with human intentions, open-sourcing the [datasets, models and benchmarks](https://huggingface.co/datasets/PKU-Alignment/align-anything). Benefiting from its concise and modular design, it supports 30+ open-source benchmarks, 40+ models and algorithms including SFT, SimPO, RLHF, *etc*. It also provides 30+ directly runnable scripts, making it suitable for beginners to quickly get started.

Best Practices: [MiniCPM-o 2.6](https://github.com/PKU-Alignment/align-anything/tree/main/scripts).


### With LLaMA-Factory 

We support fine-tuning MiniCPM-o 2.6 and MiniCPM-V 2.6 with the LLaMA-Factory framework. LLaMA-Factory provides a solution for flexibly customizing the fine-tuning (Lora/Full/Qlora) of 200+ LLMs without the need for coding through the built-in web UI LLaMABoard. It supports various training methods like sft/ppo/dpo/kto and advanced algorithms like Galore/BAdam/LLaMA-Pro/Pissa/LongLoRA.

Best Practices: [MiniCPM-o 2.6 | MiniCPM-V 2.6](./docs/llamafactory_train_and_infer.md). 


### With the SWIFT Framework 

We now support MiniCPM-V series fine-tuning with the SWIFT framework. SWIFT supports training, inference, evaluation and deployment of nearly 200 LLMs and MLLMs . It supports the lightweight training solutions provided by PEFT and a complete Adapters Library including techniques such as NEFTune, LoRA+ and LLaMA-PRO.

Best Practices[MiniCPM-V 1.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v.md), [MiniCPM-V 2.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v-2.md), [MiniCPM-V 2.6](https://github.com/modelscope/ms-swift/issues/1613).

## FAQs
Click here to view the [FAQs](./docs/faqs.md)

## Limitations
As an experimental trial, we find MiniCPM-o 2.6 has notable limitations worth further investigation and improvement.
- **Unstable speech output.** The speech generation can be flawed with noisy backgrounds and unmeaningful sounds.
- **Repeated response.** The model tends to repeat its response when encountering similar consecutive user queries.
- **High-latency on Web Demo.** Users may experience unusual high-latency when using web demo hosted on overseas servers. We recommend deploying the demo locally or with good network connections.

## Model License 

* This repository is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License. 

* The usage of MiniCPM-o/V model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).

* The models and weights of MiniCPM are completely free for academic research. after filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, are also available for free commercial use.  

## Statement 

As MLLMs, MiniCPM-o/V models generate content by learning a large number of multimodal corpora, but they cannot comprehend, express personal opinions, or make value judgements. Anything generated by MiniCPM-o/V models does not represent the views and positions of the model developers

We will not be liable for any problems arising from the use of MiniCPM-o/V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination, or misuse of the model.


## Institutions  

This project is developed by the following institutions:

-  [THUNLP](https://nlp.csai.tsinghua.edu.cn/)
-  [ModelBest](https://modelbest.cn/)

##  Star History 





  
  
  


## Key Techniques and Other Multimodal Projects 

 Welcome to explore key techniques of MiniCPM-o/V and other multimodal projects of our team:

[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)


## Citation 

If you find our model/code/paper helpful, please consider citing our papers  and staring us 

```bib
@article{yao2024minicpm,
  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
  journal={arXiv preprint arXiv:2408.01800},
  year={2024}
}
```

Model	Size	Token Density⁺	OpenCompass	OCRBench	MathVista mini	ChartQA	MMVet	MMStar	MME	MMB1.1 test	AI2D	MMMU val	HallusionBench	TextVQA val	DocVQA test	MathVerse mini	MathVision	MMHal Score
Proprietary
GPT-4o-20240513	-	1088	69.9	736	61.3	85.7	69.1	63.9	2328.7	82.2	84.6	69.2	55.0	-	92.8	50.2	30.4	3.6
Claude3.5-Sonnet	-	750	67.9	788	61.6	90.8	66.0	62.2	1920.0	78.5	80.2	65.9	49.9	-	95.2	-	-	3.4
Gemini 1.5 Pro	-	-	64.4	754	57.7	81.3	64.0	59.1	2110.6	73.9	79.1	60.6	45.6	73.5	86.5	-	19.2	-
GPT-4o-mini-20240718	-	1088	64.1	785	52.4	-	66.9	54.8	2003.4	76.0	77.8	60.0	46.1	-	-	-	-	3.3
Open Source
Cambrian-34B	34B	1820	58.3	591	50.3	75.6	53.2	54.2	2049.9	77.8	79.5	50.4	41.6	76.7	75.5	-	-	-
GLM-4V-9B	13B	784	59.1	776	51.1	-	58.0	54.8	2018.8	67.9	71.2	46.9	45.0	-	-	-	-	-
Pixtral-12B	12B	256	61.0	685	56.9	81.8	58.5	54.5	-	72.7	79.0	51.1	47.0	75.7	90.7	-	-	-
VITA-1.5	8B	784	63.3	741	66.2	-	52.7	60.2	2328.1	76.8	79.2	52.6	44.6	-	-	-	-	-
DeepSeek-VL2-27B (4B)	27B	672	66.4	809	63.9	86.0	60.0	61.9	2253.0	81.2	83.8	54.0	45.3	84.2	93.3	-	-	3.0
Qwen2-VL-7B	8B	784	67.1	866	58.2	83.0	62.0	60.7	2326.0	81.8	83.0	54.1	50.6	84.3	94.5	31.9	16.3	3.2
LLaVA-OneVision-72B	72B	182	68.1	741	67.5	83.7	60.6	65.8	2261.0	85.0	85.6	56.8	49.0	80.5	91.3	39.1	-	3.5
InternVL2.5-8B	8B	706	68.3	822	64.4	84.8	62.8	62.8	2344.0	83.6	84.5	56.0	50.1	79.1	93.0	39.5	19.7	3.4
MiniCPM-V 2.6	8B	2822	65.2	852*	60.6	79.4	60.0	57.5	2348.4*	78.0	82.1	49.8*	48.1*	80.1	90.8	25.7	18.3	3.6
MiniCPM-o 2.6	8B	2822	70.2	897*	71.9*	86.9*	67.5	64.0	2372.0*	80.5	85.8	50.4*	51.9	82.0	93.5	41.4*	23.1*	3.8

Model	Size	BLINK val	Mantis Eval	MIRB	Video-MME (wo / w subs)
Proprietary
GPT-4o-20240513	-	68.0	-	-	71.9/77.2
GPT4V	-	54.6	62.7	53.1	59.9/63.3
Open-source
VITA-1.5	8B	45.0	-	-	56.1/58.7
LLaVA-NeXT-Interleave 14B	14B	52.6	66.4	30.2	-
LLaVA-OneVision-72B	72B	55.4	77.6	-	66.2/69.5
MANTIS 8B	8B	49.1	59.5	34.8	-
Qwen2-VL-7B	8B	53.2	69.6*	67.6*	63.3/69.0
InternVL2.5-8B	8B	54.8	67.7	52.5	64.2/66.9
MiniCPM-V 2.6	8B	53.0	69.1	53.8	60.9/63.6
MiniCPM-o 2.6	8B	56.7	71.9	58.6	63.9/67.9

Task	Size	ASR (zh)	ASR (en)	AST	Emotion
Proprietary
GPT-4o-Realtime	-	7.3*	5.4*	28.9*	2.6*	12.9*	4.8*	37.1*	15.7*	33.2*
Gemini 1.5 Pro	-	4.5*	5.9*	14.3*	2.9*	10.6*	3.0*	47.3*	22.6*	48.4*
Open-Source
Qwen2-Audio-7B	8B	-	7.5	-	1.6	-	-	45.2	24.4	55.3
Qwen2-Audio-7B-Instruct	8B	2.6*	6.9*	10.3*	3.1*	9.7*	5.9*	39.5*	22.9*	17.4*
VITA-1.5	8B	2.16	-	8.4	3.4	-	-	-	-	-
GLM-4-Voice-Base	9B	2.5	-	-	2.8	-	-	-	-
MiniCPM-o 2.6	8B	1.6	4.4	6.9	1.7	8.7	3.0	48.2	27.2	52.4

Task	Size	SpeechQA
Proprietary
GPT-4o-Realtime		71.7	51.6	69.7	7.4	1157	1203	1200	4.2	2.3
Open-Source
GLM-4-Voice	9B	50.0	32.0	36.4	5.1	999	1147	1035	4.1	11.7
Llama-Omni	8B	45.3	22.9	10.7	3.9	960	878	897	3.2	24.3
VITA-1.5	8B	46.7	28.1	23.3	2.0	-	-	-	-	-
Moshi	7B	43.7	23.8	16.7	2.4	871	808	875	2.8	8.2
Mini-Omni	1B	22.0	12.8	6.9	2.5	926	803	865	3.4	10.0
MiniCPM-o 2.6	8B	61.0	40.0	40.2	5.1	1088	1163	1131	4.2	9.8

Task	Voice cloning
F5-TTS	76	67
CosyVoice	75	64
FireRedTTS	63	46
MiniCPM-o 2.6	57	47

Model	Size	Real-Time Video Understanding	Omni-Source Understanding	Contextual Understanding	Overall
Proprietary
Gemini 1.5 Pro	-	77.4	67.8	51.1	70.3
GPT-4o-202408	-	74.5	51.0	48.0	64.1
Claude-3.5-Sonnet	-	74.0	41.4	37.8	59.7
Open-source
VILA-1.5	8B	61.5	37.5	26.7	49.5
LongVA	7B	63.1	35.9	30.2	50.7
LLaVA-Next-Video-34B	34B	69.8	41.7	34.3	56.7
Qwen2-VL-7B	8B	71.2	40.7	33.1	57.0
InternVL2-8B	8B	70.1	42.7	34.1	57.0
VITA-1.5	8B	70.9	40.8	35.8	57.4
LLaVA-OneVision-7B	8B	74.3	40.8	31.0	58.4
InternLM-XC2.5-OL-7B	8B	75.4	46.2	33.6	60.8
MiniCPM-V 2.6	8B	72.4	40.2	33.4	57.7
MiniCPM-o 2.6	8B	79.9	53.4	38.5	66.0

Model	Size	Token Density⁺	OpenCompass	MME	MMVet	OCRBench	MMMU val	MathVista mini	MMB1.1 test	AI2D	TextVQA val	DocVQA test	HallusionBench	Object HalBench
Proprietary
GPT-4o	-	1088	69.9	2328.7	69.1	736	69.2	61.3	82.2	84.6	-	92.8	55.0	17.6
Claude 3.5 Sonnet	-	750	67.9	1920.0	66.0	788	65.9	61.6	78.5	80.2	-	95.2	49.9	13.8
Gemini 1.5 Pro	-	-	64.4	2110.6	64.0	754	60.6	57.7	73.9	79.1	73.5	86.5	45.6	-
GPT-4o mini	-	1088	64.1	2003.4	66.9	785	60.0	52.4	76.0	77.8	-	-	46.1	12.4
GPT-4V	-	1088	63.5	2070.2	67.5	656	61.7	54.7	79.8	78.6	78.0	87.2	43.9	14.2
Step-1V	-	-	59.5	2206.4	63.3	625	49.9	44.8	78.0	79.2	71.6	-	48.4	-
Qwen-VL-Max	-	784	58.3	2281.7	61.8	684	52.0	43.4	74.6	75.7	79.5	93.1	41.2	13.4
Open-source
LLaVA-NeXT-Yi-34B	34B	157	55.0	2006.5	50.7	574	48.8	40.4	77.8	78.9	69.3	-	34.8	12.6
Mini-Gemini-HD-34B	34B	157	-	2141.0	59.3	518	48.0	43.3	-	80.5	74.1	78.9	-	-
Cambrian-34B	34B	1820	58.3	2049.9	53.2	591	50.4	50.3	77.8	79.5	76.7	75.5	41.6	14.7
GLM-4V-9B	13B	784	59.1	2018.8	58.0	776	46.9	51.1	67.9	71.2	-	-	45.0	-
InternVL2-8B	8B	706	64.1	2215.1	54.3	794	51.2	58.3	79.4	83.6	77.4	91.6	45.0	21.3
MiniCPM-Llama-V 2.5	8B	1882	58.8	2024.6	52.8	725	45.8	54.3	72.0	78.4	76.6	84.8	42.4	10.3
MiniCPM-V 2.6	8B	2822	65.2	2348.4*	60.0	852*	49.8*	60.6	78.0	82.1	80.1	90.8	48.1*	8.2

Model	Size	Mantis Eval	BLINK val	Mathverse mv	Sciverse mv	MIRB
Proprietary
GPT-4V	-	62.7	54.6	60.3	66.9	53.1
LLaVA-NeXT-Interleave-14B	14B	66.4	52.6	32.7	30.2	-
Open-source
Emu2-Chat	37B	37.8	36.2	-	27.2	-
CogVLM	17B	45.2	41.1	-	-	-
VPG-C	7B	52.4	43.1	24.3	23.1	-
VILA 8B	8B	51.2	39.3	-	36.5	-
InternLM-XComposer-2.5	8B	53.1*	48.9	32.1*	-	42.5
InternVL2-8B	8B	59.0*	50.9	30.5*	34.4*	56.9*
MiniCPM-V 2.6	8B	69.1	53.0	84.9	74.9	53.8

Model	Size	Video-MME	Video-ChatGPT
Proprietary
Claude 3.5 Sonnet	-	60.0	62.9	-	-	-	-	-
GPT-4V	-	59.9	63.3	-	-	-	-	-
Open-source
LLaVA-NeXT-7B	7B	-	-	3.39	3.29	3.92	2.60	3.12
LLaVA-NeXT-34B	34B	-	-	3.29	3.23	3.83	2.51	3.47
CogVLM2-Video	12B	-	-	3.49	3.46	3.23	2.98	3.64
LongVA	7B	52.4	54.3	3.05	3.09	3.77	2.44	3.64
InternVL2-8B	8B	54.0	56.9	-	-	-	-	-
InternLM-XComposer-2.5	8B	55.8	-	-	-	-	-	-
LLaVA-NeXT-Video	32B	60.2	63.0	3.48	3.37	3.95	2.64	3.28
MiniCPM-V 2.6	8B	60.9	63.6	3.59	3.28	3.93	2.73	3.62

Model	Size	Shot	TextVQA val	VizWiz test-dev	VQAv2 test-dev	OK-VQA val
Flamingo	80B	0*	35.0	31.6	56.3	40.6
4	36.5	39.6	63.1	57.4
8	37.3	44.8	65.6	57.5
IDEFICS	80B	0*	30.9	36.0	60.0	45.2
4	34.3	40.4	63.6	52.4
8	35.7	46.1	64.8	55.1
OmniCorpus	7B	0*	43.0	49.8	63.2	45.5
4	45.4	51.3	64.5	46.5
8	45.6	52.2	64.7	46.6
Emu2	37B	0	26.4	40.4	33.5	26.7
4	48.2	54.6	67.0	53.2
8	49.3	54.7	67.8	54.1
MM1	30B	0	26.2	40.4	48.9	26.7
8	49.3	54.7	70.9	54.1
MiniCPM-V 2.6⁺	8B	0	43.9	33.8	45.4	23.9
4	63.6	60.5	65.5	50.1
8	64.6	63.4	68.2	51.4

Owner

Login: BUAADreamer
Kind: user
Location: Beijing
Company: Beihang University

Website: https://buaadreamer.top/
Repositories: 4
Profile: https://github.com/BUAADreamer

GitHub Events

Total

Push event: 27

Last Year

Push event: 27

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/buaadreamer/minicpm-v

Science Score: 10.0%

Repository

Basic Info

Statistics

https://github.com/BUAADreamer/MiniCPM-V/blob/main/

Owner

GitHub Events

Total

Last Year

Task	Size	ASR (zh)			ASR (en)			AST		Emotion
Metric		CER			WER			BLEU		ACC
Dataset		AISHELL-1	Fleurs zh	WenetSpeech test-net	LibriSpeech test-clean	GigaSpeech	TED-LIUM	CoVoST en2zh	CoVoST zh2en	MELD emotion
Proprietary
GPT-4o-Realtime	-	7.3*	5.4*	28.9*	2.6*	12.9*	4.8*	37.1*	15.7*	33.2*
Gemini 1.5 Pro	-	4.5*	5.9*	14.3*	2.9*	10.6*	3.0*	47.3*	22.6*	48.4*
Open-Source
Qwen2-Audio-7B	8B	-	7.5	-	1.6	-	-	45.2	24.4	55.3
Qwen2-Audio-7B-Instruct	8B	2.6*	6.9*	10.3*	3.1*	9.7*	5.9*	39.5*	22.9*	17.4*
VITA-1.5	8B	2.16	-	8.4	3.4	-	-	-	-	-
GLM-4-Voice-Base	9B	2.5	-	-	2.8	-	-	-	-
MiniCPM-o 2.6	8B	1.6	4.4	6.9	1.7	8.7	3.0	48.2	27.2	52.4

Task	Size	SpeechQA
Metric		ACC			G-Eval (10 point)	Semantic ELO score	Acoustic ELO score	Overall ELO score	UTMOS	ASR-WER
Dataset		Speech Llama Q.	Speech Web Q.	Speech Trivia QA	Speech AlpacaEval	AudioArena
Proprietary
GPT-4o-Realtime		71.7	51.6	69.7	7.4	1157	1203	1200	4.2	2.3
Open-Source
GLM-4-Voice	9B	50.0	32.0	36.4	5.1	999	1147	1035	4.1	11.7
Llama-Omni	8B	45.3	22.9	10.7	3.9	960	878	897	3.2	24.3
VITA-1.5	8B	46.7	28.1	23.3	2.0	-	-	-	-	-
Moshi	7B	43.7	23.8	16.7	2.4	871	808	875	2.8	8.2
Mini-Omni	1B	22.0	12.8	6.9	2.5	926	803	865	3.4	10.0
MiniCPM-o 2.6	8B	61.0	40.0	40.2	5.1	1088	1163	1131	4.2	9.8

Task	Voice cloning
Metric	SIMO	SIMO
Dataset	Seed-TTS test-zh	Seed-TTS test-en
F5-TTS	76	67
CosyVoice	75	64
FireRedTTS	63	46
MiniCPM-o 2.6	57	47

Model	Size	Video-MME		Video-ChatGPT
		w/o subs	w subs	Correctness	Detail	Context	Temporal	Consistency
Proprietary
Claude 3.5 Sonnet	-	60.0	62.9	-	-	-	-	-
GPT-4V	-	59.9	63.3	-	-	-	-	-
Open-source
LLaVA-NeXT-7B	7B	-	-	3.39	3.29	3.92	2.60	3.12
LLaVA-NeXT-34B	34B	-	-	3.29	3.23	3.83	2.51	3.47
CogVLM2-Video	12B	-	-	3.49	3.46	3.23	2.98	3.64
LongVA	7B	52.4	54.3	3.05	3.09	3.77	2.44	3.64
InternVL2-8B	8B	54.0	56.9	-	-	-	-	-
InternLM-XComposer-2.5	8B	55.8	-	-	-	-	-	-
LLaVA-NeXT-Video	32B	60.2	63.0	3.48	3.37	3.95	2.64	3.28
MiniCPM-V 2.6	8B	60.9	63.6	3.59	3.28	3.93	2.73	3.62