Releases | Open Source Science

Added support for RWKV7, Qwen3, and MiniCPM4 models
Added support for the RV1126B platform
Enabled function calling capability
Enabled cross-attention inference
Optimize the callback function to support pausing inference
Supported multi-batch inference
Optimized KV cache clearing interface
Improved chat template parsing with support for thinking mode selection
Server demo updated to support OpenAI-compatible format
Added return of model inference performance statistics
Supported mrope multimodal position encoding
A new quantization optimization algorithm has been added to improve quantization accuracy

- Python
Published by yhcvb about 1 year ago

- Python
Published by yhcvb about 1 year ago

Supports custom model conversion.
Supports chat_template configuration.
Enables multi-turn dialogue interactions.
Implements automatic prompt cache reuse for improved inference efficiency.
Expands maximum context length to 16K.
Supports embedding flash storage to reduce memory usage.
Introduces the GRQ Int4 quantization algorithm.
Supports GPTQ-Int8 model conversion.
Compatible with the RK3562 platform.
Added support for visual multimodal models such as InternVL2, Janus, and Qwen2.5-VL.
Supports CPU core configuration.
Added support for Gemma3
Added support for Python 3.9/3.11/3.12

- Python
Published by yhcvb about 1 year ago

Add support for converting HuggingFace GPTQ-int4 models (requires groupsize to be 32, 64, or 128, and desc_act set to false).
Add support for TeleChat/TeleChat2/MiniCPM-S models.
Support exporting llm model in Qwen2VL
Resolve issues with LoRA inference.
Fix an import error related to IPython.

- Python
Published by yhcvb over 1 year ago

- Python
Published by yhcvb over 1 year ago

- Python
Published by yhcvb over 1 year ago

Added support for grouped quantization (w4a16 group sizes of 32/64/128, w8a8 group sizes of 128/256/512).
Added gdq algorithm to improve 4-bit quantization accuracy.
Added hybrid quantization algorithm, supporting a combination of grouped and non-grouped quantization based on specified ratios.
Added support for Llama3, Gemma2, and Minicpm3 models.
Added support for gguf model conversion (currently supports q4_0 and fp16 only).
Added support for LoRa models.
Added storage and loading of prompt cache
Added PC-side emulation accuracy testing and inference interface support for rkllm-toolkit.
Fixed catastrophic forgetting issue when the token count exceeds max_context.
Optimized prefill speed.
Optimized generate speed.
Optimized model initialization time
Added support for four input interfaces: prompt, embedding, token, and multimodal.

- Python
Published by yhcvb over 1 year ago

- Python
Published by yhcvb about 2 years ago

ecosyste.ms