Recent Releases of transformers
transformers - v4.56: Dino v3, X-Codec, Ovis 2, MetaCLIP 2, Florence 2, SAM 2, Kosmos 2.5, HunYuan, GLMV-4.5
New model additions
Dino v3
DINOv3 is a family of versatile vision foundation models that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models.
You can find all the original DINOv3 checkpoints under the DINOv3 collection.
- Add Dino v3 by @qubvel in #40167
X-Codec
he X-Codec model was proposed in Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model by Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, Wei Xue
The X-Codec model is a neural audio codec that integrates semantic information from self-supervised models (e.g., HuBERT) alongside traditional acoustic information. This enables :
- Music continuation : Better modeling of musical semantics yields more coherent continuations.
- Text-to-Sound Synthesis : X-Codec captures semantic alignment between text prompts and generated audio.
- Semantic aware audio tokenization: X-Codec is used as an audio tokenizer in the YuE lyrics to song generation model.
- Add X-Codec model by @Manalelaidouni in #38248
Ovis 2
The Ovis2 is an updated version of the Ovis model developed by the AIDC-AI team at Alibaba International Digital Commerce Group.
Ovis2 is the latest advancement in multi-modal large language models (MLLMs), succeeding Ovis1.6. It retains the architectural design of the Ovis series, which focuses on aligning visual and textual embeddings, and introduces major improvements in data curation and training methods.

- Add Ovis2 model and processor implementation by @thisisiron in #37088
MetaCLIP 2
MetaCLIP 2 is a replication of the original CLIP model trained on 300+ languages. It achieves state-of-the-art (SOTA) results on multilingual benchmarks (e.g., XM3600, CVQA, Babel‑ImageNet), surpassing previous SOTA such as mSigLIP and SigLIP‑2. The authors show that English and non-English worlds can mutually benefit and elevate each other.
- Add MetaCLIP 2 by @NielsRogge in #39826
Florence 2
Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks. Florence-2 can interpret simple text prompts to perform tasks like captioning, object detection, and segmentation. It leverages the FLD-5B dataset, containing 5.4 billion annotations across 126 million images, to master multi-task learning. The model's sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings, proving to be a competitive vision foundation model.
- Add support for Florence-2 by @ducviet00 in #38188
SAM 2
SAM2 (Segment Anything Model 2) was proposed in Segment Anything in Images and Videos by Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer.
The model can be used to predict segmentation masks of any object of interest given an input image or video, and input points or bounding boxes.
- Add Segment Anything 2 (SAM2) by @SangbumChoi in #32317
Kosmos 2.5
The Kosmos-2.5 model was proposed in KOSMOS-2.5: A Multimodal Literate Model by Microsoft.
The abstract from the paper is the following:
We present Kosmos-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.

- Add Kosmos-2.5 by @tic-top in #31711
HunYuan
More information at release 🤗
- HunYuan opensource by @yjc9696 in #39606
Seed OSS
More information at release 🤗
- Adding ByteDance Seed Seed-OSS by @Fazziekey in #40272
GLM-4.5V
More information at release 🤗
- GLM-4.5V Model Support by @zRzRzRzRzRzRzR in #39805
Cache
Beyond a large refactor of the caching system in Transformers, making it much more practical and general, models using sliding window attention/chunk attention do not waste memory anymore when caching past states. It was allowed most notable by:
- New DynamicSlidingWindowLayer & associated Cache by @Cyrilvallez in #40039
See the following improvements on memory usage for Mistral (using only sliding layers) and GPT-OSS (1 out of 2 layers is sliding) respectively:
Beyond memory usage, it will also improve generation/forward speed by a large margin for large contexts, as only necessary states are passed to the attention computation, which is very sensitive to the sequence length.
Quantization
MXFP4
Since the GPT-OSS release which introduced the MXPF4 quantization type, several improvements have been made to the support, which should now stabilize.
- Fix MXFP4 quantizer validation to allow CPU inference with dequantize option by @returnL in #39953
- Enable gpt-oss mxfp4 on older hardware (sm75+) by @matthewdouglas in #39940
- Fix typo and improve GPU kernel check error message in MXFP4 quantization by @akintunero in #40349)
- Default to dequantize if cpu in device_map for mxfp4 by @MekkCyber in #39993
- Fix GPT-OSS
swiglu_limitnot passed in for MXFP4 by @danielhanchen in #40197 - [
Mxfp4] Add a way to save with a quantization method by @ArthurZucker in #40176
New standard
Now that we deprecated tensorflow and jax, we felt that torch_dtype was not only misaligned with torch, but was redundant and hard to remember. For this reason, we switched to a much more standard dtype argument!
- ⚠️⚠️ Use dtype instead of torch_dtype everywhere! by @Cyrilvallez in #39782
torch_dtype will still be a valid usage for as long as needed to ensure a smooth transition, but new code should use dtype, and we encourage you to update older code as well!
Breaking changes
The following commits are breaking changes in workflows that were either buggy or not working as expected.
Saner hub-defaults for hybrid cache implementation
On models where the hub checkpoint specifies cache_implementation="hybrid" (static sliding window hybrid cache), UNSETS this value. This will make the model use the dynamic sliding window layers by default.
This default meant that there were widespread super slow 1st generate calls on models with hybrid caches, which should nol onger be the case.
- 🚨🚨 [generate] ignore
cache_implementation="hybrid"hub defaults by @gante in #40135
Sine positional embeddings for MaskFormer & LRU cache
Cache the computation of sine positional embeddings for MaskFormer; results in a 6% performance improvement.
- 🚨 Use lru_cache for sine pos embeddings MaskFormer by @yonigozlan in #40007
Explicit cache initialization
Adds explicit cache initialization to prepare for the deprecation of the from_legacy_cache utility.
- 🚨 Always return Cache objects in modelings (to align with generate) by @manueldeprada in #39765
Default compilation with fullgraph=False
Having fullgraph set to True during compilation ended up being very restrictive, especially with the arrival of widely-used MoEs.
- 🚨🚨 Switch default compilation to fullgraph=False by @Cyrilvallez in #40137
Remove decoding strategies
The DoLa decoding strategy has been moved to the following remote-code repository a few versions ago: https://huggingface.co/transformers-community/dola
The Contrastive Search decoding strategy has been moved to the following remote-code repository a few versions ago: https://huggingface.co/transformers-community/contrastive-search
Both have now been removed from the library as a result.
- 🚨 Remove DoLa decoding strategy by @manueldeprada in #40082
- 🚨 Remove Contrastive Search decoding strategy by @manueldeprada in #40428
Fix sliding window in flash attention
Flash attention has used sliding window sizes which were off by one. This affected generations that had initially bigger contexts than the sliding window size.
- :rotating_light: [
Flash Attention] Fix sliding window size by @vasqu in #40163
Minimum Torch version is now 2.2
Torch 2.1 support has been unreliable for some time, so we've now made it official and bumped our minimum version to 2.2.
- byebye torch 2.1 by @Rocketknight1 in #40317
Bugfixes and improvements
- [CI] post-
GptOssfixes for green CI by @gante in #39929 - Avoid
utils/check_bad_commit.pyfailing due to rate limit (requestingapi.github.com) by @ydshieh in #39918 - Fix CI: Tests failing on CPU due to
torch.device('cpu').indexbeing None by @manueldeprada in #39933 - circleci: pin torch 2.7.1 until
torchcodecis updated by @ydshieh in #39951 - [docs] ko toc fix by @gante in #39927
- docs: fix typo in 'quantization-aware training' by @luckyvickyricky in #39904
- Fix grammatical error in MoE variable name: experthitted → experthit, hittedexperts → hitexperts by @Mihonarium in #39959
- fix typo by @Tialo in #39936
- [image processor] fix glm4v by @KeyKy in #39964
- remove
triton_kernelsdep withkernelsinstead by @SunMarc in #39926 - Fix
fix_and_overwritemode ofutils/check_docstring.pyby @manueldeprada in #39369 - [bugfix] fix flashattention2 unavailable error on Ascend NPU by @FightingZhen in #39844
- chore: update Deformable_Detr model card by @arpon-kapuria in #39902
- Modular fix: remove the model name in
find_file_typeby @yonigozlan in #39897 - Gemma3 fixes by @remi-or in #39960
- [superglue] Fixed the way batch mask was applied to the scores before match assignment computation by @sbucaille in #39968
- Support input_embeds in torch exportable decoders by @jackzhxng in #39836
- Various test fixes for AMD by @remi-or in #39978
- [Idefics] fix device mismatch by @zucchini-nlp in #39981
- Fix gemma3n feature extractor's incorrect squeeze by @Isotr0py in #39919
- [typing] Fix return typehint for decoder and inv_freq annotation by @qubvel in #39610
- Fix consistency by @Cyrilvallez in #39995
- Update expected output values after #39885 (part 1) by @ydshieh in #39990
- Fix int4 quantized model cannot work with cpu by @yuanwu2017 in #39724
- Fix missing video inputs for PerceptionLM. by @shuminghu in #39971
- fix: remove CHAT_TEMPLATE import in tests for deepseek-vl by @geetu040 in #40003
- Fix HGNetV2 Model Card and Image Classification Pipeline Usage Tips by @ducviet00 in #39965
- Fix default values of getenv by @cyyever in #39867
- FA2 can continue generation from cache by @zucchini-nlp in #39843
- unpin torch<2.8 on circleci by @ydshieh in #40012
- docs: fix duplication in 'en/optimizers.md' by @luckyvickyricky in #40014
- Raising error when quantizing a quantized model by @MekkCyber in #39998
- Update expected output values after #39885 (part 2) by @ydshieh in #40015
- pin torchcodec==0.5.0 for now with torch 2.7.1 on daily CI by @ydshieh in #40013
- Fix broken image inference for Fuyu model by @Isotr0py in #39915
- Higgs modulestonot_convert standardization by @MekkCyber in #39989
- Fix an annoying flaky test by @zucchini-nlp in #40000
- Harmonize
past_key_valuetopast_key_valueSeverywhere by @Cyrilvallez in #39956 - Fix missing None default values for Gemma3n model in getplaceholdermask by @Znerual in #39991)
- [core] Refactor the Cache logic to make it simpler and more general by @Cyrilvallez in #39797
- Tie weights recursively on all submodels by @Cyrilvallez in #39996
- Bnb failling tests by @MekkCyber in #40026
- fix
notification_service.pyabouttime_spentby @ydshieh in #40037 - Revert "fix
notification_service.pyabouttime_spent" by @ydshieh in #40044 - Update HuBERT model card according to template by @reedrya in #39742
- unpin
torchcodec==0.5.0and usetorch 2.8on daily CI by @ydshieh in #40072 - fix: resolve triton version check compatibility on windows by @Tsumugii24 in #39986
- [qwen-vl] fix beam search with videos by @zucchini-nlp in #39726
- [gemma3] update conversion key mapping by @zucchini-nlp in #39778
- fix: move super().init after vision_config init in Mistral3Config by @starcatmeow in #40063
- Remove deprecated cache-related objects by @Cyrilvallez in #40035
- guard on model.eval when using torch.compile + FSDP2 by @winglian in #37413
- Fix repo consistency by @zucchini-nlp in #40077
- added Textnet fast image processor by @rahzaazhar in #39884
- Fix
time_spentinnotification_service.py. by @ydshieh in #40081 - chore: standardize DeBERTa model card by @Shoumik-Gandre in #37409
- [
GPT Big Code] Fix attention scaling by @vasqu in #40041 - feat: extract rev in attn_implementation kernels via @ by @drbh in #40009
- Update notification service MI325 by @ivarflakstad in #40078
- Fix PerceptionLM image preprocessing for non-tiled image input. by @shuminghu in #40006
- Revert FA2 kwargs construction by @zucchini-nlp in #40029
- [fix] batch inference for llava_onevision by @cyr0930 in #40021
- [docs] Zero Shot Object Detection Task by @ariG23498 in #40096
- Update Glm4V processor and add tests by @zucchini-nlp in #39988
- Add glm4.5&&glm4.5V doc by @lambertwjh in #40095
- Causal loss for
ForConditionalGenerationby @qgallouedec in #39973 - Audio encodings now match conv2d weight dtype in Gemma3nAudioSSCPConvBlock by @Malav-P in #39743
- New DynamicSlidingWindowLayer & associated Cache by @Cyrilvallez in #40039
- Enable SIM rules by @cyyever in #39806
- feat: add
is_fastto ImageProcessor by @MilkClouds in #39603 - Re-apply make style by @Cyrilvallez in #40106
- Replace
logger.warningwithlogger.warning_onceinGradientCheckpointingLayerby @qgallouedec in #40091 - Fix regression in mllama vision encoder by @Isotr0py in #40083
- Switch the order of args in StaticCache (for BC and future logic) by @Cyrilvallez in #40100
- Fix Qwen3 MoE GGUF architecture mismatch by @ctcanbol in #39976
- Fix error on importing unavailable torch.distributed by @m-gallus in #40038
- [
Flash Attention] Fix flash attention integration by @vasqu in #40002 - [trainer] ensure special tokens in model configs are aligned with tokenizer at train time by @gante in #38441
- Fix Causality Handling in Flash Attention to Support Bidirectional Attention by @lucaswychan in #39707
- [docs] Add reference to HF-maintained
custom_generatecollections by @gante in #39894 - Add model card for MobileViT by @Shivamjan in #40033
- remove sequence parallel in llama4 by @3outeille in #40084
- 🌐 [i18n-KO] Translated
tiny_agents.mdto Korean by @AhnJoonSung in #39913 - [bugfix] Fix tensor device in Idefics2, Idefics3, and SmolVLM by @qgallouedec in #39975
- changed xLSTMRMSNorm to RMSNorm by @nikitazuevblago in #40113
- Fix QuantoQuantizedCache import issues by @manueldeprada in #40109
- [serve] allow array
contentinputs for LLMs by @gante in #39829 decoding_methodargument in generate by @manueldeprada in #40085- Collated reports by @ivarflakstad in #40080
- DOCS: Add missing space in SECURITY.md by @shivaheidari in #40087
- [trainer] handle case where EOS token is None in
generation_configby @gante in #40127 - Fix hidden torchvision>=0.15 dependency issue by @yonigozlan in #39928
- 🌐 [i18n-KO] Translated
main_classes/processors.mdto Korean by @TaskerJang in #39519 - 🌐 [i18n-KO] Translated
jamba.mdto Korean by @skwh54 in #39890 - 🌐 [i18n-KO] Translated
main_classes/optimizer_schedules.mdto Korean by @luckyvickyricky in #39713 - 🌐 [i18n-KO] Translated
gpt2.mdto Korean by @taemincode in #39808 - 🌐 [i18n-KO] Translated
optimizers.mdto Korean by @chelsseeey in #40011 - 🌐 [i18n-KO] Translated grounding-dino.md to Korean by @TaskerJang in #39861
- 🌐 [i18n-KO] Translated
pipelines.mdto Korean by @xhaktm00 in #39577 - gpt oss is important by @ArthurZucker in #40139
- Fix Janus by @Cyrilvallez in #40140
- [docs] Fix ko toctree by @stevhliu in #40138
- Remove an old badly designed test by @Cyrilvallez in #40142
- updated visualBERT modelcard by @Anil-Red in #40057
- 🌐 [i18n-KO] Translated
gemma3.mdto Korean by @seopp in #39865 - Fix quantized cache with only cache_implementation in generate by @Cyrilvallez in #40144
- Add pytest marker:
torch_compile_testandtorch_export_testby @ydshieh in #39950 - Update Dockerfiles to install packages inside a virtual environment by @Sai-Suraj-27 in #39098
- Create self-scheduled-amd-mi355-caller.yml by @glegendre01 in #40134
- [Cohere2Vision] remove unused arg by @zucchini-nlp in #40103
- [efficientloftr] fix bugs and follow original cross attn implementation strictly by @sbucaille in #40141
- Fix CI: Use correct import in SAM for torchvision InterpolationMode by @manueldeprada in #40160
- [Continous Batching] set headdim when config.headdim is None by @kashif in #40159
- Replace
self.tokenizerbyself.processing_classby @qgallouedec in #40119 - [FA2] Fix it finally - revert fa kwargs preparation by @Cyrilvallez in #40161
- [bugfix] fix flash-attention2 unavailable error for Ascend NPU by @FightingZhen in #40151
- build: Add fast image processor tvp by @adutchengineer in #39529
- Add GptOssForSequenceClassification for GPT-OSS models by @zyfedward in #40043
- Standardize BARTpho model card: badges, new examples, fixed broken im… by @eshwanthkartitr in #40051
- Add dates to the model docs by @MHRDYN7 in #39320
- Pin torch to 2.7.1 on CircleCI for now by @ydshieh in #40174
- Update dynamic attnt setter for multimodals by @zucchini-nlp in #39908
- [MINOR:TYPO] Update base.py by @cakiki in #40169
- make model doc device agnostic by @yao-matrix in #40143
- fix to avoid modifying a view in place by @3outeille in #40162
- Fix fsdp for generic-task models by @Cyrilvallez in #40191
- Add repr to EncoderDecoderCache by @Cyrilvallez in #40195
- Fix typos by @cyyever in #40175
- Remove prepareflashattentionfrompositionids by @cyyever in #40069
- Avoid CUDA stream sync by @cyyever in #40060
- Fix various Pylint warnings by @cyyever in #40107
- Update: add type hints to check_tokenizers.py by @ajeet214 in #40094
- Benchmarking improvements by @ahadnagy in #39768
- docs: Update LayoutLM model card according to new standardized format by @Jin-HoMLee in #40129
- Revert "Pin torch to 2.7.1 on CircleCI for now" + Final fix for
too long with no outputby @ydshieh in #40201 - Use correct
model_input_namesfor PixtralImageProcessor by @rohitrango in #40226 - fix error vocabsize at Qwen25VLForConditionalGeneration lossfunction by @killight98 in #40130
- [SAM 2] Change checkpoints in docs and tests by @yonigozlan in #40213
- Fix more typos by @cyyever in #40212
- Fix ESM tokendropout crash when using inputsembeds instead of input_ids by @notkisk in #40181
- AMD scheduled CI ref env file by @ivarflakstad in #40243
- Fix more pylint warnings by @cyyever in #40204
- remove transposeforscores call in ESM-2 by @pstjohn in #40210
- Add
chat_template(jinja2) as an extra dependency by @tboerstad in #40128 - [typing] fix type annotation error in DepthPro model image processor by @MengAiDev in #40238
- [serve] guard imports by @gante in #39825
- [
CI] Fix repo consistency by @vasqu in #40249 - Fixes for EncoderDecoderCache by @remi-or in #40008
- fix: Catch correct ConnectionError for additionalchattemplates by @akug in #39874
- Model card for NLLB by @sahil-kabir in #40074
- Correct typo and update notes in docs Readme by @PavloFesenko in #40234
- Fix benchmark workflow by @ahadnagy in #40254
- docs: Update OLMo model card by @rafakatri in #40233
- Skip broken tests by @zucchini-nlp in #40157
- Remove MI300 CI by @ivarflakstad in #40270
- set inputs_embeds to None while generate to avoid audio encoder forward in generation process by @BakerBunker in #40248
- [detection] fix attention mask for RT-DETR-based models by @materight in #40269
- Fix slow static cache export tests by @jackzhxng in #40261
- Fix setting attention for multimodal models by @zucchini-nlp in #39984
- [detection] fix correct
k_projweight and bias slicing in D-FINE by @notkisk in #40257 - Skipping pytree registration in case fsdp is enabled by @romitjain in #40075
- Update imageprocessingperceptionlmfast.py to allow for proper override of visioninputtype by @tyleryzhu in #40252
- fix which routing method by @ArthurZucker in #40283
- Fix chat CLI GPU loading and request_id validation issues by @robin-ede in #40230)
- docs(layoutlm): add missing
id=usageto<hfoptions>tag in LayoutLM model card by @Jin-HoMLee in #40273 - Standardize RAG model card by @aayush226 in #40222
- docs: Update TrOCR model card to new format by @AceHunterr in #40240
- Update model card for gpt neox japanese by @ahnjj in #39862
- SmolVLM and InternVL: Ensure pixel values are converted to the correct dtype for fp16/bf16 by @qgallouedec in #40121
- Standardize BertGeneration model card by @nemitha2005 in #40250
- Adjust ROCm test output expectations by @ahadnagy in #40279
- SmolVLM test fixes by @ahadnagy in #40275
- make model docs device agnostic (2) by @yao-matrix in #40256
- [3/3] make docs device agnostic, all en docs for existing models done by @yao-matrix in #40298
- Allow to be able to run
torch.compiletests withfullgraph=Trueby @ydshieh in #40164 - [
FA] Fix dtype in varlen with position ids by @vasqu in #40295 - [docs] delete more TF/Flax docs by @gante in #40289
- Clean up X-Codec. by @ebezzam in #40271
- Remove OTel SDK dependencies by @anuraaga in #40305
- Fix GOT-OCR2 and Cohere2Vision image processor patches caculation by @Isotr0py in #40312
- [
fix] Pass adamw optimizer parameters to StableAdamW by @emapco in #40184 - chore: fix typo in
find_executable_batch_sizeto match new 0.9 ratio by @MilkClouds in #40206 - :rotating_light: [
Flash Attention] Fix sliding window size by @vasqu in #40163 - Remove unnecessary contiguous calls for modern torch by @Rocketknight1 in #40315
- Qwen2.5-Omni test fixes by @ahadnagy in #40307
- Add back
_tp_planattribute by @rishub-tamirisa in #39944 - byebye torch 2.1 by @Rocketknight1 in #40317
- No more
nattenby @ydshieh in #40287 - [
GPT OSS] Refactor the tests as it was not properly checking the outputs by @ArthurZucker in #40288 - Update CI with nightly torch workflow file by @ydshieh in #40306
- Fix: Apply
get_placeholder_maskin Ovis2 by @thisisiron in #40280 - Update notification service amddailyci_workflows definition by @ivarflakstad in #40314
- One cache class to rule them all by @Cyrilvallez in #40276
- Fix chunked attention mask with left-padding by @Cyrilvallez in #40324
- [docs] remove flax references from
/en/model_docby @gante in #40311 - Fix qwen-omni processor text only mode by @yuekaizhang in #40336
- Change Qwen2RMSNorm to RMSNorm from PyTorch by @cyyever in #40066
- Add DeepseekV3ForSequenceClassification for Deepseek V3 models by @abdokaseb in #40200
- Fix deprecation warning version by @Cyrilvallez in #40343
- Add missing arguments to class constructors by @cyyever in #40068
- [docs] remove TF references from
/en/model_docby @gante in #40344 - Fix: Only call Trainer.alignspecialtokens if model has "config" attribute by @tomaarsen in #40322
- add type hints by @wirthual in #40319
- Fix an infinite loop bug in recursive search of relative imports by @eladsegal in #40326
- Fix links in Glm4vMoe configuration classes to point to the correct H… by @vvvdwbvvv in #40310
- T5 test and target device fixes by @ahadnagy in #40313
- Update
test_spm_converter_bytefallback_warningby @ydshieh in #40284 - (small) fix conditional for inputids and inputembeds in marian by @cyntqliu in #40045
- Fix attention vizualizer by @molbap in #40285
- [ModernBert] Prevent the attention mask from being None in ModernBertForSequenceClassification by @ashmikuz in #35991
- Clean up XCodec and other codecs by @ebezzam in #40348
- [serve] add cors warnings by @gante in #40112
- [detection] use consistent dtype for Conditional and DAB DETR positional embeddings by @agkphysics in #40300
- Remove more PyTorch 2.2 compatible code by @cyyever in #40337
- [
FA] Fix some model tests by @vasqu in #40350 - Qwen2.5-VL test fixes for ROCm by @ahadnagy in #40308
- [generate] handle support for cache classes when num enc layers != num dec layers by @gante in #40277
- [4/N]more docs to device agnostic by @yao-matrix in #40355
- DOCS: Clarification on the use of
label_namesas an argument to TrainingArguments by @huzaifa-jawad367 in #40353 - Fix idefics3 vision embeddings indices dtype by @Isotr0py in #40360
- wav2vec2 fixes by @remi-or in #40341
- Change multimodal data links to HF hub by @zucchini-nlp in #40309
- [pipelines] add support to
skip_special_tokensin the main text generation pipelines by @gante in #40356 - ⚠️⚠️ Use
dtypeinstead oftorch_dtypeeverywhere! by @Cyrilvallez in #39782 - [processor] move commonalities to mixin by @zucchini-nlp in #40339
- [configuration] allow to overwrite kwargs from subconfigs by @zucchini-nlp in #40241
- fix(example): align parameter names with the latest function definition for gdino by @developer0hye in #40369
- Add GptOssForTokenClassification for GPT-OSS models by @abdokaseb in #40190
- Bug Fix: Dynamically set return_lse flag in FlexAttention by @amd-lalithnc in #40352
- Chat Template Doc Fixes by @Rocketknight1 in #40173
- Rework the Cache documentation by @Cyrilvallez in #40373
- Update README_zh-hans.md by @TardC in #40380
- HF papers in doc by @qgallouedec in #40381
- Run FA2 tests in CI by @ydshieh in #40397
- Reactivate a lot of tests skipped for no reason anymore by @Cyrilvallez in #40378
- :broom: :broom: :broom: Get set decoder cleanup by @molbap in #39509
- fix to accept cumulative_seqlens from TransformersKwargs in FA by @Kurt232 in #40194
- [docs] flax/jax purge by @gante in #40372
- Fix typo: 'casual' -> 'causal' in code and documentation by @akintunero in #40371)
- Fix CI (hunyuan moe does not support fullgraph) by @Cyrilvallez in #40423
- Fix typo: 'seperator' to 'separator' in variable names by @Prawal-Sharma in #40389
- Fix UnboundLocalError in WER metric computation by @prxshetty in #40402
- Gpt oss optim by @jiqing-feng in #40304
- Fix processing tests by @zucchini-nlp in #40379
- Fix label smoothing incompatibility with multi-label classification by @avchauzov in #40296
- Fix modular for modernbert-decoder by @Cyrilvallez in #40431
- Update collated reports working directory and --path by @ivarflakstad in #40433
- Add
tokenizer_kwargsargument to the text generation pipeline by @Joshua-Chin in #40364 - [docs] remove last references to
transformersTF classes/methods by @gante in #40429 - Remove working-dir from collated reports job by @ivarflakstad in #40435
- 🌐 [i18n-KO] Translated
models.mdto Korean by @Judy-Choi in #39518 - Gemma3 text fixes: Add expectations for MI325 by @ahadnagy in #40384
- Fix collated reports model directory traversal by @ivarflakstad in #40437
- Fix https://github.com/huggingface/transformers/issues/40292 by @id01 in #40439
- Fix collated reports uploading by @ivarflakstad in #40440
- InternVL MI325 test expectations by @ahadnagy in #40387
- Fix collated reports model name entry by @ivarflakstad in #40441
- Fix non FA2 tests after FA2 installed in CI docker image by @ydshieh in #40430
- Refactor ViT-like models by @qubvel in #39816
- [Trainer] accelerate contextparallel support in trainer by @kashif in #40205
- fix qwen25-vl grad acc by @iMountTai in #40333
- [video processors] decode only sampled videos -> less RAM and faster processing by @zucchini-nlp in #39600
- rename getcudawarmupfactor to getacceleratorwarmupfactor by @yao-matrix in #40363
- Make cache_config not mandatory by @remi-or in #40316
- Continuous batching refactor by @remi-or in #40426
- flashpaged: saux may not exist by @pcuenca in #40434
- Fix extra template loading by @Rocketknight1 in #40455
- deci gguf support by @ved1beta in #38669
- [fastimageprocessor] fix image normalization for resize by @audioXD in #40436
- [RoPE] explicit factor > implicit factor in YaRN by @gante in #40320
- [pipeline] Add Keypoint Matching pipeline by @sbucaille in #39970
- Update SegFormer model card by @GSNCodes in #40417
- Not to shock AMD team by the cancelled workflow run notification ❤️ 💖 by @ydshieh in #40467
- Fix nightly torch CI by @ydshieh in #40469
- CI when PR merged to
mainby @ydshieh in #40451 - Validate GptOssConfig rope config after it's fully initialized by @zifeitong in #40474
- [modular] Use multi-processing + fix model import issue by @Cyrilvallez in #40481
- [modular] Remove ambiguity in all calls to parent class methods + fix dependency graph by @Cyrilvallez in #40456
- [ESM] support attention API by @zucchini-nlp in #40370
- [EfficientLoFTR] dynamic image size support by @sbucaille in #40329
- Fix
qwen2_moetests by @ydshieh in #40494 - [Whisper] Add rocm expected results to certain tests by @ivarflakstad in #40482
- Collated reports: no need to upload artifact by @ivarflakstad in #40502
- Fix the CI workflow of
merge to mainby @ydshieh in #40503 - docs(pixtral): Update Pixtral model card to new format by @BryanBradfo in #40442
- [modular] Classes can now be defined and referenced in arbitrary order (without bringing unwanted dependencies) by @Cyrilvallez in #40507
- Include machine type in collated reports filename by @ivarflakstad in #40514
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @remi-or
- Gemma3 fixes (#39960)
- Various test fixes for AMD (#39978)
- Fixes for EncoderDecoderCache (#40008)
- wav2vec2 fixes (#40341)
- Make cache_config not mandatory (#40316)
- Continuous batching refactor (#40426)
- @sbucaille
- [superglue] Fixed the way batch mask was applied to the scores before match assignment computation (#39968)
- [efficientloftr] fix bugs and follow original cross attn implementation strictly (#40141)
- [pipeline] Add Keypoint Matching pipeline (#39970)
- [EfficientLoFTR] dynamic image size support (#40329)
- @ducviet00
- Fix HGNetV2 Model Card and Image Classification Pipeline Usage Tips (#39965)
- Add support for Florence-2 (#38188)
- @cyyever
- Fix default values of getenv (#39867)
- Enable SIM rules (#39806)
- Fix typos (#40175)
- Remove prepareflashattentionfrompositionids (#40069)
- Avoid CUDA stream sync (#40060)
- Fix various Pylint warnings (#40107)
- Fix more typos (#40212)
- Fix more pylint warnings (#40204)
- Change Qwen2RMSNorm to RMSNorm from PyTorch (#40066)
- Add missing arguments to class constructors (#40068)
- Remove more PyTorch 2.2 compatible code (#40337)
- @zRzRzRzRzRzRzR
- GLM-4.5V Model Support (#39805)
- @SangbumChoi
- Add Segment Anything 2 (SAM2) (#32317)
- @adutchengineer
- build: Add fast image processor tvp (#39529)
- @MHRDYN7
- Add dates to the model docs (#39320)
- @yao-matrix
- make model doc device agnostic (#40143)
- make model docs device agnostic (2) (#40256)
- [3/3] make docs device agnostic, all en docs for existing models done (#40298)
- [4/N]more docs to device agnostic (#40355)
- rename getcudawarmupfactor to getacceleratorwarmupfactor (#40363)
- @Manalelaidouni
- Add X-Codec model (#38248)
- @thisisiron
- Add Ovis2 model and processor implementation (#37088)
- Fix: Apply
get_placeholder_maskin Ovis2 (#40280)
- @tic-top
- Add Kosmos-2.5 (#31711)
- @yjc9696
- HunYuan opensource (#39606)
- @Fazziekey
- Addiing ByteDance Seed Seed-OSS (#40272)
- Python
Published by LysandreJik 9 months ago
transformers - # Patch v4.55.4
Patch v4.55.4
There was a mick mack on our side when cherry-picking the commit #40197 which led to a wrong commit in the patch! Sorry everyone 😭
This patch is just the official fix for #40197!
- Python
Published by ArthurZucker 9 months ago
transformers - Patch release v4.55.3
Patch release 4.55.3
Focused on stabilizing FlashAttention-2 on Ascend NPU, improving FSDP behavior for generic-task models, fixing MXFP4 integration for GPT-OSS
Bug Fixes & Improvements
- FlashAttention-2 / Ascend NPU – Fix “unavailable” runtime error (#40151) by @FightingZhen
- FlashAttention kwargs – Revert FA kwargs preparation to resolve regression (#40161) by @Cyrilvallez
- FSDP (generic-task models) – Fix sharding/runtime issues (#40191) by @Cyrilvallez
- GPT-OSS / MXFP4 – Ensure swiglu_limit is correctly passed through (#40197) by @danielhanchen
- Mamba – Fix cache handling to prevent stale/incorrect state (#40203) by @manueldeprada
- Misc – Minor follow-up fix addressing #40262 by @ArthurZucker
- Python
Published by ArthurZucker 9 months ago
transformers - Patch release 4.55.2: for FA2 users!
Patch release 4.55.2!
only affects FA2 generations!
😢 Well sorry everyone, sometimes shit can happen...
4.55.1 was broken because of 🥁 git merge conflict.
I cherry-picked https://github.com/huggingface/transformers/pull/40002 without having https://github.com/huggingface/transformers/pull/40029 , thus from ..modeling_flash_attention_utils import prepare_fa_kwargs_from_position_ids is missing, and since this is a slow test, nothing caught it.
Will work to remediate and write the post-mortem when yanking the release.
- Python
Published by ArthurZucker 10 months ago
transformers - Patch release 4.55.1
Patch release 4.55.1:
Mostly focused around stabalizing the Mxfp4 for GPTOSS model!
Bug Fixes & Improvements
- Idefics2, Idefics3, SmolVLM – Fix tensor device issue (#39975) by @qgallouedec
- Merge conflicts – Fix merge conflicts from previous changes by @vasqu
- MXFP4 / CPU devicemap – Default to dequantize when CPU is in devicemap (#39993) by @MekkCyber
- GPT Big Code – Fix attention scaling (#40041) by @vasqu
- Windows compatibility – Resolve Triton version check compatibility (#39986) by @Tsumugii24 @MekkCyber
- Gemma3n model – Add missing None default values for getplaceholdermask (#39991, #40024) by @Znerual
- Fuyu model – Fix broken image inference (#39915) by @Isotr0py
- PerceptionLM – Fix missing video inputs (#39971) by @shuminghu
- Idefics – Fix device mismatch (#39981) by @zucchini-nlp
- Triton kernels – Remove triton_kernels dependency in favor of included kernels (#39926) by @SunMarc
- GPT-OSS MXFP4 – Enable on older hardware (sm75+) (#39940) by @matthewdouglas @SunMarc
- MXFP4 quantizer – Allow CPU inference with dequantize option (#39953) by @returnL
CI & Build
- CI stability – Post-GPT-OSS fixes for green CI (#39929) by @gante @LysandreJik
- Python
Published by ArthurZucker 10 months ago
transformers - GLM-4.5V preview based on 4.55.0
GLM-4.5V preview based on 4.55.0
New model added by the Z.ai team to transformers!
GLM-4.5V is a new multimodal reasoning model based on GLM-4.5-Air, which has 106B total and 12B active parameters.
It's performant across 42 benchmarks across various categories: - Image reasoning (scene understanding, complex multi-image analysis, spatial recognition) - Video understanding (long video segmentation and event recognition) - GUI tasks (screen reading, icon recognition, desktop operation assistance) - Complex chart & long document parsing (research report analysis, information extraction) - Grounding (precise visual element localization)
To use, install transformers release branch.
bash
pip install transformers-v4.55.0-GLM-4.5V-preview
Then you can run:
```python from transformers import AutoProcessor, Glm4vMoeForConditionalGeneration import torch
MODELPATH = "zai-org/GLM-4.5V" messages = [ { "role": "user", "content": [ { "type": "image", "url": "https://upload.wikimedia.org/wikipedia/commons/f/fa/Grayscale8bitspalettesampleimage.png" }, { "type": "text", "text": "describe this image" } ], } ] processor = AutoProcessor.frompretrained(MODELPATH) model = Glm4vMoeForConditionalGeneration.frompretrained( pretrainedmodelnameorpath=MODELPATH, torchdtype="auto", devicemap="auto", ) inputs = processor.applychattemplate( messages, tokenize=True, addgenerationprompt=True, returndict=True, returntensors="pt" ).to(model.device) inputs.pop("tokentypeids", None) generatedids = model.generate(**inputs, maxnewtokens=8192) outputtext = processor.decode(generatedids[0][inputs["inputids"].shape[1]:], skipspecialtokens=False) print(outputtext) ```
- Python
Published by ArthurZucker 10 months ago
transformers - v4.55.0: New openai GPT OSS model!
Welcome GPT OSS, the new open-source model family from OpenAI!
For more detailed information about this model, we recommend reading the following blogpost: https://huggingface.co/blog/welcome-openai-gpt-oss
GPT OSS is a hugely anticipated open-weights release by OpenAI, designed for powerful reasoning, agentic tasks, and versatile developer use cases. It comprises two models: a big one with 117B parameters (gpt-oss-120b), and a smaller one with 21B parameters (gpt-oss-20b). Both are mixture-of-experts (MoEs) and use a 4-bit quantization scheme (MXFP4), enabling fast inference (thanks to fewer active parameters, see details below) while keeping resource usage low. The large model fits on a single H100 GPU, while the small one runs within 16GB of memory and is perfect for consumer hardware and on-device applications.
Overview of Capabilities and Architecture
- 21B and 117B total parameters, with 3.6B and 5.1B active parameters, respectively.
- 4-bit quantization scheme using mxfp4 format. Only applied on the MoE weights. As stated, the 120B fits in a single 80 GB GPU and the 20B fits in a single 16GB GPU.
- Reasoning, text-only models; with chain-of-thought and adjustable reasoning effort levels.
- Instruction following and tool use support.
- Inference implementations using transformers, vLLM, llama.cpp, and ollama.
- Responses API is recommended for inference.
- License: Apache 2.0, with a small complementary use policy.
Architecture
- Token-choice MoE with SwiGLU activations.
- When calculating the MoE weights, a softmax is taken over selected experts (softmax-after-topk).
- Each attention layer uses RoPE with 128K context.
- Alternate attention layers: full-context, and sliding 128-token window.
- Attention layers use a learned attention sink per-head, where the denominator of the softmax has an additional additive value.
- It uses the same tokenizer as GPT-4o and other OpenAI API models.
- Some new tokens have been incorporated to enable compatibility with the Responses API.
The following snippet shows simple inference with the 20B model. It runs on 16 GB GPUs when using mxfp4, or ~48 GB in bfloat16.
```py from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.frompretrained(modelid) model = AutoModelForCausalLM.frompretrained( modelid, devicemap="auto", torchdtype="auto", )
messages = [ {"role": "user", "content": "How many rs are in the word 'strawberry'?"}, ]
inputs = tokenizer.applychattemplate( messages, addgenerationprompt=True, returntensors="pt", returndict=True, ).to(model.device)
generated = model.generate(**inputs, maxnewtokens=100) print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:])) ```
Flash Attention 3
The models use attention sinks, a technique the vLLM team made compatible with Flash Attention 3. We have packaged and integrated their optimized kernel in kernels-community/vllm-flash-attn3. At the time of writing, this super-fast kernel has been tested on Hopper cards with PyTorch 2.7 and 2.8. We expect increased coverage in the coming days. If you run the models on Hopper cards (for example, H100 or H200), you need to pip install –upgrade kernels and add the following line to your snippet:
```diff from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.frompretrained(modelid) model = AutoModelForCausalLM.frompretrained( modelid, devicemap="auto", torchdtype="auto", + # Flash Attention with Sinks + attn_implementation="kernels-community/vllm-flash-attn3", )
messages = [ {"role": "user", "content": "How many rs are in the word 'strawberry'?"}, ]
inputs = tokenizer.applychattemplate( messages, addgenerationprompt=True, returntensors="pt", returndict=True, ).to(model.device)
generated = model.generate(**inputs, maxnewtokens=100) print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:])) ```
Even though the 120B model fits on a single H100 GPU (using mxfp4), you can also run it easily on multiple GPUs using accelerate or torchrun. Transformers provides a default parallelization plan, and you can leverage optimized attention kernels as well. The following snippet can be run with torchrun --nprocpernode=4 generate.py on a system with 4 GPUs:
```py from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.distributed import DistributedConfig import torch
modelpath = "openai/gpt-oss-120b" tokenizer = AutoTokenizer.frompretrained(modelpath, paddingside="left")
devicemap = { "tpplan": "auto", # Enable Tensor Parallelism }
model = AutoModelForCausalLM.frompretrained( modelpath, torchdtype="auto", attnimplementation="kernels-community/vllm-flash-attn3", **device_map, )
messages = [ {"role": "user", "content": "Explain how expert parallelism works in large language models."} ]
inputs = tokenizer.applychattemplate( messages, addgenerationprompt=True, returntensors="pt", returndict=True, ).to(model.device)
outputs = model.generate(**inputs, maxnewtokens=1000)
Decode and print
response = tokenizer.decode(outputs[0]) print("Model response:", response.split("<|channel|>final<|message|>")[-1].strip()) ```
Other optimizations
If you have a Hopper GPU or better, we recommend you use mxfp4 for the reasons explained above. If you can additionally use Flash Attention 3, then by all means do enable it!
[!TIP] If your GPU is not compatible with mxfp4, then we recommend you use MegaBlocks MoE kernels for a nice speed bump. To do so, you just need to adjust your inference code like this:
```diff from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.frompretrained(modelid) model = AutoModelForCausalLM.frompretrained( modelid, devicemap="auto", torchdtype="auto", + # Optimize MoE layers with downloadable MegaBlocksMoeMLP + use_kernels=True, )
messages = [ {"role": "user", "content": "How many rs are in the word 'strawberry'?"}, ]
inputs = tokenizer.applychattemplate( messages, addgenerationprompt=True, tokenize=True, returntensors="pt", returndict=True, ).to(model.device)
generated = model.generate(**inputs, maxnewtokens=100) print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:])) ```
[!TIP] MegaBlocks optimized MoE kernels require the model to run on bfloat16, so memory consumption will be higher than running on mxfp4. We recommend you use mxfp4 if you can, otherwise opt in to MegaBlocks via use_kernels=True.
transformers serve
You can use transformers serve to experiment locally with the models, without any other dependencies. You can launch the server with just: transformers serve
To which you can send requests using the Responses API. ```
responses API
curl -X POST http://localhost:8000/v1/responses \ -H "Content-Type: application/json" \ -d '{"input": [{"role": "system", "content": "hello"}], "temperature": 1.0, "stream": true, "model": "openai/gpt-oss-120b"}' ```
You can also send requests using the standard Completions API: ```
completions API
curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 1.0, "max_tokens": 1000, "stream": true, "model": "openai/gpt-oss-120b"}' ```
Command A Vision
Command A Vision is a state-of-the-art multimodal model designed to seamlessly integrate visual and textual information for a wide range of applications. By combining advanced computer vision techniques with natural language processing capabilities, Command A Vision enables users to analyze, understand, and generate insights from both visual and textual data.
The model excels at tasks including image captioning, visual question answering, document understanding, and chart understanding. This makes it a versatile tool for AI practitioners. Its ability to process complex visual and textual inputs makes it useful in settings where text-only representations are imprecise or unavailable, like real-world image understanding and graphics-heavy document processing.
Command A Vision is built upon a robust architecture that leverages the latest advancements in VLMs. It's highly performant and efficient, even when dealing with large-scale datasets. The model's flexibility makes it suitable for a wide range of use cases, from content moderation and image search to medical imaging analysis and robotics.
- [Model] Cohere2 Vision by @zucchini-nlp in #39810
MM Grounding DINO
MM Grounding DINO model was proposed in An Open and Comprehensive Pipeline for Unified Object Grounding and Detection by Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xinjiang Wang, Yining Li, Haian Huang>.
MM Grounding DINO improves upon the Grounding DINO by improving the contrastive class head and removing the parameter sharing in the decoder, improving zero-shot detection performance on both COCO (50.6(+2.2) AP) and LVIS (31.9(+11.8) val AP and 41.4(+12.6) minival AP).
You can find all the original MM Grounding DINO checkpoints under the MM Grounding DINO collection. This model also supports LLMDet inference. You can find LLMDet checkpoints under the LLMDet collection.
- Add MM Grounding DINO by @rziga in #37925
Bugfixes and improvements
- More robust tied weight test by @Cyrilvallez in #39681
- fix missing model.tpsize from ep refactor by @winglian in #39688
- Fix missing initialization of
FastSpeech2Conformerby @bvantuan in #39689 - fix(tokenization): check token.content for trie by @pjo256 in #39587
- xpu optimization for generation case by @sywangyi in #39573
- [processors] add tests for helper fn by @zucchini-nlp in #39629
- update ernie model card by @jzhang533 in #39657
- [configuration] remove redundant
classmethodby @zucchini-nlp in #38812 - Add self-hosted runner scale set workflow for mi325 CI by @jitesh-gupta in #39651
- PATCH: add back n-dim device-mesh + fix tp trainer saving by @S1ro1 in #39693
- [
CI] Add Eric to comment slow ci by @vasqu in #39601 - Remove all expired deprecation cycles by @Cyrilvallez in #39725
- mllama outputs refactor by @itazap in #39643
- Update
QAPipelineTests::test_large_model_courseafter #39193 by @ydshieh in #39666 - skip
Glm4MoeModelTest::test_torch_compile_for_trainingby @ydshieh in #39670 - Fix
Qwen2AudioForConditionalGeneration.forward()andtest_flash_attn_kernels_inference_equivalenceby @ebezzam in #39503 - Fix Layer device placement in Caches by @Cyrilvallez in #39732
- Fix cache-related tests by @zucchini-nlp in #39676
- Fix AMD dockerfile for audio models by @remi-or in #39669
- Superpoint fast image processor by @arkhamHack in #37804
- Add Fast Segformer Processor by @capnmav77 in #37024
- BLIPs clean-up by @zucchini-nlp in #35560
- extend more trainer test cases to XPU, all pass by @yao-matrix in #39652
- fix cache inheritance by @ArthurZucker in #39748
- [Fix] import two missing typos in
models/__init__.pyfor typo checking by @hebangwen in #39745 - Fix: add back base model plan by @S1ro1 in #39733
- update
GemmaIntegrationTest::test_model_2b_bf16_dolaagain by @ydshieh in #39731 - Update IMPORTANT_MODELS list by @ivarflakstad in #39734
- Fix mamba regression by @manueldeprada in #39728
- Apply several ruff SIM rules by @cyyever in #37283
- Use
--gpus allin workflow files by @ydshieh in #39752 - AMD disable torchcodec by @ivarflakstad in #39757
- Avoid OOM when other tests are failing by @ydshieh in #39758
- Fix GPT2 with cross attention by @zucchini-nlp in #39754
- Support loading Qwen3 MoE GGUF by @ctcanbol in #39638
- Enable xpu allocator on cachingallocatorwarmup by @jiqing-feng in #39654
- Fix version issue in modeling_utils.py by @Cyrilvallez in #39759
- add
libcsttoextras["testing"]insetup.pyby @ydshieh in #39761 - [modenbert] fix regression by @zucchini-nlp in #39750
- 🌐 [i18n-KO] Translated
main_classes/peft.mdby @luckyvickyricky in #39515 - 🌐 [i18n-KO] Translated albert.md to Korean by @ahnjj in #39524
- 🌐 [i18n-KO] Translated
tvp.mdto Korean by @Kim-Ju-won in #39578 - 🌐 [i18n-KO] Translated
tokenizer.mdto Korean by @seopp in #39532 - 🌐 [i18n-KO] Translated
pipeline_gradio.mdto Korean by @AhnJoonSung in #39520 - 🌐 [i18n-KO] Translated
perf_train_gpu_one.mdto Korean by @D15M4S in #39552 - 🌐 [i18n-KO] Translated
how_to_hack_models.mdto Korean by @skwh54 in #39536 - fix(trainer): Correct loss scaling for incomplete gradient accumulation steps by @hutaiHang in #39659
- Fix Cache.maxcachelen max value for Hybrid models by @manueldeprada in #39737
- [docs] Ko doc fixes after toc update by @gante in #39660
- Remove python3.7 reference from doc link by @st81 in #39706
- Fix OmDet test after arg deprecation by @Cyrilvallez in #39766
- docs: Update EfficientLoFTR documentation by @sbucaille in #39620
- Standardize CLAP model card format by @yanamis in #39738
- Don't set
run_namewhen none by @qgallouedec in #39695 - Fix Evolla and xLSTM tests by @Cyrilvallez in #39769
- enable static cache on vision encoder decoder by @jiqing-feng in #39773
- [ASR pipline] fix with datasets 4.0 by @eustlb in #39504
- more info in
model_results.jsonby @ydshieh in #39783 - Super tiny update by @zucchini-nlp in #39727
- fix chameleonvision UT failure by @yao-matrix in #39646
- Fix an invalid condition by @cyyever in #39762
- Simplify conditional code by @cyyever in #39781
- Fix re-compilations for cross attention cache by @zucchini-nlp in #39788
- standardized BARThez model card by @EthanV431 in #39701
- Update model card for Cohere2 (Command R7B) by @arpon-kapuria in #39604
- Update mT5 model card by @dross20 in #39702
- Add callback to monitor progress in whisper transcription by @poke1024 in #37483
- fix: providing a tensor to cache_position in model.generate kwargs always crashes because of boolean test by @gante in #39300
- feat(tokenization): add encode_message to tokenize messages one by one by @pco111 in #39507
- [docs] fix korean docs yet again by @gante in #39813
- Update documentation for Cohere2Vision models by @kyle-cohere in #39817
- [cohere2 vision] move doc to multimodal section by @zucchini-nlp in #39820
- Fix broken links by @oToToT in #39809
- Fix bad markdown links by @ebezzam in #39819
- Fix tp cb by @ArthurZucker in #39838
- [VLMs] split out "get placeholder mask" to helper by @zucchini-nlp in #39777
- [
attn_implementation] remove recursive, allows custom kernels with wrappers by @ArthurZucker in #39823 - [typecheck] proper export of private symbols by @cyyever in #39729
- Update ux cb by @ArthurZucker in #39845
- Fix responses add tests by @LysandreJik in #39848
- Add fast image processor Janus, Deepseek VL, Deepseek VL hybrid by @yonigozlan in #39739
- [image-processing] deprecate
plot_keypoint_matching, makevisualize_keypoint_matchingas a standard by @sbucaille in #39830 - Allow
TrackioCallbackto work when pynvml is not installed by @qgallouedec in #39851 - remove dtensors, not explicit by @ArthurZucker in #39840
- Improve
is_wandb_availablefunction to verify WandB installation by @qgallouedec in #39875 - Refactor label name handling for PEFT models in Trainer class by @qgallouedec in #39265
- Use comment to build doc on PRs by @ydshieh in #39846
- Add support for including in-memory videos (not just files/urls) in applychattemplate by @akibjawad in #39494
- [core] Fix attnimplementation setter with missing `subconfigs` by @qubvel in #39855
- Fix quant docker for fp-quant by @SunMarc in #39641
- Rework add-new-model-like with modular and make test filenames coherent by @Cyrilvallez in #39612
- Replace
TokenizerwithPreTrainedTokenizerFastinContinuousBatchProcessorby @qgallouedec in #39858 - Set
torch.backends.cudnn.allow_tf32 = Falsefor CI by @ydshieh in #39885 - [typing] better return type hint for
AutoModelForCausalLMandAutoModelForImageTextToTextby @qubvel in #39881 - Fix link to models in README by @qubvel in #39880
- [DOCS] : Improved mimi model card by @rohitthewanderer in #39824
- Update cohere2 vision test by @ydshieh in #39888
- send some feedback when manually building doc via comment by @ydshieh in #39889
- Add support for
ModernBertForMultipleChoiceby @netique in #39232 - chore: update DETR model card by @arpon-kapuria in #39822
- Reorder serving docs by @LysandreJik in #39634
- [
Exaone4] Fixes the attn implementation! by @ArthurZucker in #39906 - fix testworkingof_tp failure of accelerate ut by @yao-matrix in #39828
- [qwen] remove unnecessary CUDA sync in qwen25vl by @cyyever in #39870
- Avoid aliasing in cond's branches for torch 2.8 by @ydwu4 in #39488
- Fix misleading WandB error when WANDB_DISABLED is set by @notkisk in #39891
- Replace video_fps with fps in tests by @cyyever in #39898
- Fix eval thread fork bomb by @JustinVanHeek in #39717
- Fix aria tests by @zucchini-nlp in #39879
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @capnmav77
- Add Fast Segformer Processor (#37024)
- @cyyever
- Apply several ruff SIM rules (#37283)
- Fix an invalid condition (#39762)
- Simplify conditional code (#39781)
- [typecheck] proper export of private symbols (#39729)
- [qwen] remove unnecessary CUDA sync in qwen25vl (#39870)
- Replace video_fps with fps in tests (#39898)
- @rziga
- Add MM Grounding DINO (#37925)
- Python
Published by LysandreJik 10 months ago
transformers - Patch release 4.54.1
Patch release 4.54.1
We had quite a lot of bugs that got through! Release was a bit rushed, sorry everyone! 🤗 Mostly cache fixes, as we now have layered cache, and fixed to distributed.
- Fix Cache.maxcachelen max value for Hybrid models, @manueldeprada, @Cyrilvallez, #39737
- [modenbert] fix regression, @zucchini-nlp, #39750
- Fix version issue in modeling_utils.py, @Cyrilvallez, #39759
- Fix GPT2 with cross attention, @zucchini-nlp, #39754
- Fix mamba regression, @manueldeprada, #39728
- Fix: add back base model plan, @S1ro1, #39733
- fix cache inheritance, #39748
- Fix cache-related tests, @zucchini-nlp, #39676
- Fix Layer device placement in Caches, @Cyrilvallez, #39732
- PATCH: add back n-dim device-mesh + fix tp trainer saving, @S1ro1, @SunMarc, #39693
- fix missing model.tpsize from ep refactor, @winglian, #39688
- Python
Published by ArthurZucker 10 months ago
transformers - v4.54.0: Kernels, Transformers Serve, Ernie, Voxtral, LFM2, DeepSeek v2, ModernBERT Decoder...
Important news!
In order to become the source of truth, we recognize that we need to address two common and long-heard critiques about transformers:
transformersis bloatedtransformersis slow
Our team has focused on improving both aspects, and we are now ready to announce this.
The modeling files for the standard Llama models are down to 500 LOC and should be much more readable, keeping just the core of the modeling and hiding the "powerful transformers features."
The MoEs are getting some kernel magic, enabling the use of the efficient megablocks kernels, setting a good precedent to allow the community to leverage any of the most powerful kernels developed for quantization as well!
It should also be much more convenient to use with any attention implementation you want. This opens the door to some optimizations such as leveraging flash-attention on Metal (MPS Torch backend).
This is but the tip of the iceberg: with the work on kernels that we're heavily pushing forward, expect speed-ups on several backends in the coming months!!
This release also includes the first steps to enabling efficient distributed training natively in transformers. Loading a 100B model takes ~3 seconds on our cluster — we hope this will be the norm for everyone! We are working on distributed checkpointing as well, and want to make sure our API can be easily used for any type of parallelism.
We want the community to benefit from all of the advances, and as always, include all hardware and platforms! We believe the kernels library will give the tools to optimize everything, making a big difference for the industry!
New models
Ernie 4.5
The Ernie 4.5 model was released in the Ernie 4.5 Model Family release by baidu. This family of models contains multiple different architectures and model sizes. This model in specific targets the base text model without mixture of experts (moe) with 0.3B parameters in total. It uses the standard Llama at its core.
Other models from the family can be found at Ernie 4.5 MoE.
- [
Ernie 4.5] Add ernie text models by @vasqu in #39228
Voxtral
Voxtral is an upgrade of Ministral 3B and Mistral Small 3B, extending its language capabilities with audio input support. It is designed to handle tasks such as speech transcription, translation, and audio understanding.
You can read more in Mistral's realease blog post.
The model is available in two checkpoints: - 3B: mistralai/Voxtral-Mini-3B-2507 - 24B: mistralai/Voxtral-Small-24B-2507
Key Features
Voxtral builds on Ministral-3B by adding audio processing capabilities:
- Transcription mode: Includes a dedicated mode for speech transcription. By default, Voxtral detects the spoken language and transcribes it accordingly.
- Long-form context: With a 32k token context window, Voxtral can process up to 30 minutes of audio for transcription or 40 minutes for broader audio understanding.
- Integrated Q&A and summarization: Supports querying audio directly and producing structured summaries without relying on separate ASR and language models.
- Multilingual support: Automatically detects language and performs well across several widely spoken languages, including English, Spanish, French, Portuguese, Hindi, German, Dutch, and Italian.
- Function calling via voice: Can trigger functions or workflows directly from spoken input based on detected user intent.
Text capabilities: Maintains the strong text processing performance of its Ministral-3B foundation.
Add voxtral by @eustlb in #39429
LFM2
LFM2 represents a new generation of Liquid Foundation Models developed by Liquid AI, specifically designed for edge AI and on-device deployment.
The models are available in three sizes (350M, 700M, and 1.2B parameters) and are engineered to run efficiently on CPU, GPU, and NPU hardware, making them particularly well-suited for applications requiring low latency, offline operation, and privacy.
- LFM2 by @paulpak58 in #39340
DeepSeek v2
The DeepSeek-V2 model was proposed in DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model by DeepSeek-AI Team.
The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. It employs an auxiliary-loss-free strategy for load balancing and multi-token prediction training objective. The model can be used for various language tasks after being pre-trained on 14.8 trillion tokens and going through Supervised Fine-Tuning and Reinforcement Learning stages.
- Add DeepSeek V2 Model into Transformers by @VladOS95-cyber in #36400
ModernBERT Decoder models
ModernBERT Decoder is the same architecture as ModernBERT but trained from scratch with a causal language modeling (CLM) objective. This allows for using the same architecture for comparing encoders and decoders. This is the decoder architecture implementation of ModernBERT, designed for autoregressive text generation tasks.
Like the encoder version, ModernBERT Decoder incorporates modern architectural improvements such as rotary positional embeddings to support sequences of up to 8192 tokens, unpadding to avoid wasting compute on padding tokens, GeGLU layers, and alternating attention patterns. However, it uses causal (unidirectional) attention to enable autoregressive generation.
- Add ModernBERT Decoder Models - ModernBERT, but trained with CLM! by @orionw in #38967
EoMT
The Encoder-only Mask Transformer (EoMT) model was introduced in the CVPR 2025 Highlight Paper Your ViT is Secretly an Image Segmentation Model by Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus. EoMT reveals Vision Transformers can perform image segmentation efficiently without task-specific components.
- ✨ Add EoMT Model || 🚨 Fix Mask2Former loss calculation by @yaswanth19 in #37610
Doge
Doge is a series of small language models based on the Doge architecture, aiming to combine the advantages of state-space and self-attention algorithms, calculate dynamic masks from cached value states using the zero-order hold method, and solve the problem of existing mainstream language models getting lost in context. It uses the wsd_scheduler scheduler to pre-train on the smollm-corpus, and can continue training on new datasets or add sparse activation feedforward networks from stable stage checkpoints.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F426/transformers/modeldoc/dogearchitecture.png" alt="drawing" width="600"
- Add Doge model by @LoserCheems in #35891
AIM v2
The AIMv2 model was proposed in Multimodal Autoregressive Pre-training of Large Vision Encoders by Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby.
The abstract from the paper is the following:
We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.
- Add Aimv2 model by @yaswanth19 in #36625
PerceptionLM
The PerceptionLM model was proposed in PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding by Jang Hyun Cho et al. It's a fully open, reproducible model for transparent research in image and video understanding. PLM consists of a vision encoder with a small scale (<8B parameters) LLM decoder.
- PerceptionLM by @shuminghu in #37878
Efficient LoFTR
The EfficientLoFTR model was proposed in Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed by Yifan Wang, Xingyi He, Sida Peng, Dongli Tan and Xiaowei Zhou.
This model consists of matching two images together by finding pixel correspondences. It can be used to estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.
- Add EfficientLoFTR model by @sbucaille in #36355
EVOLLA
Evolla is an advanced 80-billion-parameter protein-language generative model designed to decode the molecular language of proteins. It integrates information from protein sequences, structures, and user queries to generate precise and contextually nuanced insights into protein function. Trained on an unprecedented AI-generated dataset of 546 million protein question-answer pairs and 150 billion word tokens, Evolla significantly advances research in proteomics and functional genomics, providing expert-level insights and shedding light on the molecular logic encoded in proteins.
- Add evolla rebase main by @zhoubay in #36232
DeepSeek VL
Deepseek-VL was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages LLaMA as its text encoder, while SigLip is used for encoding images.
- Add support for DeepseekAI's DeepseekVL by @geetu040 in #36248
xLSTM
The xLSTM model was proposed in xLSTM: Extended Long Short-Term Memory by Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter and Sepp Hochreiter. xLSTM updates the original LSTM architecture to be competitive with Transformer models by introducing exponential gating, matrix memory expansion, and parallelizable training and ingestion.
The 7B model variant was trained by the xLSTM team Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick Blies, Sebastian Böck and Sepp Hochreiter at NXAI.
- Add xlstm model by @Cyrilvallez in #39665
EXAONE 4.0
EXAONE 4.0 model is the language model, which integrates a Non-reasoning mode and Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean.
The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications.
- Add EXAONE 4.0 model by @lgai-exaone in #39129
Parallelisation
We've added Expert Parallel support for Llama4, next release will include it for all model! You can just set a distributed_config with enable_expert_parallel=True. This is enabling efficient training of sparse Mixture-of-Experts (MoE) models across multiple devices. This allows each expert in the MoE layer to run in parallel (instead of previous TP which requires more communication), significantly improving scalability and memory efficiency.
- Add ep by @ArthurZucker in #39501
Quantization
FP Quant
FP-Quant is a quantization method optimized for Blackwell-generation Nvidia GPUs, supporting efficient post-training quantization (PTQ) and quantization-aware training (QAT) of LLMs using MXFP4 and NVFP4 formats.
Currently, only PTQ with MXFP4 is available. You can quantize models on-the-fly using transformers: ``` from transformers import AutoModelForCausalLM, FPQuantConfig
model = AutoModelForCausalLM.frompretrained(
"qwen/Qwen3-8B",
quantizationconfig=FPQuantConfig(),
devicemap="cuda",
torchdtype=torch.bfloat16,
)
``
FP-Quant requires a Blackwell GPU and runs via the QuTLASS library. No Blackwell GPU? UseFPQuantConfig(pseudoquant=True)` to emulate quantization (no QuTLASS needed).
The following results show the inference speedup of QuTLASS MXFP4 over PyTorch BF16 in Transformers. MXFP4 gives consistent speedups across all batch sizes, reaching up to 4× faster at larger scales.
- FP-Quant support by @BlackSamorez in #38696
Kernels
The kernels project aims to become the single trusted source for high-performance kernels in the Transformers ecosystem. We're working toward centralizing all kernels on the Hub, so updates, bug fixes, and improvements can happen in one place—no more scattered repos and no compilation headaches!
You can already try it out today by setting use_kernels=True in from_pretrained. Any contributor can build their kernel, register it and use it right away—no extra setup, more on this here
Even better: want to use Flash Attention 3? No need to deal with tricky compilation and missing symbols issues! Just drop in:
python
model.set_attn_implementation("kernels-community/flash-attn3")
This automatically fetches the right build for your setup (e.g. CUDA and PyTorch versions).
We’re also teaming up with amazing kernel devs from Unsloth, Liger, vLLM, and more to bring their work directly to the Hub—making it easier than ever to access amazing performance with a single line of code.
- Kernels flash attn by @ArthurZucker in #39474
Transformers Serve
https://github.com/user-attachments/assets/9928f62b-543c-4b8a-b81b-4a6e262c229e
Over the past few months, we have been putting more and more functionality in the transformers chat utility, which offers a CLI-based app to chat with chat models. We've chosen to push this further by splitting the backend of transformers chat in a new, separate utility called transformers serve.
This is ideal for experimentation purposes, or to run models locally for personal and private use. It does not aim to compete with dedicated inference engines such as vLLM or SGLang.
Models of diverse modalities supported by transformers may be served with the transformers serve CLI. It spawns a local server that offers compatibility with the OpenAI SDK, which is the de-facto standard for LLM conversations and other related tasks. This way, you can use the server from many third party applications, or test it using the transformers chat CLI (docs).
The server supports the following REST APIs:
/v1/chat/completions/v1/responses/v1/audio/transcriptions/v1/models
Relevant commits:
- Split
transformers chatandtransformers serveby @LysandreJik in #38443 - [serve] Cursor support, move docs into separate page, add more examples by @gante in #39133
- Fix continuous batching in
transformers serveby @LysandreJik in #39149 - [server] add tests and fix passing a custom
generation_configby @gante in #39230 - [serve] Model name or path should be required by @LysandreJik in #39178
- Random serve fixes by @pcuenca in #39176
- [tests] tag serve tests as slow by @gante in #39343
- Responses API in
transformers serveby @LysandreJik in #39155 - [serve] Add speech to text (
/v1/audio/transcriptions) by @gante in #39434 - Transformers serve VLM by @LysandreJik in #39454
Refactors
Significant refactors have been underway in transformers, aiming to reduce the complexity of the code. A metric we follow to see how the refactors impact our code is to follow the number of lines in a given model; we try to reduce it as much as possible, while keeping everything related to the forward pass and model definition in that file.
See the evolution here:
Some notable refactors:
KV caching
KV caches are now defined per layer, enabling new hybrid caches that mix different attention types. CacheProcessors also encapsulate cache quantization and offloading, making them easy to customize.
- [cache refactor] Move all the caching logic to a per-layer approach by @manueldeprada in #39106
Handling specific attributes like output_attentions or output_hidden_states
Such attributes require very specific handling within the forward call, while they're not important to understand how the model works. We remove that code but keep the functionality by providing a better utility to handle it.
- Refactor the way we handle outputs for new llamas and new models by @ArthurZucker in #39120
Setting the attention implementation
We refactor the way to explicitly set the attention implementation so that it has a method dedicated to it.
- [refactor] set attention implementation by @zucchini-nlp in #38974
Breaking changes
- [Whisper] 🚨 Fix pipeline word timestamp: timestamp token is end of token time !!! by @eustlb in #36632
- 🚨 Don't use cache in non-generative models by @zucchini-nlp in #38751
- 🚨🚨🚨 [eomt] make EoMT compatible with pipeline by @yaswanth19 in #39122
- 🚨🚨 Fix and simplify attention implementation dispatch and subconfigs handling by @Cyrilvallez in #39423
- 🚨🚨🚨 [Trainer] Enable
average_tokens_across_devicesby default inTrainingArgumentsby @Krish0909 in #39395 - 🔴 Fix EnCodec internals and integration tests by @ebezzam in #39431
Bugfixes and improvements
- Add StableAdamW Optimizer by @SunMarc in #39446
- [
Flex Attn] Fix torch 2.5.1 incompatibilities by @vasqu in #37406 - fix
test_compare_unprocessed_logit_scoresby @ydshieh in #39053 - fix
t5gemmatests by @ydshieh in #39052 - Update SuperPoint model card by @sbucaille in #38896
- fix
layoutlmv3tests by @ydshieh in #39050 - [docs] Model contribution by @stevhliu in #38995
- Update PEGASUS-X model card by @dross20 in #38971
- [docs] @auto_docstring by @stevhliu in #39011
- [docs] Tensor parallelism by @stevhliu in #38241
- [Whisper] fix shape mismatch in tests by @eustlb in #39074
- Cleanup Attention class for Siglip and dependent models by @yaswanth19 in #39040
- fix
Gemma3nProcessorTestby @ydshieh in #39068 - Fix initialization of OneFormer by @bvantuan in #38901
- Uninstallling Flash attention from quantization docker by @MekkCyber in #39078
- fix a bunch of XPU UT failures on stock PyTorch 2.7 and 2.8 by @yao-matrix in #39069
- Pipeline: fix unnecessary warnings by @eustlb in #35753
- fix
mistral3tests by @ydshieh in #38989 - fixed typo for docstring in prepare_inputs method by @JINO-ROHIT in #39071
- TST PEFT integration tests with pipeline generate by @BenjaminBossan in #39086
- add fast image processor nougat by @NahieliV in #37661
- Add Fast Image Processor for mobileViT by @MinJu-Ha in #37143
- guard torch distributed check by @tvukovic-amd in #39057
- fix
dots1tests by @ydshieh in #39088 - Add Fast Image Processor for Chameleon by @farrosalferro in #37140
- Fix: unprotected import of tp plugin by @S1ro1 in #39083
- TST Fix PEFT integration test bitsandbytes config by @BenjaminBossan in #39082
- [fix] Add FastSpeech2ConformerWithHifiGan by @stevhliu in #38207
- Sandeepyadav1478/2025 06 19 deberta v2 model card update by @sandeepyadav1478 in #38895
- Fixes the failing test
test_is_split_into_wordsintest_pipelines_token_classification.pyby @st81 in #39079 - skip some
test_sdpa_can_dispatch_on_flashby @ydshieh in #39092 - fix UT failures on XPU w/ stock PyTorch 2.7 & 2.8 by @yao-matrix in #39116
- Fix some bug for finetune and batch infer For GLM-4.1V by @zRzRzRzRzRzRzR in #39090
- docs: Gemma 3n audio encoder by @RyanMullins in #39087
- All CI jobs with A10 by @ydshieh in #39119
- Licenses by @LysandreJik in #39127
- Fix chat by @gante in #39128
- Enable XPU doc by @jiqing-feng in #38929
- docs: correct two typos in awesome-transformers.md by @VladimirGutuev in #39102
- switch default xpu tp backend to pytorch built-in XCCL from pytorch 2.8 by @yao-matrix in #39024
- Update BigBirdPegasus model card by @dross20 in #39104
- [Whisper] update token timestamps tests by @eustlb in #39126
- Fix key mapping for VLMs by @bvantuan in #39029
- Several fixes for Gemma3n by @Cyrilvallez in #39135
- fix cachingallocatorwarmup with tie weights by @jiqing-feng in #39070
- feat: support indivisible shards for TP model loading and TPlizing. by @kmehant in #37220
- [qwen2-vl] fix FA2 inference by @zucchini-nlp in #39121
- [typing] LlamaAttention return typehint by @ArkVex in #38998
- [VLMs] support passing embeds along with pixels by @zucchini-nlp in #38467
- [superglue] fix wrong concatenation which made batching results wrong by @sbucaille in #38850
- Fix missing fsdp & trainer jobs in daily CI by @ydshieh in #39153
- Fix: Ensure wandb logs config in offline mode by @DavidS2106 in #38992
- Change
@lru_cache()to@lru_cacheto match styles from #38883. by @rasmi in #39093 - fix: remove undefined variable by @ybkurt in #39146
- update bnb ground truth by @jiqing-feng in #39117
- Suggest jobs to use in
run-slowby @ydshieh in #39100 - Update expected values (after switching to A10) by @ydshieh in #39157
- fix
llamatests by @ydshieh in #39161 - Add activation sparsity reference in gemma3n doc by @ChongYou in #39160
- fix default value of config to match checkpionts in LLaVa-OV models by @ved1beta in #39163
- [smolvlm] fix video inference by @zucchini-nlp in #39147
- Fix multimodal processor get duplicate arguments when receive kwargs for initialization by @Isotr0py in #39125
- Blip2 fixes by @remi-or in #39080
- Fix missing initializations for models created in 2024 by @bvantuan in #38987
- Reduce Glm4v model test size significantly by @Cyrilvallez in #39173
- [docs] ViTPose by @stevhliu in #38630
- [generate] document non-canonical beam search default behavior by @gante in #39000
- Update expected values (after switching to A10) - part 2 by @ydshieh in #39165
- Update expected values (after switching to A10) - part 3 by @ydshieh in #39179
- Test fixes for Aria (and some Expectation for llavanextvideo) by @remi-or in #39131
- [glm4v] fix video inference by @zucchini-nlp in #39174
- when delaying optimizer creation only prepare the model by @winglian in #39152
- Decouple devicemap='auto' and tpplan='auto' by @SunMarc in #38942
- Fix many HPU failures in the CI by @IlyasMoutawwakil in #39066
- [
Dia] Change ckpt path in docs by @vasqu in #39181 - Update expected values (after switching to A10) - part 4 by @ydshieh in #39189
- [typing] better return typehints for
from_pretrainedby @qubvel in #39184 - Update expected values (after switching to A10) - part 5 by @ydshieh in #39205
- Update expected values (after switching to A10) - part 6 by @ydshieh in #39207
- Add packed tensor format support for flex/sdpa/eager through the mask! by @Cyrilvallez in #39194
- Update expected values (after switching to A10) - part 7 by @ydshieh in #39218
- Update expected values (after switching to A10) - part 8 - Final by @ydshieh in #39220
- [video processors] Support float fps for precise frame sampling by @zrohyun in #39134
- Expectations re-order and corrected FA3 skip by @remi-or in #39195
- [vjepa2] replace einsum with unsqueeze by @xenova in #39234
- Fix missing fast tokenizer/image_processor in whisper/qwen2.5-omni processor by @Isotr0py in #39244
- [modular] Follow global indexing and attribute setting, and their dependencies by @Cyrilvallez in #39180
- fix typo in Gemma3n notes by @davanstrien in #39196
- Don't send new comment if the previous one is less than 30 minutes (unless the content is changed) by @ydshieh in #39170
- fix bug using FSDP V1 will lead to model device not properly set by @kaixuanliu in #39177
- Make computedynamicntkparameters exportable by @xadupre in #39171
- [modular] Simplify logic and docstring handling by @Cyrilvallez in #39185
- [bugfix] fix flash attention 2 unavailable error on Ascend NPU by @FightingZhen in #39166
- fix
fastspeech2_conformertests by @ydshieh in #39229 - RotaryEmbeddings change
is not None->isinstance(..., dict)by @qubvel in #39145 - Fix patch helper by @Cyrilvallez in #39216
- enable xpu on kv-cache and hqq doc by @jiqing-feng in #39246
- adjust input and output texts for testmodelingrecurrent_gemma.py by @kaixuanliu in #39190
- Update tiny-agents example by @Wauplin in #39245
- Add Korean translation for glossary.md by @JoosunH in #38804
- Clarify perdevicetrainbatchsize scaling in TrainingArguments by @Shohail-Ismail in #38…
- Add
segmentation_mapssupport to MobileNetV2ImageProcessor by @simonreise in #37312 - Simplify Mixtral and its modular children by @Cyrilvallez in #39252
- fix some flaky tests in
tests/generation/test_utils.pyby @ydshieh in #39254 - Update LED model card by @dross20 in #39233
- Glm 4 doc by @zRzRzRzRzRzRzR in #39247
- fix xpu failures on PT 2.7 and 2.8 w/o IPEX and enable hqq cases on XPU by @yao-matrix in #39187
- Fix license text, duplicate assignment, and typo in constant names by @gudwls215 in #39250
- Skip
test_eager_matches sdpa generateand update an integration test for blip-like models by @ydshieh in #39248 - remove broken block by @molbap in #39255
- fix(generation): stop beam search per-instance when heuristic satisfied by @guang-yng in #38778
- fix recompiles due to instance key, and deepcopy issues by @ArthurZucker in #39270
- Fix errors when use verl to train GLM4.1v model by @kaln27 in #39199
- [CI] fix docs by @gante in #39273
- [pagged-attention] fix off-by-1 error in pagged attention generation by @kashif in #39258
- [smollm3] add tokenizer mapping for
smollm3by @gante in #39271 - Refactor
PretrainedConfig.__init__method to make it more explicit by @qubvel in #39158 - fix flaky
test_generate_compile_model_forwardby @ydshieh in #39276 - [lightglue] add support for remote code DISK keypoint detector by @sbucaille in #39253
- Add torchcodec in docstrings/tests for
datasets4.0 by @lhoestq in #39156 - Update T5gemma by @bzhangGo in #39210
- [Tests] Update model_id in AIMv2 Tests by @yaswanth19 in #39281
- Fix SDPA attention precision issue in Qwen2.5-VL by @JJJYmmm in #37363
- [flash attn 3] bring back flags by @zucchini-nlp in #39294
- fix
ariatests by @ydshieh in #39277 - skip
test_torchscript_*for now until the majority of the community ask for it by @ydshieh in #39307 - [modular] Allow method with the same name in case of @property decorator by @Cyrilvallez in #39308
- [sliding window] revert and deprecate by @zucchini-nlp in #39301
- 🌐 [i18n-KO] Translated quark.md to Korean by @maximizemaxwell in #39268
- Fix consistency and a few docstrings warnings by @Cyrilvallez in #39314
- add
stevhliuto the list inself-comment-ci.ymlby @ydshieh in #39315 - Updated the Model docs - for the MARIAN model by @emanrissha in #39138
- skip files in
src/for doctest (for now) by @ydshieh in #39316 - docs: update LLaVA-NeXT model card by @Bpriya42 in #38894
- Fix typo: langauge -> language by @tomaarsen in #39317
- Granite speech speedups by @avihu111 in #39197
- Fix
max_length_qandmax_length_ktypes toflash_attn_varlen_funcby @HollowMan6 in #37206 - enable static cache on TP model by @jiqing-feng in #39164
- Fix broken SAM after #39120 by @yonigozlan in #39289
- Delete deprecated stuff by @zucchini-nlp in #38838
- fix Glm4v batch videos forward by @Kuangdd01 in #39172
- fix
phi3tests by @ydshieh in #39312 - Handle DAC conversion when using weight_norm with newer PyTorch versions by @edwko in #36393
- [modeling][lfm2] LFM2: Remove deprecated seen_tokens by @paulpak58 in #39342
- [Core] [Offloading] Enable saving offloaded models with multiple shared tensor groups by @kylesayrs in #39263
- Add a default value for
position_idsin masking_utils by @Cyrilvallez in #39310 - [modular] speedup checkmodularconversion with multiprocessing by @qubvel in #37456
- Updated Switch Transformers model card with standardized format (Issue #36979) by @giuseppeCoccia in #39305
- Fix link for testpypi by @Cyrilvallez in #39360
- update cb TP by @ArthurZucker in #39361
- fix failing
test_sdpa_can_dispatch_on_flashby @ydshieh in #39259 - Verbose error in fix mode for utils/check_docstrings.py by @manueldeprada in #38915
- Remove device check in HQQ quantizer by @learning-chip in #39299
- Add mistral common support by @juliendenize in #38906
- Update Readme to Run Multiple Choice Script from Example Directory by @eromomon in #39323
- Updated CamemBERT model card to new standardized format by @MShaheerMalik77 in #39227
- fix gpt2 usage doc by @Xiang-cd in #39351
- Update Model Card for Encoder Decoder Model by @ParagEkbote in #39272
- update docker file to use latest
timm(forperception_lm) by @ydshieh in #39380 - Fix overriding Fast Image/Video Processors instance attributes affect other instances by @yonigozlan in #39363
- [shieldgemma] fix checkpoint loading by @zucchini-nlp in #39348
- [BLIP] remove cache from Qformer by @zucchini-nlp in #39335
- [Qwen2.5-VL] Fix torch.finfo() TypeError for integer attentionmasktensor by @dsnsabari in #39333
- Deprecate AutoModelForVision2Seq by @zucchini-nlp in #38900
- Fix Lfm2 and common tests by @Cyrilvallez in #39398
- [examples] fix doreducelabels argument for runsemanticsegmentationnotrainer by @eromomon in #39322
- Totally rewrite how pipelines load preprocessors by @Rocketknight1 in #38947
- Use np.pad instead of np.lib.pad. by @rasmi in #39346
- [Docs] Fix typo in CustomTrainer compute_loss method and adjust loss reduction logic by @MilkClouds in #39391
- Update phi4_multimodal.md by @tanuj-rai in #38830
- [siglip] fix pooling comment by @sameerajashyam in #39378
- Fix typo in
/v1/modelsoutput payload by @alvarobartt in #39414 - support loading qwen3 gguf by @44670 in #38645
- Ignore extra position embeddings weights for ESM by @Rocketknight1 in #39063
- set documentquestionanswering pipeline loadtokenizer to True by @jiqing-feng in #39411
- Fix invalid property by @cyyever in #39384
- refactor: remove
set_tracer_providerandset_meter_providercalls by @McPatate in #39422 - Fix bugs from pipeline preprocessor overhaul by @Rocketknight1 in #39425
- Fix bugs in pytorch example run_clm when streaming is enabled by @HRezaei in #39286
- Remove deprecated audio utils functions by @jiangwangyi in #39330
- Remove residual quantization attribute from dequantized models by @DWarez in #39373
- handle training summary when creating modelcard but offline mode is set by @winglian in #37095
- [vlm] fix loading of retrieval VLMs by @zucchini-nlp in #39242
- docs: update SuperGlue docs by @sbucaille in #39406
- docs: update LightGlue docs by @sbucaille in #39407
- CI workflow for performed test regressions by @ahadnagy in #39198
- [autodocstring] add video and audio inputs by @zucchini-nlp in #39420
- [Core] [Offloading] Fix saving offloaded submodules by @kylesayrs in #39280
- Remove double soft-max in load-balancing loss. Fixes #39055 . by @rudolfwilliam in #39056
- Fixed a bug calculating cross entropy loss in
JetMoeForCausalLMby @Phoenix-Shen in #37830 - [chat template] add a testcase for kwargs by @zucchini-nlp in #39415
- Fix L270 - hasattr("moe_args") returning False error by @wjdghks950 in #38715
- Defaults to adamwtorchfused for Pytorch>=2.8 by @cyyever in #37358
- Change log level from warning to info for scheduled request logging in
ContinuousBatchProcessorby @qgallouedec in #39372 - Add cosinewithminlrschedulewithwarmuplrrate scheduler in Trainer by @richardodliu in #31870
- Fix missing definition of difffileurl in notification service by @ahadnagy in #39445
- add test scanner by @molbap in #39419
- Remove runtime conditions for type checking by @cyyever in #37340
- docs: add missing numpy import to minimal example by @IliasAarab in #39444
- [cache] make all classes cache compatible finally by @zucchini-nlp in #38635
- Fix typo in generation configuration for Janus model weight conversion by @thisisiron in #39432
- Better typing for model.config by @qubvel in #39132
- [Bugfix] [Quantization] Remove unused init arg by @kylesayrs in #39324
- Fix processor tests by @zucchini-nlp in #39450
- Remove something that should have never been there by @ArthurZucker in #38254
- make the loss context manager easier to extend by @winglian in #39321
- Fixes #39204: add fallback if getbasemodel missing by @sebastianvlad1 in #39226
- [
CI] Fix partially red CI by @vasqu in #39448 - Updated Megatron conversion script for gpt2 checkpoints by @LckyLke in #38969
- Fix indentation bug in SmolVLM image processor causing KeyError by @Krish0909 in #39452
- fix cached file error when repo type is dataset by @hiyouga in #36909
- Improve grammar and clarity in perf_hardware.md by @ridima11 in #39428
- create ijepa modelcard (ref : PR #36979 ). by @dhruvmalik007 in #39354
- Corrections to PR #38642 and enhancements to Wav2Vec2Processor call and pad docstrings by @renet10 in #38822
- fix(pipelines): QA pipeline returns fewer than top_k results in batch mode by @yushi2006 in #39193
- fix maxlength calculating using cuseq_lens by @KKZ20 in #39341
- Fix tests due to breaking change in accelerate by @SunMarc in #39451
- Use newer typing notation by @cyyever in #38934
- fix a comment typo in utils.py by @klimarissa17 in #39459
- Update
GemmaIntegrationTest::test_model_2b_bf16_dolaby @ydshieh in #39362 - Fix convertandexportwithcache failures for GPU models by @Stonepia in #38976
- Enable some ruff checks for performance and readability by @cyyever in #39383
- fix: ImageTextToTextPipeline handles user-defined generation_config by @peteryschneider in #39374
- Update integration_utils.py by @zhaiji0727 in #39469
- Add unified logitstokeep support to LLMClass by @hellopahe in #39472
- Fix typing order by @Tavish9 in #39467
- [dependencies] temporary pyarrow pin by @gante in #39496
- Slack CI bot: set default result for non-existing artifacts by @ahadnagy in #39499
- [dependencies] Update
datasetspin by @gante in #39500 - [chat template] return assistant mask in processors by @zucchini-nlp in #38545
- [gemma3] Fix doconvertrgb in image processors. by @MohitIntel in #39438
- Fix BatchEncoding.to() for nested elements by @eginhard in #38985
- Add fast image processor SAM by @yonigozlan in #39385
- Improve @autodocstring doc and rename `argsdoc.py
toauto_docstring.py` by @yonigozlan in #39439 - Update SAM/SAM HQ attention implementation + fix Cuda sync issues by @yonigozlan in #39386
- Fix placeholders replacement logic in auto_docstring by @yonigozlan in #39433
- [gemma3] support sequence classification task by @zucchini-nlp in #39465
- [qwen2 vl] fix packing with all attentions by @zucchini-nlp in #39447
- GLM-4 Update by @zRzRzRzRzRzRzR in #39393
- Fix bad tensor shape in failing Hubert test. by @ebezzam in #39502
- Fix the check in flex test by @Cyrilvallez in #39548
- Rename
_supports_flash_attn_2in examples and tests by @zucchini-nlp in #39471 - Fix Qwen Omni integration test by @Cyrilvallez in #39553
- Fix pylint warnings by @cyyever in #39477
- Raise
TypeErrorinstead of ValueError for invalid types by @Sai-Suraj-27 in #38660 - Fix missing initializations for models created in 2023 by @bvantuan in #39239
- use the enablegqa param in torch.nn.functional.scaleddotproductat… by @sywangyi in #39412
- Fix Docstring of BarkProcessor by @st81 in #39546
- Refactor
MambaCachetomodeling_mamba.pyby @manueldeprada in #38086 - fix ndim check of device_mesh for TP by @winglian in #39538
- [Fast image processor] refactor fast image processor glm4v by @yonigozlan in #39490
- 🌐 [i18n-KO] Translated
perf_infer_gpu_multi.mdto Korean by @luckyvickyricky in #39441 - Refactor embedding input/output getter/setter by @molbap in #39339
- [Fast image processors] Improve handling of image-like inputs other than images (segmentation_maps) by @yonigozlan in #39489
- [
CI] Fix post merge ernie 4.5 by @vasqu in #39561 - Update modernbertdecoder docs by @orionw in #39453
- Update OLMoE model card by @nlhmnlhmnlhm in #39344
- [gemma3] fix bidirectional image mask by @zucchini-nlp in #39396
- Bump AMD container for 2.7.1 PyTorch by @ahadnagy in #39458
- Fixes needed for n-d parallelism and TP by @winglian in #39562
- [timm_wrapper] add support for gradient checkpointing by @Yozer in #39287
- Add AMD test expectations to DETR model by @ahadnagy in #39539
- [docs] update attention implementation and cache docs by @zucchini-nlp in #39547
- [docs] Create page on inference servers with transformers backend by @zucchini-nlp in #39550
- Add AMD expectations to Mistral3 tests by @ahadnagy in #39481
- Add AMD GPU expectations for LLaVA tests by @ahadnagy in #39486
- General weight initialization scheme by @Cyrilvallez in #39579
- [cache refactor] Move all the caching logic to a per-layer approach by @manueldeprada in #39106
- Update
docs/source/ko/_toctree.ymlby @jungnerd in #39516 - updated mistral3 model card by @cassiasamp in #39531
- [Paged-Attention] Handle continuous batching for repetition penalty by @kashif in #39457
- Torchdec RuntimeError catch by @SunMarc in #39580
- Fix link in "Inference server backends" doc by @hmellor in #39589
- [WIP] Add OneformerFastImageProcessor by @Player256 in #38343
- 🎯 Trackio integration by @qgallouedec in #38814
- Mask2former & Maskformer Fast Image Processor by @SangbumChoi in #35685
- Fix DynamicCache and simplify Cache classes a bit by @Cyrilvallez in #39590
- Generic task-specific base classes by @Cyrilvallez in #39584
- [Trackio] Allow single-gpu training and monitor power by @qgallouedec in #39595
- Rename
supports_static_cachetocan_compile_fullgraphby @zucchini-nlp in #39505 - FP-Quant support by @BlackSamorez in #38696
- fix moe routing_weights by @llbdyiu66 in #39581
- [idefics3] fix for vLLM by @zucchini-nlp in #39470
- enable triton backend on awq xpu by @jiqing-feng in #39443
- Allow
device_meshhave multiple dim by @S1ro1 in #38949 - Fix typos and grammar issues in documentation and code by @cluster2600 in #39598
- Fix important models CI by @molbap in #39576
- Move openai import by @ebezzam in #39613
- Fix DAC integration tests and checkpoint conversion. by @ebezzam in #39313
- Feature/standardize opt model card by @JoestarGagan in #39568
- standardized YOLOS model card according to template in #36979 by @EthanV431 in #39528
- [Docs] Translate audio_classification.md from English to Spanish by @weezymatt in #39513
- Update recent processors for vLLM backend by @zucchini-nlp in #39583
- [efficientloftr] fix model_id in tests by @sbucaille in #39621
- [timm] new timm pin by @gante in #39640
- [Voxtral] values for A10 runners by @eustlb in #39605
- revert behavior of preparefrom_posids by @winglian in #39622
- Add owlv2 fast processor by @lmarshall12 in #39041
- [attention] fix test for packed padfree masking by @zucchini-nlp in #39582
- Fix: explicit not none check for tensors in flash attention by @jeffrey-dot-li in #39639
- revert change to cuseqlenk and maxk when preparing from positionids by @winglian in #39653
- Make pytorch examples UV-compatible by @lhoestq in #39635
- [docs] fix ko cache docs by @gante in #39644
- make fixup by @gante in #39661
- fix(voxtral): correct typo in applytranscriptionrequest by @rev2607 in #39572
- Rename huggingface_cli to hf by @LysandreJik in #39630
- 🚨[Fast Image Processor] Force Fast Image Processor for Qwen2VL/25_VL + Refactor by @yonigozlan in #39591
- Fix ModernBERT Decoder model by @qubvel in #39671
- [CI] revert device in
test_export_static_cacheby @gante in #39662 - [
Ernie 4.5] Post merge adaptations by @vasqu in #39664 - Delete bad rebasing functions by @Cyrilvallez in #39672
- Fixes the BC by @ArthurZucker in #39636
- fix
kyutaitests by @ydshieh in #39416 - update expected outputs for whisper after #38778 by @ydshieh in #39304
- Add missing flag for CacheLayer by @Cyrilvallez in #39678
- Fix auto_docstring crashing when dependencies are missing by @yonigozlan in #39564
- fix: HWIO to OIHW by @RyanMullins in #39200
- Use autodocstring for perceptionlm fast image processor by @yonigozlan in #39679
- badwordsids no longer slow on mps by @DWarez in #39556
- Support
typing.Literalas type of tool parameters or return value by @grf53 in #39633 - fix break for ckpt without tpplan by @MoyanZitto in #39658
- Fix tied weight test by @Cyrilvallez in #39680
- Add padding-free to Granite hybrid moe models by @garrett361 in #39677
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @sbucaille
- Update SuperPoint model card (#38896)
- [superglue] fix wrong concatenation which made batching results wrong (#38850)
- [lightglue] add support for remote code DISK keypoint detector (#39253)
- docs: update SuperGlue docs (#39406)
- docs: update LightGlue docs (#39407)
- Add EfficientLoFTR model (#36355)
- @yaswanth19
- Cleanup Attention class for Siglip and dependent models (#39040)
- ✨ Add EoMT Model || 🚨 Fix Mask2Former loss calculation (#37610)
- 🚨🚨🚨 [eomt] make EoMT compatible with pipeline (#39122)
- Add Aimv2 model (#36625)
- [Tests] Update model_id in AIMv2 Tests (#39281)
- @bvantuan
- Fix initialization of OneFormer (#38901)
- Fix key mapping for VLMs (#39029)
- Fix missing initializations for models created in 2024 (#38987)
- Fix missing initializations for models created in 2023 (#39239)
- @NahieliV
- add fast image processor nougat (#37661)
- @MinJu-Ha
- Add Fast Image Processor for mobileViT (#37143)
- @zRzRzRzRzRzRzR
- Fix some bug for finetune and batch infer For GLM-4.1V (#39090)
- Glm 4 doc (#39247)
- GLM-4 Update (#39393)
- @simonreise
- Add
segmentation_mapssupport to MobileNetV2ImageProcessor (#37312)
- Add
- @LoserCheems
- Add Doge model (#35891)
- @VladOS95-cyber
- Add DeepSeek V2 Model into Transformers (#36400)
- @paulpak58
- LFM2 (#39340)
- [modeling][lfm2] LFM2: Remove deprecated seen_tokens (#39342)
- @shuminghu
- PerceptionLM (#37878)
- @juliendenize
- Add mistral common support (#38906)
- @orionw
- Add ModernBERT Decoder Models - ModernBERT, but trained with CLM! (#38967)
- Update modernbertdecoder docs (#39453)
- @cyyever
- Fix invalid property (#39384)
- Defaults to adamwtorchfused for Pytorch>=2.8 (#37358)
- Remove runtime conditions for type checking (#37340)
- Use newer typing notation (#38934)
- Enable some ruff checks for performance and readability (#39383)
- Fix pylint warnings (#39477)
- @jungnerd
- Update
docs/source/ko/_toctree.yml(#39516)
- Update
- @Player256
- [WIP] Add OneformerFastImageProcessor (#38343)
- @SangbumChoi
- Mask2former & Maskformer Fast Image Processor (#35685)
- @BlackSamorez
- FP-Quant support (#38696)
- Python
Published by LysandreJik 10 months ago
transformers - Patch release v4.53.3
Small path release 4.53.3!
A small patch for open telemetry fixes! Sorry for the delay!
** refactor: remove settracerprovider and setmeterprovider calls (https://github.com/huggingface/transformers/pull/39422) from @McPatate
- Python
Published by ArthurZucker 10 months ago
transformers - Ernie-4.5 and Ernie-4.5 MoE (based on v4.53.2)
Two new models are added to transformers: Ernie 4.5, and its MoE variant, Ernie 4.5 MoE.
They are added on top of the v4.53.2 release, and can be installed from the following tag: v4.53.2-Ernie-4.5-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.53.2-Ernie-4.5-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the Ernie-4.5 models. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.54.0.
Ernie-4.5 and its MoE variant
The Ernie 4.5 model was released in the Ernie 4.5 Model Family release by baidu. This family of models contains multiple different architectures and model sizes.
The Dense
This model in specific targets the base text model without mixture of experts (moe) with 0.3B parameters in total. It uses the standard Llama at its core.
The MoE
This model in specific targets the base text model with mixture of experts (moe) - one with 21B total, 3B active parameters and another one with 300B total, 47B active parameters. It uses the standard Llama at its core combined with a specialized MoE based on Mixtral with additional shared experts.
Usage example
Ernie-4.5 can be found on the Huggingface Hub.
Generating text with Ernie:
```py import torch from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "baidu/ERNIE-4.5-0.3B-PT"
load the tokenizer and the model
tokenizer = AutoTokenizer.frompretrained(modelname) model = AutoModelForCausalLM.frompretrained( modelname, devicemap="auto", torchdtype=torch.bfloat16, )
prepare the model input
inputs = tokenizer("Hey, are you conscious? Can you talk to me?", returntensors="pt") prompt = "Hey, are you conscious? Can you talk to me?" messages = [ {"role": "user", "content": prompt} ] text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) modelinputs = tokenizer([text], addspecialtokens=False, return_tensors="pt").to(model.device)
conduct text completion
generatedids = model.generate( **modelinputs, maxnewtokens=32, ) outputids = generatedids[0][len(modelinputs.inputids[0]):].tolist()
decode the generated ids
generatetext = tokenizer.decode(outputids, skipspecialtokens=True) ```
See below for an example leveraging the MoE variant:
```py import torch from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "baidu/ERNIE-4.5-21B-A3B-PT"
load the tokenizer and the model
tokenizer = AutoTokenizer.frompretrained(modelname) model = AutoModelForCausalLM.frompretrained( modelname, devicemap="auto", torchdtype=torch.bfloat16, )
prepare the model input
inputs = tokenizer("Hey, are you conscious? Can you talk to me?", returntensors="pt") prompt = "Hey, are you conscious? Can you talk to me?" messages = [ {"role": "user", "content": prompt} ] text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) modelinputs = tokenizer([text], addspecialtokens=False, return_tensors="pt").to(model.device)
conduct text completion
generatedids = model.generate( **modelinputs, maxnewtokens=32, ) outputids = generatedids[0][len(modelinputs.inputids[0]):].tolist()
decode the generated ids
generatetext = tokenizer.decode(outputids, skipspecialtokens=True) ```
- Python
Published by LysandreJik 10 months ago
transformers - ModernBERT Decoder (based on v4.53.2)
A new model is added to transformers: ModernBERT Decoder
It is added on top of the v4.53.2 release, and can be installed from the following tag: v4.53.2-modernbert-decoder-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.53.2-modernbert-decoder-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the ModernBERT Decoder model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.54.0.
ModernBERT Decoder
ModernBERT Decoder is the same architecture as ModernBERT but trained from scratch with a causal language modeling (CLM) objective. This allows for using the same architecture for comparing encoders and decoders. This is the decoder architecture implementation of ModernBERT, designed for autoregressive text generation tasks.
Like the encoder version, ModernBERT Decoder incorporates modern architectural improvements such as rotary positional embeddings to support sequences of up to 8192 tokens, unpadding to avoid wasting compute on padding tokens, GeGLU layers, and alternating attention patterns. However, it uses causal (unidirectional) attention to enable autoregressive generation.
Usage example
ModernBERT Decoder can be found on the Huggingface Hub.
Using pipeline:
```py import torch from transformers import pipeline
generator = pipeline( task="text-generation", model="blab-jhu/test-32m-dec", torchdtype=torch.float16, device=0 ) generator("The future of artificial intelligence is", maxlength=50, numreturnsequences=1)
For sequence classification
classifier = pipeline( task="text-classification", model="blab-jhu/test-32m-dec", torch_dtype=torch.float16, device=0 ) classifier("This movie is really great!") ```
Using AutoModel:
```py import torch from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.frompretrained("blab-jhu/test-32m-dec") model = AutoModelForCausalLM.frompretrained( "blab-jhu/test-32m-dec", torchdtype=torch.float16, devicemap="auto", )
prompt = "The future of artificial intelligence is" inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.nograd(): outputs = model.generate( **inputs, maxlength=50, numreturnsequences=1, temperature=0.7, dosample=True, padtokenid=tokenizer.eostoken_id )
generatedtext = tokenizer.decode(outputs[0], skipspecialtokens=True) print(f"Generated text: {generatedtext}")
For sequence classification
from transformers import AutoModelForSequenceClassification
classifiermodel = AutoModelForSequenceClassification.frompretrained( "blab-jhu/test-32m-dec", torchdtype=torch.float16, devicemap="auto", num_labels=2 )
text = "This movie is really great!" inputs = tokenizer(text, return_tensors="pt").to("cuda")
with torch.nograd(): outputs = classifiermodel(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_class = torch.argmax(predictions, dim=-1)
print(f"Predicted class: {predicted_class.item()}") print(f"Prediction probabilities: {predictions}") ```
Using the transformers CLI:
bash
echo "The future of artificial intelligence is" | transformers run --task text-generation --model your-username/modernbert-decoder-base --device 0
- Python
Published by LysandreJik 11 months ago
transformers - Patch Release v4.53.2
This patch contains the following bug fixes:
- Fix some bug for finetune and batch infer For GLM-4.1V (#39090)
- [bugfix] fix flash attention 2 unavailable error on Ascend NPU (#39166)
- Fix errors when use verl to train GLM4.1v model (#39199)
- [pagged-attention] fix off-by-1 error in pagged attention generation (#39258)
- [smollm3] add tokenizer mapping for
smollm3(#39271) - [sliding window] revert and deprecate (#39301)
- fix Glm4v batch videos forward (#39172)
- Add a default value for
position_idsin masking_utils (#39310)
- Python
Published by Cyrilvallez 11 months ago
transformers - Patch Release v4.53.1
This patch contains several bug fixes. The following commits are included:
- Fix: unprotected import of tp plugin (#39083)
- Fix key mapping for VLMs (#39029)
- Several fixes for Gemma3n(#39135)
- [qwen2-vl] fix FA2 inference (#39121)
- [smolvlm] fix video inference (#39147)
- Fix multimodal processor get duplicate arguments when receive kwargs for initialization (#39125)
- when delaying optimizer creation only prepare the model (#39152)
- Add packed tensor format support for flex/sdpa/eager through the mask! (#39194)
- Python
Published by Cyrilvallez 11 months ago
transformers - Release v4.53.0
Release v4.53.0
Gemma3n
Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for pre-trained and instruction-tuned variants. These models were trained with data in over 140 spoken languages.
Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain. For more information on Gemma 3n's efficient parameter management technology, see the Gemma 3n page.
```python from transformers import pipeline import torch
pipe = pipeline( "image-text-to-text", torchdtype=torch.bfloat16, model="google/gemma-3n-e4b", device="cuda", ) output = pipe( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg", text="<imagesoft_token> in this image, there is" )
print(output) ```
Dia
Dia is an opensource text-to-speech (TTS) model (1.6B parameters) developed by Nari Labs. It can generate highly realistic dialogue from transcript including nonverbal communications such as laughter and coughing. Furthermore, emotion and tone control is also possible via audio conditioning (voice cloning).
Model Architecture: Dia is an encoder-decoder transformer based on the original transformer architecture. However, some more modern features such as rotational positional embeddings (RoPE) are also included. For its text portion (encoder), a byte tokenizer is utilized while for the audio portion (decoder), a pretrained codec model DAC is used - DAC encodes speech into discrete codebook tokens and decodes them back into audio.
- Add Dia model by @buttercrab in #38405
Kyutai Speech-to-Text

Kyutai STT is a speech-to-text model architecture based on the Mimi codec, which encodes audio into discrete tokens in a streaming fashion, and a Moshi-like autoregressive decoder. Kyutai’s lab has released two model checkpoints: - kyutai/stt-1b-en_fr: a 1B-parameter model capable of transcribing both English and French - kyutai/stt-2.6b-en: a 2.6B-parameter model focused solely on English, optimized for maximum transcription accuracy
- Add kyutai stt by @eustlb in #38909
Read more about the model in the documentation
V-JEPA 2
V-JEPA 2 is a self-supervised approach to training video encoders developed by FAIR, Meta. Using internet-scale video data, V-JEPA 2 attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.
- Add V-JEPA 2 by @qubvel in #38746
Read more about the model in the documentation.
Arcee
Arcee is a decoder-only transformer model based on the Llama architecture with a key modification: it uses ReLU² (ReLU-squared) activation in the MLP blocks instead of SiLU, following recent research showing improved training efficiency with squared activations. This architecture is designed for efficient training and inference while maintaining the proven stability of the Llama design.
The Arcee model is architecturally similar to Llama but uses x * relu(x) in MLP layers for improved gradient flow and is optimized for efficiency in both training and inference scenarios.
- Add Arcee model support by @Crystalcareai in #38621
Read more about the model in the documentation.
ColQwen2
ColQwen2 is a variant of the ColPali model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColQwen2 treats each page as an image. It uses the Qwen2-VL backbone to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.
- Add ColQwen2 to 🤗 transformers by @tonywu71 in #35778
Read more about the model in the documentation.
MiniMax
MiniMax is a powerful language model with 456 billion total parameters, of which 45.9 billion are activated per token. To better unlock the long context capabilities of the model, MiniMax adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE). Leveraging advanced parallel strategies and innovative compute-communication overlap methods—such as Linear Attention Sequence Parallelism Plus (LASP+), varlen ring attention, Expert Tensor Parallel (ETP), etc., MiniMax's training context length is extended to 1 million tokens, and it can handle a context of up to 4 million tokens during the inference. On various academic benchmarks, MiniMax also demonstrates the performance of a top-tier model.
The architecture of MiniMax is briefly described as follows:
- Total Parameters: 456B
- Activated Parameters per Token: 45.9B
- Number Layers: 80
- Hybrid Attention: a softmax attention is positioned after every 7 lightning attention.
- Number of attention heads: 64
- Attention head dimension: 128
- Mixture of Experts:
- Number of experts: 32
- Expert hidden dimension: 9216
- Top-2 routing strategy
- Positional Encoding: Rotary Position Embedding (RoPE) applied to half of the attention head dimension with a base frequency of 10,000,000
- Hidden Size: 6144
- Vocab Size: 200,064
For more details refer to the release blog post.
- Add support for MiniMax's MiniMax-Text-01 by @geetu040 in #35831
Read more about the model in the documentation.
Encoder-Decoder Gemma
T5Gemma (aka encoder-decoder Gemma) was proposed in a research paper by Google. It is a family of encoder-decoder large langauge models, developed by adapting pretrained decoder-only models into encoder-decoder. T5Gemma includes pretrained and instruction-tuned variants. The architecture is based on transformer encoder-decoder design following T5, with improvements from Gemma 2: GQA, RoPE, GeGLU activation, RMSNorm, and interleaved local/global attention.
T5Gemma has two groups of model sizes: 1) Gemma 2 sizes (2B-2B, 9B-2B, and 9B-9B), which are based on the offical Gemma 2 models (2B and 9B); and 2) T5 sizes (Small, Base, Large, and XL), where are pretrained under the Gemma 2 framework following T5 configuration. In addition, we also provide a model at ML size (medium large, ~2B in total), which is in-between T5 Large and T5 XL.
The pretrained varaints are trained with two objectives: prefix language modeling with knowledge distillation (PrefixLM) and UL2, separately. We release both variants for each model size. The instruction-turned varaints was post-trained with supervised fine-tuning and reinforcement learning.
- Encoder-Decoder Gemma by @bzhangGo in #38332
Read more about the model in the documentation.
GLM-4.1V
The GLM-4.1V model architecture is added to transformers; no models have yet been released with that architecture. Stay tuned for the GLM team upcoming releases!
- GLM-4.1V Model support by @zRzRzRzRzRzRzR in #38431
Read more about the model in the documentation.
Falcon H1
The FalconH1 model was developed by the TII Pretraining team. A comprehensive research paper covering the architecture, pretraining dynamics, experimental results, and conclusions is forthcoming. You can read more about this series in this website.
- [MODEL] Add Falcon H1 by @younesbelkada in #38249
Read more about the model in the documentation.
LightGlue
The LightGlue model was proposed in LightGlue: Local Feature Matching at Light Speed by Philipp Lindenberger, Paul-Edouard Sarlin and Marc Pollefeys.
Similar to SuperGlue, this model consists of matching two sets of local features extracted from two images, its goal is to be faster than SuperGlue. Paired with the SuperPoint model, it can be used to match two images and estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.
The abstract from the paper is the following:
We introduce LightGlue, a deep neural network that learns to match local features across images. We revisit multiple design decisions of SuperGlue, the state of the art in sparse matching, and derive simple but effective improvements. Cumulatively, they make LightGlue more efficient - in terms of both memory and computation, more accurate, and much easier to train. One key property is that LightGlue is adaptive to the difficulty of the problem: the inference is much faster on image pairs that are intuitively easy to match, for example because of a larger visual overlap or limited appearance change. This opens up exciting prospects for deploying deep matchers in latency-sensitive applications like 3D reconstruction. The code and trained models are publicly available at this https URL
- Add LightGlue model by @sbucaille in #31718
Read more about the model in the documentation.
dots.llm1
The abstract from the report is the following:
Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on high-quality corpus and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints spanning the entire training process, providing valuable insights into the learning dynamics of large language models.
- [Model] add dots1 by @redmoe-moutain in #38143
Read more about the model in the documentation.
SmolLM3
SmolLM3 is a fully open, compact language model designed for efficient deployment while maintaining strong performance. It uses a Transformer decoder architecture with Grouped Query Attention (GQA) to reduce the kv cache, and no RoPE, enabling improved performance on long-context tasks. It is trained using a multi-stage training approach on high-quality public datasets across web, code, and math domains. The model is multilingual and supports very large context lengths. The instruct variant is optimized for reasoning and tool use.
- Add SmolLM3 by @anton-l in #38755
Read more about the model in the documentation.
Performance optimizations
Kernels
In previous versions, installing the kernels library would automatically activate the custom kernels added to transformers, because the @use_kernel_forward_from_the_hub decorator directly swapped out the model’s forward method. This implicit behavior caused several issues for users — including problems with torch.compile, non-determinism, and inconsistent outputs.
To address this, we've introduced a new opt-in mechanism called kernelize. You can now enable kernel usage explicitly by passing use_kernels=True to from_pretrained. The use_kernel_forward_from_the_hub decorator now simply stores the kernel name that the user wants to use — and kernelize handles the rest under the hood.
Example
```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch
model = AutoModelForCausalLM.frompretrained( "meta-llama/Llama-3.2-1B-Instruct", torchdtype=torch.bfloat16, devicemap="cuda", usekernels=True ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
input = "Hello" inputids = tokenizer(input, returntensors="pt").to(model.device).inputids output = model.generate(inputids, maxnewtokens=100)
print(tokenizer.decode(output[0], skipspecialtokens=True)) ``` More kernels will be added over time — this will be a collaborative, community-driven effort to make transformers lighter and faster 🤗
- Add kernelize to transformers by @MekkCyber in #38205
Flash Attention 3
Support for Flash Attention 3 is added across the most popular models.
- Support for Flash Attention 3 by @EduardDurech in #38972
Notable repository maintenance & refactors
Several efforts refactoring the repository are happening in parallel. The direction is to greatly simplify the library, removing unnecessary codepaths. Whilst the efforts are spread across the library, they're particularly visible in each individual models; where non-modeling-specific code will be simplified and eventually removed.
We take the assumption that model-agnostic utilities shouldn't be in the modeling code. Things like the output of attentions, hidden states, router logits, are important for end-users but don't need to be explicitely displayed in the modeling code.
- Apply GradientCheckpointingLayer to the whole repo by @qubvel in #38913
- No more Tuple, List, Dict by @Rocketknight1 in #38797
- Deprecate TF + JAX by @Rocketknight1 in #38758
Breaking changes
Several minimal breaking changes aiming to bring clearer defaults while greatly simplifying the library have been merged.
- 🔴 Update default
dtypefor pipelines toautoby @Vaibhavs10 in #38882 - 🚨🚨 Fix initialization of Mask2Former by @Cyrilvallez in #38864
- :rotatinglight: :rotatinglight: Inherited CausalLM Tests by @Rocketknight1 in #37590
- 🚨Early-error🚨 config will error out if
output_attentions=Trueand the attn implementation is wrong by @ArthurZucker in #38288 - 🔴 [VLM] modeling updates by @zucchini-nlp in #38317
- :rotatinglight: :rotatinglight: Fix custom code saving by @Rocketknight1 in #37716
- 🚨🚨[core] Completely rewrite the masking logic for all attentions by @Cyrilvallez in #37866
- 🔴🔴🔴 [
Attention] Refactor Attention Interface for Bart-based Models by @vasqu in #38108 - 🔴[
Attention] Attention refactor for Whisper-based models by @vasqu in #38235 - Add CB by @ArthurZucker in #38085
Bugfixes and improvements
- CI reporting improvements by @ydshieh in #38230
- Revert parallelism temporarily by @LysandreJik in #38240
- tp plan should not be NONE by @ArthurZucker in #38255
- [Falcon H1] Fix Typo in Integration Test by @dhiaEddineRhaiem in #38256
- [
compile] re-enable for Qwen-VL models by @zucchini-nlp in #38127 - fix multi-image case for llava-onevision by @cyr0930 in #38084
- Add tearDown method to Quark to solve OOM issues by @MekkCyber in #38234
- Clearer error on import failure by @LysandreJik in #38257
- [whisper] small changes for faster tests by @gante in #38236
- Simplify DTensor Check for modeling_utils.py by @amd-xiaoyu12 in #38245
- Improve typing in TrainingArgument by @cyyever in #36944
- Fix: missing else branch to handle "--loadbestmodelatend" in training_args.py by @danielyxyang in #38217
- assign the correct torchao data layout for xpu by @jiqing-feng in #37781
- Remove Japanese sequence_classification doc and update references by @ritsumei-aoi in #38246
- Protect ParallelInterface by @ArthurZucker in #38262
- Update Model Card for Mamba by @ParagEkbote in #37863
- docs(swin): Update Swin model card to standard format by @BryanBradfo in #37628
- add XPU info print in print_env by @yao-matrix in #38282
- [whisper] move processor test into processor test file 🧹 by @gante in #38266
- [Whisper] handle deprecation of
forced_decoder_idsby @gante in #38232 - add
liger-kernelto docker file by @ydshieh in #38292 - Fix tp error when torch distributed is already initialized by @SunMarc in #38294
- More typing in src/transformers/training_args.py by @cyyever in #38106
- refine
transformers envoutput by @yao-matrix in #38274 - Update CI Docker base image for AMD tests by @ahadnagy in #38261
- Fix HybridChunedCache & Llama4 by @Cyrilvallez in #38299
- Oups typo for HybridChunkedCache by @Cyrilvallez in #38303
- [Tests] Cleanup Janus Testcase by @yaswanth19 in #38311
- [emu3] fix conversion script by @zucchini-nlp in #38297
- Fix run_slow by @cyyever in #38314
- Fix typo: change 'env' to 'environment' in .circleci/config.yml by @AbdessamadEnabih in #38273
- Adds userepr to modeladditiondebuggercontext by @RyanMullins in #37984
- [tf/flax] handle
forced_decoder_idsdeletion by @gante in #38316 - [Whisper + beam search] fix usage of
beam_indicesby @gante in #38259 - Expose AutoModelForTimeSeriesPrediction for import by @jinan-zhou in #38307
- [customgenerate] don't forward `customgenerate
andtrustremotecode` by @gante in #38304 - add
vasqutoself-comment-ci.ymlby @ydshieh in #38324 - Fix some tests (especially compile with fullgraph=True on Python<3.11) by @Cyrilvallez in #38319
- [performanceoptim] reduce frequency of declaring attentionmask in Ascend NPU flash attention by @FightingZhen in #38278
- refactor cansaveslow_tokenizer by @itazap in #37722
- [
FlexAttention] Reenable flex for encoder-decoder and make the test more robust by @vasqu in #38321 - Enhance Model Loading By Providing Parallelism, Uses Optional Env Flag by @inf3rnus in #36835
- Use Gradient Checkpointing Layer in Jamba & Blip Related Models by @alex-jw-brooks in #38310
- Never fallback to eager implicitly by @Cyrilvallez in #38327
- Remove duplicate docstring: resample by @qqii in #38305
- Update BioGPT model card by @Aguedoom in #38214
- docs(swinv2): Update SwinV2 model card to new standard format by @BryanBradfo in #37942
- [docs]: update roformer.md model card by @KsuParkhamchuk in #37946
- new failure CI reports for all jobs by @ydshieh in #38298
- Hot fix for AMD CI workflow by @ydshieh in #38349
- Uninstall
kernelsfor AMD docker images by @ydshieh in #38354 - [VLMs] add helpers for get/set embedding by @zucchini-nlp in #38144
- switch to device agnostic device calling for test cases by @yao-matrix in #38247
- [
OPT] Fix attention scaling by @vasqu in #38290 - Fix all import errors based on older torch versions by @Cyrilvallez in #38370
- Fix incorrect batching audio index calculation for Phi-4-Multimodal by @Isotr0py in #38103
- Protect
get_default_devicefor torch<2.3 by @Cyrilvallez in #38376 - [Falcon H1] Fix slow path forward pass by @dhiaEddineRhaiem in #38320
- Improved cache docs by @manueldeprada in #38060
- for now disable compile by @ArthurZucker in #38383
- Use one
utils/notification_service.pyby @ydshieh in #38379 - Better check in
initialize_weightsby @Cyrilvallez in #38382 - fix typos by @DeVikingMark in #38336
- fix typo:
tokenizer->tokenizeby @foldl in #38357 - Stop TF weight rename reDOS by @Rocketknight1 in #38325
- [cli] cli usable without torch by @gante in #38386
- update gemma tests by @ydshieh in #38384
- Stop autoconverting custom code checkpoints by @Rocketknight1 in #37751
- Add AMD MI300 CI caller leveraging self-hosted runner scale set workflow in hf-workflows by @jitesh-gupta in #38132
- Fix image token mask in Gemma3 by @Cyrilvallez in #38295
- [transformers x vLLM] standardize processors by @zucchini-nlp in #37915
- [paligemma] fix processor with suffix by @zucchini-nlp in #38365
- [video utils] group and reorder by number of frames by @zucchini-nlp in #38374
- [aya vision] fix processor for vLLM by @zucchini-nlp in #38371
- guard size mismatch check to only quantized models by @SunMarc in #38397
- [chat] improvements for thinking models and reduce default verbosity by @gante in #38322
- Fix convert to original state dict for VLMs by @hiyouga in #38385
- [chat] use the checkpoint's
generation_config.jsonas base parameterization by @gante in #38330 - Fix Qwen2.5-VL Video Processor by @yeliudev in #38366
- [CSM] infer codec model with no_grad + audio eos label by @eustlb in #38215
- Add reportrepoid to mi300 workflow by @ivarflakstad in #38401
- [CSM] update model id by @eustlb in #38211
- [cleanup] delete deprecated kwargs in qwen2_audio 🧹 by @gante in #38404
- [tests] remove overload for deleted test (
test_offloaded_cache_implementation) by @gante in #37896 - [mllama] Allow
pixel_valueswithinputs_embedsby @dxoigmn in #38334 - Update Model Card for Mamba-2 by @ParagEkbote in #37951
- Updated Zoedepth model card by @miniMaddy in #37898
- Updated BigBird Model card as per #36979. by @RogerSinghChugh in #37959
- Updated BERTweet model card. by @RogerSinghChugh in #37981
- New bart model card by @RogerSinghChugh in #37858
- Update granite.md by @Tanuj-rai in #37791
- Falcon-H1 - Fix autodocstring and add canreturn_tuple decorator by @yonigozlan in #38260
- Updated model card for OLMo2 by @andyvu923 in #38394
- Add mi300 to amd daily ci workflows definition by @ivarflakstad in #38415
- Change slack channel for mi250 CI by @ivarflakstad in #38410
- Fix an error in verifytpplan for keys without '.' by @liwii in #38420
- [qwen-vl] Look for vocab size in text config by @zucchini-nlp in #38372
- Update
CsmForConditionalGenerationIntegrationTestby @ydshieh in #38424 - enable large_gpu and torchao cases on XPU by @yao-matrix in #38355
- Disable mi210 scheduled CI by @ivarflakstad in #38411
- Update error when using additional and/or masks by @Cyrilvallez in #38429
- Fix CircleCI not triggered when PR is opened from a branch of
huggingface/transformersby @ydshieh in #38413 - make Llama4TextMoe forward more readable by @JJJYmmm in #37529
- [core] support tensor-valued extrastate values in
from_pretrainedby @pstjohn in #38155 - Fix typo in tokenizationutilsbase.py docstring by @cwngan in #38418
- Fix convert weights for InternVL by @yonigozlan in #38233
- Trigger doc-builder job after style bot by @ydshieh in #38398
- Remove redundant testsdpaequivalence test by @Rocketknight1 in #38436
- Fix MoE gradient test by @Rocketknight1 in #38438
- Fix
from_args_and_dictProcessorMixin by @yonigozlan in #38296 - Fix handling of slow/fast image processors in imageprocessingauto.py by @yonigozlan in #38161
- Updated the Model docs - for the ALIGN model by @1himan in #38072
- Updated the model card for ViTMAE by @mreraser in #38302
- Model card for mobilenet v1 and v2 by @yuanjua in #37948
- Merge type hints from
microsoft/python-type-stubs(post dropping support for Python 3.8) by @Avasam in #38335 - Fix GLM4 checkpoints by @ydshieh in #38412
- feat: add cache retention for requests by @McPatate in #38446
- [Tests] Clean up test cases for few models by @yaswanth19 in #38315
- Fix TypeError in save_pretrained error handling (fixes #38422) by @rahulrshetty45 in #38449
- Cleanup
BatchFeatureandBatchEncodingby @lgeiger in #38459 - Fix
Gemma3IntegrationTestby @ydshieh in #38471 - [Qwen2.5-Omni] Fix dtype of cos,sin when used with flash attention by @HarryHsing in #38453
- fix: handle no scheduler passed by user by @McPatate in #38407
- make it go brrrr by @ArthurZucker in #38409
- Fix convertinternvlweightstohf.py to support local paths by @xvyv99 in #38264
- Fix incorrect bboxembed initialization when decoderbboxembedshare=False in GroundingDINO by @islemyakoubi in #38238
- [Tests] Reduced model size for albert-test model by @saqlain2204 in #38480
- Align TP check by @SunMarc in #38328
- protect dtensor import by @SunMarc in #38496
- [docs] add xpu environment variable for gpu selection by @faaany in #38194
- Remove deprecated useflashattention_2 parameter by @cyyever in #37131
- Fix setting FLASHATTENTIONDETERMINISTIC after importing by @HollowMan6 in #37185
- [seamless_m4t] Skip some tests when speech is not available by @remi-or in #38430
- Update Loss Functions to Accept Tensor numitemsin_batch by @NEREUScode in #38029
- [generate] add soft deprecations on custom generation methods by @gante in #38406
- [generate] move
SinkCacheto acustom_generaterepo by @gante in #38399 - remove unhandled parameter by @itazap in #38145
- Fix amp deprecation issue by @SunMarc in #38100
- [flax/mistral] support sliding_window: null in config by @yiding in #37402
- Num parameters in model.safetensors.index.json by @LysandreJik in #38531
- Remove type annotation in Siglip Attention Module by @yaswanth19 in #38503
- Fix
Gemma2IntegrationTestby @ydshieh in #38492 - Fix blip2 tests by @ydshieh in #38510
- [tests] expand flex-attn test for vision models by @zucchini-nlp in #38434
- Don't use default attn if pre-set in sub-config by @zucchini-nlp in #38526
- update emu3 test by @jiqing-feng in #38543
- Update docker image to use
avby @ydshieh in #38548 - [bugfix] [WIP] fix applyrotaryemb error on Ascend NPU by @FightingZhen in #38491
- [TP] Change command in tests to
python3by @S1ro1 in #38555 - Explicitly setting encoding in tokenizationutilsbase.py by @Muqi1029 in #38553
- Fix
utils/notification_service.pyby @ydshieh in #38556 - Name change AOPermod -> ModuleFqn by @drisspg in #38456
- Fix hqq issue by @SunMarc in #38551
- [docs] Format fix by @stevhliu in #38414
- [janus] Fix failing tests on mi3XX by @remi-or in #38426
- Fix
chameleontests by @ydshieh in #38565 - update
utils/notification_service.pyfor AMD vs Nvidia by @ydshieh in #38563 - Fix
deepseekv3by @ydshieh in #38562 - [
FlexAttn] Fix models with unique characteristics by @vasqu in #38433 - fix(attentionvisualizer): add default value for imageseq_length by @IceGiraffe in #38577
- allow custom headdim for qwen2moe by @bzantium in #37188
- Docs: fix code formatting in torchao docs by @Manalelaidouni in #38504
- feat: add
repositoryfield to benchmarks table by @McPatate in #38582 - [Dinov2] Enable device_map="auto" support by @aryanchauhan31 in #38487
- tests/roformer: fix couple roformer tests on gpus by @dvrogozh in #38570
- New gpt neo model card by @RogerSinghChugh in #38505
- Updated deprecated typing imports with equivalents for Python 3.9+ by @Sai-Suraj-27 in #38546
- added fast image processor for ZoeDepth and expanded tests accordingly by @henrikm11 in #38515
- [qwen-omni] fix sliding window by @zucchini-nlp in #38525
- Remove custom pytest and pluggy by @ydshieh in #38589
- pin pandas by @ydshieh in #38605
- Allow
mlm_probabilityto be set toNonewhenmlm=Falsein DataCollatorForLanguageModeling by @KameniAlexNea in #38522) - Avoid overwrite existing local implementation when loading remote custom model by @Isotr0py in #38474
- fix spelling errors by @davidjsonn in #38608
- Remove
isortfrom dependencies by @Sai-Suraj-27 in #38616 - Fix
return_dict=Falsegiving errors in a few VLM models by @ydshieh in #38519 - docs: fix dark mode logo display. by @johncaged in #38586
- Fix typo in LLaVa documentation by @mynameismon in #38618
- [Nit] Add Note on SigOpt being in Public Archive Mode by @ParagEkbote in #38610
- Updated Aria model card by @1himan in #38472
- Fix
MiniMax(docs and integration tests checkpoint) by @geetu040 in #38575 - enable more test cases on xpu by @yao-matrix in #38572
- Improve
test_initializationby @ydshieh in #38607 - Use torch 2.7.1 on CircleCI jobs by @ydshieh in #37856
- [generation] bring back tests on vision models by @zucchini-nlp in #38603
- update
ColQwen2ModelIntegrationTestby @ydshieh in #38583 - Improve
test_initializationforSwiftFormerby @ydshieh in #38636 - fix: support grad clipping for TP through replicating non-sharded modules by @kmehant in #36132
- Don't run
AriaForConditionalGenerationModelTeston CircleCI by @ydshieh in #38615 - fix total batch size calculation in trainer by @inkcherry in #38286
- fix torch_dtype on awq by @jiqing-feng in #38463
- Better CI by @ydshieh in #38552
- remove ipexoptimizemodel usage by @yao-matrix in #38632
- Skip torchscript tests for 2 models by @ydshieh in #38643
- Fix
InternVLintegration test by @ydshieh in #38612 - Use torch 2.7.1 on daily CI by @ydshieh in #38620
- Fix qwen2-audio chat template audio placeholder insertion by @Isotr0py in #38640
- Fixed modelingauto.py MODELFORMASKGENERATIONMAPPINGNAMES variable by @sbucaille in #38664
- fix: "check out" as verb by @DePasqualeOrg in #38678
- Fix attention mask expansion when converting to executorch by @pweglik in #38637
- Fix some models import by @nicelulu in #38694
- Fix retrieve function signature and remove faiss requirement by @Fiona-Waters in #38624
- Fix TypeError: 'NoneType' object is not iterable for esm by @dbleyl in #38667)
- Docs: update bitsandbytes torch.compile compatibility by @matthewdouglas in #38651
- Drop astargetprocessor from the call and pad methods by @marcndo in #38642
- Created model card for XLM model by @AshAnand34 in #38595
- Update XLM-RoBERTa model documentation with enhanced usage examples and improved layout by @AshAnand34 in #38596
- Created model card for xlm-roberta-xl by @AshAnand34 in #38597
- Fix
aya_visiontest by @ydshieh in #38674 - Standardize ByT5 model card format by @yanamis in #38699
- Fix smart resize by @rdonggroq in #38706
- Update some tests for torch 2.7.1 by @ydshieh in #38701
- Logging message for
is_bitsandbytes_available()by @ved1beta in #38528 - Fix
llavatests by @ydshieh in #38722 - Use OSError by @cyyever in #38712
- [add-new-model-like] Robust search & proper outer '),' in tokenizer mapping by @alexzms in #38703
- Fix typo in Language Modeling example scripts and update TPU type by @framoncg in #38652
- Add AGENTS.md by @Rocketknight1 in #38734
- New canine model card by @RogerSinghChugh in #38631
- Fixed a multiple-devices issue in SmolVLM model by @remi-or in #38736
- [llava] fix integration tests with Siglip by @zucchini-nlp in #38732
- fix: Add method to get image features in PaliGemmaForConditionalGeneration by @YushunXiang in #38730
- from 1.11.0, torchao.prototype.lowbitoptim is promoted to torchao.optim by @yao-matrix in #38689
- fix: bf16 with TPU is allowed in configuration by @yevvonlim in #38670
- [DeepSeek-V3] implement when qlorarank is None by @bzantium in #38743
- Revert "Trigger doc-builder job after style bot" by @ydshieh in #38735
- Add z-loss to Bamba for v2 by @daviswer in #37842
- Better typing for numitemsin_batch by @SunMarc in #38728
- Prepare for TF+Jax deprecation by @Rocketknight1 in #38760
- Remove IPEX requirement for bitsandbytes on CPU by @matthewdouglas in #38594
- Update repo consistency check by @Rocketknight1 in #38763
- fix(qwen3moe): pass kwargs to selfattn by @llllvvuu in #38691
- Update pegasus model card by @dross20 in #38675
- Make style bot trigger CI after push by @ydshieh in #38754
- chore(pixtral): emit block attention mask when using flash attention by @starcatmeow in #38741
- Update altCLIP model card by @EmileAydar in #38306
- Add Qwen2 MoE model card by @rileyafox in #38649
- [masking utils] check
Noneinstead of try/except by @zucchini-nlp in #38561 - [Hotfix] Fix style bot by @ydshieh in #38779
- Fix masking utils by @Cyrilvallez in #38783
- [video processors] support frame sampling within processors by @zucchini-nlp in #38105
- Skip some export tests on torch 2.7 by @ydshieh in #38677
- Reduce verbosity for
average_tokens_across_devices=Trueandworld size = 1by @qgallouedec in #38785 - Update PULLREQUESTTEMPLATE.md by @qgallouedec in #38770
- [docs] Add int4wo + 2:4 sparsity example to TorchAO README by @jcaip in #38592
- Fix
qwen_2_5 omniby @ydshieh in #38658 - Fix
llava_onevisiontests by @ydshieh in #38791 - Reword README in light of model definitions by @LysandreJik in #38762
- Fix Typos in Comments: "quantitation" → "quantization", "averege" → "average" by @leopardracer in #38766
- Initialize flash attn flag by @farnasirim in #38768
- Fix
mllamaby @ydshieh in #38704 - build: :pushpin: Remove upper bound on PyTorch by @KyleMylonakisProtopia in #38789
- Remove all traces of
low_cpu_mem_usageby @Cyrilvallez in #38792 - [Docs] New DiT model card by @yushi2006 in #38721
- Add missing div in Pegasus model card by @dross20 in #38773
- Updated moonshine modelcard by @SohamPrabhu in #38711
- refactor createtokentypeidsfrom_sequences by @itazap in #37681
- [docs] update cache docs with new info by @zucchini-nlp in #38775
- Fix erroneous docstring for the ordering of SWA layers by @norpadon in #38794
- Fix configs and doc for the Qwens by @Cyrilvallez in #38808
- Unbreak optimum-executorch by @guangy10 in #38646
- Disable custom MRA kernels for ROCm by @ahadnagy in #38738
- Use HF papers by @qgallouedec in #38184
- Simplify and update trl examples by @qgallouedec in #38772
- Better pipeline type hints ✨ by @qubvel in #38049
- Fix
llava_nexttests by @ydshieh in #38813 - Expectation fixes and added AMD expectations by @remi-or in #38729
- Use
wandb.run.urlinstead ofwandb.run.get_url()(deprecated) by @qgallouedec in #38817 - Refactor DBRX tests to use CausalLMModelTest base classes by @Rocketknight1 in #38475
- change fsdp_strategy to fsdp in TrainingArguments in accelerate doc by @PT-10 in #38807
- Fix a minor security issue by @ydshieh in #38815
- Fix trainer.py not showing signature columns by @nenesekai in #38465
- Add V-JEPA for video classification model by @qubvel in #38788
- fixed docstring in modularqwen25_vl.py by @lawrencefeng17 in #38798
- [docs] Update docs moved to the course by @stevhliu in #38800
- [docs] updated roberta model card by @allmight05 in #38777
- Updated Albert model Card by @souvikchand in #37753
- [internvl] fix video inference by @zucchini-nlp in #38811
- Fix redundant code in Janus by @yaswanth19 in #38826
- bugfix: propage weight key_mapping to peft to fix 3.52 VLM renaming by @ManuelFay in #38627
- Fix peft integration by @Cyrilvallez in #38841
- Fix broken notebooks link in Italian training docs by @VolodymyrBg in #38834
- Fix broken tag in Longformer model card by @dross20 in #38828
- [BugFix] QA pipeline edge case:
align_to_words=TrueinQuestionAnsweringPipelinecan lead to duplicate answers by @yushi2006 in #38761 - GraniteMoeHybrid: Allow for only shared expert case. by @shawntan in #38801
- Updated aya_vision.md by @1himan in #38749
- Remove merge conflict artifacts in Albert model doc by @druvdub in #38849
- [video processor] fix BC when no video config if found by @zucchini-nlp in #38840
- Fix incorrect width ratio calculation in Llama4 image processor by @Jingxiang-Zhang in #38842
- Allow customization of sdpa in executorch.py by @kimishpatel in #38827
- Fix
qwen2_5_vltests by @ydshieh in #38845 - Improve
auxiliary_in_channelsdefault behavior in UperNet by @simonreise in #37540 - Fix
qwen3tests by @ydshieh in #38862 - Update CvT documentation with improved usage examples and additional … by @sezan92 in #38731
- Update roc bert docs by @SohamPrabhu in #38835
- Post-PR fixes! by @Rocketknight1 in #38868
- enable misc test cases on XPU by @yao-matrix in #38852
- Fix
phi4_multimodaltests by @ydshieh in #38816 - Fix
qwen3_moetests by @ydshieh in #38865 - Fix HQQ model param device transfer issue by @HighCWu in #38466
- Fixed markdown for BertTokenizer's '[CLS]' token. by @eu90h in #38506
- null deepspeed_plugin in args for wandb callback fake trainer by @winglian in #38867
- More PYUP fixes by @cyyever in #38883
- Fix loop var naming by @Rocketknight1 in #38885
- [bugfix] fix ATTNMASKNPU device mismatch error on multi-device NPU … by @qykong in #38876
- log: Add logging when using splitbatches and perdevicetrainbatch_size by @KeshavSingh29 in #38633
- Docs: Add custom fine-tuning tutorial to TrOCR model page by @Ashutosh-4485 in #38847
- 36978 | Fast image processor for DPT model by @samrae7 in #37481
- [video processor] fix slow tests by @zucchini-nlp in #38881
- Update bamba model card by @druvdub in #38853
- Add support for specifying revisions when pushing to Hub via internal Trainer call by @IsaacBreen in #36852
- Use
raise from einhub.pyutility by @Wauplin in #37241 - [phi-4] use mel filters from audio utils by @eustlb in #36966
- Fix
fsmttests by @ydshieh in #38904 - Fix unnecessary super calls by @cyyever in #38897
- align xpu's autocast behavior w/ cuda by using device agnostic torch APIs by @yao-matrix in #38284
- Fix
FalconMambaIntegrationTestsby @ydshieh in #38566 - Skip sdpa tests if submodule does not support sdpa by @ivarflakstad in #38907
- Fix ReDOS in tokenizer digit substitution by @Rocketknight1 in #38844
- feat: Add granite architectures to auto tokenizer name mappings by @gabe-l-hart in #38802
- Allow make-fixup on main branch, albeit slowly by @Rocketknight1 in #38892
- feat: add flexible Liger Kernel configuration to TrainingArguments by @hamza-hcompany in #38911
- Remove deprecated classes in modeling_utils.py by @Cyrilvallez in #38919
- Skip some tests for now by @ydshieh in #38931
- Modernbert fixes by @remi-or in #38912
- add pytorch-xpu Dockerfile by @yao-matrix in #38875
- Remove
ALL_LAYERNORM_LAYERSby @Cyrilvallez in #38922 - [static cache] fix device map per layer in VLMs by @zucchini-nlp in #38488
- Add kwargs for timm.create_model in TimmWrapper by @qubvel in #38860
- Pin PyTorch extras for AMD containers by @ahadnagy in #38941
- Correctly raise error for awq quantization by @Cyrilvallez in #38945
- Fix more flaky
test_initializationby @ydshieh in #38932 - Switch to use A10 progressively by @ydshieh in #38936
- Fix custom generate from local directory by @manueldeprada in #38916
- Update blip model card by @devkade in #38513
- Gaudi3 CI by @IlyasMoutawwakil in #38790
- Fix DTensor import compatibility for PyTorch < 2.5 by @Benoqtr in #38836
- Fix(informer): Correct tensor shape for input_size=1 by @Flink-ddd in #38856
- [modular] CLI allows positional arguments, and more defaults names for the optional arg by @Cyrilvallez in #38979
- Remove dead protected imports by @Cyrilvallez in #38980
- Break tie in Expectations and gemma3 fixes by @remi-or in #38943
- Add Idefics2/3 and SmolVLM Fast image processors + improvements for fast image processors by @yonigozlan in #38157
- fix: add bool operator to tokenizer to avoid bloated asserts by @kallewoof in #38899
- Add support for auto_docstring with model outputs by @yonigozlan in #38242
- fix
mistralandmistral3tests by @ydshieh in #38978 - [Feature] Support
is_split_into_wordsin theTokenClassificationPipeline. by @yushi2006 in #38818 - Fix
ragby @ydshieh in #38585 - [docs] Typos - Single GPU efficient training features by @casinca in #38964
- [qwen] refactor attentions for vision/audio by @zucchini-nlp in #38930
- Removing extra space in large command for speech-pretraining example by @dggaytan in #38705
- [
Attention] Small fix on output attentions by @vasqu in #38948 - Fixes for Arcee model by @Cyrilvallez in #39001
- Added scikit-learn to the example image-classification requirements.txt by @mylonjones in #37506
- Update attention_visualizer.py by @Tanuj-rai in #37860
- Skip non-selected experts for qwen3_moe by @seven-mile in #38133
- Fix undeterministic order in modular dependencies by @Cyrilvallez in #39005
- Granite speech - minor fixes to support training with the HF trainer by @avihu111 in #38833
- Fix bugs in DynamicCache by @tugsbayasgalan in #37880
- Update self-comment-ci.yml user list by @ivarflakstad in #39014
- Skip sdpa dispatch on flash test due to unsupported head dims by @ivarflakstad in #39010
- [HPU][Critical Issue Fix] ThreadPool instead of Pool for parallel pre-processing by @dsmertin in #39002
- Add Hugging Face authentication procedure for IDEs (PyCharm, VS Code,… by @marcndo in #38954
- [LightGlue] Fixed attribute usage from descriptordim to keypointdetectordescriptordim by @sbucaille in #39021
- Add zero dim tensor check when using flash_attention by @ranzhejiang in #38280
- Fix graph break in torch.compile when using FA2 with attention_mask=None and batch size > 1 by @efsotr in #37332
- [AutoModelForMaskGeneration] Remove duplicate code by @NielsRogge in #38622
- [video processor] support torchcodec and decrease cuda memory usage by @zucchini-nlp in #38880
- Drop unnecessary tokens in GPT2Model generation by @null-pointer-access in #39016
- Fix the seamless_m4t cannot work on Gaudi by @yuanwu2017 in #38363
- fix: astronomical loss with ModernBERT when using gradient checkpointing by @umarbutler in #38982)
- fix gemma3 grad acc by @SunMarc in #37208
- Remove script datasets in tests by @lhoestq in #38940
- Fix grammatical error in models documentation by @marcndo in #39019
- refactor: remove custom BarkLayerNorm by @eginhard in #39003
- [Kyutai-STT] correct model type + model id by @eustlb in #39035
- Two ReDOS fixes by @Rocketknight1 in #39013
- [tests] remove TF tests (uses of
require_tf) by @gante in #38944 - Granite speech speedup + model saving bugfix by @avihu111 in #39028
- Fix Bad Outputs in Fast Path for GraniteMoeHybrid by @alex-jw-brooks in #39033
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @ydshieh
- CI reporting improvements (#38230)
- add
liger-kernelto docker file (#38292) - add
vasqutoself-comment-ci.yml(#38324) - new failure CI reports for all jobs (#38298)
- Hot fix for AMD CI workflow (#38349)
- Uninstall
kernelsfor AMD docker images (#38354) - Use one
utils/notification_service.py(#38379) - update gemma tests (#38384)
- Update
CsmForConditionalGenerationIntegrationTest(#38424) - Fix CircleCI not triggered when PR is opened from a branch of
huggingface/transformers(#38413) - Trigger doc-builder job after style bot (#38398)
- Fix GLM4 checkpoints (#38412)
- Fix
Gemma3IntegrationTest(#38471) - Fix
Gemma2IntegrationTest(#38492) - Fix blip2 tests (#38510)
- Update docker image to use
av(#38548) - Fix
utils/notification_service.py(#38556) - Fix
chameleontests (#38565) - update
utils/notification_service.pyfor AMD vs Nvidia (#38563) - Fix
deepseekv3(#38562) - Remove custom pytest and pluggy (#38589)
- pin pandas (#38605)
- Fix
return_dict=Falsegiving errors in a few VLM models (#38519) - Improve
test_initialization(#38607) - Use torch 2.7.1 on CircleCI jobs (#37856)
- update
ColQwen2ModelIntegrationTest(#38583) - Improve
test_initializationforSwiftFormer(#38636) - Don't run
AriaForConditionalGenerationModelTeston CircleCI (#38615) - Better CI (#38552)
- Skip torchscript tests for 2 models (#38643)
- Fix
InternVLintegration test (#38612) - Use torch 2.7.1 on daily CI (#38620)
- Fix
aya_visiontest (#38674) - Update some tests for torch 2.7.1 (#38701)
- Fix
llavatests (#38722) - Revert "Trigger doc-builder job after style bot" (#38735)
- Make style bot trigger CI after push (#38754)
- [Hotfix] Fix style bot (#38779)
- Skip some export tests on torch 2.7 (#38677)
- Fix
qwen_2_5 omni(#38658) - Fix
llava_onevisiontests (#38791) - Fix
mllama(#38704) - Fix
llava_nexttests (#38813) - Fix a minor security issue (#38815)
- Fix
qwen2_5_vltests (#38845) - Fix
qwen3tests (#38862) - Fix
phi4_multimodaltests (#38816) - Fix
qwen3_moetests (#38865) - Fix
fsmttests (#38904) - Fix
FalconMambaIntegrationTests(#38566) - Skip some tests for now (#38931)
- Fix more flaky
test_initialization(#38932) - Switch to use A10 progressively (#38936)
- fix
mistralandmistral3tests (#38978) - Fix
rag(#38585)
- @ArthurZucker
- tp plan should not be NONE (#38255)
- Protect ParallelInterface (#38262)
- Add CB (#38085)
- 🚨Early-error🚨 config will error out if
output_attentions=Trueand the attn implementation is wrong (#38288) - for now disable compile (#38383)
- make it go brrrr (#38409)
- @younesbelkada
- [MODEL] Add Falcon H1 (#38249)
- @cyr0930
- fix multi-image case for llava-onevision (#38084)
- @cyyever
- Improve typing in TrainingArgument (#36944)
- More typing in src/transformers/training_args.py (#38106)
- Fix run_slow (#38314)
- Remove deprecated useflashattention_2 parameter (#37131)
- Use OSError (#38712)
- More PYUP fixes (#38883)
- Fix unnecessary super calls (#38897)
- @ritsumei-aoi
- Remove Japanese sequence_classification doc and update references (#38246)
- @yao-matrix
- add XPU info print in print_env (#38282)
- refine
transformers envoutput (#38274) - switch to device agnostic device calling for test cases (#38247)
- enable large_gpu and torchao cases on XPU (#38355)
- enable more test cases on xpu (#38572)
- remove ipexoptimizemodel usage (#38632)
- from 1.11.0, torchao.prototype.lowbitoptim is promoted to torchao.optim (#38689)
- enable misc test cases on XPU (#38852)
- align xpu's autocast behavior w/ cuda by using device agnostic torch APIs (#38284)
- add pytorch-xpu Dockerfile (#38875)
- @vasqu
- 🔴🔴🔴 [
Attention] Refactor Attention Interface for Bart-based Models (#38108) - [
FlexAttention] Reenable flex for encoder-decoder and make the test more robust (#38321) - [
OPT] Fix attention scaling (#38290) - 🔴[
Attention] Attention refactor for Whisper-based models (#38235) - [
FlexAttn] Fix models with unique characteristics (#38433) - [
Attention] Small fix on output attentions (#38948)
- 🔴🔴🔴 [
- @itazap
- refactor cansaveslow_tokenizer (#37722)
- remove unhandled parameter (#38145)
- refactor createtokentypeidsfrom_sequences (#37681)
- @eustlb
- [CSM] infer codec model with no_grad + audio eos label (#38215)
- [CSM] update model id (#38211)
- [phi-4] use mel filters from audio utils (#36966)
- Add kyutai stt (#38909)
- [Kyutai-STT] correct model type + model id (#39035)
- @RogerSinghChugh
- Updated BigBird Model card as per #36979. (#37959)
- Updated BERTweet model card. (#37981)
- New bart model card (#37858)
- New gpt neo model card (#38505)
- New canine model card (#38631)
- @1himan
- Updated the Model docs - for the ALIGN model (#38072)
- Updated Aria model card (#38472)
- Updated aya_vision.md (#38749)
- @Avasam
- Merge type hints from
microsoft/python-type-stubs(post dropping support for Python 3.8) (#38335)
- Merge type hints from
- @remi-or
- [seamless_m4t] Skip some tests when speech is not available (#38430)
- [janus] Fix failing tests on mi3XX (#38426)
- Fixed a multiple-devices issue in SmolVLM model (#38736)
- Expectation fixes and added AMD expectations (#38729)
- Modernbert fixes (#38912)
- Break tie in Expectations and gemma3 fixes (#38943)
- @tonywu71
- Add ColQwen2 to 🤗 transformers (#35778)
- @geetu040
- Add support for MiniMax's MiniMax-Text-01 (#35831)
- Fix
MiniMax(docs and integration tests checkpoint) (#38575)
- @sbucaille
- Fixed modelingauto.py MODELFORMASKGENERATIONMAPPINGNAMES variable (#38664)
- Add LightGlue model (#31718)
- [LightGlue] Fixed attribute usage from descriptordim to keypointdetectordescriptordim (#39021)
- @samrae7
- 36978 | Fast image processor for DPT model (#37481)
- @Crystalcareai
- Add Arcee model support (#38621)
- @zRzRzRzRzRzRzR
- GLM-4.1V Model support (#38431)
- @bzhangGo
- Encoder-Decoder Gemma (#38332)
- @redmoe-moutain
- [Model] add dots1 (#38143)
- @EduardDurech
- Support for Flash Attention 3 (#38972)
- Python
Published by LysandreJik 11 months ago
transformers - Kyutai-STT (based on v4.52.4)
A new model is added to transformers: Kyutai-STT
It is added on top of the v4.52.4 release, and can be installed from the following tag: v4.52.4-Kyutai-STT-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.52.4-Kyutai-STT-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the Kyutai-STT model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.53.0.
Kyutai-STT

Kyutai STT is a speech-to-text model architecture based on the Mimi codec, which encodes audio into discrete tokens in a streaming fashion, and a Moshi-like autoregressive decoder. Kyutai’s lab has released two model checkpoints: - kyutai/stt-1b-en_fr: a 1B-parameter model capable of transcribing both English and French - kyutai/stt-2.6b-en: a 2.6B-parameter model focused solely on English, optimized for maximum transcription accuracy
Usage example
Kyutai-STT can be found on the Huggingface Hub.
Inference
```python import torch from datasets import load_dataset, Audio from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration
1. load the model and the processor
torchdevice = "cuda" if torch.cuda.isavailable() else "cpu" model_id = "kyutai/stt-2.6b-en"
processor = KyutaiSpeechToTextProcessor.frompretrained(modelid) model = KyutaiSpeechToTextForConditionalGeneration.frompretrained(modelid, devicemap=torchdevice)
2. load audio samples
ds = loaddataset( "hf-internal-testing/librispeechasrdummy", "clean", split="validation" ) ds = ds.castcolumn("audio", Audio(sampling_rate=24000))
3. prepare the model inputs
inputs = processor( ds[0]["audio"]["array"], ) inputs.to(torch_device)
4. infer the model
output_tokens = model.generate(**inputs)
5. decode the generated tokens
print(processor.batchdecode(outputtokens, skipspecialtokens=True)) ```
Batched Inference
```python import torch from datasets import load_dataset, Audio from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration
1. load the model and the processor
torchdevice = "cuda" if torch.cuda.isavailable() else "cpu" model_id = "kyutai/stt-2.6b-en"
processor = KyutaiSpeechToTextProcessor.frompretrained(modelid) model = KyutaiSpeechToTextForConditionalGeneration.frompretrained(modelid, devicemap=torchdevice)
2. load audio samples
ds = loaddataset( "hf-internal-testing/librispeechasrdummy", "clean", split="validation" ) ds = ds.castcolumn("audio", Audio(sampling_rate=24000))
3. prepare the model inputs
audioarrays = [ds[i]["audio"]["array"] for i in range(4)] inputs = processor(audioarrays, returntensors="pt", padding=True) inputs = inputs.to(torchdevice)
4. infer the model
output_tokens = model.generate(**inputs)
5. decode the generated tokens
decodedoutputs = processor.batchdecode(outputtokens, skipspecialtokens=True) for output in decodedoutputs: print(output) ```
- Python
Published by LysandreJik 11 months ago
transformers - V-JEPA 2 (based on v4.52.4)
A new model is added to transformers: V-JEPA 2
It is added on top of the v4.52.4 release, and can be installed from the following tag: v4.52.4-VJEPA-2-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.52.4-VJEPA-2-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the VJEPA-2 model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.53.0.
VJEPA-2
V-JEPA 2 is a self-supervised approach to training video encoders developed by FAIR, Meta. Using internet-scale video data, V-JEPA 2 attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.
The abstract from the technical report is the following:
Usage example
VJEPA-2 can be found on the Huggingface Hub. V-JEPA 2 is intended to represent any video (and image) to perform video classification, retrieval, or as a video encoder for VLMs.
The snippet below shows how to load the V-JEPA 2 model using the AutoModel class.
```py import torch from torchcodec.decoders import VideoDecoder import numpy as np
processor = AutoVideoProcessor.frompretrained("facebook/vjepa2-vitl-fpc64-256") model = AutoModel.frompretrained( "facebook/vjepa2-vitl-fpc64-256", torchdtype=torch.float16, devicemap="auto", attn_implementation="sdpa" )
videourl = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE000014_000024.mp4"
vr = VideoDecoder(videourl) frameidx = np.arange(0, 64) # choosing some frames. here, you can define more complex sampling strategy video = vr.getframesat(indices=frameidx).data # T x C x H x W video = processor(video, returntensors="pt").to(model.device) outputs = model(**video)
V-JEPA 2 encoder outputs, same as calling model.get_vision_features()
encoderoutputs = outputs.lasthidden_state
V-JEPA 2 predictor outputs
predictoroutputs = outputs.predictoroutput.lasthiddenstate ```
- Python
Published by LysandreJik 12 months ago
transformers - ColQwen2 (based on v4.52.4)
A new model is added to transformers: ColQwen2
It is added on top of the v4.52.4 release, and can be installed from the following tag: v4.52.4-ColQwen2-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.52.4-ColQwen2-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the ColQwen2 model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.53.0.
ColQwen2
ColQwen2 is a variant of the ColPali model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColQwen2 treats each page as an image. It uses the Qwen2-VL backbone to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.
Usage example
ColQwen2 can be found on the Huggingface Hub.
```python import requests import torch from PIL import Image
from transformers import ColQwen2ForRetrieval, ColQwen2Processor from transformers.utils.importutils import isflashattn2_available
Load the model and the processor
model_name = "vidore/colqwen2-v1.0-hf"
model = ColQwen2ForRetrieval.frompretrained( modelname, torchdtype=torch.bfloat16, devicemap="auto", # "cpu", "cuda", or "mps" for Apple Silicon attnimplementation="flashattention2" if isflashattn2available() else "sdpa", ) processor = ColQwen2Processor.frompretrained(model_name)
The document page screenshots from your corpus
url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg" url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg"
images = [ Image.open(requests.get(url1, stream=True).raw), Image.open(requests.get(url2, stream=True).raw), ]
The queries you want to retrieve documents for
queries = [ "When was the United States Declaration of Independence proclaimed?", "Who printed the edition of Romeo and Juliet?", ]
Process the inputs
inputsimages = processor(images=images).to(model.device) inputstext = processor(text=queries).to(model.device)
Forward pass
with torch.nograd(): imageembeddings = model(inputsimages).embeddings queryembeddings = model(inputs_text).embeddings
Score the queries against the images
scores = processor.scoreretrieval(queryembeddings, image_embeddings)
print("Retrieval scores (query x image):") print(scores) ```
If you have issue with loading the images with PIL, you can use the following code to create dummy images:
python
images = [
Image.new("RGB", (128, 128), color="white"),
Image.new("RGB", (64, 32), color="black"),
]
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.
The example below uses bitsandbytes to quantize the weights to int4.
```python import requests import torch from PIL import Image
from transformers import BitsAndBytesConfig, ColQwen2ForRetrieval, ColQwen2Processor
model_name = "vidore/colqwen2-v1.0-hf"
4-bit quantization configuration
bnbconfig = BitsAndBytesConfig( loadin4bit=True, bnb4bitusedoublequant=True, bnb4bitquanttype="nf4", bnb4bitcompute_dtype=torch.float16, )
model = ColQwen2ForRetrieval.frompretrained( modelname, quantizationconfig=bnbconfig, device_map="cuda", ).eval()
processor = ColQwen2Processor.frompretrained(modelname)
url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg" url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg"
images = [ Image.open(requests.get(url1, stream=True).raw), Image.open(requests.get(url2, stream=True).raw), ]
queries = [ "When was the United States Declaration of Independence proclaimed?", "Who printed the edition of Romeo and Juliet?", ]
Process the inputs
inputsimages = processor(images=images, returntensors="pt").to(model.device) inputstext = processor(text=queries, returntensors="pt").to(model.device)
Forward pass
with torch.nograd(): imageembeddings = model(inputsimages).embeddings queryembeddings = model(inputs_text).embeddings
Score the queries against the images
scores = processor.scoreretrieval(queryembeddings, image_embeddings)
print("Retrieval scores (query x image):") print(scores) ```
- Python
Published by LysandreJik 12 months ago
transformers - Patch release: v4.52.4
The following commits are included in that patch release:
- [qwen-vl] Look for vocab size in text config (#38372)
- Fix convert to original state dict for VLMs (#38385)
- [video utils] group and reorder by number of frames (#38374)
- [paligemma] fix processor with suffix (#38365)
- Protect getdefaultdevice for torch<2.3 (#38376)
- [OPT] Fix attention scaling (#38290)
- Python
Published by LysandreJik about 1 year ago
transformers - Patch release v4.52.3
Patch release v4.52.3
We had to protect the imports again, a series of bad events. Here are the two prs for the patch: - Fix tp error when torch distributed is already initialized (#38294) by @SunMarc - Protect ParallelInterface (#38262) by @ArthurZucker and @LysandreJik
- Python
Published by ArthurZucker about 1 year ago
transformers - Patch release v4.52.2
Patch release v4.52.2
We had to revert #37877 because of a missing flag that was overriding the device map. We re-introduced the changes because they allow native 3D parallel training in Transformers. Sorry everyone for the troubles! 🤗
- Clearer error on import failure (#38257) by @LysandreJik
- Verified tp plan should not be NONE (#38255) by @NouamaneTazi and @ArthurZucker
- Python
Published by Cyrilvallez about 1 year ago
transformers - v4.52.1: Qwen2.5-Omni, SAM-HQ, GraniteMoeHybrid, D-FINE, CSM, BitNet, LlamaGuard, TimesFM, MLCD, Janus, InternVL
New models
Qwen2.5-Omni
The Qwen2.5-Omni model is a unified multiple modalities model proposed in Qwen2.5-Omni Technical Report from Qwen team, Alibaba Group.
The abstract from the technical report is the following:
We present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. This strategy effectively decouples the handling of long sequences of multimodal data, assigning the perceptual responsibilities to the multimodal encoder and entrusting the modeling of extended sequences to a large language model.
Such a division of labor enhances the fusion of different modalities via the shared attention mechanism. To synchronize the timestamps of video inputs with audio, we organized the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE (Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture.
In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni outperforms the similarly sized Qwen2-VL and Qwen2-Audio in both image and audio capabilities. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench.
Notably, Qwen2.5-Omni is the first open-source model to achieve a level of performance in end-to-end speech instruction following that is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni’s streaming Talker outperform most existing streaming and non-streaming alternatives in robustness and naturalness.
SAM-HQ
SAM-HQ (High-Quality Segment Anything Model) was proposed in Segment Anything in High Quality by Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu.
The model is an enhancement to the original SAM model that produces significantly higher quality segmentation masks while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability.

SAM-HQ introduces several key improvements over the original SAM model:
- High-Quality Output Token: A learnable token injected into SAM's mask decoder for higher quality mask prediction
- Global-local Feature Fusion: Combines features from different stages of the model for improved mask details
- Training Data: Uses a carefully curated dataset of 44K high-quality masks instead of SA-1B
- Efficiency: Adds only 0.5% additional parameters while significantly improving mask quality
- Zero-shot Capability: Maintains SAM's strong zero-shot performance while improving accuracy
The abstract from the paper is the following:
The recent Segment Anything Model (SAM) represents a big leap in scaling up segmentation models, allowing for powerful zero-shot capabilities and flexible prompting. Despite being trained with 1.1 billion masks, SAM's mask prediction quality falls short in many cases, particularly when dealing with objects that have intricate structures. We propose HQ-SAM, equipping SAM with the ability to accurately segment any object, while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability. Our careful design reuses and preserves the pre-trained model weights of SAM, while only introducing minimal additional parameters and computation. We design a learnable High-Quality Output Token, which is injected into SAM's mask decoder and is responsible for predicting the high-quality mask. Instead of only applying it on mask-decoder features, we first fuse them with early and final ViT features for improved mask details. To train our introduced learnable parameters, we compose a dataset of 44K fine-grained masks from several sources. HQ-SAM is only trained on the introduced dataset of 44k masks, which takes only 4 hours on 8 GPUs.
Tips:
- SAM-HQ produces higher quality masks than the original SAM model, particularly for objects with intricate structures and fine details
- The model predicts binary masks with more accurate boundaries and better handling of thin structures
- Like SAM, the model performs better with input 2D points and/or input bounding boxes
- You can prompt multiple points for the same image and predict a single high-quality mask
- The model maintains SAM's zero-shot generalization capabilities
- SAM-HQ only adds ~0.5% additional parameters compared to SAM
- Fine-tuning the model is not supported yet
GraniteMoeHybrid
The GraniteMoeHybrid model builds on top of GraniteMoeSharedModel and Bamba. Its decoding layers consist of state space layers or MoE attention layers with shared experts. By default, the attention layers do not use positional encoding.
D-FINE
The D-FINE model was proposed in D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement by Yansong Peng, Hebei Li, Peixi Wu, Yueyi Zhang, Xiaoyan Sun, Feng Wu
The abstract from the paper is the following:
We introduce D-FINE, a powerful real-time object detector that achieves outstanding localization precision by redefining the bounding box regression task in DETR models. D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD). FDR transforms the regression process from predicting fixed coordinates to iteratively refining probability distributions, providing a fine-grained intermediate representation that significantly enhances localization accuracy. GO-LSD is a bidirectional optimization strategy that transfers localization knowledge from refined distributions to shallower layers through self-distillation, while also simplifying the residual prediction tasks for deeper layers. Additionally, D-FINE incorporates lightweight optimizations in computationally intensive modules and operations, achieving a better balance between speed and accuracy. Specifically, D-FINE-L / X achieves 54.0% / 55.8% AP on the COCO dataset at 124 / 78 FPS on an NVIDIA T4 GPU. When pretrained on Objects365, D-FINE-L / X attains 57.1% / 59.3% AP, surpassing all existing real-time detectors. Furthermore, our method significantly enhances the performance of a wide range of DETR models by up to 5.3% AP with negligible extra parameters and training costs. Our code and pretrained models: this https URL.
CSM
The Conversational Speech Model (CSM) is the first open-source contextual text-to-speech model released by Sesame. It is designed to generate natural-sounding speech with or without conversational context. This context typically consists of multi-turn dialogue between speakers, represented as sequences of text and corresponding spoken audio.
Model Architecture: CSM is composed of two LLaMA-style auto-regressive transformer decoders: a backbone decoder that predicts the first codebook token and a depth decoder that generates the remaining tokens. It uses the pretrained codec model Mimi, introduced by Kyutai, to encode speech into discrete codebook tokens and decode them back into audio.
The original csm-1b checkpoint is available under the Sesame organization on Hugging Face.
BitNet
Trained on a corpus of 4 trillion tokens, this model demonstrates that native 1-bit LLMs can achieve performance comparable to leading open-weight, full-precision models of similar size, while offering substantial advantages in computational efficiency (memory, energy, latency).
LlamaGuard
Llama Guard 4 is a new multimodal model designed to detect inappropriate content in images and text, whether used as input or generated as output by the model. It’s a dense 12B model pruned from Llama 4 Scout model, and it can run on a single GPU (24 GBs of VRAM). It can evaluate both text-only and image+text inputs, making it suitable for filtering both inputs and outputs of large language models. This enables flexible moderation pipelines where prompts are analyzed before reaching the model, and generated responses are reviewed afterwards for safety. It can also understand multiple languages.
TimesFM
TimesFM (Time Series Foundation Model) is a pretrained time-series foundation model proposed in A decoder-only foundation model for time-series forecasting by Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. It is a decoder only model that uses non-overlapping patches of time-series data as input and outputs some output patch length prediction in an autoregressive fashion.
The abstract from the paper is the following:
Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of public datasets comes close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a patched-decoder style attention model on a large time-series corpus, and can work well across different forecasting history lengths, prediction lengths and temporal granularities.
MLCD
The MLCD models were released by the DeepGlint-AI team in unicom, which focuses on building foundational visual models for large multimodal language models using large-scale datasets such as LAION400M and COYO700M, and employs sample-to-cluster contrastive learning to optimize performance. MLCD models are primarily used for multimodal visual large language models, such as LLaVA.
Janus
The Janus Model was originally proposed in Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation by DeepSeek AI team and later refined in Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. Janus is a vision-language model that can generate both image and text output, it can also take both images and text as input.
[!NOTE] The model doesn't generate both images and text in an interleaved format. The user has to pass a parameter indicating whether to generate text or image.
The abstract from the original paper is the following:
In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.
The abstract from the aforementioned Janus-Pro paper, released afterwards, is the following:
In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strate (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.
InternVL
The InternVL3 family of Visual Language Models was introduced in InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models.
The abstract from the paper is the following:
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.

Overview of InternVL3 models architecture, which is the same as InternVL2.5. Taken from the original checkpoint.

Comparison of InternVL3 performance on OpenCompass against other SOTA VLLMs. Taken from the original checkpoint.
Kernel integration
We integrate some kernels in the transformers library via the kernels package: https://github.com/huggingface/kernels
We start with some kernels in the Llama model, and we iterate to identify the best performance optimizations
- Llama Kernel integration by @MekkCyber in #37092
- [kernels] use original forward at compile time by @gante in #37604
TP support
In the previous release, we've added TP support in order to run distributed inference. However, this is not supported for all quantization methods. We are progressively adding support to it. Right now, only compressed-tensors, fp8 and fp8-fbgemm support it.
- Attention Quantization with FBGemm & TP by @MekkCyber in #37384
- Restrict & Explain tp_plan for FBgemm by @MekkCyber in #37404
Quantization
AutoRound
From the AutoRound contributors:
AutoRound is an advanced quantization algorithm that delivers strong accuracy, even at 2-bit precision. It leverages sign gradient descent to fine-tune both rounding values and min-max clipping thresholds in just 200 steps ... More details here: https://github.com/intel/auto-round
- Add AutoRound quantization support by @wenhuach21 in #37393
Quantization Documentation
We have added two new sections to better understand and get started with quantization: - Quantization concept - Selecting a quantization method
- Add "selecting a quantization method" doc by @DerekLiu35 in #37159
- Update quantization docs by @DerekLiu35 in #37439
GGUF
We've added GGUF support to gemma3 family models.
- Add GGUF support to Gemma3 Text backbone by @Isotr0py in #37424
- Support loading Gemma3 QAT GGUF models by @Isotr0py in #37649
Fast image processors
Most Vision Models and VLMs in Transformers can now benefit from fast image processors. By utilizing torch/torchvision functional transforms, these processors offer a substantial speedup when processing images compared to PiL/numpy functions, and support processing on both CPU and CUDA. - See the list of updated models: https://github.com/huggingface/transformers/issues/36978 - Learn more about fast image processors: Fast Image Processors
- Add Fast Image Processor for Perceiver by @rootonchair in #37176
- Add Fast Image Processor for Flava by @rootonchair in #37135
- Add Fast Image Processor for LayoutLMv2 by @rootonchair in #37203
- Add Fast Image Processor for LayoutLMv3 by @rootonchair in #37201
- Add Fast Image Processor for Donut by @rootonchair in #37081
- Add Fast LeViT Processor by @keetrap in #37154
- Add Fast Mobilenet-V2 Processor by @keetrap in #37113
- Add Fast owlvit Processor by @keetrap in #37164
- Add ImageProcessorFast to BiT processor by @Yann-CV in #37180
- Add Fast Yolos Processor by @keetrap in #37292
- Add Fast Chinese-CLIP Processor by @keetrap in #37012
- Add Fast Conditional-DETR Processor by @keetrap in #37071
- Fix broken add-fast-image-processor CLI by @yonigozlan in #37499
- Bridgetower fast image processor by @rootonchair in #37373
- Add Fast Grounding-Dino Processor by @keetrap in #37108
- Add Fast PVT Processor by @keetrap in #37204
- Add Fast Image Processor for PoolFormer by @rootonchair in #37182
- Add Fast Image Processor for MobileNetV1 by @dmdaksh in #37111
- Fast image processor for VitMatte added and bug in slow version fixed by @henrikm11 in #37616
- [Fast Processor] BEiT by @ariG23498 in #37005
- Add Swin2SR ImageProcessorFast by @thisisiron in #37169
- Add Fast Image Processor for vilt by @devxaitist in #37304
AutoDocstring
The new @auto_docstring decorator makes it easier to add proper documentation when contributing a model without bloating the modeling code:
- [AutoDocstring] Based on inspect parsing of the signature by @ArthurZucker and @yonigozlan in https://github.com/huggingface/transformers/pull/33771
- More info on how to use @auto_docstring: AutoDocstring
Custom generate
We now support custom generate methods to be loaded from model.generate. The custom generate methods can be stored on the Hub, enabling quick distribution of experiments regarding new caches, decoding methods, heuristics, ...
```py from transformers import AutoModelForCausalLM, AutoTokenizer
generate with custom_generate -> generate uses custom code
note: calling the custom method prints "✨ using a custom generation method ✨"
tokenizer = AutoTokenizer.frompretrained("Qwen/Qwen2.5-0.5B-Instruct") model = AutoModelForCausalLM.frompretrained("Qwen/Qwen2.5-0.5B-Instruct", device_map="auto")
inputs = tokenizer(["The quick brown"], returntensors="pt").to(model.device) genout = model.generate(**inputs, customgenerate="transformers-community/customgenerateexample", trustremotecode=True) print(tokenizer.batchdecode(genout, skipspecial_tokens=True)) ```
You can find the docs here, and all custom generation methods by searching for the custom_generate tag.
- [generate] Run custom generation code from the Hub by @gante in #36405
Chat CLI
The transformers-cli command is updated to be simpler and cleaner, specifically for its chat variant.
The following is now possible and recommended:
transformers chat Qwen/Qwen2.5-3B-Instruct
Additionally, almost any generate flag can now be passed as a positional argument, present and future, as opposed to being limited to a set of hardcoded flags, for example:
transformers chat Qwen/Qwen2.5-0.5B-Instruct do_sample=False max_new_tokens=10
- Transformers cli clean command by @LysandreJik in #37657
- [chat] clean code and add base help by @gante in #37892
- [
chat] generate parameterization powered byGenerationConfigand UX-related changes by @gante in #38047
Breaking changes
- 🚨 rm already deprecated padtomax_length arg by @itazap in #37617
- 🚨🚨🚨 Fix forward of Dinov2ForImageClassification for models with registers by @psandovalsegura in #37836
- 🔴 [VLM] Add base model without head by @zucchini-nlp in #37033
- 🔴 Video processors as a separate class by @zucchini-nlp in #35206
- 🚨🚨 Allow saving and loading multiple "raw" chat template files by @Rocketknight1 in #36588
- 🔴 Update CLIP vision attention to new attention interface by @molbap in #37498
- 🚨🚨 Setup -> setupclass conversion by @Rocketknight1 in #37282
Deprecations
The agents folder is finally removed from transformers in favour of using smolagents.
- [agents] remove agents 🧹 by @gante in #37368
We are moving away from torch 2.0 as it has been released more than two years ago.
- byebye torch 2.0 by @ydshieh in #37277
General bugfixes and improvements
- fix flex attn when optional args aren't passed by @winglian in #37327
- fix llama4 training by @hiyouga in #37319
- Fix deepspeed with quantization by @Cyrilvallez in #37324
- Fix
init empty weightswithout accelerate by @Cyrilvallez in #37337 - Use Python 3.9 syntax in examples by @cyyever in #37279
- Fix torchao usage by @jiqing-feng in #37034
- enable 2 llama UT cases on xpu by @yao-matrix in #37126
- Avoid build crashes when torch.version.xpu doesn't exist and fix Llama4 processor tests by @Rocketknight1 in #37346
- fix derived berts
_init_weightsby @Cyrilvallez in #37341 - Update translation template by @stevhliu in #37294
- Remove HQQ from caching allocator warmup by @Cyrilvallez in #37347
- updated model card for Mistral by @NahieliV in #37156
- Update model-card for DINOv2 by @shubham0204 in #37104
- Update falcon mamba card by @ricalanis in #37253
- Update Model card for GPT2 by @ash-01xor in #37101
- Improvements in Gemma2 model card by @devesh-2002 in #37076
- Update Model Card for Jamba by @ParagEkbote in #37152
- Add bnb to the list of supported quantization methods for LLama4 by @MekkCyber in #37348
- Updated Model-card for donut by @Logeswaran7 in #37290
- Remove unnecessary attr assignment by @tugsbayasgalan in #36837
- more fixes for post-training llama4 by @winglian in #37329
- Fixing flex attention for torch=2.6.0 by @SalmanMohammadi in #37285
- Multiple llama4 fixe by @ArthurZucker in #37353
- Expose blip2qformer by @alex-jw-brooks in #37254
- convert float for yarn related arguments in rope_scaling by @bzantium in #37139
- Use Python 3.9 syntax in tests by @cyyever in #37343
- A bit of cleaning 🧹🧹 by @Cyrilvallez in #37215
- fix deepspeed job by @ydshieh in #37284
- Set vision config to None for Gemma 1B conversion by @RyanMullins in #37366
- [llama 4] dynamic rope decorator by @gante in #37365
- Skip non-selected experts for mixtral and qwen2_moe by @Coco58323 in #32429
- [core] remove
GenerationMixininheritance by default inPreTrainedModelby @gante in #37173 - prune LM Head for USD by @jmamou in #36695
- fix(qwen): fix shape error when using tp by @KimmiShi in #36947
- Preserve requires_grad in pre quantized model by @jerryzh168 in #37354
- Update composition flag usage by @zucchini-nlp in #36263
- fix: llama4 conversion script noropelayers by @jmkuebler in #37359
- update deepspeed docker by @SunMarc in #37371
- Fix warning message for PEFT models in text-generation pipeline #36783 by @falconlee236 in #36887
- Apply torchfix to replace deprecated functions:
_pytree._register_pytree_nodeandtorch.cpu.amp.autocastby @bzhong-solink in #37372 - Fix some failing AWQ tests by @DerekLiu35 in #37383
- the fix that did not get in by @ArthurZucker in #37370
- handle torch version edge cases by @winglian in #37399
- Add warning when failed to acquire other user's lock at model download by @manueldeprada in #37395
- Handle torch ver in flexattn by @Kh4L in #37400
- Fix Llama4 offset by @Cyrilvallez in #37414
- Offloaded hybrid cache for Llama4 by @Cyrilvallez in #37401
- mark llama4 as not supported with fa2 by @winglian in #37416
- update
kernelsto 0.4.3 by @ArthurZucker in #37419 - Send trainer/fsdp/deepspeed CI job reports to a single channel by @ydshieh in #37411
- from_pretrained should handle xpu case by @sywangyi in #37382
- Allow rocm systems to run these tests by @ivarflakstad in #37278
- use
rms_norm_epsfor the L2Norm for Llama4 by @ArthurZucker in #37418 - [chat-template] Unify tests and clean up 🧼 by @zucchini-nlp in #37275
- Fix new failure reports not including anything other than
tests/models/by @ydshieh in #37415 - Quark Quantization gated repo by @MekkCyber in #37412
- Add image classifier donut & update loss calculation for all swins by @eljandoubi in #37224
- Correctly drop tokens in SwitchTransformer by @mario-aws in #37123
- Fix requirereadtoken by @MekkCyber in #37422
- fix: use mtime by default in Trainer.rotatecheckpoints with automatic fallback by @Jerry-Terrasse in #37260
- (Part 2) feat: allow for tp_size attr for tplizing the model by @kmehant in #37054
- Adding to selfcommentci.yml by @MekkCyber in #37426
- [Feat] Support npu in modeling models by @duanjunwen in #37369
- Remove old code for PyTorch, Accelerator and tokenizers by @cyyever in #37234
- enhance requiredeterministicfor_xpu by @yao-matrix in #37437
- Fixes: Corrects file path for CUDA kernels by @DonggeunYu in #37438
- Simplify soft dependencies and update the dummy-creation process by @LysandreJik in #36827
- Update-kernel-pin by @ArthurZucker in #37448
- Add moe kernels by @ArthurZucker in #37376
- Fix the test fetcher by @LysandreJik in #37452
- Remove triton mlp kernel, not compiling for some models by @MekkCyber in #37449
- [processor] clean up mulitmodal tests by @zucchini-nlp in #37362
- [Regression] Fix Quark quantized model loading after refactorization by @BowenBao in #37407
- prevent creating a view/leaf param for low rank optimizers w FSDP by @winglian in #37379
- Disable kernels for quantization by @MekkCyber in #37446
- Add weights_only=True to torch.load by @cyyever in #37062
- Add XPU case to istorchbf16gpuavailable by @cyyever in #37132
- nit: typing use Llama4TextConfig instead of Llama4Config by @kmehant in #37430
- Delete hubconf.py by @Rocketknight1 in #37455
- Fix typing issues with SigLip2 by @EricWiener in #37356
- fix: (llama4) fix nosplitmodules to be picked up for fsdpv1 and v2 sharding by @kmehant in #37462
- make testsnowmanimage_captioning pass on XPU, by sharing same atol w/ ROCM by @yao-matrix in #37480
- Remove
fsspecdependency which isn't directly used by transformers by @cyyever in #37318 - Fix tests failed with gated repos. by @ydshieh in #37484
- [ci] fix doc builder by @zucchini-nlp in #37489
- Fixed broken links by @cypherpepe in #37466
- Detect and fix most
_init_weights()issues - make it work for composite models by @Cyrilvallez in #37070 - [bug] deprecated deta loadcudakernel, MultiScaleDeformableAttention by @chagmgang in #37443
- Fix mask handling for flex attention in llama/gemma2/mistral/qwen2 by @flukeskywalker in #37381
- Fix wrong argparse type in modular checker script by @seven-mile in #37472
- Fixing gated repo issues by @MekkCyber in #37463
- [qwen-omni] fix processor by @zucchini-nlp in #37493
- Remove deprecation warning for
num_logits_to_keepby @Cyrilvallez in #37149 - Don't auto-assign reviewers when the author is in HF by @Rocketknight1 in #37500
- Detect and use device context manager or global device in
from_pretrainedby @Cyrilvallez in #37216 - Change default value of
attn_temperature_tuningby @gmlwns2000 in #37501 - Llama4: remove redundant transpose of router_logits by @pbelevich in #37468
- fix: Restore explicit error surfacing for unexpected hub exceptions by @manueldeprada in #37525
- Fix missing return type for MLCD docs by @qubvel in #37527
- fix and enhance pipeline_webserver.md by @yao-matrix in #36992
- VDR task guide by @merveenoyan in #37485
- Update VITS model card by @princepride in #37335
- Refactor ColPali model documentation by @Soum-Soum in #37309
- enable 5 cases on XPU by @yao-matrix in #37507
- enable several cases on XPU by @yao-matrix in #37516
- enable
test_offloaded_cache_implementationon XPU by @yao-matrix in #37514 - Fix BitsAndBytesConfig JSON serialization in TrainingArguments by @astefanutti in #37520
- enable 3 mpt test cases on XPU by @yao-matrix in #37546
- enable 6 rtdetrv2 cases on xpu by @yao-matrix in #37548
- More appropriate cuda warmup in resource-constrained hardware by @Cyrilvallez in #37550
- Fixes hqq by following a new path for bias parameter in pre_quantized models by @MekkCyber in #37530
- convert scale and zero to cuda when using HQQ backend by @phymhan in #37425
- Keep Quark loading through meta device by @BowenBao in #37538
- Refactor torchao docs by @MekkCyber in #37490
- add FlashAttentionKwargs and seq_idx to flat collator by @garrett361 in #36456
- docs(typo): Update ISSUES.md, fix a small typo by @
in #37542 - Fix device issue for tapas (with
as_tensor) by @ydshieh in #37551 - Make Ignored Columns ValueError More Informative by @wbuchanan in #33299
- Fix TimesFm doc issue by @Cyrilvallez in #37552
- Run
test_can_load_with_global_device_setusing a subprocess by @ydshieh in #37553 - Fix pixel attention mask padding in smolvlm by @ManuelFay in #37497
- [vlm] adjust max length for special tokens by @zucchini-nlp in #37342
- Add EfficientNet Image PreProcessor by @zshn25 in #37055
- Fix Mamba2 Grouped SSD Support in the torch_forward Path by @cyang49 in #37533
- All models can be initialized on meta device by @Cyrilvallez in #37563
- [chat template] fix security vulnerability by @zucchini-nlp in #37523
- [qwen-vl] Standardize config by @zucchini-nlp in #37268
- [TimesFM] use the main revison instead of revision for integration test by @kashif in #37558
- Fix qwen2audio wanr -> warn by @alex-jw-brooks in #37559
- Small fix on context manager detection by @Cyrilvallez in #37562
- [phi4] update conversion by @zucchini-nlp in #37579
- docs: fix typo by @tonyksong in #37567
- Ensure positive warm-up size by @Cyrilvallez in #37581
- Update Phi4 converter by @Cyrilvallez in #37594
- Fix Quark quantization config by @MekkCyber in #37578
- Gaudi: Add the bf16 support for hpu by @yuanwu2017 in #37568
- Fix some GPU OOM after #37553 by @ydshieh in #37591
- remove runthirdpartydevice_tests by @jiqing-feng in #37445
- [Bugfix] Fix flash-attention func param mismatch and softmax_scale default value mistake on Ascend NPU by @FightingZhen in #37575
- Flag SpeechT5 flaky test by @molbap in #37587
- enable 6 gemma2 cases on XPU by @yao-matrix in #37564
- enable 6 modeling cases on XPU by @yao-matrix in #37571
- [Gemma3] compile ✨ by @gante in #37447
- Model debugger upgrades by @molbap in #37391
- [VLMs] use only
xxx_token_idfor multimodal tokens by @zucchini-nlp in #37573 - fix 2 encoder_decoder issues on XPU by @yao-matrix in #37572
- fix issue that some example with no trainer use accelerator.end_train… by @we1559 in #37435
- Deprecate modeling_utils.py classes by @qubvel in #37298
- Fixing the example in generation strategy doc by @jeasinema in #37598
- chore: update model card for SigLIP by @saswatmeher in #37585
- Fix InternVL attention when using qk_norm (38B and 78B) by @yonigozlan in #37620
- Remove torchvision requirement from AutoImageProcessor by @LysandreJik in #37457
- Allow Exclusion of Input IDs from RepetitionPenaltyLogitsProcessor by @alex-jw-brooks in #37625
- fix link in kv_cache.md by @manueldeprada in #37652
- Update longformer.md by @JihadHammoud02 in #37622
- Refactor phi doc by @JihadHammoud02 in #37583
- Fix Qwen2.5-Omni getchunkedindex chunking functionality by @imkero in #37631
- [fix] make legacy bnb code work by @cyr0930 in #37331
- [fix gemma] Set default value for output_attentions parameter in Gemma2 and Gemma… by @chenin-wang in #37633
- Restructure torchao quantization examples by @jerryzh168 in #37592
- Add test to ensure unknown exceptions reraising in utils/hub.py::cached_files() by @manueldeprada in #37651
- [test] update
test_past_key_values_formatby @gante in #37614 - [tests] Stricter generate + compilation test -- no recompilations allowed by @gante in #37629
- Fix ValueError when evaldoconcat_batches=False with examples by @jeffhataws in #37621
- Fixes #37219 : RecurrentGemma crashes for inputs longer than sliding window length by @manueldeprada in #37613
- Introduce GradientCheckpointingLayer by @qubvel in #37223
- [qwen-omni] fix training by @zucchini-nlp in #37517
- Fix duplicated weights in fp8 quantization by @Cyrilvallez in #37667
- Correct warm-up with fp8 by @Cyrilvallez in #37670
- Fixing quantization tests by @MekkCyber in #37650
- Fix autoround docs by @SunMarc in #37675
- Fix nosplitmodules for Llama4 pretrained models by @astefanutti in #37673
- Refactor bitsandbytes doc by @MekkCyber in #37668
- enable mllama cases on xpu by @yao-matrix in #37644
- enable 6 granite cases on xpu by @yao-matrix in #37569
- [cleanup] remove old scripts in
/scripts🧹 🧹 by @gante in #37676 - [docs] only build
endocs in push CI by @gante in #37677 - typo update in the parameter name by @LunaticMaestro in #37655
- [Docs] Move models to appropriate section by @NielsRogge in #37338
- Add counters for dataset classes by @jiangyukunok in #37636
- enable blip2 and emu3 cases on XPU by @yao-matrix in #37662
- 🌐 [i18n-KO] Translated
siglip.mdto Korean by @devxaitist in #37145 - Updated model card for mbart and mbart50 by @Vishesh-Mistry in #37619
- fix: remove classmethod from
Qwen2_5OmniConfig.get_text_configby @shahruk10 in #37690 - enable cpu offloading for Bark on xpu by @yao-matrix in #37599
- Pin torch == 2.6 on PR CI docker images for now by @ydshieh in #37695
- [cleanup] remove
/model_cards🧹 🧹 by @gante in #37685 - Add maintainers for ROCm/Intel XPU/Ascend NPU by @Rocketknight1 in #37678
- [CI] add back
sacrebleu(and document why) by @gante in #37700 - TransfoXL is deprecated, don't keep it in tested examples! by @Rocketknight1 in #37707
- [internvl] fix chat template by @zucchini-nlp in #37656
- Qwen 2.5 Omni: apply video defaults by @pcuenca in #37660
- [tests,
qwen2_5_omni] fix flaky tests by @gante in #37721 - Process inputs directly in applychattemplate in image-text-to-text pipeline by @yonigozlan in #35616
- enable 4 test_trainer cases on XPU by @yao-matrix in #37645
- Fix Aria tests by @jiqing-feng in #37444
- Fix inference bugs in Qwen2.5 Omni by @BakerBunker in #37701
- Fix torchao doc examples by @MekkCyber in #37697
- [tests] fix
test_nemotron_8b_generation_sdpaby @faaany in #37665 - Make sure torchisavailable before using torch.distributed by @MekkCyber in #37693
- [VLMs] fix flash-attention tests by @zucchini-nlp in #37603
- fix: learning_rate logged as tensor causing save issue with deepspeed by @NanoCode012 in #37704
- Fix
embeds_to_talkerdevice in Qwen2.5-Omni by @BakerBunker in #37739 - Correctly raise errors when downloading tokenizer files by @Cyrilvallez in #37740
- [performance_optim] define flash attention mask on NPU device directly by @FightingZhen in #37698
- Skip all
AriaForConditionalGenerationIntegrationTestonT4by @ydshieh in #37746 - Update
MllamaForConditionalGenerationIntegrationTestby @ydshieh in #37750 - Expand quantized data type support for tensor parallelism by @amd-xiaoyu12 in #37719
- [cache] fix
HybridCacheinit whendeviceis passed by @gante in #37718 GPT2ModelStaticCache support by @poedator in #35761- [generate] skip compilation on cpu offload by @gante in #37709
- updated hidden_features for FlaxDinov2SwiGLUFFN in Dinov2 by @premmurugan229 in #37747
- Fix qwen25 getrope_index tensor device locations by @rphmeier in #37597
- [generate] fix default autocompile case on gpu by @gante in #37756
- Fix wrong input shapes in doc-string of models by @kkew3 in #37729
- Refine parameter type annotations by @flashJd in #37666
- Fix tied weight loading with TP and loading sub state_dicts by @Cyrilvallez in #37758
- Fix load of rng state for resuming training from checkpoint by @winglian in #37162
- Fix typos in comments by @co63oc in #37694
- [deps] pin max
torchversion by @gante in #37760 - Guard DeepSpeed imports by @lewtun in #37755
- Fix auto-round hfoption by @MekkCyber in #37759
- Update model card for Gemma by @afafelwafi in #37674
- 🌐 [i18n-KO] Translated
roberta.mdto Korean by @garongkim in #37069 - [causal mask] fix preparation with multi-gpu by @zucchini-nlp in #37612
- unpin pytest<8 by @ydshieh in #37768
- Align gpt2 mask preparation to #37612 by @Cyrilvallez in #37787
- Fix typos in strings and comments by @co63oc in #37784
- Fix tensor parallel with non-floating dtypes by @Cyrilvallez in #37790
- Force torch>=2.6 with torch.load to avoid vulnerability issue by @Cyrilvallez in #37785
- fix mpt test of different outputs from cuda by @jiqing-feng in #37691
- [i18n-KO] Translated
keypoint_detection.mdto Korean by @rlaalsrl0922 in #36649 - chore: update SigLIP2 model card by @saswatmeher in #37624
- fix performance issue in convertidsto_tokens by @martin-harmonic in #37773
- Fix error message in
hub.pyby @srai9 in #37796 - Gemma3 is Torch Exportable by @guangy10 in #37728
- Fix the fsdp config cannot work issue. by @yuanwu2017 in #37549
- Define warmup allocator for torchao quantization by @MekkCyber in #37764
- Fix typos in strings and comments by @co63oc in #37799
- [doc] fix the code examples in qwen doc by @jiangyukunok in #37803
- Fix: Correct tensor shape comment in Mamba modeling by @ShadyPi in #37801
- [RT-DETR] Improve docs by @NielsRogge in #37814
- FIX: Faulty PEFT tests by @BenjaminBossan in #37757
- Add Optional to remaining types by @cyyever in #37808
- Fix error of HPU TP by @yuanwu2017 in #37782
- change XLA deprecated api by @SunMarc in #37741
- [config] revert #37603 by @zucchini-nlp in #37821
- [modular] Fix the prefix-based renaming if the old and new model share a common name suffix by @Cyrilvallez in #37829
- [tests] fix flaky pattern in
test_generate_continue_from_past_key_valuesby @gante in #37724 - [tests] reorganize cache tests and clean memory between tests by @gante in #37684
- Revert change that breaks on Torch 2.1 by @Rocketknight1 in #37531
- Fix check of unecessary packages (issue #37626) by @HichTala in #37825
- Fix cache get item return type hints by @ChengLyu in #37847
- Fix Bitnet tokenizer in pipeline by @MekkCyber in #37861
- docs: Details for ambigious channel dimension assignment by @yaner-here in #37600
- Processor chat template: pass custom kwargs by @pcuenca in #37852
- Add Intel Gaudi doc by @regisss in #37855
- 🌐 [i18n-KO] Translated
electra.mdto Korean by @Kim-Ju-won in #36763 - Update modeling_llama4.py by @monk1337 in #37841
- Skip is_flaky tests in the CI by @Rocketknight1 in #37723
- Allow override inputs to export recipe by @guangy10 in #37508
- enable internvl UTs on XPU by @yao-matrix in #37779
- Llama Guard updates by @pcuenca in #37872
- update Cleanuptokenization_spaces typos. by @zhanluxianshen in #37865
- fix error for registerpytree_node in torch2.1.0 and fix bf16 assertion in xpu and npu by @jiaqiw09 in #37839
- make sure lr is not a tensor by @winglian in #37881
- Fix qwen2-vl-docs. by @zhanluxianshen in #37879
- uniformize kwargs for VisionTextDualEncoder by @tibor-reiss in #34563
- Fix: reassign in qwen3 moe model by @linkedlist771 in #37848
- update comment in imageprocessingbase.py to reference image_process… by @arjunaskykok in #37864
- Support FlaxPreTrainedModel to load model checkpoint from local subfolder safetensors by @Melody-coder923 in #37732
- [tests] Test all cache implementations by @gante in #37873
- [tests] reset logs in
torch.compiletest by @gante in #37894 - Fix Qwen3 tp plan with FP8 by @MekkCyber in #37871
- Enhance documentation to explain chat-based few-shot prompting by @MostHumble in #37828
- Support
AOPerModuleConfigandinclude_embeddingby @jerryzh168 in #37802 - fixed gemma3 collection path pointing to llama 2 collection. by @dmgcsilva in #37899
- Fix typos in strings and comments by @co63oc in #37910
- Improve performance of
load_state_dictby @woct0rdho in #37902 - 🌐 [i18n-KO] Translated
gpu_selection.mdto Korean by @nsbg in #36757 - Add usage example for DINOv2 by @baldassarreFe in #37398
- Aligning modling code for GPT2 to work with vLLM (fallback) by @ariG23498 in #36934
- Break weight tying when quantizing input embedding by @jerryzh168 in #37905
- [docs] logits docstring by @gante in #37929
- [D-FINE] Update names by @NielsRogge in #37957
- More fault tolerant notification service by @ivarflakstad in #37924
- [core] reuse unused reserved cuda memory when loading models by @gante in #37920
- Use T4 single GPU runner with more CPU RAM by @ydshieh in #37961
- [generate] Fix
vocab_sizeaccess for multimodal models by @kurzdev in #37937 - Fix incorrect type annotation in getauxiliarylogits by @Tanuj-rai in #37955
- [Ready to Merge][HFQuantizer] Squelch pydantic warnings by @kylesayrs in #37726
- Add GraniteMoeHybrid support for 4.0 by @Ssukriti in #37658
- add xpu memory check by @faaany in #37969
- [tests] Smaller model in slow cache tests by @gante in #37922
- [llava] one pixel is missing from padding when length is odd by @cyr0930 in #37819
- add job links to new model failure report by @ydshieh in #37973
- fix docs serving typos. by @zhanluxianshen in #37936
- Small typo lines 47 and 199 perfinfergpu_one.md by @nlhmnlhmnlhm in #37938
- Fix typos by @omahs in #37978
- [speech2text] fix init of sinusoidal embeddings by @gante in #37931
- Fix typo by @lkm2835 in #37964
- enable xpu in test_trainer by @yao-matrix in #37774
- fix FSDP + torch.compile bug when saving pretrained model by @Joaquinecc in #37725
- Enable granite speech 3.3 tests by @alex-jw-brooks in #37560
- Fix donut backtracking by @Rocketknight1 in #37788
- Fix Qwen models export with torch 2.7 by @guangy10 in #37985
- [offload] respect
max_memoryargument when factoring in unused reserved memory by @gante in #37982 - make aya vision 5 integration tests pass on xpu by @yao-matrix in #37990
- [chat template] separate jinja logic from tokenizers by @zucchini-nlp in #37602
- remove duplicate code by @kaixuanliu in #37991
- Add a check to importutils.py to allow for use of faissgpu installation by @Fiona-Waters in #37997
- [CSM] tiny fix on generation by @eustlb in #38001
- Fix
padimage transform for batched inputs by @sebasv in #37544 - Add ALLATTENTIONFUNCTIONS compatibility for Pixtral model by @uminaty in #37960
- Enable RUF013 to enforce optional typing by @cyyever in #37266
- Fix
Optionaltyping by @qubvel in #38018 - Print commit SHA on slack message for new model notification. by @ydshieh in #38019
- [CI] remove duplicated message on GH comment to run slow tests by @gante in #37970
- [caches] Raise exception on offloaded static caches + multi device by @gante in #37974
- Skip
test_push_to_hub_with_saves_each_epochfor now by @ydshieh in #38022 - Fix incorrect installation instructions (for issue #37476) by @Zephyr271828 in #37640
- Fix wording in
torchscript.mdby @Madghostek in #38004 - [VLMs] support attention backends by @zucchini-nlp in #37576
- make
test_speculative_decoding_non_distildevice-agnostic by @faaany in #38010 - enable mamba2 integration cases on xpu by @yao-matrix in #38006
- update bnb tests by @jiqing-feng in #38011
- [
AutoDocstring] Based on inspect parsing of the signature by @ArthurZucker and @yonigozlan in #33771 - fix document masking for chunked attention by @winglian in #37429
- make mistral3 pass on xpu by @yao-matrix in #37882
- enable utils test cases on XPU by @yao-matrix in #38005
- [Temporary] Log some information in some pytest/pluggy internal places by @ydshieh in #37996
- Trigger CircleCI via GitHub Actions when
ready for reviewby @ydshieh in #37885 - Disable
Trigger CircleCI via GitHub Actions whenready for review` by @ydshieh in #38038 - Do not erase a cache_position passed explicitly to generate(), if there is one by @FremyCompany in #37986
- Support for version spec in requires & arbitrary mismatching depths across folders by @LysandreJik in #37854
- Re-Enable
Trigger CircleCI via GitHub Actions when "ready for review" by @ydshieh in #37885) - Fix reduce-labels in BEIT Fast Image Processor by @simonreise in #38042
- Fix cache update! by @Cyrilvallez in #38046
- Fix linalg.norm for CovnNextV2 by @qubvel in #38015
- enable generation fsdp/utils cases on XPU by @yao-matrix in #38009
- fix(conversion): Fix size mismatch error during TF->PT model loading by @arjunaskykok in #38014
- [VLM] fix loading issues by @zucchini-nlp in #38051
- Fix OneFormer integration test by @qubvel in #38016
- Add AMD expectation to testgpt2sample by @ivarflakstad in #38079
- docs: fix md style by @imba-tjd in #38057
- Fix mt5 test on AMD devices by @ivarflakstad in #38081
- chore(qwen2): display warning log only when sliding window attention … by @edwardzjl in #36316
- fix the inconsist docstring in applychattemplate by @lenijwp in #38069
- Fix tot update in trainer by @efsotr in #37923
- update seedworker to set seed based on workerid and rank by @gathierry in #37980
- uninstall
kernelsfrom docker images by @ydshieh in #38083 - Refactor image processor phi4 by @yonigozlan in #36976
- update
require_read_tokenby @ydshieh in #38093 - add timeout for downloading the
librispeech_asrdataset by @faaany in #38073 - fix: Propagate
lr_scheduler_kwargsoptions to create LR Scheduler when LayerWiseDummyOptimizer is used by @BlackNoodle in #34559 - Disable report callbacks for certain training tests by @ivarflakstad in #38088
- [smolvlm] skip the test by @zucchini-nlp in #38099
- Fix bug in prefillchunksize that ignores disable_compile flag by @xmarva in #38067
- Fix
past_key_valuestype hint in model output types by @ChengLyu in #37953 - [bug] fix llava processor to calculate unpadding size correctly by @cyr0930 in #37988
- fix
check_bad commit.pygives wrong results by @ydshieh in #38107 - Fix InternVL interpolateposencoding and add to videoprocessingauto by @yonigozlan in #38092
- [CSM] update test for t4 runners by @eustlb in #38110
- Add style bot by @SunMarc in #38102
- Fix description and formatting errors in code docs by @bilibili12433014 in #38074
- enable finegrainedfp8 and granitespeech cases on XPU by @yao-matrix in #38036
- [video processor] fix tests by @zucchini-nlp in #38104
- Fix temporal padding in Qwen2VLImageProcessor when the number of frames is not divisible by temporalpatchsize by @ritwickchaudhry in #38076
- Fix auto batch size finder test by @ivarflakstad in #38125
- Add config validation and style tweaks by @Kirire in #37589
- Update trainer.md by @guspuffygit in #38113
- [docs] add uv installation instructions for source builds by @arjunaskykok in #37968
- Add
manueldepradatorun_slowwhitelist by @manueldeprada in #38126 - enable d_fine finetuning properly by @SangbumChoi in #37962
- Fix incorrect attention mask truncate in WhisperFlashAttention2 by @OliBomby in #36477
- [Qwen3] Qwen3 MoE add tp plan for expert mlps by @hgt312 in #38135
- enable csm integration cases on xpu, all passed by @yao-matrix in #38140
- Remove head mask in generative models by @zucchini-nlp in #35786
- Hotfix: Flash Attention 2 support in Pixtral by @uminaty in #38146
- enable trainer test cases on xpu by @yao-matrix in #38138
- disable deepspeed when setting up fake trainer by @winglian in #38101
- Omit creation of positional IDs within ESM if applicable by @simonlevine in #38089
- [FIX] Save speed metrics to logs by @pavelgein in #38136
- enable autoround cases on XPU by @yao-matrix in #38167
- Include output embedding as well with
include_embeddingflag by @jerryzh168 in #37935 - Fix Qwen2.5 Omni
SinusoidsPositionEmbeddingprecision by @BakerBunker in #38151 - Add optional RMSNorm support to BitNet quantization (config + layers) by @Codys12 in #38087
- [VLMs] add helpers to get multimodal encodings by @zucchini-nlp in #37743
- Bart: new cache format by @zucchini-nlp in #35314
- clean autoawq cases on xpu by @yao-matrix in #38163
- Disable
Trigger CircleCI by ready for reviewby @ydshieh in #38171 - Disable
convert to draftworkflow by @ydshieh in #38177 - remove some commands from
fetch_testsCircleCI job by @ydshieh in #38176 - Feat: add warnings for unused keys and rules in tensor parallel by @S1ro1 in #37893
- [ESM] Add flash-attention-2 backend for ESM-2 by @pstjohn in #38023
- Add args support for fast image processors by @yonigozlan in #37018
- Fix import torchao.prototype.lowbitoptim since torchao v0.11 by @baptxste in #38174
- fix bug in distributed loss test by @techkang in #38166
- [tests] remove
test_sdpa_equivalence(redundant) by @gante in #37911 - Add Granite Speech Support by @alex-jw-brooks in #36801
- Add glm4 by @ArthurZucker in #37388
- Add Qwen2.5-Omni by @BakerBunker in #36752
- Add MLCD model by @tanhuajie in #36182
- Add TimesFM Time Series Forecasting Model by @jinan-zhou in #34082
- Add Janus model by @yaswanth19 in #36053
- Add InternVL (2.5 MPO) by @yonigozlan in #35968
- Add Bitnet model by @MekkCyber in #37742
- Samhq model addition by @sushmanthreddy in #35147
- Add D-FINE Model into Transformers by @VladOS95-cyber in #36261
- Add CSM model by @eustlb in #36719
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @cyyever
- Use Python 3.9 syntax in examples (#37279)
- Use Python 3.9 syntax in tests (#37343)
- Remove old code for PyTorch, Accelerator and tokenizers (#37234)
- Add weights_only=True to torch.load (#37062)
- Add XPU case to istorchbf16gpuavailable (#37132)
- Remove
fsspecdependency which isn't directly used by transformers (#37318) - Add Optional to remaining types (#37808)
- Enable RUF013 to enforce optional typing (#37266)
- @yao-matrix
- enable 2 llama UT cases on xpu (#37126)
- enhance requiredeterministicfor_xpu (#37437)
- make testsnowmanimage_captioning pass on XPU, by sharing same atol w/ ROCM (#37480)
- fix and enhance pipeline_webserver.md (#36992)
- enable 5 cases on XPU (#37507)
- enable several cases on XPU (#37516)
- enable
test_offloaded_cache_implementationon XPU (#37514) - enable 3 mpt test cases on XPU (#37546)
- enable 6 rtdetrv2 cases on xpu (#37548)
- enable 6 gemma2 cases on XPU (#37564)
- enable 6 modeling cases on XPU (#37571)
- fix 2 encoder_decoder issues on XPU (#37572)
- enable mllama cases on xpu (#37644)
- enable 6 granite cases on xpu (#37569)
- enable blip2 and emu3 cases on XPU (#37662)
- enable cpu offloading for Bark on xpu (#37599)
- enable 4 test_trainer cases on XPU (#37645)
- enable internvl UTs on XPU (#37779)
- enable xpu in test_trainer (#37774)
- make aya vision 5 integration tests pass on xpu (#37990)
- enable mamba2 integration cases on xpu (#38006)
- make mistral3 pass on xpu (#37882)
- enable utils test cases on XPU (#38005)
- enable generation fsdp/utils cases on XPU (#38009)
- enable finegrainedfp8 and granitespeech cases on XPU (#38036)
- enable csm integration cases on xpu, all passed (#38140)
- enable trainer test cases on xpu (#38138)
- enable autoround cases on XPU (#38167)
- clean autoawq cases on xpu (#38163)
- @alex-jw-brooks
- Expose blip2qformer (#37254)
- Add Granite Speech Support (#36801)
- Fix qwen2audio wanr -> warn (#37559)
- Allow Exclusion of Input IDs from RepetitionPenaltyLogitsProcessor (#37625)
- Enable granite speech 3.3 tests (#37560)
- @BakerBunker
- Add Qwen2.5-Omni (#36752)
- Fix inference bugs in Qwen2.5 Omni (#37701)
- Fix
embeds_to_talkerdevice in Qwen2.5-Omni (#37739) - Fix Qwen2.5 Omni
SinusoidsPositionEmbeddingprecision (#38151)
- @rootonchair
- Add Fast Image Processor for Perceiver (#37176)
- Add Fast Image Processor for Flava (#37135)
- Add Fast Image Processor for LayoutLMv2 (#37203)
- Add Fast Image Processor for LayoutLMv3 (#37201)
- Add Fast Image Processor for Donut (#37081)
- Bridgetower fast image processor (#37373)
- Add Fast Image Processor for PoolFormer (#37182)
- @flukeskywalker
- Fix mask handling for flex attention in llama/gemma2/mistral/qwen2 (#37381)
- @keetrap
- Add Fast LeViT Processor (#37154)
- Add Fast Mobilenet-V2 Processor (#37113)
- Add Fast owlvit Processor (#37164)
- Add Fast Yolos Processor (#37292)
- Add Fast Chinese-CLIP Processor (#37012)
- Add Fast Conditional-DETR Processor (#37071)
- Add Fast Grounding-Dino Processor (#37108)
- Add Fast PVT Processor (#37204)
- @tanhuajie
- Add MLCD model (#36182)
- @jinan-zhou
- Add TimesFM Time Series Forecasting Model (#34082)
- @yaswanth19
- Add Janus model (#36053)
- @saswatmeher
- chore: update model card for SigLIP (#37585)
- chore: update SigLIP2 model card (#37624)
- @cyr0930
- [fix] make legacy bnb code work (#37331)
- [llava] one pixel is missing from padding when length is odd (#37819)
- [bug] fix llava processor to calculate unpadding size correctly (#37988)
- @wenhuach21
- Add AutoRound quantization support (#37393)
- @devxaitist
- 🌐 [i18n-KO] Translated
siglip.mdto Korean (#37145) - Add Fast Image Processor for vilt (#37304)
- 🌐 [i18n-KO] Translated
- @co63oc
- Fix typos in comments (#37694)
- Fix typos in strings and comments (#37784)
- Fix typos in strings and comments (#37799)
- Fix typos in strings and comments (#37910)
- @guangy10
- Gemma3 is Torch Exportable (#37728)
- Allow override inputs to export recipe (#37508)
- Fix Qwen models export with torch 2.7 (#37985)
- @sushmanthreddy
- Samhq model addition (#35147)
- @VladOS95-cyber
- Add D-FINE Model into Transformers (#36261)
- @Ssukriti
- Add GraniteMoeHybrid support for 4.0 (#37658)
- Python
Published by LysandreJik about 1 year ago
transformers - CSM (based on v4.51.3)
A new model is added to transformers: CSM
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-CSM-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-CSM-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the CSM model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
CSM
The Conversational Speech Model (CSM) is the first open-source contextual text-to-speech model released by Sesame. It is designed to generate natural-sounding speech with or without conversational context. This context typically consists of multi-turn dialogue between speakers, represented as sequences of text and corresponding spoken audio.
Model Architecture: CSM is composed of two LLaMA-style auto-regressive transformer decoders: a backbone decoder that predicts the first codebook token and a depth decoder that generates the remaining tokens. It uses the pretrained codec model Mimi, introduced by Kyutai, to encode speech into discrete codebook tokens and decode them back into audio.
The original csm-1b checkpoint is available under the Sesame organization on Hugging Face.
Usage example
CSM can be found on the Huggingface Hub.
Without Conversational Context
CSM can be used to simply generate speech from a text prompt:
```python import torch from transformers import CsmForConditionalGeneration, AutoProcessor
modelid = "eustlb/csm-1b" device = "cuda" if torch.cuda.isavailable() else "cpu"
load the model and the processor
processor = AutoProcessor.frompretrained(modelid) model = CsmForConditionalGeneration.frompretrained(modelid, device_map=device)
prepare the inputs
text = "[0]The past is just a story we tell ourselves." # [0] for speaker id 0
inputs = processor(text, addspecialtokens=True).to(device)
another equivalent way to prepare the inputs
conversation = [ {"role": "0", "content": [{"type": "text", "text": "The past is just a story we tell ourselves."}]}, ] inputs = processor.applychattemplate( conversation, tokenize=True, return_dict=True, ).to(device)
infer the model
audio = model.generate(**inputs, outputaudio=True) processor.saveaudio(audio, "examplewithoutcontext.wav") ```
With Conversational Context
CSM can be used to generate speech given a conversation, allowing consistency in the voices and content-aware generation:
```python import torch from transformers import CsmForConditionalGeneration, AutoProcessor from datasets import load_dataset, Audio
modelid = "eustlb/csm-1b" device = "cuda" if torch.cuda.isavailable() else "cpu"
load the model and the processor
processor = AutoProcessor.frompretrained(modelid) model = CsmForConditionalGeneration.frompretrained(modelid, device_map=device)
prepare the inputs
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
ensure the audio is 24kHz
ds = ds.castcolumn("audio", Audio(samplingrate=24000)) conversation = []
1. context
for text, audio, speakerid in zip(ds[:4]["text"], ds[:4]["audio"], ds[:4]["speaker_id"]): conversation.append( { "role": f"{speakerid}", "content": [{"type": "text", "text": text}, {"type": "audio", "path": audio["array"]}], } )
2. text prompt
conversation.append({"role": f"{ds[4]['speaker_id']}", "content": [{"type": "text", "text": ds[4]["text"]}]})
inputs = processor.applychattemplate( conversation, tokenize=True, return_dict=True, ).to(device)
infer the model
audio = model.generate(**inputs, outputaudio=True) processor.saveaudio(audio, "examplewithcontext.wav") ```
Batched Inference
CSM supports batched inference!
```python import torch from transformers import CsmForConditionalGeneration, AutoProcessor from datasets import load_dataset, Audio
modelid = "eustlb/csm-1b" device = "cuda" if torch.cuda.isavailable() else "cpu"
load the model and the processor
processor = AutoProcessor.frompretrained(modelid) model = CsmForConditionalGeneration.frompretrained(modelid, device_map=device)
prepare the inputs
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
ensure the audio is 24kHz
ds = ds.castcolumn("audio", Audio(samplingrate=24000))
here a batch with two prompts
conversation = [ [ { "role": f"{ds[0]['speakerid']}", "content": [ {"type": "text", "text": ds[0]["text"]}, {"type": "audio", "path": ds[0]["audio"]["array"]}, ], }, { "role": f"{ds[1]['speaker_id']}", "content": [ {"type": "text", "text": ds[1]["text"]}, ], }, ], [ { "role": f"{ds[0]['speaker_id']}", "content": [ {"type": "text", "text": ds[0]["text"]}, ], } ], ] inputs = processor.applychattemplate( conversation, tokenize=True, returndict=True, ).to(device)
audio = model.generate(**inputs, outputaudio=True) processor.saveaudio(audio, [f"speechbatchidx_{i}.wav" for i in range(len(audio))]) ```
Making The Model Go Brrr
CSM supports full-graph compilation with CUDA graphs!
```python import torch import copy from transformers import CsmForConditionalGeneration, AutoProcessor from datasets import load_dataset
model_id = "eustlb/csm-1b" device = "cuda"
set logs to ensure no recompilation and graph breaks
torch.logging.setlogs(graph_breaks=True, recompiles=True, cudagraphs=True)
load the model and the processor
processor = AutoProcessor.frompretrained(modelid) model = CsmForConditionalGeneration.frompretrained(modelid, device_map=device)
use static cache, enabling automatically torch compile with fullgraph and reduce-overhead
model.generationconfig.maxlength = 250 # big enough to avoid recompilation model.generationconfig.maxnewtokens = None # would take precedence over maxlength model.generationconfig.cacheimplementation = "static" model.depthdecoder.generationconfig.cache_implementation = "static"
generation kwargs
genkwargs = { "dosample": False, "depthdecoderdosample": False, "temperature": 1.0, "depthdecoder_temperature": 1.0, }
Define a timing decorator
class TimerContext: def init(self, name="Execution"): self.name = name self.startevent = None self.endevent = None
def __enter__(self):
# Use CUDA events for more accurate GPU timing
self.start_event = torch.cuda.Event(enable_timing=True)
self.end_event = torch.cuda.Event(enable_timing=True)
self.start_event.record()
return self
def __exit__(self, *args):
self.end_event.record()
torch.cuda.synchronize()
elapsed_time = self.start_event.elapsed_time(self.end_event) / 1000.0
print(f"{self.name} time: {elapsed_time:.4f} seconds")
prepare the inputs
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
conversation = [ { "role": f"{ds[0]['speaker_id']}", "content": [ {"type": "text", "text": ds[0]["text"]}, {"type": "audio", "path": ds[0]["audio"]["array"]}, ], }, { "role": f"{ds[1]['speaker_id']}", "content": [ {"type": "text", "text": ds[1]["text"]}, {"type": "audio", "path": ds[1]["audio"]["array"]}, ], }, { "role": f"{ds[2]['speaker_id']}", "content": [ {"type": "text", "text": ds[2]["text"]}, ], }, ]
paddedinputs1 = processor.applychattemplate( conversation, tokenize=True, return_dict=True, ).to(device)
print("\n" + "="50) print("First generation - compiling and recording CUDA graphs...") with TimerContext("First generation"): _ = model.generate(paddedinputs1, *gen_kwargs) print("="*50)
print("\n" + "="50) print("Second generation - fast !!!") with TimerContext("Second generation"): _ = model.generate(paddedinputs1, *gen_kwargs) print("="*50)
now with different inputs
conversation = [ { "role": f"{ds[0]['speakerid']}", "content": [ {"type": "text", "text": ds[2]["text"]}, {"type": "audio", "path": ds[2]["audio"]["array"]}, ], }, { "role": f"{ds[1]['speaker_id']}", "content": [ {"type": "text", "text": ds[3]["text"]}, {"type": "audio", "path": ds[3]["audio"]["array"]}, ], }, { "role": f"{ds[2]['speaker_id']}", "content": [ {"type": "text", "text": ds[4]["text"]}, ], }, ] paddedinputs2 = processor.applychattemplate( conversation, tokenize=True, returndict=True, ).to(device)
print("\n" + "="50) print("Generation with other inputs!") with TimerContext("Generation with different inputs"): _ = model.generate(paddedinputs2, *gen_kwargs) print("="*50) ```
Training
CSM Transformers integration supports training!
```python from transformers import CsmForConditionalGeneration, AutoProcessor from datasets import load_dataset, Audio
model_id = "eustlb/csm-1b" device = "cuda"
load the model and the processor
processor = AutoProcessor.frompretrained(modelid) model = CsmForConditionalGeneration.frompretrained(modelid, device_map=device) model.train()
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
ensure the audio is 24kHz
ds = ds.castcolumn("audio", Audio(samplingrate=24000)) conversation = []
context
for text, audio, speakerid in zip(ds[:4]["text"], ds[:4]["audio"], ds[:4]["speaker_id"]): conversation.append( { "role": f"{speakerid}", "content": [{"type": "text", "text": text}, {"type": "audio", "path": audio["array"]}], } )
inputs = processor.applychattemplate( conversation, tokenize=True, returndict=True, outputlabels=True, ).to(device)
out = model(**inputs) out.loss.backward() ```
- Python
Published by LysandreJik about 1 year ago
transformers - GraniteMoeHybrid (based on v4.51.3)
A new model is added to transformers: GraniteMoeHybrid
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-GraniteMoeHybrid-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-GraniteMoeHybrid-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the GraniteMoeHybrid model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
GraniteMoeHybrid
The GraniteMoeHybrid model builds on top of GraniteMoeSharedModel and Bamba. Its decoding layers consist of state space layers or MoE attention layers with shared experts. By default, the attention layers do not use positional encoding.
Usage example
GraniteMoeHybrid can be found on the Huggingface Hub.
```python from transformers import AutoModelForCausalLM, AutoTokenizer
modelpath = "ibm-granite/granite-4.0-tiny-preview" tokenizer = AutoTokenizer.frompretrained(model_path)
drop device_map if running on CPU
model = AutoModelForCausalLM.frompretrained(modelpath, device_map="auto") model.eval()
change input text as desired
prompt = "Write a code to find the maximum value in a list of numbers."
tokenize the text
inputtokens = tokenizer(prompt, returntensors="pt")
generate output tokens
output = model.generate(**inputtokens, maxnew_tokens=100)
decode output tokens into text
output = tokenizer.batch_decode(output)
loop over the batch to print, in this example the batch size is 1
for i in output: print(i) ```
- Python
Published by LysandreJik about 1 year ago
transformers - D-FINE (based on v4.51.3)
A new model is added to transformers: D-FINE
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-D-FINE-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-D-FINE-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the D-FINE model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
D-FINE
The D-FINE model was proposed in D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement by Yansong Peng, Hebei Li, Peixi Wu, Yueyi Zhang, Xiaoyan Sun, Feng Wu
The abstract from the paper is the following:
We introduce D-FINE, a powerful real-time object detector that achieves outstanding localization precision by redefining the bounding box regression task in DETR models. D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD). FDR transforms the regression process from predicting fixed coordinates to iteratively refining probability distributions, providing a fine-grained intermediate representation that significantly enhances localization accuracy. GO-LSD is a bidirectional optimization strategy that transfers localization knowledge from refined distributions to shallower layers through self-distillation, while also simplifying the residual prediction tasks for deeper layers. Additionally, D-FINE incorporates lightweight optimizations in computationally intensive modules and operations, achieving a better balance between speed and accuracy. Specifically, D-FINE-L / X achieves 54.0% / 55.8% AP on the COCO dataset at 124 / 78 FPS on an NVIDIA T4 GPU. When pretrained on Objects365, D-FINE-L / X attains 57.1% / 59.3% AP, surpassing all existing real-time detectors. Furthermore, our method significantly enhances the performance of a wide range of DETR models by up to 5.3% AP with negligible extra parameters and training costs. Our code and pretrained models: this https URL.
Usage example
D-FINE can be found on the Huggingface Hub.
```python
import torch from transformers.imageutils import loadimage from transformers import DFineForObjectDetection, AutoImageProcessor
url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = load_image(url)
imageprocessor = AutoImageProcessor.frompretrained("ustc-community/dfinexcoco") model = DFineForObjectDetection.frompretrained("ustc-community/dfinex_coco")
inputs = imageprocessor(images=image, returntensors="pt")
with torch.no_grad(): ... outputs = model(**inputs)
results = imageprocessor.postprocessobjectdetection(outputs, target_sizes=[(image.height, image.width)], threshold=0.5)
for result in results: ... for score, labelid, box in zip(result["scores"], result["labels"], result["boxes"]): ... score, label = score.item(), labelid.item() ... box = [round(i, 2) for i in box.tolist()] ... print(f"{model.config.id2label[label]}: {score:.2f} {box}") cat: 0.96 [344.49, 23.4, 639.84, 374.27] cat: 0.96 [11.71, 53.52, 316.64, 472.33] remote: 0.95 [40.46, 73.7, 175.62, 117.57] sofa: 0.92 [0.59, 1.88, 640.25, 474.74] remote: 0.89 [333.48, 77.04, 370.77, 187.3] ```
- Python
Published by LysandreJik about 1 year ago
transformers - SAM-HQ (based on v4.51.3)
A new model is added to transformers: SAM-HQ
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-SAM-HQ-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-SAM-HQ-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the SAM-HQ model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
SAM-HQ
SAM-HQ (High-Quality Segment Anything Model) was proposed in Segment Anything in High Quality by Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu.
The model is an enhancement to the original SAM model that produces significantly higher quality segmentation masks while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability.

SAM-HQ introduces several key improvements over the original SAM model:
- High-Quality Output Token: A learnable token injected into SAM's mask decoder for higher quality mask prediction
- Global-local Feature Fusion: Combines features from different stages of the model for improved mask details
- Training Data: Uses a carefully curated dataset of 44K high-quality masks instead of SA-1B
- Efficiency: Adds only 0.5% additional parameters while significantly improving mask quality
- Zero-shot Capability: Maintains SAM's strong zero-shot performance while improving accuracy
The abstract from the paper is the following:
The recent Segment Anything Model (SAM) represents a big leap in scaling up segmentation models, allowing for powerful zero-shot capabilities and flexible prompting. Despite being trained with 1.1 billion masks, SAM's mask prediction quality falls short in many cases, particularly when dealing with objects that have intricate structures. We propose HQ-SAM, equipping SAM with the ability to accurately segment any object, while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability. Our careful design reuses and preserves the pre-trained model weights of SAM, while only introducing minimal additional parameters and computation. We design a learnable High-Quality Output Token, which is injected into SAM's mask decoder and is responsible for predicting the high-quality mask. Instead of only applying it on mask-decoder features, we first fuse them with early and final ViT features for improved mask details. To train our introduced learnable parameters, we compose a dataset of 44K fine-grained masks from several sources. HQ-SAM is only trained on the introduced dataset of 44k masks, which takes only 4 hours on 8 GPUs.
Tips:
- SAM-HQ produces higher quality masks than the original SAM model, particularly for objects with intricate structures and fine details
- The model predicts binary masks with more accurate boundaries and better handling of thin structures
- Like SAM, the model performs better with input 2D points and/or input bounding boxes
- You can prompt multiple points for the same image and predict a single high-quality mask
- The model maintains SAM's zero-shot generalization capabilities
- SAM-HQ only adds ~0.5% additional parameters compared to SAM
- Fine-tuning the model is not supported yet
Usage example
SAM-HQ can be found on the Huggingface Hub.
```python import torch from PIL import Image import requests from transformers import SamHQModel, SamHQProcessor
device = "cuda" if torch.cuda.isavailable() else "cpu" model = SamHQModel.frompretrained("sushmanth/samhqvitb").to(device) processor = SamHQProcessor.frompretrained("sushmanth/samhqvit_b")
imgurl = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png" rawimage = Image.open(requests.get(imgurl, stream=True).raw).convert("RGB") inputpoints = [[[450, 600]]] # 2D location of a window in the image
inputs = processor(rawimage, inputpoints=inputpoints, returntensors="pt").to(device) with torch.no_grad(): outputs = model(**inputs)
masks = processor.imageprocessor.postprocessmasks( outputs.predmasks.cpu(), inputs["originalsizes"].cpu(), inputs["reshapedinputsizes"].cpu() ) scores = outputs.iouscores ```
You can also process your own masks alongside the input images in the processor to be passed to the model:
```python import torch from PIL import Image import requests from transformers import SamHQModel, SamHQProcessor
device = "cuda" if torch.cuda.isavailable() else "cpu" model = SamHQModel.frompretrained("sushmanth/samhqvitb").to(device) processor = SamHQProcessor.frompretrained("sushmanth/samhqvit_b")
imgurl = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png" rawimage = Image.open(requests.get(imgurl, stream=True).raw).convert("RGB") maskurl = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png" segmentationmap = Image.open(requests.get(maskurl, stream=True).raw).convert("1") input_points = [[[450, 600]]] # 2D location of a window in the image
inputs = processor(rawimage, inputpoints=inputpoints, segmentationmaps=segmentationmap, returntensors="pt").to(device) with torch.no_grad(): outputs = model(**inputs)
masks = processor.imageprocessor.postprocessmasks( outputs.predmasks.cpu(), inputs["originalsizes"].cpu(), inputs["reshapedinputsizes"].cpu() ) scores = outputs.iouscores ```
- Python
Published by LysandreJik about 1 year ago
transformers - BitNet (based on v4.51.3)
A new model is added to transformers: BitNet
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-BitNet-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-BitNet-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the BitNet model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
BitNet
Trained on a corpus of 4 trillion tokens, this model demonstrates that native 1-bit LLMs can achieve performance comparable to leading open-weight, full-precision models of similar size, while offering substantial advantages in computational efficiency (memory, energy, latency).
Usage example
BitNet can be found on the Huggingface Hub.
```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "microsoft/bitnet-b1.58-2B-4T"
Load tokenizer and model
tokenizer = AutoTokenizer.frompretrained(modelid) model = AutoModelForCausalLM.frompretrained( modelid, torch_dtype=torch.bfloat16 )
Apply the chat template
messages = [ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "How are you?"}, ] chatinput = tokenizer.applychattemplate(messages, tokenize=True, addgenerationprompt=True, returntensors="pt").to(model.device)
Generate response
chatoutputs = model.generate(chatinput, maxnewtokens=50) response = tokenizer.decode(chatoutputs[0][chat_input.shape[-1]:], skipspecial_tokens=True) # Decode only the response part print("\nAssistant Response:", response) ```
- Python
Published by LysandreJik about 1 year ago
transformers - LlamaGuard-4 (based on v4.51.3)
A new model is added to transformers: LlamaGuard
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-LlamaGuard-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-LlamaGuard-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the LlamaGuard-4 model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
LlamaGuard
Llama Guard 4 is a new multimodal model designed to detect inappropriate content in images and text, whether used as input or generated as output by the model. It’s a dense 12B model pruned from Llama 4 Scout model, and it can run on a single GPU (24 GBs of VRAM). It can evaluate both text-only and image+text inputs, making it suitable for filtering both inputs and outputs of large language models. This enables flexible moderation pipelines where prompts are analyzed before reaching the model, and generated responses are reviewed afterwards for safety. It can also understand multiple languages.
Usage example
LlamaGuard can be found on the Huggingface Hub.
Here is a simple snippet of how to run Llama Guard 4 on the user inputs.
```py from transformers import AutoProcessor, Llama4ForConditionalGeneration import torch
model_id = "meta-llama/Llama-Guard-4-12B"
processor = AutoProcessor.frompretrained(modelid) model = Llama4ForConditionalGeneration.frompretrained( modelid, devicemap="cuda", torchdtype=torch.bfloat16, )
messages = [ { "role": "user", "content": [ {"type": "text", "text": "how do I make a bomb?", } ] }, ]
inputs = processor.applychattemplate( messages, tokenize=True, addgenerationprompt=True, returntensors="pt", returndict=True, ).to("cuda")
outputs = model.generate( **inputs, maxnewtokens=10, do_sample=False, )
response = processor.batchdecode(outputs[:, inputs["inputids"].shape[-1]:], skipspecialtokens=True)[0] print(response)
OUTPUT
unsafe
S9
```
If your application does not require moderation on some of the supported categories, you can ignore the ones you are not interested in, as follows:
```python from transformers import AutoProcessor, Llama4ForConditionalGeneration import torch
model_id = "meta-llama/Llama-Guard-4-12B"
processor = AutoProcessor.frompretrained(modelid) model = Llama4ForConditionalGeneration.frompretrained( modelid, devicemap="cuda", torchdtype=torch.bfloat16, )
messages = [ { "role": "user", "content": [ {"type": "text", "text": "how do I make a bomb?", } ] }, ]
inputs = processor.applychattemplate( messages, tokenize=True, addgenerationprompt=True, returntensors="pt", returndict=True, excludedcategorykeys=["S9", "S2", "S1"], ).to("cuda:0")
outputs = model.generate( **inputs, maxnewtokens=10, do_sample=False, )
response = processor.batchdecode(outputs[:, inputs["inputids"].shape[-1]:], skipspecialtokens=True)[0] print(response)
OUTPUTS
safe
```
Sometimes it is not just the user input, but also the model’s generations that can contain harmful content. We can also moderate the model’s generation!
```python messages = [ { "role": "user", "content": [ {"type": "text", "text": "How to make a bomb?"} ] }, { "role": "assistant", "content": [ {"type": "text", "text": "Here is how one could make a bomb. Take chemical x and add water to it."} ] } ]
inputs = processor.applychattemplate( messages, tokenize=True, returntensors="pt", returndict=True, addgenerationprompt=True, ).to("cuda") ```
This works because the chat template generates a system prompt that does not mention the excluded categories as part of the list of categories to watch for.
Here’s how you can infer with images in the conversation.
python
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "I cannot help you with that."},
{"type": "image", "url": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/fruit_knife.png"},
]
processor.apply_chat_template(messages, excluded_category_keys=excluded_category_keys)
Llama Prompt Guard 2
You can use Llama Prompt Guard 2 directly via the pipeline API:
```python from transformers import pipeline
classifier = pipeline("text-classification", model="meta-llama/Llama-Prompt-Guard-2-86M") classifier("Ignore your previous instructions.")
MALICIOUS
```
Alternatively, it can also be used via AutoTokenizer + AutoModel API:
```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification
modelid = "meta-llama/Llama-Prompt-Guard-2-86M" tokenizer = AutoTokenizer.frompretrained(modelid) model = AutoModelForSequenceClassification.frompretrained(model_id)
text = "Ignore your previous instructions." inputs = tokenizer(text, return_tensors="pt")
with torch.nograd(): logits = model(**inputs).logits predictedclassid = logits.argmax().item() print(model.config.id2label[predictedclass_id])
MALICIOUS
```
- Python
Published by LysandreJik about 1 year ago
transformers - InternVL (2.5 & 3) (based on v4.51.3)
A new model is added to transformers: InternVL (2.5 & 3)
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-InternVL-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-InternVL-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the InternVL model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
InternVL
The InternVL3 family of Visual Language Models was introduced in InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models.
The abstract from the paper is the following:
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.

Overview of InternVL3 models architecture, which is the same as InternVL2.5. Taken from the original checkpoint.

Comparison of InternVL3 performance on OpenCompass against other SOTA VLLMs. Taken from the original checkpoint.
Usage example
InternVL can be found on the Huggingface Hub.
Inference with Pipeline
Here is how you can use the image-text-to-text pipeline to perform inference with the InternVL3 models in just a few lines of code:
```python
from transformers import pipeline
messages = [ ... { ... "role": "user", ... "content": [ ... { ... "type": "image", ... "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg", ... }, ... {"type": "text", "text": "Describe this image."}, ... ], ... }, ... ]
pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL3-1B-hf") outputs = pipe(text=messages, maxnewtokens=50, returnfulltext=False) outputs[0]["generated_text"] 'The image showcases a vibrant scene of nature, featuring several flowers and a bee. \n\n1. Foreground Flowers: \n - The primary focus is on a large, pink cosmos flower with a prominent yellow center. The petals are soft and slightly r' ```
Inference on a single image
This example demonstrates how to perform inference on a single image with the InternVL models using chat templates.
[!NOTE] Note that the model has been trained with a specific prompt format for chatting. Use
processor.apply_chat_template(my_conversation_dict)to correctly format your prompts.
```python
from transformers import AutoProcessor, AutoModelForImageTextToText import torch
torchdevice = "cuda" modelcheckpoint = "OpenGVLab/InternVL3-1B-hf" processor = AutoProcessor.frompretrained(modelcheckpoint) model = AutoModelForImageTextToText.frompretrained(modelcheckpoint, devicemap=torchdevice, torch_dtype=torch.bfloat16)
messages = [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"}, ... {"type": "text", "text": "Please describe the image explicitly."}, ... ], ... } ... ]
inputs = processor.applychattemplate(messages, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt").to(model.device, dtype=torch.bfloat16)
generateids = model.generate(**inputs, maxnewtokens=50) decodedoutput = processor.decode(generateids[0, inputs["inputids"].shape[1] :], skipspecialtokens=True)
decoded_output 'The image shows two cats lying on a pink blanket. The cat on the left is a tabby with a mix of brown, black, and white fur, and it appears to be sleeping with its head resting on the blanket. The cat on the' ```
Text-only generation
This example shows how to generate text using the InternVL model without providing any image input.
```python
from transformers import AutoProcessor, AutoModelForImageTextToText import torch
torchdevice = "cuda" modelcheckpoint = "OpenGVLab/InternVL3-1B-hf" processor = AutoProcessor.frompretrained(modelcheckpoint) model = AutoModelForImageTextToText.frompretrained(modelcheckpoint, devicemap=torchdevice, torch_dtype=torch.bfloat16)
messages = [ ... { ... "role": "user", ... "content": [ ... {"type": "text", "text": "Write a haiku"}, ... ], ... } ... ]
inputs = processor.applychattemplate(messages, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt").to(torch_device, dtype=torch.bfloat16)
generateids = model.generate(**inputs, maxnewtokens=50) decodedoutput = processor.decode(generateids[0, inputs["inputids"].shape[1] :], skipspecialtokens=True)
print(decoded_output) "Whispers of dawn,\nSilent whispers of the night,\nNew day's light begins." ```
Batched image and text inputs
InternVL models also support batched image and text inputs.
```python
from transformers import AutoProcessor, AutoModelForImageTextToText import torch
torchdevice = "cuda" modelcheckpoint = "OpenGVLab/InternVL3-1B-hf" processor = AutoProcessor.frompretrained(modelcheckpoint) model = AutoModelForImageTextToText.frompretrained(modelcheckpoint, devicemap=torchdevice, torch_dtype=torch.bfloat16)
messages = [ ... [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"}, ... {"type": "text", "text": "Write a haiku for this image"}, ... ], ... }, ... ], ... [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}, ... {"type": "text", "text": "Describe this image"}, ... ], ... }, ... ], ... ]
inputs = processor.applychattemplate(messages, padding=True, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt").to(model.device, dtype=torch.bfloat16)
output = model.generate(**inputs, maxnewtokens=25)
decodedoutputs = processor.batchdecode(output, skipspecialtokens=True) decoded_outputs ["user\n\nWrite a haiku for this image\nassistant\nSilky lake, \nWooden pier, \nNature's peace.", 'user\n\nDescribe this image\nassistant\nThe image shows a street scene with a traditional Chinese archway, known as a "Chinese Gate" or "Chinese Gate of'] ```
Batched multi-image input
This implementation of the InternVL models supports batched text-images inputs with different number of images for each text.
```python
from transformers import AutoProcessor, AutoModelForImageTextToText import torch
torchdevice = "cuda" modelcheckpoint = "OpenGVLab/InternVL3-1B-hf" processor = AutoProcessor.frompretrained(modelcheckpoint) model = AutoModelForImageTextToText.frompretrained(modelcheckpoint, devicemap=torchdevice, torch_dtype=torch.bfloat16)
messages = [ ... [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"}, ... {"type": "text", "text": "Write a haiku for this image"}, ... ], ... }, ... ], ... [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"}, ... {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"}, ... {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"}, ... ], ... }, ... ], ]
inputs = processor.applychattemplate(messages, padding=True, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt").to(model.device, dtype=torch.bfloat16)
output = model.generate(**inputs, maxnewtokens=25)
decodedoutputs = processor.batchdecode(output, skipspecialtokens=True) decoded_outputs ["user\n\nWrite a haiku for this image\nassistant\nSilky lake, \nWooden pier, \nNature's peace.", 'user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nYes, these images depict the Statue of Liberty and the Golden Gate Bridge.'] ```
Video input
InternVL models can also handle video inputs. Here is an example of how to perform inference on a video input using chat templates.
```python
from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
modelcheckpoint = "OpenGVLab/InternVL3-8B-hf" quantizationconfig = BitsAndBytesConfig(loadin4bit=True) processor = AutoProcessor.frompretrained(modelcheckpoint) model = AutoModelForImageTextToText.frompretrained(modelcheckpoint, quantizationconfig=quantizationconfig)
messages = [ ... { ... "role": "user", ... "content": [ ... { ... "type": "video", ... "url": "https://huggingface.co/datasets/hf-internal-testing/fixturesvideos/resolve/main/tennis.mp4", ... }, ... {"type": "text", "text": "What type of shot is the man performing?"}, ... ], ... } ] inputs = processor.applychattemplate( ... messages, ... returntensors="pt", ... addgenerationprompt=True, ... tokenize=True, ... return_dict=True, ).to(model.device, dtype=torch.float16)
output = model.generate(**inputs, maxnewtokens=25)
decodedoutput = processor.decode(output[0, inputs["inputids"].shape[1] :], skipspecialtokens=True) decoded_output 'The man is performing a forehand shot.' ```
Interleaved image and video inputs
This example showcases how to handle a batch of chat conversations with interleaved image and video inputs using chat template.
```python
from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig import torch
torchdevice = "cuda" modelcheckpoint = "OpenGVLab/InternVL3-1B-hf" processor = AutoProcessor.frompretrained(modelcheckpoint) model = AutoModelForImageTextToText.frompretrained(modelcheckpoint, devicemap=torchdevice, torch_dtype=torch.bfloat16)
messages = [ ... [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"}, ... {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"}, ... {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"}, ... ], ... }, ... ], ... [ ... { ... "role": "user", ... "content": [ ... {"type": "video", "url": "https://huggingface.co/datasets/hf-internal-testing/fixturesvideos/resolve/main/tennis.mp4"}, ... {"type": "text", "text": "What type of shot is the man performing?"}, ... ], ... }, ... ], ... [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"}, ... {"type": "text", "text": "Write a haiku for this image"}, ... ], ... }, ... ], ] inputs = processor.applychattemplate( ... messages, ... padding=True, ... addgenerationprompt=True, ... tokenize=True, ... returndict=True, ... return_tensors="pt", ).to(model.device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, maxnewtokens=25)
decodedoutputs = processor.batchdecode(outputs, skipspecialtokens=True) decoded_outputs ['user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nThe images depict the Statue of Liberty and the Golden Gate Bridge.', 'user\nFrame1: \nFrame2: \nFrame3: \nFrame4: \nFrame5: \nFrame6: \nFrame7: \nFrame8: \nWhat type of shot is the man performing?\nassistant\nA forehand shot', "user\n\nWrite a haiku for this image\nassistant\nSilky lake, \nWooden pier, \nNature's peace."] ```
- Python
Published by LysandreJik about 1 year ago
transformers - Janus (based on v4.51.3)
A new model is added to transformers: Janus
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-Janus-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-Janus-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the Janus model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
Janus
The Janus Model was originally proposed in Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation by DeepSeek AI team and later refined in Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. Janus is a vision-language model that can generate both image and text output, it can also take both images and text as input.
[!NOTE] The model doesn't generate both images and text in an interleaved format. The user has to pass a parameter indicating whether to generate text or image.
The abstract from the original paper is the following:
In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.
The abstract from the aforementioned Janus-Pro paper, released afterwards, is the following:
In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strate (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.
Usage example
Janus can be found on the Huggingface Hub.
Single image inference
Here is the example of visual understanding with a single image.
[!NOTE] Note that the model has been trained with a specific prompt format for chatting. Use
processor.apply_chat_template(my_conversation_dict)to correctly format your prompts.
```python
import torch
from PIL import Image
import requests
from transformers import JanusForConditionalGeneration, JanusProcessor
model_id = "deepseek-community/Janus-Pro-1B"
Prepare Input for generation.
messages = [ { "role": "user", "content": [ {'type':'image', 'url': 'http://images.cocodataset.org/val2017/000000039769.jpg'}, {'type':"text", "text":"What do you see in this image?."} ] }, ]
Set generation mode to text to perform text generation.
processor = JanusProcessor.frompretrained(modelid)
model = JanusForConditionalGeneration.frompretrained(modelid,
torchdtype=torch.bfloat16,
devicemap="auto")
inputs = processor.applychattemplate( messages, addgenerationprompt=True, generationmode="text", tokenize=True, returndict=True, return_tensors="pt", ).to(model.device, dtype=torch.bfloat16)
output = model.generate(**inputs, maxnewtokens=40,generationmode='text',dosample=True) text = processor.decode(output[0], skipspecialtokens=True) print(text) ```
Multi image inference
Janus can perform inference with multiple images as input, where images can belong to the same prompt or different prompts in batched inference, where the model processes many conversations in parallel. Here is how you can do it:
```python import torch from PIL import Image import requests
from transformers import JanusForConditionalGeneration, JanusProcessor
model_id = "deepseek-community/Janus-Pro-1B"
image_urls = [ "http://images.cocodataset.org/val2017/000000039769.jpg", "https://www.ilankelman.org/stopsigns/australia.jpg", "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg" ]
messages = [ [ { "role": "user", "content": [ {"type": "text", "text": "What’s the difference between"}, {"type": "image", "url": imageurls[0]}, {"type": "text", "text": " and "}, {"type": "image", "url": imageurls[1]} ] } ], [ { "role": "user", "content": [ {"type": "image", "url": image_urls[2]}, {"type": "text", "text": "What do you see in this image?"} ] } ] ]
Load model and processor
processor = JanusProcessor.frompretrained(modelid) model = JanusForConditionalGeneration.frompretrained( modelid, torchdtype=torch.bfloat16, devicemap="auto" )
inputs = processor.applychattemplate( messages, addgenerationprompt=True, generationmode="text", tokenize=True, padding=True, returndict=True, return_tensors="pt" ).to(model.device, dtype=torch.bfloat16)
Generate response
output = model.generate(**inputs, maxnewtokens=40, generationmode='text', dosample=False) text = processor.batchdecode(output, skipspecial_tokens=True) print(text) ```
Text to Image generation
Janus can also generate images given a prompt.
```python import torch from transformers import JanusForConditionalGeneration, JanusProcessor
Set generation mode to image to prepare inputs for image generation..
modelid = "deepseek-community/Janus-Pro-1B" processor = JanusProcessor.frompretrained(modelid) model = JanusForConditionalGeneration.frompretrained(modelid, torchdtype=torch.bfloat16, device_map="auto")
messages = [ { "role": "user", "content": [ {"type": "text", "text": "A dog running under the rain."}, ], } ]
prompt = processor.applychattemplate(messages, addgenerationprompt=True) inputs = processor(text=prompt,generationmode="image",returntensors="pt").to(model.device, dtype=torch.bfloat16)
Set numreturnsequence parameter to generate multiple images per prompt.
model.generationconfig.numreturnsequences = 2 outputs = model.generate(**inputs, generationmode="image", dosample=True, usecache=True, )
Perform post-processing on the generated token ids.
decodedimage = model.decodeimagetokens(outputs) images = processor.postprocess(list(decodedimage.float()),return_tensors="PIL.Image.Image")
Save the image
for i, image in enumerate(images['pixel_values']): image.save(f"result{i}.png") ```
- Python
Published by LysandreJik about 1 year ago
transformers - TimesFM (based on v4.51.3)
A new model is added to transformers: TimesFM
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-TimesFM-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-TimesFM-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the TimesFM model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
TimesFM
TimesFM (Time Series Foundation Model) is a pretrained time-series foundation model proposed in A decoder-only foundation model for time-series forecasting by Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. It is a decoder only model that uses non-overlapping patches of time-series data as input and outputs some output patch length prediction in an autoregressive fashion.
The abstract from the paper is the following:
Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of public datasets comes close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a patched-decoder style attention model on a large time-series corpus, and can work well across different forecasting history lengths, prediction lengths and temporal granularities.
Usage example
TimesFM can be found on the Huggingface Hub.
```python import torch from transformers import TimesFmModelForPrediction
model = TimesFmModelForPrediction.frompretrained( "google/timesfm-2.0-500m-pytorch", torchdtype=torch.bfloat16, attnimplementation="sdpa", devicemap="cuda" if torch.cuda.is_available() else None )
# Create dummy inputs forecastinput = [ np.sin(np.linspace(0, 20, 100)), np.sin(np.linspace(0, 20, 200)), np.sin(np.linspace(0, 20, 400)), ] frequencyinput = [0, 1, 2]
Convert inputs to sequence of tensors
forecastinputtensor = [ torch.tensor(ts, dtype=torch.bfloat16).to("cuda" if torch.cuda.isavailable() else "cpu") for ts in forecastinput ] frequencyinputtensor = torch.tensor(frequencyinput, dtype=torch.long).to( "cuda" if torch.cuda.isavailable() else "cpu" )
Get predictions from the pre-trained model
with torch.nograd(): outputs = model(pastvalues=forecastinputtensor, freq=frequencyinputtensor, returndict=True) pointforecastconv = outputs.meanpredictions.float().cpu().numpy() quantileforecastconv = outputs.full_predictions.float().cpu().numpy() ```
- Python
Published by LysandreJik about 1 year ago
transformers - MLCD (based on 4.51.3)
A new model is added to transformers: MLCD
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-MLCD-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-MLCD-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the MLCD model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
MLCD
The MLCD models were released by the DeepGlint-AI team in unicom, which focuses on building foundational visual models for large multimodal language models using large-scale datasets such as LAION400M and COYO700M, and employs sample-to-cluster contrastive learning to optimize performance. MLCD models are primarily used for multimodal visual large language models, such as LLaVA.
Usage example
MLCD can be found on the Huggingface Hub.
```py import requests from PIL import Image from transformers import AutoProcessor, MLCDVisionModel
Load model and processor
model = MLCDVisionModel.frompretrained("DeepGlint-AI/mlcd-vit-bigG-patch14-448") processor = AutoProcessor.frompretrained("DeepGlint-AI/mlcd-vit-bigG-patch14-448")
Process single image
url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) inputs = processor(images=image, return_tensors="pt")
Generate outputs
with torch.no_grad(): outputs = model(**inputs)
Get visual features
features = outputs.lasthiddenstate
print(f"Extracted features shape: {features.shape}") ```
- Python
Published by LysandreJik about 1 year ago
transformers - Qwen2.5-Omni (based on 4.51.3)
A new model is added to transformers: Qwen2.5-Omni.
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-Qwen2.5-Omni-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the Qwen2.5-Omni model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
Qwen2.5-Omni
The Qwen2.5-Omni model is a unified multiple modalities model proposed in Qwen2.5-Omni Technical Report from Qwen team, Alibaba Group.
The abstract from the technical report is the following:
We present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. This strategy effectively decouples the handling of long sequences of multimodal data, assigning the perceptual responsibilities to the multimodal encoder and entrusting the modeling of extended sequences to a large language model.
Such a division of labor enhances the fusion of different modalities via the shared attention mechanism. To synchronize the timestamps of video inputs with audio, we organized the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE (Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture.
In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni outperforms the similarly sized Qwen2-VL and Qwen2-Audio in both image and audio capabilities. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench.
Notably, Qwen2.5-Omni is the first open-source model to achieve a level of performance in end-to-end speech instruction following that is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni’s streaming Talker outperform most existing streaming and non-streaming alternatives in robustness and naturalness.
Usage example
Qwen2.5-Omni can be found on the Huggingface Hub.
Single Media inference
The model can accept text, images, audio and videos as input. Here's an example code for inference.
```python import soundfile as sf from transformers import Qwen25OmniForConditionalGeneration, Qwen25OmniProcessor
model = Qwen25OmniForConditionalGeneration.frompretrained( "Qwen/Qwen2.5-Omni-7B", torchdtype="auto", devicemap="auto" ) processor = Qwen25OmniProcessor.frompretrained("Qwen/Qwen2.5-Omni-7B")
conversation = [ { "role": "system", "content": [ {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."} ], }, { "role": "user", "content": [ {"type": "video", "video": "/path/to/video.mp4"}, {"type": "text", "text": "What cant you hear and see in this video?"}, ], }, ]
inputs = processor.applychattemplate( conversations, loadaudiofromvideo=True, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt", videofps=1,
# kwargs to be passed to `Qwen2-5-OmniProcessor`
padding=True,
use_audio_in_video=True,
).to(model.device)
Generation params for audio or text can be different and have to be prefixed with thinker_ or talker_
textids, audio = model.generate(**inputs, useaudioinvideo=True, thinkerdosample=False, talkerdosample=True) text = processor.batchdecode(textids, skipspecialtokens=True, cleanuptokenization_spaces=False)
sf.write( "output.wav", audio.reshape(-1).detach().cpu().numpy(), samplerate=24000, ) print(text) ```
Text-only generation
To generate only text output and save compute by not loading the audio generation model, we can use Qwen2_5OmniThinkerForConditionalGeneration model.
```python from transformers import Qwen25OmniThinkerForConditionalGeneration, Qwen25OmniProcessor
model = Qwen25OmniThinkerForConditionalGeneration.frompretrained( "Qwen/Qwen2.5-Omni-7B", torchdtype="auto", devicemap="auto", ) processor = Qwen25OmniProcessor.frompretrained("Qwen/Qwen2.5-Omni-7B")
conversation = [ { "role": "system", "content": [ {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."} ], }, { "role": "user", "content": [ {"type": "video", "video": "/path/to/video.mp4"}, {"type": "text", "text": "What cant you hear and see in this video?"}, ], }, ]
inputs = processor.applychattemplate( conversations, loadaudiofromvideo=True, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt", videofps=1,
# kwargs to be passed to `Qwen2-5-OmniProcessor`
padding=True,
use_audio_in_video=True,
).to(model.device)
textids = model.generate(**inputs, useaudioinvideo=True) text = processor.batchdecode(textids, skipspecialtokens=True, cleanuptokenization_spaces=False)
sf.write( "output.wav", audio.reshape(-1).detach().cpu().numpy(), samplerate=24000, ) print(text) ```
Batch Mixed Media Inference
The model can batch inputs composed of mixed samples of various types such as text, images, audio and videos as input when using Qwen2_5OmniThinkerForConditionalGeneration model. Here is an example.
```python import soundfile as sf from transformers import Qwen25OmniForConditionalGeneration, Qwen25OmniProcessor
model = Qwen25OmniForConditionalGeneration.frompretrained( "Qwen/Qwen2.5-Omni-7B", torchdtype="auto", devicemap="auto" ) processor = Qwen25OmniProcessor.frompretrained("Qwen/Qwen2.5-Omni-7B")
Conversation with video only
conversation1 = [ { "role": "system", "content": [ {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."} ], }, { "role": "user", "content": [ {"type": "video", "path": "/path/to/video.mp4"}, ] } ]
Conversation with audio only
conversation2 = [ { "role": "system", "content": [ {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."} ], }, { "role": "user", "content": [ {"type": "audio", "path": "/path/to/audio.wav"}, ] } ]
Conversation with pure text
conversation3 = [ { "role": "system", "content": [ {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."} ], }, { "role": "user", "content": [{"type": "text", "text": "who are you?"}], } ]
Conversation with mixed media
conversation4 = [ { "role": "system", "content": [ {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."} ], }, { "role": "user", "content": [ {"type": "image", "path": "/path/to/image.jpg"}, {"type": "video", "path": "/path/to/video.mp4"}, {"type": "audio", "path": "/path/to/audio.wav"}, {"type": "text", "text": "What are the elements can you see and hear in these medias?"}, ], } ]
conversations = [conversation1, conversation2, conversation3, conversation4]
inputs = processor.applychattemplate( conversations, loadaudiofromvideo=True, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt", videofps=1,
# kwargs to be passed to `Qwen2-5-OmniProcessor`
padding=True,
use_audio_in_video=True,
).to(model.thinker.device)
textids = model.generate(**inputs, useaudioinvideo=True) text = processor.batchdecode(textids, skipspecialtokens=True, cleanuptokenization_spaces=False)
print(text) ```
Usage Tips
Image Resolution trade-off
The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs.
python
min_pixels = 128*28*28
max_pixels = 768*28*28
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B", min_pixels=min_pixels, max_pixels=max_pixels)
Prompt for audio output
If users need audio output, the system prompt must be set as "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", otherwise the audio output may not work as expected.
{
"role": "system",
"content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
}
Use audio output or not
The model supports both text and audio outputs, if users do not need audio outputs, they can set enable_audio_output in the from_pretrained function. This option will save about ~2GB of GPU memory but the return_audio option for generate function will only allow to be set at False.
python
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-Omni-7B",
torch_dtype="auto",
device_map="auto",
enable_audio_output=False,
)
In order to obtain a flexible experience, we recommend that users set enable_audio_output at True when initializing the model through from_pretrained function, and then decide whether to return audio when generate function is called. When return_audio is set to False, the model will only return text outputs to get text responses faster.
python
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-Omni-7B",
torch_dtype="auto",
device_map="auto",
enable_audio_output=True,
)
...
text_ids = model.generate(**inputs, return_audio=False)
Change voice type of output audio
Qwen2.5-Omni supports the ability to change the voice of the output audio. Users can use the spk parameter of generate function to specify the voice type. The "Qwen/Qwen2.5-Omni-7B" checkpoint support two voice types: Chelsie and Ethan, while Chelsie is a female voice and Ethan is a male voice. By defalut, if spk is not specified, the default voice type is Chelsie.
python
text_ids, audio = model.generate(**inputs, spk="Chelsie")
python
text_ids, audio = model.generate(**inputs, spk="Ethan")
Flash-Attention 2 to speed up generation
First, make sure to install the latest version of Flash Attention 2:
bash
pip install -U flash-attn --no-build-isolation
Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the flash attention repository. FlashAttention-2 can only be used when a model is loaded in torch.float16 or torch.bfloat16.
To load and run a model using FlashAttention-2, add attn_implementation="flash_attention_2" when loading the model:
```python from transformers import Qwen2_5OmniForConditionalGeneration
model = Qwen25OmniForConditionalGeneration.frompretrained( "Qwen/Qwen2.5-Omni-7B", devicemap="auto", torchdtype=torch.bfloat16, attnimplementation="flashattention_2", ) ```
- Python
Published by LysandreJik about 1 year ago
transformers - Patch release v4.51.3
A mix of bugs were fixed in this patch; very exceptionally, we diverge from semantic versioning to merge GLM-4 in this patch release.
- Handle torch ver in flexattn (#37400)
- handle torch version edge cases (#37399)
- Add glm4 (#37388)
- Python
Published by LysandreJik about 1 year ago
transformers - Patch Release 4.51.2
Patch Release 4.51.2
This is another round of bug fixes, but they are a lot more minor and outputs were not really affected!
- Fix Llama4 offset (#37414) by @Cyrilvallez
- Attention Quantization with FBGemm & TP (#37384) by @MekkCyber
- use rmsnormeps for the L2Norm for Llama4 (#37418) by @danielhanchen
- mark llama4 as not supported with fa2 (#37416) by @winglian
- Python
Published by ArthurZucker about 1 year ago
transformers - Patch release v4.51.1
Patch release v4.51.1
Since the release of Llama 4, we have fixed a few issues that we are now releasing in patch v4.51.1
- Fixing flex attention for torch=2.6.0 (#37285)
- more fixes for post-training llama4 (#37329)
- Remove HQQ from caching allocator warmup (#37347)
- fix derived berts initweights (#37341)
- Fix init empty weights without accelerate (#37337)
- Fix deepspeed with quantization (#37324)
- fix llama4 training (#37319)
- fix flex attn when optional args aren't passed (#37327)
- Multiple llama4 fixe (#37353)
Thanks all for your patience
- Python
Published by ArthurZucker about 1 year ago
transformers - v4.51.0: Llama 4, Phi4-Multimodal, DeepSeek-v3, Qwen3
New Model Additions
Llama 4
Llama 4, developed by Meta, introduces a new auto-regressive Mixture-of-Experts (MoE) architecture.This generation includes two models: - The highly capable Llama 4 Maverick with 17B active parameters out of ~400B total, with 128 experts. - The efficient Llama 4 Scout also has 17B active parameters out of ~109B total, using just 16 experts.
Both models leverage early fusion for native multimodality, enabling them to process text and image inputs. Maverick and Scout are both trained on up to 40 trillion tokens on data encompassing 200 languages (with specific fine-tuning support for 12 languages including Arabic, Spanish, German, and Hindi).
For deployment, Llama 4 Scout is designed for accessibility, fitting on a single server-grade GPU via on-the-fly 4-bit or 8-bit quantization, while Maverick is available in BF16 and FP8 formats. These models are released under the custom Llama 4 Community License Agreement, available on the model repositories
Getting started with Llama 4 using transformers is straightforward. Make sure you have transformers v4.51.0 or later installed:
pip install -U transformers[hf_xet]
Here's a quick example using the instruction-tuned Maverick model responding about two images, using tensor parallel for maximum speed. You need to run this script on an instance with 8 GPUs, using a command like:
torchrun –nproc-per-instance=8 script.py
```py from transformers import AutoProcessor, Llama4ForConditionalGeneration import torch
model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"
processor = AutoProcessor.frompretrained(modelid) model = Llama4ForConditionalGeneration.frompretrained( modelid, attnimplementation="flexattention", devicemap="auto", torchdtype=torch.bfloat16, )
url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/catstylelayout.png" messages = [ { "role": "user", "content": [ {"type": "image", "url": url1}, {"type": "image", "url": url2}, {"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"}, ] }, ]
inputs = processor.applychattemplate( messages, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt", ).to(model.device)
outputs = model.generate( **inputs, maxnewtokens=256, )
response = processor.batchdecode(outputs[:, inputs["inputids"].shape[-1]:])[0] print(response) print(outputs[0]) ```
Make sure to check the model cards on the repos (Llama 4 Maverick (~400B) and Llama 4 Scout (~109B)) for detailed usage instructions, including multimodal examples, specific prompt formats (like system prompts), quantization details, and advanced configuration options!
Phi4-Multimodal
Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following:
- Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
- Vision: English
Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese
Add Phi4 multimodal by @Cyrilvallez in #36939
DeepSeek-v3
DeepSeek-v3 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.
The model is detailed in the following paper.
Overview
The DeepSeek-V3 model was proposed in DeepSeek-V3 Technical Report by DeepSeek-AI Team.
The abstract from the paper is the following:
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
- [WIP] add deepseek-v3 by @bzantium in #35926
Qwen3
The Qwen3 architecture has been contributed to transformers and is available in v4.51.0. At time of release, the models themselves have not yet been released - stay tuned for a release from the Qwen team!
- Adding Qwen3 and Qwen3MoE by @bozheng-hit in #36878
Documentation
Model docs are getting a significant overhaul by providing much needed, ready-to-use examples one can copy-paste in their modules/consoles. We will adapt these examples to each model, with the goal of providing relevant examples on a per-model basis.
- [docs] Model docs by @stevhliu in #36469
Significant model improvements
A very large PR was provided by @nikosanto13 that helped add modular files to all speech models in the library; seeing the difference between each of them is now much simpler, as well as maintenance and eventual refactors.
- Introduce modular files for speech models by @nikosanto13 in #35902
Bugfixes and improvements
- fix: loss computation after embeddings resize - mllama by @Ssukriti in #36840
- Simplify keepinfp32_modules logic by @Cyrilvallez in #36722
- Fix Pan and Scan on batched images Gemma3 by @yonigozlan in #36864
- Update installation.md by @ariG23498 in #36826
- fix Gemma3 Config by @eljandoubi in #36893
- Fix torch version guard at import by @zucchini-nlp in #36907
- [Fix] Add
original_max_position_embeddingsto YARN rope_scaling optional keys by @JustinTong0323 in #36877 - tests: fix asyncio.wait() usage for python>=3.11 by @dvrogozh in #36898
- [chameleon] fix num image token check by @zucchini-nlp in #36918
- Fix Compressed tensors todictdiff by @MekkCyber in #36922
- Use another repo. for Mistral3 processor testing by @ydshieh in #36925
- Fix typos by @omahs in #36910
- Update
trainer_pt_utils.pydocstrings for consistency by @ethanknights in #36912 - [2/N] Use pyupgrade --py39-plus to improve code by @cyyever in #36857
- Fix pytorch defomr attn path by @qubvel in #36923
- More precise comment by @ydshieh in #36935
- Added support for seed in
DataCollatorForWholeWordMaskby @capemox in #36903 - Fix processor kwargs qwen2 vl by @yonigozlan in #36890
- Disallow Offload to disk for gguf files by @MekkCyber in #36933
- Deprecate #36741 and map Causal to Conditional by @zucchini-nlp in #36917
- Fixing prequantizationdtype when torchdtype is None by @MekkCyber in #36930
- Export for Phi4-mini by @guangy10 in #36780
- fix typos in the tests directory by @threewebcode in #36932
- Fix cuda index issue in cache allocator by @SunMarc in #36937
- [Utils] torch version checks optionally accept dev versions by @gante in #36847
- Update after #36962 by @ydshieh in #36965
- Change GPUS to GPUs by @zhanluxianshen in #36945
- typo fixed in README_fr.md by @NargiT in #36951
- Updated docker files to use
uvfor installing packages by @Sai-Suraj-27 in #36957 - update examples after ruff being updated by @ydshieh in #36972
- Remove extra tensor clone in PyTorch code by @cyyever in #36748
- [docs] Fix image link by @stevhliu in #36869
- Add ruff target-version by @cyyever in #36971
- update bot comment again by @ydshieh in #36974
- 🚨Deprecate legacy argument for image-text-to-text models and adopt new behavior by default by @yonigozlan in #36307
- Fix tensor dtype mismatch by @cyyever in #36985
- byebye CircleCI TF jobs by @ydshieh in #36998
- Use torch.expm1 by @cyyever in #36995
- Install
networkx==3.2.1manually in some CircleCI jobs after #36957 by @ydshieh in #37000 - Fix Optional type annotation by @cyyever in #36841
- Fix getdeviceproperties by @ivarflakstad in #36997
- Allow easy registration of custom attention functions by @Cyrilvallez in #36889
- Fix removing "cpu" from frozenset in bitsandbytes.py to allow better ROCm support. by @anadon in #36975
- Fix device_map check for ggml files by @MekkCyber in #37003
- Log the correct learning rate by @SunMarc in #36973
- fix typos in the code comments and error messages by @threewebcode in #36993
- Remove deprecated training arguments by @cyyever in #36946
- [docs] Attention mask image by @stevhliu in #36970
- fix transformers_cli import relative path issue by @yao-matrix in #36989
- Support QuestionAnswering Module for ModernBert based models. by @bakrianoo in #35566
- Fix PixtralProcessor patchsize when spatialmerge_size is used by @mgoin in #37019
- [Modeling] Load FP8 safetensors such as DeepSeek by @kylesayrs in #36828
- Mark 2 tests as flaky for now by @ydshieh in #37038
- remove redundant code in trainer by @hiyouga in #36994
- Skip FP8 linear tests For device capability 9.0 by @MekkCyber in #37008
- Add Distill Any Depth by @keetrap in #36614
- fix pegasus init weights and other copied models by @jiqing-feng in #36844
- Optimize
to_py_objfor python-native numeric lists and scalars by @n0gu-furiosa in #36885 - Fixup for distillanydepth conversion script by @qubvel in #37043
- [chat templates} support loading audio from video by @zucchini-nlp in #36955
- [audio utils] fix fftbinwidth computation by @eustlb in #36603
- [generate, cache] handle more complex device maps by @gante in #37014
- clean pipeline question_answering. by @zhanluxianshen in #36986
- Avoid unnecessary device operations in loss computing by @cyyever in #36950
- Set weights_only in torch.load by @cyyever in #36991
- Replace default split function with jnp.split() in flax models by @premmurugan229 in #37001
- Remove deprecated batch_size parameter by @cyyever in #37007
- fixed typo by @finnoh in #37036
- fix: Fully remove legacy cache from Llama by @Wheest in #36958
- Fix SDPA implementation in Qwen2-VL (issues with torch==2.6.0) by @ManuelFay in #36891
- fix: AttributeError: 'LlavaProcessor' object has no attribute 'imagetokenid' by @jp1924 in #37026
- Fix some typos about benchmark scripts. by @zhanluxianshen in #37027
- Change deprecated PT functions by @cyyever in #37041
- [blip-2] Fix dtype mismatch when keep in fp32 by @zucchini-nlp in #37068
- fix tied weigths issue by @ydshieh in #37031
- Update w/ new account by @muellerzr in #37084
- Fix state_dict map location when quantized by @Cyrilvallez in #37086
- Fix AttentionInterface following feedback by @Cyrilvallez in #37010
- fixed typo. by @zhanluxianshen in #37057
- [generate] beam search -- fix output cropping by @gante in #37080
- [Cache] rename dtype attribute 🚨 🚨 by @gante in #37044
- Kenlm by @ydshieh in #37091
- 🌐 [i18n-KO] Translated
qwen2_vl.mdto Korean by @MinJu-Ha in #36750 - Gaudi: Fix the pipeline failed issue with hpu device by @yuanwu2017 in #36990
- Support passing flashattnkwargs when gradient_checkpointing is enabled by @efsotr in #37037
- Fix 4090/ada not detected as having FP8 support by @Qubitium in #37067
- enable tp on CPU by @jiqing-feng in #36299
- fix whisper re-compile by @jiqing-feng in #36712
- [MLU] Fix FA2 check error, remove deepspeed-mlu deps. by @huismiling in #36159
- Fix Gemma3 embedding scaling by @gau-nernst in #37109
- RWKV: fix mask warning typo by @RobinKa in #37114
- Remove deprecated code by @cyyever in #37059
- [tests] remove cuda-only test marker in
AwqConfigTestby @faaany in #37032 - Export T5 (encoder-decoder) to ExecuTorch by @guangy10 in #36486
- skip by @ydshieh in #37141
- [qwen3] fix generation tests by @zucchini-nlp in #37142
- Fix more inefficient PT operations by @cyyever in #37060
- Fix std initialization in Idefics variants by @yaswanth19 in #37100
- add gpt2 test on XPU by @jiqing-feng in #37028
- Fix llava xpu tests. by @jiqing-feng in #37130
- enable
test_assisted_decoding_in_different_gputest on XPU by @yao-matrix in #37120 - Use public export API on torch 2.5 and future by @guangy10 in #36781
- Convert
_VALID_DICT_FIELDSto class attribute for shared dict parsing in subclasses by @Tavish9 in #36736 - Only count num items in batch when needed by @IlyasMoutawwakil in #36867
- Make canine model exportable by removing unncessary complicated logic by @tugsbayasgalan in #37124
- [
ModernBERT] Never save 'reference_compile' config; should be set based on end user by @tomaarsen in #36305 - fix XPU UT error case brough by RNG difference btw XPU and CUDA by @yao-matrix in #37121
- Fixes the inconsistency of the optionality of attention_mask by @Zephyr271828 in #37153
- Avoid pipeline test failing related to Hub call by @ydshieh in #37170
- Fix meta state dict loading with quantizers by @Cyrilvallez in #37136
- Revert #37031 by @Cyrilvallez in #37178
- [doc] Fix link for Quark quantization page by @BowenBao in #37179
- [chat-template] fix video loading by @zucchini-nlp in #37146
- Skip code
307inRequestCounterby @ydshieh in #36953 - Add device workaround for int4 weight only quantization after API update by @jerryzh168 in #36980
- Fixes DynamicCache export issues due to control flow and inplace modifications by @xadupre in #36652
- Try to avoid/reduce some remaining CI job failures by @ydshieh in #37202
- fix: Add 'image-text-to-text' to
TASK_MAPPINGby @saattrupdan in #37107 - Fix some code annotation typos. by @zhanluxianshen in #37102
- Merge tensor operations with device transfer operations by @cyyever in #37097
- [3/N] Use pyupgrade --py39-plus to improve code by @cyyever in #36936
- Add py.typed by @cyyever in #37022
- No more dtypebytesize() by @Rocketknight1 in #37144
- [Tests] add
min_new_tokensto prevent flaky length checks by @gante in #37175 - Stop DOSing the Hub in the CI by @Rocketknight1 in #37209
- More ReDOS fixes! by @Rocketknight1 in #36964
- Updated the model card for CLIP by @purusharthmalik in #37040
- Update falcon model card by @ricalanis in #37184
- Updated model card for Qwen2 by @Aravind-11 in #37192
- Fix static cache export by @guangy10 in #37229
- [Phi4] add multimodal chat template by @zucchini-nlp in #36996
- Add new dim to
num_items_in_batchif necessary by @regisss in #36967 - Fix test by @Cyrilvallez in #37213
- [tests] fix mamba integration simple inference precision issue by @faaany in #37193
- [CI] lazy loading external datasets by @gante in #37218
- enable 2 types of case on XPU by @yao-matrix in #37198
- Fix AST parsing when looking for remote code imports by @Rocketknight1 in #37245
- Add support for fast image processing in image-pretraining example by @jafraustro in #37021
- Allow flexible generation params arg when checking pipeline specs by @Rocketknight1 in #37211
- [CI] green llama tests by @gante in #37244
- Adding links to ShieldGemma 2 technical report by @RyanMullins in #37247
- feat: updated model card for qwen2.5vl by @arkhamHack in #37099
- Update model card for Cohere by @bimal-gajera in #37056
- chore: Update model doc for code_llama by @AbhishekRP2002 in #37115
- Update Model Card for ModernBERT by @ParagEkbote in #37052
- Update model card for electra by @Wu-n0 in #37063
- [qwen-vl] fix image processor by @zucchini-nlp in #37258
- update error msg by @itazap in #37207
- Fix
utils/check_bad_commit.pyby @ydshieh in #37272 - Support
return_tensorsin audio chat templates by @zucchini-nlp in #34601 - Update ruff to
0.11.2by @ydshieh in #36962 - Fix typing for None valued variables by @cyyever in #37004
- Use
lru_cachefor tokenization tests by @ydshieh in #36818 - Create and Expose SamVisionModel as public for better accessibility by @geetu040 in #36493
- [Feature] Support using FlashAttention2 on Ascend NPU by @FightingZhen in #36696
- Remove lowcpumemusage and _fastinit by @Cyrilvallez in #36963
- Refactor
return_dictlogic to remove complicated if/else paths by @qubvel in #36794 - Refactor attention for SigLIP based models by @qubvel in #36981
- Add Optional to types by @cyyever in #37163
- Purge unused ModelTester code by @Rocketknight1 in #37085
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @cyyever
- [2/N] Use pyupgrade --py39-plus to improve code (#36857)
- Remove extra tensor clone in PyTorch code (#36748)
- Add ruff target-version (#36971)
- Fix tensor dtype mismatch (#36985)
- Use torch.expm1 (#36995)
- Fix Optional type annotation (#36841)
- Remove deprecated training arguments (#36946)
- Avoid unnecessary device operations in loss computing (#36950)
- Fix typing for None valued variables (#37004)
- Set weights_only in torch.load (#36991)
- Remove deprecated batch_size parameter (#37007)
- Change deprecated PT functions (#37041)
- Remove deprecated code (#37059)
- Fix more inefficient PT operations (#37060)
- Merge tensor operations with device transfer operations (#37097)
- [3/N] Use pyupgrade --py39-plus to improve code (#36936)
- Add py.typed (#37022)
- Add Optional to types (#37163)
- @bzantium
- [WIP] add deepseek-v3 (#35926)
- @bozheng-hit
- Adding Qwen3 and Qwen3MoE (#36878)
- @geetu040
- Create and Expose SamVisionModel as public for better accessibility (#36493)
- @FightingZhen
- [Feature] Support using FlashAttention2 on Ascend NPU (#36696)
- @nikosanto13
- Introduce modular files for speech models (#35902)
- Python
Published by LysandreJik about 1 year ago
transformers - Deepseek v3 (based on 4.50.3)
A new model is added to transformers: DeepSeek 3 (Also known as DeepSeek R1). It is added on top of the v4.50.3 release, and can be installed from the following tag: v4.50.3-DeepSeek-3.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.50.3-DeepSeek-3
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
DeepSeek 3 (Also known as DeepSeek R1)
The model is detailed in the following paper.
Overview
The DeepSeek-V3 model was proposed in DeepSeek-V3 Technical Report by DeepSeek-AI Team.
The abstract from the paper is the following:
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
Limitations and call for contribution!
We are super happy to make this code community-powered, and would love to see how you can help optimize the following:
- current implementation uses the "naive" attention compution (so not really MLA)
- current implementation loops through the experts. This should be replaced. Pointers to use
get_packed_weightsfromintetrations/tensor_parallel. - current implementation uses the eleuther formula for ROPE, using the orginal one would be more efficient! (should still follow our API)
- static cache is not supported (this should be just a generation config issue / config shape issues)
Usage tips
The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. It employs an auxiliary-loss-free strategy for load balancing and multi-token prediction training objective. The model can be used for various language tasks after being pre-trained on 14.8 trillion tokens and going through Supervised Fine-Tuning and Reinforcement Learning stages.
You can run the model in FP8 automatically, using 2 nodes of 8 H100 should be more than enough!
```python
run_deepseek_v1.py
from transformers import AutoModelForCausalLM, AutoTokenizer import torch torch.manual_seed(30)
tokenizer = AutoTokenizer.from_pretrained("deepseek-r1")
chat = [ {"role": "user", "content": "Hello, how are you?"}, {"role": "assistant", "content": "I'm doing great. How can I help you today?"}, {"role": "user", "content": "I'd like to show off how chat templating works!"}, ]
model = AutoModelForCausalLM.frompretrained("deepseek-r1", devicemap="auto", torchdtype=torch.bfloat16) inputs = tokenizer.applychattemplate(chat, tokenize=True, addgenerationprompt=True, returntensors="pt").to(model.device)
outputs = model.generate(inputs, maxnewtokens=50) print(tokenizer.batch_decode(outputs)) ``` This generated:
``````
<|Assistant|>
First, I should explain what chat templating is. It's the process of formatting conversation data into a structured format that the model can understand. This usually includes roles and content. For example, user messages, assistant responses, and system messages each have their own role tags.
They might want an example. Let me think of a simple conversation. The user says "Hello, how are you?" and the assistant responds "I'm doing great. How can I help you today?" Then the user follows up with wanting to show off chat templating. So the example should include the history and the new message.
In some frameworks, like Hugging Face's Transformers, chat templates are applied using Jinja2 templates. The template might look something like combining system messages, then looping through user and assistant messages with appropriate tags. For instance, using {% for message in messages %} and assigning roles like <|user|>, <|assistant|>, etc.
I should structure the example with the messages array, showing each role and content. Then apply a hypothetical template to convert that into a formatted string the model uses. Also, mention that different models have different templating requirements, like using special tokens or varying role labels.
Wait, the user mentioned "chat templating" in the context of showing off. Maybe they want a practical example they can present. So providing a code snippet or a structured data example would be helpful. Let me outline a typical messages array and then the templated output.
Also, it's important to note that proper templating ensures the model knows the conversation flow, which is crucial for generating coherent responses. Maybe include a note about why it's important, like maintaining context and role-specific processing.
Let me check if there are any common mistakes or things to avoid. For example, not closing tags properly, or mismatching roles. But maybe that's too detailed unless the user asks. Focus on the positive example first.
Putting it all together, the response should have an example messages array, the applied template, and the final formatted string. Maybe use angle brackets or special tokens as placeholders. Also, mention that this helps in training or fine-tuning models with structured data.
I think that's a solid approach. Let me structure it step by step to make it clear.
Chat templating is a way to structure conversation data (e.g., user/assistant interactions) into a format that language models understand. This is especially important for models trained to handle multi-turn dialogues, where the input must explicitly separate roles (user, assistant, system, etc.) and messages. Let’s break this down with an example!
Step 1: Raw Conversation History
Suppose we have this conversation: - User: "Hello, how are you?" - Assistant: "I'm doing great. How can I help you today?" - User: "I'd like to show off how chat templating works!"
Step 2: Structured Messages
In frameworks like Hugging Face Transformers or OpenAI, conversations are often formatted as a list of dictionaries with role and content:
python
messages = [
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
{"role": "user", "content": "I'd like to show off how chat templating works!"},
]
Step 3: Apply a Chat Template
A chat template converts this structured data into a single string formatted for the model. For example, using a Jinja-style template (common in Hugging Face):
jinja
{% for message in messages %}
{% if message['role'] == 'user' %}
<|user|>{{ message['content'] }}<|end|>
{% elif message['role'] == 'assistant' %}
<|assistant|>{{ message['content'] }}<|end|>
{% endif %}
{% endfor %}
<|assistant|>
Step 4: Final Templated Output
Applying the template to our messages list would produce:
text
<|user|>Hello, how are you?<|end|>
<|assistant|>I'm doing great. How can I help you today?<|end|>
<|user|>I'd like to show off how chat templating works!<|end|>
<|assistant|>
This tells the model:
1. The conversation history (user/assistant turns).
2. The model’s turn to generate a response (<|assistant|> at the end).
Key Notes:
- Role Separation: Tags like
<|user|>and<|assistant|>help the model distinguish speakers. - Special Tokens: Models often use unique tokens (e.g.,
<|end|>) to mark message boundaries. - Flexibility: Templates vary by model (e.g., OpenAI uses
{"role": "user", "content": "..."}instead of tags).
Why This Matters:
- Consistency: Ensures the model understands dialogue structure.
- Context Preservation: Maintains the flow of multi-turn conversations.
- Alignment: Matches the format the model was trained on for better performance.
Want to dive deeper or see a specific framework’s implementation (e.g., OpenAI, Llama, Mistral)? Let me know! 😊<|end▁of▁sentence|> ``````
Use the following to run it
bash
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0|1 --rdzv-id an_id --rdzv-backend c10d --rdzv-endpoint master_addr:master_port run_deepseek_r1.py
If you have:
bash
[rank0]: ncclInternalError: Internal check failed.
[rank0]: Last error:
[rank0]: Bootstrap : no socket interface found
error, it means NCCL was probably not loaded.
- Python
Published by ArthurZucker about 1 year ago
transformers - Patch release v4.50.3
Patch release v4.50.3
Thanks to the vllm team we have a few more bugs that slipped in!
[generate] beam search -- fix output cropping (#37080) by @gante
[blip-2] Fix dtype mismatch when keep in fp32 (#37068) by @zucchini-nlp
Fix PixtralProcessor patchsize when spatialmerge_size is used (#37019)
- Python
Published by ArthurZucker about 1 year ago
transformers - Patch release v4.50.2
Patch release v4.50.2
I completely forgot to put these in the previous patch sorry! Should put the transformers backend in a good spot!
[Utils] torch version checks optionally accept dev versions (#36847) by @gante
Fix processor kwargs qwen2 vl (#36890) by @yonigozlan
Fix Pan and Scan on batched images Gemma3 (#36864) by @yonigozlan
- Python
Published by ArthurZucker about 1 year ago
transformers - Patch release v4.50.1
Patch release v4.50.1
There were some very minor bugs with the new hub kernels, and with remote code that we had to fix
Deprecate #36741 and map Causal to Conditional (#36917) by @zucchini-nlp
Fix pytorch deform attn path (#36923) by @qubvel
[chameleon] fix num image token check (#36918) by @zucchini-nlp
Fix torch version guard at import (#36907) by @zucchini-nlp
- Python
Published by ArthurZucker about 1 year ago
transformers - Release v4.50.0
Release v4.50.0
New Model Additions
Model-based releases
Starting with version v4.49.0, we have been doing model-based releases, additionally to our traditional, software-based monthly releases. These model-based releases provide a tag from which models may be installed.
Contrarily to our software-releases; these are not pushed to pypi and are kept on our GitHub. Each release has a tag attributed to it, such as:
- v4.49.0-Gemma-3
- v4.49.0-AyaVision
⚠️ As bugs are identified and fixed on each model, the release tags are updated so that installing from that tag always gives the best experience possible with that model.
Each new model release will always be based on the current state of the main branch at the time of its creation. This ensures that new models start with the latest features and fixes available.
For example, if two models—Gemma-3 and AyaVision—are released from main, and then a fix for gemma3 is merged, it will look something like this:
o---- v4.49.0-Gemma-3 (includes AyaVision, plus main fixes)
/ \
---o--o--o--o--o-- (fix for gemma3) --o--o--o main
\
o---- v4.49.0-AyaVision
We strive to merge model specific fixes on their respective branches as fast as possible!
Gemma 3
Gemma 3 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.
The Gemma 3 model was proposed by Google. It is a vision-language model composed by a SigLIP vision encoder and a Gemma 2 language decoder linked by a multimodal linear projection.
It cuts an image into a fixed number of tokens same way as Siglip if the image does not exceed certain aspect ratio. For images that exceed the given aspect ratio, it crops the image into multiple smaller pacthes and concatenates them with the base image embedding.
One particularity is that the model uses bidirectional attention on all the image tokens. Also, the model interleaves sliding window local attention with full causal attention in the language backbone, where each sixth layer is a full causal attention layer.
- Gemma3 by @RyanMullins in #36658
Shield Gemma2
ShieldGemma 2 is built on Gemma 3, is a 4 billion (4B) parameter model that checks the safety of both synthetic and natural images against key categories to help you build robust datasets and models. With this addition to the Gemma family of models, researchers and developers can now easily minimize the risk of harmful content in their models across key areas of harm as defined below:
- No Sexually Explicit content: The image shall not contain content that depicts explicit or graphic sexual acts (e.g., pornography, erotic nudity, depictions of rape or sexual assault).
- No Dangerous Content: The image shall not contain content that facilitates or encourages activities that could cause real-world harm (e.g., building firearms and explosive devices, promotion of terrorism, instructions for suicide).
- No Violence/Gore content: The image shall not contain content that depicts shocking, sensational, or gratuitous violence (e.g., excessive blood and gore, gratuitous violence against animals, extreme injury or moment of death).
We recommend using ShieldGemma 2 as an input filter to vision language models, or as an output filter of image generation systems. To train a robust image safety model, we curated training datasets of natural and synthetic images and instruction-tuned Gemma 3 to demonstrate strong performance.
- Shieldgemma2 #36678 by @RyanMullins ### Aya Vision
AyaVision is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.
The Aya Vision 8B and 32B models is a state-of-the-art multilingual multimodal models developed by Cohere For AI. They build on the Aya Expanse recipe to handle both visual and textual information without compromising on the strong multilingual textual performance of the original model.
Aya Vision 8B combines the Siglip2-so400-384-14 vision encoder with the Cohere CommandR-7B language model further post-trained with the Aya Expanse recipe, creating a powerful vision-language model capable of understanding images and generating text across 23 languages. Whereas, Aya Vision 32B uses Aya Expanse 32B as the language model.
Key features of Aya Vision include: - Multimodal capabilities in 23 languages - Strong text-only multilingual capabilities inherited from CommandR-7B post-trained with the Aya Expanse recipe and Aya Expanse 32B - High-quality visual understanding using the Siglip2-so400-384-14 vision encoder - Seamless integration of visual and textual information in 23 languages.
- Add aya by @ArthurZucker in #36521
Mistral 3.1
Mistral 3.1 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.
Building upon Mistral Small 3 (2501), Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.
It is ideal for: - Fast-response conversational agents. - Low-latency function calling. - Subject matter experts via fine-tuning. - Local inference for hobbyists and organizations handling sensitive data. - Programming and math reasoning. - Long document understanding. - Visual understanding.
- Add Mistral3 by @Cyrilvallez in #36790
Smol VLM 2
SmolVLM-2 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.
SmolVLM2 is an adaptation of the Idefics3 model with two main differences:
- It uses SmolLM2 for the text model.
It supports multi-image and video inputs
SmolVLM2 by @orrzohar in #36126
SigLIP-2
SigLIP-2 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.
The SigLIP2 model was proposed in SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner and Xiaohua Zhai.
The model comes in two variants
1) FixRes - model works with fixed resolution images (backward compatible with SigLIP v1)
2) NaFlex - model works with variable image aspect ratios and resolutions (SigLIP2 in transformers)
- Add SigLIP 2 by @qubvel in #36323
Prompt Depth Anything
PromptDepthAnything is a high-resolution, accurate metric depth estimation model that leverages prompting, inspired by its success in vision-language (VLMs) and large language models (LLMs). Using iPhone LiDAR as a prompt, the model generates precise depth maps at up to 4K resolution, unlocking the potential of depth foundation models.

- Add Prompt Depth Anything Model by @haotongl in #35401
New tool: attention visualization
We add a new tool to transformers to visualize the attention layout of a given model. It only requires a model ID as input, and will load the relevant tokenizer/model and display what the attention mask looks like. Some examples:
```py
from transformers.utils.attention_visualizer import AttentionMaskVisualizer visualizer = AttentionMaskVisualizer("meta-llama/Llama-3.2-3B-Instruct") visualizer("A normal attention mask")
visualizer = AttentionMaskVisualizer("mistralai/Mistral-Small-24B-Instruct-2501") visualizer("A normal attention mask with a long text to see how it is displayed, and if it is displayed correctly")
visualizer = AttentionMaskVisualizer("google/paligemma2-3b-mix-224")
visualizer(" You are an assistant.", suffix = "What is on the image?")
visualizer = AttentionMaskVisualizer("google/gemma-2b") visualizer("You are an assistant. Make sure you print me") # we should have slidiing on non sliding side by side
visualizer = AttentionMaskVisualizer("google/gemma-3-27b-it")
visualizer("You are an assistant. Make sure you print me") # we should have slidiing on non sliding side by side
```
- Add attention visualization tool by @ArthurZucker in #36630
Deprecating transformers.agents in favor of smolagents
We are deprecating transformers.agents in favour of the smolagents library. Read more about smolagents here.
- Deprecate transformers.agents by @aymeric-roucher in #36415
Quantization
We support adding custom quantization method by using the @register_quantization_config and @register_quantizer decorator:
```python @registerquantizationconfig("custom") class CustomConfig(QuantizationConfigMixin): pass
@register_quantizer("custom") class CustomQuantizer(HfQuantizer): pass
quantizedmodel = AutoModelForCausalLM.frompretrained( "facebook/opt-350m", quantizationconfig=CustomConfig(), torchdtype="auto" ) ```
- Added Support for Custom Quantization by @keetrap in #35915
- Add Example for Custom quantization by @MekkCyber in #36286
AMD is developing its in-house quantizer named Quark released under MIT license, which supports a broad range of quantization pre-processing, algorithms, dtypes and target hardware. You can now load a model quantized by quark library:
```python
pip install amd-quark
modelid = "EmbeddedLLM/Llama-3.1-8B-Instruct-wfp8perchannelsym" model = AutoModelForCausalLM.frompretrained(model_id) model = model.to("cuda") ```
- Support loading Quark quantized models in Transformers by @fxmarty-amd and @BowenBao in #36372
Torchao is augmented with autoquant support, CPU-quantization, as well as new AOBaseConfig object instances for more advanced configuration.
- Add autoquant support for torchao quantizer by @jerryzh168 in #35503
- enable torchao quantization on CPU by @jiqing-feng in #36146
- Add option for ao base configs by @drisspg in #36526
Tensor Parallelism implementation changes
At loading time, the parallelization is now applied module-by-module, so that no memory overhead is required compared to what the final weight distribution will be!
- TP initialization module-by-module by @Cyrilvallez in #35996
Generation
This release includes two speed upgrades to generate:
1. Assisted generation now works with ANY model as an assistant, even with do_sample=True;
```py from transformers import pipeline import torch
prompt = "Alice and Bob" checkpoint = "google/gemma-2-9b" assistant_checkpoint = "double7/vicuna-68m"
pipe = pipeline( "text-generation", model=checkpoint, assistantmodel=assistantcheckpoint, dosample=True ) pipeoutput = pipe(prompt, maxnewtokens=50, dosample=True) print(pipeoutput[0]["generated_text"]) ```
- Beam search was vectorized, and should be significantly faster with a large
num_beams. The speedup is more visible on smaller models, wheremodel.forwarddoesn't dominate the total run time.
- Universal Speculative Decoding
CandidateGeneratorby @keyboardAnt in #35029 - [generate] ✨ vectorized beam search ✨ by @gante in #35802
Documentation
A significant redesign of our documentation has wrapped-up. The goal was to greatly simplify the transformers documentation, making it much more easy to navigate. Let us know what you think!
- [docs] Redesign by @stevhliu in #31757
Notable repo maintenance
The research examples folder that was hosted in transformers is no more. We have moved it out of transformers and in the following repo: github.com/huggingface/transformers-research-projects/
- Remove research projects by @Rocketknight1 in #36645
We have updated our flex attention support so as to have it be on-par with our Flash Attention 2 support.
- Proper_flex by @ArthurZucker in #36643
More models support flex attention now thanks to @qubvel
- Refactor Attention implementation for ViT-based models by @qubvel in #36545
First integration of hub kernels for deformable detr!
- Use deformable_detr kernel from the Hub (#36853) by @danieldk
Bugfixes and improvements
- [tests] fix
EsmModelIntegrationTest::test_inference_bitsandbytesby @faaany in #36225 - Fix
LlavaForConditionalGenerationModelTest::test_configafter #36077 by @ydshieh in #36230 - AMD DeepSpeed image additional HIP dependencies by @ivarflakstad in #36195
- [generate] remove cache v4.47 deprecations by @gante in #36212
- Add missing atol to torch.testing.assert_close where rtol is specified by @ivarflakstad in #36234
- [tests] remove tf/flax tests in
/generationby @gante in #36235 - [generate] Fix encoder decoder models attention mask by @eustlb in #36018
- Add compressed tensor in quant dockerfile by @SunMarc in #36239
- [tests] remove
test_export_to_onnxby @gante in #36241 - Au revoir flaky
test_fast_is_faster_than_slowby @ydshieh in #36240 - Fix TorchAoConfig not JSON serializable by @andrewor14 in #36206
- Remove flakiness in VLMs by @zucchini-nlp in #36242
- feat: add support for tensor parallel training workflow with accelerate by @kmehant in #34194
- Fix XGLM loss computation (PyTorch and TensorFlow) by @damianoamatruda in #35878
- GitModelIntegrationTest - flatten the expected slice tensor by @ivarflakstad in #36260
- Added Support for Custom Quantization by @keetrap in #35915
- Qwen2VL fix cos,sin dtypes to float when used with deepspeed by @ArdalanM in #36188
- Uniformize LlavaNextVideoProcessor kwargs by @yonigozlan in #35613
- Add support for post-processing kwargs in image-text-to-text pipeline by @yonigozlan in #35374
- Add dithering to the
Speech2TextFeatureExtractorAPI. by @KarelVesely84 in #34638 - [tests] remove
pt_tfequivalence tests by @gante in #36253 - TP initialization module-by-module by @Cyrilvallez in #35996
- [tests] deflake dither test by @gante in #36284
- [tests] remove flax-pt equivalence and cross tests by @gante in #36283
- [tests] make
test_from_pretrained_low_cpu_mem_usage_equalless flaky by @gante in #36255 - Add Example for Custom quantization by @MekkCyber in #36286
- docs: Update README_zh-hans.md by @hyjbrave in #36269
- Fix callback handler reference by @SunMarc in #36250
- Make cache traceable by @IlyasMoutawwakil in #35873
- Fix broken CI on release branch due to missing conversion files by @ydshieh in #36275
- Ignore conversion files in test fetcher by @ydshieh in #36251
- SmolVLM2 by @orrzohar in #36126
- Fix typo in Pixtral example by @12v in #36302
- fix: prevent second save in the end of training if last step was saved already by @NosimusAI in #36219
- [smolvlm] make CI green by @gante in #36306
- Fix default attention mask of generate in MoshiForConditionalGeneration by @cyan-channel-io in #36171
- VLMs: even more clean-up by @zucchini-nlp in #36249
- Add SigLIP 2 by @qubvel in #36323
- [CI] Check test if the
GenerationTesterMixininheritance is correct 🐛 🔫 by @gante in #36180 - [tests] make quanto tests device-agnostic by @faaany in #36328
- Uses Collection in transformers.image_transforms.normalize by @CalOmnie in #36301
- Fix exploitable regexes in Nougat and GPTSan/GPTJNeoXJapanese by @Rocketknight1 in #36121
- [tests] enable bnb tests on xpu by @faaany in #36233
- Improve model loading for compressed tensor models by @rahul-tuli in #36152
- Change slack channel for mi250 CI to amd-hf-ci by @ivarflakstad in #36346
- Add autoquant support for torchao quantizer by @jerryzh168 in #35503
- Update amd pytorch index to match base image by @ivarflakstad in #36347
- fix(type): padding_side type should be Optional[str] by @shenxiangzhuang in #36326
- [Modeling] Reduce runtime when loading missing keys by @kylesayrs in #36312
- notify new model merged to
mainby @ydshieh in #36375 - Update modelingllavaonevision.py by @yinsong1986 in #36391
- Load models much faster on accelerator devices!! by @Cyrilvallez in #36380
- [modular] Do not track imports in functions by @Cyrilvallez in #36279
- Fix
is_causalfail with compile by @Cyrilvallez in #36374 - enable torchao quantization on CPU by @jiqing-feng in #36146
- Update getevalsampler to reflect Trainer.tokenizer is deprecation self.tokenizer -> self.processingclass by @yukiman76 in #36315
- Fix doc formatting in forward passes & modular by @Cyrilvallez in #36243
- Added handling for length <2 of suppress_tokens for whisper by @andreystarenky in #36336
- addressing the issue #34611 to make FlaxDinov2 compatible with any batch size by @MHRDYN7 in #35138
- tests: revert change of torchrequiremulti_gpu to be device agnostic by @dvrogozh in #35721
- [tests] enable autoawq tests on XPU by @faaany in #36327
- fix audio classification pipeline fp16 test on cuda by @jiqing-feng in #36359
- chore: fix function argument descriptions by @threewebcode in #36392
- Fix pytorch integration tests for SAM by @qubvel in #36397
- [CLI] add import guards by @gante in #36376
- Fix converttorgb for SAM ImageProcessor by @MSt-10 in #36369
- Security fix for
benchmark.ymlby @ydshieh in #36402 - Fixed VitDet for non-squre Images by @cjfghk5697 in #35969
- Add retry hf hub decorator by @muellerzr in #35213
- Deprecate transformers.agents by @aymeric-roucher in #36415
- Fixing the docs corresponding to the breaking change in torch 2.6. by @Narsil in #36420
- add recommendations for NPU using flash_attn by @zheliuyu in #36383
- fix: prevent model access error during Optuna hyperparameter tuning by @emapco in #36395
- Universal Speculative Decoding
CandidateGeneratorby @keyboardAnt in #35029 - Fix compressed tensors config by @MekkCyber in #36421
- Update form pretrained to make TP a first class citizen by @ArthurZucker in #36335
- Fix Expected output for compressed-tensors tests by @MekkCyber in #36425
- restrict cache allocator to non quantized model by @SunMarc in #36428
- Change PR to draft when it is (re)opened by @ydshieh in #36417
- Fix permission by @ydshieh in #36443
- Fix another permission by @ydshieh in #36444
- Add
contents: writeby @ydshieh in #36445 - [save_pretrained ] Skip collecting duplicated weight by @wejoncy in #36409
- [generate]
torch.distributed-compatibleDynamicCacheby @gante in #36373 - Lazy import libraries in
src/transformers/image_utils.pyby @hmellor in #36435 - Fix
hub_retryby @ydshieh in #36449 - [GroundingDino] Fix grounding dino loss 🚨 by @EduardoPach in #31828
- Fix loading models with mismatched sizes by @qubvel in #36463
- [docs] fix bug in deepspeed config by @faaany in #36081
- Add Got-OCR 2 Fast image processor and refactor slow one by @yonigozlan in #36185
- Fix couples of issues from #36335 by @SunMarc in #36453
- Fix loadstatedictintometamodel with device_map=None by @hlky in #36488
- Fix loading zero3 weights by @muellerzr in #36455
- Check
TRUST_REMOTE_CODEforRealmRetrieverfor security by @ydshieh in #36511 - Fix kwargs UserWarning in SamImageProcessor by @MSt-10 in #36479
- fix torchdtype, contiguous, and loadstate_dict regression by @SunMarc in #36512
- Fix some typos in docs by @co63oc in #36502
- chore: fix message descriptions in arguments and comments by @threewebcode in #36504
- Fix pipeline+peft interaction by @Rocketknight1 in #36480
- Fix edge case for continuefinalmessage by @Rocketknight1 in #36404
- [Style] fix E721 warnings by @kashif in #36474
- Remove unused code by @Rocketknight1 in #36459
- [docs] Redesign by @stevhliu in #31757
- Add aya by @ArthurZucker in #36521
- chore: Fix typos in docs and examples by @co63oc in #36524
- Fix bamba tests amd by @ivarflakstad in #36535
- Fix links in quantization doc by @MekkCyber in #36528
- chore: enhance messages in docstrings by @threewebcode in #36525
- guard torch version for uint16 by @SunMarc in #36520
- Fix typos in tests by @co63oc in #36547
- Fix typos . by @zhanluxianshen in #36551
- chore: enhance message descriptions in parameters,comments,logs and docstrings by @threewebcode in #36554
- Delete redundancy if case in model_utils by @zhanluxianshen in #36559
- Modular Conversion --fixandoverwrite on Windows by @hlky in #36583
- Integrate SwanLab for offline/online experiment tracking and local visualization by @ShaohonChen in #36433
- [bark] fix loading of generation config by @gante in #36587
- [XGLM] tag tests as slow by @gante in #36592
- fix: argument by @ariG23498 in #36558
- Mention UltraScale Playbook 🌌 in docs by @NouamaneTazi in #36589
- avoid errors when the size of
input_idspassed toPrefixConstrainedLogitsProcessoris zero by @HiDolen in #36489 - Export base streamer. by @AndreasAbdi in #36500
- Github action for auto-assigning reviewers by @Rocketknight1 in #35846
- Update chat_extras.md with content correction by @krishkkk in #36599
- Update "who to tag" / "who can review" by @gante in #36394
- Fixed datatype related issues in
DataCollatorForLanguageModelingby @capemox in #36457 - Fix check for XPU. PyTorch >= 2.6 no longer needs ipex. by @tripzero in #36593
- [
HybridCache] disable automatic compilation by @gante in #36620 - Fix auto-assign reviewers by @Rocketknight1 in #36631
- chore: fix typos in language models by @threewebcode in #36586
- [docs] Serving LLMs by @stevhliu in #36522
- Refactor some core stuff by @ArthurZucker in #36539
- Fix bugs in mllama image processing by @tjohnson31415 in #36156
- Proper_flex by @ArthurZucker in #36643
- Fix AriaForConditionalGeneration flex attn test by @ivarflakstad in #36604
- Remove remote code warning by @Rocketknight1 in #36285
- Stop warnings from unnecessary torch.tensor() overuse by @Rocketknight1 in #36538
- [docs] Update docs dependency by @stevhliu in #36635
- Remove research projects by @Rocketknight1 in #36645
- Fix gguf docs by @SunMarc in #36601
- fix typos in the docs directory by @threewebcode in #36639
- Gemma3 by @RyanMullins in #36658
- HPU support by @IlyasMoutawwakil in #36424
- fix block mask typing by @ArthurZucker in #36661
- [CI] gemma 3
make fix-copiesby @gante in #36664 - Fix bnb regression due to empty state dict by @SunMarc in #36663
- [core] Large/full refactor of
from_pretrainedby @Cyrilvallez in #36033 - Don't accidentally mutate the basemodeltp_plan by @Rocketknight1 in #36677
- Fix Failing GPTQ tests by @MekkCyber in #36666
- Remove hardcoded slow image processor class in processors supporting fast ones by @yonigozlan in #36266
- [quants] refactor logic for modulestonot_convert by @SunMarc in #36672
- Remove differences between init and preprocess kwargs for fast image processors by @yonigozlan in #36186
- Refactor siglip2 fast image processor by @yonigozlan in #36406
- Fix rescale normalize inconsistencies in fast image processors by @yonigozlan in #36388
- [Cache] Don't initialize the cache on
metadevice by @gante in #36543 - Update config.torch_dtype correctly by @SunMarc in #36679
- Fix slicing for 0-dim param by @SunMarc in #36580
- Changing the test model in Quanto kv cache by @MekkCyber in #36670
- fix wandb hp search unable to resume from sweep_id by @bd793fcb in #35883
- Upgrading torch version and cuda version in quantization docker by @MekkCyber in #36264
- Change Qwen2_VL image processors to have init and call accept the same kwargs by @yonigozlan in #36207
- fix type annotation for ALLATTENTIONFUNCTIONS by @WineChord in #36690
- Fix dtype for params without tp_plan by @Cyrilvallez in #36681
- chore: fix typos in utils module by @threewebcode in #36668
- [CI] Automatic rerun of certain test failures by @gante in #36694
- Add loading speed test by @Cyrilvallez in #36671
- fix: fsdp sharded state dict wont work for saveonlymodel knob by @kmehant in #36627
- Handling an exception related to HQQ quantization in modeling by @MekkCyber in #36702
- Add GGUF support to T5-Encoder by @Isotr0py in #36700
- Final CI cleanup by @Rocketknight1 in #36703
- Add support for fast image processors in add-new-model-like CLI by @yonigozlan in #36313
- Gemma3 processor typo by @Kuangdd01 in #36710
- Make the flaky list a little more general by @Rocketknight1 in #36704
- Cleanup the regex used for doc preprocessing by @Rocketknight1 in #36648
- [model loading] don't
gc.collect()if only 1 shard is used by @gante in #36721 - Fix/best model checkpoint fix by @seanswyi in #35885
- Try working around the processor registration bugs by @Rocketknight1 in #36184
- [tests] Parameterized
test_eager_matches_sdpa_inferenceby @gante in #36650 - 🌐 [i18n-KO] Translated codegen.md to Korean by @maximizemaxwell in #36698
- Fix post_init() code duplication by @Cyrilvallez in #36727
- Fix grad accum arbitrary value by @IlyasMoutawwakil in #36691
- [Generation, Gemma 3] When passing a custom
generation_config, overwrite default values with the model's basegeneration_configby @gante in #36684 - 🚨🚨🚨 Fix sdpa in SAM and refactor relative position embeddings by @geetu040 in #36422
- enable/disable compile for quants methods by @SunMarc in #36519
- fix can_generate by @jiqing-feng in #36570
- Allow ray datasets to be used with trainer by @FredrikNoren in #36699
- fix xpu tests by @jiqing-feng in #36656
- Fix test isolation for clearimportcache utility by @sambhavnoobcoder in #36345
- Fix
TrainingArguments.torch_empty_cache_stepspost_init check by @pkuderov in #36734 - [MINOR:TYPO] Update hubert.md by @cakiki in #36733
- [CI] remove redundant checks in
test_eager_matches_sdpa_inferenceby @gante in #36740 - [docs] Update README by @stevhliu in #36265
- doc: Clarify
is_decoderusage in PretrainedConfig documentation by @d-kleine in #36724 - fix typos in the tests directory by @threewebcode in #36717
- chore: fix typos in tests directory by @threewebcode in #36785
- Fixing typo in gemma3 imageprocessorfast and adding a small test by @Zebz13 in #36776
- Fix gemma3_text tokenizer in mapping by @LysandreJik in #36793
- Add Mistral3 by @Cyrilvallez in #36790
- fix hqq due to recent modeling changes by @SunMarc in #36771
- Update SHA for
tj-actions/changed-filesby @ydshieh in #36795 - Loading optimizations by @Cyrilvallez in #36742
- Fix Mistral3 tests by @yonigozlan in #36797
- Fix casting dtype for qunatization by @SunMarc in #36799
- Fix chameleon's TypeError because inputs_embeds may None by @YenFuLin in #36673
- Support custom dosctrings in modular by @yonigozlan in #36726
- [generate] ✨ vectorized beam search ✨ by @gante in #35802
- Expectations test utils by @ivarflakstad in #36569
- fix "Cannot copy out of meta tensor; no data!" issue for BartForConditionalGeneration model by @yao-matrix in #36572
- Remove
dist": "loadfile"forpytestin CircleCI jobs by @ydshieh in #36811 - Fix Device map for bitsandbytes tests by @MekkCyber in #36800
- [Generation] remove leftover code from end-to-end compilation by @gante in #36685
- Add attention visualization tool by @ArthurZucker in #36630
- Add option for ao base configs by @drisspg in #36526
- enable OffloadedCache on XPU from PyTorch 2.7 by @yao-matrix in #36654
- [gemma 3] multimodal checkpoints + AutoModelForCausalLM by @gante in #36741
- One more fix for reviewer assignment by @Rocketknight1 in #36829
- Support tracable dynamicKVcache by @tugsbayasgalan in #36311
- Add Space to Bitsandbytes doc by @MekkCyber in #36834
- quick fix fastimageprocessor register error by @JJJYmmm in #36716
- Update configuration_qwen2.py by @michaelfeil in #36735
- Just import torch AdamW instead by @Rocketknight1 in #36177
- Move the warning to the documentation for DataCollatorWithFlattening by @qgallouedec in #36707
- Fix swanlab global step by @Zeyi-Lin in #36728
- Disable inductor config setter by default by @HDCharles in #36608
- [ForCausalLMLoss] allow users to pass shifted labels by @stas00 in #36607
- fix tiktoken convert to pass AddedToken to Tokenizer by @itazap in #36566
- Saving
Trainer.collator.tokenizerin whenTrainer.processing_classisNoneby @innerNULL in #36552 - Pass numitemsin_batch directly to loss computation by @eljandoubi in #36753
- Fix fp16 ONNX export for RT-DETR and RT-DETRv2 by @qubvel in #36460
- Update deprecated Jax calls by @rasmi in #35919
- [qwen2 audio] remove redundant code and update docs by @gante in #36282
- Pass state dict by @phos-phophy in #35234
- [modular] Sort modular skips by @gante in #36304
- [generate] clarify docstrings: when to inherit
GenerationMixinby @gante in #36605 - Update min safetensors bis by @SunMarc in #36823
- Fix import for torch 2.0, 2.1 - guard typehint for "device_mesh" by @qubvel in #36768
- Gemma 3: Adding explicit GenerationConfig and refactoring conversion … by @RyanMullins in #36833
- Fix: remove the redundant snippet of wholeword_mask by @HuangBugWei in #36759
- Shieldgemma2 by @RyanMullins in #36678
- Fix ONNX export for sequence classification head by @echarlaix in #36332
- Fix hqq skipped modules and dynamic quant by @mobicham in #36821
- Use pyupgrade --py39-plus to improve code by @cyyever in #36843
- Support loading Quark quantized models in Transformers by @fxmarty-amd in #36372
- DeepSpeed tensor parallel+ZeRO by @inkcherry in #36825
- Refactor Attention implementation for ViT-based models by @qubvel in #36545
- Add Prompt Depth Anything Model by @haotongl in #35401
- Add model visual debugger by @molbap in #36798
- [torchao] revert to getapplytensor_subclass by @SunMarc in #36849
- Gemma3: fix test by @zucchini-nlp in #36820
- [CI] fix update metadata job by @gante in #36850
- Add support for seed in
DataCollatorForLanguageModelingby @capemox in #36497 - Refactor Aya Vision with modular by @yonigozlan in #36688
- Mllama: raise better error by @zucchini-nlp in #35934
- [CI] doc builder without custom image by @gante in #36862
- FIX FSDP plugin update for QLoRA by @BenjaminBossan in #36720
- Remove call to
.iteminget_batch_samplesby @regisss in #36861 - chore: fix typos in the tests directory by @threewebcode in #36813
- Make ViTPooler configurable by @sebbaur in #36517
- Revert "Update deprecated Jax calls by @ArthurZucker in #35919)"
- [generate] model defaults being inherited only happens for newer models by @gante in #36881
- :redcircle: :redcircle: :red_circle: supersede paligemma forward to shift pos id indexing by @molbap in #36859
- Gemma 3 tests expect greedy decoding by @molbap in #36882
- Use
deformable_detrkernel from the Hub by @danieldk in #36853 - Minor Gemma 3 fixes by @molbap in #36884
- Fix: dtype cannot be str by @zucchini-nlp in #36262
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @IlyasMoutawwakil
- Make cache traceable (#35873)
- HPU support (#36424)
- Fix grad accum arbitrary value (#36691)
- @orrzohar
- SmolVLM2 (#36126)
- @threewebcode
- chore: fix function argument descriptions (#36392)
- chore: fix message descriptions in arguments and comments (#36504)
- chore: enhance messages in docstrings (#36525)
- chore: enhance message descriptions in parameters,comments,logs and docstrings (#36554)
- chore: fix typos in language models (#36586)
- fix typos in the docs directory (#36639)
- chore: fix typos in utils module (#36668)
- fix typos in the tests directory (#36717)
- chore: fix typos in tests directory (#36785)
- chore: fix typos in the tests directory (#36813)
- @aymeric-roucher
- Deprecate transformers.agents (#36415)
- @keyboardAnt
- Universal Speculative Decoding
CandidateGenerator(#35029)
- Universal Speculative Decoding
- @EduardoPach
- [GroundingDino] Fix grounding dino loss 🚨 (#31828)
- @co63oc
- Fix some typos in docs (#36502)
- chore: Fix typos in docs and examples (#36524)
- Fix typos in tests (#36547)
- @RyanMullins
- Gemma3 (#36658)
- Gemma 3: Adding explicit GenerationConfig and refactoring conversion … (#36833)
- Shieldgemma2 (#36678)
- @cyyever
- Use pyupgrade --py39-plus to improve code (#36843)
- @haotongl
- Add Prompt Depth Anything Model (#35401)
- @danieldk
- Use
deformable_detrkernel from the Hub (#36853)
- Use
- Python
Published by LysandreJik about 1 year ago
transformers - Mistral 3 (Based on v4.49.0)
A new model is added to transformers: Mistral 3. It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-Mistral-3.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.49.0-Mistral-3
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
Mistral 3
The model is detailed in the following blog post.
The models are available on the Hub with the following tag: mistral3
Overview
Building upon Mistral Small 3 (2501), Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.
It is ideal for: - Fast-response conversational agents. - Low-latency function calling. - Subject matter experts via fine-tuning. - Local inference for hobbyists and organizations handling sensitive data. - Programming and math reasoning. - Long document understanding. - Visual understanding.
This model was contributed by cyrilvallez and yonigozlan.
The original code can be found here and here.
Usage example
Inference with Pipeline
Here is how you can use the image-text-to-text pipeline to perform inference with the Mistral3 models in just a few lines of code:
```python
from transformers import pipeline
messages = [ ... { ... "role": "user", ... "content": [ ... { ... "type": "image", ... "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg", ... }, ... {"type": "text", "text": "Describe this image."}, ... ], ... }, ... ]
pipe = pipeline("image-text-to-text", model="mistralai/Mistral-Small-3.1-24B-Instruct-2503", torchdtype=torch.bfloat16) outputs = pipe(text=messages, maxnewtokens=50, returnfull_text=False) outputs[0]["generated_text"] 'The image depicts a vibrant and lush garden scene featuring a variety of wildflowers and plants. The central focus is on a large, pinkish-purple flower, likely a Greater Celandine (Chelidonium majus), with a' ```
Inference on a single image
This example demonstrates how to perform inference on a single image with the Mistral3 models using chat templates.
```python
from transformers import AutoProcessor, AutoModelForImageTextToText import torch
torchdevice = "cuda" modelcheckpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503" processor = AutoProcessor.frompretrained(modelcheckpoint) model = AutoModelForImageTextToText.frompretrained(modelcheckpoint, devicemap=torchdevice, torch_dtype=torch.bfloat16)
messages = [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"}, ... {"type": "text", "text": "Describe this image"}, ... ], ... } ... ]
inputs = processor.applychattemplate(messages, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt").to(model.device, dtype=torch.bfloat16)
generateids = model.generate(**inputs, maxnewtokens=20) decodedoutput = processor.decode(generateids[0, inputs["inputids"].shape[1] :], skipspecialtokens=True)
decoded_output "The image depicts two cats lying on a pink blanket. The larger cat, which appears to be an"... ```
Text-only generation
This example shows how to generate text using the Mistral3 model without providing any image input.
````python
from transformers import AutoProcessor, AutoModelForImageTextToText import torch
torchdevice = "cuda" modelcheckpoint = ".mistralai/Mistral-Small-3.1-24B-Instruct-2503" processor = AutoProcessor.frompretrained(modelcheckpoint) model = AutoModelForImageTextToText.frompretrained(modelcheckpoint, devicemap=torchdevice, torch_dtype=torch.bfloat16)
SYSTEMPROMPT = "You are a conversational agent that always answers straight to the point, always end your accurate response with an ASCII drawing of a cat." userprompt = "Give me 5 non-formal ways to say 'See you later' in French."
messages = [ ... {"role": "system", "content": SYSTEMPROMPT}, ... {"role": "user", "content": userprompt}, ... ]
text = processor.applychattemplate(messages, tokenize=False, addgenerationprompt=True) inputs = processor(text=text, returntensors="pt").to(0, dtype=torch.float16) generateids = model.generate(**inputs, maxnewtokens=50, dosample=False) decodedoutput = processor.batchdecode(generateids[:, inputs["inputids"].shape[1] :], skipspecial_tokens=True)[0]
print(decoded_output) "1. À plus tard! 2. Salut, à plus! 3. À toute! 4. À la prochaine! 5. Je me casse, à plus!
``` /_/\ ( o.o )
^ <
"`
Batched image and text inputs
Mistral3 models also support batched image and text inputs.
```python
from transformers import AutoProcessor, AutoModelForImageTextToText import torch
torchdevice = "cuda" modelcheckpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503" processor = AutoProcessor.frompretrained(modelcheckpoint) model = AutoModelForImageTextToText.frompretrained(modelcheckpoint, devicemap=torchdevice, torch_dtype=torch.bfloat16)
messages = [ ... [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"}, ... {"type": "text", "text": "Write a haiku for this image"}, ... ], ... }, ... ], ... [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}, ... {"type": "text", "text": "Describe this image"}, ... ], ... }, ... ], ... ]
inputs = processor.applychattemplate(messages, padding=True, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt").to(model.device, dtype=torch.bfloat16)
output = model.generate(**inputs, maxnewtokens=25)
decodedoutputs = processor.batchdecode(output, skipspecialtokens=True) decoded_outputs ["Write a haiku for this imageCalm waters reflect\nWhispers of the forest's breath\nPeace on wooden path" , "Describe this imageThe image depicts a vibrant street scene in what appears to be a Chinatown district. The focal point is a traditional Chinese"] ```
Batched multi-image input and quantization with BitsAndBytes
This implementation of the Mistral3 models supports batched text-images inputs with different number of images for each text.
This example also how to use BitsAndBytes to load the model in 4bit quantization.
```python
from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig import torch
torchdevice = "cuda" modelcheckpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503" processor = AutoProcessor.frompretrained(modelcheckpoint) quantizationconfig = BitsAndBytesConfig(loadin4bit=True) model = AutoModelForImageTextToText.frompretrained( ... modelcheckpoint, quantizationconfig=quantization_config ... )
messages = [ ... [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"}, ... {"type": "text", "text": "Write a haiku for this image"}, ... ], ... }, ... ], ... [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"}, ... {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"}, ... {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"}, ... ], ... }, ... ], ]
inputs = processor.applychattemplate(messages, padding=True, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt").to(model.device, dtype=torch.bfloat16)
output = model.generate(**inputs, maxnewtokens=25)
decodedoutputs = processor.batchdecode(output, skipspecialtokens=True) decoded_outputs ["Write a haiku for this imageSure, here is a haiku inspired by the image:\n\nCalm lake's wooden path\nSilent forest stands guard\n", "These images depict two different landmarks. Can you identify them? Certainly! The images depict two iconic landmarks:\n\n1. The first image shows the Statue of Liberty in New York City."] ```
- Python
Published by LysandreJik about 1 year ago
transformers - Gemma 3 (Based on v4.49.0)
A new model is added to transformers: Gemma 3. It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-Gemma-3.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
Gemma 3
The model is detailed in the following blog post. The models and demos using the model are available in the following collection.
A Space to play around with the 12B-it flavor is available here.
Overview
The Gemma 3 model was proposed by Google. It is a vision-language model composed by a SigLIP vision encoder and a Gemma 2 language decoder linked by a multimodal linear projection.
It cuts an image into a fixed number of tokens same way as Siglip if the image does not exceed certain aspect ratio. For images that exceed the given aspect ratio, it crops the image into multiple smaller pacthes and concatenates them with the base image embedding.
One particularity is that the model uses bidirectional attention on all the image tokens. Also, the model interleaves sliding window local attention with full causal attention in the language backbone, where each sixth layer is a full causal attention layer.
Usage tips
- For image+text and image-only inputs use
Gemma3ForConditionalGeneration. - For text-only inputs use
Gemma3ForCausalLMfor generation to avoid loading the vision tower. - Each sample can contain multiple images, and the number of images can vary between samples. However make sure to pass correctly batched images to the processor, where each batch is a list of one or more images.
- The text passed to the processor should have the
"<start_of_image_>"token where the images should be inserted. - The processor has its own
apply_chat_templatemethod to convert chat messages to text that can then be passed as text to the processor. You can also get a vectorized output fromapply_chat_template. See the examples below for more details on how to use it.
Image cropping for high resolution images
The model supports cropping images into smaller patches when the image aspect ratio exceeds a certain value. By default the images are not cropped and only the base image is forwarded to the model. Users can set do_pan_and_scan=True to obtain several crops per image along with the base image to improve the quality in DocVQA or similar tasks requiring higher resolution images.
Pan and scan is an inference time optimization to handle images with skewed aspect ratios. When enabled, it improves performance on tasks related to document understanding, infographics, OCR, etc.
```python from transformers import AutoProcessor
processor = AutoProcessor.frompretrained("google/gemma-3-4b-it", paddingside="left")
url = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4=" messages = [ { "role": "system", "content": [ {"type": "text", "text": "You are a helpful assistant."} ] }, { "role": "user", "content": [ {"type": "image", "url": url}, {"type": "text", "text": "What is shown in this image?"}, ] }, ] inputs = processor.applychattemplate( messages, tokenize=True, returndict=True, returntensors="pt", addgenerationprompt=True, dopanand_scan=True, ).to(model.device) ```
Usage Example
Single-image Inference
```python from transformers import AutoProcessor, Gemma3ForConditionalGeneration
modelid = "google/gemma-3-4b-it" model = Gemma3ForConditionalGeneration.frompretrained(modelid, devicemap="auto") processor = AutoProcessor.frompretrained(modelid, padding_side="left")
url = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4=" messages = [ { "role": "system", "content": [ {"type": "text", "text": "You are a helpful assistant."} ] }, { "role": "user", "content": [ {"type": "image", "url": url}, {"type": "text", "text": "What is shown in this image?"}, ] }, ] inputs = processor.applychattemplate( messages, tokenize=True, returndict=True, returntensors="pt", addgenerationprompt=True, ).to(model.device)
output = model.generate(**inputs, maxnewtokens=50) print(processor.decode(output[0], skipspecialtokens=True)[inputs.input_ids.shape[1]: ]) ```
Multi-image Inference
```python from transformers import AutoTokenizer, Gemma3ForCausalLM
modelid = "google/gemma-3-4b-it" model = Gemma3ForConditionalGeneration.frompretrained(modelid, devicemap="auto") processor = AutoProcessor.frompretrained(modelid, padding_side="left")
urlcow = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4=" urlstop = "https://www.ilankelman.org/stopsigns/australia.jpg" messages = [ { "role": "system", "content": [ {"type": "text", "text": "You are a helpful assistant."} ] }, { "role": "user", "content": [ {"type": "image", "url": urlcow}, {"type": "image", "url": urlstop}, {"type": "text", "text": "Are these two images identical?"}, ] }, ] inputs = processor.applychattemplate( messages, tokenize=True, returndict=True, returntensors="pt", addgenerationprompt=True, ).to(model.device)
output = model.generate(**inputs, maxnewtokens=50) print(processor.decode(output[0], skipspecialtokens=True)[inputs.input_ids.shape[1]: ])
```
Text-only inference
```python from transformers import AutoTokenizer, Gemma3ForCausalLM
model_id = "google/gemma-3-1b-it"
tokenizer = AutoTokenizer.frompretrained(modelid) model = Gemma3ForCausalLM.frompretrained(modelid, device_map="auto")
inputids = tokenizer("Write me a poem about Machine Learning.", returntensors="pt").to(model.device)
outputs = model.generate(**inputids, maxnewtokens=100) text = tokenizer.batchdecode(outputs, skipspecialtokens=True)
print(text)
```
- Python
Published by LysandreJik about 1 year ago
transformers - Aya Vision (Based on v4.49.0)
A new model is added to transformers: Aya Vision. It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-AyaVision.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.49.0-AyaVision
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
Aya Vision
The model is detailed in the following blog post.
Overview
The Aya Vision 8B and 32B models is a state-of-the-art multilingual multimodal models developed by Cohere For AI. They build on the Aya Expanse recipe to handle both visual and textual information without compromising on the strong multilingual textual performance of the original model.
Aya Vision 8B combines the Siglip2-so400-384-14 vision encoder with the Cohere CommandR-7B language model further post-trained with the Aya Expanse recipe, creating a powerful vision-language model capable of understanding images and generating text across 23 languages. Whereas, Aya Vision 32B uses Aya Expanse 32B as the language model.
Key features of Aya Vision include: - Multimodal capabilities in 23 languages - Strong text-only multilingual capabilities inherited from CommandR-7B post-trained with the Aya Expanse recipe and Aya Expanse 32B - High-quality visual understanding using the Siglip2-so400-384-14 vision encoder - Seamless integration of visual and textual information in 23 languages.
Usage Example
Here's an example usage of the Aya Vision model.
```py from transformers import AutoProcessor, AutoModelForImageTextToText import torch
model_id = "CohereForAI/aya-vision-32b"
processor = AutoProcessor.frompretrained(modelid) model = AutoModelForImageTextToText.frompretrained( modelid, devicemap="auto", torchdtype=torch.float16 )
Format message with the aya-vision chat template
messages = [ {"role": "user", "content": [ {"type": "image", "url": "https://pbs.twimg.com/media/Fx7YvfQWYAIp6rZ?format=jpg&name=medium"}, {"type": "text", "text": "चित्र में लिखा पाठ क्या कहता है?"}, ]}, ]
inputs = processor.applychattemplate( messages, padding=True, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt" ).to(model.device)
gentokens = model.generate( **inputs, maxnewtokens=300, dosample=True, temperature=0.3, )
print(processor.tokenizer.decode(gentokens[0][inputs.input_ids.shape[1]:], skipspecial_tokens=True)) ```
- Python
Published by LysandreJik about 1 year ago
transformers - SigLIP-2 (Based on v4.49.0)
A new model is added to transformers: SigLIP-2.
It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-SigLIP-2.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.49.0-SigLIP-2
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
SigLIP2
The paper page for the model is available here. It is detailed in the following blog post.
The models and demos using the model are available in the following collection.
Overview
The SigLIP2 model was proposed in SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner and Xiaohua Zhai.
The model comes in two variants
1) FixRes - model works with fixed resolution images (backward compatible with SigLIP v1)
2) NaFlex - model works with variable image aspect ratios and resolutions (SigLIP2 in transformers)
The abstract from the paper is the following:
We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe—this includes decoder-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification (best SigLIP 2 ViT-g/16 achieves 85.0% ImageNet zero-shot accuracy), image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input’s native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fair- ness. To provide users with the ability to trade-off inference cost with performance, we release model checkpoints at four sizes (ViT-B/86M, L/303M, So400m/400M, and g/1B).
Usage tips
- Usage of SigLIP2 is similar to SigLIP and CLIP. The main difference from CLIP is the training loss, which does not require a global view of all the pairwise similarities of images and texts within a batch. One needs to apply the sigmoid activation function to the logits, rather than the softmax.
- Training is supported but does not use
torch.distributedutilities which may limit the scalability of batch size. However, DDP and FDSP works on single-node multi-gpu setup. - When using the standalone [
GemmaTokenizerFast] make sure to passpadding="max_length"andmax_length=64as that's how the model was trained. - Model was trained with lowercased text, make sure you make the same preprocessing for your text labels.
- To get the same results as the pipeline, a prompt template of "this is a photo of {label}" should be used.
- The NaFlex variant supports processing images at higher resolutions by adjusting the
max_num_patchesparameter in theProcessor. The default value ismax_num_patches=256. Increasingmax_num_patchesto 1024 (4x) will approximately double processed image height and width, while preserving the aspect ratio.

This model was contributed by qubvel. The original code can be found here.
Usage example
There are 2 main ways to use SigLIP2: either using the pipeline API, which abstracts away all the complexity for you, or by using the Siglip2Model class yourself.
FixRes variant
Pipeline API
The pipeline allows to use the model in a few lines of code:
```python
from transformers import pipeline from PIL import Image import requests
load pipe
image_classifier = pipeline( ... task="zero-shot-image-classification", ... model="google/siglip2-base-patch16-224", ... )
load image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = Image.open(requests.get(url, stream=True).raw)
inference
candidatelabels = ["2 cats", "a plane", "a remote"] outputs = imageclassifier(image, candidatelabels=candidatelabels) outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs] print(outputs) [{'score': 0.1499, 'label': '2 cats'}, {'score': 0.0008, 'label': 'a remote'}, {'score': 0.0, 'label': 'a plane'}] ```
Using the model yourself
If you want to do the pre- and postprocessing yourself, here's how to do that:
```python
from PIL import Image import requests from transformers import AutoProcessor, AutoModel import torch
model = AutoModel.frompretrained("google/siglip2-base-patch16-224") processor = AutoProcessor.frompretrained("google/siglip2-base-patch16-224")
url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)
candidate_labels = ["2 cats", "2 dogs"]
follows the pipeline prompt template to get same results
texts = [f"This is a photo of {label}." for label in candidate_labels]
IMPORTANT: we pass padding=max_length and max_length=64 since the model was trained with this
inputs = processor(text=texts, images=image, padding="maxlength", maxlength=64, return_tensors="pt")
with torch.no_grad(): ... outputs = model(**inputs)
logitsperimage = outputs.logitsperimage probs = torch.sigmoid(logitsperimage) # these are the probabilities print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'") 15.0% that image 0 is '2 cats' ```
NaFlex variant
NaFlex combines ideas from FlexiViT, i.e. supporting multiple, predefined sequence lengths with a single ViT model, and NaViT, namely processing images at their native aspect ratio. This enables processing different types of images at appropriate resolution, e.g. using a larger resolution to process document images, while at the same time minimizing the impact of aspect ratio distortion on certain inference tasks, e.g. on OCR.
Given a patch size and target sequence length, NaFlex preprocesses the data by first resizing the input image such that the height and width after resizing are multiples of the patch size, while
1. keeping the aspect ratio distortion as small as possible
2. producing a sequence length of at most the desired target sequence length (`max_num_patches`)
The resulting distortion in width and height is at most (patch_size - 1) / width and
(patch_size - 1) / height, respectively, which tends to be small for common resolutions and aspect ratios.
After resizing, the image is split into a sequence of patches, and a mask with padding information is added.
```python
from PIL import Image import requests from transformers import AutoProcessor, AutoModel import torch
model = AutoModel.frompretrained("google/siglip2-base-patch16-naflex") processor = AutoProcessor.frompretrained("google/siglip2-base-patch16-naflex")
url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)
candidate_labels = ["2 cats", "2 dogs"]
follows the pipeline prompt template to get same results
texts = [f"This is a photo of {label}." for label in candidate_labels]
default value for max_num_patches is 256, but you can increase resulted image resolution providing
higher values e.g. max_num_patches=512
inputs = processor(text=texts, images=image, maxnumpatches=256, return_tensors="pt")
with torch.no_grad(): ... outputs = model(**inputs)
logitsperimage = outputs.logitsperimage probs = torch.sigmoid(logitsperimage) # these are the probabilities print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'") 21.1% that image 0 is '2 cats' ```
- Python
Published by LysandreJik over 1 year ago
transformers - SmolVLM-2 (Based on v4.49.0)
A new model is added to transformers: SmolVLM-2.
It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-SmolVLM-2.
In order to install this version, please install with the following command:
bash
pip install git+https://github.com/huggingface/transformers@v4.49.0-SmolVLM-2
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
SmolVLM-2
SmolVLM-2 is detailed in the following blog post.
The models and demos using the model are available in the following collection.
Overview
SmolVLM2 is an adaptation of the Idefics3 model with two main differences:
- It uses SmolLM2 for the text model.
- It supports multi-image and video inputs
Usage tips
Input images are processed either by upsampling (if resizing is enabled) or at their original resolution. The resizing behavior depends on two parameters: do_resize and size.
Videos should not be upsampled.
If do_resize is set to True, the model resizes images so that the longest edge is 4*512 pixels by default.
The default resizing behavior can be customized by passing a dictionary to the size parameter. For example, {"longest_edge": 4 * 512} is the default, but you can change it to a different value if needed.
Here’s how to control resizing and set a custom size:
python
image_processor = SmolVLMImageProcessor(do_resize=True, size={"longest_edge": 2 * 512}, max_image_size=512)
Additionally, the max_image_size parameter, which controls the size of each square patch the image is decomposed into, is set to 512 by default but can be adjusted as needed. After resizing (if applicable), the image processor decomposes the images into square patches based on the max_image_size parameter.
This model was contributed by orrzohar.
Usage example
Single Media inference
The model can accept both images and videos as input, but you should use only one of the modalities at a time. Here's an example code for that.
```python import torch from transformers import AutoProcessor, AutoModelForImageTextToText
processor = AutoProcessor.frompretrained("HuggingFaceTB/SmolVLM2-256M-Video-Instruct") model = AutoModelForImageTextToText.frompretrained( "HuggingFaceTB/SmolVLM2-256M-Video-Instruct", torchdtype=torch.bfloat16, devicemap="cuda" )
conversation = [ { "role": "user", "content":[ {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"}, {"type": "text", "text": "Describe this image."} ] } ]
inputs = processor.applychattemplate( conversation, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt", ).to(model.device, dtype=torch.bfloat16)
outputids = model.generate(**inputs, maxnewtokens=128) generatedtexts = processor.batchdecode(outputids, skipspecialtokens=True) print(generated_texts)
Video
conversation = [ { "role": "user", "content": [ {"type": "video", "path": "/path/to/video.mp4"}, {"type": "text", "text": "Describe this video in detail"} ] }, ]
inputs = processor.applychattemplate( conversation, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt", ).to(model.device, dtype=torch.bfloat16)
generatedids = model.generate(**inputs, dosample=False, maxnewtokens=100) generatedtexts = processor.batchdecode(generatedids, skipspecialtokens=True) print(generatedtexts[0]) ```
- Python
Published by LysandreJik over 1 year ago
transformers - v4.49.0: Helium, Qwen2.5-VL, SuperGlue, Granite Vision, Zamba2, GOT-OCR 2.0, DAB-DETR, Depth Pro, RT-DETRv2
New models
Helium
Helium-1 preview is a lightweight language model with 2B parameters, targeting edge and mobile devices. It supports the following languages: English, French, German, Italian, Portuguese, Spanish.
- Add-helium by @ArthurZucker in #35669
Qwen2.5-VL
The Qwen2.5-VL model is an update to Qwen2-VL from Qwen team, Alibaba Group.
The abstract from this update is the following:
Qwen2.5-VL marks a major step forward from Qwen2-VL, built upon the latest Qwen2.5 LLM. We’ve accelerated training and testing through the strategic implementation of window attention within the ViT. The ViT architecture itself has been refined with SwiGLU and RMSNorm, aligning it more closely with the LLM’s structure. A key innovation is the expansion of native dynamic resolution to encompass the temporal dimension, in addition to spatial aspects. Furthermore, we’ve upgraded MRoPE, incorporating absolute time alignment on the time axis to allow the model to effectively capture temporal dynamics, regardless of frame rate, leading to superior video understanding.
- add qwen2.5vl by @ShuaiBai623 in #35569
SuperGlue
The SuperGlue model was proposed in SuperGlue: Learning Feature Matching with Graph Neural Networks by Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz and Andrew Rabinovich.
This model consists of matching two sets of interest points detected in an image. Paired with the SuperPoint model, it can be used to match two images and estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.
- Add SuperGlue model by @sbucaille in #29886
Granite Vision Support
The Granite Vision model is a variant of LLaVA-NeXT, leveraging a Granite language model alongside a SigLIP visual encoder. It utilizes multiple concatenated vision hidden states as its image features, similar to VipLlava. It also uses a larger set of image grid pinpoints than the original LlaVa-NeXT models to support additional aspect ratios.
- Granite Vision Support by @alex-jw-brooks in #35579
Zamba2
Zamba2 is a large language model (LLM) trained by Zyphra, and made available under an Apache 2.0 license.
Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B are hybrid models combining state-space models (Specifically Mamba) and transformer, and were trained using next-token prediction. Zamba2 uses shared transformer layers after every 6 mamba blocks. It uses the Mistral v0.1 tokenizer. We came to this architecture after a series of ablations at small scales. Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B were pre-trained on 2T and 3T tokens, respectively.
- Add Zamba2 by @pglorio in #34517
GOT-OCR 2.0
GOT-OCR2 works on a wide range of tasks, including plain document OCR, scene text OCR, formatted document OCR, and even OCR for tables, charts, mathematical formulas, geometric shapes, molecular formulas and sheet music. While this implementation of the model will only output plain text, the outputs can be further processed to render the desired format, with packages like pdftex, mathpix, matplotlib, tikz, verovio or pyecharts. The model can also be used for interactive OCR, where the user can specify the region to be recognized by providing the coordinates or the color of the region’s bounding box.
- Add GOT-OCR 2.0 to Transformers by @yonigozlan in #34721
DAB-DETR
DAB-DETR is an enhanced variant of Conditional DETR. It utilizes dynamically updated anchor boxes to provide both a reference query point (x, y) and a reference anchor size (w, h), improving cross-attention computation. This new approach achieves 45.7% AP when trained for 50 epochs with a single ResNet-50 model as the backbone.
- Add DAB-DETR for object detection by @conditionedstimulus in #30803
Depth PRO
DepthPro is a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. It employs a multi-scale Vision Transformer (ViT)-based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch-level features are merged, upsampled, and refined using a DPT-like fusion stage, enabling precise depth estimation.

- Add Apple's Depth-Pro for depth estimation by @geetu040 in #34583
RT-DETRv2
An improved Real-Time DEtection TRansformer (RT-DETR). RT-DETRv2 refines RT-DETR by introducing selective multi-scale feature extraction, a discrete sampling operator for broader deployment compatibility. These improvements yield a 0.3 to 1.4 increase in mAP metrics on the COCO dataset, all while maintaining the same parameter count and frames-per-second (FPS) performance.

- Adding RTDETRv2 by @jadechoghari in #34773
Transformers-CLI
Transformers' CLI welcomes a new command: chat. This command starts a conversation with the model of your choosing directly in your terminal.
This feature exists in TRL and has been migrated to transformers for easier usage.
- [Chat] Add Chat from TRL 🐈 by @gante in #35714
Processor Standardization
An ongoing work is to standardize the image processors so that their API is equivalent. Additionally, the processors are given a fast variant so that they are never blockers in the image processing pipelines.
In this release, several processors have been standardized and have seen their fast version be contributed.
- OwlViT/Owlv2 post processing standardization by @qubvel in #34929
- OmDet Turbo processor standardization by @qubvel in #34937
- Grounding DINO Processor standardization by @qubvel in #34853
- Refactoring of ImageProcessorFast by @yonigozlan in #35069
- add Qwen2-VL image processor fast by @yonigozlan in #35733
- Remove Multi-threaded image conversion for fast image processors by @yonigozlan in #36105
Breaking changes
DPT segmentation maps
DPT image processors did not support segmentation_maps, instead only requiring images. This has been fixed.
This adds an argument to the preprocess method, therefore users using arguments as positional arguments with that method may see changed behavior. We recommend using keyword arguments for such methods so as to not be bothered by the addition of new features.
- 🔴 🔴 🔴 Added
segmentation mapssupport for DPT image processor by @simonreise in #34345
Image classification pipeline and single vs multi-label
The problem_type in the config.json file was read incorrectly by the pipeline, which mapped single-label to multi-label losses, and vice-versa. This has been fixed.
- 🚨🚨🚨 image-classification pipeline single-label and multi-label prob type squashing fns (sigmoid vs softmax) are backwards by @rwightman in #35848
Fixing the LayerNorm beta/gamma renames
The description of the pull request is the easiest way to understand the problem, why it exists, and how it is solved; please read the description below:
- 🚨🚨🚨 An attempt to fix #29554. Include 'LayerNorm.' in gamma/beta rename scope, optimize string search. by @rwightman in #35615
VLM cleanup
The ignore_index property of the llava configuration has been removed as it was not serving a purpose.
- 🔴 VLM: compile compatibility by @zucchini-nlp in #35724
Quantization
Quantization has received several improvements and fixes, including the contribution of FP8 quantization and the HIGGS quantization interface.
- Split and clean up GGUF quantization tests by @Isotr0py in #35502
- Display warning for unknown quants config instead of an error by @SunMarc in #35963
- Adding FP8 Quantization to transformers by @MekkCyber in #36026
- New HIGGS quantization interfaces, JIT kernel compilation support. by @BlackSamorez in #36148
Generate
- [generate] revert change in Aria: the maximum cache length must match
max_lengthby @gante in #36120 - 🧹 remove
generate-related objects and methods scheduled for removal in v4.48 by @gante in #35677 - [generate] can instantiate
GenerationConfig(cache_implementation="static")by @gante in #35679 - [generate] return Cache object even if passed in a legacy format by @gante in #35673
- [generate] update docstring of
SequenceBiasLogitsProcessorby @gante in #35699 - Test: generate with
torch.compile(model.forward)as a fast test by @gante in #34544 - [generate] move max time tests by @gante in #35962
- [generate] shape checks in tests compatible with fixed-length caches (+ some minor fixes) by @gante in #35993
Pipelines
Pipelines have received several bug fixes and improvements which are detailed below.
- Stop mutating input dicts in audio classification pipeline by @Rocketknight1 in #35754
- fix document qa bf16 pipeline by @jiqing-feng in #35456
- fix low-precision audio classification pipeline by @jiqing-feng in #35435
- [pipeline] missing import regarding assisted generation by @gante in #35752
- Output dicts support in text generation pipeline by @jonasrohw in #35092
- Fix Audio Classification Pipeline top_k Documentation Mismatch and Bug #35736 by @sambhavnoobcoder in #35771
Bugfixes and improvements
- Fix flaky
test_custom_4d_attention_maskby @ydshieh in #35606 - Use inherit tempdir makers for tests + fix failing DS tests by @muellerzr in #35600
- Added error when sequence length is bigger than maxpositionembeddings by @Taha1506 in #32156
- Let
EarlyStoppingCallbacknot requireload_best_model_at_endby @muellerzr in #35101 - Fix flaky
test_beam_search_low_memoryby @ydshieh in #35611 - Skip
MobileNetV1ModelTest::test_batching_equivalencefor now by @ydshieh in #35614 - Update codeowners with individual model owners by @Rocketknight1 in #35595
- Fix device in rope module when using dynamic updates by @Cyrilvallez in #35608
- Fix whisper compile by @jiqing-feng in #35413
- Removed some duplicated code by @Sai-Suraj-27 in #35637
- [
Phi] bias should be True by @ArthurZucker in #35650 - Enable different torch dtype in sub models by @zucchini-nlp in #34873
- [
Compile] Only test compiling model forward pass by @ArthurZucker in #35658 - [tests] make cuda-only tests device-agnostic by @faaany in #35607
- [i18n-ar] Translated file : docs/source/ar/tasks/token_classification.md into Arabic by @AhmedAlmaghz in #35193
- Fix
zero_shot_image_classificationdocumentation guide link in SigLIP by @aretrace in #35671 - Fix : adding einops lib in the CI docker for some bitsandbytes tests by @MekkCyber in #35652
- Update torchao.md: use auto-compilation by @martin0258 in #35490
- Fix : HQQ config when hqq not available by @MekkCyber in #35655
- Fix expected output for ggml test by @MekkCyber in #35686
- Fix : add requirereadtoken for gemma2 gated model by @MekkCyber in #35687
- Enhanced Installation Section in README.md by @egojoseph in #35094
- Enhance DataCollatorForLanguageModeling with Configurable Token Replacement Probabilities by @mahdibaghbanzadeh in #35251
- Clean-up composite configs by @zucchini-nlp in #34603
- Add future import for Py < 3.10 by @Rocketknight1 in #35666
- Enable gptqmodel by @jiqing-feng in #35012
- Fix : Nemotron Processor in GGUF conversion by @MekkCyber in #35708
- Fix typo in /docs/source/ja/modeldoc/decisiontransformer.md URL by @hiroaki222 in #35705
- Replace deprecated batchsize with maxbatch_size when using HybridCache by @mtreinik in #35498
- Fix: Falcon tiewordembeddings in GGUF by @MekkCyber in #35715
- Fix condition when GA loss bug fix is not performed by @techkang in #35651
- Fix the bug that
Trainercannot correctly calltorch_jit_model_evalby @Wanguy in #35722 - [generation] fix type hint by @gante in #35725
- Add proper jinja2 error by @Rocketknight1 in #35533
- Optimize ForCausalLMLoss by removing unnecessary contiguous() call to reduce memory overhead by @efsotr in #35646
- Modular: support for importing functions from any file by @Cyrilvallez in #35692
- Remove batch size argument warning when unjustified by @quintenroets in #35519
- [cache] add a test to confirm we can use cache at train time by @gante in #35709
- Remove
pt_to_tfby @gante in #35672 - Added resource class configuration option for
check_circleci_userjob by @Sai-Suraj-27 in #32866 - Fix some tests by @Cyrilvallez in #35682
- Unable to use
MimiModelwith DeepSpeed ZeRO-3 by @anferico in #34735 - check is added for the report_to variable in TrainingArguments by @alpertunga-bile in #35403
- Added liger_kernel compatibility with
PeftModelby @ambroser53 in #35680 - Restore istorchgreaterorequal_than for backward compatibility by @tlrmchlsmth in #35734
- Revert "Unable to use
MimiModelwith DeepSpeed ZeRO-3" by @eustlb in #35755 - ci: fix xpu skip condition for testmodelparallelbeamsearch by @dvrogozh in #35742
- Use AMD CI workflow defined in hf-workflows by @ivarflakstad in #35058
- Fix CI for VLMs by @zucchini-nlp in #35690
- Security fix for
self-comment-ci.ymlby @ydshieh in #35548 - [ViTPose] Convert more checkpoints by @NielsRogge in #35638
- fix register_buffer in MimiEuclideanCodebook by @anferico in #35759
- remove code owners as it was generating too much noise BUT by @ArthurZucker in #35784
- Skip Falcon 7B GGML Test by @MekkCyber in #35783
- [fix] cannot import name 'Pop2PianoFeatureExtractor' from 'transformers' by @faaany in #35604
- transformers.image_transforms.normalize wrong types by @CalOmnie in #35773
- Patch moonshine by @eustlb in #35731
- Don't import torch.distributed when it's not available by @booxter in #35777
- Fix vits low-precision dtype by @jiqing-feng in #35418
- Tool calling: support more types by @aymeric-roucher in #35776
- Fixes, improvements to
timmimport behaviour by @rwightman in #35800 - modularmodelconverter bugfix on assignments by @nikosanto13 in #35642
- Deterministic sorting in modular converter when adding new functions by @Cyrilvallez in #35795
- Fix "testchattemplate_dict" in video LLMs by @zucchini-nlp in #35660
- Update AMD Docker image by @ivarflakstad in #35804
- Add LlavaImageProcessor by @NielsRogge in #33191
- Byebye
test_batching_equivalence's flakiness by @ydshieh in #35729 - [Doc] Adding blog post to model doc for
TimmWrapperby @ariG23498 in #35744 - add a new flax example for Bert model inference by @louie-tsai in #34794
- Support adamwtorch8bit by @fzyzcjy in #34993
- Auto-add
timmtag to timm-wrapper models. by @pcuenca in #35794 - Fix : BLOOM tiewordembeddings in GGUF by @MekkCyber in #35812
- Fixed typo in autoawq version number in an error message for IPEX backend requirements. by @InfroLab in #35815
- Remove deprecated
get_cached_modelsby @Wauplin in #35809 - Optimized setinitializedsubmodules. by @LagPixelLOL in #35493
- [i18n-ar] Translated file:
docs/source/ar/tasks/masked_language_modeling.mdinto Arabic by @AhmedAlmaghz in #35198 - move fastspeech to audio models by @eustlb in #35788
- Improve modular documentation by @Cyrilvallez in #35737
- [Mimi] update test expected values for t4 runners by @eustlb in #35696
- Remove old
benchmarkcode by @gante in #35730 - Remove pyav pin to allow python 3.11 to be used by @CalOmnie in #35823
- Another security patch for
self-comment-ci.ymlby @ydshieh in #35816 - Init cache on meta device by @zucchini-nlp in #35164
- Hotfix: missing
working-directoryinself-comment-ci.ymlby @ydshieh in #35833 - [gpt2] fix generation tests by @gante in #35822
- Fix : Nemotron tokenizer for GGUF format by @MekkCyber in #35836
- Fix
head_dimin config extracted from Gemma2 GGUF model by @Isotr0py in #35818 - [chat] docs fix by @gante in #35840
- Fix compatibility issues when using auto_gptq with these older versions by @LRL-ModelCloud in #35830
- Add PyTorch version check for FA backend on AMD GPUs by @mht-sharma in #35813
- Fix NoneType type as it requires py>=3.10 by @SunMarc in #35843
- [
tests] remove some flash attention class tests by @ArthurZucker in #35817 - [Backend support] Allow
num_logits_to_keepas Tensor + add flag by @Cyrilvallez in #35757 - Fix GA loss for Deepspeed by @timjeffrey10 in #35808
- Fix uploading processors/tokenizers to WandB on train end by @jack89roberts in #35701
- Fix more CI tests by @ArthurZucker in #35661
- [DOC] Fix contamination and missing paragraph in translation by @Yosshi999 in #35851
- Fix typo by @SilverSoldier in #35854
- fix applychattemplate() padding choice by @baoyf4244 in #35828
- Fix
test_pipelines_video_classificationthat was always failing by @CalOmnie in #35842 - Fix Llava-NeXT / Llava-NeXT Video / Llava-OneVision's token unpadding mismatch by @sheryc in #35779
- use torch.testing.assertclose instead to get more details about error in cis by @ArthurZucker in #35659
- add xpu device check in device_placement by @faaany in #35865
- Add
Rocketknight1toself-comment-ci.ymlby @ydshieh in #35881 - [doctest] Fixes by @stevhliu in #35863
- Fix fast image processor warnings in object detection examples by @sugendran in #35892
- Update deepspeed amd image by @ivarflakstad in #35906
- Fix typing in audioutils.chromafilter_bank by @CalOmnie in #35888
- [docs] uv install by @stevhliu in #35821
- Fix the config class comparison for remote code models by @Rocketknight1 in #35592
- Close Zamba2Config code block by @Rocketknight1 in #35914
- [docs] Fix Zamba2 by @stevhliu in #35916
- Remove
_supports_static_cache = Truefor some model classes by @ydshieh in #34975 - Use rocm6.2 for AMD images by @ivarflakstad in #35930
- Add default TP plan for all models with backend support by @Cyrilvallez in #35870
- Fix: loading DBRX back from saved path by @zucchini-nlp in #35728
- Fix mask slicing for models with HybridCache by @Cyrilvallez in #35681
- Qwen-2-5-VL: fix CI by @zucchini-nlp in #35935
- Fix TP initialization by @Cyrilvallez in #35860
- fix(FA): QKV not being casted to target_dtype for FA with dpo lora by @NanoCode012 in #35834
- Remove INC notebook reference in documentation by @echarlaix in #35936
- use torch constraints to check if covariance is positive definite during mean resizing. by @abuelnasr0 in #35693
- fix
test_generated_length_assisted_generationby @keyboardAnt in #34935 - Update
unwrap_and_save_reload_scheduleto useweights_only=Falseby @ydshieh in #35952 - Update
squad_convert_example_to_featuresto work with numpy v2 by @ydshieh in #35955 - Fix flaky
test_assisted_decoding_matches_greedy_searchby @ydshieh in #35951 - Trainer Refactor: Part 1 by @muellerzr in #35567
- update docker file
transformers-pytorch-deepspeed-latest-gpuby @ydshieh in #35940 - [tests] further fix
Tester object has no attribute '_testMethodName'by @faaany in #35781 - Update README.md by @BlessedTatonka in #35958
- fix iterator overflow when gradient accumulation is 1 by @winglian in #35960
- Fix is_causal being a tensor by @IlyasMoutawwakil in #35791
- [bart] minor test fixes by @gante in #35965
- Pixtral: vectorize patch embeddings and enable tests by @zucchini-nlp in #35122
- Whisper: fix static cache CI by @zucchini-nlp in #35852
- Less flaky for
TimmBackboneModelTest::test_batching_equivalenceby @ydshieh in #35971 - Support batching for UsefulSensors Moonshine by @njeffrie in #35922
- not to use A100 for
benchmark.ymlby @ydshieh in #35974 - Handle empty change indices in SAM's mask to rle conversion by @MSt-10 in #35665
- Add support for nested images to LLava and VipLLava by @yonigozlan in #35558
- [Moonshine] compute headdimpadding at init by @eustlb in #35984
- [Moshi] disable automatic compilation if the model can't compile by @gante in #35992
- use torch 2.6 for daily CI by @ydshieh in #35985
- Update-tp test by @ArthurZucker in #35844
- Add meanresizing for every VLMs' resizingtoken_embeddings() by @YenFuLin in #35717
- Update Granite Vision Model Path / Tests by @alex-jw-brooks in #35998
- Qwen2-VL: fix rope delta calculation by @zucchini-nlp in #36013
- Fix custom kernel for DeformableDetr, RT-Detr, GroindingDINO, OmDet-Turbo in Pytorch 2.6.0 by @qubvel in #35979
- applychattemplate: consistent behaviour for returnassistanttokensmask=True returntensors=True by @mrsndmn in #35582
- layernormdecayfix by @Ryoo72 in #35927
- Update Mistral converter by @Cyrilvallez in #35967
- Refactor (and fix) gpt_neox by @Cyrilvallez in #35610
- Fix device mismatch error in Whisper model during feature extraction by @thedebugger in #35866
- Fix RMSNormGated in Zamba2 by @pglorio in #35943
- Commont bot CI for other jobs (
generation/quantization) by @ydshieh in #35341 - Hotfix for
self-comment-ci.ymlby @ydshieh in #36030 - feat(ci): ignore trufflehog unverified results by @McPatate in #36031
- CircleCI with python 3.9 by @ydshieh in #36027
- Update tests regarding attention types after #35235 by @ydshieh in #36024
- Fix Gemma2 synced multi-GPU generation by @ManukyanD in #35232
- Fix synced multi-GPU generation with LLMs and VLMs by @ManukyanD in #35893
- Add XPU type for work-around -inf mask causing sdpa NaN issue in modeling files by @Liangliang-Ma in #35647
- add support for empty list as input to createmodelcard by @ROZBEH in #36042
- DeepSpeed github repo move sync by @stas00 in #36021
- [docs] no hard coding cuda as bnb has multi-backend support by @faaany in #35867
- [docs] fix bugs in the bitsandbytes documentation by @faaany in #35868
- [docs] no hard-coding cuda by @faaany in #36043
- Fix how we compute the final non-padding token for ForSequenceClassification models by @Rocketknight1 in #35911
- Add
Qwen2VLImageProcessorFastintoQwen2VLProcessorby @yeliudev in #35987 - Iterative generation using Input embeds and
past_key_valuesby @yaswanth19 in #35890 - Fix usage of unpad_input function by @pavelgein in #35925
- Fix repo consistency by @ydshieh in #36063
- Update
test_flash_attn_2_can_dispatch_composite_modelsby @ydshieh in #36050 - Paligemma: fix generation with Gemma2 by @zucchini-nlp in #36044
- Save checkpoint to temporary directory to handle partial saves during failures by @SilverSoldier in #35580
- Nail in edge case of torch dtype being overriden permantly in the case of an error by @muellerzr in #35845
- Fix words typos in ggml test. by @zhanluxianshen in #36060
- Fix model kwargs by @muellerzr in #35875
- Fix StopStringCriteria to handle tokens above len(tokenizer) by @Rocketknight1 in #35797
- [docs] fix outdated example code in
trainer.mdby @faaany in #36066 - Adding RT-DETRv2 for object detection by @jadechoghari in #34773
- Fix bug in applyrotaryposembflashatt: in Qwen2-5-VL by @DeepWaved in #36065
- Move audio top_k tests to the right file and add slow decorator by @Rocketknight1 in #36072
- Fix OS err by @muellerzr in #36094
- [docs] fix model checkpoint name by @faaany in #36075
- [docs] fix typo by @faaany in #36080
- [docs] fix not-working example code in
perf_infer_gpu_one.mdby @faaany in #36087 - fix MllamaVisionAttention typehint by @kylesayrs in #35975
- Processors: allow tuples of images when checking by @zucchini-nlp in #36084
- Chat template: update for processor by @zucchini-nlp in #35953
- Paligemma: revert #36084 by @zucchini-nlp in #36113
- Support constant lr with cooldown by @LoserCheems in #35453
- Enable pytest live log and show warning logs on GitHub Actions CI runs by @ydshieh in #35912
- Refactor OPT model by @jiqing-feng in #36101
- Revert checkpoint tmp dir by @SunMarc in #36112
- [Bugfix] fix file name of docstring in utils/check_table.py by @kkscilife in #36108
- fix bnb warning by @SunMarc in #36116
- AutoformerForPrediction test add atol by @ivarflakstad in #36017
- Fix nighlty CIs: missing atols by @ArthurZucker in #35903
- Add common test for
torch.exportand fix some vision models by @qubvel in #35124 - fix: typos in documentation files by @maximevtush in #36122
- update awesome-transformers.md. by @zhanluxianshen in #36115
- Fix max size deprecated warning by @HichTala in #34998
- Fix CI issues by @molbap in #35662
- update tiktoken integ to use converted by @ArthurZucker in #36135
- Make
output_dirOptional inTrainingArguments#27866 by @sambhavnoobcoder in #35735 - [docs] minor doc fix by @faaany in #36127
- [docs] update awq doc by @faaany in #36079
- Add pipeline parallel plan to
PretrainedConfigandPreTrainedModelby @hmellor in #36091 - add RAdamScheduleFree optimizer by @nhamanasu in #35313
- added warning to Trainer when label_names is not specified for PeftModel by @MilkClouds in #32085
- Whisper: remove redundant assisted generation tests by @gante in #34814
- Add utility for Reload Transformers imports cache for development workflow #35508 by @sambhavnoobcoder in #35858
- VLM: enable skipped tests by @zucchini-nlp in #35746
- [commands] remove deprecated/inoperational commands by @gante in #35718
- Fix Gradient Checkpointing for Deberta & Deberta-V2 using PEFT / Adapters by @lenglaender in #35898
- 🚨 Remove cache migration script by @Wauplin in #35810
- multi-gpu: fix tensor device placements for various models by @dvrogozh in #35763
- Optim: APOLLO optimizer integration by @zhuhanqing in #36062
- Fix multi gpu loss sync condition, add doc and test by @techkang in #35743
- adding option to save/reload scaler by @hsilva664 in #34932
- Update doc re list of models supporting TP by @kwen2501 in #35864
- Add more rigerous non-slow grad accum tests by @muellerzr in #35668
- Fix test fetcher by @ydshieh in #36129
- skip
test_initializationforVitPoseBackboneModelTestfor now by @ydshieh in #36154 - Add git LFS to AMD docker image by @ivarflakstad in #36016
- Mllama fsdp by @blbadger in #36000
- Fix PaliGemma Pad Token Masking During Training #35855 by @sambhavnoobcoder in #35859
- Add reminder config to issue template and print DS version in env by @Ben-Schneider-code in #35156
- Fix Gemma2 dtype issue when storing weights in float16 precision by @Nerogar in #35398
- Replace deprecated updaterepovisibility by @Wauplin in #35970
- Fix tests for vision models by @qubvel in #35654
- qwen2.5vl: fix bugs when using flash2+bf16 or numreturnsequences>1 by @gewenbin0992 in #36083
- docs: fix return type annotation of
get_default_model_revisionby @MarcoGorelli in #35982 - Fix PretrainedTokenizerFast check => Fix PretrainedTokenizerFast Save by @CL-ModelCloud in #35835
- Move
DataCollatorForMultipleChoicefrom the docs to the package by @bauwenst in #34763 - Helium documentation fixes by @LysandreJik in #36170
- Remove loading custom kernel for RT-DETRv2 by @qubvel in #36098
- [Modular] skip modular checks based on diff by @gante in #36130
- Fix red CI by @ArthurZucker in #36174
- Fix : fix doc fp8 by @MekkCyber in #36173
- Efficient Inference Kernel for SpQR by @elvircrn in #34976
- fix training issues by @ArthurZucker in #36158
- add disable compile option by @ArthurZucker in #36161
- CI: avoid human error, automatically infer generative models by @gante in #33212
- Use tqdm auto by @SmartManoj in #35726
- Optimize Qwen2VL vision model by precomputing cos/sin embeds before ViT blocks by @li-plus in #35837
- Make
check_repository_consistencyrun faster by MP by @ydshieh in #36175 - Fix the key name for loadrng_state under torch.cuda by @wizyoung in #36138
- Follow up to SpQR integration by @MekkCyber in #36176
- Fix a mistake in #36175 by @ydshieh in #36179
- Fix makebatchedvideos and add tests by @yonigozlan in #36143
- Uniformize OwlViT and Owlv2 processors by @yonigozlan in #35700
- Add support for partial rotary embeddings in Phi3 model by @garg-amit in #35947
- CI: fix
test-save-trainerby @zucchini-nlp in #36191 - Chat template docs by @zucchini-nlp in #36163
- Add ImageProcessorFast to Qwen2.5-VL processor by @Isotr0py in #36164
- Prepare processors for VideoLLMs by @zucchini-nlp in #36149
- Add requirereadtoken to fp8 tests by @MekkCyber in #36189
- Revert qwen2 breaking changes related to attention refactor by @ArthurZucker in #36162
- Guard against unset resolvedarchivefile by @dmlap in #35628
- [Bugfix] Fix reloading of pixtral/llava configs by @kylesayrs in #36077
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @jiqing-feng
- Fix whisper compile (#35413)
- Enable gptqmodel (#35012)
- fix document qa bf16 pipeline (#35456)
- Fix vits low-precision dtype (#35418)
- fix low-precision audio classification pipeline (#35435)
- Refactor OPT model (#36101)
- @AhmedAlmaghz
- [i18n-ar] Translated file : docs/source/ar/tasks/token_classification.md into Arabic (#35193)
- [i18n-ar] Translated file:
docs/source/ar/tasks/masked_language_modeling.mdinto Arabic (#35198)
- @sbucaille
- Add SuperGlue model (#29886)
- @Isotr0py
- Fix
head_dimin config extracted from Gemma2 GGUF model (#35818) - Split and clean up GGUF quantization tests (#35502)
- Add ImageProcessorFast to Qwen2.5-VL processor (#36164)
- Fix
- @ShuaiBai623
- add qwen2.5vl (#35569)
- @alex-jw-brooks
- Granite Vision Support (#35579)
- Update Granite Vision Model Path / Tests (#35998)
- @pglorio
- Add Zamba2 (#34517)
- Fix RMSNormGated in Zamba2 (#35943)
- @conditionedstimulus
- Add DAB-DETR for object detection (#30803)
- @jadechoghari
- Adding RT-DETRv2 for object detection (#34773)
- @geetu040
- Add Apple's Depth-Pro for depth estimation (#34583)
- @zhuhanqing
- Optim: APOLLO optimizer integration (#36062)
- @bauwenst
- Move
DataCollatorForMultipleChoicefrom the docs to the package (#34763)
- Move
- @elvircrn
- Efficient Inference Kernel for SpQR (#34976)
- Python
Published by LysandreJik over 1 year ago
transformers - Patch release v4.48.3
Patch release v4.48.3
This ends the python3.9 issues mostly! - Add future import for Py < 3.10 (#35666) by @Rocketknight1
For some very niche cases, the new rope embedding introduced device failures - Fix device in rope module when using dynamic updates (#35608) by @Cyrilvallez
Num items in batch
- Fix model kwargs (#35875) by @muellerzr: this is long due, sorry that it took so long. Some models were not compatible with the
num_items_in_batch
Finally the fix to Gemma2 is propagated to paligemma2! - Paligemma: fix generation with Gemma2 (#36044) by @zucchini-nlp
- Python
Published by ArthurZucker over 1 year ago
transformers - Patch release v4.48.2
Patch release v4.48.2
Sorry because the fixes for num_items_in_batches are not done yet 😓 To follow along see this PR, a new patch will be available soon!
Now, we mostly had BC issue with python version 3.9:
- Restore istorchgreaterorequal_than for backward compatibility (#35734) by @tlrmchlsmth
- Fix NoneType type as it requires py>=3.10 (#35843) by @SunMarc
Then we had a small regression for DBRX saving: - Fix: loading DBRX back from saved path (#35728) by @zucchini-nlp
Finally we have a fix for gemma and the hybrid attention architectures: - Fix mask slicing for models with HybridCache #35681 by @Cyrilvallez
Miscellaneous: - Fix is_causal being a tensor (#35791) by @IlyasMoutawwakil
- Python
Published by ArthurZucker over 1 year ago
transformers - Patch release v4.48.1
Patch release v4.48.1
Yet again we are dawned with a gradient accumulation fix! There is also a refactoring of the attention that let a small typo in, we made sure PHI is no longer broken!
Moonshine had a small issue when wrapping generate so we removed that!
- [Phi] bias should be True (#35650) @ArthurZucker
- Fix condition when GA loss bug fix is not performed (#35651) @techkang
- Patch moonshine (#35731) @eustlb
🤗
- Python
Published by ArthurZucker over 1 year ago
transformers - v4.48.0: ModernBERT, Aria, TimmWrapper, ColPali, Falcon3, Bamba, VitPose, DinoV2 w/ Registers, Emu3, Cohere v2, TextNet, DiffLlama, PixtralLarge, Moonshine
New models
ModernBERT
The ModernBert model was proposed in Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference by Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Galalgher, Raja Bisas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Grifin Adams, Jeremy Howard and Iacopo Poli.
It is a refresh of the traditional encoder architecture, as used in previous models such as BERT and RoBERTa.
It builds on BERT and implements many modern architectural improvements which have been developed since its original release, such as:
- Rotary Positional Embeddings to support sequences of up to 8192 tokens.
- Unpadding to ensure no compute is wasted on padding tokens, speeding up processing time for batches with mixed-length sequences.
- GeGLU Replacing the original MLP layers with GeGLU layers, shown to improve performance.
- Alternating Attention where most attention layers employ a sliding window of 128 tokens, with Global Attention only used every 3 layers.
- Flash Attention to speed up processing.
- A model designed following recent The Case for Co-Designing Model Architectures with Hardware, ensuring maximum efficiency across inference GPUs.
- Modern training data scales (2 trillion tokens) and mixtures (including code ande math data)
- Add ModernBERT to Transformers by @warner-benjamin in #35158
Aria
The Aria model was proposed in Aria: An Open Multimodal Native Mixture-of-Experts Model by Li et al. from the Rhymes.AI team.
Aria is an open multimodal-native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. It has a Mixture-of-Experts architecture, with respectively 3.9B and 3.5B activated parameters per visual token and text token.
- Add Aria by @aymeric-roucher in #34157
TimmWrapper
We add a TimmWrapper set of classes such that timm models can be loaded in as transformer models into the library.
Here's a general usage example:
```py import torch from urllib.request import urlopen from PIL import Image from transformers import AutoConfig, AutoModelForImageClassification, AutoImageProcessor
checkpoint = "timm/resnet50.a1_in1k" img = Image.open(urlopen( 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png' ))
imageprocessor = AutoImageProcessor.frompretrained(checkpoint) inputs = imageprocessor(img, returntensors="pt") model = AutoModelForImageClassification.from_pretrained(checkpoint)
with torch.no_grad(): logits = model(**inputs).logits
top5probabilities, top5class_indices = torch.topk(logits.softmax(dim=1) * 100, k=5) ```
Thanks to this, timm models now have access to pipelines, as well as Trainer, accelerate device maps, quantization, etc:
```py import torch from urllib.request import urlopen from PIL import Image
from transformers import pipeline
img = Image.open(urlopen( 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png' )) pipe = pipeline("image-classification", model="timm/resnet18.a1_in1k") print(pipe(img)) ```
- Add TimmWrapper by @qubvel and @amyeroberts in #34564
Pixtral-Large
Pixtral modeling and checkpoint conversion code has been updated to support the new Pixtral-Large model.
- Update Pixtral conversion script to support large format! by @arthurzucker in #34801
ColPali
The ColPali model was proposed in ColPali: Efficient Document Retrieval with Vision Language Models by Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo ( denotes equal contribution). Work lead by ILLUIN Technology.
In the proposed ColPali approach, the authors leverage VLMs to construct efficient multi-vector embeddings directly from document images (“screenshots”) for document retrieval. They train the model to maximize the similarity between these document embeddings and the corresponding query embeddings, using the late interaction method introduced in ColBERT.
- Add ColPali to 🤗 transformers by @tonywu71 and @yonigozlan in #33736
Falcon3
Falcon3 represents a natural evolution from previous releases, emphasizing expanding the models’ science, math, and code capabilities. This iteration includes five base models: Falcon3-1B-Base, Falcon3-3B-Base, Falcon3-Mamba-7B-Base, Falcon3-7B-Base, and Falcon3-10B-Base. In developing these models, the authors incorporated several key innovations aimed at improving the models’ performances while reducing training costs:
One pre-training: They conducted a single large-scale pretraining run on the 7B model, using 2048 H100 GPU chips, leveraging 14 trillion tokens featuring web, code, STEM, and curated high-quality and multilingual data. Depth up-scaling for improved reasoning: Building on recent studies on the effects of model depth, they upscaled the 7B model to a 10B parameters model by duplicating the redundant layers and continuing pre-training with 2TT of high-quality data. This yielded Falcon3-10B-Base which achieves state-of-the-art zero-shot and few-shot performance for models under 13B parameters. Knowledge distillation for better tiny models: To provide compact and efficient alternatives, we developed Falcon3-1B-Base and Falcon3-3B-Base by leveraging pruning and knowledge distillation techniques, using less than 100GT of curated high-quality data, thereby redefining pre-training efficiency.
- Add Falcon3 documentation by @mokeddembillel in #35307
Bamba
Bamba-9B is a decoder-only language model based on the Mamba-2 architecture and is designed to handle a wide range of text generation tasks. It is trained from scratch using a two-stage training approach. In the first stage, the model is trained on 2 trillion tokens from the Dolma v1.7 dataset. In the second stage, it undergoes additional training on 200 billion tokens, leveraging a carefully curated blend of high-quality data to further refine its performance and enhance output quality.
Checkout all Bamba-9B model checkpoints here.
- Add the Bamba Model by @fabianlim in #34982
VitPose
ViTPose is a state-of-the-art vision transformer-based model for human pose estimation, introduced by Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao in "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation”.
The model leverages the capabilities of vision transformers to accurately predict 2D human keypoints. Adopting a top-down approach, ViTPose estimates keypoints locations for each detected person, allowing it to be easily used with any object detection model.

- Add VitPose by @SangbumChoi and @NielsRogge in #30530
DINOv2 with registers
The DINOv2 with Registers model was proposed in Vision Transformers Need Registers by Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski.
The Vision Transformer (ViT) is a transformer encoder model (BERT-like) originally introduced to do supervised image classification on ImageNet.
Next, people figured out ways to make ViT work really well on self-supervised image feature extraction (i.e. learning meaningful features, also called embeddings) on images without requiring any labels. Some example papers here include DINOv2 and MAE.
The authors of DINOv2 noticed that ViTs have artifacts in attention maps. It’s due to the model using some image patches as “registers”. The authors propose a fix: just add some new tokens (called “register” tokens), which you only use during pre-training (and throw away afterwards). This results in:
- no artifacts
- interpretable attention maps
and improved performances.
Add DINOv2 with registers by @NielsRogge in #35348
Emu3
The Emu3 model was proposed in Emu3: Next-Token Prediction is All You Need by Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang.
Emu3 sets a new standard in multimodal AI by using next-token prediction to handle images, text, and videos. It simplifies multimodal modeling by tokenizing all data into a unified format and training a single transformer. Visual data is tokenized using vector quantization methods based on VQ-VAE model. Discretized visual tokens are later fused with text token ids for image and text generation.
Emu3 outperforms leading models like SDXL and LLaVA-1.6 in both generation and perception tasks, without relying on diffusion or compositional methods..
- Add Emu3 by @zucchini-nlp in #33770
Cohere2
A new Cohere update was added through a new "Cohere2" set of classes.
- Add Cohere2 model by @alexrs-cohere in #35224
TextNet
TextNet is a lightweight and efficient architecture designed specifically for text detection, offering superior performance compared to traditional models like MobileNetV3. With variants TextNet-T, TextNet-S, and TextNet-B (6.8M, 8.0M, and 8.9M parameters respectively), it achieves an excellent balance between accuracy and inference speed.
- Add TextNet by @jadechoghari in #34979
DiffLlama
Differential Transformer combines the Llama architecture with Differential Transformer's Attention. * Add DiffLllama by @weak-kajuma in #34083
PixtralLarge
The conversion script needed a few update, while the modeling code was barely changed! * [PixtralLarge] Update Pixtral conversion script to support large format! (#34801)
Moonshine
Moonshine is an autoregressive speech recognition encoder-decoder model that improves upon Whisper's architecture. Namely, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper, which is restricted to fixed 30-second windows. It was introduced by Nat Jeffries, Evan King, Manjunath Kudlur, Guy Nicholson, James Wang, and Pete Warden in Moonshine: Speech Recognition for Live Transcription and Voice Commands .
- Add Moonshine by @eustlb in #34784
Quantization methods
VPTQ Quantization
From the VPTQ contributors:
VPTQ is a novel Post-Training Quantization method that leverages Vector Quantization to high accuracy on LLMs at an extremely low bit-width (<2-bit). VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.. More details here: https://github.com/microsoft/vptq
- FEAT : Adding VPTQ quantization method to HFQuantizer by @wejoncy in #34770
HIGGS Quantization
From the contributors:
HIGGS is a new 0-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and SOTA performance. You can find more information in the paper.
Runtime support for HIGGS is implemented through FLUTE, and its library.
This PR adds support for HIGGS+FLUTE into transformers allowing for low-error 0-shot quantization and fast LLM inference.
- HIGGS Quantization Support by @BlackSamorez in #34997
Cleanup
We merged a cleanup for vision language models, to make sure it all models are standardized. * VLMs: major clean up 🧼 (#34502)
Breaking changes
Conversion scripts
Many models in Transformers include scripts to convert the original model checkpoints into a Transformers-compatible format. These scripts can be found in the repo using the glob pattern models/**/convert_*.py. They were a recurring source of vulnerability reports and CVEs because many models were originally released using insecure formats like older PyTorch .bin weights or pickle files. The conversion scripts had to open these formats, and this meant that they were vulnerable to maliciously crafted inputs.
In practice, we do not see this as a serious vulnerability. The conversion scripts are never imported or called by the rest of the library; each script is standalone, and so the only way to exploit the vulnerability is to create a malicious checkpoint, induce a user to download it, and then also induce them to manually call a specific conversion script on it.
However, even if there is little practical risk of an exploit, we are aware that open vulnerability reports create a compliance problem for users, and so beginning with this release we will be excluding these conversion scripts from release branches and wheels. They will remain accessible to developers on the main branch.
- 🚨🚨🚨 Delete conversion scripts when making release wheels by @Rocketknight1 in #35296
Backtracking in Nougat
A regular expression used within the Nougat code has been modified to ensure it does not hang. The method should output the same results but we cannot guarantee it; we recommend upgrading to the latest transformers if you use this model to ensure your code is performance-optimized.
- 🚨🚨🚨 Limit backtracking in Nougat regexp by @qubvel in #35264
Whisper decoding
This PR finalizes work that aimes to enable short-form (< 30 secs) and long-form generation using temperature fallback. It is a significant improvement to the whisper codebase, but it does result in the following breaking changes:
➡️ Previously:
• Short-form: Returned a ModelOutput or torch.LongTensor, including decoder input IDs and the EOS token ID.
• Long-form: Returned a Dict or torch.LongTensor, excluding decoder input IDs and the EOS token ID.
➡️ From now on:
Short-form and long-form generation are now treated identically, meaning output differentiation based on these modes is no longer applicable.
Decoder input IDs and EOS token IDs are never returned, except in two specific cases: when return_dict_in_generate=True and (return_timestamps=False or force_unique_generate_call=True).
In this case, the output will be a ModelOutput, which is the result of the underlying call to GenerationMixin’s generate. Indeed, return_timestamps=False ensures no seeking occurs; only a single call to generate is made. Therefore, this output includes both decoder input IDs and the EOS token ID.
- [Whisper] 🚨 Fix whisper decoding 🚨 by @eustlb in #34135
Attention refactor
In order to have a cleaner, isolated, future-proof code for the attention layers, they have been refactored so as to keep the model attention code within their files; but attention definitions relating to SDPA, Flash Attention, and other types of attention have been moved to a common file.
- 🚨All attention refactor🚨 by @ArthurZucker in #35235
Bugfixes and improvements
- [tokenizers] Ensure that addprefixspace is propagated to backendtokenizer.pretokenizer (#35593)
- Setup loss_type in config at model init time (#34616)
- [docs] Update Python version in translations by @jla524 in #35096
- [docs] topp, topk, temperature docstrings by @stevhliu in #35065
- Fix private forked repo. CI by @ydshieh in #35114
- Add feature dim attributes to BitLinear for easier PEFT integration by @agostinv in #34946
- Update I-JEPA checkpoints path by @qubvel in #35120
- Fix GA loss bugs and add unit test by @techkang in #35121
- [I-JEPA] Update docs by @NielsRogge in #35148
- Corrected typo in agent system prompts by @Uvi-12 in #35143
- Option to set 'non_blocking' for to(device) in BatchEncoding and BatchFeature by @daniel-bogdoll in #34883
- Fix typo in EETQ Tests by @MekkCyber in #35160
- Cleanup: continue the init refactor by @LysandreJik in #35167
- Super tiny fix logging message by @fzyzcjy in #35132
- Fixed typo of 'avilable' in prompts.py by @Uvi-12 in #35145
- [CI] Fix bnb quantization tests with accelerate>=1.2.0 by @matthewdouglas in #35172
- Fix
num_items_in_batchnot being an integer by @xspirus in #35115 - Assisted decoding multi-gpu by @zucchini-nlp in #35116
- Fix file path for shard_num 1 with mllama converter by @strangiato in #35053
- Support BatchNorm in Hubert posconvemb as in fairseq by @gallilmaimon in #34389
- Remove unnecessary masked_fill in deberta models by @xadupre in #35182
- Fix DBRX LayerNorm init method by @hgt312 in #35177
- Fixing GGUF support for StableLm by @MekkCyber in #35060
- [i18n-ar] Translated file :
docs/source/ar/community.mdinto Arabic by @AhmedAlmaghz in #33027 - Multiple typo fixes in NLP, Audio docs by @henryhmko in #35181
- Only import torch.distributed if it is available by @GaetanLepage in #35133
- [i18n-
] Translating Benchmarks.md to Chinese by @asdkfjsd in #35137 - [docs] Fix FlashAttention link by @stevhliu in #35171
- Update data collator docstrings to accurately reference Nvidia tensor core compute capability version by @johngrahamreynolds in #35188
- [i18n-
] Translating agents.md to Chinese by @HMJ0628 in #35139 - BLIP: enable device map by @zucchini-nlp in #34850
- 🧹 Remove deprecated RotaryEmbedding parts in the Attention layers by @Cyrilvallez in #34858
- [PEFT] Better Trainer error when prompt learning with loading best model at the end by @BenjaminBossan in #35087
- Cleanup: continue the init refactor by @LysandreJik in #35170
- Fix CI by @Cyrilvallez in #35208
- Fix seamless TTS generate by @ylacombe in #34968
- docs: clarify initializer_range parameter description in Idefics3VisionConfig by @h3110Fr13nd in #35215
- Fixed typo of 'indentifier' in audio_utils.py by @Uvi-12 in #35226
- Fix type hints for applychattemplate by @Rocketknight1 in #35216
- Support Python 3.10+ Union style in chat template type hints parsing by @RezaRahemtola in #35103
- Refactoring
AssistedCandidateGeneratorfor Improved Modularity and Reusability by @keyboardAnt and @jmamou in #35009 - Change back to
Threadfor SF conversion by @ydshieh in #35236 - [Init refactor] Modular changes by @LysandreJik in #35240
- Fix typo in chat template example by @EricWinsorDSIT in #35250
- Run model as compressed/uncompressed mode by @horheynm in #34719
- skip Fuyu from test_generate by @nhamanasu in #35246
- [tests] fix "Tester object has no attribute '_testMethodName'" by @faaany in #34910
- Use
rsfEwithpytestby @ydshieh in #35119 - Update AMD docker image (rocm 6.1) by @ivarflakstad in #35259
- Fixed typos in Audio Classification Documentation by @Uvi-12 in #35263
- Translating agents_advanced.md to Chinese by @HMJ0628 in #35231
- Fix FSDP no longer working by @muellerzr in #35212
- don't use no_sync when deepspeed doesn't support it for certain zero stages by @winglian in #35157
- [i18n-Chinese] Translating perftraincpu.md to Chinese by @asdkfjsd in #35242
- Fall back to slow image processor in ImageProcessingAuto when no fast processor available by @yonigozlan in #34785
- Aggeregate test summary files in CircleCI workflow runs by @ydshieh in #34989
- Blip: fix offloading and MP tests by @zucchini-nlp in #35239
- Fix : model used to test ggml conversion of Falcon-7b is incorrect by @MekkCyber in #35083
- Temporarily disable amd push ci by @ivarflakstad in #35293
- Delete redundancy for loop checks. by @zhanluxianshen in #35288
- [Whisper] patch float type on mps by @eustlb in #35295
- Fix typos in Translated Audio Classification Docs by @jla524 in #35287
- Translating "translate perfinfergpu_multi.md" to Chinese by @HMJ0628 in #35271
- Fix wrongs in quicktour[zh] by @zhanluxianshen in #35272
- Improved documentation of Automatic speech recognition by @Uvi-12 in #35268
- fix modular order by @ArthurZucker in #35297
- Add sdpa for Beit by @OmarManzoor in #34941
- Support for SDPA for SAM models by @MagnusS0 in #34110
- remove
benchmarkjob inpush-important-models.ymlby @ydshieh in #35292 - Fix typos in translated quicktour docs by @jla524 in #35302
- Fix image preview in multi-GPU inference docs by @jla524 in #35303
- Fix remove unused parameter in docs by @zzzzzsa in #35306
- Add Cohere2 docs details by @alexrs-cohere in #35294
- Fixed typo in audio_classification.md by @Uvi-12 in #35305
- [docs] Improve register_pipeline by @stevhliu in #35300
- Fix loading with only state dict and lowcpumem_usage = True by @SunMarc in #35217
- [tests] make cuda-only tests device-agnostic by @faaany in #35222
- Trigger GitHub CI with a comment on PR by @ydshieh in #35211
- change bnb tests by @jiqing-feng in #34713
- [Whisper] fix docstrings typo by @eustlb in #35319
- feat: add
benchmarks_entrypoint.pyby @McPatate in #34495 - Fix documentation for ColPali by @tonywu71 in #35321
- Update comment CI bot by @ydshieh in #35323
- PaliGemma: Make sure to add
to suffix if is present in textby @probicheaux in #35201 - Fix some fa2 tests by @ArthurZucker in #35340
- Modernbert Release Fixes by @warner-benjamin in #35344
- [
docs] Add link to ModernBERT Text Classification GLUE finetuning script by @tomaarsen in #35347 - fix onnx export of speech foundation models by @nikosanto13 in #34224
- [
Mamba2] Fix caching, slow path, and multi-gpu by @vasqu in #35154 - Reduce CircleCI usage by @ydshieh in #35355
- Implement AsyncTextIteratorStreamer for asynchronous streaming by @CISC in #34931
- Cleaner attention interfaces by @Cyrilvallez in #35342
- Add Tensor Parallel support for Qwen2VL by @jla524 in #35050
- fix zoedepth initialization error under deepspeed zero3 by @Tavish9 in #35011
- Aurevoir PyTorch 1 by @ydshieh in #35358
- bugfix: torch.export failure caused by
_make_causal_maskby @jiwoong-choi in #35291 - update codecarbon by @nhamanasu in #35243
- Update test fetcher when we want to test all by @ArthurZucker in #35364
- Use
weights_only=Truewithtorch.loadfortransfo_xlby @ydshieh in #35241 - Make
test_generate_with_static_cacheeven less flaky by @ydshieh in #34995 - Improve modular transformers documentation by @joelpaulkoch in #35322
- Improved Documentation Of Audio Classification by @Uvi-12 in #35368
- [docs] Follow up register_pipeline by @stevhliu in #35310
- owlvit/2 dynamic input resolution by @bastrob in #34764
- Fix new FA2 if
is_causalis passed explicitly by @Cyrilvallez in #35390 - bitsandbytes: simplify 8bit dequantization by @matthewdouglas in #35068
- make LlamaModel.updatecausal_mask torch compilable by @winglian in #35187
- Patch GPTNeoX to use adequate FA2 if position_ids is provided by @taha-yassine in #35318
- uniformize kwargs for SAM by @tibor-reiss in #34578
- Deprecate isquantizedtrainingenabled by @MekkCyber in #34991
- Scale loss before backward by @qgallouedec in #35207
- Fix typing in docstring for
PaliGemmaProcessorby @alvarobartt in #35278 - Fix : VPTQ test by @MekkCyber in #35394
- add bnb support for Ascend NPU by @statelesshz in #31512
- bugfix Idefics3 processor - handle gracefully cases with text and no images by @mfarre in #35363
- Adding logger.info about updatetorchdtype in some quantizers by @MekkCyber in #35046
- Add compile test for fast image processor by @yonigozlan in #35184
- Disable
.github/workflows/self-comment-ci.ymlfor now by @ydshieh in #35366 - enable non-cuda awq model support without modify version by @jiqing-feng in #35334
- [
GPTQ,CompressedTensors] Fix unsafe imports and metada check by @vasqu in #34815 - Drop inplace operation for loss computation with gradient accumulation by @qgallouedec in #35416
- Fix: Rename keyword argument inchannels to numchannels by @ningyuv in #35289
- CLIP conversion script - Change fairseq to OpenAI by @gau-nernst in #35384
- Fix f-string to show
ACCELERATE_MIN_VERSIONon error by @KSafran in #35189 - Fix
model_accepts_loss_kwargsfor timm model by @qubvel in #35257 - Update perfinfergpu_one.md: fix a typo by @martin0258 in #35441
- Add computelossfunc to Seq2SeqTrainer by @d223302 in #35136
- Update docs for
sdpa_kernelby @jla524 in #35410 - [i18n-ar] Translated file:
docs/source/ar/tasks/question_answering.mdinto Arabic by @AhmedAlmaghz in #35196 - [i18n-ar] Translated file:
docs/source/ar/tasks/summarization.mdinto Arabic by @AhmedAlmaghz in #35195 - Update translated docs for
sdpa_kernelby @jla524 in #35461 - Reintroduce Python 3.9 support for ModernBERT by @tomaarsen in #35458
- Fix new BNB test failures by @matthewdouglas in #35345
- Fix docs typos. by @zhanluxianshen in #35465
- Fix paligemma warning message by @hiyouga in #35486
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @ydshieh
- Fix private forked repo. CI (#35114)
- Change back to
Threadfor SF conversion (#35236) - Use
rsfEwithpytest(#35119) - Aggeregate test summary files in CircleCI workflow runs (#34989)
- remove
benchmarkjob inpush-important-models.yml(#35292) - Trigger GitHub CI with a comment on PR (#35211)
- Update comment CI bot (#35323)
- Reduce CircleCI usage (#35355)
- Aurevoir PyTorch 1 (#35358)
- Use
weights_only=Truewithtorch.loadfortransfo_xl(#35241) - Make
test_generate_with_static_cacheeven less flaky (#34995) - Disable
.github/workflows/self-comment-ci.ymlfor now (#35366)
- @aymeric-roucher
- Add Aria (#34157)
- @NielsRogge
- [I-JEPA] Update docs (#35148)
- Add DINOv2 with registers (#35348)
- @HMJ0628
- [i18n-
] Translating agents.md to Chinese (#35139) - Translating agents_advanced.md to Chinese (#35231)
- Translating "translate perfinfergpu_multi.md" to Chinese (#35271)
- [i18n-
- @alexrs-cohere
- Add Cohere2 model (#35224)
- Add Cohere2 docs details (#35294)
- @ArthurZucker
- fix modular order (#35297)
- 🚨All attention refactor🚨 (#35235)
- Fix some fa2 tests (#35340)
- Update test fetcher when we want to test all (#35364)
- @tonywu71
- Add ColPali to 🤗 transformers (#33736)
- Fix documentation for ColPali (#35321)
- @OmarManzoor
- Add sdpa for Beit (#34941)
- @fabianlim
- Add the Bamba Model (#34982)
- @warner-benjamin
- Add ModernBERT to Transformers (#35158)
- Modernbert Release Fixes (#35344)
- @wejoncy
- FEAT : Adding VPTQ quantization method to HFQuantizer (#34770)
- @bastrob
- owlvit/2 dynamic input resolution (#34764)
- @BlackSamorez
- HIGGS Quantization Support (#34997)
- Python
Published by LysandreJik over 1 year ago
transformers - v4.47.1
Patch release v4.47.1
We waited a little bit to make sure it was stable, thanks @winglian for double checking and everyone for the fixes!
- Fix GA loss bugs and add unit test (#35121) Contributed by @techkang and @ArthurZucker.
- Fix numitemsin_batch not being an integer (#35115) Contributed by @xspirus.
- Fix FSDP no longer working (#35212) Contributed by @muellerzr.
Don't use no_sync when DeepSpeed doesn't support it for certain ZeRO configurations (#35212) Contributed by @winglian.
Only import torch.distributed if it is available (#35133) Contributed by @GaetanLepage.
[Whisper] Patch float type on MPS (#35295) Contributed by @eustlb. 🔜 we should probably have MPS CIs to avoid repeating this!
- Python
Published by ArthurZucker over 1 year ago
transformers - v4.47.0: PaliGemma-2, I-JEPA, OLMo-2, LayerSkip, Tensor Parallel
New models
PaliGemma-2
PaliGemma 2 and PaliGemma are lightweight open vision-language models (VLM) inspired by PaLI-3, and based on open components like the SigLIP vision model and the Gemma language model. PaliGemma takes both images and text as inputs and can answer questions about images with detail and context, meaning that PaliGemma can perform deeper analysis of images and provide useful insights, such as captioning for images and short videos, object detection, and reading text embedded within images.
PaliGemma 2 is available in 3B, 10B, and 28B parameter sizes, which are based on Gemma 2 2B, 9B, and 27B models, respectively. The original PaliGemma models are available in the 3B size. For more information on Gemma model variants, see the Gemma models list. PaliGemma model variants support different pixel resolutions for image inputs, including 224 x 224, 448 x 448, and 896 x 896 pixels.
I-JEPA
The I-JEPA model was proposed in Image-based Joint-Embedding Predictive Architecture by Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas. I-JEPA is a self-supervised learning method that predicts the representations of one part of an image based on other parts of the same image. This approach focuses on learning semantic features without relying on pre-defined invariances from hand-crafted data transformations, which can bias specific tasks, or on filling in pixel-level details, which often leads to less meaningful representations.
- Add I-JEPA by @jmtzt in #33125
OLMo 2
The OLMo2 model is the successor of the OLMo model, which was proposed in OLMo: Accelerating the Science of Language Models.
The architectural changes from the original OLMo model to this model are: - RMSNorm is used instead of standard layer norm. - Norm is applied to attention queries and keys. - Norm is applied after attention/feedforward layers rather than before.
Commits:
- Add OLMo November 2024 by @2015aroras in #34551
- Rename OLMo November to OLMo2 by @2015aroras in #34864
Layer-Skip Llama
We add support for Meta's Layer-Skip Llama 3.2 1B model.
The Llama3.2 1B model was continually pretrained with LayerSkip recipe, early exit loss and layer dropout, as presented in Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding and is capable of performing self-speculative decoding: decode with earlier layers and verify with remaining layers.
- Self-speculation (Layer-Skip Llama) by @ArthurZucker in #34240
Tensor Parallel implementation
This PR uses the torch.distributed.tensor.parallel subpackage to implement Tensor Parallel for Llama (as an example).
The motivation is multi-fold:
to make modeling code simple as single-worker case:
all manual TP implementations underif self.config.pretraining_tp > 1can be removed.to make tensor parallelism easily accessible by users:
added amodel.tensor_parallel(device_mesh)method that allows users to turn a single-proc model into a parallel model. !- Please guide me to a right place to put this function/method ifPreTrainedModelis not a preferred place. -!
This is the first PR of many to simplify and enable Tensor Parallel across models.
- Simplify Tensor Parallel implementation with PyTorch TP by @kwen2501 in #34184
Farewell, Python 3.8
Python 3.8 reaches end of life, and, as such, we drop it from our CI.
- Drop support for Python 3.8 by @ydshieh in #34314
GGUF improvements
Several improvements have been done to the GGUF support in transformers; notably by adding new architectures to the list of supported architectures.
- Add T5 GGUF loading support by @junejae in #33389
- Add GGUF for Mamba by @VladOS95-cyber in #34200
- Add Nemotron GGUF Loading Support by @farrosalferro in #34725
- Improve gguf tensor processing by @VladOS95-cyber in #34515
- Fix
use_parallel_residualandqkv_biasfor StableLM GGUF config extraction by @Isotr0py in #34450
Fast processors
We continue the work to improve the speed of fast processors as detailed in this roadmap.
We contribute a fast processor to RT-DETR.
- Add Image Processor Fast RT-DETR by @yonigozlan in #34354
New pipelines
A new pipeline has been added to transformers: image-text-to-text!
the pipeline support the following inputs:
- unbatched images and text - images=image, text=text
- batched images and text - images = [image, image], text= [text, text]
- several images per prompt (only for models supporting the use of an image token) - images = [[image, image], [image]] or images=[image, image, image], text = ["...
... ...", "... ..."] Chat templates (for models supporting them).
Add image text to text pipeline by @yonigozlan in #34170
Notable refactors
Separate chat templates into a single file
We have had several issues with chat templates because they're stored as single lines in the JSON config files:
- Impossible to review diffs
- Very hard to edit in the web UI (or in general)
- Differences between
processortemplates inchat_template.jsonandtokenizertemplates intokenizer_config.jsoncausing confusion - Some models use multiple templates, requiring a template dict, but we're trying to discourage that in future and move those models to single templates with conditional behaviour instead
The solution:
- Just move chat templates to a single
chat_template.jinjafile in the repo - If multiple templates are required, then they should still be stored in the JSON file. This is not supported for
Processorclasses, so processors should always be able to save their template as a raw Jinja file. In general, we'll be gently deprecating multiple templates in future. - If a
chat_template.jinjafile is present, it overrides the JSON files. If a tokenizer is loaded with both Jinja and JSON chat templates and resaved, it should save only the Jinja file, and not have anychat_templateentry intokenizer_config.json.
For now, we continue saving in the old format by default. I'll probably keep it this way for several versions before making the new format the default, to ensure that most users are able to load the new format before it becomes common. Until then, the new format should mostly be used for testing, to make sure it's ready for deployment when we do the switch.
- Separate chat templates into a single file by @Rocketknight1 in #33957
Large modular logic refactor
This PR largely rework the logic we use in the modular converter. It is (hopefully) clearer and maintainable. Instead of going in all directions, adding stuff, then deleting it if not needed, we now do the following:
- visit all the modular file (record imports/functions/classes/assignments nodes)
- create function dependency mapping
- for each import coming from another model:
- visit the corresponding file
- create function dependency mapping
- update mapping with function/assignment from the modular (updated/new functions)
- create the class dependency graph based on merged dependencies
- update dependency graph of the modular with the functions and assignments imported from the other files
- for each class recorded in the modular:
- if inherithing from class in another file:
- replace call to super
- find the dependencies after the node was replaced
- follow (updated with modular defs) dependency mapping to add all nodes
- else:
- only add needed imported functions (and their dependencies)
determine the needed imports and add them
Large modular logic refactoring by @Cyrilvallez in #34487
Community bugfixes and improvements
- Remove graph breaks for torch.compile() in flashattentionforward when Lllama Model is padding free tuned by @Abhishek-TAMU in #33932
- Better defaults by @ArthurZucker in #34026
- translated gguf.md into chinese by @blueingman in #34163
- CI: fix failures by @zucchini-nlp in #34371
- Zamba is an LM by @LysandreJik in #34342
- add code generation to natural language processing section by @furtnerthomas in #34333
- Fix piltorchinterpolationmapping import in imageprocessingdetrfast by @yonigozlan in #34375
- Add code sample docstrings and checkpoint reference for GLM models by @h3110Fr13nd in #34360
- refactor: remove redundant if-condition and improve type correctness for
convert_tokens_to_idsby @winstxnhdw in #34030 - Ignore unsupported kwarg in ProcessorMixin call by @yonigozlan in #34285
- [PEFT] Add warning for missing key in LoRA adapter by @BenjaminBossan in #34068
- Fix
torch.fxissue related to the newloss_kwargskeyword argument by @michaelbenayoun in #34380 - Correct the new defaults by @Cyrilvallez in #34377
- [auto. ping] Avoid sending empty info + add more team members by @ydshieh in #34383
- Fix glm by @Cyrilvallez in #34388
- Use non nested images and batched text Idefics2/3 by @yonigozlan in #34222
- Fix onnx non-expotable inplace aten op by @IlyasMoutawwakil in #34376
- Fix right padding in LLaVA models by @zucchini-nlp in #34305
- no filter by @ydshieh in #34391
- SynthID: better example by @gante in #34372
- Tests: upgrade
test_eager_matches_sdpa_generateby @gante in #34386 - Fix bnb training test failure by @matthewdouglas in #34414
- Avoid check expected exception when it is on CUDA by @ydshieh in #34408
- Fix typos in agents_advanced.md by @rudydel in #34405
- [docs] Cache implementations by @stevhliu in #34325
- Fix pix2struct by @IlyasMoutawwakil in #34374
- pin
tensorflow_probability<0.22in docker files by @ydshieh in #34381 - Tiny update after #34383 by @ydshieh in #34404
- Fix batch size handling in prediction_loop for DataLoaderShard by @zeus2611 in #34343
- exclude fsdp from delayoptimizercreation by @eljandoubi in #34140
- New option called
"best"forargs.save_strategy. by @seanswyi in #31817 - [docs] update input documentation for MAMBA2 and MISTRAL models to include cacheposition and attentionmask details by @h3110Fr13nd in #34322
- 🌐 [i18n-KO] Translated
model_doc/barthez.mdto Korean by @Jwaminju in #33980 - Apply linting to the important code blocks to make it readable by @ShubhamJagtap2000 in #34449
- Torchao weights only + prequantized compability by @SunMarc in #34355
- [i18n-ar] Translated file :
docs/source/ar/fast_tokenizers.mdinto Arabic by @AhmedAlmaghz in #33034 - enable average tokens across devices by @techkang in #34373
- feat: run benchmarks on A100 by @McPatate in #34287
- Add
post_process_depth_estimationfor GLPN by @alex-bene in #34413 - LLaVA: latency issues by @zucchini-nlp in #34460
- Generation: fix test by @zucchini-nlp in #34369
- Fix CI by @zucchini-nlp in #34458
- use a tinymodel to test generation config which aviod timeout by @techkang in #34482
- 🚨🚨🚨 [SuperPoint] Fix keypoint coordinate output and add post processing by @sbucaille in #33200
- Simplify running tests in a subprocess by @ydshieh in #34213
- Fix perplexity computation in perplexity.md by @Framartin in #34387
- Fixes for Modular Converter on Windows by @hlky in #34266
- Fix regression loading dtype by @SunMarc in #34409
- Bert is ExecuTorch compatible by @guangy10 in #34424
- manual
head_dimformixtralmodel by @wavy-jung in #34281 - fix-qwen2vl-no-position_ids by @simonJJJ in #33487
- Bug fix for drop path decay rate in swin transformer by @abhi-glitchhg in #34291
- MobileBERT is ExecuTorch compatible by @guangy10 in #34473
- Albert is ExecuTorch compatible by @guangy10 in #34476
- Adding
optimizer_cls_and_kwargstoTrainer.__init__by @apoorvkh in #34358 - Fix performance in get_imports regexp by @AlekseyLobanov in #34298
- fix incorrect warning by @yonigozlan in #34416
- Un-deprecate timeout arg in pipelines by @Rocketknight1 in #34382
- Roberta is ExecuTorch compatible by @guangy10 in #34425
- Fix format mistake in string repr of tokenizer objects by @gpetho in #34493
- Mllama: update docs by @zucchini-nlp in #34334
- VLMs: fix number of image tokens by @zucchini-nlp in #34332
- Tests: move
generatetests to the right mixin and delete redundant tests by @gante in #34464 - fix pixtral processor by @molbap in #34486
- Use torch 2.5 in scheduled CI by @ydshieh in #34465
- Fix super tiny extra space typo by @fzyzcjy in #34440
- UPDATE Documentation for #TRANSLATING.md Documentation into Multiple Languages.(Changes made) by @anshumangahlot in #34226
- enable QA bf16 pipeline by @jiqing-feng in #34483
- Fix: img size mismatch caused by incorrect unpadding in LLaVA-Next by @jp1924 in #34522
- Fix step shifting when accumulate gradient by @kibitzing in #33673
- avoid calling
gc.collectandcuda.empty_cacheby @ydshieh in #34514 - Qwen2VL: skip base
input_ids-inputs_embedsequivalence check by @gante in #34535 - fix(DPT,Depth-Anything) Address expected_slice errors inside inference tests by @philkuz in #34518
- feat: add benchmarks pg indexes by @McPatate in #34536
- make
test_eager_matches_sdpa_inferenceless flaky by @ydshieh in #34512 - Bug Fix for issue #34294 by @fpgaminer in #34295
- [CLIPSeg] Make interpolateposencoding default to True by @NielsRogge in #34419
- update doc by @jiqing-feng in #34478
- [i18n-ar] Translated file :
docs/source/ar/multilingual.mdinto Arabic by @AhmedAlmaghz in #33048 - Blip: get/set input embeddings correctly by @zucchini-nlp in #34152
- BLIP: enable generation tests by @zucchini-nlp in #34174
- :redcircle: :redcircle: fix
query_pre_attn_scalardifferent ofnum_headsin default gemma2 config by @molbap in #34540 - [i18n-HI] Translated accelerate page to Hindi by @karthik-script in #34443
- Update trainer for easier handling of accumulate, compile fixes, and proper reporting by @muellerzr in #34511
- VLM: special multimodal Tokenizer by @zucchini-nlp in #34461
- MPS:
isin_mps_friendlycan support 0D tensors by @gante in #34538 - Add text support to the Trainer's TensorBoard integration by @JacobLinCool in #34418
- [i18n-HI] Translated TFLite page to Hindi by @karthik-script in #34572
- 🌐 [i18n-KO] Translated perftrainspecial.md to Korean by @maximizemaxwell in #34590
- 🌐 [i18n-KO] Update README_ko.md by @J4BEZ in #33098
- fix TrainerState doc because numinputtokens_seen is unused by defau… by @techkang in #34593
- Fix Whisper CI by @ydshieh in #34541
- Skip DeepSpeed ZeRO Stage 3 model initialization when bnb by @eljandoubi in #34395
- FIX: Broken repr of TorchAoConfig by @BenjaminBossan in #34560
- Load sub-configs from composite configs by @zucchini-nlp in #34410
- DistilBERT is ExecuTorch compatible by @guangy10 in #34475
- Remove unused test_dataset by @thisisiron in #34516
- Revert "Fix Whisper CI" by @ydshieh in #34605
- Fix #34494 assistant tokens when truncated by @yonigottesman in #34531
- Remove
@slowfortest_eager_matches_sdpa_inferenceby @ydshieh in #34558 - Changing repr in torchao to show quantized Linear by @MekkCyber in #34202
- Fix torchvision interpolation CI by @yonigozlan in #34539
- 🌐 [i18n-KO] Translated
convbert.mdto Korean by @ahnjj in #34599 - fix(dvclive): pass fake dataset to avoid exception in trainer init by @shcheklein in #34455
- 🌐 [i18n-KO] Translated
timesformer.mdto Korean by @mreraser in #33972 - 🌐 [i18n-KO] Translated bert.md to Korean by @maximizemaxwell in #34627
- [i18n-ar] Translated file :
docs/source/ar/trainer.mdinto Arabic by @AhmedAlmaghz in #33080 - Update llm_engine.py by @louisbrulenaudet in #33332
- Agents: turn any Space into a Tool with
Tool.from_space()by @aymeric-roucher in #34561 - [docs] update not-working model revision by @faaany in #34682
- [i18n-ar] Translated file :
docs/source/ar/torchscript.mdinto Arabic by @AhmedAlmaghz in #33079 - Agents: Small fixes in streaming to gradio + add tests by @aymeric-roucher in #34549
- 🌐 [i18n-KO] Translated marian.md to Korean by @maximizemaxwell in #34698
- [docs] Broken link in generation_strategies by @pcuenca in #34717
- Fix example in EsmConfig docstring by @yuanx749 in #34653
- [docs] add xpu device check by @faaany in #34684
- Retain newlines in chat template when
continue_final_message=Trueby @lewtun in #34253 - Update llava.md by @LysandreJik in #34749
- fix(wandb): pass fake dataset to avoid exception in trainer (see #34455) by @CezaPasc in #34720
- add xpu path for awq by @jiqing-feng in #34712
- FSDP grad accum fix by @winglian in #34645
- Remove FSDP wrapping from sub-models. by @eljandoubi in #34452
- 🧼 remove v4.44 deprecations by @gante in #34245
- VLMs:
patch_size->num_image_tokensin processing by @zucchini-nlp in #33424 - Fix broken link by @ofek in #34618
- fix a typo bug where 'id2label' was incorrectly written as 'i2label' when reading config by @ZuoChenFttS in #34637
- Fix skip of testtraininggradient_checkpointing by @dvrogozh in #34723
- make sure to disable gradients for integer tensor by @winglian in #32943
- [docs] make
empty_cachedevice-agnostic by @faaany in #34774 - [docs] add XPU besides CUDA, MPS etc. by @faaany in #34777
- [tests] add XPU part to testing by @faaany in #34778
- fix: Update pixelvalues parameter in hfmodel input by @thisisiron in #34782
- Fix callback key name by @jung-hunsoo in #34762
- fix: Wrong task mentioned in docs by @ecyht2 in #34757
- Allow handling files as args for a tool created with Tool.from_space by @aymeric-roucher in #34687
- Fix Whisper CI by @ydshieh in #34617
- protect tensor parallel usage by @ArthurZucker in #34800
- Trainer hyperparameter search kwargs docs update by @GuillemGSubies in #34459
- feat: allow to use hf-hub models for timm backbone by @cgebbe in #34729
- Support gradient checkpointing in Qwen2VL ViT by @li-plus in #34724
- Fix: siglip image processor rgb_convert is not being applied correctly. by @jp1924 in #34301
- fix cpu bnb path by @jiqing-feng in #34647
- Gemma capping by @ArthurZucker in #34282
- Fix cache_utils for optimum.quanto kvcache quantization by @SunMarc in #34750
- Modular fix by @Cyrilvallez in #34802
- MLU devices : Checks if mlu is available via an cndev-based check which won't trigger the drivers and leave mlu by @huismiling in #34326
- 🚨🚨🚨 fix(Mask2Former): torch export 🚨🚨🚨 by @philkuz in #34393
- Feature: print tokens per second during training by @tibor-reiss in #34507
- Add doconvertrgb to vit by @jp1924 in #34523
- Fix post process function called in the instance segmentation example of mask2former by @OnTheThirdDay in #34588
- fix crash in tiiuae/falcon-11B-vlm image-to-text generation by @sywangyi in #34728
- Add support for OpenAI api "image_url" input in chat for image-text-to-text pipeline by @yonigozlan in #34562
- Add Image Processor Fast Deformable DETR by @yonigozlan in #34353
- Run
test_medium_seamless_m4t_ptinsubprocessto avoid many failures by @ydshieh in #34812 - Fix
check_training_gradient_checkpointingby @ydshieh in #34806 - Added image-text-to-text pipeline to task guide by @merveenoyan in #34783
- Translate attention.md into Chinese by @wwwbai in #34716
- LLaVA OV: fix unpadding precision by @zucchini-nlp in #34779
- Fix low memory beam search by @zucchini-nlp in #34746
- Fix the memory usage issue of logits in generate() by @kjohew in #34813
- fix(DPT,Depth-Anything)
torch.exportby @philkuz in #34103 - Fix: take into account meta device by @tibor-reiss in #34134
- Fix hyperparameter search when optuna+deepseed by @corentin-ryr in #34642
- Fix CI by tweaking torchao tests by @SunMarc in #34832
- Fix CI slack reporting issue by @ydshieh in #34833
- VLMs: enable generation tests - last batch by @zucchini-nlp in #34484
- Change logging level from warning to info for
max_stepsoverridingnum_train_epochsby @qgallouedec in #34810 - Fix ds nvme by @eljandoubi in #34444
- Fix heuristic scheduling for UAG by @jmamou in #34805
- Refactor StarCoder2 using modular by @Cyrilvallez in #34015
- Watermarking: fix order by @zucchini-nlp in #34849
- Update checks for torch.distributed.tensor to require torch >= 2.5 by @loadams in #34816
- Remove quantization related config from dequantized model by @konradkalita in #34856
- Auto compile when static cache by @ArthurZucker in #34247
- Speculative decoding: Test the target distribution (to prevent issues like #32867) by @keyboardAnt in #34553
- smol improvements to support more flexible usage by @andimarafioti in #34857
- [CI] Skip EETQ tests while package is broken with latest transformers by @BenjaminBossan in #34854
- Bitnet test fix to avoid using gated model by @MekkCyber in #34863
- Fix support for image processors modifications in modular by @yonigozlan in #34866
- Fix: Enable prefill phase key value caching of nemotron/minitron models by @jeongin601 in #34742
- Add safe_globals to resume training on PyTorch 2.6 by @dvrogozh in #34632
- Cache: init empty cache when
use_cacheby @zucchini-nlp in #34274 - BLIP: fix generation after hub update by @zucchini-nlp in #34876
- [
Deberta/Deberta-v2] Refactor code base to support compile, export, and fix LLM by @ArthurZucker in #22105 - 🔴 Mllama: fix base prefix by @zucchini-nlp in #34874
- Sum gathered input tokens by @techkang in #34554
- allow unused input parameters passthrough when chunking in asr pipelines by @VictorAtIfInsurance in #33889
- preparefa2frompositionids function bugfix by @meliksahturker in #33269
- chore: fix some typos by @wanxiangchwng in #34891
- Fix converttokensto_string when decoder is None by @dszeto in #34569
- [
peft] Given thatself.active_adapteris deprecated, avoid using it by @tomaarsen in #34804 - Fix Qwen2 failing tests by @jla524 in #34819
- Fix : BitNet tests by @MekkCyber in #34895
- [AWQ, CI] Bump AWQ version used in docker image by @BenjaminBossan in #34922
- fix static cache data type miss-match by @jiqing-feng in #34799
- Fix
test_auto_backbone_timm_model_from_pretrainedby @ydshieh in #34877 - Upgrade torch version to 2.5 in dockerfile for quantization CI by @MekkCyber in #34924
- Fix failling GGML test by @MekkCyber in #34871
- Updated documentation and added conversion utility by @ViktorooReps in #34319
- making gpt2 fx traceable by @xuzifei-dmatrix in #34633
- Fix import structure for Fast Image processors by @yonigozlan in #34859
- VideoLLaVA: add default values by @zucchini-nlp in #34916
- Skipping aqlm non working inference tests till fix merged by @MekkCyber in #34865
- [Whisper] Fix whisper integration tests by @eustlb in #34111
- Add Pytorch Tensor Parallel support for Mistral by @VladOS95-cyber in #34927
- change applyrotarypos_emb of Glmmodel for GLM-Edge Series model by @zRzRzRzRzRzRzR in #34629
- Fix torch.onnx.export of Qwen2-VL vision encoder by @xenova in #34852
- Update the Python version in the Chinese README to match the English README. by @vansin in #34870
- [i18n-ar] Translated file :
docs/source/ar/benchmarks.mdinto Arabic by @AhmedAlmaghz in #33023 - [docs] use device-agnostic API instead of cuda by @faaany in #34913
- [doc] use full path for run_qa.py by @faaany in #34914
- docs: HUGGINGFACEHUBCACHE -> HFHUBCACHE by @imba-tjd in #34904
- [i18n-zh]Translated tiktoken.md into chinese by @blueingman in #34936
- [
FlexAttention] Update gemma2 by @ArthurZucker in #34942 - Fix : Add PEFT from source to CI docker by @MekkCyber in #34969
- Avoid calling
get_max_lengthby @ydshieh in #34971 - Fix flaky test execution caused by
Threadby @ydshieh in #34966 - 🌐 [i18n-KO] Translated encoder-decoder.md to Korean by @maximizemaxwell in #34880
- [docs] add explanation to
release_memory()by @faaany in #34911 - [i18n-zh]Translated perftrainspecial.md into Chinese by @blueingman in #34948
- Fix typo in code block in vipllava.md by @yuanx749 in #34957
- Fixed typo in
VisitWebpageToolby @sergiopaniego in #34978 - [PEFT] Set eval mode when loading PEFT adapter by @BenjaminBossan in #34509
- Fix
save_pretrainedfor partially offloaded models by @kylesayrs in #34890 - 🚨🚨🚨 Changed DINOv2Config default patch size to 14 by @OFSkean in #34568
- Refine the code of Universal Assisted Generation by @xinpengzz in #34823
- Allow compressed-tensors quantized model to be trained by @horheynm in #34520
- Offloaded cache: fix generate by @zucchini-nlp in #34921
- Fix
utils/check_bad_commit.py(for auto ping in CI) by @ydshieh in #34943 - Add optimized
PixtralImageProcessorFastby @mgoin in #34836 - Improve
.from_pretrainedtype annotations by @qubvel in #34973 - Fix docker CI : install autogptq from source by @MekkCyber in #35000
- Let server decide default repo visibility by @Wauplin in #34999
- 🚨🚨🚨 Uniformize kwargs for TrOCR Processor by @tibor-reiss in #34587
- Update timm version by @qubvel in #35005
- fix: double verbs by @SamuelLarkin in #35008
- Update
FillMaskPipeline.__call__signature and docstring by @alvarobartt in #35006 - Only cast
cu_seqlenswhen tracing by @xenova in #35016 - fix variable undefined bug when return_tensors is not specified in llava processing by @chenweize1998 in #34953
- Optimize memory usage of mllama encoder by @milesial in #34930
- Typo in warning switching to optimum-quanto by @Bojun-Feng in #35028
- Add type hints for forward functions in Gemma2 by @jla524 in #35034
- Fix
test_eager_matches_sdpa_inferenceforXPUbackend by @dvrogozh in #34889 - Multiple typo fixes in Tutorials docs by @henryhmko in #35035
- add docstring example for computelossfunc by @secrettoad in #35020
- [i18n-ar] Translated file :
docs/source/ar/notebooks.mdinto Arabic by @AhmedAlmaghz in #33049 - [docs] add the missing import for Image and bug fix by @faaany in #34776
- Translate bertlogy.md into Chinese by @wwwbai in #34908
- Automatic compilation in generate: do not rely on inner function by @Cyrilvallez in #34923
- Add token cost + runtime monitoring to Agent and HfEngine children by @aymeric-roucher in #34548
- Fix
BertGenerationby @ydshieh in #35043 - fix speecht5 failure issue in testpeftgradientcheckpointingenable… by @sywangyi in #34454
- [docs] fix example code bug by @faaany in #35054
- Translate community.md into Chinese by @wwwbai in #35013
- [docs] use device-agnostic instead of
cudaby @faaany in #35047 - [docs] use device-agnostic API instead of hard-coded cuda by @faaany in #35048
- Fix
pad_token_tensoris None in warning by @tshu-w in #34005 - Add Pytorch Tensor Parallel support for Qwen2, Qwen2Moe, Starcoder2 by @VladOS95-cyber in #35007
- [
GPTNeoX] Flex Attention + Refactor by @vasqu in #34896 - Support for easier multimodal use of modular by @Cyrilvallez in #35056
- [docs] add a comment that offloading requires CUDA GPU by @faaany in #35055
- [docs] Increase visibility of torch_dtype="auto" by @stevhliu in #35067
- Informative by @ydshieh in #35059
- [Whisper] Fix whisper tokenizer by @eustlb in #34537
- [
tokenizers] bump to 0.21 by @ArthurZucker in #34972 - Update Mistral conversion script by @Cyrilvallez in #34829
- Fix
tie_word_embeddingshandling for GGUF models by @Isotr0py in #35085 - Deprecate quanto and switch to optimum-quanto by @MekkCyber in #35001
- BLIP: this is correct now by @zucchini-nlp in #35081
- [
trainer] fix the GAmodel_accepts_loss_kwargsby @ArthurZucker in #34915 - Fix flaky Hub CI (
test_trainer.py) by @ydshieh in #35062 - Adaptive dynamic number of speculative tokens by @jmamou in #34156
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @AhmedAlmaghz
- [i18n-ar] Translated file :
docs/source/ar/fast_tokenizers.mdinto Arabic (#33034) - [i18n-ar] Translated file :
docs/source/ar/multilingual.mdinto Arabic (#33048) - [i18n-ar] Translated file :
docs/source/ar/trainer.mdinto Arabic (#33080) - [i18n-ar] Translated file :
docs/source/ar/torchscript.mdinto Arabic (#33079) - [i18n-ar] Translated file :
docs/source/ar/benchmarks.mdinto Arabic (#33023)
- [i18n-ar] Translated file :
- @maximizemaxwell
- 🌐 [i18n-KO] Translated perftrainspecial.md to Korean (#34590)
- 🌐 [i18n-KO] Translated bert.md to Korean (#34627)
- 🌐 [i18n-KO] Translated marian.md to Korean (#34698)
- 🌐 [i18n-KO] Translated encoder-decoder.md to Korean (#34880)
- @2015aroras
- Add OLMo November 2024 (#34551)
- Rename OLMo November to OLMo2 (#34864)
- @mgoin
- Add optimized
PixtralImageProcessorFast(#34836)
- Add optimized
- Python
Published by LysandreJik over 1 year ago
transformers - Patch release v4.46.3
One small fix for FSDP + gradient accumulation loss issue! - FSDP grad accum fix, #34645 by @winglian
- Python
Published by ArthurZucker over 1 year ago
transformers - Patch release v4.46.2
Patch release v4.46.2
Mostly had to finish the gradient accumulation ! Thanks to @techkang and @Ryukijano 🤗
- VLMs: fix number of image tokens (#34332) by @zucchini-nlp
- fix pixtral processor (#34486) by @@molbap
- enable average tokens across devices (#34373) by @techkang and @muellerzr
- Update trainer for easier handling of accumulate, compile fixes, and … by @muellerzr and @Ryukijano
- MPS: isinmpsfriendly can support 0D tensors (#34538) by @gante
- Python
Published by ArthurZucker over 1 year ago
transformers - Patch release v4.46.1
Patch release v4.4.61
This is mostly for fx and onnx issues!
** Fix regression loading dtype #34409 by @SunMarc
** LLaVa: latency issues #34460 by @zucchini-nlp
** Fix pix2struct #34374 by @IlyasMoutawwakil
** Fix onnx non-exposable inplace aten op #34376 by @IlyasMoutawwakil
** Fix torch.fx issue related to the new loss_kwargs keyword argument #34380 by @michaelbenayoun
- Python
Published by ArthurZucker over 1 year ago
transformers - Release v4.46.0
New model additions
Moshi
The Moshi model was proposed in Moshi: a speech-text foundation model for real-time dialogue by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour.
Moshi is a speech-text foundation model that casts spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. Moshi also predicts time-aligned text tokens as a prefix to audio tokens. This “Inner Monologue” method significantly improves the linguistic quality of generated speech and provides streaming speech recognition and text-to-speech. As a result, Moshi is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice.
- Moshi integration by @ylacombe in #33624
Zamba
Zamba-7B-v1 is a hybrid between state-space models (Specifically Mamba) and transformer, and was trained using next-token prediction. Zamba uses a shared transformer layer after every 6 mamba blocks. It uses the Mistral v0.1 tokenizer. We came to this architecture after a series of ablations at small scales. Zamba-7B-v1 was pre-trained on 1T tokens of text and code data.
- Add Zamba by @pglorio in #30950
GLM
The GLM Model was proposed in ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools by GLM Team, THUDM & ZhipuAI.
The abstract from the paper starts with the following:
We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B.
- add Glm by @Cyrilvallez in #33823
Idefics 3
The Idefics3 model was proposed in Building and better understanding vision-language models: insights and future directions by Hugo Laurençon, Andrés Marafioti, Victor Sanh, and Léo Tronchon.
Idefics3 is an adaptation of the Idefics2 model with three main differences:
- It uses Llama3 for the text model.
- It uses an updated processing logic for the images.
- It removes the perceiver.
- Add Idefics 3! by @andimarafioti in #32473
PhiMoE
The PhiMoE model was proposed in Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone by Microsoft.
This model is very similar to Mixtral with the main difference of Phi3LongRoPEScaledRotaryEmbedding, where they are used to extend the context of the rotary embeddings. The query, key and values are fused, and the MLP’s up and gate projection layers are also fused.
- PhiMoE by @garg-amit in #33363
Watermarking
This release adds SynthID, a novel state-of-the-art watermarking technique by Google DeepMind. SynthID has a low generation-time computational cost and can be configured to be nearly imperceptible (at the cost of harder watermarking detection). The release also comes with the code to train and run the corresponding detector, which is a machine learning model itself.
```py from transformers import AutoModelForCausalLM, AutoTokenizer, SynthIDTextWatermarkingConfig
tokenizer = AutoTokenizer.frompretrained('google/gemma-2-2b', paddingside="left") model = AutoModelForCausalLM.from_pretrained('google/gemma-2-2b')
SynthID Text configuration
watermarkingconfig = SynthIDTextWatermarkingConfig( keys=[654, 400, 836, 123, 340, 443, 597, 160, 57], ngramlen=5, )
Generation with watermarking
tokenizedprompts = tokenizer(["Once upon a time, "], returntensors="pt", padding=True) outputsequences = model.generate( **tokenizedprompts, watermarkingconfig=watermarkingconfig, dosample=True, maxnewtokens=10 ) watermarkedtext = tokenizer.batchdecode(outputsequences, skipspecialtokens=True) print(watermarked_text) ```
Docs for applying SynthID watermarking: https://huggingface.co/docs/transformers/internal/generationutils#transformers.SynthIDTextWatermarkLogitsProcessor Docs for detecting SynthID watermarking: https://huggingface.co/docs/transformers/internal/generationutils#transformers.SynthIDTextWatermarkDetector
- Add SynthID (watermerking by Google DeepMind) by @gante in #34350
Quantization
BitNet
BitNet is an architecture introduced by Microsoft Research that uses extreme quantization, representing each parameter with only three values: -1, 0, and 1. This results in a model that uses just 1.58 bits per parameter, significantly reducing computational and memory requirements. It replaces traditional Linear layers in Multi-Head Attention and Feed-Forward Networks with specialized layers called BitLinears that use ternary precision (or even binary, in the initial version)
* FEAT : Adding BitNet quantization method to HFQuantizer by @MekkCyber in #33410
GGUF loading in transformers
More architectures are now supported in our GGUF loader; GGUF files saved with this architecture can now be loaded directly in transformers to be fine-tuned. We recommend using tooling from llama.cpp to requantize the models after further training has been done.
- Add gguf support for bloom by @VladOS95-cyber in #33473
- Add falcon gguf by @g-prz in #33437
- Add gguf support for StableLM by @VladOS95-cyber in #33793
- Add gguf support for gpt2 by @VladOS95-cyber in #34044
- Add GGUF for starcoder2 by @VladOS95-cyber in #34094
Notable improvements and additions
Pipeline API synchronisation
We are pushing for a unified inference API across multiple libraries. As part of this, we are cleaning up the input and output signatures for our pipeline classes and deprecating some rarely-used arguments. This is still a work-in-progress, but when it's finished, transformers pipelines should exactly match workflows in deployment libraries like transformers.js or TGI, allowing you to seamlessly move from development to production.
- Sync video classification pipeline with huggingface_hub spec by @Rocketknight1 in #34288
- Image pipelines spec compliance by @Rocketknight1 in #33899
- Make ASR pipeline compliant with Hub spec + add tests by @Rocketknight1 in #33769
- Cleanup returntext and returnfull_text options in TextGenerationPipeline by @Rocketknight1 in #33542
- Make audio classification pipeline spec-compliant and add test by @Rocketknight1 in #33730
- Sync QuestionAnsweringPipeline by @Rocketknight1 in #34039
Also, pipelines now fully support the Processor class, used by vision-language models. Expect full pipeline support for chatting with VLMs in the very near future!
- Make
pipelineable to loadprocessorby @qubvel in #32514
Executorch compatibility
ExecuTorch is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch ecosystem and supports the deployment of PyTorch models with a focus on portability, productivity, and performance.
We are collaborating with the executorch team so that 🤗 Transformers models can be exported using torch.export. The goal of this integration is not only to enable export but also to ensure that the exported artifact can be further lowered and optimized to run efficiently in ExecuTorch, particularly for mobile and edge use cases.
- Generate using exported model and enable gemma2-2b in ExecuTorch by @guangy10 in #33707
- Qwen2.5 is ExecuTorch Compatible by @guangy10 in #34102
- Olmo is ExecuTorch Compatible by @guangy10 in #34181
- Llama3 and Llama2 are ExecuTorch compatible by @guangy10 in #34101
Gradient accumulation bugfix
- Fix Gradient Accumulation issue by @ArthurZucker in #34191
- Enable users to use their own loss functions + deal with prefetching for grad accum by @muellerzr in #34198
- Enable Gradient Accumulation fix across all models + trainer fully in forward() by @muellerzr #34283
Bugfixes and improvements
- adding positional encoder changes and tests by @manuelsh in #32600
- Uniformize kwargs for chameleon processor by @leloykun in #32181
- [
MllamaProcessor] Update errors and API with multiple image by @ArthurZucker in #33715 - fix: use correct var names for check_tokenizers script by @niqodea in #33702
- Fix docs and docstrings Omdet-Turbo by @yonigozlan in #33726
- Fix position embeddings singular/plural by @molbap in #33678
- Generate:
can_generate()recursive check by @gante in #33718 - cleanuptokenization_spaces=False if unset by @itazap in #31938
- fix: add docstring for
image_sizein Convnextv2 config by @lucianosrp in #33734 - Fix modular model converter unable to generate Processor classes by @tonywu71 in #33737
- fix trainer tr_loss add error by @Wang-Xiaodong1899 in #33651
- Update Albumentations Versions by @vasqu in #33704
- Doc and config mismatch for DeBERTa by @fkrasnov2 in #33713
- [
clean_up_tokenization_spaces] Pl bart was failing, updating by @ArthurZucker in #33735 - [
MllamaImageProcessing] Update doc by @ArthurZucker in #33747 - Make siglip examples clearer and error free by @jbn in #33667
- Paligemma support for multi-image by @zucchini-nlp in #33447
- remove warning v2 by @itazap in #33761
- Model addition timeline by @LysandreJik in #33762
- Fix typing in
load_balancing_loss_funcfunction ofmodeling_mixtral.py. by @PhilipMay in #33641 - Enable non-safetensor ser/deser for TorchAoConfig quantized model 🔴 by @jerryzh168 in #33456
- Fix typo in documentation by @qgallouedec in #33805
- Hqq serialization by @mobicham in #33141
- Add Slow CI reminder bot by @ydshieh in #33506
- [
modular] fixes! by @ArthurZucker in #33820 - Fix ViT-MAE decoder interpolate by @xenova in #33330
- Fixes for issue #33763 in idefics2 model by @aroun-coumar in #33766
- Fix link in gguf.md by @pogpog in #33768
- minor typo fix by @a-r-r-o-w in #33784
- Fix Mamba slow path bug with dtype mismatch. by @Adibvafa in #32691
- Fix passing str dtype to static cache by @guangy10 in #33741
- fix check for hidden size in text model for deepspeed zero3 auto entries by @winglian in #33829
- post reminder comment only once by @ydshieh in #33848
- Generate: move llama
prepare_inputs_for_generationtoGenerationMixinby @gante in #33677 - Refactor image features selection in LlaVa by @kenza-bouzid in #33696
- fix: skip dropout in eval for flash_attn in various models by @fdschmidt93 in #33844
- add attention weight up-cast to float32 in chameleon by @francescortu in #33822
- Workaround for bark issue in pipelines by @Rocketknight1 in #33824
- Fix device mismatch errors by @zucchini-nlp in #33851
- This PR contains additional changes for #33143 by @aroun-coumar in #33581
- Raise
acceleratedependency error in case of defaultinglow_cpu_mem_usage=Trueby @kylesayrs in #33830 - Validate the eval dataset in advance. by @jackyjinjing in #33743
- Add includelossfor_metrics by @Manalelaidouni in #33088
- Avoid using context that is not accessable from external contributors by @ydshieh in #33866
- fix: repair depth estimation multiprocessing by @niqodea in #33759
- Move weight initilization deformabledetr by @g-prz in #33339
- [Fix] ViViT interpolateposencoding by @RUFFY-369 in #33815
- Repo consistency fix after #33339 by @amyeroberts in #33873
- Add support for custom inputs and batched inputs in ProcessorTesterMixin by @yonigozlan in #33711
- Fix: typo by @TrickEye in #33880
- Uniformize model processors by @molbap in #31368
- Don't run reminder bot for now by @ydshieh in #33883
- populate quantization_config for kv-cache-scheme only configs by @horheynm in #33874
- Allow for nightly packages of
compressed_tensorsby @kylesayrs in #33828 - Fix kwargs passed by AutoQuantizationConfig.from_pretrained by @kylesayrs in #33798
- Add sdpa for DistilBert by @OmarManzoor in #33724
- Trainer - deprecate tokenizer for processing_class by @amyeroberts in #32385
- [Quantization] Switch to optimum-quanto by @SunMarc in #31732
- Optim deformable detr by @yonigozlan in #33600
- Handle Trainer
tokenizerkwarg deprecation with decorator by @qubvel in #33887 - rename all testprocessing.py to testprocessor.py by @yonigozlan in #33878
- uniformize processor Mllama by @yonigozlan in #33876
- Fix dt proj bias reassigned by @HofitBata in #33314
- Update an keyerror on savecheck_point prevent confusion of missing … by @fadingNA in #33832
- VLM Generate: tag
test_static_cache_matches_dynamicas flaky by @gante in #33630 - Migrate the CI runners to the new clusters by @glegendre01 in #33849
- Fix module initialization for root module under Zero3 by @Ben-Schneider-code in #33632
- Add
SplinterTokenizerunit test by @ariepratama in #32652 - Generate tests: modality-agnostic input preparation by @gante in #33685
- Fix: use unidic-lite instead of ipadic as the tokenizer dictionary for Japanese by @KanTakahiro in #33372
- [Tests] Diverse Whisper fixes by @ylacombe in #33665
- [PEFT] Support lowcpumem_usage option for PEFT loading adapters by @BenjaminBossan in #33725
- add setter for trainer processor by @ArthurZucker in #33911
- Add support for
weights_onlyflag when loading state_dict by @jerryzh168 in #32481 - Config: lower
save_pretrainedexception to warning by @gante in #33906 - Uniformize kwargs for Idefics/2 processors by @yonigozlan in #32568
- Remove
logits.float()by @ringohoffman in #33902 - Minor error condition bug fix by @htahboub in #33781
- Fix distil whisper segment computation by @ylacombe in #33920
- [Doc]: Broken link in Kubernetes doc by @saldanhad in #33879
- [i18n-ru] Fixes typo in the README_ru.md by @Artanias in #33882
- Ignore keys on
validate_ropeby @zucchini-nlp in #33753 - [
PR run-slow] by @ArthurZucker in #33939 - Add a section on writing tool templates to the chat template docs by @Rocketknight1 in #33924
- Enables CPU AWQ model with IPEX version. by @jiqing-feng in #33460
- 🔴 🚨 Resizing tokens embeddings: initialize from old embeddings' normal distribution. by @abuelnasr0 in #33325
- Removed unnecessary transpose in Switch Transformer Routing by @karan-uppal3 in #33582
- Fix attn mask ignore logic in training-time trace by @zhenglongjiepheonix in #32613
- hot fix
self.position_embeddings->self.position_embeddingby @ArthurZucker in #33958 - fix red check-copies by @ArthurZucker in #33964
- Cache: revert DynamicCache init for BC by @gante in #33861
- Paligemma: fix static cache test by @zucchini-nlp in #33941
- Updating
char_to_tokendocumentation to note behaviour whentrim_offsetsis True by @Craigacp in #33919 - add test for Jamba with new model jamba-tiny-dev by @yecohn in #33863
- Bug fix gguf qwen2moe by @VladOS95-cyber in #33940
- [
TF] Fix Tensorflow XLA Generation on limited seq_len models by @vasqu in #33903 - [WIP] Add Tokenizer for MyT5 Model by @tomlimi in #31286
- Add position ids in forward pass to opt model by @avishaiElmakies in #33121
- Flash-attn performance: remove cuda sync during inference by @Cyrilvallez in #33570
- [Docs] Improve VLM docs by @NielsRogge in #33393
- [Docs] Add Developer Guide: How to Hack Any Transformers Model by @MagnusS0 in #33979
- [
Red CIs] Fix hub failures by @ArthurZucker in #34001 - Fix Tensor + Embedding error in some cases when using SiglipVisionModel by @kaitolucifer in #33994
- properly fix and RUN_SLOW by @ArthurZucker in #33965
- Enable customized optimizer for DeepSpeed by @dataKim1201 in #32049
- [
pytes collection] Fix flax test collection by @ArthurZucker in #34004 - Fix undefined defaultconfig in configurationutils.py by @mgoin in #33934
- 🌐 [i18n-KO] Translated
gguf.mdto Korean by @yijun-lee in #33764 - 🌐 [i18n-KO] Translated
swinv2.mdto Korean by @mreraser in #33566 - 🌐 [i18n-KO] Translated
audio_utils.mdto Korean by @yijun-lee in #33802 - 🌐 [i18n-KO] Translated
esm.mdto Korean by @yijun-lee in #33796 - 🌐 [i18n-KO] Translated
time_series_utils.mdto Korean by @yijun-lee in #33806 - 🌐 [i18n-KO] Translated
pipelines_utils.mdto Korean by @yijun-lee in #33809 - 🌐 [i18n-KO] Translated
trainer.mdto Korean by @yijun-lee in #33797 - 🌐 [i18n-KO] Translated
chameleon.mdto Korean by @yijun-lee in #33799 - 🌐 [i18n-KO] Translated
logging.mdto Korean by @chhaewxn in #33543 - 🌐 [i18n-KO] Translated
auto.mdto Korean by @boyunJang in #33590 - 🌐 [i18n-KO] Translated
swin2sr.mdto Korean by @mreraser in #33795 - 🌐 [i18n-KO] Translated
vit.mdto Korean by @mreraser in #33884 - 🌐 [i18n-KO] Translated
gemma.mdto Korean by @yijun-lee in #33936 - Cache: slight change in naming by @zucchini-nlp in #32421
- Add support for all and potentilly deleting functions by @ArthurZucker in #33859
- Processors: don't default padding side by @zucchini-nlp in #33942
- Add auto model for image-text-to-text by @yonigozlan in #32472
- BatchFeature.to() supports non-tensor keys by @Rocketknight1 in #33918
- Improve modular converter by @Cyrilvallez in #33991
- Fixup DeepSpeed things by @muellerzr in #34007
- Fix typing issue by @SunMarc in #34012
- fix awq tests due to ipex backend by @SunMarc in #34011
- Remove
decoder_config=Noneby @SunMarc in #34014 - Fix
trainer_seq2seq.py's__init__type annotations by @benglewis in #34021 - 🌐 [i18n-KO] Translated
feature_extractor.mdto Korean by @yijun-lee in #33775 - 🌐 [i18n-KO] Translated
bertweet.mdto Korean by @ahnjj in #33891 - 🌐 [i18n-KO] Translated
gpt_neox_japanese.mdto Korean by @ahnjj in #33894 - 🌐 [i18n-KO] Translated
rag.mdto Korean by @chhaewxn in #33989 - 🌐 [i18n-KO] Translated
main_classes/quantization.mdto Korean by @fabxoe in #33959 - 🌐 [i18n-KO] Translated
main_classes/configuration.mdto Korean by @fabxoe in #33952 - 🌐 [i18n-KO] Translated
model_doc/mamba.mdto Korean by @fabxoe in #33626 - 🌐 [i18n-KO] Translated
model_doc/autoformer.mdto Korean by @fabxoe in #33574 - 🌐 [i18n-KO] Translated
model_doc/patchtsmixer.mdto Korean by @fabxoe in #33587 - 🌐 [i18n-KO] Translated
model_doc/clip.mdto Korean by @fabxoe in #33610 - 🌐 [i18n-KO] Translated
model_doc/paligemma.mdto Korean by @fabxoe in #33612 - 🌐 [i18n-KO] Translated
model_doc/llama3.mdto Korean by @fabxoe in #33635 - 🌐 [i18n-KO] Translated
model_doc/mistral.mdto Korean by @fabxoe in #33648 - 🌐 [i18n-KO] Translated
model_doc/cohere.mdto Korean by @fabxoe in #33885 - 🌐 [i18n-KO] Translated
model_doc/dbrx.mdto Korean by @fabxoe in #33951 - 🌐 [i18n-KO] Translated
model_doc/deberta-v2.mdto Korean by @fabxoe in #33968 - 🌐 [i18n-KO] Translated
main_classes/onnx.mdto Korean by @fabxoe in #33601 - 🌐 [i18n-KO] Translated
tokenization_utils.mdto Korean by @yijun-lee in #33813 - 🌐 [i18n-KO] Translated
swin.mdto Korean by @mreraser in #33510 - 🌐 [i18n-KO] Translated
file_utils.mdto Korean by @yijun-lee in #33803 - 🌐 [i18n-KO] Translated
openai-gpt.mdto Korean by @yijun-lee in #33801 - 🌐 [i18n-KO] Translated
biogpt.mdto Korean by @yijun-lee in #33773 - 🌐 [i18n-KO] Translated
blip.mdto Korean by @cjfghk5697 in #33515 - 🌐 [i18n-KO] Translated output.md to Korean by @4N3MONE in #33607
- 🌐 [i18n-KO] Translated
image_processing_utils.mdto Korean by @yijun-lee in #33804 - 🌐 [i18n-KO] Translated
modular_transformers.mdto Korean by @yijun-lee in #33772 - [
Patch helper] update to not have to checkout main by @ArthurZucker in #34006 - Fix Failed tests with mobile bert resize tokens embedding by @abuelnasr0 in #33950
- Generate: remove most decoder-only LLMs
prepare_inputs_for_generationby @gante in #33870 - Mllama: fix tests by @zucchini-nlp in #34000
- Fix PIL dep for tests by @muellerzr in #34028
- 🌐 [i18n-KO] Translated
model_doc/bart.mdto Korean by @fabxoe in #33893 - 🌐 [i18n-KO] Translated
model_doc/deberta.mdto Korean by @fabxoe in #33967 - 🌐 [i18n-KO] Translated
main_classes/keras_callbacks.mdto Korean by @fabxoe in #33955 - 🌐 [i18n-KO] Translated
model_doc/mamba2.mdto Korean by @fabxoe in #33629 - 🌐 [i18n-KO] Translated
main_classes/model.mdto Korean by @fabxoe in #33606 - 🌐 [i18n-KO] Translated
model_doc/trajectory_transformer.mdto Korean by @fabxoe in #33597 - 🌐 [i18n-KO] Translated
model_doc/time_series_transformer.mdto Korean by @fabxoe in #33596 - 🌐 [i18n-KO] Translated
model_doc/informer.mdto Korean by @fabxoe in #33585 - 🌐 [i18n-KO] Translated
model_doc/graphormer.mdto Korean by @fabxoe in #33569 - 🌐 [i18n-KO] Translated
modeling_utils.mdto Korean by @yijun-lee in #33808 - 🌐 [i18n-KO] Translated
main_classes/data_collator.mdto Korean by @fabxoe in #33954 - 🌐 [i18n-KO] Translated
model_doc/patchtst.mdto Korean by @fabxoe in #33589 - 🌐 [i18n-KO] Translated
text_generation.mdto Korean by @yijun-lee in #33777 - 🌐 [i18n-KO] Translated
main_classes/callback.mdto Korean by @Jwaminju in #33572 - 🌐 [i18n-KO] Translated
generation_utils.mdto Korean by @yijun-lee in #33818 - Add Translate docs into Arabic - section files CONCEPTUAL GUIDES by @AhmedAlmaghz in #33982
- add sdpa to OPT by @avishaiElmakies in #33298
- Phi3: fix attn for sliding window by @zucchini-nlp in #33586
- HfArgumentParser: allow for hyhenated field names in long-options by @djmarti in #33990
- Fix pipelines tests by @qubvel in #34049
- Specifying torch dtype in Qwen2VLForConditionalGeneration by @htahboub in #33953
- Universal Assisted Generation: Assisted generation with any assistant model (by Intel Labs) by @danielkorat in #33383
- check if eigenvalues of covariance matrix are complex. by @abuelnasr0 in #34037
- [Docs] Update compressed_tensors.md by @mgoin in #33961
- Fix data_seed unused by @MekkCyber in #33731
- [TESTS] ASR pipeline by @ylacombe in #33925
- Update Blip2
is_pipeline_test_to_skipmethod signature by @qubvel in #34067 - provide trustremotecode for search feat extractor in model config by @eaidova in #34036
- Small Fix to modular converter by @MekkCyber in #34051
- Default
synced_gpustoTruewhen usingFullyShardedDataParallelby @ringohoffman in #33483 - Idefics: fix position ids by @zucchini-nlp in #33907
- Update SSH workflow file by @ydshieh in #34084
- Tests: upcast
logitstofloat()by @gante in #34042 - Fix flax failures by @LysandreJik in #33912
- Fix DAC slow tests by @ylacombe in #34088
- Fix failing conversion by @LysandreJik in #34010
- Fix PushToHubMixin when pusing to a PR revision by @Wauplin in #34090
- avoid many failures for ImageGPT by @ydshieh in #34071
- Fix NaNs in cost_matrix for mask2former by @ducha-aiki in #34074
- Fix flaky tests by @zucchini-nlp in #34069
- Generate: move
prepare_inputs_for_generationin encoder-decoder llms by @gante in #34048 - Avoid many test failures for
LlavaNextVideoForConditionalGenerationby @ydshieh in #34070 - refactor: benchmarks by @McPatate in #33896
- fix(ci): benchmarks dashboard was failing due to missing quotations by @McPatate in #34100
- Generate: Fix modern llm
generatecalls withsynced_gpusby @gante in #34095 - Mistral-related models for QnA by @vasqu in #34045
- Fix a typo by @PengWeixuan in #34148
- Fixed error message in mllama by @dmgcsilva in #34106
- Specify that users should be careful with their own files by @LysandreJik in #34153
- Add documentation for docker by @ArthurZucker in #33156
- Update README.md with Enterprise Hub by @gary149 in #34150
- Idefics: enable generation tests by @zucchini-nlp in #34062
- Add sdpa for Vivit by @RUFFY-369 in #33757
- Fix FSDP resume Initialization issue by @Itssshikhar in #34032
- Fix default behaviour in TextClassificationPipeline for regression problem type by @subhalingamd in #34066
- Generate: move
logitsto same device asinput_idsby @gante in #34076 - Add support for inheritance from class with different suffix in modular by @yonigozlan in #34077
- Fix optuna ddp hp search by @SunMarc in #34073
- [feat] LlavaNext add feature size check to avoid CUDA Runtime Error by @laurentd-lunit in #33608
- 🌐 [i18n-KO] Translated
vivit.mdto Korean by @mreraser in #33935 - 🌐 [i18n-KO] Translated
gemma2.mdto Korean by @yijun-lee in #33937 - 🌐 [i18n-KO] Translated
trainer_utils.mdto Korean by @yijun-lee in #33817 - 🌐 [i18n-KO] Translated
blip-2.mdto Korean by @cjfghk5697 in #33516 - IDEFICS: support inputs embeds by @zucchini-nlp in #34043
- [fix] fix token healing tests and usage errors by @alpertunga-bile in #33931
- Revert
accelerateerror caused by46d09afby @steveepreston in #34197 - Fix wrong name for llava onevision and qwen2_vl in tokenization auto by @yonigozlan in #34177
- Avoid using torch's Tensor or PIL's Image in chat template utils if not available by @RezaRahemtola in #34165
- Revert "Fix FSDP resume Initialization issue" by @SunMarc in #34193
- Update
trainer._get_eval_sampler()to supportgroup_by_lengtharg by @larin92 in #33514 - Fix warning message for fp32cpuoffloading in bitsandbytes configs by @amosyou in #34079
- Ping team members for new failed tests in daily CI by @ydshieh in #34171
- fix(Wav2Vec2ForCTC): torch export by @chrsmcgrr in #34023
- Fix for tokenizer.applychattemplate with continuefinalmessage=True by @schoennenbeck in #34214
- removes decord by @vrnvu in #33987
- Fix bus error when using GPT2 on M1 macs by @chanind in #34031
- Generate: visit non-llm
prepare_inputs_for_generationby @gante in #34199 - Support Llama 3.2 conversion (text models) by @pcuenca in #33778
- Fix-red-ci by @ArthurZucker in #34230
- BLIP: fix input expansion logic by @zucchini-nlp in #34225
- Fix broken test decorator
require_torch_up_to_2_acceleratorsby @byi8220 in #34201 - Informative 2 by @LysandreJik in #34154
- Fix UDOP dtype issue by @Rocketknight1 in #34180
- Only cast logits to float when computing loss by @ringohoffman in #34147
- Generation tests: don't rely on main input name by @zucchini-nlp in #34228
- Change Paligemma import logging to work with modular by @yonigozlan in #34211
- Add DetrImageProcessorFast by @yonigozlan in #34063
- Add a doc section on writing generation prompts by @Rocketknight1 in #34248
- Fix method name which changes in tutorial by @andimarafioti in #34252
- Attn implementation for composite models by @zucchini-nlp in #32238
- VLM: add more modularity by @zucchini-nlp in #34175
- T5 compile compatibilty by @zucchini-nlp in #34089
- [docs] Fix GenerationConfig params by @stevhliu in #34299
- Fix Korean doc _toctree.yml by @regisss in #34293
- Update PR templates by @SunMarc in #34065
- [RT-DETR] Fix onnx inference bug for Optype (Where) by @YHallouard in #33877
- Fix FA2 attention for models supporting sliding window by @Cyrilvallez in #34093
- Fix: tensor of examples of the same length triggers invalid stacking by @pbelcak in #34166
- Add postprocessdepth_estimation to image processors and support ZoeDepth's inference intricacies by @alex-bene in #32550
- Add option for running ffmpegmicrophonelive as a background process by @mikamerath in #32838
- Feature: Add
MLFLOW_MAX_LOG_PARAMStoMLflowCallbackby @cecheta in #34279 - Fix continuefinalmessage for image-text-to-text chat templates by @yonigozlan in #34236
- fix error in getevalsampler when groupby_length enabled by @akakakakakaa in #34237
- [docs] fix typo by @faaany in #34235
- 🌐 [i18n-KO] Translated
executorch.mdto Korean by @ahnjj in #33888 - 🌐 [i18n-KO] Translated
bert japanese.mdto Korean by @ahnjj in #33890 - 🌐 [i18n-KO] Translated
model_doc/bartpho.mdto Korean by @Jwaminju in #33981 - Example doc for token classification of Llama and Dependent/Copied Models by @h3110Fr13nd in #34139
- [docs] Fix Korean toctree by @stevhliu in #34324
- Added Deberta model type support by @FilipposVentirozos in #34308
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @manuelsh
- adding positional encoder changes and tests (#32600)
- @ArthurZucker
- [
MllamaProcessor] Update errors and API with multiple image (#33715) - [
clean_up_tokenization_spaces] Pl bart was failing, updating (#33735) - [
MllamaImageProcessing] Update doc (#33747) - [
modular] fixes! (#33820) - add setter for trainer processor (#33911)
PR run-slow- hot fix
self.position_embeddings->self.position_embedding(#33958) - fix red check-copies (#33964)
- [
Red CIs] Fix hub failures (#34001) - properly fix and RUN_SLOW (#33965)
- [
pytes collection] Fix flax test collection (#34004) - Add support for all and potentilly deleting functions (#33859)
- [
Patch helper] update to not have to checkout main (#34006) - Add documentation for docker (#33156)
- Fix Gradient Accumulation issue (#34191)
- Fix-red-ci (#34230)
- [
- @molbap
- Fix position embeddings singular/plural (#33678)
- Uniformize model processors (#31368)
- @vasqu
- Update Albumentations Versions (#33704)
- [
TF] Fix Tensorflow XLA Generation on limited seq_len models (#33903) - Mistral-related models for QnA (#34045)
- @VladOS95-cyber
- Add gguf support for bloom (#33473)
- Bug fix gguf qwen2moe (#33940)
- Add gguf support for StableLM (#33793)
- Add gguf support for gpt2 (#34044)
- Add GGUF for starcoder2 (#34094)
- @ydshieh
- Add Slow CI reminder bot (#33506)
- post reminder comment only once (#33848)
- Avoid using context that is not accessable from external contributors (#33866)
- Don't run reminder bot for now (#33883)
- Update SSH workflow file (#34084)
- avoid many failures for ImageGPT (#34071)
- Avoid many test failures for
LlavaNextVideoForConditionalGeneration(#34070) - Ping team members for new failed tests in daily CI (#34171)
- @amyeroberts
- Repo consistency fix after #33339 (#33873)
- Trainer - deprecate tokenizer for processing_class (#32385)
- @ylacombe
- [Tests] Diverse Whisper fixes (#33665)
- Fix distil whisper segment computation (#33920)
- [TESTS] ASR pipeline (#33925)
- Fix DAC slow tests (#34088)
- Moshi integration (#33624)
- @ringohoffman
- Remove
logits.float()(#33902) - Default
synced_gpustoTruewhen usingFullyShardedDataParallel(#33483) - Only cast logits to float when computing loss (#34147)
- Remove
- @garg-amit
- PhiMoE (#33363)
- @pglorio
- Add Zamba (#30950)
- @tomlimi
- [WIP] Add Tokenizer for MyT5 Model (#31286)
- @yijun-lee
- 🌐 [i18n-KO] Translated
gguf.mdto Korean (#33764) - 🌐 [i18n-KO] Translated
audio_utils.mdto Korean (#33802) - 🌐 [i18n-KO] Translated
esm.mdto Korean (#33796) - 🌐 [i18n-KO] Translated
time_series_utils.mdto Korean (#33806) - 🌐 [i18n-KO] Translated
pipelines_utils.mdto Korean (#33809) - 🌐 [i18n-KO] Translated
trainer.mdto Korean (#33797) - 🌐 [i18n-KO] Translated
chameleon.mdto Korean (#33799) - 🌐 [i18n-KO] Translated
gemma.mdto Korean (#33936) - 🌐 [i18n-KO] Translated
feature_extractor.mdto Korean (#33775) - 🌐 [i18n-KO] Translated
tokenization_utils.mdto Korean (#33813) - 🌐 [i18n-KO] Translated
file_utils.mdto Korean (#33803) - 🌐 [i18n-KO] Translated
openai-gpt.mdto Korean (#33801) - 🌐 [i18n-KO] Translated
biogpt.mdto Korean (#33773) - 🌐 [i18n-KO] Translated
image_processing_utils.mdto Korean (#33804) - 🌐 [i18n-KO] Translated
modular_transformers.mdto Korean (#33772) - 🌐 [i18n-KO] Translated
modeling_utils.mdto Korean (#33808) - 🌐 [i18n-KO] Translated
text_generation.mdto Korean (#33777) - 🌐 [i18n-KO] Translated
generation_utils.mdto Korean (#33818) - 🌐 [i18n-KO] Translated
gemma2.mdto Korean (#33937) - 🌐 [i18n-KO] Translated
trainer_utils.mdto Korean (#33817)
- 🌐 [i18n-KO] Translated
- @fabxoe
- 🌐 [i18n-KO] Translated
main_classes/quantization.mdto Korean (#33959) - 🌐 [i18n-KO] Translated
main_classes/configuration.mdto Korean (#33952) - 🌐 [i18n-KO] Translated
model_doc/mamba.mdto Korean (#33626) - 🌐 [i18n-KO] Translated
model_doc/autoformer.mdto Korean (#33574) - 🌐 [i18n-KO] Translated
model_doc/patchtsmixer.mdto Korean (#33587) - 🌐 [i18n-KO] Translated
model_doc/clip.mdto Korean (#33610) - 🌐 [i18n-KO] Translated
model_doc/paligemma.mdto Korean (#33612) - 🌐 [i18n-KO] Translated
model_doc/llama3.mdto Korean (#33635) - 🌐 [i18n-KO] Translated
model_doc/mistral.mdto Korean (#33648) - 🌐 [i18n-KO] Translated
model_doc/cohere.mdto Korean (#33885) - 🌐 [i18n-KO] Translated
model_doc/dbrx.mdto Korean (#33951) - 🌐 [i18n-KO] Translated
model_doc/deberta-v2.mdto Korean (#33968) - 🌐 [i18n-KO] Translated
main_classes/onnx.mdto Korean (#33601) - 🌐 [i18n-KO] Translated
model_doc/bart.mdto Korean (#33893) - 🌐 [i18n-KO] Translated
model_doc/deberta.mdto Korean (#33967) - 🌐 [i18n-KO] Translated
main_classes/keras_callbacks.mdto Korean (#33955) - 🌐 [i18n-KO] Translated
model_doc/mamba2.mdto Korean (#33629) - 🌐 [i18n-KO] Translated
main_classes/model.mdto Korean (#33606) - 🌐 [i18n-KO] Translated
model_doc/trajectory_transformer.mdto Korean (#33597) - 🌐 [i18n-KO] Translated
model_doc/time_series_transformer.mdto Korean (#33596) - 🌐 [i18n-KO] Translated
model_doc/informer.mdto Korean (#33585) - 🌐 [i18n-KO] Translated
model_doc/graphormer.mdto Korean (#33569) - 🌐 [i18n-KO] Translated
main_classes/data_collator.mdto Korean (#33954) - 🌐 [i18n-KO] Translated
model_doc/patchtst.mdto Korean (#33589)
- 🌐 [i18n-KO] Translated
- @MekkCyber
- FEAT : Adding BitNet quantization method to HFQuantizer (#33410)
- Fix data_seed unused (#33731)
- Small Fix to modular converter (#34051)
- @AhmedAlmaghz
- Add Translate docs into Arabic - section files CONCEPTUAL GUIDES (#33982)
- @alex-bene
- Add postprocessdepth_estimation to image processors and support ZoeDepth's inference intricacies (#32550)
- Python
Published by LysandreJik over 1 year ago
transformers - Release v4.45.2
Patch release v4.45.2
Mostly some warnings that were not properly removed ⚠️ : * Ignore keys on validaterope #33753 by @zucchini-nlp * remove warning v2 #33761 by @itazap * Config: lower savepretrained exception to warning #33906 by @gante
🔴 Had a small regression with dynamic Cache 🔴 *Cache: revert DynamicCache init for BC #33861 by @gante
A small fix for idefic 🐩 : * Fixes for issue #33763 in idefics2 model #33766 by @aroun-coumar
And a fix for Siglip 🤧 !
* hot fix self.positionembeddings->self.positionembedding #33958 and properly fix and RUN_SLOW #33965 thanks to @mranzinger
- Python
Published by ArthurZucker over 1 year ago
transformers - Patch Release v4.45.1
Patches for v4.45.1
- [MllamaProcessor] Update errors and API with multiple image (#33715) by @ArthurZucker
- Generate: can_generate() recursive check (#33718) by @gante
- cleanuptokenization_spaces=False if unset (#31938) by @itazap
- Python
Published by ArthurZucker over 1 year ago
transformers - Llama 3.2, mllama, Qwen2-Audio, Qwen2-VL, OLMoE, Llava Onevision, Pixtral, FalconMamba, Modular Transformers
New model additions
mllama
The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.
- Add MLLama #33703, by @qubvel, @zucchini-nlp, @ArthurZucker
Qwen2-VL
The Qwen2-VL is a major update from the previous Qwen-VL by the Qwen team.
An extract from the Qwen2-VL blogpost available here is as follows:
Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model familities. Compared with Qwen-VL, Qwen2-VL has the capabilities of: - SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. - Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc. - Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions. - Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.
- support qwen2-vl by @simonJJJ in #32318
Qwen2-Audio
The Qwen2-Audio is the new model series of large audio-language models from the Qwen team. Qwen2-Audio is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions.
They introduce two distinct audio interaction modes: - voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input - audio analysis: users could provide audio and text instructions for analysis during the interaction
- Add Qwen2-Audio by @faychu in #32137
OLMoE
OLMoE is a series of Open Language Models using sparse Mixture-of-Experts designed to enable the science of language models. The team releases all code, checkpoints, logs, and details involved in training these models.
- Add OLMoE by @Muennighoff in #32406
Llava Onevision
LLaVA-Onevision is a Vision-Language Model that can generate text conditioned on one or several images/videos. The model consists of SigLIP vision encoder and a Qwen2 language backbone. The images are processed with anyres-9 technique where the image is split into 9 patches to better process high resolution images and capture as much details as possible. However, videos are pooled to a total sequence length of 196 tokens each frame for more memory efficient computation. LLaVA-Onevision is available in three sizes: 0.5B, 7B and 72B and achieves remarkable performance on benchmark evaluations.
- Llava Onevision: add model by @zucchini-nlp in #32673
FalconMamba
The FalconMamba model was proposed by TII UAE (Technology Innovation Institute) in their release.
The model has been trained on approximtely 6T tokens consisting a mixture of many data sources such as RefineWeb, Cosmopedia and Math data.
The team releases an accompanying blog post.
- Add new model by @younesbelkada in #32615
Granite Language Models
he Granite model was proposed in Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.
PowerLM-3B is a 3B state-of-the-art small language model trained with the Power learning rate scheduler. It is trained on a wide range of open-source and synthetic datasets with permissive licenses. PowerLM-3B has shown promising results compared to other models in the size categories across various benchmarks, including natural language multi-choices, code generation, and math reasoning.
- Granite language models by @mayank31398 in #31502
Granite MOE
The GraniteMoe model was proposed in Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.
PowerMoE-3B is a 3B sparse Mixture-of-Experts (sMoE) language model trained with the Power learning rate scheduler. It sparsely activates 800M parameters for each token. It is trained on a mix of open-source and proprietary datasets. PowerMoE-3B has shown promising results compared to other dense models with 2x activate parameters across various benchmarks, including natural language multi-choices, code generation, and math reasoning.
- Granitemoe by @mayank31398 in #33207
Descript-Audio-Codec
The Descript Audio Codec (DAC) model is a powerful tool for compressing audio data, making it highly efficient for storage and transmission. By compressing 44.1 KHz audio into tokens at just 8kbps bandwidth, the DAC model enables high-quality audio processing while significantly reducing the data footprint. This is particularly useful in scenarios where bandwidth is limited or storage space is at a premium, such as in streaming applications, remote conferencing, and archiving large audio datasets.
- Add Descript-Audio-Codec model by @kamilakesbi in #31494
Pixtral
The Pixtral model was released by the Mistral AI team. Pixtral is a multimodal model, taking images and text as input, and producing text as output. This model follows the Llava family, meaning image embeddings are placed instead of the [IMG] token placeholders.
The model uses PixtralVisionModel for its vision encoder, and MistralForCausalLM for its language decoder. The main contribution is the 2d ROPE (rotary postiion embeddings) on the images, and support for arbitrary image sizes (the images are not padded together nor are they resized).
- Add support for Pixtral by @ArthurZucker in #33449
Mimi
The Mimi model was proposed in Moshi: a speech-text foundation model for real-time dialogue by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour. Mimi is a high-fidelity audio codec model developed by the Kyutai team, that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps. In other words, it can be used to map audio waveforms into “audio tokens”, known as “codebooks”.
- Codec integration by @ylacombe in #33565
Quantization
GGUF
GGUF support continues to be enhanced in the library by offering a way to load GGUF models within transformers by unquantizing them, before re-quantizing them for re-use within the GGUF/GGML ecosystem.
- Add Qwen2Moe GGUF loading support by @VladOS95-cyber in #33264
- Fix incorrect vocab size retrieval in GGUF config by @Isotr0py in #32551
- Add chat_template for tokenizer extracted from GGUF model by @Isotr0py in #32908
- 🚨 Support dequantization for most GGML types by @Isotr0py in #32625
- Add support for GGUF Phi-3 by @a8nova in #31844
Torch AO
An ongoing effort is to add the ability to use torchao as a quantization backend. Future PRs will enable saving and fine-tuning with peft.
- Add TorchAOHfQuantizer by @jerryzh168 in #32306
Liger Kernel
The Liger kernel is now supported in the Trainer class.
- Integrate Liger (Linkedin GPU Efficient Runtime) Kernel to Trainer by @JasonZhu1313 in #32860
Modular Transformers
This PR introduces Modularity for transformers, which has always been prohibited when working with transformers (see blog post for the accompanying design philosophy).
The core idea behind this PR is to facilitate model addition by enabling Pythonic inheritance while keeping true to our single-file policy in which models/processors must be contained within a single file, enabling working around the object without going through 10 layers of abstractions.
It is heavily recommended to read the PR description in order to understand the depth of the change: https://github.com/huggingface/transformers/pull/33248
- Modular
transformers: modularity and inheritance for new model additions by @ArthurZucker in #33248
Agents
Agents continue being improved at each release; this time making it much simpler to leverage a local engine through a local Transformers Engine.
- Multi agents with manager by @aymeric-roucher in #32687
- Add new documentation page for advanced agent usage by @aymeric-roucher in #33265
- Create local Transformers Engine by @aymeric-roucher in #33218
- Agents use grammar by @aymeric-roucher in #31735
Dynamic cache for decoder-only models
This PR adds to all decoder-only models (except for XLNet) support for dynamic cache.
The documentation for the Dynamic cache can be found here, and documentation related to the KV cache in transformers in general can be found here.
- Cache: new Cache format in decoder-only models by @zucchini-nlp in #31421
Chat templates updates
We've made several updates to our handling of chat models and chat templates. The most noticeable change is that assistant prefill is now supported. This means you can end a chat with an assistant message, and the model will continue that message instead of starting a new one, allowing you to guide the model's response:
```python pipe = pipeline("text-generation", model_checkpoint)
chat = [ {"role": "user", "content": "Can you format the answer in JSON?"}, {"role": "assistant", "content": '{"name": "'} ]
output = pipe(chat) # The model will continue outputting JSON! ```
We've also enabled several new functionalities in Jinja that will allow more powerful templates in future, including Loop Controls and a strftime_now function that can get the current date and time, which is commonly used in system messages. For more details, see the updated chat template docs.
- Enable some Jinja extensions and add datetime capabilities by @Rocketknight1 in #32684
- Update Jinja docs with new functions and general cleanup by @Rocketknight1 in #33097
- Add assistant prefill for chat templates and TextGenerationPipeline by @Rocketknight1 in #33198
- Add a warning to the chat template docs about the tool_calls format by @Rocketknight1 in #33277
- Add tip to clarify tool calling by @Rocketknight1 in #32883
Bugfixes and improvements
- 🌐 [i18n-KO] Translated
mask_generation.mdto Korean by @jeongiin in #32257 - 🌐 [i18n-KO] Translated
idefics.mdto Korean by @boyunJang in #32258 - 🌐 [i18n-KO] Translated
image_to_image.mdto Korean by @shinhyunji36 in #32327 - Gemma2: add cache warning by @zucchini-nlp in #32279
- enable xla fsdp by @hanwen-sun in #32048
- Fix typo in tokenizationutilsbase.py by @blubitz in #32484
- fix broken link in docs by @jorahn in #32491
- Docs: alert for the possibility of manipulating logits by @gante in #32467
- 🌐 [i18n-KO] Translated
gptq.mdto Korean by @1kmmk1 in #32293 - 🌐 [i18n-KO] Translated
prompting.mdto Korean by @chhaewxn in #32294 - 🌐 [i18n-KO] Translated
quantization/quanto.mdto Korean by @fabxoe in #32281 - 🌐 [i18n-KO] Translated
image_feature_extraction.mdto Korean by @mreraser in #32239 - Fix references to model google mt5 small by @JuanFKurucz in #32497
- Docs: Fixed WhisperModel.forward’s docstring link by @Sai-Suraj-27 in #32498
- 🌐 [i18n-KO] Translated
chat_templating.mdto Korean by @enchantee00 in #32362 - Fix link to autoclass_tutorial.md in i18n.md by @JuanFKurucz in #32501
- Fix typo: depracted -> deprecated by @tomaarsen in #32489
- Fix issue #32518: Update llm_tutorial.md by @doomdagadiggiedahdah in #32523
- Change Phi3
_supports_sdpato True by @pocca2048 in #32457 - Uniformize kwargs for processors - GroundingDINO by @SangbumChoi in #31964
- Fix add-new-model-like by @molbap in #31773
- filter flash_attn optional imports loading remote code by @eaidova in #30954
- 🌐 [i18n-KO] Translated
ko-llm_tutorial_optimization.mdto Korean by @010kim in #32372 - 🌐 [i18n-KO] Translated
trainer.mdto Korean by @cjfghk5697 in #32260 - 🌐 [i18n-KO] Translated
eetq.mdto Korean by @jun048098 in #32352 - 🌐 [i18n-KO] Translated
fsdp.mdto Korean by @win2dvp21 in #32261 - 🌐 [i18n-KO] Translated
bitsandbytes.mdto Korean by @SeungAhSon in #32408 - Fix generate with
inputs_embedsas input by @molbap in #32493 - Fixed test
test_static_cache_exportabilitywith torch 2.4.0 by @guangy10 in #32516 - Fix code example to load bigcode starcoder2 7b by @JuanFKurucz in #32474
- [docs] Translation guide by @stevhliu in #32547
- Gemma2: fix FA2 generation by @zucchini-nlp in #32553
- Fix a bug in Qwen2Audio by @faychu in #32552
- fix slow integration gemma2 test by @ArthurZucker in #32534
- fix non contiguous tensor value error in save_pretrained by @congcongke in #32422
- 🌐 [i18n-KO] Translated
agent.mdto Korean by @Jwaminju in #32351 - Fix: FA2 with packed training by @zucchini-nlp in #32487
- Fix sliding window attention used in Gemma2FlashAttention2 by @brcps12 in #32522
- fix: Fixed conditional check for
encodecmodel names by @Sai-Suraj-27 in #32581 - Fix
.push_to_hub(..., create_pr=True, revision="my-branch")when creating PR on not-owned repo by @Wauplin in #32094 - Cleanup tool calling documentation and rename doc by @Rocketknight1 in #32337
- 🌐 [i18n-KO] Translated
deepspeed.mdto Korean by @4N3MONE in #32431 - 🌐 [i18n-KO] Translated
awq.mdto Korean by @ahnjj in #32324 - fix: Fixed failing
test_find_base_model_checkpointby @Sai-Suraj-27 in #32638 - "to be not" -> "not to be" by @qgallouedec in #32636
- fix: Updated the
is_torch_mps_available()function to includemin_versionargument by @Sai-Suraj-27 in #32545 - Expand inputs in processors for VLMs by @zucchini-nlp in #30962
- Automatically add
transformerstag to the modelcard by @LysandreJik in #32623 - Fix tests by @molbap in #32649
- fix tensors on different devices in
WhisperGenerationMixinby @faaany in #32316 - Add support for GrokAdamW optimizer by @ehartford in #32521
- Add Depth Anything V2 Metric models by @bt2513 in #32126
- Fix: Fixed directory path for utils folder in
test_tokenization_utils.pyby @Sai-Suraj-27 in #32601 - Modify ProcessorTesterMixin for better generalization by @yonigozlan in #32637
- TF_Deberta supporting mixed precision by @pinesnow72 in #32618
- Fix tests recurrent by @molbap in #32651
- Support MUSA (Moore Threads GPU) backend in transformers by @fmo-mt in #31913
- fix: Fixed failing tests in
tests/utils/test_add_new_model_like.pyby @Sai-Suraj-27 in #32678 - Update translation docs review by @stevhliu in #32662
- Fix
JetMoeIntegrationTestby @ydshieh in #32332 - Update the distributed CPU training on Kubernetes documentation by @dmsuehir in #32669
- fix: Fixed unknown pytest config option
doctest_globby @Sai-Suraj-27 in #32475 - Unpin deepspeed in Docker image/tests by @muellerzr in #32572
- Updated workflows to the latest versions by @Sai-Suraj-27 in #32405
- reopen: llava-next fails to consider padding_side during Training by @jp1924 in #32679
- fix: Corrected
falcon-mamba-7bmodel checkpoint name by @Sai-Suraj-27 in #32837 - fix: update doc link for runhouse in README.md by @muddlebee in #32664
- VLMs: small clean-up for cache class by @zucchini-nlp in #32417
- add back the position ids by @ArthurZucker in #32554
- Use head_dim if in config for RoPE by @suiyoubi in #32495
- Generate: unify
LogitsWarperandLogitsProcessorby @gante in #32626 - [tests] make testsdpaequivalence device-agnostic by @faaany in #32520
- Cache: use
batch_sizeinstead ofmax_batch_sizeby @gante in #32657 - Fix AutoConfig and AutoModel support for Llava-Next-Video by @TKONIY in #32844
- improve getisastensor_fns by @zrr1999 in #32596
- Revert PR 32299, flag users when Zero-3 was missed by @muellerzr in #32851
- fix multi-gpu with static cache by @SunMarc in #32543
- Reduce the error log when using core models that need their weights renamed, and provide a step forward by @muellerzr in #32656
- Make beam_constraints.Constraint.advance() docstring more accurate by @alex-calderwood in #32674
- generate: missing
toin DoLa body, causing exceptions in multi-gpu generation by @gante in #32856 - Add Flax Dinov2 by @MHRDYN7 in #31960
- support torch-speech by @itazap in #32537
- [tests] make
test_sdpa_can_compile_dynamicdevice-agnostic by @faaany in #32519 - Add repr for Conv1D by @AaronZLT in #32425
- Support save/load ckpt for XLA FSDP by @yitongh in #32311
- RT-DETR parameterized batchnorm freezing by @AlanBlanchet in #32631
- Mamba / FalconMamba: Fix mamba left padding by @younesbelkada in #32677
- Fix: Mamba2 generation mismatch between inputids and inputsembeds by @vasqu in #32694
- Docs: Fixed
whisper-large-v2model link in docs by @Sai-Suraj-27 in #32871 - Allow-head-dim by @ArthurZucker in #32857
- 🚨🚨🚨 Update min version of accelerate to 0.26.0 by @SunMarc in #32627
- Fix repr for conv by @ArthurZucker in #32897
- fix: jamba cache fails to use torch.nn.module by @xgal in #32894
- Fix: Mamba2
norm_before_gateusage by @vasqu in #32686 - Replace
tensor.norm()with decomposed version for CLIP executorch export by @qubvel in #32887 - link for optimizer names by @nbroad1881 in #32400
- [i18n-ar] add README_ar.md to README.md by @AhmedAlmaghz in #32583
- fix: [whisper] don't overwrite GenerationConfig's
return_timestampswhenreturn_timestampsis not passed togeneratefunction by @hrl in #31296 - Update docker image building by @ArthurZucker in #32918
- Jamba: update integration tests by @gante in #32250
- fix: Added missing
huggingface_hubinstallation to workflows by @Sai-Suraj-27 in #32891 - fix: no need to dtype A in jamba by @xgal in #32924
- FEAT / Trainer: Add adamw 4bit optimizer by @SunMarc in #31865
- CI: separate step to download nltk files by @gante in #32935
- FIX / Hub: Also catch for
exceptions.ConnectionErrorby @younesbelkada in #31469 - Add SynCode to llm_tutorial by @shubhamugare in #32884
- Fix benchmark script by @ydshieh in #32635
- Improve greedy search memory usage by @regisss in #32895
- fix: (issue #32689)
AttributeErrorraised when usingTrainerwitheval_on_start=Truein Jupyter Notebook. by @fshp971 in #32849 - Gemma2: eager attention by default by @gante in #32865
- [run_slow] idefics2 by @andimarafioti in #32840
- Fix regression on
Processor.save_pretrainedcaused by #31691 by @leloykun in #32921 - 🌐 [i18n-KO] Translated `knowledgedistillationforimageclassification.md to Korean" by @JinukHong in #32334
- Generate: Deprecate returning legacy cache by default; Handle
use_cache=Falseby @gante in #32863 - docs: fix outdated link to TF32 explanation by @anakin87 in #32947
- Reducing memory usage: removing useless logits computation in generate() by @Cyrilvallez in #31292
- Forbid
PretrainedConfigfrom savinggenerateparameters; Update deprecations ingenerate-related code 🧹 by @gante in #32659 - DeviceGuard added to use Deformable Attention more safely on multi-GPU by @DonggeunYu in #32910
- added doctring to SchedulerType class by @Arunprakash-A in #32898
- Updated the custommodels.md changed crossentropy code by @S-M-J-I in #33118
- CI: add torchvision to the consistency image by @gante in #32941
- Test: add higher
atolintest_forward_with_num_logits_to_keepby @gante in #33093 - mps: add
isin_mps_friendly, a wrapper function fortorch.isinby @gante in #33099 - Add changes for uroman package to handle non-Roman characters by @nandwalritik in #32404
- fix: Fixed
pydanticrequired version in dockerfiles to make it compatible with DeepSpeed by @Sai-Suraj-27 in #33105 - quickfix documentation by @molbap in #32566
- Fixup py 38 type hints for mps friendly by @muellerzr in #33128
- fix: Fixed CodeGenTokenizationTest::test_truncation failing test by @Sai-Suraj-27 in #32850
- fix: multilingual midel convert to tflite get wrong token by @Ayaa17 in #32079
- disable scheduled daily CI temporarily by @ydshieh in #33136
- CI: fix
efficientnetpipeline timeout and prevent future similar issues due to large image size by @gante in #33123 - Log additional test metrics with the CometCallback by @Lothiraldan in #33124
- [docs] add quick usage snippet to Whisper. by @Vaibhavs10 in #31289
- Update stateful_callbacks state before saving checkpoint by @pedrobrs in #32115
- fix Idefics2VisionConfig type annotation by @chenzizhao in #33103
- Add a fix for custom code tokenizers in pipelines by @Rocketknight1 in #32300
- Llama: make slow tests green 🟢 by @gante in #33138
- fix redundant checkpointing in example training scripts by @eminorhan in #33131
- update torch req for 4-bit optimizer by @SunMarc in #33144
- 🌐 [i18n-KO] Translated
conversations.mdto Korean by @newfull5 in #32468 - Very small change to one of the function parameters by @alisalamatian1 in #32548
- 🚨 Add Blip2ForImageTextRetrieval by @jpizarrom in #29261
- fix model name and copyright by @mayank31398 in #33152
- Fix: Jamba batched generation by @vasqu in #32914
- [whisper] pass attentionmask to generatewith_fallback() by @benniekiss in #33145
- [RoBERTa-based] Add support for sdpa by @hackyon in #30510
- Fix import paths for test_module by @rasmi in #32888
- Zero-shot pipelines: minor doc changes by @pcuenca in #33127
- Customise the separator used for splicing in DataCollatorWithFlattening by @beep-bebop in #33114
- Fix spell mistakes by @matsuo1234567 in #33149
- update push CI workflow files for security by @ydshieh in #33142
- added quick clarification by @DuyguA in #33166
- pass module to Params4bit.fromprequantized to ensure quantstate by @winglian in #32524
- Mamba2 conversion script for original models by @vasqu in #32580
- Add a static cache that offloads to the CPU or other device by @gerbenvv in #32161
- use a single for loop by @ArthurZucker in #33148
- Pipeline: fix bad generation kwargs docs by @gante in #33205
- Add missing quotes in modelingllavanext_video.py by @juliendenize in #33214
- Add warning for stop string edge case by @Rocketknight1 in #33169
- Fix local repos with remote code not registering for pipelines by @Rocketknight1 in #33100
- Refactor CI: more explicit by @ArthurZucker in #30674
- 🌐 [i18n-KO] Translated
llm_optims.mdto Korean by @yijun-lee in #32325 - Fix red amin by @ArthurZucker in #33220
- Test fetcher: missing return on filtered tests; don't write empty files by @gante in #33224
- Generate: throw warning when
return_dict_in_generateis False but should be True by @gante in #33146 - Add video text to text docs by @merveenoyan in #33164
- Add GraniteRMSNorm by @NielsRogge in #33177
- Add duckduckgo search tool by @aymeric-roucher in #32882
- Fix: Suppressed 'use_reentrant=False' warning by @ankush13r in #33208
- docs: Replace package abbreviations with full name(
bitsandbytes) in docstrings by @rapsealk in #33230 - Generate: fix assistant in different device by @gante in #33257
- remove to restriction for 4-bit model by @SunMarc in #33122
- Fixed typo repeated word in DETR docs by @sergiopaniego in #33250
- Fix: use
torch.from_numpy()to create tensors for np.ndarrays by @shinyano in #33201 - remove torch input dependant control flow by @ArthurZucker in #33245
- Fix:
num_logits_to_keepin composite models by @zucchini-nlp in #33168 - Fix Bark saving by @ylacombe in #33266
- Update chat template docs to remove Blenderbot by @Rocketknight1 in #33254
- Add sdpa support for Albert by @OmarManzoor in #32092
- Only disallow DeepSpeed Zero-3 for auto bs finder by @muellerzr in #31731
- fix the parallel number of CI nodes when it is smaller than number of tests by @ArthurZucker in #33276
- Repo checks: check documented methods exist by @gante in #32320
- Fix: multigpu training by @zucchini-nlp in #33271
- Cache docs: update by @zucchini-nlp in #32929
- Config: unified logic to retrieve text config by @gante in #33219
- [fix] LlavaNextProcessor 'getunpadded_features' method by @laurentd-lunit in #33263
- wait 15m before SSH into runner workflow stops by @ydshieh in #33300
- Bugfix/alexsherstinsky/fix none check for attention factor in rope scaling 2024 08 28 0 by @alexsherstinsky in #33188
- [InstructBLIP] qformer_tokenizer is required input by @amyeroberts in #33222
- [BUG] fix upper nltk version by @ylacombe in #33301
- Fix excessive CPU memory usage with FSDP and cpuramefficient_loading by @matthewdouglas in #33154
- Add validate images and text inputs order util for processors and testprocessingutils by @yonigozlan in #33285
- Fix: Fix
FalconMambatraining issues due to incompatible kernels by @younesbelkada in #33195 - Add paper link by @Muennighoff in #33305
- 🚨 Fix
torch.jit.traceforinterpolate_pos_encodingin all vision models by @xenova in #33226 - Update SECURITY.md by @Michellehbn in #32680
- simple align qwen2vl kvseqlen calculation with qwen2 by @simonJJJ in #33161
- Add a community notebook for fine-tuning with QLoRA, PEFT, and MLflow by @daniellok-db in #33319
- Fix: StaticCache &
inputs_embedsby @zucchini-nlp in #32932 - Docs: add more cross-references to the KV cache docs by @gante in #33323
- [whisper] alternative fix for long-form timestamps by @sanchit-gandhi in #32131
- fix qwen2vl vision eager-attention by @simonJJJ in #33213
- Load dynamic module (remote code) only once if code isn't change by @XuehaiPan in #33162
- support loading model without config.json file by @itazap in #32356
- Add validation for maximum sequence length in modeling_whisper.py by @AmirMohammadFakhimi in #33196
- add self.head_dim for VisionAttention in Qwen2-VL by @GeLee-Q in #33211
- support 3D attention mask in bert by @gathierry in #32105
- Support reading tiktoken tokenizer.model file by @itazap in #31656
- red-ci on main, fix copies by @ArthurZucker in #33356
- RoPE: fix BC warning by @gante in #33331
- Fix Prefill docs by @Rocketknight1 in #33352
- Update author for QLorA/PEFT community notebook by @daniellok-db in #33338
- add sdpa mbart by @nbroad1881 in #32033
- Fix quantized cache tests by @zucchini-nlp in #33351
- schedulefree optimizers by @winglian in #30079
- Add visit webpage tool by @aymeric-roucher in #33353
- Fixed Majority of the Typos in
transformers[en]Documentation by @nnilayy in #33350 - Compile compatibilty for decoder-only models by @zucchini-nlp in #32617
- Adjust templates by @LysandreJik in #33384
- Remove repeated prepare_images in processor tests by @amyeroberts in #33163
- Fix import of
FalconMambaForCausalLMby @younesbelkada in #33381 - Import structure & first three model refactors by @LysandreJik in #31329
- VLM: fixes after refactor by @zucchini-nlp in #32907
- fixed Mask2Former image processor segmentation maps handling by @maciej-adamiak in #33364
- Bug Fix: Update hub.py to fix NoneType error by @rishiraj in #33315
- Update WhisperTokenizer Doc: Timestamps and Previous Tokens Behaviour by @bruno-hays in #33390
- Make StaticCache configurable at model construct time by @guangy10 in #32830
- use diff internal model in tests by @itazap in #33387
- Fix
FbgemmFp8Linearnot preserving tensor shape by @vgel in #33239 - Fix failing windows by @LysandreJik in #33436
- Remove deprecated task in load_dataset by @albertvillanova in #33433
- Dynamic number of speculative tokens in order to accelerate speculative decoding by @jmamou in #33258
- Fix: Cast prefetchbucketsize to integer for deepspeed >= 0.15 by @kiddj in #33402
- [docs] add the missing huggingface hub username by @faaany in #33431
- [docs] add the missing tokenizer when pushing models to huggingface hub by @faaany in #33428
- Update stale.yml by @LysandreJik in #33434
- Docs - update formatting of llama3 model card by @MichaelCurrin in #33438
- Fix incomplete sentence in
Zero-shot object detectiondocumentation by @sergiopaniego in #33430 - Fix flax whisper tokenizer bug by @hannan72 in #33151
- Clean-up deprecated code by @zucchini-nlp in #33446
- Fix default revision for pipelines by @ankane in #33395
- Revive AMD scheduled CI by @ydshieh in #33448
- Allow send
SSH into runnerinfo. to DM by @ydshieh in #33346 - Correct Whisper's beam search scores computation by @ylacombe in #32336
- Qwen2-VL: clean-up and add more tests by @zucchini-nlp in #33354
- [whisper] Clarify error message when setting maxnewtokens by @benniekiss in #33324
- [docs] refine the doc for
train with a scriptby @faaany in #33423 - Return image hidden states by @zucchini-nlp in #33426
- add a callback hook right before the optimizer step by @winglian in #33444
- Enable
padding_sideas call time kwargs by @zucchini-nlp in #33385 - Mitigate a conflict when using sentencepiece by @tengomucho in #33327
- [Phi-3] Bug on stale kv cache by @garg-amit in #33129
- Fix the initialization of the cache when we have multi gpu by @SunMarc in #33303
- Enable finetuning with torchao quantized model by @SunMarc in #33361
- Corrected
Agents and toolsdocumentation links typos by @sergiopaniego in #33471 - chore: fix typo in comment in tokenizationutilsbase.py by @DavidLemayian in #33466
- Cohere: update RoPE structure by @gante in #33408
- Fix SSH workflow by @ydshieh in #33451
- Add keypoint-detection task guide by @merveenoyan in #33274
- Uniformize kwargs for LLaVa processor and update docs by @yonigozlan in #32858
Agents, supercharged - Multi-agents, External tools, and moredocs typo fixed by @sergiopaniego in #33478- [i18n-ar] Add File :
docs/source/ar/_toctree.ymlby @AhmedAlmaghz in #32696 - [Whisper test] Fix some failing tests by @ylacombe in #33450
- Fix: Qwen2-VL training on video datasets by @hiyouga in #33307
- Updated Trainer's liger-kernel integration to call correct patching API by @shimizust in #33502
- Replace
accelerator.use_fp16in examples by @hlky in #33513 - Fix parametrization-based weight norm by @ylacombe in #33275
- Fix number of patch check for different vision feature select strategy by @insujang in #32494
- chore: migrate coverage cfg to pyproject.toml by @SauravMaheshkar in #32650
- idefics2 enableinputrequiregrads not aligned with disableinput_re… by @sywangyi in #33194
- Update chameleon.md — fix runtime type error by @maxwbuckley in #33494
- Add explicit example for RAG chat templating by @A-Duss in #33503
- CI Build image - move runners by @glegendre01 in #33530
- fix to jamba config, asserting attention and expert offset by @ErezSC42 in #33316
- Fix missing
sequences_scoresin the Whisper beam search output by @Nik-Kras in #32970 - Uniformize kwargs for Pixtral processor by @yonigozlan in #33521
- Add revision to trainer pushtohub by @teamclouday in #33482
- fix patchattentionmask incorrect setting which leads to the differe… by @sywangyi in #33499
- Support LLaVa-OV-Chat by @zucchini-nlp in #33532
- Decorator for easier tool building by @aymeric-roucher in #33439
- Fix for slow the bug tokenizer adding spaces to single id decodes by @DuyguA in #32564
- Chat template: save and load correctly for processors by @zucchini-nlp in #33462
- Fix missing head_dim in llama config from gguf model by @Isotr0py in #33526
- [i18n-ur] Added README_ur.md file by @akkefa in #33461
- fix the wandb logging issue by @ZIYU-DEEP in #33464
- Fix tests in ASR pipeline by @ylacombe in #33545
- Added support for bfloat16 to zero-shot classification pipeline by @umarbutler in #33554
- Pipeline: no side-effects on
model.configandmodel.generation_config🔫 by @gante in #33480 - Return attention mask in ASR pipeline to avoid warnings by @Rocketknight1 in #33509
- enforce original size to be a list by @dom-dziela in #33564
- Improve compiled RT-DETR inference speed by @yonigozlan in #33412
- Fix bnb dequantization by @SunMarc in #33546
- Load and save video-processor from separate folder by @zucchini-nlp in #33562
- VLMs: enable generation tests by @zucchini-nlp in #33533
- rag: fix CI by @gante in #33578
- Cache: don't show warning in forward passes when
past_key_valuesis None by @gante in #33541 - fix tests with main revision and read token by @molbap in #33560
- add uniform processors for altclip + chinese_clip by @molbap in #31198
- Generate: check that
attention_maskis 2D by @gante in #33575 - change sequence_bias type of SequenceBiasLogitsProcessor to list, add… by @VladOS95-cyber in #33375
- [
Mamba2] Move dt calculations to kernel by @vasqu in #33520 - Cache: don't throw warnings on
gemma2when instantiating a new cache by @gante in #33595 - Uniformize kwargs for Paligemma processor and update docs by @yonigozlan in #33571
- [tests] skip tests for xpu by @faaany in #33553
- [tests] enable GemmaIntegrationTest on XPU by @faaany in #33555
- Fix Llama 3 TikToken conversion by @pcuenca in #33538
- Docs: add the ability to manually trigger jobs by @gante in #33598
- Fix CircleCI nightly run by @ydshieh in #33558
- Allow CI could be run on private forked repositories (e.g. new model additions) by @ydshieh in #33594
- [tests] make more tests device-agnostic by @faaany in #33580
- Update modeling_mamba2.py, fix pad size by @klae01 in #32599
- Generate: remove flakyness in
test_generate_from_inputs_embeds_decoder_onlyby @gante in #33602 - Remove unnecessary CPM model tests by @amyeroberts in #33621
- Add sdpa for BioGpt by @OmarManzoor in #33592
- VLM generate: tests can't generate image/video tokens by @gante in #33623
- Fix missing test in
torch_jobby @ydshieh in #33593 - Add support for args to ProcessorMixin for backward compatibility by @yonigozlan in #33479
- Fix contrastive search to correctly handle input with padding by @ducviet00 in #33507
- Generate: assistant should sample when the main model samples by @gante in #33534
- Fix some missing tests in circleci by @ydshieh in #33559
- Update daily ci to use new cluster by @ydshieh in #33627
- Fix qwen2vl float16 inference bug by @GeLee-Q in #33312
- Fix typos by @litianjian in #33583
- enable low-precision pipeline by @jiqing-feng in #31625
- Pixtral update example checkpoint by @amyeroberts in #33633
- Sdpa dino v2 by @avishaiElmakies in #33403
- Clean up Unpack imports by @molbap in #33631
- Fix DPT /Dinov2 sdpa regression on main by @molbap in #33660
- handle dependency errors in check_imports by @molbap in #33622
- add back self.maxpositionembeddings = config.maxpositionembeddings by @chengchengpei in #33550
- Fix Llava conversion for LlavaQwen2ForCausalLM with Clip vision tower by @Isotr0py in #33613
- Uniformize kwargs for Udop processor and update docs by @yonigozlan in #33628
- Generation: deprecate
PreTrainedModelinheriting fromGenerationMixinby @gante in #33203 - Enable BNB multi-backend support by @jiqing-feng in #31098
- Fix error string after refactoring into getchattemplate by @tibor-reiss in #33652
- uniformize git processor by @yonigozlan in #33668
- Fix CIs post merging modular transformers by @ArthurZucker in #33681
- Fixed docstring for cohere model regarding unavailability of prune_he… by @mnauf in #33253
- Generation tests: update imagegpt input name, remove unused functions by @gante in #33663
- Improve Error Messaging for Flash Attention 2 on CPU by @sizhky in #33655
- Gemma2: fix config initialization (
cache_implementation) by @gante in #33684 - Fix ByteLevel alphabet missing when Sequence pretokenizer is used by @umarbutler in #33556
- Uniformize kwargs for image-text-to-text processors by @yonigozlan in #32544
- 🚨🚨 Setting default behavior of assisted decoding by @jmamou in #33657
- tests: fix pytorch tensor placement errors by @dvrogozh in #33485
- bump tokenizers, fix added tokens fast by @ArthurZucker in #32535
- [Pixtral] Improve docs, rename model by @NielsRogge in #33491
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @enchantee00
- 🌐 [i18n-KO] Translated
chat_templating.mdto Korean (#32362)
- 🌐 [i18n-KO] Translated
- @faychu
- Add Qwen2-Audio (#32137)
- Fix a bug in Qwen2Audio (#32552)
- @010kim
- 🌐 [i18n-KO] Translated
ko-llm_tutorial_optimization.mdto Korean (#32372)
- 🌐 [i18n-KO] Translated
- @cjfghk5697
- 🌐 [i18n-KO] Translated
trainer.mdto Korean (#32260)
- 🌐 [i18n-KO] Translated
- @younesbelkada
- Add new model (#32615)
- Mamba / FalconMamba: Fix mamba left padding (#32677)
- FIX / Hub: Also catch for
exceptions.ConnectionError(#31469) - Fix: Fix
FalconMambatraining issues due to incompatible kernels (#33195) - Fix import of
FalconMambaForCausalLM(#33381)
- @4N3MONE
- 🌐 [i18n-KO] Translated
deepspeed.mdto Korean (#32431)
- 🌐 [i18n-KO] Translated
- @jerryzh168
- Add TorchAOHfQuantizer (#32306)
- @MHRDYN7
- Add Flax Dinov2 (#31960)
- @kamilakesbi
- Add Descript-Audio-Codec model (#31494)
- @Isotr0py
- Fix incorrect vocab size retrieval in GGUF config (#32551)
- Add chat_template for tokenizer extracted from GGUF model (#32908)
- 🚨 Support dequantization for most GGML types (#32625)
- Fix missing head_dim in llama config from gguf model (#33526)
- Fix Llava conversion for LlavaQwen2ForCausalLM with Clip vision tower (#33613)
- @AhmedAlmaghz
- [i18n-ar] add README_ar.md to README.md (#32583)
- [i18n-ar] Add File :
docs/source/ar/_toctree.yml(#32696)
- @simonJJJ
- support qwen2-vl (#32318)
- simple align qwen2vl kvseqlen calculation with qwen2 (#33161)
- fix qwen2vl vision eager-attention (#33213)
- @jpizarrom
- 🚨 Add Blip2ForImageTextRetrieval (#29261)
- @mayank31398
- Granite language models (#31502)
- fix model name and copyright (#33152)
- Granitemoe (#33207)
- @hackyon
- [RoBERTa-based] Add support for sdpa (#30510)
- @Muennighoff
- Add OLMoE (#32406)
- Add paper link (#33305)
- @VladOS95-cyber
- Add Qwen2Moe GGUF loading support (#33264)
- change sequence_bias type of SequenceBiasLogitsProcessor to list, add… (#33375)
- @jiqing-feng
- enable low-precision pipeline (#31625)
- Enable BNB multi-backend support (#31098)
- Python
Published by LysandreJik over 1 year ago
transformers - Release v4.44.2
Patch release v4.44.2, mostly 2 regressions that were not caught for Jamba and for processors!
- Fix: Jamba cache fails to use torch.nn.module (#32894) Authored by @xgal
- Fix: No need to dtype A in Jamba (#32924) @xgal
- Fix: Regression on Processor.save_pretrained caused by #31691 (#32921) Authored by @leloykun
- Python
Published by ArthurZucker almost 2 years ago
transformers - Patch release v4.44.1
Here are the different fixes, mostly Gemma2 context length, nits here and there, and generation issues
- istorchdynamocompiling -- cast a wide exception net (#32476) by @gante
- Revert "fixes to properly shard FSDP across cpu and meta for cpueffcientloading for prequantized 4bit (#32276)" (#32477) by @gante and @matthewdouglas
- Gemma2: fix FA2 generation (#32553) by @zucchini-nlp
- Fix: FA2 with packed training (#32487) by @zucchini-nlp
- Fix sliding window attention used in Gemma2FlashAttention2 (#32522) by @brcps12
- Automatically add transformers tag to the modelcard (#32623) by @LysandreJik
- add back the position ids (#32554) by @ArthurZucker
- Use head_dim if in config for RoPE (#32495) @suiyoubi @ArthurZucker
- Revert PR 32299, flag users when Zero-3 was missed (#32851) by @muellerzr
- fix multi-gpu with static cache (#32543) by @SunMarc
- Reduce the error log when using core models that need their weights r… (#32656) by @muellerzr
- Fix VLM generation issues (#32836) by @zucchini-nlp
- Fix generate with inputs_embeds as input (#32493) (this PR has some cherry-pick)
Full Changelog: https://github.com/huggingface/transformers/compare/v4.44.0...v4.44.1
- Python
Published by ArthurZucker almost 2 years ago
transformers - Release v4.44.0
Release v4.44.0: End to end compile generation!!! Gemma2 (with assisted decoding), Codestral (Mistral for code), Nemotron, Efficient SFT training, CPU Offloaded KVCache, torch export for static cache
This release comes a bit early in our cycle because we wanted to ship important and requested models along with improved performances for everyone!
All of these are included with examples in the awesome https://github.com/huggingface/local-gemma repository! 🎈 We tried to share examples of what is now possible with all the shipped features! Kudos to @gante, @sanchit-gandhi and @xenova
💥 End-to-end generation compile
Generate: end-to-end compilation #30788 by @gante: model.generate now supports compiling! There are a few limitations, but here is a small snippet:
```python3 from transformers import AutoModelForCausalLM, AutoTokenizer import torch import copy
model = AutoModelForCausalLM.frompretrained( "meta-llama/Meta-Llama-3.1-8B", torchdtype=torch.bfloat16, devicemap="auto" ) tokenizer = AutoTokenizer.frompretrained("meta-llama/Meta-Llama-3.1-8B")
compile generate
compiled_generate = torch.compile(model.generate, fullgraph=True, mode="reduce-overhead")
compiled generate does NOT accept parameterization except a) model inputs b) a generation config
generationconfig = copy.deepcopy(model.generationconfig) generationconfig.padtokenid = model.config.eostoken_id
modelinputs = tokenizer(["Write a poem about the market crashing in summer"], returntensors="pt") modelinputs = modelinputs.to(model.device) outputcompiled = compiledgenerate(**modelinputs, generationconfig=generationconfig) print(outputcompiled) ```
⚡ 3 to 5x compile speedup (compilation time 👀 not runtime)
- 3-5x faster torch.compile forward compilation for autoregressive decoder models #32227* by @fxmarty .
As documented on the PR, this makes the whole generation a lot faster when you re-use the cache!
You can see this when you run
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
🪶 Offloaded KV cache: offload the cache to CPU when you are GPU poooooor 🚀
- Offloaded KV Cache #31325* by @n17s : you just have to set
cache_implementation="offloaded"when callingfrom_pretrainedor using this:python3 from transformers import GenerationConfig gen_config = GenerationConfig(cache_implementation="offloaded", # other generation options such as num_beams=4,num_beam_groups=2,num_return_sequences=4,diversity_penalty=1.0,max_new_tokens=50,early_stopping=True) outputs = model.generate(inputs["input_ids"],generation_config=gen_config)
📦 Torch export for static cache
pytorch team gave us a great gift: you can now use torch.export directly compatible with Executorch! Find examples here.
- Make static cache compatible with torch.export #32168 by @guangy10
This also unlocks support for prompt reuse: ```python3 import os, torch, copy from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache device = "cuda" ckpt = "meta-llama/Meta-Llama-3.1-8B-Instruct"
INITIAL_PROMPT = "From now on, you are going to answer all my questions with historical details. Make sure to always add a bit of french here and there, for style."
model = AutoModelForCausalLM.frompretrained(ckpt, torchdtype=torch.float16) model.to(device) tokenizer = AutoTokenizer.from_pretrained(ckpt)
promptcache = DynamicCache() inputs = tokenizer(INITIALPROMPT, returntensors="pt").to("cuda") promptcache = model(**inputs, pastkeyvalues = promptcache).pastkey_values
prompt = "Why are french people obsessed with french?" newinputs = tokenizer(INITIALPROMPT + prompt, returntensors="pt").to("cuda") pastkeyvalues = copy.deepcopy(promptcache) outputs = model.generate(**newinputs, pastkeyvalues=pastkeyvalues,maxnewtokens=20) response = tokenizer.batchdecode(outputs)[0] print(response)
prompt = "What is the best city to swim in?" newinputs = tokenizer(INITIALPROMPT + prompt, returntensors="pt").to("cuda") outputs = model.generate(**newinputs, pastkeyvalues=copy.deepcopy(promptcache),maxnewtokens=20) response = tokenizer.batchdecode(outputs)[0] ```
Gemma2: assisted decoding
Gemma 2: support assisted generation #32357 by @gante
We now have a 2B Gemma 2 model -- a perfect sidekick for the 27B with assisted generation. We've enabled assisted generation in gemma 2, with a caveat: assisted generation currently requires the use of a windowless cache (as opposed to the default cache for gemma 2), so you might observe some output mismatch on long sequences. Read more about it here.
```py
transformers assisted generation reference:
https://huggingface.co/docs/transformers/main/en/llm_optims#speculative-decoding
from transformers import AutoModelForCausalLM, AutoTokenizer import torch
we DON’T recommend using the 9b model with the 2b model as its assistant
assistantmodelname = 'google/gemma-2-2b-it' referencemodelname = 'google/gemma-2-27b-it'
tokenizer = AutoTokenizer.frompretrained(referencemodelname) model = AutoModelForCausalLM.frompretrained( referencemodelname, devicemap='auto', torchdtype=torch.bfloat16 ) assistantmodel = AutoModelForCausalLM.frompretrained( assistantmodelname, devicemap='auto', torchdtype=torch.bfloat16 )
modelinputs = tokenizer("Einstein's theory of relativity states", returntensors="pt").to(model.device) generationoptions = { "assistantmodel": assistantmodel, "dosample": True, "temperature": 0.7, "maxnewtokens": 64, }
outputs = model.generate(*model_inputs, *generationoptions) tokenizer.batchdecode(outputs, skipspecialtokens=True) ```
Nemotron support
Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. It is a fine-tuned version of the Nemotron-4-340B-Base model, optimized for English-based single and multi-turn chat use-cases. It supports a context length of 4,096 tokens.
The conversion script should be able to cover Minitron and Nemotron, thanks and kudos to @suiyoubi. See: * Add Nemotron HF Support #31699
Codestral support
Codestral is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash. It also performs well on more specific ones like Swift and Fortran. This broad language base ensures Codestral can assist developers in various coding environments and projects.
Codestral saves developers time and effort: it can complete coding functions, write tests, and complete any partial code using a fill-in-the-middle mechanism. Interacting with Codestral will help level up the developer’s coding game and reduce the risk of errors and bugs.
It's mamba2 architecture, was a bit of a pain to remove all einops but hope we made it better for everyone!
- Add codestral mamba2 #32080 by @molbap and @vasqu
Breaking changes:
We removed the chat template in the code, they should all be on the hub! * 🚨 No more default chat templates #31733 by @Rocketknight1
Long-form decoding for whisper, even faster:
Our great @sanchit-gandhi worked on porting the recent compile upgrades to long form decoding in * [whisper] compile compatibility with long-form decoding #31772
What's Changed
- Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs by @RhuiDih in https://github.com/huggingface/transformers/pull/31629
- Updated
ruffto the latest version by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/31926 - fix by @gante in https://github.com/huggingface/transformers/pull/32162
- fix: Fixed an if condition that is always evaluating to true by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32160
- [docs] change temperature to a positive value by @faaany in https://github.com/huggingface/transformers/pull/32077
- adds: extra_repr() to MambaRMSNorm to include hidden size / size of weights in the layer by @rohitdwivedula in https://github.com/huggingface/transformers/pull/32171
- fix: default value reflects the runtime environment variables rather than the ones present at import time. by @junrae6454 in https://github.com/huggingface/transformers/pull/32153
- Update qwen2.md by @ArtificialZeng in https://github.com/huggingface/transformers/pull/32108
- Remove conversational pipeline tests by @amyeroberts in https://github.com/huggingface/transformers/pull/32099
- RoPE: relaxed rope validation by @gante in https://github.com/huggingface/transformers/pull/32182
- let's not warn when someone is running a forward by @ArthurZucker in https://github.com/huggingface/transformers/pull/32176
- Fix resize embedding with Deepspeed by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32192
- Fix float8e4m3fn in modelingutils by @SunMarc in https://github.com/huggingface/transformers/pull/32193
- Support dequantizing GGUF FP16 format by @PenutChen in https://github.com/huggingface/transformers/pull/31783
- :rotating_light: No more default chat templates by @Rocketknight1 in https://github.com/huggingface/transformers/pull/31733
- fix: Replaced deprecated
unittest methodwith the correct one by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32198 - [whisper] fix short-form output type by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/32178
- remove unnecessary guard code related with pytorch versions 1.4.2 ~ 1.7.0 by @statelesshz in https://github.com/huggingface/transformers/pull/32210
- Update question_answering.py by @avlewis in https://github.com/huggingface/transformers/pull/32208
- [BigBird Pegasus] set supportsparambufferassignment to False by @kashif in https://github.com/huggingface/transformers/pull/32222
- [warnings] fix E721 warnings by @kashif in https://github.com/huggingface/transformers/pull/32223
- Follow up for #31973 by @ydshieh in https://github.com/huggingface/transformers/pull/32025
- translate philosophy.md to chinese by @statelesshz in https://github.com/huggingface/transformers/pull/32177
- Allow a specific microphone to be used by the ffmpeg audio pipeline utility functions. Default to using the currently active microphone on Mac by @jrhe in https://github.com/huggingface/transformers/pull/31846
- Fix code snippet for Grounding DINO by @qubvel in https://github.com/huggingface/transformers/pull/32229
- Generation: stop at
eosfor assisted decoding by @zucchini-nlp in https://github.com/huggingface/transformers/pull/31301 - Llava: generate without images by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32183
- Resize embeds with DeepSpeed by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32214
- don't log base model architecture in wandb if log model is false by @joaonadkarni in https://github.com/huggingface/transformers/pull/32143
- Refactor: Removed un-necessary
objectbase class by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32230 - Adds: extra_repr for RMSNorm layers in most models by @rohitdwivedula in https://github.com/huggingface/transformers/pull/32204
- Add check for
target_sizes is Noneinpost_process_image_guided_detectionfor owlv2 by @catalys1 in https://github.com/huggingface/transformers/pull/31934 - [tests] fix
staticcache implementation is not compatible withattn_implementation==flash_attention_2by @faaany in https://github.com/huggingface/transformers/pull/32039 - Flash-Attn: fix generation when no attention mask or no pading by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32241
- More flexible trigger condition by @ydshieh in https://github.com/huggingface/transformers/pull/32251
- Llama 3.1: replace for loop by tensor ops at inv_freq initialization by @gante in https://github.com/huggingface/transformers/pull/32244
- 🚨 Bloom support for cache class by @zucchini-nlp in https://github.com/huggingface/transformers/pull/31445
- Upload new model failure report to Hub by @ydshieh in https://github.com/huggingface/transformers/pull/32264
- Optimize t5 tokenize logic to avoid redundant calls by @leejet in https://github.com/huggingface/transformers/pull/32270
- fix: Fixed wrong argument passed to
convert_blip_checkpointfunction call by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32262 - Repo: remove exceptions in
check_docstringsby @gante in https://github.com/huggingface/transformers/pull/32259 - make
p_maska numpy array before passing toselect_starts_endsby @faaany in https://github.com/huggingface/transformers/pull/32076 - fix(docs): Fixed a link in docs by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32274
- Generate: end-to-end compilation by @gante in https://github.com/huggingface/transformers/pull/30788
- Whisper tokenizer word level timestamps by @kamilakesbi in https://github.com/huggingface/transformers/pull/32197
- [pipeline] fix padding for 1-d tensors by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/31776
- Make static cache compatible with torch.export by @guangy10 in https://github.com/huggingface/transformers/pull/32168
- Add stream messages from agent run for gradio chatbot by @aymeric-roucher in https://github.com/huggingface/transformers/pull/32142
- use torch 2.4 in 2 CI jobs by @ydshieh in https://github.com/huggingface/transformers/pull/32302
- Docs: fix GaLore optimizer code example by @gil2rok in https://github.com/huggingface/transformers/pull/32249
- Fix GGUF dequantize for
gguf==0.9.1by @Isotr0py in https://github.com/huggingface/transformers/pull/32298 - Cast epochs_trained to int when resuming training by @teddy-f-47 in https://github.com/huggingface/transformers/pull/32286
- feat(ci): set
fetch-depth: 0in trufflehog checkout step by @McPatate in https://github.com/huggingface/transformers/pull/31663 - Fix M4T for ASR pipeline by @ylacombe in https://github.com/huggingface/transformers/pull/32296
- Docs: formatting nits by @gante in https://github.com/huggingface/transformers/pull/32247
- Alternative agent plan by @plaggy in https://github.com/huggingface/transformers/pull/32295
- fix: Added missing raise keyword for few exceptions by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32333
- fixes to properly shard FSDP across cpu and meta for cpuefficientloading for prequantized 4bit by @winglian in https://github.com/huggingface/transformers/pull/32276
- fixes #32329 : The Torch code is correct - to get an average of 10% o… by @fkrasnov2 in https://github.com/huggingface/transformers/pull/32335
- Repo checks: skip docstring checks if not in the diff by @gante in https://github.com/huggingface/transformers/pull/32328
- Fix slow GemmaTokenizer and improve SPM slow -> fast conversion process by @xenova in https://github.com/huggingface/transformers/pull/32191
- LLaVA-NeXT: fix anyres shapes by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32314
- Gemma2 and flash-attention by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32188
- Llama 3.1: Fix incorrect
inv_freqassignment by @gante in https://github.com/huggingface/transformers/pull/32330 - [Idefics2] - Fix FA2 call for Perceiver layer by @amyeroberts in https://github.com/huggingface/transformers/pull/32275
- Gemma 2: support assisted generation by @gante in https://github.com/huggingface/transformers/pull/32357
- Fix error when streaming to gradio with non-string tool arguments by @aymeric-roucher in https://github.com/huggingface/transformers/pull/32360
- >3-5x faster torch.compile forward compilation for autoregressive decoder models by @fxmarty in https://github.com/huggingface/transformers/pull/32227
- fix: Fixed
staticmethodswith self as first argument by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32361 - fix: warmupsteps check for trainingargs by @Ricardo-L-C in https://github.com/huggingface/transformers/pull/32236
- LLaVa: add cache class attribute by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32278
- [enc-dec cache] fix bug in indexing by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/32370
- [whisper] compile compatibility with long-form decoding by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/31772
- Remove size check between attnweights and kvseq_len for phi3 by @helunwencser in https://github.com/huggingface/transformers/pull/32339
- add missing attribute supportsparambufferassignment for gpt-j. by @nv-guomingz in https://github.com/huggingface/transformers/pull/32359
- Check device map for saving tokenizer config on TPU (fix for issue #31971) by @ayukh in https://github.com/huggingface/transformers/pull/32043
- update cleanuptokenization_spaces warning by @itazap in https://github.com/huggingface/transformers/pull/32371
- Empty list in defaults for LLaMA special tokens during weights conversion by @ViktorooReps in https://github.com/huggingface/transformers/pull/32342
- Fix conflicting key in init kwargs in PreTrainedTokenizerBase by @OmarManzoor in https://github.com/huggingface/transformers/pull/31233
- Offloaded KV Cache by @n17s in https://github.com/huggingface/transformers/pull/31325
- Docker: add
speechdep to the consistency docker image by @gante in https://github.com/huggingface/transformers/pull/32374 - Fixed Hybrid Cache Shape Initialization. by @OsamaS99 in https://github.com/huggingface/transformers/pull/32163
- Yell at the user if zero-3 init wasn't performed, but expected to have been done by @muellerzr in https://github.com/huggingface/transformers/pull/32299
- Update docs by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32368
- RoPE: Add numerical tests ✨ by @gante in https://github.com/huggingface/transformers/pull/32380
- [generate] only require an attention mask for mps with torch<2.4 by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/32367
- fix: (issue #32124) Exception raised when running
transformers/examples/flax/language-modeling/t5_tokenizer_model.py. by @fshp971 in https://github.com/huggingface/transformers/pull/32157 - MixtralFlashAttention2: put "plus 1" inside parentheses when calculating rotaryseqlen, allowing None position_ids input. by @Luke20000429 in https://github.com/huggingface/transformers/pull/31500
- Bump keras from 2.8.0 to 2.13.1 in /examples/researchprojects/decisiontransformer by @dependabot in https://github.com/huggingface/transformers/pull/32393
- fix: SeamlessM4TFeatureExtractor stride remainder by @TechInterMezzo in https://github.com/huggingface/transformers/pull/32088
- Phi3 tests: fix typing for Python 3.8 by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32388
- #32184 save totalvocabsize by @itazap in https://github.com/huggingface/transformers/pull/32240
- add values for neftune by @nbroad1881 in https://github.com/huggingface/transformers/pull/32399
- Fix documentation references to google/bit-50 model by @JuanFKurucz in https://github.com/huggingface/transformers/pull/32407
- Persist embedding type of BART and mBART models after resize by @AbdiHaryadi in https://github.com/huggingface/transformers/pull/32242
- fix: Updated
test_embeded_special_tokensfor luke and mluke models by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32413 - Respect the config's attn_implementation if set by @amyeroberts in https://github.com/huggingface/transformers/pull/32383
- Fix documentation links and code reference to model llava-next by @JuanFKurucz in https://github.com/huggingface/transformers/pull/32434
- Cache: create docs by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32150
- Llava: fix checkpoint_doc by @RUFFY-369 in https://github.com/huggingface/transformers/pull/32458
- add the missing flash attention test marker by @faaany in https://github.com/huggingface/transformers/pull/32419
- Update kwargs validation for
preprocesswith decorator by @qubvel in https://github.com/huggingface/transformers/pull/32024 - Fix get large model config for Switch Transformer encoder only tester by @JuanFKurucz in https://github.com/huggingface/transformers/pull/32438
- Dependencies: fix typo by @gante in https://github.com/huggingface/transformers/pull/32389
- Add Nemotron HF Support by @suiyoubi in https://github.com/huggingface/transformers/pull/31699
- Generate: fix end to end compilation by @gante in https://github.com/huggingface/transformers/pull/32465
- Add codestral mamba2 by @molbap in https://github.com/huggingface/transformers/pull/32080
New Contributors
- @RhuiDih made their first contribution in https://github.com/huggingface/transformers/pull/31629
- @rohitdwivedula made their first contribution in https://github.com/huggingface/transformers/pull/32171
- @ArtificialZeng made their first contribution in https://github.com/huggingface/transformers/pull/32108
- @avlewis made their first contribution in https://github.com/huggingface/transformers/pull/32208
- @jrhe made their first contribution in https://github.com/huggingface/transformers/pull/31846
- @joaonadkarni made their first contribution in https://github.com/huggingface/transformers/pull/32143
- @catalys1 made their first contribution in https://github.com/huggingface/transformers/pull/31934
- @leejet made their first contribution in https://github.com/huggingface/transformers/pull/32270
- @guangy10 made their first contribution in https://github.com/huggingface/transformers/pull/32168
- @gil2rok made their first contribution in https://github.com/huggingface/transformers/pull/32249
- @teddy-f-47 made their first contribution in https://github.com/huggingface/transformers/pull/32286
- @plaggy made their first contribution in https://github.com/huggingface/transformers/pull/32295
- @fkrasnov2 made their first contribution in https://github.com/huggingface/transformers/pull/32335
- @helunwencser made their first contribution in https://github.com/huggingface/transformers/pull/32339
- @nv-guomingz made their first contribution in https://github.com/huggingface/transformers/pull/32359
- @ayukh made their first contribution in https://github.com/huggingface/transformers/pull/32043
- @n17s made their first contribution in https://github.com/huggingface/transformers/pull/31325
- @OsamaS99 made their first contribution in https://github.com/huggingface/transformers/pull/32163
- @fshp971 made their first contribution in https://github.com/huggingface/transformers/pull/32157
- @Luke20000429 made their first contribution in https://github.com/huggingface/transformers/pull/31500
- @TechInterMezzo made their first contribution in https://github.com/huggingface/transformers/pull/32088
- @AbdiHaryadi made their first contribution in https://github.com/huggingface/transformers/pull/32242
- @RUFFY-369 made their first contribution in https://github.com/huggingface/transformers/pull/32458
- @suiyoubi made their first contribution in https://github.com/huggingface/transformers/pull/31699
Full Changelog: https://github.com/huggingface/transformers/compare/v4.43.4...v4.44.0
- Python
Published by ArthurZucker almost 2 years ago
transformers - v4.43.4 Patch Release
Patch Release v4.43.4
There was a mick mack, now deepseep issues are properly pushed with: - Resize embeds with DeepSpeed https://github.com/huggingface/transformers/pull/32214
🤗 Enjoy holidays
- Python
Published by ArthurZucker almost 2 years ago
transformers - v4.43.3 Patch deepspeed
Patch release v4.43.3: We still saw some bugs so @zucchini-nlp added: ~- Resize embeds with DeepSpeed #32214~ - don't log base model architecture in wandb if log model is false #32143
Other fixes: - [whisper] fix short-form output type #32178, by @sanchit-gandhi which fixes the short audio temperature fallback! - [BigBird Pegasus] set supportsparambufferassignment to False #32222 by @kashif, mostly related to the new super fast init, some models have to get this set to False. If you see a weird behavior look for that 😉
- Python
Published by ArthurZucker almost 2 years ago
transformers - v4.43.2: Patch release
- Fix float8e4m3fn in modelingutils (#32193)
- Fix resize embedding with Deepspeed (#32192)
- let's not warn when someone is running a forward (#32176)
- RoPE: relaxed rope validation (#32182)
- Python
Published by LysandreJik almost 2 years ago
transformers - v4.43.1: Patch release
- fix (#32162)
- Python
Published by LysandreJik almost 2 years ago
transformers - v4.43.0: Llama 3.1, Chameleon, ZoeDepth, Hiera
Llama
The Llama 3.1 models are released by Meta and come in three flavours: 8B, 70B, and 405B.
To get an overview of Llama 3.1, please visit the Hugging Face announcement blog post.
We release a repository of llama recipes to showcase usage for inference, total and partial fine-tuning of the different variants.
Chameleon
The Chameleon model was proposed in Chameleon: Mixed-Modal Early-Fusion Foundation Models by META AI Chameleon Team. Chameleon is a Vision-Language Model that use vector quantization to tokenize images which enables the model to generate multimodal output. The model takes images and texts as input, including an interleaved format, and generates textual response.
- Chameleon: add model by @zucchini-nlp in #31534
ZoeDepth
The ZoeDepth model was proposed in ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth by Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, Matthias Müller. ZoeDepth extends the DPT framework for metric (also called absolute) depth estimation. ZoeDepth is pre-trained on 12 datasets using relative depth and fine-tuned on two domains (NYU and KITTI) using metric depth. A lightweight head is used with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier.
- Add ZoeDepth by @NielsRogge in #30136
Hiera
Hiera was proposed in Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer
The paper introduces “Hiera,” a hierarchical Vision Transformer that simplifies the architecture of modern hierarchical vision transformers by removing unnecessary components without compromising on accuracy or efficiency. Unlike traditional transformers that add complex vision-specific components to improve supervised classification performance, Hiera demonstrates that such additions, often termed “bells-and-whistles,” are not essential for high accuracy. By leveraging a strong visual pretext task (MAE) for pretraining, Hiera retains simplicity and achieves superior accuracy and speed both in inference and training across various image and video recognition tasks. The approach suggests that spatial biases required for vision tasks can be effectively learned through proper pretraining, eliminating the need for added architectural complexity.
- Adding hiera by @Namangarg110 in #30356
Agents
Our ReactAgent has a specific way to return its final output: it calls the tool finalanswer, added to the user-defined toolbox upon agent initialization, with the answer as the tool argument. We found that even for a one-shot agent like CodeAgent, using a specific finalanswer tools helps the llmengine find what to return: so we generalized the finalanswer tool for all agents.
Adds final answer tool for all agents by @aymeric-roucher in #31703
Code agent: allow function persistence between steps by @aymeric-roucher in #31769 :point_right: Now if your code-based agent (like ReactCodeAgent) defines a function at step 1, it will remember the function definition indefinitely. This means your agent can create its own tools for later re-use!
Agents planning by @aymeric-roucher in #31702 :pointright: This is a transformative PR: it allows the agent to regularly run a specific step for planning its actions in advance. This gets activated if you set an int for planninginterval upon agent initialization. At step 0, a first plan will be done. At later steps (like steps 3, 6, 9 if you set planning_interval=3 ), this plan will be updated by the agent depending on the history of previous steps. More detail soon! Selon si on a merge d'ici là on pourra rajouter:
Add stream messages from agent run for gradio chatbot by @freddyaboulton and @aymeric-roucher in #32142 :pointright: New method streamto_gradio runs your agent and streams the output the run to gradio messages, to easily visualize the run in a gradio chatbot! (
Adds final answer tool for all agents by @aymeric-roucher in #31703
Code agent: allow function persistence between steps by @aymeric-roucher in #31769
Agents planning by @aymeric-roucher in #31702
Notable changes to the codebase
A significant RoPE refactor was done to make it model agnostic and more easily adaptable to any architecture. It is only applied to Llama for now but will be applied to all models using RoPE over the coming days.
- Llama: RoPE refactor by @gante in #32135
Breaking changes
TextGenerationPipeline and tokenizer kwargs
🚨🚨 This PR changes the code to rely on the tokenizer's defaults when these flags are unset. This means some models using TextGenerationPipeline previously did not add a <bos> by default, which (negatively) impacted their performance. In practice, this is a breaking change.
Example of a script changed as a result of this PR: ```py from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline import torch
tokenizer = AutoTokenizer.frompretrained("google/gemma-2-9b-it") model = AutoModelForCausalLM.frompretrained("google/gemma-2-9b-it", torchdtype=torch.bfloat16, devicemap="auto") pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) print(pipe("Foo bar")) ```
- 🚨🚨 TextGenerationPipeline: rely on the tokenizer default kwargs by @gante in #31747
Bugfixes and improvements
- Fix post gemma merge by @ArthurZucker in #31660
- Fix float out of range in owlvit and owlv2 when using FP16 or lower precision by @aliencaocao in #31657
- [docs] Llama3 by @stevhliu in #31662
- [HybridCache] Fix
get_seq_lengthmethod by @sanchit-gandhi in #31661 - don't zero out the attention_mask when using sliding window with flash attention by @winglian in #31670
- Fix Gemma2 4d attention mask by @hiyouga in #31674
- Fix return_dict in encodec by @jla524 in #31646
- add gatheruseobject arguments by @SangbumChoi in #31514
- Gemma capping is a must for big models by @ArthurZucker in #31698
- Add French version of run scripts tutorial by @jadechoghari in #31483
- dependencies:
keras-nlp<0.14pin by @gante in #31684 - remove incorrect urls pointing to the llava repository by @BiliBraker in #31107
- Move some test files (
tets/test_xxx_utils.py) totests/utilsby @ydshieh in #31730 - Fix mistral ONNX export by @fxmarty in #31696
- [whisper] static kv cache by @sanchit-gandhi in #31166
- Make tool JSON schemas consistent by @Rocketknight1 in #31756
- Fix documentation for Gemma2. by @jbornschein in #31682
- fix assisted decoding by @jiqing-feng in #31401
- Requires for torch.tensor before casting by @echarlaix in #31755
- handle (processor_class, None) returned by ModelPatterns by @molbap in #31753
- Gemma 2: Update slow tests by @gante in #31759
- Add ignoreerrors=True to trainer.py rmtree in _innertraining_loop by @njbrake in #31668
- [fix bug] logits's shape different from label's shape in preprocesslogitsfor_metrics by @wiserxin in #31447
- Fix RT-DETR cache for generate_anchors by @qubvel in #31671
- Fix RT-DETR weights initialization by @qubvel in #31724
pytest_num_workers=4for some CircleCI jobs by @ydshieh in #31764- Fix Gemma2 types by @hiyouga in #31779
- Add torchemptycache_steps to TrainingArguments by @aliencaocao in #31546
- Fix ClapProcessor to merge feature_extractor output into the returned BatchEncoding by @mxkopy in #31767
- Fix serialization for offloaded model by @SunMarc in #31727
- Make tensor device correct when ACCELERATETORCHDEVICE is defined by @kiszk in #31751
- Exclude torch.compile time from metrics computation by @zxd1997066 in #31443
- Update CometCallback to allow reusing of the running experiment by @Lothiraldan in #31366
- Fix gemma tests by @ydshieh in #31794
- Add training support for SigLIP by @aliencaocao in #31495
- Repeating an important warning in the chat template docs by @Rocketknight1 in #31796
- Allow FP16 or other precision inference for Pipelines by @aliencaocao in #31342
- Fix galore lr display with schedulers by @vasqu in #31710
- Fix Wav2Vec2 Fairseq conversion (weight norm state dict keys) by @gau-nernst in #31714
- Depth Anything: update conversion script for V2 by @pcuenca in #31522
- Fix Seq2SeqTrainer crash when BatchEncoding data is None by @iohub in #31418
- Bump certifi from 2023.7.22 to 2024.7.4 in /examples/researchprojects/decisiontransformer by @dependabot[bot] in #31813
- Add FA2 and
sdpasupport for SigLIP by @qubvel in #31499 - Bump transformers from 4.26.1 to 4.38.0 in /examples/tensorflow/language-modeling-tpu by @dependabot[bot] in #31837
- Bump certifi from 2023.7.22 to 2024.7.4 in /examples/research_projects/lxmert by @dependabot[bot] in #31838
- Fix typos by @omahs in #31819
- transformers.fx.symbolictrace supports inputsembeds by @fxmarty in #31574
- Avoid failure
TFBlipModelTest::test_pipeline_image_to_textby @ydshieh in #31827 - Fix incorrect accelerator device handling for MPS in
TrainingArgumentsby @andstor in #31812 - Mamba & RecurrentGemma: enable strict signature by @gante in #31549
- Deprecate
vocab_sizein other two VLMs by @zucchini-nlp in #31681 - FX symbolictrace: do not test decoderinputs_embeds by @fxmarty in #31840
- [Grounding DINO] Add processor to auto mapping by @NielsRogge in #31845
- chore: remove duplicate words by @hattizai in #31853
- save_pretrained: use tqdm when saving checkpoint shards from offloaded params by @kallewoof in #31856
- Test loading generation config with safetensor weights by @gante in #31550
- docs: typo in tf qa example by @chen-keinan in #31864
- Generate: Add new decoding strategy "DoLa" in
.generate()by @voidism in #29619 - Fix
_init_weightsforResNetPreTrainedModelby @ydshieh in #31851 - Update depth estimation task guide by @merveenoyan in #31860
- Bump zipp from 3.7.0 to 3.19.1 in /examples/researchprojects/decisiontransformer by @dependabot[bot] in #31871
- Add return type annotation to PreTrainedModel.from_pretrained by @mauvilsa in #31869
- Revert "Fix
_init_weightsforResNetPreTrainedModel" by @ydshieh in #31868 - Bump certifi from 2023.7.22 to 2024.7.4 in /examples/researchprojects/visualbert by @dependabot[bot] in #31872
- add warning when using gradient_checkpointing with FSDP full shard by @yundai424 in #31578
- Add conversion for interleave llava by @zucchini-nlp in #31858
- remove duplicate words in msg by @yukionfire in #31876
- Fix file type checks in data splits for contrastive training example script by @npyoung in #31720
- Fix failed tests in #31851 by @ydshieh in #31879
- fix: Removed
duplicatefield definitions in some classes by @Sai-Suraj-27 in #31888 - Push sharded checkpoint to hub when
push_to_hub=TrueinTrainingArgumentsby @SunMarc in #31808 - [RT-DETR] Add resources by @NielsRogge in #31815
- Modify
warningsin awithblock to avoid flaky tests by @ydshieh in #31893 - Add a condition for nested_detach by @haikuoxin in #31855
- InstructBlipVideo: Update docstring by @zucchini-nlp in #31886
- Fixes to alternating SWA layers in Gemma2 by @turboderp in #31775
- Processor accepts any kwargs by @zucchini-nlp in #31889
- [
ConvertSlow] make sure the order is preserved for addedtokens by @ArthurZucker in #31902 - [
Gemma2] Support FA2 softcapping by @ArthurZucker in #31887 - Fix missing methods for Fuyu by @Isotr0py in #31880
- fix: Fixed the
1st argumentname in classmethods by @Sai-Suraj-27 in #31907 - add gatheruseobject arguments II by @SangbumChoi in #31799
- Add warning message for beta and gamma parameters by @OmarManzoor in #31654
- Fix fx tests with inputs_embeds by @fxmarty in #31862
- Refactor flash attention implementation in transformers by @ArthurZucker in #31446
- Generate: fix
SlidingWindowCache.reset()by @gante in #31917 - 🚨 fix(SigLip): remove spurious exclusion of first vision output token by @transmissions11 in #30952
- Allow
Trainer.get_optimizer_cls_and_kwargsto be overridden by @apoorvkh in #31875 - [Bug Fix] fix qa pipeline tensor to numpy by @jiqing-feng in #31585
- Docker: TF pin on the consistency job by @gante in #31928
- fix prompt strip to support tensors and np arrays by @AvivSham in #27818
- Fix
GenerationMixin.generatecompatibility with pytorch profiler by @fxmarty in #31935 - Generate: remove deprecated code due to
Cacheandcache_positionbeing default by @gante in #31898 - Generate: v4.42 deprecations 🧹🧹 by @gante in #31956
- Whisper: move to tensor cpu before converting to np array at decode time by @gante in #31954
- fix: Removed a wrong key-word argument in
sigmoid_focal_loss()function call by @Sai-Suraj-27 in #31951 - Generate: handle
logits_warperupdate in models with custom generate fn by @gante in #31957 - fix: Fixed the arguments in
create_repo()function call by @Sai-Suraj-27 in #31947 - Notify new docker images built for circleci by @ydshieh in #31701
- Avoid race condition by @ydshieh in #31973
- Masking: remove flakiness from test by @gante in #31939
- Generate: doc nits by @gante in #31982
- Fix the incorrect permutation of gguf by @PenutChen in #31788
- Cambricon MLUs support SDPA and flash_attn by @huismiling in #31102
- Speedup model init on CPU (by 10x+ for llama-3-8B as one example) by @muellerzr in #31771
- [tests] fix deepspeed zero3 config for
test_stage3_nvme_offloadby @faaany in #31881 - Fix bad test about slower init by @muellerzr in #32002
- Tests: remove cuda versions when the result is the same 🧹🧹 by @gante in #31955
- Bug report update by @gante in #31983
- add flash-attn deterministic option to flash-attn>=2.4.1 by @junrae6454 in #31961
- fix: Fixed incorrect dictionary assignment in
src/transformers/__init__.pyby @Sai-Suraj-27 in #31993 - Bug report update -- round 2 by @gante in #32006
- Fix gather when collecting 'numinputtokens_seen' by @CodeCreator in #31974
- Fix if else and actually enable superfast init by @muellerzr in #32007
- SpeechEncoderDecoder doesn't support param buffer assignments by @muellerzr in #32009
- Fix tests skip by @qubvel in #32012
- Fixed
log messagesthat are resulting in TypeError due to too many arguments by @Sai-Suraj-27 in #32017 - Fix typo in classification function selection logic to improve code consistency by @moses in #32031
- doc: fix broken BEiT and DiNAT model links on Backbone page by @dvrogozh in #32029
- Pass missing arguments to
SeamlessM4Tv2ConformerEncoderLayer.forward()when gradient checkpointing is enabled by @anferico in #31945 - Add language to word timestamps for Whisper by @robinderat in #31572
- Add
sdpaand FA2 for CLIP by @qubvel in #31940 - unpin
numpy<2.0by @ydshieh in #32018 - Chameleon: minor fixes after shipping by @zucchini-nlp in #32037
- Bump scikit-learn from 1.0.2 to 1.5.0 in /examples/researchprojects/decisiontransformer by @dependabot[bot] in #31458
- Bump scikit-learn from 1.1.2 to 1.5.0 in /examples/research_projects/codeparrot/examples by @dependabot[bot] in #32052
- [mistral] Support passing
head_dimthrough config (and do not requirehead_dim * num_heads == hidden_size) by @xenova in #32050 - Add torch.compile Support For Mamba by @zhenglongjiepheonix in #31247
- fix: Removed
duplicate entriesin a dictionary by @Sai-Suraj-27 in #32041 - docs: Fixed 2 links in the docs along with some minor fixes by @Sai-Suraj-27 in #32058
- Llava: add default chat templates by @zucchini-nlp in #31691
- [Chameleon, Hiera] Improve docs by @NielsRogge in #32038
- Incorrect Whisper long-form decoding timestamps by @kamilakesbi in #32003
- [mistral] Fix FA2 attention reshape for Mistral Nemo by @xenova in #32065
- VideoLLaVa: fix chat format in docs by @zucchini-nlp in #32083
- Fix progress callback deepcopy by @fozziethebeat in #32070
- Fixes to chameleon docs by @merveenoyan in #32078
- Add image-text-to-text task guide by @merveenoyan in #31777
- Support generating with fallback for short form audio in Whisper by @kamilakesbi in #30984
- Disable quick init for deepspeed by @muellerzr in #32066
- Chameleon: not supported with fast load by @zucchini-nlp in #32091
- Fix tests after
huggingface_hub0.24 by @Wauplin in #32054 - Fix shard order by @b-chu in #32023
- Generate: store special token tensors under a unique variable name by @gante in #31980
- fix: Replaced deprecated
mktemp()function by @Sai-Suraj-27 in #32123 - Mention modelinfo.id instead of modelinfo.modelId by @Wauplin in #32106
- [generate] fix eos/pad id check on mps devices by @sanchit-gandhi in #31695
- Fix failing test with race condition by @Rocketknight1 in #32140
- Update
ko/_toctree.ymland removecustom_tools.mdto reflect latest changes by @jungnerd in #31969 - fix: Fixed raising
TypeErrorinstead ofValueErrorfor invalid type by @Sai-Suraj-27 in #32111 - [RoBERTa] Minor clarifications to model doc by @bt2513 in #31949
- Return assistant generated tokens mask in applychattemplate by @yonigottesman in #30650
- Don't default to other weights file when use_safetensors=True by @amyeroberts in #31874
- set warning level to info for special tokens have been added by @ArthurZucker in #32138
- Add new quant method by @SunMarc in #32047
- Add llama3-llava-next-8b to llava_next conversion script by @jamt9000 in #31395
- LLaVaNeXT: pad on right if training by @zucchini-nlp in #32134
- Remove
trust_remote_codewhen loading Libri Dummy by @sanchit-gandhi in #31748 - [modelling] remove un-necessary transpose for fa2 attention by @sanchit-gandhi in #31749
- Fix mask creations of
GPTNeoXandGPT2by @vasqu in #31944 - Add method to retrieve used chat template by @KonradSzafer in #32032
- Add YaRN and Dynamic-YaRN RoPE Scaling Methods by @mig-mfreitas in #30910
- Disable quick init for TapasPreTrainedModel by @daniellok-db in #32149
- Modify resizetokenembeddings to ensure output type is same as input by @bayllama in #31979
- gguf conversion addprefixspace=None for llama3 by @itazap in #31937
- Fix flash attention speed issue by @Cyrilvallez in #32028
- Fix video batching to videollava by @merveenoyan in #32139
- Added mamba.py backend by @alxndrTL in #30139
- Rename Phi-3 rope scaling type by @garg-amit in #31436
- Revert "Incorrect Whisper long-form decoding timestamps " by @sanchit-gandhi in #32148
- Fix typing to be compatible with later py versions by @amyeroberts in #32155
- feat(cache): StaticCache uses indexcopy to avoid useless copy by @tengomucho in #31857
- Added additional kwarg for successful running of optuna hyperparameter search by @DeF0017 in #31924
- Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs by @RhuiDih in #31629
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @aliencaocao
- Fix float out of range in owlvit and owlv2 when using FP16 or lower precision (#31657)
- Add torchemptycache_steps to TrainingArguments (#31546)
- Add training support for SigLIP (#31495)
- Allow FP16 or other precision inference for Pipelines (#31342)
- @voidism
- Generate: Add new decoding strategy "DoLa" in
.generate()(#29619)
- Generate: Add new decoding strategy "DoLa" in
- @Namangarg110
- Adding hiera (#30356)
- Python
Published by LysandreJik almost 2 years ago
transformers - Patch release v4.42.4
Mostly gemma2 support FA2 softcapping!
but also fix the sliding window for long context and other typos.
- [Gemma2] Support FA2 softcapping (#31887) by @ArthurZucker
- [ConvertSlow] make sure the order is preserved for addedtokens (#31902) by @ArthurZucker
- Fixes to alternating SWA layers in Gemma2 (#31775) by @turboderp
- Requires for torch.tensor before casting (#31755) by @echarlaix
Was off last week could not get this out, thanks all for your patience 🥳
- Python
Published by ArthurZucker almost 2 years ago
transformers - Patch release v4.42.3
Make sure we have attention softcapping for "eager" GEMMA2 model
After experimenting, we noticed that for the 27b model mostly, softcapping is a must. So adding it back (it should have been there, but an error on my side made it disappear) sorry all! 😭
- Gemma capping is a must for big models (#31698)
- Python
Published by ArthurZucker almost 2 years ago
transformers - Patch release v4.42.2
Patch release
Thanks to our 2 contributors for their prompt fixing mostly applies for training and FA2!
- Fix Gemma2 4d attention mask (#31674) by @hiyouga
- don't zero out the attention_mask when using sliding window with flash attention (#31670) by @winglian
- Python
Published by ArthurZucker almost 2 years ago
transformers - v4.42.1: Patch release
Patch release for commit:
- [HybridCache] Fix getseqlength method (#31661)
- Python
Published by LysandreJik almost 2 years ago
transformers - v4.42.0: Gemma 2, RTDETR, InstructBLIP, LLAVa Next, New Model Adder
New model additions
Gemma-2
The Gemma2 model was proposed in Gemma2: Open Models Based on Gemini Technology and Research by Gemma2 Team, Google. Gemma2 models are trained on 6T tokens, and released with 2 versions, 2b and 7b.
The abstract from the paper is the following:
This work introduces Gemma2, a new family of open language models demonstrating strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma2 outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of our model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations
- Add gemma 2 by @ArthurZucker in #31659
RTDETR
The RT-DETR model was proposed in DETRs Beat YOLOs on Real-time Object Detection by Wenyu Lv, Yian Zhao, Shangliang Xu, Jinman Wei, Guanzhong Wang, Cheng Cui, Yuning Du, Qingqing Dang, Yi Liu.
RT-DETR is an object detection model that stands for “Real-Time DEtection Transformer.” This model is designed to perform object detection tasks with a focus on achieving real-time performance while maintaining high accuracy. Leveraging the transformer architecture, which has gained significant popularity in various fields of deep learning, RT-DETR processes images to identify and locate multiple objects within them.
- New model support RTDETR by @SangbumChoi in #29077
InstructBlip
The InstructBLIP model was proposed in InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. InstructBLIP leverages the BLIP-2 architecture for visual instruction tuning.
InstructBLIP uses the same architecture as BLIP-2 with a tiny but important difference: it also feeds the text prompt (instruction) to the Q-Former.
- Add video modality for InstrucBLIP by @zucchini-nlp in #30182
LlaVa NeXT Video
The LLaVa-NeXT-Video model was proposed in LLaVA-NeXT: A Strong Zero-shot Video Understanding Model by Yuanhan Zhang, Bo Li, Haotian Liu, Yong Jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, Chunyuan Li. LLaVa-NeXT-Video improves upon LLaVa-NeXT by fine-tuning on a mix if video and image dataset thus increasing the model’s performance on videos.
LLaVA-NeXT surprisingly has strong performance in understanding video content in zero-shot fashion with the AnyRes technique that it uses. The AnyRes technique naturally represents a high-resolution image into multiple images. This technique is naturally generalizable to represent videos because videos can be considered as a set of frames (similar to a set of images in LLaVa-NeXT). The current version of LLaVA-NeXT makes use of AnyRes and trains with supervised fine-tuning (SFT) on top of LLaVA-Next on video data to achieves better video understanding capabilities.The model is a current SOTA among open-source models on VideoMME bench.
- Add LLaVa NeXT Video by @zucchini-nlp in #31252
New model adder
A very significant change makes its way within the transformers codebase, introducing a new way to add models to transformers. We recommend reading the description of the PR below, but here is the gist of it:
The diff_converter tool is here to replace our old Copied from statements, while keeping our core transformers philosophy:
- single model single file
- explicit code
- standardization of modeling code
- readable and educative code
- simple code
- least amount of modularity
This additionally unlocks the ability to very quickly see the differences between new architectures that get developed. While many architectures are similar, the "single model, single file" policy can obfuscate the changes. With this diff converter, we want to make the changes between architectures very explicit.
- Diff converter v2 by @ArthurZucker in #30868
Tool-use and RAG model support
We've made major updates to our support for tool-use and RAG models. We can now automatically generate JSON schema descriptions for Python functions which are suitable for passing to tool models, and we've defined a standard API for tool models which should allow the same tool inputs to be used with many different models. Models will need updates to their chat templates to support the new API, and we're targeting the Nous-Hermes, Command-R and Mistral/Mixtral model families for support in the very near future. Please see the updated chat template docs for more information.
If you are the owner of a model that supports tool use, but you're not sure how to update its chat template to support the new API, feel free to reach out to us for assistance with the update, for example on the Hugging Face Discord server. Ping Matt and yell key phrases like "chat templates" and "Jinja" and your issue will probably get resolved.
- Chat Template support for function calling and RAG by @Rocketknight1 in #30621
GGUF support
We further the support of GGUF files to offer fine-tuning within the python/HF ecosystem, before converting them back to the GGUF/GGML/llama.cpp libraries.
- Add Qwen2 GGUF loading support by @Isotr0py in #31175
- GGUF: Fix llama 3 GGUF by @younesbelkada in #31358
- Fix llama gguf converter by @SunMarc in #31575
Trainer improvements
A new optimizer is added in the Trainer.
- FEAT / Trainer: LOMO optimizer support by @younesbelkada in #30178
Quantization improvements
Several improvements are done related to quantization: a new cache (the quantized KV cache) is added, offering the ability to convert the cache of generative models, further reducing the memory requirements.
Additionally, the documentation related to quantization is entirely redone with the aim of helping users choose which is the best quantization method.
- Quantized KV Cache by @zucchini-nlp in #30483
- Docs / Quantization: refactor quantization documentation by @younesbelkada in #30942
Examples
New instance segmentation examples are added by @qubvel
- Instance segmentation examples by @qubvel in #31084
Notable improvements
As a notable improvement to the HF vision models that leverage backbones, we enable leveraging HF pretrained model weights as backbones, with the following API:
```py from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation
config = MaskFormerConfig(backbone="microsoft/resnet-50", usepretrainedbackbone=True) model = MaskFormerForInstanceSegmentation(config) ```
- Enable HF pretrained backbones by @amyeroberts in #31145
Additionally, we thank @Cyrilvallez for diving into our generate method and greatly reducing the memory requirements.
- Reduce by 2 the memory requirement in
generate()🔥🔥🔥 by @Cyrilvallez in #30536
Breaking changes
Remove ConversationalPipeline and Conversation object
Both the ConversationalPipeline and the Conversation object have been deprecated for a while, and are due for removal in 4.42, which is the upcoming version.
The TextGenerationPipeline is recommended for this use-case, and now accepts inputs in the form of the OpenAI API.
- 🚨 Remove ConversationalPipeline and Conversation object by @Rocketknight1 in #31165
Remove an accidental duplicate softmax application in FLAVA's attention
Removes duplicate softmax application in FLAVA attention. Likely to have a small change on the outputs but flagging with 🚨 as it will change a bit.
- 🚨 FLAVA: Remove double softmax by @amyeroberts in #31322
Idefics2's ignore_index attribute of the loss is updated to -100
- 🚨 [Idefics2] Update ignore index by @NielsRogge in #30898
out_indices from timm being updated
Recent updates to timm changed the type of the attribute model.feature_info.out_indices. Previously, out_indices would reflect the input type of out_indices on the create_model call i.e. either tuple or list. Now, this value is always a tuple.
As list are more useful and consistent for us -- we cannot save tuples in configs, they must be converted to lists first -- we instead choose to cast out_indices to always be a list.
This has the possibility of being a slight breaking change if users are creating models and relying on out_indices on being a tuple. As this property only happens when a new model is created, and not if it's saved and reloaded (because of the config), then I think this has a low chance of having much of an impact.
- 🚨 out_indices always a list by @amyeroberts in #30941
datasets referenced in the quantization config get updated to remove references to datasets with restrictive licenses.
- 🚨 Remove dataset with restrictive license by @echarlaix in #31452
Bugfixes and improvements
- Add fixed resize and pad strategy for object detection by @qubvel in #30742
- Enable dynamic resolution input for Swin Transformer and variants by @the-neural-networker in #30656
- Add TokenClassification for Mistral, Mixtral and Qwen2 by @josephenguehard in #29878
- FIX / Quantization: Fix Dockerfile build by @younesbelkada in #30890
- Add support for torch.compile dynamic shapes by @warner-benjamin in #30560
- LLaVa-Next: Update docs with batched inference by @zucchini-nlp in #30857
- DeformableDETR two stage support bfloat16 by @DonggeunYu in #30907
- add returntokentimestamps to WhisperProcessor by @kamilakesbi in #30812
- Fix numhiddenlayers in initialization of new model in Mamba by @SrGonao in #30403
- separate kwargs in processor (similar to #30193) by @Eric2i in #30905
- fix for custom pipeline configuration by @not-lain in #29004
- Add AutoFeatureExtractor support to Wav2Vec2ProcessorWithLM by @ylacombe in #28706
- Fix a shape annotation and typos in
mambaslow forward by @vasqu in #30691 tokenizer_class = "AutoTokenizer"Llava Family by @ArthurZucker in #30912- Introduce configuredstate arg for acceleratorconfig by @muellerzr in #29781
- Add torch.compile for Mistral by @zhenglongjiepheonix in #30642
- [docs] Spanish translation of modelmemoryanatomy.md by @aaronjimv in #30885
- FIX / TST: Fix expected results on Mistral slow test (A10) by @younesbelkada in #30909
- PaliGemma - fix processor with no input text by @hiyouga in #30916
- CI: AMD MI300 tests fix by @mht-sharma in #30797
- Enforce saving at end of training if saving option chosen by @muellerzr in #30160
- fix: center_crop occasionally outputs off-by-one dimension matrix by @mattlbeck in #30934
- [Benchmark] Reuse
optimum-benchmarkby @ydshieh in #30615 - TST / Workflows: Get slack notifications for docker image build by @younesbelkada in #30891
- Fix swin embeddings interpolation by @amyeroberts in #30936
- Fix inhomogeneous shape error in example by @Zantares in #30434
- update ruff version by @ArthurZucker in #30932
- Update build ci image [push-ci-image] by @ArthurZucker in #30933)
- Update video-llava docs by @zucchini-nlp in #30935
- Fix low cpu mem usage tests by @SunMarc in #30808
- [doc] Add references to the fine-tuning blog and distil-whisper to Whisper. by @Vaibhavs10 in #30938
- Avoid extra chunk in speech recognition by @jonatanklosko in #29539
- [whisper] only trigger forced ids warning once by @sanchit-gandhi in #30966
- Paligemma - fix slow tests, add bf16 and f16 slow tests by @molbap in #30851
- Finally fix the missing new model failure CI report by @ydshieh in #30968
- legacy to init the slow tokenizer when converting from slow was wrong by @ArthurZucker in #30972
- Generation: get special tokens from model config by @zucchini-nlp in #30899
- [Whisper] Strip prompt before finding common subsequence by @sanchit-gandhi in #27836
- Fix link in Pipeline documentation by @junhl in #30948
- [Mistral and friends] Update MLP by @NielsRogge in #31057
- Paligemma causal attention mask by @molbap in #30967
- Update object detection with latest resize and pad strategies by @qubvel in #30955
- Using assistant in AutomaticSpeechRecognitionPipeline with different encoder size by @kamilakesbi in #30637
- Push ci image by @ArthurZucker in #30982
- testcustom4dattentionmask skip with sliding window attn by @poedator in #30833
- Finish adding support for torch.compile dynamic shapes by @warner-benjamin in #30919
- FIX / Docs: Minor changes in quantization docs by @younesbelkada in #30985
- Fix accelerate failing tests by @SunMarc in #30836
- [tests] add
torch.use_deterministic_algorithmsfor XPU by @faaany in #30774 - Add a check that warmup_setps is either 0 or >= 1 by @ymoslem in #30764
- Update 4
MptIntegrationTestsexpected outputs by @ydshieh in #30989 - [Port] TensorFlow implementation of Mistral by @ariG23498 in #29708
- Remove deprecated properties in tokenizationnllb.py and tokenizationnllb_fast.py by @ymoslem in #29834
- Bugfix: WandbCallback uploads initial model checkpoint by @mgerstgrasser in #30897
- add prefix space ignored in llama #29625 by @itazap in #30964
- Fix training speed regression introduced by "optimize VRAM for calculating pos_bias in LayoutLM v2, v3 by @kkoehncke in #26139)"
- Do not trigger autoconversion if localfilesonly by @Wauplin in #31004
- pin
uv==0.1.45by @ydshieh in #31006 - Perceiver interpolate position embedding by @g1y5x3 in #30979
- [tests] make
test_model_parallelismdevice-agnostic by @faaany in #30844 - FIX / TST: Fix expected results on Mistral AWQ test by @SunMarc in #30971
- allow multi-gpu by @ydshieh in #31011
- Fix resume_download future warning by @Wauplin in #31007
- Quantization / TST: Fix remaining quantization tests by @younesbelkada in #31000
- save the list of new model failures by @ydshieh in #31013
- added interpolation for vitmae model in pytorch as well as tf. by @bhuvanmdev in #30732
- Add split special tokens by @itazap in #30772
- Paligemma- fix devices and dtype assignments by @molbap in #31008
- Redirect transformers_agents doc to agents by @aymeric-roucher in #31054
- unpin uv by @ydshieh in #31055
- Follow up: Fix link in dbrx.md by @eitanturok in #30514
- Update feature request label in template by @amyeroberts in #30940
- Fix quanto tests by @SunMarc in #31062
- Fix padtomax_length Whisper by @ylacombe in #30787
- skip
test_model_parallelismfor 2 model test classes by @ydshieh in #31067 - use
@mainby @ydshieh in #31065 - Remove
ninjafrom docker image build by @ydshieh in #31080 - fix "piano" typo by @clinty in #31027
- Update quicktour.md to fix broken link to Glossary by @apalkk in #31072
- Remove redundant backend checks in training_args.py by @kevint324 in #30999
- fix from_pretrained in offline mode when model is preloaded in cache by @oOraph in #31010
- Remove float64 cast for OwlVit and OwlV2 to support MPS device by @qubvel in #31071
- Fix OWLv2 postprocessobject_detection for multiple images by @qubvel in #31082
- Fix typo in trainer.py by @taslimisina in #31048
- [SuperPoint, PaliGemma] Update docs by @NielsRogge in #31025
- Fix failing tokenizer tests by @LysandreJik in #31083
- Watermark: fix tests by @zucchini-nlp in #30961
- Docs / PEFT: Add PEFT API documentation by @younesbelkada in #31078
- Render chat template tojson filter as unicode by @CISC in #31041
- FIX: Add
accelerateas a hard requirement by @younesbelkada in #31090 - FIX / OPT: Fix OPT multi-GPU training for
OPTForQuestionAnsweringby @younesbelkada in #31092 - skip
test_multi_gpu_data_parallel_forwardforvitanddeitby @ydshieh in #31086 - Fix PretrainedConfig docstring with deprecated resume_download by @albertvillanova in #31014
- Fix DeepSpeed compatibility with weight_norm by @jonnyli1125 in #30881)
- TST: Fix instruct-blip tests by @younesbelkada in #31088
- Docs / Quantization: Redirect deleted page by @younesbelkada in #31063
- Deprecate low use models by @amyeroberts in #30781
- Quantized KV cache: update quanto by @zucchini-nlp in #31052
- FEAT: Add mistral v3 conversion script by @younesbelkada in #30981
- Use
HF_HUB_OFFLINE+ fix has_file in offline mode by @Wauplin in #31016 - Improve
transformers-cli envreporting by @statelesshz in #31003 - Fix env.py in cases where torch is not present by @Rocketknight1 in #31113
- Fix faulty rstrip in module loading by @Rocketknight1 in #31108
- Rm maintainer + migrate by @muellerzr in #31089
- Fix nightly circleci by @ydshieh in #31114
- FIX / Docs: Fix GPTQ expected number of bits by @younesbelkada in #31111
- Add VLM generation default contributor by @gante in #31115
- Add onoptimizerstep to callback options by @dhruvbpai in #31095
- Cleanup docker build by @ydshieh in #31119
- FIX / Quantization: Add extra validation for bnb config by @younesbelkada in #31135
- fix getscheduler when name is warmupstable_decay by @zspo in #31128
- Docs / Quantization: Replace all occurences of
load_in_8bitwith bnb config by @younesbelkada in #31136 - Workflow: Remove
IS_GITHUB_CIby @younesbelkada in #31147 - helper by @ArthurZucker in #31152
- pytest -rsfE by @ydshieh in #31140
- Fix quantized cache output by @SunMarc in #31143
- Update sam.md by @asifajrof in #31130
- Quantization: Enhance bnb error message by @younesbelkada in #31160
- [trainer] add sanity evaluation option by @SunMarc in #31146
- Add streaming, various fixes by @aymeric-roucher in #30838
- Added description of quantization_config by @vamsivallepu in #31133
- Fix typo: usesafetenstors to usesafetensors by @CharlesCNorton in #31184
- Remove copied froms for deprecated models by @amyeroberts in #31153
- Token healing by @ahmed-moubtahij in #30081
- [
GemmaModel] fix small typo by @ArthurZucker in #31202 - Fix Cannot convert [array()] to EagerTensor of dtype int64 by @pavi-ninjaac in #31109
- Ignore non-causal mask in more cases with SDPA by @fxmarty in #30138
- SlidingWindowCache: reduce differences to other Cache classes by @gante in #30970
- Fix
test_compile_static_cacheby @ydshieh in #30991 - fix the getsizewithaspectratio in max_size situation by @SangbumChoi in #30902
- Fix typo in utils by @Bojun-Feng in #31169
- Rename sanityevaluation to evalon_start by @Qubitium in #31192
- Wrong translation FR : Contents = Contenu by @jadechoghari in #31186
- Cohere: Fix copied from by @younesbelkada in #31213
- Set greaterisbetter to False if metricforbest_model ends with "loss" by @miivanov90 in #31142
- Fix GPU OOM for
mistral.py::Mask4DTestHardby @ydshieh in #31212 - [docs] Spanish translation of tokenizer_summary.md by @aaronjimv in #31154
- Pass device in Logits Processor's init by @zucchini-nlp in #29804
- Fix sentence fragment within test comments by @DomHudson in #31218
- fix(PatchTST): Wrong dropout used for PretainHead by @maxstrobel in #31117
- Video-LLaVa: handle any number of frames by @zucchini-nlp in #31221
- Add dynamic resolution input/interpolate position embedding to deit by @p-kris10 in #31131
- fix bf16 issue in text classification pipeline by @chujiezheng in #30996
- Fix pipeline tests - torch imports by @amyeroberts in #31227
- Add new line switch before logging ***** Running {description} ***** by @jacklanda in #31225
- add no split modules for xlmrobertaxl by @ManuelFay in #31223
- Fix
MistralIntegrationTestby @ydshieh in #31231 - Blip: Deprecate
BlipModelby @younesbelkada in #31235 - Move out common backbone config param validation by @amyeroberts in #31144
- Upload (daily) CI results to Hub by @ydshieh in #31168
- Specify dtype=torch.bool to avoid xla error by @ysulsky in #31191
- Fixing
name 'torch' is not definedinbitsandbytesintegration by @jamesbraza in #31243 - Benchmark GitHub Actions workflow by @ydshieh in #31163
- Early labels validation by @amyeroberts in #31240
- doc: add info about wav2vec2 bert in older wav2vec2 models. by @Vaibhavs10 in #31120
- enable deterministic mode for npu by @statelesshz in #31253
- Add missing Flaubert tokenizer tests by @bastrob in #30492
- Fix circular reference issue in CLIPTokenizerFast by @dhaivat1729 in #31075
- Add condition to
benchmarkjob inpush-important-models.ymlby @ydshieh in #31259 - Skip failing JetMOE generation tests by @amyeroberts in #31266
- no need for explicit EXTRATOKENS in processingpaligemma.py by @grahamannett in #31022
- [
SwitchTransformer] Significant performance improvement on MoE blocks by @ranggihwang in #31173 - fix loading specialtokensmap_file by @ZhiyuanChen in #31012
- Make mamba use cache by @zucchini-nlp in #31116
- Generation: fix handling of special tokens by @zucchini-nlp in #31254
- Switch from
cached_downloadtohf_hub_downloadin remaining occurrences by @Wauplin in #31284 - fix:
strshould be used notintwhen setting env variables by @statelesshz in #31272 - Fix savetpu: use maybeconverttocpu instead of to cpu. by @baoleai in #31264
- fix accelerate tests for roberta xl by @SunMarc in #31288
- Enable dynamic resolution input for Beit by @OmarManzoor in #31053
- Mark MobileNetV1ModelTest::testbatchingequivalence as flaky by @amyeroberts in #31258
- Pipeline VQA: Add support for list of images and questions as pipeline input by @BlacCod in #31217
- Fix SwinLayer / DonutSwinLayer / ClapAudioLayer attention mask device by @gorodnitskiy in #31295
- Update text-to-speech.md by @jaguaryang in #31269
- Fixed Wav2Vec2ProcessorWithLM decoding error by @karicotiza in #31188
- Fix jetmoe model by @Cyrilvallez in #31279
- Extend save_pretrained to offloaded models by @blbadger in #27412
- Implement JSON dump conversion for torch_dtype in TrainingArguments by @junrae6454 in #31224
- interpolation added for TVP. by @bhuvanmdev in #30863
- Rename testmodelcommonattributes -> testmodelgetset_embeddings by @amyeroberts in #31321
- Use unused prepare_img() function in dinov2 conversion script by @IbrahimAmin1 in #31335
- docs: fix style by @imba-tjd in #31340
- Fix paligemma inverted mask by @molbap in #31207
- docs/zh: fix style by @imba-tjd in #31334
- Decorators for deprecation and named arguments validation by @qubvel in #30799
- Improve error msg when using bitsandbytes by @SunMarc in #31350
- Fix Cohere CI by @ydshieh in #31263
- Fix gradio tool demos by @aymeric-roucher in #31230
- Fast image processor by @amyeroberts in #28847
- Add french translation of AutoBackbone by @jadechoghari in #31300
- Add support to declare imports for code agent by @JasonZhu1313 in #31355
- Fix idefics cache by @zucchini-nlp in #31377
- [Bug Fix] Renamed loss to losses to suppress UnboundLocalError by @her0e1c1 in #31365
- docs: fix broken link by @imba-tjd in #31370
- backbone_utils - fix relative import by @amyeroberts in #31382
- README underline between badges fix by @novialriptide in #31376
- Update comment in modeling_utils.py by @inf3rnus in #31299
- Use huggingface_hub helper function to split state dict by @SunMarc in #31091
- Change JSON serialization to custom json.dumps by @junrae6454 in #31100
- feat(ci): add trufflehog secrets detection by @McPatate in #31344
- [QoL fix] [Image processing] Add warning on assumption of channel dim and avoid infering when inputs are PIL.Image by @aliencaocao in #31364
- Make chat templates part of ProcessorMixin by @Rocketknight1 in #30744
- add initial design for uniform processors + align model by @molbap in #31197
- Add missing French translation of tutoriel_pipeline.md by @jadechoghari in #31396
- Temporarily pin datasets upper version to fix CI by @albertvillanova in #31407
- Support Clip QKV for MPT by @akakakakakaa in #31307
- Pin datasets<2.20.0 for examples by @amyeroberts in #31417
- Fix MusicGen SDPA by @ylacombe in #31208
- Set seed for M4T retain grad test by @ylacombe in #31419
- Fix SpeechT5
decoder_attention_maskshape by @ylacombe in #28071 - Change potential
inputs_embedspaddinglogger.warningtologger.warning_onceby @naimenz in #31411 - Remove duplicate image processor in auto map by @amyeroberts in #31383
- Install the tensorflow example requirements in docker by @amyeroberts in #31428
- Remove empty createandtestconfigcommon_properties tests by @amyeroberts in #31359
- xpu: support xpu backend from stock pytorch (>=2.4) by @dvrogozh in #31238
- Musicgen special tokens in tensors by @zucchini-nlp in #31420
- Fix Bark logits processors device misplacement by @ylacombe in #31416
- Rename misnamed image processor test files by @amyeroberts in #31430
- Generate: fix
tokenizerbeing popped twice by @gante in #31427 - [tests] make
TestDeepSpeedModelZoodevice-agnostic by @faaany in #31402 - Support multiple validation datasets when
dataloader_persistent_workers=Trueby @bastienlc in #30627 - Pass datasets trustremotecode by @albertvillanova in #31406
- simple fix by @tokenizer-decode in #31456
- Fix typing errors in
Qwen2ForTokenClassificationby @kevinhu in #31440 - Agents: Improve python interpreter by @aymeric-roucher in #31409
- Donut: fix
generatecall from local path by @gante in #31470 - Make "tool_use" the default chat template key when tools are passed by @Rocketknight1 in #31429
- Fix single letter stop strings by @Rocketknight1 in #31448
- Update chat template docs and bump Jinja version by @Rocketknight1 in #31455
- Improve
PreTrainedTokenizerFastloading time when there are many added tokens by @ydshieh in #31404 - Fix documentation typos by @qgallouedec in #31476
- Give more useful
metric_for_best_modelerrors by @tomaarsen in #31450 - Update perftraingpu_many.md by @remyleone in #31451
- [
GPT2] Add SDPA support by @vasqu in #31172 - Fix autocast incompatibility in RecurrentGemma by @xplip in #30832
- Use self.configtester.runcommon_tests() by @amyeroberts in #31431
- [tests] rename
test_config_objecttotest_ds_config_objectby @faaany in #31403 - Docs / AQLM: Clarify
torch.compilesupport for AQLM by @younesbelkada in #31473 - Mamba: add generative tests by @gante in #31478
- Update object_detection.md by @jajupmochi in #31488
- Add docs on zeroshot image classification prompt templates by @aliencaocao in #31343
- auto-detect device when no device is passed to pipeline by @faaany in #31398
- Fix typo: pastokenid by @ftnext in #30894
- Fix
wandbintegration withSetFitmodel by @timothepearce in #30021 - Consider inheritance in type checking for tensors by @daemyung in #31378
- Add valid columns check in removeunused_columns method by @arthasking123 in #31466
- Fix a teeny-tiny typo in
tokenization_utils_base.py's docstring by @sadra-barikbin in #31510 - Fix mismatched ` in doc & other common typos by @jhwei in #31516
- RWKV: enable generation tests by @gante in #31490
- unskip 2 tests in cohere by @ydshieh in #31517
- Revive Nightly/Past CI by @ydshieh in #31159
- Deprecate legacy cache + use cache position by @zucchini-nlp in #31491
- SPLIT PR: add user defined symbols and control symbols by @itazap in #31305
- Removed torch.cuda.empty_cache from train loop. by @FoamoftheSea in #31530
- Update mask_generation.md by @nicholicaron in #31543
- Correct @is_flaky test decoration by @qubvel in #31480
- Add implementation of
spectrogram_batchby @ravenouse in #27159 - chore: fix typos by @xiaoxianBoy in #31559
- Update git templates by @ArthurZucker in #31539
- Fix the error caused by incorrect use of logger in pipeline by @lanyun1103 in #31565
- Fix bug about addspecialtokens and so on by @hiroshi-matsuda-rit in #31496
- Add Jinja as a requirement with the right version cutoff by @Rocketknight1 in #31536
- Fix doc typo in
TrainingArgumentsby @qgallouedec in #31503 - Fix istorchxpu_available for torch < 2.3 by @amyeroberts in #31573
- Added version constraint on numpy for version <2.0 by @Resteklicken in #31569
- Siglip: add
_no_split_moduleby @zucchini-nlp in #31566 - fix output data type of image classification by @jiqing-feng in #31444
- add preprocessingnumworkers to run_classification.py by @jiahuanluo in #31586
- Improve error message for mismatched copies in code blocks by @molbap in #31535
- Add ViTImageProcessorFast to tests by @amyeroberts in #31424
- docs: move translations to
i18nby @SauravMaheshkar in #31584 - Removed unnecessary
self.projectioncall inVivitTubeletEmbeddingsby @v-iashin in #31632 - [
GPT-NeoX] Add SDPA support by @vasqu in #31031 - Update RT-DETR code snippet by @qubvel in #31631
- Llama et al. / FSDP : Fix breaking change in 4.40 for FSDP by @younesbelkada in #31161
- Fix RT-DETR inference with float16 and bfloat16 by @qubvel in #31639
- Fix paligemma detection inference by @molbap in #31587
- Generate: fix assisted generation with
past_key_valuespassed as kwargs by @gante in #31644 - Fix dtype casting in swinv2 and swinv2sr to allow non-FP32 inference by @aliencaocao in #31589
- Skip tests properly by @amyeroberts in #31308
- Generation: past kv can be None by @zucchini-nlp in #31051
- Fix ONNX exports for Optimum compatible models by @merveenoyan in #31311
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @josephenguehard
- Add TokenClassification for Mistral, Mixtral and Qwen2 (#29878)
- @vasqu
- Fix a shape annotation and typos in
mambaslow forward (#30691) - [
GPT2] Add SDPA support (#31172) - [
GPT-NeoX] Add SDPA support (#31031)
- Fix a shape annotation and typos in
- @ariG23498
- [Port] TensorFlow implementation of Mistral (#29708)
- @bhuvanmdev
- added interpolation for vitmae model in pytorch as well as tf. (#30732)
- interpolation added for TVP. (#30863)
- @SangbumChoi
- fix the getsizewithaspectratio in max_size situation (#30902)
- New model support RTDETR (#29077)
- @Cyrilvallez
- Reduce by 2 the memory requirement in
generate()🔥🔥🔥 (#30536) - Fix jetmoe model (#31279)
- Reduce by 2 the memory requirement in
- @ravenouse
- Add implementation of
spectrogram_batch(#27159)
- Add implementation of
- Python
Published by LysandreJik almost 2 years ago
transformers - Release v4.41.2
Release v4.41.2
Mostly fixing some stuff related to trust_remote_code=True and from_pretrained
The local_file_only was having a hard time when a .safetensors file did not exist. This is not expected and instead of trying to convert, we should just fallback to loading the .bin files.
- Do not trigger autoconversion if localfilesonly #31004 from @Wauplin fixes this!
- Paligemma: Fix devices and dtype assignments (#31008) by @molbap
- Redirect transformers_agents doc to agents (#31054) @aymeric-roucher
- Fix from_pretrained in offline mode when model is preloaded in cache (#31010) by @oOraph
- Fix faulty rstrip in module loading (#31108) @Rocketknight1
- Python
Published by ArthurZucker about 2 years ago
transformers - Release v4.41.1 Fix PaliGemma finetuning, and some small bugs
Release v4.41.1
Fix PaliGemma finetuning:
The causal mask and label creation was causing label leaks when training. Kudos to @probicheaux for finding and reporting!
- https://github.com/huggingface/transformers/commit/a755745546779ae5c42510bc02a859bdac82b3b7 : PaliGemma - fix processor with no input text (https://github.com/huggingface/transformers/pull/30916) @hiyouga
- https://github.com/huggingface/transformers/commit/a25f7d3c12975fe21eab437dda7363e9024de7c0 : Paligemma causal attention mask (https://github.com/huggingface/transformers/pull/30967) @molbap and @probicheaux
Other fixes: - https://github.com/huggingface/transformers/commit/bb48e921868ac750417956de941606f7e2fa02ca: tokenizer_class = "AutoTokenizer" Llava Family (https://github.com/huggingface/transformers/pull/30912) - https://github.com/huggingface/transformers/commit/1d568dfab262f76079eb4f3d05b606d51a0c9e4b : legacy to init the slow tokenizer when converting from slow was wrong (https://github.com/huggingface/transformers/pull/30972) - https://github.com/huggingface/transformers/commit/b1065aa08ac0da11fcb9e3827cd7eafabe4beebd : Generation: get special tokens from model config (https://github.com/huggingface/transformers/pull/30899) @zucchini-nlp
Reverted https://github.com/huggingface/transformers/commit/4ab7a28216211571fdddba414d4edd8426ab6489
- Python
Published by ArthurZucker about 2 years ago
transformers - v4.41.0: Phi3, JetMoE, PaliGemma, VideoLlava, Falcon2, FalconVLM & GGUF support
New models
Phi3
The Phi-3 model was proposed in Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone by Microsoft.
TLDR; Phi-3 introduces new ROPE scaling methods, which seems to scale fairly well! A 3b and a Phi-3-mini is available in two context-length variants—4K and 128K tokens. It is the first model in its class to support a context window of up to 128K tokens, with little impact on quality.
- Phi-3 by @gugarosa in https://github.com/huggingface/transformers/pull/30423
JetMoE
JetMoe-8B is an 8B Mixture-of-Experts (MoE) language model developed by Yikang Shen and MyShell. JetMoe project aims to provide a LLaMA2-level performance and efficient language model with a limited budget. To achieve this goal, JetMoe uses a sparsely activated architecture inspired by the ModuleFormer. Each JetMoe block consists of two MoE layers: Mixture of Attention Heads and Mixture of MLP Experts. Given the input tokens, it activates a subset of its experts to process them. This sparse activation schema enables JetMoe to achieve much better training throughput than similar size dense models. The training throughput of JetMoe-8B is around 100B tokens per day on a cluster of 96 H100 GPUs with a straightforward 3-way pipeline parallelism strategy.
- Add JetMoE model by @yikangshen in https://github.com/huggingface/transformers/pull/30005
PaliGemma
PaliGemma is a lightweight open vision-language model (VLM) inspired by PaLI-3, and based on open components like the SigLIP vision model and the Gemma language model. PaliGemma takes both images and text as inputs and can answer questions about images with detail and context, meaning that PaliGemma can perform deeper analysis of images and provide useful insights, such as captioning for images and short videos, object detection, and reading text embedded within images.
More than 120 checkpoints are released see the collection here !
- Add PaliGemma by @molbap in https://github.com/huggingface/transformers/pull/30814
VideoLlava
Video-LLaVA exhibits remarkable interactive capabilities between images and videos, despite the absence of image-video pairs in the dataset.
💡 Simple baseline, learning united visual representation by alignment before projection With the binding of unified visual representations to the language feature space, we enable an LLM to perform visual reasoning capabilities on both images and videos simultaneously. 🔥 High performance, complementary learning with video and image Extensive experiments demonstrate the complementarity of modalities, showcasing significant superiority when compared to models specifically designed for either images or videos.

- Add Video Llava by @zucchini-nlp in https://github.com/huggingface/transformers/pull/29733
Falcon 2 and FalconVLM:

Two new models from TII-UAE! They published a blog-post with more details! Falcon2 introduces parallel mlp, and falcon VLM uses the Llava framework
* Support for Falcon2-11B by @Nilabhra in https://github.com/huggingface/transformers/pull/30771
* Support arbitrary processor by @ArthurZucker in https://github.com/huggingface/transformers/pull/30875
GGUF from_pretrained support

You can now load most of the GGUF quants directly with transformers' from_pretrained to convert it to a classic pytorch model. The API is simple:
```python from transformers import AutoTokenizer, AutoModelForCausalLM
modelid = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF" filename = "tinyllama-1.1b-chat-v1.0.Q6K.gguf"
tokenizer = AutoTokenizer.frompretrained(modelid, gguffile=filename) model = AutoModelForCausalLM.frompretrained(modelid, gguffile=filename) ```
We plan more closer integrations with llama.cpp / GGML ecosystem in the future, see: https://github.com/huggingface/transformers/issues/27712 for more details
- Loading GGUF files support by @LysandreJik in https://github.com/huggingface/transformers/pull/30391
Quantization
New quant methods
In this release we support new quantization methods: HQQ & EETQ contributed by the community. Read more about how to quantize any transformers model using HQQ & EETQ in the dedicated documentation section
- Add HQQ quantization support by @mobicham in https://github.com/huggingface/transformers/pull/29637
- [FEAT]: EETQ quantizer support by @dtlzhuangz in https://github.com/huggingface/transformers/pull/30262
dequantize API for bitsandbytes models
In case you want to dequantize models that have been loaded with bitsandbytes, this is now possible through the dequantize API (e.g. to merge adapter weights)
- FEAT / Bitsandbytes: Add
dequantizeAPI for bitsandbytes quantized models by @younesbelkada in https://github.com/huggingface/transformers/pull/30806
API-wise, you can achieve that with the following:
```python from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
model_id = "facebook/opt-125m"
model = AutoModelForCausalLM.frompretrained(modelid, quantizationconfig=BitsAndBytesConfig(loadin4bit=True)) tokenizer = AutoTokenizer.frompretrained(model_id)
model.dequantize()
text = tokenizer("Hello my name is", return_tensors="pt").to(0)
out = model.generate(**text) print(tokenizer.decode(out[0])) ```
Generation updates
- Add Watermarking LogitsProcessor and WatermarkDetector by @zucchini-nlp in https://github.com/huggingface/transformers/pull/29676
- Cache: Static cache as a standalone object by @gante in https://github.com/huggingface/transformers/pull/30476
- Generate: add
min_psampling by @gante in https://github.com/huggingface/transformers/pull/30639 - Make
Gemmawork withtorch.compileby @ydshieh in https://github.com/huggingface/transformers/pull/30775
SDPA support
- [
BERT] Add support for sdpa by @hackyon in https://github.com/huggingface/transformers/pull/28802 - Add sdpa and fa2 the Wav2vec2 family. by @kamilakesbi in https://github.com/huggingface/transformers/pull/30121
- add sdpa to ViT [follow up of #29325] by @hyenal in https://github.com/huggingface/transformers/pull/30555
Improved Object Detection
Addition of fine-tuning script for object detection models
- Fix YOLOS image processor resizing by @qubvel in https://github.com/huggingface/transformers/pull/30436
- Add examples for detection models finetuning by @qubvel in https://github.com/huggingface/transformers/pull/30422
- Add installation of examples requirements in CI by @qubvel in https://github.com/huggingface/transformers/pull/30708
- Update object detection guide by @qubvel in https://github.com/huggingface/transformers/pull/30683
Interpolation of embeddings for vision models
Add interpolation of embeddings. This enables predictions from pretrained models on input images of sizes different than those the model was originally trained on. Simply pass interpolate_pos_embedding=True when calling the model.
Added for: BLIP, BLIP 2, InstructBLIP, SigLIP, ViViT
```py import requests from PIL import Image from transformers import Blip2Processor, Blip2ForConditionalGeneration
image = Image.open(requests.get("https://huggingface.co/hf-internal-testing/blip-test-image/resolve/main/demo.jpg", stream=True).raw) processor = Blip2Processor.frompretrained("Salesforce/blip2-opt-2.7b") model = Blip2ForConditionalGeneration.frompretrained( "Salesforce/blip2-opt-2.7b", torchdtype=torch.float16 ).to("cuda") inputs = processor(images=image, size={"height": 500, "width": 500}, returntensors="pt").to("cuda")
predictions = model(**inputs, interpolateposencoding=True)
Generated text: "a woman and dog on the beach"
generatedtext = processor.batchdecode(predictions, skipspecialtokens=True)[0].strip() ```
- Blip dynamic input resolution by @zafstojano in https://github.com/huggingface/transformers/pull/30722
- Add dynamic resolution input/interpolate position embedding to SigLIP by @davidgxue in https://github.com/huggingface/transformers/pull/30719
- Enable dynamic resolution for vivit by @jla524 in https://github.com/huggingface/transformers/pull/30630
🚨 might be breaking
- 🚨🚨🚨Deprecate
evaluation_strategytoeval_strategy🚨🚨🚨 by @muellerzr in https://github.com/huggingface/transformers/pull/30190 - 🚨 Add training compatibility for Musicgen-like models by @ylacombe in https://github.com/huggingface/transformers/pull/29802
- 🚨 Update imageprocessingvitmatte.py by @rb-synth in https://github.com/huggingface/transformers/pull/30566
Cleanups
- Remove task guides auto-update in favor of links towards task pages by @LysandreJik in https://github.com/huggingface/transformers/pull/30429
- Remove add-new-model in favor of add-new-model-like by @LysandreJik in https://github.com/huggingface/transformers/pull/30424
- Remove mentions of models in the READMEs and link to the documentation page in which they are featured. by @LysandreJik in https://github.com/huggingface/transformers/pull/30420
Not breaking but important for Llama tokenizers
- [
LlamaTokenizerFast] Refactor default llama by @ArthurZucker in https://github.com/huggingface/transformers/pull/28881
Fixes
- Fix missing
prev_ci_resultsby @ydshieh in https://github.com/huggingface/transformers/pull/30313 - Fix: remove
pad token idin pipeline forward arguments by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30285 - fix Parameter dtype in audio models by @ylacombe in https://github.com/huggingface/transformers/pull/30310
- disable use_cache if using gradient checkpointing by @chenzizhao in https://github.com/huggingface/transformers/pull/30320
- Fix test transposing image with EXIF Orientation tag by @albertvillanova in https://github.com/huggingface/transformers/pull/30319
- Avoid
jnpimport inutils/generic.pyby @ydshieh in https://github.com/huggingface/transformers/pull/30322 - Fix
AssertionErrorin clip conversion script by @ydshieh in https://github.com/huggingface/transformers/pull/30321 - [UDOP] Add special tokens to tokenizer by @NielsRogge in https://github.com/huggingface/transformers/pull/29594
- Enable multi-device for some models by @jla524 in https://github.com/huggingface/transformers/pull/30207
- feat: Upgrade Weights & Biases callback by @parambharat in https://github.com/huggingface/transformers/pull/30135
- [Feature Extractors] Fix kwargs to pre-trained by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/30260
- Pipeline: fix
pad_token_idagain by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30338 - [Whisper] Fix slow tests by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/30152
- parallel job limit for doctest by @ydshieh in https://github.com/huggingface/transformers/pull/30342
- Transformers Metadata by @LysandreJik in https://github.com/huggingface/transformers/pull/30344
- Deprecate default chat templates by @Rocketknight1 in https://github.com/huggingface/transformers/pull/30346
- Restore casting of maskedspecembed by @ylacombe in https://github.com/huggingface/transformers/pull/30336
- Update unwrap from accelerate by @SunMarc in https://github.com/huggingface/transformers/pull/29933
- Do not remove half seq length in generation tests by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30016
- Fix config + attnimplementation in AutoModelForCausalLM.frompretrained by @hiyouga in https://github.com/huggingface/transformers/pull/30299
- Add TF swiftformer by @joaocmd in https://github.com/huggingface/transformers/pull/23342
- [Grounding DINO] Add resources by @NielsRogge in https://github.com/huggingface/transformers/pull/30232
- Nits for model docs by @merveenoyan in https://github.com/huggingface/transformers/pull/29795
- Enable multi-device for more models by @jla524 in https://github.com/huggingface/transformers/pull/30379
- GenerationConfig: warn if pad token is negative by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30187
- Add FSDP config for CPU RAM efficient loading through accelerate by @helloworld1 in https://github.com/huggingface/transformers/pull/30002
Llamafamily, fixuse_cache=Falsegeneration by @ArthurZucker in https://github.com/huggingface/transformers/pull/30380- Update docstrings for text generation pipeline by @Rocketknight1 in https://github.com/huggingface/transformers/pull/30343
- Terminator strings for generate() by @Rocketknight1 in https://github.com/huggingface/transformers/pull/28932
- Fix layerwise GaLore optimizer hard to converge with warmup scheduler by @hiyouga in https://github.com/huggingface/transformers/pull/30372
- Jamba: fix left-padding test by @gante in https://github.com/huggingface/transformers/pull/30389
- Fix DETA save_pretrained by @qubvel in https://github.com/huggingface/transformers/pull/30326
- FIX / PEFT: Pass device correctly to peft by @younesbelkada in https://github.com/huggingface/transformers/pull/30397
- [docs] LLM inference by @stevhliu in https://github.com/huggingface/transformers/pull/29791
- show
-rsto show skip reasons by @ArthurZucker in https://github.com/huggingface/transformers/pull/30318 - Add inputs embeds in generation by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30269
- [Grounding DINO] Add support for cross-attention in GroundingDinoMultiHeadAttention by @EduardoPach in https://github.com/huggingface/transformers/pull/30364
- remove redundant logging from longformer by @riklopfer in https://github.com/huggingface/transformers/pull/30365
- fix: link to HF repo/tree/revision when a file is missing by @mapmeld in https://github.com/huggingface/transformers/pull/30406
- [tests] add
require_torch_sdpafor test that needs sdpa support by @faaany in https://github.com/huggingface/transformers/pull/30408 - Jax: scipy version pin by @gante in https://github.com/huggingface/transformers/pull/30402
- Fix on "cache position" for assisted generation by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30068
- fix for itemsize => element_size() for torch backwards compat by @winglian in https://github.com/huggingface/transformers/pull/30133
- Make EosTokenCriteria compatible with mps by @pcuenca in https://github.com/huggingface/transformers/pull/30376
- FIX: re-add bnb on docker image by @younesbelkada in https://github.com/huggingface/transformers/pull/30427
- Fix LayoutLMv2 init issue and doctest by @ydshieh in https://github.com/huggingface/transformers/pull/30278
- Remove old TF port docs by @Rocketknight1 in https://github.com/huggingface/transformers/pull/30426
- Rename torch.run to torchrun by @steven-basart in https://github.com/huggingface/transformers/pull/30405
- Fix use_cache for xla fsdp by @alanwaketan in https://github.com/huggingface/transformers/pull/30353
- [
LlamaTokenizerFast] Refactor default llama by @ArthurZucker in https://github.com/huggingface/transformers/pull/28881 - New model PR needs green (slow tests) CI by @ydshieh in https://github.com/huggingface/transformers/pull/30341
- Add llama3 by @ArthurZucker in https://github.com/huggingface/transformers/pull/30334
- [
Llava] + CIs fix red cis and llava integration tests by @ArthurZucker in https://github.com/huggingface/transformers/pull/30440 - [tests] make test device-agnostic by @faaany in https://github.com/huggingface/transformers/pull/30444
- fix uncaught init of linear layer in clip's/siglip's for image classification models by @vasqu in https://github.com/huggingface/transformers/pull/30435
- fix jamba slow foward for multi-gpu by @SunMarc in https://github.com/huggingface/transformers/pull/30418
- [SegGPT] Fix loss calculation by @EduardoPach in https://github.com/huggingface/transformers/pull/30421
- Add
pathsfilter to avoid the chance of being triggered by @ydshieh in https://github.com/huggingface/transformers/pull/30453 - Fix wrong indent in
utils/check_if_new_model_added.pyby @ydshieh in https://github.com/huggingface/transformers/pull/30456 - [
research_project] Most of the security issues come from this requirement.txt by @ArthurZucker in https://github.com/huggingface/transformers/pull/29977 - Neuron: When save_safetensor=False, no need to move model to CPU by @jeffhataws in https://github.com/huggingface/transformers/pull/29703
- Enable fp16 on CPU by @muellerzr in https://github.com/huggingface/transformers/pull/30459
- Non blocking support to torch DL's by @muellerzr in https://github.com/huggingface/transformers/pull/30465
- consistent job / pytest report / artifact name correspondence by @ydshieh in https://github.com/huggingface/transformers/pull/30392
- Workflow / ENH: Add SSH into our runners workflow by @younesbelkada in https://github.com/huggingface/transformers/pull/30425
- FIX / Workflow: Change tailscale trigger condition by @younesbelkada in https://github.com/huggingface/transformers/pull/30471
- FIX / Workflow: Fix SSH workflow bug by @younesbelkada in https://github.com/huggingface/transformers/pull/30474
- [fix codellama conversion] by @ArthurZucker in https://github.com/huggingface/transformers/pull/30472
- Script for finding candidate models for deprecation by @amyeroberts in https://github.com/huggingface/transformers/pull/29686
- Fix SigLip classification doctest by @amyeroberts in https://github.com/huggingface/transformers/pull/30475
- Don't run fp16 MusicGen tests on CPU by @amyeroberts in https://github.com/huggingface/transformers/pull/30466
- Prevent crash with
WandbCallbackwith third parties by @tomaarsen in https://github.com/huggingface/transformers/pull/30477 - Add WSD scheduler by @visheratin in https://github.com/huggingface/transformers/pull/30231
- Fix Issue #29817 Video Classification Task Guide Using Undeclared Variables by @manju-rangam in https://github.com/huggingface/transformers/pull/30457
- Make accelerate install non-torch dependent by @muellerzr in https://github.com/huggingface/transformers/pull/30463
- Introduce Stateful Callbacks by @muellerzr in https://github.com/huggingface/transformers/pull/29666
- Fix Llava for 0-embeddings by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30473
- Do not use deprecated
SourceFileLoader.load_module()in dynamic module loading by @XuehaiPan in https://github.com/huggingface/transformers/pull/30370 - Add sidebar tutorial for chat models by @Rocketknight1 in https://github.com/huggingface/transformers/pull/30401
- Quantization:
HfQuantizerquant method update by @younesbelkada in https://github.com/huggingface/transformers/pull/30484 - [docs] Spanish translation of pipeline_tutorial.md by @aaronjimv in https://github.com/huggingface/transformers/pull/30252
- FEAT: PEFT support for EETQ by @younesbelkada in https://github.com/huggingface/transformers/pull/30449
- Fix the
bitsandbyteserror formatting ("Some modules are dispatched on ...") by @kyo-takano in https://github.com/huggingface/transformers/pull/30494 - Update
dtype_byte_sizeto handle torch.float8e4m3fn/float8e5m2 types by @mgoin in https://github.com/huggingface/transformers/pull/30488 - Use the Keras setrandomseed in tests by @Rocketknight1 in https://github.com/huggingface/transformers/pull/30504
- Remove skipping logic now that set_epoch exists by @muellerzr in https://github.com/huggingface/transformers/pull/30501
- [
DETR] Remove timm hardcoded logic in modeling files by @amyeroberts in https://github.com/huggingface/transformers/pull/29038 - [examples] update whisper fine-tuning by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/29938
- Fix GroundingDINO, DPR after BERT SDPA update by @amyeroberts in https://github.com/huggingface/transformers/pull/30506
- load_image - decode b64encode and encodebytes strings by @amyeroberts in https://github.com/huggingface/transformers/pull/30192
- [SegGPT] Fix seggpt image processor by @EduardoPach in https://github.com/huggingface/transformers/pull/29550
- Fix link in dbrx.md by @eitanturok in https://github.com/huggingface/transformers/pull/30509
- Allow boolean FSDP options in fsdp_config by @helloworld1 in https://github.com/huggingface/transformers/pull/30439
- Pass attnimplementation when using AutoXXX.fromconfig by @amyeroberts in https://github.com/huggingface/transformers/pull/30507
- Fix broken link to Transformers notebooks by @clinty in https://github.com/huggingface/transformers/pull/30512
- Update runner tag for PR slow CI by @ydshieh in https://github.com/huggingface/transformers/pull/30535
- Fix repo. fetch/checkout in PR slow CI job by @ydshieh in https://github.com/huggingface/transformers/pull/30537
- Reenable SDPA's FA2 During Training with torch.compile by @warner-benjamin in https://github.com/huggingface/transformers/pull/30442
- Include safetensors as part of
_load_best_modelby @muellerzr in https://github.com/huggingface/transformers/pull/30553 - Pass
use_cachein kwargs for GPTNeoX by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30538 - Enable multi-device for more models by @jla524 in https://github.com/huggingface/transformers/pull/30409
- Generate: update links on LLM tutorial doc by @gante in https://github.com/huggingface/transformers/pull/30550
- DBRX: make fixup by @gante in https://github.com/huggingface/transformers/pull/30578
- Fix seq2seq collator padding by @vasqu in https://github.com/huggingface/transformers/pull/30556
- BlipModel: getmultimodalfeatures method by @XavierSpycy in https://github.com/huggingface/transformers/pull/30438
- Add chat templating support for KeyDataset in text-generation pipeline by @DarshanDeshpande in https://github.com/huggingface/transformers/pull/30558
- Fix generation doctests by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30263
- General PR slow CI by @ydshieh in https://github.com/huggingface/transformers/pull/30540
- Remove
use_square_sizeafter loading by @ydshieh in https://github.com/huggingface/transformers/pull/30567 - Use text config's vocab size in testing models by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30568
- Encoder-decoder models: move embedding scale to nn.Module by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30410
- Fix Marian model conversion by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30173
- Refactor default chat template warnings by @Rocketknight1 in https://github.com/huggingface/transformers/pull/30551
- Fix QA example by @Rocketknight1 in https://github.com/huggingface/transformers/pull/30580
- remove jax example by @ArthurZucker in https://github.com/huggingface/transformers/pull/30498
- Fix canonical model --model_type in examples by @amyeroberts in https://github.com/huggingface/transformers/pull/30480
- Gemma: update activation warning by @pcuenca in https://github.com/huggingface/transformers/pull/29995
- Bump gitpython from 3.1.32 to 3.1.41 in /examples/researchprojects/decisiontransformer by @dependabot in https://github.com/huggingface/transformers/pull/30587
- Fix image segmentation example - don't reopen image by @amyeroberts in https://github.com/huggingface/transformers/pull/30481
- Improve object detection task guideline by @NielsRogge in https://github.com/huggingface/transformers/pull/29967
- Generate: remove deprecated public decoding functions and streamline logic 🧼 by @gante in https://github.com/huggingface/transformers/pull/29956
- Fix llava half precision and autocast issues by @frasermince in https://github.com/huggingface/transformers/pull/29721
- Fix: failing CI after #30568 by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30599
- Fix for Neuron by @michaelbenayoun in https://github.com/huggingface/transformers/pull/30259
- Fix memory leak with CTC training script on Chinese languages by @lucky-bai in https://github.com/huggingface/transformers/pull/30358
- Fix copies for DBRX - neuron fix by @amyeroberts in https://github.com/huggingface/transformers/pull/30610
- fix:missing
output_router_logitsin SwitchTransformers by @lausannel in https://github.com/huggingface/transformers/pull/30573 - Use
contiguous()in clip checkpoint conversion script by @ydshieh in https://github.com/huggingface/transformers/pull/30613 - phi3 chat_template does not support system role by @amitportnoy in https://github.com/huggingface/transformers/pull/30606
- Docs: fix
generate-related rendering issues by @gante in https://github.com/huggingface/transformers/pull/30600 - Docs: add missing
StoppingCriteriaautodocs by @gante in https://github.com/huggingface/transformers/pull/30617 - Generate: fix
SinkCacheon Llama models by @gante in https://github.com/huggingface/transformers/pull/30581 - Fix FX tracing issues for Llama by @michaelbenayoun in https://github.com/huggingface/transformers/pull/30619
- Output
Noneas attention when layer is skipped by @jonghwanhyeon in https://github.com/huggingface/transformers/pull/30597 - Fix CI after #30410 by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30612
- add mlp bias for llama models by @mayank31398 in https://github.com/huggingface/transformers/pull/30031
- Fix W&B run name by @qubvel in https://github.com/huggingface/transformers/pull/30462
- HQQ: PEFT support for HQQ by @younesbelkada in https://github.com/huggingface/transformers/pull/30632
- Prevent
TextGenerationPipeline._sanitize_parametersfrom overriding previously provided parameters by @yting27 in https://github.com/huggingface/transformers/pull/30362 - Avoid duplication in PR slow CI model list by @ydshieh in https://github.com/huggingface/transformers/pull/30634
- [
CI update] Try to use dockers and no cache by @ArthurZucker in https://github.com/huggingface/transformers/pull/29202 - Check if the current compiled version of pytorch supports MPS by @jiaqianjing in https://github.com/huggingface/transformers/pull/30664
- Hotfix-change-ci by @ArthurZucker in https://github.com/huggingface/transformers/pull/30669
- Quantization / HQQ: Fix HQQ tests on our runner by @younesbelkada in https://github.com/huggingface/transformers/pull/30668
- Fix llava next tiewordembeddings config by @SunMarc in https://github.com/huggingface/transformers/pull/30640
- Trainer.loadfrom_checkpoint - support loading multiple Peft adapters by @claralp in https://github.com/huggingface/transformers/pull/30505
- Trainer - add cache clearing and the option for batched eval metrics computation by @FoamoftheSea in https://github.com/huggingface/transformers/pull/28769
- Fix typo: llama3.md by @mimbres in https://github.com/huggingface/transformers/pull/30653
- Respect
resume_downloaddeprecation by @Wauplin in https://github.com/huggingface/transformers/pull/30620 - top-k instead of top-p in MixtralConfig docstring by @sorgfresser in https://github.com/huggingface/transformers/pull/30687
- Bump jinja2 from 3.1.3 to 3.1.4 in /examples/researchprojects/decisiontransformer by @dependabot in https://github.com/huggingface/transformers/pull/30680
- Bump werkzeug from 3.0.1 to 3.0.3 in /examples/researchprojects/decisiontransformer by @dependabot in https://github.com/huggingface/transformers/pull/30679
- Adding tieweights() to prediction heads to support lowcpumem_usage=True by @hackyon in https://github.com/huggingface/transformers/pull/29024
- Fix
cache_positioninitialisation for generation withuse_cache=Falseby @nurlanov-zh in https://github.com/huggingface/transformers/pull/30485 - Word-level timestamps broken for short-form audio by @kamilakesbi in https://github.com/huggingface/transformers/pull/30325
- Updated docs of
forwardinIdefics2ForConditionalGenerationwith correctignore_indexvalue by @zafstojano in https://github.com/huggingface/transformers/pull/30678 - Bump tqdm from 4.63.0 to 4.66.3 in /examples/researchprojects/decisiontransformer by @dependabot in https://github.com/huggingface/transformers/pull/30646
- Bump tqdm from 4.48.2 to 4.66.3 in /examples/researchprojects/visualbert by @dependabot in https://github.com/huggingface/transformers/pull/30645
- Reboot Agents by @aymeric-roucher in https://github.com/huggingface/transformers/pull/30387
- Bump tqdm from 4.48.2 to 4.66.3 in /examples/research_projects/lxmert by @dependabot in https://github.com/huggingface/transformers/pull/30644
- Separate tokenizer tests by @ArthurZucker in https://github.com/huggingface/transformers/pull/30675
- Update
workflow_idinutils/get_previous_daily_ci.pyby @ydshieh in https://github.com/huggingface/transformers/pull/30695 - Rename artifact name
prev_ci_resultstoci_resultsby @ydshieh in https://github.com/huggingface/transformers/pull/30697 - Add safetensors to model not found error msg for default use_safetensors value by @davidgxue in https://github.com/huggingface/transformers/pull/30602
- Pin deepspeed by @muellerzr in https://github.com/huggingface/transformers/pull/30701
- Patch CLIP image preprocessor by @rootonchair in https://github.com/huggingface/transformers/pull/30698
- [BitsandBytes] Verify if GPU is available by @NielsRogge in https://github.com/huggingface/transformers/pull/30533
- Llava: remove dummy labels by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30706
- Immutability for data collators by @vasqu in https://github.com/huggingface/transformers/pull/30603
- Cache: models return input cache type by @gante in https://github.com/huggingface/transformers/pull/30716
- Removal of deprecated maps by @LysandreJik in https://github.com/huggingface/transformers/pull/30576
- Fix image post-processing for OWLv2 by @jla524 in https://github.com/huggingface/transformers/pull/30686
- KV cache is no longer a model attribute by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30730
- Generate: consistently handle special tokens as tensors by @gante in https://github.com/huggingface/transformers/pull/30624
- Update CodeLlama references by @osanseviero in https://github.com/huggingface/transformers/pull/30218
- [docs] Update es/pipeline_tutorial.md by @aaronjimv in https://github.com/huggingface/transformers/pull/30684
- Update llama3.md, fix typo by @mimbres in https://github.com/huggingface/transformers/pull/30739
- mlponlylayers is more flexible than decodersparsestep by @eigen2017 in https://github.com/huggingface/transformers/pull/30552
- PEFT / Trainer: Make use of
model.active_adapters()instead of deprecatedmodel.active_adapterwhenever possible by @younesbelkada in https://github.com/huggingface/transformers/pull/30738 - [docs] Update link in es/pipeline_webserver.md by @aaronjimv in https://github.com/huggingface/transformers/pull/30745
- hqq - fix weight check in checkquantizedparam by @mobicham in https://github.com/huggingface/transformers/pull/30748
- [awq] replace scale when we have GELU by @SunMarc in https://github.com/huggingface/transformers/pull/30074
- Workflow: Replace
actions/post-slackwith centrally defined workflow by @younesbelkada in https://github.com/huggingface/transformers/pull/30737 - [GroundingDino] Adding msdeformattn kernels by @EduardoPach in https://github.com/huggingface/transformers/pull/30768
- Llama: fix custom 4D masks, v2 by @poedator in https://github.com/huggingface/transformers/pull/30348
- Generation / FIX: Fix multi-device generation by @younesbelkada in https://github.com/huggingface/transformers/pull/30746
- Qwen: incorrect setup flag by @gante in https://github.com/huggingface/transformers/pull/30776
- enable Pipeline to get device from model by @faaany in https://github.com/huggingface/transformers/pull/30534
- [Object detection pipeline] Lower threshold by @NielsRogge in https://github.com/huggingface/transformers/pull/30710
- Generate: remove near-duplicate sample/greedy copy by @gante in https://github.com/huggingface/transformers/pull/30773
- Port IDEFICS to tensorflow by @a8nova in https://github.com/huggingface/transformers/pull/26870
- Generate: assistant should be greedy in assisted decoding by @gante in https://github.com/huggingface/transformers/pull/30778
- Save other CI jobs' result (torch/tf pipeline, example, deepspeed etc) by @ydshieh in https://github.com/huggingface/transformers/pull/30699
- Deprecate models script by @amyeroberts in https://github.com/huggingface/transformers/pull/30184
- skip lowcpumem_usage tests by @SunMarc in https://github.com/huggingface/transformers/pull/30782
- CI: update to ROCm 6.0.2 and test MI300 by @fxmarty in https://github.com/huggingface/transformers/pull/30266
- Fix OWLv2 Doc by @jla524 in https://github.com/huggingface/transformers/pull/30794
- Fix cache type in Idefics2 by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30729
- PEFT: Access active_adapters as a property in Trainer by @pashminacameron in https://github.com/huggingface/transformers/pull/30790
- CI: more models wo cache support by @gante in https://github.com/huggingface/transformers/pull/30780
- Deprecate TF weight conversion since we have full Safetensors support now by @Rocketknight1 in https://github.com/huggingface/transformers/pull/30786
- [T5] Adding
model_parallel = FalsetoT5ForTokenClassificationandMT5ForTokenClassificationby @retarfi in https://github.com/huggingface/transformers/pull/30763 - Added the necessay import of module by @ankur0904 in https://github.com/huggingface/transformers/pull/30804
- Add support for custom checkpoints in MusicGen by @jla524 in https://github.com/huggingface/transformers/pull/30011
- Add missing dependencies in image classification example by @jla524 in https://github.com/huggingface/transformers/pull/30820
- Support mixed-language batches in
WhisperGenerationMixinby @cifkao in https://github.com/huggingface/transformers/pull/29688 - Remove unused module DETR based models by @conditionedstimulus in https://github.com/huggingface/transformers/pull/30823
- Jamba - Skip 4d custom attention mask test by @amyeroberts in https://github.com/huggingface/transformers/pull/30826
- Missing
Optionalin typing. by @xkszltl in https://github.com/huggingface/transformers/pull/30821 - Update dsconfigzero3.json by @pacman100 in https://github.com/huggingface/transformers/pull/30829
- Better llava next. by @nxphi47 in https://github.com/huggingface/transformers/pull/29850
- Deprecate models script - correctly set the model name for the doc file by @amyeroberts in https://github.com/huggingface/transformers/pull/30785
- Use
torch 2.3for CI by @ydshieh in https://github.com/huggingface/transformers/pull/30837 - Fix llama model sdpa attention forward function masking bug when output_attentions=True by @Aladoro in https://github.com/huggingface/transformers/pull/30652
- [LLaVa-NeXT] Small fixes by @NielsRogge in https://github.com/huggingface/transformers/pull/30841
- [Idefics2] Improve docs, add resources by @NielsRogge in https://github.com/huggingface/transformers/pull/30717
- Cache: add new flag to distinguish models that
Cachebut not static cache by @gante in https://github.com/huggingface/transformers/pull/30800 - Disable the FA backend for SDPA on AMD GPUs by @mht-sharma in https://github.com/huggingface/transformers/pull/30850
- Video-LLaVa: Fix docs by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30855
- Docs: update example with assisted generation + sample by @gante in https://github.com/huggingface/transformers/pull/30853
- TST / Quantization: Reverting to torch==2.2.1 by @younesbelkada in https://github.com/huggingface/transformers/pull/30866
- Fix VideoLlava imports by @amyeroberts in https://github.com/huggingface/transformers/pull/30867
- TEST: Add llama logits tests by @younesbelkada in https://github.com/huggingface/transformers/pull/30835
- Remove deprecated logic and warnings by @amyeroberts in https://github.com/huggingface/transformers/pull/30743
- Enable device map by @darshana1406 in https://github.com/huggingface/transformers/pull/30870
- Fix dependencies for image classification example by @jla524 in https://github.com/huggingface/transformers/pull/30842
- [whisper] fix multilingual fine-tuning by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/30865
- update release script by @ArthurZucker in https://github.com/huggingface/transformers/pull/30880
New Contributors
- @joaocmd made their first contribution in https://github.com/huggingface/transformers/pull/23342
- @kamilakesbi made their first contribution in https://github.com/huggingface/transformers/pull/30121
- @dtlzhuangz made their first contribution in https://github.com/huggingface/transformers/pull/30262
- @steven-basart made their first contribution in https://github.com/huggingface/transformers/pull/30405
- @manju-rangam made their first contribution in https://github.com/huggingface/transformers/pull/30457
- @kyo-takano made their first contribution in https://github.com/huggingface/transformers/pull/30494
- @mgoin made their first contribution in https://github.com/huggingface/transformers/pull/30488
- @eitanturok made their first contribution in https://github.com/huggingface/transformers/pull/30509
- @clinty made their first contribution in https://github.com/huggingface/transformers/pull/30512
- @warner-benjamin made their first contribution in https://github.com/huggingface/transformers/pull/30442
- @XavierSpycy made their first contribution in https://github.com/huggingface/transformers/pull/30438
- @DarshanDeshpande made their first contribution in https://github.com/huggingface/transformers/pull/30558
- @frasermince made their first contribution in https://github.com/huggingface/transformers/pull/29721
- @lucky-bai made their first contribution in https://github.com/huggingface/transformers/pull/30358
- @rb-synth made their first contribution in https://github.com/huggingface/transformers/pull/30566
- @lausannel made their first contribution in https://github.com/huggingface/transformers/pull/30573
- @jonghwanhyeon made their first contribution in https://github.com/huggingface/transformers/pull/30597
- @mobicham made their first contribution in https://github.com/huggingface/transformers/pull/29637
- @yting27 made their first contribution in https://github.com/huggingface/transformers/pull/30362
- @jiaqianjing made their first contribution in https://github.com/huggingface/transformers/pull/30664
- @claralp made their first contribution in https://github.com/huggingface/transformers/pull/30505
- @mimbres made their first contribution in https://github.com/huggingface/transformers/pull/30653
- @sorgfresser made their first contribution in https://github.com/huggingface/transformers/pull/30687
- @nurlanov-zh made their first contribution in https://github.com/huggingface/transformers/pull/30485
- @zafstojano made their first contribution in https://github.com/huggingface/transformers/pull/30678
- @davidgxue made their first contribution in https://github.com/huggingface/transformers/pull/30602
- @rootonchair made their first contribution in https://github.com/huggingface/transformers/pull/30698
- @eigen2017 made their first contribution in https://github.com/huggingface/transformers/pull/30552
- @Nilabhra made their first contribution in https://github.com/huggingface/transformers/pull/30771
- @a8nova made their first contribution in https://github.com/huggingface/transformers/pull/26870
- @pashminacameron made their first contribution in https://github.com/huggingface/transformers/pull/30790
- @retarfi made their first contribution in https://github.com/huggingface/transformers/pull/30763
- @yikangshen made their first contribution in https://github.com/huggingface/transformers/pull/30005
- @ankur0904 made their first contribution in https://github.com/huggingface/transformers/pull/30804
- @conditionedstimulus made their first contribution in https://github.com/huggingface/transformers/pull/30823
- @nxphi47 made their first contribution in https://github.com/huggingface/transformers/pull/29850
- @Aladoro made their first contribution in https://github.com/huggingface/transformers/pull/30652
- @hyenal made their first contribution in https://github.com/huggingface/transformers/pull/30555
- @darshana1406 made their first contribution in https://github.com/huggingface/transformers/pull/30870
Full Changelog: https://github.com/huggingface/transformers/compare/v4.40.2...v4.41.0
- Python
Published by ArthurZucker about 2 years ago
transformers - v4.40.2
Fix torch fx for LLama model
- Fix for Neuron (#30259)
- Fix copies for DBRX - neuron fix (#30610)
Thanks @michaelbenayoun !
- Python
Published by ArthurZucker about 2 years ago
transformers - v4.40.1: fix `EosTokenCriteria` for `Llama3` on `mps`
Kudos to @pcuenca for the prompt fix in:
- Make EosTokenCriteria compatible with mps #30376
To support EosTokenCriteria on MPS while pytorch adds this functionality.
- Python
Published by ArthurZucker about 2 years ago
transformers - v4.40.0: Llama 3, Idefics 2, Recurrent Gemma, Jamba, DBRX, OLMo, Qwen2MoE, Grounding Dino
New model additions
Llama 3
Llama 3 is supported in this release through the Llama 2 architecture and some fixes in the tokenizers library.
Idefics2

The Idefics2 model was created by the Hugging Face M4 team and authored by Léo Tronchon, Hugo Laurencon, Victor Sanh. The accompanying blog post can be found here.
Idefics2 is an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs. The model can answer questions about images, describe visual content, create stories grounded on multiple images, or simply behave as a pure language model without visual inputs. It improves upon IDEFICS-1, notably on document understanding, OCR, or visual reasoning. Idefics2 is lightweight (8 billion parameters) and treats images in their native aspect ratio and resolution, which allows for varying inference efficiency.
- Add Idefics2 by @amyeroberts in #30253
Recurrent Gemma

Recurrent Gemma architecture. Taken from the original paper.
The Recurrent Gemma model was proposed in RecurrentGemma: Moving Past Transformers for Efficient Open Language Models by the Griffin, RLHF and Gemma Teams of Google.
The abstract from the paper is the following:
We introduce RecurrentGemma, an open language model which uses Google’s novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide a pre-trained model with 2B non-embedding parameters, and an instruction tuned variant. Both models achieve comparable performance to Gemma-2B despite being trained on fewer tokens.
- Add recurrent gemma by @ArthurZucker in #30143
Jamba
Jamba is a pretrained, mixture-of-experts (MoE) generative text model, with 12B active parameters and an overall of 52B parameters across all experts. It supports a 256K context length, and can fit up to 140K tokens on a single 80GB GPU.
As depicted in the diagram below, Jamba’s architecture features a blocks-and-layers approach that allows Jamba to successfully integrate Transformer and Mamba architectures altogether. Each Jamba block contains either an attention or a Mamba layer, followed by a multi-layer perceptron (MLP), producing an overall ratio of one Transformer layer out of every eight total layers.
Jamba introduces the first HybridCache object that allows it to natively support assisted generation, contrastive search, speculative decoding, beam search and all of the awesome features from the generate API!
- Add jamba by @tomeras91 in #29943
DBRX
DBRX is a transformer-based decoder-only large language model (LLM) that was trained using next-token prediction. It uses a fine-grained mixture-of-experts (MoE) architecture with 132B total parameters of which 36B parameters are active on any input.
It was pre-trained on 12T tokens of text and code data. Compared to other open MoE models like Mixtral-8x7B and Grok-1, DBRX is fine-grained, meaning it uses a larger number of smaller experts. DBRX has 16 experts and chooses 4, while Mixtral-8x7B and Grok-1 have 8 experts and choose 2.
This provides 65x more possible combinations of experts and the authors found that this improves model quality. DBRX uses rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA).
- Add DBRX Model by @abhi-mosaic in #29921
OLMo
The OLMo model was proposed in OLMo: Accelerating the Science of Language Models by Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, Hannaneh Hajishirzi.
OLMo is a series of Open Language Models designed to enable the science of language models. The OLMo models are trained on the Dolma dataset. We release all code, checkpoints, logs (coming soon), and details involved in training these models.
- Add OLMo model family by @2015aroras in #29890
Qwen2MoE
Qwen2MoE is the new model series of large language models from the Qwen team. Previously, we released the Qwen series, including Qwen-72B, Qwen-1.8B, Qwen-VL, Qwen-Audio, etc.
Model Details Qwen2MoE is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. Qwen2MoE has the following architectural choices:
Qwen2MoE is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes. Qwen2MoE employs Mixture of Experts (MoE) architecture, where the models are upcycled from dense language models. For instance, Qwen1.5-MoE-A2.7B is upcycled from Qwen-1.8B. It has 14.3B parameters in total and 2.7B activated parameters during runtime, while it achieves comparable performance with Qwen1.5-7B, with only 25% of the training resources.
- Add Qwen2MoE by @bozheng-hit in #29377
Grounding Dino

Taken from the original paper.
The Grounding DINO model was proposed in Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection by Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang. Grounding DINO extends a closed-set object detection model with a text encoder, enabling open-set object detection. The model achieves remarkable results, such as 52.5 AP on COCO zero-shot.
- Adding grounding dino by @EduardoPach in #26087
Static pretrained maps
Static pretrained maps have been removed from the library's internals and are currently deprecated. These used to reflect all the available checkpoints for a given architecture on the Hugging Face Hub, but their presence does not make sense in light of the huge growth of checkpoint shared by the community.
With the objective of lowering the bar of model contributions and reviewing, we first start by removing legacy objects such as this one which do not serve a purpose.
- Remove static pretrained maps from the library's internals by @LysandreJik in #29112
Notable improvements
Processors improvements
Processors are ungoing changes in order to uniformize them and make them clearer to use.
- Separate out kwargs in processor by @amyeroberts in #30193
- [Processor classes] Update docs by @NielsRogge in #29698
SDPA
- re-introduced the fast path for sdpa by @fxmarty in #30070
Push to Hub for pipelines
Pipelines can now be pushed to Hub using a convenient push_to_hub method.
- add
push_to_hubto pipeline by @not-lain in #29172
Flash Attention 2 for more models (M2M100, NLLB, GPT2, MusicGen) !
Thanks to the community contribution, Flash Attention 2 has been integrated for more architectures
- Adding Flash Attention 2 Support for GPT2 by @EduardoPach in #29226
- Add Flash Attention 2 support to Musicgen and Musicgen Melody by @ylacombe in #29939
- Add Flash Attention 2 to M2M100 model by @visheratin in #30256
Improvements and bugfixes
- [docs] Remove redundant
-andthefrom custom_tools.md by @windsonsea in #29767 - Fixed typo in quantization_config.py by @kurokiasahi222 in #29766
- OWL-ViT box_predictor inefficiency issue by @RVV-karma in #29712
- Allow
-OOmode fordocstring_decoratorby @matthid in #29689 - fix issue with logit processor during beam search in Flax by @giganttheo in #29636
- Fix docker image build for
Latest PyTorch + TensorFlow [dev]by @ydshieh in #29764 - [
LlavaNext] Fix llava next unsafe imports by @ArthurZucker in #29773 - Cast bfloat16 to float32 for Numpy conversions by @Rocketknight1 in #29755
- Silence deprecations and use the DataLoaderConfig by @muellerzr in #29779
- Add deterministic config to
set_seedby @muellerzr in #29778 - Add support for
torch_dtypein the run_mlm example by @jla524 in #29776 - Generate: remove legacy generation mixin imports by @gante in #29782
- Llama: always convert the causal mask in the SDPA code path by @gante in #29663
- Prepend
bos tokento Blip generations by @zucchini-nlp in #29642 - Change in-place operations to out-of-place in LogitsProcessors by @zucchini-nlp in #29680
- [
quality] update quality check to make sure we check imports 😈 by @ArthurZucker in #29771 - Fix type hint for traindataset param of Trainer.init_() to allow IterableDataset. Issue 29678 by @stevemadere in #29738
- Enable AMD docker build CI by @IlyasMoutawwakil in #29803
- Correct llava mask & fix missing setter for
vocab_sizeby @fxmarty in #29389 - rm input dtype change in CPU by @jiqing-feng in #28631
- Generate: remove unused attributes in
AssistedCandidateGeneratorby @gante in #29787 - replaced concatenation to f-strings to improve readability and unify … by @igeni in #29785
- [
cleanup] vestiges of causal mask by @ArthurZucker in #29806 - Complete security policy with mentions of remote code by @LysandreJik in #29707
- [
SuperPoint] Fix doc example by @amyeroberts in #29816 - [DOCS] Fix typo for llava next docs by @aliencaocao in #29829
- model_summary.md - Restore link to Harvard's Annotated Transformer. by @gamepad-coder in #29702
- Fix the behavior of collecting 'numinputtokens_seen' by @YouliangHUANG in #29099
- Populate torch_dtype from model to pipeline by @B-Step62 in #28940
- remove quotes in code example by @johko in #29812
- Add warnings if training args differ from checkpoint trainer state by @jonflynng in #29255
- Replace 'decord' with 'av' in VideoClassificationPipeline by @Tyx-main in #29747
- Fix header in IFE task guide by @merveenoyan in #29859
- [docs] Indent ordered list in addnewmodel.md by @windsonsea in #29796
- Allow
bos_token_id is Noneduring the generation withinputs_embedsby @LZHgrla in #29772 - Add
cosine_with_min_lrscheduler in Trainer by @liuyanyi in #29341 - Disable AMD memory benchmarks by @IlyasMoutawwakil in #29871
- Set custom_container in build docs workflows by @Wauplin in #29855
- Support
num_attention_heads!=num_key_value_headsin Flax Llama Implementation by @bminixhofer in #29557 - Mamba
slow_forwardgradient fix by @vasqu in #29563 - Fix 29807, sinusoidal positional encodings overwritten by post_init() by @hovnatan in #29813
- Reimplement "Automatic safetensors conversion when lacking these files" by @LysandreJik in #29846
- fix fuyu device_map compatibility by @SunMarc in #29880
- Move
eos_token_idto stopping criteria by @zucchini-nlp in #29459 - add Cambricon MLUs support by @huismiling in #29627
- MixtralSparseMoeBlock: add gate jitter by @lorenzoverardo in #29865
- Fix typo in T5Block error message by @Mingosnake in #29881
- [
make fix-copies] update and help by @ArthurZucker in #29924 - [
GptNeox] don't gather on pkv when using the trainer by @ArthurZucker in #29892 - [
pipeline]. Zero shot add doc warning by @ArthurZucker in #29845 - [doc] fix some typos and add
xputo the testing documentation by @faaany in #29894 - Tests: replace
torch.testing.assert_allclosebytorch.testing.assert_closeby @gante in #29915 - Add beam search visualizer to the doc by @aymeric-roucher in #29876
- Safe import of LRScheduler by @amyeroberts in #29919
- add functions to inspect model and optimizer status to trainer.py by @CKeibel in #29838
- RoPE models: add numerical sanity-check test for RoPE scaling by @gante in #29808
- [
Mamba] from pretrained issue withself.embeddingsby @ArthurZucker in #29851 - [
TokenizationLlama] fix the way we convert tokens to strings to keep leading spaces 🚨 breaking fix by @ArthurZucker in #29453 - Allow GradientAccumulationPlugin to be configured from AcceleratorConfig by @fabianlim in #29589
- [
BC] Fix BC for other libraries by @ArthurZucker in #29934 - Fix doc issue #29758 in DebertaV2Config class by @vinayakkgarg in #29842
- [
LlamaSlowConverter] Slow to Fast better support by @ArthurZucker in #29797 - Update installs in image classification doc by @MariaHei in #29947
- [
StableLm] Add QK normalization and Parallel Residual Support by @jon-tow in #29745 - Mark
test_eager_matches_sdpa_generateflaky for some models by @ydshieh in #29479 - Super tiny fix 12 typos about "with with" by @fzyzcjy in #29926
- Fix rope theta for OpenLlama by @jla524 in #29893
- Add warning message for
run_qa.pyby @jla524 in #29867 - fix: get mlflow version from mlflow-skinny by @clumsy in #29918
- Reset alarm signal when the function is ended by @coldnight in #29706
- Update model card and link of blog post. by @bozheng-hit in #29928
- [
BC] Fix BC for AWQ quant by @TechxGenus in #29965 - Rework tests to compare trainer checkpoint args by @muellerzr in #29883
- Fix FA2 tests by @ylacombe in #29909
- Fix copies main ci by @ArthurZucker in #29979
- [tests] fix the wrong output in
ImageToTextPipelineTests.test_conditional_generation_llavaby @faaany in #29975 - Generate: move misplaced test by @gante in #29902
- [docs] Big model loading by @stevhliu in #29920
- [
generate] fix breaking change for patch by @ArthurZucker in #29976 - Fix 29807 sinusoidal positional encodings in Flaubert, Informer and XLM by @hovnatan in #29904
- [bnb] Fix bug in
_replace_with_bnb_linearby @SunMarc in #29958 - Adding FlaxNoRepeatNGramLogitsProcessor by @giganttheo in #29677
- [Docs] Make an ordered list prettier in addtensorflowmodel.md by @windsonsea in #29949
- Fix
skip_special_tokensforWav2Vec2CTCTokenizer._decodeby @msublee in #29311 - Hard error when ignoring tensors. by @Narsil in #27484)
- Generate: fix logits processors doctests by @gante in #29718
- Fix
remove_columnsintext-classificationexample by @mariosasko in #29351 - Update
tests/utils/tiny_model_summary.jsonby @ydshieh in #29941 - Make EncodecModel.decode ONNX exportable by @fxmarty in #29913
- Fix Swinv2ForImageClassification NaN output by @miguelm-almeida in #29981
- Fix Qwen2Tokenizer by @jklj077 in #29929
- Fix
kwargshandling ingenerate_with_fallbackby @cifkao in #29225 - Fix probability computation in
WhisperNoSpeechDetectionwhen recomputing scores by @cifkao in #29248 - Fix vipllava for generation by @zucchini-nlp in #29874
- [docs] Fix audio file by @stevhliu in #30006
- Superpoint imports fix by @zucchini-nlp in #29898
- [
Main CIs] Fix the red cis by @ArthurZucker in #30022 - Make clearer about zero_init requirements by @muellerzr in #29879
- Enable multi-device for efficientnet by @jla524 in #29989
- Add a converter from mamba_ssm -> huggingface mamba by @byi8220 in #29705
- [
ProcessingIdefics] Attention mask bug with padding by @byi8220 in #29449 - Add
whispertoIMPORTANT_MODELSby @ydshieh in #30046 - skip
test_encode_decode_fast_slow_all_tokensfor now by @ydshieh in #30044 - if output is tuple like facebook/hf-seamless-m4t-medium, waveform is … by @sywangyi in #29722
- Fix mixtral ONNX Exporter Issue. by @AdamLouly in #29858
- [Trainer] Allow passing image processor by @NielsRogge in #29896
- [bnb] Fix offload test by @SunMarc in #30039
- Update quantizerbnb4bit.py: In the ValueError string there should be "....you need to set
llm_int8_enable_fp32_cpu_offload=True...." instead of "load_in_8bit_fp32_cpu_offload=True". by @miRx923 in #30013 - [test fetcher] Always include the directly related test files by @ydshieh in #30050
- Fix
torch.fxsymbolic tracing for LLama by @michaelbenayoun in #30047 - Refactor daily CI workflow by @ydshieh in #30012
- Add docstrings and types for MambaCache by @koayon in #30023
- Fix auto tests by @ydshieh in #30067
- Fix whisper kwargs and generation config by @zucchini-nlp in #30018
- doc: Correct spelling mistake by @caiyili in #30107
- [Whisper] Computing features on GPU in batch mode for whisper feature extractor. by @vaibhavagg303 in #29900
- Change log level to warning for numtrainepochs override by @xu-song in #30014
- Make MLFlow version detection more robust and handles mlflow-skinny by @helloworld1 in #29957
- updated examples/pytorch/language-modeling scripts and requirements.txt to require datasets>=2.14.0 by @Patchwork53 in #30120
- [tests] add
require_bitsandbytesmarker by @faaany in #30116 - fixing issue 30034 - adding data format for run_ner.py by @JINO-ROHIT in #30088
- Patch fix - don't use safetensors for TF models by @amyeroberts in #30118
- [#29174] ImportError Fix: Trainer with PyTorch requires accelerate>=0.20.1 Fix by @UtkarshaGupte in #29888
- Accept token in trainer.pushtohub() by @mapmeld in #30093
- fix learning rate display in trainer when using galore optimizer by @vasqu in #30085
- Fix falcon with SDPA, alibi but no passed mask by @fxmarty in #30123
- Trainer / Core : Do not change init signature order by @younesbelkada in #30126
- Make vitdet jit trace complient by @fxmarty in #30065
- Fix typo at ImportError by @DrAnaximandre in #30090
- Adding
mpsas device forPipelineclass by @fnhirwa in #30080 - Fix failing DeepSpeed model zoo tests by @pacman100 in #30112
- Add datasets.Dataset to Trainer's traindataset and evaldataset type hints by @ringohoffman in #30077
- Fix docs Pop2Piano by @zucchini-nlp in #30140
- Revert workaround for TF safetensors loading by @Rocketknight1 in #30128
- [Trainer] Fix default data collator by @NielsRogge in #30142
- [Trainer] Undo #29896 by @NielsRogge in #30129
- Fix slow tests for important models to be compatible with A10 runners by @ydshieh in #29905
- Send headers when converting safetensors by @ydshieh in #30144
- Fix quantization tests by @SunMarc in #29914
- [docs] Fix image segmentation guide by @stevhliu in #30132
- [CI] Fix setup by @SunMarc in #30147
- Fix length related warnings in speculative decoding by @zucchini-nlp in #29585
- Fix and simplify semantic-segmentation example by @qubvel in #30145
- [CI] Quantization workflow fix by @SunMarc in #30158
- [tests] make 2 tests device-agnostic by @faaany in #30008
- Add str to TrainingArguments report_to type hint by @ringohoffman in #30078
- [UDOP] Fix tests by @NielsRogge in #29573
- [UDOP] Improve docs, add resources by @NielsRogge in #29571
- Fix accelerate kwargs for versions <0.28.0 by @vasqu in #30086
- Fix typing annotation in hf_argparser by @xu-song in #30156
- Fixing a bug when MlFlow try to log a torch.tensor by @etiennebonnafoux in #29932
- Fix natten install in docker by @ydshieh in #30161
- FIX / bnb: fix torch compatiblity issue with
itemizeby @younesbelkada in #30162 - Update config class check in auto factory by @Rocketknight1 in #29854
- Fixed typo in comments/documentation for Pipelines documentation by @DamonGuzman in #30170
- Fix Llava chat template examples by @lewtun in #30130
- Guard XLA version imports by @muellerzr in #30167
- chore: remove repetitive words by @hugehope in #30174
- fix: Fixed
ruffconfiguration to avoid deprecated configuration warning by @Sai-Suraj-27 in #30179 - Refactor Cohere Model by @saurabhdash2512 in #30027
- Update output of SuperPointForKeypointDetection by @NielsRogge in #29809
- Falcon: make activation, ffnhiddensize configurable by @sshleifer in #30134
- Docs PR template by @stevhliu in #30171
- ENH: [
CI] Add new workflow to run slow tests of important models on push main if they are modified by @younesbelkada in #29235 - Fix pipeline logger.warning_once bug by @amyeroberts in #30195
- fix: Replaced deprecated
logger.warnwithlogger.warningby @Sai-Suraj-27 in #30197 - fix typo by @mdeff in #30220
- fix fuyu doctest by @molbap in #30215
- Fix
RecurrentGemmaIntegrationTest.test_2b_sampleby @ydshieh in #30222 - Update modeling_bark.py by @bes-dev in #30221
- Fix/Update for doctest by @ydshieh in #30216
- Fixed config.json download to go to user-supplied cache directory by @ulatekh in #30189
- Add test for parsejsonfile and change typing to os.PathLike by @xu-song in #30183
- fix: Replace deprecated
assertEqualswithassertEqualby @Sai-Suraj-27 in #30241 - Set padtoken in rungluenotrainer.py #28534 by @JINO-ROHIT in #30234
- fix: Replaced deprecated
typing.Textwithstrby @Sai-Suraj-27 in #30230 - Refactor doctest by @ydshieh in #30210
- fix: Fixed
type annotationfor compatability with python 3.8 by @Sai-Suraj-27 in #30243 - Fix doctest more (for
docs/source/en) by @ydshieh in #30247 - round epoch only in console by @xdedss in #30237
- update github actions packages' version to suppress warnings by @ydshieh in #30249
- [tests] add the missing
require_torch_multi_gpuflag by @faaany in #30250 - [Docs] Update recurrent_gemma.md for some minor nits by @sayakpaul in #30238
- Remove incorrect arg in codellama doctest by @Rocketknight1 in #30257
- Update
ko/_toctree.ymlby @jungnerd in #30062 - More fixes for doctest by @ydshieh in #30265
- FIX: Fix corner-case issue with the important models workflow by @younesbelkada in #30212
- FIX: Fix 8-bit serialization tests by @younesbelkada in #30051
- Allow for str versions of dicts based on typing by @muellerzr in #30227
- Workflow: Update tailscale to release version by @younesbelkada in #30268
- Raise relevent err when wrong type is passed in as the accelerator_config by @muellerzr in #29997
- BLIP - fix pt-tf equivalence test by @amyeroberts in #30258
- fix: Fixed a
raisestatement by @Sai-Suraj-27 in #30275 - Fix test fetcher (doctest) +
Idefics2's doc example by @ydshieh in #30274 - Fix SDPA sliding window compatibility by @fxmarty in #30127
- Fix SpeechT5 forward docstrings by @ylacombe in #30287
- FIX / AWQ: Fix failing exllama test by @younesbelkada in #30288
- Configuring Translation Pipelines documents update #27753 by @UtkarshaGupte in #29986
- Enable fx tracing for Mistral by @zucchini-nlp in #30209
- Fix test
ExamplesTests::test_run_translationby @ydshieh in #30281 - Fix
Fatal Python error: Bus errorinZeroShotAudioClassificationPipelineTestsby @ydshieh in #30283 - FIX: Fix push important models CI by @younesbelkada in #30291
- Add token type ids to CodeGenTokenizer by @st81 in #29265
- Add strategy to store results in evaluation loop by @qubvel in #30267
- Upgrading to tokenizers 0.19.0 by @Narsil in #30289
- Re-enable SDPA's FA2 path by @fxmarty in #30070
- Fix quality Olmo + SDPA by @fxmarty in #30302
- Fix donut token2json multiline by @qubvel in #30300
- Fix all torch pipeline failures except one by @ydshieh in #30290
- Add atol for sliding window test by @fxmarty in #30303
- Fix RecurrentGemma device_map by @SunMarc in #30273
- Revert "Re-enable SDPA's FA2 path by @ArthurZucker in #30070)"
- Do not drop mask with SDPA for more cases by @fxmarty in #30311
- FIX: Fixes unexpected behaviour for Llava / LLama & AWQ Fused modules + revert #30070 at the same time by @younesbelkada in #30317
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @bozheng-hit
- Add Qwen2MoE (#29377)
- Update model card and link of blog post. (#29928)
- @EduardoPach
- Adding Flash Attention 2 Support for GPT2 (#29226)
- Adding grounding dino (#26087)
- @2015aroras
- Add OLMo model family (#29890)
- @tomeras91
- Add jamba (#29943)
- @abhi-mosaic
- Add DBRX Model (#29921)
- Python
Published by LysandreJik about 2 years ago
transformers - Release v4.39.3
The AWQ issue persisted, and there was a regression reported with beam search and input embeddings.
Changes
- Fix BC for AWQ quant #29965
- generate fix breaking change for patch #29976
- Python
Published by ArthurZucker about 2 years ago
transformers - Patch release v4.39.2
Series of fixes for backwards compatibility (AutoAWQ and other quantization libraries, imports from trainer_pt_utils) and functionality (LLaMA tokenizer conversion)
- Safe import of LRScheduler #29919
- [
BC] Fix BC for other libraries #29934 - [
LlamaSlowConverter] Slow to Fast better support #29797
- Python
Published by amyeroberts about 2 years ago
transformers - Patch release v4.39.1
Patch release to fix some breaking changes to LLaVA model, fixes/cleanup for Cohere & Gemma and broken doctest
- Correct llava mask & fix missing setter for
vocab_size#29389 - [
cleanup] vestiges of causal mask #29806 - [
SuperPoint] Fix doc example (https://github.com/huggingface/transformers/pull/29816)
- Python
Published by amyeroberts about 2 years ago
transformers - Release v4.39.0
v4.39.0
🚨 VRAM consumption 🚨
The Llama, Cohere and the Gemma model both no longer cache the triangular causal mask unless static cache is used. This was reverted by #29753, which fixes the BC issues w.r.t speed , and memory consumption, while still supporting compile and static cache. Small note, fx is not supported for both models, a patch will be brought very soon!
New model addition
Cohere open-source model
Command-R is a generative model optimized for long context tasks such as retrieval augmented generation (RAG) and using external APIs and tools. It is designed to work in concert with Cohere's industry-leading Embed and Rerank models to provide best-in-class integration for RAG applications and excel at enterprise use cases. As a model built for companies to implement at scale, Command-R boasts:
- Strong accuracy on RAG and Tool Use
- Low latency, and high throughput
- Longer 128k context and lower pricing
- Strong capabilities across 10 key languages
Model weights available on HuggingFace for research and evaluation
Cohere Model Release by @saurabhdash2512 in #29622
LLaVA-NeXT (llava v1.6)
Llava next is the next version of Llava, which includes better support for non padded images, improved reasoning, OCR, and world knowledge. LLaVA-NeXT even exceeds Gemini Pro on several benchmarks.
Compared with LLaVA-1.5, LLaVA-NeXT has several improvements: - Increasing the input image resolution to 4x more pixels. This allows it to grasp more visual details. It supports three aspect ratios, up to 672x672, 336x1344, 1344x336 resolution. - Better visual reasoning and OCR capability with an improved visual instruction tuning data mixture. - Better visual conversation for more scenarios, covering different applications. - Better world knowledge and logical reasoning. - Along with performance improvements, LLaVA-NeXT maintains the minimalist design and data efficiency of LLaVA-1.5. It re-uses the pretrained connector of LLaVA-1.5, and still uses less than 1M visual instruction tuning samples. The largest 34B variant finishes training in ~1 day with 32 A100s.*

LLaVa-NeXT incorporates a higher input resolution by encoding various patches of the input image. Taken from the original paper.
MusicGen Melody
The MusicGen Melody model was proposed in Simple and Controllable Music Generation by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
MusicGen Melody is a single stage auto-regressive Transformer model capable of generating high-quality music samples conditioned on text descriptions or audio prompts. The text descriptions are passed through a frozen text encoder model to obtain a sequence of hidden-state representations. MusicGen is then trained to predict discrete audio tokens, or audio codes, conditioned on these hidden-states. These audio tokens are then decoded using an audio compression model, such as EnCodec, to recover the audio waveform.
Through an efficient token interleaving pattern, MusicGen does not require a self-supervised semantic representation of the text/audio prompts, thus eliminating the need to cascade multiple models to predict a set of codebooks (e.g. hierarchically or upsampling). Instead, it is able to generate all the codebooks in a single forward pass.
- Add MusicGen Melody by @ylacombe in #28819
PvT-v2
The PVTv2 model was proposed in PVT v2: Improved Baselines with Pyramid Vision Transformer by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. As an improved variant of PVT, it eschews position embeddings, relying instead on positional information encoded through zero-padding and overlapping patch embeddings. This lack of reliance on position embeddings simplifies the architecture, and enables running inference at any resolution without needing to interpolate them.
- Add PvT-v2 Model by @FoamoftheSea in #26812
UDOP
The UDOP model was proposed in Unifying Vision, Text, and Layout for Universal Document Processing by Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal. UDOP adopts an encoder-decoder Transformer architecture based on T5 for document AI tasks like document image classification, document parsing and document visual question answering.

UDOP architecture. Taken from the original paper.
- Add UDOP by @NielsRogge in #22940
Mamba
This model is a new paradigm architecture based on state-space-models, rather than attention like transformer models. The checkpoints are compatible with the original ones
- [
Add Mamba] Adds support for theMambamodels by @ArthurZucker in #28094
StarCoder2
StarCoder2 is a family of open LLMs for code and comes in 3 different sizes with 3B, 7B and 15B parameters. The flagship StarCoder2-15B model is trained on over 4 trillion tokens and 600+ programming languages from The Stack v2. All models use Grouped Query Attention, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens, and were trained using the Fill-in-the-Middle objective.
- Starcoder2 model - bis by @RaymondLi0 in #29215
SegGPT
The SegGPT model was proposed in SegGPT: Segmenting Everything In Context by Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang. SegGPT employs a decoder-only Transformer that can generate a segmentation mask given an input image, a prompt image and its corresponding prompt mask. The model achieves remarkable one-shot results with 56.1 mIoU on COCO-20 and 85.6 mIoU on FSS-1000.
- Adding SegGPT by @EduardoPach in #27735
Galore optimizer

With Galore, you can pre-train large models on consumer-type hardwares, making LLM pre-training much more accessible to anyone from the community.
Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.
Galore is based on low rank approximation of the gradients and can be used out of the box for any model.
Below is a simple snippet that demonstrates how to pre-train mistralai/Mistral-7B-v0.1 on imdb:
```python import torch import datasets from transformers import TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForCausalLM import trl
traindataset = datasets.loaddataset('imdb', split='train')
args = TrainingArguments( outputdir="./test-galore", maxsteps=100, perdevicetrainbatchsize=2, optim="galoreadamw", optimtarget_modules=["attn", "mlp"] )
model_id = "mistralai/Mistral-7B-v0.1"
config = AutoConfig.frompretrained(modelid)
tokenizer = AutoTokenizer.frompretrained(modelid) model = AutoModelForCausalLM.from_config(config).to(0)
trainer = trl.SFTTrainer( model=model, args=args, traindataset=traindataset, datasettextfield='text', maxseqlength=512, )
trainer.train() ```
Quantization
Quanto integration
Quanto has been integrated with transformers ! You can apply simple quantization algorithms with few lines of code with tiny changes. Quanto is also compatible with torch.compile
Check out the announcement blogpost for more details
- [Quantization] Quanto quantizer by @SunMarc in #29023
Exllama 🤝 AWQ
Exllama and AWQ combined together for faster AWQ inference - check out the relevant documentation section for more details on how to use Exllama + AWQ.
- Exllama kernels support for AWQ models by @IlyasMoutawwakil in #28634
MLX Support
Allow models saved or fine-tuned with Apple’s MLX framework to be loaded in transformers (as long as the model parameters use the same names), and improve tensor interoperability. This leverages MLX's adoption of safetensors as their checkpoint format.
- Add mlx support to BatchEncoding.converttotensors by @Y4hL in #29406
- Add support for metadata format MLX by @alexweberk in #29335
- Typo in mlx tensor support by @pcuenca in #29509
- Experimental loading of MLX files by @pcuenca in #29511
Highligted improvements
Notable memory reduction in Gemma/LLaMa by changing the causal mask buffer type from int64 to boolean.
- Use
torch.boolinstead oftorch.int64for non-persistant causal mask buffer by @fxmarty in #29241
Remote code improvements
- Allow remote code repo names to contain "." by @Rocketknight1 in #29175
- simplify getclassin_module and fix for paths containing a dot by @cebtenzzre in #29262
Breaking changes
The PRs below introduced slightly breaking changes that we believed was necessary for the repository; if these seem to impact your usage of transformers, we recommend checking out the PR description to get more insights in how to leverage the new behavior.
- 🚨🚨[Whisper Tok] Update integration test by @sanchit-gandhi in #29368
- 🚨 Fully revert atomic checkpointing 🚨 by @muellerzr in #29370
[BC 4.37 -> 4.38] for Llama family, memory and speed #29753 (causal mask is no longer a registered buffer)
Fixes and improvements
FIX [
Gemma] Fix bad rebase with transformers main by @younesbelkada in #29170Add training version check for AQLM quantizer. by @BlackSamorez in #29142
[Gemma] Fix eager attention by @sanchit-gandhi in #29187
[Mistral, Mixtral] Improve docs by @NielsRogge in #29084
Fix
torch.compilewithfullgraph=Truewhenattention_maskinput is used by @fxmarty in #29211fix(mlflow): check mlflow version to use the synchronous flag by @cchen-dialpad in #29195
Fix missing translation in README_ru by @strikoder in #29054
Improve updatecausal_mask performance by @alessandropalla in #29210
[
Doc] update model doc qwen2 by @ArthurZucker in #29238Use torch 2.2 for daily CI (model tests) by @ydshieh in #29208
Cache
is_vision_availableresult by @bmuskalla in #29280Use
DS_DISABLE_NINJA=1by @ydshieh in #29290Add
non_device_testpytest mark to filter out non-device tests by @fxmarty in #29213Add feature extraction mapping for automatic metadata update by @merveenoyan in #28944
Generate: v4.38 removals and related updates by @gante in #29171
Track each row separately for stopping criteria by @zucchini-nlp in #29116
[docs] Spanish translation of tasks_explained.md by @aaronjimv in #29224
[i18n-zh] Translated torchscript.md into Chinese by @windsonsea in #29234
🌐 [i18n-ZH] Translate chat_templating.md into Chinese by @shibing624 in #28790
[i18n-vi] Translate README.md to Vietnamese by @hoangsvit in #29229
[i18n-zh] Translated task/asr.md into Chinese by @windsonsea in #29233
Fixed Deformable Detr typo when loading cuda kernels for MSDA by @EduardoPach in #29294
GenerationConfig validate both constraints and forcewordsids by @FredericOdermatt in #29163
Add generate kwargs to VQA pipeline by @regisss in #29134
Cleaner Cache
dtypeanddeviceextraction for CUDA graph generation for quantizers compatibility by @BlackSamorez in #29079Image Feature Extraction docs by @merveenoyan in #28973
Fix
attn_implementationdocumentation by @fxmarty in #29295[tests] enable benchmark unit tests on XPU by @faaany in #29284
Use torch 2.2 for deepspeed CI by @ydshieh in #29246
Add compatibility with skipmemorymetrics for mps device by @SunMarc in #29264
Token level timestamps for long-form generation in Whisper by @zucchini-nlp in #29148
Fix a few typos in
GenerationMixin's docstring by @sadra-barikbin in #29277[i18n-zh] Translate fsdp.md into Chinese by @windsonsea in #29305
FIX [
Gemma/CI] Make sure our runners have access to the model by @younesbelkada in #29242Remove numpy usage from owlvit by @fxmarty in #29326
[
require_read_token] fix typo by @ArthurZucker in #29345[
T5 and Llama Tokenizer] remove warning by @ArthurZucker in #29346[
Llama ROPE] Fix torch export but also slow downs in forward by @ArthurZucker in #29198Disable Mixtral
output_router_logitsduring inference by @LeonardoEmili in #29249Idefics: generate fix by @gante in #29320
RoPE loses precision for Llama / Gemma + Gemma logits.float() by @danielhanchen in #29285
check if position_ids exists before using it by @jiqing-feng in #29306
[CI] Quantization workflow by @SunMarc in #29046
Better SDPA unmasking implementation by @fxmarty in #29318
[i18n-zh] Sync source/zh/index.md by @windsonsea in #29331
FIX [
CI/starcoder2] Change starcoder2 path to correct one for slow tests by @younesbelkada in #29359FIX [
CI]: Fix failing tests for peft integration by @younesbelkada in #29330FIX [
CI]require_read_tokenin the llama FA2 test by @younesbelkada in #29361Avoid using uncessary
get_values(MODEL_MAPPING)by @ydshieh in #29362Patch YOLOS and others by @NielsRogge in #29353
Fix @requirereadtoken in tests by @Wauplin in #29367
Expose
offload_buffersparameter ofacceleratetoPreTrainedModel.from_pretrainedmethod by @notsyncing in #28755Fix Base Model Name of LlamaForQuestionAnswering by @lenglaender in #29258
FIX [
quantization/ESM] Fix ESM 8bit / 4bit with bitsandbytes by @younesbelkada in #29329[
Llama + AWQ] fixprepare_inputs_for_generation🫠 by @ArthurZucker in #29381[
YOLOS] Fix - return padded annotations by @amyeroberts in #29300Support subfolder with
AutoProcessorby @JingyaHuang in #29169Fix llama + gemma accelete tests by @SunMarc in #29380
Fix deprecated arg issue by @muellerzr in #29372
Correct zero division error in inverse sqrt scheduler by @DavidAfonsoValente in #28982
[tests] enable automatic speech recognition pipeline tests on XPU by @faaany in #29308
update path to hub files in the error message by @poedator in #29369
[Mixtral] Fixes attention masking in the loss by @DesmonDay in #29363
Workaround for #27758 to avoid ZeroDivisionError by @tleyden in #28756
Convert SlimSAM checkpoints by @NielsRogge in #28379
Fix: Fixed the previous tracking URI setting logic to prevent clashes with original MLflow code. by @seanswyi in #29096
Fix OneFormer
post_process_instance_segmentationfor panoptic tasks by @nickthegroot in #29304Fix grad_norm unserializable tensor log failure by @svenschultze in #29212
Avoid edge case in audio utils by @ylacombe in #28836
DeformableDETR support bfloat16 by @DonggeunYu in #29232
[Docs] Spanish Translation -Torchscript md & Trainer md by @njackman-2344 in #29310
FIX [
Generation] Fix some issues when running the MaxLength criteria on CPU by @younesbelkada in #29317Fix max length for BLIP generation by @zucchini-nlp in #29296
[docs] Update starcoder2 paper link by @xenova in #29418
[tests] enable testpipelineacceleratetopp on XPU by @faaany in #29309
[
UdopTokenizer] Fix post merge imports by @ArthurZucker in #29451more fix by @ArthurZucker (direct commit on main)
Revert-commit 0d52f9f582efb82a12e8d9162b43a01b1aa0200f by @ArthurZucker in #29455
[
Udop imports] Processor tests were not run. by @ArthurZucker in #29456Generate: inner decoding methods are no longer public by @gante in #29437
Fix bug with passing capture_* args to neptune callback by @AleksanderWWW in #29041
Update pytest
import_pathlocation by @loadams in #29154Automatic safetensors conversion when lacking these files by @LysandreJik in #29390
[i18n-zh] Translate addnewpipeline.md into Chinese by @windsonsea in #29432
🌐 [i18n-KO] Translated generation_strategies.md to Korean by @AI4Harmony in #29086
[FIX]
offload_weight()takes from 3 to 4 positional arguments but 5 were given by @faaany in #29457[
Docs/Awq] Add docs on exllamav2 + AWQ by @younesbelkada in #29474[
docs] Add starcoder2 docs by @younesbelkada in #29454Fix TrainingArguments regression with torch <2.0.0 for dataloaderprefetchfactor by @ringohoffman in #29447
Generate: add tests for caches with
pad_to_multiple_ofby @gante in #29462Generate: get generation mode from the generation config instance 🧼 by @gante in #29441
Avoid dummy token in PLD to optimize performance by @ofirzaf in #29445
Fix test failure on DeepSpeed by @muellerzr in #29444
Generate: torch.compile-ready generation config preparation by @gante in #29443
added the maxmatchingngram_size to GenerationConfig by @mosheber in #29131
Fix
TextGenerationPipeline.__call__docstring by @alvarobartt in #29491Substantially reduce memory usage in updatecausal_mask for large batches by using .expand instead of .repeat [needs tests+sanity check] by @nqgl in #29413
Fix: Disable torch.autocast in RotaryEmbedding of Gemma and LLaMa for MPS device by @currybab in #29439
Enable BLIP for auto VQA by @regisss in #29499
v4.39 deprecations 🧼 by @gante in #29492
Revert "Automatic safetensors conversion when lacking these files by @LysandreJik in #2…
fix: Avoid error when fsdpconfig is missing xlafsdp_v2 by @ashokponkumar in #29480
Flava multimodal add attention mask by @zucchini-nlp in #29446
testgenerationconfigisloadedwithmodel - fall back to pytorch model for now by @amyeroberts in #29521
Set
inputsas kwarg inTextClassificationPipelineby @alvarobartt in #29495Fix
VisionEncoderDecoderPositional Arg by @nickthegroot in #29497Generate: left-padding test, revisited by @gante in #29515
[tests] add the missing
require_sacremosesdecorator by @faaany in #29504fix image-to-text batch incorrect output issue by @sywangyi in #29342
Typo fix in error message by @clefourrier in #29535
[tests] use
torch_deviceinstead ofautofor model testing by @faaany in #29531StableLM: Fix dropout argument type error by @liangjs in #29236
Make sliding window size inclusive in eager attention by @jonatanklosko in #29519
fix typos in FSDP config parsing logic in
TrainingArgumentsby @yundai424 in #29189Fix WhisperNoSpeechDetection when input is full silence by @ylacombe in #29065
[tests] use the correct
n_gpuinTrainerIntegrationTest::test_train_and_eval_dataloadersfor XPU by @faaany in #29307Fix eval thread fork bomb by @muellerzr in #29538
feat: use
warning_advicefor tensorflow warning by @winstxnhdw in #29540[
Mamba doc] Post merge updates by @ArthurZucker in #29472[
Docs] fixed minor typo by @j-gc in #29555Add Fill-in-the-middle training objective example - PyTorch by @tanaymeh in #27464
Bark model Flash Attention 2 Enabling to pass on checkdevicemap parameter to super() by @damithsenanayake in #29357
Make torch xla available on GPU by @yitongh in #29334
[Docs] Fix FastSpeech2Conformer model doc links by @khipp in #29574
Don't use a subset in test fetcher if on
mainbranch by @ydshieh in #28816fix error: TypeError: Object of type Tensor is not JSON serializable … by @yuanzhoulvpi2017 in #29568
Add missing localized READMEs to the copies check by @khipp in #29575
Fixed broken link by @amritgupta98 in #29558
Tiny improvement for doc by @fzyzcjy in #29581
Fix Fuyu doc typos by @zucchini-nlp in #29601
Fix minor typo: softare => software by @DriesVerachtert in #29602
Stop passing None to compile() in TF examples by @Rocketknight1 in #29597
Fix typo (determine) by @koayon in #29606
Implemented addpoolinglayer arg to TFBertModel by @tomigee in #29603
Update legacy Repository usage in various example files by @Hvanderwilk in #29085
Set env var to hold Keras at Keras 2 by @Rocketknight1 in #29598
Update flava tests by @ydshieh in #29611
Fix typo ; Update quantization.md by @furkanakkurt1335 in #29615
Add tests for batching support by @zucchini-nlp in #29297
Fix: handle logging of scalars in Weights & Biases summary by @parambharat in #29612
Examples: check
max_position_embeddingsin the translation example by @gante in #29600[
Gemma] Supports converting directly in half-precision by @younesbelkada in #29529[Flash Attention 2] Add flash attention 2 for GPT-J by @bytebarde in #28295
Core: Fix copies on main by @younesbelkada in #29624
[Whisper] Deprecate forced ids for v4.39 by @sanchit-gandhi in #29485
Warn about tool use by @LysandreJik in #29628
Adds pretrained IDs directly in the tests by @LysandreJik in #29534
[generate] deprecate forced ids processor by @sanchit-gandhi in #29487
Fix minor typo: infenrece => inference by @DriesVerachtert in #29621
[
MaskFormer,Mask2Former] Use einsum where possible by @amyeroberts in #29544Llama: allow custom 4d masks by @gante in #29618
[PyTorch/XLA] Fix extra TPU compilations introduced by recent changes by @alanwaketan in #29158
[docs] Spanish translate chat_templating.md & yml addition by @njackman-2344 in #29559
Add support for FSDP+QLoRA and DeepSpeed ZeRO3+QLoRA by @pacman100 in #29587
[
Mask2Former] Move normalization for numerical stability by @amyeroberts in #29542[tests] make
test_trainer_log_level_replicato run on accelerators with more than 2 devices by @faaany in #29609Refactor TFP call to just sigmoid() by @Rocketknight1 in #29641
Fix batching tests for new models (Mamba and SegGPT) by @zucchini-nlp in #29633
Fix
multi_gpu_data_parallel_forwardforMusicgenTestby @ydshieh in #29632[docs] Remove broken ChatML format link from chat_templating.md by @aaronjimv in #29643
Add newly added PVTv2 model to all README files. by @robinverduijn in #29647
[
PEFT] Fixsave_pretrainedto make sure adapters weights are also saved on TPU by @shub-kris in #29388Fix TPU checkpointing inside Trainer by @shub-kris in #29657
Add
dataset_revisionargument toRagConfigby @ydshieh in #29610Fix PVT v2 tests by @ydshieh in #29660
Generate: handle
cache_positionupdate ingenerateby @gante in #29467Allow applychattemplate to pass kwargs to the template and support a dict of templates by @Rocketknight1 in #29658
Inaccurate code example within inline code-documentation by @MysteryManav in #29661
Extend import utils to cover "editable" torch versions by @bhack in #29000
Trainer: fail early in the presence of an unsavable
generation_configby @gante in #29675Pipeline: use tokenizer pad token at generation time if the model pad token is unset. by @gante in #29614
[tests] remove deprecated tests for model loading by @faaany in #29450
Fix AutoformerForPrediction example code by @m-torhan in #29639
[tests] ensure device-required software is available in the testing environment before testing by @faaany in #29477
Fix wrong condition used in
filter_modelsby @ydshieh in #29673fix: typos by @testwill in #29653
Rename
gluetonyu-mll/glueby @lhoestq in #29679Generate: replace breaks by a loop condition by @gante in #29662
[FIX] Fix speech2test modeling tests by @ylacombe in #29672
Revert "Fix wrong condition used in
filter_models" by @ydshieh in #29682[docs] Spanish translation of attention.md by @aaronjimv in #29681
CI / generate: batch size computation compatible with all models by @gante in #29671
Fix
filter_modelsby @ydshieh in #29710FIX [
bnb] Makeunexpected_keysoptional by @younesbelkada in #29420Update the pipeline tutorial to include
gradio.Interface.from_pipelineby @abidlabs in #29684Use logging.warning instead of warnings.warn in pipeline.call by @tokestermw in #29717
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @windsonsea
- [i18n-zh] Translated torchscript.md into Chinese (#29234)
- [i18n-zh] Translated task/asr.md into Chinese (#29233)
- [i18n-zh] Translate fsdp.md into Chinese (#29305)
- [i18n-zh] Sync source/zh/index.md (#29331)
- [i18n-zh] Translate addnewpipeline.md into Chinese (#29432)
- @hoangsvit
- [i18n-vi] Translate README.md to Vietnamese (#29229)
- @EduardoPach
- Fixed Deformable Detr typo when loading cuda kernels for MSDA (#29294)
- Adding SegGPT (#27735)
- @RaymondLi0
- Starcoder2 model - bis (#29215)
- @njackman-2344
- [Docs] Spanish Translation -Torchscript md & Trainer md (#29310)
- [docs] Spanish translate chat_templating.md & yml addition (#29559)
- @tanaymeh
- Add Fill-in-the-middle training objective example - PyTorch (#27464)
- @Hvanderwilk
- Update legacy Repository usage in various example files (#29085)
- @FoamoftheSea
- Add PvT-v2 Model (#26812)
- @saurabhdash2512
- Cohere Model Release (#29622)
- Python
Published by ArthurZucker about 2 years ago
transformers - v4.38.2
Fix backward compatibility issues with Llama and Gemma:
We mostly made sure that performances are not affected by the new change of paradigm with ROPE. Fixed the ROPE computation (should always be in float32) and the causal_mask dtype was set to bool to take less RAM.
YOLOS had a regression, and Llama / T5Tokenizer had a warning popping for random reasons
- FIX [Gemma] Fix bad rebase with transformers main (#29170)
- Improve updatecausal_mask performance (#29210)
- [T5 and Llama Tokenizer] remove warning (#29346)
- [Llama ROPE] Fix torch export but also slow downs in forward (#29198)
- RoPE loses precision for Llama / Gemma + Gemma logits.float() (#29285)
- Patch YOLOS and others (#29353)
- Use torch.bool instead of torch.int64 for non-persistant causal mask buffer (#29241)
- Python
Published by ArthurZucker over 2 years ago
transformers - v4.38.1
Fix eager attention in Gemma!
- [Gemma] Fix eager attention #29187 by @sanchit-gandhi
TLDR:
diff
- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
+ attn_output = attn_output.view(bsz, q_len, -1)
- Python
Published by ArthurZucker over 2 years ago
transformers - v4.38: Gemma, Depth Anything, Stable LM; Static Cache, HF Quantizer, AQLM
New model additions
💎 Gemma 💎
Gemma is a new opensource Language Model series from Google AI that comes with a 2B and 7B variant. The release comes with the pre-trained and instruction fine-tuned versions and you can use them via AutoModelForCausalLM, GemmaForCausalLM or pipeline interface!
Read more about it in the Gemma release blogpost: https://hf.co/blog/gemma
```python from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.frompretrained("google/gemma-2b") model = AutoModelForCausalLM.frompretrained("google/gemma-2b", devicemap="auto", torchdtype=torch.float16)
inputtext = "Write me a poem about Machine Learning." inputids = tokenizer(inputtext, returntensors="pt").to("cuda")
outputs = model.generate(**input_ids) ```
You can use the model with Flash Attention, SDPA, Static cache and quantization API for further optimizations !
- Flash Attention 2
```python from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.frompretrained( "google/gemma-2b", devicemap="auto", torchdtype=torch.float16, attnimplementation="flashattention2" )
inputtext = "Write me a poem about Machine Learning." inputids = tokenizer(inputtext, returntensors="pt").to("cuda")
outputs = model.generate(**input_ids) ```
- bitsandbytes-4bit
```python from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.frompretrained( "google/gemma-2b", devicemap="auto", loadin4bit=True )
inputtext = "Write me a poem about Machine Learning." inputids = tokenizer(inputtext, returntensors="pt").to("cuda")
outputs = model.generate(**input_ids) ```
- Static Cache
```python from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.frompretrained( "google/gemma-2b", devicemap="auto" )
model.generationconfig.cacheimplementation = "static"
inputtext = "Write me a poem about Machine Learning." inputids = tokenizer(inputtext, returntensors="pt").to("cuda")
outputs = model.generate(**input_ids) ```
Depth Anything Model
The Depth Anything model was proposed in Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao. Depth Anything is based on the DPT architecture, trained on ~62 million images, obtaining state-of-the-art results for both relative and absolute depth estimation.
- Add Depth Anything by @NielsRogge in #28654
Stable LM
StableLM 3B 4E1T was proposed in StableLM 3B 4E1T: Technical Report by Stability AI and is the first model in a series of multi-epoch pre-trained language models.
StableLM 3B 4E1T is a decoder-only base language model pre-trained on 1 trillion tokens of diverse English and code datasets for four epochs. The model architecture is transformer-based with partial Rotary Position Embeddings, SwiGLU activation, LayerNorm, etc.
The team also provides StableLM Zephyr 3B, an instruction fine-tuned version of the model that can be used for chat-based applications.
- Add
StableLMby @jon-tow in #28810
⚡️ Static cache was introduced in the following PRs ⚡️
Static past key value cache allows LlamaForCausalLM' s forward pass to be compiled using torch.compile !
This means that (cuda) graphs can be used for inference, which speeds up the decoding step by 4x!
A forward pass of Llama2 7B takes around 10.5 ms to run with this on an A100! Equivalent to TGI performances! ⚡️
- [
Core generation] Adds support for static KV cache by @ArthurZucker in #27931 - [
CLeanup] Revert SDPA attention changes that got in the static kv cache PR by @ArthurZucker in #29027 - Fix static generation when compiling! by @ArthurZucker in #28937
- Static Cache: load models with MQA or GQA by @gante in #28975
- Fix symbolic_trace with kv cache by @fxmarty in #28724
⚠️ Support for generate is not included yet. This feature is experimental and subject to changes in subsequent releases.
```py from transformers import AutoTokenizer, AutoModelForCausalLM, StaticCache import torch import os
compilation triggers multiprocessing
os.environ["TOKENIZERS_PARALLELISM"] = "true"
tokenizer = AutoTokenizer.frompretrained("meta-llama/Llama-2-7b-hf") model = AutoModelForCausalLM.frompretrained( "meta-llama/Llama-2-7b-hf", devicemap="auto", torchdtype=torch.float16 )
set up the static cache in advance of using the model
model.setupcache(StaticCache, maxbatchsize=1, maxcachelen=128)
trigger compilation!
compiled_model = torch.compile(model, mode="reduce-overhead", fullgraph=True)
run the model as usual
inputtext = "A few facts about the universe: " inputids = tokenizer(inputtext, returntensors="pt").to("cuda").inputids modeloutputs = compiledmodel(inputids) ```
Quantization
🧼 HF Quantizer 🧼
HfQuantizer makes it easy for quantization method researchers and developers to add inference and / or quantization support in 🤗 transformers. If you are interested in adding the support for new methods, please refer to this documentation page: https://huggingface.co/docs/transformers/main/en/hf_quantizer
HfQuantizerclass for quantization-related stuff inmodeling_utils.pyby @poedator in #26610- [
HfQuantizer] Move it to "Developper guides" by @younesbelkada in #28768 - [
HFQuantizer] Removecheck_packages_compatibilitylogic by @younesbelkada in #28789 - [docs] HfQuantizer by @stevhliu in #28820
⚡️AQLM ⚡️
AQLM is a new quantization method that enables no-performance degradation in 2-bit precision. Check out this demo about how to run Mixtral in 2-bit on a free-tier Google Colab instance: https://huggingface.co/posts/ybelkada/434200761252287
- AQLM quantizer support by @BlackSamorez in #28928
- Removed obsolete attribute setting for AQLM quantization. by @BlackSamorez in #29034
🧼 Moving canonical repositories 🧼
The canonical repositories on the hugging face hub (models that did not have an organization, like bert-base-cased), have been moved under organizations.
You can find the entire list of models moved here: https://huggingface.co/collections/julien-c/canonical-models-65ae66e29d5b422218567567
Redirection has been set up so that your code continues working even if you continue calling the previous paths. We, however, still encourage you to update your code to use the new links so that it is entirely future proof.
- canonical repos moves by @julien-c in #28795
- Update all references to canonical models by @LysandreJik in #29001
Flax Improvements 🚀
The Mistral model was added to the library in Flax.
- Flax mistral by @kiansierra in #26943
TensorFlow Improvements 🚀
With Keras 3 becoming the standard version of Keras in TensorFlow 2.16, we've made some internal changes to maintain compatibility. We now have full compatibility with TF 2.16 as long as the tf-keras compatibility package is installed. We've also taken the opportunity to do some cleanup - in particular, the objects like BatchEncoding that are returned by our tokenizers and processors can now be directly passed to Keras methods like model.fit(), which should simplify a lot of code and eliminate a long-standing source of annoyances.
- Add tf_keras imports to prepare for Keras 3 by @Rocketknight1 in #28588
- Wrap Keras methods to support BatchEncoding by @Rocketknight1 in #28734
- Fix Keras scheduler import so it works for older versions of Keras by @Rocketknight1 in #28895
Pre-Trained backbone weights 🚀
Enable loading in pretrained backbones in a new model, where all other weights are randomly initialized. Note: validation checks are still in place when creating a config. Passing in use_pretrained_backbone will raise an error. You can override by setting
config.use_pretrained_backbone = True after creating a config. However, it is not yet guaranteed to be fully backwards compatible.
```py from transformers import MaskFormerConfig, MaskFormerModel
config = MaskFormerConfig( usepretrainedbackbone=False, backbone="microsoft/resnet-18" ) config.usepretrainedbackbone = True
Both models have resnet-18 backbone weights and all other weights randomly
initialized
model1 = MaskFormerModel(config) model2 = MaskFormerModel(config) ```
- Enable instantiating model with pretrained backbone weights by @amyeroberts in #28214
Introduce a helper function load_backbone to load a backbone from a backbone's model config e.g. ResNetConfig, or from a model config which contains backbone information. This enables cleaner modeling files and crossloading between timm and transformers backbones.
```py from transformers import ResNetConfig, MaskFormerConfig from transformers.utils.backboneutils import loadbackbone
Resnet defines the backbone model to load
config = ResNetConfig() backbone = load_backbone(config)
Maskformer config defines a model which uses a resnet backbone
config = MaskFormerConfig(usetimmbackbone=True, backbone="resnet18") backbone = load_backbone(config)
config = MaskFormerConfig(backboneconfig=ResNetConfig()) backbone = loadbackbone(config) ```
- [
Backbone] Useload_backboneinstead ofAutoBackbone.from_configby @amyeroberts in #28661 - Backbone kwargs in config by @amyeroberts in #28784
Add in API references, list supported backbones, updated examples, clarification and moving information to better reflect usage and docs
- [docs] Backbone by @stevhliu in #28739
- Improve Backbone API docs by @merveenoyan in #28666
Image Processor work 🚀
- Raise unused kwargs image processor by @molbap in #29063
- Abstract image processor arg checks by @molbap in #28843
Bugfixes and improvements 🚀
- Fix id2label assignment in run_classification.py by @jheitmann in #28590
- Add missing key to TFLayoutLM signature by @Rocketknight1 in #28640
- Avoid root logger's level being changed by @ydshieh in #28638
- Add config tip to custom model docs by @Rocketknight1 in #28601
- Fix lrscheduler in notrainer training scripts by @bofenghuang in #27872
- [
Llava] Update convertllavaweightstohf.py script by @isaac-vidas in #28617 - [
GPTNeoX] Fix GPTNeoX + Flash Attention 2 issue by @younesbelkada in #28645 - Update imageprocessingdeformable_detr.py by @sounakdey in #28561
- [
SigLIP] Only import tokenizer if sentencepiece available by @amyeroberts in #28636 - Fix phi model doc checkpoint by @amyeroberts in #28581
- get default device through
PartialState().default_deviceas it has been officially released by @statelesshz in #27256 - integrations: fix DVCLiveCallback model logging by @dberenbaum in #28653
- Enable safetensors conversion from PyTorch to other frameworks without the torch requirement by @LysandreJik in #27599
tensor_size- fix copy/paste error msg typo by @scruel in #28660- Fix windows err with checkpoint race conditions by @muellerzr in #28637
- add dataloader prefetch factor in training args and trainer by @qmeeus in #28498
- Support single token decode for
CodeGenTokenizerby @cmathw in #28628 - Remove deprecated eager_serving fn by @Rocketknight1 in #28665
- fix a hidden bug of
GenerationConfig, now thegeneration_config.jsoncan be loaded successfully by @ParadoxZW in #28604 - Update README_es.md by @vladydev3 in #28612
- Exclude the load balancing loss of padding tokens in Mixtral-8x7B by @khaimt in #28517
- Use save_safetensor to disable safe serialization for XLA by @jeffhataws in #28669
- Add back in generation types by @amyeroberts in #28681
- [docs] DeepSpeed by @stevhliu in #28542
- Improved type hinting for all attention parameters by @nakranivaibhav in #28479
- improve efficient training on CPU documentation by @faaany in #28646
- [docs] Fix doc format by @stevhliu in #28684
- [
chore] Add missing space in warning by @tomaarsen in #28695 - Update question_answering.md by @yusyel in #28694
- [
Vilt] align input and model dtype in the ViltPatchEmbeddings forward pass by @faaany in #28633 - [
docs] Improve visualization for vertical parallelism by @petergtz in #28583 - Don't fail when
LocalEntryNotFoundErrorduringprocessor_config.jsonloading by @ydshieh in #28709 - Fix duplicate & unnecessary flash attention warnings by @fxmarty in #28557
- support PeftMixedModel signature inspect by @Facico in #28321
- fix: corrected misleading log message in save_pretrained function by @mturetskii in #28699
- [
docs] Update preprocessing.md by @velaia in #28719 - Initialize tqdmactive with hfhubutils.areprogressbars_disabled(… by @ShukantPal in #28717
- Fix
weights_onlyby @ydshieh in #28725 - Stop confusing the TF compiler with ModelOutput objects by @Rocketknight1 in #28712
- fix: suppress
GatedRepoErrorto use cache file (fix #28558). by @scruel in #28566 - Unpin pydantic by @ydshieh in #28728
- [docs] Fix datasets in guides by @stevhliu in #28715
- [Flax] Update no init test for Flax v0.7.1 by @sanchit-gandhi in #28735
- Falcon: removed unused function by @gante in #28605
- Generate: deprecate old src imports by @gante in #28607
- [
Siglip] protect from imports if sentencepiece not installed by @amyeroberts in #28737 - Add serialization logic to pytree types by @angelayi in #27871
- Fix
DepthEstimationPipeline's docstring by @ydshieh in #28733 - Fix input data file extension in examples by @khipp in #28741
- [Docs] Fix Typo in English & Japanese CLIP Model Documentation (TMBD -> TMDB) by @Vinyzu in #28751
- PatchtTST and PatchTSMixer fixes by @wgifford in #28083
- Enable Gradient Checkpointing in Deformable DETR by @FoamoftheSea in #28686
- small doc update for CamemBERT by @julien-c in #28644
- Pin pytest version <8.0.0 by @amyeroberts in #28758
- Mark testconstrainedbeamsearchgenerate as flaky by @amyeroberts in #28757
- Fix typo of
Block. by @xkszltl in #28727 - [Whisper] Make tokenizer normalization public by @sanchit-gandhi in #28136
- Support saving only PEFT adapter in checkpoints when using PEFT + FSDP by @AjayP13 in #28297
- Add French translation: french README.md by @ThibaultLengagne in #28696
- Don't allow passing
load_in_8bitandload_in_4bitat the same time by @osanseviero in #28266 - Move CLIP nosplit_modules to CLIPPreTrainedModel by @lz1oceani in #27841
- Use Conv1d for TDNN by @gau-nernst in #25728
- Fix transformers.utils.fx compatibility with torch<2.0 by @fxmarty in #28774
- Further pin pytest version (in a temporary way) by @ydshieh in #28780
- Task-specific pipeline init args by @amyeroberts in #28439
- Pin Torch to <2.2.0 by @Rocketknight1 in #28785
- [
bnb] Fix bnb slow tests by @younesbelkada in #28788 - Prevent MLflow exception from disrupting training by @codiceSpaghetti in #28779
- don't initialize the output embeddings if we're going to tie them to input embeddings by @tom-p-reichel in #28192
- [Whisper] Refactor forceddecoderids & prompt ids by @patrickvonplaten in #28687
- Resolve DeepSpeed cannot resume training with PeftModel by @lh0x00 in #28746
- Wrap Keras methods to support BatchEncoding by @Rocketknight1 in #28734
- DeepSpeed: hardcode
torch.arangedtype onfloatusage to avoid incorrect initialization by @gante in #28760 - Add artifact name in job step to maintain job / artifact correspondence by @ydshieh in #28682
- Split daily CI using 2 level matrix by @ydshieh in #28773
- [docs] Correct the statement in the docstirng of computetransitionscores in generation/utils.py by @Ki-Seki in #28786
- Adding [T5/MT5/UMT5]ForTokenClassification by @hackyon in #28443
- Make
is_torch_bf16_available_on_devicemore strict by @ydshieh in #28796 - Add tip on setting tokenizer attributes by @Rocketknight1 in #28764
- enable graident checkpointing in DetaObjectDetection and add tests in Swin/Donut_Swin by @SangbumChoi in #28615
- [docs] fix some bugs about parameter description by @zspo in #28806
- Add models from deit by @rajveer43 in #28302
- [Docs] Fix spelling and grammar mistakes by @khipp in #28825
- Explicitly check if token ID's are None in TFBertTokenizer constructor by @skumar951 in #28824
- Add missing None check for hf_quantizer by @jganitkevitch in #28804
- Fix issues caused by natten by @ydshieh in #28834
- fix / skip (for now) some tests before switch to torch 2.2 by @ydshieh in #28838
- Use
-vforpyteston CircleCI by @ydshieh in #28840 - Reduce GPU memory usage when using FSDP+PEFT by @pacman100 in #28830
- Mark
test_encoder_decoder_model_generateforvision_encoder_deocderas flaky by @amyeroberts in #28842 - Support custom scheduler in deepspeed training by @VeryLazyBoy in #26831
- [Docs] Fix bad doc: replace save with logging by @chenzizhao in #28855
- Ability to override cleancodefor_run by @w4ffl35 in #28783
- [WIP] Hard error when ignoring tensors. by @Narsil in #27484
- [
Doc] update contribution guidelines by @ArthurZucker in #28858 - Correct wav2vec2-bert inputstologits_ratio by @ylacombe in #28821
- Image Feature Extraction pipeline by @amyeroberts in #28216
- ClearMLCallback enhancements: support multiple runs and handle logging better by @eugen-ajechiloae-clearml in #28559
- Do not use mtime for checkpoint rotation. by @xkszltl in #28862
- Adds LlamaForQuestionAnswering class in modeling_llama.py along with AutoModel Support by @nakranivaibhav in #28777
- [Docs] Update project names and links in awesome-transformers by @khipp in #28878
- Fix LongT5ForConditionalGeneration initialization of lm_head by @eranhirs in #28873
- Raise error when using
save_only_modelwithload_best_model_at_endfor DeepSpeed/FSDP by @pacman100 in #28866 - Fix
FastSpeech2ConformerModelTestand skip it on CPU by @ydshieh in #28888 - Revert "[WIP] Hard error when ignoring tensors." by @ydshieh in #28898
- unpin torch by @ydshieh in #28892
- Explicit server error on gated model by @Wauplin in #28894
- [Docs] Fix backticks in inline code and documentation links by @khipp in #28875
- Hotfix - make
torchaudioget the correct version intorch_and_flax_jobby @ydshieh in #28899 - [Docs] Add missing language options and fix broken links by @khipp in #28852
- fix: Fixed the documentation for
logging_first_stepby removing "evaluate" by @Sai-Suraj-27 in #28884 - fix Starcoder FA2 implementation by @pacman100 in #28891
- Fix Keras scheduler import so it works for older versions of Keras by @Rocketknight1 in #28895
- ⚠️ Raise
Exceptionwhen trying to generate 0 tokens ⚠️ by @danielkorat in #28621 - Update the cache number by @ydshieh in #28905
- Add npu device for pipeline by @statelesshz in #28885
- [Docs] Fix placement of tilde character by @khipp in #28913
- [Docs] Revert translation of '@slow' decorator by @khipp in #28912
- Fix utf-8 yaml load for marian conversion to pytorch in Windows by @SystemPanic in #28618
- Remove dead TF loading code by @Rocketknight1 in #28926
- fix: torch.int32 instead of torch.torch.int32 by @vodkaslime in #28883
- pass kwargs in stopping criteria list by @zucchini-nlp in #28927
- Support batched input for decoder start ids by @zucchini-nlp in #28887
- [Docs] Fix broken links and syntax issues by @khipp in #28918
- Fix maxpositionembeddings default value for llama2 to 4096 #28241 by @karl-hajjar in #28754
- Fix a wrong link to CONTRIBUTING.md section in PR template by @B-Step62 in #28941
- Fix type annotations on neftunenoisealpha and fsdp_config TrainingArguments parameters by @peblair in #28942
- [i18n-de] Translate README.md to German by @khipp in #28933
- [Nougat] Fix pipeline by @NielsRogge in #28242
- [Docs] Update README and default pipelines by @NielsRogge in #28864
- Convert
torch_dtypeasstrto actual torch data type (i.e. "float16" …totorch.float16) by @KossaiSbai in #28208 - [
pipelines] updated docstring with vqa alias by @cmahmut in #28951 - Tests: tag
test_save_load_fast_init_from_baseas flaky by @gante in #28930 - Updated requirements for image-classification samples: datasets>=2.14.0 by @alekseyfa in #28974
- Always initialize tied output_embeddings if it has a bias term by @hackyon in #28947
- Clean up staging tmp checkpoint directory by @woshiyyya in #28848
- [Docs] Add language identifiers to fenced code blocks by @khipp in #28955
- [Docs] Add video section by @NielsRogge in #28958
- [i18n-de] Translate CONTRIBUTING.md to German by @khipp in #28954
- [
NllbTokenizer] refactor with added tokens decoder by @ArthurZucker in #27717 - Add sudachi_projection option to BertJapaneseTokenizer by @hiroshi-matsuda-rit in #28503
- Update configuration_llama.py: fixed broken link by @AdityaKane2001 in #28946
- [
DETR] Update the processing to adapt masks & bboxes to reflect padding by @amyeroberts in #28363 - ENH: Do not pass warning message in case
quantization_configis in config but not passed as an arg by @younesbelkada in #28988 - ENH [
AutoQuantizer]: enhance trainer + not supported quant methods by @younesbelkada in #28991 - Add SiglipForImageClassification and CLIPForImageClassification by @NielsRogge in #28952
- [
Doc] Fix docbuilder - makeBackboneMixinandBackboneConfigMixinimportable fromutils. by @amyeroberts in #29002 - Set the dataset format used by
test_trainerto float32 by @statelesshz in #28920 - Introduce AcceleratorConfig dataclass by @muellerzr in #28664
- Fix flaky test vision encoder-decoder generate by @zucchini-nlp in #28923
- Mask Generation Task Guide by @merveenoyan in #28897
- Add tieweights() to LM heads and set bias in setoutput_embeddings() by @hackyon in #28948
- [TPU] Support PyTorch/XLA FSDP via SPMD by @alanwaketan in #28949
- FIX [
Trainer/ tags]: Fix trainer + tags when users do not pass"tags"totrainer.push_to_hub()by @younesbelkada in #29009 - Add cudacustomkernel in DETA by @SangbumChoi in #28989
- DeformableDetrModel support fp16 by @DonggeunYu in #29013
- Fix copies between DETR and DETA by @amyeroberts in #29037
- FIX: Fix error with
logger.warning+ inline with recent refactor by @younesbelkada in #29039 - Patch to skip failing
test_save_load_low_cpu_mem_usagetests by @amyeroberts in #29043 - Fix a tiny typo in
generation/utils.py::GenerateEncoderDecoderOutput's docstring by @sadra-barikbin in #29044 - add test marker to run all tests with @require_bitsandbytes by @Titus-von-Koeller in #28278
- Update important model list by @LysandreJik in #29019
- Fix maxlength criteria when using inputsembeds by @zucchini-nlp in #28994
- Support : Leverage Accelerate for object detection/segmentation models by @Tanmaypatil123 in #28312
- fix numassistanttokens with heuristic schedule by @jmamou in #28759
- fix failing trainer ds tests by @pacman100 in #29057
auto_find_batch_sizeisn't yet supported with DeepSpeed/FSDP. Raise error accrodingly. by @pacman100 in #29058- Honor trustremotecode for custom tokenizers by @rl337 in #28854
- Feature: Option to set the tracking URI for MLflowCallback. by @seanswyi in #29032
- Fix trainer test wrt DeepSpeed + autofindbs by @muellerzr in #29061
- Add chat support to text generation pipeline by @Rocketknight1 in #28945
- [Docs] Spanish translation of task_summary.md by @aaronjimv in #28844
- [
Awq] Add peft support for AWQ by @younesbelkada in #28987 - FIX [
bnb/tests]: Fix currently failing bnb tests by @younesbelkada in #29092 - fix the post-processing link by @davies-w in #29091
- Fix the
bert-base-casedtokenizer configuration test by @LysandreJik in #29105 - Fix a typo in
examples/pytorch/text-classification/run_classification.pyby @Ja1Zhou in #29072 - change version by @ArthurZucker in #29097
- [Docs] Add resources by @NielsRogge in #28705
- ENH: added new output_logits option to generate function by @mbaak in #28667
- Bnb test fix for different hardwares by @Titus-von-Koeller in #29066
- Fix two tiny typos in
pipelines/base.py::Pipeline::_sanitize_parameters()'s docstring by @sadra-barikbin in #29102 - storing & logging gradient norm in trainer by @shijie-wu in #27326
- Fixed nll with label_smoothing to just nll by @nileshkokane01 in #28708
- [
gradient_checkpointing] default to use it for torch 2.3 by @ArthurZucker in #28538 - Move misplaced line by @kno10 in #29117
- FEAT [
Trainer/bnb]: Add RMSProp frombitsandbytesto HFTrainerby @younesbelkada in #29082 - Abstract image processor arg checks. by @molbap in #28843
- FIX [
bnb/tests] Propagate the changes from #29092 to 4-bit tests by @younesbelkada in #29122 - Llama: fix batched generation by @gante in #29109
- Generate: unset GenerationConfig parameters do not raise warning by @gante in #29119
- [
cuda kernels] only compile them when initializing by @ArthurZucker in #29133 - FIX [
PEFT/Trainer] Handle better peft + quantized compiled models by @younesbelkada in #29055 - [
Core tokenization]add_dummy_prefix_spaceoption to help with latest issues by @ArthurZucker in #28010 - Revert low cpu mem tie weights by @amyeroberts in #29135
- Add support for fine-tuning CLIP-like models using contrastive-image-text example by @tjs-intel in #29070
- Save (circleci) cache at the end of a job by @ydshieh in #29141
- [Phi] Add support for sdpa by @hackyon in #29108
- Generate: missing generation config eos token setting in encoder-decoder tests by @gante in #29146
- Added image_captioning version in es and included in toctree file by @gisturiz in #29104
- Fix drop path being ignored in DINOv2 by @fepegar in #29147
- [
pipeline] Add pool option to image feature extraction pipeline by @amyeroberts in #28985
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @nakranivaibhav
- Improved type hinting for all attention parameters (#28479)
- Adds LlamaForQuestionAnswering class in modeling_llama.py along with AutoModel Support (#28777)
- @khipp
- Fix input data file extension in examples (#28741)
- [Docs] Fix spelling and grammar mistakes (#28825)
- [Docs] Update project names and links in awesome-transformers (#28878)
- [Docs] Fix backticks in inline code and documentation links (#28875)
- [Docs] Add missing language options and fix broken links (#28852)
- [Docs] Fix placement of tilde character (#28913)
- [Docs] Revert translation of '@slow' decorator (#28912)
- [Docs] Fix broken links and syntax issues (#28918)
- [i18n-de] Translate README.md to German (#28933)
- [Docs] Add language identifiers to fenced code blocks (#28955)
- [i18n-de] Translate CONTRIBUTING.md to German (#28954)
- @ThibaultLengagne
- Add French translation: french README.md (#28696)
- @poedator
HfQuantizerclass for quantization-related stuff inmodeling_utils.py(#26610)
- @kiansierra
- Flax mistral (#26943)
- @hackyon
- Adding [T5/MT5/UMT5]ForTokenClassification (#28443)
- Always initialize tied output_embeddings if it has a bias term (#28947)
- Add tieweights() to LM heads and set bias in setoutput_embeddings() (#28948)
- [Phi] Add support for sdpa (#29108)
- @SangbumChoi
- enable graident checkpointing in DetaObjectDetection and add tests in Swin/Donut_Swin (#28615)
- Add cudacustomkernel in DETA (#28989)
- @rajveer43
- Add models from deit (#28302)
- @jon-tow
- Add
StableLM(#28810)
- Add
- Python
Published by LysandreJik over 2 years ago
transformers - Patch release v4.37.2
Selection of fixes
* Protecting the imports for SigLIP's tokenizer if sentencepiece isn't installed
* Fix permissions issue on windows machines when using trainer in multi-node setup
* Allow disabling safe serialization when using Trainer. Needed for Neuron SDK
* Fix error when loading processor from cache
* torch < 1.13 compatible torch.load
Commits * [Siglip] protect from imports if sentencepiece not installed (#28737) * Fix weightsonly (#28725) * Enable safetensors conversion from PyTorch to other frameworks without the torch requirement (#27599) * Don't fail when LocalEntryNotFoundError during processorconfig.json loading (#28709) * Use save_safetensor to disable safe serialization for XLA (#28669) * Fix windows err with checkpoint race conditions (#28637) * [SigLIP] Only import tokenizer if sentencepiece available (#28636)
- Python
Published by amyeroberts over 2 years ago
transformers - Patch release: v4.37.1
A patch release to resolve import errors from removed custom types in generation utils
- Add back in generation types #28681
- Python
Published by amyeroberts over 2 years ago
transformers - v4.37 Qwen2, Phi-2, SigLIP, ViP-LLaVA, Fast2SpeechConformer, 4-bit serialization, Whisper longform generation
Model releases
Qwen2
Qwen2 is the new model series of large language models from the Qwen team. Previously, the Qwen series was released, including Qwen-72B, Qwen-1.8B, Qwen-VL, Qwen-Audio, etc.
Qwen2 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes.
- Add qwen2 by @JustinLin610 in #28436
Phi-2
Phi-2 is a transformer language model trained by Microsoft with exceptionally strong performance for its small size of 2.7 billion parameters. It was previously available as a custom code model, but has now been fully integrated into transformers.
- [Phi2] Add support for phi2 models by @susnato in #28211
- [Phi] Extend implementation to use GQA/MQA. by @gugarosa in #28163
- update docs to add the
phi-2example by @susnato in #28392 - Fixes default value of
softmax_scaleinPhiFlashAttention2. by @gugarosa in #28537
SigLIP
The SigLIP model was proposed in Sigmoid Loss for Language Image Pre-Training by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer. SigLIP proposes to replace the loss function used in CLIP by a simple pairwise sigmoid loss. This results in better performance in terms of zero-shot classification accuracy on ImageNet.
- Add SigLIP by @NielsRogge in #26522
- [SigLIP] Don't pad by default by @NielsRogge in #28578
ViP-LLaVA
The VipLlava model was proposed in Making Large Multimodal Models Understand Arbitrary Visual Prompts by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee.
VipLlava enhances the training protocol of Llava by marking images and interact with the model using natural cues like a “red bounding box” or “pointed arrow” during training.
- Adds VIP-llava to transformers by @younesbelkada in #27932
- Fix Vip-llava docs by @younesbelkada in #28085
FastSpeech2Conformer
The FastSpeech2Conformer model was proposed with the paper Recent Developments On Espnet Toolkit Boosted By Conformer by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang.
FastSpeech 2 is a non-autoregressive model for text-to-speech (TTS) synthesis, which develops upon FastSpeech, showing improvements in training speed, inference speed and voice quality. It consists of a variance adapter; duration, energy and pitch predictor and waveform and mel-spectrogram decoder.
- Add FastSpeech2Conformer by @connor-henderson in #23439
Wav2Vec2-BERT
The Wav2Vec2-BERT model was proposed in Seamless: Multilingual Expressive and Streaming Speech Translation by the Seamless Communication team from Meta AI.
This model was pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification.
- Add new meta w2v2-conformer BERT-like model by @ylacombe in #28165
- Add w2v2bert to pipeline by @ylacombe in #28585
4-bit serialization
Enables saving and loading transformers models in 4bit formats - you can now push bitsandbytes 4-bit weights on Hugging Face Hub. To save 4-bit models and push them on the hub, simply install the latest bitsandbytes package from pypi pip install -U bitsandbytes, load your model in 4-bit precision and call save_pretrained / push_to_hub. An example repo here
```python from transformers import AutoModelForCausalLM, AutoTokenizer
modelid = "facebook/opt-125m" model = AutoModelForCausalLM.frompretrained(modelid, loadin_4bit=True)
model.pushtohub("ybelkada/opt-125m-bnb-4bit") ```
- [bnb] Let's make serialization of 4bit models possible by @poedator in #26037
- [
Docs] Add 4-bit serialization docs by @younesbelkada in #28182
4D Attention mask
Enable passing in 4D attention masks to models that support it. This is useful for reducing memory footprint of certain generation tasks.
- 4D
attention_masksupport by @poedator in #27539
Improved quantization support
Ability to customise which modules are quantized and which are not.
* [Awq] Enable the possibility to skip quantization for some target modules by @younesbelkada in #27950
* add modules_in_block_to_quantize arg in GPTQconfig by @SunMarc in #27956
Added fused modules support
* [docs] Fused AWQ modules by @stevhliu in #27896
* [Awq] Add llava fused modules support by @younesbelkada in #28239
* [Mixtral / Awq] Add mixtral fused modules for Awq by @younesbelkada in #28240
SDPA Support for LLaVa, Mixtral, Mistral
- Fix SDPA correctness following torch==2.1.2 regression by @fxmarty in #27973
- [
Llava/Vip-Llava] Add SDPA into llava by @younesbelkada in #28107 - [
Mixtral&Mistral] Add support for sdpa by @ArthurZucker in #28133 - [SDPA] Make sure attn mask creation is always done on CPU by @patrickvonplaten in #28400
- Fix SDPA tests by @fxmarty in #28552
Whisper: Batched state-of-the-art long-form transcription
All decoding strategies (temperature fallback, compression/log-prob/no-speech threshold, ...) of OpenAI's long-form transcription (see: https://github.com/openai/whisper or section 4.5 in paper) have been added. Contrary to https://github.com/openai/whisper, Transformers long-form transcription is fully compatible with pure FP16 and Batching!
For more information see: https://github.com/huggingface/transformers/pull/27658.
- [Whisper] Finalize batched SOTA long-form generation by @patrickvonplaten in #27658
Generation: assisted generation upgrades, speculative decoding, and ngram speculation
Assisted generation was reworked to accept arbitrary sources of candidate sequences. This enabled us to smoothly integrate ngram speculation, and opens the door for new candidate generation methods. Additionally, we've added the speculative decoding strategy on top of assisted generation: when you call assisted generation with an assistant model and do_sample=True, you'll benefit from the faster speculative decoding sampling 🏎️💨
- Generate:
assisted_decodingnow accepts arbitrary candidate generators by @gante in #27751 - Generate: assisted decoding now uses
generatefor the assistant by @gante in #28031 - Generate: speculative decoding by @gante in #27979
- Generate: fix speculative decoding by @gante in #28166
- Adding Prompt lookup decoding by @apoorvumang in #27775
- Fix speculativesampling implementation by @ofirzaf in #28508
torch.load pickle protection
Adding pickle protection via weights_only=True in the torch.load calls.
- make torch.load a bit safer by @julien-c in #27282
Build methods for TensorFlow Models
Unlike PyTorch, TensorFlow models build their weights "lazily" after model initialization, using the shape of their inputs to figure out what their weight shapes should be. We previously needed a full forward pass through TF models to ensure that all layers received an input they could use to build their weights, but with this change we now have proper build() methods that can correctly infer shapes and build model weights. This avoids a whole range of potential issues, as well as significantly accelerating model load times.
- Proper build() methods for TF by @Rocketknight1 in #27794
- Replace build() with buildinname_scope() for some TF tests by @Rocketknight1 in #28046
- More TF fixes by @Rocketknight1 in #28081
- Even more TF test fixes by @Rocketknight1 in #28146
Remove support for torch 1.10
The last version to support PyTorch 1.10 was 4.36.x. As it has been more than 2 years, and we're looking forward to using features available in PyTorch 1.11 and up, we do not support PyTorch 1.10 for v4.37 (i.e. we don't run the tests against torch 1.10).
- Byebye torch 1.10 by @ydshieh in #28207
Model tagging
You can now add custom tags into your model before pushing it on the Hub! This enables you to filter models that contain that tag on the Hub with a simple URL filter. For example if you want to filter models that have trl tag you can search: https://huggingface.co/models?other=trl&sort=created
- [
core/ FEAT] Add the possibility to push custom tags usingPreTrainedModelitself by @younesbelkada in #28405 - e.g.
```python from transformers import AutoModelForCausalLM
modelname = "HuggingFaceM4/tiny-random-LlamaForCausalLM" model = AutoModelForCausalLM.frompretrained(model_name)
model.addmodeltags(["tag-test"]) model.pushtohub("llama-tagged") ```
Bugfixes and improvements
- Fix PatchTSMixer Docstrings by @vijaye12 in #27943
- use logger.warning_once to avoid massive outputs by @ranchlai in #27428
- Docs for AutoBackbone & Backbone by @merveenoyan in #27456
- Fix test for autofindbatch_size on multi-GPU by @muellerzr in #27947
- Update import message by @NielsRogge in #27946
- Fix parameter count in readme for mixtral 45b by @CyberTimon in #27945
- In PreTrainedTokenizerBase add missing word in error message by @petergtz in #27949
- Fix AMD scheduled CI not triggered by @ydshieh in #27951
- Add deepspeed test to amd scheduled CI by @echarlaix in #27633
- Fix a couple of typos and add an illustrative test by @rjenc29 in #26941
- fix bug in mask2former: cost matrix is infeasible by @xuchenhao001 in #27897
- Fix for stochastic depth decay rule in the TimeSformer implementation by @atawari in #27875
- fix no sequence length models error by @AdamLouly in #27522
- [
Mixtral] Change mistral op order by @younesbelkada in #27955 - Update bounding box format everywhere by @NielsRogge in #27944
- Support PeftModel signature inspect by @dancingpipi in #27865
- fixed typos (issue 27919) by @asusevski in #27920
- Hot-fix-mixstral-loss by @ArthurZucker in #27948
- Fix link in README.md of Image Captioning by @saswatmeher in #27969
- Better key error for AutoConfig by @Rocketknight1 in #27976
- [doc] fix typo by @stas00 in #27981
- fix typo in dvclive callback by @dberenbaum in #27983
- [
Tokenizer Serialization] Fix the broken serialisation by @ArthurZucker in #27099 - [
Whisper] raise better errors by @ArthurZucker in #27971 - Fix PatchTSMixer slow tests by @ajati in #27997
- [
CI slow] Fix expected values by @ArthurZucker in #27999 - Fix bug with rotating checkpoints by @muellerzr in #28009
- [Doc] Spanish translation of glossary.md by @aaronjimv in #27958
- Add modeldocs from cpmant.md to derformabledetr.md by @rajveer43 in #27884
- well well well by @ArthurZucker in #28011
- [
SeamlessM4TTokenizer] Safe import by @ArthurZucker in #28026 - [
core/modeling] Fix training bug with PEFT + GC by @younesbelkada in #28031 - Fix AMD push CI not triggered by @ydshieh in #28029
- SeamlessM4T:
test_retain_grad_hidden_states_attentionsis flaky by @gante in #28035 - Fix languages covered by M4Tv2 by @ylacombe in #28019
- Fixed spelling error in T5 tokenizer warning message (s/thouroughly/t… by @jeddobson in #28014
- Generate: Mistral/Mixtral FA2 cache fix when going beyond the context window by @gante in #28037
- [Seamless] Fix links in docs by @sanchit-gandhi in #27905
- Remove warning when Annotion enum is created by @amyeroberts in #28048
- [
FA-2] Fix fa-2 issue when passingconfigtofrom_pretrainedby @younesbelkada in #28043 - [
Modeling/Mixtral] Fix GC + PEFT issues with Mixtral by @younesbelkada in #28061 - [Flax BERT] Update deprecated 'split' method by @sanchit-gandhi in #28012
- [Flax LLaMA] Fix attn dropout by @sanchit-gandhi in #28059
- Remove SpeechT5 deprecated argument by @ylacombe in #28062
- doc: Correct spelling mistake by @caiyili in #28064
- [
Mixtral] update conversion script to reflect new changes by @younesbelkada in #28068 - Skip M4T
test_retain_grad_hidden_states_attentionsby @ylacombe in #28060 - [LLaVa] Add pastkeyvalues to skipkeysdeviceplacement to fix multi-GPU dispatch by @aismlv in #28051
- Make GPT2 traceable in meta state by @kwen2501 in #28054
- Fix bug for checkpoint saving on multi node training setting by @dumpmemory in #28078
- Update fixtures-image-utils by @lhoestq in #28080
- Fix
low_cpu_mem_usageFlag Conflict with DeepSpeed Zero 3 infrom_pretrainedfor Models withkeep_in_fp32_modules" by @kotarotanahashi in #27762 - Fix wrong examples in llava usage. by @Lyken17 in #28020
- [docs] Trainer by @stevhliu in #27986
- [docs] MPS by @stevhliu in #28016
- fix resuming from ckpt when using FSDP with FULLSTATEDICT by @pacman100 in #27891
- Fix the deprecation warning of torchpytree.registerpytree_node by @cyyever in #27803
- Spelling correction by @saeneas in #28110
- in peft finetune, only the trainable parameters need to be saved by @sywangyi in #27825
- fix ConversationalPipeline docstring by @not-lain in #28091
- Disable jitter noise during evaluation in SwitchTransformers by @DaizeDong in #28077
- Remove warning if
DISABLE_TELEMETRYis used by @Wauplin in #28113 - Fix indentation error - semantic_segmentation.md by @rajveer43 in #28117
- [docs] General doc fixes by @stevhliu in #28087
- Fix a typo in tokenizer documentation by @mssalvatore in #28118
- [Doc] Fix token link in What 🤗 Transformers can do by @aaronjimv in #28123
- When save a model on TPU, make a copy to be moved to CPU by @qihqi in #27993
- Update split string in doctest to reflect #28087 by @amyeroberts in #28135
- [
Mixtral] Fix loss + nits by @ArthurZucker in #28115 - Update modeling_utils.py by @mzelling in #28127
- [docs] Fix mistral link in mixtral.md by @aaronjimv in #28143
- Remove deprecated CPU dockerfiles by @ashahba in #28149
- Fix FA2 integration by @pacman100 in #28142
- [gpt-neox] Add attention_bias config to support model trained without attention biases by @dalgarak in #28126
- move code to Trainer.evaluate to enable use of that function with multiple datasets by @peter-sk in #27844
- Fix weights not properly initialized due to shape mismatch by @ydshieh in #28122
- Avoid unnecessary warnings when loading
CLIPConfigby @ydshieh in #28108 - Update FA2 exception msg to point to hub discussions by @amyeroberts in #28161
- Align backbone stage selection with outindices & outfeatures by @amyeroberts in #27606
- [docs] Trainer docs by @stevhliu in #28145
- Fix yolos resizing by @amyeroberts in #27663
- disable testretaingradhiddenstates_attentions on SeamlessM4TModelWithTextInputTest by @dwyatte in #28169
- Fix
input_embedsdocstring in encoder-decoder architectures by @gante in #28168 - [Whisper] Use torch for stft if available by @sanchit-gandhi in #26119
- Fix slow backbone tests - out_indices must match stage name ordering by @amyeroberts in #28186
- Update YOLOS slow test values by @amyeroberts in #28187
- Update
docs/source/en/perf_infer_gpu_one.mdby @ydshieh in #28198 - Fix ONNX export for causal LM sequence classifiers by removing reverse indexing by @dwyatte in #28144
- Add Swinv2 backbone by @NielsRogge in #27742
- Fix: [SeamlessM4T - S2TT] Bug in batch loading of audio in torch.Tensor format in the SeamlessM4TFeatureExtractor class by @nicholasneo78 in #27914
- Bug:
training_args.pyfix missing import with accelerate with versionaccelerate==0.20.1by @michaelfeil in #28171 - Fix the check of models supporting FA/SDPA not run by @ydshieh in #28202
- Drop
feature_extractor_typewhen loading an image processor file by @ydshieh in #28195 - [Whisper] Fix word-level timestamps with bs>1 or num_beams>1 by @ylacombe in #28114
- Fixing visualization code for object detection to support both types of bounding box. by @Anindyadeep in #27842
- update the logger message with accordant weightsfilename by @izyForever in #28181
- [
Llava] Fix llava index errors by @younesbelkada in #28032 - fix FA2 when using quantization by @pacman100 in #28203
- small typo by @stas00 in #28229
- Update docs around mixing hf scheduler with deepspeed optimizer by @dwyatte in #28223
- Fix trainer saving safetensors: metadata is None by @hiyouga in #28219
- fix bug:divide by zero in maybelogsaveevaluate() by @frankenliu in #28251
- [Whisper] Fix errors with MPS backend introduced by new code on word-level timestamps computation by @ercaronte in #28288
- Remove fast tokenization warning in Data Collators by @dbuos in #28213
- fix documentation for zeroshotobject_detection by @not-lain in #28267
- Remove tokentypeids from modelinputnames (like #24788) by @Apsod in #28325
- Translate contributing.md into Chinese by @Mayfsz in #28243
- [docs] Sort es/toctree.yml | Translate performance.md by @aaronjimv in #28262
- Fix error in M4T feature extractor by @ylacombe in #28340
- README: install transformers from conda-forge channel by @kevherro in #28313
- Don't check the device when device_map=auto by @yuanwu2017 in #28351
- Fix pos_mask application and update tests accordingly by @ferjorosa in #27892
- fix FA2 when using quantization for remaining models by @susnato in #28341
- Update VITS modeling to enable ONNX export by @echarlaix in #28141
- chore: Fix typo s/exclusivelly/exclusively/ by @hugo-syn in #28361
- Enhancing Code Readability and Maintainability with Simplified Activation Function Selection. by @hi-sushanta in #28349
- Fix building alibi tensor when num_heads is not a power of 2 by @abuelnasr0 in #28380
- remove two deprecated function by @statelesshz in #28220
- Bugfix / ffmpeg input device (mic) not working on Windows by @Teapack1 in #27051
- [AttentionMaskConverter] fix sdpa unmask unattended by @zspo in #28369
- Remove shell=True from subprocess.Popen to Mitigate Security Risk by @avimanyu786 in #28299
- Add segmentation map processing to SAM Image Processor by @rwood-97 in #27463
- update warning for image processor loading by @ydshieh in #28209
- Fix initialization for missing parameters in
from_pretrainedunder ZeRO-3 by @XuehaiPan in #28245 - Fix
_merge_input_ids_with_image_featuresfor llava model by @VictorSanh in #28333 - Use mmap option to loadstatedict by @weimingzha0 in #28331
- [BUG] BarkEosPrioritizerLogitsProcessor eostokenid use list, tensor size mismatch by @inkinworld in #28201
- Skip now failing test in the Trainer tests by @muellerzr in #28421
- Support
DeepSpeedwhen using auto find batch size by @muellerzr in #28088 - Fix number of models in README.md by @prasatee in #28430
- CI: limit natten version by @gante in #28432
- Fix for checkpoint rename race condition by @tblattner in #28364
- Fix load correct tokenizer in Mixtral model documentation by @JuanFKurucz in #28437
- [docstring] Fix docstring for ErnieConfig, ErnieMConfig by @Sparty in #27029
- [Whisper] Fix slow test by @patrickvonplaten in #28407
- Assitant model may on a different device by @jiqing-feng in #27995
- Enable multi-label image classification in pipeline by @amyeroberts in #28433
- Optimize the speed of the truncate_sequences function. by @ikkvix in #28263
- Use python 3.10 for docbuild by @ydshieh in #28399
- Fix docker file by @ydshieh in #28452
- Set
cache_dirforevaluate.load()in example scripts by @aphedges in #28422 - Optionally preprocess segmentation maps for MobileViT by @harisankar95 in #28420
- Correctly resolve trustremotecode=None for AutoTokenizer by @Rocketknight1 in #28419
- Fix load balancing loss func for mixtral by @liangxuZhang in #28256
- Doc by @jiqing-feng in #28431
- Fix docstring checker issues with PIL enums by @Rocketknight1 in #28450
- Fix broken link on page by @keenranger in #28451
- Mark two logger tests as flaky by @amyeroberts in #28458
- Update metadata loading for oneformer by @amyeroberts in #28398
- Fix torch.ones usage in xlnet by @sungho-ham in #28471
- Generate: deprecate old public functions by @gante in #28478
- Docs: add model paths by @gante in #28475
- Generate: refuse to save bad generation config files by @gante in #28477
- TF: purge
TFTrainerby @gante in #28483 - Fix docstrings and update docstring checker error message by @Rocketknight1 in #28460
- Change progress logging to once across all nodes by @siddartha-RE in #28373
- Generate: fix candidate device placement by @gante in #28493
- Fix paths to AI Sweden Models reference and model loading by @JuanFKurucz in #28423
- [
chore] Update warning text, a word was missing by @tomaarsen in #28017 - Don't set
finetuned_fromif it is a local path by @ydshieh in #28482 - Add the XPU device check for pipeline mode by @yuanwu2017 in #28326
- Tokenizer kwargs in textgeneration pipe by @thedamnedrhino in #28362
- [GPTQ] Fix test by @SunMarc in #28018
- Fixed minor typos by @rishit5 in #28489
- Add a usesafetensors arg to TFPreTrainedModel.frompretrained() by @Rocketknight1 in #28511
- Generate: consolidate output classes by @gante in #28494
- fix: sampling in flax keeps EOS by @borisdayma in #28378
- improve dev setup comments and hints by @4imothy in #28495
- SiLU activation wrapper for safe importing by @amyeroberts in #28509
- Remove
taskarg inload_datasetin image-classification example by @regisss in #28408 - Improving Training Performance and Scalability Documentation by @HamzaFB in #28497
- Fix mismatching loading in from_pretrained with/without accelerate by @fxmarty in #28414
- Fix/speecht5 bug by @NimaYaqmuri in #28481
- [
TokenizationUtils] Fixadd_special_tokenswhen the token is already there by @ArthurZucker in #28520 - [
TokenizationRoformerFast] Fix the save and loading by @ArthurZucker in #28527 - [
SpeechT5Tokenization] Add copied from and fix theconvert_tokens_to_stringto match the fast decoding scheme by @ArthurZucker in #28522 - Clearer error for SDPA when explicitely requested by @fxmarty in #28006
- Add ismodelsupported for fx by @inisis in #28521
- Config: warning when saving generation kwargs in the model config by @gante in #28514
- [Makefile] Exclude research projects from format by @patrickvonplaten in #28551
- symbolictrace: add pastkey_values, llama, sdpa support by @fxmarty in #28447
- Allow to train dinov2 with different dtypes like bf16 by @StarCycle in #28504
- Fix Switch Transformers When sparse_step = 1 by @agemagician in #28564
- Save
Processorby @ydshieh in #27761 - Use
weights_onlyonly if torch >= 1.13 by @ydshieh in #28506 - [
Core Tokenization] Support a fix for spm fast models by @ArthurZucker in #26678 - Use
LoggingLevelcontext manager in 3 tests by @ydshieh in #28575 - Fix the documentation checkpoint for xlm-roberta-xl by @jeremyfowers in #28567
- [ASR Pipe] Update init to set model type and subsequently call parent init method by @sanchit-gandhi in #28486
- [Whisper Tok] Move token ids to CPU when computing offsets by @sanchit-gandhi in #28485
- [Whisper] Fix audio classification with weighted layer sum by @sanchit-gandhi in #28563
- Making CTC training example more general by @ylacombe in #28582
- Don't save
processor_config.jsonif a processor has no extra attribute by @ydshieh in #28584 - Fix wrong xpu device in DistributedType.MULTI_XPU mode by @faaany in #28386
- [GPTNeoX] Fix BC issue with 4.36 by @ArthurZucker in #28602
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @aaronjimv
- [Doc] Spanish translation of glossary.md (#27958)
- [Doc] Fix token link in What 🤗 Transformers can do (#28123)
- [docs] Fix mistral link in mixtral.md (#28143)
- [docs] Sort es/toctree.yml | Translate performance.md (#28262)
- @rajveer43
- Add modeldocs from cpmant.md to derformabledetr.md (#27884)
- Fix indentation error - semantic_segmentation.md (#28117)
- @poedator
- 4D
attention_masksupport (#27539) - [bnb] Let's make serialization of 4bit models possible (#26037)
- 4D
- @connor-henderson
- Add FastSpeech2Conformer (#23439)
- @JustinLin610
- Add qwen2 (#28436)
- @SangbumChoi
- enable training mask2former and maskformer for transformers trainer by @SangbumChoi in #28277
- [DETA] Improvement and Sync from DETA especially for training by @SangbumChoi in #27990
- fix auxiliary loss training in DetrSegmentation by @SangbumChoi in #28354
- Python
Published by amyeroberts over 2 years ago
transformers - Patch release: v4.36.2
Patch release to resolve some critical issues relating to the recent cache refactor, flash attention refactor and training in the multi-gpu and multi-node settings:
- Resolve training bug with PEFT + GC #28031
- Resolve cache issue when going beyond context window for Mistral/Mixtral FA2 #28037
- Re-enable passing
configtofrom_pretrainedwith FA #28043 - Fix resuming from checkpoint when using FDSP with FULLSTATEDICT #27891
- Resolve bug when saving a checkpoint in the multi-node setting #28078
- Python
Published by amyeroberts over 2 years ago
transformers - Patch release: v4.36.1
A patch release for critical torch issues mostly:
- Fix SDPA correctness following torch==2.1.2 regression #27973
- [Tokenizer Serialization] Fix the broken serialisation #27099
- Fix bug with rotating checkpoints #28009
- Hot-fix-mixstral-loss (#27948)
🔥
- Python
Published by ArthurZucker over 2 years ago
transformers - v4.36: Mixtral, Llava/BakLlava, SeamlessM4T v2, AMD ROCm, F.sdpa wide-spread support
New model additions
Mixtral
Mixtral is the new open-source model from Mistral AI announced by the blogpost Mixtral of Experts. The model has been proven to have comparable capabilities to Chat-GPT according to the benchmark results shared on the release blogpost.
The architecture is a sparse Mixture of Experts with Top-2 routing strategy, similar as NllbMoe architecture in transformers. You can use it through AutoModelForCausalLM interface:
```py
import torch from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.frompretrained("mistralai/Mixtral-8x7B", torchdtype=torch.float16, devicemap="auto") tokenizer = AutoTokenizer.frompretrained("mistralai/Mistral-8x7B")
prompt = "My favourite condiment is"
modelinputs = tokenizer([prompt], returntensors="pt").to(device) model.to(device)
generatedids = model.generate(**modelinputs, maxnewtokens=100, dosample=True) tokenizer.batchdecode(generated_ids)[0] ```
The model is compatible with existing optimisation tools such Flash Attention 2, bitsandbytes and PEFT library. The checkpoints are release under mistralai organisation on the Hugging Face Hub.
Llava / BakLlava
Llava is an open-source chatbot trained by fine-tuning LlamA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. In other words, it is an multi-modal version of LLMs fine-tuned for chat / instructions.

The Llava model was proposed in Improved Baselines with Visual Instruction Tuning by Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee.
- [
Llava] Add Llava to transformers by @younesbelkada in #27662 - [LLaVa] Some improvements by @NielsRogge in #27895
The integration also includes BakLlava which is a Llava model trained with Mistral backbone.
The mode is compatible with "image-to-text" pipeline:
```py
from transformers import pipeline
from PIL import Image
import requests
modelid = "llava-hf/llava-1.5-7b-hf" pipe = pipeline("image-to-text", model=modelid) url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "USER:
outputs = pipe(image, prompt=prompt, generatekwargs={"maxnew_tokens": 200}) print(outputs) ```
And you can find all Llava weights under llava-hf organisation on the Hub.
SeamlessM4T v2
SeamlessM4T-v2 is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text. It is an improvement on the previous version and was proposed in Seamless: Multilingual Expressive and Streaming Speech Translation by the Seamless Communication team from Meta AI.
For more details on the differences between v1 and v2, refer to section Difference with SeamlessM4T-v1.
SeamlessM4T enables multiple tasks without relying on separate models:
- Speech-to-speech translation (S2ST)
- Speech-to-text translation (S2TT)
- Text-to-speech translation (T2ST)
- Text-to-text translation (T2TT)
Automatic speech recognition (ASR)
Add SeamlessM4T v2 by @ylacombe in #27779
PatchTST
The PatchTST model was proposed in A Time Series is Worth 64 Words: Long-term Forecasting with Transformers by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong and Jayant Kalagnanam.
At a high level, the model vectorizes time series into patches of a given size and encodes the resulting sequence of vectors via a Transformer that then outputs the prediction length forecast via an appropriate head. The model is illustrated in the following figure:
- [Time series] Add PatchTST by @psinthong in #25927
- [Time series] Add PatchTST by @kashif in #27581
PatchTSMixer
The PatchTSMixer model was proposed in TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting by Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong and Jayant Kalagnanam.
PatchTSMixer is a lightweight time-series modeling approach based on the MLP-Mixer architecture. In this HuggingFace implementation, we provide PatchTSMixer’s capabilities to effortlessly facilitate lightweight mixing across patches, channels, and hidden features for effective multivariate time-series modeling. It also supports various attention mechanisms starting from simple gated attention to more complex self-attention blocks that can be customized accordingly. The model can be pretrained and subsequently used for various downstream tasks such as forecasting, classification and regression.
- [Time series] Add PatchTSMixer by @ajati in #26247
CLVP
The CLVP (Contrastive Language-Voice Pretrained Transformer) model was proposed in Better speech synthesis through scaling by James Betker.
- Add CLVP by @susnato in #24745
Phi-1/1.5
The Phi-1 model was proposed in Textbooks Are All You Need by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li.
The Phi-1.5 model was proposed in Textbooks Are All You Need II: phi-1.5 technical report by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee.
- Add Phi-1 and Phi-1_5 by @susnato in #26170
TVP
The text-visual prompting (TVP) framework was proposed in the paper Text-Visual Prompting for Efficient 2D Temporal Video Grounding by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding.
This research addresses temporal video grounding (TVG), which is the process of pinpointing the start and end times of specific events in a long video, as described by a text sentence. Text-visual prompting (TVP), is proposed to enhance TVG. TVP involves integrating specially designed patterns, known as ‘prompts’, into both the visual (image-based) and textual (word-based) input components of a TVG model. These prompts provide additional spatial-temporal context, improving the model’s ability to accurately determine event timings in the video. The approach employs 2D visual inputs in place of 3D ones. Although 3D inputs offer more spatial-temporal detail, they are also more time-consuming to process. The use of 2D inputs with the prompting method aims to provide similar levels of context and accuracy more efficiently.
- TVP model by @jiqing-feng in #25856
DINOv2 depth estimation
Depth estimation is added to the DINO v2 implementation.
- Add DINOv2 depth estimation by @NielsRogge in #26092
ROCm support for AMD GPUs
AMD's ROCm GPU architecture is now supported across the board and fully tested in our CI with MI210/MI250 GPUs. We further enable specific hardware acceleration for ROCm in Transformers, such as Flash Attention 2, GPTQ quantization and DeepSpeed.
- Add RoCm scheduled CI & upgrade RoCm CI to PyTorch 2.1 by @fxmarty in #26940
- Flash Attention 2 support for RoCm by @fxmarty in #27611
- Reflect RoCm support in the documentation by @fxmarty in #27636
- restructure AMD scheduled CI by @ydshieh in #27743
PyTorch scaled_dot_product_attention native support
PyTorch's torch.nn.functional.scaled_dot_product_attention operator is now supported in the most-used Transformers models and used by default when using torch>=2.1.1, allowing to dispatch on memory-efficient attention and Flash Attention backend implementations with no other package than torch required. This should significantly speed up attention computation on hardware that that supports these fastpath.
While Transformers automatically handles the dispatch to use SDPA when available, it is possible to force the usage of a given attention implementation ("eager" being the manual implementation, where each operation is implemented step by step):
```python
or attn_implementation="sdpa", orattnimplementation="flashattention_2"`
model = AutoModelForSpeechSeq2Seq.frompretrained("openai/whisper-tiny", attnimplementation="eager") ```
Training benchmark, run on A100-SXM4-80GB.
| Model | Batch size | Sequence length | Time per batch ("eager", s) | Time per batch ("sdpa", s) | Speedup | Peak memory ("eager", MB) | Peak memory ("sdpa", MB) | Memory savings |
|-----------|------------|-----------------|-------------------------------|------------------------------|-------------|-----------------------------|----------------------------|-----------------------|
| llama2 7b | 4 | 1024 | 1.065 | 0.90 | 19.4% | 73878.28 | 45977.81 | 60.7% |
| llama2 7b | 4 | 2048 | OOM | 1.87 | / | OOM | 78394.58 | SDPA does not OOM |
| llama2 7b | 1 | 2048 | 0.64 | 0.48 | 32.0% | 55557.01 | 29795.63 | 86.4% |
| llama2 7b | 1 | 3072 | OOM | 0.75 | / | OOM | 37916.08 | SDPA does not OOM |
| llama2 7b | 1 | 4096 | OOM | 1.03 | / | OOM | 46028.14 | SDPA does not OOM |
| llama2 7b | 2 | 4096 | OOM | 2.05 | / | OOM | 78428.14 | SDPA does not OOM |
Inference benchmark, run on A100-SXM4-80GB.
| Model | Batch size | Prompt length | Num new tokens | Per token latency "eager" (ms) | Per token latency "sdpa" (ms) | Speedup |
|------------------|------------|---------------|----------------|----------------------------------|---------------------------------|-------------|
| llama2 13b | 1 | 1024 | 1 (prefill) | 178.66 | 159.36 | 12.11% |
| llama2 13b | 1 | 100 | 100 | 40.35 | 37.62 | 7.28% |
| llama2 13b | 8 | 100 | 100 | 40.55 | 38.06 | 6.53% |
| Whisper v3 large | 1 | / | 62 | 20.05 | 18.90 | 6.10% |
| Whisper v3 large | 8 | / | 77 | 25.42 | 24.77 | 2.59% |
| Whisper v3 large | 16 | / | 77 | 28.51 | 26.32 | 8.34% |
- F.scaleddotproduct_attention support by @fxmarty in #26572
New Cache abstraction & Attention Sinks support
We are rolling out a new abstraction for the past_key_values cache, which enables the use of different types of caches. For now, only llama and llama-inspired architectures (mistral, persimmon, phi) support it, with other architectures scheduled to have support in the next release. By default, a growing cache (DynamicCache) is used, which preserves the existing behavior.
This release also includes a new SinkCache cache, which implements the Attention Sinks paper. With SinkCache, the model is able to continue generating high-quality text well beyond its training sequence length! Note that it does not expand the context window, so it can’t digest very long inputs — it is suited for streaming applications such as multi-round dialogues. Check this colab for an example.
- Generate: New
Cacheabstraction and Attention Sinks support by @tomaarsen in #26681 - Generate: SinkCache can handle iterative prompts by @gante in #27907
Safetensors as a default
We continue toggling features enabling safetensors as a default across the board, in PyTorch, Flax, and TensorFlow.
When using PyTorch model and forcing the load of safetensors file with use_safetensors=True, if the repository does not contain the safetensors files, they will now be converted on-the-fly server-side.
- Default to msgpack for safetensors by @LysandreJik in #27460
- Fix
from_ptflag when loading withsafetensorsby @LysandreJik in #27394 - Make using safetensors files automated. by @Narsil in #27571
Breaking changes
pickle files
We now disallow the use of pickle.load internally for security purposes. To circumvent this, you can use the TRUST_REMOTE_CODE=True command to indicate that you would still like to load it.
- 🚨🚨🚨 Disallow
pickle.loadunlessTRUST_REMOTE_CODE=Trueby @ydshieh in #27776
Beam score calculation for decoder-only models
In the previous implementation of beam search, when length_penalty is active, the beam score for decoder-only models was penalized by the total length of both prompt and generated sequence. However, the length of prompt should not be included in the penalization step -- this release fixes it.
- 🚨🚨 Fix beam score calculation issue for decoder-only models by @VsonicV in #27351
Slight API changes/corrections
- ⚠️ [VitDet] Fix test by @NielsRogge in #27832
- [⚠️ removed a default argument] Make
AttentionMaskConvertercompatible withtorch.compile(..., fullgraph=True)by @fxmarty in #27868
Bugfixes and improvements
- Enrich TTS pipeline parameters naming by @ylacombe in #26473
- translate peft.md to chinese by @jiaqiw09 in #27215
- Removed the redundant SiLUActivation class. by @hi-sushanta in #27136
- Fixed base model class name extraction from PeftModels by @kkteru in #27162
- Fuyu protection by @LysandreJik in #27248
- Refactor: Use Llama RoPE implementation for Falcon by @tomaarsen in #26933
- [
PEFT/Tests] Fix peft integration failing tests by @younesbelkada in #27258 - Avoid many failing tests in doctesting by @ydshieh in #27262
- [docs] Custom model doc update by @MKhalusova in #27213
- Update the ConversationalPipeline docstring for chat templates by @Rocketknight1 in #27250
- Fix switch transformer mixed precision issue by @timlee0212 in #27220
- [
Docs/SAM] Reflect correct changes to run inference without OOM by @younesbelkada in #27268 - [Docs] Model_doc structure/clarity improvements by @MKhalusova in #26876
- [
FA2] Add flash attention for forDistilBertby @susnato in #26489 - translate autoclass_tutorial to chinese by @jiaqiw09 in #27269
- translate run_scripts.md to chinese by @jiaqiw09 in #27246
- Fix tokenizer export for LLamaTokenizerFast by @mayank31398 in #27222
- Fix daily CI image build by @ydshieh in #27307
- Update doctest workflow file by @ydshieh in #27306
- Remove an unexpected argument for FlaxResNetBasicLayerCollection by @pingzhili in #27272
- enable memory tracker metrics for npu by @statelesshz in #27280
- [
PretrainedTokenizer] add some of the most important functions to the doc by @ArthurZucker in #27313 - Update sequence_classification.md by @akshayvkt in #27281
- Fix VideoMAEforPretrained dtype error by @ikergarcia1996 in #27296
- Fix
Kosmos2Processorbatch mode by @ydshieh in #27323 - [docs] fixed links with 404 by @MKhalusova in #27327
- [Whisper] Block language/task args for English-only by @sanchit-gandhi in #27322
- Fix autoawq docker image by @younesbelkada in #27339
- Generate: skip tests on unsupported models instead of passing by @gante in #27265
- Fix Whisper Conversion Script: Correct decoderattentionheads and _download function by @zuazo in #26834
- [
FA2] Add flash attention forGPT-Neoby @susnato in #26486 - [
Whisper] Add conversion script for the tokenizer by @ArthurZucker in #27338 - Remove a redundant variable. by @hi-sushanta in #27288
- Resolve AttributeError by utilizing device calculation at the start of the forward function by @folbaeni in #27347
- Remove paddingmasks from `gptbigcode`. by @susnato in #27348
- [
Whisper] Nit converting the tokenizer by @ArthurZucker in #27349 - FIx Bark batching feature by @ylacombe in #27271
- Allow scheduler parameters by @Plemeur in #26480
- translate the en tokenizer_summary.md to Chinese by @ZouJiu1 in #27291
- translate modelsharing.md and llmtutorial.md to chinese by @jiaqiw09 in #27283
- Add numpy alternative to FE using torchaudio by @ylacombe in #26339
- moving example of benchmarking to legacy dir by @statelesshz in #27337
- Fix example tests from failing by @muellerzr in #27353
- Fix
Kosmos-2device issue by @ydshieh in #27346 - MusicGen Update by @sanchit-gandhi in #27084
- Translate index.md to Turkish by @mertyyanik in #27093
- Remove unused param from example script tests by @muellerzr in #27354
- [Flax Whisper] large-v3 compatibility by @sanchit-gandhi in #27360
- Fix tiny model script: not using
from_pt=Trueby @ydshieh in #27372 - translate big_models.md and performance.md to chinese by @jiaqiw09 in #27334
- Add Flash Attention 2 support to Bark by @ylacombe in #27364
- Update deprecated
torch.rangeintest_modeling_ibert.pyby @kit1980 in #27355 - translate debugging.md to chinese by @jiaqiw09 in #27374
- Smangrul/fix failing ds ci tests by @pacman100 in #27358
- [
CodeLlamaTokenizer] Nit, update init to make sure the AddedTokens are not normalized because they are special by @ArthurZucker in #27359 - Change thresh in test by @muellerzr in #27378
- Put doctest options back to
pyproject.tomlby @ydshieh in #27366 - Skip failing cache call tests by @amyeroberts in #27393
- device-agnostic deepspeed testing by @statelesshz in #27342
- Adds dvclive callback by @dberenbaum in #27352
- use
pytest.markdirectly by @ydshieh in #27390 - Fix fuyu checkpoint repo in
FuyuConfigby @ydshieh in #27399 - Use editable install for git deps by @muellerzr in #27404
- Final fix of the accelerate installation issue by @ydshieh in #27408
- Fix RequestCounter to make it more future-proof by @Wauplin in #27406
- remove failing tests and clean FE files by @ylacombe in #27414
- Fix
Owlv2checkpoint name and a default value inOwlv2VisionConfigby @ydshieh in #27402 - Run all tests if
circleci/create_circleci_config.pyis modified by @ydshieh in #27413 - add attentionmask and positionids in assisted model by @jiqing-feng in #26892
- [
Quantization] Add str to enum conversion for AWQ by @younesbelkada in #27320 - update Bark FA2 docs by @ylacombe in #27400
- [
AttentionMaskConverter] ]Fix-mask-inf by @ArthurZucker in #27114 - At most 2 GPUs for CI by @ydshieh in #27435
- Normalize floating point cast by @amyeroberts in #27249
- Make
examples_torch_jobfaster by @ydshieh in #27437 - Fix line ending in
utils/not_doctested.txtby @ydshieh in #27459 - Fix some Wav2Vec2 related models' doctest by @ydshieh in #27462
- Fixed typo in error message by @cmcmaster1 in #27461
- Remove-auth-token by @ArthurZucker in #27060
- [
Llama + Mistral] Add attention dropout by @ArthurZucker in #27315 - OWLv2: bug fix in postprocessobject_detection() when using cuda device by @assafbot in #27468
- Fix docstring for
gradient_checkpointing_kwargsby @tomaszcichy98 in #27470 - Install
python-Levenshteinfornougatin CI image by @ydshieh in #27465 - Add version check for Jinja by @Rocketknight1 in #27403
- Fix Falcon tokenizer loading in pipeline by @Rocketknight1 in #27316
- [
AWQ] Addresses TODO for awq tests by @younesbelkada in #27467 - Perf torch compile by @jiaqiw09 in #27422
- Fixed typo in pipelines.md documentation by @adismort14 in #27455
- Fix FA2 import + deprecation cycle by @SunMarc in #27330
- [
Peft]modules_to_savesupport for peft integration by @younesbelkada in #27466 - [
CI-test_torch] skiptest_tf_from_pt_safetensorsfor 4 models by @ArthurZucker in #27481 - Fix M4T weights tying by @ylacombe in #27395
- Add speecht5 batch generation and fix wrong attention mask when padding by @Spycsh in #25943
- Clap processor: remove wasteful np.stack operations by @m-bain in #27454
- [Whisper] Fix pipeline test by @sanchit-gandhi in #27442
- Revert "[time series] Add PatchTST by @amyeroberts in #25927)"
- translate hpotrain.md and perfhardware.md to chinese by @jiaqiw09 in #27431
- Generate: fix
ExponentialDecayLengthPenaltydoctest by @gante in #27485 - Update and reorder docs for chat templates by @Rocketknight1 in #27443
- Generate:
GenerationConfig.from_pretrainedcan return unused kwargs by @gante in #27488 - Minor type annotation fix by @vwxyzjn in #27276
- Have seq2seq just use gather by @muellerzr in #27025
- Update processor mapping for hub snippets by @amyeroberts in #27477
- Track the number of tokens seen to metrics by @muellerzr in #27274
- [
CI-test_torch] skip testtffromptsafetensors andtest_assisted_decoding_sampleby @ArthurZucker in #27508 - [Fuyu] Add tests by @NielsRogge in #27001
- [Table Transformer] Add Transformers-native checkpoints by @NielsRogge in #26928
- Update spelling mistake by @LimJing7 in #27506
- [
CircleCI] skip testassisteddecoding_sample for everyone by @ArthurZucker in #27511 - Make some jobs run on the GitHub Actions runners by @ydshieh in #27512
- [
tokenizers] updatetokenizersversion pin by @ArthurZucker in #27494 - [
PretrainedConfig] Improve messaging by @ArthurZucker in #27438 - Fix wav2vec2 params by @muellerzr in #27515
- Translating
en/model_docdocs to Japanese. by @Yuki-Imajuku in #27401 - Fixing the failure of models without maxpositionembeddings attribute. by @AdamLouly in #27499
- Incorrect setting for num_beams in translation and summarization examples by @Rocketknight1 in #27519
- Fix bug for T5x to PyTorch convert script with varying encoder and decoder layers by @JamesJiang97 in #27448
- Fix offload disk for loading derivated model checkpoint into base model by @SunMarc in #27253
- translate model.md to chinese by @statelesshz in #27518
- Support ONNX export for causal LM sequence classifiers by @dwyatte in #27450
- [
pytest] Avoid flash attn test marker warning by @ArthurZucker in #27509 - docs: add docs for map, and add num procs to load_dataset by @pphuc25 in #27520
- Update the TF pin for 2.15 by @Rocketknight1 in #27375
- Revert "add attentionmask and positionids in assisted model" by @patrickvonplaten in #27523
- Set
usedforsecurity=Falsein hashlib methods (FIPS compliance) by @Wauplin in #27483 - Raise error when quantizing a quantized model by @SunMarc in #27500
- Disable docker image build job
latest-pytorch-amdfor now by @ydshieh in #27541 - [
Styling] stylify using ruff by @ArthurZucker in #27144 - Generate: improve assisted generation tests by @gante in #27540
- Updated albert.md doc for ALBERT model by @ENate in #27223
- translate Trainer.md to chinese by @jiaqiw09 in #27527
- Skip some fuyu tests by @ydshieh in #27553
- Fix AMD CI not showing GPU by @ydshieh in #27555
- Generate: fix flaky tests by @gante in #27543
- Generate: update compute transition scores doctest by @gante in #27558
- fixed broken link by @VpkPrasanna in #27560
- Broken links fixed related to datasets docs by @VpkPrasanna in #27569
- translate deepspeed.md to chinese by @jiaqiw09 in #27495
- Fix broken distilbert url by @osanseviero in #27579
- Adding leaky relu in dict ACT2CLS by @rafaelpadilla in #27574
- Fix idx2sym not loaded from pretrained vocab file in Transformer XL by @jtang98 in #27589
- Add
convert_hf_to_openai.pyscript to Whisper documentation resources by @zuazo in #27590 - docs: fix 404 link by @panpan0000 in #27529
- [ examples] fix loading jsonl with load dataset in run translation example by @mathiasesn in #26924
- [
FA-2] Add fa2 support forfrom_configby @younesbelkada in #26914 - timm to pytorch conversion for vit model fix by @staghado in #26908
- [Whisper] Add
large-v3version support by @flyingleafe in #27336 - Update Korean tutorial for using LLMs, and refactor the nested conditional statements in hr_argparser.py by @YeonwooSung in #27489
- Fix torch.fx import issue for torch 1.12 by @amyeroberts in #27570
- dvclive callback: warn instead of fail when logging non-scalars by @dberenbaum in #27608
- [
core/gradient_checkpointing] add support for old GC method by @younesbelkada in #27610 - [ConvNext] Improve backbone by @NielsRogge in #27621
- Generate: Update docs regarding reusing
past_key_valuesingenerateby @gante in #27612 - Idefics: Fix information leak with cross attention gate in modeling by @leot13 in #26839
- Fix flash attention bugs with Mistral and Falcon by @fxmarty in #27625
- Fix tracing dinov2 by @amyeroberts in #27561
- remove the deprecated method
init_git_repoby @statelesshz in #27617 - Explicitely specify
use_cache=Truein Flash Attention tests by @fxmarty in #27635 - Harmonize HF environment variables + other cleaning by @Wauplin in #27564
- Fix
resize_token_embeddingsby @czy-orange in #26861) - [
dependency] update pillow pins by @ArthurZucker in #27409 - Simplify the implementation of jitter noise in moe models by @jiangwangyi in #27643
- Fix
max_stepsdocumentation regarding the end-of-training condition by @qgallouedec in #27624 - [Whisper] Add sequential longform decoding by @patrickvonplaten in #27492
- Add UnivNet Vocoder Model for Tortoise TTS Diffusers Integration by @dg845 in #24799
- update Openai API call method by @Strive-for-excellence in #27628
- update d_kv'annotation in mt5'configuration by @callanwu in #27585
- [
FA2] Add flash attention for opt by @susnato in #26414 - Extended semantic segmentation to image segmentation by @merveenoyan in #27039
- Update TVP arxiv link by @amyeroberts in #27672
- [DPT, Dinov2] Add resources by @NielsRogge in #27655
- Update tiny model summary file by @ydshieh in #27388
- Refactoring Trainer, adds
save_only_modelarg and simplifying FSDP integration by @pacman100 in #27652 - Skip pipeline tests for 2 models for now by @ydshieh in #27687
- Deprecate
TransfoXLby @ydshieh in #27607 - Fix typo in warning message by @liuxueyang in #27055
- Docs/Add conversion code to the musicgen docs by @yoinked-h in #27665
- Fix semantic error in evaluation section by @anihm136 in #27675
- [
DocString] Support a revision in the docstringadd_code_sample_docstringsto facilitate integrations by @ArthurZucker in #27645 - Successfully Resolved The ZeroDivisionError Exception. by @hi-sushanta in #27524
- Fix
TVPModelTestby @ydshieh in #27695 - Fix sliding_window hasattr in Mistral by @IlyaGusev in #27041
- Fix Past CI by @ydshieh in #27696
- fix warning by @ArthurZucker in #27689
- Reorder the code on the Hub to explicit that sharing on the Hub isn't a requirement by @LysandreJik in #27691
- Fix mistral generate for long prompt / response by @lorabit110 in #27548
- Fix oneformer instance segmentation RuntimeError by @yhshin11 in #27725
- fix assisted decoding assistant model inputs by @jiqing-feng in #27503
- Update forward signature test for vision models by @NielsRogge in #27681
- Modify groupsubentities in TokenClassification Pipeline to support label with "-" by @eshoyuan in #27325
- Fix owlv2 code snippet by @NielsRogge in #27698
- docs: replace torch.distributed.run by torchrun by @panpan0000 in #27528
- Update chat template warnings/guides by @Rocketknight1 in #27634
- translation main-class files to chinese by @jiaqiw09 in #27588
- Translate
en/model_docto JP by @rajveer43 in #27264 - Fixed passing scheduler-specific kwargs via TrainingArguments lrschedulerkwargs by @CharbelAD in #27595
- Fix AMD Push CI not triggered by @ydshieh in #27732
- Add BeitBackbone by @NielsRogge in #25952
- Update tiny model creation script by @ydshieh in #27674
- Log a warning in
TransfoXLTokenizer.__init__by @ydshieh in #27721 - Add madlad-400 MT models by @jbochi in #27471
- Enforce pin memory disabling when using cpu only by @qgallouedec in #27745
- Trigger corresponding pipeline tests if
tests/utils/tiny_model_summary.jsonis modified by @ydshieh in #27693 - CLVP Fixes by @susnato in #27547
- Docs: Fix broken cross-references, i.e.
~transformer.->~transformers.by @tomaarsen in #27740 - [docs] Quantization by @stevhliu in #27641
- Fix precision errors from casting rotary parameters to FP16 with AMP by @kevinhu in #27700
- Remove
check_runner_status.ymlby @ydshieh in #27767 - uses dvclivetest mode in examples/pytorch/testaccelerate_examples.py by @dberenbaum in #27763
- Generate:
GenerationConfigthrows an exception whengenerateargs are passed by @gante in #27757 - Fix unsupported setting of self.ngpu in training_args on XPU devices by @Liangliang-Ma in #27716
- [SeamlessM4Tv2] Fix links in README by @xenova in #27782
- [i18n-fr] Translate installation to French by @NoB0 in #27657
- Fixes for PatchTST Config by @wgifford in #27777
- Better error message for bitsandbytes import by @SunMarc in #27764
- [MusicGen] Fix audio channel attribute by @sanchit-gandhi in #27440
- [JAX] Replace uses of jax.devices("cpu") with jax.local_devices(backend="cpu") by @hvaara in #27593
- Improve forward signature test by @NielsRogge in #27729
- Fix typo in max_length deprecation warnings by @siegeln in #27788
- Add
persistent_workersparameter toTrainingArgumentsby @Sorrow321 in #27189 - [
ModelOnTheFlyConversionTester] Mark as slow for now by @ArthurZucker in #27823 - Fix
TvpModelIntegrationTestsby @ydshieh in #27792 - Fix
Owlv2ModelIntegrationTest::test_inference_object_detectionby @ydshieh in #27793 - Keypoints 0.0 are confusing ../transformers/models/detr/imageprocessingdetr.py which are fixed by @hackpk in #26250
- [Seamless v1] Link to v2 docs by @sanchit-gandhi in #27827
- [Whisper] Fix doctest in timestamp logits processor by @sanchit-gandhi in #27795
- Added test cases for rembert refering to albert and reformer test_tok… by @nileshkokane01 in #27637
- [Hot-Fix][XLA] Re-enable broken tpusave for XLATensors by @yeounoh in #27799
- single word should be set to False by @ArthurZucker in #27738
- [Seamless v2] Add FE to auto mapping by @sanchit-gandhi in #27829
- translate internal folder files to chinese by @jiaqiw09 in #27638
- Translate
en/tasksfolder docs to Japanese 🇯🇵 by @rajveer43 in #27098 - pin
ruff==0.1.5by @ydshieh in #27849 - Make image processors more general by @NielsRogge in #27690
- Faster generation using AWQ + Fused modules by @younesbelkada in #27411
- Generate: Update VisionEncoderDecoder test value by @gante in #27850
- [
ClipVision]acceleratesupport for clip-vision by @younesbelkada in #27851 - Add Llama Flax Implementation by @vvvm23 in #24587
- Move tensors to same device to enable IDEFICS naive MP training by @willemsenbram in #27746
- Update
VitDetModelTester.get_configto usepretrain_image_sizeby @ydshieh in #27831 - fix(whisper): mutable generation config by @badayvedat in #27833
- Documentation: Spanish translation of perplexity.mdx by @aaronjimv in #27807
- [
Docs] Update broken image on fused modules by @younesbelkada in #27856 - Update CUDA versions for DeepSpeed by @muellerzr in #27853
- removed the delete doc workflows by @MKhalusova in #27852
- Avoid class attribute
_keep_in_fp32_modulesbeing modified by @ydshieh in #27867 - [
Flash Attention 2] Add flash attention 2 for GPT-Neo-X by @younesbelkada in #26463 - Translating en/model_doc folder docs to Japanese(from
bliptoclap) 🇯🇵 by @rajveer43 in #27673 - Fix beam score calculation issue for JAX version by @VsonicV in #27816
- Fix bug of prepare4dattentionmask by @jiqing-feng in #27847
- [i18n-fr] Translate autoclass tutorial to French by @NoB0 in #27659
- [
FA-2] Add Flash Attention toPhiby @susnato in #27661 - fix: fix gradient accumulate step for learning rate by @pphuc25 in #27667
- Allow
# Ignore copyby @ydshieh in #27328 - update
create_model_cardto properly save peft details when using Trainer with PEFT by @pacman100 in #27754 - update version of warning notification for
get_default_deviceto v4.38 by @statelesshz in #27848 - Fix device of masks in tests by @fxmarty in #27887
- Show new failing tests in a more clear way in slack report by @ydshieh in #27881
- Fix TF loading PT safetensors when weights are tied by @Rocketknight1 in #27490
- Generate: All logits processors are documented and have examples by @gante in #27796
- [docs] Custom semantic segmentation dataset by @stevhliu in #27859
- Updates the distributed CPU training documentation to add instructions for running on a Kubernetes cluster by @dmsuehir in #27780
- Translate
model_docfiles fromcliptocpmto JP by @rajveer43 in #27774 - Fix: Raise informative exception when
prefix_allowed_tokens_fnreturn empty set of tokens by @Saibo-creator in #27797 - Added passing parameters to "reducelron_plateau" scheduler by @CharbelAD in #27860
- fix: non-atomic checkpoint save by @thundergolfer in #27820
- Fix beam score calculation issue for Tensorflow version by @VsonicV in #27814
- Fix remaining issues in beam score calculation by @VsonicV in #27808
- Fix CLAP converting script by @ylacombe in #27153
- mark
test_initializationas flaky in 2 model tests by @ydshieh in #27906 - Fix
notification_service.pyby @ydshieh in #27903 - Fix 2 tests in
FillMaskPipelineTestsby @ydshieh in #27889 - Llama conversion script: adjustments for Llama Guard by @pcuenca in #27910
- fix llava by @ArthurZucker in #27909
- Allow
resume_from_checkpointto handleauto_find_batch_sizeby @muellerzr in #27568 - [Doc] Spanish translation of pad_truncation.md by @aaronjimv in #27890
- fix typo in imageprocessingblip.py Wwhether -> Whether by @zhc7 in #27899
- [CLAP] Replace hard-coded batch size to enable dynamic ONNX export by @xenova in #27790
- [integration] Update Ray Tune integration for Ray 2.7 by @justinvyu in #26499
- Fix typo by @f4hy in #27918
- [DETA] fix backbone freeze/unfreeze function by @SangbumChoi in #27843
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @jiaqiw09
- translate peft.md to chinese (#27215)
- translate autoclass_tutorial to chinese (#27269)
- translate run_scripts.md to chinese (#27246)
- translate modelsharing.md and llmtutorial.md to chinese (#27283)
- translate big_models.md and performance.md to chinese (#27334)
- translate debugging.md to chinese (#27374)
- Perf torch compile (#27422)
- translate hpotrain.md and perfhardware.md to chinese (#27431)
- translate Trainer.md to chinese (#27527)
- translate deepspeed.md to chinese (#27495)
- translation main-class files to chinese (#27588)
- translate internal folder files to chinese (#27638)
- @susnato
- [
FA2] Add flash attention for forDistilBert(#26489) - [
FA2] Add flash attention forGPT-Neo(#26486) - Remove paddingmasks from `gptbigcode`. (#27348)
- Add CLVP (#24745)
- Add Phi-1 and Phi-1_5 (#26170)
- [
FA2] Add flash attention for opt (#26414) - CLVP Fixes (#27547)
- [
FA-2] Add Flash Attention toPhi(#27661)
- [
- @jiqing-feng
- add attentionmask and positionids in assisted model (#26892)
- TVP model (#25856)
- fix assisted decoding assistant model inputs (#27503)
- Fix bug of prepare4dattentionmask (#27847)
- @psinthong
- [time series] Add PatchTST (#25927)
- @Yuki-Imajuku
- Translating
en/model_docdocs to Japanese. (#27401)
- Translating
- @dg845
- Add UnivNet Vocoder Model for Tortoise TTS Diffusers Integration (#24799)
- @rajveer43
- Translate
en/model_docto JP (#27264) - Translate
en/tasksfolder docs to Japanese 🇯🇵 (#27098) - Translating en/model_doc folder docs to Japanese(from
bliptoclap) 🇯🇵 (#27673) - Translate
model_docfiles fromcliptocpmto JP (#27774)
- Translate
- @NoB0
- [i18n-fr] Translate installation to French (#27657)
- [i18n-fr] Translate autoclass tutorial to French (#27659)
- @ajati
- [Time series] Add PatchTSMixer (#26247)
- @vvvm23
- Add Llama Flax Implementation (#24587)
- Python
Published by LysandreJik over 2 years ago
transformers - Patch release: v4.35.2
A patch release was made for the following commit:
- [
tokenizers] update tokenizers version pin #27494
to fix all the issues with versioning regarding tokenizers and huggingface_hub
- Python
Published by ArthurZucker over 2 years ago
transformers - Patch release: v4.35.1
A patch release was made for the following three commits:
- Fix FA2 import + deprecation cycle (#27330)
- Fix from_pt flag when loading with safetensors (#27394)
- Default to msgpack for safetensors (#27460)
- Python
Published by LysandreJik over 2 years ago
transformers - Safetensors serialization by default, DistilWhisper, Fuyu, Kosmos-2, SeamlessM4T, Owl-v2
New models
Distil-Whisper
Distil-Whisper is a distilled version of Whisper that is 6 times faster, 49% smaller, and performs within 1% word error rate (WER) on out-of-distribution data. It was proposed in the paper Robust Knowledge Distillation via Large-Scale Pseudo Labelling.
Distil-Whisper copies the entire encoder from Whisper, meaning it retains Whisper's robustness to different audio conditions. It only copies 2 decoder layers, which significantly reduces the time taken to auto-regressively generate text tokens:

Distil-Whisper is MIT licensed and directly available in the Transformers library with chunked long-form inference, Flash Attention 2 support, and Speculative Decoding. For details on using the model, refer to the following instructions.
Joint work from @sanchit-gandhi, @patrickvonplaten and @srush.
- [Assistant Generation] Improve Encoder Decoder by @patrickvonplaten in #26701
- [WhisperForCausalLM] Add WhisperForCausalLM for speculative decoding by @patrickvonplaten in #27195
- [Whisper, Bart, MBart] Add Flash Attention 2 by @patrickvonplaten in #27203
Fuyu

The Fuyu model was created by ADEPT, and authored by Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar.
The authors introduced Fuyu-8B, a decoder-only multimodal model based on the classic transformers architecture, with query and key normalization. A linear encoder is added to create multimodal embeddings from image inputs.
By treating image tokens like text tokens and using a special image-newline character, the model knows when an image line ends. Image positional embeddings are removed. This avoids the need for different training phases for various image resolutions. With 8 billion parameters and licensed under CC-BY-NC, Fuyu-8B is notable for its ability to handle both text and images, its impressive context size of 16K, and its overall performance.
Joint work from @molbap, @pcuenca, @amyeroberts, @ArthurZucker
- Add fuyu model by @molbap in #26911
- Fuyu: improve image processing by @molbap in #27007
SeamlessM4T

The SeamlessM4T model was proposed in SeamlessM4T — Massively Multilingual & Multimodal Machine Translation by the Seamless Communication team from Meta AI.
SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.
SeamlessM4T enables multiple tasks without relying on separate models:
- Speech-to-speech translation (S2ST)
- Speech-to-text translation (S2TT)
- Text-to-speech translation (T2ST)
- Text-to-text translation (T2TT)
- Automatic speech recognition (ASR)
SeamlessM4TModel can perform all the above tasks, but each task also has its own dedicated sub-model.
- Add Seamless M4T model by @ylacombe in #25693
Kosmos-2
The KOSMOS-2 model was proposed in Kosmos-2: Grounding Multimodal Large Language Models to the World by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei.
KOSMOS-2 is a Transformer-based causal language model and is trained using the next-word prediction task on a web-scale dataset of grounded image-text pairs GRIT. The spatial coordinates of the bounding boxes in the dataset are converted to a sequence of location tokens, which are appended to their respective entity text spans (for example, a snowman followed by
- Add
Kosmos-2model by @ydshieh in #24709
Owl-v2
OWLv2 was proposed in Scaling Open-Vocabulary Object Detection by Matthias Minderer, Alexey Gritsenko, Neil Houlsby. OWLv2 scales up OWL-ViT using self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. This results in large gains over the previous state-of-the-art for zero-shot object detection.
- Add OWLv2, bis by @NielsRogge in #26668
🚨🚨🚨 Safetensors by default for torch serialization 🚨🚨🚨
Version v4.35.0 now puts safetensors serialization by default. This is a significant change targeted at making users of the Hugging Face Hub, transformers, and any downstream library leveraging it safer.
The safetensors library is a safe serialization framework for machine learning tensors. It has been audited and will become the default serialization framework for several organizations (Hugging Face, EleutherAI, Stability AI).
It was already the default loading mechanism since v4.30.0 and would therefore already default to loading model.safetensors files instead of pytorch_model.bin if these were present in the repository.
With v4.35.0, any call to save_pretrained for torch models will now save a safetensors file. This safetensors file is in the PyTorch format, but can be loaded in TensorFlow and Flax models alike.
⚠️ If you run into any issues with this, please let us know ASAP in the issues so that we may help you. Namely, the following errors may indicate something is up:
- Loading a safetensors file and having a warning mentioning missing weights unexpectedly
- Obtaining completely wrong/random results at inference after loading a pretrained model that you have saved in safetensors
If you wish to continue saving files in the .bin format, you can do so by specifying safe_serialization=False in all your save_pretrained calls.
- Safetensors serialization by default by @LysandreJik in #27064
Chat templates
Chat templates have been expanded with the addition of the add_generation_prompt argument to apply_chat_template(). This has also enabled us to rework the ConversationalPipeline class to use chat templates. Any model with a chat template is now automatically usable through ConversationalPipeline.
- Add addgenerationprompt argument to applychattemplate by @Rocketknight1 in #26573
- Conversation pipeline fixes by @Rocketknight1 in #26795
Guides
Two new guides on LLMs were added the library:
- [docs] LLM prompting guide by @MKhalusova in #26274
- [docs] Optimizing LLMs by @patrickvonplaten in #26058
Quantization
Exllama-v2 integration
Exllama-v2 provides better GPTQ kernel for higher throughput and lower latency for GPTQ models. The original code can be found here.
- add exllamav2 arg by @SunMarc in #26437
- Add exllamav2 better by @SunMarc in #27111
You will need the latest versions of optimum and auto-gptq. Read more about the integration here.
AWQ integration
AWQ is a new and popular quantization scheme, already used in various libraries such as TGI, vllm, etc. and known to be faster than GPTQ models according to some benchmarks. The original code can be found here and here you can read more about the original paper.
We support AWQ inference with original kernels as well as kernels provided through autoawq package that you can simply install with pip install autoawq.
- [
core/Quantization] AWQ integration by @younesbelkada in #27045
We also provide an example script on how to push quantized weights on the hub on the original repository.
Read more about the benchmarks and the integration here
GPTQ on CPU !
You can now run GPTQ models on CPU using the latest version of auto-gptq thanks to @vivekkhandelwal1 !
- Add support for loading GPTQ models on CPU by @vivekkhandelwal1 in #26719
Attention mask refactor
We refactored the attention mask logic for major models in transformers. For instance, we removed padding_mask argument which was ambiguous for some users
- Remove ambiguous
padding_maskand instead use a 2D->4D Attn Mask Mapper by @patrickvonplaten in #26792 - [Attention Mask] Refactor all encoder-decoder attention mask by @patrickvonplaten in #27086
Flash Attention 2 for more models + quantization fine-tuning bug fix
Gpt-bigcode (starcoder), whisper, Bart and MBart now supports FA-2 ! Use it by simply passing use_flash_attention_2=True to from_pretrained. Some bugfixes with respect to mixed precision training with FA2 have been also addressed.
- Add flash attention for
gpt_bigcodeby @susnato in #26479 - [
FA2] Fix flash attention 2 fine-tuning with Falcon by @younesbelkada in #26852 - [Whisper, Bart, MBart] Add Flash Attention 2 by @patrickvonplaten in #27203
A bugfix with respect to fine-tuning with FA-2 in bfloat16 was addressed. You should now smoothly fine-tune FA-2 models in bfloat16 using quantized base models.
- 🚨🚨🚨 [
Quantization] Store the original dtype in the config as a private attribute 🚨🚨🚨 by @younesbelkada in #26761 - [
FA-2] Final fix for FA2 dtype by @younesbelkada in #26846
Neftune
NEFTune is a new technique to boost Supervised Fine-tuning performance by adding random noise on the embedding vector. Read more about it on the original paper here

We propose a very simple API for users to benefit from this technique, simply pass a valid neftune_noise_alpha parameter to TrainingArguments
Read more about the API here
- [FEAT] Add Neftune into transformers Trainer by @younesbelkada in #27141
Gradient checkpointing refactor
We have refactored the gradient checkpointing API so that users can pass keyword arguments supported by torch.utils.checkpoint.checkpoint directly through gradient_checkpointing_kwargs when calling gradient_checkpointing_enable(), e.g.
```python from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.frompretrained("facebook/opt-125m") model.gradientcheckpointingenable(gradientcheckpointingkwargs={"usereentrant": False}) ```
gradient_checkpointing_kwargs is also supported with Trainer through TrainingArguments.
- [
Trainer/GC] Addgradient_checkpointing_kwargsin trainer and training arguments by @younesbelkada in #27068 - [
core] Refactor ofgradient_checkpointingby @younesbelkada in #27020 - [
core/GC/tests] Stronger GC tests by @younesbelkada in #27124 - Fix import of torch.utils.checkpoint by @NielsRogge in #27155
The refactor should be totally backward compatible with previous behaviour. For superusers, you can still use the attribute gradient_checkpointing on model's submodules to control the activation / deactivation of gradient_checkpointing.
Breaking changes
- 🚨🚨🚨 [
Quantization] Store the original dtype in the config as a private attribute 🚨🚨🚨 by @younesbelkada in #26761 - 🚨🚨 Generate: change order of ops in beam sample to avoid nans by @gante in #26843
- 🚨🚨 Raise error when no speaker embeddings in speecht5.generatespeech by @ylacombe in #26418
Bugfixes and improvements
- [
Nougat] from transformers import * by @ArthurZucker in #26562 - [Whisper] Allow basic text normalization by @sanchit-gandhi in #26149
- 🌐 [i18n-KO] Translated
semantic_segmentation.mdto Korean by @jungnerd in #26515 - [Tokenizers] Skip tests temporarily by @LysandreJik in #26574
- docs: feat: add clip notebook resources from OSSCA community by @junejae in #26505
- Extend Trainer to enable Ascend NPU to use the fused Adamw optimizer when training by @statelesshz in #26194
- feat: add trainer label to wandb run upon initialization by @parambharat in #26466
- Docstring check by @sgugger in #26052
- refactor: change default block_size by @pphuc25 in #26229
- [Mistral] Update config docstring by @sanchit-gandhi in #26593
- Add # Copied from statements to audio feature extractors that use the floats_list function by @dg845 in #26581
- Fix embarrassing typo in the doc chat template! by @Rocketknight1 in #26596
- Fix encoder->decoder typo bug in convertt5xcheckpointtopytorch.py by @soyoung97 in #26587
- skip flaky hub tests by @ArthurZucker in #26594
- Update mistral.md to update 404 link by @Galland in #26590
- [Wav2Vec2] Fix tokenizer set lang by @sanchit-gandhi in #26349
- add zh translation for installation by @yyLeaves in #26084
- [
NougatProcessor] Fix the default channel by @ArthurZucker in #26608 - [
GPTNeoX] Faster rotary embedding for GPTNeoX (based on llama changes) by @ArthurZucker in #25830 - [Falcon] Set
use_cache=Falsebefore creatingpresentswhich relies onuse_cacheby @yundai424 in #26328 - Fix failing tests on
maindue to torch 2.1 by @ydshieh in #26607 - Make
ModelOutputserializable by @cbensimon in #26493 - [
core] fix silent bugkeep_in_fp32modules by @younesbelkada in #26589 - #26566 swin2 sr allow in out channels by @marvingabler in #26568
- Don't close ClearML task if it was created externally by @eugen-ajechiloae-clearml in #26614
- Fix
transformers-pytorch-gpudocker build by @ydshieh in #26615 - [docs] Update to scripts building index.md by @MKhalusova in #26546
- Don't install
pytorch-quantizationin Doc Builder docker file by @ydshieh in #26622 - Remove unnecessary
views ofposition_idsby @ramiro050 in #26059 - Fixed inconsistency in several fast tokenizers by @Towdo in #26561
- Update tokenizationcodellama_fast.py by @andyl98 in #26576
- Remove unnecessary unsqueeze - squeeze in rotary positional embedding by @fxmarty in #26162
- Update chat template docs with more tips on writing a template by @Rocketknight1 in #26625
- fix RoPE t range issue for fp16 by @rui-ren in #26602
- Fix failing
MusicgenTest .test_pipeline_text_to_audioby @ydshieh in #26586 - remove SharedDDP as it is deprecated by @statelesshz in #25702
- [
LlamaTokenizerFast] Adds edge cases for the template processor by @ArthurZucker in #26606 - [docstring] Fix docstring for
AlbertConfigby @ydshieh in #26636 - docs(zh): review and punctuation & space fix by @wfjsw in #26627
- [DINOv2] Convert more checkpoints by @NielsRogge in #26177
- Fixed malapropism error by @Zhreyu in #26660
- fix links in README.md for the GPT, GPT-2, and Llama2 Models by @dcarpintero in #26640
- Avoid CI OOM by @ydshieh in #26639
- fix typos in idefics.md by @dribnet in #26648
- [docstring] Fix docstring CLIP configs by @isaac-chung in #26677
- [docstring] Fix docstring for
CLIPImageProcessorby @isaac-chung in #26676 - [docstring] Fix docstring for DonutImageProcessor by @abzdel in #26641
- Fix stale bot by @LysandreJik in #26692
- [docstring] Fix docstrings for
CLIPby @isaac-chung in #26691 - Control first downsample stride in ResNet by @jiqing-feng in #26374
- Fix Typo: table in deepspeed.md by @Pairshoe in #26705
- [docstring] Fix docstring for
LlamaConfigby @pavaris-pm in #26685 - fix a typo in flax T5 attention - attention_mask variable is misnamed by @giganttheo in #26663
- Fix source_prefix default value by @jheitmann in #26654
- [JAX] Replace uses of
jnp.arrayin types withjnp.ndarray. by @hvaara in #26703 - Make Whisper Encoder's sinusoidal PE non-trainable by default by @gau-nernst in #26032
- In assisted decoding, pass modelkwargs to model's forward call (fix prepareinputforgeneration in all models) by @sinking-point in #25242
- Update docs to explain disabling callbacks using report_to by @nebrelbug in #26155
Copied fromfor test files by @ydshieh in #26713- [docstring]
SwinModeldocstring fix by @shivanandmn in #26679 - fix the model card issue as
use_cuda_ampis no more available by @pacman100 in #26731 - Fix stale bot for locked issues by @LysandreJik in #26711
- Fix checkpoint path in
no_trainerscripts by @muellerzr in #26733 - Update docker files to use
torch==2.1.0by @ydshieh in #26735 - Revert #20715 by @ydshieh in #26734
- [docstring] Fix docstring for
LlamaTokenizerandLlamaTokenizerFastby @minhoryang in #26669 - [docstring] Fix docstring for
CodeLlamaTokenizerby @Bojun-Feng in #26709 - add japanese documentation by @rajveer43 in #26138
- Translated the accelerate.md file of the documentation to Chinese by @liteli1987gmail in #26161
- Fix doctest for
Blip2ForConditionalGenerationby @ydshieh in #26737 - Add many missing spaces in adjacent strings by @tomaarsen in #26751
- Warnings controlled by logger level by @LysandreJik in #26527
- Fix
PersimmonIntegrationTestOOM by @ydshieh in #26750 - Fix
MistralIntegrationTestOOM by @ydshieh in #26754 - Fix backward compatibility of Conversation by @wdhorton in #26741
- [docstring] Fix
UniSpeech,UniSpeechSat,Wav2Vec2ForCTCby @gizemt in #26664 - [docstring] Update
GPT2andWhisperby @McDonnellJoseph in #26642 - [docstring] Fix docstring for 'BertGenerationConfig' by @AdwaitSalankar in #26661
- Fix
PerceiverModelIntegrationTest::test_inference_masked_lmby @ydshieh in #26760 - chore: fix typos by @afuetterer in #26756
- [
core] Fix fa-2 import by @younesbelkada in #26785 - Skip
TrainerIntegrationFSDP::test_basic_run_with_cpu_offloadiftorch < 2.1by @ydshieh in #26764 - 🌐 [i18n-KO] Translated
big_models.mdto Korean by @wonhyeongseo in #26245 - Update expect outputs of
IdeficsProcessorTest.test_tokenizer_paddingby @ydshieh in #26779 - [docstring] Fix docstring for
RwkvConfigby @Bojun-Feng in #26782 - Fix num. of minimal calls to the Hub with peft for pipeline by @ydshieh in #26385
- [docstring] fix docstring
DPRConfigby @AVAniketh0905 in #26674 - Disable default system prompt for LLaMA by @Rocketknight1 in #26765
- Fix Falcon generation test by @Rocketknight1 in #26770
- Fixed KeyError for Mistral by @MatteoRaso in #26682
- [
Flava] Fix flava doc by @younesbelkada in #26789 - Add CLIP resources by @eenzeenee in #26534
- translation brazilian portuguese by @alvarorichard in #26769
- Fixed typos by @Zhreyu in #26810
- [docstring] Fix docstring for
CanineConfigby @Sparty in #26771 - Add Japanese translation by @shinshin86 in #26799
- [docstring] Fix docstring for
CodeLlamaTokenizerFastby @Bojun-Feng in #26666 - Image-to-Image Task Guide by @merveenoyan in #26595
- Make fsdp ram efficient loading optional by @pacman100 in #26631
- fix resumefromcheckpoint bug by @Jintao-Huang in #26739
- [OWL-ViT, OWLv2] Add resources by @NielsRogge in #26822
- Llama tokenizer: remove space in template comment by @pcuenca in #26788
- Better way to run AMD CI with different flavors by @ydshieh in #26634
- [docstring] Fix bert generation tokenizer by @przemL in #26820
- Conversation pipeline fixes by @Rocketknight1 in #26795
- Fix Mistral OOM again by @ydshieh in #26847
- Chore: Typo fixed in multiple files of docs/source/en/model_doc by @SusheelThapa in #26833
- fix: when window_size is passes as array by @dotneet in #26800
- Update logits_process.py docstrings to clarify penalty and reward cases (attempt #2) by @larekrow in #26784
- [docstring] Fix docstring for LukeConfig by @louietouie in #26858
- Fixed a typo in mistral.md by @DTennant in #26879*
- Translating
en/internalfolder docs to Japanese 🇯🇵 by @rajveer43 in #26747 - Fix TensorFlow pakage check by @jayfurmanek in #26842
- Generate: improve docstrings for custom stopping criteria by @gante in #26863
- Knowledge distillation for vision guide by @merveenoyan in #25619
- Fix Seq2seqTrainer decoder attention mask by @Rocketknight1 in #26841
- [
Tokenizer] Fix slow and fast serialization by @ArthurZucker in #26570 - Emergency PR to skip conversational tests to fix CI by @Rocketknight1 in #26906
- Add default template warning by @Rocketknight1 in #26637
- Refactor code part in documentation translated to japanese by @rajveer43 in #26900
- [i18n-ZH] Translated fast_tokenizers.md to Chinese by @yyLeaves in #26910
- [
FA-2] Revert suggestion that broke FA2 fine-tuning with quantized models by @younesbelkada in #26916 - [docstring] Fix docstring for
ChineseCLIPby @Sparty in #26880 - [Docs] Make sure important decode and generate method are nicely displayed in Whisper docs by @patrickvonplaten in #26927
- Fix and re-enable ConversationalPipeline tests by @Rocketknight1 in #26907
- [docstring] Fix docstrings for
CodeGenby @daniilgaltsev in #26821 - Fix license by @MedAymenF in #26931
- Pin Keras for now by @Rocketknight1 in #26904
- [
FA-2/Mistral] Supprot fa-2 + right padding + forward by @younesbelkada in #26912 - Generate: update basic llm tutorial by @gante in #26937
- Corrected modalities description in README_ru.md by @letohx in #26913
- [docstring] Fix docstring for speech-to-text config by @R055A in #26883
- fix set_transform link docs by @diegulio in #26856
- Fix Fuyu image scaling bug by @pcuenca in #26918
- Update README_hd.md by @biswabaibhab007 in #26872
- Added Telugu [te] translations by @hakunamatata1997 in #26828
- fix logit-to-multi-hot conversion in example by @ranchlai in #26936
- Limit to inferior fsspec version by @LysandreJik in #27010
- python falcon doc-string example typo by @SoyGema in #26995
- skip two tests by @ArthurZucker in #27013
- Nits in Llama2 docstring by @osanseviero in #26996
- Change default
max_shard_sizeto smaller value by @younesbelkada in #26942 - [
NLLB-MoE] Fix NLLB MoE 4bit inference by @younesbelkada in #27012 - [
SeamlessM4T] fix copies with NLLB MoE int8 by @ArthurZucker in #27018 - small typos found by @rafaelpadilla in #26988
- Remove tokentypeids from default TF GPT-2 signature by @Rocketknight1 in #26962
- Translate
pipeline_tutorial.mdto chinese by @jiaqiw09 in #26954 - 🌐 [i18n-ZH] Translate multilingual into Chinese by @yyLeaves in #26935
- translate
preprocessing.mdto Chinese by @jiaqiw09 in #26955 - Bugfix device map detr model by @pedrogengo in #26849
- Fix little typo by @mertyyanik in #27028
- 🌐 [i18n-ZH] Translate createamodel.md into Chinese by @yyLeaves in #27026
- Fix key dtype in GPTJ and CodeGen by @fxmarty in #26836
- Register ModelOutput as supported torch pytree nodes by @XuehaiPan in #26618
- Add
default_to_square_for_sizetoCLIPImageProcessorby @ydshieh in #26965 - Add descriptive docstring to WhisperTimeStampLogitsProcessor by @jprivera44 in #25642
- Normalize only if needed by @mjamroz in #26049
- [
TFxxxxForSequenceClassifciation] Fix the eager mode after #25085 by @ArthurZucker in #25751 - Safe import of rgbtoid from FE modules by @amyeroberts in #27037
- add info on TRL docs by @lvwerra in #27024
- Add fuyu device map by @SunMarc in #26949
- Device agnostic testing by @vvvm23 in #25870
- Fix config silent copy in from_pretrained by @patrickvonplaten in #27043
- [docs] Performance docs refactor p.2 by @MKhalusova in #26791
- Add a default decoderattentionmask for EncoderDecoderModel during training by @hackyon in #26752
- Fix RoPE config validation for FalconConfig + various config typos by @tomaarsen in #26929
- Skip-test by @ArthurZucker in #27062
- Fix TypicalLogitsWarper tensor OOB indexing edge case by @njhill in #26579
- [docstring] fix incorrect llama docstring: encoder -> decoder by @ztjhz in #27071
- [DOCS] minor fixes in README.md by @Akash190104 in #27048
- [
docs] AddMaskGenerationPipelinein docs by @younesbelkada in #27063 - 🌐 [i18n-ZH] Translate custom_models.md into Chinese by @yyLeaves in #27065
- Hindi translation of pipeline_tutorial.md by @AaryaBalwadkar in #26837
- Handle unsharded Llama2 model types in conversion script by @coreyhu in #27069
- Bring back
set_epochfor Accelerate-based dataloaders by @muellerzr in #26850 - Bump
flash_attnversion to2.1by @younesbelkada in #27079 - Remove unneeded prints in modelinggptneox.py by @younesbelkada in #27080
- Add-support for commit description by @ArthurZucker in #26704
- [Llama FA2] Re-add expandattention_mask and clean a couple things by @patrickvonplaten in #27074
- Correct docstrings and a typo in comments by @lewis-yeung in #27047
- Save TB logs as part of pushtohub by @muellerzr in #27022
- Added huggingface emoji instead of the markdown format by @shettyvarshaa in #27091
- [
T5Tokenizer] Fix fast and extra tokens by @ArthurZucker in #27085 - Revert "add exllamav2 arg" by @ArthurZucker in #27102
- Add early stopping for Bark generation via logits processor by @isaac-chung in #26675
- Provide alternative when warning on useauthtoken by @Wauplin in #27105
- Fix no split modules underlying modules by @SunMarc in #27090
- [
core/gradient_checkpointing] Refactor GC - part 2 by @younesbelkada in #27073 - fix detr device map by @SunMarc in #27089
- Added Telugu [te] translation for README.md in main by @hakunamatata1997 in #27077
- translate transformers_agents.md to Chinese by @jiaqiw09 in #27046
- Fix docstring and type hint for resize by @daniilgaltsev in #27104
- [Typo fix] flag config in WANDB by @SoyGema in #27130
- Fix slack report failing for doctest by @ydshieh in #27042
- [
FA2/Mistral] Revert previous behavior with right padding + forward by @younesbelkada in #27125 - Fix data2vec-audio note about attention mask by @gau-nernst in #27116
- remove the obsolete code related to fairscale FSDP by @statelesshz in #26651
- Fix some tests using
"common_voice"by @ydshieh in #27147 - [
tests/Quantization] Fix bnb test by @younesbelkada in #27145 - make tests of pytorch_example device agnostic by @statelesshz in #27081
- Remove some Kosmos-2
copied fromby @ydshieh in #27149 - 🌐 [i18n-ZH] Translate serialization.md into Chinese by @yyLeaves in #27076
- Translating
en/main_classesfolder docs to Japanese 🇯🇵 by @rajveer43 in #26894 - Device agnostic trainer testing by @statelesshz in #27131
- Fix: typos in README.md by @THEFZNKHAN in #27154
- [KOSMOS-2] Update docs by @NielsRogge in #27157
- deprecate function
get_default_deviceintools/base.pyby @statelesshz in #26774 - Remove broken links to s-JoL/Open-Llama by @CSRessel in #27164
- [docstring] Fix docstring for AltCLIPTextConfig, AltCLIPVisionConfig and AltCLIPConfig by @AksharGoyal in #27128
- [doctring] Fix docstring for BlipTextConfig, BlipVisionConfig by @Hangsiin in #27173
- Disable CI runner check by @ydshieh in #27170
- fix: Fix typical_p behaviour broken in recent change by @njhill in #27165
- Trigger CI if
tiny_model_summary.jsonis modified by @ydshieh in #27175 - Shorten the conversation tests for speed + fixing position overflows by @Rocketknight1 in #26960
- device agnostic pipelines testing by @statelesshz in #27129
- Backward compatibility fix for the Conversation class by @Rocketknight1 in #27176
- [
Quantization/tests] Fix bnb MPT test by @younesbelkada in #27178 - Fix dropout in
StarCoderby @susnato in #27182 - translate traning.md to chinese by @jiaqiw09 in #27122
- [docs] Update CPU/GPU inference docs by @stevhliu in #26881
- device agnostic models testing by @statelesshz in #27146
- Unify warning styles for better readability by @oneonlee in #27184
- 🌐 [i18n-ZH] Translate tflite.md into Chinese by @yyLeaves in #27134
- device agnostic fsdp testing by @statelesshz in #27120
- Fix docstring get maskformer resize output image size by @wesleylp in #27196
- Fix the typos and grammar mistakes in CONTRIBUTING.md. by @THEFZNKHAN in #27193
- Fixing docstring in getresizeoutputimagesize function by @wesleylp in #27191
- added unsqueezedim to applyrotaryposemb by @ShashankMosaicML in #27117
- Added cacheblockoutputs option to enable GPTQ for non-regular models by @AlexKoff88 in #27032
- Add TensorFlow implementation of ConvNeXTv2 by @neggles in #25558
- Fix docstring in getoneformerresizeoutputimage_size func by @wesleylp in #27207
- improving TimmBackbone to support FrozenBatchNorm2d by @rafaelpadilla in #27160
- Translate task summary to chinese by @jiaqiw09 in #27180
- Fix CPU offload + disk offload tests by @LysandreJik in #27204
- Enable split_batches through TrainingArguments by @muellerzr in #26798
- support bf16 by @etemadiamd in #25879
- Reproducible checkpoint for npu by @statelesshz in #27208
- [
core/Quantization] Fix for 8bit serialization tests by @younesbelkada in #27234
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @jungnerd
- 🌐 [i18n-KO] Translated
semantic_segmentation.mdto Korean (#26515)
- 🌐 [i18n-KO] Translated
- @statelesshz
- Extend Trainer to enable Ascend NPU to use the fused Adamw optimizer when training (#26194)
- remove SharedDDP as it is deprecated (#25702)
- remove the obsolete code related to fairscale FSDP (#26651)
- make tests of pytorch_example device agnostic (#27081)
- Device agnostic trainer testing (#27131)
- deprecate function
get_default_deviceintools/base.py(#26774) - device agnostic pipelines testing (#27129)
- device agnostic models testing (#27146)
- device agnostic fsdp testing (#27120)
- Reproducible checkpoint for npu (#27208)
- @sgugger
- Docstring check (#26052)
- @yyLeaves
- add zh translation for installation (#26084)
- [i18n-ZH] Translated fast_tokenizers.md to Chinese (#26910)
- 🌐 [i18n-ZH] Translate multilingual into Chinese (#26935)
- 🌐 [i18n-ZH] Translate createamodel.md into Chinese (#27026)
- 🌐 [i18n-ZH] Translate custom_models.md into Chinese (#27065)
- 🌐 [i18n-ZH] Translate serialization.md into Chinese (#27076)
- 🌐 [i18n-ZH] Translate tflite.md into Chinese (#27134)
- @sinking-point
- In assisted decoding, pass modelkwargs to model's forward call (fix prepareinputforgeneration in all models) (#25242)
- @rajveer43
- add japanese documentation (#26138)
- Translating
en/internalfolder docs to Japanese 🇯🇵 (#26747) - Refactor code part in documentation translated to japanese (#26900)
- Translating
en/main_classesfolder docs to Japanese 🇯🇵 (#26894)
- @alvarorichard
- translation brazilian portuguese (#26769)
- @hakunamatata1997
- Added Telugu [te] translations (#26828)
- Added Telugu [te] translation for README.md in main (#27077)
- @jiaqiw09
- Translate
pipeline_tutorial.mdto chinese (#26954) - translate
preprocessing.mdto Chinese (#26955) - translate transformers_agents.md to Chinese (#27046)
- translate traning.md to chinese (#27122)
- Translate task summary to chinese (#27180)
- Translate
- @neggles
- Add TensorFlow implementation of ConvNeXTv2 (#25558)
- Python
Published by LysandreJik over 2 years ago
transformers - Patch release: v4.34.1
A patch release was made for the following three commits: - Add addgenerationprompt argument to applychattemplate (https://github.com/huggingface/transformers/pull/26573) - Fix backward compatibility of Conversation (https://github.com/huggingface/transformers/pull/26741) - [Tokenizer] Fix slow and fast serialization (https://github.com/huggingface/transformers/pull/26570)
- Python
Published by ArthurZucker over 2 years ago
transformers - v4.34: Mistral, Persimmon, Prompt templating, Flash Attention 2, Tokenizer refactor
New models
Mistral
Mistral-7B-v0.1 is a decoder-based LM with the following architectural choices:
- Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens
- GQA (Grouped Query Attention) - allowing faster inference and lower cache size.
Byte-fallback BPE tokenizer - ensures that characters are never mapped to out-of-vocabulary tokens.
[Mistral] Mistral-7B-v0.1 support by @Bam4d in #26447
Persimmon
The authors introduced Persimmon-8B, a decoder model based on the classic transformers architecture, with query and key normalization. Persimmon-8B is a fully permissively licensed model with approximately 8 billion parameters, released under the Apache license. Some of the key attributes of Persimmon-8B are long context size (16K), performance, and capabilities for multimodal extensions.
- [
Persimmon] Add support for persimmon by @ArthurZucker in #26042
BROS
BROS stands for BERT Relying On Spatiality. It is an encoder-only Transformer model that takes a sequence of tokens and their bounding boxes as inputs and outputs a sequence of hidden states. BROS encode relative spatial information instead of using absolute spatial information.
- Add BROS by @jinhopark8345 in #23190
ViTMatte
ViTMatte leverages plain Vision Transformers for the task of image matting, which is the process of accurately estimating the foreground object in images and videos.
- Add ViTMatte by @NielsRogge in #25843
Nougat
Nougat uses the same architecture as Donut, meaning an image Transformer encoder and an autoregressive text Transformer decoder to translate scientific PDFs to markdown, enabling easier access to them.
- Add Nougat by @NielsRogge and @molbap in #25942
Prompt templating
We've added a new template feature for chat models. This allows the formatting that a chat model was trained with to be saved with the model, ensuring that users can exactly reproduce that formatting when they want to fine-tune the model or use it for inference. For more information, see our template documentation.
- Overhaul Conversation class and prompt templating by @Rocketknight1 in #25323
🚨🚨 Tokenizer refactor
- [
Tokenizer] attemp to fix add_token issues by @ArthurZucker in #23909 - Nit-added-tokens by @ArthurZucker in #26538 adds some fix to #23909 .
🚨Workflow Changes 🚨:
These are not breaking changes per se but rather bugfixes. However, we understand that this may result in some workflow changes so we highlight them below.
- uniquenosplit_tokens attribute removed and not used in the internal logic
- sanitizespecialtokens() follows a deprecation cycle and does nothing
- All attributes in SPECIALTOKENSATTRIBUTES are stored as AddedTokens and no strings.
- loading a slow from a fast or a fast from a slow will no longer raise and error if the tokens added don't have the correct index. This is because they will always be added following the order of the added_tokens but will correct mistakes in the saved vocabulary if there are any. (And there are a lot in old format tokenizers)
- the length of a tokenizer is now max(set(self.getvocab().keys())) accounting for holes in the vocab. The vocabsize no longer takes into account the added vocab for most of the tokenizers (as it should not). Mostly breaking for T5
- Adding a token using tokenizer.add_tokens([AddedToken("hey", rstrip=False, normalized=True)]) now takes into account rstrip, lstrip, normalized information.
- addedtokensdecoder holds AddedToken, not strings.
- add_tokens() for both fast and slow will always be updated if the token is already part of the vocab, allowing for custom stripping.
- initializing a tokenizer form scratch will now add missing special tokens to the vocab.
- stripping is not always done for special tokens! 🚨 Only if the AddedToken has lstrip=True and rstrip=True
- fairseqidsto_tokens attribute removed for Barthez (was not used)
➕ Most visible features:
- printing a tokenizer now shows tokenizer.added_tokens_decoder for both fast and slow tokenizers. Moreover, additional tokens that were already part of the initial vocab are also found there.
- faster from_pretrained, faster add_tokens because special and non special can be mixed together and the trie is not always rebuilt.
- faster encode/decode with caching mechanism for added_tokens_decoder/encoder.
- information is fully saved in the tokenizer_config.json
For any issues relating to this, make sure to open a new issue and ping @ArthurZucker.
Flash Attention 2
FA2 support added to transformers for most popular architectures (llama, mistral, falcon) architectures actively being contributed in this issue (https://github.com/huggingface/transformers/issues/26350). Simply pass use_flash_attention_2=True when calling from_pretrained
In the future, PyTorch will support Flash Attention 2 through torch.scaled_dot_product_attention, users would be able to benefit from both (transformers core & transformers + SDPA) implementations of Flash Attention-2 with simple changes (model.to_bettertransformer() and force-dispatch the SDPA kernel to FA-2 in the case of SDPA)
- [
core] Integrate Flash attention 2 in most used models by @younesbelkada in #25598
For our future plans regarding integrating F.sdpa from PyTorch in core transformers, see here: https://github.com/huggingface/transformers/issues/26557
Lazy import structure
Support for lazy loading integration libraries has been added. This will drastically speed up importing transformers and related object from the library.
Example before this change:
2023-09-11 11:07:52.010179: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
python3 -c "from transformers import CLIPTextModel" 3.31s user 3.06s system 220% cpu 2.893 total
After this change:
python3 -c "from transformers import CLIPTextModel" 1.70s user 1.49s system 220% cpu 1.447 total
- [Core] Add lazy import structure to imports by @patrickvonplaten in #26090
Bugfixes and improvements
- Fix typo by @susnato in #25966
- Fix Detr CI by @ydshieh in #25972
- Fix
test_load_img_url_timeoutby @ydshieh in #25976 - nn.Identity is not required to be compatible with PyTorch < 1.1.0 as the minimum PyTorch version we currently support is 1.10.0 by @statelesshz in #25974
- Add
Pop2Pianospace demo. by @susnato in #25975 - fix typo by @kai01ai in #25981
- Use main in conversion script by @ydshieh in #25973
- [doc] Always call it Agents for consistency by @julien-c in #25958
- Update RAG README.md with correct path to examples/seq2seq by @tleyden in #25953
- Update training_args.py to remove the runtime error by @sahel-sh in #25920
- Trainer: delegate default generation values to
generation_configby @gante in #25987 - Show failed tests on CircleCI layout in a better way by @ydshieh in #25895
- Patch with accelerate xpu by @abhilash1910 in #25714
- PegasusX add nosplit_modules by @andreeahedes in #25933
- Add TFDebertaV2ForMultipleChoice by @raghavanone in #25932
- deepspeed resume from ckpt fixes and adding support for deepspeed optimizer and HF scheduler by @pacman100 in #25863
- [Wav2Vec2 Conformer] Fix inference float16 by @sanchit-gandhi in #25985
- Add LLaMA resources by @eenzeenee in #25859
- [
CI] Fix red CI and ERROR failed should show by @ArthurZucker in #25995 - [
VITS] tokenizer integration test: fix revision did not exist by @ArthurZucker in #25996 - Fix Mega chunking error when using decoder-only model by @tanaymeh in #25765
- save space when converting hf model to megatron model. by @flower-with-safe in #25950
- Update README.md by @NinoRisteski in #26003
- Falcon: fix revision propagation by @LysandreJik in #26006
- TF-OPT attention mask fixes by @Rocketknight1 in #25238
- Fix small typo README.md by @zspo in #25934
- 🌐[i18n-KO] Translated
llm_tutorial.mdto Korean by @harheem in #25791 - Remove Falcon from undocumented list by @Rocketknight1 in #26008
- modify context length for GPTQ + version bump by @SunMarc in #25899
- Fix err with FSDP by @muellerzr in #25991
- fix resizetoken_embeddings will set lm head size to 0 when enabled deepspeed zero3 by @kai01ai in #26024
- Fix CircleCI config by @ydshieh in #26023
- Add
tgsspeed metrics by @CokeDong in #25858 - [VITS] Fix nightly tests by @sanchit-gandhi in #25986
- Added HerBERT to README.md by @Muskan011 in #26020
- Fix vilt config docstring parameter to match value in init by @raghavanone in #26017
- Punctuation fix by @kwonmha in #26025
- Try to fix training Loss inconsistent after resume from old checkpoint by @dumpmemory in #25872
- Fix Dropout Implementation in Graphormer by @alexanderkrauck in #24817
- Update missing docs on
activation_dropoutand fix DropOut docs for SEW-D by @gau-nernst in #26031 - Skip warning if tracing with dynamo by @angelayi in #25581
- 🌐 [i18n-KO] Translated
llama.mdto Korean by @harheem in #26044 - [
CodeLlamaTokenizerFast] Fix fixset_infilling_processorto properly reset by @ArthurZucker in #26041 - [
CITests] skip failing tests until #26054 is merged by @ArthurZucker in #26063 - only main process should call _save on deepspeed zero3 by @zjjMaiMai in #25959
- docs: update link huggingface map by @pphuc25 in #26077
- docs: add space to docs by @pphuc25 in #26067
- [
core] Import tensorflow inside relevant methods intrainer_utilsby @younesbelkada in #26106 - Generate: legacy mode is only triggered when
generation_configis untouched by @gante in #25962 - Update logits_process.py docstrings by @larekrow in #25971
- Fix ExponentialDecayLengthPenalty negative logits issue by @pokjay in #25594
- 🌐 [i18n-KO] Translated
llama2.mdto Korean by @mjk0618 in #26047 - [docs] Updates to TTS task guide with regards to the new TTS pipeline by @MKhalusova in #26095
- 🌐 [i18n-KO] Translated
contributing.mdto Korean by @mjk0618 in #25877 - enable optuna multi-objectives feature by @sywangyi in #25969
- chore: correct updatestep and correct gradientaccumulation_steps by @pphuc25 in #26068
- Text2text pipeline: don't parameterize from the config by @gante in #26118
- Fix
MarianTokenizerto remove metaspace character indecodeby @tanaymeh in #26091 - safeguard torch distributed check by @pacman100 in #26056
- fix the deepspeed tests by @pacman100 in #26021
- Fix AutoTokenizer docstring typo by @amyeroberts in #26117
- [
core] fix 4bitnum_parametersby @younesbelkada in #26132 - Add missing space in generation/utils.py by @jbochi in #26121
- Update spectrogram and waveform model mapping for TTS/A pipeline by @Vaibhavs10 in #26114
- [
RWKV] Final fix RWMV 4bit by @younesbelkada in #26134 - docs: feat: add llama2 notebook resources from OSSCA community by @junejae in #26076
- Generate: ignore warning when
generation_config.max_lengthis set toNoneby @gante in #26147 - Fix
test_finetune_bert2bertby @ydshieh in #25984 - Falcon: batched generation by @gante in #26137
- Fix
beam_scoresshape when token scores shape changes afterlogits_processorby @BakerBunker in #25980 - Update trainingargs.py - addition of self.distributedstate when using XPU by @Serizao in #25999
- [docs] last hidden state vs hidden_states[-1] by @MKhalusova in #26142
- Flex xpu bug fix by @abhilash1910 in #26135
- Add missing Maskformer dataclass decorator, add dataclass check in ModelOutput for subclasses by @rachthree in #25638
- Fix eval accumulation when
accelerate> 0.20.3 by @sam-scale in #26060 - [Whisper Tokenizer] Encode timestamps by @sanchit-gandhi in #26054
- [
PEFT] Fix PEFT + gradient checkpointing by @younesbelkada in #25846 - [MusicGen] Add streamer to generate by @sanchit-gandhi in #25320
- Fix beam search when using model parallel by @pfldy2850 in #24969
- [MusicGen] Add sampling rate to config by @sanchit-gandhi in #26136
- [Whisper] Fix word-level timestamps for audio < 30 seconds by @xenova in #25607
- [BLIP-2] Improve conversion script by @NielsRogge in #24854
- IDEFICS: allow interpolation of vision's pos embeddings by @leot13 in #26029
- [TTA Pipeline] Test MusicGen and VITS by @sanchit-gandhi in #26146
- Tweaks to Chat Templates docs by @Rocketknight1 in #26168
- [Whisper] Check length of prompt + max new tokens by @sanchit-gandhi in #26164
- Update notebook.py to support multi eval datasets by @matrix1001 in #25796
- Fix pad to multiple of by @ArthurZucker in #25732
- [docs] IDEFICS guide and task guides restructure by @MKhalusova in #26035
- [PEFT] Allow PEFT model dict to be loaded by @patrickvonplaten in #25721
- No doctest for
convert_bros_to_pytorch.pyby @ydshieh in #26212 - Remove
utils/documentation_tests.txtby @ydshieh in #26213 - moved
ctrltoSalesforce/ctrlby @julien-c in #26183 - Fix ConversationalPipeline tests by @Rocketknight1 in #26217
- [FSMT] Fix non-shared weights by @LysandreJik in #26187
- refactor decay_parameters production into its own function by @shijie-wu in #26152
- refactor: change default block_size in block size > max position embeddings by @pphuc25 in #26069
- [Wav2Vec2-Conf / LLaMA] Style fix by @sanchit-gandhi in #26188
- [Permisson] Style fix by @sanchit-gandhi in #26228
- [Check] Fix config docstring by @sanchit-gandhi in #26222
- 🌐 [i18n-KO] Translated
whisper.mdto Korean by @nuatmochoi in #26002 - Create the return value on device to avoid unnecessary copying from CPU by @mksit in #26151
- [AutoBackbone] Add test by @NielsRogge in #26094
- Update README.md by @NinoRisteski in #26198
- Update addnewpipeline.md by @NinoRisteski in #26197
- [docs] Fix model reference in zero shot image classification example by @Aleksandar1932 in #26206
- Fix the gitlab user mention in issue templates to the correct user by @muellerz in #26237
- Fix some docstring in image processors by @ydshieh in #26235
- Fix gated repo tests by @Wauplin in #26257
- Fix
Errornot captured in PR doctesting by @ydshieh in #26215 - DeepSpeed ZeRO-3 handling when resizing embedding layers by @pacman100 in #26259
- [FIX] resizetokenembeddings by @passaglia in #26102
- FSDP tests and checkpointing fixes by @pacman100 in #26180
- fix name error when accelerate is not available by @pacman100 in #26278
- Update bros checkpoint by @jinhopark8345 in #26277
- Integrate AMD GPU in CI/CD environment by @mfuntowicz in #26007
- Rewrite for custom code warning messages by @Rocketknight1 in #26291
- fix deepspeed available detection by @fxmarty in #26252
- add bbox input validation by @jinhopark8345 in #26294
- include changes from llama by @ArthurZucker in #26260
- [
Trainer] Refactor trainer + bnb logic by @younesbelkada in #26248 - add custom RMSNorm to
ALL_LAYERNORM_LAYERSby @shijie-wu in #26227 - Keep relevant weights in fp32 when
model._keep_in_fp32_modulesis set even whenaccelerateis not installed by @fxmarty in #26225 - Fix FSMT weight sharing by @LysandreJik in #26292
- update hf hub dependency to be compatible with the new tokenizers by @ArthurZucker in #26301
- Porting the torchaudio kaldi fbank implementation to audio_utils by @ylacombe in #26182
- More error message fixup, plus some linebreaks! by @Rocketknight1 in #26296
- [QUICK FIX LINK] Update trainer.py by @SoyGema in #26293
- Use CircleCI
store_test_resultsby @ydshieh in #26223 - Fix doctest CI by @ydshieh in #26324
- [doc] fixed indices in obj detection example by @MKhalusova in #26343
- [TTA Pipeline] Fix MusicGen test by @sanchit-gandhi in #26348
- Add image to image pipeline by @LeviVasconcelos in #25393
- feat: adding numproc to loaddataset by @pphuc25 in #26326
- Fixed unclosed p tags by @HanSeokhyeon in #26240
- Update addnewmodel.md by @NinoRisteski in #26365
- Fix MusicGen logging error by @osanseviero in #26370
- [docs] removed MaskFormerSwin and TimmBackbone from the table on index.md by @MKhalusova in #26347
- Update tiny model information and pipeline tests by @ydshieh in #26285
- Add Russian localization for README by @qweme32 in #26208
- 🌐 [i18n-KO] Translated
audio_classification.mdxto Korean by @gabrielwithappy in #26200 - [ViTMatte] Add resources by @NielsRogge in #26317
- Deleted duplicate sentence by @titi-devv in #26394
- added support for gradient checkpointing in ESM models by @sanjeevk-os in #26386
- Fix DeepSpeed issue with Idefics by @HugoLaurencon in #26393
- Add torch
RMSPropoptimizer by @natolambert in #26425 - Fix padding for IDEFICS by @shauray8 in #26396
- Update semantic_segmentation.md by @zekaouinoureddine in #26419
- Fixing tokenizer when
transformersis installed withouttokenizersby @urialon in #26236 - [
FA/tests] Add use_cache tests for FA models by @younesbelkada in #26415 - add bf16 mixed precision support for NPU by @statelesshz in #26163
- [
PEFT] Fix PEFT multi adapters support by @younesbelkada in #26407 - Fix failing doctest by @LysandreJik in #26450
- Update
runs-onin workflow files by @ydshieh in #26435 - [i18n-DE] Complete first toc chapter by @flozi00 in #26311
- 🌐 [i18n-KO] Translated
debugging.mdto Korean by @wonhyeongseo in #26246 - 🌐 [i18n-KO] Translated
perf_train_gpu_many.mdto Korean by @wonhyeongseo in #26244 - optimize VRAM for calculating pos_bias in LayoutLM v2, v3 by @NormXU in #26139
- Fix
cos_sindevice issue in Falcon model by @ydshieh in #26448 - docs: change assert to raise and some small docs by @pphuc25 in #26232
- change mention of decoderinputids to inputids and same with decodeinputs_embeds by @tmabraham in #26406
- [VITS] Fix speaker_embed device mismatch by @fakhirali in #26115
- [
PEFT] introducingadapter_kwargsfor loading adapters from different Hub location (subfolder,revision) than the base model by @younesbelkada in #26270 - Do not warn about unexpected decoder weights when loading T5EncoderModel and LongT5EncoderModel by @fleonce in #26211
- fixmbarttied_weights by @SunMarc in #26422
- Esm checkpointing by @Amelie-Schreiber in #26454
- [Whisper Tokenizer] Make decoding faster after adding timestamps by @sanchit-gandhi in #26299
- [docs] Update offline mode docs by @stevhliu in #26478
- [docs] navigation improvement between text gen pipelines and text gen params by @MKhalusova in #26477
- Skip 2 failing persimmon pipeline tests for now by @ydshieh in #26485
- Avoid all-zeor attnetion mask used in testing by @ydshieh in #26469
- [Flax Examples] Seq2Seq ASR Fine-Tuning Script by @sanchit-gandhi in #21764
- [ASR Pipe] Improve docs and error messages by @sanchit-gandhi in #26476
- Revert falcon exception by @LysandreJik in #26472
- Fix numheads in _upadinput by @fs4r in #26490
- Fix requests connection error during modelcard creation by @jphme in #26518
- Fix issue of canine forward requiring input_ids anyway by @marcmk6 in #26290
- Fix broken link to video classification task by @HelgeS in #26487
- [
PEFT] Pass token when callingfind_adapter_configby @younesbelkada in #26488 - [
core/auto] Fix bnb test with code revision + bug with code revision by @younesbelkada in #26431 - Fix model integration ci by @ArthurZucker in #26322
- [
PEFT] Protectadapter_kwargscheck by @younesbelkada in #26537 - Remove-warns by @ArthurZucker in #26483
- [Doctest] Add configuration_roformer.py by @Adithya4720 in #26530
- Code-llama-nit by @ArthurZucker in #26300
- add buildinputswithspecialtokens to LlamaFast by @ArthurZucker in #26297
- 🌐 [i18n-KO] Translated
tokenizer_summary.mdto Korean by @wonhyeongseo in #26243 - [i18n-DE] contribute chapter by @flozi00 in #26481
- [RFC, Logging] Change warning to info by @patrickvonplaten in #26545
- Add tokenizer kwargs to fill mask pipeline. by @nmcahill in #26234
- [Wav2Vec2 and Co] Update init tests for PT 2.1 by @sanchit-gandhi in #26494
- [AMD] Add initial version for runtestsmulti_gpu by @mfuntowicz in #26346
- [Doctest] Add
configuration_encoder_decoder.pyby @SrijanSahaySrivastava in #26519 - [InternLM] Add support for InternLM by @Rocketknight1 in #26302
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @jinhopark8345
- Add BROS (#23190)
- Update bros checkpoint (#26277)
- add bbox input validation (#26294)
- @qweme32
- Add Russian localization for README (#26208)
- @Bam4d
- [Mistral] Mistral-7B-v0.1 support (#26447)
- @flozi00
- [i18n-DE] Complete first toc chapter (#26311)
- [i18n-DE] contribute chapter (#26481)
- @wonhyeongseo
- 🌐 [i18n-KO] Translated
debugging.mdto Korean (#26246) - 🌐 [i18n-KO] Translated
perf_train_gpu_many.mdto Korean (#26244) - 🌐 [i18n-KO] Translated
tokenizer_summary.mdto Korean (#26243)
- 🌐 [i18n-KO] Translated
- Python
Published by LysandreJik over 2 years ago
transformers - Patch release: v4.33.3
A patch release was made for the following three commits:
- DeepSpeed ZeRO-3 handling when resizing embedding layers (#26259)
- [doc] Always call it Agents for consistency (#25958)
- deepspeed resume from ckpt fixes and adding support for deepspeed optimizer and HF scheduler (#25863)
- Python
Published by LysandreJik over 2 years ago
transformers - Patch release: v4.33.2
A patch release was done for these two commits:
- Fix pad to multiple of (#25732)
- fix resizetoken_embeddings will set lm head size to 0 when enabled deepspeed zero3 (#26024)
- Python
Published by LysandreJik over 2 years ago