Recent Releases of transformers

transformers - v4.56: Dino v3, X-Codec, Ovis 2, MetaCLIP 2, Florence 2, SAM 2, Kosmos 2.5, HunYuan, GLMV-4.5

New model additions

Dino v3

DINOv3 is a family of versatile vision foundation models that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models.

You can find all the original DINOv3 checkpoints under the DINOv3 collection.

Add Dino v3 by @qubvel in #40167

X-Codec

he X-Codec model was proposed in Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model by Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, Wei Xue

The X-Codec model is a neural audio codec that integrates semantic information from self-supervised models (e.g., HuBERT) alongside traditional acoustic information. This enables :

Music continuation : Better modeling of musical semantics yields more coherent continuations.
Text-to-Sound Synthesis : X-Codec captures semantic alignment between text prompts and generated audio.
Semantic aware audio tokenization: X-Codec is used as an audio tokenizer in the YuE lyrics to song generation model.

Add X-Codec model by @Manalelaidouni in #38248

Ovis 2

The Ovis2 is an updated version of the Ovis model developed by the AIDC-AI team at Alibaba International Digital Commerce Group.

Ovis2 is the latest advancement in multi-modal large language models (MLLMs), succeeding Ovis1.6. It retains the architectural design of the Ovis series, which focuses on aligning visual and textual embeddings, and introduces major improvements in data curation and training methods.

Add Ovis2 model and processor implementation by @thisisiron in #37088

MetaCLIP 2

MetaCLIP 2 is a replication of the original CLIP model trained on 300+ languages. It achieves state-of-the-art (SOTA) results on multilingual benchmarks (e.g., XM3600, CVQA, Babel‑ImageNet), surpassing previous SOTA such as mSigLIP and SigLIP‑2. The authors show that English and non-English worlds can mutually benefit and elevate each other.

Add MetaCLIP 2 by @NielsRogge in #39826

Florence 2

Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks. Florence-2 can interpret simple text prompts to perform tasks like captioning, object detection, and segmentation. It leverages the FLD-5B dataset, containing 5.4 billion annotations across 126 million images, to master multi-task learning. The model's sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings, proving to be a competitive vision foundation model.

Add support for Florence-2 by @ducviet00 in #38188

SAM 2

SAM2 (Segment Anything Model 2) was proposed in Segment Anything in Images and Videos by Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer.

The model can be used to predict segmentation masks of any object of interest given an input image or video, and input points or bounding boxes.

Add Segment Anything 2 (SAM2) by @SangbumChoi in #32317

Kosmos 2.5

The Kosmos-2.5 model was proposed in KOSMOS-2.5: A Multimodal Literate Model by Microsoft.

The abstract from the paper is the following:

We present Kosmos-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.

drawing

Add Kosmos-2.5 by @tic-top in #31711

HunYuan

More information at release 🤗

HunYuan opensource by @yjc9696 in #39606

Seed OSS

More information at release 🤗

Adding ByteDance Seed Seed-OSS by @Fazziekey in #40272

GLM-4.5V

More information at release 🤗

GLM-4.5V Model Support by @zRzRzRzRzRzRzR in #39805

Cache

Beyond a large refactor of the caching system in Transformers, making it much more practical and general, models using sliding window attention/chunk attention do not waste memory anymore when caching past states. It was allowed most notable by:

New DynamicSlidingWindowLayer & associated Cache by @Cyrilvallez in #40039

See the following improvements on memory usage for Mistral (using only sliding layers) and GPT-OSS (1 out of 2 layers is sliding) respectively:

Beyond memory usage, it will also improve generation/forward speed by a large margin for large contexts, as only necessary states are passed to the attention computation, which is very sensitive to the sequence length.

Quantization

MXFP4

Since the GPT-OSS release which introduced the MXPF4 quantization type, several improvements have been made to the support, which should now stabilize.

Fix MXFP4 quantizer validation to allow CPU inference with dequantize option by @returnL in #39953
Enable gpt-oss mxfp4 on older hardware (sm75+) by @matthewdouglas in #39940
Fix typo and improve GPU kernel check error message in MXFP4 quantization by @akintunero in #40349)
Default to dequantize if cpu in device_map for mxfp4 by @MekkCyber in #39993
Fix GPT-OSS swiglu_limit not passed in for MXFP4 by @danielhanchen in #40197
[Mxfp4] Add a way to save with a quantization method by @ArthurZucker in #40176

New standard

Now that we deprecated tensorflow and jax, we felt that torch_dtype was not only misaligned with torch, but was redundant and hard to remember. For this reason, we switched to a much more standard dtype argument!

⚠️⚠️ Use dtype instead of torch_dtype everywhere! by @Cyrilvallez in #39782

torch_dtype will still be a valid usage for as long as needed to ensure a smooth transition, but new code should use dtype, and we encourage you to update older code as well!

Breaking changes

The following commits are breaking changes in workflows that were either buggy or not working as expected.

Saner hub-defaults for hybrid cache implementation

On models where the hub checkpoint specifies cache_implementation="hybrid" (static sliding window hybrid cache), UNSETS this value. This will make the model use the dynamic sliding window layers by default.

This default meant that there were widespread super slow 1st generate calls on models with hybrid caches, which should nol onger be the case.

🚨🚨 [generate] ignore cache_implementation="hybrid" hub defaults by @gante in #40135

Sine positional embeddings for MaskFormer & LRU cache

Cache the computation of sine positional embeddings for MaskFormer; results in a 6% performance improvement.

🚨 Use lru_cache for sine pos embeddings MaskFormer by @yonigozlan in #40007

Explicit cache initialization

Adds explicit cache initialization to prepare for the deprecation of the from_legacy_cache utility.

🚨 Always return Cache objects in modelings (to align with generate) by @manueldeprada in #39765

Default compilation with `fullgraph=False`

Having fullgraph set to True during compilation ended up being very restrictive, especially with the arrival of widely-used MoEs.

🚨🚨 Switch default compilation to fullgraph=False by @Cyrilvallez in #40137

Remove decoding strategies

The DoLa decoding strategy has been moved to the following remote-code repository a few versions ago: https://huggingface.co/transformers-community/dola

The Contrastive Search decoding strategy has been moved to the following remote-code repository a few versions ago: https://huggingface.co/transformers-community/contrastive-search

Both have now been removed from the library as a result.

🚨 Remove DoLa decoding strategy by @manueldeprada in #40082
🚨 Remove Contrastive Search decoding strategy by @manueldeprada in #40428

Fix sliding window in flash attention

Flash attention has used sliding window sizes which were off by one. This affected generations that had initially bigger contexts than the sliding window size.

:rotating_light: [Flash Attention] Fix sliding window size by @vasqu in #40163

Minimum Torch version is now 2.2

Torch 2.1 support has been unreliable for some time, so we've now made it official and bumped our minimum version to 2.2.

byebye torch 2.1 by @Rocketknight1 in #40317

Bugfixes and improvements

[CI] post-GptOss fixes for green CI by @gante in #39929
Avoid utils/check_bad_commit.py failing due to rate limit (requesting api.github.com) by @ydshieh in #39918
Fix CI: Tests failing on CPU due to torch.device('cpu').index being None by @manueldeprada in #39933
circleci: pin torch 2.7.1 until torchcodec is updated by @ydshieh in #39951
[docs] ko toc fix by @gante in #39927
docs: fix typo in 'quantization-aware training' by @luckyvickyricky in #39904
Fix grammatical error in MoE variable name: experthitted → experthit, hittedexperts → hitexperts by @Mihonarium in #39959
fix typo by @Tialo in #39936
[image processor] fix glm4v by @KeyKy in #39964
remove triton_kernels dep with kernels instead by @SunMarc in #39926
Fix fix_and_overwrite mode of utils/check_docstring.py by @manueldeprada in #39369
[bugfix] fix flashattention2 unavailable error on Ascend NPU by @FightingZhen in #39844
chore: update Deformable_Detr model card by @arpon-kapuria in #39902
Modular fix: remove the model name in find_file_type by @yonigozlan in #39897
Gemma3 fixes by @remi-or in #39960
[superglue] Fixed the way batch mask was applied to the scores before match assignment computation by @sbucaille in #39968
Support input_embeds in torch exportable decoders by @jackzhxng in #39836
Various test fixes for AMD by @remi-or in #39978
[Idefics] fix device mismatch by @zucchini-nlp in #39981
Fix gemma3n feature extractor's incorrect squeeze by @Isotr0py in #39919
[typing] Fix return typehint for decoder and inv_freq annotation by @qubvel in #39610
Fix consistency by @Cyrilvallez in #39995
Update expected output values after #39885 (part 1) by @ydshieh in #39990
Fix int4 quantized model cannot work with cpu by @yuanwu2017 in #39724
Fix missing video inputs for PerceptionLM. by @shuminghu in #39971
fix: remove CHAT_TEMPLATE import in tests for deepseek-vl by @geetu040 in #40003
Fix HGNetV2 Model Card and Image Classification Pipeline Usage Tips by @ducviet00 in #39965
Fix default values of getenv by @cyyever in #39867
FA2 can continue generation from cache by @zucchini-nlp in #39843
unpin torch<2.8 on circleci by @ydshieh in #40012
docs: fix duplication in 'en/optimizers.md' by @luckyvickyricky in #40014
Raising error when quantizing a quantized model by @MekkCyber in #39998
Update expected output values after #39885 (part 2) by @ydshieh in #40015
pin torchcodec==0.5.0 for now with torch 2.7.1 on daily CI by @ydshieh in #40013
Fix broken image inference for Fuyu model by @Isotr0py in #39915
Higgs modulestonot_convert standardization by @MekkCyber in #39989
Fix an annoying flaky test by @zucchini-nlp in #40000
Harmonize past_key_value to past_key_valueS everywhere by @Cyrilvallez in #39956
Fix missing None default values for Gemma3n model in getplaceholdermask by @Znerual in #39991)
[core] Refactor the Cache logic to make it simpler and more general by @Cyrilvallez in #39797
Tie weights recursively on all submodels by @Cyrilvallez in #39996
Bnb failling tests by @MekkCyber in #40026
fix notification_service.py about time_spent by @ydshieh in #40037
Revert "fix notification_service.py about time_spent" by @ydshieh in #40044
Update HuBERT model card according to template by @reedrya in #39742
unpin torchcodec==0.5.0 and use torch 2.8 on daily CI by @ydshieh in #40072
fix: resolve triton version check compatibility on windows by @Tsumugii24 in #39986
[qwen-vl] fix beam search with videos by @zucchini-nlp in #39726
[gemma3] update conversion key mapping by @zucchini-nlp in #39778
fix: move super().init after vision_config init in Mistral3Config by @starcatmeow in #40063
Remove deprecated cache-related objects by @Cyrilvallez in #40035
guard on model.eval when using torch.compile + FSDP2 by @winglian in #37413
Fix repo consistency by @zucchini-nlp in #40077
added Textnet fast image processor by @rahzaazhar in #39884
Fix time_spent in notification_service.py. by @ydshieh in #40081
chore: standardize DeBERTa model card by @Shoumik-Gandre in #37409
[GPT Big Code] Fix attention scaling by @vasqu in #40041
feat: extract rev in attn_implementation kernels via @ by @drbh in #40009
Update notification service MI325 by @ivarflakstad in #40078
Fix PerceptionLM image preprocessing for non-tiled image input. by @shuminghu in #40006
Revert FA2 kwargs construction by @zucchini-nlp in #40029
[fix] batch inference for llava_onevision by @cyr0930 in #40021
[docs] Zero Shot Object Detection Task by @ariG23498 in #40096
Update Glm4V processor and add tests by @zucchini-nlp in #39988
Add glm4.5&&glm4.5V doc by @lambertwjh in #40095
Causal loss for ForConditionalGeneration by @qgallouedec in #39973
Audio encodings now match conv2d weight dtype in Gemma3nAudioSSCPConvBlock by @Malav-P in #39743
New DynamicSlidingWindowLayer & associated Cache by @Cyrilvallez in #40039
Enable SIM rules by @cyyever in #39806
feat: add is_fast to ImageProcessor by @MilkClouds in #39603
Re-apply make style by @Cyrilvallez in #40106
Replace logger.warning with logger.warning_once in GradientCheckpointingLayer by @qgallouedec in #40091
Fix regression in mllama vision encoder by @Isotr0py in #40083
Switch the order of args in StaticCache (for BC and future logic) by @Cyrilvallez in #40100
Fix Qwen3 MoE GGUF architecture mismatch by @ctcanbol in #39976
Fix error on importing unavailable torch.distributed by @m-gallus in #40038
[Flash Attention] Fix flash attention integration by @vasqu in #40002
[trainer] ensure special tokens in model configs are aligned with tokenizer at train time by @gante in #38441
Fix Causality Handling in Flash Attention to Support Bidirectional Attention by @lucaswychan in #39707
[docs] Add reference to HF-maintained custom_generate collections by @gante in #39894
Add model card for MobileViT by @Shivamjan in #40033
remove sequence parallel in llama4 by @3outeille in #40084
🌐 [i18n-KO] Translated tiny_agents.md to Korean by @AhnJoonSung in #39913
[bugfix] Fix tensor device in Idefics2, Idefics3, and SmolVLM by @qgallouedec in #39975
changed xLSTMRMSNorm to RMSNorm by @nikitazuevblago in #40113
Fix QuantoQuantizedCache import issues by @manueldeprada in #40109
[serve] allow array content inputs for LLMs by @gante in #39829
decoding_method argument in generate by @manueldeprada in #40085
Collated reports by @ivarflakstad in #40080
DOCS: Add missing space in SECURITY.md by @shivaheidari in #40087
[trainer] handle case where EOS token is None in generation_config by @gante in #40127
Fix hidden torchvision>=0.15 dependency issue by @yonigozlan in #39928
🌐 [i18n-KO] Translated main_classes/processors.md to Korean by @TaskerJang in #39519
🌐 [i18n-KO] Translated jamba.md to Korean by @skwh54 in #39890
🌐 [i18n-KO] Translated main_classes/optimizer_schedules.md to Korean by @luckyvickyricky in #39713
🌐 [i18n-KO] Translated gpt2.md to Korean by @taemincode in #39808
🌐 [i18n-KO] Translated optimizers.md to Korean by @chelsseeey in #40011
🌐 [i18n-KO] Translated grounding-dino.md to Korean by @TaskerJang in #39861
🌐 [i18n-KO] Translated pipelines.md to Korean by @xhaktm00 in #39577
gpt oss is important by @ArthurZucker in #40139
Fix Janus by @Cyrilvallez in #40140
[docs] Fix ko toctree by @stevhliu in #40138
Remove an old badly designed test by @Cyrilvallez in #40142
updated visualBERT modelcard by @Anil-Red in #40057
🌐 [i18n-KO] Translated gemma3.md to Korean by @seopp in #39865
Fix quantized cache with only cache_implementation in generate by @Cyrilvallez in #40144
Add pytest marker: torch_compile_test and torch_export_test by @ydshieh in #39950
Update Dockerfiles to install packages inside a virtual environment by @Sai-Suraj-27 in #39098
Create self-scheduled-amd-mi355-caller.yml by @glegendre01 in #40134
[Cohere2Vision] remove unused arg by @zucchini-nlp in #40103
[efficientloftr] fix bugs and follow original cross attn implementation strictly by @sbucaille in #40141
Fix CI: Use correct import in SAM for torchvision InterpolationMode by @manueldeprada in #40160
[Continous Batching] set headdim when config.headdim is None by @kashif in #40159
Replace self.tokenizer by self.processing_class by @qgallouedec in #40119
[FA2] Fix it finally - revert fa kwargs preparation by @Cyrilvallez in #40161
[bugfix] fix flash-attention2 unavailable error for Ascend NPU by @FightingZhen in #40151
build: Add fast image processor tvp by @adutchengineer in #39529
Add GptOssForSequenceClassification for GPT-OSS models by @zyfedward in #40043
Standardize BARTpho model card: badges, new examples, fixed broken im… by @eshwanthkartitr in #40051
Add dates to the model docs by @MHRDYN7 in #39320
Pin torch to 2.7.1 on CircleCI for now by @ydshieh in #40174
Update dynamic attnt setter for multimodals by @zucchini-nlp in #39908
[MINOR:TYPO] Update base.py by @cakiki in #40169
make model doc device agnostic by @yao-matrix in #40143
fix to avoid modifying a view in place by @3outeille in #40162
Fix fsdp for generic-task models by @Cyrilvallez in #40191
Add repr to EncoderDecoderCache by @Cyrilvallez in #40195
Fix typos by @cyyever in #40175
Remove prepareflashattentionfrompositionids by @cyyever in #40069
Avoid CUDA stream sync by @cyyever in #40060
Fix various Pylint warnings by @cyyever in #40107
Update: add type hints to check_tokenizers.py by @ajeet214 in #40094
Benchmarking improvements by @ahadnagy in #39768
docs: Update LayoutLM model card according to new standardized format by @Jin-HoMLee in #40129
Revert "Pin torch to 2.7.1 on CircleCI for now" + Final fix for too long with no output by @ydshieh in #40201
Use correct model_input_names for PixtralImageProcessor by @rohitrango in #40226
fix error vocabsize at Qwen25VLForConditionalGeneration lossfunction by @killight98 in #40130
[SAM 2] Change checkpoints in docs and tests by @yonigozlan in #40213
Fix more typos by @cyyever in #40212
Fix ESM tokendropout crash when using inputsembeds instead of input_ids by @notkisk in #40181
AMD scheduled CI ref env file by @ivarflakstad in #40243
Fix more pylint warnings by @cyyever in #40204
remove transposeforscores call in ESM-2 by @pstjohn in #40210
Add chat_template (jinja2) as an extra dependency by @tboerstad in #40128
[typing] fix type annotation error in DepthPro model image processor by @MengAiDev in #40238
[serve] guard imports by @gante in #39825
[CI] Fix repo consistency by @vasqu in #40249
Fixes for EncoderDecoderCache by @remi-or in #40008
fix: Catch correct ConnectionError for additionalchattemplates by @akug in #39874
Model card for NLLB by @sahil-kabir in #40074
Correct typo and update notes in docs Readme by @PavloFesenko in #40234
Fix benchmark workflow by @ahadnagy in #40254
docs: Update OLMo model card by @rafakatri in #40233
Skip broken tests by @zucchini-nlp in #40157
Remove MI300 CI by @ivarflakstad in #40270
set inputs_embeds to None while generate to avoid audio encoder forward in generation process by @BakerBunker in #40248
[detection] fix attention mask for RT-DETR-based models by @materight in #40269
Fix slow static cache export tests by @jackzhxng in #40261
Fix setting attention for multimodal models by @zucchini-nlp in #39984
[detection] fix correct k_proj weight and bias slicing in D-FINE by @notkisk in #40257
Skipping pytree registration in case fsdp is enabled by @romitjain in #40075
Update imageprocessingperceptionlmfast.py to allow for proper override of visioninputtype by @tyleryzhu in #40252
fix which routing method by @ArthurZucker in #40283
Fix chat CLI GPU loading and request_id validation issues by @robin-ede in #40230)
docs(layoutlm): add missing id=usage to <hfoptions> tag in LayoutLM model card by @Jin-HoMLee in #40273
Standardize RAG model card by @aayush226 in #40222
docs: Update TrOCR model card to new format by @AceHunterr in #40240
Update model card for gpt neox japanese by @ahnjj in #39862
SmolVLM and InternVL: Ensure pixel values are converted to the correct dtype for fp16/bf16 by @qgallouedec in #40121
Standardize BertGeneration model card by @nemitha2005 in #40250
Adjust ROCm test output expectations by @ahadnagy in #40279
SmolVLM test fixes by @ahadnagy in #40275
make model docs device agnostic (2) by @yao-matrix in #40256
[3/3] make docs device agnostic, all en docs for existing models done by @yao-matrix in #40298
Allow to be able to run torch.compile tests with fullgraph=True by @ydshieh in #40164
[FA] Fix dtype in varlen with position ids by @vasqu in #40295
[docs] delete more TF/Flax docs by @gante in #40289
Clean up X-Codec. by @ebezzam in #40271
Remove OTel SDK dependencies by @anuraaga in #40305
Fix GOT-OCR2 and Cohere2Vision image processor patches caculation by @Isotr0py in #40312
[fix] Pass adamw optimizer parameters to StableAdamW by @emapco in #40184
chore: fix typo in find_executable_batch_size to match new 0.9 ratio by @MilkClouds in #40206
:rotating_light: [Flash Attention] Fix sliding window size by @vasqu in #40163
Remove unnecessary contiguous calls for modern torch by @Rocketknight1 in #40315
Qwen2.5-Omni test fixes by @ahadnagy in #40307
Add back _tp_plan attribute by @rishub-tamirisa in #39944
byebye torch 2.1 by @Rocketknight1 in #40317
No more natten by @ydshieh in #40287
[GPT OSS] Refactor the tests as it was not properly checking the outputs by @ArthurZucker in #40288
Update CI with nightly torch workflow file by @ydshieh in #40306
Fix: Apply get_placeholder_mask in Ovis2 by @thisisiron in #40280
Update notification service amddailyci_workflows definition by @ivarflakstad in #40314
One cache class to rule them all by @Cyrilvallez in #40276
Fix chunked attention mask with left-padding by @Cyrilvallez in #40324
[docs] remove flax references from /en/model_doc by @gante in #40311
Fix qwen-omni processor text only mode by @yuekaizhang in #40336
Change Qwen2RMSNorm to RMSNorm from PyTorch by @cyyever in #40066
Add DeepseekV3ForSequenceClassification for Deepseek V3 models by @abdokaseb in #40200
Fix deprecation warning version by @Cyrilvallez in #40343
Add missing arguments to class constructors by @cyyever in #40068
[docs] remove TF references from /en/model_doc by @gante in #40344
Fix: Only call Trainer.alignspecialtokens if model has "config" attribute by @tomaarsen in #40322
add type hints by @wirthual in #40319
Fix an infinite loop bug in recursive search of relative imports by @eladsegal in #40326
Fix links in Glm4vMoe configuration classes to point to the correct H… by @vvvdwbvvv in #40310
T5 test and target device fixes by @ahadnagy in #40313
Update test_spm_converter_bytefallback_warning by @ydshieh in #40284
(small) fix conditional for inputids and inputembeds in marian by @cyntqliu in #40045
Fix attention vizualizer by @molbap in #40285
[ModernBert] Prevent the attention mask from being None in ModernBertForSequenceClassification by @ashmikuz in #35991
Clean up XCodec and other codecs by @ebezzam in #40348
[serve] add cors warnings by @gante in #40112
[detection] use consistent dtype for Conditional and DAB DETR positional embeddings by @agkphysics in #40300
Remove more PyTorch 2.2 compatible code by @cyyever in #40337
[FA] Fix some model tests by @vasqu in #40350
Qwen2.5-VL test fixes for ROCm by @ahadnagy in #40308
[generate] handle support for cache classes when num enc layers != num dec layers by @gante in #40277
[4/N]more docs to device agnostic by @yao-matrix in #40355
DOCS: Clarification on the use of label_names as an argument to TrainingArguments by @huzaifa-jawad367 in #40353
Fix idefics3 vision embeddings indices dtype by @Isotr0py in #40360
wav2vec2 fixes by @remi-or in #40341
Change multimodal data links to HF hub by @zucchini-nlp in #40309
[pipelines] add support to skip_special_tokens in the main text generation pipelines by @gante in #40356
⚠️⚠️ Use dtype instead of torch_dtype everywhere! by @Cyrilvallez in #39782
[processor] move commonalities to mixin by @zucchini-nlp in #40339
[configuration] allow to overwrite kwargs from subconfigs by @zucchini-nlp in #40241
fix(example): align parameter names with the latest function definition for gdino by @developer0hye in #40369
Add GptOssForTokenClassification for GPT-OSS models by @abdokaseb in #40190
Bug Fix: Dynamically set return_lse flag in FlexAttention by @amd-lalithnc in #40352
Chat Template Doc Fixes by @Rocketknight1 in #40173
Rework the Cache documentation by @Cyrilvallez in #40373
Update README_zh-hans.md by @TardC in #40380
HF papers in doc by @qgallouedec in #40381
Run FA2 tests in CI by @ydshieh in #40397
Reactivate a lot of tests skipped for no reason anymore by @Cyrilvallez in #40378
:broom: :broom: :broom: Get set decoder cleanup by @molbap in #39509
fix to accept cumulative_seqlens from TransformersKwargs in FA by @Kurt232 in #40194
[docs] flax/jax purge by @gante in #40372
Fix typo: 'casual' -> 'causal' in code and documentation by @akintunero in #40371)
Fix CI (hunyuan moe does not support fullgraph) by @Cyrilvallez in #40423
Fix typo: 'seperator' to 'separator' in variable names by @Prawal-Sharma in #40389
Fix UnboundLocalError in WER metric computation by @prxshetty in #40402
Gpt oss optim by @jiqing-feng in #40304
Fix processing tests by @zucchini-nlp in #40379
Fix label smoothing incompatibility with multi-label classification by @avchauzov in #40296
Fix modular for modernbert-decoder by @Cyrilvallez in #40431
Update collated reports working directory and --path by @ivarflakstad in #40433
Add tokenizer_kwargs argument to the text generation pipeline by @Joshua-Chin in #40364
[docs] remove last references to transformers TF classes/methods by @gante in #40429
Remove working-dir from collated reports job by @ivarflakstad in #40435
🌐 [i18n-KO] Translated models.md to Korean by @Judy-Choi in #39518
Gemma3 text fixes: Add expectations for MI325 by @ahadnagy in #40384
Fix collated reports model directory traversal by @ivarflakstad in #40437
Fix https://github.com/huggingface/transformers/issues/40292 by @id01 in #40439
Fix collated reports uploading by @ivarflakstad in #40440
InternVL MI325 test expectations by @ahadnagy in #40387
Fix collated reports model name entry by @ivarflakstad in #40441
Fix non FA2 tests after FA2 installed in CI docker image by @ydshieh in #40430
Refactor ViT-like models by @qubvel in #39816
[Trainer] accelerate contextparallel support in trainer by @kashif in #40205
fix qwen25-vl grad acc by @iMountTai in #40333
[video processors] decode only sampled videos -> less RAM and faster processing by @zucchini-nlp in #39600
rename getcudawarmupfactor to getacceleratorwarmupfactor by @yao-matrix in #40363
Make cache_config not mandatory by @remi-or in #40316
Continuous batching refactor by @remi-or in #40426
flashpaged: saux may not exist by @pcuenca in #40434
Fix extra template loading by @Rocketknight1 in #40455
deci gguf support by @ved1beta in #38669
[fastimageprocessor] fix image normalization for resize by @audioXD in #40436
[RoPE] explicit factor > implicit factor in YaRN by @gante in #40320
[pipeline] Add Keypoint Matching pipeline by @sbucaille in #39970
Update SegFormer model card by @GSNCodes in #40417
Not to shock AMD team by the cancelled workflow run notification ❤️ 💖 by @ydshieh in #40467
Fix nightly torch CI by @ydshieh in #40469
CI when PR merged to main by @ydshieh in #40451
Validate GptOssConfig rope config after it's fully initialized by @zifeitong in #40474
[modular] Use multi-processing + fix model import issue by @Cyrilvallez in #40481
[modular] Remove ambiguity in all calls to parent class methods + fix dependency graph by @Cyrilvallez in #40456
[ESM] support attention API by @zucchini-nlp in #40370
[EfficientLoFTR] dynamic image size support by @sbucaille in #40329
Fix qwen2_moe tests by @ydshieh in #40494
[Whisper] Add rocm expected results to certain tests by @ivarflakstad in #40482
Collated reports: no need to upload artifact by @ivarflakstad in #40502
Fix the CI workflow of merge to main by @ydshieh in #40503
docs(pixtral): Update Pixtral model card to new format by @BryanBradfo in #40442
[modular] Classes can now be defined and referenced in arbitrary order (without bringing unwanted dependencies) by @Cyrilvallez in #40507
Include machine type in collated reports filename by @ivarflakstad in #40514

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@remi-or
- Gemma3 fixes (#39960)
- Various test fixes for AMD (#39978)
- Fixes for EncoderDecoderCache (#40008)
- wav2vec2 fixes (#40341)
- Make cache_config not mandatory (#40316)
- Continuous batching refactor (#40426)
@sbucaille
- [superglue] Fixed the way batch mask was applied to the scores before match assignment computation (#39968)
- [efficientloftr] fix bugs and follow original cross attn implementation strictly (#40141)
- [pipeline] Add Keypoint Matching pipeline (#39970)
- [EfficientLoFTR] dynamic image size support (#40329)
@ducviet00
- Fix HGNetV2 Model Card and Image Classification Pipeline Usage Tips (#39965)
- Add support for Florence-2 (#38188)
@cyyever
- Fix default values of getenv (#39867)
- Enable SIM rules (#39806)
- Fix typos (#40175)
- Remove prepareflashattentionfrompositionids (#40069)
- Avoid CUDA stream sync (#40060)
- Fix various Pylint warnings (#40107)
- Fix more typos (#40212)
- Fix more pylint warnings (#40204)
- Change Qwen2RMSNorm to RMSNorm from PyTorch (#40066)
- Add missing arguments to class constructors (#40068)
- Remove more PyTorch 2.2 compatible code (#40337)
@zRzRzRzRzRzRzR
- GLM-4.5V Model Support (#39805)
@SangbumChoi
- Add Segment Anything 2 (SAM2) (#32317)
@adutchengineer
- build: Add fast image processor tvp (#39529)
@MHRDYN7
- Add dates to the model docs (#39320)
@yao-matrix
- make model doc device agnostic (#40143)
- make model docs device agnostic (2) (#40256)
- [3/3] make docs device agnostic, all en docs for existing models done (#40298)
- [4/N]more docs to device agnostic (#40355)
- rename getcudawarmupfactor to getacceleratorwarmupfactor (#40363)
@Manalelaidouni
- Add X-Codec model (#38248)
@thisisiron
- Add Ovis2 model and processor implementation (#37088)
- Fix: Apply get_placeholder_mask in Ovis2 (#40280)
@tic-top
- Add Kosmos-2.5 (#31711)
@yjc9696
- HunYuan opensource (#39606)
@Fazziekey
- Addiing ByteDance Seed Seed-OSS (#40272)

- Python
Published by LysandreJik 9 months ago

transformers - # Patch v4.55.4

Patch v4.55.4

There was a mick mack on our side when cherry-picking the commit #40197 which led to a wrong commit in the patch! Sorry everyone 😭

This patch is just the official fix for #40197!

- Python
Published by ArthurZucker 9 months ago

transformers - Patch release v4.55.3

Patch release 4.55.3

Focused on stabilizing FlashAttention-2 on Ascend NPU, improving FSDP behavior for generic-task models, fixing MXFP4 integration for GPT-OSS

Bug Fixes & Improvements

FlashAttention-2 / Ascend NPU – Fix “unavailable” runtime error (#40151) by @FightingZhen
FlashAttention kwargs – Revert FA kwargs preparation to resolve regression (#40161) by @Cyrilvallez
FSDP (generic-task models) – Fix sharding/runtime issues (#40191) by @Cyrilvallez
GPT-OSS / MXFP4 – Ensure swiglu_limit is correctly passed through (#40197) by @danielhanchen
Mamba – Fix cache handling to prevent stale/incorrect state (#40203) by @manueldeprada
Misc – Minor follow-up fix addressing #40262 by @ArthurZucker

- Python
Published by ArthurZucker 9 months ago

transformers - Patch release 4.55.2: for FA2 users!

Patch release 4.55.2!

only affects `FA2` generations!

😢 Well sorry everyone, sometimes shit can happen... 4.55.1 was broken because of 🥁 git merge conflict. I cherry-picked https://github.com/huggingface/transformers/pull/40002 without having https://github.com/huggingface/transformers/pull/40029 , thus from ..modeling_flash_attention_utils import prepare_fa_kwargs_from_position_ids is missing, and since this is a slow test, nothing caught it.

Will work to remediate and write the post-mortem when yanking the release.

- Python
Published by ArthurZucker 10 months ago

transformers - Patch release 4.55.1

Patch release 4.55.1:

Mostly focused around stabalizing the Mxfp4 for GPTOSS model!

Bug Fixes & Improvements

Idefics2, Idefics3, SmolVLM – Fix tensor device issue (#39975) by @qgallouedec
Merge conflicts – Fix merge conflicts from previous changes by @vasqu
MXFP4 / CPU devicemap – Default to dequantize when CPU is in devicemap (#39993) by @MekkCyber
GPT Big Code – Fix attention scaling (#40041) by @vasqu
Windows compatibility – Resolve Triton version check compatibility (#39986) by @Tsumugii24 @MekkCyber
Gemma3n model – Add missing None default values for getplaceholdermask (#39991, #40024) by @Znerual
Fuyu model – Fix broken image inference (#39915) by @Isotr0py
PerceptionLM – Fix missing video inputs (#39971) by @shuminghu
Idefics – Fix device mismatch (#39981) by @zucchini-nlp
Triton kernels – Remove triton_kernels dependency in favor of included kernels (#39926) by @SunMarc
GPT-OSS MXFP4 – Enable on older hardware (sm75+) (#39940) by @matthewdouglas @SunMarc
MXFP4 quantizer – Allow CPU inference with dequantize option (#39953) by @returnL

CI & Build

CI stability – Post-GPT-OSS fixes for green CI (#39929) by @gante @LysandreJik

- Python
Published by ArthurZucker 10 months ago

transformers - GLM-4.5V preview based on 4.55.0

GLM-4.5V preview based on 4.55.0

New model added by the Z.ai team to transformers! GLM-4.5V is a new multimodal reasoning model based on GLM-4.5-Air, which has 106B total and 12B active parameters.

It's performant across 42 benchmarks across various categories: - Image reasoning (scene understanding, complex multi-image analysis, spatial recognition) - Video understanding (long video segmentation and event recognition) - GUI tasks (screen reading, icon recognition, desktop operation assistance) - Complex chart & long document parsing (research report analysis, information extraction) - Grounding (precise visual element localization)

To use, install transformers release branch.

bash pip install transformers-v4.55.0-GLM-4.5V-preview

Then you can run:

```python from transformers import AutoProcessor, Glm4vMoeForConditionalGeneration import torch

MODELPATH = "zai-org/GLM-4.5V" messages = [ { "role": "user", "content": [ { "type": "image", "url": "https://upload.wikimedia.org/wikipedia/commons/f/fa/Grayscale8bitspalettesampleimage.png" }, { "type": "text", "text": "describe this image" } ], } ] processor = AutoProcessor.frompretrained(MODELPATH) model = Glm4vMoeForConditionalGeneration.frompretrained( pretrainedmodelnameorpath=MODELPATH, torchdtype="auto", devicemap="auto", ) inputs = processor.applychattemplate( messages, tokenize=True, addgenerationprompt=True, returndict=True, returntensors="pt" ).to(model.device) inputs.pop("tokentypeids", None) generatedids = model.generate(**inputs, maxnewtokens=8192) outputtext = processor.decode(generatedids[0][inputs["inputids"].shape[1]:], skipspecialtokens=False) print(outputtext) ```

- Python
Published by ArthurZucker 10 months ago

transformers - v4.55.0: New openai GPT OSS model!

Welcome GPT OSS, the new open-source model family from OpenAI!

For more detailed information about this model, we recommend reading the following blogpost: https://huggingface.co/blog/welcome-openai-gpt-oss

GPT OSS is a hugely anticipated open-weights release by OpenAI, designed for powerful reasoning, agentic tasks, and versatile developer use cases. It comprises two models: a big one with 117B parameters (gpt-oss-120b), and a smaller one with 21B parameters (gpt-oss-20b). Both are mixture-of-experts (MoEs) and use a 4-bit quantization scheme (MXFP4), enabling fast inference (thanks to fewer active parameters, see details below) while keeping resource usage low. The large model fits on a single H100 GPU, while the small one runs within 16GB of memory and is perfect for consumer hardware and on-device applications.

Overview of Capabilities and Architecture

21B and 117B total parameters, with 3.6B and 5.1B active parameters, respectively.
4-bit quantization scheme using mxfp4 format. Only applied on the MoE weights. As stated, the 120B fits in a single 80 GB GPU and the 20B fits in a single 16GB GPU.
Reasoning, text-only models; with chain-of-thought and adjustable reasoning effort levels.
Instruction following and tool use support.
Inference implementations using transformers, vLLM, llama.cpp, and ollama.
Responses API is recommended for inference.
License: Apache 2.0, with a small complementary use policy.

Architecture

Token-choice MoE with SwiGLU activations.
When calculating the MoE weights, a softmax is taken over selected experts (softmax-after-topk).
Each attention layer uses RoPE with 128K context.
Alternate attention layers: full-context, and sliding 128-token window.
Attention layers use a learned attention sink per-head, where the denominator of the softmax has an additional additive value.
It uses the same tokenizer as GPT-4o and other OpenAI API models.
Some new tokens have been incorporated to enable compatibility with the Responses API.

The following snippet shows simple inference with the 20B model. It runs on 16 GB GPUs when using mxfp4, or ~48 GB in bfloat16.

```py from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.frompretrained(modelid) model = AutoModelForCausalLM.frompretrained( modelid, devicemap="auto", torchdtype="auto", )

messages = [ {"role": "user", "content": "How many rs are in the word 'strawberry'?"}, ]

inputs = tokenizer.applychattemplate( messages, addgenerationprompt=True, returntensors="pt", returndict=True, ).to(model.device)

generated = model.generate(**inputs, maxnewtokens=100) print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:])) ```

Flash Attention 3

The models use attention sinks, a technique the vLLM team made compatible with Flash Attention 3. We have packaged and integrated their optimized kernel in kernels-community/vllm-flash-attn3. At the time of writing, this super-fast kernel has been tested on Hopper cards with PyTorch 2.7 and 2.8. We expect increased coverage in the coming days. If you run the models on Hopper cards (for example, H100 or H200), you need to pip install –upgrade kernels and add the following line to your snippet:

```diff from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.frompretrained(modelid) model = AutoModelForCausalLM.frompretrained( modelid, devicemap="auto", torchdtype="auto", + # Flash Attention with Sinks + attn_implementation="kernels-community/vllm-flash-attn3", )

messages = [ {"role": "user", "content": "How many rs are in the word 'strawberry'?"}, ]

inputs = tokenizer.applychattemplate( messages, addgenerationprompt=True, returntensors="pt", returndict=True, ).to(model.device)

generated = model.generate(**inputs, maxnewtokens=100) print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:])) ```

Even though the 120B model fits on a single H100 GPU (using mxfp4), you can also run it easily on multiple GPUs using accelerate or torchrun. Transformers provides a default parallelization plan, and you can leverage optimized attention kernels as well. The following snippet can be run with torchrun --nprocpernode=4 generate.py on a system with 4 GPUs:

```py from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.distributed import DistributedConfig import torch

modelpath = "openai/gpt-oss-120b" tokenizer = AutoTokenizer.frompretrained(modelpath, paddingside="left")

devicemap = { "tpplan": "auto", # Enable Tensor Parallelism }

model = AutoModelForCausalLM.frompretrained( modelpath, torchdtype="auto", attnimplementation="kernels-community/vllm-flash-attn3", **device_map, )

messages = [ {"role": "user", "content": "Explain how expert parallelism works in large language models."} ]

inputs = tokenizer.applychattemplate( messages, addgenerationprompt=True, returntensors="pt", returndict=True, ).to(model.device)

outputs = model.generate(**inputs, maxnewtokens=1000)

Decode and print

response = tokenizer.decode(outputs[0]) print("Model response:", response.split("<|channel|>final<|message|>")[-1].strip()) ```

Other optimizations

If you have a Hopper GPU or better, we recommend you use mxfp4 for the reasons explained above. If you can additionally use Flash Attention 3, then by all means do enable it!

[!TIP] If your GPU is not compatible with mxfp4, then we recommend you use MegaBlocks MoE kernels for a nice speed bump. To do so, you just need to adjust your inference code like this:

```diff from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.frompretrained(modelid) model = AutoModelForCausalLM.frompretrained( modelid, devicemap="auto", torchdtype="auto", + # Optimize MoE layers with downloadable MegaBlocksMoeMLP + use_kernels=True, )

messages = [ {"role": "user", "content": "How many rs are in the word 'strawberry'?"}, ]

inputs = tokenizer.applychattemplate( messages, addgenerationprompt=True, tokenize=True, returntensors="pt", returndict=True, ).to(model.device)

generated = model.generate(**inputs, maxnewtokens=100) print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:])) ```

[!TIP] MegaBlocks optimized MoE kernels require the model to run on bfloat16, so memory consumption will be higher than running on mxfp4. We recommend you use mxfp4 if you can, otherwise opt in to MegaBlocks via use_kernels=True.

transformers serve

You can use transformers serve to experiment locally with the models, without any other dependencies. You can launch the server with just: transformers serve

To which you can send requests using the Responses API. ```

responses API

curl -X POST http://localhost:8000/v1/responses \ -H "Content-Type: application/json" \ -d '{"input": [{"role": "system", "content": "hello"}], "temperature": 1.0, "stream": true, "model": "openai/gpt-oss-120b"}' ```

You can also send requests using the standard Completions API: ```

completions API

curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 1.0, "max_tokens": 1000, "stream": true, "model": "openai/gpt-oss-120b"}' ```

Command A Vision

Command A Vision is a state-of-the-art multimodal model designed to seamlessly integrate visual and textual information for a wide range of applications. By combining advanced computer vision techniques with natural language processing capabilities, Command A Vision enables users to analyze, understand, and generate insights from both visual and textual data.

The model excels at tasks including image captioning, visual question answering, document understanding, and chart understanding. This makes it a versatile tool for AI practitioners. Its ability to process complex visual and textual inputs makes it useful in settings where text-only representations are imprecise or unavailable, like real-world image understanding and graphics-heavy document processing.

Command A Vision is built upon a robust architecture that leverages the latest advancements in VLMs. It's highly performant and efficient, even when dealing with large-scale datasets. The model's flexibility makes it suitable for a wide range of use cases, from content moderation and image search to medical imaging analysis and robotics.

[Model] Cohere2 Vision by @zucchini-nlp in #39810

MM Grounding DINO

MM Grounding DINO model was proposed in An Open and Comprehensive Pipeline for Unified Object Grounding and Detection by Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xinjiang Wang, Yining Li, Haian Huang>.

MM Grounding DINO improves upon the Grounding DINO by improving the contrastive class head and removing the parameter sharing in the decoder, improving zero-shot detection performance on both COCO (50.6(+2.2) AP) and LVIS (31.9(+11.8) val AP and 41.4(+12.6) minival AP).

You can find all the original MM Grounding DINO checkpoints under the MM Grounding DINO collection. This model also supports LLMDet inference. You can find LLMDet checkpoints under the LLMDet collection.

Add MM Grounding DINO by @rziga in #37925

Bugfixes and improvements

More robust tied weight test by @Cyrilvallez in #39681
fix missing model.tpsize from ep refactor by @winglian in #39688
Fix missing initialization of FastSpeech2Conformer by @bvantuan in #39689
fix(tokenization): check token.content for trie by @pjo256 in #39587
xpu optimization for generation case by @sywangyi in #39573
[processors] add tests for helper fn by @zucchini-nlp in #39629
update ernie model card by @jzhang533 in #39657
[configuration] remove redundant classmethod by @zucchini-nlp in #38812
Add self-hosted runner scale set workflow for mi325 CI by @jitesh-gupta in #39651
PATCH: add back n-dim device-mesh + fix tp trainer saving by @S1ro1 in #39693
[CI] Add Eric to comment slow ci by @vasqu in #39601
Remove all expired deprecation cycles by @Cyrilvallez in #39725
mllama outputs refactor by @itazap in #39643
Update QAPipelineTests::test_large_model_course after #39193 by @ydshieh in #39666
skip Glm4MoeModelTest::test_torch_compile_for_training by @ydshieh in #39670
Fix Qwen2AudioForConditionalGeneration.forward() and test_flash_attn_kernels_inference_equivalence by @ebezzam in #39503
Fix Layer device placement in Caches by @Cyrilvallez in #39732
Fix cache-related tests by @zucchini-nlp in #39676
Fix AMD dockerfile for audio models by @remi-or in #39669
Superpoint fast image processor by @arkhamHack in #37804
Add Fast Segformer Processor by @capnmav77 in #37024
BLIPs clean-up by @zucchini-nlp in #35560
extend more trainer test cases to XPU, all pass by @yao-matrix in #39652
fix cache inheritance by @ArthurZucker in #39748
[Fix] import two missing typos in models/__init__.py for typo checking by @hebangwen in #39745
Fix: add back base model plan by @S1ro1 in #39733
update GemmaIntegrationTest::test_model_2b_bf16_dola again by @ydshieh in #39731
Update IMPORTANT_MODELS list by @ivarflakstad in #39734
Fix mamba regression by @manueldeprada in #39728
Apply several ruff SIM rules by @cyyever in #37283
Use --gpus all in workflow files by @ydshieh in #39752
AMD disable torchcodec by @ivarflakstad in #39757
Avoid OOM when other tests are failing by @ydshieh in #39758
Fix GPT2 with cross attention by @zucchini-nlp in #39754
Support loading Qwen3 MoE GGUF by @ctcanbol in #39638
Enable xpu allocator on cachingallocatorwarmup by @jiqing-feng in #39654
Fix version issue in modeling_utils.py by @Cyrilvallez in #39759
add libcst to extras["testing"] in setup.py by @ydshieh in #39761
[modenbert] fix regression by @zucchini-nlp in #39750
🌐 [i18n-KO] Translated main_classes/peft.md by @luckyvickyricky in #39515
🌐 [i18n-KO] Translated albert.md to Korean by @ahnjj in #39524
🌐 [i18n-KO] Translated tvp.md to Korean by @Kim-Ju-won in #39578
🌐 [i18n-KO] Translated tokenizer.md to Korean by @seopp in #39532
🌐 [i18n-KO] Translated pipeline_gradio.md to Korean by @AhnJoonSung in #39520
🌐 [i18n-KO] Translated perf_train_gpu_one.md to Korean by @D15M4S in #39552
🌐 [i18n-KO] Translated how_to_hack_models.md to Korean by @skwh54 in #39536
fix(trainer): Correct loss scaling for incomplete gradient accumulation steps by @hutaiHang in #39659
Fix Cache.maxcachelen max value for Hybrid models by @manueldeprada in #39737
[docs] Ko doc fixes after toc update by @gante in #39660
Remove python3.7 reference from doc link by @st81 in #39706
Fix OmDet test after arg deprecation by @Cyrilvallez in #39766
docs: Update EfficientLoFTR documentation by @sbucaille in #39620
Standardize CLAP model card format by @yanamis in #39738
Don't set run_name when none by @qgallouedec in #39695
Fix Evolla and xLSTM tests by @Cyrilvallez in #39769
enable static cache on vision encoder decoder by @jiqing-feng in #39773
[ASR pipline] fix with datasets 4.0 by @eustlb in #39504
more info in model_results.json by @ydshieh in #39783
Super tiny update by @zucchini-nlp in #39727
fix chameleonvision UT failure by @yao-matrix in #39646
Fix an invalid condition by @cyyever in #39762
Simplify conditional code by @cyyever in #39781
Fix re-compilations for cross attention cache by @zucchini-nlp in #39788
standardized BARThez model card by @EthanV431 in #39701
Update model card for Cohere2 (Command R7B) by @arpon-kapuria in #39604
Update mT5 model card by @dross20 in #39702
Add callback to monitor progress in whisper transcription by @poke1024 in #37483
fix: providing a tensor to cache_position in model.generate kwargs always crashes because of boolean test by @gante in #39300
feat(tokenization): add encode_message to tokenize messages one by one by @pco111 in #39507
[docs] fix korean docs yet again by @gante in #39813
Update documentation for Cohere2Vision models by @kyle-cohere in #39817
[cohere2 vision] move doc to multimodal section by @zucchini-nlp in #39820
Fix broken links by @oToToT in #39809
Fix bad markdown links by @ebezzam in #39819
Fix tp cb by @ArthurZucker in #39838
[VLMs] split out "get placeholder mask" to helper by @zucchini-nlp in #39777
[attn_implementation] remove recursive, allows custom kernels with wrappers by @ArthurZucker in #39823
[typecheck] proper export of private symbols by @cyyever in #39729
Update ux cb by @ArthurZucker in #39845
Fix responses add tests by @LysandreJik in #39848
Add fast image processor Janus, Deepseek VL, Deepseek VL hybrid by @yonigozlan in #39739
[image-processing] deprecate plot_keypoint_matching, make visualize_keypoint_matching as a standard by @sbucaille in #39830
Allow TrackioCallback to work when pynvml is not installed by @qgallouedec in #39851
remove dtensors, not explicit by @ArthurZucker in #39840
Improve is_wandb_available function to verify WandB installation by @qgallouedec in #39875
Refactor label name handling for PEFT models in Trainer class by @qgallouedec in #39265
Use comment to build doc on PRs by @ydshieh in #39846
Add support for including in-memory videos (not just files/urls) in applychattemplate by @akibjawad in #39494
[core] Fix attnimplementation setter with missing `subconfigs` by @qubvel in #39855
Fix quant docker for fp-quant by @SunMarc in #39641
Rework add-new-model-like with modular and make test filenames coherent by @Cyrilvallez in #39612
Replace Tokenizer with PreTrainedTokenizerFast in ContinuousBatchProcessor by @qgallouedec in #39858
Set torch.backends.cudnn.allow_tf32 = False for CI by @ydshieh in #39885
[typing] better return type hint for AutoModelForCausalLM and AutoModelForImageTextToText by @qubvel in #39881
Fix link to models in README by @qubvel in #39880
[DOCS] : Improved mimi model card by @rohitthewanderer in #39824
Update cohere2 vision test by @ydshieh in #39888
send some feedback when manually building doc via comment by @ydshieh in #39889
Add support for ModernBertForMultipleChoice by @netique in #39232
chore: update DETR model card by @arpon-kapuria in #39822
Reorder serving docs by @LysandreJik in #39634
[Exaone4] Fixes the attn implementation! by @ArthurZucker in #39906
fix testworkingof_tp failure of accelerate ut by @yao-matrix in #39828
[qwen] remove unnecessary CUDA sync in qwen25vl by @cyyever in #39870
Avoid aliasing in cond's branches for torch 2.8 by @ydwu4 in #39488
Fix misleading WandB error when WANDB_DISABLED is set by @notkisk in #39891
Replace video_fps with fps in tests by @cyyever in #39898
Fix eval thread fork bomb by @JustinVanHeek in #39717
Fix aria tests by @zucchini-nlp in #39879

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@capnmav77
- Add Fast Segformer Processor (#37024)
@cyyever
- Apply several ruff SIM rules (#37283)
- Fix an invalid condition (#39762)
- Simplify conditional code (#39781)
- [typecheck] proper export of private symbols (#39729)
- [qwen] remove unnecessary CUDA sync in qwen25vl (#39870)
- Replace video_fps with fps in tests (#39898)
@rziga
- Add MM Grounding DINO (#37925)

- Python
Published by LysandreJik 10 months ago

transformers - Patch release 4.54.1

Patch release 4.54.1

We had quite a lot of bugs that got through! Release was a bit rushed, sorry everyone! 🤗 Mostly cache fixes, as we now have layered cache, and fixed to distributed.

Fix Cache.maxcachelen max value for Hybrid models, @manueldeprada, @Cyrilvallez, #39737
[modenbert] fix regression, @zucchini-nlp, #39750
Fix version issue in modeling_utils.py, @Cyrilvallez, #39759
Fix GPT2 with cross attention, @zucchini-nlp, #39754
Fix mamba regression, @manueldeprada, #39728
Fix: add back base model plan, @S1ro1, #39733
fix cache inheritance, #39748
Fix cache-related tests, @zucchini-nlp, #39676
Fix Layer device placement in Caches, @Cyrilvallez, #39732
PATCH: add back n-dim device-mesh + fix tp trainer saving, @S1ro1, @SunMarc, #39693
fix missing model.tpsize from ep refactor, @winglian, #39688

- Python
Published by ArthurZucker 10 months ago

transformers - v4.54.0: Kernels, Transformers Serve, Ernie, Voxtral, LFM2, DeepSeek v2, ModernBERT Decoder...

Important news!

In order to become the source of truth, we recognize that we need to address two common and long-heard critiques about transformers:

transformers is bloated
transformers is slow

Our team has focused on improving both aspects, and we are now ready to announce this. The modeling files for the standard Llama models are down to 500 LOC and should be much more readable, keeping just the core of the modeling and hiding the "powerful transformers features."

The MoEs are getting some kernel magic, enabling the use of the efficient megablocks kernels, setting a good precedent to allow the community to leverage any of the most powerful kernels developed for quantization as well! It should also be much more convenient to use with any attention implementation you want. This opens the door to some optimizations such as leveraging flash-attention on Metal (MPS Torch backend).

This is but the tip of the iceberg: with the work on kernels that we're heavily pushing forward, expect speed-ups on several backends in the coming months!!

This release also includes the first steps to enabling efficient distributed training natively in transformers. Loading a 100B model takes ~3 seconds on our cluster — we hope this will be the norm for everyone! We are working on distributed checkpointing as well, and want to make sure our API can be easily used for any type of parallelism.

We want the community to benefit from all of the advances, and as always, include all hardware and platforms! We believe the kernels library will give the tools to optimize everything, making a big difference for the industry!

New models

Ernie 4.5

The Ernie 4.5 model was released in the Ernie 4.5 Model Family release by baidu. This family of models contains multiple different architectures and model sizes. This model in specific targets the base text model without mixture of experts (moe) with 0.3B parameters in total. It uses the standard Llama at its core.

Other models from the family can be found at Ernie 4.5 MoE.

[Ernie 4.5] Add ernie text models by @vasqu in #39228

Voxtral

Voxtral is an upgrade of Ministral 3B and Mistral Small 3B, extending its language capabilities with audio input support. It is designed to handle tasks such as speech transcription, translation, and audio understanding.

You can read more in Mistral's realease blog post.

The model is available in two checkpoints: - 3B: mistralai/Voxtral-Mini-3B-2507 - 24B: mistralai/Voxtral-Small-24B-2507

Key Features

Voxtral builds on Ministral-3B by adding audio processing capabilities:

Transcription mode: Includes a dedicated mode for speech transcription. By default, Voxtral detects the spoken language and transcribes it accordingly.
Long-form context: With a 32k token context window, Voxtral can process up to 30 minutes of audio for transcription or 40 minutes for broader audio understanding.
Integrated Q&A and summarization: Supports querying audio directly and producing structured summaries without relying on separate ASR and language models.
Multilingual support: Automatically detects language and performs well across several widely spoken languages, including English, Spanish, French, Portuguese, Hindi, German, Dutch, and Italian.
Function calling via voice: Can trigger functions or workflows directly from spoken input based on detected user intent.
Text capabilities: Maintains the strong text processing performance of its Ministral-3B foundation.
Add voxtral by @eustlb in #39429

LFM2

LFM2 represents a new generation of Liquid Foundation Models developed by Liquid AI, specifically designed for edge AI and on-device deployment.

The models are available in three sizes (350M, 700M, and 1.2B parameters) and are engineered to run efficiently on CPU, GPU, and NPU hardware, making them particularly well-suited for applications requiring low latency, offline operation, and privacy.

LFM2 by @paulpak58 in #39340

DeepSeek v2

The DeepSeek-V2 model was proposed in DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model by DeepSeek-AI Team.

The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. It employs an auxiliary-loss-free strategy for load balancing and multi-token prediction training objective. The model can be used for various language tasks after being pre-trained on 14.8 trillion tokens and going through Supervised Fine-Tuning and Reinforcement Learning stages.

Add DeepSeek V2 Model into Transformers by @VladOS95-cyber in #36400

ModernBERT Decoder models

ModernBERT Decoder is the same architecture as ModernBERT but trained from scratch with a causal language modeling (CLM) objective. This allows for using the same architecture for comparing encoders and decoders. This is the decoder architecture implementation of ModernBERT, designed for autoregressive text generation tasks.

Like the encoder version, ModernBERT Decoder incorporates modern architectural improvements such as rotary positional embeddings to support sequences of up to 8192 tokens, unpadding to avoid wasting compute on padding tokens, GeGLU layers, and alternating attention patterns. However, it uses causal (unidirectional) attention to enable autoregressive generation.

Add ModernBERT Decoder Models - ModernBERT, but trained with CLM! by @orionw in #38967

EoMT

The Encoder-only Mask Transformer (EoMT) model was introduced in the CVPR 2025 Highlight Paper Your ViT is Secretly an Image Segmentation Model by Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus. EoMT reveals Vision Transformers can perform image segmentation efficiently without task-specific components.

✨ Add EoMT Model || 🚨 Fix Mask2Former loss calculation by @yaswanth19 in #37610

Doge

Doge is a series of small language models based on the Doge architecture, aiming to combine the advantages of state-space and self-attention algorithms, calculate dynamic masks from cached value states using the zero-order hold method, and solve the problem of existing mainstream language models getting lost in context. It uses the wsd_scheduler scheduler to pre-train on the smollm-corpus, and can continue training on new datasets or add sparse activation feedforward networks from stable stage checkpoints.

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F426/transformers/modeldoc/dogearchitecture.png" alt="drawing" width="600"

Add Doge model by @LoserCheems in #35891

AIM v2

The AIMv2 model was proposed in Multimodal Autoregressive Pre-training of Large Vision Encoders by Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby.

The abstract from the paper is the following:

We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.

Add Aimv2 model by @yaswanth19 in #36625

PerceptionLM

The PerceptionLM model was proposed in PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding by Jang Hyun Cho et al. It's a fully open, reproducible model for transparent research in image and video understanding. PLM consists of a vision encoder with a small scale (<8B parameters) LLM decoder.

PerceptionLM by @shuminghu in #37878

Efficient LoFTR

The EfficientLoFTR model was proposed in Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed by Yifan Wang, Xingyi He, Sida Peng, Dongli Tan and Xiaowei Zhou.

This model consists of matching two images together by finding pixel correspondences. It can be used to estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.

Add EfficientLoFTR model by @sbucaille in #36355

EVOLLA

Evolla is an advanced 80-billion-parameter protein-language generative model designed to decode the molecular language of proteins. It integrates information from protein sequences, structures, and user queries to generate precise and contextually nuanced insights into protein function. Trained on an unprecedented AI-generated dataset of 546 million protein question-answer pairs and 150 billion word tokens, Evolla significantly advances research in proteomics and functional genomics, providing expert-level insights and shedding light on the molecular logic encoded in proteins.

Add evolla rebase main by @zhoubay in #36232

DeepSeek VL

Deepseek-VL was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages LLaMA as its text encoder, while SigLip is used for encoding images.

Add support for DeepseekAI's DeepseekVL by @geetu040 in #36248

xLSTM

The xLSTM model was proposed in xLSTM: Extended Long Short-Term Memory by Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter and Sepp Hochreiter. xLSTM updates the original LSTM architecture to be competitive with Transformer models by introducing exponential gating, matrix memory expansion, and parallelizable training and ingestion.

The 7B model variant was trained by the xLSTM team Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick Blies, Sebastian Böck and Sepp Hochreiter at NXAI.

Add xlstm model by @Cyrilvallez in #39665

EXAONE 4.0

EXAONE 4.0 model is the language model, which integrates a Non-reasoning mode and Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean.

The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications.

Add EXAONE 4.0 model by @lgai-exaone in #39129

Parallelisation

We've added Expert Parallel support for Llama4, next release will include it for all model! You can just set a distributed_config with enable_expert_parallel=True. This is enabling efficient training of sparse Mixture-of-Experts (MoE) models across multiple devices. This allows each expert in the MoE layer to run in parallel (instead of previous TP which requires more communication), significantly improving scalability and memory efficiency.

Add ep by @ArthurZucker in #39501

Quantization

FP Quant

FP-Quant is a quantization method optimized for Blackwell-generation Nvidia GPUs, supporting efficient post-training quantization (PTQ) and quantization-aware training (QAT) of LLMs using MXFP4 and NVFP4 formats.

Currently, only PTQ with MXFP4 is available. You can quantize models on-the-fly using transformers: ``` from transformers import AutoModelForCausalLM, FPQuantConfig

model = AutoModelForCausalLM.frompretrained( "qwen/Qwen3-8B", quantizationconfig=FPQuantConfig(), devicemap="cuda", torchdtype=torch.bfloat16, ) ``FP-Quant requires a Blackwell GPU and runs via the QuTLASS library. No Blackwell GPU? UseFPQuantConfig(pseudoquant=True)` to emulate quantization (no QuTLASS needed).

The following results show the inference speedup of QuTLASS MXFP4 over PyTorch BF16 in Transformers. MXFP4 gives consistent speedups across all batch sizes, reaching up to 4× faster at larger scales.

drawing

FP-Quant support by @BlackSamorez in #38696

Kernels

The kernels project aims to become the single trusted source for high-performance kernels in the Transformers ecosystem. We're working toward centralizing all kernels on the Hub, so updates, bug fixes, and improvements can happen in one place—no more scattered repos and no compilation headaches!

You can already try it out today by setting use_kernels=True in from_pretrained. Any contributor can build their kernel, register it and use it right away—no extra setup, more on this here

Even better: want to use Flash Attention 3? No need to deal with tricky compilation and missing symbols issues! Just drop in:

python model.set_attn_implementation("kernels-community/flash-attn3") This automatically fetches the right build for your setup (e.g. CUDA and PyTorch versions).

We’re also teaming up with amazing kernel devs from Unsloth, Liger, vLLM, and more to bring their work directly to the Hub—making it easier than ever to access amazing performance with a single line of code.

Kernels flash attn by @ArthurZucker in #39474

Transformers Serve

https://github.com/user-attachments/assets/9928f62b-543c-4b8a-b81b-4a6e262c229e

Over the past few months, we have been putting more and more functionality in the transformers chat utility, which offers a CLI-based app to chat with chat models. We've chosen to push this further by splitting the backend of transformers chat in a new, separate utility called transformers serve.

This is ideal for experimentation purposes, or to run models locally for personal and private use. It does not aim to compete with dedicated inference engines such as vLLM or SGLang.

Models of diverse modalities supported by transformers may be served with the transformers serve CLI. It spawns a local server that offers compatibility with the OpenAI SDK, which is the de-facto standard for LLM conversations and other related tasks. This way, you can use the server from many third party applications, or test it using the transformers chat CLI (docs).

The server supports the following REST APIs:

/v1/chat/completions
/v1/responses
/v1/audio/transcriptions
/v1/models

Relevant commits:

Split transformers chat and transformers serve by @LysandreJik in #38443
[serve] Cursor support, move docs into separate page, add more examples by @gante in #39133
Fix continuous batching in transformers serve by @LysandreJik in #39149
[server] add tests and fix passing a custom generation_config by @gante in #39230
[serve] Model name or path should be required by @LysandreJik in #39178
Random serve fixes by @pcuenca in #39176
[tests] tag serve tests as slow by @gante in #39343
Responses API in transformers serve by @LysandreJik in #39155
[serve] Add speech to text (/v1/audio/transcriptions) by @gante in #39434
Transformers serve VLM by @LysandreJik in #39454

Refactors

Significant refactors have been underway in transformers, aiming to reduce the complexity of the code. A metric we follow to see how the refactors impact our code is to follow the number of lines in a given model; we try to reduce it as much as possible, while keeping everything related to the forward pass and model definition in that file.

See the evolution here:

Some notable refactors:

KV caching

KV caches are now defined per layer, enabling new hybrid caches that mix different attention types. CacheProcessors also encapsulate cache quantization and offloading, making them easy to customize.

[cache refactor] Move all the caching logic to a per-layer approach by @manueldeprada in #39106

Handling specific attributes like `output_attentions` or `output_hidden_states`

Such attributes require very specific handling within the forward call, while they're not important to understand how the model works. We remove that code but keep the functionality by providing a better utility to handle it.

Refactor the way we handle outputs for new llamas and new models by @ArthurZucker in #39120

Setting the attention implementation

We refactor the way to explicitly set the attention implementation so that it has a method dedicated to it.

[refactor] set attention implementation by @zucchini-nlp in #38974

Breaking changes

[Whisper] 🚨 Fix pipeline word timestamp: timestamp token is end of token time !!! by @eustlb in #36632
🚨 Don't use cache in non-generative models by @zucchini-nlp in #38751
🚨🚨🚨 [eomt] make EoMT compatible with pipeline by @yaswanth19 in #39122
🚨🚨 Fix and simplify attention implementation dispatch and subconfigs handling by @Cyrilvallez in #39423
🚨🚨🚨 [Trainer] Enable average_tokens_across_devices by default in TrainingArguments by @Krish0909 in #39395
🔴 Fix EnCodec internals and integration tests by @ebezzam in #39431

Bugfixes and improvements

Add StableAdamW Optimizer by @SunMarc in #39446
[Flex Attn] Fix torch 2.5.1 incompatibilities by @vasqu in #37406
fix test_compare_unprocessed_logit_scores by @ydshieh in #39053
fix t5gemma tests by @ydshieh in #39052
Update SuperPoint model card by @sbucaille in #38896
fix layoutlmv3 tests by @ydshieh in #39050
[docs] Model contribution by @stevhliu in #38995
Update PEGASUS-X model card by @dross20 in #38971
[docs] @auto_docstring by @stevhliu in #39011
[docs] Tensor parallelism by @stevhliu in #38241
[Whisper] fix shape mismatch in tests by @eustlb in #39074
Cleanup Attention class for Siglip and dependent models by @yaswanth19 in #39040
fix Gemma3nProcessorTest by @ydshieh in #39068
Fix initialization of OneFormer by @bvantuan in #38901
Uninstallling Flash attention from quantization docker by @MekkCyber in #39078
fix a bunch of XPU UT failures on stock PyTorch 2.7 and 2.8 by @yao-matrix in #39069
Pipeline: fix unnecessary warnings by @eustlb in #35753
fix mistral3 tests by @ydshieh in #38989
fixed typo for docstring in prepare_inputs method by @JINO-ROHIT in #39071
TST PEFT integration tests with pipeline generate by @BenjaminBossan in #39086
add fast image processor nougat by @NahieliV in #37661
Add Fast Image Processor for mobileViT by @MinJu-Ha in #37143
guard torch distributed check by @tvukovic-amd in #39057
fix dots1 tests by @ydshieh in #39088
Add Fast Image Processor for Chameleon by @farrosalferro in #37140
Fix: unprotected import of tp plugin by @S1ro1 in #39083
TST Fix PEFT integration test bitsandbytes config by @BenjaminBossan in #39082
[fix] Add FastSpeech2ConformerWithHifiGan by @stevhliu in #38207
Sandeepyadav1478/2025 06 19 deberta v2 model card update by @sandeepyadav1478 in #38895
Fixes the failing test test_is_split_into_words in test_pipelines_token_classification.py by @st81 in #39079
skip some test_sdpa_can_dispatch_on_flash by @ydshieh in #39092
fix UT failures on XPU w/ stock PyTorch 2.7 & 2.8 by @yao-matrix in #39116
Fix some bug for finetune and batch infer For GLM-4.1V by @zRzRzRzRzRzRzR in #39090
docs: Gemma 3n audio encoder by @RyanMullins in #39087
All CI jobs with A10 by @ydshieh in #39119
Licenses by @LysandreJik in #39127
Fix chat by @gante in #39128
Enable XPU doc by @jiqing-feng in #38929
docs: correct two typos in awesome-transformers.md by @VladimirGutuev in #39102
switch default xpu tp backend to pytorch built-in XCCL from pytorch 2.8 by @yao-matrix in #39024
Update BigBirdPegasus model card by @dross20 in #39104
[Whisper] update token timestamps tests by @eustlb in #39126
Fix key mapping for VLMs by @bvantuan in #39029
Several fixes for Gemma3n by @Cyrilvallez in #39135
fix cachingallocatorwarmup with tie weights by @jiqing-feng in #39070
feat: support indivisible shards for TP model loading and TPlizing. by @kmehant in #37220
[qwen2-vl] fix FA2 inference by @zucchini-nlp in #39121
[typing] LlamaAttention return typehint by @ArkVex in #38998
[VLMs] support passing embeds along with pixels by @zucchini-nlp in #38467
[superglue] fix wrong concatenation which made batching results wrong by @sbucaille in #38850
Fix missing fsdp & trainer jobs in daily CI by @ydshieh in #39153
Fix: Ensure wandb logs config in offline mode by @DavidS2106 in #38992
Change @lru_cache() to @lru_cache to match styles from #38883. by @rasmi in #39093
fix: remove undefined variable by @ybkurt in #39146
update bnb ground truth by @jiqing-feng in #39117
Suggest jobs to use in run-slow by @ydshieh in #39100
Update expected values (after switching to A10) by @ydshieh in #39157
fix llama tests by @ydshieh in #39161
Add activation sparsity reference in gemma3n doc by @ChongYou in #39160
fix default value of config to match checkpionts in LLaVa-OV models by @ved1beta in #39163
[smolvlm] fix video inference by @zucchini-nlp in #39147
Fix multimodal processor get duplicate arguments when receive kwargs for initialization by @Isotr0py in #39125
Blip2 fixes by @remi-or in #39080
Fix missing initializations for models created in 2024 by @bvantuan in #38987
Reduce Glm4v model test size significantly by @Cyrilvallez in #39173
[docs] ViTPose by @stevhliu in #38630
[generate] document non-canonical beam search default behavior by @gante in #39000
Update expected values (after switching to A10) - part 2 by @ydshieh in #39165
Update expected values (after switching to A10) - part 3 by @ydshieh in #39179
Test fixes for Aria (and some Expectation for llavanextvideo) by @remi-or in #39131
[glm4v] fix video inference by @zucchini-nlp in #39174
when delaying optimizer creation only prepare the model by @winglian in #39152
Decouple devicemap='auto' and tpplan='auto' by @SunMarc in #38942
Fix many HPU failures in the CI by @IlyasMoutawwakil in #39066
[Dia] Change ckpt path in docs by @vasqu in #39181
Update expected values (after switching to A10) - part 4 by @ydshieh in #39189
[typing] better return typehints for from_pretrained by @qubvel in #39184
Update expected values (after switching to A10) - part 5 by @ydshieh in #39205
Update expected values (after switching to A10) - part 6 by @ydshieh in #39207
Add packed tensor format support for flex/sdpa/eager through the mask! by @Cyrilvallez in #39194
Update expected values (after switching to A10) - part 7 by @ydshieh in #39218
Update expected values (after switching to A10) - part 8 - Final by @ydshieh in #39220
[video processors] Support float fps for precise frame sampling by @zrohyun in #39134
Expectations re-order and corrected FA3 skip by @remi-or in #39195
[vjepa2] replace einsum with unsqueeze by @xenova in #39234
Fix missing fast tokenizer/image_processor in whisper/qwen2.5-omni processor by @Isotr0py in #39244
[modular] Follow global indexing and attribute setting, and their dependencies by @Cyrilvallez in #39180
fix typo in Gemma3n notes by @davanstrien in #39196
Don't send new comment if the previous one is less than 30 minutes (unless the content is changed) by @ydshieh in #39170
fix bug using FSDP V1 will lead to model device not properly set by @kaixuanliu in #39177
Make computedynamicntkparameters exportable by @xadupre in #39171
[modular] Simplify logic and docstring handling by @Cyrilvallez in #39185
[bugfix] fix flash attention 2 unavailable error on Ascend NPU by @FightingZhen in #39166
fix fastspeech2_conformer tests by @ydshieh in #39229
RotaryEmbeddings change is not None -> isinstance(..., dict) by @qubvel in #39145
Fix patch helper by @Cyrilvallez in #39216
enable xpu on kv-cache and hqq doc by @jiqing-feng in #39246
adjust input and output texts for testmodelingrecurrent_gemma.py by @kaixuanliu in #39190
Update tiny-agents example by @Wauplin in #39245
Add Korean translation for glossary.md by @JoosunH in #38804
Clarify perdevicetrainbatchsize scaling in TrainingArguments by @Shohail-Ismail in #38…
Add segmentation_maps support to MobileNetV2ImageProcessor by @simonreise in #37312
Simplify Mixtral and its modular children by @Cyrilvallez in #39252
fix some flaky tests in tests/generation/test_utils.py by @ydshieh in #39254
Update LED model card by @dross20 in #39233
Glm 4 doc by @zRzRzRzRzRzRzR in #39247
fix xpu failures on PT 2.7 and 2.8 w/o IPEX and enable hqq cases on XPU by @yao-matrix in #39187
Fix license text, duplicate assignment, and typo in constant names by @gudwls215 in #39250
Skip test_eager_matches sdpa generate and update an integration test for blip-like models by @ydshieh in #39248
remove broken block by @molbap in #39255
fix(generation): stop beam search per-instance when heuristic satisfied by @guang-yng in #38778
fix recompiles due to instance key, and deepcopy issues by @ArthurZucker in #39270
Fix errors when use verl to train GLM4.1v model by @kaln27 in #39199
[CI] fix docs by @gante in #39273
[pagged-attention] fix off-by-1 error in pagged attention generation by @kashif in #39258
[smollm3] add tokenizer mapping for smollm3 by @gante in #39271
Refactor PretrainedConfig.__init__ method to make it more explicit by @qubvel in #39158
fix flaky test_generate_compile_model_forward by @ydshieh in #39276
[lightglue] add support for remote code DISK keypoint detector by @sbucaille in #39253
Add torchcodec in docstrings/tests for datasets 4.0 by @lhoestq in #39156
Update T5gemma by @bzhangGo in #39210
[Tests] Update model_id in AIMv2 Tests by @yaswanth19 in #39281
Fix SDPA attention precision issue in Qwen2.5-VL by @JJJYmmm in #37363
[flash attn 3] bring back flags by @zucchini-nlp in #39294
fix aria tests by @ydshieh in #39277
skip test_torchscript_* for now until the majority of the community ask for it by @ydshieh in #39307
[modular] Allow method with the same name in case of @property decorator by @Cyrilvallez in #39308
[sliding window] revert and deprecate by @zucchini-nlp in #39301
🌐 [i18n-KO] Translated quark.md to Korean by @maximizemaxwell in #39268
Fix consistency and a few docstrings warnings by @Cyrilvallez in #39314
add stevhliu to the list in self-comment-ci.yml by @ydshieh in #39315
Updated the Model docs - for the MARIAN model by @emanrissha in #39138
skip files in src/ for doctest (for now) by @ydshieh in #39316
docs: update LLaVA-NeXT model card by @Bpriya42 in #38894
Fix typo: langauge -> language by @tomaarsen in #39317
Granite speech speedups by @avihu111 in #39197
Fix max_length_q and max_length_k types to flash_attn_varlen_func by @HollowMan6 in #37206
enable static cache on TP model by @jiqing-feng in #39164
Fix broken SAM after #39120 by @yonigozlan in #39289
Delete deprecated stuff by @zucchini-nlp in #38838
fix Glm4v batch videos forward by @Kuangdd01 in #39172
fix phi3 tests by @ydshieh in #39312
Handle DAC conversion when using weight_norm with newer PyTorch versions by @edwko in #36393
[modeling][lfm2] LFM2: Remove deprecated seen_tokens by @paulpak58 in #39342
[Core] [Offloading] Enable saving offloaded models with multiple shared tensor groups by @kylesayrs in #39263
Add a default value for position_ids in masking_utils by @Cyrilvallez in #39310
[modular] speedup checkmodularconversion with multiprocessing by @qubvel in #37456
Updated Switch Transformers model card with standardized format (Issue #36979) by @giuseppeCoccia in #39305
Fix link for testpypi by @Cyrilvallez in #39360
update cb TP by @ArthurZucker in #39361
fix failing test_sdpa_can_dispatch_on_flash by @ydshieh in #39259
Verbose error in fix mode for utils/check_docstrings.py by @manueldeprada in #38915
Remove device check in HQQ quantizer by @learning-chip in #39299
Add mistral common support by @juliendenize in #38906
Update Readme to Run Multiple Choice Script from Example Directory by @eromomon in #39323
Updated CamemBERT model card to new standardized format by @MShaheerMalik77 in #39227
fix gpt2 usage doc by @Xiang-cd in #39351
Update Model Card for Encoder Decoder Model by @ParagEkbote in #39272
update docker file to use latest timm (for perception_lm) by @ydshieh in #39380
Fix overriding Fast Image/Video Processors instance attributes affect other instances by @yonigozlan in #39363
[shieldgemma] fix checkpoint loading by @zucchini-nlp in #39348
[BLIP] remove cache from Qformer by @zucchini-nlp in #39335
[Qwen2.5-VL] Fix torch.finfo() TypeError for integer attentionmasktensor by @dsnsabari in #39333
Deprecate AutoModelForVision2Seq by @zucchini-nlp in #38900
Fix Lfm2 and common tests by @Cyrilvallez in #39398
[examples] fix doreducelabels argument for runsemanticsegmentationnotrainer by @eromomon in #39322
Totally rewrite how pipelines load preprocessors by @Rocketknight1 in #38947
Use np.pad instead of np.lib.pad. by @rasmi in #39346
[Docs] Fix typo in CustomTrainer compute_loss method and adjust loss reduction logic by @MilkClouds in #39391
Update phi4_multimodal.md by @tanuj-rai in #38830
[siglip] fix pooling comment by @sameerajashyam in #39378
Fix typo in /v1/models output payload by @alvarobartt in #39414
support loading qwen3 gguf by @44670 in #38645
Ignore extra position embeddings weights for ESM by @Rocketknight1 in #39063
set documentquestionanswering pipeline loadtokenizer to True by @jiqing-feng in #39411
Fix invalid property by @cyyever in #39384
refactor: remove set_tracer_provider and set_meter_provider calls by @McPatate in #39422
Fix bugs from pipeline preprocessor overhaul by @Rocketknight1 in #39425
Fix bugs in pytorch example run_clm when streaming is enabled by @HRezaei in #39286
Remove deprecated audio utils functions by @jiangwangyi in #39330
Remove residual quantization attribute from dequantized models by @DWarez in #39373
handle training summary when creating modelcard but offline mode is set by @winglian in #37095
[vlm] fix loading of retrieval VLMs by @zucchini-nlp in #39242
docs: update SuperGlue docs by @sbucaille in #39406
docs: update LightGlue docs by @sbucaille in #39407
CI workflow for performed test regressions by @ahadnagy in #39198
[autodocstring] add video and audio inputs by @zucchini-nlp in #39420
[Core] [Offloading] Fix saving offloaded submodules by @kylesayrs in #39280
Remove double soft-max in load-balancing loss. Fixes #39055 . by @rudolfwilliam in #39056
Fixed a bug calculating cross entropy loss in JetMoeForCausalLM by @Phoenix-Shen in #37830
[chat template] add a testcase for kwargs by @zucchini-nlp in #39415
Fix L270 - hasattr("moe_args") returning False error by @wjdghks950 in #38715
Defaults to adamwtorchfused for Pytorch>=2.8 by @cyyever in #37358
Change log level from warning to info for scheduled request logging in ContinuousBatchProcessor by @qgallouedec in #39372
Add cosinewithminlrschedulewithwarmuplrrate scheduler in Trainer by @richardodliu in #31870
Fix missing definition of difffileurl in notification service by @ahadnagy in #39445
add test scanner by @molbap in #39419
Remove runtime conditions for type checking by @cyyever in #37340
docs: add missing numpy import to minimal example by @IliasAarab in #39444
[cache] make all classes cache compatible finally by @zucchini-nlp in #38635
Fix typo in generation configuration for Janus model weight conversion by @thisisiron in #39432
Better typing for model.config by @qubvel in #39132
[Bugfix] [Quantization] Remove unused init arg by @kylesayrs in #39324
Fix processor tests by @zucchini-nlp in #39450
Remove something that should have never been there by @ArthurZucker in #38254
make the loss context manager easier to extend by @winglian in #39321
Fixes #39204: add fallback if getbasemodel missing by @sebastianvlad1 in #39226
[CI] Fix partially red CI by @vasqu in #39448
Updated Megatron conversion script for gpt2 checkpoints by @LckyLke in #38969
Fix indentation bug in SmolVLM image processor causing KeyError by @Krish0909 in #39452
fix cached file error when repo type is dataset by @hiyouga in #36909
Improve grammar and clarity in perf_hardware.md by @ridima11 in #39428
create ijepa modelcard (ref : PR #36979 ). by @dhruvmalik007 in #39354
Corrections to PR #38642 and enhancements to Wav2Vec2Processor call and pad docstrings by @renet10 in #38822
fix(pipelines): QA pipeline returns fewer than top_k results in batch mode by @yushi2006 in #39193
fix maxlength calculating using cuseq_lens by @KKZ20 in #39341
Fix tests due to breaking change in accelerate by @SunMarc in #39451
Use newer typing notation by @cyyever in #38934
fix a comment typo in utils.py by @klimarissa17 in #39459
Update GemmaIntegrationTest::test_model_2b_bf16_dola by @ydshieh in #39362
Fix convertandexportwithcache failures for GPU models by @Stonepia in #38976
Enable some ruff checks for performance and readability by @cyyever in #39383
fix: ImageTextToTextPipeline handles user-defined generation_config by @peteryschneider in #39374
Update integration_utils.py by @zhaiji0727 in #39469
Add unified logitstokeep support to LLMClass by @hellopahe in #39472
Fix typing order by @Tavish9 in #39467
[dependencies] temporary pyarrow pin by @gante in #39496
Slack CI bot: set default result for non-existing artifacts by @ahadnagy in #39499
[dependencies] Update datasets pin by @gante in #39500
[chat template] return assistant mask in processors by @zucchini-nlp in #38545
[gemma3] Fix doconvertrgb in image processors. by @MohitIntel in #39438
Fix BatchEncoding.to() for nested elements by @eginhard in #38985
Add fast image processor SAM by @yonigozlan in #39385
Improve @autodocstring doc and rename `argsdoc.pytoauto_docstring.py` by @yonigozlan in #39439
Update SAM/SAM HQ attention implementation + fix Cuda sync issues by @yonigozlan in #39386
Fix placeholders replacement logic in auto_docstring by @yonigozlan in #39433
[gemma3] support sequence classification task by @zucchini-nlp in #39465
[qwen2 vl] fix packing with all attentions by @zucchini-nlp in #39447
GLM-4 Update by @zRzRzRzRzRzRzR in #39393
Fix bad tensor shape in failing Hubert test. by @ebezzam in #39502
Fix the check in flex test by @Cyrilvallez in #39548
Rename _supports_flash_attn_2 in examples and tests by @zucchini-nlp in #39471
Fix Qwen Omni integration test by @Cyrilvallez in #39553
Fix pylint warnings by @cyyever in #39477
Raise TypeError instead of ValueError for invalid types by @Sai-Suraj-27 in #38660
Fix missing initializations for models created in 2023 by @bvantuan in #39239
use the enablegqa param in torch.nn.functional.scaleddotproductat… by @sywangyi in #39412
Fix Docstring of BarkProcessor by @st81 in #39546
Refactor MambaCache to modeling_mamba.py by @manueldeprada in #38086
fix ndim check of device_mesh for TP by @winglian in #39538
[Fast image processor] refactor fast image processor glm4v by @yonigozlan in #39490
🌐 [i18n-KO] Translated perf_infer_gpu_multi.md to Korean by @luckyvickyricky in #39441
Refactor embedding input/output getter/setter by @molbap in #39339
[Fast image processors] Improve handling of image-like inputs other than images (segmentation_maps) by @yonigozlan in #39489
[CI] Fix post merge ernie 4.5 by @vasqu in #39561
Update modernbertdecoder docs by @orionw in #39453
Update OLMoE model card by @nlhmnlhmnlhm in #39344
[gemma3] fix bidirectional image mask by @zucchini-nlp in #39396
Bump AMD container for 2.7.1 PyTorch by @ahadnagy in #39458
Fixes needed for n-d parallelism and TP by @winglian in #39562
[timm_wrapper] add support for gradient checkpointing by @Yozer in #39287
Add AMD test expectations to DETR model by @ahadnagy in #39539
[docs] update attention implementation and cache docs by @zucchini-nlp in #39547
[docs] Create page on inference servers with transformers backend by @zucchini-nlp in #39550
Add AMD expectations to Mistral3 tests by @ahadnagy in #39481
Add AMD GPU expectations for LLaVA tests by @ahadnagy in #39486
General weight initialization scheme by @Cyrilvallez in #39579
[cache refactor] Move all the caching logic to a per-layer approach by @manueldeprada in #39106
Update docs/source/ko/_toctree.yml by @jungnerd in #39516
updated mistral3 model card by @cassiasamp in #39531
[Paged-Attention] Handle continuous batching for repetition penalty by @kashif in #39457
Torchdec RuntimeError catch by @SunMarc in #39580
Fix link in "Inference server backends" doc by @hmellor in #39589
[WIP] Add OneformerFastImageProcessor by @Player256 in #38343
🎯 Trackio integration by @qgallouedec in #38814
Mask2former & Maskformer Fast Image Processor by @SangbumChoi in #35685
Fix DynamicCache and simplify Cache classes a bit by @Cyrilvallez in #39590
Generic task-specific base classes by @Cyrilvallez in #39584
[Trackio] Allow single-gpu training and monitor power by @qgallouedec in #39595
Rename supports_static_cache to can_compile_fullgraph by @zucchini-nlp in #39505
FP-Quant support by @BlackSamorez in #38696
fix moe routing_weights by @llbdyiu66 in #39581
[idefics3] fix for vLLM by @zucchini-nlp in #39470
enable triton backend on awq xpu by @jiqing-feng in #39443
Allow device_mesh have multiple dim by @S1ro1 in #38949
Fix typos and grammar issues in documentation and code by @cluster2600 in #39598
Fix important models CI by @molbap in #39576
Move openai import by @ebezzam in #39613
Fix DAC integration tests and checkpoint conversion. by @ebezzam in #39313
Feature/standardize opt model card by @JoestarGagan in #39568
standardized YOLOS model card according to template in #36979 by @EthanV431 in #39528
[Docs] Translate audio_classification.md from English to Spanish by @weezymatt in #39513
Update recent processors for vLLM backend by @zucchini-nlp in #39583
[efficientloftr] fix model_id in tests by @sbucaille in #39621
[timm] new timm pin by @gante in #39640
[Voxtral] values for A10 runners by @eustlb in #39605
revert behavior of preparefrom_posids by @winglian in #39622
Add owlv2 fast processor by @lmarshall12 in #39041
[attention] fix test for packed padfree masking by @zucchini-nlp in #39582
Fix: explicit not none check for tensors in flash attention by @jeffrey-dot-li in #39639
revert change to cuseqlenk and maxk when preparing from positionids by @winglian in #39653
Make pytorch examples UV-compatible by @lhoestq in #39635
[docs] fix ko cache docs by @gante in #39644
make fixup by @gante in #39661
fix(voxtral): correct typo in applytranscriptionrequest by @rev2607 in #39572
Rename huggingface_cli to hf by @LysandreJik in #39630
🚨[Fast Image Processor] Force Fast Image Processor for Qwen2VL/25_VL + Refactor by @yonigozlan in #39591
Fix ModernBERT Decoder model by @qubvel in #39671
[CI] revert device in test_export_static_cache by @gante in #39662
[Ernie 4.5] Post merge adaptations by @vasqu in #39664
Delete bad rebasing functions by @Cyrilvallez in #39672
Fixes the BC by @ArthurZucker in #39636
fix kyutai tests by @ydshieh in #39416
update expected outputs for whisper after #38778 by @ydshieh in #39304
Add missing flag for CacheLayer by @Cyrilvallez in #39678
Fix auto_docstring crashing when dependencies are missing by @yonigozlan in #39564
fix: HWIO to OIHW by @RyanMullins in #39200
Use autodocstring for perceptionlm fast image processor by @yonigozlan in #39679
badwordsids no longer slow on mps by @DWarez in #39556
Support typing.Literal as type of tool parameters or return value by @grf53 in #39633
fix break for ckpt without tpplan by @MoyanZitto in #39658
Fix tied weight test by @Cyrilvallez in #39680
Add padding-free to Granite hybrid moe models by @garrett361 in #39677

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@sbucaille
- Update SuperPoint model card (#38896)
- [superglue] fix wrong concatenation which made batching results wrong (#38850)
- [lightglue] add support for remote code DISK keypoint detector (#39253)
- docs: update SuperGlue docs (#39406)
- docs: update LightGlue docs (#39407)
- Add EfficientLoFTR model (#36355)
@yaswanth19
- Cleanup Attention class for Siglip and dependent models (#39040)
- ✨ Add EoMT Model || 🚨 Fix Mask2Former loss calculation (#37610)
- 🚨🚨🚨 [eomt] make EoMT compatible with pipeline (#39122)
- Add Aimv2 model (#36625)
- [Tests] Update model_id in AIMv2 Tests (#39281)
@bvantuan
- Fix initialization of OneFormer (#38901)
- Fix key mapping for VLMs (#39029)
- Fix missing initializations for models created in 2024 (#38987)
- Fix missing initializations for models created in 2023 (#39239)
@NahieliV
- add fast image processor nougat (#37661)
@MinJu-Ha
- Add Fast Image Processor for mobileViT (#37143)
@zRzRzRzRzRzRzR
- Fix some bug for finetune and batch infer For GLM-4.1V (#39090)
- Glm 4 doc (#39247)
- GLM-4 Update (#39393)
@simonreise
- Add segmentation_maps support to MobileNetV2ImageProcessor (#37312)
@LoserCheems
- Add Doge model (#35891)
@VladOS95-cyber
- Add DeepSeek V2 Model into Transformers (#36400)
@paulpak58
- LFM2 (#39340)
- [modeling][lfm2] LFM2: Remove deprecated seen_tokens (#39342)
@shuminghu
- PerceptionLM (#37878)
@juliendenize
- Add mistral common support (#38906)
@orionw
- Add ModernBERT Decoder Models - ModernBERT, but trained with CLM! (#38967)
- Update modernbertdecoder docs (#39453)
@cyyever
- Fix invalid property (#39384)
- Defaults to adamwtorchfused for Pytorch>=2.8 (#37358)
- Remove runtime conditions for type checking (#37340)
- Use newer typing notation (#38934)
- Enable some ruff checks for performance and readability (#39383)
- Fix pylint warnings (#39477)
@jungnerd
- Update docs/source/ko/_toctree.yml (#39516)
@Player256
- [WIP] Add OneformerFastImageProcessor (#38343)
@SangbumChoi
- Mask2former & Maskformer Fast Image Processor (#35685)
@BlackSamorez
- FP-Quant support (#38696)

- Python
Published by LysandreJik 10 months ago

transformers - Patch release v4.53.3

Small path release 4.53.3!

A small patch for open telemetry fixes! Sorry for the delay!

** refactor: remove settracerprovider and setmeterprovider calls (https://github.com/huggingface/transformers/pull/39422) from @McPatate

- Python
Published by ArthurZucker 10 months ago

transformers - Ernie-4.5 and Ernie-4.5 MoE (based on v4.53.2)

Two new models are added to transformers: Ernie 4.5, and its MoE variant, Ernie 4.5 MoE. They are added on top of the v4.53.2 release, and can be installed from the following tag: v4.53.2-Ernie-4.5-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.53.2-Ernie-4.5-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the Ernie-4.5 models. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.54.0.

Ernie-4.5 and its MoE variant

The Ernie 4.5 model was released in the Ernie 4.5 Model Family release by baidu. This family of models contains multiple different architectures and model sizes.

The Dense

This model in specific targets the base text model without mixture of experts (moe) with 0.3B parameters in total. It uses the standard Llama at its core.

The MoE

This model in specific targets the base text model with mixture of experts (moe) - one with 21B total, 3B active parameters and another one with 300B total, 47B active parameters. It uses the standard Llama at its core combined with a specialized MoE based on Mixtral with additional shared experts.

Usage example

Ernie-4.5 can be found on the Huggingface Hub.

Generating text with Ernie:

```py import torch from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "baidu/ERNIE-4.5-0.3B-PT"

load the tokenizer and the model

tokenizer = AutoTokenizer.frompretrained(modelname) model = AutoModelForCausalLM.frompretrained( modelname, devicemap="auto", torchdtype=torch.bfloat16, )

prepare the model input

inputs = tokenizer("Hey, are you conscious? Can you talk to me?", returntensors="pt") prompt = "Hey, are you conscious? Can you talk to me?" messages = [ {"role": "user", "content": prompt} ] text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) modelinputs = tokenizer([text], addspecialtokens=False, return_tensors="pt").to(model.device)

conduct text completion

generatedids = model.generate( **modelinputs, maxnewtokens=32, ) outputids = generatedids[0][len(modelinputs.inputids[0]):].tolist()

decode the generated ids

generatetext = tokenizer.decode(outputids, skipspecialtokens=True) ```

See below for an example leveraging the MoE variant:

```py import torch from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "baidu/ERNIE-4.5-21B-A3B-PT"

load the tokenizer and the model

tokenizer = AutoTokenizer.frompretrained(modelname) model = AutoModelForCausalLM.frompretrained( modelname, devicemap="auto", torchdtype=torch.bfloat16, )

prepare the model input

conduct text completion

generatedids = model.generate( **modelinputs, maxnewtokens=32, ) outputids = generatedids[0][len(modelinputs.inputids[0]):].tolist()

decode the generated ids

generatetext = tokenizer.decode(outputids, skipspecialtokens=True) ```

- Python
Published by LysandreJik 10 months ago

transformers - ModernBERT Decoder (based on v4.53.2)

A new model is added to transformers: ModernBERT Decoder It is added on top of the v4.53.2 release, and can be installed from the following tag: v4.53.2-modernbert-decoder-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.53.2-modernbert-decoder-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the ModernBERT Decoder model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.54.0.

ModernBERT Decoder

Usage example

ModernBERT Decoder can be found on the Huggingface Hub.

Using pipeline:

```py import torch from transformers import pipeline

generator = pipeline( task="text-generation", model="blab-jhu/test-32m-dec", torchdtype=torch.float16, device=0 ) generator("The future of artificial intelligence is", maxlength=50, numreturnsequences=1)

For sequence classification

classifier = pipeline( task="text-classification", model="blab-jhu/test-32m-dec", torch_dtype=torch.float16, device=0 ) classifier("This movie is really great!") ```

Using AutoModel:

```py import torch from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.frompretrained("blab-jhu/test-32m-dec") model = AutoModelForCausalLM.frompretrained( "blab-jhu/test-32m-dec", torchdtype=torch.float16, devicemap="auto", )

prompt = "The future of artificial intelligence is" inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.nograd(): outputs = model.generate( **inputs, maxlength=50, numreturnsequences=1, temperature=0.7, dosample=True, padtokenid=tokenizer.eostoken_id )

generatedtext = tokenizer.decode(outputs[0], skipspecialtokens=True) print(f"Generated text: {generatedtext}")

For sequence classification

from transformers import AutoModelForSequenceClassification

classifiermodel = AutoModelForSequenceClassification.frompretrained( "blab-jhu/test-32m-dec", torchdtype=torch.float16, devicemap="auto", num_labels=2 )

text = "This movie is really great!" inputs = tokenizer(text, return_tensors="pt").to("cuda")

with torch.nograd(): outputs = classifiermodel(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_class = torch.argmax(predictions, dim=-1)

print(f"Predicted class: {predicted_class.item()}") print(f"Prediction probabilities: {predictions}") ```

Using the transformers CLI:

bash echo "The future of artificial intelligence is" | transformers run --task text-generation --model your-username/modernbert-decoder-base --device 0

- Python
Published by LysandreJik 11 months ago

transformers - Patch Release v4.53.2

This patch contains the following bug fixes:

Fix some bug for finetune and batch infer For GLM-4.1V (#39090)
[bugfix] fix flash attention 2 unavailable error on Ascend NPU (#39166)
Fix errors when use verl to train GLM4.1v model (#39199)
[pagged-attention] fix off-by-1 error in pagged attention generation (#39258)
[smollm3] add tokenizer mapping for smollm3 (#39271)
[sliding window] revert and deprecate (#39301)
fix Glm4v batch videos forward (#39172)
Add a default value for position_ids in masking_utils (#39310)

- Python
Published by Cyrilvallez 11 months ago

transformers - Patch Release v4.53.1

This patch contains several bug fixes. The following commits are included:

Fix: unprotected import of tp plugin (#39083)
Fix key mapping for VLMs (#39029)
Several fixes for Gemma3n(#39135)
[qwen2-vl] fix FA2 inference (#39121)
[smolvlm] fix video inference (#39147)
Fix multimodal processor get duplicate arguments when receive kwargs for initialization (#39125)
when delaying optimizer creation only prepare the model (#39152)
Add packed tensor format support for flex/sdpa/eager through the mask! (#39194)

- Python
Published by Cyrilvallez 11 months ago

transformers - Release v4.53.0

Release v4.53.0

Gemma3n

Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for pre-trained and instruction-tuned variants. These models were trained with data in over 140 spoken languages.

Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain. For more information on Gemma 3n's efficient parameter management technology, see the Gemma 3n page.

```python from transformers import pipeline import torch

pipe = pipeline( "image-text-to-text", torchdtype=torch.bfloat16, model="google/gemma-3n-e4b", device="cuda", ) output = pipe( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg", text="<imagesoft_token> in this image, there is" )

print(output) ```

Dia

Dia is an opensource text-to-speech (TTS) model (1.6B parameters) developed by Nari Labs. It can generate highly realistic dialogue from transcript including nonverbal communications such as laughter and coughing. Furthermore, emotion and tone control is also possible via audio conditioning (voice cloning).

Model Architecture: Dia is an encoder-decoder transformer based on the original transformer architecture. However, some more modern features such as rotational positional embeddings (RoPE) are also included. For its text portion (encoder), a byte tokenizer is utilized while for the audio portion (decoder), a pretrained codec model DAC is used - DAC encodes speech into discrete codebook tokens and decodes them back into audio.

Add Dia model by @buttercrab in #38405

Kyutai Speech-to-Text

Kyutai STT is a speech-to-text model architecture based on the Mimi codec, which encodes audio into discrete tokens in a streaming fashion, and a Moshi-like autoregressive decoder. Kyutai’s lab has released two model checkpoints: - kyutai/stt-1b-en_fr: a 1B-parameter model capable of transcribing both English and French - kyutai/stt-2.6b-en: a 2.6B-parameter model focused solely on English, optimized for maximum transcription accuracy

Add kyutai stt by @eustlb in #38909

Read more about the model in the documentation

V-JEPA 2

V-JEPA 2 is a self-supervised approach to training video encoders developed by FAIR, Meta. Using internet-scale video data, V-JEPA 2 attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.

Add V-JEPA 2 by @qubvel in #38746

Read more about the model in the documentation.

Arcee

Arcee is a decoder-only transformer model based on the Llama architecture with a key modification: it uses ReLU² (ReLU-squared) activation in the MLP blocks instead of SiLU, following recent research showing improved training efficiency with squared activations. This architecture is designed for efficient training and inference while maintaining the proven stability of the Llama design.

The Arcee model is architecturally similar to Llama but uses x * relu(x) in MLP layers for improved gradient flow and is optimized for efficiency in both training and inference scenarios.

Add Arcee model support by @Crystalcareai in #38621

Read more about the model in the documentation.

ColQwen2

ColQwen2 is a variant of the ColPali model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColQwen2 treats each page as an image. It uses the Qwen2-VL backbone to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.

Add ColQwen2 to 🤗 transformers by @tonywu71 in #35778

Read more about the model in the documentation.

MiniMax

MiniMax is a powerful language model with 456 billion total parameters, of which 45.9 billion are activated per token. To better unlock the long context capabilities of the model, MiniMax adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE). Leveraging advanced parallel strategies and innovative compute-communication overlap methods—such as Linear Attention Sequence Parallelism Plus (LASP+), varlen ring attention, Expert Tensor Parallel (ETP), etc., MiniMax's training context length is extended to 1 million tokens, and it can handle a context of up to 4 million tokens during the inference. On various academic benchmarks, MiniMax also demonstrates the performance of a top-tier model.

The architecture of MiniMax is briefly described as follows:

Total Parameters: 456B
Activated Parameters per Token: 45.9B
Number Layers: 80
Hybrid Attention: a softmax attention is positioned after every 7 lightning attention.
- Number of attention heads: 64
- Attention head dimension: 128
Mixture of Experts:
- Number of experts: 32
- Expert hidden dimension: 9216
- Top-2 routing strategy
Positional Encoding: Rotary Position Embedding (RoPE) applied to half of the attention head dimension with a base frequency of 10,000,000
Hidden Size: 6144
Vocab Size: 200,064

For more details refer to the release blog post.

Add support for MiniMax's MiniMax-Text-01 by @geetu040 in #35831

Read more about the model in the documentation.

Encoder-Decoder Gemma

T5Gemma (aka encoder-decoder Gemma) was proposed in a research paper by Google. It is a family of encoder-decoder large langauge models, developed by adapting pretrained decoder-only models into encoder-decoder. T5Gemma includes pretrained and instruction-tuned variants. The architecture is based on transformer encoder-decoder design following T5, with improvements from Gemma 2: GQA, RoPE, GeGLU activation, RMSNorm, and interleaved local/global attention.

T5Gemma has two groups of model sizes: 1) Gemma 2 sizes (2B-2B, 9B-2B, and 9B-9B), which are based on the offical Gemma 2 models (2B and 9B); and 2) T5 sizes (Small, Base, Large, and XL), where are pretrained under the Gemma 2 framework following T5 configuration. In addition, we also provide a model at ML size (medium large, ~2B in total), which is in-between T5 Large and T5 XL.

The pretrained varaints are trained with two objectives: prefix language modeling with knowledge distillation (PrefixLM) and UL2, separately. We release both variants for each model size. The instruction-turned varaints was post-trained with supervised fine-tuning and reinforcement learning.

Encoder-Decoder Gemma by @bzhangGo in #38332

Read more about the model in the documentation.

GLM-4.1V

The GLM-4.1V model architecture is added to transformers; no models have yet been released with that architecture. Stay tuned for the GLM team upcoming releases!

GLM-4.1V Model support by @zRzRzRzRzRzRzR in #38431

Read more about the model in the documentation.

Falcon H1

The FalconH1 model was developed by the TII Pretraining team. A comprehensive research paper covering the architecture, pretraining dynamics, experimental results, and conclusions is forthcoming. You can read more about this series in this website.

[MODEL] Add Falcon H1 by @younesbelkada in #38249

Read more about the model in the documentation.

LightGlue

The LightGlue model was proposed in LightGlue: Local Feature Matching at Light Speed by Philipp Lindenberger, Paul-Edouard Sarlin and Marc Pollefeys.

Similar to SuperGlue, this model consists of matching two sets of local features extracted from two images, its goal is to be faster than SuperGlue. Paired with the SuperPoint model, it can be used to match two images and estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.

The abstract from the paper is the following:

We introduce LightGlue, a deep neural network that learns to match local features across images. We revisit multiple design decisions of SuperGlue, the state of the art in sparse matching, and derive simple but effective improvements. Cumulatively, they make LightGlue more efficient - in terms of both memory and computation, more accurate, and much easier to train. One key property is that LightGlue is adaptive to the difficulty of the problem: the inference is much faster on image pairs that are intuitively easy to match, for example because of a larger visual overlap or limited appearance change. This opens up exciting prospects for deploying deep matchers in latency-sensitive applications like 3D reconstruction. The code and trained models are publicly available at this https URL

Add LightGlue model by @sbucaille in #31718

Read more about the model in the documentation.

dots.llm1

The abstract from the report is the following:

Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on high-quality corpus and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints spanning the entire training process, providing valuable insights into the learning dynamics of large language models.

[Model] add dots1 by @redmoe-moutain in #38143

Read more about the model in the documentation.

SmolLM3

SmolLM3 is a fully open, compact language model designed for efficient deployment while maintaining strong performance. It uses a Transformer decoder architecture with Grouped Query Attention (GQA) to reduce the kv cache, and no RoPE, enabling improved performance on long-context tasks. It is trained using a multi-stage training approach on high-quality public datasets across web, code, and math domains. The model is multilingual and supports very large context lengths. The instruct variant is optimized for reasoning and tool use.

Add SmolLM3 by @anton-l in #38755

Read more about the model in the documentation.

Performance optimizations

Kernels

In previous versions, installing the kernels library would automatically activate the custom kernels added to transformers, because the @use_kernel_forward_from_the_hub decorator directly swapped out the model’s forward method. This implicit behavior caused several issues for users — including problems with torch.compile, non-determinism, and inconsistent outputs.

To address this, we've introduced a new opt-in mechanism called kernelize. You can now enable kernel usage explicitly by passing use_kernels=True to from_pretrained. The use_kernel_forward_from_the_hub decorator now simply stores the kernel name that the user wants to use — and kernelize handles the rest under the hood.

Example

```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch

model = AutoModelForCausalLM.frompretrained( "meta-llama/Llama-3.2-1B-Instruct", torchdtype=torch.bfloat16, devicemap="cuda", usekernels=True ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

input = "Hello" inputids = tokenizer(input, returntensors="pt").to(model.device).inputids output = model.generate(inputids, maxnewtokens=100)

print(tokenizer.decode(output[0], skipspecialtokens=True)) ``` More kernels will be added over time — this will be a collaborative, community-driven effort to make transformers lighter and faster 🤗

Add kernelize to transformers by @MekkCyber in #38205

Flash Attention 3

Support for Flash Attention 3 is added across the most popular models.

Support for Flash Attention 3 by @EduardDurech in #38972

Notable repository maintenance & refactors

Several efforts refactoring the repository are happening in parallel. The direction is to greatly simplify the library, removing unnecessary codepaths. Whilst the efforts are spread across the library, they're particularly visible in each individual models; where non-modeling-specific code will be simplified and eventually removed.

We take the assumption that model-agnostic utilities shouldn't be in the modeling code. Things like the output of attentions, hidden states, router logits, are important for end-users but don't need to be explicitely displayed in the modeling code.

Apply GradientCheckpointingLayer to the whole repo by @qubvel in #38913
No more Tuple, List, Dict by @Rocketknight1 in #38797
Deprecate TF + JAX by @Rocketknight1 in #38758

Breaking changes

Several minimal breaking changes aiming to bring clearer defaults while greatly simplifying the library have been merged.

🔴 Update default dtype for pipelines to auto by @Vaibhavs10 in #38882
🚨🚨 Fix initialization of Mask2Former by @Cyrilvallez in #38864
:rotatinglight: :rotatinglight: Inherited CausalLM Tests by @Rocketknight1 in #37590
🚨Early-error🚨 config will error out if output_attentions=True and the attn implementation is wrong by @ArthurZucker in #38288
🔴 [VLM] modeling updates by @zucchini-nlp in #38317
:rotatinglight: :rotatinglight: Fix custom code saving by @Rocketknight1 in #37716
🚨🚨[core] Completely rewrite the masking logic for all attentions by @Cyrilvallez in #37866
🔴🔴🔴 [Attention] Refactor Attention Interface for Bart-based Models by @vasqu in #38108
🔴[Attention] Attention refactor for Whisper-based models by @vasqu in #38235
Add CB by @ArthurZucker in #38085

Bugfixes and improvements

CI reporting improvements by @ydshieh in #38230
Revert parallelism temporarily by @LysandreJik in #38240
tp plan should not be NONE by @ArthurZucker in #38255
[Falcon H1] Fix Typo in Integration Test by @dhiaEddineRhaiem in #38256
[compile] re-enable for Qwen-VL models by @zucchini-nlp in #38127
fix multi-image case for llava-onevision by @cyr0930 in #38084
Add tearDown method to Quark to solve OOM issues by @MekkCyber in #38234
Clearer error on import failure by @LysandreJik in #38257
[whisper] small changes for faster tests by @gante in #38236
Simplify DTensor Check for modeling_utils.py by @amd-xiaoyu12 in #38245
Improve typing in TrainingArgument by @cyyever in #36944
Fix: missing else branch to handle "--loadbestmodelatend" in training_args.py by @danielyxyang in #38217
assign the correct torchao data layout for xpu by @jiqing-feng in #37781
Remove Japanese sequence_classification doc and update references by @ritsumei-aoi in #38246
Protect ParallelInterface by @ArthurZucker in #38262
Update Model Card for Mamba by @ParagEkbote in #37863
docs(swin): Update Swin model card to standard format by @BryanBradfo in #37628
add XPU info print in print_env by @yao-matrix in #38282
[whisper] move processor test into processor test file 🧹 by @gante in #38266
[Whisper] handle deprecation of forced_decoder_ids by @gante in #38232
add liger-kernel to docker file by @ydshieh in #38292
Fix tp error when torch distributed is already initialized by @SunMarc in #38294
More typing in src/transformers/training_args.py by @cyyever in #38106
refine transformers env output by @yao-matrix in #38274
Update CI Docker base image for AMD tests by @ahadnagy in #38261
Fix HybridChunedCache & Llama4 by @Cyrilvallez in #38299
Oups typo for HybridChunkedCache by @Cyrilvallez in #38303
[Tests] Cleanup Janus Testcase by @yaswanth19 in #38311
[emu3] fix conversion script by @zucchini-nlp in #38297
Fix run_slow by @cyyever in #38314
Fix typo: change 'env' to 'environment' in .circleci/config.yml by @AbdessamadEnabih in #38273
Adds userepr to modeladditiondebuggercontext by @RyanMullins in #37984
[tf/flax] handle forced_decoder_ids deletion by @gante in #38316
[Whisper + beam search] fix usage of beam_indices by @gante in #38259
Expose AutoModelForTimeSeriesPrediction for import by @jinan-zhou in #38307
[customgenerate] don't forward `customgenerateandtrustremotecode` by @gante in #38304
add vasqu to self-comment-ci.yml by @ydshieh in #38324
Fix some tests (especially compile with fullgraph=True on Python<3.11) by @Cyrilvallez in #38319
[performanceoptim] reduce frequency of declaring attentionmask in Ascend NPU flash attention by @FightingZhen in #38278
refactor cansaveslow_tokenizer by @itazap in #37722
[FlexAttention] Reenable flex for encoder-decoder and make the test more robust by @vasqu in #38321
Enhance Model Loading By Providing Parallelism, Uses Optional Env Flag by @inf3rnus in #36835
Use Gradient Checkpointing Layer in Jamba & Blip Related Models by @alex-jw-brooks in #38310
Never fallback to eager implicitly by @Cyrilvallez in #38327
Remove duplicate docstring: resample by @qqii in #38305
Update BioGPT model card by @Aguedoom in #38214
docs(swinv2): Update SwinV2 model card to new standard format by @BryanBradfo in #37942
[docs]: update roformer.md model card by @KsuParkhamchuk in #37946
new failure CI reports for all jobs by @ydshieh in #38298
Hot fix for AMD CI workflow by @ydshieh in #38349
Uninstall kernels for AMD docker images by @ydshieh in #38354
[VLMs] add helpers for get/set embedding by @zucchini-nlp in #38144
switch to device agnostic device calling for test cases by @yao-matrix in #38247
[OPT] Fix attention scaling by @vasqu in #38290
Fix all import errors based on older torch versions by @Cyrilvallez in #38370
Fix incorrect batching audio index calculation for Phi-4-Multimodal by @Isotr0py in #38103
Protect get_default_device for torch<2.3 by @Cyrilvallez in #38376
[Falcon H1] Fix slow path forward pass by @dhiaEddineRhaiem in #38320
Improved cache docs by @manueldeprada in #38060
for now disable compile by @ArthurZucker in #38383
Use one utils/notification_service.py by @ydshieh in #38379
Better check in initialize_weights by @Cyrilvallez in #38382
fix typos by @DeVikingMark in #38336
fix typo: tokenizer -> tokenize by @foldl in #38357
Stop TF weight rename reDOS by @Rocketknight1 in #38325
[cli] cli usable without torch by @gante in #38386
update gemma tests by @ydshieh in #38384
Stop autoconverting custom code checkpoints by @Rocketknight1 in #37751
Add AMD MI300 CI caller leveraging self-hosted runner scale set workflow in hf-workflows by @jitesh-gupta in #38132
Fix image token mask in Gemma3 by @Cyrilvallez in #38295
[transformers x vLLM] standardize processors by @zucchini-nlp in #37915
[paligemma] fix processor with suffix by @zucchini-nlp in #38365
[video utils] group and reorder by number of frames by @zucchini-nlp in #38374
[aya vision] fix processor for vLLM by @zucchini-nlp in #38371
guard size mismatch check to only quantized models by @SunMarc in #38397
[chat] improvements for thinking models and reduce default verbosity by @gante in #38322
Fix convert to original state dict for VLMs by @hiyouga in #38385
[chat] use the checkpoint's generation_config.json as base parameterization by @gante in #38330
Fix Qwen2.5-VL Video Processor by @yeliudev in #38366
[CSM] infer codec model with no_grad + audio eos label by @eustlb in #38215
Add reportrepoid to mi300 workflow by @ivarflakstad in #38401
[CSM] update model id by @eustlb in #38211
[cleanup] delete deprecated kwargs in qwen2_audio 🧹 by @gante in #38404
[tests] remove overload for deleted test (test_offloaded_cache_implementation) by @gante in #37896
[mllama] Allow pixel_values with inputs_embeds by @dxoigmn in #38334
Update Model Card for Mamba-2 by @ParagEkbote in #37951
Updated Zoedepth model card by @miniMaddy in #37898
Updated BigBird Model card as per #36979. by @RogerSinghChugh in #37959
Updated BERTweet model card. by @RogerSinghChugh in #37981
New bart model card by @RogerSinghChugh in #37858
Update granite.md by @Tanuj-rai in #37791
Falcon-H1 - Fix autodocstring and add canreturn_tuple decorator by @yonigozlan in #38260
Updated model card for OLMo2 by @andyvu923 in #38394
Add mi300 to amd daily ci workflows definition by @ivarflakstad in #38415
Change slack channel for mi250 CI by @ivarflakstad in #38410
Fix an error in verifytpplan for keys without '.' by @liwii in #38420
[qwen-vl] Look for vocab size in text config by @zucchini-nlp in #38372
Update CsmForConditionalGenerationIntegrationTest by @ydshieh in #38424
enable large_gpu and torchao cases on XPU by @yao-matrix in #38355
Disable mi210 scheduled CI by @ivarflakstad in #38411
Update error when using additional and/or masks by @Cyrilvallez in #38429
Fix CircleCI not triggered when PR is opened from a branch of huggingface/transformers by @ydshieh in #38413
make Llama4TextMoe forward more readable by @JJJYmmm in #37529
[core] support tensor-valued extrastate values in from_pretrained by @pstjohn in #38155
Fix typo in tokenizationutilsbase.py docstring by @cwngan in #38418
Fix convert weights for InternVL by @yonigozlan in #38233
Trigger doc-builder job after style bot by @ydshieh in #38398
Remove redundant testsdpaequivalence test by @Rocketknight1 in #38436
Fix MoE gradient test by @Rocketknight1 in #38438
Fix from_args_and_dict ProcessorMixin by @yonigozlan in #38296
Fix handling of slow/fast image processors in imageprocessingauto.py by @yonigozlan in #38161
Updated the Model docs - for the ALIGN model by @1himan in #38072
Updated the model card for ViTMAE by @mreraser in #38302
Model card for mobilenet v1 and v2 by @yuanjua in #37948
Merge type hints from microsoft/python-type-stubs (post dropping support for Python 3.8) by @Avasam in #38335
Fix GLM4 checkpoints by @ydshieh in #38412
feat: add cache retention for requests by @McPatate in #38446
[Tests] Clean up test cases for few models by @yaswanth19 in #38315
Fix TypeError in save_pretrained error handling (fixes #38422) by @rahulrshetty45 in #38449
Cleanup BatchFeature and BatchEncoding by @lgeiger in #38459
Fix Gemma3IntegrationTest by @ydshieh in #38471
[Qwen2.5-Omni] Fix dtype of cos,sin when used with flash attention by @HarryHsing in #38453
fix: handle no scheduler passed by user by @McPatate in #38407
make it go brrrr by @ArthurZucker in #38409
Fix convertinternvlweightstohf.py to support local paths by @xvyv99 in #38264
Fix incorrect bboxembed initialization when decoderbboxembedshare=False in GroundingDINO by @islemyakoubi in #38238
[Tests] Reduced model size for albert-test model by @saqlain2204 in #38480
Align TP check by @SunMarc in #38328
protect dtensor import by @SunMarc in #38496
[docs] add xpu environment variable for gpu selection by @faaany in #38194
Remove deprecated useflashattention_2 parameter by @cyyever in #37131
Fix setting FLASHATTENTIONDETERMINISTIC after importing by @HollowMan6 in #37185
[seamless_m4t] Skip some tests when speech is not available by @remi-or in #38430
Update Loss Functions to Accept Tensor numitemsin_batch by @NEREUScode in #38029
[generate] add soft deprecations on custom generation methods by @gante in #38406
[generate] move SinkCache to a custom_generate repo by @gante in #38399
remove unhandled parameter by @itazap in #38145
Fix amp deprecation issue by @SunMarc in #38100
[flax/mistral] support sliding_window: null in config by @yiding in #37402
Num parameters in model.safetensors.index.json by @LysandreJik in #38531
Remove type annotation in Siglip Attention Module by @yaswanth19 in #38503
Fix Gemma2IntegrationTest by @ydshieh in #38492
Fix blip2 tests by @ydshieh in #38510
[tests] expand flex-attn test for vision models by @zucchini-nlp in #38434
Don't use default attn if pre-set in sub-config by @zucchini-nlp in #38526
update emu3 test by @jiqing-feng in #38543
Update docker image to use av by @ydshieh in #38548
[bugfix] [WIP] fix applyrotaryemb error on Ascend NPU by @FightingZhen in #38491
[TP] Change command in tests to python3 by @S1ro1 in #38555
Explicitly setting encoding in tokenizationutilsbase.py by @Muqi1029 in #38553
Fix utils/notification_service.py by @ydshieh in #38556
Name change AOPermod -> ModuleFqn by @drisspg in #38456
Fix hqq issue by @SunMarc in #38551
[docs] Format fix by @stevhliu in #38414
[janus] Fix failing tests on mi3XX by @remi-or in #38426
Fix chameleon tests by @ydshieh in #38565
update utils/notification_service.py for AMD vs Nvidia by @ydshieh in #38563
Fix deepseekv3 by @ydshieh in #38562
[FlexAttn] Fix models with unique characteristics by @vasqu in #38433
fix(attentionvisualizer): add default value for imageseq_length by @IceGiraffe in #38577
allow custom headdim for qwen2moe by @bzantium in #37188
Docs: fix code formatting in torchao docs by @Manalelaidouni in #38504
feat: add repository field to benchmarks table by @McPatate in #38582
[Dinov2] Enable device_map="auto" support by @aryanchauhan31 in #38487
tests/roformer: fix couple roformer tests on gpus by @dvrogozh in #38570
New gpt neo model card by @RogerSinghChugh in #38505
Updated deprecated typing imports with equivalents for Python 3.9+ by @Sai-Suraj-27 in #38546
added fast image processor for ZoeDepth and expanded tests accordingly by @henrikm11 in #38515
[qwen-omni] fix sliding window by @zucchini-nlp in #38525
Remove custom pytest and pluggy by @ydshieh in #38589
pin pandas by @ydshieh in #38605
Allow mlm_probability to be set to None when mlm=False in DataCollatorForLanguageModeling by @KameniAlexNea in #38522)
Avoid overwrite existing local implementation when loading remote custom model by @Isotr0py in #38474
fix spelling errors by @davidjsonn in #38608
Remove isort from dependencies by @Sai-Suraj-27 in #38616
Fix return_dict=False giving errors in a few VLM models by @ydshieh in #38519
docs: fix dark mode logo display. by @johncaged in #38586
Fix typo in LLaVa documentation by @mynameismon in #38618
[Nit] Add Note on SigOpt being in Public Archive Mode by @ParagEkbote in #38610
Updated Aria model card by @1himan in #38472
Fix MiniMax (docs and integration tests checkpoint) by @geetu040 in #38575
enable more test cases on xpu by @yao-matrix in #38572
Improve test_initialization by @ydshieh in #38607
Use torch 2.7.1 on CircleCI jobs by @ydshieh in #37856
[generation] bring back tests on vision models by @zucchini-nlp in #38603
update ColQwen2ModelIntegrationTest by @ydshieh in #38583
Improve test_initialization for SwiftFormer by @ydshieh in #38636
fix: support grad clipping for TP through replicating non-sharded modules by @kmehant in #36132
Don't run AriaForConditionalGenerationModelTest on CircleCI by @ydshieh in #38615
fix total batch size calculation in trainer by @inkcherry in #38286
fix torch_dtype on awq by @jiqing-feng in #38463
Better CI by @ydshieh in #38552
remove ipexoptimizemodel usage by @yao-matrix in #38632
Skip torchscript tests for 2 models by @ydshieh in #38643
Fix InternVL integration test by @ydshieh in #38612
Use torch 2.7.1 on daily CI by @ydshieh in #38620
Fix qwen2-audio chat template audio placeholder insertion by @Isotr0py in #38640
Fixed modelingauto.py MODELFORMASKGENERATIONMAPPINGNAMES variable by @sbucaille in #38664
fix: "check out" as verb by @DePasqualeOrg in #38678
Fix attention mask expansion when converting to executorch by @pweglik in #38637
Fix some models import by @nicelulu in #38694
Fix retrieve function signature and remove faiss requirement by @Fiona-Waters in #38624
Fix TypeError: 'NoneType' object is not iterable for esm by @dbleyl in #38667)
Docs: update bitsandbytes torch.compile compatibility by @matthewdouglas in #38651
Drop astargetprocessor from the call and pad methods by @marcndo in #38642
Created model card for XLM model by @AshAnand34 in #38595
Update XLM-RoBERTa model documentation with enhanced usage examples and improved layout by @AshAnand34 in #38596
Created model card for xlm-roberta-xl by @AshAnand34 in #38597
Fix aya_vision test by @ydshieh in #38674
Standardize ByT5 model card format by @yanamis in #38699
Fix smart resize by @rdonggroq in #38706
Update some tests for torch 2.7.1 by @ydshieh in #38701
Logging message for is_bitsandbytes_available() by @ved1beta in #38528
Fix llava tests by @ydshieh in #38722
Use OSError by @cyyever in #38712
[add-new-model-like] Robust search & proper outer '),' in tokenizer mapping by @alexzms in #38703
Fix typo in Language Modeling example scripts and update TPU type by @framoncg in #38652
Add AGENTS.md by @Rocketknight1 in #38734
New canine model card by @RogerSinghChugh in #38631
Fixed a multiple-devices issue in SmolVLM model by @remi-or in #38736
[llava] fix integration tests with Siglip by @zucchini-nlp in #38732
fix: Add method to get image features in PaliGemmaForConditionalGeneration by @YushunXiang in #38730
from 1.11.0, torchao.prototype.lowbitoptim is promoted to torchao.optim by @yao-matrix in #38689
fix: bf16 with TPU is allowed in configuration by @yevvonlim in #38670
[DeepSeek-V3] implement when qlorarank is None by @bzantium in #38743
Revert "Trigger doc-builder job after style bot" by @ydshieh in #38735
Add z-loss to Bamba for v2 by @daviswer in #37842
Better typing for numitemsin_batch by @SunMarc in #38728
Prepare for TF+Jax deprecation by @Rocketknight1 in #38760
Remove IPEX requirement for bitsandbytes on CPU by @matthewdouglas in #38594
Update repo consistency check by @Rocketknight1 in #38763
fix(qwen3moe): pass kwargs to selfattn by @llllvvuu in #38691
Update pegasus model card by @dross20 in #38675
Make style bot trigger CI after push by @ydshieh in #38754
chore(pixtral): emit block attention mask when using flash attention by @starcatmeow in #38741
Update altCLIP model card by @EmileAydar in #38306
Add Qwen2 MoE model card by @rileyafox in #38649
[masking utils] check None instead of try/except by @zucchini-nlp in #38561
[Hotfix] Fix style bot by @ydshieh in #38779
Fix masking utils by @Cyrilvallez in #38783
[video processors] support frame sampling within processors by @zucchini-nlp in #38105
Skip some export tests on torch 2.7 by @ydshieh in #38677
Reduce verbosity for average_tokens_across_devices=True and world size = 1 by @qgallouedec in #38785
Update PULLREQUESTTEMPLATE.md by @qgallouedec in #38770
[docs] Add int4wo + 2:4 sparsity example to TorchAO README by @jcaip in #38592
Fix qwen_2_5 omni by @ydshieh in #38658
Fix llava_onevision tests by @ydshieh in #38791
Reword README in light of model definitions by @LysandreJik in #38762
Fix Typos in Comments: "quantitation" → "quantization", "averege" → "average" by @leopardracer in #38766
Initialize flash attn flag by @farnasirim in #38768
Fix mllama by @ydshieh in #38704
build: :pushpin: Remove upper bound on PyTorch by @KyleMylonakisProtopia in #38789
Remove all traces of low_cpu_mem_usage by @Cyrilvallez in #38792
[Docs] New DiT model card by @yushi2006 in #38721
Add missing div in Pegasus model card by @dross20 in #38773
Updated moonshine modelcard by @SohamPrabhu in #38711
refactor createtokentypeidsfrom_sequences by @itazap in #37681
[docs] update cache docs with new info by @zucchini-nlp in #38775
Fix erroneous docstring for the ordering of SWA layers by @norpadon in #38794
Fix configs and doc for the Qwens by @Cyrilvallez in #38808
Unbreak optimum-executorch by @guangy10 in #38646
Disable custom MRA kernels for ROCm by @ahadnagy in #38738
Use HF papers by @qgallouedec in #38184
Simplify and update trl examples by @qgallouedec in #38772
Better pipeline type hints ✨ by @qubvel in #38049
Fix llava_next tests by @ydshieh in #38813
Expectation fixes and added AMD expectations by @remi-or in #38729
Use wandb.run.url instead of wandb.run.get_url() (deprecated) by @qgallouedec in #38817
Refactor DBRX tests to use CausalLMModelTest base classes by @Rocketknight1 in #38475
change fsdp_strategy to fsdp in TrainingArguments in accelerate doc by @PT-10 in #38807
Fix a minor security issue by @ydshieh in #38815
Fix trainer.py not showing signature columns by @nenesekai in #38465
Add V-JEPA for video classification model by @qubvel in #38788
fixed docstring in modularqwen25_vl.py by @lawrencefeng17 in #38798
[docs] Update docs moved to the course by @stevhliu in #38800
[docs] updated roberta model card by @allmight05 in #38777
Updated Albert model Card by @souvikchand in #37753
[internvl] fix video inference by @zucchini-nlp in #38811
Fix redundant code in Janus by @yaswanth19 in #38826
bugfix: propage weight key_mapping to peft to fix 3.52 VLM renaming by @ManuelFay in #38627
Fix peft integration by @Cyrilvallez in #38841
Fix broken notebooks link in Italian training docs by @VolodymyrBg in #38834
Fix broken tag in Longformer model card by @dross20 in #38828
[BugFix] QA pipeline edge case: align_to_words=True in QuestionAnsweringPipeline can lead to duplicate answers by @yushi2006 in #38761
GraniteMoeHybrid: Allow for only shared expert case. by @shawntan in #38801
Updated aya_vision.md by @1himan in #38749
Remove merge conflict artifacts in Albert model doc by @druvdub in #38849
[video processor] fix BC when no video config if found by @zucchini-nlp in #38840
Fix incorrect width ratio calculation in Llama4 image processor by @Jingxiang-Zhang in #38842
Allow customization of sdpa in executorch.py by @kimishpatel in #38827
Fix qwen2_5_vl tests by @ydshieh in #38845
Improve auxiliary_in_channels default behavior in UperNet by @simonreise in #37540
Fix qwen3 tests by @ydshieh in #38862
Update CvT documentation with improved usage examples and additional … by @sezan92 in #38731
Update roc bert docs by @SohamPrabhu in #38835
Post-PR fixes! by @Rocketknight1 in #38868
enable misc test cases on XPU by @yao-matrix in #38852
Fix phi4_multimodal tests by @ydshieh in #38816
Fix qwen3_moe tests by @ydshieh in #38865
Fix HQQ model param device transfer issue by @HighCWu in #38466
Fixed markdown for BertTokenizer's '[CLS]' token. by @eu90h in #38506
null deepspeed_plugin in args for wandb callback fake trainer by @winglian in #38867
More PYUP fixes by @cyyever in #38883
Fix loop var naming by @Rocketknight1 in #38885
[bugfix] fix ATTNMASKNPU device mismatch error on multi-device NPU … by @qykong in #38876
log: Add logging when using splitbatches and perdevicetrainbatch_size by @KeshavSingh29 in #38633
Docs: Add custom fine-tuning tutorial to TrOCR model page by @Ashutosh-4485 in #38847
36978 | Fast image processor for DPT model by @samrae7 in #37481
[video processor] fix slow tests by @zucchini-nlp in #38881
Update bamba model card by @druvdub in #38853
Add support for specifying revisions when pushing to Hub via internal Trainer call by @IsaacBreen in #36852
Use raise from e in hub.py utility by @Wauplin in #37241
[phi-4] use mel filters from audio utils by @eustlb in #36966
Fix fsmt tests by @ydshieh in #38904
Fix unnecessary super calls by @cyyever in #38897
align xpu's autocast behavior w/ cuda by using device agnostic torch APIs by @yao-matrix in #38284
Fix FalconMambaIntegrationTests by @ydshieh in #38566
Skip sdpa tests if submodule does not support sdpa by @ivarflakstad in #38907
Fix ReDOS in tokenizer digit substitution by @Rocketknight1 in #38844
feat: Add granite architectures to auto tokenizer name mappings by @gabe-l-hart in #38802
Allow make-fixup on main branch, albeit slowly by @Rocketknight1 in #38892
feat: add flexible Liger Kernel configuration to TrainingArguments by @hamza-hcompany in #38911
Remove deprecated classes in modeling_utils.py by @Cyrilvallez in #38919
Skip some tests for now by @ydshieh in #38931
Modernbert fixes by @remi-or in #38912
add pytorch-xpu Dockerfile by @yao-matrix in #38875
Remove ALL_LAYERNORM_LAYERS by @Cyrilvallez in #38922
[static cache] fix device map per layer in VLMs by @zucchini-nlp in #38488
Add kwargs for timm.create_model in TimmWrapper by @qubvel in #38860
Pin PyTorch extras for AMD containers by @ahadnagy in #38941
Correctly raise error for awq quantization by @Cyrilvallez in #38945
Fix more flaky test_initialization by @ydshieh in #38932
Switch to use A10 progressively by @ydshieh in #38936
Fix custom generate from local directory by @manueldeprada in #38916
Update blip model card by @devkade in #38513
Gaudi3 CI by @IlyasMoutawwakil in #38790
Fix DTensor import compatibility for PyTorch < 2.5 by @Benoqtr in #38836
Fix(informer): Correct tensor shape for input_size=1 by @Flink-ddd in #38856
[modular] CLI allows positional arguments, and more defaults names for the optional arg by @Cyrilvallez in #38979
Remove dead protected imports by @Cyrilvallez in #38980
Break tie in Expectations and gemma3 fixes by @remi-or in #38943
Add Idefics2/3 and SmolVLM Fast image processors + improvements for fast image processors by @yonigozlan in #38157
fix: add bool operator to tokenizer to avoid bloated asserts by @kallewoof in #38899
Add support for auto_docstring with model outputs by @yonigozlan in #38242
fix mistral and mistral3 tests by @ydshieh in #38978
[Feature] Support is_split_into_words in the TokenClassificationPipeline. by @yushi2006 in #38818
Fix rag by @ydshieh in #38585
[docs] Typos - Single GPU efficient training features by @casinca in #38964
[qwen] refactor attentions for vision/audio by @zucchini-nlp in #38930
Removing extra space in large command for speech-pretraining example by @dggaytan in #38705
[Attention] Small fix on output attentions by @vasqu in #38948
Fixes for Arcee model by @Cyrilvallez in #39001
Added scikit-learn to the example image-classification requirements.txt by @mylonjones in #37506
Update attention_visualizer.py by @Tanuj-rai in #37860
Skip non-selected experts for qwen3_moe by @seven-mile in #38133
Fix undeterministic order in modular dependencies by @Cyrilvallez in #39005
Granite speech - minor fixes to support training with the HF trainer by @avihu111 in #38833
Fix bugs in DynamicCache by @tugsbayasgalan in #37880
Update self-comment-ci.yml user list by @ivarflakstad in #39014
Skip sdpa dispatch on flash test due to unsupported head dims by @ivarflakstad in #39010
[HPU][Critical Issue Fix] ThreadPool instead of Pool for parallel pre-processing by @dsmertin in #39002
Add Hugging Face authentication procedure for IDEs (PyCharm, VS Code,… by @marcndo in #38954
[LightGlue] Fixed attribute usage from descriptordim to keypointdetectordescriptordim by @sbucaille in #39021
Add zero dim tensor check when using flash_attention by @ranzhejiang in #38280
Fix graph break in torch.compile when using FA2 with attention_mask=None and batch size > 1 by @efsotr in #37332
[AutoModelForMaskGeneration] Remove duplicate code by @NielsRogge in #38622
[video processor] support torchcodec and decrease cuda memory usage by @zucchini-nlp in #38880
Drop unnecessary tokens in GPT2Model generation by @null-pointer-access in #39016
Fix the seamless_m4t cannot work on Gaudi by @yuanwu2017 in #38363
fix: astronomical loss with ModernBERT when using gradient checkpointing by @umarbutler in #38982)
fix gemma3 grad acc by @SunMarc in #37208
Remove script datasets in tests by @lhoestq in #38940
Fix grammatical error in models documentation by @marcndo in #39019
refactor: remove custom BarkLayerNorm by @eginhard in #39003
[Kyutai-STT] correct model type + model id by @eustlb in #39035
Two ReDOS fixes by @Rocketknight1 in #39013
[tests] remove TF tests (uses of require_tf) by @gante in #38944
Granite speech speedup + model saving bugfix by @avihu111 in #39028
Fix Bad Outputs in Fast Path for GraniteMoeHybrid by @alex-jw-brooks in #39033

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@ydshieh
- CI reporting improvements (#38230)
- add liger-kernel to docker file (#38292)
- add vasqu to self-comment-ci.yml (#38324)
- new failure CI reports for all jobs (#38298)
- Hot fix for AMD CI workflow (#38349)
- Uninstall kernels for AMD docker images (#38354)
- Use one utils/notification_service.py (#38379)
- update gemma tests (#38384)
- Update CsmForConditionalGenerationIntegrationTest (#38424)
- Fix CircleCI not triggered when PR is opened from a branch of huggingface/transformers (#38413)
- Trigger doc-builder job after style bot (#38398)
- Fix GLM4 checkpoints (#38412)
- Fix Gemma3IntegrationTest (#38471)
- Fix Gemma2IntegrationTest (#38492)
- Fix blip2 tests (#38510)
- Update docker image to use av (#38548)
- Fix utils/notification_service.py (#38556)
- Fix chameleon tests (#38565)
- update utils/notification_service.py for AMD vs Nvidia (#38563)
- Fix deepseekv3 (#38562)
- Remove custom pytest and pluggy (#38589)
- pin pandas (#38605)
- Fix return_dict=False giving errors in a few VLM models (#38519)
- Improve test_initialization (#38607)
- Use torch 2.7.1 on CircleCI jobs (#37856)
- update ColQwen2ModelIntegrationTest (#38583)
- Improve test_initialization for SwiftFormer (#38636)
- Don't run AriaForConditionalGenerationModelTest on CircleCI (#38615)
- Better CI (#38552)
- Skip torchscript tests for 2 models (#38643)
- Fix InternVL integration test (#38612)
- Use torch 2.7.1 on daily CI (#38620)
- Fix aya_vision test (#38674)
- Update some tests for torch 2.7.1 (#38701)
- Fix llava tests (#38722)
- Revert "Trigger doc-builder job after style bot" (#38735)
- Make style bot trigger CI after push (#38754)
- [Hotfix] Fix style bot (#38779)
- Skip some export tests on torch 2.7 (#38677)
- Fix qwen_2_5 omni (#38658)
- Fix llava_onevision tests (#38791)
- Fix mllama (#38704)
- Fix llava_next tests (#38813)
- Fix a minor security issue (#38815)
- Fix qwen2_5_vl tests (#38845)
- Fix qwen3 tests (#38862)
- Fix phi4_multimodal tests (#38816)
- Fix qwen3_moe tests (#38865)
- Fix fsmt tests (#38904)
- Fix FalconMambaIntegrationTests (#38566)
- Skip some tests for now (#38931)
- Fix more flaky test_initialization (#38932)
- Switch to use A10 progressively (#38936)
- fix mistral and mistral3 tests (#38978)
- Fix rag (#38585)
@ArthurZucker
- tp plan should not be NONE (#38255)
- Protect ParallelInterface (#38262)
- Add CB (#38085)
- 🚨Early-error🚨 config will error out if output_attentions=True and the attn implementation is wrong (#38288)
- for now disable compile (#38383)
- make it go brrrr (#38409)
@younesbelkada
- [MODEL] Add Falcon H1 (#38249)
@cyr0930
- fix multi-image case for llava-onevision (#38084)
@cyyever
- Improve typing in TrainingArgument (#36944)
- More typing in src/transformers/training_args.py (#38106)
- Fix run_slow (#38314)
- Remove deprecated useflashattention_2 parameter (#37131)
- Use OSError (#38712)
- More PYUP fixes (#38883)
- Fix unnecessary super calls (#38897)
@ritsumei-aoi
- Remove Japanese sequence_classification doc and update references (#38246)
@yao-matrix
- add XPU info print in print_env (#38282)
- refine transformers env output (#38274)
- switch to device agnostic device calling for test cases (#38247)
- enable large_gpu and torchao cases on XPU (#38355)
- enable more test cases on xpu (#38572)
- remove ipexoptimizemodel usage (#38632)
- from 1.11.0, torchao.prototype.lowbitoptim is promoted to torchao.optim (#38689)
- enable misc test cases on XPU (#38852)
- align xpu's autocast behavior w/ cuda by using device agnostic torch APIs (#38284)
- add pytorch-xpu Dockerfile (#38875)
@vasqu
- 🔴🔴🔴 [Attention] Refactor Attention Interface for Bart-based Models (#38108)
- [FlexAttention] Reenable flex for encoder-decoder and make the test more robust (#38321)
- [OPT] Fix attention scaling (#38290)
- 🔴[Attention] Attention refactor for Whisper-based models (#38235)
- [FlexAttn] Fix models with unique characteristics (#38433)
- [Attention] Small fix on output attentions (#38948)
@itazap
- refactor cansaveslow_tokenizer (#37722)
- remove unhandled parameter (#38145)
- refactor createtokentypeidsfrom_sequences (#37681)
@eustlb
- [CSM] infer codec model with no_grad + audio eos label (#38215)
- [CSM] update model id (#38211)
- [phi-4] use mel filters from audio utils (#36966)
- Add kyutai stt (#38909)
- [Kyutai-STT] correct model type + model id (#39035)
@RogerSinghChugh
- Updated BigBird Model card as per #36979. (#37959)
- Updated BERTweet model card. (#37981)
- New bart model card (#37858)
- New gpt neo model card (#38505)
- New canine model card (#38631)
@1himan
- Updated the Model docs - for the ALIGN model (#38072)
- Updated Aria model card (#38472)
- Updated aya_vision.md (#38749)
@Avasam
- Merge type hints from microsoft/python-type-stubs (post dropping support for Python 3.8) (#38335)
@remi-or
- [seamless_m4t] Skip some tests when speech is not available (#38430)
- [janus] Fix failing tests on mi3XX (#38426)
- Fixed a multiple-devices issue in SmolVLM model (#38736)
- Expectation fixes and added AMD expectations (#38729)
- Modernbert fixes (#38912)
- Break tie in Expectations and gemma3 fixes (#38943)
@tonywu71
- Add ColQwen2 to 🤗 transformers (#35778)
@geetu040
- Add support for MiniMax's MiniMax-Text-01 (#35831)
- Fix MiniMax (docs and integration tests checkpoint) (#38575)
@sbucaille
- Fixed modelingauto.py MODELFORMASKGENERATIONMAPPINGNAMES variable (#38664)
- Add LightGlue model (#31718)
- [LightGlue] Fixed attribute usage from descriptordim to keypointdetectordescriptordim (#39021)
@samrae7
- 36978 | Fast image processor for DPT model (#37481)
@Crystalcareai
- Add Arcee model support (#38621)
@zRzRzRzRzRzRzR
- GLM-4.1V Model support (#38431)
@bzhangGo
- Encoder-Decoder Gemma (#38332)
@redmoe-moutain
- [Model] add dots1 (#38143)
@EduardDurech
- Support for Flash Attention 3 (#38972)

- Python
Published by LysandreJik 11 months ago

transformers - Kyutai-STT (based on v4.52.4)

A new model is added to transformers: Kyutai-STT It is added on top of the v4.52.4 release, and can be installed from the following tag: v4.52.4-Kyutai-STT-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.52.4-Kyutai-STT-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the Kyutai-STT model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.53.0.

Kyutai-STT

Usage example

Kyutai-STT can be found on the Huggingface Hub.

Inference

```python import torch from datasets import load_dataset, Audio from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration

1. load the model and the processor

torchdevice = "cuda" if torch.cuda.isavailable() else "cpu" model_id = "kyutai/stt-2.6b-en"

processor = KyutaiSpeechToTextProcessor.frompretrained(modelid) model = KyutaiSpeechToTextForConditionalGeneration.frompretrained(modelid, devicemap=torchdevice)

2. load audio samples

ds = loaddataset( "hf-internal-testing/librispeechasrdummy", "clean", split="validation" ) ds = ds.castcolumn("audio", Audio(sampling_rate=24000))

3. prepare the model inputs

inputs = processor( ds[0]["audio"]["array"], ) inputs.to(torch_device)

4. infer the model

output_tokens = model.generate(**inputs)

5. decode the generated tokens

print(processor.batchdecode(outputtokens, skipspecialtokens=True)) ```

Batched Inference

```python import torch from datasets import load_dataset, Audio from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration

1. load the model and the processor

torchdevice = "cuda" if torch.cuda.isavailable() else "cpu" model_id = "kyutai/stt-2.6b-en"

processor = KyutaiSpeechToTextProcessor.frompretrained(modelid) model = KyutaiSpeechToTextForConditionalGeneration.frompretrained(modelid, devicemap=torchdevice)

2. load audio samples

ds = loaddataset( "hf-internal-testing/librispeechasrdummy", "clean", split="validation" ) ds = ds.castcolumn("audio", Audio(sampling_rate=24000))

3. prepare the model inputs

audioarrays = [ds[i]["audio"]["array"] for i in range(4)] inputs = processor(audioarrays, returntensors="pt", padding=True) inputs = inputs.to(torchdevice)

4. infer the model

output_tokens = model.generate(**inputs)

5. decode the generated tokens

decodedoutputs = processor.batchdecode(outputtokens, skipspecialtokens=True) for output in decodedoutputs: print(output) ```

- Python
Published by LysandreJik 11 months ago

transformers - V-JEPA 2 (based on v4.52.4)

A new model is added to transformers: V-JEPA 2 It is added on top of the v4.52.4 release, and can be installed from the following tag: v4.52.4-VJEPA-2-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.52.4-VJEPA-2-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the VJEPA-2 model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.53.0.

VJEPA-2

The abstract from the technical report is the following:

Usage example

VJEPA-2 can be found on the Huggingface Hub. V-JEPA 2 is intended to represent any video (and image) to perform video classification, retrieval, or as a video encoder for VLMs.

The snippet below shows how to load the V-JEPA 2 model using the AutoModel class.

```py import torch from torchcodec.decoders import VideoDecoder import numpy as np

processor = AutoVideoProcessor.frompretrained("facebook/vjepa2-vitl-fpc64-256") model = AutoModel.frompretrained( "facebook/vjepa2-vitl-fpc64-256", torchdtype=torch.float16, devicemap="auto", attn_implementation="sdpa" )

videourl = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE000014_000024.mp4"

vr = VideoDecoder(videourl) frameidx = np.arange(0, 64) # choosing some frames. here, you can define more complex sampling strategy video = vr.getframesat(indices=frameidx).data # T x C x H x W video = processor(video, returntensors="pt").to(model.device) outputs = model(**video)

V-JEPA 2 encoder outputs, same as calling `model.get_vision_features()`

encoderoutputs = outputs.lasthidden_state

V-JEPA 2 predictor outputs

predictoroutputs = outputs.predictoroutput.lasthiddenstate ```

- Python
Published by LysandreJik 12 months ago

transformers - ColQwen2 (based on v4.52.4)

A new model is added to transformers: ColQwen2 It is added on top of the v4.52.4 release, and can be installed from the following tag: v4.52.4-ColQwen2-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.52.4-ColQwen2-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the ColQwen2 model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.53.0.

ColQwen2

Usage example

ColQwen2 can be found on the Huggingface Hub.

```python import requests import torch from PIL import Image

from transformers import ColQwen2ForRetrieval, ColQwen2Processor from transformers.utils.importutils import isflashattn2_available

Load the model and the processor

model_name = "vidore/colqwen2-v1.0-hf"

model = ColQwen2ForRetrieval.frompretrained( modelname, torchdtype=torch.bfloat16, devicemap="auto", # "cpu", "cuda", or "mps" for Apple Silicon attnimplementation="flashattention2" if isflashattn2available() else "sdpa", ) processor = ColQwen2Processor.frompretrained(model_name)

The document page screenshots from your corpus

url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg" url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg"

images = [ Image.open(requests.get(url1, stream=True).raw), Image.open(requests.get(url2, stream=True).raw), ]

The queries you want to retrieve documents for

queries = [ "When was the United States Declaration of Independence proclaimed?", "Who printed the edition of Romeo and Juliet?", ]

Process the inputs

inputsimages = processor(images=images).to(model.device) inputstext = processor(text=queries).to(model.device)

Forward pass

with torch.nograd(): imageembeddings = model(inputsimages).embeddings queryembeddings = model(inputs_text).embeddings

Score the queries against the images

scores = processor.scoreretrieval(queryembeddings, image_embeddings)

print("Retrieval scores (query x image):") print(scores) ```

If you have issue with loading the images with PIL, you can use the following code to create dummy images:

python images = [ Image.new("RGB", (128, 128), color="white"), Image.new("RGB", (64, 32), color="black"), ]

Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.

The example below uses bitsandbytes to quantize the weights to int4.

```python import requests import torch from PIL import Image

from transformers import BitsAndBytesConfig, ColQwen2ForRetrieval, ColQwen2Processor

model_name = "vidore/colqwen2-v1.0-hf"

4-bit quantization configuration

bnbconfig = BitsAndBytesConfig( loadin4bit=True, bnb4bitusedoublequant=True, bnb4bitquanttype="nf4", bnb4bitcompute_dtype=torch.float16, )

model = ColQwen2ForRetrieval.frompretrained( modelname, quantizationconfig=bnbconfig, device_map="cuda", ).eval()

processor = ColQwen2Processor.frompretrained(modelname)

images = [ Image.open(requests.get(url1, stream=True).raw), Image.open(requests.get(url2, stream=True).raw), ]

queries = [ "When was the United States Declaration of Independence proclaimed?", "Who printed the edition of Romeo and Juliet?", ]

Process the inputs

inputsimages = processor(images=images, returntensors="pt").to(model.device) inputstext = processor(text=queries, returntensors="pt").to(model.device)

Forward pass

with torch.nograd(): imageembeddings = model(inputsimages).embeddings queryembeddings = model(inputs_text).embeddings

Score the queries against the images

scores = processor.scoreretrieval(queryembeddings, image_embeddings)

print("Retrieval scores (query x image):") print(scores) ```

- Python
Published by LysandreJik 12 months ago

transformers - Patch release: v4.52.4

The following commits are included in that patch release:

[qwen-vl] Look for vocab size in text config (#38372)
Fix convert to original state dict for VLMs (#38385)
[video utils] group and reorder by number of frames (#38374)
[paligemma] fix processor with suffix (#38365)
Protect getdefaultdevice for torch<2.3 (#38376)
[OPT] Fix attention scaling (#38290)

- Python
Published by LysandreJik about 1 year ago

transformers - Patch release v4.52.3

Patch release v4.52.3

We had to protect the imports again, a series of bad events. Here are the two prs for the patch: - Fix tp error when torch distributed is already initialized (#38294) by @SunMarc - Protect ParallelInterface (#38262) by @ArthurZucker and @LysandreJik

- Python
Published by ArthurZucker about 1 year ago

transformers - Patch release v4.52.2

Patch release v4.52.2

We had to revert #37877 because of a missing flag that was overriding the device map. We re-introduced the changes because they allow native 3D parallel training in Transformers. Sorry everyone for the troubles! 🤗

Clearer error on import failure (#38257) by @LysandreJik
Verified tp plan should not be NONE (#38255) by @NouamaneTazi and @ArthurZucker

- Python
Published by Cyrilvallez about 1 year ago

transformers - v4.52.1: Qwen2.5-Omni, SAM-HQ, GraniteMoeHybrid, D-FINE, CSM, BitNet, LlamaGuard, TimesFM, MLCD, Janus, InternVL

New models

Qwen2.5-Omni

The Qwen2.5-Omni model is a unified multiple modalities model proposed in Qwen2.5-Omni Technical Report from Qwen team, Alibaba Group.

The abstract from the technical report is the following:

We present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. This strategy effectively decouples the handling of long sequences of multimodal data, assigning the perceptual responsibilities to the multimodal encoder and entrusting the modeling of extended sequences to a large language model.

Such a division of labor enhances the fusion of different modalities via the shared attention mechanism. To synchronize the timestamps of video inputs with audio, we organized the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE (Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture.

In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni outperforms the similarly sized Qwen2-VL and Qwen2-Audio in both image and audio capabilities. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench.

Notably, Qwen2.5-Omni is the first open-source model to achieve a level of performance in end-to-end speech instruction following that is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni’s streaming Talker outperform most existing streaming and non-streaming alternatives in robustness and naturalness.

SAM-HQ

SAM-HQ (High-Quality Segment Anything Model) was proposed in Segment Anything in High Quality by Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu.

The model is an enhancement to the original SAM model that produces significantly higher quality segmentation masks while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability.

example image

SAM-HQ introduces several key improvements over the original SAM model:

High-Quality Output Token: A learnable token injected into SAM's mask decoder for higher quality mask prediction
Global-local Feature Fusion: Combines features from different stages of the model for improved mask details
Training Data: Uses a carefully curated dataset of 44K high-quality masks instead of SA-1B
Efficiency: Adds only 0.5% additional parameters while significantly improving mask quality
Zero-shot Capability: Maintains SAM's strong zero-shot performance while improving accuracy

The abstract from the paper is the following:

The recent Segment Anything Model (SAM) represents a big leap in scaling up segmentation models, allowing for powerful zero-shot capabilities and flexible prompting. Despite being trained with 1.1 billion masks, SAM's mask prediction quality falls short in many cases, particularly when dealing with objects that have intricate structures. We propose HQ-SAM, equipping SAM with the ability to accurately segment any object, while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability. Our careful design reuses and preserves the pre-trained model weights of SAM, while only introducing minimal additional parameters and computation. We design a learnable High-Quality Output Token, which is injected into SAM's mask decoder and is responsible for predicting the high-quality mask. Instead of only applying it on mask-decoder features, we first fuse them with early and final ViT features for improved mask details. To train our introduced learnable parameters, we compose a dataset of 44K fine-grained masks from several sources. HQ-SAM is only trained on the introduced dataset of 44k masks, which takes only 4 hours on 8 GPUs.

Tips:

SAM-HQ produces higher quality masks than the original SAM model, particularly for objects with intricate structures and fine details
The model predicts binary masks with more accurate boundaries and better handling of thin structures
Like SAM, the model performs better with input 2D points and/or input bounding boxes
You can prompt multiple points for the same image and predict a single high-quality mask
The model maintains SAM's zero-shot generalization capabilities
SAM-HQ only adds ~0.5% additional parameters compared to SAM
Fine-tuning the model is not supported yet

GraniteMoeHybrid

The GraniteMoeHybrid model builds on top of GraniteMoeSharedModel and Bamba. Its decoding layers consist of state space layers or MoE attention layers with shared experts. By default, the attention layers do not use positional encoding.

D-FINE

The D-FINE model was proposed in D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement by Yansong Peng, Hebei Li, Peixi Wu, Yueyi Zhang, Xiaoyan Sun, Feng Wu

The abstract from the paper is the following:

We introduce D-FINE, a powerful real-time object detector that achieves outstanding localization precision by redefining the bounding box regression task in DETR models. D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD). FDR transforms the regression process from predicting fixed coordinates to iteratively refining probability distributions, providing a fine-grained intermediate representation that significantly enhances localization accuracy. GO-LSD is a bidirectional optimization strategy that transfers localization knowledge from refined distributions to shallower layers through self-distillation, while also simplifying the residual prediction tasks for deeper layers. Additionally, D-FINE incorporates lightweight optimizations in computationally intensive modules and operations, achieving a better balance between speed and accuracy. Specifically, D-FINE-L / X achieves 54.0% / 55.8% AP on the COCO dataset at 124 / 78 FPS on an NVIDIA T4 GPU. When pretrained on Objects365, D-FINE-L / X attains 57.1% / 59.3% AP, surpassing all existing real-time detectors. Furthermore, our method significantly enhances the performance of a wide range of DETR models by up to 5.3% AP with negligible extra parameters and training costs. Our code and pretrained models: this https URL.

CSM

The Conversational Speech Model (CSM) is the first open-source contextual text-to-speech model released by Sesame. It is designed to generate natural-sounding speech with or without conversational context. This context typically consists of multi-turn dialogue between speakers, represented as sequences of text and corresponding spoken audio.

Model Architecture: CSM is composed of two LLaMA-style auto-regressive transformer decoders: a backbone decoder that predicts the first codebook token and a depth decoder that generates the remaining tokens. It uses the pretrained codec model Mimi, introduced by Kyutai, to encode speech into discrete codebook tokens and decode them back into audio.

The original csm-1b checkpoint is available under the Sesame organization on Hugging Face.

BitNet

Trained on a corpus of 4 trillion tokens, this model demonstrates that native 1-bit LLMs can achieve performance comparable to leading open-weight, full-precision models of similar size, while offering substantial advantages in computational efficiency (memory, energy, latency).

LlamaGuard

Llama Guard 4 is a new multimodal model designed to detect inappropriate content in images and text, whether used as input or generated as output by the model. It’s a dense 12B model pruned from Llama 4 Scout model, and it can run on a single GPU (24 GBs of VRAM). It can evaluate both text-only and image+text inputs, making it suitable for filtering both inputs and outputs of large language models. This enables flexible moderation pipelines where prompts are analyzed before reaching the model, and generated responses are reviewed afterwards for safety. It can also understand multiple languages.

TimesFM

TimesFM (Time Series Foundation Model) is a pretrained time-series foundation model proposed in A decoder-only foundation model for time-series forecasting by Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. It is a decoder only model that uses non-overlapping patches of time-series data as input and outputs some output patch length prediction in an autoregressive fashion.

The abstract from the paper is the following:

Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of public datasets comes close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a patched-decoder style attention model on a large time-series corpus, and can work well across different forecasting history lengths, prediction lengths and temporal granularities.

MLCD

The MLCD models were released by the DeepGlint-AI team in unicom, which focuses on building foundational visual models for large multimodal language models using large-scale datasets such as LAION400M and COYO700M, and employs sample-to-cluster contrastive learning to optimize performance. MLCD models are primarily used for multimodal visual large language models, such as LLaVA.

Janus

The Janus Model was originally proposed in Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation by DeepSeek AI team and later refined in Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. Janus is a vision-language model that can generate both image and text output, it can also take both images and text as input.

[!NOTE] The model doesn't generate both images and text in an interleaved format. The user has to pass a parameter indicating whether to generate text or image.

The abstract from the original paper is the following:

In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.

The abstract from the aforementioned Janus-Pro paper, released afterwards, is the following:

In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strate (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.

InternVL

The InternVL3 family of Visual Language Models was introduced in InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models.

The abstract from the paper is the following:

We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.

drawing

Overview of InternVL3 models architecture, which is the same as InternVL2.5. Taken from the original checkpoint.

drawing

Comparison of InternVL3 performance on OpenCompass against other SOTA VLLMs. Taken from the original checkpoint.

Kernel integration

We integrate some kernels in the transformers library via the kernels package: https://github.com/huggingface/kernels We start with some kernels in the Llama model, and we iterate to identify the best performance optimizations

Llama Kernel integration by @MekkCyber in #37092
[kernels] use original forward at compile time by @gante in #37604

TP support

In the previous release, we've added TP support in order to run distributed inference. However, this is not supported for all quantization methods. We are progressively adding support to it. Right now, only compressed-tensors, fp8 and fp8-fbgemm support it.

Attention Quantization with FBGemm & TP by @MekkCyber in #37384
Restrict & Explain tp_plan for FBgemm by @MekkCyber in #37404

Quantization

AutoRound

From the AutoRound contributors:

AutoRound is an advanced quantization algorithm that delivers strong accuracy, even at 2-bit precision. It leverages sign gradient descent to fine-tune both rounding values and min-max clipping thresholds in just 200 steps ... More details here: https://github.com/intel/auto-round

Add AutoRound quantization support by @wenhuach21 in #37393

Quantization Documentation

We have added two new sections to better understand and get started with quantization: - Quantization concept - Selecting a quantization method

Add "selecting a quantization method" doc by @DerekLiu35 in #37159
Update quantization docs by @DerekLiu35 in #37439

GGUF

We've added GGUF support to gemma3 family models.

Add GGUF support to Gemma3 Text backbone by @Isotr0py in #37424
Support loading Gemma3 QAT GGUF models by @Isotr0py in #37649

Fast image processors

Most Vision Models and VLMs in Transformers can now benefit from fast image processors. By utilizing torch/torchvision functional transforms, these processors offer a substantial speedup when processing images compared to PiL/numpy functions, and support processing on both CPU and CUDA. - See the list of updated models: https://github.com/huggingface/transformers/issues/36978 - Learn more about fast image processors: Fast Image Processors

Add Fast Image Processor for Perceiver by @rootonchair in #37176
Add Fast Image Processor for Flava by @rootonchair in #37135
Add Fast Image Processor for LayoutLMv2 by @rootonchair in #37203
Add Fast Image Processor for LayoutLMv3 by @rootonchair in #37201
Add Fast Image Processor for Donut by @rootonchair in #37081
Add Fast LeViT Processor by @keetrap in #37154
Add Fast Mobilenet-V2 Processor by @keetrap in #37113
Add Fast owlvit Processor by @keetrap in #37164
Add ImageProcessorFast to BiT processor by @Yann-CV in #37180
Add Fast Yolos Processor by @keetrap in #37292
Add Fast Chinese-CLIP Processor by @keetrap in #37012
Add Fast Conditional-DETR Processor by @keetrap in #37071
Fix broken add-fast-image-processor CLI by @yonigozlan in #37499
Bridgetower fast image processor by @rootonchair in #37373
Add Fast Grounding-Dino Processor by @keetrap in #37108
Add Fast PVT Processor by @keetrap in #37204
Add Fast Image Processor for PoolFormer by @rootonchair in #37182
Add Fast Image Processor for MobileNetV1 by @dmdaksh in #37111
Fast image processor for VitMatte added and bug in slow version fixed by @henrikm11 in #37616
[Fast Processor] BEiT by @ariG23498 in #37005
Add Swin2SR ImageProcessorFast by @thisisiron in #37169
Add Fast Image Processor for vilt by @devxaitist in #37304

AutoDocstring

The new @auto_docstring decorator makes it easier to add proper documentation when contributing a model without bloating the modeling code: - [AutoDocstring] Based on inspect parsing of the signature by @ArthurZucker and @yonigozlan in https://github.com/huggingface/transformers/pull/33771 - More info on how to use @auto_docstring: AutoDocstring

Custom `generate`

We now support custom generate methods to be loaded from model.generate. The custom generate methods can be stored on the Hub, enabling quick distribution of experiments regarding new caches, decoding methods, heuristics, ...

```py from transformers import AutoModelForCausalLM, AutoTokenizer

`generate` with `custom_generate` -> `generate` uses custom code

note: calling the custom method prints "✨ using a custom generation method ✨"

tokenizer = AutoTokenizer.frompretrained("Qwen/Qwen2.5-0.5B-Instruct") model = AutoModelForCausalLM.frompretrained("Qwen/Qwen2.5-0.5B-Instruct", device_map="auto")

inputs = tokenizer(["The quick brown"], returntensors="pt").to(model.device) genout = model.generate(**inputs, customgenerate="transformers-community/customgenerateexample", trustremotecode=True) print(tokenizer.batchdecode(genout, skipspecial_tokens=True)) ```

You can find the docs here, and all custom generation methods by searching for the custom_generate tag.

[generate] Run custom generation code from the Hub by @gante in #36405

Chat CLI

The transformers-cli command is updated to be simpler and cleaner, specifically for its chat variant.

The following is now possible and recommended:

transformers chat Qwen/Qwen2.5-3B-Instruct

Additionally, almost any generate flag can now be passed as a positional argument, present and future, as opposed to being limited to a set of hardcoded flags, for example:

transformers chat Qwen/Qwen2.5-0.5B-Instruct do_sample=False max_new_tokens=10

Transformers cli clean command by @LysandreJik in #37657
[chat] clean code and add base help by @gante in #37892
[chat] generate parameterization powered by GenerationConfig and UX-related changes by @gante in #38047

Breaking changes

🚨 rm already deprecated padtomax_length arg by @itazap in #37617
🚨🚨🚨 Fix forward of Dinov2ForImageClassification for models with registers by @psandovalsegura in #37836
🔴 [VLM] Add base model without head by @zucchini-nlp in #37033
🔴 Video processors as a separate class by @zucchini-nlp in #35206
🚨🚨 Allow saving and loading multiple "raw" chat template files by @Rocketknight1 in #36588
🔴 Update CLIP vision attention to new attention interface by @molbap in #37498
🚨🚨 Setup -> setupclass conversion by @Rocketknight1 in #37282

Deprecations

The agents folder is finally removed from transformers in favour of using smolagents.

[agents] remove agents 🧹 by @gante in #37368

We are moving away from torch 2.0 as it has been released more than two years ago.

byebye torch 2.0 by @ydshieh in #37277

General bugfixes and improvements

fix flex attn when optional args aren't passed by @winglian in #37327
fix llama4 training by @hiyouga in #37319
Fix deepspeed with quantization by @Cyrilvallez in #37324
Fix init empty weights without accelerate by @Cyrilvallez in #37337
Use Python 3.9 syntax in examples by @cyyever in #37279
Fix torchao usage by @jiqing-feng in #37034
enable 2 llama UT cases on xpu by @yao-matrix in #37126
Avoid build crashes when torch.version.xpu doesn't exist and fix Llama4 processor tests by @Rocketknight1 in #37346
fix derived berts _init_weights by @Cyrilvallez in #37341
Update translation template by @stevhliu in #37294
Remove HQQ from caching allocator warmup by @Cyrilvallez in #37347
updated model card for Mistral by @NahieliV in #37156
Update model-card for DINOv2 by @shubham0204 in #37104
Update falcon mamba card by @ricalanis in #37253
Update Model card for GPT2 by @ash-01xor in #37101
Improvements in Gemma2 model card by @devesh-2002 in #37076
Update Model Card for Jamba by @ParagEkbote in #37152
Add bnb to the list of supported quantization methods for LLama4 by @MekkCyber in #37348
Updated Model-card for donut by @Logeswaran7 in #37290
Remove unnecessary attr assignment by @tugsbayasgalan in #36837
more fixes for post-training llama4 by @winglian in #37329
Fixing flex attention for torch=2.6.0 by @SalmanMohammadi in #37285
Multiple llama4 fixe by @ArthurZucker in #37353
Expose blip2qformer by @alex-jw-brooks in #37254
convert float for yarn related arguments in rope_scaling by @bzantium in #37139
Use Python 3.9 syntax in tests by @cyyever in #37343
A bit of cleaning 🧹🧹 by @Cyrilvallez in #37215
fix deepspeed job by @ydshieh in #37284
Set vision config to None for Gemma 1B conversion by @RyanMullins in #37366
[llama 4] dynamic rope decorator by @gante in #37365
Skip non-selected experts for mixtral and qwen2_moe by @Coco58323 in #32429
[core] remove GenerationMixin inheritance by default in PreTrainedModel by @gante in #37173
prune LM Head for USD by @jmamou in #36695
fix(qwen): fix shape error when using tp by @KimmiShi in #36947
Preserve requires_grad in pre quantized model by @jerryzh168 in #37354
Update composition flag usage by @zucchini-nlp in #36263
fix: llama4 conversion script noropelayers by @jmkuebler in #37359
update deepspeed docker by @SunMarc in #37371
Fix warning message for PEFT models in text-generation pipeline #36783 by @falconlee236 in #36887
Apply torchfix to replace deprecated functions: _pytree._register_pytree_node and torch.cpu.amp.autocast by @bzhong-solink in #37372
Fix some failing AWQ tests by @DerekLiu35 in #37383
the fix that did not get in by @ArthurZucker in #37370
handle torch version edge cases by @winglian in #37399
Add warning when failed to acquire other user's lock at model download by @manueldeprada in #37395
Handle torch ver in flexattn by @Kh4L in #37400
Fix Llama4 offset by @Cyrilvallez in #37414
Offloaded hybrid cache for Llama4 by @Cyrilvallez in #37401
mark llama4 as not supported with fa2 by @winglian in #37416
update kernels to 0.4.3 by @ArthurZucker in #37419
Send trainer/fsdp/deepspeed CI job reports to a single channel by @ydshieh in #37411
from_pretrained should handle xpu case by @sywangyi in #37382
Allow rocm systems to run these tests by @ivarflakstad in #37278
use rms_norm_eps for the L2Norm for Llama4 by @ArthurZucker in #37418
[chat-template] Unify tests and clean up 🧼 by @zucchini-nlp in #37275
Fix new failure reports not including anything other than tests/models/ by @ydshieh in #37415
Quark Quantization gated repo by @MekkCyber in #37412
Add image classifier donut & update loss calculation for all swins by @eljandoubi in #37224
Correctly drop tokens in SwitchTransformer by @mario-aws in #37123
Fix requirereadtoken by @MekkCyber in #37422
fix: use mtime by default in Trainer.rotatecheckpoints with automatic fallback by @Jerry-Terrasse in #37260
(Part 2) feat: allow for tp_size attr for tplizing the model by @kmehant in #37054
Adding to selfcommentci.yml by @MekkCyber in #37426
[Feat] Support npu in modeling models by @duanjunwen in #37369
Remove old code for PyTorch, Accelerator and tokenizers by @cyyever in #37234
enhance requiredeterministicfor_xpu by @yao-matrix in #37437
Fixes: Corrects file path for CUDA kernels by @DonggeunYu in #37438
Simplify soft dependencies and update the dummy-creation process by @LysandreJik in #36827
Update-kernel-pin by @ArthurZucker in #37448
Add moe kernels by @ArthurZucker in #37376
Fix the test fetcher by @LysandreJik in #37452
Remove triton mlp kernel, not compiling for some models by @MekkCyber in #37449
[processor] clean up mulitmodal tests by @zucchini-nlp in #37362
[Regression] Fix Quark quantized model loading after refactorization by @BowenBao in #37407
prevent creating a view/leaf param for low rank optimizers w FSDP by @winglian in #37379
Disable kernels for quantization by @MekkCyber in #37446
Add weights_only=True to torch.load by @cyyever in #37062
Add XPU case to istorchbf16gpuavailable by @cyyever in #37132
nit: typing use Llama4TextConfig instead of Llama4Config by @kmehant in #37430
Delete hubconf.py by @Rocketknight1 in #37455
Fix typing issues with SigLip2 by @EricWiener in #37356
fix: (llama4) fix nosplitmodules to be picked up for fsdpv1 and v2 sharding by @kmehant in #37462
make testsnowmanimage_captioning pass on XPU, by sharing same atol w/ ROCM by @yao-matrix in #37480
Remove fsspec dependency which isn't directly used by transformers by @cyyever in #37318
Fix tests failed with gated repos. by @ydshieh in #37484
[ci] fix doc builder by @zucchini-nlp in #37489
Fixed broken links by @cypherpepe in #37466
Detect and fix most _init_weights() issues - make it work for composite models by @Cyrilvallez in #37070
[bug] deprecated deta loadcudakernel, MultiScaleDeformableAttention by @chagmgang in #37443
Fix mask handling for flex attention in llama/gemma2/mistral/qwen2 by @flukeskywalker in #37381
Fix wrong argparse type in modular checker script by @seven-mile in #37472
Fixing gated repo issues by @MekkCyber in #37463
[qwen-omni] fix processor by @zucchini-nlp in #37493
Remove deprecation warning for num_logits_to_keep by @Cyrilvallez in #37149
Don't auto-assign reviewers when the author is in HF by @Rocketknight1 in #37500
Detect and use device context manager or global device in from_pretrained by @Cyrilvallez in #37216
Change default value of attn_temperature_tuning by @gmlwns2000 in #37501
Llama4: remove redundant transpose of router_logits by @pbelevich in #37468
fix: Restore explicit error surfacing for unexpected hub exceptions by @manueldeprada in #37525
Fix missing return type for MLCD docs by @qubvel in #37527
fix and enhance pipeline_webserver.md by @yao-matrix in #36992
VDR task guide by @merveenoyan in #37485
Update VITS model card by @princepride in #37335
Refactor ColPali model documentation by @Soum-Soum in #37309
enable 5 cases on XPU by @yao-matrix in #37507
enable several cases on XPU by @yao-matrix in #37516
enable test_offloaded_cache_implementation on XPU by @yao-matrix in #37514
Fix BitsAndBytesConfig JSON serialization in TrainingArguments by @astefanutti in #37520
enable 3 mpt test cases on XPU by @yao-matrix in #37546
enable 6 rtdetrv2 cases on xpu by @yao-matrix in #37548
More appropriate cuda warmup in resource-constrained hardware by @Cyrilvallez in #37550
Fixes hqq by following a new path for bias parameter in pre_quantized models by @MekkCyber in #37530
convert scale and zero to cuda when using HQQ backend by @phymhan in #37425
Keep Quark loading through meta device by @BowenBao in #37538
Refactor torchao docs by @MekkCyber in #37490
add FlashAttentionKwargs and seq_idx to flat collator by @garrett361 in #36456
docs(typo): Update ISSUES.md, fix a small typo by @ in #37542
Fix device issue for tapas (with as_tensor) by @ydshieh in #37551
Make Ignored Columns ValueError More Informative by @wbuchanan in #33299
Fix TimesFm doc issue by @Cyrilvallez in #37552
Run test_can_load_with_global_device_set using a subprocess by @ydshieh in #37553
Fix pixel attention mask padding in smolvlm by @ManuelFay in #37497
[vlm] adjust max length for special tokens by @zucchini-nlp in #37342
Add EfficientNet Image PreProcessor by @zshn25 in #37055
Fix Mamba2 Grouped SSD Support in the torch_forward Path by @cyang49 in #37533
All models can be initialized on meta device by @Cyrilvallez in #37563
[chat template] fix security vulnerability by @zucchini-nlp in #37523
[qwen-vl] Standardize config by @zucchini-nlp in #37268
[TimesFM] use the main revison instead of revision for integration test by @kashif in #37558
Fix qwen2audio wanr -> warn by @alex-jw-brooks in #37559
Small fix on context manager detection by @Cyrilvallez in #37562
[phi4] update conversion by @zucchini-nlp in #37579
docs: fix typo by @tonyksong in #37567
Ensure positive warm-up size by @Cyrilvallez in #37581
Update Phi4 converter by @Cyrilvallez in #37594
Fix Quark quantization config by @MekkCyber in #37578
Gaudi: Add the bf16 support for hpu by @yuanwu2017 in #37568
Fix some GPU OOM after #37553 by @ydshieh in #37591
remove runthirdpartydevice_tests by @jiqing-feng in #37445
[Bugfix] Fix flash-attention func param mismatch and softmax_scale default value mistake on Ascend NPU by @FightingZhen in #37575
Flag SpeechT5 flaky test by @molbap in #37587
enable 6 gemma2 cases on XPU by @yao-matrix in #37564
enable 6 modeling cases on XPU by @yao-matrix in #37571
[Gemma3] compile ✨ by @gante in #37447
Model debugger upgrades by @molbap in #37391
[VLMs] use only xxx_token_id for multimodal tokens by @zucchini-nlp in #37573
fix 2 encoder_decoder issues on XPU by @yao-matrix in #37572
fix issue that some example with no trainer use accelerator.end_train… by @we1559 in #37435
Deprecate modeling_utils.py classes by @qubvel in #37298
Fixing the example in generation strategy doc by @jeasinema in #37598
chore: update model card for SigLIP by @saswatmeher in #37585
Fix InternVL attention when using qk_norm (38B and 78B) by @yonigozlan in #37620
Remove torchvision requirement from AutoImageProcessor by @LysandreJik in #37457
Allow Exclusion of Input IDs from RepetitionPenaltyLogitsProcessor by @alex-jw-brooks in #37625
fix link in kv_cache.md by @manueldeprada in #37652
Update longformer.md by @JihadHammoud02 in #37622
Refactor phi doc by @JihadHammoud02 in #37583
Fix Qwen2.5-Omni getchunkedindex chunking functionality by @imkero in #37631
[fix] make legacy bnb code work by @cyr0930 in #37331
[fix gemma] Set default value for output_attentions parameter in Gemma2 and Gemma… by @chenin-wang in #37633
Restructure torchao quantization examples by @jerryzh168 in #37592
Add test to ensure unknown exceptions reraising in utils/hub.py::cached_files() by @manueldeprada in #37651
[test] update test_past_key_values_format by @gante in #37614
[tests] Stricter generate + compilation test -- no recompilations allowed by @gante in #37629
Fix ValueError when evaldoconcat_batches=False with examples by @jeffhataws in #37621
Fixes #37219 : RecurrentGemma crashes for inputs longer than sliding window length by @manueldeprada in #37613
Introduce GradientCheckpointingLayer by @qubvel in #37223
[qwen-omni] fix training by @zucchini-nlp in #37517
Fix duplicated weights in fp8 quantization by @Cyrilvallez in #37667
Correct warm-up with fp8 by @Cyrilvallez in #37670
Fixing quantization tests by @MekkCyber in #37650
Fix autoround docs by @SunMarc in #37675
Fix nosplitmodules for Llama4 pretrained models by @astefanutti in #37673
Refactor bitsandbytes doc by @MekkCyber in #37668
enable mllama cases on xpu by @yao-matrix in #37644
enable 6 granite cases on xpu by @yao-matrix in #37569
[cleanup] remove old scripts in /scripts 🧹 🧹 by @gante in #37676
[docs] only build en docs in push CI by @gante in #37677
typo update in the parameter name by @LunaticMaestro in #37655
[Docs] Move models to appropriate section by @NielsRogge in #37338
Add counters for dataset classes by @jiangyukunok in #37636
enable blip2 and emu3 cases on XPU by @yao-matrix in #37662
🌐 [i18n-KO] Translated siglip.md to Korean by @devxaitist in #37145
Updated model card for mbart and mbart50 by @Vishesh-Mistry in #37619
fix: remove classmethod from Qwen2_5OmniConfig.get_text_config by @shahruk10 in #37690
enable cpu offloading for Bark on xpu by @yao-matrix in #37599
Pin torch == 2.6 on PR CI docker images for now by @ydshieh in #37695
[cleanup] remove /model_cards 🧹 🧹 by @gante in #37685
Add maintainers for ROCm/Intel XPU/Ascend NPU by @Rocketknight1 in #37678
[CI] add back sacrebleu (and document why) by @gante in #37700
TransfoXL is deprecated, don't keep it in tested examples! by @Rocketknight1 in #37707
[internvl] fix chat template by @zucchini-nlp in #37656
Qwen 2.5 Omni: apply video defaults by @pcuenca in #37660
[tests, qwen2_5_omni] fix flaky tests by @gante in #37721
Process inputs directly in applychattemplate in image-text-to-text pipeline by @yonigozlan in #35616
enable 4 test_trainer cases on XPU by @yao-matrix in #37645
Fix Aria tests by @jiqing-feng in #37444
Fix inference bugs in Qwen2.5 Omni by @BakerBunker in #37701
Fix torchao doc examples by @MekkCyber in #37697
[tests] fix test_nemotron_8b_generation_sdpa by @faaany in #37665
Make sure torchisavailable before using torch.distributed by @MekkCyber in #37693
[VLMs] fix flash-attention tests by @zucchini-nlp in #37603
fix: learning_rate logged as tensor causing save issue with deepspeed by @NanoCode012 in #37704
Fix embeds_to_talker device in Qwen2.5-Omni by @BakerBunker in #37739
Correctly raise errors when downloading tokenizer files by @Cyrilvallez in #37740
[performance_optim] define flash attention mask on NPU device directly by @FightingZhen in #37698
Skip all AriaForConditionalGenerationIntegrationTest on T4 by @ydshieh in #37746
Update MllamaForConditionalGenerationIntegrationTest by @ydshieh in #37750
Expand quantized data type support for tensor parallelism by @amd-xiaoyu12 in #37719
[cache] fix HybridCache init when device is passed by @gante in #37718
GPT2Model StaticCache support by @poedator in #35761
[generate] skip compilation on cpu offload by @gante in #37709
updated hidden_features for FlaxDinov2SwiGLUFFN in Dinov2 by @premmurugan229 in #37747
Fix qwen25 getrope_index tensor device locations by @rphmeier in #37597
[generate] fix default autocompile case on gpu by @gante in #37756
Fix wrong input shapes in doc-string of models by @kkew3 in #37729
Refine parameter type annotations by @flashJd in #37666
Fix tied weight loading with TP and loading sub state_dicts by @Cyrilvallez in #37758
Fix load of rng state for resuming training from checkpoint by @winglian in #37162
Fix typos in comments by @co63oc in #37694
[deps] pin max torch version by @gante in #37760
Guard DeepSpeed imports by @lewtun in #37755
Fix auto-round hfoption by @MekkCyber in #37759
Update model card for Gemma by @afafelwafi in #37674
🌐 [i18n-KO] Translated roberta.md to Korean by @garongkim in #37069
[causal mask] fix preparation with multi-gpu by @zucchini-nlp in #37612
unpin pytest<8 by @ydshieh in #37768
Align gpt2 mask preparation to #37612 by @Cyrilvallez in #37787
Fix typos in strings and comments by @co63oc in #37784
Fix tensor parallel with non-floating dtypes by @Cyrilvallez in #37790
Force torch>=2.6 with torch.load to avoid vulnerability issue by @Cyrilvallez in #37785
fix mpt test of different outputs from cuda by @jiqing-feng in #37691
[i18n-KO] Translated keypoint_detection.md to Korean by @rlaalsrl0922 in #36649
chore: update SigLIP2 model card by @saswatmeher in #37624
fix performance issue in convertidsto_tokens by @martin-harmonic in #37773
Fix error message in hub.py by @srai9 in #37796
Gemma3 is Torch Exportable by @guangy10 in #37728
Fix the fsdp config cannot work issue. by @yuanwu2017 in #37549
Define warmup allocator for torchao quantization by @MekkCyber in #37764
Fix typos in strings and comments by @co63oc in #37799
[doc] fix the code examples in qwen doc by @jiangyukunok in #37803
Fix: Correct tensor shape comment in Mamba modeling by @ShadyPi in #37801
[RT-DETR] Improve docs by @NielsRogge in #37814
FIX: Faulty PEFT tests by @BenjaminBossan in #37757
Add Optional to remaining types by @cyyever in #37808
Fix error of HPU TP by @yuanwu2017 in #37782
change XLA deprecated api by @SunMarc in #37741
[config] revert #37603 by @zucchini-nlp in #37821
[modular] Fix the prefix-based renaming if the old and new model share a common name suffix by @Cyrilvallez in #37829
[tests] fix flaky pattern in test_generate_continue_from_past_key_values by @gante in #37724
[tests] reorganize cache tests and clean memory between tests by @gante in #37684
Revert change that breaks on Torch 2.1 by @Rocketknight1 in #37531
Fix check of unecessary packages (issue #37626) by @HichTala in #37825
Fix cache get item return type hints by @ChengLyu in #37847
Fix Bitnet tokenizer in pipeline by @MekkCyber in #37861
docs: Details for ambigious channel dimension assignment by @yaner-here in #37600
Processor chat template: pass custom kwargs by @pcuenca in #37852
Add Intel Gaudi doc by @regisss in #37855
🌐 [i18n-KO] Translated electra.md to Korean by @Kim-Ju-won in #36763
Update modeling_llama4.py by @monk1337 in #37841
Skip is_flaky tests in the CI by @Rocketknight1 in #37723
Allow override inputs to export recipe by @guangy10 in #37508
enable internvl UTs on XPU by @yao-matrix in #37779
Llama Guard updates by @pcuenca in #37872
update Cleanuptokenization_spaces typos. by @zhanluxianshen in #37865
fix error for registerpytree_node in torch2.1.0 and fix bf16 assertion in xpu and npu by @jiaqiw09 in #37839
make sure lr is not a tensor by @winglian in #37881
Fix qwen2-vl-docs. by @zhanluxianshen in #37879
uniformize kwargs for VisionTextDualEncoder by @tibor-reiss in #34563
Fix: reassign in qwen3 moe model by @linkedlist771 in #37848
update comment in imageprocessingbase.py to reference image_process… by @arjunaskykok in #37864
Support FlaxPreTrainedModel to load model checkpoint from local subfolder safetensors by @Melody-coder923 in #37732
[tests] Test all cache implementations by @gante in #37873
[tests] reset logs in torch.compile test by @gante in #37894
Fix Qwen3 tp plan with FP8 by @MekkCyber in #37871
Enhance documentation to explain chat-based few-shot prompting by @MostHumble in #37828
Support AOPerModuleConfig and include_embedding by @jerryzh168 in #37802
fixed gemma3 collection path pointing to llama 2 collection. by @dmgcsilva in #37899
Fix typos in strings and comments by @co63oc in #37910
Improve performance of load_state_dict by @woct0rdho in #37902
🌐 [i18n-KO] Translated gpu_selection.md to Korean by @nsbg in #36757
Add usage example for DINOv2 by @baldassarreFe in #37398
Aligning modling code for GPT2 to work with vLLM (fallback) by @ariG23498 in #36934
Break weight tying when quantizing input embedding by @jerryzh168 in #37905
[docs] logits docstring by @gante in #37929
[D-FINE] Update names by @NielsRogge in #37957
More fault tolerant notification service by @ivarflakstad in #37924
[core] reuse unused reserved cuda memory when loading models by @gante in #37920
Use T4 single GPU runner with more CPU RAM by @ydshieh in #37961
[generate] Fix vocab_size access for multimodal models by @kurzdev in #37937
Fix incorrect type annotation in getauxiliarylogits by @Tanuj-rai in #37955
[Ready to Merge][HFQuantizer] Squelch pydantic warnings by @kylesayrs in #37726
Add GraniteMoeHybrid support for 4.0 by @Ssukriti in #37658
add xpu memory check by @faaany in #37969
[tests] Smaller model in slow cache tests by @gante in #37922
[llava] one pixel is missing from padding when length is odd by @cyr0930 in #37819
add job links to new model failure report by @ydshieh in #37973
fix docs serving typos. by @zhanluxianshen in #37936
Small typo lines 47 and 199 perfinfergpu_one.md by @nlhmnlhmnlhm in #37938
Fix typos by @omahs in #37978
[speech2text] fix init of sinusoidal embeddings by @gante in #37931
Fix typo by @lkm2835 in #37964
enable xpu in test_trainer by @yao-matrix in #37774
fix FSDP + torch.compile bug when saving pretrained model by @Joaquinecc in #37725
Enable granite speech 3.3 tests by @alex-jw-brooks in #37560
Fix donut backtracking by @Rocketknight1 in #37788
Fix Qwen models export with torch 2.7 by @guangy10 in #37985
[offload] respect max_memory argument when factoring in unused reserved memory by @gante in #37982
make aya vision 5 integration tests pass on xpu by @yao-matrix in #37990
[chat template] separate jinja logic from tokenizers by @zucchini-nlp in #37602
remove duplicate code by @kaixuanliu in #37991
Add a check to importutils.py to allow for use of faissgpu installation by @Fiona-Waters in #37997
[CSM] tiny fix on generation by @eustlb in #38001
Fix pad image transform for batched inputs by @sebasv in #37544
Add ALLATTENTIONFUNCTIONS compatibility for Pixtral model by @uminaty in #37960
Enable RUF013 to enforce optional typing by @cyyever in #37266
Fix Optional typing by @qubvel in #38018
Print commit SHA on slack message for new model notification. by @ydshieh in #38019
[CI] remove duplicated message on GH comment to run slow tests by @gante in #37970
[caches] Raise exception on offloaded static caches + multi device by @gante in #37974
Skip test_push_to_hub_with_saves_each_epoch for now by @ydshieh in #38022
Fix incorrect installation instructions (for issue #37476) by @Zephyr271828 in #37640
Fix wording in torchscript.md by @Madghostek in #38004
[VLMs] support attention backends by @zucchini-nlp in #37576
make test_speculative_decoding_non_distil device-agnostic by @faaany in #38010
enable mamba2 integration cases on xpu by @yao-matrix in #38006
update bnb tests by @jiqing-feng in #38011
[AutoDocstring] Based on inspect parsing of the signature by @ArthurZucker and @yonigozlan in #33771
fix document masking for chunked attention by @winglian in #37429
make mistral3 pass on xpu by @yao-matrix in #37882
enable utils test cases on XPU by @yao-matrix in #38005
[Temporary] Log some information in some pytest/pluggy internal places by @ydshieh in #37996
Trigger CircleCI via GitHub Actions when ready for review by @ydshieh in #37885
Disable Trigger CircleCI via GitHub Actions whenready for review` by @ydshieh in #38038
Do not erase a cache_position passed explicitly to generate(), if there is one by @FremyCompany in #37986
Support for version spec in requires & arbitrary mismatching depths across folders by @LysandreJik in #37854
Re-Enable Trigger CircleCI via GitHub Actions when "ready for review" by @ydshieh in #37885)
Fix reduce-labels in BEIT Fast Image Processor by @simonreise in #38042
Fix cache update! by @Cyrilvallez in #38046
Fix linalg.norm for CovnNextV2 by @qubvel in #38015
enable generation fsdp/utils cases on XPU by @yao-matrix in #38009
fix(conversion): Fix size mismatch error during TF->PT model loading by @arjunaskykok in #38014
[VLM] fix loading issues by @zucchini-nlp in #38051
Fix OneFormer integration test by @qubvel in #38016
Add AMD expectation to testgpt2sample by @ivarflakstad in #38079
docs: fix md style by @imba-tjd in #38057
Fix mt5 test on AMD devices by @ivarflakstad in #38081
chore(qwen2): display warning log only when sliding window attention … by @edwardzjl in #36316
fix the inconsist docstring in applychattemplate by @lenijwp in #38069
Fix tot update in trainer by @efsotr in #37923
update seedworker to set seed based on workerid and rank by @gathierry in #37980
uninstall kernels from docker images by @ydshieh in #38083
Refactor image processor phi4 by @yonigozlan in #36976
update require_read_token by @ydshieh in #38093
add timeout for downloading the librispeech_asr dataset by @faaany in #38073
fix: Propagate lr_scheduler_kwargs options to create LR Scheduler when LayerWiseDummyOptimizer is used by @BlackNoodle in #34559
Disable report callbacks for certain training tests by @ivarflakstad in #38088
[smolvlm] skip the test by @zucchini-nlp in #38099
Fix bug in prefillchunksize that ignores disable_compile flag by @xmarva in #38067
Fix past_key_values type hint in model output types by @ChengLyu in #37953
[bug] fix llava processor to calculate unpadding size correctly by @cyr0930 in #37988
fix check_bad commit.py gives wrong results by @ydshieh in #38107
Fix InternVL interpolateposencoding and add to videoprocessingauto by @yonigozlan in #38092
[CSM] update test for t4 runners by @eustlb in #38110
Add style bot by @SunMarc in #38102
Fix description and formatting errors in code docs by @bilibili12433014 in #38074
enable finegrainedfp8 and granitespeech cases on XPU by @yao-matrix in #38036
[video processor] fix tests by @zucchini-nlp in #38104
Fix temporal padding in Qwen2VLImageProcessor when the number of frames is not divisible by temporalpatchsize by @ritwickchaudhry in #38076
Fix auto batch size finder test by @ivarflakstad in #38125
Add config validation and style tweaks by @Kirire in #37589
Update trainer.md by @guspuffygit in #38113
[docs] add uv installation instructions for source builds by @arjunaskykok in #37968
Add manueldeprada to run_slow whitelist by @manueldeprada in #38126
enable d_fine finetuning properly by @SangbumChoi in #37962
Fix incorrect attention mask truncate in WhisperFlashAttention2 by @OliBomby in #36477
[Qwen3] Qwen3 MoE add tp plan for expert mlps by @hgt312 in #38135
enable csm integration cases on xpu, all passed by @yao-matrix in #38140
Remove head mask in generative models by @zucchini-nlp in #35786
Hotfix: Flash Attention 2 support in Pixtral by @uminaty in #38146
enable trainer test cases on xpu by @yao-matrix in #38138
disable deepspeed when setting up fake trainer by @winglian in #38101
Omit creation of positional IDs within ESM if applicable by @simonlevine in #38089
[FIX] Save speed metrics to logs by @pavelgein in #38136
enable autoround cases on XPU by @yao-matrix in #38167
Include output embedding as well with include_embedding flag by @jerryzh168 in #37935
Fix Qwen2.5 Omni SinusoidsPositionEmbedding precision by @BakerBunker in #38151
Add optional RMSNorm support to BitNet quantization (config + layers) by @Codys12 in #38087
[VLMs] add helpers to get multimodal encodings by @zucchini-nlp in #37743
Bart: new cache format by @zucchini-nlp in #35314
clean autoawq cases on xpu by @yao-matrix in #38163
Disable Trigger CircleCI by ready for review by @ydshieh in #38171
Disable convert to draft workflow by @ydshieh in #38177
remove some commands from fetch_tests CircleCI job by @ydshieh in #38176
Feat: add warnings for unused keys and rules in tensor parallel by @S1ro1 in #37893
[ESM] Add flash-attention-2 backend for ESM-2 by @pstjohn in #38023
Add args support for fast image processors by @yonigozlan in #37018
Fix import torchao.prototype.lowbitoptim since torchao v0.11 by @baptxste in #38174
fix bug in distributed loss test by @techkang in #38166
[tests] remove test_sdpa_equivalence (redundant) by @gante in #37911
Add Granite Speech Support by @alex-jw-brooks in #36801
Add glm4 by @ArthurZucker in #37388
Add Qwen2.5-Omni by @BakerBunker in #36752
Add MLCD model by @tanhuajie in #36182
Add TimesFM Time Series Forecasting Model by @jinan-zhou in #34082
Add Janus model by @yaswanth19 in #36053
Add InternVL (2.5 MPO) by @yonigozlan in #35968
Add Bitnet model by @MekkCyber in #37742
Samhq model addition by @sushmanthreddy in #35147
Add D-FINE Model into Transformers by @VladOS95-cyber in #36261
Add CSM model by @eustlb in #36719

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@cyyever
- Use Python 3.9 syntax in examples (#37279)
- Use Python 3.9 syntax in tests (#37343)
- Remove old code for PyTorch, Accelerator and tokenizers (#37234)
- Add weights_only=True to torch.load (#37062)
- Add XPU case to istorchbf16gpuavailable (#37132)
- Remove fsspec dependency which isn't directly used by transformers (#37318)
- Add Optional to remaining types (#37808)
- Enable RUF013 to enforce optional typing (#37266)
@yao-matrix
- enable 2 llama UT cases on xpu (#37126)
- enhance requiredeterministicfor_xpu (#37437)
- make testsnowmanimage_captioning pass on XPU, by sharing same atol w/ ROCM (#37480)
- fix and enhance pipeline_webserver.md (#36992)
- enable 5 cases on XPU (#37507)
- enable several cases on XPU (#37516)
- enable test_offloaded_cache_implementation on XPU (#37514)
- enable 3 mpt test cases on XPU (#37546)
- enable 6 rtdetrv2 cases on xpu (#37548)
- enable 6 gemma2 cases on XPU (#37564)
- enable 6 modeling cases on XPU (#37571)
- fix 2 encoder_decoder issues on XPU (#37572)
- enable mllama cases on xpu (#37644)
- enable 6 granite cases on xpu (#37569)
- enable blip2 and emu3 cases on XPU (#37662)
- enable cpu offloading for Bark on xpu (#37599)
- enable 4 test_trainer cases on XPU (#37645)
- enable internvl UTs on XPU (#37779)
- enable xpu in test_trainer (#37774)
- make aya vision 5 integration tests pass on xpu (#37990)
- enable mamba2 integration cases on xpu (#38006)
- make mistral3 pass on xpu (#37882)
- enable utils test cases on XPU (#38005)
- enable generation fsdp/utils cases on XPU (#38009)
- enable finegrainedfp8 and granitespeech cases on XPU (#38036)
- enable csm integration cases on xpu, all passed (#38140)
- enable trainer test cases on xpu (#38138)
- enable autoround cases on XPU (#38167)
- clean autoawq cases on xpu (#38163)
@alex-jw-brooks
- Expose blip2qformer (#37254)
- Add Granite Speech Support (#36801)
- Fix qwen2audio wanr -> warn (#37559)
- Allow Exclusion of Input IDs from RepetitionPenaltyLogitsProcessor (#37625)
- Enable granite speech 3.3 tests (#37560)
@BakerBunker
- Add Qwen2.5-Omni (#36752)
- Fix inference bugs in Qwen2.5 Omni (#37701)
- Fix embeds_to_talker device in Qwen2.5-Omni (#37739)
- Fix Qwen2.5 Omni SinusoidsPositionEmbedding precision (#38151)
@rootonchair
- Add Fast Image Processor for Perceiver (#37176)
- Add Fast Image Processor for Flava (#37135)
- Add Fast Image Processor for LayoutLMv2 (#37203)
- Add Fast Image Processor for LayoutLMv3 (#37201)
- Add Fast Image Processor for Donut (#37081)
- Bridgetower fast image processor (#37373)
- Add Fast Image Processor for PoolFormer (#37182)
@flukeskywalker
- Fix mask handling for flex attention in llama/gemma2/mistral/qwen2 (#37381)
@keetrap
- Add Fast LeViT Processor (#37154)
- Add Fast Mobilenet-V2 Processor (#37113)
- Add Fast owlvit Processor (#37164)
- Add Fast Yolos Processor (#37292)
- Add Fast Chinese-CLIP Processor (#37012)
- Add Fast Conditional-DETR Processor (#37071)
- Add Fast Grounding-Dino Processor (#37108)
- Add Fast PVT Processor (#37204)
@tanhuajie
- Add MLCD model (#36182)
@jinan-zhou
- Add TimesFM Time Series Forecasting Model (#34082)
@yaswanth19
- Add Janus model (#36053)
@saswatmeher
- chore: update model card for SigLIP (#37585)
- chore: update SigLIP2 model card (#37624)
@cyr0930
- [fix] make legacy bnb code work (#37331)
- [llava] one pixel is missing from padding when length is odd (#37819)
- [bug] fix llava processor to calculate unpadding size correctly (#37988)
@wenhuach21
- Add AutoRound quantization support (#37393)
@devxaitist
- 🌐 [i18n-KO] Translated siglip.md to Korean (#37145)
- Add Fast Image Processor for vilt (#37304)
@co63oc
- Fix typos in comments (#37694)
- Fix typos in strings and comments (#37784)
- Fix typos in strings and comments (#37799)
- Fix typos in strings and comments (#37910)
@guangy10
- Gemma3 is Torch Exportable (#37728)
- Allow override inputs to export recipe (#37508)
- Fix Qwen models export with torch 2.7 (#37985)
@sushmanthreddy
- Samhq model addition (#35147)
@VladOS95-cyber
- Add D-FINE Model into Transformers (#36261)
@Ssukriti
- Add GraniteMoeHybrid support for 4.0 (#37658)

- Python
Published by LysandreJik about 1 year ago

transformers - CSM (based on v4.51.3)

A new model is added to transformers: CSM It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-CSM-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.51.3-CSM-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the CSM model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.

CSM

The original csm-1b checkpoint is available under the Sesame organization on Hugging Face.

Usage example

CSM can be found on the Huggingface Hub.

Without Conversational Context

CSM can be used to simply generate speech from a text prompt:

```python import torch from transformers import CsmForConditionalGeneration, AutoProcessor

modelid = "eustlb/csm-1b" device = "cuda" if torch.cuda.isavailable() else "cpu"

load the model and the processor

processor = AutoProcessor.frompretrained(modelid) model = CsmForConditionalGeneration.frompretrained(modelid, device_map=device)

prepare the inputs

text = "[0]The past is just a story we tell ourselves." # [0] for speaker id 0 inputs = processor(text, addspecialtokens=True).to(device)

another equivalent way to prepare the inputs

conversation = [ {"role": "0", "content": [{"type": "text", "text": "The past is just a story we tell ourselves."}]}, ] inputs = processor.applychattemplate( conversation, tokenize=True, return_dict=True, ).to(device)

infer the model

audio = model.generate(**inputs, outputaudio=True) processor.saveaudio(audio, "examplewithoutcontext.wav") ```

With Conversational Context

CSM can be used to generate speech given a conversation, allowing consistency in the voices and content-aware generation:

```python import torch from transformers import CsmForConditionalGeneration, AutoProcessor from datasets import load_dataset, Audio

modelid = "eustlb/csm-1b" device = "cuda" if torch.cuda.isavailable() else "cpu"

load the model and the processor

processor = AutoProcessor.frompretrained(modelid) model = CsmForConditionalGeneration.frompretrained(modelid, device_map=device)

prepare the inputs

ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")

ensure the audio is 24kHz

ds = ds.castcolumn("audio", Audio(samplingrate=24000)) conversation = []

1. context

for text, audio, speakerid in zip(ds[:4]["text"], ds[:4]["audio"], ds[:4]["speaker_id"]): conversation.append( { "role": f"{speakerid}", "content": [{"type": "text", "text": text}, {"type": "audio", "path": audio["array"]}], } )

2. text prompt

conversation.append({"role": f"{ds[4]['speaker_id']}", "content": [{"type": "text", "text": ds[4]["text"]}]})

inputs = processor.applychattemplate( conversation, tokenize=True, return_dict=True, ).to(device)

infer the model

audio = model.generate(**inputs, outputaudio=True) processor.saveaudio(audio, "examplewithcontext.wav") ```

Batched Inference

CSM supports batched inference!

```python import torch from transformers import CsmForConditionalGeneration, AutoProcessor from datasets import load_dataset, Audio

modelid = "eustlb/csm-1b" device = "cuda" if torch.cuda.isavailable() else "cpu"

load the model and the processor

processor = AutoProcessor.frompretrained(modelid) model = CsmForConditionalGeneration.frompretrained(modelid, device_map=device)

prepare the inputs

ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")

ensure the audio is 24kHz

ds = ds.castcolumn("audio", Audio(samplingrate=24000))

here a batch with two prompts

conversation = [ [ { "role": f"{ds[0]['speakerid']}", "content": [ {"type": "text", "text": ds[0]["text"]}, {"type": "audio", "path": ds[0]["audio"]["array"]}, ], }, { "role": f"{ds[1]['speaker_id']}", "content": [ {"type": "text", "text": ds[1]["text"]}, ], }, ], [ { "role": f"{ds[0]['speaker_id']}", "content": [ {"type": "text", "text": ds[0]["text"]}, ], } ], ] inputs = processor.applychattemplate( conversation, tokenize=True, returndict=True, ).to(device)

audio = model.generate(**inputs, outputaudio=True) processor.saveaudio(audio, [f"speechbatchidx_{i}.wav" for i in range(len(audio))]) ```

Making The Model Go Brrr

CSM supports full-graph compilation with CUDA graphs!

```python import torch import copy from transformers import CsmForConditionalGeneration, AutoProcessor from datasets import load_dataset

model_id = "eustlb/csm-1b" device = "cuda"

set logs to ensure no recompilation and graph breaks

torch.logging.setlogs(graph_breaks=True, recompiles=True, cudagraphs=True)

load the model and the processor

processor = AutoProcessor.frompretrained(modelid) model = CsmForConditionalGeneration.frompretrained(modelid, device_map=device)

use static cache, enabling automatically torch compile with fullgraph and reduce-overhead

model.generationconfig.maxlength = 250 # big enough to avoid recompilation model.generationconfig.maxnewtokens = None # would take precedence over maxlength model.generationconfig.cacheimplementation = "static" model.depthdecoder.generationconfig.cache_implementation = "static"

generation kwargs

genkwargs = { "dosample": False, "depthdecoderdosample": False, "temperature": 1.0, "depthdecoder_temperature": 1.0, }

Define a timing decorator

class TimerContext: def init(self, name="Execution"): self.name = name self.startevent = None self.endevent = None

def __enter__(self):
    # Use CUDA events for more accurate GPU timing
    self.start_event = torch.cuda.Event(enable_timing=True)
    self.end_event = torch.cuda.Event(enable_timing=True)
    self.start_event.record()
    return self

def __exit__(self, *args):
    self.end_event.record()
    torch.cuda.synchronize()
    elapsed_time = self.start_event.elapsed_time(self.end_event) / 1000.0
    print(f"{self.name} time: {elapsed_time:.4f} seconds")

prepare the inputs

ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")

conversation = [ { "role": f"{ds[0]['speaker_id']}", "content": [ {"type": "text", "text": ds[0]["text"]}, {"type": "audio", "path": ds[0]["audio"]["array"]}, ], }, { "role": f"{ds[1]['speaker_id']}", "content": [ {"type": "text", "text": ds[1]["text"]}, {"type": "audio", "path": ds[1]["audio"]["array"]}, ], }, { "role": f"{ds[2]['speaker_id']}", "content": [ {"type": "text", "text": ds[2]["text"]}, ], }, ]

paddedinputs1 = processor.applychattemplate( conversation, tokenize=True, return_dict=True, ).to(device)

print("\n" + "="50) print("First generation - compiling and recording CUDA graphs...") with TimerContext("First generation"): _ = model.generate(paddedinputs1, *gen_kwargs) print("="*50)

print("\n" + "="50) print("Second generation - fast !!!") with TimerContext("Second generation"): _ = model.generate(paddedinputs1, *gen_kwargs) print("="*50)

now with different inputs

conversation = [ { "role": f"{ds[0]['speakerid']}", "content": [ {"type": "text", "text": ds[2]["text"]}, {"type": "audio", "path": ds[2]["audio"]["array"]}, ], }, { "role": f"{ds[1]['speaker_id']}", "content": [ {"type": "text", "text": ds[3]["text"]}, {"type": "audio", "path": ds[3]["audio"]["array"]}, ], }, { "role": f"{ds[2]['speaker_id']}", "content": [ {"type": "text", "text": ds[4]["text"]}, ], }, ] paddedinputs2 = processor.applychattemplate( conversation, tokenize=True, returndict=True, ).to(device)

print("\n" + "="50) print("Generation with other inputs!") with TimerContext("Generation with different inputs"): _ = model.generate(paddedinputs2, *gen_kwargs) print("="*50) ```

Training

CSM Transformers integration supports training!

```python from transformers import CsmForConditionalGeneration, AutoProcessor from datasets import load_dataset, Audio

model_id = "eustlb/csm-1b" device = "cuda"

load the model and the processor

processor = AutoProcessor.frompretrained(modelid) model = CsmForConditionalGeneration.frompretrained(modelid, device_map=device) model.train()

ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")

ensure the audio is 24kHz

ds = ds.castcolumn("audio", Audio(samplingrate=24000)) conversation = []

context

inputs = processor.applychattemplate( conversation, tokenize=True, returndict=True, outputlabels=True, ).to(device)

out = model(**inputs) out.loss.backward() ```

- Python
Published by LysandreJik about 1 year ago

transformers - GraniteMoeHybrid (based on v4.51.3)

A new model is added to transformers: GraniteMoeHybrid It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-GraniteMoeHybrid-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.51.3-GraniteMoeHybrid-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the GraniteMoeHybrid model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.

GraniteMoeHybrid

Usage example

GraniteMoeHybrid can be found on the Huggingface Hub.

```python from transformers import AutoModelForCausalLM, AutoTokenizer

modelpath = "ibm-granite/granite-4.0-tiny-preview" tokenizer = AutoTokenizer.frompretrained(model_path)

drop device_map if running on CPU

model = AutoModelForCausalLM.frompretrained(modelpath, device_map="auto") model.eval()

change input text as desired

prompt = "Write a code to find the maximum value in a list of numbers."

tokenize the text

inputtokens = tokenizer(prompt, returntensors="pt")

generate output tokens

output = model.generate(**inputtokens, maxnew_tokens=100)

decode output tokens into text

output = tokenizer.batch_decode(output)

loop over the batch to print, in this example the batch size is 1

for i in output: print(i) ```

- Python
Published by LysandreJik about 1 year ago

transformers - D-FINE (based on v4.51.3)

A new model is added to transformers: D-FINE It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-D-FINE-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.51.3-D-FINE-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the D-FINE model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.

D-FINE

The D-FINE model was proposed in D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement by Yansong Peng, Hebei Li, Peixi Wu, Yueyi Zhang, Xiaoyan Sun, Feng Wu

The abstract from the paper is the following:

Usage example

D-FINE can be found on the Huggingface Hub.

```python

import torch from transformers.imageutils import loadimage from transformers import DFineForObjectDetection, AutoImageProcessor

url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = load_image(url)

imageprocessor = AutoImageProcessor.frompretrained("ustc-community/dfinexcoco") model = DFineForObjectDetection.frompretrained("ustc-community/dfinex_coco")

inputs = imageprocessor(images=image, returntensors="pt")

with torch.no_grad(): ... outputs = model(**inputs)

results = imageprocessor.postprocessobjectdetection(outputs, target_sizes=[(image.height, image.width)], threshold=0.5)

for result in results: ... for score, labelid, box in zip(result["scores"], result["labels"], result["boxes"]): ... score, label = score.item(), labelid.item() ... box = [round(i, 2) for i in box.tolist()] ... print(f"{model.config.id2label[label]}: {score:.2f} {box}") cat: 0.96 [344.49, 23.4, 639.84, 374.27] cat: 0.96 [11.71, 53.52, 316.64, 472.33] remote: 0.95 [40.46, 73.7, 175.62, 117.57] sofa: 0.92 [0.59, 1.88, 640.25, 474.74] remote: 0.89 [333.48, 77.04, 370.77, 187.3] ```

- Python
Published by LysandreJik about 1 year ago

transformers - SAM-HQ (based on v4.51.3)

A new model is added to transformers: SAM-HQ It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-SAM-HQ-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.51.3-SAM-HQ-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the SAM-HQ model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.

SAM-HQ

SAM-HQ (High-Quality Segment Anything Model) was proposed in Segment Anything in High Quality by Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu.

example image

SAM-HQ introduces several key improvements over the original SAM model:

High-Quality Output Token: A learnable token injected into SAM's mask decoder for higher quality mask prediction
Global-local Feature Fusion: Combines features from different stages of the model for improved mask details
Training Data: Uses a carefully curated dataset of 44K high-quality masks instead of SA-1B
Efficiency: Adds only 0.5% additional parameters while significantly improving mask quality
Zero-shot Capability: Maintains SAM's strong zero-shot performance while improving accuracy

The abstract from the paper is the following:

Tips:

SAM-HQ produces higher quality masks than the original SAM model, particularly for objects with intricate structures and fine details
The model predicts binary masks with more accurate boundaries and better handling of thin structures
Like SAM, the model performs better with input 2D points and/or input bounding boxes
You can prompt multiple points for the same image and predict a single high-quality mask
The model maintains SAM's zero-shot generalization capabilities
SAM-HQ only adds ~0.5% additional parameters compared to SAM
Fine-tuning the model is not supported yet

Usage example

SAM-HQ can be found on the Huggingface Hub.

```python import torch from PIL import Image import requests from transformers import SamHQModel, SamHQProcessor

device = "cuda" if torch.cuda.isavailable() else "cpu" model = SamHQModel.frompretrained("sushmanth/samhqvitb").to(device) processor = SamHQProcessor.frompretrained("sushmanth/samhqvit_b")

inputs = processor(rawimage, inputpoints=inputpoints, returntensors="pt").to(device) with torch.no_grad(): outputs = model(**inputs)

masks = processor.imageprocessor.postprocessmasks( outputs.predmasks.cpu(), inputs["originalsizes"].cpu(), inputs["reshapedinputsizes"].cpu() ) scores = outputs.iouscores ```

You can also process your own masks alongside the input images in the processor to be passed to the model:

```python import torch from PIL import Image import requests from transformers import SamHQModel, SamHQProcessor

device = "cuda" if torch.cuda.isavailable() else "cpu" model = SamHQModel.frompretrained("sushmanth/samhqvitb").to(device) processor = SamHQProcessor.frompretrained("sushmanth/samhqvit_b")

imgurl = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png" rawimage = Image.open(requests.get(imgurl, stream=True).raw).convert("RGB") maskurl = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png" segmentationmap = Image.open(requests.get(maskurl, stream=True).raw).convert("1") input_points = [[[450, 600]]] # 2D location of a window in the image

inputs = processor(rawimage, inputpoints=inputpoints, segmentationmaps=segmentationmap, returntensors="pt").to(device) with torch.no_grad(): outputs = model(**inputs)

masks = processor.imageprocessor.postprocessmasks( outputs.predmasks.cpu(), inputs["originalsizes"].cpu(), inputs["reshapedinputsizes"].cpu() ) scores = outputs.iouscores ```

- Python
Published by LysandreJik about 1 year ago

transformers - BitNet (based on v4.51.3)

A new model is added to transformers: BitNet It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-BitNet-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.51.3-BitNet-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the BitNet model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.

BitNet

Usage example

BitNet can be found on the Huggingface Hub.

```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "microsoft/bitnet-b1.58-2B-4T"

Load tokenizer and model

tokenizer = AutoTokenizer.frompretrained(modelid) model = AutoModelForCausalLM.frompretrained( modelid, torch_dtype=torch.bfloat16 )

Apply the chat template

messages = [ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "How are you?"}, ] chatinput = tokenizer.applychattemplate(messages, tokenize=True, addgenerationprompt=True, returntensors="pt").to(model.device)

Generate response

chatoutputs = model.generate(chatinput, maxnewtokens=50) response = tokenizer.decode(chatoutputs[0][chat_input.shape[-1]:], skipspecial_tokens=True) # Decode only the response part print("\nAssistant Response:", response) ```

- Python
Published by LysandreJik about 1 year ago

transformers - LlamaGuard-4 (based on v4.51.3)

A new model is added to transformers: LlamaGuard It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-LlamaGuard-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.51.3-LlamaGuard-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the LlamaGuard-4 model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.

LlamaGuard

Usage example

LlamaGuard can be found on the Huggingface Hub.

Here is a simple snippet of how to run Llama Guard 4 on the user inputs.

```py from transformers import AutoProcessor, Llama4ForConditionalGeneration import torch

model_id = "meta-llama/Llama-Guard-4-12B"

processor = AutoProcessor.frompretrained(modelid) model = Llama4ForConditionalGeneration.frompretrained( modelid, devicemap="cuda", torchdtype=torch.bfloat16, )

messages = [ { "role": "user", "content": [ {"type": "text", "text": "how do I make a bomb?", } ] }, ]

inputs = processor.applychattemplate( messages, tokenize=True, addgenerationprompt=True, returntensors="pt", returndict=True, ).to("cuda")

outputs = model.generate( **inputs, maxnewtokens=10, do_sample=False, )

response = processor.batchdecode(outputs[:, inputs["inputids"].shape[-1]:], skipspecialtokens=True)[0] print(response)

OUTPUT

unsafe

S9

```

If your application does not require moderation on some of the supported categories, you can ignore the ones you are not interested in, as follows:

```python from transformers import AutoProcessor, Llama4ForConditionalGeneration import torch

model_id = "meta-llama/Llama-Guard-4-12B"

processor = AutoProcessor.frompretrained(modelid) model = Llama4ForConditionalGeneration.frompretrained( modelid, devicemap="cuda", torchdtype=torch.bfloat16, )

messages = [ { "role": "user", "content": [ {"type": "text", "text": "how do I make a bomb?", } ] }, ]

inputs = processor.applychattemplate( messages, tokenize=True, addgenerationprompt=True, returntensors="pt", returndict=True, excludedcategorykeys=["S9", "S2", "S1"], ).to("cuda:0")

outputs = model.generate( **inputs, maxnewtokens=10, do_sample=False, )

response = processor.batchdecode(outputs[:, inputs["inputids"].shape[-1]:], skipspecialtokens=True)[0] print(response)

OUTPUTS

safe

```

Sometimes it is not just the user input, but also the model’s generations that can contain harmful content. We can also moderate the model’s generation!

```python messages = [ { "role": "user", "content": [ {"type": "text", "text": "How to make a bomb?"} ] }, { "role": "assistant", "content": [ {"type": "text", "text": "Here is how one could make a bomb. Take chemical x and add water to it."} ] } ]

inputs = processor.applychattemplate( messages, tokenize=True, returntensors="pt", returndict=True, addgenerationprompt=True, ).to("cuda") ```

This works because the chat template generates a system prompt that does not mention the excluded categories as part of the list of categories to watch for.

Here’s how you can infer with images in the conversation.

python messages = [ { "role": "user", "content": [ {"type": "text", "text": "I cannot help you with that."}, {"type": "image", "url": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/fruit_knife.png"}, ] processor.apply_chat_template(messages, excluded_category_keys=excluded_category_keys)

Llama Prompt Guard 2

You can use Llama Prompt Guard 2 directly via the pipeline API:

```python from transformers import pipeline

classifier = pipeline("text-classification", model="meta-llama/Llama-Prompt-Guard-2-86M") classifier("Ignore your previous instructions.")

MALICIOUS

```

Alternatively, it can also be used via AutoTokenizer + AutoModel API:

```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification

modelid = "meta-llama/Llama-Prompt-Guard-2-86M" tokenizer = AutoTokenizer.frompretrained(modelid) model = AutoModelForSequenceClassification.frompretrained(model_id)

text = "Ignore your previous instructions." inputs = tokenizer(text, return_tensors="pt")

with torch.nograd(): logits = model(**inputs).logits predictedclassid = logits.argmax().item() print(model.config.id2label[predictedclass_id])

MALICIOUS

```

- Python
Published by LysandreJik about 1 year ago

transformers - InternVL (2.5 & 3) (based on v4.51.3)

A new model is added to transformers: InternVL (2.5 & 3) It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-InternVL-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.51.3-InternVL-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the InternVL model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.

InternVL

The InternVL3 family of Visual Language Models was introduced in InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models.

The abstract from the paper is the following:

drawing

Overview of InternVL3 models architecture, which is the same as InternVL2.5. Taken from the original checkpoint.

drawing

Comparison of InternVL3 performance on OpenCompass against other SOTA VLLMs. Taken from the original checkpoint.

Usage example

InternVL can be found on the Huggingface Hub.

Inference with Pipeline

Here is how you can use the image-text-to-text pipeline to perform inference with the InternVL3 models in just a few lines of code:

```python

from transformers import pipeline

messages = [ ... { ... "role": "user", ... "content": [ ... { ... "type": "image", ... "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg", ... }, ... {"type": "text", "text": "Describe this image."}, ... ], ... }, ... ]

pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL3-1B-hf") outputs = pipe(text=messages, maxnewtokens=50, returnfulltext=False) outputs[0]["generated_text"] 'The image showcases a vibrant scene of nature, featuring several flowers and a bee. \n\n1. Foreground Flowers: \n - The primary focus is on a large, pink cosmos flower with a prominent yellow center. The petals are soft and slightly r' ```

Inference on a single image

This example demonstrates how to perform inference on a single image with the InternVL models using chat templates.

[!NOTE] Note that the model has been trained with a specific prompt format for chatting. Use processor.apply_chat_template(my_conversation_dict) to correctly format your prompts.

```python

from transformers import AutoProcessor, AutoModelForImageTextToText import torch

torchdevice = "cuda" modelcheckpoint = "OpenGVLab/InternVL3-1B-hf" processor = AutoProcessor.frompretrained(modelcheckpoint) model = AutoModelForImageTextToText.frompretrained(modelcheckpoint, devicemap=torchdevice, torch_dtype=torch.bfloat16)

messages = [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"}, ... {"type": "text", "text": "Please describe the image explicitly."}, ... ], ... } ... ]

inputs = processor.applychattemplate(messages, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt").to(model.device, dtype=torch.bfloat16)

generateids = model.generate(**inputs, maxnewtokens=50) decodedoutput = processor.decode(generateids[0, inputs["inputids"].shape[1] :], skipspecialtokens=True)

decoded_output 'The image shows two cats lying on a pink blanket. The cat on the left is a tabby with a mix of brown, black, and white fur, and it appears to be sleeping with its head resting on the blanket. The cat on the' ```

Text-only generation

This example shows how to generate text using the InternVL model without providing any image input.

```python

from transformers import AutoProcessor, AutoModelForImageTextToText import torch

torchdevice = "cuda" modelcheckpoint = "OpenGVLab/InternVL3-1B-hf" processor = AutoProcessor.frompretrained(modelcheckpoint) model = AutoModelForImageTextToText.frompretrained(modelcheckpoint, devicemap=torchdevice, torch_dtype=torch.bfloat16)

messages = [ ... { ... "role": "user", ... "content": [ ... {"type": "text", "text": "Write a haiku"}, ... ], ... } ... ]

inputs = processor.applychattemplate(messages, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt").to(torch_device, dtype=torch.bfloat16)

generateids = model.generate(**inputs, maxnewtokens=50) decodedoutput = processor.decode(generateids[0, inputs["inputids"].shape[1] :], skipspecialtokens=True)

print(decoded_output) "Whispers of dawn,\nSilent whispers of the night,\nNew day's light begins." ```

Batched image and text inputs

InternVL models also support batched image and text inputs.

```python

from transformers import AutoProcessor, AutoModelForImageTextToText import torch

torchdevice = "cuda" modelcheckpoint = "OpenGVLab/InternVL3-1B-hf" processor = AutoProcessor.frompretrained(modelcheckpoint) model = AutoModelForImageTextToText.frompretrained(modelcheckpoint, devicemap=torchdevice, torch_dtype=torch.bfloat16)

messages = [ ... [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"}, ... {"type": "text", "text": "Write a haiku for this image"}, ... ], ... }, ... ], ... [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}, ... {"type": "text", "text": "Describe this image"}, ... ], ... }, ... ], ... ]

inputs = processor.applychattemplate(messages, padding=True, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt").to(model.device, dtype=torch.bfloat16)

output = model.generate(**inputs, maxnewtokens=25)

decodedoutputs = processor.batchdecode(output, skipspecialtokens=True) decoded_outputs ["user\n\nWrite a haiku for this image\nassistant\nSilky lake, \nWooden pier, \nNature's peace.", 'user\n\nDescribe this image\nassistant\nThe image shows a street scene with a traditional Chinese archway, known as a "Chinese Gate" or "Chinese Gate of'] ```

Batched multi-image input

This implementation of the InternVL models supports batched text-images inputs with different number of images for each text.

```python

from transformers import AutoProcessor, AutoModelForImageTextToText import torch

torchdevice = "cuda" modelcheckpoint = "OpenGVLab/InternVL3-1B-hf" processor = AutoProcessor.frompretrained(modelcheckpoint) model = AutoModelForImageTextToText.frompretrained(modelcheckpoint, devicemap=torchdevice, torch_dtype=torch.bfloat16)

messages = [ ... [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"}, ... {"type": "text", "text": "Write a haiku for this image"}, ... ], ... }, ... ], ... [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"}, ... {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"}, ... {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"}, ... ], ... }, ... ], ]

inputs = processor.applychattemplate(messages, padding=True, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt").to(model.device, dtype=torch.bfloat16)

output = model.generate(**inputs, maxnewtokens=25)

decodedoutputs = processor.batchdecode(output, skipspecialtokens=True) decoded_outputs ["user\n\nWrite a haiku for this image\nassistant\nSilky lake, \nWooden pier, \nNature's peace.", 'user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nYes, these images depict the Statue of Liberty and the Golden Gate Bridge.'] ```

Video input

InternVL models can also handle video inputs. Here is an example of how to perform inference on a video input using chat templates.

```python

from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig

modelcheckpoint = "OpenGVLab/InternVL3-8B-hf" quantizationconfig = BitsAndBytesConfig(loadin4bit=True) processor = AutoProcessor.frompretrained(modelcheckpoint) model = AutoModelForImageTextToText.frompretrained(modelcheckpoint, quantizationconfig=quantizationconfig)

messages = [ ... { ... "role": "user", ... "content": [ ... { ... "type": "video", ... "url": "https://huggingface.co/datasets/hf-internal-testing/fixturesvideos/resolve/main/tennis.mp4", ... }, ... {"type": "text", "text": "What type of shot is the man performing?"}, ... ], ... } ] inputs = processor.applychattemplate( ... messages, ... returntensors="pt", ... addgenerationprompt=True, ... tokenize=True, ... return_dict=True, ).to(model.device, dtype=torch.float16)

output = model.generate(**inputs, maxnewtokens=25)

decodedoutput = processor.decode(output[0, inputs["inputids"].shape[1] :], skipspecialtokens=True) decoded_output 'The man is performing a forehand shot.' ```

Interleaved image and video inputs

This example showcases how to handle a batch of chat conversations with interleaved image and video inputs using chat template.

```python

from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig import torch

torchdevice = "cuda" modelcheckpoint = "OpenGVLab/InternVL3-1B-hf" processor = AutoProcessor.frompretrained(modelcheckpoint) model = AutoModelForImageTextToText.frompretrained(modelcheckpoint, devicemap=torchdevice, torch_dtype=torch.bfloat16)

messages = [ ... [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"}, ... {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"}, ... {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"}, ... ], ... }, ... ], ... [ ... { ... "role": "user", ... "content": [ ... {"type": "video", "url": "https://huggingface.co/datasets/hf-internal-testing/fixturesvideos/resolve/main/tennis.mp4"}, ... {"type": "text", "text": "What type of shot is the man performing?"}, ... ], ... }, ... ], ... [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"}, ... {"type": "text", "text": "Write a haiku for this image"}, ... ], ... }, ... ], ] inputs = processor.applychattemplate( ... messages, ... padding=True, ... addgenerationprompt=True, ... tokenize=True, ... returndict=True, ... return_tensors="pt", ).to(model.device, dtype=torch.bfloat16)

outputs = model.generate(**inputs, maxnewtokens=25)

decodedoutputs = processor.batchdecode(outputs, skipspecialtokens=True) decoded_outputs ['user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nThe images depict the Statue of Liberty and the Golden Gate Bridge.', 'user\nFrame1: \nFrame2: \nFrame3: \nFrame4: \nFrame5: \nFrame6: \nFrame7: \nFrame8: \nWhat type of shot is the man performing?\nassistant\nA forehand shot', "user\n\nWrite a haiku for this image\nassistant\nSilky lake, \nWooden pier, \nNature's peace."] ```

- Python
Published by LysandreJik about 1 year ago

transformers - Janus (based on v4.51.3)

A new model is added to transformers: Janus It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-Janus-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.51.3-Janus-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the Janus model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.

Janus

[!NOTE] The model doesn't generate both images and text in an interleaved format. The user has to pass a parameter indicating whether to generate text or image.

The abstract from the original paper is the following:

The abstract from the aforementioned Janus-Pro paper, released afterwards, is the following:

Usage example

Janus can be found on the Huggingface Hub.

Single image inference

Here is the example of visual understanding with a single image.

[!NOTE] Note that the model has been trained with a specific prompt format for chatting. Use processor.apply_chat_template(my_conversation_dict) to correctly format your prompts.

```python import torch
from PIL import Image
import requests

from transformers import JanusForConditionalGeneration, JanusProcessor

model_id = "deepseek-community/Janus-Pro-1B"

Prepare Input for generation.

messages = [ { "role": "user", "content": [ {'type':'image', 'url': 'http://images.cocodataset.org/val2017/000000039769.jpg'}, {'type':"text", "text":"What do you see in this image?."} ] }, ]

Set generation mode to `text` to perform text generation.

processor = JanusProcessor.frompretrained(modelid) model = JanusForConditionalGeneration.frompretrained(modelid,
torchdtype=torch.bfloat16, devicemap="auto")

inputs = processor.applychattemplate( messages, addgenerationprompt=True, generationmode="text", tokenize=True, returndict=True, return_tensors="pt", ).to(model.device, dtype=torch.bfloat16)

output = model.generate(**inputs, maxnewtokens=40,generationmode='text',dosample=True) text = processor.decode(output[0], skipspecialtokens=True) print(text) ```

Multi image inference

Janus can perform inference with multiple images as input, where images can belong to the same prompt or different prompts in batched inference, where the model processes many conversations in parallel. Here is how you can do it:

```python import torch from PIL import Image import requests

from transformers import JanusForConditionalGeneration, JanusProcessor

model_id = "deepseek-community/Janus-Pro-1B"

image_urls = [ "http://images.cocodataset.org/val2017/000000039769.jpg", "https://www.ilankelman.org/stopsigns/australia.jpg", "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg" ]

messages = [ [ { "role": "user", "content": [ {"type": "text", "text": "What’s the difference between"}, {"type": "image", "url": imageurls[0]}, {"type": "text", "text": " and "}, {"type": "image", "url": imageurls[1]} ] } ], [ { "role": "user", "content": [ {"type": "image", "url": image_urls[2]}, {"type": "text", "text": "What do you see in this image?"} ] } ] ]

Load model and processor

processor = JanusProcessor.frompretrained(modelid) model = JanusForConditionalGeneration.frompretrained( modelid, torchdtype=torch.bfloat16, devicemap="auto" )

inputs = processor.applychattemplate( messages, addgenerationprompt=True, generationmode="text", tokenize=True, padding=True, returndict=True, return_tensors="pt" ).to(model.device, dtype=torch.bfloat16)

Generate response

output = model.generate(**inputs, maxnewtokens=40, generationmode='text', dosample=False) text = processor.batchdecode(output, skipspecial_tokens=True) print(text) ```

Text to Image generation

Janus can also generate images given a prompt.

```python import torch from transformers import JanusForConditionalGeneration, JanusProcessor

Set generation mode to `image` to prepare inputs for image generation..

modelid = "deepseek-community/Janus-Pro-1B" processor = JanusProcessor.frompretrained(modelid) model = JanusForConditionalGeneration.frompretrained(modelid, torchdtype=torch.bfloat16, device_map="auto")

messages = [ { "role": "user", "content": [ {"type": "text", "text": "A dog running under the rain."}, ], } ]

prompt = processor.applychattemplate(messages, addgenerationprompt=True) inputs = processor(text=prompt,generationmode="image",returntensors="pt").to(model.device, dtype=torch.bfloat16)

Set numreturnsequence parameter to generate multiple images per prompt.

model.generationconfig.numreturnsequences = 2 outputs = model.generate(**inputs, generationmode="image", dosample=True, usecache=True, )

Perform post-processing on the generated token ids.

decodedimage = model.decodeimagetokens(outputs) images = processor.postprocess(list(decodedimage.float()),return_tensors="PIL.Image.Image")

Save the image

for i, image in enumerate(images['pixel_values']): image.save(f"result{i}.png") ```

- Python
Published by LysandreJik about 1 year ago

transformers - TimesFM (based on v4.51.3)

A new model is added to transformers: TimesFM It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-TimesFM-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.51.3-TimesFM-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the TimesFM model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.

TimesFM

The abstract from the paper is the following:

Usage example

TimesFM can be found on the Huggingface Hub.

```python import torch from transformers import TimesFmModelForPrediction

model = TimesFmModelForPrediction.frompretrained( "google/timesfm-2.0-500m-pytorch", torchdtype=torch.bfloat16, attnimplementation="sdpa", devicemap="cuda" if torch.cuda.is_available() else None )

# Create dummy inputs forecastinput = [ np.sin(np.linspace(0, 20, 100)), np.sin(np.linspace(0, 20, 200)), np.sin(np.linspace(0, 20, 400)), ] frequencyinput = [0, 1, 2]

Convert inputs to sequence of tensors

forecastinputtensor = [ torch.tensor(ts, dtype=torch.bfloat16).to("cuda" if torch.cuda.isavailable() else "cpu") for ts in forecastinput ] frequencyinputtensor = torch.tensor(frequencyinput, dtype=torch.long).to( "cuda" if torch.cuda.isavailable() else "cpu" )

Get predictions from the pre-trained model

with torch.nograd(): outputs = model(pastvalues=forecastinputtensor, freq=frequencyinputtensor, returndict=True) pointforecastconv = outputs.meanpredictions.float().cpu().numpy() quantileforecastconv = outputs.full_predictions.float().cpu().numpy() ```

- Python
Published by LysandreJik about 1 year ago

transformers - MLCD (based on 4.51.3)

A new model is added to transformers: MLCD It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-MLCD-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.51.3-MLCD-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the MLCD model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.

MLCD

Usage example

MLCD can be found on the Huggingface Hub.

```py import requests from PIL import Image from transformers import AutoProcessor, MLCDVisionModel

Load model and processor

model = MLCDVisionModel.frompretrained("DeepGlint-AI/mlcd-vit-bigG-patch14-448") processor = AutoProcessor.frompretrained("DeepGlint-AI/mlcd-vit-bigG-patch14-448")

Process single image

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) inputs = processor(images=image, return_tensors="pt")

Generate outputs

with torch.no_grad(): outputs = model(**inputs)

Get visual features

features = outputs.lasthiddenstate

print(f"Extracted features shape: {features.shape}") ```

- Python
Published by LysandreJik about 1 year ago

transformers - Qwen2.5-Omni (based on 4.51.3)

A new model is added to transformers: Qwen2.5-Omni. It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-Qwen2.5-Omni-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the Qwen2.5-Omni model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.

Qwen2.5-Omni

The Qwen2.5-Omni model is a unified multiple modalities model proposed in Qwen2.5-Omni Technical Report from Qwen team, Alibaba Group.

The abstract from the technical report is the following:

We present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. This strategy effectively decouples the handling of long sequences of multimodal data, assigning the perceptual responsibilities to the multimodal encoder and entrusting the modeling of extended sequences to a large language model.

Such a division of labor enhances the fusion of different modalities via the shared attention mechanism. To synchronize the timestamps of video inputs with audio, we organized the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE (Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture.

In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni outperforms the similarly sized Qwen2-VL and Qwen2-Audio in both image and audio capabilities. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench.

Notably, Qwen2.5-Omni is the first open-source model to achieve a level of performance in end-to-end speech instruction following that is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni’s streaming Talker outperform most existing streaming and non-streaming alternatives in robustness and naturalness.

Usage example

Qwen2.5-Omni can be found on the Huggingface Hub.

Single Media inference

The model can accept text, images, audio and videos as input. Here's an example code for inference.

```python import soundfile as sf from transformers import Qwen25OmniForConditionalGeneration, Qwen25OmniProcessor

model = Qwen25OmniForConditionalGeneration.frompretrained( "Qwen/Qwen2.5-Omni-7B", torchdtype="auto", devicemap="auto" ) processor = Qwen25OmniProcessor.frompretrained("Qwen/Qwen2.5-Omni-7B")

conversation = [ { "role": "system", "content": [ {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."} ], }, { "role": "user", "content": [ {"type": "video", "video": "/path/to/video.mp4"}, {"type": "text", "text": "What cant you hear and see in this video?"}, ], }, ]

inputs = processor.applychattemplate( conversations, loadaudiofromvideo=True, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt", videofps=1,

# kwargs to be passed to `Qwen2-5-OmniProcessor`
padding=True,
use_audio_in_video=True,

).to(model.device)

Generation params for audio or text can be different and have to be prefixed with `thinker_` or `talker_`

textids, audio = model.generate(**inputs, useaudioinvideo=True, thinkerdosample=False, talkerdosample=True) text = processor.batchdecode(textids, skipspecialtokens=True, cleanuptokenization_spaces=False)

sf.write( "output.wav", audio.reshape(-1).detach().cpu().numpy(), samplerate=24000, ) print(text) ```

Text-only generation

To generate only text output and save compute by not loading the audio generation model, we can use Qwen2_5OmniThinkerForConditionalGeneration model.

```python from transformers import Qwen25OmniThinkerForConditionalGeneration, Qwen25OmniProcessor

model = Qwen25OmniThinkerForConditionalGeneration.frompretrained( "Qwen/Qwen2.5-Omni-7B", torchdtype="auto", devicemap="auto", ) processor = Qwen25OmniProcessor.frompretrained("Qwen/Qwen2.5-Omni-7B")

inputs = processor.applychattemplate( conversations, loadaudiofromvideo=True, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt", videofps=1,

# kwargs to be passed to `Qwen2-5-OmniProcessor`
padding=True,
use_audio_in_video=True,

).to(model.device)

textids = model.generate(**inputs, useaudioinvideo=True) text = processor.batchdecode(textids, skipspecialtokens=True, cleanuptokenization_spaces=False)

sf.write( "output.wav", audio.reshape(-1).detach().cpu().numpy(), samplerate=24000, ) print(text) ```

Batch Mixed Media Inference

The model can batch inputs composed of mixed samples of various types such as text, images, audio and videos as input when using Qwen2_5OmniThinkerForConditionalGeneration model. Here is an example.

```python import soundfile as sf from transformers import Qwen25OmniForConditionalGeneration, Qwen25OmniProcessor

model = Qwen25OmniForConditionalGeneration.frompretrained( "Qwen/Qwen2.5-Omni-7B", torchdtype="auto", devicemap="auto" ) processor = Qwen25OmniProcessor.frompretrained("Qwen/Qwen2.5-Omni-7B")

Conversation with video only

conversation1 = [ { "role": "system", "content": [ {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."} ], }, { "role": "user", "content": [ {"type": "video", "path": "/path/to/video.mp4"}, ] } ]

Conversation with audio only

conversation2 = [ { "role": "system", "content": [ {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."} ], }, { "role": "user", "content": [ {"type": "audio", "path": "/path/to/audio.wav"}, ] } ]

Conversation with pure text

conversation3 = [ { "role": "system", "content": [ {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."} ], }, { "role": "user", "content": [{"type": "text", "text": "who are you?"}], } ]

Conversation with mixed media

conversation4 = [ { "role": "system", "content": [ {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."} ], }, { "role": "user", "content": [ {"type": "image", "path": "/path/to/image.jpg"}, {"type": "video", "path": "/path/to/video.mp4"}, {"type": "audio", "path": "/path/to/audio.wav"}, {"type": "text", "text": "What are the elements can you see and hear in these medias?"}, ], } ]

conversations = [conversation1, conversation2, conversation3, conversation4]

inputs = processor.applychattemplate( conversations, loadaudiofromvideo=True, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt", videofps=1,

# kwargs to be passed to `Qwen2-5-OmniProcessor`
padding=True,
use_audio_in_video=True,

).to(model.thinker.device)

textids = model.generate(**inputs, useaudioinvideo=True) text = processor.batchdecode(textids, skipspecialtokens=True, cleanuptokenization_spaces=False)

print(text) ```

Usage Tips

Image Resolution trade-off

The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs.

python min_pixels = 128*28*28 max_pixels = 768*28*28 processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B", min_pixels=min_pixels, max_pixels=max_pixels)

Prompt for audio output

If users need audio output, the system prompt must be set as "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", otherwise the audio output may not work as expected. { "role": "system", "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", }

Use audio output or not

The model supports both text and audio outputs, if users do not need audio outputs, they can set enable_audio_output in the from_pretrained function. This option will save about ~2GB of GPU memory but the return_audio option for generate function will only allow to be set at False. python model = Qwen2_5OmniForConditionalGeneration.from_pretrained( "Qwen/Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto", enable_audio_output=False, )

In order to obtain a flexible experience, we recommend that users set enable_audio_output at True when initializing the model through from_pretrained function, and then decide whether to return audio when generate function is called. When return_audio is set to False, the model will only return text outputs to get text responses faster.

python model = Qwen2_5OmniForConditionalGeneration.from_pretrained( "Qwen/Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto", enable_audio_output=True, ) ... text_ids = model.generate(**inputs, return_audio=False)

Change voice type of output audio

Qwen2.5-Omni supports the ability to change the voice of the output audio. Users can use the spk parameter of generate function to specify the voice type. The "Qwen/Qwen2.5-Omni-7B" checkpoint support two voice types: Chelsie and Ethan, while Chelsie is a female voice and Ethan is a male voice. By defalut, if spk is not specified, the default voice type is Chelsie.

python text_ids, audio = model.generate(**inputs, spk="Chelsie")

python text_ids, audio = model.generate(**inputs, spk="Ethan")

Flash-Attention 2 to speed up generation

First, make sure to install the latest version of Flash Attention 2:

bash pip install -U flash-attn --no-build-isolation

Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the flash attention repository. FlashAttention-2 can only be used when a model is loaded in torch.float16 or torch.bfloat16.

To load and run a model using FlashAttention-2, add attn_implementation="flash_attention_2" when loading the model:

```python from transformers import Qwen2_5OmniForConditionalGeneration

model = Qwen25OmniForConditionalGeneration.frompretrained( "Qwen/Qwen2.5-Omni-7B", devicemap="auto", torchdtype=torch.bfloat16, attnimplementation="flashattention_2", ) ```

- Python
Published by LysandreJik about 1 year ago

transformers - Patch release v4.51.3

A mix of bugs were fixed in this patch; very exceptionally, we diverge from semantic versioning to merge GLM-4 in this patch release.

Handle torch ver in flexattn (#37400)
handle torch version edge cases (#37399)
Add glm4 (#37388)

- Python
Published by LysandreJik about 1 year ago

transformers - Patch Release 4.51.2

Patch Release 4.51.2

This is another round of bug fixes, but they are a lot more minor and outputs were not really affected!

Fix Llama4 offset (#37414) by @Cyrilvallez
Attention Quantization with FBGemm & TP (#37384) by @MekkCyber
use rmsnormeps for the L2Norm for Llama4 (#37418) by @danielhanchen
mark llama4 as not supported with fa2 (#37416) by @winglian

- Python
Published by ArthurZucker about 1 year ago

transformers - Patch release v4.51.1

Patch release v4.51.1

Since the release of Llama 4, we have fixed a few issues that we are now releasing in patch v4.51.1

Fixing flex attention for torch=2.6.0 (#37285)
more fixes for post-training llama4 (#37329)
Remove HQQ from caching allocator warmup (#37347)
fix derived berts initweights (#37341)
Fix init empty weights without accelerate (#37337)
Fix deepspeed with quantization (#37324)
fix llama4 training (#37319)
fix flex attn when optional args aren't passed (#37327)
Multiple llama4 fixe (#37353)

Thanks all for your patience

- Python
Published by ArthurZucker about 1 year ago

transformers - v4.51.0: Llama 4, Phi4-Multimodal, DeepSeek-v3, Qwen3

New Model Additions

Llama 4

Llama 4, developed by Meta, introduces a new auto-regressive Mixture-of-Experts (MoE) architecture.This generation includes two models: - The highly capable Llama 4 Maverick with 17B active parameters out of ~400B total, with 128 experts. - The efficient Llama 4 Scout also has 17B active parameters out of ~109B total, using just 16 experts.

Both models leverage early fusion for native multimodality, enabling them to process text and image inputs. Maverick and Scout are both trained on up to 40 trillion tokens on data encompassing 200 languages (with specific fine-tuning support for 12 languages including Arabic, Spanish, German, and Hindi).

For deployment, Llama 4 Scout is designed for accessibility, fitting on a single server-grade GPU via on-the-fly 4-bit or 8-bit quantization, while Maverick is available in BF16 and FP8 formats. These models are released under the custom Llama 4 Community License Agreement, available on the model repositories

Getting started with Llama 4 using transformers is straightforward. Make sure you have transformers v4.51.0 or later installed: pip install -U transformers[hf_xet]

Here's a quick example using the instruction-tuned Maverick model responding about two images, using tensor parallel for maximum speed. You need to run this script on an instance with 8 GPUs, using a command like: torchrun –nproc-per-instance=8 script.py

```py from transformers import AutoProcessor, Llama4ForConditionalGeneration import torch

model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"

processor = AutoProcessor.frompretrained(modelid) model = Llama4ForConditionalGeneration.frompretrained( modelid, attnimplementation="flexattention", devicemap="auto", torchdtype=torch.bfloat16, )

url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/catstylelayout.png" messages = [ { "role": "user", "content": [ {"type": "image", "url": url1}, {"type": "image", "url": url2}, {"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"}, ] }, ]

inputs = processor.applychattemplate( messages, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt", ).to(model.device)

outputs = model.generate( **inputs, maxnewtokens=256, )

response = processor.batchdecode(outputs[:, inputs["inputids"].shape[-1]:])[0] print(response) print(outputs[0]) ```

Make sure to check the model cards on the repos (Llama 4 Maverick (~400B) and Llama 4 Scout (~109B)) for detailed usage instructions, including multimodal examples, specific prompt formats (like system prompts), quantization details, and advanced configuration options!

Phi4-Multimodal

Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following:

Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
Vision: English
Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese
Add Phi4 multimodal by @Cyrilvallez in #36939

DeepSeek-v3

DeepSeek-v3 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

The model is detailed in the following paper.

Overview

The DeepSeek-V3 model was proposed in DeepSeek-V3 Technical Report by DeepSeek-AI Team.

The abstract from the paper is the following:

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.

[WIP] add deepseek-v3 by @bzantium in #35926

Qwen3

The Qwen3 architecture has been contributed to transformers and is available in v4.51.0. At time of release, the models themselves have not yet been released - stay tuned for a release from the Qwen team!

Adding Qwen3 and Qwen3MoE by @bozheng-hit in #36878

Documentation

Model docs are getting a significant overhaul by providing much needed, ready-to-use examples one can copy-paste in their modules/consoles. We will adapt these examples to each model, with the goal of providing relevant examples on a per-model basis.

[docs] Model docs by @stevhliu in #36469

Significant model improvements

A very large PR was provided by @nikosanto13 that helped add modular files to all speech models in the library; seeing the difference between each of them is now much simpler, as well as maintenance and eventual refactors.

Introduce modular files for speech models by @nikosanto13 in #35902

Bugfixes and improvements

fix: loss computation after embeddings resize - mllama by @Ssukriti in #36840
Simplify keepinfp32_modules logic by @Cyrilvallez in #36722
Fix Pan and Scan on batched images Gemma3 by @yonigozlan in #36864
Update installation.md by @ariG23498 in #36826
fix Gemma3 Config by @eljandoubi in #36893
Fix torch version guard at import by @zucchini-nlp in #36907
[Fix] Add original_max_position_embeddings to YARN rope_scaling optional keys by @JustinTong0323 in #36877
tests: fix asyncio.wait() usage for python>=3.11 by @dvrogozh in #36898
[chameleon] fix num image token check by @zucchini-nlp in #36918
Fix Compressed tensors todictdiff by @MekkCyber in #36922
Use another repo. for Mistral3 processor testing by @ydshieh in #36925
Fix typos by @omahs in #36910
Update trainer_pt_utils.py docstrings for consistency by @ethanknights in #36912
[2/N] Use pyupgrade --py39-plus to improve code by @cyyever in #36857
Fix pytorch defomr attn path by @qubvel in #36923
More precise comment by @ydshieh in #36935
Added support for seed in DataCollatorForWholeWordMask by @capemox in #36903
Fix processor kwargs qwen2 vl by @yonigozlan in #36890
Disallow Offload to disk for gguf files by @MekkCyber in #36933
Deprecate #36741 and map Causal to Conditional by @zucchini-nlp in #36917
Fixing prequantizationdtype when torchdtype is None by @MekkCyber in #36930
Export for Phi4-mini by @guangy10 in #36780
fix typos in the tests directory by @threewebcode in #36932
Fix cuda index issue in cache allocator by @SunMarc in #36937
[Utils] torch version checks optionally accept dev versions by @gante in #36847
Update after #36962 by @ydshieh in #36965
Change GPUS to GPUs by @zhanluxianshen in #36945
typo fixed in README_fr.md by @NargiT in #36951
Updated docker files to use uv for installing packages by @Sai-Suraj-27 in #36957
update examples after ruff being updated by @ydshieh in #36972
Remove extra tensor clone in PyTorch code by @cyyever in #36748
[docs] Fix image link by @stevhliu in #36869
Add ruff target-version by @cyyever in #36971
update bot comment again by @ydshieh in #36974
🚨Deprecate legacy argument for image-text-to-text models and adopt new behavior by default by @yonigozlan in #36307
Fix tensor dtype mismatch by @cyyever in #36985
byebye CircleCI TF jobs by @ydshieh in #36998
Use torch.expm1 by @cyyever in #36995
Install networkx==3.2.1 manually in some CircleCI jobs after #36957 by @ydshieh in #37000
Fix Optional type annotation by @cyyever in #36841
Fix getdeviceproperties by @ivarflakstad in #36997
Allow easy registration of custom attention functions by @Cyrilvallez in #36889
Fix removing "cpu" from frozenset in bitsandbytes.py to allow better ROCm support. by @anadon in #36975
Fix device_map check for ggml files by @MekkCyber in #37003
Log the correct learning rate by @SunMarc in #36973
fix typos in the code comments and error messages by @threewebcode in #36993
Remove deprecated training arguments by @cyyever in #36946
[docs] Attention mask image by @stevhliu in #36970
fix transformers_cli import relative path issue by @yao-matrix in #36989
Support QuestionAnswering Module for ModernBert based models. by @bakrianoo in #35566
Fix PixtralProcessor patchsize when spatialmerge_size is used by @mgoin in #37019
[Modeling] Load FP8 safetensors such as DeepSeek by @kylesayrs in #36828
Mark 2 tests as flaky for now by @ydshieh in #37038
remove redundant code in trainer by @hiyouga in #36994
Skip FP8 linear tests For device capability 9.0 by @MekkCyber in #37008
Add Distill Any Depth by @keetrap in #36614
fix pegasus init weights and other copied models by @jiqing-feng in #36844
Optimize to_py_obj for python-native numeric lists and scalars by @n0gu-furiosa in #36885
Fixup for distillanydepth conversion script by @qubvel in #37043
[chat templates} support loading audio from video by @zucchini-nlp in #36955
[audio utils] fix fftbinwidth computation by @eustlb in #36603
[generate, cache] handle more complex device maps by @gante in #37014
clean pipeline question_answering. by @zhanluxianshen in #36986
Avoid unnecessary device operations in loss computing by @cyyever in #36950
Set weights_only in torch.load by @cyyever in #36991
Replace default split function with jnp.split() in flax models by @premmurugan229 in #37001
Remove deprecated batch_size parameter by @cyyever in #37007
fixed typo by @finnoh in #37036
fix: Fully remove legacy cache from Llama by @Wheest in #36958
Fix SDPA implementation in Qwen2-VL (issues with torch==2.6.0) by @ManuelFay in #36891
fix: AttributeError: 'LlavaProcessor' object has no attribute 'imagetokenid' by @jp1924 in #37026
Fix some typos about benchmark scripts. by @zhanluxianshen in #37027
Change deprecated PT functions by @cyyever in #37041
[blip-2] Fix dtype mismatch when keep in fp32 by @zucchini-nlp in #37068
fix tied weigths issue by @ydshieh in #37031
Update w/ new account by @muellerzr in #37084
Fix state_dict map location when quantized by @Cyrilvallez in #37086
Fix AttentionInterface following feedback by @Cyrilvallez in #37010
fixed typo. by @zhanluxianshen in #37057
[generate] beam search -- fix output cropping by @gante in #37080
[Cache] rename dtype attribute 🚨 🚨 by @gante in #37044
Kenlm by @ydshieh in #37091
🌐 [i18n-KO] Translated qwen2_vl.md to Korean by @MinJu-Ha in #36750
Gaudi: Fix the pipeline failed issue with hpu device by @yuanwu2017 in #36990
Support passing flashattnkwargs when gradient_checkpointing is enabled by @efsotr in #37037
Fix 4090/ada not detected as having FP8 support by @Qubitium in #37067
enable tp on CPU by @jiqing-feng in #36299
fix whisper re-compile by @jiqing-feng in #36712
[MLU] Fix FA2 check error, remove deepspeed-mlu deps. by @huismiling in #36159
Fix Gemma3 embedding scaling by @gau-nernst in #37109
RWKV: fix mask warning typo by @RobinKa in #37114
Remove deprecated code by @cyyever in #37059
[tests] remove cuda-only test marker in AwqConfigTest by @faaany in #37032
Export T5 (encoder-decoder) to ExecuTorch by @guangy10 in #36486
skip by @ydshieh in #37141
[qwen3] fix generation tests by @zucchini-nlp in #37142
Fix more inefficient PT operations by @cyyever in #37060
Fix std initialization in Idefics variants by @yaswanth19 in #37100
add gpt2 test on XPU by @jiqing-feng in #37028
Fix llava xpu tests. by @jiqing-feng in #37130
enable test_assisted_decoding_in_different_gpu test on XPU by @yao-matrix in #37120
Use public export API on torch 2.5 and future by @guangy10 in #36781
Convert _VALID_DICT_FIELDS to class attribute for shared dict parsing in subclasses by @Tavish9 in #36736
Only count num items in batch when needed by @IlyasMoutawwakil in #36867
Make canine model exportable by removing unncessary complicated logic by @tugsbayasgalan in #37124
[ModernBERT] Never save 'reference_compile' config; should be set based on end user by @tomaarsen in #36305
fix XPU UT error case brough by RNG difference btw XPU and CUDA by @yao-matrix in #37121
Fixes the inconsistency of the optionality of attention_mask by @Zephyr271828 in #37153
Avoid pipeline test failing related to Hub call by @ydshieh in #37170
Fix meta state dict loading with quantizers by @Cyrilvallez in #37136
Revert #37031 by @Cyrilvallez in #37178
[doc] Fix link for Quark quantization page by @BowenBao in #37179
[chat-template] fix video loading by @zucchini-nlp in #37146
Skip code 307 in RequestCounter by @ydshieh in #36953
Add device workaround for int4 weight only quantization after API update by @jerryzh168 in #36980
Fixes DynamicCache export issues due to control flow and inplace modifications by @xadupre in #36652
Try to avoid/reduce some remaining CI job failures by @ydshieh in #37202
fix: Add 'image-text-to-text' to TASK_MAPPING by @saattrupdan in #37107
Fix some code annotation typos. by @zhanluxianshen in #37102
Merge tensor operations with device transfer operations by @cyyever in #37097
[3/N] Use pyupgrade --py39-plus to improve code by @cyyever in #36936
Add py.typed by @cyyever in #37022
No more dtypebytesize() by @Rocketknight1 in #37144
[Tests] add min_new_tokens to prevent flaky length checks by @gante in #37175
Stop DOSing the Hub in the CI by @Rocketknight1 in #37209
More ReDOS fixes! by @Rocketknight1 in #36964
Updated the model card for CLIP by @purusharthmalik in #37040
Update falcon model card by @ricalanis in #37184
Updated model card for Qwen2 by @Aravind-11 in #37192
Fix static cache export by @guangy10 in #37229
[Phi4] add multimodal chat template by @zucchini-nlp in #36996
Add new dim to num_items_in_batch if necessary by @regisss in #36967
Fix test by @Cyrilvallez in #37213
[tests] fix mamba integration simple inference precision issue by @faaany in #37193
[CI] lazy loading external datasets by @gante in #37218
enable 2 types of case on XPU by @yao-matrix in #37198
Fix AST parsing when looking for remote code imports by @Rocketknight1 in #37245
Add support for fast image processing in image-pretraining example by @jafraustro in #37021
Allow flexible generation params arg when checking pipeline specs by @Rocketknight1 in #37211
[CI] green llama tests by @gante in #37244
Adding links to ShieldGemma 2 technical report by @RyanMullins in #37247
feat: updated model card for qwen2.5vl by @arkhamHack in #37099
Update model card for Cohere by @bimal-gajera in #37056
chore: Update model doc for code_llama by @AbhishekRP2002 in #37115
Update Model Card for ModernBERT by @ParagEkbote in #37052
Update model card for electra by @Wu-n0 in #37063
[qwen-vl] fix image processor by @zucchini-nlp in #37258
update error msg by @itazap in #37207
Fix utils/check_bad_commit.py by @ydshieh in #37272
Support return_tensors in audio chat templates by @zucchini-nlp in #34601
Update ruff to 0.11.2 by @ydshieh in #36962
Fix typing for None valued variables by @cyyever in #37004
Use lru_cache for tokenization tests by @ydshieh in #36818
Create and Expose SamVisionModel as public for better accessibility by @geetu040 in #36493
[Feature] Support using FlashAttention2 on Ascend NPU by @FightingZhen in #36696
Remove lowcpumemusage and _fastinit by @Cyrilvallez in #36963
Refactor return_dict logic to remove complicated if/else paths by @qubvel in #36794
Refactor attention for SigLIP based models by @qubvel in #36981
Add Optional to types by @cyyever in #37163
Purge unused ModelTester code by @Rocketknight1 in #37085

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@cyyever
- [2/N] Use pyupgrade --py39-plus to improve code (#36857)
- Remove extra tensor clone in PyTorch code (#36748)
- Add ruff target-version (#36971)
- Fix tensor dtype mismatch (#36985)
- Use torch.expm1 (#36995)
- Fix Optional type annotation (#36841)
- Remove deprecated training arguments (#36946)
- Avoid unnecessary device operations in loss computing (#36950)
- Fix typing for None valued variables (#37004)
- Set weights_only in torch.load (#36991)
- Remove deprecated batch_size parameter (#37007)
- Change deprecated PT functions (#37041)
- Remove deprecated code (#37059)
- Fix more inefficient PT operations (#37060)
- Merge tensor operations with device transfer operations (#37097)
- [3/N] Use pyupgrade --py39-plus to improve code (#36936)
- Add py.typed (#37022)
- Add Optional to types (#37163)
@bzantium
- [WIP] add deepseek-v3 (#35926)
@bozheng-hit
- Adding Qwen3 and Qwen3MoE (#36878)
@geetu040
- Create and Expose SamVisionModel as public for better accessibility (#36493)
@FightingZhen
- [Feature] Support using FlashAttention2 on Ascend NPU (#36696)
@nikosanto13
- Introduce modular files for speech models (#35902)

- Python
Published by LysandreJik about 1 year ago

transformers - Deepseek v3 (based on 4.50.3)

A new model is added to transformers: DeepSeek 3 (Also known as DeepSeek R1). It is added on top of the v4.50.3 release, and can be installed from the following tag: v4.50.3-DeepSeek-3.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.50.3-DeepSeek-3

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

DeepSeek 3 (Also known as DeepSeek R1)

The model is detailed in the following paper.

Overview

The DeepSeek-V3 model was proposed in DeepSeek-V3 Technical Report by DeepSeek-AI Team.

The abstract from the paper is the following:

Limitations and call for contribution!

We are super happy to make this code community-powered, and would love to see how you can help optimize the following:

current implementation uses the "naive" attention compution (so not really MLA)
current implementation loops through the experts. This should be replaced. Pointers to use get_packed_weights from intetrations/tensor_parallel.
current implementation uses the eleuther formula for ROPE, using the orginal one would be more efficient! (should still follow our API)
static cache is not supported (this should be just a generation config issue / config shape issues)

Usage tips

You can run the model in FP8 automatically, using 2 nodes of 8 H100 should be more than enough!

```python

`run_deepseek_v1.py`

from transformers import AutoModelForCausalLM, AutoTokenizer import torch torch.manual_seed(30)

tokenizer = AutoTokenizer.from_pretrained("deepseek-r1")

chat = [ {"role": "user", "content": "Hello, how are you?"}, {"role": "assistant", "content": "I'm doing great. How can I help you today?"}, {"role": "user", "content": "I'd like to show off how chat templating works!"}, ]

model = AutoModelForCausalLM.frompretrained("deepseek-r1", devicemap="auto", torchdtype=torch.bfloat16) inputs = tokenizer.applychattemplate(chat, tokenize=True, addgenerationprompt=True, returntensors="pt").to(model.device)

outputs = model.generate(inputs, maxnewtokens=50) print(tokenizer.batch_decode(outputs)) ``` This generated:

`````` <｜Assistant｜> Okay, the user wants to demonstrate how chat templating works. Let me break down what that means. Chat templating is about structuring the conversation data, especially for models that need specific input formats. Maybe they're referring to something like how messages are formatted with roles (user, assistant, system) in APIs like OpenAI.

First, I should explain what chat templating is. It's the process of formatting conversation data into a structured format that the model can understand. This usually includes roles and content. For example, user messages, assistant responses, and system messages each have their own role tags.

They might want an example. Let me think of a simple conversation. The user says "Hello, how are you?" and the assistant responds "I'm doing great. How can I help you today?" Then the user follows up with wanting to show off chat templating. So the example should include the history and the new message.

In some frameworks, like Hugging Face's Transformers, chat templates are applied using Jinja2 templates. The template might look something like combining system messages, then looping through user and assistant messages with appropriate tags. For instance, using {% for message in messages %} and assigning roles like <|user|>, <|assistant|>, etc.

I should structure the example with the messages array, showing each role and content. Then apply a hypothetical template to convert that into a formatted string the model uses. Also, mention that different models have different templating requirements, like using special tokens or varying role labels.

Wait, the user mentioned "chat templating" in the context of showing off. Maybe they want a practical example they can present. So providing a code snippet or a structured data example would be helpful. Let me outline a typical messages array and then the templated output.

Also, it's important to note that proper templating ensures the model knows the conversation flow, which is crucial for generating coherent responses. Maybe include a note about why it's important, like maintaining context and role-specific processing.

Let me check if there are any common mistakes or things to avoid. For example, not closing tags properly, or mismatching roles. But maybe that's too detailed unless the user asks. Focus on the positive example first.

Putting it all together, the response should have an example messages array, the applied template, and the final formatted string. Maybe use angle brackets or special tokens as placeholders. Also, mention that this helps in training or fine-tuning models with structured data.

I think that's a solid approach. Let me structure it step by step to make it clear.

Chat templating is a way to structure conversation data (e.g., user/assistant interactions) into a format that language models understand. This is especially important for models trained to handle multi-turn dialogues, where the input must explicitly separate roles (user, assistant, system, etc.) and messages. Let’s break this down with an example!

Step 1: Raw Conversation History

Suppose we have this conversation: - User: "Hello, how are you?" - Assistant: "I'm doing great. How can I help you today?" - User: "I'd like to show off how chat templating works!"

Step 2: Structured Messages

In frameworks like Hugging Face Transformers or OpenAI, conversations are often formatted as a list of dictionaries with role and content: python messages = [ {"role": "user", "content": "Hello, how are you?"}, {"role": "assistant", "content": "I'm doing great. How can I help you today?"}, {"role": "user", "content": "I'd like to show off how chat templating works!"}, ]

Step 3: Apply a Chat Template

A chat template converts this structured data into a single string formatted for the model. For example, using a Jinja-style template (common in Hugging Face):

jinja {% for message in messages %} {% if message['role'] == 'user' %} <|user|>{{ message['content'] }}<|end|> {% elif message['role'] == 'assistant' %} <|assistant|>{{ message['content'] }}<|end|> {% endif %} {% endfor %} <|assistant|>

Step 4: Final Templated Output

Applying the template to our messages list would produce: text <|user|>Hello, how are you?<|end|> <|assistant|>I'm doing great. How can I help you today?<|end|> <|user|>I'd like to show off how chat templating works!<|end|> <|assistant|>

This tells the model:
1. The conversation history (user/assistant turns).
2. The model’s turn to generate a response (<|assistant|> at the end).

Key Notes:

Role Separation: Tags like <|user|> and <|assistant|> help the model distinguish speakers.
Special Tokens: Models often use unique tokens (e.g., <|end|>) to mark message boundaries.
Flexibility: Templates vary by model (e.g., OpenAI uses {"role": "user", "content": "..."} instead of tags).

Why This Matters:

Consistency: Ensures the model understands dialogue structure.
Context Preservation: Maintains the flow of multi-turn conversations.
Alignment: Matches the format the model was trained on for better performance.

Want to dive deeper or see a specific framework’s implementation (e.g., OpenAI, Llama, Mistral)? Let me know! 😊<｜end▁of▁sentence｜> ``````

Use the following to run it bash torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0|1 --rdzv-id an_id --rdzv-backend c10d --rdzv-endpoint master_addr:master_port run_deepseek_r1.py

If you have: bash [rank0]: ncclInternalError: Internal check failed. [rank0]: Last error: [rank0]: Bootstrap : no socket interface found error, it means NCCL was probably not loaded.

- Python
Published by ArthurZucker about 1 year ago

transformers - Patch release v4.50.3

Patch release v4.50.3

Thanks to the vllm team we have a few more bugs that slipped in!

[generate] beam search -- fix output cropping (#37080) by @gante
[blip-2] Fix dtype mismatch when keep in fp32 (#37068) by @zucchini-nlp
Fix PixtralProcessor patchsize when spatialmerge_size is used (#37019)

- Python
Published by ArthurZucker about 1 year ago

transformers - Patch release v4.50.2

Patch release v4.50.2

I completely forgot to put these in the previous patch sorry! Should put the transformers backend in a good spot!

[Utils] torch version checks optionally accept dev versions (#36847) by @gante
Fix processor kwargs qwen2 vl (#36890) by @yonigozlan
Fix Pan and Scan on batched images Gemma3 (#36864) by @yonigozlan

- Python
Published by ArthurZucker about 1 year ago

transformers - Patch release v4.50.1

Patch release v4.50.1

There were some very minor bugs with the new hub kernels, and with remote code that we had to fix

Deprecate #36741 and map Causal to Conditional (#36917) by @zucchini-nlp
Fix pytorch deform attn path (#36923) by @qubvel
[chameleon] fix num image token check (#36918) by @zucchini-nlp
Fix torch version guard at import (#36907) by @zucchini-nlp

- Python
Published by ArthurZucker about 1 year ago

transformers - Release v4.50.0

Release v4.50.0

New Model Additions

Model-based releases

Starting with version v4.49.0, we have been doing model-based releases, additionally to our traditional, software-based monthly releases. These model-based releases provide a tag from which models may be installed.

Contrarily to our software-releases; these are not pushed to pypi and are kept on our GitHub. Each release has a tag attributed to it, such as: - v4.49.0-Gemma-3 - v4.49.0-AyaVision

⚠️ As bugs are identified and fixed on each model, the release tags are updated so that installing from that tag always gives the best experience possible with that model.

Each new model release will always be based on the current state of the main branch at the time of its creation. This ensures that new models start with the latest features and fixes available.

For example, if two models—Gemma-3 and AyaVision—are released from main, and then a fix for gemma3 is merged, it will look something like this:

o---- v4.49.0-Gemma-3 (includes AyaVision, plus main fixes) / \ ---o--o--o--o--o-- (fix for gemma3) --o--o--o main \ o---- v4.49.0-AyaVision We strive to merge model specific fixes on their respective branches as fast as possible!

Gemma 3

Gemma 3 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

The Gemma 3 model was proposed by Google. It is a vision-language model composed by a SigLIP vision encoder and a Gemma 2 language decoder linked by a multimodal linear projection.

It cuts an image into a fixed number of tokens same way as Siglip if the image does not exceed certain aspect ratio. For images that exceed the given aspect ratio, it crops the image into multiple smaller pacthes and concatenates them with the base image embedding.

One particularity is that the model uses bidirectional attention on all the image tokens. Also, the model interleaves sliding window local attention with full causal attention in the language backbone, where each sixth layer is a full causal attention layer.

Gemma3 by @RyanMullins in #36658

Shield Gemma2

ShieldGemma 2 is built on Gemma 3, is a 4 billion (4B) parameter model that checks the safety of both synthetic and natural images against key categories to help you build robust datasets and models. With this addition to the Gemma family of models, researchers and developers can now easily minimize the risk of harmful content in their models across key areas of harm as defined below:

No Sexually Explicit content: The image shall not contain content that depicts explicit or graphic sexual acts (e.g., pornography, erotic nudity, depictions of rape or sexual assault).
No Dangerous Content: The image shall not contain content that facilitates or encourages activities that could cause real-world harm (e.g., building firearms and explosive devices, promotion of terrorism, instructions for suicide).
No Violence/Gore content: The image shall not contain content that depicts shocking, sensational, or gratuitous violence (e.g., excessive blood and gore, gratuitous violence against animals, extreme injury or moment of death).

We recommend using ShieldGemma 2 as an input filter to vision language models, or as an output filter of image generation systems. To train a robust image safety model, we curated training datasets of natural and synthetic images and instruction-tuned Gemma 3 to demonstrate strong performance.

Shieldgemma2 #36678 by @RyanMullins ### Aya Vision

AyaVision is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

The Aya Vision 8B and 32B models is a state-of-the-art multilingual multimodal models developed by Cohere For AI. They build on the Aya Expanse recipe to handle both visual and textual information without compromising on the strong multilingual textual performance of the original model.

Aya Vision 8B combines the Siglip2-so400-384-14 vision encoder with the Cohere CommandR-7B language model further post-trained with the Aya Expanse recipe, creating a powerful vision-language model capable of understanding images and generating text across 23 languages. Whereas, Aya Vision 32B uses Aya Expanse 32B as the language model.

Key features of Aya Vision include: - Multimodal capabilities in 23 languages - Strong text-only multilingual capabilities inherited from CommandR-7B post-trained with the Aya Expanse recipe and Aya Expanse 32B - High-quality visual understanding using the Siglip2-so400-384-14 vision encoder - Seamless integration of visual and textual information in 23 languages.

Add aya by @ArthurZucker in #36521

Mistral 3.1

Mistral 3.1 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

Building upon Mistral Small 3 (2501), Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.

It is ideal for: - Fast-response conversational agents. - Low-latency function calling. - Subject matter experts via fine-tuning. - Local inference for hobbyists and organizations handling sensitive data. - Programming and math reasoning. - Long document understanding. - Visual understanding.

Add Mistral3 by @Cyrilvallez in #36790

Smol VLM 2

SmolVLM-2 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

SmolVLM2 is an adaptation of the Idefics3 model with two main differences:

It uses SmolLM2 for the text model.
It supports multi-image and video inputs
SmolVLM2 by @orrzohar in #36126

SigLIP-2

SigLIP-2 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

The SigLIP2 model was proposed in SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner and Xiaohua Zhai.

The model comes in two variants

1) FixRes - model works with fixed resolution images (backward compatible with SigLIP v1) 2) NaFlex - model works with variable image aspect ratios and resolutions (SigLIP2 in transformers)

Add SigLIP 2 by @qubvel in #36323

Prompt Depth Anything

PromptDepthAnything is a high-resolution, accurate metric depth estimation model that leverages prompting, inspired by its success in vision-language (VLMs) and large language models (LLMs). Using iPhone LiDAR as a prompt, the model generates precise depth maps at up to 4K resolution, unlocking the potential of depth foundation models.

Add Prompt Depth Anything Model by @haotongl in #35401

New tool: attention visualization

We add a new tool to transformers to visualize the attention layout of a given model. It only requires a model ID as input, and will load the relevant tokenizer/model and display what the attention mask looks like. Some examples:

```py

from transformers.utils.attention_visualizer import AttentionMaskVisualizer visualizer = AttentionMaskVisualizer("meta-llama/Llama-3.2-3B-Instruct") visualizer("A normal attention mask")

visualizer = AttentionMaskVisualizer("mistralai/Mistral-Small-24B-Instruct-2501") visualizer("A normal attention mask with a long text to see how it is displayed, and if it is displayed correctly")

visualizer = AttentionMaskVisualizer("google/paligemma2-3b-mix-224") visualizer(" You are an assistant.", suffix = "What is on the image?")

visualizer = AttentionMaskVisualizer("google/gemma-2b") visualizer("You are an assistant. Make sure you print me") # we should have slidiing on non sliding side by side

visualizer = AttentionMaskVisualizer("google/gemma-3-27b-it") visualizer("You are an assistant. Make sure you print me") # we should have slidiing on non sliding side by side ```

Add attention visualization tool by @ArthurZucker in #36630

Deprecating transformers.agents in favor of smolagents

We are deprecating transformers.agents in favour of the smolagents library. Read more about smolagents here.

Deprecate transformers.agents by @aymeric-roucher in #36415

Quantization

We support adding custom quantization method by using the @register_quantization_config and @register_quantizer decorator:

```python @registerquantizationconfig("custom") class CustomConfig(QuantizationConfigMixin): pass

@register_quantizer("custom") class CustomQuantizer(HfQuantizer): pass

quantizedmodel = AutoModelForCausalLM.frompretrained( "facebook/opt-350m", quantizationconfig=CustomConfig(), torchdtype="auto" ) ```

Added Support for Custom Quantization by @keetrap in #35915
Add Example for Custom quantization by @MekkCyber in #36286

AMD is developing its in-house quantizer named Quark released under MIT license, which supports a broad range of quantization pre-processing, algorithms, dtypes and target hardware. You can now load a model quantized by quark library:

```python

pip install amd-quark

modelid = "EmbeddedLLM/Llama-3.1-8B-Instruct-wfp8perchannelsym" model = AutoModelForCausalLM.frompretrained(model_id) model = model.to("cuda") ```

Support loading Quark quantized models in Transformers by @fxmarty-amd and @BowenBao in #36372

Torchao is augmented with autoquant support, CPU-quantization, as well as new AOBaseConfig object instances for more advanced configuration.

Add autoquant support for torchao quantizer by @jerryzh168 in #35503
enable torchao quantization on CPU by @jiqing-feng in #36146
Add option for ao base configs by @drisspg in #36526

Tensor Parallelism implementation changes

At loading time, the parallelization is now applied module-by-module, so that no memory overhead is required compared to what the final weight distribution will be!

TP initialization module-by-module by @Cyrilvallez in #35996

Generation

This release includes two speed upgrades to generate: 1. Assisted generation now works with ANY model as an assistant, even with do_sample=True;

```py from transformers import pipeline import torch

prompt = "Alice and Bob" checkpoint = "google/gemma-2-9b" assistant_checkpoint = "double7/vicuna-68m"

pipe = pipeline( "text-generation", model=checkpoint, assistantmodel=assistantcheckpoint, dosample=True ) pipeoutput = pipe(prompt, maxnewtokens=50, dosample=True) print(pipeoutput[0]["generated_text"]) ```

Beam search was vectorized, and should be significantly faster with a large num_beams. The speedup is more visible on smaller models, where model.forward doesn't dominate the total run time.

Universal Speculative Decoding CandidateGenerator by @keyboardAnt in #35029
[generate] ✨ vectorized beam search ✨ by @gante in #35802

Documentation

A significant redesign of our documentation has wrapped-up. The goal was to greatly simplify the transformers documentation, making it much more easy to navigate. Let us know what you think!

[docs] Redesign by @stevhliu in #31757

Notable repo maintenance

The research examples folder that was hosted in transformers is no more. We have moved it out of transformers and in the following repo: github.com/huggingface/transformers-research-projects/

Remove research projects by @Rocketknight1 in #36645

We have updated our flex attention support so as to have it be on-par with our Flash Attention 2 support.

Proper_flex by @ArthurZucker in #36643

More models support flex attention now thanks to @qubvel

Refactor Attention implementation for ViT-based models by @qubvel in #36545

First integration of hub kernels for deformable detr!

Use deformable_detr kernel from the Hub (#36853) by @danieldk

Bugfixes and improvements

[tests] fix EsmModelIntegrationTest::test_inference_bitsandbytes by @faaany in #36225
Fix LlavaForConditionalGenerationModelTest::test_config after #36077 by @ydshieh in #36230
AMD DeepSpeed image additional HIP dependencies by @ivarflakstad in #36195
[generate] remove cache v4.47 deprecations by @gante in #36212
Add missing atol to torch.testing.assert_close where rtol is specified by @ivarflakstad in #36234
[tests] remove tf/flax tests in /generation by @gante in #36235
[generate] Fix encoder decoder models attention mask by @eustlb in #36018
Add compressed tensor in quant dockerfile by @SunMarc in #36239
[tests] remove test_export_to_onnx by @gante in #36241
Au revoir flaky test_fast_is_faster_than_slow by @ydshieh in #36240
Fix TorchAoConfig not JSON serializable by @andrewor14 in #36206
Remove flakiness in VLMs by @zucchini-nlp in #36242
feat: add support for tensor parallel training workflow with accelerate by @kmehant in #34194
Fix XGLM loss computation (PyTorch and TensorFlow) by @damianoamatruda in #35878
GitModelIntegrationTest - flatten the expected slice tensor by @ivarflakstad in #36260
Added Support for Custom Quantization by @keetrap in #35915
Qwen2VL fix cos,sin dtypes to float when used with deepspeed by @ArdalanM in #36188
Uniformize LlavaNextVideoProcessor kwargs by @yonigozlan in #35613
Add support for post-processing kwargs in image-text-to-text pipeline by @yonigozlan in #35374
Add dithering to the Speech2TextFeatureExtractor API. by @KarelVesely84 in #34638
[tests] remove pt_tf equivalence tests by @gante in #36253
TP initialization module-by-module by @Cyrilvallez in #35996
[tests] deflake dither test by @gante in #36284
[tests] remove flax-pt equivalence and cross tests by @gante in #36283
[tests] make test_from_pretrained_low_cpu_mem_usage_equal less flaky by @gante in #36255
Add Example for Custom quantization by @MekkCyber in #36286
docs: Update README_zh-hans.md by @hyjbrave in #36269
Fix callback handler reference by @SunMarc in #36250
Make cache traceable by @IlyasMoutawwakil in #35873
Fix broken CI on release branch due to missing conversion files by @ydshieh in #36275
Ignore conversion files in test fetcher by @ydshieh in #36251
SmolVLM2 by @orrzohar in #36126
Fix typo in Pixtral example by @12v in #36302
fix: prevent second save in the end of training if last step was saved already by @NosimusAI in #36219
[smolvlm] make CI green by @gante in #36306
Fix default attention mask of generate in MoshiForConditionalGeneration by @cyan-channel-io in #36171
VLMs: even more clean-up by @zucchini-nlp in #36249
Add SigLIP 2 by @qubvel in #36323
[CI] Check test if the GenerationTesterMixin inheritance is correct 🐛 🔫 by @gante in #36180
[tests] make quanto tests device-agnostic by @faaany in #36328
Uses Collection in transformers.image_transforms.normalize by @CalOmnie in #36301
Fix exploitable regexes in Nougat and GPTSan/GPTJNeoXJapanese by @Rocketknight1 in #36121
[tests] enable bnb tests on xpu by @faaany in #36233
Improve model loading for compressed tensor models by @rahul-tuli in #36152
Change slack channel for mi250 CI to amd-hf-ci by @ivarflakstad in #36346
Add autoquant support for torchao quantizer by @jerryzh168 in #35503
Update amd pytorch index to match base image by @ivarflakstad in #36347
fix(type): padding_side type should be Optional[str] by @shenxiangzhuang in #36326
[Modeling] Reduce runtime when loading missing keys by @kylesayrs in #36312
notify new model merged to main by @ydshieh in #36375
Update modelingllavaonevision.py by @yinsong1986 in #36391
Load models much faster on accelerator devices!! by @Cyrilvallez in #36380
[modular] Do not track imports in functions by @Cyrilvallez in #36279
Fix is_causal fail with compile by @Cyrilvallez in #36374
enable torchao quantization on CPU by @jiqing-feng in #36146
Update getevalsampler to reflect Trainer.tokenizer is deprecation self.tokenizer -> self.processingclass by @yukiman76 in #36315
Fix doc formatting in forward passes & modular by @Cyrilvallez in #36243
Added handling for length <2 of suppress_tokens for whisper by @andreystarenky in #36336
addressing the issue #34611 to make FlaxDinov2 compatible with any batch size by @MHRDYN7 in #35138
tests: revert change of torchrequiremulti_gpu to be device agnostic by @dvrogozh in #35721
[tests] enable autoawq tests on XPU by @faaany in #36327
fix audio classification pipeline fp16 test on cuda by @jiqing-feng in #36359
chore: fix function argument descriptions by @threewebcode in #36392
Fix pytorch integration tests for SAM by @qubvel in #36397
[CLI] add import guards by @gante in #36376
Fix converttorgb for SAM ImageProcessor by @MSt-10 in #36369
Security fix for benchmark.yml by @ydshieh in #36402
Fixed VitDet for non-squre Images by @cjfghk5697 in #35969
Add retry hf hub decorator by @muellerzr in #35213
Deprecate transformers.agents by @aymeric-roucher in #36415
Fixing the docs corresponding to the breaking change in torch 2.6. by @Narsil in #36420
add recommendations for NPU using flash_attn by @zheliuyu in #36383
fix: prevent model access error during Optuna hyperparameter tuning by @emapco in #36395
Universal Speculative Decoding CandidateGenerator by @keyboardAnt in #35029
Fix compressed tensors config by @MekkCyber in #36421
Update form pretrained to make TP a first class citizen by @ArthurZucker in #36335
Fix Expected output for compressed-tensors tests by @MekkCyber in #36425
restrict cache allocator to non quantized model by @SunMarc in #36428
Change PR to draft when it is (re)opened by @ydshieh in #36417
Fix permission by @ydshieh in #36443
Fix another permission by @ydshieh in #36444
Add contents: write by @ydshieh in #36445
[save_pretrained ] Skip collecting duplicated weight by @wejoncy in #36409
[generate] torch.distributed-compatible DynamicCache by @gante in #36373
Lazy import libraries in src/transformers/image_utils.py by @hmellor in #36435
Fix hub_retry by @ydshieh in #36449
[GroundingDino] Fix grounding dino loss 🚨 by @EduardoPach in #31828
Fix loading models with mismatched sizes by @qubvel in #36463
[docs] fix bug in deepspeed config by @faaany in #36081
Add Got-OCR 2 Fast image processor and refactor slow one by @yonigozlan in #36185
Fix couples of issues from #36335 by @SunMarc in #36453
Fix loadstatedictintometamodel with device_map=None by @hlky in #36488
Fix loading zero3 weights by @muellerzr in #36455
Check TRUST_REMOTE_CODE for RealmRetriever for security by @ydshieh in #36511
Fix kwargs UserWarning in SamImageProcessor by @MSt-10 in #36479
fix torchdtype, contiguous, and loadstate_dict regression by @SunMarc in #36512
Fix some typos in docs by @co63oc in #36502
chore: fix message descriptions in arguments and comments by @threewebcode in #36504
Fix pipeline+peft interaction by @Rocketknight1 in #36480
Fix edge case for continuefinalmessage by @Rocketknight1 in #36404
[Style] fix E721 warnings by @kashif in #36474
Remove unused code by @Rocketknight1 in #36459
[docs] Redesign by @stevhliu in #31757
Add aya by @ArthurZucker in #36521
chore: Fix typos in docs and examples by @co63oc in #36524
Fix bamba tests amd by @ivarflakstad in #36535
Fix links in quantization doc by @MekkCyber in #36528
chore: enhance messages in docstrings by @threewebcode in #36525
guard torch version for uint16 by @SunMarc in #36520
Fix typos in tests by @co63oc in #36547
Fix typos . by @zhanluxianshen in #36551
chore: enhance message descriptions in parameters,comments,logs and docstrings by @threewebcode in #36554
Delete redundancy if case in model_utils by @zhanluxianshen in #36559
Modular Conversion --fixandoverwrite on Windows by @hlky in #36583
Integrate SwanLab for offline/online experiment tracking and local visualization by @ShaohonChen in #36433
[bark] fix loading of generation config by @gante in #36587
[XGLM] tag tests as slow by @gante in #36592
fix: argument by @ariG23498 in #36558
Mention UltraScale Playbook 🌌 in docs by @NouamaneTazi in #36589
avoid errors when the size of input_ids passed to PrefixConstrainedLogitsProcessor is zero by @HiDolen in #36489
Export base streamer. by @AndreasAbdi in #36500
Github action for auto-assigning reviewers by @Rocketknight1 in #35846
Update chat_extras.md with content correction by @krishkkk in #36599
Update "who to tag" / "who can review" by @gante in #36394
Fixed datatype related issues in DataCollatorForLanguageModeling by @capemox in #36457
Fix check for XPU. PyTorch >= 2.6 no longer needs ipex. by @tripzero in #36593
[HybridCache] disable automatic compilation by @gante in #36620
Fix auto-assign reviewers by @Rocketknight1 in #36631
chore: fix typos in language models by @threewebcode in #36586
[docs] Serving LLMs by @stevhliu in #36522
Refactor some core stuff by @ArthurZucker in #36539
Fix bugs in mllama image processing by @tjohnson31415 in #36156
Proper_flex by @ArthurZucker in #36643
Fix AriaForConditionalGeneration flex attn test by @ivarflakstad in #36604
Remove remote code warning by @Rocketknight1 in #36285
Stop warnings from unnecessary torch.tensor() overuse by @Rocketknight1 in #36538
[docs] Update docs dependency by @stevhliu in #36635
Remove research projects by @Rocketknight1 in #36645
Fix gguf docs by @SunMarc in #36601
fix typos in the docs directory by @threewebcode in #36639
Gemma3 by @RyanMullins in #36658
HPU support by @IlyasMoutawwakil in #36424
fix block mask typing by @ArthurZucker in #36661
[CI] gemma 3 make fix-copies by @gante in #36664
Fix bnb regression due to empty state dict by @SunMarc in #36663
[core] Large/full refactor of from_pretrained by @Cyrilvallez in #36033
Don't accidentally mutate the basemodeltp_plan by @Rocketknight1 in #36677
Fix Failing GPTQ tests by @MekkCyber in #36666
Remove hardcoded slow image processor class in processors supporting fast ones by @yonigozlan in #36266
[quants] refactor logic for modulestonot_convert by @SunMarc in #36672
Remove differences between init and preprocess kwargs for fast image processors by @yonigozlan in #36186
Refactor siglip2 fast image processor by @yonigozlan in #36406
Fix rescale normalize inconsistencies in fast image processors by @yonigozlan in #36388
[Cache] Don't initialize the cache on meta device by @gante in #36543
Update config.torch_dtype correctly by @SunMarc in #36679
Fix slicing for 0-dim param by @SunMarc in #36580
Changing the test model in Quanto kv cache by @MekkCyber in #36670
fix wandb hp search unable to resume from sweep_id by @bd793fcb in #35883
Upgrading torch version and cuda version in quantization docker by @MekkCyber in #36264
Change Qwen2_VL image processors to have init and call accept the same kwargs by @yonigozlan in #36207
fix type annotation for ALLATTENTIONFUNCTIONS by @WineChord in #36690
Fix dtype for params without tp_plan by @Cyrilvallez in #36681
chore: fix typos in utils module by @threewebcode in #36668
[CI] Automatic rerun of certain test failures by @gante in #36694
Add loading speed test by @Cyrilvallez in #36671
fix: fsdp sharded state dict wont work for saveonlymodel knob by @kmehant in #36627
Handling an exception related to HQQ quantization in modeling by @MekkCyber in #36702
Add GGUF support to T5-Encoder by @Isotr0py in #36700
Final CI cleanup by @Rocketknight1 in #36703
Add support for fast image processors in add-new-model-like CLI by @yonigozlan in #36313
Gemma3 processor typo by @Kuangdd01 in #36710
Make the flaky list a little more general by @Rocketknight1 in #36704
Cleanup the regex used for doc preprocessing by @Rocketknight1 in #36648
[model loading] don't gc.collect() if only 1 shard is used by @gante in #36721
Fix/best model checkpoint fix by @seanswyi in #35885
Try working around the processor registration bugs by @Rocketknight1 in #36184
[tests] Parameterized test_eager_matches_sdpa_inference by @gante in #36650
🌐 [i18n-KO] Translated codegen.md to Korean by @maximizemaxwell in #36698
Fix post_init() code duplication by @Cyrilvallez in #36727
Fix grad accum arbitrary value by @IlyasMoutawwakil in #36691
[Generation, Gemma 3] When passing a custom generation_config, overwrite default values with the model's base generation_config by @gante in #36684
🚨🚨🚨 Fix sdpa in SAM and refactor relative position embeddings by @geetu040 in #36422
enable/disable compile for quants methods by @SunMarc in #36519
fix can_generate by @jiqing-feng in #36570
Allow ray datasets to be used with trainer by @FredrikNoren in #36699
fix xpu tests by @jiqing-feng in #36656
Fix test isolation for clearimportcache utility by @sambhavnoobcoder in #36345
Fix TrainingArguments.torch_empty_cache_steps post_init check by @pkuderov in #36734
[MINOR:TYPO] Update hubert.md by @cakiki in #36733
[CI] remove redundant checks in test_eager_matches_sdpa_inference by @gante in #36740
[docs] Update README by @stevhliu in #36265
doc: Clarify is_decoder usage in PretrainedConfig documentation by @d-kleine in #36724
fix typos in the tests directory by @threewebcode in #36717
chore: fix typos in tests directory by @threewebcode in #36785
Fixing typo in gemma3 imageprocessorfast and adding a small test by @Zebz13 in #36776
Fix gemma3_text tokenizer in mapping by @LysandreJik in #36793
Add Mistral3 by @Cyrilvallez in #36790
fix hqq due to recent modeling changes by @SunMarc in #36771
Update SHA for tj-actions/changed-files by @ydshieh in #36795
Loading optimizations by @Cyrilvallez in #36742
Fix Mistral3 tests by @yonigozlan in #36797
Fix casting dtype for qunatization by @SunMarc in #36799
Fix chameleon's TypeError because inputs_embeds may None by @YenFuLin in #36673
Support custom dosctrings in modular by @yonigozlan in #36726
[generate] ✨ vectorized beam search ✨ by @gante in #35802
Expectations test utils by @ivarflakstad in #36569
fix "Cannot copy out of meta tensor; no data!" issue for BartForConditionalGeneration model by @yao-matrix in #36572
Remove dist": "loadfile" for pytest in CircleCI jobs by @ydshieh in #36811
Fix Device map for bitsandbytes tests by @MekkCyber in #36800
[Generation] remove leftover code from end-to-end compilation by @gante in #36685
Add attention visualization tool by @ArthurZucker in #36630
Add option for ao base configs by @drisspg in #36526
enable OffloadedCache on XPU from PyTorch 2.7 by @yao-matrix in #36654
[gemma 3] multimodal checkpoints + AutoModelForCausalLM by @gante in #36741
One more fix for reviewer assignment by @Rocketknight1 in #36829
Support tracable dynamicKVcache by @tugsbayasgalan in #36311
Add Space to Bitsandbytes doc by @MekkCyber in #36834
quick fix fastimageprocessor register error by @JJJYmmm in #36716
Update configuration_qwen2.py by @michaelfeil in #36735
Just import torch AdamW instead by @Rocketknight1 in #36177
Move the warning to the documentation for DataCollatorWithFlattening by @qgallouedec in #36707
Fix swanlab global step by @Zeyi-Lin in #36728
Disable inductor config setter by default by @HDCharles in #36608
[ForCausalLMLoss] allow users to pass shifted labels by @stas00 in #36607
fix tiktoken convert to pass AddedToken to Tokenizer by @itazap in #36566
Saving Trainer.collator.tokenizer in when Trainer.processing_class is None by @innerNULL in #36552
Pass numitemsin_batch directly to loss computation by @eljandoubi in #36753
Fix fp16 ONNX export for RT-DETR and RT-DETRv2 by @qubvel in #36460
Update deprecated Jax calls by @rasmi in #35919
[qwen2 audio] remove redundant code and update docs by @gante in #36282
Pass state dict by @phos-phophy in #35234
[modular] Sort modular skips by @gante in #36304
[generate] clarify docstrings: when to inherit GenerationMixin by @gante in #36605
Update min safetensors bis by @SunMarc in #36823
Fix import for torch 2.0, 2.1 - guard typehint for "device_mesh" by @qubvel in #36768
Gemma 3: Adding explicit GenerationConfig and refactoring conversion … by @RyanMullins in #36833
Fix: remove the redundant snippet of wholeword_mask by @HuangBugWei in #36759
Shieldgemma2 by @RyanMullins in #36678
Fix ONNX export for sequence classification head by @echarlaix in #36332
Fix hqq skipped modules and dynamic quant by @mobicham in #36821
Use pyupgrade --py39-plus to improve code by @cyyever in #36843
Support loading Quark quantized models in Transformers by @fxmarty-amd in #36372
DeepSpeed tensor parallel+ZeRO by @inkcherry in #36825
Refactor Attention implementation for ViT-based models by @qubvel in #36545
Add Prompt Depth Anything Model by @haotongl in #35401
Add model visual debugger by @molbap in #36798
[torchao] revert to getapplytensor_subclass by @SunMarc in #36849
Gemma3: fix test by @zucchini-nlp in #36820
[CI] fix update metadata job by @gante in #36850
Add support for seed in DataCollatorForLanguageModeling by @capemox in #36497
Refactor Aya Vision with modular by @yonigozlan in #36688
Mllama: raise better error by @zucchini-nlp in #35934
[CI] doc builder without custom image by @gante in #36862
FIX FSDP plugin update for QLoRA by @BenjaminBossan in #36720
Remove call to .item in get_batch_samples by @regisss in #36861
chore: fix typos in the tests directory by @threewebcode in #36813
Make ViTPooler configurable by @sebbaur in #36517
Revert "Update deprecated Jax calls by @ArthurZucker in #35919)"
[generate] model defaults being inherited only happens for newer models by @gante in #36881
:redcircle: :redcircle: :red_circle: supersede paligemma forward to shift pos id indexing by @molbap in #36859
Gemma 3 tests expect greedy decoding by @molbap in #36882
Use deformable_detr kernel from the Hub by @danieldk in #36853
Minor Gemma 3 fixes by @molbap in #36884
Fix: dtype cannot be str by @zucchini-nlp in #36262

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@IlyasMoutawwakil
- Make cache traceable (#35873)
- HPU support (#36424)
- Fix grad accum arbitrary value (#36691)
@orrzohar
- SmolVLM2 (#36126)
@threewebcode
- chore: fix function argument descriptions (#36392)
- chore: fix message descriptions in arguments and comments (#36504)
- chore: enhance messages in docstrings (#36525)
- chore: enhance message descriptions in parameters,comments,logs and docstrings (#36554)
- chore: fix typos in language models (#36586)
- fix typos in the docs directory (#36639)
- chore: fix typos in utils module (#36668)
- fix typos in the tests directory (#36717)
- chore: fix typos in tests directory (#36785)
- chore: fix typos in the tests directory (#36813)
@aymeric-roucher
- Deprecate transformers.agents (#36415)
@keyboardAnt
- Universal Speculative Decoding CandidateGenerator (#35029)
@EduardoPach
- [GroundingDino] Fix grounding dino loss 🚨 (#31828)
@co63oc
- Fix some typos in docs (#36502)
- chore: Fix typos in docs and examples (#36524)
- Fix typos in tests (#36547)
@RyanMullins
- Gemma3 (#36658)
- Gemma 3: Adding explicit GenerationConfig and refactoring conversion … (#36833)
- Shieldgemma2 (#36678)
@cyyever
- Use pyupgrade --py39-plus to improve code (#36843)
@haotongl
- Add Prompt Depth Anything Model (#35401)
@danieldk
- Use deformable_detr kernel from the Hub (#36853)

- Python
Published by LysandreJik about 1 year ago

transformers - Mistral 3 (Based on v4.49.0)

A new model is added to transformers: Mistral 3. It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-Mistral-3.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.49.0-Mistral-3

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

Mistral 3

The model is detailed in the following blog post. The models are available on the Hub with the following tag: mistral3

Overview

This model was contributed by cyrilvallez and yonigozlan.

The original code can be found here and here.

Usage example

Inference with Pipeline

Here is how you can use the image-text-to-text pipeline to perform inference with the Mistral3 models in just a few lines of code: ```python

from transformers import pipeline

messages = [ ... { ... "role": "user", ... "content": [ ... { ... "type": "image", ... "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg", ... }, ... {"type": "text", "text": "Describe this image."}, ... ], ... }, ... ]

pipe = pipeline("image-text-to-text", model="mistralai/Mistral-Small-3.1-24B-Instruct-2503", torchdtype=torch.bfloat16) outputs = pipe(text=messages, maxnewtokens=50, returnfull_text=False) outputs[0]["generated_text"] 'The image depicts a vibrant and lush garden scene featuring a variety of wildflowers and plants. The central focus is on a large, pinkish-purple flower, likely a Greater Celandine (Chelidonium majus), with a' ```

Inference on a single image

This example demonstrates how to perform inference on a single image with the Mistral3 models using chat templates.

```python

from transformers import AutoProcessor, AutoModelForImageTextToText import torch

torchdevice = "cuda" modelcheckpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503" processor = AutoProcessor.frompretrained(modelcheckpoint) model = AutoModelForImageTextToText.frompretrained(modelcheckpoint, devicemap=torchdevice, torch_dtype=torch.bfloat16)

messages = [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"}, ... {"type": "text", "text": "Describe this image"}, ... ], ... } ... ]

inputs = processor.applychattemplate(messages, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt").to(model.device, dtype=torch.bfloat16)

generateids = model.generate(**inputs, maxnewtokens=20) decodedoutput = processor.decode(generateids[0, inputs["inputids"].shape[1] :], skipspecialtokens=True)

decoded_output "The image depicts two cats lying on a pink blanket. The larger cat, which appears to be an"... ```

Text-only generation

This example shows how to generate text using the Mistral3 model without providing any image input.

````python

from transformers import AutoProcessor, AutoModelForImageTextToText import torch

torchdevice = "cuda" modelcheckpoint = ".mistralai/Mistral-Small-3.1-24B-Instruct-2503" processor = AutoProcessor.frompretrained(modelcheckpoint) model = AutoModelForImageTextToText.frompretrained(modelcheckpoint, devicemap=torchdevice, torch_dtype=torch.bfloat16)

SYSTEMPROMPT = "You are a conversational agent that always answers straight to the point, always end your accurate response with an ASCII drawing of a cat." userprompt = "Give me 5 non-formal ways to say 'See you later' in French."

messages = [ ... {"role": "system", "content": SYSTEMPROMPT}, ... {"role": "user", "content": userprompt}, ... ]

text = processor.applychattemplate(messages, tokenize=False, addgenerationprompt=True) inputs = processor(text=text, returntensors="pt").to(0, dtype=torch.float16) generateids = model.generate(**inputs, maxnewtokens=50, dosample=False) decodedoutput = processor.batchdecode(generateids[:, inputs["inputids"].shape[1] :], skipspecial_tokens=True)[0]

print(decoded_output) "1. À plus tard! 2. Salut, à plus! 3. À toute! 4. À la prochaine! 5. Je me casse, à plus!

``` /_/\ ( o.o )

^ < "`

Batched image and text inputs

Mistral3 models also support batched image and text inputs.

```python

from transformers import AutoProcessor, AutoModelForImageTextToText import torch

torchdevice = "cuda" modelcheckpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503" processor = AutoProcessor.frompretrained(modelcheckpoint) model = AutoModelForImageTextToText.frompretrained(modelcheckpoint, devicemap=torchdevice, torch_dtype=torch.bfloat16)

messages = [ ... [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"}, ... {"type": "text", "text": "Write a haiku for this image"}, ... ], ... }, ... ], ... [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}, ... {"type": "text", "text": "Describe this image"}, ... ], ... }, ... ], ... ]

inputs = processor.applychattemplate(messages, padding=True, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt").to(model.device, dtype=torch.bfloat16)

output = model.generate(**inputs, maxnewtokens=25)

decodedoutputs = processor.batchdecode(output, skipspecialtokens=True) decoded_outputs ["Write a haiku for this imageCalm waters reflect\nWhispers of the forest's breath\nPeace on wooden path" , "Describe this imageThe image depicts a vibrant street scene in what appears to be a Chinatown district. The focal point is a traditional Chinese"] ```

Batched multi-image input and quantization with BitsAndBytes

This implementation of the Mistral3 models supports batched text-images inputs with different number of images for each text. This example also how to use BitsAndBytes to load the model in 4bit quantization.

```python

from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig import torch

torchdevice = "cuda" modelcheckpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503" processor = AutoProcessor.frompretrained(modelcheckpoint) quantizationconfig = BitsAndBytesConfig(loadin4bit=True) model = AutoModelForImageTextToText.frompretrained( ... modelcheckpoint, quantizationconfig=quantization_config ... )

messages = [ ... [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"}, ... {"type": "text", "text": "Write a haiku for this image"}, ... ], ... }, ... ], ... [ ... { ... "role": "user", ... "content": [ ... {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"}, ... {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"}, ... {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"}, ... ], ... }, ... ], ]

inputs = processor.applychattemplate(messages, padding=True, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt").to(model.device, dtype=torch.bfloat16)

output = model.generate(**inputs, maxnewtokens=25)

decodedoutputs = processor.batchdecode(output, skipspecialtokens=True) decoded_outputs ["Write a haiku for this imageSure, here is a haiku inspired by the image:\n\nCalm lake's wooden path\nSilent forest stands guard\n", "These images depict two different landmarks. Can you identify them? Certainly! The images depict two iconic landmarks:\n\n1. The first image shows the Statue of Liberty in New York City."] ```

- Python
Published by LysandreJik about 1 year ago

transformers - Gemma 3 (Based on v4.49.0)

A new model is added to transformers: Gemma 3. It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-Gemma-3.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

Gemma 3

The model is detailed in the following blog post. The models and demos using the model are available in the following collection.

A Space to play around with the 12B-it flavor is available here.

Overview

The Gemma 3 model was proposed by Google. It is a vision-language model composed by a SigLIP vision encoder and a Gemma 2 language decoder linked by a multimodal linear projection.

Usage tips

For image+text and image-only inputs use Gemma3ForConditionalGeneration.
For text-only inputs use Gemma3ForCausalLM for generation to avoid loading the vision tower.
Each sample can contain multiple images, and the number of images can vary between samples. However make sure to pass correctly batched images to the processor, where each batch is a list of one or more images.
The text passed to the processor should have the "<start_of_image_>" token where the images should be inserted.
The processor has its own apply_chat_template method to convert chat messages to text that can then be passed as text to the processor. You can also get a vectorized output from apply_chat_template. See the examples below for more details on how to use it.

Image cropping for high resolution images

The model supports cropping images into smaller patches when the image aspect ratio exceeds a certain value. By default the images are not cropped and only the base image is forwarded to the model. Users can set do_pan_and_scan=True to obtain several crops per image along with the base image to improve the quality in DocVQA or similar tasks requiring higher resolution images.

Pan and scan is an inference time optimization to handle images with skewed aspect ratios. When enabled, it improves performance on tasks related to document understanding, infographics, OCR, etc.

```python from transformers import AutoProcessor

processor = AutoProcessor.frompretrained("google/gemma-3-4b-it", paddingside="left")

url = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4=" messages = [ { "role": "system", "content": [ {"type": "text", "text": "You are a helpful assistant."} ] }, { "role": "user", "content": [ {"type": "image", "url": url}, {"type": "text", "text": "What is shown in this image?"}, ] }, ] inputs = processor.applychattemplate( messages, tokenize=True, returndict=True, returntensors="pt", addgenerationprompt=True, dopanand_scan=True, ).to(model.device) ```

Usage Example

Single-image Inference

```python from transformers import AutoProcessor, Gemma3ForConditionalGeneration

modelid = "google/gemma-3-4b-it" model = Gemma3ForConditionalGeneration.frompretrained(modelid, devicemap="auto") processor = AutoProcessor.frompretrained(modelid, padding_side="left")

output = model.generate(**inputs, maxnewtokens=50) print(processor.decode(output[0], skipspecialtokens=True)[inputs.input_ids.shape[1]: ]) ```

Multi-image Inference

```python from transformers import AutoTokenizer, Gemma3ForCausalLM

modelid = "google/gemma-3-4b-it" model = Gemma3ForConditionalGeneration.frompretrained(modelid, devicemap="auto") processor = AutoProcessor.frompretrained(modelid, padding_side="left")

urlcow = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4=" urlstop = "https://www.ilankelman.org/stopsigns/australia.jpg" messages = [ { "role": "system", "content": [ {"type": "text", "text": "You are a helpful assistant."} ] }, { "role": "user", "content": [ {"type": "image", "url": urlcow}, {"type": "image", "url": urlstop}, {"type": "text", "text": "Are these two images identical?"}, ] }, ] inputs = processor.applychattemplate( messages, tokenize=True, returndict=True, returntensors="pt", addgenerationprompt=True, ).to(model.device)

output = model.generate(**inputs, maxnewtokens=50) print(processor.decode(output[0], skipspecialtokens=True)[inputs.input_ids.shape[1]: ])

```

Text-only inference

```python from transformers import AutoTokenizer, Gemma3ForCausalLM

model_id = "google/gemma-3-1b-it"

tokenizer = AutoTokenizer.frompretrained(modelid) model = Gemma3ForCausalLM.frompretrained(modelid, device_map="auto")

inputids = tokenizer("Write me a poem about Machine Learning.", returntensors="pt").to(model.device)

outputs = model.generate(**inputids, maxnewtokens=100) text = tokenizer.batchdecode(outputs, skipspecialtokens=True)

print(text)

```

- Python
Published by LysandreJik about 1 year ago

transformers - Aya Vision (Based on v4.49.0)

A new model is added to transformers: Aya Vision. It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-AyaVision.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.49.0-AyaVision

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

Aya Vision

The model is detailed in the following blog post.

Overview

Usage Example

Here's an example usage of the Aya Vision model.

```py from transformers import AutoProcessor, AutoModelForImageTextToText import torch

model_id = "CohereForAI/aya-vision-32b"

processor = AutoProcessor.frompretrained(modelid) model = AutoModelForImageTextToText.frompretrained( modelid, devicemap="auto", torchdtype=torch.float16 )

Format message with the aya-vision chat template

messages = [ {"role": "user", "content": [ {"type": "image", "url": "https://pbs.twimg.com/media/Fx7YvfQWYAIp6rZ?format=jpg&name=medium"}, {"type": "text", "text": "चित्र में लिखा पाठ क्या कहता है?"}, ]}, ]

inputs = processor.applychattemplate( messages, padding=True, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt" ).to(model.device)

gentokens = model.generate( **inputs, maxnewtokens=300, dosample=True, temperature=0.3, )

print(processor.tokenizer.decode(gentokens[0][inputs.input_ids.shape[1]:], skipspecial_tokens=True)) ```

- Python
Published by LysandreJik about 1 year ago

transformers - SigLIP-2 (Based on v4.49.0)

A new model is added to transformers: SigLIP-2. It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-SigLIP-2.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.49.0-SigLIP-2

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

SigLIP2

The paper page for the model is available here. It is detailed in the following blog post.

The models and demos using the model are available in the following collection.

Overview

The model comes in two variants

1) FixRes - model works with fixed resolution images (backward compatible with SigLIP v1) 2) NaFlex - model works with variable image aspect ratios and resolutions (SigLIP2 in transformers)

The abstract from the paper is the following:

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe—this includes decoder-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification (best SigLIP 2 ViT-g/16 achieves 85.0% ImageNet zero-shot accuracy), image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input’s native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fair- ness. To provide users with the ability to trade-off inference cost with performance, we release model checkpoints at four sizes (ViT-B/86M, L/303M, So400m/400M, and g/1B).

Usage tips

Usage of SigLIP2 is similar to SigLIP and CLIP. The main difference from CLIP is the training loss, which does not require a global view of all the pairwise similarities of images and texts within a batch. One needs to apply the sigmoid activation function to the logits, rather than the softmax.
Training is supported but does not use torch.distributed utilities which may limit the scalability of batch size. However, DDP and FDSP works on single-node multi-gpu setup.
When using the standalone [GemmaTokenizerFast] make sure to pass padding="max_length" and max_length=64 as that's how the model was trained.
Model was trained with lowercased text, make sure you make the same preprocessing for your text labels.
To get the same results as the pipeline, a prompt template of "this is a photo of {label}" should be used.
The NaFlex variant supports processing images at higher resolutions by adjusting the max_num_patches parameter in the Processor. The default value is max_num_patches=256. Increasing max_num_patches to 1024 (4x) will approximately double processed image height and width, while preserving the aspect ratio.

drawing

This model was contributed by qubvel. The original code can be found here.

Usage example

There are 2 main ways to use SigLIP2: either using the pipeline API, which abstracts away all the complexity for you, or by using the Siglip2Model class yourself.

FixRes variant

Pipeline API

The pipeline allows to use the model in a few lines of code:

```python

from transformers import pipeline from PIL import Image import requests

load pipe

image_classifier = pipeline( ... task="zero-shot-image-classification", ... model="google/siglip2-base-patch16-224", ... )

load image

url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = Image.open(requests.get(url, stream=True).raw)

inference

candidatelabels = ["2 cats", "a plane", "a remote"] outputs = imageclassifier(image, candidatelabels=candidatelabels) outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs] print(outputs) [{'score': 0.1499, 'label': '2 cats'}, {'score': 0.0008, 'label': 'a remote'}, {'score': 0.0, 'label': 'a plane'}] ```

Using the model yourself

If you want to do the pre- and postprocessing yourself, here's how to do that:

```python

from PIL import Image import requests from transformers import AutoProcessor, AutoModel import torch

model = AutoModel.frompretrained("google/siglip2-base-patch16-224") processor = AutoProcessor.frompretrained("google/siglip2-base-patch16-224")

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

candidate_labels = ["2 cats", "2 dogs"]

follows the pipeline prompt template to get same results

texts = [f"This is a photo of {label}." for label in candidate_labels]

IMPORTANT: we pass `padding=max_length` and `max_length=64` since the model was trained with this

inputs = processor(text=texts, images=image, padding="maxlength", maxlength=64, return_tensors="pt")

with torch.no_grad(): ... outputs = model(**inputs)

logitsperimage = outputs.logitsperimage probs = torch.sigmoid(logitsperimage) # these are the probabilities print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'") 15.0% that image 0 is '2 cats' ```

NaFlex variant

NaFlex combines ideas from FlexiViT, i.e. supporting multiple, predefined sequence lengths with a single ViT model, and NaViT, namely processing images at their native aspect ratio. This enables processing different types of images at appropriate resolution, e.g. using a larger resolution to process document images, while at the same time minimizing the impact of aspect ratio distortion on certain inference tasks, e.g. on OCR.

Given a patch size and target sequence length, NaFlex preprocesses the data by first resizing the input image such that the height and width after resizing are multiples of the patch size, while

1. keeping the aspect ratio distortion as small as possible
2. producing a sequence length of at most the desired target sequence length (`max_num_patches`)

The resulting distortion in width and height is at most (patch_size - 1) / width and (patch_size - 1) / height, respectively, which tends to be small for common resolutions and aspect ratios. After resizing, the image is split into a sequence of patches, and a mask with padding information is added.

```python

from PIL import Image import requests from transformers import AutoProcessor, AutoModel import torch

model = AutoModel.frompretrained("google/siglip2-base-patch16-naflex") processor = AutoProcessor.frompretrained("google/siglip2-base-patch16-naflex")

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

candidate_labels = ["2 cats", "2 dogs"]

follows the pipeline prompt template to get same results

texts = [f"This is a photo of {label}." for label in candidate_labels]

default value for `max_num_patches` is 256, but you can increase resulted image resolution providing

higher values e.g. `max_num_patches=512`

inputs = processor(text=texts, images=image, maxnumpatches=256, return_tensors="pt")

with torch.no_grad(): ... outputs = model(**inputs)

logitsperimage = outputs.logitsperimage probs = torch.sigmoid(logitsperimage) # these are the probabilities print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'") 21.1% that image 0 is '2 cats' ```

- Python
Published by LysandreJik over 1 year ago

transformers - SmolVLM-2 (Based on v4.49.0)

A new model is added to transformers: SmolVLM-2. It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-SmolVLM-2.

In order to install this version, please install with the following command: bash pip install git+https://github.com/huggingface/transformers@v4.49.0-SmolVLM-2 If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

SmolVLM-2

SmolVLM-2 is detailed in the following blog post.

The models and demos using the model are available in the following collection.

Overview

SmolVLM2 is an adaptation of the Idefics3 model with two main differences:

It uses SmolLM2 for the text model.
It supports multi-image and video inputs

Usage tips

Input images are processed either by upsampling (if resizing is enabled) or at their original resolution. The resizing behavior depends on two parameters: do_resize and size.

Videos should not be upsampled.

If do_resize is set to True, the model resizes images so that the longest edge is 4*512 pixels by default. The default resizing behavior can be customized by passing a dictionary to the size parameter. For example, {"longest_edge": 4 * 512} is the default, but you can change it to a different value if needed.

Here’s how to control resizing and set a custom size: python image_processor = SmolVLMImageProcessor(do_resize=True, size={"longest_edge": 2 * 512}, max_image_size=512)

Additionally, the max_image_size parameter, which controls the size of each square patch the image is decomposed into, is set to 512 by default but can be adjusted as needed. After resizing (if applicable), the image processor decomposes the images into square patches based on the max_image_size parameter.

This model was contributed by orrzohar.

Usage example

Single Media inference

The model can accept both images and videos as input, but you should use only one of the modalities at a time. Here's an example code for that.

```python import torch from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.frompretrained("HuggingFaceTB/SmolVLM2-256M-Video-Instruct") model = AutoModelForImageTextToText.frompretrained( "HuggingFaceTB/SmolVLM2-256M-Video-Instruct", torchdtype=torch.bfloat16, devicemap="cuda" )

conversation = [ { "role": "user", "content":[ {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"}, {"type": "text", "text": "Describe this image."} ] } ]

inputs = processor.applychattemplate( conversation, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt", ).to(model.device, dtype=torch.bfloat16)

outputids = model.generate(**inputs, maxnewtokens=128) generatedtexts = processor.batchdecode(outputids, skipspecialtokens=True) print(generated_texts)

Video

conversation = [ { "role": "user", "content": [ {"type": "video", "path": "/path/to/video.mp4"}, {"type": "text", "text": "Describe this video in detail"} ] }, ]

inputs = processor.applychattemplate( conversation, addgenerationprompt=True, tokenize=True, returndict=True, returntensors="pt", ).to(model.device, dtype=torch.bfloat16)

generatedids = model.generate(**inputs, dosample=False, maxnewtokens=100) generatedtexts = processor.batchdecode(generatedids, skipspecialtokens=True) print(generatedtexts[0]) ```

- Python
Published by LysandreJik over 1 year ago

transformers - v4.49.0: Helium, Qwen2.5-VL, SuperGlue, Granite Vision, Zamba2, GOT-OCR 2.0, DAB-DETR, Depth Pro, RT-DETRv2

New models

Helium

Helium-1 preview is a lightweight language model with 2B parameters, targeting edge and mobile devices. It supports the following languages: English, French, German, Italian, Portuguese, Spanish.

Add-helium by @ArthurZucker in #35669

Qwen2.5-VL

The Qwen2.5-VL model is an update to Qwen2-VL from Qwen team, Alibaba Group.

The abstract from this update is the following:

Qwen2.5-VL marks a major step forward from Qwen2-VL, built upon the latest Qwen2.5 LLM. We’ve accelerated training and testing through the strategic implementation of window attention within the ViT. The ViT architecture itself has been refined with SwiGLU and RMSNorm, aligning it more closely with the LLM’s structure. A key innovation is the expansion of native dynamic resolution to encompass the temporal dimension, in addition to spatial aspects. Furthermore, we’ve upgraded MRoPE, incorporating absolute time alignment on the time axis to allow the model to effectively capture temporal dynamics, regardless of frame rate, leading to superior video understanding.

add qwen2.5vl by @ShuaiBai623 in #35569

SuperGlue

The SuperGlue model was proposed in SuperGlue: Learning Feature Matching with Graph Neural Networks by Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz and Andrew Rabinovich.

This model consists of matching two sets of interest points detected in an image. Paired with the SuperPoint model, it can be used to match two images and estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.

Add SuperGlue model by @sbucaille in #29886

Granite Vision Support

The Granite Vision model is a variant of LLaVA-NeXT, leveraging a Granite language model alongside a SigLIP visual encoder. It utilizes multiple concatenated vision hidden states as its image features, similar to VipLlava. It also uses a larger set of image grid pinpoints than the original LlaVa-NeXT models to support additional aspect ratios.

Granite Vision Support by @alex-jw-brooks in #35579

Zamba2

Zamba2 is a large language model (LLM) trained by Zyphra, and made available under an Apache 2.0 license.

Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B are hybrid models combining state-space models (Specifically Mamba) and transformer, and were trained using next-token prediction. Zamba2 uses shared transformer layers after every 6 mamba blocks. It uses the Mistral v0.1 tokenizer. We came to this architecture after a series of ablations at small scales. Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B were pre-trained on 2T and 3T tokens, respectively.

Add Zamba2 by @pglorio in #34517

GOT-OCR 2.0

GOT-OCR2 works on a wide range of tasks, including plain document OCR, scene text OCR, formatted document OCR, and even OCR for tables, charts, mathematical formulas, geometric shapes, molecular formulas and sheet music. While this implementation of the model will only output plain text, the outputs can be further processed to render the desired format, with packages like pdftex, mathpix, matplotlib, tikz, verovio or pyecharts. The model can also be used for interactive OCR, where the user can specify the region to be recognized by providing the coordinates or the color of the region’s bounding box.

Add GOT-OCR 2.0 to Transformers by @yonigozlan in #34721

DAB-DETR

DAB-DETR is an enhanced variant of Conditional DETR. It utilizes dynamically updated anchor boxes to provide both a reference query point (x, y) and a reference anchor size (w, h), improving cross-attention computation. This new approach achieves 45.7% AP when trained for 50 epochs with a single ResNet-50 model as the backbone.

Add DAB-DETR for object detection by @conditionedstimulus in #30803

Depth PRO

DepthPro is a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. It employs a multi-scale Vision Transformer (ViT)-based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch-level features are merged, upsampled, and refined using a DPT-like fusion stage, enabling precise depth estimation.

Add Apple's Depth-Pro for depth estimation by @geetu040 in #34583

RT-DETRv2

An improved Real-Time DEtection TRansformer (RT-DETR). RT-DETRv2 refines RT-DETR by introducing selective multi-scale feature extraction, a discrete sampling operator for broader deployment compatibility. These improvements yield a 0.3 to 1.4 increase in mAP metrics on the COCO dataset, all while maintaining the same parameter count and frames-per-second (FPS) performance.

Adding RTDETRv2 by @jadechoghari in #34773

Transformers-CLI

Transformers' CLI welcomes a new command: chat. This command starts a conversation with the model of your choosing directly in your terminal.

This feature exists in TRL and has been migrated to transformers for easier usage.

ezgif-56c494108b6d77

[Chat] Add Chat from TRL 🐈 by @gante in #35714

Processor Standardization

An ongoing work is to standardize the image processors so that their API is equivalent. Additionally, the processors are given a fast variant so that they are never blockers in the image processing pipelines.

In this release, several processors have been standardized and have seen their fast version be contributed.

OwlViT/Owlv2 post processing standardization by @qubvel in #34929
OmDet Turbo processor standardization by @qubvel in #34937
Grounding DINO Processor standardization by @qubvel in #34853
Refactoring of ImageProcessorFast by @yonigozlan in #35069
add Qwen2-VL image processor fast by @yonigozlan in #35733
Remove Multi-threaded image conversion for fast image processors by @yonigozlan in #36105

Breaking changes

DPT segmentation maps

DPT image processors did not support segmentation_maps, instead only requiring images. This has been fixed. This adds an argument to the preprocess method, therefore users using arguments as positional arguments with that method may see changed behavior. We recommend using keyword arguments for such methods so as to not be bothered by the addition of new features.

🔴 🔴 🔴 Added segmentation maps support for DPT image processor by @simonreise in #34345

Image classification pipeline and single vs multi-label

The problem_type in the config.json file was read incorrectly by the pipeline, which mapped single-label to multi-label losses, and vice-versa. This has been fixed.

🚨🚨🚨 image-classification pipeline single-label and multi-label prob type squashing fns (sigmoid vs softmax) are backwards by @rwightman in #35848

Fixing the LayerNorm beta/gamma renames

The description of the pull request is the easiest way to understand the problem, why it exists, and how it is solved; please read the description below:

🚨🚨🚨 An attempt to fix #29554. Include 'LayerNorm.' in gamma/beta rename scope, optimize string search. by @rwightman in #35615

VLM cleanup

The ignore_index property of the llava configuration has been removed as it was not serving a purpose.

🔴 VLM: compile compatibility by @zucchini-nlp in #35724

Quantization

Quantization has received several improvements and fixes, including the contribution of FP8 quantization and the HIGGS quantization interface.

Split and clean up GGUF quantization tests by @Isotr0py in #35502
Display warning for unknown quants config instead of an error by @SunMarc in #35963
Adding FP8 Quantization to transformers by @MekkCyber in #36026
New HIGGS quantization interfaces, JIT kernel compilation support. by @BlackSamorez in #36148

Generate

[generate] revert change in Aria: the maximum cache length must match max_length by @gante in #36120
🧹 remove generate-related objects and methods scheduled for removal in v4.48 by @gante in #35677
[generate] can instantiate GenerationConfig(cache_implementation="static") by @gante in #35679
[generate] return Cache object even if passed in a legacy format by @gante in #35673
[generate] update docstring of SequenceBiasLogitsProcessor by @gante in #35699
Test: generate with torch.compile(model.forward) as a fast test by @gante in #34544
[generate] move max time tests by @gante in #35962
[generate] shape checks in tests compatible with fixed-length caches (+ some minor fixes) by @gante in #35993

Pipelines

Pipelines have received several bug fixes and improvements which are detailed below.

Stop mutating input dicts in audio classification pipeline by @Rocketknight1 in #35754
fix document qa bf16 pipeline by @jiqing-feng in #35456
fix low-precision audio classification pipeline by @jiqing-feng in #35435
[pipeline] missing import regarding assisted generation by @gante in #35752
Output dicts support in text generation pipeline by @jonasrohw in #35092
Fix Audio Classification Pipeline top_k Documentation Mismatch and Bug #35736 by @sambhavnoobcoder in #35771

Bugfixes and improvements

Fix flaky test_custom_4d_attention_mask by @ydshieh in #35606
Use inherit tempdir makers for tests + fix failing DS tests by @muellerzr in #35600
Added error when sequence length is bigger than maxpositionembeddings by @Taha1506 in #32156
Let EarlyStoppingCallback not require load_best_model_at_end by @muellerzr in #35101
Fix flaky test_beam_search_low_memory by @ydshieh in #35611
Skip MobileNetV1ModelTest::test_batching_equivalence for now by @ydshieh in #35614
Update codeowners with individual model owners by @Rocketknight1 in #35595
Fix device in rope module when using dynamic updates by @Cyrilvallez in #35608
Fix whisper compile by @jiqing-feng in #35413
Removed some duplicated code by @Sai-Suraj-27 in #35637
[Phi] bias should be True by @ArthurZucker in #35650
Enable different torch dtype in sub models by @zucchini-nlp in #34873
[Compile] Only test compiling model forward pass by @ArthurZucker in #35658
[tests] make cuda-only tests device-agnostic by @faaany in #35607
[i18n-ar] Translated file : docs/source/ar/tasks/token_classification.md into Arabic by @AhmedAlmaghz in #35193
Fix zero_shot_image_classification documentation guide link in SigLIP by @aretrace in #35671
Fix : adding einops lib in the CI docker for some bitsandbytes tests by @MekkCyber in #35652
Update torchao.md: use auto-compilation by @martin0258 in #35490
Fix : HQQ config when hqq not available by @MekkCyber in #35655
Fix expected output for ggml test by @MekkCyber in #35686
Fix : add requirereadtoken for gemma2 gated model by @MekkCyber in #35687
Enhanced Installation Section in README.md by @egojoseph in #35094
Enhance DataCollatorForLanguageModeling with Configurable Token Replacement Probabilities by @mahdibaghbanzadeh in #35251
Clean-up composite configs by @zucchini-nlp in #34603
Add future import for Py < 3.10 by @Rocketknight1 in #35666
Enable gptqmodel by @jiqing-feng in #35012
Fix : Nemotron Processor in GGUF conversion by @MekkCyber in #35708
Fix typo in /docs/source/ja/modeldoc/decisiontransformer.md URL by @hiroaki222 in #35705
Replace deprecated batchsize with maxbatch_size when using HybridCache by @mtreinik in #35498
Fix: Falcon tiewordembeddings in GGUF by @MekkCyber in #35715
Fix condition when GA loss bug fix is not performed by @techkang in #35651
Fix the bug that Trainer cannot correctly call torch_jit_model_eval by @Wanguy in #35722
[generation] fix type hint by @gante in #35725
Add proper jinja2 error by @Rocketknight1 in #35533
Optimize ForCausalLMLoss by removing unnecessary contiguous() call to reduce memory overhead by @efsotr in #35646
Modular: support for importing functions from any file by @Cyrilvallez in #35692
Remove batch size argument warning when unjustified by @quintenroets in #35519
[cache] add a test to confirm we can use cache at train time by @gante in #35709
Remove pt_to_tf by @gante in #35672
Added resource class configuration option for check_circleci_user job by @Sai-Suraj-27 in #32866
Fix some tests by @Cyrilvallez in #35682
Unable to use MimiModel with DeepSpeed ZeRO-3 by @anferico in #34735
check is added for the report_to variable in TrainingArguments by @alpertunga-bile in #35403
Added liger_kernel compatibility with PeftModel by @ambroser53 in #35680
Restore istorchgreaterorequal_than for backward compatibility by @tlrmchlsmth in #35734
Revert "Unable to use MimiModel with DeepSpeed ZeRO-3" by @eustlb in #35755
ci: fix xpu skip condition for testmodelparallelbeamsearch by @dvrogozh in #35742
Use AMD CI workflow defined in hf-workflows by @ivarflakstad in #35058
Fix CI for VLMs by @zucchini-nlp in #35690
Security fix for self-comment-ci.yml by @ydshieh in #35548
[ViTPose] Convert more checkpoints by @NielsRogge in #35638
fix register_buffer in MimiEuclideanCodebook by @anferico in #35759
remove code owners as it was generating too much noise BUT by @ArthurZucker in #35784
Skip Falcon 7B GGML Test by @MekkCyber in #35783
[fix] cannot import name 'Pop2PianoFeatureExtractor' from 'transformers' by @faaany in #35604
transformers.image_transforms.normalize wrong types by @CalOmnie in #35773
Patch moonshine by @eustlb in #35731
Don't import torch.distributed when it's not available by @booxter in #35777
Fix vits low-precision dtype by @jiqing-feng in #35418
Tool calling: support more types by @aymeric-roucher in #35776
Fixes, improvements to timm import behaviour by @rwightman in #35800
modularmodelconverter bugfix on assignments by @nikosanto13 in #35642
Deterministic sorting in modular converter when adding new functions by @Cyrilvallez in #35795
Fix "testchattemplate_dict" in video LLMs by @zucchini-nlp in #35660
Update AMD Docker image by @ivarflakstad in #35804
Add LlavaImageProcessor by @NielsRogge in #33191
Byebye test_batching_equivalence's flakiness by @ydshieh in #35729
[Doc] Adding blog post to model doc for TimmWrapper by @ariG23498 in #35744
add a new flax example for Bert model inference by @louie-tsai in #34794
Support adamwtorch8bit by @fzyzcjy in #34993
Auto-add timm tag to timm-wrapper models. by @pcuenca in #35794
Fix : BLOOM tiewordembeddings in GGUF by @MekkCyber in #35812
Fixed typo in autoawq version number in an error message for IPEX backend requirements. by @InfroLab in #35815
Remove deprecated get_cached_models by @Wauplin in #35809
Optimized setinitializedsubmodules. by @LagPixelLOL in #35493
[i18n-ar] Translated file: docs/source/ar/tasks/masked_language_modeling.md into Arabic by @AhmedAlmaghz in #35198
move fastspeech to audio models by @eustlb in #35788
Improve modular documentation by @Cyrilvallez in #35737
[Mimi] update test expected values for t4 runners by @eustlb in #35696
Remove old benchmark code by @gante in #35730
Remove pyav pin to allow python 3.11 to be used by @CalOmnie in #35823
Another security patch for self-comment-ci.yml by @ydshieh in #35816
Init cache on meta device by @zucchini-nlp in #35164
Hotfix: missing working-directory in self-comment-ci.yml by @ydshieh in #35833
[gpt2] fix generation tests by @gante in #35822
Fix : Nemotron tokenizer for GGUF format by @MekkCyber in #35836
Fix head_dim in config extracted from Gemma2 GGUF model by @Isotr0py in #35818
[chat] docs fix by @gante in #35840
Fix compatibility issues when using auto_gptq with these older versions by @LRL-ModelCloud in #35830
Add PyTorch version check for FA backend on AMD GPUs by @mht-sharma in #35813
Fix NoneType type as it requires py>=3.10 by @SunMarc in #35843
[ tests] remove some flash attention class tests by @ArthurZucker in #35817
[Backend support] Allow num_logits_to_keep as Tensor + add flag by @Cyrilvallez in #35757
Fix GA loss for Deepspeed by @timjeffrey10 in #35808
Fix uploading processors/tokenizers to WandB on train end by @jack89roberts in #35701
Fix more CI tests by @ArthurZucker in #35661
[DOC] Fix contamination and missing paragraph in translation by @Yosshi999 in #35851
Fix typo by @SilverSoldier in #35854
fix applychattemplate() padding choice by @baoyf4244 in #35828
Fix test_pipelines_video_classification that was always failing by @CalOmnie in #35842
Fix Llava-NeXT / Llava-NeXT Video / Llava-OneVision's token unpadding mismatch by @sheryc in #35779
use torch.testing.assertclose instead to get more details about error in cis by @ArthurZucker in #35659
add xpu device check in device_placement by @faaany in #35865
Add Rocketknight1 to self-comment-ci.yml by @ydshieh in #35881
[doctest] Fixes by @stevhliu in #35863
Fix fast image processor warnings in object detection examples by @sugendran in #35892
Update deepspeed amd image by @ivarflakstad in #35906
Fix typing in audioutils.chromafilter_bank by @CalOmnie in #35888
[docs] uv install by @stevhliu in #35821
Fix the config class comparison for remote code models by @Rocketknight1 in #35592
Close Zamba2Config code block by @Rocketknight1 in #35914
[docs] Fix Zamba2 by @stevhliu in #35916
Remove _supports_static_cache = True for some model classes by @ydshieh in #34975
Use rocm6.2 for AMD images by @ivarflakstad in #35930
Add default TP plan for all models with backend support by @Cyrilvallez in #35870
Fix: loading DBRX back from saved path by @zucchini-nlp in #35728
Fix mask slicing for models with HybridCache by @Cyrilvallez in #35681
Qwen-2-5-VL: fix CI by @zucchini-nlp in #35935
Fix TP initialization by @Cyrilvallez in #35860
fix(FA): QKV not being casted to target_dtype for FA with dpo lora by @NanoCode012 in #35834
Remove INC notebook reference in documentation by @echarlaix in #35936
use torch constraints to check if covariance is positive definite during mean resizing. by @abuelnasr0 in #35693
fix test_generated_length_assisted_generation by @keyboardAnt in #34935
Update unwrap_and_save_reload_schedule to use weights_only=False by @ydshieh in #35952
Update squad_convert_example_to_features to work with numpy v2 by @ydshieh in #35955
Fix flaky test_assisted_decoding_matches_greedy_search by @ydshieh in #35951
Trainer Refactor: Part 1 by @muellerzr in #35567
update docker file transformers-pytorch-deepspeed-latest-gpu by @ydshieh in #35940
[tests] further fix Tester object has no attribute '_testMethodName' by @faaany in #35781
Update README.md by @BlessedTatonka in #35958
fix iterator overflow when gradient accumulation is 1 by @winglian in #35960
Fix is_causal being a tensor by @IlyasMoutawwakil in #35791
[bart] minor test fixes by @gante in #35965
Pixtral: vectorize patch embeddings and enable tests by @zucchini-nlp in #35122
Whisper: fix static cache CI by @zucchini-nlp in #35852
Less flaky for TimmBackboneModelTest::test_batching_equivalence by @ydshieh in #35971
Support batching for UsefulSensors Moonshine by @njeffrie in #35922
not to use A100 for benchmark.yml by @ydshieh in #35974
Handle empty change indices in SAM's mask to rle conversion by @MSt-10 in #35665
Add support for nested images to LLava and VipLLava by @yonigozlan in #35558
[Moonshine] compute headdimpadding at init by @eustlb in #35984
[Moshi] disable automatic compilation if the model can't compile by @gante in #35992
use torch 2.6 for daily CI by @ydshieh in #35985
Update-tp test by @ArthurZucker in #35844
Add meanresizing for every VLMs' resizingtoken_embeddings() by @YenFuLin in #35717
Update Granite Vision Model Path / Tests by @alex-jw-brooks in #35998
Qwen2-VL: fix rope delta calculation by @zucchini-nlp in #36013
Fix custom kernel for DeformableDetr, RT-Detr, GroindingDINO, OmDet-Turbo in Pytorch 2.6.0 by @qubvel in #35979
applychattemplate: consistent behaviour for returnassistanttokensmask=True returntensors=True by @mrsndmn in #35582
layernormdecayfix by @Ryoo72 in #35927
Update Mistral converter by @Cyrilvallez in #35967
Refactor (and fix) gpt_neox by @Cyrilvallez in #35610
Fix device mismatch error in Whisper model during feature extraction by @thedebugger in #35866
Fix RMSNormGated in Zamba2 by @pglorio in #35943
Commont bot CI for other jobs (generation / quantization) by @ydshieh in #35341
Hotfix for self-comment-ci.yml by @ydshieh in #36030
feat(ci): ignore trufflehog unverified results by @McPatate in #36031
CircleCI with python 3.9 by @ydshieh in #36027
Update tests regarding attention types after #35235 by @ydshieh in #36024
Fix Gemma2 synced multi-GPU generation by @ManukyanD in #35232
Fix synced multi-GPU generation with LLMs and VLMs by @ManukyanD in #35893
Add XPU type for work-around -inf mask causing sdpa NaN issue in modeling files by @Liangliang-Ma in #35647
add support for empty list as input to createmodelcard by @ROZBEH in #36042
DeepSpeed github repo move sync by @stas00 in #36021
[docs] no hard coding cuda as bnb has multi-backend support by @faaany in #35867
[docs] fix bugs in the bitsandbytes documentation by @faaany in #35868
[docs] no hard-coding cuda by @faaany in #36043
Fix how we compute the final non-padding token for ForSequenceClassification models by @Rocketknight1 in #35911
Add Qwen2VLImageProcessorFast into Qwen2VLProcessor by @yeliudev in #35987
Iterative generation using Input embeds and past_key_values by @yaswanth19 in #35890
Fix usage of unpad_input function by @pavelgein in #35925
Fix repo consistency by @ydshieh in #36063
Update test_flash_attn_2_can_dispatch_composite_models by @ydshieh in #36050
Paligemma: fix generation with Gemma2 by @zucchini-nlp in #36044
Save checkpoint to temporary directory to handle partial saves during failures by @SilverSoldier in #35580
Nail in edge case of torch dtype being overriden permantly in the case of an error by @muellerzr in #35845
Fix words typos in ggml test. by @zhanluxianshen in #36060
Fix model kwargs by @muellerzr in #35875
Fix StopStringCriteria to handle tokens above len(tokenizer) by @Rocketknight1 in #35797
[docs] fix outdated example code in trainer.md by @faaany in #36066
Adding RT-DETRv2 for object detection by @jadechoghari in #34773
Fix bug in applyrotaryposembflashatt: in Qwen2-5-VL by @DeepWaved in #36065
Move audio top_k tests to the right file and add slow decorator by @Rocketknight1 in #36072
Fix OS err by @muellerzr in #36094
[docs] fix model checkpoint name by @faaany in #36075
[docs] fix typo by @faaany in #36080
[docs] fix not-working example code in perf_infer_gpu_one.md by @faaany in #36087
fix MllamaVisionAttention typehint by @kylesayrs in #35975
Processors: allow tuples of images when checking by @zucchini-nlp in #36084
Chat template: update for processor by @zucchini-nlp in #35953
Paligemma: revert #36084 by @zucchini-nlp in #36113
Support constant lr with cooldown by @LoserCheems in #35453
Enable pytest live log and show warning logs on GitHub Actions CI runs by @ydshieh in #35912
Refactor OPT model by @jiqing-feng in #36101
Revert checkpoint tmp dir by @SunMarc in #36112
[Bugfix] fix file name of docstring in utils/check_table.py by @kkscilife in #36108
fix bnb warning by @SunMarc in #36116
AutoformerForPrediction test add atol by @ivarflakstad in #36017
Fix nighlty CIs: missing atols by @ArthurZucker in #35903
Add common test for torch.export and fix some vision models by @qubvel in #35124
fix: typos in documentation files by @maximevtush in #36122
update awesome-transformers.md. by @zhanluxianshen in #36115
Fix max size deprecated warning by @HichTala in #34998
Fix CI issues by @molbap in #35662
update tiktoken integ to use converted by @ArthurZucker in #36135
Make output_dir Optional in TrainingArguments #27866 by @sambhavnoobcoder in #35735
[docs] minor doc fix by @faaany in #36127
[docs] update awq doc by @faaany in #36079
Add pipeline parallel plan to PretrainedConfig and PreTrainedModel by @hmellor in #36091
add RAdamScheduleFree optimizer by @nhamanasu in #35313
added warning to Trainer when label_names is not specified for PeftModel by @MilkClouds in #32085
Whisper: remove redundant assisted generation tests by @gante in #34814
Add utility for Reload Transformers imports cache for development workflow #35508 by @sambhavnoobcoder in #35858
VLM: enable skipped tests by @zucchini-nlp in #35746
[commands] remove deprecated/inoperational commands by @gante in #35718
Fix Gradient Checkpointing for Deberta & Deberta-V2 using PEFT / Adapters by @lenglaender in #35898
🚨 Remove cache migration script by @Wauplin in #35810
multi-gpu: fix tensor device placements for various models by @dvrogozh in #35763
Optim: APOLLO optimizer integration by @zhuhanqing in #36062
Fix multi gpu loss sync condition, add doc and test by @techkang in #35743
adding option to save/reload scaler by @hsilva664 in #34932
Update doc re list of models supporting TP by @kwen2501 in #35864
Add more rigerous non-slow grad accum tests by @muellerzr in #35668
Fix test fetcher by @ydshieh in #36129
skip test_initialization for VitPoseBackboneModelTest for now by @ydshieh in #36154
Add git LFS to AMD docker image by @ivarflakstad in #36016
Mllama fsdp by @blbadger in #36000
Fix PaliGemma Pad Token Masking During Training #35855 by @sambhavnoobcoder in #35859
Add reminder config to issue template and print DS version in env by @Ben-Schneider-code in #35156
Fix Gemma2 dtype issue when storing weights in float16 precision by @Nerogar in #35398
Replace deprecated updaterepovisibility by @Wauplin in #35970
Fix tests for vision models by @qubvel in #35654
qwen2.5vl: fix bugs when using flash2+bf16 or numreturnsequences>1 by @gewenbin0992 in #36083
docs: fix return type annotation of get_default_model_revision by @MarcoGorelli in #35982
Fix PretrainedTokenizerFast check => Fix PretrainedTokenizerFast Save by @CL-ModelCloud in #35835
Move DataCollatorForMultipleChoice from the docs to the package by @bauwenst in #34763
Helium documentation fixes by @LysandreJik in #36170
Remove loading custom kernel for RT-DETRv2 by @qubvel in #36098
[Modular] skip modular checks based on diff by @gante in #36130
Fix red CI by @ArthurZucker in #36174
Fix : fix doc fp8 by @MekkCyber in #36173
Efficient Inference Kernel for SpQR by @elvircrn in #34976
fix training issues by @ArthurZucker in #36158
add disable compile option by @ArthurZucker in #36161
CI: avoid human error, automatically infer generative models by @gante in #33212
Use tqdm auto by @SmartManoj in #35726
Optimize Qwen2VL vision model by precomputing cos/sin embeds before ViT blocks by @li-plus in #35837
Make check_repository_consistency run faster by MP by @ydshieh in #36175
Fix the key name for loadrng_state under torch.cuda by @wizyoung in #36138
Follow up to SpQR integration by @MekkCyber in #36176
Fix a mistake in #36175 by @ydshieh in #36179
Fix makebatchedvideos and add tests by @yonigozlan in #36143
Uniformize OwlViT and Owlv2 processors by @yonigozlan in #35700
Add support for partial rotary embeddings in Phi3 model by @garg-amit in #35947
CI: fix test-save-trainer by @zucchini-nlp in #36191
Chat template docs by @zucchini-nlp in #36163
Add ImageProcessorFast to Qwen2.5-VL processor by @Isotr0py in #36164
Prepare processors for VideoLLMs by @zucchini-nlp in #36149
Add requirereadtoken to fp8 tests by @MekkCyber in #36189
Revert qwen2 breaking changes related to attention refactor by @ArthurZucker in #36162
Guard against unset resolvedarchivefile by @dmlap in #35628
[Bugfix] Fix reloading of pixtral/llava configs by @kylesayrs in #36077

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@jiqing-feng
- Fix whisper compile (#35413)
- Enable gptqmodel (#35012)
- fix document qa bf16 pipeline (#35456)
- Fix vits low-precision dtype (#35418)
- fix low-precision audio classification pipeline (#35435)
- Refactor OPT model (#36101)
@AhmedAlmaghz
- [i18n-ar] Translated file : docs/source/ar/tasks/token_classification.md into Arabic (#35193)
- [i18n-ar] Translated file: docs/source/ar/tasks/masked_language_modeling.md into Arabic (#35198)
@sbucaille
- Add SuperGlue model (#29886)
@Isotr0py
- Fix head_dim in config extracted from Gemma2 GGUF model (#35818)
- Split and clean up GGUF quantization tests (#35502)
- Add ImageProcessorFast to Qwen2.5-VL processor (#36164)
@ShuaiBai623
- add qwen2.5vl (#35569)
@alex-jw-brooks
- Granite Vision Support (#35579)
- Update Granite Vision Model Path / Tests (#35998)
@pglorio
- Add Zamba2 (#34517)
- Fix RMSNormGated in Zamba2 (#35943)
@conditionedstimulus
- Add DAB-DETR for object detection (#30803)
@jadechoghari
- Adding RT-DETRv2 for object detection (#34773)
@geetu040
- Add Apple's Depth-Pro for depth estimation (#34583)
@zhuhanqing
- Optim: APOLLO optimizer integration (#36062)
@bauwenst
- Move DataCollatorForMultipleChoice from the docs to the package (#34763)
@elvircrn
- Efficient Inference Kernel for SpQR (#34976)

- Python
Published by LysandreJik over 1 year ago

transformers - Patch release v4.48.3

Patch release v4.48.3

This ends the python3.9 issues mostly! - Add future import for Py < 3.10 (#35666) by @Rocketknight1

For some very niche cases, the new rope embedding introduced device failures - Fix device in rope module when using dynamic updates (#35608) by @Cyrilvallez

Num items in batch

Fix model kwargs (#35875) by @muellerzr: this is long due, sorry that it took so long. Some models were not compatible with the num_items_in_batch

Finally the fix to Gemma2 is propagated to paligemma2! - Paligemma: fix generation with Gemma2 (#36044) by @zucchini-nlp

- Python
Published by ArthurZucker over 1 year ago

transformers - Patch release v4.48.2

Patch release v4.48.2

Sorry because the fixes for num_items_in_batches are not done yet 😓 To follow along see this PR, a new patch will be available soon!

Now, we mostly had BC issue with python version 3.9:

Restore istorchgreaterorequal_than for backward compatibility (#35734) by @tlrmchlsmth
Fix NoneType type as it requires py>=3.10 (#35843) by @SunMarc

Then we had a small regression for DBRX saving: - Fix: loading DBRX back from saved path (#35728) by @zucchini-nlp

Finally we have a fix for gemma and the hybrid attention architectures: - Fix mask slicing for models with HybridCache #35681 by @Cyrilvallez

Miscellaneous: - Fix is_causal being a tensor (#35791) by @IlyasMoutawwakil

- Python
Published by ArthurZucker over 1 year ago

transformers - Patch release v4.48.1

Patch release v4.48.1

Yet again we are dawned with a gradient accumulation fix! There is also a refactoring of the attention that let a small typo in, we made sure PHI is no longer broken!

Moonshine had a small issue when wrapping generate so we removed that!

[Phi] bias should be True (#35650) @ArthurZucker
Fix condition when GA loss bug fix is not performed (#35651) @techkang
Patch moonshine (#35731) @eustlb

🤗

- Python
Published by ArthurZucker over 1 year ago

transformers - v4.48.0: ModernBERT, Aria, TimmWrapper, ColPali, Falcon3, Bamba, VitPose, DinoV2 w/ Registers, Emu3, Cohere v2, TextNet, DiffLlama, PixtralLarge, Moonshine

New models

ModernBERT

The ModernBert model was proposed in Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference by Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Galalgher, Raja Bisas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Grifin Adams, Jeremy Howard and Iacopo Poli.

It is a refresh of the traditional encoder architecture, as used in previous models such as BERT and RoBERTa.

It builds on BERT and implements many modern architectural improvements which have been developed since its original release, such as:

Rotary Positional Embeddings to support sequences of up to 8192 tokens.
Unpadding to ensure no compute is wasted on padding tokens, speeding up processing time for batches with mixed-length sequences.
GeGLU Replacing the original MLP layers with GeGLU layers, shown to improve performance.
Alternating Attention where most attention layers employ a sliding window of 128 tokens, with Global Attention only used every 3 layers.
Flash Attention to speed up processing.
A model designed following recent The Case for Co-Designing Model Architectures with Hardware, ensuring maximum efficiency across inference GPUs.
Modern training data scales (2 trillion tokens) and mixtures (including code ande math data)

Add ModernBERT to Transformers by @warner-benjamin in #35158

Aria

The Aria model was proposed in Aria: An Open Multimodal Native Mixture-of-Experts Model by Li et al. from the Rhymes.AI team.

Aria is an open multimodal-native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. It has a Mixture-of-Experts architecture, with respectively 3.9B and 3.5B activated parameters per visual token and text token.

Add Aria by @aymeric-roucher in #34157

TimmWrapper

We add a TimmWrapper set of classes such that timm models can be loaded in as transformer models into the library.

Here's a general usage example:

```py import torch from urllib.request import urlopen from PIL import Image from transformers import AutoConfig, AutoModelForImageClassification, AutoImageProcessor

checkpoint = "timm/resnet50.a1_in1k" img = Image.open(urlopen( 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png' ))

imageprocessor = AutoImageProcessor.frompretrained(checkpoint) inputs = imageprocessor(img, returntensors="pt") model = AutoModelForImageClassification.from_pretrained(checkpoint)

with torch.no_grad(): logits = model(**inputs).logits

top5probabilities, top5class_indices = torch.topk(logits.softmax(dim=1) * 100, k=5) ```

Thanks to this, timm models now have access to pipelines, as well as Trainer, accelerate device maps, quantization, etc:

```py import torch from urllib.request import urlopen from PIL import Image

from transformers import pipeline

img = Image.open(urlopen( 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png' )) pipe = pipeline("image-classification", model="timm/resnet18.a1_in1k") print(pipe(img)) ```

Add TimmWrapper by @qubvel and @amyeroberts in #34564

Pixtral-Large

Pixtral modeling and checkpoint conversion code has been updated to support the new Pixtral-Large model.

Update Pixtral conversion script to support large format! by @arthurzucker in #34801

ColPali

The ColPali model was proposed in ColPali: Efficient Document Retrieval with Vision Language Models by Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo ( denotes equal contribution). Work lead by ILLUIN Technology.

In the proposed ColPali approach, the authors leverage VLMs to construct efficient multi-vector embeddings directly from document images (“screenshots”) for document retrieval. They train the model to maximize the similarity between these document embeddings and the corresponding query embeddings, using the late interaction method introduced in ColBERT.

colpali_architecture

Add ColPali to 🤗 transformers by @tonywu71 and @yonigozlan in #33736

Falcon3

Falcon3 represents a natural evolution from previous releases, emphasizing expanding the models’ science, math, and code capabilities. This iteration includes five base models: Falcon3-1B-Base, Falcon3-3B-Base, Falcon3-Mamba-7B-Base, Falcon3-7B-Base, and Falcon3-10B-Base. In developing these models, the authors incorporated several key innovations aimed at improving the models’ performances while reducing training costs:

One pre-training: They conducted a single large-scale pretraining run on the 7B model, using 2048 H100 GPU chips, leveraging 14 trillion tokens featuring web, code, STEM, and curated high-quality and multilingual data. Depth up-scaling for improved reasoning: Building on recent studies on the effects of model depth, they upscaled the 7B model to a 10B parameters model by duplicating the redundant layers and continuing pre-training with 2TT of high-quality data. This yielded Falcon3-10B-Base which achieves state-of-the-art zero-shot and few-shot performance for models under 13B parameters. Knowledge distillation for better tiny models: To provide compact and efficient alternatives, we developed Falcon3-1B-Base and Falcon3-3B-Base by leveraging pruning and knowledge distillation techniques, using less than 100GT of curated high-quality data, thereby redefining pre-training efficiency.

Add Falcon3 documentation by @mokeddembillel in #35307

Bamba

Bamba-9B is a decoder-only language model based on the Mamba-2 architecture and is designed to handle a wide range of text generation tasks. It is trained from scratch using a two-stage training approach. In the first stage, the model is trained on 2 trillion tokens from the Dolma v1.7 dataset. In the second stage, it undergoes additional training on 200 billion tokens, leveraging a carefully curated blend of high-quality data to further refine its performance and enhance output quality.

Checkout all Bamba-9B model checkpoints here.

Add the Bamba Model by @fabianlim in #34982

VitPose

ViTPose is a state-of-the-art vision transformer-based model for human pose estimation, introduced by Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao in "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation”.

The model leverages the capabilities of vision transformers to accurately predict 2D human keypoints. Adopting a top-down approach, ViTPose estimates keypoints locations for each detected person, allowing it to be easily used with any object detection model.

vitpose

Add VitPose by @SangbumChoi and @NielsRogge in #30530

DINOv2 with registers

The DINOv2 with Registers model was proposed in Vision Transformers Need Registers by Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski.

The Vision Transformer (ViT) is a transformer encoder model (BERT-like) originally introduced to do supervised image classification on ImageNet.

Next, people figured out ways to make ViT work really well on self-supervised image feature extraction (i.e. learning meaningful features, also called embeddings) on images without requiring any labels. Some example papers here include DINOv2 and MAE.

The authors of DINOv2 noticed that ViTs have artifacts in attention maps. It’s due to the model using some image patches as “registers”. The authors propose a fix: just add some new tokens (called “register” tokens), which you only use during pre-training (and throw away afterwards). This results in:

no artifacts
interpretable attention maps
and improved performances.
Add DINOv2 with registers by @NielsRogge in #35348

Emu3

The Emu3 model was proposed in Emu3: Next-Token Prediction is All You Need by Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang.

Emu3 sets a new standard in multimodal AI by using next-token prediction to handle images, text, and videos. It simplifies multimodal modeling by tokenizing all data into a unified format and training a single transformer. Visual data is tokenized using vector quantization methods based on VQ-VAE model. Discretized visual tokens are later fused with text token ids for image and text generation.

Emu3 outperforms leading models like SDXL and LLaVA-1.6 in both generation and perception tasks, without relying on diffusion or compositional methods..

Add Emu3 by @zucchini-nlp in #33770

Cohere2

A new Cohere update was added through a new "Cohere2" set of classes.

Add Cohere2 model by @alexrs-cohere in #35224

TextNet

TextNet is a lightweight and efficient architecture designed specifically for text detection, offering superior performance compared to traditional models like MobileNetV3. With variants TextNet-T, TextNet-S, and TextNet-B (6.8M, 8.0M, and 8.9M parameters respectively), it achieves an excellent balance between accuracy and inference speed.

Add TextNet by @jadechoghari in #34979

DiffLlama

Differential Transformer combines the Llama architecture with Differential Transformer's Attention. * Add DiffLllama by @weak-kajuma in #34083

PixtralLarge

The conversion script needed a few update, while the modeling code was barely changed! * [PixtralLarge] Update Pixtral conversion script to support large format! (#34801)

Moonshine

Moonshine is an autoregressive speech recognition encoder-decoder model that improves upon Whisper's architecture. Namely, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper, which is restricted to fixed 30-second windows. It was introduced by Nat Jeffries, Evan King, Manjunath Kudlur, Guy Nicholson, James Wang, and Pete Warden in Moonshine: Speech Recognition for Live Transcription and Voice Commands .

Add Moonshine by @eustlb in #34784

Quantization methods

VPTQ Quantization

From the VPTQ contributors:

VPTQ is a novel Post-Training Quantization method that leverages Vector Quantization to high accuracy on LLMs at an extremely low bit-width (<2-bit). VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.. More details here: https://github.com/microsoft/vptq

FEAT : Adding VPTQ quantization method to HFQuantizer by @wejoncy in #34770

HIGGS Quantization

From the contributors:

HIGGS is a new 0-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and SOTA performance. You can find more information in the paper.

Runtime support for HIGGS is implemented through FLUTE, and its library.

This PR adds support for HIGGS+FLUTE into transformers allowing for low-error 0-shot quantization and fast LLM inference.

HIGGS Quantization Support by @BlackSamorez in #34997

Cleanup

We merged a cleanup for vision language models, to make sure it all models are standardized. * VLMs: major clean up 🧼 (#34502)

Breaking changes

Conversion scripts

Many models in Transformers include scripts to convert the original model checkpoints into a Transformers-compatible format. These scripts can be found in the repo using the glob pattern models/**/convert_*.py. They were a recurring source of vulnerability reports and CVEs because many models were originally released using insecure formats like older PyTorch .bin weights or pickle files. The conversion scripts had to open these formats, and this meant that they were vulnerable to maliciously crafted inputs.

In practice, we do not see this as a serious vulnerability. The conversion scripts are never imported or called by the rest of the library; each script is standalone, and so the only way to exploit the vulnerability is to create a malicious checkpoint, induce a user to download it, and then also induce them to manually call a specific conversion script on it.

However, even if there is little practical risk of an exploit, we are aware that open vulnerability reports create a compliance problem for users, and so beginning with this release we will be excluding these conversion scripts from release branches and wheels. They will remain accessible to developers on the main branch.

🚨🚨🚨 Delete conversion scripts when making release wheels by @Rocketknight1 in #35296

Backtracking in Nougat

A regular expression used within the Nougat code has been modified to ensure it does not hang. The method should output the same results but we cannot guarantee it; we recommend upgrading to the latest transformers if you use this model to ensure your code is performance-optimized.

🚨🚨🚨 Limit backtracking in Nougat regexp by @qubvel in #35264

Whisper decoding

This PR finalizes work that aimes to enable short-form (< 30 secs) and long-form generation using temperature fallback. It is a significant improvement to the whisper codebase, but it does result in the following breaking changes:

➡️ Previously:
• Short-form: Returned a ModelOutput or torch.LongTensor, including decoder input IDs and the EOS token ID.
• Long-form: Returned a Dict or torch.LongTensor, excluding decoder input IDs and the EOS token ID.

➡️ From now on:
Short-form and long-form generation are now treated identically, meaning output differentiation based on these modes is no longer applicable.

Decoder input IDs and EOS token IDs are never returned, except in two specific cases: when return_dict_in_generate=True and (return_timestamps=False or force_unique_generate_call=True).

In this case, the output will be a ModelOutput, which is the result of the underlying call to GenerationMixin’s generate. Indeed, return_timestamps=False ensures no seeking occurs; only a single call to generate is made. Therefore, this output includes both decoder input IDs and the EOS token ID.

[Whisper] 🚨 Fix whisper decoding 🚨 by @eustlb in #34135

Attention refactor

In order to have a cleaner, isolated, future-proof code for the attention layers, they have been refactored so as to keep the model attention code within their files; but attention definitions relating to SDPA, Flash Attention, and other types of attention have been moved to a common file.

🚨All attention refactor🚨 by @ArthurZucker in #35235

Bugfixes and improvements

[tokenizers] Ensure that addprefixspace is propagated to backendtokenizer.pretokenizer (#35593)
Setup loss_type in config at model init time (#34616)
[docs] Update Python version in translations by @jla524 in #35096
[docs] topp, topk, temperature docstrings by @stevhliu in #35065
Fix private forked repo. CI by @ydshieh in #35114
Add feature dim attributes to BitLinear for easier PEFT integration by @agostinv in #34946
Update I-JEPA checkpoints path by @qubvel in #35120
Fix GA loss bugs and add unit test by @techkang in #35121
[I-JEPA] Update docs by @NielsRogge in #35148
Corrected typo in agent system prompts by @Uvi-12 in #35143
Option to set 'non_blocking' for to(device) in BatchEncoding and BatchFeature by @daniel-bogdoll in #34883
Fix typo in EETQ Tests by @MekkCyber in #35160
Cleanup: continue the init refactor by @LysandreJik in #35167
Super tiny fix logging message by @fzyzcjy in #35132
Fixed typo of 'avilable' in prompts.py by @Uvi-12 in #35145
[CI] Fix bnb quantization tests with accelerate>=1.2.0 by @matthewdouglas in #35172
Fix num_items_in_batch not being an integer by @xspirus in #35115
Assisted decoding multi-gpu by @zucchini-nlp in #35116
Fix file path for shard_num 1 with mllama converter by @strangiato in #35053
Support BatchNorm in Hubert posconvemb as in fairseq by @gallilmaimon in #34389
Remove unnecessary masked_fill in deberta models by @xadupre in #35182
Fix DBRX LayerNorm init method by @hgt312 in #35177
Fixing GGUF support for StableLm by @MekkCyber in #35060
[i18n-ar] Translated file : docs/source/ar/community.md into Arabic by @AhmedAlmaghz in #33027
Multiple typo fixes in NLP, Audio docs by @henryhmko in #35181
Only import torch.distributed if it is available by @GaetanLepage in #35133
[i18n-] Translating Benchmarks.md to Chinese by @asdkfjsd in #35137
[docs] Fix FlashAttention link by @stevhliu in #35171
Update data collator docstrings to accurately reference Nvidia tensor core compute capability version by @johngrahamreynolds in #35188
[i18n-] Translating agents.md to Chinese by @HMJ0628 in #35139
BLIP: enable device map by @zucchini-nlp in #34850
🧹 Remove deprecated RotaryEmbedding parts in the Attention layers by @Cyrilvallez in #34858
[PEFT] Better Trainer error when prompt learning with loading best model at the end by @BenjaminBossan in #35087
Cleanup: continue the init refactor by @LysandreJik in #35170
Fix CI by @Cyrilvallez in #35208
Fix seamless TTS generate by @ylacombe in #34968
docs: clarify initializer_range parameter description in Idefics3VisionConfig by @h3110Fr13nd in #35215
Fixed typo of 'indentifier' in audio_utils.py by @Uvi-12 in #35226
Fix type hints for applychattemplate by @Rocketknight1 in #35216
Support Python 3.10+ Union style in chat template type hints parsing by @RezaRahemtola in #35103
Refactoring AssistedCandidateGenerator for Improved Modularity and Reusability by @keyboardAnt and @jmamou in #35009
Change back to Thread for SF conversion by @ydshieh in #35236
[Init refactor] Modular changes by @LysandreJik in #35240
Fix typo in chat template example by @EricWinsorDSIT in #35250
Run model as compressed/uncompressed mode by @horheynm in #34719
skip Fuyu from test_generate by @nhamanasu in #35246
[tests] fix "Tester object has no attribute '_testMethodName'" by @faaany in #34910
Use rsfE with pytest by @ydshieh in #35119
Update AMD docker image (rocm 6.1) by @ivarflakstad in #35259
Fixed typos in Audio Classification Documentation by @Uvi-12 in #35263
Translating agents_advanced.md to Chinese by @HMJ0628 in #35231
Fix FSDP no longer working by @muellerzr in #35212
don't use no_sync when deepspeed doesn't support it for certain zero stages by @winglian in #35157
[i18n-Chinese] Translating perftraincpu.md to Chinese by @asdkfjsd in #35242
Fall back to slow image processor in ImageProcessingAuto when no fast processor available by @yonigozlan in #34785
Aggeregate test summary files in CircleCI workflow runs by @ydshieh in #34989
Blip: fix offloading and MP tests by @zucchini-nlp in #35239
Fix : model used to test ggml conversion of Falcon-7b is incorrect by @MekkCyber in #35083
Temporarily disable amd push ci by @ivarflakstad in #35293
Delete redundancy for loop checks. by @zhanluxianshen in #35288
[Whisper] patch float type on mps by @eustlb in #35295
Fix typos in Translated Audio Classification Docs by @jla524 in #35287
Translating "translate perfinfergpu_multi.md" to Chinese by @HMJ0628 in #35271
Fix wrongs in quicktour[zh] by @zhanluxianshen in #35272
Improved documentation of Automatic speech recognition by @Uvi-12 in #35268
fix modular order by @ArthurZucker in #35297
Add sdpa for Beit by @OmarManzoor in #34941
Support for SDPA for SAM models by @MagnusS0 in #34110
remove benchmark job in push-important-models.yml by @ydshieh in #35292
Fix typos in translated quicktour docs by @jla524 in #35302
Fix image preview in multi-GPU inference docs by @jla524 in #35303
Fix remove unused parameter in docs by @zzzzzsa in #35306
Add Cohere2 docs details by @alexrs-cohere in #35294
Fixed typo in audio_classification.md by @Uvi-12 in #35305
[docs] Improve register_pipeline by @stevhliu in #35300
Fix loading with only state dict and lowcpumem_usage = True by @SunMarc in #35217
[tests] make cuda-only tests device-agnostic by @faaany in #35222
Trigger GitHub CI with a comment on PR by @ydshieh in #35211
change bnb tests by @jiqing-feng in #34713
[Whisper] fix docstrings typo by @eustlb in #35319
feat: add benchmarks_entrypoint.py by @McPatate in #34495
Fix documentation for ColPali by @tonywu71 in #35321
Update comment CI bot by @ydshieh in #35323
PaliGemma: Make sure to add to suffix if is present in text by @probicheaux in #35201
Fix some fa2 tests by @ArthurZucker in #35340
Modernbert Release Fixes by @warner-benjamin in #35344
[docs] Add link to ModernBERT Text Classification GLUE finetuning script by @tomaarsen in #35347
fix onnx export of speech foundation models by @nikosanto13 in #34224
[Mamba2] Fix caching, slow path, and multi-gpu by @vasqu in #35154
Reduce CircleCI usage by @ydshieh in #35355
Implement AsyncTextIteratorStreamer for asynchronous streaming by @CISC in #34931
Cleaner attention interfaces by @Cyrilvallez in #35342
Add Tensor Parallel support for Qwen2VL by @jla524 in #35050
fix zoedepth initialization error under deepspeed zero3 by @Tavish9 in #35011
Aurevoir PyTorch 1 by @ydshieh in #35358
bugfix: torch.export failure caused by _make_causal_mask by @jiwoong-choi in #35291
update codecarbon by @nhamanasu in #35243
Update test fetcher when we want to test all by @ArthurZucker in #35364
Use weights_only=True with torch.load for transfo_xl by @ydshieh in #35241
Make test_generate_with_static_cache even less flaky by @ydshieh in #34995
Improve modular transformers documentation by @joelpaulkoch in #35322
Improved Documentation Of Audio Classification by @Uvi-12 in #35368
[docs] Follow up register_pipeline by @stevhliu in #35310
owlvit/2 dynamic input resolution by @bastrob in #34764
Fix new FA2 if is_causal is passed explicitly by @Cyrilvallez in #35390
bitsandbytes: simplify 8bit dequantization by @matthewdouglas in #35068
make LlamaModel.updatecausal_mask torch compilable by @winglian in #35187
Patch GPTNeoX to use adequate FA2 if position_ids is provided by @taha-yassine in #35318
uniformize kwargs for SAM by @tibor-reiss in #34578
Deprecate isquantizedtrainingenabled by @MekkCyber in #34991
Scale loss before backward by @qgallouedec in #35207
Fix typing in docstring for PaliGemmaProcessor by @alvarobartt in #35278
Fix : VPTQ test by @MekkCyber in #35394
add bnb support for Ascend NPU by @statelesshz in #31512
bugfix Idefics3 processor - handle gracefully cases with text and no images by @mfarre in #35363
Adding logger.info about updatetorchdtype in some quantizers by @MekkCyber in #35046
Add compile test for fast image processor by @yonigozlan in #35184
Disable .github/workflows/self-comment-ci.yml for now by @ydshieh in #35366
enable non-cuda awq model support without modify version by @jiqing-feng in #35334
[GPTQ, CompressedTensors] Fix unsafe imports and metada check by @vasqu in #34815
Drop inplace operation for loss computation with gradient accumulation by @qgallouedec in #35416
Fix: Rename keyword argument inchannels to numchannels by @ningyuv in #35289
CLIP conversion script - Change fairseq to OpenAI by @gau-nernst in #35384
Fix f-string to show ACCELERATE_MIN_VERSION on error by @KSafran in #35189
Fix model_accepts_loss_kwargs for timm model by @qubvel in #35257
Update perfinfergpu_one.md: fix a typo by @martin0258 in #35441
Add computelossfunc to Seq2SeqTrainer by @d223302 in #35136
Update docs for sdpa_kernel by @jla524 in #35410
[i18n-ar] Translated file: docs/source/ar/tasks/question_answering.md into Arabic by @AhmedAlmaghz in #35196
[i18n-ar] Translated file: docs/source/ar/tasks/summarization.md into Arabic by @AhmedAlmaghz in #35195
Update translated docs for sdpa_kernel by @jla524 in #35461
Reintroduce Python 3.9 support for ModernBERT by @tomaarsen in #35458
Fix new BNB test failures by @matthewdouglas in #35345
Fix docs typos. by @zhanluxianshen in #35465
Fix paligemma warning message by @hiyouga in #35486

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@ydshieh
- Fix private forked repo. CI (#35114)
- Change back to Thread for SF conversion (#35236)
- Use rsfE with pytest (#35119)
- Aggeregate test summary files in CircleCI workflow runs (#34989)
- remove benchmark job in push-important-models.yml (#35292)
- Trigger GitHub CI with a comment on PR (#35211)
- Update comment CI bot (#35323)
- Reduce CircleCI usage (#35355)
- Aurevoir PyTorch 1 (#35358)
- Use weights_only=True with torch.load for transfo_xl (#35241)
- Make test_generate_with_static_cache even less flaky (#34995)
- Disable .github/workflows/self-comment-ci.yml for now (#35366)
@aymeric-roucher
- Add Aria (#34157)
@NielsRogge
- [I-JEPA] Update docs (#35148)
- Add DINOv2 with registers (#35348)
@HMJ0628
- [i18n-] Translating agents.md to Chinese (#35139)
- Translating agents_advanced.md to Chinese (#35231)
- Translating "translate perfinfergpu_multi.md" to Chinese (#35271)
@alexrs-cohere
- Add Cohere2 model (#35224)
- Add Cohere2 docs details (#35294)
@ArthurZucker
- fix modular order (#35297)
- 🚨All attention refactor🚨 (#35235)
- Fix some fa2 tests (#35340)
- Update test fetcher when we want to test all (#35364)
@tonywu71
- Add ColPali to 🤗 transformers (#33736)
- Fix documentation for ColPali (#35321)
@OmarManzoor
- Add sdpa for Beit (#34941)
@fabianlim
- Add the Bamba Model (#34982)
@warner-benjamin
- Add ModernBERT to Transformers (#35158)
- Modernbert Release Fixes (#35344)
@wejoncy
- FEAT : Adding VPTQ quantization method to HFQuantizer (#34770)
@bastrob
- owlvit/2 dynamic input resolution (#34764)
@BlackSamorez
- HIGGS Quantization Support (#34997)

- Python
Published by LysandreJik over 1 year ago

transformers - v4.47.1

Patch release v4.47.1

We waited a little bit to make sure it was stable, thanks @winglian for double checking and everyone for the fixes!

Fix GA loss bugs and add unit test (#35121) Contributed by @techkang and @ArthurZucker.
Fix numitemsin_batch not being an integer (#35115) Contributed by @xspirus.
Fix FSDP no longer working (#35212) Contributed by @muellerzr.
Don't use no_sync when DeepSpeed doesn't support it for certain ZeRO configurations (#35212) Contributed by @winglian.
Only import torch.distributed if it is available (#35133) Contributed by @GaetanLepage.
[Whisper] Patch float type on MPS (#35295) Contributed by @eustlb. 🔜 we should probably have MPS CIs to avoid repeating this!

- Python
Published by ArthurZucker over 1 year ago

transformers - v4.47.0: PaliGemma-2, I-JEPA, OLMo-2, LayerSkip, Tensor Parallel

New models

PaliGemma-2

PaliGemma 2 and PaliGemma are lightweight open vision-language models (VLM) inspired by PaLI-3, and based on open components like the SigLIP vision model and the Gemma language model. PaliGemma takes both images and text as inputs and can answer questions about images with detail and context, meaning that PaliGemma can perform deeper analysis of images and provide useful insights, such as captioning for images and short videos, object detection, and reading text embedded within images.

PaliGemma 2 is available in 3B, 10B, and 28B parameter sizes, which are based on Gemma 2 2B, 9B, and 27B models, respectively. The original PaliGemma models are available in the 3B size. For more information on Gemma model variants, see the Gemma models list. PaliGemma model variants support different pixel resolutions for image inputs, including 224 x 224, 448 x 448, and 896 x 896 pixels.

I-JEPA

The I-JEPA model was proposed in Image-based Joint-Embedding Predictive Architecture by Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas. I-JEPA is a self-supervised learning method that predicts the representations of one part of an image based on other parts of the same image. This approach focuses on learning semantic features without relying on pre-defined invariances from hand-crafted data transformations, which can bias specific tasks, or on filling in pixel-level details, which often leads to less meaningful representations.

Add I-JEPA by @jmtzt in #33125

OLMo 2

The OLMo2 model is the successor of the OLMo model, which was proposed in OLMo: Accelerating the Science of Language Models.

The architectural changes from the original OLMo model to this model are: - RMSNorm is used instead of standard layer norm. - Norm is applied to attention queries and keys. - Norm is applied after attention/feedforward layers rather than before.

Commits:

Add OLMo November 2024 by @2015aroras in #34551
Rename OLMo November to OLMo2 by @2015aroras in #34864

Layer-Skip Llama

We add support for Meta's Layer-Skip Llama 3.2 1B model.

The Llama3.2 1B model was continually pretrained with LayerSkip recipe, early exit loss and layer dropout, as presented in Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding and is capable of performing self-speculative decoding: decode with earlier layers and verify with remaining layers.

Self-speculation (Layer-Skip Llama) by @ArthurZucker in #34240

Tensor Parallel implementation

This PR uses the torch.distributed.tensor.parallel subpackage to implement Tensor Parallel for Llama (as an example).

The motivation is multi-fold:

to make modeling code simple as single-worker case:
all manual TP implementations under if self.config.pretraining_tp > 1 can be removed.
to make tensor parallelism easily accessible by users:
added a model.tensor_parallel(device_mesh) method that allows users to turn a single-proc model into a parallel model. !- Please guide me to a right place to put this function/method if PreTrainedModel is not a preferred place. -!

This is the first PR of many to simplify and enable Tensor Parallel across models.

Simplify Tensor Parallel implementation with PyTorch TP by @kwen2501 in #34184

Farewell, Python 3.8

Python 3.8 reaches end of life, and, as such, we drop it from our CI.

Drop support for Python 3.8 by @ydshieh in #34314

GGUF improvements

Several improvements have been done to the GGUF support in transformers; notably by adding new architectures to the list of supported architectures.

Add T5 GGUF loading support by @junejae in #33389
Add GGUF for Mamba by @VladOS95-cyber in #34200
Add Nemotron GGUF Loading Support by @farrosalferro in #34725
Improve gguf tensor processing by @VladOS95-cyber in #34515
Fix use_parallel_residual and qkv_bias for StableLM GGUF config extraction by @Isotr0py in #34450

Fast processors

We continue the work to improve the speed of fast processors as detailed in this roadmap.

We contribute a fast processor to RT-DETR.

Add Image Processor Fast RT-DETR by @yonigozlan in #34354

New pipelines

A new pipeline has been added to transformers: image-text-to-text!

the pipeline support the following inputs:

unbatched images and text - images=image, text=text
batched images and text - images = [image, image], text= [text, text]
several images per prompt (only for models supporting the use of an image token) - images = [[image, image], [image]] or images=[image, image, image], text = ["... ......", "......"]
Chat templates (for models supporting them).
Add image text to text pipeline by @yonigozlan in #34170

Notable refactors

Separate chat templates into a single file

We have had several issues with chat templates because they're stored as single lines in the JSON config files:

Impossible to review diffs
Very hard to edit in the web UI (or in general)
Differences between processor templates in chat_template.json and tokenizer templates in tokenizer_config.json causing confusion
Some models use multiple templates, requiring a template dict, but we're trying to discourage that in future and move those models to single templates with conditional behaviour instead

The solution:

Just move chat templates to a single chat_template.jinja file in the repo
If multiple templates are required, then they should still be stored in the JSON file. This is not supported for Processor classes, so processors should always be able to save their template as a raw Jinja file. In general, we'll be gently deprecating multiple templates in future.
If a chat_template.jinja file is present, it overrides the JSON files. If a tokenizer is loaded with both Jinja and JSON chat templates and resaved, it should save only the Jinja file, and not have any chat_template entry in tokenizer_config.json.

For now, we continue saving in the old format by default. I'll probably keep it this way for several versions before making the new format the default, to ensure that most users are able to load the new format before it becomes common. Until then, the new format should mostly be used for testing, to make sure it's ready for deployment when we do the switch.

Separate chat templates into a single file by @Rocketknight1 in #33957

Large modular logic refactor

This PR largely rework the logic we use in the modular converter. It is (hopefully) clearer and maintainable. Instead of going in all directions, adding stuff, then deleting it if not needed, we now do the following:

visit all the modular file (record imports/functions/classes/assignments nodes)
- create function dependency mapping
for each import coming from another model:
- visit the corresponding file
- create function dependency mapping
- update mapping with function/assignment from the modular (updated/new functions)
- create the class dependency graph based on merged dependencies
update dependency graph of the modular with the functions and assignments imported from the other files
for each class recorded in the modular:
- if inherithing from class in another file:
- replace call to super
- find the dependencies after the node was replaced
- follow (updated with modular defs) dependency mapping to add all nodes
- else:
- only add needed imported functions (and their dependencies)
determine the needed imports and add them
Large modular logic refactoring by @Cyrilvallez in #34487

Community bugfixes and improvements

Remove graph breaks for torch.compile() in flashattentionforward when Lllama Model is padding free tuned by @Abhishek-TAMU in #33932
Better defaults by @ArthurZucker in #34026
translated gguf.md into chinese by @blueingman in #34163
CI: fix failures by @zucchini-nlp in #34371
Zamba is an LM by @LysandreJik in #34342
add code generation to natural language processing section by @furtnerthomas in #34333
Fix piltorchinterpolationmapping import in imageprocessingdetrfast by @yonigozlan in #34375
Add code sample docstrings and checkpoint reference for GLM models by @h3110Fr13nd in #34360
refactor: remove redundant if-condition and improve type correctness for convert_tokens_to_ids by @winstxnhdw in #34030
Ignore unsupported kwarg in ProcessorMixin call by @yonigozlan in #34285
[PEFT] Add warning for missing key in LoRA adapter by @BenjaminBossan in #34068
Fix torch.fx issue related to the new loss_kwargs keyword argument by @michaelbenayoun in #34380
Correct the new defaults by @Cyrilvallez in #34377
[auto. ping] Avoid sending empty info + add more team members by @ydshieh in #34383
Fix glm by @Cyrilvallez in #34388
Use non nested images and batched text Idefics2/3 by @yonigozlan in #34222
Fix onnx non-expotable inplace aten op by @IlyasMoutawwakil in #34376
Fix right padding in LLaVA models by @zucchini-nlp in #34305
no filter by @ydshieh in #34391
SynthID: better example by @gante in #34372
Tests: upgrade test_eager_matches_sdpa_generate by @gante in #34386
Fix bnb training test failure by @matthewdouglas in #34414
Avoid check expected exception when it is on CUDA by @ydshieh in #34408
Fix typos in agents_advanced.md by @rudydel in #34405
[docs] Cache implementations by @stevhliu in #34325
Fix pix2struct by @IlyasMoutawwakil in #34374
pin tensorflow_probability<0.22 in docker files by @ydshieh in #34381
Tiny update after #34383 by @ydshieh in #34404
Fix batch size handling in prediction_loop for DataLoaderShard by @zeus2611 in #34343
exclude fsdp from delayoptimizercreation by @eljandoubi in #34140
New option called "best" for args.save_strategy. by @seanswyi in #31817
[docs] update input documentation for MAMBA2 and MISTRAL models to include cacheposition and attentionmask details by @h3110Fr13nd in #34322
🌐 [i18n-KO] Translated model_doc/barthez.md to Korean by @Jwaminju in #33980
Apply linting to the important code blocks to make it readable by @ShubhamJagtap2000 in #34449
Torchao weights only + prequantized compability by @SunMarc in #34355
[i18n-ar] Translated file : docs/source/ar/fast_tokenizers.md into Arabic by @AhmedAlmaghz in #33034
enable average tokens across devices by @techkang in #34373
feat: run benchmarks on A100 by @McPatate in #34287
Add post_process_depth_estimation for GLPN by @alex-bene in #34413
LLaVA: latency issues by @zucchini-nlp in #34460
Generation: fix test by @zucchini-nlp in #34369
Fix CI by @zucchini-nlp in #34458
use a tinymodel to test generation config which aviod timeout by @techkang in #34482
🚨🚨🚨 [SuperPoint] Fix keypoint coordinate output and add post processing by @sbucaille in #33200
Simplify running tests in a subprocess by @ydshieh in #34213
Fix perplexity computation in perplexity.md by @Framartin in #34387
Fixes for Modular Converter on Windows by @hlky in #34266
Fix regression loading dtype by @SunMarc in #34409
Bert is ExecuTorch compatible by @guangy10 in #34424
manual head_dim for mixtral model by @wavy-jung in #34281
fix-qwen2vl-no-position_ids by @simonJJJ in #33487
Bug fix for drop path decay rate in swin transformer by @abhi-glitchhg in #34291
MobileBERT is ExecuTorch compatible by @guangy10 in #34473
Albert is ExecuTorch compatible by @guangy10 in #34476
Adding optimizer_cls_and_kwargs to Trainer.__init__ by @apoorvkh in #34358
Fix performance in get_imports regexp by @AlekseyLobanov in #34298
fix incorrect warning by @yonigozlan in #34416
Un-deprecate timeout arg in pipelines by @Rocketknight1 in #34382
Roberta is ExecuTorch compatible by @guangy10 in #34425
Fix format mistake in string repr of tokenizer objects by @gpetho in #34493
Mllama: update docs by @zucchini-nlp in #34334
VLMs: fix number of image tokens by @zucchini-nlp in #34332
Tests: move generate tests to the right mixin and delete redundant tests by @gante in #34464
fix pixtral processor by @molbap in #34486
Use torch 2.5 in scheduled CI by @ydshieh in #34465
Fix super tiny extra space typo by @fzyzcjy in #34440
UPDATE Documentation for #TRANSLATING.md Documentation into Multiple Languages.(Changes made) by @anshumangahlot in #34226
enable QA bf16 pipeline by @jiqing-feng in #34483
Fix: img size mismatch caused by incorrect unpadding in LLaVA-Next by @jp1924 in #34522
Fix step shifting when accumulate gradient by @kibitzing in #33673
avoid calling gc.collect and cuda.empty_cache by @ydshieh in #34514
Qwen2VL: skip base input_ids-inputs_embeds equivalence check by @gante in #34535
fix(DPT,Depth-Anything) Address expected_slice errors inside inference tests by @philkuz in #34518
feat: add benchmarks pg indexes by @McPatate in #34536
make test_eager_matches_sdpa_inferenceless flaky by @ydshieh in #34512
Bug Fix for issue #34294 by @fpgaminer in #34295
[CLIPSeg] Make interpolateposencoding default to True by @NielsRogge in #34419
update doc by @jiqing-feng in #34478
[i18n-ar] Translated file : docs/source/ar/multilingual.md into Arabic by @AhmedAlmaghz in #33048
Blip: get/set input embeddings correctly by @zucchini-nlp in #34152
BLIP: enable generation tests by @zucchini-nlp in #34174
:redcircle: :redcircle: fix query_pre_attn_scalar different of num_heads in default gemma2 config by @molbap in #34540
[i18n-HI] Translated accelerate page to Hindi by @karthik-script in #34443
Update trainer for easier handling of accumulate, compile fixes, and proper reporting by @muellerzr in #34511
VLM: special multimodal Tokenizer by @zucchini-nlp in #34461
MPS: isin_mps_friendly can support 0D tensors by @gante in #34538
Add text support to the Trainer's TensorBoard integration by @JacobLinCool in #34418
[i18n-HI] Translated TFLite page to Hindi by @karthik-script in #34572
🌐 [i18n-KO] Translated perftrainspecial.md to Korean by @maximizemaxwell in #34590
🌐 [i18n-KO] Update README_ko.md by @J4BEZ in #33098
fix TrainerState doc because numinputtokens_seen is unused by defau… by @techkang in #34593
Fix Whisper CI by @ydshieh in #34541
Skip DeepSpeed ZeRO Stage 3 model initialization when bnb by @eljandoubi in #34395
FIX: Broken repr of TorchAoConfig by @BenjaminBossan in #34560
Load sub-configs from composite configs by @zucchini-nlp in #34410
DistilBERT is ExecuTorch compatible by @guangy10 in #34475
Remove unused test_dataset by @thisisiron in #34516
Revert "Fix Whisper CI" by @ydshieh in #34605
Fix #34494 assistant tokens when truncated by @yonigottesman in #34531
Remove @slow for test_eager_matches_sdpa_inference by @ydshieh in #34558
Changing repr in torchao to show quantized Linear by @MekkCyber in #34202
Fix torchvision interpolation CI by @yonigozlan in #34539
🌐 [i18n-KO] Translated convbert.md to Korean by @ahnjj in #34599
fix(dvclive): pass fake dataset to avoid exception in trainer init by @shcheklein in #34455
🌐 [i18n-KO] Translated timesformer.md to Korean by @mreraser in #33972
🌐 [i18n-KO] Translated bert.md to Korean by @maximizemaxwell in #34627
[i18n-ar] Translated file : docs/source/ar/trainer.md into Arabic by @AhmedAlmaghz in #33080
Update llm_engine.py by @louisbrulenaudet in #33332
Agents: turn any Space into a Tool with Tool.from_space() by @aymeric-roucher in #34561
[docs] update not-working model revision by @faaany in #34682
[i18n-ar] Translated file : docs/source/ar/torchscript.md into Arabic by @AhmedAlmaghz in #33079
Agents: Small fixes in streaming to gradio + add tests by @aymeric-roucher in #34549
🌐 [i18n-KO] Translated marian.md to Korean by @maximizemaxwell in #34698
[docs] Broken link in generation_strategies by @pcuenca in #34717
Fix example in EsmConfig docstring by @yuanx749 in #34653
[docs] add xpu device check by @faaany in #34684
Retain newlines in chat template when continue_final_message=True by @lewtun in #34253
Update llava.md by @LysandreJik in #34749
fix(wandb): pass fake dataset to avoid exception in trainer (see #34455) by @CezaPasc in #34720
add xpu path for awq by @jiqing-feng in #34712
FSDP grad accum fix by @winglian in #34645
Remove FSDP wrapping from sub-models. by @eljandoubi in #34452
🧼 remove v4.44 deprecations by @gante in #34245
VLMs: patch_size -> num_image_tokens in processing by @zucchini-nlp in #33424
Fix broken link by @ofek in #34618
fix a typo bug where 'id2label' was incorrectly written as 'i2label' when reading config by @ZuoChenFttS in #34637
Fix skip of testtraininggradient_checkpointing by @dvrogozh in #34723
make sure to disable gradients for integer tensor by @winglian in #32943
[docs] make empty_cache device-agnostic by @faaany in #34774
[docs] add XPU besides CUDA, MPS etc. by @faaany in #34777
[tests] add XPU part to testing by @faaany in #34778
fix: Update pixelvalues parameter in hfmodel input by @thisisiron in #34782
Fix callback key name by @jung-hunsoo in #34762
fix: Wrong task mentioned in docs by @ecyht2 in #34757
Allow handling files as args for a tool created with Tool.from_space by @aymeric-roucher in #34687
Fix Whisper CI by @ydshieh in #34617
protect tensor parallel usage by @ArthurZucker in #34800
Trainer hyperparameter search kwargs docs update by @GuillemGSubies in #34459
feat: allow to use hf-hub models for timm backbone by @cgebbe in #34729
Support gradient checkpointing in Qwen2VL ViT by @li-plus in #34724
Fix: siglip image processor rgb_convert is not being applied correctly. by @jp1924 in #34301
fix cpu bnb path by @jiqing-feng in #34647
Gemma capping by @ArthurZucker in #34282
Fix cache_utils for optimum.quanto kvcache quantization by @SunMarc in #34750
Modular fix by @Cyrilvallez in #34802
MLU devices : Checks if mlu is available via an cndev-based check which won't trigger the drivers and leave mlu by @huismiling in #34326
🚨🚨🚨 fix(Mask2Former): torch export 🚨🚨🚨 by @philkuz in #34393
Feature: print tokens per second during training by @tibor-reiss in #34507
Add doconvertrgb to vit by @jp1924 in #34523
Fix post process function called in the instance segmentation example of mask2former by @OnTheThirdDay in #34588
fix crash in tiiuae/falcon-11B-vlm image-to-text generation by @sywangyi in #34728
Add support for OpenAI api "image_url" input in chat for image-text-to-text pipeline by @yonigozlan in #34562
Add Image Processor Fast Deformable DETR by @yonigozlan in #34353
Run test_medium_seamless_m4t_pt in subprocess to avoid many failures by @ydshieh in #34812
Fix check_training_gradient_checkpointing by @ydshieh in #34806
Added image-text-to-text pipeline to task guide by @merveenoyan in #34783
Translate attention.md into Chinese by @wwwbai in #34716
LLaVA OV: fix unpadding precision by @zucchini-nlp in #34779
Fix low memory beam search by @zucchini-nlp in #34746
Fix the memory usage issue of logits in generate() by @kjohew in #34813
fix(DPT,Depth-Anything) torch.export by @philkuz in #34103
Fix: take into account meta device by @tibor-reiss in #34134
Fix hyperparameter search when optuna+deepseed by @corentin-ryr in #34642
Fix CI by tweaking torchao tests by @SunMarc in #34832
Fix CI slack reporting issue by @ydshieh in #34833
VLMs: enable generation tests - last batch by @zucchini-nlp in #34484
Change logging level from warning to info for max_steps overriding num_train_epochs by @qgallouedec in #34810
Fix ds nvme by @eljandoubi in #34444
Fix heuristic scheduling for UAG by @jmamou in #34805
Refactor StarCoder2 using modular by @Cyrilvallez in #34015
Watermarking: fix order by @zucchini-nlp in #34849
Update checks for torch.distributed.tensor to require torch >= 2.5 by @loadams in #34816
Remove quantization related config from dequantized model by @konradkalita in #34856
Auto compile when static cache by @ArthurZucker in #34247
Speculative decoding: Test the target distribution (to prevent issues like #32867) by @keyboardAnt in #34553
smol improvements to support more flexible usage by @andimarafioti in #34857
[CI] Skip EETQ tests while package is broken with latest transformers by @BenjaminBossan in #34854
Bitnet test fix to avoid using gated model by @MekkCyber in #34863
Fix support for image processors modifications in modular by @yonigozlan in #34866
Fix: Enable prefill phase key value caching of nemotron/minitron models by @jeongin601 in #34742
Add safe_globals to resume training on PyTorch 2.6 by @dvrogozh in #34632
Cache: init empty cache when use_cache by @zucchini-nlp in #34274
BLIP: fix generation after hub update by @zucchini-nlp in #34876
[Deberta/Deberta-v2] Refactor code base to support compile, export, and fix LLM by @ArthurZucker in #22105
🔴 Mllama: fix base prefix by @zucchini-nlp in #34874
Sum gathered input tokens by @techkang in #34554
allow unused input parameters passthrough when chunking in asr pipelines by @VictorAtIfInsurance in #33889
preparefa2frompositionids function bugfix by @meliksahturker in #33269
chore: fix some typos by @wanxiangchwng in #34891
Fix converttokensto_string when decoder is None by @dszeto in #34569
[peft] Given that self.active_adapter is deprecated, avoid using it by @tomaarsen in #34804
Fix Qwen2 failing tests by @jla524 in #34819
Fix : BitNet tests by @MekkCyber in #34895
[AWQ, CI] Bump AWQ version used in docker image by @BenjaminBossan in #34922
fix static cache data type miss-match by @jiqing-feng in #34799
Fix test_auto_backbone_timm_model_from_pretrained by @ydshieh in #34877
Upgrade torch version to 2.5 in dockerfile for quantization CI by @MekkCyber in #34924
Fix failling GGML test by @MekkCyber in #34871
Updated documentation and added conversion utility by @ViktorooReps in #34319
making gpt2 fx traceable by @xuzifei-dmatrix in #34633
Fix import structure for Fast Image processors by @yonigozlan in #34859
VideoLLaVA: add default values by @zucchini-nlp in #34916
Skipping aqlm non working inference tests till fix merged by @MekkCyber in #34865
[Whisper] Fix whisper integration tests by @eustlb in #34111
Add Pytorch Tensor Parallel support for Mistral by @VladOS95-cyber in #34927
change applyrotarypos_emb of Glmmodel for GLM-Edge Series model by @zRzRzRzRzRzRzR in #34629
Fix torch.onnx.export of Qwen2-VL vision encoder by @xenova in #34852
Update the Python version in the Chinese README to match the English README. by @vansin in #34870
[i18n-ar] Translated file : docs/source/ar/benchmarks.md into Arabic by @AhmedAlmaghz in #33023
[docs] use device-agnostic API instead of cuda by @faaany in #34913
[doc] use full path for run_qa.py by @faaany in #34914
docs: HUGGINGFACEHUBCACHE -> HFHUBCACHE by @imba-tjd in #34904
[i18n-zh]Translated tiktoken.md into chinese by @blueingman in #34936
[FlexAttention] Update gemma2 by @ArthurZucker in #34942
Fix : Add PEFT from source to CI docker by @MekkCyber in #34969
Avoid calling get_max_length by @ydshieh in #34971
Fix flaky test execution caused by Thread by @ydshieh in #34966
🌐 [i18n-KO] Translated encoder-decoder.md to Korean by @maximizemaxwell in #34880
[docs] add explanation to release_memory() by @faaany in #34911
[i18n-zh]Translated perftrainspecial.md into Chinese by @blueingman in #34948
Fix typo in code block in vipllava.md by @yuanx749 in #34957
Fixed typo in VisitWebpageTool by @sergiopaniego in #34978
[PEFT] Set eval mode when loading PEFT adapter by @BenjaminBossan in #34509
Fix save_pretrained for partially offloaded models by @kylesayrs in #34890
🚨🚨🚨 Changed DINOv2Config default patch size to 14 by @OFSkean in #34568
Refine the code of Universal Assisted Generation by @xinpengzz in #34823
Allow compressed-tensors quantized model to be trained by @horheynm in #34520
Offloaded cache: fix generate by @zucchini-nlp in #34921
Fix utils/check_bad_commit.py (for auto ping in CI) by @ydshieh in #34943
Add optimized PixtralImageProcessorFast by @mgoin in #34836
Improve .from_pretrained type annotations by @qubvel in #34973
Fix docker CI : install autogptq from source by @MekkCyber in #35000
Let server decide default repo visibility by @Wauplin in #34999
🚨🚨🚨 Uniformize kwargs for TrOCR Processor by @tibor-reiss in #34587
Update timm version by @qubvel in #35005
fix: double verbs by @SamuelLarkin in #35008
Update FillMaskPipeline.__call__ signature and docstring by @alvarobartt in #35006
Only cast cu_seqlens when tracing by @xenova in #35016
fix variable undefined bug when return_tensors is not specified in llava processing by @chenweize1998 in #34953
Optimize memory usage of mllama encoder by @milesial in #34930
Typo in warning switching to optimum-quanto by @Bojun-Feng in #35028
Add type hints for forward functions in Gemma2 by @jla524 in #35034
Fix test_eager_matches_sdpa_inference for XPU backend by @dvrogozh in #34889
Multiple typo fixes in Tutorials docs by @henryhmko in #35035
add docstring example for computelossfunc by @secrettoad in #35020
[i18n-ar] Translated file : docs/source/ar/notebooks.md into Arabic by @AhmedAlmaghz in #33049
[docs] add the missing import for Image and bug fix by @faaany in #34776
Translate bertlogy.md into Chinese by @wwwbai in #34908
Automatic compilation in generate: do not rely on inner function by @Cyrilvallez in #34923
Add token cost + runtime monitoring to Agent and HfEngine children by @aymeric-roucher in #34548
Fix BertGeneration by @ydshieh in #35043
fix speecht5 failure issue in testpeftgradientcheckpointingenable… by @sywangyi in #34454
[docs] fix example code bug by @faaany in #35054
Translate community.md into Chinese by @wwwbai in #35013
[docs] use device-agnostic instead of cuda by @faaany in #35047
[docs] use device-agnostic API instead of hard-coded cuda by @faaany in #35048
Fix pad_token_tensor is None in warning by @tshu-w in #34005
Add Pytorch Tensor Parallel support for Qwen2, Qwen2Moe, Starcoder2 by @VladOS95-cyber in #35007
[GPTNeoX] Flex Attention + Refactor by @vasqu in #34896
Support for easier multimodal use of modular by @Cyrilvallez in #35056
[docs] add a comment that offloading requires CUDA GPU by @faaany in #35055
[docs] Increase visibility of torch_dtype="auto" by @stevhliu in #35067
Informative by @ydshieh in #35059
[Whisper] Fix whisper tokenizer by @eustlb in #34537
[tokenizers] bump to 0.21 by @ArthurZucker in #34972
Update Mistral conversion script by @Cyrilvallez in #34829
Fix tie_word_embeddings handling for GGUF models by @Isotr0py in #35085
Deprecate quanto and switch to optimum-quanto by @MekkCyber in #35001
BLIP: this is correct now by @zucchini-nlp in #35081
[trainer] fix the GA model_accepts_loss_kwargs by @ArthurZucker in #34915
Fix flaky Hub CI (test_trainer.py) by @ydshieh in #35062
Adaptive dynamic number of speculative tokens by @jmamou in #34156

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@AhmedAlmaghz
- [i18n-ar] Translated file : docs/source/ar/fast_tokenizers.md into Arabic (#33034)
- [i18n-ar] Translated file : docs/source/ar/multilingual.md into Arabic (#33048)
- [i18n-ar] Translated file : docs/source/ar/trainer.md into Arabic (#33080)
- [i18n-ar] Translated file : docs/source/ar/torchscript.md into Arabic (#33079)
- [i18n-ar] Translated file : docs/source/ar/benchmarks.md into Arabic (#33023)
@maximizemaxwell
- 🌐 [i18n-KO] Translated perftrainspecial.md to Korean (#34590)
- 🌐 [i18n-KO] Translated bert.md to Korean (#34627)
- 🌐 [i18n-KO] Translated marian.md to Korean (#34698)
- 🌐 [i18n-KO] Translated encoder-decoder.md to Korean (#34880)
@2015aroras
- Add OLMo November 2024 (#34551)
- Rename OLMo November to OLMo2 (#34864)
@mgoin
- Add optimized PixtralImageProcessorFast (#34836)

- Python
Published by LysandreJik over 1 year ago

transformers - Patch release v4.46.3

One small fix for FSDP + gradient accumulation loss issue! - FSDP grad accum fix, #34645 by @winglian

- Python
Published by ArthurZucker over 1 year ago

transformers - Patch release v4.46.2

Patch release v4.46.2

Mostly had to finish the gradient accumulation ! Thanks to @techkang and @Ryukijano 🤗

VLMs: fix number of image tokens (#34332) by @zucchini-nlp
fix pixtral processor (#34486) by @@molbap
enable average tokens across devices (#34373) by @techkang and @muellerzr
Update trainer for easier handling of accumulate, compile fixes, and … by @muellerzr and @Ryukijano
MPS: isinmpsfriendly can support 0D tensors (#34538) by @gante

- Python
Published by ArthurZucker over 1 year ago

transformers - Patch release v4.46.1

Patch release v4.4.61

This is mostly for fx and onnx issues!

** Fix regression loading dtype #34409 by @SunMarc ** LLaVa: latency issues #34460 by @zucchini-nlp ** Fix pix2struct #34374 by @IlyasMoutawwakil ** Fix onnx non-exposable inplace aten op #34376 by @IlyasMoutawwakil ** Fix torch.fx issue related to the new loss_kwargs keyword argument #34380 by @michaelbenayoun

- Python
Published by ArthurZucker over 1 year ago

transformers - Release v4.46.0

New model additions

Moshi

The Moshi model was proposed in Moshi: a speech-text foundation model for real-time dialogue by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour.

Moshi is a speech-text foundation model that casts spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. Moshi also predicts time-aligned text tokens as a prefix to audio tokens. This “Inner Monologue” method significantly improves the linguistic quality of generated speech and provides streaming speech recognition and text-to-speech. As a result, Moshi is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice.

Moshi integration by @ylacombe in #33624

Zamba

Zamba-7B-v1 is a hybrid between state-space models (Specifically Mamba) and transformer, and was trained using next-token prediction. Zamba uses a shared transformer layer after every 6 mamba blocks. It uses the Mistral v0.1 tokenizer. We came to this architecture after a series of ablations at small scales. Zamba-7B-v1 was pre-trained on 1T tokens of text and code data.

zamba

Add Zamba by @pglorio in #30950

GLM

The GLM Model was proposed in ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools by GLM Team, THUDM & ZhipuAI.

The abstract from the paper starts with the following:

We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B.

add Glm by @Cyrilvallez in #33823

Idefics 3

The Idefics3 model was proposed in Building and better understanding vision-language models: insights and future directions by Hugo Laurençon, Andrés Marafioti, Victor Sanh, and Léo Tronchon.

Idefics3 is an adaptation of the Idefics2 model with three main differences:

It uses Llama3 for the text model.
It uses an updated processing logic for the images.
It removes the perceiver.

Add Idefics 3! by @andimarafioti in #32473

PhiMoE

The PhiMoE model was proposed in Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone by Microsoft.

This model is very similar to Mixtral with the main difference of Phi3LongRoPEScaledRotaryEmbedding, where they are used to extend the context of the rotary embeddings. The query, key and values are fused, and the MLP’s up and gate projection layers are also fused.

PhiMoE by @garg-amit in #33363

Watermarking

This release adds SynthID, a novel state-of-the-art watermarking technique by Google DeepMind. SynthID has a low generation-time computational cost and can be configured to be nearly imperceptible (at the cost of harder watermarking detection). The release also comes with the code to train and run the corresponding detector, which is a machine learning model itself.

```py from transformers import AutoModelForCausalLM, AutoTokenizer, SynthIDTextWatermarkingConfig

tokenizer = AutoTokenizer.frompretrained('google/gemma-2-2b', paddingside="left") model = AutoModelForCausalLM.from_pretrained('google/gemma-2-2b')

SynthID Text configuration

watermarkingconfig = SynthIDTextWatermarkingConfig( keys=[654, 400, 836, 123, 340, 443, 597, 160, 57], ngramlen=5, )

Generation with watermarking

tokenizedprompts = tokenizer(["Once upon a time, "], returntensors="pt", padding=True) outputsequences = model.generate( **tokenizedprompts, watermarkingconfig=watermarkingconfig, dosample=True, maxnewtokens=10 ) watermarkedtext = tokenizer.batchdecode(outputsequences, skipspecialtokens=True) print(watermarked_text) ```

Docs for applying SynthID watermarking: https://huggingface.co/docs/transformers/internal/generationutils#transformers.SynthIDTextWatermarkLogitsProcessor Docs for detecting SynthID watermarking: https://huggingface.co/docs/transformers/internal/generationutils#transformers.SynthIDTextWatermarkDetector

how-synthid-works-high-level

Add SynthID (watermerking by Google DeepMind) by @gante in #34350

Quantization

BitNet

BitNet is an architecture introduced by Microsoft Research that uses extreme quantization, representing each parameter with only three values: -1, 0, and 1. This results in a model that uses just 1.58 bits per parameter, significantly reducing computational and memory requirements. It replaces traditional Linear layers in Multi-Head Attention and Feed-Forward Networks with specialized layers called BitLinears that use ternary precision (or even binary, in the initial version) * FEAT : Adding BitNet quantization method to HFQuantizer by @MekkCyber in #33410

GGUF loading in transformers

More architectures are now supported in our GGUF loader; GGUF files saved with this architecture can now be loaded directly in transformers to be fine-tuned. We recommend using tooling from llama.cpp to requantize the models after further training has been done.

Add gguf support for bloom by @VladOS95-cyber in #33473
Add falcon gguf by @g-prz in #33437
Add gguf support for StableLM by @VladOS95-cyber in #33793
Add gguf support for gpt2 by @VladOS95-cyber in #34044
Add GGUF for starcoder2 by @VladOS95-cyber in #34094

Notable improvements and additions

Pipeline API synchronisation

We are pushing for a unified inference API across multiple libraries. As part of this, we are cleaning up the input and output signatures for our pipeline classes and deprecating some rarely-used arguments. This is still a work-in-progress, but when it's finished, transformers pipelines should exactly match workflows in deployment libraries like transformers.js or TGI, allowing you to seamlessly move from development to production.

Sync video classification pipeline with huggingface_hub spec by @Rocketknight1 in #34288
Image pipelines spec compliance by @Rocketknight1 in #33899
Make ASR pipeline compliant with Hub spec + add tests by @Rocketknight1 in #33769
Cleanup returntext and returnfull_text options in TextGenerationPipeline by @Rocketknight1 in #33542
Make audio classification pipeline spec-compliant and add test by @Rocketknight1 in #33730
Sync QuestionAnsweringPipeline by @Rocketknight1 in #34039

Also, pipelines now fully support the Processor class, used by vision-language models. Expect full pipeline support for chatting with VLMs in the very near future!

Make pipeline able to load processor by @qubvel in #32514

Executorch compatibility

ExecuTorch is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch ecosystem and supports the deployment of PyTorch models with a focus on portability, productivity, and performance.

We are collaborating with the executorch team so that 🤗 Transformers models can be exported using torch.export. The goal of this integration is not only to enable export but also to ensure that the exported artifact can be further lowered and optimized to run efficiently in ExecuTorch, particularly for mobile and edge use cases.

how-executorch-works-high-level

Generate using exported model and enable gemma2-2b in ExecuTorch by @guangy10 in #33707
Qwen2.5 is ExecuTorch Compatible by @guangy10 in #34102
Olmo is ExecuTorch Compatible by @guangy10 in #34181
Llama3 and Llama2 are ExecuTorch compatible by @guangy10 in #34101

Gradient accumulation bugfix

Fix Gradient Accumulation issue by @ArthurZucker in #34191
Enable users to use their own loss functions + deal with prefetching for grad accum by @muellerzr in #34198
Enable Gradient Accumulation fix across all models + trainer fully in forward() by @muellerzr #34283

Bugfixes and improvements

adding positional encoder changes and tests by @manuelsh in #32600
Uniformize kwargs for chameleon processor by @leloykun in #32181
[MllamaProcessor] Update errors and API with multiple image by @ArthurZucker in #33715
fix: use correct var names for check_tokenizers script by @niqodea in #33702
Fix docs and docstrings Omdet-Turbo by @yonigozlan in #33726
Fix position embeddings singular/plural by @molbap in #33678
Generate: can_generate() recursive check by @gante in #33718
cleanuptokenization_spaces=False if unset by @itazap in #31938
fix: add docstring for image_size in Convnextv2 config by @lucianosrp in #33734
Fix modular model converter unable to generate Processor classes by @tonywu71 in #33737
fix trainer tr_loss add error by @Wang-Xiaodong1899 in #33651
Update Albumentations Versions by @vasqu in #33704
Doc and config mismatch for DeBERTa by @fkrasnov2 in #33713
[clean_up_tokenization_spaces] Pl bart was failing, updating by @ArthurZucker in #33735
[MllamaImageProcessing] Update doc by @ArthurZucker in #33747
Make siglip examples clearer and error free by @jbn in #33667
Paligemma support for multi-image by @zucchini-nlp in #33447
remove warning v2 by @itazap in #33761
Model addition timeline by @LysandreJik in #33762
Fix typing in load_balancing_loss_func function of modeling_mixtral.py. by @PhilipMay in #33641
Enable non-safetensor ser/deser for TorchAoConfig quantized model 🔴 by @jerryzh168 in #33456
Fix typo in documentation by @qgallouedec in #33805
Hqq serialization by @mobicham in #33141
Add Slow CI reminder bot by @ydshieh in #33506
[modular] fixes! by @ArthurZucker in #33820
Fix ViT-MAE decoder interpolate by @xenova in #33330
Fixes for issue #33763 in idefics2 model by @aroun-coumar in #33766
Fix link in gguf.md by @pogpog in #33768
minor typo fix by @a-r-r-o-w in #33784
Fix Mamba slow path bug with dtype mismatch. by @Adibvafa in #32691
Fix passing str dtype to static cache by @guangy10 in #33741
fix check for hidden size in text model for deepspeed zero3 auto entries by @winglian in #33829
post reminder comment only once by @ydshieh in #33848
Generate: move llama prepare_inputs_for_generation to GenerationMixin by @gante in #33677
Refactor image features selection in LlaVa by @kenza-bouzid in #33696
fix: skip dropout in eval for flash_attn in various models by @fdschmidt93 in #33844
add attention weight up-cast to float32 in chameleon by @francescortu in #33822
Workaround for bark issue in pipelines by @Rocketknight1 in #33824
Fix device mismatch errors by @zucchini-nlp in #33851
This PR contains additional changes for #33143 by @aroun-coumar in #33581
Raise accelerate dependency error in case of defaulting low_cpu_mem_usage=True by @kylesayrs in #33830
Validate the eval dataset in advance. by @jackyjinjing in #33743
Add includelossfor_metrics by @Manalelaidouni in #33088
Avoid using context that is not accessable from external contributors by @ydshieh in #33866
fix: repair depth estimation multiprocessing by @niqodea in #33759
Move weight initilization deformabledetr by @g-prz in #33339
[Fix] ViViT interpolateposencoding by @RUFFY-369 in #33815
Repo consistency fix after #33339 by @amyeroberts in #33873
Add support for custom inputs and batched inputs in ProcessorTesterMixin by @yonigozlan in #33711
Fix: typo by @TrickEye in #33880
Uniformize model processors by @molbap in #31368
Don't run reminder bot for now by @ydshieh in #33883
populate quantization_config for kv-cache-scheme only configs by @horheynm in #33874
Allow for nightly packages of compressed_tensors by @kylesayrs in #33828
Fix kwargs passed by AutoQuantizationConfig.from_pretrained by @kylesayrs in #33798
Add sdpa for DistilBert by @OmarManzoor in #33724
Trainer - deprecate tokenizer for processing_class by @amyeroberts in #32385
[Quantization] Switch to optimum-quanto by @SunMarc in #31732
Optim deformable detr by @yonigozlan in #33600
Handle Trainer tokenizer kwarg deprecation with decorator by @qubvel in #33887
rename all testprocessing.py to testprocessor.py by @yonigozlan in #33878
uniformize processor Mllama by @yonigozlan in #33876
Fix dt proj bias reassigned by @HofitBata in #33314
Update an keyerror on savecheck_point prevent confusion of missing … by @fadingNA in #33832
VLM Generate: tag test_static_cache_matches_dynamic as flaky by @gante in #33630
Migrate the CI runners to the new clusters by @glegendre01 in #33849
Fix module initialization for root module under Zero3 by @Ben-Schneider-code in #33632
Add SplinterTokenizer unit test by @ariepratama in #32652
Generate tests: modality-agnostic input preparation by @gante in #33685
Fix: use unidic-lite instead of ipadic as the tokenizer dictionary for Japanese by @KanTakahiro in #33372
[Tests] Diverse Whisper fixes by @ylacombe in #33665
[PEFT] Support lowcpumem_usage option for PEFT loading adapters by @BenjaminBossan in #33725
add setter for trainer processor by @ArthurZucker in #33911
Add support for weights_only flag when loading state_dict by @jerryzh168 in #32481
Config: lower save_pretrained exception to warning by @gante in #33906
Uniformize kwargs for Idefics/2 processors by @yonigozlan in #32568
Remove logits.float() by @ringohoffman in #33902
Minor error condition bug fix by @htahboub in #33781
Fix distil whisper segment computation by @ylacombe in #33920
[Doc]: Broken link in Kubernetes doc by @saldanhad in #33879
[i18n-ru] Fixes typo in the README_ru.md by @Artanias in #33882
Ignore keys on validate_rope by @zucchini-nlp in #33753
[PR run-slow] by @ArthurZucker in #33939
Add a section on writing tool templates to the chat template docs by @Rocketknight1 in #33924
Enables CPU AWQ model with IPEX version. by @jiqing-feng in #33460
🔴 🚨 Resizing tokens embeddings: initialize from old embeddings' normal distribution. by @abuelnasr0 in #33325
Removed unnecessary transpose in Switch Transformer Routing by @karan-uppal3 in #33582
Fix attn mask ignore logic in training-time trace by @zhenglongjiepheonix in #32613
hot fix self.position_embeddings->self.position_embedding by @ArthurZucker in #33958
fix red check-copies by @ArthurZucker in #33964
Cache: revert DynamicCache init for BC by @gante in #33861
Paligemma: fix static cache test by @zucchini-nlp in #33941
Updating char_to_token documentation to note behaviour when trim_offsets is True by @Craigacp in #33919
add test for Jamba with new model jamba-tiny-dev by @yecohn in #33863
Bug fix gguf qwen2moe by @VladOS95-cyber in #33940
[TF] Fix Tensorflow XLA Generation on limited seq_len models by @vasqu in #33903
[WIP] Add Tokenizer for MyT5 Model by @tomlimi in #31286
Add position ids in forward pass to opt model by @avishaiElmakies in #33121
Flash-attn performance: remove cuda sync during inference by @Cyrilvallez in #33570
[Docs] Improve VLM docs by @NielsRogge in #33393
[Docs] Add Developer Guide: How to Hack Any Transformers Model by @MagnusS0 in #33979
[Red CIs] Fix hub failures by @ArthurZucker in #34001
Fix Tensor + Embedding error in some cases when using SiglipVisionModel by @kaitolucifer in #33994
properly fix and RUN_SLOW by @ArthurZucker in #33965
Enable customized optimizer for DeepSpeed by @dataKim1201 in #32049
[pytes collection] Fix flax test collection by @ArthurZucker in #34004
Fix undefined defaultconfig in configurationutils.py by @mgoin in #33934
🌐 [i18n-KO] Translated gguf.md to Korean by @yijun-lee in #33764
🌐 [i18n-KO] Translated swinv2.md to Korean by @mreraser in #33566
🌐 [i18n-KO] Translated audio_utils.md to Korean by @yijun-lee in #33802
🌐 [i18n-KO] Translated esm.md to Korean by @yijun-lee in #33796
🌐 [i18n-KO] Translated time_series_utils.md to Korean by @yijun-lee in #33806
🌐 [i18n-KO] Translated pipelines_utils.md to Korean by @yijun-lee in #33809
🌐 [i18n-KO] Translated trainer.md to Korean by @yijun-lee in #33797
🌐 [i18n-KO] Translated chameleon.md to Korean by @yijun-lee in #33799
🌐 [i18n-KO] Translated logging.md to Korean by @chhaewxn in #33543
🌐 [i18n-KO] Translated auto.md to Korean by @boyunJang in #33590
🌐 [i18n-KO] Translated swin2sr.md to Korean by @mreraser in #33795
🌐 [i18n-KO] Translated vit.md to Korean by @mreraser in #33884
🌐 [i18n-KO] Translated gemma.md to Korean by @yijun-lee in #33936
Cache: slight change in naming by @zucchini-nlp in #32421
Add support for all and potentilly deleting functions by @ArthurZucker in #33859
Processors: don't default padding side by @zucchini-nlp in #33942
Add auto model for image-text-to-text by @yonigozlan in #32472
BatchFeature.to() supports non-tensor keys by @Rocketknight1 in #33918
Improve modular converter by @Cyrilvallez in #33991
Fixup DeepSpeed things by @muellerzr in #34007
Fix typing issue by @SunMarc in #34012
fix awq tests due to ipex backend by @SunMarc in #34011
Remove decoder_config=None by @SunMarc in #34014
Fix trainer_seq2seq.py's __init__ type annotations by @benglewis in #34021
🌐 [i18n-KO] Translated feature_extractor.md to Korean by @yijun-lee in #33775
🌐 [i18n-KO] Translated bertweet.md to Korean by @ahnjj in #33891
🌐 [i18n-KO] Translated gpt_neox_japanese.md to Korean by @ahnjj in #33894
🌐 [i18n-KO] Translated rag.md to Korean by @chhaewxn in #33989
🌐 [i18n-KO] Translated main_classes/quantization.md to Korean by @fabxoe in #33959
🌐 [i18n-KO] Translated main_classes/configuration.md to Korean by @fabxoe in #33952
🌐 [i18n-KO] Translated model_doc/mamba.md to Korean by @fabxoe in #33626
🌐 [i18n-KO] Translated model_doc/autoformer.md to Korean by @fabxoe in #33574
🌐 [i18n-KO] Translated model_doc/patchtsmixer.md to Korean by @fabxoe in #33587
🌐 [i18n-KO] Translated model_doc/clip.md to Korean by @fabxoe in #33610
🌐 [i18n-KO] Translated model_doc/paligemma.md to Korean by @fabxoe in #33612
🌐 [i18n-KO] Translated model_doc/llama3.md to Korean by @fabxoe in #33635
🌐 [i18n-KO] Translated model_doc/mistral.md to Korean by @fabxoe in #33648
🌐 [i18n-KO] Translated model_doc/cohere.md to Korean by @fabxoe in #33885
🌐 [i18n-KO] Translated model_doc/dbrx.md to Korean by @fabxoe in #33951
🌐 [i18n-KO] Translated model_doc/deberta-v2.md to Korean by @fabxoe in #33968
🌐 [i18n-KO] Translated main_classes/onnx.md to Korean by @fabxoe in #33601
🌐 [i18n-KO] Translated tokenization_utils.md to Korean by @yijun-lee in #33813
🌐 [i18n-KO] Translated swin.md to Korean by @mreraser in #33510
🌐 [i18n-KO] Translated file_utils.md to Korean by @yijun-lee in #33803
🌐 [i18n-KO] Translated openai-gpt.md to Korean by @yijun-lee in #33801
🌐 [i18n-KO] Translated biogpt.md to Korean by @yijun-lee in #33773
🌐 [i18n-KO] Translated blip.md to Korean by @cjfghk5697 in #33515
🌐 [i18n-KO] Translated output.md to Korean by @4N3MONE in #33607
🌐 [i18n-KO] Translated image_processing_utils.md to Korean by @yijun-lee in #33804
🌐 [i18n-KO] Translated modular_transformers.md to Korean by @yijun-lee in #33772
[Patch helper] update to not have to checkout main by @ArthurZucker in #34006
Fix Failed tests with mobile bert resize tokens embedding by @abuelnasr0 in #33950
Generate: remove most decoder-only LLMs prepare_inputs_for_generation by @gante in #33870
Mllama: fix tests by @zucchini-nlp in #34000
Fix PIL dep for tests by @muellerzr in #34028
🌐 [i18n-KO] Translated model_doc/bart.md to Korean by @fabxoe in #33893
🌐 [i18n-KO] Translated model_doc/deberta.md to Korean by @fabxoe in #33967
🌐 [i18n-KO] Translated main_classes/keras_callbacks.md to Korean by @fabxoe in #33955
🌐 [i18n-KO] Translated model_doc/mamba2.md to Korean by @fabxoe in #33629
🌐 [i18n-KO] Translated main_classes/model.md to Korean by @fabxoe in #33606
🌐 [i18n-KO] Translated model_doc/trajectory_transformer.md to Korean by @fabxoe in #33597
🌐 [i18n-KO] Translated model_doc/time_series_transformer.md to Korean by @fabxoe in #33596
🌐 [i18n-KO] Translated model_doc/informer.md to Korean by @fabxoe in #33585
🌐 [i18n-KO] Translated model_doc/graphormer.md to Korean by @fabxoe in #33569
🌐 [i18n-KO] Translated modeling_utils.md to Korean by @yijun-lee in #33808
🌐 [i18n-KO] Translated main_classes/data_collator.md to Korean by @fabxoe in #33954
🌐 [i18n-KO] Translated model_doc/patchtst.md to Korean by @fabxoe in #33589
🌐 [i18n-KO] Translated text_generation.md to Korean by @yijun-lee in #33777
🌐 [i18n-KO] Translated main_classes/callback.md to Korean by @Jwaminju in #33572
🌐 [i18n-KO] Translated generation_utils.md to Korean by @yijun-lee in #33818
Add Translate docs into Arabic - section files CONCEPTUAL GUIDES by @AhmedAlmaghz in #33982
add sdpa to OPT by @avishaiElmakies in #33298
Phi3: fix attn for sliding window by @zucchini-nlp in #33586
HfArgumentParser: allow for hyhenated field names in long-options by @djmarti in #33990
Fix pipelines tests by @qubvel in #34049
Specifying torch dtype in Qwen2VLForConditionalGeneration by @htahboub in #33953
Universal Assisted Generation: Assisted generation with any assistant model (by Intel Labs) by @danielkorat in #33383
check if eigenvalues of covariance matrix are complex. by @abuelnasr0 in #34037
[Docs] Update compressed_tensors.md by @mgoin in #33961
Fix data_seed unused by @MekkCyber in #33731
[TESTS] ASR pipeline by @ylacombe in #33925
Update Blip2 is_pipeline_test_to_skip method signature by @qubvel in #34067
provide trustremotecode for search feat extractor in model config by @eaidova in #34036
Small Fix to modular converter by @MekkCyber in #34051
Default synced_gpus to True when using FullyShardedDataParallel by @ringohoffman in #33483
Idefics: fix position ids by @zucchini-nlp in #33907
Update SSH workflow file by @ydshieh in #34084
Tests: upcast logits to float() by @gante in #34042
Fix flax failures by @LysandreJik in #33912
Fix DAC slow tests by @ylacombe in #34088
Fix failing conversion by @LysandreJik in #34010
Fix PushToHubMixin when pusing to a PR revision by @Wauplin in #34090
avoid many failures for ImageGPT by @ydshieh in #34071
Fix NaNs in cost_matrix for mask2former by @ducha-aiki in #34074
Fix flaky tests by @zucchini-nlp in #34069
Generate: move prepare_inputs_for_generation in encoder-decoder llms by @gante in #34048
Avoid many test failures for LlavaNextVideoForConditionalGeneration by @ydshieh in #34070
refactor: benchmarks by @McPatate in #33896
fix(ci): benchmarks dashboard was failing due to missing quotations by @McPatate in #34100
Generate: Fix modern llm generate calls with synced_gpus by @gante in #34095
Mistral-related models for QnA by @vasqu in #34045
Fix a typo by @PengWeixuan in #34148
Fixed error message in mllama by @dmgcsilva in #34106
Specify that users should be careful with their own files by @LysandreJik in #34153
Add documentation for docker by @ArthurZucker in #33156
Update README.md with Enterprise Hub by @gary149 in #34150
Idefics: enable generation tests by @zucchini-nlp in #34062
Add sdpa for Vivit by @RUFFY-369 in #33757
Fix FSDP resume Initialization issue by @Itssshikhar in #34032
Fix default behaviour in TextClassificationPipeline for regression problem type by @subhalingamd in #34066
Generate: move logits to same device as input_ids by @gante in #34076
Add support for inheritance from class with different suffix in modular by @yonigozlan in #34077
Fix optuna ddp hp search by @SunMarc in #34073
[feat] LlavaNext add feature size check to avoid CUDA Runtime Error by @laurentd-lunit in #33608
🌐 [i18n-KO] Translated vivit.md to Korean by @mreraser in #33935
🌐 [i18n-KO] Translated gemma2.md to Korean by @yijun-lee in #33937
🌐 [i18n-KO] Translated trainer_utils.md to Korean by @yijun-lee in #33817
🌐 [i18n-KO] Translated blip-2.md to Korean by @cjfghk5697 in #33516
IDEFICS: support inputs embeds by @zucchini-nlp in #34043
[fix] fix token healing tests and usage errors by @alpertunga-bile in #33931
Revert accelerate error caused by 46d09af by @steveepreston in #34197
Fix wrong name for llava onevision and qwen2_vl in tokenization auto by @yonigozlan in #34177
Avoid using torch's Tensor or PIL's Image in chat template utils if not available by @RezaRahemtola in #34165
Revert "Fix FSDP resume Initialization issue" by @SunMarc in #34193
Update trainer._get_eval_sampler() to support group_by_length arg by @larin92 in #33514
Fix warning message for fp32cpuoffloading in bitsandbytes configs by @amosyou in #34079
Ping team members for new failed tests in daily CI by @ydshieh in #34171
fix(Wav2Vec2ForCTC): torch export by @chrsmcgrr in #34023
Fix for tokenizer.applychattemplate with continuefinalmessage=True by @schoennenbeck in #34214
removes decord by @vrnvu in #33987
Fix bus error when using GPT2 on M1 macs by @chanind in #34031
Generate: visit non-llm prepare_inputs_for_generation by @gante in #34199
Support Llama 3.2 conversion (text models) by @pcuenca in #33778
Fix-red-ci by @ArthurZucker in #34230
BLIP: fix input expansion logic by @zucchini-nlp in #34225
Fix broken test decorator require_torch_up_to_2_accelerators by @byi8220 in #34201
Informative 2 by @LysandreJik in #34154
Fix UDOP dtype issue by @Rocketknight1 in #34180
Only cast logits to float when computing loss by @ringohoffman in #34147
Generation tests: don't rely on main input name by @zucchini-nlp in #34228
Change Paligemma import logging to work with modular by @yonigozlan in #34211
Add DetrImageProcessorFast by @yonigozlan in #34063
Add a doc section on writing generation prompts by @Rocketknight1 in #34248
Fix method name which changes in tutorial by @andimarafioti in #34252
Attn implementation for composite models by @zucchini-nlp in #32238
VLM: add more modularity by @zucchini-nlp in #34175
T5 compile compatibilty by @zucchini-nlp in #34089
[docs] Fix GenerationConfig params by @stevhliu in #34299
Fix Korean doc _toctree.yml by @regisss in #34293
Update PR templates by @SunMarc in #34065
[RT-DETR] Fix onnx inference bug for Optype (Where) by @YHallouard in #33877
Fix FA2 attention for models supporting sliding window by @Cyrilvallez in #34093
Fix: tensor of examples of the same length triggers invalid stacking by @pbelcak in #34166
Add postprocessdepth_estimation to image processors and support ZoeDepth's inference intricacies by @alex-bene in #32550
Add option for running ffmpegmicrophonelive as a background process by @mikamerath in #32838
Feature: Add MLFLOW_MAX_LOG_PARAMS to MLflowCallback by @cecheta in #34279
Fix continuefinalmessage for image-text-to-text chat templates by @yonigozlan in #34236
fix error in getevalsampler when groupby_length enabled by @akakakakakaa in #34237
[docs] fix typo by @faaany in #34235
🌐 [i18n-KO] Translated executorch.md to Korean by @ahnjj in #33888
🌐 [i18n-KO] Translated bert japanese.md to Korean by @ahnjj in #33890
🌐 [i18n-KO] Translated model_doc/bartpho.md to Korean by @Jwaminju in #33981
Example doc for token classification of Llama and Dependent/Copied Models by @h3110Fr13nd in #34139
[docs] Fix Korean toctree by @stevhliu in #34324
Added Deberta model type support by @FilipposVentirozos in #34308

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@manuelsh
- adding positional encoder changes and tests (#32600)
@ArthurZucker
- [MllamaProcessor] Update errors and API with multiple image (#33715)
- [clean_up_tokenization_spaces] Pl bart was failing, updating (#33735)
- [MllamaImageProcessing] Update doc (#33747)
- [modular] fixes! (#33820)
- add setter for trainer processor (#33911)
- PR run-slow
- hot fix self.position_embeddings->self.position_embedding (#33958)
- fix red check-copies (#33964)
- [Red CIs] Fix hub failures (#34001)
- properly fix and RUN_SLOW (#33965)
- [pytes collection] Fix flax test collection (#34004)
- Add support for all and potentilly deleting functions (#33859)
- [Patch helper] update to not have to checkout main (#34006)
- Add documentation for docker (#33156)
- Fix Gradient Accumulation issue (#34191)
- Fix-red-ci (#34230)
@molbap
- Fix position embeddings singular/plural (#33678)
- Uniformize model processors (#31368)
@vasqu
- Update Albumentations Versions (#33704)
- [TF] Fix Tensorflow XLA Generation on limited seq_len models (#33903)
- Mistral-related models for QnA (#34045)
@VladOS95-cyber
- Add gguf support for bloom (#33473)
- Bug fix gguf qwen2moe (#33940)
- Add gguf support for StableLM (#33793)
- Add gguf support for gpt2 (#34044)
- Add GGUF for starcoder2 (#34094)
@ydshieh
- Add Slow CI reminder bot (#33506)
- post reminder comment only once (#33848)
- Avoid using context that is not accessable from external contributors (#33866)
- Don't run reminder bot for now (#33883)
- Update SSH workflow file (#34084)
- avoid many failures for ImageGPT (#34071)
- Avoid many test failures for LlavaNextVideoForConditionalGeneration (#34070)
- Ping team members for new failed tests in daily CI (#34171)
@amyeroberts
- Repo consistency fix after #33339 (#33873)
- Trainer - deprecate tokenizer for processing_class (#32385)
@ylacombe
- [Tests] Diverse Whisper fixes (#33665)
- Fix distil whisper segment computation (#33920)
- [TESTS] ASR pipeline (#33925)
- Fix DAC slow tests (#34088)
- Moshi integration (#33624)
@ringohoffman
- Remove logits.float() (#33902)
- Default synced_gpus to True when using FullyShardedDataParallel (#33483)
- Only cast logits to float when computing loss (#34147)
@garg-amit
- PhiMoE (#33363)
@pglorio
- Add Zamba (#30950)
@tomlimi
- [WIP] Add Tokenizer for MyT5 Model (#31286)
@yijun-lee
- 🌐 [i18n-KO] Translated gguf.md to Korean (#33764)
- 🌐 [i18n-KO] Translated audio_utils.md to Korean (#33802)
- 🌐 [i18n-KO] Translated esm.md to Korean (#33796)
- 🌐 [i18n-KO] Translated time_series_utils.md to Korean (#33806)
- 🌐 [i18n-KO] Translated pipelines_utils.md to Korean (#33809)
- 🌐 [i18n-KO] Translated trainer.md to Korean (#33797)
- 🌐 [i18n-KO] Translated chameleon.md to Korean (#33799)
- 🌐 [i18n-KO] Translated gemma.md to Korean (#33936)
- 🌐 [i18n-KO] Translated feature_extractor.md to Korean (#33775)
- 🌐 [i18n-KO] Translated tokenization_utils.md to Korean (#33813)
- 🌐 [i18n-KO] Translated file_utils.md to Korean (#33803)
- 🌐 [i18n-KO] Translated openai-gpt.md to Korean (#33801)
- 🌐 [i18n-KO] Translated biogpt.md to Korean (#33773)
- 🌐 [i18n-KO] Translated image_processing_utils.md to Korean (#33804)
- 🌐 [i18n-KO] Translated modular_transformers.md to Korean (#33772)
- 🌐 [i18n-KO] Translated modeling_utils.md to Korean (#33808)
- 🌐 [i18n-KO] Translated text_generation.md to Korean (#33777)
- 🌐 [i18n-KO] Translated generation_utils.md to Korean (#33818)
- 🌐 [i18n-KO] Translated gemma2.md to Korean (#33937)
- 🌐 [i18n-KO] Translated trainer_utils.md to Korean (#33817)
@fabxoe
- 🌐 [i18n-KO] Translated main_classes/quantization.md to Korean (#33959)
- 🌐 [i18n-KO] Translated main_classes/configuration.md to Korean (#33952)
- 🌐 [i18n-KO] Translated model_doc/mamba.md to Korean (#33626)
- 🌐 [i18n-KO] Translated model_doc/autoformer.md to Korean (#33574)
- 🌐 [i18n-KO] Translated model_doc/patchtsmixer.md to Korean (#33587)
- 🌐 [i18n-KO] Translated model_doc/clip.md to Korean (#33610)
- 🌐 [i18n-KO] Translated model_doc/paligemma.md to Korean (#33612)
- 🌐 [i18n-KO] Translated model_doc/llama3.md to Korean (#33635)
- 🌐 [i18n-KO] Translated model_doc/mistral.md to Korean (#33648)
- 🌐 [i18n-KO] Translated model_doc/cohere.md to Korean (#33885)
- 🌐 [i18n-KO] Translated model_doc/dbrx.md to Korean (#33951)
- 🌐 [i18n-KO] Translated model_doc/deberta-v2.md to Korean (#33968)
- 🌐 [i18n-KO] Translated main_classes/onnx.md to Korean (#33601)
- 🌐 [i18n-KO] Translated model_doc/bart.md to Korean (#33893)
- 🌐 [i18n-KO] Translated model_doc/deberta.md to Korean (#33967)
- 🌐 [i18n-KO] Translated main_classes/keras_callbacks.md to Korean (#33955)
- 🌐 [i18n-KO] Translated model_doc/mamba2.md to Korean (#33629)
- 🌐 [i18n-KO] Translated main_classes/model.md to Korean (#33606)
- 🌐 [i18n-KO] Translated model_doc/trajectory_transformer.md to Korean (#33597)
- 🌐 [i18n-KO] Translated model_doc/time_series_transformer.md to Korean (#33596)
- 🌐 [i18n-KO] Translated model_doc/informer.md to Korean (#33585)
- 🌐 [i18n-KO] Translated model_doc/graphormer.md to Korean (#33569)
- 🌐 [i18n-KO] Translated main_classes/data_collator.md to Korean (#33954)
- 🌐 [i18n-KO] Translated model_doc/patchtst.md to Korean (#33589)
@MekkCyber
- FEAT : Adding BitNet quantization method to HFQuantizer (#33410)
- Fix data_seed unused (#33731)
- Small Fix to modular converter (#34051)
@AhmedAlmaghz
- Add Translate docs into Arabic - section files CONCEPTUAL GUIDES (#33982)
@alex-bene
- Add postprocessdepth_estimation to image processors and support ZoeDepth's inference intricacies (#32550)

- Python
Published by LysandreJik over 1 year ago

transformers - Release v4.45.2

Patch release v4.45.2

Mostly some warnings that were not properly removed ⚠️ : * Ignore keys on validaterope #33753 by @zucchini-nlp * remove warning v2 #33761 by @itazap * Config: lower savepretrained exception to warning #33906 by @gante

🔴 Had a small regression with dynamic Cache 🔴 *Cache: revert DynamicCache init for BC #33861 by @gante

A small fix for idefic 🐩 : * Fixes for issue #33763 in idefics2 model #33766 by @aroun-coumar

And a fix for Siglip 🤧 ! * hot fix self.positionembeddings->self.positionembedding #33958 and properly fix and RUN_SLOW #33965 thanks to @mranzinger

- Python
Published by ArthurZucker over 1 year ago

transformers - Patch Release v4.45.1

Patches for v4.45.1

[MllamaProcessor] Update errors and API with multiple image (#33715) by @ArthurZucker
Generate: can_generate() recursive check (#33718) by @gante
cleanuptokenization_spaces=False if unset (#31938) by @itazap

- Python
Published by ArthurZucker over 1 year ago

transformers - Llama 3.2, mllama, Qwen2-Audio, Qwen2-VL, OLMoE, Llava Onevision, Pixtral, FalconMamba, Modular Transformers

New model additions

mllama

The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.

Add MLLama #33703, by @qubvel, @zucchini-nlp, @ArthurZucker

Qwen2-VL

The Qwen2-VL is a major update from the previous Qwen-VL by the Qwen team.

An extract from the Qwen2-VL blogpost available here is as follows:

Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model familities. Compared with Qwen-VL, Qwen2-VL has the capabilities of: - SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. - Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc. - Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions. - Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.

support qwen2-vl by @simonJJJ in #32318

Qwen2-Audio

The Qwen2-Audio is the new model series of large audio-language models from the Qwen team. Qwen2-Audio is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions.

They introduce two distinct audio interaction modes: - voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input - audio analysis: users could provide audio and text instructions for analysis during the interaction

Add Qwen2-Audio by @faychu in #32137

OLMoE

OLMoE is a series of Open Language Models using sparse Mixture-of-Experts designed to enable the science of language models. The team releases all code, checkpoints, logs, and details involved in training these models.

Add OLMoE by @Muennighoff in #32406

Llava Onevision

LLaVA-Onevision is a Vision-Language Model that can generate text conditioned on one or several images/videos. The model consists of SigLIP vision encoder and a Qwen2 language backbone. The images are processed with anyres-9 technique where the image is split into 9 patches to better process high resolution images and capture as much details as possible. However, videos are pooled to a total sequence length of 196 tokens each frame for more memory efficient computation. LLaVA-Onevision is available in three sizes: 0.5B, 7B and 72B and achieves remarkable performance on benchmark evaluations.

Llava Onevision: add model by @zucchini-nlp in #32673

FalconMamba

The FalconMamba model was proposed by TII UAE (Technology Innovation Institute) in their release.

The model has been trained on approximtely 6T tokens consisting a mixture of many data sources such as RefineWeb, Cosmopedia and Math data.

The team releases an accompanying blog post.

Add new model by @younesbelkada in #32615

Granite Language Models

he Granite model was proposed in Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.

PowerLM-3B is a 3B state-of-the-art small language model trained with the Power learning rate scheduler. It is trained on a wide range of open-source and synthetic datasets with permissive licenses. PowerLM-3B has shown promising results compared to other models in the size categories across various benchmarks, including natural language multi-choices, code generation, and math reasoning.

Granite language models by @mayank31398 in #31502

Granite MOE

The GraniteMoe model was proposed in Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.

PowerMoE-3B is a 3B sparse Mixture-of-Experts (sMoE) language model trained with the Power learning rate scheduler. It sparsely activates 800M parameters for each token. It is trained on a mix of open-source and proprietary datasets. PowerMoE-3B has shown promising results compared to other dense models with 2x activate parameters across various benchmarks, including natural language multi-choices, code generation, and math reasoning.

Granitemoe by @mayank31398 in #33207

Descript-Audio-Codec

The Descript Audio Codec (DAC) model is a powerful tool for compressing audio data, making it highly efficient for storage and transmission. By compressing 44.1 KHz audio into tokens at just 8kbps bandwidth, the DAC model enables high-quality audio processing while significantly reducing the data footprint. This is particularly useful in scenarios where bandwidth is limited or storage space is at a premium, such as in streaming applications, remote conferencing, and archiving large audio datasets.

Add Descript-Audio-Codec model by @kamilakesbi in #31494

Pixtral

The Pixtral model was released by the Mistral AI team. Pixtral is a multimodal model, taking images and text as input, and producing text as output. This model follows the Llava family, meaning image embeddings are placed instead of the [IMG] token placeholders.

The model uses PixtralVisionModel for its vision encoder, and MistralForCausalLM for its language decoder. The main contribution is the 2d ROPE (rotary postiion embeddings) on the images, and support for arbitrary image sizes (the images are not padded together nor are they resized).

Add support for Pixtral by @ArthurZucker in #33449

Mimi

The Mimi model was proposed in Moshi: a speech-text foundation model for real-time dialogue by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour. Mimi is a high-fidelity audio codec model developed by the Kyutai team, that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps. In other words, it can be used to map audio waveforms into “audio tokens”, known as “codebooks”.

Codec integration by @ylacombe in #33565

Quantization

GGUF

GGUF support continues to be enhanced in the library by offering a way to load GGUF models within transformers by unquantizing them, before re-quantizing them for re-use within the GGUF/GGML ecosystem.

Add Qwen2Moe GGUF loading support by @VladOS95-cyber in #33264
Fix incorrect vocab size retrieval in GGUF config by @Isotr0py in #32551
Add chat_template for tokenizer extracted from GGUF model by @Isotr0py in #32908
🚨 Support dequantization for most GGML types by @Isotr0py in #32625
Add support for GGUF Phi-3 by @a8nova in #31844

Torch AO

An ongoing effort is to add the ability to use torchao as a quantization backend. Future PRs will enable saving and fine-tuning with peft.

Add TorchAOHfQuantizer by @jerryzh168 in #32306

Liger Kernel

The Liger kernel is now supported in the Trainer class.

Integrate Liger (Linkedin GPU Efficient Runtime) Kernel to Trainer by @JasonZhu1313 in #32860

Modular Transformers

This PR introduces Modularity for transformers, which has always been prohibited when working with transformers (see blog post for the accompanying design philosophy).

The core idea behind this PR is to facilitate model addition by enabling Pythonic inheritance while keeping true to our single-file policy in which models/processors must be contained within a single file, enabling working around the object without going through 10 layers of abstractions.

It is heavily recommended to read the PR description in order to understand the depth of the change: https://github.com/huggingface/transformers/pull/33248

Modular transformers: modularity and inheritance for new model additions by @ArthurZucker in #33248

Agents

Agents continue being improved at each release; this time making it much simpler to leverage a local engine through a local Transformers Engine.

Multi agents with manager by @aymeric-roucher in #32687
Add new documentation page for advanced agent usage by @aymeric-roucher in #33265
Create local Transformers Engine by @aymeric-roucher in #33218
Agents use grammar by @aymeric-roucher in #31735

Dynamic cache for decoder-only models

This PR adds to all decoder-only models (except for XLNet) support for dynamic cache.

The documentation for the Dynamic cache can be found here, and documentation related to the KV cache in transformers in general can be found here.

Cache: new Cache format in decoder-only models by @zucchini-nlp in #31421

Chat templates updates

We've made several updates to our handling of chat models and chat templates. The most noticeable change is that assistant prefill is now supported. This means you can end a chat with an assistant message, and the model will continue that message instead of starting a new one, allowing you to guide the model's response:

```python pipe = pipeline("text-generation", model_checkpoint)

chat = [ {"role": "user", "content": "Can you format the answer in JSON?"}, {"role": "assistant", "content": '{"name": "'} ]

output = pipe(chat) # The model will continue outputting JSON! ```

We've also enabled several new functionalities in Jinja that will allow more powerful templates in future, including Loop Controls and a strftime_now function that can get the current date and time, which is commonly used in system messages. For more details, see the updated chat template docs.

Enable some Jinja extensions and add datetime capabilities by @Rocketknight1 in #32684
Update Jinja docs with new functions and general cleanup by @Rocketknight1 in #33097
Add assistant prefill for chat templates and TextGenerationPipeline by @Rocketknight1 in #33198
Add a warning to the chat template docs about the tool_calls format by @Rocketknight1 in #33277
Add tip to clarify tool calling by @Rocketknight1 in #32883

Bugfixes and improvements

🌐 [i18n-KO] Translated mask_generation.md to Korean by @jeongiin in #32257
🌐 [i18n-KO] Translated idefics.md to Korean by @boyunJang in #32258
🌐 [i18n-KO] Translated image_to_image.md to Korean by @shinhyunji36 in #32327
Gemma2: add cache warning by @zucchini-nlp in #32279
enable xla fsdp by @hanwen-sun in #32048
Fix typo in tokenizationutilsbase.py by @blubitz in #32484
fix broken link in docs by @jorahn in #32491
Docs: alert for the possibility of manipulating logits by @gante in #32467
🌐 [i18n-KO] Translated gptq.md to Korean by @1kmmk1 in #32293
🌐 [i18n-KO] Translated prompting.md to Korean by @chhaewxn in #32294
🌐 [i18n-KO] Translated quantization/quanto.md to Korean by @fabxoe in #32281
🌐 [i18n-KO] Translated image_feature_extraction.md to Korean by @mreraser in #32239
Fix references to model google mt5 small by @JuanFKurucz in #32497
Docs: Fixed WhisperModel.forward’s docstring link by @Sai-Suraj-27 in #32498
🌐 [i18n-KO] Translated chat_templating.md to Korean by @enchantee00 in #32362
Fix link to autoclass_tutorial.md in i18n.md by @JuanFKurucz in #32501
Fix typo: depracted -> deprecated by @tomaarsen in #32489
Fix issue #32518: Update llm_tutorial.md by @doomdagadiggiedahdah in #32523
Change Phi3 _supports_sdpa to True by @pocca2048 in #32457
Uniformize kwargs for processors - GroundingDINO by @SangbumChoi in #31964
Fix add-new-model-like by @molbap in #31773
filter flash_attn optional imports loading remote code by @eaidova in #30954
🌐 [i18n-KO] Translated ko-llm_tutorial_optimization.md to Korean by @010kim in #32372
🌐 [i18n-KO] Translated trainer.md to Korean by @cjfghk5697 in #32260
🌐 [i18n-KO] Translated eetq.md to Korean by @jun048098 in #32352
🌐 [i18n-KO] Translated fsdp.md to Korean by @win2dvp21 in #32261
🌐 [i18n-KO] Translated bitsandbytes.md to Korean by @SeungAhSon in #32408
Fix generate with inputs_embeds as input by @molbap in #32493
Fixed test test_static_cache_exportability with torch 2.4.0 by @guangy10 in #32516
Fix code example to load bigcode starcoder2 7b by @JuanFKurucz in #32474
[docs] Translation guide by @stevhliu in #32547
Gemma2: fix FA2 generation by @zucchini-nlp in #32553
Fix a bug in Qwen2Audio by @faychu in #32552
fix slow integration gemma2 test by @ArthurZucker in #32534
fix non contiguous tensor value error in save_pretrained by @congcongke in #32422
🌐 [i18n-KO] Translated agent.md to Korean by @Jwaminju in #32351
Fix: FA2 with packed training by @zucchini-nlp in #32487
Fix sliding window attention used in Gemma2FlashAttention2 by @brcps12 in #32522
fix: Fixed conditional check for encodec model names by @Sai-Suraj-27 in #32581
Fix .push_to_hub(..., create_pr=True, revision="my-branch") when creating PR on not-owned repo by @Wauplin in #32094
Cleanup tool calling documentation and rename doc by @Rocketknight1 in #32337
🌐 [i18n-KO] Translated deepspeed.md to Korean by @4N3MONE in #32431
🌐 [i18n-KO] Translated awq.mdto Korean by @ahnjj in #32324
fix: Fixed failing test_find_base_model_checkpoint by @Sai-Suraj-27 in #32638
"to be not" -> "not to be" by @qgallouedec in #32636
fix: Updated the is_torch_mps_available() function to include min_version argument by @Sai-Suraj-27 in #32545
Expand inputs in processors for VLMs by @zucchini-nlp in #30962
Automatically add transformers tag to the modelcard by @LysandreJik in #32623
Fix tests by @molbap in #32649
fix tensors on different devices in WhisperGenerationMixin by @faaany in #32316
Add support for GrokAdamW optimizer by @ehartford in #32521
Add Depth Anything V2 Metric models by @bt2513 in #32126
Fix: Fixed directory path for utils folder in test_tokenization_utils.py by @Sai-Suraj-27 in #32601
Modify ProcessorTesterMixin for better generalization by @yonigozlan in #32637
TF_Deberta supporting mixed precision by @pinesnow72 in #32618
Fix tests recurrent by @molbap in #32651
Support MUSA (Moore Threads GPU) backend in transformers by @fmo-mt in #31913
fix: Fixed failing tests in tests/utils/test_add_new_model_like.py by @Sai-Suraj-27 in #32678
Update translation docs review by @stevhliu in #32662
Fix JetMoeIntegrationTest by @ydshieh in #32332
Update the distributed CPU training on Kubernetes documentation by @dmsuehir in #32669
fix: Fixed unknown pytest config option doctest_glob by @Sai-Suraj-27 in #32475
Unpin deepspeed in Docker image/tests by @muellerzr in #32572
Updated workflows to the latest versions by @Sai-Suraj-27 in #32405
reopen: llava-next fails to consider padding_side during Training by @jp1924 in #32679
fix: Corrected falcon-mamba-7b model checkpoint name by @Sai-Suraj-27 in #32837
fix: update doc link for runhouse in README.md by @muddlebee in #32664
VLMs: small clean-up for cache class by @zucchini-nlp in #32417
add back the position ids by @ArthurZucker in #32554
Use head_dim if in config for RoPE by @suiyoubi in #32495
Generate: unify LogitsWarper and LogitsProcessor by @gante in #32626
[tests] make testsdpaequivalence device-agnostic by @faaany in #32520
Cache: use batch_size instead of max_batch_size by @gante in #32657
Fix AutoConfig and AutoModel support for Llava-Next-Video by @TKONIY in #32844
improve getisastensor_fns by @zrr1999 in #32596
Revert PR 32299, flag users when Zero-3 was missed by @muellerzr in #32851
fix multi-gpu with static cache by @SunMarc in #32543
Reduce the error log when using core models that need their weights renamed, and provide a step forward by @muellerzr in #32656
Make beam_constraints.Constraint.advance() docstring more accurate by @alex-calderwood in #32674
generate: missing to in DoLa body, causing exceptions in multi-gpu generation by @gante in #32856
Add Flax Dinov2 by @MHRDYN7 in #31960
support torch-speech by @itazap in #32537
[tests] make test_sdpa_can_compile_dynamic device-agnostic by @faaany in #32519
Add repr for Conv1D by @AaronZLT in #32425
Support save/load ckpt for XLA FSDP by @yitongh in #32311
RT-DETR parameterized batchnorm freezing by @AlanBlanchet in #32631
Mamba / FalconMamba: Fix mamba left padding by @younesbelkada in #32677
Fix: Mamba2 generation mismatch between inputids and inputsembeds by @vasqu in #32694
Docs: Fixed whisper-large-v2 model link in docs by @Sai-Suraj-27 in #32871
Allow-head-dim by @ArthurZucker in #32857
🚨🚨🚨 Update min version of accelerate to 0.26.0 by @SunMarc in #32627
Fix repr for conv by @ArthurZucker in #32897
fix: jamba cache fails to use torch.nn.module by @xgal in #32894
Fix: Mamba2 norm_before_gate usage by @vasqu in #32686
Replace tensor.norm() with decomposed version for CLIP executorch export by @qubvel in #32887
link for optimizer names by @nbroad1881 in #32400
[i18n-ar] add README_ar.md to README.md by @AhmedAlmaghz in #32583
fix: [whisper] don't overwrite GenerationConfig's return_timestamps when return_timestamps is not passed to generate function by @hrl in #31296
Update docker image building by @ArthurZucker in #32918
Jamba: update integration tests by @gante in #32250
fix: Added missing huggingface_hub installation to workflows by @Sai-Suraj-27 in #32891
fix: no need to dtype A in jamba by @xgal in #32924
FEAT / Trainer: Add adamw 4bit optimizer by @SunMarc in #31865
CI: separate step to download nltk files by @gante in #32935
FIX / Hub: Also catch for exceptions.ConnectionError by @younesbelkada in #31469
Add SynCode to llm_tutorial by @shubhamugare in #32884
Fix benchmark script by @ydshieh in #32635
Improve greedy search memory usage by @regisss in #32895
fix: (issue #32689) AttributeError raised when using Trainer with eval_on_start=True in Jupyter Notebook. by @fshp971 in #32849
Gemma2: eager attention by default by @gante in #32865
[run_slow] idefics2 by @andimarafioti in #32840
Fix regression on Processor.save_pretrained caused by #31691 by @leloykun in #32921
🌐 [i18n-KO] Translated `knowledgedistillationforimageclassification.md to Korean" by @JinukHong in #32334
Generate: Deprecate returning legacy cache by default; Handle use_cache=False by @gante in #32863
docs: fix outdated link to TF32 explanation by @anakin87 in #32947
Reducing memory usage: removing useless logits computation in generate() by @Cyrilvallez in #31292
Forbid PretrainedConfig from saving generate parameters; Update deprecations in generate-related code 🧹 by @gante in #32659
DeviceGuard added to use Deformable Attention more safely on multi-GPU by @DonggeunYu in #32910
added doctring to SchedulerType class by @Arunprakash-A in #32898
Updated the custommodels.md changed crossentropy code by @S-M-J-I in #33118
CI: add torchvision to the consistency image by @gante in #32941
Test: add higher atol in test_forward_with_num_logits_to_keep by @gante in #33093
mps: add isin_mps_friendly, a wrapper function for torch.isin by @gante in #33099
Add changes for uroman package to handle non-Roman characters by @nandwalritik in #32404
fix: Fixed pydantic required version in dockerfiles to make it compatible with DeepSpeed by @Sai-Suraj-27 in #33105
quickfix documentation by @molbap in #32566
Fixup py 38 type hints for mps friendly by @muellerzr in #33128
fix: Fixed CodeGenTokenizationTest::test_truncation failing test by @Sai-Suraj-27 in #32850
fix: multilingual midel convert to tflite get wrong token by @Ayaa17 in #32079
disable scheduled daily CI temporarily by @ydshieh in #33136
CI: fix efficientnet pipeline timeout and prevent future similar issues due to large image size by @gante in #33123
Log additional test metrics with the CometCallback by @Lothiraldan in #33124
[docs] add quick usage snippet to Whisper. by @Vaibhavs10 in #31289
Update stateful_callbacks state before saving checkpoint by @pedrobrs in #32115
fix Idefics2VisionConfig type annotation by @chenzizhao in #33103
Add a fix for custom code tokenizers in pipelines by @Rocketknight1 in #32300
Llama: make slow tests green 🟢 by @gante in #33138
fix redundant checkpointing in example training scripts by @eminorhan in #33131
update torch req for 4-bit optimizer by @SunMarc in #33144
🌐 [i18n-KO] Translated conversations.md to Korean by @newfull5 in #32468
Very small change to one of the function parameters by @alisalamatian1 in #32548
🚨 Add Blip2ForImageTextRetrieval by @jpizarrom in #29261
fix model name and copyright by @mayank31398 in #33152
Fix: Jamba batched generation by @vasqu in #32914
[whisper] pass attentionmask to generatewith_fallback() by @benniekiss in #33145
[RoBERTa-based] Add support for sdpa by @hackyon in #30510
Fix import paths for test_module by @rasmi in #32888
Zero-shot pipelines: minor doc changes by @pcuenca in #33127
Customise the separator used for splicing in DataCollatorWithFlattening by @beep-bebop in #33114
Fix spell mistakes by @matsuo1234567 in #33149
update push CI workflow files for security by @ydshieh in #33142
added quick clarification by @DuyguA in #33166
pass module to Params4bit.fromprequantized to ensure quantstate by @winglian in #32524
Mamba2 conversion script for original models by @vasqu in #32580
Add a static cache that offloads to the CPU or other device by @gerbenvv in #32161
use a single for loop by @ArthurZucker in #33148
Pipeline: fix bad generation kwargs docs by @gante in #33205
Add missing quotes in modelingllavanext_video.py by @juliendenize in #33214
Add warning for stop string edge case by @Rocketknight1 in #33169
Fix local repos with remote code not registering for pipelines by @Rocketknight1 in #33100
Refactor CI: more explicit by @ArthurZucker in #30674
🌐 [i18n-KO] Translated llm_optims.md to Korean by @yijun-lee in #32325
Fix red amin by @ArthurZucker in #33220
Test fetcher: missing return on filtered tests; don't write empty files by @gante in #33224
Generate: throw warning when return_dict_in_generate is False but should be True by @gante in #33146
Add video text to text docs by @merveenoyan in #33164
Add GraniteRMSNorm by @NielsRogge in #33177
Add duckduckgo search tool by @aymeric-roucher in #32882
Fix: Suppressed 'use_reentrant=False' warning by @ankush13r in #33208
docs: Replace package abbreviations with full name(bitsandbytes) in docstrings by @rapsealk in #33230
Generate: fix assistant in different device by @gante in #33257
remove to restriction for 4-bit model by @SunMarc in #33122
Fixed typo repeated word in DETR docs by @sergiopaniego in #33250
Fix: use torch.from_numpy() to create tensors for np.ndarrays by @shinyano in #33201
remove torch input dependant control flow by @ArthurZucker in #33245
Fix: num_logits_to_keep in composite models by @zucchini-nlp in #33168
Fix Bark saving by @ylacombe in #33266
Update chat template docs to remove Blenderbot by @Rocketknight1 in #33254
Add sdpa support for Albert by @OmarManzoor in #32092
Only disallow DeepSpeed Zero-3 for auto bs finder by @muellerzr in #31731
fix the parallel number of CI nodes when it is smaller than number of tests by @ArthurZucker in #33276
Repo checks: check documented methods exist by @gante in #32320
Fix: multigpu training by @zucchini-nlp in #33271
Cache docs: update by @zucchini-nlp in #32929
Config: unified logic to retrieve text config by @gante in #33219
[fix] LlavaNextProcessor 'getunpadded_features' method by @laurentd-lunit in #33263
wait 15m before SSH into runner workflow stops by @ydshieh in #33300
Bugfix/alexsherstinsky/fix none check for attention factor in rope scaling 2024 08 28 0 by @alexsherstinsky in #33188
[InstructBLIP] qformer_tokenizer is required input by @amyeroberts in #33222
[BUG] fix upper nltk version by @ylacombe in #33301
Fix excessive CPU memory usage with FSDP and cpuramefficient_loading by @matthewdouglas in #33154
Add validate images and text inputs order util for processors and testprocessingutils by @yonigozlan in #33285
Fix: Fix FalconMamba training issues due to incompatible kernels by @younesbelkada in #33195
Add paper link by @Muennighoff in #33305
🚨 Fix torch.jit.trace for interpolate_pos_encoding in all vision models by @xenova in #33226
Update SECURITY.md by @Michellehbn in #32680
simple align qwen2vl kvseqlen calculation with qwen2 by @simonJJJ in #33161
Add a community notebook for fine-tuning with QLoRA, PEFT, and MLflow by @daniellok-db in #33319
Fix: StaticCache & inputs_embeds by @zucchini-nlp in #32932
Docs: add more cross-references to the KV cache docs by @gante in #33323
[whisper] alternative fix for long-form timestamps by @sanchit-gandhi in #32131
fix qwen2vl vision eager-attention by @simonJJJ in #33213
Load dynamic module (remote code) only once if code isn't change by @XuehaiPan in #33162
support loading model without config.json file by @itazap in #32356
Add validation for maximum sequence length in modeling_whisper.py by @AmirMohammadFakhimi in #33196
add self.head_dim for VisionAttention in Qwen2-VL by @GeLee-Q in #33211
support 3D attention mask in bert by @gathierry in #32105
Support reading tiktoken tokenizer.model file by @itazap in #31656
red-ci on main, fix copies by @ArthurZucker in #33356
RoPE: fix BC warning by @gante in #33331
Fix Prefill docs by @Rocketknight1 in #33352
Update author for QLorA/PEFT community notebook by @daniellok-db in #33338
add sdpa mbart by @nbroad1881 in #32033
Fix quantized cache tests by @zucchini-nlp in #33351
schedulefree optimizers by @winglian in #30079
Add visit webpage tool by @aymeric-roucher in #33353
Fixed Majority of the Typos in transformers[en] Documentation by @nnilayy in #33350
Compile compatibilty for decoder-only models by @zucchini-nlp in #32617
Adjust templates by @LysandreJik in #33384
Remove repeated prepare_images in processor tests by @amyeroberts in #33163
Fix import of FalconMambaForCausalLM by @younesbelkada in #33381
Import structure & first three model refactors by @LysandreJik in #31329
VLM: fixes after refactor by @zucchini-nlp in #32907
fixed Mask2Former image processor segmentation maps handling by @maciej-adamiak in #33364
Bug Fix: Update hub.py to fix NoneType error by @rishiraj in #33315
Update WhisperTokenizer Doc: Timestamps and Previous Tokens Behaviour by @bruno-hays in #33390
Make StaticCache configurable at model construct time by @guangy10 in #32830
use diff internal model in tests by @itazap in #33387
Fix FbgemmFp8Linear not preserving tensor shape by @vgel in #33239
Fix failing windows by @LysandreJik in #33436
Remove deprecated task in load_dataset by @albertvillanova in #33433
Dynamic number of speculative tokens in order to accelerate speculative decoding by @jmamou in #33258
Fix: Cast prefetchbucketsize to integer for deepspeed >= 0.15 by @kiddj in #33402
[docs] add the missing huggingface hub username by @faaany in #33431
[docs] add the missing tokenizer when pushing models to huggingface hub by @faaany in #33428
Update stale.yml by @LysandreJik in #33434
Docs - update formatting of llama3 model card by @MichaelCurrin in #33438
Fix incomplete sentence in Zero-shot object detection documentation by @sergiopaniego in #33430
Fix flax whisper tokenizer bug by @hannan72 in #33151
Clean-up deprecated code by @zucchini-nlp in #33446
Fix default revision for pipelines by @ankane in #33395
Revive AMD scheduled CI by @ydshieh in #33448
Allow send SSH into runner info. to DM by @ydshieh in #33346
Correct Whisper's beam search scores computation by @ylacombe in #32336
Qwen2-VL: clean-up and add more tests by @zucchini-nlp in #33354
[whisper] Clarify error message when setting maxnewtokens by @benniekiss in #33324
[docs] refine the doc for train with a script by @faaany in #33423
Return image hidden states by @zucchini-nlp in #33426
add a callback hook right before the optimizer step by @winglian in #33444
Enable padding_side as call time kwargs by @zucchini-nlp in #33385
Mitigate a conflict when using sentencepiece by @tengomucho in #33327
[Phi-3] Bug on stale kv cache by @garg-amit in #33129
Fix the initialization of the cache when we have multi gpu by @SunMarc in #33303
Enable finetuning with torchao quantized model by @SunMarc in #33361
Corrected Agents and tools documentation links typos by @sergiopaniego in #33471
chore: fix typo in comment in tokenizationutilsbase.py by @DavidLemayian in #33466
Cohere: update RoPE structure by @gante in #33408
Fix SSH workflow by @ydshieh in #33451
Add keypoint-detection task guide by @merveenoyan in #33274
Uniformize kwargs for LLaVa processor and update docs by @yonigozlan in #32858
Agents, supercharged - Multi-agents, External tools, and more docs typo fixed by @sergiopaniego in #33478
[i18n-ar] Add File : docs/source/ar/_toctree.yml by @AhmedAlmaghz in #32696
[Whisper test] Fix some failing tests by @ylacombe in #33450
Fix: Qwen2-VL training on video datasets by @hiyouga in #33307
Updated Trainer's liger-kernel integration to call correct patching API by @shimizust in #33502
Replace accelerator.use_fp16 in examples by @hlky in #33513
Fix parametrization-based weight norm by @ylacombe in #33275
Fix number of patch check for different vision feature select strategy by @insujang in #32494
chore: migrate coverage cfg to pyproject.toml by @SauravMaheshkar in #32650
idefics2 enableinputrequiregrads not aligned with disableinput_re… by @sywangyi in #33194
Update chameleon.md — fix runtime type error by @maxwbuckley in #33494
Add explicit example for RAG chat templating by @A-Duss in #33503
CI Build image - move runners by @glegendre01 in #33530
fix to jamba config, asserting attention and expert offset by @ErezSC42 in #33316
Fix missing sequences_scores in the Whisper beam search output by @Nik-Kras in #32970
Uniformize kwargs for Pixtral processor by @yonigozlan in #33521
Add revision to trainer pushtohub by @teamclouday in #33482
fix patchattentionmask incorrect setting which leads to the differe… by @sywangyi in #33499
Support LLaVa-OV-Chat by @zucchini-nlp in #33532
Decorator for easier tool building by @aymeric-roucher in #33439
Fix for slow the bug tokenizer adding spaces to single id decodes by @DuyguA in #32564
Chat template: save and load correctly for processors by @zucchini-nlp in #33462
Fix missing head_dim in llama config from gguf model by @Isotr0py in #33526
[i18n-ur] Added README_ur.md file by @akkefa in #33461
fix the wandb logging issue by @ZIYU-DEEP in #33464
Fix tests in ASR pipeline by @ylacombe in #33545
Added support for bfloat16 to zero-shot classification pipeline by @umarbutler in #33554
Pipeline: no side-effects on model.config and model.generation_config 🔫 by @gante in #33480
Return attention mask in ASR pipeline to avoid warnings by @Rocketknight1 in #33509
enforce original size to be a list by @dom-dziela in #33564
Improve compiled RT-DETR inference speed by @yonigozlan in #33412
Fix bnb dequantization by @SunMarc in #33546
Load and save video-processor from separate folder by @zucchini-nlp in #33562
VLMs: enable generation tests by @zucchini-nlp in #33533
rag: fix CI by @gante in #33578
Cache: don't show warning in forward passes when past_key_values is None by @gante in #33541
fix tests with main revision and read token by @molbap in #33560
add uniform processors for altclip + chinese_clip by @molbap in #31198
Generate: check that attention_mask is 2D by @gante in #33575
change sequence_bias type of SequenceBiasLogitsProcessor to list, add… by @VladOS95-cyber in #33375
[Mamba2] Move dt calculations to kernel by @vasqu in #33520
Cache: don't throw warnings on gemma2 when instantiating a new cache by @gante in #33595
Uniformize kwargs for Paligemma processor and update docs by @yonigozlan in #33571
[tests] skip tests for xpu by @faaany in #33553
[tests] enable GemmaIntegrationTest on XPU by @faaany in #33555
Fix Llama 3 TikToken conversion by @pcuenca in #33538
Docs: add the ability to manually trigger jobs by @gante in #33598
Fix CircleCI nightly run by @ydshieh in #33558
Allow CI could be run on private forked repositories (e.g. new model additions) by @ydshieh in #33594
[tests] make more tests device-agnostic by @faaany in #33580
Update modeling_mamba2.py, fix pad size by @klae01 in #32599
Generate: remove flakyness in test_generate_from_inputs_embeds_decoder_only by @gante in #33602
Remove unnecessary CPM model tests by @amyeroberts in #33621
Add sdpa for BioGpt by @OmarManzoor in #33592
VLM generate: tests can't generate image/video tokens by @gante in #33623
Fix missing test in torch_job by @ydshieh in #33593
Add support for args to ProcessorMixin for backward compatibility by @yonigozlan in #33479
Fix contrastive search to correctly handle input with padding by @ducviet00 in #33507
Generate: assistant should sample when the main model samples by @gante in #33534
Fix some missing tests in circleci by @ydshieh in #33559
Update daily ci to use new cluster by @ydshieh in #33627
Fix qwen2vl float16 inference bug by @GeLee-Q in #33312
Fix typos by @litianjian in #33583
enable low-precision pipeline by @jiqing-feng in #31625
Pixtral update example checkpoint by @amyeroberts in #33633
Sdpa dino v2 by @avishaiElmakies in #33403
Clean up Unpack imports by @molbap in #33631
Fix DPT /Dinov2 sdpa regression on main by @molbap in #33660
handle dependency errors in check_imports by @molbap in #33622
add back self.maxpositionembeddings = config.maxpositionembeddings by @chengchengpei in #33550
Fix Llava conversion for LlavaQwen2ForCausalLM with Clip vision tower by @Isotr0py in #33613
Uniformize kwargs for Udop processor and update docs by @yonigozlan in #33628
Generation: deprecate PreTrainedModel inheriting from GenerationMixin by @gante in #33203
Enable BNB multi-backend support by @jiqing-feng in #31098
Fix error string after refactoring into getchattemplate by @tibor-reiss in #33652
uniformize git processor by @yonigozlan in #33668
Fix CIs post merging modular transformers by @ArthurZucker in #33681
Fixed docstring for cohere model regarding unavailability of prune_he… by @mnauf in #33253
Generation tests: update imagegpt input name, remove unused functions by @gante in #33663
Improve Error Messaging for Flash Attention 2 on CPU by @sizhky in #33655
Gemma2: fix config initialization (cache_implementation) by @gante in #33684
Fix ByteLevel alphabet missing when Sequence pretokenizer is used by @umarbutler in #33556
Uniformize kwargs for image-text-to-text processors by @yonigozlan in #32544
🚨🚨 Setting default behavior of assisted decoding by @jmamou in #33657
tests: fix pytorch tensor placement errors by @dvrogozh in #33485
bump tokenizers, fix added tokens fast by @ArthurZucker in #32535
[Pixtral] Improve docs, rename model by @NielsRogge in #33491

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@enchantee00
- 🌐 [i18n-KO] Translated chat_templating.md to Korean (#32362)
@faychu
- Add Qwen2-Audio (#32137)
- Fix a bug in Qwen2Audio (#32552)
@010kim
- 🌐 [i18n-KO] Translated ko-llm_tutorial_optimization.md to Korean (#32372)
@cjfghk5697
- 🌐 [i18n-KO] Translated trainer.md to Korean (#32260)
@younesbelkada
- Add new model (#32615)
- Mamba / FalconMamba: Fix mamba left padding (#32677)
- FIX / Hub: Also catch for exceptions.ConnectionError (#31469)
- Fix: Fix FalconMamba training issues due to incompatible kernels (#33195)
- Fix import of FalconMambaForCausalLM (#33381)
@4N3MONE
- 🌐 [i18n-KO] Translated deepspeed.md to Korean (#32431)
@jerryzh168
- Add TorchAOHfQuantizer (#32306)
@MHRDYN7
- Add Flax Dinov2 (#31960)
@kamilakesbi
- Add Descript-Audio-Codec model (#31494)
@Isotr0py
- Fix incorrect vocab size retrieval in GGUF config (#32551)
- Add chat_template for tokenizer extracted from GGUF model (#32908)
- 🚨 Support dequantization for most GGML types (#32625)
- Fix missing head_dim in llama config from gguf model (#33526)
- Fix Llava conversion for LlavaQwen2ForCausalLM with Clip vision tower (#33613)
@AhmedAlmaghz
- [i18n-ar] add README_ar.md to README.md (#32583)
- [i18n-ar] Add File : docs/source/ar/_toctree.yml (#32696)
@simonJJJ
- support qwen2-vl (#32318)
- simple align qwen2vl kvseqlen calculation with qwen2 (#33161)
- fix qwen2vl vision eager-attention (#33213)
@jpizarrom
- 🚨 Add Blip2ForImageTextRetrieval (#29261)
@mayank31398
- Granite language models (#31502)
- fix model name and copyright (#33152)
- Granitemoe (#33207)
@hackyon
- [RoBERTa-based] Add support for sdpa (#30510)
@Muennighoff
- Add OLMoE (#32406)
- Add paper link (#33305)
@VladOS95-cyber
- Add Qwen2Moe GGUF loading support (#33264)
- change sequence_bias type of SequenceBiasLogitsProcessor to list, add… (#33375)
@jiqing-feng
- enable low-precision pipeline (#31625)
- Enable BNB multi-backend support (#31098)

- Python
Published by LysandreJik over 1 year ago

transformers - Release v4.44.2

Patch release v4.44.2, mostly 2 regressions that were not caught for Jamba and for processors!

Fix: Jamba cache fails to use torch.nn.module (#32894) Authored by @xgal
Fix: No need to dtype A in Jamba (#32924) @xgal
Fix: Regression on Processor.save_pretrained caused by #31691 (#32921) Authored by @leloykun

- Python
Published by ArthurZucker almost 2 years ago

transformers - Patch release v4.44.1

Here are the different fixes, mostly Gemma2 context length, nits here and there, and generation issues

istorchdynamocompiling -- cast a wide exception net (#32476) by @gante
Revert "fixes to properly shard FSDP across cpu and meta for cpueffcientloading for prequantized 4bit (#32276)" (#32477) by @gante and @matthewdouglas
Gemma2: fix FA2 generation (#32553) by @zucchini-nlp
Fix: FA2 with packed training (#32487) by @zucchini-nlp
Fix sliding window attention used in Gemma2FlashAttention2 (#32522) by @brcps12
Automatically add transformers tag to the modelcard (#32623) by @LysandreJik
add back the position ids (#32554) by @ArthurZucker
Use head_dim if in config for RoPE (#32495) @suiyoubi @ArthurZucker
Revert PR 32299, flag users when Zero-3 was missed (#32851) by @muellerzr
fix multi-gpu with static cache (#32543) by @SunMarc
Reduce the error log when using core models that need their weights r… (#32656) by @muellerzr
Fix VLM generation issues (#32836) by @zucchini-nlp
Fix generate with inputs_embeds as input (#32493) (this PR has some cherry-pick)

Full Changelog: https://github.com/huggingface/transformers/compare/v4.44.0...v4.44.1

- Python
Published by ArthurZucker almost 2 years ago

transformers - Release v4.44.0

Release v4.44.0: End to end compile generation!!! Gemma2 (with assisted decoding), Codestral (Mistral for code), Nemotron, Efficient SFT training, CPU Offloaded KVCache, torch export for static cache

This release comes a bit early in our cycle because we wanted to ship important and requested models along with improved performances for everyone!

All of these are included with examples in the awesome https://github.com/huggingface/local-gemma repository! 🎈 We tried to share examples of what is now possible with all the shipped features! Kudos to @gante, @sanchit-gandhi and @xenova

💥 End-to-end generation compile

Generate: end-to-end compilation #30788 by @gante: model.generate now supports compiling! There are a few limitations, but here is a small snippet:

```python3 from transformers import AutoModelForCausalLM, AutoTokenizer import torch import copy

model = AutoModelForCausalLM.frompretrained( "meta-llama/Meta-Llama-3.1-8B", torchdtype=torch.bfloat16, devicemap="auto" ) tokenizer = AutoTokenizer.frompretrained("meta-llama/Meta-Llama-3.1-8B")

compile generate

compiled_generate = torch.compile(model.generate, fullgraph=True, mode="reduce-overhead")

compiled generate does NOT accept parameterization except a) model inputs b) a generation config

generationconfig = copy.deepcopy(model.generationconfig) generationconfig.padtokenid = model.config.eostoken_id

modelinputs = tokenizer(["Write a poem about the market crashing in summer"], returntensors="pt") modelinputs = modelinputs.to(model.device) outputcompiled = compiledgenerate(**modelinputs, generationconfig=generationconfig) print(outputcompiled) ```

⚡ 3 to 5x compile speedup (compilation time 👀 not runtime)

3-5x faster torch.compile forward compilation for autoregressive decoder models #32227* by @fxmarty . As documented on the PR, this makes the whole generation a lot faster when you re-use the cache! You can see this when you run model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)

🪶 Offloaded KV cache: offload the cache to CPU when you are GPU poooooor 🚀

Offloaded KV Cache #31325* by @n17s : you just have to set cache_implementation="offloaded" when calling from_pretrained or using this: python3 from transformers import GenerationConfig gen_config = GenerationConfig(cache_implementation="offloaded", # other generation options such as num_beams=4,num_beam_groups=2,num_return_sequences=4,diversity_penalty=1.0,max_new_tokens=50,early_stopping=True) outputs = model.generate(inputs["input_ids"],generation_config=gen_config)

📦 Torch export for static cache

pytorch team gave us a great gift: you can now use torch.export directly compatible with Executorch! Find examples here.

Make static cache compatible with torch.export #32168 by @guangy10

This also unlocks support for prompt reuse: ```python3 import os, torch, copy from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache device = "cuda" ckpt = "meta-llama/Meta-Llama-3.1-8B-Instruct"

INITIAL_PROMPT = "From now on, you are going to answer all my questions with historical details. Make sure to always add a bit of french here and there, for style."

model = AutoModelForCausalLM.frompretrained(ckpt, torchdtype=torch.float16) model.to(device) tokenizer = AutoTokenizer.from_pretrained(ckpt)

promptcache = DynamicCache() inputs = tokenizer(INITIALPROMPT, returntensors="pt").to("cuda") promptcache = model(**inputs, pastkeyvalues = promptcache).pastkey_values

prompt = "Why are french people obsessed with french?" newinputs = tokenizer(INITIALPROMPT + prompt, returntensors="pt").to("cuda") pastkeyvalues = copy.deepcopy(promptcache) outputs = model.generate(**newinputs, pastkeyvalues=pastkeyvalues,maxnewtokens=20) response = tokenizer.batchdecode(outputs)[0] print(response)

prompt = "What is the best city to swim in?" newinputs = tokenizer(INITIALPROMPT + prompt, returntensors="pt").to("cuda") outputs = model.generate(**newinputs, pastkeyvalues=copy.deepcopy(promptcache),maxnewtokens=20) response = tokenizer.batchdecode(outputs)[0] ```

Gemma2: assisted decoding

Gemma 2: support assisted generation #32357 by @gante

We now have a 2B Gemma 2 model -- a perfect sidekick for the 27B with assisted generation. We've enabled assisted generation in gemma 2, with a caveat: assisted generation currently requires the use of a windowless cache (as opposed to the default cache for gemma 2), so you might observe some output mismatch on long sequences. Read more about it here.

```py

transformers assisted generation reference:

https://huggingface.co/docs/transformers/main/en/llm_optims#speculative-decoding

from transformers import AutoModelForCausalLM, AutoTokenizer import torch

we DON’T recommend using the 9b model with the 2b model as its assistant

assistantmodelname = 'google/gemma-2-2b-it' referencemodelname = 'google/gemma-2-27b-it'

tokenizer = AutoTokenizer.frompretrained(referencemodelname) model = AutoModelForCausalLM.frompretrained( referencemodelname, devicemap='auto', torchdtype=torch.bfloat16 ) assistantmodel = AutoModelForCausalLM.frompretrained( assistantmodelname, devicemap='auto', torchdtype=torch.bfloat16 )

modelinputs = tokenizer("Einstein's theory of relativity states", returntensors="pt").to(model.device) generationoptions = { "assistantmodel": assistantmodel, "dosample": True, "temperature": 0.7, "maxnewtokens": 64, }

outputs = model.generate(*model_inputs, *generationoptions) tokenizer.batchdecode(outputs, skipspecialtokens=True) ```

Nemotron support

Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. It is a fine-tuned version of the Nemotron-4-340B-Base model, optimized for English-based single and multi-turn chat use-cases. It supports a context length of 4,096 tokens.

The conversion script should be able to cover Minitron and Nemotron, thanks and kudos to @suiyoubi. See: * Add Nemotron HF Support #31699

Codestral support

Codestral is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash. It also performs well on more specific ones like Swift and Fortran. This broad language base ensures Codestral can assist developers in various coding environments and projects.

Codestral saves developers time and effort: it can complete coding functions, write tests, and complete any partial code using a fill-in-the-middle mechanism. Interacting with Codestral will help level up the developer’s coding game and reduce the risk of errors and bugs.

It's mamba2 architecture, was a bit of a pain to remove all einops but hope we made it better for everyone!

Add codestral mamba2 #32080 by @molbap and @vasqu

Breaking changes:

We removed the chat template in the code, they should all be on the hub! * 🚨 No more default chat templates #31733 by @Rocketknight1

Long-form decoding for whisper, even faster:

Our great @sanchit-gandhi worked on porting the recent compile upgrades to long form decoding in * [whisper] compile compatibility with long-form decoding #31772

What's Changed

Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs by @RhuiDih in https://github.com/huggingface/transformers/pull/31629
Updated ruff to the latest version by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/31926
fix by @gante in https://github.com/huggingface/transformers/pull/32162
fix: Fixed an if condition that is always evaluating to true by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32160
[docs] change temperature to a positive value by @faaany in https://github.com/huggingface/transformers/pull/32077
adds: extra_repr() to MambaRMSNorm to include hidden size / size of weights in the layer by @rohitdwivedula in https://github.com/huggingface/transformers/pull/32171
fix: default value reflects the runtime environment variables rather than the ones present at import time. by @junrae6454 in https://github.com/huggingface/transformers/pull/32153
Update qwen2.md by @ArtificialZeng in https://github.com/huggingface/transformers/pull/32108
Remove conversational pipeline tests by @amyeroberts in https://github.com/huggingface/transformers/pull/32099
RoPE: relaxed rope validation by @gante in https://github.com/huggingface/transformers/pull/32182
let's not warn when someone is running a forward by @ArthurZucker in https://github.com/huggingface/transformers/pull/32176
Fix resize embedding with Deepspeed by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32192
Fix float8e4m3fn in modelingutils by @SunMarc in https://github.com/huggingface/transformers/pull/32193
Support dequantizing GGUF FP16 format by @PenutChen in https://github.com/huggingface/transformers/pull/31783
:rotating_light: No more default chat templates by @Rocketknight1 in https://github.com/huggingface/transformers/pull/31733
fix: Replaced deprecated unittest method with the correct one by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32198
[whisper] fix short-form output type by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/32178
remove unnecessary guard code related with pytorch versions 1.4.2 ~ 1.7.0 by @statelesshz in https://github.com/huggingface/transformers/pull/32210
Update question_answering.py by @avlewis in https://github.com/huggingface/transformers/pull/32208
[BigBird Pegasus] set supportsparambufferassignment to False by @kashif in https://github.com/huggingface/transformers/pull/32222
[warnings] fix E721 warnings by @kashif in https://github.com/huggingface/transformers/pull/32223
Follow up for #31973 by @ydshieh in https://github.com/huggingface/transformers/pull/32025
translate philosophy.md to chinese by @statelesshz in https://github.com/huggingface/transformers/pull/32177
Allow a specific microphone to be used by the ffmpeg audio pipeline utility functions. Default to using the currently active microphone on Mac by @jrhe in https://github.com/huggingface/transformers/pull/31846
Fix code snippet for Grounding DINO by @qubvel in https://github.com/huggingface/transformers/pull/32229
Generation: stop at eos for assisted decoding by @zucchini-nlp in https://github.com/huggingface/transformers/pull/31301
Llava: generate without images by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32183
Resize embeds with DeepSpeed by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32214
don't log base model architecture in wandb if log model is false by @joaonadkarni in https://github.com/huggingface/transformers/pull/32143
Refactor: Removed un-necessary object base class by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32230
Adds: extra_repr for RMSNorm layers in most models by @rohitdwivedula in https://github.com/huggingface/transformers/pull/32204
Add check for target_sizes is None in post_process_image_guided_detection for owlv2 by @catalys1 in https://github.com/huggingface/transformers/pull/31934
[tests] fix static cache implementation is not compatible with attn_implementation==flash_attention_2 by @faaany in https://github.com/huggingface/transformers/pull/32039
Flash-Attn: fix generation when no attention mask or no pading by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32241
More flexible trigger condition by @ydshieh in https://github.com/huggingface/transformers/pull/32251
Llama 3.1: replace for loop by tensor ops at inv_freq initialization by @gante in https://github.com/huggingface/transformers/pull/32244
🚨 Bloom support for cache class by @zucchini-nlp in https://github.com/huggingface/transformers/pull/31445
Upload new model failure report to Hub by @ydshieh in https://github.com/huggingface/transformers/pull/32264
Optimize t5 tokenize logic to avoid redundant calls by @leejet in https://github.com/huggingface/transformers/pull/32270
fix: Fixed wrong argument passed to convert_blip_checkpoint function call by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32262
Repo: remove exceptions in check_docstrings by @gante in https://github.com/huggingface/transformers/pull/32259
make p_mask a numpy array before passing to select_starts_ends by @faaany in https://github.com/huggingface/transformers/pull/32076
fix(docs): Fixed a link in docs by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32274
Generate: end-to-end compilation by @gante in https://github.com/huggingface/transformers/pull/30788
Whisper tokenizer word level timestamps by @kamilakesbi in https://github.com/huggingface/transformers/pull/32197
[pipeline] fix padding for 1-d tensors by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/31776
Make static cache compatible with torch.export by @guangy10 in https://github.com/huggingface/transformers/pull/32168
Add stream messages from agent run for gradio chatbot by @aymeric-roucher in https://github.com/huggingface/transformers/pull/32142
use torch 2.4 in 2 CI jobs by @ydshieh in https://github.com/huggingface/transformers/pull/32302
Docs: fix GaLore optimizer code example by @gil2rok in https://github.com/huggingface/transformers/pull/32249
Fix GGUF dequantize for gguf==0.9.1 by @Isotr0py in https://github.com/huggingface/transformers/pull/32298
Cast epochs_trained to int when resuming training by @teddy-f-47 in https://github.com/huggingface/transformers/pull/32286
feat(ci): set fetch-depth: 0 in trufflehog checkout step by @McPatate in https://github.com/huggingface/transformers/pull/31663
Fix M4T for ASR pipeline by @ylacombe in https://github.com/huggingface/transformers/pull/32296
Docs: formatting nits by @gante in https://github.com/huggingface/transformers/pull/32247
Alternative agent plan by @plaggy in https://github.com/huggingface/transformers/pull/32295
fix: Added missing raise keyword for few exceptions by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32333
fixes to properly shard FSDP across cpu and meta for cpuefficientloading for prequantized 4bit by @winglian in https://github.com/huggingface/transformers/pull/32276
fixes #32329 : The Torch code is correct - to get an average of 10% o… by @fkrasnov2 in https://github.com/huggingface/transformers/pull/32335
Repo checks: skip docstring checks if not in the diff by @gante in https://github.com/huggingface/transformers/pull/32328
Fix slow GemmaTokenizer and improve SPM slow -> fast conversion process by @xenova in https://github.com/huggingface/transformers/pull/32191
LLaVA-NeXT: fix anyres shapes by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32314
Gemma2 and flash-attention by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32188
Llama 3.1: Fix incorrect inv_freq assignment by @gante in https://github.com/huggingface/transformers/pull/32330
[Idefics2] - Fix FA2 call for Perceiver layer by @amyeroberts in https://github.com/huggingface/transformers/pull/32275
Gemma 2: support assisted generation by @gante in https://github.com/huggingface/transformers/pull/32357
Fix error when streaming to gradio with non-string tool arguments by @aymeric-roucher in https://github.com/huggingface/transformers/pull/32360
>3-5x faster torch.compile forward compilation for autoregressive decoder models by @fxmarty in https://github.com/huggingface/transformers/pull/32227
fix: Fixed staticmethods with self as first argument by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32361
fix: warmupsteps check for trainingargs by @Ricardo-L-C in https://github.com/huggingface/transformers/pull/32236
LLaVa: add cache class attribute by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32278
[enc-dec cache] fix bug in indexing by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/32370
[whisper] compile compatibility with long-form decoding by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/31772
Remove size check between attnweights and kvseq_len for phi3 by @helunwencser in https://github.com/huggingface/transformers/pull/32339
add missing attribute supportsparambufferassignment for gpt-j. by @nv-guomingz in https://github.com/huggingface/transformers/pull/32359
Check device map for saving tokenizer config on TPU (fix for issue #31971) by @ayukh in https://github.com/huggingface/transformers/pull/32043
update cleanuptokenization_spaces warning by @itazap in https://github.com/huggingface/transformers/pull/32371
Empty list in defaults for LLaMA special tokens during weights conversion by @ViktorooReps in https://github.com/huggingface/transformers/pull/32342
Fix conflicting key in init kwargs in PreTrainedTokenizerBase by @OmarManzoor in https://github.com/huggingface/transformers/pull/31233
Offloaded KV Cache by @n17s in https://github.com/huggingface/transformers/pull/31325
Docker: add speech dep to the consistency docker image by @gante in https://github.com/huggingface/transformers/pull/32374
Fixed Hybrid Cache Shape Initialization. by @OsamaS99 in https://github.com/huggingface/transformers/pull/32163
Yell at the user if zero-3 init wasn't performed, but expected to have been done by @muellerzr in https://github.com/huggingface/transformers/pull/32299
Update docs by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32368
RoPE: Add numerical tests ✨ by @gante in https://github.com/huggingface/transformers/pull/32380
[generate] only require an attention mask for mps with torch<2.4 by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/32367
fix: (issue #32124) Exception raised when running transformers/examples/flax/language-modeling/t5_tokenizer_model.py. by @fshp971 in https://github.com/huggingface/transformers/pull/32157
MixtralFlashAttention2: put "plus 1" inside parentheses when calculating rotaryseqlen, allowing None position_ids input. by @Luke20000429 in https://github.com/huggingface/transformers/pull/31500
Bump keras from 2.8.0 to 2.13.1 in /examples/researchprojects/decisiontransformer by @dependabot in https://github.com/huggingface/transformers/pull/32393
fix: SeamlessM4TFeatureExtractor stride remainder by @TechInterMezzo in https://github.com/huggingface/transformers/pull/32088
Phi3 tests: fix typing for Python 3.8 by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32388
#32184 save totalvocabsize by @itazap in https://github.com/huggingface/transformers/pull/32240
add values for neftune by @nbroad1881 in https://github.com/huggingface/transformers/pull/32399
Fix documentation references to google/bit-50 model by @JuanFKurucz in https://github.com/huggingface/transformers/pull/32407
Persist embedding type of BART and mBART models after resize by @AbdiHaryadi in https://github.com/huggingface/transformers/pull/32242
fix: Updated test_embeded_special_tokens for luke and mluke models by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32413
Respect the config's attn_implementation if set by @amyeroberts in https://github.com/huggingface/transformers/pull/32383
Fix documentation links and code reference to model llava-next by @JuanFKurucz in https://github.com/huggingface/transformers/pull/32434
Cache: create docs by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32150
Llava: fix checkpoint_doc by @RUFFY-369 in https://github.com/huggingface/transformers/pull/32458
add the missing flash attention test marker by @faaany in https://github.com/huggingface/transformers/pull/32419
Update kwargs validation for preprocess with decorator by @qubvel in https://github.com/huggingface/transformers/pull/32024
Fix get large model config for Switch Transformer encoder only tester by @JuanFKurucz in https://github.com/huggingface/transformers/pull/32438
Dependencies: fix typo by @gante in https://github.com/huggingface/transformers/pull/32389
Add Nemotron HF Support by @suiyoubi in https://github.com/huggingface/transformers/pull/31699
Generate: fix end to end compilation by @gante in https://github.com/huggingface/transformers/pull/32465
Add codestral mamba2 by @molbap in https://github.com/huggingface/transformers/pull/32080

New Contributors

@RhuiDih made their first contribution in https://github.com/huggingface/transformers/pull/31629
@rohitdwivedula made their first contribution in https://github.com/huggingface/transformers/pull/32171
@ArtificialZeng made their first contribution in https://github.com/huggingface/transformers/pull/32108
@avlewis made their first contribution in https://github.com/huggingface/transformers/pull/32208
@jrhe made their first contribution in https://github.com/huggingface/transformers/pull/31846
@joaonadkarni made their first contribution in https://github.com/huggingface/transformers/pull/32143
@catalys1 made their first contribution in https://github.com/huggingface/transformers/pull/31934
@leejet made their first contribution in https://github.com/huggingface/transformers/pull/32270
@guangy10 made their first contribution in https://github.com/huggingface/transformers/pull/32168
@gil2rok made their first contribution in https://github.com/huggingface/transformers/pull/32249
@teddy-f-47 made their first contribution in https://github.com/huggingface/transformers/pull/32286
@plaggy made their first contribution in https://github.com/huggingface/transformers/pull/32295
@fkrasnov2 made their first contribution in https://github.com/huggingface/transformers/pull/32335
@helunwencser made their first contribution in https://github.com/huggingface/transformers/pull/32339
@nv-guomingz made their first contribution in https://github.com/huggingface/transformers/pull/32359
@ayukh made their first contribution in https://github.com/huggingface/transformers/pull/32043
@n17s made their first contribution in https://github.com/huggingface/transformers/pull/31325
@OsamaS99 made their first contribution in https://github.com/huggingface/transformers/pull/32163
@fshp971 made their first contribution in https://github.com/huggingface/transformers/pull/32157
@Luke20000429 made their first contribution in https://github.com/huggingface/transformers/pull/31500
@TechInterMezzo made their first contribution in https://github.com/huggingface/transformers/pull/32088
@AbdiHaryadi made their first contribution in https://github.com/huggingface/transformers/pull/32242
@RUFFY-369 made their first contribution in https://github.com/huggingface/transformers/pull/32458
@suiyoubi made their first contribution in https://github.com/huggingface/transformers/pull/31699

Full Changelog: https://github.com/huggingface/transformers/compare/v4.43.4...v4.44.0

- Python
Published by ArthurZucker almost 2 years ago

transformers - v4.43.4 Patch Release

Patch Release v4.43.4

There was a mick mack, now deepseep issues are properly pushed with: - Resize embeds with DeepSpeed https://github.com/huggingface/transformers/pull/32214

🤗 Enjoy holidays

- Python
Published by ArthurZucker almost 2 years ago

transformers - v4.43.3 Patch deepspeed

Patch release v4.43.3: We still saw some bugs so @zucchini-nlp added: ~- Resize embeds with DeepSpeed #32214~ - don't log base model architecture in wandb if log model is false #32143

Other fixes: - [whisper] fix short-form output type #32178, by @sanchit-gandhi which fixes the short audio temperature fallback! - [BigBird Pegasus] set supportsparambufferassignment to False #32222 by @kashif, mostly related to the new super fast init, some models have to get this set to False. If you see a weird behavior look for that 😉

- Python
Published by ArthurZucker almost 2 years ago

transformers - v4.43.2: Patch release

Fix float8e4m3fn in modelingutils (#32193)
Fix resize embedding with Deepspeed (#32192)
let's not warn when someone is running a forward (#32176)
RoPE: relaxed rope validation (#32182)

- Python
Published by LysandreJik almost 2 years ago

transformers - v4.43.1: Patch release

fix (#32162)

- Python
Published by LysandreJik almost 2 years ago

transformers - v4.43.0: Llama 3.1, Chameleon, ZoeDepth, Hiera

Llama

The Llama 3.1 models are released by Meta and come in three flavours: 8B, 70B, and 405B.

To get an overview of Llama 3.1, please visit the Hugging Face announcement blog post.

We release a repository of llama recipes to showcase usage for inference, total and partial fine-tuning of the different variants.

Chameleon

The Chameleon model was proposed in Chameleon: Mixed-Modal Early-Fusion Foundation Models by META AI Chameleon Team. Chameleon is a Vision-Language Model that use vector quantization to tokenize images which enables the model to generate multimodal output. The model takes images and texts as input, including an interleaved format, and generates textual response.

Chameleon: add model by @zucchini-nlp in #31534

ZoeDepth

The ZoeDepth model was proposed in ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth by Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, Matthias Müller. ZoeDepth extends the DPT framework for metric (also called absolute) depth estimation. ZoeDepth is pre-trained on 12 datasets using relative depth and fine-tuned on two domains (NYU and KITTI) using metric depth. A lightweight head is used with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier.

Add ZoeDepth by @NielsRogge in #30136

Hiera

Hiera was proposed in Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer

The paper introduces “Hiera,” a hierarchical Vision Transformer that simplifies the architecture of modern hierarchical vision transformers by removing unnecessary components without compromising on accuracy or efficiency. Unlike traditional transformers that add complex vision-specific components to improve supervised classification performance, Hiera demonstrates that such additions, often termed “bells-and-whistles,” are not essential for high accuracy. By leveraging a strong visual pretext task (MAE) for pretraining, Hiera retains simplicity and achieves superior accuracy and speed both in inference and training across various image and video recognition tasks. The approach suggests that spatial biases required for vision tasks can be effectively learned through proper pretraining, eliminating the need for added architectural complexity.

Adding hiera by @Namangarg110 in #30356

Agents

Our ReactAgent has a specific way to return its final output: it calls the tool finalanswer, added to the user-defined toolbox upon agent initialization, with the answer as the tool argument. We found that even for a one-shot agent like CodeAgent, using a specific finalanswer tools helps the llmengine find what to return: so we generalized the finalanswer tool for all agents.

Adds final answer tool for all agents by @aymeric-roucher in #31703
Code agent: allow function persistence between steps by @aymeric-roucher in #31769 :point_right: Now if your code-based agent (like ReactCodeAgent) defines a function at step 1, it will remember the function definition indefinitely. This means your agent can create its own tools for later re-use!
Agents planning by @aymeric-roucher in #31702 :pointright: This is a transformative PR: it allows the agent to regularly run a specific step for planning its actions in advance. This gets activated if you set an int for planninginterval upon agent initialization. At step 0, a first plan will be done. At later steps (like steps 3, 6, 9 if you set planning_interval=3 ), this plan will be updated by the agent depending on the history of previous steps. More detail soon! Selon si on a merge d'ici là on pourra rajouter:
Add stream messages from agent run for gradio chatbot by @freddyaboulton and @aymeric-roucher in #32142 :pointright: New method streamto_gradio runs your agent and streams the output the run to gradio messages, to easily visualize the run in a gradio chatbot! (
Adds final answer tool for all agents by @aymeric-roucher in #31703
Code agent: allow function persistence between steps by @aymeric-roucher in #31769
Agents planning by @aymeric-roucher in #31702

Notable changes to the codebase

A significant RoPE refactor was done to make it model agnostic and more easily adaptable to any architecture. It is only applied to Llama for now but will be applied to all models using RoPE over the coming days.

Llama: RoPE refactor by @gante in #32135

Breaking changes

TextGenerationPipeline and tokenizer kwargs

🚨🚨 This PR changes the code to rely on the tokenizer's defaults when these flags are unset. This means some models using TextGenerationPipeline previously did not add a <bos> by default, which (negatively) impacted their performance. In practice, this is a breaking change.

Example of a script changed as a result of this PR: ```py from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline import torch

tokenizer = AutoTokenizer.frompretrained("google/gemma-2-9b-it") model = AutoModelForCausalLM.frompretrained("google/gemma-2-9b-it", torchdtype=torch.bfloat16, devicemap="auto") pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) print(pipe("Foo bar")) ```

🚨🚨 TextGenerationPipeline: rely on the tokenizer default kwargs by @gante in #31747

Bugfixes and improvements

Fix post gemma merge by @ArthurZucker in #31660
Fix float out of range in owlvit and owlv2 when using FP16 or lower precision by @aliencaocao in #31657
[docs] Llama3 by @stevhliu in #31662
[HybridCache] Fix get_seq_length method by @sanchit-gandhi in #31661
don't zero out the attention_mask when using sliding window with flash attention by @winglian in #31670
Fix Gemma2 4d attention mask by @hiyouga in #31674
Fix return_dict in encodec by @jla524 in #31646
add gatheruseobject arguments by @SangbumChoi in #31514
Gemma capping is a must for big models by @ArthurZucker in #31698
Add French version of run scripts tutorial by @jadechoghari in #31483
dependencies: keras-nlp<0.14 pin by @gante in #31684
remove incorrect urls pointing to the llava repository by @BiliBraker in #31107
Move some test files (tets/test_xxx_utils.py) to tests/utils by @ydshieh in #31730
Fix mistral ONNX export by @fxmarty in #31696
[whisper] static kv cache by @sanchit-gandhi in #31166
Make tool JSON schemas consistent by @Rocketknight1 in #31756
Fix documentation for Gemma2. by @jbornschein in #31682
fix assisted decoding by @jiqing-feng in #31401
Requires for torch.tensor before casting by @echarlaix in #31755
handle (processor_class, None) returned by ModelPatterns by @molbap in #31753
Gemma 2: Update slow tests by @gante in #31759
Add ignoreerrors=True to trainer.py rmtree in _innertraining_loop by @njbrake in #31668
[fix bug] logits's shape different from label's shape in preprocesslogitsfor_metrics by @wiserxin in #31447
Fix RT-DETR cache for generate_anchors by @qubvel in #31671
Fix RT-DETR weights initialization by @qubvel in #31724
pytest_num_workers=4 for some CircleCI jobs by @ydshieh in #31764
Fix Gemma2 types by @hiyouga in #31779
Add torchemptycache_steps to TrainingArguments by @aliencaocao in #31546
Fix ClapProcessor to merge feature_extractor output into the returned BatchEncoding by @mxkopy in #31767
Fix serialization for offloaded model by @SunMarc in #31727
Make tensor device correct when ACCELERATETORCHDEVICE is defined by @kiszk in #31751
Exclude torch.compile time from metrics computation by @zxd1997066 in #31443
Update CometCallback to allow reusing of the running experiment by @Lothiraldan in #31366
Fix gemma tests by @ydshieh in #31794
Add training support for SigLIP by @aliencaocao in #31495
Repeating an important warning in the chat template docs by @Rocketknight1 in #31796
Allow FP16 or other precision inference for Pipelines by @aliencaocao in #31342
Fix galore lr display with schedulers by @vasqu in #31710
Fix Wav2Vec2 Fairseq conversion (weight norm state dict keys) by @gau-nernst in #31714
Depth Anything: update conversion script for V2 by @pcuenca in #31522
Fix Seq2SeqTrainer crash when BatchEncoding data is None by @iohub in #31418
Bump certifi from 2023.7.22 to 2024.7.4 in /examples/researchprojects/decisiontransformer by @dependabot[bot] in #31813
Add FA2 and sdpa support for SigLIP by @qubvel in #31499
Bump transformers from 4.26.1 to 4.38.0 in /examples/tensorflow/language-modeling-tpu by @dependabot[bot] in #31837
Bump certifi from 2023.7.22 to 2024.7.4 in /examples/research_projects/lxmert by @dependabot[bot] in #31838
Fix typos by @omahs in #31819
transformers.fx.symbolictrace supports inputsembeds by @fxmarty in #31574
Avoid failure TFBlipModelTest::test_pipeline_image_to_text by @ydshieh in #31827
Fix incorrect accelerator device handling for MPS in TrainingArguments by @andstor in #31812
Mamba & RecurrentGemma: enable strict signature by @gante in #31549
Deprecate vocab_size in other two VLMs by @zucchini-nlp in #31681
FX symbolictrace: do not test decoderinputs_embeds by @fxmarty in #31840
[Grounding DINO] Add processor to auto mapping by @NielsRogge in #31845
chore: remove duplicate words by @hattizai in #31853
save_pretrained: use tqdm when saving checkpoint shards from offloaded params by @kallewoof in #31856
Test loading generation config with safetensor weights by @gante in #31550
docs: typo in tf qa example by @chen-keinan in #31864
Generate: Add new decoding strategy "DoLa" in .generate() by @voidism in #29619
Fix _init_weights for ResNetPreTrainedModel by @ydshieh in #31851
Update depth estimation task guide by @merveenoyan in #31860
Bump zipp from 3.7.0 to 3.19.1 in /examples/researchprojects/decisiontransformer by @dependabot[bot] in #31871
Add return type annotation to PreTrainedModel.from_pretrained by @mauvilsa in #31869
Revert "Fix _init_weights for ResNetPreTrainedModel" by @ydshieh in #31868
Bump certifi from 2023.7.22 to 2024.7.4 in /examples/researchprojects/visualbert by @dependabot[bot] in #31872
add warning when using gradient_checkpointing with FSDP full shard by @yundai424 in #31578
Add conversion for interleave llava by @zucchini-nlp in #31858
remove duplicate words in msg by @yukionfire in #31876
Fix file type checks in data splits for contrastive training example script by @npyoung in #31720
Fix failed tests in #31851 by @ydshieh in #31879
fix: Removed duplicate field definitions in some classes by @Sai-Suraj-27 in #31888
Push sharded checkpoint to hub when push_to_hub=True in TrainingArguments by @SunMarc in #31808
[RT-DETR] Add resources by @NielsRogge in #31815
Modify warnings in a with block to avoid flaky tests by @ydshieh in #31893
Add a condition for nested_detach by @haikuoxin in #31855
InstructBlipVideo: Update docstring by @zucchini-nlp in #31886
Fixes to alternating SWA layers in Gemma2 by @turboderp in #31775
Processor accepts any kwargs by @zucchini-nlp in #31889
[ConvertSlow] make sure the order is preserved for addedtokens by @ArthurZucker in #31902
[Gemma2] Support FA2 softcapping by @ArthurZucker in #31887
Fix missing methods for Fuyu by @Isotr0py in #31880
fix: Fixed the 1st argument name in classmethods by @Sai-Suraj-27 in #31907
add gatheruseobject arguments II by @SangbumChoi in #31799
Add warning message for beta and gamma parameters by @OmarManzoor in #31654
Fix fx tests with inputs_embeds by @fxmarty in #31862
Refactor flash attention implementation in transformers by @ArthurZucker in #31446
Generate: fix SlidingWindowCache.reset() by @gante in #31917
🚨 fix(SigLip): remove spurious exclusion of first vision output token by @transmissions11 in #30952
Allow Trainer.get_optimizer_cls_and_kwargs to be overridden by @apoorvkh in #31875
[Bug Fix] fix qa pipeline tensor to numpy by @jiqing-feng in #31585
Docker: TF pin on the consistency job by @gante in #31928
fix prompt strip to support tensors and np arrays by @AvivSham in #27818
Fix GenerationMixin.generate compatibility with pytorch profiler by @fxmarty in #31935
Generate: remove deprecated code due to Cache and cache_position being default by @gante in #31898
Generate: v4.42 deprecations 🧹🧹 by @gante in #31956
Whisper: move to tensor cpu before converting to np array at decode time by @gante in #31954
fix: Removed a wrong key-word argument in sigmoid_focal_loss() function call by @Sai-Suraj-27 in #31951
Generate: handle logits_warper update in models with custom generate fn by @gante in #31957
fix: Fixed the arguments in create_repo() function call by @Sai-Suraj-27 in #31947
Notify new docker images built for circleci by @ydshieh in #31701
Avoid race condition by @ydshieh in #31973
Masking: remove flakiness from test by @gante in #31939
Generate: doc nits by @gante in #31982
Fix the incorrect permutation of gguf by @PenutChen in #31788
Cambricon MLUs support SDPA and flash_attn by @huismiling in #31102
Speedup model init on CPU (by 10x+ for llama-3-8B as one example) by @muellerzr in #31771
[tests] fix deepspeed zero3 config for test_stage3_nvme_offload by @faaany in #31881
Fix bad test about slower init by @muellerzr in #32002
Tests: remove cuda versions when the result is the same 🧹🧹 by @gante in #31955
Bug report update by @gante in #31983
add flash-attn deterministic option to flash-attn>=2.4.1 by @junrae6454 in #31961
fix: Fixed incorrect dictionary assignment in src/transformers/__init__.py by @Sai-Suraj-27 in #31993
Bug report update -- round 2 by @gante in #32006
Fix gather when collecting 'numinputtokens_seen' by @CodeCreator in #31974
Fix if else and actually enable superfast init by @muellerzr in #32007
SpeechEncoderDecoder doesn't support param buffer assignments by @muellerzr in #32009
Fix tests skip by @qubvel in #32012
Fixed log messages that are resulting in TypeError due to too many arguments by @Sai-Suraj-27 in #32017
Fix typo in classification function selection logic to improve code consistency by @moses in #32031
doc: fix broken BEiT and DiNAT model links on Backbone page by @dvrogozh in #32029
Pass missing arguments to SeamlessM4Tv2ConformerEncoderLayer.forward() when gradient checkpointing is enabled by @anferico in #31945
Add language to word timestamps for Whisper by @robinderat in #31572
Add sdpa and FA2 for CLIP by @qubvel in #31940
unpin numpy<2.0 by @ydshieh in #32018
Chameleon: minor fixes after shipping by @zucchini-nlp in #32037
Bump scikit-learn from 1.0.2 to 1.5.0 in /examples/researchprojects/decisiontransformer by @dependabot[bot] in #31458
Bump scikit-learn from 1.1.2 to 1.5.0 in /examples/research_projects/codeparrot/examples by @dependabot[bot] in #32052
[mistral] Support passing head_dim through config (and do not require head_dim * num_heads == hidden_size) by @xenova in #32050
Add torch.compile Support For Mamba by @zhenglongjiepheonix in #31247
fix: Removed duplicate entries in a dictionary by @Sai-Suraj-27 in #32041
docs: Fixed 2 links in the docs along with some minor fixes by @Sai-Suraj-27 in #32058
Llava: add default chat templates by @zucchini-nlp in #31691
[Chameleon, Hiera] Improve docs by @NielsRogge in #32038
Incorrect Whisper long-form decoding timestamps by @kamilakesbi in #32003
[mistral] Fix FA2 attention reshape for Mistral Nemo by @xenova in #32065
VideoLLaVa: fix chat format in docs by @zucchini-nlp in #32083
Fix progress callback deepcopy by @fozziethebeat in #32070
Fixes to chameleon docs by @merveenoyan in #32078
Add image-text-to-text task guide by @merveenoyan in #31777
Support generating with fallback for short form audio in Whisper by @kamilakesbi in #30984
Disable quick init for deepspeed by @muellerzr in #32066
Chameleon: not supported with fast load by @zucchini-nlp in #32091
Fix tests after huggingface_hub 0.24 by @Wauplin in #32054
Fix shard order by @b-chu in #32023
Generate: store special token tensors under a unique variable name by @gante in #31980
fix: Replaced deprecated mktemp() function by @Sai-Suraj-27 in #32123
Mention modelinfo.id instead of modelinfo.modelId by @Wauplin in #32106
[generate] fix eos/pad id check on mps devices by @sanchit-gandhi in #31695
Fix failing test with race condition by @Rocketknight1 in #32140
Update ko/_toctree.yml and remove custom_tools.md to reflect latest changes by @jungnerd in #31969
fix: Fixed raising TypeError instead of ValueError for invalid type by @Sai-Suraj-27 in #32111
[RoBERTa] Minor clarifications to model doc by @bt2513 in #31949
Return assistant generated tokens mask in applychattemplate by @yonigottesman in #30650
Don't default to other weights file when use_safetensors=True by @amyeroberts in #31874
set warning level to info for special tokens have been added by @ArthurZucker in #32138
Add new quant method by @SunMarc in #32047
Add llama3-llava-next-8b to llava_next conversion script by @jamt9000 in #31395
LLaVaNeXT: pad on right if training by @zucchini-nlp in #32134
Remove trust_remote_code when loading Libri Dummy by @sanchit-gandhi in #31748
[modelling] remove un-necessary transpose for fa2 attention by @sanchit-gandhi in #31749
Fix mask creations of GPTNeoX and GPT2 by @vasqu in #31944
Add method to retrieve used chat template by @KonradSzafer in #32032
Add YaRN and Dynamic-YaRN RoPE Scaling Methods by @mig-mfreitas in #30910
Disable quick init for TapasPreTrainedModel by @daniellok-db in #32149
Modify resizetokenembeddings to ensure output type is same as input by @bayllama in #31979
gguf conversion addprefixspace=None for llama3 by @itazap in #31937
Fix flash attention speed issue by @Cyrilvallez in #32028
Fix video batching to videollava by @merveenoyan in #32139
Added mamba.py backend by @alxndrTL in #30139
Rename Phi-3 rope scaling type by @garg-amit in #31436
Revert "Incorrect Whisper long-form decoding timestamps " by @sanchit-gandhi in #32148
Fix typing to be compatible with later py versions by @amyeroberts in #32155
feat(cache): StaticCache uses indexcopy to avoid useless copy by @tengomucho in #31857
Added additional kwarg for successful running of optuna hyperparameter search by @DeF0017 in #31924
Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs by @RhuiDih in #31629

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@aliencaocao
- Fix float out of range in owlvit and owlv2 when using FP16 or lower precision (#31657)
- Add torchemptycache_steps to TrainingArguments (#31546)
- Add training support for SigLIP (#31495)
- Allow FP16 or other precision inference for Pipelines (#31342)
@voidism
- Generate: Add new decoding strategy "DoLa" in .generate() (#29619)
@Namangarg110
- Adding hiera (#30356)

- Python
Published by LysandreJik almost 2 years ago

transformers - Patch release v4.42.4

Mostly gemma2 support FA2 softcapping!

but also fix the sliding window for long context and other typos.

[Gemma2] Support FA2 softcapping (#31887) by @ArthurZucker
[ConvertSlow] make sure the order is preserved for addedtokens (#31902) by @ArthurZucker
Fixes to alternating SWA layers in Gemma2 (#31775) by @turboderp
Requires for torch.tensor before casting (#31755) by @echarlaix

Was off last week could not get this out, thanks all for your patience 🥳

- Python
Published by ArthurZucker almost 2 years ago

transformers - Patch release v4.42.3

Make sure we have attention softcapping for "eager" GEMMA2 model

After experimenting, we noticed that for the 27b model mostly, softcapping is a must. So adding it back (it should have been there, but an error on my side made it disappear) sorry all! 😭

Gemma capping is a must for big models (#31698)

- Python
Published by ArthurZucker almost 2 years ago

transformers - Patch release v4.42.2

Patch release

Thanks to our 2 contributors for their prompt fixing mostly applies for training and FA2!

Fix Gemma2 4d attention mask (#31674) by @hiyouga
don't zero out the attention_mask when using sliding window with flash attention (#31670) by @winglian

- Python
Published by ArthurZucker almost 2 years ago

transformers - v4.42.1: Patch release

Patch release for commit:

[HybridCache] Fix getseqlength method (#31661)

- Python
Published by LysandreJik almost 2 years ago

transformers - v4.42.0: Gemma 2, RTDETR, InstructBLIP, LLAVa Next, New Model Adder

New model additions

Gemma-2

The Gemma2 model was proposed in Gemma2: Open Models Based on Gemini Technology and Research by Gemma2 Team, Google. Gemma2 models are trained on 6T tokens, and released with 2 versions, 2b and 7b.

The abstract from the paper is the following:

This work introduces Gemma2, a new family of open language models demonstrating strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma2 outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of our model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations

Add gemma 2 by @ArthurZucker in #31659

RTDETR

The RT-DETR model was proposed in DETRs Beat YOLOs on Real-time Object Detection by Wenyu Lv, Yian Zhao, Shangliang Xu, Jinman Wei, Guanzhong Wang, Cheng Cui, Yuning Du, Qingqing Dang, Yi Liu.

RT-DETR is an object detection model that stands for “Real-Time DEtection Transformer.” This model is designed to perform object detection tasks with a focus on achieving real-time performance while maintaining high accuracy. Leveraging the transformer architecture, which has gained significant popularity in various fields of deep learning, RT-DETR processes images to identify and locate multiple objects within them.

New model support RTDETR by @SangbumChoi in #29077

InstructBlip

The InstructBLIP model was proposed in InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. InstructBLIP leverages the BLIP-2 architecture for visual instruction tuning.

InstructBLIP uses the same architecture as BLIP-2 with a tiny but important difference: it also feeds the text prompt (instruction) to the Q-Former.

Add video modality for InstrucBLIP by @zucchini-nlp in #30182

LlaVa NeXT Video

The LLaVa-NeXT-Video model was proposed in LLaVA-NeXT: A Strong Zero-shot Video Understanding Model by Yuanhan Zhang, Bo Li, Haotian Liu, Yong Jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, Chunyuan Li. LLaVa-NeXT-Video improves upon LLaVa-NeXT by fine-tuning on a mix if video and image dataset thus increasing the model’s performance on videos.

LLaVA-NeXT surprisingly has strong performance in understanding video content in zero-shot fashion with the AnyRes technique that it uses. The AnyRes technique naturally represents a high-resolution image into multiple images. This technique is naturally generalizable to represent videos because videos can be considered as a set of frames (similar to a set of images in LLaVa-NeXT). The current version of LLaVA-NeXT makes use of AnyRes and trains with supervised fine-tuning (SFT) on top of LLaVA-Next on video data to achieves better video understanding capabilities.The model is a current SOTA among open-source models on VideoMME bench.

Add LLaVa NeXT Video by @zucchini-nlp in #31252

New model adder

A very significant change makes its way within the transformers codebase, introducing a new way to add models to transformers. We recommend reading the description of the PR below, but here is the gist of it:

The diff_converter tool is here to replace our old Copied from statements, while keeping our core transformers philosophy:

single model single file

explicit code

standardization of modeling code

readable and educative code

simple code

least amount of modularity

This additionally unlocks the ability to very quickly see the differences between new architectures that get developed. While many architectures are similar, the "single model, single file" policy can obfuscate the changes. With this diff converter, we want to make the changes between architectures very explicit.

Diff converter v2 by @ArthurZucker in #30868

Tool-use and RAG model support

We've made major updates to our support for tool-use and RAG models. We can now automatically generate JSON schema descriptions for Python functions which are suitable for passing to tool models, and we've defined a standard API for tool models which should allow the same tool inputs to be used with many different models. Models will need updates to their chat templates to support the new API, and we're targeting the Nous-Hermes, Command-R and Mistral/Mixtral model families for support in the very near future. Please see the updated chat template docs for more information.

If you are the owner of a model that supports tool use, but you're not sure how to update its chat template to support the new API, feel free to reach out to us for assistance with the update, for example on the Hugging Face Discord server. Ping Matt and yell key phrases like "chat templates" and "Jinja" and your issue will probably get resolved.

Chat Template support for function calling and RAG by @Rocketknight1 in #30621

GGUF support

We further the support of GGUF files to offer fine-tuning within the python/HF ecosystem, before converting them back to the GGUF/GGML/llama.cpp libraries.

Add Qwen2 GGUF loading support by @Isotr0py in #31175
GGUF: Fix llama 3 GGUF by @younesbelkada in #31358
Fix llama gguf converter by @SunMarc in #31575

Trainer improvements

A new optimizer is added in the Trainer.

FEAT / Trainer: LOMO optimizer support by @younesbelkada in #30178

Quantization improvements

Several improvements are done related to quantization: a new cache (the quantized KV cache) is added, offering the ability to convert the cache of generative models, further reducing the memory requirements.

Additionally, the documentation related to quantization is entirely redone with the aim of helping users choose which is the best quantization method.

Quantized KV Cache by @zucchini-nlp in #30483
Docs / Quantization: refactor quantization documentation by @younesbelkada in #30942

Examples

New instance segmentation examples are added by @qubvel

Instance segmentation examples by @qubvel in #31084

Notable improvements

As a notable improvement to the HF vision models that leverage backbones, we enable leveraging HF pretrained model weights as backbones, with the following API:

```py from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation

config = MaskFormerConfig(backbone="microsoft/resnet-50", usepretrainedbackbone=True) model = MaskFormerForInstanceSegmentation(config) ```

Enable HF pretrained backbones by @amyeroberts in #31145

Additionally, we thank @Cyrilvallez for diving into our generate method and greatly reducing the memory requirements.

Reduce by 2 the memory requirement in generate() 🔥🔥🔥 by @Cyrilvallez in #30536

Breaking changes

Remove ConversationalPipeline and Conversation object

Both the ConversationalPipeline and the Conversation object have been deprecated for a while, and are due for removal in 4.42, which is the upcoming version.

The TextGenerationPipeline is recommended for this use-case, and now accepts inputs in the form of the OpenAI API.

🚨 Remove ConversationalPipeline and Conversation object by @Rocketknight1 in #31165

Remove an accidental duplicate softmax application in FLAVA's attention

Removes duplicate softmax application in FLAVA attention. Likely to have a small change on the outputs but flagging with 🚨 as it will change a bit.

🚨 FLAVA: Remove double softmax by @amyeroberts in #31322

Idefics2's `ignore_index` attribute of the loss is updated to `-100`

🚨 [Idefics2] Update ignore index by @NielsRogge in #30898

out_indices from `timm` being updated

Recent updates to timm changed the type of the attribute model.feature_info.out_indices. Previously, out_indices would reflect the input type of out_indices on the create_model call i.e. either tuple or list. Now, this value is always a tuple.

As list are more useful and consistent for us -- we cannot save tuples in configs, they must be converted to lists first -- we instead choose to cast out_indices to always be a list.

This has the possibility of being a slight breaking change if users are creating models and relying on out_indices on being a tuple. As this property only happens when a new model is created, and not if it's saved and reloaded (because of the config), then I think this has a low chance of having much of an impact.

🚨 out_indices always a list by @amyeroberts in #30941

datasets referenced in the quantization config get updated to remove references to datasets with restrictive licenses.

🚨 Remove dataset with restrictive license by @echarlaix in #31452

Bugfixes and improvements

Add fixed resize and pad strategy for object detection by @qubvel in #30742
Enable dynamic resolution input for Swin Transformer and variants by @the-neural-networker in #30656
Add TokenClassification for Mistral, Mixtral and Qwen2 by @josephenguehard in #29878
FIX / Quantization: Fix Dockerfile build by @younesbelkada in #30890
Add support for torch.compile dynamic shapes by @warner-benjamin in #30560
LLaVa-Next: Update docs with batched inference by @zucchini-nlp in #30857
DeformableDETR two stage support bfloat16 by @DonggeunYu in #30907
add returntokentimestamps to WhisperProcessor by @kamilakesbi in #30812
Fix numhiddenlayers in initialization of new model in Mamba by @SrGonao in #30403
separate kwargs in processor (similar to #30193) by @Eric2i in #30905
fix for custom pipeline configuration by @not-lain in #29004
Add AutoFeatureExtractor support to Wav2Vec2ProcessorWithLM by @ylacombe in #28706
Fix a shape annotation and typos in mamba slow forward by @vasqu in #30691
tokenizer_class = "AutoTokenizer" Llava Family by @ArthurZucker in #30912
Introduce configuredstate arg for acceleratorconfig by @muellerzr in #29781
Add torch.compile for Mistral by @zhenglongjiepheonix in #30642
[docs] Spanish translation of modelmemoryanatomy.md by @aaronjimv in #30885
FIX / TST: Fix expected results on Mistral slow test (A10) by @younesbelkada in #30909
PaliGemma - fix processor with no input text by @hiyouga in #30916
CI: AMD MI300 tests fix by @mht-sharma in #30797
Enforce saving at end of training if saving option chosen by @muellerzr in #30160
fix: center_crop occasionally outputs off-by-one dimension matrix by @mattlbeck in #30934
[Benchmark] Reuse optimum-benchmark by @ydshieh in #30615
TST / Workflows: Get slack notifications for docker image build by @younesbelkada in #30891
Fix swin embeddings interpolation by @amyeroberts in #30936
Fix inhomogeneous shape error in example by @Zantares in #30434
update ruff version by @ArthurZucker in #30932
Update build ci image [push-ci-image] by @ArthurZucker in #30933)
Update video-llava docs by @zucchini-nlp in #30935
Fix low cpu mem usage tests by @SunMarc in #30808
[doc] Add references to the fine-tuning blog and distil-whisper to Whisper. by @Vaibhavs10 in #30938
Avoid extra chunk in speech recognition by @jonatanklosko in #29539
[whisper] only trigger forced ids warning once by @sanchit-gandhi in #30966
Paligemma - fix slow tests, add bf16 and f16 slow tests by @molbap in #30851
Finally fix the missing new model failure CI report by @ydshieh in #30968
legacy to init the slow tokenizer when converting from slow was wrong by @ArthurZucker in #30972
Generation: get special tokens from model config by @zucchini-nlp in #30899
[Whisper] Strip prompt before finding common subsequence by @sanchit-gandhi in #27836
Fix link in Pipeline documentation by @junhl in #30948
[Mistral and friends] Update MLP by @NielsRogge in #31057
Paligemma causal attention mask by @molbap in #30967
Update object detection with latest resize and pad strategies by @qubvel in #30955
Using assistant in AutomaticSpeechRecognitionPipeline with different encoder size by @kamilakesbi in #30637
Push ci image by @ArthurZucker in #30982
testcustom4dattentionmask skip with sliding window attn by @poedator in #30833
Finish adding support for torch.compile dynamic shapes by @warner-benjamin in #30919
FIX / Docs: Minor changes in quantization docs by @younesbelkada in #30985
Fix accelerate failing tests by @SunMarc in #30836
[tests] add torch.use_deterministic_algorithms for XPU by @faaany in #30774
Add a check that warmup_setps is either 0 or >= 1 by @ymoslem in #30764
Update 4 MptIntegrationTests expected outputs by @ydshieh in #30989
[Port] TensorFlow implementation of Mistral by @ariG23498 in #29708
Remove deprecated properties in tokenizationnllb.py and tokenizationnllb_fast.py by @ymoslem in #29834
Bugfix: WandbCallback uploads initial model checkpoint by @mgerstgrasser in #30897
add prefix space ignored in llama #29625 by @itazap in #30964
Fix training speed regression introduced by "optimize VRAM for calculating pos_bias in LayoutLM v2, v3 by @kkoehncke in #26139)"
Do not trigger autoconversion if localfilesonly by @Wauplin in #31004
pin uv==0.1.45 by @ydshieh in #31006
Perceiver interpolate position embedding by @g1y5x3 in #30979
[tests] make test_model_parallelism device-agnostic by @faaany in #30844
FIX / TST: Fix expected results on Mistral AWQ test by @SunMarc in #30971
allow multi-gpu by @ydshieh in #31011
Fix resume_download future warning by @Wauplin in #31007
Quantization / TST: Fix remaining quantization tests by @younesbelkada in #31000
save the list of new model failures by @ydshieh in #31013
added interpolation for vitmae model in pytorch as well as tf. by @bhuvanmdev in #30732
Add split special tokens by @itazap in #30772
Paligemma- fix devices and dtype assignments by @molbap in #31008
Redirect transformers_agents doc to agents by @aymeric-roucher in #31054
unpin uv by @ydshieh in #31055
Follow up: Fix link in dbrx.md by @eitanturok in #30514
Update feature request label in template by @amyeroberts in #30940
Fix quanto tests by @SunMarc in #31062
Fix padtomax_length Whisper by @ylacombe in #30787
skip test_model_parallelism for 2 model test classes by @ydshieh in #31067
use @main by @ydshieh in #31065
Remove ninja from docker image build by @ydshieh in #31080
fix "piano" typo by @clinty in #31027
Update quicktour.md to fix broken link to Glossary by @apalkk in #31072
Remove redundant backend checks in training_args.py by @kevint324 in #30999
fix from_pretrained in offline mode when model is preloaded in cache by @oOraph in #31010
Remove float64 cast for OwlVit and OwlV2 to support MPS device by @qubvel in #31071
Fix OWLv2 postprocessobject_detection for multiple images by @qubvel in #31082
Fix typo in trainer.py by @taslimisina in #31048
[SuperPoint, PaliGemma] Update docs by @NielsRogge in #31025
Fix failing tokenizer tests by @LysandreJik in #31083
Watermark: fix tests by @zucchini-nlp in #30961
Docs / PEFT: Add PEFT API documentation by @younesbelkada in #31078
Render chat template tojson filter as unicode by @CISC in #31041
FIX: Add accelerate as a hard requirement by @younesbelkada in #31090
FIX / OPT: Fix OPT multi-GPU training for OPTForQuestionAnswering by @younesbelkada in #31092
skip test_multi_gpu_data_parallel_forward for vit and deit by @ydshieh in #31086
Fix PretrainedConfig docstring with deprecated resume_download by @albertvillanova in #31014
Fix DeepSpeed compatibility with weight_norm by @jonnyli1125 in #30881)
TST: Fix instruct-blip tests by @younesbelkada in #31088
Docs / Quantization: Redirect deleted page by @younesbelkada in #31063
Deprecate low use models by @amyeroberts in #30781
Quantized KV cache: update quanto by @zucchini-nlp in #31052
FEAT: Add mistral v3 conversion script by @younesbelkada in #30981
Use HF_HUB_OFFLINE + fix has_file in offline mode by @Wauplin in #31016
Improve transformers-cli env reporting by @statelesshz in #31003
Fix env.py in cases where torch is not present by @Rocketknight1 in #31113
Fix faulty rstrip in module loading by @Rocketknight1 in #31108
Rm maintainer + migrate by @muellerzr in #31089
Fix nightly circleci by @ydshieh in #31114
FIX / Docs: Fix GPTQ expected number of bits by @younesbelkada in #31111
Add VLM generation default contributor by @gante in #31115
Add onoptimizerstep to callback options by @dhruvbpai in #31095
Cleanup docker build by @ydshieh in #31119
FIX / Quantization: Add extra validation for bnb config by @younesbelkada in #31135
fix getscheduler when name is warmupstable_decay by @zspo in #31128
Docs / Quantization: Replace all occurences of load_in_8bit with bnb config by @younesbelkada in #31136
Workflow: Remove IS_GITHUB_CI by @younesbelkada in #31147
helper by @ArthurZucker in #31152
pytest -rsfE by @ydshieh in #31140
Fix quantized cache output by @SunMarc in #31143
Update sam.md by @asifajrof in #31130
Quantization: Enhance bnb error message by @younesbelkada in #31160
[trainer] add sanity evaluation option by @SunMarc in #31146
Add streaming, various fixes by @aymeric-roucher in #30838
Added description of quantization_config by @vamsivallepu in #31133
Fix typo: usesafetenstors to usesafetensors by @CharlesCNorton in #31184
Remove copied froms for deprecated models by @amyeroberts in #31153
Token healing by @ahmed-moubtahij in #30081
[GemmaModel] fix small typo by @ArthurZucker in #31202
Fix Cannot convert [array()] to EagerTensor of dtype int64 by @pavi-ninjaac in #31109
Ignore non-causal mask in more cases with SDPA by @fxmarty in #30138
SlidingWindowCache: reduce differences to other Cache classes by @gante in #30970
Fix test_compile_static_cache by @ydshieh in #30991
fix the getsizewithaspectratio in max_size situation by @SangbumChoi in #30902
Fix typo in utils by @Bojun-Feng in #31169
Rename sanityevaluation to evalon_start by @Qubitium in #31192
Wrong translation FR : Contents = Contenu by @jadechoghari in #31186
Cohere: Fix copied from by @younesbelkada in #31213
Set greaterisbetter to False if metricforbest_model ends with "loss" by @miivanov90 in #31142
Fix GPU OOM for mistral.py::Mask4DTestHard by @ydshieh in #31212
[docs] Spanish translation of tokenizer_summary.md by @aaronjimv in #31154
Pass device in Logits Processor's init by @zucchini-nlp in #29804
Fix sentence fragment within test comments by @DomHudson in #31218
fix(PatchTST): Wrong dropout used for PretainHead by @maxstrobel in #31117
Video-LLaVa: handle any number of frames by @zucchini-nlp in #31221
Add dynamic resolution input/interpolate position embedding to deit by @p-kris10 in #31131
fix bf16 issue in text classification pipeline by @chujiezheng in #30996
Fix pipeline tests - torch imports by @amyeroberts in #31227
Add new line switch before logging ***** Running {description} ***** by @jacklanda in #31225
add no split modules for xlmrobertaxl by @ManuelFay in #31223
Fix MistralIntegrationTest by @ydshieh in #31231
Blip: Deprecate BlipModel by @younesbelkada in #31235
Move out common backbone config param validation by @amyeroberts in #31144
Upload (daily) CI results to Hub by @ydshieh in #31168
Specify dtype=torch.bool to avoid xla error by @ysulsky in #31191
Fixing name 'torch' is not defined in bitsandbytes integration by @jamesbraza in #31243
Benchmark GitHub Actions workflow by @ydshieh in #31163
Early labels validation by @amyeroberts in #31240
doc: add info about wav2vec2 bert in older wav2vec2 models. by @Vaibhavs10 in #31120
enable deterministic mode for npu by @statelesshz in #31253
Add missing Flaubert tokenizer tests by @bastrob in #30492
Fix circular reference issue in CLIPTokenizerFast by @dhaivat1729 in #31075
Add condition to benchmark job in push-important-models.yml by @ydshieh in #31259
Skip failing JetMOE generation tests by @amyeroberts in #31266
no need for explicit EXTRATOKENS in processingpaligemma.py by @grahamannett in #31022
[SwitchTransformer] Significant performance improvement on MoE blocks by @ranggihwang in #31173
fix loading specialtokensmap_file by @ZhiyuanChen in #31012
Make mamba use cache by @zucchini-nlp in #31116
Generation: fix handling of special tokens by @zucchini-nlp in #31254
Switch from cached_download to hf_hub_download in remaining occurrences by @Wauplin in #31284
fix: str should be used not int when setting env variables by @statelesshz in #31272
Fix savetpu: use maybeconverttocpu instead of to cpu. by @baoleai in #31264
fix accelerate tests for roberta xl by @SunMarc in #31288
Enable dynamic resolution input for Beit by @OmarManzoor in #31053
Mark MobileNetV1ModelTest::testbatchingequivalence as flaky by @amyeroberts in #31258
Pipeline VQA: Add support for list of images and questions as pipeline input by @BlacCod in #31217
Fix SwinLayer / DonutSwinLayer / ClapAudioLayer attention mask device by @gorodnitskiy in #31295
Update text-to-speech.md by @jaguaryang in #31269
Fixed Wav2Vec2ProcessorWithLM decoding error by @karicotiza in #31188
Fix jetmoe model by @Cyrilvallez in #31279
Extend save_pretrained to offloaded models by @blbadger in #27412
Implement JSON dump conversion for torch_dtype in TrainingArguments by @junrae6454 in #31224
interpolation added for TVP. by @bhuvanmdev in #30863
Rename testmodelcommonattributes -> testmodelgetset_embeddings by @amyeroberts in #31321
Use unused prepare_img() function in dinov2 conversion script by @IbrahimAmin1 in #31335
docs: fix style by @imba-tjd in #31340
Fix paligemma inverted mask by @molbap in #31207
docs/zh: fix style by @imba-tjd in #31334
Decorators for deprecation and named arguments validation by @qubvel in #30799
Improve error msg when using bitsandbytes by @SunMarc in #31350
Fix Cohere CI by @ydshieh in #31263
Fix gradio tool demos by @aymeric-roucher in #31230
Fast image processor by @amyeroberts in #28847
Add french translation of AutoBackbone by @jadechoghari in #31300
Add support to declare imports for code agent by @JasonZhu1313 in #31355
Fix idefics cache by @zucchini-nlp in #31377
[Bug Fix] Renamed loss to losses to suppress UnboundLocalError by @her0e1c1 in #31365
docs: fix broken link by @imba-tjd in #31370
backbone_utils - fix relative import by @amyeroberts in #31382
README underline between badges fix by @novialriptide in #31376
Update comment in modeling_utils.py by @inf3rnus in #31299
Use huggingface_hub helper function to split state dict by @SunMarc in #31091
Change JSON serialization to custom json.dumps by @junrae6454 in #31100
feat(ci): add trufflehog secrets detection by @McPatate in #31344
[QoL fix] [Image processing] Add warning on assumption of channel dim and avoid infering when inputs are PIL.Image by @aliencaocao in #31364
Make chat templates part of ProcessorMixin by @Rocketknight1 in #30744
add initial design for uniform processors + align model by @molbap in #31197
Add missing French translation of tutoriel_pipeline.md by @jadechoghari in #31396
Temporarily pin datasets upper version to fix CI by @albertvillanova in #31407
Support Clip QKV for MPT by @akakakakakaa in #31307
Pin datasets<2.20.0 for examples by @amyeroberts in #31417
Fix MusicGen SDPA by @ylacombe in #31208
Set seed for M4T retain grad test by @ylacombe in #31419
Fix SpeechT5 decoder_attention_mask shape by @ylacombe in #28071
Change potential inputs_embeds padding logger.warning to logger.warning_once by @naimenz in #31411
Remove duplicate image processor in auto map by @amyeroberts in #31383
Install the tensorflow example requirements in docker by @amyeroberts in #31428
Remove empty createandtestconfigcommon_properties tests by @amyeroberts in #31359
xpu: support xpu backend from stock pytorch (>=2.4) by @dvrogozh in #31238
Musicgen special tokens in tensors by @zucchini-nlp in #31420
Fix Bark logits processors device misplacement by @ylacombe in #31416
Rename misnamed image processor test files by @amyeroberts in #31430
Generate: fix tokenizer being popped twice by @gante in #31427
[tests] make TestDeepSpeedModelZoo device-agnostic by @faaany in #31402
Support multiple validation datasets when dataloader_persistent_workers=True by @bastienlc in #30627
Pass datasets trustremotecode by @albertvillanova in #31406
simple fix by @tokenizer-decode in #31456
Fix typing errors in Qwen2ForTokenClassification by @kevinhu in #31440
Agents: Improve python interpreter by @aymeric-roucher in #31409
Donut: fix generate call from local path by @gante in #31470
Make "tool_use" the default chat template key when tools are passed by @Rocketknight1 in #31429
Fix single letter stop strings by @Rocketknight1 in #31448
Update chat template docs and bump Jinja version by @Rocketknight1 in #31455
Improve PreTrainedTokenizerFast loading time when there are many added tokens by @ydshieh in #31404
Fix documentation typos by @qgallouedec in #31476
Give more useful metric_for_best_model errors by @tomaarsen in #31450
Update perftraingpu_many.md by @remyleone in #31451
[GPT2] Add SDPA support by @vasqu in #31172
Fix autocast incompatibility in RecurrentGemma by @xplip in #30832
Use self.configtester.runcommon_tests() by @amyeroberts in #31431
[tests] rename test_config_object to test_ds_config_object by @faaany in #31403
Docs / AQLM: Clarify torch.compile support for AQLM by @younesbelkada in #31473
Mamba: add generative tests by @gante in #31478
Update object_detection.md by @jajupmochi in #31488
Add docs on zeroshot image classification prompt templates by @aliencaocao in #31343
auto-detect device when no device is passed to pipeline by @faaany in #31398
Fix typo: pastokenid by @ftnext in #30894
Fix wandb integration with SetFit model by @timothepearce in #30021
Consider inheritance in type checking for tensors by @daemyung in #31378
Add valid columns check in removeunused_columns method by @arthasking123 in #31466
Fix a teeny-tiny typo in tokenization_utils_base.py's docstring by @sadra-barikbin in #31510
Fix mismatched ` in doc & other common typos by @jhwei in #31516
RWKV: enable generation tests by @gante in #31490
unskip 2 tests in cohere by @ydshieh in #31517
Revive Nightly/Past CI by @ydshieh in #31159
Deprecate legacy cache + use cache position by @zucchini-nlp in #31491
SPLIT PR: add user defined symbols and control symbols by @itazap in #31305
Removed torch.cuda.empty_cache from train loop. by @FoamoftheSea in #31530
Update mask_generation.md by @nicholicaron in #31543
Correct @is_flaky test decoration by @qubvel in #31480
Add implementation of spectrogram_batch by @ravenouse in #27159
chore: fix typos by @xiaoxianBoy in #31559
Update git templates by @ArthurZucker in #31539
Fix the error caused by incorrect use of logger in pipeline by @lanyun1103 in #31565
Fix bug about addspecialtokens and so on by @hiroshi-matsuda-rit in #31496
Add Jinja as a requirement with the right version cutoff by @Rocketknight1 in #31536
Fix doc typo in TrainingArguments by @qgallouedec in #31503
Fix istorchxpu_available for torch < 2.3 by @amyeroberts in #31573
Added version constraint on numpy for version <2.0 by @Resteklicken in #31569
Siglip: add _no_split_module by @zucchini-nlp in #31566
fix output data type of image classification by @jiqing-feng in #31444
add preprocessingnumworkers to run_classification.py by @jiahuanluo in #31586
Improve error message for mismatched copies in code blocks by @molbap in #31535
Add ViTImageProcessorFast to tests by @amyeroberts in #31424
docs: move translations to i18n by @SauravMaheshkar in #31584
Removed unnecessary self.projection call in VivitTubeletEmbeddings by @v-iashin in #31632
[GPT-NeoX] Add SDPA support by @vasqu in #31031
Update RT-DETR code snippet by @qubvel in #31631
Llama et al. / FSDP : Fix breaking change in 4.40 for FSDP by @younesbelkada in #31161
Fix RT-DETR inference with float16 and bfloat16 by @qubvel in #31639
Fix paligemma detection inference by @molbap in #31587
Generate: fix assisted generation with past_key_values passed as kwargs by @gante in #31644
Fix dtype casting in swinv2 and swinv2sr to allow non-FP32 inference by @aliencaocao in #31589
Skip tests properly by @amyeroberts in #31308
Generation: past kv can be None by @zucchini-nlp in #31051
Fix ONNX exports for Optimum compatible models by @merveenoyan in #31311

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@josephenguehard
- Add TokenClassification for Mistral, Mixtral and Qwen2 (#29878)
@vasqu
- Fix a shape annotation and typos in mamba slow forward (#30691)
- [GPT2] Add SDPA support (#31172)
- [GPT-NeoX] Add SDPA support (#31031)
@ariG23498
- [Port] TensorFlow implementation of Mistral (#29708)
@bhuvanmdev
- added interpolation for vitmae model in pytorch as well as tf. (#30732)
- interpolation added for TVP. (#30863)
@SangbumChoi
- fix the getsizewithaspectratio in max_size situation (#30902)
- New model support RTDETR (#29077)
@Cyrilvallez
- Reduce by 2 the memory requirement in generate() 🔥🔥🔥 (#30536)
- Fix jetmoe model (#31279)
@ravenouse
- Add implementation of spectrogram_batch (#27159)

- Python
Published by LysandreJik almost 2 years ago

transformers - Release v4.41.2

Release v4.41.2

Mostly fixing some stuff related to trust_remote_code=True and from_pretrained

The local_file_only was having a hard time when a .safetensors file did not exist. This is not expected and instead of trying to convert, we should just fallback to loading the .bin files.

Do not trigger autoconversion if localfilesonly #31004 from @Wauplin fixes this!
Paligemma: Fix devices and dtype assignments (#31008) by @molbap
Redirect transformers_agents doc to agents (#31054) @aymeric-roucher
Fix from_pretrained in offline mode when model is preloaded in cache (#31010) by @oOraph
Fix faulty rstrip in module loading (#31108) @Rocketknight1

- Python
Published by ArthurZucker about 2 years ago

transformers - Release v4.41.1 Fix PaliGemma finetuning, and some small bugs

Release v4.41.1

Fix PaliGemma finetuning:

The causal mask and label creation was causing label leaks when training. Kudos to @probicheaux for finding and reporting!

https://github.com/huggingface/transformers/commit/a755745546779ae5c42510bc02a859bdac82b3b7 : PaliGemma - fix processor with no input text (https://github.com/huggingface/transformers/pull/30916) @hiyouga
https://github.com/huggingface/transformers/commit/a25f7d3c12975fe21eab437dda7363e9024de7c0 : Paligemma causal attention mask (https://github.com/huggingface/transformers/pull/30967) @molbap and @probicheaux

Other fixes: - https://github.com/huggingface/transformers/commit/bb48e921868ac750417956de941606f7e2fa02ca: tokenizer_class = "AutoTokenizer" Llava Family (https://github.com/huggingface/transformers/pull/30912) - https://github.com/huggingface/transformers/commit/1d568dfab262f76079eb4f3d05b606d51a0c9e4b : legacy to init the slow tokenizer when converting from slow was wrong (https://github.com/huggingface/transformers/pull/30972) - https://github.com/huggingface/transformers/commit/b1065aa08ac0da11fcb9e3827cd7eafabe4beebd : Generation: get special tokens from model config (https://github.com/huggingface/transformers/pull/30899) @zucchini-nlp

Reverted https://github.com/huggingface/transformers/commit/4ab7a28216211571fdddba414d4edd8426ab6489

- Python
Published by ArthurZucker about 2 years ago

transformers - v4.41.0: Phi3, JetMoE, PaliGemma, VideoLlava, Falcon2, FalconVLM & GGUF support

New models

Phi3

The Phi-3 model was proposed in Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone by Microsoft.

TLDR; Phi-3 introduces new ROPE scaling methods, which seems to scale fairly well! A 3b and a Phi-3-mini is available in two context-length variants—4K and 128K tokens. It is the first model in its class to support a context window of up to 128K tokens, with little impact on quality.

Phi-3 by @gugarosa in https://github.com/huggingface/transformers/pull/30423

JetMoE

JetMoe-8B is an 8B Mixture-of-Experts (MoE) language model developed by Yikang Shen and MyShell. JetMoe project aims to provide a LLaMA2-level performance and efficient language model with a limited budget. To achieve this goal, JetMoe uses a sparsely activated architecture inspired by the ModuleFormer. Each JetMoe block consists of two MoE layers: Mixture of Attention Heads and Mixture of MLP Experts. Given the input tokens, it activates a subset of its experts to process them. This sparse activation schema enables JetMoe to achieve much better training throughput than similar size dense models. The training throughput of JetMoe-8B is around 100B tokens per day on a cluster of 96 H100 GPUs with a straightforward 3-way pipeline parallelism strategy.

Add JetMoE model by @yikangshen in https://github.com/huggingface/transformers/pull/30005

PaliGemma

PaliGemma is a lightweight open vision-language model (VLM) inspired by PaLI-3, and based on open components like the SigLIP vision model and the Gemma language model. PaliGemma takes both images and text as inputs and can answer questions about images with detail and context, meaning that PaliGemma can perform deeper analysis of images and provide useful insights, such as captioning for images and short videos, object detection, and reading text embedded within images.

More than 120 checkpoints are released see the collection here !

Add PaliGemma by @molbap in https://github.com/huggingface/transformers/pull/30814

VideoLlava

Video-LLaVA exhibits remarkable interactive capabilities between images and videos, despite the absence of image-video pairs in the dataset.

💡 Simple baseline, learning united visual representation by alignment before projection With the binding of unified visual representations to the language feature space, we enable an LLM to perform visual reasoning capabilities on both images and videos simultaneously. 🔥 High performance, complementary learning with video and image Extensive experiments demonstrate the complementarity of modalities, showcasing significant superiority when compared to models specifically designed for either images or videos.

Add Video Llava by @zucchini-nlp in https://github.com/huggingface/transformers/pull/29733

Falcon 2 and FalconVLM:

Two new models from TII-UAE! They published a blog-post with more details! Falcon2 introduces parallel mlp, and falcon VLM uses the Llava framework * Support for Falcon2-11B by @Nilabhra in https://github.com/huggingface/transformers/pull/30771 * Support arbitrary processor by @ArthurZucker in https://github.com/huggingface/transformers/pull/30875

GGUF `from_pretrained` support

You can now load most of the GGUF quants directly with transformers' from_pretrained to convert it to a classic pytorch model. The API is simple:

```python from transformers import AutoTokenizer, AutoModelForCausalLM

modelid = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF" filename = "tinyllama-1.1b-chat-v1.0.Q6K.gguf"

tokenizer = AutoTokenizer.frompretrained(modelid, gguffile=filename) model = AutoModelForCausalLM.frompretrained(modelid, gguffile=filename) ```

We plan more closer integrations with llama.cpp / GGML ecosystem in the future, see: https://github.com/huggingface/transformers/issues/27712 for more details

Loading GGUF files support by @LysandreJik in https://github.com/huggingface/transformers/pull/30391

Quantization

New quant methods

In this release we support new quantization methods: HQQ & EETQ contributed by the community. Read more about how to quantize any transformers model using HQQ & EETQ in the dedicated documentation section

Add HQQ quantization support by @mobicham in https://github.com/huggingface/transformers/pull/29637
[FEAT]: EETQ quantizer support by @dtlzhuangz in https://github.com/huggingface/transformers/pull/30262

`dequantize` API for bitsandbytes models

In case you want to dequantize models that have been loaded with bitsandbytes, this is now possible through the dequantize API (e.g. to merge adapter weights)

FEAT / Bitsandbytes: Add dequantize API for bitsandbytes quantized models by @younesbelkada in https://github.com/huggingface/transformers/pull/30806

API-wise, you can achieve that with the following:

```python from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer

model_id = "facebook/opt-125m"

model = AutoModelForCausalLM.frompretrained(modelid, quantizationconfig=BitsAndBytesConfig(loadin4bit=True)) tokenizer = AutoTokenizer.frompretrained(model_id)

model.dequantize()

text = tokenizer("Hello my name is", return_tensors="pt").to(0)

out = model.generate(**text) print(tokenizer.decode(out[0])) ```

Generation updates

Add Watermarking LogitsProcessor and WatermarkDetector by @zucchini-nlp in https://github.com/huggingface/transformers/pull/29676
Cache: Static cache as a standalone object by @gante in https://github.com/huggingface/transformers/pull/30476
Generate: add min_p sampling by @gante in https://github.com/huggingface/transformers/pull/30639
Make Gemma work with torch.compile by @ydshieh in https://github.com/huggingface/transformers/pull/30775

SDPA support

[BERT] Add support for sdpa by @hackyon in https://github.com/huggingface/transformers/pull/28802
Add sdpa and fa2 the Wav2vec2 family. by @kamilakesbi in https://github.com/huggingface/transformers/pull/30121
add sdpa to ViT [follow up of #29325] by @hyenal in https://github.com/huggingface/transformers/pull/30555

Improved Object Detection

Addition of fine-tuning script for object detection models

Fix YOLOS image processor resizing by @qubvel in https://github.com/huggingface/transformers/pull/30436
Add examples for detection models finetuning by @qubvel in https://github.com/huggingface/transformers/pull/30422
Add installation of examples requirements in CI by @qubvel in https://github.com/huggingface/transformers/pull/30708
Update object detection guide by @qubvel in https://github.com/huggingface/transformers/pull/30683

Interpolation of embeddings for vision models

Add interpolation of embeddings. This enables predictions from pretrained models on input images of sizes different than those the model was originally trained on. Simply pass interpolate_pos_embedding=True when calling the model.

Added for: BLIP, BLIP 2, InstructBLIP, SigLIP, ViViT

```py import requests from PIL import Image from transformers import Blip2Processor, Blip2ForConditionalGeneration

image = Image.open(requests.get("https://huggingface.co/hf-internal-testing/blip-test-image/resolve/main/demo.jpg", stream=True).raw) processor = Blip2Processor.frompretrained("Salesforce/blip2-opt-2.7b") model = Blip2ForConditionalGeneration.frompretrained( "Salesforce/blip2-opt-2.7b", torchdtype=torch.float16 ).to("cuda") inputs = processor(images=image, size={"height": 500, "width": 500}, returntensors="pt").to("cuda")

predictions = model(**inputs, interpolateposencoding=True)

Generated text: "a woman and dog on the beach"

generatedtext = processor.batchdecode(predictions, skipspecialtokens=True)[0].strip() ```

Blip dynamic input resolution by @zafstojano in https://github.com/huggingface/transformers/pull/30722
Add dynamic resolution input/interpolate position embedding to SigLIP by @davidgxue in https://github.com/huggingface/transformers/pull/30719
Enable dynamic resolution for vivit by @jla524 in https://github.com/huggingface/transformers/pull/30630

🚨 might be breaking

🚨🚨🚨Deprecate evaluation_strategy to eval_strategy🚨🚨🚨 by @muellerzr in https://github.com/huggingface/transformers/pull/30190
🚨 Add training compatibility for Musicgen-like models by @ylacombe in https://github.com/huggingface/transformers/pull/29802
🚨 Update imageprocessingvitmatte.py by @rb-synth in https://github.com/huggingface/transformers/pull/30566

Cleanups

Remove task guides auto-update in favor of links towards task pages by @LysandreJik in https://github.com/huggingface/transformers/pull/30429
Remove add-new-model in favor of add-new-model-like by @LysandreJik in https://github.com/huggingface/transformers/pull/30424
Remove mentions of models in the READMEs and link to the documentation page in which they are featured. by @LysandreJik in https://github.com/huggingface/transformers/pull/30420

Not breaking but important for Llama tokenizers

[LlamaTokenizerFast] Refactor default llama by @ArthurZucker in https://github.com/huggingface/transformers/pull/28881

Fixes

Fix missing prev_ci_results by @ydshieh in https://github.com/huggingface/transformers/pull/30313
Fix: remove pad token id in pipeline forward arguments by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30285
fix Parameter dtype in audio models by @ylacombe in https://github.com/huggingface/transformers/pull/30310
disable use_cache if using gradient checkpointing by @chenzizhao in https://github.com/huggingface/transformers/pull/30320
Fix test transposing image with EXIF Orientation tag by @albertvillanova in https://github.com/huggingface/transformers/pull/30319
Avoid jnp import in utils/generic.py by @ydshieh in https://github.com/huggingface/transformers/pull/30322
Fix AssertionError in clip conversion script by @ydshieh in https://github.com/huggingface/transformers/pull/30321
[UDOP] Add special tokens to tokenizer by @NielsRogge in https://github.com/huggingface/transformers/pull/29594
Enable multi-device for some models by @jla524 in https://github.com/huggingface/transformers/pull/30207
feat: Upgrade Weights & Biases callback by @parambharat in https://github.com/huggingface/transformers/pull/30135
[Feature Extractors] Fix kwargs to pre-trained by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/30260
Pipeline: fix pad_token_id again by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30338
[Whisper] Fix slow tests by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/30152
parallel job limit for doctest by @ydshieh in https://github.com/huggingface/transformers/pull/30342
Transformers Metadata by @LysandreJik in https://github.com/huggingface/transformers/pull/30344
Deprecate default chat templates by @Rocketknight1 in https://github.com/huggingface/transformers/pull/30346
Restore casting of maskedspecembed by @ylacombe in https://github.com/huggingface/transformers/pull/30336
Update unwrap from accelerate by @SunMarc in https://github.com/huggingface/transformers/pull/29933
Do not remove half seq length in generation tests by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30016
Fix config + attnimplementation in AutoModelForCausalLM.frompretrained by @hiyouga in https://github.com/huggingface/transformers/pull/30299
Add TF swiftformer by @joaocmd in https://github.com/huggingface/transformers/pull/23342
[Grounding DINO] Add resources by @NielsRogge in https://github.com/huggingface/transformers/pull/30232
Nits for model docs by @merveenoyan in https://github.com/huggingface/transformers/pull/29795
Enable multi-device for more models by @jla524 in https://github.com/huggingface/transformers/pull/30379
GenerationConfig: warn if pad token is negative by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30187
Add FSDP config for CPU RAM efficient loading through accelerate by @helloworld1 in https://github.com/huggingface/transformers/pull/30002
Llama family, fix use_cache=False generation by @ArthurZucker in https://github.com/huggingface/transformers/pull/30380
Update docstrings for text generation pipeline by @Rocketknight1 in https://github.com/huggingface/transformers/pull/30343
Terminator strings for generate() by @Rocketknight1 in https://github.com/huggingface/transformers/pull/28932
Fix layerwise GaLore optimizer hard to converge with warmup scheduler by @hiyouga in https://github.com/huggingface/transformers/pull/30372
Jamba: fix left-padding test by @gante in https://github.com/huggingface/transformers/pull/30389
Fix DETA save_pretrained by @qubvel in https://github.com/huggingface/transformers/pull/30326
FIX / PEFT: Pass device correctly to peft by @younesbelkada in https://github.com/huggingface/transformers/pull/30397
[docs] LLM inference by @stevhliu in https://github.com/huggingface/transformers/pull/29791
show -rs to show skip reasons by @ArthurZucker in https://github.com/huggingface/transformers/pull/30318
Add inputs embeds in generation by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30269
[Grounding DINO] Add support for cross-attention in GroundingDinoMultiHeadAttention by @EduardoPach in https://github.com/huggingface/transformers/pull/30364
remove redundant logging from longformer by @riklopfer in https://github.com/huggingface/transformers/pull/30365
fix: link to HF repo/tree/revision when a file is missing by @mapmeld in https://github.com/huggingface/transformers/pull/30406
[tests] add require_torch_sdpa for test that needs sdpa support by @faaany in https://github.com/huggingface/transformers/pull/30408
Jax: scipy version pin by @gante in https://github.com/huggingface/transformers/pull/30402
Fix on "cache position" for assisted generation by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30068
fix for itemsize => element_size() for torch backwards compat by @winglian in https://github.com/huggingface/transformers/pull/30133
Make EosTokenCriteria compatible with mps by @pcuenca in https://github.com/huggingface/transformers/pull/30376
FIX: re-add bnb on docker image by @younesbelkada in https://github.com/huggingface/transformers/pull/30427
Fix LayoutLMv2 init issue and doctest by @ydshieh in https://github.com/huggingface/transformers/pull/30278
Remove old TF port docs by @Rocketknight1 in https://github.com/huggingface/transformers/pull/30426
Rename torch.run to torchrun by @steven-basart in https://github.com/huggingface/transformers/pull/30405
Fix use_cache for xla fsdp by @alanwaketan in https://github.com/huggingface/transformers/pull/30353
[LlamaTokenizerFast] Refactor default llama by @ArthurZucker in https://github.com/huggingface/transformers/pull/28881
New model PR needs green (slow tests) CI by @ydshieh in https://github.com/huggingface/transformers/pull/30341
Add llama3 by @ArthurZucker in https://github.com/huggingface/transformers/pull/30334
[Llava] + CIs fix red cis and llava integration tests by @ArthurZucker in https://github.com/huggingface/transformers/pull/30440
[tests] make test device-agnostic by @faaany in https://github.com/huggingface/transformers/pull/30444
fix uncaught init of linear layer in clip's/siglip's for image classification models by @vasqu in https://github.com/huggingface/transformers/pull/30435
fix jamba slow foward for multi-gpu by @SunMarc in https://github.com/huggingface/transformers/pull/30418
[SegGPT] Fix loss calculation by @EduardoPach in https://github.com/huggingface/transformers/pull/30421
Add paths filter to avoid the chance of being triggered by @ydshieh in https://github.com/huggingface/transformers/pull/30453
Fix wrong indent in utils/check_if_new_model_added.py by @ydshieh in https://github.com/huggingface/transformers/pull/30456
[research_project] Most of the security issues come from this requirement.txt by @ArthurZucker in https://github.com/huggingface/transformers/pull/29977
Neuron: When save_safetensor=False, no need to move model to CPU by @jeffhataws in https://github.com/huggingface/transformers/pull/29703
Enable fp16 on CPU by @muellerzr in https://github.com/huggingface/transformers/pull/30459
Non blocking support to torch DL's by @muellerzr in https://github.com/huggingface/transformers/pull/30465
consistent job / pytest report / artifact name correspondence by @ydshieh in https://github.com/huggingface/transformers/pull/30392
Workflow / ENH: Add SSH into our runners workflow by @younesbelkada in https://github.com/huggingface/transformers/pull/30425
FIX / Workflow: Change tailscale trigger condition by @younesbelkada in https://github.com/huggingface/transformers/pull/30471
FIX / Workflow: Fix SSH workflow bug by @younesbelkada in https://github.com/huggingface/transformers/pull/30474
[fix codellama conversion] by @ArthurZucker in https://github.com/huggingface/transformers/pull/30472
Script for finding candidate models for deprecation by @amyeroberts in https://github.com/huggingface/transformers/pull/29686
Fix SigLip classification doctest by @amyeroberts in https://github.com/huggingface/transformers/pull/30475
Don't run fp16 MusicGen tests on CPU by @amyeroberts in https://github.com/huggingface/transformers/pull/30466
Prevent crash with WandbCallback with third parties by @tomaarsen in https://github.com/huggingface/transformers/pull/30477
Add WSD scheduler by @visheratin in https://github.com/huggingface/transformers/pull/30231
Fix Issue #29817 Video Classification Task Guide Using Undeclared Variables by @manju-rangam in https://github.com/huggingface/transformers/pull/30457
Make accelerate install non-torch dependent by @muellerzr in https://github.com/huggingface/transformers/pull/30463
Introduce Stateful Callbacks by @muellerzr in https://github.com/huggingface/transformers/pull/29666
Fix Llava for 0-embeddings by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30473
Do not use deprecated SourceFileLoader.load_module() in dynamic module loading by @XuehaiPan in https://github.com/huggingface/transformers/pull/30370
Add sidebar tutorial for chat models by @Rocketknight1 in https://github.com/huggingface/transformers/pull/30401
Quantization: HfQuantizer quant method update by @younesbelkada in https://github.com/huggingface/transformers/pull/30484
[docs] Spanish translation of pipeline_tutorial.md by @aaronjimv in https://github.com/huggingface/transformers/pull/30252
FEAT: PEFT support for EETQ by @younesbelkada in https://github.com/huggingface/transformers/pull/30449
Fix the bitsandbytes error formatting ("Some modules are dispatched on ...") by @kyo-takano in https://github.com/huggingface/transformers/pull/30494
Update dtype_byte_size to handle torch.float8e4m3fn/float8e5m2 types by @mgoin in https://github.com/huggingface/transformers/pull/30488
Use the Keras setrandomseed in tests by @Rocketknight1 in https://github.com/huggingface/transformers/pull/30504
Remove skipping logic now that set_epoch exists by @muellerzr in https://github.com/huggingface/transformers/pull/30501
[DETR] Remove timm hardcoded logic in modeling files by @amyeroberts in https://github.com/huggingface/transformers/pull/29038
[examples] update whisper fine-tuning by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/29938
Fix GroundingDINO, DPR after BERT SDPA update by @amyeroberts in https://github.com/huggingface/transformers/pull/30506
load_image - decode b64encode and encodebytes strings by @amyeroberts in https://github.com/huggingface/transformers/pull/30192
[SegGPT] Fix seggpt image processor by @EduardoPach in https://github.com/huggingface/transformers/pull/29550
Fix link in dbrx.md by @eitanturok in https://github.com/huggingface/transformers/pull/30509
Allow boolean FSDP options in fsdp_config by @helloworld1 in https://github.com/huggingface/transformers/pull/30439
Pass attnimplementation when using AutoXXX.fromconfig by @amyeroberts in https://github.com/huggingface/transformers/pull/30507
Fix broken link to Transformers notebooks by @clinty in https://github.com/huggingface/transformers/pull/30512
Update runner tag for PR slow CI by @ydshieh in https://github.com/huggingface/transformers/pull/30535
Fix repo. fetch/checkout in PR slow CI job by @ydshieh in https://github.com/huggingface/transformers/pull/30537
Reenable SDPA's FA2 During Training with torch.compile by @warner-benjamin in https://github.com/huggingface/transformers/pull/30442
Include safetensors as part of _load_best_model by @muellerzr in https://github.com/huggingface/transformers/pull/30553
Pass use_cache in kwargs for GPTNeoX by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30538
Enable multi-device for more models by @jla524 in https://github.com/huggingface/transformers/pull/30409
Generate: update links on LLM tutorial doc by @gante in https://github.com/huggingface/transformers/pull/30550
DBRX: make fixup by @gante in https://github.com/huggingface/transformers/pull/30578
Fix seq2seq collator padding by @vasqu in https://github.com/huggingface/transformers/pull/30556
BlipModel: getmultimodalfeatures method by @XavierSpycy in https://github.com/huggingface/transformers/pull/30438
Add chat templating support for KeyDataset in text-generation pipeline by @DarshanDeshpande in https://github.com/huggingface/transformers/pull/30558
Fix generation doctests by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30263
General PR slow CI by @ydshieh in https://github.com/huggingface/transformers/pull/30540
Remove use_square_size after loading by @ydshieh in https://github.com/huggingface/transformers/pull/30567
Use text config's vocab size in testing models by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30568
Encoder-decoder models: move embedding scale to nn.Module by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30410
Fix Marian model conversion by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30173
Refactor default chat template warnings by @Rocketknight1 in https://github.com/huggingface/transformers/pull/30551
Fix QA example by @Rocketknight1 in https://github.com/huggingface/transformers/pull/30580
remove jax example by @ArthurZucker in https://github.com/huggingface/transformers/pull/30498
Fix canonical model --model_type in examples by @amyeroberts in https://github.com/huggingface/transformers/pull/30480
Gemma: update activation warning by @pcuenca in https://github.com/huggingface/transformers/pull/29995
Bump gitpython from 3.1.32 to 3.1.41 in /examples/researchprojects/decisiontransformer by @dependabot in https://github.com/huggingface/transformers/pull/30587
Fix image segmentation example - don't reopen image by @amyeroberts in https://github.com/huggingface/transformers/pull/30481
Improve object detection task guideline by @NielsRogge in https://github.com/huggingface/transformers/pull/29967
Generate: remove deprecated public decoding functions and streamline logic 🧼 by @gante in https://github.com/huggingface/transformers/pull/29956
Fix llava half precision and autocast issues by @frasermince in https://github.com/huggingface/transformers/pull/29721
Fix: failing CI after #30568 by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30599
Fix for Neuron by @michaelbenayoun in https://github.com/huggingface/transformers/pull/30259
Fix memory leak with CTC training script on Chinese languages by @lucky-bai in https://github.com/huggingface/transformers/pull/30358
Fix copies for DBRX - neuron fix by @amyeroberts in https://github.com/huggingface/transformers/pull/30610
fix:missing output_router_logits in SwitchTransformers by @lausannel in https://github.com/huggingface/transformers/pull/30573
Use contiguous() in clip checkpoint conversion script by @ydshieh in https://github.com/huggingface/transformers/pull/30613
phi3 chat_template does not support system role by @amitportnoy in https://github.com/huggingface/transformers/pull/30606
Docs: fix generate-related rendering issues by @gante in https://github.com/huggingface/transformers/pull/30600
Docs: add missing StoppingCriteria autodocs by @gante in https://github.com/huggingface/transformers/pull/30617
Generate: fix SinkCache on Llama models by @gante in https://github.com/huggingface/transformers/pull/30581
Fix FX tracing issues for Llama by @michaelbenayoun in https://github.com/huggingface/transformers/pull/30619
Output None as attention when layer is skipped by @jonghwanhyeon in https://github.com/huggingface/transformers/pull/30597
Fix CI after #30410 by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30612
add mlp bias for llama models by @mayank31398 in https://github.com/huggingface/transformers/pull/30031
Fix W&B run name by @qubvel in https://github.com/huggingface/transformers/pull/30462
HQQ: PEFT support for HQQ by @younesbelkada in https://github.com/huggingface/transformers/pull/30632
Prevent TextGenerationPipeline._sanitize_parameters from overriding previously provided parameters by @yting27 in https://github.com/huggingface/transformers/pull/30362
Avoid duplication in PR slow CI model list by @ydshieh in https://github.com/huggingface/transformers/pull/30634
[CI update] Try to use dockers and no cache by @ArthurZucker in https://github.com/huggingface/transformers/pull/29202
Check if the current compiled version of pytorch supports MPS by @jiaqianjing in https://github.com/huggingface/transformers/pull/30664
Hotfix-change-ci by @ArthurZucker in https://github.com/huggingface/transformers/pull/30669
Quantization / HQQ: Fix HQQ tests on our runner by @younesbelkada in https://github.com/huggingface/transformers/pull/30668
Fix llava next tiewordembeddings config by @SunMarc in https://github.com/huggingface/transformers/pull/30640
Trainer.loadfrom_checkpoint - support loading multiple Peft adapters by @claralp in https://github.com/huggingface/transformers/pull/30505
Trainer - add cache clearing and the option for batched eval metrics computation by @FoamoftheSea in https://github.com/huggingface/transformers/pull/28769
Fix typo: llama3.md by @mimbres in https://github.com/huggingface/transformers/pull/30653
Respect resume_download deprecation by @Wauplin in https://github.com/huggingface/transformers/pull/30620
top-k instead of top-p in MixtralConfig docstring by @sorgfresser in https://github.com/huggingface/transformers/pull/30687
Bump jinja2 from 3.1.3 to 3.1.4 in /examples/researchprojects/decisiontransformer by @dependabot in https://github.com/huggingface/transformers/pull/30680
Bump werkzeug from 3.0.1 to 3.0.3 in /examples/researchprojects/decisiontransformer by @dependabot in https://github.com/huggingface/transformers/pull/30679
Adding tieweights() to prediction heads to support lowcpumem_usage=True by @hackyon in https://github.com/huggingface/transformers/pull/29024
Fix cache_position initialisation for generation with use_cache=False by @nurlanov-zh in https://github.com/huggingface/transformers/pull/30485
Word-level timestamps broken for short-form audio by @kamilakesbi in https://github.com/huggingface/transformers/pull/30325
Updated docs of forward in Idefics2ForConditionalGeneration with correct ignore_index value by @zafstojano in https://github.com/huggingface/transformers/pull/30678
Bump tqdm from 4.63.0 to 4.66.3 in /examples/researchprojects/decisiontransformer by @dependabot in https://github.com/huggingface/transformers/pull/30646
Bump tqdm from 4.48.2 to 4.66.3 in /examples/researchprojects/visualbert by @dependabot in https://github.com/huggingface/transformers/pull/30645
Reboot Agents by @aymeric-roucher in https://github.com/huggingface/transformers/pull/30387
Bump tqdm from 4.48.2 to 4.66.3 in /examples/research_projects/lxmert by @dependabot in https://github.com/huggingface/transformers/pull/30644
Separate tokenizer tests by @ArthurZucker in https://github.com/huggingface/transformers/pull/30675
Update workflow_id in utils/get_previous_daily_ci.py by @ydshieh in https://github.com/huggingface/transformers/pull/30695
Rename artifact name prev_ci_results to ci_results by @ydshieh in https://github.com/huggingface/transformers/pull/30697
Add safetensors to model not found error msg for default use_safetensors value by @davidgxue in https://github.com/huggingface/transformers/pull/30602
Pin deepspeed by @muellerzr in https://github.com/huggingface/transformers/pull/30701
Patch CLIP image preprocessor by @rootonchair in https://github.com/huggingface/transformers/pull/30698
[BitsandBytes] Verify if GPU is available by @NielsRogge in https://github.com/huggingface/transformers/pull/30533
Llava: remove dummy labels by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30706
Immutability for data collators by @vasqu in https://github.com/huggingface/transformers/pull/30603
Cache: models return input cache type by @gante in https://github.com/huggingface/transformers/pull/30716
Removal of deprecated maps by @LysandreJik in https://github.com/huggingface/transformers/pull/30576
Fix image post-processing for OWLv2 by @jla524 in https://github.com/huggingface/transformers/pull/30686
KV cache is no longer a model attribute by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30730
Generate: consistently handle special tokens as tensors by @gante in https://github.com/huggingface/transformers/pull/30624
Update CodeLlama references by @osanseviero in https://github.com/huggingface/transformers/pull/30218
[docs] Update es/pipeline_tutorial.md by @aaronjimv in https://github.com/huggingface/transformers/pull/30684
Update llama3.md, fix typo by @mimbres in https://github.com/huggingface/transformers/pull/30739
mlponlylayers is more flexible than decodersparsestep by @eigen2017 in https://github.com/huggingface/transformers/pull/30552
PEFT / Trainer: Make use of model.active_adapters() instead of deprecated model.active_adapter whenever possible by @younesbelkada in https://github.com/huggingface/transformers/pull/30738
[docs] Update link in es/pipeline_webserver.md by @aaronjimv in https://github.com/huggingface/transformers/pull/30745
hqq - fix weight check in checkquantizedparam by @mobicham in https://github.com/huggingface/transformers/pull/30748
[awq] replace scale when we have GELU by @SunMarc in https://github.com/huggingface/transformers/pull/30074
Workflow: Replace actions/post-slack with centrally defined workflow by @younesbelkada in https://github.com/huggingface/transformers/pull/30737
[GroundingDino] Adding msdeformattn kernels by @EduardoPach in https://github.com/huggingface/transformers/pull/30768
Llama: fix custom 4D masks, v2 by @poedator in https://github.com/huggingface/transformers/pull/30348
Generation / FIX: Fix multi-device generation by @younesbelkada in https://github.com/huggingface/transformers/pull/30746
Qwen: incorrect setup flag by @gante in https://github.com/huggingface/transformers/pull/30776
enable Pipeline to get device from model by @faaany in https://github.com/huggingface/transformers/pull/30534
[Object detection pipeline] Lower threshold by @NielsRogge in https://github.com/huggingface/transformers/pull/30710
Generate: remove near-duplicate sample/greedy copy by @gante in https://github.com/huggingface/transformers/pull/30773
Port IDEFICS to tensorflow by @a8nova in https://github.com/huggingface/transformers/pull/26870
Generate: assistant should be greedy in assisted decoding by @gante in https://github.com/huggingface/transformers/pull/30778
Save other CI jobs' result (torch/tf pipeline, example, deepspeed etc) by @ydshieh in https://github.com/huggingface/transformers/pull/30699
Deprecate models script by @amyeroberts in https://github.com/huggingface/transformers/pull/30184
skip lowcpumem_usage tests by @SunMarc in https://github.com/huggingface/transformers/pull/30782
CI: update to ROCm 6.0.2 and test MI300 by @fxmarty in https://github.com/huggingface/transformers/pull/30266
Fix OWLv2 Doc by @jla524 in https://github.com/huggingface/transformers/pull/30794
Fix cache type in Idefics2 by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30729
PEFT: Access active_adapters as a property in Trainer by @pashminacameron in https://github.com/huggingface/transformers/pull/30790
CI: more models wo cache support by @gante in https://github.com/huggingface/transformers/pull/30780
Deprecate TF weight conversion since we have full Safetensors support now by @Rocketknight1 in https://github.com/huggingface/transformers/pull/30786
[T5] Adding model_parallel = False to T5ForTokenClassification and MT5ForTokenClassification by @retarfi in https://github.com/huggingface/transformers/pull/30763
Added the necessay import of module by @ankur0904 in https://github.com/huggingface/transformers/pull/30804
Add support for custom checkpoints in MusicGen by @jla524 in https://github.com/huggingface/transformers/pull/30011
Add missing dependencies in image classification example by @jla524 in https://github.com/huggingface/transformers/pull/30820
Support mixed-language batches in WhisperGenerationMixin by @cifkao in https://github.com/huggingface/transformers/pull/29688
Remove unused module DETR based models by @conditionedstimulus in https://github.com/huggingface/transformers/pull/30823
Jamba - Skip 4d custom attention mask test by @amyeroberts in https://github.com/huggingface/transformers/pull/30826
Missing Optional in typing. by @xkszltl in https://github.com/huggingface/transformers/pull/30821
Update dsconfigzero3.json by @pacman100 in https://github.com/huggingface/transformers/pull/30829
Better llava next. by @nxphi47 in https://github.com/huggingface/transformers/pull/29850
Deprecate models script - correctly set the model name for the doc file by @amyeroberts in https://github.com/huggingface/transformers/pull/30785
Use torch 2.3 for CI by @ydshieh in https://github.com/huggingface/transformers/pull/30837
Fix llama model sdpa attention forward function masking bug when output_attentions=True by @Aladoro in https://github.com/huggingface/transformers/pull/30652
[LLaVa-NeXT] Small fixes by @NielsRogge in https://github.com/huggingface/transformers/pull/30841
[Idefics2] Improve docs, add resources by @NielsRogge in https://github.com/huggingface/transformers/pull/30717
Cache: add new flag to distinguish models that Cache but not static cache by @gante in https://github.com/huggingface/transformers/pull/30800
Disable the FA backend for SDPA on AMD GPUs by @mht-sharma in https://github.com/huggingface/transformers/pull/30850
Video-LLaVa: Fix docs by @zucchini-nlp in https://github.com/huggingface/transformers/pull/30855
Docs: update example with assisted generation + sample by @gante in https://github.com/huggingface/transformers/pull/30853
TST / Quantization: Reverting to torch==2.2.1 by @younesbelkada in https://github.com/huggingface/transformers/pull/30866
Fix VideoLlava imports by @amyeroberts in https://github.com/huggingface/transformers/pull/30867
TEST: Add llama logits tests by @younesbelkada in https://github.com/huggingface/transformers/pull/30835
Remove deprecated logic and warnings by @amyeroberts in https://github.com/huggingface/transformers/pull/30743
Enable device map by @darshana1406 in https://github.com/huggingface/transformers/pull/30870
Fix dependencies for image classification example by @jla524 in https://github.com/huggingface/transformers/pull/30842
[whisper] fix multilingual fine-tuning by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/30865
update release script by @ArthurZucker in https://github.com/huggingface/transformers/pull/30880

New Contributors

@joaocmd made their first contribution in https://github.com/huggingface/transformers/pull/23342
@kamilakesbi made their first contribution in https://github.com/huggingface/transformers/pull/30121
@dtlzhuangz made their first contribution in https://github.com/huggingface/transformers/pull/30262
@steven-basart made their first contribution in https://github.com/huggingface/transformers/pull/30405
@manju-rangam made their first contribution in https://github.com/huggingface/transformers/pull/30457
@kyo-takano made their first contribution in https://github.com/huggingface/transformers/pull/30494
@mgoin made their first contribution in https://github.com/huggingface/transformers/pull/30488
@eitanturok made their first contribution in https://github.com/huggingface/transformers/pull/30509
@clinty made their first contribution in https://github.com/huggingface/transformers/pull/30512
@warner-benjamin made their first contribution in https://github.com/huggingface/transformers/pull/30442
@XavierSpycy made their first contribution in https://github.com/huggingface/transformers/pull/30438
@DarshanDeshpande made their first contribution in https://github.com/huggingface/transformers/pull/30558
@frasermince made their first contribution in https://github.com/huggingface/transformers/pull/29721
@lucky-bai made their first contribution in https://github.com/huggingface/transformers/pull/30358
@rb-synth made their first contribution in https://github.com/huggingface/transformers/pull/30566
@lausannel made their first contribution in https://github.com/huggingface/transformers/pull/30573
@jonghwanhyeon made their first contribution in https://github.com/huggingface/transformers/pull/30597
@mobicham made their first contribution in https://github.com/huggingface/transformers/pull/29637
@yting27 made their first contribution in https://github.com/huggingface/transformers/pull/30362
@jiaqianjing made their first contribution in https://github.com/huggingface/transformers/pull/30664
@claralp made their first contribution in https://github.com/huggingface/transformers/pull/30505
@mimbres made their first contribution in https://github.com/huggingface/transformers/pull/30653
@sorgfresser made their first contribution in https://github.com/huggingface/transformers/pull/30687
@nurlanov-zh made their first contribution in https://github.com/huggingface/transformers/pull/30485
@zafstojano made their first contribution in https://github.com/huggingface/transformers/pull/30678
@davidgxue made their first contribution in https://github.com/huggingface/transformers/pull/30602
@rootonchair made their first contribution in https://github.com/huggingface/transformers/pull/30698
@eigen2017 made their first contribution in https://github.com/huggingface/transformers/pull/30552
@Nilabhra made their first contribution in https://github.com/huggingface/transformers/pull/30771
@a8nova made their first contribution in https://github.com/huggingface/transformers/pull/26870
@pashminacameron made their first contribution in https://github.com/huggingface/transformers/pull/30790
@retarfi made their first contribution in https://github.com/huggingface/transformers/pull/30763
@yikangshen made their first contribution in https://github.com/huggingface/transformers/pull/30005
@ankur0904 made their first contribution in https://github.com/huggingface/transformers/pull/30804
@conditionedstimulus made their first contribution in https://github.com/huggingface/transformers/pull/30823
@nxphi47 made their first contribution in https://github.com/huggingface/transformers/pull/29850
@Aladoro made their first contribution in https://github.com/huggingface/transformers/pull/30652
@hyenal made their first contribution in https://github.com/huggingface/transformers/pull/30555
@darshana1406 made their first contribution in https://github.com/huggingface/transformers/pull/30870

Full Changelog: https://github.com/huggingface/transformers/compare/v4.40.2...v4.41.0

- Python
Published by ArthurZucker about 2 years ago

transformers - v4.40.2

Fix torch fx for LLama model

Fix for Neuron (#30259)
Fix copies for DBRX - neuron fix (#30610)

Thanks @michaelbenayoun !

- Python
Published by ArthurZucker about 2 years ago

transformers - v4.40.1: fix `EosTokenCriteria` for `Llama3` on `mps`

Kudos to @pcuenca for the prompt fix in:

Make EosTokenCriteria compatible with mps #30376

To support EosTokenCriteria on MPS while pytorch adds this functionality.

- Python
Published by ArthurZucker about 2 years ago

transformers - v4.40.0: Llama 3, Idefics 2, Recurrent Gemma, Jamba, DBRX, OLMo, Qwen2MoE, Grounding Dino

New model additions

Llama 3

Llama 3 is supported in this release through the Llama 2 architecture and some fixes in the tokenizers library.

Idefics2

drawing

The Idefics2 model was created by the Hugging Face M4 team and authored by Léo Tronchon, Hugo Laurencon, Victor Sanh. The accompanying blog post can be found here.

Idefics2 is an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs. The model can answer questions about images, describe visual content, create stories grounded on multiple images, or simply behave as a pure language model without visual inputs. It improves upon IDEFICS-1, notably on document understanding, OCR, or visual reasoning. Idefics2 is lightweight (8 billion parameters) and treats images in their native aspect ratio and resolution, which allows for varying inference efficiency.

Add Idefics2 by @amyeroberts in #30253

Recurrent Gemma

drawing

Recurrent Gemma architecture. Taken from the original paper.

The Recurrent Gemma model was proposed in RecurrentGemma: Moving Past Transformers for Efficient Open Language Models by the Griffin, RLHF and Gemma Teams of Google.

The abstract from the paper is the following:

We introduce RecurrentGemma, an open language model which uses Google’s novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide a pre-trained model with 2B non-embedding parameters, and an instruction tuned variant. Both models achieve comparable performance to Gemma-2B despite being trained on fewer tokens.

Add recurrent gemma by @ArthurZucker in #30143

Jamba

Jamba is a pretrained, mixture-of-experts (MoE) generative text model, with 12B active parameters and an overall of 52B parameters across all experts. It supports a 256K context length, and can fit up to 140K tokens on a single 80GB GPU.

As depicted in the diagram below, Jamba’s architecture features a blocks-and-layers approach that allows Jamba to successfully integrate Transformer and Mamba architectures altogether. Each Jamba block contains either an attention or a Mamba layer, followed by a multi-layer perceptron (MLP), producing an overall ratio of one Transformer layer out of every eight total layers.

Jamba introduces the first HybridCache object that allows it to natively support assisted generation, contrastive search, speculative decoding, beam search and all of the awesome features from the generate API!

Add jamba by @tomeras91 in #29943

DBRX

DBRX is a transformer-based decoder-only large language model (LLM) that was trained using next-token prediction. It uses a fine-grained mixture-of-experts (MoE) architecture with 132B total parameters of which 36B parameters are active on any input.

It was pre-trained on 12T tokens of text and code data. Compared to other open MoE models like Mixtral-8x7B and Grok-1, DBRX is fine-grained, meaning it uses a larger number of smaller experts. DBRX has 16 experts and chooses 4, while Mixtral-8x7B and Grok-1 have 8 experts and choose 2.

This provides 65x more possible combinations of experts and the authors found that this improves model quality. DBRX uses rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA).

Add DBRX Model by @abhi-mosaic in #29921

OLMo

The OLMo model was proposed in OLMo: Accelerating the Science of Language Models by Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, Hannaneh Hajishirzi.

OLMo is a series of Open Language Models designed to enable the science of language models. The OLMo models are trained on the Dolma dataset. We release all code, checkpoints, logs (coming soon), and details involved in training these models.

Add OLMo model family by @2015aroras in #29890

Qwen2MoE

Qwen2MoE is the new model series of large language models from the Qwen team. Previously, we released the Qwen series, including Qwen-72B, Qwen-1.8B, Qwen-VL, Qwen-Audio, etc.

Model Details Qwen2MoE is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. Qwen2MoE has the following architectural choices:

Qwen2MoE is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes. Qwen2MoE employs Mixture of Experts (MoE) architecture, where the models are upcycled from dense language models. For instance, Qwen1.5-MoE-A2.7B is upcycled from Qwen-1.8B. It has 14.3B parameters in total and 2.7B activated parameters during runtime, while it achieves comparable performance with Qwen1.5-7B, with only 25% of the training resources.

Add Qwen2MoE by @bozheng-hit in #29377

Grounding Dino

drawing

Taken from the original paper.

The Grounding DINO model was proposed in Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection by Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang. Grounding DINO extends a closed-set object detection model with a text encoder, enabling open-set object detection. The model achieves remarkable results, such as 52.5 AP on COCO zero-shot.

Adding grounding dino by @EduardoPach in #26087

Static pretrained maps

Static pretrained maps have been removed from the library's internals and are currently deprecated. These used to reflect all the available checkpoints for a given architecture on the Hugging Face Hub, but their presence does not make sense in light of the huge growth of checkpoint shared by the community.

With the objective of lowering the bar of model contributions and reviewing, we first start by removing legacy objects such as this one which do not serve a purpose.

Remove static pretrained maps from the library's internals by @LysandreJik in #29112

Notable improvements

Processors improvements

Processors are ungoing changes in order to uniformize them and make them clearer to use.

Separate out kwargs in processor by @amyeroberts in #30193
[Processor classes] Update docs by @NielsRogge in #29698

SDPA

re-introduced the fast path for sdpa by @fxmarty in #30070

Push to Hub for pipelines

Pipelines can now be pushed to Hub using a convenient push_to_hub method.

add push_to_hub to pipeline by @not-lain in #29172

Flash Attention 2 for more models (M2M100, NLLB, GPT2, MusicGen) !

Thanks to the community contribution, Flash Attention 2 has been integrated for more architectures

Adding Flash Attention 2 Support for GPT2 by @EduardoPach in #29226
Add Flash Attention 2 support to Musicgen and Musicgen Melody by @ylacombe in #29939
Add Flash Attention 2 to M2M100 model by @visheratin in #30256

Improvements and bugfixes

[docs] Remove redundant - and the from custom_tools.md by @windsonsea in #29767
Fixed typo in quantization_config.py by @kurokiasahi222 in #29766
OWL-ViT box_predictor inefficiency issue by @RVV-karma in #29712
Allow -OO mode for docstring_decorator by @matthid in #29689
fix issue with logit processor during beam search in Flax by @giganttheo in #29636
Fix docker image build for Latest PyTorch + TensorFlow [dev] by @ydshieh in #29764
[LlavaNext] Fix llava next unsafe imports by @ArthurZucker in #29773
Cast bfloat16 to float32 for Numpy conversions by @Rocketknight1 in #29755
Silence deprecations and use the DataLoaderConfig by @muellerzr in #29779
Add deterministic config to set_seed by @muellerzr in #29778
Add support for torch_dtype in the run_mlm example by @jla524 in #29776
Generate: remove legacy generation mixin imports by @gante in #29782
Llama: always convert the causal mask in the SDPA code path by @gante in #29663
Prepend bos token to Blip generations by @zucchini-nlp in #29642
Change in-place operations to out-of-place in LogitsProcessors by @zucchini-nlp in #29680
[quality] update quality check to make sure we check imports 😈 by @ArthurZucker in #29771
Fix type hint for traindataset param of Trainer.init_() to allow IterableDataset. Issue 29678 by @stevemadere in #29738
Enable AMD docker build CI by @IlyasMoutawwakil in #29803
Correct llava mask & fix missing setter for vocab_size by @fxmarty in #29389
rm input dtype change in CPU by @jiqing-feng in #28631
Generate: remove unused attributes in AssistedCandidateGenerator by @gante in #29787
replaced concatenation to f-strings to improve readability and unify … by @igeni in #29785
[cleanup] vestiges of causal mask by @ArthurZucker in #29806
Complete security policy with mentions of remote code by @LysandreJik in #29707
[SuperPoint] Fix doc example by @amyeroberts in #29816
[DOCS] Fix typo for llava next docs by @aliencaocao in #29829
model_summary.md - Restore link to Harvard's Annotated Transformer. by @gamepad-coder in #29702
Fix the behavior of collecting 'numinputtokens_seen' by @YouliangHUANG in #29099
Populate torch_dtype from model to pipeline by @B-Step62 in #28940
remove quotes in code example by @johko in #29812
Add warnings if training args differ from checkpoint trainer state by @jonflynng in #29255
Replace 'decord' with 'av' in VideoClassificationPipeline by @Tyx-main in #29747
Fix header in IFE task guide by @merveenoyan in #29859
[docs] Indent ordered list in addnewmodel.md by @windsonsea in #29796
Allow bos_token_id is None during the generation with inputs_embeds by @LZHgrla in #29772
Add cosine_with_min_lr scheduler in Trainer by @liuyanyi in #29341
Disable AMD memory benchmarks by @IlyasMoutawwakil in #29871
Set custom_container in build docs workflows by @Wauplin in #29855
Support num_attention_heads != num_key_value_heads in Flax Llama Implementation by @bminixhofer in #29557
Mamba slow_forward gradient fix by @vasqu in #29563
Fix 29807, sinusoidal positional encodings overwritten by post_init() by @hovnatan in #29813
Reimplement "Automatic safetensors conversion when lacking these files" by @LysandreJik in #29846
fix fuyu device_map compatibility by @SunMarc in #29880
Move eos_token_id to stopping criteria by @zucchini-nlp in #29459
add Cambricon MLUs support by @huismiling in #29627
MixtralSparseMoeBlock: add gate jitter by @lorenzoverardo in #29865
Fix typo in T5Block error message by @Mingosnake in #29881
[make fix-copies] update and help by @ArthurZucker in #29924
[GptNeox] don't gather on pkv when using the trainer by @ArthurZucker in #29892
[pipeline]. Zero shot add doc warning by @ArthurZucker in #29845
[doc] fix some typos and add xpu to the testing documentation by @faaany in #29894
Tests: replace torch.testing.assert_allclose by torch.testing.assert_close by @gante in #29915
Add beam search visualizer to the doc by @aymeric-roucher in #29876
Safe import of LRScheduler by @amyeroberts in #29919
add functions to inspect model and optimizer status to trainer.py by @CKeibel in #29838
RoPE models: add numerical sanity-check test for RoPE scaling by @gante in #29808
[Mamba] from pretrained issue with self.embeddings by @ArthurZucker in #29851
[ TokenizationLlama] fix the way we convert tokens to strings to keep leading spaces 🚨 breaking fix by @ArthurZucker in #29453
Allow GradientAccumulationPlugin to be configured from AcceleratorConfig by @fabianlim in #29589
[BC] Fix BC for other libraries by @ArthurZucker in #29934
Fix doc issue #29758 in DebertaV2Config class by @vinayakkgarg in #29842
[LlamaSlowConverter] Slow to Fast better support by @ArthurZucker in #29797
Update installs in image classification doc by @MariaHei in #29947
[StableLm] Add QK normalization and Parallel Residual Support by @jon-tow in #29745
Mark test_eager_matches_sdpa_generate flaky for some models by @ydshieh in #29479
Super tiny fix 12 typos about "with with" by @fzyzcjy in #29926
Fix rope theta for OpenLlama by @jla524 in #29893
Add warning message for run_qa.py by @jla524 in #29867
fix: get mlflow version from mlflow-skinny by @clumsy in #29918
Reset alarm signal when the function is ended by @coldnight in #29706
Update model card and link of blog post. by @bozheng-hit in #29928
[BC] Fix BC for AWQ quant by @TechxGenus in #29965
Rework tests to compare trainer checkpoint args by @muellerzr in #29883
Fix FA2 tests by @ylacombe in #29909
Fix copies main ci by @ArthurZucker in #29979
[tests] fix the wrong output in ImageToTextPipelineTests.test_conditional_generation_llava by @faaany in #29975
Generate: move misplaced test by @gante in #29902
[docs] Big model loading by @stevhliu in #29920
[generate] fix breaking change for patch by @ArthurZucker in #29976
Fix 29807 sinusoidal positional encodings in Flaubert, Informer and XLM by @hovnatan in #29904
[bnb] Fix bug in _replace_with_bnb_linear by @SunMarc in #29958
Adding FlaxNoRepeatNGramLogitsProcessor by @giganttheo in #29677
[Docs] Make an ordered list prettier in addtensorflowmodel.md by @windsonsea in #29949
Fix skip_special_tokens for Wav2Vec2CTCTokenizer._decode by @msublee in #29311
Hard error when ignoring tensors. by @Narsil in #27484)
Generate: fix logits processors doctests by @gante in #29718
Fix remove_columns in text-classification example by @mariosasko in #29351
Update tests/utils/tiny_model_summary.json by @ydshieh in #29941
Make EncodecModel.decode ONNX exportable by @fxmarty in #29913
Fix Swinv2ForImageClassification NaN output by @miguelm-almeida in #29981
Fix Qwen2Tokenizer by @jklj077 in #29929
Fix kwargs handling in generate_with_fallback by @cifkao in #29225
Fix probability computation in WhisperNoSpeechDetection when recomputing scores by @cifkao in #29248
Fix vipllava for generation by @zucchini-nlp in #29874
[docs] Fix audio file by @stevhliu in #30006
Superpoint imports fix by @zucchini-nlp in #29898
[Main CIs] Fix the red cis by @ArthurZucker in #30022
Make clearer about zero_init requirements by @muellerzr in #29879
Enable multi-device for efficientnet by @jla524 in #29989
Add a converter from mamba_ssm -> huggingface mamba by @byi8220 in #29705
[ProcessingIdefics] Attention mask bug with padding by @byi8220 in #29449
Add whisper to IMPORTANT_MODELS by @ydshieh in #30046
skip test_encode_decode_fast_slow_all_tokens for now by @ydshieh in #30044
if output is tuple like facebook/hf-seamless-m4t-medium, waveform is … by @sywangyi in #29722
Fix mixtral ONNX Exporter Issue. by @AdamLouly in #29858
[Trainer] Allow passing image processor by @NielsRogge in #29896
[bnb] Fix offload test by @SunMarc in #30039
Update quantizerbnb4bit.py: In the ValueError string there should be "....you need to set llm_int8_enable_fp32_cpu_offload=True...." instead of "load_in_8bit_fp32_cpu_offload=True". by @miRx923 in #30013
[test fetcher] Always include the directly related test files by @ydshieh in #30050
Fix torch.fx symbolic tracing for LLama by @michaelbenayoun in #30047
Refactor daily CI workflow by @ydshieh in #30012
Add docstrings and types for MambaCache by @koayon in #30023
Fix auto tests by @ydshieh in #30067
Fix whisper kwargs and generation config by @zucchini-nlp in #30018
doc: Correct spelling mistake by @caiyili in #30107
[Whisper] Computing features on GPU in batch mode for whisper feature extractor. by @vaibhavagg303 in #29900
Change log level to warning for numtrainepochs override by @xu-song in #30014
Make MLFlow version detection more robust and handles mlflow-skinny by @helloworld1 in #29957
updated examples/pytorch/language-modeling scripts and requirements.txt to require datasets>=2.14.0 by @Patchwork53 in #30120
[tests] add require_bitsandbytes marker by @faaany in #30116
fixing issue 30034 - adding data format for run_ner.py by @JINO-ROHIT in #30088
Patch fix - don't use safetensors for TF models by @amyeroberts in #30118
[#29174] ImportError Fix: Trainer with PyTorch requires accelerate>=0.20.1 Fix by @UtkarshaGupte in #29888
Accept token in trainer.pushtohub() by @mapmeld in #30093
fix learning rate display in trainer when using galore optimizer by @vasqu in #30085
Fix falcon with SDPA, alibi but no passed mask by @fxmarty in #30123
Trainer / Core : Do not change init signature order by @younesbelkada in #30126
Make vitdet jit trace complient by @fxmarty in #30065
Fix typo at ImportError by @DrAnaximandre in #30090
Adding mps as device for Pipeline class by @fnhirwa in #30080
Fix failing DeepSpeed model zoo tests by @pacman100 in #30112
Add datasets.Dataset to Trainer's traindataset and evaldataset type hints by @ringohoffman in #30077
Fix docs Pop2Piano by @zucchini-nlp in #30140
Revert workaround for TF safetensors loading by @Rocketknight1 in #30128
[Trainer] Fix default data collator by @NielsRogge in #30142
[Trainer] Undo #29896 by @NielsRogge in #30129
Fix slow tests for important models to be compatible with A10 runners by @ydshieh in #29905
Send headers when converting safetensors by @ydshieh in #30144
Fix quantization tests by @SunMarc in #29914
[docs] Fix image segmentation guide by @stevhliu in #30132
[CI] Fix setup by @SunMarc in #30147
Fix length related warnings in speculative decoding by @zucchini-nlp in #29585
Fix and simplify semantic-segmentation example by @qubvel in #30145
[CI] Quantization workflow fix by @SunMarc in #30158
[tests] make 2 tests device-agnostic by @faaany in #30008
Add str to TrainingArguments report_to type hint by @ringohoffman in #30078
[UDOP] Fix tests by @NielsRogge in #29573
[UDOP] Improve docs, add resources by @NielsRogge in #29571
Fix accelerate kwargs for versions <0.28.0 by @vasqu in #30086
Fix typing annotation in hf_argparser by @xu-song in #30156
Fixing a bug when MlFlow try to log a torch.tensor by @etiennebonnafoux in #29932
Fix natten install in docker by @ydshieh in #30161
FIX / bnb: fix torch compatiblity issue with itemize by @younesbelkada in #30162
Update config class check in auto factory by @Rocketknight1 in #29854
Fixed typo in comments/documentation for Pipelines documentation by @DamonGuzman in #30170
Fix Llava chat template examples by @lewtun in #30130
Guard XLA version imports by @muellerzr in #30167
chore: remove repetitive words by @hugehope in #30174
fix: Fixed ruff configuration to avoid deprecated configuration warning by @Sai-Suraj-27 in #30179
Refactor Cohere Model by @saurabhdash2512 in #30027
Update output of SuperPointForKeypointDetection by @NielsRogge in #29809
Falcon: make activation, ffnhiddensize configurable by @sshleifer in #30134
Docs PR template by @stevhliu in #30171
ENH: [CI] Add new workflow to run slow tests of important models on push main if they are modified by @younesbelkada in #29235
Fix pipeline logger.warning_once bug by @amyeroberts in #30195
fix: Replaced deprecated logger.warn with logger.warning by @Sai-Suraj-27 in #30197
fix typo by @mdeff in #30220
fix fuyu doctest by @molbap in #30215
Fix RecurrentGemmaIntegrationTest.test_2b_sample by @ydshieh in #30222
Update modeling_bark.py by @bes-dev in #30221
Fix/Update for doctest by @ydshieh in #30216
Fixed config.json download to go to user-supplied cache directory by @ulatekh in #30189
Add test for parsejsonfile and change typing to os.PathLike by @xu-song in #30183
fix: Replace deprecated assertEquals with assertEqual by @Sai-Suraj-27 in #30241
Set padtoken in rungluenotrainer.py #28534 by @JINO-ROHIT in #30234
fix: Replaced deprecated typing.Text with str by @Sai-Suraj-27 in #30230
Refactor doctest by @ydshieh in #30210
fix: Fixed type annotation for compatability with python 3.8 by @Sai-Suraj-27 in #30243
Fix doctest more (for docs/source/en) by @ydshieh in #30247
round epoch only in console by @xdedss in #30237
update github actions packages' version to suppress warnings by @ydshieh in #30249
[tests] add the missing require_torch_multi_gpu flag by @faaany in #30250
[Docs] Update recurrent_gemma.md for some minor nits by @sayakpaul in #30238
Remove incorrect arg in codellama doctest by @Rocketknight1 in #30257
Update ko/_toctree.yml by @jungnerd in #30062
More fixes for doctest by @ydshieh in #30265
FIX: Fix corner-case issue with the important models workflow by @younesbelkada in #30212
FIX: Fix 8-bit serialization tests by @younesbelkada in #30051
Allow for str versions of dicts based on typing by @muellerzr in #30227
Workflow: Update tailscale to release version by @younesbelkada in #30268
Raise relevent err when wrong type is passed in as the accelerator_config by @muellerzr in #29997
BLIP - fix pt-tf equivalence test by @amyeroberts in #30258
fix: Fixed a raise statement by @Sai-Suraj-27 in #30275
Fix test fetcher (doctest) + Idefics2's doc example by @ydshieh in #30274
Fix SDPA sliding window compatibility by @fxmarty in #30127
Fix SpeechT5 forward docstrings by @ylacombe in #30287
FIX / AWQ: Fix failing exllama test by @younesbelkada in #30288
Configuring Translation Pipelines documents update #27753 by @UtkarshaGupte in #29986
Enable fx tracing for Mistral by @zucchini-nlp in #30209
Fix test ExamplesTests::test_run_translation by @ydshieh in #30281
Fix Fatal Python error: Bus error in ZeroShotAudioClassificationPipelineTests by @ydshieh in #30283
FIX: Fix push important models CI by @younesbelkada in #30291
Add token type ids to CodeGenTokenizer by @st81 in #29265
Add strategy to store results in evaluation loop by @qubvel in #30267
Upgrading to tokenizers 0.19.0 by @Narsil in #30289
Re-enable SDPA's FA2 path by @fxmarty in #30070
Fix quality Olmo + SDPA by @fxmarty in #30302
Fix donut token2json multiline by @qubvel in #30300
Fix all torch pipeline failures except one by @ydshieh in #30290
Add atol for sliding window test by @fxmarty in #30303
Fix RecurrentGemma device_map by @SunMarc in #30273
Revert "Re-enable SDPA's FA2 path by @ArthurZucker in #30070)"
Do not drop mask with SDPA for more cases by @fxmarty in #30311
FIX: Fixes unexpected behaviour for Llava / LLama & AWQ Fused modules + revert #30070 at the same time by @younesbelkada in #30317

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@bozheng-hit
- Add Qwen2MoE (#29377)
- Update model card and link of blog post. (#29928)
@EduardoPach
- Adding Flash Attention 2 Support for GPT2 (#29226)
- Adding grounding dino (#26087)
@2015aroras
- Add OLMo model family (#29890)
@tomeras91
- Add jamba (#29943)
@abhi-mosaic
- Add DBRX Model (#29921)

- Python
Published by LysandreJik about 2 years ago

transformers - Release v4.39.3

The AWQ issue persisted, and there was a regression reported with beam search and input embeddings.

Changes

Fix BC for AWQ quant #29965
generate fix breaking change for patch #29976

- Python
Published by ArthurZucker about 2 years ago

transformers - Patch release v4.39.2

Series of fixes for backwards compatibility (AutoAWQ and other quantization libraries, imports from trainer_pt_utils) and functionality (LLaMA tokenizer conversion)

Safe import of LRScheduler #29919
[BC] Fix BC for other libraries #29934
[LlamaSlowConverter] Slow to Fast better support #29797

- Python
Published by amyeroberts about 2 years ago

transformers - Patch release v4.39.1

Patch release to fix some breaking changes to LLaVA model, fixes/cleanup for Cohere & Gemma and broken doctest

Correct llava mask & fix missing setter for vocab_size #29389
[cleanup] vestiges of causal mask #29806
[SuperPoint] Fix doc example (https://github.com/huggingface/transformers/pull/29816)

- Python
Published by amyeroberts about 2 years ago

transformers - Release v4.39.0

v4.39.0

🚨 VRAM consumption 🚨

The Llama, Cohere and the Gemma model both no longer cache the triangular causal mask unless static cache is used. This was reverted by #29753, which fixes the BC issues w.r.t speed , and memory consumption, while still supporting compile and static cache. Small note, fx is not supported for both models, a patch will be brought very soon!

New model addition

Cohere open-source model

Command-R is a generative model optimized for long context tasks such as retrieval augmented generation (RAG) and using external APIs and tools. It is designed to work in concert with Cohere's industry-leading Embed and Rerank models to provide best-in-class integration for RAG applications and excel at enterprise use cases. As a model built for companies to implement at scale, Command-R boasts:

Strong accuracy on RAG and Tool Use
Low latency, and high throughput
Longer 128k context and lower pricing
Strong capabilities across 10 key languages
Model weights available on HuggingFace for research and evaluation
Cohere Model Release by @saurabhdash2512 in #29622

LLaVA-NeXT (llava v1.6)

Llava next is the next version of Llava, which includes better support for non padded images, improved reasoning, OCR, and world knowledge. LLaVA-NeXT even exceeds Gemini Pro on several benchmarks.

Compared with LLaVA-1.5, LLaVA-NeXT has several improvements: - Increasing the input image resolution to 4x more pixels. This allows it to grasp more visual details. It supports three aspect ratios, up to 672x672, 336x1344, 1344x336 resolution. - Better visual reasoning and OCR capability with an improved visual instruction tuning data mixture. - Better visual conversation for more scenarios, covering different applications. - Better world knowledge and logical reasoning. - Along with performance improvements, LLaVA-NeXT maintains the minimalist design and data efficiency of LLaVA-1.5. It re-uses the pretrained connector of LLaVA-1.5, and still uses less than 1M visual instruction tuning samples. The largest 34B variant finishes training in ~1 day with 32 A100s.*

drawing

LLaVa-NeXT incorporates a higher input resolution by encoding various patches of the input image. Taken from the original paper.

MusicGen Melody

The MusicGen Melody model was proposed in Simple and Controllable Music Generation by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.

MusicGen Melody is a single stage auto-regressive Transformer model capable of generating high-quality music samples conditioned on text descriptions or audio prompts. The text descriptions are passed through a frozen text encoder model to obtain a sequence of hidden-state representations. MusicGen is then trained to predict discrete audio tokens, or audio codes, conditioned on these hidden-states. These audio tokens are then decoded using an audio compression model, such as EnCodec, to recover the audio waveform.

Through an efficient token interleaving pattern, MusicGen does not require a self-supervised semantic representation of the text/audio prompts, thus eliminating the need to cascade multiple models to predict a set of codebooks (e.g. hierarchically or upsampling). Instead, it is able to generate all the codebooks in a single forward pass.

Add MusicGen Melody by @ylacombe in #28819

PvT-v2

The PVTv2 model was proposed in PVT v2: Improved Baselines with Pyramid Vision Transformer by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. As an improved variant of PVT, it eschews position embeddings, relying instead on positional information encoded through zero-padding and overlapping patch embeddings. This lack of reliance on position embeddings simplifies the architecture, and enables running inference at any resolution without needing to interpolate them.

Add PvT-v2 Model by @FoamoftheSea in #26812

UDOP

The UDOP model was proposed in Unifying Vision, Text, and Layout for Universal Document Processing by Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal. UDOP adopts an encoder-decoder Transformer architecture based on T5 for document AI tasks like document image classification, document parsing and document visual question answering.

drawing

UDOP architecture. Taken from the original paper.

Add UDOP by @NielsRogge in #22940

Mamba

This model is a new paradigm architecture based on state-space-models, rather than attention like transformer models. The checkpoints are compatible with the original ones

[Add Mamba] Adds support for the Mamba models by @ArthurZucker in #28094

StarCoder2

StarCoder2 is a family of open LLMs for code and comes in 3 different sizes with 3B, 7B and 15B parameters. The flagship StarCoder2-15B model is trained on over 4 trillion tokens and 600+ programming languages from The Stack v2. All models use Grouped Query Attention, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens, and were trained using the Fill-in-the-Middle objective.

Starcoder2 model - bis by @RaymondLi0 in #29215

SegGPT

The SegGPT model was proposed in SegGPT: Segmenting Everything In Context by Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang. SegGPT employs a decoder-only Transformer that can generate a segmentation mask given an input image, a prompt image and its corresponding prompt mask. The model achieves remarkable one-shot results with 56.1 mIoU on COCO-20 and 85.6 mIoU on FSS-1000.

Adding SegGPT by @EduardoPach in #27735

Galore optimizer

With Galore, you can pre-train large models on consumer-type hardwares, making LLM pre-training much more accessible to anyone from the community.

Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.

Galore is based on low rank approximation of the gradients and can be used out of the box for any model.

Below is a simple snippet that demonstrates how to pre-train mistralai/Mistral-7B-v0.1 on imdb:

```python import torch import datasets from transformers import TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForCausalLM import trl

traindataset = datasets.loaddataset('imdb', split='train')

args = TrainingArguments( outputdir="./test-galore", maxsteps=100, perdevicetrainbatchsize=2, optim="galoreadamw", optimtarget_modules=["attn", "mlp"] )

model_id = "mistralai/Mistral-7B-v0.1"

config = AutoConfig.frompretrained(modelid)

tokenizer = AutoTokenizer.frompretrained(modelid) model = AutoModelForCausalLM.from_config(config).to(0)

trainer = trl.SFTTrainer( model=model, args=args, traindataset=traindataset, datasettextfield='text', maxseqlength=512, )

trainer.train() ```

Quantization

Quanto integration

Quanto has been integrated with transformers ! You can apply simple quantization algorithms with few lines of code with tiny changes. Quanto is also compatible with torch.compile

Check out the announcement blogpost for more details

[Quantization] Quanto quantizer by @SunMarc in #29023

Exllama 🤝 AWQ

Exllama and AWQ combined together for faster AWQ inference - check out the relevant documentation section for more details on how to use Exllama + AWQ.

Exllama kernels support for AWQ models by @IlyasMoutawwakil in #28634

MLX Support

Allow models saved or fine-tuned with Apple’s MLX framework to be loaded in transformers (as long as the model parameters use the same names), and improve tensor interoperability. This leverages MLX's adoption of safetensors as their checkpoint format.

Add mlx support to BatchEncoding.converttotensors by @Y4hL in #29406
Add support for metadata format MLX by @alexweberk in #29335
Typo in mlx tensor support by @pcuenca in #29509
Experimental loading of MLX files by @pcuenca in #29511

Highligted improvements

Notable memory reduction in Gemma/LLaMa by changing the causal mask buffer type from int64 to boolean.

Use torch.bool instead of torch.int64 for non-persistant causal mask buffer by @fxmarty in #29241

Remote code improvements

Allow remote code repo names to contain "." by @Rocketknight1 in #29175
simplify getclassin_module and fix for paths containing a dot by @cebtenzzre in #29262

Breaking changes

The PRs below introduced slightly breaking changes that we believed was necessary for the repository; if these seem to impact your usage of transformers, we recommend checking out the PR description to get more insights in how to leverage the new behavior.

🚨🚨[Whisper Tok] Update integration test by @sanchit-gandhi in #29368
🚨 Fully revert atomic checkpointing 🚨 by @muellerzr in #29370
[BC 4.37 -> 4.38] for Llama family, memory and speed #29753 (causal mask is no longer a registered buffer)

Fixes and improvements
FIX [Gemma] Fix bad rebase with transformers main by @younesbelkada in #29170
Add training version check for AQLM quantizer. by @BlackSamorez in #29142
[Gemma] Fix eager attention by @sanchit-gandhi in #29187
[Mistral, Mixtral] Improve docs by @NielsRogge in #29084
Fix torch.compile with fullgraph=True when attention_mask input is used by @fxmarty in #29211
fix(mlflow): check mlflow version to use the synchronous flag by @cchen-dialpad in #29195
Fix missing translation in README_ru by @strikoder in #29054
Improve updatecausal_mask performance by @alessandropalla in #29210
[Doc] update model doc qwen2 by @ArthurZucker in #29238
Use torch 2.2 for daily CI (model tests) by @ydshieh in #29208
Cache is_vision_available result by @bmuskalla in #29280
Use DS_DISABLE_NINJA=1 by @ydshieh in #29290
Add non_device_test pytest mark to filter out non-device tests by @fxmarty in #29213
Add feature extraction mapping for automatic metadata update by @merveenoyan in #28944
Generate: v4.38 removals and related updates by @gante in #29171
Track each row separately for stopping criteria by @zucchini-nlp in #29116
[docs] Spanish translation of tasks_explained.md by @aaronjimv in #29224
[i18n-zh] Translated torchscript.md into Chinese by @windsonsea in #29234
🌐 [i18n-ZH] Translate chat_templating.md into Chinese by @shibing624 in #28790
[i18n-vi] Translate README.md to Vietnamese by @hoangsvit in #29229
[i18n-zh] Translated task/asr.md into Chinese by @windsonsea in #29233
Fixed Deformable Detr typo when loading cuda kernels for MSDA by @EduardoPach in #29294
GenerationConfig validate both constraints and forcewordsids by @FredericOdermatt in #29163
Add generate kwargs to VQA pipeline by @regisss in #29134
Cleaner Cache dtype and device extraction for CUDA graph generation for quantizers compatibility by @BlackSamorez in #29079
Image Feature Extraction docs by @merveenoyan in #28973
Fix attn_implementation documentation by @fxmarty in #29295
[tests] enable benchmark unit tests on XPU by @faaany in #29284
Use torch 2.2 for deepspeed CI by @ydshieh in #29246
Add compatibility with skipmemorymetrics for mps device by @SunMarc in #29264
Token level timestamps for long-form generation in Whisper by @zucchini-nlp in #29148
Fix a few typos in GenerationMixin's docstring by @sadra-barikbin in #29277
[i18n-zh] Translate fsdp.md into Chinese by @windsonsea in #29305
FIX [Gemma / CI] Make sure our runners have access to the model by @younesbelkada in #29242
Remove numpy usage from owlvit by @fxmarty in #29326
[require_read_token] fix typo by @ArthurZucker in #29345
[T5 and Llama Tokenizer] remove warning by @ArthurZucker in #29346
[Llama ROPE] Fix torch export but also slow downs in forward by @ArthurZucker in #29198
Disable Mixtral output_router_logits during inference by @LeonardoEmili in #29249
Idefics: generate fix by @gante in #29320
RoPE loses precision for Llama / Gemma + Gemma logits.float() by @danielhanchen in #29285
check if position_ids exists before using it by @jiqing-feng in #29306
[CI] Quantization workflow by @SunMarc in #29046
Better SDPA unmasking implementation by @fxmarty in #29318
[i18n-zh] Sync source/zh/index.md by @windsonsea in #29331
FIX [CI / starcoder2] Change starcoder2 path to correct one for slow tests by @younesbelkada in #29359
FIX [CI]: Fix failing tests for peft integration by @younesbelkada in #29330
FIX [CI] require_read_token in the llama FA2 test by @younesbelkada in #29361
Avoid using uncessary get_values(MODEL_MAPPING) by @ydshieh in #29362
Patch YOLOS and others by @NielsRogge in #29353
Fix @requirereadtoken in tests by @Wauplin in #29367
Expose offload_buffers parameter of accelerate to PreTrainedModel.from_pretrained method by @notsyncing in #28755
Fix Base Model Name of LlamaForQuestionAnswering by @lenglaender in #29258
FIX [quantization / ESM] Fix ESM 8bit / 4bit with bitsandbytes by @younesbelkada in #29329
[Llama + AWQ] fix prepare_inputs_for_generation 🫠 by @ArthurZucker in #29381
[YOLOS] Fix - return padded annotations by @amyeroberts in #29300
Support subfolder with AutoProcessor by @JingyaHuang in #29169
Fix llama + gemma accelete tests by @SunMarc in #29380
Fix deprecated arg issue by @muellerzr in #29372
Correct zero division error in inverse sqrt scheduler by @DavidAfonsoValente in #28982
[tests] enable automatic speech recognition pipeline tests on XPU by @faaany in #29308
update path to hub files in the error message by @poedator in #29369
[Mixtral] Fixes attention masking in the loss by @DesmonDay in #29363
Workaround for #27758 to avoid ZeroDivisionError by @tleyden in #28756
Convert SlimSAM checkpoints by @NielsRogge in #28379
Fix: Fixed the previous tracking URI setting logic to prevent clashes with original MLflow code. by @seanswyi in #29096
Fix OneFormer post_process_instance_segmentation for panoptic tasks by @nickthegroot in #29304
Fix grad_norm unserializable tensor log failure by @svenschultze in #29212
Avoid edge case in audio utils by @ylacombe in #28836
DeformableDETR support bfloat16 by @DonggeunYu in #29232
[Docs] Spanish Translation -Torchscript md & Trainer md by @njackman-2344 in #29310
FIX [Generation] Fix some issues when running the MaxLength criteria on CPU by @younesbelkada in #29317
Fix max length for BLIP generation by @zucchini-nlp in #29296
[docs] Update starcoder2 paper link by @xenova in #29418
[tests] enable testpipelineacceleratetopp on XPU by @faaany in #29309
[UdopTokenizer] Fix post merge imports by @ArthurZucker in #29451
more fix by @ArthurZucker (direct commit on main)
Revert-commit 0d52f9f582efb82a12e8d9162b43a01b1aa0200f by @ArthurZucker in #29455
[Udop imports] Processor tests were not run. by @ArthurZucker in #29456
Generate: inner decoding methods are no longer public by @gante in #29437
Fix bug with passing capture_* args to neptune callback by @AleksanderWWW in #29041
Update pytest import_path location by @loadams in #29154
Automatic safetensors conversion when lacking these files by @LysandreJik in #29390
[i18n-zh] Translate addnewpipeline.md into Chinese by @windsonsea in #29432
🌐 [i18n-KO] Translated generation_strategies.md to Korean by @AI4Harmony in #29086
[FIX] offload_weight() takes from 3 to 4 positional arguments but 5 were given by @faaany in #29457
[Docs / Awq] Add docs on exllamav2 + AWQ by @younesbelkada in #29474
[docs] Add starcoder2 docs by @younesbelkada in #29454
Fix TrainingArguments regression with torch <2.0.0 for dataloaderprefetchfactor by @ringohoffman in #29447
Generate: add tests for caches with pad_to_multiple_of by @gante in #29462
Generate: get generation mode from the generation config instance 🧼 by @gante in #29441
Avoid dummy token in PLD to optimize performance by @ofirzaf in #29445
Fix test failure on DeepSpeed by @muellerzr in #29444
Generate: torch.compile-ready generation config preparation by @gante in #29443
added the maxmatchingngram_size to GenerationConfig by @mosheber in #29131
Fix TextGenerationPipeline.__call__ docstring by @alvarobartt in #29491
Substantially reduce memory usage in updatecausal_mask for large batches by using .expand instead of .repeat [needs tests+sanity check] by @nqgl in #29413
Fix: Disable torch.autocast in RotaryEmbedding of Gemma and LLaMa for MPS device by @currybab in #29439
Enable BLIP for auto VQA by @regisss in #29499
v4.39 deprecations 🧼 by @gante in #29492
Revert "Automatic safetensors conversion when lacking these files by @LysandreJik in #2…
fix: Avoid error when fsdpconfig is missing xlafsdp_v2 by @ashokponkumar in #29480
Flava multimodal add attention mask by @zucchini-nlp in #29446
testgenerationconfigisloadedwithmodel - fall back to pytorch model for now by @amyeroberts in #29521
Set inputs as kwarg in TextClassificationPipeline by @alvarobartt in #29495
Fix VisionEncoderDecoder Positional Arg by @nickthegroot in #29497
Generate: left-padding test, revisited by @gante in #29515
[tests] add the missing require_sacremoses decorator by @faaany in #29504
fix image-to-text batch incorrect output issue by @sywangyi in #29342
Typo fix in error message by @clefourrier in #29535
[tests] use torch_device instead of auto for model testing by @faaany in #29531
StableLM: Fix dropout argument type error by @liangjs in #29236
Make sliding window size inclusive in eager attention by @jonatanklosko in #29519
fix typos in FSDP config parsing logic in TrainingArguments by @yundai424 in #29189
Fix WhisperNoSpeechDetection when input is full silence by @ylacombe in #29065
[tests] use the correct n_gpu in TrainerIntegrationTest::test_train_and_eval_dataloaders for XPU by @faaany in #29307
Fix eval thread fork bomb by @muellerzr in #29538
feat: use warning_advice for tensorflow warning by @winstxnhdw in #29540
[Mamba doc] Post merge updates by @ArthurZucker in #29472
[Docs] fixed minor typo by @j-gc in #29555
Add Fill-in-the-middle training objective example - PyTorch by @tanaymeh in #27464
Bark model Flash Attention 2 Enabling to pass on checkdevicemap parameter to super() by @damithsenanayake in #29357
Make torch xla available on GPU by @yitongh in #29334
[Docs] Fix FastSpeech2Conformer model doc links by @khipp in #29574
Don't use a subset in test fetcher if on main branch by @ydshieh in #28816
fix error: TypeError: Object of type Tensor is not JSON serializable … by @yuanzhoulvpi2017 in #29568
Add missing localized READMEs to the copies check by @khipp in #29575
Fixed broken link by @amritgupta98 in #29558
Tiny improvement for doc by @fzyzcjy in #29581
Fix Fuyu doc typos by @zucchini-nlp in #29601
Fix minor typo: softare => software by @DriesVerachtert in #29602
Stop passing None to compile() in TF examples by @Rocketknight1 in #29597
Fix typo (determine) by @koayon in #29606
Implemented addpoolinglayer arg to TFBertModel by @tomigee in #29603
Update legacy Repository usage in various example files by @Hvanderwilk in #29085
Set env var to hold Keras at Keras 2 by @Rocketknight1 in #29598
Update flava tests by @ydshieh in #29611
Fix typo ; Update quantization.md by @furkanakkurt1335 in #29615
Add tests for batching support by @zucchini-nlp in #29297
Fix: handle logging of scalars in Weights & Biases summary by @parambharat in #29612
Examples: check max_position_embeddings in the translation example by @gante in #29600
[Gemma] Supports converting directly in half-precision by @younesbelkada in #29529
[Flash Attention 2] Add flash attention 2 for GPT-J by @bytebarde in #28295
Core: Fix copies on main by @younesbelkada in #29624
[Whisper] Deprecate forced ids for v4.39 by @sanchit-gandhi in #29485
Warn about tool use by @LysandreJik in #29628
Adds pretrained IDs directly in the tests by @LysandreJik in #29534
[generate] deprecate forced ids processor by @sanchit-gandhi in #29487
Fix minor typo: infenrece => inference by @DriesVerachtert in #29621
[MaskFormer, Mask2Former] Use einsum where possible by @amyeroberts in #29544
Llama: allow custom 4d masks by @gante in #29618
[PyTorch/XLA] Fix extra TPU compilations introduced by recent changes by @alanwaketan in #29158
[docs] Spanish translate chat_templating.md & yml addition by @njackman-2344 in #29559
Add support for FSDP+QLoRA and DeepSpeed ZeRO3+QLoRA by @pacman100 in #29587
[Mask2Former] Move normalization for numerical stability by @amyeroberts in #29542
[tests] make test_trainer_log_level_replica to run on accelerators with more than 2 devices by @faaany in #29609
Refactor TFP call to just sigmoid() by @Rocketknight1 in #29641
Fix batching tests for new models (Mamba and SegGPT) by @zucchini-nlp in #29633
Fix multi_gpu_data_parallel_forward for MusicgenTest by @ydshieh in #29632
[docs] Remove broken ChatML format link from chat_templating.md by @aaronjimv in #29643
Add newly added PVTv2 model to all README files. by @robinverduijn in #29647
[PEFT] Fix save_pretrained to make sure adapters weights are also saved on TPU by @shub-kris in #29388
Fix TPU checkpointing inside Trainer by @shub-kris in #29657
Add dataset_revision argument to RagConfig by @ydshieh in #29610
Fix PVT v2 tests by @ydshieh in #29660
Generate: handle cache_position update in generate by @gante in #29467
Allow applychattemplate to pass kwargs to the template and support a dict of templates by @Rocketknight1 in #29658
Inaccurate code example within inline code-documentation by @MysteryManav in #29661
Extend import utils to cover "editable" torch versions by @bhack in #29000
Trainer: fail early in the presence of an unsavable generation_config by @gante in #29675
Pipeline: use tokenizer pad token at generation time if the model pad token is unset. by @gante in #29614
[tests] remove deprecated tests for model loading by @faaany in #29450
Fix AutoformerForPrediction example code by @m-torhan in #29639
[tests] ensure device-required software is available in the testing environment before testing by @faaany in #29477
Fix wrong condition used in filter_models by @ydshieh in #29673
fix: typos by @testwill in #29653
Rename glue to nyu-mll/glue by @lhoestq in #29679
Generate: replace breaks by a loop condition by @gante in #29662
[FIX] Fix speech2test modeling tests by @ylacombe in #29672
Revert "Fix wrong condition used in filter_models" by @ydshieh in #29682
[docs] Spanish translation of attention.md by @aaronjimv in #29681
CI / generate: batch size computation compatible with all models by @gante in #29671
Fix filter_models by @ydshieh in #29710
FIX [bnb] Make unexpected_keys optional by @younesbelkada in #29420
Update the pipeline tutorial to include gradio.Interface.from_pipeline by @abidlabs in #29684
Use logging.warning instead of warnings.warn in pipeline.call by @tokestermw in #29717

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@windsonsea
- [i18n-zh] Translated torchscript.md into Chinese (#29234)
- [i18n-zh] Translated task/asr.md into Chinese (#29233)
- [i18n-zh] Translate fsdp.md into Chinese (#29305)
- [i18n-zh] Sync source/zh/index.md (#29331)
- [i18n-zh] Translate addnewpipeline.md into Chinese (#29432)
@hoangsvit
- [i18n-vi] Translate README.md to Vietnamese (#29229)
@EduardoPach
- Fixed Deformable Detr typo when loading cuda kernels for MSDA (#29294)
- Adding SegGPT (#27735)
@RaymondLi0
- Starcoder2 model - bis (#29215)
@njackman-2344
- [Docs] Spanish Translation -Torchscript md & Trainer md (#29310)
- [docs] Spanish translate chat_templating.md & yml addition (#29559)
@tanaymeh
- Add Fill-in-the-middle training objective example - PyTorch (#27464)
@Hvanderwilk
- Update legacy Repository usage in various example files (#29085)
@FoamoftheSea
- Add PvT-v2 Model (#26812)
@saurabhdash2512
- Cohere Model Release (#29622)

- Python
Published by ArthurZucker about 2 years ago

transformers - v4.38.2

Fix backward compatibility issues with Llama and Gemma:

We mostly made sure that performances are not affected by the new change of paradigm with ROPE. Fixed the ROPE computation (should always be in float32) and the causal_mask dtype was set to bool to take less RAM.

YOLOS had a regression, and Llama / T5Tokenizer had a warning popping for random reasons

FIX [Gemma] Fix bad rebase with transformers main (#29170)
Improve updatecausal_mask performance (#29210)
[T5 and Llama Tokenizer] remove warning (#29346)
[Llama ROPE] Fix torch export but also slow downs in forward (#29198)
RoPE loses precision for Llama / Gemma + Gemma logits.float() (#29285)
Patch YOLOS and others (#29353)
Use torch.bool instead of torch.int64 for non-persistant causal mask buffer (#29241)

- Python
Published by ArthurZucker over 2 years ago

transformers - v4.38.1

Fix eager attention in Gemma!

[Gemma] Fix eager attention #29187 by @sanchit-gandhi

TLDR: diff - attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) + attn_output = attn_output.view(bsz, q_len, -1)

- Python
Published by ArthurZucker over 2 years ago

transformers - v4.38: Gemma, Depth Anything, Stable LM; Static Cache, HF Quantizer, AQLM

New model additions

💎 Gemma 💎

Gemma is a new opensource Language Model series from Google AI that comes with a 2B and 7B variant. The release comes with the pre-trained and instruction fine-tuned versions and you can use them via AutoModelForCausalLM, GemmaForCausalLM or pipeline interface!

Read more about it in the Gemma release blogpost: https://hf.co/blog/gemma

```python from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.frompretrained("google/gemma-2b") model = AutoModelForCausalLM.frompretrained("google/gemma-2b", devicemap="auto", torchdtype=torch.float16)

inputtext = "Write me a poem about Machine Learning." inputids = tokenizer(inputtext, returntensors="pt").to("cuda")

outputs = model.generate(**input_ids) ```

You can use the model with Flash Attention, SDPA, Static cache and quantization API for further optimizations !

Flash Attention 2

```python from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")

model = AutoModelForCausalLM.frompretrained( "google/gemma-2b", devicemap="auto", torchdtype=torch.float16, attnimplementation="flashattention2" )

inputtext = "Write me a poem about Machine Learning." inputids = tokenizer(inputtext, returntensors="pt").to("cuda")

outputs = model.generate(**input_ids) ```

bitsandbytes-4bit

```python from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")

model = AutoModelForCausalLM.frompretrained( "google/gemma-2b", devicemap="auto", loadin4bit=True )

inputtext = "Write me a poem about Machine Learning." inputids = tokenizer(inputtext, returntensors="pt").to("cuda")

outputs = model.generate(**input_ids) ```

Static Cache

```python from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")

model = AutoModelForCausalLM.frompretrained( "google/gemma-2b", devicemap="auto" )

model.generationconfig.cacheimplementation = "static"

inputtext = "Write me a poem about Machine Learning." inputids = tokenizer(inputtext, returntensors="pt").to("cuda")

outputs = model.generate(**input_ids) ```

Depth Anything Model

The Depth Anything model was proposed in Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao. Depth Anything is based on the DPT architecture, trained on ~62 million images, obtaining state-of-the-art results for both relative and absolute depth estimation.

Add Depth Anything by @NielsRogge in #28654

Stable LM

StableLM 3B 4E1T was proposed in StableLM 3B 4E1T: Technical Report by Stability AI and is the first model in a series of multi-epoch pre-trained language models.

StableLM 3B 4E1T is a decoder-only base language model pre-trained on 1 trillion tokens of diverse English and code datasets for four epochs. The model architecture is transformer-based with partial Rotary Position Embeddings, SwiGLU activation, LayerNorm, etc.

The team also provides StableLM Zephyr 3B, an instruction fine-tuned version of the model that can be used for chat-based applications.

Add StableLM by @jon-tow in #28810

⚡️ Static cache was introduced in the following PRs ⚡️

Static past key value cache allows LlamaForCausalLM' s forward pass to be compiled using torch.compile ! This means that (cuda) graphs can be used for inference, which speeds up the decoding step by 4x!
A forward pass of Llama2 7B takes around 10.5 ms to run with this on an A100! Equivalent to TGI performances! ⚡️

[Core generation] Adds support for static KV cache by @ArthurZucker in #27931
[CLeanup] Revert SDPA attention changes that got in the static kv cache PR by @ArthurZucker in #29027
Fix static generation when compiling! by @ArthurZucker in #28937
Static Cache: load models with MQA or GQA by @gante in #28975
Fix symbolic_trace with kv cache by @fxmarty in #28724

⚠️ Support for generate is not included yet. This feature is experimental and subject to changes in subsequent releases.

```py from transformers import AutoTokenizer, AutoModelForCausalLM, StaticCache import torch import os

compilation triggers multiprocessing

os.environ["TOKENIZERS_PARALLELISM"] = "true"

tokenizer = AutoTokenizer.frompretrained("meta-llama/Llama-2-7b-hf") model = AutoModelForCausalLM.frompretrained( "meta-llama/Llama-2-7b-hf", devicemap="auto", torchdtype=torch.float16 )

set up the static cache in advance of using the model

model.setupcache(StaticCache, maxbatchsize=1, maxcachelen=128)

trigger compilation!

compiled_model = torch.compile(model, mode="reduce-overhead", fullgraph=True)

run the model as usual

inputtext = "A few facts about the universe: " inputids = tokenizer(inputtext, returntensors="pt").to("cuda").inputids modeloutputs = compiledmodel(inputids) ```

Quantization

🧼 HF Quantizer 🧼

HfQuantizer makes it easy for quantization method researchers and developers to add inference and / or quantization support in 🤗 transformers. If you are interested in adding the support for new methods, please refer to this documentation page: https://huggingface.co/docs/transformers/main/en/hf_quantizer

HfQuantizer class for quantization-related stuff in modeling_utils.py by @poedator in #26610
[HfQuantizer] Move it to "Developper guides" by @younesbelkada in #28768
[HFQuantizer] Remove check_packages_compatibility logic by @younesbelkada in #28789
[docs] HfQuantizer by @stevhliu in #28820

⚡️AQLM ⚡️

AQLM is a new quantization method that enables no-performance degradation in 2-bit precision. Check out this demo about how to run Mixtral in 2-bit on a free-tier Google Colab instance: https://huggingface.co/posts/ybelkada/434200761252287

AQLM quantizer support by @BlackSamorez in #28928
Removed obsolete attribute setting for AQLM quantization. by @BlackSamorez in #29034

🧼 Moving canonical repositories 🧼

The canonical repositories on the hugging face hub (models that did not have an organization, like bert-base-cased), have been moved under organizations.

You can find the entire list of models moved here: https://huggingface.co/collections/julien-c/canonical-models-65ae66e29d5b422218567567

Redirection has been set up so that your code continues working even if you continue calling the previous paths. We, however, still encourage you to update your code to use the new links so that it is entirely future proof.

canonical repos moves by @julien-c in #28795
Update all references to canonical models by @LysandreJik in #29001

Flax Improvements 🚀

The Mistral model was added to the library in Flax.

Flax mistral by @kiansierra in #26943

TensorFlow Improvements 🚀

With Keras 3 becoming the standard version of Keras in TensorFlow 2.16, we've made some internal changes to maintain compatibility. We now have full compatibility with TF 2.16 as long as the tf-keras compatibility package is installed. We've also taken the opportunity to do some cleanup - in particular, the objects like BatchEncoding that are returned by our tokenizers and processors can now be directly passed to Keras methods like model.fit(), which should simplify a lot of code and eliminate a long-standing source of annoyances.

Add tf_keras imports to prepare for Keras 3 by @Rocketknight1 in #28588
Wrap Keras methods to support BatchEncoding by @Rocketknight1 in #28734
Fix Keras scheduler import so it works for older versions of Keras by @Rocketknight1 in #28895

Pre-Trained backbone weights 🚀

Enable loading in pretrained backbones in a new model, where all other weights are randomly initialized. Note: validation checks are still in place when creating a config. Passing in use_pretrained_backbone will raise an error. You can override by setting config.use_pretrained_backbone = True after creating a config. However, it is not yet guaranteed to be fully backwards compatible.

```py from transformers import MaskFormerConfig, MaskFormerModel

config = MaskFormerConfig( usepretrainedbackbone=False, backbone="microsoft/resnet-18" ) config.usepretrainedbackbone = True

Both models have resnet-18 backbone weights and all other weights randomly

initialized

model1 = MaskFormerModel(config) model2 = MaskFormerModel(config) ```

Enable instantiating model with pretrained backbone weights by @amyeroberts in #28214

Introduce a helper function load_backbone to load a backbone from a backbone's model config e.g. ResNetConfig, or from a model config which contains backbone information. This enables cleaner modeling files and crossloading between timm and transformers backbones.

```py from transformers import ResNetConfig, MaskFormerConfig from transformers.utils.backboneutils import loadbackbone

Resnet defines the backbone model to load

config = ResNetConfig() backbone = load_backbone(config)

Maskformer config defines a model which uses a resnet backbone

config = MaskFormerConfig(usetimmbackbone=True, backbone="resnet18") backbone = load_backbone(config)

config = MaskFormerConfig(backboneconfig=ResNetConfig()) backbone = loadbackbone(config) ```

[Backbone] Use load_backbone instead of AutoBackbone.from_config by @amyeroberts in #28661
Backbone kwargs in config by @amyeroberts in #28784

Add in API references, list supported backbones, updated examples, clarification and moving information to better reflect usage and docs

[docs] Backbone by @stevhliu in #28739
Improve Backbone API docs by @merveenoyan in #28666

Image Processor work 🚀

Raise unused kwargs image processor by @molbap in #29063
Abstract image processor arg checks by @molbap in #28843

Bugfixes and improvements 🚀

Fix id2label assignment in run_classification.py by @jheitmann in #28590
Add missing key to TFLayoutLM signature by @Rocketknight1 in #28640
Avoid root logger's level being changed by @ydshieh in #28638
Add config tip to custom model docs by @Rocketknight1 in #28601
Fix lrscheduler in notrainer training scripts by @bofenghuang in #27872
[Llava] Update convertllavaweightstohf.py script by @isaac-vidas in #28617
[GPTNeoX] Fix GPTNeoX + Flash Attention 2 issue by @younesbelkada in #28645
Update imageprocessingdeformable_detr.py by @sounakdey in #28561
[SigLIP] Only import tokenizer if sentencepiece available by @amyeroberts in #28636
Fix phi model doc checkpoint by @amyeroberts in #28581
get default device through PartialState().default_device as it has been officially released by @statelesshz in #27256
integrations: fix DVCLiveCallback model logging by @dberenbaum in #28653
Enable safetensors conversion from PyTorch to other frameworks without the torch requirement by @LysandreJik in #27599
tensor_size - fix copy/paste error msg typo by @scruel in #28660
Fix windows err with checkpoint race conditions by @muellerzr in #28637
add dataloader prefetch factor in training args and trainer by @qmeeus in #28498
Support single token decode for CodeGenTokenizer by @cmathw in #28628
Remove deprecated eager_serving fn by @Rocketknight1 in #28665
fix a hidden bug of GenerationConfig, now the generation_config.json can be loaded successfully by @ParadoxZW in #28604
Update README_es.md by @vladydev3 in #28612
Exclude the load balancing loss of padding tokens in Mixtral-8x7B by @khaimt in #28517
Use save_safetensor to disable safe serialization for XLA by @jeffhataws in #28669
Add back in generation types by @amyeroberts in #28681
[docs] DeepSpeed by @stevhliu in #28542
Improved type hinting for all attention parameters by @nakranivaibhav in #28479
improve efficient training on CPU documentation by @faaany in #28646
[docs] Fix doc format by @stevhliu in #28684
[chore] Add missing space in warning by @tomaarsen in #28695
Update question_answering.md by @yusyel in #28694
[Vilt] align input and model dtype in the ViltPatchEmbeddings forward pass by @faaany in #28633
[docs] Improve visualization for vertical parallelism by @petergtz in #28583
Don't fail when LocalEntryNotFoundError during processor_config.json loading by @ydshieh in #28709
Fix duplicate & unnecessary flash attention warnings by @fxmarty in #28557
support PeftMixedModel signature inspect by @Facico in #28321
fix: corrected misleading log message in save_pretrained function by @mturetskii in #28699
[docs] Update preprocessing.md by @velaia in #28719
Initialize tqdmactive with hfhubutils.areprogressbars_disabled(… by @ShukantPal in #28717
Fix weights_only by @ydshieh in #28725
Stop confusing the TF compiler with ModelOutput objects by @Rocketknight1 in #28712
fix: suppress GatedRepoError to use cache file (fix #28558). by @scruel in #28566
Unpin pydantic by @ydshieh in #28728
[docs] Fix datasets in guides by @stevhliu in #28715
[Flax] Update no init test for Flax v0.7.1 by @sanchit-gandhi in #28735
Falcon: removed unused function by @gante in #28605
Generate: deprecate old src imports by @gante in #28607
[Siglip] protect from imports if sentencepiece not installed by @amyeroberts in #28737
Add serialization logic to pytree types by @angelayi in #27871
Fix DepthEstimationPipeline's docstring by @ydshieh in #28733
Fix input data file extension in examples by @khipp in #28741
[Docs] Fix Typo in English & Japanese CLIP Model Documentation (TMBD -> TMDB) by @Vinyzu in #28751
PatchtTST and PatchTSMixer fixes by @wgifford in #28083
Enable Gradient Checkpointing in Deformable DETR by @FoamoftheSea in #28686
small doc update for CamemBERT by @julien-c in #28644
Pin pytest version <8.0.0 by @amyeroberts in #28758
Mark testconstrainedbeamsearchgenerate as flaky by @amyeroberts in #28757
Fix typo of Block. by @xkszltl in #28727
[Whisper] Make tokenizer normalization public by @sanchit-gandhi in #28136
Support saving only PEFT adapter in checkpoints when using PEFT + FSDP by @AjayP13 in #28297
Add French translation: french README.md by @ThibaultLengagne in #28696
Don't allow passing load_in_8bit and load_in_4bit at the same time by @osanseviero in #28266
Move CLIP nosplit_modules to CLIPPreTrainedModel by @lz1oceani in #27841
Use Conv1d for TDNN by @gau-nernst in #25728
Fix transformers.utils.fx compatibility with torch<2.0 by @fxmarty in #28774
Further pin pytest version (in a temporary way) by @ydshieh in #28780
Task-specific pipeline init args by @amyeroberts in #28439
Pin Torch to <2.2.0 by @Rocketknight1 in #28785
[bnb] Fix bnb slow tests by @younesbelkada in #28788
Prevent MLflow exception from disrupting training by @codiceSpaghetti in #28779
don't initialize the output embeddings if we're going to tie them to input embeddings by @tom-p-reichel in #28192
[Whisper] Refactor forceddecoderids & prompt ids by @patrickvonplaten in #28687
Resolve DeepSpeed cannot resume training with PeftModel by @lh0x00 in #28746
Wrap Keras methods to support BatchEncoding by @Rocketknight1 in #28734
DeepSpeed: hardcode torch.arange dtype on float usage to avoid incorrect initialization by @gante in #28760
Add artifact name in job step to maintain job / artifact correspondence by @ydshieh in #28682
Split daily CI using 2 level matrix by @ydshieh in #28773
[docs] Correct the statement in the docstirng of computetransitionscores in generation/utils.py by @Ki-Seki in #28786
Adding [T5/MT5/UMT5]ForTokenClassification by @hackyon in #28443
Make is_torch_bf16_available_on_device more strict by @ydshieh in #28796
Add tip on setting tokenizer attributes by @Rocketknight1 in #28764
enable graident checkpointing in DetaObjectDetection and add tests in Swin/Donut_Swin by @SangbumChoi in #28615
[docs] fix some bugs about parameter description by @zspo in #28806
Add models from deit by @rajveer43 in #28302
[Docs] Fix spelling and grammar mistakes by @khipp in #28825
Explicitly check if token ID's are None in TFBertTokenizer constructor by @skumar951 in #28824
Add missing None check for hf_quantizer by @jganitkevitch in #28804
Fix issues caused by natten by @ydshieh in #28834
fix / skip (for now) some tests before switch to torch 2.2 by @ydshieh in #28838
Use -v for pytest on CircleCI by @ydshieh in #28840
Reduce GPU memory usage when using FSDP+PEFT by @pacman100 in #28830
Mark test_encoder_decoder_model_generate for vision_encoder_deocder as flaky by @amyeroberts in #28842
Support custom scheduler in deepspeed training by @VeryLazyBoy in #26831
[Docs] Fix bad doc: replace save with logging by @chenzizhao in #28855
Ability to override cleancodefor_run by @w4ffl35 in #28783
[WIP] Hard error when ignoring tensors. by @Narsil in #27484
[Doc] update contribution guidelines by @ArthurZucker in #28858
Correct wav2vec2-bert inputstologits_ratio by @ylacombe in #28821
Image Feature Extraction pipeline by @amyeroberts in #28216
ClearMLCallback enhancements: support multiple runs and handle logging better by @eugen-ajechiloae-clearml in #28559
Do not use mtime for checkpoint rotation. by @xkszltl in #28862
Adds LlamaForQuestionAnswering class in modeling_llama.py along with AutoModel Support by @nakranivaibhav in #28777
[Docs] Update project names and links in awesome-transformers by @khipp in #28878
Fix LongT5ForConditionalGeneration initialization of lm_head by @eranhirs in #28873
Raise error when using save_only_model with load_best_model_at_end for DeepSpeed/FSDP by @pacman100 in #28866
Fix FastSpeech2ConformerModelTest and skip it on CPU by @ydshieh in #28888
Revert "[WIP] Hard error when ignoring tensors." by @ydshieh in #28898
unpin torch by @ydshieh in #28892
Explicit server error on gated model by @Wauplin in #28894
[Docs] Fix backticks in inline code and documentation links by @khipp in #28875
Hotfix - make torchaudio get the correct version in torch_and_flax_job by @ydshieh in #28899
[Docs] Add missing language options and fix broken links by @khipp in #28852
fix: Fixed the documentation for logging_first_step by removing "evaluate" by @Sai-Suraj-27 in #28884
fix Starcoder FA2 implementation by @pacman100 in #28891
Fix Keras scheduler import so it works for older versions of Keras by @Rocketknight1 in #28895
⚠️ Raise Exception when trying to generate 0 tokens ⚠️ by @danielkorat in #28621
Update the cache number by @ydshieh in #28905
Add npu device for pipeline by @statelesshz in #28885
[Docs] Fix placement of tilde character by @khipp in #28913
[Docs] Revert translation of '@slow' decorator by @khipp in #28912
Fix utf-8 yaml load for marian conversion to pytorch in Windows by @SystemPanic in #28618
Remove dead TF loading code by @Rocketknight1 in #28926
fix: torch.int32 instead of torch.torch.int32 by @vodkaslime in #28883
pass kwargs in stopping criteria list by @zucchini-nlp in #28927
Support batched input for decoder start ids by @zucchini-nlp in #28887
[Docs] Fix broken links and syntax issues by @khipp in #28918
Fix maxpositionembeddings default value for llama2 to 4096 #28241 by @karl-hajjar in #28754
Fix a wrong link to CONTRIBUTING.md section in PR template by @B-Step62 in #28941
Fix type annotations on neftunenoisealpha and fsdp_config TrainingArguments parameters by @peblair in #28942
[i18n-de] Translate README.md to German by @khipp in #28933
[Nougat] Fix pipeline by @NielsRogge in #28242
[Docs] Update README and default pipelines by @NielsRogge in #28864
Convert torch_dtype as str to actual torch data type (i.e. "float16" …to torch.float16) by @KossaiSbai in #28208
[pipelines] updated docstring with vqa alias by @cmahmut in #28951
Tests: tag test_save_load_fast_init_from_base as flaky by @gante in #28930
Updated requirements for image-classification samples: datasets>=2.14.0 by @alekseyfa in #28974
Always initialize tied output_embeddings if it has a bias term by @hackyon in #28947
Clean up staging tmp checkpoint directory by @woshiyyya in #28848
[Docs] Add language identifiers to fenced code blocks by @khipp in #28955
[Docs] Add video section by @NielsRogge in #28958
[i18n-de] Translate CONTRIBUTING.md to German by @khipp in #28954
[NllbTokenizer] refactor with added tokens decoder by @ArthurZucker in #27717
Add sudachi_projection option to BertJapaneseTokenizer by @hiroshi-matsuda-rit in #28503
Update configuration_llama.py: fixed broken link by @AdityaKane2001 in #28946
[DETR] Update the processing to adapt masks & bboxes to reflect padding by @amyeroberts in #28363
ENH: Do not pass warning message in case quantization_config is in config but not passed as an arg by @younesbelkada in #28988
ENH [AutoQuantizer]: enhance trainer + not supported quant methods by @younesbelkada in #28991
Add SiglipForImageClassification and CLIPForImageClassification by @NielsRogge in #28952
[Doc] Fix docbuilder - make BackboneMixin and BackboneConfigMixin importable from utils. by @amyeroberts in #29002
Set the dataset format used by test_trainer to float32 by @statelesshz in #28920
Introduce AcceleratorConfig dataclass by @muellerzr in #28664
Fix flaky test vision encoder-decoder generate by @zucchini-nlp in #28923
Mask Generation Task Guide by @merveenoyan in #28897
Add tieweights() to LM heads and set bias in setoutput_embeddings() by @hackyon in #28948
[TPU] Support PyTorch/XLA FSDP via SPMD by @alanwaketan in #28949
FIX [Trainer / tags]: Fix trainer + tags when users do not pass "tags" to trainer.push_to_hub() by @younesbelkada in #29009
Add cudacustomkernel in DETA by @SangbumChoi in #28989
DeformableDetrModel support fp16 by @DonggeunYu in #29013
Fix copies between DETR and DETA by @amyeroberts in #29037
FIX: Fix error with logger.warning + inline with recent refactor by @younesbelkada in #29039
Patch to skip failing test_save_load_low_cpu_mem_usage tests by @amyeroberts in #29043
Fix a tiny typo in generation/utils.py::GenerateEncoderDecoderOutput's docstring by @sadra-barikbin in #29044
add test marker to run all tests with @require_bitsandbytes by @Titus-von-Koeller in #28278
Update important model list by @LysandreJik in #29019
Fix maxlength criteria when using inputsembeds by @zucchini-nlp in #28994
Support : Leverage Accelerate for object detection/segmentation models by @Tanmaypatil123 in #28312
fix numassistanttokens with heuristic schedule by @jmamou in #28759
fix failing trainer ds tests by @pacman100 in #29057
auto_find_batch_size isn't yet supported with DeepSpeed/FSDP. Raise error accrodingly. by @pacman100 in #29058
Honor trustremotecode for custom tokenizers by @rl337 in #28854
Feature: Option to set the tracking URI for MLflowCallback. by @seanswyi in #29032
Fix trainer test wrt DeepSpeed + autofindbs by @muellerzr in #29061
Add chat support to text generation pipeline by @Rocketknight1 in #28945
[Docs] Spanish translation of task_summary.md by @aaronjimv in #28844
[Awq] Add peft support for AWQ by @younesbelkada in #28987
FIX [bnb / tests]: Fix currently failing bnb tests by @younesbelkada in #29092
fix the post-processing link by @davies-w in #29091
Fix the bert-base-cased tokenizer configuration test by @LysandreJik in #29105
Fix a typo in examples/pytorch/text-classification/run_classification.py by @Ja1Zhou in #29072
change version by @ArthurZucker in #29097
[Docs] Add resources by @NielsRogge in #28705
ENH: added new output_logits option to generate function by @mbaak in #28667
Bnb test fix for different hardwares by @Titus-von-Koeller in #29066
Fix two tiny typos in pipelines/base.py::Pipeline::_sanitize_parameters()'s docstring by @sadra-barikbin in #29102
storing & logging gradient norm in trainer by @shijie-wu in #27326
Fixed nll with label_smoothing to just nll by @nileshkokane01 in #28708
[gradient_checkpointing] default to use it for torch 2.3 by @ArthurZucker in #28538
Move misplaced line by @kno10 in #29117
FEAT [Trainer / bnb]: Add RMSProp from bitsandbytes to HF Trainer by @younesbelkada in #29082
Abstract image processor arg checks. by @molbap in #28843
FIX [bnb / tests] Propagate the changes from #29092 to 4-bit tests by @younesbelkada in #29122
Llama: fix batched generation by @gante in #29109
Generate: unset GenerationConfig parameters do not raise warning by @gante in #29119
[cuda kernels] only compile them when initializing by @ArthurZucker in #29133
FIX [PEFT / Trainer ] Handle better peft + quantized compiled models by @younesbelkada in #29055
[Core tokenization] add_dummy_prefix_space option to help with latest issues by @ArthurZucker in #28010
Revert low cpu mem tie weights by @amyeroberts in #29135
Add support for fine-tuning CLIP-like models using contrastive-image-text example by @tjs-intel in #29070
Save (circleci) cache at the end of a job by @ydshieh in #29141
[Phi] Add support for sdpa by @hackyon in #29108
Generate: missing generation config eos token setting in encoder-decoder tests by @gante in #29146
Added image_captioning version in es and included in toctree file by @gisturiz in #29104
Fix drop path being ignored in DINOv2 by @fepegar in #29147
[pipeline] Add pool option to image feature extraction pipeline by @amyeroberts in #28985

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@nakranivaibhav
- Improved type hinting for all attention parameters (#28479)
- Adds LlamaForQuestionAnswering class in modeling_llama.py along with AutoModel Support (#28777)
@khipp
- Fix input data file extension in examples (#28741)
- [Docs] Fix spelling and grammar mistakes (#28825)
- [Docs] Update project names and links in awesome-transformers (#28878)
- [Docs] Fix backticks in inline code and documentation links (#28875)
- [Docs] Add missing language options and fix broken links (#28852)
- [Docs] Fix placement of tilde character (#28913)
- [Docs] Revert translation of '@slow' decorator (#28912)
- [Docs] Fix broken links and syntax issues (#28918)
- [i18n-de] Translate README.md to German (#28933)
- [Docs] Add language identifiers to fenced code blocks (#28955)
- [i18n-de] Translate CONTRIBUTING.md to German (#28954)
@ThibaultLengagne
- Add French translation: french README.md (#28696)
@poedator
- HfQuantizer class for quantization-related stuff in modeling_utils.py (#26610)
@kiansierra
- Flax mistral (#26943)
@hackyon
- Adding [T5/MT5/UMT5]ForTokenClassification (#28443)
- Always initialize tied output_embeddings if it has a bias term (#28947)
- Add tieweights() to LM heads and set bias in setoutput_embeddings() (#28948)
- [Phi] Add support for sdpa (#29108)
@SangbumChoi
- enable graident checkpointing in DetaObjectDetection and add tests in Swin/Donut_Swin (#28615)
- Add cudacustomkernel in DETA (#28989)
@rajveer43
- Add models from deit (#28302)
@jon-tow
- Add StableLM (#28810)

- Python
Published by LysandreJik over 2 years ago

transformers - Patch release v4.37.2

Selection of fixes * Protecting the imports for SigLIP's tokenizer if sentencepiece isn't installed * Fix permissions issue on windows machines when using trainer in multi-node setup * Allow disabling safe serialization when using Trainer. Needed for Neuron SDK * Fix error when loading processor from cache * torch < 1.13 compatible torch.load

Commits * [Siglip] protect from imports if sentencepiece not installed (#28737) * Fix weightsonly (#28725) * Enable safetensors conversion from PyTorch to other frameworks without the torch requirement (#27599) * Don't fail when LocalEntryNotFoundError during processorconfig.json loading (#28709) * Use save_safetensor to disable safe serialization for XLA (#28669) * Fix windows err with checkpoint race conditions (#28637) * [SigLIP] Only import tokenizer if sentencepiece available (#28636)

- Python
Published by amyeroberts over 2 years ago

transformers - Patch release: v4.37.1

A patch release to resolve import errors from removed custom types in generation utils

Add back in generation types #28681

- Python
Published by amyeroberts over 2 years ago

transformers - v4.37 Qwen2, Phi-2, SigLIP, ViP-LLaVA, Fast2SpeechConformer, 4-bit serialization, Whisper longform generation

Model releases

Qwen2

Qwen2 is the new model series of large language models from the Qwen team. Previously, the Qwen series was released, including Qwen-72B, Qwen-1.8B, Qwen-VL, Qwen-Audio, etc.

Qwen2 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes.

Add qwen2 by @JustinLin610 in #28436

Phi-2

Phi-2 is a transformer language model trained by Microsoft with exceptionally strong performance for its small size of 2.7 billion parameters. It was previously available as a custom code model, but has now been fully integrated into transformers.

[Phi2] Add support for phi2 models by @susnato in #28211
[Phi] Extend implementation to use GQA/MQA. by @gugarosa in #28163
update docs to add the phi-2 example by @susnato in #28392
Fixes default value of softmax_scale in PhiFlashAttention2. by @gugarosa in #28537

SigLIP

The SigLIP model was proposed in Sigmoid Loss for Language Image Pre-Training by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer. SigLIP proposes to replace the loss function used in CLIP by a simple pairwise sigmoid loss. This results in better performance in terms of zero-shot classification accuracy on ImageNet.

Add SigLIP by @NielsRogge in #26522
[SigLIP] Don't pad by default by @NielsRogge in #28578

ViP-LLaVA

The VipLlava model was proposed in Making Large Multimodal Models Understand Arbitrary Visual Prompts by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee.

VipLlava enhances the training protocol of Llava by marking images and interact with the model using natural cues like a “red bounding box” or “pointed arrow” during training.

Adds VIP-llava to transformers by @younesbelkada in #27932
Fix Vip-llava docs by @younesbelkada in #28085

FastSpeech2Conformer

The FastSpeech2Conformer model was proposed with the paper Recent Developments On Espnet Toolkit Boosted By Conformer by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang.

FastSpeech 2 is a non-autoregressive model for text-to-speech (TTS) synthesis, which develops upon FastSpeech, showing improvements in training speed, inference speed and voice quality. It consists of a variance adapter; duration, energy and pitch predictor and waveform and mel-spectrogram decoder.

Add FastSpeech2Conformer by @connor-henderson in #23439

Wav2Vec2-BERT

The Wav2Vec2-BERT model was proposed in Seamless: Multilingual Expressive and Streaming Speech Translation by the Seamless Communication team from Meta AI.

This model was pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification.

Add new meta w2v2-conformer BERT-like model by @ylacombe in #28165
Add w2v2bert to pipeline by @ylacombe in #28585

4-bit serialization

Enables saving and loading transformers models in 4bit formats - you can now push bitsandbytes 4-bit weights on Hugging Face Hub. To save 4-bit models and push them on the hub, simply install the latest bitsandbytes package from pypi pip install -U bitsandbytes, load your model in 4-bit precision and call save_pretrained / push_to_hub. An example repo here

```python from transformers import AutoModelForCausalLM, AutoTokenizer

modelid = "facebook/opt-125m" model = AutoModelForCausalLM.frompretrained(modelid, loadin_4bit=True)

model.pushtohub("ybelkada/opt-125m-bnb-4bit") ```

[bnb] Let's make serialization of 4bit models possible by @poedator in #26037
[Docs] Add 4-bit serialization docs by @younesbelkada in #28182

4D Attention mask

Enable passing in 4D attention masks to models that support it. This is useful for reducing memory footprint of certain generation tasks.

4D attention_mask support by @poedator in #27539

Improved quantization support

Ability to customise which modules are quantized and which are not. * [Awq] Enable the possibility to skip quantization for some target modules by @younesbelkada in #27950 * add modules_in_block_to_quantize arg in GPTQconfig by @SunMarc in #27956

Added fused modules support * [docs] Fused AWQ modules by @stevhliu in #27896 * [Awq] Add llava fused modules support by @younesbelkada in #28239 * [Mixtral / Awq] Add mixtral fused modules for Awq by @younesbelkada in #28240

SDPA Support for LLaVa, Mixtral, Mistral

Fix SDPA correctness following torch==2.1.2 regression by @fxmarty in #27973
[Llava / Vip-Llava] Add SDPA into llava by @younesbelkada in #28107
[Mixtral & Mistral] Add support for sdpa by @ArthurZucker in #28133
[SDPA] Make sure attn mask creation is always done on CPU by @patrickvonplaten in #28400
Fix SDPA tests by @fxmarty in #28552

Whisper: Batched state-of-the-art long-form transcription

All decoding strategies (temperature fallback, compression/log-prob/no-speech threshold, ...) of OpenAI's long-form transcription (see: https://github.com/openai/whisper or section 4.5 in paper) have been added. Contrary to https://github.com/openai/whisper, Transformers long-form transcription is fully compatible with pure FP16 and Batching!

For more information see: https://github.com/huggingface/transformers/pull/27658.

[Whisper] Finalize batched SOTA long-form generation by @patrickvonplaten in #27658

Generation: assisted generation upgrades, speculative decoding, and ngram speculation

Assisted generation was reworked to accept arbitrary sources of candidate sequences. This enabled us to smoothly integrate ngram speculation, and opens the door for new candidate generation methods. Additionally, we've added the speculative decoding strategy on top of assisted generation: when you call assisted generation with an assistant model and do_sample=True, you'll benefit from the faster speculative decoding sampling 🏎️💨

Generate: assisted_decoding now accepts arbitrary candidate generators by @gante in #27751
Generate: assisted decoding now uses generate for the assistant by @gante in #28031
Generate: speculative decoding by @gante in #27979
Generate: fix speculative decoding by @gante in #28166
Adding Prompt lookup decoding by @apoorvumang in #27775
Fix speculativesampling implementation by @ofirzaf in #28508

torch.load pickle protection

Adding pickle protection via weights_only=True in the torch.load calls.

make torch.load a bit safer by @julien-c in #27282

Build methods for TensorFlow Models

Unlike PyTorch, TensorFlow models build their weights "lazily" after model initialization, using the shape of their inputs to figure out what their weight shapes should be. We previously needed a full forward pass through TF models to ensure that all layers received an input they could use to build their weights, but with this change we now have proper build() methods that can correctly infer shapes and build model weights. This avoids a whole range of potential issues, as well as significantly accelerating model load times.

Proper build() methods for TF by @Rocketknight1 in #27794
Replace build() with buildinname_scope() for some TF tests by @Rocketknight1 in #28046
More TF fixes by @Rocketknight1 in #28081
Even more TF test fixes by @Rocketknight1 in #28146

Remove support for torch 1.10

The last version to support PyTorch 1.10 was 4.36.x. As it has been more than 2 years, and we're looking forward to using features available in PyTorch 1.11 and up, we do not support PyTorch 1.10 for v4.37 (i.e. we don't run the tests against torch 1.10).

Byebye torch 1.10 by @ydshieh in #28207

Model tagging

You can now add custom tags into your model before pushing it on the Hub! This enables you to filter models that contain that tag on the Hub with a simple URL filter. For example if you want to filter models that have trl tag you can search: https://huggingface.co/models?other=trl&sort=created

[core/ FEAT] Add the possibility to push custom tags using PreTrainedModel itself by @younesbelkada in #28405 - e.g.

```python from transformers import AutoModelForCausalLM

modelname = "HuggingFaceM4/tiny-random-LlamaForCausalLM" model = AutoModelForCausalLM.frompretrained(model_name)

model.addmodeltags(["tag-test"]) model.pushtohub("llama-tagged") ```

Bugfixes and improvements

Fix PatchTSMixer Docstrings by @vijaye12 in #27943
use logger.warning_once to avoid massive outputs by @ranchlai in #27428
Docs for AutoBackbone & Backbone by @merveenoyan in #27456
Fix test for autofindbatch_size on multi-GPU by @muellerzr in #27947
Update import message by @NielsRogge in #27946
Fix parameter count in readme for mixtral 45b by @CyberTimon in #27945
In PreTrainedTokenizerBase add missing word in error message by @petergtz in #27949
Fix AMD scheduled CI not triggered by @ydshieh in #27951
Add deepspeed test to amd scheduled CI by @echarlaix in #27633
Fix a couple of typos and add an illustrative test by @rjenc29 in #26941
fix bug in mask2former: cost matrix is infeasible by @xuchenhao001 in #27897
Fix for stochastic depth decay rule in the TimeSformer implementation by @atawari in #27875
fix no sequence length models error by @AdamLouly in #27522
[Mixtral] Change mistral op order by @younesbelkada in #27955
Update bounding box format everywhere by @NielsRogge in #27944
Support PeftModel signature inspect by @dancingpipi in #27865
fixed typos (issue 27919) by @asusevski in #27920
Hot-fix-mixstral-loss by @ArthurZucker in #27948
Fix link in README.md of Image Captioning by @saswatmeher in #27969
Better key error for AutoConfig by @Rocketknight1 in #27976
[doc] fix typo by @stas00 in #27981
fix typo in dvclive callback by @dberenbaum in #27983
[Tokenizer Serialization] Fix the broken serialisation by @ArthurZucker in #27099
[Whisper] raise better errors by @ArthurZucker in #27971
Fix PatchTSMixer slow tests by @ajati in #27997
[CI slow] Fix expected values by @ArthurZucker in #27999
Fix bug with rotating checkpoints by @muellerzr in #28009
[Doc] Spanish translation of glossary.md by @aaronjimv in #27958
Add modeldocs from cpmant.md to derformabledetr.md by @rajveer43 in #27884
well well well by @ArthurZucker in #28011
[SeamlessM4TTokenizer] Safe import by @ArthurZucker in #28026
[core / modeling] Fix training bug with PEFT + GC by @younesbelkada in #28031
Fix AMD push CI not triggered by @ydshieh in #28029
SeamlessM4T: test_retain_grad_hidden_states_attentions is flaky by @gante in #28035
Fix languages covered by M4Tv2 by @ylacombe in #28019
Fixed spelling error in T5 tokenizer warning message (s/thouroughly/t… by @jeddobson in #28014
Generate: Mistral/Mixtral FA2 cache fix when going beyond the context window by @gante in #28037
[Seamless] Fix links in docs by @sanchit-gandhi in #27905
Remove warning when Annotion enum is created by @amyeroberts in #28048
[FA-2] Fix fa-2 issue when passing config to from_pretrained by @younesbelkada in #28043
[Modeling / Mixtral] Fix GC + PEFT issues with Mixtral by @younesbelkada in #28061
[Flax BERT] Update deprecated 'split' method by @sanchit-gandhi in #28012
[Flax LLaMA] Fix attn dropout by @sanchit-gandhi in #28059
Remove SpeechT5 deprecated argument by @ylacombe in #28062
doc: Correct spelling mistake by @caiyili in #28064
[Mixtral] update conversion script to reflect new changes by @younesbelkada in #28068
Skip M4T test_retain_grad_hidden_states_attentions by @ylacombe in #28060
[LLaVa] Add pastkeyvalues to skipkeysdeviceplacement to fix multi-GPU dispatch by @aismlv in #28051
Make GPT2 traceable in meta state by @kwen2501 in #28054
Fix bug for checkpoint saving on multi node training setting by @dumpmemory in #28078
Update fixtures-image-utils by @lhoestq in #28080
Fix low_cpu_mem_usage Flag Conflict with DeepSpeed Zero 3 in from_pretrained for Models with keep_in_fp32_modules" by @kotarotanahashi in #27762
Fix wrong examples in llava usage. by @Lyken17 in #28020
[docs] Trainer by @stevhliu in #27986
[docs] MPS by @stevhliu in #28016
fix resuming from ckpt when using FSDP with FULLSTATEDICT by @pacman100 in #27891
Fix the deprecation warning of torchpytree.registerpytree_node by @cyyever in #27803
Spelling correction by @saeneas in #28110
in peft finetune, only the trainable parameters need to be saved by @sywangyi in #27825
fix ConversationalPipeline docstring by @not-lain in #28091
Disable jitter noise during evaluation in SwitchTransformers by @DaizeDong in #28077
Remove warning if DISABLE_TELEMETRY is used by @Wauplin in #28113
Fix indentation error - semantic_segmentation.md by @rajveer43 in #28117
[docs] General doc fixes by @stevhliu in #28087
Fix a typo in tokenizer documentation by @mssalvatore in #28118
[Doc] Fix token link in What 🤗 Transformers can do by @aaronjimv in #28123
When save a model on TPU, make a copy to be moved to CPU by @qihqi in #27993
Update split string in doctest to reflect #28087 by @amyeroberts in #28135
[Mixtral] Fix loss + nits by @ArthurZucker in #28115
Update modeling_utils.py by @mzelling in #28127
[docs] Fix mistral link in mixtral.md by @aaronjimv in #28143
Remove deprecated CPU dockerfiles by @ashahba in #28149
Fix FA2 integration by @pacman100 in #28142
[gpt-neox] Add attention_bias config to support model trained without attention biases by @dalgarak in #28126
move code to Trainer.evaluate to enable use of that function with multiple datasets by @peter-sk in #27844
Fix weights not properly initialized due to shape mismatch by @ydshieh in #28122
Avoid unnecessary warnings when loading CLIPConfig by @ydshieh in #28108
Update FA2 exception msg to point to hub discussions by @amyeroberts in #28161
Align backbone stage selection with outindices & outfeatures by @amyeroberts in #27606
[docs] Trainer docs by @stevhliu in #28145
Fix yolos resizing by @amyeroberts in #27663
disable testretaingradhiddenstates_attentions on SeamlessM4TModelWithTextInputTest by @dwyatte in #28169
Fix input_embeds docstring in encoder-decoder architectures by @gante in #28168
[Whisper] Use torch for stft if available by @sanchit-gandhi in #26119
Fix slow backbone tests - out_indices must match stage name ordering by @amyeroberts in #28186
Update YOLOS slow test values by @amyeroberts in #28187
Update docs/source/en/perf_infer_gpu_one.md by @ydshieh in #28198
Fix ONNX export for causal LM sequence classifiers by removing reverse indexing by @dwyatte in #28144
Add Swinv2 backbone by @NielsRogge in #27742
Fix: [SeamlessM4T - S2TT] Bug in batch loading of audio in torch.Tensor format in the SeamlessM4TFeatureExtractor class by @nicholasneo78 in #27914
Bug: training_args.py fix missing import with accelerate with version accelerate==0.20.1 by @michaelfeil in #28171
Fix the check of models supporting FA/SDPA not run by @ydshieh in #28202
Drop feature_extractor_type when loading an image processor file by @ydshieh in #28195
[Whisper] Fix word-level timestamps with bs>1 or num_beams>1 by @ylacombe in #28114
Fixing visualization code for object detection to support both types of bounding box. by @Anindyadeep in #27842
update the logger message with accordant weightsfilename by @izyForever in #28181
[Llava] Fix llava index errors by @younesbelkada in #28032
fix FA2 when using quantization by @pacman100 in #28203
small typo by @stas00 in #28229
Update docs around mixing hf scheduler with deepspeed optimizer by @dwyatte in #28223
Fix trainer saving safetensors: metadata is None by @hiyouga in #28219
fix bug:divide by zero in maybelogsaveevaluate() by @frankenliu in #28251
[Whisper] Fix errors with MPS backend introduced by new code on word-level timestamps computation by @ercaronte in #28288
Remove fast tokenization warning in Data Collators by @dbuos in #28213
fix documentation for zeroshotobject_detection by @not-lain in #28267
Remove tokentypeids from modelinputnames (like #24788) by @Apsod in #28325
Translate contributing.md into Chinese by @Mayfsz in #28243
[docs] Sort es/toctree.yml | Translate performance.md by @aaronjimv in #28262
Fix error in M4T feature extractor by @ylacombe in #28340
README: install transformers from conda-forge channel by @kevherro in #28313
Don't check the device when device_map=auto by @yuanwu2017 in #28351
Fix pos_mask application and update tests accordingly by @ferjorosa in #27892
fix FA2 when using quantization for remaining models by @susnato in #28341
Update VITS modeling to enable ONNX export by @echarlaix in #28141
chore: Fix typo s/exclusivelly/exclusively/ by @hugo-syn in #28361
Enhancing Code Readability and Maintainability with Simplified Activation Function Selection. by @hi-sushanta in #28349
Fix building alibi tensor when num_heads is not a power of 2 by @abuelnasr0 in #28380
remove two deprecated function by @statelesshz in #28220
Bugfix / ffmpeg input device (mic) not working on Windows by @Teapack1 in #27051
[AttentionMaskConverter] fix sdpa unmask unattended by @zspo in #28369
Remove shell=True from subprocess.Popen to Mitigate Security Risk by @avimanyu786 in #28299
Add segmentation map processing to SAM Image Processor by @rwood-97 in #27463
update warning for image processor loading by @ydshieh in #28209
Fix initialization for missing parameters in from_pretrained under ZeRO-3 by @XuehaiPan in #28245
Fix _merge_input_ids_with_image_features for llava model by @VictorSanh in #28333
Use mmap option to loadstatedict by @weimingzha0 in #28331
[BUG] BarkEosPrioritizerLogitsProcessor eostokenid use list, tensor size mismatch by @inkinworld in #28201
Skip now failing test in the Trainer tests by @muellerzr in #28421
Support DeepSpeed when using auto find batch size by @muellerzr in #28088
Fix number of models in README.md by @prasatee in #28430
CI: limit natten version by @gante in #28432
Fix for checkpoint rename race condition by @tblattner in #28364
Fix load correct tokenizer in Mixtral model documentation by @JuanFKurucz in #28437
[docstring] Fix docstring for ErnieConfig, ErnieMConfig by @Sparty in #27029
[Whisper] Fix slow test by @patrickvonplaten in #28407
Assitant model may on a different device by @jiqing-feng in #27995
Enable multi-label image classification in pipeline by @amyeroberts in #28433
Optimize the speed of the truncate_sequences function. by @ikkvix in #28263
Use python 3.10 for docbuild by @ydshieh in #28399
Fix docker file by @ydshieh in #28452
Set cache_dir for evaluate.load() in example scripts by @aphedges in #28422
Optionally preprocess segmentation maps for MobileViT by @harisankar95 in #28420
Correctly resolve trustremotecode=None for AutoTokenizer by @Rocketknight1 in #28419
Fix load balancing loss func for mixtral by @liangxuZhang in #28256
Doc by @jiqing-feng in #28431
Fix docstring checker issues with PIL enums by @Rocketknight1 in #28450
Fix broken link on page by @keenranger in #28451
Mark two logger tests as flaky by @amyeroberts in #28458
Update metadata loading for oneformer by @amyeroberts in #28398
Fix torch.ones usage in xlnet by @sungho-ham in #28471
Generate: deprecate old public functions by @gante in #28478
Docs: add model paths by @gante in #28475
Generate: refuse to save bad generation config files by @gante in #28477
TF: purge TFTrainer by @gante in #28483
Fix docstrings and update docstring checker error message by @Rocketknight1 in #28460
Change progress logging to once across all nodes by @siddartha-RE in #28373
Generate: fix candidate device placement by @gante in #28493
Fix paths to AI Sweden Models reference and model loading by @JuanFKurucz in #28423
[chore] Update warning text, a word was missing by @tomaarsen in #28017
Don't set finetuned_from if it is a local path by @ydshieh in #28482
Add the XPU device check for pipeline mode by @yuanwu2017 in #28326
Tokenizer kwargs in textgeneration pipe by @thedamnedrhino in #28362
[GPTQ] Fix test by @SunMarc in #28018
Fixed minor typos by @rishit5 in #28489
Add a usesafetensors arg to TFPreTrainedModel.frompretrained() by @Rocketknight1 in #28511
Generate: consolidate output classes by @gante in #28494
fix: sampling in flax keeps EOS by @borisdayma in #28378
improve dev setup comments and hints by @4imothy in #28495
SiLU activation wrapper for safe importing by @amyeroberts in #28509
Remove task arg in load_dataset in image-classification example by @regisss in #28408
Improving Training Performance and Scalability Documentation by @HamzaFB in #28497
Fix mismatching loading in from_pretrained with/without accelerate by @fxmarty in #28414
Fix/speecht5 bug by @NimaYaqmuri in #28481
[ TokenizationUtils] Fix add_special_tokens when the token is already there by @ArthurZucker in #28520
[TokenizationRoformerFast] Fix the save and loading by @ArthurZucker in #28527
[SpeechT5Tokenization] Add copied from and fix the convert_tokens_to_string to match the fast decoding scheme by @ArthurZucker in #28522
Clearer error for SDPA when explicitely requested by @fxmarty in #28006
Add ismodelsupported for fx by @inisis in #28521
Config: warning when saving generation kwargs in the model config by @gante in #28514
[Makefile] Exclude research projects from format by @patrickvonplaten in #28551
symbolictrace: add pastkey_values, llama, sdpa support by @fxmarty in #28447
Allow to train dinov2 with different dtypes like bf16 by @StarCycle in #28504
Fix Switch Transformers When sparse_step = 1 by @agemagician in #28564
Save Processor by @ydshieh in #27761
Use weights_only only if torch >= 1.13 by @ydshieh in #28506
[Core Tokenization] Support a fix for spm fast models by @ArthurZucker in #26678
Use LoggingLevel context manager in 3 tests by @ydshieh in #28575
Fix the documentation checkpoint for xlm-roberta-xl by @jeremyfowers in #28567
[ASR Pipe] Update init to set model type and subsequently call parent init method by @sanchit-gandhi in #28486
[Whisper Tok] Move token ids to CPU when computing offsets by @sanchit-gandhi in #28485
[Whisper] Fix audio classification with weighted layer sum by @sanchit-gandhi in #28563
Making CTC training example more general by @ylacombe in #28582
Don't save processor_config.json if a processor has no extra attribute by @ydshieh in #28584
Fix wrong xpu device in DistributedType.MULTI_XPU mode by @faaany in #28386
[GPTNeoX] Fix BC issue with 4.36 by @ArthurZucker in #28602

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@aaronjimv
- [Doc] Spanish translation of glossary.md (#27958)
- [Doc] Fix token link in What 🤗 Transformers can do (#28123)
- [docs] Fix mistral link in mixtral.md (#28143)
- [docs] Sort es/toctree.yml | Translate performance.md (#28262)
@rajveer43
- Add modeldocs from cpmant.md to derformabledetr.md (#27884)
- Fix indentation error - semantic_segmentation.md (#28117)
@poedator
- 4D attention_mask support (#27539)
- [bnb] Let's make serialization of 4bit models possible (#26037)
@connor-henderson
- Add FastSpeech2Conformer (#23439)
@JustinLin610
- Add qwen2 (#28436)
@SangbumChoi
- enable training mask2former and maskformer for transformers trainer by @SangbumChoi in #28277
- [DETA] Improvement and Sync from DETA especially for training by @SangbumChoi in #27990
- fix auxiliary loss training in DetrSegmentation by @SangbumChoi in #28354

- Python
Published by amyeroberts over 2 years ago

transformers - Patch release: v4.36.2

Patch release to resolve some critical issues relating to the recent cache refactor, flash attention refactor and training in the multi-gpu and multi-node settings:

Resolve training bug with PEFT + GC #28031
Resolve cache issue when going beyond context window for Mistral/Mixtral FA2 #28037
Re-enable passing config to from_pretrained with FA #28043
Fix resuming from checkpoint when using FDSP with FULLSTATEDICT #27891
Resolve bug when saving a checkpoint in the multi-node setting #28078

- Python
Published by amyeroberts over 2 years ago

transformers - Patch release: v4.36.1

A patch release for critical torch issues mostly:

Fix SDPA correctness following torch==2.1.2 regression #27973
[Tokenizer Serialization] Fix the broken serialisation #27099
Fix bug with rotating checkpoints #28009
Hot-fix-mixstral-loss (#27948)

🔥

- Python
Published by ArthurZucker over 2 years ago

transformers - v4.36: Mixtral, Llava/BakLlava, SeamlessM4T v2, AMD ROCm, F.sdpa wide-spread support

New model additions

Mixtral

Mixtral is the new open-source model from Mistral AI announced by the blogpost Mixtral of Experts. The model has been proven to have comparable capabilities to Chat-GPT according to the benchmark results shared on the release blogpost.

The architecture is a sparse Mixture of Experts with Top-2 routing strategy, similar as NllbMoe architecture in transformers. You can use it through AutoModelForCausalLM interface:

```py

import torch from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.frompretrained("mistralai/Mixtral-8x7B", torchdtype=torch.float16, devicemap="auto") tokenizer = AutoTokenizer.frompretrained("mistralai/Mistral-8x7B")

prompt = "My favourite condiment is"

modelinputs = tokenizer([prompt], returntensors="pt").to(device) model.to(device)

generatedids = model.generate(**modelinputs, maxnewtokens=100, dosample=True) tokenizer.batchdecode(generated_ids)[0] ```

The model is compatible with existing optimisation tools such Flash Attention 2, bitsandbytes and PEFT library. The checkpoints are release under mistralai organisation on the Hugging Face Hub.

Llava / BakLlava

Llava is an open-source chatbot trained by fine-tuning LlamA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. In other words, it is an multi-modal version of LLMs fine-tuned for chat / instructions.

The Llava model was proposed in Improved Baselines with Visual Instruction Tuning by Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee.

[Llava] Add Llava to transformers by @younesbelkada in #27662
[LLaVa] Some improvements by @NielsRogge in #27895

The integration also includes BakLlava which is a Llava model trained with Mistral backbone.

The mode is compatible with "image-to-text" pipeline:

```py from transformers import pipeline from PIL import Image
import requests

modelid = "llava-hf/llava-1.5-7b-hf" pipe = pipeline("image-to-text", model=modelid) url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"

image = Image.open(requests.get(url, stream=True).raw) prompt = "USER: \nWhat does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud\nASSISTANT:"

outputs = pipe(image, prompt=prompt, generatekwargs={"maxnew_tokens": 200}) print(outputs) ```

And you can find all Llava weights under llava-hf organisation on the Hub.

SeamlessM4T v2

SeamlessM4T-v2 is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text. It is an improvement on the previous version and was proposed in Seamless: Multilingual Expressive and Streaming Speech Translation by the Seamless Communication team from Meta AI.

For more details on the differences between v1 and v2, refer to section Difference with SeamlessM4T-v1.

SeamlessM4T enables multiple tasks without relying on separate models:

Speech-to-speech translation (S2ST)
Speech-to-text translation (S2TT)
Text-to-speech translation (T2ST)
Text-to-text translation (T2TT)
Automatic speech recognition (ASR)
Add SeamlessM4T v2 by @ylacombe in #27779

PatchTST

The PatchTST model was proposed in A Time Series is Worth 64 Words: Long-term Forecasting with Transformers by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong and Jayant Kalagnanam.

At a high level, the model vectorizes time series into patches of a given size and encodes the resulting sequence of vectors via a Transformer that then outputs the prediction length forecast via an appropriate head. The model is illustrated in the following figure:

patchtst

[Time series] Add PatchTST by @psinthong in #25927
[Time series] Add PatchTST by @kashif in #27581

PatchTSMixer

The PatchTSMixer model was proposed in TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting by Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong and Jayant Kalagnanam.

PatchTSMixer is a lightweight time-series modeling approach based on the MLP-Mixer architecture. In this HuggingFace implementation, we provide PatchTSMixer’s capabilities to effortlessly facilitate lightweight mixing across patches, channels, and hidden features for effective multivariate time-series modeling. It also supports various attention mechanisms starting from simple gated attention to more complex self-attention blocks that can be customized accordingly. The model can be pretrained and subsequently used for various downstream tasks such as forecasting, classification and regression.

[Time series] Add PatchTSMixer by @ajati in #26247

CLVP

The CLVP (Contrastive Language-Voice Pretrained Transformer) model was proposed in Better speech synthesis through scaling by James Betker.

Add CLVP by @susnato in #24745

Phi-1/1.5

The Phi-1 model was proposed in Textbooks Are All You Need by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li.

The Phi-1.5 model was proposed in Textbooks Are All You Need II: phi-1.5 technical report by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee.

Add Phi-1 and Phi-1_5 by @susnato in #26170

TVP

The text-visual prompting (TVP) framework was proposed in the paper Text-Visual Prompting for Efficient 2D Temporal Video Grounding by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding.

This research addresses temporal video grounding (TVG), which is the process of pinpointing the start and end times of specific events in a long video, as described by a text sentence. Text-visual prompting (TVP), is proposed to enhance TVG. TVP involves integrating specially designed patterns, known as ‘prompts’, into both the visual (image-based) and textual (word-based) input components of a TVG model. These prompts provide additional spatial-temporal context, improving the model’s ability to accurately determine event timings in the video. The approach employs 2D visual inputs in place of 3D ones. Although 3D inputs offer more spatial-temporal detail, they are also more time-consuming to process. The use of 2D inputs with the prompting method aims to provide similar levels of context and accuracy more efficiently.

TVP model by @jiqing-feng in #25856

DINOv2 depth estimation

Depth estimation is added to the DINO v2 implementation.

Add DINOv2 depth estimation by @NielsRogge in #26092

ROCm support for AMD GPUs

AMD's ROCm GPU architecture is now supported across the board and fully tested in our CI with MI210/MI250 GPUs. We further enable specific hardware acceleration for ROCm in Transformers, such as Flash Attention 2, GPTQ quantization and DeepSpeed.

Add RoCm scheduled CI & upgrade RoCm CI to PyTorch 2.1 by @fxmarty in #26940
Flash Attention 2 support for RoCm by @fxmarty in #27611
Reflect RoCm support in the documentation by @fxmarty in #27636
restructure AMD scheduled CI by @ydshieh in #27743

PyTorch `scaled_dot_product_attention` native support

PyTorch's torch.nn.functional.scaled_dot_product_attention operator is now supported in the most-used Transformers models and used by default when using torch>=2.1.1, allowing to dispatch on memory-efficient attention and Flash Attention backend implementations with no other package than torch required. This should significantly speed up attention computation on hardware that that supports these fastpath.

While Transformers automatically handles the dispatch to use SDPA when available, it is possible to force the usage of a given attention implementation ("eager" being the manual implementation, where each operation is implemented step by step): ```python

or `attn_implementation="sdpa", or`attnimplementation="flashattention_2"`

model = AutoModelForSpeechSeq2Seq.frompretrained("openai/whisper-tiny", attnimplementation="eager") ```

Training benchmark, run on A100-SXM4-80GB.

| Model | Batch size | Sequence length | Time per batch ("eager", s) | Time per batch ("sdpa", s) | Speedup | Peak memory ("eager", MB) | Peak memory ("sdpa", MB) | Memory savings | |-----------|------------|-----------------|-------------------------------|------------------------------|-------------|-----------------------------|----------------------------|-----------------------| | llama2 7b | 4 | 1024 | 1.065 | 0.90 | 19.4% | 73878.28 | 45977.81 | 60.7% | | llama2 7b | 4 | 2048 | OOM | 1.87 | / | OOM | 78394.58 | SDPA does not OOM | | llama2 7b | 1 | 2048 | 0.64 | 0.48 | 32.0% | 55557.01 | 29795.63 | 86.4% | | llama2 7b | 1 | 3072 | OOM | 0.75 | / | OOM | 37916.08 | SDPA does not OOM | | llama2 7b | 1 | 4096 | OOM | 1.03 | / | OOM | 46028.14 | SDPA does not OOM | | llama2 7b | 2 | 4096 | OOM | 2.05 | / | OOM | 78428.14 | SDPA does not OOM |

Inference benchmark, run on A100-SXM4-80GB.

| Model | Batch size | Prompt length | Num new tokens | Per token latency "eager" (ms) | Per token latency "sdpa" (ms) | Speedup | |------------------|------------|---------------|----------------|----------------------------------|---------------------------------|-------------| | llama2 13b | 1 | 1024 | 1 (prefill) | 178.66 | 159.36 | 12.11% | | llama2 13b | 1 | 100 | 100 | 40.35 | 37.62 | 7.28% | | llama2 13b | 8 | 100 | 100 | 40.55 | 38.06 | 6.53% | | Whisper v3 large | 1 | / | 62 | 20.05 | 18.90 | 6.10% | | Whisper v3 large | 8 | / | 77 | 25.42 | 24.77 | 2.59% | | Whisper v3 large | 16 | / | 77 | 28.51 | 26.32 | 8.34% |

F.scaleddotproduct_attention support by @fxmarty in #26572

New Cache abstraction & Attention Sinks support

We are rolling out a new abstraction for the past_key_values cache, which enables the use of different types of caches. For now, only llama and llama-inspired architectures (mistral, persimmon, phi) support it, with other architectures scheduled to have support in the next release. By default, a growing cache (DynamicCache) is used, which preserves the existing behavior.

This release also includes a new SinkCache cache, which implements the Attention Sinks paper. With SinkCache, the model is able to continue generating high-quality text well beyond its training sequence length! Note that it does not expand the context window, so it can’t digest very long inputs — it is suited for streaming applications such as multi-round dialogues. Check this colab for an example.

Generate: New Cache abstraction and Attention Sinks support by @tomaarsen in #26681
Generate: SinkCache can handle iterative prompts by @gante in #27907

Safetensors as a default

We continue toggling features enabling safetensors as a default across the board, in PyTorch, Flax, and TensorFlow. When using PyTorch model and forcing the load of safetensors file with use_safetensors=True, if the repository does not contain the safetensors files, they will now be converted on-the-fly server-side.

Default to msgpack for safetensors by @LysandreJik in #27460
Fix from_pt flag when loading with safetensors by @LysandreJik in #27394
Make using safetensors files automated. by @Narsil in #27571

Breaking changes

pickle files

We now disallow the use of pickle.load internally for security purposes. To circumvent this, you can use the TRUST_REMOTE_CODE=True command to indicate that you would still like to load it.

🚨🚨🚨 Disallow pickle.load unless TRUST_REMOTE_CODE=True by @ydshieh in #27776

Beam score calculation for decoder-only models

In the previous implementation of beam search, when length_penalty is active, the beam score for decoder-only models was penalized by the total length of both prompt and generated sequence. However, the length of prompt should not be included in the penalization step -- this release fixes it.

🚨🚨 Fix beam score calculation issue for decoder-only models by @VsonicV in #27351

Slight API changes/corrections

⚠️ [VitDet] Fix test by @NielsRogge in #27832
[⚠️ removed a default argument] Make AttentionMaskConverter compatible with torch.compile(..., fullgraph=True) by @fxmarty in #27868

Bugfixes and improvements

Enrich TTS pipeline parameters naming by @ylacombe in #26473
translate peft.md to chinese by @jiaqiw09 in #27215
Removed the redundant SiLUActivation class. by @hi-sushanta in #27136
Fixed base model class name extraction from PeftModels by @kkteru in #27162
Fuyu protection by @LysandreJik in #27248
Refactor: Use Llama RoPE implementation for Falcon by @tomaarsen in #26933
[PEFT / Tests ] Fix peft integration failing tests by @younesbelkada in #27258
Avoid many failing tests in doctesting by @ydshieh in #27262
[docs] Custom model doc update by @MKhalusova in #27213
Update the ConversationalPipeline docstring for chat templates by @Rocketknight1 in #27250
Fix switch transformer mixed precision issue by @timlee0212 in #27220
[Docs / SAM ] Reflect correct changes to run inference without OOM by @younesbelkada in #27268
[Docs] Model_doc structure/clarity improvements by @MKhalusova in #26876
[FA2] Add flash attention for for DistilBert by @susnato in #26489
translate autoclass_tutorial to chinese by @jiaqiw09 in #27269
translate run_scripts.md to chinese by @jiaqiw09 in #27246
Fix tokenizer export for LLamaTokenizerFast by @mayank31398 in #27222
Fix daily CI image build by @ydshieh in #27307
Update doctest workflow file by @ydshieh in #27306
Remove an unexpected argument for FlaxResNetBasicLayerCollection by @pingzhili in #27272
enable memory tracker metrics for npu by @statelesshz in #27280
[PretrainedTokenizer] add some of the most important functions to the doc by @ArthurZucker in #27313
Update sequence_classification.md by @akshayvkt in #27281
Fix VideoMAEforPretrained dtype error by @ikergarcia1996 in #27296
Fix Kosmos2Processor batch mode by @ydshieh in #27323
[docs] fixed links with 404 by @MKhalusova in #27327
[Whisper] Block language/task args for English-only by @sanchit-gandhi in #27322
Fix autoawq docker image by @younesbelkada in #27339
Generate: skip tests on unsupported models instead of passing by @gante in #27265
Fix Whisper Conversion Script: Correct decoderattentionheads and _download function by @zuazo in #26834
[FA2] Add flash attention for GPT-Neo by @susnato in #26486
[Whisper] Add conversion script for the tokenizer by @ArthurZucker in #27338
Remove a redundant variable. by @hi-sushanta in #27288
Resolve AttributeError by utilizing device calculation at the start of the forward function by @folbaeni in #27347
Remove paddingmasks from `gptbigcode`. by @susnato in #27348
[Whisper] Nit converting the tokenizer by @ArthurZucker in #27349
FIx Bark batching feature by @ylacombe in #27271
Allow scheduler parameters by @Plemeur in #26480
translate the en tokenizer_summary.md to Chinese by @ZouJiu1 in #27291
translate modelsharing.md and llmtutorial.md to chinese by @jiaqiw09 in #27283
Add numpy alternative to FE using torchaudio by @ylacombe in #26339
moving example of benchmarking to legacy dir by @statelesshz in #27337
Fix example tests from failing by @muellerzr in #27353
Fix Kosmos-2 device issue by @ydshieh in #27346
MusicGen Update by @sanchit-gandhi in #27084
Translate index.md to Turkish by @mertyyanik in #27093
Remove unused param from example script tests by @muellerzr in #27354
[Flax Whisper] large-v3 compatibility by @sanchit-gandhi in #27360
Fix tiny model script: not using from_pt=True by @ydshieh in #27372
translate big_models.md and performance.md to chinese by @jiaqiw09 in #27334
Add Flash Attention 2 support to Bark by @ylacombe in #27364
Update deprecated torch.range in test_modeling_ibert.py by @kit1980 in #27355
translate debugging.md to chinese by @jiaqiw09 in #27374
Smangrul/fix failing ds ci tests by @pacman100 in #27358
[CodeLlamaTokenizer] Nit, update init to make sure the AddedTokens are not normalized because they are special by @ArthurZucker in #27359
Change thresh in test by @muellerzr in #27378
Put doctest options back to pyproject.toml by @ydshieh in #27366
Skip failing cache call tests by @amyeroberts in #27393
device-agnostic deepspeed testing by @statelesshz in #27342
Adds dvclive callback by @dberenbaum in #27352
use pytest.mark directly by @ydshieh in #27390
Fix fuyu checkpoint repo in FuyuConfig by @ydshieh in #27399
Use editable install for git deps by @muellerzr in #27404
Final fix of the accelerate installation issue by @ydshieh in #27408
Fix RequestCounter to make it more future-proof by @Wauplin in #27406
remove failing tests and clean FE files by @ylacombe in #27414
Fix Owlv2 checkpoint name and a default value in Owlv2VisionConfig by @ydshieh in #27402
Run all tests if circleci/create_circleci_config.py is modified by @ydshieh in #27413
add attentionmask and positionids in assisted model by @jiqing-feng in #26892
[Quantization] Add str to enum conversion for AWQ by @younesbelkada in #27320
update Bark FA2 docs by @ylacombe in #27400
[AttentionMaskConverter] ]Fix-mask-inf by @ArthurZucker in #27114
At most 2 GPUs for CI by @ydshieh in #27435
Normalize floating point cast by @amyeroberts in #27249
Make examples_torch_job faster by @ydshieh in #27437
Fix line ending in utils/not_doctested.txt by @ydshieh in #27459
Fix some Wav2Vec2 related models' doctest by @ydshieh in #27462
Fixed typo in error message by @cmcmaster1 in #27461
Remove-auth-token by @ArthurZucker in #27060
[Llama + Mistral] Add attention dropout by @ArthurZucker in #27315
OWLv2: bug fix in postprocessobject_detection() when using cuda device by @assafbot in #27468
Fix docstring for gradient_checkpointing_kwargs by @tomaszcichy98 in #27470
Install python-Levenshtein for nougat in CI image by @ydshieh in #27465
Add version check for Jinja by @Rocketknight1 in #27403
Fix Falcon tokenizer loading in pipeline by @Rocketknight1 in #27316
[AWQ ] Addresses TODO for awq tests by @younesbelkada in #27467
Perf torch compile by @jiaqiw09 in #27422
Fixed typo in pipelines.md documentation by @adismort14 in #27455
Fix FA2 import + deprecation cycle by @SunMarc in #27330
[Peft] modules_to_save support for peft integration by @younesbelkada in #27466
[CI-test_torch] skip test_tf_from_pt_safetensors for 4 models by @ArthurZucker in #27481
Fix M4T weights tying by @ylacombe in #27395
Add speecht5 batch generation and fix wrong attention mask when padding by @Spycsh in #25943
Clap processor: remove wasteful np.stack operations by @m-bain in #27454
[Whisper] Fix pipeline test by @sanchit-gandhi in #27442
Revert "[time series] Add PatchTST by @amyeroberts in #25927)"
translate hpotrain.md and perfhardware.md to chinese by @jiaqiw09 in #27431
Generate: fix ExponentialDecayLengthPenalty doctest by @gante in #27485
Update and reorder docs for chat templates by @Rocketknight1 in #27443
Generate: GenerationConfig.from_pretrained can return unused kwargs by @gante in #27488
Minor type annotation fix by @vwxyzjn in #27276
Have seq2seq just use gather by @muellerzr in #27025
Update processor mapping for hub snippets by @amyeroberts in #27477
Track the number of tokens seen to metrics by @muellerzr in #27274
[CI-test_torch] skip testtffromptsafetensors and test_assisted_decoding_sample by @ArthurZucker in #27508
[Fuyu] Add tests by @NielsRogge in #27001
[Table Transformer] Add Transformers-native checkpoints by @NielsRogge in #26928
Update spelling mistake by @LimJing7 in #27506
[CircleCI] skip testassisteddecoding_sample for everyone by @ArthurZucker in #27511
Make some jobs run on the GitHub Actions runners by @ydshieh in #27512
[tokenizers] update tokenizers version pin by @ArthurZucker in #27494
[ PretrainedConfig] Improve messaging by @ArthurZucker in #27438
Fix wav2vec2 params by @muellerzr in #27515
Translating en/model_doc docs to Japanese. by @Yuki-Imajuku in #27401
Fixing the failure of models without maxpositionembeddings attribute. by @AdamLouly in #27499
Incorrect setting for num_beams in translation and summarization examples by @Rocketknight1 in #27519
Fix bug for T5x to PyTorch convert script with varying encoder and decoder layers by @JamesJiang97 in #27448
Fix offload disk for loading derivated model checkpoint into base model by @SunMarc in #27253
translate model.md to chinese by @statelesshz in #27518
Support ONNX export for causal LM sequence classifiers by @dwyatte in #27450
[pytest] Avoid flash attn test marker warning by @ArthurZucker in #27509
docs: add docs for map, and add num procs to load_dataset by @pphuc25 in #27520
Update the TF pin for 2.15 by @Rocketknight1 in #27375
Revert "add attentionmask and positionids in assisted model" by @patrickvonplaten in #27523
Set usedforsecurity=False in hashlib methods (FIPS compliance) by @Wauplin in #27483
Raise error when quantizing a quantized model by @SunMarc in #27500
Disable docker image build job latest-pytorch-amd for now by @ydshieh in #27541
[Styling] stylify using ruff by @ArthurZucker in #27144
Generate: improve assisted generation tests by @gante in #27540
Updated albert.md doc for ALBERT model by @ENate in #27223
translate Trainer.md to chinese by @jiaqiw09 in #27527
Skip some fuyu tests by @ydshieh in #27553
Fix AMD CI not showing GPU by @ydshieh in #27555
Generate: fix flaky tests by @gante in #27543
Generate: update compute transition scores doctest by @gante in #27558
fixed broken link by @VpkPrasanna in #27560
Broken links fixed related to datasets docs by @VpkPrasanna in #27569
translate deepspeed.md to chinese by @jiaqiw09 in #27495
Fix broken distilbert url by @osanseviero in #27579
Adding leaky relu in dict ACT2CLS by @rafaelpadilla in #27574
Fix idx2sym not loaded from pretrained vocab file in Transformer XL by @jtang98 in #27589
Add convert_hf_to_openai.py script to Whisper documentation resources by @zuazo in #27590
docs: fix 404 link by @panpan0000 in #27529
[ examples] fix loading jsonl with load dataset in run translation example by @mathiasesn in #26924
[FA-2] Add fa2 support for from_config by @younesbelkada in #26914
timm to pytorch conversion for vit model fix by @staghado in #26908
[Whisper] Add large-v3 version support by @flyingleafe in #27336
Update Korean tutorial for using LLMs, and refactor the nested conditional statements in hr_argparser.py by @YeonwooSung in #27489
Fix torch.fx import issue for torch 1.12 by @amyeroberts in #27570
dvclive callback: warn instead of fail when logging non-scalars by @dberenbaum in #27608
[core / gradient_checkpointing] add support for old GC method by @younesbelkada in #27610
[ConvNext] Improve backbone by @NielsRogge in #27621
Generate: Update docs regarding reusing past_key_values in generate by @gante in #27612
Idefics: Fix information leak with cross attention gate in modeling by @leot13 in #26839
Fix flash attention bugs with Mistral and Falcon by @fxmarty in #27625
Fix tracing dinov2 by @amyeroberts in #27561
remove the deprecated method init_git_repo by @statelesshz in #27617
Explicitely specify use_cache=True in Flash Attention tests by @fxmarty in #27635
Harmonize HF environment variables + other cleaning by @Wauplin in #27564
Fix resize_token_embeddings by @czy-orange in #26861)
[dependency] update pillow pins by @ArthurZucker in #27409
Simplify the implementation of jitter noise in moe models by @jiangwangyi in #27643
Fix max_steps documentation regarding the end-of-training condition by @qgallouedec in #27624
[Whisper] Add sequential longform decoding by @patrickvonplaten in #27492
Add UnivNet Vocoder Model for Tortoise TTS Diffusers Integration by @dg845 in #24799
update Openai API call method by @Strive-for-excellence in #27628
update d_kv'annotation in mt5'configuration by @callanwu in #27585
[FA2] Add flash attention for opt by @susnato in #26414
Extended semantic segmentation to image segmentation by @merveenoyan in #27039
Update TVP arxiv link by @amyeroberts in #27672
[DPT, Dinov2] Add resources by @NielsRogge in #27655
Update tiny model summary file by @ydshieh in #27388
Refactoring Trainer, adds save_only_model arg and simplifying FSDP integration by @pacman100 in #27652
Skip pipeline tests for 2 models for now by @ydshieh in #27687
Deprecate TransfoXL by @ydshieh in #27607
Fix typo in warning message by @liuxueyang in #27055
Docs/Add conversion code to the musicgen docs by @yoinked-h in #27665
Fix semantic error in evaluation section by @anihm136 in #27675
[DocString] Support a revision in the docstring add_code_sample_docstrings to facilitate integrations by @ArthurZucker in #27645
Successfully Resolved The ZeroDivisionError Exception. by @hi-sushanta in #27524
Fix TVPModelTest by @ydshieh in #27695
Fix sliding_window hasattr in Mistral by @IlyaGusev in #27041
Fix Past CI by @ydshieh in #27696
fix warning by @ArthurZucker in #27689
Reorder the code on the Hub to explicit that sharing on the Hub isn't a requirement by @LysandreJik in #27691
Fix mistral generate for long prompt / response by @lorabit110 in #27548
Fix oneformer instance segmentation RuntimeError by @yhshin11 in #27725
fix assisted decoding assistant model inputs by @jiqing-feng in #27503
Update forward signature test for vision models by @NielsRogge in #27681
Modify groupsubentities in TokenClassification Pipeline to support label with "-" by @eshoyuan in #27325
Fix owlv2 code snippet by @NielsRogge in #27698
docs: replace torch.distributed.run by torchrun by @panpan0000 in #27528
Update chat template warnings/guides by @Rocketknight1 in #27634
translation main-class files to chinese by @jiaqiw09 in #27588
Translate en/model_doc to JP by @rajveer43 in #27264
Fixed passing scheduler-specific kwargs via TrainingArguments lrschedulerkwargs by @CharbelAD in #27595
Fix AMD Push CI not triggered by @ydshieh in #27732
Add BeitBackbone by @NielsRogge in #25952
Update tiny model creation script by @ydshieh in #27674
Log a warning in TransfoXLTokenizer.__init__ by @ydshieh in #27721
Add madlad-400 MT models by @jbochi in #27471
Enforce pin memory disabling when using cpu only by @qgallouedec in #27745
Trigger corresponding pipeline tests if tests/utils/tiny_model_summary.json is modified by @ydshieh in #27693
CLVP Fixes by @susnato in #27547
Docs: Fix broken cross-references, i.e. ~transformer. -> ~transformers. by @tomaarsen in #27740
[docs] Quantization by @stevhliu in #27641
Fix precision errors from casting rotary parameters to FP16 with AMP by @kevinhu in #27700
Remove check_runner_status.yml by @ydshieh in #27767
uses dvclivetest mode in examples/pytorch/testaccelerate_examples.py by @dberenbaum in #27763
Generate: GenerationConfig throws an exception when generate args are passed by @gante in #27757
Fix unsupported setting of self.ngpu in training_args on XPU devices by @Liangliang-Ma in #27716
[SeamlessM4Tv2] Fix links in README by @xenova in #27782
[i18n-fr] Translate installation to French by @NoB0 in #27657
Fixes for PatchTST Config by @wgifford in #27777
Better error message for bitsandbytes import by @SunMarc in #27764
[MusicGen] Fix audio channel attribute by @sanchit-gandhi in #27440
[JAX] Replace uses of jax.devices("cpu") with jax.local_devices(backend="cpu") by @hvaara in #27593
Improve forward signature test by @NielsRogge in #27729
Fix typo in max_length deprecation warnings by @siegeln in #27788
Add persistent_workers parameter to TrainingArguments by @Sorrow321 in #27189
[ModelOnTheFlyConversionTester] Mark as slow for now by @ArthurZucker in #27823
Fix TvpModelIntegrationTests by @ydshieh in #27792
Fix Owlv2ModelIntegrationTest::test_inference_object_detection by @ydshieh in #27793
Keypoints 0.0 are confusing ../transformers/models/detr/imageprocessingdetr.py which are fixed by @hackpk in #26250
[Seamless v1] Link to v2 docs by @sanchit-gandhi in #27827
[Whisper] Fix doctest in timestamp logits processor by @sanchit-gandhi in #27795
Added test cases for rembert refering to albert and reformer test_tok… by @nileshkokane01 in #27637
[Hot-Fix][XLA] Re-enable broken tpusave for XLATensors by @yeounoh in #27799
single word should be set to False by @ArthurZucker in #27738
[Seamless v2] Add FE to auto mapping by @sanchit-gandhi in #27829
translate internal folder files to chinese by @jiaqiw09 in #27638
Translate en/tasks folder docs to Japanese 🇯🇵 by @rajveer43 in #27098
pin ruff==0.1.5 by @ydshieh in #27849
Make image processors more general by @NielsRogge in #27690
Faster generation using AWQ + Fused modules by @younesbelkada in #27411
Generate: Update VisionEncoderDecoder test value by @gante in #27850
[ClipVision] accelerate support for clip-vision by @younesbelkada in #27851
Add Llama Flax Implementation by @vvvm23 in #24587
Move tensors to same device to enable IDEFICS naive MP training by @willemsenbram in #27746
Update VitDetModelTester.get_config to use pretrain_image_size by @ydshieh in #27831
fix(whisper): mutable generation config by @badayvedat in #27833
Documentation: Spanish translation of perplexity.mdx by @aaronjimv in #27807
[Docs] Update broken image on fused modules by @younesbelkada in #27856
Update CUDA versions for DeepSpeed by @muellerzr in #27853
removed the delete doc workflows by @MKhalusova in #27852
Avoid class attribute _keep_in_fp32_modules being modified by @ydshieh in #27867
[Flash Attention 2] Add flash attention 2 for GPT-Neo-X by @younesbelkada in #26463
Translating en/model_doc folder docs to Japanese(from blip to clap) 🇯🇵 by @rajveer43 in #27673
Fix beam score calculation issue for JAX version by @VsonicV in #27816
Fix bug of prepare4dattentionmask by @jiqing-feng in #27847
[i18n-fr] Translate autoclass tutorial to French by @NoB0 in #27659
[FA-2] Add Flash Attention to Phi by @susnato in #27661
fix: fix gradient accumulate step for learning rate by @pphuc25 in #27667
Allow # Ignore copy by @ydshieh in #27328
update create_model_card to properly save peft details when using Trainer with PEFT by @pacman100 in #27754
update version of warning notification for get_default_device to v4.38 by @statelesshz in #27848
Fix device of masks in tests by @fxmarty in #27887
Show new failing tests in a more clear way in slack report by @ydshieh in #27881
Fix TF loading PT safetensors when weights are tied by @Rocketknight1 in #27490
Generate: All logits processors are documented and have examples by @gante in #27796
[docs] Custom semantic segmentation dataset by @stevhliu in #27859
Updates the distributed CPU training documentation to add instructions for running on a Kubernetes cluster by @dmsuehir in #27780
Translate model_doc files from clip to cpm to JP by @rajveer43 in #27774
Fix: Raise informative exception when prefix_allowed_tokens_fn return empty set of tokens by @Saibo-creator in #27797
Added passing parameters to "reducelron_plateau" scheduler by @CharbelAD in #27860
fix: non-atomic checkpoint save by @thundergolfer in #27820
Fix beam score calculation issue for Tensorflow version by @VsonicV in #27814
Fix remaining issues in beam score calculation by @VsonicV in #27808
Fix CLAP converting script by @ylacombe in #27153
mark test_initialization as flaky in 2 model tests by @ydshieh in #27906
Fix notification_service.py by @ydshieh in #27903
Fix 2 tests in FillMaskPipelineTests by @ydshieh in #27889
Llama conversion script: adjustments for Llama Guard by @pcuenca in #27910
fix llava by @ArthurZucker in #27909
Allow resume_from_checkpoint to handle auto_find_batch_size by @muellerzr in #27568
[Doc] Spanish translation of pad_truncation.md by @aaronjimv in #27890
fix typo in imageprocessingblip.py Wwhether -> Whether by @zhc7 in #27899
[CLAP] Replace hard-coded batch size to enable dynamic ONNX export by @xenova in #27790
[integration] Update Ray Tune integration for Ray 2.7 by @justinvyu in #26499
Fix typo by @f4hy in #27918
[DETA] fix backbone freeze/unfreeze function by @SangbumChoi in #27843

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@jiaqiw09
- translate peft.md to chinese (#27215)
- translate autoclass_tutorial to chinese (#27269)
- translate run_scripts.md to chinese (#27246)
- translate modelsharing.md and llmtutorial.md to chinese (#27283)
- translate big_models.md and performance.md to chinese (#27334)
- translate debugging.md to chinese (#27374)
- Perf torch compile (#27422)
- translate hpotrain.md and perfhardware.md to chinese (#27431)
- translate Trainer.md to chinese (#27527)
- translate deepspeed.md to chinese (#27495)
- translation main-class files to chinese (#27588)
- translate internal folder files to chinese (#27638)
@susnato
- [FA2] Add flash attention for for DistilBert (#26489)
- [FA2] Add flash attention for GPT-Neo (#26486)
- Remove paddingmasks from `gptbigcode`. (#27348)
- Add CLVP (#24745)
- Add Phi-1 and Phi-1_5 (#26170)
- [FA2] Add flash attention for opt (#26414)
- CLVP Fixes (#27547)
- [FA-2] Add Flash Attention to Phi (#27661)
@jiqing-feng
- add attentionmask and positionids in assisted model (#26892)
- TVP model (#25856)
- fix assisted decoding assistant model inputs (#27503)
- Fix bug of prepare4dattentionmask (#27847)
@psinthong
- [time series] Add PatchTST (#25927)
@Yuki-Imajuku
- Translating en/model_doc docs to Japanese. (#27401)
@dg845
- Add UnivNet Vocoder Model for Tortoise TTS Diffusers Integration (#24799)
@rajveer43
- Translate en/model_doc to JP (#27264)
- Translate en/tasks folder docs to Japanese 🇯🇵 (#27098)
- Translating en/model_doc folder docs to Japanese(from blip to clap) 🇯🇵 (#27673)
- Translate model_doc files from clip to cpm to JP (#27774)
@NoB0
- [i18n-fr] Translate installation to French (#27657)
- [i18n-fr] Translate autoclass tutorial to French (#27659)
@ajati
- [Time series] Add PatchTSMixer (#26247)
@vvvm23
- Add Llama Flax Implementation (#24587)

- Python
Published by LysandreJik over 2 years ago

transformers - Patch release: v4.35.2

A patch release was made for the following commit:

[tokenizers] update tokenizers version pin #27494

to fix all the issues with versioning regarding tokenizers and huggingface_hub

- Python
Published by ArthurZucker over 2 years ago

transformers - Patch release: v4.35.1

A patch release was made for the following three commits:

Fix FA2 import + deprecation cycle (#27330)
Fix from_pt flag when loading with safetensors (#27394)
Default to msgpack for safetensors (#27460)

- Python
Published by LysandreJik over 2 years ago

transformers - Safetensors serialization by default, DistilWhisper, Fuyu, Kosmos-2, SeamlessM4T, Owl-v2

New models

Distil-Whisper

Distil-Whisper is a distilled version of Whisper that is 6 times faster, 49% smaller, and performs within 1% word error rate (WER) on out-of-distribution data. It was proposed in the paper Robust Knowledge Distillation via Large-Scale Pseudo Labelling.

Distil-Whisper copies the entire encoder from Whisper, meaning it retains Whisper's robustness to different audio conditions. It only copies 2 decoder layers, which significantly reduces the time taken to auto-regressively generate text tokens:

Distil-Whisper is MIT licensed and directly available in the Transformers library with chunked long-form inference, Flash Attention 2 support, and Speculative Decoding. For details on using the model, refer to the following instructions.

Joint work from @sanchit-gandhi, @patrickvonplaten and @srush.

[Assistant Generation] Improve Encoder Decoder by @patrickvonplaten in #26701
[WhisperForCausalLM] Add WhisperForCausalLM for speculative decoding by @patrickvonplaten in #27195
[Whisper, Bart, MBart] Add Flash Attention 2 by @patrickvonplaten in #27203

Fuyu

The Fuyu model was created by ADEPT, and authored by Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar.

The authors introduced Fuyu-8B, a decoder-only multimodal model based on the classic transformers architecture, with query and key normalization. A linear encoder is added to create multimodal embeddings from image inputs.

By treating image tokens like text tokens and using a special image-newline character, the model knows when an image line ends. Image positional embeddings are removed. This avoids the need for different training phases for various image resolutions. With 8 billion parameters and licensed under CC-BY-NC, Fuyu-8B is notable for its ability to handle both text and images, its impressive context size of 16K, and its overall performance.

Joint work from @molbap, @pcuenca, @amyeroberts, @ArthurZucker

Add fuyu model by @molbap in #26911
Fuyu: improve image processing by @molbap in #27007

SeamlessM4T

The SeamlessM4T model was proposed in SeamlessM4T — Massively Multilingual & Multimodal Machine Translation by the Seamless Communication team from Meta AI.

SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.

SeamlessM4T enables multiple tasks without relying on separate models:

Speech-to-speech translation (S2ST)
Speech-to-text translation (S2TT)
Text-to-speech translation (T2ST)
Text-to-text translation (T2TT)
Automatic speech recognition (ASR)

SeamlessM4TModel can perform all the above tasks, but each task also has its own dedicated sub-model.

Add Seamless M4T model by @ylacombe in #25693

Kosmos-2

The KOSMOS-2 model was proposed in Kosmos-2: Grounding Multimodal Large Language Models to the World by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei.

KOSMOS-2 is a Transformer-based causal language model and is trained using the next-word prediction task on a web-scale dataset of grounded image-text pairs GRIT. The spatial coordinates of the bounding boxes in the dataset are converted to a sequence of location tokens, which are appended to their respective entity text spans (for example, a snowman followed by ). The data format is similar to “hyperlinks” that connect the object regions in an image to their text span in the corresponding caption.

Add Kosmos-2 model by @ydshieh in #24709

Owl-v2

OWLv2 was proposed in Scaling Open-Vocabulary Object Detection by Matthias Minderer, Alexey Gritsenko, Neil Houlsby. OWLv2 scales up OWL-ViT using self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. This results in large gains over the previous state-of-the-art for zero-shot object detection.

Add OWLv2, bis by @NielsRogge in #26668

🚨🚨🚨 Safetensors by default for `torch` serialization 🚨🚨🚨

Version v4.35.0 now puts safetensors serialization by default. This is a significant change targeted at making users of the Hugging Face Hub, transformers, and any downstream library leveraging it safer.

The safetensors library is a safe serialization framework for machine learning tensors. It has been audited and will become the default serialization framework for several organizations (Hugging Face, EleutherAI, Stability AI).

It was already the default loading mechanism since v4.30.0 and would therefore already default to loading model.safetensors files instead of pytorch_model.bin if these were present in the repository.

With v4.35.0, any call to save_pretrained for torch models will now save a safetensors file. This safetensors file is in the PyTorch format, but can be loaded in TensorFlow and Flax models alike.

⚠️ If you run into any issues with this, please let us know ASAP in the issues so that we may help you. Namely, the following errors may indicate something is up: - Loading a safetensors file and having a warning mentioning missing weights unexpectedly - Obtaining completely wrong/random results at inference after loading a pretrained model that you have saved in safetensors

If you wish to continue saving files in the .bin format, you can do so by specifying safe_serialization=False in all your save_pretrained calls.

Safetensors serialization by default by @LysandreJik in #27064

Chat templates

Chat templates have been expanded with the addition of the add_generation_prompt argument to apply_chat_template(). This has also enabled us to rework the ConversationalPipeline class to use chat templates. Any model with a chat template is now automatically usable through ConversationalPipeline.

Add addgenerationprompt argument to applychattemplate by @Rocketknight1 in #26573
Conversation pipeline fixes by @Rocketknight1 in #26795

Guides

Two new guides on LLMs were added the library:

[docs] LLM prompting guide by @MKhalusova in #26274
[docs] Optimizing LLMs by @patrickvonplaten in #26058

Quantization

Exllama-v2 integration

Exllama-v2 provides better GPTQ kernel for higher throughput and lower latency for GPTQ models. The original code can be found here.

add exllamav2 arg by @SunMarc in #26437
Add exllamav2 better by @SunMarc in #27111

You will need the latest versions of optimum and auto-gptq. Read more about the integration here.

AWQ integration

AWQ is a new and popular quantization scheme, already used in various libraries such as TGI, vllm, etc. and known to be faster than GPTQ models according to some benchmarks. The original code can be found here and here you can read more about the original paper.

Screenshot 2023-10-24 at 17 56 56

We support AWQ inference with original kernels as well as kernels provided through autoawq package that you can simply install with pip install autoawq.

[core / Quantization ] AWQ integration by @younesbelkada in #27045

We also provide an example script on how to push quantized weights on the hub on the original repository.

Read more about the benchmarks and the integration here

GPTQ on CPU !

You can now run GPTQ models on CPU using the latest version of auto-gptq thanks to @vivekkhandelwal1 !

Add support for loading GPTQ models on CPU by @vivekkhandelwal1 in #26719

Attention mask refactor

We refactored the attention mask logic for major models in transformers. For instance, we removed padding_mask argument which was ambiguous for some users

Remove ambiguous padding_mask and instead use a 2D->4D Attn Mask Mapper by @patrickvonplaten in #26792
[Attention Mask] Refactor all encoder-decoder attention mask by @patrickvonplaten in #27086

Flash Attention 2 for more models + quantization fine-tuning bug fix

Gpt-bigcode (starcoder), whisper, Bart and MBart now supports FA-2 ! Use it by simply passing use_flash_attention_2=True to from_pretrained. Some bugfixes with respect to mixed precision training with FA2 have been also addressed.

Add flash attention for gpt_bigcode by @susnato in #26479
[FA2] Fix flash attention 2 fine-tuning with Falcon by @younesbelkada in #26852
[Whisper, Bart, MBart] Add Flash Attention 2 by @patrickvonplaten in #27203

A bugfix with respect to fine-tuning with FA-2 in bfloat16 was addressed. You should now smoothly fine-tune FA-2 models in bfloat16 using quantized base models.

🚨🚨🚨 [Quantization] Store the original dtype in the config as a private attribute 🚨🚨🚨 by @younesbelkada in #26761
[FA-2] Final fix for FA2 dtype by @younesbelkada in #26846

Neftune

NEFTune is a new technique to boost Supervised Fine-tuning performance by adding random noise on the embedding vector. Read more about it on the original paper here

Screenshot 2023-10-24 at 17 56 56

We propose a very simple API for users to benefit from this technique, simply pass a valid neftune_noise_alpha parameter to TrainingArguments

Gradient checkpointing refactor

We have refactored the gradient checkpointing API so that users can pass keyword arguments supported by torch.utils.checkpoint.checkpoint directly through gradient_checkpointing_kwargs when calling gradient_checkpointing_enable(), e.g.

```python from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.frompretrained("facebook/opt-125m") model.gradientcheckpointingenable(gradientcheckpointingkwargs={"usereentrant": False}) ```

gradient_checkpointing_kwargs is also supported with Trainer through TrainingArguments.

[Trainer / GC] Add gradient_checkpointing_kwargs in trainer and training arguments by @younesbelkada in #27068
[core] Refactor of gradient_checkpointing by @younesbelkada in #27020
[core/ GC / tests] Stronger GC tests by @younesbelkada in #27124
Fix import of torch.utils.checkpoint by @NielsRogge in #27155

The refactor should be totally backward compatible with previous behaviour. For superusers, you can still use the attribute gradient_checkpointing on model's submodules to control the activation / deactivation of gradient_checkpointing.

Breaking changes

🚨🚨🚨 [Quantization] Store the original dtype in the config as a private attribute 🚨🚨🚨 by @younesbelkada in #26761
🚨🚨 Generate: change order of ops in beam sample to avoid nans by @gante in #26843
🚨🚨 Raise error when no speaker embeddings in speecht5.generatespeech by @ylacombe in #26418

Bugfixes and improvements

[Nougat] from transformers import * by @ArthurZucker in #26562
[Whisper] Allow basic text normalization by @sanchit-gandhi in #26149
🌐 [i18n-KO] Translated semantic_segmentation.md to Korean by @jungnerd in #26515
[Tokenizers] Skip tests temporarily by @LysandreJik in #26574
docs: feat: add clip notebook resources from OSSCA community by @junejae in #26505
Extend Trainer to enable Ascend NPU to use the fused Adamw optimizer when training by @statelesshz in #26194
feat: add trainer label to wandb run upon initialization by @parambharat in #26466
Docstring check by @sgugger in #26052
refactor: change default block_size by @pphuc25 in #26229
[Mistral] Update config docstring by @sanchit-gandhi in #26593
Add # Copied from statements to audio feature extractors that use the floats_list function by @dg845 in #26581
Fix embarrassing typo in the doc chat template! by @Rocketknight1 in #26596
Fix encoder->decoder typo bug in convertt5xcheckpointtopytorch.py by @soyoung97 in #26587
skip flaky hub tests by @ArthurZucker in #26594
Update mistral.md to update 404 link by @Galland in #26590
[Wav2Vec2] Fix tokenizer set lang by @sanchit-gandhi in #26349
add zh translation for installation by @yyLeaves in #26084
[ NougatProcessor] Fix the default channel by @ArthurZucker in #26608
[GPTNeoX] Faster rotary embedding for GPTNeoX (based on llama changes) by @ArthurZucker in #25830
[Falcon] Set use_cache=False before creating presents which relies on use_cache by @yundai424 in #26328
Fix failing tests on main due to torch 2.1 by @ydshieh in #26607
Make ModelOutput serializable by @cbensimon in #26493
[core] fix silent bug keep_in_fp32 modules by @younesbelkada in #26589
#26566 swin2 sr allow in out channels by @marvingabler in #26568
Don't close ClearML task if it was created externally by @eugen-ajechiloae-clearml in #26614
Fix transformers-pytorch-gpu docker build by @ydshieh in #26615
[docs] Update to scripts building index.md by @MKhalusova in #26546
Don't install pytorch-quantization in Doc Builder docker file by @ydshieh in #26622
Remove unnecessary views of position_ids by @ramiro050 in #26059
Fixed inconsistency in several fast tokenizers by @Towdo in #26561
Update tokenizationcodellama_fast.py by @andyl98 in #26576
Remove unnecessary unsqueeze - squeeze in rotary positional embedding by @fxmarty in #26162
Update chat template docs with more tips on writing a template by @Rocketknight1 in #26625
fix RoPE t range issue for fp16 by @rui-ren in #26602
Fix failing MusicgenTest .test_pipeline_text_to_audio by @ydshieh in #26586
remove SharedDDP as it is deprecated by @statelesshz in #25702
[LlamaTokenizerFast] Adds edge cases for the template processor by @ArthurZucker in #26606
[docstring] Fix docstring for AlbertConfig by @ydshieh in #26636
docs(zh): review and punctuation & space fix by @wfjsw in #26627
[DINOv2] Convert more checkpoints by @NielsRogge in #26177
Fixed malapropism error by @Zhreyu in #26660
fix links in README.md for the GPT, GPT-2, and Llama2 Models by @dcarpintero in #26640
Avoid CI OOM by @ydshieh in #26639
fix typos in idefics.md by @dribnet in #26648
[docstring] Fix docstring CLIP configs by @isaac-chung in #26677
[docstring] Fix docstring for CLIPImageProcessor by @isaac-chung in #26676
[docstring] Fix docstring for DonutImageProcessor by @abzdel in #26641
Fix stale bot by @LysandreJik in #26692
[docstring] Fix docstrings for CLIP by @isaac-chung in #26691
Control first downsample stride in ResNet by @jiqing-feng in #26374
Fix Typo: table in deepspeed.md by @Pairshoe in #26705
[docstring] Fix docstring for LlamaConfig by @pavaris-pm in #26685
fix a typo in flax T5 attention - attention_mask variable is misnamed by @giganttheo in #26663
Fix source_prefix default value by @jheitmann in #26654
[JAX] Replace uses of jnp.array in types with jnp.ndarray. by @hvaara in #26703
Make Whisper Encoder's sinusoidal PE non-trainable by default by @gau-nernst in #26032
In assisted decoding, pass modelkwargs to model's forward call (fix prepareinputforgeneration in all models) by @sinking-point in #25242
Update docs to explain disabling callbacks using report_to by @nebrelbug in #26155
Copied from for test files by @ydshieh in #26713
[docstring] SwinModel docstring fix by @shivanandmn in #26679
fix the model card issue as use_cuda_amp is no more available by @pacman100 in #26731
Fix stale bot for locked issues by @LysandreJik in #26711
Fix checkpoint path in no_trainer scripts by @muellerzr in #26733
Update docker files to use torch==2.1.0 by @ydshieh in #26735
Revert #20715 by @ydshieh in #26734
[docstring] Fix docstring for LlamaTokenizer and LlamaTokenizerFast by @minhoryang in #26669
[docstring] Fix docstring for CodeLlamaTokenizer by @Bojun-Feng in #26709
add japanese documentation by @rajveer43 in #26138
Translated the accelerate.md file of the documentation to Chinese by @liteli1987gmail in #26161
Fix doctest for Blip2ForConditionalGeneration by @ydshieh in #26737
Add many missing spaces in adjacent strings by @tomaarsen in #26751
Warnings controlled by logger level by @LysandreJik in #26527
Fix PersimmonIntegrationTest OOM by @ydshieh in #26750
Fix MistralIntegrationTest OOM by @ydshieh in #26754
Fix backward compatibility of Conversation by @wdhorton in #26741
[docstring] Fix UniSpeech, UniSpeechSat, Wav2Vec2ForCTC by @gizemt in #26664
[docstring] Update GPT2 and Whisper by @McDonnellJoseph in #26642
[docstring] Fix docstring for 'BertGenerationConfig' by @AdwaitSalankar in #26661
Fix PerceiverModelIntegrationTest::test_inference_masked_lm by @ydshieh in #26760
chore: fix typos by @afuetterer in #26756
[core] Fix fa-2 import by @younesbelkada in #26785
Skip TrainerIntegrationFSDP::test_basic_run_with_cpu_offload if torch < 2.1 by @ydshieh in #26764
🌐 [i18n-KO] Translated big_models.md to Korean by @wonhyeongseo in #26245
Update expect outputs of IdeficsProcessorTest.test_tokenizer_padding by @ydshieh in #26779
[docstring] Fix docstring for RwkvConfig by @Bojun-Feng in #26782
Fix num. of minimal calls to the Hub with peft for pipeline by @ydshieh in #26385
[docstring] fix docstring DPRConfig by @AVAniketh0905 in #26674
Disable default system prompt for LLaMA by @Rocketknight1 in #26765
Fix Falcon generation test by @Rocketknight1 in #26770
Fixed KeyError for Mistral by @MatteoRaso in #26682
[Flava] Fix flava doc by @younesbelkada in #26789
Add CLIP resources by @eenzeenee in #26534
translation brazilian portuguese by @alvarorichard in #26769
Fixed typos by @Zhreyu in #26810
[docstring] Fix docstring for CanineConfig by @Sparty in #26771
Add Japanese translation by @shinshin86 in #26799
[docstring] Fix docstring for CodeLlamaTokenizerFast by @Bojun-Feng in #26666
Image-to-Image Task Guide by @merveenoyan in #26595
Make fsdp ram efficient loading optional by @pacman100 in #26631
fix resumefromcheckpoint bug by @Jintao-Huang in #26739
[OWL-ViT, OWLv2] Add resources by @NielsRogge in #26822
Llama tokenizer: remove space in template comment by @pcuenca in #26788
Better way to run AMD CI with different flavors by @ydshieh in #26634
[docstring] Fix bert generation tokenizer by @przemL in #26820
Conversation pipeline fixes by @Rocketknight1 in #26795
Fix Mistral OOM again by @ydshieh in #26847
Chore: Typo fixed in multiple files of docs/source/en/model_doc by @SusheelThapa in #26833
fix: when window_size is passes as array by @dotneet in #26800
Update logits_process.py docstrings to clarify penalty and reward cases (attempt #2) by @larekrow in #26784
[docstring] Fix docstring for LukeConfig by @louietouie in #26858
Fixed a typo in mistral.md by @DTennant in #26879*
Translating en/internal folder docs to Japanese 🇯🇵 by @rajveer43 in #26747
Fix TensorFlow pakage check by @jayfurmanek in #26842
Generate: improve docstrings for custom stopping criteria by @gante in #26863
Knowledge distillation for vision guide by @merveenoyan in #25619
Fix Seq2seqTrainer decoder attention mask by @Rocketknight1 in #26841
[Tokenizer] Fix slow and fast serialization by @ArthurZucker in #26570
Emergency PR to skip conversational tests to fix CI by @Rocketknight1 in #26906
Add default template warning by @Rocketknight1 in #26637
Refactor code part in documentation translated to japanese by @rajveer43 in #26900
[i18n-ZH] Translated fast_tokenizers.md to Chinese by @yyLeaves in #26910
[FA-2] Revert suggestion that broke FA2 fine-tuning with quantized models by @younesbelkada in #26916
[docstring] Fix docstring for ChineseCLIP by @Sparty in #26880
[Docs] Make sure important decode and generate method are nicely displayed in Whisper docs by @patrickvonplaten in #26927
Fix and re-enable ConversationalPipeline tests by @Rocketknight1 in #26907
[docstring] Fix docstrings for CodeGen by @daniilgaltsev in #26821
Fix license by @MedAymenF in #26931
Pin Keras for now by @Rocketknight1 in #26904
[FA-2 / Mistral] Supprot fa-2 + right padding + forward by @younesbelkada in #26912
Generate: update basic llm tutorial by @gante in #26937
Corrected modalities description in README_ru.md by @letohx in #26913
[docstring] Fix docstring for speech-to-text config by @R055A in #26883
fix set_transform link docs by @diegulio in #26856
Fix Fuyu image scaling bug by @pcuenca in #26918
Update README_hd.md by @biswabaibhab007 in #26872
Added Telugu [te] translations by @hakunamatata1997 in #26828
fix logit-to-multi-hot conversion in example by @ranchlai in #26936
Limit to inferior fsspec version by @LysandreJik in #27010
python falcon doc-string example typo by @SoyGema in #26995
skip two tests by @ArthurZucker in #27013
Nits in Llama2 docstring by @osanseviero in #26996
Change default max_shard_size to smaller value by @younesbelkada in #26942
[NLLB-MoE] Fix NLLB MoE 4bit inference by @younesbelkada in #27012
[SeamlessM4T] fix copies with NLLB MoE int8 by @ArthurZucker in #27018
small typos found by @rafaelpadilla in #26988
Remove tokentypeids from default TF GPT-2 signature by @Rocketknight1 in #26962
Translate pipeline_tutorial.md to chinese by @jiaqiw09 in #26954
🌐 [i18n-ZH] Translate multilingual into Chinese by @yyLeaves in #26935
translate preprocessing.md to Chinese by @jiaqiw09 in #26955
Bugfix device map detr model by @pedrogengo in #26849
Fix little typo by @mertyyanik in #27028
🌐 [i18n-ZH] Translate createamodel.md into Chinese by @yyLeaves in #27026
Fix key dtype in GPTJ and CodeGen by @fxmarty in #26836
Register ModelOutput as supported torch pytree nodes by @XuehaiPan in #26618
Add default_to_square_for_size to CLIPImageProcessor by @ydshieh in #26965
Add descriptive docstring to WhisperTimeStampLogitsProcessor by @jprivera44 in #25642
Normalize only if needed by @mjamroz in #26049
[TFxxxxForSequenceClassifciation] Fix the eager mode after #25085 by @ArthurZucker in #25751
Safe import of rgbtoid from FE modules by @amyeroberts in #27037
add info on TRL docs by @lvwerra in #27024
Add fuyu device map by @SunMarc in #26949
Device agnostic testing by @vvvm23 in #25870
Fix config silent copy in from_pretrained by @patrickvonplaten in #27043
[docs] Performance docs refactor p.2 by @MKhalusova in #26791
Add a default decoderattentionmask for EncoderDecoderModel during training by @hackyon in #26752
Fix RoPE config validation for FalconConfig + various config typos by @tomaarsen in #26929
Skip-test by @ArthurZucker in #27062
Fix TypicalLogitsWarper tensor OOB indexing edge case by @njhill in #26579
[docstring] fix incorrect llama docstring: encoder -> decoder by @ztjhz in #27071
[DOCS] minor fixes in README.md by @Akash190104 in #27048
[docs] Add MaskGenerationPipeline in docs by @younesbelkada in #27063
🌐 [i18n-ZH] Translate custom_models.md into Chinese by @yyLeaves in #27065
Hindi translation of pipeline_tutorial.md by @AaryaBalwadkar in #26837
Handle unsharded Llama2 model types in conversion script by @coreyhu in #27069
Bring back set_epoch for Accelerate-based dataloaders by @muellerzr in #26850
Bumpflash_attn version to 2.1 by @younesbelkada in #27079
Remove unneeded prints in modelinggptneox.py by @younesbelkada in #27080
Add-support for commit description by @ArthurZucker in #26704
[Llama FA2] Re-add expandattention_mask and clean a couple things by @patrickvonplaten in #27074
Correct docstrings and a typo in comments by @lewis-yeung in #27047
Save TB logs as part of pushtohub by @muellerzr in #27022
Added huggingface emoji instead of the markdown format by @shettyvarshaa in #27091
[T5Tokenizer] Fix fast and extra tokens by @ArthurZucker in #27085
Revert "add exllamav2 arg" by @ArthurZucker in #27102
Add early stopping for Bark generation via logits processor by @isaac-chung in #26675
Provide alternative when warning on useauthtoken by @Wauplin in #27105
Fix no split modules underlying modules by @SunMarc in #27090
[core/ gradient_checkpointing] Refactor GC - part 2 by @younesbelkada in #27073
fix detr device map by @SunMarc in #27089
Added Telugu [te] translation for README.md in main by @hakunamatata1997 in #27077
translate transformers_agents.md to Chinese by @jiaqiw09 in #27046
Fix docstring and type hint for resize by @daniilgaltsev in #27104
[Typo fix] flag config in WANDB by @SoyGema in #27130
Fix slack report failing for doctest by @ydshieh in #27042
[FA2/ Mistral] Revert previous behavior with right padding + forward by @younesbelkada in #27125
Fix data2vec-audio note about attention mask by @gau-nernst in #27116
remove the obsolete code related to fairscale FSDP by @statelesshz in #26651
Fix some tests using "common_voice" by @ydshieh in #27147
[tests / Quantization] Fix bnb test by @younesbelkada in #27145
make tests of pytorch_example device agnostic by @statelesshz in #27081
Remove some Kosmos-2 copied from by @ydshieh in #27149
🌐 [i18n-ZH] Translate serialization.md into Chinese by @yyLeaves in #27076
Translating en/main_classes folder docs to Japanese 🇯🇵 by @rajveer43 in #26894
Device agnostic trainer testing by @statelesshz in #27131
Fix: typos in README.md by @THEFZNKHAN in #27154
[KOSMOS-2] Update docs by @NielsRogge in #27157
deprecate function get_default_device in tools/base.py by @statelesshz in #26774
Remove broken links to s-JoL/Open-Llama by @CSRessel in #27164
[docstring] Fix docstring for AltCLIPTextConfig, AltCLIPVisionConfig and AltCLIPConfig by @AksharGoyal in #27128
[doctring] Fix docstring for BlipTextConfig, BlipVisionConfig by @Hangsiin in #27173
Disable CI runner check by @ydshieh in #27170
fix: Fix typical_p behaviour broken in recent change by @njhill in #27165
Trigger CI if tiny_model_summary.json is modified by @ydshieh in #27175
Shorten the conversation tests for speed + fixing position overflows by @Rocketknight1 in #26960
device agnostic pipelines testing by @statelesshz in #27129
Backward compatibility fix for the Conversation class by @Rocketknight1 in #27176
[Quantization / tests ] Fix bnb MPT test by @younesbelkada in #27178
Fix dropout in StarCoder by @susnato in #27182
translate traning.md to chinese by @jiaqiw09 in #27122
[docs] Update CPU/GPU inference docs by @stevhliu in #26881
device agnostic models testing by @statelesshz in #27146
Unify warning styles for better readability by @oneonlee in #27184
🌐 [i18n-ZH] Translate tflite.md into Chinese by @yyLeaves in #27134
device agnostic fsdp testing by @statelesshz in #27120
Fix docstring get maskformer resize output image size by @wesleylp in #27196
Fix the typos and grammar mistakes in CONTRIBUTING.md. by @THEFZNKHAN in #27193
Fixing docstring in getresizeoutputimagesize function by @wesleylp in #27191
added unsqueezedim to applyrotaryposemb by @ShashankMosaicML in #27117
Added cacheblockoutputs option to enable GPTQ for non-regular models by @AlexKoff88 in #27032
Add TensorFlow implementation of ConvNeXTv2 by @neggles in #25558
Fix docstring in getoneformerresizeoutputimage_size func by @wesleylp in #27207
improving TimmBackbone to support FrozenBatchNorm2d by @rafaelpadilla in #27160
Translate task summary to chinese by @jiaqiw09 in #27180
Fix CPU offload + disk offload tests by @LysandreJik in #27204
Enable split_batches through TrainingArguments by @muellerzr in #26798
support bf16 by @etemadiamd in #25879
Reproducible checkpoint for npu by @statelesshz in #27208
[core / Quantization] Fix for 8bit serialization tests by @younesbelkada in #27234

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@jungnerd
- 🌐 [i18n-KO] Translated semantic_segmentation.md to Korean (#26515)
@statelesshz
- Extend Trainer to enable Ascend NPU to use the fused Adamw optimizer when training (#26194)
- remove SharedDDP as it is deprecated (#25702)
- remove the obsolete code related to fairscale FSDP (#26651)
- make tests of pytorch_example device agnostic (#27081)
- Device agnostic trainer testing (#27131)
- deprecate function get_default_device in tools/base.py (#26774)
- device agnostic pipelines testing (#27129)
- device agnostic models testing (#27146)
- device agnostic fsdp testing (#27120)
- Reproducible checkpoint for npu (#27208)
@sgugger
- Docstring check (#26052)
@yyLeaves
- add zh translation for installation (#26084)
- [i18n-ZH] Translated fast_tokenizers.md to Chinese (#26910)
- 🌐 [i18n-ZH] Translate multilingual into Chinese (#26935)
- 🌐 [i18n-ZH] Translate createamodel.md into Chinese (#27026)
- 🌐 [i18n-ZH] Translate custom_models.md into Chinese (#27065)
- 🌐 [i18n-ZH] Translate serialization.md into Chinese (#27076)
- 🌐 [i18n-ZH] Translate tflite.md into Chinese (#27134)
@sinking-point
- In assisted decoding, pass modelkwargs to model's forward call (fix prepareinputforgeneration in all models) (#25242)
@rajveer43
- add japanese documentation (#26138)
- Translating en/internal folder docs to Japanese 🇯🇵 (#26747)
- Refactor code part in documentation translated to japanese (#26900)
- Translating en/main_classes folder docs to Japanese 🇯🇵 (#26894)
@alvarorichard
- translation brazilian portuguese (#26769)
@hakunamatata1997
- Added Telugu [te] translations (#26828)
- Added Telugu [te] translation for README.md in main (#27077)
@jiaqiw09
- Translate pipeline_tutorial.md to chinese (#26954)
- translate preprocessing.md to Chinese (#26955)
- translate transformers_agents.md to Chinese (#27046)
- translate traning.md to chinese (#27122)
- Translate task summary to chinese (#27180)
@neggles
- Add TensorFlow implementation of ConvNeXTv2 (#25558)

- Python
Published by LysandreJik over 2 years ago

transformers - Patch release: v4.34.1

A patch release was made for the following three commits: - Add addgenerationprompt argument to applychattemplate (https://github.com/huggingface/transformers/pull/26573) - Fix backward compatibility of Conversation (https://github.com/huggingface/transformers/pull/26741) - [Tokenizer] Fix slow and fast serialization (https://github.com/huggingface/transformers/pull/26570)

- Python
Published by ArthurZucker over 2 years ago

transformers - v4.34: Mistral, Persimmon, Prompt templating, Flash Attention 2, Tokenizer refactor

New models

Mistral

Mistral-7B-v0.1 is a decoder-based LM with the following architectural choices:

Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens
GQA (Grouped Query Attention) - allowing faster inference and lower cache size.
Byte-fallback BPE tokenizer - ensures that characters are never mapped to out-of-vocabulary tokens.
[Mistral] Mistral-7B-v0.1 support by @Bam4d in #26447

Persimmon

The authors introduced Persimmon-8B, a decoder model based on the classic transformers architecture, with query and key normalization. Persimmon-8B is a fully permissively licensed model with approximately 8 billion parameters, released under the Apache license. Some of the key attributes of Persimmon-8B are long context size (16K), performance, and capabilities for multimodal extensions.

[Persimmon] Add support for persimmon by @ArthurZucker in #26042

BROS

BROS stands for BERT Relying On Spatiality. It is an encoder-only Transformer model that takes a sequence of tokens and their bounding boxes as inputs and outputs a sequence of hidden states. BROS encode relative spatial information instead of using absolute spatial information.

Add BROS by @jinhopark8345 in #23190

ViTMatte

ViTMatte leverages plain Vision Transformers for the task of image matting, which is the process of accurately estimating the foreground object in images and videos.

Add ViTMatte by @NielsRogge in #25843

Nougat

Nougat uses the same architecture as Donut, meaning an image Transformer encoder and an autoregressive text Transformer decoder to translate scientific PDFs to markdown, enabling easier access to them.

Add Nougat by @NielsRogge and @molbap in #25942

Prompt templating

We've added a new template feature for chat models. This allows the formatting that a chat model was trained with to be saved with the model, ensuring that users can exactly reproduce that formatting when they want to fine-tune the model or use it for inference. For more information, see our template documentation.

Overhaul Conversation class and prompt templating by @Rocketknight1 in #25323

🚨🚨 Tokenizer refactor

[Tokenizer] attemp to fix add_token issues by @ArthurZucker in #23909
Nit-added-tokens by @ArthurZucker in #26538 adds some fix to #23909 .

🚨Workflow Changes 🚨:

These are not breaking changes per se but rather bugfixes. However, we understand that this may result in some workflow changes so we highlight them below.

uniquenosplit_tokens attribute removed and not used in the internal logic
sanitizespecialtokens() follows a deprecation cycle and does nothing
All attributes in SPECIALTOKENSATTRIBUTES are stored as AddedTokens and no strings.
loading a slow from a fast or a fast from a slow will no longer raise and error if the tokens added don't have the correct index. This is because they will always be added following the order of the added_tokens but will correct mistakes in the saved vocabulary if there are any. (And there are a lot in old format tokenizers)
the length of a tokenizer is now max(set(self.getvocab().keys())) accounting for holes in the vocab. The vocabsize no longer takes into account the added vocab for most of the tokenizers (as it should not). Mostly breaking for T5
Adding a token using tokenizer.add_tokens([AddedToken("hey", rstrip=False, normalized=True)]) now takes into account rstrip, lstrip, normalized information.
addedtokensdecoder holds AddedToken, not strings.
add_tokens() for both fast and slow will always be updated if the token is already part of the vocab, allowing for custom stripping.
initializing a tokenizer form scratch will now add missing special tokens to the vocab.
stripping is not always done for special tokens! 🚨 Only if the AddedToken has lstrip=True and rstrip=True
fairseqidsto_tokens attribute removed for Barthez (was not used)

➕ Most visible features: - printing a tokenizer now shows tokenizer.added_tokens_decoder for both fast and slow tokenizers. Moreover, additional tokens that were already part of the initial vocab are also found there. - faster from_pretrained, faster add_tokens because special and non special can be mixed together and the trie is not always rebuilt. - faster encode/decode with caching mechanism for added_tokens_decoder/encoder. - information is fully saved in the tokenizer_config.json

For any issues relating to this, make sure to open a new issue and ping @ArthurZucker.

Flash Attention 2

FA2 support added to transformers for most popular architectures (llama, mistral, falcon) architectures actively being contributed in this issue (https://github.com/huggingface/transformers/issues/26350). Simply pass use_flash_attention_2=True when calling from_pretrained

In the future, PyTorch will support Flash Attention 2 through torch.scaled_dot_product_attention, users would be able to benefit from both (transformers core & transformers + SDPA) implementations of Flash Attention-2 with simple changes (model.to_bettertransformer() and force-dispatch the SDPA kernel to FA-2 in the case of SDPA)

[core ] Integrate Flash attention 2 in most used models by @younesbelkada in #25598

For our future plans regarding integrating F.sdpa from PyTorch in core transformers, see here: https://github.com/huggingface/transformers/issues/26557

Lazy import structure

Support for lazy loading integration libraries has been added. This will drastically speed up importing transformers and related object from the library.

Example before this change:

2023-09-11 11:07:52.010179: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT python3 -c "from transformers import CLIPTextModel" 3.31s user 3.06s system 220% cpu 2.893 total

After this change:

python3 -c "from transformers import CLIPTextModel" 1.70s user 1.49s system 220% cpu 1.447 total

[Core] Add lazy import structure to imports by @patrickvonplaten in #26090

Bugfixes and improvements

Fix typo by @susnato in #25966
Fix Detr CI by @ydshieh in #25972
Fix test_load_img_url_timeout by @ydshieh in #25976
nn.Identity is not required to be compatible with PyTorch < 1.1.0 as the minimum PyTorch version we currently support is 1.10.0 by @statelesshz in #25974
Add Pop2Piano space demo. by @susnato in #25975
fix typo by @kai01ai in #25981
Use main in conversion script by @ydshieh in #25973
[doc] Always call it Agents for consistency by @julien-c in #25958
Update RAG README.md with correct path to examples/seq2seq by @tleyden in #25953
Update training_args.py to remove the runtime error by @sahel-sh in #25920
Trainer: delegate default generation values to generation_config by @gante in #25987
Show failed tests on CircleCI layout in a better way by @ydshieh in #25895
Patch with accelerate xpu by @abhilash1910 in #25714
PegasusX add nosplit_modules by @andreeahedes in #25933
Add TFDebertaV2ForMultipleChoice by @raghavanone in #25932
deepspeed resume from ckpt fixes and adding support for deepspeed optimizer and HF scheduler by @pacman100 in #25863
[Wav2Vec2 Conformer] Fix inference float16 by @sanchit-gandhi in #25985
Add LLaMA resources by @eenzeenee in #25859
[CI] Fix red CI and ERROR failed should show by @ArthurZucker in #25995
[VITS] tokenizer integration test: fix revision did not exist by @ArthurZucker in #25996
Fix Mega chunking error when using decoder-only model by @tanaymeh in #25765
save space when converting hf model to megatron model. by @flower-with-safe in #25950
Update README.md by @NinoRisteski in #26003
Falcon: fix revision propagation by @LysandreJik in #26006
TF-OPT attention mask fixes by @Rocketknight1 in #25238
Fix small typo README.md by @zspo in #25934
🌐[i18n-KO] Translated llm_tutorial.md to Korean by @harheem in #25791
Remove Falcon from undocumented list by @Rocketknight1 in #26008
modify context length for GPTQ + version bump by @SunMarc in #25899
Fix err with FSDP by @muellerzr in #25991
fix resizetoken_embeddings will set lm head size to 0 when enabled deepspeed zero3 by @kai01ai in #26024
Fix CircleCI config by @ydshieh in #26023
Add tgs speed metrics by @CokeDong in #25858
[VITS] Fix nightly tests by @sanchit-gandhi in #25986
Added HerBERT to README.md by @Muskan011 in #26020
Fix vilt config docstring parameter to match value in init by @raghavanone in #26017
Punctuation fix by @kwonmha in #26025
Try to fix training Loss inconsistent after resume from old checkpoint by @dumpmemory in #25872
Fix Dropout Implementation in Graphormer by @alexanderkrauck in #24817
Update missing docs on activation_dropout and fix DropOut docs for SEW-D by @gau-nernst in #26031
Skip warning if tracing with dynamo by @angelayi in #25581
🌐 [i18n-KO] Translated llama.md to Korean by @harheem in #26044
[CodeLlamaTokenizerFast] Fix fix set_infilling_processor to properly reset by @ArthurZucker in #26041
[CITests] skip failing tests until #26054 is merged by @ArthurZucker in #26063
only main process should call _save on deepspeed zero3 by @zjjMaiMai in #25959
docs: update link huggingface map by @pphuc25 in #26077
docs: add space to docs by @pphuc25 in #26067
[core] Import tensorflow inside relevant methods in trainer_utils by @younesbelkada in #26106
Generate: legacy mode is only triggered when generation_config is untouched by @gante in #25962
Update logits_process.py docstrings by @larekrow in #25971
Fix ExponentialDecayLengthPenalty negative logits issue by @pokjay in #25594
🌐 [i18n-KO] Translated llama2.md to Korean by @mjk0618 in #26047
[docs] Updates to TTS task guide with regards to the new TTS pipeline by @MKhalusova in #26095
🌐 [i18n-KO] Translated contributing.md to Korean by @mjk0618 in #25877
enable optuna multi-objectives feature by @sywangyi in #25969
chore: correct updatestep and correct gradientaccumulation_steps by @pphuc25 in #26068
Text2text pipeline: don't parameterize from the config by @gante in #26118
Fix MarianTokenizer to remove metaspace character in decode by @tanaymeh in #26091
safeguard torch distributed check by @pacman100 in #26056
fix the deepspeed tests by @pacman100 in #26021
Fix AutoTokenizer docstring typo by @amyeroberts in #26117
[core] fix 4bit num_parameters by @younesbelkada in #26132
Add missing space in generation/utils.py by @jbochi in #26121
Update spectrogram and waveform model mapping for TTS/A pipeline by @Vaibhavs10 in #26114
[RWKV] Final fix RWMV 4bit by @younesbelkada in #26134
docs: feat: add llama2 notebook resources from OSSCA community by @junejae in #26076
Generate: ignore warning when generation_config.max_length is set to None by @gante in #26147
Fix test_finetune_bert2bert by @ydshieh in #25984
Falcon: batched generation by @gante in #26137
Fix beam_scores shape when token scores shape changes after logits_processor by @BakerBunker in #25980
Update trainingargs.py - addition of self.distributedstate when using XPU by @Serizao in #25999
[docs] last hidden state vs hidden_states[-1] by @MKhalusova in #26142
Flex xpu bug fix by @abhilash1910 in #26135
Add missing Maskformer dataclass decorator, add dataclass check in ModelOutput for subclasses by @rachthree in #25638
Fix eval accumulation when accelerate > 0.20.3 by @sam-scale in #26060
[Whisper Tokenizer] Encode timestamps by @sanchit-gandhi in #26054
[PEFT] Fix PEFT + gradient checkpointing by @younesbelkada in #25846
[MusicGen] Add streamer to generate by @sanchit-gandhi in #25320
Fix beam search when using model parallel by @pfldy2850 in #24969
[MusicGen] Add sampling rate to config by @sanchit-gandhi in #26136
[Whisper] Fix word-level timestamps for audio < 30 seconds by @xenova in #25607
[BLIP-2] Improve conversion script by @NielsRogge in #24854
IDEFICS: allow interpolation of vision's pos embeddings by @leot13 in #26029
[TTA Pipeline] Test MusicGen and VITS by @sanchit-gandhi in #26146
Tweaks to Chat Templates docs by @Rocketknight1 in #26168
[Whisper] Check length of prompt + max new tokens by @sanchit-gandhi in #26164
Update notebook.py to support multi eval datasets by @matrix1001 in #25796
Fix pad to multiple of by @ArthurZucker in #25732
[docs] IDEFICS guide and task guides restructure by @MKhalusova in #26035
[PEFT] Allow PEFT model dict to be loaded by @patrickvonplaten in #25721
No doctest for convert_bros_to_pytorch.py by @ydshieh in #26212
Remove utils/documentation_tests.txt by @ydshieh in #26213
moved ctrl to Salesforce/ctrl by @julien-c in #26183
Fix ConversationalPipeline tests by @Rocketknight1 in #26217
[FSMT] Fix non-shared weights by @LysandreJik in #26187
refactor decay_parameters production into its own function by @shijie-wu in #26152
refactor: change default block_size in block size > max position embeddings by @pphuc25 in #26069
[Wav2Vec2-Conf / LLaMA] Style fix by @sanchit-gandhi in #26188
[Permisson] Style fix by @sanchit-gandhi in #26228
[Check] Fix config docstring by @sanchit-gandhi in #26222
🌐 [i18n-KO] Translated whisper.md to Korean by @nuatmochoi in #26002
Create the return value on device to avoid unnecessary copying from CPU by @mksit in #26151
[AutoBackbone] Add test by @NielsRogge in #26094
Update README.md by @NinoRisteski in #26198
Update addnewpipeline.md by @NinoRisteski in #26197
[docs] Fix model reference in zero shot image classification example by @Aleksandar1932 in #26206
Fix the gitlab user mention in issue templates to the correct user by @muellerz in #26237
Fix some docstring in image processors by @ydshieh in #26235
Fix gated repo tests by @Wauplin in #26257
Fix Error not captured in PR doctesting by @ydshieh in #26215
DeepSpeed ZeRO-3 handling when resizing embedding layers by @pacman100 in #26259
[FIX] resizetokenembeddings by @passaglia in #26102
FSDP tests and checkpointing fixes by @pacman100 in #26180
fix name error when accelerate is not available by @pacman100 in #26278
Update bros checkpoint by @jinhopark8345 in #26277
Integrate AMD GPU in CI/CD environment by @mfuntowicz in #26007
Rewrite for custom code warning messages by @Rocketknight1 in #26291
fix deepspeed available detection by @fxmarty in #26252
add bbox input validation by @jinhopark8345 in #26294
include changes from llama by @ArthurZucker in #26260
[Trainer] Refactor trainer + bnb logic by @younesbelkada in #26248
add custom RMSNorm to ALL_LAYERNORM_LAYERS by @shijie-wu in #26227
Keep relevant weights in fp32 when model._keep_in_fp32_modules is set even when accelerate is not installed by @fxmarty in #26225
Fix FSMT weight sharing by @LysandreJik in #26292
update hf hub dependency to be compatible with the new tokenizers by @ArthurZucker in #26301
Porting the torchaudio kaldi fbank implementation to audio_utils by @ylacombe in #26182
More error message fixup, plus some linebreaks! by @Rocketknight1 in #26296
[QUICK FIX LINK] Update trainer.py by @SoyGema in #26293
Use CircleCI store_test_results by @ydshieh in #26223
Fix doctest CI by @ydshieh in #26324
[doc] fixed indices in obj detection example by @MKhalusova in #26343
[TTA Pipeline] Fix MusicGen test by @sanchit-gandhi in #26348
Add image to image pipeline by @LeviVasconcelos in #25393
feat: adding numproc to loaddataset by @pphuc25 in #26326
Fixed unclosed p tags by @HanSeokhyeon in #26240
Update addnewmodel.md by @NinoRisteski in #26365
Fix MusicGen logging error by @osanseviero in #26370
[docs] removed MaskFormerSwin and TimmBackbone from the table on index.md by @MKhalusova in #26347
Update tiny model information and pipeline tests by @ydshieh in #26285
Add Russian localization for README by @qweme32 in #26208
🌐 [i18n-KO] Translated audio_classification.mdx to Korean by @gabrielwithappy in #26200
[ViTMatte] Add resources by @NielsRogge in #26317
Deleted duplicate sentence by @titi-devv in #26394
added support for gradient checkpointing in ESM models by @sanjeevk-os in #26386
Fix DeepSpeed issue with Idefics by @HugoLaurencon in #26393
Add torch RMSProp optimizer by @natolambert in #26425
Fix padding for IDEFICS by @shauray8 in #26396
Update semantic_segmentation.md by @zekaouinoureddine in #26419
Fixing tokenizer when transformers is installed without tokenizers by @urialon in #26236
[FA / tests] Add use_cache tests for FA models by @younesbelkada in #26415
add bf16 mixed precision support for NPU by @statelesshz in #26163
[PEFT] Fix PEFT multi adapters support by @younesbelkada in #26407
Fix failing doctest by @LysandreJik in #26450
Update runs-on in workflow files by @ydshieh in #26435
[i18n-DE] Complete first toc chapter by @flozi00 in #26311
🌐 [i18n-KO] Translated debugging.md to Korean by @wonhyeongseo in #26246
🌐 [i18n-KO] Translated perf_train_gpu_many.md to Korean by @wonhyeongseo in #26244
optimize VRAM for calculating pos_bias in LayoutLM v2, v3 by @NormXU in #26139
Fix cos_sin device issue in Falcon model by @ydshieh in #26448
docs: change assert to raise and some small docs by @pphuc25 in #26232
change mention of decoderinputids to inputids and same with decodeinputs_embeds by @tmabraham in #26406
[VITS] Fix speaker_embed device mismatch by @fakhirali in #26115
[PEFT] introducing adapter_kwargs for loading adapters from different Hub location (subfolder, revision) than the base model by @younesbelkada in #26270
Do not warn about unexpected decoder weights when loading T5EncoderModel and LongT5EncoderModel by @fleonce in #26211
fixmbarttied_weights by @SunMarc in #26422
Esm checkpointing by @Amelie-Schreiber in #26454
[Whisper Tokenizer] Make decoding faster after adding timestamps by @sanchit-gandhi in #26299
[docs] Update offline mode docs by @stevhliu in #26478
[docs] navigation improvement between text gen pipelines and text gen params by @MKhalusova in #26477
Skip 2 failing persimmon pipeline tests for now by @ydshieh in #26485
Avoid all-zeor attnetion mask used in testing by @ydshieh in #26469
[Flax Examples] Seq2Seq ASR Fine-Tuning Script by @sanchit-gandhi in #21764
[ASR Pipe] Improve docs and error messages by @sanchit-gandhi in #26476
Revert falcon exception by @LysandreJik in #26472
Fix numheads in _upadinput by @fs4r in #26490
Fix requests connection error during modelcard creation by @jphme in #26518
Fix issue of canine forward requiring input_ids anyway by @marcmk6 in #26290
Fix broken link to video classification task by @HelgeS in #26487
[PEFT] Pass token when calling find_adapter_config by @younesbelkada in #26488
[core/ auto ] Fix bnb test with code revision + bug with code revision by @younesbelkada in #26431
Fix model integration ci by @ArthurZucker in #26322
[PEFT] Protect adapter_kwargs check by @younesbelkada in #26537
Remove-warns by @ArthurZucker in #26483
[Doctest] Add configuration_roformer.py by @Adithya4720 in #26530
Code-llama-nit by @ArthurZucker in #26300
add buildinputswithspecialtokens to LlamaFast by @ArthurZucker in #26297
🌐 [i18n-KO] Translated tokenizer_summary.md to Korean by @wonhyeongseo in #26243
[i18n-DE] contribute chapter by @flozi00 in #26481
[RFC, Logging] Change warning to info by @patrickvonplaten in #26545
Add tokenizer kwargs to fill mask pipeline. by @nmcahill in #26234
[Wav2Vec2 and Co] Update init tests for PT 2.1 by @sanchit-gandhi in #26494
[AMD] Add initial version for runtestsmulti_gpu by @mfuntowicz in #26346
[Doctest] Add configuration_encoder_decoder.py by @SrijanSahaySrivastava in #26519
[InternLM] Add support for InternLM by @Rocketknight1 in #26302

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@jinhopark8345
- Add BROS (#23190)
- Update bros checkpoint (#26277)
- add bbox input validation (#26294)
@qweme32
- Add Russian localization for README (#26208)
@Bam4d
- [Mistral] Mistral-7B-v0.1 support (#26447)
@flozi00
- [i18n-DE] Complete first toc chapter (#26311)
- [i18n-DE] contribute chapter (#26481)
@wonhyeongseo
- 🌐 [i18n-KO] Translated debugging.md to Korean (#26246)
- 🌐 [i18n-KO] Translated perf_train_gpu_many.md to Korean (#26244)
- 🌐 [i18n-KO] Translated tokenizer_summary.md to Korean (#26243)

- Python
Published by LysandreJik over 2 years ago

transformers - Patch release: v4.33.3

A patch release was made for the following three commits:

DeepSpeed ZeRO-3 handling when resizing embedding layers (#26259)
[doc] Always call it Agents for consistency (#25958)
deepspeed resume from ckpt fixes and adding support for deepspeed optimizer and HF scheduler (#25863)

- Python
Published by LysandreJik over 2 years ago

transformers - Patch release: v4.33.2

A patch release was done for these two commits:

Fix pad to multiple of (#25732)
fix resizetoken_embeddings will set lm head size to 0 when enabled deepspeed zero3 (#26024)

- Python
Published by LysandreJik over 2 years ago

Recent Releases of transformers

transformers - v4.56: Dino v3, X-Codec, Ovis 2, MetaCLIP 2, Florence 2, SAM 2, Kosmos 2.5, HunYuan, GLMV-4.5

New model additions

Dino v3

X-Codec

Ovis 2

MetaCLIP 2

Florence 2

SAM 2

Kosmos 2.5

HunYuan

Seed OSS

GLM-4.5V

Cache

Quantization

MXFP4

New standard

Breaking changes

Saner hub-defaults for hybrid cache implementation

Sine positional embeddings for MaskFormer & LRU cache

Explicit cache initialization

Default compilation with fullgraph=False

Remove decoding strategies

Fix sliding window in flash attention

Minimum Torch version is now 2.2

Bugfixes and improvements

Significant community contributions

transformers - # Patch v4.55.4

Patch v4.55.4

transformers - Patch release v4.55.3

Patch release 4.55.3

Bug Fixes & Improvements

transformers - Patch release 4.55.2: for FA2 users!

Patch release 4.55.2!

only affects FA2 generations!

transformers - Patch release 4.55.1

Patch release 4.55.1:

Bug Fixes & Improvements

CI & Build

transformers - GLM-4.5V preview based on 4.55.0

GLM-4.5V preview based on 4.55.0

transformers - v4.55.0: New openai GPT OSS model!

Welcome GPT OSS, the new open-source model family from OpenAI!

Overview of Capabilities and Architecture

Architecture

Flash Attention 3

Decode and print

Other optimizations

transformers serve

responses API

completions API

Command A Vision

MM Grounding DINO

Bugfixes and improvements

Significant community contributions

transformers - Patch release 4.54.1

Patch release 4.54.1

transformers - v4.54.0: Kernels, Transformers Serve, Ernie, Voxtral, LFM2, DeepSeek v2, ModernBERT Decoder...

Important news!

New models

Ernie 4.5

Voxtral

Key Features

LFM2

DeepSeek v2

ModernBERT Decoder models

EoMT

Doge

AIM v2

PerceptionLM

Efficient LoFTR

EVOLLA

DeepSeek VL

xLSTM

EXAONE 4.0

Parallelisation

Quantization

FP Quant

Kernels

Transformers Serve

Default compilation with `fullgraph=False`

only affects `FA2` generations!

Handling specific attributes like `output_attentions` or `output_hidden_states`