Transformers v5: PyTorch-First Library Update

Introduction

Hugging Face has released Transformers v5.0.0rc-0, marking a significant milestone in the evolution of one of the most widely used artificial intelligence libraries. Five years after the initial release candidate for version 4, this new release represents a major shift toward simplicity, interoperability, and production-ready capabilities that will shape the AI development ecosystem for years to come.

The Transformers library has experienced extraordinary growth since v4, with installations increasing from 20,000 per day to more than 3 million per day via pip. The library has now surpassed 1.2 billion total installs, supporting over 400 model architectures (up from 40 in v4) and enabling access to more than 750,000 model checkpoints on the Hugging Face Hub (up from roughly 1,000 at the time of v4).

This growth reflects the mainstream adoption of AI and the library's central role in the open-source AI ecosystem. Transformers v5 focuses on four key areas: simplicity, training, inference, and production deployment, with interoperability as the overarching theme connecting all improvements.

Simplicity and Model Architecture

Modular Design Approach

One of the most significant improvements in v5 is the modular design approach for model architectures. This change makes it easier to add new models, maintain existing ones, and collaborate across the community. The modular approach has dramatically reduced the lines of code required to contribute new models, making the library more accessible to contributors while reducing maintenance burden.

The introduction of the AttentionInterface exemplifies this modular philosophy. This centralized abstraction handles various attention methods, with the eager method remaining in modeling files while specialized implementations like FA1/2/3, FlexAttention, and SDPA are moved to the interface. This standardization makes it easier to understand how models differ and what features each model supports.

Streamlined Codebase

Transformers v5 includes significant refactoring of modeling and tokenization files:

Modeling files: Improved through modular approach and standardization, focusing only on relevant parts for forward/backward passes
Tokenization: Simplified to focus on the tokenizers backend, removing the distinction between "Fast" and "Slow" tokenizers
Image processors: Now only exist with their fast variant, depending on the torchvision backend
PyTorch-only: Sunsetting Flax/TensorFlow support in favor of PyTorch as the sole backend

These changes result in cleaner, more maintainable code that's easier for developers to understand and contribute to. The library maintains its position as the "source of truth" for model definitions while becoming more accessible to the broader community.

Automated Model Conversion

Hugging Face is building tooling to help identify which existing model architecture a new model resembles, using machine learning to find code similarities between modeling files. The goal is to automate the conversion process by opening draft pull requests for new models, reducing manual effort and ensuring consistency across the library.

Training Capabilities

Pre-training at Scale

Transformers v5 includes significant improvements for pre-training at scale, which was previously less of a focus compared to fine-tuning. The team has reworked model initialization to work at scale with different parallelism paradigms and shipped support for optimized kernels for both forward and backward passes.

The library now has extended compatibility with major pre-training tools including:

torchtitan: Training framework for large-scale models
megatron: Distributed training framework
nanotron: Scalable training framework

This compatibility ensures that Transformers models can be used seamlessly with the tools needed for large-scale pre-training, making the library more versatile for different training scenarios.

Fine-tuning and Post-training

The library continues to collaborate closely with fine-tuning tools in the Python ecosystem, ensuring compatibility with:

Unsloth: Efficient fine-tuning framework
Axolotl: Post-training tool for modern LLMs
LlamaFactory: Model training framework
TRL: Transformer Reinforcement Learning library
MaxText: JAX-based training framework

All fine-tuning and post-training tools can now rely on Transformers for model definitions, further enabling agentic use cases through OpenEnv or the Prime Environment Hub. This interoperability makes it easier for developers to choose the right tool for their specific training needs while using consistent model definitions.

Inference Performance

Specialized Kernels and Optimizations

Transformers v5 puts significant focus on inference performance, with several paradigm changes:

Automatic kernel usage: Kernels are automatically used when hardware and software permit it, improving performance without requiring manual configuration
Continuous batching: Support for continuous batching mechanisms to handle multiple requests efficiently
Paged attention: Implementation of paged attention for more efficient memory usage during inference

These improvements are particularly valuable for evaluation scenarios where a large number of inference requests are processed simultaneously. The library aims to provide good performance while maintaining compatibility with specialized inference engines.

New Inference APIs

The release introduces two new APIs dedicated to inference:

Continuous batching and paged attention: Now available for production use after internal testing
transformers serve: A new transformers-specific serving system that deploys an OpenAI API-compatible server

These APIs make it easier to deploy Transformers models in production environments while maintaining compatibility with existing OpenAI-based workflows.

Integration with Inference Engines

Transformers v5 works closely with popular inference engines, using Transformers as a backend:

vLLM: High-performance inference engine
SGLang: Fast structured generation framework
TensorRT LLM: NVIDIA's optimized inference engine

The value proposition is significant: as soon as a model is added to Transformers, it becomes available in these inference engines while taking advantage of each engine's strengths, including inference optimizations, specialized kernels, and dynamic batching.

Production and Local Deployment

Cross-Platform Compatibility

Transformers v5 emphasizes interoperability with various deployment targets:

ONNXRuntime: Easy conversion to ONNX format for cross-platform deployment
llama.cpp: Direct compatibility with GGUF files for local inference
MLX: Direct compatibility with MLX's models using safetensors files
executorch: Working with the executorch team for on-device deployment

This compatibility makes it possible to train a model with Unsloth/Axolotl/LlamaFactory, deploy it with vLLM/SGLang, and export it to llama.cpp/executorch/MLX for local execution. The library serves as a central hub that connects the entire AI development and deployment pipeline.

Model Format Support

The library now supports easy conversion between formats:

GGUF files: Can be loaded in Transformers for further fine-tuning
Safetensors: Direct compatibility with MLX and other frameworks
ONNX: Export capabilities for production deployment

This format flexibility ensures that models can move seamlessly between different tools and deployment targets, reducing friction in the development workflow.

Quantization as First-Class Feature

Central Focus on Quantization

Quantization is emerging as the standard for state-of-the-art model development, with many SOTA models released in low-precision formats (8-bit, 4-bit). Examples include gpt-oss, Kimi-K2, and Deepseek-r1. Hardware is increasingly optimized for low-precision workloads, and the community is actively sharing high-quality quantized checkpoints.

Transformers v5 makes quantization a first-class citizen, ensuring full compatibility with all major features and delivering a reliable framework for both training and inference. This represents a significant change in how weights are loaded in models, with quantization integrated at the core level rather than as an add-on feature.

Collaboration with Quantization Tools

The release includes close collaboration with quantization tooling:

TorchAO: Integration of TorchAO quantization features
bitsandbytes: Better support for key features like tensor parallelism and mixture-of-experts models
New quantization methods: Easier integration of new quantization techniques

This collaboration ensures that developers have access to the latest quantization techniques while maintaining compatibility with the broader Transformers ecosystem.

Ecosystem Impact

Community Growth

The Transformers ecosystem has grown dramatically since v4:

Model architectures: From 40 to over 400
Model checkpoints: From 1,000 to 750,000+
Daily installations: From 20,000 to 3+ million
Total installs: Surpassed 1.2 billion

This growth reflects both the expansion of the AI field and the library's central role in making AI accessible to developers worldwide.

Industry Partnerships

Transformers v5 benefits from collaboration with major players in the AI ecosystem:

PyTorch Foundation: Working closely with PyTorch as the primary backend
Inference engines: vLLM, SGLang, TensorRT LLM
Training frameworks: Unsloth, Axolotl, LlamaFactory, MaxText
Deployment tools: llama.cpp, MLX, ONNXRuntime, executorch

These partnerships ensure that Transformers remains compatible with the tools developers actually use, creating a cohesive ecosystem rather than isolated tools.

Technical Improvements

Code Quality and Maintainability

The refactoring work in v5 results in:

Cleaner code: More focused modeling files that only contain relevant parts
Better abstractions: Common functionality moved to shared interfaces
Easier contributions: Reduced code requirements for adding new models
Improved readability: Standardization makes it easier to understand model differences

These improvements benefit both maintainers and users, making the library more accessible and easier to work with.

Performance Optimizations

The release includes various performance improvements:

Optimized kernels: Automatic usage when available
Better memory management: Paged attention and other optimizations
Faster inference: Continuous batching and other techniques
Efficient training: Support for various parallelism paradigms

These optimizations ensure that Transformers models can be used efficiently across different hardware configurations and use cases.

Migration and Compatibility

Breaking Changes

Transformers v5 includes some breaking changes:

Flax/TensorFlow support: Being sunsetted in favor of PyTorch
Tokenizer changes: Removal of "Fast" and "Slow" distinction
Image processor changes: Only fast variant available

Users should review the release notes to understand migration requirements for their specific use cases.

Backward Compatibility

Despite breaking changes, the library maintains compatibility where possible:

Model checkpoints: Existing checkpoints remain compatible
API stability: Core APIs remain stable
Migration guides: Documentation provided for breaking changes

The team has worked to minimize disruption while moving toward a cleaner, more maintainable codebase.

Conclusion

Transformers v5 represents a significant evolution of one of the most important libraries in the AI ecosystem. The focus on interoperability, simplicity, and production readiness positions the library to continue serving as the foundation for AI development across training, inference, and deployment scenarios.

The move to PyTorch-only, the elevation of quantization to first-class status, and the emphasis on ecosystem compatibility reflect the current state of AI development while preparing for future needs. The library's growth from 20,000 to 3+ million daily installations demonstrates its critical role in making AI accessible to developers worldwide.

As the AI field continues to evolve rapidly, Transformers v5 provides a stable, well-maintained foundation that connects the entire development pipeline—from training with specialized frameworks to deployment on various platforms. The emphasis on interoperability ensures that developers can choose the best tools for each stage of their workflow while maintaining consistency through shared model definitions.

To learn more about AI development and model training, explore our AI fundamentals courses, check out our glossary of AI terms, or browse our AI tools catalog for related development tools and frameworks.

Transformers v5: PyTorch-First Library Update

Introduction

Simplicity and Model Architecture

Modular Design Approach

Streamlined Codebase

Automated Model Conversion

Training Capabilities

Pre-training at Scale

Fine-tuning and Post-training

Inference Performance

Specialized Kernels and Optimizations

New Inference APIs

Integration with Inference Engines

Production and Local Deployment

Cross-Platform Compatibility

Model Format Support

Quantization as First-Class Feature

Central Focus on Quantization

Collaboration with Quantization Tools

Ecosystem Impact

Community Growth

Industry Partnerships

Technical Improvements

Code Quality and Maintainability

Performance Optimizations

Migration and Compatibility

Breaking Changes

Backward Compatibility

Conclusion

Sources

Frequently Asked Questions

What is Transformers v5?

What are the key changes in Transformers v5?

How popular is the Transformers library?

What is the focus of Transformers v5?

What happened to Flax and TensorFlow support?

What is quantization in Transformers v5?

Related Articles

Alibaba Qwen Team Faces Key Departures and Restructuring

Qwen 3.5: Scaling Intelligence in Compact Models

ChatGPT-5.4 Leaks: 2M Context, Full-Res Vision, and Agentic Power

Continue Your AI Journey