Transformers v5 Release: PyTorch-First AI Library Update

Hugging Face releases Transformers v5 with PyTorch-only backend, quantization as first-class feature, and enhanced interoperability across the AI ecosystem.

by HowAIWorks Team
TransformersHugging FacePyTorchAI LibrariesMachine LearningOpen SourceQuantizationAI DevelopmentModel TrainingAI InferenceDeep Learning

Introduction

Hugging Face has released Transformers v5.0.0rc-0, marking a significant milestone in the evolution of one of the most widely used artificial intelligence libraries. Five years after the initial release candidate for version 4, this new release represents a major shift toward simplicity, interoperability, and production-ready capabilities that will shape the AI development ecosystem for years to come.

The Transformers library has experienced extraordinary growth since v4, with installations increasing from 20,000 per day to more than 3 million per day via pip. The library has now surpassed 1.2 billion total installs, supporting over 400 model architectures (up from 40 in v4) and enabling access to more than 750,000 model checkpoints on the Hugging Face Hub (up from roughly 1,000 at the time of v4).

This growth reflects the mainstream adoption of AI and the library's central role in the open-source AI ecosystem. Transformers v5 focuses on four key areas: simplicity, training, inference, and production deployment, with interoperability as the overarching theme connecting all improvements.

Simplicity and Model Architecture

Modular Design Approach

One of the most significant improvements in v5 is the modular design approach for model architectures. This change makes it easier to add new models, maintain existing ones, and collaborate across the community. The modular approach has dramatically reduced the lines of code required to contribute new models, making the library more accessible to contributors while reducing maintenance burden.

The introduction of the AttentionInterface exemplifies this modular philosophy. This centralized abstraction handles various attention methods, with the eager method remaining in modeling files while specialized implementations like FA1/2/3, FlexAttention, and SDPA are moved to the interface. This standardization makes it easier to understand how models differ and what features each model supports.

Streamlined Codebase

Transformers v5 includes significant refactoring of modeling and tokenization files:

  • Modeling files: Improved through modular approach and standardization, focusing only on relevant parts for forward/backward passes
  • Tokenization: Simplified to focus on the tokenizers backend, removing the distinction between "Fast" and "Slow" tokenizers
  • Image processors: Now only exist with their fast variant, depending on the torchvision backend
  • PyTorch-only: Sunsetting Flax/TensorFlow support in favor of PyTorch as the sole backend

These changes result in cleaner, more maintainable code that's easier for developers to understand and contribute to. The library maintains its position as the "source of truth" for model definitions while becoming more accessible to the broader community.

Automated Model Conversion

Hugging Face is building tooling to help identify which existing model architecture a new model resembles, using machine learning to find code similarities between modeling files. The goal is to automate the conversion process by opening draft pull requests for new models, reducing manual effort and ensuring consistency across the library.

Training Capabilities

Pre-training at Scale

Transformers v5 includes significant improvements for pre-training at scale, which was previously less of a focus compared to fine-tuning. The team has reworked model initialization to work at scale with different parallelism paradigms and shipped support for optimized kernels for both forward and backward passes.

The library now has extended compatibility with major pre-training tools including:

  • torchtitan: Training framework for large-scale models
  • megatron: Distributed training framework
  • nanotron: Scalable training framework

This compatibility ensures that Transformers models can be used seamlessly with the tools needed for large-scale pre-training, making the library more versatile for different training scenarios.

Fine-tuning and Post-training

The library continues to collaborate closely with fine-tuning tools in the Python ecosystem, ensuring compatibility with:

  • Unsloth: Efficient fine-tuning framework
  • Axolotl: Post-training tool for modern LLMs
  • LlamaFactory: Model training framework
  • TRL: Transformer Reinforcement Learning library
  • MaxText: JAX-based training framework

All fine-tuning and post-training tools can now rely on Transformers for model definitions, further enabling agentic use cases through OpenEnv or the Prime Environment Hub. This interoperability makes it easier for developers to choose the right tool for their specific training needs while using consistent model definitions.

Inference Performance

Specialized Kernels and Optimizations

Transformers v5 puts significant focus on inference performance, with several paradigm changes:

  • Automatic kernel usage: Kernels are automatically used when hardware and software permit it, improving performance without requiring manual configuration
  • Continuous batching: Support for continuous batching mechanisms to handle multiple requests efficiently
  • Paged attention: Implementation of paged attention for more efficient memory usage during inference

These improvements are particularly valuable for evaluation scenarios where a large number of inference requests are processed simultaneously. The library aims to provide good performance while maintaining compatibility with specialized inference engines.

New Inference APIs

The release introduces two new APIs dedicated to inference:

  • Continuous batching and paged attention: Now available for production use after internal testing
  • transformers serve: A new transformers-specific serving system that deploys an OpenAI API-compatible server

These APIs make it easier to deploy Transformers models in production environments while maintaining compatibility with existing OpenAI-based workflows.

Integration with Inference Engines

Transformers v5 works closely with popular inference engines, using Transformers as a backend:

  • vLLM: High-performance inference engine
  • SGLang: Fast structured generation framework
  • TensorRT LLM: NVIDIA's optimized inference engine

The value proposition is significant: as soon as a model is added to Transformers, it becomes available in these inference engines while taking advantage of each engine's strengths, including inference optimizations, specialized kernels, and dynamic batching.

Production and Local Deployment

Cross-Platform Compatibility

Transformers v5 emphasizes interoperability with various deployment targets:

  • ONNXRuntime: Easy conversion to ONNX format for cross-platform deployment
  • llama.cpp: Direct compatibility with GGUF files for local inference
  • MLX: Direct compatibility with MLX's models using safetensors files
  • executorch: Working with the executorch team for on-device deployment

This compatibility makes it possible to train a model with Unsloth/Axolotl/LlamaFactory, deploy it with vLLM/SGLang, and export it to llama.cpp/executorch/MLX for local execution. The library serves as a central hub that connects the entire AI development and deployment pipeline.

Model Format Support

The library now supports easy conversion between formats:

  • GGUF files: Can be loaded in Transformers for further fine-tuning
  • Safetensors: Direct compatibility with MLX and other frameworks
  • ONNX: Export capabilities for production deployment

This format flexibility ensures that models can move seamlessly between different tools and deployment targets, reducing friction in the development workflow.

Quantization as First-Class Feature

Central Focus on Quantization

Quantization is emerging as the standard for state-of-the-art model development, with many SOTA models released in low-precision formats (8-bit, 4-bit). Examples include gpt-oss, Kimi-K2, and Deepseek-r1. Hardware is increasingly optimized for low-precision workloads, and the community is actively sharing high-quality quantized checkpoints.

Transformers v5 makes quantization a first-class citizen, ensuring full compatibility with all major features and delivering a reliable framework for both training and inference. This represents a significant change in how weights are loaded in models, with quantization integrated at the core level rather than as an add-on feature.

Collaboration with Quantization Tools

The release includes close collaboration with quantization tooling:

  • TorchAO: Integration of TorchAO quantization features
  • bitsandbytes: Better support for key features like tensor parallelism and mixture-of-experts models
  • New quantization methods: Easier integration of new quantization techniques

This collaboration ensures that developers have access to the latest quantization techniques while maintaining compatibility with the broader Transformers ecosystem.

Ecosystem Impact

Community Growth

The Transformers ecosystem has grown dramatically since v4:

  • Model architectures: From 40 to over 400
  • Model checkpoints: From 1,000 to 750,000+
  • Daily installations: From 20,000 to 3+ million
  • Total installs: Surpassed 1.2 billion

This growth reflects both the expansion of the AI field and the library's central role in making AI accessible to developers worldwide.

Industry Partnerships

Transformers v5 benefits from collaboration with major players in the AI ecosystem:

  • PyTorch Foundation: Working closely with PyTorch as the primary backend
  • Inference engines: vLLM, SGLang, TensorRT LLM
  • Training frameworks: Unsloth, Axolotl, LlamaFactory, MaxText
  • Deployment tools: llama.cpp, MLX, ONNXRuntime, executorch

These partnerships ensure that Transformers remains compatible with the tools developers actually use, creating a cohesive ecosystem rather than isolated tools.

Technical Improvements

Code Quality and Maintainability

The refactoring work in v5 results in:

  • Cleaner code: More focused modeling files that only contain relevant parts
  • Better abstractions: Common functionality moved to shared interfaces
  • Easier contributions: Reduced code requirements for adding new models
  • Improved readability: Standardization makes it easier to understand model differences

These improvements benefit both maintainers and users, making the library more accessible and easier to work with.

Performance Optimizations

The release includes various performance improvements:

  • Optimized kernels: Automatic usage when available
  • Better memory management: Paged attention and other optimizations
  • Faster inference: Continuous batching and other techniques
  • Efficient training: Support for various parallelism paradigms

These optimizations ensure that Transformers models can be used efficiently across different hardware configurations and use cases.

Migration and Compatibility

Breaking Changes

Transformers v5 includes some breaking changes:

  • Flax/TensorFlow support: Being sunsetted in favor of PyTorch
  • Tokenizer changes: Removal of "Fast" and "Slow" distinction
  • Image processor changes: Only fast variant available

Users should review the release notes to understand migration requirements for their specific use cases.

Backward Compatibility

Despite breaking changes, the library maintains compatibility where possible:

  • Model checkpoints: Existing checkpoints remain compatible
  • API stability: Core APIs remain stable
  • Migration guides: Documentation provided for breaking changes

The team has worked to minimize disruption while moving toward a cleaner, more maintainable codebase.

Conclusion

Transformers v5 represents a significant evolution of one of the most important libraries in the AI ecosystem. The focus on interoperability, simplicity, and production readiness positions the library to continue serving as the foundation for AI development across training, inference, and deployment scenarios.

The move to PyTorch-only, the elevation of quantization to first-class status, and the emphasis on ecosystem compatibility reflect the current state of AI development while preparing for future needs. The library's growth from 20,000 to 3+ million daily installations demonstrates its critical role in making AI accessible to developers worldwide.

As the AI field continues to evolve rapidly, Transformers v5 provides a stable, well-maintained foundation that connects the entire development pipeline—from training with specialized frameworks to deployment on various platforms. The emphasis on interoperability ensures that developers can choose the best tools for each stage of their workflow while maintaining consistency through shared model definitions.

To learn more about AI development and model training, explore our AI fundamentals courses, check out our glossary of AI terms, or browse our AI tools catalog for related development tools and frameworks.

Sources

Frequently Asked Questions

Transformers v5 is the latest major release of Hugging Face's popular AI library, featuring PyTorch-only support, quantization as a first-class citizen, improved interoperability, and enhanced training and inference capabilities.
Key changes include moving to PyTorch-only backend (sunsetting Flax/TensorFlow), making quantization a first-class feature, introducing modular architecture for easier model additions, and improving interoperability with inference engines like vLLM and SGLang.
Transformers is installed more than 3 million times per day via pip, up from 20,000 per day in v4. The library has surpassed 1.2 billion total installs and supports over 400 model architectures with 750,000+ model checkpoints on the Hub.
The overarching theme of v5 is interoperability, with improvements in simplicity, training capabilities, inference performance, and production deployment. The library now works seamlessly with tools like Unsloth, Axolotl, vLLM, SGLang, llama.cpp, and MLX.
Hugging Face is sunsetting Flax/TensorFlow support in favor of focusing on PyTorch as the sole backend. However, they're working with partners in the JAX ecosystem to ensure compatibility between Transformers models and JAX-based frameworks.
Quantization is now a first-class citizen in v5, with full compatibility across all major features. This supports the growing trend of low-precision model formats (8-bit, 4-bit) used by many state-of-the-art models, with improved support for training and inference.

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.