Step3-VL-10B: Redefining Multimodal AI

Stepfun AI releases Step3-VL-10B, a 10B parameter multimodal model that outperforms giants 20x its size through innovative Parallel Coordinated Reasoning.

by HowAIWorks Team
Stepfun AIStep3-VL-10BMultimodalOpen SourceComputer VisionVLMsPaCoRe

Introduction

In the rapidly evolving landscape of Large Multimodal Models (LMMs), the trade-off between efficiency and intelligence has always been a significant hurdle. Stepfun AI has shattered expectations with the release of Step3-VL-10B, a compact 10-billion parameter foundation model that punches far above its weight class.

Despite its relatively small footprint, Step3-VL-10B consistently outperforms models under the 10B scale and remarkably rivals or even surpasses significantly larger open-weights models—some up to 20 times its size. By focusing on deep vision-language synergy and introducing advanced test-time compute scaling, Stepfun AI is setting a new standard for what "compact" models can achieve in visual perception, complex reasoning, and agentic tasks. This release builds on the success of their previous models, such as the Step-Audio-R1 audio reasoning model.

Key Capabilities & Benchmarks

Step3-VL-10B delivers best-in-class performance across a wide array of multimodal benchmarks. What stands out is its ability to handle extremely complex STEM reasoning and fine-grained visual recognition tasks that typically require much larger architectures.

  • STEM Reasoning: The model achieves a staggering 94.43% on AIME 2025 and 75.95% on MathVision when using its advanced reasoning mode.
  • Visual Perception: It records 92.05% on MMBench (EN) and 80.11% on MMMU, establishing its dominance in general visual understanding.
  • GUI & OCR: Optimized for agentic use cases, it reaches 92.61% on ScreenSpot-V2 and 86.75% on OCRBench, making it highly effective for document analysis and UI navigation.
  • Spatial Awareness: With 66.79% on BLINK, it demonstrates emergent spatial intelligence, a critical component for embodied AI applications.

Comparison with Larger Models

BenchmarkStep3-VL-10B (PaCoRe)GLM-4.6V (106B)Qwen3-VL (235B)Gemini-2.5-Pro
MMMU80.1175.2078.7083.89
MathVision75.9563.5072.1073.30
AIME 202594.4371.8883.5983.96
OCRBench89.0086.2087.3085.90

Parallel Coordinated Reasoning (PaCoRe)

The "secret sauce" behind Step3-VL-10B's success is a strategic innovation called Parallel Coordinated Reasoning (PaCoRe). Unlike standard models that generate a single sequential Chain-of-Thought (SeRe), Step3-VL-10B can intelligently scale its compute during inference.

PaCoRe works by aggregating evidence from up to 16 parallel visual explorations. This multi-path approach allows the model to synthesize a final answer based on a much broader understanding of the visual context, effectively increasing its max context length from 64K to 128K tokens. This mechanism is what enables a 10B model to out-reason flagship models like Gemini 2.5 Pro in highly specialized mathematics and vision tasks, similar to the reasoning capabilities seen in the Qwen3-VL and Qwen3-Max series.

Architecture & Training

Step3-VL-10B's architecture is built on a foundation of high-quality components and a rigorous training pipeline:

  • Visual Encoder (PE-lang): A language-optimized 1.8B parameter encoder designed for deep alignment with the LLM.
  • Decoder: Utilizes the powerful Qwen3-8B as its linguistic backbone.
  • Unified Pre-training: Trained on a 1.2 trillion token multimodal corpus using a single-stage, fully unfrozen strategy. This ensures the vision and language components are co-optimized from the start.
  • Scaling RL: The model underwent over 1,400 iterations of Reinforcement Learning, including both Verifiable Rewards (RLVR) for math/geometry and Human Feedback (RLHF) for general alignment.

Model Zoo & Quick Start

Stepfun AI has released both the Base and the Chat/Instruct versions of the model, licensed under Apache 2.0.

Model NameTypeModelScopeHugging Face
Step3-VL-10B-BaseBaseDownloadDownload
Step3-VL-10BChatDownloadDownload

Minimum Hardware Requirements

To run Step3-VL-10B locally, you will need:

  • Model Weights: ~20 GB
  • Minimum VRAM: 24 GB (e.g., NVIDIA RTX 4090 or A100)
  • Environment: Python 3.10+, Transformers 4.57.0+

Conclusion

Step3-VL-10B is a testament to the fact that model size is no longer the sole determinant of AI intelligence. By combining a high-quality 1.2T token training corpus with architectural innovations like PaCoRe, Stepfun AI has created a model that is both accessible and incredibly powerful. Whether it's for complex STEM reasoning, precise OCR tasks, or advanced GUI grounding, Step3-VL-10B stands as the premier open-source choice in its class.

As more developers begin to leverage the Parallel Coordinated Reasoning capabilities, we expect to see Step3-VL-10B integrated into sophisticated agentic workflows where efficiency and frontier-level intelligence are both non-negotiable.

Sources

Frequently Asked Questions

Step3-VL-10B introduces Parallel Coordinated Reasoning (PaCoRe), allowing it to outperform models 10-20x its size by intelligently scaling test-time compute.
In several benchmarks like MathVision and AIME 2025, Step3-VL-10B (using PaCoRe) actually surpasses Gemini 2.5 Pro, despite its significantly smaller footprint.
Yes, with a 24GB VRAM requirement (like an RTX 4090), the 20GB model weights plus runtime overhead make it accessible for local deployment.
The model was pre-trained on a massive 1.2 trillion token high-quality multimodal corpus using a single-stage, fully unfrozen strategy.

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.