Introduction
In the rapidly evolving landscape of Large Multimodal Models (LMMs), the trade-off between efficiency and intelligence has always been a significant hurdle. Stepfun AI has shattered expectations with the release of Step3-VL-10B, a compact 10-billion parameter foundation model that punches far above its weight class.
Despite its relatively small footprint, Step3-VL-10B consistently outperforms models under the 10B scale and remarkably rivals or even surpasses significantly larger open-weights models—some up to 20 times its size. By focusing on deep vision-language synergy and introducing advanced test-time compute scaling, Stepfun AI is setting a new standard for what "compact" models can achieve in visual perception, complex reasoning, and agentic tasks. This release builds on the success of their previous models, such as the Step-Audio-R1 audio reasoning model.
Key Capabilities & Benchmarks
Step3-VL-10B delivers best-in-class performance across a wide array of multimodal benchmarks. What stands out is its ability to handle extremely complex STEM reasoning and fine-grained visual recognition tasks that typically require much larger architectures.
- STEM Reasoning: The model achieves a staggering 94.43% on AIME 2025 and 75.95% on MathVision when using its advanced reasoning mode.
- Visual Perception: It records 92.05% on MMBench (EN) and 80.11% on MMMU, establishing its dominance in general visual understanding.
- GUI & OCR: Optimized for agentic use cases, it reaches 92.61% on ScreenSpot-V2 and 86.75% on OCRBench, making it highly effective for document analysis and UI navigation.
- Spatial Awareness: With 66.79% on BLINK, it demonstrates emergent spatial intelligence, a critical component for embodied AI applications.
Comparison with Larger Models
| Benchmark | Step3-VL-10B (PaCoRe) | GLM-4.6V (106B) | Qwen3-VL (235B) | Gemini-2.5-Pro |
|---|---|---|---|---|
| MMMU | 80.11 | 75.20 | 78.70 | 83.89 |
| MathVision | 75.95 | 63.50 | 72.10 | 73.30 |
| AIME 2025 | 94.43 | 71.88 | 83.59 | 83.96 |
| OCRBench | 89.00 | 86.20 | 87.30 | 85.90 |
Parallel Coordinated Reasoning (PaCoRe)
The "secret sauce" behind Step3-VL-10B's success is a strategic innovation called Parallel Coordinated Reasoning (PaCoRe). Unlike standard models that generate a single sequential Chain-of-Thought (SeRe), Step3-VL-10B can intelligently scale its compute during inference.
PaCoRe works by aggregating evidence from up to 16 parallel visual explorations. This multi-path approach allows the model to synthesize a final answer based on a much broader understanding of the visual context, effectively increasing its max context length from 64K to 128K tokens. This mechanism is what enables a 10B model to out-reason flagship models like Gemini 2.5 Pro in highly specialized mathematics and vision tasks, similar to the reasoning capabilities seen in the Qwen3-VL and Qwen3-Max series.
Architecture & Training
Step3-VL-10B's architecture is built on a foundation of high-quality components and a rigorous training pipeline:
- Visual Encoder (PE-lang): A language-optimized 1.8B parameter encoder designed for deep alignment with the LLM.
- Decoder: Utilizes the powerful Qwen3-8B as its linguistic backbone.
- Unified Pre-training: Trained on a 1.2 trillion token multimodal corpus using a single-stage, fully unfrozen strategy. This ensures the vision and language components are co-optimized from the start.
- Scaling RL: The model underwent over 1,400 iterations of Reinforcement Learning, including both Verifiable Rewards (RLVR) for math/geometry and Human Feedback (RLHF) for general alignment.
Model Zoo & Quick Start
Stepfun AI has released both the Base and the Chat/Instruct versions of the model, licensed under Apache 2.0.
| Model Name | Type | ModelScope | Hugging Face |
|---|---|---|---|
| Step3-VL-10B-Base | Base | Download | Download |
| Step3-VL-10B | Chat | Download | Download |
Minimum Hardware Requirements
To run Step3-VL-10B locally, you will need:
- Model Weights: ~20 GB
- Minimum VRAM: 24 GB (e.g., NVIDIA RTX 4090 or A100)
- Environment: Python 3.10+, Transformers 4.57.0+
Conclusion
Step3-VL-10B is a testament to the fact that model size is no longer the sole determinant of AI intelligence. By combining a high-quality 1.2T token training corpus with architectural innovations like PaCoRe, Stepfun AI has created a model that is both accessible and incredibly powerful. Whether it's for complex STEM reasoning, precise OCR tasks, or advanced GUI grounding, Step3-VL-10B stands as the premier open-source choice in its class.
As more developers begin to leverage the Parallel Coordinated Reasoning capabilities, we expect to see Step3-VL-10B integrated into sophisticated agentic workflows where efficiency and frontier-level intelligence are both non-negotiable.