Introduction
In the rapidly evolving landscape of Large Language Models (LLMs), the focus is often on increasing scale. However, the release of Nanbeige4.1-3B by Nanbeige Lab (南北阁实验室) challenges this trend. Built upon Nanbeige4-3B-Base, this enhanced iteration demonstrates that compact models can achieve robust reasoning, exceptional preference alignment, and effective agentic behaviors simultaneously.
Nanbeige4.1-3B is an optimized version achieved through extensive post-training, including supervised fine-tuning (SFT) and reinforcement learning (RL). It fills a significant gap in the small-model ecosystem, where models typically excel at either general reasoning or agentic tasks, but rarely both.
Key Features and Capabilities
Nanbeige4.1-3B stands out for several reasons:
- Strong Reasoning: Capable of solving complex, multi-step problems with sustained coherence, it achieves impressive results on challenging benchmarks like LiveCodeBench-Pro and AIME 2026 I.
- Robust Preference Alignment: It outperforms same-scale models and even substantially larger ones like Qwen3-32B on Arena-Hard-v2, showing superior understanding of human preferences.
- Agentic Capability: As the first general small model to natively support deep-search tasks, it can reliably handle complex problem solving involving hundreds of tool invocations.
Performance Benchmarks
The model's performance across diverse benchmarks is remarkable for its size. In many cases, it not only leads its class but also rivals or exceeds the performance of much larger high-profile models.
General Reasoning Tasks
| Benchmark | Qwen3-4B-2507 | Qwen3-32B | Nanbeige4.1-3B |
|---|---|---|---|
| Live-Code-Bench-V6 | 57.4 | 55.7 | 76.9 |
| AIME 2026 I | 81.46 | 75.83 | 87.40 |
| GPQA | 65.8 | 68.4 | 83.8 |
| Arena-Hard-v2 | 34.9 | 56.0 | 73.2 |
| BFCL-V4 (Tool Use) | 44.87 | 47.90 | 56.50 |
Deep Search and Agentic Behavior
Nanbeige4.1-3B represents a qualitative leap in deep-search capability for small foundation models. On the xBench-DeepSearch-2505, it achieved a score of 75, significantly higher than its peers and even exceeding several large foundation models when equipped with tools.
Quickstart: How to Use Nanbeige4.1-3B
You can easily integrate Nanbeige4.1-3B into your projects using the Hugging Face transformers library. Here is a simple example for a chat scenario:
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(
'Nanbeige/Nanbeige4.1-3B',
use_fast=False,
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
'Nanbeige/Nanbeige4.1-3B',
torch_dtype='auto',
device_map='auto',
trust_remote_code=True
)
# Prepare messages
messages = [
{'role': 'user', 'content': 'Which number is bigger, 9.11 or 9.8?'}
]
# Generate response
prompt = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=False
)
input_ids = tokenizer(prompt, add_special_tokens=False, return_tensors='pt').input_ids
output_ids = model.generate(input_ids.to('cuda'), eos_token_id=166101)
resp = tokenizer.decode(output_ids[0][len(input_ids[0]):], skip_special_tokens=True)
print(resp)
Conclusion
Nanbeige4.1-3B is a testament to the power of optimization over scale. By demonstrating top-tier reasoning and agentic performance at the 3B parameter level, it opens up new possibilities for efficient, high-performance AI applications that can run on more accessible hardware. Whether you are building complex agents or need a reliable reasoning engine, Nanbeige4.1-3B is a compelling new choice in the open-source community.