Nanbeige4.1-3B: A Compact Powerhouse with Strong Reasoning and Agentic Capabilities

Introduction

In the rapidly evolving landscape of Large Language Models (LLMs), the focus is often on increasing scale. However, the release of Nanbeige4.1-3B by Nanbeige Lab (南北阁实验室) challenges this trend. Built upon Nanbeige4-3B-Base, this enhanced iteration demonstrates that compact models can achieve robust reasoning, exceptional preference alignment, and effective agentic behaviors simultaneously.

Nanbeige4.1-3B is an optimized version achieved through extensive post-training, including supervised fine-tuning (SFT) and reinforcement learning (RL). It fills a significant gap in the small-model ecosystem, where models typically excel at either general reasoning or agentic tasks, but rarely both.

Key Features and Capabilities

Nanbeige4.1-3B stands out for several reasons:

Strong Reasoning: Capable of solving complex, multi-step problems with sustained coherence, it achieves impressive results on challenging benchmarks like LiveCodeBench-Pro and AIME 2026 I.
Robust Preference Alignment: It outperforms same-scale models and even substantially larger ones like Qwen3-32B on Arena-Hard-v2, showing superior understanding of human preferences.
Agentic Capability: As the first general small model to natively support deep-search tasks, it can reliably handle complex problem solving involving hundreds of tool invocations.

Performance Benchmarks

The model's performance across diverse benchmarks is remarkable for its size. In many cases, it not only leads its class but also rivals or exceeds the performance of much larger high-profile models.

General Reasoning Tasks

Benchmark	Qwen3-4B-2507	Qwen3-32B	Nanbeige4.1-3B
Live-Code-Bench-V6	57.4	55.7	76.9
AIME 2026 I	81.46	75.83	87.40
GPQA	65.8	68.4	83.8
Arena-Hard-v2	34.9	56.0	73.2
BFCL-V4 (Tool Use)	44.87	47.90	56.50

Deep Search and Agentic Behavior

Nanbeige4.1-3B represents a qualitative leap in deep-search capability for small foundation models. On the xBench-DeepSearch-2505, it achieved a score of 75, significantly higher than its peers and even exceeding several large foundation models when equipped with tools.

Quickstart: How to Use Nanbeige4.1-3B

You can easily integrate Nanbeige4.1-3B into your projects using the Hugging Face transformers library. Here is a simple example for a chat scenario:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(
  'Nanbeige/Nanbeige4.1-3B',
  use_fast=False,
  trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
  'Nanbeige/Nanbeige4.1-3B',
  torch_dtype='auto',
  device_map='auto',
  trust_remote_code=True
)

# Prepare messages
messages = [
  {'role': 'user', 'content': 'Which number is bigger, 9.11 or 9.8?'}
]

# Generate response
prompt = tokenizer.apply_chat_template(
  messages,
  add_generation_prompt=True,
  tokenize=False
)
input_ids = tokenizer(prompt, add_special_tokens=False, return_tensors='pt').input_ids
output_ids = model.generate(input_ids.to('cuda'), eos_token_id=166101)
resp = tokenizer.decode(output_ids[0][len(input_ids[0]):], skip_special_tokens=True)

print(resp)

Conclusion

Nanbeige4.1-3B is a testament to the power of optimization over scale. By demonstrating top-tier reasoning and agentic performance at the 3B parameter level, it opens up new possibilities for efficient, high-performance AI applications that can run on more accessible hardware. Whether you are building complex agents or need a reliable reasoning engine, Nanbeige4.1-3B is a compelling new choice in the open-source community.

Nanbeige4.1-3B: A Compact Powerhouse with Strong Reasoning and Agentic Capabilities

Introduction

Key Features and Capabilities

Performance Benchmarks

General Reasoning Tasks

Deep Search and Agentic Behavior

Quickstart: How to Use Nanbeige4.1-3B

Conclusion

Sources

Frequently Asked Questions

What are the main strengths of Nanbeige4.1-3B?

How does Nanbeige4.1-3B perform compared to larger models?

Which tasks are best suited for Nanbeige4.1-3B?

Related Articles

Qwen3-ASR: SOTA Multilingual Speech Recognition and Forced Alignment

Tencent HPC-Ops: SOTA Performance for LLM Inference

Qwen3-Max-Thinking: A New Era for Reasoning Models

Continue Your AI Journey