DeepSeek Teases Multimodal Capabilities: 'Now, We See You'

Xiaokang Chen of DeepSeek's multimodal team hints at upcoming vision features, signaling the lab's move toward integrated visual data understanding.

by HowAIWorks Team
DeepSeekMultimodal AIComputer VisionAI ModelsDeepSeek-VLXiaokang ChenMachine LearningVision-Language ModelsAI News

DeepSeek Multimodal Teaser

Introduction

DeepSeek, the AI research lab that has recently disrupted the industry with its high-performance open-weights models, is preparing to give its AI a pair of eyes. Xiaokang Chen, a leading developer in DeepSeek's multimodal team, recently posted a cryptic but clear teaser on X (formerly Twitter) with the phrase: "Now, We See You."

The post was accompanied by a telling visual: the company's iconic whale mascot. In one image, the whale's eyes are covered by a blindfold; in the second, the eyes are wide open, glowing with intelligence. Given Chen’s background in vision-language models, this teaser almost certainly points to the imminent release of a multimodal foundation model or a significant vision update to the existing DeepSeek-V4 ecosystem.

The Shift to Multimodal AI

While DeepSeek has built its reputation on powerful large language models (LLMs) like DeepSeek-V3 and V4, the lab is no stranger to computer vision. In 2024, they released DeepSeek-VL and DeepSeek-VL2, specialized vision-language models designed for high-resolution image understanding.

However, the "Now, We See You" teaser suggests something more integrated. As frontier models like GPT-4o and Gemini 1.5 Pro move toward "omni" capabilities—where text, vision, and audio are processed by a single unified architecture—DeepSeek appears ready to close the gap. This move is crucial for AI agents that need to "see" a user's screen, interpret diagrams, or process real-world visual data.

Internal Testing Underway

Shortly after the teaser went viral, reports emerged from China that DeepSeek has already begun a limited grayscale test of an "Image recognition mode" (识图模式) within their official mobile and web platforms.

Users with early access have reported that the new feature allows the model to:

  • Describe complex scenes with high accuracy.
  • Extract text from images (OCR) with improved precision.
  • Reason about visual layouts, which is essential for agentic tasks like UI navigation and code generation from mockups.

Why Vision Matters for DeepSeek

The addition of vision capabilities is more than just a feature update; it's a strategic necessity. For DeepSeek to remain competitive in the agentic AI space, their models must be able to interact with the visual world.

  • Coding Agents: Developers can share screenshots of UI bugs or design mockups for the AI to implement.
  • Data Analysis: The model can interpret charts, graphs, and tables directly from PDFs or images.
  • Robotics and Automation: Multimodal understanding is the foundation for future robotics applications that DeepSeek may explore.

Conclusion

The "Now, We See You" teaser marks a pivotal moment for DeepSeek. By bringing vision capabilities to their already efficient and powerful architectures, the lab is positioning itself as a full-spectrum competitor to the biggest names in AI. Whether this will be a direct upgrade to the V4 series or a standalone "DeepSeek-V5" remains to be seen, but the message is clear: DeepSeek's blindfold is coming off.

Stay tuned for more updates as we monitor the official rollout of these multimodal features.

Explore more about DeepSeek models and the latest in multimodal AI in our Glossary.

Sources

Frequently Asked Questions

The teaser featured the phrase 'Now, We See You' and images of the company's whale mascot with eyes opening, strongly hinting at the addition of vision and multimodal capabilities to their models.
Xiaokang Chen is a key developer at DeepSeek specializing in multimodal projects and vision-language models. He previously worked on the DeepSeek-VL series.
No official release date has been announced yet, but internal testing of an 'image recognition mode' has already been spotted in the DeepSeek app.
DeepSeek-V4 was released recently as a text-only model. The upcoming multimodal features are expected to bring vision capabilities to the V4 series or launch as a specialized vision-language model.

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.