
Introduction
DeepSeek, the AI research lab that has recently disrupted the industry with its high-performance open-weights models, is preparing to give its AI a pair of eyes. Xiaokang Chen, a leading developer in DeepSeek's multimodal team, recently posted a cryptic but clear teaser on X (formerly Twitter) with the phrase: "Now, We See You."
The post was accompanied by a telling visual: the company's iconic whale mascot. In one image, the whale's eyes are covered by a blindfold; in the second, the eyes are wide open, glowing with intelligence. Given Chen’s background in vision-language models, this teaser almost certainly points to the imminent release of a multimodal foundation model or a significant vision update to the existing DeepSeek-V4 ecosystem.
The Shift to Multimodal AI
While DeepSeek has built its reputation on powerful large language models (LLMs) like DeepSeek-V3 and V4, the lab is no stranger to computer vision. In 2024, they released DeepSeek-VL and DeepSeek-VL2, specialized vision-language models designed for high-resolution image understanding.
However, the "Now, We See You" teaser suggests something more integrated. As frontier models like GPT-4o and Gemini 1.5 Pro move toward "omni" capabilities—where text, vision, and audio are processed by a single unified architecture—DeepSeek appears ready to close the gap. This move is crucial for AI agents that need to "see" a user's screen, interpret diagrams, or process real-world visual data.
Internal Testing Underway
Shortly after the teaser went viral, reports emerged from China that DeepSeek has already begun a limited grayscale test of an "Image recognition mode" (识图模式) within their official mobile and web platforms.
Users with early access have reported that the new feature allows the model to:
- Describe complex scenes with high accuracy.
- Extract text from images (OCR) with improved precision.
- Reason about visual layouts, which is essential for agentic tasks like UI navigation and code generation from mockups.
Why Vision Matters for DeepSeek
The addition of vision capabilities is more than just a feature update; it's a strategic necessity. For DeepSeek to remain competitive in the agentic AI space, their models must be able to interact with the visual world.
- Coding Agents: Developers can share screenshots of UI bugs or design mockups for the AI to implement.
- Data Analysis: The model can interpret charts, graphs, and tables directly from PDFs or images.
- Robotics and Automation: Multimodal understanding is the foundation for future robotics applications that DeepSeek may explore.
Conclusion
The "Now, We See You" teaser marks a pivotal moment for DeepSeek. By bringing vision capabilities to their already efficient and powerful architectures, the lab is positioning itself as a full-spectrum competitor to the biggest names in AI. Whether this will be a direct upgrade to the V4 series or a standalone "DeepSeek-V5" remains to be seen, but the message is clear: DeepSeek's blindfold is coming off.
Stay tuned for more updates as we monitor the official rollout of these multimodal features.
Explore more about DeepSeek models and the latest in multimodal AI in our Glossary.