
DeepSeek V4 Preview: Two Models, Million-Token Context, Open Source
On April 24, 2025, DeepSeek released the preview version of their V4 family, introducing two distinct models: DeepSeek-V4-Pro and DeepSeek-V4-Flash. Both variants come standard with a 1 million token context window, a significant leap from most open-source models that typically cap at 128K or 200K tokens. According to DeepSeek’s official announcement, this preview represents a new milestone in making long-context capabilities accessible to the broader AI developer community.
The V4 family is designed to excel in agentic workflows, complex reasoning, and knowledge-intensive tasks. Based on the release notes we examined, the Pro variant targets high-performance deployment scenarios (e.g., large-scale batch inference, complex multi-step reasoning), while the Flash variant is optimized for lower latency and lower compute overhead, suitable for real-time applications such as chatbots and code completion. Both models have been open-sourced, continuing DeepSeek’s strategy of releasing model weights under permissive licenses.
What the Million-Token Context Means in Practice

Context length has become a key battleground for large language models. In early 2024, most models handled 32K–128K tokens. By late 2024, Gemini 1.5 Pro offered 1M tokens (though only in beta for certain regions) and Anthropic’s Claude 3 introduced 200K. DeepSeek V4’s 1M context window matches the upper end of proprietary offers, but with the added advantage of being fully open-source. For developers working on document analysis, long-form code repositories, or multi-turn agent interactions, this removes a major constraint: they can now feed entire codebases or thousand-page reports into a single inference call without chunking.
In our review of the technical documentation, DeepSeek highlights that V4 employs a sparse attention mechanism combined with a novel caching strategy to keep inference costs reasonable even at 1M tokens. This is critical because naïve attention scales quadratically with sequence length. The preview benchmarks (which DeepSeek published on their research page) show that V4-Pro retains over 95% accuracy on the RULER (Recall and Understanding over Long Extended Responses) test suite compared to proprietary 1M-token models, while V4-Flash achieves 89% accuracy at roughly half the compute cost.
Another notable improvement is in agent performance. DeepSeek V4 preview scores 74.3% on the SWE-bench Verified (a challenging software engineering benchmark), placing it ahead of Qwen2.5-72B and comparable to GPT-4o. This makes it a strong candidate for autonomous coding agents, a rapidly growing use case in the developer community.
Implications for the Open-Source Ecosystem and Developer Workflow

The release of DeepSeek V4 preview solidifies a trend: frontier-level model capabilities are no longer exclusive to proprietary APIs. Developers who self-host LLMs can now access million-token context, advanced reasoning, and state-of-the-art agent performance without paying per-token fees to cloud providers. This is particularly relevant for enterprises with strict data sovereignty requirements, as well as for startups building on custom fine-tuned models.
However, we note two limitations in the preview. First, the 1M-token context is supported only in the base inference mode; fine-tuning with such long sequences is not yet available, which restricts custom adaptation for domain-specific long-document tasks. Second, the memory footprint for loading a 1M-token context into GPU VRAM remains substantial—DeepSeek recommends at least 4× A100 80GB for smooth inference with V4-Pro at full context. This may limit adoption to teams with substantial hardware budgets.
Despite these constraints, DeepSeek’s move is likely to accelerate competition among open-source model providers (such as Meta, Alibaba’s Qwen, and Mistral) to match or exceed the 1M-token standard. The official changelog confirms that DeepSeek will release a full stable version of V4 within Q2 2025, along with a technical paper detailing architecture changes. For developers, the immediate takeaway is that the era of “infinite context” for open models has arrived ahead of many predictions.
We recommend that teams exploring long-context applications begin experimenting with DeepSeek V4-Flash for latency-sensitive tasks and V4-Pro for accuracy-critical workloads. The model weights and inference code are available on Hugging Face and GitHub under the DeepSeek organization. As always, developers should validate performance on their own benchmarks, as preview models may exhibit edge cases not covered in official evaluations.
댓글