DeepSeek: The Good, the Bad, and the Ugly – Part 1
This article was originally published on quasi.pros.com.
The introduction of DeepSeek shocked the AI world. It shattered the belief that only tech giants (i.e. OpenAI, Anthropic, Google, Meta, etc.) with virtually infinite resources could train top-performing generative AI (GenAI) models. DeepSeek released a reasoning model (i.e. R1) that rivals OpenAI’s o1 model, while using only a fraction of the resources (only ~$5M, as opposed to ~$100M; and ~2000 GPUs, as opposed to ~100,000). About $1T market cap were erased due to the R1 launch.
Many have pinged me for comments, and even more have asked me how is this possible, and what it means. Rather than repeating myself, I thought I’d share some of my thoughts here. Unlike most, my thoughts on DeepSeek are a bit mixed.
Don’t get me wrong. I do think there are some really clever engineering and data science innovation that goes into the development of DeepSeek v3 (the base LLM) and R1 (their reasoning model). And there are lots that we could learn from. However, there are also evidences that suggest DeepSeek hasn’t been completely transparent with everything they did. So my feeling is like the Good, the Bad, and the Ugly.
For today, let’s focus on the Good. These are great innovations that deserve accolades and the industry’s recognition.
Hyper Efficient DeepSeek Innovations
DeepSeek’s Real Innovations
IMHO, here are some of DeepSeek’s most important innovations that, in combination, enabled them to engineer R1 with hyper-efficiency.
1. FP8 Precision Training for Faster and More Memory-Efficient Models
Most large language models (LLMs) use 16-bit precision floating point numbers (FP16) to store and compute all their weights and probability outputs during training. However, as LLMs are being commoditized, even close source models have released either smaller or quantized versions of their LLMs to compete for adoption. And it’s observed that the reduced precision to 8-bit quantization doesn’t degrade the LLM’s performance much. So why not just train the LLM using an 8-bit floating point number (FP8) from the start?
Since FP8 operations are twice as fast as FP16, DeepSeek was able to train its model 2x faster (or equivalently use fewer GPUs). Moreover, because FP8 takes up only half the memory as FP16, larger models can be trained on fewer GPUs without significant performance loss.
2. Multi-Token Prediction (MTP) for Faster Inference
There are 2 approaches to train LLMs. OpenAI’s autoregressive approach to LLM predicts the very next word from a given sequence (i.e. the prompt), whereas Google’s bidirectional approach predicts a missing word in the middle of the sequence. Regardless of the approach, nearly all LLMs operate by predicting a single word token at a time.
In contrast, DeepSeek’s model can predict multiple tokens at each inference. Although this could quadratically increase the inference space, the strong correlations within realistic word sequences will constrain the vast majority of the token probabilities to zero. So in practice, MTP enables DeepSeek to generate responses multiple times faster without incurring much computational cost. Moreover, since DeepSeek uses FP8, many tiny token probabilities will be resolved to zero under FP8 precision. This further reduces the computing resources needed for training and inference.
3. Mixture-of-Experts (MoE) for Efficient Training and Inference
Today, most popular LLMs are rather dense in the sense that the entire network is active during inference. For GPT3, all 175B parameters are active, contributing to the calculation of each token probability. DeepSeek’s MoE approach selectively activates only a subset of the model’s parameters for each token. So despite having 671B parameters, only 37B parameters (~5.5% of the entire network) are active at any given time. This means the model behaves like a small model in terms of computing cost while retaining the expressive power of the larger model.
MoE reduces training compute costs by only updating ~5.5% of its 671B parameters, allowing for faster, more scalable learning while ensuring different expert subnetworks specialize in specific tasks. Since only the most relevant 37B parameters are active per token, responses can be generated faster without massive GPU clusters. This further compounds the efficiency gain of using FP8 and MTP.
4. Hardware Level Optimizations for Maximum GPU Utilization
Unlike CPU, which executes instructions sequentially, GPU computations are inherently parallel and distributed. Standard GPU execution pipeline cycles between computing some intermediate results and transferring this data for subsequent computation. And while the GPU is transferring data, it’s often sitting idle waiting for the next set of transferred data to arrive. That is ~50% of wasted GPU cycles while waiting.
DeepSeek uses DualPipe Algorithm by creating two parallel execution pipelines that work simultaneously to overlap computation and data transfer. The DualPipe ensures that while pipeline 1 is computing, pipeline 2 is transferring data at the same time. And when pipeline 1 finishes the compute and starts transferring, pipeline 2 is ready to start computing. Beyond DualPipe, DeepSeek also optimizes data flow between GPUs by leveraging high-speed NVLink for intra-node communication and InfiniBand for inter-node data transfers. This architecture ensures that data moves through the fastest available channels, reducing delay. These hardware optimization keeps GPUs fully utilized, computing ~100% of the time and transferring data as fast as possible.
Constraints Breeds Innovation
Constraints are often seen as barriers that slow down the pace of tech development. In reality, however, they can act as powerful drivers of innovation. DeepSeek being in China is operating under a completely different set of constraints from AI labs in the US and Europe.
US export restrictions have cut off access to the most advanced AI chips and GPUs. This makes it far more challenging for Chinese AI labs to compete by brute-force scaling. Meanwhile, the lack of complete financial transparency has made it harder for Chinese companies to attract foreign investment. This limits access to the vast capital pools that fuel AI development in the West.
These constraints have forced Chinese AI labs to innovate in ways that others haven’t. Rather than relying on massive GPU clusters, DeepSeek must engineer their model with hyper-efficiency to get the most out of the GPUs they can access. In hindsight, many of the abovementioned innovations are no-brainers. US tech giants simply never bother to try it, because they are not compelled to do so due to the lack of constraints.
Conclusion
My opinion about DeepSeek is mixed. In part 1 of this series, we focused on 4 of the most important innovations behind DeepSeek’s ability to train and deploy cost-effective LLMs.
- FP8 Precision Training for Faster and More Memory-Efficient Models
- Multi-Token Prediction (MTP) for Faster Inference
- Mixture-of-Experts (MoE) for Efficient Training and Inference
- Hardware Level Optimizations for Maximum GPU Utilization
These are highly effective engineering and data science techniques that made DeepSeek hyper-efficient. There are a lot that we could learn from these innovations. Yet, these innovations are primarily driven by necessity in response to the first of 2 major constraints they face.
- Limited access to GPUs due to restrictions on US export to China
- Limited access to foreign capital due to the lack of financial transparency
Stay tuned for the next installment, where we will discuss the opposite side (the Bad: what I don’t like about DeepSeek).