Skip to main content

DeepSeek: The Good, the Bad, and the Ugly – Part 1

This article was originally published on quasi.pros.com.

The introduction of DeepSeek shocked the AI world. It shattered the belief that only tech giants (i.e. OpenAI, Anthropic, Google, Meta, etc.) with virtually infinite resources could train top-performing generative AI (GenAI) models. DeepSeek released a reasoning model (i.e. R1) that rivals OpenAI’s o1 model, while using only a fraction of the resources (only ~$5M, as opposed to ~$100M; and ~2000 GPUs, as opposed to ~100,000). About $1T market cap were erased due to the R1 launch.

DeepSeek: the good, the bad and the ugly, blog post thumbnail

Many have pinged me for comments, and even more have asked me how is this possible, and what it means. Rather than repeating myself, I thought I’d share some of my thoughts here. Unlike most, my thoughts on DeepSeek are a bit mixed.

Don’t get me wrong. I do think there are some really clever engineering and data science innovation that goes into the development of DeepSeek v3 (the base LLM) and R1 (their reasoning model). And there are lots that we could learn from. However, there are also evidences that suggest DeepSeek hasn’t been completely transparent with everything they did. So my feeling is like the Good, the Bad, and the Ugly.

For today, let’s focus on the Good. These are great innovations that deserve accolades and the industry’s recognition.

DeepSeek innovations image

Hyper Efficient DeepSeek Innovations

DeepSeek’s Real Innovations

IMHO, here are some of DeepSeek’s most important innovations that, in combination, enabled them to engineer R1 with hyper-efficiency.

1. FP8 Precision Training for Faster and More Memory-Efficient Models

Most large language models (LLMs) use 16-bit precision floating point numbers (FP16) to store and compute all their weights and probability outputs during training. However, as LLMs are being commoditized, even close source models have released either smaller or quantized versions of their LLMs to compete for adoption. And it’s observed that the reduced precision to 8-bit quantization doesn’t degrade the LLM’s performance much. So why not just train the LLM using an 8-bit floating point number (FP8) from the start?

Since FP8 operations are twice as fast as FP16, DeepSeek was able to train its model 2x faster (or equivalently use fewer GPUs). Moreover, because FP8 takes up only half the memory as FP16, larger models can be trained on fewer GPUs without significant performance loss.

2. Multi-Token Prediction (MTP) for Faster Inference

There are 2 approaches to train LLMs. OpenAI’s autoregressive approach to LLM predicts the very next word from a given sequence (i.e. the prompt), whereas Google’s bidirectional approach predicts a missing word in the middle of the sequence. Regardless of the approach, nearly all LLMs operate by predicting a single word token at a time.

In contrast, DeepSeek’s model can predict multiple tokens at each inference. Although this could quadratically increase the inference space, the strong correlations within realistic word sequences will constrain the vast majority of the token probabilities to zero. So in practice, MTP enables DeepSeek to generate responses multiple times faster without incurring much computational cost. Moreover, since DeepSeek uses FP8, many tiny token probabilities will be resolved to zero under FP8 precision. This further reduces the computing resources needed for training and inference.

3. Mixture-of-Experts (MoE) for Efficient Training and Inference

Today, most popular LLMs are rather dense in the sense that the entire network is active during inference. For GPT3, all 175B parameters are active, contributing to the calculation of each token probability. DeepSeek’s MoE approach selectively activates only a subset of the model’s parameters for each token. So despite having 671B parameters, only 37B parameters (~5.5% of the entire network) are active at any given time. This means the model behaves like a small model in terms of computing cost while retaining the expressive power of the larger model.

MoE reduces training compute costs by only updating ~5.5% of its 671B parameters, allowing for faster, more scalable learning while ensuring different expert subnetworks specialize in specific tasks. Since only the most relevant 37B parameters are active per token, responses can be generated faster without massive GPU clusters. This further compounds the efficiency gain of using FP8 and MTP.

4. Hardware Level Optimizations for Maximum GPU Utilization

Unlike CPU, which executes instructions sequentially, GPU computations are inherently parallel and distributed. Standard GPU execution pipeline cycles between computing some intermediate results and transferring this data for subsequent computation. And while the GPU is transferring data, it’s often sitting idle waiting for the next set of transferred data to arrive. That is ~50% of wasted GPU cycles while waiting.

DeepSeek uses DualPipe Algorithm by creating two parallel execution pipelines that work simultaneously to overlap computation and data transfer. The DualPipe ensures that while pipeline 1 is computing, pipeline 2 is transferring data at the same time. And when pipeline 1 finishes the compute and starts transferring, pipeline 2 is ready to start computing. Beyond DualPipe, DeepSeek also optimizes data flow between GPUs by leveraging high-speed NVLink for intra-node communication and InfiniBand for inter-node data transfers. This architecture ensures that data moves through the fastest available channels, reducing delay. These hardware optimization keeps GPUs fully utilized, computing ~100% of the time and transferring data as fast as possible.

Constraints Breeds Innovation

Constraints are often seen as barriers that slow down the pace of tech development. In reality, however, they can act as powerful drivers of innovation. DeepSeek being in China is operating under a completely different set of constraints from AI labs in the US and Europe.

US export restrictions have cut off access to the most advanced AI chips and GPUs. This makes it far more challenging for Chinese AI labs to compete by brute-force scaling. Meanwhile, the lack of complete financial transparency has made it harder for Chinese companies to attract foreign investment. This limits access to the vast capital pools that fuel AI development in the West.

These constraints have forced Chinese AI labs to innovate in ways that others haven’t. Rather than relying on massive GPU clusters, DeepSeek must engineer their model with hyper-efficiency to get the most out of the GPUs they can access. In hindsight, many of the abovementioned innovations are no-brainers. US tech giants simply never bother to try it, because they are not compelled to do so due to the lack of constraints.

Conclusion

My opinion about DeepSeek is mixed. In part 1 of this series, we focused on 4 of the most important innovations behind DeepSeek’s ability to train and deploy cost-effective LLMs.

  1. FP8 Precision Training for Faster and More Memory-Efficient Models
  2. Multi-Token Prediction (MTP) for Faster Inference
  3. Mixture-of-Experts (MoE) for Efficient Training and Inference
  4. Hardware Level Optimizations for Maximum GPU Utilization

These are highly effective engineering and data science techniques that made DeepSeek hyper-efficient. There are a lot that we could learn from these innovations. Yet, these innovations are primarily driven by necessity in response to the first of 2 major constraints they face.

  1. Limited access to GPUs due to restrictions on US export to China
  2. Limited access to foreign capital due to the lack of financial transparency

Stay tuned for the next installment, where we will discuss the opposite side (the Bad: what I don’t like about DeepSeek).

About the Author

Dr. Michael Wu is one of the world’s premier authorities on artificial intelligence (AI), machine learning (ML), data science, and behavioral economics. He’s the Chief AI Strategist at PROS (NYSE: PRO), an AI-powered SaaS provider that helps companies monetize more efficiently in the digital economy. He’s been appointed as a Senior Research Fellow at the Ecole des Ponts Business School for his work in Data Science. Prior to PROS, Michael was the Chief Scientist at Lithium for a decade, where he focuses on developing predictive and prescriptive algorithms to extract insights from social media big data. His research spans many areas, including customer experience, CRM, online influence, gamification, digital transformation, AI, etc. His R&D won him the recognition as an Influential Leader by CRM Magazine along with Mark Zuckerberg, Marc Benioff and other industry giants. Michael has served as a DOE fellow at the Los Alamos National Lab conducting research in face recognition and was awarded 4 years of full fellowship under the Computational Science Graduate Fellowship. Prior to industry, Michael received his triple major undergraduate degree in Applied Math, Physics, and Molecular & Cell Biology; and his Ph.D. from UC Berkeley’s Biophysics program, where he uses machine learning to model visual processing within the human brain. Michael believes in knowledge dissemination, and speaks internationally at universities, conferences, and enterprises. His insights have inspired many global enterprises and are made accessible through “The Science of Social,” and “The Science of Social 2”—two easy-reading e-books.

Profile Photo of Michael Wu