DeepSeek: The Good, the Bad, and the Ugly – Part 1

Key Takeaways

  • FP8 precision training: Enables faster, memory-efficient model training.
  • Multi-token prediction (MTP): Speeds up inference with minimal cost.
  • Mixture-of-Experts (MoE): Activates only relevant parameters for efficiency.
  • Hardware optimizations: Maximizes GPU utilization with DualPipe algorithms.
  • Innovation through constraints: Limited resources drive hyper-efficient solutions.

This article was originally published on quasi.pros.com.

The introduction of DeepSeek shocked the AI world. It shattered the belief that only tech giants (i.e. OpenAI, Anthropic, Google, Meta, etc.) with virtually infinite resources could train top-performing generative AI (GenAI) models. DeepSeek released a reasoning model (i.e. R1) that rivals OpenAI’s o1 model, while using only a fraction of the resources (only ~$5M, as opposed to ~$100M; and ~2000 GPUs, as opposed to ~100,000). About $1T market cap were erased due to the R1 launch.

DeepSeek: the good, the bad and the ugly, blog post thumbnail

Many have pinged me for comments, and even more have asked me how is this possible, and what it means. Rather than repeating myself, I thought I’d share some of my thoughts here. Unlike most, my thoughts on DeepSeek are a bit mixed.

Don’t get me wrong. I do think there are some really clever engineering and data science innovation that goes into the development of DeepSeek v3 (the base LLM) and R1 (their reasoning model). And there are lots that we could learn from. However, there are also evidences that suggest DeepSeek hasn’t been completely transparent with everything they did. So my feeling is like the Good, the Bad, and the Ugly.

For today, let’s focus on the Good. These are great innovations that deserve accolades and the industry’s recognition.

DeepSeek innovations image

Hyper Efficient DeepSeek Innovations

DeepSeek’s Real Innovations

IMHO, here are some of DeepSeek’s most important innovations that, in combination, enabled them to engineer R1 with hyper-efficiency.

1. FP8 Precision Training for Faster and More Memory-Efficient Models

Most large language models (LLMs) use 16-bit precision floating point numbers (FP16) to store and compute all their weights and probability outputs during training. However, as LLMs are being commoditized, even close source models have released either smaller or quantized versions of their LLMs to compete for adoption. And it’s observed that the reduced precision to 8-bit quantization doesn’t degrade the LLM’s performance much. So why not just train the LLM using an 8-bit floating point number (FP8) from the start?

Since FP8 operations are twice as fast as FP16, DeepSeek was able to train its model 2x faster (or equivalently use fewer GPUs). Moreover, because FP8 takes up only half the memory as FP16, larger models can be trained on fewer GPUs without significant performance loss.

2. Multi-Token Prediction (MTP) for Faster Inference

There are 2 approaches to train LLMs. OpenAI’s autoregressive approach to LLM predicts the very next word from a given sequence (i.e. the prompt), whereas Google’s bidirectional approach predicts a missing word in the middle of the sequence. Regardless of the approach, nearly all LLMs operate by predicting a single word token at a time.

In contrast, DeepSeek’s model can predict multiple tokens at each inference. Although this could quadratically increase the inference space, the strong correlations within realistic word sequences will constrain the vast majority of the token probabilities to zero. So in practice, MTP enables DeepSeek to generate responses multiple times faster without incurring much computational cost. Moreover, since DeepSeek uses FP8, many tiny token probabilities will be resolved to zero under FP8 precision. This further reduces the computing resources needed for training and inference.

3. Mixture-of-Experts (MoE) for Efficient Training and Inference

Today, most popular LLMs are rather dense in the sense that the entire network is active during inference. For GPT3, all 175B parameters are active, contributing to the calculation of each token probability. DeepSeek’s MoE approach selectively activates only a subset of the model’s parameters for each token. So despite having 671B parameters, only 37B parameters (~5.5% of the entire network) are active at any given time. This means the model behaves like a small model in terms of computing cost while retaining the expressive power of the larger model.

MoE reduces training compute costs by only updating ~5.5% of its 671B parameters, allowing for faster, more scalable learning while ensuring different expert subnetworks specialize in specific tasks. Since only the most relevant 37B parameters are active per token, responses can be generated faster without massive GPU clusters. This further compounds the efficiency gain of using FP8 and MTP.

4. Hardware Level Optimizations for Maximum GPU Utilization

Unlike CPU, which executes instructions sequentially, GPU computations are inherently parallel and distributed. Standard GPU execution pipeline cycles between computing some intermediate results and transferring this data for subsequent computation. And while the GPU is transferring data, it’s often sitting idle waiting for the next set of transferred data to arrive. That is ~50% of wasted GPU cycles while waiting.

DeepSeek uses DualPipe Algorithm by creating two parallel execution pipelines that work simultaneously to overlap computation and data transfer. The DualPipe ensures that while pipeline 1 is computing, pipeline 2 is transferring data at the same time. And when pipeline 1 finishes the compute and starts transferring, pipeline 2 is ready to start computing. Beyond DualPipe, DeepSeek also optimizes data flow between GPUs by leveraging high-speed NVLink for intra-node communication and InfiniBand for inter-node data transfers. This architecture ensures that data moves through the fastest available channels, reducing delay. These hardware optimization keeps GPUs fully utilized, computing ~100% of the time and transferring data as fast as possible.

Constraints Breeds Innovation

Constraints are often seen as barriers that slow down the pace of tech development. In reality, however, they can act as powerful drivers of innovation. DeepSeek being in China is operating under a completely different set of constraints from AI labs in the US and Europe.

US export restrictions have cut off access to the most advanced AI chips and GPUs. This makes it far more challenging for Chinese AI labs to compete by brute-force scaling. Meanwhile, the lack of complete financial transparency has made it harder for Chinese companies to attract foreign investment. This limits access to the vast capital pools that fuel AI development in the West.

These constraints have forced Chinese AI labs to innovate in ways that others haven’t. Rather than relying on massive GPU clusters, DeepSeek must engineer their model with hyper-efficiency to get the most out of the GPUs they can access. In hindsight, many of the abovementioned innovations are no-brainers. US tech giants simply never bother to try it, because they are not compelled to do so due to the lack of constraints.

Conclusion

My opinion about DeepSeek is mixed. In part 1 of this series, we focused on 4 of the most important innovations behind DeepSeek’s ability to train and deploy cost-effective LLMs.

  1. FP8 Precision Training for Faster and More Memory-Efficient Models
  2. Multi-Token Prediction (MTP) for Faster Inference
  3. Mixture-of-Experts (MoE) for Efficient Training and Inference
  4. Hardware Level Optimizations for Maximum GPU Utilization

These are highly effective engineering and data science techniques that made DeepSeek hyper-efficient. There are a lot that we could learn from these innovations. Yet, these innovations are primarily driven by necessity in response to the first of 2 major constraints they face.

  1. Limited access to GPUs due to restrictions on US export to China
  2. Limited access to foreign capital due to the lack of financial transparency

Stay tuned for the next installment, where we will discuss the opposite side (the Bad: what I don’t like about DeepSeek).

Frequently Asked Questions

What is DeepSeek, and why is it significant?

DeepSeek is an AI company that developed a reasoning model (R1) rivaling top-tier models like OpenAI’s, using significantly fewer resources. 

What is FP8 precision training, and how does it help?

FP8 uses 8-bit floating-point numbers for training, doubling speed and halving memory usage without significant performance loss. 

How does multi-token prediction (MTP) improve inference?

MTP predicts multiple tokens at once, accelerating response generation while maintaining computational efficiency. 

What is the Mixture-of-Experts (MoE) approach?

MoE activates only a subset of model parameters, reducing computational costs while retaining the power of larger models. 

How does DeepSeek optimize GPU utilization?

DeepSeek uses DualPipe algorithms and high-speed data transfer technologies like NVLink and InfiniBand to maximize GPU efficiency. 

What challenges drive DeepSeek’s innovations?

Limited access to advanced GPUs and foreign capital forced DeepSeek to innovate hyper-efficient AI solutions. 

Why is DeepSeek’s approach important for the AI industry?

DeepSeek’s innovations demonstrate how constraints can lead to groundbreaking efficiency, challenging the dominance of resource-heavy tech giants. 

Other content in this Stream

PROS Platform: What’s New for Fall 2025

Discover the latest updates in the PROS B2B Platform Fall 2025 release, including smarter pricing tools, enhanced CPQ features, and advanced analytics to streamline workflows and boost productivity.

BlogPlatform Releases

Japan Airlines Boosts Group Sales Revenue 3X With PROS

Discover how Japan Airlines (JAL) boosted group sales revenue by 3X, reduced manual tasks by 90%, and increased market share by 20% with PROS Group Sales Optimizer. Learn how automation and dynamic pricing transformed their global group sales strategy.

BlogCase Studies & Testimonials

Agent-Washing: How to Spot Hype and Separate Buzzwords from Real Agentic AI

Learn how to identify agent-washing in AI marketing, spot misleading buzzwords, and evaluate whether a system truly delivers autonomous, agentic capabilities.

Artificial IntelligenceBlog

How to fix “Duplicate Google chose a different Canonical than user” A Hands-on Guide for Airlines

Learn how to diagnose and fix the “Duplicate, Google chose a different canonical than user” issue in Google Search Console. This hands-on guide for airlines explains root causes, troubleshooting steps, and how to validate indexing using GA4, AWR, and SERP checks.

Airline RetailingBlogInsights & Research

Top 5 Sales Challenges That Stall Growth

Discover the top 5 sales challenges stalling growth and learn how CPQ software solves them with faster quotes, accurate pricing, and profitable deals.

BlogOmnichannel Commerce