DeepSeek V3/R1 Architecture Explorer

Interactive demonstration of Multi-Head Latent Attention (MLA) and Mixture of Experts (MoE)

Total Params 671B
Active 37B
Efficiency 94.5%

Attention Architecture Comparison

DeepSeek MLA vs Traditional Attention Mechanisms

DeepSeek MLA

DeepSeek V3
Processing tokens...

Grouped Query Attention

Traditional
Processing tokens...
Sequential Processing

Mixture of Experts (MoE)

Sparse activation with shared expert architecture

Routing experts...
Active Parameters
↑ 94.5%
37B / 671B
Sparse Activation
Expert Selection
9 Total
1 Shared + 8 Routed
Router Efficiency
Load Balanced
No Auxiliary Loss
Total Experts
256 → 32
Visualization Scale