What is Mixture of Experts in AI Models? Example - Qwen 3.5-122B-A10B
Answer
Mixture of Experts (MoE) in AI Models
Mixture of Experts (MoE) is a neural network architecture where instead of every token passing through a single dense feed-forward network, multiple expert subnetworks exist, and each token is sparsely routed to only a top-k subset of them. This lets models have massive total parameter counts while keeping inference compute proportional to a much smaller number of active parameters.
How MoE Works
An MoE layer replaces the standard FFN with two elements — a router/gate and an expert pool:
python# Conceptual MoE routing mechanism import torch import torch.nn as nn class MixtureOfExperts(nn.Module): def __init__(self, d_model=4096, num_experts=64, top_k=8): super().__init__() self.num_experts = num_experts self.top_k = top_k self.router = nn.Linear(d_model, num_experts, bias=False) self.experts = nn.ModuleList([ nn.Sequential( nn.Linear(d_model, d_model * 4), nn.SiLU(), nn.Linear(d_model * 4, d_model), ) for _ in range(num_experts) ]) def forward(self, x): batch, seq, dim = x.shape x_flat = x.view(-1, dim) router_logits = self.router(x_flat) topk_weights, topk_indices = torch.topk( router_logits, self.top_k, dim=-1 ) topk_weights = torch.softmax(topk_weights, dim=-1) output = torch.zeros_like(x_flat) for k in range(self.top_k): expert_idx = topk_indices[:, k] weight = topk_weights[:, k] for e in range(self.num_experts): mask = (expert_idx == e) if mask.any(): output[mask] += weight[mask].unsqueeze(-1) * self.experts[e](x_flat[mask]) return output.view(batch, seq, dim)
Key Concepts
| Concept | Description |
|---|---|
| Sparse Activation | Only k experts activated per token (typically 2-8 out of 64-256 total) |
| Load Balancing | Auxiliary loss ensures all experts get used evenly, preventing collapse |
| Expert Capacity | Maximum tokens an expert can process; tokens beyond capacity are dropped or overflowed |
| Total vs Active Params | Total = all experts combined; Active = k experts + shared layers |
Qwen 3.5-122B-A10B Breakdown
| Metric | Value |
|---|---|
| Total parameters | 122B |
| Active parameters | 10B (per token) |
| Compression ratio | ~12:1 |
| Expert count | 128 experts (64 shared + 64 routed) |
| Top-k routing | k=8 |
| Context window | 128K tokens |
| Architecture | Dense attention + MoE FFN layers |
This means Qwen achieves the knowledge capacity of a 122B model while only burning compute proportional to a ~10B dense model per forward pass.
MoE vs Dense Models
| Feature | Dense Model | MoE Model |
|---|---|---|
| Parameters used per token | 100% | ~8-15% |
| Total parameter count | LLaMA 3 70B | Qwen 3.5 122B-A10B |
| Inference compute | Proportional to total params | Proportional to active params |
| VRAM required | Lower (all params fit) | Higher (all experts must be loaded) |
| Training efficiency | Simpler to train | Harder (load balancing, expert collapse) |
| Knowledge capacity | Linear with params | Super-linear (fit more knowledge per FLOP) |
| Best for | Predictable latency, smaller scale | Maximum capacity at constrained compute budget |
Popular MoE Models
| Model | Total Params | Active Params | Experts |
|---|---|---|---|
| Qwen 3.5-122B-A10B | 122B | 10B | 128 |
| DeepSeek-V3 | 671B | 37B | 256 |
| Mixtral 8 x 7B | 47B | 13B | 8 |
| Mixtral 8 x 22B | 141B | 39B | 8 |
| DBRX | 132B | 36B | 16 |
Common Pitfalls
| Issue | Symptom | Fix |
|---|---|---|
| Expert collapse | Router sends all tokens to 1-2 experts | Add auxiliary load-balancing loss |
| Token dropping | Tokens assigned to over-capacity experts get dropped | Increase expert capacity or use token-choice routing |
| VRAM overhead | All experts loaded even if unused | Use expert offloading or quantization |
| Training instability | Router gradients destabilize training | Use Z-loss regularization |
Key insight: MoE trades memory for compute efficiency. You need enough VRAM to hold all experts but only pay FLOPs for the active ones — which is why MoE models excel in throughput-constrained serving scenarios.
Learn more at Qwen Technical Report and Mixtral Paper.