What is Mixture of Experts in AI Models? Example - Qwen 3.5-122B-A10B

#gen-ai#mixture-of-experts#moe#architecture#qwen#sparse-routing#llm

Answer

Mixture of Experts (MoE) in AI Models

Mixture of Experts (MoE) is a neural network architecture where instead of every token passing through a single dense feed-forward network, multiple expert subnetworks exist, and each token is sparsely routed to only a top-k subset of them. This lets models have massive total parameter counts while keeping inference compute proportional to a much smaller number of active parameters.

How MoE Works

An MoE layer replaces the standard FFN with two elements — a router/gate and an expert pool:

python
# Conceptual MoE routing mechanism
import torch
import torch.nn as nn

class MixtureOfExperts(nn.Module):
    def __init__(self, d_model=4096, num_experts=64, top_k=8):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k

        self.router = nn.Linear(d_model, num_experts, bias=False)

        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(d_model, d_model * 4),
                nn.SiLU(),
                nn.Linear(d_model * 4, d_model),
            ) for _ in range(num_experts)
        ])

    def forward(self, x):
        batch, seq, dim = x.shape
        x_flat = x.view(-1, dim)

        router_logits = self.router(x_flat)

        topk_weights, topk_indices = torch.topk(
            router_logits, self.top_k, dim=-1
        )
        topk_weights = torch.softmax(topk_weights, dim=-1)

        output = torch.zeros_like(x_flat)
        for k in range(self.top_k):
            expert_idx = topk_indices[:, k]
            weight = topk_weights[:, k]

            for e in range(self.num_experts):
                mask = (expert_idx == e)
                if mask.any():
                    output[mask] += weight[mask].unsqueeze(-1) *                                     self.experts[e](x_flat[mask])

        return output.view(batch, seq, dim)

Key Concepts

ConceptDescription
Sparse ActivationOnly k experts activated per token (typically 2-8 out of 64-256 total)
Load BalancingAuxiliary loss ensures all experts get used evenly, preventing collapse
Expert CapacityMaximum tokens an expert can process; tokens beyond capacity are dropped or overflowed
Total vs Active ParamsTotal = all experts combined; Active = k experts + shared layers

Qwen 3.5-122B-A10B Breakdown

MetricValue
Total parameters122B
Active parameters10B (per token)
Compression ratio~12:1
Expert count128 experts (64 shared + 64 routed)
Top-k routingk=8
Context window128K tokens
ArchitectureDense attention + MoE FFN layers

This means Qwen achieves the knowledge capacity of a 122B model while only burning compute proportional to a ~10B dense model per forward pass.

MoE vs Dense Models

FeatureDense ModelMoE Model
Parameters used per token100%~8-15%
Total parameter countLLaMA 3 70BQwen 3.5 122B-A10B
Inference computeProportional to total paramsProportional to active params
VRAM requiredLower (all params fit)Higher (all experts must be loaded)
Training efficiencySimpler to trainHarder (load balancing, expert collapse)
Knowledge capacityLinear with paramsSuper-linear (fit more knowledge per FLOP)
Best forPredictable latency, smaller scaleMaximum capacity at constrained compute budget

Popular MoE Models

ModelTotal ParamsActive ParamsExperts
Qwen 3.5-122B-A10B122B10B128
DeepSeek-V3671B37B256
Mixtral 8 x 7B47B13B8
Mixtral 8 x 22B141B39B8
DBRX132B36B16

Common Pitfalls

IssueSymptomFix
Expert collapseRouter sends all tokens to 1-2 expertsAdd auxiliary load-balancing loss
Token droppingTokens assigned to over-capacity experts get droppedIncrease expert capacity or use token-choice routing
VRAM overheadAll experts loaded even if unusedUse expert offloading or quantization
Training instabilityRouter gradients destabilize trainingUse Z-loss regularization

Key insight: MoE trades memory for compute efficiency. You need enough VRAM to hold all experts but only pay FLOPs for the active ones — which is why MoE models excel in throughput-constrained serving scenarios.

Learn more at Qwen Technical Report and Mixtral Paper.