Explain the Transformer architecture. What are attention mechanisms and why are they important?
Answer
Transformer Architecture & Attention Mechanisms
The Transformer (introduced in "Attention Is All You Need", 2017) is the foundational architecture behind every modern LLM. It replaced RNNs by processing all tokens in parallel using attention, making it faster to train and better at capturing long-range dependencies.
Core Components
| Component | Role |
|---|---|
| Input Embedding | Converts tokens to dense vectors |
| Positional Encoding | Injects token order (since Transformers have no recurrence) |
| Multi-Head Self-Attention | Lets each token attend to all other tokens simultaneously |
| Feed-Forward Network | Applies non-linear transformation per token |
| Layer Norm + Residuals | Stabilises training and enables deep stacking |
How Self-Attention Works
For each token, the model computes three vectors — Query (Q), Key (K), Value (V) — via learned linear projections. Attention scores are computed as:
The scaling prevents dot products from growing too large in high dimensions, keeping gradients stable.
Multi-Head Attention runs attention heads in parallel, each learning different relationship types (e.g. syntactic, semantic, coreference), then concatenates and projects the results.
Why Attention Matters
- Parallelism — all tokens processed simultaneously (unlike RNNs)
- Long-range dependencies — "bank" in "river bank" vs "bank account" resolved by context
- Interpretability — attention weights show what the model focuses on
- Scalability — adding more layers/heads improves performance predictably
Encoder vs Decoder Stacks
| Stack | Type | Example Models | Use Case |
|---|---|---|---|
| Encoder-only | Bidirectional attention | BERT, RoBERTa | Classification, embeddings |
| Decoder-only | Causal (masked) attention | GPT-4, LLaMA | Text generation |
| Encoder-Decoder | Both | T5, BART | Translation, summarisation |
Key insight: The Transformer's success comes from replacing sequential computation (RNN) with parallel attention, allowing models to scale to billions of parameters efficiently.