Explain the Transformer architecture. What are attention mechanisms and why are they important?

Question

Accepted Answer

## Transformer Architecture & Attention Mechanisms

The Transformer (introduced in *"Attention Is All You Need"*, 2017) is the foundational architecture behind every modern LLM. It replaced RNNs by processing all tokens in parallel using **attention**, making it faster to train and better at capturing long-range dependencies.

### Core Components

| Component | Role |
|-----------|------|
| **Input Embedding** | Converts tokens to dense vectors |
| **Positional Encoding** | Injects token order (since Transformers have no recurrence) |
| **Multi-Head Self-Attention** | Lets each token attend to all other tokens simultaneously |
| **Feed-Forward Network** | Applies non-linear transformation per token |
| **Layer Norm + Residuals** | Stabilises training and enables deep stacking |

### How Self-Attention Works

For each token, the model computes three vectors — Query (Q), Key (K), Value (V) — via learned linear projections. Attention scores are computed as:

$$	ext{Attention}(Q, K, V) = 	ext{softmax}\left(\frac{QK^T}{\sqrt{d_k}}ight) V$$

The $\sqrt{d_k}$ scaling prevents dot products from growing too large in high dimensions, keeping gradients stable.

**Multi-Head Attention** runs $h$ attention heads in parallel, each learning different relationship types (e.g. syntactic, semantic, coreference), then concatenates and projects the results.

### Why Attention Matters

* **Parallelism** — all tokens processed simultaneously (unlike RNNs)
* **Long-range dependencies** — "bank" in "river bank" vs "bank account" resolved by context
* **Interpretability** — attention weights show what the model focuses on
* **Scalability** — adding more layers/heads improves performance predictably

### Encoder vs Decoder Stacks

| Stack | Type | Example Models | Use Case |
|-------|------|----------------|----------|
| **Encoder-only** | Bidirectional attention | BERT, RoBERTa | Classification, embeddings |
| **Decoder-only** | Causal (masked) attention | GPT-4, LLaMA | Text generation |
| **Encoder-Decoder** | Both | T5, BART | Translation, summarisation |

> **Key insight:** The Transformer's success comes from replacing sequential computation (RNN) with parallel attention, allowing models to scale to billions of parameters efficiently.

Explain the Transformer architecture. What are attention mechanisms and why are they important?

Answer

Transformer Architecture & Attention Mechanisms

Core Components

How Self-Attention Works

Why Attention Matters

Encoder vs Decoder Stacks

Related Concepts

What's the difference between a Large Language Model (LLM) and other ML models?

Explain these LLM concepts: Tokens, Context window, Temperature & Top-p sampling, Beam search.

What's the difference between encoder-only, decoder-only, and encoder-decoder models?

Explain quantization in LLMs. Why is it important?

What's the difference between fine-tuning and prompt engineering?

Component	Role
Input Embedding	Converts tokens to dense vectors
Positional Encoding	Injects token order (since Transformers have no recurrence)
Multi-Head Self-Attention	Lets each token attend to all other tokens simultaneously
Feed-Forward Network	Applies non-linear transformation per token
Layer Norm + Residuals	Stabilises training and enables deep stacking

Stack	Type	Example Models	Use Case
Encoder-only	Bidirectional attention	BERT, RoBERTa	Classification, embeddings
Decoder-only	Causal (masked) attention	GPT-4, LLaMA	Text generation
Encoder-Decoder	Both	T5, BART	Translation, summarisation