Concept #1Hardgen-ai-fundamentals

Explain the Transformer architecture. What are attention mechanisms and why are they important?

#gen-ai#transformers#attention

Answer

Transformer Architecture & Attention Mechanisms

The Transformer (introduced in "Attention Is All You Need", 2017) is the foundational architecture behind every modern LLM. It replaced RNNs by processing all tokens in parallel using attention, making it faster to train and better at capturing long-range dependencies.

Core Components

ComponentRole
Input EmbeddingConverts tokens to dense vectors
Positional EncodingInjects token order (since Transformers have no recurrence)
Multi-Head Self-AttentionLets each token attend to all other tokens simultaneously
Feed-Forward NetworkApplies non-linear transformation per token
Layer Norm + ResidualsStabilises training and enables deep stacking

How Self-Attention Works

For each token, the model computes three vectors — Query (Q), Key (K), Value (V) — via learned linear projections. Attention scores are computed as:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

The dk\sqrt{d_k} scaling prevents dot products from growing too large in high dimensions, keeping gradients stable.

Multi-Head Attention runs hh attention heads in parallel, each learning different relationship types (e.g. syntactic, semantic, coreference), then concatenates and projects the results.

Why Attention Matters

  • Parallelism — all tokens processed simultaneously (unlike RNNs)
  • Long-range dependencies — "bank" in "river bank" vs "bank account" resolved by context
  • Interpretability — attention weights show what the model focuses on
  • Scalability — adding more layers/heads improves performance predictably

Encoder vs Decoder Stacks

StackTypeExample ModelsUse Case
Encoder-onlyBidirectional attentionBERT, RoBERTaClassification, embeddings
Decoder-onlyCausal (masked) attentionGPT-4, LLaMAText generation
Encoder-DecoderBothT5, BARTTranslation, summarisation

Key insight: The Transformer's success comes from replacing sequential computation (RNN) with parallel attention, allowing models to scale to billions of parameters efficiently.