What's the difference between encoder-only, decoder-only, and encoder-decoder models?

Question

Accepted Answer

## Encoder-only vs Decoder-only vs Encoder-Decoder

The Transformer architecture comes in three flavours, each optimised for different tasks.

### Encoder-only Models

**How it works:** All tokens attend to all other tokens bidirectionally (full self-attention). The model produces contextualised embeddings for each input token.

**Best for:** Tasks that require *understanding* the full input — classification, named entity recognition, semantic search, embeddings.

**Examples:** BERT, RoBERTa, DistilBERT, `sentence-transformers`

```python
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

inputs = tokenizer("What is RAG?", return_tensors="pt")
outputs = model(**inputs)
embedding = outputs.last_hidden_state[:, 0, :]  # [CLS] token embedding
```

### Decoder-only Models

**How it works:** Each token can only attend to previous tokens (causal/masked self-attention). Generates one token at a time, auto-regressively.

**Best for:** Text generation, chatbots, code completion, instruction following.

**Examples:** GPT-4, LLaMA, Mistral, Gemini, Claude

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

inputs = tokenizer("Explain RAG in one sentence:", return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=100, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```

### Encoder-Decoder Models

**How it works:** Encoder processes the full input bidirectionally. Decoder generates output with cross-attention to the encoder's output.

**Best for:** Sequence-to-sequence tasks — translation, summarisation, question answering from a passage.

**Examples:** T5, BART, mT5

### Comparison Table

| Feature | Encoder-only | Decoder-only | Encoder-Decoder |
|---------|-------------|--------------|-----------------|
| **Attention type** | Bidirectional | Causal (left-to-right) | Both |
| **Primary task** | Understanding | Generation | Seq2Seq |
| **Pretraining objective** | Masked LM (MLM) | Next-token prediction | Masked + generation |
| **Output** | Token embeddings | Generated text | Generated text |
| **Examples** | BERT, RoBERTa | GPT-4, LLaMA, Claude | T5, BART |
| **Best use case** | Search, classification | Chatbots, code gen | Translation, summarisation |

> - One model covers both comprehension and generation
> - Encoder-decoder is overkill for pure generation (decoder-only is simpler and scales better)
> - For modern LLM applications, decoder-only (e.g. LLaMA, GPT) is the dominant architecture

What's the difference between encoder-only, decoder-only, and encoder-decoder models?

Answer

Encoder-only vs Decoder-only vs Encoder-Decoder

Encoder-only Models

Decoder-only Models

Encoder-Decoder Models

Comparison Table

Related Concepts

Explain the Transformer architecture. What are attention mechanisms and why are they important?

What's the difference between a Large Language Model (LLM) and other ML models?

Explain these LLM concepts: Tokens, Context window, Temperature & Top-p sampling, Beam search.

Explain quantization in LLMs. Why is it important?

What's the difference between fine-tuning and prompt engineering?

Feature	Encoder-only	Decoder-only	Encoder-Decoder
Attention type	Bidirectional	Causal (left-to-right)	Both
Primary task	Understanding	Generation	Seq2Seq
Pretraining objective	Masked LM (MLM)	Next-token prediction	Masked + generation
Output	Token embeddings	Generated text	Generated text
Examples	BERT, RoBERTa	GPT-4, LLaMA, Claude	T5, BART
Best use case	Search, classification	Chatbots, code gen	Translation, summarisation