What's the difference between encoder-only, decoder-only, and encoder-decoder models?
Answer
Encoder-only vs Decoder-only vs Encoder-Decoder
The Transformer architecture comes in three flavours, each optimised for different tasks.
Encoder-only Models
How it works: All tokens attend to all other tokens bidirectionally (full self-attention). The model produces contextualised embeddings for each input token.
Best for: Tasks that require understanding the full input — classification, named entity recognition, semantic search, embeddings.
Examples: BERT, RoBERTa, DistilBERT,
sentence-transformerspythonfrom transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") inputs = tokenizer("What is RAG?", return_tensors="pt") outputs = model(**inputs) embedding = outputs.last_hidden_state[:, 0, :] # [CLS] token embedding
Decoder-only Models
How it works: Each token can only attend to previous tokens (causal/masked self-attention). Generates one token at a time, auto-regressively.
Best for: Text generation, chatbots, code completion, instruction following.
Examples: GPT-4, LLaMA, Mistral, Gemini, Claude
pythonfrom transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2") tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2") inputs = tokenizer("Explain RAG in one sentence:", return_tensors="pt") output = model.generate(**inputs, max_new_tokens=100, temperature=0.7) print(tokenizer.decode(output[0], skip_special_tokens=True))
Encoder-Decoder Models
How it works: Encoder processes the full input bidirectionally. Decoder generates output with cross-attention to the encoder's output.
Best for: Sequence-to-sequence tasks — translation, summarisation, question answering from a passage.
Examples: T5, BART, mT5
Comparison Table
| Feature | Encoder-only | Decoder-only | Encoder-Decoder |
|---|---|---|---|
| Attention type | Bidirectional | Causal (left-to-right) | Both |
| Primary task | Understanding | Generation | Seq2Seq |
| Pretraining objective | Masked LM (MLM) | Next-token prediction | Masked + generation |
| Output | Token embeddings | Generated text | Generated text |
| Examples | BERT, RoBERTa | GPT-4, LLaMA, Claude | T5, BART |
| Best use case | Search, classification | Chatbots, code gen | Translation, summarisation |
- One model covers both comprehension and generation
- Encoder-decoder is overkill for pure generation (decoder-only is simpler and scales better)
- For modern LLM applications, decoder-only (e.g. LLaMA, GPT) is the dominant architecture