Concept #4Mediumgen-ai-fundamentals

What's the difference between encoder-only, decoder-only, and encoder-decoder models?

#gen-ai#transformers#llm

Answer

Encoder-only vs Decoder-only vs Encoder-Decoder

The Transformer architecture comes in three flavours, each optimised for different tasks.

Encoder-only Models

How it works: All tokens attend to all other tokens bidirectionally (full self-attention). The model produces contextualised embeddings for each input token.

Best for: Tasks that require understanding the full input — classification, named entity recognition, semantic search, embeddings.

Examples: BERT, RoBERTa, DistilBERT,

text
sentence-transformers

python
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

inputs = tokenizer("What is RAG?", return_tensors="pt")
outputs = model(**inputs)
embedding = outputs.last_hidden_state[:, 0, :]  # [CLS] token embedding

Decoder-only Models

How it works: Each token can only attend to previous tokens (causal/masked self-attention). Generates one token at a time, auto-regressively.

Best for: Text generation, chatbots, code completion, instruction following.

Examples: GPT-4, LLaMA, Mistral, Gemini, Claude

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

inputs = tokenizer("Explain RAG in one sentence:", return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=100, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Encoder-Decoder Models

How it works: Encoder processes the full input bidirectionally. Decoder generates output with cross-attention to the encoder's output.

Best for: Sequence-to-sequence tasks — translation, summarisation, question answering from a passage.

Examples: T5, BART, mT5

Comparison Table

FeatureEncoder-onlyDecoder-onlyEncoder-Decoder
Attention typeBidirectionalCausal (left-to-right)Both
Primary taskUnderstandingGenerationSeq2Seq
Pretraining objectiveMasked LM (MLM)Next-token predictionMasked + generation
OutputToken embeddingsGenerated textGenerated text
ExamplesBERT, RoBERTaGPT-4, LLaMA, ClaudeT5, BART
Best use caseSearch, classificationChatbots, code genTranslation, summarisation
  • One model covers both comprehension and generation
  • Encoder-decoder is overkill for pure generation (decoder-only is simpler and scales better)
  • For modern LLM applications, decoder-only (e.g. LLaMA, GPT) is the dominant architecture