Explain these LLM concepts: Tokens, Context window, Temperature & Top-p sampling, Beam search.

Question

Accepted Answer

## Core LLM Concepts

### Tokens

A **token** is the basic unit an LLM processes — not a word, but a subword chunk. Tokenisation splits text using algorithms like **Byte-Pair Encoding (BPE)**.

```python
import tiktoken  # OpenAI tokenizer

enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("Transformer architecture")
print(tokens)        # [38841, 9756, 23103]
print(len(tokens))   # 3 tokens for 2 words
```

| Text | Tokens | Count |
|------|--------|-------|
| "hello" | ["hello"] | 1 |
| "unhappiness" | ["un", "happiness"] | 2 |
| "GPT-4o" | ["G", "PT", "-", "4", "o"] | 5 |

**Rule of thumb:** 1 token ≈ 0.75 English words (100 tokens ≈ 75 words)

### Context Window

The **context window** is the maximum number of tokens the model can "see" at once — both input and output combined.

| Model | Context Window |
|-------|---------------|
| GPT-3.5-turbo | 16K tokens |
| GPT-4o | 128K tokens |
| Claude 3.5 Sonnet | 200K tokens |
| Gemini 1.5 Pro | 1M tokens |

Tokens outside the context window are simply not available to the model — it cannot reference them.

### Temperature

**Temperature** controls randomness in token sampling. At each step, the model produces a probability distribution over all tokens. Temperature rescales this distribution:

$$P_i \propto \exp(z_i / T)$$

| Temperature | Effect | Use Case |
|-------------|--------|----------|
| `0.0` | Deterministic (always picks top token) | Factual Q&A, code generation |
| `0.7` | Balanced creativity and coherence | General chatbots |
| `1.0` | Raw model distribution | Creative writing |
| `>1.0` | Flat distribution, more randomness | Exploration/brainstorming |

```python
from openai import OpenAI
client = OpenAI()

# Factual answer — low temperature
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    temperature=0.0
)

# Creative story — high temperature
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a poem about AI"}],
    temperature=0.9
)
```

### Beam Search

**Beam search** keeps the top-$k$ most probable sequences at each step, expanding each, then pruning back to $k$. Unlike greedy decoding (always picks best single token), it finds globally better sequences.

| Strategy | Speed | Quality | Diversity |
|----------|-------|---------|-----------|
| Greedy decoding | Fastest | Lower | None |
| Beam search (k=5) | Slower | Higher | Low |
| Sampling + top-p | Fast | Good | High |

> **Trade-off:** Better output quality but $k	imes$ more compute. Mostly used in translation and summarisation — LLMs typically use sampling instead.

Explain these LLM concepts: Tokens, Context window, Temperature & Top-p sampling, Beam search.

Answer

Core LLM Concepts

Tokens

Context Window

Temperature

Beam Search

Related Concepts

Explain the Transformer architecture. What are attention mechanisms and why are they important?

What's the difference between a Large Language Model (LLM) and other ML models?

What's the difference between encoder-only, decoder-only, and encoder-decoder models?

Explain quantization in LLMs. Why is it important?

What's the difference between fine-tuning and prompt engineering?

Text	Tokens	Count
"hello"	["hello"]	1
"unhappiness"	["un", "happiness"]	2
"GPT-4o"	["G", "PT", "-", "4", "o"]	5

Model	Context Window
GPT-3.5-turbo	16K tokens
GPT-4o	128K tokens
Claude 3.5 Sonnet	200K tokens
Gemini 1.5 Pro	1M tokens

Temperature	Effect	Use Case
text `0.0`	Deterministic (always picks top token)	Factual Q&A, code generation
text `0.7`	Balanced creativity and coherence	General chatbots
text `1.0`	Raw model distribution	Creative writing
text `>1.0`	Flat distribution, more randomness	Exploration/brainstorming

Strategy	Speed	Quality	Diversity
Greedy decoding	Fastest	Lower	None
Beam search (k=5)	Slower	Higher	Low
Sampling + top-p	Fast	Good	High