Explain these LLM concepts: Tokens, Context window, Temperature & Top-p sampling, Beam search.
Answer
Core LLM Concepts
Tokens
A token is the basic unit an LLM processes — not a word, but a subword chunk. Tokenisation splits text using algorithms like Byte-Pair Encoding (BPE).
pythonimport tiktoken # OpenAI tokenizer enc = tiktoken.encoding_for_model("gpt-4") tokens = enc.encode("Transformer architecture") print(tokens) # [38841, 9756, 23103] print(len(tokens)) # 3 tokens for 2 words
| Text | Tokens | Count |
|---|---|---|
| "hello" | ["hello"] | 1 |
| "unhappiness" | ["un", "happiness"] | 2 |
| "GPT-4o" | ["G", "PT", "-", "4", "o"] | 5 |
Rule of thumb: 1 token ≈ 0.75 English words (100 tokens ≈ 75 words)
Context Window
The context window is the maximum number of tokens the model can "see" at once — both input and output combined.
| Model | Context Window |
|---|---|
| GPT-3.5-turbo | 16K tokens |
| GPT-4o | 128K tokens |
| Claude 3.5 Sonnet | 200K tokens |
| Gemini 1.5 Pro | 1M tokens |
Tokens outside the context window are simply not available to the model — it cannot reference them.
Temperature
Temperature controls randomness in token sampling. At each step, the model produces a probability distribution over all tokens. Temperature rescales this distribution:
| Temperature | Effect | Use Case |
|---|---|---|
text | Deterministic (always picks top token) | Factual Q&A, code generation |
text | Balanced creativity and coherence | General chatbots |
text | Raw model distribution | Creative writing |
text | Flat distribution, more randomness | Exploration/brainstorming |
pythonfrom openai import OpenAI client = OpenAI() # Factual answer — low temperature response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "What is 2+2?"}], temperature=0.0 ) # Creative story — high temperature response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Write a poem about AI"}], temperature=0.9 )
Beam Search
Beam search keeps the top- most probable sequences at each step, expanding each, then pruning back to . Unlike greedy decoding (always picks best single token), it finds globally better sequences.
| Strategy | Speed | Quality | Diversity |
|---|---|---|---|
| Greedy decoding | Fastest | Lower | None |
| Beam search (k=5) | Slower | Higher | Low |
| Sampling + top-p | Fast | Good | High |
Trade-off: Better output quality but more compute. Mostly used in translation and summarisation — LLMs typically use sampling instead.