Concept #3Mediumgen-ai-fundamentals

Explain these LLM concepts: Tokens, Context window, Temperature & Top-p sampling, Beam search.

#gen-ai#llm#tokens

Answer

Core LLM Concepts

Tokens

A token is the basic unit an LLM processes — not a word, but a subword chunk. Tokenisation splits text using algorithms like Byte-Pair Encoding (BPE).

python
import tiktoken  # OpenAI tokenizer

enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("Transformer architecture")
print(tokens)        # [38841, 9756, 23103]
print(len(tokens))   # 3 tokens for 2 words
TextTokensCount
"hello"["hello"]1
"unhappiness"["un", "happiness"]2
"GPT-4o"["G", "PT", "-", "4", "o"]5

Rule of thumb: 1 token ≈ 0.75 English words (100 tokens ≈ 75 words)

Context Window

The context window is the maximum number of tokens the model can "see" at once — both input and output combined.

ModelContext Window
GPT-3.5-turbo16K tokens
GPT-4o128K tokens
Claude 3.5 Sonnet200K tokens
Gemini 1.5 Pro1M tokens

Tokens outside the context window are simply not available to the model — it cannot reference them.

Temperature

Temperature controls randomness in token sampling. At each step, the model produces a probability distribution over all tokens. Temperature rescales this distribution:

Piexp(zi/T)P_i \propto \exp(z_i / T)

TemperatureEffectUse Case
text
0.0
Deterministic (always picks top token)Factual Q&A, code generation
text
0.7
Balanced creativity and coherenceGeneral chatbots
text
1.0
Raw model distributionCreative writing
text
>1.0
Flat distribution, more randomnessExploration/brainstorming
python
from openai import OpenAI
client = OpenAI()

# Factual answer — low temperature
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    temperature=0.0
)

# Creative story — high temperature
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a poem about AI"}],
    temperature=0.9
)

Beam Search

Beam search keeps the top-kk most probable sequences at each step, expanding each, then pruning back to kk. Unlike greedy decoding (always picks best single token), it finds globally better sequences.

StrategySpeedQualityDiversity
Greedy decodingFastestLowerNone
Beam search (k=5)SlowerHigherLow
Sampling + top-pFastGoodHigh

Trade-off: Better output quality but k×k\times more compute. Mostly used in translation and summarisation — LLMs typically use sampling instead.