How are text/chat prompts converted to AI-understandable format?
#gen-ai#tokens
Answer
How Text/Chat Prompts Are Converted to AI-Understandable Format
This covers the complete journey from a chat message to model output.
Full Pipeline
textYou type: "What is AI?" ↓ 1. Chat formatting → role labels added ↓ 2. Chat template → model-specific format string ↓ 3. Tokenization → integer token IDs ↓ 4. Embedding lookup → float vectors ↓ 5. + Positional encoding → position-aware vectors ↓ 6. Transformer blocks → context-enriched representations ↓ 7. Output layer → logits over vocabulary ↓ 8. Sampling → token IDs → detokenize → "AI stands for..."
Step 1-2: Chat Template Formatting
Every model has a specific format it was fine-tuned on:
pythonfrom transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct") messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is AI?"}, ] # Apply the model-specific template formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) print(formatted)
Output for Llama 3:
text<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful assistant. <|eot_id|><|start_header_id|>user<|end_header_id|> What is AI? <|eot_id|><|start_header_id|>assistant<|end_header_id|>
Different models, different formats:
| Model Family | Format |
|---|---|
| Llama 3 | `< |
| ChatML (GPT) | `< |
| Mistral/Mixtral | text |
| Claude API | Handled server-side by Anthropic |
| Gemini API | Handled server-side by Google |
Step 3: Tokenization
pythonimport tiktoken enc = tiktoken.get_encoding("cl100k_base") text = "<|im_start|>user\nWhat is AI?<|im_end|>" token_ids = enc.encode(text) # → [100264, 882, 198, 3923, 374, 15592, 30, 100265]
Step 4-5: Embeddings + Position
pythonimport torch import torch.nn as nn token_ids_tensor = torch.tensor(token_ids) # Token embedding: ID → semantic vector token_embeds = nn.Embedding(100277, 4096)(token_ids_tensor) # Position embedding: position → positional vector pos_ids = torch.arange(len(token_ids)) pos_embeds = nn.Embedding(8192, 4096)(pos_ids) # Combined input to transformer x = token_embeds + pos_embeds # shape: (n_tokens, 4096)
Step 6-8: Transformer → Output
python# Transformer blocks process x through attention + FFN layers # Output: logits over vocabulary logits = transformer_model(x) # shape: (n_tokens, vocab_size) # Take last token's logits (predict next token) next_token_logits = logits[-1] # Sample from distribution (or argmax for greedy) probs = torch.softmax(next_token_logits, dim=-1) next_token_id = torch.multinomial(probs, 1).item() next_token_text = enc.decode([next_token_id])
Key Insight for Engineers
When you send a message via API:
- The API provider handles chat formatting server-side
- They tokenize and embed for you
- Model runs forward pass
- Output tokens are decoded back to text
- You receive a string response
You don't need to implement this manually when using APIs — but understanding it helps you debug token limits, prompt formatting issues, and model behavior.