How is text converted to AI-understandable format?
#gen-ai#tokens#embeddings
Answer
How Is Text Converted to AI-Understandable Format?
AI models process numbers, not text. Text goes through multiple transformation stages before a model can work with it.
Full Pipeline
text"What is AI?" → Tokenize → [1921, 374, 15592, 30] ↓ Embedding lookup → [[0.2, -0.5, ...], [0.8, 0.1, ...], ...] ↓ + Positional encoding → context-aware vectors ↓ Transformer blocks → enriched representations ↓ Output projection → logits over vocabulary ↓ Sample/argmax → token IDs → Decode → "AI stands for..."
Step 1: Tokenization
pythonimport tiktoken enc = tiktoken.get_encoding("cl100k_base") text = "The quick brown fox" token_ids = enc.encode(text) print(token_ids) # → [791, 4062, 14198, 39935] # Each token ID maps to a vocabulary entry for tid in token_ids: print(f" {tid}: '{enc.decode([tid])}'") # 791: 'The' # 4062: ' quick' # 14198: ' brown' # 39935: ' fox'
Step 2: Embedding Lookup
Each token ID is converted to a dense vector (embedding):
pythonimport torch.nn as nn # Each of 50,000 tokens maps to a 768-dimensional vector embedding_layer = nn.Embedding(50257, 768) import torch token_ids_tensor = torch.tensor([791, 4062, 14198, 39935]) embeddings = embedding_layer(token_ids_tensor) # Shape: (4, 768) — 4 tokens, each represented as 768 floats
These learned vectors encode semantic meaning — "king" and "queen" will be close in this space.
Step 3: Positional Encoding
Transformers process all tokens simultaneously (unlike RNNs). Position info is added:
python# Position embeddings encode where each token is in the sequence position_embedding = nn.Embedding(2048, 768) # up to 2048 positions positions = torch.arange(4) # [0, 1, 2, 3] pos_embeddings = position_embedding(positions) # Final input: token meaning + position x = embeddings + pos_embeddings
Step 4: Transformer Processing
The combined embeddings flow through transformer blocks:
- Self-attention: each token attends to all others, building context
- FFN: non-linear transformation
- Layer norm: stabilizes training
Step 5: Decode Output
python# Final layer outputs logits (score for each vocabulary word) logits = model(x) # shape: (sequence_len, vocab_size) # Convert to probabilities and sample import torch.nn.functional as F probs = F.softmax(logits[-1], dim=-1) next_token_id = torch.multinomial(probs, 1).item() next_word = enc.decode([next_token_id])
Summary Table
| Stage | Input | Output | Purpose |
|---|---|---|---|
| Tokenization | Raw text | Integer IDs | Text → discrete units |
| Embedding | IDs | Float vectors | IDs → semantic meaning |
| Positional enc. | Embeddings | Enriched embeddings | Add position awareness |
| Attention + FFN | Embeddings | Context vectors | Build understanding |
| Output layer | Context | Logits | Score all possible next tokens |
| Sampling | Logits | Token ID → text | Generate readable output |
Understanding this pipeline helps debug tokenization issues, optimize prompt length, and understand model behavior.