Concept #83Mediumextended-ai-concepts

How is text converted to AI-understandable format?

#gen-ai#tokens#embeddings

Answer

How Is Text Converted to AI-Understandable Format?

AI models process numbers, not text. Text goes through multiple transformation stages before a model can work with it.

Full Pipeline

text
"What is AI?" → Tokenize → [1921, 374, 15592, 30]
              Embedding lookup → [[0.2, -0.5, ...], [0.8, 0.1, ...], ...]
              + Positional encoding → context-aware vectors
              Transformer blocks → enriched representations
              Output projection → logits over vocabulary
              Sample/argmax → token IDs → Decode → "AI stands for..."

Step 1: Tokenization

python
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
text = "The quick brown fox"
token_ids = enc.encode(text)
print(token_ids)  # → [791, 4062, 14198, 39935]

# Each token ID maps to a vocabulary entry
for tid in token_ids:
    print(f"  {tid}: '{enc.decode([tid])}'")
# 791: 'The'
# 4062: ' quick'
# 14198: ' brown'
# 39935: ' fox'

Step 2: Embedding Lookup

Each token ID is converted to a dense vector (embedding):

python
import torch.nn as nn

# Each of 50,000 tokens maps to a 768-dimensional vector
embedding_layer = nn.Embedding(50257, 768)

import torch
token_ids_tensor = torch.tensor([791, 4062, 14198, 39935])
embeddings = embedding_layer(token_ids_tensor)
# Shape: (4, 768) — 4 tokens, each represented as 768 floats

These learned vectors encode semantic meaning — "king" and "queen" will be close in this space.

Step 3: Positional Encoding

Transformers process all tokens simultaneously (unlike RNNs). Position info is added:

python
# Position embeddings encode where each token is in the sequence
position_embedding = nn.Embedding(2048, 768)  # up to 2048 positions
positions = torch.arange(4)  # [0, 1, 2, 3]
pos_embeddings = position_embedding(positions)

# Final input: token meaning + position
x = embeddings + pos_embeddings

Step 4: Transformer Processing

The combined embeddings flow through transformer blocks:

  • Self-attention: each token attends to all others, building context
  • FFN: non-linear transformation
  • Layer norm: stabilizes training

Step 5: Decode Output

python
# Final layer outputs logits (score for each vocabulary word)
logits = model(x)  # shape: (sequence_len, vocab_size)

# Convert to probabilities and sample
import torch.nn.functional as F
probs = F.softmax(logits[-1], dim=-1)
next_token_id = torch.multinomial(probs, 1).item()
next_word = enc.decode([next_token_id])

Summary Table

StageInputOutputPurpose
TokenizationRaw textInteger IDsText → discrete units
EmbeddingIDsFloat vectorsIDs → semantic meaning
Positional enc.EmbeddingsEnriched embeddingsAdd position awareness
Attention + FFNEmbeddingsContext vectorsBuild understanding
Output layerContextLogitsScore all possible next tokens
SamplingLogitsToken ID → textGenerate readable output

Understanding this pipeline helps debug tokenization issues, optimize prompt length, and understand model behavior.