How is text converted to AI-understandable format?

Question

Accepted Answer

## How Is Text Converted to AI-Understandable Format?

AI models process numbers, not text. Text goes through multiple transformation stages before a model can work with it.

### Full Pipeline

```
"What is AI?" → Tokenize → [1921, 374, 15592, 30]
                           ↓
              Embedding lookup → [[0.2, -0.5, ...], [0.8, 0.1, ...], ...]
                           ↓
              + Positional encoding → context-aware vectors
                           ↓
              Transformer blocks → enriched representations
                           ↓
              Output projection → logits over vocabulary
                           ↓
              Sample/argmax → token IDs → Decode → "AI stands for..."
```

### Step 1: Tokenization

```python
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
text = "The quick brown fox"
token_ids = enc.encode(text)
print(token_ids)  # → [791, 4062, 14198, 39935]

# Each token ID maps to a vocabulary entry
for tid in token_ids:
    print(f"  {tid}: '{enc.decode([tid])}'")
# 791: 'The'
# 4062: ' quick'
# 14198: ' brown'
# 39935: ' fox'
```

### Step 2: Embedding Lookup

Each token ID is converted to a dense vector (embedding):

```python
import torch.nn as nn

# Each of 50,000 tokens maps to a 768-dimensional vector
embedding_layer = nn.Embedding(50257, 768)

import torch
token_ids_tensor = torch.tensor([791, 4062, 14198, 39935])
embeddings = embedding_layer(token_ids_tensor)
# Shape: (4, 768) — 4 tokens, each represented as 768 floats
```

These learned vectors encode semantic meaning — "king" and "queen" will be close in this space.

### Step 3: Positional Encoding

Transformers process all tokens simultaneously (unlike RNNs). Position info is added:

```python
# Position embeddings encode where each token is in the sequence
position_embedding = nn.Embedding(2048, 768)  # up to 2048 positions
positions = torch.arange(4)  # [0, 1, 2, 3]
pos_embeddings = position_embedding(positions)

# Final input: token meaning + position
x = embeddings + pos_embeddings
```

### Step 4: Transformer Processing

The combined embeddings flow through transformer blocks:
- **Self-attention**: each token attends to all others, building context
- **FFN**: non-linear transformation
- **Layer norm**: stabilizes training

### Step 5: Decode Output

```python
# Final layer outputs logits (score for each vocabulary word)
logits = model(x)  # shape: (sequence_len, vocab_size)

# Convert to probabilities and sample
import torch.nn.functional as F
probs = F.softmax(logits[-1], dim=-1)
next_token_id = torch.multinomial(probs, 1).item()
next_word = enc.decode([next_token_id])
```

### Summary Table

| Stage | Input | Output | Purpose |
|-------|-------|--------|---------|
| Tokenization | Raw text | Integer IDs | Text → discrete units |
| Embedding | IDs | Float vectors | IDs → semantic meaning |
| Positional enc. | Embeddings | Enriched embeddings | Add position awareness |
| Attention + FFN | Embeddings | Context vectors | Build understanding |
| Output layer | Context | Logits | Score all possible next tokens |
| Sampling | Logits | Token ID → text | Generate readable output |

Understanding this pipeline helps debug tokenization issues, optimize prompt length, and understand model behavior.

How is text converted to AI-understandable format?

Answer

How Is Text Converted to AI-Understandable Format?

Full Pipeline

Step 1: Tokenization

Step 2: Embedding Lookup

Step 3: Positional Encoding

Step 4: Transformer Processing

Step 5: Decode Output

Summary Table

Related Concepts

What is AI?

What are all the current types of AI?

What is Machine Learning (ML)?

What is Deep Learning in AI?

What is an LLM?

Stage	Input	Output	Purpose
Tokenization	Raw text	Integer IDs	Text → discrete units
Embedding	IDs	Float vectors	IDs → semantic meaning
Positional enc.	Embeddings	Enriched embeddings	Add position awareness
Attention + FFN	Embeddings	Context vectors	Build understanding
Output layer	Context	Logits	Score all possible next tokens
Sampling	Logits	Token ID → text	Generate readable output