What are generators in Python? How are they used in streaming LLM responses?

Question

Accepted Answer

## Generators in Python

A **generator** is a function that yields values one at a time instead of computing and returning all values at once. It uses `yield` instead of `return` and maintains its state between calls.

### How Generators Work

```python
def count_up(n: int):
    for i in range(n):
        yield i   # Pause here, return i, resume on next()

gen = count_up(3)
print(next(gen))  # 0
print(next(gen))  # 1
print(next(gen))  # 2

# Or iterate directly
for val in count_up(5):
    print(val)
```

### Why Generators Are Memory-Efficient

```python
# ❌ Returns all items at once — entire dataset in memory
def load_all_documents(file_path: str) -> list[str]:
    with open(file_path) as f:
        return [line.strip() for line in f]  # Millions of lines = GB of RAM

# ✅ Generator — processes one line at a time, O(1) memory
def stream_documents(file_path: str):
    with open(file_path) as f:
        for line in f:
            yield line.strip()

# Process a 10GB file with constant memory
for doc in stream_documents("large_corpus.jsonl"):
    embed_and_index(doc)
```

### Streaming LLM Responses

Generators are the natural model for streaming LLM output — tokens arrive one at a time:

```python
from openai import OpenAI

client = OpenAI()

def stream_llm_response(prompt: str):
    '''Generator that yields tokens as they arrive from the LLM'''
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    for chunk in stream:
        token = chunk.choices[0].delta.content
        if token:
            yield token

# Print tokens as they arrive
for token in stream_llm_response("Explain transformers"):
    print(token, end="", flush=True)
```

### Streaming in a FastAPI / Web Context

```python
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

def generate_tokens(prompt: str):
    from openai import OpenAI
    client = OpenAI()
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    for chunk in stream:
        token = chunk.choices[0].delta.content
        if token:
            yield f"data: {token}

"  # Server-Sent Events format
    yield "data: [DONE]

"

@app.post("/stream")
async def stream_endpoint(prompt: str):
    return StreamingResponse(generate_tokens(prompt), media_type="text/event-stream")
```

### Generator Pipelines for RAG

Chain generators for memory-efficient document processing:

```python
def load_files(paths: list[str]):
    for path in paths:
        with open(path) as f:
            yield f.read()

def chunk_text(texts, chunk_size: int = 512):
    for text in texts:
        words = text.split()
        for i in range(0, len(words), chunk_size):
            yield " ".join(words[i:i + chunk_size])

def embed_chunks(chunks, batch_size: int = 32):
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer("all-MiniLM-L6-v2")
    batch = []
    for chunk in chunks:
        batch.append(chunk)
        if len(batch) == batch_size:
            embeddings = model.encode(batch)
            yield from zip(batch, embeddings)
            batch = []
    if batch:
        yield from zip(batch, model.encode(batch))

# Compose the pipeline — processes millions of docs with minimal RAM
files = load_files(["doc1.txt", "doc2.txt", "doc3.txt"])
chunks = chunk_text(files)
embedded = embed_chunks(chunks)

for text, embedding in embedded:
    vectorstore.add(text, embedding)
```

| Feature | List | Generator |
|---------|------|-----------|
| **Memory** | All values at once | One value at a time |
| **Speed** | Fast for small data | Same throughput |
| **Reusable** | Yes | No (exhausted after iteration) |
| **Best for** | Small datasets, random access | Streaming, large data, pipelines |

> **Key insight:** LLM streaming responses are inherently generator-based. The model generates one token at a time — wrapping this in a Python generator is the idiomatic pattern for low-latency streaming UIs.

What are generators in Python? How are they used in streaming LLM responses?

Answer

Generators in Python

How Generators Work

Why Generators Are Memory-Efficient

Streaming LLM Responses

Streaming in a FastAPI / Web Context

Generator Pipelines for RAG

Related Concepts

Explain decorators in Python. How would you use them in an LLM application?

What are context managers? How would you use them for LLM resource management?

Explain async/await in Python. Why is it important for API-heavy applications?

Explain list comprehensions vs. loops in Python. When is each appropriate?

What's the difference between == and is in Python?

Feature	List	Generator
Memory	All values at once	One value at a time
Speed	Fast for small data	Same throughput
Reusable	Yes	No (exhausted after iteration)
Best for	Small datasets, random access	Streaming, large data, pipelines