Concept #22Mediumpython-for-gen-ai

What are generators in Python? How are they used in streaming LLM responses?

#gen-ai#python

Answer

Generators in Python

A generator is a function that yields values one at a time instead of computing and returning all values at once. It uses

text
yield
instead of
text
return
and maintains its state between calls.

How Generators Work

python
def count_up(n: int):
    for i in range(n):
        yield i   # Pause here, return i, resume on next()

gen = count_up(3)
print(next(gen))  # 0
print(next(gen))  # 1
print(next(gen))  # 2

# Or iterate directly
for val in count_up(5):
    print(val)

Why Generators Are Memory-Efficient

python
# ❌ Returns all items at once — entire dataset in memory
def load_all_documents(file_path: str) -> list[str]:
    with open(file_path) as f:
        return [line.strip() for line in f]  # Millions of lines = GB of RAM

# ✅ Generator — processes one line at a time, O(1) memory
def stream_documents(file_path: str):
    with open(file_path) as f:
        for line in f:
            yield line.strip()

# Process a 10GB file with constant memory
for doc in stream_documents("large_corpus.jsonl"):
    embed_and_index(doc)

Streaming LLM Responses

Generators are the natural model for streaming LLM output — tokens arrive one at a time:

python
from openai import OpenAI

client = OpenAI()

def stream_llm_response(prompt: str):
    '''Generator that yields tokens as they arrive from the LLM'''
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    for chunk in stream:
        token = chunk.choices[0].delta.content
        if token:
            yield token

# Print tokens as they arrive
for token in stream_llm_response("Explain transformers"):
    print(token, end="", flush=True)

Streaming in a FastAPI / Web Context

python
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

def generate_tokens(prompt: str):
    from openai import OpenAI
    client = OpenAI()
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    for chunk in stream:
        token = chunk.choices[0].delta.content
        if token:
            yield f"data: {token}\n\n"  # Server-Sent Events format
    yield "data: [DONE]\n\n"

@app.post("/stream")
async def stream_endpoint(prompt: str):
    return StreamingResponse(generate_tokens(prompt), media_type="text/event-stream")

Generator Pipelines for RAG

Chain generators for memory-efficient document processing:

python
def load_files(paths: list[str]):
    for path in paths:
        with open(path) as f:
            yield f.read()

def chunk_text(texts, chunk_size: int = 512):
    for text in texts:
        words = text.split()
        for i in range(0, len(words), chunk_size):
            yield " ".join(words[i:i + chunk_size])

def embed_chunks(chunks, batch_size: int = 32):
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer("all-MiniLM-L6-v2")
    batch = []
    for chunk in chunks:
        batch.append(chunk)
        if len(batch) == batch_size:
            embeddings = model.encode(batch)
            yield from zip(batch, embeddings)
            batch = []
    if batch:
        yield from zip(batch, model.encode(batch))

# Compose the pipeline — processes millions of docs with minimal RAM
files = load_files(["doc1.txt", "doc2.txt", "doc3.txt"])
chunks = chunk_text(files)
embedded = embed_chunks(chunks)

for text, embedding in embedded:
    vectorstore.add(text, embedding)
FeatureListGenerator
MemoryAll values at onceOne value at a time
SpeedFast for small dataSame throughput
ReusableYesNo (exhausted after iteration)
Best forSmall datasets, random accessStreaming, large data, pipelines

Key insight: LLM streaming responses are inherently generator-based. The model generates one token at a time — wrapping this in a Python generator is the idiomatic pattern for low-latency streaming UIs.