What are generators in Python? How are they used in streaming LLM responses?
#gen-ai#python
Answer
Generators in Python
A generator is a function that yields values one at a time instead of computing and returning all values at once. It uses
text
yieldtext
returnHow Generators Work
pythondef count_up(n: int): for i in range(n): yield i # Pause here, return i, resume on next() gen = count_up(3) print(next(gen)) # 0 print(next(gen)) # 1 print(next(gen)) # 2 # Or iterate directly for val in count_up(5): print(val)
Why Generators Are Memory-Efficient
python# ❌ Returns all items at once — entire dataset in memory def load_all_documents(file_path: str) -> list[str]: with open(file_path) as f: return [line.strip() for line in f] # Millions of lines = GB of RAM # ✅ Generator — processes one line at a time, O(1) memory def stream_documents(file_path: str): with open(file_path) as f: for line in f: yield line.strip() # Process a 10GB file with constant memory for doc in stream_documents("large_corpus.jsonl"): embed_and_index(doc)
Streaming LLM Responses
Generators are the natural model for streaming LLM output — tokens arrive one at a time:
pythonfrom openai import OpenAI client = OpenAI() def stream_llm_response(prompt: str): '''Generator that yields tokens as they arrive from the LLM''' stream = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], stream=True ) for chunk in stream: token = chunk.choices[0].delta.content if token: yield token # Print tokens as they arrive for token in stream_llm_response("Explain transformers"): print(token, end="", flush=True)
Streaming in a FastAPI / Web Context
pythonfrom fastapi import FastAPI from fastapi.responses import StreamingResponse app = FastAPI() def generate_tokens(prompt: str): from openai import OpenAI client = OpenAI() stream = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], stream=True ) for chunk in stream: token = chunk.choices[0].delta.content if token: yield f"data: {token}\n\n" # Server-Sent Events format yield "data: [DONE]\n\n" @app.post("/stream") async def stream_endpoint(prompt: str): return StreamingResponse(generate_tokens(prompt), media_type="text/event-stream")
Generator Pipelines for RAG
Chain generators for memory-efficient document processing:
pythondef load_files(paths: list[str]): for path in paths: with open(path) as f: yield f.read() def chunk_text(texts, chunk_size: int = 512): for text in texts: words = text.split() for i in range(0, len(words), chunk_size): yield " ".join(words[i:i + chunk_size]) def embed_chunks(chunks, batch_size: int = 32): from sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2") batch = [] for chunk in chunks: batch.append(chunk) if len(batch) == batch_size: embeddings = model.encode(batch) yield from zip(batch, embeddings) batch = [] if batch: yield from zip(batch, model.encode(batch)) # Compose the pipeline — processes millions of docs with minimal RAM files = load_files(["doc1.txt", "doc2.txt", "doc3.txt"]) chunks = chunk_text(files) embedded = embed_chunks(chunks) for text, embedding in embedded: vectorstore.add(text, embedding)
| Feature | List | Generator |
|---|---|---|
| Memory | All values at once | One value at a time |
| Speed | Fast for small data | Same throughput |
| Reusable | Yes | No (exhausted after iteration) |
| Best for | Small datasets, random access | Streaming, large data, pipelines |
Key insight: LLM streaming responses are inherently generator-based. The model generates one token at a time — wrapping this in a Python generator is the idiomatic pattern for low-latency streaming UIs.