How would you reduce inference latency for an LLM application?
#gen-ai#mlops#llm
Answer
Reducing LLM Inference Latency
Latency is the biggest UX bottleneck in LLM applications. Here are the most impactful techniques, ordered by effort.
1. Streaming (Free Win — Always Do This)
pythonfrom openai import OpenAI from fastapi.responses import StreamingResponse client = OpenAI() def stream_response(prompt: str): '''Stream tokens as they're generated — reduces perceived latency dramatically.''' stream = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], stream=True, ) for chunk in stream: token = chunk.choices[0].delta.content if token: yield token # Without streaming: user waits 5s for full response # With streaming: user sees first token in 0.3s
2. Choose Faster/Smaller Models
| Model | Typical p50 Latency | Quality | Cost |
|---|---|---|---|
| GPT-4o | 2–5s | Highest | High |
| GPT-4o-mini | 0.5–1.5s | Good | Low |
| Claude 3 Haiku | 0.3–1s | Good | Very Low |
| Local Mistral-7B | 0.1–0.5s | Moderate | Free |
python# Route simple queries to fast/cheap model def smart_model_routing(question: str, complexity_score: float) -> str: if complexity_score < 0.3: return "gpt-4o-mini" # Simple factual queries elif complexity_score < 0.7: return "gpt-4o" # Moderate complexity else: return "gpt-4o" # Complex reasoning
3. Caching
pythonimport redis import hashlib cache = redis.Redis() def cached_llm_call(prompt: str, model: str = "gpt-4o") -> str: key = f"llm:{hashlib.md5(f'{model}:{prompt}'.encode()).hexdigest()}" cached = cache.get(key) if cached: return cached.decode() response = call_llm(prompt, model) cache.setex(key, 3600, response) # Cache 1 hour return response
4. Reduce Token Count
python# Shorter prompts = faster responses def compress_context(documents: list[str], max_tokens: int = 2000) -> str: '''Keep only most relevant context within token budget.''' import tiktoken enc = tiktoken.encoding_for_model("gpt-4o") context = "" for doc in documents: # Pre-ranked by relevance if len(enc.encode(context + doc)) > max_tokens: break context += doc + "\n\n" return context # Also: use max_tokens to cap response length response = client.chat.completions.create( model="gpt-4o", messages=messages, max_tokens=256, # Don't generate more than needed )
5. Parallel Retrieval + Generation (Async)
pythonimport asyncio from openai import AsyncOpenAI async def parallel_rag(questions: list[str]) -> list[str]: '''Process multiple RAG queries concurrently.''' client = AsyncOpenAI() async def single_query(q: str) -> str: docs = await async_retrieve(q) response = await client.chat.completions.create( model="gpt-4o", messages=build_messages(q, docs), ) return response.choices[0].message.content # All queries run in parallel — total time = max(individual times) return await asyncio.gather(*[single_query(q) for q in questions])
6. Speculative Decoding (Advanced)
Run a small "draft" model to generate candidate tokens, verified in parallel by the larger model — effectively multiplying generation speed by 2–4×.
7. KV Cache Reuse
python# Prefix caching: if your system prompt is always the same, # OpenAI automatically caches the KV state for repeated prefixes # (>1024 tokens prefix). No code needed — just use consistent system prompts.
Latency Reduction Summary
| Technique | Latency Reduction | Effort | Trade-off |
|---|---|---|---|
| Streaming | ~80% perceived | None | None |
| Smaller model | 50–80% | Low | Quality |
| Caching | 100% for cached | Low | Freshness |
| Shorter prompts | 20–40% | Medium | Context |
| Async parallel | N× for N queries | Medium | Complexity |
| Local model | 60–90% | High | GPU required |
Start here: Enable streaming first. It costs nothing and has the highest user-perceived impact — users tolerate slow responses much better when they see tokens arriving.