How would you reduce inference latency for an LLM application?

Question

Accepted Answer

## Reducing LLM Inference Latency Latency is the biggest UX bottleneck in LLM applications. Here are the most impactful techniques, ordered by effort. ### 1. Streaming (Free Win — Always Do This) ```python from openai import OpenAI from fastapi.responses import StreamingResponse client = OpenAI() def stream_response(prompt: str): '''Stream tokens as they're generated — reduces perceived latency dramatically.''' stream = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], stream=True, ) for chunk in stream: token = chunk.choices[0].delta.content if token: yield token # Without streaming: user waits 5s for full response # With streaming: user sees first token in 0.3s ``` ### 2. Choose Faster/Smaller Models | Model | Typical p50 Latency | Quality | Cost | |-------|-------------------|---------|------| | GPT-4o | 2–5s | Highest | High | | GPT-4o-mini | 0.5–1.5s | Good | Low | | Claude 3 Haiku | 0.3–1s | Good | Very Low | | Local Mistral-7B | 0.1–0.5s | Moderate | Free | ```python # Route simple queries to fast/cheap model def smart_model_routing(question: str, complexity_score: float) -> str: if complexity_score < 0.3: return "gpt-4o-mini" # Simple factual queries elif complexity_score < 0.7: return "gpt-4o" # Moderate complexity else: return "gpt-4o" # Complex reasoning ``` ### 3. Caching ```python import redis import hashlib cache = redis.Redis() def cached_llm_call(prompt: str, model: str = "gpt-4o") -> str: key = f"llm:{hashlib.md5(f'{model}:{prompt}'.encode()).hexdigest()}" cached = cache.get(key) if cached: return cached.decode() response = call_llm(prompt, model) cache.setex(key, 3600, response) # Cache 1 hour return response ``` ### 4. Reduce Token Count ```python # Shorter prompts = faster responses def compress_context(documents: list[str], max_tokens: int = 2000) -> str: '''Keep only most relevant context within token budget.''' import tiktoken enc = tiktoken.encoding_for_model("gpt-4o") context = "" for doc in documents: # Pre-ranked by relevance if len(enc.encode(context + doc)) > max_tokens: break context += doc + " " return context # Also: use max_tokens to cap response length response = client.chat.completions.create( model="gpt-4o", messages=messages, max_tokens=256, # Don't generate more than needed ) ``` ### 5. Parallel Retrieval + Generation (Async) ```python import asyncio from openai import AsyncOpenAI async def parallel_rag(questions: list[str]) -> list[str]: '''Process multiple RAG queries concurrently.''' client = AsyncOpenAI() async def single_query(q: str) -> str: docs = await async_retrieve(q) response = await client.chat.completions.create( model="gpt-4o", messages=build_messages(q, docs), ) return response.choices[0].message.content # All queries run in parallel — total time = max(individual times) return await asyncio.gather(*[single_query(q) for q in questions]) ``` ### 6. Speculative Decoding (Advanced) Run a small "draft" model to generate candidate tokens, verified in parallel by the larger model — effectively multiplying generation speed by 2–4×. ### 7. KV Cache Reuse ```python # Prefix caching: if your system prompt is always the same, # OpenAI automatically caches the KV state for repeated prefixes # (>1024 tokens prefix). No code needed — just use consistent system prompts. ``` ### Latency Reduction Summary | Technique | Latency Reduction | Effort | Trade-off | |-----------|-----------------|--------|-----------| | **Streaming** | ~80% perceived | None | None | | **Smaller model** | 50–80% | Low | Quality | | **Caching** | 100% for cached | Low | Freshness | | **Shorter prompts** | 20–40% | Medium | Context | | **Async parallel** | N× for N queries | Medium | Complexity | | **Local model** | 60–90% | High | GPU required | > **Start here:** Enable streaming first. It costs nothing and has the highest user-perceived impact — users tolerate slow responses much better when they see tokens arriving.

How would you reduce inference latency for an LLM application?

Answer

Reducing LLM Inference Latency

1. Streaming (Free Win — Always Do This)

2. Choose Faster/Smaller Models

3. Caching

4. Reduce Token Count

5. Parallel Retrieval + Generation (Async)

6. Speculative Decoding (Advanced)

7. KV Cache Reuse

Latency Reduction Summary

Related Concepts

How would you monitor a deployed LLM application?

What's your strategy for handling model updates in production?

How would you estimate costs for a large-scale LLM application?

What's your testing strategy for Gen AI applications?

What are all the model serving frameworks that a fine tuned model can be added and accessed across?

Model	Typical p50 Latency	Quality	Cost
GPT-4o	2–5s	Highest	High
GPT-4o-mini	0.5–1.5s	Good	Low
Claude 3 Haiku	0.3–1s	Good	Very Low
Local Mistral-7B	0.1–0.5s	Moderate	Free

Technique	Latency Reduction	Effort	Trade-off
Streaming	~80% perceived	None	None
Smaller model	50–80%	Low	Quality
Caching	100% for cached	Low	Freshness
Shorter prompts	20–40%	Medium	Context
Async parallel	N× for N queries	Medium	Complexity
Local model	60–90%	High	GPU required