Concept #48Mediumproduction-mlops

How would you reduce inference latency for an LLM application?

#gen-ai#mlops#llm

Answer

Reducing LLM Inference Latency

Latency is the biggest UX bottleneck in LLM applications. Here are the most impactful techniques, ordered by effort.

1. Streaming (Free Win — Always Do This)

python
from openai import OpenAI
from fastapi.responses import StreamingResponse

client = OpenAI()

def stream_response(prompt: str):
    '''Stream tokens as they're generated — reduces perceived latency dramatically.'''
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    )
    for chunk in stream:
        token = chunk.choices[0].delta.content
        if token:
            yield token

# Without streaming: user waits 5s for full response
# With streaming: user sees first token in 0.3s

2. Choose Faster/Smaller Models

ModelTypical p50 LatencyQualityCost
GPT-4o2–5sHighestHigh
GPT-4o-mini0.5–1.5sGoodLow
Claude 3 Haiku0.3–1sGoodVery Low
Local Mistral-7B0.1–0.5sModerateFree
python
# Route simple queries to fast/cheap model
def smart_model_routing(question: str, complexity_score: float) -> str:
    if complexity_score < 0.3:
        return "gpt-4o-mini"   # Simple factual queries
    elif complexity_score < 0.7:
        return "gpt-4o"        # Moderate complexity
    else:
        return "gpt-4o"        # Complex reasoning

3. Caching

python
import redis
import hashlib

cache = redis.Redis()

def cached_llm_call(prompt: str, model: str = "gpt-4o") -> str:
    key = f"llm:{hashlib.md5(f'{model}:{prompt}'.encode()).hexdigest()}"
    cached = cache.get(key)
    if cached:
        return cached.decode()

    response = call_llm(prompt, model)
    cache.setex(key, 3600, response)  # Cache 1 hour
    return response

4. Reduce Token Count

python
# Shorter prompts = faster responses
def compress_context(documents: list[str], max_tokens: int = 2000) -> str:
    '''Keep only most relevant context within token budget.'''
    import tiktoken
    enc = tiktoken.encoding_for_model("gpt-4o")

    context = ""
    for doc in documents:  # Pre-ranked by relevance
        if len(enc.encode(context + doc)) > max_tokens:
            break
        context += doc + "\n\n"
    return context

# Also: use max_tokens to cap response length
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    max_tokens=256,  # Don't generate more than needed
)

5. Parallel Retrieval + Generation (Async)

python
import asyncio
from openai import AsyncOpenAI

async def parallel_rag(questions: list[str]) -> list[str]:
    '''Process multiple RAG queries concurrently.'''
    client = AsyncOpenAI()

    async def single_query(q: str) -> str:
        docs = await async_retrieve(q)
        response = await client.chat.completions.create(
            model="gpt-4o",
            messages=build_messages(q, docs),
        )
        return response.choices[0].message.content

    # All queries run in parallel — total time = max(individual times)
    return await asyncio.gather(*[single_query(q) for q in questions])

6. Speculative Decoding (Advanced)

Run a small "draft" model to generate candidate tokens, verified in parallel by the larger model — effectively multiplying generation speed by 2–4×.

7. KV Cache Reuse

python
# Prefix caching: if your system prompt is always the same,
# OpenAI automatically caches the KV state for repeated prefixes
# (>1024 tokens prefix). No code needed — just use consistent system prompts.

Latency Reduction Summary

TechniqueLatency ReductionEffortTrade-off
Streaming~80% perceivedNoneNone
Smaller model50–80%LowQuality
Caching100% for cachedLowFreshness
Shorter prompts20–40%MediumContext
Async parallelN× for N queriesMediumComplexity
Local model60–90%HighGPU required

Start here: Enable streaming first. It costs nothing and has the highest user-perceived impact — users tolerate slow responses much better when they see tokens arriving.