Concept #46Mediumproduction-mlops

How would you monitor a deployed LLM application?

#gen-ai#mlops

Answer

Monitoring a Deployed LLM Application

Monitoring an LLM app requires tracking both infrastructure metrics (latency, errors) and model quality metrics (faithfulness, hallucinations) — the latter being unique to Gen AI.

The Four Monitoring Layers

1. Infrastructure Metrics

python
import time
import structlog
from functools import wraps
from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Prometheus metrics
REQUEST_COUNT = Counter("llm_requests_total", "Total LLM requests", ["model", "status"])
REQUEST_LATENCY = Histogram("llm_request_duration_seconds", "LLM request latency",
                            ["model"], buckets=[0.5, 1, 2, 5, 10, 30, 60])
TOKEN_USAGE = Counter("llm_tokens_total", "Total tokens used", ["model", "type"])
ACTIVE_REQUESTS = Gauge("llm_active_requests", "Currently active LLM requests")

logger = structlog.get_logger()

def monitored_llm_call(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        model = kwargs.get("model", "unknown")
        ACTIVE_REQUESTS.inc()
        start = time.perf_counter()
        try:
            result = func(*args, **kwargs)
            duration = time.perf_counter() - start
            REQUEST_COUNT.labels(model=model, status="success").inc()
            REQUEST_LATENCY.labels(model=model).observe(duration)
            logger.info("llm_call_success", model=model, latency_ms=duration*1000)
            return result
        except Exception as e:
            REQUEST_COUNT.labels(model=model, status="error").inc()
            logger.error("llm_call_failed", model=model, error=str(e))
            raise
        finally:
            ACTIVE_REQUESTS.dec()
    return wrapper

2. LLM Quality Metrics (Most Critical)

python
from openai import OpenAI

client = OpenAI()

def evaluate_response_quality(question: str, answer: str, context: str) -> dict:
    '''Use LLM-as-judge to score response quality.'''
    eval_prompt = f'''Score this RAG response on a scale 1-5 for each criterion.
Respond as JSON: {{"faithfulness": 1-5, "relevance": 1-5, "completeness": 1-5, "hallucination": "yes/no"}}

Question: {question}
Retrieved Context: {context}
Generated Answer: {answer}'''

    import json
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": eval_prompt}],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(response.choices[0].message.content)

# Log quality metrics for sampled requests (e.g., 5% sampling)
import random

def log_with_quality_eval(question: str, answer: str, context: str):
    if random.random() < 0.05:  # 5% sampling to control eval costs
        scores = evaluate_response_quality(question, answer, context)
        logger.info("quality_eval", **scores, question=question[:100])

3. Key Metrics & Alerts

MetricAlert ThresholdAction
p99 latency> 10sPage on-call
Error rate> 2%Page on-call
Faithfulness score< 3.5 avgSlack alert
Hallucination rate> 5%Immediate review
Cost per hour> $50Slack alert
Token usage spike3× baselineInvestigate

4. Observability Stack

python
# LangSmith — automatic tracing of LangChain calls
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls_..."
os.environ["LANGCHAIN_PROJECT"] = "production-rag"

# Every chain.invoke() is automatically traced with:
# - Input/output at each step
# - Token usage per step
# - Latency breakdown
# - Feedback collection

5. User Feedback Loop

python
# Collect thumbs up/down on responses
@app.post("/feedback")
async def submit_feedback(request_id: str, rating: int, comment: str = ""):
    logger.info("user_feedback",
        request_id=request_id,
        rating=rating,
        comment=comment,
    )
    # Store in database for weekly quality review

Production principle: Start with LangSmith for tracing and Prometheus for infrastructure. Add LLM-as-judge quality evaluation at 5–10% sampling. Set up weekly quality review to catch slow degradation that point-in-time alerts miss.