How would you monitor a deployed LLM application?

Question

Accepted Answer

## Monitoring a Deployed LLM Application Monitoring an LLM app requires tracking both **infrastructure metrics** (latency, errors) and **model quality metrics** (faithfulness, hallucinations) — the latter being unique to Gen AI. ### The Four Monitoring Layers ```mermaid graph TD A[Infrastructure Metrics] --> M[Monitoring Dashboard] B[LLM Quality Metrics] --> M C[Business Metrics] --> M D[Safety & Security] --> M M --> AL[Alerting] AL --> ON[On-call] ``` ### 1. Infrastructure Metrics ```python import time import structlog from functools import wraps from prometheus_client import Counter, Histogram, Gauge, start_http_server # Prometheus metrics REQUEST_COUNT = Counter("llm_requests_total", "Total LLM requests", ["model", "status"]) REQUEST_LATENCY = Histogram("llm_request_duration_seconds", "LLM request latency", ["model"], buckets=[0.5, 1, 2, 5, 10, 30, 60]) TOKEN_USAGE = Counter("llm_tokens_total", "Total tokens used", ["model", "type"]) ACTIVE_REQUESTS = Gauge("llm_active_requests", "Currently active LLM requests") logger = structlog.get_logger() def monitored_llm_call(func): @wraps(func) def wrapper(*args, **kwargs): model = kwargs.get("model", "unknown") ACTIVE_REQUESTS.inc() start = time.perf_counter() try: result = func(*args, **kwargs) duration = time.perf_counter() - start REQUEST_COUNT.labels(model=model, status="success").inc() REQUEST_LATENCY.labels(model=model).observe(duration) logger.info("llm_call_success", model=model, latency_ms=duration*1000) return result except Exception as e: REQUEST_COUNT.labels(model=model, status="error").inc() logger.error("llm_call_failed", model=model, error=str(e)) raise finally: ACTIVE_REQUESTS.dec() return wrapper ``` ### 2. LLM Quality Metrics (Most Critical) ```python from openai import OpenAI client = OpenAI() def evaluate_response_quality(question: str, answer: str, context: str) -> dict: '''Use LLM-as-judge to score response quality.''' eval_prompt = f'''Score this RAG response on a scale 1-5 for each criterion. Respond as JSON: {{"faithfulness": 1-5, "relevance": 1-5, "completeness": 1-5, "hallucination": "yes/no"}} Question: {question} Retrieved Context: {context} Generated Answer: {answer}''' import json response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": eval_prompt}], response_format={"type": "json_object"}, temperature=0, ) return json.loads(response.choices[0].message.content) # Log quality metrics for sampled requests (e.g., 5% sampling) import random def log_with_quality_eval(question: str, answer: str, context: str): if random.random() < 0.05: # 5% sampling to control eval costs scores = evaluate_response_quality(question, answer, context) logger.info("quality_eval", **scores, question=question[:100]) ``` ### 3. Key Metrics & Alerts | Metric | Alert Threshold | Action | |--------|----------------|--------| | **p99 latency** | > 10s | Page on-call | | **Error rate** | > 2% | Page on-call | | **Faithfulness score** | < 3.5 avg | Slack alert | | **Hallucination rate** | > 5% | Immediate review | | **Cost per hour** | > $50 | Slack alert | | **Token usage spike** | 3× baseline | Investigate | ### 4. Observability Stack ```python # LangSmith — automatic tracing of LangChain calls import os os.environ["LANGCHAIN_TRACING_V2"] = "true" os.environ["LANGCHAIN_API_KEY"] = "ls_..." os.environ["LANGCHAIN_PROJECT"] = "production-rag" # Every chain.invoke() is automatically traced with: # - Input/output at each step # - Token usage per step # - Latency breakdown # - Feedback collection ``` ### 5. User Feedback Loop ```python # Collect thumbs up/down on responses @app.post("/feedback") async def submit_feedback(request_id: str, rating: int, comment: str = ""): logger.info("user_feedback", request_id=request_id, rating=rating, comment=comment, ) # Store in database for weekly quality review ``` > **Production principle:** Start with LangSmith for tracing and Prometheus for infrastructure. Add LLM-as-judge quality evaluation at 5–10% sampling. Set up weekly quality review to catch slow degradation that point-in-time alerts miss.

How would you monitor a deployed LLM application?

Answer

Monitoring a Deployed LLM Application

The Four Monitoring Layers

1. Infrastructure Metrics

2. LLM Quality Metrics (Most Critical)

3. Key Metrics & Alerts

4. Observability Stack

5. User Feedback Loop

Related Concepts

What's your strategy for handling model updates in production?

How would you reduce inference latency for an LLM application?

How would you estimate costs for a large-scale LLM application?

What's your testing strategy for Gen AI applications?

What are all the model serving frameworks that a fine tuned model can be added and accessed across?

Metric	Alert Threshold	Action
p99 latency	> 10s	Page on-call
Error rate	> 2%	Page on-call
Faithfulness score	< 3.5 avg	Slack alert
Hallucination rate	> 5%	Immediate review
Cost per hour	> $50	Slack alert
Token usage spike	3× baseline	Investigate