Answer
Monitoring a Deployed LLM Application
Monitoring an LLM app requires tracking both infrastructure metrics (latency, errors) and model quality metrics (faithfulness, hallucinations) — the latter being unique to Gen AI.
The Four Monitoring Layers
1. Infrastructure Metrics
pythonimport time import structlog from functools import wraps from prometheus_client import Counter, Histogram, Gauge, start_http_server # Prometheus metrics REQUEST_COUNT = Counter("llm_requests_total", "Total LLM requests", ["model", "status"]) REQUEST_LATENCY = Histogram("llm_request_duration_seconds", "LLM request latency", ["model"], buckets=[0.5, 1, 2, 5, 10, 30, 60]) TOKEN_USAGE = Counter("llm_tokens_total", "Total tokens used", ["model", "type"]) ACTIVE_REQUESTS = Gauge("llm_active_requests", "Currently active LLM requests") logger = structlog.get_logger() def monitored_llm_call(func): @wraps(func) def wrapper(*args, **kwargs): model = kwargs.get("model", "unknown") ACTIVE_REQUESTS.inc() start = time.perf_counter() try: result = func(*args, **kwargs) duration = time.perf_counter() - start REQUEST_COUNT.labels(model=model, status="success").inc() REQUEST_LATENCY.labels(model=model).observe(duration) logger.info("llm_call_success", model=model, latency_ms=duration*1000) return result except Exception as e: REQUEST_COUNT.labels(model=model, status="error").inc() logger.error("llm_call_failed", model=model, error=str(e)) raise finally: ACTIVE_REQUESTS.dec() return wrapper
2. LLM Quality Metrics (Most Critical)
pythonfrom openai import OpenAI client = OpenAI() def evaluate_response_quality(question: str, answer: str, context: str) -> dict: '''Use LLM-as-judge to score response quality.''' eval_prompt = f'''Score this RAG response on a scale 1-5 for each criterion. Respond as JSON: {{"faithfulness": 1-5, "relevance": 1-5, "completeness": 1-5, "hallucination": "yes/no"}} Question: {question} Retrieved Context: {context} Generated Answer: {answer}''' import json response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": eval_prompt}], response_format={"type": "json_object"}, temperature=0, ) return json.loads(response.choices[0].message.content) # Log quality metrics for sampled requests (e.g., 5% sampling) import random def log_with_quality_eval(question: str, answer: str, context: str): if random.random() < 0.05: # 5% sampling to control eval costs scores = evaluate_response_quality(question, answer, context) logger.info("quality_eval", **scores, question=question[:100])
3. Key Metrics & Alerts
| Metric | Alert Threshold | Action |
|---|---|---|
| p99 latency | > 10s | Page on-call |
| Error rate | > 2% | Page on-call |
| Faithfulness score | < 3.5 avg | Slack alert |
| Hallucination rate | > 5% | Immediate review |
| Cost per hour | > $50 | Slack alert |
| Token usage spike | 3× baseline | Investigate |
4. Observability Stack
python# LangSmith — automatic tracing of LangChain calls import os os.environ["LANGCHAIN_TRACING_V2"] = "true" os.environ["LANGCHAIN_API_KEY"] = "ls_..." os.environ["LANGCHAIN_PROJECT"] = "production-rag" # Every chain.invoke() is automatically traced with: # - Input/output at each step # - Token usage per step # - Latency breakdown # - Feedback collection
5. User Feedback Loop
python# Collect thumbs up/down on responses @app.post("/feedback") async def submit_feedback(request_id: str, rating: int, comment: str = ""): logger.info("user_feedback", request_id=request_id, rating=rating, comment=comment, ) # Store in database for weekly quality review
Production principle: Start with LangSmith for tracing and Prometheus for infrastructure. Add LLM-as-judge quality evaluation at 5–10% sampling. Set up weekly quality review to catch slow degradation that point-in-time alerts miss.