How would you estimate costs for a large-scale LLM application?
#gen-ai#mlops
Answer
Cost Estimation for Large-Scale LLM Applications
LLM costs can balloon quickly without careful planning. Here's a framework for estimating and controlling costs.
Cost Components
textTotal Cost = (LLM API costs) + (Embedding costs) + (Vector DB costs) + (Infrastructure)
LLM API Cost Calculator
pythonfrom dataclasses import dataclass # Pricing per 1M tokens (approximate, check official pricing) MODEL_PRICING = { "gpt-4o": {"input": 2.50, "output": 10.00}, "gpt-4o-mini": {"input": 0.15, "output": 0.60}, "claude-3-5-sonnet": {"input": 3.00, "output": 15.00}, "claude-3-haiku": {"input": 0.25, "output": 1.25}, "gemini-1.5-pro": {"input": 1.25, "output": 5.00}, "text-embedding-3-small": {"input": 0.02, "output": 0}, } @dataclass class UsagePattern: daily_queries: int avg_input_tokens: int # System prompt + context + user message avg_output_tokens: int # Generated response model: str def estimate_monthly_cost(usage: UsagePattern) -> dict: pricing = MODEL_PRICING[usage.model] monthly_queries = usage.daily_queries * 30 input_cost = (usage.avg_input_tokens / 1_000_000) * pricing["input"] * monthly_queries output_cost = (usage.avg_output_tokens / 1_000_000) * pricing["output"] * monthly_queries return { "model": usage.model, "monthly_queries": monthly_queries, "input_tokens_monthly": usage.avg_input_tokens * monthly_queries, "output_tokens_monthly": usage.avg_output_tokens * monthly_queries, "input_cost_usd": round(input_cost, 2), "output_cost_usd": round(output_cost, 2), "total_cost_usd": round(input_cost + output_cost, 2), "cost_per_query_cents": round((input_cost + output_cost) / monthly_queries * 100, 4), } # Example: Customer support chatbot usage = UsagePattern( daily_queries=10_000, avg_input_tokens=800, # System prompt (200) + RAG context (500) + user message (100) avg_output_tokens=200, # Concise answer model="gpt-4o-mini", ) estimate = estimate_monthly_cost(usage) print(f"Monthly cost: ${estimate['total_cost_usd']:,.2f}") print(f"Cost per query: {estimate['cost_per_query_cents']:.4f} cents")
RAG-Specific Cost Components
pythondef estimate_rag_costs( daily_queries: int, avg_context_chunks: int = 5, avg_chunk_tokens: int = 200, documents_to_embed: int = 10_000, avg_doc_tokens: int = 500, ) -> dict: # Embedding costs indexing_cost = (documents_to_embed * avg_doc_tokens / 1_000_000) * 0.02 # One-time query_embed_cost = (daily_queries * 50 / 1_000_000) * 0.02 * 30 # Monthly return { "indexing_cost_one_time": round(indexing_cost, 4), "monthly_query_embedding_cost": round(query_embed_cost, 4), "context_tokens_per_query": avg_context_chunks * avg_chunk_tokens, }
Cost by Scale
| Scale | Monthly Queries | GPT-4o Cost | GPT-4o-mini Cost |
|---|---|---|---|
| MVP | 10K | ~$35 | ~$2 |
| Startup | 300K | ~$1,050 | ~$63 |
| Growth | 3M | ~$10,500 | ~$630 |
| Enterprise | 30M | ~$105,000 | ~$6,300 |
Cost Optimisation Strategies
python# 1. Caching (reduces effective queries by 30-60%) cache_hit_rate = 0.40 effective_queries = daily_queries * (1 - cache_hit_rate) # 2. Model routing — use cheap model for simple queries routing_config = { "simple_faq": ("gpt-4o-mini", 0.70), # 70% of queries "complex_analysis": ("gpt-4o", 0.30), # 30% of queries } # 3. Prompt compression — reduce context length # Shorter prompts = proportionally lower cost avg_tokens_with_compression = 800 * 0.6 # 40% reduction via compression # 4. Batch processing — use batch API for non-real-time (50% discount) # OpenAI Batch API: half the cost, 24h turnaround
Spending Controls
python# Set hard spending limits in OpenAI dashboard # Also implement soft limits in code: from collections import defaultdict daily_spend: dict[str, float] = defaultdict(float) DAILY_LIMIT_USD = 500.0 def budget_gated_call(prompt: str, user_id: str, model: str = "gpt-4o") -> str: today = str(datetime.date.today()) if daily_spend[today] >= DAILY_LIMIT_USD: raise Exception("Daily budget exceeded") response = call_llm(prompt, model) estimated_cost = estimate_call_cost(prompt, response, model) daily_spend[today] += estimated_cost return response
Rule of thumb: For a new product, start with GPT-4o-mini. You'll spend 10–15× less while getting 80–90% of GPT-4o quality for most tasks. Only upgrade to GPT-4o when you've identified specific quality gaps.