Concept #49Mediumproduction-mlops

How would you estimate costs for a large-scale LLM application?

#gen-ai#mlops

Answer

Cost Estimation for Large-Scale LLM Applications

LLM costs can balloon quickly without careful planning. Here's a framework for estimating and controlling costs.

Cost Components

text
Total Cost = (LLM API costs) + (Embedding costs) + (Vector DB costs) + (Infrastructure)

LLM API Cost Calculator

python
from dataclasses import dataclass

# Pricing per 1M tokens (approximate, check official pricing)
MODEL_PRICING = {
    "gpt-4o":           {"input": 2.50,  "output": 10.00},
    "gpt-4o-mini":      {"input": 0.15,  "output": 0.60},
    "claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
    "claude-3-haiku":    {"input": 0.25, "output": 1.25},
    "gemini-1.5-pro":   {"input": 1.25,  "output": 5.00},
    "text-embedding-3-small": {"input": 0.02, "output": 0},
}

@dataclass
class UsagePattern:
    daily_queries: int
    avg_input_tokens: int    # System prompt + context + user message
    avg_output_tokens: int   # Generated response
    model: str

def estimate_monthly_cost(usage: UsagePattern) -> dict:
    pricing = MODEL_PRICING[usage.model]
    monthly_queries = usage.daily_queries * 30

    input_cost = (usage.avg_input_tokens / 1_000_000) * pricing["input"] * monthly_queries
    output_cost = (usage.avg_output_tokens / 1_000_000) * pricing["output"] * monthly_queries

    return {
        "model": usage.model,
        "monthly_queries": monthly_queries,
        "input_tokens_monthly": usage.avg_input_tokens * monthly_queries,
        "output_tokens_monthly": usage.avg_output_tokens * monthly_queries,
        "input_cost_usd": round(input_cost, 2),
        "output_cost_usd": round(output_cost, 2),
        "total_cost_usd": round(input_cost + output_cost, 2),
        "cost_per_query_cents": round((input_cost + output_cost) / monthly_queries * 100, 4),
    }

# Example: Customer support chatbot
usage = UsagePattern(
    daily_queries=10_000,
    avg_input_tokens=800,   # System prompt (200) + RAG context (500) + user message (100)
    avg_output_tokens=200,  # Concise answer
    model="gpt-4o-mini",
)

estimate = estimate_monthly_cost(usage)
print(f"Monthly cost: ${estimate['total_cost_usd']:,.2f}")
print(f"Cost per query: {estimate['cost_per_query_cents']:.4f} cents")

RAG-Specific Cost Components

python
def estimate_rag_costs(
    daily_queries: int,
    avg_context_chunks: int = 5,
    avg_chunk_tokens: int = 200,
    documents_to_embed: int = 10_000,
    avg_doc_tokens: int = 500,
) -> dict:

    # Embedding costs
    indexing_cost = (documents_to_embed * avg_doc_tokens / 1_000_000) * 0.02  # One-time
    query_embed_cost = (daily_queries * 50 / 1_000_000) * 0.02 * 30  # Monthly

    return {
        "indexing_cost_one_time": round(indexing_cost, 4),
        "monthly_query_embedding_cost": round(query_embed_cost, 4),
        "context_tokens_per_query": avg_context_chunks * avg_chunk_tokens,
    }

Cost by Scale

ScaleMonthly QueriesGPT-4o CostGPT-4o-mini Cost
MVP10K~$35~$2
Startup300K~$1,050~$63
Growth3M~$10,500~$630
Enterprise30M~$105,000~$6,300

Cost Optimisation Strategies

python
# 1. Caching (reduces effective queries by 30-60%)
cache_hit_rate = 0.40
effective_queries = daily_queries * (1 - cache_hit_rate)

# 2. Model routing — use cheap model for simple queries
routing_config = {
    "simple_faq": ("gpt-4o-mini", 0.70),   # 70% of queries
    "complex_analysis": ("gpt-4o", 0.30),   # 30% of queries
}

# 3. Prompt compression — reduce context length
# Shorter prompts = proportionally lower cost
avg_tokens_with_compression = 800 * 0.6  # 40% reduction via compression

# 4. Batch processing — use batch API for non-real-time (50% discount)
# OpenAI Batch API: half the cost, 24h turnaround

Spending Controls

python
# Set hard spending limits in OpenAI dashboard
# Also implement soft limits in code:
from collections import defaultdict

daily_spend: dict[str, float] = defaultdict(float)
DAILY_LIMIT_USD = 500.0

def budget_gated_call(prompt: str, user_id: str, model: str = "gpt-4o") -> str:
    today = str(datetime.date.today())
    if daily_spend[today] >= DAILY_LIMIT_USD:
        raise Exception("Daily budget exceeded")

    response = call_llm(prompt, model)
    estimated_cost = estimate_call_cost(prompt, response, model)
    daily_spend[today] += estimated_cost
    return response

Rule of thumb: For a new product, start with GPT-4o-mini. You'll spend 10–15× less while getting 80–90% of GPT-4o quality for most tasks. Only upgrade to GPT-4o when you've identified specific quality gaps.