Module 11: Cost Control: Don't Blow Your Budget

Dear Engineering,
Please explain the $47,000 charge from 'Anthropic' on this month's AWS bill. We don't have anyone named Anthropic on staff.
Regards,
Finance

— Letters from Hell, Corporate Edition

The Actual Cost Breakdown
The Real Cost Killers
Caching: Your Best Friend
Rate Limits: The Invisible Walls
Local Models: The Alternative
Building Cost-Conscious Architecture
The Awkward Conversation with Your Manager
The Cost Control Checklist
The Uncomfortable Truth
What You’ve Learned

Let me tell you about the time I watched a junior developer burn through $800 in API costs during a single afternoon.

He was building a “smart” search feature. Every keystroke triggered an API call. Every API call used GPT-4. Every response was immediately discarded because the user was still typing. By the time someone noticed the billing alerts, we’d processed roughly 2 million tokens of pure waste.

The feature never shipped. The developer learned an expensive lesson. Finance sent a very pointed email.

This chapter is about not being that developer.

The Actual Cost Breakdown

A receipt from "AI SERVICES INC" that keeps unrolling off the screen, with line items getting increasingly absurd: "Tokens (input): $2.00", "Tokens (output): $15.00", "System prompt sent 10,000 times: $180.00", "That one time you forgot to cache: $500.00".

Before you can control costs, you need to understand where the money goes. It’s not as simple as “API calls cost money.”

Token Costs: The Obvious One

Every API call is priced per token. But here’s what catches people:

Input tokens cost less than output tokens. With Claude, you might pay $3 per million input tokens but $15 per million output tokens. With GPT-4, the ratio is similar.

This means a chatbot that writes long responses costs way more than one that writes short ones. That verbose, helpful AI personality? It’s expensive.

// This costs more than you think
const expensivePrompt = `
Please provide a comprehensive, detailed explanation of ${topic}.
Include examples, edge cases, and potential pitfalls.
Be thorough and leave nothing out.
`;

// This is cheaper
const cheaperPrompt = `
Explain ${topic} in 2-3 sentences. Be concise.
`;

The difference in a production system handling thousands of requests? Hundreds or thousands of dollars per month.

The Hidden Multiplier

That “system prompt” you’re sending with every request? It counts as input tokens. If your system prompt is 2,000 tokens and you make 10,000 requests per day, that’s 20 million tokens just for the system prompt. At $3/million, that’s $60/day or $1,800/month—before the actual user messages.

Compute Costs: The Sneaky One

If you’re running local models or fine-tuning, you’re paying for compute. GPU time isn’t cheap:

Cloud GPUs: $1-4 per hour for decent inference hardware
Fine-tuning: Can run into hundreds or thousands of dollars depending on dataset size
Self-hosted inference: Cheaper per request, but you’re paying even when idle

The math often surprises people. “I’ll save money by self-hosting!” they say, before realizing their GPU server costs $2,000/month whether they use it or not.

Storage Costs: The Forgotten One

Vector databases, embedding storage, conversation history, cached responses—it all takes space.

Pinecone: $70/month for the starter tier, scales up fast
Conversation history: Grows forever if you’re not careful
Embedding storage: A million embeddings at 1536 dimensions = ~6GB

Not huge individually, but it adds up.

The Real Cost Killers

After watching dozens of teams blow their budgets, I’ve identified the patterns that cause 90% of cost problems:

1. The Conversation History Trap

A snowball rolling downhill getting exponentially larger, but instead of snow it is made of chat messages and token counts. At the bottom, a wallet is being crushed. The snowball has labels showing "Message 1: 100 tokens", "Message 20: 2000 tokens", "Message 50: ???"

Every chatbot tutorial shows you how to include conversation history in your prompts. None of them mention that this creates exponentially growing costs.

// The tutorial approach (expensive)
async function chat(userMessage: string, history: Message[]) {
  const response = await client.messages.create({
    model: "claude-sonnet-4-20250514",
    messages: [
      ...history,  // This grows forever
      { role: "user", content: userMessage }
    ]
  });
  
  history.push({ role: "user", content: userMessage });
  history.push({ role: "assistant", content: response.content });
  
  return response;
}

After 20 messages, you’re sending 20 previous messages with every new request. After 100 messages, you’re sending 100. Each message might be hundreds of tokens.

The fix: Summarize or truncate history. Keep the last N messages. Use a sliding window.

// Better approach
const MAX_HISTORY_MESSAGES = 10;

async function chat(userMessage: string, history: Message[]) {
  // Keep only recent history
  const recentHistory = history.slice(-MAX_HISTORY_MESSAGES);
  
  const response = await client.messages.create({
    model: "claude-sonnet-4-20250514",
    messages: [
      // Optional: include a summary of older context
      ...(history.length > MAX_HISTORY_MESSAGES 
        ? [{ role: "system", content: conversationSummary }] 
        : []),
      ...recentHistory,
      { role: "user", content: userMessage }
    ]
  });
  
  return response;
}

2. The “Use the Best Model” Mistake

Someone arriving at a grocery store in a helicopter. The helicopter is labeled "GPT-4 Opus $$$" and a bicycle is parked outside labeled "GPT-3.5 Haiku $". Everyone in the parking lot is staring. The groceries are simple items like milk and bread.

Not every task needs GPT-4 or Claude Opus. Most tasks don’t.

Task	Model You Need	Model People Use
Classification	Haiku/GPT-3.5	Opus/GPT-4
Simple extraction	Haiku/GPT-3.5	Opus/GPT-4
Summarization	Sonnet/GPT-4-mini	Opus/GPT-4
Complex reasoning	Opus/GPT-4	Opus/GPT-4 ✓

Using Opus for everything is like taking a helicopter to get groceries. It works, but you’re wasting money.

// Route to appropriate model based on task
function selectModel(task: TaskType): string {
  switch (task) {
    case 'classification':
    case 'extraction':
    case 'simple_qa':
      return 'claude-3-5-haiku-20241022';  // Fast and cheap
    
    case 'summarization':
    case 'code_review':
    case 'moderate_reasoning':
      return 'claude-sonnet-4-20250514';   // Good balance
    
    case 'complex_analysis':
    case 'novel_problems':
    case 'critical_decisions':
      return 'claude-opus-4-20250514';     // When you actually need it
    
    default:
      return 'claude-sonnet-4-20250514';   // Safe default
  }
}

The 80/20 Rule of Model Selection

In most applications, 80% of requests can be handled by the cheapest model. Route those correctly and you’ll cut costs dramatically without sacrificing quality where it matters.

3. The Retry Storm

When API calls fail, the instinct is to retry. When retries fail, the instinct is to retry harder. This can turn a $0.01 failed request into a $10 incident.

// Dangerous: Unlimited retries with no backoff
async function dangerousRequest(prompt: string) {
  while (true) {
    try {
      return await client.messages.create({ /* ... */ });
    } catch (e) {
      console.log('Retrying...');
      // This can run forever, burning money
    }
  }
}

// Better: Limited retries with exponential backoff
async function saferRequest(prompt: string, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await client.messages.create({ /* ... */ });
    } catch (e) {
      if (attempt === maxRetries - 1) throw e;
      
      // Exponential backoff: 1s, 2s, 4s
      await sleep(Math.pow(2, attempt) * 1000);
    }
  }
}

4. The Streaming Overhead

Streaming responses feels more responsive, but it’s not free. Each streaming chunk has overhead. For short responses, non-streaming can actually be cheaper.

Use streaming for: Long responses, user-facing chat, real-time applications

Skip streaming for: Backend processing, batch jobs, short responses

Caching: Your Best Friend

The cheapest API call is the one you don’t make. Caching is the single most effective cost reduction strategy.

Response Caching

If the same question gets asked repeatedly, cache the answer:

import { Redis } from 'ioredis';

const redis = new Redis();
const CACHE_TTL = 3600; // 1 hour

async function cachedCompletion(prompt: string): Promise<string> {
  const cacheKey = `ai:${hashString(prompt)}`;
  
  // Check cache first
  const cached = await redis.get(cacheKey);
  if (cached) {
    return cached; // Free!
  }
  
  // Cache miss - make the API call
  const response = await client.messages.create({
    model: "claude-sonnet-4-20250514",
    messages: [{ role: "user", content: prompt }]
  });
  
  const result = response.content[0].text;
  
  // Cache for next time
  await redis.setex(cacheKey, CACHE_TTL, result);
  
  return result;
}

Cache Hit Rates in Practice

A well-designed cache can handle 30-70% of requests in typical applications. For FAQ bots or documentation search, hit rates can exceed 90%. That’s a 90% cost reduction with minimal effort.

Semantic Caching

Exact string matching misses similar questions. “What is React?” and “What’s React?” are the same question but different strings.

Semantic caching uses embeddings to find similar cached responses:

async function semanticCache(query: string): Promise<string | null> {
  const queryEmbedding = await getEmbedding(query);
  
  // Find similar cached queries
  const similar = await vectorStore.search(queryEmbedding, {
    threshold: 0.95,  // High similarity required
    limit: 1
  });
  
  if (similar.length > 0) {
    return similar[0].cachedResponse;
  }
  
  return null;
}

More complex to implement, but catches more cache hits.

Embedding Caching

Embeddings are deterministic—the same text always produces the same embedding. Cache them:

const embeddingCache = new Map<string, number[]>();

async function getCachedEmbedding(text: string): Promise<number[]> {
  const cacheKey = hashString(text);
  
  if (embeddingCache.has(cacheKey)) {
    return embeddingCache.get(cacheKey)!;
  }
  
  const embedding = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text
  });
  
  const vector = embedding.data[0].embedding;
  embeddingCache.set(cacheKey, vector);
  
  return vector;
}

Rate Limits: The Invisible Walls

Every API has rate limits. Hit them and your application breaks. The limits vary by provider and tier:

Provider	Requests/min (free tier)	Requests/min (paid)
OpenAI	3	500-10,000
Anthropic	5	50-4,000
Google AI	60	360-1,000

Handling Rate Limits Gracefully

class RateLimitedClient {
  private queue: Array<() => Promise<any>> = [];
  private processing = false;
  private requestsThisMinute = 0;
  private readonly maxRequestsPerMinute = 50;
  
  async request<T>(fn: () => Promise<T>): Promise<T> {
    return new Promise((resolve, reject) => {
      this.queue.push(async () => {
        try {
          const result = await fn();
          resolve(result);
        } catch (e) {
          reject(e);
        }
      });
      
      this.processQueue();
    });
  }
  
  private async processQueue() {
    if (this.processing) return;
    this.processing = true;
    
    while (this.queue.length > 0) {
      if (this.requestsThisMinute >= this.maxRequestsPerMinute) {
        // Wait until the next minute
        await this.waitForRateLimit();
      }
      
      const task = this.queue.shift()!;
      this.requestsThisMinute++;
      await task();
    }
    
    this.processing = false;
  }
  
  private async waitForRateLimit() {
    const waitTime = 60000 - (Date.now() % 60000);
    await sleep(waitTime);
    this.requestsThisMinute = 0;
  }
}

Local Models: The Alternative

Cloud APIs aren’t your only option. Local models trade money for hardware:

When Local Makes Sense

High volume, simple tasks: Thousands of classification requests per day
Privacy requirements: Data can’t leave your servers
Predictable costs: Fixed hardware cost vs variable API costs
Latency sensitive: No network round trip

When Local Doesn’t Make Sense

Occasional use: The GPU sits idle most of the time
Quality critical: Local models are generally worse than frontier APIs
Rapid iteration: Switching models means new deployments
Peak handling: Can’t scale up for sudden demand

The Math

Let’s do the actual calculation:

Cloud API (Claude Sonnet):

1 million tokens/day = ~$50/day = ~$1,500/month

Self-hosted (Llama 3.1 70B):

GPU server: ~$2,000/month (or $30K+ to buy)
Electricity: ~$200/month
Maintenance: Your time
Total: ~$2,200/month (but unlimited tokens)

The crossover point is around 1.5 million tokens per day. Below that, cloud is cheaper. Above that, local might make sense—if you can accept the quality trade-off.

The Hidden Costs of Self-Hosting

Everyone remembers the GPU cost. They forget: setup time (days to weeks), maintenance, monitoring, failover, updates, and the opportunity cost of not working on your actual product. Factor in at least 10-20 hours per month of engineering time.

Building Cost-Conscious Architecture

The best cost control is built into your architecture from the start.

1. Budget Per Request

Set a maximum cost per request and enforce it:

interface CostTracker {
  inputTokens: number;
  outputTokens: number;
  estimatedCost: number;
}

const MAX_COST_PER_REQUEST = 0.10; // 10 cents

async function budgetedRequest(prompt: string): Promise<string> {
  const estimatedInputTokens = estimateTokens(prompt);
  const estimatedCost = calculateCost(estimatedInputTokens, 500); // Assume 500 output
  
  if (estimatedCost > MAX_COST_PER_REQUEST) {
    // Truncate prompt or use cheaper model
    return fallbackRequest(prompt);
  }
  
  return makeRequest(prompt);
}

2. User-Level Budgets

Don’t let one user bankrupt you:

class UserBudget {
  private usage = new Map<string, number>();
  private readonly dailyLimit = 1.00; // $1 per user per day
  
  async checkBudget(userId: string, estimatedCost: number): Promise<boolean> {
    const today = new Date().toISOString().split('T')[0];
    const key = `${userId}:${today}`;
    
    const currentUsage = this.usage.get(key) || 0;
    
    if (currentUsage + estimatedCost > this.dailyLimit) {
      return false; // Budget exceeded
    }
    
    return true;
  }
  
  recordUsage(userId: string, cost: number) {
    const today = new Date().toISOString().split('T')[0];
    const key = `${userId}:${today}`;
    
    const current = this.usage.get(key) || 0;
    this.usage.set(key, current + cost);
  }
}

3. Monitoring and Alerts

You can’t control what you don’t measure:

// Track every request
async function trackedRequest(prompt: string, metadata: RequestMetadata) {
  const startTime = Date.now();
  
  const response = await client.messages.create({
    model: metadata.model,
    messages: [{ role: "user", content: prompt }]
  });
  
  // Log for analysis
  await analytics.track('ai_request', {
    model: metadata.model,
    inputTokens: response.usage.input_tokens,
    outputTokens: response.usage.output_tokens,
    latencyMs: Date.now() - startTime,
    userId: metadata.userId,
    feature: metadata.feature,
    estimatedCost: calculateCost(response.usage)
  });
  
  // Alert if costs spike
  await checkCostAlerts(metadata.feature);
  
  return response;
}

Set up alerts for:

Daily spend exceeding threshold
Single request exceeding cost limit
Unusual usage patterns
Error rate spikes (which might indicate retry storms)

The Awkward Conversation with Your Manager

At some point, you’ll need to explain AI costs to someone who doesn’t understand tokens. Here’s how:

Frame It Correctly

Don’t say: “We spent $3,000 on API calls.”

Do say: “We processed 50,000 customer requests at $0.06 each, which is 90% cheaper than the human support cost.”

Have the Numbers Ready

Before the conversation:

Cost per user action
Comparison to alternatives (human labor, other services)
Trend over time (hopefully downward)
ROI if applicable

Propose Solutions, Not Problems

“We’re spending too much” is a problem.

“We’re spending $X, but by implementing caching and model routing, we can reduce this to $Y within 2 weeks” is a solution.

The $47,000 Email

Remember the email at the start of this chapter? That was a real team I consulted with. The actual conversation went like this:

Finance: “Why did you spend $47,000 on AI?” Team: “We processed 2 million support tickets automatically.” Finance: “That would have cost us $400,000 in support staff time.” Finance: “…carry on.”

Context matters. Always have the comparison ready.

The Cost Control Checklist

Before you deploy any AI feature:

Model selection: Are you using the cheapest model that works?
Caching: Are you caching responses that don’t need to be fresh?
History management: Is conversation history bounded?
Retry limits: Do failed requests have limited retries?
Budget limits: Are there per-request and per-user limits?
Monitoring: Can you see costs in real-time?
Alerts: Will you know before the bill arrives?
Fallbacks: What happens when you hit rate limits?

The Uncomfortable Truth

Here’s what nobody wants to admit: AI features are expensive to run. The “demo” cost is always way lower than the production cost. That cool prototype that costs $5/day will cost $500/day at scale.

The companies telling you AI is cheap are either:

Not actually using it at scale
Subsidizing it as a loss leader
Lying

Budget accordingly. Build cost control into your architecture. Monitor obsessively. And always, always have an answer ready for when Finance sends that email.

Because they will send that email.

What You’ve Learned

After this chapter, you should be able to:

Understand where AI costs actually come from
Implement caching to reduce API calls
Route requests to appropriate model tiers
Build budget limits into your architecture
Have productive conversations about AI costs with non-technical stakeholders
Not be the developer who burns through $800 in an afternoon

The junior developer from the opening story? He’s now a tech lead who reviews every AI feature for cost implications before it ships. Some lessons you only need to learn once.

But learning from other people’s expensive mistakes is cheaper.

Mandatory Knowledge Check

Question 1 / 10Score: 0

Cost Control: Don't Blow Your Budget

Table of Contents