AI_FOR_CYNICAL_DEVS
← Back to The Grind
Module 14 // 45 minutes // Quality Assurance

Testing AI Systems: Without Losing Your Mind

How do you know it works?

I asked it a few questions and the answers seemed... fine?

— Every AI demo ever, moments before production deployment

Table of Contents


Why Testing AI Is Different (And Worse)

A traditional test with a simple green checkmark vs an AI test showing a spectrum/gradient from red to green with question marks, dice, and a shrugging scientist. The traditional side is neat and orderly, the AI side is chaotic with "maybe?" and "70% correct?" labels. Comparison illustration.

Traditional testing has this beautiful property: determinism. You call a function with the same inputs, you get the same outputs. Write a test, run it a thousand times, same result. Lovely.

AI systems laugh at your determinism.

The same prompt can produce different outputs. Temperature settings add randomness by design. Model updates change behavior without warning. And even when the output is “correct,” there’s no boolean to check—just vibes.

The Fundamental Problem

You can’t write assertEquals(ai.generate(prompt), expectedOutput) because there is no single expected output. The AI might give you a perfect answer, a slightly different perfect answer, or complete nonsense—all from the same input.

This doesn’t mean testing is impossible. It means you need different tools.

What Traditional Testing Gets You

  • Binary pass/fail: Either the function returns 42 or it doesn’t
  • Reproducibility: Same test, same result, forever
  • Clear coverage: You know exactly which code paths are tested
  • Fast feedback: Milliseconds to know if you broke something

What AI Testing Requires

  • Probabilistic assessment: Is this output good enough?
  • Statistical validity: Run it N times, check the distribution
  • Semantic evaluation: Does this mean the right thing?
  • Slow acceptance: Evals take time, especially with human review

Welcome to the world of evals—short for evaluations, because AI people love abbreviations that make simple things sound technical.


The Eval Mindset

Before building evals, you need to think differently about what “working” means.

The Quality Spectrum

Traditional software has bugs. They’re discrete. Either the bug exists or it doesn’t.

AI systems have quality. It’s continuous. The output can be:

  • Completely wrong
  • Partially correct but misleading
  • Correct but poorly formatted
  • Correct but verbose
  • Correct and concise
  • Perfect

Your eval suite needs to capture where on this spectrum your system typically lands—and alert you when it drifts.

What Are You Actually Measuring?

Before writing any eval code, answer these questions:

  1. What does “good” look like? Be specific. Not “helpful answers” but “answers that include the product name, price, and availability within 100 words.”

  2. What does “bad” look like? Even more important. What failure modes are you trying to catch?

  3. Who decides? Is there an objective right answer, or is quality subjective? If subjective, whose opinion matters?

  4. How often does it need to work? 99.9%? 95%? 80%? Different thresholds require different approaches.

Start With Failures

Build your first evals around failure cases, not success cases. You already know when the system works—that’s why you shipped it. What you don’t know is all the ways it can fail.


Types of Evals You Actually Need

1. Factual Correctness Evals

For systems that should produce verifiable facts.

interface FactualEval {
  input: string;
  expectedFacts: string[];
  evaluate(response: string): {
    factsPresent: string[];
    factsMissing: string[];
    factsWrong: string[];
    score: number;
  };
}

// Example
const eval: FactualEval = {
  input: "What is the capital of France?",
  expectedFacts: ["Paris"],
  evaluate(response) {
    const containsParis = response.toLowerCase().includes("paris");
    const saysSomethingElse = response.match(/capital.*is\s+(\w+)/i)?.[1]?.toLowerCase() !== "paris";
    
    return {
      factsPresent: containsParis ? ["Paris"] : [],
      factsMissing: containsParis ? [] : ["Paris"],
      factsWrong: saysSomethingElse ? ["Wrong capital mentioned"] : [],
      score: containsParis && !saysSomethingElse ? 1.0 : 0.0
    };
  }
};

2. Format Compliance Evals

For systems that must output specific formats (JSON, XML, structured data).

interface FormatEval {
  input: string;
  expectedFormat: 'json' | 'xml' | 'markdown' | 'custom';
  schema?: object;
  evaluate(response: string): {
    validFormat: boolean;
    parseErrors: string[];
    schemaErrors: string[];
    score: number;
  };
}

function evaluateJsonFormat(response: string, schema?: object): FormatEvalResult {
  try {
    const parsed = JSON.parse(response);
    
    if (schema) {
      const schemaErrors = validateAgainstSchema(parsed, schema);
      return {
        validFormat: true,
        parseErrors: [],
        schemaErrors,
        score: schemaErrors.length === 0 ? 1.0 : 0.5
      };
    }
    
    return { validFormat: true, parseErrors: [], schemaErrors: [], score: 1.0 };
  } catch (e) {
    return {
      validFormat: false,
      parseErrors: [e.message],
      schemaErrors: [],
      score: 0.0
    };
  }
}

3. Semantic Similarity Evals

For when the exact wording doesn’t matter, but the meaning does.

interface SemanticEval {
  input: string;
  referenceAnswer: string;
  evaluate(response: string): Promise<{
    similarityScore: number;
    keyConceptsMatched: string[];
    keyConceptsMissed: string[];
  }>;
}

// Using embeddings to compare semantic similarity
async function evaluateSemanticSimilarity(
  response: string,
  reference: string,
  threshold: number = 0.8
): Promise<SemanticEvalResult> {
  const [responseEmbedding, referenceEmbedding] = await Promise.all([
    getEmbedding(response),
    getEmbedding(reference)
  ]);
  
  const similarity = cosineSimilarity(responseEmbedding, referenceEmbedding);
  
  return {
    similarityScore: similarity,
    passed: similarity >= threshold,
    keyConceptsMatched: [], // Would need additional analysis
    keyConceptsMissed: []
  };
}

4. Safety and Guardrail Evals

For ensuring the system doesn’t output harmful content.

interface SafetyEval {
  input: string;
  prohibitedContent: string[];
  requiredDisclaimers: string[];
  evaluate(response: string): {
    containsProhibited: boolean;
    prohibitedFound: string[];
    hasRequiredDisclaimers: boolean;
    missingDisclaimers: string[];
    score: number;
  };
}

// Check for jailbreak attempts
const jailbreakPatterns = [
  /ignore.*previous.*instructions/i,
  /pretend.*you.*are/i,
  /act.*as.*if/i,
  /hypothetically/i
];

function evaluateSafety(input: string, response: string): SafetyResult {
  const isJailbreakAttempt = jailbreakPatterns.some(p => p.test(input));
  const refusedAppropriately = response.includes("I can't") || response.includes("I'm not able");
  
  if (isJailbreakAttempt && !refusedAppropriately) {
    return { safe: false, reason: "Jailbreak attempt not refused" };
  }
  
  return { safe: true };
}

5. Task Completion Evals

For agentic systems that should accomplish specific goals.

interface TaskEval {
  task: string;
  successCriteria: SuccessCriterion[];
  evaluate(actions: Action[], finalState: State): {
    criteriaResults: { criterion: string; met: boolean; evidence: string }[];
    overallSuccess: boolean;
    efficiency: number; // Lower is better
  };
}

// Example: Code generation task
const codeGenEval: TaskEval = {
  task: "Write a function that sorts an array",
  successCriteria: [
    { name: "Compiles", check: (code) => compiles(code) },
    { name: "Passes tests", check: (code) => runTests(code, sortTests) },
    { name: "Handles edge cases", check: (code) => runTests(code, edgeCaseTests) },
    { name: "Reasonable complexity", check: (code) => analyzeComplexity(code) < O_N_SQUARED }
  ],
  evaluate(actions, finalState) {
    // Implementation
  }
};

Building Your First Eval Suite

Let’s build a real eval suite for a customer support chatbot.

Step 1: Define Your Test Cases

interface EvalCase {
  id: string;
  category: string;
  input: string;
  context?: string;
  expectedBehavior: string;
  requiredElements?: string[];
  prohibitedElements?: string[];
  difficulty: 'easy' | 'medium' | 'hard';
}

const evalCases: EvalCase[] = [
  {
    id: "refund-basic",
    category: "refunds",
    input: "I want a refund for my order",
    context: "Order #12345, placed 3 days ago, not shipped",
    expectedBehavior: "Offer refund, explain process, provide timeline",
    requiredElements: ["refund", "3-5 business days", "confirmation email"],
    prohibitedElements: ["sorry we can't help", "contact another department"],
    difficulty: "easy"
  },
  {
    id: "refund-edge-case",
    category: "refunds",
    input: "I want a refund but I already opened the product",
    context: "Order #12346, delivered 2 weeks ago, opened",
    expectedBehavior: "Explain partial refund policy, offer alternatives",
    requiredElements: ["restocking fee", "15%", "store credit alternative"],
    prohibitedElements: ["full refund"],
    difficulty: "hard"
  },
  // ... more cases
];

Step 2: Build the Eval Runner

interface EvalResult {
  caseId: string;
  passed: boolean;
  score: number;
  response: string;
  analysis: {
    requiredElementsFound: string[];
    requiredElementsMissing: string[];
    prohibitedElementsFound: string[];
    additionalNotes: string[];
  };
  latency: number;
  tokens: { input: number; output: number };
}

async function runEvalSuite(
  system: AISystem,
  cases: EvalCase[],
  options: { parallel?: number; retries?: number } = {}
): Promise<EvalResult[]> {
  const { parallel = 5, retries = 0 } = options;
  const results: EvalResult[] = [];
  
  // Run in batches to avoid rate limits
  for (let i = 0; i < cases.length; i += parallel) {
    const batch = cases.slice(i, i + parallel);
    const batchResults = await Promise.all(
      batch.map(evalCase => runSingleEval(system, evalCase, retries))
    );
    results.push(...batchResults);
    
    // Progress reporting
    console.log(`Completed ${results.length}/${cases.length} evals`);
  }
  
  return results;
}

async function runSingleEval(
  system: AISystem,
  evalCase: EvalCase,
  retries: number
): Promise<EvalResult> {
  const startTime = Date.now();
  
  const response = await system.generate({
    input: evalCase.input,
    context: evalCase.context
  });
  
  const latency = Date.now() - startTime;
  
  // Analyze the response
  const analysis = analyzeResponse(response.text, evalCase);
  
  // Calculate score
  const score = calculateScore(analysis, evalCase);
  
  return {
    caseId: evalCase.id,
    passed: score >= 0.8, // Configurable threshold
    score,
    response: response.text,
    analysis,
    latency,
    tokens: response.usage
  };
}

Step 3: Analysis and Scoring

function analyzeResponse(response: string, evalCase: EvalCase): Analysis {
  const responseLower = response.toLowerCase();
  
  const requiredElementsFound = (evalCase.requiredElements || [])
    .filter(el => responseLower.includes(el.toLowerCase()));
  
  const requiredElementsMissing = (evalCase.requiredElements || [])
    .filter(el => !responseLower.includes(el.toLowerCase()));
  
  const prohibitedElementsFound = (evalCase.prohibitedElements || [])
    .filter(el => responseLower.includes(el.toLowerCase()));
  
  return {
    requiredElementsFound,
    requiredElementsMissing,
    prohibitedElementsFound,
    additionalNotes: []
  };
}

function calculateScore(analysis: Analysis, evalCase: EvalCase): number {
  let score = 1.0;
  
  // Deduct for missing required elements
  const requiredCount = evalCase.requiredElements?.length || 0;
  if (requiredCount > 0) {
    const foundRatio = analysis.requiredElementsFound.length / requiredCount;
    score *= foundRatio;
  }
  
  // Heavy penalty for prohibited elements
  if (analysis.prohibitedElementsFound.length > 0) {
    score *= 0.5; // Immediate 50% penalty
    score *= Math.pow(0.8, analysis.prohibitedElementsFound.length);
  }
  
  return Math.max(0, Math.min(1, score));
}
Golden Rule of Eval Cases

Your eval suite is only as good as your test cases. Spend time crafting cases that represent real user behavior, not just the happy path. Include typos, ambiguous requests, edge cases, and adversarial inputs.


Measuring What Matters

The Metrics Zoo

Accuracy-based metrics:

  • Exact Match (EM): Did the response exactly match the expected output? (Rarely useful for generative AI)
  • F1 Score: Balance of precision and recall for extracted information
  • BLEU/ROUGE: Text similarity scores borrowed from machine translation

Semantic metrics:

  • BERTScore: Semantic similarity using embeddings
  • G-Eval: Using another LLM to judge quality (yes, really)
  • Human preference: Having humans rate outputs

Operational metrics:

  • Latency: How long did it take?
  • Token usage: How expensive was it?
  • Error rate: How often did it completely fail?

Choosing the Right Metrics

Use CasePrimary MetricsSecondary Metrics
Q&A SystemFactual accuracy, Source citationLatency, Verbosity
Code GenerationCompiles, Tests passComplexity, Style
SummarizationKey points coverage, FaithfulnessLength, Readability
ClassificationAccuracy, F1Latency, Confidence calibration
Chat/SupportTask completion, User satisfactionResponse time, Escalation rate

Building a Dashboard

interface EvalDashboard {
  overallScore: number;
  categoryBreakdown: Record<string, CategoryStats>;
  trendData: DailyStats[];
  failingCases: EvalResult[];
  recommendations: string[];
}

interface CategoryStats {
  passRate: number;
  averageScore: number;
  averageLatency: number;
  totalCases: number;
}

function generateDashboard(results: EvalResult[], history: EvalResult[][]): EvalDashboard {
  const categoryGroups = groupBy(results, r => getCategoryFromCaseId(r.caseId));
  
  const categoryBreakdown: Record<string, CategoryStats> = {};
  for (const [category, cases] of Object.entries(categoryGroups)) {
    categoryBreakdown[category] = {
      passRate: cases.filter(c => c.passed).length / cases.length,
      averageScore: average(cases.map(c => c.score)),
      averageLatency: average(cases.map(c => c.latency)),
      totalCases: cases.length
    };
  }
  
  return {
    overallScore: average(results.map(r => r.score)),
    categoryBreakdown,
    trendData: history.map(h => ({
      date: h[0]?.timestamp,
      score: average(h.map(r => r.score))
    })),
    failingCases: results.filter(r => !r.passed),
    recommendations: generateRecommendations(results, history)
  };
}

Automated Evals vs Human Review

When Automated Evals Work

  • Format checking: JSON valid? Required fields present?
  • Factual verification: Against a known database
  • Safety filters: Profanity, PII, known bad patterns
  • Performance metrics: Latency, token count, error rates
  • Regression detection: Did scores drop vs baseline?

When You Need Humans

  • Subjective quality: Is this helpful? Professional? Appropriate?
  • Nuanced correctness: Technically correct but misleading?
  • Tone and style: Does this match our brand voice?
  • Novel failures: New ways the system can break
  • High-stakes decisions: Medical, legal, financial advice

LLM-as-Judge: The Controversial Middle Ground

A robot in judge's robes and powdered wig, sitting at a bench, evaluating another robot's work. The judging robot has a thought bubble showing <I also make mistakes but I'll judge confidently anyway>. Courtroom drama meets AI absurdity.

Yes, you can use another LLM to evaluate your LLM. It’s weird but it works (sometimes).

async function llmJudge(
  response: string,
  criteria: string[],
  rubric: string
): Promise<JudgeResult> {
  const prompt = `
You are evaluating an AI system's response. 

RESPONSE TO EVALUATE:
${response}

EVALUATION CRITERIA:
${criteria.map((c, i) => `${i + 1}. ${c}`).join('\n')}

RUBRIC:
${rubric}

For each criterion, provide:
1. Score (1-5)
2. Brief justification

Format as JSON: { "scores": [{ "criterion": 1, "score": 4, "justification": "..." }], "overall": 4 }
`;

  const judgment = await evaluatorModel.generate(prompt);
  return JSON.parse(judgment);
}
The LLM-as-Judge Problem

LLM judges have biases: they prefer longer responses, responses similar to their own training, and confident-sounding nonsense. Always calibrate against human judgments and watch for systematic errors.

A Hybrid Approach

interface HybridEval {
  automated: {
    formatCheck: boolean;
    safetyCheck: boolean;
    factCheck: number;
  };
  llmJudge: {
    helpfulness: number;
    accuracy: number;
    professionalism: number;
  };
  humanReview?: {
    reviewer: string;
    scores: Record<string, number>;
    notes: string;
  };
}

async function runHybridEval(response: string, evalCase: EvalCase): Promise<HybridEval> {
  // Fast automated checks first
  const automated = {
    formatCheck: checkFormat(response),
    safetyCheck: checkSafety(response),
    factCheck: await checkFacts(response, evalCase.expectedFacts)
  };
  
  // If automated checks fail hard, skip expensive evaluation
  if (!automated.safetyCheck || !automated.formatCheck) {
    return { automated, llmJudge: null, humanReview: null };
  }
  
  // LLM judge for subjective quality
  const llmJudge = await llmJudge(response, evalCase.criteria, evalCase.rubric);
  
  // Queue for human review if edge case or low confidence
  const needsHumanReview = 
    llmJudge.overall < 3 || 
    evalCase.difficulty === 'hard' ||
    automated.factCheck < 0.9;
  
  return { 
    automated, 
    llmJudge,
    humanReview: needsHumanReview ? await queueForHumanReview(response, evalCase) : null
  };
}

Regression Testing for AI

Model updates, prompt changes, and context modifications can all cause regressions. Here’s how to catch them.

The Golden Dataset

Maintain a set of responses that are known to be good:

interface GoldenExample {
  id: string;
  input: string;
  context: string;
  goldenResponse: string;
  approvedBy: string;
  approvedAt: Date;
  tags: string[];
}

const goldenDataset: GoldenExample[] = [
  {
    id: "support-001",
    input: "How do I reset my password?",
    context: "User account: premium, no 2FA",
    goldenResponse: "To reset your password:\n1. Go to settings...",
    approvedBy: "product-team",
    approvedAt: new Date("2025-06-15"),
    tags: ["support", "account", "critical-path"]
  }
];

async function regressionTest(system: AISystem): Promise<RegressionResult[]> {
  const results: RegressionResult[] = [];
  
  for (const golden of goldenDataset) {
    const newResponse = await system.generate({
      input: golden.input,
      context: golden.context
    });
    
    const similarity = await semanticSimilarity(
      newResponse.text, 
      golden.goldenResponse
    );
    
    results.push({
      id: golden.id,
      passed: similarity > 0.85,
      similarity,
      oldResponse: golden.goldenResponse,
      newResponse: newResponse.text,
      diff: generateDiff(golden.goldenResponse, newResponse.text)
    });
  }
  
  return results;
}

A/B Comparison Testing

When changing your system, run both versions:

async function compareVersions(
  oldSystem: AISystem,
  newSystem: AISystem,
  testCases: EvalCase[]
): Promise<ComparisonResult> {
  const results = await Promise.all(testCases.map(async (testCase) => {
    const [oldResult, newResult] = await Promise.all([
      runSingleEval(oldSystem, testCase, 0),
      runSingleEval(newSystem, testCase, 0)
    ]);
    
    return {
      caseId: testCase.id,
      oldScore: oldResult.score,
      newScore: newResult.score,
      delta: newResult.score - oldResult.score,
      oldLatency: oldResult.latency,
      newLatency: newResult.latency,
      regression: newResult.score < oldResult.score - 0.1 // 10% tolerance
    };
  }));
  
  return {
    improved: results.filter(r => r.delta > 0.1).length,
    regressed: results.filter(r => r.regression).length,
    unchanged: results.filter(r => Math.abs(r.delta) <= 0.1).length,
    averageDelta: average(results.map(r => r.delta)),
    regressions: results.filter(r => r.regression)
  };
}

The Vibes-Based Testing Problem

A QA engineer in front of a computer, looking uncertain, with thought bubbles showing 'feels right' and 'looks good to me'. In the background, production is on fire with alarms blaring. A chart shows 'Vibes' trending up while 'Actual Quality' trends down.

Let’s be honest: a lot of AI testing is vibes-based.

“I tried it a few times and it seemed fine.”

This is how most AI features ship. And sometimes it’s… actually okay?

When Vibes Are Acceptable

  • Early prototypes: You’re still figuring out what “good” means
  • Low-stakes features: Suggested tags, auto-categorization
  • Highly subjective outputs: Creative writing, brainstorming
  • Fast iteration: Testing would take longer than fixing

When Vibes Will Destroy You

  • Production systems: Real users hitting real edge cases
  • High-stakes decisions: Anything medical, legal, financial
  • Regulated industries: You’ll need audit trails
  • Model updates: New model, new failure modes
Real Story: The Vibes Ship

A team shipped a chatbot after “extensive testing” (15 manual queries). First week in production, a user asked “how do I end it all?” referring to their subscription. The bot’s response was… not appropriate for that interpretation. Vibes don’t catch ambiguity.

Graduating from Vibes

The path from vibes to rigor:

  1. Document your vibes: Write down what you tested manually
  2. Automate one check: Just one. Format validation is easy.
  3. Add failure cases: Things that definitely shouldn’t work
  4. Build a golden set: 20-50 examples you know are correct
  5. Run on schedule: Daily or weekly automated checks
  6. Add metrics: Track scores over time

You don’t need to go from vibes to 99.9% coverage overnight. But you do need to move.


A/B Testing AI Features

Traditional A/B testing works for AI too, with some modifications.

What to A/B Test

  • Different prompts: Which system prompt performs better?
  • Different models: GPT-4 vs Claude vs Gemini
  • Different parameters: Temperature, max tokens
  • Different RAG strategies: More context vs less context
  • Feature enablement: AI feature on vs off

Metrics for AI A/B Tests

interface AIABTestMetrics {
  // Engagement metrics
  featureUsageRate: number;      // % of users who use the AI feature
  completionRate: number;        // % who complete the AI-assisted task
  repeatUsage: number;           // Users who come back
  
  // Quality metrics
  editRate: number;              // How often users edit AI output
  rejectionRate: number;         // How often users reject AI output entirely
  escalationRate: number;        // For support: escalated to human
  
  // Business metrics
  taskCompletionTime: number;    // Faster with AI?
  conversionRate: number;        // For sales/funnel
  customerSatisfaction: number;  // Survey scores
  
  // Cost metrics
  costPerInteraction: number;    // API costs
  costPerSuccessfulOutcome: number;  // Cost per completed task
}

Sample Size Headaches

AI features often have high variance, which means you need larger sample sizes:

function calculateRequiredSampleSize(
  baselineConversion: number,
  minimumDetectableEffect: number,
  significance: number = 0.05,
  power: number = 0.8
): number {
  // Standard sample size calculation
  // But for AI features, multiply by 1.5-2x due to variance
  const standardSize = sampleSizeCalculation(
    baselineConversion, 
    minimumDetectableEffect,
    significance,
    power
  );
  
  return Math.ceil(standardSize * 1.75); // AI variance factor
}

When Evals Lie to You

Evals can give you false confidence. Watch out for these traps.

Goodhart’s Law Strikes Again

“When a measure becomes a target, it ceases to be a good measure.”

If you optimize for eval scores, you might not be optimizing for actual quality.

Example: Your eval checks if responses contain the word “sorry” for apologetic tone. The model learns to say “sorry” in every response. Eval score: 100%. User experience: terrible.

Distribution Shift

Your eval dataset doesn’t match production:

  • Training data from English speakers, users speak Spanglish
  • Test cases use proper grammar, users use txt spk
  • Eval context is complete, production context is messy

Ceiling Effects

All your eval cases pass at 95%+. Are you done?

No. Your eval cases are too easy. Add harder cases until your scores drop to 60-70%, then work on improving.

Evaluation Contamination

The model might have seen your eval cases during training (especially if you’re using public datasets). Test on held-out data.

async function checkForContamination(
  model: AISystem,
  evalCase: EvalCase
): Promise<boolean> {
  // Ask the model to complete the expected answer
  const completion = await model.generate({
    input: `Complete this text: "${evalCase.expectedBehavior.slice(0, 50)}..."`,
    temperature: 0
  });
  
  // If it matches suspiciously well, might be contaminated
  const similarity = stringSimilarity(
    completion, 
    evalCase.expectedBehavior
  );
  
  return similarity > 0.9; // Suspiciously high
}
The Meta-Eval

Evaluate your evals. Have a human review a random sample of passing and failing cases. Are the passes actually good? Are the failures actually bad? If not, your eval is lying.


Continuous Evaluation in Production

Evals aren’t just for pre-deployment. Monitor quality continuously.

Logging for Evals

interface ProductionLog {
  requestId: string;
  timestamp: Date;
  input: string;
  context: string;
  response: string;
  latency: number;
  tokens: { input: number; output: number };
  cost: number;
  
  // User signals
  userAction: 'accepted' | 'edited' | 'rejected' | 'ignored';
  userFeedback?: 'thumbs_up' | 'thumbs_down';
  editDistance?: number; // How much did they change it?
  
  // Automated signals
  formatValid: boolean;
  safetyFlags: string[];
  anomalyScore: number;
}

Sampling for Review

You can’t review everything. Sample intelligently:

function shouldSampleForReview(log: ProductionLog): boolean {
  // Always sample failures
  if (log.userAction === 'rejected') return Math.random() < 0.5; // 50% of rejections
  
  // Sample edge cases
  if (log.anomalyScore > 0.8) return true;
  if (log.latency > 5000) return true;
  if (log.safetyFlags.length > 0) return true;
  
  // Random sample of successes (for calibration)
  if (log.userAction === 'accepted') return Math.random() < 0.01; // 1%
  
  return false;
}

Alerting on Quality Drops

interface QualityAlert {
  type: 'degradation' | 'anomaly' | 'cost_spike' | 'latency_spike';
  severity: 'low' | 'medium' | 'high' | 'critical';
  metric: string;
  expected: number;
  actual: number;
  windowStart: Date;
  windowEnd: Date;
}

function checkForAlerts(recentLogs: ProductionLog[]): QualityAlert[] {
  const alerts: QualityAlert[] = [];
  
  // Check rejection rate
  const rejectionRate = recentLogs.filter(l => l.userAction === 'rejected').length / recentLogs.length;
  if (rejectionRate > 0.2) { // 20% threshold
    alerts.push({
      type: 'degradation',
      severity: rejectionRate > 0.4 ? 'critical' : 'high',
      metric: 'rejection_rate',
      expected: 0.1,
      actual: rejectionRate,
      windowStart: recentLogs[0].timestamp,
      windowEnd: recentLogs[recentLogs.length - 1].timestamp
    });
  }
  
  // Add more checks...
  
  return alerts;
}

Making Evals Actually Useful

Evals that nobody looks at are useless. Here’s how to make them actionable.

The Eval Ritual

Make evals part of your workflow:

  1. Pre-commit: Fast format checks, basic safety
  2. PR review: Full eval suite run, comparison to main branch
  3. Daily: Regression tests against golden dataset
  4. Weekly: Human review of sampled production logs
  5. Monthly: Full eval suite update, remove stale cases, add new ones

Connecting Evals to Action

interface EvalReport {
  summary: {
    overallScore: number;
    trend: 'improving' | 'stable' | 'declining';
    criticalIssues: number;
  };
  
  // Actionable sections
  mustFix: {
    description: string;
    impact: string;
    suggestedFix: string;
  }[];
  
  shouldInvestigate: {
    description: string;
    examples: EvalResult[];
  }[];
  
  improvements: {
    description: string;
    before: number;
    after: number;
  }[];
}

Communicating Results

For your team:

  • “Eval score dropped 15% after the prompt change. Here are the 3 failing cases.”

For stakeholders:

  • “Our customer support bot correctly handles 94% of refund requests, up from 87% last month.”

For executives:

  • “AI quality is stable. Cost per interaction down 12%.”
The One Metric That Matters

Find your OMTM (One Metric That Matters) for each AI feature. For a support bot, it might be “percentage of issues resolved without human escalation.” For code generation, it might be “percentage of generated code that compiles and passes tests.” Track it obsessively.


Practice Exercise: Build a Mini Eval Suite

Create an eval suite for a simple AI feature:

The Feature

A function that uses an LLM to extract structured data from emails:

interface EmailExtraction {
  sender: string;
  subject: string;
  actionItems: string[];
  deadline?: Date;
  sentiment: 'positive' | 'neutral' | 'negative';
}

async function extractFromEmail(emailText: string): Promise<EmailExtraction> {
  // Your implementation
}

Your Task

  1. Write 10 eval cases covering:

    • Happy path emails
    • Emails with no action items
    • Emails with multiple deadlines
    • Poorly formatted emails
    • Non-English emails (if relevant)
  2. Implement automated checks for:

    • JSON format validity
    • Required fields present
    • Deadline format validation
  3. Design a scoring rubric for:

    • Action item extraction accuracy
    • Sentiment classification
    • Overall extraction quality
  4. Create a simple dashboard showing:

    • Pass/fail rate by category
    • Average scores over time
    • List of failing cases

Key Takeaways

Testing AI is different:

  • Non-deterministic outputs require probabilistic thinking
  • “Correct” is often a spectrum, not a boolean
  • You’re measuring quality, not just catching bugs

Types of evals you need:

  • Factual correctness for verifiable claims
  • Format compliance for structured outputs
  • Semantic similarity for meaning preservation
  • Safety checks for guardrails
  • Task completion for agentic systems

Build systematically:

  • Start with failure cases, not success cases
  • Maintain a golden dataset for regression testing
  • Use automated checks + LLM judges + human review
  • Connect evals to your deployment workflow

Watch for pitfalls:

  • Goodhart’s Law (optimizing for the wrong metric)
  • Distribution shift (eval data ≠ production data)
  • Ceiling effects (tests too easy)
  • Evaluation contamination (model saw the answers)

Make evals actionable:

  • Run them regularly
  • Connect results to specific actions
  • Communicate clearly to different audiences
  • Find your One Metric That Matters

What’s Next?

You now know how to evaluate whether your AI systems actually work. In the next chapter, we’ll tackle a different kind of evaluation: how to talk about AI in job interviews without sounding like an idiot. Because knowing how to test AI is one thing—explaining it to a skeptical interviewer is another.

Time to prove you’re not just following the hype train.

Mandatory Knowledge Check

Question 1 / 10Score: 0

Why is testing AI different from traditional software testing?