Testing AI Models Using Evals: A Practical Guide for QA & ML Engineers

Artificial Intelligence is revolutionizing software development — but how do we test AI models effectively?
Unlike traditional software, AI models don’t have deterministic outputs; the same input might lead to different valid responses. This makes traditional testing methods insufficient.

That’s where Evals come in — a structured way to evaluate, compare, and benchmark AI models consistently.

⚙️ What Are “Evals” in AI Testing?

Evals (short for Evaluations) are systematic frameworks used to:

  • Measure the accuracy, robustness, and consistency of AI models
  • Benchmark models using standard datasets or custom tests
  • Identify regressions or performance drops after model updates
  • Support automated testing pipelines for LLMs and AI applications

Think of evals as unit tests for AI, but with fuzzy logic.

💡 Why Traditional Testing Fails for AI

Traditional TestingAI Testing
Deterministic outputsProbabilistic outputs
Pass/Fail assertionsGraded or fuzzy correctness
Static code pathsLearned model behavior
Simple test casesContextual and ambiguous scenarios

Example:

# Traditional software test
assert add(2, 3) == 5  # always true

# AI model test
response = llm("What’s 2 + 3?")
assert "5" in response  # fuzzy match, not exact

AI testing requires a graded scoring or semantic comparison rather than simple boolean checks.

🧩 Popular Evaluation Frameworks

  1. OpenAI Evals – Used internally by OpenAI for GPT benchmarking
  2. LangChain’s Evaluators – Built into the LangChain framework for prompt chains
  3. TruLens – A powerful tool for tracking model quality, bias, and consistency
  4. Helicone / PromptLayer – Logging and analytics tools that can support custom evals
  5. Custom Evals with Python – Build your own eval harness using scripts and datasets

🚀 Setting Up Evals Using OpenAI Evals

🧰 Step 1: Install the Evals Package

pip install openai-evals

🧾 Step 2: Create a Sample Prompt & Dataset

Let’s test how well a model summarizes text.

data/sample_eval.json

{"input": "Artificial Intelligence enables machines to learn from data.", "ideal": "AI allows computers to learn from data."}
{"input": "Testing AI models requires careful evaluation.", "ideal": "AI model testing needs structured evaluation."}

⚡ Step 3: Create a Custom Eval Script

evals/summary_eval.py

import evals
import evals.metrics
from evals.api import CompletionFn

class SummaryEval(evals.Eval):
    def __init__(self, completion_fn: CompletionFn, *args, **kwargs):
        super().__init__(completion_fn, *args, **kwargs)

    def eval_sample(self, sample, *_):
        prompt = f"Summarize: {sample['input']}"
        result = self.completion_fn(prompt=prompt)
        response = result.get_completions()[0].strip()
        ideal = sample["ideal"]

        # Fuzzy matching: semantic or string-based similarity
        score = evals.metrics.levenshtein_distance(response, ideal)
        self.logger.log_eval_result({"input": sample["input"], "response": response, "score": score})
        return {"score": score}

⚙️ Step 4: Run the Eval

oaieval gpt-4 evals/summary_eval.py data/sample_eval.json

Sample Output:

Running eval SummaryEval on 2 samples...
Sample 1: score=0.9
Sample 2: score=0.8
Average Score: 0.85

You now have a quantitative metric for how well your model summarizes!

🧮 Building Custom Evals Without OpenAI Evals

You can create your own lightweight framework using Python, cosine similarity, or BLEU scores.

Here’s a simple custom example:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_similarity(a, b):
    embeddings = model.encode([a, b])
    return float(util.cos_sim(embeddings[0], embeddings[1]))

# Example use:
pred = "AI helps computers learn from data"
ideal = "Artificial Intelligence enables machines to learn from data"

score = semantic_similarity(pred, ideal)
print(f"Semantic similarity: {score:.2f}")

Output:

Semantic similarity: 0.91

This provides semantic grading, which is often better than exact matching.

🧱 Integrating Evals into CI/CD

You can automate model testing in your DevOps pipeline:

  • Run evals as GitHub Actions
  • Fail builds if average accuracy < threshold
  • Store metrics in dashboards

Example GitHub Action:

name: Run Evals
on: [push, pull_request]
jobs:
  test-ai:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - name: Install deps
        run: pip install openai-evals
      - name: Run model evals
        run: oaieval gpt-4 evals/summary_eval.py data/sample_eval.json

📊 Evaluating Results and Improving Models

After running your evals, you’ll typically get quantitative scores (like accuracy or similarity) and qualitative insights (like how natural or factual the outputs feel).
This section is about interpreting those results and using them to improve your AI models.

🧮 1. Key Metrics to Track

🧠 a. Accuracy / Similarity Score

Measures how close the model’s output is to the “ideal” answer.

Example:

similarity = 0.87  # High similarity = good summary

✅ Use when: testing summarization, paraphrasing, translation, etc.

💬 b. Response Consistency

Checks whether the model gives similar answers to similar inputs.

if similarity_score < 0.8:
    print("Model inconsistency detected ⚠️")

✅ Use when: testing reliability or reproducibility.

⚡ c. Hallucination Rate

Measures how often the model makes up facts or outputs incorrect information.
Can be checked by comparing outputs to trusted sources or gold datasets.

✅ Use when: evaluating knowledge-based or retrieval-augmented models.

⏱️ d. Latency and Cost

Even if a model is accurate, it must also be efficient.

start = time.time()
_ = llm("Explain AI Testing")
end = time.time()
latency = end - start

✅ Use when: optimizing for performance or production use.

🎯 e. Prompt Sensitivity

Measures how sensitive the model’s output is to small changes in prompt phrasing.

✅ Use when: designing robust prompt templates.

🧩 2. Visualizing and Interpreting Evals

Use tools like TruLens, Weights & Biases, or even custom dashboards to visualize metrics.

Example: TruLens

from trulens_eval import Tru, Feedback, TruLlama
tru = Tru()
feedback = Feedback(name="semantic_similarity")
tru_lla = TruLlama(llm=llm, feedbacks=[feedback])
tru_lla.run_eval(dataset)

Benefits:

  • 📈 Visual similarity score distribution
  • 📊 Response consistency trends
  • 🧩 Detect regression after model updates

🔄 3. Closing the Loop: Improving the Model

🧠 a. Prompt Engineering

If your evals show low accuracy or high hallucination:

  • Add context or examples to prompts
  • Use structured formats
  • Re-run evals to validate improvements

Example:

prompt = """
You are an expert QA engineer.
Summarize this text in 1 line without adding new facts.
Text: {input}
"""

🧰 b. Data Augmentation

Expand your test data to include:

  • Edge cases
  • Multilingual examples
  • Ambiguous or contradictory inputs

⚙️ c. Fine-Tuning or RAG

If evals show gaps in domain knowledge:

  • Fine-tune on specific datasets
  • Or use RAG pipelines to fetch real-time context

Example:

context = search_knowledge_base(query)
response = llm(f"Answer using this context: {context}")

🔁 d. Continuous Evaluation in CI/CD

Integrate evals into your deployment pipeline:

if new_model_score < baseline_score * 0.95:
    raise Exception("Model quality regression detected ❌")

This ensures no model goes live without meeting quality benchmarks.

📘 Example Evaluation Workflow Summary

StepDescriptionTool/Method
1Define evaluation datasetJSONL, CSV
2Run evalsOpenAI Evals / Custom script
3Collect metricsSimilarity, Consistency, Hallucination
4Visualize & analyzeTruLens / W&B
5ImprovePrompt tuning / Fine-tuning
6AutomateCI/CD with eval thresholds

🧭 Final Thought

Evaluating AI isn’t just about numbers — it’s about understanding model behavior.
Every eval gives you insights into why your model behaves the way it does, helping you build systems that are accurate, explainable, and reliable over time.

🏁 Conclusion

AI evals are the future of quality assurance in machine learning.
They bring rigor, repeatability, and accountability to an otherwise fuzzy domain.

By adopting Evals in your QA and CI/CD workflows, you can:

  • Detect regressions early
  • Benchmark multiple models
  • Build trustworthy AI systems

🔗 Important Resources

Leave a Reply

Scroll to Top

Discover more from Kirti Satapathy

Subscribe now to keep reading and get access to the full archive.

Continue reading