Artificial Intelligence is revolutionizing software development — but how do we test AI models effectively?
Unlike traditional software, AI models don’t have deterministic outputs; the same input might lead to different valid responses. This makes traditional testing methods insufficient.
That’s where Evals come in — a structured way to evaluate, compare, and benchmark AI models consistently.
⚙️ What Are “Evals” in AI Testing?
Evals (short for Evaluations) are systematic frameworks used to:
- Measure the accuracy, robustness, and consistency of AI models
- Benchmark models using standard datasets or custom tests
- Identify regressions or performance drops after model updates
- Support automated testing pipelines for LLMs and AI applications
Think of evals as unit tests for AI, but with fuzzy logic.
💡 Why Traditional Testing Fails for AI
| Traditional Testing | AI Testing |
|---|---|
| Deterministic outputs | Probabilistic outputs |
| Pass/Fail assertions | Graded or fuzzy correctness |
| Static code paths | Learned model behavior |
| Simple test cases | Contextual and ambiguous scenarios |
Example:
# Traditional software test
assert add(2, 3) == 5 # always true
# AI model test
response = llm("What’s 2 + 3?")
assert "5" in response # fuzzy match, not exact
AI testing requires a graded scoring or semantic comparison rather than simple boolean checks.
🧩 Popular Evaluation Frameworks
- OpenAI Evals – Used internally by OpenAI for GPT benchmarking
- LangChain’s Evaluators – Built into the LangChain framework for prompt chains
- TruLens – A powerful tool for tracking model quality, bias, and consistency
- Helicone / PromptLayer – Logging and analytics tools that can support custom evals
- Custom Evals with Python – Build your own eval harness using scripts and datasets
🚀 Setting Up Evals Using OpenAI Evals
🧰 Step 1: Install the Evals Package
pip install openai-evals
🧾 Step 2: Create a Sample Prompt & Dataset
Let’s test how well a model summarizes text.
data/sample_eval.json
{"input": "Artificial Intelligence enables machines to learn from data.", "ideal": "AI allows computers to learn from data."}
{"input": "Testing AI models requires careful evaluation.", "ideal": "AI model testing needs structured evaluation."}
⚡ Step 3: Create a Custom Eval Script
evals/summary_eval.py
import evals
import evals.metrics
from evals.api import CompletionFn
class SummaryEval(evals.Eval):
def __init__(self, completion_fn: CompletionFn, *args, **kwargs):
super().__init__(completion_fn, *args, **kwargs)
def eval_sample(self, sample, *_):
prompt = f"Summarize: {sample['input']}"
result = self.completion_fn(prompt=prompt)
response = result.get_completions()[0].strip()
ideal = sample["ideal"]
# Fuzzy matching: semantic or string-based similarity
score = evals.metrics.levenshtein_distance(response, ideal)
self.logger.log_eval_result({"input": sample["input"], "response": response, "score": score})
return {"score": score}
⚙️ Step 4: Run the Eval
oaieval gpt-4 evals/summary_eval.py data/sample_eval.json
Sample Output:
Running eval SummaryEval on 2 samples...
Sample 1: score=0.9
Sample 2: score=0.8
Average Score: 0.85
You now have a quantitative metric for how well your model summarizes!
🧮 Building Custom Evals Without OpenAI Evals
You can create your own lightweight framework using Python, cosine similarity, or BLEU scores.
Here’s a simple custom example:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
def semantic_similarity(a, b):
embeddings = model.encode([a, b])
return float(util.cos_sim(embeddings[0], embeddings[1]))
# Example use:
pred = "AI helps computers learn from data"
ideal = "Artificial Intelligence enables machines to learn from data"
score = semantic_similarity(pred, ideal)
print(f"Semantic similarity: {score:.2f}")
Output:
Semantic similarity: 0.91
This provides semantic grading, which is often better than exact matching.
🧱 Integrating Evals into CI/CD
You can automate model testing in your DevOps pipeline:
- Run evals as GitHub Actions
- Fail builds if average accuracy < threshold
- Store metrics in dashboards
Example GitHub Action:
name: Run Evals
on: [push, pull_request]
jobs:
test-ai:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install deps
run: pip install openai-evals
- name: Run model evals
run: oaieval gpt-4 evals/summary_eval.py data/sample_eval.json
📊 Evaluating Results and Improving Models
After running your evals, you’ll typically get quantitative scores (like accuracy or similarity) and qualitative insights (like how natural or factual the outputs feel).
This section is about interpreting those results and using them to improve your AI models.
🧮 1. Key Metrics to Track
🧠 a. Accuracy / Similarity Score
Measures how close the model’s output is to the “ideal” answer.
Example:
similarity = 0.87 # High similarity = good summary
✅ Use when: testing summarization, paraphrasing, translation, etc.
💬 b. Response Consistency
Checks whether the model gives similar answers to similar inputs.
if similarity_score < 0.8:
print("Model inconsistency detected ⚠️")
✅ Use when: testing reliability or reproducibility.
⚡ c. Hallucination Rate
Measures how often the model makes up facts or outputs incorrect information.
Can be checked by comparing outputs to trusted sources or gold datasets.
✅ Use when: evaluating knowledge-based or retrieval-augmented models.
⏱️ d. Latency and Cost
Even if a model is accurate, it must also be efficient.
start = time.time()
_ = llm("Explain AI Testing")
end = time.time()
latency = end - start
✅ Use when: optimizing for performance or production use.
🎯 e. Prompt Sensitivity
Measures how sensitive the model’s output is to small changes in prompt phrasing.
✅ Use when: designing robust prompt templates.
🧩 2. Visualizing and Interpreting Evals
Use tools like TruLens, Weights & Biases, or even custom dashboards to visualize metrics.
Example: TruLens
from trulens_eval import Tru, Feedback, TruLlama
tru = Tru()
feedback = Feedback(name="semantic_similarity")
tru_lla = TruLlama(llm=llm, feedbacks=[feedback])
tru_lla.run_eval(dataset)
Benefits:
- 📈 Visual similarity score distribution
- 📊 Response consistency trends
- 🧩 Detect regression after model updates
🔄 3. Closing the Loop: Improving the Model
🧠 a. Prompt Engineering
If your evals show low accuracy or high hallucination:
- Add context or examples to prompts
- Use structured formats
- Re-run evals to validate improvements
Example:
prompt = """
You are an expert QA engineer.
Summarize this text in 1 line without adding new facts.
Text: {input}
"""
🧰 b. Data Augmentation
Expand your test data to include:
- Edge cases
- Multilingual examples
- Ambiguous or contradictory inputs
⚙️ c. Fine-Tuning or RAG
If evals show gaps in domain knowledge:
- Fine-tune on specific datasets
- Or use RAG pipelines to fetch real-time context
Example:
context = search_knowledge_base(query)
response = llm(f"Answer using this context: {context}")
🔁 d. Continuous Evaluation in CI/CD
Integrate evals into your deployment pipeline:
if new_model_score < baseline_score * 0.95:
raise Exception("Model quality regression detected ❌")
This ensures no model goes live without meeting quality benchmarks.
📘 Example Evaluation Workflow Summary
| Step | Description | Tool/Method |
|---|---|---|
| 1 | Define evaluation dataset | JSONL, CSV |
| 2 | Run evals | OpenAI Evals / Custom script |
| 3 | Collect metrics | Similarity, Consistency, Hallucination |
| 4 | Visualize & analyze | TruLens / W&B |
| 5 | Improve | Prompt tuning / Fine-tuning |
| 6 | Automate | CI/CD with eval thresholds |
🧭 Final Thought
Evaluating AI isn’t just about numbers — it’s about understanding model behavior.
Every eval gives you insights into why your model behaves the way it does, helping you build systems that are accurate, explainable, and reliable over time.
🏁 Conclusion
AI evals are the future of quality assurance in machine learning.
They bring rigor, repeatability, and accountability to an otherwise fuzzy domain.
By adopting Evals in your QA and CI/CD workflows, you can:
- Detect regressions early
- Benchmark multiple models
- Build trustworthy AI systems