LLM Evaluation

Master DeepEval in 2025: A 7-Step Implementation Guide

Ready to master LLM evaluation? Our 2025 guide provides a 7-step walkthrough for implementing DeepEval, from setup to custom metrics. Start building reliable AI.

Dr. Alex Carter

AI research scientist specializing in large language model evaluation and reliability engineering.

September 8, 20257 min read503 views

7 min read

1,539 words

503 views

Updated

Building applications with Large Language Models (LLMs) is exciting, but let's be honest: ensuring they're accurate, relevant, and not just making things up can feel like a guessing game. How do you really know if your RAG pipeline is faithful to the source context? How can you systematically prevent regressions when you tweak a prompt?

In 2025, the ad-hoc approach of "it looks good to me" is no longer enough. We need robust, developer-centric tools for evaluation. Enter DeepEval, an open-source framework designed to bring the rigor of unit testing to the world of LLMs.

This guide will walk you through a 7-step process to implement DeepEval, moving you from initial setup to running sophisticated evaluations on your own projects. No fluff, just a practical roadmap.

So, What is DeepEval?

Think of DeepEval as the Pytest for your LLM applications. It’s a Python framework that lets you write evaluation tests for your models and pipelines using a structure you're already familiar with. Its core strengths are:

Pytest Integration: It plugs directly into Pytest, so you can run your LLM evaluations just like any other software test.
State-of-the-Art Metrics: It comes packed with 14+ pre-built metrics like G-Eval, Faithfulness, Answer Relevancy, and Bias, which use LLMs to score the quality of your outputs.
Confidence Scores: Instead of a simple pass/fail, DeepEval provides a confidence score for each metric, giving you a nuanced understanding of your model's performance.
Customizability: You can easily create your own custom metrics to evaluate aspects unique to your application.

Ready to get your hands dirty? Let's dive in.

The 7-Step Implementation Guide

Step 1: Installation and Setup

Getting started is as simple as a pip install. Open your terminal and run:

pip install deepeval

You'll also need to configure your OpenAI API key, as DeepEval's default metrics use GPT models for evaluation. You can do this by setting an environment variable:

export OPENAI_API_KEY="YOUR_API_KEY"

With that, your environment is ready.

Step 2: Understanding the Core Concepts

Before writing a test, let's grasp two key components:

LLMTestCase: This is a data structure that holds all the information needed for a single evaluation. It includes things like the user's input, the model's actual_output, the retrieval_context (for RAG), and optionally, an expected_output.
Metrics: These are the heart of DeepEval. A metric is a class that defines a specific quality to measure, such as FaithfulnessMetric or AnswerRelevancyMetric. You pass a list of metrics to your test assertion.

Think of LLMTestCase as the evidence you're presenting in court, and the Metrics as the expert witnesses who analyze that evidence.

Step 3: Creating Your First Test Case

Let's write a simple test to see it in action. Create a file named test_simple.py.

In this test, we'll check if a model's output is similar to what we expect. DeepEval's GEval metric is perfect for this, as it assesses overall quality based on a given criteria.


import pytest
from deepeval import assert_test
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase

# This is the test function that will be run by pytest
def test_basic_summarization():
    # Define our evaluation criteria
    summarization_metric = GEval(
        name="Summarization",
        criteria="Summarization: The output should be a concise summary of the input.",
        evaluation_steps=["Is the summary shorter than the input?", "Does it capture the main points?"],
    )

    # Create the test case
    test_case = LLMTestCase(
        input="The quick brown fox jumps over the lazy dog. It was a beautiful day in the neighborhood.",
        actual_output="A fast fox leaps over a sleepy dog on a lovely day.",
        expected_output="A fox jumped over a dog."
    )

    # Run the assertion with the metric
    assert_test(test_case, [summarization_metric])

To run this, simply execute Pytest from your terminal:

pytest test_simple.py

DeepEval will execute the test, use an LLM to score the actual_output against the criteria in GEval, and give you a pass or fail based on the result.

Step 4: Integrating with Your LLM Application (RAG)

Now for the real power: testing a real-world application. Let's imagine we have a simple RAG (Retrieval-Augmented Generation) pipeline that answers questions based on a provided context.

First, here's our hypothetical RAG function:


# Assume this function exists in your application (e.g., in a file named `rag_pipeline.py`)
def answer_question(query: str, context: str) -> str:
    # In a real app, you'd pass the query and context to an LLM
    # For this example, we'll just simulate a response.
    if "Paris" in query:
        return "The Eiffel Tower is the most famous landmark in Paris, which is the capital of France."
    return "I'm not sure about that."

Now, let's write a DeepEval test for it in a new file, test_rag.py.


import pytest
from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
# from your_app.rag_pipeline import answer_question # Import your actual function

# Mock function for demonstration
def answer_question(query: str, context: str) -> str:
    return "The Eiffel Tower is the most famous landmark in Paris, which is the capital of France."


@pytest.mark.parametrize(
    "query, context",
    [("What is the most famous landmark in Paris?", "Paris is the capital of France and is known for its art, fashion, and landmarks, including the iconic Eiffel Tower.")]
)
def test_rag_pipeline(query, context):
    # 1. Get the actual output from our RAG pipeline
    actual_output = answer_question(query, context)

    # 2. Create the LLMTestCase
    test_case = LLMTestCase(
        input=query,
        actual_output=actual_output,
        retrieval_context=[context] # Note: retrieval_context is a list of strings
    )

    # 3. Define metrics and run the test (we'll do this in the next step)
    # For now, let's just create the test case
    assert test_case is not None, "Test case creation failed"
    print("Test case created successfully for RAG pipeline.")

This structure separates your application logic (answer_question) from your evaluation logic (the test function). We're now ready to add powerful metrics.

Step 5: Using Built-in Evaluation Metrics

Let's upgrade our RAG test from Step 4. We want to ensure two things:

Faithfulness: Does the answer stick to the facts in the provided retrieval_context? (i.e., no hallucinations)
Answer Relevancy: Is the answer actually relevant to the user's input query?

DeepEval has metrics for both. Let's add them to our test.


# ... (imports and mock function from previous step)

@pytest.mark.parametrize(
    "query, context",
    [("What is the most famous landmark in Paris?", "Paris is the capital of France and is known for its art, fashion, and landmarks, including the iconic Eiffel Tower.")]
)
def test_rag_pipeline_with_metrics(query, context):
    actual_output = answer_question(query, context)

    test_case = LLMTestCase(
        input=query,
        actual_output=actual_output,
        retrieval_context=[context]
    )

    # Define our metrics with a passing threshold
    # The threshold is the minimum confidence score for the test to pass
    metrics = [
        FaithfulnessMetric(threshold=0.8),
        AnswerRelevancyMetric(threshold=0.8)
    ]

    # Run the test with the defined metrics
    assert_test(test_case, metrics)

Now, when you run pytest test_rag.py, DeepEval will execute the test, send the test case data to an LLM, and calculate scores for both faithfulness and answer relevancy. If either score falls below 0.8, the test will fail.

Step 6: Running and Analyzing Your Tests

Running tests is simple with the Pytest command. However, DeepEval provides its own CLI for a more integrated experience, including logging to the web dashboard (if you choose to use it).


# Login to the Confident AI dashboard (optional)
deepeval login

# Run your tests
pytest

The terminal output will be rich with information:


✅ Test test_rag_pipeline_with_metrics

✨ Metrics Scores
- FaithfulnessMetric: 0.95 (threshold=0.8)
- AnswerRelevancyMetric: 0.92 (threshold=0.8)

✅ TEST PASSED

This detailed output allows you to pinpoint exactly which aspect of your LLM's response is underperforming. A low faithfulness score might mean your model is hallucinating, while a low relevancy score could indicate a problem with your prompt or retrieval system.

Step 7: Leveling Up with Custom Metrics

What if you need to measure something very specific, like whether your chatbot's response adheres to a certain brand voice? DeepEval's built-in metrics might not cover this. That’s when you create a custom metric.

Creating a custom metric involves subclassing BaseMetric and implementing the measure method. Here’s a conceptual skeleton:


from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase

class BrandVoiceMetric(BaseMetric):
    def __init__(self, threshold: float = 0.5):
        super().__init__(threshold=threshold)
        self.criteria = "The response should be friendly, professional, and avoid using slang."

    def measure(self, test_case: LLMTestCase) -> float:
        # Use an LLM to score the actual_output against the criteria
        # This is a simplified example. You'd use an LLM call here.
        prompt = f"Score the following response on a scale of 0-1 for adhering to the brand voice: '{self.criteria}'. Response: {test_case.actual_output}"
        # ... LLM call to get a score ...
        score = 0.9 # a simulated score from the LLM
        self.success = score >= self.threshold
        self.score = score
        return self.score

    def is_successful(self) -> bool:
        return self.success

    @property
    def __name__(self):
        return "Brand Voice"

While the full implementation requires a bit more detail (see the official docs), this shows the core idea: you have complete control to define what “good” means for your application.

Conclusion: Test, Don't Guess

DeepEval transforms LLM evaluation from a subjective art into a repeatable science. By integrating it into your development workflow, you can catch regressions early, objectively measure improvements, and build more reliable and trustworthy AI applications.

You've now seen the full 7-step process: from a simple install to crafting tests for a RAG pipeline and even conceptualizing custom metrics. The next step is to apply it to your own project. Stop guessing if your LLM is performing well—start testing it.