AI & Machine Learning

DeepEval Tutorial 2025: Build 3 Powerful LLM Apps

Ready to build reliable LLM apps? Our 2025 DeepEval tutorial guides you through building and evaluating 3 projects: a RAG system, a chatbot, and a content generator.

A

Adrian Sharma

AI Engineer and technical writer specializing in MLOps and reliable AI systems.

6 min read19 views

DeepEval Tutorial 2025: Build 3 Powerful LLM Apps

You’ve done it. After countless hours of prompt engineering and API calls, you’ve built a shiny new LLM application. It works… most of the time. But how do you know it’s good? How do you prevent it from confidently making things up or going off the rails with a slightly different input? Welcome to the single biggest challenge in the Generative AI space: evaluation.

In 2025, saying "it works on my machine" is no longer enough. The novelty of AI-generated text has worn off, and users now demand reliability, accuracy, and safety. To move from a cool prototype to a production-ready product, you need robust, reproducible evaluation. That's where DeepEval comes in.

DeepEval is an open-source framework designed specifically for evaluating Large Language Models. It treats your LLM evaluations like unit tests, making them simple to write, run, and integrate into your CI/CD pipeline. In this tutorial, we're not just talking theory. We're rolling up our sleeves and building (and, more importantly, evaluating!) three powerful, real-world LLM applications.

What is DeepEval and Why Does It Matter in 2025?

Think about traditional software development. You wouldn't ship code without a suite of unit and integration tests to catch bugs and regressions. So why are we so often flying blind with LLMs? DeepEval aims to bring that same rigor to the world of AI.

It provides a rich set of pre-built evaluation metrics that go far beyond simple string matching. These metrics use LLMs themselves to score the quality of your outputs based on criteria like:

  • Faithfulness: Is the output factually consistent with the provided context? (Crucial for RAG)
  • Answer Relevancy: Does the output actually answer the user's question?
  • Toxicity & Bias: Is the output free from harmful or biased language?
  • Summarization: Does a summary capture the key points of the original text without introducing new information?

By integrating these checks directly into your development workflow, you can confidently iterate on your prompts, models, and retrieval strategies, knowing you have a safety net to measure your progress objectively. In 2025, this isn't a luxury; it's a necessity for building trust with your users.

Setting Up Your DeepEval Environment

Getting started is refreshingly simple. DeepEval is a Python library, so you can install it with pip. You'll also need to configure an API key for the LLM that DeepEval will use for its own evaluations (meta, right?).

# Install DeepEval
pip install -U deepeval

# Set your LLM provider API key (e.g., OpenAI)
# You can set this as an environment variable
export OPENAI_API_KEY="sk-your-key-here"

With that, you're ready to start building and evaluating. Let's dive into our first project.

Project 1: A Hallucination-Proof RAG System

Retrieval-Augmented Generation (RAG) is a powerful technique for grounding LLMs in factual data. Instead of relying on the model's internal (and potentially outdated) knowledge, you retrieve relevant documents and pass them as context along with the user's query. The problem? LLMs can still "hallucinate," or invent facts, even when given the correct context.

Advertisement

Our goal is to build a simple RAG system and use DeepEval to ensure its output is faithful to the source documents.

Evaluating RAG Faithfulness with DeepEval

Imagine your RAG system answers a query. How do you programmatically check if it made something up? This is where DeepEval's `FaithfulnessMetric` shines. It compares the generated `actual_output` against the `retrieval_context` to see if every claim in the output is supported.

Here’s what a DeepEval test case looks like:

from deepeval import assert_test, test_case
from deepeval.metrics import FaithfulnessMetric

# Your amazing RAG function
def my_rag_system(query: str, context: list[str]) -> str:
    # In a real app, this would involve vector search and an LLM call
    # For this example, we'll simulate an output
    if "Paris" in context[0]:
        return "The capital of France is Paris, a city known for the Eiffel Tower."
    return "I am not sure what the capital of France is."

@test_case
def test_rag_faithfulness():
    query = "What is the capital of France?"
    retrieval_context = ["The city of Paris serves as the capital of France."]
    actual_output = my_rag_system(query, retrieval_context)

    faithfulness_metric = FaithfulnessMetric(threshold=0.7) # Score must be >= 0.7

    # This will pass or fail based on the metric's evaluation
    assert_test(actual_output, [faithfulness_metric], context=retrieval_context)

Running this test is as simple as `deepeval test run test_rag.py`. If the output includes claims not supported by the context (e.g., mentioning the Eiffel Tower when it's not in the context), the `FaithfulnessMetric` will catch it, and the test will fail. Now you can tweak your system prompt or retrieval strategy and re-run the test to see if you've fixed the issue!

Project 2: A Nuanced Customer Support Chatbot

For a customer support chatbot, just being factually correct isn't enough. It needs to be helpful, relevant, and maintain a professional, non-toxic tone. A bot that gives a technically correct but snarky or irrelevant answer is worse than no bot at all.

Let's build a chatbot that answers user queries about a product and use DeepEval to evaluate its conversational quality.

Measuring Relevancy and Tone

For this, we'll combine two metrics: `AnswerRelevancyMetric` and `ToxicityMetric`.

  • `AnswerRelevancyMetric` measures how well the output addresses the input query.
  • `ToxicityMetric` scans the output for any form of rude, disrespectful, or unreasonable language.
from deepeval import assert_test, test_case
from deepeval.metrics import AnswerRelevancyMetric, ToxicityMetric

# Your customer support bot function
def support_chatbot(query: str) -> str:
    # Simulate a response
    if "refund" in query.lower():
        return "To request a refund, please visit our portal at example.com/refunds."
    return "Our product is the best on the market! You should buy another one."

@test_case
def test_chatbot_quality():
    input_query = "How do I get a refund?"
    actual_output = support_chatbot(input_query)

    # Define our quality criteria
    relevancy_metric = AnswerRelevancyMetric(threshold=0.8)
    toxicity_metric = ToxicityMetric(threshold=0.1) # We want toxicity to be very low

    # A test can have multiple metrics!
    assert_test(actual_output, [relevancy_metric, toxicity_metric], input=input_query)

If we run this test with the query "How do I get a refund?", it will pass. The answer is relevant and not toxic. But if we tested it with a different query where it gives the irrelevant marketing spiel, the `AnswerRelevancyMetric` would fail. This framework allows you to build a comprehensive test suite covering all the conversational qualities you care about.

Project 3: A Creative Content Generator with Guardrails

Our final project is a bit more open-ended: an LLM that helps generate creative marketing copy. The challenge here is balancing creativity with control. We want novel and engaging text, but it must stay on-brand, be free of harmful bias, and not stray into controversial territory.

We'll use DeepEval to establish "guardrails" for our creative bot.

Enforcing Creative Guardrails

For this task, we can use metrics like `BiasMetric` and `ToxicityMetric` as safety checks. We can even create custom metrics to check for brand voice or other specific requirements. Let's focus on the `BiasMetric`, which can detect problematic gender, racial, or other biases in the generated text.

from deepeval import assert_test, test_case
from deepeval.metrics import BiasMetric

# Your creative copy generator
def generate_marketing_copy(topic: str) -> str:
    # Simulate a biased output for demonstration
    if "developer" in topic:
        return "A great developer is a guy who codes all night and drinks coffee."
    return "Our new software is perfect for everyone!"

@test_case
def test_for_gender_bias():
    topic = "a good developer"
    actual_output = generate_marketing_copy(topic)

    # The BiasMetric uses an LLM to check for subtle and overt bias
    bias_metric = BiasMetric(threshold=0.1) # Fail if bias score is high

    assert_test(actual_output, [bias_metric])

When this test runs, the `BiasMetric` will analyze the sentence "A great developer is a guy..." and flag it for gender bias, causing the test to fail. By adding these guardrails, you can let your LLM be creative while ensuring its output aligns with your company's values and safety standards.

Beyond the Prototype: Building with Confidence

Building with Large Language Models is exhilarating, but it's the disciplined practice of evaluation that separates fragile prototypes from robust, production-grade products. As we've seen, DeepEval provides a powerful yet simple framework to move beyond subjective "looks good to me" checks and into the realm of objective, measurable, and automated quality assurance.

The three projects we walked through—a faithful RAG system, a relevant support bot, and a safe creative assistant—are just the beginning. The real power comes when you integrate these tests into your daily workflow, creating a feedback loop that allows you to innovate rapidly without sacrificing reliability.

So, take these concepts, apply them to your own ideas, and start building LLM applications that you and your users can truly trust.

Tags

You May Also Like