deepeval - 🎉 New Interfaces, Reduce ETL Code < 50%!

Less Code to Load Data In and Out of DeepEval's Ecosystem :)

If you're using any of the features below, you'll likely see a 50% reduction in code required, especially around ETL for formatting things in and out of DeepEval's ecosystem. This includes:

🆚 Arena-GEval

The first LLM-arena-as-a-Judge metric, now runs a blinded experiment and swaps positions randomly for a fair verdict on which LLM output is better.

Docs: https://deepeval.com/docs/metrics-arena-g-eval

⚛️ You can now run component-level evals by simply running a for loop against your dataset of goldens.

Simply run your loop -> call your agent X number of times -> get your evaluation results. No more trying to fit non-test-case-friendly formats. Instead DeepEval will find your LLM traces automatically to run evals.

```python from somewhere import yourasyncllm_app # Replace with your async LLM app from deepeval.dataset import EvaluationDataset, Golden

dataset = EvaluationDataset(goldens=[Golden(input="...")])

for golden in dataset.evalsiterator(): # Create task to invoke your async LLM app task = asyncio.createtask(yourasyncllm_app(golden.input)) dataset.evaluate(task) ```

Docs: https://deepeval.com/docs/evaluation-component-level-llm-evals

💬 Conversation simulator is now based on goldens.

Previously you have to define a list of user intentions, profile items, with a ton of more configs to juggle between. Now you can define a list of goldens with a standardized benchmark of scenarios to have turns generated for.

```python from deepeval.test_case import Turn from deepeval.simulator import ConversationSimulator

Create ConversationalGolden

conversationgolden = ConversationalGolden( scenario="Andy Byron wants to purchase a VIP ticket to a cold play concert.", expectedoutcome="Successful purchase of a ticket.", user_description="Andy Byron is the CEO of Astronomer.", )

Define chatbot callback

async def chatbot_callback(input): return Turn(role="assistant", content=f"Chatbot response to: {input}")

Run Simulation

simulator = ConversationSimulator(modelcallback=chatbotcallback) conversationaltestcases = simulator.simulate(goldens=[conversationgolden]) print(conversationaltest_cases) ```

Docs: https://deepeval.com/docs/conversation-simulator

We also updated our docs with more improvements to come 👀

- Python
Published by penguine-ip 12 months ago

deepeval - 🎉 Renewed datasets, single vs multi-turn

⚙️ New Features

DeepEval's 3.2.6 release focuses on single-vs multi-turn use cases in datasets!

🧩 Support for Single-Turn and Multi-Turn Datasets

Single-turn datasets: Simple input → output pairs for one-off prompt testing.
Multi-turn datasets: Full conversation flows with alternating user/assistant turns. Perfect for simulating real chat interactions.

DeepEval now automatically detects whether a dataset is single-turn or multi-turn based on structure and routes to the appropriate evaluation logic.

🧪 Conversational Goldens

Introduced a new concept: conversational goldens, which contains scenario, (and optionally expected_outcome) but not things like input and expected output as with single-turn use cases..

✅ Improvements

Smarter dataset evaluation routing: Whether single-turn or multi-turn, DeepEval figures it out and builds test cases accordingly.
Improved multi-turn context preservation: Each conversational turn is maintained during evaluation, giving more accurate multi-turn metrics.

This release is setting the stage for future multi-turn use cases.

Docs here: https://deepeval.com/docs/evaluation-datasets

- Python
Published by penguine-ip about 1 year ago

deepeval - 🎉 New Arena GEval Metric, for Pairwise Comparisons

Metric that is alike LLM Arena is Here

In DeepEval's latest release, we are introducing ArenaGEval, the first ever metric to compare test cases to choose the best performing one based on your custom criteria.

It looks something like this:

```python from deepeval import evaluate from deepeval.test_case import ArenaTestCase, LLMTestCaseParams from deepeval.metrics import ArenaGEval

atestcase = ArenaTestCase( contestants={ "GPT-4": LLMTestCase( input="What is the capital of France?", actualoutput="Paris", ), "Claude-4": LLMTestCase( input="What is the capital of France?", actualoutput="Paris is the capital of France.", ), }, ) arenageval = ArenaGEval( name="Friendly", criteria="Choose the winter of the more friendly contestant based on the input and actual output", evaluationparams=[ LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, ], )

metric.measure(atestcase) print(metric.winner, metric.reason) ```

Docs here: https://deepeval.com/docs/metrics-arena-g-eval

- Python
Published by penguine-ip about 1 year ago

deepeval - 🎉 New Multimodal Metrics, with Platform Support

In DeepEval's latest release, we are introducing multimodal G-Eval, plus 7+ multimodal metrics!

Previously we had great support for single-turn, text evaluation in the form of LLMTestCases, but now we're adding MLLMTestCase, which accepts images:

```python from deepeval.metrics import MultimodalGEval from deepeval.test_case import MLLMTestCaseParams, MLLMTestCase, MLLMImage from deepeval import evaluate

mtestcase = MLLMTestCase( input=["Show me how to fold an airplane"], actualoutput=[ "1. Take the sheet of paper and fold it lengthwise", MLLMImage(url="./paperplane1", local=True), "2. Unfold the paper. Fold the top left and right corners towards the center.", MLLMImage(url="./paperplane2", local=True) ] ) textimagecoherence = MultimodalGEval( name="Text-Image Coherence", criteria="Determine whether the images and text is coherence in the actual output.", evaluationparams=[MLLMTestCaseParams.ACTUAL_OUTPUT], )

evaluate(testcases=[mtestcase], metrics=[textimage_coherence]) ```

Docs here: https://deepeval.com/docs/multimodal-metrics-g-eval

PS. This also includes platform support

Screenshot 2025-06-19 at 3 46 12 PM

- Python
Published by penguine-ip about 1 year ago

deepeval - 🎉 New Conversational Evaluation, LiteLLM Integration

In DeepEval's latest release, we are introducing a slight change in how a conversation is evaluated.

Previously we assumed a conversation as as a list of LLMTestCases, which might necessarily be the case. Now a conversational test case is made up of a list of Turns instead, which follows OpenAI's standard messages format:

```python from deepeval.test_case import Turn

turns = [Turn(role="user", content="...")] ```

Docs here: https://deepeval.com/docs/evaluation-test-cases#conversational-test-case

- Python
Published by penguine-ip about 1 year ago

deepeval - New Loading Bars, And Cloud Storage

Added new loading bars for component-level evals, and deepeval view to see results on Confident AI.

- Python
Published by penguine-ip about 1 year ago

deepeval - LLM Evals - v3.0

🚀 DeepEval v3.0 — Evaluate Any LLM Workflow, Anywhere

We’re excited to introduce DeepEval v3.0, a major milestone that transforms how you evaluate LLM applications — from complex multi-step agents to simple prompt chains. This release brings component-level granularity, production-ready observability, and simulation tools to empower devs building modern AI systems.

🔍 Component-Level Evaluation for Agentic Workflows

You can now apply DeepEval metrics to any step of your LLM workflow — tools, memories, retrievers, generators — and monitor them in both development and production.

Evaluate individual function calls, not just final outputs
Works with any framework or custom agent logic
Real-time evaluation in production using observe()
Track sub-component performance over time

📘 Learn more →

🧠 Conversation Simulation

Automatically simulate realistic multi-turn conversations to test your chatbots and agents.

Define model goals and user behavior
Generate labeled conversations at scale
Use DeepEval metrics to assess response quality
Customize turn count, persona types, and more

📘 Try the simulator →

🧬 Generate Goldens from Goldens

Bootstrapping eval datasets just got easier. Now you can exponentially expand your test cases using LLM-generated variants of existing goldens.

Transform goldens into many meaningful test cases
Preserve structure while diversifying content
Control tone, complexity, length, and more

📘 Read the guide →

🔒 Red Teaming Moved to DeepTeam

All red teaming functionality now lives in its own focused project: DeepTeam. DeepTeam is built for LLM security — adversarial testing, attack generation, and vulnerability discovery.

🛠️ Install or Upgrade

bash pip install deepeval --upgrade

🧠 Why v3.0 Matters

DeepEval v3.0 is more than an evaluation framework — it's a foundation for LLM observability. Whether you're debugging agents, simulating conversations, or continuously monitoring production performance, DeepEval now meets you wherever your LLM logic runs.

Ready to explore? 📚 Full docs at deepeval.com →

- Python
Published by penguine-ip about 1 year ago

deepeval - G-Eval Rubric

Rubric Available for G-Eval

https://www.deepeval.com/docs/metrics-llm-evals#rubric

- Python
Published by penguine-ip about 1 year ago

deepeval - Cleanup Tracing, Component Evals, Etc.

In this release we've cleaned up some dependencies to separate out dev packages, as well as more tracing verbose logs for debugging.

- Python
Published by penguine-ip about 1 year ago

deepeval - v3.0 Pre-Release

🚨 Breaking Changes

⚠️ This release introduces breaking changes in preparation for DeepEval v3.0. Please review carefully and adjust your code as needed.

The `evaluate()` function now has "configs"

Previously the evaluate() function had 13+ arguments to control display, async behaviors, caching, etc. and it was growing out of control. We've now abstracted it into "configs" instead:

```python from deepeval.evaluate.configs import AsyncConfig from deepeval import evaluate

evaluate(..., asyncconfig=AsyncConfig(maxconcurrent=20)) ```

Full docs here: https://www.deepeval.com/docs/evaluation-running-llm-evals#configs-for-evaluate

Red Teaming Officially Migrated to DeepTeam

This shouldn't be a surprised but, DeepTeam now takes care of everything red teaming related, for the foreseeable future. Docs here: https://trydeepteam.com

🥳 New Feature

Dynamic Evaluations for Nested Components

Nested components are a mess to evaluate. In this version in preparation for v3.0, we introduced dynamic evals, where you can apply a different set of metrics for different components in your LLM application:

```python from deepeval.testcase import LLMTestCase from deepeval.metrics import AnswerRelevancyMetric from deepeval.tracing import observe, updatecurrentspantest_case

@observe(metrics=[AnswerRelevancyMetric()]) def complete(query: str): response = openai.ChatCompletion.create(model="gpt-4o", messages=[{"role": "user", "content": query}]).choices[0].message["content"]

updatecurrentspantestcase( test_case=LLMTestCase(input=query, output=response) ) return response ```

Full docs here: https://www.deepeval.com/docs/evaluation-running-llm-evals#setup-tracing-highly-recommended

- Python
Published by penguine-ip about 1 year ago

deepeval - Dependency Cleaning

Cleaned up dependencies for upcoming 3.0 release:

Removed the automatic updates, it is now opt-in: https://www.deepeval.com/docs/miscellaneous
Removed instructor, double checked and it wasn't used anywhere
Removed LlamaIndex and moved it to optional, only needed for one module

- Python
Published by penguine-ip about 1 year ago

deepeval - Conversation Simulator

The latest conversation simulator simulates fake user interactions to generate conversations on your behalf. These conversations can be used for evaluation right afterwards, and is similar to the goldens synthesizer. Docs here: https://docs.confident-ai.com/docs/evaluation-conversation-simulator

- Python
Published by penguine-ip over 1 year ago

deepeval - Better Custom Model Support

What's New 🔥

Migrated default provider models to support Synthesizer
Default model providers are now in a different directory, those that are using deepeval < 2.5.6 might need to update imports

- Python
Published by penguine-ip over 1 year ago

deepeval - Custom Prompts for Metrics

What's New 🔥

Custom prompt template overriding for all RAG metrics. This was introduced for folks using weaker models for evaluation, or just models in general that don't fit too well with OpenAI's prompt formatting, which is what most of deepeval's metrics are built around. You can still use your favorite metrics and algorithms, but now with a custom template if required. Example here: https://docs.confident-ai.com/docs/metrics-answer-relevancy#customize-your-template
Fixes to our model providers. Now more stable and usable.
Including save_as() for datasets to save test cases as well: https://docs.confident-ai.com/docs/evaluation-datasets#save-your-dataset
Bug fixes for Synthesizer
Improvements to prompt templates of DAGMetric: https://docs.confident-ai.com/docs/metrics-dag

- Python
Published by penguine-ip over 1 year ago

deepeval - Faithfulness template experimentation

🥳 Latest feature to allow users to inject the Faithfulness metric with their custom template. Most suited for custom LLMs where text data is highly formatted by data engineers and stored in databases according to different categories.

- Python
Published by penguine-ip over 1 year ago

deepeval - Deterministic LLM-judge metrics

Here are the new features we're bringing to you in the latest release: 💥 Releasing beta for Deep, Acyclic, Graph. A new deterministic way in deepeval to build decision trees for deterministic outputs for LLM evaluation: https://docs.confident-ai.com/docs/metrics-dag ⚙️ Open-sourcing all LLM red teaming vulnerabilities: https://docs.confident-ai.com/docs/red-teaming-introduction 🪄 Fixes to synthetic dataset generation pipeline

- Python
Published by penguine-ip over 1 year ago

deepeval - Version v2.0

Here are the new features we're bringing to you in the latest release: ⚙️ Automated LLM red teaming, aka. vulnerability and security safety scanning. You can now scan for over 40+ vulnerabilities using 10+ SOTA attack enhancement techniques in <10 lines of python code. 🪄 Synthetic dataset generation with a highly customizable synthetic data generation pipeline to cover literally any use case. 🖼️ Multi-modal LLM evaluation - perfect for an image editing or text-image use cases. 💬 Conversational evaluation - perfect for evaluating LLM chatbots. 💥 More LLM system metrics: Prompt Alignment (to determine whether your LLM is able to follow instructions specified in your prompt template), Tool Correctness (for agents), and Json Correctness (to validate if LLM outputs conform to your desired schema)

- Python
Published by penguine-ip over 1 year ago

deepeval - Red teaming, safety testing, and improved synthesizer, conversational metrics, multi-modal metrics

In DeepEval 1.4.7, we're releasing: - LLM red teaming. Safety test your LLM application for 40+ vulnerabilities with 10+ attack enhancements, docs here: https://docs.confident-ai.com/docs/red-teaming-introduction - Improved synthetic data synthesizer, much more functionality and customizbility: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data - Conversational metrics: Dedicated metrics to evaluate LLM turns - Multi-modal metrics: Image editing and text to image evaluation

- Python
Published by penguine-ip over 1 year ago

deepeval - Agentic Evaluation Metric, Custom Evaluation LLMs, and Async for Synthetic Data Generation

In DeepEval v0.21.74, we have: - Agnetic evaluation metric to evaluate tool calling correctness for LLM agents: https://docs.confident-ai.com/docs/metrics-tool-correctness - Pydantic Schemas to enforce JSON outputs for custom, smaller LLMs: https://docs.confident-ai.com/docs/guides-using-custom-llms - Asynchronous support for synthetic data generation: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data - Tracing integration for LLamaIndex and LangChain: https://docs.confident-ai.com/docs/confident-ai-tracing

- Python
Published by penguine-ip almost 2 years ago

deepeval - Verbosity in Metrics, Hyperparameter Logging, Improved Synthetic Data Generation, Better Async Support

In DeepEval v0.21.62, we: - added an option to print out intermediate steps during metric execution, which can be configured via the verbose_mode parameter: https://docs.confident-ai.com/docs/metrics-answer-relevancy#example - hyperparameters can be logged to Confident AI via the evaluate() function: https://docs.confident-ai.com/docs/getting-started#optimizing-hyperparameters - Synthetic data generation now gives more realistic results and is more customizable: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data

- Python
Published by penguine-ip about 2 years ago

deepeval - Synthetic Data, Caching, Benchmarks, and GEval improvement

For deepeval's latest release v0.21.15, we release: - Synthetic Data generation. Generate synthetic data from documents easily: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data - caching. If you're running 10k test cases and it fails at the 9999th test case, you no longer have to rerun the first 9999 test case as you can just read from cache using the -c flag: https://docs.confident-ai.com/docs/evaluation-introduction#cache - repeats. If you want to repeat each test case for statistical significant, use the -r flag: https://docs.confident-ai.com/docs/evaluation-introduction#repeats - LLM Benchmarks. Supporting popular benchmarks such as MMLU, HellaSwag, and BIG-BH so anyone can evaluate ANY model on research backed benchmarks in a few lines of code. - G-Eval improvements. The G-Eval metric now supports using logprobs of tokens to find the weighted summed score.