Recent Releases of deepeval
deepeval - π New Interfaces, Reduce ETL Code < 50%!
Less Code to Load Data In and Out of DeepEval's Ecosystem :)
If you're using any of the features below, you'll likely see a 50% reduction in code required, especially around ETL for formatting things in and out of DeepEval's ecosystem. This includes:
π Arena-GEval
The first LLM-arena-as-a-Judge metric, now runs a blinded experiment and swaps positions randomly for a fair verdict on which LLM output is better.
Docs: https://deepeval.com/docs/metrics-arena-g-eval
βοΈ You can now run component-level evals by simply running a for loop against your dataset of goldens.
Simply run your loop -> call your agent X number of times -> get your evaluation results. No more trying to fit non-test-case-friendly formats. Instead DeepEval will find your LLM traces automatically to run evals.
```python from somewhere import yourasyncllm_app # Replace with your async LLM app from deepeval.dataset import EvaluationDataset, Golden
dataset = EvaluationDataset(goldens=[Golden(input="...")])
for golden in dataset.evalsiterator(): # Create task to invoke your async LLM app task = asyncio.createtask(yourasyncllm_app(golden.input)) dataset.evaluate(task) ```
Docs: https://deepeval.com/docs/evaluation-component-level-llm-evals
π¬ Conversation simulator is now based on goldens.
Previously you have to define a list of user intentions, profile items, with a ton of more configs to juggle between. Now you can define a list of goldens with a standardized benchmark of scenarios to have turns generated for.
```python from deepeval.test_case import Turn from deepeval.simulator import ConversationSimulator
Create ConversationalGolden
conversationgolden = ConversationalGolden( scenario="Andy Byron wants to purchase a VIP ticket to a cold play concert.", expectedoutcome="Successful purchase of a ticket.", user_description="Andy Byron is the CEO of Astronomer.", )
Define chatbot callback
async def chatbot_callback(input): return Turn(role="assistant", content=f"Chatbot response to: {input}")
Run Simulation
simulator = ConversationSimulator(modelcallback=chatbotcallback) conversationaltestcases = simulator.simulate(goldens=[conversationgolden]) print(conversationaltest_cases) ```
Docs: https://deepeval.com/docs/conversation-simulator
We also updated our docs with more improvements to come π
- Python
Published by penguine-ip 10 months ago
deepeval - π Renewed datasets, single vs multi-turn
βοΈ New Features
DeepEval's 3.2.6 release focuses on single-vs multi-turn use cases in datasets!
π§© Support for Single-Turn and Multi-Turn Datasets
- Single-turn datasets: Simple
input β outputpairs for one-off prompt testing. - Multi-turn datasets: Full conversation flows with alternating user/assistant turns. Perfect for simulating real chat interactions.
DeepEval now automatically detects whether a dataset is single-turn or multi-turn based on structure and routes to the appropriate evaluation logic.
π§ͺ Conversational Goldens
Introduced a new concept: conversational goldens, which contains scenario, (and optionally expected_outcome) but not things like input and expected output as with single-turn use cases..
β Improvements
- Smarter dataset evaluation routing: Whether single-turn or multi-turn, DeepEval figures it out and builds test cases accordingly.
- Improved multi-turn context preservation: Each conversational turn is maintained during evaluation, giving more accurate multi-turn metrics.
This release is setting the stage for future multi-turn use cases.
Docs here: https://deepeval.com/docs/evaluation-datasets
- Python
Published by penguine-ip 11 months ago
deepeval - π New Arena GEval Metric, for Pairwise Comparisons
Metric that is alike LLM Arena is Here
In DeepEval's latest release, we are introducing ArenaGEval, the first ever metric to compare test cases to choose the best performing one based on your custom criteria.
It looks something like this:
```python from deepeval import evaluate from deepeval.test_case import ArenaTestCase, LLMTestCaseParams from deepeval.metrics import ArenaGEval
atestcase = ArenaTestCase( contestants={ "GPT-4": LLMTestCase( input="What is the capital of France?", actualoutput="Paris", ), "Claude-4": LLMTestCase( input="What is the capital of France?", actualoutput="Paris is the capital of France.", ), }, ) arenageval = ArenaGEval( name="Friendly", criteria="Choose the winter of the more friendly contestant based on the input and actual output", evaluationparams=[ LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, ], )
metric.measure(atestcase) print(metric.winner, metric.reason) ```
Docs here: https://deepeval.com/docs/metrics-arena-g-eval
- Python
Published by penguine-ip 11 months ago
deepeval - π New Multimodal Metrics, with Platform Support
In DeepEval's latest release, we are introducing multimodal G-Eval, plus 7+ multimodal metrics!
Previously we had great support for single-turn, text evaluation in the form of LLMTestCases, but now we're adding MLLMTestCase, which accepts images:
```python from deepeval.metrics import MultimodalGEval from deepeval.test_case import MLLMTestCaseParams, MLLMTestCase, MLLMImage from deepeval import evaluate
mtestcase = MLLMTestCase( input=["Show me how to fold an airplane"], actualoutput=[ "1. Take the sheet of paper and fold it lengthwise", MLLMImage(url="./paperplane1", local=True), "2. Unfold the paper. Fold the top left and right corners towards the center.", MLLMImage(url="./paperplane2", local=True) ] ) textimagecoherence = MultimodalGEval( name="Text-Image Coherence", criteria="Determine whether the images and text is coherence in the actual output.", evaluationparams=[MLLMTestCaseParams.ACTUAL_OUTPUT], )
evaluate(testcases=[mtestcase], metrics=[textimage_coherence]) ```
Docs here: https://deepeval.com/docs/multimodal-metrics-g-eval
PS. This also includes platform support
- Python
Published by penguine-ip 12 months ago
deepeval - π New Conversational Evaluation, LiteLLM Integration
In DeepEval's latest release, we are introducing a slight change in how a conversation is evaluated.
Previously we assumed a conversation as as a list of LLMTestCases, which might necessarily be the case. Now a conversational test case is made up of a list of Turns instead, which follows OpenAI's standard messages format:
```python from deepeval.test_case import Turn
turns = [Turn(role="user", content="...")] ```
Docs here: https://deepeval.com/docs/evaluation-test-cases#conversational-test-case
- Python
Published by penguine-ip 12 months ago
deepeval - New Loading Bars, And Cloud Storage
Added new loading bars for component-level evals, and deepeval view to see results on Confident AI.
- Python
Published by penguine-ip 12 months ago
deepeval - LLM Evals - v3.0
π DeepEval v3.0 β Evaluate Any LLM Workflow, Anywhere
Weβre excited to introduce DeepEval v3.0, a major milestone that transforms how you evaluate LLM applications β from complex multi-step agents to simple prompt chains. This release brings component-level granularity, production-ready observability, and simulation tools to empower devs building modern AI systems.
π Component-Level Evaluation for Agentic Workflows
You can now apply DeepEval metrics to any step of your LLM workflow β tools, memories, retrievers, generators β and monitor them in both development and production.
- Evaluate individual function calls, not just final outputs
- Works with any framework or custom agent logic
- Real-time evaluation in production using
observe() - Track sub-component performance over time
π Learn more β
π§ Conversation Simulation
Automatically simulate realistic multi-turn conversations to test your chatbots and agents.
- Define model goals and user behavior
- Generate labeled conversations at scale
- Use DeepEval metrics to assess response quality
- Customize turn count, persona types, and more
𧬠Generate Goldens from Goldens
Bootstrapping eval datasets just got easier. Now you can exponentially expand your test cases using LLM-generated variants of existing goldens.
- Transform goldens into many meaningful test cases
- Preserve structure while diversifying content
- Control tone, complexity, length, and more
π Read the guide β
π Red Teaming Moved to DeepTeam
All red teaming functionality now lives in its own focused project: DeepTeam. DeepTeam is built for LLM security β adversarial testing, attack generation, and vulnerability discovery.
π οΈ Install or Upgrade
bash
pip install deepeval --upgrade
π§ Why v3.0 Matters
DeepEval v3.0 is more than an evaluation framework β it's a foundation for LLM observability. Whether you're debugging agents, simulating conversations, or continuously monitoring production performance, DeepEval now meets you wherever your LLM logic runs.
Ready to explore? π Full docs at deepeval.com β
- Python
Published by penguine-ip about 1 year ago
deepeval - G-Eval Rubric
Rubric Available for G-Eval
https://www.deepeval.com/docs/metrics-llm-evals#rubric
- Python
Published by penguine-ip about 1 year ago
deepeval - Cleanup Tracing, Component Evals, Etc.
In this release we've cleaned up some dependencies to separate out dev packages, as well as more tracing verbose logs for debugging.
- Python
Published by penguine-ip about 1 year ago
deepeval - v3.0 Pre-Release
π¨ Breaking Changes
β οΈ This release introduces breaking changes in preparation for DeepEval v3.0. Please review carefully and adjust your code as needed.
The evaluate() function now has "configs"
- Previously the
evaluate()function had 13+ arguments to control display, async behaviors, caching, etc. and it was growing out of control. We've now abstracted it into "configs" instead:
```python from deepeval.evaluate.configs import AsyncConfig from deepeval import evaluate
evaluate(..., asyncconfig=AsyncConfig(maxconcurrent=20)) ```
Full docs here: https://www.deepeval.com/docs/evaluation-running-llm-evals#configs-for-evaluate
Red Teaming Officially Migrated to DeepTeam
This shouldn't be a surprised but, DeepTeam now takes care of everything red teaming related, for the foreseeable future. Docs here: https://trydeepteam.com
π₯³ New Feature
Dynamic Evaluations for Nested Components
Nested components are a mess to evaluate. In this version in preparation for v3.0, we introduced dynamic evals, where you can apply a different set of metrics for different components in your LLM application:
```python from deepeval.testcase import LLMTestCase from deepeval.metrics import AnswerRelevancyMetric from deepeval.tracing import observe, updatecurrentspantest_case
@observe(metrics=[AnswerRelevancyMetric()]) def complete(query: str): response = openai.ChatCompletion.create(model="gpt-4o", messages=[{"role": "user", "content": query}]).choices[0].message["content"]
updatecurrentspantestcase( test_case=LLMTestCase(input=query, output=response) ) return response ```
Full docs here: https://www.deepeval.com/docs/evaluation-running-llm-evals#setup-tracing-highly-recommended
- Python
Published by penguine-ip about 1 year ago
deepeval - Dependency Cleaning
Cleaned up dependencies for upcoming 3.0 release:
Removed the automatic updates, it is now opt-in: https://www.deepeval.com/docs/miscellaneous
Removed instructor, double checked and it wasn't used anywhere
Removed LlamaIndex and moved it to optional, only needed for one module
- Python
Published by penguine-ip about 1 year ago
deepeval - Conversation Simulator
The latest conversation simulator simulates fake user interactions to generate conversations on your behalf. These conversations can be used for evaluation right afterwards, and is similar to the goldens synthesizer. Docs here: https://docs.confident-ai.com/docs/evaluation-conversation-simulator
- Python
Published by penguine-ip about 1 year ago
deepeval - Better Custom Model Support
What's New π₯
- Migrated default provider models to support Synthesizer
- Default model providers are now in a different directory, those that are using
deepeval< 2.5.6 might need to update imports
- Python
Published by penguine-ip about 1 year ago
deepeval - Custom Prompts for Metrics
What's New π₯
- Custom prompt template overriding for all RAG metrics. This was introduced for folks using weaker models for evaluation, or just models in general that don't fit too well with OpenAI's prompt formatting, which is what most of
deepeval's metrics are built around. You can still use your favorite metrics and algorithms, but now with a custom template if required. Example here: https://docs.confident-ai.com/docs/metrics-answer-relevancy#customize-your-template - Fixes to our model providers. Now more stable and usable.
- Including
save_as()for datasets to save test cases as well: https://docs.confident-ai.com/docs/evaluation-datasets#save-your-dataset - Bug fixes for
Synthesizer - Improvements to prompt templates of
DAGMetric: https://docs.confident-ai.com/docs/metrics-dag
- Python
Published by penguine-ip about 1 year ago
deepeval - Faithfulness template experimentation
π₯³ Latest feature to allow users to inject the Faithfulness metric with their custom template. Most suited for custom LLMs where text data is highly formatted by data engineers and stored in databases according to different categories.
- Python
Published by penguine-ip over 1 year ago
deepeval - Deterministic LLM-judge metrics
Here are the new features we're bringing to you in the latest release: π₯ Releasing beta for Deep, Acyclic, Graph. A new deterministic way in deepeval to build decision trees for deterministic outputs for LLM evaluation: https://docs.confident-ai.com/docs/metrics-dag βοΈ Open-sourcing all LLM red teaming vulnerabilities: https://docs.confident-ai.com/docs/red-teaming-introduction πͺ Fixes to synthetic dataset generation pipeline
- Python
Published by penguine-ip over 1 year ago
deepeval - Version v2.0
Here are the new features we're bringing to you in the latest release: βοΈ Automated LLM red teaming, aka. vulnerability and security safety scanning. You can now scan for over 40+ vulnerabilities using 10+ SOTA attack enhancement techniques in <10 lines of python code. πͺ Synthetic dataset generation with a highly customizable synthetic data generation pipeline to cover literally any use case. πΌοΈ Multi-modal LLM evaluation - perfect for an image editing or text-image use cases. π¬ Conversational evaluation - perfect for evaluating LLM chatbots. π₯ More LLM system metrics: Prompt Alignment (to determine whether your LLM is able to follow instructions specified in your prompt template), Tool Correctness (for agents), and Json Correctness (to validate if LLM outputs conform to your desired schema)
- Python
Published by penguine-ip over 1 year ago
deepeval - Red teaming, safety testing, and improved synthesizer, conversational metrics, multi-modal metrics
In DeepEval 1.4.7, we're releasing: - LLM red teaming. Safety test your LLM application for 40+ vulnerabilities with 10+ attack enhancements, docs here: https://docs.confident-ai.com/docs/red-teaming-introduction - Improved synthetic data synthesizer, much more functionality and customizbility: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data - Conversational metrics: Dedicated metrics to evaluate LLM turns - Multi-modal metrics: Image editing and text to image evaluation
- Python
Published by penguine-ip over 1 year ago
deepeval - Agentic Evaluation Metric, Custom Evaluation LLMs, and Async for Synthetic Data Generation
In DeepEval v0.21.74, we have: - Agnetic evaluation metric to evaluate tool calling correctness for LLM agents: https://docs.confident-ai.com/docs/metrics-tool-correctness - Pydantic Schemas to enforce JSON outputs for custom, smaller LLMs: https://docs.confident-ai.com/docs/guides-using-custom-llms - Asynchronous support for synthetic data generation: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data - Tracing integration for LLamaIndex and LangChain: https://docs.confident-ai.com/docs/confident-ai-tracing
- Python
Published by penguine-ip almost 2 years ago
deepeval - Verbosity in Metrics, Hyperparameter Logging, Improved Synthetic Data Generation, Better Async Support
In DeepEval v0.21.62, we:
- added an option to print out intermediate steps during metric execution, which can be configured via the verbose_mode parameter: https://docs.confident-ai.com/docs/metrics-answer-relevancy#example
- hyperparameters can be logged to Confident AI via the evaluate() function: https://docs.confident-ai.com/docs/getting-started#optimizing-hyperparameters
- Synthetic data generation now gives more realistic results and is more customizable: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data
- Python
Published by penguine-ip almost 2 years ago
deepeval - Synthetic Data, Caching, Benchmarks, and GEval improvement
For deepeval's latest release v0.21.15, we release:
- Synthetic Data generation. Generate synthetic data from documents easily: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data
- caching. If you're running 10k test cases and it fails at the 9999th test case, you no longer have to rerun the first 9999 test case as you can just read from cache using the -c flag: https://docs.confident-ai.com/docs/evaluation-introduction#cache
- repeats. If you want to repeat each test case for statistical significant, use the -r flag: https://docs.confident-ai.com/docs/evaluation-introduction#repeats
- LLM Benchmarks. Supporting popular benchmarks such as MMLU, HellaSwag, and BIG-BH so anyone can evaluate ANY model on research backed benchmarks in a few lines of code.
- G-Eval improvements. The G-Eval metric now supports using logprobs of tokens to find the weighted summed score.
- Python
Published by penguine-ip about 2 years ago
deepeval - Async Support for Prod
In deepeval v0.20.85:
- asynchronous support throughout deepeval, and no longer using threads. Users can also call individual metrics asynchronously: https://docs.confident-ai.com/docs/metrics-introduction#measuring-metrics-in-async
- improved the way in which you create a custom LLM for evaluation. You'll now have to implement an asynchronous generate() method to use deepeval's async features: https://docs.confident-ai.com/docs/metrics-introduction#using-a-custom-llm
- strict mode for all metrics!
- improve
evaluate()function for more customizability: https://docs.confident-ai.com/docs/evaluation-introduction#evaluating-without-pytest
- Python
Published by penguine-ip about 2 years ago
deepeval - Conversational Metrics and Synthetic Data Generation
In DeepEval's latest release, there is now: - conversational metrics: https://docs.confident-ai.com/docs/metrics-knowledge-retention. This metric evaluates whether your LLM is able to retain factual information presented to it throughout a conversation - synthetic data generation. Generate evaluation datasets from scratch: https://docs.confident-ai.com/docs/evaluation-datasets#generate-an-evaluation-dataset
- Python
Published by penguine-ip about 2 years ago
deepeval - Production Stability
For the newest release, deepeval now is now stable for production use: - reduced package size - separated functionality of pytest vs deepeval test run command - included coverage score for summarization - fix contextual precision node error - released docs for better transparency into metrics calculation - allows users to configure RAGAS metrics for custom embedding models: https://docs.confident-ai.com/docs/metrics-ragas#example - fixed bugs with checking for package updates
- Python
Published by penguine-ip over 2 years ago
deepeval - Hugging Face and LlamaIndex integration
For the latest release, DeepEval:
- Supports Hugging Face users by providing real-time evaluations during fine-tuning: https://docs.confident-ai.com/docs/integrations-huggingface
- Supports LlamaIndex users by allowing unit testing of LlamaIndex apps in CI/CD, and offer metrics in LlamaIndex's evaluators: https://docs.confident-ai.com/docs/integrations-llamaindex
- Improvements to accuracy and reliability in Faithfulness and Answer Relevancy
- Summarization Metric now offers explanation
- You can now use ANY LLM for evaluation: https://docs.confident-ai.com/docs/metrics-introduction#using-a-custom-llm
- Python
Published by penguine-ip over 2 years ago
deepeval - LLM-Evals now support all LangChain chatmodels
- LLM-Evals (LLM evaluated metrics) now support all of langchain's chat models.
LLMTestCasenow hasexecution_timeandcost, useful for those looking to evaluate on these parametersminimum_scoreis nowthresholdinstead, meaning you can now create custom metrics that either have a "minimum" or "maximum" threshold- Llamaindex Tracing integration: (https://docs.llamaindex.ai/en/stable/module_guides/observability/observability.html#deepeval)
- Python
Published by penguine-ip over 2 years ago
deepeval - ALL RAG Metrics now offers score reasoning, and a lot more.
In this release:
- Faithfulness, Answer Relevancy, Contextual Relevancy, Contextual Precision, and Contextual Recall, all offer a reasoning for its given score.
- Azure OpenAI now supported via a single command in the CLI: https://docs.confident-ai.com/docs/metrics-introduction#using-azure-openai
- New Summarization Metric that uses the QAG framework for its implementation: https://docs.confident-ai.com/docs/metrics-summarization
- Pulling datasets from Confident AI now offers an intermediate step for additional data processing before evaluation: https://docs.confident-ai.com/docs/confident-ai-evaluate-datasets#pull-your-dataset-from-confident-ai
- Decoupled imports from transformers, sentence_transformers, and pandas to reduce package size
- Python
Published by penguine-ip over 2 years ago
deepeval - Lots of new features
Lots of new features this release:
JudgementalGPTnow allows for different languages - useful for our APAC and European friendsRAGASmetrics now supports all OpenAI models - useful for those running into context length issuesLLMEvalMetricnow returns a reasoning for its scoredeepeval test runnow has hooks that call on test run completion-
evaluatenow displaysretrieval_contextfor RAG evaluation RAGASmetric now displays metric breakdown for all its distinct metrics
- Python
Published by penguine-ip over 2 years ago
deepeval - Continuous Evaluation
Automatically integrated with Confident AI for continous evaluation throughout the lifetime of your LLM (app):
-log evaluation results and analyze metrics pass / fails -compare and pick the optimal hyperparameters (eg. prompt templates, chunk size, models used, etc.) based on evaluation results -debug evaluation results via LLM traces -manage evaluation test cases / datasets in one place -track events to identify live LLM responses in production -add production events to existing evaluation datasets to strength evals over time
- Python
Published by penguine-ip over 2 years ago
deepeval - Continuous Evaluation
Automatically integrated with Confident AI for continous evaluation throughout the lifetime of your LLM (app):
-log evaluation results and analyze metrics pass / fails -compare and pick the optimal hyperparameters (eg. prompt templates, chunk size, models used, etc.) based on evaluation results -debug evaluation results via LLM traces -manage evaluation test cases / datasets in one place -track events to identify live LLM responses in production -add production events to existing evaluation datasets to strength evals over time
- Python
Published by penguine-ip over 2 years ago
deepeval - Evaluate entire datasets
Mid-week bug fixes release with an extra feature:
- run_test now works
- new function evaluate, evaluates a list of test cases (dataset) on metrics you define, all without having to go through the CLI. More info here: https://docs.confident-ai.com/docs/evaluation-datasets#evaluate-your-dataset-without-pytest
- Python
Published by penguine-ip over 2 years ago
deepeval - Judgemental GPT
In this release, deepeval has added support for:
- JudgementalGPT, a dedicated LLM app developed by Confident AI to perform evaluations more robustly and accurately. JudgementalGPT provides a score and a reason for the score.
- Parallel testing: execute test cases in parallel and speed up evaluation up to 100x.
- Python
Published by penguine-ip over 2 years ago
deepeval - v0.20.5
What's Changed
- firewall check for telemetry by @ColabDog in https://github.com/confident-ai/deepeval/pull/200
- hotfix telemetry setup by @ColabDog in https://github.com/confident-ai/deepeval/pull/201
Full Changelog: https://github.com/confident-ai/deepeval/compare/v0.20.3...v0.20.5
- Python
Published by ColabDog over 2 years ago
deepeval - v0.20.3
What's Changed
- clean quickstart by @ColabDog in https://github.com/confident-ai/deepeval/pull/166
- Hotfix.readme by @penguine-ip in https://github.com/confident-ai/deepeval/pull/168
- Freeze typer v by @ColabDog in https://github.com/confident-ai/deepeval/pull/169
- update quickstart by @ColabDog in https://github.com/confident-ai/deepeval/pull/170
- update sidebar by @ColabDog in https://github.com/confident-ai/deepeval/pull/171
- Fix sidebar by @ColabDog in https://github.com/confident-ai/deepeval/pull/172
- Additional error handling for network issues in initpy by @donaldwasserman in https://github.com/confident-ai/deepeval/pull/177
- Feature/improve assert dx by @ColabDog in https://github.com/confident-ai/deepeval/pull/178
- make query mandatory by @penguine-ip in https://github.com/confident-ai/deepeval/pull/182
- Iimprove prompting flow for api key by @penguine-ip in https://github.com/confident-ai/deepeval/pull/185
- Force query and output by @ColabDog in https://github.com/confident-ai/deepeval/pull/188
- force context to be a list by @ColabDog in https://github.com/confident-ai/deepeval/pull/192
- update version by @ColabDog in https://github.com/confident-ai/deepeval/pull/195
- Features/llmmetric by @penguine-ip in https://github.com/confident-ai/deepeval/pull/197
- add telemetry and sentry by @ColabDog in https://github.com/confident-ai/deepeval/pull/198
- Feature/add configuration by @ColabDog in https://github.com/confident-ai/deepeval/pull/199
New Contributors
- @donaldwasserman made their first contribution in https://github.com/confident-ai/deepeval/pull/177
Full Changelog: https://github.com/confident-ai/deepeval/compare/v0.20.0...v0.20.3
- Python
Published by ColabDog over 2 years ago
deepeval - v0.20.0
What's Changed
- Rename HOWTOCONTRIBUTE.md to CONTRIBUTING.md by @penguine-ip in https://github.com/confident-ai/deepeval/pull/164
- add image similarity metric by @ColabDog in https://github.com/confident-ai/deepeval/pull/162
- Feature/add image similarity by @ColabDog in https://github.com/confident-ai/deepeval/pull/165
Full Changelog: https://github.com/confident-ai/deepeval/compare/v0.19.0...v0.20.0
- Python
Published by ColabDog over 2 years ago
deepeval - v0.19.0
What's Changed
- add guardrails integration by @ColabDog in https://github.com/confident-ai/deepeval/pull/158
- add github workflow results by @ColabDog in https://github.com/confident-ai/deepeval/pull/159
- Feature/add llm eval by @ColabDog in https://github.com/confident-ai/deepeval/pull/161
Full Changelog: https://github.com/confident-ai/deepeval/compare/v0.18.0...v0.19.0
- Python
Published by ColabDog over 2 years ago
deepeval - v0.18.0
What's Changed
- Add new customer support example by @ColabDog in https://github.com/confident-ai/deepeval/pull/154
- Add example test case by @ColabDog in https://github.com/confident-ai/deepeval/pull/156
Full Changelog: https://github.com/confident-ai/deepeval/compare/v0.17.9...v0.18.0
- Python
Published by ColabDog over 2 years ago
deepeval - v0.17.8
What's Changed
- fix by @ColabDog in https://github.com/confident-ai/deepeval/pull/147
- add context to the API by @ColabDog in https://github.com/confident-ai/deepeval/pull/150
- Resolves #151 by @ColabDog in https://github.com/confident-ai/deepeval/pull/152
Full Changelog: https://github.com/confident-ai/deepeval/compare/v0.17.6...v0.17.8
- Python
Published by ColabDog over 2 years ago
deepeval - v0.17.6
What's Changed
- adding length metric including test and documentation by @j-space-b in https://github.com/confident-ai/deepeval/pull/139
- add koala by @ColabDog in https://github.com/confident-ai/deepeval/pull/141
- Feature/update file name by @ColabDog in https://github.com/confident-ai/deepeval/pull/145
- add switch CLI by @ColabDog in https://github.com/confident-ai/deepeval/pull/146
Full Changelog: https://github.com/confident-ai/deepeval/compare/v0.17.5...v0.17.6
- Python
Published by ColabDog over 2 years ago
deepeval - v0.17.5
What's Changed
- Hotfix/fix conceptual similarity threshold by @ColabDog in https://github.com/confident-ai/deepeval/pull/133
- Hotfix/fix quickstart example for query creation by @ColabDog in https://github.com/confident-ai/deepeval/pull/130
- remove test utils by @ColabDog in https://github.com/confident-ai/deepeval/pull/134
- fix yml by @ColabDog in https://github.com/confident-ai/deepeval/pull/135
- add docs on evaluatin glegal by @ColabDog in https://github.com/confident-ai/deepeval/pull/136
- update social card image by @ColabDog in https://github.com/confident-ai/deepeval/pull/137
- update ragas by @ColabDog in https://github.com/confident-ai/deepeval/pull/138
Full Changelog: https://github.com/confident-ai/deepeval/compare/v0.17.4...v0.17.5
- Python
Published by ColabDog over 2 years ago
deepeval - v0.17.4
What's Changed
- Feature/cache pip dependencies by @ColabDog in https://github.com/confident-ai/deepeval/pull/128
- Hotfix/fix cli by @ColabDog in https://github.com/confident-ai/deepeval/pull/129
- FIx to Singleton class instantiation not considering arguments of the class
Full Changelog: https://github.com/confident-ai/deepeval/compare/v0.17.3...v0.17.4
- Python
Published by ColabDog over 2 years ago
deepeval - v0.17.3
What's Changed
- Feature/add synthetic query generation by @ColabDog in https://github.com/confident-ai/deepeval/pull/1
- Feature/add synthetic query generation by @ColabDog in https://github.com/confident-ai/deepeval/pull/2
- fix evals by @ColabDog in https://github.com/confident-ai/deepeval/pull/3
- add docs by @ColabDog in https://github.com/confident-ai/deepeval/pull/4
- Feature/add docs by @ColabDog in https://github.com/confident-ai/deepeval/pull/5
- Feature/add pytest cases by @ColabDog in https://github.com/confident-ai/deepeval/pull/6
- add black by @ColabDog in https://github.com/confident-ai/deepeval/pull/7
- add pytest by @ColabDog in https://github.com/confident-ai/deepeval/pull/8
- Feature/add langchain integration by @ColabDog in https://github.com/confident-ai/deepeval/pull/10
- update to modal by @ColabDog in https://github.com/confident-ai/deepeval/pull/11
- fix langchain integration pipeline by @ColabDog in https://github.com/confident-ai/deepeval/pull/14
- update API by @ColabDog in https://github.com/confident-ai/deepeval/pull/15
- switch to callable metrics by @ColabDog in https://github.com/confident-ai/deepeval/pull/16
- add answer relevancy by @ColabDog in https://github.com/confident-ai/deepeval/pull/18
- check for ranking similarity by @ColabDog in https://github.com/confident-ai/deepeval/pull/19
- add dashboard by @ColabDog in https://github.com/confident-ai/deepeval/pull/20
- add support for dict objects by @ColabDog in https://github.com/confident-ai/deepeval/pull/21
- Jacky/twi 332 add overall metric by @ColabDog in https://github.com/confident-ai/deepeval/pull/23
- Jacky/twi 338 add alert score by @ColabDog in https://github.com/confident-ai/deepeval/pull/22
- Feature/add conceptual similarity by @ColabDog in https://github.com/confident-ai/deepeval/pull/24
- Feature/add overall score by @ColabDog in https://github.com/confident-ai/deepeval/pull/25
- Jacky/twi 332 add overall metric 2 by @ColabDog in https://github.com/confident-ai/deepeval/pull/27
- Jacky/twi 343 fix sending python events to server by @ColabDog in https://github.com/confident-ai/deepeval/pull/28
- Add assert non-toxic by @ColabDog in https://github.com/confident-ai/deepeval/pull/26
- Fix toxic classifier by @ColabDog in https://github.com/confident-ai/deepeval/pull/29
- Biasdetection by @j-space-b in https://github.com/confident-ai/deepeval/pull/31
- add length metric by @ColabDog in https://github.com/confident-ai/deepeval/pull/30
- update debias documentation by @ColabDog in https://github.com/confident-ai/deepeval/pull/32
- update the name by @ColabDog in https://github.com/confident-ai/deepeval/pull/34
- make tests run in matrix by @ColabDog in https://github.com/confident-ai/deepeval/pull/33
- Feature/create implementation by @ColabDog in https://github.com/confident-ai/deepeval/pull/35
- fix prod by @ColabDog in https://github.com/confident-ai/deepeval/pull/36
- Feature/add retry manager by @ColabDog in https://github.com/confident-ai/deepeval/pull/37
- update tests by @ColabDog in https://github.com/confident-ai/deepeval/pull/38
- Feature/fix evaluation by @ColabDog in https://github.com/confident-ai/deepeval/pull/40
- Feature/run evaluation by @ColabDog in https://github.com/confident-ai/deepeval/pull/39
- Hotfix/assert not working by @ColabDog in https://github.com/confident-ai/deepeval/pull/41
- add classifiers by @ColabDog in https://github.com/confident-ai/deepeval/pull/42
- ADd check for raising error by @ColabDog in https://github.com/confident-ai/deepeval/pull/43
- Feature/add success by @ColabDog in https://github.com/confident-ai/deepeval/pull/45
- Feature/fix error by @ColabDog in https://github.com/confident-ai/deepeval/pull/44
- Feature/rename input to query by @ColabDog in https://github.com/confident-ai/deepeval/pull/47
- Feature/improve question answering by @ColabDog in https://github.com/confident-ai/deepeval/pull/46
- Feature/add singleton metric by @ColabDog in https://github.com/confident-ai/deepeval/pull/48
- Add quickstart by @ColabDog in https://github.com/confident-ai/deepeval/pull/49
- Feature/make parameters available by @ColabDog in https://github.com/confident-ai/deepeval/pull/50
- add answer relevancy by @ColabDog in https://github.com/confident-ai/deepeval/pull/51
- add deepeval test by @ColabDog in https://github.com/confident-ai/deepeval/pull/52
- Add custom metric by @ColabDog in https://github.com/confident-ai/deepeval/pull/53
- Feature/fix custom metric by @ColabDog in https://github.com/confident-ai/deepeval/pull/54
- Feature/add posthog by @ColabDog in https://github.com/confident-ai/deepeval/pull/56
- Jacky/conf 356 add tests for answer relevancy by @ColabDog in https://github.com/confident-ai/deepeval/pull/55
- Hotfix/overall score tests by @ColabDog in https://github.com/confident-ai/deepeval/pull/58
- add etst by @ColabDog in https://github.com/confident-ai/deepeval/pull/57
- Feature/add cli by @ColabDog in https://github.com/confident-ai/deepeval/pull/59
- add readme by @ColabDog in https://github.com/confident-ai/deepeval/pull/60
- add file handler to key env by @ColabDog in https://github.com/confident-ai/deepeval/pull/61
- Add llamaidnex by @ColabDog in https://github.com/confident-ai/deepeval/pull/63
- Feaeture/improve get api key by @ColabDog in https://github.com/confident-ai/deepeval/pull/66
- update reqs by @ColabDog in https://github.com/confident-ai/deepeval/pull/69
- Feature/add test overall score by @ColabDog in https://github.com/confident-ai/deepeval/pull/73
- Fix overall score by @ColabDog in https://github.com/confident-ai/deepeval/pull/75
- Add gpt synthetic data by @penguine-ip in https://github.com/confident-ai/deepeval/pull/74
- Feature/log context by @ColabDog in https://github.com/confident-ai/deepeval/pull/77
- remove by @ColabDog in https://github.com/confident-ai/deepeval/pull/79
- Feature/fix quickstart by @ColabDog in https://github.com/confident-ai/deepeval/pull/81
- Hotfix/fix project name by @ColabDog in https://github.com/confident-ai/deepeval/pull/82
- Hotfix/incorrect issue by @ColabDog in https://github.com/confident-ai/deepeval/pull/85
- Fix conceptual similarity by @ColabDog in https://github.com/confident-ai/deepeval/pull/86
- Feature/improve writing unit tests by @ColabDog in https://github.com/confident-ai/deepeval/pull/90
- Feature/add spinner progress by @ColabDog in https://github.com/confident-ai/deepeval/pull/91
- Feature/add context back to textcase by @ColabDog in https://github.com/confident-ai/deepeval/pull/94
- update quickstart by @ColabDog in https://github.com/confident-ai/deepeval/pull/103
- Fix answer relevancy by @ColabDog in https://github.com/confident-ai/deepeval/pull/111
- make crossencoder default by @ColabDog in https://github.com/confident-ai/deepeval/pull/112
- Feature/add test run by @ColabDog in https://github.com/confident-ai/deepeval/pull/115
- Feature/add test run by @ColabDog in https://github.com/confident-ai/deepeval/pull/116
- update metric by @ColabDog in https://github.com/confident-ai/deepeval/pull/117
- Hotfix/zero score logs by @ColabDog in https://github.com/confident-ai/deepeval/pull/119
- Feature/add ragas by @ColabDog in https://github.com/confident-ai/deepeval/pull/121
- Hotfix/not running with no pytest by @ColabDog in https://github.com/confident-ai/deepeval/pull/124
- add multiple metrics by @ColabDog in https://github.com/confident-ai/deepeval/pull/123
- Hotfix/remove code by @ColabDog in https://github.com/confident-ai/deepeval/pull/125
- add chatbot test by @ColabDog in https://github.com/confident-ai/deepeval/pull/126
New Contributors
- @ColabDog made their first contribution in https://github.com/confident-ai/deepeval/pull/1
- @j-space-b made their first contribution in https://github.com/confident-ai/deepeval/pull/31
- @penguine-ip made their first contribution in https://github.com/confident-ai/deepeval/pull/74
Full Changelog: https://github.com/confident-ai/deepeval/commits/v0.17.3
- Python
Published by ColabDog over 2 years ago