Recent Releases of farm-haystack

farm-haystack - v2.17.1

Release Notes

v2.17.1

Bug Fixes

  • Fixed the from_dict method of MetadataRouter so the output_type parameter introduced in Haystack 2.17 is now optional when loading from YAML. This ensures compatibility with older Haystack pipelines.
  • In OpenAIChatGenerator, improved the logic to exclude unsupported custom tool calls. The previous implementation caused compatibility issues with the Mistral Haystack core integration, which extends OpenAIChatGenerator.

- Python
Published by github-actions[bot] 6 months ago

farm-haystack - v2.17.0

⭐️ Highlights

πŸ–ΌοΈ Image support for several model providers

Following the introduction of image support in Haystack 2.16.0, we've expanded this to more model providers in Haystack and Haystack Core integrations.

Now supported: Amazon Bedrock, Anthropic, Azure, Google, Hugging Face API, Meta Llama API, Mistral, Nvidia, Ollama, OpenAI, OpenRouter, STACKIT.

🧩 Extended components

We've improved several components to make them more flexible:

  • MetadataRouter, which is used to route Documents based on metadata, has been extended to also support routing ByteStream objects.
  • The SentenceWindowRetriever, which retrieves neighboring sentences around relevant Documents to provide full context, is now more flexible. Previously, its source_id_meta_field parameter accepted only a single field containing the ID of the original document. It now also accepts a list of fields, so that only documents matching all of the specified meta fields will be retrieved.

⬆️ Upgrade Notes

  • MultiFileConverter outputs a new key failed in the result dictionary, which contains a list of files that failed to convert. The documents output is included only if at least one file is successfully converted. Previously, documents could still be present but empty if a file with a supported MIME type was provided but did not actually exist.

  • The finish_reason field behavior in HuggingFaceAPIChatGenerator has been updated. Previously, the new finish_reason mapping (introduced in Haystack 2.15.0 release) was only applied when streaming was enabled. When streaming was disabled, the old finish_reason was still returned. This change ensures the updated finish_reason values are consistently returned regardless of streaming mode.

    How to know if you're affected: If you rely on finish_reason in responses from HuggingFaceAPIChatGenerator with streaming disabled, you may see different values after this upgrade.

    What to do: Review the updated mapping:

    • length β†’ length
    • eos_token β†’ stop
    • stop_sequence β†’ stop
    • If tool calls are present β†’ tool_calls

πŸš€ New Features

  • Add support for ByteStream objects in MetadataRouter. It can now be used to route list[Documents] or list[ByteStream] based on metadata.
  • Add support for the union type operator | (added in python 3.10) in serialize_type and Pipeline.connect(). These functions support both the typing.Union and | operators and mixtures of them for backwards compatibility.
  • Added ReasoningContent as a new content part to the ChatMessage dataclass. This allows storing model reasoning text and additional metadata in assistant messages. Assistant messages can now include reasoning content using the reasoning parameter in ChatMessage.from_assistant(). We will progressively update the implementations for Chat Generators with LLMs that support reasoning to use this new content part.
  • Updated SentenceWindowRetriever's sourceidmeta_field parameter to also accept a list of strings. If a list of fields are provided, then only documents matching both fields will be retrieved.

⚑️ Enhancement Notes

  • Added multimodal support to HuggingFaceAPIChatGenerator to enable vision-language model (VLM) usage with images and text. Users can now send both text and images to VLM models through Hugging Face APIs. The implementation follows the HF VLM API format specification and maintains full backward compatibility with text-only messages.
  • Added serialization/deserialization methods for TextContent and ImageContent parts of ChatMessage.
  • Made the lazy import error message clearer explaining that the optional dependency is missing.
  • Adopted modern type hinting syntax using PEP 585 throughout the codebase. This improves readability and removes unnecessary imports from the typing module.
  • Support subclasses of ChatMessage in Agent state schema validation. The validation now checks for issubclass(args[0], ChatMessage) instead of requiring exact type equality, allowing custom ChatMessage subclasses to be used in the messages field.
  • The ToolInvoker run method now accepts a list of tools. When provided, this list overrides the tools set in the constructor, allowing you to switch tools at runtime in previously built pipelines.

πŸ› Bug Fixes

  • The English and German abbreviation files used by the SentenceSplitter are now included in the distribution. They were previously missing due to a config in the .gitignore file.

  • Add encoding format keyword argument to OpenAI client when creating embeddings.

  • Addressed incorrect assumptions in the ChatMessage class that raised errors in valid usage scenario.

1. `ChatMessage.from_user` with `content_parts`: Previously, at least one text part was required, even though some model providers support messages with only image parts. This restriction has been removed. If a provider has such a limitation, it should now be enforced in the provider's implementation.

2. `ChatMessage.to_openai_dict_format`: Messages containing multiple text parts weren't supported, despite this being allowed by the OpenAI API. This has now been corrected.
  • Improved validation in the ChatMessage.from_user class method. The method now raises an error if neither text nor content_parts are provided. It does not raise an error if text is an empty string.

  • Ensure that the score field in SentenceTransformersSimilarityRanker is returned as a Python float instead of numpy.float32. This prevents potential serialization issues in downstream integrations.

  • Raise a RuntimeError when AsyncPipeline.run is called from within an async context, indicating that run_async should be used instead.

  • Prevented in-place mutation of input Document objects in all Extractor and Classifier components by creating copies with dataclasses.replace before processing.

  • Prevented in-place mutation of input Document objects in all DocumentEmbedder components by creating copies with dataclasses.replace before processing.

  • FileTypeRouter has a new parameter raise_on_failure with default value to False. When set to True, FileNotFoundError is always raised for non-existent files. Previously, this exception was raised only when processing a non-existent file and the meta parameter was provided to run().

  • Return a more informative error message when attempting to connect two components and the sender component does not have any OutputSockets defined.

  • Fix tracing context not propagated to tools when running via ToolInvoker.run_async

  • Ensure consistent behavior in SentenceTransformersDiversityRanker. Like other rankers, it now returns all documents instead of raising an error when top_k exceeds the number of available documents.

πŸ’™ Big thank you to everyone who contributed to this release!

@abdokaseb @Amnah199 @anakin87 @bilgeyucel @ChinmayBansal @datbth @davidsbatista @dfokina @LastRemote @mpangrazzi @RafaelJohn9 @rolshoven @SaraCalla @SaurabhLingam @sjrl

- Python
Published by github-actions[bot] 6 months ago

farm-haystack - v2.17.0-rc2

- Python
Published by github-actions[bot] 6 months ago

farm-haystack - v2.17.0-rc1

Release Notes

Upgrade Notes

  • MultiFileConverter outputs a new key failed in the result dictionary, which contains a list of files that failed to convert. The documents output is included only if at least one file is successfully converted. Previously, documents could still be present but empty if a file with a supported MIME type was provided but did not actually exist.

  • The finishreason field behavior in HuggingFaceAPIChatGenerator has been updated. Previously, the new finishreason mapping (introduced in Haystack 2.15.0 release) was only applied when streaming was enabled. When streaming was disabled, the old finishreason was still returned. This change ensures the updated finishreason values are consistently returned regardless of streaming mode.

    How to know if you're affected: If you rely on finish_reason in responses from HuggingFaceAPIChatGenerator with streaming disabled, you may see different values after this upgrade.

    What to do: Review the updated mapping: - length β†’ length - eostoken β†’ stop - stopsequence β†’ stop - If tool calls are present β†’ tool_calls

New Features

  • Add support for ByteStream objects in MetadataRouter. It can now be used to route list[Documents] or list[ByteStream] based on metadata.
  • Add support for the union type operator | (added in python 3.10) in serialize_type and Pipeline.connect(). These functions support both the typing.Union and | operators and mixtures of them for backwards compatibility.
  • Added ReasoningContent as a new content part to the ChatMessage dataclass. This allows storing model reasoning text and additional metadata in assistant messages. Assistant messages can now include reasoning content using the reasoning parameter in ChatMessage.from_assistant(). We will progressively update the implementations for Chat Generators with LLMs that support reasoning to use this new content part.
  • Updated SentenceWindowRetriever's sourceidmeta_field parameter to also accept a list of strings. If a list of fields are provided, then only documents matching both fields will be retrieved.

Enhancement Notes

  • Added multimodal support to HuggingFaceAPIChatGenerator to enable vision-language model (VLM) usage with images and text. Users can now send both text and images to VLM models through Hugging Face APIs. The implementation follows the HF VLM API format specification and maintains full backward compatibility with text-only messages.
  • Added serialization/deserialization methods for TextContent and ImageContent parts of ChatMessage.
  • Made the lazy import error message clearer explaining that the optional dependency is missing.
  • Adopted modern type hinting syntax using PEP 585 throughout the codebase. This improves readability and removes unnecessary imports from the typing module.
  • Support subclasses of ChatMessage in Agent state schema validation. The validation now checks for issubclass(args[0], ChatMessage) instead of requiring exact type equality, allowing custom ChatMessage subclasses to be used in the messages field.
  • The ToolInvoker run method now accepts a list of tools. When provided, this list overrides the tools set in the constructor, allowing you to switch tools at runtime in previously built pipelines.

Bug Fixes

  • Add encoding format keyword argument to OpenAI client when creating embeddings.

  • Addressed incorrect assumptions in the ChatMessage class that raised errors in valid usage scenario.

    1. ChatMessage.fromuser with `contentparts`: Previously, at least one text part was required, even though some model providers support messages with only image parts. This restriction has been removed. If a provider has such a limitation, it should now be enforced in the provider's implementation.

    2. `ChatMessage.toopenaidict_format`: Messages containing multiple text parts weren't supported, despite this being allowed by the OpenAI API. This has now been corrected.

  • Improved validation in the ChatMessage.fromuser class method. The method now raises an error if neither text nor contentparts are provided. It does not raise an error if text is an empty string.

  • Ensure that the score field in SentenceTransformersSimilarityRanker is returned as a Python float instead of numpy.float32. This prevents potential serialization issues in downstream integrations.

  • Raise a RuntimeError when AsyncPipeline.run is called from within an async context, indicating that run_async should be used instead.

  • Prevented in-place mutation of input Document objects in all Extractor and Classifier components by creating copies with dataclasses.replace before processing.

  • Prevented in-place mutation of input Document objects in all DocumentEmbedder components by creating copies with dataclasses.replace before processing.

  • FileTypeRouter has a new parameter raiseonfailure with default value to False. When set to True, FileNotFoundError is always raised for non-existent files. Previously, this exception was raised only when processing a non-existent file and the meta parameter was provided to run().

  • Return a more informative error message when attempting to connect two components and the sender component does not have any OutputSockets defined.

  • Fix tracing context not propagated to tools when running via ToolInvoker.run_async

  • Ensure consistent behavior in SentenceTransformersDiversityRanker. Like other rankers, it now returns all documents instead of raising an error when top_k exceeds the number of available documents.

- Python
Published by github-actions[bot] 6 months ago

farm-haystack - v2.16.1

Release Notes

v2.16.1

Bug Fixes

  • Improved validation in the ChatMessage.from_user class method. The method now raises an error if neither text nor content_parts are provided. It does not raise an error if text is an empty string.

- Python
Published by github-actions[bot] 7 months ago

farm-haystack - v2.16.1-rc1

Release Notes

v2.16.1-rc1

Bug Fixes

  • Improved validation in the ChatMessage.fromuser class method. The method now raises an error if neither text nor contentparts are provided. It does not raise an error if text is an empty string.

- Python
Published by github-actions[bot] 7 months ago

farm-haystack - v2.16.0

⭐️ Highlights

🧠 Agent Breakpoints

This release introduces Agent Breakpoints, a powerful new feature that enhances debugging and observability when working with Haystack Agents. You can pause execution mid-run by inserting breakpoints in the Agent or its tools to inspect internal state and resume execution seamlessly. This brings fine-grained control to agent development and significantly improves traceability during complex interactions.

```python from haystack.dataclasses.breakpoints import AgentBreakpoint, Breakpoint from haystack.dataclasses import ChatMessage

chatgeneratorbreakpoint = Breakpoint( componentname="chatgenerator", visitcount=0, snapshotfilepath="debugsnapshots" ) agentbreakpoint = AgentBreakpoint(breakpoint=chatgeneratorbreakpoint, agentname='calculatoragent')

response = agent.run( messages=[ChatMessage.fromuser("What is 7 * (4 + 2)?")], breakpoint=agent_breakpoint ) ```

πŸ–ΌοΈ Multimodal Pipelines and Agents

You can now blend text and image capabilities across generation, indexing, and retrieval in Haystack.

  • New ImageContent Dataclass: A dedicated structure to store image data along with base64_image, mime_type, detail, and metadata.

  • Image-Aware Chat Generators: Image inputs are now supported in OpenAIChatGenerator

```python from haystack.dataclasses import ImageContent, ChatMessage from haystack.components.generators.chat import OpenAIChatGenerator

imageurl = "https://cdn.britannica.com/79/191679-050-C7114D2B/Adult-capybara.jpg" imagecontent = ImageContent.fromurl(imageurl)

message = ChatMessage.fromuser( contentparts=["Describe the image in short.", image_content] )

llm = OpenAIChatGenerator(model="gpt-4o-mini") print(llm.run([message])["replies"][0].text) ```

  • Powerful Multimodal Components:

    • PDFToImageContent, ImageFileToImageContent, DocumentToImageContent: Convert PDFs, image files, and Documents into ImageContent objects.
    • LLMDocumentContentExtractor: Extract text from images using a vision-enabled LLM.
    • SentenceTransformersDocumentImageEmbedder: Generate embeddings from image-based documents using models like CLIP.
    • DocumentLengthRouter: Route documents based on textual content lengthβ€”ideal for distinguishing scanned PDFs from text-based ones.
    • DocumentTypeRouter: Route documents automatically based on MIME type metadata.
  • Prompt Building with Image Support: The ChatPromptBuilder now supports templates with embedded images, enabling dynamic multimodal prompt creation.

With these additions, you can now build multimodal agents and RAG pipelines that reason over both text and visual content, unlocking richer interactions and retrieval capabilities.

πŸ‘‰ Learn more about multimodality in our Introduction to Multimodal Text Generation.

πŸš€ New Features

  • Add to_dict and from_dict to ByteStream so it is consistent with our other dataclasses in having serialization and deserialization methods.

  • Add to_dict and from_dict to classes StreamingChunk, ToolCallResult, ToolCall, ComponentInfo, and ToolCallDelta to make it consistent with our other dataclasses in having serialization and deserialization methods.

  • Added the tool_invoker_kwargs param to Agent so additional kwargs can be passed to the ToolInvoker like max_workers and enable_streaming_callback_passthrough.

  • ChatPromptBuilder now supports special string templates in addition to a list of ChatMessage objects. This new format is more flexible and allows structured parts like images to be included in the templatized ChatMessage.

    ```python from haystack.components.builders import ChatPromptBuilder from haystack.dataclasses.chat_message import ImageContent

    template = """ {% message role="user" %} Hello! I am {{username}}. What's the difference between the following images? {% for image in images %} {{ image | templatizepart }} {% endfor %} {% endmessage %} """

    images=[ ImageContent.fromfilepath("apple-fruit.jpg"), ImageContent.fromfilepath("apple-logo.jpg") ]

    builder = ChatPromptBuilder(template=template) builder.run(user_name="John", images=images) ```

  • Added convenience class methods to the ImageContent dataclass to create ImageContent objects from file paths and URLs.

  • Added multiple converters to help convert image data between different formats:

    • DocumentToImageContent: Converts documents sourced from PDF and image files into ImageContents.
    • ImageFileToImageContent: Converts image files to ImageContent objects.
    • ImageFileToDocument: Converts image file references into empty Document objects with associated metadata.
    • PDFToImageContent: Converts PDF files to ImageContent objects.
  • Chat Messages with the user role can now include images using the new ImageContent dataclass. We've added image support to OpenAIChatGenerator, and plan to support more model providers over time.

  • Raise a warning when a pipeline can no longer proceed because all remaining components are blocked from running and no expected pipeline outputs have been produced. This scenario can occur legitimately. For example, in pipelines with mutually exclusive branches where some components are intentionally blocked. To help avoid false positives, the check ensures that none of the expected outputs (as defined by Pipeline().outputs()) have been generated during the current run.

  • Added source_id_meta_field and split_id_meta_field to SentenceWindowRetriever for customizable metadata field names. Added raise_on_missing_meta_fields to control whether a ValueError is raised if any of the documents at runtime are missing the required meta fields (set to True by default). If False, then the documents missing the meta field will be skipped when retrieving their windows, but the original document will still be included in the results.

  • Add a ComponentInfo dataclass to the haystack.dataclasses module. This dataclass is used to store information about the component. We pass it to StreamingChunk so we can tell from which component a stream is coming from.

  • Pass the component_info to the StreamingChunk in the OpenAIChatGenerator, AzureOpenAIChatGenerator, HuggingFaceAPIChatGenerator and HuggingFaceLocalChatGenerator.

  • Added the enable_streaming_callback_passthrough to the ToolInvoker init, run and runasync methods. If set to True the ToolInvoker will try and pass the `streamingcallbackfunction to a tool's invoke method only if the tool's invoke method hasstreaming_callback` in its signature.

  • Added dedicated finish_reason field to StreamingChunk class to improve type safety and enable sophisticated streaming UI logic. The field uses a FinishReason type alias with standard values: "stop", "length", "toolcalls", "contentfilter", plus Haystack-specific value "toolcallresults" (used by ToolInvoker to indicate tool execution completion).

  • Updated ToolInvoker component to use the new finish_reason field when streaming tool results. The component now sets finish_reason="tool_call_results" in the final streaming chunk to indicate that tool execution has completed, while maintaining backward compatibility by also setting the value in meta["finish_reason"].

  • Added new HuggingFaceTEIRanker component to enable reranking with Text Embeddings Inference (TEI) API. This component supports both self-hosted Text Embeddings Inference services and Hugging Face Inference Endpoints.

  • Added a raiseonfailure boolean parameter to OpenAIDocumentEmbedder and AzureOpenAIDocumentEmbedder. If set to True then the component will raise an exception when there is an error with the API request. It is set to False by default to so the previous behavior of logging an exception and continuing is still the default.

  • ToolInvoker now executes tool_calls in parallel for both sync and async mode.

  • Add AsyncHFTokenStreamingHandler for async streaming support in HuggingFaceLocalChatGenerator

  • We introduced the LLMMessagesRouter component, which routes Chat Messages to different connections, using a generative Language Model to perform classification. This component can be used with general-purpose LLMs and with specialized LLMs for moderation like Llama Guard.

    Usage example:

```python from haystack.components.generators.chat import HuggingFaceAPIChatGenerator from haystack.components.routers.llmmessagesrouter import LLMMessagesRouter from haystack.dataclasses import ChatMessage

initialize a Chat Generator with a generative model for moderation

chatgenerator = HuggingFaceAPIChatGenerator( apitype="serverlessinferenceapi", api_params={"model": "meta-llama/Llama-Guard-4-12B", "provider": "groq"}, )

router = LLMMessagesRouter( chatgenerator=chatgenerator, outputnames=["unsafe", "safe"], outputpatterns=["unsafe", "safe"] )

print(router.run([ChatMessage.from_user("How to rob a bank?")])) ```

  • For HuggingFaceAPIGenerator and HuggingFaceAPIChatGenerator all additional key, value pairs passed in api_params are now passed to the initializations of the underlying Inference Clients. This allows passing of additional parameters to the clients like timeout, headers, provider, etc. This means we now can easily specify a different inference provider by passing the provider key in api_params.

  • Updated StreamingChunk to add the fields tool_calls, tool_call_result, index, and start to make it easier to format the stream in a streaming callback.

  • Added new dataclass ToolCallDelta for the StreamingChunk.tool_calls field to reflect that the arguments can be a string delta.

  • Updated print_streaming_chunk and _convert_streaming_chunks_to_chat_message utility methods to use these new fields. This especially improves the formatting when using print_streaming_chunk with Agent.

  • Updated OpenAIGenerator, OpenAIChatGenerator, HuggingFaceAPIGenerator, HuggingFaceAPIChatGenerator, HuggingFaceLocalGenerator and HuggingFaceLocalChatGenerator to follow the new dataclasses.

  • Updated ToolInvoker to follow the StreamingChunk dataclass.

⬆️ Upgrade Notes

  • HuggingFaceAPIGenerator might no longer work with the Hugging Face Inference API. As of July 2025, the Hugging Face Inference API no longer offers generative models that support the text_generation endpoint. Generative models are now only available through providers that support the chat_completion endpoint. As a result, the HuggingFaceAPIGenerator component might not work with the Hugging Face Inference API. It still works with Hugging Face Inference Endpoints and self-hosted TGI instances. To use generative models via Hugging Face Inference API, please use the HuggingFaceAPIChatGenerator component, which supports the chat_completion endpoint.

  • All parameters of the Pipeline.draw() and Pipeline.show() methods must now be specified as keyword arguments. Example:

python pipeline.draw( path="output.png", server_url="https://custom-server.com", params=None, timeout=30, super_component_expansion=False )

  • The deprecated async_executor parameter has been removed from the ToolInvoker class. Please use the max_workers parameter instead and a ThreadPoolExecutor with these workers will be created automatically for parallel tool invocations.

  • The deprecated State class has been removed from the haystack.dataclasses module. The State class is now part of the haystack.components.agents module.

  • Remove the deserialize_value_with_schema_legacy function from the base_serialization module. This function was used to deserialize State objects created with Haystack 2.14.0 or older. Support for the old serialization format is removed in Haystack 2.16.0.

⚑️ Enhancement Notes

  • Add guess_mime_type parameter to Bytestream.from_file_path()

  • Add the init parameter skip_empty_documents to the DocumentSplitter component. The default value is True. Setting it to False can be useful when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text from non-textual documents.

  • Test that our type validation and connection validation works with builtin python types introduced in 3.9. We found that these types were already supported, we just now add explicit tests for them.

  • We relaxed the requirement that in ToolCallDelta (introduced in Haystack 2.15) which required the parameters arguments or name to be populated to be able to create a ToolCallDelta dataclass. We remove this requirement to be more in line with OpenAI's SDK and since this was causing errors for some hosted versions of open source models following OpenAI's SDK specification.

  • Added return_embedding parameter inside InMemoryDocumentStore::init method.

  • Updated methods bm25_retrieval, and filter_documents to use self.return_embedding to determine whether embeddings are returned.

  • Updated tests (testinmemory & testinmemoryembeddingretriever) to reflect the changes in the InMemoryDocumentStore.

  • Added a new deserialize_component_inplace function to handle generic component deserialization that works with any component type.

  • Made doc-parser a core dependency since ComponentTool that uses it is one of the core Tool components.

  • Make the PipelineBase().validate_input method public so users can use it with the confidence that it won't receive breaking changes without warning. This method is useful for checking that all required connections in a pipeline have a connection and is automatically called in the run method of Pipeline. It is being exposed as public for users who would like to call this method before runtime to validate the pipeline.

  • For component run Datadog tracing, set the span resource name to the component name instead of the operation name.

  • Added a trust_remote_code parameter to the SentenceTransformersSimilarityRanker component. When set to True, this enables execution of custom models and scripts hosted on the Hugging Face Hub.

  • Add a new parameter require_tool_call_ids to ChatMessage.to_openai_dict_format. The default is True, for compatibility with OpenAI's Chat API: if the id field is missing in a Tool Call, an error is raised. Using False is useful for shallow OpenAI-compatible APIs, where the id field is not required.

  • Haystack's core modules are now "type complete", meaning that all function parameters and return types are explicitly annotated. This increases the usefulness of the newly added py.typed marker and sidesteps differences in type inference between the various type checker implementations.

  • Refactors the HuggingFaceAPIChatGenerator to use the util method _convert_streaming_chunks_to_chat_message. This is to help with being consistent for how we convert StreamingChunks into a final ChatMessage.

  • We also add ComponentInfo to the StreamingChunks made in HuggingFaceGenerator, and HugginFaceLocalGenerator so we can tell from which component a stream is coming from.

  • If only system messages are provided as input a warning will be logged to the user indicating that this likely not intended and that they should probably also provide user messages.

πŸ› Bug Fixes

  • Fix _convert_streaming_chunks_to_chat_message which is used to convert Haystack StreamingChunks into a Haystack ChatMessage. This fixes the scenario where one StreamingChunk contains two ToolCallDeltas in StreamingChunk.tool_calls. With this fix this correctly saves both ToolCallDeltas whereas before they were overwriting each other. This only occurs with some LLM providers like Mistral (and not OpenAI) due to how the provider returns tool calls.

  • Fixed a bug in the print_streaming_chunk utility function that prevented tool call name from being printed.

  • Fixed the to_dict and from_dict of ToolInvoker to properly serialize the streaming_callback init parameter.

  • Fix bug where if raise_on_failure=False and an error occurs mid-batch that the following embeddings would be paired with the wrong documents.

  • Fix component_invoker used by ComponentTool to work when a dataclass like ChatMessage is directly passed to component_tool.invoke(...). Previously this would either cause an error or silently skip your input.

  • Fixed a bug in the LLMMetadataExtractor that occurred when processing Document objects with None or empty string content. The component now gracefully handles these cases by marking such documents as failed and providing an appropriate error message in their metadata, without attempting an LLM call.

  • RecursiveDocumentSplitter now generates a unique Document.id for every chunk. The meta fields (split_id, parent_id, etc.) are populated before Document creation, so the hash used for id generation is always unique.

  • In ConditionalRouter fixed the to_dict and from_dict methods to properly handle the case when output_type is a List of types or a List of strings. This occurs when a user specifies a route in ConditionalRouter to have multiple outputs.

  • Fix serialization of GeneratedAnswer when ChatMessage objects are nested in meta.

  • Fix the serialization of ComponentTool and Tool when specifying outputs_to_string. Previously an error occurred on deserialization right after serializing if outputs_to_string is not None.

  • When calling set_output_types we now also check that the decorator @component.output_types is not present on the run_async method of a Component. Previously we only checked that the Component.run method did not possess the decorator.

  • Fix type comparison in schema validation by replacing is not with != when checking the type List[ChatMessage]. This prevents false mismatches due to Python's is operator comparing object identity instead of equality.

  • Re-export symbols in __init__.py files. This ensures that short imports like from haystack.components.builders import ChatPromptBuilder work equivalently to from haystack.components.builders.chat_prompt_builder import ChatPromptBuilder, without causing errors or warnings in mypy/Pylance.

  • The SuperComponent class can now correctly serialize and deserialize a SuperComponent based on an async pipeline. Previously, the SuperComponent class always assumed the underlying pipeline was synchronous.

⚠️ Deprecation Notes

  • async_executor parameter in ToolInvoker is deprecated in favor of max_workers parameter and will be removed in Haystack 2.16.0. You can use max_workers parameter to control the number of threads used for parallel tool calling.

πŸ’™ Big thank you to everyone who contributed to this release!

@Amnah199 @RafaelJohn9 @anakin87 @bilgeyucel @davidsbatista @julian-risch @kanenorman @kr1shnasomani @mathislucka @mpangrazzi @sjr @srishti-git1110

- Python
Published by github-actions[bot] 7 months ago

farm-haystack - v2.16.0-rc1

Release Notes

v2.17.0-rc0

Highlights

  • Added the LLMDocumentContentExtractor which extracts textual content from image-based documents using a vision-enabled LLM. This is good for creating a textual representation of an image which can be used when only text-based retrieval is available.

  • Add todict and fromdict to ByteStream so it is consistent with our other dataclasses in having serialization and deserialization methods.

  • Add todict and fromdict to classes StreamingChunk, ToolCallResult, ToolCall, ComponentInfo, and ToolCallDelta to make it consistent with our other dataclasses in having serialization and deserialization methods.

  • Added the toolinvokerkwargs param to Agent so additional kwargs can be passed to the ToolInvoker like maxworkers and enablestreamingcallbackpassthrough.

  • ChatPromptBuilder now supports special string templates in addition to a list of ChatMessage objects.

    This new format is more flexible and allows to include structured parts like images in the templatized ChatMessage.

    `python from haystack.components.builders import ChatPromptBuilder from haystack.dataclasses.chat_message import ImageContent template = """ {% message role="user" %} Hello! I am {{user_name}}. What's the difference between the following images? {% for image in images %} {{ image | templatize_part }} {% endfor %} {% endmessage %} """ images=[ImageContent.from_file_path("apple-fruit.jpg"), ImageContent.from_file_path("apple-logo.jpg")] builder = ChatPromptBuilder(template=template) builder.run(user_name="John", images=images)`

  • Introduce the DocumentLengthRouter, a component for routing Documents based on the length of the content field.

    A common use case for DocumentLengthRouter is handling documents obtained from PDFs that contain non-text content, such as scanned pages or images. This component can detect empty or low-content documents and route them to components that perform OCR, generate captions, or compute image embeddings.

  • Add new component called DocumentTypeRouter which routes documents by their MIME types. MIME types can be extracted directly from document metadata or inferred from file paths using standard or user-supplied MIME type mappings.

  • The Pipeline and the Agent now support breakpoints, a feature useful for debugging. A breakpoint is associated with a component and it stops the execution of a Pipeline/Agent generating a JSON file with the execution status, which can be inspected edited and later used to resume the execution.

  • Added convenience class methods to the ImageContent dataclass to create ImageContent objects from file paths and URLs.

  • Added multiple converters to help convert image data between different formats:

    • DocumentToImageContent: Converts documents sourced from PDF and image files into ImageContents.
    • ImageFileToImageContent: Converts image files to ImageContent objects.
    • ImageFileToDocument: Converts image file references into empty Document objects with associated metadata.
    • PDFToImageContent: Converts PDF files to ImageContent objects.
  • Chat Messages with the user role can now include images using the new ImageContent dataclass.

    We've added image support to OpenAIChatGenerator, and plan to support more model providers over time.

  • Raise a warning when a pipeline can no longer proceed because all remaining components are blocked from running and no expected pipeline outputs have been produced. This scenario can occur legitimately. For example, in pipelines with mutually exclusive branches where some components are intentionally blocked. To help avoid false positives, the check ensures that none of the expected outputs (as defined by Pipeline().outputs()) have been generated during the current run.

  • Introduce the SentenceTransformersDocumentImageEmbedder, a component for computing Document embeddings from images using Sentence Transformers models such as OpenAI CLIP or Jina CLIP. Each Document must have a meta field that specifies the path an image or PDF file. The resulting embedding will be stored in the embedding field of the Document.

  • Added sourceidmetafield and splitidmetafield to SentenceWindowRetriever for customizable metadata field names. Added raiseonmissingmetafields to control whether a ValueError is raised if any of the documents at runtime are missing the required meta fields (set to True by default). If False, then the documents missing the meta field will be skipped when retrieving their windows, but the original document will still be included in the results.

  • - Add a ComponentInfo dataclass to the haystack.dataclasses module. This dataclass is used to store information about the component. We pass it to StreamingChunk so we can tell from which component a stream is coming from.

    • Pass the component_info to the StreamingChunk in the OpenAIChatGenerator, AzureOpenAIChatGenerator, HuggingFaceAPIChatGenerator and HuggingFaceLocalChatGenerator.
  • Added the enablestreamingcallbackpassthrough to the ToolInovker init, run and runasync methods. If set to True the ToolInvoker will try and pass the streamingcallback function to a tool's invoke method only if the tool's invoke method has streamingcallback in its signature.

  • Added dedicated finishreason field to StreamingChunk class to improve type safety and enable sophisticated streaming UI logic. The field uses a FinishReason type alias with standard values: "stop", "length", "toolcalls", "contentfilter", plus Haystack-specific value "toolcall_results" (used by ToolInvoker to indicate tool execution completion).

  • Updated ToolInvoker component to use the new finishreason field when streaming tool results. The component now sets finishreason="toolcallresults" in the final streaming chunk to indicate that tool execution has completed, while maintaining backward compatibility by also setting the value in meta["finish_reason"].

  • Added new HuggingFaceTEIRanker component to enable reranking with Text Embeddings Inference (TEI) API. This component supports both self-hosted Text Embeddings Inference services and Hugging Face Inference Endpoints.

  • Added a raiseonfailure boolean parameter to OpenAIDocumentEmbedder and AzureOpenAIDocumentEmbedder. If set to True then the component will raise an exception when there is an error with the API request. It is set to False by default to so the previous behavior of logging an exception and continuing is still the default.

  • ToolInvoker now executes tool_calls in parallel for both sync and async mode.

  • Add AsyncHFTokenStreamingHandler for async streaming support in HuggingFaceLocalChatGenerator

  • We introduced the LLMMessagesRouter component, that routes Chat Messages to different connections, using a generative Language Model to perform classification.

    This component can be used with general-purpose LLMs and with specialized LLMs for moderation like Llama Guard.

    Usage example: `python from haystack.components.generators.chat import HuggingFaceAPIChatGenerator from haystack.components.routers.llm_messages_router import LLMMessagesRouter from haystack.dataclasses import ChatMessage # initialize a Chat Generator with a generative model for moderation chat_generator = HuggingFaceAPIChatGenerator( api_type="serverless_inference_api", api_params={"model": "meta-llama/Llama-Guard-4-12B", "provider": "groq"}, ) router = LLMMessagesRouter(chat_generator=chat_generator, output_names=["unsafe", "safe"], output_patterns=["unsafe", "safe"]) print(router.run([ChatMessage.from_user("How to rob a bank?")]))`

    • For HuggingFaceAPIGenerator and HuggingFaceAPIChatGenerator all additional key, value pairs passed in api_params are now passed to the initializations of the underlying Inference Clients. This allows passing of additional parameters to the clients like timeout, headers, provider, etc.
    • This means we now can easily specify a different inference provider by passing the provider key in api_params.
  • Updated StreamingChunk to add the fields toolcalls, toolcall_result, index, and start to make it easier to format the stream in a streaming callback.

    • Added new dataclass ToolCallDelta for the StreamingChunk.tool_calls field to reflect that the arguments can be a string delta.
    • Updated printstreamingchunk and _convertstreamingchunkstochatmessage utility methods to use these new fields. This especially improves the formatting when using printstreaming_chunk with Agent.
    • Updated OpenAIGenerator, OpenAIChatGenerator, HuggingFaceAPIGenerator, HuggingFaceAPIChatGenerator, HuggingFaceLocalGenerator and HuggingFaceLocalChatGenerator to follow the new dataclasses.
    • Updated ToolInvoker to follow the StreamingChunk dataclass.

Upgrade Notes

  • HuggingFaceAPIGenerator might no longer work with the Hugging Face Inference API.

    As of July 2025, the Hugging Face Inference API no longer offers generative models that support the textgeneration endpoint. Generative models are now only available through providers that support the chatcompletion endpoint. As a result, the HuggingFaceAPIGenerator component might not work with the Hugging Face Inference API. It still works with Hugging Face Inference Endpoints and self-hosted TGI instances.

    To use generative models via Hugging Face Inference API, please use the HuggingFaceAPIChatGenerator component, which supports the chat_completion endpoint.

  • All parameters of the Pipeline.draw() and Pipeline.show() methods must now be specified as keyword arguments Example: pipeline.draw( path="output.png", serverurl="https://custom-server.com", params=None, timeout=30, supercomponent_expansion=False )

  • The deprecated asyncexecutor parameter has been removed from the ToolInvoker class. Please use the maxworkers parameter instead and a ThreadPoolExecutor with these workers will be created automatically for parallel tool invocations.

  • The deprecated State class has been removed from the haystack.dataclasses module. The State class is now part of the haystack.components.agents module.

  • Remove the deserializevaluewithschemalegacy function from the base_serialization module. This function was used to deserialize State objects created with Haystack 2.14.0 or older. Support for the old serialization format is removed in Haystack 2.16.0.

Enhancement Notes

  • Add guessmimetype parameter to Bytestream.fromfilepath()
  • Add the init parameter skipemptydocuments to the DocumentSplitter component. The default value is True. Setting it to False can be useful when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text from non-textual documents.
  • Test that our type validation and connection validation works with builtin python types introduced in 3.9. We found that these types were already supported, we just now add explicit tests for them.
  • We relaxed the requirement that in ToolCallDelta (introduced in Haystack 2.15) which required the parameters arguments or name to be populated to be able to create a ToolCallDelta dataclass. We remove this requirement to be more in line with OpenAI's SDK and since this was causing errors for some hosted versions of open source models following OpenAI's SDK specification.
    • Added return_embedding parameter inside InMemoryDocumentStore::init method.
    • Updated methods bm25retrieval, and filterdocuments to use self.return_embedding to determine whether embeddings are returned.
    • Updated tests (testinmemory & testinmemoryembeddingretriever) to reflect the changes in the InMemoryDocumentStore.
  • Added a new deserializecomponentinplace function to handle generic component deserialization that works with any component type.

  • Made doc-parser a core dependency since ComponentTool that uses it is one of the core Tool components.

  • Make the PipelineBase().validate_input method public so users can use it with the confidence that it won't receive breaking changes without warning. This method is useful for checking that all required connections in a pipeline have a connection and is automatically called in the run method of Pipeline. It is being exposed as public for users who would like to call this method before runtime to validate the pipeline.

  • For component run Datadog tracing, set the span resource name to the component name instead of the operation name.

  • Added a trustremotecode parameter to the SentenceTransformersSimilarityRanker component. When set to True, this enables execution of custom models and scripts hosted on the Hugging Face Hub.

  • Add a new parameter requiretoolcallids to ChatMessage.toopenaidictformat. The default is True, for compatibility with OpenAI's Chat API: if the id field is missing in a Tool Call, an error is raised. Using False is useful for shallow OpenAI-compatible APIs, where the id field is not required.

  • Haystack's core modules are now ["type complete"](https://typing.python.org/en/latest/guides/libraries.html#how-much-of-my-library-needs-types), meaning that all function parameters and return types are explicitly annotated. This increases the usefulness of the newly added py.typed marker and sidesteps differences in type inference between the various type checker implementations.

    • Refactors the HuggingFaceAPIChatGenerator to use the util method _convertstreamingchunkstochat_message. This is to help with being consistent for how we convert StreamingChunks into a final ChatMessage.
    • We also add ComponentInfo to the StreamingChunks made in HuggingFaceGenerator, and HugginFaceLocalGenerator so we can tell from which component a stream is coming from.
    • If only system messages are provided as input a warning will be logged to the user indicating that this likely not intended and that they should probably also provide user messages.

Bug Fixes

  • Fix _convertstreamingchunkstochatmessage which is used to convert Haystack StreamingChunks into a Haystack ChatMessage. This fixes the scenario where one StreamingChunk contains two ToolCallDetlas in StreamingChunk.toolcalls. With this fix this correctly saves both ToolCallDeltas whereas before they were overwriting each other. This only occurs with some LLM providers like Mistral (and not OpenAI) due to how the provider returns tool calls.
  • Fixed a bug in the printstreamingchunk utility function that prevented tool call name from being printed.

  • Fixed the todict and fromdict of ToolInvoker to properly serialize the streaming_callback init parameter.

  • Fix bug where if raiseonfailure=False and an error occurs mid-batch that the following embeddings would be paired with the wrong documents.

  • Fix componentinvoker used by ComponentTool to work when a dataclass like ChatMessage is directly passed to componenttool.invoke(...). Previously this would either cause an error or silently skip your input.

  • Fixed a bug in the LLMMetadataExtractor that occurred when processing Document objects with None or empty string content. The component now gracefully handles these cases by marking such documents as failed and providing an appropriate error message in their metadata, without attempting an LLM call.

  • RecursiveDocumentSplitter now generates a unique Document.id for every chunk. The meta fields (splitid, parentid, etc.) are populated _before Document creation, so the hash used for id generation is always unique.

  • In ConditionalRouter fixed the todict and fromdict methods to properly handle the case when output_type is a List of types or a List of strings. This occurs when a user specifies a route in ConditionalRouter to have multiple outputs.

  • Fix serialization of GeneratedAnswer when ChatMessage objects are nested in meta.

  • Fix the serialization of ComponentTool and Tool when specifying outputstostring. Previously an error occurred on deserialization right after serializing if outputstostring is not None.

  • When calling setoutputtypes we now also check that the decorator @component.outputtypes is not present on the runasync method of a Component. Previously we only checked that the Component.run method did not possess the decorator.

  • Fix type comparison in schema validation by replacing is not with != when checking the type List[ChatMessage]. This prevents false mismatches due to Python's is operator comparing object identity instead of equality.

  • Re-export symbols in __init__.py files. This ensures that short imports like from haystack.components.builders import ChatPromptBuilder work equivalently to from haystack.components.builders.chatpromptbuilder import ChatPromptBuilder, without causing errors or warnings in mypy/Pylance.

  • The SuperComponent class can now correctly serialize and deserialize a SuperComponent based on an async pipeline. Previously, the SuperComponent class always assumed the underlying pipeline was synchronous.

Deprecation Notes

  • asyncexecutor parameter in ToolInvoker is deprecated in favor of maxworkers parameter and will be removed in Haystack 2.16.0. You can use max_workers parameter to control the number of threads used for parallel tool calling.

- Python
Published by github-actions[bot] 7 months ago

farm-haystack - v2.15.2

Enhancement Notes

  • We’ve relaxed the requirements for the ToolCallDelta dataclass (introduced in Haystack 2.15). Previously, creating a ToolCallDelta instance required either the parameters argument or the name to be set. This constraint has now been removed to align more closely with OpenAI's SDK behavior. The change was necessary as the stricter requirement was causing errors in certain hosted versions of open-source models that adhere to the OpenAI SDK specification.

Bug Fixes

  • Fixed a bug in the print_streaming_chunk utility function that prevented ToolCall name from being printed.

- Python
Published by github-actions[bot] 8 months ago

farm-haystack - v2.15.2-rc1

- Python
Published by github-actions[bot] 8 months ago

farm-haystack - v2.15.1

Bug Fixes

  • Fix _convert_streaming_chunks_to_chat_message which is used to convert Haystack StreamingChunks into a Haystack ChatMessage. This fixes the scenario where one StreamingChunk contains two ToolCallDetlas in StreamingChunk.tool_calls. With this fix this correctly saves both ToolCallDeltas whereas before they were overwriting each other. This only occurs with some LLM providers like Mistral (and not OpenAI) due to how the provider returns tool calls.

- Python
Published by github-actions[bot] 8 months ago

farm-haystack - v2.15.1-rc1

- Python
Published by github-actions[bot] 8 months ago

farm-haystack - v2.15.0

⭐️ Highlights

Parallel Tool Calling for Faster Agents

  • ToolInvoker now processes all tool calls passed to run or run_async in parallel using an internal ThreadPoolExecutor. This improves performance by reducing the time spent on sequential tool invocations.
  • This parallel execution capability enables ToolInvoker to batch and process multiple tool calls concurrently, allowing Agents to run complex pipelines efficiently with decreased latency.
  • You no longer need to pass an async_executor. ToolInvoker manages its own executor, configurable via the max_workers parameter in init.

Introducing LLMMessagesRouter

The new LLMMessagesRouter component that classifies and routes incoming ChatMessage objects to different connections using a generative LLM. This component can be used with general-purpose LLMs and with specialized LLMs for moderation like Llama Guard.

Usage example: ```python from haystack.components.generators.chat import HuggingFaceAPIChatGenerator from haystack.components.routers.llmmessagesrouter import LLMMessagesRouter from haystack.dataclasses import ChatMessage

chatgenerator = HuggingFaceAPIChatGenerator(apitype="serverlessinferenceapi", apiparams={"model": "meta-llama/Llama-Guard-4-12B", "provider": "groq"}, )
router = LLMMessagesRouter(chat
generator=chatgenerator, outputnames=["unsafe", "safe"], outputpatterns=["unsafe", "safe"])
print(router.run([ChatMessage.from
user("How to rob a bank?")])) ```

New HuggingFaceTEIRanker Component

HuggingFaceTEIRanker enables end-to-end reranking via the Text Embeddings Inference (TEI) API. It supports both self-hosted TEI services and Hugging Face Inference Endpoints, giving you flexible, high-quality reranking out of the box.

πŸš€ New Features

  • Added a ComponentInfo dataclass to haystack to store information about the component. We pass it to StreamingChunk so we can tell from which component a stream is coming.
  • Pass the component_info to the StreamingChunk in the OpenAIChatGenerator, AzureOpenAIChatGenerator, HuggingFaceAPIChatGenerator, HuggingFaceGenerator, HugginFaceLocalGenerator and HuggingFaceLocalChatGenerator.

  • Added the enable_streaming_callback_passthrough to the init, run and run_async methods of ToolInvoker. If set to True the ToolInvoker will try and pass the streaming_callback function to a tool's invoke method only if the tool's invoke method has streaming_callback in its signature.

  • Added dedicated finish_reason field to StreamingChunk class to improve type safety and enable sophisticated streaming UI logic. The field uses a FinishReason type alias with standard values: "stop", "length", "toolcalls", "contentfilter", plus Haystack-specific value "toolcallresults" (used by ToolInvoker to indicate tool execution completion).

  • Updated ToolInvoker component to use the new finish_reason field when streaming tool results. The component now sets finish_reason="tool_call_results" in the final streaming chunk to indicate that tool execution has completed, while maintaining backward compatibility by also setting the value in meta["finish_reason"].

  • Added a raise_on_failure boolean parameter to OpenAIDocumentEmbedder and AzureOpenAIDocumentEmbedder. If set to True then the component will raise an exception when there is an error with the API request. It is set to False by default so the previous behavior of logging an exception and continuing is still the default.

  • Add AsyncHFTokenStreamingHandler for async streaming support in HuggingFaceLocalChatGenerator

  • For HuggingFaceAPIGenerator and HuggingFaceAPIChatGenerator all additional key, value pairs passed in api_params are now passed to the initializations of the underlying Inference Clients. This allows passing of additional parameters to the clients like timeout, headers, provider, etc. This means we now can easily specify a different inference provider by passing the provider key in api_params.

  • Updated StreamingChunk to add the fields tool_calls, tool_call_result, index, and start to make it easier to format the stream in a streaming callback.

    • Added new dataclass ToolCallDelta for the StreamingChunk.tool_calls field to reflect that the arguments can be a string delta.
    • Updated print_streaming_chunk and _convert_streaming_chunks_to_chat_message utility methods to use these new fields. This especially improves the formatting when using print_streaming_chunk with Agent.
    • Updated OpenAIGenerator, OpenAIChatGenerator, HuggingFaceAPIGenerator, HuggingFaceAPIChatGenerator, HuggingFaceLocalGenerator and HuggingFaceLocalChatGenerator to follow the new dataclasses.
    • Updated ToolInvoker to follow the StreamingChunk dataclass.

⚑️ Enhancement Notes

  • Added a new deserialize_component_inplace function to handle generic component deserialization that works with any component type.
  • Made doc-parser a core dependency since ComponentTool that uses it is one of the core Tool components.
  • Make the PipelineBase().validate_inputmethod public so users can use it with the confidence that it won't receive breaking changes without warning. This method is useful for checking that all required connections in a pipeline have a connection and is automatically called in the run method of Pipeline. It is being exposed as public for users who would like to call this method before runtime to validate the pipeline.
  • For component run Datadog tracing, set the span resource name to the component name instead of the operation name.
  • Added a trust_remote_code parameter to the SentenceTransformersSimilarityRanker component. When set to True, this enables execution of custom models and scripts hosted on the Hugging Face Hub.
  • Add a new parameter require_tool_call_ids to ChatMessage.to_openai_dict_format. The default is True, for compatibility with OpenAI's Chat API: if the id field is missing in a Tool Call, an error is raised. Using False is useful for shallow OpenAI-compatible APIs, where the id field is not required.
  • Haystack's core modules are now "type complete", meaning that all function parameters and return types are explicitly annotated. This increases the usefulness of the newly added py.typed marker and sidesteps differences in type inference between the various type checker implementations.
  • HuggingFaceAPIChatGenerator now uses the util method _convert_streaming_chunks_to_chat_message. This is to help with being consistent for how we convert StreamingChunks into a final ChatMessage.

    • If only system messages are provided as input a warning will be logged to the user indicating that this likely not intended and that they should probably also provide user messages.

⚠️ Deprecation Notes

  • async_executor parameter in ToolInvoker is deprecated in favor of max_workers parameter and will be removed in Haystack 2.16.0. You can use max_workers parameter to control the number of threads used for parallel tool calling.

πŸ› Bug Fixes

  • Fixed the to_dict and from_dict of ToolInvoker to properly serialize the streaming_callback init parameter.
  • Fix bug where if raise_on_failure=False and an error occurs mid-batch that the following embeddings would be paired with the wrong documents.
  • Fix componentinvoker used by ComponentTool to work when a dataclass like ChatMessage is directly passed to `componenttool.invoke(...)`. Previously this would either cause an error or silently skip your input.
  • Fixed a bug in the LLMMetadataExtractor that occurred when processing Document objects with None or empty string content. The component now gracefully handles these cases by marking such documents as failed and providing an appropriate error message in their metadata, without attempting an LLM call.
  • RecursiveDocumentSplitter now generates a unique Document.id for every chunk. The meta fields (split_id, parent_id, etc.) are populated before Document creation, so the hash used for id generation is always unique.
  • In ConditionalRouter fixed the to_dict and from_dict methods to properly handle the case when output_type is a List of types or a List of strings. This occurs when a user specifies a route in ConditionalRouter to have multiple outputs.
  • Fix serialization of GeneratedAnswer when ChatMessage objects are nested in meta.
  • Fix the serialization of ComponentTool and Tool when specifying outputs_to_string. Previously an error occurred on deserialization right after serializing if outputstostring is not None.
  • When calling set_output_types we now also check that the decorator @component.output_types is not present on the run_async method of a Component. Previously we only checked that the Component.run method did not possess the decorator.
  • Fix type comparison in schema validation by replacing is not with != when checking the type List[ChatMessage]. This prevents false mismatches due to Python's is operator comparing object identity instead of equality.
  • Re-export symbols in __init__.py files. This ensures that short imports like from haystack.components.builders import ChatPromptBuilder work equivalently to from haystack.components.builders.chat_prompt_builder import ChatPromptBuilder, without causing errors or warnings in mypy/Pylance.
  • The SuperComponent class can now correctly serialize and deserialize a SuperComponent based on an async pipeline. Previously, the SuperComponent class always assumed the underlying pipeline was synchronous.
  • Fixed a bug in OpenAIDocumentEmbedder and AzureOpenAIDocumentEmbedder where if an OpenAI API error occurred mid-batch then the following embeddings would be paired with the wrong documents.

πŸ’™ Big thank you to everyone who contributed to this release!

  • @Amnah199 @Seth-Peters @anakin87 @atopx @davidsbatista @denisw @gulbaki @julian-risch @lan666as @mdrazak2001 @mpangrazzi @sjrl @srini047 @vblagoje

- Python
Published by github-actions[bot] 8 months ago

farm-haystack - v2.15.0-rc1

Release Notes

v2.16.0-rc0

New Features

  • - Add a ComponentInfo dataclass to the haystack.dataclasses module. This dataclass is used to store information about the component. We pass it to StreamingChunk so we can tell from which component a stream is coming from.

    • Pass the component_info to the StreamingChunk in the OpenAIChatGenerator, AzureOpenAIChatGenerator, HuggingFaceAPIChatGenerator and HuggingFaceLocalChatGenerator.
  • Added the enablestreamingcallbackpassthrough to the ToolInovker init, run and runasync methods. If set to True the ToolInvoker will try and pass the streamingcallback function to a tool's invoke method only if the tool's invoke method has streamingcallback in its signature.

  • Added dedicated finishreason field to StreamingChunk class to improve type safety and enable sophisticated streaming UI logic. The field uses a FinishReason type alias with standard values: "stop", "length", "toolcalls", "contentfilter", plus Haystack-specific value "toolcall_results" (used by ToolInvoker to indicate tool execution completion).

  • Updated ToolInvoker component to use the new finishreason field when streaming tool results. The component now sets finishreason="toolcallresults" in the final streaming chunk to indicate that tool execution has completed, while maintaining backward compatibility by also setting the value in meta["finish_reason"].

  • Added new HuggingFaceTEIRanker component to enable reranking with Text Embeddings Inference (TEI) API. This component supports both self-hosted Text Embeddings Inference services and Hugging Face Inference Endpoints.

  • Added a raiseonfailure boolean parameter to OpenAIDocumentEmbedder and AzureOpenAIDocumentEmbedder. If set to True then the component will raise an exception when there is an error with the API request. It is set to False by default to so the previous behavior of logging an exception and continuing is still the default.

  • ToolInvoker now executes tool_calls in parallel for both sync and async mode.

  • Add AsyncHFTokenStreamingHandler for async streaming support in HuggingFaceLocalChatGenerator

  • We introduced the LLMMessagesRouter component, that routes Chat Messages to different connections, using a generative Language Model to perform classification.

    This component can be used with general-purpose LLMs and with specialized LLMs for moderation like Llama Guard.

    Usage example: `python from haystack.components.generators.chat import HuggingFaceAPIChatGenerator from haystack.components.routers.llm_messages_router import LLMMessagesRouter from haystack.dataclasses import ChatMessage # initialize a Chat Generator with a generative model for moderation chat_generator = HuggingFaceAPIChatGenerator( api_type="serverless_inference_api", api_params={"model": "meta-llama/Llama-Guard-4-12B", "provider": "groq"}, ) router = LLMMessagesRouter(chat_generator=chat_generator, output_names=["unsafe", "safe"], output_patterns=["unsafe", "safe"]) print(router.run([ChatMessage.from_user("How to rob a bank?")]))`

    • For HuggingFaceAPIGenerator and HuggingFaceAPIChatGenerator all additional key, value pairs passed in api_params are now passed to the initializations of the underlying Inference Clients. This allows passing of additional parameters to the clients like timeout, headers, provider, etc.
    • This means we now can easily specify a different inference provider by passing the provider key in api_params.
  • Updated StreamingChunk to add the fields toolcalls, toolcall_result, index, and start to make it easier to format the stream in a streaming callback.

    • Added new dataclass ToolCallDelta for the StreamingChunk.tool_calls field to reflect that the arguments can be a string delta.
    • Updated printstreamingchunk and _convertstreamingchunkstochatmessage utility methods to use these new fields. This especially improves the formatting when using printstreaming_chunk with Agent.
    • Updated OpenAIGenerator, OpenAIChatGenerator, HuggingFaceAPIGenerator, HuggingFaceAPIChatGenerator, HuggingFaceLocalGenerator and HuggingFaceLocalChatGenerator to follow the new dataclasses.
    • Updated ToolInvoker to follow the StreamingChunk dataclass.

Enhancement Notes

  • Added a new deserializecomponentinplace function to handle generic component deserialization that works with any component type.
  • Made doc-parser a core dependency since ComponentTool that uses it is one of the core Tool components.
  • Make the PipelineBase().validate_input method public so users can use it with the confidence that it won't receive breaking changes without warning. This method is useful for checking that all required connections in a pipeline have a connection and is automatically called in the run method of Pipeline. It is being exposed as public for users who would like to call this method before runtime to validate the pipeline.
  • For component run Datadog tracing, set the span resource name to the component name instead of the operation name.
  • Added a trustremotecode parameter to the SentenceTransformersSimilarityRanker component. When set to True, this enables execution of custom models and scripts hosted on the Hugging Face Hub.
  • Add a new parameter requiretoolcallids to ChatMessage.toopenaidictformat. The default is True, for compatibility with OpenAI's Chat API: if the id field is missing in a Tool Call, an error is raised. Using False is useful for shallow OpenAI-compatible APIs, where the id field is not required.
  • Haystack's core modules are now ["type complete"](https://typing.python.org/en/latest/guides/libraries.html#how-much-of-my-library-needs-types), meaning that all function parameters and return types are explicitly annotated. This increases the usefulness of the newly added py.typed marker and sidesteps differences in type inference between the various type checker implementations.
  • - Refactors the HuggingFaceAPIChatGenerator to use the util method _convertstreamingchunkstochat_message. This is to help with being consistent for how we convert StreamingChunks into a final ChatMessage.
    • We also add ComponentInfo to the StreamingChunks made in HuggingFaceGenerator, and HugginFaceLocalGenerator so we can tell from which component a stream is coming from.
  • - If only system messages are provided as input a warning will be logged to the user indicating that this likely not intended and that they should probably also provide user messages.

Deprecation Notes

  • asyncexecutor parameter in ToolInvoker is deprecated in favor of maxworkers parameter and will be removed in Haystack 2.16.0. You can use max_workers parameter to control the number of threads used for parallel tool calling.

Bug Fixes

  • Fixed the todict and fromdict of ToolInvoker to properly serialize the streaming_callback init parameter.
  • Fix bug where if raiseonfailure=False and an error occurs mid-batch that the following embeddings would be paired with the wrong documents.
  • Fix componentinvoker used by ComponentTool to work when a dataclass like ChatMessage is directly passed to componenttool.invoke(...). Previously this would either cause an error or silently skip your input.
  • Fixed a bug in the LLMMetadataExtractor that occurred when processing Document objects with None or empty string content. The component now gracefully handles these cases by marking such documents as failed and providing an appropriate error message in their metadata, without attempting an LLM call.
  • RecursiveDocumentSplitter now generates a unique Document.id for every chunk. The meta fields (splitid, parentid, etc.) are populated _before Document creation, so the hash used for id generation is always unique.
  • In ConditionalRouter fixed the todict and fromdict methods to properly handle the case when output_type is a List of types or a List of strings. This occurs when a user specifies a route in ConditionalRouter to have multiple outputs.
  • Fix serialization of GeneratedAnswer when ChatMessage objects are nested in meta.
  • Fix the serialization of ComponentTool and Tool when specifying outputstostring. Previously an error occurred on deserialization right after serializing if outputstostring is not None.
  • When calling setoutputtypes we now also check that the decorator @component.outputtypes is not present on the runasync method of a Component. Previously we only checked that the Component.run method did not possess the decorator.
  • Fix type comparison in schema validation by replacing is not with != when checking the type List[ChatMessage]. This prevents false mismatches due to Python's is operator comparing object identity instead of equality.
  • Re-export symbols in __init__.py files. This ensures that short imports like from haystack.components.builders import ChatPromptBuilder work equivalently to from haystack.components.builders.chatpromptbuilder import ChatPromptBuilder, without causing errors or warnings in mypy/Pylance.
  • The SuperComponent class can now correctly serialize and deserialize a SuperComponent based on an async pipeline. Previously, the SuperComponent class always assumed the underlying pipeline was synchronous.

v2.15.0-rc0

Highlights

Adding visualization capabilities to the SuperComponent via the following methods: - show() displays pipeline diagrams in Jupyter notebooks. - draw() saves pipeline diagrams as images to specified file paths.

Upgrade Notes

  • We've added a py.typed file to Haystack to enable type information to be used by downstream projects, in line with PEP 561. This means Haystack's type hints will now be visible to type checkers in projects that depend on it. Haystack is primarily type checked using mypy (not pyright) and, despite our efforts, some type information can be incomplete or unreliable. If you use static type checking in your own project, you may notice some changes: previously, Haystack's types were effectively treated as Any, but now actual type information will be available and enforced.
  • The deprecated deserializetoolsinplace utility function has been removed. Use deserializetoolsortoolsetinplace instead, importing it as follows: from haystack.tools import deserializetoolsortoolsetinplace.

New Features

    • Added run_async method to ToolInvoker class to allow asynchronous tool invocations.
    • Agent can now stream tool result with run_async method as well.
    • Introduced serializevalue and deserializevalue utility methods for consistent value (de)serialization across modules.
    • Moved the State class to the agents.state module and added serialization and deserialization capabilities.
  • Agent now supports a List of Tools or a Toolset as input.

  • Add support for multiple outputs in ConditionalRouter

  • Implement JSON-safe serialization for OpenAI usage data by converting token counts and details (like CompletionTokensDetails and PromptTokensDetails) into plain dictionaries.

  • Added a new SentenceTransformersSimilarityRanker component that uses the Sentence Transformers library to rank documents based on their semantic similarity to the query.

    This component is a replacement for the legacy TransformersSimilarityRanker component, which may be deprecated in a future release, with removal following after a deprecation period.

    The SentenceTransformersSimilarityRanker also allows choosing different inference backends: PyTorch, ONNX, and OpenVINO.

    To use the SentenceTransformersSimilarityRanker, you need to install sentence-transformers>=4.1.0.

  • Add a streamingcallback parameter to ToolInvoker to enable streaming of tool results. Note that toolresult is emitted only after the tool execution completes and is not streamed incrementally.

  • Update printstreamingchunk to print ToolCall information if it is present in the chunk's metadata.

  • Update Agent to forward the streaming_callback to ToolInvoker to emit tool results during tool invocation.

  • Enhance SuperComponent's type compatibility check to return the detected common type between two input types.

Enhancement Notes

  • When using HuggingFaceAPIChatGenerator with streaming, the returned ChatMessage now contains the number of prompt tokens and completion tokens in its meta data. Internally, the HuggingFaceAPIChatGenerator requests an additional streaming chunk that contains usage data. It then processes the usage streaming chunk to add usage meta data to the returned ChatMessage.

  • We now have a Protocol for TextEmbedder. The protocol makes it easier to create custom components or SuperComponents that expect any TextEmbedder as init parameter.

  • We added a Component signature validation method that details the mismatches between the run and run_async method signatures. This allows a user to debug custom components easily.

  • Enhanced the AnswerBuilder component with two agent-friendly features:

1.  All generated messages are now stored in the <span class="title-ref">meta</span> field of the GeneratedAnswer objects under an <span class="title-ref">all_messages</span> key, improving traceability and debugging capabilities.
2.  Added a new <span class="title-ref">last_message_only</span> parameter that, when set to <span class="title-ref">True</span>, processes only the last message in the replies while still preserving the complete conversation history in metadata. This is particularly useful for agent workflows where only the final response needs to be processed.
  • A variety of improvements have been made so an Agent component can be directly used in ComponentTool enabling straightforward building of Multi-Agent systems. These improvements include:

    • Adding a last_message field to the Agent's output which returns the last generated ChatMessage.
    • Improving the _defaultoutputhandler in the ToolInvoker to try and first serialize the outputs in the tool result before converting it into a string. This is especially relevant for getting a better representation when stringifying dataclasses like ChatMessage.
  • Added type hints to the component decorator. This improves support for Pyright/Pylance, enabling IDEs like VSCode to show docstrings for components.

  • Updated pipeline execution logic to use a new utility method _deepcopywithexceptions, which attempts to deep copy an object and safely falls back to the original object if copying fails. Additionally _deepcopywithexceptions skips deep-copying of Component, Tool, and Toolset instances when used as runtime parameters. This prevents errors and unintended behavior caused by trying to deepcopy objects that contain non-copyable attributes (e.g. Jinja2 templates, clients). Previously, standard deepcopy was used on inputs and outputs which occasionally lead to errors since certain Python objects cannot be deepcopied.

  • Refactored JSON Schema generation for ComponentTool parameters using Pydantic’s modeljsonschema, enabling expanded type support (e.g. Union, Enum, Dict, etc.). We also convert dataclasses to Pydantic models before calling modeljsonschema to preserve docstring descriptions of the parameters in the schema. This means dataclasses like ChatMessage, Document, etc. now have correctly defined JSON schemas.

  • The draw() and show() methods from Pipeline now have an extra boolean parameter, supercomponentexpansion, which, when set to True and if the pipeline contains SuperComponents, the visualisation diagram will show the internal structure of super-components as if they were components part of the pipeline instead of a "black-box" with the name of the SuperComponent.

  • Improve the type annotations for @component and the Component protocol. The type checker can now ensure that a @component class provides a compatible run() method, whose required return type has been changed from Dict[str, Any] (invariant) to the Mapping[str, Any] to allow TypedDict to be used for output types.

    • Updates StreamingChunk construction in ToolInvoker to also stream a chunk with a finish reason. This is useful when using the printstreamingchunk utility method
    • Update the printstreamingchunk to have better formatting of messages especially when using it with Agent.
    • Also updated to work with the current version of the AWS Bedrock integration by working with the dict representation of ChoiceDeltaToolCall
  • ComponentTool now preserves and combines docstrings from underlying pipeline components when wrapping a SuperComponent. When a SuperComponent is used with ComponentTool, two key improvements are made:

1.  Parameter descriptions are now extracted from the original components in the wrapped pipeline. When a single input is mapped to multiple components, the parameter descriptions are combined from all mapped components, providing comprehensive information about how the parameter is used throughout the pipeline.
2.  The overall component description is now generated from the descriptions of all underlying components instead of using the generic SuperComponent description. This helps LLMs understand what the component actually does rather than just seeing "Runs the wrapped pipeline with the provided inputs."

These changes make SuperComponents much more useful with LLM function calling as the LLM will get detailed information about both the component's purpose and its parameters.
  • Adds localfilesonly parameter to SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder to allow loading models in offline mode.

  • The DocumentRecallEvaluator was updated. Now, when in MULTI_HIT mode, the division is over the unique ground truth documents instead of the total number of ground truth documents. We also added checks for emptiness. If there are no retrieved documents or all of them have an empty string as content, we return 0.0 and log a warning. Likewise, if there are no ground truth documents or all of them have an empty string as content, we return 0.0 and log a warning.

Deprecation Notes

  • Deprecated the State class in the dataclasses module. Users are encouraged to transition to the new version of State now located in the agents.state module. A deprecation warning has been added to guide this migration.

Security Notes

  • Made QUOTESPANSRE regex ReDoS-safe. This prevents potential catastrophic backtracking on malicious inputs

Bug Fixes

  • Fixed a potential ReDoS issue in QUOTESPANSRE regex used inside the SentenceSplitter component.
  • Add the init parameters timeout and maxretries to the todict methods of OpenAITextEmbedder and OpenAIDocumentEmbedder. This ensures that these values are properly serialized when using the to_dict method of these components.
  • use coercetagvalue in LoggingTracer to serialize tag values
  • Update the __deepcopy__ of ComponentTool to gracefully handle NotImplementedError when trying to deepcopy attributes.
  • Fix an issue where OpenAIChatGenerator and OpenAIGenerator were not properly handling wrapped streaming responses from tools like Weave.
  • A bug in the RecursiveDocumentSplitter was fixed for the case where a splittext is longer than the splitlength and recursive chunking is triggered.
  • Make internal tool conversion in the HuggingFaceAPICompatibleChatGenerator compatible with huggingfacehub>=0.31.0. In the huggingfacehub library, arguments attribute of ChatCompletionInputFunctionDefinition has been renamed to parameters. Our implementation is compatible with both the legacy version and the new one.
  • The HuggingFaceAPIChatGenerator now checks the type of the arguments variable in the tool calls returned by the Hugging Face API. If arguments is a JSON string, it is parsed into a dictionary. Previously, the arguments type was not checked, which sometimes led to failures later in the tool workflow.
  • Move deserializetoolsinplace back to original import path of from haystack.tools.tool import deserializetoolsinplace.
  • To properly preserve the context when AsyncPipeline with components that only have sync run methods we copy the context using contextvars.copy_context() and run the component using ctx.run(...) so we can preserve context like the active tracing span. This now means if your component 1) only has a sync run method and 2) it logs something to the tracer then this trace will be properly nested within the parent context.

- Python
Published by github-actions[bot] 8 months ago

farm-haystack - v2.14.3

Bug Fixes

  • In ConditionalRouter fixed the to_dict and from_dict methods to properly handle the case when output_type is a List of types or a List of strings. This occurs when a user specifies a route in ConditionalRouter to have multiple outputs.
  • Fix the serialization of ComponentTool and Tool when specifying outputs_to_string. Previously an error occurred on deserialization right after serializing if outputs_to_string is not None.

- Python
Published by github-actions[bot] 8 months ago

farm-haystack - v2.14.3-rc1

- Python
Published by github-actions[bot] 8 months ago

farm-haystack - v2.14.2

Bug Fixes

  • Fixed a bug in OpenAIDocumentEmbedder and AzureOpenAIDocumentEmbedder where if an OpenAI API error occurred mid-batch then the following embeddings would be paired with the wrong documents.

New Features

  • Added a raise_on_failure boolean parameter to OpenAIDocumentEmbedder and AzureOpenAIDocumentEmbedder. If set to True then the component will raise an exception when there is an error with the API request. It is set to False by default so the previous behavior of logging an exception and continuing is still the default.

- Python
Published by github-actions[bot] 9 months ago

farm-haystack - v2.14.2-rc1

- Python
Published by github-actions[bot] 9 months ago

farm-haystack - v2.14.1

Release Notes

v2.14.1

Bug Fixes

  • Fixed a mypy issue in the OpenAIChatGenerator and its handling of stream responses. This issue only occurs with mypy >=1.16.0.
  • Fix type comparison in schema validation by replacing is not with != when checking the type List[ChatMessage]. This prevents false mismatches due to Python's is operator comparing object identity instead of equality.

- Python
Published by github-actions[bot] 9 months ago

farm-haystack - v2.14.1-rc1

- Python
Published by github-actions[bot] 9 months ago

farm-haystack - v2.14.0

⭐️ Highlights

Enhancements for Complex Agentic Systems

We've improved agent workflows with better message handling and streaming support. Agent component now returns a last_message output for quick access to the final message, and can use a streaming_callback to emit tool results in real time. You can use the updated print_streaming_chunk or write your own callback function to enable ToolCall details during streaming.

```python from haystack.components.websearch import SerperDevWebSearch from haystack.components.agents import Agent from haystack.components.generators.utils import printstreamingchunk from haystack.tools import tool, ComponentTool from haystack.components.generators.chat import OpenAIChatGenerator from haystack.dataclasses import ChatMessage

websearch = ComponentTool(name="websearch", component=SerperDevWebSearch(topk=5)) wikisearch = ComponentTool(name="wikisearch", component=SerperDevWebSearch(topk=5, allowed_domains=["https://www.wikipedia.org/"]))

researchagent = Agent( chatgenerator=OpenAIChatGenerator(model="gpt-4o-mini"), systemprompt=""" You are a research agent that can find information on web or specifically on wikipedia. Use wikisearch tool if you need facts and use websearch tool for latest news on topics. Use one tool at a time, use the other tool if the retrieved information is not enough. Summarize the retrieved information before returning response to the user. """, tools=[websearch, wikisearch], streamingcallback=printstreamingchunk )

result = researchagent.run(messages=[ChatMessage.fromuser("Can you tell me about Florence Nightingale's life?")]) Enabling streaming with `print_streaming_chunk` function looks like this: bash [TOOL CALL] Tool: wiki_search Arguments: {"query":"Florence Nightingale"}

[TOOL RESULT] {'documents': [{'title': 'List of schools in Nottinghamshire', 'link': 'https://www.wikipedia.org/wiki/ListofschoolsinNottinghamshire', 'position': 1, 'id': 'a6d0fe00f1e0cd06324f80fb926ba647878fb7bee8182de59a932500aeb54a5b', 'content': 'The Florence Nightingale Academy, Eastwood; The Flying High Academy, Mansfield; Forest Glade Primary School, Sutton-in-Ashfield; Forest Town Primary School ...', 'blob': None, 'score': None, 'embedding': None, 'sparseembedding': None}], 'links': ['https://www.wikipedia.org/wiki/Listofschoolsin_Nottinghamshire']} ...

Print the `last_message` python print("Final Answer:", result["last_message"].text)

Final Answer: Florence Nightingale (1820-1910) was a pioneering figure in nursing and is often hailed as the founder of modern nursing. She was born... `` Additionally, [AnswerBuilder](https://docs.haystack.deepset.ai/docs/answerbuilder) stores all generated messages inallmessagesmeta field of GeneratedAnswer and supports a newlastmessage_only` mode for lightweight flows where only the final message needs to be processed.

Visualizing Pipelines with SuperComponents

We extended pipeline.draw() and pipeline.show(), which save pipeline diagrams to images files or display them in Jupyter notebooks. You can now pass super_component_expansion=True to expand any SuperComponents and draw more detailed visualizations.

Here is an example with a pipeline containing MultiFileConverter and DocumentPreprocssor SuperComponents. After installing the dependencies that the MultiFileConverter needs for all supported file formats via pip install haystack-ai pypdf markdown-it-py mdit_plain trafilatura python-pptx python-docx jq openpyxl tabulate pandas, you can run:

```python from pathlib import Path

from haystack import Pipeline from haystack.components.converters import MultiFileConverter from haystack.components.preprocessors import DocumentPreprocessor from haystack.components.writers import DocumentWriter from haystack.documentstores.inmemory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

pipeline = Pipeline() pipeline.addcomponent("converter", MultiFileConverter()) pipeline.addcomponent("preprocessor", DocumentPreprocessor()) pipeline.addcomponent("writer", DocumentWriter(documentstore = document_store)) pipeline.connect("converter", "preprocessor") pipeline.connect("preprocessor", "writer")

expanded pipeline that shows all components

path = Path("expandedpipeline.png") pipeline.draw(path=path, supercomponent_expansion=True)

original pipeline

path = Path("original_pipeline.png") pipeline.draw(path=path) ``` Extended vs Original Pipeline

SentenceTransformersSimilarityRanker with PyTorch, ONNX, and OpenVINO

We added a new SentenceTransformersSimilarityRanker component that uses the Sentence Transformers library to rank documents based on their semantic similarity to the query. This component replaces the legacy TransformersSimilarityRanker component, which may be deprecated in a future release, with removal following a deprecation period. The SentenceTransformersSimilarityRanker also allows choosing different inference backends: PyTorch, ONNX, and OpenVINO. For example, after installing sentence-transformers>=4.1.0, you can run:

```python from haystack.components.rankers import SentenceTransformersSimilarityRanker from haystack.utils.device import ComponentDevice

onnxranker = SentenceTransformersSimilarityRanker( model="sentence-transformers/all-MiniLM-L6-v2", token=None, device=ComponentDevice.fromstr("cpu"), backend="onnx", ) onnxranker.warmup() docs = [Document(content="Berlin"), Document(content="Sarajevo")] output = onnxranker.run(query="City in Germany", documents=docs) rankeddocs = output["documents"] ```

⬆️ Upgrade Notes

  • We've added a py.typed file to Haystack to enable type information to be used by downstream projects, in line with PEP 561. This means Haystack's type hints will now be visible to type checkers in projects that depend on it. Haystack is primarily type checked using mypy (not pyright) and, despite our efforts, some type information can be incomplete or unreliable. If you use static type checking in your own project, you may notice some changes: previously, Haystack's types were effectively treated as Any, but now actual type information will be available and enforced. We'll continue improving typing with the next release.
  • The deprecated deserialize_tools_inplace utility function has been removed. Use deserialize_tools_or_toolset_inplace instead, importing it as follows: from haystack.tools import deserialize_tools_or_toolset_inplace.

πŸš€ New Features

  • Added run_async method to ToolInvoker class to allow asynchronous tool invocations.

  • Agent can now stream tool result with run_async method as well.

  • Introduced serialize_value and deserialize_value utility methods for consistent value (de)serialization across modules.

  • Moved the State class to the agents.state module and added serialization and deserialization capabilities.

  • Add support for multiple outputs in ConditionalRouter

  • Implement JSON-safe serialization for OpenAI usage data by converting token counts and details (like CompletionTokensDetails and PromptTokensDetails) into plain dictionaries.

  • Added a new SentenceTransformersSimilarityRanker component that uses the Sentence Transformers library to rank documents based on their semantic similarity to the query. This component is a replacement for the legacy TransformersSimilarityRanker component, which may be deprecated in a future release, with removal following after a deprecation period. The SentenceTransformersSimilarityRanker also allows choosing different inference backends: PyTorch, ONNX, and OpenVINO. To use the SentenceTransformersSimilarityRanker, you need to install sentence-transformers>=4.1.0.

  • Add a streaming_callback parameter to ToolInvoker to enable streaming of tool results. Note that tool_result is emitted only after the tool execution completes and is not streamed incrementally.

  • Update print_streaming_chunk to print ToolCall information if it is present in the chunk's metadata.

  • Update Agent to forward the streaming_callback to ToolInvoker to emit tool results during tool invocation.

  • Enhance SuperComponent's type compatibility check to return the detected common type between two input types.

⚑️ Enhancement Notes

  • When using HuggingFaceAPIChatGenerator with streaming, the returned ChatMessage now contains the number of prompt tokens and completion tokens in its meta data. Internally, the HuggingFaceAPIChatGenerator requests an additional streaming chunk that contains usage data. It then processes the usage streaming chunk to add usage meta data to the returned ChatMessage.

  • We now have a Protocol for TextEmbedder. The protocol makes it easier to create custom components or SuperComponents that expect any TextEmbedder as init parameter.

  • We added a Component signature validation method that details the mismatches between the run and run_async method signatures. This allows a user to debug custom components easily.

  • Enhanced the AnswerBuilder component with two agent-friendly features:

1.  All generated messages are now stored in the `meta` field of the GeneratedAnswer objects under an `all_messages` key, improving traceability and debugging capabilities.
2.  Added a new `last_message_only` parameter that, when set to `True`, processes only the last message in the replies while still preserving the complete conversation history in metadata. This is particularly useful for agent workflows where only the final response needs to be processed.
  • A variety of improvements have been made so an Agent component can be directly used in ComponentTool enabling straightforward building of Multi-Agent systems. These improvements include:

    • Adding a last_message field to the Agent's output which returns the last generated ChatMessage.
    • Improving the _default_output_handler in the ToolInvoker to try and first serialize the outputs in the tool result before converting it into a string. This is especially relevant for getting a better representation when stringifying dataclasses like ChatMessage.
  • Added type hints to the component decorator. This improves support for Pyright/Pylance, enabling IDEs like VSCode to show docstrings for components.

  • Updated pipeline execution logic to use a new utility method _deepcopy_with_exceptions, which attempts to deep copy an object and safely falls back to the original object if copying fails. Additionally _deepcopy_with_exceptions skips deep-copying of Component, Tool, and Toolset instances when used as runtime parameters. This prevents errors and unintended behavior caused by trying to deepcopy objects that contain non-copyable attributes (e.g. Jinja2 templates, clients). Previously, standard deepcopy was used on inputs and outputs which occasionally lead to errors since certain Python objects cannot be deepcopied.

  • Refactored JSON Schema generation for ComponentTool parameters using Pydantic’s modeljsonschema, enabling expanded type support (e.g. Union, Enum, Dict, etc.). We also convert dataclasses to Pydantic models before calling modeljsonschema to preserve docstring descriptions of the parameters in the schema. This means dataclasses like ChatMessage, Document, etc. now have correctly defined JSON schemas.

  • The draw() and show() methods from Pipeline now have an extra boolean parameter, super_component_expansion, which, when set to True and if the pipeline contains SuperComponents, the visualisation diagram will show the internal structure of super-components as if they were components part of the pipeline instead of a "black-box" with the name of the SuperComponent.

  • Improve the type annotations for @component and the Component protocol. The type checker can now ensure that a @component class provides a compatible run() method, whose required return type has been changed from Dict[str, Any] (invariant) to the Mapping[str, Any] to allow TypedDict to be used for output types.

    • Updates StreamingChunk construction in ToolInvoker to also stream a chunk with a finish reason. This is useful when using the printstreamingchunk utility method
    • Update the printstreamingchunk to have better formatting of messages especially when using it with Agent.
    • Also updated to work with the current version of the AWS Bedrock integration by working with the dict representation of ChoiceDeltaToolCall
  • ComponentTool now preserves and combines docstrings from underlying pipeline components when wrapping a SuperComponent. When a SuperComponent is used with ComponentTool, two key improvements are made:

1.  Parameter descriptions are now extracted from the original components in the wrapped pipeline. When a single input is mapped to multiple components, the parameter descriptions are combined from all mapped components, providing comprehensive information about how the parameter is used throughout the pipeline.
2.  The overall component description is now generated from the descriptions of all underlying components instead of using the generic SuperComponent description. This helps LLMs understand what the component actually does rather than just seeing "Runs the wrapped pipeline with the provided inputs."

These changes make SuperComponents much more useful with LLM function calling as the LLM will get detailed information about both the component's purpose and its parameters.
  • Adds local_files_only parameter to SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder to allow loading models in offline mode.

  • The DocumentRecallEvaluator was updated. Now, when in MULTI_HIT mode, the division is over the unique ground truth documents instead of the total number of ground truth documents. We also added checks for emptiness. If there are no retrieved documents or all of them have an empty string as content, we return 0.0 and log a warning. Likewise, if there are no ground truth documents or all of them have an empty string as content, we return 0.0 and log a warning.

⚠️ Deprecation Notes

  • Deprecated the State class in the dataclasses module. Users are encouraged to transition to the new version of State now located in the agents.state module. A deprecation warning has been added to guide this migration.

πŸ”’ Security Notes

  • Made QUOTESPANSRE regex in SentenceSplitter ReDoS-safe. This prevents potential backtracking on malicious inputs.

πŸ› Bug Fixes

  • Fixed a potential ReDoS issue in QUOTESPANSRE regex used inside the SentenceSplitter component.
  • Add the init parameters timeout and maxretries to the todict methods of OpenAITextEmbedder and OpenAIDocumentEmbedder. This ensures that these values are properly serialized when using the to_dict method of these components.
  • use coercetagvalue in LoggingTracer to serialize tag values
  • Update the __deepcopy__ of ComponentTool to gracefully handle NotImplementedError when trying to deepcopy attributes.
  • Fix an issue where OpenAIChatGenerator and OpenAIGenerator were not properly handling wrapped streaming responses from tools like Weave.
  • A bug in the RecursiveDocumentSplitter was fixed for the case where a split_text is longer than the split_length and recursive chunking is triggered.
  • Make internal tool conversion in the HuggingFaceAPICompatibleChatGenerator compatible with huggingface_hub>=0.31.0. In the huggingface_hub library, arguments attribute of ChatCompletionInputFunctionDefinition has been renamed to parameters. Our implementation is compatible with both the legacy version and the new one.
  • The HuggingFaceAPIChatGenerator now checks the type of the arguments variable in the tool calls returned by the Hugging Face API. If arguments is a JSON string, it is parsed into a dictionary. Previously, the arguments type was not checked, which sometimes led to failures later in the tool workflow.
  • Move deserializetoolsinplace back to original import path of from haystack.tools.tool import deserializetoolsinplace.
  • To properly preserve the context when AsyncPipeline with components that only have sync run methods we copy the context using contextvars.copy_context() and run the component using ctx.run(...) so we can preserve context like the active tracing span. This now means if your component 1) only has a sync run method and 2) it logs something to the tracer then this trace will be properly nested within the parent context.
  • Fixed a bug in the LLMMetadataExtractor that occurred when processing Document objects with None or empty string content. The component now gracefully handles these cases by marking such documents as failed and providing an appropriate error message in their metadata, without attempting an LLM call.
  • Fix componentinvoker used by ComponentTool to work when a dataclass like ChatMessage is directly passed to `componenttool.invoke(...)`. Previously this would either cause an error or silently skip your input.

πŸ’™ Big thank you to everyone who contributed to this release!

  • @Amnah199 @anakin87 @davidsbatista @denisw @dfokina @jantrienes @mdrazak2001 @medsriha @mpangrazzi @sjrl @vblagoje @wsargent @YassinNouh21

Special thanks and congratulations to our first time contributors! * @wsargent made their first contribution in https://github.com/deepset-ai/haystack/pull/9273 * @YassinNouh21 made their first contribution in https://github.com/deepset-ai/haystack/pull/9303 * @jantrienes made their first contribution in https://github.com/deepset-ai/haystack/pull/9400

Full Changelog: https://github.com/deepset-ai/haystack/compare/v2.13.0...v2.14.0

- Python
Published by github-actions[bot] 9 months ago

farm-haystack - v2.14.0-rc2

- Python
Published by github-actions[bot] 9 months ago

farm-haystack - v2.14.0-rc1

- Python
Published by github-actions[bot] 9 months ago

farm-haystack - v2.13.2

⚑️ Enhancement Notes

  • Updated pipeline execution logic to use a new utility method _deepcopywithexceptions, which attempts to deep copy an object and safely falls back to the original object if copying fails. Additionally _deepcopywithexceptions skips deep-copying of Component, Tool, and Toolset instances when used as runtime parameters. This prevents errors and unintended behavior caused by trying to deepcopy objects that contain non-copyable attributes (e.g. Jinja2 templates, clients). Previously, standard deepcopy was used on inputs and outputs which occasionally lead to errors since certain Python objects cannot be deepcopied.

πŸ› Bug Fixes

  • Make internal tool conversion in the HuggingFaceAPICompatibleChatGenerator compatible with huggingfacehub>=0.31.0. In the huggingfacehub library, arguments attribute of ChatCompletionInputFunctionDefinition has been renamed to parameters. Our implementation is compatible with both the legacy version and the new one.
  • The HuggingFaceAPIChatGenerator now checks the type of the arguments variable in the tool calls returned by the Hugging Face API. If arguments is a JSON string, it is parsed into a dictionary. Previously, the arguments type was not checked, which sometimes led to failures later in the tool workflow.

- Python
Published by github-actions[bot] 10 months ago

farm-haystack - v2.13.1

Release Notes

v2.13.1

Bug Fixes

  • Update the __deepcopy__ of ComponentTool to gracefully handle NotImplementedError when trying to deepcopy attributes.
  • Fix an issue where OpenAIChatGenerator and OpenAIGenerator were not properly handling wrapped streaming responses from tools like Weave.
  • Move deserializetoolsinplace back to original import path of from haystack.tools.tool import deserializetoolsinplace.

- Python
Published by github-actions[bot] 10 months ago

farm-haystack - v2.13.0

⭐️ Highlights

Enhanced Agent Tracing and Async Support

Haystack's Agent got several improvements!

Agent Tracing Agent tracing now provides deeper visibility into the agent's execution. For every call, the inputs and outputs of the ChatGenerator and ToolInvoker are captured and logged using dedicated child spans. This makes it easier to debug, monitor, and analyze how an agent operates step-by-step.

Below is an example of what the trace looks like in Langfuse:

Langfuse UI for tracing

```python

pip install langfuse-haystack

from haystackintegrations.components.connectors.langfuse.langfuseconnector import LangfuseConnector from haystack.components.agents import Agent from haystack.components.generators.chat import OpenAIChatGenerator

tracer = LangfuseConnector("My Haystack Agent") agent = Agent( systemprompt="You help provide the weather for cities" chatgenerator=OpenAIChatGenerator(), tools=[weathertool], ) `` **Async Support** Additionally, there's a newrunasyncmethod to enable built-in async support forAgent. Just userun_asyncinstead of therun` method. Here's an example of an async web search agent:

```python

set SERPERDEV_API_KEY and OPENAI_API_KEY as env variables

from haystack.components.agents import Agent from haystack.components.generators.chat import OpenAIChatGenerator from haystack.components.websearch import SerperDevWebSearch from haystack.dataclasses import ChatMessage from haystack.tools.component_tool import ComponentTool

web_tool = ComponentTool(component=SerperDevWebSearch())

websearchagent = Agent(
chatgenerator=OpenAIChatGenerator(), tools=[webtool], )

result = await websearchagent.runasync( messages=[ChatMessage.fromuser("Find information about Haystack by deepset")] ) ```

New Toolset for Enhanced Tool Management

The new Toolset groups multiple Tool instances into a single manageable unit. It simplifies the passing of tools to components like ChatGenerator, ToolInvoker, or Agent, and supports filtering, serialization, and reuse. Check out the MCPToolset for dynamic tool discovery from an MCP server.

```python from haystack.tools import Toolset from haystack.components.agents import Agent from haystack.components.generators.chat import OpenAIChatGenerator

mathtoolset = Toolset([toolone, tooltwo, ...]) agent = Agent( chatgenerator=OpenAIChatGenerator(model="gpt-4o-mini"), tools=math_toolset ) ```

@super_component decorator and new ready-made SuperComponents

Creating a custom SuperComponents just got even simpler. Now, all you need to do is define a class with a pipeline attribute and decorate it with @super_component. Haystack takes care of the rest!

Here's an example of building a custom HybridRetriever using the @super_component decorator: ```python

pip install haystack-ai datasets "sentence-transformers>=3.0.0"

from haystack import Document, Pipeline, supercomponent from haystack.components.joiners import DocumentJoiner from haystack.components.embedders import SentenceTransformersTextEmbedder from haystack.components.retrievers import InMemoryBM25Retriever, InMemoryEmbeddingRetriever from haystack.documentstores.inmemory import InMemoryDocumentStore from datasets import loaddataset

@supercomponent class HybridRetriever: def _init_(self, documentstore: InMemoryDocumentStore, embeddermodel: str = "BAAI/bge-small-en-v1.5"): embeddingretriever = InMemoryEmbeddingRetriever(documentstore) bm25retriever = InMemoryBM25Retriever(documentstore) textembedder = SentenceTransformersTextEmbedder(embeddermodel) documentjoiner = DocumentJoiner(joinmode="reciprocalrank_fusion")

    self.pipeline = Pipeline()
    self.pipeline.add_component("text_embedder", text_embedder)
    self.pipeline.add_component("embedding_retriever", embedding_retriever)
    self.pipeline.add_component("bm25_retriever", bm25_retriever)
    self.pipeline.add_component("document_joiner", document_joiner)

    self.pipeline.connect("text_embedder", "embedding_retriever")
    self.pipeline.connect("bm25_retriever", "document_joiner")
    self.pipeline.connect("embedding_retriever", "document_joiner")

dataset = loaddataset("HaystackBot/medrag-pubmed-chunk-with-embeddings", split="train") docs = [Document(content=doc["contents"], embedding=doc["embedding"]) for doc in dataset] documentstore = InMemoryDocumentStore() documentstore.writedocuments(docs)

query = "What treatments are available for chronic bronchitis?" result = HybridRetriever(document_store).run(text=query, query=query) print(result) `` **New ready-made SuperComponents:MultiFileConverter,DocumentPreprocessor** There are also two ready-made SuperComponents, [MultiFileConverter](https://docs.haystack.deepset.ai/docs/multifileconverter) and [DocumentPreprocessor`](https://docs.haystack.deepset.ai/docs/documentpreprocessor), that encapsulate widely used common logic for indexing pipelines.

πŸ“š Learn more about SuperComponents and get the full code example in the Tutorial: Creating Custom SuperComponents

⬆️ Upgrade Notes

  • The deprecated api, api_key, and api_params parameters for LLMEvaluator, ContextRelevanceEvaluator, and FaithfulnessEvaluator have been removed. By default, these components will continue to use OpenAI in JSON mode. To customize the LLM, use the chat_generator parameter with a ChatGenerator instance configured to return a response in JSON format. For example: python chat_generator=OpenAIChatGenerator(generation_kwargs={"response_format": {"type": "json_object"}})

  • The deprecated generator_api and generator_api_params initialization parameters of LLMMetadataExtractor and the LLMProvider enum have been removed. Use chat_generator instead to configure the underlying LLM. In order for the component to work, the LLM should be configured to return a JSON object. For example, if using OpenAI, you should initialize the LLMMetadataExtractor with python chat_generator=OpenAIChatGenerator(generation_kwargs={"response_format": {"type": "json_object"}})

πŸš€ New Features

  • Add run_async for OpenAITextEmbedder.
  • Add run_async method to HuggingFaceAPIDocumentEmbedder. This method enriches Documents with embeddings. It supports the same parameters as the run method. It returns a coroutine that can be awaited.
  • Support custom HTTP client configuration via http_client_kwargs (proxy, SSL) for:
    • AzureOpenAIGenerator, OpenAIGenerator and DALLEImageGenerator
    • OpenAIDocumentEmbedder and OpenAITextEmbedder
    • RemoteWhisperTranscriber
  • OpenAIChatGenerator and AzureOpenAIChatGenerator now support custom HTTP client config via http_client_kwargs, enabling proxy and SSL setup.
  • Introduced the Toolset class, allowing for the grouping and management of related tool functionalities. This new abstraction supports dynamic tool loading and registration.
  • We have added internal tracing support to Agent. It is now possible to track the internal loops within the agent by viewing the inputs and outputs each time the ChatGenerator and ToolInvoker is called.
  • The HuggingFaceAPITextEmbedder now also has support for a run() method in an asynchronous way, i.e., run_async.
  • Add a runasync to the Agent, which calls the runasync of the underlying ChatGenerator if available.
  • SuperComponents now support mapping nonleaf pipelines outputs to the SuperComponents output when specifying them in output_mapping.
  • AzureOpenAITextEmbedder and AzureOpenAIDocumentEmbedder now support custom HTTP client config via http_client_kwargs, enabling proxy and SSL setup.
  • The AzureOpenAIDocumentEmbedder component now inherits from the OpenAIDocumentEmbedder component, enabling asynchronous usage.
  • The AzureOpenAITextEmbedder component now inherits from the OpenAITextEmbedder component, enabling asynchronous usage.
  • Added async support to the OpenAIDocumentEmbedder component.
  • Agent now supports a List of Tools or a Toolset as input.

⚑️ Enhancement Notes

  • Added component_name and component_type attributes to PipelineRuntimeError.
    • Moved error message creation to within PipelineRuntimeError
    • Created a new subclass of PipelineRuntimeError called PipelineComponentsBlockedError for the specific case where the pipeline cannot run since no components are unblocked.
  • The ChatGenerator Protocol no longer requires to_dict and from_dict methods.

⚠️ Deprecation Notes

  • The utility function deserialize_tools_inplace has been deprecated and will be removed in Haystack 2.14.0. Use deserialize_tools_or_toolset_inplace instead.

πŸ› Bug Fixes

  • OpenAITextEmbedder no longer replaces newlines with spaces in the text to embed. This was only required for the discontinued v1 embedding models.
  • OpenAIDocumentEmbedder and AzureOpenAIDocumentEmbedder no longer replace newlines with spaces in the text to embed. This was only required for the discontinued v1 embedding models.
  • Fix ChatMessage.from_dict to handle cases where optional fields like name and meta are missing.
  • Make Document's first-level fields to take precedence over meta fields when flattening the dictionary representation.
  • In Agent we make sure stateschema is always initialized to have 'messages'. Previously this was only happening at run time which is why pipeline.connect failed because output types are set at init time. Now the Agent correctly sets everything in stateschema (including messages by default) at init time. Now, when you call an Agent without tools, it acts like a ChatGenerator, which means it returns a ChatMessage based on user input.
  • In AsyncPipeline, the span tag name is updated from hasytack.component.outputs to haystack.component.output. This matches the tag name used in Pipeline and is the tag name expected by our tracers.
  • The batchsize parameter has now been added to todict function of TransformersSimilarityRanker. This means serialization of batch_size now works as expected.

- Python
Published by github-actions[bot] 10 months ago

farm-haystack - v2.13.0-rc2

- Python
Published by github-actions[bot] 10 months ago

farm-haystack - v2.13.0-rc1

⭐️ Highlights

Enhanced Agent Tracing and Async Support

Haystack's Agent got several improvements!

Agent Tracing Agent tracing now provides deeper visibility into the agent's execution. For every call, the inputs and outputs of the ChatGenerator and ToolInvoker are captured and logged using dedicated child spans. This makes it easier to debug, monitor, and analyze how an agent operates step-by-step.

Below is an example of what the trace looks like in Langfuse:

Langfuse UI for tracing

TODO: HOW DO I ENABLE AGENT TRACING? ADD A CODE SNIPPET

Async Support Additionally, there's a new run_async method to enable built-in async support for Agent. Just use run_async instead of the run method. Here's an example of an async web search agent:

```python

set SERPERDEV_API_KEY and OPENAI_API_KEY as env variables

from haystack.components.agents import Agent from haystack.components.generators.chat import OpenAIChatGenerator from haystack.components.websearch import SerperDevWebSearch from haystack.dataclasses import ChatMessage from haystack.tools.component_tool import ComponentTool

web_tool = ComponentTool(component=SerperDevWebSearch())

websearchagent = Agent(
chatgenerator=OpenAIChatGenerator(), tools=[webtool], )

result = await websearchagent.runasync( messages=[ChatMessage.fromuser("Find information about Haystack by deepset")] ) ```

New Toolset for Enhanced Tool Management

The new Toolset groups multiple Tool instances into a single manageable unit. It simplifies the passing of tools to components like ChatGenerator, ToolInvoker, or Agent, and supports filtering, serialization, and reuse. Check out the MCPToolset for dynamic tool discovery from an MCP server.

```python from haystack.tools import Toolset from haystack.components.agents import Agent from haystack.components.generators.chat import OpenAIChatGenerator

mathtoolset = Toolset([toolone, tooltwo, ...]) agent = Agent( chatgenerator=OpenAIChatGenerator(model="gpt-4o-mini"), tools=math_toolset ) ```

@super_component decorator and new ready-made SuperComponents

Creating a custom SuperComponents just got even simpler. Now, all you need to do is define a class with a pipeline attribute and decorate it with @super_component. Haystack takes care of the rest!

Here's an example of building a custom HybridRetriever using the @super_component decorator: ```python

pip install haystack-ai datasets "sentence-transformers>=3.0.0"

from haystack import Document, Pipeline, supercomponent from haystack.components.joiners import DocumentJoiner from haystack.components.embedders import SentenceTransformersTextEmbedder from haystack.components.retrievers import InMemoryBM25Retriever, InMemoryEmbeddingRetriever from haystack.documentstores.inmemory import InMemoryDocumentStore from datasets import loaddataset

@supercomponent class HybridRetriever: def _init_(self, documentstore: InMemoryDocumentStore, embeddermodel: str = "BAAI/bge-small-en-v1.5"): embeddingretriever = InMemoryEmbeddingRetriever(documentstore) bm25retriever = InMemoryBM25Retriever(documentstore) textembedder = SentenceTransformersTextEmbedder(embeddermodel) documentjoiner = DocumentJoiner(joinmode="reciprocalrank_fusion")

    self.pipeline = Pipeline()
    self.pipeline.add_component("text_embedder", text_embedder)
    self.pipeline.add_component("embedding_retriever", embedding_retriever)
    self.pipeline.add_component("bm25_retriever", bm25_retriever)
    self.pipeline.add_component("document_joiner", document_joiner)

    self.pipeline.connect("text_embedder", "embedding_retriever")
    self.pipeline.connect("bm25_retriever", "document_joiner")
    self.pipeline.connect("embedding_retriever", "document_joiner")

dataset = loaddataset("HaystackBot/medrag-pubmed-chunk-with-embeddings", split="train") docs = [Document(content=doc["contents"], embedding=doc["embedding"]) for doc in dataset] documentstore = InMemoryDocumentStore() documentstore.writedocuments(docs)

query = "What treatments are available for chronic bronchitis?" result = HybridRetriever(document_store).run(text=query, query=query) print(result) `` **New ready-made SuperComponents:MultiFileConverter,DocumentPreprocessor** There are also two ready-made SuperComponents, [MultiFileConverter](https://docs.haystack.deepset.ai/reference/converters-api#multifileconverter) and [DocumentPreprocessor`](https://docs.haystack.deepset.ai/reference/preprocessors-api#documentpreprocessor), that encapsulate widely used common logic for indexing pipelines.

πŸ“š Learn more about SuperComponents and get the full code example in the Tutorial: Creating Custom SuperComponents

⬆️ Upgrade Notes

  • The deprecated api, apikey, and apiparams parameters for LLMEvaluator, ContextRelevanceEvaluator, and FaithfulnessEvaluator have been removed.

    By default, these components will continue to use OpenAI in JSON mode.

    To customize the LLM, use the chatgenerator parameter with a ChatGenerator instance configured to return a response in JSON format. For example: chatgenerator=OpenAIChatGenerator(generationkwargs={"responseformat": {"type": "json_object"}}).

  • The deprecated generatorapi and generatorapiparams initialization parameters of LLMMetadataExtractor and the LLMProvider enum have been removed. Use chatgenerator instead to configure the underlying LLM. In order for the component to work, the LLM should be configured to return a JSON object. For example, if using OpenAI, you should initialize the LLMMetadataExtractor with chatgenerator=OpenAIChatGenerator(generationkwargs={"responseformat": {"type": "jsonobject"}}).

πŸš€ New Features

  • Add run_async for OpenAITextEmbedder.
  • Add run_async method to HuggingFaceAPIDocumentEmbedder. This method enriches Documents with embeddings. It supports the same parameters as the run method. It returns a coroutine that can be awaited.
  • Support custom HTTP client configuration via httpclientkwargs (proxy, SSL) for:
    • AzureOpenAIGenerator, OpenAIGenerator and DALLEImageGenerator
    • OpenAIDocumentEmbedder and OpenAITextEmbedder
    • RemoteWhisperTranscriber
  • OpenAIChatGenerator and AzureOpenAIChatGenerator now support custom HTTP client config via httpclientkwargs, enabling proxy and SSL setup.
  • Introduced the Toolset class, allowing for the grouping and management of related tool functionalities. This new abstraction supports dynamic tool loading and registration.
  • We have added internal tracing support to Agent. It is now possible to track the internal loops within the agent by viewing the inputs and outputs each time the ChatGenerator and ToolInvoker is called.
  • The HuggingFaceAPITextEmbedder now also has support for a run() method in an asynchronous way, i.e., run_async.
  • Add a runasync to the Agent, which calls the runasync of the underlying ChatGenerator if available.
  • SuperComponents now support mapping nonleaf pipelines outputs to the SuperComponents output when specifying them in output_mapping.
  • AzureOpenAITextEmbedder and AzureOpenAIDocumentEmbedder now support custom HTTP client config via httpclientkwargs, enabling proxy and SSL setup.
  • The AzureOpenAIDocumentEmbedder component now inherits from the OpenAIDocumentEmbedder component, enabling asynchronous usage.
  • The AzureOpenAITextEmbedder component now inherits from the OpenAITextEmbedder component, enabling asynchronous usage.
  • Added async support to the OpenAIDocumentEmbedder component.
  • Created a supercomponent decorator (from haystack import supercomponent) that directly converts your class into a SuperComponent. This is an alternative to inheriting from SuperComponent.
  • Add support for initializing chat generators with a Toolset, allowing for more flexible tool management. The tools parameter can now accept either a list of Tool objects or a Toolset instance.

⚑️ Enhancement Notes

  • Added componentname and componenttype attributes to PipelineRuntimeError.
    • Moved error message creation to within PipelineRuntimeError
    • Created a new subclass of PipelineRuntimeError called PipelineComponentsBlockedError for the specific case where the pipeline cannot run since no components are unblocked.
  • The ChatGenerator Protocol no longer requires todict and fromdict methods.

⚠️ Deprecation Notes

  • The utility function deserializetoolsinplace has been deprecated and will be removed in Haystack 2.14.0. Use deserializetoolsortoolsetinplace instead.

πŸ› Bug Fixes

  • OpenAITextEmbedder no longer replaces newlines with spaces in the text to embed. This was only required for the discontinued v1 embedding models.
  • OpenAIDocumentEmbedder and AzureOpenAIDocumentEmbedder no longer replace newlines with spaces in the text to embed. This was only required for the discontinued v1 embedding models.
  • Fix ChatMessage.from_dict to handle cases where optional fields like name and meta are missing.
  • Make Document's first-level fields to take precedence over meta fields when flattening the dictionary representation.
  • In Agent we make sure stateschema is always initialized to have 'messages'. Previously this was only happening at run time which is why pipeline.connect failed because output types are set at init time. Now the Agent correctly sets everything in stateschema (including messages by default) at init time.
  • Now when you call an Agent with no tools it acts like a ChatGenerator which means it returns a ChatMessage based on a user input.
  • In AsyncPipline the span tag name is updated from hasytack.component.outputs to haystack.component.output. This matches the tag name used in Pipeline and is the tag name expected by our tracers.
  • The batchsize parameter has now been added to todict function of TransformersSimilarityRanker. This means serialization of batch_size now works as expected.

- Python
Published by github-actions[bot] 10 months ago

farm-haystack - v2.12.2

πŸ› Bug Fixes

  • Fix ChatMessage.from_dict to handle cases where optional fields like name and meta are missing.
  • Make Document's first-level fields to take precedence over meta fields when flattening the dictionary representation.

- Python
Published by github-actions[bot] 10 months ago

farm-haystack - v2.12.1

πŸ› Bug Fixes

  • In Agent we make sure stateschema is always initialized to have 'messages'. Previously this was only happening at run time which is why pipeline.connect failed because output types are set at init time. Now the Agent correctly sets everything in stateschema (including messages by default) at init time.
  • In AsyncPipline the span tag name is updated from hasytack.component.outputs to haystack.component.output. This matches the tag name used in Pipeline and is the tag name expected by our tracers.

- Python
Published by github-actions[bot] 11 months ago

farm-haystack - v2.12.0

⭐️ Highlights

Agent Component with State Management

The Agent component enables tool-calling functionality with provider-agnostic chat model support and can be used as a standalone component or within a pipeline. With SERPERDEVAPIKEY and OPENAIAPIKEY defined, a Web Search Agent is as simple as:

```python from haystack.components.agents import Agent from haystack.components.generators.chat import OpenAIChatGenerator from haystack.components.websearch import SerperDevWebSearch from haystack.dataclasses import ChatMessage from haystack.tools.component_tool import ComponentTool

web_tool = ComponentTool(
component=SerperDevWebSearch(), )

agent = Agent(
chatgenerator=OpenAIChatGenerator(), tools=[webtool], )

result = agent.run( messages=[ChatMessage.from_user("Find information about Haystack by deepset")] ) ```

The Agent supports streaming responses, customizable exit conditions, and a flexible state management system that enables tools to share and modify data during execution:

python agent = Agent( chat_generator=OpenAIChatGenerator(), tools=[web_tool, weather_tool], exit_conditions=["text", "weather_tool"], state_schema = {...}, streaming_callback=streaming_callback, )

SuperComponent for Reusable Pipelines

SuperComponent allows you to wrap complex pipelines into reusable components. This makes it easy to reuse them across your applications. Just initialize a SuperComponent with a pipeline:

```python from haystack import Pipeline, SuperComponent

with open("pipeline.yaml", "r") as file: pipeline = Pipeline.load(file)

super_component = SuperComponent(pipeline) ```

That's not all! To show the benefits, there are three ready-made SuperComponents in haystack-experimental. For example, there is a MultiFileConverter that wraps a pipeline with converters for CSV, DOCX, HTML, JSON, MD, PPTX, PDF, TXT, and XSLX. After installing the integration dependencies pip install pypdf markdown-it-py mdit_plain trafilatura python-pptx python-docx jq openpyxl tabulate, you can run with any of the supported file types as input:

```python from haystackexperimental.supercomponents.converters import MultiFileConverter

converter = MultiFileConverter() converter.run(sources=["test.txt", "test.pdf"], meta={}) Here's an example of creating a custom `SuperComponent` from any Haystack pipeline: python from haystack import Pipeline, SuperComponent from haystack.components.generators.chat import OpenAIChatGenerator from haystack.components.builders import ChatPromptBuilder from haystack.components.retrievers import InMemoryBM25Retriever from haystack.dataclasses.chatmessage import ChatMessage from haystack.documentstores.in_memory import InMemoryDocumentStore from haystack.dataclasses import Document

documentstore = InMemoryDocumentStore() documents = [ Document(content="Paris is the capital of France."), Document(content="London is the capital of England."), ] documentstore.write_documents(documents)

prompttemplate = [ ChatMessage.fromuser( ''' According to the following documents: {% for document in documents %} {{document.content}} {% endfor %} Answer the given question: {{query}} Answer: ''' ) ] promptbuilder = ChatPromptBuilder(template=prompttemplate, required_variables="*")

pipeline = Pipeline() pipeline.addcomponent("retriever", InMemoryBM25Retriever(documentstore=documentstore)) pipeline.addcomponent("promptbuilder", promptbuilder) pipeline.addcomponent("llm", OpenAIChatGenerator()) pipeline.connect("retriever.documents", "promptbuilder.documents") pipeline.connect("prompt_builder.prompt", "llm.messages")

Create a super component with simplified input/output mapping

wrapper = SuperComponent( pipeline=pipeline, inputmapping={ "query": ["retriever.query", "promptbuilder.query"], }, output_mapping={"llm.replies": "replies"} )

Run the pipeline with simplified interface

result = wrapper.run(query="What is the capital of France?") print(result)

{'replies': [ChatMessage(_role=,

_content=[TextContent(text='The capital of France is Paris.')],...)

```

⬆️ Upgrade Notes

  • Updated ChatMessage serialization and deserialization. ChatMessage.to_dict() now returns a dictionary with the keys: role, content, meta, and name. ChatMessage.from_dict() supports this format and maintains compatibility with older formats.

    If your application consumes the result of ChatMessage.to_dict(), update your code to handle the new format. No changes are needed if you're using ChatPromptBuilder in a Pipeline.

  • LLMEvaluator, ContextRelevanceEvaluator, and FaithfulnessEvaluator now internally use a ChatGenerator instance instead of a Generator instance. The public attribute generator has been replaced with _chat_generator.

  • to_pandas, comparative_individual_scores_report and score_report were removed from EvaluationRunResult, please use detailed_report, comparative_detailed_report and aggregated_report instead.

πŸš€ New Features

  • Treat bare types (e.g., List, Dict) as generic types with Any arguments during type compatibility checks.
  • Add compatibility for Callable types.
  • Adds outputs_to_string to Tool and ComponentTool to allow users to customize how the output of a Tool should be converted into a string so that it can be provided back to the ChatGenerator in a ChatMessage. If outputs_to_string is not provided, a default converter is used within ToolInvoker. The default handler uses the current default behavior.
  • Added a new parameter split_mode to the CSVDocumentSplitter component to control the splitting mode. The new parameter can be set to row-wise to split the CSV file by rows. The default value is threshold, which is the previous behavior.
  • We added a new retrieval technique, AutoMergingRetriever which together with the HierarchicalDocumentSplitter implement a auto-merging retrieval technique.
  • Add run_async method to HuggingFaceLocalChatGenerator. This method internally uses ThreadPoolExecutor to return coroutines that can be awaited.
  • Introduced asynchronous functionality and HTTP/2 support in the LinkContentFetcher component, thus improving content fetching in several aspects.
  • The DOCXToDocument component now has the option to include extracted hyperlink addresses in the output Documents. It accepts a link_format parameter that can be set to "markdown" or "plain". By default, no hyperlink addresses are extracted as before.
  • Added a new parameter azure_ad_token_provider to all Azure OpenAI components: AzureOpenAIGenerator, AzureOpenAIChatGenerator, AzureOpenAITextEmbedder and AzureOpenAIDocumentEmbedder. This parameter optionally accepts a callable that returns a bearer token, enabling authentication via Azure AD.
    • Introduced the utility function default_azure_token_provider in haystack/utils/azure.py. This function provides a default token provider that is serializable by Haystack. Users can now pass default_azure_token_provider as the azure_ad_token_provider or implement a custom token provider.
  • Users can now work with date and time in the ChatPromptBuilder. In the same way as the PromptBuilder, the ChatPromptBuilder now supports arrow to work with datetime.
  • Introduce new State dataclass with a customizable schema for managing Agent state. Enhance error logging of Tool and extend the ToolInvoker component to work with newly introduced State.
  • The RecursiveDocumentSplitter now supports splitting by number of tokens. Setting "splitunit" to "token" will use a hard-coded tiktoken tokenizer (o200kbase) and requires having tiktoken installed.

⚑️ Enhancement Notes

  • LLMEvaluator, ContextRelevanceEvaluator, and FaithfulnessEvaluator now accept a chat_generator initialization parameter, consisting of ChatGenerator instance pre-configured to return a JSON object. Previously, these components only supported OpenAI and LLMs with OpenAI-compatible APIs. Regardless of whether the evaluator components are initialized with api, api_key, and api_params or the new chatgenerator parameter, the serialization format will now only include `chatgeneratorin preparation for the future removal ofapi,apikey, andapiparams`.
  • Improved error handling for component run failures by raising a runtime error that includes the component's name and type.
  • When using Haystack's Agent, the messages are stored and accumulated in State. This means:
    • State is required to have a "messages" type and handler defined in its schema. If not provided, a default type and handler are provided. Users can now customize how messages are accumulated by providing a custom handler for messages through the State schema.
  • Added PDFMinerToDocument functionality to detect and report undecoded CID characters in PDF text extraction, helping users identify potential text extraction quality issues when processing PDFs with non-standard fonts.
  • The Agent component allows defining multiple exit conditions instead of a single condition. The init parameter has been renamed from exit_condition to exit_conditions to reflect that.
  • Introduce a ChatGenerator Protocol to qualify ChatGenerator components from a static type-checking perspective. It defines the minimal interface that Chat Generators must implement. This will especially help to standardize the integration of Chat Generators into other more complex components.
  • In Agent, we check all messages from the LLM when doing an exit condition check. For example, it's possible the LLM returns multiple messages, such as multiple tool calls, or includes messages with reasoning. Now we check all messages before assessing if we should exit the loop.
  • The Agent component checks whether the ChatGenerator it is initialized with supports tools. If it doesn't, the Agent raises a TypeError.
  • Updated SentenceTransformersDiversityRanker to use the token parameter internally instead of the deprecated useauthtoken. The public API of this component already utilizes token.
  • Simplified the serialization code for better readability and maintainability.
    • Updated deserialization to allow users to omit the typing. prefix for standard typing library types (e.g., List[str] instead of typing.List[str]).
  • Refactored the processing of streaming chunks from OpenAI to simplify logic.
    • Added tests to ensure expected behavior when handling streaming chunks when using include_usage=True.
  • Updates the doc strings of the BranchJoiner to more understandable and better highlight where it's useful.
  • Consolidate the use of selectstreamingcallback utility in OpenAI and Azure ChatGenerators, which checks the compatibility of streaming_callback with the async or non-async run method.
  • Added a warning to ChatPromptBuilder and PromptBuilder when prompt variables are present and required_variables is unset to help users avoid unexpected execution in multi-branch pipelines. The warning recommends users to set required_variables.

⚠️ Deprecation Notes

  • The api, api_key, and api_params parameters for LLMEvaluator, ContextRelevanceEvaluator, and FaithfulnessEvaluator are now deprecated and will be removed in Haystack 2.13.0. By default, these components will continue to use OpenAI in JSON mode. To configure a specific LLM, use the chat_generator parameter.
  • The generatorapi and generatorapiparams initialization parameters of LLMMetadataExtractor and the LLMProvider enum are deprecated and will be removed in Haystack 2.13.0. Use `chatgeneratorinstead to configure the underlying LLM. For example, changegeneratorapi=LLMProvider.OPENAItochatgenerator=OpenAIChatGenerator()`.

πŸ› Bug Fixes

  • Add dataframe to legacy fields for the Document dataclass. This fixes a bug where Document.from_dict() in haystack-ai>=2.11.0 could not properly deserialize a Document dictionary obtained with document.to_dict(flatten=False) in haystack-ai<=2.10.0.
  • In DALLEImageGenerator, ensure that the max_retries initialization parameter is correctly set when it is equal to 0.
  • Fixed an index error in the logging module when arbitrary strings are logged.
  • Ensure that the max_retries initialization parameter is correctly set when equal 0 in AzureOpenAIGenerator, AzureOpenAIChatGenerator, AzureOpenAITextEmbedder and AzureOpenAIDocumentEmbedder.
  • Improved serialization and deserialization in haystack/utils/type_serialization.py to handle Optional types correctly.
  • Replace lazy imports with eager imports in haystack/__init__.py to avoid potential static type checking issues and simplify maintenance.
  • Fix an issue that prevented Jinja2-based ComponentTools from being passed into pipelines at runtime.
  • Improved type hinting for the component.output_types decorator. The type hinting for the decorator was originally introduced to avoid overshadowing the type hinting of the run method and allow proper static type checking. This update extends support to asynchronous run_async methods.
  • Fixed issue with MistralChatGenerator not returning a finishreason when using streaming. Fixed by adjusting how we look for the finishreason when processing streaming chunks. Now, the last non-None finish_reason is used to handle differences between OpenAI and Mistral.

- Python
Published by github-actions[bot] 11 months ago

farm-haystack - v2.12.0-rc1

- Python
Published by github-actions[bot] 11 months ago

farm-haystack - v2.11.2

Release Notes

v2.11.2

Enhancement Notes

  • Refactored the processing of streaming chunks from OpenAI to simplify logic.
  • Added tests to ensure expected behavior when handling streaming chunks when using include_usage=True.

Bug Fixes

  • Fixed issue with MistralChatGenerator not returning a finishreason when using streaming. Fixed by adjusting how we look for the finishreason when processing streaming chunks. Now, the last non-None finish_reason is used to handle differences between OpenAI and Mistral.

- Python
Published by github-actions[bot] 11 months ago

farm-haystack - v2.11.1

Release Notes

v2.11.1

Bug Fixes

  • Add dataframe to legacy fields for the Document dataclass. This fixes a bug where Document.fromdict() in haystack-ai>=2.11.0 could not properly deserialize a Document dictionary obtained with document.todict(flatten=False) in haystack-ai<=2.10.0.

- Python
Published by github-actions[bot] 12 months ago

farm-haystack - v2.11.1-rc1

- Python
Published by github-actions[bot] 12 months ago

farm-haystack - v2.11.0

⭐️ Highlights

Faster Imports

With lazy importing, importing individual components now requires 50% less CPU time on average. Overall import performance has also significantly improved: for example, import haystack now consumes only 2-5% of the CPU time it previously did.

Extended Async Run Support

As of this release, all chat generators and retrievers in the core package now include a run_async method, enabling asynchronous execution at the component level. When used in an AsyncPipeline, this method runs automatically, providing native async capabilities.

AsyncPipeline vs Pipeline

New MSGToDocument Component

Use MSGToDocument to convert Microsoft Outlook .msg files into Haystack documents. This component extracts the email metadata (such as sender, recipients, CC, BCC, subject) and body content and converts any file attachments into ByteStream objects.

Turn off Validation for Pipeline Connections

Set connection_type_validation to false when initializing Pipeline to disable type validation for pipeline connections. This will allow you to connect any edges and bypass errors you might get, for example, when you connect Optional[str] output to str input.

⬆️ Upgrade Notes

  • The ExtractedTableAnswer dataclass and the dataframe field in the Document dataclass, deprecated in Haystack 2.10.0, have now been removed. pandas is no longer a required dependency for Haystack, making the installation lighter. If a component you use requires pandas, an informative error will be raised, prompting you to install it. For details and motivation, see the GitHub discussion #8688.

  • Starting from Haystack 2.11.0 Python 3.8 is no longer supported. Python 3.8 reached its end of life on October 2024.

  • The AzureOCRDocumentConverter no longer produces Document objects with the deprecated dataframe field.

    Am I affected?

    • If your workflow relies on the dataframe field in Document objects generated by AzureOCRDocumentConverter, you are affected.
    • If you saw a DeprecationWarning in Haystack 2.10 when initializing a Document with a dataframe, this change will now remove that field entirely.

    How to handle the change: - Instead of storing detected tables as a dataframe, AzureOCRDocumentConverter now represents tables as CSV-formatted text in the content field of the Document. - Update your processing logic to handle CSV-formatted tables instead of a dataframe. If needed, you can convert the CSV text back into a dataframe using pandas.read_csv().

πŸš€ New Features

  • Add a new MSGToDocument component to convert .msg files into Haystack Document objects.
    • Extracts email metadata (e.g. sender, recipients, CC, BCC, subject) and body content into a Document.
    • Converts attachments into ByteStream objects which can be passed onto a FileTypeRouter + relevant converters.
  • We've introduced a new type_validation parameter to control type compatibility checks in pipeline connections. It can be set to True (default) or False which means no type checks will be done and everything is allowed.
  • Add run_async method to HuggingFaceAPIChatGenerator. This method relies internally on the AsyncInferenceClient from huggingface to generate chat completions and supports the same parameters as the run method. It returns a coroutine that can be awaited.
  • Add run_async method to OpenAIChatGenerator. This method internally uses the async version of the OpenAI client to generate chat completions and supports the same parameters as the run method. It returns a coroutine that can be awaited.
  • The InMemoryDocumentStore and the associated InMemoryBM25Retriever and InMemoryEmbeddingRetriever retrievers now support async mode.
  • Add run_async method to DocumentWriter. This method supports the same parameters as the run method and relies on the DocumentStore to implement write_documents_async. It returns a coroutine that can be awaited.
  • Add run_async method to AzureOpenAIChatGenerator. This method uses AsyncAzureOpenAI to generate chat completions and supports the same parameters as the run method. It returns a coroutine that can be awaited.
  • Sentence Transformers components now support ONNX and OpenVINO backends through the "backend" parameter. Supported backends are torch (default), onnx, and openvino. Refer to the Sentence Transformers documentation for more information.
  • Add run_async method to HuggingFaceLocalChatGenerator. This method internally uses ThreadPoolExecutor to return coroutines that can be awaited.

⚑️ Enhancement Notes

  • Improved AzureDocumentEmbedder to handle embedding generation failures gracefully. Errors are logged, and processing continues with the remaining batches.
  • In the FileTypeRouter add explicit support for classifying .msg files with mimetype "application/vnd.ms-outlook" since the mimetypes module returns None for .msg files by default.
  • Added the storefullpath init variable to XLSXToDocument to allow users to toggle whether to store the full path of the source file in the meta of the Document. This is set to False by default to increase privacy.
  • Increased default timeout for Mermaid server to 30 seconds. Mermaid server is used to draw Pipelines. Exposed the timeout as a parameter for the Pipeline.show and Pipeline.draw methods. This allows users to customize the timeout as needed.
  • Optimize import times through extensive use of lazy imports across packages. Importing one component of a certain package, no longer leads to importing all components of the same package. For example, importing OpenAIChatGenerator no longer imports AzureOpenAIChatGenerator.
  • Haystack now officially supports Python 3.13. Some components and integrations may not yet be compatible. Specifically, the NamedEntityExtractor does not work with Python 3.13 when using the spacy backend. Additionally, you may encounter issues installing openai-whisper, which is required by the LocalWhisperTranscriber component, if you use uv or poetry for installation. In this case, we recommend using pip for installation.
  • EvaluationRunResult can now output the results in JSON, a pandas Dataframe or in a CSV file.
  • Update ListJoiner to only optionally need list_type to be passed. By default it uses type List which acts like List[Any].
    • This allows the ListJoiner to combine any incoming lists into a single flattened list.
    • Users can still pass list_type if they would like to have stricter type validation in their pipelines.
  • Added PDFMinerToDocument functionality to detect and report undecoded CID characters in PDF text extraction, helping users identify potential text extraction quality issues when processing PDFs with non-standard fonts.
  • Simplified the serialization code for better readability and maintainability.
    • Updated deserialization to allow users to omit the typing. prefix for standard typing library types (e.g., List[str] instead of typing.List[str]).

⚠️ Deprecation Notes

  • The use of pandas Dataframe in EvaluationRunResult is now optional and the methods score_report, to_pandas and comparative_individual_scores_report are deprecated and will be removed in the next haystack release.

πŸ› Bug Fixes

  • In the ChatMessage.to_openai_dict_format utility method, include the name field in the returned dictionary, if present. Previously, the name field was erroneously skipped.
  • Pipelines with components that return plain pandas dataframes failed. The comparison of socket values is now 'is not' instead of '!=' to avoid errors with dataframes.
  • Make sure that OpenAIChatGenerator sets additionalProperties: False in the tool schema when tool_strict is set to True.
  • Fix a bug where the output_type of a ConditionalRouter was not being serialized correctly. This would cause the router to work incorrectly after being serialized and deserialized.
  • Fixed accumulation of a tools arguments when streaming with an OpenAIChatGenerator
  • Added a fix to the pipeline's component scheduling alogrithm to reduce edge cases where the execution order of components that are simultaneously waiting for inputs has an impact on a pipeline's output. We look at topolgical order first to see which of the waiting components should run first and fall back to lexicographical order when both components are on the same topology-level. In cyclic pipelines, if the waiting components are in the same cycle, we fall back to lexicographical order immediately.
  • Fixes serialization of typing.Any when using serialize_type utility
  • Fixes an edge case in the pipeline-run logic where an existing input could be overwritten if the same component connects to the socket from multiple output sockets.
  • ComponentTool does not truncate description anymore.
  • Updates import paths for type hints to get ddtrace 3.0.0 working with our datadog tracer
  • Improved serialization and deserialization in haystack/utils/type_serialization.py to handle Optional types correctly.

- Python
Published by github-actions[bot] 12 months ago

farm-haystack - v2.11.0-rc3

- Python
Published by github-actions[bot] 12 months ago

farm-haystack - v2.11.0-rc2

- Python
Published by github-actions[bot] 12 months ago

farm-haystack - v2.10.3

Release Notes

v2.10.3

Bug Fixes

  • Fixed accumulation of a tools arguments when streaming with an OpenAIChatGenerator

- Python
Published by github-actions[bot] about 1 year ago

farm-haystack - v2.10.3-rc1

- Python
Published by github-actions[bot] about 1 year ago

farm-haystack - v2.10.2

Release Notes

v2.10.2

Bug Fixes

  • Pipelines with components that return plain pandas dataframes failed. The comparison of socket values is now 'is not' instead of '!=' to avoid errors with dataframes.

- Python
Published by github-actions[bot] about 1 year ago

farm-haystack - v2.10.2-rc1

- Python
Published by github-actions[bot] about 1 year ago

farm-haystack - v2.10.1

Release Notes

v2.10.1

Bug Fixes

  • ComponentTool does not truncate 'description' anymore.

- Python
Published by github-actions[bot] about 1 year ago

farm-haystack - v2.10.1-rc1

- Python
Published by github-actions[bot] about 1 year ago

farm-haystack - v2.10.0

⭐️ Highlights

Improved Pipeline.run() Logic

The new Pipeline.run() logic fixes common pipeline issues, including exceptions, incorrect component execution, missing intermediate outputs, and premature execution of lazy variadic components. While most pipelines should remain unaffected, we recommend carefully reviewing your pipeline executions if you are using cyclic pipelines or pipelines with lazy variadic components to ensure their behavior has not changed. You can use this tool to compare the execution traces of your pipeline with the old and new logic.

AsyncPipeline for Async Execution

Together with the new Pipeline.run logic, AsyncPipeline enables asynchronous execution, allowing pipeline components to run concurrently whenever possible. This leads to significant speed improvements, especially for pipelines processing data in parallel branches such as hybrid retrieval setting.

AsyncPipeline vs Pipeline

Source Codes

**Hybrid Retrieval** ```python hybrid_rag_retrieval = AsyncPipeline() hybrid_rag_retrieval.add_component("text_embedder", SentenceTransformersTextEmbedder()) hybrid_rag_retrieval.add_component("embedding_retriever", InMemoryEmbeddingRetriever(document_store=document_store)) hybrid_rag_retrieval.add_component("bm25_retriever", InMemoryBM25Retriever(document_store=document_store)) hybrid_rag_retrieval.connect("text_embedder", "embedding_retriever") hybrid_rag_retrieval.connect("bm25_retriever", "document_joiner") hybrid_rag_retrieval.connect("embedding_retriever", "document_joiner") async def run_inner(): return await hybrid_rag_retrieval.run({ "text_embedder": {"text": query}, "bm25_retriever": {"query": query} }) results = asyncio.run(run_inner()) ``` **Parallel Translation Pipeline** ```python from haystack.components.builders import ChatPromptBuilder from haystack.components.generators.chat import OpenAIChatGenerator from haystack import AsyncPipeline from haystack.utils import Secret # Create prompt builders with templates at initialization spanish_prompt_builder = ChatPromptBuilder(template="Translate this message to Spanish: {{user_message}}") turkish_prompt_builder = ChatPromptBuilder(template="Translate this message to Turkish: {{user_message}}") thai_prompt_builder = ChatPromptBuilder(template="Translate this message to Thai: {{user_message}}") # Create LLM instances spanish_llm = OpenAIChatGenerator() turkish_llm = OpenAIChatGenerator() thai_llm = OpenAIChatGenerator() # Create and configure pipeline pipe = AsyncPipeline() # Add components pipe.add_component("spanish_prompt_builder", spanish_prompt_builder) pipe.add_component("turkish_prompt_builder", turkish_prompt_builder) pipe.add_component("thai_prompt_builder", thai_prompt_builder) pipe.add_component("spanish_llm", spanish_llm) pipe.add_component("turkish_llm", turkish_llm) pipe.add_component("thai_llm", thai_llm) # Connect components pipe.connect("spanish_prompt_builder.prompt", "spanish_llm.messages") pipe.connect("turkish_prompt_builder.prompt", "turkish_llm.messages") pipe.connect("thai_prompt_builder.prompt", "thai_llm.messages") user_message = """ In computer programming, the async/await pattern is a syntactic feature of many programming languages that allows an asynchronous, non-blocking function to be structured in a way similar to an ordinary synchronous function. It is semantically related to the concept of a coroutine and is often implemented using similar techniques, and is primarily intended to provide opportunities for the program to execute other code while waiting for a long-running, asynchronous task to complete, usually represented by promises or similar data structures. """ # Run the pipeline with simplified input res = pipe.run(data={"user_message": user_message}) # Print results print("Spanish translation:", res["spanish_llm"]["generated_messages"][0].text) print("Turkish translation:", res["turkish_llm"]["generated_messages"][0].text) print("Thai translation:", res["thai_llm"]["generated_messages"][0].text) ```

Tool Calling Support Everywhere

Tool calling is now universally supported across all chat generators, making it easier than ever for developers to port tools across different platforms. Simply switch the chat generator used, and tooling will work seamlessly without any additional configuration. This update applies across AzureOpenAIChatGenerator, HuggingFaceLocalChatGenerator, and all core integrations, including AnthropicChatGenerator, CohereChatGenerator, AmazonBedrockChatGenerator, and VertexAIGeminiChatGenerator. With this enhancement, tool usage becomes a native capability across the ecosystem, enabling more advanced and interactive agentic applications.

Visualize Your Pipelines Locally

Pipeline visualization is now more flexible, allowing users to render pipeline graphs locally without requiring an internet connection or sending data to an external service. By running a local Mermaid server with Docker, you can generate visual representations of your pipelines using draw() or show(). Learn more in Visualizing Pipelines

New Components for Smarter Document Processing

This release introduces new components that enhance document processing capabilities. CSVDocumentSplitter and CSVDocumentCleaner make handling CSV files more efficient. LLMMetadaExtractor leverages an LLM to analyze documents and enrich them with relevant metadata, improving searchability and retrieval accuracy.

⬆️ Upgrade Notes

  • The DOCXToDocument converter now returns a Document object with DOCX metadata stored in the meta field as a dictionary under the key docx. Previously, the metadata was represented as a DOCXMetadata dataclass. This change does not impact reading from or writing to a Document Store.
  • Removed the deprecated NLTKDocumentSplitter, it's functionalities are now supported by the DocumentSplitter.
  • The deprecated FUNCTION role has been removed from the ChatRole enum. Use TOOL instead. The deprecated class method ChatMessage.fromfunction has been removed. Use ChatMessage.fromtool instead.

πŸš€ New Features

  • Added a new component ListJoiner which joins lists of values from different components to a single list.

  • Introduced the OpenAPIConnector component, enabling direct invocation of REST endpoints as specified in an OpenAPI specification. This component is designed for direct REST endpoint invocation without LLM-generated payloads, users needs to pass the run parameters explicitly. Example: ```python from haystack.utils import Secret from haystack.components.connectors.openapi import OpenAPIConnector

    connector = OpenAPIConnector(openapispec="https://bit.ly/serperdevopenapi", credentials=Secret.fromenvvar("SERPERDEVAPIKEY")) response = connector.run(operation_id="search", parameters={"q": "Who was Nikola Tesla?"} ) ```

  • Adding a new component, LLMMetadaExtractor, which can be used in an indexing pipeline to extract metadata from documents based on a user-given prompt and return the documents with the metadata field with the output of the LLM.

  • Introduced CSVDocumentCleaner component for cleaning CSV documents.

    • Removes empty rows and columns, while preserving specified ignored rows and columns.
    • Customizable number of rows and columns to ignore during processing.
  • Introducing CSVDocumentSplitter: The CSVDocumentSplitter splits CSV documents into structured sub-tables by recursively splitting by empty rows and columns larger than a specified threshold. This is particularly useful when converting Excel files which can often have multiple tables within one sheet.

⚑️ Enhancement Notes

  • Enhanced SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder to accept an additional parameter, which is passed directly to the underlying SentenceTransformer.encode method for greater flexibility in embedding customization.
  • Added completion_start_time metadata to track time-to-first-token (TTFT) in streaming responses from Hugging Face API and OpenAI (Azure).
  • Enhancements to Date Filtering in MetadataRouter:
    • Improved date parsing in filter utilities by introducing _parse_date, which first attempts datetime.fromisoformat(value) for backward compatibility and then falls back to dateutil.parser.parse() for broader ISO 8601 support.
    • Resolved a common issue where comparing naive and timezone-aware datetimes resulted in TypeError. Added _ensure_both_dates_naive_or_aware, which ensures both datetimes are either naive or aware. If one is missing a timezone, it is assigned the timezone of the other for consistency.
  • When Pipeline.from_dict receives an invalid type (e.g. empty string), an informative PipelineError is now raised.
  • Add jsonschema library as a core dependency. It is used in Tool and JsonSchemaValidator.
  • Streaming callback run param support for HF chat generators.
  • For the CSVDocumentCleaner, added remove_empty_rows & remove_empty_columns to optionally remove rows and columns. Also added keep_id to optionally allow for keeping the original document ID.
  • Enhanced OpenAPIServiceConnector to support and be compatible with the new ChatMessage format.
  • Updated Document's meta data after initializing the Document in DocumentSplitter as requested in issue #8741

⚠️ Deprecation Notes

  • The ExtractedTableAnswer dataclass and the dataframe field in the Document dataclass are deprecated and will be removed in Haystack 2.11.0. Check out the GitHub discussion for motivation and details.

πŸ› Bug Fixes

  • Fixes a bug that causes pyright type checker to fail for all component objects.
  • Haystack pipelines with Mermaid graphs are now compressed to reduce the size of the encoded base64 and avoid HTTP 400 errors when the graph is too large.
  • The DOCXToDocument component now skips comment blocks in DOCX files that previously caused errors.
  • Callable deserialization now works for all fully qualified import paths.
  • Fix error messages for Document Classifier components, that suggested using nonexistent components for text classification.
  • Fixed JSONConverter to properly skip converting JSON files that are not utf-8 encoded.
  • - acyclic pipelines with multiple lazy variadic components not running all components
    • cyclic pipelines not passing intermediate outputs to components outside the cycle
    • cyclic pipelines with two or more optional or greedy variadic edges showing unexpected execution behavior
    • cyclic pipelines with two cycles sharing an edge raising errors
  • Updated PDFMinerToDocument convert function to to double new lines between container_text so that passages can later by DocumentSplitter.
  • In the Hugging Face API embedders, the InferenceClient.feature_extraction method is now used instead of InferenceClient.post to compute embeddings. This ensures a more robust and future-proof implementation.
  • Improved OpenAIChatGenerator streaming response tool call processing: The logic now scans all chunks to correctly identify the first chunk with tool calls, ensuring accurate payload construction and preventing errors when tool call data isn't confined to the initial chunk.

- Python
Published by github-actions[bot] about 1 year ago

farm-haystack - v2.10.0-rc3

- Python
Published by github-actions[bot] about 1 year ago

farm-haystack - v2.10.0-rc1

- Python
Published by github-actions[bot] about 1 year ago

farm-haystack - v2.9.0

⭐️ Highlights

Tool Calling Support

We are introducing the Tool, a simple and unified abstraction for representing tools in Haystack, and the ToolInvoker, which executes tool calls prepared by LLMs. These features make it easy to integrate tool calling into your Haystack pipelines, enabling seamless interaction with tools when used with components like OpenAIChatGenerator and HuggingFaceAPIChatGenerator. Here's how you can use them:

```python def dummyweatherfunction(city: str): return f"The weather in {city} is 20 degrees."

tool = Tool( name="weathertool", description="A tool to get the weather", function=dummyweather_function, parameters={ "type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"], } )

pipeline = Pipeline() pipeline.addcomponent("llm", OpenAIChatGenerator(model="gpt-4o-mini", tools=[tool])) pipeline.addcomponent("toolinvoker", ToolInvoker(tools=[tool])) pipeline.connect("llm.replies", "toolinvoker.messages")

message = ChatMessage.from_user("How is the weather in Berlin today?") result = pipeline.run({"llm": {"messages": [message]}}) `` **Use Components as Tools** As an abstraction ofTool, [ComponentTool`](https://docs.haystack.deepset.ai/docs/componenttool) allows LLMs to interact directly with components like web search, document processing, or custom user components. It simplifies schema generation and type conversion, making it easy to expose complex component functionality to LLMs.

```python

Create a tool from the component

tool = ComponentTool( component=SerperDevWebSearch(apikey=Secret.fromenvvar("SERPERDEVAPIKEY"), topk=3), name="websearch", # Optional: defaults to "serperdevwebsearch" description="Search the web for current information on any topic" # Optional: defaults to component docstring ) ```

New Splitting Method: RecursiveDocumentSplitter

RecursiveDocumentSplitter introduces a smarter way to split text. It uses a set of separators to divide text recursively, starting with the first separator. If chunks are still larger than the specified size, the splitter moves to the next separator in the list. This approach ensures efficient and granular text splitting for improved processing.

```python from haystack.components.preprocessors import RecursiveDocumentSplitter

splitter = RecursiveDocumentSplitter(splitlength=260, splitoverlap=0, separators=["\n\n", "\n", ".", " "]) doc_chunks = splitter.run([Document(content="...")]) ```

⚠️ Refactored ChatMessage dataclass

ChatMessage dataclass has been refactored to improve flexibility and compatibility. As part of this update, the content attribute has been removed and replaced with a new text property for accessing the ChatMessage's textual value. This change ensures future-proofing and better support for features like tool calls and their results. For details on the new API and migration steps, see the ChatMessage documentation. If you have any questions about this refactoring, feel free to let us know in this Github discussion.

⬆️ Upgrade Notes

  • The refactoring of the ChatMessage data class includes some breaking changes involving ChatMessage creation and accessing attributes. If you have a Pipeline containing a ChatPromptBuilder, serialized with haystack-ai =< 2.9.0, deserialization may break. For detailed information about the changes and how to migrate, see the ChatMessage documentation.
  • Removed the deprecated converter init argument from PyPDFToDocument. Use other init arguments instead, or create a custom component.
  • The SentenceWindowRetriever output key context_documents now outputs a List[Document] containing the retrieved documents and the context windows ordered by split_idx_start.
  • Update default value of store_full_path to False in converters

πŸš€ New Features

  • Introduced the ComponentTool, a new tool that wraps Haystack components, allowing them to be utilized as tools for LLMs (various ChatGenerators). This ComponentTool supports automatic tool schema generation, input type conversion, and offers support for components with run methods that have input types:

    • Basic types (str, int, float, bool, dict)
    • Dataclasses (both simple and nested structures)
    • Lists of basic types (e.g., List[str])
    • Lists of dataclasses (e.g., List[Document])
    • Parameters with mixed types (e.g., List[Document], str etc.)

    Example usage: ```python

    from haystack import component, Pipeline from haystack.tools import ComponentTool from haystack.components.websearch import SerperDevWebSearch from haystack.utils import Secret from haystack.components.tools.tool_invoker import ToolInvoker from haystack.components.generators.chat import OpenAIChatGenerator from haystack.dataclasses import ChatMessage

    Create a SerperDev search component

    search = SerperDevWebSearch(apikey=Secret.fromenvvar("SERPERDEVAPIKEY"), topk=3)

    Create a tool from the component

    tool = ComponentTool( component=search, name="websearch", # Optional: defaults to "serperdevwebsearch" description="Search the web for current information on any topic" # Optional: defaults to component docstring )

    Create pipeline with OpenAIChatGenerator and ToolInvoker

    pipeline = Pipeline() pipeline.addcomponent("llm", OpenAIChatGenerator(model="gpt-4o-mini", tools=[tool])) pipeline.addcomponent("tool_invoker", ToolInvoker(tools=[tool]))

    Connect components

    pipeline.connect("llm.replies", "tool_invoker.messages")

    message = ChatMessage.from_user("Use the web search tool to find information about Nikola Tesla")

    Run pipeline

    result = pipeline.run({"llm": {"messages": [message]}})

    print(result) ```

  • Add XLSXToDocument converter that loads an Excel file using Pandas + openpyxl and by default converts each sheet into a separate Document in CSV format.

  • Added a new store_full_path parameter to the __init__ methods of PyPDFToDocument and AzureOCRDocumentConverter. The default value is True, which stores the full file path in the metadata of the output documents. When set to False, only the file name is stored.

  • Add a new experimental component ToolInvoker. This component invokes tools based on tool calls prepared by Language Models and returns the results as a list of ChatMessage objects with tool role.

  • Adding a RecursiveSplitter, which uses a set of separators to split text recursively. It attempts to divide the text using the first separator, and if the resulting chunks are still larger than the specified size, it moves to the next separator in the list.

  • Added a create_tool_from_function function to create a Too instance from a function, with automatic generation of name, description and parameters. Added a tool decorator to achieve the same result.

  • Add support for Tools in the Hugging Face API Chat Generator.

  • Changed the ChatMessage dataclass to support different types of content, including tool calls, and tool call results.

  • Add support for Tools in the OpenAI Chat Generator.

  • Added a new Tool dataclass to represent a tool for which Language Models can prepare calls.

  • Added the component StringJoiner to join strings from different components to a list of strings.

⚑️ Enhancement Notes

  • Added default_headers parameter to AzureOpenAIDocumentEmbedder and AzureOpenAITextEmbedder.

  • Add token argument to NamedEntityExtractor to allow usage of private Hugging Face models.

  • Add the from_openai_dict_format class method to the ChatMessage class. It allows you to create a ChatMessage from a dictionary in the format that OpenAI's Chat API expects.

  • Add a testing job to check that all packages can be imported successfully. This should help detect several issues, such as forgetting to use a forward reference for a type hint coming from a lazy import.

  • DocumentJoiner methods _concatenate() and _distribution_based_rank_fusion() were converted to static methods.

  • Improve serialization and deserialization of callables. We now allow serialization of class methods and static methods and explicitly prohibit serialization of instance methods, lambdas, and nested functions.

  • Added new initialization parameters to the PyPDFToDocument component to customize the text extraction process from PDF files.

  • Reorganized the document store test suite to isolate dataframe filter tests. This change prepares for potential future deprecation of the Document class's dataframe field.

  • Move Tool to a new dedicated tools package. Refactor Tool serialization and deserialization to make it more flexible and include type information.

  • The NLTKDocumentSplitter was merged into the DocumentSplitter which now provides the same functionality as the NLTKDocumentSplitter. The split_by="sentence" now uses a custom sentence boundary detection based on the nltk library. The previous sentence behaviour can still be achieved by split_by="period".

  • Improved deserialization of callables by using importlib instead of sys.modules. This change allows importing local functions and classes that are not in sys.modules when deserializing callable.

  • Change OpenAIDocumentEmbedder to keep running if a batch fails embedding. Now OpenAI returns an error we log that error and keep processing following batches.

⚠️ Deprecation Notes

  • The NLTKDocumentSplitter will be deprecated and will be removed in the next release. The DocumentSplitter will instead support the functionality of the NLTKDocumentSplitter.

  • The function role and ChatMessage.from_function class method have been deprecated and will be removed in Haystack 2.10.0. ChatMessage.from_function also attempts to produce a valid tool message. For more information, see the documentation: https://docs.haystack.deepset.ai/docs/chatmessage

  • The SentenceWindowRetriever output of context_documents changed. Instead of a List[List[Document], the output is a List[Document], where the documents are ordered by split_idx_start value.

πŸ› Bug Fixes

  • Add missing stream mime type assignment to the LinkContentFetcher for the single url scenario.

  • Previously, the pipelines that use FileTypeRouter could fail if they received a single URL as an input.

  • OpenAIChatGenerator no longer passes tools to the OpenAI client if none are provided. Previously, a null value was passed. This change improves compatibility with OpenAI-compatible APIs that do not support tools.

  • ByteStream now truncates the data to 100 bytes in the string representation to avoid excessive log output.

  • Make the HuggingFaceLocalChatGenerator compatible with the new ChatMessage format, by converting the messages to the format expected by HuggingFace.

  • Serialize the chat_template parameter.

  • Moved the NLTK download of DocumentSplitter and NLTKDocumentSplitter to warm_up(). This prevents calling to an external API during instantiation. If a DocumentSplitter or NLTKDocumentSplitter is used for sentence splitting outside of a pipeline, warm_up() now needs to be called before running the component.

  • PDFMinerToDocument now creates documents with id based on converted text and metadata. Before, PDFMinerToDocument did not consider the document's meta field when generating the document's id.

  • Pin OpenAI client to >=1.56.1 to avoid issues related to changes in the httpx library.

  • PyPDFToDocument now creates documents with id based on converted text and metadata. Before it didn't take the meta data into account.

  • Fixes issues with deserialization of components in multi-threaded environments.

- Python
Published by github-actions[bot] about 1 year ago

farm-haystack - v2.9.0-rc1

- Python
Published by github-actions[bot] about 1 year ago

farm-haystack - v2.8.1

Release Notes

v2.8.1

Bug Fixes

  • Pin OpenAI client to >=1.56.1 to avoid issues related to changes in the httpx library.
  • PyPDFToDocument now creates documents with id based on converted text and meta data. Before it didn't take the meta data into account.
  • Fixes issues with deserialization of components in multi-threaded environments.

- Python
Published by github-actions[bot] about 1 year ago

farm-haystack - v2.8.1-rc3

Release Notes

v2.8.1-rc3

Bug Fixes

  • PyPDFToDocument now creates documents with id based on converted text and meta data. Before it didn't take the meta data into account.

v2.8.1-rc2

Bug Fixes

  • Fixes issues with deserialization of components in multi-threaded environments.

v2.8.1-rc1

Bug Fixes

  • Pin OpenAI client to >=1.56.1 to avoid issues related to changes in the httpx library.

- Python
Published by github-actions[bot] about 1 year ago

farm-haystack - v2.8.1-rc2

Release Notes

v2.8.1-rc2

Bug Fixes

  • Fixes issues with deserialization of components in multi-threaded environments.

v2.8.1-rc1

Bug Fixes

  • Pin OpenAI client to >=1.56.1 to avoid issues related to changes in the httpx library.

- Python
Published by github-actions[bot] about 1 year ago

farm-haystack - v2.8.1-rc1

Release Notes

v2.8.1-rc1

Bug Fixes

  • Pin OpenAI client to >=1.56.1 to avoid issues related to changes in the httpx library.

- Python
Published by github-actions[bot] about 1 year ago

farm-haystack - v2.8.0

Release Notes

⬆️ Upgrade Notes

  • Remove is_greedy deprecated argument from @component decorator. Change the Variadic input of your Component to GreedyVariadic instead.

πŸš€ New Features

  • We've added a new DALLEImageGenerator component, bringing image generation with OpenAI's DALL-E to the Haystack

    • Easy to Use: Just a few lines of code to get started:
      ```python from haystack.components.generators import DALLEImageGenerator

      imagegenerator = DALLEImageGenerator() response = imagegenerator.run("Show me a picture of a black cat.") print(response) ```

  • Add warning logs to the PDFMinerToDocument and PyPDFToDocument to indicate when a processed PDF file has no content. This can happen if the PDF file is a scanned image. Also added an explicit check and warning message to the DocumentSplitter that warns the user that empty Documents are skipped. This behavior was already occurring, but now its clearer through logs that this is happening.

  • We have added a new MetaFieldGroupingRanker component that reorders documents by grouping them based on metadata keys. This can be useful for pre-processing Documents before feeding them to an LLM.

  • Added a new store_full_path parameter to the __init__ methods of the following converters: JSONConverter, CSVToDocument, DOCXToDocument, HTMLToDocument MarkdownToDocument, PDFMinerToDocument, PPTXToDocument, TikaDocumentConverter, PyPDFToDocument , AzureOCRDocumentConverter and TextFileToDocument. The default value is True, which stores full file path in the metadata of the output documents. When set to False, only the file name is stored.

  • When making function calls via OpenAPI, allow both switching SSL verification off and specifying a certificate authority to use for it.

  • Add TTFT (Time-to-First-Token) support for OpenAI generators. This captures the time taken to generate the first token from the model and can be used to analyze the latency of the application.

  • Added a new option to the requiredvariables parameter to the PromptBuilder and ChatPromptBuilder. By passing `requiredvariables="*"` you can automatically set all variables in the prompt to be required.

⚑️ Enhancement Notes

  • Across Haystack codebase, we have replaced the use of ChatMessage data class constructor with specific class methods (ChatMessage.from_user, ChatMessage.from_assistant, etc.).
  • Added the Maximum Margin Relevance (MMR) strategy to the SentenceTransformersDiversityRanker. MMR scores are calculated for each document based on their relevance to the query and diversity from already selected documents.
  • Introduces optional parameters in the ConditionalRouter component, enabling default/fallback routing behavior when certain inputs are not provided at runtime. This enhancement allows for more flexible pipeline configurations with graceful handling of missing parameters.
  • Added split by line to DocumentSplitter, which will split the document at n.
  • Change OpenAIDocumentEmbedder to keep running if a batch fails embedding. Now OpenAI returns an error we log that error and keep processing following batches.
  • Added new initialization parameters to the PyPDFToDocument component to customize the text extraction process from PDF files.
  • Replace usage of ChatMessage.content with ChatMessage.text across the codebase. This is done in preparation for the removal of content in Haystack 2.9.0.

⚠️ Deprecation Notes

  • The default value of the store_full_path parameter in converters will change to False in Haysatck 2.9.0 to enhance privacy.
  • In Haystack 2.9.0, the ChatMessage data class will be refactored to make it more flexible and future-proof. As part of this change, the content attribute will be removed. A new text property has been introduced to provide access to the textual value of the ChatMessage. To ensure a smooth transition, start using the text property now in place of content.
  • The converter parameter in the PyPDFToDocument component is deprecated and will be removed in Haystack 2.9.0. For in-depth customization of the conversion process, consider implementing a custom component. Additional high-level customization options will be added in the future.
  • The output of context_documents in SentenceWindowRetriever will change in the next release. Instead of a List[List[Document]], the output will be a List[Document], where the documents are ordered by split_idx_start.

πŸ› Bug Fixes

  • Fix DocumentCleaner not preserving all Document fields when run

  • Fix DocumentJoiner failing when ran with an empty list of Documents

  • For the NLTKDocumentSplitter we are updating how chunks are made when splitting by word and sentence boundary is respected. Namely, to avoid fully subsuming the previous chunk into the next one, we ignore the first sentence from that chunk when calculating sentence overlap. i.e. we want to avoid cases of Doc1 = [s1, s2], Doc2 = [s1, s2, s3].

  • Finished adding function support for this component by updating the _split_into_units function and added the splitting_function init parameter.

  • Add specific to_dict method to overwrite the underlying one from DocumentSplitter. This is needed to properly save the settings of the component to yaml.

  • Fix OpenAIChatGenerator and OpenAIGenerator crashing when using a streamingcallback and `generationkwargscontain{"streamoptions": {"includeusage": True}}`.

  • Fix tracing Pipeline with cycles to correctly track components execution

  • When meta is passed into AnswerBuilder.run(), it is now merged into GeneratedAnswer meta

  • Fix DocumentSplitter to handle custom splitting_function without requiring split_length. Previously the splitting_function provided would not override other settings.

- Python
Published by github-actions[bot] about 1 year ago

farm-haystack - v2.8.0-rc3

Release Notes

⬆️ Upgrade Notes

  • Remove is_greedy deprecated argument from @component decorator. Change the Variadic input of your Component to GreedyVariadic instead.

πŸš€ New Features

  • We've added a new DALLEImageGenerator component, bringing image generation with OpenAI's DALL-E to the Haystack
    • Easy to Use: Just a few lines of code to get started:
      `python from haystack.components.generators import DALLEImageGenerator image_generator = DALLEImageGenerator() response = image_generator.run("Show me a picture of a black cat.") print(response)`
  • Add warning logs to the PDFMinerToDocument and PyPDFToDocument to indicate when a processed PDF file has no content. This can happen if the PDF file is a scanned image. Also added an explicit check and warning message to the DocumentSplitter that warns the user that empty Documents are skipped. This behavior was already occurring, but now its clearer through logs that this is happening.
  • We have added a new MetaFieldGroupingRanker component that reorders documents by grouping them based on metadata keys. This can be useful for pre-processing Documents before feeding them to an LLM.
  • Added a new store_full_path parameter to the __init__ methods of the following converters: JSONConverter, CSVToDocument, DOCXToDocument, HTMLToDocument MarkdownToDocument, PDFMinerToDocument, PPTXToDocument, TikaDocumentConverter, PyPDFToDocument , AzureOCRDocumentConverter and TextFileToDocument. The default value is True, which stores full file path in the metadata of the output documents. When set to False, only the file name is stored.
  • When making function calls via OpenAPI, allow both switching SSL verification off and specifying a certificate authority to use for it.
  • Add TTFT (Time-to-First-Token) support for OpenAI generators. This captures the time taken to generate the first token from the model and can be used to analyze the latency of the application.
  • Added a new option to the requiredvariables parameter to the PromptBuilder and ChatPromptBuilder. By passing `requiredvariables="*"` you can automatically set all variables in the prompt to be required.

⚑️ Enhancement Notes

  • Across Haystack codebase, we have replaced the use of ChatMessage data class constructor with specific class methods (ChatMessage.from_user, ChatMessage.from_assistant, etc.).
  • Added the Maximum Margin Relevance (MMR) strategy to the SentenceTransformersDiversityRanker. MMR scores are calculated for each document based on their relevance to the query and diversity from already selected documents.
  • Introduces optional parameters in the ConditionalRouter component, enabling default/fallback routing behavior when certain inputs are not provided at runtime. This enhancement allows for more flexible pipeline configurations with graceful handling of missing parameters.
  • Added split by line to DocumentSplitter, which will split the document at n.
  • Change OpenAIDocumentEmbedder to keep running if a batch fails embedding. Now OpenAI returns an error we log that error and keep processing following batches.
  • Added new initialization parameters to the PyPDFToDocument component to customize the text extraction process from PDF files.
  • Replace usage of ChatMessage.content with ChatMessage.text across the codebase. This is done in preparation for the removal of content in Haystack 2.9.0.

⚠️ Deprecation Notes

  • The default value of the store_full_path parameter in converters will change to False in Haysatck 2.9.0 to enhance privacy.
  • In Haystack 2.9.0, the ChatMessage data class will be refactored to make it more flexible and future-proof. As part of this change, the content attribute will be removed. A new text property has been introduced to provide access to the textual value of the ChatMessage. To ensure a smooth transition, start using the text property now in place of content.
  • The converter parameter in the PyPDFToDocument component is deprecated and will be removed in Haystack 2.9.0. For in-depth customization of the conversion process, consider implementing a custom component. Additional high-level customization options will be added in the future.
  • The output of context_documents will change in the next release. Instead of a List[List[Document]], the output will be a List[Document], where the documents are ordered by split_idx_start.

πŸ› Bug Fixes

  • Fix DocumentCleaner not preserving all Document fields when run

  • Fix DocumentJoiner failing when ran with an empty list of Documents

  • For the NLTKDocumentSplitter we are updating how chunks are made when splitting by word and sentence boundary is respected. Namely, to avoid fully subsuming the previous chunk into the next one, we ignore the first sentence from that chunk when calculating sentence overlap. i.e. we want to avoid cases of Doc1 = [s1, s2], Doc2 = [s1, s2, s3].

  • Finished adding function support for this component by updating the _split_into_units function and added the splitting_function init parameter.

  • Add specific to_dict method to overwrite the underlying one from DocumentSplitter. This is needed to properly save the settings of the component to yaml.

  • Fix OpenAIChatGenerator and OpenAIGenerator crashing when using a streamingcallback and `generationkwargscontain{"streamoptions": {"includeusage": True}}`.

  • Fix tracing Pipeline with cycles to correctly track components execution

  • When meta is passed into AnswerBuilder.run(), it is now merged into GeneratedAnswer meta

  • Fix DocumentSplitter to handle custom splitting_function without requiring split_length. Previously the splitting_function provided would not override other settings.

- Python
Published by github-actions[bot] about 1 year ago

farm-haystack - v2.8.0-rc2

Release Notes

⬆️ Upgrade Notes

  • Remove is_greedy deprecated argument from @component decorator. Change the Variadic input of your Component to GreedyVariadic instead.

πŸš€ New Features

  • We've added a new DALLEImageGenerator component, bringing image generation with OpenAI's DALL-E to the Haystack
    • Easy to Use: Just a few lines of code to get started:
      `python from haystack.components.generators import DALLEImageGenerator image_generator = DALLEImageGenerator() response = image_generator.run("Show me a picture of a black cat.") print(response)`
  • Add warning logs to the PDFMinerToDocument and PyPDFToDocument to indicate when a processed PDF file has no content. This can happen if the PDF file is a scanned image. Also added an explicit check and warning message to the DocumentSplitter that warns the user that empty Documents are skipped. This behavior was already occurring, but now its clearer through logs that this is happening.
  • We have added a new MetaFieldGroupingRanker component that reorders documents by grouping them based on metadata keys. This can be useful for pre-processing Documents before feeding them to an LLM.
  • Added a new store_full_path parameter to the __init__ methods of the following converters: JSONConverter, CSVToDocument, DOCXToDocument, HTMLToDocument MarkdownToDocument, PDFMinerToDocument, PPTXToDocument, TikaDocumentConverter, PyPDFToDocument , AzureOCRDocumentConverter and TextFileToDocument. The default value is True, which stores full file path in the metadata of the output documents. When set to False, only the file name is stored.
  • When making function calls via OpenAPI, allow both switching SSL verification off and specifying a certificate authority to use for it.
  • Add TTFT (Time-to-First-Token) support for OpenAI generators. This captures the time taken to generate the first token from the model and can be used to analyze the latency of the application.
  • Added a new option to the requiredvariables parameter to the PromptBuilder and ChatPromptBuilder. By passing `requiredvariables="*"` you can automatically set all variables in the prompt to be required.

⚑️ Enhancement Notes

  • Across Haystack codebase, we have replaced the use of ChatMessage data class constructor with specific class methods (ChatMessage.from_user, ChatMessage.from_assistant, etc.).
  • Added the Maximum Margin Relevance (MMR) strategy to the SentenceTransformersDiversityRanker. MMR scores are calculated for each document based on their relevance to the query and diversity from already selected documents.
  • Introduces optional parameters in the ConditionalRouter component, enabling default/fallback routing behavior when certain inputs are not provided at runtime. This enhancement allows for more flexible pipeline configurations with graceful handling of missing parameters.
  • Added split by line to DocumentSplitter, which will split the document at n
  • Change OpenAIDocumentEmbedder to keep running if a batch fails embedding. Now OpenAI returns an error we log that error and keep processing following batches.
  • Added new initialization parameters to the PyPDFToDocument component to customize the text extraction process from PDF files.
  • Replace usage of ChatMessage.content with ChatMessage.text across the codebase. This is done in preparation for the removal of content in Haystack 2.9.0.

⚠️ Deprecation Notes

  • The default value of the store_full_path parameter will change to False in Haysatck 2.9.0 to enhance privacy.
  • The default value of the store_full_path parameter in converters will change to False in Haysatck 2.9.0 to enhance privacy.
  • In Haystack 2.9.0, the ChatMessage data class will be refactored to make it more flexible and future-proof. As part of this change, the content attribute will be removed. A new text property has been introduced to provide access to the textual value of the ChatMessage. To ensure a smooth transition, start using the text property now in place of content.
  • The converter parameter in the PyPDFToDocument component is deprecated and will be removed in Haystack 2.9.0. For in-depth customization of the conversion process, consider implementing a custom component. Additional high-level customization options will be added in the future.

πŸ› Bug Fixes

  • Fix DocumentCleaner not preserving all Document fields when run

  • Fix DocumentJoiner failing when ran with an empty list of Documents

  • For the NLTKDocumentSplitter we are updating how chunks are made when splitting by word and sentence boundary is respected. Namely, to avoid fully subsuming the previous chunk into the next one, we ignore the first sentence from that chunk when calculating sentence overlap. i.e. we want to avoid cases of Doc1 = [s1, s2], Doc2 = [s1, s2, s3].

  • Finished adding function support for this component by updating the _split_into_units function and added the splitting_function init parameter.

  • Add specific to_dict method to overwrite the underlying one from DocumentSplitter. This is needed to properly save the settings of the component to yaml.

  • Fix OpenAIChatGenerator and OpenAIGenerator crashing when using a streamingcallback and `generationkwargscontain{"streamoptions": {"includeusage": True}}`.

  • Fix tracing Pipeline with cycles to correctly track components execution

  • When meta is passed into AnswerBuilder.run(), it is now merged into GeneratedAnswer meta

  • Fix DocumentSplitter to handle custom splitting_function without requiring split_length. Previously the splitting_function provided would not override other settings.

- Python
Published by github-actions[bot] about 1 year ago

farm-haystack - v2.8.0-rc1

Release Notes

⬆️ Upgrade Notes

  • Remove is_greedy deprecated argument from @component decorator. Change the Variadic input of your Component to GreedyVariadic instead.

πŸš€ New Features

  • We've added a new DALLEImageGenerator component, bringing image generation with OpenAI's DALL-E to the Haystack
    • Easy to Use: Just a few lines of code to get started:
      `python from haystack.components.generators import DALLEImageGenerator image_generator = DALLEImageGenerator() response = image_generator.run("Show me a picture of a black cat.") print(response)`
  • Add warning logs to the PDFMinerToDocument and PyPDFToDocument to indicate when a processed PDF file has no content. This can happen if the PDF file is a scanned image. Also added an explicit check and warning message to the DocumentSplitter that warns the user that empty Documents are skipped. This behavior was already occurring, but now its clearer through logs that this is happening.
  • We have added a new MetaFieldGroupingRanker component that reorders documents by grouping them based on metadata keys. This can be useful for pre-processing Documents before feeding them to an LLM.
  • Added a new store_full_path parameter to the __init__ methods of the following converters: JSONConverter, CSVToDocument, DOCXToDocument, HTMLToDocument MarkdownToDocument, PDFMinerToDocument, PPTXToDocument, TikaDocumentConverter and TextFileToDocument. The default value is True, which stores full file path in the metadata of the output documents. When set to False, only the file name is stored.
  • When making function calls via OpenAPI, allow both switching SSL verification off and specifying a certificate authority to use for it.
  • Add TTFT (Time-to-First-Token) support for OpenAI generators. This captures the time taken to generate the first token from the model and can be used to analyze the latency of the application.
  • Added a new option to the requiredvariables parameter to the PromptBuilder and ChatPromptBuilder. By passing `requiredvariables="*"` you can automatically set all variables in the prompt to be required.

⚑️ Enhancement Notes

  • Added the Maximum Margin Relevance (MMR) strategy to the SentenceTransformersDiversityRanker. MMR scores are calculated for each document based on their relevance to the query and diversity from already selected documents.
  • Introduces optional parameters in the ConditionalRouter component, enabling default/fallback routing behavior when certain inputs are not provided at runtime. This enhancement allows for more flexible pipeline configurations with graceful handling of missing parameters.
  • Added split by line to DocumentSplitter, which will split the document at n
  • Change OpenAIDocumentEmbedder to keep running if a batch fails embedding. Now OpenAI returns an error we log that error and keep processing following batches.

⚠️ Deprecation Notes

  • The default value of the store_full_path parameter will change to False in Haysatck 2.9.0 to enhance privacy.

πŸ› Bug Fixes

  • Fix DocumentCleaner not preserving all Document fields when run

  • Fix DocumentJoiner failing when ran with an empty list of Documents

  • For the NLTKDocumentSplitter we are updating how chunks are made when splitting by word and sentence boundary is respected. Namely, to avoid fully subsuming the previous chunk into the next one, we ignore the first sentence from that chunk when calculating sentence overlap. i.e. we want to avoid cases of Doc1 = [s1, s2], Doc2 = [s1, s2, s3].

  • Finished adding function support for this component by updating the _split_into_units function and added the splitting_function init parameter.

  • Add specific to_dict method to overwrite the underlying one from DocumentSplitter. This is needed to properly save the settings of the component to yaml.

  • Fix OpenAIChatGenerator and OpenAIGenerator crashing when using a streamingcallback and `generationkwargscontain{"streamoptions": {"includeusage": True}}`.

  • Fix tracing Pipeline with cycles to correctly track components execution

  • When meta is passed into AnswerBuilder.run(), it is now merged into GeneratedAnswer meta

  • Fix DocumentSplitter to handle custom splitting_function without requiring split_length. Previously the splitting_function provided would not override other settings.

- Python
Published by github-actions[bot] about 1 year ago

farm-haystack - v1.26.4

Release Notes

v1.26.4

⚑️ Enhancement Notes

  • Upgrade the transformers dependency requirement to transformers>=4.46,<5.0
  • Updated tokenizer.json URL for Anthropic models as the old URL was no longer available.

- Python
Published by silvanocerza over 1 year ago

farm-haystack - v2.7.0

Release Notes

✨ Highlights

πŸš… Rework Pipeline.run() logic to better handle cycles

Pipeline.run() internal logic has been heavily reworked to be more robust and reliable than before. This new implementation makes it easier to run Pipelines that have cycles in their graph. It also fixes some corner cases in Pipelines that don't have any cycle.

πŸ“ Introduce LoggingTracer

With the new LoggingTracer, users can inspect the logs in real-time to see everything that is happening in their Pipelines. This feature aims to improve the user experience during experimentation and prototyping.

```python import logging from haystack import tracing from haystack.tracing.logging_tracer import LoggingTracer

logging.basicConfig(format="%(levelname)s - %(name)s - %(message)s", level=logging.WARNING) logging.getLogger("haystack").setLevel(logging.DEBUG) tracing.tracer.iscontenttracingenabled = True # to enable tracing/logging content (inputs/outputs) tracing.enabletracing(LoggingTracer()) ```

image

⬆️ Upgrade Notes

  • Removed Pipeline init argument debug_path. We do not support this anymore.

  • Removed Pipeline init argument max_loops_allowed. Use max_runs_per_component instead.

  • Removed PipelineMaxLoops exception. Use PipelineMaxComponentRuns instead.

  • The deprecated default converter class haystack.components.converters.pypdf.DefaultConverter used by PyPDFToDocument has been removed.

Pipeline YAMLs from haystack<2.7.0 that use the default converter must be updated in the following manner:

```yaml # Old components: Comp1: init_parameters: converter: type: haystack.components.converters.pypdf.DefaultConverter type: haystack.components.converters.pypdf.PyPDFToDocument

# New components: Comp1: init_parameters: converter: null type: haystack.components.converters.pdf.PDFToTextConverter ```

Pipeline YAMLs from haystack<2.7.0 that use custom converter classes can be upgraded by simply loading them with haystack==2.6.x and saving them to YAML again.

  • Pipeline.connect() will now raise a PipelineConnectError if sender and receiver are the same Component. We do not support this use case anymore.

πŸš€ New Features

  • Added component StringJoiner to join strings from different components to a list of strings.

  • Improved serialization/deserialization errors to provide extra context about the delinquent components when possible.

  • Enhanced DOCX converter to support table extraction in addition to paragraph content. The converter supports both CSV and Markdown table formats, providing flexible options for representing tabular data extracted from DOCX documents.

  • Added a new parameter additional_mimetypes to the FileTypeRouter component. This allows users to specify additional MIME type mappings, ensuring correct file classification across different runtime environments and Python versions.

  • Introduce a LoggingTracer, that sends all traces to the logs.

It can enabled as follows:

```python import logging from haystack import tracing from haystack.tracing.logging_tracer import LoggingTracer

logging.basicConfig(format="%(levelname)s - %(name)s - %(message)s", level=logging.WARNING) logging.getLogger("haystack").setLevel(logging.DEBUG) tracing.tracer.iscontenttracingenabled = True # to enable tracing/logging content (inputs/outputs) tracing.enabletracing(LoggingTracer()) ```

  • Fundamentally rework the internal logic of Pipeline.run(). The rework makes it more reliable and covers more use cases. We fixed some issues that made Pipelines with cycles unpredictable and with unclear Components execution order.

  • Each tracing span of a component run is now attached with the pipeline run span object. This allows users to trace the execution of multiple pipeline runs concurrently.

⚑️ Enhancement Notes

  • Add streaming_callback run parameter to HuggingFaceAPIGenerator and HuggingFaceLocalGenerator to allow users to pass a callback function that will be called after each chunk of the response is generated.
  • The SentenceWindowRetriever now supports the window_size parameter at run time, overwriting the value set in the constructor.
  • Add output type validation in ConditionalRouter. Setting validate_output_type to True will enable a check to verify if the actual output of a route returns the declared type. If it doesn't match a ValueError is raised.
  • Reduced numpy usage to speed up imports.
  • Improved file type detection in FileTypeRouter, particularly for Microsoft Office file formats like .docx and .pptx. This enhancement ensures more consistent behavior across different environments, including AWS Lambda functions and systems without pre-installed office suites.
  • The FiletypeRouter now supports passing metadata (meta) in the run method. When metadata is provided, the sources are internally converted to ByteStream objects and the metadata is added. This new parameter simplifies working with preprocessing/indexing pipelines.
  • SentenceTransformersDocumentEmbedder now supports config_kwargs for additional parameters when loading the model configuration
  • SentenceTransformersTextEmbedder now supports config_kwargs for additional parameters when loading the model configuration
  • Previously, numpy was pinned to <2.0 to avoid compatibility issues in several core integrations. This pin has been removed, and haystack can work with both numpy 1.x and 2.x. If necessary, we will pin numpy version in specific core integrations that require it.

⚠️ Deprecation Notes

  • The DefaultConverter class used by the PyPDFToDocument component has been deprecated. Its functionality will be merged into the component in 2.7.0.

πŸ› Bug Fixes

  • Serialized data of components are now explicitly enforced to be one of the following basic Python datatypes: str, int, float, bool, list, dict, set, tuple or None.
  • Addressed an issue where certain file types (e.g., .docx, .pptx) were incorrectly classified as 'unclassified' in environments with limited MIME type definitions, such as AWS Lambda functions.
  • Fixes logs containing JSON data getting lost due to string interpolation.
  • Use forward references for Hugging Face Hub types in the HuggingFaceAPIGenerator component to prevent import errors.
  • Fix the serialization of PyPDFToDocument component to prevent the default converter from being serialized unnecessarily.
  • Revert change to PyPDFConverter that broke the deserialization of pre 2.6.0 YAMLs.

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.7.0-rc1

Release Notes

v2.7.0-rc1

New Features

  • Added component StringJoiner to join strings from different components to a list of strings.

v2.8.0-rc0

Highlights

With the new Logging Tracer, users can inspect in the logs everything that is happening in their Pipelines in real time. This feature aims to improve the user experience during experimentation and prototyping.

Pipeline.run() internal logic has been heavily reworked to be more robust and reliable than before. This new implementation makes it easier to run `Pipeline`s that have cycles in their graph. It also fixes some corner cases in `Pipeline`s that don't have any cycle.

Upgrade Notes

  • Removed Pipeline init argument debug_path. We do not support this anymore.

  • Removed Pipeline init argument maxloopsallowed. Use maxrunsper_component instead.

  • Removed PipelineMaxLoops exception. Use PipelineMaxComponentRuns instead.

  • The deprecated default converter class haystack.components.converters.pypdf.DefaultConverter used by PyPDFToDocument has been removed.

    Pipeline YAMLs from haystack<2.7.0 that use the default converter must be updated in the following manner: `yaml # Old components: Comp1: init_parameters: converter: type: haystack.components.converters.pypdf.DefaultConverter type: haystack.components.converters.pypdf.PyPDFToDocument # New components: Comp1: init_parameters: converter: null type: haystack.components.converters.pdf.PDFToTextConverter`

    Pipeline YAMLs from haystack<2.7.0 that use custom converter classes can be upgraded by simply loading them with haystack==2.6.x and saving them to YAML again.

  • Pipeline.connect() will now raise a PipelineConnectError if sender and receiver are the same Component. We do not support this use case anymore.

New Features

  • Added a new component DocumentNDCGEvaluator, which is similar to DocumentMRREvaluator and useful for retrieval evaluation. It calculates the normalized discounted cumulative gain, an evaluation metric useful when there are multiple ground truth relevant documents and the order in which they are retrieved is important.

  • Improved serialization/deserialization errors to provide extra context about the delinquent components when possible.

  • Enhanced DOCX converter to support table extraction in addition to paragraph content. The converter supports both CSV and Markdown table formats, providing flexible options for representing tabular data extracted from DOCX documents.

  • Added a new parameter additional_mimetypes to the FileTypeRouter component.

    This allows users to specify additional MIME type mappings, ensuring correct

    file classification across different runtime environments and Python versions.

  • Introduce a Logging Tracer, that sends all traces to the logs.

    It can enabled as follows: `python import logging from haystack import tracing from haystack.tracing.logging_tracer import LoggingTracer logging.basicConfig(format="%(levelname)s - %(name)s - %(message)s", level=logging.WARNING) logging.getLogger("haystack").setLevel(logging.DEBUG) tracing.tracer.is_content_tracing_enabled = True # to enable tracing/logging content (inputs/outputs) tracing.enable_tracing(LoggingTracer())`

  • Fundamentally rework the internal logic of Pipeline.run(). The rework makes it more reliable and covers more use cases. We fixed some issues that made `Pipeline`s with cycles unpredictable and with unclear Components execution order.

  • Each tracing span of a component run is now attached with the pipeline run span object. This allows users to trace the execution of multiple pipeline runs concurrently.

Enhancement Notes

  • Add streaming_callback run parameter to HuggingFaceAPIGenerator and HuggingFaceLocalGenerator to allow users to pass a callback function that will be called after each chunk of the response is generated.
  • The SentenceWindowRetriever now supports the window_size parameter at run time, overwriting the value set in the constructor.
  • Add output type validation in ConditionalRouter. Setting validateoutputtype to True will enable a check to verify if the actual output of a route returns the declared type. If it doesn't match a ValueError is raised.
  • Reduced numpy usage to speed up imports.
  • Improved file type detection in FileTypeRouter, particularly for Microsoft Office file formats like .docx and .pptx. This enhancement ensures more consistent behavior across different environments, including AWS Lambda functions and systems without pre-installed office suites.
  • The FiletypeRouter now supports passing metadata (meta) in the run method. When metadata is provided, the sources are internally converted to ByteStream objects and the metadata is added. This new parameter simplifies working with preprocessing/indexing pipelines.
  • SentenceTransformersDocumentEmbedder now supports config_kwargs for additional parameters when loading the model configuration
  • SentenceTransformersTextEmbedder now supports config_kwargs for additional parameters when loading the model configuration
  • Previously, numpy was pinned to <2.0 to avoid compatibility issues in several core integrations. This pin has been removed, and haystack can work with both numpy 1.x and 2.x. If necessary, we will pin numpy version in specific core integrations that require it.
  • Upgrade Hatch to 1.13.0 and adopt uv as installer, to speed up the CI.

Deprecation Notes

  • The DefaultConverter class used by the PyPDFToDocument component has been deprecated. Its functionality will be merged into the component in 2.7.0.

Bug Fixes

  • Adjusted a test on HuggingFaceAPIGenerator to ensure compatibility with the huggingface_hub==0.26.0.
  • Serialized data of components are now explicitly enforced to be one of the following basic Python datatypes: str, int, float, bool, list, dict, set, tuple or None.
  • Addressed an issue where certain file types (e.g., .docx, .pptx) were incorrectly classified as 'unclassified' in environments with limited MIME type definitions, such as AWS Lambda functions.
  • Fixes logs containing JSON data getting lost due to string interpolation.
  • Use forward references for Hugging Face Hub types in the HuggingFaceAPIGenerator component to prevent import errors.
  • Add pip to test dependencies: mypy needs it to install missing stub packages.
  • Fix the serialization of PyPDFToDocument component to prevent the default converter from being serialized unnecessarily.
  • Revert change to PyPDFConverter that broke the deserialization of pre 2.6.0 YAMLs.

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.6.1

Release Notes

v2.6.1

Bug Fixes

  • Revert change to PyPDFConverter that broke the deserialization of pre 2.6.0 YAMLs.

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.6.1-rc1

Release Notes

v2.6.1-rc1

Bug Fixes

  • Revert change to PyPDFConverter that broke the deserialization of pre 2.6.0 YAMLs.

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.6.0

Release Notes

⬆️ Upgrade Notes

  • gpt-3.5-turbo was replaced by gpt-4o-mini as the default model for all components relying on OpenAI API
  • Support for the legacy filter syntax and operators (e.g., "$and", "$or", "$eq", "$lt", etc.), which originated in Haystack v1, has been fully removed. Users must now use only the new filter syntax. See the docs for more details.

πŸš€ New Features

  • Added a new component DocumentNDCGEvaluator, which is similar to DocumentMRREvaluator and useful for retrieval evaluation. It calculates the normalized discounted cumulative gain, an evaluation metric useful when there are multiple ground truth relevant documents and the order in which they are retrieved is important.

  • Add new CSVToDocument component. Loads the file as bytes object. Adds the loaded string as a new document that can be used for further processing by the Document Splitter.

  • Adds support for zero shot document classification via new TransformersZeroShotDocumentClassifier component. This allows you to classify documents into user-defined classes (binary and multi-label classification) using pre-trained models from Hugging Face.

  • Added the option to use a custom splitting function in DocumentSplitter. The function must accept a string as input and return a list of strings, representing the split units. To use the feature initialise DocumentSplitter with split_by="function" providing the custom splitting function as splitting_function=custom_function.

  • Add new JSONConverter Component to convert JSON files to Document. Optionally it can use jq to filter the source JSON files and extract only specific parts.

```python import json
from haystack.components.converters import JSONConverter from haystack.dataclasses import ByteStream
data = { "laureates": [ { "firstname": "Enrico", "surname": "Fermi", "motivation": "for his demonstrations of the existence of new radioactive elements produced " "by neutron irradiation, and for his related discovery of nuclear reactions brought about by slow neutrons", }, { "firstname": "Rita", "surname": "Levi-Montalcini", "motivation": "for their discoveries of growth factors", }, ], } source = ByteStream.fromstring(json.dumps(data)) converter = JSONConverter(jqschema=".laureates[]", contentkey="motivation", extrameta_fields=["firstname", "surname"])
results = converter.run(sources=[source]) documents = results["documents"] print(documents[0].content)

'for his demonstrations of the existence of new radioactive elements produced by

neutron irradiation, and for his related discovery of nuclear reactions brought

about by slow neutrons'

print(documents[0].meta)

{'firstname': 'Enrico', 'surname': 'Fermi'}

print(documents[1].content)

'for their discoveries of growth factors' print(documents[1].meta) # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}

```

  • Added a new NLTKDocumentSplitter, a component enhancing document preprocessing capabilities with NLTK. This feature allows for fine-grained control over the splitting of documents into smaller parts based on configurable criteria such as word count, sentence boundaries, and page breaks. It supports multiple languages and offers options for handling sentence boundaries and abbreviations, facilitating better handling of various document types for further processing tasks.

  • Updates SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder so model_max_length passed through tokenizer_kwargs also updates the max_seq_length of the underlying SentenceTransformer model.

⚑️ Enhancement Notes

  • Adapts how ChatPromptBuilder creates ChatMessages. Messages are deep copied to ensure all meta fields are copied correctly.

  • Expose default_headers to pass custom headers to Azure API including APIM subscription key.

  • Add optional azure_kwargs dictionary parameter to pass in parameters undefined in Haystack but supported by AzureOpenAI.

  • Allow the ability to add the current date inside a template in PromptBuilder using the following syntax:

    • {% now 'UTC' %}: Get the current date for the UTC timezone.
    • {% now 'America/Chicago' + 'hours=2' %}: Add two hours to the current date in the Chicago timezone.
    • {% now 'Europe/Berlin' - 'weeks=2' %}: Subtract two weeks from the current date in the Berlin timezone.
    • {% now 'Pacific/Fiji' + 'hours=2', '%H' %}: Display only the number of hours after adding two hours to the Fiji timezone.
    • {% now 'Etc/GMT-4', '%I:%M %p' %}: Change the date format to AM/PM for the GMT-4 timezone.

    Note that if no date format is provided, the default will be %Y-%m-%d %H:%M:%S. Please refer to list of tz database for a list of timezones.

  • Adds usage meta field with prompt_tokens and completion_tokens keys to HuggingFaceAPIChatGenerator.

  • Add new GreedyVariadic input type. This has a similar behaviour to Variadic input type as it can be connected to multiple output sockets, though the Pipeline will run it as soon as it receives an input without waiting for others. This replaces the is_greedy argument in the @component decorator. If you had a Component with a Variadic input type and @component(is_greedy=True) you need to change the type to GreedyVariadic and remove is_greedy=true from @component.

  • Add new Pipeline init argument max_runs_per_component, this has the same identical behaviour as the existing max_loops_allowed argument but is more descriptive of its actual effects.

  • Add new PipelineMaxLoops to reflect new max_runs_per_component init argument

  • We added batching during inference time to the TransformerSimilarityRanker to help prevent OOMs when ranking large amounts of Documents.

⚠️ Deprecation Notes

  • The DefaultConverter class used by the PyPDFToDocument component has been deprecated. Its functionality will be merged into the component in 2.7.0.
  • Pipeline init argument debug_path is deprecated and will be removed in version 2.7.0.
  • @component decorator is_greedy argument is deprecated and will be removed in version 2.7.0. Use GreedyVariadic type instead.
  • Deprecate connecting a Component to itself when calling Pipeline.connect(), it will raise an error from version 2.7.0 onwards
  • Pipeline init argument max_loops_allowed is deprecated and will be removed in version 2.7.0. Use max_runs_per_component instead.
  • PipelineMaxLoops exception is deprecated and will be removed in version 2.7.0. Use PipelineMaxComponentRuns instead.

πŸ› Bug Fixes

  • Fix the serialization of PyPDFToDocument component to prevent the default converter from being serialized unnecessarily.
  • Add constraints to component.set_input_type and component.set_input_types to prevent undefined behaviour when the run method does not contain a variadic keyword argument.
  • Prevent set_output_types from being called when the output_types decorator is used.
  • Update the CHAT_WITH_WEBSITE Pipeline template to reflect the changes in the HTMLToDocument converter component.
  • Fix a Pipeline visualization issue due to changes in the new release of Mermaid.
  • Fixing the filters in the SentenceWindowRetriever allowing now support for 3 more DocumentStores: Astra, PGVector, Qdrant
  • Fix Pipeline not running Components with Variadic input even if it received inputs only from a subset of its senders
  • The from_dict method of ConditionalRouter now correctly handles the case where the dict passed to it contains the key custom_filters explicitly set to None. Previously this was causing an AttributeError
  • Make the from_dict method of the PyPDFToDocument more robust to cases when the converter is not provided in the dictionary.

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.6.0-rc3

Release Notes

⬆️ Upgrade Notes

  • gpt-3.5-turbo was replaced by gpt-4o-mini as the default model for all components relying on OpenAI API
  • Support for the legacy filter syntax and operators (e.g., "$and", "$or", "$eq", "$lt", etc.), which originated in Haystack v1, has been fully removed. Users must now use only the new filter syntax. See the docs for more details.

πŸš€ New Features

  • Added a new component DocumentNDCGEvaluator, which is similar to DocumentMRREvaluator and useful for retrieval evaluation. It calculates the normalized discounted cumulative gain, an evaluation metric useful when there are multiple ground truth relevant documents and the order in which they are retrieved is important.

  • Add new CSVToDocument component. Loads the file as bytes object. Adds the loaded string as a new document that can be used for further processing by the Document Splitter.

  • Adds support for zero shot document classification via new TransformersZeroShotDocumentClassifier component. This allows you to classify documents into user-defined classes (binary and multi-label classification) using pre-trained models from Hugging Face.

  • Added the option to use a custom splitting function in DocumentSplitter. The function must accept a string as input and return a list of strings, representing the split units. To use the feature initialise DocumentSplitter with split_by="function" providing the custom splitting function as splitting_function=custom_function.

  • Add new JSONConverter Component to convert JSON files to Document. Optionally it can use jq to filter the source JSON files and extract only specific parts.

```python import json
from haystack.components.converters import JSONConverter from haystack.dataclasses import ByteStream
data = { "laureates": [ { "firstname": "Enrico", "surname": "Fermi", "motivation": "for his demonstrations of the existence of new radioactive elements produced " "by neutron irradiation, and for his related discovery of nuclear reactions brought about by slow neutrons", }, { "firstname": "Rita", "surname": "Levi-Montalcini", "motivation": "for their discoveries of growth factors", }, ], } source = ByteStream.fromstring(json.dumps(data)) converter = JSONConverter(jqschema=".laureates[]", contentkey="motivation", extrameta_fields=["firstname", "surname"])
results = converter.run(sources=[source]) documents = results["documents"] print(documents[0].content)

'for his demonstrations of the existence of new radioactive elements produced by

neutron irradiation, and for his related discovery of nuclear reactions brought

about by slow neutrons'

print(documents[0].meta)

{'firstname': 'Enrico', 'surname': 'Fermi'}

print(documents[1].content)

'for their discoveries of growth factors' print(documents[1].meta) # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}

```

  • Added a new NLTKDocumentSplitter, a component enhancing document preprocessing capabilities with NLTK. This feature allows for fine-grained control over the splitting of documents into smaller parts based on configurable criteria such as word count, sentence boundaries, and page breaks. It supports multiple languages and offers options for handling sentence boundaries and abbreviations, facilitating better handling of various document types for further processing tasks.

  • Updates SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder so model_max_length passed through tokenizer_kwargs also updates the max_seq_length of the underlying SentenceTransformer model.

⚑️ Enhancement Notes

  • Adapts how ChatPromptBuilder creates ChatMessages. Messages are deep copied to ensure all meta fields are copied correctly.

  • Expose default_headers to pass custom headers to Azure API including APIM subscription key.

  • Add optional azure_kwargs dictionary parameter to pass in parameters undefined in Haystack but supported by AzureOpenAI.

  • Allow the ability to add the current date inside a template in PromptBuilder using the following syntax:

    • {% now 'UTC' %}: Get the current date for the UTC timezone.
    • {% now 'America/Chicago' + 'hours=2' %}: Add two hours to the current date in the Chicago timezone.
    • {% now 'Europe/Berlin' - 'weeks=2' %}: Subtract two weeks from the current date in the Berlin timezone.
    • {% now 'Pacific/Fiji' + 'hours=2', '%H' %}: Display only the number of hours after adding two hours to the Fiji timezone.
    • {% now 'Etc/GMT-4', '%I:%M %p' %}: Change the date format to AM/PM for the GMT-4 timezone.

    Note that if no date format is provided, the default will be %Y-%m-%d %H:%M:%S. Please refer to list of tz database for a list of timezones.

  • Adds usage meta field with prompt_tokens and completion_tokens keys to HuggingFaceAPIChatGenerator.

  • Add new GreedyVariadic input type. This has a similar behaviour to Variadic input type as it can be connected to multiple output sockets, though the Pipeline will run it as soon as it receives an input without waiting for others. This replaces the is_greedy argument in the @component decorator. If you had a Component with a Variadic input type and @component(is_greedy=True) you need to change the type to GreedyVariadic and remove is_greedy=true from @component.

  • Add new Pipeline init argument max_runs_per_component, this has the same identical behaviour as the existing max_loops_allowed argument but is more descriptive of its actual effects.

  • Add new PipelineMaxLoops to reflect new max_runs_per_component init argument

  • We added batching during inference time to the TransformerSimilarityRanker to help prevent OOMs when ranking large amounts of Documents.

⚠️ Deprecation Notes

  • The DefaultConverter class used by the PyPDFToDocument component has been deprecated. Its functionality will be merged into the component in 2.7.0.
  • Pipeline init argument debug_path is deprecated and will be removed in version 2.7.0.
  • @component decorator is_greedy argument is deprecated and will be removed in version 2.7.0. Use GreedyVariadic type instead.
  • Deprecate connecting a Component to itself when calling Pipeline.connect(), it will raise an error from version 2.7.0 onwards
  • Pipeline init argument max_loops_allowed is deprecated and will be removed in version 2.7.0. Use max_runs_per_component instead.
  • PipelineMaxLoops exception is deprecated and will be removed in version 2.7.0. Use PipelineMaxComponentRuns instead.

πŸ› Bug Fixes

  • Fix the serialization of PyPDFToDocument component to prevent the default converter from being serialized unnecessarily.
  • Add constraints to component.set_input_type and component.set_input_types to prevent undefined behaviour when the run method does not contain a variadic keyword argument.
  • Prevent set_output_types from being called when the output_types decorator is used.
  • Update the CHAT_WITH_WEBSITE Pipeline template to reflect the changes in the HTMLToDocument converter component.
  • Fix a Pipeline visualization issue due to changes in the new release of Mermaid.
  • Fixing the filters in the SentenceWindowRetriever allowing now support for 3 more DocumentStores: Astra, PGVector, Qdrant
  • Fix Pipeline not running Components with Variadic input even if it received inputs only from a subset of its senders
  • The from_dict method of ConditionalRouter now correctly handles the case where the dict passed to it contains the key custom_filters explicitly set to None. Previously this was causing an AttributeError
  • Make the from_dict method of the PyPDFToDocument more robust to cases when the converter is not provided in the dictionary.

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.6.0-rc2

Release Notes

⬆️ Upgrade Notes

  • gpt-3.5-turbo was replaced by gpt-4o-mini as the default model for all components relying on OpenAI API
  • The legacy filter syntax support has been completely removed. Users need to use the new filter syntax. See the docs for more details.

πŸš€ New Features

  • Add new CSVToDocument component. Loads the file as bytes object. Adds the loaded string as a new document that can be used for further processing by the Document Splitter.

  • Adds support for zero shot document classification via new TransformersZeroShotDocumentClassifier component. This allows you to classify documents into user-defined classes (binary and multi-label classification) using pre-trained models from Hugging Face.

  • Added the option to use a custom splitting function in DocumentSplitter. The function must accept a string as input and return a list of strings, representing the split units. To use the feature initialise DocumentSplitter with split_by="function" providing the custom splitting function as splitting_function=custom_function.

  • Add new JSONConverter Component to convert JSON files to Document. Optionally it can use jq to filter the source JSON files and extract only specific parts.

```python import json
from haystack.components.converters import JSONConverter from haystack.dataclasses import ByteStream
data = { "laureates": [ { "firstname": "Enrico", "surname": "Fermi", "motivation": "for his demonstrations of the existence of new radioactive elements produced " "by neutron irradiation, and for his related discovery of nuclear reactions brought about by slow neutrons", }, { "firstname": "Rita", "surname": "Levi-Montalcini", "motivation": "for their discoveries of growth factors", }, ], } source = ByteStream.fromstring(json.dumps(data)) converter = JSONConverter(jqschema=".laureates[]", contentkey="motivation", extrameta_fields=["firstname", "surname"])
results = converter.run(sources=[source]) documents = results["documents"] print(documents[0].content)

'for his demonstrations of the existence of new radioactive elements produced by

neutron irradiation, and for his related discovery of nuclear reactions brought

about by slow neutrons'

print(documents[0].meta)

{'firstname': 'Enrico', 'surname': 'Fermi'}

print(documents[1].content)

'for their discoveries of growth factors' print(documents[1].meta) # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}

```

  • Added a new NLTKDocumentSplitter, a component enhancing document preprocessing capabilities with NLTK. This feature allows for fine-grained control over the splitting of documents into smaller parts based on configurable criteria such as word count, sentence boundaries, and page breaks. It supports multiple languages and offers options for handling sentence boundaries and abbreviations, facilitating better handling of various document types for further processing tasks.

  • Updates SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder so model_max_length passed through tokenizer_kwargs also updates the max_seq_length of the underlying SentenceTransformer model.

⚑️ Enhancement Notes

  • Adapts how ChatPromptBuilder creates ChatMessages. Messages are deep copied to ensure all meta fields are copied correctly.

  • Expose default_headers to pass custom headers to Azure API including APIM subscription key.

  • Add optional azure_kwargs dictionary parameter to pass in parameters undefined in Haystack but supported by AzureOpenAI.

  • Allow the ability to add the current date inside a template in PromptBuilder using the following syntax:

    • {% now 'UTC' %}: Get the current date for the UTC timezone.
    • {% now 'America/Chicago' + 'hours=2' %}: Add two hours to the current date in the Chicago timezone.
    • {% now 'Europe/Berlin' - 'weeks=2' %}: Subtract two weeks from the current date in the Berlin timezone.
    • {% now 'Pacific/Fiji' + 'hours=2', '%H' %}: Display only the number of hours after adding two hours to the Fiji timezone.
    • {% now 'Etc/GMT-4', '%I:%M %p' %}: Change the date format to AM/PM for the GMT-4 timezone.

    Note that if no date format is provided, the default will be %Y-%m-%d %H:%M:%S. Please refer to list of tz database for a list of timezones.

  • Adds usage meta field with prompt_tokens and completion_tokens keys to HuggingFaceAPIChatGenerator.

  • Add new GreedyVariadic input type. This has a similar behaviour to Variadic input type as it can be connected to multiple output sockets, though the Pipeline will run it as soon as it receives an input without waiting for others. This replaces the is_greedy argument in the @component decorator. If you had a Component with a Variadic input type and @component(is_greedy=True) you need to change the type to GreedyVariadic and remove is_greedy=true from @component.

  • Add new Pipeline init argument max_runs_per_component, this has the same identical behaviour as the existing max_loops_allowed argument but is more descriptive of its actual effects.

  • Add new PipelineMaxLoops to reflect new max_runs_per_component init argument

  • We added batching during inference time to the TransformerSimilarityRanker to help prevent OOMs when ranking large amounts of Documents.

⚠️ Deprecation Notes

  • Pipeline init argument debug_path is deprecated and will be removed in version 2.7.0.
  • @component decorator is_greedy argument is deprecated and will be removed in version 2.7.0. Use GreedyVariadic type instead.
  • Deprecate connecting a Component to itself when calling Pipeline.connect(), it will raise an error from version 2.7.0 onwards
  • Pipeline init argument max_loops_allowed is deprecated and will be removed in version 2.7.0. Use max_runs_per_component instead.
  • PipelineMaxLoops exception is deprecated and will be removed in version 2.7.0. Use PipelineMaxComponentRuns instead.

πŸ› Bug Fixes

  • Add constraints to component.set_input_type and component.set_input_types to prevent undefined behaviour when the run method does not contain a variadic keyword argument.
  • Prevent set_output_types from being called when the output_types decorator is used.
  • Update the CHAT_WITH_WEBSITE Pipeline template to reflect the changes in the HTMLToDocument converter component.
  • Fix a Pipeline visualization issue due to changes in the new release of Mermaid.
  • Fixing the filters in the SentenceWindowRetriever allowing now support for 3 more DocumentStores: Astra, PGVector, Qdrant
  • Fix Pipeline not running Components with Variadic input even if it received inputs only from a subset of its senders
  • The from_dict method of ConditionalRouter now correctly handles the case where the dict passed to it contains the key custom_filters explicitly set to None. Previously this was causing an AttributeError
  • Make the from_dict method of the PyPDFToDocument more robust to cases when the converter is not provided in the dictionary.

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.5.1

Release Notes

⚑️ Enhancement Notes

  • Add default_headers init argument to AzureOpenAIGenerator and AzureOpenAIChatGenerator

πŸ› Bug Fixes

  • Fix the Pipeline visualization issue due to changes in the new release of Mermaid
  • Fix Pipeline not running Components with Variadic input even if it received inputs only from a subset of its senders
  • The from_dict method of ConditionalRouter now correctly handles the case where the dict passed to it contains the key custom_filters explicitly set to None. Previously this was causing an AttributeError

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.5.1-rc2

Release Notes

⚑️ Enhancement Notes

  • Add default_headers init argument to AzureOpenAIGenerator and AzureOpenAIChatGenerator

πŸ› Bug Fixes

  • Fix the Pipeline visualization issue due to changes in the new release of Mermaid
  • Fix Pipeline not running Components with Variadic input even if it received inputs only from a subset of its senders
  • The from_dict method of ConditionalRouter now correctly handles the case where the dict passed to it contains the key custom_filters explicitly set to None. Previously this was causing an AttributeError

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.5.1-rc1

Release Notes

⚑️ Enhancement Notes

  • Add default_headers init argument to AzureOpenAIGenerator and AzureOpenAIChatGenerator

πŸ› Bug Fixes

  • Fix Pipeline not running Components with Variadic input even if it received inputs only from a subset of its senders
  • The from_dict method of ConditionalRouter now correctly handles the case where the dict passed to it contains the key custom_filters explicitly set to None. Previously this was causing an AttributeError

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.5.0

Release Notes

⬆️ Upgrade Notes

  • Removed ChatMessage.to_openai_format method. Use haystack.components.generators.openai_utils._convert_message_to_openai_format instead.
  • Removed unused debug parameter from Pipeline.run method.
  • Removed deprecated SentenceWindowRetrieval. Use SentenceWindowRetriever instead.

πŸš€ New Features

- Added the unsafe argument to enable behavior that could lead to remote code execution in ConditionalRouter and OutputAdapter. By default, unsafe behavior is disabled, and users must explicitly set unsafe=True to enable it. When unsafe is enabled, types such as ChatMessage, Document, and Answer can be used as output types. We recommend enabling unsafe behavior only when the Jinja template source is trusted. For more information, see the documentation for ConditionalRouter and OutputAdapter.

⚑️ Enhancement Notes

  • Adapts how ChatPromptBuilder creates ChatMessages. Messages are deep copied to ensure all meta fields are copied correctly.
  • The parameter, min_top_k, has been added to the TopPSampler. This parameter sets the minimum number of documents to be returned when the top-p sampling algorithm selects fewer documents than desired. Documents with the next highest scores are added to meet the minimum. This is useful when guaranteeing a set number of documents to pass through while still allowing the Top-P algorithm to determine if more documents should be sent based on scores.
  • Introduced a utility function to deserialize a generic Document Store from the init_parameters of a serialized component.
  • Refactor deserialize_document_store_in_init_parameters to clarify that the function operates in place and does not return a value.
  • The SentenceWindowRetriever now returns context_documents as well as the context_windows for each Document in retrieved_documents . This allows you to get a list of Documents from within the context window for each retrieved document.

⚠️ Deprecation Notes

  • The default model for OpenAIGenerator and OpenAIChatGenerator, previously 'gpt-3.5-turbo', will be replaced by 'gpt-4o-mini'.

πŸ› Bug Fixes

  • Fixed an issue where page breaks were not being extracted from DOCX files.
  • Used a forward reference for the Paragraph class in the DOCXToDocument converter to prevent import errors.
  • The metadata produced by DOCXToDocument component is now JSON serializable. Previously, it contained datetime objects automatically extracted from DOCX files, which are not JSON serializable. These datetime objects are now converted to strings.
  • Starting from haystack-ai==2.4.0, Haystack is compatible with sentence-transformers>=3.0.0; earlier versions of sentence-transformers are not supported. We have updated the test dependencies and LazyImport messages to reflect this change.
  • For components that support multiple Document Stores, prioritize using the specific from_dict class method for deserialization when available. Otherwise, fall back to the generic default_from_dict method. This impacts the following generic components: CacheChecker, DocumentWriter, FilterRetriever, and SentenceWindowRetriever.

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.5.0-rc3

Release Notes

Enhancement Notes

  • Adapts how ChatPromptBuilder creates ChatMessages. Messages are deep copied to ensure all meta fields are copied correctly.

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.5.0-rc2

Release Notes

Upgrade Notes

  • Remove ChatMessage.toopenaiformat method. Use haystack.components.generators.openaiutils._convertmessagetoopenai_format instead.
  • Remove unused debug parameter from Pipeline.run method.
  • Removing deprecated SentenceWindowRetrieval, replaced by SentenceWindowRetriever

New Features

  • Add unsafe argument to enable behaviour that could lead to remote code execution in ConditionalRouter and OutputAdapter. By default unsafe behaviour is not enabled, the user must set it explicitly to True. This means that user types like ChatMessage, Document, and Answer can be used as output types when unsafe is True. We recommend using unsafe behaviour only when the Jinja templates source is trusted. For more info see the documentation for ConditionalRouter and OutputAdapter

Enhancement Notes

  • The parameter mintopk is added to the TopPSampler which sets the minimum number of documents to be returned when the top-p sampling algorithm results in fewer documents being selected. The documents with the next highest scores are added to the selection. This is useful when we want to guarantee a set number of documents will always be passed on, but allow the Top-P algorithm to still determine if more documents should be sent based on document score.
  • Introduce an utility function to deserialize a generic Document Store from the init_parameters of a serialized component.
  • Refactor deserializedocumentstoreininit_parameters so that new function name indicates that the operation occurs in place, with no return value.
  • The SentenceWindowRetriever has now an extra output key containing all the documents belonging to the context window.

Deprecation Notes

  • SentenceWindowRetrieval is deprecated and will be removed in future. Use SentenceWindowRetriever instead.
  • The 'gpt-3.5-turbo' as the default model for the OpenAIGenerator and OpenAIChatGenerator will be replaced by 'gpt-4o-mini'.

Bug Fixes

  • Fixed an issue where page breaks were not being extracted from DOCX files.
  • Use a forward reference for the Paragraph class in the DOCXToDocument converter to prevent import errors.
  • The metadata produced by DOCXToDocument component is now JSON serializable. Previously, it contained datetime objects automatically extracted from DOCX files, which are not JSON serializable. Now, the datetime objects are converted to strings.
  • Starting from haystack-ai==2.4.0, Haystack is compatible with sentence-transformers>=3.0.0; earlier versions of sentence-transformers are not supported. We are updating the test dependency and the LazyImport messages to reflect that.
  • For components that support multiple Document Stores, prioritize using the specific fromdict class method for deserialization when available. Otherwise, fall back to the generic defaultfrom_dict method. This impacts the following generic components: CacheChecker, DocumentWriter, FilterRetriever, and SentenceWindowRetriever.

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v1.26.3

Release Notes

v1.26.3

⬆️ Upgrade Notes

  • Upgrades ntlk to 3.9.1 as prior versions are affect by https://nvd.nist.gov/vuln/detail/CVE-2024-39705. Due to these security vulnerabilities, it is not possible to use custom NLTK tokenizer models with the new version (for example in PreProcessor). Users can still use built-in nltk tokenizers by specifying the language parameter in the PreProcessor. See PreProcessor documentation for more details.

⚑️ Enhancement Notes

  • Pins sentence-transformers<=3.0.0,>=2.3.1 and python-pptx<=1.0 to avoid some minor typing incompatibilities with the newer version of the respective libraries.

πŸ› Bug Fixes

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.4.0

Release Notes

v2.4.0

Highlights

πŸ™Œ Local LLMs and custom generation parameters in evaluation

The new api_params init parameter added to LLM-based evaluators such as ContextRelevanceEvaluator and FaithfulnessEvaluator can be used to pass in supported OpenAIGenerator parameters, allowing for custom generation parameters (via generation_kwargs) and local LLM support (via api_base_url).

πŸ“ New Joiner

New AnswerJoiner component to combine multiple lists of Answers.

⬆️ Upgrade Notes

  • The ContextRelevanceEvaluator now returns a list of relevant sentences for each context, instead of all the sentences in a context. Also, a score of 1 is now returned if a relevant sentence is found, and 0 otherwise.
  • Removed the deprecated DynamicPromptBuilder and DynamicChatPromptBuilder components. Use PromptBuilder and ChatPromptBuilder instead.
  • OutputAdapter and ConditionalRouter can't return users inputs anymore.
  • Multiplexer is removed and users should switch to BranchJoiner instead.
  • Removed deprecated init parameters extractortype and tryothers from HTMLToDocument.
  • SentenceWindowRetrieval component has been renamed to SenetenceWindowRetriever.
  • The serializecallbackhandler and deserializecallbackhandler utility functions have been removed. Use serializecallable and deserializecallable instead. For more information on serializecallable and deserializecallable, see the API reference: https://docs.haystack.deepset.ai/reference/utils-api#module-callable_serialization

πŸš€ New Features

  • LLM based evaluators can pass in supported OpenAIGenerator parameters via apiparams. This allows for custom generationkwargs, changing the apibaseurl (for local evaluation), and all other supported parameters as described in the OpenAIGenerator docs.
  • Introduced a new AnswerJoiner component that allows joining multiple lists of Answers into a single list using the Concatenate join mode.
  • Add truncate_dim parameter to Sentence Transformers Embedders, which allows truncating embeddings. Especially useful for models trained with Matryoshka Representation Learning.
  • Add precision parameter to Sentence Transformers Embedders, which allows quantized embeddings. Especially useful for reducing the size of the embeddings of a corpus for semantic search, among other tasks.

⚑️ Enhancement Notes

  • Adds modelkwargs and tokenizerkwargs to the components TransformersSimilarityRanker, SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder. This allows passing things like modelmaxlength or torch_dtype for better management of model inference.
  • Added unicode_normalization parameter to the DocumentCleaner, allowing to normalize the text to NFC, NFD, NFKC, or NFKD.
  • Added ascii_only parameter to the DocumentCleaner, transforming letters with diacritics to their ASCII equivalent and removing other non-ASCII characters.
  • Improved error messages for deserialization errors.
  • TikaDocumentConverter now returns page breaks ("f") in the output. This only works for PDF files.
  • Enhanced filter application logic to support merging of filters. It facilitates more precise retrieval filtering, allowing for both init and runtime complex filter combinations with logical operators. For more details see https://docs.haystack.deepset.ai/docs/metadata-filtering
  • The streaming_callback parameter can be passed to OpenAIGenerator and OpenAIChatGenerator during pipeline run. This prevents the need to recreate pipelines for streaming callbacks.
  • Add max_retries and timeout parameters to the AzureOpenAIChatGenerator initializations.
  • Document Python 3.11 and 3.12 support in project configuration.
  • Refactor DocumentJoiner to use enum pattern for the 'join_mode' parameter instead of bare string.
  • Add max_retries, timeout parameters to the AzureOpenAIDocumentEmbedder initialization.
  • Add max_retries and timeout parameters to the AzureOpenAITextEmbedder initializations.
  • Introduce an utility function to deserialize a generic Document Store from the init_parameters of a serialized component.

⚠️ Deprecation Notes

  • Haystack 1.x legacy filters are deprecated and will be removed in a future release. Please use the new filter style as described in the documentation - https://docs.haystack.deepset.ai/docs/metadata-filtering
  • Deprecate the method toopenaiformat of the ChatMessage dataclass. This method was never intended to be public and was only used internally. Now, each Chat Generator will know internally how to convert the messages to the format of their specific provider.
  • Deprecate the unused debug parameter in the Pipeline.run method.
  • SentenceWindowRetrieval is deprecated and will be removed in future. Use SentenceWindowRetriever instead.

Security Notes

  • Fix issue that could lead to remote code execution when using insecure Jinja template in the following Components:

    • PromptBuilder
    • ChatPromptBuilder
    • OutputAdapter
    • ConditionalRouter

    The same issue has been fixed in the PipelineTemplate class too.

πŸ› Bug Fixes

  • Fix ChatPromptBuilder from_dict method when template value is None.
  • Fix the DocumentCleaner removing the f tag from content preventing from counting page number (by Splitter for example).
  • The DocumentSplitter was incorrectly calculating the splitstartidx and _splitoverlap information due to slight miscalculations of appropriate indices. This fixes those so the splitstartidx and _splitoverlap information is correct.
  • Fix bug in Pipeline.run() executing Components in a wrong and unexpected order
  • Encoding of HTML files in LinkContentFetcher
  • Fix Output Adapter fromdict method when customfilters value is None.
  • Prevent Pipeline.from_dict from modifying the dictionary parameter passed to it.
  • Fix a bug in Pipeline.run() that would cause it to get stuck in an infinite loop and never return. This was caused by Components waiting forever for their inputs when parts of the Pipeline graph are skipped cause of a "decision" Component not returning outputs for that side of the Pipeline.
  • This updates the components, TransformersSimilarityRanker, SentenceTransformersDiversityRanker, SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder and LocalWhisperTranscriber fromdict methods to work when loading with initparameters only containing required parameters.
  • Pins structlog to <= 24.2.0 to avoid some unit test failures. This is a temporary fix until we can upgrade tests to a newer versions of structlog.
  • Correctly expose PPTXToDocument component in haystack namespace.
  • Fix TransformersZeroShotTextRouter and TransformersTextRouter fromdict methods to work when initparameters only contain required variables.
  • For components that support multiple Document Stores, prioritize using the specific fromdict class method for deserialization when available. Otherwise, fall back to the generic defaultfrom_dict method. This impacts the following generic components: CacheChecker, DocumentWriter, FilterRetriever, and SentenceWindowRetriever.

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.4.0-rc1

Release Notes

v2.4.0-rc1

Highlights

πŸ™Œ Local LLMs and custom generation parameters in evaluation

The new api_params init parameter added to LLM-based evaluators such as ContextRelevanceEvaluator and FaithfulnessEvaluator can be used to pass in supported OpenAIGenerator parameters, allowing for custom generation parameters (via generation_kwargs) and local LLM support (via api_base_url).

πŸ“ New Joiner

New AnswerJoiner component to combine multiple lists of Answers.

⬆️ Upgrade Notes

  • The ContextRelevanceEvaluator now returns a list of relevant sentences for each context, instead of all the sentences in a context. Also, a score of 1 is now returned if a relevant sentence is found, and 0 otherwise.
  • Removed the deprecated DynamicPromptBuilder and DynamicChatPromptBuilder components. Use PromptBuilder and ChatPromptBuilder instead.
  • OutputAdapter and ConditionalRouter can't return users inputs anymore.
  • Multiplexer is removed and users should switch to BranchJoiner instead.
  • Removed deprecated init parameters extractortype and tryothers from HTMLToDocument.
  • SentenceWindowRetrieval component has been renamed to SenetenceWindowRetriever.
  • The serializecallbackhandler and deserializecallbackhandler utility functions have been removed. Use serializecallable and deserializecallable instead. For more information on serializecallable and deserializecallable, see the API reference: https://docs.haystack.deepset.ai/reference/utils-api#module-callable_serialization

πŸš€ New Features

  • LLM based evaluators can pass in supported OpenAIGenerator parameters via apiparams. This allows for custom generationkwargs, changing the apibaseurl (for local evaluation), and all other supported parameters as described in the OpenAIGenerator docs.
  • Introduced a new AnswerJoiner component that allows joining multiple lists of Answers into a single list using the Concatenate join mode.
  • Add truncate_dim parameter to Sentence Transformers Embedders, which allows truncating embeddings. Especially useful for models trained with Matryoshka Representation Learning.
  • Add precision parameter to Sentence Transformers Embedders, which allows quantized embeddings. Especially useful for reducing the size of the embeddings of a corpus for semantic search, among other tasks.

⚑️ Enhancement Notes

  • Adds modelkwargs and tokenizerkwargs to the components TransformersSimilarityRanker, SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder. This allows passing things like modelmaxlength or torch_dtype for better management of model inference.
  • Added unicode_normalization parameter to the DocumentCleaner, allowing to normalize the text to NFC, NFD, NFKC, or NFKD.
  • Added ascii_only parameter to the DocumentCleaner, transforming letters with diacritics to their ASCII equivalent and removing other non-ASCII characters.
  • Improved error messages for deserialization errors.
  • TikaDocumentConverter now returns page breaks ("f") in the output. This only works for PDF files.
  • Enhanced filter application logic to support merging of filters. It facilitates more precise retrieval filtering, allowing for both init and runtime complex filter combinations with logical operators. For more details see https://docs.haystack.deepset.ai/docs/metadata-filtering
  • The streaming_callback parameter can be passed to OpenAIGenerator and OpenAIChatGenerator during pipeline run. This prevents the need to recreate pipelines for streaming callbacks.
  • Add max_retries and timeout parameters to the AzureOpenAIChatGenerator initializations.
  • Document Python 3.11 and 3.12 support in project configuration.
  • Refactor DocumentJoiner to use enum pattern for the 'join_mode' parameter instead of bare string.
  • Add max_retries, timeout parameters to the AzureOpenAIDocumentEmbedder initialization.
  • Add max_retries and timeout parameters to the AzureOpenAITextEmbedder initializations.
  • Introduce an utility function to deserialize a generic Document Store from the init_parameters of a serialized component.

⚠️ Deprecation Notes

  • Haystack 1.x legacy filters are deprecated and will be removed in a future release. Please use the new filter style as described in the documentation - https://docs.haystack.deepset.ai/docs/metadata-filtering
  • Deprecate the method toopenaiformat of the ChatMessage dataclass. This method was never intended to be public and was only used internally. Now, each Chat Generator will know internally how to convert the messages to the format of their specific provider.
  • Deprecate the unused debug parameter in the Pipeline.run method.
  • SentenceWindowRetrieval is deprecated and will be removed in future. Use SentenceWindowRetriever instead.

Security Notes

  • Fix issue that could lead to remote code execution when using insecure Jinja template in the following Components:

    • PromptBuilder
    • ChatPromptBuilder
    • OutputAdapter
    • ConditionalRouter

    The same issue has been fixed in the PipelineTemplate class too.

πŸ› Bug Fixes

  • Fix ChatPromptBuilder from_dict method when template value is None.
  • Fix the DocumentCleaner removing the f tag from content preventing from counting page number (by Splitter for example).
  • The DocumentSplitter was incorrectly calculating the splitstartidx and _splitoverlap information due to slight miscalculations of appropriate indices. This fixes those so the splitstartidx and _splitoverlap information is correct.
  • Fix bug in Pipeline.run() executing Components in a wrong and unexpected order
  • Encoding of HTML files in LinkContentFetcher
  • Fix Output Adapter fromdict method when customfilters value is None.
  • Prevent Pipeline.from_dict from modifying the dictionary parameter passed to it.
  • Fix a bug in Pipeline.run() that would cause it to get stuck in an infinite loop and never return. This was caused by Components waiting forever for their inputs when parts of the Pipeline graph are skipped cause of a "decision" Component not returning outputs for that side of the Pipeline.
  • This updates the components, TransformersSimilarityRanker, SentenceTransformersDiversityRanker, SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder and LocalWhisperTranscriber fromdict methods to work when loading with initparameters only containing required parameters.
  • Pins structlog to <= 24.2.0 to avoid some unit test failures. This is a temporary fix until we can upgrade tests to a newer versions of structlog.
  • Correctly expose PPTXToDocument component in haystack namespace.
  • Fix TransformersZeroShotTextRouter and TransformersTextRouter fromdict methods to work when initparameters only contain required variables.
  • For components that support multiple Document Stores, prioritize using the specific fromdict class method for deserialization when available. Otherwise, fall back to the generic defaultfrom_dict method. This impacts the following generic components: CacheChecker, DocumentWriter, FilterRetriever, and SentenceWindowRetriever.

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.3.1

Release Notes

v2.3.1

⬆️ Upgrade Notes

  • For security reasons, OutputAdapter and ConditionalRouter can only return the following Python literal structures: strings, bytes, numbers, tuples, lists, dicts, sets, booleans, None and Ellipsis (...). This implies that types like ChatMessage, Document, and Answer cannot be used as output types.

Security Notes

  • Fix issue that could lead to remote code execution when using insecure Jinja template in the following Components:

    • PromptBuilder
    • ChatPromptBuilder
    • DynamicPromptBuilder
    • DynamicChatPromptBuilder
    • OutputAdapter
    • ConditionalRouter

    The same issue has been fixed in the PipelineTemplate class too.

πŸ› Bug Fixes

  • Pins structlog to <= 24.2.0 to avoid some unit test failures. This is a temporary fix until we can upgrade tests to a newer versions of structlog.

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.3.1-rc1

Release Notes

v2.3.1-rc1

⬆️ Upgrade Notes

  • OutputAdapter and ConditionalRouter can't return users inputs anymore.

Security Notes

  • Fix issue that could lead to remote code execution when using insecure Jinja template in the following Components:

    • PromptBuilder
    • ChatPromptBuilder
    • DynamicPromptBuilder
    • DynamicChatPromptBuilder
    • OutputAdapter
    • ConditionalRouter

    The same issue has been fixed in the PipelineTemplate class too.

πŸ› Bug Fixes

  • Pins structlog to <= 24.2.0 to avoid some unit test failures. This is a temporary fix until we can upgrade tests to a newer versions of structlog.

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.3.0

Release Notes

Highlights

πŸ§‘β€πŸ”¬ Haystack Experimental Package

Alongside this release, we're introducing a new repository and package: haystack-experimental. This package will be installed alongside haystack-ai and will give you access to experimental components. As the name suggests, these components will be highly exploratory, and may or may not make their way into the main haystack package.

  • Each experimental component in the haystack-experimental repo will have a life-span of 3 months
  • The end of the 3 months marks the end of the experiment. In which case the component will either move to the core haystack package, or be discontinued

To learn more about the experimental package, check out the Experimental Package docs[LINK] and the API references[LINK] To use components in the experimental package, simply from haystack_experimental.component_type import Component What's in there already? - The OpenAIFunctionCaller: Use this component after Chat Generators to call the functions that the LLM returns with - The OpenAPITool: The OpenAPITool is a component designed to interact with RESTful endpoints of OpenAPI services. Its primary function is to generate and send appropriate payloads to these endpoints based on human-provided instructions. OpenAPITool bridges the gap between natural language inputs and structured API calls, making it easier for users to interact with complex APIs and thus integrating the structured world of OpenAPI-specified services with the LLMs apps. - The EvaluationHarness - A tool that can wrap pipelines to be evaluated as well as complex evaluation tasks into one simple runnable component

For more information, visit https://github.com/deepset-ai/haystack-experimental or the haystack_experimental reference API at https://docs.haystack.deepset.ai/v2.3/reference/ (bottom left pane)

πŸ“ New Converter

⬆️ Upgrade Notes

  • trafilatura must now be manually installed with pip install trafilatura to use the HTMLToDocument Component.

  • The deprecated converter_name parameter has been removed from PyPDFToDocument.

    To specify a custom converter for PyPDFToDocument, use the converter initialization parameter and pass an instance of a class that implements the PyPDFConverter protocol.

    The PyPDFConverter protocol defines the methods convert, todict and fromdict. A default implementation of PyPDFConverter is provided in the DefaultConverter class.

  • Deprecated HuggingFaceTEITextEmbedder and HuggingFaceTEIDocumentEmbedder have been removed. Use HuggingFaceAPITextEmbedder and HuggingFaceAPIDocumentEmbedder instead.

  • Deprecated HuggingFaceTGIGenerator and HuggingFaceTGIChatGenerator have been removed. Use HuggingFaceAPIGenerator and HuggingFaceAPIChatGenerator instead.

πŸš€ New Features

  • Adding a new SentenceWindowRetrieval component allowing to perform sentence-window retrieval, i.e. retrieves surrounding documents of a given document from the document store. This is useful when a document is split into multiple chunks and you want to retrieve the surrounding context of a given chunk.
  • Added custom filters support to ConditionalRouter. Users can now pass in one or more custom Jinja2 filter callables and be able to access those filters when defining condition expressions in routes.
  • Added a new mode in JoinDocuments, Distribution-based rank fusion as [the article](https://medium.com/plain-simple-software/distribution-based-score-fusion-dbsf-a-new-approach-to-vector-search-ranking-f87c37488b18)
  • Adding the DocxToDocument component inside the converters category. It uses the python-docx library to convert Docx files to haystack Documents.
  • Add a PPTX to Document converter using the python-pptx library. Extracts all text from each slide. Each slide is separated with a page break "f" so a Document Splitter could split by slide.
  • The DocumentSplitter now has support for the splitid and splitoverlap to allow for more control over the splitting process.
  • Introduces the TransformersTextRouter! This component uses a transformers text classification pipeline to route text inputs onto different output connections based on the labels of the chosen text classification model.
  • Add memory sharing between different instances of InMemoryDocumentStore. Setting the same index argument as another instance will make sure that the memory is shared. e.g. `python index = "my_personal_index" document_store_1 = InMemoryDocumentStore(index=index) document_store_2 = InMemoryDocumentStore(index=index) assert document_store_1.count_documents() == 0 assert document_store_2.count_documents() == 0 document_store_1.write_documents([Document(content="Hello world")]) assert document_store_1.count_documents() == 1 assert document_store_2.count_documents() == 1`
  • Add a new missing_meta param to MetaFieldRanker, which determines what to do with documents that lack the ranked meta field. Supported values are "bottom" (which puts documents with missing meta at the bottom of the sorted list), "top" (which puts them at the top), and "drop" (which removes them from the results entirely).

⚑️ Enhancement Notes

  • Added the applyfilterpolicy function to standardize the application of filter policies across all document store-specific retrievers, allowing for consistent handling of initial and runtime filters based on the chosen policy (replace or merge).
  • Added a new parameter to EvaluationRunResult.comparativeindividualscores_report() to specify columns to keep in the comparative DataFrame.
  • Added the 'remove_component' method in 'PipelineBase' to delete components and its connections.
  • Added serialization methods savetodisk and writetodisk to InMemoryDocumentStore.
  • When using "openai" for the LLM-based evaluators the metadata from OpenAI will be in the output dictionary, under the key "meta".
  • Remove trafilatura as direct dependency and make it a lazily imported one
  • Renamed component from DocxToDocument to DOCXToDocument to follow the naming convention of other converter components.
  • Made JSON schema validator compatible with all LLM by switching error template handling to a single user message. Also reduce cost by only including last error instead of full message history.
  • Enhanced flexibility in HuggingFace API environment variable names across all related components to support both 'HFAPITOKEN' and 'HF_TOKEN', improving compatibility with the widely used HF environmental variable naming conventions.
  • Updated the ContextRelevance evaluator prompt, explicitly asking to score each statement.
  • Improve LinkContentFetcher to support a broader range of content types, including glob patterns for text, application, audio, and video types. This update introduces a more flexible content handler resolution mechanism, allowing for direct matches and pattern matching, thereby greatly improving the handler's adaptability to various content types encountered on the web.
  • Add maxretries to AzureOpenAIGenerator. AzureOpenAIGenerator can now be initialised by setting maxretries. If not set, it is inferred from the OPENAIMAXRETRIES environment variable or set to 5. The timeout for AzureOpenAIGenerator, if not set, it is inferred from the OPENAI_TIMEOUT environment variable or set to 30.
  • Introduced a 'filter_policy' init parameter for both InMemoryBM25Retriever and InMemoryEmbeddingRetriever, allowing users to define how runtime filters should be applied with options to either 'replace' the initial filters or 'merge' them, providing greater flexibility in filtering query results.
  • Pipeline serialization to YAML now supports tuples as field values.
  • Add support for [structlog context variables](https://www.structlog.org/en/24.2.0/contextvars.html) to structured logging.
  • AnswerBuilder can now accept ChatMessages as input in addition to strings. When using ChatMessages, metadata will be automatically added to the answer.
  • Update the error message when the sentence-transformers library is not installed and the used component requires it.
  • Add max_retries and timeout parameters to the AzureOpenAIChatGenerator initializations.
  • Add max_retries and timeout parameters to the AzureOpenAITextEmbedder initializations.
  • Add max_retries, timeout parameters to the AzureOpenAIDocumentEmbedder initialization.
  • Improved error messages for deserialization errors.

⚠️ Deprecation Notes

  • Haystack 1.x legacy filters are deprecated and will be removed in a future release. Please use the new filter style as described in the documentation - https://docs.haystack.deepset.ai/docs/metadata-filtering
  • The output of the ContextRelevanceEvaluator will change in Haystack 2.4.0. Contexts will be scored as a whole instead of individual statements and only the relevant sentences will be returned. A score of 1 is now returned if a relevant sentence is found, and 0 otherwise.

πŸ› Bug Fixes

  • Encoding of HTML files in LinkContentFetcher
  • This updates the components, TransformersSimilarityRanker, SentenceTransformersDiversityRanker, SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder and LocalWhisperTranscriber fromdict methods to work when loading with initparameters only containing required parameters.
  • Fix TransformersZeroShotTextRouter and TransformersTextRouter fromdict methods to work when initparameters only contain required variables.
  • SASEvaluator now raises a ValueError if a None value is contained in the predicted_answers input.
  • Auto enable tracing upon import if ddtrace or opentelemetry is installed.
  • Meta handling of bytestreams in Azure OCR has been fixed.
  • Use new filter syntax in the CacheChecker component instead of legacy one.
  • Solve serialization bug on 'ChatPromptBuilder' by creating 'todict' and 'fromdict' methods on 'ChatMessage' and 'ChatPromptBuilder'.
  • Fix some bugs running a Pipeline that has Components with conditional outputs. Some branches that were expected not to run would run anyway, even if they received no inputs. Some branches instead would cause the Pipeline to get stuck waiting to run that branch, even if they received no inputs. The behaviour would depend whether the Component not receiving the input has a optional input or not.
  • Fixed the calculation for MRR and MAP scores.
  • Fix the deserialization of pipelines containing evaluator components that were subclasses of LLMEvaluator.
  • Fix recursive JSON type conversion in the schema validator to be less aggressive (no infinite recursion).
  • Adds the missing 'organization' parameter to the serialization function.
  • Correctly serialize tuples and types in the init parameters of the LLMEvaluator component and its subclasses.
  • Pin numpy<2 to avoid breaking changes that cause several core integrations to fail. Pin tenacity too (8.4.0 is broken).

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.3.0-rc2

Release Notes

v2.3.0-rc2

πŸš€ New Features

  • Adding a new component allowing to perform sentence-window retrieval, i.e. retrieves surrounding documents of a given document from the document store. This is useful when a document is split into multiple chunks and you want to retrieve the surrounding context of a given chunk.

⚑️ Enhancement Notes

  • Enhanced the PyPDF converter to ensure backwards compatibility with Pipelines dumped with versions older than 2.3.0. The update includes a conditional check to automatically default to the DefaultConverter if a specific converter is not provided, improving the component's robustness and ease of use.

⚠️ Deprecation Notes

πŸ› Bug Fixes

  • Encoding of HTML files in LinkContentFetcher
  • This updates the components, TransformersSimilarityRanker, SentenceTransformersDiversityRanker, SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder and LocalWhisperTranscriber fromdict methods to work when loading with initparameters only containing required parameters.
  • Fix TransformersZeroShotTextRouter and TransformersTextRouter fromdict methods to work when initparameters only contain required variables.

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.3.0-rc1

Release Notes

Highlights

Adding the DocxToDocument component to convert Docx files to Documents.

⬆️ Upgrade Notes

  • trafilatura must now be manually installed with pip install trafilatura to use the HTMLToDocument Component.

  • The deprecated converter_name parameter has been removed from PyPDFToDocument.

    To specify a custom converter for PyPDFToDocument, use the converter initialization parameter and pass an instance of a class that implements the PyPDFConverter protocol.

    The PyPDFConverter protocol defines the methods convert, todict and fromdict. A default implementation of PyPDFConverter is provided in the DefaultConverter class.

  • Deprecated HuggingFaceTEITextEmbedder and HuggingFaceTEIDocumentEmbedder have been removed. Use HuggingFaceAPITextEmbedder and HuggingFaceAPIDocumentEmbedder instead.

  • Deprecated HuggingFaceTGIGenerator and HuggingFaceTGIChatGenerator have been removed. Use HuggingFaceAPIGenerator and HuggingFaceAPIChatGenerator instead.

πŸš€ New Features

  • Added custom filters support to ConditionalRouter. Users can now pass in one or more custom Jinja2 filter callables and be able to access those filters when defining condition expressions in routes.
  • Added a new mode in JoinDocuments, Distribution-based rank fusion as [the article](https://medium.com/plain-simple-software/distribution-based-score-fusion-dbsf-a-new-approach-to-vector-search-ranking-f87c37488b18)
  • Adding the DocxToDocument component inside the converters category. It uses the python-docx library to convert Docx files to haystack Documents.
  • Added haystack-experimental to the project's dependencies to enable automatic use of cutting-edge features from Haystack. Users can now access components from haystack-experimental by simply importing them from haystack_experimental instead of haystack. For more information, visit https://github.com/deepset-ai/haystack-experimental.
  • Add a PPTX to Document converter using the python-pptx library. Extracts all text from each slide. Each slide is separated with a page break "f" so a Document Splitter could split by slide.
  • The DocumentSplitter now has support for the splitid and splitoverlap to allow for more control over the splitting process.
  • Introduces the TransformersTextRouter! This component uses a transformers text classification pipeline to route text inputs onto different output connections based on the labels of the chosen text classification model.
  • Add memory sharing between different instances of InMemoryDocumentStore. Setting the same index argument as another instance will make sure that the memory is shared. e.g. `python index = "my_personal_index" document_store_1 = InMemoryDocumentStore(index=index) document_store_2 = InMemoryDocumentStore(index=index) assert document_store_1.count_documents() == 0 assert document_store_2.count_documents() == 0 document_store_1.write_documents([Document(content="Hello world")]) assert document_store_1.count_documents() == 1 assert document_store_2.count_documents() == 1`
  • Add a new missing_meta param to MetaFieldRanker, which determines what to do with documents that lack the ranked meta field. Supported values are "bottom" (which puts documents with missing meta at the bottom of the sorted list), "top" (which puts them at the top), and "drop" (which removes them from the results entirely).

⚑️ Enhancement Notes

  • Added the applyfilterpolicy function to standardize the application of filter policies across all document store-specific retrievers, allowing for consistent handling of initial and runtime filters based on the chosen policy (replace or merge).
  • Added a new parameter to EvaluationRunResult.comparativeindividualscores_report() to specify columns to keep in the comparative DataFrame.
  • Added the 'remove_component' method in 'PipelineBase' to delete components and its connections.
  • Added serialization methods savetodisk and writetodisk to InMemoryDocumentStore.
  • When using "openai" for the LLM-based evaluators the metadata from OpenAI will be in the output dictionary, under the key "meta".
  • Remove trafilatura as direct dependency and make it a lazily imported one
  • Renamed component from DocxToDocument to DOCXToDocument to follow the naming convention of other converter components.
  • Made JSON schema validator compatible with all LLM by switching error template handling to a single user message. Also reduce cost by only including last error instead of full message history.
  • Enhanced flexibility in HuggingFace API environment variable names across all related components to support both 'HFAPITOKEN' and 'HF_TOKEN', improving compatibility with the widely used HF environmental variable naming conventions.
  • Updated the ContextRelevance evaluator prompt, explicitly asking to score each statement.
  • Improve LinkContentFetcher to support a broader range of content types, including glob patterns for text, application, audio, and video types. This update introduces a more flexible content handler resolution mechanism, allowing for direct matches and pattern matching, thereby greatly improving the handler's adaptability to various content types encountered on the web.
  • Add maxretries to AzureOpenAIGenerator. AzureOpenAIGenerator can now be initialised by setting maxretries. If not set, it is inferred from the OPENAIMAXRETRIES environment variable or set to 5. The timeout for AzureOpenAIGenerator, if not set, it is inferred from the OPENAI_TIMEOUT environment variable or set to 30.
  • Introduced a 'filter_policy' init parameter for both InMemoryBM25Retriever and InMemoryEmbeddingRetriever, allowing users to define how runtime filters should be applied with options to either 'replace' the initial filters or 'merge' them, providing greater flexibility in filtering query results.
  • Pipeline serialization to YAML now supports tuples as field values.
  • Add support for [structlog context variables](https://www.structlog.org/en/24.2.0/contextvars.html) to structured logging.
  • AnswerBuilder can now accept ChatMessages as input in addition to strings. When using ChatMessages, metadata will be automatically added to the answer.
  • Update the error message when the sentence-transformers library is not installed and the used component requires it.

⚠️ Deprecation Notes

  • The output of the ContextRelevanceEvaluator will change in Haystack 2.4.0. Contexts will be scored as a whole instead of individual statements and only the relevant sentences will be returned. A score of 1 is now returned if a relevant sentence is found, and 0 otherwise.

πŸ› Bug Fixes

  • SASEvaluator now raises a ValueError if a None value is contained in the predicted_answers input.
  • Auto enable tracing upon import if ddtrace or opentelemetry is installed.
  • Meta handling of bytestreams in Azure OCR has been fixed.
  • Use new filter syntax in the CacheChecker component instead of legacy one.
  • Solve serialization bug on 'ChatPromptBuilder' by creating 'todict' and 'fromdict' methods on 'ChatMessage' and 'ChatPromptBuilder'.
  • Fix some bugs running a Pipeline that has Components with conditional outputs. Some branches that were expected not to run would run anyway, even if they received no inputs. Some branches instead would cause the Pipeline to get stuck waiting to run that branch, even if they received no inputs. The behaviour would depend whether the Component not receiving the input has a optional input or not.
  • Fixed the calculation for MRR and MAP scores.
  • Fix the deserialization of pipelines containing evaluator components that were subclasses of LLMEvaluator.
  • Fix recursive JSON type conversion in the schema validator to be less aggressive (no infinite recursion).
  • Adds the missing 'organization' parameter to the serialization function.
  • Correctly serialize tuples and types in the init parameters of the LLMEvaluator component and its subclasses.
  • Pin numpy<2 to avoid breaking changes that cause several core integrations to fail. Pin tenacity too (8.4.0 is broken).

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.2.4

Release Notes

v2.2.4

⚑️ Enhancement Notes

  • Added the applyfilterpolicy function to standardize the application of filter policies across all document store-specific retrievers, allowing for consistent handling of initial and runtime filters based on the chosen policy (replace or merge).
  • Introduced a 'filter_policy' init parameter for both InMemoryBM25Retriever and InMemoryEmbeddingRetriever, allowing users to define how runtime filters should be applied with options to either 'replace' the initial filters or 'merge' them, providing greater flexibility in filtering query results.

πŸ› Bug Fixes

  • Meta handling of bytestreams in Azure OCR has been fixed.
  • Fix some bugs running a Pipeline that has Components with conditional outputs. Some branches that were expected not to run would run anyway, even if they received no inputs. Some branches instead would cause the Pipeline to get stuck waiting to run that branch, even if they received no inputs. The behaviour would depend whether the Component not receiving the input has a optional input or not.

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.2.4-rc1

Release Notes

v2.2.4-rc1

⚑️ Enhancement Notes

  • Added the applyfilterpolicy function to standardize the application of filter policies across all document store-specific retrievers, allowing for consistent handling of initial and runtime filters based on the chosen policy (replace or merge).
  • Introduced a 'filter_policy' init parameter for both InMemoryBM25Retriever and InMemoryEmbeddingRetriever, allowing users to define how runtime filters should be applied with options to either 'replace' the initial filters or 'merge' them, providing greater flexibility in filtering query results.

πŸ› Bug Fixes

  • Fix some bugs running a Pipeline that has Components with conditional outputs. Some branches that were expected not to run would run anyway, even if they received no inputs. Some branches instead would cause the Pipeline to get stuck waiting to run that branch, even if they received no inputs. The behaviour would depend whether the Component not receiving the input has a optional input or not.

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.2.3

Release Notes

v2.2.3

πŸ› Bug Fixes

  • Pin numpy<2 to avoid breaking changes that cause several core integrations to fail. Pin tenacity too (8.4.0 is broken).

⚑️ Enhancement Notes

  • Export ChatPromptBuilder in builders module

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack -

Release Notes

v2.2.2

πŸ› Bug Fixes

  • Add missing metrics column in DataFrame returned by EvaluationRunResult.score_report()

- Python
Published by silvanocerza over 1 year ago

farm-haystack - v2.2.2-rc1

Release Notes

v2.2.2-rc1

πŸ› Bug Fixes

  • Add missing metrics column in DataFrame returned by EvaluationRunResult.score_report()

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v1.26.2

Release Notes

v1.26.2

πŸ› Bug Fixes

  • Export fetch_archive_from_http in utils/__init__.py

- Python
Published by silvanocerza over 1 year ago

farm-haystack - v2.2.1

Release Notes

v2.2.1

⬆️ Upgrade Notes

  • trafilatura must now be manually installed with pip install trafilatura to use the HTMLToDocument Component.

⚑️ Enhancement Notes

  • Remove trafilatura as direct dependency and make it a lazily imported one

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.2.1-rc1

Release Notes

v2.2.1-rc1

⬆️ Upgrade Notes

  • trafilatura must now be manually installed with pip install trafilatura to use the HTMLToDocument Component.

⚑️ Enhancement Notes

  • Remove trafilatura as direct dependency and make it a lazily imported one

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v1.26.1

Release Notes

v1.26.1

πŸš€ New Features

  • Add previously removed fetcharchivefrom_http util function to fetch zip and gzip archives from url

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v1.26.0

Release Notes

v1.26.0

Prelude

We are announcing that Haystack 1.26 is the final minor release for Haystack 1.x. Although we will continue to release bug fixes for this version, we will neither be adding nor removing any functionalities. Instead, we will focus our efforts on Haystack 2.x. Haystack 1.26 will reach its end-of-life on March 11, 2025.

The utility functions fetcharchivefromhttp, buildpipeline and addexampledata were removed from Haystack.

This release changes the PDFToTextConverter so that it doesn't support PyMuPDF anymore. The converter will always assume xpdf is used by default.

⬆️ Upgrade Notes

  • We recommend replacing calls to the fetcharchivefrom_http function with other tools available in Python or in the operating system of use.
  • To keep using PyMuPDF you must create a custom node, you can use the previous Haystack version for inspiration.

⚑️ Enhancement Notes

  • Add raiseonfailure flag to BaseConverter class so that big processes can optionally continue without breaking from exceptions.

  • Support for Llama3 models on AWS Bedrock.

  • Support for MistralAI and new Claude 3 models on AWS Bedrock.

  • Upgrade Transformers to the latest version 4.37.2. This version adds support for the Phi-2 and Qwen2 models and improves support for quantization.

  • Upgrade transformers to version 4.39.3 so that Haystack can support the new Cohere Command R models.

  • Add support for latest OpenAI embedding models text-embedding-3-large and text-embedding-3-small.

  • APIBASE can now be passed as an optional parameter in the gettingstarted sample. Only openai provider is supported in this set of changes. PromptNode and PromptModel were enhanced to allow passing of this parameter. This allows RAG against a local endpoint (e.g, http://localhost:1234/v1), so long as it is OpenAI compatible (such as LM Studio)

    Logging in the getting started sample was made more verbose, to make it easier for people to see what was happening under the covers.

  • Added new option split_by="page" to the preprocessor so we can chunk documents by page break.

  • Review and update context windows for OpenAI GPT models.

  • Support gated repos for Huggingface inference.

  • Add a check to verify that the embedding dimension set in the FAISS Document Store and retriever are equal before running embedding calculations.

πŸ› Bug Fixes

  • Pipeline run error when using the FileTypeClassifier with the raiseonerror: True option. Instead of returning an unexpected NoneType, we route the file to a dead-end edge.

  • Ensure that the crawled files are downloaded to the output_dir directory, as specified in the Crawler constructor. Previously, some files were incorrectly downloaded to the current working directory.

  • Fixes SearchEngineDocumentStore.getmetadatavaluesbykey method to make use of self.index if no index is provided.

  • Fixes OutputParser usage in PromptTemplate after making invocation context immutable in https://github.com/deepset-ai/haystack/pull/7510.

  • When using a Pipeline with a JoinNode (e.g. JoinDocuments) all information from the previous nodes was lost other than a few select fields (e.g. documents). This was due to the JoinNode not properly passing on the information from the previous nodes. This has been fixed and now all information from the previous nodes is passed on to the next node in the pipeline.

    For example, this is a pipeline that rewrites the query during pipeline execution combined with a hybrid retrieval setup that requires a JoinDocuments node. Specifically the first prompt node rewrites the query to fix all spelling errors, and this new query is used for retrieval. And now the JoinDocuments node will now pass on the rewritten query so it can be used by the QAPromptNode node whereas before it would pass on the original query. `python from haystack import Pipeline from haystack.nodes import BM25Retriever, EmbeddingRetriever, PromptNode, Shaper, JoinDocuments, PromptTemplate from haystack.document_stores import InMemoryDocumentStore document_store = InMemoryDocumentStore(use_bm25=True) dicts = [{"content": "The capital of Germany is Berlin."}, {"content": "The capital of France is Paris."}] document_store.write_documents(dicts) query_prompt_node = PromptNode( model_name_or_path="gpt-3.5-turbo", api_key="", default_prompt_template=PromptTemplate("You are a spell checker. Given a user query return the same query with all spelling errors fixed.\nUser Query: {query}\nSpell Checked Query:") ) shaper = Shaper( func="join_strings", inputs={"strings": "results"}, outputs=["query"], ) qa_prompt_node = PromptNode( model_name_or_path="gpt-3.5-turbo", api_key="", default_prompt_template=PromptTemplate("Answer the user query. Query: {query}") ) sparse_retriever = BM25Retriever( document_store=document_store, top_k=2 ) dense_retriever = EmbeddingRetriever( document_store=document_store, embedding_model="intfloat/e5-base-v2", model_format="sentence_transformers", top_k=2 ) document_store.update_embeddings(dense_retriever) pipeline = Pipeline() pipeline.add_node(component=query_prompt_node, name="QueryPromptNode", inputs=["Query"]) pipeline.add_node(component=shaper, name="ListToString", inputs=["QueryPromptNode"]) pipeline.add_node(component=sparse_retriever, name="BM25", inputs=["ListToString"]) pipeline.add_node(component=dense_retriever, name="Embedding", inputs=["ListToString"]) pipeline.add_node( component=JoinDocuments(join_mode="concatenate"), name="Join", inputs=["BM25", "Embedding"] ) pipeline.add_node(component=qa_prompt_node, name="QAPromptNode", inputs=["Join"]) out = pipeline.run(query="What is the captial of Grmny?", debug=True) print(out["invocation_context"]) # Before Fix # {'query': 'What is the captial of Grmny?', <-- Original Query!! # 'results': ['The capital of Germany is Berlin.'], # 'prompts': ['Answer the user query. Query: What is the captial of Grmny?'], <-- Original Query!! # After Fix # {'query': 'What is the capital of Germany?', <-- Rewritten Query!! # 'results': ['The capital of Germany is Berlin.'], # 'prompts': ['Answer the user query. Query: What is the capital of Germany?'], <-- Rewritten Query!!`

  • When passing empty inputs (such as query="") to PromptNode, the node would raise an error. This has been fixed.

  • Change the dummy vector used internally in the Pinecone Document Store. A recent change to the Pinecone API does not allow to use vectors filled with zeros as was the previous dummy vector.

  • The types of meta data values accepted by RouteDocuments was unnecessarily restricted to string types. This causes validation errors (for example when loading from a yaml file) if a user tries to use a boolean type for example. We add boolean and int types as valid types for metadata_values.

  • Fixed a bug that made it impossible to write Documents to Weaviate when some of the fields were empty lists (e.g. split_overlap for preprocessed documents).

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v1.26.0-rc1

Release Notes

v1.26.0-rc1

Prelude

The utility functions fetcharchivefromhttp, buildpipeline and addexampledata were removed from Haystack.

This release changes the PDFToTextConverter so that it doesn't support PyMuPDF anymore. The converter will always assume xpdf is used by default.

⬆️ Upgrade Notes

  • We recommend replacing calls to the fetcharchivefrom_http function with other tools available in Python or in the operating system of use.
  • To keep using PyMuPDF you must create a custom node, you can use the previous Haystack version for inspiration.

⚑️ Enhancement Notes

  • Support for Llama3 models on AWS Bedrock.
  • Support for MistralAI and new Claude 3 models on AWS Bedrock.
  • Upgrade transformers to version 4.39.3 so that Haystack can support the new Cohere Command R models.
  • Review and update context windows for OpenAI GPT models.
  • Support gated repos for Huggingface inference.
  • Add a check to verify that the embedding dimension set in the FAISS Document Store and retriever are equal before running embedding calculations.

πŸ› Bug Fixes

  • Pipeline run error when using the FileTypeClassifier with the raiseonerror: True option. Instead of returning an unexpected NoneType, we route the file to a dead-end edge.

  • Ensure that the crawled files are downloaded to the output_dir directory, as specified in the Crawler constructor. Previously, some files were incorrectly downloaded to the current working directory.

  • Fixes SearchEngineDocumentStore.getmetadatavaluesbykey method to make use of self.index if no index is provided.

  • Fixes OutputParser usage in PromptTemplate after making invocation context immutable in https://github.com/deepset-ai/haystack/pull/7510.

  • When using a Pipeline with a JoinNode (e.g. JoinDocuments) all information from the previous nodes was lost other than a few select fields (e.g. documents). This was due to the JoinNode not properly passing on the information from the previous nodes. This has been fixed and now all information from the previous nodes is passed on to the next node in the pipeline.

    For example, this is a pipeline that rewrites the query during pipeline execution combined with a hybrid retrieval setup that requires a JoinDocuments node. Specifically the first prompt node rewrites the query to fix all spelling errors, and this new query is used for retrieval. And now the JoinDocuments node will now pass on the rewritten query so it can be used by the QAPromptNode node whereas before it would pass on the original query. `python from haystack import Pipeline from haystack.nodes import BM25Retriever, EmbeddingRetriever, PromptNode, Shaper, JoinDocuments, PromptTemplate from haystack.document_stores import InMemoryDocumentStore document_store = InMemoryDocumentStore(use_bm25=True) dicts = [{"content": "The capital of Germany is Berlin."}, {"content": "The capital of France is Paris."}] document_store.write_documents(dicts) query_prompt_node = PromptNode( model_name_or_path="gpt-3.5-turbo", api_key="", default_prompt_template=PromptTemplate("You are a spell checker. Given a user query return the same query with all spelling errors fixed.\nUser Query: {query}\nSpell Checked Query:") ) shaper = Shaper( func="join_strings", inputs={"strings": "results"}, outputs=["query"], ) qa_prompt_node = PromptNode( model_name_or_path="gpt-3.5-turbo", api_key="", default_prompt_template=PromptTemplate("Answer the user query. Query: {query}") ) sparse_retriever = BM25Retriever( document_store=document_store, top_k=2 ) dense_retriever = EmbeddingRetriever( document_store=document_store, embedding_model="intfloat/e5-base-v2", model_format="sentence_transformers", top_k=2 ) document_store.update_embeddings(dense_retriever) pipeline = Pipeline() pipeline.add_node(component=query_prompt_node, name="QueryPromptNode", inputs=["Query"]) pipeline.add_node(component=shaper, name="ListToString", inputs=["QueryPromptNode"]) pipeline.add_node(component=sparse_retriever, name="BM25", inputs=["ListToString"]) pipeline.add_node(component=dense_retriever, name="Embedding", inputs=["ListToString"]) pipeline.add_node( component=JoinDocuments(join_mode="concatenate"), name="Join", inputs=["BM25", "Embedding"] ) pipeline.add_node(component=qa_prompt_node, name="QAPromptNode", inputs=["Join"]) out = pipeline.run(query="What is the captial of Grmny?", debug=True) print(out["invocation_context"]) # Before Fix # {'query': 'What is the captial of Grmny?', <-- Original Query!! # 'results': ['The capital of Germany is Berlin.'], # 'prompts': ['Answer the user query. Query: What is the captial of Grmny?'], <-- Original Query!! # After Fix # {'query': 'What is the capital of Germany?', <-- Rewritten Query!! # 'results': ['The capital of Germany is Berlin.'], # 'prompts': ['Answer the user query. Query: What is the capital of Germany?'], <-- Rewritten Query!!`

  • When passing empty inputs (such as query="") to PromptNode, the node would raise an error. This has been fixed.

v1.26.0-rc0

⚑️ Enhancement Notes

  • Add raiseonfailure flag to BaseConverter class so that big processes can optionally continue without breaking from exceptions.

  • Upgrade Transformers to the latest version 4.37.2. This version adds support for the Phi-2 and Qwen2 models and improves support for quantization.

  • Add support for latest OpenAI embedding models text-embedding-3-large and text-embedding-3-small.

  • APIBASE can now be passed as an optional parameter in the gettingstarted sample. Only openai provider is supported in this set of changes. PromptNode and PromptModel were enhanced to allow passing of this parameter. This allows RAG against a local endpoint (e.g, http://localhost:1234/v1), so long as it is OpenAI compatible (such as LM Studio)

    Logging in the getting started sample was made more verbose, to make it easier for people to see what was happening under the covers.

  • Added new option split_by="page" to the preprocessor so we can chunk documents by page break.

πŸ› Bug Fixes

  • Change the dummy vector used internally in the Pinecone Document Store. A recent change to the Pinecone API does not allow to use vectors filled with zeros as was the previous dummy vector.
  • The types of meta data values accepted by RouteDocuments was unnecessarily restricted to string types. This causes validation errors (for example when loading from a yaml file) if a user tries to use a boolean type for example. We add boolean and int types as valid types for metadata_values.
  • Fixed a bug that made it impossible to write Documents to Weaviate when some of the fields were empty lists (e.g. split_overlap for preprocessed documents).

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.2.0

Release Notes

v2.2.0

Highlights

The Multiplexer component proved to be hard to explain and to understand. After reviewing its use cases, the documentation was rewritten and the component was renamed to BranchJoiner to better explain its functionalities.

Add the 'OPENAITIMEOUT' and 'OPENAIMAX_RETRIES' to the OpenAI components.

⬆️ Upgrade Notes

  • BranchJoiner has the very same interface as Multiplexer. To upgrade your code, just rename any occurrence of Multiplexer to BranchJoiner and ajdust the imports accordingly.

πŸš€ New Features

  • Add BranchJoiner to eventually replace Multiplexer
  • AzureOpenAIGenerator and AzureOpenAIChatGenerator can now be configured passing a timeout for the underlying AzureOpenAI client.

⚑️ Enhancement Notes

  • ChatPromptBuilder now supports changing its template at runtime. This allows you to define a default template and then change it based on your needs at runtime.
  • If an LLM-based evaluator (e.g., Faithfulness or ContextRelevance) is initialised with raiseonfailure=False, and if a call to an LLM fails or an LLM outputs an invalid JSON, the score of the sample is set to NaN instead of raising an exception. The user is notified with a warning indicating the number of requests that failed.
  • Adds inference mode to model call of the ExtractiveReader. This prevents gradients from being calculated during inference time in pytorch.
  • The DocumentCleaner class has the optional attribute keep_id that if set to True it keeps the document ids unchanged after cleanup.
  • DocumentSplitter now has an optional splitthreshold parameter. Use this parameter if you want to rather not split inputs that are only slightly longer than the allowed splitlength. If when chunking one of the chunks is smaller than the split_threshold, the chunk will be concatenated with the previous one. This avoids having too small chunks that are not meaningful.
  • Re-implement InMemoryDocumentStore BM25 search with incremental indexing by avoiding re-creating the entire inverse index for every new query. This change also removes the dependency on haystack_bm25. Please refer to [PR #7549](https://github.com/deepset-ai/haystack/pull/7549) for the full context.
  • Improved MIME type management by directly setting MIME types on ByteStreams, enhancing the overall handling and routing of different file types. This update makes MIME type data more consistently accessible and simplifies the process of working with various document formats.
  • PromptBuilder now supports changing its template at runtime (e.g. for Prompt Engineering). This allows you to define a default template and then change it based on your needs at runtime.
  • Now you can set the timeout and maxretries parameters on OpenAI components by setting the 'OPENAITIMEOUT' and 'OPENAIMAXRETRIES' environment vars or passing them at __init__.
  • The DocumentJoiner component's run method now accepts a top_k parameter, allowing users to specify the maximum number of documents to return at query time. This fixes issue #7702.
  • Enforce JSON mode on OpenAI LLM-based evaluators so that the they always return valid JSON output. This is to ensure that the output is always in a consistent format, regardless of the input.
  • Make warm_up() usage consistent across the codebase.
  • Create a class hierarchy for pipeline classes, and move the run logic into the child class. Preparation work for introducing multiple run stratgegies.
  • Make the SerperDevWebSearch more robust when snippet is not present in the request response.
  • Make SparseEmbedding a dataclass, this makes it easier to use the class with Pydantic
  • `HTMLToDocument`: change the HTML conversion backend from boilerpy3 to trafilatura, which is more robust and better maintained.

⚠️ Deprecation Notes

  • Mulitplexer is now deprecated.
  • DynamicChatPromptBuilder has been deprecated as ChatPromptBuilder fully covers its functionality. Use ChatPromptBuilder instead.
  • DynamicPromptBuilder has been deprecated as PromptBuilder fully covers its functionality. Use PromptBuilder instead.
  • The following parameters of HTMLToDocument are ignored and will be removed in Haystack 2.4.0: extractortype and tryothers.

πŸ› Bug Fixes

  • FaithfullnessEvaluator and ContextRelevanceEvaluator now return 0 instead of NaN when applied to an empty context or empty statements.
  • Azure generators components fixed, they were missing the @component decorator.
  • Updates the fromdict method of SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder, NamedEntityExtractor, SentenceTransformersDiversityRanker and LocalWhisperTranscriber to allow None as a valid value for device when deserializing from a YAML file. This allows a deserialized pipeline to auto-determine what device to use using the ComponentDevice.resolvedevice logic.
  • Fix the broken serialization of HuggingFaceAPITextEmbedder, HuggingFaceAPIDocumentEmbedder, HuggingFaceAPIGenerator, and HuggingFaceAPIChatGenerator.
  • Fix NamedEntityExtractor crashing in Python 3.12 if constructed using a string backend argument.
  • Fixed the PdfMinerToDocument converter's outputs to be properly wired up to 'documents'.
  • Add to_dict method to DocumentRecallEvaluator to allow proper serialization of the component.
  • Improves/fixes type serialization of PEP 585 types (e.g. list[Document], and their nested version). This improvement enables better serialization of generics and nested types and improves/fixes matching of list[X] and List[X] types in component connections after serialization.
  • Fixed (de)serialization of NamedEntityExtractor. Includes updated tests verifying these fixes when NamedEntityExtractor is used in pipelines.
  • The includeoutputsfrom parameter in Pipeline.run correctly returns outputs of components with multiple outputs.
  • Return an empty list of answers when ExtractiveReader receives an empty list of documents instead of raising an exception.

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.2.0-rc2

Release Notes

v2.2.0-rc1

Highlights

The Multiplexer component proved to be hard to explain and to understand. After reviewing its use cases, the documentation was rewritten and the component was renamed to BranchJoiner to better explain its functionalities.

Add the 'OPENAITIMEOUT' and 'OPENAIMAX_RETRIES' to the OpenAI components.

⬆️ Upgrade Notes

  • BranchJoiner has the very same interface as Multiplexer. To upgrade your code, just rename any occurrence of Multiplexer to BranchJoiner and ajdust the imports accordingly.

πŸš€ New Features

  • Add BranchJoiner to eventually replace Multiplexer
  • AzureOpenAIGenerator and AzureOpenAIChatGenerator can now be configured passing a timeout for the underlying AzureOpenAI client.

⚑️ Enhancement Notes

  • ChatPromptBuilder now supports changing its template at runtime. This allows you to define a default template and then change it based on your needs at runtime.
  • If an LLM-based evaluator (e.g., Faithfulness or ContextRelevance) is initialised with raiseonfailure=False, and if a call to an LLM fails or an LLM outputs an invalid JSON, the score of the sample is set to NaN instead of raising an exception. The user is notified with a warning indicating the number of requests that failed.
  • Adds inference mode to model call of the ExtractiveReader. This prevents gradients from being calculated during inference time in pytorch.
  • The DocumentCleaner class has the optional attribute keep_id that if set to True it keeps the document ids unchanged after cleanup.
  • DocumentSplitter now has an optional splitthreshold parameter. Use this parameter if you want to rather not split inputs that are only slightly longer than the allowed splitlength. If when chunking one of the chunks is smaller than the split_threshold, the chunk will be concatenated with the previous one. This avoids having too small chunks that are not meaningful.
  • Re-implement InMemoryDocumentStore BM25 search with incremental indexing by avoiding re-creating the entire inverse index for every new query. This change also removes the dependency on haystack_bm25. Please refer to [PR #7549](https://github.com/deepset-ai/haystack/pull/7549) for the full context.
  • Improved MIME type management by directly setting MIME types on ByteStreams, enhancing the overall handling and routing of different file types. This update makes MIME type data more consistently accessible and simplifies the process of working with various document formats.
  • PromptBuilder now supports changing its template at runtime (e.g. for Prompt Engineering). This allows you to define a default template and then change it based on your needs at runtime.
  • Now you can set the timeout and maxretries parameters on OpenAI components by setting the 'OPENAITIMEOUT' and 'OPENAIMAXRETRIES' environment vars or passing them at __init__.
  • The DocumentJoiner component's run method now accepts a top_k parameter, allowing users to specify the maximum number of documents to return at query time. This fixes issue #7702.
  • Enforce JSON mode on OpenAI LLM-based evaluators so that the they always return valid JSON output. This is to ensure that the output is always in a consistent format, regardless of the input.
  • Make warm_up() usage consistent across the codebase.
  • Create a class hierarchy for pipeline classes, and move the run logic into the child class. Preparation work for introducing multiple run stratgegies.
  • Make the SerperDevWebSearch more robust when snippet is not present in the request response.
  • Make SparseEmbedding a dataclass, this makes it easier to use the class with Pydantic
  • `HTMLToDocument`: change the HTML conversion backend from boilerpy3 to trafilatura, which is more robust and better maintained.

⚠️ Deprecation Notes

  • Mulitplexer is now deprecated.
  • DynamicChatPromptBuilder has been deprecated as ChatPromptBuilder fully covers its functionality. Use ChatPromptBuilder instead.
  • DynamicPromptBuilder has been deprecated as PromptBuilder fully covers its functionality. Use PromptBuilder instead.
  • The following parameters of HTMLToDocument are ignored and will be removed in Haystack 2.4.0: extractortype and tryothers.

πŸ› Bug Fixes

  • FaithfullnessEvaluator and ContextRelevanceEvaluator now return 0 instead of NaN when applied to an empty context or empty statements.
  • Azure generators components fixed, they were missing the @component decorator.
  • Updates the fromdict method of SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder, NamedEntityExtractor, SentenceTransformersDiversityRanker and LocalWhisperTranscriber to allow None as a valid value for device when deserializing from a YAML file. This allows a deserialized pipeline to auto-determine what device to use using the ComponentDevice.resolvedevice logic.
  • Fix the broken serialization of HuggingFaceAPITextEmbedder, HuggingFaceAPIDocumentEmbedder, HuggingFaceAPIGenerator, and HuggingFaceAPIChatGenerator.
  • Fix NamedEntityExtractor crashing in Python 3.12 if constructed using a string backend argument.
  • Fixed the PdfMinerToDocument converter's outputs to be properly wired up to 'documents'.
  • Add to_dict method to DocumentRecallEvaluator to allow proper serialization of the component.
  • Improves/fixes type serialization of PEP 585 types (e.g. list[Document], and their nested version). This improvement enables better serialization of generics and nested types and improves/fixes matching of list[X] and List[X] types in component connections after serialization.
  • Fixed (de)serialization of NamedEntityExtractor. Includes updated tests verifying these fixes when NamedEntityExtractor is used in pipelines.
  • The includeoutputsfrom parameter in Pipeline.run correctly returns outputs of components with multiple outputs.
  • Return an empty list of answers when ExtractiveReader receives an empty list of documents instead of raising an exception.

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.2.0-rc1

- Python
Published by github-actions[bot] over 1 year ago

farm-haystack - v2.1.2

Release Notes

v2.1.2

⚑️ Enhancement Notes

  • Enforce JSON mode on OpenAI LLM-based evaluators so that the they always return valid JSON output. This is to ensure that the output is always in a consistent format, regardless of the input.

πŸ› Bug Fixes

  • FaithfullnessEvaluator and ContextRelevanceEvaluator now return 0 instead of NaN when applied to an empty context or empty statements.
  • Azure generators components fixed, they were missing the @component decorator.
  • Updates the from_dict method of SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder, NamedEntityExtractor, SentenceTransformersDiversityRanker and LocalWhisperTranscriber to allow Noneas a valid value for device when deserializing from a YAML file. This allows a deserialized pipeline to auto-determine what device to use using the ComponentDevice.resolve_device logic.
  • Improves/fixes type serialization of PEP 585 types (e.g. list[Document], and their nested version). This improvement enables better serialization of generics and nested types and improves/fixes matching of list[X] and List[X]` types in component connections after serialization.
  • Fixed (de)serialization of NamedEntityExtractor. Includes updated tests verifying these fixes when NamedEntityExtractor is used in pipelines.
  • The include_outputs_from parameter in Pipeline.run correctly returns outputs of components with multiple outputs.

- Python
Published by github-actions[bot] almost 2 years ago

farm-haystack - v2.1.1

Release Notes

v2.1.1

⚑️ Enhancement Notes

  • Make SparseEmbedding a dataclass, this makes it easier to use the class with Pydantic

πŸ› Bug Fixes

  • Fix the broken serialization of HuggingFaceAPITextEmbedder, HuggingFaceAPIDocumentEmbedder, HuggingFaceAPIGenerator, and HuggingFaceAPIChatGenerator.
  • Add to_dict method to DocumentRecallEvaluator to allow proper serialization of the component.

- Python
Published by github-actions[bot] almost 2 years ago

farm-haystack - v2.1.1-rc1

Release Notes

v2.1.1-rc1

⚑️ Enhancement Notes

  • Make SparseEmbedding a dataclass, this makes it easier to use the class with Pydantic

πŸ› Bug Fixes

  • Fix the broken serialization of HuggingFaceAPITextEmbedder, HuggingFaceAPIDocumentEmbedder, HuggingFaceAPIGenerator, and HuggingFaceAPIChatGenerator.
  • Add to_dict method to DocumentRecallEvaluator to allow proper serialization of the component.

- Python
Published by github-actions[bot] almost 2 years ago

farm-haystack -

Release Notes

Highlights

πŸ“Š New Evaluator Components

Haystack introduces new components for both with model-based, and statistical evaluation: AnswerExactMatchEvaluator, ContextRelevanceEvaluator, DocumentMAPEvaluator, DocumentMRREvaluator, DocumentRecallEvaluator, FaithfulnessEvaluator, LLMEvaluator, SASEvaluator

Here's an example of how to use DocumentMAPEvaluator to evaluate retrieved documents and calculate mean average precision score:

```python from haystack import Document from haystack.components.evaluators import DocumentMAPEvaluator

evaluator = DocumentMAPEvaluator() result = evaluator.run( groundtruthdocuments=[ [Document(content="France")], [Document(content="9th century"), Document(content="9th")], ], retrieved_documents=[ [Document(content="France")], [Document(content="9th century"), Document(content="10th century"), Document(content="9th")], ], )

result["individual_scores"]

[1.0, 0.8333333333333333] result["score"] 0 .9166666666666666 ```

To learn more about evaluating RAG pipelines both with model-based, and statistical metrics available in the Haystack, check out Tutorial: Evaluating RAG Pipelines.

πŸ•ΈοΈ Support For Sparse Embeddings

Haystack offers robust support for Sparse Embedding Retrieval techniques, including SPLADE. Here's how to create a simple retrieval Pipeline with sparse embeddings:

```python from haystack import Pipeline from haystackintegrations.components.retrievers.qdrant import QdrantSparseEmbeddingRetriever from haystackintegrations.components.embedders.fastembed import FastembedSparseTextEmbedder

sparsetextembedder = FastembedSparseTextEmbedder(model="prithvida/SpladePPenv1") sparseretriever = QdrantSparseEmbeddingRetriever(documentstore=documentstore)

querypipeline = Pipeline() querypipeline.addcomponent("sparsetextembedder", sparsetextembedder) querypipeline.addcomponent("sparseretriever", sparse_retriever)

querypipeline.connect("sparsetextembedder.sparseembedding", "sparseretriever.querysparse_embedding") ``` Learn more about this topic in our documentation on Sparse Embedding-based Retrievers Start building with our new cookbook: πŸ§‘β€πŸ³ Sparse Embedding Retrieval using Qdrant and FastEmbed.

🧐 Inspect Component Outputs

As of 2.1.0, you can now inspect each component output after running a pipeline. Provide component names with include_outputs_from key to pipeline.run: python pipe.run(data, include_outputs_from={"prompt_builder", "llm", "retriever"}) And the pipeline output should look like this: text {'llm': {'replies': ['The Rhodes Statue was described as being built with iron tie bars to which brass plates were fixed to form the skin. It stood on a 15-meter-high white marble pedestal near the Rhodes harbor entrance. The statue itself was about 70 cubits, or 32 meters, tall.'], 'meta': [{'model': 'gpt-3.5-turbo-0125', ... 'usage': {'completion_tokens': 57, 'prompt_tokens': 446, 'total_tokens': 503}}]}, 'retriever': {'documents': [Document(id=a3ee3a9a55b47ff651ae11dc56d84d2b6f8d931b795bd866c14eacfa56000965, content: 'Within it, too, are to be seen large masses of rock, by the weight of which the artist steadied it w...', meta: {'url': 'https://en.wikipedia.org/wiki/Colossus_of_Rhodes', '_split_id': 9}, score: 0.648961685430463),...]}, 'prompt_builder': {'prompt': "\nGiven the following information, answer the question.\n\nContext:\n\n Within it, too, are to be seen large masses of rock, by the weight of which the artist steadied it while... ... levels during construction.\n\n\n\nQuestion: What does Rhodes Statue look like?\nAnswer:"}}

πŸš€ New Features

  • Add several new Evaluation components, i.e:

    • AnswerExactMatchEvaluator
    • ContextRelevanceEvaluator
    • DocumentMAPEvaluator
    • DocumentMRREvaluator
    • DocumentRecallEvaluator
    • FaithfulnessEvaluator
    • LLMEvaluator
    • SASEvaluator
  • Introduce a new SparseEmbedding class that can store a sparse vector representation of a document. It will be instrumental in supporting sparse embedding retrieval with the subsequent introduction of sparse embedders and sparse embedding retrievers.

  • Added a SentenceTransformersDiversityRanker. The diversity ranker orders documents to maximize their overall diversity. The ranker leverages sentence-transformer models to calculate semantic embeddings for each document and the query.

  • Introduced new HuggingFace API components, namely:

    • HuggingFaceAPIChatGenerator, which will replace the HuggingFaceTGIChatGenerator in the future.
    • HuggingFaceAPIDocumentEmbedder, which will replace the HuggingFaceTEIDocumentEmbedder in the future.
    • HuggingFaceAPIGenerator, which will replace the HuggingFaceTGIGenerator in the future.
    • HuggingFaceAPITextEmbedder, which will replace the HuggingFaceTEITextEmbedder in the future.
    • These components support different Hugging Face APIs:
      • free Serverless Inference API
      • paid Inference Endpoints
      • self-hosted Text Generation Inference

⚑️ Enhancement Notes

  • Compatibility with huggingface_hub>=0.22.0 for HuggingFaceTGIGenerator and HuggingFaceTGIChatGenerator components.

  • Adds truncate and normalize parameters to HuggingFaceTEITextEmbedder and HuggingFaceTEITextEmbedder to allow truncation and normalization of embeddings.

  • Adds trust_remote_code parameter to SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder for allowing custom models and scripts.

  • Adds streaming_callback parameter to HuggingFaceLocalGenerator, allowing users to handle streaming responses.

  • Adds a ZeroShotTextRouter that uses an NLI model from HuggingFace to classify texts based on a set of provided labels and routes them based on the label they were classified with.

  • Adds dimensions parameter to Azure OpenAI Embedders (AzureOpenAITextEmbedder and AzureOpenAIDocumentEmbedder) to fully support new embedding models like text-embedding-3-small, text-embedding-3-large and upcoming ones

  • Now the DocumentSplitter adds the page_number field to the metadata of all output documents to keep track of the page of the original document it belongs to.

  • Allows users to customise text extraction from PDF files. This is particularly useful for PDFs with unusual layouts, such as multiple text columns. For instance, users can configure the object to retain the reading order.

  • Enhanced PromptBuilder to specify and enforce required variables in prompt templates.

  • Set max_new_tokens default to 512 in HuggingFace generators.

  • Enhanced the AzureOCRDocumentConverter to include advanced handling of tables and text. Features such as extracting preceding and following context for tables, merging multiple column headers, and enabling single-column page layout for text have been introduced. This update furthers the flexibility and accuracy of document conversion within complex layouts.

  • Enhanced DynamicChatPromptBuilder's capabilities by allowing all user and system messages to be templated with provided variables. This update ensures a more versatile and dynamic templating process, making chat prompt generation more efficient and customised to user needs.

  • Improved HTML content extraction by attempting to use multiple extractors in order of priority until successful. An additional try_others parameter in HTMLToDocument, True by default, determines whether subsequent extractors are used after a failure. This enhancement decreases extraction failures, ensuring more dependable content retrieval.

  • Enhanced FileTypeRouter with regex pattern support for MIME types. This powerful addition allows for more granular control and flexibility in routing files based on their MIME types, enabling the handling of broad categories or specific MIME type patterns with ease. This feature particularly benefits applications requiring sophisticated file classification and routing logic.

  • In Jupyter notebooks, the image of the Pipeline will no longer be displayed automatically. Instead, the textual representation of the Pipeline will be displayed. To display the Pipeline image, use the show method of the Pipeline object.

  • Add support for callbacks during pipeline deserialization. Currently supports a pre-init hook for components that can be used to inspect and modify the initialization parameters before the invocation of the component's __init__ method.

  • pipeline.run() accepts a set of component names whose intermediate outputs are returned in the final pipeline output dictionary.

  • Refactor PyPDFToDocument to simplify support for custom PDF converters. PDF converters are classes that implement the PyPDFConverter protocol and have 3 methods: convert, to_dict and from_dict.

⚠️ Deprecation Notes

  • Deprecate HuggingFaceTGIChatGenerator, will be removed in Haystack 2.3.0. Use HuggingFaceAPIChatGenerator instead.
  • Deprecate HuggingFaceTEIDocumentEmbedder, will be removed in Haystack 2.3.0. Use HuggingFaceAPIDocumentEmbedder instead.
  • Deprecate HuggingFaceTGIGenerator, will be removed in Haystack 2.3.0. Use HuggingFaceAPIGenerator instead.
  • Deprecate HuggingFaceTEITextEmbedder, will be removed in Haystack 2.3.0. Use HuggingFaceAPITextEmbedder instead.
  • Using the converter_name parameter in the PyPDFToDocument component is deprecated. it will be removed in the 2.3.0 release. Use the converter parameter instead.

πŸ› Bug Fixes

  • Forward declaration of AnalyzeResult type in AzureOCRDocumentConverter. AnalyzeResult is already imported in a lazy import block. The forward declaration avoids issues when azure-ai-formrecognizer>=3.2.0b2 is not installed.

  • Fixed a bug in the MetaFieldRanker: when the weight parameter was set to 0 in the run method, the component incorrectly used the default parameter set in the__init__ method.

  • Fixes Pipeline.run() logic so components with all their inputs with a default are run in the correct order.

  • Fix a bug when running a Pipeline that would cause it to get stuck in an infinite loop

  • Fixes on the HuggingFaceTEITextEmbedder returning an embedding of incorrect shape when used with a Text-Embedding-Inference endpoint deployed using Docker.

  • Add the @component decorator to HuggingFaceTGIChatGenerator. The lack of this decorator made it impossible to use the HuggingFaceTGIChatGenerator in a pipeline.

  • Updated the SearchApiWebSearch component with new search format and allowed users to specify the search engine via the engine parameter in search_params. The default search engine is Google, making it easier for users to tailor their web searches.

- Python
Published by davidsbatista almost 2 years ago

farm-haystack - v2.1.0-rc2

Release Notes

Highlights

πŸ“Š New Evaluator Components

Haystack introduces new components for both with model-based, and statistical evaluation: AnswerExactMatchEvaluator, ContextRelevanceEvaluator, DocumentMAPEvaluator, DocumentMRREvaluator, DocumentRecallEvaluator, FaithfulnessEvaluator, LLMEvaluator, SASEvaluator

Here's an example of how to use DocumentMAPEvaluator to evaluate retrieved documents and calculate mean average precision score:

```python from haystack import Document from haystack.components.evaluators import DocumentMAPEvaluator

evaluator = DocumentMAPEvaluator() result = evaluator.run( groundtruthdocuments=[ [Document(content="France")], [Document(content="9th century"), Document(content="9th")], ], retrieved_documents=[ [Document(content="France")], [Document(content="9th century"), Document(content="10th century"), Document(content="9th")], ], )

result["individual_scores"]

[1.0, 0.8333333333333333] result["score"] 0 .9166666666666666 ```

To learn more about evaluating RAG pipelines both with model-based, and statistical metrics available in the Haystack, check out Tutorial: Evaluating RAG Pipelines.

πŸ•ΈοΈ Support For Sparse Embeddings

Haystack offers robust support for Sparse Embedding Retrieval techniques, including SPLADE. Here's how to create a simple retrieval Pipeline with sparse embeddings:

```python from haystack import Pipeline from haystackintegrations.components.retrievers.qdrant import QdrantSparseEmbeddingRetriever from haystackintegrations.components.embedders.fastembed import FastembedSparseTextEmbedder

sparsetextembedder = FastembedSparseTextEmbedder(model="prithvida/SpladePPenv1") sparseretriever = QdrantSparseEmbeddingRetriever(documentstore=documentstore)

querypipeline = Pipeline() querypipeline.addcomponent("sparsetextembedder", sparsetextembedder) querypipeline.addcomponent("sparseretriever", sparse_retriever)

querypipeline.connect("sparsetextembedder.sparseembedding", "sparseretriever.querysparse_embedding") ``` Learn more about this topic in our documentation on Sparse Embedding-based Retrievers Start building with our new cookbook: πŸ§‘β€πŸ³ Sparse Embedding Retrieval using Qdrant and FastEmbed.

🧐 Inspect Component Outputs

As of 2.1.0, you can now inspect each component output after running a pipeline. Provide component names with include_outputs_from key to pipeline.run: python pipe.run(data, include_outputs_from=["prompt_builder", "llm", "retriever"]) And the pipeline output should look like this: text {'llm': {'replies': ['The Rhodes Statue was described as being built with iron tie bars to which brass plates were fixed to form the skin. It stood on a 15-meter-high white marble pedestal near the Rhodes harbor entrance. The statue itself was about 70 cubits, or 32 meters, tall.'], 'meta': [{'model': 'gpt-3.5-turbo-0125', ... 'usage': {'completion_tokens': 57, 'prompt_tokens': 446, 'total_tokens': 503}}]}, 'retriever': {'documents': [Document(id=a3ee3a9a55b47ff651ae11dc56d84d2b6f8d931b795bd866c14eacfa56000965, content: 'Within it, too, are to be seen large masses of rock, by the weight of which the artist steadied it w...', meta: {'url': 'https://en.wikipedia.org/wiki/Colossus_of_Rhodes', '_split_id': 9}, score: 0.648961685430463),...]}, 'prompt_builder': {'prompt': "\nGiven the following information, answer the question.\n\nContext:\n\n Within it, too, are to be seen large masses of rock, by the weight of which the artist steadied it while... ... levels during construction.\n\n\n\nQuestion: What does Rhodes Statue look like?\nAnswer:"}}

πŸš€ New Features

  • Add several new Evaluation components, i.e:

    • AnswerExactMatchEvaluator
    • ContextRelevanceEvaluator
    • DocumentMAPEvaluator
    • DocumentMRREvaluator
    • DocumentRecallEvaluator
    • FaithfulnessEvaluator
    • LLMEvaluator
    • SASEvaluator
  • Introduce a new SparseEmbedding class that can store a sparse vector representation of a document. It will be instrumental in supporting sparse embedding retrieval with the subsequent introduction of sparse embedders and sparse embedding retrievers.

  • Added a SentenceTransformersDiversityRanker. The diversity ranker orders documents to maximize their overall diversity. The ranker leverages sentence-transformer models to calculate semantic embeddings for each document and the query.

  • Introduced new HuggingFace API components, namely:

    • HuggingFaceAPIChatGenerator, which will replace the HuggingFaceTGIChatGenerator in the future.
    • HuggingFaceAPIDocumentEmbedder, which will replace the HuggingFaceTEIDocumentEmbedder in the future.
    • HuggingFaceAPIGenerator, which will replace the HuggingFaceTGIGenerator in the future.
    • HuggingFaceAPITextEmbedder, which will replace the HuggingFaceTEITextEmbedder in the future.
    • These components support different Hugging Face APIs:
      • free Serverless Inference API
      • paid Inference Endpoints
      • self-hosted Text Generation Inference

⚑️ Enhancement Notes

  • Compatibility with huggingface_hub>=0.22.0 for HuggingFaceTGIGenerator and HuggingFaceTGIChatGenerator components.

  • Adds truncate and normalize parameters to HuggingFaceTEITextEmbedder and HuggingFaceTEITextEmbedder to allow truncation and normalization of embeddings.

  • Adds trust_remote_code parameter to SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder for allowing custom models and scripts.

  • Adds streaming_callback parameter to HuggingFaceLocalGenerator, allowing users to handle streaming responses.

  • Adds a ZeroShotTextRouter that uses an NLI model from HuggingFace to classify texts based on a set of provided labels and routes them based on the label they were classified with.

  • Adds dimensions parameter to Azure OpenAI Embedders (AzureOpenAITextEmbedder and AzureOpenAIDocumentEmbedder) to fully support new embedding models like text-embedding-3-small, text-embedding-3-large and upcoming ones

  • Now the DocumentSplitter adds the page_number field to the metadata of all output documents to keep track of the page of the original document it belongs to.

  • Allows users to customise text extraction from PDF files. This is particularly useful for PDFs with unusual layouts, such as multiple text columns. For instance, users can configure the object to retain the reading order.

  • Enhanced PromptBuilder to specify and enforce required variables in prompt templates.

  • Set max_new_tokens default to 512 in HuggingFace generators.

  • Enhanced the AzureOCRDocumentConverter to include advanced handling of tables and text. Features such as extracting preceding and following context for tables, merging multiple column headers, and enabling single-column page layout for text have been introduced. This update furthers the flexibility and accuracy of document conversion within complex layouts.

  • Enhanced DynamicChatPromptBuilder's capabilities by allowing all user and system messages to be templated with provided variables. This update ensures a more versatile and dynamic templating process, making chat prompt generation more efficient and customised to user needs.

  • Improved HTML content extraction by attempting to use multiple extractors in order of priority until successful. An additional try_others parameter in HTMLToDocument, True by default, determines whether subsequent extractors are used after a failure. This enhancement decreases extraction failures, ensuring more dependable content retrieval.

  • Enhanced FileTypeRouter with regex pattern support for MIME types. This powerful addition allows for more granular control and flexibility in routing files based on their MIME types, enabling the handling of broad categories or specific MIME type patterns with ease. This feature particularly benefits applications requiring sophisticated file classification and routing logic.

  • In Jupyter notebooks, the image of the Pipeline will no longer be displayed automatically. Instead, the textual representation of the Pipeline will be displayed. To display the Pipeline image, use the show method of the Pipeline object.

  • Add support for callbacks during pipeline deserialization. Currently supports a pre-init hook for components that can be used to inspect and modify the initialization parameters before the invocation of the component's __init__ method.

  • pipeline.run() accepts a set of component names whose intermediate outputs are returned in the final pipeline output dictionary.

  • Refactor PyPDFToDocument to simplify support for custom PDF converters. PDF converters are classes that implement the PyPDFConverter protocol and have 3 methods: convert, to_dict and from_dict.

⚠️ Deprecation Notes

  • Deprecate HuggingFaceTGIChatGenerator, will be removed in Haystack 2.3.0. Use HuggingFaceAPIChatGenerator instead.
  • Deprecate HuggingFaceTEIDocumentEmbedder, will be removed in Haystack 2.3.0. Use HuggingFaceAPIDocumentEmbedder instead.
  • Deprecate HuggingFaceTGIGenerator, will be removed in Haystack 2.3.0. Use HuggingFaceAPIGenerator instead.
  • Deprecate HuggingFaceTEITextEmbedder, will be removed in Haystack 2.3.0. Use HuggingFaceAPITextEmbedder instead.
  • Using the converter_name parameter in the PyPDFToDocument component is deprecated. it will be removed in the 2.3.0 release. Use the converter parameter instead.

πŸ› Bug Fixes

  • Forward declaration of AnalyzeResult type in AzureOCRDocumentConverter. AnalyzeResult is already imported in a lazy import block. The forward declaration avoids issues when azure-ai-formrecognizer>=3.2.0b2 is not installed.

  • Fixed a bug in the MetaFieldRanker: when the weight parameter was set to 0 in the run method, the component incorrectly used the default parameter set in the__init__ method.

  • Fixes Pipeline.run() logic so components with all their inputs with a default are run in the correct order.

  • Fix a bug when running a Pipeline that would cause it to get stuck in an infinite loop

  • Fixes on the HuggingFaceTEITextEmbedder returning an embedding of incorrect shape when used with a Text-Embedding-Inference endpoint deployed using Docker.

  • Add the @component decorator to HuggingFaceTGIChatGenerator. The lack of this decorator made it impossible to use the HuggingFaceTGIChatGenerator in a pipeline.

  • Updated the SearchApiWebSearch component with new search format and allowed users to specify the search engine via the engine parameter in search_params. The default search engine is Google, making it easier for users to tailor their web searches.

- Python
Published by github-actions[bot] almost 2 years ago

farm-haystack - v2.1.0-rc1

Release Notes

v2.1.0-rc1

Highlights

Add the "page_number" field to the metadata of all output documents.

⬆️ Upgrade Notes

  • The HuggingFaceTGIGenerator and HuggingFaceTGIChatGenerator components have been modified to be compatible with huggingface_hub>=0.22.0.

    If you use these components, you may need to upgrade the huggingfacehub library. To do this, run the following command in your environment: `bash pip install "huggingfacehub>=0.22.0" `\

πŸš€ New Features

  • Add SentenceTransformersDiversityRanker. The Diversity Ranker orders documents in such a way as to maximize the overall diversity of the given documents. The ranker leverages sentence-transformer models to calculate semantic embeddings for each document and the query.

  • Adds truncate and normalize parameters to HuggingFaceTEITextEmbedder and HuggingFaceTEITextEmbedder for allowing truncation and normalization of embeddings.

  • Add trustremotecode parameter to SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder for allowing custom models and scripts.

  • Add a new ContextRelevanceEvaluator component that can be used to evaluate whether retrieved documents are relevant to answer a question with a RAG pipeline. Given a question and a list of retrieved document contents (contexts), an LLM is used to score to what extent the provided context is relevant. The score ranges from 0 to 1.

  • Add DocumentMAPEvaluator, it can be used to calculate mean average precision of retrieved documents.

  • Add DocumentMRREvaluator, it can be used to calculate mean reciprocal rank of retrieved documents.

  • Add a new FaithfulnessEvaluator component that can be used to evaluate faithfulness / groundedness / hallucinations of LLMs in a RAG pipeline. Given a question, a list of retrieved document contents (contexts), and a predicted answer, FaithfulnessEvaluator returns a score ranging from 0 (poor faithfulness) to 1 (perfect faithfulness). The score is the proportion of statements in the predicted answer that could by inferred from the documents.

  • Introduce HuggingFaceAPIChatGenerator. This text-generation component uses the ChatMessage format and supports different Hugging Face APIs: - free Serverless Inference API - paid Inference Endpoints - self-hosted Text Generation Inference.

    This generator will replace the HuggingFaceTGIChatGenerator in the future.

  • Introduce HuggingFaceAPIDocumentEmbedder. This component can be used to compute Document embeddings using different Hugging Face APIs: - free Serverless Inference API - paid Inference Endpoints - self-hosted Text Embeddings Inference. This embedder will replace the HuggingFaceTEIDocumentEmbedder in the future.

  • Introduce HuggingFaceAPIGenerator. This text-generation component supports different Hugging Face APIs:

    • free Serverless Inference API
    • paid Inference Endpoints
    • self-hosted Text Generation Inference.

    This generator will replace the HuggingFaceTGIGenerator in the future.

  • Introduce HuggingFaceAPITextEmbedder. This component can be used to embed strings using different Hugging Face APIs: - free Serverless Inference API - paid Inference Endpoints - self-hosted Text Embeddings Inference. This embedder will replace the HuggingFaceTEITextEmbedder in the future.

  • Adds 'streaming_callback' parameter to 'HuggingFaceLocalGenerator', allowing users to handle streaming responses.

  • Added a new EvaluationRunResult dataclass that wraps the results of an evaluation pipeline, allowing for its transformation and visualization.

  • Add a new LLMEvaluator component that leverages LLMs through the OpenAI api to evaluate pipelines.

  • Add DocumentRecallEvaluator, a Component that can be used to calculate the Recall single-hit or multi-hit metric given a list of questions, a list of expected documents for each question and the list of predicted documents for each question.

  • Add SASEvaluator, it can be used to calculate Semantic Answer Similarity of generated answers from an LLM

  • Introduce a new SparseEmbedding class which can be used to store a sparse vector representation of a Document. It will be instrumental to support Sparse Embedding Retrieval with the subsequent introduction of Sparse Embedders and Sparse Embedding Retrievers.

  • Add a Zero Shot Text Router that uses an NLI model from HF to classify texts based on a set of provided labels and routes them based on the label they were classified with.

⚑️ Enhancement Notes

  • add dimensions parameter to Azure OpenAI Embedders (AzureOpenAITextEmbedder and AzureOpenAIDocumentEmbedder) to fully support new embedding models like text-embedding-3-small, text-embedding-3-large and upcoming ones

  • Now the DocumentSplitter adds the "page_number" field to the metadata of all output documents to keep track of the page of the original document it belongs to.

  • Provides users the ability to customize text extraction from PDF files. It is particularly useful for PDFs with unusual layouts, such as those containing multiple text columns. For instance, users can configure the object to retain the reading order.

  • Enhanced PromptBuilder to specify and enforce required variables in prompt templates.

  • Set maxnewtokens default to 512 in Hugging Face generators.

  • Enhanced the AzureOCRDocumentConverter to include advanced handling of tables and text. Features such as extracting preceding and following context for tables, merging multiple column headers, and enabling single column page layout for text have been introduced. This update furthers the flexibility and accuracy of document conversion within complex layouts.

  • Enhanced DynamicChatPromptBuilder's capabilities by allowing all user and system messages to be templated with provided variables. This update ensures a more versatile and dynamic templating process, making chat prompt generation more efficient and customized to user needs.

  • Improved HTML content extraction by attempting to use multiple extractors in order of priority until successful. An additional try_others parameter in HTMLToDocument, which is true by default, determines whether subsequent extractors are used after a failure. This enhancement decreases extraction failures, ensuring more dependable content retrieval.

  • Enhanced FileTypeRouter with Regex Pattern Support for MIME Types: This introduces a significant enhancement to the FileTypeRouter, now featuring support for regex pattern matching for MIME types. This powerful addition allows for more granular control and flexibility in routing files based on their MIME types, enabling the handling of broad categories or specific MIME type patterns with ease. This feature is particularly beneficial for applications requiring sophisticated file classification and routing logic.

    Usage example: `python from haystack.components.routers import FileTypeRouter router = FileTypeRouter(mime_types=[r"text/.*", r"application/(pdf|json)"]) # Example files to classify file_paths = [ Path("document.pdf"), Path("report.json"), Path("notes.txt"), Path("image.png"), ] result = router.run(sources=file_paths) for mime_type, files in result.items(): print(f"MIME Type: {mime_type}, Files: {[str(file) for file in files]}")`

  • Improved pipeline run tracing to include pipeline input/output data.

  • In Jupyter notebooks, the image of the Pipeline will no longer be displayed automatically. The textual representation of the Pipeline will be displayed.

    To display the Pipeline image, use the show method of the Pipeline object.

  • Add support for callbacks during pipeline deserialization. Currently supports a pre-init hook for components that can be used to inspect and modify the initialization parameters before the invocation of the component's __init__ method.

  • pipeline.run accepts a set of component names whose intermediate outputs are returned in the final pipeline output dictionary.

  • Pipeline.inputs and Pipeline.outputs can optionally include components input/output sockets that are connected.

  • Refactor PyPDFToDocument to simplify support for custom PDF converters. PDF converters are classes that implement the PyPDFConverter protocol and have 3 methods: convert, todict and fromdict. The DefaultConverter class is provided as a default implementation.

  • Add an __eq__ method to SparseEmbedding class to compare two SparseEmbedding objects.

⚠️ Deprecation Notes

  • Deprecate HuggingFaceTGIChatGenerator. This component will be removed in Haystack 2.3.0. Use HuggingFaceAPIChatGenerator instead.
  • Deprecate HuggingFaceTEIDocumentEmbedder. This component will be removed in Haystack 2.3.0. Use HuggingFaceAPIDocumentEmbedder instead.
  • Deprecate HuggingFaceTGIGenerator. This component will be removed in Haystack 2.3.0. Use HuggingFaceAPIGenerator instead.
  • Deprecate HuggingFaceTEITextEmbedder. This component will be removed in Haystack 2.3.0. Use HuggingFaceAPITextEmbedder instead.
  • Using the converter_name parameter in the PyPDFToDocument component is deprecated. It will be removed in the 2.3.0 release. Use the converter parameter instead.

πŸ› Bug Fixes

  • Forward declaration of AnalyzeResult type in AzureOCRDocumentConverter.

    AnalyzeResult is already imported in a lazy import block. The forward declaration avoids issues when azure-ai-formrecognizer>=3.2.0b2 is not installed.

  • The testcomparisonin test case in the base document store tests used to always pass, no matter how the in filtering logic was implemented in document stores. With the fix, the in logic is actually tested. Some tests might start to fail for document stores that don't implement the in filter correctly.

  • Remove the usage of reserved keywords in the logger calls, causing a KeyError when setting the log level to DEBUG.

  • Fixed a bug in the `MetaFieldRanker`: when the weight parameter was set to 0 in the run method, the component was incorrectly using the default weight parameter set in the __init__ method.

  • Fixes Pipeline.run() logic so Components that have all their inputs with a default are run in the correct order. This happened we gather a list of Components to run internally when running the Pipeline in the order they are added during creation of the Pipeline. This caused some Components to run before they received all their inputs.

  • Fix a bug when running a Pipeline that would cause it to get stuck in an infinite loop

  • Fixes HuggingFaceTEITextEmbedder returning an embedding of incorrect shape when used with a Text-Embedding-Inference endpoint deployed using Docker.

  • Add the @component decorator to HuggingFaceTGIChatGenerator. The lack of this decorator made it impossible to use the HuggingFaceTGIChatGenerator in a pipeline.

  • Updated the SearchApiWebSearch component with new search format and allowed users to specify the search engine via the engine parameter in search_params. The default search engine is Google, making it easier for users to tailor their web searches.

  • Fixed a bug in the `MetaFieldRanker`: when the rankingmode parameter was overridden in the run method, the component was incorrectly using the rankingmode parameter set in the __init__ method.

v2.1.0-rc0

⬆️ Upgrade Notes

  • Removed the deprecated GPTGenerator and GPTChatGenerator components. Use OpenAIGenerator and OpenAIChatGeneratornotes instead.

  • Update secret handling for the ExtractiveReader component using the Secret type.

    The default init parameter token is now required to either use a token or the environment HFAPITOKEN variable if authentication is required - The on-disk local token file is no longer supported.

πŸš€ New Features

  • Add a new pipeline template PredefinedPipeline.CHATWITHWEBSITE to quickly create a pipeline that will answer questions based on data collected from one or more web pages.

    Usage example: `python from haystack import Pipeline, PredefinedPipeline pipe = Pipeline.from_template(PredefinedPipeline.CHAT_WITH_WEBSITE) result = pipe.run({ "fetcher": {"urls": ["https://haystack.deepset.ai/overview/quick-start"]}, "prompt": {"query": "How should I install Haystack?"}} ) print(result["llm"]["replies"][0])`

  • Added option to instrument pipeline and component runs. This allows users to observe their pipeline runs and component runs in real-time via their chosen observability tool. Out-of-the-box support for OpenTelemetry and Datadog will be added in separate contributions.

    Example usage for [OpenTelemetry](https://opentelemetry.io/docs/languages/python/):

    1. Install OpenTelemetry SDK and exporter:
    `bash pip install opentelemetry-sdk opentelemetry-exporter-otlp-proto-http`

    2. Configure OpenTelemetry SDK with your tracing provider and exporter:
    ```python from opentelemetry.sdk.resources import SERVICE_NAME, Resource

    from opentelemetry import trace from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor

    # Service name is required for most backends resource = Resource(attributes={ SERVICE_NAME: "haystack" })

    traceProvider = TracerProvider(resource=resource) processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces")) traceProvider.addspanprocessor(processor) trace.settracerprovider(traceProvider)

    tracer = traceProvider.gettracer("myapplication")

    3. Create tracer
    `python import contextlib from typing import Optional, Dict, Any, Iterator from opentelemetry import trace from opentelemetry.trace import NonRecordingSpan from haystack.tracing import Tracer, Span from haystack.tracing import utils as tracing_utils import opentelemetry.trace class OpenTelemetrySpan(Span): def __init__(self, span: opentelemetry.trace.Span) -> None: self._span = span def set_tag(self, key: str, value: Any) -> None: coerced_value = tracing_utils.coerce_tag_value(value) self._span.set_attribute(key, coerced_value) class OpenTelemetryTracer(Tracer): def __init__(self, tracer: opentelemetry.trace.Tracer) -> None: self._tracer = tracer @contextlib.contextmanager def trace(self, operation_name: str, tags: Optional[Dict[str, Any]] = None) -> Iterator[Span]: with self._tracer.start_as_current_span(operation_name) as span: span = OpenTelemetrySpan(span) if tags: span.set_tags(tags) yield span def current_span(self) -> Optional[Span]: current_span = trace.get_current_span() if isinstance(current_span, NonRecordingSpan): return None return OpenTelemetrySpan(current_span)`

    4. Use the tracer with Haystack:
    `python from haystack import tracing haystack_tracer = OpenTelemetryTracer(tracer) tracing.enable_tracing(haystack_tracer)`

1.  Run your pipeline
  • Enhanced OpenAPI integration by handling complex types of requests and responses in OpenAPIServiceConnector and OpenAPIServiceToFunctions.

  • Added out-of-the-box support for the Datadog Tracer. This allows you to instrument pipeline and component runs using Datadog and send traces to your preferred backend.

    To use the Datadog Tracer you need to have the ddtrace package installed in your environment. To instruct Haystack to use the Datadog tracer, you have multiple options:

    • Run your Haystack application using the ddtrace command line tool as described in the the [ddtrace documentation](https://ddtrace.readthedocs.io/en/stable/installation_quickstart.html#tracing). This behavior can be disabled by setting the HAYSTACKAUTOTRACEENABLEDENV_VAR environment variable to false.
    • Configure the tracer manually in your code using the ddtrace package: `python from haystack.tracing import DatadogTracer import haystack.tracing import ddtrace tracer = ddtrace.tracer tracing.enable_tracing(DatadogTracer(tracer))`
  • Add AnswerExactMatchEvaluator, a component that can be used to calculate the Exact Match metric comparing a list of expected answers with a list of predicted answers.

  • Added out-of-the-box support for the OpenTelemetry Tracer. This allows you to instrument pipeline and component runs using OpenTelemetry and send traces to your preferred backend.

    To use the OpenTelemetry Tracer you need to have the opentelemetry-sdk package installed in your environment. To instruct Haystack to use the OpenTelemetry Tracer, you have multiple options:

    * Run your Haystack application using the opentelemetry-instrument command line tool as described in the
    [OpenTelemetry documentation](https://opentelemetry.io/docs/languages/python/automatic/#configuring-the-agent).
    This behavior can be disabled by setting the HAYSTACKAUTOTRACEENABLEDENV_VAR environment variable to false.

    * Configure the tracer manually in your code using the opentelemetry package:
    `python from opentelemetry import trace from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor # Service name is required for most backends resource = Resource(attributes={ SERVICE_NAME: "haystack" }) traceProvider = TracerProvider(resource=resource) processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces")) traceProvider.add_span_processor(processor) trace.set_tracer_provider(traceProvider) # Auto-configuration import haystack.tracing haystack.tracing.auto_enable_tracing() # Or explicitly from haystack.tracing import OpenTelemetryTracer tracer = traceProvider.get_tracer("my_application") tracing.enable_tracing(OpenTelemetryTracer(tracer))`

  • Haystack now supports structured logging out-of-the box. Logging can be separated into 3 categories:

    • If [structlog](https://www.structlog.org/en/stable/) is not installed, Haystack will use the standard Python logging library with whatever configuration is present.
    • If structlog is installed, Haystack will log through [structlog](https://www.structlog.org/en/stable/) using structlog's console renderer. To disable structlog, set the environment variable HAYSTACKLOGGINGIGNORESTRUCTLOGENV_VAR to true.
    • To log in JSON, install [structlog](https://www.structlog.org/en/stable/) and
      • set the environment variable HAYSTACKLOGGINGJSON to true or
      • enable JSON logging from Python `python import haystack.logging haystack.logging.configure_logging(use_json=True)`

⚑️ Enhancement Notes

  • Allow code instrumentation to also trace the input and output of components. This is useful for debugging and understanding the behavior of components. This behavior is disabled by default and can be enabled with one of the following methods:

    • Set the environment variable HAYSTACKCONTENTTRACINGENABLEDENV_VAR to true before importing Haystack.
    • Enable content tracing in the code:
      `python from haystack import tracing tracing.tracer.is_content_tracing_enabled = True`
  • Update Component protocol to fix type checking issues with some Language Servers. Most Language Servers and some type checkers would show warnings when calling Pipeline.add_component() as technically most `Component`s weren't respecting the protocol we defined.

  • Added a new Logger implementation which eases and enforces logging via key-word arguments. This is an internal change only. The behavior of instances created via logging.getLogger is not affected.

  • If using JSON logging in conjunction with tracing, Haystack will automatically add correlation IDs to the logs. This is done by getting the necessary information from the current span and adding it to the log record. You can customize this by overriding the getcorrelationdataforlogs of your tracer's span:

    `python from haystack.tracing import Span class OpenTelemetrySpan(Span): ... def get_correlation_data_for_logs(self) -> Dict[str, Any]: span_context = ... return {"trace_id": span_context.trace_id, "span_id": span_context.span_id}`

  • The logging module now detects if the standard output is a TTY. If it is not and structlog is installed, it will automatically disable the console renderer and log in JSON format. This behavior can be overridden by setting the environment variable HAYSTACKLOGGINGUSE_JSON to false.

  • Enhanced OpenAPI service connector to better handle method invocation with support for security schemes, refined handling of method arguments including URL/query parameters and request body, alongside improved error validation for method calls. This update enables more versatile interactions with OpenAPI services, ensuring compatibility with a wide range of API specifications.

Security Notes

  • Remove the text value from a warning log in the TextLanguageRouter to avoid logging sensitive information. The text can be still be shown by switching to the debug log level.

    `python import logging logging.basicConfig(format="%(levelname)s - %(name)s - %(message)s", level=logging.WARNING) logging.getLogger("haystack").setLevel(logging.DEBUG)`

πŸ› Bug Fixes

  • Fix a bug in the MetaFieldRanker where the weight parameter passed to the run method was not being used.
  • Pin the typing-extensions package to versions >= 4.7 to avoid [incompatibilities with the openai package](https://community.openai.com/t/error-while-importing-openai-from-open-import-openai/578166/26).
  • Restore transparent background for images generated with Pipeline.draw and Pipeline.show
  • Fix telemetry code that could cause a ValueError by trying to serialize a pipeline. Telemetry code does not serialize pipelines anymore.
  • Fix Pipeline.run() mistakenly running a Component before it should. This can happen when a greedy variadic Component must be executed before a Component with default inputs.

- Python
Published by github-actions[bot] almost 2 years ago

farm-haystack - v1.25.5

Release Notes

v1.25.5

πŸ› Bug Fixes

  • Pipeline run error when using the FileTypeClassifier with the raiseonerror: True option. Instead of returning an unexpected NoneType, we route the file to a dead-end edge.

- Python
Published by github-actions[bot] almost 2 years ago