Recent Releases of distilabel

What's Changed

Fix typo by @Riezebos in https://github.com/argilla-io/distilabel/pull/1111
Checks for images using PIL only if available by @plaguss in https://github.com/argilla-io/distilabel/pull/1112
Fix pipeline getting stuck when multiple step replicas by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1113

New Contributors

@Riezebos made their first contribution in https://github.com/argilla-io/distilabel/pull/1111

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.5.2...1.5.3

- Python
Published by gabrielmbmb over 1 year ago

What's Changed

Fix structured output JSON to pydantic.BaseModel and LiteLLM async completion client by @rolshoven in https://github.com/argilla-io/distilabel/pull/1105

New Contributors

@rolshoven made their first contribution in https://github.com/argilla-io/distilabel/pull/1105

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.5.1...1.5.2

- Python
Published by gabrielmbmb over 1 year ago

What's Changed

Remove deprecated CombineColumns step by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1101
Fix image import handling and update MlxLLM initialisation by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1102
Fix MlxLLM by aligning it with mlx-lm>=0.21 by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1103
1.5.1 by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1104

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.5.0...1.5.1

- Python
Published by gabrielmbmb over 1 year ago

✨ Release highlights

🖼️ Image Generation Support

We're excited to introduce ImageGenerationModel, a new abstraction for working with image generation models. This addition enables seamless integration with models that can transform text prompts into images.

Available Services

🤗 InferenceEndpointsImageGeneration: Integration with Hugging Face's Inference Endpoints
OpenAIImageGeneration: Integration with OpenAI's DALL-E

Architecture

Just as LLMs are used by a Task, we've introduced ImageTask as a high-level abstraction for image generation workflows. ImageTask defines how a step should use an ImageGenerationModel to accomplish specific image generation tasks.

Our first implementation, the ImageGeneration task, provides a straightforward interface: given a text prompt, it generates the corresponding image, leveraging any of the supported image generation models.

We've also added a small tutorial on how to generate images using distilabel: distilabel - Tutorials - Image generation with distilabel

Images as inputs for `LLM`s

We've added initial support for providing images as input to an LLM through the new TextGenerationWithImage task. We've updated and tested InferenceEndpointsLLM and OpenAILLM with this new task, but we'll image as input compatibility in the next releases for others such as vLLM.

Check the tutorial distilabel - Tutorials - Text generation with images in distilabel to get started!

💻 New `MlxLLM` integration

We've integrated mlx-lm package with the new MlxLLM class, enabling native machine learning acceleration on Apple Silicon Macs. This integration supercharges synthetic data generation by leveraging MLX's highly optimized framework designed specifically for the M-series chips.

New `InstructionResponsePipeline` template

We've started making changes so distilabel is easier to use since minute one. We'll start adding presets or templates that allows to quickly get a pipeline with some sensible preconfigured defaults for generating data for certain tasks. The first task we've worked on is the SFT or Instruction Response tuning pipeline which you can use like:

```python from distilabel.pipeline import InstructionResponsePipeline

pipeline = InstructionResponsePipeline() distiset = pipeline.run() ```

Define load stages

We've added a way for users to define which steps of the pipeline should be loaded together, allowing for more efficient resource management and better control over the execution flow. This new feature is particularly useful in scenarios where resource-constrained environments limit the ability to execute all steps simultaneously, requiring steps to be executed in distinct stages.

We've added a detailed guide on how to use this feature: distilabel - How-to guides - Load groups and execution stages.

What's Changed

Add common typing module by @plaguss in https://github.com/argilla-io/distilabel/pull/1029
docs: textcat tutorial by @sdiazlor in https://github.com/argilla-io/distilabel/pull/949
Add task decorator by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1028
Update docs workflows to use uv by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1032
fix: simplify prompt template ArgillaLabeller by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1033
Add dataset_batch_size argument by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1039
Move all LLMs to distilabel.models by @plaguss in https://github.com/argilla-io/distilabel/pull/1045
Fix a tiny typo in _Step docstring by @sadra-barikbin in https://github.com/argilla-io/distilabel/pull/1051
docs: improve docs for MinHashDedup Step by @anakin87 in https://github.com/argilla-io/distilabel/pull/1050
Fix new response_format variable in openai api by @plaguss in https://github.com/argilla-io/distilabel/pull/1053
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in https://github.com/argilla-io/distilabel/pull/1043
Update LLM.generate output to include statistics by @plaguss in https://github.com/argilla-io/distilabel/pull/1034
Add example of structured output. by @plaguss in https://github.com/argilla-io/distilabel/pull/1061
feat: implenent basic SFT pipeline based on synthetic data generator by @burtenshaw in https://github.com/argilla-io/distilabel/pull/1059
fix: broken import in instruction by @burtenshaw in https://github.com/argilla-io/distilabel/pull/1063
Fix StepOutput type by @plaguss in https://github.com/argilla-io/distilabel/pull/1072
docs: update issue templates by @sdiazlor in https://github.com/argilla-io/distilabel/pull/1074
Update unload method from vLLM to properly free resources by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1077
Add tasks to replicate Math-shepherd by @plaguss in https://github.com/argilla-io/distilabel/pull/1052
Add load_groups argument to run by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1075
Add TextGenerationWithImage task by @plaguss in https://github.com/argilla-io/distilabel/pull/1066
Create columns with LLM returned extra keys by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1078
Fix vLLM unload logic when model is None by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1080
Fix merge_distilabel_metadata function when handling outputs from Task with group_generations==True by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1082
chore: update base.py by @eltociear in https://github.com/argilla-io/distilabel/pull/1085
Add magpie support llama cpp ollama by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1086
Feat/954 llama cpp by @bikash119 in https://github.com/argilla-io/distilabel/pull/1000
fix import by replacing GeneratorOutput with GeneratorStepOutput by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1093
add mlx support by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1089
Support custom default headers in OpenAILLM class. by @khulaifi95 in https://github.com/argilla-io/distilabel/pull/1088
fix/pip install messages by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1095
Fix handling empty list statistics by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1094
update to outlines010 by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1092
update: search by match by @sdiazlor in https://github.com/argilla-io/distilabel/pull/1096
Add Legend to Component Gallery Icons by @ParagEkbote in https://github.com/argilla-io/distilabel/pull/1090
Image Language Models and ImageGeneration task by @plaguss in https://github.com/argilla-io/distilabel/pull/1060
Update LLMs to support prompt logprobs use-case by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1099
1.5.0 by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1100

New Contributors

@sadra-barikbin made their first contribution in https://github.com/argilla-io/distilabel/pull/1051
@anakin87 made their first contribution in https://github.com/argilla-io/distilabel/pull/1050
@pre-commit-ci made their first contribution in https://github.com/argilla-io/distilabel/pull/1043
@eltociear made their first contribution in https://github.com/argilla-io/distilabel/pull/1085
@bikash119 made their first contribution in https://github.com/argilla-io/distilabel/pull/1000
@khulaifi95 made their first contribution in https://github.com/argilla-io/distilabel/pull/1088
@ParagEkbote made their first contribution in https://github.com/argilla-io/distilabel/pull/1090

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.4.2...1.5.0

- Python
Published by gabrielmbmb over 1 year ago

distilabel - 1.4.2

What's Changed

Fix chat template not applied in TransformersLLM by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1083

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.4.1...1.4.2

- Python
Published by gabrielmbmb over 1 year ago

distilabel - 1.4.1

What's Changed

Fix not handling list of all primitive types in SignatureMixin by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1037

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.4.0...1.4.1

- Python
Published by gabrielmbmb over 1 year ago

distilabel - 1.4.0

✨ Release highlights

Offline Batch Generation and OpenAI Batch API

We’ve updated the LLM interface so now LLMs using an external platform that offers a batch service can be integrated in distilabel. In addition, OpenAILLM has been updated so it can use the OpenAI Batch API to get 50% cost reductions.

https://github.com/user-attachments/assets/9a559ae1-099b-47a4-9f92-37a3171dfbff

Improved cache for maximum outputs reusability

We all know that running LLM is costly and most of the times we want to reuse as much as we can the outputs generated with them. Before this release, distilabel cache mechanism enabled to recover a pipeline execution that was stopped before finishing and to re-create the Distiset generated by one that finished its execution and was re-executed.

In this release, we've greatly improved the cache so the outputs of all the Steps are cached and therefore can be reused in other pipelines executions even if the pipeline has changed:

In addition, we've added a use_cache attribute in the Steps that allows toggling the use of the cache at step level.

Steps can generated artifacts

In some cases, Step produces some additional artifacts that are used to generate its outputs. These artifacts can take some time to be generated and they could be reused in the future. That’s why we’ve added a new method called Step.save_artifact that can be called within the step to store artifacts generated by it. The artifacts generated by the Step will also get uploaded to the Hugging Face Hub.

```python from typing import List, TYPE_CHECKING from distilabel.steps import GlobalStep, StepInput, StepOutput import matplotlib.pyplot as plt

if TYPE_CHECKING: from distilabel.steps import StepOutput

class CountTextCharacters(GlobalStep): @property def inputs(self) -> List[str]: return ["text"]

@property
def outputs(self) -> List[str]:
    return ["text_character_count"]

def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
    character_counts = []

    for input in inputs:
        text_character_count = len(input["text"])
        input["text_character_count"] = text_character_count
        character_counts.append(text_character_count)

    # Generate plot with the distribution of text character counts
    plt.figure(figsize=(10, 6))
    plt.hist(character_counts, bins=30, edgecolor="black")
    plt.title("Distribution of Text Character Counts")
    plt.xlabel("Character Count")
    plt.ylabel("Frequency")

    # Save the plot as an artifact of the step
    self.save_artifact(
        name="text_character_count_distribution",
        write_function=lambda path: plt.savefig(path / "figure.png"),
        metadata={"type": "image", "library": "matplotlib"},
    )

    plt.close()

    yield inputs

```

New `Tasks`: `CLAIR`, `APIGEN` and many more!

New CLAIR task: CLAIR uses an AI system to minimally revise a solution A→A´ such that the resulting preference A preferred A’ is much more contrastive and precise.
New tasks to replicate APIGen framework: APIGenGenerator, APIGenSemanticChecker, APIGenExecutionChecker. These tasks allow generating datasets like the one presented in the paper: APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
New URIAL task that allows using non-instruct models to generate a response for an instruction.
New TextClassification task to make zero-shot text classification based on a predefined but highly customizable prompt.
TextClustering, to generate clusters from text and group your generations, discovering labels from your data. Comes with 2 steps to run UMAP and DBSCAN algorithms.
Updated TextGeneration to simplify customization of tasks that don’t require further post-processing.

New Steps to sample data in your pipelines and remove duplicates

New DataSampler step to sample data from other datasets, which can be useful to inject different examples for few-shot examples in your prompts.
New EmbeddingDedup step to remove duplicates based on embeddings and a distance metric.
New MinHashDedup step to remove near duplicates from the text based on MinHash and MinHashLSH algorithm.
New TruncateTextColumns to truncate the length of your texts using either the character length or the number of tokens based on a tokenizer.
New CombineOutputs to combine the outputs of two or more steps into a single output.

Generate text embeddings using `vLLM`

Now you can generate embeddings using vLLMEmbeddings!

Extra things

Easily visualize the tasks’ prompts using Task.print method.
New use_default_structured_outputs flag in tasks to automatically use structured generation in some tasks that can benefit from it.

What's Changed

Make ClientvLLM.model_name a cached_property by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/862
Pass dataset to dry_run method by @plaguss in https://github.com/argilla-io/distilabel/pull/863
Add default structured output for GenerateSentencePair task by @plaguss in https://github.com/argilla-io/distilabel/pull/868
Complexity scorer default structured output by @plaguss in https://github.com/argilla-io/distilabel/pull/870
Quality scorer default structured output by @plaguss in https://github.com/argilla-io/distilabel/pull/873
Ultrafeedback default structured output by @plaguss in https://github.com/argilla-io/distilabel/pull/876
Remove use of default_chat_template by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/888
Temporary fix for installing llama-cpp-python by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/886
Fix unit tests after release of transformers==4.44.0 by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/891
Fix default structured output by @plaguss in https://github.com/argilla-io/distilabel/pull/892
Send as many batches as possible to input queues by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/895
Exclude repo_id from LoadDataFromFileSystem by @plaguss in https://github.com/argilla-io/distilabel/pull/898
Fix loader to read from a glob pattern by @plaguss in https://github.com/argilla-io/distilabel/pull/877
Add save_artifact method to _Step by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/871
Add new add_raw_input argument to _Task so we can automatically include the formatted input by @plaguss in https://github.com/argilla-io/distilabel/pull/903
New TruncateTextColumn to truncate the length of texts using the number of tokens or characters by @plaguss in https://github.com/argilla-io/distilabel/pull/902
Update inputs and outputs interface to allow returning dict indicating optionality by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/883
Update mistrallm by @plaguss in https://github.com/argilla-io/distilabel/pull/904
Deepseek prover by @plaguss in https://github.com/argilla-io/distilabel/pull/907
Update RewardModelScore.inputs property by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/908
Add tutorial - generate data for training embeddings and reranking models by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/893
Fix load data from disk by @plaguss in https://github.com/argilla-io/distilabel/pull/910
docs: minor fixes by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/913
Add URIAL task by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/921
Add vLLMEmbeddings by @plaguss in https://github.com/argilla-io/distilabel/pull/920
docs: add tutorials preference and clean by @sdiazlor in https://github.com/argilla-io/distilabel/pull/917
Fix StructuredGeneration examples and internal check by @plaguss in https://github.com/argilla-io/distilabel/pull/912
Generate deterministic pipeline name when it's not given by @plaguss in https://github.com/argilla-io/distilabel/pull/878
Add custom errors by @plaguss in https://github.com/argilla-io/distilabel/pull/911
Docs/tutorials fix by @sdiazlor in https://github.com/argilla-io/distilabel/pull/922
Add revision runtime parameter to LoadDataFromHub by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/928
Add plausible as replacement for GA by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/929
Add minhash related steps to deduplicate texts by @plaguss in https://github.com/argilla-io/distilabel/pull/931
docs: API reference review by @sdiazlor in https://github.com/argilla-io/distilabel/pull/932
Refactor of MinHash to work with a single class and fix the shelve backend by @plaguss in https://github.com/argilla-io/distilabel/pull/937
Update make_generator_step to set pipeline to step and add edge to steps in trophic level 1 by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/936
Add CombineOutputs step by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/939
fix: regex expression in POSITIVE_NEGATIVE by @sdiazlor in https://github.com/argilla-io/distilabel/pull/940
Offline batch generation by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/923
Fix applying input mapping when mapping overrides another column by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/938
Fix all replicas had the same _llm_identifier for CudaDevicePlacementMixin by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/941
Fix empty load stage when two GlobalSteps are chained by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/945
Add system_prompt attribute to TextGeneration by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/950
Add step to deduplicate records based on embeddings by @plaguss in https://github.com/argilla-io/distilabel/pull/946
Updated setup_logging to use UTF-8 in FileHandler by @dameikle in https://github.com/argilla-io/distilabel/pull/952
Add more generation parameters to vLLM by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/955
Fix Magpie generating different columns names depending on LLM output by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/965
Docs/962 docs create a smoother transition from index installation quickstart by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/968
Add logging_handlers argument by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/969
[DOCS] Add tips in the docs to avoid overloading Free Serverless Endpoints by @plaguss in https://github.com/argilla-io/distilabel/pull/973
Add TextClassification, UMAP, DBSCAN and TextClustering tasks by @plaguss in https://github.com/argilla-io/distilabel/pull/948
[FEATURE] Simplify customizing the TextGeneration task with custom prompts by @plaguss in https://github.com/argilla-io/distilabel/pull/974
Update system_prompt attribute for adding probabilities in MagpieBase by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/981
Fix unloading steps with more than 1 replica by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/982
docs: 960 docs add a glossary concept section by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/970
Fix missing system_prompt_key column in Magpie tasks by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/983
docs: update component gallery by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/987
fix missing batch when last batch arrive early by @zye1996 in https://github.com/argilla-io/distilabel/pull/989
Fine personas socialai tutorial by @plaguss in https://github.com/argilla-io/distilabel/pull/992
feat: add basic draw implementation to pipline by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/966
Fix schema inference structured generation by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/994
[DOCS] Add developer documentation section in the docs by @plaguss in https://github.com/argilla-io/distilabel/pull/999
Fix vllm installation in CI by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1009
fix metadata writeout when llm error by @zye1996 in https://github.com/argilla-io/distilabel/pull/1003
Add example of custom text generation step in quickstart by @plaguss in https://github.com/argilla-io/distilabel/pull/984
feat: 985 feature argillalabeller task by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/986
Fixllvmlite install with uv by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1018
fix: failing tests argilla labeller by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1017
fix inpute when output_mapping is not empty by @zye1996 in https://github.com/argilla-io/distilabel/pull/1015
Add Tasks to replicate APIGen by @plaguss in https://github.com/argilla-io/distilabel/pull/925
Pretty print by @plaguss in https://github.com/argilla-io/distilabel/pull/934
Add CLAIR task by @plaguss in https://github.com/argilla-io/distilabel/pull/926
Add cache at Step level by @plaguss in https://github.com/argilla-io/distilabel/pull/766
Fix IndexError when overriding inputs and group_generations=False by @plaguss in https://github.com/argilla-io/distilabel/pull/1022
Update Pipeline cache docs by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1023
1.4.0 by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1024

New Contributors

@dameikle made their first contribution in https://github.com/argilla-io/distilabel/pull/952
@zye1996 made their first contribution in https://github.com/argilla-io/distilabel/pull/989

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.3.2...1.4.0

- Python
Published by gabrielmbmb over 1 year ago

distilabel - 1.3.2

What's Changed

Deepseek prover task by @plaguss in https://github.com/argilla-io/distilabel/pull/733
Do not cancel in progress docs workflows by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/919
Fix creating Ray placement groups for vLLM by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/918
Fix passing base_url in model_id in InferenceEndpointsLLM by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/924

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.3.1...1.3.2

- Python
Published by gabrielmbmb almost 2 years ago

distilabel - 1.3.1

What's Changed

Create new distilabel.constants module to store constants and avoid circular imports by @plaguss in https://github.com/argilla-io/distilabel/pull/861
Add OpenAI request timeout by @ashim-mahara in https://github.com/argilla-io/distilabel/pull/858

New Contributors

@ashim-mahara made their first contribution in https://github.com/argilla-io/distilabel/pull/858

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.3.0...1.3.1

- Python
Published by gabrielmbmb almost 2 years ago

distilabel - 1.3.0

What's Changed

Add new step CombineKeys by @plaguss in https://github.com/argilla-io/distilabel/pull/747
Refactor naming columns steps combinecolumns combinekeys expandcolumns by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/758
Drop remove deprecated LoadHubDataset by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/759
Add requirements list for Pipeline by @plaguss in https://github.com/argilla-io/distilabel/pull/720
Add StepResources and step replicas in Pipeline by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/750
Add load stages by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/760
Update min required version to python==3.9 by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/770
Optionally include the pipeline script in the hub when pushing your distiset by @plaguss in https://github.com/argilla-io/distilabel/pull/762
Add docs-pr.yml and docs-pr-close.yml workflows by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/774
Add RayPipeline class by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/769
Fixed closed PR workflow by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/776
Add Magpie and MagpieGenerator tasks by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/778
Fix some issues related to Magpie task by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/783
Add end_with_user and include_system_prompt flags to Magpie tasks and handle Nones. by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/784
Add workflow concurrency group for publishing docs by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/796
Add _desired_num_gpus attribute to CudaDevicePlacementMixin by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/795
Compatibility with vLLM with tensor_parallel_size argument by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/805
Update default names in GroupColumns by @plaguss in https://github.com/argilla-io/distilabel/pull/808
Request batches to GeneratorStep if only step in pipeline by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/828
Add default name for a pipeline by @plaguss in https://github.com/argilla-io/distilabel/pull/809
Update distilabel phrasing based on PR hugging face hub by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/821
Some more Magpie improvements by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/833
Add Embeddings base class, SentenceTransformerEmbeddings class, EmbeddingGeneration and FaissNearestNeighbour steps by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/830
Create file per hostname in CudaDevicePlacementMixin by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/814
Create a GeneratorStep from a dataset using a helper function by @plaguss in https://github.com/argilla-io/distilabel/pull/812
Do not take into account disable_cuda_device_placement for pipeline signature by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/838
Add RewardModelScore step by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/840
Fix LoadDataFromHub attribute _dataset had ellipsis by default instead of None by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/841
Create PlacementGroup for steps using vLLM by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/842
Update argilla integration to use argilla_sdk v2 by @alvarobartt in https://github.com/argilla-io/distilabel/pull/705
Make overall-rating the default aspect for UltraFeedback task by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/843
fix typo index.md by @franperic in https://github.com/argilla-io/distilabel/pull/844
Use CudaDevicePlacementMixin in RewardModelScore step by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/845
Gather GPUs per Ray node to create placement groups by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/848
Fix typo in docs by @plaguss in https://github.com/argilla-io/distilabel/pull/850
Add xfail routing batch function tests by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/852
Fix creating placement group when pipeline_parallel_size>1 by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/851
docs: 846 docs include google analytics by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/847
Add ClientvLLM class by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/854
Add hard-negative flag to include similar challenging negatives on triplets by @plaguss in https://github.com/argilla-io/distilabel/pull/856
Add bibtex references in the docstrings to be shown in the README by @plaguss in https://github.com/argilla-io/distilabel/pull/855
distilabel 1.3.0 by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/857

New Contributors

@franperic made their first contribution in https://github.com/argilla-io/distilabel/pull/844

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.2.4...1.3.0

- Python
Published by gabrielmbmb almost 2 years ago

distilabel - 1.2.4

What's Changed

Update InferenceEndpointsLLM to use chat_completion method by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/815

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.2.3...1.2.4

- Python
Published by gabrielmbmb almost 2 years ago

distilabel - 1.2.3

What's Changed

Fix Import Error for KeepColumns in instruction_backtranslation.md (Issue #785) by @Hassaan-Qaisar in https://github.com/argilla-io/distilabel/pull/786
Correct variable name in dataset push example (in ultrafeedback.md file) (Issue #787) by @Hassaan-Qaisar in https://github.com/argilla-io/distilabel/pull/791
docs: update script for issue dashboard by @sdiazlor in https://github.com/argilla-io/distilabel/pull/775
Fix 404 model not found for private Serverless IE by @dvsrepo in https://github.com/argilla-io/distilabel/pull/806

New Contributors

@Hassaan-Qaisar made their first contribution in https://github.com/argilla-io/distilabel/pull/786

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.2.2...1.2.3

- Python
Published by gabrielmbmb almost 2 years ago

distilabel - 1.2.2

What's Changed

Fix passing input to format_output function by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/781

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.2.1...1.2.2

- Python
Published by gabrielmbmb almost 2 years ago

distilabel - 1.2.1

What's Changed

Fix docs for distiset.savetodisk kwargs by @fpreiss in https://github.com/argilla-io/distilabel/pull/745
docs: change references by @sdiazlor in https://github.com/argilla-io/distilabel/pull/754
Fix response_format for TogetherLLM and AnyScaleLLM by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/764

New Contributors

@fpreiss made their first contribution in https://github.com/argilla-io/distilabel/pull/745

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.2.0...1.2.1

- Python
Published by gabrielmbmb almost 2 years ago

distilabel - 1.2.0

✨ Release highlights

Structured generation with `instructor`, `InferenceEndpointsLLM` now supports structured generation and `StructuredGeneration` task

instructor has been integrated bringing support for structured generation with OpenAILLM, AnthropicLLM, LiteLLM, MistralLLM, CohereLLM and GroqLLM:

Structured generation with instructor example

```python from typing import List

from distilabel.llms import MistralLLM from distilabel.pipeline import Pipeline from distilabel.steps import LoadDataFromDicts from distilabel.steps.tasks import TextGeneration from pydantic import BaseModel, Field

class Node(BaseModel): id: int label: str color: str

class Edge(BaseModel): source: int target: int label: str color: str = "black"

class KnowledgeGraph(BaseModel): nodes: List[Node] = Field(..., defaultfactory=list) edges: List[Edge] = Field(..., defaultfactory=list)

with Pipeline( name="Knowledge-Graphs", description=( "Generate knowledge graphs to answer questions, this type of dataset can be used to " "steer a model to answer questions with a knowledge graph." ), ) as pipeline: sample_questions = [ "Teach me about quantum mechanics", "Who is who in The Simpsons family?", "Tell me about the evolution of programming languages", ]

  load_dataset = LoadDataFromDicts(
      name="load_instructions",
      data=[
          {
              "system_prompt": "You are a knowledge graph expert generator. Help me understand by describing everything as a detailed knowledge graph.",
              "instruction": f"{question}",
          }
          for question in sample_questions
      ],
  )

  text_generation = TextGeneration(
      name="knowledge_graph_generation",
      llm=MistralLLM(
          model="open-mixtral-8x22b",
          structured_output={"schema": KnowledgeGraph}
      ),
  )
  load_dataset >> text_generation

``</details> *InferenceEndpointsLLMnow supports structured generation * New [StructuredGeneration`](https://distilabel.argilla.io/latest/components-gallery/tasks/structuredgeneration/) task that allows defining the schema of the structured generation per input row.

New tasks for generating datasets for training embedding models

sentence-transformers v3 was recently released and we couldn't resist the urge of adding a few new tasks to allow creating datasets for training embedding models!

New GenerateSentencePair task that allows to generate a positive sentence for an input anchor, and optionally also a negative sentence. The tasks allows creating different kind of data specifying the action to perform with respect to the anchor: paraphrasing, generate semantically-similar sentence, generate a query or generate an answer.
Implemented Improving Text Embeddings with Large Language Models and adding the following tasks derived from the paper:
- EmbeddingTaskGenerator which allows generating new embedding-related tasks using an LLM.
- GenerateTextRetrievalData which allows creating text retrieval data with an LLM.
- GenerateShortTextMatchingData which allows creating short texts matching the input data.
- GenerateLongTextMatchingData which allows creating long texts matching the input data.
- GenerateTextClassificationData which allows creating text classification data from the input data.
- MonolingualTripletGenerator which allows creating monolingual triplets from the input data.
- BitextRetrievalGenerator which allows creating bitext retrieval data from the input data.

New `Step`s for loading data from different sources and saving/loading `Distiset` to disk

We've added a few new steps allowing to load data from different sources:

LoadDataFromDisk allows loading a Distisetor datasets.Dataset that was previously saved using the save_to_disk method.
LoadDataFromFileSystem allows loading a datasets.Dataset from a file system.

Thanks to @rasdani for helping us testing this new tasks!

In addition, we have added save_to_disk method to Distiset akin to datasets.Dataset.save_to_disk, that allows saving the generated distiset to disk, along with the pipeline.yaml and pipeline.log.

`save_to_disk` example

```python from distilabel.pipeline import Pipeline with Pipeline(name="my-pipeline") as pipeline: ... if __name__ == "__main__": distiset = pipeline.run(...) distiset.save_to_disk(dataset_path="my-distiset") ```

`MixtureOfAgentsLLM` implementation

We've added a new LLM called MixtureOfAgentsLLM derived from the paper Mixture-of-Agents Enhances Large Language Model Capabilities. This new LLM allows generating improved outputs thanks to the collective expertise of several LLMs.

`MixtureOfAgentsLLM` example

```python from distilabel.llms import MixtureOfAgentsLLM, InferenceEndpointsLLM llm = MixtureOfAgentsLLM( aggregator_llm=InferenceEndpointsLLM( model_id="meta-llama/Meta-Llama-3-70B-Instruct", tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct", ), proposers_llms=[ InferenceEndpointsLLM( model_id="meta-llama/Meta-Llama-3-70B-Instruct", tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct", ), InferenceEndpointsLLM( model_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO", tokenizer_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO", ), InferenceEndpointsLLM( model_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1", tokenizer_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1", ), ], rounds=2, ) llm.load() output = llm.generate( inputs=[ [ { "role": "user", "content": "My favorite witty review of The Rings of Power series is this: Input:", } ] ] ) ```

Saving cache and passing batches to `GlobalStep`s optimizations

The cache logic of the _BatchManager has been improved to incrementally update the cache making the process much faster.
The data of the input batches of the GlobalSteps will be passed to the step using the file system, as this is faster than passing it using the queue. This is possible thanks to new integration of fsspec, which can be configured to use a file system or cloud storage as backend for passing the data of the batches.

`BasePipeline` and `_BatchManager` refactor

The logic around BasePipeline and _BatchManager has been refactored, which will make it easier to implement new pipelines in the future.

Added `ArenaHard` as an example of how to use `distilabel` to implement a benchmark

distilabel can be easily used to create an LLM benchmark. To showcase this, we decided to implement Arena Hard as an example: Benchmarking with distilabel: Arena Hard

📚 Improved documentation structure

We have updated the documentation structure to make it more clear and self-explanatory, as well as more visually appealing 😏.

What's Changed

Add prometheus.md by @alvarobartt in https://github.com/argilla-io/distilabel/pull/656
Reduce time required to execute _cache method by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/672
[DOCS] Update theme styles and images by @leiyre in https://github.com/argilla-io/distilabel/pull/667
Fix circular import due to DISTILABELMETADATAKEY by @plaguss in https://github.com/argilla-io/distilabel/pull/675
Add CITATION.cff by @alvarobartt in https://github.com/argilla-io/distilabel/pull/677
Deprecate conversation support in TextGeneration in favour of ChatGeneration by @alvarobartt in https://github.com/argilla-io/distilabel/pull/676
Add functionality to load/save distisets to/from disk by @plaguss in https://github.com/argilla-io/distilabel/pull/673
Integration instructor by @plaguss in https://github.com/argilla-io/distilabel/pull/654
Fix docs of saving/loading distiset from disk by @plaguss in https://github.com/argilla-io/distilabel/pull/679
Pass data of batches using file system by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/678
Add python==3.12 by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/615
Add codspeed benchmarks by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/674
Add StructuredGeneration task and support for grammar in InferenceEndpointsLLM by @alvarobartt in https://github.com/argilla-io/distilabel/pull/680
Fix InferenceEndpointsLLM not using cached token by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/690
Add GenerateSentencePair task by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/689
Fix prepend batches by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/696
Fix EvolQuality._apply_random_mutation not properly injecting response in template by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/703
[FEATURE] Include new GeneratorStep classes to load datasets from different formats by @plaguss in https://github.com/argilla-io/distilabel/pull/691
Add citation readme by @plaguss in https://github.com/argilla-io/distilabel/pull/712
Move navigation to top bar by @plaguss in https://github.com/argilla-io/distilabel/pull/708
Fix install_dependencies.sh by @alvarobartt in https://github.com/argilla-io/distilabel/pull/713
Add context to guide the generate sentence pair task if informed by @plaguss in https://github.com/argilla-io/distilabel/pull/706
Add examples to the LLMs to be shown in the components gallery by @plaguss in https://github.com/argilla-io/distilabel/pull/714
Gather HFTOKEN internally when calling `Distiset.pushto_hub` if token is None. by @plaguss in https://github.com/argilla-io/distilabel/pull/707
Implement "Improving Text Embeddings with LLMs" by @alvarobartt in https://github.com/argilla-io/distilabel/pull/683
Add ArenaHard benchmark and ArenaHardResults step by @alvarobartt in https://github.com/argilla-io/distilabel/pull/670
Refactor Pipeline and BasePipeline classes by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/704
Fix AzureOpenAILLM load method setting the correct path to mock the internal class by @plaguss in https://github.com/argilla-io/distilabel/pull/725
Components examples steps by @plaguss in https://github.com/argilla-io/distilabel/pull/715
Add examples for tasks in the components gallery by @plaguss in https://github.com/argilla-io/distilabel/pull/724
[FEATURE] Refactor of structured generation and use schemas defined in a dataset by @plaguss in https://github.com/argilla-io/distilabel/pull/688
Update docs document phrasing and funnel by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/718
docs: 728 docs api reference tasktyping cannot be imported during doc build by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/729
docs: 730 docs add an index to the guide overview by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/731
Add MixtureOfAgentsLLM by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/735
Add examples/arena_hard.py and remove from distilabel core by @alvarobartt in https://github.com/argilla-io/distilabel/pull/741
Add serving LLM section in the docs by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/742
distilabel v1.2.0 by @alvarobartt in https://github.com/argilla-io/distilabel/pull/659

New Contributors

@leiyre made their first contribution in https://github.com/argilla-io/distilabel/pull/667

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.1.1...1.2.0

- Python
Published by gabrielmbmb about 2 years ago

distilabel - 1.1.1

What's Changed

Fix crash when using vLLM without structured generation by @cg123 in https://github.com/argilla-io/distilabel/pull/658
Fix error on Pipeline.dry_run without parameters by @plaguss in https://github.com/argilla-io/distilabel/pull/655

New Contributors

@cg123 made their first contribution in https://github.com/argilla-io/distilabel/pull/658

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.1.0...1.1.1

- Python
Published by alvarobartt about 2 years ago

distilabel - 1.1.0

Distilabel 1.1.0

Two new tasks implemented!

`Genstruct` task (https://github.com/argilla-io/distilabel/pull/600)

You can now use Genstruct task as described in https://huggingface.co/NousResearch/Genstruct-7B, to generate synthetic instruction fine-tuning datasets from a raw document:

```python from distilabel.llms import TransformersLLM from distilabel.pipeline import Pipeline from distilabel.steps import KeepColumns, LoadDataFromDicts from distilabel.steps.tasks import Genstruct

with Pipeline(name="harry-potter-genstruct") as pipeline: loadhubdataset = LoadDataFromDicts( name="load_dataset", data=[ { "title": "Harry Potter and the Sorcerer's Stone", "content": "An orphaned boy enrolls in a school of wizardry, where he learns the truth about himself, his family and the terrible evil that haunts the magical world.", }, { "title": "Harry Potter and the Chamber of Secrets", "content": "Harry Potter lives his second year at Hogwarts with Ron and Hermione when a message on the wall announces that the legendary Chamber of Secrets has been opened. The trio soon realize that, to save the school, it will take a lot of courage.", }, ], )

task = Genstruct(
    name="task",
    llm=TransformersLLM(
        model="NousResearch/Genstruct-7B",
        torch_dtype="float16",
        chat_template="{{ messages[0]['content'] }}",
        device="cuda:0",
    ),
    num_generations=2,
    group_generations=False,
    output_mappings={"model_name": "model"},
)

```

`PrometheusEval` task (https://github.com/argilla-io/distilabel/pull/610)

A new PrometheusEval task, based on the recently published paper "Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models":

```python from distilabel.steps.tasks import PrometheusEval

with Pipeline(name="prometheus") as pipeline: loaddataset = LoadHubDataset( name="loaddataset", repoid="HuggingFaceH4/instruction-dataset", split="test", outputmappings={"prompt": "instruction", "completion": "generation"}, )

task = PrometheusEval(
    name="task",
    llm=vLLM(
        model="prometheus-eval/prometheus-7b-v2.0",
        chat_template="[INST] {{ messages[0]['content'] }}\n{{ messages[1]['content'] }}[/INST]",
    ),
    mode="absolute",
    rubric="factual-validity",
    reference=False,
    num_generations=1,
    group_generations=False,
)

load_dataset >> task

```

Connect the steps in the pipeline with `>>` (https://github.com/argilla-io/distilabel/pull/490)

Now you can connect your steps using the binary shift operator in python:

```python from distilabel.pipeline import Pipeline from distilabel.steps.generators.huggingface import LoadHubDataset from distilabel.steps.task.evol_instruct.base import EvolInstruct from distilabel.steps.combine import CombineColumns

with Pipeline(name="Pipe name") as pipeline: loadhubdataset = LoadHubDataset(name="loaddataset", batchsize=8) evolinstructioncomplexity1 = EvolInstruct( llm=OpenAILLM(model="gpt-3.5-turbo"), ) evolinstructioncomplexity2 = EvolInstruct( llm=InferenceEndpointsLLM(model_id="mistralai/Mixtral-8x7B-Instruct-v0.1"), )

combine_columns = CombineColumns(
    columns=["response"],
    output_columns=["candidates"],
)

(
    load_hub_dataset 
    >> [evol_instruction_complexity_1, evol_instruction_complexity_2] 
    >> combine_columns
)

```

Routing batch function (https://github.com/argilla-io/distilabel/pull/595)

Thanks to the new routing_batch_function, each batch of an upstream step can be routed conditionally to a list of specific downstream steps. In addition, we have included a sample_n_steps routing batch function, making easier replicating the definition of the original UltraFeedback paper:

```python import random from distilabel.llms import MistralLLM, OpenAILLM, VertexAILLM from distilabel.pipeline import Pipeline, routingbatchfunction from distilabel.steps import CombineColumns, LoadHubDataset from distilabel.steps.tasks import TextGeneration

@routingbatchfunction() def sampletwosteps(steps: list[str]) -> list[str]: return random.sample(steps, 2)

with Pipeline("pipe-name", description="My first pipe") as pipeline: loaddataset = LoadHubDataset( name="loaddataset", output_mappings={"prompt": "instruction"}, )

tasks = []
for llm in (
    OpenAILLM(model="gpt-4-0125-preview"),
    MistralLLM(model="mistral-large-2402"),
    VertexAILLM(model="gemini-1.0-pro"),
):
    tasks.append(
        TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm)
    )

combine_generations = CombineColumns(
    name="combine_generations",
    columns=["generation", "model_name"],
    output_columns=["generations", "model_names"],
)

load_dataset >> sample_two_steps >> tasks >> combine_generations

```

Generate structured outputs using `outlines` (https://github.com/argilla-io/distilabel/pull/601)

You can generate JSON or regex using TransformersLLM, LlamaCppLLM or vLLM thanks to the integration with [outlines](https://github.com/outlines-dev/outlines)

```python from enum import Enum

from distilabel.llms import LlamaCppLLM from distilabel.pipeline import Pipeline from distilabel.steps import LoadDataFromDicts from distilabel.steps.tasks import TextGeneration from pydantic import BaseModel, StringConstraints, conint from typing_extensions import Annotated

class Weapon(str, Enum): sword = "sword" axe = "axe" mace = "mace" spear = "spear" bow = "bow" crossbow = "crossbow"

class Armor(str, Enum): leather = "leather" chainmail = "chainmail" plate = "plate" mithril = "mithril"

class Character(BaseModel): name: Annotated[str, StringConstraints(max_length=30)] age: conint(gt=1, lt=3000) armor: Armor weapon: Weapon

with Pipeline("RPG-characters") as pipeline: system_prompt = ( "You are a leading role play gamer. You have seen thousands of different characters and their attributes." " Please return a JSON object with common attributes of an RPG character." )

load_dataset = LoadDataFromDicts(
    name="load_instructions",
    data=[
        {
            "system_prompt": system_prompt,
            "instruction": f"Give me a character description for a {char}",
        }
        for char in ["dwarf", "elf", "human", "ork"]
    ],
)

text_generation = TextGeneration(
    name="text_generation_rpg",
    llm=LlamaCppLLM(
        model_path="model/path",  # type: ignore
        structured_output={"format": "json", "schema": Character},
    ),
)
load_dataset >> text_generation

```

New `GroqLLM` (https://github.com/argilla-io/distilabel/pull/583)

New integration with groq, special mention to @kcentric which did the initial work prior to the refactor for 1.0.0

```python from distilabel.llms.groq import GroqLLM from distilabel.pipeline import Pipeline from distilabel.steps.tasks import TextGeneration

with Pipeline(name="text-generation-groq") as pipeline: ... textgenerationwith_groq = TextGeneration( llm=GroqLLM(model="llama3-70b-8192"), ) ... ```

Easily test your pipeline doing a `dry_run` (https://github.com/argilla-io/distilabel/pull/635)

python with Pipeline(...) as pipeline: ... distiset = pipeline.dry_run( parameters=..., # The same argument as `Pipeline.run` batch_size=1 # Optional, will be set to 1 by default. )

python [05/13/24 16:22:30] INFO ['distilabel.pipeline.local'] 🌵 Dry run mode local.py:103 INFO ['distilabel.pipeline.local'] 📝 Pipeline data will be ... local.py:125

`Pipeline.log` file is dumped to the Hugging Face repository (#568)

Now on when you call distiset.push_to_hub, the pipeline.log file will be automatically dumped to your dataset repository with the pipeline.yaml to keep track of the execution.

New `distilabel_metadata` column to store internal data (https://github.com/argilla-io/distilabel/pull/586)

You can now optionally enable the addition of a metadata column. This column can store other things in the future, but for the moment can be really handy to keep the raw output from an LLM, and in case it does some post processing via format_output , keep the original output to avoid lossing anything.

You can include the metadata at the task level as:

python TextGeneration(..., add_raw_output=True|False)

And directly determine whether you want this column in your final Distiset:

python with Pipeline(...,enable_metadata=True|False): ...

This way we can decide to remove all the column altogether.

All the changes in this PR

Allow nested connect calls and overload rshift method to connect steps by @plaguss in https://github.com/argilla-io/distilabel/pull/490
Fix llm_blender installation by @alvarobartt in https://github.com/argilla-io/distilabel/pull/557
Warn user about unknown runtime parameters by @plaguss in https://github.com/argilla-io/distilabel/pull/555
Add missing model_name, update docstrings, and add *.jinja2 templates to Task subclasses by @alvarobartt in https://github.com/argilla-io/distilabel/pull/560
Split ChatGeneration from TextGeneration by @alvarobartt in https://github.com/argilla-io/distilabel/pull/558
Set extra="forbid" in {_Step,LLM}.model_config by @alvarobartt in https://github.com/argilla-io/distilabel/pull/577
Infer step name by @plaguss in https://github.com/argilla-io/distilabel/pull/575
Change the context of subprocesses depending on the platform by @plaguss in https://github.com/argilla-io/distilabel/pull/578
Dump logs within a file in .cache/distilabel/pipelines dir by @plaguss in https://github.com/argilla-io/distilabel/pull/568
Fix empty batches causing missaligment when branching by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/590
Add GroqLLM by @alvarobartt in https://github.com/argilla-io/distilabel/pull/583
Add Format{Chat,Text}Generation{DPO,SFT} by @alvarobartt in https://github.com/argilla-io/distilabel/pull/584
Fix title in RatingQuestion of PreferenceToArgilla by @alvarobartt in https://github.com/argilla-io/distilabel/pull/597
Set streaming=False and add num_examples to LoadHubDataset by @plaguss in https://github.com/argilla-io/distilabel/pull/565
Make pipeline argument of Step optional by @plaguss in https://github.com/argilla-io/distilabel/pull/566
Extend LLM kwargs to align with counterparts by @alvarobartt in https://github.com/argilla-io/distilabel/pull/594
Add Genstruct task by @alvarobartt in https://github.com/argilla-io/distilabel/pull/600
Fix num_examples to be optional in LoadHubDataset by @plaguss in https://github.com/argilla-io/distilabel/pull/603
Fix list_files_in_dir returning unsorted files by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/609
Add PrometheusEval task by @alvarobartt in https://github.com/argilla-io/distilabel/pull/610
Update ValueError on missing inputs message by @alvarobartt in https://github.com/argilla-io/distilabel/pull/617
Add routing_batch_function by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/595
Fix pipeline.log inconsistency & include LLM info in signature by @plaguss in https://github.com/argilla-io/distilabel/pull/598
Add custom rubrics attribute to PrometheusEval by @alvarobartt in https://github.com/argilla-io/distilabel/pull/621
Update UltraFeedback paper replication to use routing_batch_function by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/620
Add distilabel_metadata column to the datasets to include general data by @plaguss in https://github.com/argilla-io/distilabel/pull/586
Add the option of passing the multiprocessing context via env var by @plaguss in https://github.com/argilla-io/distilabel/pull/604
Add name of the pipeline to group the hashed folders by it by @plaguss in https://github.com/argilla-io/distilabel/pull/626
Add routing_batch_function serialization by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/628
Excluding model path in serialization of llamacpp by @ignacioct in https://github.com/argilla-io/distilabel/pull/633
Fix problem with sorting method in list_files_in_dir function by @plaguss in https://github.com/argilla-io/distilabel/pull/622
Add dry_run method to the pipelines to run with a single example. by @plaguss in https://github.com/argilla-io/distilabel/pull/635
[FEATURE] Add structured outputs using outlines by @plaguss in https://github.com/argilla-io/distilabel/pull/601
Force pipeline stop after 2 SIGINT signals caught by @plaguss in https://github.com/argilla-io/distilabel/pull/630
Refactor and update docs by @alvarobartt in https://github.com/argilla-io/distilabel/pull/634
Export components info & components gallery in docs by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/640
Documentation updates by @plaguss in https://github.com/argilla-io/distilabel/pull/646
Refactor docs 1.1.0 by @plaguss in https://github.com/argilla-io/distilabel/pull/650
Fix routing batch function deadlocks and unordered batches by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/649

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.0.3...1.1.0

- Python
Published by plaguss about 2 years ago

distilabel - 1.0.3

What's Changed

Add stop and stop_sequences in LLM.generate subclasses by @alvarobartt in https://github.com/argilla-io/distilabel/pull/585

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.0.2...1.0.3

- Python
Published by gabrielmbmb about 2 years ago

distilabel - 1.0.2

What's Changed

Fix RuntimeParamater validation when provided as _Step attr by @alvarobartt in https://github.com/argilla-io/distilabel/pull/564
Add seed with random.randint to ensure cache is not used by @alvarobartt in https://github.com/argilla-io/distilabel/pull/571

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.0.1...1.0.2

- Python
Published by alvarobartt about 2 years ago

distilabel - 1.0.1

What's Changed

Fix typo in readme and remove the ToArgilla step by @dvsrepo in https://github.com/argilla-io/distilabel/pull/548
Fix model_validator in InferenceEndpoints due to Pipeline pickling by @alvarobartt in https://github.com/argilla-io/distilabel/pull/552

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.0.0...1.0.1

- Python
Published by gabrielmbmb about 2 years ago

distilabel - 1.0.0

What's Changed

Add Step abstract class and new Pipeline by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/338
Add runtime parameters validation by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/345
Pipeline local execution by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/346
Add Task (minimal implementation) by @alvarobartt in https://github.com/argilla-io/distilabel/pull/347
Refactor _BatchManager to have list of batches per step by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/353
Refactor getting parameters from Step.process method by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/355
Add LLM, OpenAILLM, TransformersLLM, and LlamaCppLLM by @alvarobartt in https://github.com/argilla-io/distilabel/pull/354
Fix Task and TextGeneration by @alvarobartt in https://github.com/argilla-io/distilabel/pull/356
Add combine_dicts function and CombineColumns class by @alvarobartt in https://github.com/argilla-io/distilabel/pull/358
Add PushToHub step and fix typing by @alvarobartt in https://github.com/argilla-io/distilabel/pull/357
Add serialization for the new components by @plaguss in https://github.com/argilla-io/distilabel/pull/349
Fix OpenAILLM.api_key due to SecretStr and StepInput wrong imports by @alvarobartt in https://github.com/argilla-io/distilabel/pull/359
Add GlobalStep, fix _BatchManager, and add logging by @alvarobartt in https://github.com/argilla-io/distilabel/pull/362
Migrate vllm to the new API by @plaguss in https://github.com/argilla-io/distilabel/pull/361
Update _BatchManager to work with GlobalSteps and input_batch_size per step by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/366
Clean up outdated / unused files by @alvarobartt in https://github.com/argilla-io/distilabel/pull/369
Add input_mappings and output_mappings attributes by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/367
Move batching from Task to LLM, fix vLLM.generate and add DISTILABEL_LOG_LEVEL by @alvarobartt in https://github.com/argilla-io/distilabel/pull/371
Improve runtime parameter definition by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/372
Add AsyncOpenAI and update OpenAILLM accordingly by @alvarobartt in https://github.com/argilla-io/distilabel/pull/381
Update serde by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/382
Add MistralLLM and add generation_kwargs as RuntimeParameters by @alvarobartt in https://github.com/argilla-io/distilabel/pull/383
Move steps out of pipeline by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/384
Add tests and docstring for Task and subclasses by @alvarobartt in https://github.com/argilla-io/distilabel/pull/385
Add step decorator by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/387
Add input propagation through Task.process by @alvarobartt in https://github.com/argilla-io/distilabel/pull/399
Improve Pipeline error handling by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/400
Fix combine_dicts and StepInput import in PushToHub by @alvarobartt in https://github.com/argilla-io/distilabel/pull/401
Improve GlobalStep error handling by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/402
Changed " by italics in EvolInstruct tutorial where one "" was missing by @ignacioct in https://github.com/argilla-io/distilabel/pull/398
Add get_last_hidden_states method and update TransformersLLM by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/414
docs: correct small typos in tutorial by @sdiazlor in https://github.com/argilla-io/distilabel/pull/419
docs: readme positioning by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/386
Add num_generations and group_generations parameters to Task by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/416
Add Argilla and PromptCompletionToArgilla by @alvarobartt in https://github.com/argilla-io/distilabel/pull/420
Add EvolInstruct and EvolInstructGenerator tasks by @alvarobartt in https://github.com/argilla-io/distilabel/pull/407
Wrap optional LLM dependencies under load by @alvarobartt in https://github.com/argilla-io/distilabel/pull/428
Add ComplexityScorer task by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/421
Implement caching mechanism for the pipelines by @plaguss in https://github.com/argilla-io/distilabel/pull/370
Add method to Pipeline to handle keyboard interruptions via ctrl+c by @plaguss in https://github.com/argilla-io/distilabel/pull/406
Add GenerateEmbeddings task by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/427
Add api_key within LLM.load and add llm_kwargs as RuntimeParameter by @alvarobartt in https://github.com/argilla-io/distilabel/pull/432
Add GeneratorStep.process validation in DAG and smaller fixes by @alvarobartt in https://github.com/argilla-io/distilabel/pull/435
Add EvolComplexity task by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/415
Add QualityScorer Task by @ignacioct in https://github.com/argilla-io/distilabel/pull/425
Add CudaDevicePlacementMixin class by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/436
Return distiset from Pipeline.run by @plaguss in https://github.com/argilla-io/distilabel/pull/417
Update README.md by @strickvl in https://github.com/argilla-io/distilabel/pull/451
Add InferenceEndpointsLLM by @alvarobartt in https://github.com/argilla-io/distilabel/pull/439
Fix Distiset after PushToHub and smaller fixes by @alvarobartt in https://github.com/argilla-io/distilabel/pull/452
Fix Step.process_applying_mappings by @alvarobartt in https://github.com/argilla-io/distilabel/pull/453
Add AnyscaleLLM by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/447
Add general function to obtain schema for parquet writer by @plaguss in https://github.com/argilla-io/distilabel/pull/454
Add TogetherLLM by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/449
Fix LLM subclasses based on OpenAILLM by @alvarobartt in https://github.com/argilla-io/distilabel/pull/455
Improve batching and caching by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/457
Add EvolQuality task by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/429
Add VertexAILLM by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/445
Add use_cache to BasePipeline by @plaguss in https://github.com/argilla-io/distilabel/pull/463
Add AnthropicLLM by @sdiazlor in https://github.com/argilla-io/distilabel/pull/444
Add multiprocess dependency by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/467
Add UltraFeedback by @alvarobartt in https://github.com/argilla-io/distilabel/pull/464
Add OllamaLLM by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/405
Add RuntimeParametersMixin and LLM runtime parameters by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/466
Add LiteLLM by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/441
Add CLI by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/471
Set _batch_manager to None after run by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/473
Add create_distiset function by @plaguss in https://github.com/argilla-io/distilabel/pull/480
Add overload to step decorator by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/474
Move Enum to Dict[str, str] to avoid serialization errors during caching by @plaguss in https://github.com/argilla-io/distilabel/pull/482
Include a dataset card and the pipeline.yaml on Distiset.push_to_hub by @plaguss in https://github.com/argilla-io/distilabel/pull/479
Add PairRM task for ranking responses by @plaguss in https://github.com/argilla-io/distilabel/pull/450
Update _WriteBuffer to write several parquet files by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/483
Extend Argilla integration TextGeneration, Preference, and more by @alvarobartt in https://github.com/argilla-io/distilabel/pull/472
Add DeitaFiltering step by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/481
Add InstructionBacktranslation by @alvarobartt in https://github.com/argilla-io/distilabel/pull/486
Fix huggingface_hub TextGenerationError import by @Wauplin in https://github.com/argilla-io/distilabel/pull/485
Improve azure openai support by @BramVanroy in https://github.com/argilla-io/distilabel/pull/461
Add SelfInstruct task by @ignacioct in https://github.com/argilla-io/distilabel/pull/456
Use QueueHandler for Pipeline logging by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/489
Improve _stop and logging by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/491
Fix creating empty Dataset in create_distiset function by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/492
Add imports from __init__ modules by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/493
batch_size and input_batch_size runtime parameters by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/495
Update serialization method of _BatchManager to write each step on its own file by @plaguss in https://github.com/argilla-io/distilabel/pull/496
Fix asyncio in AsyncLLM to use the running event loop if any by @alvarobartt in https://github.com/argilla-io/distilabel/pull/501
Added authentication header to allow private/gated dataset use by @bjoernpl in https://github.com/argilla-io/distilabel/pull/498
Fix generator yielding batches all at once if batch_size == input_batch_size by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/510
Run output queue loop in thread and improve stop by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/511
Update docs for distilabel v1.0 with mkdocs-material by @plaguss in https://github.com/argilla-io/distilabel/pull/476
Add CohereLLM by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/508
distilabel v1.0 by @alvarobartt in https://github.com/argilla-io/distilabel/pull/352
Remove draft comment by @plaguss in https://github.com/argilla-io/distilabel/pull/515
Fix docs/sections/papers/*.md and add example in docs/index.md by @alvarobartt in https://github.com/argilla-io/distilabel/pull/516
Small fixes for the docs (images and nav bar) by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/519
Fix CTRL + C when still loading steps by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/521
Empty input queues when CTRL + C by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/528
Add filelock and flash-attn to vllm extra by @alvarobartt in https://github.com/argilla-io/distilabel/pull/529
Fix error in README.md when pushing the custom dataset card by @plaguss in https://github.com/argilla-io/distilabel/pull/530
Fix pipeline stuck when empty batches by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/531
Add EvolQuality to tasks.__init__.py by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/525
Show information about subprocess exception by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/532
Update TextGeneration.format_input method to allow OpenAI format by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/533
Improve create_distiset by @plaguss in https://github.com/argilla-io/distilabel/pull/534
Fixes regarding RuntimeParameters and pydantic model attributes by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/535
Fix parsing LLM generation kwargs by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/537
pass on Distiset's kwargs to Dataset.pushtohub() by @rasdani in https://github.com/argilla-io/distilabel/pull/522
Set config="default" in Distiset when only one leaf Step by @alvarobartt in https://github.com/argilla-io/distilabel/pull/540
docs: update documentation for huggingface inference endpoints. by @burtenshaw in https://github.com/argilla-io/distilabel/pull/539
Remove flash-attn from vllm extra by @alvarobartt in https://github.com/argilla-io/distilabel/pull/542
Docs fix argilla imports by @burtenshaw in https://github.com/argilla-io/distilabel/pull/541
Fix not all exceptions being able to be pickled by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/543
Update CLI example by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/544
Check that Step.name doesn't contain dots or spaces by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/545

New Contributors

@strickvl made their first contribution in https://github.com/argilla-io/distilabel/pull/451
@Wauplin made their first contribution in https://github.com/argilla-io/distilabel/pull/485
@BramVanroy made their first contribution in https://github.com/argilla-io/distilabel/pull/461
@bjoernpl made their first contribution in https://github.com/argilla-io/distilabel/pull/498
@rasdani made their first contribution in https://github.com/argilla-io/distilabel/pull/522

Full Changelog: https://github.com/argilla-io/distilabel/compare/0.6.0...1.0.0

- Python
Published by gabrielmbmb about 2 years ago

distilabel - 0.6.0

What's Changed

Fix typo in docstring of toargilla metrics to metric_ by @burtenshaw in https://github.com/argilla-io/distilabel/pull/334
Implement a JSON responding OpenAI LLM as JSONOpenAILLM by @burtenshaw in https://github.com/argilla-io/distilabel/pull/331
Add examples for the deita paper tasks by @plaguss in https://github.com/argilla-io/distilabel/pull/329
Add checkpoint strategy to automatically push to hub by @plaguss in https://github.com/argilla-io/distilabel/pull/321
docs: update tutorials avoid argilla installation error by @sdiazlor in https://github.com/argilla-io/distilabel/pull/337
Fix CustomDataset.load_from_disk with str/Path objects by @plaguss in https://github.com/argilla-io/distilabel/pull/341
Clalrify number of generations produced when using LLMPool in docs by @davanstrien in https://github.com/argilla-io/distilabel/pull/339
Refactor builddataset piece for speed by @plaguss in https://github.com/argilla-io/distilabel/pull/344
Fix documentation and type variables in CustomDataset checkpoint methods by @plaguss in https://github.com/argilla-io/distilabel/pull/342
US Spelling and other typo correction on Distilabel tutorials by @ignacioct in https://github.com/argilla-io/distilabel/pull/324
docs: add a tutorial for evolinstruct by @sdiazlor in https://github.com/argilla-io/distilabel/pull/327
Fix Openai api error with OpenAI-compatible providers by @jphme in https://github.com/argilla-io/distilabel/pull/351
Add fix for labels not returned by openai api by @plaguss in https://github.com/argilla-io/distilabel/pull/364
Refactor model availability check in isserverlessendpoint_available by @davanstrien in https://github.com/argilla-io/distilabel/pull/363

New Contributors

@burtenshaw made their first contribution in https://github.com/argilla-io/distilabel/pull/334
@jphme made their first contribution in https://github.com/argilla-io/distilabel/pull/351

Full Changelog: https://github.com/argilla-io/distilabel/compare/0.5.0...0.6.0

- Python
Published by gabrielmbmb over 2 years ago

distilabel - 0.5.0

What's Changed

fix: Correct import error by @plaguss in https://github.com/argilla-io/distilabel/pull/279
fix: Filter examples for which len generations != len ratings by @plaguss in https://github.com/argilla-io/distilabel/pull/284
feat: Add sentence transformers support for the to argilla method by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/262
feat: Add text descriptives support to the to argilla methods by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/271
feat: Add to_argilla method to EvolInstructTask generated datasets by @plaguss in https://github.com/argilla-io/distilabel/pull/291
docs: Shorten titles tutorials and update core example by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/289
feat: Add new serialization strategy by @plaguss in https://github.com/argilla-io/distilabel/pull/288
feat: Review OllamaLLM and TogetherInferenceLLM by @alvarobartt in https://github.com/argilla-io/distilabel/pull/305
refactor: Remove Metadata for Ratings by @ignacioct in https://github.com/argilla-io/distilabel/pull/303
docs: Add missing VertexAI information within README.md and docs/index.md by @alvarobartt in https://github.com/argilla-io/distilabel/pull/308
feat: Add functionality to push tasks to the HuggingFace hub and download them automatically. by @plaguss in https://github.com/argilla-io/distilabel/pull/297
feat: Add ComplexityScorer and QualityScorer tasks from Deita by @plaguss in https://github.com/argilla-io/distilabel/pull/302
fix: Fix logging visualization of labeller pipelines by @plaguss in https://github.com/argilla-io/distilabel/pull/310
feat: Add Improving Text Embeddings with LLMs tutorial by @alvarobartt in https://github.com/argilla-io/distilabel/pull/313
feat: Add EvolComplexity and EvolQuality by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/299
feat: Add validate_prompts method to LLMs to help validating the prompts by @plaguss in https://github.com/argilla-io/distilabel/pull/314
fix: typo in clean an existing preference dataset by @sdiazlor in https://github.com/argilla-io/distilabel/pull/312
feat: Add new column for sft fine tuning with prepare_dataset by @plaguss in https://github.com/argilla-io/distilabel/pull/309
docs: Custom Task Documentation by @ignacioct in https://github.com/argilla-io/distilabel/pull/275
refactor: Align the LLM subclasses args by @alvarobartt in https://github.com/argilla-io/distilabel/pull/315
feat: Include rationale of the model responses on prepare_dataset if available by @plaguss in https://github.com/argilla-io/distilabel/pull/317
feat: Add embedding tutorial to docs by @ignacioct in https://github.com/argilla-io/distilabel/pull/319
feat: Add MistralAILLM by @plaguss in https://github.com/argilla-io/distilabel/pull/293
feat: Use ollama Python client within OllamaLLM by @sdiazlor in https://github.com/argilla-io/distilabel/pull/307

Full Changelog: https://github.com/argilla-io/distilabel/compare/0.4.0...0.5.0

- Python
Published by plaguss over 2 years ago

distilabel - 0.4.0

What's Changed

docs: Notus end2end example for preference and instruction generation by @ignacioct in https://github.com/argilla-io/distilabel/pull/145
docs: binders anchors by @ignacioct in https://github.com/argilla-io/distilabel/pull/235
feat: Add support for dedicated and serverless inference endpoints via inference API by @philschmid in https://github.com/argilla-io/distilabel/pull/238
docs: Update links to arxiv landing pages rather than PDFs by @davanstrien in https://github.com/argilla-io/distilabel/pull/249
feat: add ETA to progress bar and fix not showing the progress bar if irrelavant by @ignacioct in https://github.com/argilla-io/distilabel/pull/253
feat: Add Evol instruct task by @plaguss in https://github.com/argilla-io/distilabel/pull/237
docs: rename enable_checkpoints to checkpoint_strategy by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/257
feat: Fixing progress bar and ETA by @ignacioct in https://github.com/argilla-io/distilabel/pull/260
fix: resolved error with self instruct to argilla method by @plaguss in https://github.com/argilla-io/distilabel/pull/265
chore: Add extra check in llmpool to ensure all the tasks share the same parent class by @plaguss in https://github.com/argilla-io/distilabel/pull/266
fix: fix for Notus tutorial after bug in record unwrap by @ignacioct in https://github.com/argilla-io/distilabel/pull/267
feat: add customizable criteria for query generation in SelfInstructTask by @ignacioct in https://github.com/argilla-io/distilabel/pull/269
docs: add a tutorial on "clean a DPO/preference dataset with distilabel" by @sdiazlor in https://github.com/argilla-io/distilabel/pull/270
feat: Add new functionality to binarize preference datasets directly from distilabel by @plaguss in https://github.com/argilla-io/distilabel/pull/264
feat: add support ollama api by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/250

New Contributors

@philschmid made their first contribution in https://github.com/argilla-io/distilabel/pull/238
@davanstrien made their first contribution in https://github.com/argilla-io/distilabel/pull/249
@sdiazlor made their first contribution in https://github.com/argilla-io/distilabel/pull/270

Full Changelog: https://github.com/argilla-io/distilabel/compare/0.3.0...0.4.0

- Python
Published by davidberenstein1957 over 2 years ago

distilabel - 0.3.0

What's Changed

Add VertexAILLM & VertexAIEndpointLLM classes by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/204
Add draft with social cards by @plaguss in https://github.com/argilla-io/distilabel/pull/197
Relax LLMPool check to match parent Task instead by @plaguss in https://github.com/argilla-io/distilabel/pull/210
Align README.md with docs/ and minor fixes / improvements by @alvarobartt in https://github.com/argilla-io/distilabel/pull/214
Add TogetherInferenceLLM by @alvarobartt in https://github.com/argilla-io/distilabel/pull/215
Add checking valid inputs before calling _generate by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/216
Add TogetherInferenceLLM tests by @alvarobartt in https://github.com/argilla-io/distilabel/pull/217
Add Vertex AI LLMs documentation by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/222
Documentation review by @alvarobartt in https://github.com/argilla-io/distilabel/pull/223
Rename for_text_quality to for_overall_quality method in UltraFeedbackTask by @alvarobartt in https://github.com/argilla-io/distilabel/pull/224
Add Anyscale endpoints by @plaguss in https://github.com/argilla-io/distilabel/pull/213
Feature dataset checkpoint strategy by @plaguss in https://github.com/argilla-io/distilabel/pull/194
Fix rating parsing in RatingToArgillaMixin.to_argilla_record by @alvarobartt in https://github.com/argilla-io/distilabel/pull/227
Add badges to readme by @plaguss in https://github.com/argilla-io/distilabel/pull/226
Fix badges by @dvsrepo in https://github.com/argilla-io/distilabel/pull/228
Update LICENSE and add LICENSE_HEADER by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/221

Full Changelog: https://github.com/argilla-io/distilabel/compare/0.2.1...0.3.0

- Python
Published by alvarobartt over 2 years ago

distilabel - 0.2.1

What's Changed

Fix PrometheusTask could not be imported by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/190
Fix LLM.return_futures by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/192
Remove learn section from docs until developed by @plaguss in https://github.com/argilla-io/distilabel/pull/188
Add markdown to fields by default by @plaguss in https://github.com/argilla-io/distilabel/pull/189
Fix PrometheusTask and UltraCMTask could not be chained with TextGenerationTask by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/195
Add missing use_markdown for every field by @plaguss in https://github.com/argilla-io/distilabel/pull/196
Add to_argilla_{dataset,record} for CritiqueTask by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/198
Update generate_prompt in Task subclasses to always return Prompt by @alvarobartt in https://github.com/argilla-io/distilabel/pull/199
Add CritiqueTask documentation by @alvarobartt in https://github.com/argilla-io/distilabel/pull/200
Fix UltraCMTask scoring range and align argilla imports by @alvarobartt in https://github.com/argilla-io/distilabel/pull/201

Full Changelog: https://github.com/argilla-io/distilabel/compare/0.2.0...0.2.1

- Python
Published by alvarobartt over 2 years ago

distilabel - 0.2.0

What's Changed

adds accelerate example by @edbeeching in https://github.com/argilla-io/distilabel/pull/141
Add a dry-run when calling Pipeline.generate by @alvarobartt in https://github.com/argilla-io/distilabel/pull/146
Add Notus format in Prompt.format_as and update examples/*.py by @alvarobartt in https://github.com/argilla-io/distilabel/pull/147
Add ProcessLLM class by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/151
Adds CritiqueTask, UltraCMTask and more by @alvarobartt in https://github.com/argilla-io/distilabel/pull/152
docs: add llama.cpp to extras by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/154
Fix _build_dataset as processed_labels were ignored by @plaguss in https://github.com/argilla-io/distilabel/pull/158
Add to_argilla_{dataset,record} methods in TextGenerationTask by @alvarobartt in https://github.com/argilla-io/distilabel/pull/159
Fix UltraFeedbackTask.to_argilla_dataset ratings values by @alvarobartt in https://github.com/argilla-io/distilabel/pull/160
Align typing and typing_extensions with supported Python versions by @alvarobartt in https://github.com/argilla-io/distilabel/pull/161
Add LLMPool class by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/156
Add missing CritiqueTask and UltraCMTask in __init__ and move argilla_utils to utils.argilla by @alvarobartt in https://github.com/argilla-io/distilabel/pull/162
Add test workflow by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/163
Update LLM to return Future[List[List[LLMOutput]]] by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/164
Add PrometheusTask by @alvarobartt in https://github.com/argilla-io/distilabel/pull/165
Randomise generations order by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/167
Add custom to_argilla_{dataset,record} to SelfInstructTask by @alvarobartt in https://github.com/argilla-io/distilabel/pull/169
Fix shuffle_before_labelling and progress bar in Pipeline.generate by @alvarobartt in https://github.com/argilla-io/distilabel/pull/170
Replace multiprocessing with multiprocess by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/171
Refactor and improve docs by @plaguss in https://github.com/argilla-io/distilabel/pull/134
Fix SelfInstructTask.{parse_output,to_argilla_record} methods and _build_dataset by @alvarobartt in https://github.com/argilla-io/distilabel/pull/172
Fix results didn't have same order as futures by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/173
Remove unnecesary plugin by @plaguss in https://github.com/argilla-io/distilabel/pull/174
Add {generation,labelling}_model column as metadata in Argilla by @alvarobartt in https://github.com/argilla-io/distilabel/pull/175
Fix exporting model name to Argilla with LLMPool by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/177
Update docs to include info about ProcessLLM and LLMPool by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/176

New Contributors

@edbeeching made their first contribution in https://github.com/argilla-io/distilabel/pull/141
@davidberenstein1957 made their first contribution in https://github.com/argilla-io/distilabel/pull/154

Full Changelog: https://github.com/argilla-io/distilabel/compare/0.1.1...0.2.0

- Python
Published by gabrielmbmb over 2 years ago

distilabel - 0.1.1

What's Changed

Template for Documentation Issue created by @ignacioct in https://github.com/argilla-io/distilabel/pull/128
self.threadpoolexecutor can be None, protecting it for print by @ignacioct in https://github.com/argilla-io/distilabel/pull/129
Use do_sample in transformers example by @dvsrepo in https://github.com/argilla-io/distilabel/pull/138
Fix llama-cpp and hf-inference-endpoints extras in pyproject.toml by @plaguss in https://github.com/argilla-io/distilabel/pull/139
Fix llama_cpp_python dependency check by @plaguss in https://github.com/argilla-io/distilabel/pull/140

New Contributors

@ignacioct made their first contribution in https://github.com/argilla-io/distilabel/pull/128
@plaguss made their first contribution in https://github.com/argilla-io/distilabel/pull/139

Full Changelog: https://github.com/argilla-io/distilabel/compare/0.1.0...0.1.1

- Python
Published by alvarobartt over 2 years ago

distilabel - 0.1.1

What's Changed

Template for Documentation Issue created by @ignacioct in https://github.com/argilla-io/distilabel/pull/128
self.thread_pool_executor can be None, protecting it for print by @ignacioct in https://github.com/argilla-io/distilabel/pull/129
Use do_sample in transformers example by @dvsrepo in https://github.com/argilla-io/distilabel/pull/138
Fix llama-cpp and hf-inference-endpoints extras in pyproject.toml by @plaguss in https://github.com/argilla-io/distilabel/pull/139

New Contributors

@ignacioct made their first contribution in https://github.com/argilla-io/distilabel/pull/128
@plaguss made their first contribution in https://github.com/argilla-io/distilabel/pull/139

Full Changelog: https://github.com/argilla-io/distilabel/compare/0.1.0...0.1.1

- Python
Published by alvarobartt over 2 years ago

distilabel - 0.1.0

Stable Release - v0.1.0

- Python
Published by alvarobartt over 2 years ago

0.1.0rc0

- Python
Published by gabrielmbmb over 2 years ago

Recent Releases of distilabel

distilabel - 1.5.3

What's Changed

New Contributors

distilabel - 1.5.2

What's Changed

New Contributors

distilabel - 1.5.1

What's Changed

distilabel - 1.5.0

✨ Release highlights

🖼️ Image Generation Support

Available Services

Architecture

Images as inputs for LLMs

💻 New MlxLLM integration

New InstructionResponsePipeline template

Define load stages

What's Changed

New Contributors

distilabel - 1.4.2

What's Changed

distilabel - 1.4.1

What's Changed

distilabel - 1.4.0

✨ Release highlights

Offline Batch Generation and OpenAI Batch API

Improved cache for maximum outputs reusability

Steps can generated artifacts

New Tasks: CLAIR, APIGEN and many more!

New Steps to sample data in your pipelines and remove duplicates

Generate text embeddings using vLLM

Extra things

What's Changed

New Contributors

distilabel - 1.3.2

What's Changed

distilabel - 1.3.1

What's Changed

New Contributors

distilabel - 1.3.0

What's Changed

New Contributors

distilabel - 1.2.4

What's Changed

distilabel - 1.2.3

What's Changed

New Contributors

distilabel - 1.2.2

What's Changed

distilabel - 1.2.1

What's Changed

New Contributors

distilabel - 1.2.0

✨ Release highlights

Structured generation with instructor, InferenceEndpointsLLM now supports structured generation and StructuredGeneration task

New tasks for generating datasets for training embedding models

New Steps for loading data from different sources and saving/loading Distiset to disk

MixtureOfAgentsLLM implementation

Saving cache and passing batches to GlobalSteps optimizations

BasePipeline and _BatchManager refactor

Added ArenaHard as an example of how to use distilabel to implement a benchmark

📚 Improved documentation structure

What's Changed

New Contributors

distilabel - 1.1.1

What's Changed

New Contributors

distilabel - 1.1.0

Distilabel 1.1.0

Two new tasks implemented!

Genstruct task (https://github.com/argilla-io/distilabel/pull/600)

PrometheusEval task (https://github.com/argilla-io/distilabel/pull/610)

Connect the steps in the pipeline with >> (https://github.com/argilla-io/distilabel/pull/490)

Routing batch function (https://github.com/argilla-io/distilabel/pull/595)

Generate structured outputs using outlines (https://github.com/argilla-io/distilabel/pull/601)

New GroqLLM (https://github.com/argilla-io/distilabel/pull/583)

Easily test your pipeline doing a dry_run (https://github.com/argilla-io/distilabel/pull/635)

Pipeline.log file is dumped to the Hugging Face repository (#568)

New distilabel_metadata column to store internal data (https://github.com/argilla-io/distilabel/pull/586)

Images as inputs for `LLM`s

💻 New `MlxLLM` integration

New `InstructionResponsePipeline` template

New `Tasks`: `CLAIR`, `APIGEN` and many more!

Generate text embeddings using `vLLM`

Structured generation with `instructor`, `InferenceEndpointsLLM` now supports structured generation and `StructuredGeneration` task

New `Step`s for loading data from different sources and saving/loading `Distiset` to disk

`MixtureOfAgentsLLM` implementation

Saving cache and passing batches to `GlobalStep`s optimizations

`BasePipeline` and `_BatchManager` refactor

Added `ArenaHard` as an example of how to use `distilabel` to implement a benchmark

`Genstruct` task (https://github.com/argilla-io/distilabel/pull/600)

`PrometheusEval` task (https://github.com/argilla-io/distilabel/pull/610)

Connect the steps in the pipeline with `>>` (https://github.com/argilla-io/distilabel/pull/490)

Generate structured outputs using `outlines` (https://github.com/argilla-io/distilabel/pull/601)

New `GroqLLM` (https://github.com/argilla-io/distilabel/pull/583)

Easily test your pipeline doing a `dry_run` (https://github.com/argilla-io/distilabel/pull/635)

`Pipeline.log` file is dumped to the Hugging Face repository (#568)

New `distilabel_metadata` column to store internal data (https://github.com/argilla-io/distilabel/pull/586)