Recent Releases of distilabel

distilabel - 1.5.3

What's Changed

  • Fix typo by @Riezebos in https://github.com/argilla-io/distilabel/pull/1111
  • Checks for images using PIL only if available by @plaguss in https://github.com/argilla-io/distilabel/pull/1112
  • Fix pipeline getting stuck when multiple step replicas by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1113

New Contributors

  • @Riezebos made their first contribution in https://github.com/argilla-io/distilabel/pull/1111

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.5.2...1.5.3

- Python
Published by gabrielmbmb about 1 year ago

distilabel - 1.5.2

What's Changed

  • Fix structured output JSON to pydantic.BaseModel and LiteLLM async completion client by @rolshoven in https://github.com/argilla-io/distilabel/pull/1105

New Contributors

  • @rolshoven made their first contribution in https://github.com/argilla-io/distilabel/pull/1105

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.5.1...1.5.2

- Python
Published by gabrielmbmb about 1 year ago

distilabel - 1.5.1

What's Changed

  • Remove deprecated CombineColumns step by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1101
  • Fix image import handling and update MlxLLM initialisation by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1102
  • Fix MlxLLM by aligning it with mlx-lm>=0.21 by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1103
  • 1.5.1 by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1104

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.5.0...1.5.1

- Python
Published by gabrielmbmb about 1 year ago

distilabel - 1.5.0

✨ Release highlights

🖼️ Image Generation Support

We're excited to introduce ImageGenerationModel, a new abstraction for working with image generation models. This addition enables seamless integration with models that can transform text prompts into images.

Available Services

  • 🤗 InferenceEndpointsImageGeneration: Integration with Hugging Face's Inference Endpoints
  • OpenAIImageGeneration: Integration with OpenAI's DALL-E

Architecture

Just as LLMs are used by a Task, we've introduced ImageTask as a high-level abstraction for image generation workflows. ImageTask defines how a step should use an ImageGenerationModel to accomplish specific image generation tasks.

Our first implementation, the ImageGeneration task, provides a straightforward interface: given a text prompt, it generates the corresponding image, leveraging any of the supported image generation models.

We've also added a small tutorial on how to generate images using distilabel: distilabel - Tutorials - Image generation with distilabel

Images as inputs for LLMs

We've added initial support for providing images as input to an LLM through the new TextGenerationWithImage task. We've updated and tested InferenceEndpointsLLM and OpenAILLM with this new task, but we'll image as input compatibility in the next releases for others such as vLLM.

Check the tutorial distilabel - Tutorials - Text generation with images in distilabel to get started!

💻 New MlxLLM integration

We've integrated mlx-lm package with the new MlxLLM class, enabling native machine learning acceleration on Apple Silicon Macs. This integration supercharges synthetic data generation by leveraging MLX's highly optimized framework designed specifically for the M-series chips.

New InstructionResponsePipeline template

We've started making changes so distilabel is easier to use since minute one. We'll start adding presets or templates that allows to quickly get a pipeline with some sensible preconfigured defaults for generating data for certain tasks. The first task we've worked on is the SFT or Instruction Response tuning pipeline which you can use like:

```python from distilabel.pipeline import InstructionResponsePipeline

pipeline = InstructionResponsePipeline() distiset = pipeline.run() ```

Define load stages

We've added a way for users to define which steps of the pipeline should be loaded together, allowing for more efficient resource management and better control over the execution flow. This new feature is particularly useful in scenarios where resource-constrained environments limit the ability to execute all steps simultaneously, requiring steps to be executed in distinct stages.

We've added a detailed guide on how to use this feature: distilabel - How-to guides - Load groups and execution stages.

What's Changed

  • Add common typing module by @plaguss in https://github.com/argilla-io/distilabel/pull/1029
  • docs: textcat tutorial by @sdiazlor in https://github.com/argilla-io/distilabel/pull/949
  • Add task decorator by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1028
  • Update docs workflows to use uv by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1032
  • fix: simplify prompt template ArgillaLabeller by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1033
  • Add dataset_batch_size argument by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1039
  • Move all LLMs to distilabel.models by @plaguss in https://github.com/argilla-io/distilabel/pull/1045
  • Fix a tiny typo in _Step docstring by @sadra-barikbin in https://github.com/argilla-io/distilabel/pull/1051
  • docs: improve docs for MinHashDedup Step by @anakin87 in https://github.com/argilla-io/distilabel/pull/1050
  • Fix new response_format variable in openai api by @plaguss in https://github.com/argilla-io/distilabel/pull/1053
  • [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in https://github.com/argilla-io/distilabel/pull/1043
  • Update LLM.generate output to include statistics by @plaguss in https://github.com/argilla-io/distilabel/pull/1034
  • Add example of structured output. by @plaguss in https://github.com/argilla-io/distilabel/pull/1061
  • feat: implenent basic SFT pipeline based on synthetic data generator by @burtenshaw in https://github.com/argilla-io/distilabel/pull/1059
  • fix: broken import in instruction by @burtenshaw in https://github.com/argilla-io/distilabel/pull/1063
  • Fix StepOutput type by @plaguss in https://github.com/argilla-io/distilabel/pull/1072
  • docs: update issue templates by @sdiazlor in https://github.com/argilla-io/distilabel/pull/1074
  • Update unload method from vLLM to properly free resources by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1077
  • Add tasks to replicate Math-shepherd by @plaguss in https://github.com/argilla-io/distilabel/pull/1052
  • Add load_groups argument to run by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1075
  • Add TextGenerationWithImage task by @plaguss in https://github.com/argilla-io/distilabel/pull/1066
  • Create columns with LLM returned extra keys by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1078
  • Fix vLLM unload logic when model is None by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1080
  • Fix merge_distilabel_metadata function when handling outputs from Task with group_generations==True by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1082
  • chore: update base.py by @eltociear in https://github.com/argilla-io/distilabel/pull/1085
  • Add magpie support llama cpp ollama by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1086
  • Feat/954 llama cpp by @bikash119 in https://github.com/argilla-io/distilabel/pull/1000
  • fix import by replacing GeneratorOutput with GeneratorStepOutput by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1093
  • add mlx support by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1089
  • Support custom default headers in OpenAILLM class. by @khulaifi95 in https://github.com/argilla-io/distilabel/pull/1088
  • fix/pip install messages by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1095
  • Fix handling empty list statistics by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1094
  • update to outlines010 by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1092
  • update: search by match by @sdiazlor in https://github.com/argilla-io/distilabel/pull/1096
  • Add Legend to Component Gallery Icons by @ParagEkbote in https://github.com/argilla-io/distilabel/pull/1090
  • Image Language Models and ImageGeneration task by @plaguss in https://github.com/argilla-io/distilabel/pull/1060
  • Update LLMs to support prompt logprobs use-case by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1099
  • 1.5.0 by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1100

New Contributors

  • @sadra-barikbin made their first contribution in https://github.com/argilla-io/distilabel/pull/1051
  • @anakin87 made their first contribution in https://github.com/argilla-io/distilabel/pull/1050
  • @pre-commit-ci made their first contribution in https://github.com/argilla-io/distilabel/pull/1043
  • @eltociear made their first contribution in https://github.com/argilla-io/distilabel/pull/1085
  • @bikash119 made their first contribution in https://github.com/argilla-io/distilabel/pull/1000
  • @khulaifi95 made their first contribution in https://github.com/argilla-io/distilabel/pull/1088
  • @ParagEkbote made their first contribution in https://github.com/argilla-io/distilabel/pull/1090

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.4.2...1.5.0

- Python
Published by gabrielmbmb about 1 year ago

distilabel - 1.4.2

What's Changed

  • Fix chat template not applied in TransformersLLM by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1083

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.4.1...1.4.2

- Python
Published by gabrielmbmb about 1 year ago

distilabel - 1.4.1

What's Changed

  • Fix not handling list of all primitive types in SignatureMixin by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1037

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.4.0...1.4.1

- Python
Published by gabrielmbmb over 1 year ago

distilabel - 1.4.0

✨ Release highlights

Offline Batch Generation and OpenAI Batch API

We’ve updated the LLM interface so now LLMs using an external platform that offers a batch service can be integrated in distilabel. In addition, OpenAILLM has been updated so it can use the OpenAI Batch API to get 50% cost reductions.

https://github.com/user-attachments/assets/9a559ae1-099b-47a4-9f92-37a3171dfbff

Improved cache for maximum outputs reusability

We all know that running LLM is costly and most of the times we want to reuse as much as we can the outputs generated with them. Before this release, distilabel cache mechanism enabled to recover a pipeline execution that was stopped before finishing and to re-create the Distiset generated by one that finished its execution and was re-executed.

In this release, we've greatly improved the cache so the outputs of all the Steps are cached and therefore can be reused in other pipelines executions even if the pipeline has changed:

image

In addition, we've added a use_cache attribute in the Steps that allows toggling the use of the cache at step level.

Steps can generated artifacts

In some cases, Step produces some additional artifacts that are used to generate its outputs. These artifacts can take some time to be generated and they could be reused in the future. That’s why we’ve added a new method called Step.save_artifact that can be called within the step to store artifacts generated by it. The artifacts generated by the Step will also get uploaded to the Hugging Face Hub.

```python from typing import List, TYPE_CHECKING from distilabel.steps import GlobalStep, StepInput, StepOutput import matplotlib.pyplot as plt

if TYPE_CHECKING: from distilabel.steps import StepOutput

class CountTextCharacters(GlobalStep): @property def inputs(self) -> List[str]: return ["text"]

@property
def outputs(self) -> List[str]:
    return ["text_character_count"]

def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
    character_counts = []

    for input in inputs:
        text_character_count = len(input["text"])
        input["text_character_count"] = text_character_count
        character_counts.append(text_character_count)

    # Generate plot with the distribution of text character counts
    plt.figure(figsize=(10, 6))
    plt.hist(character_counts, bins=30, edgecolor="black")
    plt.title("Distribution of Text Character Counts")
    plt.xlabel("Character Count")
    plt.ylabel("Frequency")

    # Save the plot as an artifact of the step
    self.save_artifact(
        name="text_character_count_distribution",
        write_function=lambda path: plt.savefig(path / "figure.png"),
        metadata={"type": "image", "library": "matplotlib"},
    )

    plt.close()

    yield inputs

```

New Tasks: CLAIR, APIGEN and many more!

  • New CLAIR task: CLAIR uses an AI system to minimally revise a solution A→A´ such that the resulting preference A preferred A’ is much more contrastive and precise.
  • New tasks to replicate APIGen framework: APIGenGenerator, APIGenSemanticChecker, APIGenExecutionChecker. These tasks allow generating datasets like the one presented in the paper: APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
  • New URIAL task that allows using non-instruct models to generate a response for an instruction.
  • New TextClassification task to make zero-shot text classification based on a predefined but highly customizable prompt.
  • TextClustering, to generate clusters from text and group your generations, discovering labels from your data. Comes with 2 steps to run UMAP and DBSCAN algorithms.
  • Updated TextGeneration to simplify customization of tasks that don’t require further post-processing.

New Steps to sample data in your pipelines and remove duplicates

  • New DataSampler step to sample data from other datasets, which can be useful to inject different examples for few-shot examples in your prompts.
  • New EmbeddingDedup step to remove duplicates based on embeddings and a distance metric.
  • New MinHashDedup step to remove near duplicates from the text based on MinHash and MinHashLSH algorithm.
  • New TruncateTextColumns to truncate the length of your texts using either the character length or the number of tokens based on a tokenizer.
  • New CombineOutputs to combine the outputs of two or more steps into a single output.

Generate text embeddings using vLLM

Extra things

What's Changed

  • Make ClientvLLM.model_name a cached_property by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/862
  • Pass dataset to dry_run method by @plaguss in https://github.com/argilla-io/distilabel/pull/863
  • Add default structured output for GenerateSentencePair task by @plaguss in https://github.com/argilla-io/distilabel/pull/868
  • Complexity scorer default structured output by @plaguss in https://github.com/argilla-io/distilabel/pull/870
  • Quality scorer default structured output by @plaguss in https://github.com/argilla-io/distilabel/pull/873
  • Ultrafeedback default structured output by @plaguss in https://github.com/argilla-io/distilabel/pull/876
  • Remove use of default_chat_template by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/888
  • Temporary fix for installing llama-cpp-python by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/886
  • Fix unit tests after release of transformers==4.44.0 by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/891
  • Fix default structured output by @plaguss in https://github.com/argilla-io/distilabel/pull/892
  • Send as many batches as possible to input queues by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/895
  • Exclude repo_id from LoadDataFromFileSystem by @plaguss in https://github.com/argilla-io/distilabel/pull/898
  • Fix loader to read from a glob pattern by @plaguss in https://github.com/argilla-io/distilabel/pull/877
  • Add save_artifact method to _Step by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/871
  • Add new add_raw_input argument to _Task so we can automatically include the formatted input by @plaguss in https://github.com/argilla-io/distilabel/pull/903
  • New TruncateTextColumn to truncate the length of texts using the number of tokens or characters by @plaguss in https://github.com/argilla-io/distilabel/pull/902
  • Update inputs and outputs interface to allow returning dict indicating optionality by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/883
  • Update mistrallm by @plaguss in https://github.com/argilla-io/distilabel/pull/904
  • Deepseek prover by @plaguss in https://github.com/argilla-io/distilabel/pull/907
  • Update RewardModelScore.inputs property by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/908
  • Add tutorial - generate data for training embeddings and reranking models by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/893
  • Fix load data from disk by @plaguss in https://github.com/argilla-io/distilabel/pull/910
  • docs: minor fixes by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/913
  • Add URIAL task by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/921
  • Add vLLMEmbeddings by @plaguss in https://github.com/argilla-io/distilabel/pull/920
  • docs: add tutorials preference and clean by @sdiazlor in https://github.com/argilla-io/distilabel/pull/917
  • Fix StructuredGeneration examples and internal check by @plaguss in https://github.com/argilla-io/distilabel/pull/912
  • Generate deterministic pipeline name when it's not given by @plaguss in https://github.com/argilla-io/distilabel/pull/878
  • Add custom errors by @plaguss in https://github.com/argilla-io/distilabel/pull/911
  • Docs/tutorials fix by @sdiazlor in https://github.com/argilla-io/distilabel/pull/922
  • Add revision runtime parameter to LoadDataFromHub by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/928
  • Add plausible as replacement for GA by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/929
  • Add minhash related steps to deduplicate texts by @plaguss in https://github.com/argilla-io/distilabel/pull/931
  • docs: API reference review by @sdiazlor in https://github.com/argilla-io/distilabel/pull/932
  • Refactor of MinHash to work with a single class and fix the shelve backend by @plaguss in https://github.com/argilla-io/distilabel/pull/937
  • Update make_generator_step to set pipeline to step and add edge to steps in trophic level 1 by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/936
  • Add CombineOutputs step by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/939
  • fix: regex expression in POSITIVE_NEGATIVE by @sdiazlor in https://github.com/argilla-io/distilabel/pull/940
  • Offline batch generation by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/923
  • Fix applying input mapping when mapping overrides another column by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/938
  • Fix all replicas had the same _llm_identifier for CudaDevicePlacementMixin by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/941
  • Fix empty load stage when two GlobalSteps are chained by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/945
  • Add system_prompt attribute to TextGeneration by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/950
  • Add step to deduplicate records based on embeddings by @plaguss in https://github.com/argilla-io/distilabel/pull/946
  • Updated setup_logging to use UTF-8 in FileHandler by @dameikle in https://github.com/argilla-io/distilabel/pull/952
  • Add more generation parameters to vLLM by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/955
  • Fix Magpie generating different columns names depending on LLM output by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/965
  • Docs/962 docs create a smoother transition from index installation quickstart by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/968
  • Add logging_handlers argument by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/969
  • [DOCS] Add tips in the docs to avoid overloading Free Serverless Endpoints by @plaguss in https://github.com/argilla-io/distilabel/pull/973
  • Add TextClassification, UMAP, DBSCAN and TextClustering tasks by @plaguss in https://github.com/argilla-io/distilabel/pull/948
  • [FEATURE] Simplify customizing the TextGeneration task with custom prompts by @plaguss in https://github.com/argilla-io/distilabel/pull/974
  • Update system_prompt attribute for adding probabilities in MagpieBase by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/981
  • Fix unloading steps with more than 1 replica by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/982
  • docs: 960 docs add a glossary concept section by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/970
  • Fix missing system_prompt_key column in Magpie tasks by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/983
  • docs: update component gallery by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/987
  • fix missing batch when last batch arrive early by @zye1996 in https://github.com/argilla-io/distilabel/pull/989
  • Fine personas socialai tutorial by @plaguss in https://github.com/argilla-io/distilabel/pull/992
  • feat: add basic draw implementation to pipline by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/966
  • Fix schema inference structured generation by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/994
  • [DOCS] Add developer documentation section in the docs by @plaguss in https://github.com/argilla-io/distilabel/pull/999
  • Fix vllm installation in CI by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1009
  • fix metadata writeout when llm error by @zye1996 in https://github.com/argilla-io/distilabel/pull/1003
  • Add example of custom text generation step in quickstart by @plaguss in https://github.com/argilla-io/distilabel/pull/984
  • feat: 985 feature argillalabeller task by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/986
  • Fixllvmlite install with uv by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1018
  • fix: failing tests argilla labeller by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1017
  • fix inpute when output_mapping is not empty by @zye1996 in https://github.com/argilla-io/distilabel/pull/1015
  • Add Tasks to replicate APIGen by @plaguss in https://github.com/argilla-io/distilabel/pull/925
  • Pretty print by @plaguss in https://github.com/argilla-io/distilabel/pull/934
  • Add CLAIR task by @plaguss in https://github.com/argilla-io/distilabel/pull/926
  • Add cache at Step level by @plaguss in https://github.com/argilla-io/distilabel/pull/766
  • Fix IndexError when overriding inputs and group_generations=False by @plaguss in https://github.com/argilla-io/distilabel/pull/1022
  • Update Pipeline cache docs by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1023
  • 1.4.0 by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1024

New Contributors

  • @dameikle made their first contribution in https://github.com/argilla-io/distilabel/pull/952
  • @zye1996 made their first contribution in https://github.com/argilla-io/distilabel/pull/989

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.3.2...1.4.0

- Python
Published by gabrielmbmb over 1 year ago

distilabel - 1.3.2

What's Changed

  • Deepseek prover task by @plaguss in https://github.com/argilla-io/distilabel/pull/733
  • Do not cancel in progress docs workflows by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/919
  • Fix creating Ray placement groups for vLLM by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/918
  • Fix passing base_url in model_id in InferenceEndpointsLLM by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/924

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.3.1...1.3.2

- Python
Published by gabrielmbmb over 1 year ago

distilabel - 1.3.1

What's Changed

  • Create new distilabel.constants module to store constants and avoid circular imports by @plaguss in https://github.com/argilla-io/distilabel/pull/861
  • Add OpenAI request timeout by @ashim-mahara in https://github.com/argilla-io/distilabel/pull/858

New Contributors

  • @ashim-mahara made their first contribution in https://github.com/argilla-io/distilabel/pull/858

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.3.0...1.3.1

- Python
Published by gabrielmbmb over 1 year ago

distilabel - 1.3.0

What's Changed

  • Add new step CombineKeys by @plaguss in https://github.com/argilla-io/distilabel/pull/747
  • Refactor naming columns steps combinecolumns combinekeys expandcolumns by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/758
  • Drop remove deprecated LoadHubDataset by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/759
  • Add requirements list for Pipeline by @plaguss in https://github.com/argilla-io/distilabel/pull/720
  • Add StepResources and step replicas in Pipeline by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/750
  • Add load stages by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/760
  • Update min required version to python==3.9 by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/770
  • Optionally include the pipeline script in the hub when pushing your distiset by @plaguss in https://github.com/argilla-io/distilabel/pull/762
  • Add docs-pr.yml and docs-pr-close.yml workflows by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/774
  • Add RayPipeline class by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/769
  • Fixed closed PR workflow by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/776
  • Add Magpie and MagpieGenerator tasks by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/778
  • Fix some issues related to Magpie task by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/783
  • Add end_with_user and include_system_prompt flags to Magpie tasks and handle Nones. by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/784
  • Add workflow concurrency group for publishing docs by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/796
  • Add _desired_num_gpus attribute to CudaDevicePlacementMixin by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/795
  • Compatibility with vLLM with tensor_parallel_size argument by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/805
  • Update default names in GroupColumns by @plaguss in https://github.com/argilla-io/distilabel/pull/808
  • Request batches to GeneratorStep if only step in pipeline by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/828
  • Add default name for a pipeline by @plaguss in https://github.com/argilla-io/distilabel/pull/809
  • Update distilabel phrasing based on PR hugging face hub by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/821
  • Some more Magpie improvements by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/833
  • Add Embeddings base class, SentenceTransformerEmbeddings class, EmbeddingGeneration and FaissNearestNeighbour steps by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/830
  • Create file per hostname in CudaDevicePlacementMixin by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/814
  • Create a GeneratorStep from a dataset using a helper function by @plaguss in https://github.com/argilla-io/distilabel/pull/812
  • Do not take into account disable_cuda_device_placement for pipeline signature by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/838
  • Add RewardModelScore step by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/840
  • Fix LoadDataFromHub attribute _dataset had ellipsis by default instead of None by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/841
  • Create PlacementGroup for steps using vLLM by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/842
  • Update argilla integration to use argilla_sdk v2 by @alvarobartt in https://github.com/argilla-io/distilabel/pull/705
  • Make overall-rating the default aspect for UltraFeedback task by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/843
  • fix typo index.md by @franperic in https://github.com/argilla-io/distilabel/pull/844
  • Use CudaDevicePlacementMixin in RewardModelScore step by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/845
  • Gather GPUs per Ray node to create placement groups by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/848
  • Fix typo in docs by @plaguss in https://github.com/argilla-io/distilabel/pull/850
  • Add xfail routing batch function tests by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/852
  • Fix creating placement group when pipeline_parallel_size>1 by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/851
  • docs: 846 docs include google analytics by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/847
  • Add ClientvLLM class by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/854
  • Add hard-negative flag to include similar challenging negatives on triplets by @plaguss in https://github.com/argilla-io/distilabel/pull/856
  • Add bibtex references in the docstrings to be shown in the README by @plaguss in https://github.com/argilla-io/distilabel/pull/855
  • distilabel 1.3.0 by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/857

New Contributors

  • @franperic made their first contribution in https://github.com/argilla-io/distilabel/pull/844

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.2.4...1.3.0

- Python
Published by gabrielmbmb over 1 year ago

distilabel - 1.2.4

What's Changed

  • Update InferenceEndpointsLLM to use chat_completion method by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/815

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.2.3...1.2.4

- Python
Published by gabrielmbmb over 1 year ago

distilabel - 1.2.3

What's Changed

  • Fix Import Error for KeepColumns in instruction_backtranslation.md (Issue #785) by @Hassaan-Qaisar in https://github.com/argilla-io/distilabel/pull/786
  • Correct variable name in dataset push example (in ultrafeedback.md file) (Issue #787) by @Hassaan-Qaisar in https://github.com/argilla-io/distilabel/pull/791
  • docs: update script for issue dashboard by @sdiazlor in https://github.com/argilla-io/distilabel/pull/775
  • Fix 404 model not found for private Serverless IE by @dvsrepo in https://github.com/argilla-io/distilabel/pull/806

New Contributors

  • @Hassaan-Qaisar made their first contribution in https://github.com/argilla-io/distilabel/pull/786

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.2.2...1.2.3

- Python
Published by gabrielmbmb over 1 year ago

distilabel - 1.2.2

What's Changed

  • Fix passing input to format_output function by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/781

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.2.1...1.2.2

- Python
Published by gabrielmbmb over 1 year ago

distilabel - 1.2.1

What's Changed

  • Fix docs for distiset.savetodisk kwargs by @fpreiss in https://github.com/argilla-io/distilabel/pull/745
  • docs: change references by @sdiazlor in https://github.com/argilla-io/distilabel/pull/754
  • Fix response_format for TogetherLLM and AnyScaleLLM by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/764

New Contributors

  • @fpreiss made their first contribution in https://github.com/argilla-io/distilabel/pull/745

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.2.0...1.2.1

- Python
Published by gabrielmbmb over 1 year ago

distilabel - 1.2.0

✨ Release highlights

Structured generation with instructor, InferenceEndpointsLLM now supports structured generation and StructuredGeneration task

  • instructor has been integrated bringing support for structured generation with OpenAILLM, AnthropicLLM, LiteLLM, MistralLLM, CohereLLM and GroqLLM:

Structured generation with instructor example

```python from typing import List

from distilabel.llms import MistralLLM from distilabel.pipeline import Pipeline from distilabel.steps import LoadDataFromDicts from distilabel.steps.tasks import TextGeneration from pydantic import BaseModel, Field

class Node(BaseModel): id: int label: str color: str

class Edge(BaseModel): source: int target: int label: str color: str = "black"

class KnowledgeGraph(BaseModel): nodes: List[Node] = Field(..., defaultfactory=list) edges: List[Edge] = Field(..., defaultfactory=list)

with Pipeline( name="Knowledge-Graphs", description=( "Generate knowledge graphs to answer questions, this type of dataset can be used to " "steer a model to answer questions with a knowledge graph." ), ) as pipeline: sample_questions = [ "Teach me about quantum mechanics", "Who is who in The Simpsons family?", "Tell me about the evolution of programming languages", ]

  load_dataset = LoadDataFromDicts(
      name="load_instructions",
      data=[
          {
              "system_prompt": "You are a knowledge graph expert generator. Help me understand by describing everything as a detailed knowledge graph.",
              "instruction": f"{question}",
          }
          for question in sample_questions
      ],
  )

  text_generation = TextGeneration(
      name="knowledge_graph_generation",
      llm=MistralLLM(
          model="open-mixtral-8x22b",
          structured_output={"schema": KnowledgeGraph}
      ),
  )
  load_dataset >> text_generation

`` </details> *InferenceEndpointsLLMnow supports structured generation * New [StructuredGeneration`](https://distilabel.argilla.io/latest/components-gallery/tasks/structuredgeneration/) task that allows defining the schema of the structured generation per input row.

New tasks for generating datasets for training embedding models

sentence-transformers v3 was recently released and we couldn't resist the urge of adding a few new tasks to allow creating datasets for training embedding models!

New Steps for loading data from different sources and saving/loading Distiset to disk

We've added a few new steps allowing to load data from different sources:

  • LoadDataFromDisk allows loading a Distisetor datasets.Dataset that was previously saved using the save_to_disk method.
  • LoadDataFromFileSystem allows loading a datasets.Dataset from a file system.

Thanks to @rasdani for helping us testing this new tasks!

In addition, we have added save_to_disk method to Distiset akin to datasets.Dataset.save_to_disk, that allows saving the generated distiset to disk, along with the pipeline.yaml and pipeline.log.

`save_to_disk` example ```python from distilabel.pipeline import Pipeline with Pipeline(name="my-pipeline") as pipeline: ... if __name__ == "__main__": distiset = pipeline.run(...) distiset.save_to_disk(dataset_path="my-distiset") ```

MixtureOfAgentsLLM implementation

We've added a new LLM called MixtureOfAgentsLLM derived from the paper Mixture-of-Agents Enhances Large Language Model Capabilities. This new LLM allows generating improved outputs thanks to the collective expertise of several LLMs.

`MixtureOfAgentsLLM` example ```python from distilabel.llms import MixtureOfAgentsLLM, InferenceEndpointsLLM llm = MixtureOfAgentsLLM( aggregator_llm=InferenceEndpointsLLM( model_id="meta-llama/Meta-Llama-3-70B-Instruct", tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct", ), proposers_llms=[ InferenceEndpointsLLM( model_id="meta-llama/Meta-Llama-3-70B-Instruct", tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct", ), InferenceEndpointsLLM( model_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO", tokenizer_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO", ), InferenceEndpointsLLM( model_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1", tokenizer_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1", ), ], rounds=2, ) llm.load() output = llm.generate( inputs=[ [ { "role": "user", "content": "My favorite witty review of The Rings of Power series is this: Input:", } ] ] ) ```

Saving cache and passing batches to GlobalSteps optimizations

  • The cache logic of the _BatchManager has been improved to incrementally update the cache making the process much faster.
  • The data of the input batches of the GlobalSteps will be passed to the step using the file system, as this is faster than passing it using the queue. This is possible thanks to new integration of fsspec, which can be configured to use a file system or cloud storage as backend for passing the data of the batches.

BasePipeline and _BatchManager refactor

The logic around BasePipeline and _BatchManager has been refactored, which will make it easier to implement new pipelines in the future.

Added ArenaHard as an example of how to use distilabel to implement a benchmark

distilabel can be easily used to create an LLM benchmark. To showcase this, we decided to implement Arena Hard as an example: Benchmarking with distilabel: Arena Hard

📚 Improved documentation structure

We have updated the documentation structure to make it more clear and self-explanatory, as well as more visually appealing 😏.

image

What's Changed

  • Add prometheus.md by @alvarobartt in https://github.com/argilla-io/distilabel/pull/656
  • Reduce time required to execute _cache method by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/672
  • [DOCS] Update theme styles and images by @leiyre in https://github.com/argilla-io/distilabel/pull/667
  • Fix circular import due to DISTILABELMETADATAKEY by @plaguss in https://github.com/argilla-io/distilabel/pull/675
  • Add CITATION.cff by @alvarobartt in https://github.com/argilla-io/distilabel/pull/677
  • Deprecate conversation support in TextGeneration in favour of ChatGeneration by @alvarobartt in https://github.com/argilla-io/distilabel/pull/676
  • Add functionality to load/save distisets to/from disk by @plaguss in https://github.com/argilla-io/distilabel/pull/673
  • Integration instructor by @plaguss in https://github.com/argilla-io/distilabel/pull/654
  • Fix docs of saving/loading distiset from disk by @plaguss in https://github.com/argilla-io/distilabel/pull/679
  • Pass data of batches using file system by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/678
  • Add python==3.12 by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/615
  • Add codspeed benchmarks by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/674
  • Add StructuredGeneration task and support for grammar in InferenceEndpointsLLM by @alvarobartt in https://github.com/argilla-io/distilabel/pull/680
  • Fix InferenceEndpointsLLM not using cached token by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/690
  • Add GenerateSentencePair task by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/689
  • Fix prepend batches by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/696
  • Fix EvolQuality._apply_random_mutation not properly injecting response in template by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/703
  • [FEATURE] Include new GeneratorStep classes to load datasets from different formats by @plaguss in https://github.com/argilla-io/distilabel/pull/691
  • Add citation readme by @plaguss in https://github.com/argilla-io/distilabel/pull/712
  • Move navigation to top bar by @plaguss in https://github.com/argilla-io/distilabel/pull/708
  • Fix install_dependencies.sh by @alvarobartt in https://github.com/argilla-io/distilabel/pull/713
  • Add context to guide the generate sentence pair task if informed by @plaguss in https://github.com/argilla-io/distilabel/pull/706
  • Add examples to the LLMs to be shown in the components gallery by @plaguss in https://github.com/argilla-io/distilabel/pull/714
  • Gather HFTOKEN internally when calling `Distiset.pushto_hub` if token is None. by @plaguss in https://github.com/argilla-io/distilabel/pull/707
  • Implement "Improving Text Embeddings with LLMs" by @alvarobartt in https://github.com/argilla-io/distilabel/pull/683
  • Add ArenaHard benchmark and ArenaHardResults step by @alvarobartt in https://github.com/argilla-io/distilabel/pull/670
  • Refactor Pipeline and BasePipeline classes by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/704
  • Fix AzureOpenAILLM load method setting the correct path to mock the internal class by @plaguss in https://github.com/argilla-io/distilabel/pull/725
  • Components examples steps by @plaguss in https://github.com/argilla-io/distilabel/pull/715
  • Add examples for tasks in the components gallery by @plaguss in https://github.com/argilla-io/distilabel/pull/724
  • [FEATURE] Refactor of structured generation and use schemas defined in a dataset by @plaguss in https://github.com/argilla-io/distilabel/pull/688
  • Update docs document phrasing and funnel by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/718
  • docs: 728 docs api reference tasktyping cannot be imported during doc build by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/729
  • docs: 730 docs add an index to the guide overview by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/731
  • Add MixtureOfAgentsLLM by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/735
  • Add examples/arena_hard.py and remove from distilabel core by @alvarobartt in https://github.com/argilla-io/distilabel/pull/741
  • Add serving LLM section in the docs by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/742
  • distilabel v1.2.0 by @alvarobartt in https://github.com/argilla-io/distilabel/pull/659

New Contributors

  • @leiyre made their first contribution in https://github.com/argilla-io/distilabel/pull/667

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.1.1...1.2.0

- Python
Published by gabrielmbmb over 1 year ago

distilabel - 1.1.1

What's Changed

  • Fix crash when using vLLM without structured generation by @cg123 in https://github.com/argilla-io/distilabel/pull/658
  • Fix error on Pipeline.dry_run without parameters by @plaguss in https://github.com/argilla-io/distilabel/pull/655

New Contributors

  • @cg123 made their first contribution in https://github.com/argilla-io/distilabel/pull/658

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.1.0...1.1.1

- Python
Published by alvarobartt almost 2 years ago

distilabel - 1.1.0

Distilabel 1.1.0

Two new tasks implemented!

Genstruct task (https://github.com/argilla-io/distilabel/pull/600)

You can now use Genstruct task as described in https://huggingface.co/NousResearch/Genstruct-7B, to generate synthetic instruction fine-tuning datasets from a raw document:

```python from distilabel.llms import TransformersLLM from distilabel.pipeline import Pipeline from distilabel.steps import KeepColumns, LoadDataFromDicts from distilabel.steps.tasks import Genstruct

with Pipeline(name="harry-potter-genstruct") as pipeline: loadhubdataset = LoadDataFromDicts( name="load_dataset", data=[ { "title": "Harry Potter and the Sorcerer's Stone", "content": "An orphaned boy enrolls in a school of wizardry, where he learns the truth about himself, his family and the terrible evil that haunts the magical world.", }, { "title": "Harry Potter and the Chamber of Secrets", "content": "Harry Potter lives his second year at Hogwarts with Ron and Hermione when a message on the wall announces that the legendary Chamber of Secrets has been opened. The trio soon realize that, to save the school, it will take a lot of courage.", }, ], )

task = Genstruct(
    name="task",
    llm=TransformersLLM(
        model="NousResearch/Genstruct-7B",
        torch_dtype="float16",
        chat_template="{{ messages[0]['content'] }}",
        device="cuda:0",
    ),
    num_generations=2,
    group_generations=False,
    output_mappings={"model_name": "model"},
)

```

PrometheusEval task (https://github.com/argilla-io/distilabel/pull/610)

A new PrometheusEval task, based on the recently published paper "Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models":

```python from distilabel.steps.tasks import PrometheusEval

with Pipeline(name="prometheus") as pipeline: loaddataset = LoadHubDataset( name="loaddataset", repoid="HuggingFaceH4/instruction-dataset", split="test", outputmappings={"prompt": "instruction", "completion": "generation"}, )

task = PrometheusEval(
    name="task",
    llm=vLLM(
        model="prometheus-eval/prometheus-7b-v2.0",
        chat_template="[INST] {{ messages[0]['content'] }}\n{{ messages[1]['content'] }}[/INST]",
    ),
    mode="absolute",
    rubric="factual-validity",
    reference=False,
    num_generations=1,
    group_generations=False,
)

load_dataset >> task

```

Connect the steps in the pipeline with >> (https://github.com/argilla-io/distilabel/pull/490)

Now you can connect your steps using the binary shift operator in python:

```python from distilabel.pipeline import Pipeline from distilabel.steps.generators.huggingface import LoadHubDataset from distilabel.steps.task.evol_instruct.base import EvolInstruct from distilabel.steps.combine import CombineColumns

with Pipeline(name="Pipe name") as pipeline: loadhubdataset = LoadHubDataset(name="loaddataset", batchsize=8) evolinstructioncomplexity1 = EvolInstruct( llm=OpenAILLM(model="gpt-3.5-turbo"), ) evolinstructioncomplexity2 = EvolInstruct( llm=InferenceEndpointsLLM(model_id="mistralai/Mixtral-8x7B-Instruct-v0.1"), )

combine_columns = CombineColumns(
    columns=["response"],
    output_columns=["candidates"],
)

(
    load_hub_dataset 
    >> [evol_instruction_complexity_1, evol_instruction_complexity_2] 
    >> combine_columns
)

```

Routing batch function (https://github.com/argilla-io/distilabel/pull/595)

Thanks to the new routing_batch_function, each batch of an upstream step can be routed conditionally to a list of specific downstream steps. In addition, we have included a sample_n_steps routing batch function, making easier replicating the definition of the original UltraFeedback paper:

```python import random from distilabel.llms import MistralLLM, OpenAILLM, VertexAILLM from distilabel.pipeline import Pipeline, routingbatchfunction from distilabel.steps import CombineColumns, LoadHubDataset from distilabel.steps.tasks import TextGeneration

@routingbatchfunction() def sampletwosteps(steps: list[str]) -> list[str]: return random.sample(steps, 2)

with Pipeline("pipe-name", description="My first pipe") as pipeline: loaddataset = LoadHubDataset( name="loaddataset", output_mappings={"prompt": "instruction"}, )

tasks = []
for llm in (
    OpenAILLM(model="gpt-4-0125-preview"),
    MistralLLM(model="mistral-large-2402"),
    VertexAILLM(model="gemini-1.0-pro"),
):
    tasks.append(
        TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm)
    )

combine_generations = CombineColumns(
    name="combine_generations",
    columns=["generation", "model_name"],
    output_columns=["generations", "model_names"],
)

load_dataset >> sample_two_steps >> tasks >> combine_generations

```

Generate structured outputs using outlines (https://github.com/argilla-io/distilabel/pull/601)

You can generate JSON or regex using TransformersLLM, LlamaCppLLM or vLLM thanks to the integration with [outlines](https://github.com/outlines-dev/outlines)

```python from enum import Enum

from distilabel.llms import LlamaCppLLM from distilabel.pipeline import Pipeline from distilabel.steps import LoadDataFromDicts from distilabel.steps.tasks import TextGeneration from pydantic import BaseModel, StringConstraints, conint from typing_extensions import Annotated

class Weapon(str, Enum): sword = "sword" axe = "axe" mace = "mace" spear = "spear" bow = "bow" crossbow = "crossbow"

class Armor(str, Enum): leather = "leather" chainmail = "chainmail" plate = "plate" mithril = "mithril"

class Character(BaseModel): name: Annotated[str, StringConstraints(max_length=30)] age: conint(gt=1, lt=3000) armor: Armor weapon: Weapon

with Pipeline("RPG-characters") as pipeline: system_prompt = ( "You are a leading role play gamer. You have seen thousands of different characters and their attributes." " Please return a JSON object with common attributes of an RPG character." )

load_dataset = LoadDataFromDicts(
    name="load_instructions",
    data=[
        {
            "system_prompt": system_prompt,
            "instruction": f"Give me a character description for a {char}",
        }
        for char in ["dwarf", "elf", "human", "ork"]
    ],
)

text_generation = TextGeneration(
    name="text_generation_rpg",
    llm=LlamaCppLLM(
        model_path="model/path",  # type: ignore
        structured_output={"format": "json", "schema": Character},
    ),
)
load_dataset >> text_generation

```

New GroqLLM (https://github.com/argilla-io/distilabel/pull/583)

New integration with groq, special mention to @kcentric which did the initial work prior to the refactor for 1.0.0

```python from distilabel.llms.groq import GroqLLM from distilabel.pipeline import Pipeline from distilabel.steps.tasks import TextGeneration

with Pipeline(name="text-generation-groq") as pipeline: ... textgenerationwith_groq = TextGeneration( llm=GroqLLM(model="llama3-70b-8192"), ) ... ```

Easily test your pipeline doing a dry_run (https://github.com/argilla-io/distilabel/pull/635)

python with Pipeline(...) as pipeline: ... distiset = pipeline.dry_run( parameters=..., # The same argument as `Pipeline.run` batch_size=1 # Optional, will be set to 1 by default. )

python [05/13/24 16:22:30] INFO ['distilabel.pipeline.local'] 🌵 Dry run mode local.py:103 INFO ['distilabel.pipeline.local'] 📝 Pipeline data will be ... local.py:125

Pipeline.log file is dumped to the Hugging Face repository (#568)

Now on when you call distiset.push_to_hub, the pipeline.log file will be automatically dumped to your dataset repository with the pipeline.yaml to keep track of the execution.

New distilabel_metadata column to store internal data (https://github.com/argilla-io/distilabel/pull/586)

You can now optionally enable the addition of a metadata column. This column can store other things in the future, but for the moment can be really handy to keep the raw output from an LLM, and in case it does some post processing via format_output , keep the original output to avoid lossing anything.

You can include the metadata at the task level as:

python TextGeneration(..., add_raw_output=True|False)

And directly determine whether you want this column in your final Distiset:

python with Pipeline(...,enable_metadata=True|False): ...

This way we can decide to remove all the column altogether.

All the changes in this PR

  • Allow nested connect calls and overload rshift method to connect steps by @plaguss in https://github.com/argilla-io/distilabel/pull/490
  • Fix llm_blender installation by @alvarobartt in https://github.com/argilla-io/distilabel/pull/557
  • Warn user about unknown runtime parameters by @plaguss in https://github.com/argilla-io/distilabel/pull/555
  • Add missing model_name, update docstrings, and add *.jinja2 templates to Task subclasses by @alvarobartt in https://github.com/argilla-io/distilabel/pull/560
  • Split ChatGeneration from TextGeneration by @alvarobartt in https://github.com/argilla-io/distilabel/pull/558
  • Set extra="forbid" in {_Step,LLM}.model_config by @alvarobartt in https://github.com/argilla-io/distilabel/pull/577
  • Infer step name by @plaguss in https://github.com/argilla-io/distilabel/pull/575
  • Change the context of subprocesses depending on the platform by @plaguss in https://github.com/argilla-io/distilabel/pull/578
  • Dump logs within a file in .cache/distilabel/pipelines dir by @plaguss in https://github.com/argilla-io/distilabel/pull/568
  • Fix empty batches causing missaligment when branching by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/590
  • Add GroqLLM by @alvarobartt in https://github.com/argilla-io/distilabel/pull/583
  • Add Format{Chat,Text}Generation{DPO,SFT} by @alvarobartt in https://github.com/argilla-io/distilabel/pull/584
  • Fix title in RatingQuestion of PreferenceToArgilla by @alvarobartt in https://github.com/argilla-io/distilabel/pull/597
  • Set streaming=False and add num_examples to LoadHubDataset by @plaguss in https://github.com/argilla-io/distilabel/pull/565
  • Make pipeline argument of Step optional by @plaguss in https://github.com/argilla-io/distilabel/pull/566
  • Extend LLM kwargs to align with counterparts by @alvarobartt in https://github.com/argilla-io/distilabel/pull/594
  • Add Genstruct task by @alvarobartt in https://github.com/argilla-io/distilabel/pull/600
  • Fix num_examples to be optional in LoadHubDataset by @plaguss in https://github.com/argilla-io/distilabel/pull/603
  • Fix list_files_in_dir returning unsorted files by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/609
  • Add PrometheusEval task by @alvarobartt in https://github.com/argilla-io/distilabel/pull/610
  • Update ValueError on missing inputs message by @alvarobartt in https://github.com/argilla-io/distilabel/pull/617
  • Add routing_batch_function by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/595
  • Fix pipeline.log inconsistency & include LLM info in signature by @plaguss in https://github.com/argilla-io/distilabel/pull/598
  • Add custom rubrics attribute to PrometheusEval by @alvarobartt in https://github.com/argilla-io/distilabel/pull/621
  • Update UltraFeedback paper replication to use routing_batch_function by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/620
  • Add distilabel_metadata column to the datasets to include general data by @plaguss in https://github.com/argilla-io/distilabel/pull/586
  • Add the option of passing the multiprocessing context via env var by @plaguss in https://github.com/argilla-io/distilabel/pull/604
  • Add name of the pipeline to group the hashed folders by it by @plaguss in https://github.com/argilla-io/distilabel/pull/626
  • Add routing_batch_function serialization by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/628
  • Excluding model path in serialization of llamacpp by @ignacioct in https://github.com/argilla-io/distilabel/pull/633
  • Fix problem with sorting method in list_files_in_dir function by @plaguss in https://github.com/argilla-io/distilabel/pull/622
  • Add dry_run method to the pipelines to run with a single example. by @plaguss in https://github.com/argilla-io/distilabel/pull/635
  • [FEATURE] Add structured outputs using outlines by @plaguss in https://github.com/argilla-io/distilabel/pull/601
  • Force pipeline stop after 2 SIGINT signals caught by @plaguss in https://github.com/argilla-io/distilabel/pull/630
  • Refactor and update docs by @alvarobartt in https://github.com/argilla-io/distilabel/pull/634
  • Export components info & components gallery in docs by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/640
  • Documentation updates by @plaguss in https://github.com/argilla-io/distilabel/pull/646
  • Refactor docs 1.1.0 by @plaguss in https://github.com/argilla-io/distilabel/pull/650
  • Fix routing batch function deadlocks and unordered batches by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/649

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.0.3...1.1.0

- Python
Published by plaguss almost 2 years ago

distilabel - 1.0.3

What's Changed

  • Add stop and stop_sequences in LLM.generate subclasses by @alvarobartt in https://github.com/argilla-io/distilabel/pull/585

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.0.2...1.0.3

- Python
Published by gabrielmbmb almost 2 years ago

distilabel - 1.0.2

What's Changed

  • Fix RuntimeParamater validation when provided as _Step attr by @alvarobartt in https://github.com/argilla-io/distilabel/pull/564
  • Add seed with random.randint to ensure cache is not used by @alvarobartt in https://github.com/argilla-io/distilabel/pull/571

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.0.1...1.0.2

- Python
Published by alvarobartt almost 2 years ago

distilabel - 1.0.1

What's Changed

  • Fix typo in readme and remove the ToArgilla step by @dvsrepo in https://github.com/argilla-io/distilabel/pull/548
  • Fix model_validator in InferenceEndpoints due to Pipeline pickling by @alvarobartt in https://github.com/argilla-io/distilabel/pull/552

Full Changelog: https://github.com/argilla-io/distilabel/compare/1.0.0...1.0.1

- Python
Published by gabrielmbmb almost 2 years ago

distilabel - 1.0.0

What's Changed

  • Add Step abstract class and new Pipeline by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/338
  • Add runtime parameters validation by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/345
  • Pipeline local execution by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/346
  • Add Task (minimal implementation) by @alvarobartt in https://github.com/argilla-io/distilabel/pull/347
  • Refactor _BatchManager to have list of batches per step by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/353
  • Refactor getting parameters from Step.process method by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/355
  • Add LLM, OpenAILLM, TransformersLLM, and LlamaCppLLM by @alvarobartt in https://github.com/argilla-io/distilabel/pull/354
  • Fix Task and TextGeneration by @alvarobartt in https://github.com/argilla-io/distilabel/pull/356
  • Add combine_dicts function and CombineColumns class by @alvarobartt in https://github.com/argilla-io/distilabel/pull/358
  • Add PushToHub step and fix typing by @alvarobartt in https://github.com/argilla-io/distilabel/pull/357
  • Add serialization for the new components by @plaguss in https://github.com/argilla-io/distilabel/pull/349
  • Fix OpenAILLM.api_key due to SecretStr and StepInput wrong imports by @alvarobartt in https://github.com/argilla-io/distilabel/pull/359
  • Add GlobalStep, fix _BatchManager, and add logging by @alvarobartt in https://github.com/argilla-io/distilabel/pull/362
  • Migrate vllm to the new API by @plaguss in https://github.com/argilla-io/distilabel/pull/361
  • Update _BatchManager to work with GlobalSteps and input_batch_size per step by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/366
  • Clean up outdated / unused files by @alvarobartt in https://github.com/argilla-io/distilabel/pull/369
  • Add input_mappings and output_mappings attributes by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/367
  • Move batching from Task to LLM, fix vLLM.generate and add DISTILABEL_LOG_LEVEL by @alvarobartt in https://github.com/argilla-io/distilabel/pull/371
  • Improve runtime parameter definition by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/372
  • Add AsyncOpenAI and update OpenAILLM accordingly by @alvarobartt in https://github.com/argilla-io/distilabel/pull/381
  • Update serde by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/382
  • Add MistralLLM and add generation_kwargs as RuntimeParameters by @alvarobartt in https://github.com/argilla-io/distilabel/pull/383
  • Move steps out of pipeline by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/384
  • Add tests and docstring for Task and subclasses by @alvarobartt in https://github.com/argilla-io/distilabel/pull/385
  • Add step decorator by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/387
  • Add input propagation through Task.process by @alvarobartt in https://github.com/argilla-io/distilabel/pull/399
  • Improve Pipeline error handling by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/400
  • Fix combine_dicts and StepInput import in PushToHub by @alvarobartt in https://github.com/argilla-io/distilabel/pull/401
  • Improve GlobalStep error handling by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/402
  • Changed " by italics in EvolInstruct tutorial where one "" was missing by @ignacioct in https://github.com/argilla-io/distilabel/pull/398
  • Add get_last_hidden_states method and update TransformersLLM by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/414
  • docs: correct small typos in tutorial by @sdiazlor in https://github.com/argilla-io/distilabel/pull/419
  • docs: readme positioning by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/386
  • Add num_generations and group_generations parameters to Task by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/416
  • Add Argilla and PromptCompletionToArgilla by @alvarobartt in https://github.com/argilla-io/distilabel/pull/420
  • Add EvolInstruct and EvolInstructGenerator tasks by @alvarobartt in https://github.com/argilla-io/distilabel/pull/407
  • Wrap optional LLM dependencies under load by @alvarobartt in https://github.com/argilla-io/distilabel/pull/428
  • Add ComplexityScorer task by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/421
  • Implement caching mechanism for the pipelines by @plaguss in https://github.com/argilla-io/distilabel/pull/370
  • Add method to Pipeline to handle keyboard interruptions via ctrl+c by @plaguss in https://github.com/argilla-io/distilabel/pull/406
  • Add GenerateEmbeddings task by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/427
  • Add api_key within LLM.load and add llm_kwargs as RuntimeParameter by @alvarobartt in https://github.com/argilla-io/distilabel/pull/432
  • Add GeneratorStep.process validation in DAG and smaller fixes by @alvarobartt in https://github.com/argilla-io/distilabel/pull/435
  • Add EvolComplexity task by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/415
  • Add QualityScorer Task by @ignacioct in https://github.com/argilla-io/distilabel/pull/425
  • Add CudaDevicePlacementMixin class by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/436
  • Return distiset from Pipeline.run by @plaguss in https://github.com/argilla-io/distilabel/pull/417
  • Update README.md by @strickvl in https://github.com/argilla-io/distilabel/pull/451
  • Add InferenceEndpointsLLM by @alvarobartt in https://github.com/argilla-io/distilabel/pull/439
  • Fix Distiset after PushToHub and smaller fixes by @alvarobartt in https://github.com/argilla-io/distilabel/pull/452
  • Fix Step.process_applying_mappings by @alvarobartt in https://github.com/argilla-io/distilabel/pull/453
  • Add AnyscaleLLM by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/447
  • Add general function to obtain schema for parquet writer by @plaguss in https://github.com/argilla-io/distilabel/pull/454
  • Add TogetherLLM by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/449
  • Fix LLM subclasses based on OpenAILLM by @alvarobartt in https://github.com/argilla-io/distilabel/pull/455
  • Improve batching and caching by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/457
  • Add EvolQuality task by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/429
  • Add VertexAILLM by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/445
  • Add use_cache to BasePipeline by @plaguss in https://github.com/argilla-io/distilabel/pull/463
  • Add AnthropicLLM by @sdiazlor in https://github.com/argilla-io/distilabel/pull/444
  • Add multiprocess dependency by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/467
  • Add UltraFeedback by @alvarobartt in https://github.com/argilla-io/distilabel/pull/464
  • Add OllamaLLM by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/405
  • Add RuntimeParametersMixin and LLM runtime parameters by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/466
  • Add LiteLLM by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/441
  • Add CLI by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/471
  • Set _batch_manager to None after run by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/473
  • Add create_distiset function by @plaguss in https://github.com/argilla-io/distilabel/pull/480
  • Add overload to step decorator by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/474
  • Move Enum to Dict[str, str] to avoid serialization errors during caching by @plaguss in https://github.com/argilla-io/distilabel/pull/482
  • Include a dataset card and the pipeline.yaml on Distiset.push_to_hub by @plaguss in https://github.com/argilla-io/distilabel/pull/479
  • Add PairRM task for ranking responses by @plaguss in https://github.com/argilla-io/distilabel/pull/450
  • Update _WriteBuffer to write several parquet files by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/483
  • Extend Argilla integration TextGeneration, Preference, and more by @alvarobartt in https://github.com/argilla-io/distilabel/pull/472
  • Add DeitaFiltering step by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/481
  • Add InstructionBacktranslation by @alvarobartt in https://github.com/argilla-io/distilabel/pull/486
  • Fix huggingface_hub TextGenerationError import by @Wauplin in https://github.com/argilla-io/distilabel/pull/485
  • Improve azure openai support by @BramVanroy in https://github.com/argilla-io/distilabel/pull/461
  • Add SelfInstruct task by @ignacioct in https://github.com/argilla-io/distilabel/pull/456
  • Use QueueHandler for Pipeline logging by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/489
  • Improve _stop and logging by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/491
  • Fix creating empty Dataset in create_distiset function by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/492
  • Add imports from __init__ modules by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/493
  • batch_size and input_batch_size runtime parameters by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/495
  • Update serialization method of _BatchManager to write each step on its own file by @plaguss in https://github.com/argilla-io/distilabel/pull/496
  • Fix asyncio in AsyncLLM to use the running event loop if any by @alvarobartt in https://github.com/argilla-io/distilabel/pull/501
  • Added authentication header to allow private/gated dataset use by @bjoernpl in https://github.com/argilla-io/distilabel/pull/498
  • Fix generator yielding batches all at once if batch_size == input_batch_size by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/510
  • Run output queue loop in thread and improve stop by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/511
  • Update docs for distilabel v1.0 with mkdocs-material by @plaguss in https://github.com/argilla-io/distilabel/pull/476
  • Add CohereLLM by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/508
  • distilabel v1.0 by @alvarobartt in https://github.com/argilla-io/distilabel/pull/352
  • Remove draft comment by @plaguss in https://github.com/argilla-io/distilabel/pull/515
  • Fix docs/sections/papers/*.md and add example in docs/index.md by @alvarobartt in https://github.com/argilla-io/distilabel/pull/516
  • Small fixes for the docs (images and nav bar) by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/519
  • Fix CTRL + C when still loading steps by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/521
  • Empty input queues when CTRL + C by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/528
  • Add filelock and flash-attn to vllm extra by @alvarobartt in https://github.com/argilla-io/distilabel/pull/529
  • Fix error in README.md when pushing the custom dataset card by @plaguss in https://github.com/argilla-io/distilabel/pull/530
  • Fix pipeline stuck when empty batches by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/531
  • Add EvolQuality to tasks.__init__.py by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/525
  • Show information about subprocess exception by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/532
  • Update TextGeneration.format_input method to allow OpenAI format by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/533
  • Improve create_distiset by @plaguss in https://github.com/argilla-io/distilabel/pull/534
  • Fixes regarding RuntimeParameters and pydantic model attributes by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/535
  • Fix parsing LLM generation kwargs by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/537
  • pass on Distiset's kwargs to Dataset.pushtohub() by @rasdani in https://github.com/argilla-io/distilabel/pull/522
  • Set config="default" in Distiset when only one leaf Step by @alvarobartt in https://github.com/argilla-io/distilabel/pull/540
  • docs: update documentation for huggingface inference endpoints. by @burtenshaw in https://github.com/argilla-io/distilabel/pull/539
  • Remove flash-attn from vllm extra by @alvarobartt in https://github.com/argilla-io/distilabel/pull/542
  • Docs fix argilla imports by @burtenshaw in https://github.com/argilla-io/distilabel/pull/541
  • Fix not all exceptions being able to be pickled by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/543
  • Update CLI example by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/544
  • Check that Step.name doesn't contain dots or spaces by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/545

New Contributors

  • @strickvl made their first contribution in https://github.com/argilla-io/distilabel/pull/451
  • @Wauplin made their first contribution in https://github.com/argilla-io/distilabel/pull/485
  • @BramVanroy made their first contribution in https://github.com/argilla-io/distilabel/pull/461
  • @bjoernpl made their first contribution in https://github.com/argilla-io/distilabel/pull/498
  • @rasdani made their first contribution in https://github.com/argilla-io/distilabel/pull/522

Full Changelog: https://github.com/argilla-io/distilabel/compare/0.6.0...1.0.0

- Python
Published by gabrielmbmb almost 2 years ago

distilabel - 0.6.0

What's Changed

  • Fix typo in docstring of toargilla metrics to metric_ by @burtenshaw in https://github.com/argilla-io/distilabel/pull/334
  • Implement a JSON responding OpenAI LLM as JSONOpenAILLM by @burtenshaw in https://github.com/argilla-io/distilabel/pull/331
  • Add examples for the deita paper tasks by @plaguss in https://github.com/argilla-io/distilabel/pull/329
  • Add checkpoint strategy to automatically push to hub by @plaguss in https://github.com/argilla-io/distilabel/pull/321
  • docs: update tutorials avoid argilla installation error by @sdiazlor in https://github.com/argilla-io/distilabel/pull/337
  • Fix CustomDataset.load_from_disk with str/Path objects by @plaguss in https://github.com/argilla-io/distilabel/pull/341
  • Clalrify number of generations produced when using LLMPool in docs by @davanstrien in https://github.com/argilla-io/distilabel/pull/339
  • Refactor builddataset piece for speed by @plaguss in https://github.com/argilla-io/distilabel/pull/344
  • Fix documentation and type variables in CustomDataset checkpoint methods by @plaguss in https://github.com/argilla-io/distilabel/pull/342
  • US Spelling and other typo correction on Distilabel tutorials by @ignacioct in https://github.com/argilla-io/distilabel/pull/324
  • docs: add a tutorial for evolinstruct by @sdiazlor in https://github.com/argilla-io/distilabel/pull/327
  • Fix Openai api error with OpenAI-compatible providers by @jphme in https://github.com/argilla-io/distilabel/pull/351
  • Add fix for labels not returned by openai api by @plaguss in https://github.com/argilla-io/distilabel/pull/364
  • Refactor model availability check in isserverlessendpoint_available by @davanstrien in https://github.com/argilla-io/distilabel/pull/363

New Contributors

  • @burtenshaw made their first contribution in https://github.com/argilla-io/distilabel/pull/334
  • @jphme made their first contribution in https://github.com/argilla-io/distilabel/pull/351

Full Changelog: https://github.com/argilla-io/distilabel/compare/0.5.0...0.6.0

- Python
Published by gabrielmbmb almost 2 years ago

distilabel - 0.5.0

What's Changed

  • fix: Correct import error by @plaguss in https://github.com/argilla-io/distilabel/pull/279
  • fix: Filter examples for which len generations != len ratings by @plaguss in https://github.com/argilla-io/distilabel/pull/284
  • feat: Add sentence transformers support for the to argilla method by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/262
  • feat: Add text descriptives support to the to argilla methods by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/271
  • feat: Add to_argilla method to EvolInstructTask generated datasets by @plaguss in https://github.com/argilla-io/distilabel/pull/291
  • docs: Shorten titles tutorials and update core example by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/289
  • feat: Add new serialization strategy by @plaguss in https://github.com/argilla-io/distilabel/pull/288
  • feat: Review OllamaLLM and TogetherInferenceLLM by @alvarobartt in https://github.com/argilla-io/distilabel/pull/305
  • refactor: Remove Metadata for Ratings by @ignacioct in https://github.com/argilla-io/distilabel/pull/303
  • docs: Add missing VertexAI information within README.md and docs/index.md by @alvarobartt in https://github.com/argilla-io/distilabel/pull/308
  • feat: Add functionality to push tasks to the HuggingFace hub and download them automatically. by @plaguss in https://github.com/argilla-io/distilabel/pull/297
  • feat: Add ComplexityScorer and QualityScorer tasks from Deita by @plaguss in https://github.com/argilla-io/distilabel/pull/302
  • fix: Fix logging visualization of labeller pipelines by @plaguss in https://github.com/argilla-io/distilabel/pull/310
  • feat: Add Improving Text Embeddings with LLMs tutorial by @alvarobartt in https://github.com/argilla-io/distilabel/pull/313
  • feat: Add EvolComplexity and EvolQuality by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/299
  • feat: Add validate_prompts method to LLMs to help validating the prompts by @plaguss in https://github.com/argilla-io/distilabel/pull/314
  • fix: typo in clean an existing preference dataset by @sdiazlor in https://github.com/argilla-io/distilabel/pull/312
  • feat: Add new column for sft fine tuning with prepare_dataset by @plaguss in https://github.com/argilla-io/distilabel/pull/309
  • docs: Custom Task Documentation by @ignacioct in https://github.com/argilla-io/distilabel/pull/275
  • refactor: Align the LLM subclasses args by @alvarobartt in https://github.com/argilla-io/distilabel/pull/315
  • feat: Include rationale of the model responses on prepare_dataset if available by @plaguss in https://github.com/argilla-io/distilabel/pull/317
  • feat: Add embedding tutorial to docs by @ignacioct in https://github.com/argilla-io/distilabel/pull/319
  • feat: Add MistralAILLM by @plaguss in https://github.com/argilla-io/distilabel/pull/293
  • feat: Use ollama Python client within OllamaLLM by @sdiazlor in https://github.com/argilla-io/distilabel/pull/307

Full Changelog: https://github.com/argilla-io/distilabel/compare/0.4.0...0.5.0

- Python
Published by plaguss about 2 years ago

distilabel - 0.4.0

What's Changed

  • docs: Notus end2end example for preference and instruction generation by @ignacioct in https://github.com/argilla-io/distilabel/pull/145
  • docs: binders anchors by @ignacioct in https://github.com/argilla-io/distilabel/pull/235
  • feat: Add support for dedicated and serverless inference endpoints via inference API by @philschmid in https://github.com/argilla-io/distilabel/pull/238
  • docs: Update links to arxiv landing pages rather than PDFs by @davanstrien in https://github.com/argilla-io/distilabel/pull/249
  • feat: add ETA to progress bar and fix not showing the progress bar if irrelavant by @ignacioct in https://github.com/argilla-io/distilabel/pull/253
  • feat: Add Evol instruct task by @plaguss in https://github.com/argilla-io/distilabel/pull/237
  • docs: rename enable_checkpoints to checkpoint_strategy by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/257
  • feat: Fixing progress bar and ETA by @ignacioct in https://github.com/argilla-io/distilabel/pull/260
  • fix: resolved error with self instruct to argilla method by @plaguss in https://github.com/argilla-io/distilabel/pull/265
  • chore: Add extra check in llmpool to ensure all the tasks share the same parent class by @plaguss in https://github.com/argilla-io/distilabel/pull/266
  • fix: fix for Notus tutorial after bug in record unwrap by @ignacioct in https://github.com/argilla-io/distilabel/pull/267
  • feat: add customizable criteria for query generation in SelfInstructTask by @ignacioct in https://github.com/argilla-io/distilabel/pull/269
  • docs: add a tutorial on "clean a DPO/preference dataset with distilabel" by @sdiazlor in https://github.com/argilla-io/distilabel/pull/270
  • feat: Add new functionality to binarize preference datasets directly from distilabel by @plaguss in https://github.com/argilla-io/distilabel/pull/264
  • feat: add support ollama api by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/250

New Contributors

  • @philschmid made their first contribution in https://github.com/argilla-io/distilabel/pull/238
  • @davanstrien made their first contribution in https://github.com/argilla-io/distilabel/pull/249
  • @sdiazlor made their first contribution in https://github.com/argilla-io/distilabel/pull/270

Full Changelog: https://github.com/argilla-io/distilabel/compare/0.3.0...0.4.0

- Python
Published by davidberenstein1957 about 2 years ago

distilabel - 0.3.0

What's Changed

  • Add VertexAILLM & VertexAIEndpointLLM classes by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/204
  • Add draft with social cards by @plaguss in https://github.com/argilla-io/distilabel/pull/197
  • Relax LLMPool check to match parent Task instead by @plaguss in https://github.com/argilla-io/distilabel/pull/210
  • Align README.md with docs/ and minor fixes / improvements by @alvarobartt in https://github.com/argilla-io/distilabel/pull/214
  • Add TogetherInferenceLLM by @alvarobartt in https://github.com/argilla-io/distilabel/pull/215
  • Add checking valid inputs before calling _generate by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/216
  • Add TogetherInferenceLLM tests by @alvarobartt in https://github.com/argilla-io/distilabel/pull/217
  • Add Vertex AI LLMs documentation by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/222
  • Documentation review by @alvarobartt in https://github.com/argilla-io/distilabel/pull/223
  • Rename for_text_quality to for_overall_quality method in UltraFeedbackTask by @alvarobartt in https://github.com/argilla-io/distilabel/pull/224
  • Add Anyscale endpoints by @plaguss in https://github.com/argilla-io/distilabel/pull/213
  • Feature dataset checkpoint strategy by @plaguss in https://github.com/argilla-io/distilabel/pull/194
  • Fix rating parsing in RatingToArgillaMixin.to_argilla_record by @alvarobartt in https://github.com/argilla-io/distilabel/pull/227
  • Add badges to readme by @plaguss in https://github.com/argilla-io/distilabel/pull/226
  • Fix badges by @dvsrepo in https://github.com/argilla-io/distilabel/pull/228
  • Update LICENSE and add LICENSE_HEADER by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/221

Full Changelog: https://github.com/argilla-io/distilabel/compare/0.2.1...0.3.0

- Python
Published by alvarobartt about 2 years ago

distilabel - 0.2.1

What's Changed

  • Fix PrometheusTask could not be imported by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/190
  • Fix LLM.return_futures by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/192
  • Remove learn section from docs until developed by @plaguss in https://github.com/argilla-io/distilabel/pull/188
  • Add markdown to fields by default by @plaguss in https://github.com/argilla-io/distilabel/pull/189
  • Fix PrometheusTask and UltraCMTask could not be chained with TextGenerationTask by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/195
  • Add missing use_markdown for every field by @plaguss in https://github.com/argilla-io/distilabel/pull/196
  • Add to_argilla_{dataset,record} for CritiqueTask by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/198
  • Update generate_prompt in Task subclasses to always return Prompt by @alvarobartt in https://github.com/argilla-io/distilabel/pull/199
  • Add CritiqueTask documentation by @alvarobartt in https://github.com/argilla-io/distilabel/pull/200
  • Fix UltraCMTask scoring range and align argilla imports by @alvarobartt in https://github.com/argilla-io/distilabel/pull/201

Full Changelog: https://github.com/argilla-io/distilabel/compare/0.2.0...0.2.1

- Python
Published by alvarobartt about 2 years ago

distilabel - 0.2.0

What's Changed

  • adds accelerate example by @edbeeching in https://github.com/argilla-io/distilabel/pull/141
  • Add a dry-run when calling Pipeline.generate by @alvarobartt in https://github.com/argilla-io/distilabel/pull/146
  • Add Notus format in Prompt.format_as and update examples/*.py by @alvarobartt in https://github.com/argilla-io/distilabel/pull/147
  • Add ProcessLLM class by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/151
  • Adds CritiqueTask, UltraCMTask and more by @alvarobartt in https://github.com/argilla-io/distilabel/pull/152
  • docs: add llama.cpp to extras by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/154
  • Fix _build_dataset as processed_labels were ignored by @plaguss in https://github.com/argilla-io/distilabel/pull/158
  • Add to_argilla_{dataset,record} methods in TextGenerationTask by @alvarobartt in https://github.com/argilla-io/distilabel/pull/159
  • Fix UltraFeedbackTask.to_argilla_dataset ratings values by @alvarobartt in https://github.com/argilla-io/distilabel/pull/160
  • Align typing and typing_extensions with supported Python versions by @alvarobartt in https://github.com/argilla-io/distilabel/pull/161
  • Add LLMPool class by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/156
  • Add missing CritiqueTask and UltraCMTask in __init__ and move argilla_utils to utils.argilla by @alvarobartt in https://github.com/argilla-io/distilabel/pull/162
  • Add test workflow by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/163
  • Update LLM to return Future[List[List[LLMOutput]]] by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/164
  • Add PrometheusTask by @alvarobartt in https://github.com/argilla-io/distilabel/pull/165
  • Randomise generations order by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/167
  • Add custom to_argilla_{dataset,record} to SelfInstructTask by @alvarobartt in https://github.com/argilla-io/distilabel/pull/169
  • Fix shuffle_before_labelling and progress bar in Pipeline.generate by @alvarobartt in https://github.com/argilla-io/distilabel/pull/170
  • Replace multiprocessing with multiprocess by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/171
  • Refactor and improve docs by @plaguss in https://github.com/argilla-io/distilabel/pull/134
  • Fix SelfInstructTask.{parse_output,to_argilla_record} methods and _build_dataset by @alvarobartt in https://github.com/argilla-io/distilabel/pull/172
  • Fix results didn't have same order as futures by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/173
  • Remove unnecesary plugin by @plaguss in https://github.com/argilla-io/distilabel/pull/174
  • Add {generation,labelling}_model column as metadata in Argilla by @alvarobartt in https://github.com/argilla-io/distilabel/pull/175
  • Fix exporting model name to Argilla with LLMPool by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/177
  • Update docs to include info about ProcessLLM and LLMPool by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/176

New Contributors

  • @edbeeching made their first contribution in https://github.com/argilla-io/distilabel/pull/141
  • @davidberenstein1957 made their first contribution in https://github.com/argilla-io/distilabel/pull/154

Full Changelog: https://github.com/argilla-io/distilabel/compare/0.1.1...0.2.0

- Python
Published by gabrielmbmb about 2 years ago

distilabel - 0.1.1

What's Changed

  • Template for Documentation Issue created by @ignacioct in https://github.com/argilla-io/distilabel/pull/128
  • self.threadpoolexecutor can be None, protecting it for print by @ignacioct in https://github.com/argilla-io/distilabel/pull/129
  • Use do_sample in transformers example by @dvsrepo in https://github.com/argilla-io/distilabel/pull/138
  • Fix llama-cpp and hf-inference-endpoints extras in pyproject.toml by @plaguss in https://github.com/argilla-io/distilabel/pull/139
  • Fix llama_cpp_python dependency check by @plaguss in https://github.com/argilla-io/distilabel/pull/140

New Contributors

  • @ignacioct made their first contribution in https://github.com/argilla-io/distilabel/pull/128
  • @plaguss made their first contribution in https://github.com/argilla-io/distilabel/pull/139

Full Changelog: https://github.com/argilla-io/distilabel/compare/0.1.0...0.1.1

- Python
Published by alvarobartt about 2 years ago

distilabel - 0.1.1

What's Changed

  • Template for Documentation Issue created by @ignacioct in https://github.com/argilla-io/distilabel/pull/128
  • self.thread_pool_executor can be None, protecting it for print by @ignacioct in https://github.com/argilla-io/distilabel/pull/129
  • Use do_sample in transformers example by @dvsrepo in https://github.com/argilla-io/distilabel/pull/138
  • Fix llama-cpp and hf-inference-endpoints extras in pyproject.toml by @plaguss in https://github.com/argilla-io/distilabel/pull/139

New Contributors

  • @ignacioct made their first contribution in https://github.com/argilla-io/distilabel/pull/128
  • @plaguss made their first contribution in https://github.com/argilla-io/distilabel/pull/139

Full Changelog: https://github.com/argilla-io/distilabel/compare/0.1.0...0.1.1

- Python
Published by alvarobartt about 2 years ago

distilabel - 0.1.0

Stable Release - v0.1.0

- Python
Published by alvarobartt about 2 years ago

distilabel - 0.1.0rc2

- Python
Published by gabrielmbmb over 2 years ago

distilabel - 0.1.0rc1

- Python
Published by gabrielmbmb over 2 years ago

distilabel - 0.1.0rc0

0.1.0rc0

- Python
Published by gabrielmbmb over 2 years ago