Recent Releases of distilabel
distilabel - 1.5.3
What's Changed
- Fix typo by @Riezebos in https://github.com/argilla-io/distilabel/pull/1111
- Checks for images using PIL only if available by @plaguss in https://github.com/argilla-io/distilabel/pull/1112
- Fix pipeline getting stuck when multiple step replicas by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1113
New Contributors
- @Riezebos made their first contribution in https://github.com/argilla-io/distilabel/pull/1111
Full Changelog: https://github.com/argilla-io/distilabel/compare/1.5.2...1.5.3
- Python
Published by gabrielmbmb about 1 year ago
distilabel - 1.5.2
What's Changed
- Fix structured output JSON to
pydantic.BaseModelandLiteLLMasync completion client by @rolshoven in https://github.com/argilla-io/distilabel/pull/1105
New Contributors
- @rolshoven made their first contribution in https://github.com/argilla-io/distilabel/pull/1105
Full Changelog: https://github.com/argilla-io/distilabel/compare/1.5.1...1.5.2
- Python
Published by gabrielmbmb about 1 year ago
distilabel - 1.5.1
What's Changed
- Remove deprecated
CombineColumnsstep by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1101 - Fix image import handling and update MlxLLM initialisation by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1102
- Fix
MlxLLMby aligning it withmlx-lm>=0.21by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1103 1.5.1by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1104
Full Changelog: https://github.com/argilla-io/distilabel/compare/1.5.0...1.5.1
- Python
Published by gabrielmbmb about 1 year ago
distilabel - 1.5.0
✨ Release highlights
🖼️ Image Generation Support
We're excited to introduce ImageGenerationModel, a new abstraction for working with image generation models. This addition enables seamless integration with models that can transform text prompts into images.
Available Services
- 🤗
InferenceEndpointsImageGeneration: Integration with Hugging Face's Inference Endpoints OpenAIImageGeneration: Integration with OpenAI's DALL-E
Architecture
Just as LLMs are used by a Task, we've introduced ImageTask as a high-level abstraction for image generation workflows. ImageTask defines how a step should use an ImageGenerationModel to accomplish specific image generation tasks.
Our first implementation, the ImageGeneration task, provides a straightforward interface: given a text prompt, it generates the corresponding image, leveraging any of the supported image generation models.
We've also added a small tutorial on how to generate images using distilabel: distilabel - Tutorials - Image generation with distilabel
Images as inputs for LLMs
We've added initial support for providing images as input to an LLM through the new TextGenerationWithImage task. We've updated and tested InferenceEndpointsLLM and OpenAILLM with this new task, but we'll image as input compatibility in the next releases for others such as vLLM.
Check the tutorial distilabel - Tutorials - Text generation with images in distilabel to get started!
💻 New MlxLLM integration
We've integrated mlx-lm package with the new MlxLLM class, enabling native machine learning acceleration on Apple Silicon Macs. This integration supercharges synthetic data generation by leveraging MLX's highly optimized framework designed specifically for the M-series chips.
New InstructionResponsePipeline template
We've started making changes so distilabel is easier to use since minute one. We'll start adding presets or templates that allows to quickly get a pipeline with some sensible preconfigured defaults for generating data for certain tasks. The first task we've worked on is the SFT or Instruction Response tuning pipeline which you can use like:
```python from distilabel.pipeline import InstructionResponsePipeline
pipeline = InstructionResponsePipeline() distiset = pipeline.run() ```
Define load stages
We've added a way for users to define which steps of the pipeline should be loaded together, allowing for more efficient resource management and better control over the execution flow. This new feature is particularly useful in scenarios where resource-constrained environments limit the ability to execute all steps simultaneously, requiring steps to be executed in distinct stages.
We've added a detailed guide on how to use this feature: distilabel - How-to guides - Load groups and execution stages.
What's Changed
- Add common typing module by @plaguss in https://github.com/argilla-io/distilabel/pull/1029
- docs: textcat tutorial by @sdiazlor in https://github.com/argilla-io/distilabel/pull/949
- Add
taskdecorator by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1028 - Update
docsworkflows to useuvby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1032 - fix: simplify prompt template
ArgillaLabellerby @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1033 - Add
dataset_batch_sizeargument by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1039 - Move all LLMs to distilabel.models by @plaguss in https://github.com/argilla-io/distilabel/pull/1045
- Fix a tiny typo in
_Stepdocstring by @sadra-barikbin in https://github.com/argilla-io/distilabel/pull/1051 - docs: improve docs for
MinHashDedupStepby @anakin87 in https://github.com/argilla-io/distilabel/pull/1050 - Fix new response_format variable in openai api by @plaguss in https://github.com/argilla-io/distilabel/pull/1053
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in https://github.com/argilla-io/distilabel/pull/1043
- Update
LLM.generateoutput to includestatisticsby @plaguss in https://github.com/argilla-io/distilabel/pull/1034 - Add example of structured output. by @plaguss in https://github.com/argilla-io/distilabel/pull/1061
- feat: implenent basic SFT pipeline based on synthetic data generator by @burtenshaw in https://github.com/argilla-io/distilabel/pull/1059
- fix: broken import in instruction by @burtenshaw in https://github.com/argilla-io/distilabel/pull/1063
- Fix StepOutput type by @plaguss in https://github.com/argilla-io/distilabel/pull/1072
- docs: update issue templates by @sdiazlor in https://github.com/argilla-io/distilabel/pull/1074
- Update
unloadmethod fromvLLMto properly free resources by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1077 - Add tasks to replicate Math-shepherd by @plaguss in https://github.com/argilla-io/distilabel/pull/1052
- Add
load_groupsargument torunby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1075 - Add
TextGenerationWithImagetask by @plaguss in https://github.com/argilla-io/distilabel/pull/1066 - Create columns with
LLMreturned extra keys by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1078 - Fix
vLLMunload logic when model isNoneby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1080 - Fix
merge_distilabel_metadatafunction when handling outputs fromTaskwithgroup_generations==Trueby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1082 - chore: update base.py by @eltociear in https://github.com/argilla-io/distilabel/pull/1085
- Add magpie support llama cpp ollama by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1086
- Feat/954 llama cpp by @bikash119 in https://github.com/argilla-io/distilabel/pull/1000
- fix import by replacing GeneratorOutput with GeneratorStepOutput by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1093
- add mlx support by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1089
- Support custom default headers in
OpenAILLMclass. by @khulaifi95 in https://github.com/argilla-io/distilabel/pull/1088 - fix/pip install messages by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1095
- Fix handling empty list statistics by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1094
- update to outlines010 by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1092
- update: search by match by @sdiazlor in https://github.com/argilla-io/distilabel/pull/1096
- Add Legend to Component Gallery Icons by @ParagEkbote in https://github.com/argilla-io/distilabel/pull/1090
- Image Language Models and
ImageGenerationtask by @plaguss in https://github.com/argilla-io/distilabel/pull/1060 - Update
LLMs to support prompt logprobs use-case by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1099 1.5.0by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1100
New Contributors
- @sadra-barikbin made their first contribution in https://github.com/argilla-io/distilabel/pull/1051
- @anakin87 made their first contribution in https://github.com/argilla-io/distilabel/pull/1050
- @pre-commit-ci made their first contribution in https://github.com/argilla-io/distilabel/pull/1043
- @eltociear made their first contribution in https://github.com/argilla-io/distilabel/pull/1085
- @bikash119 made their first contribution in https://github.com/argilla-io/distilabel/pull/1000
- @khulaifi95 made their first contribution in https://github.com/argilla-io/distilabel/pull/1088
- @ParagEkbote made their first contribution in https://github.com/argilla-io/distilabel/pull/1090
Full Changelog: https://github.com/argilla-io/distilabel/compare/1.4.2...1.5.0
- Python
Published by gabrielmbmb about 1 year ago
distilabel - 1.4.2
What's Changed
- Fix chat template not applied in
TransformersLLMby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1083
Full Changelog: https://github.com/argilla-io/distilabel/compare/1.4.1...1.4.2
- Python
Published by gabrielmbmb about 1 year ago
distilabel - 1.4.1
What's Changed
- Fix not handling list of all primitive types in
SignatureMixinby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1037
Full Changelog: https://github.com/argilla-io/distilabel/compare/1.4.0...1.4.1
- Python
Published by gabrielmbmb over 1 year ago
distilabel - 1.4.0
✨ Release highlights
Offline Batch Generation and OpenAI Batch API
We’ve updated the LLM interface so now LLMs using an external platform that offers a batch service can be integrated in distilabel. In addition, OpenAILLM has been updated so it can use the OpenAI Batch API to get 50% cost reductions.
https://github.com/user-attachments/assets/9a559ae1-099b-47a4-9f92-37a3171dfbff
Improved cache for maximum outputs reusability
We all know that running LLM is costly and most of the times we want to reuse as much as we can the outputs generated with them. Before this release, distilabel cache mechanism enabled to recover a pipeline execution that was stopped before finishing and to re-create the Distiset generated by one that finished its execution and was re-executed.
In this release, we've greatly improved the cache so the outputs of all the Steps are cached and therefore can be reused in other pipelines executions even if the pipeline has changed:
In addition, we've added a use_cache attribute in the Steps that allows toggling the use of the cache at step level.
Steps can generated artifacts
In some cases, Step produces some additional artifacts that are used to generate its outputs. These artifacts can take some time to be generated and they could be reused in the future. That’s why we’ve added a new method called Step.save_artifact that can be called within the step to store artifacts generated by it. The artifacts generated by the Step will also get uploaded to the Hugging Face Hub.
```python from typing import List, TYPE_CHECKING from distilabel.steps import GlobalStep, StepInput, StepOutput import matplotlib.pyplot as plt
if TYPE_CHECKING: from distilabel.steps import StepOutput
class CountTextCharacters(GlobalStep): @property def inputs(self) -> List[str]: return ["text"]
@property
def outputs(self) -> List[str]:
return ["text_character_count"]
def process(self, inputs: StepInput) -> "StepOutput": # type: ignore
character_counts = []
for input in inputs:
text_character_count = len(input["text"])
input["text_character_count"] = text_character_count
character_counts.append(text_character_count)
# Generate plot with the distribution of text character counts
plt.figure(figsize=(10, 6))
plt.hist(character_counts, bins=30, edgecolor="black")
plt.title("Distribution of Text Character Counts")
plt.xlabel("Character Count")
plt.ylabel("Frequency")
# Save the plot as an artifact of the step
self.save_artifact(
name="text_character_count_distribution",
write_function=lambda path: plt.savefig(path / "figure.png"),
metadata={"type": "image", "library": "matplotlib"},
)
plt.close()
yield inputs
```
New Tasks: CLAIR, APIGEN and many more!
- New CLAIR task: CLAIR uses an AI system to minimally revise a solution A→A´ such that the resulting preference A
preferredA’ is much more contrastive and precise. - New tasks to replicate APIGen framework:
APIGenGenerator,APIGenSemanticChecker,APIGenExecutionChecker. These tasks allow generating datasets like the one presented in the paper: APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets - New URIAL task that allows using non-instruct models to generate a response for an instruction.
- New TextClassification task to make zero-shot text classification based on a predefined but highly customizable prompt.
- TextClustering, to generate clusters from text and group your generations, discovering labels from your data. Comes with 2 steps to run UMAP and DBSCAN algorithms.
- Updated TextGeneration to simplify customization of tasks that don’t require further post-processing.
New Steps to sample data in your pipelines and remove duplicates
- New DataSampler step to sample data from other datasets, which can be useful to inject different examples for few-shot examples in your prompts.
- New EmbeddingDedup step to remove duplicates based on embeddings and a distance metric.
- New MinHashDedup step to remove near duplicates from the text based on MinHash and MinHashLSH algorithm.
- New TruncateTextColumns to truncate the length of your texts using either the character length or the number of tokens based on a tokenizer.
- New CombineOutputs to combine the outputs of two or more steps into a single output.
Generate text embeddings using vLLM
- Now you can generate embeddings using vLLMEmbeddings!
Extra things
- Easily visualize the tasks’ prompts using Task.print method.
- New use_default_structured_outputs flag in tasks to automatically use structured generation in some tasks that can benefit from it.
What's Changed
- Make
ClientvLLM.model_nameacached_propertyby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/862 - Pass dataset to dry_run method by @plaguss in https://github.com/argilla-io/distilabel/pull/863
- Add default structured output for
GenerateSentencePairtask by @plaguss in https://github.com/argilla-io/distilabel/pull/868 - Complexity scorer default structured output by @plaguss in https://github.com/argilla-io/distilabel/pull/870
- Quality scorer default structured output by @plaguss in https://github.com/argilla-io/distilabel/pull/873
- Ultrafeedback default structured output by @plaguss in https://github.com/argilla-io/distilabel/pull/876
- Remove use of
default_chat_templateby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/888 - Temporary fix for installing
llama-cpp-pythonby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/886 - Fix unit tests after release of
transformers==4.44.0by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/891 - Fix default structured output by @plaguss in https://github.com/argilla-io/distilabel/pull/892
- Send as many batches as possible to input queues by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/895
- Exclude
repo_idfromLoadDataFromFileSystemby @plaguss in https://github.com/argilla-io/distilabel/pull/898 - Fix loader to read from a glob pattern by @plaguss in https://github.com/argilla-io/distilabel/pull/877
- Add
save_artifactmethod to_Stepby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/871 - Add new
add_raw_inputargument to_Taskso we can automatically include the formatted input by @plaguss in https://github.com/argilla-io/distilabel/pull/903 - New
TruncateTextColumnto truncate the length of texts using the number of tokens or characters by @plaguss in https://github.com/argilla-io/distilabel/pull/902 - Update
inputsandoutputsinterface to allow returning dict indicating optionality by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/883 - Update mistrallm by @plaguss in https://github.com/argilla-io/distilabel/pull/904
- Deepseek prover by @plaguss in https://github.com/argilla-io/distilabel/pull/907
- Update
RewardModelScore.inputsproperty by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/908 - Add tutorial - generate data for training embeddings and reranking models by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/893
- Fix load data from disk by @plaguss in https://github.com/argilla-io/distilabel/pull/910
- docs: minor fixes by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/913
- Add
URIALtask by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/921 - Add
vLLMEmbeddingsby @plaguss in https://github.com/argilla-io/distilabel/pull/920 - docs: add tutorials preference and clean by @sdiazlor in https://github.com/argilla-io/distilabel/pull/917
- Fix
StructuredGenerationexamples and internal check by @plaguss in https://github.com/argilla-io/distilabel/pull/912 - Generate deterministic pipeline name when it's not given by @plaguss in https://github.com/argilla-io/distilabel/pull/878
- Add custom errors by @plaguss in https://github.com/argilla-io/distilabel/pull/911
- Docs/tutorials fix by @sdiazlor in https://github.com/argilla-io/distilabel/pull/922
- Add
revisionruntime parameter toLoadDataFromHubby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/928 - Add plausible as replacement for GA by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/929
- Add minhash related steps to deduplicate texts by @plaguss in https://github.com/argilla-io/distilabel/pull/931
- docs: API reference review by @sdiazlor in https://github.com/argilla-io/distilabel/pull/932
- Refactor of MinHash to work with a single class and fix the shelve backend by @plaguss in https://github.com/argilla-io/distilabel/pull/937
- Update
make_generator_stepto set pipeline to step and add edge to steps in trophic level 1 by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/936 - Add
CombineOutputsstep by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/939 - fix: regex expression in POSITIVE_NEGATIVE by @sdiazlor in https://github.com/argilla-io/distilabel/pull/940
- Offline batch generation by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/923
- Fix applying input mapping when mapping overrides another column by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/938
- Fix all replicas had the same
_llm_identifierforCudaDevicePlacementMixinby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/941 - Fix empty load stage when two
GlobalSteps are chained by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/945 - Add
system_promptattribute toTextGenerationby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/950 - Add step to deduplicate records based on embeddings by @plaguss in https://github.com/argilla-io/distilabel/pull/946
- Updated setup_logging to use UTF-8 in FileHandler by @dameikle in https://github.com/argilla-io/distilabel/pull/952
- Add more generation parameters to
vLLMby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/955 - Fix
Magpiegenerating different columns names depending onLLMoutput by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/965 - Docs/962 docs create a smoother transition from index installation quickstart by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/968
- Add
logging_handlersargument by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/969 - [DOCS] Add tips in the docs to avoid overloading Free Serverless Endpoints by @plaguss in https://github.com/argilla-io/distilabel/pull/973
- Add
TextClassification,UMAP,DBSCANandTextClusteringtasks by @plaguss in https://github.com/argilla-io/distilabel/pull/948 - [FEATURE] Simplify customizing the
TextGenerationtask with custom prompts by @plaguss in https://github.com/argilla-io/distilabel/pull/974 - Update
system_promptattribute for adding probabilities inMagpieBaseby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/981 - Fix unloading steps with more than 1 replica by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/982
- docs: 960 docs add a glossary concept section by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/970
- Fix missing
system_prompt_keycolumn inMagpietasks by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/983 - docs: update component gallery by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/987
- fix missing batch when last batch arrive early by @zye1996 in https://github.com/argilla-io/distilabel/pull/989
- Fine personas socialai tutorial by @plaguss in https://github.com/argilla-io/distilabel/pull/992
- feat: add basic draw implementation to pipline by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/966
- Fix schema inference structured generation by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/994
- [DOCS] Add developer documentation section in the docs by @plaguss in https://github.com/argilla-io/distilabel/pull/999
- Fix
vllminstallation in CI by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1009 - fix metadata writeout when llm error by @zye1996 in https://github.com/argilla-io/distilabel/pull/1003
- Add example of custom text generation step in quickstart by @plaguss in https://github.com/argilla-io/distilabel/pull/984
- feat: 985 feature argillalabeller task by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/986
- Fix
llvmliteinstall withuvby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1018 - fix: failing tests argilla labeller by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/1017
- fix inpute when output_mapping is not empty by @zye1996 in https://github.com/argilla-io/distilabel/pull/1015
- Add Tasks to replicate
APIGenby @plaguss in https://github.com/argilla-io/distilabel/pull/925 - Pretty print by @plaguss in https://github.com/argilla-io/distilabel/pull/934
- Add
CLAIRtask by @plaguss in https://github.com/argilla-io/distilabel/pull/926 - Add cache at
Steplevel by @plaguss in https://github.com/argilla-io/distilabel/pull/766 - Fix
IndexErrorwhen overriding inputs andgroup_generations=Falseby @plaguss in https://github.com/argilla-io/distilabel/pull/1022 - Update
Pipeline cachedocs by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1023 1.4.0by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/1024
New Contributors
- @dameikle made their first contribution in https://github.com/argilla-io/distilabel/pull/952
- @zye1996 made their first contribution in https://github.com/argilla-io/distilabel/pull/989
Full Changelog: https://github.com/argilla-io/distilabel/compare/1.3.2...1.4.0
- Python
Published by gabrielmbmb over 1 year ago
distilabel - 1.3.2
What's Changed
- Deepseek prover task by @plaguss in https://github.com/argilla-io/distilabel/pull/733
- Do not cancel in progress docs workflows by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/919
- Fix creating Ray placement groups for vLLM by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/918
- Fix passing
base_urlinmodel_idinInferenceEndpointsLLMby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/924
Full Changelog: https://github.com/argilla-io/distilabel/compare/1.3.1...1.3.2
- Python
Published by gabrielmbmb over 1 year ago
distilabel - 1.3.1
What's Changed
- Create new
distilabel.constantsmodule to store constants and avoid circular imports by @plaguss in https://github.com/argilla-io/distilabel/pull/861 - Add OpenAI request timeout by @ashim-mahara in https://github.com/argilla-io/distilabel/pull/858
New Contributors
- @ashim-mahara made their first contribution in https://github.com/argilla-io/distilabel/pull/858
Full Changelog: https://github.com/argilla-io/distilabel/compare/1.3.0...1.3.1
- Python
Published by gabrielmbmb over 1 year ago
distilabel - 1.3.0
What's Changed
- Add new step
CombineKeysby @plaguss in https://github.com/argilla-io/distilabel/pull/747 - Refactor naming columns steps combinecolumns combinekeys expandcolumns by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/758
- Drop remove deprecated
LoadHubDatasetby @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/759 - Add
requirementslist forPipelineby @plaguss in https://github.com/argilla-io/distilabel/pull/720 - Add
StepResourcesand step replicas inPipelineby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/750 - Add load stages by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/760
- Update min required version to
python==3.9by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/770 - Optionally include the pipeline script in the hub when pushing your distiset by @plaguss in https://github.com/argilla-io/distilabel/pull/762
- Add
docs-pr.ymlanddocs-pr-close.ymlworkflows by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/774 - Add
RayPipelineclass by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/769 - Fixed closed PR workflow by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/776
- Add
MagpieandMagpieGeneratortasks by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/778 - Fix some issues related to
Magpietask by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/783 - Add
end_with_userandinclude_system_promptflags toMagpietasks and handleNones. by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/784 - Add workflow concurrency group for publishing docs by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/796
- Add
_desired_num_gpusattribute toCudaDevicePlacementMixinby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/795 - Compatibility with
vLLMwithtensor_parallel_sizeargument by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/805 - Update default names in
GroupColumnsby @plaguss in https://github.com/argilla-io/distilabel/pull/808 - Request batches to
GeneratorStepif only step in pipeline by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/828 - Add default name for a pipeline by @plaguss in https://github.com/argilla-io/distilabel/pull/809
- Update distilabel phrasing based on PR hugging face hub by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/821
- Some more
Magpieimprovements by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/833 - Add
Embeddingsbase class,SentenceTransformerEmbeddingsclass,EmbeddingGenerationandFaissNearestNeighboursteps by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/830 - Create file per hostname in
CudaDevicePlacementMixinby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/814 - Create a
GeneratorStepfrom a dataset using a helper function by @plaguss in https://github.com/argilla-io/distilabel/pull/812 - Do not take into account
disable_cuda_device_placementfor pipeline signature by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/838 - Add
RewardModelScorestep by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/840 - Fix
LoadDataFromHubattribute_datasethadellipsisby default instead ofNoneby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/841 - Create
PlacementGroupfor steps usingvLLMby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/842 - Update
argillaintegration to useargilla_sdkv2 by @alvarobartt in https://github.com/argilla-io/distilabel/pull/705 - Make
overall-ratingthe default aspect forUltraFeedbacktask by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/843 - fix typo index.md by @franperic in https://github.com/argilla-io/distilabel/pull/844
- Use
CudaDevicePlacementMixininRewardModelScorestep by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/845 - Gather GPUs per Ray node to create placement groups by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/848
- Fix typo in docs by @plaguss in https://github.com/argilla-io/distilabel/pull/850
- Add
xfailrouting batch function tests by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/852 - Fix creating placement group when
pipeline_parallel_size>1by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/851 - docs: 846 docs include google analytics by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/847
- Add
ClientvLLMclass by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/854 - Add hard-negative flag to include similar challenging negatives on triplets by @plaguss in https://github.com/argilla-io/distilabel/pull/856
- Add bibtex references in the docstrings to be shown in the README by @plaguss in https://github.com/argilla-io/distilabel/pull/855
- distilabel
1.3.0by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/857
New Contributors
- @franperic made their first contribution in https://github.com/argilla-io/distilabel/pull/844
Full Changelog: https://github.com/argilla-io/distilabel/compare/1.2.4...1.3.0
- Python
Published by gabrielmbmb over 1 year ago
distilabel - 1.2.4
What's Changed
- Update
InferenceEndpointsLLMto usechat_completionmethod by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/815
Full Changelog: https://github.com/argilla-io/distilabel/compare/1.2.3...1.2.4
- Python
Published by gabrielmbmb over 1 year ago
distilabel - 1.2.3
What's Changed
- Fix Import Error for KeepColumns in instruction_backtranslation.md (Issue #785) by @Hassaan-Qaisar in https://github.com/argilla-io/distilabel/pull/786
- Correct variable name in dataset push example (in ultrafeedback.md file) (Issue #787) by @Hassaan-Qaisar in https://github.com/argilla-io/distilabel/pull/791
- docs: update script for issue dashboard by @sdiazlor in https://github.com/argilla-io/distilabel/pull/775
- Fix 404 model not found for private Serverless IE by @dvsrepo in https://github.com/argilla-io/distilabel/pull/806
New Contributors
- @Hassaan-Qaisar made their first contribution in https://github.com/argilla-io/distilabel/pull/786
Full Changelog: https://github.com/argilla-io/distilabel/compare/1.2.2...1.2.3
- Python
Published by gabrielmbmb over 1 year ago
distilabel - 1.2.2
What's Changed
- Fix passing
inputtoformat_outputfunction by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/781
Full Changelog: https://github.com/argilla-io/distilabel/compare/1.2.1...1.2.2
- Python
Published by gabrielmbmb over 1 year ago
distilabel - 1.2.1
What's Changed
- Fix docs for distiset.savetodisk kwargs by @fpreiss in https://github.com/argilla-io/distilabel/pull/745
- docs: change references by @sdiazlor in https://github.com/argilla-io/distilabel/pull/754
- Fix
response_formatforTogetherLLMandAnyScaleLLMby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/764
New Contributors
- @fpreiss made their first contribution in https://github.com/argilla-io/distilabel/pull/745
Full Changelog: https://github.com/argilla-io/distilabel/compare/1.2.0...1.2.1
- Python
Published by gabrielmbmb over 1 year ago
distilabel - 1.2.0
✨ Release highlights
Structured generation with instructor, InferenceEndpointsLLM now supports structured generation and StructuredGeneration task
instructorhas been integrated bringing support for structured generation withOpenAILLM,AnthropicLLM,LiteLLM,MistralLLM,CohereLLMandGroqLLM:
Structured generation with
instructor example
```python from typing import List
from distilabel.llms import MistralLLM from distilabel.pipeline import Pipeline from distilabel.steps import LoadDataFromDicts from distilabel.steps.tasks import TextGeneration from pydantic import BaseModel, Field
class Node(BaseModel): id: int label: str color: str
class Edge(BaseModel): source: int target: int label: str color: str = "black"
class KnowledgeGraph(BaseModel): nodes: List[Node] = Field(..., defaultfactory=list) edges: List[Edge] = Field(..., defaultfactory=list)
with Pipeline( name="Knowledge-Graphs", description=( "Generate knowledge graphs to answer questions, this type of dataset can be used to " "steer a model to answer questions with a knowledge graph." ), ) as pipeline: sample_questions = [ "Teach me about quantum mechanics", "Who is who in The Simpsons family?", "Tell me about the evolution of programming languages", ]
load_dataset = LoadDataFromDicts(
name="load_instructions",
data=[
{
"system_prompt": "You are a knowledge graph expert generator. Help me understand by describing everything as a detailed knowledge graph.",
"instruction": f"{question}",
}
for question in sample_questions
],
)
text_generation = TextGeneration(
name="knowledge_graph_generation",
llm=MistralLLM(
model="open-mixtral-8x22b",
structured_output={"schema": KnowledgeGraph}
),
)
load_dataset >> text_generation
``
</details>
*InferenceEndpointsLLMnow supports structured generation
* New [StructuredGeneration`](https://distilabel.argilla.io/latest/components-gallery/tasks/structuredgeneration/) task that allows defining the schema of the structured generation per input row.
New tasks for generating datasets for training embedding models
sentence-transformers v3 was recently released and we couldn't resist the urge of adding a few new tasks to allow creating datasets for training embedding models!
- New
GenerateSentencePairtask that allows to generate apositivesentence for an inputanchor, and optionally also anegativesentence. The tasks allows creating different kind of data specifying theactionto perform with respect to theanchor: paraphrasing, generate semantically-similar sentence, generate a query or generate an answer. - Implemented Improving Text Embeddings with Large Language Models and adding the following tasks derived from the paper:
EmbeddingTaskGeneratorwhich allows generating new embedding-related tasks using anLLM.GenerateTextRetrievalDatawhich allows creating text retrieval data with anLLM.GenerateShortTextMatchingDatawhich allows creating short texts matching the input data.GenerateLongTextMatchingDatawhich allows creating long texts matching the input data.GenerateTextClassificationDatawhich allows creating text classification data from the input data.MonolingualTripletGeneratorwhich allows creating monolingual triplets from the input data.BitextRetrievalGeneratorwhich allows creating bitext retrieval data from the input data.
New Steps for loading data from different sources and saving/loading Distiset to disk
We've added a few new steps allowing to load data from different sources:
LoadDataFromDiskallows loading aDistisetordatasets.Datasetthat was previously saved using thesave_to_diskmethod.LoadDataFromFileSystemallows loading adatasets.Datasetfrom a file system.
Thanks to @rasdani for helping us testing this new tasks!
In addition, we have added save_to_disk method to Distiset akin to datasets.Dataset.save_to_disk, that allows saving the generated distiset to disk, along with the pipeline.yaml and pipeline.log.
`save_to_disk` example
```python from distilabel.pipeline import Pipeline with Pipeline(name="my-pipeline") as pipeline: ... if __name__ == "__main__": distiset = pipeline.run(...) distiset.save_to_disk(dataset_path="my-distiset") ```MixtureOfAgentsLLM implementation
We've added a new LLM called MixtureOfAgentsLLM derived from the paper Mixture-of-Agents Enhances Large Language Model Capabilities. This new LLM allows generating improved outputs thanks to the collective expertise of several LLMs.
`MixtureOfAgentsLLM` example
```python from distilabel.llms import MixtureOfAgentsLLM, InferenceEndpointsLLM llm = MixtureOfAgentsLLM( aggregator_llm=InferenceEndpointsLLM( model_id="meta-llama/Meta-Llama-3-70B-Instruct", tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct", ), proposers_llms=[ InferenceEndpointsLLM( model_id="meta-llama/Meta-Llama-3-70B-Instruct", tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct", ), InferenceEndpointsLLM( model_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO", tokenizer_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO", ), InferenceEndpointsLLM( model_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1", tokenizer_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1", ), ], rounds=2, ) llm.load() output = llm.generate( inputs=[ [ { "role": "user", "content": "My favorite witty review of The Rings of Power series is this: Input:", } ] ] ) ```Saving cache and passing batches to GlobalSteps optimizations
- The cache logic of the
_BatchManagerhas been improved to incrementally update the cache making the process much faster. - The data of the input batches of the
GlobalSteps will be passed to the step using the file system, as this is faster than passing it using the queue. This is possible thanks to new integration offsspec, which can be configured to use a file system or cloud storage as backend for passing the data of the batches.
BasePipeline and _BatchManager refactor
The logic around BasePipeline and _BatchManager has been refactored, which will make it easier to implement new pipelines in the future.
Added ArenaHard as an example of how to use distilabel to implement a benchmark
distilabel can be easily used to create an LLM benchmark. To showcase this, we decided to implement Arena Hard as an example: Benchmarking with distilabel: Arena Hard
📚 Improved documentation structure
We have updated the documentation structure to make it more clear and self-explanatory, as well as more visually appealing 😏.
What's Changed
- Add
prometheus.mdby @alvarobartt in https://github.com/argilla-io/distilabel/pull/656 - Reduce time required to execute
_cachemethod by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/672 - [DOCS] Update theme styles and images by @leiyre in https://github.com/argilla-io/distilabel/pull/667
- Fix circular import due to DISTILABELMETADATAKEY by @plaguss in https://github.com/argilla-io/distilabel/pull/675
- Add
CITATION.cffby @alvarobartt in https://github.com/argilla-io/distilabel/pull/677 - Deprecate conversation support in
TextGenerationin favour ofChatGenerationby @alvarobartt in https://github.com/argilla-io/distilabel/pull/676 - Add functionality to load/save distisets to/from disk by @plaguss in https://github.com/argilla-io/distilabel/pull/673
- Integration instructor by @plaguss in https://github.com/argilla-io/distilabel/pull/654
- Fix docs of saving/loading distiset from disk by @plaguss in https://github.com/argilla-io/distilabel/pull/679
- Pass data of batches using file system by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/678
- Add
python==3.12by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/615 - Add
codspeedbenchmarks by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/674 - Add
StructuredGenerationtask and support forgrammarinInferenceEndpointsLLMby @alvarobartt in https://github.com/argilla-io/distilabel/pull/680 - Fix
InferenceEndpointsLLMnot using cached token by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/690 - Add
GenerateSentencePairtask by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/689 - Fix prepend batches by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/696
- Fix
EvolQuality._apply_random_mutationnot properly injectingresponsein template by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/703 - [FEATURE] Include new
GeneratorStepclasses to load datasets from different formats by @plaguss in https://github.com/argilla-io/distilabel/pull/691 - Add citation readme by @plaguss in https://github.com/argilla-io/distilabel/pull/712
- Move navigation to top bar by @plaguss in https://github.com/argilla-io/distilabel/pull/708
- Fix
install_dependencies.shby @alvarobartt in https://github.com/argilla-io/distilabel/pull/713 - Add context to guide the generate sentence pair task if informed by @plaguss in https://github.com/argilla-io/distilabel/pull/706
- Add examples to the LLMs to be shown in the components gallery by @plaguss in https://github.com/argilla-io/distilabel/pull/714
- Gather HFTOKEN internally when calling `Distiset.pushto_hub` if token is None. by @plaguss in https://github.com/argilla-io/distilabel/pull/707
- Implement "Improving Text Embeddings with LLMs" by @alvarobartt in https://github.com/argilla-io/distilabel/pull/683
- Add
ArenaHardbenchmark andArenaHardResultsstep by @alvarobartt in https://github.com/argilla-io/distilabel/pull/670 - Refactor
PipelineandBasePipelineclasses by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/704 - Fix AzureOpenAILLM load method setting the correct path to mock the internal class by @plaguss in https://github.com/argilla-io/distilabel/pull/725
- Components examples steps by @plaguss in https://github.com/argilla-io/distilabel/pull/715
- Add examples for tasks in the components gallery by @plaguss in https://github.com/argilla-io/distilabel/pull/724
- [FEATURE] Refactor of structured generation and use schemas defined in a dataset by @plaguss in https://github.com/argilla-io/distilabel/pull/688
- Update docs document phrasing and funnel by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/718
- docs: 728 docs api reference tasktyping cannot be imported during doc build by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/729
- docs: 730 docs add an index to the guide overview by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/731
- Add
MixtureOfAgentsLLMby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/735 - Add
examples/arena_hard.pyand remove fromdistilabelcore by @alvarobartt in https://github.com/argilla-io/distilabel/pull/741 - Add serving LLM section in the docs by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/742
distilabelv1.2.0 by @alvarobartt in https://github.com/argilla-io/distilabel/pull/659
New Contributors
- @leiyre made their first contribution in https://github.com/argilla-io/distilabel/pull/667
Full Changelog: https://github.com/argilla-io/distilabel/compare/1.1.1...1.2.0
- Python
Published by gabrielmbmb over 1 year ago
distilabel - 1.1.1
What's Changed
- Fix crash when using vLLM without structured generation by @cg123 in https://github.com/argilla-io/distilabel/pull/658
- Fix error on
Pipeline.dry_runwithoutparametersby @plaguss in https://github.com/argilla-io/distilabel/pull/655
New Contributors
- @cg123 made their first contribution in https://github.com/argilla-io/distilabel/pull/658
Full Changelog: https://github.com/argilla-io/distilabel/compare/1.1.0...1.1.1
- Python
Published by alvarobartt almost 2 years ago
distilabel - 1.1.0
Distilabel 1.1.0
Two new tasks implemented!
Genstruct task (https://github.com/argilla-io/distilabel/pull/600)
You can now use Genstruct task as described in https://huggingface.co/NousResearch/Genstruct-7B, to generate synthetic instruction fine-tuning datasets from a raw document:
```python from distilabel.llms import TransformersLLM from distilabel.pipeline import Pipeline from distilabel.steps import KeepColumns, LoadDataFromDicts from distilabel.steps.tasks import Genstruct
with Pipeline(name="harry-potter-genstruct") as pipeline: loadhubdataset = LoadDataFromDicts( name="load_dataset", data=[ { "title": "Harry Potter and the Sorcerer's Stone", "content": "An orphaned boy enrolls in a school of wizardry, where he learns the truth about himself, his family and the terrible evil that haunts the magical world.", }, { "title": "Harry Potter and the Chamber of Secrets", "content": "Harry Potter lives his second year at Hogwarts with Ron and Hermione when a message on the wall announces that the legendary Chamber of Secrets has been opened. The trio soon realize that, to save the school, it will take a lot of courage.", }, ], )
task = Genstruct(
name="task",
llm=TransformersLLM(
model="NousResearch/Genstruct-7B",
torch_dtype="float16",
chat_template="{{ messages[0]['content'] }}",
device="cuda:0",
),
num_generations=2,
group_generations=False,
output_mappings={"model_name": "model"},
)
```
PrometheusEval task (https://github.com/argilla-io/distilabel/pull/610)
A new PrometheusEval task, based on the recently published paper "Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models":
```python from distilabel.steps.tasks import PrometheusEval
with Pipeline(name="prometheus") as pipeline: loaddataset = LoadHubDataset( name="loaddataset", repoid="HuggingFaceH4/instruction-dataset", split="test", outputmappings={"prompt": "instruction", "completion": "generation"}, )
task = PrometheusEval(
name="task",
llm=vLLM(
model="prometheus-eval/prometheus-7b-v2.0",
chat_template="[INST] {{ messages[0]['content'] }}\n{{ messages[1]['content'] }}[/INST]",
),
mode="absolute",
rubric="factual-validity",
reference=False,
num_generations=1,
group_generations=False,
)
load_dataset >> task
```
Connect the steps in the pipeline with >> (https://github.com/argilla-io/distilabel/pull/490)
Now you can connect your steps using the binary shift operator in python:
```python from distilabel.pipeline import Pipeline from distilabel.steps.generators.huggingface import LoadHubDataset from distilabel.steps.task.evol_instruct.base import EvolInstruct from distilabel.steps.combine import CombineColumns
with Pipeline(name="Pipe name") as pipeline: loadhubdataset = LoadHubDataset(name="loaddataset", batchsize=8) evolinstructioncomplexity1 = EvolInstruct( llm=OpenAILLM(model="gpt-3.5-turbo"), ) evolinstructioncomplexity2 = EvolInstruct( llm=InferenceEndpointsLLM(model_id="mistralai/Mixtral-8x7B-Instruct-v0.1"), )
combine_columns = CombineColumns(
columns=["response"],
output_columns=["candidates"],
)
(
load_hub_dataset
>> [evol_instruction_complexity_1, evol_instruction_complexity_2]
>> combine_columns
)
```
Routing batch function (https://github.com/argilla-io/distilabel/pull/595)
Thanks to the new routing_batch_function, each batch of an upstream step can be routed conditionally to a list of specific downstream steps. In addition, we have included a sample_n_steps routing batch function, making easier replicating the definition of the original UltraFeedback paper:
```python import random from distilabel.llms import MistralLLM, OpenAILLM, VertexAILLM from distilabel.pipeline import Pipeline, routingbatchfunction from distilabel.steps import CombineColumns, LoadHubDataset from distilabel.steps.tasks import TextGeneration
@routingbatchfunction() def sampletwosteps(steps: list[str]) -> list[str]: return random.sample(steps, 2)
with Pipeline("pipe-name", description="My first pipe") as pipeline: loaddataset = LoadHubDataset( name="loaddataset", output_mappings={"prompt": "instruction"}, )
tasks = []
for llm in (
OpenAILLM(model="gpt-4-0125-preview"),
MistralLLM(model="mistral-large-2402"),
VertexAILLM(model="gemini-1.0-pro"),
):
tasks.append(
TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm)
)
combine_generations = CombineColumns(
name="combine_generations",
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
)
load_dataset >> sample_two_steps >> tasks >> combine_generations
```
Generate structured outputs using outlines (https://github.com/argilla-io/distilabel/pull/601)
You can generate JSON or regex using TransformersLLM, LlamaCppLLM or vLLM thanks to the integration with [outlines](https://github.com/outlines-dev/outlines)
```python from enum import Enum
from distilabel.llms import LlamaCppLLM from distilabel.pipeline import Pipeline from distilabel.steps import LoadDataFromDicts from distilabel.steps.tasks import TextGeneration from pydantic import BaseModel, StringConstraints, conint from typing_extensions import Annotated
class Weapon(str, Enum): sword = "sword" axe = "axe" mace = "mace" spear = "spear" bow = "bow" crossbow = "crossbow"
class Armor(str, Enum): leather = "leather" chainmail = "chainmail" plate = "plate" mithril = "mithril"
class Character(BaseModel): name: Annotated[str, StringConstraints(max_length=30)] age: conint(gt=1, lt=3000) armor: Armor weapon: Weapon
with Pipeline("RPG-characters") as pipeline: system_prompt = ( "You are a leading role play gamer. You have seen thousands of different characters and their attributes." " Please return a JSON object with common attributes of an RPG character." )
load_dataset = LoadDataFromDicts(
name="load_instructions",
data=[
{
"system_prompt": system_prompt,
"instruction": f"Give me a character description for a {char}",
}
for char in ["dwarf", "elf", "human", "ork"]
],
)
text_generation = TextGeneration(
name="text_generation_rpg",
llm=LlamaCppLLM(
model_path="model/path", # type: ignore
structured_output={"format": "json", "schema": Character},
),
)
load_dataset >> text_generation
```
New GroqLLM (https://github.com/argilla-io/distilabel/pull/583)
New integration with groq, special mention to @kcentric which did the initial work prior to the refactor for 1.0.0
```python from distilabel.llms.groq import GroqLLM from distilabel.pipeline import Pipeline from distilabel.steps.tasks import TextGeneration
with Pipeline(name="text-generation-groq") as pipeline: ... textgenerationwith_groq = TextGeneration( llm=GroqLLM(model="llama3-70b-8192"), ) ... ```
Easily test your pipeline doing a dry_run (https://github.com/argilla-io/distilabel/pull/635)
python
with Pipeline(...) as pipeline:
...
distiset = pipeline.dry_run(
parameters=..., # The same argument as `Pipeline.run`
batch_size=1 # Optional, will be set to 1 by default.
)
python
[05/13/24 16:22:30] INFO ['distilabel.pipeline.local'] 🌵 Dry run mode local.py:103
INFO ['distilabel.pipeline.local'] 📝 Pipeline data will be ... local.py:125
Pipeline.log file is dumped to the Hugging Face repository (#568)
Now on when you call distiset.push_to_hub, the pipeline.log file will be automatically dumped to your dataset repository with the pipeline.yaml to keep track of the execution.
New distilabel_metadata column to store internal data (https://github.com/argilla-io/distilabel/pull/586)
You can now optionally enable the addition of a metadata column. This column can store other things in the future, but for the moment can be really handy to keep the raw output from an LLM, and in case it does some post processing via format_output , keep the original output to avoid lossing anything.
You can include the metadata at the task level as:
python
TextGeneration(..., add_raw_output=True|False)
And directly determine whether you want this column in your final Distiset:
python
with Pipeline(...,enable_metadata=True|False):
...
This way we can decide to remove all the column altogether.
All the changes in this PR
- Allow nested connect calls and overload rshift method to connect steps by @plaguss in https://github.com/argilla-io/distilabel/pull/490
- Fix
llm_blenderinstallation by @alvarobartt in https://github.com/argilla-io/distilabel/pull/557 - Warn user about unknown runtime parameters by @plaguss in https://github.com/argilla-io/distilabel/pull/555
- Add missing
model_name, update docstrings, and add*.jinja2templates toTasksubclasses by @alvarobartt in https://github.com/argilla-io/distilabel/pull/560 - Split
ChatGenerationfromTextGenerationby @alvarobartt in https://github.com/argilla-io/distilabel/pull/558 - Set
extra="forbid"in{_Step,LLM}.model_configby @alvarobartt in https://github.com/argilla-io/distilabel/pull/577 - Infer step name by @plaguss in https://github.com/argilla-io/distilabel/pull/575
- Change the context of subprocesses depending on the platform by @plaguss in https://github.com/argilla-io/distilabel/pull/578
- Dump logs within a file in .cache/distilabel/pipelines dir by @plaguss in https://github.com/argilla-io/distilabel/pull/568
- Fix empty batches causing missaligment when branching by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/590
- Add
GroqLLMby @alvarobartt in https://github.com/argilla-io/distilabel/pull/583 - Add
Format{Chat,Text}Generation{DPO,SFT}by @alvarobartt in https://github.com/argilla-io/distilabel/pull/584 - Fix
titleinRatingQuestionofPreferenceToArgillaby @alvarobartt in https://github.com/argilla-io/distilabel/pull/597 - Set
streaming=Falseand addnum_examplestoLoadHubDatasetby @plaguss in https://github.com/argilla-io/distilabel/pull/565 - Make
pipelineargument ofStepoptional by @plaguss in https://github.com/argilla-io/distilabel/pull/566 - Extend
LLMkwargs to align with counterparts by @alvarobartt in https://github.com/argilla-io/distilabel/pull/594 - Add
Genstructtask by @alvarobartt in https://github.com/argilla-io/distilabel/pull/600 - Fix
num_examplesto be optional inLoadHubDatasetby @plaguss in https://github.com/argilla-io/distilabel/pull/603 - Fix
list_files_in_dirreturning unsorted files by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/609 - Add
PrometheusEvaltask by @alvarobartt in https://github.com/argilla-io/distilabel/pull/610 - Update
ValueErroron missing inputs message by @alvarobartt in https://github.com/argilla-io/distilabel/pull/617 - Add
routing_batch_functionby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/595 - Fix
pipeline.loginconsistency & include LLM info in signature by @plaguss in https://github.com/argilla-io/distilabel/pull/598 - Add custom
rubricsattribute toPrometheusEvalby @alvarobartt in https://github.com/argilla-io/distilabel/pull/621 - Update
UltraFeedbackpaper replication to userouting_batch_functionby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/620 - Add
distilabel_metadatacolumn to the datasets to include general data by @plaguss in https://github.com/argilla-io/distilabel/pull/586 - Add the option of passing the multiprocessing context via env var by @plaguss in https://github.com/argilla-io/distilabel/pull/604
- Add name of the pipeline to group the hashed folders by it by @plaguss in https://github.com/argilla-io/distilabel/pull/626
- Add
routing_batch_functionserialization by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/628 - Excluding model path in serialization of llamacpp by @ignacioct in https://github.com/argilla-io/distilabel/pull/633
- Fix problem with sorting method in
list_files_in_dirfunction by @plaguss in https://github.com/argilla-io/distilabel/pull/622 - Add
dry_runmethod to the pipelines to run with a single example. by @plaguss in https://github.com/argilla-io/distilabel/pull/635 - [FEATURE] Add structured outputs using
outlinesby @plaguss in https://github.com/argilla-io/distilabel/pull/601 - Force pipeline stop after 2 SIGINT signals caught by @plaguss in https://github.com/argilla-io/distilabel/pull/630
- Refactor and update
docsby @alvarobartt in https://github.com/argilla-io/distilabel/pull/634 - Export components info & components gallery in docs by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/640
- Documentation updates by @plaguss in https://github.com/argilla-io/distilabel/pull/646
- Refactor docs 1.1.0 by @plaguss in https://github.com/argilla-io/distilabel/pull/650
- Fix routing batch function deadlocks and unordered batches by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/649
Full Changelog: https://github.com/argilla-io/distilabel/compare/1.0.3...1.1.0
- Python
Published by plaguss almost 2 years ago
distilabel - 1.0.3
What's Changed
- Add
stopandstop_sequencesinLLM.generatesubclasses by @alvarobartt in https://github.com/argilla-io/distilabel/pull/585
Full Changelog: https://github.com/argilla-io/distilabel/compare/1.0.2...1.0.3
- Python
Published by gabrielmbmb almost 2 years ago
distilabel - 1.0.2
What's Changed
- Fix
RuntimeParamatervalidation when provided as_Stepattr by @alvarobartt in https://github.com/argilla-io/distilabel/pull/564 - Add
seedwithrandom.randintto ensure cache is not used by @alvarobartt in https://github.com/argilla-io/distilabel/pull/571
Full Changelog: https://github.com/argilla-io/distilabel/compare/1.0.1...1.0.2
- Python
Published by alvarobartt almost 2 years ago
distilabel - 1.0.1
What's Changed
- Fix typo in readme and remove the ToArgilla step by @dvsrepo in https://github.com/argilla-io/distilabel/pull/548
- Fix
model_validatorinInferenceEndpointsdue toPipelinepickling by @alvarobartt in https://github.com/argilla-io/distilabel/pull/552
Full Changelog: https://github.com/argilla-io/distilabel/compare/1.0.0...1.0.1
- Python
Published by gabrielmbmb almost 2 years ago
distilabel - 1.0.0
What's Changed
- Add
Stepabstract class and newPipelineby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/338 - Add runtime parameters validation by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/345
- Pipeline local execution by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/346
- Add
Task(minimal implementation) by @alvarobartt in https://github.com/argilla-io/distilabel/pull/347 - Refactor
_BatchManagerto have list of batches per step by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/353 - Refactor getting parameters from
Step.processmethod by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/355 - Add
LLM,OpenAILLM,TransformersLLM, andLlamaCppLLMby @alvarobartt in https://github.com/argilla-io/distilabel/pull/354 - Fix
TaskandTextGenerationby @alvarobartt in https://github.com/argilla-io/distilabel/pull/356 - Add
combine_dictsfunction andCombineColumnsclass by @alvarobartt in https://github.com/argilla-io/distilabel/pull/358 - Add
PushToHubstep and fixtypingby @alvarobartt in https://github.com/argilla-io/distilabel/pull/357 - Add serialization for the new components by @plaguss in https://github.com/argilla-io/distilabel/pull/349
- Fix
OpenAILLM.api_keydue toSecretStrandStepInputwrong imports by @alvarobartt in https://github.com/argilla-io/distilabel/pull/359 - Add
GlobalStep, fix_BatchManager, and addloggingby @alvarobartt in https://github.com/argilla-io/distilabel/pull/362 - Migrate vllm to the new API by @plaguss in https://github.com/argilla-io/distilabel/pull/361
- Update
_BatchManagerto work withGlobalSteps andinput_batch_sizeper step by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/366 - Clean up outdated / unused files by @alvarobartt in https://github.com/argilla-io/distilabel/pull/369
- Add
input_mappingsandoutput_mappingsattributes by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/367 - Move batching from
TasktoLLM, fixvLLM.generateand addDISTILABEL_LOG_LEVELby @alvarobartt in https://github.com/argilla-io/distilabel/pull/371 - Improve runtime parameter definition by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/372
- Add
AsyncOpenAIand updateOpenAILLMaccordingly by @alvarobartt in https://github.com/argilla-io/distilabel/pull/381 - Update serde by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/382
- Add
MistralLLMand addgeneration_kwargsasRuntimeParametersby @alvarobartt in https://github.com/argilla-io/distilabel/pull/383 - Move
stepsout ofpipelineby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/384 - Add tests and docstring for
Taskand subclasses by @alvarobartt in https://github.com/argilla-io/distilabel/pull/385 - Add
stepdecorator by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/387 - Add
inputpropagation throughTask.processby @alvarobartt in https://github.com/argilla-io/distilabel/pull/399 - Improve
Pipelineerror handling by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/400 - Fix
combine_dictsandStepInputimport inPushToHubby @alvarobartt in https://github.com/argilla-io/distilabel/pull/401 - Improve
GlobalSteperror handling by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/402 - Changed " by italics in EvolInstruct tutorial where one "" was missing by @ignacioct in https://github.com/argilla-io/distilabel/pull/398
- Add
get_last_hidden_statesmethod and updateTransformersLLMby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/414 - docs: correct small typos in tutorial by @sdiazlor in https://github.com/argilla-io/distilabel/pull/419
- docs: readme positioning by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/386
- Add
num_generationsandgroup_generationsparameters toTaskby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/416 - Add
ArgillaandPromptCompletionToArgillaby @alvarobartt in https://github.com/argilla-io/distilabel/pull/420 - Add
EvolInstructandEvolInstructGeneratortasks by @alvarobartt in https://github.com/argilla-io/distilabel/pull/407 - Wrap optional
LLMdependencies underloadby @alvarobartt in https://github.com/argilla-io/distilabel/pull/428 - Add
ComplexityScorertask by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/421 - Implement caching mechanism for the pipelines by @plaguss in https://github.com/argilla-io/distilabel/pull/370
- Add method to Pipeline to handle keyboard interruptions via ctrl+c by @plaguss in https://github.com/argilla-io/distilabel/pull/406
- Add
GenerateEmbeddingstask by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/427 - Add
api_keywithinLLM.loadand addllm_kwargsasRuntimeParameterby @alvarobartt in https://github.com/argilla-io/distilabel/pull/432 - Add
GeneratorStep.processvalidation inDAGand smaller fixes by @alvarobartt in https://github.com/argilla-io/distilabel/pull/435 - Add
EvolComplexitytask by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/415 - Add
QualityScorerTask by @ignacioct in https://github.com/argilla-io/distilabel/pull/425 - Add
CudaDevicePlacementMixinclass by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/436 - Return
distisetfromPipeline.runby @plaguss in https://github.com/argilla-io/distilabel/pull/417 - Update README.md by @strickvl in https://github.com/argilla-io/distilabel/pull/451
- Add
InferenceEndpointsLLMby @alvarobartt in https://github.com/argilla-io/distilabel/pull/439 - Fix
DistisetafterPushToHuband smaller fixes by @alvarobartt in https://github.com/argilla-io/distilabel/pull/452 - Fix
Step.process_applying_mappingsby @alvarobartt in https://github.com/argilla-io/distilabel/pull/453 - Add
AnyscaleLLMby @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/447 - Add general function to obtain schema for parquet writer by @plaguss in https://github.com/argilla-io/distilabel/pull/454
- Add
TogetherLLMby @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/449 - Fix
LLMsubclasses based onOpenAILLMby @alvarobartt in https://github.com/argilla-io/distilabel/pull/455 - Improve batching and caching by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/457
- Add
EvolQualitytask by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/429 - Add
VertexAILLMby @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/445 - Add
use_cachetoBasePipelineby @plaguss in https://github.com/argilla-io/distilabel/pull/463 - Add
AnthropicLLMby @sdiazlor in https://github.com/argilla-io/distilabel/pull/444 - Add
multiprocessdependency by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/467 - Add
UltraFeedbackby @alvarobartt in https://github.com/argilla-io/distilabel/pull/464 - Add
OllamaLLMby @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/405 - Add
RuntimeParametersMixinandLLMruntime parameters by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/466 - Add
LiteLLMby @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/441 - Add CLI by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/471
- Set
_batch_managertoNoneafter run by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/473 - Add create_distiset function by @plaguss in https://github.com/argilla-io/distilabel/pull/480
- Add
overloadtostepdecorator by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/474 - Move Enum to Dict[str, str] to avoid serialization errors during caching by @plaguss in https://github.com/argilla-io/distilabel/pull/482
- Include a dataset card and the
pipeline.yamlonDistiset.push_to_hubby @plaguss in https://github.com/argilla-io/distilabel/pull/479 - Add
PairRMtask for ranking responses by @plaguss in https://github.com/argilla-io/distilabel/pull/450 - Update
_WriteBufferto write several parquet files by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/483 - Extend
ArgillaintegrationTextGeneration,Preference, and more by @alvarobartt in https://github.com/argilla-io/distilabel/pull/472 - Add
DeitaFilteringstep by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/481 - Add
InstructionBacktranslationby @alvarobartt in https://github.com/argilla-io/distilabel/pull/486 - Fix huggingface_hub TextGenerationError import by @Wauplin in https://github.com/argilla-io/distilabel/pull/485
- Improve azure openai support by @BramVanroy in https://github.com/argilla-io/distilabel/pull/461
- Add
SelfInstructtask by @ignacioct in https://github.com/argilla-io/distilabel/pull/456 - Use
QueueHandlerforPipelinelogging by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/489 - Improve
_stopandloggingby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/491 - Fix creating empty
Datasetincreate_distisetfunction by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/492 - Add imports from
__init__modules by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/493 batch_sizeandinput_batch_sizeruntime parameters by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/495- Update serialization method of _BatchManager to write each step on its own file by @plaguss in https://github.com/argilla-io/distilabel/pull/496
- Fix
asyncioinAsyncLLMto use the running event loop if any by @alvarobartt in https://github.com/argilla-io/distilabel/pull/501 - Added authentication header to allow private/gated dataset use by @bjoernpl in https://github.com/argilla-io/distilabel/pull/498
- Fix generator yielding batches all at once if
batch_size==input_batch_sizeby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/510 - Run output queue loop in thread and improve stop by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/511
- Update
docsfordistilabelv1.0 withmkdocs-materialby @plaguss in https://github.com/argilla-io/distilabel/pull/476 - Add
CohereLLMby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/508 distilabelv1.0 by @alvarobartt in https://github.com/argilla-io/distilabel/pull/352- Remove draft comment by @plaguss in https://github.com/argilla-io/distilabel/pull/515
- Fix
docs/sections/papers/*.mdand add example indocs/index.mdby @alvarobartt in https://github.com/argilla-io/distilabel/pull/516 - Small fixes for the docs (images and nav bar) by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/519
- Fix CTRL + C when still loading steps by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/521
- Empty input queues when
CTRL + Cby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/528 - Add
filelockandflash-attntovllmextra by @alvarobartt in https://github.com/argilla-io/distilabel/pull/529 - Fix error in README.md when pushing the custom dataset card by @plaguss in https://github.com/argilla-io/distilabel/pull/530
- Fix pipeline stuck when empty batches by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/531
- Add
EvolQualitytotasks.__init__.pyby @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/525 - Show information about subprocess exception by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/532
- Update
TextGeneration.format_inputmethod to allow OpenAI format by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/533 - Improve create_distiset by @plaguss in https://github.com/argilla-io/distilabel/pull/534
- Fixes regarding
RuntimeParameters andpydanticmodel attributes by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/535 - Fix parsing
LLMgeneration kwargs by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/537 - pass on Distiset's kwargs to Dataset.pushtohub() by @rasdani in https://github.com/argilla-io/distilabel/pull/522
- Set
config="default"inDistisetwhen only one leafStepby @alvarobartt in https://github.com/argilla-io/distilabel/pull/540 - docs: update documentation for huggingface inference endpoints. by @burtenshaw in https://github.com/argilla-io/distilabel/pull/539
- Remove
flash-attnfromvllmextra by @alvarobartt in https://github.com/argilla-io/distilabel/pull/542 - Docs fix argilla imports by @burtenshaw in https://github.com/argilla-io/distilabel/pull/541
- Fix not all exceptions being able to be pickled by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/543
- Update CLI example by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/544
- Check that
Step.namedoesn't contain dots or spaces by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/545
New Contributors
- @strickvl made their first contribution in https://github.com/argilla-io/distilabel/pull/451
- @Wauplin made their first contribution in https://github.com/argilla-io/distilabel/pull/485
- @BramVanroy made their first contribution in https://github.com/argilla-io/distilabel/pull/461
- @bjoernpl made their first contribution in https://github.com/argilla-io/distilabel/pull/498
- @rasdani made their first contribution in https://github.com/argilla-io/distilabel/pull/522
Full Changelog: https://github.com/argilla-io/distilabel/compare/0.6.0...1.0.0
- Python
Published by gabrielmbmb almost 2 years ago
distilabel - 0.6.0
What's Changed
- Fix typo in docstring of toargilla metrics to metric_ by @burtenshaw in https://github.com/argilla-io/distilabel/pull/334
- Implement a JSON responding OpenAI LLM as JSONOpenAILLM by @burtenshaw in https://github.com/argilla-io/distilabel/pull/331
- Add examples for the deita paper tasks by @plaguss in https://github.com/argilla-io/distilabel/pull/329
- Add checkpoint strategy to automatically push to hub by @plaguss in https://github.com/argilla-io/distilabel/pull/321
- docs: update tutorials avoid argilla installation error by @sdiazlor in https://github.com/argilla-io/distilabel/pull/337
- Fix
CustomDataset.load_from_diskwithstr/Pathobjects by @plaguss in https://github.com/argilla-io/distilabel/pull/341 - Clalrify number of generations produced when using LLMPool in docs by @davanstrien in https://github.com/argilla-io/distilabel/pull/339
- Refactor builddataset piece for speed by @plaguss in https://github.com/argilla-io/distilabel/pull/344
- Fix documentation and type variables in
CustomDatasetcheckpoint methods by @plaguss in https://github.com/argilla-io/distilabel/pull/342 - US Spelling and other typo correction on Distilabel tutorials by @ignacioct in https://github.com/argilla-io/distilabel/pull/324
- docs: add a tutorial for evolinstruct by @sdiazlor in https://github.com/argilla-io/distilabel/pull/327
- Fix Openai api error with OpenAI-compatible providers by @jphme in https://github.com/argilla-io/distilabel/pull/351
- Add fix for labels not returned by openai api by @plaguss in https://github.com/argilla-io/distilabel/pull/364
- Refactor model availability check in isserverlessendpoint_available by @davanstrien in https://github.com/argilla-io/distilabel/pull/363
New Contributors
- @burtenshaw made their first contribution in https://github.com/argilla-io/distilabel/pull/334
- @jphme made their first contribution in https://github.com/argilla-io/distilabel/pull/351
Full Changelog: https://github.com/argilla-io/distilabel/compare/0.5.0...0.6.0
- Python
Published by gabrielmbmb almost 2 years ago
distilabel - 0.5.0
What's Changed
- fix: Correct import error by @plaguss in https://github.com/argilla-io/distilabel/pull/279
- fix: Filter examples for which len generations != len ratings by @plaguss in https://github.com/argilla-io/distilabel/pull/284
- feat: Add sentence transformers support for the to argilla method by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/262
- feat: Add text descriptives support to the to argilla methods by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/271
- feat: Add
to_argillamethod toEvolInstructTaskgenerated datasets by @plaguss in https://github.com/argilla-io/distilabel/pull/291 - docs: Shorten titles tutorials and update core example by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/289
- feat: Add new serialization strategy by @plaguss in https://github.com/argilla-io/distilabel/pull/288
- feat: Review
OllamaLLMandTogetherInferenceLLMby @alvarobartt in https://github.com/argilla-io/distilabel/pull/305 - refactor: Remove Metadata for Ratings by @ignacioct in https://github.com/argilla-io/distilabel/pull/303
- docs: Add missing VertexAI information within
README.mdanddocs/index.mdby @alvarobartt in https://github.com/argilla-io/distilabel/pull/308 - feat: Add functionality to push tasks to the HuggingFace hub and download them automatically. by @plaguss in https://github.com/argilla-io/distilabel/pull/297
- feat: Add
ComplexityScorerandQualityScorertasks from Deita by @plaguss in https://github.com/argilla-io/distilabel/pull/302 - fix: Fix logging visualization of labeller pipelines by @plaguss in https://github.com/argilla-io/distilabel/pull/310
- feat: Add
Improving Text Embeddings with LLMstutorial by @alvarobartt in https://github.com/argilla-io/distilabel/pull/313 - feat: Add
EvolComplexityandEvolQualityby @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/299 - feat: Add
validate_promptsmethod to LLMs to help validating the prompts by @plaguss in https://github.com/argilla-io/distilabel/pull/314 - fix: typo in clean an existing preference dataset by @sdiazlor in https://github.com/argilla-io/distilabel/pull/312
- feat: Add new column for sft fine tuning with
prepare_datasetby @plaguss in https://github.com/argilla-io/distilabel/pull/309 - docs: Custom Task Documentation by @ignacioct in https://github.com/argilla-io/distilabel/pull/275
- refactor: Align the
LLMsubclasses args by @alvarobartt in https://github.com/argilla-io/distilabel/pull/315 - feat: Include rationale of the model responses on
prepare_datasetif available by @plaguss in https://github.com/argilla-io/distilabel/pull/317 - feat: Add embedding tutorial to docs by @ignacioct in https://github.com/argilla-io/distilabel/pull/319
- feat: Add
MistralAILLMby @plaguss in https://github.com/argilla-io/distilabel/pull/293 - feat: Use
ollamaPython client withinOllamaLLMby @sdiazlor in https://github.com/argilla-io/distilabel/pull/307
Full Changelog: https://github.com/argilla-io/distilabel/compare/0.4.0...0.5.0
- Python
Published by plaguss about 2 years ago
distilabel - 0.4.0
What's Changed
- docs: Notus end2end example for preference and instruction generation by @ignacioct in https://github.com/argilla-io/distilabel/pull/145
- docs: binders anchors by @ignacioct in https://github.com/argilla-io/distilabel/pull/235
- feat: Add support for dedicated and serverless inference endpoints via inference API by @philschmid in https://github.com/argilla-io/distilabel/pull/238
- docs: Update links to arxiv landing pages rather than PDFs by @davanstrien in https://github.com/argilla-io/distilabel/pull/249
- feat: add ETA to progress bar and fix not showing the progress bar if irrelavant by @ignacioct in https://github.com/argilla-io/distilabel/pull/253
- feat: Add Evol instruct task by @plaguss in https://github.com/argilla-io/distilabel/pull/237
- docs: rename
enable_checkpointstocheckpoint_strategyby @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/257 - feat: Fixing progress bar and ETA by @ignacioct in https://github.com/argilla-io/distilabel/pull/260
- fix: resolved error with self instruct to argilla method by @plaguss in https://github.com/argilla-io/distilabel/pull/265
- chore: Add extra check in llmpool to ensure all the tasks share the same parent class by @plaguss in https://github.com/argilla-io/distilabel/pull/266
- fix: fix for Notus tutorial after bug in record unwrap by @ignacioct in https://github.com/argilla-io/distilabel/pull/267
- feat: add customizable criteria for query generation in SelfInstructTask by @ignacioct in https://github.com/argilla-io/distilabel/pull/269
- docs: add a tutorial on "clean a DPO/preference dataset with distilabel" by @sdiazlor in https://github.com/argilla-io/distilabel/pull/270
- feat: Add new functionality to binarize preference datasets directly from distilabel by @plaguss in https://github.com/argilla-io/distilabel/pull/264
- feat: add support
ollamaapi by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/250
New Contributors
- @philschmid made their first contribution in https://github.com/argilla-io/distilabel/pull/238
- @davanstrien made their first contribution in https://github.com/argilla-io/distilabel/pull/249
- @sdiazlor made their first contribution in https://github.com/argilla-io/distilabel/pull/270
Full Changelog: https://github.com/argilla-io/distilabel/compare/0.3.0...0.4.0
- Python
Published by davidberenstein1957 about 2 years ago
distilabel - 0.3.0
What's Changed
- Add
VertexAILLM&VertexAIEndpointLLMclasses by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/204 - Add draft with social cards by @plaguss in https://github.com/argilla-io/distilabel/pull/197
- Relax
LLMPoolcheck to match parentTaskinstead by @plaguss in https://github.com/argilla-io/distilabel/pull/210 - Align
README.mdwithdocs/and minor fixes / improvements by @alvarobartt in https://github.com/argilla-io/distilabel/pull/214 - Add
TogetherInferenceLLMby @alvarobartt in https://github.com/argilla-io/distilabel/pull/215 - Add checking valid
inputsbefore calling_generateby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/216 - Add
TogetherInferenceLLMtests by @alvarobartt in https://github.com/argilla-io/distilabel/pull/217 - Add Vertex AI
LLMs documentation by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/222 - Documentation review by @alvarobartt in https://github.com/argilla-io/distilabel/pull/223
- Rename
for_text_qualitytofor_overall_qualitymethod inUltraFeedbackTaskby @alvarobartt in https://github.com/argilla-io/distilabel/pull/224 - Add Anyscale endpoints by @plaguss in https://github.com/argilla-io/distilabel/pull/213
- Feature dataset checkpoint strategy by @plaguss in https://github.com/argilla-io/distilabel/pull/194
- Fix
ratingparsing inRatingToArgillaMixin.to_argilla_recordby @alvarobartt in https://github.com/argilla-io/distilabel/pull/227 - Add badges to readme by @plaguss in https://github.com/argilla-io/distilabel/pull/226
- Fix badges by @dvsrepo in https://github.com/argilla-io/distilabel/pull/228
- Update
LICENSEand addLICENSE_HEADERby @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/221
Full Changelog: https://github.com/argilla-io/distilabel/compare/0.2.1...0.3.0
- Python
Published by alvarobartt about 2 years ago
distilabel - 0.2.1
What's Changed
- Fix
PrometheusTaskcould not be imported by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/190 - Fix
LLM.return_futuresby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/192 - Remove learn section from docs until developed by @plaguss in https://github.com/argilla-io/distilabel/pull/188
- Add markdown to fields by default by @plaguss in https://github.com/argilla-io/distilabel/pull/189
- Fix
PrometheusTaskandUltraCMTaskcould not be chained withTextGenerationTaskby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/195 - Add missing
use_markdownfor every field by @plaguss in https://github.com/argilla-io/distilabel/pull/196 - Add
to_argilla_{dataset,record}forCritiqueTaskby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/198 - Update
generate_promptinTasksubclasses to always returnPromptby @alvarobartt in https://github.com/argilla-io/distilabel/pull/199 - Add
CritiqueTaskdocumentation by @alvarobartt in https://github.com/argilla-io/distilabel/pull/200 - Fix
UltraCMTaskscoring range and alignargillaimports by @alvarobartt in https://github.com/argilla-io/distilabel/pull/201
Full Changelog: https://github.com/argilla-io/distilabel/compare/0.2.0...0.2.1
- Python
Published by alvarobartt about 2 years ago
distilabel - 0.2.0
What's Changed
- adds accelerate example by @edbeeching in https://github.com/argilla-io/distilabel/pull/141
- Add a dry-run when calling
Pipeline.generateby @alvarobartt in https://github.com/argilla-io/distilabel/pull/146 - Add Notus format in
Prompt.format_asand updateexamples/*.pyby @alvarobartt in https://github.com/argilla-io/distilabel/pull/147 - Add
ProcessLLMclass by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/151 - Adds
CritiqueTask,UltraCMTaskand more by @alvarobartt in https://github.com/argilla-io/distilabel/pull/152 - docs: add
llama.cppto extras by @davidberenstein1957 in https://github.com/argilla-io/distilabel/pull/154 - Fix
_build_datasetasprocessed_labelswere ignored by @plaguss in https://github.com/argilla-io/distilabel/pull/158 - Add
to_argilla_{dataset,record}methods inTextGenerationTaskby @alvarobartt in https://github.com/argilla-io/distilabel/pull/159 - Fix
UltraFeedbackTask.to_argilla_datasetratings values by @alvarobartt in https://github.com/argilla-io/distilabel/pull/160 - Align
typingandtyping_extensionswith supported Python versions by @alvarobartt in https://github.com/argilla-io/distilabel/pull/161 - Add
LLMPoolclass by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/156 - Add missing
CritiqueTaskandUltraCMTaskin__init__and moveargilla_utilstoutils.argillaby @alvarobartt in https://github.com/argilla-io/distilabel/pull/162 - Add
testworkflow by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/163 - Update
LLMto returnFuture[List[List[LLMOutput]]]by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/164 - Add
PrometheusTaskby @alvarobartt in https://github.com/argilla-io/distilabel/pull/165 - Randomise generations order by @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/167
- Add custom
to_argilla_{dataset,record}toSelfInstructTaskby @alvarobartt in https://github.com/argilla-io/distilabel/pull/169 - Fix
shuffle_before_labellingand progress bar inPipeline.generateby @alvarobartt in https://github.com/argilla-io/distilabel/pull/170 - Replace
multiprocessingwithmultiprocessby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/171 - Refactor and improve docs by @plaguss in https://github.com/argilla-io/distilabel/pull/134
- Fix
SelfInstructTask.{parse_output,to_argilla_record}methods and_build_datasetby @alvarobartt in https://github.com/argilla-io/distilabel/pull/172 - Fix
resultsdidn't have same order asfuturesby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/173 - Remove unnecesary plugin by @plaguss in https://github.com/argilla-io/distilabel/pull/174
- Add
{generation,labelling}_modelcolumn as metadata in Argilla by @alvarobartt in https://github.com/argilla-io/distilabel/pull/175 - Fix exporting model name to Argilla with
LLMPoolby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/177 - Update docs to include info about
ProcessLLMandLLMPoolby @gabrielmbmb in https://github.com/argilla-io/distilabel/pull/176
New Contributors
- @edbeeching made their first contribution in https://github.com/argilla-io/distilabel/pull/141
- @davidberenstein1957 made their first contribution in https://github.com/argilla-io/distilabel/pull/154
Full Changelog: https://github.com/argilla-io/distilabel/compare/0.1.1...0.2.0
- Python
Published by gabrielmbmb about 2 years ago
distilabel - 0.1.1
What's Changed
- Template for Documentation Issue created by @ignacioct in https://github.com/argilla-io/distilabel/pull/128
- self.threadpoolexecutor can be None, protecting it for print by @ignacioct in https://github.com/argilla-io/distilabel/pull/129
- Use
do_sampleintransformersexample by @dvsrepo in https://github.com/argilla-io/distilabel/pull/138 - Fix
llama-cppandhf-inference-endpointsextras inpyproject.tomlby @plaguss in https://github.com/argilla-io/distilabel/pull/139 - Fix
llama_cpp_pythondependency check by @plaguss in https://github.com/argilla-io/distilabel/pull/140
New Contributors
- @ignacioct made their first contribution in https://github.com/argilla-io/distilabel/pull/128
- @plaguss made their first contribution in https://github.com/argilla-io/distilabel/pull/139
Full Changelog: https://github.com/argilla-io/distilabel/compare/0.1.0...0.1.1
- Python
Published by alvarobartt about 2 years ago
distilabel - 0.1.1
What's Changed
- Template for Documentation Issue created by @ignacioct in https://github.com/argilla-io/distilabel/pull/128
self.thread_pool_executorcan be None, protecting it for print by @ignacioct in https://github.com/argilla-io/distilabel/pull/129- Use
do_sampleintransformersexample by @dvsrepo in https://github.com/argilla-io/distilabel/pull/138 - Fix
llama-cppandhf-inference-endpointsextras inpyproject.tomlby @plaguss in https://github.com/argilla-io/distilabel/pull/139
New Contributors
- @ignacioct made their first contribution in https://github.com/argilla-io/distilabel/pull/128
- @plaguss made their first contribution in https://github.com/argilla-io/distilabel/pull/139
Full Changelog: https://github.com/argilla-io/distilabel/compare/0.1.0...0.1.1
- Python
Published by alvarobartt about 2 years ago