distilabel

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

https://github.com/argilla-io/distilabel

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 37 committers (2.7%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.3%) to scientific vocabulary

Keywords

ai huggingface llms openai python rlaif rlhf synthetic-data synthetic-dataset-generation

Keywords from Contributors

transformers cryptocurrencies jax cryptography agents multi-agents application fine-tuning optimism rag
Last synced: 6 months ago · JSON representation ·

Repository

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

Basic Info
Statistics
  • Stars: 2,862
  • Watchers: 26
  • Forks: 216
  • Open Issues: 95
  • Releases: 33
Topics
ai huggingface llms openai python rlaif rlhf synthetic-data synthetic-dataset-generation
Created over 2 years ago · Last pushed 6 months ago
Metadata Files
Readme License Citation

README.md

[!IMPORTANT]
The original authors have moved on to other projects. A group of community members have recently joined the GitHub project as collaborators to maintain the project and are actively working towards the next release. Check out the develop branch for access to the latest fixes and improvements in the meantime.

Distilabel Logo

Synthesize data for AI and add feedback on the fly!

CI CI

Distilabel is the framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

If you just want to get started, we recommend you check the documentation. Curious, and want to know more? Keep reading! <!-- overview -->

Why use distilabel?

Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines for data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback.

Improve your AI output quality through data quality

Compute is expensive and output quality is important. We help you focus on data quality, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time achieving and keeping high-quality standards for your data.

Take control of your data and models

Ownership of data for fine-tuning your own LLMs is not easy but Distilabel can help you to get started. We integrate AI feedback from any LLM provider out there using one unified API.

Improve efficiency by quickly iterating on the right research and LLMs

Synthesize and judge data with latest research papers while ensuring flexibility, scalability and fault tolerance. So you can focus on improving your data and training your models.

Community

We are an open-source community-driven project and we love to hear from you. Here are some ways to get involved:

  • Community Meetup: listen in or present during one of our bi-weekly events.

  • Discord: get direct support from the community in #argilla-general and #argilla-help.

  • Roadmap: plans change but we love to discuss those with our community so feel encouraged to participate.

What do people build with Distilabel?

The Argilla community uses distilabel to create amazing datasets and models.

  • The 1M OpenHermesPreference is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use Distilabel to synthesize data on an immense scale.
  • Our distilabeled Intel Orca DPO dataset and the improved OpenHermes model, show how we improve model performance by filtering out 50% of the original dataset through AI feedback.
  • The haiku DPO data outlines how anyone can create a dataset for a specific task and the latest research papers to improve the quality of the dataset.

Installation

sh pip install distilabel --upgrade

Requires Python 3.9+

In addition, the following extras are available:

LLMs

  • anthropic: for using models available in Anthropic API via the AnthropicLLM integration.
  • cohere: for using models available in Cohere via the CohereLLM integration.
  • argilla: for exporting the generated datasets to Argilla.
  • groq: for using models available in Groq using groq Python client via the GroqLLM integration.
  • hf-inference-endpoints: for using the Hugging Face Inference Endpoints via the InferenceEndpointsLLM integration.
  • hf-transformers: for using models available in transformers package via the TransformersLLM integration.
  • litellm: for using LiteLLM to call any LLM using OpenAI format via the LiteLLM integration.
  • llama-cpp: for using llama-cpp-python Python bindings for llama.cpp via the LlamaCppLLM integration.
  • mistralai: for using models available in Mistral AI API via the MistralAILLM integration.
  • ollama: for using Ollama and their available models via OllamaLLM integration.
  • openai: for using OpenAI API models via the OpenAILLM integration, or the rest of the integrations based on OpenAI and relying on its client as AnyscaleLLM, AzureOpenAILLM, and TogetherLLM.
  • vertexai: for using Google Vertex AI proprietary models via the VertexAILLM integration.
  • vllm: for using vllm serving engine via the vLLM integration.
  • sentence-transformers: for generating sentence embeddings using sentence-transformers.
  • mlx: for using MLX models via the MlxLLM integration.

Structured generation

  • outlines: for using structured generation of LLMs with outlines.
  • instructor: for using structured generation of LLMs with Instructor.

Data processing

  • ray: for scaling and distributing a pipeline with Ray.
  • faiss-cpu and faiss-gpu: for generating sentence embeddings using faiss.
  • text-clustering: for using text clustering with UMAP and Scikit-learn.
  • minhash: for using minhash for duplicate detection with datasketch and nltk.

Example

To run the following example you must install distilabel with the hf-inference-endpoints extra:

sh pip install "distilabel[hf-inference-endpoints]" --upgrade

Then run:

```python from datasets import load_dataset

from distilabel.models import InferenceEndpointsLLM from distilabel.pipeline import Pipeline from distilabel.steps.tasks import TextGeneration

with Pipeline() as pipeline: TextGeneration( llm=InferenceEndpointsLLM( modelid="meta-llama/Meta-Llama-3.1-8B-Instruct", generationkwargs={"temperature": 0.7, "maxnewtokens": 512}, ), )

if name == "main": dataset = loaddataset("distilabel-internal-testing/instructions", split="test") distiset = pipeline.run(dataset=dataset) distiset.pushtohub(repoid="distilabel-example") ```

Badges

If you build something cool with distilabel consider adding one of these badges to your dataset or model card.

[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

Built with Distilabel

[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

Built with Distilabel

Contribute

To directly contribute with distilabel, check our good first issues or open a new one.

Citation

bibtex @misc{distilabel-argilla-2024, author = {Álvaro Bartolomé Del Canto and Gabriel Martín Blázquez and Agustín Piqueres Lajarín and Daniel Vila Suero}, title = {Distilabel: An AI Feedback (AIF) framework for building datasets with and for LLMs}, year = {2024}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/argilla-io/distilabel}} }

Owner

  • Name: Argilla
  • Login: argilla-io
  • Kind: organization
  • Email: contact@argilla.io

Building the open-source tool for data-centric NLP

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Bartolomé"
  given-names: "Álvaro"
- family-names: "Martín-Blázquez"
  given-names: "Gabriel"
- family-names: "Piqueres-Lajarín"
  given-names: "Agustín"
- family-names: "Vila-Suero"
  given-names: "Daniel"
title: "Distilabel: An AI Feedback (AIF) framework for building datasets with and for LLMs."
version: 1.1.1
date-released: 2024-05-22
url: "https://github.com/argilla-io/distilabel"

Committers

Last synced: 10 months ago

All Time
  • Total Commits: 777
  • Total Committers: 37
  • Avg Commits per committer: 21.0
  • Development Distribution Score (DDS): 0.692
Past Year
  • Commits: 297
  • Committers: 25
  • Avg Commits per committer: 11.88
  • Development Distribution Score (DDS): 0.626
Top Committers
Name Email Commits
Gabriel Martín Blázquez g****v@g****m 239
Agus a****n@a****o 202
Alvaro Bartolome a****o@a****o 149
David Berenstein d****n@g****m 54
Daniel Vila Suero d****l@a****o 51
Sara Han 1****r 16
Ignacio Talavera i****a@g****m 16
burtenshaw b****n@a****o 7
Luca Rolshoven l****n@b****h 4
zye1996 z****e@g****u 3
dependabot[bot] 4****] 3
Daniel van Strien d****n 3
plaguss a****s@A****l 3
DANIEL VILA SUERO d****o@M****l 3
Hassaan Qaisar 1****r 2
Leire l****e@r****i 1
Alex Strick van Linschoten s****l 1
rasdani 7****i 1
Zhangda Xu x****a@p****m 1
Stefano Fiorucci s****i@g****m 1
Sadra Barikbin s****1@y****m 1
Riezebos 2****s 1
Philipp Schmid 3****d 1
Parag Ekbote t****9@g****m 1
Lucain l****p@g****m 1
Jan Philipp Harries 2****e 1
Ikko Eltociear Ashimine e****r@g****m 1
Fran Peric 4****c 1
Fabian Preiß f****s@d****o 1
Edward Beeching e****g 1
and 7 more...
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 395
  • Total pull requests: 664
  • Average time to close issues: 22 days
  • Average time to close pull requests: 6 days
  • Total issue authors: 64
  • Total pull request authors: 53
  • Average comments per issue: 0.81
  • Average comments per pull request: 1.15
  • Merged pull requests: 506
  • Bot issues: 0
  • Bot pull requests: 11
Past Year
  • Issues: 74
  • Pull requests: 194
  • Average time to close issues: 20 days
  • Average time to close pull requests: 9 days
  • Issue authors: 34
  • Pull request authors: 29
  • Average comments per issue: 0.34
  • Average comments per pull request: 1.81
  • Merged pull requests: 124
  • Bot issues: 0
  • Bot pull requests: 8
Top Authors
Issue Authors
  • gabrielmbmb (77)
  • davidberenstein1957 (72)
  • plaguss (65)
  • dvsrepo (45)
  • alvarobartt (31)
  • ignacioct (10)
  • sdiazlor (8)
  • jphme (5)
  • kcentric (5)
  • davanstrien (4)
  • yuqie (3)
  • Hassaan-Qaisar (3)
  • Josephrp (3)
  • YueWu0301 (2)
  • cmcmaster1 (2)
Pull Request Authors
  • plaguss (166)
  • gabrielmbmb (164)
  • alvarobartt (96)
  • davidberenstein1957 (60)
  • dvsrepo (22)
  • dameikle (22)
  • sdiazlor (20)
  • ignacioct (16)
  • burtenshaw (15)
  • pre-commit-ci[bot] (8)
  • bikash119 (5)
  • zye1996 (4)
  • bjoernpl (4)
  • dependabot[bot] (3)
  • davanstrien (3)
Top Labels
Issue Labels
enhancement (165) documentation (66) bug (48) team: ml (24) good first issue (16) team: interns (15) improvement (13) integrations (11) help wanted (9) interview (6) question (5) idea (3) refactor (3) fix (2) deprecation (2) dependencies (2) ci (2)
Pull Request Labels
enhancement (150) fix (110) documentation (59) improvement (35) integrations (28) refactor (11) bug (9) dependencies (6) ci (5) argilla (5) deprecation (3) example (3) team: ml (3) tutorial (2) test (2) release (2) benchmark (1) task (1) team: interns (1) question (1)

Dependencies

.github/workflows/release.yml actions
  • actions/cache v3 composite
  • actions/checkout v4 composite
  • actions/setup-python v4 composite
  • pypa/gh-action-pypi-publish release/v1 composite
pyproject.toml pypi
  • Jinja2 >= 3.1.2
  • datasets >= 2.14.0
  • rich >= 13.5.0
  • tenacity >= 8