distilabel

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

https://github.com/argilla-io/distilabel

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 37 committers (2.7%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.3%) to scientific vocabulary

Keywords

ai huggingface llms openai python rlaif rlhf synthetic-data synthetic-dataset-generation

Keywords from Contributors

transformers cryptocurrencies jax cryptography agents multi-agents application fine-tuning optimism rag

Last synced: 11 months ago · JSON representation ·

Repository

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

Basic Info

Host: GitHub
Owner: argilla-io
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://distilabel.argilla.io
Size: 554 MB

Statistics

Stars: 2,862
Watchers: 26
Forks: 216
Open Issues: 95
Releases: 33

Topics

ai huggingface llms openai python rlaif rlhf synthetic-data synthetic-dataset-generation

Created almost 3 years ago · Last pushed 11 months ago

Metadata Files

Readme License Citation

README.md

[!IMPORTANT]
The original authors have moved on to other projects. A group of community members have recently joined the GitHub project as collaborators to maintain the project and are actively working towards the next release. Check out the develop branch for access to the latest fixes and improvements in the meantime.

Synthesize data for AI and add feedback on the fly!

Distilabel is the framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

If you just want to get started, we recommend you check the documentation. Curious, and want to know more? Keep reading!

Why use distilabel?

Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines for data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback.

Improve your AI output quality through data quality

Compute is expensive and output quality is important. We help you focus on data quality, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time achieving and keeping high-quality standards for your data.

Take control of your data and models

Ownership of data for fine-tuning your own LLMs is not easy but Distilabel can help you to get started. We integrate AI feedback from any LLM provider out there using one unified API.

Improve efficiency by quickly iterating on the right research and LLMs

Synthesize and judge data with latest research papers while ensuring flexibility, scalability and fault tolerance. So you can focus on improving your data and training your models.

Community

We are an open-source community-driven project and we love to hear from you. Here are some ways to get involved:

Community Meetup: listen in or present during one of our bi-weekly events.
Discord: get direct support from the community in #argilla-general and #argilla-help.
Roadmap: plans change but we love to discuss those with our community so feel encouraged to participate.

What do people build with Distilabel?

The Argilla community uses distilabel to create amazing datasets and models.

The 1M OpenHermesPreference is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use Distilabel to synthesize data on an immense scale.
Our distilabeled Intel Orca DPO dataset and the improved OpenHermes model, show how we improve model performance by filtering out 50% of the original dataset through AI feedback.
The haiku DPO data outlines how anyone can create a dataset for a specific task and the latest research papers to improve the quality of the dataset.

Installation

sh pip install distilabel --upgrade

Requires Python 3.9+

In addition, the following extras are available:

LLMs

anthropic: for using models available in Anthropic API via the AnthropicLLM integration.
cohere: for using models available in Cohere via the CohereLLM integration.
argilla: for exporting the generated datasets to Argilla.
groq: for using models available in Groq using groq Python client via the GroqLLM integration.
hf-inference-endpoints: for using the Hugging Face Inference Endpoints via the InferenceEndpointsLLM integration.
hf-transformers: for using models available in transformers package via the TransformersLLM integration.
litellm: for using LiteLLM to call any LLM using OpenAI format via the LiteLLM integration.
llama-cpp: for using llama-cpp-python Python bindings for llama.cpp via the LlamaCppLLM integration.
mistralai: for using models available in Mistral AI API via the MistralAILLM integration.
ollama: for using Ollama and their available models via OllamaLLM integration.
openai: for using OpenAI API models via the OpenAILLM integration, or the rest of the integrations based on OpenAI and relying on its client as AnyscaleLLM, AzureOpenAILLM, and TogetherLLM.
vertexai: for using Google Vertex AI proprietary models via the VertexAILLM integration.
vllm: for using vllm serving engine via the vLLM integration.
sentence-transformers: for generating sentence embeddings using sentence-transformers.
mlx: for using MLX models via the MlxLLM integration.

Structured generation

outlines: for using structured generation of LLMs with outlines.
instructor: for using structured generation of LLMs with Instructor.

Data processing

ray: for scaling and distributing a pipeline with Ray.
faiss-cpu and faiss-gpu: for generating sentence embeddings using faiss.
text-clustering: for using text clustering with UMAP and Scikit-learn.
minhash: for using minhash for duplicate detection with datasketch and nltk.

Example

To run the following example you must install distilabel with the hf-inference-endpoints extra:

sh pip install "distilabel[hf-inference-endpoints]" --upgrade

Then run:

```python from datasets import load_dataset

from distilabel.models import InferenceEndpointsLLM from distilabel.pipeline import Pipeline from distilabel.steps.tasks import TextGeneration

with Pipeline() as pipeline: TextGeneration( llm=InferenceEndpointsLLM( modelid="meta-llama/Meta-Llama-3.1-8B-Instruct", generationkwargs={"temperature": 0.7, "maxnewtokens": 512}, ), )

if name == "main": dataset = loaddataset("distilabel-internal-testing/instructions", split="test") distiset = pipeline.run(dataset=dataset) distiset.pushtohub(repoid="distilabel-example") ```

Badges

If you build something cool with distilabel consider adding one of these badges to your dataset or model card.

[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

Contribute

To directly contribute with distilabel, check our good first issues or open a new one.

Citation

bibtex @misc{distilabel-argilla-2024, author = {Álvaro Bartolomé Del Canto and Gabriel Martín Blázquez and Agustín Piqueres Lajarín and Daniel Vila Suero}, title = {Distilabel: An AI Feedback (AIF) framework for building datasets with and for LLMs}, year = {2024}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/argilla-io/distilabel}} }

Owner

Name: Argilla
Login: argilla-io
Kind: organization
Email: contact@argilla.io

Website: https://argilla.io
Twitter: argilla_io
Repositories: 12
Profile: https://github.com/argilla-io

Building the open-source tool for data-centric NLP

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Bartolomé"
  given-names: "Álvaro"
- family-names: "Martín-Blázquez"
  given-names: "Gabriel"
- family-names: "Piqueres-Lajarín"
  given-names: "Agustín"
- family-names: "Vila-Suero"
  given-names: "Daniel"
title: "Distilabel: An AI Feedback (AIF) framework for building datasets with and for LLMs."
version: 1.1.1
date-released: 2024-05-22
url: "https://github.com/argilla-io/distilabel"

Committers

Last synced: about 1 year ago

All Time

Total Commits: 777
Total Committers: 37
Avg Commits per committer: 21.0
Development Distribution Score (DDS): 0.692

Past Year

Commits: 297
Committers: 25
Avg Commits per committer: 11.88
Development Distribution Score (DDS): 0.626

Top Committers

Name	Email	Commits
Gabriel Martín Blázquez	g**v@g**m	239
Agus	a**n@a**o	202
Alvaro Bartolome	a**o@a**o	149
David Berenstein	d**n@g**m	54
Daniel Vila Suero	d**l@a**o	51
Sara Han	1****r	16
Ignacio Talavera	i**a@g**m	16
burtenshaw	b**n@a**o	7
Luca Rolshoven	l**n@b**h	4
zye1996	z**e@g**u	3
dependabot[bot]	4****]	3
Daniel van Strien	d****n	3
plaguss	a**s@A**l	3
DANIEL VILA SUERO	d**o@M**l	3
Hassaan Qaisar	1****r	2
Leire	l**e@r**i	1
Alex Strick van Linschoten	s****l	1
rasdani	7****i	1
Zhangda Xu	x**a@p**m	1
Stefano Fiorucci	s**i@g**m	1
Sadra Barikbin	s**1@y**m	1
Riezebos	2****s	1
Philipp Schmid	3****d	1
Parag Ekbote	t**9@g**m	1
Lucain	l**p@g**m	1
Jan Philipp Harries	2****e	1
Ikko Eltociear Ashimine	e**r@g**m	1
Fran Peric	4****c	1
Fabian Preiß	f**s@d**o	1
Edward Beeching	e****g	1
and 7 more...

Committer Domains (Top 20 + Academic)

argilla.io: 4 ellamind.com: 1 meikle.io: 1 digon.io: 1 patsnap.com: 1 recogn.ai: 1 gmu.edu: 1 bfh.ch: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 395
Total pull requests: 664
Average time to close issues: 22 days
Average time to close pull requests: 6 days
Total issue authors: 64
Total pull request authors: 53
Average comments per issue: 0.81
Average comments per pull request: 1.15
Merged pull requests: 506
Bot issues: 0
Bot pull requests: 11

Past Year

Issues: 74
Pull requests: 194
Average time to close issues: 20 days
Average time to close pull requests: 9 days
Issue authors: 34
Pull request authors: 29
Average comments per issue: 0.34
Average comments per pull request: 1.81
Merged pull requests: 124
Bot issues: 0
Bot pull requests: 8

View more stats

Top Authors

Issue Authors

gabrielmbmb (77)
davidberenstein1957 (72)
plaguss (65)
dvsrepo (45)
alvarobartt (31)
ignacioct (10)
sdiazlor (8)
jphme (5)
kcentric (5)
davanstrien (4)
yuqie (3)
Hassaan-Qaisar (3)
Josephrp (3)
YueWu0301 (2)
cmcmaster1 (2)

Pull Request Authors

plaguss (166)
gabrielmbmb (164)
alvarobartt (96)
davidberenstein1957 (60)
dvsrepo (22)
dameikle (22)
sdiazlor (20)
ignacioct (16)
burtenshaw (15)
pre-commit-ci[bot] (8)
bikash119 (5)
zye1996 (4)
bjoernpl (4)
dependabot[bot] (3)
davanstrien (3)

Top Labels

Issue Labels

enhancement (165) documentation (66) bug (48) team: ml (24) good first issue (16) team: interns (15) improvement (13) integrations (11) help wanted (9) interview (6) question (5) idea (3) refactor (3) fix (2) deprecation (2) dependencies (2) ci (2)

Pull Request Labels

enhancement (150) fix (110) documentation (59) improvement (35) integrations (28) refactor (11) bug (9) dependencies (6) ci (5) argilla (5) deprecation (3) example (3) team: ml (3) tutorial (2) test (2) release (2) benchmark (1) task (1) team: interns (1) question (1)

Dependencies

.github/workflows/release.yml actions

actions/cache v3 composite
actions/checkout v4 composite
actions/setup-python v4 composite
pypa/gh-action-pypi-publish release/v1 composite

pyproject.toml pypi

Jinja2 >= 3.1.2
datasets >= 2.14.0
rich >= 13.5.0
tenacity >= 8