argilla

Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets

https://github.com/argilla-io/argilla

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    3 of 107 committers (2.8%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.1%) to scientific vocabulary

Keywords

active-learning ai annotation-tool developer-tools gpt-4 human-in-the-loop langchain llm machine-learning mlops natural-language-processing nlp rlhf text-annotation text-labeling weak-supervision weakly-supervised-learning

Keywords from Contributors

transformer cryptocurrency jax huggingface agents llms synthetic-dataset-generation synthetic-data rlaif cryptography
Last synced: 6 months ago · JSON representation

Repository

Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets

Basic Info
Statistics
  • Stars: 4,645
  • Watchers: 33
  • Forks: 449
  • Open Issues: 32
  • Releases: 132
Topics
active-learning ai annotation-tool developer-tools gpt-4 human-in-the-loop langchain llm machine-learning mlops natural-language-processing nlp rlhf text-annotation text-labeling weak-supervision weakly-supervised-learning
Created almost 5 years ago · Last pushed 6 months ago
Metadata Files
Readme Contributing License Code of conduct Citation

README.md

[!IMPORTANT] The original authors have moved on to exciting new projects! The codebase is mature and stable, having served users reliably for years. While we won't be adding new features going forward, we're committed to solve bug fixes and publish patches as needed. If you're interested in helping maintain or extend this project, we'd love to hear from you! Please open an issue to discuss becoming a maintainer - we're looking for dedicated contributors who can take ownership of the project's future development.

Argilla
Argilla

Build high quality datasets for your AI models

CI Codecov CI

Argilla is a collaboration tool for AI engineers and domain experts who need to build high-quality datasets for their projects.

If you just want to get started, deploy Argilla on Hugging Face Spaces. Curious, and want to know more? Read our documentation.

Or, play with the Argilla UI by signing in with your Hugging Face account:

homepage

Why use Argilla?

Argilla can be used for collecting human feedback for a wide variety of AI projects like traditional NLP (text classification, NER, etc.), LLMs (RAG, preference tuning, etc.), or multimodal models (text to image, etc.). Argilla's programmatic approach lets you build workflows for continuous evaluation and model improvement. The goal of Argilla is to ensure your data work pays off by quickly iterating on the right data and models.

Improve your AI output quality through data quality

Compute is expensive and output quality is important. We help you focus on data, which tackles the root cause of both of these problems at once. Argilla helps you to achieve and keep high-quality standards for your data. This means you can improve the quality of your AI output.

Take control of your data and models

Most AI tools are black boxes. Argilla is different. We believe that you should be the owner of both your data and your models. That's why we provide you with all the tools your team needs to manage your data and models in a way that suits you best.

Improve efficiency by quickly iterating on the right data and models

Gathering data is a time-consuming process. Argilla helps by providing a tool that allows you to interact with your data in a more engaging way. This means you can quickly and easily label your data with filters, AI feedback suggestions and semantic search. So you can focus on training your models and monitoring their performance.

Community

We are an open-source community-driven project and we love to hear from you. Here are some ways to get involved:

  • Community Meetup: listen in or present during one of our bi-weekly events.

  • Discord: get direct support from the community in #argilla-distilabel-general and #argilla-distilabel-help.

  • Roadmap: plans change but we love to discuss those with our community so feel encouraged to participate.

What do people build with Argilla?

Open-source datasets and models

The community uses Argilla to create amazing open-source datasets and models.

  • Cleaned UltraFeedback dataset used to fine-tune the Notus and Notux models. The original UltraFeedback dataset was curated using Argilla UI filters to find and report a bug in the original data generation code. Based on this data curation process, Argilla built this new version of the UltraFeedback dataset and fine-tuned Notus, outperforming Zephyr on several benchmarks.
  • distilabel Intel Orca DPO dataset used to fine-tune the improved OpenHermes model. This dataset was built by combining human curation in Argilla with AI feedback from distilabel, leading to an improved version of the Intel Orca dataset and outperforming models fine-tuned on the original dataset.

Examples Use cases

AI teams from organizations such as the Red Cross, Loris.ai and Prolific use Argilla to improve the quality and efficiency of AI projects. They shared their experiences in our AI community meetup.

  • AI for good: the Red Cross presentation showcases how the Red Cross domain experts and AI team collaborated by classifying and redirecting requests from refugees of the Ukrainian crisis to streamline the support processes of the Red Cross.
  • Customer support: during the Loris meetup they showed how their AI team uses unsupervised and few-shot contrastive learning to help them quickly validate and gain labeled samples for a huge amount of multi-label classifiers.
  • Research studies: the showcase from Prolific announced their integration with our platform. They use it to actively distribute data collection projects among their annotating workforce. This allows Prolific to quickly and efficiently collect high-quality data for research studies.

Getting started

Installation

First things first! You can install the SDK with pip as follows:

console pip install argilla

After that, you will need to deploy Argilla Server. The easiest way to do this is through our free Hugging Face Spaces deployment integration.

To use the client, you need to import the Argilla class and instantiate it with the API URL and API key.

```python import argilla as rg

client = rg.Argilla(apiurl="https://[your-owner-name]-[yourspacename].hf.space", apikey="owner.apikey") ```

Create your first dataset

We can now create a dataset with a simple text classification task. First, you need to define the dataset settings.

python settings = rg.Settings( guidelines="Classify the reviews as positive or negative.", fields=[ rg.TextField( name="review", title="Text from the review", use_markdown=False, ), ], questions=[ rg.LabelQuestion( name="my_label", title="In which category does this article fit?", labels=["positive", "negative"], ) ], ) dataset = rg.Dataset( name=f"my_first_dataset", settings=settings, client=client, ) dataset.create()

Next, we can add records to the dataset.

bash pip install datasets

```python from datasets import load_dataset

data = loaddataset("imdb", split="train[:100]").tolist() dataset.records.log(records=data, mapping={"text": "review"}) ```

You have successfully created your first dataset with Argilla. You can now access it in the Argilla UI and start annotating the records. Need more info, check out our docs.

Contributors

To help our community with the creation of contributions, we have created our community docs.

Owner

  • Name: Argilla
  • Login: argilla-io
  • Kind: organization
  • Email: contact@argilla.io

Building the open-source tool for data-centric NLP

Committers

Last synced: 10 months ago

All Time
  • Total Commits: 3,386
  • Total Committers: 107
  • Avg Commits per committer: 31.645
  • Development Distribution Score (DDS): 0.797
Past Year
  • Commits: 615
  • Committers: 37
  • Avg Commits per committer: 16.622
  • Development Distribution Score (DDS): 0.636
Top Committers
Name Email Commits
Francisco Aranda f****s@a****o 687
Francisco Aranda f****o@r****i 343
Alvaro Bartolome a****o@a****o 295
leiyre l****e@r****i 274
David Berenstein d****n@g****m 206
Damián Pumar d****r@g****m 201
leiyre l****e@a****o 179
David Fidalgo d****d@r****i 161
José Francisco Calvo j****e@a****o 146
Daniel Vila Suero d****l@r****i 134
Gabriel Martín Blázquez g****v@g****m 112
Keith Cuniah 8****h 95
Sara Han 1****r 68
Ignacio Talavera i****a@g****m 59
Tom Aarsen 3****n 56
burtenshaw b****n@a****o 51
Natalia Elvira 1****v 48
pre-commit-ci[bot] 6****] 39
Leire Rosado 3****l 28
Agus 5****s 23
kursathalat 8****t 18
Leire Aguirre Eguiluz l****k@g****m 14
dependabot[bot] 4****] 11
Garima Upadhyay 6****u 10
Paul Bauriegel p****l@w****e 8
Rohit Jadhav r****t@a****o 5
David Carreto Fidalgo d****o@g****m 4
bikash119 b****a@g****m 4
Thomas Chaigneau t****c@g****m 4
Ankush Chander a****r@g****m 4
and 77 more...

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 642
  • Total pull requests: 995
  • Average time to close issues: 6 months
  • Average time to close pull requests: 15 days
  • Total issue authors: 123
  • Total pull request authors: 65
  • Average comments per issue: 2.07
  • Average comments per pull request: 1.14
  • Merged pull requests: 684
  • Bot issues: 0
  • Bot pull requests: 40
Past Year
  • Issues: 105
  • Pull requests: 398
  • Average time to close issues: 2 months
  • Average time to close pull requests: 8 days
  • Issue authors: 62
  • Pull request authors: 33
  • Average comments per issue: 0.77
  • Average comments per pull request: 1.11
  • Merged pull requests: 279
  • Bot issues: 0
  • Bot pull requests: 27
Top Authors
Issue Authors
  • davidberenstein1957 (127)
  • frascuchon (74)
  • jfcalvo (70)
  • nataliaElv (58)
  • sdiazlor (31)
  • dvsrepo (26)
  • burtenshaw (22)
  • ignacioct (19)
  • Amelie-V (14)
  • cceyda (12)
  • leiyre (11)
  • damianpumar (10)
  • plaguss (7)
  • MoritzLaurer (7)
  • gabrielmbmb (6)
Pull Request Authors
  • frascuchon (198)
  • jfcalvo (167)
  • damianpumar (127)
  • leiyre (99)
  • burtenshaw (66)
  • sdiazlor (57)
  • davidberenstein1957 (48)
  • dependabot[bot] (36)
  • dvsrepo (25)
  • nataliaElv (22)
  • ignacioct (21)
  • paulbauriegel (19)
  • bikash119 (9)
  • gabrielmbmb (8)
  • dcfidalgo (7)
Top Labels
Issue Labels
status: stale (196) type: enhancement (154) area: ui (100) team: frontend (80) area: ux (58) team: backend (57) type: bug (55) team: ml (49) area: python sdk (46) type: documentation (40) language: python (40) area: server (28) area: api (27) severity: minor (23) team: interns (22) type: community request (21) good first issue (15) severity: major (14) status: help wanted (14) stale (13) area: trainer (11) status: question (9) area: architecture (9) type: popular request (8) language: javascript (7) type: improvement (6) type: dependencies (6) type: integration (4) type: technical debt (3) type: deprecation (3)
Pull Request Labels
area: ui (83) lgtm (73) language: python (69) team: backend (65) language: javascript (60) team: frontend (59) severity: minor (59) size:L (50) type: enhancement (48) type: documentation (45) type: dependencies (38) size:S (34) type: improvement (32) size:XS (31) type: bug (26) area: python sdk (25) team: ml (23) type: refactor (22) javascript (21) size:M (20) area: api (17) area: server (15) area: tests (12) python (11) size:XXL (10) size:XL (9) area: ci (8) ready-to-merge (6) status: stale (4) dependencies (3)

Dependencies

.github/actions/docker-image-tag-from-ref/action.yml actions
  • Dockerfile * docker
.github/actions/generate-credentials/action.yml actions
  • Dockerfile * docker
.github/actions/slack-post-credentials/action.yml actions
  • Dockerfile * docker
.github/workflows/build-python-package.yml actions
  • actions/cache v3 composite
  • actions/checkout v3 composite
  • actions/setup-node v3 composite
  • actions/upload-artifact v3 composite
.github/workflows/check-repo-files.yml actions
  • actions/checkout v3 composite
  • dorny/paths-filter v2 composite
.github/workflows/close-inactive-issues-bot.yml actions
  • actions/stale v5 composite
.github/workflows/close-pr.yml actions
  • actions/checkout v3 composite
  • google-github-actions/auth v1 composite
  • google-github-actions/setup-gcloud v1 composite
.github/workflows/codeql-analysis.yml actions
  • actions/checkout v3 composite
  • github/codeql-action/analyze v2 composite
  • github/codeql-action/autobuild v2 composite
  • github/codeql-action/init v2 composite
.github/workflows/dependency-review.yml actions
  • actions/checkout v3 composite
  • actions/dependency-review-action v1 composite
.github/workflows/deploy-environment.yml actions
  • ./.github/actions/generate-credentials * composite
  • ./.github/actions/slack-post-credentials * composite
  • actions/checkout v3 composite
  • google-github-actions/auth v1 composite
  • google-github-actions/deploy-cloudrun v1 composite
  • thollander/actions-comment-pull-request v2 composite
.github/workflows/package.yml actions
  • actions/checkout v3 composite
  • actions/download-artifact v3 composite
  • actions/download-artifact v2 composite
  • codecov/codecov-action v2 composite
  • pypa/gh-action-pypi-publish master composite
.github/workflows/run-python-tests.yml actions
  • actions/cache v3 composite
  • actions/checkout v3 composite
  • actions/upload-artifact v3 composite
  • conda-incubator/setup-miniconda v2 composite
  • ${{inputs.searchEngineDockerImage}} * docker
.github/workflows/tutorials.yml actions
  • actions/checkout v3 composite
  • conda-incubator/setup-miniconda v2 composite
.github/actions/docker-image-tag-from-ref/Dockerfile docker
  • python 3.10 build
.github/actions/generate-credentials/Dockerfile docker
  • python 3.10 build
.github/actions/slack-post-credentials/Dockerfile docker
  • python 3.10 build
.github/workflows/end2end-examples.yml actions
  • actions/cache v3 composite
  • actions/checkout v3 composite
  • conda-incubator/setup-miniconda v2 composite
  • ${{inputs.searchEngineDockerImage}} * docker
.github/workflows/build-push-dev-frontend-docker.yml actions
  • ./.github/actions/docker-image-tag-from-ref * composite
  • actions/checkout v3 composite
  • actions/setup-node v3 composite
  • docker/build-push-action v4 composite
  • docker/login-action v2 composite
  • docker/setup-buildx-action v2 composite
  • google-github-actions/auth v1 composite