Learning from Crowds with Crowd-Kit

Learning from Crowds with Crowd-Kit - Published in JOSS (2024)

https://github.com/toloka/crowd-kit

Science Score: 100.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 7 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
    3 of 28 committers (10.7%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

aggregations annotation crowd crowdsourcing data-mining data-science labeling python quality-control toloka truth-inference

Keywords from Contributors

mesh

Scientific Fields

Artificial Intelligence and Machine Learning Computer Science - 64% confidence
Last synced: 4 months ago · JSON representation ·

Repository

Control the quality of your labeled data with the Python tools you already know.

Basic Info
Statistics
  • Stars: 230
  • Watchers: 11
  • Forks: 18
  • Open Issues: 4
  • Releases: 24
Topics
aggregations annotation crowd crowdsourcing data-mining data-science labeling python quality-control toloka truth-inference
Created almost 5 years ago · Last pushed 4 months ago
Metadata Files
Readme Changelog Contributing License Citation Codeowners Authors

README.md

Crowd-Kit: Computational Quality Control for Crowdsourcing

Crowd-Kit

PyPI Version GitHub Tests Codecov Documentation Paper

Crowd-Kit is a powerful Python library that implements commonly-used aggregation methods for crowdsourced annotation and offers the relevant metrics and datasets. We strive to implement functionality that simplifies working with crowdsourced data.

Currently, Crowd-Kit contains:

  • implementations of commonly-used aggregation methods for categorical, pairwise, textual, and segmentation responses;
  • metrics of uncertainty, consistency, and agreement with aggregate;
  • loaders for popular crowdsourced datasets.

Also, the learning subpackage contains PyTorch implementations of deep learning from crowds methods and advanced aggregation algorithms.

Installing

To install Crowd-Kit, run the following command: pip install crowd-kit. If you also want to use the learning subpackage, type pip install crowd-kit[learning].

If you are interested in contributing to Crowd-Kit, use uv to manage the dependencies:

shell uv venv uv pip install -e '.[dev,docs,learning]' uv tool run pre-commit install

We use pytest for testing and a variety of linters, including pre-commit, Black, isort, Flake8, pyupgrade, and nbQA, to simplify code maintenance.

Getting Started

This example shows how to use Crowd-Kit for categorical aggregation using the classical Dawid-Skene algorithm.

First, let us do all the necessary imports.

````python from crowdkit.aggregation import DawidSkene from crowdkit.datasets import load_dataset

import pandas as pd ````

Then, you need to read your annotations into Pandas DataFrame with columns task, worker, label. Alternatively, you can download an example dataset:

````python df = pd.read_csv('results.csv') # should contain columns: task, worker, label

df, groundtruth = loaddataset('relevance-2') # or download an example dataset

````

Then, you can aggregate the workers' responses using the fit_predict method from the scikit-learn library:

python aggregated_labels = DawidSkene(n_iter=100).fit_predict(df)

More usage examples

Implemented Aggregation Methods

Below is the list of currently implemented methods, including the already available (✅) and in progress (🟡).

Categorical Responses

| Method | Status | | ------------- | :-------------: | | Majority Vote | ✅ | | One-coin Dawid-Skene | ✅ | | Dawid-Skene | ✅ | | Gold Majority Vote | ✅ | | M-MSR | ✅ | | Wawa | ✅ | | Zero-Based Skill | ✅ | | GLAD | ✅ | | KOS | ✅ | | MACE | ✅ |

Multi-Label Responses

|Method|Status| |-|:-:| |Binary Relevance|✅|

Textual Responses

| Method | Status | | ------------- | :-------------: | | RASA | ✅ | | HRRASA | ✅ | | ROVER | ✅ |

Image Segmentation

| Method | Status | | ------------------ | :------------------: | | Segmentation MV | ✅ | | Segmentation RASA | ✅ | | Segmentation EM | ✅ |

Pairwise Comparisons

| Method | Status | | -------------- | :---------------------: | | Bradley-Terry | ✅ | | Noisy Bradley-Terry | ✅ |

Learning from Crowds

|Method|Status| |-|:-:| |CrowdLayer|✅| |CoNAL|✅|

Citation

bibtex @article{CrowdKit, author = {Ustalov, Dmitry and Pavlichenko, Nikita and Tseitlin, Boris}, title = {{Learning from Crowds with Crowd-Kit}}, year = {2024}, journal = {Journal of Open Source Software}, volume = {9}, number = {96}, pages = {6227}, publisher = {The Open Journal}, doi = {10.21105/joss.06227}, issn = {2475-9066}, eprint = {2109.08584}, eprinttype = {arxiv}, eprintclass = {cs.HC}, language = {english}, }

Support and Contributions

Please use GitHub Issues to seek support and submit feature requests. We accept contributions to Crowd-Kit via GitHub as according to our guidelines in CONTRIBUTING.md.

License

© Crowd-Kit team authors, 2020–2024. Licensed under the Apache License, Version 2.0. See LICENSE file for more details.

Owner

  • Name: Toloka
  • Login: Toloka
  • Kind: organization
  • Email: github@toloka.ai

Data labeling platform for ML

JOSS Publication

Learning from Crowds with Crowd-Kit
Published
April 06, 2024
Volume 9, Issue 96, Page 6227
Authors
Dmitry Ustalov ORCID
JetBrains, Serbia
Nikita Pavlichenko ORCID
JetBrains, Germany
Boris Tseitlin ORCID
Planet Farms, Portugal
Editor
Arfon Smith ORCID
Tags
crowdsourcing data labeling answer aggregation truth inference learning from crowds machine learning quality control data quality

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this software, please cite the article from preferred-citation.
title: Crowd-Kit
type: software
authors:
- family-names: Ustalov
  given-names: Dmitry
  orcid: "https://orcid.org/0000-0002-9979-2188"
- family-names: Pavlichenko
  given-names: Nikita
  orcid: "https://orcid.org/0000-0002-7330-393X"
- given-names: Boris
  family-names: Tseitlin
  orcid: "https://orcid.org/0000-0001-8553-4260"
repository-code: "https://github.com/Toloka/crowd-kit"
url: "https://crowd-kit.readthedocs.io/"
repository-artifact: "https://pypi.org/project/crowd-kit/"
keywords:
- Python
- crowdsourcing
- data labeling
- answer aggregation
- truth inference
- learning from crowds
- machine learning
- quality control
- data quality
license: Apache-2.0
preferred-citation:
  type: article
  authors:
  - family-names: Ustalov
    given-names: Dmitry
    orcid: "https://orcid.org/0000-0002-9979-2188"
  - family-names: Pavlichenko
    given-names: Nikita
    orcid: "https://orcid.org/0000-0002-7330-393X"
  - given-names: Boris
    family-names: Tseitlin
    orcid: "https://orcid.org/0000-0001-8553-4260"
  title: "Learning from Crowds with Crowd-Kit"
  year: 2024
  journal: Journal of Open Source Software
  volume: 9
  issue: 96
  start: 6227
  end: 6227
  doi: "10.21105/joss.06227"
  issn: 2475-9066
  identifiers:
  - type: other
    value: "arXiv:2109.08584"
    description: The ArXiv preprint of the paper

GitHub Events

Total
  • Issues event: 3
  • Watch event: 23
  • Issue comment event: 9
  • Push event: 11
  • Pull request review event: 5
  • Pull request event: 14
  • Fork event: 3
  • Create event: 2
Last Year
  • Issues event: 4
  • Watch event: 23
  • Issue comment event: 9
  • Push event: 11
  • Pull request review event: 5
  • Pull request event: 14
  • Fork event: 3
  • Create event: 2

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 363
  • Total Committers: 28
  • Avg Commits per committer: 12.964
  • Development Distribution Score (DDS): 0.628
Past Year
  • Commits: 21
  • Committers: 4
  • Avg Commits per committer: 5.25
  • Development Distribution Score (DDS): 0.524
Top Committers
Name Email Commits
Dmitry Ustalov d****v@g****m 135
Nikita Pavlichenko p****v@p****u 37
Sukhorosov Aleksey a****v@g****m 26
Denis Fraltsov 8****t 22
Mathew Shen d****r@g****m 21
Alisa a****a@g****m 19
Vladimir Losev m****k@g****m 19
Natalia n****6@y****u 15
Stepan Nosov n****8@y****u 15
Daniil Likhobaba l****p@p****u 12
dependabot[bot] 4****] 7
pavlichenko p****o@y****u 7
DrhF m****w@g****m 5
Alexander Vnuchkov a****v@y****u 3
Evgeny Tulin t****v@y****u 3
vlad-mois v****s@y****u 3
Tahar Allouche t****o@g****m 2
mr-fedulow m****w@y****u 2
Aleksandr Dremov d****e@g****m 1
gilyazev-yu g****u@y****u 1
dsamuylov d****v@y****u 1
btseytlin b****n@y****u 1
Daniil Likhobaba l****p@y****u 1
Artem Grigorev o****j@t****i 1
Iulian Giliazev g****a@p****u 1
Pavel Gein p****n@y****u 1
arcadia-devtools a****s@y****u 1
shadchin s****n@y****u 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 25
  • Total pull requests: 125
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 7 days
  • Total issue authors: 15
  • Total pull request authors: 18
  • Average comments per issue: 2.2
  • Average comments per pull request: 1.46
  • Merged pull requests: 112
  • Bot issues: 0
  • Bot pull requests: 11
Past Year
  • Issues: 4
  • Pull requests: 13
  • Average time to close issues: 3 days
  • Average time to close pull requests: about 18 hours
  • Issue authors: 4
  • Pull request authors: 4
  • Average comments per issue: 0.75
  • Average comments per pull request: 0.69
  • Merged pull requests: 10
  • Bot issues: 0
  • Bot pull requests: 3
Top Authors
Issue Authors
  • shenxiangzhuang (6)
  • jcklie (3)
  • pilot7747 (2)
  • Senarect (2)
  • LydiaMak (2)
  • ahundt (1)
  • alexdremov (1)
  • TanVD (1)
  • takumi1001 (1)
  • Mind-the-Cap (1)
  • vikasraykar (1)
  • Marceau-h (1)
  • taharallouche (1)
  • johann-petrak (1)
  • amine-boukriba (1)
Pull Request Authors
  • shenxiangzhuang (29)
  • dustalov (26)
  • pilot7747 (17)
  • dependabot[bot] (11)
  • Losik (5)
  • Natalyl3 (5)
  • aliskin (4)
  • Senarect (4)
  • taharallouche (4)
  • alexdrydew (4)
  • DrhF (3)
  • varfolomeii (3)
  • alexandervnuchkov (3)
  • denaxen (2)
  • ortemij (2)
Top Labels
Issue Labels
bug (9) documentation (8) enhancement (6) good first issue (2)
Pull Request Labels
dependencies (11) enhancement (9) documentation (7) good first issue (3) github_actions (3) bug (1) duplicate (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 4,916 last-month
  • Total dependent packages: 3
  • Total dependent repositories: 5
  • Total versions: 23
  • Total maintainers: 2
pypi.org: crowd-kit

Computational Quality Control for Crowdsourcing

  • Versions: 23
  • Dependent Packages: 3
  • Dependent Repositories: 5
  • Downloads: 4,916 Last month
Rankings
Dependent packages count: 2.3%
Downloads: 4.4%
Stargazers count: 5.3%
Average: 5.7%
Dependent repos count: 6.7%
Forks count: 9.6%
Maintainers (2)
Last synced: 4 months ago

Dependencies

setup.py pypi
  • attrs *
  • nltk *
  • numpy *
  • pandas *
  • scikit-learn *
  • tqdm *
  • transformers *
.github/workflows/release.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • peter-evans/create-pull-request v4 composite
.github/workflows/tests.yml actions
  • actions/checkout v3 composite
  • actions/setup-node v3 composite
  • actions/setup-python v4 composite
  • actions/upload-artifact v3 composite
  • citation-file-format/cffconvert-github-action 2.0.0 composite
Pipfile pypi
  • build * develop
  • codecov * develop
  • flake8 * develop
  • ipywidgets * develop
  • mypy * develop
  • notebook * develop
  • pytest * develop
  • stubmaker * develop
  • twine * develop
  • crowd-kit *
pyproject.toml pypi