Learning from Crowds with Crowd-Kit

Learning from Crowds with Crowd-Kit - Published in JOSS (2024)

Science Score: 100.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 7 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: joss.theoj.org
✓
Committers with academic emails
3 of 28 committers (10.7%) from academic institutions
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

aggregations annotation crowd crowdsourcing data-mining data-science labeling python quality-control toloka truth-inference

Keywords from Contributors

mesh

Scientific Fields

Artificial Intelligence and Machine Learning Computer Science - 64% confidence

Last synced: 6 months ago · JSON representation ·

Repository

Control the quality of your labeled data with the Python tools you already know.

Basic Info

Host: GitHub
Owner: Toloka
License: other
Language: Python
Default Branch: main
Homepage: https://crowd-kit.readthedocs.io/
Size: 1.43 MB

Statistics

Stars: 230
Watchers: 11
Forks: 18
Open Issues: 4
Releases: 24

Topics

aggregations annotation crowd crowdsourcing data-mining data-science labeling python quality-control toloka truth-inference

Created almost 5 years ago · Last pushed 6 months ago

Metadata Files

Readme Changelog Contributing License Citation Codeowners Authors

Crowd-Kit: Computational Quality Control for Crowdsourcing

Crowd-Kit is a powerful Python library that implements commonly-used aggregation methods for crowdsourced annotation and offers the relevant metrics and datasets. We strive to implement functionality that simplifies working with crowdsourced data.

Currently, Crowd-Kit contains:

implementations of commonly-used aggregation methods for categorical, pairwise, textual, and segmentation responses;
metrics of uncertainty, consistency, and agreement with aggregate;
loaders for popular crowdsourced datasets.

Also, the learning subpackage contains PyTorch implementations of deep learning from crowds methods and advanced aggregation algorithms.

Installing

To install Crowd-Kit, run the following command: pip install crowd-kit. If you also want to use the learning subpackage, type pip install crowd-kit[learning].

If you are interested in contributing to Crowd-Kit, use uv to manage the dependencies:

shell uv venv uv pip install -e '.[dev,docs,learning]' uv tool run pre-commit install

We use pytest for testing and a variety of linters, including pre-commit, Black, isort, Flake8, pyupgrade, and nbQA, to simplify code maintenance.

Getting Started

This example shows how to use Crowd-Kit for categorical aggregation using the classical Dawid-Skene algorithm.

First, let us do all the necessary imports.

````python from crowdkit.aggregation import DawidSkene from crowdkit.datasets import load_dataset

import pandas as pd ````

Then, you need to read your annotations into Pandas DataFrame with columns task, worker, label. Alternatively, you can download an example dataset:

````python df = pd.read_csv('results.csv') # should contain columns: task, worker, label

df, groundtruth = loaddataset('relevance-2') # or download an example dataset

````

Then, you can aggregate the workers' responses using the fit_predict method from the scikit-learn library:

python aggregated_labels = DawidSkene(n_iter=100).fit_predict(df)

More usage examples

Implemented Aggregation Methods

Below is the list of currently implemented methods, including the already available (✅) and in progress (🟡).

Categorical Responses

| Method | Status | | ------------- | :-------------: | | Majority Vote | ✅ | | One-coin Dawid-Skene | ✅ | | Dawid-Skene | ✅ | | Gold Majority Vote | ✅ | | M-MSR | ✅ | | Wawa | ✅ | | Zero-Based Skill | ✅ | | GLAD | ✅ | | KOS | ✅ | | MACE | ✅ |

Multi-Label Responses

|Method|Status| |-|:-:| |Binary Relevance|✅|

Textual Responses

| Method | Status | | ------------- | :-------------: | | RASA | ✅ | | HRRASA | ✅ | | ROVER | ✅ |

Image Segmentation

| Method | Status | | ------------------ | :------------------: | | Segmentation MV | ✅ | | Segmentation RASA | ✅ | | Segmentation EM | ✅ |

Pairwise Comparisons

| Method | Status | | -------------- | :---------------------: | | Bradley-Terry | ✅ | | Noisy Bradley-Terry | ✅ |

Learning from Crowds

|Method|Status| |-|:-:| |CrowdLayer|✅| |CoNAL|✅|

Citation

Ustalov D., Pavlichenko N., Tseitlin B. (2024). Learning from Crowds with Crowd-Kit. Journal of Open Source Software, 9(96), 6227

bibtex @article{CrowdKit, author = {Ustalov, Dmitry and Pavlichenko, Nikita and Tseitlin, Boris}, title = {{Learning from Crowds with Crowd-Kit}}, year = {2024}, journal = {Journal of Open Source Software}, volume = {9}, number = {96}, pages = {6227}, publisher = {The Open Journal}, doi = {10.21105/joss.06227}, issn = {2475-9066}, eprint = {2109.08584}, eprinttype = {arxiv}, eprintclass = {cs.HC}, language = {english}, }

Support and Contributions

Please use GitHub Issues to seek support and submit feature requests. We accept contributions to Crowd-Kit via GitHub as according to our guidelines in CONTRIBUTING.md.

License

Owner

Name: Toloka
Login: Toloka
Kind: organization
Email: github@toloka.ai

Website: https://toloka.ai
Repositories: 11
Profile: https://github.com/Toloka

Data labeling platform for ML

JOSS Publication

Learning from Crowds with Crowd-Kit

Published

April 06, 2024

DOI

10.21105/joss.06227

Volume 9, Issue 96, Page 6227

Authors

Dmitry Ustalov

JetBrains, Serbia

Nikita Pavlichenko

JetBrains, Germany

Boris Tseitlin

Planet Farms, Portugal

Editor

Arfon Smith

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this software, please cite the article from preferred-citation.
title: Crowd-Kit
type: software
authors:
- family-names: Ustalov
  given-names: Dmitry
  orcid: "https://orcid.org/0000-0002-9979-2188"
- family-names: Pavlichenko
  given-names: Nikita
  orcid: "https://orcid.org/0000-0002-7330-393X"
- given-names: Boris
  family-names: Tseitlin
  orcid: "https://orcid.org/0000-0001-8553-4260"
repository-code: "https://github.com/Toloka/crowd-kit"
url: "https://crowd-kit.readthedocs.io/"
repository-artifact: "https://pypi.org/project/crowd-kit/"
keywords:
- Python
- crowdsourcing
- data labeling
- answer aggregation
- truth inference
- learning from crowds
- machine learning
- quality control
- data quality
license: Apache-2.0
preferred-citation:
  type: article
  authors:
  - family-names: Ustalov
    given-names: Dmitry
    orcid: "https://orcid.org/0000-0002-9979-2188"
  - family-names: Pavlichenko
    given-names: Nikita
    orcid: "https://orcid.org/0000-0002-7330-393X"
  - given-names: Boris
    family-names: Tseitlin
    orcid: "https://orcid.org/0000-0001-8553-4260"
  title: "Learning from Crowds with Crowd-Kit"
  year: 2024
  journal: Journal of Open Source Software
  volume: 9
  issue: 96
  start: 6227
  end: 6227
  doi: "10.21105/joss.06227"
  issn: 2475-9066
  identifiers:
  - type: other
    value: "arXiv:2109.08584"
    description: The ArXiv preprint of the paper

GitHub Events

Total

Issues event: 3
Watch event: 23
Issue comment event: 9
Push event: 11
Pull request review event: 5
Pull request event: 14
Fork event: 3
Create event: 2

Last Year

Issues event: 4
Watch event: 23
Issue comment event: 9
Push event: 11
Pull request review event: 5
Pull request event: 14
Fork event: 3
Create event: 2

Committers

Last synced: 7 months ago

All Time

Total Commits: 363
Total Committers: 28
Avg Commits per committer: 12.964
Development Distribution Score (DDS): 0.628

Past Year

Commits: 21
Committers: 4
Avg Commits per committer: 5.25
Development Distribution Score (DDS): 0.524

Top Committers

Name	Email	Commits
Dmitry Ustalov	d**v@g**m	135
Nikita Pavlichenko	p**v@p**u	37
Sukhorosov Aleksey	a**v@g**m	26
Denis Fraltsov	8****t	22
Mathew Shen	d**r@g**m	21
Alisa	a**a@g**m	19
Vladimir Losev	m**k@g**m	19
Natalia	n**6@y**u	15
Stepan Nosov	n**8@y**u	15
Daniil Likhobaba	l**p@p**u	12
dependabot[bot]	4****]	7
pavlichenko	p**o@y**u	7
DrhF	m**w@g**m	5
Alexander Vnuchkov	a**v@y**u	3
Evgeny Tulin	t**v@y**u	3
vlad-mois	v**s@y**u	3
Tahar Allouche	t**o@g**m	2
mr-fedulow	m**w@y**u	2
Aleksandr Dremov	d**e@g**m	1
gilyazev-yu	g**u@y**u	1
dsamuylov	d**v@y**u	1
btseytlin	b**n@y**u	1
Daniil Likhobaba	l**p@y**u	1
Artem Grigorev	o**j@t**i	1
Iulian Giliazev	g**a@p**u	1
Pavel Gein	p**n@y**u	1
arcadia-devtools	a**s@y**u	1
shadchin	s**n@y**u	1

Committer Domains (Top 20 + Academic)

yandex-team.ru: 12 phystech.edu: 3 yandex.ru: 2 toloka.ai: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 25
Total pull requests: 125
Average time to close issues: about 1 month
Average time to close pull requests: 7 days
Total issue authors: 15
Total pull request authors: 18
Average comments per issue: 2.2
Average comments per pull request: 1.46
Merged pull requests: 112
Bot issues: 0
Bot pull requests: 11

Past Year

Issues: 4
Pull requests: 13
Average time to close issues: 3 days
Average time to close pull requests: about 18 hours
Issue authors: 4
Pull request authors: 4
Average comments per issue: 0.75
Average comments per pull request: 0.69
Merged pull requests: 10
Bot issues: 0
Bot pull requests: 3

View more stats

Top Authors

Issue Authors

shenxiangzhuang (6)
jcklie (3)
pilot7747 (2)
Senarect (2)
LydiaMak (2)
ahundt (1)
alexdremov (1)
TanVD (1)
takumi1001 (1)
Mind-the-Cap (1)
vikasraykar (1)
Marceau-h (1)
taharallouche (1)
johann-petrak (1)
amine-boukriba (1)

Pull Request Authors

shenxiangzhuang (29)
dustalov (26)
pilot7747 (17)
dependabot[bot] (11)
Losik (5)
Natalyl3 (5)
aliskin (4)
Senarect (4)
taharallouche (4)
alexdrydew (4)
DrhF (3)
varfolomeii (3)
alexandervnuchkov (3)
denaxen (2)
ortemij (2)

Top Labels

Issue Labels

bug (9) documentation (8) enhancement (6) good first issue (2)

Pull Request Labels

dependencies (11) enhancement (9) documentation (7) good first issue (3) github_actions (3) bug (1) duplicate (1)

Packages

Total packages: 1
Total downloads:
- pypi 4,916 last-month

Total dependent packages: 3
Total dependent repositories: 5
Total versions: 23
Total maintainers: 2

pypi.org: crowd-kit

Computational Quality Control for Crowdsourcing

Homepage: https://github.com/Toloka/crowd-kit
Documentation: https://crowd-kit.readthedocs.io/
License: Apache 2.0
Latest release: 1.4.1
published over 1 year ago

Versions: 23
Dependent Packages: 3
Dependent Repositories: 5
Downloads: 4,916 Last month

Rankings

Dependent packages count: 2.3%

Downloads: 4.4%

Stargazers count: 5.3%

Average: 5.7%

Dependent repos count: 6.7%

Forks count: 9.6%

Maintainers (2)

losik toloka-opensource

Last synced: 6 months ago

Dependencies

setup.py pypi

attrs *
nltk *
numpy *
pandas *
scikit-learn *
tqdm *
transformers *

.github/workflows/release.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite
peter-evans/create-pull-request v4 composite

.github/workflows/tests.yml actions

actions/checkout v3 composite
actions/setup-node v3 composite
actions/setup-python v4 composite
actions/upload-artifact v3 composite
citation-file-format/cffconvert-github-action 2.0.0 composite

Pipfile pypi

build * develop
codecov * develop
flake8 * develop
ipywidgets * develop
mypy * develop
notebook * develop
pytest * develop
stubmaker * develop
twine * develop
crowd-kit *

pyproject.toml pypi

Learning from Crowds with Crowd-Kit

Science Score: 100.0%

Keywords

Keywords from Contributors

Scientific Fields

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Crowd-Kit: Computational Quality Control for Crowdsourcing

Installing

Getting Started

df, groundtruth = loaddataset('relevance-2') # or download an example dataset

Implemented Aggregation Methods

Categorical Responses

Multi-Label Responses

Textual Responses

Image Segmentation

Pairwise Comparisons

Learning from Crowds

Citation

Support and Contributions

License

Owner

JOSS Publication

Learning from Crowds with Crowd-Kit

Authors

Editor

Tags

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: crowd-kit

Rankings

Maintainers (2)

Dependencies