Learning from Crowds with Crowd-Kit
Learning from Crowds with Crowd-Kit - Published in JOSS (2024)
Science Score: 100.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 7 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: joss.theoj.org -
✓Committers with academic emails
3 of 28 committers (10.7%) from academic institutions -
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
Keywords from Contributors
Scientific Fields
Repository
Control the quality of your labeled data with the Python tools you already know.
Basic Info
- Host: GitHub
- Owner: Toloka
- License: other
- Language: Python
- Default Branch: main
- Homepage: https://crowd-kit.readthedocs.io/
- Size: 1.43 MB
Statistics
- Stars: 230
- Watchers: 11
- Forks: 18
- Open Issues: 4
- Releases: 24
Topics
Metadata Files
README.md
Crowd-Kit: Computational Quality Control for Crowdsourcing
Crowd-Kit is a powerful Python library that implements commonly-used aggregation methods for crowdsourced annotation and offers the relevant metrics and datasets. We strive to implement functionality that simplifies working with crowdsourced data.
Currently, Crowd-Kit contains:
- implementations of commonly-used aggregation methods for categorical, pairwise, textual, and segmentation responses;
- metrics of uncertainty, consistency, and agreement with aggregate;
- loaders for popular crowdsourced datasets.
Also, the learning subpackage contains PyTorch implementations of deep learning from crowds methods and advanced aggregation algorithms.
Installing
To install Crowd-Kit, run the following command: pip install crowd-kit. If you also want to use the learning subpackage, type pip install crowd-kit[learning].
If you are interested in contributing to Crowd-Kit, use uv to manage the dependencies:
shell
uv venv
uv pip install -e '.[dev,docs,learning]'
uv tool run pre-commit install
We use pytest for testing and a variety of linters, including pre-commit, Black, isort, Flake8, pyupgrade, and nbQA, to simplify code maintenance.
Getting Started
This example shows how to use Crowd-Kit for categorical aggregation using the classical Dawid-Skene algorithm.
First, let us do all the necessary imports.
````python from crowdkit.aggregation import DawidSkene from crowdkit.datasets import load_dataset
import pandas as pd ````
Then, you need to read your annotations into Pandas DataFrame with columns task, worker, label. Alternatively, you can download an example dataset:
````python df = pd.read_csv('results.csv') # should contain columns: task, worker, label
df, groundtruth = loaddataset('relevance-2') # or download an example dataset
````
Then, you can aggregate the workers' responses using the fit_predict method from the scikit-learn library:
python
aggregated_labels = DawidSkene(n_iter=100).fit_predict(df)
Implemented Aggregation Methods
Below is the list of currently implemented methods, including the already available (✅) and in progress (🟡).
Categorical Responses
| Method | Status | | ------------- | :-------------: | | Majority Vote | ✅ | | One-coin Dawid-Skene | ✅ | | Dawid-Skene | ✅ | | Gold Majority Vote | ✅ | | M-MSR | ✅ | | Wawa | ✅ | | Zero-Based Skill | ✅ | | GLAD | ✅ | | KOS | ✅ | | MACE | ✅ |
Multi-Label Responses
|Method|Status| |-|:-:| |Binary Relevance|✅|
Textual Responses
| Method | Status | | ------------- | :-------------: | | RASA | ✅ | | HRRASA | ✅ | | ROVER | ✅ |
Image Segmentation
| Method | Status | | ------------------ | :------------------: | | Segmentation MV | ✅ | | Segmentation RASA | ✅ | | Segmentation EM | ✅ |
Pairwise Comparisons
| Method | Status | | -------------- | :---------------------: | | Bradley-Terry | ✅ | | Noisy Bradley-Terry | ✅ |
Learning from Crowds
|Method|Status| |-|:-:| |CrowdLayer|✅| |CoNAL|✅|
Citation
- Ustalov D., Pavlichenko N., Tseitlin B. (2024). Learning from Crowds with Crowd-Kit. Journal of Open Source Software, 9(96), 6227
bibtex
@article{CrowdKit,
author = {Ustalov, Dmitry and Pavlichenko, Nikita and Tseitlin, Boris},
title = {{Learning from Crowds with Crowd-Kit}},
year = {2024},
journal = {Journal of Open Source Software},
volume = {9},
number = {96},
pages = {6227},
publisher = {The Open Journal},
doi = {10.21105/joss.06227},
issn = {2475-9066},
eprint = {2109.08584},
eprinttype = {arxiv},
eprintclass = {cs.HC},
language = {english},
}
Support and Contributions
Please use GitHub Issues to seek support and submit feature requests. We accept contributions to Crowd-Kit via GitHub as according to our guidelines in CONTRIBUTING.md.
License
© Crowd-Kit team authors, 2020–2024. Licensed under the Apache License, Version 2.0. See LICENSE file for more details.
Owner
- Name: Toloka
- Login: Toloka
- Kind: organization
- Email: github@toloka.ai
- Website: https://toloka.ai
- Repositories: 11
- Profile: https://github.com/Toloka
Data labeling platform for ML
JOSS Publication
Learning from Crowds with Crowd-Kit
Authors
Tags
crowdsourcing data labeling answer aggregation truth inference learning from crowds machine learning quality control data qualityCitation (CITATION.cff)
cff-version: 1.2.0
message: If you use this software, please cite the article from preferred-citation.
title: Crowd-Kit
type: software
authors:
- family-names: Ustalov
given-names: Dmitry
orcid: "https://orcid.org/0000-0002-9979-2188"
- family-names: Pavlichenko
given-names: Nikita
orcid: "https://orcid.org/0000-0002-7330-393X"
- given-names: Boris
family-names: Tseitlin
orcid: "https://orcid.org/0000-0001-8553-4260"
repository-code: "https://github.com/Toloka/crowd-kit"
url: "https://crowd-kit.readthedocs.io/"
repository-artifact: "https://pypi.org/project/crowd-kit/"
keywords:
- Python
- crowdsourcing
- data labeling
- answer aggregation
- truth inference
- learning from crowds
- machine learning
- quality control
- data quality
license: Apache-2.0
preferred-citation:
type: article
authors:
- family-names: Ustalov
given-names: Dmitry
orcid: "https://orcid.org/0000-0002-9979-2188"
- family-names: Pavlichenko
given-names: Nikita
orcid: "https://orcid.org/0000-0002-7330-393X"
- given-names: Boris
family-names: Tseitlin
orcid: "https://orcid.org/0000-0001-8553-4260"
title: "Learning from Crowds with Crowd-Kit"
year: 2024
journal: Journal of Open Source Software
volume: 9
issue: 96
start: 6227
end: 6227
doi: "10.21105/joss.06227"
issn: 2475-9066
identifiers:
- type: other
value: "arXiv:2109.08584"
description: The ArXiv preprint of the paper
GitHub Events
Total
- Issues event: 3
- Watch event: 23
- Issue comment event: 9
- Push event: 11
- Pull request review event: 5
- Pull request event: 14
- Fork event: 3
- Create event: 2
Last Year
- Issues event: 4
- Watch event: 23
- Issue comment event: 9
- Push event: 11
- Pull request review event: 5
- Pull request event: 14
- Fork event: 3
- Create event: 2
Committers
Last synced: 5 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Dmitry Ustalov | d****v@g****m | 135 |
| Nikita Pavlichenko | p****v@p****u | 37 |
| Sukhorosov Aleksey | a****v@g****m | 26 |
| Denis Fraltsov | 8****t | 22 |
| Mathew Shen | d****r@g****m | 21 |
| Alisa | a****a@g****m | 19 |
| Vladimir Losev | m****k@g****m | 19 |
| Natalia | n****6@y****u | 15 |
| Stepan Nosov | n****8@y****u | 15 |
| Daniil Likhobaba | l****p@p****u | 12 |
| dependabot[bot] | 4****] | 7 |
| pavlichenko | p****o@y****u | 7 |
| DrhF | m****w@g****m | 5 |
| Alexander Vnuchkov | a****v@y****u | 3 |
| Evgeny Tulin | t****v@y****u | 3 |
| vlad-mois | v****s@y****u | 3 |
| Tahar Allouche | t****o@g****m | 2 |
| mr-fedulow | m****w@y****u | 2 |
| Aleksandr Dremov | d****e@g****m | 1 |
| gilyazev-yu | g****u@y****u | 1 |
| dsamuylov | d****v@y****u | 1 |
| btseytlin | b****n@y****u | 1 |
| Daniil Likhobaba | l****p@y****u | 1 |
| Artem Grigorev | o****j@t****i | 1 |
| Iulian Giliazev | g****a@p****u | 1 |
| Pavel Gein | p****n@y****u | 1 |
| arcadia-devtools | a****s@y****u | 1 |
| shadchin | s****n@y****u | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 25
- Total pull requests: 125
- Average time to close issues: about 1 month
- Average time to close pull requests: 7 days
- Total issue authors: 15
- Total pull request authors: 18
- Average comments per issue: 2.2
- Average comments per pull request: 1.46
- Merged pull requests: 112
- Bot issues: 0
- Bot pull requests: 11
Past Year
- Issues: 4
- Pull requests: 13
- Average time to close issues: 3 days
- Average time to close pull requests: about 18 hours
- Issue authors: 4
- Pull request authors: 4
- Average comments per issue: 0.75
- Average comments per pull request: 0.69
- Merged pull requests: 10
- Bot issues: 0
- Bot pull requests: 3
Top Authors
Issue Authors
- shenxiangzhuang (6)
- jcklie (3)
- pilot7747 (2)
- Senarect (2)
- LydiaMak (2)
- ahundt (1)
- alexdremov (1)
- TanVD (1)
- takumi1001 (1)
- Mind-the-Cap (1)
- vikasraykar (1)
- Marceau-h (1)
- taharallouche (1)
- johann-petrak (1)
- amine-boukriba (1)
Pull Request Authors
- shenxiangzhuang (29)
- dustalov (26)
- pilot7747 (17)
- dependabot[bot] (11)
- Losik (5)
- Natalyl3 (5)
- aliskin (4)
- Senarect (4)
- taharallouche (4)
- alexdrydew (4)
- DrhF (3)
- varfolomeii (3)
- alexandervnuchkov (3)
- denaxen (2)
- ortemij (2)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 4,916 last-month
- Total dependent packages: 3
- Total dependent repositories: 5
- Total versions: 23
- Total maintainers: 2
pypi.org: crowd-kit
Computational Quality Control for Crowdsourcing
- Homepage: https://github.com/Toloka/crowd-kit
- Documentation: https://crowd-kit.readthedocs.io/
- License: Apache 2.0
-
Latest release: 1.4.1
published over 1 year ago
Rankings
Maintainers (2)
Dependencies
- attrs *
- nltk *
- numpy *
- pandas *
- scikit-learn *
- tqdm *
- transformers *
- actions/checkout v3 composite
- actions/setup-python v4 composite
- peter-evans/create-pull-request v4 composite
- actions/checkout v3 composite
- actions/setup-node v3 composite
- actions/setup-python v4 composite
- actions/upload-artifact v3 composite
- citation-file-format/cffconvert-github-action 2.0.0 composite
- build * develop
- codecov * develop
- flake8 * develop
- ipywidgets * develop
- mypy * develop
- notebook * develop
- pytest * develop
- stubmaker * develop
- twine * develop
- crowd-kit *

