neteasecrowd-dataset
NetEaseCrowd dataset, a collection of data obtained from You Ling crowdsourcing platform, Fuxi AI Lab, NetEase.
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org, acm.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.3%) to scientific vocabulary
Keywords
Repository
NetEaseCrowd dataset, a collection of data obtained from You Ling crowdsourcing platform, Fuxi AI Lab, NetEase.
Basic Info
Statistics
- Stars: 9
- Watchers: 6
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
NetEaseCrowd
NetEaseCrowd: A Dataset for Long-term and Online Crowdsourcing Truth Inference
Introduction
We introduce NetEaseCrowd, a large-scale crowdsourcing annotation dataset based on a mature Chinese data crowdsourcing platform of NetEase Inc.. NetEaseCrowd dataset contains about 2,400 workers, 1,000,000 tasks, and 6,000,000 annotations between them, where the annotations are collected in about 6 months. In this dataset, we provide ground truths for all the tasks and record timestamps for all the annotations.
Task
The NetEaseCrowd dataset is constructed based on various types of tasks.
In detail, there are 6 different types of tasks in the dataset
(associated with different capability as illustrated in our paper).
There are some examples of the tasks in the dataset:
50: Expression similarity filtering
Question: Select the image from A, B, and C that looks the least similar in expression to the other two images.
Click to show question related images
 |  | 52: Naturalness of Expression Judgment
Question: Select the most natural expression from the three images below
Click to show question related images
 |  | 53: Facial Similarity Screening
Question: Choose the face that is least like the other two from the following three faces.
Click to show question related images
56: Gesture Similarity Filter
Question: Select the gesture that looks the least similar to the other two gestures.
Click to show question related images
69: Article continuation classification
Question: Please select the best continuation between A and B based on: 1. Information richness; 2. Sentence fluency; 3. Coherence with the previous text; 4. Logical consistency, or select 'undecided'.
Click to show question related content(raw content in Chinese)
``` 背景: 随着他的靠近,无形间带动一股越发猛烈的气流,他的脚步未曾停下半分,压强震裂了水泥浇灌的地面,半块凹陷成蛛网状,年轻男人紧咬牙关,迎着割裂盔甲的气流一步一步,左手展开挡在前面,右手握成拳聚起全身力气向前,数据条停了片刻,尖锐声波铺天盖地,刺眼的光突然炸开,黑洞般吞噬他的身体。 画面顷刻定格,小心眯起眼睛,有个声音散在一片白茫茫之中,空洞的、零落的,意外有些耳熟。 "他叫开心。"
选项A: 这是小心第几次梦见开心了?他不记得了,可每当他想要回忆起什么时,就觉得脑袋像是被人用锤子狠狠砸了一下,疼得他整个人都发麻,随后便是无尽的黑暗。
选项B: 小心不知道自己是怎么回到家的,那段记忆太过陌生,以至于他对此毫无印象,只是当他推开门的时候,看到屋内的景象,脑袋里轰然炸开一朵巨大的烟花。
选项C:不确定 ```
Comparison with existing datasets
Compared with the existing crowdsourcing datasets, our NetEaseCrowd dataset has the following characteristics:
| Characteristic | Existing datasets | NetEaseCrowd dataset | |----------------|------------------------------------------------------|-----------------------------------------------------------| | Scalability | Relatively small sizes in #workers/tasks/annotations | Lage-scale data collection with 6 millions of annotations | | Timestamps | Short-term data with no timestamps recorded | Complete timestamps recorded during a 6-month timespan | | Task Type | Single type of tasks | Various task types with different required capabilities |
Dataset Statistics
The basic statistics of NetEaseCrowd dataset and other previous datasets are as follows: | Dataset | #Worker | #Task | #Groundtruth | #Anno | Avg(#Anno/worker) | Avg(#Anno/task) | Timestamp | Task type | |--------------------------------------------|----------|---------|---------------|-----------|--------------------|------------------|--------------|-----------| | NetEaseCrowd | 2,413 | 999,799 | 999,799 | 6,016,319 | 2,493.3 | 6.0 | ✔︎ | Multiple | | Adult | 825 | 11,040 | 333 | 92,721 | 112.4 | 8.4 | ✘ | Single | | Birds | 39 | 108 | 108 | 4,212 | 108.0 | 39.0 | ✘ | Single | | Dog | 109 | 807 | 807 | 8,070 | 74.0 | 10.0 | ✘ | Single | | CF | 461 | 300 | 300 | 1,720 | 3.7 | 5.7 | ✘ | Single | | CF_amt | 110 | 300 | 300 | 6030 | 54.8 | 20.1 | ✘ | Single | | Emotion | 38 | 700 | 565 | 7,000 | 184.2 | 10.0 | ✘ | Single | | Smile | 64 | 2,134 | 159 | 30,319 | 473.7 | 14.2 | ✘ | Single | | Face | 27 | 584 | 584 | 5,242 | 194.1 | 9.0 | ✘ | Single | | Fact | 57 | 42,624 | 576 | 216,725 | 3802.2 | 5.1 | ✘ | Single | | MS | 44 | 700 | 700 | 2,945 | 66.9 | 4.2 | ✘ | Single | | product | 176 | 8,315 | 8,315 | 24,945 | 141.7 | 3.0 | ✘ | Single | | RTE | 164 | 800 | 800 | 8,000 | 48.8 | 10.0 | ✘ | Single | | Sentiment | 1,960 | 98,980 | 1,000 | 569,375 | 290.5 | 5.8 | ✘ | Single | | SP | 203 | 4,999 | 4,999 | 27,746 | 136.7 | 5.6 | ✘ | Single | | SP_amt | 143 | 500 | 500 | 10,000 | 69.9 | 20.0 | ✘ | Single | | Trec | 762 | 19,033 | 2,275 | 88,385 | 116.0 | 4.6 | ✘ | Single | | Tweet | 85 | 1,000 | 1,000 | 20,000 | 235.3 | 20.0 | ✘ | Single | | Web | 177 | 2,665 | 2,653 | 15,567 | 87.9 | 5.8 | ✘ | Single | | ZenCrowd_us | 74 | 2,040 | 2,040 | 12,190 | 164.7 | 6.0 | ✘ | Single | | ZenCrowd_in | 25 | 2,040 | 2,040 | 11,205 | 448.2 | 5.5 | ✘ | Single | | ZenCrowd_all | 78 | 2,040 | 2,040 | 21,855 | 280.2 | 10.7 | ✘ | Single |
Data Content and Format
Obtain the data
Two ways to access the dataset: * Directly download overall NetEaseCrowd in Hugging Face [Recommended]
- Under the
data/folder, the NetEaseCrowd dataset is provided in partitions in the csv file format. Each partition is named asNetEaseCrowd_part_x.csv. Concat them to get the entire NetEaseCrowd dataset.
Dataset format
In the dataset, each line of record represents an interaction between a worker and a task, with the following columns:
- taskId: The unique id of the annotated task.
- tasksetId: The unique id of the task set. Each task set contains unspecified number of tasks. Each task belongs to exactly one task set.
- workerId: The unique id of the worker.
- answer: The annotation given by the worker, which is an enumeric number starting from 0.
- completeTime: The integer timestamp recording the completion time of the annotation.
- truth: The groundtruth of the annotated task, which, in consistency with answer, is also an enumeric number starting from 0.
- capability: The unique id of the capability required by the annotated taskset. Each taskset belongs to exactly one capability.
For the privacy concerns, all sensitive content like as -Ids, has been anonymized.
Data sample
| tasksetId | taskId | workerId | answer | completeTime | truth | capability | |-----------|---------------------|----------|--------|---------------|-------|------------| | 6980 | 1012658482844795232 | 64 | 2 | 1661917345953 | 1 | 69 | | 6980 | 1012658482844795232 | 150 | 1 | 1661871234755 | 1 | 69 | | 6980 | 1012658482844795232 | 263 | 0 | 1661855450281 | 1 | 69 |
In the example above, there are three annotations, all from the same taskset 6980 and the same task 1012658482844795232. Three annotators, with ids 64, 150, and 263, provide annotations of 2, 1, and 0, respectively. They do the task at different time. The truth label for this task is 1, and the capability id of the task is 69.
Baseline Models
We test several existing truth inference methods in our dataset, and detailed analysis with more experimental setups can be found in our paper.
| Method | Accuracy | F1-score | |----------------|----------|----------| | MV | 0.92695 | 0.92692 | | DS | 0.95178 | 0.94817 | | MACE | 0.95991 | 0.94957 | | Wawa | 0.94814 | 0.94445 | | ZeroBasedSkill | 0.94898 | 0.94585 | | GLAD | 0.95064 | 0.95058 | | EBCC | 0.91071 | 0.90996 | | ZC | 0.95305 | 0.95301 | | TiReMGE | 0.92713 | 0.92706 | | LAA | 0.94173 | 0.94169 | | BiLA | 0.88036 | 0.87896 |
Test with the dataset directly from crowd-kit
The NetEaseCrowd dataset has been integrated into the crowd-kit (with pull request here), you can use it directly in your code with the following code(with crowd-kit version > 1.2.1):
```python from crowdkit.aggregation import DawidSkene from crowdkit.datasets import load_dataset
df, gt = loaddataset('neteasecrowd')
ds = DawidSkene(10) result = ds.fit_predict(df)
print(len(result))
999799
```
Other public datasets
We provide a curated list for other public datasets towards truth inference task.
| Dataset Name | Resource |
|----------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| adult | Quality management on amazon mechanical turk. [paper][data] |
| sentiment
fact | Workshops Held at the First AAAI Conference on Human Computation and Crowdsourcing: A Report. [paper][data] |
| MS
zencrowdall
zencrowdus
zencrowdin
sp
spamt
cf
cf_amt | The active crowd toolkit: An open-source tool for benchmarking active learning algorithms for crowdsourcing research. [paper][data] |
| Product
tweet
dog
face
duck
relevance
smile | Truth inference in crowdsourcing: Is the problem solved? [paper][data]
Note that tweet dataset is called sentiment in this source. It is different from the sentiment dataset in CrowdScale2013. |
| bird
rte
web
trec | Spectral methods meet em: A provably optimal algorithm for crowdsourcing. [paper][data] |
Citation
If you use this project in your research or work, please cite it using the following BibTeX entry:
bibtex
@misc{wang2024dataset,
title={A Dataset for the Validation of Truth Inference Algorithms Suitable for Online Deployment},
author={Fei Wang and Haoyu Liu and Haoyang Bi and Xiangzhuang Shen and Renyu Zhu and Runze Wu and Minmin Lin and Tangjie Lv and Changjie Fan and Qi Liu and Zhenya Huang and Enhong Chen},
year={2024},
eprint={2403.08826},
archivePrefix={arXiv},
primaryClass={cs.HC}
}
License
The NetEaseCrowd dataset is licensed under CC-BY-SA-4.0.
Owner
- Name: fuxiAIlab
- Login: fuxiAIlab
- Kind: organization
- Repositories: 29
- Profile: https://github.com/fuxiAIlab
Citation (CITATION.bib)
@misc{wang2024datasetvalidationtruthinference,
title={A Dataset for the Validation of Truth Inference Algorithms Suitable for Online Deployment},
author={Fei Wang and Haoyu Liu and Haoyang Bi and Xiangzhuang Shen and Renyu Zhu and Runze Wu and Minmin Lin and Tangjie Lv and Changjie Fan and Qi Liu and Zhenya Huang and Enhong Chen},
year={2024},
eprint={2403.08826},
archivePrefix={arXiv},
primaryClass={cs.HC},
url={https://arxiv.org/abs/2403.08826},
}
GitHub Events
Total
- Watch event: 2
- Delete event: 1
- Push event: 2
- Pull request event: 1
Last Year
- Watch event: 2
- Delete event: 1
- Push event: 2
- Pull request event: 1