https://github.com/big-data-lab-umbc/chatgpt-comparison-detection
Human ChatGPT Comparison Corpus (HC3), Detectors, and more! 🔥
https://github.com/big-data-lab-umbc/chatgpt-comparison-detection
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
â—‹CITATION.cff file
-
â—‹codemeta.json file
-
â—‹.zenodo.json file
-
â—‹DOI references
-
✓Academic publication links
Links to: arxiv.org -
â—‹Academic email domains
-
â—‹Institutional organization owner
-
â—‹JOSS paper metadata
-
â—‹Scientific vocabulary similarity
Low similarity (8.1%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
Repository
Human ChatGPT Comparison Corpus (HC3), Detectors, and more! 🔥
Basic Info
- Host: GitHub
- Owner: big-data-lab-umbc
- Default Branch: main
- Homepage: https://arxiv.org/abs/2301.07597
- Size: 27.3 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of Hello-SimpleAI/chatgpt-comparison-detection
Created about 3 years ago
· Last pushed over 3 years ago
https://github.com/big-data-lab-umbc/chatgpt-comparison-detection/blob/main/
# ChatGPT-Comparison-Detection Project   Official repository of paper ["How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection"](https://arxiv.org/abs/2301.07597). Please star, watch, and fork our repo for the active updates! See also([ Feedback Space for Detectors](https://github.com/Hello-SimpleAI/chatgpt-comparison-detection/discussions/2) please feel free to leave your feedback here! )--- ### Human ChatGPT Comparison Corpus (HC3) / -ChatGPT Yes, we propose the first **Human vs. ChatGPT** comparison corpus, named **HC3**. **Human vs. ChatGPT** , **HC3**.
The first version of the HC3 datasets are now available on Huggingface Datasets: - [HC3-Engllish](https://huggingface.co/datasets/Hello-SimpleAI/HC3) - [HC3-Chinese](https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese) HC3 ModelScope : - [HC3-Engllish](https://www.modelscope.cn/datasets/simpleai/HC3) - [HC3-Chinese](https://www.modelscope.cn/datasets/simpleai/HC3-Chinese) > Train/Test splits & filtered versions of the paper, ref to Google Drive links in [HC3/README.md](HC3/README.md). ### Dataset Copyright If the source datasets used in this corpus has a specific license which is stricter than CC-BY-SA, our products follow the same. If not, they follow CC-BY-SA license. | English Split | Source | Source License | Note | |----------|-------------|--------|-------------| | reddit_eli5 | [ELI5](https://github.com/facebookresearch/ELI5) | BSD License | | | open_qa | [WikiQA](https://www.microsoft.com/en-us/download/details.aspx?id=52419) | [PWC Custom](https://paperswithcode.com/datasets/license) | | | wiki_csai | Wikipedia | CC-BY-SA | | [Wiki FAQ](https://en.wikipedia.org/wiki/Wikipedia:FAQ/Copyright) | | medicine | [Medical Dialog](https://github.com/UCSD-AI4H/Medical-Dialogue-System) | Unknown| [Asking](https://github.com/UCSD-AI4H/Medical-Dialogue-System/issues/10)| | finance | [FiQA](https://paperswithcode.com/dataset/fiqa-1) | Unknown | Asking by | | Chinese Split | Source | Source License | Note | |----------|-------------|-----------|-------------| | open_qa | [WebTextQA & BaikeQA](https://github.com/brightmart/nlp_chinese_corpus) | MIT license | | | | baike | Baidu Baike | None | | | | nlpcc_dbqa | [NLPCC-DBQA](https://github.com/msra-nlc/ChineseDBQA) | Unknown | [Asking](https://github.com/UCSD-AI4H/Medical-Dialogue-System/issues/10) | | medicine | [Chinese Medical Dialogue](https://tianchi.aliyun.com/dataset/90163) | CC-BY-NC 4.0 | | finance | [FinanceZhidao](https://www.heywhale.com/mw/dataset/5e9588f8e7ec38002d0331b1/content) | CC-BY 4.0 | | | psychology | [On Baidu AI Studio](https://aistudio.baidu.com/aistudio/datasetdetail/38489) | CC0 | | |law | [LegalQA](https://github.com/siatnlp/LegalQA) | Unknown | [Asking](https://github.com/siatnlp/LegalQA/issues/2) | --- ### ChatGPT detectors /  (Hosted on Hugging Face Spaces) We provide three kinds of detectors, all in Bilingual / : - [QA version / ](https://huggingface.co/spaces/Hello-SimpleAI/chatgpt-detector-qa): detect whether an **answer** is generated by ChatGPT for certain **question**, using PLM-based classifiers / ****ChatGPTPTM; - [Sinlge-text version / ](https://huggingface.co/spaces/Hello-SimpleAI/chatgpt-detector-single): detect whether a piece of text is ChatGPT generated, using PLM-based classifiers / ****ChatGPTPTM; - [Linguistic version / ](https://huggingface.co/spaces/Hello-SimpleAI/chatgpt-detector-ling): detect whether a piece of text is ChatGPT generated, using linguistic features / ****ChatGPT; modelscope : - [QA version / ](https://www.modelscope.cn/studios/simpleai/chatgpt-detector-qa) - [Sinlge-text version / ](https://www.modelscope.cn/studios/simpleai/chatgpt-detector-single) - [Linguistic version / ](https://www.modelscope.cn/studios/simpleai/chatgpt-detector-ling) The model weights are all available at Hugging Face Models: | Model Checkpoints | Comment | |-----------------------|------------| |[chatgpt-detector-roberta](https://huggingface.co/Hello-SimpleAI/chatgpt-detector-roberta)|To detect a single piece of text| |[chatgpt-qa-detector-roberta](https://huggingface.co/Hello-SimpleAI/chatgpt-qa-detector-roberta)|To detect a question-answer pair| |[chatgpt-detector-roberta-chinese](https://huggingface.co/Hello-SimpleAI/chatgpt-detector-roberta-chinese)|| |[chatgpt-qa-detector-roberta-chinese](https://huggingface.co/Hello-SimpleAI/chatgpt-qa-detector-roberta-chinese)|QA| The English models are based on [roberta-base](https://huggingface.co/roberta-base). The Chinese models are based on [hfl/chinese-roberta-wwm-ext](https://huggingface.co/hfl/chinese-roberta-wwm-ext). --- ### Important Dates / : | Events | Dates | |-----------------------|------------| | Project Launch / | 2022-12-09 | | Comparison Data Collection / | 2022-12-11 to Now | | Release ChatGPT Detector (Demo) / Demo | 2023-01-11 | | Models Release / | 2023-01-18 | | Comparison Corpus Release / | 2023-01-18 | | Research Paper / | 2023-01-19 | |...|...| --- ### Citation Checkout this paper [arxiv: 2301.07597](https://arxiv.org/abs/2301.07597) ``` @article{guo-etal-2023-hc3, title = "How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection", author = "Guo, Biyang and Zhang, Xin and Wang, Ziyuan and Jiang, Minqi and Nie, Jinran and Ding, Yuxuan and Yue, Jianwei and Wu, Yupeng", journal={arXiv preprint arxiv:2301.07597} year = "2023", } ``` --- ### Our Story... / On December 9, 2022, which is 10 days after the launch of [ChatGPT](https://openai.com/blog/chatgpt/), we started this project, for two purposes: 1. To create some **open-source models** for efficiently detecting ChatGPT-generated content; 2. To collect a valuable **human-ChatGPT comparison Q&A corpus**, to facilitate releated research. 2022 12 9 [ChatGPT](https://openai.com/blog/chatgpt/) 10 1. **** ChatGPT 2. ** ChatGPT ** Welcome to follow our project! We have released a preview of our ChatGPT detectors, and the **models, dataset will be open-sourced** in about a week. We look forward to receiving feedback from the community to help improve the models and make contributions to **open** academic research together:)
ChatGPT******** ### About Us / We are a group of insignificant researchers (in the shadow of ChatGPT) hoping to do some significant work for the community. The team for this projects consists of PhD students and engineers from 6 universities/companies.
ChatGPT 6/ | | | | | |:-:|:-:|:-:|:-:| | [Biyang Guo](https://github.com/beyondguo) | [Minqi Jiang](https://github.com/Minqi824) | [Ziyuan Wang](https://github.com/SUFEHeisenberg) | [Xin Zhang](https://github.com/izhx) | ||
|
|
| | [Jinran Nie](https://github.com/NJRBarry) | [Yuxuan Ding](https://github.com/yxding95) | [Jianwei Yue](https://github.com/TurquoiseA) | [Yupeng Wu](https://github.com/realRoc) | |
|
|
|
|
Owner
- Name: Big Data Analytics Lab @ UMBC
- Login: big-data-lab-umbc
- Kind: organization
- Location: University of Maryland, Baltimore County
- Website: https://bdal.umbc.edu/
- Twitter: jianwuwang
- Repositories: 5
- Profile: https://github.com/big-data-lab-umbc
---
### Human ChatGPT Comparison Corpus (HC3) / -ChatGPT
Yes, we propose the first **Human vs. ChatGPT** comparison corpus, named **HC3**.
**Human vs. ChatGPT** , **HC3**.
The first version of the HC3 datasets are now available on Huggingface Datasets:
- [HC3-Engllish](https://huggingface.co/datasets/Hello-SimpleAI/HC3)
- [HC3-Chinese](https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese)
HC3 ModelScope :
- [HC3-Engllish](https://www.modelscope.cn/datasets/simpleai/HC3)
- [HC3-Chinese](https://www.modelscope.cn/datasets/simpleai/HC3-Chinese)
> Train/Test splits & filtered versions of the paper, ref to Google Drive links in [HC3/README.md](HC3/README.md).
### Dataset Copyright
If the source datasets used in this corpus has a specific license which is stricter than CC-BY-SA, our products follow the same.
If not, they follow CC-BY-SA license.
| English Split | Source | Source License | Note |
|----------|-------------|--------|-------------|
| reddit_eli5 | [ELI5](https://github.com/facebookresearch/ELI5) | BSD License | |
| open_qa | [WikiQA](https://www.microsoft.com/en-us/download/details.aspx?id=52419) | [PWC Custom](https://paperswithcode.com/datasets/license) | |
| wiki_csai | Wikipedia | CC-BY-SA | | [Wiki FAQ](https://en.wikipedia.org/wiki/Wikipedia:FAQ/Copyright) |
| medicine | [Medical Dialog](https://github.com/UCSD-AI4H/Medical-Dialogue-System) | Unknown| [Asking](https://github.com/UCSD-AI4H/Medical-Dialogue-System/issues/10)|
| finance | [FiQA](https://paperswithcode.com/dataset/fiqa-1) | Unknown | Asking by |
| Chinese Split | Source | Source License | Note |
|----------|-------------|-----------|-------------|
| open_qa | [WebTextQA & BaikeQA](https://github.com/brightmart/nlp_chinese_corpus) | MIT license | | |
| baike | Baidu Baike | None | | |
| nlpcc_dbqa | [NLPCC-DBQA](https://github.com/msra-nlc/ChineseDBQA) | Unknown | [Asking](https://github.com/UCSD-AI4H/Medical-Dialogue-System/issues/10) |
| medicine | [Chinese Medical Dialogue](https://tianchi.aliyun.com/dataset/90163) | CC-BY-NC 4.0 |
| finance | [FinanceZhidao](https://www.heywhale.com/mw/dataset/5e9588f8e7ec38002d0331b1/content) | CC-BY 4.0 | |
| psychology | [On Baidu AI Studio](https://aistudio.baidu.com/aistudio/datasetdetail/38489) | CC0 | |
|law | [LegalQA](https://github.com/siatnlp/LegalQA) | Unknown | [Asking](https://github.com/siatnlp/LegalQA/issues/2) |
---
### ChatGPT detectors /

(Hosted on Hugging Face Spaces)
We provide three kinds of detectors, all in Bilingual / :
- [QA version / ](https://huggingface.co/spaces/Hello-SimpleAI/chatgpt-detector-qa): detect whether an **answer** is generated by ChatGPT for certain **question**, using PLM-based classifiers / ****ChatGPTPTM;
- [Sinlge-text version / ](https://huggingface.co/spaces/Hello-SimpleAI/chatgpt-detector-single): detect whether a piece of text is ChatGPT generated, using PLM-based classifiers / ****ChatGPTPTM;
- [Linguistic version / ](https://huggingface.co/spaces/Hello-SimpleAI/chatgpt-detector-ling): detect whether a piece of text is ChatGPT generated, using linguistic features / ****ChatGPT;
modelscope :
- [QA version / ](https://www.modelscope.cn/studios/simpleai/chatgpt-detector-qa)
- [Sinlge-text version / ](https://www.modelscope.cn/studios/simpleai/chatgpt-detector-single)
- [Linguistic version / ](https://www.modelscope.cn/studios/simpleai/chatgpt-detector-ling)
The model weights are all available at Hugging Face Models:
| Model Checkpoints | Comment |
|-----------------------|------------|
|[chatgpt-detector-roberta](https://huggingface.co/Hello-SimpleAI/chatgpt-detector-roberta)|To detect a single piece of text|
|[chatgpt-qa-detector-roberta](https://huggingface.co/Hello-SimpleAI/chatgpt-qa-detector-roberta)|To detect a question-answer pair|
|[chatgpt-detector-roberta-chinese](https://huggingface.co/Hello-SimpleAI/chatgpt-detector-roberta-chinese)||
|[chatgpt-qa-detector-roberta-chinese](https://huggingface.co/Hello-SimpleAI/chatgpt-qa-detector-roberta-chinese)|QA|
The English models are based on [roberta-base](https://huggingface.co/roberta-base).
The Chinese models are based on [hfl/chinese-roberta-wwm-ext](https://huggingface.co/hfl/chinese-roberta-wwm-ext).
---
### Important Dates / :
| Events | Dates |
|-----------------------|------------|
| Project Launch / | 2022-12-09 |
| Comparison Data Collection / | 2022-12-11 to Now |
| Release ChatGPT Detector (Demo) / Demo | 2023-01-11 |
| Models Release / | 2023-01-18 |
| Comparison Corpus Release / | 2023-01-18 |
| Research Paper / | 2023-01-19 |
|...|...|
---
### Citation
Checkout this paper [arxiv: 2301.07597](https://arxiv.org/abs/2301.07597)
```
@article{guo-etal-2023-hc3,
title = "How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection",
author = "Guo, Biyang and
Zhang, Xin and
Wang, Ziyuan and
Jiang, Minqi and
Nie, Jinran and
Ding, Yuxuan and
Yue, Jianwei and
Wu, Yupeng",
journal={arXiv preprint arxiv:2301.07597}
year = "2023",
}
```
---
### Our Story... /
On December 9, 2022, which is 10 days after the launch of [ChatGPT](https://openai.com/blog/chatgpt/), we started this project, for two purposes:
1. To create some **open-source models** for efficiently detecting ChatGPT-generated content;
2. To collect a valuable **human-ChatGPT comparison Q&A corpus**, to facilitate releated research.
2022 12 9 [ChatGPT](https://openai.com/blog/chatgpt/) 10
1. **** ChatGPT
2. ** ChatGPT **
Welcome to follow our project! We have released a preview of our ChatGPT detectors, and the **models, dataset will be open-sourced** in about a week. We look forward to receiving feedback from the community to help improve the models and make contributions to **open** academic research together:)