https://github.com/big-data-lab-umbc/chatgpt-comparison-detection

Human ChatGPT Comparison Corpus (HC3), Detectors, and more! 🔥

https://github.com/big-data-lab-umbc/chatgpt-comparison-detection

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • â—‹
    CITATION.cff file
  • â—‹
    codemeta.json file
  • â—‹
    .zenodo.json file
  • â—‹
    DOI references
  • ✓
    Academic publication links
    Links to: arxiv.org
  • â—‹
    Academic email domains
  • â—‹
    Institutional organization owner
  • â—‹
    JOSS paper metadata
  • â—‹
    Scientific vocabulary similarity
    Low similarity (8.1%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Human ChatGPT Comparison Corpus (HC3), Detectors, and more! 🔥

Basic Info
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Fork of Hello-SimpleAI/chatgpt-comparison-detection
Created about 3 years ago · Last pushed over 3 years ago

https://github.com/big-data-lab-umbc/chatgpt-comparison-detection/blob/main/

# ChatGPT-Comparison-Detection Project 

![](https://img.shields.io/badge/Languages-%20English%2C%20Chinese-brightgreen) 
![](https://img.shields.io/badge/ChatGPT-Corpus%2C%20Detector-blue)

Official repository of paper ["How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection"](https://arxiv.org/abs/2301.07597). Please star, watch, and fork our repo for the active updates!

See also([ Feedback Space for Detectors](https://github.com/Hello-SimpleAI/chatgpt-comparison-detection/discussions/2) please feel free to leave your feedback here! )



image

---
### Human ChatGPT Comparison Corpus (HC3) / -ChatGPT 
Yes, we propose the first **Human vs. ChatGPT** comparison corpus, named **HC3**.

 **Human vs. ChatGPT** ,  **HC3**.

image

The first version of the HC3 datasets are now available on  Huggingface Datasets:
- [HC3-Engllish](https://huggingface.co/datasets/Hello-SimpleAI/HC3)
- [HC3-Chinese](https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese)


HC3  ModelScope :
- [HC3-Engllish](https://www.modelscope.cn/datasets/simpleai/HC3)
- [HC3-Chinese](https://www.modelscope.cn/datasets/simpleai/HC3-Chinese)


> Train/Test splits & filtered versions of the paper, ref to Google Drive links in [HC3/README.md](HC3/README.md).

### Dataset Copyright

If the source datasets used in this corpus has a specific license which is stricter than CC-BY-SA, our products follow the same.
If not, they follow CC-BY-SA license.

| English Split       | Source | Source License | Note |
|----------|-------------|--------|-------------|
| reddit_eli5 | [ELI5](https://github.com/facebookresearch/ELI5)   | BSD License    |     |
| open_qa  | [WikiQA](https://www.microsoft.com/en-us/download/details.aspx?id=52419)  | [PWC Custom](https://paperswithcode.com/datasets/license)   |      |
| wiki_csai   | Wikipedia | CC-BY-SA |   | [Wiki FAQ](https://en.wikipedia.org/wiki/Wikipedia:FAQ/Copyright) |
| medicine    | [Medical Dialog](https://github.com/UCSD-AI4H/Medical-Dialogue-System) | Unknown|  [Asking](https://github.com/UCSD-AI4H/Medical-Dialogue-System/issues/10)|
| finance     | [FiQA](https://paperswithcode.com/dataset/fiqa-1) | Unknown |  Asking by   |

| Chinese Split       | Source | Source License  | Note |
|----------|-------------|-----------|-------------|
| open_qa  | [WebTextQA & BaikeQA](https://github.com/brightmart/nlp_chinese_corpus) | MIT license |  |  |
| baike     | Baidu Baike  | None   |    |   |
| nlpcc_dbqa  | [NLPCC-DBQA](https://github.com/msra-nlc/ChineseDBQA) | Unknown |   [Asking](https://github.com/UCSD-AI4H/Medical-Dialogue-System/issues/10) |
| medicine    | [Chinese Medical Dialogue](https://tianchi.aliyun.com/dataset/90163) |  CC-BY-NC 4.0 | 
| finance     | [FinanceZhidao](https://www.heywhale.com/mw/dataset/5e9588f8e7ec38002d0331b1/content) | CC-BY 4.0 |  |
| psychology  | [On Baidu AI Studio](https://aistudio.baidu.com/aistudio/datasetdetail/38489) | CC0  | |
|law          | [LegalQA](https://github.com/siatnlp/LegalQA) | Unknown | [Asking](https://github.com/siatnlp/LegalQA/issues/2) |


---

### ChatGPT detectors / 
![image](https://user-images.githubusercontent.com/37113676/211677236-d7c028f5-b9a5-4d88-baee-8b86dc942ff7.png)
(Hosted on  Hugging Face Spaces)


We provide three kinds of detectors, all in Bilingual / :
- [QA version / ](https://huggingface.co/spaces/Hello-SimpleAI/chatgpt-detector-qa): detect whether an **answer** is generated by ChatGPT for certain **question**, using PLM-based classifiers / ****ChatGPTPTM;
- [Sinlge-text version / ](https://huggingface.co/spaces/Hello-SimpleAI/chatgpt-detector-single): detect whether a piece of text is ChatGPT generated, using PLM-based classifiers / ****ChatGPTPTM;
- [Linguistic version / ](https://huggingface.co/spaces/Hello-SimpleAI/chatgpt-detector-ling): detect whether a piece of text is ChatGPT generated, using linguistic features / ****ChatGPT;


 modelscope :
- [QA version / ](https://www.modelscope.cn/studios/simpleai/chatgpt-detector-qa)
- [Sinlge-text version / ](https://www.modelscope.cn/studios/simpleai/chatgpt-detector-single)
- [Linguistic version / ](https://www.modelscope.cn/studios/simpleai/chatgpt-detector-ling)


The model weights are all available at  Hugging Face Models:

| Model Checkpoints              | Comment      |
|-----------------------|------------|
|[chatgpt-detector-roberta](https://huggingface.co/Hello-SimpleAI/chatgpt-detector-roberta)|To detect a single piece of text|
|[chatgpt-qa-detector-roberta](https://huggingface.co/Hello-SimpleAI/chatgpt-qa-detector-roberta)|To detect a question-answer pair|
|[chatgpt-detector-roberta-chinese](https://huggingface.co/Hello-SimpleAI/chatgpt-detector-roberta-chinese)||
|[chatgpt-qa-detector-roberta-chinese](https://huggingface.co/Hello-SimpleAI/chatgpt-qa-detector-roberta-chinese)|QA|

The English models are based on [roberta-base](https://huggingface.co/roberta-base).
The Chinese models are based on [hfl/chinese-roberta-wwm-ext](https://huggingface.co/hfl/chinese-roberta-wwm-ext).


---

### Important Dates / :

| Events                | Dates      |
|-----------------------|------------|
| Project Launch /         | 2022-12-09  |
| Comparison Data Collection /         | 2022-12-11 to Now |
| Release ChatGPT Detector (Demo) /  Demo  | 2023-01-11 |
| Models Release /  | 2023-01-18 |
| Comparison Corpus Release /  | 2023-01-18 |
| Research Paper /  | 2023-01-19 |
|...|...|



---

### Citation

Checkout this paper [arxiv: 2301.07597](https://arxiv.org/abs/2301.07597)

```
@article{guo-etal-2023-hc3,
    title = "How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection",
    author = "Guo, Biyang  and
      Zhang, Xin  and
      Wang, Ziyuan  and
      Jiang, Minqi  and
      Nie, Jinran  and
      Ding, Yuxuan  and
      Yue, Jianwei  and
      Wu, Yupeng",
    journal={arXiv preprint arxiv:2301.07597}
    year = "2023",
}
```



---
### Our Story... / 

On December 9, 2022, which is 10 days after the launch of [ChatGPT](https://openai.com/blog/chatgpt/), we started this project, for two purposes: 
1. To create some **open-source models** for efficiently detecting ChatGPT-generated content; 
2. To collect a valuable **human-ChatGPT comparison Q&A corpus**, to facilitate releated research.

2022  12  9  [ChatGPT](https://openai.com/blog/chatgpt/)  10 
1. **** ChatGPT 
2. ** ChatGPT **

Welcome to follow our project! We have released a preview of our ChatGPT detectors, and the **models, dataset will be open-sourced** in about a week. We look forward to receiving feedback from the community to help improve the models and make contributions to **open** academic research together:)
ChatGPT******** ### About Us / We are a group of insignificant researchers (in the shadow of ChatGPT) hoping to do some significant work for the community. The team for this projects consists of PhD students and engineers from 6 universities/companies.
ChatGPT 6/ | | | | | |:-:|:-:|:-:|:-:| | [Biyang Guo](https://github.com/beyondguo) | [Minqi Jiang](https://github.com/Minqi824) | [Ziyuan Wang](https://github.com/SUFEHeisenberg) | [Xin Zhang](https://github.com/izhx) | ||||| | [Jinran Nie](https://github.com/NJRBarry) | [Yuxuan Ding](https://github.com/yxding95) | [Jianwei Yue](https://github.com/TurquoiseA) | [Yupeng Wu](https://github.com/realRoc) | ||| | |

Owner

  • Name: Big Data Analytics Lab @ UMBC
  • Login: big-data-lab-umbc
  • Kind: organization
  • Location: University of Maryland, Baltimore County

GitHub Events

Total
Last Year