https://github.com/alixunxing/cluedatasetsearch

搜索所有中文NLP数据集,附常用英文NLP数据集

https://github.com/alixunxing/cluedatasetsearch

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org, springer.com, mdpi.com, ieee.org, acm.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (2.2%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

搜索所有中文NLP数据集,附常用英文NLP数据集

Basic Info
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Fork of CLUEbenchmark/CLUEDatasetSearch
Created almost 6 years ago · Last pushed over 6 years ago

https://github.com/alixunxing/CLUEDatasetSearch/blob/master/

# CLUEDatasetSearch
NLP[](https://www.cluebenchmarks.com/dataSet_search.html)



![gif](./scripts/git.gif)

- [NER](#ner)
- [QA](#qa)
- [](#)
- [](#)
- [](#)
- [](#)
- [](#)
- [](#)
- [](#)
- [](#)
- [](#)

issue




# NER

| ID   |    |  |  |  |    |    |  |    |  |
| ---- | ------------------------------------------------------------ | --------- | -------------------------------------- | ---- | ------------------------------------------------------------ | ------------ | ------------ | ----------------------------------------------------- | ---- |
| 1    | [CCKS2017](https://biendata.com/competition/CCKS2017_2/data/) | 20175 |     |      | 800  |      |  | \                                                     |  |
| 2    | [CCKS2018](https://biendata.com/competition/CCKS2018_1/data/) | 2018    |              |      |  CCKS2018600  |      |  | \                                                     |  |
| 3    | [MSRA](https://github.com/lemonhu/NER-BERT-pytorch/tree/master/data/msra) | \         | MSRA                                   |      |  MSRABIO46365                | Msra         |  | \                                                     |  |
| 4    | [1998](https://github.com/ThunderingII/nlp_ner/tree/master/data) | 19981 |                                |      |  98BIO23061        | 98   |  | \                                                     |  |
| 5    | [Boson](https://github.com/TomatoTang/BILSTM-CRF)            | \         |                                |      |  BosonBMEO,2000                | Boson        |  | \                                                     |  |
| 6    | [CLUE Fine-Grain NER](https://storage.googleapis.com/cluebenchmark/tasks/cluener_public.zip) | 2020    | CLUE                                   |      |  CLUENER2020THUCTCSina News RSS10107481343  | CULE |  | \                                                     |  |
| 7    | [CoNLL-2003](https://www.clips.uantwerpen.be/conll2003/ner/) | 2003      | CNTS - Language Technology Group       |      |  CoNLL-2003PER, LOC, ORGMISC  | CoNLL-2003   |  | [](https://www.aclweb.org/anthology/W03-0419.pdf) |  |
| 8    | [](https://github.com/hltcoe/golden-horse)       | 2015    | https://github.com/hltcoe/golden-horse |      |                                                                | EMNLP-2015   |  |                                                       |      |
| 9    | [SIGHAN Bakeoff 2005](http://sighan.cs.uchicago.edu/bakeoff2005/) | 2005    | MSR/PKU                                |      |                                                                | bakeoff-2005 |  |                                                       |      |

# QA

| ID   |                                                          |   |  |  |                                                          |  |  |                                                      |  |
| ---- | ------------------------------------------------------------ | --------- | ------------ | ---- | ------------------------------------------------------------ | ------ | ---- | ------------------------------------------------------------ | ---- |
| 1    | [NewsQA](https://github.com/Maluuba/newsqa)                  | 2019/9/13 |    |      |  Maluuba NewsQA12000120,00061623  |    | QA   | [](https://arxiv.org/abs/1611.09830)                     |      |
| 2    | [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)         |           |        |      |  SQuAD  |    | QA   | [](https://arxiv.org/abs/1606.05250)                     |      |
| 3    | [SimpleQuestions](https://www.dropbox.com/s/tohrsllcfy7rch4/SimpleQuestions_v2.tgz) |           | Facebook     |      |  , 100K  |    | QA   | [](https://arxiv.org/pdf/1506.02075v1.pdf)               |      |
| 4    | [WikiQA](https://www.microsoft.com/en-us/download/details.aspx?id=52419&from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fdownloads%2F4495da01-db8c-4041-a7f6-7984a4f6a905%2Fdefault.aspx) | 2016/7/14 |    |      |  WikiQABing3047292581473  |    | QA   | [](https://www.microsoft.com/en-us/research/publication/wikiqa-a-challenge-dataset-for-open-domain-question-answering/?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F252176%2Fyangyihmeek_emnlp-15_wikiqa.pdf) |      |
| 5    | [cMedQA](https://github.com/zhangsheng93/cMedQA)             | 2019/2/25 | Zhang Sheng  |      |  5.410   |    | QA   | [](https://www.mdpi.com/2076-3417/7/8/767)               |      |
| 6    | [cMedQA2](https://github.com/zhangsheng93/cMedQA2)           | 2019/1/9  | Zhang Sheng  |      |  cMedQA1020  |    | QA   | [](https://ieeexplore.ieee.org/abstract/document/8548603) |      |
| 7    | [webMedQA](https://github.com/hejunqing/webMedQA)            | 2019/3/10 | He Junqing   |      |  631  |    | QA   | [](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-019-0761-8) |      |
| 8    | [XQA](https://github.com/thunlp/XQA)                         | 2019/7/29 |      |      |  9  |  | QA   | [](https://www.aclweb.org/anthology/P19-1227)            |      |
| 9    | [AmazonQA](https://github.com/amazonqa/amazonqa)             | 2019/9/29 |        |      |  QAQA  |    | QA   | [](https://arxiv.org/pdf/1908.04364v1.pdf)               |      |

# 

| ID   |                                                          |  |                          |  |                                                          |                        |      |                                                      |  |
| ---- | ------------------------------------------------------------ | -------- | ------------------------------------ | ---- | ------------------------------------------------------------ | ---------------------------- | -------- | ------------------------------------------------------------ | ---- |
| 1    | [NLPCC2013](http://tcci.ccf.org.cn/conference/2013/pages/page04_tdata.html) | 2013     | CCF                                  | \    |  7 emotions: like, disgust, happiness, sadness, anger, surprise, fear14 000 , 45 431  | NLPCC2013, Emotion           |  | [](http://jcip.cipsc.org.cn/CN/article/downloadArticleFile.do?attachType=PDF&id=143) |      |
| 2    | [NLPCC2014 Task1](http://tcci.ccf.org.cn/conference/2014/pages/page04_ans.html) | 2014     | CCF                                  | \    |  7 emotions: like, disgust, happiness, sadness, anger, surprise, fear 20000  | NLPCC2014, Emotion           |  | \                                                            |      |
| 3    | [NLPCC2014 Task2](http://tcci.ccf.org.cn/conference/2014/pages/page04_tdata.html) | 2014     | CCF                                  | \    |                                      | NLPCC2014, Sentiment         |  | \                                                            |      |
| 4    | [Weibo Emotion Corpus](https://github.com/MingleiLI/emotion_corpus_weibo) | 2016     | The Hong Kong Polytechnic University | \    |  7 emotions: like, disgust, happiness, sadness, anger, surprise, fear   | weibo emotion corpus         |  | [Emotion Corpus Construction Based on Selection from Noisy Natural Labels](http://www.lrec-conf.org/proceedings/lrec2016/pdf/515_Paper.pdf) |      |
| 5    | [RenCECPs](Fuji Ren can be contacted (ren@is.tokushima-u.ac.jp) for a license agreement.) | 2009     | Fuji Ren                             | \    |  emotionsentiment15001100035000  | RenCECPs, emotion, sentiment |  | [Construction of a blog emotion corpus for Chinese emotional expression analysis](https://dl.acm.org/doi/10.5555/1699648.1699691) |      |
| 6    | [weibo_senti_100k](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/weibo_senti_100k/intro.ipynb) |      |                                  | \    |    5                     | weibo senti, sentiment       |  | \                                                            |      |
| 7    | [BDCI2018-](https://www.datafountain.cn/competitions/310/datasets) | 2018     | CCF                                  |      |  301-1  |      |  | \                                                            |      |
| 8    | [AI Challenger ](https://blog.csdn.net/linxid/article/details/82764682) | 2o18     |                                  | \    |  620  |                  |  | \                                                            |      |
| 9    | [BDCI2019](https://www.datafountain.cn/competitions/353) | 2019     |                              | \    |    |                  |  | \                                                            |      |
| 10   | [](https://zhejianglab.aliyun.com/entrance/231731/introduction?spm=5176.12281949.1003.3.2b58c341YnOFck) | 2019     |                            | \    |  {}4  |                  |  | \                                                            |      |
| 11   | [2019](https://biendata.com/competition/sohu2019/) | 2019     |                                  | \    |    |                  |  | \                                                            |      |

# 

| ID   |                                                          |       |                                              |              |                                                          |            |      |  |  |
| ---- | ------------------------------------------------------------ | ------------- | -------------------------------------------------------- | ---------------- | ------------------------------------------------------------ | ---------------- | -------- | -------- | ---- |
| 1    | [2018](https://www.pkbigdata.com/common/cmpt/ _.html) | 20187     |                                                  |                  |  idarticleword_segclass19102275  |      |  | \        |  |
| 2    | [](https://github.com/skdjfla/toutiao-text-classfication-dataset) | 20185     |                                                  |                  |  15382688  |      |  | \        |  |
| 3    | [THUCNews]([http://thuctc.thunlp.org/#%E4%B8%AD%E6%96%87%E6%96%87%E6%9C%AC%E5%88%86%E7%B1%BB%E6%95%B0%E6%8D%AE%E9%9B%86THUCNews](http://thuctc.thunlp.org/#THUCNews)) | 2016        |                                                  |                  | THUCNewsRSS2005~2011UTF-814742.19 GB |        |  | \        |  |
| 4    | [](https://www.kesci.com/home/dataset/5d3a9c86cf76a600360edd04) | \             |  |                  |  209804  |        |  | \        |  |
| 5    | [](https://www.kesci.com/home/dataset/5dd645fca0cb22002c94e65d/files) | 201912    | chenfengshf                                              | CC0  |  Kesci(length<50)1538w  |  |  | \        |  |
| 6    | [2017 ](https://biendata.com/competition/zhihu/) | 20176     | ;                                    |                  |   1 1999  300   |      |  | \        |  |
| 7    | [2019-](https://zhejianglab.aliyun.com/entrance/231731/information) | 20198     |                                                |                  |  {}  |      |  | \        |  |
| 8    | [IFLYTEK' ](https://storage.googleapis.com/cluebenchmark/tasks/iflytek_public.zip) | \             |                                                  |                  |  1.7app119  |            |  | \        |  |
| 9    | [(SogouCA)](http://www.sogou.com/labs/resource/ca.php) | 2012816 |                                                      |                  |  20126718  |              |  | \        |  |
| 10   | [(SogouCS)](http://www.sogou.com/labs/resource/cs.php) | 20128     |                                                      |                  |  20126718  |              |  | \        |  |
| 11   | [](http://www.nlpir.org/?action-viewnews-itemid-145) | 201711    |                    |                  |                            |              |          |          |      |
| 12   | [ChnSentiCorp_htl_all](https://github.com/SophonPlus/ChineseNlpCorpus/tree/master/datasets) | 20183     | https://github.com/SophonPlus/ChineseNlpCorpus           |                  |  7000 5000 2000    |                  |          |          |      |
| 13   | [waimai_10k](https://github.com/SophonPlus/ChineseNlpCorpus/tree/master/datasets) | 20183     | https://github.com/SophonPlus/ChineseNlpCorpus           |                  |   4000   8000        |                  |          |          |      |
| 14   | [online_shopping_10_cats](https://github.com/SophonPlus/ChineseNlpCorpus/tree/master/datasets) | 20183     | https://github.com/SophonPlus/ChineseNlpCorpus           |                  |  10  6  3    |                  |          |          |      |
| 15   | [weibo_senti_100k](https://github.com/SophonPlus/ChineseNlpCorpus/tree/master/datasets) | 20183     | https://github.com/SophonPlus/ChineseNlpCorpus           |                  |  10   5          |                  |          |          |      |
| 16   | [simplifyweibo_4_moods](https://github.com/SophonPlus/ChineseNlpCorpus/tree/master/datasets) | 20183     | https://github.com/SophonPlus/ChineseNlpCorpus           |                  |  36   4   20  5   |                  |          |          |      |
| 17   | [dmsc_v2](https://github.com/SophonPlus/ChineseNlpCorpus/tree/master/datasets) | 20183     | https://github.com/SophonPlus/ChineseNlpCorpus           |                  |  28  70   200  /           |                  |          |          |      |
| 18   | [yf_dianping](https://github.com/SophonPlus/ChineseNlpCorpus/tree/master/datasets) | 20183     | https://github.com/SophonPlus/ChineseNlpCorpus           |                  |  24 54 440 /                 |                  |          |          |      |
| 19   | [yf_amazon](https://github.com/SophonPlus/ChineseNlpCorpus/tree/master/datasets) | 20183     | https://github.com/SophonPlus/ChineseNlpCorpus           |                  |  52 1100 142 720 /  |                  |          |          |      |

# 

| ID   |                                                          |        |                                   |                                                    |                                                          |                                                        |                        |                                                      |  |
| ---- | ------------------------------------------------------------ | -------------- | --------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | -------------------------- | ------------------------------------------------------------ | ---- |
| 1    | [LCQMC](http://icrc.hitsz.edu.cn/Article/show/171.html)      | 2018/6/6       | ()                  | Creative Commons Attribution 4.0 International License |  26006810238766880212500  |                                      |        | [](https://www.aclweb.org/anthology/C18-1166)            |      |
| 2    | [The BQ Corpus](http://icrc.hitsz.edu.cn/Article/show/175.html) | 2018/9/4       | ()        |                                                        |  1200001:1  |                                        |  | [](https://www.aclweb.org/anthology/D18-1536/)           |      |
| 3    | [AFQMC ](https://dc.cloud.alipay.com/index?click_from=MAIL&_bdType=acafbbbiahdahhadhiih#/topic/intro?id=3) | 2018/4/25      |                                       |                                                        | 10 |                                                      |        |                                                              |      |
| 4    | [](https://ai.ppdai.com/mirror/goToMirrorDetail?mirrorId=1) | 2018/6/10      |                           |                                                        |  train.csv3label12101q12q2question.csv  |                                                      |        |                                                              |      |
| 5    | [CAIL2019](https://github.com/china-ai-law-challenge/CAIL2019/tree/master/scm) | 2019/6         |                       |                                                        |  (A,B,C)A,B,CABABsim(A,B)>sim(A,C)  |                                            |                  |                                                              |      |
| 6    | [CCKS 2018 ](https://biendata.com/competition/CCKS2018_3/data/) | 2018/4/5       | ()        |                                                        |                                                                |                                        |        |                                                              |      |
| 7    | [ChineseTextualInference](https://github.com/liuhuanyong/ChineseTextualInference) | 2018/12/15     |                   |                                                        |  ,88,  | NLI                                                      |      |                                                              |      |
| 8    | [NLPCC-DBQA](https://biendata.com/ccf_tcci2018/datasets/tcci_tag/11) | 2016/2017/2018 | NLPCC                                         |                                                        |  -10  | DBQA                                                         |                    |                                                              |      |
| 9    | [](https://www.datafountain.cn/competitions/359) | 201/8/32       | CCF                                           |                                                        |     |                                        |                  |                                                              |      |
| 10   | [CNSD / CLUE-CMNLI](https://github.com/zengjunjun/CNSD)      | 2019/12        | ZengJunjun                                    |                                                        |    | NLI                                                      |            | [](https://6a75-junzeng-uxxxm-1300734931.tcb.qcloud.la/CNSD.pdf?sign=401485f4d6f256393a264e68464ca4ae&t=1578114336) |      |
| 11   | [cMedQA v1.0](https://github.com/zhangsheng93/cMedQA)        | 2017/4/5       |     |                                                        |    50,00094,134120212 2,0003774117212 2,0003835119211 54,000101,743119212  |                                                  |                    | [](https://www.mdpi.com/2076-3417/7/8/767)               |      |
| 12   | [cMedQA2](https://github.com/zhangsheng93/cMedQA2)           | 2018/11/8      |     |                                                        |    100,000188,49048101 4,000752749101 4,000755249100 108,000203,56949101  |                                                  |                    | [](https://www.mdpi.com/2076-3417/7/8/767)               |      |
| 13   | [ChineseSTS](https://github.com/IAdmireu/ChineseSTS)         | 2017/9/21      | , , .           |                                                        |  12747   |                                                |                  |                                                              |      |
| 14   | [  ](https://biendata.com/competition/chip2018/) | 2018           | CHIP 2018-CHIP  |                                                        |      20000  10000label>   |                                            |                  |                                                              |      |
| 15   | [COS960: A Chinese Word Similarity Dataset of 960 Word Pairs](https://github.com/thunlp/COS960) | 2019/6/6       |                                       |                                                        |  960 15 960 480240240  |                                              |                      | [](https://arxiv.org/abs/1906.00247)                     |      |
| 16   | OPPOquery-title(https://pan.baidu.com/s/1Hg2Hubsn3GEuu4gubbHCzw  7p3n) | 2018/11/6      | OPPO                                          |                                                        |  OPPO,   query-titlectr  |  ctr                                       |                  |                                                              |      |
| 17   | [(SogouE)](https://www.sogou.com/labs/resource/e.php) | 2012         |                                           |                              |  URL ]\tURL\t URL 12  | [Automatic Search Engine Performance Evaluation with Click-through Data Analysis](https://www.sogou.com/labs/paper/Automatic_Search_Engine_Performance_Evaluation_with_Click-through_Data_Analysis.pdf) |            |                                                              |      |

# 

| ID   |                                                          |    |                                           |    |                                                          |                          |      |                                               |        |
| ---- | ------------------------------------------------------------ | ---------- | ----------------------------------------------------- | ------ | ------------------------------------------------------------ | ------------------------------ | -------- | ----------------------------------------------------- | ---------- |
| 1    | [LCSTS](http://icrc.hitsz.edu.cn/Article/show/139.html)      | 2015/8/6   | Qingcai Chen                                          |        |  10,6661-5  |  |  | [](http://arxiv.org/abs/1506.05865)               |            |
| 2    | [](https://www.jianshu.com/p/8f52352f0748?tdsourcetag=s_pcqq_aiomsg) | 2018/6/20  | He Zhengfang                                          |        |  679898        |              |  | \                                                     |            |
| 3    | [](https://github.com/wonderfulsuccess/chinese_abstractive_corpus) | 2018/6/5   |                                                   |        |  24500  |            |  | \                                                     |            |
| 4    | [NLPCC2017 Task3](http://tcci.ccf.org.cn/conference/2017/taskdata.php) | 2017/11/8  | NLPCC2017                                       |        |  NLPCC2017  |                |  | \                                                     |            |
| 5    | [2018](https://www.dcjingsai.com/common/cmpt/2018_.html) | 2018/10/11 | DC                                          |        |  DC  |                |  | \                                                     |            |
| 6    | [Byte Cup 2018](http://biendata.com/competition/bytecup2018/data/) | 2018/12/4  |                                               |        |  TopBuzz 130  1000   800    |          |  | \                                                     |        |
| 7    | [NEWSROOM](https://summari.es/)                              | 2018/6/1   | Grusky                                                |        |  1998201738130  |    |  | [](http://aclweb.org/anthology/N18-1065)          |        |
| 8    | [DUC/TAC](https://duc.nist.gov/ https://tac.nist.gov//)      | 2014/9/9   | NIST                                                  |        |  Document Understanding Conferences/Text Analysis ConferenceTAC KBPTAC Knowledge Base Population  | /        |  | \                                                     |        |
| 9    | [CNN/Daily Mail](https://cs.nyu.edu/~kcho/DMQA/)             | 2017/7/31  | Standford                                             | GNU v3 |  CNN(DailyMail)  |        |  | [](https://arxiv.org/pdf/1704.04368.pdf)          |        |
| 10   | [Amazon SNAP Review](https://snap.stanford.edu/data/web-Amazon.html) | 2013/3/1   | Standford                                             |        |  Amazon  |            |  | \                                                     |        |
| 11   | [Gigaword](https://github.com/harvardnlp/sent-summary)       | 2003/1/28  | David Graff, Christopher Cieri                        |        |  950w   |                |  |                                                       |        |
| 12   | [RA-MDS](http://www1.se.cuhk.edu.hk/~textmine/dataset/ra-mds/) | 2017/9/11  | Piji Li                                               |        |  Reader-Aware Multi-Document Summarization451042725  |      |  | [](http://lipiji.com/docs/li2017ramds.pdf)        |        |
| 13   | [TIPSTER SUMMAC](https://www-nlpir.nist.gov/related_projects/tipster_summac/cmp_lg.html) | 2003/5/21  | The MITRE Corporation and the University of Edinburgh |        |  183Computation and Language (cmp-lg) collectionACL  |              |  | \                                                     |        |
| 14   | [WikiHow](http://www.wikihow.com/)                           | 2018/10/18 | Mahnaz Koupaee                                        |        |  200,000  |              |  | [](https://arxiv.org/abs/1810.09305)              |        |
| 15   | [Multi-News](https://github.com/Alex-Fabbri/Multi-News)      | 2019/12/4  | Alex Fabbri                                           |        |  1500newser.com56,216  |                      |  | [](http://arxiv.org/abs/1906.01749)               |        |
| 16   | [MED Summaries](http://lear.inrialpes.fr/people/potapov/med_summaries) | 2018/8/17  | D.Potapov                                             |        |  1606010010  |            |  | [](http://hal.inria.fr/hal-01022967)              |        |
| 17   | [BIGPATENT](http://arxiv.org/abs/1906.03741)                 | 2019/7/27  | Sharma                                                |        |  130  |        |  | [](http://arxiv.org/abs/1906.03741)               |        |
| 18   | [NYT]([ https://catalog.ldc.upenn.edu/LDC2008T19](https://catalog.ldc.upenn.edu/LDC2008T19)) | 2008/10/17 | Evan Sandhaus                                         |        |  The New York Times,150,20091120101  |            |  | \                                                     |        |
| 19   | [The AQUAINT Corpus of English News Text](https://catalog.ldc.upenn.edu/LDC2002T31) | 2002/9/26  | David Graff                                           |        |  ()3.75  |                |  | \                                                     |  |
| 20   | [Legal Case Reports Data Set](https://archive.ics.uci.edu/ml/datasets/Legal+Case+Reports) | 2012/10/19 | Filippo Galgani                                       |        |  2006-2009(FCA)4000  |            |  | \                                                     |        |
| 21   | [17 Timelines](http://www.l3s.de/~gtran/timeline/)           | 2015/5/29  | G. B. Tran                                            |        |    |                |  | [](http://l3s.de/~gtran/publications/www2013.pdf) |      |
| 22   | [PTS Corpus](https://github.com/FeiSun/ProductTitleSummarizationCorpus) | 2018/10/9  | Fei Sun                                               |        |  Product Title Summarization Corpus  |              |  | [](https://arxiv.org/abs/1808.06885)              |            |
| 23   | [Scientific Summarization DataSets](https://github.com/Santosh-Gupta/ScientificSummarizationDataSets) | 2019/10/26 | Santosh Gupta                                         |        |  Semantic Scholar CorpusArXivSemantic Scholar/580ArXiv1991201975/10k26k417k157CS221k  |                |  | \                                                     |        |
| 24   | [Scientific Document Summarization Corpus and Annotations from the WING NUS group](https://github.com/WING-NUS/scisumm-corpus) | 2019/3/19  | Jaidka                                                |        |  ACL:()()40  |                |  | [](http://www.aclweb.org/anthology/W16-1511.pdf)  |        |

# 

| ID   |                                                          |   |                                                  |                                |                                                          |                            |           |                                                      |                                     |
| ---- | ------------------------------------------------------------ | --------- | ------------------------------------------------------------ | ---------------------------------- | ------------------------------------------------------------ | -------------------------------- | ------------- | ------------------------------------------------------------ | --------------------------------------- |
| 1    | [WMT2017](http://www.statmt.org/wmt17/translation-task.html) | 2017/2/1  | EMNLP 2017  Workshop on Machine Translation                  |                                    |    Europarl corpusUN corpus 2017News Commentary corpus  EMNLP  benchmark  | Benchmark, WMT2017               |   | [](https://www.statmt.org/wmt17/pdf/WMT17.pdf)           |                                         |
| 2    | [WMT2018](http://statmt.org/wmt18/translation-task.html#download) | 2018/11/1 | EMNLP 2018  Workshop on Machine Translation                  |                                    |   Europarl corpusUN corpus 2018News Commentary corpus  EMNLP  benchmark | Benchmark, WMT2018               |   | [](http://www.statmt.org/wmt18/)                         |                                         |
| 3    | [WMT2019](http://www.statmt.org/wmt19/translation-task.html) | 2019/1/31 | EMNLP 2019  Workshop on Machine Translation                  |                                    |   Europarl corpusUN corpus,  news-commentary corpus  and the ParaCrawl corpus  | Benchmark, WMT2019               |   | [](http://www.statmt.org/wmt19/pdf/53/WMT01.pdf)         |                                         |
| 4    | [UM-Corpus:A Large  English-Chinese Parallel Corpus](http://nlp2ct.cis.umac.mo/um-corpus/) | 2014/5/26 | Department of Computer  and Information Science,  University of Macau, Macau |                                    |                      | UM-Corpus;English; Chinese;large |   | [](http://www.lrec-conf.org/proceedings/lrec2014/pdf/774_Paper.pdf) |                                         |
| 5    | [Ai challenger translation 2017](https://pan.baidu.com/s/1E5gD5QnZvNxT3ZLtxe_boA            : stjf) | 2017/8/14 |   AI               |                                    |   1000       10,000,000  934  8000   | AI challenger 2017               |   |                                                              |                                         |
| 6    | [MultiUN](http://opus.nlpl.eu/download.php?f=MultiUN/v1/tmx/en-zh.tmx.gz) | 2010      | Department of Linguistics  and Philology Uppsala  University, Uppsala/Sweden |                                    |      | MultiUN                          |   | [MultiUN: A Multilingual corpus from United Nation Documents, Andreas Eisele and Yu Chen, LREC 2010](http://www.dfki.de/lt/publication_show.php?id=4790) |                                         |
| 7    | [NIST 2002 Open Machine Translation (OpenMT) Evaluation](https://catalog.ldc.upenn.edu/LDC2010T10) | 2010/5/14 | NIST Multimodal Information Group                            | LDC User Agreement for Non-Members |  Xinhua 70 Zaobao30100 212707 Xinhua25247 Zaobao39256  | NIST                             |   | [](http://www.lrec-conf.org/proceedings/lrec2018/pdf/678.pdf) |   |
| 8    | [The Multitarget TED Talks Task (MTTT)](http://cs.jhu.edu/~kevinduh/a/multitarget-tedtalks/) | 2018      | Kevin Duh, JUH                                               |                                    |  TED20  | TED                              |   | The Multitarget TED Talks Task                               |                                         |
| 9    | [ASPEC Chinese-Japanese](http://lotus.kuee.kyoto-u.ac.jp/WAT/) | 2019      | Workshop on Asian Translation                                |                                    |      | Asian scientific patent Japanese |   | http://lotus.kuee.kyoto-u.ac.jp/WAT/                         |                                         |
| 10   | [casia2015](http://nlp.nju.edu.cn/cwmt-wmt/)                 | 2015      | research group in Institute of Automation , Chinese Academy of Sciences |                                    |                    | casia CWMT 2015                  |   |                                                              |                                         |
| 11   | [casict2011](http://nlp.nju.edu.cn/cwmt-wmt/)                | 2011      | research group in Institute of Computing Technology , Chinese Academy of Sciences |                                    |  2 12 90  | casict CWMT 2011                 |   |                                                              |                                         |
| 12   | [casict2015](http://nlp.nju.edu.cn/cwmt-wmt/)                | 2015      | research group in Institute of Computing Technology , Chinese Academy of Sciences |                                    |  20060 20/20 99  | casict CWMT 2015                 |   |                                                              |                                         |
| 13   | [datum2015](http://nlp.nju.edu.cn/cwmt-wmt/)                 | 2015      | Datum Data Co., Ltd.                                         |                                    |        | datum CWMT 2015                  |   |                                                              |                                         |
| 14   | [datum2017](http://nlp.nju.edu.cn/cwmt-wmt/)                 | 2017      | Datum Data Co., Ltd.                                         |                                    |  20 50,000  10Book1-Book10  | datum CWMT 2017                  |   |                                                              |                                         |
| 15   | [neu2017](http://nlp.nju.edu.cn/cwmt-wmt/)                   | 2017      | NLP lab of Northeastern University, China                    |                                    |  200 90  | neu CWMT 2017                    |   |                                                              |                                         |
| 16   | [(translation2019zh)](https://github.com/brightmart/nlp_chinese_corpus) | 2019      |                                                          |                                    |      |                                  |               |                                                              |                                         |

# 

| ID   |                                                          |   |                                    |  |                                                          |  |  |  |  |
| ---- | ------------------------------------------------------------ | --------- | ---------------------------------------------- | ---- | ------------------------------------------------------------ | ------ | ---- | -------- | ---- |
| 1    | [NLPIR100](http://www.nlpir.org/wordpress/download/weibo_relation_corpus.rar) | 2017/12/2 |  |      |  NLPIR 1.NLPIR(127.0.0.1/wordpress)100010 2.urlEmailkevinzhang@bit.edu.cn 3.NLPIR(http://www.nlpir.org/) 4. person_id  id guanzhu_id id  |        |      |          |      |

# 

| ID   |                                                          |    |                                    |  |                                                          |  |  |  |  |
| ---- | ------------------------------------------------------------ | ---------- | ---------------------------------------------- | ---- | ------------------------------------------------------------ | ------ | ---- | -------- | ---- |
| 1    | [NLPIR-23]([http://www.nlpir.org/wordpress/2017/12/03/nlpir%e5%be%ae%e5%8d%9a%e5%86%85%e5%ae%b9%e8%af%ad%e6%96%99%e5%ba%93-23%e4%b8%87%e6%9d%a1/](http://www.nlpir.org/wordpress/2017/12/03/nlpir-23/)) | 201712 |  |      |  NLPIR 1.NLPIR(127.0.0.1/wordpress)231000 2.urlEmailkevinzhang@bit.edu.cn 3.NLPIR(http://www.nlpir.org/) 4. id   article   discuss   insertTime  origin   person_id id time   transmit   |        |      |          |      |
| 2    | [500](http://www.nlpir.org/wordpress/download/weibo.7z) | 20181  |  |      |  500@ICTCLAS 500sqlmysql500      |        |      |          |      |
| 3    | [NLPIR-2400](http://www.nlpir.org/wordpress/download/NLPIR-news-corpus.rar) | 20177  | [www.NLPIR.org](http://www.nlpir.org/)         |      |  NLPIR   1.48MB2400 2.2009101220091214 3. 4. 5.www.NLPIR.org 6. NLPIR.org  |        |      |          |      |
| 4    | [NLPIR100](http://www.nlpir.org/wordpress/download/weibo_relation_corpus.rar) | 201712 |  |      |  NLPIR 1.NLPIR(127.0.0.1/wordpress)100010 2.urlEmailkevinzhang@bit.edu.cn 3.NLPIR(http://www.nlpir.org/) 4. person_id  id guanzhu_id id  |        |      |          |      |
| 5    | [NLPIR100]([http://www.nlpir.org/wordpress/2017/09/02/nlpir%e5%be%ae%e5%8d%9a%e5%8d%9a%e4%b8%bb%e8%af%ad%e6%96%99%e5%ba%93100%e4%b8%87%e6%9d%a1/](http://www.nlpir.org/wordpress/2017/09/02/nlpir100/)) | 20179  |  |      |  NLPIR 1.NLPIR(127.0.0.1/wordpress)1001 2.urlEmailkevinzhang@bit.edu.cn 3.NLPIR(http://www.nlpir.org/) 4. id  id sex   address   fansNum   summary   wbNum   gzNum    blog   edu   work   renZh   brithday   |        |      |          |      |
| 6    | [NLPIR-40]([http://www.nlpir.org/wordpress/2017/08/12/nlpir%e7%9f%ad%e6%96%87%e6%9c%ac%e8%af%ad%e6%96%99%e5%ba%93-40%e4%b8%87%e5%ad%97/](http://www.nlpir.org/wordpress/2017/08/12/nlpir-40/)) | 20178  |  (SMS@BIT) |      |  NLPIR   1.488704 2.www.NLPIR.org 3.  |        |      |          |      |
| 7    | [](https://dumps.wikimedia.org/zhwiki/)        | \          |                                        |      |                                    |        |      |          |      |
| 8    | []([https://github.com/chinese-poetry/chinese-poetry](https://link.zhihu.com/?target=https%3A//github.com/chinese-poetry/chinese-poetry)) | 2020     | githubhttp://shici.store               |      |                                                                |        |      |          |      |
| 9    | [](https://github.com/chatopera/insuranceqa-corpus-zh) | 2017     |                                                |      |  Insurance Library   QA         label""""  |        |      |          |      |
| 10   | [](https://github.com/kfcd/chaizi)               | 19057  |                                                |      |  17,803chaizi-ft.txtchaizi-jt.txt    |        |      |          |      |
| 11   | [](https://github.com/brightmart/nlp_chinese_corpus) | 2016     |                                            |      |        |        |      |          |      |
| 12   | [json(baike2018qa)](https://github.com/brightmart/nlp_chinese_corpus) | 2018     |                                            |      |      |        |      |          |      |
| 13   | [json(webtext2019zh) ](https://github.com/brightmart/nlp_chinese_corpus) | 2019     |                                            |      |  1  2()  3(cQA)      4  5  |        |      |          |      |
| 14   | [.json(wiki2019zh)](https://github.com/brightmart/nlp_chinese_corpus) | 2019     |                                            |      |  wiki  |        |      |          |      |

# 

| ID   |                                                          |    |                 |                    |                                                          |                      |                           |                                                      |                                                          |
| ---- | ------------------------------------------------------------ | ---------- | --------------------------- | ---------------------- | ------------------------------------------------------------ | -------------------------- | ----------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
| 1    | [DuReader](http://ai.baidu.com/broad/download?dataset=dureader) | 2018/3/1   |                         | Apache2.0              |  5  |  |                   | [](https://arxiv.org/abs/1711.05073)                     |                                                              |
| 2    | [CJRC](https://github.com/china-ai-law-challenge/CAIL2019) | 2019/8/17  | HFL | \                      |  10,00050,000  |      |                   | [](https://link.springer.com/chapter/10.1007/978-3-030-32381-3_36) |                                                              |
| 3    | [2019CMRC ](https://github.com/ymcui/cmrc2019) | 201910 | HFL | CC-BY-SA-4.0           |     |        |                   | \                                                            | https://hfl-rc.github.io/cmrc2019/                 |
| 4    | [2018CMRC ](https://github.com/ymcui/cmrc2018) | 2018/10/19 | HFL | CC-BY-SA-4.0           |  CMRC 201820,000  |  |                   | [](https://www.aclweb.org/anthology/D19-1600/)           | https://hfl-rc.github.io/cmrc2018/                 |
| 5    | [2017CMRC ](https://github.com/ymcui/Chinese-Cloze-RC) | 2017/10/14 | HFL | CC-BY-SA-4.0           |  PD&CFT                            |              |                   | [](https://arxiv.org/abs/1607.02250)                     | [](https://hfl-rc.github.io/cmrc2017/)               |
| 6    | [](https://www.kesci.com/home/competition/5d142d8cbb14e6002c04e14a/content/5) | 2019/9/3   |     | \                      |    |          |                   | \                                                            | [](https://www.kesci.com/home/competition/5d142d8cbb14e6002c04e14a) |
| 7    | [CoQA](https://stanfordnlp.github.io/coqa/)                  | 2018/9     |                   | CC BY-SA 4.0Apache |  CoQA  |                    |                   | [](https://arxiv.org/abs/1808.07042)                     | [](https://www.jiqizhixin.com/articles/2018-09-11-3) |
| 8    | [SQuAD2.0](https://github.com/rajpurkar/SQuAD-explorer/tree/master/dataset) | 2018/1/11  |                   | \                      |   500    SQuAD 2.0   |          |                   | [](https://arxiv.org/abs/1806.03822)                     |                                                              |
| 9    | [SQuAD1.0](https://github.com/rajpurkar/SQuAD-explorer/tree/master/dataset) | 2016       |                   | \                      |  2016107,785 536   |      |                   | [](https://arxiv.org/pdf/1606.05250.pdf)                 |                                                              |
| 10   | [MCTest](https://www.microsoft.com/en-us/research/publication/mctest-challenge-dataset-open-domain-machine-comprehension-text/) | 2013       |                         | \                      |  100,000Bing1,000,000  |                  |                   | [](https://microsoft.github.io/msmarco/)                 |                                                              |
| 11   | [CNN/Dailymail](https://cs.nyu.edu/~kcho/DMQA/)              | 2015       | DeepMind                    | Apache-2.0             |   CNN90k380k Dailymail197k879k  |      |                   | [](https://arxiv.org/abs/1506.03340)                     |                                                              |
| 12   | [RACE](http://www.cs.cmu.edu/~glai1/data/race/)              | 2017       |               | /                      |   5  4  1  28000+ passages  100,000   |                  |                   | [](https://arxiv.org/abs/1704.04683)                     |                                                |
| 13   | [HEAD-QA](https://github.com/aghie/head-qa)                  | 2019       | aghie                       | MIT                    |    |        |   | [](https://arxiv.org/pdf/1906.04701.pdf)                 |                                                              |
| 14   | [Consensus Attention-based Neural Networks for Chinese Reading Comprehension](http://hfl.iflytek.com/chinese-rc/) | 2018       |         | /                      |                                          |              |                   | [](https://arxiv.org/pdf/1607.02250.pdf)                 |                                                              |
| 15   | [WikiQA](https://www.microsoft.com/en-us/research/publication/wikiqa-a-challenge-dataset-for-open-domain-question-answering/) | 2015       |                         | /                      |  WikiQA  |            |                   | [](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/YangYihMeek_EMNLP-15_WikiQA.pdf) |                                                              |
| 16   | [Childrens Book Test (CBT)](https://research.fb.com/downloads/babi/) | 2016       | Facebook                    | /                      |    |              |                   | [](https://arxiv.org/pdf/1511.02301.pdf)                 |                                                              |
| 17   | [NewsQA](https://www.microsoft.com/en-us/research/project/newsqa-dataset/) | 2017       | Maluuba Research            | /                      |  100000CNN10000  |            |                   | [](https://arxiv.org/pdf/1611.09830.pdf)                 |                                                              |
| 18   | [Frames dataset](https://www.microsoft.com/en-us/research/project/frames-dataset/#!download) | 2017       |                         | /                      | 136915 |              |                   | [](https://arxiv.org/pdf/1704.00057.pdf)                 |                                                              |
| 19   | [Quasar](https://github.com/bdhingra/quasar)                 | 2017       |               | BSD-2-Clause           |  Quasar-S37000 Stack overflow Quasar-T43000  |            |                   | [](https://arxiv.org/pdf/1707.03904.pdf)                 |                                                              |
| 20   | [MS MARCO](http://www.msmarco.org/dataset.aspx)              | 2018       |                         | /                      |   BING 1020MARCO  BING  BING   |                      |                   | [](https://arxiv.org/pdf/1611.09268.pdf)                 |                                                              |
| 21   | [](https://github.com/ymcui/Chinese-Cloze-RC)    | 2016     |                       |                        |  PD&CFT People Daily and Children's Fairy Tale   |              |                   | [](http://aclanthology.info/papers/consensus-attention-based-neural-networks-for-chinese-reading-comprehension) |                                                              |
| 22   | [NLPCC ICCPOL2016](http://tcci.ccf.org.cn/conference/2016/)  | 2016.12.2  | NLPCC                 |                        |  1465914K        |              |                   | \                                                            |                                                              |

# 



[](dukeenglish.github.io)



Share your data set with community or make a contribution today! Just send email to chineseGLUE#163.com,

or join QQ group: 836811304

Owner

  • Login: alixunxing
  • Kind: user

GitHub Events

Total
Last Year