https://github.com/chapzq77/chinese-nlp-corpus

Collections of Chinese NLP corpus

https://github.com/chapzq77/chinese-nlp-corpus

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: mdpi.com, ieee.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (5.4%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Collections of Chinese NLP corpus

Basic Info
  • Host: GitHub
  • Owner: chapzq77
  • Default Branch: master
  • Size: 7.13 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Fork of OYE93/Chinese-NLP-Corpus
Created over 6 years ago · Last pushed over 6 years ago

https://github.com/chapzq77/Chinese-NLP-Corpus/blob/master/

# Chinese-NLP-Corpus
Collections of Chinese NLP corpus

## Open Domain
Corpus for open domain, including: law, social media, comments
### Word Segmentation and Part-of-Speech
|Name|Description|Link|
|:-:|---|:-:|
|ZhuXian()|POS|[zhuxian](https://github.com/hankcs/OpenCorpus/tree/master/zhuxian)|
|CNLC|train: dev: test=8: 1: 1|[CNLC](https://github.com/hankcs/OpenCorpus/tree/master/cncorpus)|

\* the url in the table is out-of-date, you can find the data from the following reference.  
**Reference**:  
the details of the corpus  
![](https://camo.githubusercontent.com/8dfcf9fb2ea026c1e178ec9f70efea038fa4ca20/687474703a2f2f7778332e73696e61696d672e636e2f6c617267652f303036466d6a6d636c7931666d366a74686133746d6a33313872306c343078392e6a7067)


### Named Entity Recognition (NER)
|Name|Description|Link|
|:-:|:-:|:-:|
|MSRA|NER|[MSRA](NER/MSRA)|
|People's Daily|NER|[People's Daily](NER/People%27s%20Daily)|
|Weibo Data|NER|[Weibo](NER/Weibo)|

### Text Classification
|Name|Description|Link|notes|
|:-:|---|:-:|---|
|CAIL2018|2018**268****183****202**0-25|[CAIL2018](https://cail.oss-cn-qingdao.aliyuncs.com/CAIL2018_ALL_DATA.zip)|[](http://cail.cipsc.org.cn/), [github](https://github.com/thunlp/CAIL2018)|

### Sentiment Analysis and Rating
|Name|Description|Link|notes|
|:-:|---|:-:|---|
| ChnSentiCorp_htl_all | **7000****5000****2000** | [ChnSentiCorp_htl_all](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/ChnSentiCorp_htl_all) |
| waimai_10k | **4000****8000** | [waimai_10k](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/waimai_10k) |
| online_shopping_10_cats | **10****6****3** | [online_shopping_10_cats](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/online_shopping_10_cats) |
| weibo_senti_100k | **10****5** | [weibo_senti_100k](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/weibo_senti_100k) |[](https://github.com/SophonPlus/ChineseNlpCorpus/issues/1)emojiemojiemoji|
| simplifyweibo_4_moods | **36****4****20****5** | [simplifyweibo_4_moods](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/simplifyweibo_4_moods) |
| dmsc_v2 | **28****70** **200**/ | [dmsc_v2](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/dmsc_v2) |
| yf_dianping | **24****54****440**/ | [yf_dianping](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/yf_dianping) |
| yf_amazon | **52****1100****142****720**/ | [yf_amazon](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/yf_amazon) |
| ez_douban | **5****3****2****2.8****280** | [ez_douban](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/ez_douban) |

### Other Github Repo
|Description|Link|notes|
|:-:|:-:|---|
|Chinese NLP Corpus|||


## Medical Domain
collect corpus for Chinese medical domain, including medical terminology, QA, clinical NER

### Word Segmentation
|Name|Description|Link|notes|
|:-:|---|:-:|---|
|AMTTL|open|[AMTTL](https://github.com/adapt-sjtu/AMTTL/tree/master/medical_data)|[Adaptive Multi-Task Transfer Learning for Chinese Word Segmentation in Medical Text](http://aclweb.org/anthology/C18-1307)|

### Clinical NER
|Name|Description|Link|notes|
|:-:|---|:-:|---|
|CNMER||[CNMER](https://github.com/yhzbit/CNMER/tree/master/data)|CCKS2017|
|CNMER|6|[CCKS2018](https://github.com/MenglinLu/Chinese-clinical-NER/tree/master/data)|
|CNMER||[CCKS2019](http://openkg.cn/dataset/yiducloud-ccks2019task1)|[OpenKG](http://openkg.cn)|

### Question Answer (QA)
|Name|Description|Link|notes|
|:-:|---|:-:|---|
|cMedQA|**5.4****10**|[cMedQA](https://github.com/zhangsheng93/cMedQA)|[Chinese Medical Question Answer Matching Using End-to-End Character-Level Multi-Scale CNNs](https://www.mdpi.com/2076-3417/7/8/767)|
|cMedQA2|cMedQA**10****20**|[cMedQA2](https://github.com/zhangsheng93/cMedQA2)|[Multi-Scale Attentive Interaction Networks for Chinese Medical Question Answer Selection](https://ieeexplore.ieee.org/abstract/document/8548603)|

### Others
|Name|Description|Link|notes|
|:-:|---|:-:|---|
|medical-books|Open sourece medical books in LaTeX|||

Owner

  • Name: 周奇
  • Login: chapzq77
  • Kind: user

GitHub Events

Total
Last Year