https://github.com/beomi/awesomekorean_data

한국어 데이터 세트 링크

Last synced: 10 months ago · JSON representation

Repository

한국어 데이터 세트 링크

Basic Info

Host: GitHub
Owner: Beomi
Default Branch: master
Size: 1.5 MB

Statistics

Stars: 0
Watchers: 2
Forks: 2
Open Issues: 0
Releases: 0

Fork of songys/AwesomeKorean_Data

Created almost 6 years ago · Last pushed almost 6 years ago

https://github.com/Beomi/AwesomeKorean_Data/blob/master/

# AwesomeKorean_Data

- 2020   .           .        end to end          ,           .            

- 12 15        2020 8 21 @warnikchow     .      
- Natural language processing             [Awesome-Korean-NLP](https://github.com/datanada/Awesome-Korean-NLP) 


-          [https://ratsgo.github.io/https://ratsgo.github.io/embedding/preprocess.html](https://ratsgo.github.io/embedding/preprocess.html)      

-     !        ,     evaluation    /      huggingface.nlp     , [ko-nlp](https://github.com/ko-nlp/Korpora)
                                    
# Open Datasets
![network](./network.jpg)

- Commercially available(com), academic use only(aca), unknown(unk)
- Redistribution is possible with or without modification, if neither, or unknown (red, red/mod-x, not, unk)  
- Internationally available publication(INT) 

## 1.  Classical NLP pipeline

    ,(), , ,       .           , ,  'entity'  'entity'               .

|No|Dataset|Typical Usage|Provider|Docu|License|Volume|Goal|Lang|Description|
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|                            
|1|[KAIST Morpho-syntactically Annotated Corpus](http://semanticweb.kaist.ac.kr/home/index.php/KAIST_Corpus)|Morphological analysis|Academia|article|aca/|70M(w)| - |ko|   .    Affiliation              .|         
|2|[Korean Tree-tagged Corpus]((http://semanticweb.kaist.ac.kr/home/index.php/KAIST_Corpus))|Tree parsing|Academia|INT|aca/red|30K(s)|-|ko|-| 
|3|[UD Korean KAIST]((http://semanticweb.kaist.ac.kr/home/index.php/KAIST_Corpus))|Dependency parsing| Academia| INT|com/red|30K (s)|-|ko| Treebank   | 
|4|[PKT-UD](https://catalog.ldc.upenn.edu/LDC2006T09)|Dependency parsing |Academia| INT| com/red|5K (s)|-|ko| 
|5|[KMOU NER](https://github.com/kmounlp/NER)| NER| Academia|article|aca/red|24K (s)|-|ko|            | 
|6|[AIR x NAVER NER](http://air.changwon.ac.kr/?page_id=10)| NER |Competition| DOC |aca/not|90K (s)|-|ko|, ,              | 
|7|[AIR x NAVER SLR](http://air.changwon.ac.kr/?page_id=14)|SLR|Competition|DOC|aca/not|35K(s)|-|ko| (Semantic Role Labeling)  
| 

## 2. Entailment and sentence similarity  
   ()        .         ,           .

|No|Dataset|Typical Usage|Provider|Docu|License|Volume|Goal|Lang|Description|
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|                            
|1|[Question Pair](https://github.com/songys/Question_pair)| Paraphrase detection|Academia|DOC|com/red|10K (p)|-|ko|   |            
|2|[KorNLI](https://github.com/kakaobrain/KorNLUDatasets)|NLI|Industry|INT|com/red |1,000K (p)|-|ko |    |
|3|[KorSTS](https://github.com/kakaobrain/KorNLUDatasets)|STS|Industry|INT|com/red|8,500 (p)|-|ko |    |
|4|[ParaKQC](https://github.com/warnikchow/ParaKQC)|STS|Academia|INT|com/red|540K (p)|-|ko |Parallel dataset of Korean Questions and Commands|

## 3. Semantics and question answering
 ''      (Y Kim(2014)). ,     QA           .                   .

|No|Dataset|Typical Usage|Provider|Docu|License|Volume|Goal|Lang|Description|
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|    
|1|[NSMC](https://github.com/e9t/nsmc)|Sentiment analysis|Academia|DOC|com/red|150K / 50K (s)|-|ko|     |          
|2|[BEEP!](https://github.com/kocohub/korean-hate-speech)|Hate speech detection|Academia |INT |com/red |8K / 500 / 1,000 (s)|-|ko|   |                 
|3|[3i4K](https://github.com/warnikchow/3i4k)|Speech act classification |Academia |INT |com/red |55K / 6K (s)|-|ko|Intonation-aided intention identification for Korean|Structured argument extraction for Korean|    
|4|KorQuAD1|QA|Industry|INT|com/red (mod-x)|60K / 5K / 4K (p)|-|ko|    [KorQuAD  ](https://www.youtube.com/watch?v=ntGwv6Ifoe8)|
|5|[KorQuAD2](https://korquad.github.io/)|QA|Industry|article|com/red (mod-x)|80K / 10K / 10K (p)|-|ko| -|


## 4 Parallel corpora  
          .      Aihub  [](http://aihub.or.kr/sample_data_board)      .     ,      .           

|No|Dataset|Typical Usage|Provider|Docu|License|Volume|Goal|Lang|Description|
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|                        
|1|[Sci-news-sum-kr](https://github.com/theeluwin/sci-news-sum-kr-50)|Summarization|Academia|DOC|aca/red|50 (p)|Eval|ko|-|   IT/  50     |   
|2|[SAE4K](https://github.com/warnikchow/sae4k)|Summarization|Academia|INT|com/red|50K (p)|-|ko|Structured argument extraction for Korean|  
|3|[Korean Parallel Corpora](https://github.com/jungyeul/korean-parallel-corpora)|MT|Academia|INT|com/red|97K (p)|-|ko, en, fr|-|
|3|[KAIST Translation Evaluation Set2](http://semanticweb.kaist.ac.kr/home/index.php/Evaluateset2) |MT| Academia|DOC |aca/red |3K (p)|Eval|ko, en|-| 
|4|[Chinese-Korean Multilingual Corpus](http://semanticweb.kaist.ac.kr/home/index.php/Corpus9) |MT |Academia|DOC |aca/red|60K (p)|-|ko, zh |-|
|5|[Transliteration Dataset](https://github.com/muik/transliteration),  Wiktionary(https://en.wiktionary.org/wiki/Wiktionary:Main_Page) |Transliteration|Academia |DOC |com/red |35K (p)|-| ko, en |    ,    |-|
|6|[KAIST Transliteration Evaluation Set3](http://semanticweb.kaist.ac.kr/home/index.php/Evaluateset3)|Transliteration|Academia| DOC |aca/red|7K (p)|Eval|ko, en| -    |


## 5 Korean in multilingual corpora
|No|Dataset|Typical Usage|Provider|Docu|License|Volume|Goal|Lang|Description|
|:---:|:---:|:---:|:---:|:---:|:---:|:---------:|:---------:|:---------:|:---------:|                           
|1|[Sigmorphon G2P](https://sigmorphon.github.io/sharedtasks/2020/task1/) |G2P conversion|Competition |DOC |unk/unk |3,600 / 450 / 450 (p) |-|ko, en, hy, bg, fr, ka, hi, hu, is, lt, el|Multilingual Grapheme-to-Phoneme Conversion|-|
|2|[PAWS-X](https://github.com/google-research-datasets/paws/tree/master/pawsx) | Paraphrase detection |Industry|INT |com/red |5K / 2K / 2K (p)|-|ko, fr, es, de, zh, ja|-| 
|3|[TyDi-QA](https://github.com/google-research-datasets/tydiqa)|QA|Industry|INT [DOC](https://arxiv.org/abs/2003.05002)|com/red |11K / 1,698 / 1,722 (p)|-|ko, en, ar, bn, fi, ja, id, sw, ru, te, th |-|
|4|[XPersona](https://github.com/HLTCHKUST/Xpersona) |Dialog |Academia |INT [Doc](https://arxiv.org/abs/2003.07568) |com/red |299 (d)|- |ko, en, it, fr, id, zh, ja / 4,684 (s)| -|        


## 6. Speech recognition and spoken language understanding
|No|Dataset|Typical Usage|Provider|Docu|License|Volume|Goal|Lang|Description|
|:---:|:---:|:---:|:---:|:---:|:---:|:---------:|:---------:|:---:|:---:|
|1|[KSS](https://github.com/Kyubyong/kss) |ASR|Academia|DOC|aca/red|12+ (h)/ 13K (u) / 1 speaker |-|ko |STT|    
|2|[Zeroth](https://github.com/goodatlas/zeroth) |    ASR |Industry |DOC|com/red|51+(h)/ 27K (s)/ 46K (u)/181 speakers|-|ko|-|          
|3|[ClovaCall](https://github.com/clovaai/ClovaCall)|ASR|Industry|INT|aca/not|80+ (h)/ 60K (u)/ 11K speakers|-|ko|-|         
|4|[Pansori-TedXKR](https://github.com/yc9701/pansori-tedxkr-corpus)|ASR|Aca|INT|aca/red (mod-x)|3+ (h)/ 3K (u)/ 41 speakers|-|ko|-|           
|5|[ProSem](https://github.com/warnikchow/prosem)|SLU|Aca|INT|com/red|6+ (h) / 3,500 (s) /7K (u)/2 speakers|-|ko|-|          



## 7.  

|| |  |          
|:---:|:-----------------:|:-----------------:|
|1.|[    ](https://github.com/lovit/politician_news_dataset)|-|
|2|[ ](https://www1.president.go.kr/petitions)  [ ](https://www1.president.go.kr/petitions?only=finished)  | [:octocat:](https://github.com/akngs/petitions)|-|
|3|[ ](https://www.data.go.kr/dataset/15012945/fileData.do) |  'Kinds'   ,   |

## 8.  
|| |  |          
|:---:|:-----------------:|:-----------------:|
|1|[   ](https://github.com/songys/Chatbot_data)|    |
|2|[   ](https://github.com/lovit/kmrd)|Synthetic dataset for recommender system created with Naver Movie rating system|
|3|[ ](https://github.com/2runo/Curse-detection-data)|     |


#    

|| |  |
|:---:|:-----------------:|:-----------------:|         
|1| [](https://opendict.korean.go.kr/main)|     [:octocat:](https://github.com/songys/Dictionaries) :          :       | |      
|2| [NIA ](https://kbig.kr/portal/kbig/knowledge/files/bigdata_report.page?bltnNo=10000000016451)|        |  
|3| [ ](https://ithub.korean.go.kr/user/total/database/corpusManager.do )|   2007       ,                   |  
|4| [AIHub](http://aihub.or.kr/)|      ,             (  )   |   

![pic](./aihub.png)

   
|| |  |            
|:---:|:-----------------:|:-----------------:|         
|5|[  ](https://corpus.korean.go.kr/)|   (     ),   (, , , ),    ( )      . ,             ,      . |

![pic](./everyone.png)


-                     SW  ,  feature          ,                      .

Owner

Name: Junbum Lee
Login: Beomi
Kind: user
Location: Seoul, South Korea

Website: https://junbuml.ee
Twitter: __Beomi__
Repositories: 110
Profile: https://github.com/Beomi

AI/ML GDE @ml-gde. Korean AI/NLP Researcher and creator of multiple Korean PLMs. Focused on advancing Open LLMs.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/beomi/awesomekorean_data

Science Score: 10.0%

Repository

Basic Info

Statistics

https://github.com/Beomi/AwesomeKorean_Data/blob/master/

Owner

GitHub Events

Total

Last Year