https://github.com/beomi/awesomekorean_data
한국어 데이터 세트 링크
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (4.6%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
Repository
한국어 데이터 세트 링크
Basic Info
- Host: GitHub
- Owner: Beomi
- Default Branch: master
- Size: 1.5 MB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 2
- Open Issues: 0
- Releases: 0
Fork of songys/AwesomeKorean_Data
Created almost 6 years ago
· Last pushed almost 6 years ago
https://github.com/Beomi/AwesomeKorean_Data/blob/master/
# AwesomeKorean_Data
- 2020 . . end to end , .
- 12 15 2020 8 21 @warnikchow .
- Natural language processing [Awesome-Korean-NLP](https://github.com/datanada/Awesome-Korean-NLP)
- [https://ratsgo.github.io/https://ratsgo.github.io/embedding/preprocess.html](https://ratsgo.github.io/embedding/preprocess.html)
- ! , evaluation / huggingface.nlp , [ko-nlp](https://github.com/ko-nlp/Korpora)
# Open Datasets

- Commercially available(com), academic use only(aca), unknown(unk)
- Redistribution is possible with or without modification, if neither, or unknown (red, red/mod-x, not, unk)
- Internationally available publication(INT)
## 1. Classical NLP pipeline
,(), , , . , , 'entity' 'entity' .
|No|Dataset|Typical Usage|Provider|Docu|License|Volume|Goal|Lang|Description|
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|1|[KAIST Morpho-syntactically Annotated Corpus](http://semanticweb.kaist.ac.kr/home/index.php/KAIST_Corpus)|Morphological analysis|Academia|article|aca/|70M(w)| - |ko| . Affiliation .|
|2|[Korean Tree-tagged Corpus]((http://semanticweb.kaist.ac.kr/home/index.php/KAIST_Corpus))|Tree parsing|Academia|INT|aca/red|30K(s)|-|ko|-|
|3|[UD Korean KAIST]((http://semanticweb.kaist.ac.kr/home/index.php/KAIST_Corpus))|Dependency parsing| Academia| INT|com/red|30K (s)|-|ko| Treebank |
|4|[PKT-UD](https://catalog.ldc.upenn.edu/LDC2006T09)|Dependency parsing |Academia| INT| com/red|5K (s)|-|ko|
|5|[KMOU NER](https://github.com/kmounlp/NER)| NER| Academia|article|aca/red|24K (s)|-|ko| |
|6|[AIR x NAVER NER](http://air.changwon.ac.kr/?page_id=10)| NER |Competition| DOC |aca/not|90K (s)|-|ko|, , |
|7|[AIR x NAVER SLR](http://air.changwon.ac.kr/?page_id=14)|SLR|Competition|DOC|aca/not|35K(s)|-|ko| (Semantic Role Labeling)
|
## 2. Entailment and sentence similarity
() . , .
|No|Dataset|Typical Usage|Provider|Docu|License|Volume|Goal|Lang|Description|
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|1|[Question Pair](https://github.com/songys/Question_pair)| Paraphrase detection|Academia|DOC|com/red|10K (p)|-|ko| |
|2|[KorNLI](https://github.com/kakaobrain/KorNLUDatasets)|NLI|Industry|INT|com/red |1,000K (p)|-|ko | |
|3|[KorSTS](https://github.com/kakaobrain/KorNLUDatasets)|STS|Industry|INT|com/red|8,500 (p)|-|ko | |
|4|[ParaKQC](https://github.com/warnikchow/ParaKQC)|STS|Academia|INT|com/red|540K (p)|-|ko |Parallel dataset of Korean Questions and Commands|
## 3. Semantics and question answering
'' (Y Kim(2014)). , QA . .
|No|Dataset|Typical Usage|Provider|Docu|License|Volume|Goal|Lang|Description|
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|1|[NSMC](https://github.com/e9t/nsmc)|Sentiment analysis|Academia|DOC|com/red|150K / 50K (s)|-|ko| |
|2|[BEEP!](https://github.com/kocohub/korean-hate-speech)|Hate speech detection|Academia |INT |com/red |8K / 500 / 1,000 (s)|-|ko| |
|3|[3i4K](https://github.com/warnikchow/3i4k)|Speech act classification |Academia |INT |com/red |55K / 6K (s)|-|ko|Intonation-aided intention identification for Korean|Structured argument extraction for Korean|
|4|KorQuAD1|QA|Industry|INT|com/red (mod-x)|60K / 5K / 4K (p)|-|ko| [KorQuAD ](https://www.youtube.com/watch?v=ntGwv6Ifoe8)|
|5|[KorQuAD2](https://korquad.github.io/)|QA|Industry|article|com/red (mod-x)|80K / 10K / 10K (p)|-|ko| -|
## 4 Parallel corpora
. Aihub [](http://aihub.or.kr/sample_data_board) . , .
|No|Dataset|Typical Usage|Provider|Docu|License|Volume|Goal|Lang|Description|
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|1|[Sci-news-sum-kr](https://github.com/theeluwin/sci-news-sum-kr-50)|Summarization|Academia|DOC|aca/red|50 (p)|Eval|ko|-| IT/ 50 |
|2|[SAE4K](https://github.com/warnikchow/sae4k)|Summarization|Academia|INT|com/red|50K (p)|-|ko|Structured argument extraction for Korean|
|3|[Korean Parallel Corpora](https://github.com/jungyeul/korean-parallel-corpora)|MT|Academia|INT|com/red|97K (p)|-|ko, en, fr|-|
|3|[KAIST Translation Evaluation Set2](http://semanticweb.kaist.ac.kr/home/index.php/Evaluateset2) |MT| Academia|DOC |aca/red |3K (p)|Eval|ko, en|-|
|4|[Chinese-Korean Multilingual Corpus](http://semanticweb.kaist.ac.kr/home/index.php/Corpus9) |MT |Academia|DOC |aca/red|60K (p)|-|ko, zh |-|
|5|[Transliteration Dataset](https://github.com/muik/transliteration), Wiktionary(https://en.wiktionary.org/wiki/Wiktionary:Main_Page) |Transliteration|Academia |DOC |com/red |35K (p)|-| ko, en | , |-|
|6|[KAIST Transliteration Evaluation Set3](http://semanticweb.kaist.ac.kr/home/index.php/Evaluateset3)|Transliteration|Academia| DOC |aca/red|7K (p)|Eval|ko, en| - |
## 5 Korean in multilingual corpora
|No|Dataset|Typical Usage|Provider|Docu|License|Volume|Goal|Lang|Description|
|:---:|:---:|:---:|:---:|:---:|:---:|:---------:|:---------:|:---------:|:---------:|
|1|[Sigmorphon G2P](https://sigmorphon.github.io/sharedtasks/2020/task1/) |G2P conversion|Competition |DOC |unk/unk |3,600 / 450 / 450 (p) |-|ko, en, hy, bg, fr, ka, hi, hu, is, lt, el|Multilingual Grapheme-to-Phoneme Conversion|-|
|2|[PAWS-X](https://github.com/google-research-datasets/paws/tree/master/pawsx) | Paraphrase detection |Industry|INT |com/red |5K / 2K / 2K (p)|-|ko, fr, es, de, zh, ja|-|
|3|[TyDi-QA](https://github.com/google-research-datasets/tydiqa)|QA|Industry|INT [DOC](https://arxiv.org/abs/2003.05002)|com/red |11K / 1,698 / 1,722 (p)|-|ko, en, ar, bn, fi, ja, id, sw, ru, te, th |-|
|4|[XPersona](https://github.com/HLTCHKUST/Xpersona) |Dialog |Academia |INT [Doc](https://arxiv.org/abs/2003.07568) |com/red |299 (d)|- |ko, en, it, fr, id, zh, ja / 4,684 (s)| -|
## 6. Speech recognition and spoken language understanding
|No|Dataset|Typical Usage|Provider|Docu|License|Volume|Goal|Lang|Description|
|:---:|:---:|:---:|:---:|:---:|:---:|:---------:|:---------:|:---:|:---:|
|1|[KSS](https://github.com/Kyubyong/kss) |ASR|Academia|DOC|aca/red|12+ (h)/ 13K (u) / 1 speaker |-|ko |STT|
|2|[Zeroth](https://github.com/goodatlas/zeroth) | ASR |Industry |DOC|com/red|51+(h)/ 27K (s)/ 46K (u)/181 speakers|-|ko|-|
|3|[ClovaCall](https://github.com/clovaai/ClovaCall)|ASR|Industry|INT|aca/not|80+ (h)/ 60K (u)/ 11K speakers|-|ko|-|
|4|[Pansori-TedXKR](https://github.com/yc9701/pansori-tedxkr-corpus)|ASR|Aca|INT|aca/red (mod-x)|3+ (h)/ 3K (u)/ 41 speakers|-|ko|-|
|5|[ProSem](https://github.com/warnikchow/prosem)|SLU|Aca|INT|com/red|6+ (h) / 3,500 (s) /7K (u)/2 speakers|-|ko|-|
## 7.
|| | |
|:---:|:-----------------:|:-----------------:|
|1.|[ ](https://github.com/lovit/politician_news_dataset)|-|
|2|[ ](https://www1.president.go.kr/petitions) [ ](https://www1.president.go.kr/petitions?only=finished) | [:octocat:](https://github.com/akngs/petitions)|-|
|3|[ ](https://www.data.go.kr/dataset/15012945/fileData.do) | 'Kinds' , |
## 8.
|| | |
|:---:|:-----------------:|:-----------------:|
|1|[ ](https://github.com/songys/Chatbot_data)| |
|2|[ ](https://github.com/lovit/kmrd)|Synthetic dataset for recommender system created with Naver Movie rating system|
|3|[ ](https://github.com/2runo/Curse-detection-data)| |
#
|| | |
|:---:|:-----------------:|:-----------------:|
|1| [](https://opendict.korean.go.kr/main)| [:octocat:](https://github.com/songys/Dictionaries) : : | |
|2| [NIA ](https://kbig.kr/portal/kbig/knowledge/files/bigdata_report.page?bltnNo=10000000016451)| |
|3| [ ](https://ithub.korean.go.kr/user/total/database/corpusManager.do )| 2007 , |
|4| [AIHub](http://aihub.or.kr/)| , ( ) |

|| | |
|:---:|:-----------------:|:-----------------:|
|5|[ ](https://corpus.korean.go.kr/)| ( ), (, , , ), ( ) . , , . |

- SW , feature , .
Owner
- Name: Junbum Lee
- Login: Beomi
- Kind: user
- Location: Seoul, South Korea
- Website: https://junbuml.ee
- Twitter: __Beomi__
- Repositories: 110
- Profile: https://github.com/Beomi
AI/ML GDE @ml-gde. Korean AI/NLP Researcher and creator of multiple Korean PLMs. Focused on advancing Open LLMs.