https://github.com/alixunxing/nlp_chinese_corpus

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP

https://github.com/alixunxing/nlp_chinese_corpus

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.1%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP

Basic Info
  • Host: GitHub
  • Owner: alixunxing
  • License: mit
  • Default Branch: master
  • Homepage:
  • Size: 3.93 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Fork of brightmart/nlp_chinese_corpus
Created over 5 years ago · Last pushed over 6 years ago

https://github.com/alixunxing/nlp_chinese_corpus/blob/master/

#### 

: nlp_chinese_corpus@163.com

*** update ****

10 & 9

Language Understanding Evaluation benchmark for Chinese(CLUE benchmark): run 10 tasks & 9 baselines with one line of code, performance comparision with details.
    
Releasing Pre-trained Model of ALBERT_Chinese:

Training with 30G+ Raw Chinese Corpus, xxlarge, small version and more, Target to match State of the Art performance in Chinese with 30% less parameters, 2019-Oct-7, During the National Day of China!



10 & 3(201951)

30 & 10 & 120191231

Update json(webtext2019zh)NLP520(translation2019zh)


#### 1.(wiki2019zh)100
#### 2.(news2016zh)250
#### 3.(baike2018qa)150
#### 4.json(webtext2019zh)410
#### 5.(translation2019zh)520

#### 

2019



github






1.json(wiki2019zh)
-------------------------------------------------------------------------

#### 104(1,043,224; 1.6G519M2019.2.7)
Google Drive    


#### 

    

#### 

 
    {"id":,"url":,"title":,"text":<text>} titletext"\n\n"

#### 

    {"id": "53", "url": "https://zh.wikipedia.org/wiki?curid=53", "title": "", "text": "\n\n\n\n..."}

#### 

    
    
    
    ----
    21
    1776
    -1803Dismal science17981844
    .....

<img src="https://github.com/brightmart/nlp_chinese_corpus/blob/master/resources/img/wiki_zh.jpg"  width="90%" height="90%" />

<br>

2.json(news2016zh)
-------------------------------------------------------------------------

#### 250( 9G3.6G2014-2016)

<a href='https://drive.google.com/file/d/1TMKu1FpTr6kcjWXWlQHX7YJsMfhhcVKp/view?usp=sharing'>Google Drive</a> <a href='https://pan.baidu.com/s/1MLLM-CdM6BhJkj8D0u3atA'></a>:k265

#### 

2506.3

2437.7

#### 

    
   
    

    

#### 

    {'news_id': <news_id>,'title':<title>,'content':<content>,'source': <source>,'time':<time>,'keywords': <keywords>,'desc': <desc>, 'desc': <desc>}

    titlecontentkeywordsdescsourcetime

#### 
    
    {"news_id": "610130831", "keywords": "","title": "40 140", "desc": "", "source": "", "time": "03-22 12:00", "content": "40140...."}
  

<img src="https://github.com/brightmart/nlp_chinese_corpus/blob/master/resources/img/news2016zh.png"  width="100%" height="100%" />

<br>

3.json(baike2018qa)
-------------------------------------------------------------------------

#### 150( 1G663M2018)

<a href="https://drive.google.com/open?id=1_vgGQZpfSxN_Ng9iTAvE7hM3Z7NVwXP2">Google Drive</a>  <a href='https://pan.baidu.com/s/12TCEwC_Q3He65HtPKN17cA'></a>:fu45


#### 

15049210434

142.54.5

#### 

    

    

#### 

    {"qid":<qid>,"category":<category>,"title":<title>,"desc":<desc>,"answer":<answer>}
    
    categorytitledesc

#### 
    
    {"qid": "qid_2540946131115409959", "category": "", "title": " ", "desc": "", "answer": "\r\r\r\r  \r\r\r\r \r\r"}
  

<img src="https://github.com/brightmart/nlp_chinese_corpus/blob/master/resources/img/baike_qa.png"  width="100%" height="100%" />

#### 

1 

#1#21PDF#3()

#2#3#1#2


<br>

4.json(webtext2019zh) 
-------------------------------------------------------------------------

#### 410( 3.7G1.7G2015-2016)

<a href='https://drive.google.com/open?id=1u2yW_XohbYL2YAK6Bzc5XrngHstQTf0v'>Google Drive</a>


#### 

4102.8

14003

ID

4126.8a6.8b


#### 
    
    1
    
    2()
    
    3(cQA)
    
      

    4
    
    5

#### 

    {"qid":<qid>,"title":<title>,"desc":<desc>,"topic":<topic>,"star":<star>,"content":<content>,
    
    "answer_id":<answer_id>,"answerer_tags":<answerer_tags>}
    
    qididtitledesctopicstar
    
    contentanswer_idID,answerer_tags

#### 
    
    {"qid": 65618973, "title": "AlphaGo", "desc": "<br>", "topic": "", "star": 3, "content": "AlphaGoMCTSPRAlphaGoNLP", "answer_id": 545576062, "answerer_tags": "@"}
  

<img src="https://github.com/brightmart/nlp_chinese_corpus/blob/master/resources/img/webtext2019zh.png"  width="100%" height="100%" />


#### 

1 

#1#21PDF#3()

#2#3#1#2

2(cQA)

MAP

3webtext2019zh)OpenAIGPT-2zero-shot

<br>


5.(translation2019zh)
-------------------------------------------------------------------------

#### 520( 1.1G596M)

<a href='https://drive.google.com/open?id=1EX8eE5YWBxCaohBO8Fh4e2j3b9C2bTVQ'>Google Drive</a>


#### 

520

3619(she)

5163.9

#### 
    
    
    
    
    

#### 

    {"english": <english>, "chinese": <chinese>}
    
    englishchinese

#### 
    
    {"english": "In Italy, there is no real public pressure for a new, fairer tax system.", "chinese": ""}
  

<img src="https://github.com/brightmart/nlp_chinese_corpus/blob/master/resources/img/translation2019zh.jpeg"  width="100%" height="100%" />


/Contribution
-------------------------------------------------------------------------

nlp_chinese_corpus@163.com



20



add your chinese corpus here by sending us an email

if there is any issue regarding the data, you can also contact with us, we will process it within one week. 

thank you for your understanding.



-------------------------------------------------------------------------

1. <a href='https://github.com/ReactiveCJ'>ReactiveCJ</a>


 Citation / How do I cite Us?
-------------------------------------------------------------------------

    @misc{bright_xu_2019_3402023,
    author       = {Bright Xu},
    title        = {NLP Chinese Corpus: Large Scale Chinese Corpus for NLP },
    month        = sep,
    year         = 2019,
    doi          = {10.5281/zenodo.3402023},
    version      = {1.0},
    publisher    = {Zenodo},
    url          = {https://doi.org/10.5281/zenodo.3402023}
    }


<a href="https://zenodo.org/badge/latestdoi/169745123"><img src="https://zenodo.org/badge/169745123.svg" alt="DOI"></a>




Reference
-------------------------------------------------------------------------

1. <a href='https://github.com/AimeeLee77/wiki_zh_word2vec'>PythonWiki</a>

2. <a href='https://github.com/attardi/wikiextractor'>A tool for extracting plain text from Wikipedia dumps</a>

3. <a href='https://github.com/yichen0831/opencc-python'>Open Chinese convert (OpenCC) in pure Python:</a>

4. <a href='https://dumps.wikimedia.org/zhwiki/latest/'>dumps of wiki, latest in chinese</a>


</pre>
      </div>
  
      <div class="mb-4">
    <h3 class="mb-3">
        Owner
    </h3>

    <div class="card border-0 shadow-sm">
      <div class="card-body">
        <div class="row g-3">
          <div class="col-md-6">
            <ul class="list-unstyled mb-0">
                <li class="mb-2">
                  <strong>Login:</strong> alixunxing
                </li>
                <li class="mb-2">
                  <strong>Kind:</strong> <span class="badge rounded-pill bg-primary">user</span>
                </li>
            </ul>
          </div>

          <div class="col-md-6">
            <ul class="list-unstyled mb-0">
                <li class="mb-2">
                  <strong>Repositories:</strong> 18
                </li>
                <li class="mb-2">
                  <strong>Profile:</strong> <a class="text-decoration-none" href="https://github.com/alixunxing">https://github.com/alixunxing</a>
                </li>
            </ul>
          </div>
        </div>

      </div>
    </div>
  </div>

    
    
    
    
    
      <div class="mb-4">
    <h3 class="mb-3">GitHub Events</h3>

    <div class="card border-0 shadow-sm">
      <div class="card-body">
        <div class="row g-4">
          <div class="col-md-6">
            <h6 class="text-muted text-uppercase small mb-3">Total</h6>
            <ul class="list-unstyled mb-0">
            </ul>
          </div>

          <div class="col-md-6">
            <h6 class="text-muted text-uppercase small mb-3">Last Year</h6>
            <ul class="list-unstyled mb-0">
            </ul>
          </div>
        </div>
      </div>
    </div>
  </div>

    
    
    
    

</div>
    </div>

    <footer class="footer dark-section">
  <div class="container">
    <div class="row">
      <div class="col-md-8">
        <a class="site-logo site-logo--white" href="/">Ecosyste.ms</a>
        <p class="small">Tools and open datasets to support, sustain, and secure critical digital infrastructure.</p>
        <p class="small">
          Code: <a href="https://github.com/ecosyste-ms/documentation/blob/main/LICENSE">AGPL-3</a>  — 
          Data: <a target="_blank" href="https://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a>
        </p>
        <p class='footer-icons'>
          <a target="_blank" href="https://github.com/ecosyste-ms">
            <svg width="20" height="20" alt="ecosyste.ms on Github" class="bi bi-github" viewBox="0 0 16 16" fill="currentColor" version="1.1" aria-hidden="true"><path d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.07.55-.17.55-.38 0-.19-.01-.82-.01-1.49-2.01.37-2.53-.49-2.69-.94-.09-.23-.48-.94-.82-1.13-.28-.15-.68-.52-.01-.53.63-.01 1.08.58 1.23.82.72 1.21 1.87.87 2.33.66.07-.52.28-.87.51-1.07-1.78-.2-3.64-.89-3.64-3.95 0-.87.31-1.59.82-2.15-.08-.2-.36-1.02.08-2.12 0 0 .67-.21 2.2.82.64-.18 1.32-.27 2-.27s1.36.09 2 .27c1.53-1.04 2.2-.82 2.2-.82.44 1.1.16 1.92.08 2.12.51.56.82 1.27.82 2.15 0 3.07-1.87 3.75-3.65 3.95.29.25.54.73.54 1.48 0 1.07-.01 1.93-.01 2.2 0 .21.15.46.55.38A8.01 8.01 0 0 0 16 8c0-4.42-3.58-8-8-8"></path></svg>
</a>          <a target="_blank" href="https://mastodon.social/@ecosystems">
            <svg width="20" height="20" alt="ecosyste.ms on Mastodon" class="bi bi-mastodon" viewBox="0 0 16 16" fill="currentColor" version="1.1" aria-hidden="true"><path d="M11.19 12.195c2.016-.24 3.77-1.475 3.99-2.603.348-1.778.32-4.339.32-4.339 0-3.47-2.286-4.488-2.286-4.488C12.062.238 10.083.017 8.027 0h-.05C5.92.017 3.942.238 2.79.765c0 0-2.285 1.017-2.285 4.488l-.002.662c-.004.64-.007 1.35.011 2.091.083 3.394.626 6.74 3.78 7.57 1.454.383 2.703.463 3.709.408 1.823-.1 2.847-.647 2.847-.647l-.06-1.317s-1.303.41-2.767.36c-1.45-.05-2.98-.156-3.215-1.928a4 4 0 0 1-.033-.496s1.424.346 3.228.428c1.103.05 2.137-.064 3.188-.189zm1.613-2.47H11.13v-4.08c0-.859-.364-1.295-1.091-1.295-.804 0-1.207.517-1.207 1.541v2.233H7.168V5.89c0-1.024-.403-1.541-1.207-1.541-.727 0-1.091.436-1.091 1.296v4.079H3.197V5.522q0-1.288.66-2.046c.456-.505 1.052-.764 1.793-.764.856 0 1.504.328 1.933.983L8 4.39l.417-.695c.429-.655 1.077-.983 1.934-.983.74 0 1.336.259 1.791.764q.662.757.661 2.046z"></path></svg>
</a>          <a target="_blank" href="https://opencollective.com/ecosystems">
            <svg width="20" height="20" alt="ecosyste.ms on Open Collective" class="bi bi-opencollective" viewBox="0 0 16 16" fill="currentColor" version="1.1" aria-hidden="true"><path fill-opacity=".4" d="M12.995 8.195c0 .937-.312 1.912-.78 2.693l1.99 1.99c.976-1.327 1.6-2.966 1.6-4.683 0-1.795-.624-3.434-1.561-4.76l-2.068 2.028c.468.781.78 1.679.78 2.732z"></path>
  <path d="M8 13.151a4.995 4.995 0 1 1 0-9.99c1.015 0 1.951.273 2.732.82l1.95-2.03a7.805 7.805 0 1 0 .04 12.449l-1.951-2.03a5.07 5.07 0 0 1-2.732.781z"></path></svg>
</a>        </p>
        <div>
          <h3 class="mt-5 h6">Supported by</h3>
          <div class="row justify-content-start align-items-center g-4 mb-4 mb-lg-0">		
            <div class="col-auto">
              <a href="https://www.schmidtfutures.org">
                <img alt="Schmidt Futures" class="img-fluid p3" src="/assets/logo-schmidt-white-efa52873280decb2588e601323ef616a96a7891c254db5cdf0cca626ed85acc5.svg" width="267" height="20" />
              </a>
            </div>
      
            <div class="col-auto">
              <a href="https://oscollective.org">
                <img alt="Open Source Collective" class="img-fluid p3" src="/assets/logo-osc-white-43e420a5624e755fe206869f9c3ff608e9476881d847007a020ea01d37e36dfa.png" width="210" height="56" />
              </a>
            </div>
          </div>
          <p class="mt-3"><a href="https://opencollective.com/ecosystems" class="small">Become a sponsor</a></p>
          </div>
        </div>
      <div class="col-md-4">
        <ul class="list-unstyled footer-links mt-3 small">
          <li><strong><a href="https://ecosyste.ms">About</a></strong></li>
          <li><strong><a href="https://blog.ecosyste.ms">Blog</a></strong></li>
          <li><strong><a href="https://mastodon.social/@ecosystems">Contact</a></strong></li>
          <li><strong><a href="https://ecosyste.ms/privacy">Privacy</a></strong></li>
          <li><strong><a href="https://ecosyste.ms/terms">Terms</a></strong></li>
          <li><strong><a href="https://ecosystems.appsignal-status.com/">Status</a></strong></li>
        </ul>
      </div>
    </div>
  </div>
</footer>
  </body>
</html>