https://github.com/alixunxing/pycorrector
pycorrector is a toolkit for text error correction. 文本纠错,Kenlm,ConvSeq2Seq,BERT,MacBERT,ELECTRA,ERNIE,Transformer,T5等模型实现,开箱即用。
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org, springer.com -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.8%) to scientific vocabulary
Last synced: 6 months ago
·
JSON representation
Repository
pycorrector is a toolkit for text error correction. 文本纠错,Kenlm,ConvSeq2Seq,BERT,MacBERT,ELECTRA,ERNIE,Transformer,T5等模型实现,开箱即用。
Basic Info
- Host: GitHub
- Owner: alixunxing
- License: apache-2.0
- Default Branch: master
- Homepage: https://www.mulanai.com/product/corrector/
- Size: 50.1 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of shibing624/pycorrector
Created about 3 years ago
· Last pushed about 3 years ago
https://github.com/alixunxing/pycorrector/blob/master/
 [](https://badge.fury.io/py/pycorrector) [](https://pepy.tech/project/pycorrector) [](https://github.com/shibing624/pycorrector/graphs/contributors) [](LICENSE) [](requirements.txt) [](https://github.com/shibing624/pycorrector/issues) [](#wechat-group) [English](README.en.md) | # pycorrector python3 **pycorrector**KenlmConvSeq2SeqBERTMacBERTELECTRAERNIETransformerSigHAN **Guide** - [Question](#Question) - [Solution](#Solution) - [Evaluation](#Evaluation) - [Install](#install) - [Usage](#usage) - [Deep Model Usage](#deep-model-usage) - [Dataset](#Dataset) - [Contact](#Contact) - [Reference](#reference) # QuestionOCR query "" # Solution ### 1. 2. 3. ### 1. RNNRNN Attn 2. CRF2016 3. Seq2SeqEncoder-Decoder 4. BERT/ELECTRA/ERNIE/MacBERTNLPMASKfine-tune PS - [](https://github.com/shibing624/pycorrector/wiki/pycorrector%E6%BA%90%E7%A0%81%E8%A7%A3%E8%AF%BB-%E7%9B%B4%E6%92%AD%E5%88%86%E4%BA%AB) - [](https://zhuanlan.zhihu.com/p/138981644) # Feature * [Kenlm](pycorrector/corrector.py)KenlmNGram * [MacBERT](pycorrector/macbert)PyTorchMacBERT4CSC * [Seq2Seq](pycorrector/seq2seq)PyTorchSeq2SeqConvSeq2SeqConvSeq2SeqNLPCC-2018 * [T5](pycorrector/t5)PyTorchT5Langboat/mengzi-t5-basefine-tune * [BERT](pycorrector/bert)PyTorchBERTfill-mask * [ELECTRA](pycorrector/electra)PyTorchELECTRAfill-mask * [ERNIE_CSC](pycorrector/ernie_csc)PaddlePaddleERNIE_CSCERNIE-1.0fine-tune * [DeepContext](pycorrector/deepcontext)PyTorchDeepContextStanford UniversityNLC2014 * [Transformer](pycorrector/transformer)PyTorchfairseqTransformer #### 1. 2. CGED, Chinese Grammar Error DiagnosisTODO # Demo Official Demo: https://www.mulanai.com/product/corrector/ HuggingFace Demo: https://huggingface.co/spaces/shibing624/pycorrector  run example: [examples/gradio_demo.py](examples/gradio_demo.py) to see the demo: ```shell python examples/gradio_demo.py ``` # Evaluation [examples/evaluate_models.py](./examples/evaluate_models.py) - sighan15SIGHAN2015[pycorrector/data/cn/sighan_2015/test.tsv](pycorrector/data/cn/sighan_2015/test.tsv) - Sentence Level ### SIGHAN2015 GPUTesla V100 32 GB | | Backbone | GPU | Precision | Recall | F1 | QPS | | :-- | :-- | :--- | :----- | :--| :--- | :--- | | Rule(pycorrector.correct) | kenlm | CPU | 0.6860 | 0.1529 | 0.2500 | 9 | | BERT | bert-base-chinese | GPU | 0.8029 | 0.4052 | 0.5386 | 2 | | BART | fnlp/bart-base-chinese | GPU | 0.6984 | 0.6354 | 0.6654 | 58 | | T5 | byt5-small | GPU | 0.5220 | 0.3941 | 0.4491 | 111 | | Mengzi-T5 | mengzi-t5-base | GPU | 0.8321 | 0.6390 | 0.7229 | 214 | | ConvSeq2Seq | ConvSeq2Seq | GPU | 0.2415 | 0.1436 | 0.1801 | 6 | | **MacBert** | **macbert-base-chinese** | **GPU** | **0.8254** | **0.7311** | **0.7754** | **224** | ### - **MacBert***shibing624/macbert4csc-base-chinese*huggingface model[shibing624/macbert4csc-base-chinese](https://huggingface.co/shibing624/macbert4csc-base-chinese) - **Seq2Seq***shibing624/bart4csc-base-chinese*huggingface model[shibing624/bart4csc-base-chinese](https://huggingface.co/shibing624/bart4csc-base-chinese) - **T5***shibing624/mengzi-t5-base-chinese-correction*huggingface model[shibing624/mengzi-t5-base-chinese-correction](https://huggingface.co/shibing624/mengzi-t5-base-chinese-correction)fine-tune`SIGHAN 2015`SOTA # Install ```shell pip install -U pycorrector ``` or ```shell pip install -r requirements.txt git clone https://github.com/shibing624/pycorrector.git cd pycorrector pip install --no-deps . ``` docker #### * docker ```shell docker run -it -v ~/.pycorrector:/root/.pycorrector shibing624/pycorrector:0.0.2 ``` pythonkenlmpycorrector[Dockerfile](Dockerfile)  * kenlm ``` pip install https://github.com/kpu/kenlm/archive/master.zip ``` [kenlm-wiki](https://github.com/shibing624/pycorrector/wiki/Install-kenlm) * ``` pip install -r requirements.txt ``` # Usage ### example: [examples/base_demo.py](examples/base_demo.py) ```python import pycorrector corrected_sent, detail = pycorrector.correct('') print(corrected_sent, detail) ``` output: ``` [('', '', 4, 6), ('', '', 10, 11)] ``` > `~/.pycorrector/datasets/zh_giga.no_cna_cmn.prune01244.klm`kenlm [(2.8G)](https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm) ### example: [examples/detect_demo.py](examples/detect_demo.py) ```python import pycorrector idx_errors = pycorrector.detect('') print(idx_errors) ``` output: ``` [['', 4, 6, 'word'], ['', 10, 11, 'char']] ``` > `list`, `[error_word, begin_pos, end_pos, error_type]``pos`0 ### example: [examples/proper_correct_demo.py](examples/proper_correct_demo.py) ```python import sys sys.path.append("..") from pycorrector.proper_corrector import ProperCorrector m = ProperCorrector() x = [ '', '', ] for i in x: print(i, ' -> ', m.proper_correct(i)) ``` output: ``` -> ('', [('', '', 2, 6)]) -> ('', [('', '', 3, 6)]) ``` ### 12 example: [examples/use_custom_confusion.py](examples/use_custom_confusion.py) ```python import pycorrector error_sentences = [ 'iphonex', '', ] for line in error_sentences: print(pycorrector.correct(line)) print('*' * 42) pycorrector.set_custom_confusion_path_or_dict('./my_custom_confusion.txt') for line in error_sentences: print(pycorrector.correct(line)) ``` output: ``` ('iphonex', []) # "iphonex""iphoneX" ('', [['', '', 14, 17]]) # "" ***************************************************** ('iphonex', [['iphonex', 'iphoneX', 1, 8]]) ('', []) ``` > `./my_custom_confusion.txt` ``` iPhone iPhoneX ``` > `correct` > `set_custom_confusion_dict``path`(str)(dict) ### kenlm`zh_giga.no_cna_cmn.prune01244.klm`2.8G`pycorrector` kenlm2014140M[people2014corpus_chars.klm(o5e9)](https://pan.baidu.com/s/1I2GElyHy_MAdek3YaziFYw) example[examples/load_custom_language_model.py](examples/load_custom_language_model.py) ```python from pycorrector import Corrector import os pwd_path = os.path.abspath(os.path.dirname(__file__)) lm_path = os.path.join(pwd_path, './people2014corpus_chars.klm') model = Corrector(language_model_path=lm_path) corrected_sent, detail = model.correct('') print(corrected_sent, detail) ``` output: ``` [('', '', 4, 6), ('', '', 10, 11)] ``` ### example[examples/en_correct_demo.py](examples/en_correct_demo.py) ```python import pycorrector sent = "what happending? how to speling it, can you gorrect it?" corrected_text, details = pycorrector.en_correct(sent) print(sent, '=>', corrected_text) print(details) ``` output: ``` what happending? how to speling it, can you gorrect it? => what happening? how to spelling it, can you correct it? [('happending', 'happening', 5, 15), ('speling', 'spelling', 24, 31), ('gorrect', 'correct', 44, 51)] ``` ### example[examples/traditional_simplified_chinese_demo.py](examples/traditional_simplified_chinese_demo.py) ```python import pycorrector traditional_sentence = '' simplified_sentence = pycorrector.traditional2simplified(traditional_sentence) print(traditional_sentence, '=>', simplified_sentence) simplified_sentence = '' traditional_sentence = pycorrector.simplified2traditional(simplified_sentence) print(simplified_sentence, '=>', traditional_sentence) ``` output: ``` => => ``` ### ``` python -m pycorrector -h usage: __main__.py [-h] -o OUTPUT [-n] [-d] input @description: positional arguments: input the input file path, file encode need utf-8. optional arguments: -h, --help show this help message and exit -o OUTPUT, --output OUTPUT the output file path. -n, --no_char disable char detect mode. -d, --detail print detail info ``` case ``` python -m pycorrector input.txt -o out.txt -n -d ``` > `input.txt``out.txt ``\t` # Deep Model Usage ``[macbert](./pycorrector/macbert)[seq2seq](./pycorrector/seq2seq) [bert](./pycorrector/bert)[electra](./pycorrector/electra)[transformer](./pycorrector/transformer) [ernie-csc](./pycorrector/ernie_csc)[T5](./pycorrector/t5)`pycorrector``README.md` - ``` pip install -r requirements-dev.txt ``` ## ### **MacBert4csc[]** MacBERTHuggingFace Models[https://huggingface.co/shibing624/macbert4csc-base-chinese](https://huggingface.co/shibing624/macbert4csc-base-chinese) - MacBERT BERT backbone - BERT [detection](https://github.com/shibing624/pycorrector/blob/c0f31222b7849c452cc1ec207c71e9954bd6ca08/pycorrector/macbert/macbert4csc.py#L18) MacBERT4CSC detection correction loss loss BERT MLM correction  [pycorrector/macbert/README.md](./pycorrector/macbert/README.md) example[examples/macbert_demo.py](examples/macbert_demo.py) #### pycorrector ```python import sys sys.path.append("..") from pycorrector.macbert.macbert_corrector import MacBertCorrector if __name__ == '__main__': error_sentences = [ '', '', '', '', '', ] m = MacBertCorrector("shibing624/macbert4csc-base-chinese") for line in error_sentences: correct_sent, err = m.macbert_correct(line) print("query:{} => {}, err:{}".format(line, correct_sent, err)) ``` output ```bash query: => , err:[('', '', 14, 15)] query: => , err:[('', '', 4, 5)] query: => , err:[('', '', 1, 2), ('', '', 10, 11)] query: => , err:[] query: => , err:[('', '', 6, 7)] ``` #### transformers ```python import operator import torch from transformers import BertTokenizerFast, BertForMaskedLM device = torch.device("cuda" if torch.cuda.is_available() else "cpu") tokenizer = BertTokenizerFast.from_pretrained("shibing624/macbert4csc-base-chinese") model = BertForMaskedLM.from_pretrained("shibing624/macbert4csc-base-chinese") model.to(device) texts = ["", ""] text_tokens = tokenizer(texts, padding=True, return_tensors='pt').to(device) with torch.no_grad(): outputs = model(**text_tokens) def get_errors(corrected_text, origin_text): sub_details = [] for i, ori_char in enumerate(origin_text): if ori_char in [' ', '', '', '', '', '\n', '', '', '']: # add unk word corrected_text = corrected_text[:i] + ori_char + corrected_text[i:] continue if i >= len(corrected_text): break if ori_char != corrected_text[i]: if ori_char.lower() == corrected_text[i]: # pass english upper char corrected_text = corrected_text[:i] + ori_char + corrected_text[i + 1:] continue sub_details.append((ori_char, corrected_text[i], i, i + 1)) sub_details = sorted(sub_details, key=operator.itemgetter(2)) return corrected_text, sub_details result = [] for ids, (i, text) in zip(outputs.logits, enumerate(texts)): _text = tokenizer.decode((torch.argmax(ids, dim=-1) * text_tokens.attention_mask[i]), skip_special_tokens=True).replace(' ', '') corrected_text, details = get_errors(_text, text) print(text, ' => ', corrected_text, details) result.append((corrected_text, details)) print(result) ``` output: ```shell => [('', '', 2, 3)] => [('', '', 15, 16)] ``` ``` macbert4csc-base-chinese config.json added_tokens.json pytorch_model.bin special_tokens_map.json tokenizer_config.json vocab.txt ``` ### ErnieCSC ERNIE[PaddleNLP](https://bj.bcebos.com/paddlenlp/taskflow/text_correction/csc-ernie-1.0/csc-ernie-1.0.pdparams) [https://bj.bcebos.com/paddlenlp/taskflow/text_correction/csc-ernie-1.0/csc-ernie-1.0.pdparams](https://bj.bcebos.com/paddlenlp/taskflow/text_correction/csc-ernie-1.0/csc-ernie-1.0.pdparams)
[pycorrector/ernie_csc/README.md](./pycorrector/ernie_csc/README.md) example[examples/ernie_csc_demo.py](examples/ernie_csc_demo.py) #### pycorrector ```python from pycorrector.ernie_csc.ernie_csc_corrector import ErnieCSCCorrector if __name__ == '__main__': error_sentences = [ '', '', '', '', '', ] corrector = ErnieCSCCorrector("csc-ernie-1.0") for line in error_sentences: result = corrector.ernie_csc_correct(line)[0] print("query:{} => {}, err:{}".format(line, result['target'], result['errors'])) ``` output: ```bash query: => , err:[{'position': 14, 'correction': {'': ''}}] query: => , err:[{'position': 4, 'correction': {'': ''}}, {'position': 10, 'correction': {'': ''}}] query: => , err:[{'position': 1, 'correction': {'': ''}}, {'position': 10, 'correction': {'': ''}}] query: => , err:[] query: => , err:[{'position': 6, 'correction': {'': ''}}] ``` #### PaddleNLP PaddleNLPTaskflow: ```python from paddlenlp import Taskflow text_correction = Taskflow("text_correction") text_correction('') text_correction('') ``` output: ```shell [{'source': '', 'target': '', 'errors': [{'position': 3, 'correction': {'': ''}}]}] [{'source': '', 'target': '', 'errors': [{'position': 18, 'correction': {'': ''}}]}] ``` ### Bart ```python from transformers import BertTokenizerFast from textgen import BartSeq2SeqModel tokenizer = BertTokenizerFast.from_pretrained('shibing624/bart4csc-base-chinese') model = BartSeq2SeqModel( encoder_type='bart', encoder_decoder_type='bart', encoder_decoder_name='shibing624/bart4csc-base-chinese', tokenizer=tokenizer, args={"max_length": 128, "eval_batch_size": 128}) sentences = [""] print(model.predict(sentences)) ``` output: ```shell [''] ``` Bart https://github.com/shibing624/textgen/blob/main/examples/seq2seq/training_bartseq2seq_zh_demo.py #### Release models SIGHAN+Wang271KBartreleaseHuggingFace Models: - BARTHuggingFace Models[https://huggingface.co/shibing624/bart4csc-base-chinese](https://huggingface.co/shibing624/bart4csc-base-chinese) ### ConvSeq2Seq [pycorrector/seq2seq](pycorrector/seq2seq) : #### data example: ``` # train.txt: ``` ```shell cd seq2seq python train.py ``` `convseq2seq`sighan2104200epochP40GPU3 #### ```shell python infer.py ``` output  1. `unk`(nlpcc2018+hsk130) 2. GPUGPU #### Release models SIGHAN2015convseq2seqreleasegithub: - convseq2seq model url: https://github.com/shibing624/pycorrector/releases/download/0.4.5/convseq2seq_correction.tar.gz # Dataset | | | | | | :------- | :--------- | :---------: | :---------: | | **`SIGHAN+Wang271K`** | SIGHAN+Wang271K(27) | [01b9](https://pan.baidu.com/s/1BV5tr9eONZCI0wERFvr0gQ)| 106M | | **`SIGHAN`** | SIGHAN13 14 15 | [csc.html](http://nlp.ee.ncu.edu.tw/resource/csc.html)| 339K | | **`Wang271K`** | Wang271K | [Automatic-Corpus-Generation dimmywang](https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml)| 93M | | **`2014`** | 2014 | [cHcu](https://l6pmn3b1eo.feishu.cn/file/boxcnKpildqIseq1D4IrLwlir7c?from=from_qr_code)| 383M | | **`NLPCC 2018 GEC`** | NLPCC2018-GEC | [trainingdata](http://tcci.ccf.org.cn/conference/2018/dldoc/trainingdata02.tar.gz) | 114M | | **`NLPCC 2018+HSK`** | nlpcc2018+hsk+CGED | [m6fg](https://pan.baidu.com/s/1BkDru60nQXaDVLRSr7ktfA)
[gl9y](https://l6pmn3b1eo.feishu.cn/file/boxcnudJgRs5GEMhZwe77YGTQfc?from=from_qr_code) | 215M | | **`NLPCC 2018+HSK`** | HSK+Lang8 | [n31j](https://pan.baidu.com/s/1DaOX89uL1JRaZclfrV9C0g)
[Q9LH](https://l6pmn3b1eo.feishu.cn/file/boxcntebW3NI6OAaqzDUXlZHoDb?from=from_qr_code) | 81M | | **``** | Chinese Text CorrectionCTC | [](https://tianchi.aliyun.com/dataset/138195) | - | - SIGHAN+Wang271K(27)SIGHAN131415Wang271KjsonSIGHANtest.json macbert4cscpaper[pycorrector/macbert/README.md](pycorrector/macbert/README.md) - NLPCC 2018 GEC[NLPCC2018-GEC](http://tcci.ccf.org.cn/conference/2018/taskdata.php) [trainingdata](http://tcci.ccf.org.cn/conference/2018/dldoc/trainingdata02.tar.gz)[114.5MB] - HSKlang8[HSK+Lang8][n31j](https://pan.baidu.com/s/1DaOX89uL1JRaZclfrV9C0g) - NLPCC 2018 + HSK + CGED161718(nlpcc2018+hsk) [:m6fg](https://pan.baidu.com/s/1BkDru60nQXaDVLRSr7ktfA) [130215MB] SIGHAN+Wang271K ```json [ { "id": "B2-4029-3", "original_text": "", "wrong_ids": [ 5, 31 ], "correct_text": "" } ] ``` #### json 1. wrong_ids(correct_text) 2. original_textwrong_ids [](https://github.com/dongrixinyu/JioNLP/wiki/%E6%95%B0%E6%8D%AE%E5%A2%9E%E5%BC%BA-%E8%AF%B4%E6%98%8E%E6%96%87%E6%A1%A3#%E5%90%8C%E9%9F%B3%E8%AF%8D%E6%9B%BF%E6%8D%A2) ## Language Model [-wiki](https://github.com/shibing624/pycorrector/wiki/%E7%BB%9F%E8%AE%A1%E8%AF%AD%E8%A8%80%E6%A8%A1%E5%9E%8B%E5%8E%9F%E7%90%86) [zh_giga.no_cna_cmn.prune01244.klm(2.8G)](https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm) 2014[people2014corpus_chars.klm(o5e9)](https://pan.baidu.com/s/1I2GElyHy_MAdek3YaziFYw) pycorrector.utils.text_utils 1. kenlmhttp://blog.csdn.net/mingzai624/article/details/79560063 2. <2014> 1people2014.tar.gz 2people2014_words.txt 3kenlmpeople2014corpus_chars.arps/klm 4kenlmpeople2014corpus_words.arps/klm # Todo - [x] - [x] seq2seq - [x] - [x] - [x] seq2seq_attention dropout - [x] seq2seqPointer-generator networkBeam searchUnknown words replacementCoverage mechanism - [x] bertfine-tunedwikitransformers 2.10.0 - [x] TensorFlow 2.0 - [x] bertmask - [x] electra - [x] bert/ernie # Contact - Github Issue()[](https://github.com/shibing624/pycorrector/issues) - Github discussions[](https://github.com/shibing624/pycorrector/discussions) - xuming: xuming624@qq.com - *xuming624*, Python-NLP*--NLP*# Citation pycorrector APA: ```latex Xu, M. Pycorrector: Text error correction tool (Version 0.4.2) [Computer software]. https://github.com/shibing624/pycorrector ``` BibTeX: ```latex @misc{Xu_Pycorrector_Text_error, title={Pycorrector: Text error correction tool}, author={Xu Ming}, year={2021}, howpublished={\url{https://github.com/shibing624/pycorrector}}, } ``` # License pycorrector **Apache License 2.0**pycorrector # Contribute - `tests` - `python -m pytest` PR # Reference * [](https://blog.csdn.net/mingzai624/article/details/82390382) * [Norvigs spelling corrector](http://norvig.com/spell-correct.html) * [Chinese Spelling Error Detection and Correction Based on Language Model, Pronunciation, and Shape[Yu, 2013]](http://www.aclweb.org/anthology/W/W14/W14-6835.pdf) * [Chinese Spelling Checker Based on Statistical Machine Translation[Chiu, 2013]](http://www.aclweb.org/anthology/O/O13/O13-1005.pdf) * [Chinese Word Spelling Correction Based on Rule Induction[yeh, 2014]](http://aclweb.org/anthology/W14-6822) * [Neural Language Correction with Character-Based Attention[Ziang Xie, 2016]](https://arxiv.org/pdf/1603.09727.pdf) * [Chinese Spelling Check System Based on Tri-gram Model[Qiang Huang, 2014]](http://www.anthology.aclweb.org/W/W14/W14-6827.pdf) * [Neural Abstractive Text Summarization with Sequence-to-Sequence Models[Tian Shi, 2018]](https://arxiv.org/abs/1812.02303) * [[, 2019]](https://github.com/shibing624/pycorrector/blob/master/docs/.pdf) * [A Sequence to Sequence Learning for Chinese Grammatical Error Correction[Hongkai Ren, 2018]](https://link.springer.com/chapter/10.1007/978-3-319-99501-4_36) * [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/pdf?id=r1xMH1BtvB) * [Revisiting Pre-trained Models for Chinese Natural Language Processing](https://arxiv.org/abs/2004.13922) * Ruiqing Zhang, Chao Pang et al. "Correcting Chinese Spelling Errors with Phonetic Pre-training", ACL, 2021 * DingminWang et al. "A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check", EMNLP, 2018
Owner
- Login: alixunxing
- Kind: user
- Repositories: 18
- Profile: https://github.com/alixunxing
OCR
query
""
# Solution
###
1.
2.
3.
###
1. RNNRNN Attn
2. CRF2016
3. Seq2SeqEncoder-Decoder
4. BERT/ELECTRA/ERNIE/MacBERTNLPMASKfine-tune
PS
- [](https://github.com/shibing624/pycorrector/wiki/pycorrector%E6%BA%90%E7%A0%81%E8%A7%A3%E8%AF%BB-%E7%9B%B4%E6%92%AD%E5%88%86%E4%BA%AB)
- [](https://zhuanlan.zhihu.com/p/138981644)
# Feature
* [Kenlm](pycorrector/corrector.py)KenlmNGram
* [MacBERT](pycorrector/macbert)PyTorchMacBERT4CSC
* [Seq2Seq](pycorrector/seq2seq)PyTorchSeq2SeqConvSeq2SeqConvSeq2SeqNLPCC-2018
* [T5](pycorrector/t5)PyTorchT5Langboat/mengzi-t5-basefine-tune
* [BERT](pycorrector/bert)PyTorchBERTfill-mask
* [ELECTRA](pycorrector/electra)PyTorchELECTRAfill-mask
* [ERNIE_CSC](pycorrector/ernie_csc)PaddlePaddleERNIE_CSCERNIE-1.0fine-tune
* [DeepContext](pycorrector/deepcontext)PyTorchDeepContextStanford UniversityNLC2014
* [Transformer](pycorrector/transformer)PyTorchfairseqTransformer
####
1.
2. CGED, Chinese Grammar Error DiagnosisTODO
# Demo
Official Demo: https://www.mulanai.com/product/corrector/
HuggingFace Demo: https://huggingface.co/spaces/shibing624/pycorrector

run example: [examples/gradio_demo.py](examples/gradio_demo.py) to see the demo:
```shell
python examples/gradio_demo.py
```
# Evaluation
[examples/evaluate_models.py](./examples/evaluate_models.py)
- sighan15SIGHAN2015[pycorrector/data/cn/sighan_2015/test.tsv](pycorrector/data/cn/sighan_2015/test.tsv)
- Sentence Level
###
SIGHAN2015
GPUTesla V100 32 GB
| | Backbone | GPU | Precision | Recall | F1 | QPS |
| :-- | :-- | :--- | :----- | :--| :--- | :--- |
| Rule(pycorrector.correct) | kenlm | CPU | 0.6860 | 0.1529 | 0.2500 | 9 |
| BERT | bert-base-chinese | GPU | 0.8029 | 0.4052 | 0.5386 | 2 |
| BART | fnlp/bart-base-chinese | GPU | 0.6984 | 0.6354 | 0.6654 | 58 |
| T5 | byt5-small | GPU | 0.5220 | 0.3941 | 0.4491 | 111 |
| Mengzi-T5 | mengzi-t5-base | GPU | 0.8321 | 0.6390 | 0.7229 | 214 |
| ConvSeq2Seq | ConvSeq2Seq | GPU | 0.2415 | 0.1436 | 0.1801 | 6 |
| **MacBert** | **macbert-base-chinese** | **GPU** | **0.8254** | **0.7311** | **0.7754** | **224** |
###
- **MacBert***shibing624/macbert4csc-base-chinese*huggingface model[shibing624/macbert4csc-base-chinese](https://huggingface.co/shibing624/macbert4csc-base-chinese)
- **Seq2Seq***shibing624/bart4csc-base-chinese*huggingface model[shibing624/bart4csc-base-chinese](https://huggingface.co/shibing624/bart4csc-base-chinese)
- **T5***shibing624/mengzi-t5-base-chinese-correction*huggingface model[shibing624/mengzi-t5-base-chinese-correction](https://huggingface.co/shibing624/mengzi-t5-base-chinese-correction)fine-tune`SIGHAN 2015`SOTA
# Install
```shell
pip install -U pycorrector
```
or
```shell
pip install -r requirements.txt
git clone https://github.com/shibing624/pycorrector.git
cd pycorrector
pip install --no-deps .
```
docker
####
* docker
```shell
docker run -it -v ~/.pycorrector:/root/.pycorrector shibing624/pycorrector:0.0.2
```
pythonkenlmpycorrector[Dockerfile](Dockerfile)

* kenlm
```
pip install https://github.com/kpu/kenlm/archive/master.zip
```
[kenlm-wiki](https://github.com/shibing624/pycorrector/wiki/Install-kenlm)
*
```
pip install -r requirements.txt
```
# Usage
###
example: [examples/base_demo.py](examples/base_demo.py)
```python
import pycorrector
corrected_sent, detail = pycorrector.correct('')
print(corrected_sent, detail)
```
output:
```
[('', '', 4, 6), ('', '', 10, 11)]
```
> `~/.pycorrector/datasets/zh_giga.no_cna_cmn.prune01244.klm`kenlm
[(2.8G)](https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm)
###
example: [examples/detect_demo.py](examples/detect_demo.py)
```python
import pycorrector
idx_errors = pycorrector.detect('')
print(idx_errors)
```
output:
```
[['', 4, 6, 'word'], ['', 10, 11, 'char']]
```
> `list`, `[error_word, begin_pos, end_pos, error_type]``pos`0
###
example: [examples/proper_correct_demo.py](examples/proper_correct_demo.py)
```python
import sys
sys.path.append("..")
from pycorrector.proper_corrector import ProperCorrector
m = ProperCorrector()
x = [
'',
'',
]
for i in x:
print(i, ' -> ', m.proper_correct(i))
```
output:
```
-> ('', [('', '', 2, 6)])
-> ('', [('', '', 3, 6)])
```
###
12
example: [examples/use_custom_confusion.py](examples/use_custom_confusion.py)
```python
import pycorrector
error_sentences = [
'iphonex',
'',
]
for line in error_sentences:
print(pycorrector.correct(line))
print('*' * 42)
pycorrector.set_custom_confusion_path_or_dict('./my_custom_confusion.txt')
for line in error_sentences:
print(pycorrector.correct(line))
```
output:
```
('iphonex', []) # "iphonex""iphoneX"
('', [['', '', 14, 17]]) # ""
*****************************************************
('iphonex', [['iphonex', 'iphoneX', 1, 8]])
('', [])
```
> `./my_custom_confusion.txt`
```
iPhone iPhoneX
```
> `correct`
> `set_custom_confusion_dict``path`(str)(dict)
###
kenlm`zh_giga.no_cna_cmn.prune01244.klm`2.8G`pycorrector`
kenlm2014140M[people2014corpus_chars.klm(o5e9)](https://pan.baidu.com/s/1I2GElyHy_MAdek3YaziFYw)
example[examples/load_custom_language_model.py](examples/load_custom_language_model.py)
```python
from pycorrector import Corrector
import os
pwd_path = os.path.abspath(os.path.dirname(__file__))
lm_path = os.path.join(pwd_path, './people2014corpus_chars.klm')
model = Corrector(language_model_path=lm_path)
corrected_sent, detail = model.correct('')
print(corrected_sent, detail)
```
output:
```
[('', '', 4, 6), ('', '', 10, 11)]
```
###
example[examples/en_correct_demo.py](examples/en_correct_demo.py)
```python
import pycorrector
sent = "what happending? how to speling it, can you gorrect it?"
corrected_text, details = pycorrector.en_correct(sent)
print(sent, '=>', corrected_text)
print(details)
```
output:
```
what happending? how to speling it, can you gorrect it?
=> what happening? how to spelling it, can you correct it?
[('happending', 'happening', 5, 15), ('speling', 'spelling', 24, 31), ('gorrect', 'correct', 44, 51)]
```
###
example[examples/traditional_simplified_chinese_demo.py](examples/traditional_simplified_chinese_demo.py)
```python
import pycorrector
traditional_sentence = ''
simplified_sentence = pycorrector.traditional2simplified(traditional_sentence)
print(traditional_sentence, '=>', simplified_sentence)
simplified_sentence = ''
traditional_sentence = pycorrector.simplified2traditional(simplified_sentence)
print(simplified_sentence, '=>', traditional_sentence)
```
output:
```
=>
=>
```
###
```
python -m pycorrector -h
usage: __main__.py [-h] -o OUTPUT [-n] [-d] input
@description:
positional arguments:
input the input file path, file encode need utf-8.
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
the output file path.
-n, --no_char disable char detect mode.
-d, --detail print detail info
```
case
```
python -m pycorrector input.txt -o out.txt -n -d
```
> `input.txt``out.txt ``\t`
# Deep Model Usage
``[macbert](./pycorrector/macbert)[seq2seq](./pycorrector/seq2seq)
[bert](./pycorrector/bert)[electra](./pycorrector/electra)[transformer](./pycorrector/transformer)
[ernie-csc](./pycorrector/ernie_csc)[T5](./pycorrector/t5)`pycorrector``README.md`
-
```
pip install -r requirements-dev.txt
```
##
### **MacBert4csc[]**
MacBERTHuggingFace Models[https://huggingface.co/shibing624/macbert4csc-base-chinese](https://huggingface.co/shibing624/macbert4csc-base-chinese)
- MacBERT BERT backbone
- BERT [detection](https://github.com/shibing624/pycorrector/blob/c0f31222b7849c452cc1ec207c71e9954bd6ca08/pycorrector/macbert/macbert4csc.py#L18)
MacBERT4CSC detection correction loss loss BERT MLM correction

[pycorrector/macbert/README.md](./pycorrector/macbert/README.md)
example[examples/macbert_demo.py](examples/macbert_demo.py)
#### pycorrector
```python
import sys
sys.path.append("..")
from pycorrector.macbert.macbert_corrector import MacBertCorrector
if __name__ == '__main__':
error_sentences = [
'',
'',
'',
'',
'',
]
m = MacBertCorrector("shibing624/macbert4csc-base-chinese")
for line in error_sentences:
correct_sent, err = m.macbert_correct(line)
print("query:{} => {}, err:{}".format(line, correct_sent, err))
```
output
```bash
query: => , err:[('', '', 14, 15)]
query: => , err:[('', '', 4, 5)]
query: => , err:[('', '', 1, 2), ('', '', 10, 11)]
query: => , err:[]
query: => , err:[('', '', 6, 7)]
```
#### transformers
```python
import operator
import torch
from transformers import BertTokenizerFast, BertForMaskedLM
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = BertTokenizerFast.from_pretrained("shibing624/macbert4csc-base-chinese")
model = BertForMaskedLM.from_pretrained("shibing624/macbert4csc-base-chinese")
model.to(device)
texts = ["", ""]
text_tokens = tokenizer(texts, padding=True, return_tensors='pt').to(device)
with torch.no_grad():
outputs = model(**text_tokens)
def get_errors(corrected_text, origin_text):
sub_details = []
for i, ori_char in enumerate(origin_text):
if ori_char in [' ', '', '', '', '', '\n', '', '', '']:
# add unk word
corrected_text = corrected_text[:i] + ori_char + corrected_text[i:]
continue
if i >= len(corrected_text):
break
if ori_char != corrected_text[i]:
if ori_char.lower() == corrected_text[i]:
# pass english upper char
corrected_text = corrected_text[:i] + ori_char + corrected_text[i + 1:]
continue
sub_details.append((ori_char, corrected_text[i], i, i + 1))
sub_details = sorted(sub_details, key=operator.itemgetter(2))
return corrected_text, sub_details
result = []
for ids, (i, text) in zip(outputs.logits, enumerate(texts)):
_text = tokenizer.decode((torch.argmax(ids, dim=-1) * text_tokens.attention_mask[i]),
skip_special_tokens=True).replace(' ', '')
corrected_text, details = get_errors(_text, text)
print(text, ' => ', corrected_text, details)
result.append((corrected_text, details))
print(result)
```
output:
```shell
=> [('', '', 2, 3)]
=> [('', '', 15, 16)]
```
```
macbert4csc-base-chinese
config.json
added_tokens.json
pytorch_model.bin
special_tokens_map.json
tokenizer_config.json
vocab.txt
```
### ErnieCSC
ERNIE[PaddleNLP](https://bj.bcebos.com/paddlenlp/taskflow/text_correction/csc-ernie-1.0/csc-ernie-1.0.pdparams)
[https://bj.bcebos.com/paddlenlp/taskflow/text_correction/csc-ernie-1.0/csc-ernie-1.0.pdparams](https://bj.bcebos.com/paddlenlp/taskflow/text_correction/csc-ernie-1.0/csc-ernie-1.0.pdparams)
[pycorrector/ernie_csc/README.md](./pycorrector/ernie_csc/README.md)
example[examples/ernie_csc_demo.py](examples/ernie_csc_demo.py)
#### pycorrector
```python
from pycorrector.ernie_csc.ernie_csc_corrector import ErnieCSCCorrector
if __name__ == '__main__':
error_sentences = [
'',
'',
'',
'',
'',
]
corrector = ErnieCSCCorrector("csc-ernie-1.0")
for line in error_sentences:
result = corrector.ernie_csc_correct(line)[0]
print("query:{} => {}, err:{}".format(line, result['target'], result['errors']))
```
output:
```bash
query: => , err:[{'position': 14, 'correction': {'': ''}}]
query: => , err:[{'position': 4, 'correction': {'': ''}}, {'position': 10, 'correction': {'': ''}}]
query: => , err:[{'position': 1, 'correction': {'': ''}}, {'position': 10, 'correction': {'': ''}}]
query: => , err:[]
query: => , err:[{'position': 6, 'correction': {'': ''}}]
```
#### PaddleNLP
PaddleNLPTaskflow:
```python
from paddlenlp import Taskflow
text_correction = Taskflow("text_correction")
text_correction('')
text_correction('')
```
output:
```shell
[{'source': '',
'target': '',
'errors': [{'position': 3, 'correction': {'': ''}}]}]
[{'source': '',
'target': '',
'errors': [{'position': 18, 'correction': {'': ''}}]}]
```
### Bart
```python
from transformers import BertTokenizerFast
from textgen import BartSeq2SeqModel
tokenizer = BertTokenizerFast.from_pretrained('shibing624/bart4csc-base-chinese')
model = BartSeq2SeqModel(
encoder_type='bart',
encoder_decoder_type='bart',
encoder_decoder_name='shibing624/bart4csc-base-chinese',
tokenizer=tokenizer,
args={"max_length": 128, "eval_batch_size": 128})
sentences = [""]
print(model.predict(sentences))
```
output:
```shell
['']
```
Bart https://github.com/shibing624/textgen/blob/main/examples/seq2seq/training_bartseq2seq_zh_demo.py
#### Release models
SIGHAN+Wang271KBartreleaseHuggingFace Models:
- BARTHuggingFace Models[https://huggingface.co/shibing624/bart4csc-base-chinese](https://huggingface.co/shibing624/bart4csc-base-chinese)
### ConvSeq2Seq
[pycorrector/seq2seq](pycorrector/seq2seq) :
####
data example:
```
# train.txt:
```
```shell
cd seq2seq
python train.py
```
`convseq2seq`sighan2104200epochP40GPU3
####
```shell
python infer.py
```
output

1. `unk`(nlpcc2018+hsk130)
2. GPUGPU
#### Release models
SIGHAN2015convseq2seqreleasegithub:
- convseq2seq model url: https://github.com/shibing624/pycorrector/releases/download/0.4.5/convseq2seq_correction.tar.gz
# Dataset
| | | | |
| :------- | :--------- | :---------: | :---------: |
| **`SIGHAN+Wang271K`** | SIGHAN+Wang271K(27) | [01b9](https://pan.baidu.com/s/1BV5tr9eONZCI0wERFvr0gQ)| 106M |
| **`SIGHAN`** | SIGHAN13 14 15 | [csc.html](http://nlp.ee.ncu.edu.tw/resource/csc.html)| 339K |
| **`Wang271K`** | Wang271K | [Automatic-Corpus-Generation dimmywang](https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml)| 93M |
| **`2014`** | 2014 | [cHcu](https://l6pmn3b1eo.feishu.cn/file/boxcnKpildqIseq1D4IrLwlir7c?from=from_qr_code)| 383M |
| **`NLPCC 2018 GEC`** | NLPCC2018-GEC | [trainingdata](http://tcci.ccf.org.cn/conference/2018/dldoc/trainingdata02.tar.gz) | 114M |
| **`NLPCC 2018+HSK`** | nlpcc2018+hsk+CGED | [m6fg](https://pan.baidu.com/s/1BkDru60nQXaDVLRSr7ktfA)
# Citation
pycorrector
APA:
```latex
Xu, M. Pycorrector: Text error correction tool (Version 0.4.2) [Computer software]. https://github.com/shibing624/pycorrector
```
BibTeX:
```latex
@misc{Xu_Pycorrector_Text_error,
title={Pycorrector: Text error correction tool},
author={Xu Ming},
year={2021},
howpublished={\url{https://github.com/shibing624/pycorrector}},
}
```
# License
pycorrector **Apache License 2.0**pycorrector
# Contribute
- `tests`
- `python -m pytest`
PR
# Reference
* [](https://blog.csdn.net/mingzai624/article/details/82390382)
* [Norvigs spelling corrector](http://norvig.com/spell-correct.html)
* [Chinese Spelling Error Detection and Correction Based on Language Model, Pronunciation, and Shape[Yu, 2013]](http://www.aclweb.org/anthology/W/W14/W14-6835.pdf)
* [Chinese Spelling Checker Based on Statistical Machine Translation[Chiu, 2013]](http://www.aclweb.org/anthology/O/O13/O13-1005.pdf)
* [Chinese Word Spelling Correction Based on Rule Induction[yeh, 2014]](http://aclweb.org/anthology/W14-6822)
* [Neural Language Correction with Character-Based Attention[Ziang Xie, 2016]](https://arxiv.org/pdf/1603.09727.pdf)
* [Chinese Spelling Check System Based on Tri-gram Model[Qiang Huang, 2014]](http://www.anthology.aclweb.org/W/W14/W14-6827.pdf)
* [Neural Abstractive Text Summarization with Sequence-to-Sequence Models[Tian Shi, 2018]](https://arxiv.org/abs/1812.02303)
* [[, 2019]](https://github.com/shibing624/pycorrector/blob/master/docs/.pdf)
* [A Sequence to Sequence Learning for Chinese Grammatical Error Correction[Hongkai Ren, 2018]](https://link.springer.com/chapter/10.1007/978-3-319-99501-4_36)
* [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/pdf?id=r1xMH1BtvB)
* [Revisiting Pre-trained Models for Chinese Natural Language Processing](https://arxiv.org/abs/2004.13922)
* Ruiqing Zhang, Chao Pang et al. "Correcting Chinese Spelling Errors with Phonetic Pre-training", ACL, 2021
* DingminWang et al. "A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check", EMNLP, 2018