nerpy

🌈 NERpy: Implementation of Named Entity Recognition using Python. 命名实体识别工具,支持BertSoftmax、BertSpan等模型,开箱即用。

https://github.com/shibing624/nerpy

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (5.5%) to scientific vocabulary

Keywords

bert bert-softmax bert-span named-entity-recognition ner nlp pytorch transformers
Last synced: 6 months ago · JSON representation ·

Repository

🌈 NERpy: Implementation of Named Entity Recognition using Python. 命名实体识别工具,支持BertSoftmax、BertSpan等模型,开箱即用。

Basic Info
  • Host: GitHub
  • Owner: shibing624
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 6.13 MB
Statistics
  • Stars: 115
  • Watchers: 3
  • Forks: 15
  • Open Issues: 2
  • Releases: 3
Topics
bert bert-softmax bert-span named-entity-recognition ner nlp pytorch transformers
Created almost 4 years ago · Last pushed about 2 years ago
Metadata Files
Readme Contributing License Citation

README.md

PyPI version Downloads Contributions welcome License Apache 2.0 python_version GitHub issues Wechat Group

NERpy

🌈 Implementation of Named Entity Recognition using Python.

nerpy实现了BertSoftmax、BertCrf、BertSpan等多种命名实体识别模型,并在标准数据集上比较了各模型的效果。

Guide - Feature - Evaluation - Install - Usage - Contact - Reference

Feature

命名实体识别模型

  • BertSoftmax:BertSoftmax基于BERT预训练模型实现实体识别,本项目基于PyTorch实现了BertSoftmax模型的训练和预测
  • BertSpan:BertSpan基于BERT训练span边界的表示,模型结构更适配实体边界识别,本项目基于PyTorch实现了BertSpan模型的训练和预测

Evaluation

实体识别

  • 英文实体识别数据集的评测结果:

| Arch | Backbone | Model Name | CoNLL-2003 | QPS | | :-- | :--- | :--- | :-: | :--: | | BertSoftmax | bert-base-uncased | bert4ner-base-uncased | 90.43 | 235 | | BertSoftmax | bert-base-cased | bert4ner-base-cased | 91.17 | 235 | | BertSpan | bert-base-uncased | bertspan4ner-base-uncased | 90.61 | 210 | | BertSpan | bert-base-cased | bertspan4ner-base-cased | 91.90 | 224 |

  • 中文实体识别数据集的评测结果:

| Arch | Backbone | Model Name | CNER | PEOPLE | MSRA-NER | QPS | | :-- | :--- | :--- | :-: | :-: | :-: | :-: | | BertSoftmax | bert-base-chinese | bert4ner-base-chinese | 94.98 | 95.25 | 94.65 | 222 | | BertSpan | bert-base-chinese | bertspan4ner-base-chinese | 96.03 | 96.06 | 95.03 | 254 |

  • 本项目release模型的实体识别评测结果:

| Arch | Backbone | Model Name | CNER(zh) | PEOPLE(zh) | CoNLL-2003(en) | QPS | | :-- | :--- | :---- | :-: | :-: | :-: | :-: | | BertSpan | bert-base-chinese | shibing624/bertspan4ner-base-chinese | 96.03 | 96.06 | - | 254 | | BertSoftmax | bert-base-chinese | shibing624/bert4ner-base-chinese | 94.98 | 95.25 | - | 222 | | BertSoftmax | bert-base-uncased | shibing624/bert4ner-base-uncased | - | - | 90.43 | 243 |

说明: - 结果值均使用F1 - 结果均只用该数据集的train训练,在test上评估得到的表现,没用外部数据 - shibing624/bertspan4ner-base-chinese模型达到base级别里SOTA效果,是用BertSpan方法训练的, 运行examples/trainingbertspanzh_demo.py代码可在各中文数据集复现结果 - shibing624/bert4ner-base-chinese模型达到base级别里较好效果,是用BertSoftmax方法训练的, 运行examples/trainingnermodelzhdemo.py代码可在各中文数据集复现结果 - shibing624/bert4ner-base-uncased模型是用BertSoftmax方法训练的, 运行examples/trainingnermodelendemo.py代码可在CoNLL-2003英文数据集复现结果 - 各预训练模型均可以通过transformers调用,如中文BERT模型:--model_name bert-base-chinese - 中文实体识别数据集下载链接见下方 - QPS的GPU测试环境是Tesla V100,显存32GB

Demo

Demo: https://huggingface.co/spaces/shibing624/nerpy

run example: examples/gradio_demo.py to see the demo: shell python examples/gradio_demo.py

Install

python 3.8+

shell pip install torch # conda install pytorch pip install -U nerpy

or

```shell pip install torch # conda install pytorch pip install -r requirements.txt

git clone https://github.com/shibing624/nerpy.git cd nerpy pip install --no-deps . ```

Usage

命名实体识别

英文实体识别:

```shell

from nerpy import NERModel model = NERModel("bert", "shibing624/bert4ner-base-uncased") predictions, rawoutputs, entities = model.predict(["AL-AIN, United Arab Emirates 1996-12-06"], spliton_space=True) entities: [('AL-AIN,', 'LOC'), ('United Arab Emirates', 'LOC')] ```

中文实体识别:

```shell

from nerpy import NERModel model = NERModel("bert", "shibing624/bert4ner-base-chinese") predictions, rawoutputs, entities = model.predict(["常建良,男,1963年出生,工科学士,高级工程师"], spliton_space=False) entities: [('常建良', 'PER'), ('1963年', 'TIME')] ```

example: examples/basezhdemo.py

```python import sys

sys.path.append('..') from nerpy import NERModel

if name == 'main': # BertSoftmax中文实体识别模型: NERModel("bert", "shibing624/bert4ner-base-chinese") # BertSpan中文实体识别模型: NERModel("bertspan", "shibing624/bertspan4ner-base-chinese") model = NERModel("bert", "shibing624/bert4ner-base-chinese") sentences = [ "常建良,男,1963年出生,工科学士,高级工程师,北京物资学院客座副教授", "1985年8月-1993年在国家物资局、物资部、国内贸易部金属材料流通司从事国家统配钢材中特种钢材品种的调拨分配工作,先后任科员、主任科员。" ] predictions, raw_outputs, entities = model.predict(sentences) print(entities) ```

output: [('常建良', 'PER'), ('1963年', 'TIME'), ('北京物资学院', 'ORG')] [('1985年', 'TIME'), ('8月', 'TIME'), ('1993年', 'TIME'), ('国家物资局', 'ORG'), ('物资部', 'ORG'), ('国内贸易部金属材料流通司', 'ORG')]

  • shibing624/bert4ner-base-chinese模型是BertSoftmax方法在中文PEOPLE(人民日报)数据集训练得到的,模型已经上传到huggingface的 模型库shibing624/bert4ner-base-chinese, 是nerpy.NERModel指定的默认模型,可以通过上面示例调用,或者如下所示用transformers库调用, 模型自动下载到本机路径:~/.cache/huggingface/transformers
  • shibing624/bertspan4ner-base-chinese模型是BertSpan方法在中文PEOPLE(人民日报)数据集训练得到的,模型已经上传到huggingface的 模型库shibing624/bertspan4ner-base-chinese

Usage (HuggingFace Transformers)

Without nerpy, you can use the model like this:

First, you pass your input through the transformer model, then you have to apply the bio tag to get the entity words.

example: examples/predictuseorigintransformerszh_demo.py

```python import os import torch from transformers import AutoTokenizer, AutoModelForTokenClassification from seqeval.metrics.sequencelabeling import getentities

os.environ["KMPDUPLICATELIB_OK"] = "TRUE"

Load model from HuggingFace Hub

tokenizer = AutoTokenizer.frompretrained("shibing624/bert4ner-base-chinese") model = AutoModelForTokenClassification.frompretrained("shibing624/bert4ner-base-chinese") label_list = ['I-ORG', 'B-LOC', 'O', 'B-ORG', 'I-LOC', 'I-PER', 'B-TIME', 'I-TIME', 'B-PER']

sentence = "王宏伟来自北京,是个警察,喜欢去王府井游玩儿。"

def getentity(sentence): tokens = tokenizer.tokenize(sentence) inputs = tokenizer.encode(sentence, returntensors="pt") with torch.nograd(): outputs = model(inputs).logits predictions = torch.argmax(outputs, dim=2) chartags = [(token, labellist[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())][1:-1] print(sentence) print(chartags)

pred_labels = [i[1] for i in char_tags]
entities = []
line_entities = get_entities(pred_labels)
for i in line_entities:
    word = sentence[i[1]: i[2] + 1]
    entity_type = i[0]
    entities.append((word, entity_type))

print("Sentence entity:")
print(entities)

get_entity(sentence) output: shell 王宏伟来自北京,是个警察,喜欢去王府井游玩儿。 [('王', 'B-PER'), ('宏', 'I-PER'), ('伟', 'I-PER'), ('来', 'O'), ('自', 'O'), ('北', 'B-LOC'), ('京', 'I-LOC'), (',', 'O'), ('是', 'O'), ('个', 'O'), ('警', 'O'), ('察', 'O'), (',', 'O'), ('喜', 'O'), ('欢', 'O'), ('去', 'O'), ('王', 'B-LOC'), ('府', 'I-LOC'), ('井', 'I-LOC'), ('游', 'O'), ('玩', 'O'), ('儿', 'O'), ('。', 'O')] Sentence entity: [('王宏伟', 'PER'), ('北京', 'LOC'), ('王府井', 'LOC')] ```

数据集

实体识别数据集

| 数据集 | 语料 | 下载链接 | 文件大小 | | :------- | :--------- | :---------: | :---------: | | CNER中文实体识别数据集 | CNER(12万字) | CNER github| 1.1MB | | PEOPLE中文实体识别数据集 | 人民日报数据集(200万字) | PEOPLE github| 12.8MB | | MSRA-NER中文实体识别数据集 | MSRA-NER数据集(4.6万条,221.6万字) | MSRA-NER github| 3.6MB | | CoNLL03英文实体识别数据集 | CoNLL-2003数据集(22万字) | CoNLL03 github| 1.7MB |

CNER中文实体识别数据集,数据格式:

```text 美 B-LOC 国 I-LOC 的 O 华 B-PER 莱 I-PER 士 I-PER

我 O 跟 O 他 O ```

BertSoftmax 模型

BertSoftmax实体识别模型,基于BERT的标准序列标注方法:

Network structure:

模型文件组成: shibing624/bert4ner-base-chinese ├── config.json ├── model_args.json ├── pytorch_model.bin ├── special_tokens_map.json ├── tokenizer_config.json └── vocab.txt

BertSoftmax 模型训练和预测

training example: examples/trainingnermodeltoydemo.py

```python import sys import pandas as pd

sys.path.append('..') from nerpy.ner_model import NERModel

Creating samples

trainsamples = [ [0, "HuggingFace", "B-MISC"], [0, "Transformers", "I-MISC"], [0, "started", "O"], [0, "with", "O"], [0, "text", "O"], [0, "classification", "B-MISC"], [1, "Nerpy", "B-MISC"], [1, "Model", "I-MISC"], [1, "can", "O"], [1, "now", "O"], [1, "perform", "O"], [1, "NER", "B-MISC"], ] traindata = pd.DataFrame(trainsamples, columns=["sentenceid", "words", "labels"])

testsamples = [ [0, "HuggingFace", "B-MISC"], [0, "Transformers", "I-MISC"], [0, "was", "O"], [0, "built", "O"], [0, "for", "O"], [0, "text", "O"], [0, "classification", "B-MISC"], [1, "Nerpy", "B-MISC"], [1, "Model", "I-MISC"], [1, "then", "O"], [1, "expanded", "O"], [1, "to", "O"], [1, "perform", "O"], [1, "NER", "B-MISC"], ] testdata = pd.DataFrame(testsamples, columns=["sentenceid", "words", "labels"])

Create a NERModel

model = NERModel( "bert", "bert-base-uncased", args={"overwriteoutputdir": True, "reprocessinputdata": True, "numtrainepochs": 1}, use_cuda=False, )

Train the model

model.trainmodel(traindata)

Evaluate the model

result, modeloutputs, predictions = model.evalmodel(testdata) print(result, modeloutputs, predictions)

Predictions on text strings

sentences = ["Nerpy Model perform sentence NER", "HuggingFace Transformers build for text"] predictions, rawoutputs, entities = model.predict(sentences, spliton_space=True) print(predictions, entities) ```

  • 在中文CNER数据集训练和评估BertSoftmax模型

example: examples/trainingnermodelzhdemo.py

shell cd examples python training_ner_model_zh_demo.py --do_train --do_predict --num_epochs 5 --task_name cner - 在英文CoNLL-2003数据集训练和评估BertSoftmax模型

example: examples/trainingnermodelendemo.py

shell cd examples python training_ner_model_en_demo.py --do_train --do_predict --num_epochs 5

BertSpan 模型训练和预测

  • 在中文CNER数据集训练和评估BertSpan模型

example: examples/trainingbertspanzh_demo.py

shell cd examples python training_bertspan_zh_demo.py --do_train --do_predict --num_epochs 5 --task_name cner

Contact

  • Issue(建议):GitHub issues
  • 邮件我:xuming: xuming624@qq.com
  • 微信我: 加我微信号:xuming624, 备注:姓名-公司-NLP 进NLP交流群。

Citation

如果你在研究中使用了nerpy,请按如下格式引用:

APA: latex Xu, M. nerpy: Named Entity Recognition Toolkit (Version 0.0.2) [Computer software]. https://github.com/shibing624/nerpy

BibTeX: latex @software{Xu_nerpy_Text_to, author = {Xu, Ming}, title = {{nerpy: Named Entity Recognition Toolkit}}, url = {https://github.com/shibing624/nerpy}, version = {0.0.2} }

License

授权协议为 The Apache License 2.0,可免费用做商业用途。请在产品说明中附加nerpy的链接和授权协议。

Contribute

项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目,在提交之前,注意以下两点:

  • tests添加相应的单元测试
  • 使用python -m pytest来运行所有单元测试,确保所有单测都是通过的

之后即可提交PR。

Reference

Owner

  • Name: xuming
  • Login: shibing624
  • Kind: user
  • Location: Beijing, China
  • Company: @tencent

Senior Researcher, Machine Learning Developer, Advertising Risk Control.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Xu
    given-names: Ming
    orcid: https://orcid.org/0000-0003-3402-7159
title: "nerpy: Named Entity Recognition toolkit"
version: 0.0.3
date-released: 2022-02-27
url: "https://github.com/shibing624/nerpy"

GitHub Events

Total
  • Watch event: 7
  • Fork event: 1
Last Year
  • Watch event: 7
  • Fork event: 1

Committers

Last synced: about 1 year ago

All Time
  • Total Commits: 87
  • Total Committers: 2
  • Avg Commits per committer: 43.5
  • Development Distribution Score (DDS): 0.011
Past Year
  • Commits: 4
  • Committers: 1
  • Avg Commits per committer: 4.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
shibing624 s****4@1****m 86
XingKaiXin x****n@g****m 1
Committer Domains (Top 20 + Academic)
126.com: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 8
  • Total pull requests: 1
  • Average time to close issues: about 2 months
  • Average time to close pull requests: about 11 hours
  • Total issue authors: 7
  • Total pull request authors: 1
  • Average comments per issue: 3.75
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ScottishFold007 (2)
  • Hephaestusxg (1)
  • xxllp (1)
  • DuBaiSheng (1)
  • bingxin3chen (1)
  • zlszhonglongshen (1)
Pull Request Authors
  • xingkaixin (1)
Top Labels
Issue Labels
question (3) bug (2)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 32 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 10
  • Total maintainers: 1
pypi.org: nerpy

nerpy: Named Entity Recognition toolkit using Python

  • Versions: 10
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 32 Last month
Rankings
Stargazers count: 7.7%
Dependent packages count: 10.1%
Forks count: 10.9%
Average: 18.8%
Dependent repos count: 21.6%
Downloads: 44.0%
Maintainers (1)
Last synced: 6 months ago

Dependencies

requirements.txt pypi
  • datasets *
  • jieba >=0.39
  • loguru *
  • numpy *
  • pandas *
  • scipy *
  • seqeval *
  • tensorboard *
  • tqdm *
  • transformers >=4.6.0
setup.py pypi
  • datasets *
  • jieba >=0.39
  • loguru *
  • numpy *
  • pandas *
  • scipy *
  • seqeval *
  • tensorboard *
  • tqdm *
  • transformers >=4.6.0
.github/workflows/ubuntu.yml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
.github/workflows/windows.yml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • conda-incubator/setup-miniconda v2 composite