https://github.com/percevalw/nlstruct

Natural language structuring library

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
5 of 9 committers (55.6%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary

Keywords

deep-learning machine-learning natural-language-processing notebook python structured-data

Last synced: 5 months ago · JSON representation

Repository

Natural language structuring library

Basic Info

Host: GitHub
Owner: percevalw
License: mit
Language: Python
Default Branch: master
Size: 543 KB

Statistics

Stars: 21
Watchers: 3
Forks: 11
Open Issues: 9
Releases: 2

Topics

deep-learning machine-learning natural-language-processing notebook python structured-data

Created about 6 years ago · Last pushed over 1 year ago

Metadata Files

Readme License

NLStruct

Natural language struturing library. Currently, it implements a nested NER model and a span classification model, but other algorithms might follow.

If you find this library useful in your research, please consider citing:

@phdthesis{wajsburt:tel-03624928, TITLE = {{Extraction and normalization of simple and structured entities in medical documents}}, AUTHOR = {Wajsb{\"u}rt, Perceval}, URL = {https://hal.archives-ouvertes.fr/tel-03624928}, SCHOOL = {{Sorbonne Universit{\'e}}}, YEAR = {2021}, MONTH = Dec, KEYWORDS = {nlp ; structure ; extraction ; normalization ; clinical ; multilingual}, TYPE = {Theses}, PDF = {https://hal.archives-ouvertes.fr/tel-03624928/file/updated_phd_thesis_PW.pdf}, HAL_ID = {tel-03624928}, HAL_VERSION = {v1}, }

This work was performed at LIMICS, in collaboration with AP-HP's Clinical Data Warehouse and funded by the Institute of Computing and Data Science.

Features

processes large documents seamlessly: it automatically handles tokenization and sentence splitting.
do not train twice: an automatic caching mechanism detects when an experiment has already been run
stop & resume with checkpoints
easy import and export of data
handles nested or overlapping entities
multi-label classification of recognized entities
strict or relaxed multi label end to end retrieval metrcis
pretty logging with rich-logger
heavily customizable, without config files (see train_ner.py)
built on top of transformers and pytorch_lightning

Training models

How to train a NER model

```python from nlstruct.recipes import train_ner

model = trainner( dataset={ "train": "path to your train brat/standoff data", "val": 0.05, # or path to your validation data # "test": # and optional path to your test data }, finetunebert=False, seed=42, bertname="camembert/camembert-base", fasttextfile="", gpus=0, xpname="my-xp", returnmodel=True, ) model.save_pretrained("model.pt") ```

How to use it

```python from nlstruct import loadpretrained from nlstruct.datasets import loadfrombrat, exportto_brat

ner = loadpretrained("model.pt") ner.eval() ner.predict({"docid": "doc-0", "text": "Je lui prescris du lorazepam."})

Out:

{'doc_id': 'doc-0',

'text': 'Je lui prescris du lorazepam.',

'entities': [{'entity_id': 0,

'label': ['substance'],

'attributes': [],

'fragments': [{'begin': 19,

'end': 28,

'label': 'substance',

'text': 'lorazepam'}],

'confidence': 0.9998705969553088}]}

exporttobrat(ner.predict(loadfrombrat("path/to/brat/test")), filenameprefix="path/to/exportedbrat") ```

How to train a NER model followed by a span classification model

```python from nlstruct.recipes import trainqualifiedner

model = trainqualifiedner( dataset={ "train": "path to your train brat/standoff data", "val": 0.05, # or path to your validation data # "test": # and optional path to your test data }, finetunebert=False, seed=42, bertname="camembert/camembert-base", fasttextfile="", gpus=0, xpname="my-xp", returnmodel=True, ) model.savepretrained("model.pt") ```

Ensembling

Easily ensemble multiple models (same architecture, different seeds): python model1 = load_pretrained("model-1.pt") model2 = load_pretrained("model-2.pt") model3 = load_pretrained("model-3.pt") ensemble = model1.ensemble_with([model2, model3]).cuda() export_to_brat(ensemble.predict(load_from_brat("path/to/brat/test")), filename_prefix="path/to/exported_brat")

Advanced use

Should you need to further configure the training of a model, please modify directly one of the recipes located in the recipes folder.

Install

This project is still under development and subject to changes.

bash pip install nlstruct==0.2.0

Owner

Name: Perceval Wajsburt
Login: percevalw
Kind: user
Location: Paris
Company: APHP

Repositories: 47
Profile: https://github.com/percevalw

PhD in medical NLP, my main areas of interest are NLP, structured prediction models and UI development

GitHub Events

Total

Watch event: 3
Fork event: 2

Last Year

Watch event: 3
Fork event: 2

Committers

Last synced: over 1 year ago

All Time

Total Commits: 399
Total Committers: 9
Avg Commits per committer: 44.333
Development Distribution Score (DDS): 0.125

Past Year

Commits: 6
Committers: 4
Avg Commits per committer: 1.5
Development Distribution Score (DDS): 0.667

Top Committers

Name	Email	Commits
Perceval Wajsburt	p**t@g**m	349
Perceval Wajsburt	p**t@s**r	32
tannier	x**r@s**r	5
Perceval Wajsbürt	p**t@a**r	4
YoannT	y**e@g**m	3
Ghislain Vaillant	g**t@i**r	2
tannier	t**r@c**r	2
Camila A	d**l@i**r	1
solenn-tl	s**l@g**m	1

Committer Domains (Top 20 + Academic)

inria.fr: 2 sorbonne-universite.fr: 2 clustergpu-front.common.lip6.fr: 1 aphp.fr: 1

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 4
Total pull requests: 14
Average time to close issues: N/A
Average time to close pull requests: about 13 hours
Total issue authors: 3
Total pull request authors: 9
Average comments per issue: 0.75
Average comments per pull request: 1.0
Merged pull requests: 5
Bot issues: 0
Bot pull requests: 3

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

solenn-tl (2)
meesamnaqvi (1)
BillelBenoudjit (1)

Pull Request Authors

dependabot[bot] (3)
solenn-tl (2)
marconaguib (2)
percevalw (2)
camila-ud (2)
xtannier (1)
YoannT (1)
TrellixVulnTeam (1)
ghisvail (1)

Top Labels

Issue Labels

Pull Request Labels

dependencies (3)

Packages

Total packages: 1
Total downloads:
- pypi 36 last-month

Total dependent packages: 1
Total dependent repositories: 1
Total versions: 8
Total maintainers: 1

pypi.org: nlstruct

Natural language structuring library

Homepage: https://github.com/percevalw/nlstruct
Documentation: https://nlstruct.readthedocs.io/
License: MIT
Latest release: 0.2.0
published about 2 years ago

Versions: 8
Dependent Packages: 1
Dependent Repositories: 1
Downloads: 36 Last month

Rankings

Dependent packages count: 10.1%

Forks count: 11.9%

Stargazers count: 15.6%

Average: 17.4%

Dependent repos count: 21.5%

Downloads: 27.9%

Maintainers (1)

percevalw

Last synced: 6 months ago

Dependencies

setup.py pypi

einops ==0.4.1
fire *
numpy ==1.22.3
pandas ==1.4.2
parse ==1.19.0
pytorch_lightning ==1.4.9
regex ==2020.11.13
rich_logger ==0.1.4
scikit-learn ==1.1.0rc1
sentencepiece ==0.1.96
torch ==1.11.0
torchmetrics ==0.7.3
tqdm ==4.64.0
transformers ==4.11.2
unidecode ==1.3.4
xxhash ==3.0.0

https://github.com/percevalw/nlstruct

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

NLStruct

Features

Training models

How to train a NER model

How to use it

Out:

{'doc_id': 'doc-0',

'text': 'Je lui prescris du lorazepam.',

'entities': [{'entity_id': 0,

'label': ['substance'],

'attributes': [],

'fragments': [{'begin': 19,

'end': 28,

'label': 'substance',

'text': 'lorazepam'}],

'confidence': 0.9998705969553088}]}

How to train a NER model followed by a span classification model

Ensembling

Advanced use

Install

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: nlstruct

Rankings

Maintainers (1)

Dependencies