https://github.com/percevalw/nlstruct

Natural language structuring library

https://github.com/percevalw/nlstruct

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    5 of 9 committers (55.6%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.0%) to scientific vocabulary

Keywords

deep-learning machine-learning natural-language-processing notebook python structured-data
Last synced: 5 months ago · JSON representation

Repository

Natural language structuring library

Basic Info
  • Host: GitHub
  • Owner: percevalw
  • License: mit
  • Language: Python
  • Default Branch: master
  • Size: 543 KB
Statistics
  • Stars: 21
  • Watchers: 3
  • Forks: 11
  • Open Issues: 9
  • Releases: 2
Topics
deep-learning machine-learning natural-language-processing notebook python structured-data
Created about 6 years ago · Last pushed over 1 year ago
Metadata Files
Readme License

README.md

NLStruct

Natural language struturing library. Currently, it implements a nested NER model and a span classification model, but other algorithms might follow.

If you find this library useful in your research, please consider citing:

@phdthesis{wajsburt:tel-03624928, TITLE = {{Extraction and normalization of simple and structured entities in medical documents}}, AUTHOR = {Wajsb{\"u}rt, Perceval}, URL = {https://hal.archives-ouvertes.fr/tel-03624928}, SCHOOL = {{Sorbonne Universit{\'e}}}, YEAR = {2021}, MONTH = Dec, KEYWORDS = {nlp ; structure ; extraction ; normalization ; clinical ; multilingual}, TYPE = {Theses}, PDF = {https://hal.archives-ouvertes.fr/tel-03624928/file/updated_phd_thesis_PW.pdf}, HAL_ID = {tel-03624928}, HAL_VERSION = {v1}, }

This work was performed at LIMICS, in collaboration with AP-HP's Clinical Data Warehouse and funded by the Institute of Computing and Data Science.

Features

  • processes large documents seamlessly: it automatically handles tokenization and sentence splitting.
  • do not train twice: an automatic caching mechanism detects when an experiment has already been run
  • stop & resume with checkpoints
  • easy import and export of data
  • handles nested or overlapping entities
  • multi-label classification of recognized entities
  • strict or relaxed multi label end to end retrieval metrcis
  • pretty logging with rich-logger
  • heavily customizable, without config files (see train_ner.py)
  • built on top of transformers and pytorch_lightning

Training models

How to train a NER model

```python from nlstruct.recipes import train_ner

model = trainner( dataset={ "train": "path to your train brat/standoff data", "val": 0.05, # or path to your validation data # "test": # and optional path to your test data }, finetunebert=False, seed=42, bertname="camembert/camembert-base", fasttextfile="", gpus=0, xpname="my-xp", returnmodel=True, ) model.save_pretrained("model.pt") ```

How to use it

```python from nlstruct import loadpretrained from nlstruct.datasets import loadfrombrat, exportto_brat

ner = loadpretrained("model.pt") ner.eval() ner.predict({"docid": "doc-0", "text": "Je lui prescris du lorazepam."})

Out:

{'doc_id': 'doc-0',

'text': 'Je lui prescris du lorazepam.',

'entities': [{'entity_id': 0,

'label': ['substance'],

'attributes': [],

'fragments': [{'begin': 19,

'end': 28,

'label': 'substance',

'text': 'lorazepam'}],

'confidence': 0.9998705969553088}]}

exporttobrat(ner.predict(loadfrombrat("path/to/brat/test")), filenameprefix="path/to/exportedbrat") ```

How to train a NER model followed by a span classification model

```python from nlstruct.recipes import trainqualifiedner

model = trainqualifiedner( dataset={ "train": "path to your train brat/standoff data", "val": 0.05, # or path to your validation data # "test": # and optional path to your test data }, finetunebert=False, seed=42, bertname="camembert/camembert-base", fasttextfile="", gpus=0, xpname="my-xp", returnmodel=True, ) model.savepretrained("model.pt") ```

Ensembling

Easily ensemble multiple models (same architecture, different seeds): python model1 = load_pretrained("model-1.pt") model2 = load_pretrained("model-2.pt") model3 = load_pretrained("model-3.pt") ensemble = model1.ensemble_with([model2, model3]).cuda() export_to_brat(ensemble.predict(load_from_brat("path/to/brat/test")), filename_prefix="path/to/exported_brat")

Advanced use

Should you need to further configure the training of a model, please modify directly one of the recipes located in the recipes folder.

Install

This project is still under development and subject to changes.

bash pip install nlstruct==0.2.0

Owner

  • Name: Perceval Wajsburt
  • Login: percevalw
  • Kind: user
  • Location: Paris
  • Company: APHP

PhD in medical NLP, my main areas of interest are NLP, structured prediction models and UI development

GitHub Events

Total
  • Watch event: 3
  • Fork event: 2
Last Year
  • Watch event: 3
  • Fork event: 2

Committers

Last synced: over 1 year ago

All Time
  • Total Commits: 399
  • Total Committers: 9
  • Avg Commits per committer: 44.333
  • Development Distribution Score (DDS): 0.125
Past Year
  • Commits: 6
  • Committers: 4
  • Avg Commits per committer: 1.5
  • Development Distribution Score (DDS): 0.667
Top Committers
Name Email Commits
Perceval Wajsburt p****t@g****m 349
Perceval Wajsburt p****t@s****r 32
tannier x****r@s****r 5
Perceval Wajsbürt p****t@a****r 4
YoannT y****e@g****m 3
Ghislain Vaillant g****t@i****r 2
tannier t****r@c****r 2
Camila A d****l@i****r 1
solenn-tl s****l@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 4
  • Total pull requests: 14
  • Average time to close issues: N/A
  • Average time to close pull requests: about 13 hours
  • Total issue authors: 3
  • Total pull request authors: 9
  • Average comments per issue: 0.75
  • Average comments per pull request: 1.0
  • Merged pull requests: 5
  • Bot issues: 0
  • Bot pull requests: 3
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • solenn-tl (2)
  • meesamnaqvi (1)
  • BillelBenoudjit (1)
Pull Request Authors
  • dependabot[bot] (3)
  • solenn-tl (2)
  • marconaguib (2)
  • percevalw (2)
  • camila-ud (2)
  • xtannier (1)
  • YoannT (1)
  • TrellixVulnTeam (1)
  • ghisvail (1)
Top Labels
Issue Labels
Pull Request Labels
dependencies (3)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 36 last-month
  • Total dependent packages: 1
  • Total dependent repositories: 1
  • Total versions: 8
  • Total maintainers: 1
pypi.org: nlstruct

Natural language structuring library

  • Versions: 8
  • Dependent Packages: 1
  • Dependent Repositories: 1
  • Downloads: 36 Last month
Rankings
Dependent packages count: 10.1%
Forks count: 11.9%
Stargazers count: 15.6%
Average: 17.4%
Dependent repos count: 21.5%
Downloads: 27.9%
Maintainers (1)
Last synced: 6 months ago

Dependencies

setup.py pypi
  • einops ==0.4.1
  • fire *
  • numpy ==1.22.3
  • pandas ==1.4.2
  • parse ==1.19.0
  • pytorch_lightning ==1.4.9
  • regex ==2020.11.13
  • rich_logger ==0.1.4
  • scikit-learn ==1.1.0rc1
  • sentencepiece ==0.1.96
  • torch ==1.11.0
  • torchmetrics ==0.7.3
  • tqdm ==4.64.0
  • transformers ==4.11.2
  • unidecode ==1.3.4
  • xxhash ==3.0.0