https://github.com/percevalw/nlstruct
Natural language structuring library
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
5 of 9 committers (55.6%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary
Keywords
Repository
Natural language structuring library
Basic Info
- Host: GitHub
- Owner: percevalw
- License: mit
- Language: Python
- Default Branch: master
- Size: 543 KB
Statistics
- Stars: 21
- Watchers: 3
- Forks: 11
- Open Issues: 9
- Releases: 2
Topics
Metadata Files
README.md
NLStruct
Natural language struturing library. Currently, it implements a nested NER model and a span classification model, but other algorithms might follow.
If you find this library useful in your research, please consider citing:
@phdthesis{wajsburt:tel-03624928,
TITLE = {{Extraction and normalization of simple and structured entities in medical documents}},
AUTHOR = {Wajsb{\"u}rt, Perceval},
URL = {https://hal.archives-ouvertes.fr/tel-03624928},
SCHOOL = {{Sorbonne Universit{\'e}}},
YEAR = {2021},
MONTH = Dec,
KEYWORDS = {nlp ; structure ; extraction ; normalization ; clinical ; multilingual},
TYPE = {Theses},
PDF = {https://hal.archives-ouvertes.fr/tel-03624928/file/updated_phd_thesis_PW.pdf},
HAL_ID = {tel-03624928},
HAL_VERSION = {v1},
}
This work was performed at LIMICS, in collaboration with AP-HP's Clinical Data Warehouse and funded by the Institute of Computing and Data Science.
Features
- processes large documents seamlessly: it automatically handles tokenization and sentence splitting.
- do not train twice: an automatic caching mechanism detects when an experiment has already been run
- stop & resume with checkpoints
- easy import and export of data
- handles nested or overlapping entities
- multi-label classification of recognized entities
- strict or relaxed multi label end to end retrieval metrcis
- pretty logging with rich-logger
- heavily customizable, without config files (see train_ner.py)
- built on top of transformers and pytorch_lightning
Training models
How to train a NER model
```python from nlstruct.recipes import train_ner
model = trainner( dataset={ "train": "path to your train brat/standoff data", "val": 0.05, # or path to your validation data # "test": # and optional path to your test data }, finetunebert=False, seed=42, bertname="camembert/camembert-base", fasttextfile="", gpus=0, xpname="my-xp", returnmodel=True, ) model.save_pretrained("model.pt") ```
How to use it
```python from nlstruct import loadpretrained from nlstruct.datasets import loadfrombrat, exportto_brat
ner = loadpretrained("model.pt") ner.eval() ner.predict({"docid": "doc-0", "text": "Je lui prescris du lorazepam."})
Out:
{'doc_id': 'doc-0',
'text': 'Je lui prescris du lorazepam.',
'entities': [{'entity_id': 0,
'label': ['substance'],
'attributes': [],
'fragments': [{'begin': 19,
'end': 28,
'label': 'substance',
'text': 'lorazepam'}],
'confidence': 0.9998705969553088}]}
exporttobrat(ner.predict(loadfrombrat("path/to/brat/test")), filenameprefix="path/to/exportedbrat") ```
How to train a NER model followed by a span classification model
```python from nlstruct.recipes import trainqualifiedner
model = trainqualifiedner( dataset={ "train": "path to your train brat/standoff data", "val": 0.05, # or path to your validation data # "test": # and optional path to your test data }, finetunebert=False, seed=42, bertname="camembert/camembert-base", fasttextfile="", gpus=0, xpname="my-xp", returnmodel=True, ) model.savepretrained("model.pt") ```
Ensembling
Easily ensemble multiple models (same architecture, different seeds):
python
model1 = load_pretrained("model-1.pt")
model2 = load_pretrained("model-2.pt")
model3 = load_pretrained("model-3.pt")
ensemble = model1.ensemble_with([model2, model3]).cuda()
export_to_brat(ensemble.predict(load_from_brat("path/to/brat/test")), filename_prefix="path/to/exported_brat")
Advanced use
Should you need to further configure the training of a model, please modify directly one of the recipes located in the recipes folder.
Install
This project is still under development and subject to changes.
bash
pip install nlstruct==0.2.0
Owner
- Name: Perceval Wajsburt
- Login: percevalw
- Kind: user
- Location: Paris
- Company: APHP
- Repositories: 47
- Profile: https://github.com/percevalw
PhD in medical NLP, my main areas of interest are NLP, structured prediction models and UI development
GitHub Events
Total
- Watch event: 3
- Fork event: 2
Last Year
- Watch event: 3
- Fork event: 2
Committers
Last synced: over 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| Perceval Wajsburt | p****t@g****m | 349 |
| Perceval Wajsburt | p****t@s****r | 32 |
| tannier | x****r@s****r | 5 |
| Perceval Wajsbürt | p****t@a****r | 4 |
| YoannT | y****e@g****m | 3 |
| Ghislain Vaillant | g****t@i****r | 2 |
| tannier | t****r@c****r | 2 |
| Camila A | d****l@i****r | 1 |
| solenn-tl | s****l@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 4
- Total pull requests: 14
- Average time to close issues: N/A
- Average time to close pull requests: about 13 hours
- Total issue authors: 3
- Total pull request authors: 9
- Average comments per issue: 0.75
- Average comments per pull request: 1.0
- Merged pull requests: 5
- Bot issues: 0
- Bot pull requests: 3
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- solenn-tl (2)
- meesamnaqvi (1)
- BillelBenoudjit (1)
Pull Request Authors
- dependabot[bot] (3)
- solenn-tl (2)
- marconaguib (2)
- percevalw (2)
- camila-ud (2)
- xtannier (1)
- YoannT (1)
- TrellixVulnTeam (1)
- ghisvail (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 36 last-month
- Total dependent packages: 1
- Total dependent repositories: 1
- Total versions: 8
- Total maintainers: 1
pypi.org: nlstruct
Natural language structuring library
- Homepage: https://github.com/percevalw/nlstruct
- Documentation: https://nlstruct.readthedocs.io/
- License: MIT
-
Latest release: 0.2.0
published about 2 years ago
Rankings
Maintainers (1)
Dependencies
- einops ==0.4.1
- fire *
- numpy ==1.22.3
- pandas ==1.4.2
- parse ==1.19.0
- pytorch_lightning ==1.4.9
- regex ==2020.11.13
- rich_logger ==0.1.4
- scikit-learn ==1.1.0rc1
- sentencepiece ==0.1.96
- torch ==1.11.0
- torchmetrics ==0.7.3
- tqdm ==4.64.0
- transformers ==4.11.2
- unidecode ==1.3.4
- xxhash ==3.0.0