multi-head-crf

https://github.com/ieeta-pt/multi-head-crf

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.1%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: ieeta-pt
License: mit
Language: Python
Default Branch: master
Size: 94.7 KB

Statistics

Stars: 2
Watchers: 6
Forks: 0
Open Issues: 3
Releases: 0

Created about 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

Multi-Head-CRF

This repository contains the implementation for the Multi-Head-CRF model as described in:

Multi-head CRF classifier for biomedical multi-class Named Entity Recognition on Spanish clinical notes

Setup

Create a python environment. python -m venv venv PIP=venv/bin/pip $PIP install --upgrade pip $PIP install -r requirements.txt

Dataset

The dataset used in this work merges four separate datasets: - SymptEMIST: Zenodo - DisTEMIST: Zenodo - MedProcNER: Zenodo - PharmaCoNER: Zenodo

All datasets are licensed under CC4.

To set up the dataset, a script is provided (dataset/download_dataset.sh) that downloads these datasets, prepares them in the correct format, and merges them to create a unified dataset.

Alternatively, the dataset is available on: - Hugging Face - Zenodo

This step is required if you wish to run the Named Entity Linking or Evaluation.

Named Entity Recognition

Go to src directory bash cd src

To train a model, use the following command:

bash python hf_trainer.py lcampillos/roberta-es-clinical-trials-ner --augmentation random --number_of_layer_per_head 3 --context 32 --epochs 60 --batch 16 --percentage_tags 0.25 --aug_prob 0.5 --classes SYMPTOM PROCEDURE DISEASE PROTEIN CHEMICAL

lcampillos/roberta-es-clinical-trials-ner: Model checkpoint.
--number_of_layer_per_head: Number of hidden layers to use in each CRF head (Good options: 1-3).
--context: Context size for splitting documents exceeding the 512 token limit (Good options: 2 or 32).
--epochs: Number of epochs to train.
--batch: Batch size.
--augmentation: Augmentation strategy (None, 'random', or 'unk').
--aug_prob: Probability to apply augmentation to a sample.
--percentage_tags: Percentage of tokens to change.
--classes: Classes to train, must be a combination of: SYMPTOM PROCEDURE DISEASE PROTEIN CHEMICAL.
--val: Whether to use a validation dataset; otherwise, the test dataset is utilized.

To run inference for the model, we provide an inference file, which will conduct inference over the test dataset by default: python inference.py MODEL_CHECKPOINT

We also provide several of our best performing models available on Hugging Face.

Example:

bash python inference.py IEETA/RobertaMultiHeadCRF-C32-0

Named Entity Linking

In order to utilize the SNOMED CT terminology, it is necessary to create a UMLS account and download the file. This folder is expected to be extracted into the embeddings directory. Although we do not supply the original resource, we do supply all the embeddings used for SNOMED CT and the various gazetteers, which are available here, with a script available in embeddings/download_embeddings.sh

In order to build the embeddings, it is required to run the embeddings/prepare_jsonl_for_embedding.py script, which will create jsonl files from the various gazetteers.

In order to build the embeddings it is required to run embeddings/build_embeddings_index.py.

python build_embeddings_index.py snomedCT.jsonl

With these embeddings we can conduct normalization (in src).

python normalize.py INPUT_RUN --t 0.6 --use_gazetteer False --output_folder runs

Were --t is the the threshold of acceptance, and --use_gazetteer is whether or not to use the gazetteers to normalize.

Evaluation

The evaluation (NER and entity linking) can be run in the evaluation/ directory as follows:

python3 evaluation.py train/test PREDICTIONS_FILE.tsv

How to use our Dataset

Usage Example

```python

from data import SpanishBiomedicalNER_Corpus, CorpusTokenizer,CorpusDataset, CorpusPreProcessor ,BIOTagger, SelectModelInputs,RandomlyUKNTokens, EvaluationDataCollator, RandomlyReplaceTokens, TrainDataCollator

First, create a generic Corpus

spanishCorpus = SpanishBiomedicalNERCorpus( "../dataset/mergeddatasubtask1train.tsv", "../dataset/documents" )

Create a Corpus PreProcessor, which handles certain preprocessing tasks:

merging annotations, filtering labels, and splitting the data.

spanishCorpusProcessor = CorpusPreProcessor(spanishCorpus) spanishCorpusProcessor.mergeannoatation() spanishCorpusProcessor.filterlabels(classes)

Split the corpus into training and testing sets with a 33% split.

traincorpus, testcorpus = spanishCorpusProcessor.split_data(0.33)

Create a CorpusTokenizer, using the CorpusPreProcessor.

This internally tokenizes the dataset and splits the documents.

tokenizedtraincorpus = CorpusTokenizer(traincorpus, tokenizer, CONTEXTSIZE)

Finally, create the dataset by applying transformations.

The order of transformations is important (BioTagging should be applied first).

trainds = CorpusDataset( tokenizedcorpus=tokenizedtraincorpus, transforms=transforms, augmentation=train_augmentation )

Repeat the process for the test set

tokenizedtestcorpus = CorpusTokenizer(testcorpus, tokenizer, CONTEXTSIZE) testds = CorpusDataset( tokenizedcorpus=tokenizedtestcorpus ) ```

This example shows the workflow of using the Corpus, CorpusPreProcessor, CorpusTokenizer, and CorpusDataset classes to create a dataset for a Named Entity Recognition (NER) task. It includes:

Loading a corpus from a dataset.
Preprocessing the corpus to merge annotations, filter specific labels, and split the data.
Tokenizing the processed corpus and splitting documents.
Creating a dataset with specified transformations and augmentations.

Bring your own data

In order to create a new Dataset, you will need to overload the Corpus class, similar to the class Spanish_Biomedical_NER_Corpus. The important thing to note is the format of the data as presented below, which is a list of documents containing a dictionary.

1. `Corpus`

The Corpus class represents a collection of documents with annotations. Each document in the corpus (data) must adhere to the following format: json { "doc_id": "unique_document_identifier", "text": "document_text", "annotations": [ {"label": "LABEL", "start_span": "start_position", "end_span": "end_position"}, ... ] }

Methods:

__init__(data: list): Initializes the corpus. The data must be a list of documents with the structure mentioned above.
__len__(): Returns the number of documents in the corpus.
get_entities(): Returns a set of all unique entity labels in the corpus.
split(split: float): Splits the corpus into two sets (training and testing) based on the provided split ratio.

2. `CorpusPreProcessor`

The CorpusPreProcessor class is responsible for applying transformations and filters to the corpus.

Methods:

__init__(corpus: Corpus): Initializes the preprocessor with a Corpus object.
filter_labels(labels: list): Filters the annotations to keep only those with labels that match the provided list.
merge_annotations(): Merges overlapping or adjacent annotations based on their span. If a collision is found, the annotations are merged.
split_data(test_split_percentage: float): Splits the corpus into training and test sets based on the provided split percentage.

4. `CorpusTokenizer`

The CorpusTokenizer class tokenizes the corpus using a given tokenizer (e.g., from the Hugging Face library) and prepares it for training by splitting it into context windows.

Parameters:

corpus: CorpusPreProcessor: The preprocessed corpus to tokenize.
tokenizer: A tokenizer that tokenizes the text (e.g., Hugging Face's tokenizers).
context_size: The size of the context to be added before and after the main token sequence (optional).

Methods:

__tokenize(): Tokenizes the entire corpus.
__split(): Splits the tokenized corpus into windows, handling context and special tokens (if any).

5. `CorpusDataset`

This class wraps the tokenized corpus into a dataset compatible with PyTorch's DataLoader.

Parameters:

tokenized_corpus: CorpusTokenizer: The tokenized corpus.
transforms: Optional transformations to apply to each sample.
augmentations: Optional augmentations to apply to each sample.

Reference

Multi-head CRF classifier for biomedical multi-class named entity recognition on Spanish clinical notes -- https://academic.oup.com/database/article/doi/10.1093/database/baae068/7724924

@article{Jonker_Multi-head_CRF_classifier_2024, author = {Jonker, Richard A A and Almeida, Tiago and Antunes, Rui and Almeida, João R and Matos, Sérgio}, doi = {10.1093/database/baae068}, journal = {Database}, title = {{Multi-head CRF classifier for biomedical multi-class named entity recognition on Spanish clinical notes}}, url = {https://doi.org/10.1093/database/baae068}, volume = {2024}, year = {2024} }

License

This project is licensed under the MIT License - see the LICENSE file for details.

Authors: - Richard A A Jonker (ORCID: 0000-0002-3806-6940) - Tiago Almeida (ORCID: 0000-0002-4258-3350) - Rui Antunes (ORCID: 0000-0003-3533-8872) - João R Almeida (ORCID: 0000-0003-0729-2264) - Sérgio Matos (ORCID: 0000-0003-1941-3983)

Owner

Name: IEETA
Login: ieeta-pt
Kind: organization

Website: www.ieeta.pt
Repositories: 28
Profile: https://github.com/ieeta-pt

Citation (citation.cff)

cff-version: 1.2.0
message: If you use this software, please cite both the article from preferred-citation and the software itself.
authors:
  - family-names: Jonker
    given-names: Richard A A
  - family-names: Almeida
    given-names: Tiago
  - family-names: Antunes
    given-names: Rui
  - family-names: Almeida
    given-names: João R
  - family-names: Matos
    given-names: Sérgio
title: Multi-head CRF classifier for biomedical multi-class named entity recognition on Spanish clinical notes
version: 1.0.0
url: https://doi.org/10.1093/database/baae068
doi: 10.1093/database/baae068
date-released: '2024-11-07'
preferred-citation:
  type: article
  authors:
    - family-names: Jonker
      given-names: Richard A A
    - family-names: Almeida
      given-names: Tiago
    - family-names: Antunes
      given-names: Rui
    - family-names: Almeida
      given-names: João R
    - family-names: Matos
      given-names: Sérgio
  title: Multi-head CRF classifier for biomedical multi-class named entity recognition on Spanish clinical notes
  doi: 10.1093/database/baae068
  url: https://doi.org/10.1093/database/baae068
  journal: Database,
  volume: 2024,
  pages: baae068
  year: '2024'
  conference: {}
  publisher: {}

GitHub Events

Total

Issues event: 1
Delete event: 1
Issue comment event: 5
Push event: 14
Pull request event: 3
Pull request review event: 7
Pull request review comment event: 13
Create event: 2

Last Year

Issues event: 1
Delete event: 1
Issue comment event: 5
Push event: 14
Pull request event: 3
Pull request review event: 7
Pull request review comment event: 13
Create event: 2

Dependencies

requirements.txt pypi

click ==8.1.7
numpy ==1.26.4
pandas ==2.2.0
torch ==2.2.0
tqdm ==4.66.2
transformers ==4.37.2

multi-head-crf

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Multi-Head-CRF

Setup

Dataset

Named Entity Recognition

Named Entity Linking

Evaluation

How to use our Dataset

Usage Example

First, create a generic Corpus

Create a Corpus PreProcessor, which handles certain preprocessing tasks:

merging annotations, filtering labels, and splitting the data.

Split the corpus into training and testing sets with a 33% split.

Create a CorpusTokenizer, using the CorpusPreProcessor.

This internally tokenizes the dataset and splits the documents.

Finally, create the dataset by applying transformations.

The order of transformations is important (BioTagging should be applied first).

Repeat the process for the test set

Bring your own data

1. Corpus

Methods:

2. CorpusPreProcessor

Methods:

4. CorpusTokenizer

Parameters:

Methods:

5. CorpusDataset

Parameters:

Reference

License

Owner

Citation (citation.cff)

GitHub Events

Total

Last Year

Dependencies

1. `Corpus`

2. `CorpusPreProcessor`

4. `CorpusTokenizer`

5. `CorpusDataset`