span-marker

SpanMarker for Named Entity Recognition

https://github.com/tomaarsen/spanmarkerner

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.4%) to scientific vocabulary

Keywords

huggingface ner nlp spacy spacy-extension transformers
Last synced: 6 months ago · JSON representation ·

Repository

SpanMarker for Named Entity Recognition

Basic Info
Statistics
  • Stars: 451
  • Watchers: 7
  • Forks: 33
  • Open Issues: 33
  • Releases: 0
Topics
huggingface ner nlp spacy spacy-extension transformers
Created almost 3 years ago · Last pushed about 1 year ago
Metadata Files
Readme Changelog License Citation

README.md

SpanMarker for Named Entity Recognition

[🤗 Models](https://huggingface.co/models?library=span-marker) | [🛠️ Getting Started In Google Colab](https://colab.research.google.com/github/tomaarsen/SpanMarkerNER/blob/main/notebooks/getting_started.ipynb) | [📄 Documentation](https://tomaarsen.github.io/SpanMarkerNER) | 📊 [Thesis](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)

SpanMarker is a framework for training powerful Named Entity Recognition models using familiar encoders such as BERT, RoBERTa and ELECTRA. Built on top of the familiar 🤗 Transformers library, SpanMarker inherits a wide range of powerful functionalities, such as easily loading and saving models, hyperparameter optimization, automatic logging in various tools, checkpointing, callbacks, mixed precision training, 8-bit inference, and more.

Based on the PL-Marker paper, SpanMarker breaks the mold through its accessibility and ease of use. Crucially, SpanMarker works out of the box with many common encoders such as bert-base-cased, roberta-large and bert-base-multilingual-cased, and automatically works with datasets using the IOB, IOB2, BIOES, BILOU or no label annotation scheme.

Additionally, the SpanMarker library has been integrated with the Hugging Face Hub and the Hugging Face Inference API. See the SpanMarker documentation on Hugging Face or see all SpanMarker models on the Hugging Face Hub. Through the Inference API integration, users can test any SpanMarker model on the Hugging Face Hub for free using a widget on the model page. Furthermore, each public SpanMarker model offers a free API for fast prototyping and can be deployed to production using Hugging Face Inference Endpoints.

| Inference API Widget (on a model page) | Free Inference API (Deploy > Inference API on a model page) | | ------------- | ------------- | | image | image |

Documentation

Feel free to have a look at the documentation.

Installation

You may install the span_marker Python module via pip like so: pip install span_marker

Quick Start

Training

Please have a look at our Getting Started notebook for details on how SpanMarker is commonly used. It explains the following snippet in more detail. Alternatively, have a look at the training scripts that have been successfully used in the past.

| Colab | Kaggle | Gradient | Studio Lab | |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Open In Colab | Kaggle | Gradient | Open In SageMaker Studio Lab |

```python from pathlib import Path from datasets import loaddataset from transformers import TrainingArguments from spanmarker import SpanMarkerModel, Trainer, SpanMarkerModelCardData

def main() -> None: # Load the dataset, ensure "tokens" and "nertags" columns, and get a list of labels datasetid = "DFKI-SLT/few-nerd" datasetname = "FewNERD" dataset = loaddataset(datasetid, "supervised") dataset = dataset.removecolumns("nertags") dataset = dataset.renamecolumn("finenertags", "nertags") labels = dataset["train"].features["nertags"].feature.names # ['O', 'art-broadcastprogram', 'art-film', 'art-music', 'art-other', ...

# Initialize a SpanMarker model using a pretrained BERT-style encoder
encoder_id = "bert-base-cased"
model_id = f"tomaarsen/span-marker-{encoder_id}-fewnerd-fine-super"
model = SpanMarkerModel.from_pretrained(
    encoder_id,
    labels=labels,
    # SpanMarker hyperparameters:
    model_max_length=256,
    marker_max_length=128,
    entity_max_length=8,
    # Model card arguments
    model_card_data=SpanMarkerModelCardData(
        model_id=model_id,
        encoder_id=encoder_id,
        dataset_name=dataset_name,
        dataset_id=dataset_id,
        license="cc-by-sa-4.0",
        language="en",
    ),
)

# Prepare the 🤗 transformers training arguments
output_dir = Path("models") / model_id
args = TrainingArguments(
    output_dir=output_dir,
    # Training Hyperparameters:
    learning_rate=5e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    warmup_ratio=0.1,
    bf16=True,  # Replace `bf16` with `fp16` if your hardware can't use bf16.
    # Other Training parameters
    logging_first_step=True,
    logging_steps=50,
    evaluation_strategy="steps",
    save_strategy="steps",
    eval_steps=3000,
    save_total_limit=2,
    dataloader_num_workers=2,
)

# Initialize the trainer using our model, training args & dataset, and train
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
)
trainer.train()

# Compute & save the metrics on the test set
metrics = trainer.evaluate(dataset["test"], metric_key_prefix="test")
trainer.save_metrics("test", metrics)

# Save the final checkpoint
trainer.save_model(output_dir / "checkpoint-final")

if name == "main": main() ```

Inference

```python from span_marker import SpanMarkerModel

Download from the 🤗 Hub

model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super")

Run inference

entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.") [{'span': 'Amelia Earhart', 'label': 'person-other', 'score': 0.7659597396850586, 'charstartindex': 0, 'charendindex': 14}, {'span': 'Lockheed Vega 5B', 'label': 'product-airplane', 'score': 0.9725785851478577, 'charstartindex': 38, 'charendindex': 54}, {'span': 'Atlantic', 'label': 'location-bodiesofwater', 'score': 0.7587679028511047, 'charstartindex': 66, 'charendindex': 74}, {'span': 'Paris', 'label': 'location-GPE', 'score': 0.9892390966415405, 'charstartindex': 78, 'charendindex': 83}] ```

Pretrained Models

All models in this list contain train.py files that show the training scripts used to generate them. Additionally, all training scripts used are stored in the training_scripts directory. These trained models have Hosted Inference API widgets that you can use to experiment with the models on their Hugging Face model pages. Additionally, Hugging Face provides each model with a free API (Deploy > Inference API on the model page).

These models are further elaborated on in my thesis.

FewNERD

OntoNotes v5.0

  • tomaarsen/span-marker-roberta-large-ontonotes5 was trained in 3 hours on the OntoNotes v5.0 dataset, reaching a performance of 91.54 F1. For reference, the current strongest spaCy model (en_core_web_trf) reaches 89.8 F1. This SpanMarker model uses a roberta-large encoder under the hood.

CoNLL03

CoNLL++

MultiNERD

Using pretrained SpanMarker models with spaCy

All SpanMarker models on the Hugging Face Hub can also be easily used in spaCy. It's as simple as including 1 line to add the span_marker pipeline. See the Documentation or API Reference for more information. ```python import spacy

Load the spaCy model with the span_marker pipeline component

nlp = spacy.load("encorewebsm", exclude=["ner"]) nlp.addpipe("span_marker", config={"model": "tomaarsen/span-marker-roberta-large-ontonotes5"})

Feed some text through the model to get a spacy Doc

text = """Cleopatra VII, also known as Cleopatra the Great, was the last active ruler of the \ Ptolemaic Kingdom of Egypt. She was born in 69 BCE and ruled Egypt from 51 BCE until her \ death in 30 BCE.""" doc = nlp(text)

And look at the entities

print([(entity, entity.label_) for entity in doc.ents]) """ [(Cleopatra VII, "PERSON"), (Cleopatra the Great, "PERSON"), (the Ptolemaic Kingdom of Egypt, "GPE"), (69 BCE, "DATE"), (Egypt, "GPE"), (51 BCE, "DATE"), (30 BCE, "DATE")] """ ``` image

Context

Argilla

I have developed this library as a part of my thesis work at Argilla. Feel free to read my finished thesis here in this repository!

Changelog

See CHANGELOG.md for news on all SpanMarker versions.

License

See LICENSE for the current license.

Owner

  • Name: Tom Aarsen
  • Login: tomaarsen
  • Kind: user
  • Location: Netherlands
  • Company: Hugging Face

Sentence Transformers, SetFit & NLTK maintainer, ML Engineer & Fellow @ 🤗Hugging Face

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: SpanMarker
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Tom
    family-names: Aarsen
repository-code: 'https://github.com/tomaarsen/SpanMarkerNER'
keywords:
  - nlp
  - ner
  - transformers
license: Apache-2.0

GitHub Events

Total
  • Issues event: 8
  • Watch event: 56
  • Delete event: 1
  • Issue comment event: 22
  • Push event: 19
  • Pull request event: 6
  • Fork event: 3
  • Create event: 4
Last Year
  • Issues event: 8
  • Watch event: 56
  • Delete event: 1
  • Issue comment event: 22
  • Push event: 19
  • Pull request event: 6
  • Fork event: 3
  • Create event: 4

Committers

Last synced: 11 months ago

All Time
  • Total Commits: 274
  • Total Committers: 3
  • Avg Commits per committer: 91.333
  • Development Distribution Score (DDS): 0.007
Past Year
  • Commits: 8
  • Committers: 2
  • Avg Commits per committer: 4.0
  • Development Distribution Score (DDS): 0.125
Top Committers
Name Email Commits
Tom Aarsen C****v@g****m 272
Logan l****h@l****m 1
David Berenstein d****n@g****m 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 51
  • Total pull requests: 27
  • Average time to close issues: 24 days
  • Average time to close pull requests: 7 days
  • Total issue authors: 41
  • Total pull request authors: 7
  • Average comments per issue: 2.27
  • Average comments per pull request: 2.96
  • Merged pull requests: 19
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 9
  • Pull requests: 5
  • Average time to close issues: about 1 month
  • Average time to close pull requests: about 13 hours
  • Issue authors: 8
  • Pull request authors: 2
  • Average comments per issue: 1.67
  • Average comments per pull request: 0.0
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ganga7445 (3)
  • tomaarsen (3)
  • lambdavi (2)
  • 2533245542 (2)
  • sanmai-NL (2)
  • tan-js (2)
  • davidberenstein1957 (2)
  • gilljon (1)
  • alvarobartt (1)
  • mitchins (1)
  • polodealvarado (1)
  • logan-markewich (1)
  • YeDeming (1)
  • Ulipenitz (1)
  • andysingal (1)
Pull Request Authors
  • tomaarsen (19)
  • vangheem (2)
  • davidberenstein1957 (2)
  • mrwunderbar666 (2)
  • polodealvarado (1)
  • logan-markewich (1)
  • jayant-yadav (1)
  • cceyda (1)
Top Labels
Issue Labels
enhancement (5) question (1) bug (1)
Pull Request Labels
enhancement (4) documentation (1) bug (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 20,621 last-month
  • Total dependent packages: 8
  • Total dependent repositories: 1
  • Total versions: 20
  • Total maintainers: 1
pypi.org: span-marker

Named Entity Recognition using Span Markers

  • Versions: 20
  • Dependent Packages: 8
  • Dependent Repositories: 1
  • Downloads: 20,621 Last month
Rankings
Dependent packages count: 3.2%
Downloads: 3.4%
Stargazers count: 4.0%
Average: 8.3%
Forks count: 9.1%
Dependent repos count: 21.6%
Maintainers (1)
Last synced: 6 months ago