Recent Releases of flair

flair - Release 0.15.1

This release fixes compatibility bugs with the newest PyTorch and SciPy versions, and adds a number of small improvements and new features.

Improvements and new features

  • SegtokTokenizer: Add option to customize SegtokTokenizer, by @alanakbik in https://github.com/flairNLP/flair/pull/3592
  • RegexpTagger: Add option to define matching groups to RegexpTagger, by @alanakbik in https://github.com/flairNLP/flair/pull/3598
  • RelationClassifier: Optimize RelationClassifier by adding the option to filter long sentences and truncate context, by @alanakbik in https://github.com/flairNLP/flair/pull/3593
  • RelationClassifier: Modify printouts in RelationClassifier evaluation to remove clutter by @alanakbik in https://github.com/flairNLP/flair/pull/3591
  • Add sentence labeler, by @MattGPT-ai in https://github.com/flairNLP/flair/pull/3570
  • Adding a Deep Nearest Class Means Classifier model to Flair, by @sheldon-roberts in https://github.com/flairNLP/flair/pull/3532
  • Add per-task metrics by @ntravis22 in https://github.com/flairNLP/flair/pull/3605
  • Add options to load full documents as Sentence objects, by @alanakbik in https://github.com/flairNLP/flair/pull/3595

New Model: Deep Nearest Class Means Classifier (#3532)

Adds a new Nearest Class Mean classification approach to Flair that classifies data points to the class with the closest class data mean. This approach can be used as an alternative to fitting a Softmax Classifier. It is now available for any class in Flair that implements DefaultClassifier. For instance, to train a TextClassifier with DeepNCMs you can use the following code:

```python from flair.data import Corpus from flair.datasets import TREC_50 from flair.embeddings import TransformerDocumentEmbeddings from flair.models import TextClassifier from flair.nn import DeepNCMDecoder from flair.trainers import ModelTrainer from flair.trainers.plugins import DeepNCMPlugin

load the TREC dataset

corpus: Corpus = TREC_50()

label_type = "class"

make a transformer document embedding

documentembeddings = TransformerDocumentEmbeddings("distilbert-base-uncased", finetune=True)

create the label_dictionary

labeldictionary = corpus.makelabeldictionary(labeltype=label_type)

create a text classifier with a special DeepNCM decoder

classifier = TextClassifier( documentembeddings, labeltype=labeltype, labeldictionary=labeldictionary, decoder=DeepNCMDecoder( meanupdatemethod="condensation", embeddingssize=documentembeddings.embeddinglength, labeldictionary=labeldictionary, ), )

initialize the trainer

trainer = ModelTrainer(classifier, corpus)

train the model using the DeepNCM plugin

trainer.finetune( "resources/taggers/deepncmbaseline", plugins=[DeepNCMPlugin()], ) ```

Contributed by @sheldon-roberts in https://github.com/flairNLP/flair/pull/3532

Datasets

  • Add BarNER Dataset by @stefan-it in https://github.com/flairNLP/flair/pull/3604

Bug Fixes

  • Fix model loading for compatibility with PyTorch 2.6, by @helpmefindaname in https://github.com/flairNLP/flair/pull/3608
  • Fix SciPy compatibility by updating scipy .A to toarray(), by @sg-wbi in https://github.com/flairNLP/flair/pull/3606
  • Fix: use proper eval default main eval metrics for text regression model by @MattGPT-ai in https://github.com/flairNLP/flair/pull/3602
  • Fix: cast indices tensor to int to fix bug by @MattGPT-ai in https://github.com/flairNLP/flair/pull/3601

New Contributors

  • @sg-wbi made their first contribution in https://github.com/flairNLP/flair/pull/3606
  • @ntravis22 made their first contribution in https://github.com/flairNLP/flair/pull/3605

Full Changelog: https://github.com/flairNLP/flair/compare/v0.15.0...v0.15.1

- Python
Published by alanakbik about 1 year ago

flair - Release 0.15.0

This release adds multi-GPU support, an improved documentation page with API docs and finally deprecates Python 3.8!

Improved Documentation and API Docs

Thanks to @konstantin-lukas we have a completely new design of our documentation page, which now includes API docs.

You can check it out here! - Check out our tutorials - Check out the new Python API docs

Future releases will improve docstring coverage and further improve upon the documentation!

PRs: * Fix doc build by @helpmefindaname in https://github.com/flairNLP/flair/pull/3528 * Rework Doc page by @konstantin-lukas in https://github.com/flairNLP/flair/pull/3563 * Test new docstrings and apidocs deployment by @alanakbik in https://github.com/flairNLP/flair/pull/3573

Multi-GPU Support

Flair now offers support for training models on multiple GPUs! Big thanks to @jeffpicard!

PRs: * Add multi-GPU support by @jeffpicard in https://github.com/flairNLP/flair/pull/3548 * Fix gradient accumulation and learning rate aggregation by @jeffpicard in #3583

Deprecations

Since Python3.8 is no longer supported, we are also dropping support for it, in favor of features added in python 3.9. To acknowledge CVE-2024-10073, we decided to drop support for the flair.models.clustering module, since we aren't aware of any usage of it, we decided to do a hard drop instead of a deprecation.

  • Drop python 3.8 by @helpmefindaname in https://github.com/flairNLP/flair/pull/3560
  • Remove clustering support by @helpmefindaname in https://github.com/flairNLP/flair/pull/3567

Other Improvements

New Datasets

  • Add CleanCoNLL object by @susannaruecker in https://github.com/flairNLP/flair/pull/3557
  • Add NoiseBench object by @elenamer in #3512

Performance Improvements

  • perf: optimize dictionary items check by @MattGPT-ai in https://github.com/flairNLP/flair/pull/3569
  • Refactor fill_mean_token_embeddings for performance optimization on GPU by @sheldon-roberts in https://github.com/flairNLP/flair/pull/3525

New Features and Improvements

  • Add proxies information to requests.head by @diego-morientez in https://github.com/flairNLP/flair/pull/3535
  • Allow specifying proxy information in TransformerEmbeddings by @diego-morientez in https://github.com/flairNLP/flair/pull/3539
  • Add use_tokenizer to JsonlDataset by @david-waterworth in https://github.com/flairNLP/flair/pull/3486
  • Use built-in version parsing from packaging by @adrianeboyd in https://github.com/flairNLP/flair/pull/3502

Bugfixes

  • TransformerDocumentEmbeddings: Fix error when cls_pooling="mean" or cls_pooling="max" by @fkdosilovic in https://github.com/flairNLP/flair/pull/3558
  • SequenceTagger : Fix the incorrect token prediction distribution from _all_scores_for_token() by @mdmotaharmahtab in https://github.com/flairNLP/flair/pull/3449
  • TransformerEmbeddings: Fix T5 tokenizer loading by @helpmefindaname in https://github.com/flairNLP/flair/pull/3544
  • TextPairRegressor: Fix: use proper eval default main eval metrics by @MattGPT-ai in https://github.com/flairNLP/flair/pull/3538
  • TextPairRegressor: Fix state dict key mismatch for embeddings by @MattGPT-ai in https://github.com/flairNLP/flair/pull/3537
  • Make onnx export work again by @helpmefindaname in https://github.com/flairNLP/flair/pull/3530
  • Fix support metric by @MattGPT-ai in https://github.com/flairNLP/flair/pull/3510

Operations/Development

  • Invalidate tars classifier and tars ner tests to save disk space by @helpmefindaname in https://github.com/flairNLP/flair/pull/3527
  • Ignore FutureWarning by @alanakbik in https://github.com/flairNLP/flair/pull/3526
  • Update SECURITY.md with current contact by @alanakbik in https://github.com/flairNLP/flair/pull/3568

New Contributors

  • @adrianeboyd made their first contribution in https://github.com/flairNLP/flair/pull/3502
  • @david-waterworth made their first contribution in https://github.com/flairNLP/flair/pull/3486
  • @diego-morientez made their first contribution in https://github.com/flairNLP/flair/pull/3535
  • @jeffpicard made their first contribution in https://github.com/flairNLP/flair/pull/3548
  • @mdmotaharmahtab made their first contribution in https://github.com/flairNLP/flair/pull/3449
  • @fkdosilovic made their first contribution in https://github.com/flairNLP/flair/pull/3558

Full Changelog: https://github.com/flairNLP/flair/compare/v0.14.0...v0.15.0

- Python
Published by helpmefindaname about 1 year ago

flair - Release 0.14.0

This release adds major new support for biomedical text analytics! It adds improved biomedical NER and a state-of-the-art model for biomedical entity linking. Other new features include (1) support for parameter-efficient fine-tuning and (2) various new datasets, bug fixes and enhancements! We also removed a few dependencies, so Flair should install faster and take up less space!

Biomedical NER and Entity Linking

With Flair 0.14.0, you can now detect and normalize biomedical entities in text.

For example, to analyze the sentence "We correlate genetic variants in IFNAR2 and POLG with long-COVID syndrome", use this code snippet:

```python from flair.models import EntityMentionLinker from flair.nn import Classifier from flair.data import Sentence

A sentence from biomedical literature

sentence = Sentence("We correlate genetic variants in IFNAR2 and POLG with long-COVID syndrome.")

Tag named entities in the text

nertagger = Classifier.load("hunflair2") nertagger.predict(sentence)

Normalize disease names

diseaselinker = EntityMentionLinker.load("gene-linker") diseaselinker.predict(sentence)

Normalize gene names

genelinker = EntityMentionLinker.load("disease-linker") genelinker.predict(sentence)

Iterate over predicted entities and print

for label in sentence.get_labels(): print(label) ```

This should print out:

```console Span[5:6]: "IFNAR2" → Gene (1.0) Span[5:6]: "IFNAR2" → 3455/name=IFNAR2

Span[7:8]: "POLG" → Gene (1.0) Span[7:8]: "POLG" → 5428/name=POLG

Span[9:11]: "long-COVID syndrome" → Disease (1.0) Span[9:11]: "long-COVID syndrome" → MESH:D000094024/name=Post-Acute COVID-19 Syndrome ```

The printout shows that:

  • "IFNAR2" is a gene. Further, it is recognized as gene 3455 ("interferon alpha and beta receptor subunit 2") in the NCBI database.

  • "POLG" is a gene. Further, it is recognized as gene 5428 ("DNA polymerase gamma, catalytic subunit") in the NCBI database.

  • "long-COVID syndrome" is a disease. Further, it is uniquely linked to "Post-Acute COVID-19 Syndrome" in the MESH database.

Big thanks to @sg-wbi @WangXII @mariosaenger @helpmefindaname for all their work: * Entity Mention Linker by @helpmefindaname in https://github.com/flairNLP/flair/pull/3388 * Support for biomedical datasets with multiple entity types by @WangXII in https://github.com/flairNLP/flair/pull/3387 * Update documentation for Hunflair2 release by @mariosaenger in https://github.com/flairNLP/flair/pull/3410 * Improve nel tutorial by @helpmefindaname in https://github.com/flairNLP/flair/pull/3369 * Incorporate hunflair2 docs to docpage by @helpmefindaname in https://github.com/flairNLP/flair/pull/3442

Parameter-Efficient Fine-Tuning

Flair 0.14.0 also adds support for PEFT.

For instance, to fine-tune a BERT model on the TREC question classification task using LoRA, use the following snippet:

```python from flair.data import Corpus from flair.datasets import TREC_6 from flair.embeddings import TransformerDocumentEmbeddings from flair.models import TextClassifier from flair.trainers import ModelTrainer

Note: you need to install peft to use this feature!

from peft import LoraConfig, TaskType

Get corpus and make label dictionary

corpus: Corpus = TREC6() labeltype = "questionclass" labeldict = corpus.makelabeldictionary(labeltype=labeltype)

Define embeddings with LoRA fine-tuning

documentembeddings = TransformerDocumentEmbeddings( "bert-base-uncased", finetune=True, # set LoRA config peftconfig=LoraConfig( tasktype=TaskType.FEATUREEXTRACTION, inferencemode=False, ), )

define model

classifier = TextClassifier(documentembeddings, labeldictionary=labeldict, labeltype=label_type)

train model

trainer = ModelTrainer(classifier, corpus) trainer.finetune( "resources/taggers/question-classification-with-transformer", learningrate=5.0e-4, minibatchsize=4, max_epochs=1, )

```

Big thanks to @janpf for this new feature! * Add PEFT training and explicit kwarg passthrough by @janpf in https://github.com/flairNLP/flair/pull/3480

Smaller Library

We've removed dependencies such as gensim from the core package, since they increased the size of the Flair library and caused some compatibility/maintenance issues. This means the core package is now smaller and fast to install.

Install as always with: console pip install flair

For certain features, you still need gensim, such as training a model that uses classic word embeddings. For this use case, install with:

console pip install flair[word-embeddings]

Or just install gensim separately.

Big thanks to @helpmefindaname for this new feature! * Make gensim optional by @helpmefindaname in https://github.com/flairNLP/flair/pull/3493 * Update models for v0.14.0 by @alanakbik in https://github.com/flairNLP/flair/pull/3505 * Relax version constraint for konoha by @himkt in https://github.com/flairNLP/flair/pull/3394 * Dependencies maintainance updates by @helpmefindaname in https://github.com/flairNLP/flair/pull/3402 * Make janome optional by @himkt in https://github.com/flairNLP/flair/pull/3405 * Bump min. version of bpemb by @stefan-it in https://github.com/flairNLP/flair/pull/3468

Other Improvements

New Features and Improvements

  • Speed up euclidean distance calculation by @sheldon-roberts in https://github.com/flairNLP/flair/pull/3485
  • Add DataTriples which act just like DataPairs by @janpf in https://github.com/flairNLP/flair/pull/3481
  • Add random seed parameter to dataset splitting and downsampling for better reproducibility by @MattGPT-ai in https://github.com/flairNLP/flair/pull/3475
  • Allow cpu device even if gpu available by @drbh in https://github.com/flairNLP/flair/pull/3417
  • Add prediction label type for span classifier by @helpmefindaname in https://github.com/flairNLP/flair/pull/3432
  • Character embeddings store their embedding name too by @helpmefindaname in https://github.com/flairNLP/flair/pull/3477

Bug Fixes

  • TextPairRegressor: Fix data point iteration by @ya0guang in https://github.com/flairNLP/flair/pull/3413
  • TextPairRegressor: Fix GPU memory leak by @MattGPT-ai in https://github.com/flairNLP/flair/pull/3490
  • TextRegressor: Fix label_name bug by @sheldon-roberts in https://github.com/flairNLP/flair/pull/3491
  • SequenceTagger: Fix allscoresfortoken in ViterbiDecoder by @mauryaland in https://github.com/flairNLP/flair/pull/3455
  • SentenceSplitter: Fix linking of sentences by @mariosaenger in https://github.com/flairNLP/flair/pull/3397
  • SentenceSplitter: Fix case where split was performed on special characters by @helpmefindaname in https://github.com/flairNLP/flair/pull/3404
  • Classifier: Fix loading by moving error message to main load function by @alanakbik in https://github.com/flairNLP/flair/pull/3504
  • Trainer: Fix edge case by loading best model at end, even when there is no final evaluation by @helpmefindaname in https://github.com/flairNLP/flair/pull/3470
  • TransformerEmbeddings: Fix special tokens by not replacing replaceadditionalspecial_tokens by @helpmefindaname in https://github.com/flairNLP/flair/pull/3451
  • Unit tests: Fix double data_folder in unit test by @ya0guang in https://github.com/flairNLP/flair/pull/3412

New Datasets

  • Add revision support for all Universal Dependencies datasets by @stefan-it in https://github.com/flairNLP/flair/pull/3420
  • NER_ESTONIAN_NOISY: Support for Estonian NER dataset with noise by @teresaloeffelhardt in https://github.com/flairNLP/flair/pull/3463
  • MASAKHA_POS: Support for two new languages by @stefan-it in https://github.com/flairNLP/flair/pull/3421
  • UD_BAVARIAN_MAIBAAM: Add support for new Bavarian MaiBaam UD by @stefan-it in https://github.com/flairNLP/flair/pull/3426

Documentation

  • Minor readme fixes by @stefan-it in https://github.com/flairNLP/flair/pull/3424
  • Fix typo transformer-embeddings.md by @abhisheklomsh in https://github.com/flairNLP/flair/pull/3500
  • Fix typo in how-model-training-works.md by @abhisheklomsh in https://github.com/flairNLP/flair/pull/3499

Build Management

  • Fix black and ruff by @stefan-it in https://github.com/flairNLP/flair/pull/3423
  • Remove zappr yaml by @helpmefindaname in https://github.com/flairNLP/flair/pull/3435
  • Fix tests package being incorrectly included in builds by @asumagic in https://github.com/flairNLP/flair/pull/3440

New Contributors

  • @ya0guang made their first contribution in https://github.com/flairNLP/flair/pull/3413
  • @drbh made their first contribution in https://github.com/flairNLP/flair/pull/3417
  • @asumagic made their first contribution in https://github.com/flairNLP/flair/pull/3440
  • @MattGPT-ai made their first contribution in https://github.com/flairNLP/flair/pull/3475
  • @janpf made their first contribution in https://github.com/flairNLP/flair/pull/3481
  • @sheldon-roberts made their first contribution in https://github.com/flairNLP/flair/pull/3485
  • @abhisheklomsh made their first contribution in https://github.com/flairNLP/flair/pull/3500
  • @teresaloeffelhardt made their first contribution in https://github.com/flairNLP/flair/pull/3463

Full Changelog: https://github.com/flairNLP/flair/compare/v0.13.1...v0.14.0

- Python
Published by alanakbik over 1 year ago

flair - Release 0.13.1

This releases adds some bugfixes on top of the 0.13.0 Release, and adds a new dataset.

Bug fixes

  • fix doc redirect by @helpmefindaname in https://github.com/flairNLP/flair/pull/3366
  • fix awaiting response check by @helpmefindaname in https://github.com/flairNLP/flair/pull/3371
  • fix has unknown label is not always initialized by @helpmefindaname in https://github.com/flairNLP/flair/pull/3372
  • Fix classification report if dataset has no labels by @alanakbik in https://github.com/flairNLP/flair/pull/3375
  • fix flert hidden context breaks reduced vocab by @helpmefindaname in https://github.com/flairNLP/flair/pull/3370
  • update HF cache env variable by @helpmefindaname in https://github.com/flairNLP/flair/pull/3386

Enhancements

  • use batch count instead of total training samples for logging metrics by @helpmefindaname in https://github.com/flairNLP/flair/pull/3374

New Datasets

  • Add AGNews corpus by @elenamer in https://github.com/flairNLP/flair/pull/3385

New Contributors

  • @elenamer made their first contribution in https://github.com/flairNLP/flair/pull/3385

Full Changelog: https://github.com/flairNLP/flair/compare/v0.13.0...v0.13.1

- Python
Published by alanakbik about 2 years ago

flair - Release 0.13.0

This release adds several major new features such as (1) faster and more memory-efficient transformer training, (2) a new plugin system for custom logging and training, (3) new API docs for better documentation - still in beta, and (4) various new models, datasets, bug fixes and enhancements. This release also increases the minimum requirement to Python 3.8!

New Feature: Faster and more memory-efficient transformer training

This release integrates @helpmefindaname's transformer-smaller-training-vocab into the ModelTrainer. This temporarily reduces a transformer's vocabulary to only the tokens in the training dataset, and after training restores the full vocabulary. Depending on the dataset, this may effect huge savings in GPU memory and tuning speeds.

To use this feature, simply add the flag reduce_transformer_vocab=True to the fine_tune method. For example, to fine-tune a distilbert model on TREC_6, run this code (step 7 has the flag to reduce the vocabulary):

```python

1. get the corpus

corpus: Corpus = TREC_6()

2. what label do we want to predict?

labeltype = "questionclass"

3. create the label dictionary

labeldict = corpus.makelabeldictionary(labeltype=label_type)

4. initialize transformer document embeddings (many models are available)

documentembeddings = TransformerDocumentEmbeddings("distilbert-base-uncased", finetune=True)

5. create the text classifier

classifier = TextClassifier(documentembeddings, labeldictionary=labeldict, labeltype=label_type)

6. initialize trainer

trainer = ModelTrainer(classifier, corpus)

7. fine-tune the model, but reduce the vocabulary for faster training

trainer.finetune( "resources/taggers/question-classification-with-transformer", reducetransformer_vocab=True, # set this to False for slow version ) ```

Involved PR: add reduce transformer vocab plugin by @helpmefindaname in https://github.com/flairNLP/flair/pull/3217

New Feature: Trainer Plugins

A new "Plugin" system was added to the ModelTrainer, allowing far greater options to customize the training cycle (and slimming down the code of the ModelTrainer somewhat). For instance, it is now possible to customize logging to a far greater degree and integrate third-party logging tools.

For instance, if you want to integrate ClearML logging into the above script, simply instantiate the plugin and attach it to the trainer:

```python [...]

6. initialize trainer

trainer = ModelTrainer(classifier, corpus)

NEW: instantiate a special logger and attach it to the trainer before the training run

ClearmlLoggerPlugin(clearml.Task.init(projectname="test", taskname="test")).attach_to(trainer)

7. fine-tune the model, but reduce the vocabulary for faster training

trainer.finetune( "resources/taggers/question-classification-with-transformer", reducetransformer_vocab=True, # set this to False for slow version ) ```

Involved PRs: * Proposal: Pluggable ModelTrainer train function by @plonerma in https://github.com/flairNLP/flair/pull/3084 * Major refactoring of ModelTrainer by @alanakbik in https://github.com/flairNLP/flair/pull/3182 * Allow users to use no scheduler and use a custom scheduling plugin by @plonerma in https://github.com/flairNLP/flair/pull/3200 * Don't pickle classes & plugins in modelcard by @helpmefindaname in https://github.com/flairNLP/flair/pull/3325 * Clearml logger by @helpmefindaname in https://github.com/flairNLP/flair/pull/3259 * Add a convenience conversion for flair.device by @alanakbik in https://github.com/flairNLP/flair/pull/3350

API Docs and other documentation

We are working towards improving our documentation. A first step was the release of our tutorial page. Now, we are adding (in beta) online API docs to make navigating the code and options offered by Flair easier. To enable it, we changed all docstrings to Google docstrings. However, this process is still ongoing, so expect the API docs to improve in coming versions of Flair.

You can find the API docs here: https://flairnlp.github.io/flair/master/api/index.html

Involved PRs: * Creating a doc page with autodocs by @helpmefindaname in https://github.com/flairNLP/flair/pull/3273 * Google doc strings by @helpmefindaname in https://github.com/flairNLP/flair/pull/3164 * Add redirects to old tutorials by @alanakbik in https://github.com/flairNLP/flair/pull/3211 * Add some more documentation and (rather empty) glossary page by @helpmefindaname in https://github.com/flairNLP/flair/pull/3339 * Update README.md by @eltociear in https://github.com/flairNLP/flair/pull/3241 * Fix embedding finetuning tutorial by @helpmefindaname in https://github.com/flairNLP/flair/pull/3301 * Fix build doc page action trigger by @helpmefindaname in https://github.com/flairNLP/flair/pull/3319 * Reduce gh-actions diskspace by @helpmefindaname in https://github.com/flairNLP/flair/pull/3327 * Orange secondary color by @helpmefindaname in https://github.com/flairNLP/flair/pull/3321 * Bump Flair and Python versions by @alanakbik in https://github.com/flairNLP/flair/pull/3355

Model Refactorings

In an effort to unify class names, we now offer models that inherit from DefaultClassifier for each label type we predict, i.e.: - TokenClassifier for predicting Token labels - TextPairClassifier for predicting TextPair labels - RelationClassifier for predicting Relation labels - SpanClassifier for predicting Span labels - TextClassifier for predicting Sentence labels

An advantage of such a structure is that most functionality (such as new decoders) needs to only be implemented once in DefaultClassifier and then is immediately usable for all model classes.

To enable this, we renamed and extended WordTagger as TokenClassifier, and renamed Entity Linker to SpanClassifier. This is not a breaking change yet, as the old names are still available. But in the future, WordTagger and Entity Linker will be removed.

Involved PRs: * TokenClassifier model by @alanakbik in https://github.com/flairNLP/flair/pull/3203 * Rename EntityLinker and remove some legacy embeddings by @alanakbik in https://github.com/flairNLP/flair/pull/3295

New Models

We also add two new model classes: (1) a TextPairRegressor for regression tasks on pairs of sentences (such as STS-B), and (2) an experimental Label Encoder method for few-shot classification.

Involved PRs: * Add TextPair regression model by @plonerma in https://github.com/flairNLP/flair/pull/3202 * Add dual encoder by @whoisjones in https://github.com/flairNLP/flair/pull/3208 * Adapt LabelVerbalizer so that it also works for non-BIOES span labes by @alanakbik in https://github.com/flairNLP/flair/pull/3231

New Datasets

  • Integrate BigBio NER data sets into HunFlair by @mariosaenger in https://github.com/flairNLP/flair/pull/3146
  • Add datasets STS-B and SST-2 to flair by @plonerma in https://github.com/flairNLP/flair/pull/3201
  • Extend German LER Dataset by @stefan-it in https://github.com/flairNLP/flair/pull/3288
  • Add support for MasakhaPOS Dataset by @stefan-it in https://github.com/flairNLP/flair/pull/3247
  • Gh3275: samplemissingsplits in SST-2 by @plonerma in https://github.com/flairNLP/flair/pull/3276
  • Add German MobIE NER Dataset by @stefan-it in https://github.com/flairNLP/flair/pull/3351

Build Process

  • Use ruff instead of flake8 and isort by @Lingepumpe in https://github.com/flairNLP/flair/pull/3213
  • Update mypy by @Lingepumpe in https://github.com/flairNLP/flair/pull/3210
  • Use poetry instead of pipenv for developer/testing by @Lingepumpe in https://github.com/flairNLP/flair/pull/3214
  • Remove poetry by @helpmefindaname in https://github.com/flairNLP/flair/pull/3258

Bug Fixes

  • Fix seralization of config in transformers by @helpmefindaname in https://github.com/flairNLP/flair/pull/3178
  • Add stacklevel to log_line in order to display correct file and line number (backwards compatible) by @plonerma in https://github.com/flairNLP/flair/pull/3175
  • Fix tars loading by @helpmefindaname in https://github.com/flairNLP/flair/pull/3212
  • Fix best epoch score update by @lephong in https://github.com/flairNLP/flair/pull/3220
  • Fix loading of (not so) old models by @helpmefindaname in https://github.com/flairNLP/flair/pull/3229
  • Fix false warning for "An empty Sentence was created!" by @AbdiHaryadi in https://github.com/flairNLP/flair/pull/3268
  • Fix bug with sentences that do not contain a single valid transformer token by @helpmefindaname in https://github.com/flairNLP/flair/pull/3230
  • Fix loading of old models by @helpmefindaname in https://github.com/flairNLP/flair/pull/3228
  • Fix multiple arguments destination by @helpmefindaname in https://github.com/flairNLP/flair/pull/3272
  • Support transformers 4310 by @helpmefindaname in https://github.com/flairNLP/flair/pull/3289
  • Fix import error by @helpmefindaname in https://github.com/flairNLP/flair/pull/3336

Enhancements

  • Bump min version to 3.8 by @helpmefindaname in https://github.com/flairNLP/flair/pull/3297
  • Use torch native amp by @helpmefindaname in https://github.com/flairNLP/flair/pull/3128
  • Unpin gdown dependency by @helpmefindaname in https://github.com/flairNLP/flair/pull/3176
  • getspansfrom_bio: Start new span for previous S- if class also changed by @Lingepumpe in https://github.com/flairNLP/flair/pull/3195
  • Include flair/py.typed and requirements.txt in source distribution by @dobbersc in https://github.com/flairNLP/flair/pull/3206
  • Better tars inference by @helpmefindaname in https://github.com/flairNLP/flair/pull/3222
  • prevent fasttext embeddings to be stored separately by @helpmefindaname in https://github.com/flairNLP/flair/pull/3293
  • recreate to_dict and add relations by @helpmefindaname in https://github.com/flairNLP/flair/pull/3271
  • github: bug report description should be textarea by @stefan-it in https://github.com/flairNLP/flair/pull/3181
  • Making gradient clipping optional & max gradient norm variable by @plonerma in https://github.com/flairNLP/flair/pull/3240
  • Save final model only if save_final_model is True (even if the training is interrupted) by @plonerma in https://github.com/flairNLP/flair/pull/3251
  • Fix inconsistency between best path and scores in ViterbiDecoder by @mauryaland in https://github.com/flairNLP/flair/pull/3189
  • Add action to remove Awaiting Response label when an response was made by @helpmefindaname in https://github.com/flairNLP/flair/pull/3300
  • Add onnx session config by @helpmefindaname in https://github.com/flairNLP/flair/pull/3302
  • Feature jsonldataset metadata by @helpmefindaname in https://github.com/flairNLP/flair/pull/3349

Breaking Changes

  • Removing the following legacy embeddings, as their support was droppend long ago:
    • XLNetEmbeddings
    • XLMEmbeddings
    • OpenAIGPTEmbeddings
    • OpenAIGPT2Embeddings
    • RoBERTaEmbeddings
    • CamembertEmbeddings
    • XLMRobertaEmbeddings
    • BertEmbeddings you can use TransformerWordEmbeddings or TransformerDocumentEmbeddings instead.
  • Removing ELMoTransformerEmbeddings as allennlp is no longer maintained.
  • Removal of the flair.hyperparameter module: We recommend using the hyperparameter optimzier of your choice as external module, for example see here how to fine tune flair models with the hugginface AutoTrain SpaceRunner
  • Drop of the trainer.resume(...) functionality. Similary to the flair.hyperparameter module, this functionality was dropped due to the trainer rework.
  • Changes to the trainer.train(...) and trainer.fine_tune(...) parameters:
    • monitor_train: bool was replaced by monitor_train_sample: float: this allows you to specify the percentage of training data points used for monitoring (setting monitor_train_sample=1.0 is equivalent to the previous behaivour of monitor_train=True.
    • eval_on_train_fraction is removed in favour of monitor_train_sample see monitor_train.
    • eval_on_train_shuffle is removed.
    • anneal_with_prestarts and batch_growth_annealing have been removed.
    • num_workers has been removed, now there is always used a single worker for data loading, as it is the fastest for the inmemory datasets.
    • checkpoint has been removed as parameter. You can use the CheckpointPlugin for the same behaviour.
    • cycle_momentum has been removed, as schedulers have been moved to Plugins.
    • param_selection_mode has been removed, similar to the hyper parameter optimization.
    • optimizer_state_dict and scheduler_state_dict were removed as part of the resume functionality.
    • anneal_against_dev_loss has been dropped, as the annealing goeas always against the metric specified by main_evaluation_metric
    • use_swa has been removed
    • use_tensorboard, tensorboard_comment tensorboard_log_dir & metrics_for_tensorboard are removed in favour of the TensorboardLogger plugin.
    • amp_opt_level is removed, as we moved to the torch integration.
    • WordTagger has been deprecated as it was renamed to TokenClassifier
    • EntityLinker has been deprecated as it was renamed to SpanClassifier

New Contributors

  • @lephong made their first contribution in https://github.com/flairNLP/flair/pull/3220
  • @AbdiHaryadi made their first contribution in https://github.com/flairNLP/flair/pull/3268
  • @eltociear made their first contribution in https://github.com/flairNLP/flair/pull/3241

Full Changelog: https://github.com/flairNLP/flair/compare/v0.12.2...v0.13.0

- Python
Published by helpmefindaname over 2 years ago

flair - Release 0.12.2

Another follow-up release to 0.12 that fixes a several bugs and adds a new multilingual frame tagger. Further, our new documentation website at https://flairnlp.github.io/docs/intro is now online!

New frame tagging model #3172

Adds a new model for detecting PropBank frame. The model is trained using the "FLERT" approach, so it is much stronger than the previous 'frame' model. We also added some training data from the universal proposition bank to improve multilingual frame detection.

Use it like this:

```python

load the large frame model

model = Classifier.load('frame-large')

English sentence with the verb "return" in two different senses

sentence = Sentence("Dirk returned to Berlin to return his hat.") model.predict(sentence) print(sentence)

German sentence with the verb "trug" in two different senses

sentencede = Sentence("Dirk trug einen Koffer und trug einen Hut.") model.predict(sentencede) print(sentence_de) ```

This should print:

```console Sentence[9]: "Dirk returned to Berlin to return his hat." → ["returned"/return.01, "return"/return.02]

Sentence[9]: "Dirk trug einen Koffer und trug einen Hut." → ["trug"/carry.01, "trug"/wear.01] ```

The printout tells us that the verbs in both sentences are correctly disambiguated.

Documentation

  • adds a pointer to the new Flair documentation website at https://flairnlp.github.io/docs/intro
  • adds a night mode Flair logo #3145

Enhancements / New Features

  • more consistent behavior of context dropout and FLERT token #3168
  • settting device through environment variable #3148 (thanks @HallerPatrick)
  • modify Sentence.tooriginaltext() to take into account Sentence.start_position for whitespace calculation #3150 (thanks @mauryaland)
  • gather dev and test labels if the dataset is available #3162 (thanks @helpmefindaname)

Bug fixes

  • fix bugs caused by wrong data point equality and caching #3157
  • fix transformer smaller training vocab #3155 (thanks @helpmefindaname)
  • update scispacy version #3144 (thanks @mariosaenger)
  • unpin huggingface-hub #3149 (thanks @marctorsoc)

- Python
Published by alanakbik almost 3 years ago

flair - Release 0.12.1

This is a quick follow-up release to 0.12 that fixes a few small bugs and includes an improved version of our Zelda entity linker.

New Entity Linking model

We include a new version of our Zelda entity linker with improved predictions. Try it as follows:

```python from flair.nn import Classifier from flair.data import Sentence

load the model

tagger = Classifier.load('linker')

make a sentence

sentence = Sentence('Kirk and Spock met on the Enterprise.')

predict NER tags

tagger.predict(sentence)

print predicted entities

for label in sentence.get_labels(): print(label) ```

This should print: console Span[0:1]: "Kirk" → James_T._Kirk (0.9969) Span[2:3]: "Spock" → Spock (0.9971) Span[6:7]: "Enterprise" → USS_Enterprise_(NCC-1701-D) (0.975)

Indicating correctly that the span "Kirk" points to "JamesT.Kirk". As the prediction for the string "Enterprise" shows, the model is still beta and will be further improved with future releases.

Bug fixes

  • make transformer training vocab optional #3132
  • change token.gettag() to token.getlabel() #3135
  • update required version of transformers library #3138
  • update HunFlair tutorial to Flair 0.12 #3137

- Python
Published by alanakbik almost 3 years ago

flair - Release 0.12

Release 0.12 is out! This release greatly simplifies model usage for our users, includes our first entity linking model, adds support for the Ukrainian language, adds easy-to-use multitask learning, and many more features, improvements and bug fixes!

New Features

Simplify Flair model usage #3067

You can now load any Flair model through its parent class. Since most models inherit from Classifier, you can load and run multiple different models with exactly the same code. So, to run three different taggers for sentiment, entities and frames, do:

```python from flair.data import Sentence from flair.nn import Classifier

load three taggers to tag entities, frames and sentiment

tagger1 = Classifier.load('ner') tagger2 = Classifier.load('frame') tagger_3 = Classifier.load('sentiment')

example sentence

sentence = Sentence('Dirk celebrated in Essen')

predict with all three models

tagger1.predict(sentence) tagger2.predict(sentence) tagger_3.predict(sentence)

print all predictions

for label in sentence.get_labels(): print(label) ```

With this change, users no longer need to know which model classes implement which model. For more advanced users who do know this, the regular way for loading a model still works: python sentiment_tagger = TextClassifier.load('sentiment')

Entity Linking (BETA)

As of Flair 0.12 we ship an experimental entity linker trained on the Zelda dataset. The linker not only tags entities, but also attempts to link each entity to the corresponding Wikipedia URL if one exists.

To illustrate, let's use a short example text with two mentions of "Barcelona". The first refers to the football club "FC Barcelona", the second to the city "Barcelona".

```python from flair.nn import Classifier from flair.data import Sentence

load the model

tagger = Classifier.load('linker')

make a sentence

sentence = Sentence('Bayern played against Barcelona. The match took place in Barcelona.')

predict NER tags

tagger.predict(sentence)

print sentence with predicted tags

print(sentence) ```

This should print: console Sentence[12]: "Bayern played against Barcelona. The match took place in Barcelona." → ["Bayern"/FC_Bayern_Munich, "Barcelona"/FC_Barcelona, "Barcelona"/Barcelona]

As we can see, the linker can resolve what the two mentions of "Barcelona" refer to: - the first mention "Barcelona" is linked to "FC_Barcelona" - the second mention "Barcelona" is linked to "Barcelona"

Additionally, the mention "Bayern" is linked to "FCBayernMunich", telling us that here the football club is meant.

Entity linking support includes: - Support for the ZELDA candidate lists #3108 #3111 - Support for the ZELDA training and evaluation dataset #3088

Support for Ukrainian language #3026

This version adds support for Ukrainian taggers, embeddings and datasets. For instance, to do NER and POS tagging of a Ukrainian sentence, do:

```python

Load Ukrainian NER and POS taggers

from flair.models import SequenceTagger

nertagger = SequenceTagger.load('ner-ukrainian') postagger = SequenceTagger.load('pos-ukrainian')

Tag a sentence

from flair.data import Sentence sentence = Sentence("Сьогодні в Знам’янці проживають нащадки поета — родина Шкоди.")

nertagger.predict(sentence) postagger.predict(sentence)

print(sentence)

”Сьогодні в Знам’янці проживають нащадки поета — родина Шкоди." →

[“Сьогодні"/ADV, "в"/ADP, "Знам’янці"/LOC, "Знам’янці"/PROPN, "проживають”/VERB, "нащадки"/NOUN, "поета"/NOUN, "—"/PUNCT, "родина"/NOUN, "Шкоди”/PERS, "Шкоди"/PROPN, "."/PUNCT]

```

Multitask Learning (#2910 #3085 #3101)

We add support for multitask learning in Flair (closes #2508 and closes #1260) with hopefully a simple syntax to define multiple tasks that share parts of the model.

The most common part to share is the transformer, which you might want to fine-tune across several tasks. Instantiate a transformer embedding and pass it to two separate models that you instantiate as before:

```python

--- Embeddings that are shared by both models ---

sharedembedding = TransformerDocumentEmbeddings("distilbert-base-uncased", finetune=True)

--- Task 1: Sentiment Analysis (5-class) ---

corpus1 = SENTEVALSST_GRANULAR()

model1 = TextClassifier(sharedembedding, labeldictionary=corpus1.makelabeldictionary("class"), label_type="class")

-- Task 2: Binary Sentiment Analysis on Customer Reviews --

corpus2 = SENTEVALCR()

model2 = TextClassifier(sharedembedding, labeldictionary=corpus2.makelabeldictionary("sentiment"), label_type="sentiment", )

-- Define mapping (which tagger should train on which model) --

multitaskmodel, multicorpus = makemultitaskmodelandcorpus( [ (model1, corpus1), (model2, corpus_2), ] )

-- Create model trainer and train --

trainer = ModelTrainer(multitaskmodel, multicorpus) trainer.finetune(f"resources/taggers/multitask_test") ```

The mapping part here defines which tagger should be trained on which corpus. By calling make_multitask_model_and_corpus with a mapping, you get a corpus and model object that you can train as before.

Explicit context boundaries in Transformer embeddings #3073 #3078

We improve our FLERT model by now explicitly marking up context boundaries using a new [FLERT] special token in our transformer embeddings. Our experiments show that the context marker leads to improved NER results:

| Transformer | Context-Marker | CoNLL-03 Test F1 | |----------|:-------------:|------:| | bert-base-uncased | none | 91.52 +- 0.16 | | | [SEP] | 91.38 +- 0.18 | | | [FLERT] | 91.56 +- 0.17 | | xlm-roberta-large | none | 93.73 +- 0.2 | | | [SEP] | 93.76 +- 0.13 | | | [FLERT] | 93.92 +- 0.14 |

In the table, none is the approach used in previous Flair versions. [SEP] means using the standard separator symbol as context delimiter. [FLERT] means using a new dedicated special token.

As [FLERT] performs best in our experiments, the [FLERT] context marker is now activated by default.

More details: Assume the current sentence is Peter Blackburn and the previous sentence ends with to boycott British lamb ., while the next sentence starts with BRUSSELS 1996-08-22 The European Commission.

In this case, 1. if use_context_separator=False, the embedding is produced from this string: to boycott British lamb . Peter Blackburn BRUSSELS 1996-08-22 The European Commission 2. if use_context_separator=True, the embedding is produced from this string to boycott British lamb . [FLERT] Peter Blackburn [FLERT] BRUSSELS 1996-08-22 The European Commission

Integrate transformer-smaller-training-vocab #3066

We integrate the transformer-smaller-training-vocab library into the ModelTrainer. With it, you can reduce the size of transformer models when training and evaluating models on specific datasets. This leads to faster training times and a smaller memory footprint. Documentation on this new feature will be added soon!

Masked Relation Classifier #2748 #2993 with various Encoding Strategies #3023 (BETA)

We now include BETA support a new type of relation extraction model that leads to much higher accuracies than our vanilla relation extraction, but increases computational costs. Documentation for this will be added as we iterate on the model.

ONNX compatible models #2640 #2643 #3041 #3075

This release continues the journey on making our models more ONNX compatible.

Other features

  • Add push to Hub functionalities #2897
  • Add layoutlm layoutxlm support and the the SROIE dataset #2980
  • Convenience method for learning rate factor #2888 #2893

New Datasets

  • Add fewnerd corpus #3103
  • Add support for NERMuD 2023 Dataset #3087
  • Adds ZELDA Entity Linking dataset #3088
  • Added Ukrainian NER and UD datasets #3069
  • Add support MasakhaNER v2 dataset #3013
  • Add support for MultiCoNerV2 #3006
  • Add support for new ICDAR Europeana NER Dataset #2911
  • datasets: add support for HIPE-2022 #2735 #2827 #2805

Major refactorings

  • Unify loss reduction by making sure that all losses are summed over all points, instead of averaged #2933 #2910
  • Python 3.7 #2769
  • Flatten DefaultClassifier interface #2978
  • Restructure Tokenizer and Splitter modules #3002
  • Refactor Token and Sentence Positional Properties #3001
  • Seralization of embeddings #3011

Various Improvements

Enhancements

  • add functionality for using proxies #3082
  • add option not to shuffle the first epoch #3076
  • improved Tars Context #3063
  • release optimizer memory and fix legacy tokenization #3043
  • add time elapsed to training printout #2983
  • separate between token-lengths and sub-token lengths #2990
  • small speed optimizations #2975
  • change output of .text to original string #2974
  • remove BAD_EPOCHS printout for most schedulers #2970
  • warn if resuming with too low maxepochs & ' additionalepochs' parameter #2895
  • embeddings: add support for T5 encoder models #2896
  • add py.typed file for PEP-561 compatibility #2858
  • tars classifier always predict something on single label #2838
  • make add_unk optional and don't use it for ner #2839
  • add deprecation warning for SentenceDataset rename #2819
  • more precise type hint for evalontrain_fraction #2811
  • better handling for consecutive whitespaces in Sentence #2721(already in flair 0.11.3)
  • remove unnecessary more-itertools pin #2730 (already in flair 0.11.3)
  • add exclude_labels parameter to trainer.train #2724 (already in flair 0.11.3)
  • add option to force token-level predictions in SequenceTagger #2750 (already in flair 0.11.3)

Build

  • unified test classes, to ensure that all models & embeddings have tested the basic functionality #2981
  • add missing dependency pre-commit to requirements-dev.txt #3093
  • fix pre-commit bug by upgrading to isort 5.11.5 #3106 #3107
  • update pytest and flake8 versions #2741
  • pytest flake precommit update #2820
  • pin flake8 to v4 #2892
  • specify test paths #2932
  • pin versions for unit tests #2994
  • unit tests: Set a seed so testtrainloaduseclassifier doesn't randomly fail #2834
  • replace issue templates with issue forms #3051
  • github actions cache #2753 (already in flair 0.11.3)

Documentation

  • Add Missing Import to Tutorial 5 #2902
  • Documentation pointers #2927
  • readme: fix BibTeX for FLERT paper #2806 #2821
  • docs: mention HIPE-2022 in corpus tutorial #2807

Code improvements

  • add return types to Model and Classifier #3121
  • removed undefined names #3054 #3056
  • add docstrings missing for ModelTrainer.train() parameters #2961
  • remove "tagtobioes" (Sequence) Corpus parameter, as it is not used #2812
  • update hf-hub version #2837
  • use transformers sentencepiece requirement #2835
  • replace deprecated logging.warn with logging.warning #2829
  • various mypy issues #2822 #2845 #2905
  • removed some model classes that were very beta: the DependencyParser, the DistancePredictor and the SimilarityLearner. #2910
  • remove legacy TransformerXLEmbeddings class #2768 (already in flair 0.11.3)

Bug fixes

  • fix train error missing dev split #3115
  • fix Avg Pooling in the Entity Linker #3123
  • call super().__setstate__() in Embeddings #3057
  • remove konoha from requirements.txt #3060
  • fix label alignment if the sentence contains invalid tokens #3052
  • change indexing in TARSTagger predict #3058
  • fix training sample count in UD English #3044
  • fix comment parsing for conllu datasets #3020
  • HunFlair: Fix loading of datasets #3030 #3029
  • persist needsmanualocr #3012
  • save initial hidden states in sequence tagger #3010
  • do not save Path objects to model cards #2998
  • make JsonlCorpus create span labels #2863
  • JsonlDataset: Fix code that claims to set "O" labels to actually set them #2817
  • relationClassifier fix #2986
  • fix problem in loading TARSClassifier #2987
  • add missing tab for tensorboard #2922
  • fast tokenizer reload fix pt.2: Bloom model #2904
  • fix transformer embeddings for sentence with trailing whitespace #2891
  • added labelname parameter to renderner_html #2850
  • allow BIO evaluation on sequence tagger #2787
  • refactorings for initialization from state dict #2846
  • save and load "tag_format" for sequence tagger model #2840
  • do not remove other labels of sentence for set_label on Token and Span #2831
  • fix left-over cases of token.get_tag(), which was renamed #2815
  • remove wrong boolean check for loading datasets REENGLISHCONLL04 #2779
  • added missing property decorator in PooledFlairEmbeddings #2744 (already in flair 0.11.3)
  • fix wrong initialisations of label (where data_type was missing) #2731 (already in flair 0.11.3)
  • update gdown requirement, fix download for dataset NERMULTIWIKIANN #2757 (already in flair 0.11.3)
  • make Span detection more robust #2752 (already in flair 0.11.3)

- Python
Published by alanakbik almost 3 years ago

flair - Release 0.11

Release 0.11 is taking us ever closer to that 1.0 release! This release makes large internal refactorings and code quality / efficiency improvements to prepare Flair 1.0. We also add new features such as text clustering, a regular expression tagger, more dataset manipulation options, and some preview features like a prototype decoder.

New Features

Regular Expression Tagger (#2533)

You can now do sequence labeling in Flair with regular expressions! Simply define a RegexpTagger and add some regular expressions, like in the example below:

```python

sentence with a number and two quotes

sentence = Sentence('Figure 11 is both "too colorful" and "not informative enough".')

instantiate regex tagger with a quote matching pattern

tagger = RegexpTagger(mapping=(r'(["\'])(?:(?=(\?))\2.)*?\1', 'QUOTE'))

also add a number mapping

tagger.register_labels(mapping=(r'\b\d+\b', 'NUMBER'))

tag sentence

tagger.predict(sentence)

check out matches

for entity in sentence.get_labels(): print(entity) ```

Clustering with Flair (#2573 #2619)

Flair now supports clustering by ways of sklearn. Embed your sentences with a pre-trained embedding like below, then cluster then with any algorithm. Check the example below where we use sentence transformers and k-means clustering. A 'trained' clustering model can be saved and loaded for prediction, just like and other Flair classifier:

```python from sklearn.cluster import KMeans

from flair.data import Sentence from flair.datasets import TREC_6 from flair.embeddings import SentenceTransformerDocumentEmbeddings from flair.models import ClusteringModel

embeddings = SentenceTransformerDocumentEmbeddings()

store all embeddings in memory which is required to perform clustering

corpus = TREC6(memorymode='full').downsample(0.05)

clusteringmodel = ClusteringModel(model=KMeans(nclusters=6), embeddings=embeddings)

fit the model on a corpus

clustering_model.fit(corpus)

save the model

clusteringmodel.save(modelfile="clustering_model.pt")

load saved clustering model

model = ClusteringModel.load(modelfile="clusteringmodel.pt")

make example sentence

sentence = Sentence('Getting error in manage categories - not found for attribute "navigation _ column"')

predict for sentence

model.predict(sentence)

print sentence with prediction

print(sentence) ```

Dataset Manipulations

You can now change label names, ignore labels and add custom preprocessing when loading a dataset.

For instance, the standard WNUT_17 dataset comes with 7 NER labels:

python corpus = WNUT_17(in_memory=False) print(corpus.make_label_dictionary('ner'))

which prints:

console Dictionary with 7 tags: <unk>, person, location, group, corporation, product, creative-work

With the following code, you rename some labels ('person' is renamed to 'PER'), merge 2 labels into 1 ('group' and 'corporation' are merged into 'LOC'), and ignore 2 other labels ('creative-work' and 'product' are ignored):

python corpus = WNUT_17(in_memory=False, label_name_map={ 'person': 'PER', 'location': 'LOC', 'group': 'ORG', 'corporation': 'ORG', 'product': 'O', 'creative-work': 'O', # by renaming to 'O' this tag gets ignored })

which prints:

console Dictionary with 4 tags: <unk>, PER, LOC, ORG

You can manipulate the data even more with custom preprocessing functions. See the example in #2708.

Other New Features and Data Sets

  • A new WordTagger class for simple word-level predictions (#2607)
  • Classic WordEmbeddings can now be fine-tuned in Flair (#2491) by setting fine_tune=True. Also adds fine-tuning mode of https://arxiv.org/abs/2110.02861 which seem to "reduce gradient variance that comes from the highly non-uniform distribution of input tokens"
  • Add NER_MULTI_CONER Dataset (#2507)
  • Add support for HIPE 2022 (#2675)
  • Allow trainer to work with mutliple learning rates (#2641)
  • Update hyperparameter tuning (#2633)

Preview Features

Some preview features in beta stage, use at your own risk.

Prototypical networks in Flair (#2627)

Prototype networks learn prototypes for each target class. For each data point to be classified, the network predicts a vector in class-prototype-space, which is then compared to all class prototypes.The prediction is then the closest class prototype. See paper Prototypical Networks for Few-shot Learning for more info.

@plonerma implemented a custom decoder that can be added to any Flair model that inherits from DefaultClassifier (i.e. early all Flair models). For instance, use this script:

```python from flair.data import Corpus from flair.datasets import UP_ENGLISH from flair.embeddings import TransformerWordEmbeddings from flair.models import WordTagger from flair.nn import PrototypicalDecoder from flair.trainers import ModelTrainer

what tag do we want to predict?

tag_type = 'frame'

get a corpus

corpus: Corpus = UP_ENGLISH().downsample(0.1)

make the tag dictionary from the corpus

tagdictionary = corpus.makelabeldictionary(labeltype=tag_type)

initialize simple embeddings

embeddings = TransformerWordEmbeddings(model="distilbert-base-uncased", fine_tune=True, layers='-1')

initialize prototype decoder

decoder = PrototypicalDecoder(numprototypes=len(tagdictionary), embeddingssize=embeddings.embeddinglength, distancefunction='euclidean', normaldistributedinitialprototypes=True, )

initialize the WordTagger, but pass the prototype decoder

tagger = WordTagger(embeddings, tagdictionary, tagtype, decoder=decoder)

initialize trainer

trainer = ModelTrainer(tagger, corpus)

run training

trainer.finetune('resources/taggers/prototypicaldecoder') ```

Other Beta features

  • Dependency Parsing in Flair (#2486 #2579)
  • Lemmatization in Flair (#2531)
  • Initial implementation of JsonCorpora and Datasets (#2653)

Major Refactorings

With Flair expanding to many new NLP tasks (relation extraction, entity linking, etc.) and model types, we made a number of refactorings to reduce redundancy and make it easier to extend Flair.

Major refactoring of Label Logic in Flair (#2607 #2609 #2645)

The labeling logic was growing too complex to accommodate new tasks. With this release, we refactored this logic such that complex label classes like SpanLabel, RelationLabel etc. are removed in favor of a single Label class for all types of label. The Sentence object will now be automatically aware of all labels added to it.

To illustrate the difference, consider a before-and-after of how to add an entity label to a sentence.

Before:

```python

example sentence

sentence = Sentence("Humboldt Universität zu Berlin is located in Berlin .")

create span for "Humboldt Universität zu Berlin"

span = Span(sentence[0:4])

make a Span-label

span_label = SpanLabel(span=span, value='University')

add Span-label to sentence

sentence.addcomplexlabel(typename='ner', label=span_label) ```

Now:

```python

example sentence

sentence = Sentence("Humboldt Universität zu Berlin is located in Berlin .")

directly add a label to the span "Humboldt Universität zu Berlin"

sentence[0:4].add_label("ner", "Organization") ```

So you can now just get a span from the sentence and add a label to it directly. It will get registered on the sentence as well.

Refactoring of printouts (#2704)

We changed and unified printouts across all Flair data points and labels, and updated the documentation to reflect this. Printouts should hopefully now be more concise. Let us know what you think.

Unified classes to reduce redundancy

Next to too many Label classes (see above), we also had too many corpora that essentially do the same thing, two partially overlapping transformer embedding classes and too much redundancy in our tokenization classes. This release makes many refactorings to make the code more maintainable:

  • Unify Corpora (#2607): Unifies several corpora into a single object. Before, we had ColumnCorpus, UniversalDependenciesCorpus, CoNNLuCorpus, and EntityLinkingCorpus, which resulted in too much redundancy. Now, there is only the ColumnCorpus for all such datasets
  • Unify Transformer Embeddings (#2558, #2584, #2586): There was too much redundancy and inconsistency between the two Transformer-based embeddings classes TransformerWordEmbedding and TransformerDocumentEmbedding. Thanks to @helpmefindaname, they now both inherit from the same base object and now share all features.
  • Unify Tokenizers (#2607) : The Tokenizer classes no longer return lists of Token, rather lists of strings that the Sentence object converts to tokens, centralizing the offset and whitespace_after detection in one place.

Simplifications to DefaultClassifier

The DefaultClassifier is the base class for nearly all models in Flair. With this release, we make a number of simplifications to reduce redundancy across classes and make it more modular. - forward_pass simplified to return 3 instead of 4 arguments - forward_pass returns embeddings instead of logits allowing us to easily switch out the decoder (see Beta feature on Prototype Networks below) - removed the unintuitive spawn logic we no longer need due to Label refactoring - unify dropouts across all classes (#2669)

Sequence tagger refactoring (#2361 #2550, #2561,#2564, #2585, #2565)

Major refactoring of SequenceTagger for better modularity and code readability.

Refactoring of Span Logic (#2607 #2609 #2645)

Spans are no longer stored as word-level 'bioes' tags, but rather directly stored as span-level annotations. The SequenceTagger will still internally use BIO/BIOES tags, but the corpora and sentences no longer explicitly store this information.

So you now choose the labeling format when instantiating the SequenceTagger, i.e.: python tagger = SequenceTagger( hidden_size=256, embeddings=embeddings, tag_dictionary=tag_dictionary, tag_type="ner", tag_format="BIOES", # choose if you want to use BIOES or BIO internally )

Internally, this refactoring makes a number of changes and simplifications: - a number of fields have been added or moved up to the DataPoint class, for convenience, including properties to get start_position and end_position of datapoints, their text, their tag and score (if they have only one tag) and an unlabeled_identifier - moves up set_embedding() and to() from the data point classes (Sentence, Token, etc.) to their parent DataPoint - a number of methods like get_tag and add_tag have been removed from Token in favor of the get_label and add_label method of the parent DataPoint class - The ColumnCorpus will automatically identify which columns are span labels and treat them accordingly

Code Quality Checks (#2611)

They are back and more strict than ever! Thanks to @helpmefindaname, we now include mypy and formatting tests as part of our build process, which lead to many changes in the code and a much greater chance at catching errors early.

Speed and Memory Improvements:

  • EntityLinker class refactored for speed (#2607)
  • Performance improvements in standard evaluate() method, especially for large datasets (#2607)
  • ColumnCorpus no longer does disk reads when in_memory=False, it simply stores the raw data in memory leading to significant speed-ups on large datasets (#2607)
  • Memory management improvements for embeddings (#2645)
  • Efficiency improvements for WordEmbeddings (#2491) and OneHotEmbeddings (#2490)

Bug Fixes and Improvements

  • Add equality method to Dictionary (#2532)
  • Fix encoding error in lemmatizer (#2539)
  • Fixed printing and logging inconsistencies. (#2665)
  • Readme (#2525 #2618 #2617 #2662)
  • Fix bug in WSD_UFSAC corpus (#2521)
  • change position of model saving in between epochs (#2548)
  • Fix loss weights in TextPairClassifier and RelationExtractor models (#2576)
  • Fix token positions on column corpus (#2440)
  • long sequence transformers of any kind (#2599)
  • The deprecated data_fetcher is finally removed (#2607)
  • Small lm training improvements (#2590)
  • Remove minor bug in NELENGLISHAIDA corpus (#2615)
  • Fix module import bug (#2616)
  • Fix reloading fast tokenizers (#2622)
  • Fix two small bugs (#2634)
  • Fix .pre-commit-config.yaml (#2651)
  • patch the missing documentdelmiter for lm.getstate__() (#2658)
  • DocumentPoolEmbeddings class can now be instantiated only with a single embedding (#2645)
  • You can now specify a min_count when computing the label dictionary. Labels below that count will be UNK'ed. (e.g. tag_dictionary = corpus.make_label_dictionary("ner", min_count=10)) (#2607)
  • The Dictionary will now compute count statistics for labels in a corpus (#2607)
  • The ColumnCorpus can now handle relation annotation, dependency tree information and UD feats and misc (#2607)
  • Embeddings are stored as a torch Embedding instead of a gensim keyedvector. That way it will never come to version issues, if gensim doesn't ensure backwards compatibility
  • Make transformer offset calculation more robust (#2714)

- Python
Published by alanakbik almost 4 years ago

flair - Release 0.10

This release adds several new features such as in-built "model cards" for all Flair models, the first pre-trained models for Relation Extraction, better support for fine-tuning and a refactoring of the model training methods for more flexibility. It also fixes a number of critical bugs that were introduced by the refactorings in Flair 0.9.

Model Trainer Enhancements

Breaking change: We changed the ModelTrainer such that you now no longer pass the optimizer during initialization. Rather, it is now passed as a parameter of the train or fine_tune method.

Old syntax:

```python

1. initialize trainer with AdamW optimizer

trainer = ModelTrainer(classifier, corpus, optimizer=torch.optim.AdamW)

2. run training with small learning rate and mini-batch size

trainer.train('resources/taggers/question-classification-with-transformer', learningrate=5.0e-5, minibatch_size=4, ) ```

New syntax (optimizer is parameter of train method):

```python

1. initialize trainer

trainer = ModelTrainer(classifier, corpus)

2. run training with AdamW, small learning rate and mini-batch size

trainer.train('resources/taggers/question-classification-with-transformer', learningrate=5.0e-5, minibatch_size=4, optimizer=torch.optim.AdamW, ) ```

Convenience function for fine-tuning (#2439)

Adds a fine_tune routine that sets default parameters used for fine-tuning (AdamW optimizer, small learning rate, few epochs, cyclic learning rate scheduling, etc.). Uses the new linear scheduler with warmup (#2415).

New syntax with fine_tune method:

```python from flair.data import Corpus from flair.datasets import TREC_6 from flair.embeddings import TransformerDocumentEmbeddings from flair.models import TextClassifier from flair.trainers import ModelTrainer

1. get the corpus

corpus: Corpus = TREC_6()

2. what label do we want to predict?

labeltype = 'questionclass'

3. create the label dictionary

labeldict = corpus.makelabeldictionary(labeltype=label_type)

4. initialize transformer document embeddings (many models are available)

documentembeddings = TransformerDocumentEmbeddings('distilbert-base-uncased', finetune=True)

5. create the text classifier

classifier = TextClassifier(documentembeddings, labeldictionary=labeldict, labeltype=label_type)

6. initialize trainer

trainer = ModelTrainer(classifier, corpus)

7. run training with fine-tuning

trainer.finetune('resources/taggers/question-classification-with-transformer', learningrate=5.0e-5, minibatchsize=4, ) ```

Model Cards (#2457)

When you train any Flair model, a "model card" will now automatically be saved that stores all training parameters and versions used to train this model. Later when you load a Flair model, you can print the model card and understand how the model was trained.

The following example trains a small POS-tagger and prints the model card in the end:

```python

initialize corpus and make label dictionary for POS tags

corpus = UDENGLISH().downsample(0.01) tagtype = "pos" tagdictionary = corpus.makelabeldictionary(tagtype)

simple sequence tagger

tagger = SequenceTagger(hiddensize=256, embeddings=WordEmbeddings("glove"), tagdictionary=tagdictionary, tagtype=tag_type)

initialize model trainer and experiment path

trainer = ModelTrainer(tagger, corpus) path = f'resources/taggers/model-card'

train for a few epochs

trainer.train(path, max_epochs=20, )

load best model and print "model card"

trainedmodel = SequenceTagger.load(path + '/best-model.pt') trainedmodel.printmodelcard() ```

This should print a model card like:

~~~

--------- Flair Model Card ---------

  • this Flair model was trained with: -- Flair version 0.9 -- PyTorch version 1.7.1 -- Transformers version 4.8.1 ------------------------------------ ------- Training Parameters: ------- ------------------------------------ -- basepath = resources/taggers/model-card -- learningrate = 0.1 -- minibatchsize = 32 -- minibatchchunksize = None -- maxepochs = 20 -- trainwithdev = False -- trainwithtest = False [... shortened ...] ------------------------------------ ~~~

Resume training any model (#2457)

Previously, we distinguished between checkpoints and model files. Now all models can function as checkpoints, meaning you can load them and continue training them. Say you want to load the model above (trained to epoch 20) and continue training it to epoch 25. Do it like this:

```python

resume training best model, but this time until epoch 25

trainer.resume(trainedmodel, basepath=path + '-resume', max_epochs=25, ) ```

Pass optimizer and scheduler instance

You can also now pass an initialized optimizer and scheduler to the train and fine_tune methods.

Multi-Label Predictions and Confidence Threshold in TARS models (#2430)

Adding the possibility to set confidence thresholds on multi-label prediction in TARS, and setting whether a problem is single-label or multi-label:

```python from flair.models import TARSClassifier from flair.data import Sentence

1. Load our pre-trained TARS model for English

tars: TARSClassifier = TARSClassifier.load('tars-base')

switch to a multi-label task (emotion detection)

tars.switchtotask('GO_EMOTIONS')

sentence with two emotions

sentence = Sentence("I am happy and sad")

predict normally

tars.predict(sentence) print(sentence)

predict with lower label threshold (you can set this to 0. to get all labels)

tars.predict(sentence, label_threshold=0.01) print(sentence)

predict and enforce a single-label prediction

tars.predict(sentence, labelthreshold=0.01, multilabel=False) print(sentence) ```

Relation Extraction ( #2471 #2492)

We refactored the RelationExtractor for more options, hopefully better code clarity and small speed improvements.

We also added two few relation extraction models, trained over a modified version of TACRED: relations and relations-fast. To use these models, you also need an entity tagger. The tagger identifies entities, then the relation extractor possible entities.

For instance use this code:

```python from flair.data import Sentence from flair.models import RelationExtractor, SequenceTagger

1. make example sentence

sentence = Sentence("George was born in Washington")

2. load entity tagger and predict entities

tagger = SequenceTagger.load('ner-fast') tagger.predict(sentence)

check which entities have been found in the sentence

entities = sentence.get_labels('ner') for entity in entities: print(entity)

3. load relation extractor

extractor: RelationExtractor = RelationExtractor.load('relations-fast')

predict relations

extractor.predict(sentence)

check which relations have been found

relations = sentence.get_labels('relation') for relation in relations: print(relation) ```

Embeddings

  • Refactoring of WordEmbeddings to avoid gensim version issues and enable further fine-tuning of pre-trained embeddings (#2491)
  • Refactoring of OneHotEmbeddings to fix errors caused by some corpora and enable "stable embeddings" (#2490 )

Other Enhancements and Bug Fixes

  • Compatibility with gensim 4 and Python 3.9 (#2496)
  • Fix TransformerWordEmbeddings if modelmaxlength not set in Tokenizer (#2502)
  • Fix TransformerWordEmbeddings handling of lang ids (#2417)
  • Fix attention mask for special Transformer architectures (#2485)
  • Fix regression model (#2424)
  • Fix problems caused by refactoring of Dictionary (#2429 #2435 #2453)
  • Fix infinite loop in Span::tooriginaltext (#2462)
  • Fix result object in ModelTrainer (#2519)
  • Fix bug in wsd_ufsac corpus (#2521)
  • Fix bugs in TARS and simple sequence tagger (#2468)
  • Add Amharic FLAIR EMBEDDING model (#2494)
  • Add MultiCoNer Dataset (#2507)
  • Add Korean Flair Tutorials (#2516 #2517)
  • Remove hyperparameter features (#2518)
  • Make it optional to create logfiles and loss files (#2421)
  • Small simplification of TransformerWordEmbeddings (#2425)

- Python
Published by alanakbik over 4 years ago

flair - Release 0.9

With release 0.9 we are refactoring Flair for simplicity and speed, to make Flair faster and more easily scale to new NLP tasks. The first new tasks included in this release are Relation Extraction (RE), support for GLUE benchmark tasks and Entity Linking - all in beta for early adopters! We're working towards a Flair 1.0 release that will span the whole suite of standard NLP tasks. Also included is a new approach for Zero-Shot Sequence Labeling based on TARS! This release also includes a wealth of new datasets for all these tasks and tons of other new features and bug fixes.

Zero-Shot Sequence Labeling with TARS (#2260)

We extend the TARS zero-shot learning approach to sequence labeling and ship a pre-trained model for English NER. Try defining some classes and see if the model can find them:

```python

1. Load zero-shot NER tagger

tars = TARSTagger.load('tars-ner')

2. Prepare some test sentences

sentences = [ Sentence("The Humboldt University of Berlin is situated near the Spree in Berlin, Germany"), Sentence("Bayern Munich played against Real Madrid"), Sentence("I flew with an Airbus A380 to Peru to pick up my Porsche Cayenne"), Sentence("Game of Thrones is my favorite series"), ]

3. Define some classes of named entities such as "soccer teams", "TV shows" and "rivers"

labels = ["Soccer Team", "University", "Vehicle", "River", "City", "Country", "Person", "Movie", "TV Show"] tars.addandswitchtonewtask('task 1', labels, labeltype='ner')

4. Predict for these classes and print results

for sentence in sentences: tars.predict(sentence) print(sentence.totaggedstring("ner")) ```

This should print:

```console The Humboldt University of Berlin is situated near the Spree in Berlin , Germany

Bayern Munich played against Real Madrid

I flew with an Airbus A380 to Peru to pick up my Porsche Cayenne

Game of Thrones is my favorite series ```

So in these examples, we are finding entity classes such as "TV show" (Game of Thrones), "vehicle" (Airbus A380 and Porsche Cayenne), "soccer team" (Bayern Munich and Real Madrid) and "river" (Spree), even though the model was never explicitly trained for this. Note that this is ongoing research and the examples are a bit cherry-picked. We expect the zero-shot model to improve quite a bit until the next release.

New NLP Tasks and Datasets

We prototypically now support new tasks such as GLUE benchmark, Relation Extraction and Entity Linking. With this, we ship the datasets and model classes you need to train your own models. But we are still tweaking both methods, meaning that we don't ship any pre-trained models as-of-yet.

GLUE Benchmark (#2149 #2363)

A standard benchmark to evaluate progress in language understanding, mostly consisting of single and pairwise sentence classification tasks.

New datasets in Flair:

  • 'GLUE_COLA' - The Corpus of Linguistic Acceptability from GLUE benchmark
  • 'GLUE_MNLI' - The Multi-Genre Natural Language Inference Corpus from the GLUE benchmark
  • 'GLUE_RTE' - The RTE task from the GLUE benchmark
  • 'GLUE_QNLI' - The Stanford Question Answering Dataset formated as NLI task from the GLUE benchmark
  • 'GLUE_WNLI' - The Winograd Schema Challenge formated as NLI task from the GLUE benchmark
  • 'GLUE_MRPC' - The MRPC task from GLUE benchmark
  • 'GLUE_QQP' - The Quora Question Pairs dataset where the task is to determine whether a pair of questions are semantically equivalent

Initialize datasets like so:

```python from flair.datasets import GLUE_QNLI

load corpus

corpus = GLUE_QNLI()

print corpus

print(corpus)

print first sentence-pair of training data split

print(corpus.train[0])

print all labels in corpus

print(corpus.makelabeldictionary("entailment")) ```

Relation Extraction (#2333 #2352)

Relation extraction classifies if and which relationship holds between two entities in a text.

Model class: RelationExtractor

Datasets in Flair: - 'REENGLISHCONLL04' - the CoNLL-04 Relation Extraction dataset (#2333) - 'REENGLISHSEMEVAL2010' - the SemEval-2010 Task 8 dataset on Multi-Way Classification of Semantic Relations Between Pairs of Nominals (#2333) - 'REENGLISHTACRED' - the TAC Relation Extraction Dataset](https://nlp.stanford.edu/projects/tacred/) with 41 relations (download required) (#2333) - 'REENGLISHDRUGPROT' - the DrugProt corpus from Biocreative VII Track 1 on drug and chemical-protein interactions (#2340 #2352)

Initialize datasets like so:

```python

initalize CoNLL 04 corpus for Relation extraction

corpus = REENGLISHCONLL04() print(corpus)

print first sentence of training split with annotations

sentence = corpus.train[0]

print label dictionary

labeldict = corpus.makelabeldictionary("relation") print(labeldict) ```

Entity Linking (#2375)

Entity Linking goes one step further than NER and uniquely links entities to knowledge bases such as Wikipedia.

Model class: EntityLinker

Datasets in Flair: - 'NELENGLISHAIDA' - the AIDA CoNLL-YAGO Entity Linking corpus on the CoNLL-03 dataset for English - 'NELENGLISHAQUAINT' - the Aquaint Entity Linking corpus introduced in Milne and Witten (2008) - 'NELENGLISHIITB' - the ITTB Entity Linking corpus introduced in Sayali et al. (2009) - 'NELENGLISHREDDIT' - the Reddit Entity Linking corpus introduced in Botzer et al. (2021) (only gold annotations) - 'NELENGLISHTWEEKI' - the ITTB Entity Linking corpus introduced in Harandizadeh and Singh (2020) - 'NELGERMANHIPE' - the HIPE Entity Linking corpus for historical German as a sentence-segmented version

```python from flair.datasets import NELENGLISHREDDIT

load corpus

corpus = NELENGLISHREDDIT()

print corpus

print(corpus)

print a sentence of training data split

print(corpus.train[3]) ```

New NER Datasets

Other datasets

  • 'YAHOO_ANSWERS' - The 10 largest main categories from the Yahoo! Answers (#2198)
  • Various Universal Dependencies datasets (#2211, #2216, #2219, #2221, #2244, #2245, #2246, #2247, #2223, #2248, #2235, #2236, #2239, #2226)

New Functionality

Support for Arabic NER (#2188)

Flair now supports NER and POS tagging for Arabic. To tag an Arabic sentence, just load the appropriate model:

```python

load model

tagger = SequenceTagger.load('ar-ner')

make Arabic sentence

sentence = Sentence("احب برلين")

predict NER tags

tagger.predict(sentence)

print sentence with predicted tags

for entity in sentence.get_labels('ner'): print(entity) ```

This should print: console LOC [برلين (2)] (0.9803)

More flexibility on main metric (#2161)

When training models, you can now chose any standard evaluation metric for model selection (previously it was fixed to micro F1). When calling the trainer, simply pass the desired metric as main_evaluation_metric like so:

python trainer.train('resources/taggers/your_model', learning_rate=0.1, mini_batch_size=32, max_epochs=10, main_evaluation_metric=("macro avg", 'f1-score'), )

In this example, we now use macro F1 instead of the default micro F1.

Add handling for mapping labels to 'O' #2254

In ColumnDataset, labels can be remapped to other labels. But sometimes you may not wish to use all label types in a given dataset. You can now remap them to 'O' and so exclude them.

For instance, to load CoNLL-03 without MISC, do:

python corpus = CONLL_03( label_name_map={'MISC': 'O'} ) print(corpus.make_label_dictionary('ner')) print(corpus.train[0].to_tagged_string('ner'))

Other

  • add per-label thresholds for prediction (#2366)
  • add support for Spanish clinical Flair embeddings (#2323)
  • added 'mean', 'max' pooling strategy for TransformerDocumentEmbeddings class (#2180)
  • new DocumentCNNEmbeddings class to embed text with a trainable CNN (#2141)
  • allow negative examples in ClassificationCorpus (#2233)
  • added new parameter to save model each k epochs during training (#2146)
  • log epoch of best model instead of printing it during training (#2286)
  • add option to exclude specific sentences from dataset (#2262)
  • improved tensorboard logging (#2164)

  • return predictions during evaluation (#2162)

Internal Refactorings

Refactor for simplicity and extensibility (#2333 #2351 #2356 #2377 #2379 #2382 #2184)

In order to accommodate all these new NLP task types (plus many more in the pipeline), we restructure the flair.nn.Model class such that most models now inherit from DefaultClassifier. This removes many redundancies as most models do classification and are really only different in what they classify and how they apply embeddings. Models that inherit from DefaultClassifier need only implement the method forward_pass, making each model class only a few lines of code.

Check for instance our implementation of the RelationExtractor class to see how easy it now is to add a new tasks!

Refactor for speed

  • Flair models trained with transformers (such as the FLERT models) were previously not making use of mini-batching, greatly slowing down training and application of such models. We refactored the TransformerWordEmbeddings class, yielding significant speed-ups depending on the mini-batch size used. We observed speed-ups from x2 to x6. (#2385 #2389 #2384)

  • Improve training speed of Flair embeddings (#2203)

Bug fixes & improvements

  • fixed references to multi-X-fast Flair embedding models (#2150)
  • fixed serialization of DocumentRNNEmbeddings (#2155)
  • fixed separator in cross-attention mode (#2156)
  • fixed ID for Slovene word embeddings in the doc (#2166)
  • close log_handler after training is complete. (#2170)
  • fixed bug in IMDB dataset (#2172)
  • fixed IMDB data splitting logic (#2175)
  • fixed XLNet and Transformer-XL Execution (#2191)
  • remove unk token from Ner labeling (#2225)
  • fxed typo in property name (#2267)
  • fixed typos (#2303 #2373)
  • fixed parallel corpus (#2306)
  • fixed SegtokSentenceSplitter Incorrect Sentence Position Attributes (#2312)
  • fixed loading of old serialized models (#2322)
  • updated url for BioSemantics corpus (#2327)
  • updated requirements (#2346)
  • serialize multilabelthreshold for classification models (#2368)
  • small refactorings in ModelTrainer (#2184)
  • moving Path construction of flair.cache_root (#2241)
  • documentation improvement (#2304)
  • add model fit tests #2378

- Python
Published by alanakbik over 4 years ago

flair - Release 0.8

Release 0.8 adds major new features to Flair, including our best named entity recognition (NER) models yet and the ability to host, share and test Flair models on the HuggingFace model hub! In addition, there is a host of improvements, new features and new datasets to check out!

FLERT (#2031 #2032 #2104)

This release adds the "FLERT" approach to train sequence tagging models using cross-sentence features as presented in our recent paper. This yields new state-of-the-art models which we include in Flair, as well as the features to easily train your own "FLERT" models.

Pre-trained FLERT models (#2130)

We add 5 new NER models for English (4-class and 18-class), German, Dutch and Spanish (4-class each). Load for instance with:

```python from flair.data import Sentence from flair.models import SequenceTagger

load tagger

tagger = SequenceTagger.load("ner-large")

make example sentence

sentence = Sentence("George Washington went to Washington")

predict NER tags

tagger.predict(sentence)

print sentence

print(sentence)

print predicted NER spans

print('The following NER tags are found:')

iterate over entities and print

for entity in sentence.get_spans('ner'): print(entity) ```

If you want to test these models in action, for instance the new large English Ontonotes model with 18 classes, you can now use the hosted inference API on the HF model hub, like here.

Contextualized Sentences

In order to enable cross-sentence context, we made some changes to the Sentence object and data readers:

  1. Sentence objects now have next_sentence() and previous_sentence() methods that are set automatically if loaded through ColumnCorpus. This is a pointer system to navigate through sentences in a corpus: ```python # load corpus corpus = MITMOVIENERSIMPLE(inmemory=False)

get a sentence

sentence = corpus.test[123] print(sentence)

get the previous sentence

print(sentence.previous_sentence())

get the sentence after that

print(sentence.next_sentence())

get the sentence after the next sentence

print(sentence.nextsentence().nextsentence()) ``` This allows dynamic computation of contexts in the embedding classes.

  1. Sentence objects now have the is_document_boundary field which is set through the ColumnCorpus. In some datasets, there are sentences like "-DOCSTART-" that just indicate document boundaries. This is now recorded as a boolean in the object.

Refactored TransformerWordEmbeddings (breaking)

TransformerWordEmbeddings refactored for dynamic context, robustness to long sentences and readability. The names of some constructor arguments have changed for clarity: pooling_operation is now subtoken_pooling (to make clear that we pool subtokens), use_scalar_mean is now layer_mean (we only do a simple layer mean) and use_context can now optionally take an integer to indicate the length of the context. Default arguments are also changed.

For instance, to create embeddings with a document-level context of 64 subtokens, init like this: python embeddings = TransformerWordEmbeddings( model='bert-base-uncased', layers="-1", subtoken_pooling="first", fine_tune=True, use_context=64, )

Train your Own FLERT Models

You can train a FLERT-model like this:

```python import torch

from flair.data import Sentence from flair.datasets import CONLL03, WNUT17 from flair.embeddings import TransformerWordEmbeddings, DocumentPoolEmbeddings, WordEmbeddings from flair.models import SequenceTagger from flair.trainers import ModelTrainer

corpus = CONLL_03()

usecontext = 64 hfmodel = 'xlm-roberta-large'

embeddings = TransformerWordEmbeddings( model=hfmodel, layers="-1", subtokenpooling="first", finetune=True, usecontext=use_context, )

tagdictionary = corpus.maketag_dictionary('ner')

init bare-bones tagger (no reprojection, LSTM or CRF)

tagger: SequenceTagger = SequenceTagger( hiddensize=256, embeddings=embeddings, tagdictionary=tagdictionary, tagtype='ner', usecrf=False, usernn=False, reproject_embeddings=False, )

train with XLM parameters (AdamW, 20 epochs, small LR)

trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.AdamW) from torch.optim.lr_scheduler import OneCycleLR

contextstring = '+context' if usecontext else ''

trainer.train(f"resources/flert", learningrate=5.0e-6, minibatchsize=4, minibatchchunksize=1, maxepochs=20, scheduler=OneCycleLR, embeddingsstoragemode='none', weightdecay=0., ) ```

We recommend training FLERT this way if accuracy is by far the most important feature you need. FLERT is quite slow since it works on the document-level.

HuggingFace model hub integration (#2040 #2108 #2115)

We now host Flair sequence tagging models on the HF model hub (thanks for all the support @HuggingFace!).

Overview of all models. There is a dedicated 'Flair' tag on the hub, so to get a list of all Flair models, check here.

The hub allows all users to upload and share their own models. Even better, you can enable the Inference API and so test all models online without downloading and running them. For instance, you can test our new very powerful English 18-class NER model here.

To load any sequence tagger on the model hub, use the string identifier when instantiating a model. For instance, to load our English ontonotes model with the id "flair/ner-english-ontonotes-large", do

```python from flair.data import Sentence from flair.models import SequenceTagger

load tagger

tagger = SequenceTagger.load("flair/ner-english-ontonotes-large")

make example sentence

sentence = Sentence("On September 1st George won 1 dollar while watching Game of Thrones.")

predict NER tags

tagger.predict(sentence)

print sentence

print(sentence)

print predicted NER spans

print('The following NER tags are found:')

iterate over entities and print

for entity in sentence.get_spans('ner'): print(entity) ```

Other New Features

New Task: Recognizing Textual Entailment (#2123)

Thanks to @marcelmmm we now support training textual entailment tasks (in fact, all pairwise sentence classification tasks) in Flair.

For instance, if you want to train an RTE task of the GLUE benchmark use this script:

```python import torch

from flair.data import Corpus from flair.datasets import GLUE_RTE from flair.embeddings import TransformerDocumentEmbeddings

1. get the entailment corpus

corpus: Corpus = GLUE_RTE()

2. make the tag dictionary from the corpus

labeldictionary = corpus.makelabel_dictionary()

3. initialize text pair tagger

from flair.models import TextPairClassifier

tagger = TextPairClassifier( documentembeddings=TransformerDocumentEmbeddings(), labeldictionary=label_dictionary, )

4. train trainer with AdamW

from flair.trainers import ModelTrainer

trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.AdamW)

5. run training

trainer.train('resources/taggers/glue-rte-english', learningrate=2e-5, minibatchchunksize=2, # this can be removed if you hae a big GPU trainwithdev=True, max_epochs=3) ```

Add possibility to specify empty label name to CSV corpora (#2068)

Some CSV classification datasets contain a value that means "no class". We now extend the CSVClassificationDataset so that it is possible to specify which value should be skipped using the no_class_label argument.

For instance:

```python

load corpus

corpus = CSVClassificationCorpus( datafolder='resources/tasks/code/', trainfile='javaio.csv', skipheader=True, columnnamemap={3: 'text', 4: 'label', 5: 'label', 6: 'label', 7: 'label', 8: 'label', 9: 'label'}, noclasslabel='NONE', ) ```

This causes all entries of NONE in one of the label columns to be skipped.

More options for splits in corpora and training (#2034)

For various reasons, we might want to have a Corpus that does not define all three splits (train/dev/test). For instance, we might want to train a model over the entire dataset and not hold out any data for validation/evaluation.

We add several ways of doing so.

  1. If a dataset has predefined splits, like most NLP datasets, you can pass the arguments train_with_test and train_with_dev to the ModelTrainer. This causes the trainer to train over all three splits (and do no evaluation):

python trainer.train(f"path/to/your/folder", learning_rate=0.1, mini_batch_size=16, train_with_dev=True, train_with_test=True, )

  1. You can also now create a Corpus with fewer splits without having all three splits automatically sampled. Pass sample_missing_splits=False as argument to do this. For instance, to load SemCor WSD corpus only as training data, do:

python semcor = WSD_UFSAC(train_file='semcor.xml', sample_missing_splits=False, autofind_splits=False)

Add TFIDF Embeddings (#2086)

We added some old-school embeddings (thanks @yosipk), namely the legendary TF-IDF document embeddings. These are often good baselines, and additionally they keep NLP veterans nostalgic, if not happy.

To initialize these embeddings, you must pass the train split of your training corpus, i.e.

python embeddings = DocumentTFIDFEmbeddings(corpus.train, max_features=10000)

This triggers the process where the most common words are used to featurize documents.

New Datasets

Hungarian NER Corpus (#2045)

Added the Hungarian business news corpus annotated with NER information (thanks to @alibektas).

```python

load Hungarian business NER corpus

corpus = BUSINESSHUN() print(corpus) print(corpus.maketag_dictionary('ner')) ```

StackOverflow NER Corpus (#2052)

```python

load StackOverflow business NER corpus

corpus = STACKOVERFLOWNER() print(corpus) print(corpus.maketag_dictionary('ner')) ```

Added GermEval 18 Offensive Language dataset (#2102)

```python

load StackOverflow business NER corpus

corpus = GERMEVAL2018OFFENSIVELANGUAGE() print(corpus) print(corpus.makelabel_dictionary() ```

Added RTE corpora of GLUE and SuperGLUE

```python

load the recognizing textual entailment corpus of the GLUE benchmark

corpus = GLUERTE() print(corpus) print(corpus.makelabel_dictionary() ```

Improvements

Allow newlines as Tokens in a Sentence (#2070)

Newlines and tabs can now become Tokens in a Sentence:

```python

make sentence with newlines and tabs

sentence: Sentence = Sentence(["I", "\t", "ich", "\n", "you", "\t", "du", "\n"], use_tokenizer=True)

Alternatively: sentence: Sentence = Sentence("I \t ich \n you \t du \n", use_tokenizer=False)

print sentence and each token

print(sentence) for token in sentence: print(token) ```

Improve transformer serialization (#2046)

We improved the serialization of the TransformerWordEmbeddings class such that you can now train a model with one version of the transformers library and load it with another version. Previously, if you trained a model with transformers 3.5.1 and loaded it with 3.1.01, or trained with 3.5.1 and loaded with 4.1.1, or other version mismatches, there would either be errors or bad predictions.

Migration guide: If you have a model trained with an older version of Flair that uses TransformerWordEmbeddings you can save it in the new version-independent format by loading the model with the same transformers version you used to train it, and then saving it again. The newly saved model is then version-independent:

```python

load old model, but use the same transformer version you used when training this model

tagger = SequenceTagger.load('path/to/old-model.pt')

save the model. It is now version-independent and can for instance be loaded with transformers 4.

tagger.save('path/to/new-model.pt') ```

Fix regression prediction errors (#2067)

Fix of two problems in the regression model: - the predict() method was unable to set labels and threw errors (see #2056) - predicted labels had no label name

Now, you can set a label name either in the predict method or during instantiation of the regression model you want to train. So the full code for training a regression model and using it to predict is:

```python

load regression dataset

corpus = WASSA_JOY()

make simple document embeddings

embeddings = DocumentPoolEmbeddings([WordEmbeddings('glove')], finetunemode='linear')

init model and give name to label

model = TextRegressor(embeddings, label_name='happiness')

target folder

outputfolder = 'resources/taggers/regressiontest/'

run training

trainer = ModelTrainer(model, corpus) trainer.train( outputfolder, minibatchsize=16, maxepochs=10, )

load model

model = TextRegressor.load(output_folder + 'best-model.pt')

predict for sentence

sentence = Sentence('I am so happy') model.predict(sentence)

print sentence and prediction

print(sentence) ```

In my example run, this prints the following sentence + predicted value: ~~~ Sentence: "I am so happy" [− Tokens: 4 − Sentence-Labels: {'happiness': [0.9239126443862915 (1.0)]}] ~~~

Do not shuffle first epoch during training (#2058)

Normally, we shuffle sentences at each epoch during training in the ModelTrainer class. However, in some cases it makes sense to see sentences in their natural order during the first epoch, and shuffle only from the second epoch onward.

Bug Fixes and Improvements

  • Update to transformers 4 (#2057)
  • Fix the evaluate() method in the SimilarityLearner class (#2113)
  • Fix memory memory leak in WordEmbeddings (#2018)
  • Add support for Transformer-XL Embeddings (#2009)
  • Restrict numpy version to <1.20 for Python 3.6 (#2014)
  • Small formatting and variable declaration changes (#2022)
  • Fix document boundary offsets for Dutch CoNLL-03 (#2061)
  • Changed the torch version in requirements.txt: Torch>=1.5.0 (#2063)
  • Fix linear input dimension if the reproject (#2073)
  • Various improvements for TARS (#2090 #2128)
  • Added a link to the interpret-flair repo (#2096)
  • Improve documentatin ( #2110)
  • Update sentencepiece and gdown version (#2131)
  • Add toplainstring method to Span class (#2091)

- Python
Published by alanakbik almost 5 years ago

flair - Release 0.7

Release 0.7 adds major few-shot and zero-shot learning capabilities to Flair with our new TARS approach, plus support for the Universal Proposition Banks, new NER datasets and lots of other new features!

Few-Shot and Zero-Shot Classification with TARS (#1917 #1926)

With TARS we add a major new feature to Flair for zero-shot and few-shot classification. Details on the approach can be found in our paper Halder et al. (2020). Our approach allows you to classify text in cases in which you have little or even no training data at all.

This example illustrates how you predict new classes without training data:

```python

1. Load our pre-trained TARS model for English

tars = TARSClassifier.load('tars-base')

2. Prepare a test sentence

sentence = flair.data.Sentence("I am so glad you liked it!")

3. Define some classes that you want to predict using descriptive names

classes = ["happy", "sad"]

4. Predict for these classes

tars.predictzeroshot(sentence, classes)

Print sentence with predicted labels

print(sentence) ```

For a full overview of TARS features, please refer to our new TARS tutorial.

Other New Features

Option to set Flair seed (#1979)

Adds the possibility to set a seed via wrapping the Hugging Face Transformers library helper method (thanks @stefan-it).

By specifying a seed with:

```python import flair

flair.set_seed(42) ```

you can make experimental runs reproducible. The wrapped set_seed method sets seeds for random, numpy and torch. More details here.

Control multi-word behavior in UD datasets (#1981)

To better handle multi-words in UD corpora, we introduce the split_multiwords constructor argument to all UD corpora which by default is set to True. It controls the handling of multiwords that are split into different tokens. For instance the German "am" is split into two different tokens: "am" -> "an" + "dem". Or the French "aux" -> "a" + "les".

If split_multiwords is set to True, they are split as in UD. If set to False, we keep the original multiword as a single token. Example:

```python

default mode: multiwords are split

corpus = UDGERMAN(splitmultiwords=True)

print sentence 179

print(corpus.dev[179].toplainstring())

alternative mode: multiwords are kept as original

corpus = UDGERMAN(splitmultiwords=False)

print sentence 179

print(corpus.dev[179].toplainstring())
```

This prints

~~~ Ein Hotel zu dem Wohlfühlen.

Ein Hotel zum Wohlfühlen. ~~~

The latter is how it appears in text, the former is after splitting of multiwords.

Pass pretokenized sentence to Sentence object (#1965)

You can now pass pass a pretokenized sequence as list of words (thanks @ulf1):

python from flair.data import Sentence sentence = Sentence(['The', 'grass', 'is', 'green', '.']) print(sentence)

This should print:

console Sentence: "The grass is green ." [− Tokens: 5]

Map label names in sequence labeling datasets (#1988)

You can now pass a label map to sequence labeling datasets to change label names (thanks @pharnisch).

```python

print tag dictionary with mapped names

corpus = CONLL03DUTCH(labelnamemap={'PER': 'person', 'ORG': 'organization', 'LOC': 'location', 'MISC': 'other'}) print(corpus.maketagdictionary('ner'))

print tag dictionary with original names

corpus = CONLL03DUTCH() print(corpus.maketagdictionary('ner')) ```

Data Sets

Universal Proposition Banks (#1870 #1866 #1888)

Flair 0.7 adds support 7 Universal Proposition Banks to train your own multilingual semantic role labelers (thanks to @Dabendorf).

Load for instance with:

```python

load English Universal Proposition Bank

corpus = UP_ENGLISH() print(corpus)

make dictionary of frames

framedictionary = corpus.maketagdictionary('frame') print(framedictionary) ```

Now available for Finnish, Chinese, Italian, French, German, Spanish and English

NER Corpora

We add support for 6 new NER corpora:

Arabic NER Corpus (#1901)

Added the ANER corpus for Arabic NER (thanks to @megantosh).

```python

load Arabic NER corpus

corpus = ANER_CORP() print(corpus) ```

Movie NER Corpora (#1912)

Added the MIT movie reviews corpora annotated with NER information, in the simple and complex variant (thanks to @pharnisch).

```python

load simple movie NER corpus

corpus = MITMovieNERSimple() print(corpus) print(corpus.maketagdictionary('ner'))

load complex movie NER corpus

corpus = MITMovieNERComplex() print(corpus) print(corpus.maketagdictionary('ner'))
```

Added SEC Fillings NER corpus (#1922)

Added corpus of SEC fillings annotated with 4-class NER tags (thanks to @samahakk).

```python

load SEC fillings corpus

corpus = SECFILLINGS() print(corpus) print(corpus.maketag_dictionary('ner')) ```

WNUT 2020 NER dataset support (#1942)

Added corpus of wet lab protocols annotated with NER information used for WNUT 2020 challenge (thanks to @aynetdia).

```python

load wet lab protocol data

corpus = WNUT2020NER() print(corpus) print(corpus.maketagdictionary('ner')) ```

Weibo NER dataset support (#1944)

Added dataset about NER for Chinese Social Media (thanks to @87302380).

```python

load Weibo NER data

corpus = WEIBONER() print(corpus) print(corpus.maketag_dictionary('ner')) ```

Added Finnish NER corpus (#1946)

Added the TURKU corpus for Finnish NER (thanks to @melvelet).

```python

load Finnish NER data

corpus = TURKUNER() print(corpus) print(corpus.maketag_dictionary('ner')) ```

Universal Depdency Treebanks

We add support for 11 new UD treebanks:

  • Greek UD Treebank (#1933, thanks @malamasn)
  • Livvi UD Treebank (#1953, thanks @hebecked)
  • Naija UD Treebank (#1952, thanks @teddim420)
  • Buryat UD Treebank (#1954, thanks @MaxDall)
  • North Sami UD Treebank (#1955, thanks @dobbersc)
  • Maltese UD Treebank (#1957, thanks @phkuep)
  • Marathi UD Treebank (#1958, thanks @polarlyset)
  • Afrikaans UD Treebank (#1959, thanks @QueStat)
  • Gothic UD Treebank (#1961, thanks @wjSimon)
  • Old French UD Treebank (#1964, thanks @Weyaaron)
  • Wolof UD Treebank (#1967, thanks @LukasOpp)

Load each with language name, for instance:

```python

load Gothic UD treebank data

corpus = UD_GOTHIC() print(corpus) print(corpus.test[0]) ```

Added GoEmotions text classification corpus (#1914)

Added GoEmotions dataset containing 58k Reddit comments labeled with 27 emotion categories. Load with:

```python

load GoEmotions corpus

corpus = GOEMOTIONS() print(corpus) print(corpus.makelabel_dictionary()) ```

Enhancements and bug fixes

  • Add handling for micro-average precision and recall (#1935)
  • Make dev and test splits in treebanks optional (#1951)
  • Updated communicative functions model (#1857)
  • Biomedical Data: Explicit encodings for Windows Support (#1893)
  • Fix wrong abstract method (#1923 #1940)
  • Improve tutorial (#1939)
  • Fix requirements (#1971 )

- Python
Published by alanakbik about 5 years ago

flair - Release 0.6.1

Release 0.6.1 is bugfix release that fixes the issues caused by moving the server that originally hosted the Flair models. Additionally, this release adds a ton of new NER datasets, including the XTREME corpus for 40 languages, and a new model for NER on German-language legal text.

New Model: Legal NER (#1872)

Add legal NER model for German. Trained using the German legal NER dataset available here that can be loaded in Flair with the LER_GERMAN corpus object.

Uses German Flair and FastText embeddings and gets 96.35 F1 score.

Use like this:

```python

load German LER tagger

tagger = SequenceTagger.load('de-ler')

example text

text = "vom 6. August 2020. Alle Beschwerdeführer befinden sich derzeit gemeinsam im Urlaub auf der Insel Mallorca , die vom Robert-Koch-Institut als Risikogebiet eingestuft wird. Sie wollen am 29. August 2020 wieder nach Deutschland einreisen, ohne sich gemäß § 1 Abs. 1 bis Abs. 3 der Verordnung zur Testpflicht von Einreisenden aus Risikogebieten auf das SARS-CoV-2-Virus testen zu lassen. Die Verordnung sei wegen eines Verstoßes der ihr zugrunde liegenden gesetzlichen Ermächtigungsgrundlage, des § 36 Abs. 7 IfSG , gegen Art. 80 Abs. 1 Satz 1 GG verfassungswidrig."

sentence = Sentence(text)

predict and print entities

tagger.predict(sentence)

for entity in sentence.get_spans('ner'): print(entity) ```

New Datasets

Add XTREME and WikiANN corpora for multilingual NER (#1862)

These huge corpora provide training data for NER in 176 languages. You can either load the language-specific parts of it by supplying a language code:

```python

load German Xtreme

germancorpus = XTREME('de') print(germancorpus)

load French Xtreme

frenchcorpus = XTREME('fr') print(frenchcorpus) ```

Or you can load the default 40 languages at once into one huge MultiCorpus by not providing a language ID:

```python

load Xtreme MultiCorpus for all

multicorpus = XTREME() print(multicorpus) ```

Add Twitter NER Dataset (#1850)

Dataset of tweets annotated with NER tags. Load with:

```python

load twitter dataset

corpus = TWITTER_NER()

print example tweet

print(corpus.test[0]) ```

Add German Europarl NER Dataset (#1849)

Dataset of German-language speeches in the European parliament annotated with standard NER tags like person and location. Load with:

```python

load corpus

corpus = EUROPARLNERGERMAN() print(corpus)

print first test sentence

print(corpus.test[1]) ```

Add MIT Restaurant NER Dataset (#1177)

Dataset of English restaurant reviews annotated with entities like "dish", "location" and "rating". Load with:

```python

load restaurant dataset

corpus = MIT_RESTAURANTS()

print example sentence

print(corpus.test[0])
```

Add Universal Propositions Banks for French and German (#1866)

Our kickoff into supporting the Universal Proposition Banks adds the first two UP datasets to Flair. Load with:

```python

load German UP

corpus = UP_GERMAN() print(corpus)

print example sentence

print(corpus.dev[1]) ```

Add Universal Dependencies Dataset for Chinese (#1880)

Adds the Kyoto dataset for Chinese. Load with:

```python

load Chinese UD dataset

corpus = UDCHINESEKYOTO()

print example sentence

print(corpus.test[0])
```

Bug fixes

  • Move models to HU server (#1834 #1839 #1842)
  • Fix deserialization issues in transformer tokenizers #1865
  • Documentation fixes (#1819 #1821 #1836 #1852)
  • Add link to a repo with examples of Flair on GCP (#1825)
  • Correct variable names (#1875)
  • Fix problem with custom delimiters in ColumnDataset (#1876)
  • Fix offensive language detection model (#1877)
  • Correct Dutch NER model (#1881)

- Python
Published by alanakbik over 5 years ago

flair - Release 0.6

Release 0.6 is a major biomedical NLP upgrade for Flair, adding state-of-the-art models for biomedical NER, support for 31 biomedical NER corpora, clinical POS tagging, speculation and negation detection in biomedical literature, and many other features such as multi-tagging and one-cycle learning.

Biomedical Models and Datasets:

Most of the biomedical models and datasets were developed together with the Knowledge Management in Bioinformatics group at the HU Berlin, in particular @leonweber and @mariosaenger. This page gives an overview of the new models and datasets, and example tutorials. Some highlights:

Biomedical NER models (#1790)

Flair now has pre-trained models for biomedical NER trained over unified versions of 31 different biomedical corpora. Because they are trained on so many different datasets, the models are shown to be very robust with new datasets, outperforming all previously available off-the-shelf datasets. If you want to load a model to detect "diseases" in text for instance, do:

```python

make a sentence

sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome")

load disease tagger and predict

tagger = SequenceTagger.load("hunflair-disease") tagger.predict(sentence) ```

Done! Let's print the diseases found by the tagger:

python for entity in sentence.get_spans(): print(entity) This should print: ~~~ Span [1,2]: "Behavioral abnormalities" [− Labels: Disease (0.6736)] Span [10,11,12]: "Fragile X Syndrome" [− Labels: Disease (0.99)] ~~~

You can also get one model that finds 5 biomedical entity types (diseases, genes, species, chemicals and cell lines), like this:

```python

load bio-NER tagger and predict

tagger = MultiTagger.load("hunflair") tagger.predict(sentence) ``` This should print: ~~~ Span [1,2]: "Behavioral abnormalities" [− Labels: Disease (0.6736)] Span [10,11,12]: "Fragile X Syndrome" [− Labels: Disease (0.99)] Span [5]: "Fmr1" [− Labels: Gene (0.838)] Span [7]: "Mouse" [− Labels: Species (0.9979)] ~~~

So it now also finds genes and species. As explained here these models work best if you use them together with a biomedical tokenizer.

Biomedical NER datasets (#1790)

Flair now supports 31 biomedical NER datasets out of the box, both in their standard versions as well as the "Huner" splits for reproducibility of experiments. For a full list of datasets, refer to this page.

You can load a dataset like this:

```python

load one of the bioinformatics corpora

corpus = JNLPBA()

print statistics and one sentence

print(corpus) print(corpus.train[0]) ```

We also include "huner" corpora that combine many different biomedical datasets into a single corpus. For instance, if you execute the following line:

```python

load combined chemicals corpus

corpus = HUNER_CHEMICAL() ```

This loads a combination of 6 different corpora that contain annotation of chemicals into a single corpus. This allows you to train stronger cross-corpus models since you now combine training data from many sources. See more info here.

POS model for Portuguese clinical text (#1789)

Thanks to @LucasFerroHAILab, we now include a model for part-of-speech tagging in Portuguese clinical text. Run this model like this:

```python

load your tagger

tagger = SequenceTagger.load('pt-pos-clinical')

example sentence

sentence = Sentence('O vírus Covid causa fortes dores .') tagger.predict(sentence) print(sentence) ```

You can find more details in their paper here.

Model for negation and speculation in biomedical literature (#1758)

Using the BioScope corpus, we trained a model to recognize negation and speculation in biomedical literature. Use it like this:

```python sentence = Sentence("The picture most likely reflects airways disease")

tagger = SequenceTagger.load("negation-speculation") tagger.predict(sentence)

for entity in sentence.get_spans(): print(entity) ```

This should print:

~~~ Span [4,5,6,7]: "likely reflects airways disease" [− Labels: SPECULATION (0.9992)] ~~~

Thus indicating that this portion of the sentence is speculation.

Other New Features:

MultiTagger (#1791)

We added support for tagging text with multiple models at the same time. This can save memory usage and increase tagging speed.

For instance, if you want to POS tag, chunk, NER and detect frames in your text at the same time, do:

```python

load tagger for POS, chunking, NER and frame detection

tagger = MultiTagger.load(['pos', 'upos', 'chunk', 'ner', 'frame'])

example sentence

sentence = Sentence("George Washington was born in Washington")

predict

tagger.predict(sentence)

print(sentence) ```

This will give you a sentence annotated with 5 different layers of annotation.

Sentence splitting

Flair now includes convenience methods for sentence splitting. For instance, to use segtok to split and tokenize a text into sentences, use the following code:

```python from flair.tokenization import SegtokSentenceSplitter

example text with many sentences

text = "This is a sentence. This is another sentence. I love Berlin."

initialize sentence splitter

splitter = SegtokSentenceSplitter()

use splitter to split text into list of sentences

sentences = splitter.split(text)
```

We also ship other splitters, such as SpacySentenceSplitter (requires SpaCy to be installed).

Japanese tokenization (#1786)

Thanks to @himkt we now have expanded support for Japanese tokenization in Flair. For instance, use the following code to tokenize a Japanese sentence without installing extra libraries:

```python from flair.data import Sentence from flair.tokenization import JapaneseTokenizer

init japanese tokenizer

tokenizer = JapaneseTokenizer("janome")

make sentence (and tokenize)

sentence = Sentence("私はベルリンが好き", use_tokenizer=tokenizer)

output tokenized sentence

print(sentence) ```

One-Cycle Learning (#1776)

Thanks to @lucaventurini2 Flair one supports one-cycle learning, which may give quicker convergence. For instance, train a model in 20 epochs using the code below:

```python

train as always

trainer = ModelTrainer(tagger, corpus)

set one cycle LR as scheduler

trainer.train('onecyclener', scheduler=OneCycleLR, maxepochs=20) ```

Improvements:

Changes in convention

Turn on tokenizer by default in Sentence object (#1806)

The Sentence object now executes tokenization (use_tokenizer=True) by default:

```python

Tokenizes by default

sentence = Sentence("I love Berlin.") print(sentence)

i.e. this is equivalent to

sentence = Sentence("I love Berlin.", use_tokenizer=True) print(sentence)

i.e. if you don't want to use tokenization, set it to False

sentence = Sentence("I love Berlin.", use_tokenizer=False) print(sentence) ```

TransformerWordEmbeddings now handle long documents by default

Previously, so had to set allow_long_sentences=True to enable handling of long sequences (greater than 512 subtokens) in TransformerWordEmbeddings. This is no longer necessary as this value is now set to True by default.

Bug fixes

  • Fix serialization of BytePairEmbeddings (#1802)
  • Fix issues with loading models that use ELMoEmbeddings (#1803)
  • Allow longer lengths in transformers that can handle more than 512 subtokens (#1804)
  • Fix encoding for WASSA datasets (#1766)
  • Update BPE package (#1764)
  • Improve documentation (#1752 #1778)
  • Fix evaluation of TextClassifier if no label_type is passed (#1748)
  • Remove torch version checks that throw errors (#1744)
  • Update DaNE dataset URL (#1800)
  • Fix weight extraction error for empty sentences (#1805)

- Python
Published by alanakbik over 5 years ago

flair - Release 0.5.1

Release 0.5.1 with new features, datasets and models, including support for sentence transformers, transformer embeddings for arbitrary length sentences, new Dutch NER models, new tasks and more refactorings of evaluation and training routines to better organize the code!

New Features and Enhancements:

TransformerWordEmbeddings can now process long sentences (#1680)

Adds a heuristic as a workaround to the max sequence length of some transformer embeddings, making it possible to now embed sequences of arbitrary length if you set allow_long_sentences=True, like so:

python TransformerWordEmbeddings( allow_long_sentences=True, # set allow_long_sentences to True to enable this features ),

Setting random seeds (#1671)

It is now possible to set seeds when loading and downsampling corpora, so that the sample is always the same:

```python

set a random seed

import random random.seed(4)

load and downsample corpus

corpus = SENTEVALMR(filteriflongerthan=50).downsample(0.1)

print first sentence of dev and test

print(corpus.dev[0]) print(corpus.test[0]) ```

Make reprojection layer optional (#1676)

Makes the reprojection layer optional in SequenceTagger. You can control this behavior through the reproject_embeddings parameter. If you set it to True, embeddings are reprojected via linear map to identical size. If set to False, no reprojection happens. If you set this parameter to an integer, the linear map maps embedding vectors to vectors of this size.

```python

tagger with standard reprojection

tagger = SequenceTagger( hiddensize=256, [...] reprojectembeddings=True, )

tagger without reprojection

tagger = SequenceTagger( hiddensize=256, [...] reprojectembeddings=False, )

reprojection to vectors of length 128

tagger = SequenceTagger( hiddensize=256, [...] reprojectembeddings=128, ) ```

Set label name when predicting (#1671)

You can now optionally specify the "label name" of the predicted label. This may be useful if you want to for instance run two different NER models on the same sentence:

```python sentence = Sentence('I love Berlin')

load two NER taggers

tagger1 = SequenceTagger.load('ner') tagger2 = SequenceTagger.load('ontonotes-ner')

specify label name of tagger1 to be 'conll03ner'

tagger1.predict(sentence, labelname='conll03_ner')

specify label name of tagger2 to be 'ontoner'

tagger1.predict(sentence, labelname='onto_ner')

print(sentence) ```

This may be useful if you have multiple ner taggers and wish to tag the same sentence with them. Then you can distinguish between the tags by the taggers. It is also now no longer possible to give the predict method a string - you now must pass a sentence.

Sentence Transformers (#1696)

Adds the SentenceTransformerDocumentEmbeddings class so you get embeddings from the sentence-transformer library. Use as follows:

```python from flair.data import Sentence from flair.embeddings import SentenceTransformerDocumentEmbeddings

init embedding

embedding = SentenceTransformerDocumentEmbeddings('bert-base-nli-mean-tokens')

create a sentence

sentence = Sentence('The grass is green .')

embed the sentence

embedding.embed(sentence) ```

You can find a full list of their pretained models here.

Other enhancements

  • Update to transformers 3.0.0 (#1727)
  • Better Memory mode presets for classification corpora (#1701)
  • ClassificationDataset now also accepts line with "\t" seperator additionaly to blank spaces (#1654)
  • Change default fine-tuning in DocumentPoolEmbeddings to "none" (#1675)
  • Short-circuit the embedding loop (#1684)
  • Add option to pass kwargs into transformer models when initializing model (#1694)

New Datasets and Models

Two new dutch NER models (#1687)

The new default model is a BERT-based RNN model with the highest accuracy:

```python from flair.data import Sentence from flair.models import SequenceTagger

load the default BERT-based model

tagger = SequenceTagger.load('nl-ner')

tag sentence

sentence = Sentence('Ik hou van Amsterdam') tagger.predict(sentence) ```

You can also load a Flair-based RNN model (might be faster on some setups):

```python

load the default BERT-based model

tagger = SequenceTagger.load('nl-ner-rnn') ```

Corpus of communicative functions (#1683) and pre-trained model (#1706)

Adds corpus of communicate functions in scientific literature, described in this LREC paper and available here. Load with:

python corpus = COMMUNICATIVE_FUNCTIONS() print(corpus)

We also ship a pre-trained model on this corpus, which you can load with: ```python

load communicative function tagger

tagger = TextClassifier.load('communicative-functions')

load communicative function tagger

sentence = Sentence("However, previous approaches are limited in scalability .")

predict and print labels

tagger.predict(sentence) print(sentence.labels) ```

Keyword Extraction Corpora (#1629) and pre-trained model (#1689)

Added 3 datasets available for keyphrase extraction via sequence labeling: Inspec, SemEval-2017 and Processed SemEval-2010

Load like this:

python inspec_corpus = INSPEC() semeval_2010_corpus = SEMEVAL2010() semeval_2017 = SEMEVAL2017()

We also ship a pre-trained model on this corpus, which you can load with:

```python

load keyphrase tagger

tagger = SequenceTagger.load('keyphrase')

load communicative function tagger

sentence = Sentence("Here, we describe the engineering of a new class of ECHs through the " "functionalization of non-conductive polymers with a conductive choline-based " "bio-ionic liquid (Bio-IL).", use_tokenizer=True)

predict and print labels

tagger.predict(sentence) print(sentence) ```

Swedish NER (#1652)

Add corpus for swedish NER using dataset https://github.com/klintan/swedish-ner-corpus/. Load with:

python corpus = NER_SWEDISH() print(corpus)

German Legal Named Entity Recognition (#1697)

Adds corpus of legal named entities for German. Load with: python corpus = LER_GERMAN() print(corpus)

Refactoring of evaluation

We made a number of refactorings to the evaluation routines in Flair. In short: whenever possible, we now use the evaluation methods of sklearn (instead of our own implementations which kept getting issues). This applies to text classification and (most) sequence tagging.

A notable exception is "span-F1" which is used to evaluate NER because there is no good way of counting true negatives. After this PR, our implementation should now exactly mirror the original conlleval script of the CoNLL-02 challenge. In addition to using our reimplementation, an output file is now automatically generated that can be directly used with the conlleval script.

In more detail, this PR makes the following changes:

  • Span is now a list of Token and can now be iterated like a sentence
  • flair.DataLoader is now used throughout
  • The evaluate() interface in the Model base class is changed so that it no longer requires a data loader, but ran run either over list of Sentence or a Dataset
  • SequenceTagger.evaluate() now explicitly distinguishes between F1 and Span-F1. In the latter case, no TN are counted (#1663) and a non-sklearn implementation is used.
  • In the evaluate() method of the SequenceTagger and TextClassifier, we now explicitly call the .predict()method.

Bug fixes:

  • Fix figsize issue (#1622)
  • Allow strings to be passed instead of Path (#1637)
  • Fix segtok tokenization issue (#1653)
  • Serialize dropout in SequenceTagger (#1659)
  • Fix serialization error in DocumentPoolEmbeddings (#1671)
  • Fix subtokenization issues in transformers (#1674)
  • Add new datasets to init.py (#1677)
  • Fix deprecation warnings due to invalid escape sequences. (#1678)
  • Fix PooledFlairEmbeddings deserialization error (#1604)
  • Fix transformer tokenizer deserialization (#1686)
  • Fix issues caused by embedding mode and lambda functions in ELMoEmbeddings (#1692)
  • Fix serialization error in PooledFlairEmbeddings (#1593)
  • Fix mean pooling in PooledFlairEmbeddings (#1698)
  • Fix condition to assign whitespaceafter attribute in the buildspacy_tokenizer wraper (#1700)
  • Fix WIKINER encoding for windows (#1713)
  • Detect and ignore empty sentences in BERT embeddings (#1716)
  • Fix error in returning multiple classes (#1717)

- Python
Published by alanakbik over 5 years ago

flair - Release 0.5

Release 0.5 with tons of new models, embeddings and datasets, support for fine-tuning transformers, greatly improved sentiment analysis models for English, tons of new features and big internal refactorings to better organize the code!

New Fine-tuneable Transformers (#1494 #1544)

Flair 0.5 adds support for transformers and fine-tuning with two new embeddings classes: TransformerWordEmbeddings and TransformerDocumentEmbeddings, for word- and document-level transformer embeddings respectively. Both classes can be initialized with a model name that indicates what type of transformer (BERT, XLNet, RoBERTa, etc.) you wish to use (check the full list Here)

Transformer Word Embeddings

If you want to embed the words in a sentence with transformers, do it like this:

```python from flair.embeddings import TransformerWordEmbeddings

init embedding

embedding = TransformerWordEmbeddings('bert-base-uncased')

create a sentence

sentence = Sentence('The grass is green .')

embed words in sentence

embedding.embed(sentence) ```

If instead you want to use RoBERTa, do:

```python from flair.embeddings import TransformerWordEmbeddings

init embedding

embedding = TransformerWordEmbeddings('roberta-base')

create a sentence

sentence = Sentence('The grass is green .')

embed words in sentence

embedding.embed(sentence) ```

Transformer Document Embeddings

To get a single embedding for the whole document with BERT, do:

```python from flair.embeddings import TransformerDocumentEmbeddings

init embedding

embedding = TransformerDocumentEmbeddings('bert-base-uncased')

create a sentence

sentence = Sentence('The grass is green .')

embed the sentence

embedding.embed(sentence) ```

If instead you want to use RoBERTa, do:

```python from flair.embeddings import TransformerDocumentEmbeddings

init embedding

embedding = TransformerDocumentEmbeddings('roberta-base')

create a sentence

sentence = Sentence('The grass is green .')

embed the sentence

embedding.embed(sentence) ```

Text classification by fine-tuning a transformer

Importantly, you can now fine-tune transformers to get state-of-the-art accuracies in text classification tasks. Use TransformerDocumentEmbeddings for this and set fine_tune=True. Then, use the following example code:

```python from torch.optim.adam import Adam

from flair.data import Corpus from flair.datasets import TREC_6 from flair.embeddings import TransformerDocumentEmbeddings from flair.models import TextClassifier from flair.trainers import ModelTrainer

1. get the corpus

corpus: Corpus = TREC_6()

2. create the label dictionary

labeldict = corpus.makelabel_dictionary()

3. initialize transformer document embeddings (many models are available)

documentembeddings = TransformerDocumentEmbeddings('distilbert-base-uncased', finetune=True)

4. create the text classifier

classifier = TextClassifier(documentembeddings, labeldictionary=label_dict)

5. initialize the text classifier trainer with Adam optimizer

trainer = ModelTrainer(classifier, corpus, optimizer=Adam)

6. start the training

trainer.train('resources/taggers/trec', learningrate=3e-5, # use very small learning rate minibatchsize=16, minibatchchunksize=4, # optionally set this if transformer is too much for your machine max_epochs=5, # terminate after 5 epochs ) ```

New Taggers, Embeddings and Datasets

Flair 0.5 adds a ton of new taggers, embeddings and datasets.

New Taggers

New sentiment models (#1613)

We added new sentiment models for English. The new models are trained over a combined corpus of sentiment dataset, including Amazon product reviews. So they should be applicable to more domains than the old sentiment models that were only trained with movie reviews.

There are two new models, a transformer-based model you can load like this:

```python

load tagger

classifier = TextClassifier.load('sentiment')

predict for example sentence

sentence = Sentence("enormously entertaining for moviegoers of any age .") classifier.predict(sentence)

check prediction

print(sentence) ```

And a faster, slightly less accurate model based on RNNs you can load like this:

python classifier = TextClassifier.load('sentiment-fast')

Fine-grained POS models for English (#1625)

Adds fine-grained POS models for English so you now have the option between 'pos' and 'upos' models for fine-grained and universal dependencies respectively. Load like this:

```python

Fine-grained POS model

tagger = SequenceTagger.load('pos')

Fine-grained POS model (fast variant)

tagger = SequenceTagger.load('pos-fast')

Universal POS model

tagger = SequenceTagger.load('upos')

Universal POS model (fast variant)

tagger = SequenceTagger.load('upos-fast') ```

Added Malayalam POS and XPOS tagger model (#1522)

Added taggers for historical German speech and thought (#1532)

New Embeddings

Added language models for historical German by @redewiedergabe (#1507)

Load the language models with:

python embeddings_forward = FlairEmbeddings('de-historic-rw-forward') embeddings_backward = FlairEmbeddings('de-historic-rw-backward')

Added Malayalam flair embeddings models (#1458)

python embeddings_forward = FlairEmbeddings('ml-forward') embeddings_backward = FlairEmbeddings('ml-backward')

Added Flair Embeddings from CLEF HIPE Shared Task (#1554)

Adds the recently trained Flair embeddings on historic newspapers for German/English/French provided by the CLEF HIPE shared task.

New Datasets

Added NER dataset for Finnish (#1620)

You can now load a Finnish NER corpus with python ner_finnish = flair.datasets.NER_FINNISH()

Added DaNE dataset (#1425)

You can now load a Danish NER corpus with python dane = flair.datasets.DANE()

Added SentEval classification datasets (#1454)

Adds 6 SentEval classification datasets to Flair:

python senteval_corpus_1 = flair.datasets.SENTEVAL_CR() senteval_corpus_2 = flair.datasets.SENTEVAL_MR() senteval_corpus_3 = flair.datasets.SENTEVAL_SUBJ() senteval_corpus_4 = flair.datasets.SENTEVAL_MPQA() senteval_corpus_5 = flair.datasets.SENTEVAL_SST_BINARY() senteval_corpus_6 = flair.datasets.SENTEVAL_SST_GRANULAR()

Added Sentiment Datasets (#1545)

Adds two new sentiment datasets to Flair, namely AMAZONREVIEWS, a very large corpus of Amazon reviews with sentiment labels, and SENTIMENT140, a corpus of tweets labeled with sentiment.

python amazon_reviews = flair.datasets.AMAZON_REVIEWS() sentiment_140 = flair.datasets.SENTIMENT_140()

Added BIOfid dataset (#1589)

python biofid = flair.datasets.BIOFID()

Refactorings

Any DataPoint can now be labeled (#1450)

Refactored the DataPoint class and classes that inherit from it (Token, Sentence, Image, Span, etc.) so that all have the same methods for adding and accessing labels.

  • DataPoint base class now defined labeling methods (closes #1449)
  • Labels can no longer be passed to Sentence constructor, so instead of: python sentence_1 = Sentence("this is great", labels=[Label("POSITIVE")]) you should now do: python sentence_1 = Sentence("this is great") sentence_1.add_label('sentiment', 'POSITIVE') or: python sentence_1 = Sentence("this is great").add_label('sentiment', 'POSITIVE')

Note that Sentence labels now have a label_type (in the example that's 'sentiment').

  • The Corpus method _get_class_to_count is renamed to _count_sentence_labels
  • The Corpus method _get_tag_to_count is renamed to _count_token_labels
  • Span is now a DataPoint (so it has an embedding and labels)

Embeddings module was split into smaller submodules (#1588)

Split the previously huge embeddings.py into several submodules organized in an embeddings/ folder. The submodules are:

  • token.py for all TokenEmbeddings classes
  • document.py for all DocumentEmbeddings classes
  • image.py for all ImageEmbeddings classes
  • legacy.py for embeddings that are now deprecated
  • base.py for remaining basic classes

All embeddings are still exposed through the embeddings package, so the command to load them doesn't change, e.g.:

python from flair.embeddings import FlairEmbeddings embeddings = FlairEmbeddings('news-forward') so specifying the submodule is not needed.

Datasets module was split into smaller submodules (#1510)

Split the previously huge datasets.py into several submodules organized in a datasets/ folder. The submodules are:

  • sequence_labeling.py for all sequence labeling datasets
  • document_classification.py for all document classification datasets
  • treebanks.py for all dependency parsed corpora (UD treebanks)
  • text_text.py for all bi-text datasets (currently only parallel corpora)
  • text_image.py for all paired text-image datasets (currently only Feidegger)
  • base.py for remaining basic classes

All datasets are still exposed through the datasets package, so it is still possible to load corpora with python from flair.datasets import TREC_6 without specifying the submodule.

Other refactorings

  • Refactor datasets for code legibility (#1394)

Small refactorings on flair.datasets for easier code legibility and fewer redundancies, removing about 100 lines of code: (1) Moved the default sampling logic from all corpora classes to the parent Corpus class. You can now instantiate a Corpus only with a train file which will trigger the sampling. (2) Moved the default logic for identifying train, dev and test files into a dedicated method to avoid duplicates in code.

  • Extend string output of Sentence (#1452)

Other

New Features

Add option to specify document delimiter for language model training (#1541)

You now have the option of specifying a document_delimiter when training a LanguageModel. Say, you have a corpus of textual lists and use "[SEP]" to mark boundaries between two lists, like this:

Colors: - blue - green - red [SEP] Cities: - Berlin - Munich [SEP] ...

Then you can now train a language model by setting the document_delimiter in the TextCorpus and LanguageModel objects. This will make sure only documents as a whole will get shuffled during training (i.e. the lists in the above example):

```python

your document delimiter

delimiter = '[SEP]'

set it when you load the corpus

corpus = TextCorpus( "data/corpora/conala-corpus/", dictionary, isforwardlm, characterlevel=True, documentdelimiter=delimiter, )

set it when you init the language model

languagemodel = LanguageModel( dictionary, isforwardlm=True, hiddensize=512, nlayers=1, document_delimiter=delimiter )

train your language model as always

trainer = LanguageModelTrainer(language_model, corpus) ```

Allow column delimiter to be set in ColumnCorpus (#1526)

Added the possibility to set a different column delimite for ColumnCorpus, i.e.

python corpus = ColumnCorpus( Path("/path/to/corpus/"), column_format={0: 'text', 1: 'ner'}, column_delimiter='\t', # set a different delimiter )

if you want to read a tab-separated column corpus.

Improvements in classification corpus datasets (#1545)

There are a number of improvements for the ClassificationCorpus and ClassificationDataset classes: - It is now possible to select from three memory modes ('full', 'partial' and 'disk'). Use full if the entire dataset and all objects fit into memory. Use 'partial' if it doesn't and use 'disk' if even 'partial' does not fit. - It is also now possible to provide "name maps" to rename labels in datasets. For instance, some sentiment analysis datasets use '0' and '1' as labels, while some others use 'POSITIVE' and 'NEGATIVE'. By providing name maps you can rename labels so they are consistent across datasets. - You can now choose which splits to downsample (for instance you might want to downsample 'train' and 'dev' but not 'test') - You can now specify the option "filteriflonger_than", to filter all sentences that have more than the number of provided whitespaces. This is useful to limit corpus size as some sentiment analysis datasets are gigantic.

Added different ways to combine ELMo layers (#1547)

Improved default annealing scheme to anneal against score and loss (#1570)

Add new scheduler that uses dev score as main metric to anneal against, but additionally uses dev loss in case two epochs have the same dev score.

Added option for hidden state position in FlairEmbeddings (#1571)

Adds the option to choose which hidden state to use in FlairEmbeddings: either the state at the end of each word, or the state at the whitespace after. Default is the state at the whitespace after.

You can change the default like this: python embeddings = FlairEmbeddings('news-forward', with_whitespace=False)

This configuration seems to be better for syntactic tasks. For POS tagging, it seems that you should set with_whitespace=False. For instance, on UD_ENGLISH POS-tagging, we get 96.56 +- 0.03 with whitespace and 96.72 +- 0.04 without, averaged over three runs.

See the discussion in #1362 for more details.

Other features

  • Added the option of passing different tokenizers when loading classification datasets (#1579)

  • Added option for true whitespaces in ColumnCorpus #1583

  • Configurable cache_root from environment variable (#507)

Performance improvements

  • Improve performance for loading not-in-memory corpus (#1413)

  • A new lmdb based alternative backend for word embeddings (#1515 #1536)

  • Slim down requirements (#1419)

Bug Fixes

  • Fix issue where flair was crashing for cpu only version of pytorch (#1393 #1418)

  • Fix GPU memory error in PooledFlairEmbeddings (#1417)

  • Various small fixes (#1402 #1533 #1511 #1560 #1616)

  • Improve documentation (#1446 #1447 #1520 #1525 #1556)

  • Fix various issues in classification datasets (#1499)

- Python
Published by alanakbik almost 6 years ago

flair - Release 0.4.5

This is an enhancement release that slims down Flair for quicker/easier installation and smaller library size. It also makes Flair compatible with torch 1.4.0 and adds enhancements that reduce model size and improve runtime speed for some embeddings. New features include the ability to steer the precision/recall tradeoff during training of models and support for CamemBERT embeddings.

Memory, Runtime and Dependency Improvements

Slim down dependency tree (#1296 #1299 #1335 #1336)

We want to keep list of dependencies of Flair generally small to avoid errors like #1245 and keep the library small and quick to setup. So we removed dependencies that were each only used for one particular feature, namely: - ipython and ipython-genutils, only used for visualization settings in iPython notebooks - tiny_tokenizer, used for Japanese tokenization (replaced with instructions for how to install for all users who want to use Japanese tokenizers) - pymongo, used for MongoDB datasets (replaced with instructions for how to install for all users who want to use MongoDB datasets) - torchvision, now only loaded when needed

We also relaxed version requirements for easier installation on Google CoLab (#1335 #1336)

Dramatic speed-up of BERT embeddings (#1308)

@shoarora optimized the BERTEmbeddings implementation by removing redundant calls. This was shown to lead to dramatic speed improvements.

Reduce size of models that use WordEmbeddings (#1315)

@timnon added a method to replace word embeddings in trained model with sqlite database to dramatically reduce memory usage. Creates class WordEmbeedingsStore which can be used to replace a WordEmbeddings-instance in a flair model via duck-typing. By using this, @timnon was able to reduce our ner-servers memory consumption from 6gig to 600mb (10x decrease) by adding a few lines of code. It can be tested using the following lines (also in the docstring). First create a headless version of a model without word embeddings:

python from flair.inference_utils import WordEmbeddingsStore from flair.models import SequenceTagger import pickle tagger = SequenceTagger.load("multi-ner-fast") WordEmbeddingsStore.create_stores(tagger) pickle.dump(tagger, open("multi-ner-fast-headless.pickle", "wb")) and then to run the stored headless model without word embeddings, use: python from flair.data import Sentence tagger = pickle.load(open("multi-ner-fast-headless.pickle", "rb")) WordEmbeddingsStore.load_stores(tagger) text = "Schade um den Ameisenbären. Lukas Bärfuss veröffentlicht Erzählungen aus zwanzig Jahren." sentence = Sentence(text) tagger.predict(sentence)

New Features

Prioritize precision/recall or specific classes during training (#1345)

@klasocki added ways to steer the precision/recall tradeoff during training of models, as well as prioritize certain classes. This option was added to the SequenceTagger and the TextClassifier.

You can steer precision/recall tradeoff by adding the beta parameter, which indicates how many more times recall is important than precision. So if you set beta=0.5, precision becomes twice as important than recall. If you set beta=2, recall becomes twice as important as precision. Do it like this:

python tagger = SequenceTagger( hidden_size=256, embeddings=embeddings, tag_dictionary=tag_dictionary, tag_type=tag_type, beta=0.5)

If you want to prioritize classes, you can pass a weight_loss dictionary to the model classes. For instance, to prioritize learning the NEGATIVE class in a sentiment tagger, do:

python tagger = TextClassifier( document_embeddings=embeddings, label_dictionary=tag_dictionary, loss_weights={'NEGATIVE': 10.})

which will increase the importance of class NEGATIVE by a factor of 10.

CamemBERT Embeddings (#1297)

@stefan-it added support for the recently proposed French language model: CamemBERT.

Thanks to the awesome 🤗/Transformers library, CamemBERT can be used in Flair like in this example:

```python from flair.data import Sentence from flair.embeddings import CamembertEmbeddings

embedding = CamembertEmbeddings()

sentence = Sentence("J'aime le camembert !") embedding.embed(sentence)

for token in sentence.tokens: print(token.embedding) ```

Bug fixes and enhancements

  • Fix new RNN format for torch 1.4.0 (#1360, #1382 )
  • Fix memory issue in PooledFlairEmbeddings (#1337 #1339)
  • Correct subtoken mapping function for GPT-2 and RoBERTa (#1242)
  • Update the transformers library to the latest 2.3 version (#1333)
  • Add staticmethod decorator to some functions (#1257)
  • Add a warning if validation data is too small (#1115)
  • Remove leftover printline from MUSE embeddings (#1224)
  • Correct generate_text() UTF-8 conversion (#1238)
  • Clarify documentation (#1295 #1332)
  • Replace sklearn by scikit-learn (#1321)
  • Fix off-by-one error in progress logging (#1334)
  • Fix typo and annotation (#1341)
  • Various improvements (#1347)
  • Make loadbigfile work with read-only file (#1353)
  • Rename tiny_tokenizer to konoha (#1363)
  • Make test loss plotting optional (#1372)
  • Add pretty print function for Dictionary (#1375)

- Python
Published by alanakbik about 6 years ago

flair - Release 0.4.4

Release 0.4.4 introduces dramatic improvements in inference speed for taggers (thanks to many contributions by @pommedeterresautee), Flair embeddings in 300 languages (thanks @stefan-it), modular tokenization and many new features and refactorings.

Speed optimizations

Many refactorings by @pommedeterresautee to improve inference speed of sequence tagger (#1038 #1053 #1068 #1093 #1130), Flair embeddings (#1074 #1095 #1107 #1132 #1145), word embeddings (#1084), embeddings memory management (#1082 #1117), general optimizations (#1112) and classification (#1187).

The combined improvements increase inference speed by a factor of 2-3!

New features

Modular tokenization (#1022)

You can now pass custom tokenizers to Sentence objects and Dataset loaders to use different tokenizers than the included segtok library by implementing a tokenizer method. Currently, in-built support exists for whitespace tokenization, segtok tokenization and Japanese tokenization with mecab (requires mecab to be installed). In the future, we expect support for additional external tokenizers to be added.

For instance, if you wish to use Japanese tokanization performed by mecab, you can instantiate the Sentence object like this:

```python from flair.data import buildjapanesetokenizer from flair.data import Sentence

instantiate Japanese tokenizer

japanesetokenizer = buildjapanese_tokenizer()

init sentence and pass this tokenizer

sentence = Sentence("私はベルリンが好きです。", usetokenizer=japanesetokenizer) print(sentence) ```

Flair Embeddings for 300 languages (#1146)

Thanks to @stefan-it, there is now a massivey multilingual Flair embeddings model that covers 300 languages. See #1099 for more info on these embeddings and this repo for more details.

This replaces the old multilingual Flair embeddings that were trained for 6 languages. Load them with:

python embeddings_fw = FlairEmbeddings('multi-forward') embeddings_bw = FlairEmbeddings('multi-backward')

Multilingual Character Dictionaries (#1157)

Adds two multilingual character dictionaries computed by @stefan-it.

Load with

```python dictionary = Dictionary.load('chars-large') print(len(dictionary.idx2item))

dictionary = Dictionary.load('chars-xl') print(len(dictionary.idx2item)) ```

Batch-growth annealing (#1138)

The paper Don't Decay the Learning Rate, Increase the Batch Size makes the case for increasing the batch size over time instead of annealing the learning rate.

This version adds the possibility to have arbitrarily large mini-batch sizes with an accumulating gradient strategy. It introduces the parameter mini_batch_chunk_size that you can set to break down large mini-batches into smaller chunks for processing purposes.

So let's say you want to have a mini-batch size of 128, but your memory cannot handle more than 32 samples at a time. Then you can train like this:

python trainer = ModelTrainer(tagger, corpus) trainer.train( "path/to/experiment/folder", # set large mini-batch size mini_batch_size=128, # set chunk size to lower memory requirements mini_batch_chunk_size=32, )

Because we now can arbitrarly raise mini-batch size, we can now execute the annealing strategy in the above paper. Do it like this:

python trainer = ModelTrainer(tagger, corpus) trainer.train( "path/to/experiment/folder", # set initial mini-batch size mini_batch_size=32, # choose batch growth annealing batch_growth_annealing=True, )

Document-level sequence labeling (#1194)

Introduces the option for reading entire documents into one Sentence object for sequence labeling. This option is now supported for CONLL_03, CONLL_03_GERMAN and CONLL_03_DUTCH datasets which indicate document boundaries.

Here's how to train a model on CoNLL-03 on the document level:

```python

read CoNLL-03 with documentassequence=True

corpus = CONLL03(inmemory=True, documentassequence=True)

what tag do we want to predict?

tag_type = 'ner'

3. make the tag dictionary from the corpus

tagdictionary = corpus.maketagdictionary(tagtype=tag_type)

init simple tagger with GloVe embeddings

tagger: SequenceTagger = SequenceTagger( hiddensize=256, embeddings=WordEmbeddings('glove'), tagdictionary=tagdictionary, tagtype=tag_type, )

initialize trainer

from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

start training

trainer.train( 'path/to/your/experiment', # set a much smaller mini-batch size because documents are huge minibatchsize=2, ) ```

Option to evaluate on training split (#1202)

Previously, the ModelTrainer only allowed monitoring of dev and test splits during training. Now, you can also monitor the train split to better check if your method is overfitting.

Support for Danish tagging (#1183)

Adds support for Danish POS and NER thanks to @AmaliePauli!

Use like this:

```python from flair.data import Sentence from flair.models import SequenceTagger

example sentence

sentence = Sentence("København er en fantastisk by .")

load Danish NER model and predict

nertagger = SequenceTagger.load('da-ner') nertagger.predict(sentence)

print annotations (NER)

print(sentence.totaggedstring())

load Danish POS model and predict

postagger = SequenceTagger.load('da-pos') postagger.predict(sentence)

print annotations (NER + POS)

print(sentence.totaggedstring()) ```

Support for DistilBERT embeddings (#1044)

You can use them like this:

```python from flair.data import Sentence from flair.embeddings import BertEmbeddings

embeddings = BertEmbeddings("distilbert-base-uncased")

s = Sentence("Berlin and Munich are nice cities .") embeddings.embed(s)

for token in s.tokens: print(token.embedding) print(token.embedding.shape) ```

MongoDataset for reading text classification data from a Mongo database (#1192)

Adds the option of reading data from MongoDB. See this documentation on how to use this features.

Feidegger corpus (#1199)

Adds a dataset downloader for the Feidegger corpus consisting of text-image pairs. Instantiate the corpus like this:

```python from flair.datasets import FeideggerCorpus

instantiate Feidegger corpus

corpus = FeideggerCorpus()

print a text-image pair

print(corpus.train[0]) ```

Refactorings

Refactor checkpointing mechanism (#1101)

Refactored the checkpointing mechanism and slimmed down interfaces / code required to load checkpoints.

In detail:

  • The methods save_checkpoint and load_checkpoint are no longer part of the flair.nn.Model interface. Instead, saving and restoring checkpoints is now (fully) performed by the ModelTrainer.
  • The optimizer state and scheduler state are removed from the ModelTrainer constructor since they are no longer required here.
  • Loading a checkpoint is now one line of code (previously two lines).

```python

1. initialize trainer as always with a model and a corpus

from flair.trainers import ModelTrainer trainer: ModelTrainer = ModelTrainer(model, corpus)

2. train your model for 2 epochs

trainer.train( 'experiment/folder', max_epochs=2, # example checkpointing checkpoint=True, )

3. load last checkpoint with one line of code

trainer = ModelTrainer.load_checkpoint('experiment/folder/checkpoint.pt', corpus)

4. continue training for 2 extra epochs

trainer.train('experiment/folder2', maxepochs=4) ```

Refactor data sampling during training (#1154)

Adds a FlairSampler interface to better enable passing custom samplers to the ModelTrainer.

For instance, if you want to always shuffle your dataset in chunks of 5 to 10 sentences, you provide a sampler like this:

```python

your trainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

execute training run

trainer.train('path/to/experiment/folder', maxepochs=150, # sample data in chunks of 5 to 10 sampler=ChunkSampler(blocksize=5, plus_window=5) ) ```

Other refactorings

  • Switch everything to batch first mode (#1077)

  • Refactor classification to be more consistent with SequenceTagger (#1151)

  • PyTorch-Transformers -> Transformers #1163

  • In-place transpose of tensors (#1047)

Enhancements

Documentation fixes (#1045 #1098 #1121 #1157 #1160 #1168 )

Add option to set rnn_type used in SequenceTagger (#1113)

Accept string as input in NER predict (#1142)

Example usage:

```python

init tagger

tagger= SequenceTagger.load('ner')

predict over list of strings

sentences = tagger.predict( [ 'George Washington went to Berlin .', 'George Berlin lived in Washington .' ] )

output predictions

for sentence in sentences: print(sentence.totaggedstring()) ```

Enable One-hot Embeddings of other Tags (#1191)

Bug fixes

  • Fix the learning rate finder (#1119)
  • Fix OneHotEmbeddings on Cuda (#1147)
  • Fix encoding error in CSVClassificationDataset (#1055)
  • Fix encoding errors related to old windows chars (#1135)
  • Fix length error in CharacterEmbeddings (#1088 )
  • Fix tokenizer insert empty token to sentence object (#1226)
  • Ensure StackedEmbeddings always has the same embedding order (#1114)
  • Use $HOME instead of ~ for cache_root (#1134)

- Python
Published by alanakbik over 6 years ago

flair - Release 0.4.3

Release 0.4.3 includes a host of new features including transformer-based embeddings (roBERTa, XLNet, XLM, etc.), fine-tuneable FlairEmbeddings, crosslingual MUSE embeddings, new data loading/sampling methods, speed/memory optimizations, bug fixes and enhancements. It also begins a refactoring of interfaces that prepares more general applicability of Flair to other types of downstream tasks.

Embeddings

Transformer embeddings (#941 #972 #993)

Updates the old pytorch-pretrained-BERT library to the latest version of pytorch-transformers to support various new Transformer-based architectures for embeddings.

A total of 7 (new/updated) transformer-based embeddings can be used in Flair now:

```python from flair.embeddings import ( BertEmbeddings, OpenAIGPTEmbeddings, OpenAIGPT2Embeddings, TransformerXLEmbeddings, XLNetEmbeddings, XLMEmbeddings, RoBERTaEmbeddings, )

bertembeddings = BertEmbeddings() gpt1embeddings = OpenAIGPTEmbeddings() gpt2embeddings = OpenAIGPT2Embeddings() txlembeddings = TransformerXLEmbeddings() xlnetembeddings = XLNetEmbeddings() xlmembeddings = XLMEmbeddings() roberta_embeddings = RoBERTaEmbeddings() ```

Detailed benchmarks on the downsampled CoNLL-2003 NER dataset for English can be found in #873 .

Crosslingual MUSE Embeddings (#853)

Use the new MuseCrosslingualEmbeddings class to embed any sentence in one of 30 languages into the same embedding space. Behind the scenes the class first does language detection of the sentence to be embedded, and then embeds it with the appropriate language embeddings. If you train a classifier or sequence labeler with (only) this class, it will automatically work across all 30 languages, though quality may widely vary.

Here's how to embed: ```python

initialize embeddings

embeddings = MuseCrosslingualEmbeddings()

two sentences in different languages

sentence1 = Sentence("This red shoe is new .") sentence2 = Sentence("Dieser rote Schuh ist rot .")

language code is auto-detected

print(sentence1.getlanguagecode()) print(sentence2.getlanguagecode())

embed sentences

embeddings.embed([sentence1, sentence2])

print similarities

cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6) for token1, token2 in zip (sentence1, sentence2): print(f"'{token1.text}' and '{token2.text}' similarity: {cos(token1.embedding, token2.embedding)}")

```

FastTextEmbeddings (#879 )

Adds FastTextEmbeddings capable of handling for oov words. Be warned though that these embeddings are huge. BytePairEmbeddings are much smaller and reportedly of similar quality so it is probably advisable to use those instead.

Fine-tuneable FlairEmbeddings (#922)

You can now fine-tune FlairEmbeddings on downstream tasks. You can fine-tune an existing LM by simply passing the fine_tune parameter in the FlairEmbeddings constructor, like this:

python embeddings = FlairEmbeddings('news-foward', fine_tune=True)

You can also use this option to task-train a wholly new language model by passing an empty LanguageModel to the FlairEmbeddings constructor and the fine_tune parameter, like this:

```python

make an empty language model

languagemodel = LanguageModel( Dictionary.load('chars'), isforwardlm=True, hiddensize=256, nlayers=1)

init FlairEmbeddings to task-train this model

embeddings = FlairEmbeddings(languagemodel, finetune=True) ```

Optimizations

Automatic mixed precision support (#934)

Mixed precision training can significantly speed up training. It can now be enabled by setting use_amp=True in the trainer classes. For instance for training language models you can do:

```python

train your language model

trainer = LanguageModelTrainer(language_model, corpus)

trainer.train('resources/taggers/languagemodel', sequencelength=256, minibatchsize=256, maxepochs=10, useamp=True) ```

In our experiments, we saw 3x speedup of training large language models though results vary depending on your model size and experimental setup.

Control memory / speed tradeoff during training (#891 #809).

This release introduces the embeddings_storage_mode parameter to the ModelTrainer class and predict() methods. This parameter can be one of 'none', 'cpu' and 'gpu' and allows you to control the tradeoff between memory usage and speed during training:

  • If set to 'none' all embeddings are deleted after usage - this has lowest memory requirements but means that embeddings need to be recomputed at each epoch of training potentially causing a slowdown.
  • If set to 'cpu' all embeddings are moved to CPU memory after usage. During training, this means that they only need to be moved back to GPU for the forward pass, and not recomputed so in many cases this is faster, but requires memory.
  • If set to 'gpu' all embeddings stay on GPU memory after computation. This eliminates memory shuffling during training, causing a speedup. However this option requires enough GPU memory to be available for all embeddings of the dataset.

To use this option during training, simply set the parameter:

python # initialize trainer trainer: ModelTrainer = ModelTrainer(tagger, corpus) trainer.train( "path/to/your/model", embeddings_storage_mode='gpu', )

This release also removes the FlairEmbeddings-specific disk-caching mechanism. In the future, a more general caching mechanism applicable to all embedding types may potentially be added as a fourth memory management option.

Speed-ups on in-memory datasets (#792)

A new DataLoader abstract base class used in Flair will speed up data loading for in-memory datasets.

Refactoring of interfaces (#891 #843)

This release also slims down interfaces of flair.nn.Model and adds a new DataPoint interface that is currently implemented by the Token and Sentence classes. The idea is to widen the applicability of Flair to other data types and other tasks. In the future, the DataPoint interface will for example also be implemented by an Image object and new downstream tasks added to Flair.

The release also slims down the evaluate() method in the flair.nn.Model interface to take a DataLoader instead of a group of parameters. And refactors the logging header logic. Both refactorings prepare adding new new downstream tasks to Flair in the near future.

Other features

Training Classifiers with CSV files (#826 #952 #967)

Adds the CSVClassificationCorpus so you can train classifiers directly from CSVs instead of first having to convert to FastText format. To load a CSV, you need to pass a column_name_map (like in ColumnCorpus), which indicates which column(s) in the CSV holds the text and which field(s) the label(s):

python corpus = CSVClassificationCorpus( # path to the data folder containing train / test / dev files data_folder='path/to/data', # indicates which columns are text and labels column_name_map={4: "text", 1: "label_topic", 2: "label_subtopic"}, # if CSV has a header, you can skip it skip_header=True)

Data sampling (#908)

We added the first (of many) data samplers that can be passed to the ModelTrainer to influence training. The ImbalancedClassificationDatasetSampler for instance will upsample rare classes and downsample common classes in a classification dataset. It may potentially help with imbalanced datasets. Call like this: python # initialize trainer trainer: ModelTrainer = ModelTrainer(tagger, corpus) trainer.train( 'path/to/folder', learning_rate=0.1, mini_batch_size=32, sampler=ImbalancedClassificationDatasetSampler, ) There are two experimental chunk samplers (ChunkSampler and ExpandingChunkSampler) split a dataset into chunks and shuffle them. This preserves some ordering of the original data while also randomizing the data.

Visualization

  • Adds HTML vizualization of sequence labeling (#933). Call like this: ```python from flair.visual.nerhtml import renderner_html

tagger = SequenceTagger.load('ner')

sentence = Sentence( "Thibaut Pinot's challenge ended on Friday due to injury, and then Julian Alaphilippe saw " "his lead fall away. The BBC's Hugh Schofield in Paris reflects on 34 years of hurt." )

tagger.predict(sentence) html = rendernerhtml(sentence)

with open("sentence.html", "w") as writer: writer.write(html) ```

  • Plotter now returns images for use in iPython notebooks (#943)
  • Initial TensorBoard support (#924)
  • Add pointer to Flair Visualizer (#1014)

Additional parameterization options

  • CharacterEmbeddings now let you specify number of hidden states and embedding size (#834) python embedding = CharacterEmbedding(char_embedding_dim=64, hidden_size_char=64)
  • Adds configuration option for minimal learning rate stopping criterion (#871)
  • num_workers is a parameter of LanguageModelTrainer (#962 )

Bug fixes / enhancements

  • Updates old pretrained models to remove old bugs / performance issues (#1017)
  • Fix error in RNN initialization in DocumentRNNEmbeddings (#793)
  • ELMoEmbeddings now use flair.device param (#825)
  • Fix download of TREC_6 dataset (#896)
  • Fix download of UD_GERMAN-HDT (#980)
  • Fix download of WikiNER_German (#1006)
  • Fix error in ColumnCorpus in which words that begin with hashtags were skipped as comments (#956)
  • Fix max_tokens_per_doc param in ClassificationCorpus (#991)
  • Simplify split rule in ColumnCorpus (#990)
  • Fix import error message for ELMoEmbeddings (#1019)
  • References to Persian language unified across embeddings (#773)
  • Updates most pre-trained models fixing quality issues / bugs (#800)
  • Clarifications in documentation (#803 #860 #868)
  • Fixes infinite loop for tokens without startpos (#1030)

Enhancements

  • Adds a learnable initial hidden state to SequenceTagger (#899)
  • Now keeps order of sentences in mini-batch when embedding (#866)
  • SequenceTagger now optionally returns a distribution of tag probabilities over all classes (#782 #949 #1016)
  • The model trainer now outputs a 'test.tsv' file that contains prediction of final model when done training (#771 )
  • Releases logging handler when finishing training a model (#799)
  • Fixes bad_epochs in training logs and no longer evaluates on test data at each epoch by default (#818 )
  • Convenience method to remove all empty sentences from a corpus (#795)

- Python
Published by alanakbik over 6 years ago

flair - Release 0.4.2

Release 0.4.2 includes new features such as streaming data loading (allowing training over very large datasets), support for OpenAI GPT Embeddings, pre-trained Flair embeddings for many new languages, better classification baselines using one-hot embeddings and fine-tuneable document pool embeddings, and text regression as a third task next to sequence labeling and text classification.

New way of loading data (#768)

The data loading part has been completely refactored to enable streaming data loading from disk using PyTorch's DataLoaders. I.e. training no longer requires the full dataset to be kept in memory, allowing us to train models over much larger datasets. This version also changes the syntax of how to load datasets.

Old way (now deprecated): python from flair.data_fetcher import NLPTaskDataFetcher, NLPTask corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)

New way: python import flair.datasets corpus = flair.datasets.UD_ENGLISH()

To use streaming loading, i.e. to not load into memory, you can pass the in_memory parameter: python import flair.datasets corpus = flair.datasets.UD_ENGLISH(in_memory=False)

Embeddings

Flair embeddings (#614)

This release brings Flair embeddings to 11 new languages (thanks @stefan-it!): Arabic (ar), Danish (da), Persian (fa), Finnish (fi), Hebrew (he), Hindi (hi), Croatian (hr), Indonesian (id), Italian (it), Norwegian (no) and Swedish (sv). It also improves support for Bulgarian (bg), Czech, Basque (eu), Dutch (nl) and Slovenian (sl), and adds special language models for historical German. Load with language code, i.e.

```python

load Flair embeddings for Italian

embeddings = FlairEmbeddings('it-forward') ```

One-hot encoded embeddings (#747)

Some classification baselines work astonishingly well with simple learnable word embeddings. To support testing these baselines, we've added learnable word embeddings that start from a one-hot encoding of words. To initialize, you need to pass a corpus to initialize the vocabulary.

```python

load corpus

import flair.datasets corpus = flair.datasets.UD_ENGLISH()

init learnable word embeddings with corpus

embeddings = OneHotEmbeddings(corpus) ```

More options in DocumentPoolEmbeddings (#747)

We now allow users to specify a fine-tuning option before the pooling operation is executed in document pool embeddings. Options are 'none' (no fine-tuning), 'linear' (linear remapping of word embeddings), 'nonlinear' (nonlinear remapping of word embeddings). Nonlinear should be used together with WordEmbeddings, while None should be used with OneHotEmbeddings (not necessary since they are already learnt on data). So, to replicate FastText classification you can either do:

```python

instantiate one-hot encoded word embeddings

embeddings = OneHotEmbeddings(corpus)

document pool embeddings

documentembeddings = DocumentPoolEmbeddings([embeddings], finetune_mode='none') ```

or

```python

instantiate pre-trained word embeddings

embeddings = WordEmbeddings('glove')

document pool embeddings

documentembeddings = DocumentPoolEmbeddings([embeddings], finetune_mode='nonlinear') ```

OpenAI GPT Embeddings (#624)

We now support embeddings from the OpenAI GPT model. We use the excellent pytorch-pretrained-BERT library to download the GPT model, tokenize the input and extract embeddings from the subtokens.

Initialize with:

python embeddings = OpenAIGPTEmbeddings()

Portuguese embeddings from NILC (#576)

Extensibility to new downstream tasks (#681)

Previously, we had the SequenceTagger and TextClassifier as the two downstream tasks supported by Flair. The ModelTrainer had specific methods to train these two models, making it difficult for users to add new types of tasks (such as text regression) to Flair.

This release refactors the flair.nn.Model and ModelTrainer functionality to make it uniform across tagging models and enable users to add new tasks to Flair. Now, by implementing the 5 methods in the flair.nn.Model interface, a custom model immediately becomes trainable with the ModelTrainer. Now, three types of downstream tasks implement this interface:

  • the SequenceTagger,
  • the TextClassifier
  • and the beta TextRegressor.

The code refactor removes a lot of code redundancies and slims down the interfaces of the downstream tasks classes. As the sole breaking change, it removes the load_from_file() methods, which are now part of the load() method, i.e. if previously you loaded a self-trained model like this:

python tagger = SequenceTagger.load_from_file('/path/to/model.pt')

You now do it like this:

python tagger = SequenceTagger.load('/path/to/model.pt')

New features

  • New beta support for text regression (#564)
  • Return confidence scores for single-label classification (#664)
  • Add method to find probability for each class in case of multi-class classification (#693)
  • Capability to change threshold during multi label classification #707
  • Support for customized ELMo embeddings (#661)
  • Detect multi-label problems automatically: Previously, users always had to specify whether their text classification problem was multi_label or not. Now, this is detected automatically if users do not specify. So now you can init like this:

```python

corpus

corpus = TREC_6()

make label_dictionary

labeldictionary = corpus.makelabel_dictionary()

init text classifier

classifier = TextClassifier(documentembeddings, labeldictionary) ```

  • We added better module descriptions to embeddings and dropout so that more parameters get printed by default for models for better logging. (#747)
  • Make 'cache_root' a global variable so that different directories can be chosen for caching (#667)
  • Both string and Token objects can now be passed to the add_token method (#628)

New datasets

  • Added IMDB classification corpus to flair.datasets (#749)
  • Added TREC_6 classification corpus to flair.datasets (#749)
  • Added 20 newsgroups classification corpus to flair.datasets (NEWSGROUPS object)
  • WASSA-17 emotion intensity text regression tasks (#695)

Bug fixes

  • We normalized the training loss across modules so that train / test loss are consistent. (#670)
  • Permission error on Windows preventing model download (#557)
  • Handling of empty sentences (#566 #758)
  • Fix text generation on CUDA (#666)
  • others ...

- Python
Published by alanakbik over 6 years ago

flair - Release 0.4.1

Release 0.4.1 with lots of new features, new embeddings (RNN, Transformer and BytePair embeddings), new languages (Japanese, Spanish, Basque), new datasets, bug fixes and speed improvements (2x training speed for language models).

New Embeddings

Biomedical Embeddings

Added first embeddings trained over PubMed data, namely * ELMo embeddings * Flair embeddings

Load these for instance with:

```python

Flair embeddings PubMed

flairembeddingforward = FlairEmbeddings('pubmed-forward') flairembeddingbackward = FlairEmbeddings('pubmed-backward')

ELMo embeddings PubMed

elmo_embeddings = ELMoEmbeddings('pubmed') ```

Byte Pair Embeddings

Added the byte pair embeddings library by @bheinzerling. Support for 275 languages. Very useful if you want to train small models. Load these for instance with:

```python

initialize embeddings

embeddings = BytePairEmbeddings(language='en') ```

Transformer-XL Embeddings

Transformer-XL embeddings added by @stefan-it. Load with:

```python

initialize embeddings

embeddings = TransformerXLEmbeddings() ```

ELMo Transformer Embeddings

Experimental transformer version of ELMo embeddings added by @stefan-it.

DocumentRNNEmbeddings

The new DocumentRNNEmbeddings class replaces the now-deprecated DocumentLSTMEmbeddings. This class allows you to choose which type of RNN you want to use. By default, it uses a GRU.

Initialize like this:

```python from flair.embeddings import WordEmbeddings, DocumentRNNEmbeddings

glove_embedding = WordEmbeddings('glove')

documentlstmembeddings = DocumentRNNEmbeddings([gloveembedding], rnntype='LSTM') ```

New languages

Japanese

FlairEmbeddings for Japanese trained by @frtacoa and @minh-agent:

```python

forward and backward embedding

embeddingsfw = FlairEmbeddings('japanese-forward') embeddingsbw = FlairEmbeddings('japanese-backward') ```

Spanish

Added pre-computed FlairEmbeddings for Spanish. Embeddings were computed over Wikipedia by @iamyihwa (see #80 )

To load Spanish FlairEmbeddings, simply do:

```python

default forward and backward embedding

embeddingsfw = FlairEmbeddings('spanish-forward') embeddingsbw = FlairEmbeddings('spanish-backward')

CPU-friendly forward and backward embedding

embeddingsfwfast = FlairEmbeddings('spanish-forward-fast') embeddingsbwfast = FlairEmbeddings('spanish-backward-fast') ```

Basque

  • @stefan-it trained FlairEmbeddings for Basque which we now include, load with:

python forward_lm_embeddings = FlairEmbeddings('basque-forward') backward_lm_embeddings = FlairEmbeddings('basque-backward') - add Basque FastText embeddings, load with: python wikipedia_embeddings = WordEmbeddings('eu-wiki') crawl_embeddings = WordEmbeddings('eu-crawl')

New Datasets

  • IMDB dataset #410 - load with python corpus = NLPTaskDataFetcher.load_corpus(NLPTask.IMDB)
  • TREC6 and TREC50 #450 - load with python corpus = NLPTaskDataFetcher.load_corpus(NLPTask.TREC_6)
  • adds download routines for Basque Universal Dependencies and Named Entities, load with python corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_BASQUE) corpus_ner = NLPTaskDataFetcher.load_corpus(NLPTask.NER_BASQUE)

Other features

FlairEmbeddings for long text

FlairEmbeddings can now be generated for arbitrarily long strings without causing out of memory errors. See #444

Function for calculating perplexity of a string #531

Use like this:

```python from flair.embeddings import FlairEmbeddings

get language model

language_model = FlairEmbeddings('news-forward-fast').lm

calculate perplexity for grammatical sentence

grammatical = 'The company made a profit' perplexitygramamticalsentence = languagemodel.calculateperplexity(grammatical)

calculate perplexity for ungrammatical sentence

ungrammatical = 'Nook negh qapla!' perplexityungramamticalsentence = languagemodel.calculateperplexity(ungrammatical)

print both

print(f'"{grammatical}" - perplexity is {perplexitygramamticalsentence}') print(f'"{ungrammatical}" - perplexity is {perplexityungramamticalsentence}') ```

Bug fixes

  • Overflow error in text generation #322
  • Sentence embeddings are now vectors #368
  • macro average F-score computation #521
  • character embeddings on CUDA #434
  • accuracy calculation #553

Speed improvements

  • Asynchronous loading of mini batches in language model training (roughly doubles training speed) #406
  • Only send mini-batches to GPU #350
  • Speed up sequence tagger prediction #353
  • Use new cuda semantics #402
  • Reduce CPU-GPU shuffling #459
  • LM memory tweaks #466

- Python
Published by alanakbik about 7 years ago

flair - Release 0.4

Release 0.4 with new models, lots of new languages, experimental multilingual models, hyperparameter selection methods, BERT and ELMo embeddings, etc.

New Features

Support for new languages

Flair embeddings

We now include new language models for: * Swedish * Polish * Bulgarian * Slovenian * Dutch

In addition to English and German. You can load FlairEmbeddings for Dutch for instance with:

python flair_embeddings = FlairEmbeddings('dutch-forward')

Word Embeddings

We now include pre-trained FastText Embeddings for 30 languages: English, German, Dutch, Italian, French, Spanish, Swedish, Danish, Norwegian, Czech, Polish, Finnish, Bulgarian, Portuguese, Slovenian, Slovakian, Romanian, Serbian, Croatian, Catalan, Russian, Hindi, Arabic, Chinese, Japanese, Korean, Hebrew, Turkish, Persian, Indonesian.

Each language has embeddings trained over Wikipedia, or Web crawls. So instantiate with:

```python

German embeddings computed over Wikipedia

germanwikipediaembeddings = WordEmbeddings('de-wiki')

German embeddings computed over web crawls

germancrawlembeddings = WordEmbeddings('de-crawl') ```

Named Entity Recognition

Thanks to the Flair community, we now include NER models for: * French * Dutch

Next to the previous models for English and German.

Part-of-Speech Taggigng

Thanks to the Flair community, we now include PoS models for: * German tweets

Multilingual models

As a major new feature, we now include models that can tag text in various languages.

12-language Part-of-Speech Tagging

We include a PoS model trained over 12 different languages (English, German, Dutch, Italian, French, Spanish, Portuguese, Swedish, Norwegian, Danish, Finnish, Polish, Czech).

```python

load model

tagger = SequenceTagger.load('pos-multi')

text with English and German sentences

sentence = Sentence('George Washington went to Washington . Dort kaufte er einen Hut .')

predict PoS tags

tagger.predict(sentence)

print sentence with predicted tags

print(sentence.totaggedstring()) ```

4-language Named Entity Recognition

We include a NER model trained over 4 different languages (English, German, Dutch, Spanish).

```python

load model

tagger = SequenceTagger.load('ner-multi')

text with English and German sentences

sentence = Sentence('George Washington went to Washington . Dort traf er Thomas Jefferson .')

predict NER tags

tagger.predict(sentence)

print sentence with predicted tags

print(sentence.totaggedstring()) ```

This model also kind of works on other languages, such as French.

Pre-trained classification models (issue 70)

Flair now also includes two pre-trained classification models: * de-offensive-lanuage: detecting offensive language in German text (GermEval 2018 Task 1) * en-sentiment: detecting postive and negative sentiment in English text (IMDB)

Simply load the TextClassifier using the preferred model, such as python TextClassifier.load('en-sentiment')

BERT and ELMo embeddings

We added both BERT and ELMo embeddings so you can try them out, and mix and match them with Flair embeddings or any other embedding types. We hope this will enable the research community to better compare and combine approaches.

BERT Embeddings (issue 251)

We added BERT embeddings to Flair. We are using the implementation of huggingface. The embeddings can be used as any other embedding type in Flair:

python from flair.embeddings import BertEmbeddings # init embedding embedding = BertEmbeddings() # create a sentence sentence = Sentence('The grass is green .') # embed words in sentence embedding.embed(sentence)

ELMo Embeddings (issue 260)

Flair now also includes ELMo embeddings. We use the implementation of AllenNLP. As this implementation comes with a lot of sub-dependencies, you need to first install the library via pip install allennlp before you can use it in Flair. Using the embeddings is as simple as using any other embedding type: python from flair.embeddings import ELMoEmbeddings # init embedding embedding = ELMoEmbeddings() # create a sentence sentence = Sentence('The grass is green .') # embed words in sentence embedding.embed(sentence)

Multi-Dataset Training (issue 232)

You can now train a model on on multiple datasets with the MultiCorpus object. We use this to train our multilingual models.

Just create multiple corpora and put them into MultiCorpus:

```python englishcorpus = NLPTaskDataFetcher.loadcorpus(NLPTask.UDENGLISH) germancorpus = NLPTaskDataFetcher.loadcorpus(NLPTask.UDGERMAN) dutchcorpus = NLPTaskDataFetcher.loadcorpus(NLPTask.UD_DUTCH)

multicorpus = MultiCorpus([englishcorpus, germancorpus, dutchcorpus]) `` Themulti_corpus` can now be used for training, just as any other corpus before. Check the tutorial for more details.

Parameter Selection using Hyperopt (issue 242)

We built a wrapper around hyperopt to allow you to search for the best hyperparameters for your downstream task.

Define your search space and start training using several different parameter settings. The results are written to a specific file called param_selection.txt in the result directory. Check the tutorial for more details.

NLP Dataset Downloader (issue 243)

To make it as easy as possible to start training models, we have a new feature for automatically downloading publicly available NLP datasets. For instance, by running this code:

python corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)

you download the Universal Dependencies corpus for English and can immediately start training models. The list of available datasets can be found in the tutorial.

Model training features

We added various other features to model training.

Saving training log (issue 212)

The training log output will from now on be automatically saved in the result directory you provide for training. The log will be saved in training.log.

Resuming training (issue 217)

It is now possible to stop training at any point in time and to resume it later by training with checkpoint set to True. Check the tutorial for more details.

Custom Optimizers (issue 220)

You can now choose other optimizers besides SGD, i.e. any PyTorch optimizer, plus our own modified implementations of SDG and Adam, namely SGDW and AdamW.

Learning Rate Finder (issue 228)

A new helper method to assist you in finding a good learning rate for model training.

Breaking Changes

This release introduces breaking changes. The most important are:

Unified Model Trainer (issue 189)

Instead of maintaining two separate trainer classes for sequence labeling and text classification, we now have one model training class, namely ModelTrainer. This replaces the earlier classes SequenceTaggerTrainer and TextClassifierTrainer.

Downstream task models now implement the new flair.nn.Model interface. So, both the SequenceTagger and TextClassifier now inherit from flair.nn.Model. This allows both models to be trained with the ModelTrainer, like this:

```python

Training text classifier

tagger = SequenceTagger(512, embeddings, tag_dictionary, 'ner') trainer = ModelTrainer(tagger, corpus) trainer.train('results')

Training text classifier

classifier = TextClassifier(documentembedding, labeldictionary=label_dict) trainer = ModelTrainer(classifier, corpus) trainer.train('results') ```

The advantage is that all training parameters ans training procedures are now the same for sequence labeling and text classification, which reduces redundancy and hopefully make it easier to understand.

Metric class

The metric class is now refactored to compute micro and macro averages for F1 and accuracy. There is also a new enum EvaluationMetric which you can pass to the ModelTrainer to tell it what to use for evaluation.

Updates and Bug Fixes

Torch 1.0 (issue 176)

Flair now bulids on torch 1.0.

Use Pathlib (issue 176)

Flair now uses Path wherever possible to allow easier operations on files/directories. However, our interfaces still allows you to pass a string, which will then be transformed into a Path by Flair.

Bug Fixes

  • Fix: Non-whitespaced tokenized text results into an infinite loop (issue 226)
  • Fix: Getting IndexError: list index out of range error (issue 233)
  • Do not reset cache directory always to None (issue 249)
  • Filter sentences with zero tokens (issue 266)

- Python
Published by alanakbik about 7 years ago

flair - Release 0.3.2

This is an update over release 0.3.1 with some critical bug fixes, a few new features and a lot more pre-packaged embeddings.

New Features

Embeddings

More word embeddings (#194 )

We added FastText embeddings for 10 languages ('en', 'de', 'fr', 'pl', 'it', 'es', 'pt', 'nl', 'ar', 'sv'), load using the two-letter language code, like this:

python french_embedding = WordEmbeddings('fr')

More character LM embeddings (#204 #187 )

Thanks to contribution by @stefan-it, we added CharLMEmbeddings for Bulgarian and Slovenian. Load like this:

python flm_embeddings = CharLMEmbeddings('slovenian-forward') blm_embeddings = CharLMEmbeddings('slovenian-backward')

Custom embeddings (#170 )

Add explanation on how to use your own custom word embeddings. Simply convert to gensim.KeyedVectors and point embedding class there:

python custom_embedding = WordEmbeddings('path/to/your/custom/embeddings.gensim')

New embeddings type: DocumentPoolEmbeddings (#191 )

Add a new embedding class for document-level embeddings. You can now choose between different pooling options, e.g. min, max and average. Create the new embeddings like this:

python word_embeddings = WordEmbeddings('glove') pool_embeddings = DocumentPoolEmbeddings([word_embeddings], mode='min')

Language model

New method: generate_text() (#167 )

The LanguageModel class now has an in-built generate_text() method to sample the LM. Run code like this:

```python

load your language model

model = LanguageModel.loadlanguagemodel('path/to/your/lm')

generate 2000 characters

text = model.generate_text(20000) print(text) ```

Metrics

Class-based metrics in Metric class (#164 )

Refactored Metric class to provide class-based metrics, as well as micro and macro averaged F1 scores.

Bug Fixes

Fix serialization error for MacOS and Windows (#174 )

On these setups, we got errors when serializing or loading large models. We've put in place a workaround that limits model size so it works on those systems. Added bonus is that models are smaller now.

"Frozen" dropout (#184 )

Potentially big issue in which dropout was frozen in the first epoch in embeddings produced from the character LM, meaning that throughout training the same dimensions stayed dropped. Fixed this.

Testing step in language model trainer (#178 )

Previously, the language model was never applied to test data during training. A final testing step has been added in (again).

Testing

Distinguish between unit and integration tests (#183)

Instructions on how to run tests with pipenv (#161 )

Optimizations

Disable autograd during testing and prediction (#175)

Since autograd is unused here this gives us minor speedups.

- Python
Published by alanakbik over 7 years ago

flair - Release 0.3.1

This is a stability-update over release 0.3.0 with small optimizations, refactorings and bug fixes. For list of new features, refer to 0.3.0.

Optimizations

Retain Token embeddings in memory by default (#146 )

Allow for faster training of text classifier on large datasets by keeping token embeddings im memory.

Always clear embeddings after prediction (#149 )

After prediction, remove embeddings from memory to avoid filling up memory.

Refactorings

Alignd TextClassificationTrainer and SquenceTaggerTrainer (#148 )

Align signatures and features of the two training classes to make it easier to understand training options.

Updated DocumentLSTMEmbeddings (#150 )

Remove unused flag and code from DocumentLSTMEmbeddings

Removed unneeded AWS and Jinja2 dependencies (#158 )

Some dependencies are no longer required.

Bug Fixes

Fixed error when predicting over empty sentences. (#157)

Serialization: reset cache settings when saving a model. (#153 )

- Python
Published by alanakbik over 7 years ago

flair - Release 0.3.0

Breaking Changes

New Label class with confidence score (https://github.com/zalandoresearch/flair/issues/38)

A tag prediction is not a simple string anymore but a Label, which holds a value and a confidence score. To obtain the tag name you need to call tag.value. To get the score call tag.score. This can help you build applications in which you only want to use predictions that lie above a specific confidence threshold.

LockedDropout moved to the new flair.nn module (https://github.com/zalandoresearch/flair/issues/48)

New Features

Multi-token spans (https://github.com/zalandoresearch/flair/issues/54, https://github.com/zalandoresearch/flair/issues/97)

Entities are can now be wrapped into multi-token spans (type: Span). This is helpful for entities that span multiple words, such as "George Washington". A Span contains the position of the entity in the original text, the tag, a confidence score, and its text. You can get spans from a sentence by using the get_spans() method, like so: ```python from flair.data import Sentence from flair.models import SequenceTagger

make a sentence

sentence = Sentence('George Washington went to Washington .')

load and run NER

tagger = SequenceTagger.load('ner') tagger.predict(sentence)

get span entities, together with tag and confidence score

for entity in sentence.get_spans('ner'): print('{} {} {}'.format(entity.text, entity.tag, entity.score)) ```

Predictions with confidence score (https://github.com/zalandoresearch/flair/issues/38)

Predicted tags are no longer simple strings, but objects of type Label that contain a value and a confidence score. These scores are extracted during prediction from the sequence tagger or text classifier and indicate how confident the model is of a prediction. Print confidence scores of tags like this:

```python from flair.data import Sentence from flair.models import SequenceTagger

make a sentence

sentence = Sentence('George Washington went to Washington .')

load the POS tagger

tagger = SequenceTagger.load('pos')

run POS over sentence

tagger.predict(sentence)

print token, predicted POS tag and confidence score

for token in sentence: print('{} {} {}'.format(token.text, token.gettag('pos').value, token.gettag('pos').score)) ```

Visualization routines (https://github.com/zalandoresearch/flair/issues/61)

flair now includes visualizations for plotting training curves and weights when training a sequence tagger or text classifier. We also added visualization routines for plotting embeddings and highlighting tags in a sentence. For instance, to visualize contextual string embeddings, do this:

```python from flair.data_fetcher import NLPTaskDataFetcher, NLPTask from flair.embeddings import CharLMEmbeddings from flair.visual import Visualizer

get a list of Sentence objects

corpus = NLPTaskDataFetcher.fetchdata(NLPTask.CONLL03).downsample(0.1) sentences = corpus.train + corpus.test + corpus.dev

init embeddings (can also be a StackedEmbedding)

embeddings = CharLMEmbeddings('news-forward-fast')

embed corpus batch-wise

batches = [sentences[x:x + 8] for x in range(0, len(sentences), 8)] for batch in batches: embeddings.embed(batch)

visualize

visualizer = Visualizer() visualizer.visualizewordemeddings(embeddings, sentences, 'data/visual/embeddings.html') ```

Implementation of different dropouts (https://github.com/zalandoresearch/flair/issues/48)

Different dropout possibilities (Locked Dropout and Word Dropout) were added and can be used during training.

Memory management for training on large data sets (https://github.com/zalandoresearch/flair/issues/137)

flair now stores contextual string embeddings on disk to speed up training and allow for training on larger datsets.

Pre-trained language models for Polish

Added pre-trained language models for Polish, donated by (Borchmann et al., 2018). Load the Polish embeddings like this:

python flm_embeddings = CharLMEmbeddings('polish-forward') blm_embeddings = CharLMEmbeddings('polish-backward')

Bug Fixes

Fix evaluation of sequence tagger (https://github.com/zalandoresearch/flair/issues/79, https://github.com/zalandoresearch/flair/issues/75)

The script eval.pl for sequence tagger contained bugs. flair now uses its own evaluation methods.

Fix bugs in text classifier (https://github.com/zalandoresearch/flair/issues/108)

Fixed bugs in single label training and out-of-memory errors during evaluation.

Others

Standardize logging output (https://github.com/zalandoresearch/flair/issues/16)

Logging output for sequence tagger and text classifier is imporved and standardized.

Update torch version (https://github.com/zalandoresearch/flair/issues/34, https://github.com/zalandoresearch/flair/issues/106)

flair now uses torch version 0.4.1

Updated documentation (https://github.com/zalandoresearch/flair/issues/138, https://github.com/zalandoresearch/flair/issues/89)

Expanded documentation and tutorials.

- Python
Published by alanakbik over 7 years ago

flair - Version 0.2.0

Breaking Changes

Reorganized package structure #12

There are now two packages: flair.models and flair.trainers for the models and model trainers respectively.

Models package

flair.models contains 3 model classes: SequenceTagger, TextClassifier and LanguageModel.

Trainers package

flair.trainers contains 3 model trainer classes: SequenceTaggerTrainer, TextClassifierTrainer and LanguageModelTrainer.

Direct import from package

You call these classes directly from the packages, for instance the SequenceTagger is now instantiated as:

python from flair.models import SequenceTagger tagger = SequenceTagger.load('ner')

Reorganized embeddings #12

Clear distinction between token-level and document-level embeddings by adding two classes, namely TokenEmbeddings and DocumentEmbeddings from which respective embeddings need to inherit.

New Features

LanguageModelTrainer #24 #17

Added LanguageModelTrainer class to train your own LM embeddings.

Document Classification #10

Added experimental TextClassifier model for document-level text classification. Also added corresponding model trainer class, i.e. TextClassifierTrainer.

Batch prediction #7

Added batching into prediction method for faster sequence tagging

CPU-friendly pre-trained models #29

Added pre-trained models with smaller LM embeddings for faster CPU-inference speed

You can load them by adding '-fast' to the model name. Only for English at present.
python from flair.models import SequenceTagger tagger = SequenceTagger.load('ner-fast')

Learning Rate Scheduling #19

Added learning rate schedulers to all trainer classes for improved learning rate annealing functionality and control.

Auto-spawn on GPUs #19

All model classes now automatically spawn on GPUs if available. The separate .cuda() call is no longer necessary.

Bug Fixes

Retagging error #23

Fixed error that occurred when using multiple pre-trained taggers on the same sentence.

Empty sentence error #33

Fixed error that caused data fetchers to sometimes create empty sentences.

Other

Unit Tests #15

Added a large set of automated unit tests for better stability.

Documentation #15

Expanded documentation and tutorials. Also expanded descriptions of APIs.

Code Simplifications in sequence tagger #19

A number of code simplifications all around, hopefully making the code easier to understand.

- Python
Published by alanakbik over 7 years ago

flair - Version 0.1.0

First release of Flair Framework

Static word embeddings: - includes prepared word embeddings from GloVe, FastText, Numberbatch and Extvec - includes prepared word embeddings for English, German and Swedish

Contextual string embeddings: - includes pre-trained models for English and German

Text embeddings: - Two experimental methods for full-text embeddings (LSTM and Mean)

Sequence labeling: - pre-trained models for English (PoS-tagging, chunking and NER) - pre-trained models for German (PoS-tagging and NER) - experimental semantic frame detector for English

- Python
Published by alanakbik over 7 years ago