band

BAND:BERT Application aNd Deployment, A simple and efficient BERT model training and deployment framework.

https://github.com/sunyancn/band

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.4%) to scientific vocabulary

Keywords

bert named-entity-recognition question-answering reading-comprehension sequence-labeling text-classification transformer

Keywords from Contributors

interactive serializer packaging network-simulation hacking autograding observability embedded optim standardization
Last synced: 6 months ago · JSON representation

Repository

BAND:BERT Application aNd Deployment, A simple and efficient BERT model training and deployment framework.

Basic Info
  • Host: GitHub
  • Owner: SunYanCN
  • License: apache-2.0
  • Language: JavaScript
  • Default Branch: master
  • Homepage:
  • Size: 2.42 MB
Statistics
  • Stars: 6
  • Watchers: 2
  • Forks: 1
  • Open Issues: 3
  • Releases: 0
Topics
bert named-entity-recognition question-answering reading-comprehension sequence-labeling text-classification transformer
Created about 6 years ago · Last pushed over 2 years ago
Metadata Files
Readme Contributing Funding License

README.md

BAND:BERT Application aNd Deployment

A simple and efficient BERT model training and deployment framework.

Contributors Forks Stargazers Issues MIT License


Logo

BAND:BERT Application aNd Deployment
Documents »

Examples · Report Bug · Feature Request · Questions

What is it

**Encoding/Embedding** is a upstream task of encoding any inputs in the form of text, image, audio, video, transactional data to fixed length vector. Embeddings are quite popular in the field of NLP, there has been various Embeddings models being proposed in recent years by researchers, some of the famous one are bert, xlnet, word2vec etc. The goal of this repo is to build one stop solution for all embeddings techniques available, here we are starting with popular text embeddings for now and later on we aim to add as much technique for image, audio, video inputs also. **Finally**, **`embedding-as-service`** help you to encode any given text to fixed length vector from supported embeddings and models.

💾 Installation

Install the band via pip.
bash $ pip install band -U

Note that the code MUST be running on Python >= 3.6. Again module does not support Python 2!

⚡ ️Getting Started

Text Classification Example

```python import time import tensorflow as tf from transformers import BertConfig, BertTokenizer from band.model import TFBertForSequenceClassification from band.dataset import ChnSentiCorp from band.progress import classificationconvertexamplestofeatures

USEXLA = False USEAMP = False

EPOCHS = 1 BATCHSIZE = 16 EVALBATCHSIZE = 16 TESTBATCHSIZE = 1 MAXSEQLEN = 128 LEARNINGRATE = 3e-5 SAVEMODEL = False pretraineddir = "/home/band/models"

tf.config.optimizer.setjit(USEXLA) tf.config.optimizer.setexperimentaloptions({"automixedprecision": USE_AMP})

dataset = ChnSentiCorp(savepath="/tmp/band") data, label = dataset.data, dataset.label dataset.datasetinformation()

trainnumber, evalnumber, testnumber = dataset.trainexamplesnum, dataset.evalexamplesnum, dataset.testexamples_num

tokenizer = BertTokenizer.frompretrained(pretraineddir) traindataset = classificationconvertexamplestofeatures(data['train'], tokenizer, maxlength=MAXSEQLEN, labellist=label, outputmode="classification") validdataset = classificationconvertexamplestofeatures(data['validation'], tokenizer, maxlength=MAXSEQLEN, labellist=label, outputmode="classification") testdataset = classificationconvertexamplestofeatures(data['test'], tokenizer, maxlength=MAXSEQLEN, labellist=label, outputmode="classification")

traindataset = traindataset.shuffle(100).batch(BATCHSIZE, dropremainder=True).repeat(EPOCHS) traindataset = traindataset.prefetch(tf.data.experimental.AUTOTUNE) validdataset = validdataset.batch(EVALBATCHSIZE) validdataset = validdataset.prefetch(tf.data.experimental.AUTOTUNE) testdataset = testdataset.batch(TESTBATCHSIZE) testdataset = testdataset.prefetch(tf.data.experimental.AUTOTUNE)

config = BertConfig.frompretrained(pretraineddir, numlabels=dataset.numlabels) model = TFBertForSequenceClassification.frompretrained(pretraineddir, config=config, frompt=True) optimizer = tf.keras.optimizers.Adam(learningrate=LEARNINGRATE, epsilon=1e-08) if USEAMP: optimizer = tf.keras.mixedprecision.experimental.LossScaleOptimizer(optimizer, 'dynamic') loss = tf.keras.losses.SparseCategoricalCrossentropy(fromlogits=True) metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy') model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

history = model.fit(traindataset, epochs=EPOCHS, stepsperepoch=trainnumber // BATCHSIZE, validationdata=validdataset, validationsteps=evalnumber // EVALBATCH_SIZE)

loss, accuracy = model.evaluate(testdataset, steps=testnumber // TESTBATCHSIZE) print(loss, accuracy)

if SAVEMODEL: savedmodelpath = "./savedmodels/{}".format(int(time.time())) model.save(savedmodelpath, save_format="tf") ```

Named Entity Recognition

```python import time import tensorflow as tf from transformers import BertTokenizer, BertConfig from band.dataset import MSRANER from band.seqeval.callbacks import F1Metrics from band.model import TFBertForTokenClassification from band.utils import TrainConfig from band.progress import NERDataset

pretrained_dir = '/home/band/models'

trainconfig = TrainConfig(epochs=3, trainbatchsize=32, evalbatchsize=32, testbatchsize=1, maxlength=128, learningrate=3e-5, savemodel=False)

dataset = MSRANER(savepath="/tmp/band")

config = BertConfig.frompretrained(pretraineddir, numlabels=dataset.numlabels, returnunusedkwargs=True) tokenizer = BertTokenizer.frompretrained(pretraineddir) model = TFBertForTokenClassification.frompretrained(pretraineddir, config=config, from_pt=True)

ner = NERDataset(dataset=dataset, tokenizer=tokenizer, trainconfig=train_config) model.compile(optimizer=ner.optimizer, loss=ner.loss, metrics=[ner.metric])

f1 = F1Metrics(dataset.getlabels(), validationdata=ner.validdataset, steps=ner.validsteps)

history = model.fit(ner.traindataset, epochs=trainconfig.epochs, stepsperepoch=ner.test_steps, callbacks=[f1])

loss, accuracy = model.evaluate(ner.testdataset, steps=ner.teststeps)

if trainconfig.savemodel: savedmodelpath = "./savedmodels/{}".format(int(time.time())) model.save(savedmodelpath, saveformat="tf") ```

Question Answering

```python import time import tensorflow as tf from transformers import BertConfig, BertTokenizer from band.model import TFBertForQuestionAnswering from band.dataset import Squad from band.progress import squadconvertexamplestofeatures, parallelsquadconvertexamplesto_features

USEXLA = False USEAMP = False

EPOCHS = 1 BATCHSIZE = 4 EVALBATCHSIZE = 4 TESTBATCHSIZE = 1 MAXSEQLEN = 128 LEARNINGRATE = 3e-5 SAVEMODEL = False pretraineddir = "/home/band/models"

tf.config.optimizer.setjit(USEXLA) tf.config.optimizer.setexperimentaloptions({"automixedprecision": USE_AMP})

dataset = Squad(save_path="/tmp/band") data, label = dataset.data, dataset.label

trainnumber, evalnumber = dataset.trainexamplesnum, dataset.evalexamplesnum

tokenizer = BertTokenizer.frompretrained(pretraineddir) traindataset = parallelsquadconvertexamplestofeatures(data['train'], tokenizer, maxlength=MAXSEQLEN, docstride=128, istraining=True, maxquerylength=64) validdataset = parallelsquadconvertexamplestofeatures(data['validation'], tokenizer, maxlength=MAXSEQLEN, docstride=128, istraining=False, maxquerylength=64)

traindataset = traindataset.shuffle(100).batch(BATCHSIZE, dropremainder=True).repeat(EPOCHS) traindataset = traindataset.prefetch(tf.data.experimental.AUTOTUNE) validdataset = validdataset.batch(EVALBATCHSIZE) validdataset = validdataset.prefetch(tf.data.experimental.AUTOTUNE)

config = BertConfig.frompretrained(pretraineddir) model = TFBertForQuestionAnswering.frompretrained(pretraineddir, config=config, frompt=True, maxlength=MAXSEQLEN)

print(model.summary())

optimizer = tf.keras.optimizers.Adam(learningrate=LEARNINGRATE, epsilon=1e-08) if USEAMP: optimizer = tf.keras.mixedprecision.experimental.LossScaleOptimizer(optimizer, 'dynamic')

loss = {'startposition': tf.keras.losses.SparseCategoricalCrossentropy(fromlogits=True), 'endposition': tf.keras.losses.SparseCategoricalCrossentropy(fromlogits=True)} metrics = {'startposition': tf.keras.metrics.SparseCategoricalAccuracy('accuracy'), 'endposition': tf.keras.metrics.SparseCategoricalAccuracy('accuracy')}

model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

history = model.fit(traindataset, epochs=EPOCHS, stepsperepoch=trainnumber // BATCHSIZE, validationdata=validdataset, validationsteps=evalnumber // EVALBATCH_SIZE)

if SAVEMODEL: savedmodelpath = "./savedmodels/{}".format(int(time.time())) model.save(savedmodelpath, save_format="tf")

```

Dataset

For more information about dataset, see

| Dataset Name | Language | TASK | Description | | :----------: | :------: | :---------------------------: | :------------------------: | | ChnSentiCorp | CN | Text Classification | Binary Classification | | LCQMC | CN | Question Answer Match | Binary Classification | | MSRA_NER | CN | Named Entity Recognition | Sequence Labeling | | Toxic | EN | Text Classification | Multi-label Multi-label | | Thucnews | CN | Text Classification | Multi-class Classification | | SQUAD | EN | Machine Reading Comprehension | Span | | DRCD | CN | Machine Reading Comprehension | Span | | CMRC | CN | Machine Reading Comprehension | Span | | GLUE | EN | | |

✅ Supported Embeddings and Models

For more information about pretrained models, see <!-- links -->

Stargazers over time

Stargazers over time

Owner

  • Name: SunYan
  • Login: SunYanCN
  • Kind: user
  • Location: WuHan
  • Company: HSUT

Smile Like Sunshine

GitHub Events

Total
Last Year

Committers

Last synced: over 2 years ago

All Time
  • Total Commits: 25
  • Total Committers: 3
  • Avg Commits per committer: 8.333
  • Development Distribution Score (DDS): 0.08
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
SunYan 4****N 23
SunYan 4****n 1
dependabot[bot] 4****] 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 0
  • Total pull requests: 12
  • Average time to close issues: N/A
  • Average time to close pull requests: 3 months
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.67
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 12
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • dependabot[bot] (12)
Top Labels
Issue Labels
Pull Request Labels
dependencies (12)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 225 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 3
  • Total versions: 14
  • Total maintainers: 1
pypi.org: band

BERT Application

  • Versions: 14
  • Dependent Packages: 0
  • Dependent Repositories: 3
  • Downloads: 225 Last month
Rankings
Dependent repos count: 9.0%
Dependent packages count: 10.0%
Average: 15.6%
Downloads: 16.0%
Stargazers count: 20.3%
Forks count: 22.6%
Maintainers (1)
Last synced: 6 months ago

Dependencies

requirements.txt pypi
  • h5py *
  • jieba ==0.39
  • nltk ==3.4.5
  • numpy ==1.16.4
  • pandas ==0.23.4
  • prettytable *
  • scikit-learn >=0.21.1
  • six *
  • tabulator ==1.30.0
  • tensorflow ==2.0.1
  • tqdm *
  • transformers ==2.2.0
setup.py pypi