band
BAND:BERT Application aNd Deployment, A simple and efficient BERT model training and deployment framework.
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.4%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
BAND:BERT Application aNd Deployment, A simple and efficient BERT model training and deployment framework.
Basic Info
Statistics
- Stars: 6
- Watchers: 2
- Forks: 1
- Open Issues: 3
- Releases: 0
Topics
Metadata Files
README.md
BAND:BERT Application aNd Deployment
A simple and efficient BERT model training and deployment framework.
BAND:BERT Application aNd Deployment
Documents »
Examples
·
Report Bug
·
Feature Request
·
Questions
What is it
**Encoding/Embedding** is a upstream task of encoding any inputs in the form of text, image, audio, video, transactional data to fixed length vector. Embeddings are quite popular in the field of NLP, there has been various Embeddings models being proposed in recent years by researchers, some of the famous one are bert, xlnet, word2vec etc. The goal of this repo is to build one stop solution for all embeddings techniques available, here we are starting with popular text embeddings for now and later on we aim to add as much technique for image, audio, video inputs also.
**Finally**, **`embedding-as-service`** help you to encode any given text to fixed length vector from supported embeddings and models.
💾 Installation
Install the band via pip.
bash
$ pip install band -U
Note that the code MUST be running on Python >= 3.6. Again module does not support Python 2!
⚡ ️Getting Started
Text Classification Example
```python import time import tensorflow as tf from transformers import BertConfig, BertTokenizer from band.model import TFBertForSequenceClassification from band.dataset import ChnSentiCorp from band.progress import classificationconvertexamplestofeatures
USEXLA = False USEAMP = False
EPOCHS = 1 BATCHSIZE = 16 EVALBATCHSIZE = 16 TESTBATCHSIZE = 1 MAXSEQLEN = 128 LEARNINGRATE = 3e-5 SAVEMODEL = False pretraineddir = "/home/band/models"
tf.config.optimizer.setjit(USEXLA) tf.config.optimizer.setexperimentaloptions({"automixedprecision": USE_AMP})
dataset = ChnSentiCorp(savepath="/tmp/band") data, label = dataset.data, dataset.label dataset.datasetinformation()
trainnumber, evalnumber, testnumber = dataset.trainexamplesnum, dataset.evalexamplesnum, dataset.testexamples_num
tokenizer = BertTokenizer.frompretrained(pretraineddir) traindataset = classificationconvertexamplestofeatures(data['train'], tokenizer, maxlength=MAXSEQLEN, labellist=label, outputmode="classification") validdataset = classificationconvertexamplestofeatures(data['validation'], tokenizer, maxlength=MAXSEQLEN, labellist=label, outputmode="classification") testdataset = classificationconvertexamplestofeatures(data['test'], tokenizer, maxlength=MAXSEQLEN, labellist=label, outputmode="classification")
traindataset = traindataset.shuffle(100).batch(BATCHSIZE, dropremainder=True).repeat(EPOCHS) traindataset = traindataset.prefetch(tf.data.experimental.AUTOTUNE) validdataset = validdataset.batch(EVALBATCHSIZE) validdataset = validdataset.prefetch(tf.data.experimental.AUTOTUNE) testdataset = testdataset.batch(TESTBATCHSIZE) testdataset = testdataset.prefetch(tf.data.experimental.AUTOTUNE)
config = BertConfig.frompretrained(pretraineddir, numlabels=dataset.numlabels) model = TFBertForSequenceClassification.frompretrained(pretraineddir, config=config, frompt=True) optimizer = tf.keras.optimizers.Adam(learningrate=LEARNINGRATE, epsilon=1e-08) if USEAMP: optimizer = tf.keras.mixedprecision.experimental.LossScaleOptimizer(optimizer, 'dynamic') loss = tf.keras.losses.SparseCategoricalCrossentropy(fromlogits=True) metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy') model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
history = model.fit(traindataset, epochs=EPOCHS, stepsperepoch=trainnumber // BATCHSIZE, validationdata=validdataset, validationsteps=evalnumber // EVALBATCH_SIZE)
loss, accuracy = model.evaluate(testdataset, steps=testnumber // TESTBATCHSIZE) print(loss, accuracy)
if SAVEMODEL: savedmodelpath = "./savedmodels/{}".format(int(time.time())) model.save(savedmodelpath, save_format="tf") ```
Named Entity Recognition
```python import time import tensorflow as tf from transformers import BertTokenizer, BertConfig from band.dataset import MSRANER from band.seqeval.callbacks import F1Metrics from band.model import TFBertForTokenClassification from band.utils import TrainConfig from band.progress import NERDataset
pretrained_dir = '/home/band/models'
trainconfig = TrainConfig(epochs=3, trainbatchsize=32, evalbatchsize=32, testbatchsize=1, maxlength=128, learningrate=3e-5, savemodel=False)
dataset = MSRANER(savepath="/tmp/band")
config = BertConfig.frompretrained(pretraineddir, numlabels=dataset.numlabels, returnunusedkwargs=True) tokenizer = BertTokenizer.frompretrained(pretraineddir) model = TFBertForTokenClassification.frompretrained(pretraineddir, config=config, from_pt=True)
ner = NERDataset(dataset=dataset, tokenizer=tokenizer, trainconfig=train_config) model.compile(optimizer=ner.optimizer, loss=ner.loss, metrics=[ner.metric])
f1 = F1Metrics(dataset.getlabels(), validationdata=ner.validdataset, steps=ner.validsteps)
history = model.fit(ner.traindataset, epochs=trainconfig.epochs, stepsperepoch=ner.test_steps, callbacks=[f1])
loss, accuracy = model.evaluate(ner.testdataset, steps=ner.teststeps)
if trainconfig.savemodel: savedmodelpath = "./savedmodels/{}".format(int(time.time())) model.save(savedmodelpath, saveformat="tf") ```
Question Answering
```python import time import tensorflow as tf from transformers import BertConfig, BertTokenizer from band.model import TFBertForQuestionAnswering from band.dataset import Squad from band.progress import squadconvertexamplestofeatures, parallelsquadconvertexamplesto_features
USEXLA = False USEAMP = False
EPOCHS = 1 BATCHSIZE = 4 EVALBATCHSIZE = 4 TESTBATCHSIZE = 1 MAXSEQLEN = 128 LEARNINGRATE = 3e-5 SAVEMODEL = False pretraineddir = "/home/band/models"
tf.config.optimizer.setjit(USEXLA) tf.config.optimizer.setexperimentaloptions({"automixedprecision": USE_AMP})
dataset = Squad(save_path="/tmp/band") data, label = dataset.data, dataset.label
trainnumber, evalnumber = dataset.trainexamplesnum, dataset.evalexamplesnum
tokenizer = BertTokenizer.frompretrained(pretraineddir) traindataset = parallelsquadconvertexamplestofeatures(data['train'], tokenizer, maxlength=MAXSEQLEN, docstride=128, istraining=True, maxquerylength=64) validdataset = parallelsquadconvertexamplestofeatures(data['validation'], tokenizer, maxlength=MAXSEQLEN, docstride=128, istraining=False, maxquerylength=64)
traindataset = traindataset.shuffle(100).batch(BATCHSIZE, dropremainder=True).repeat(EPOCHS) traindataset = traindataset.prefetch(tf.data.experimental.AUTOTUNE) validdataset = validdataset.batch(EVALBATCHSIZE) validdataset = validdataset.prefetch(tf.data.experimental.AUTOTUNE)
config = BertConfig.frompretrained(pretraineddir) model = TFBertForQuestionAnswering.frompretrained(pretraineddir, config=config, frompt=True, maxlength=MAXSEQLEN)
print(model.summary())
optimizer = tf.keras.optimizers.Adam(learningrate=LEARNINGRATE, epsilon=1e-08) if USEAMP: optimizer = tf.keras.mixedprecision.experimental.LossScaleOptimizer(optimizer, 'dynamic')
loss = {'startposition': tf.keras.losses.SparseCategoricalCrossentropy(fromlogits=True), 'endposition': tf.keras.losses.SparseCategoricalCrossentropy(fromlogits=True)} metrics = {'startposition': tf.keras.metrics.SparseCategoricalAccuracy('accuracy'), 'endposition': tf.keras.metrics.SparseCategoricalAccuracy('accuracy')}
model.compile(optimizer=optimizer, loss=loss, metrics=metrics)
history = model.fit(traindataset, epochs=EPOCHS, stepsperepoch=trainnumber // BATCHSIZE, validationdata=validdataset, validationsteps=evalnumber // EVALBATCH_SIZE)
if SAVEMODEL: savedmodelpath = "./savedmodels/{}".format(int(time.time())) model.save(savedmodelpath, save_format="tf")
```
Dataset
For more information about dataset, see
| Dataset Name | Language | TASK | Description | | :----------: | :------: | :---------------------------: | :------------------------: | | ChnSentiCorp | CN | Text Classification | Binary Classification | | LCQMC | CN | Question Answer Match | Binary Classification | | MSRA_NER | CN | Named Entity Recognition | Sequence Labeling | | Toxic | EN | Text Classification | Multi-label Multi-label | | Thucnews | CN | Text Classification | Multi-class Classification | | SQUAD | EN | Machine Reading Comprehension | Span | | DRCD | CN | Machine Reading Comprehension | Span | | CMRC | CN | Machine Reading Comprehension | Span | | GLUE | EN | | |
✅ Supported Embeddings and Models
For more information about pretrained models, see <!-- links -->
Stargazers over time
Owner
- Name: SunYan
- Login: SunYanCN
- Kind: user
- Location: WuHan
- Company: HSUT
- Website: http://pddj99.coding-pages.com/
- Repositories: 50
- Profile: https://github.com/SunYanCN
Smile Like Sunshine
GitHub Events
Total
Last Year
Committers
Last synced: over 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| SunYan | 4****N | 23 |
| SunYan | 4****n | 1 |
| dependabot[bot] | 4****] | 1 |
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 0
- Total pull requests: 12
- Average time to close issues: N/A
- Average time to close pull requests: 3 months
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.67
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 12
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- dependabot[bot] (12)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 225 last-month
- Total dependent packages: 0
- Total dependent repositories: 3
- Total versions: 14
- Total maintainers: 1
pypi.org: band
BERT Application
- Homepage: https://github.com/sunyancn/band
- Documentation: https://band.readthedocs.io/
- License: MIT License
-
Latest release: 0.3.3
published about 6 years ago
Rankings
Maintainers (1)
Dependencies
- h5py *
- jieba ==0.39
- nltk ==3.4.5
- numpy ==1.16.4
- pandas ==0.23.4
- prettytable *
- scikit-learn >=0.21.1
- six *
- tabulator ==1.30.0
- tensorflow ==2.0.1
- tqdm *
- transformers ==2.2.0