https://github.com/chenghaomou/embeddings
zero-vocab or low-vocab embeddings
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.4%) to scientific vocabulary
Keywords
embeddings
nlp
text-processing
transformers
Last synced: 5 months ago
·
JSON representation
Repository
zero-vocab or low-vocab embeddings
Basic Info
Statistics
- Stars: 18
- Watchers: 3
- Forks: 1
- Open Issues: 0
- Releases: 0
Topics
embeddings
nlp
text-processing
transformers
Created almost 5 years ago
· Last pushed over 3 years ago
https://github.com/ChenghaoMou/embeddings/blob/main/

[](https://badge.fury.io/py/text-embeddings) [](https://www.codacy.com/gh/ChenghaoMou/embeddings/dashboard?utm_source=github.com&utm_medium=referral&utm_content=ChenghaoMou/embeddings&utm_campaign=Badge_Grade)[](https://www.codacy.com/gh/ChenghaoMou/embeddings/dashboard?utm_source=github.com&utm_medium=referral&utm_content=ChenghaoMou/embeddings&utm_campaign=Badge_Coverage)
## Features
- [x] `VTRTokenizer` from [Robust Open-Vocabulary Translation from Visual Text Representations](https://t.co/l9E6rL8O5p?amp=1)
- [x] `PQRNNTokenizer` from [Advancing NLP with Efficient Projection-Based Model Architectures](https://ai.googleblog.com/2020/09/advancing-nlp-with-efficient-projection.html)
- [x] `CANINETokenizer` from [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874)
- [x] `ByT5Tokenizer` from [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/pdf/2105.13626.pdf)
- [x] `GBST` and `ByteTokenizer` from [Charformer: Fast Character Transformers via Gradient-based Subword Tokenization](https://arxiv.org/abs/2106.12672)
- [x] `LTPMultiHeadAttention` from [Learned Token Pruning for Transformers](https://arxiv.org/abs/2107.00910)
- [x] `X` and `XLoss`, a model inspired from [PonderNet](https://arxiv.org/abs/2107.05407) and [Perceiver](https://arxiv.org/abs/2103.03206), with Byte Embeddings.
## Examples
- [x] [Machine Translation](examples/translation/nmt_transformer.py)
- [x] [Text Classification](examples/classification/rnn.py)
## Installation
```bash
pip install text-embeddings --upgrade
```
## Documentation
[Link](https://chenghaomou.github.io/embeddings/)
## Example Usage
```python
from text_embeddings.visual import VTRTokenizer
data = [
"Hello world!",
"Hola Mundo!",
"",
]
tokenizer = VTRTokenizer(
font_size=14,
window_size=10,
font="resources/NotoSans-Regular.ttf",
max_length=36
)
results = tokenizer(
text=data,
text_pair=data,
add_special_tokens=True,
padding="longest",
return_tensors='pt',
truncation="longest_first",
return_attention_mask=True,
return_special_tokens_mask=True,
return_length=True,
prepend_batch_axis=True,
return_overflowing_tokens=False,
)
assert results["input_ids"].shape == (3, results["input_ids"].shape[1], 14, 10)
assert results["attention_mask"].shape == (3, results["input_ids"].shape[1])
assert results["token_type_ids"].shape == (3, results["input_ids"].shape[1])
assert results["length"].shape == (3, )
```
## Write Your Own Embedding Tokenizer
```python
import numpy as np
from typing import Optional, List, Dict
from text_embeddings.base import EmbeddingTokenizer
class MyOwnTokenizer(EmbeddingTokenizer):
def __init__(
self,
model_input_names: Optional[List[str]] = None,
special_tokens: Optional[Dict[str, np.ndarray]] = None,
max_length: Optional[int] = 2048,
):
super().__init__(model_input_names, special_tokens, max_length)
def text2embeddings(self, text: str) -> np.ndarray:
sequence_length = 10
dimensions = (10, 10, 10) # each token is mapped to a 3-d array
return np.zeros((sequence_length, *dimensions))
def create_padding_token_embedding(self, input_embeddings=None) -> np.ndarray:
# let's create a consistent 3-d array
return np.zeros((10, 10, 10))
```
## Example Usage for GBST
```python
import torch.onnx # nightly torch only
from text_embeddings.byte.charformer import GBST, ByteTokenizer
model = GBST(
embed_size=128,
max_block_size=4,
downsampling_factor=2,
score_calibration=True,
vocab_size=259,
)
tokenizer = ByteTokenizer()
results = tokenizer(
["Life is like a box of chocolates.", "Coding is fun."],
add_special_tokens=True,
padding="longest",
truncation="longest_first",
)
# Export the model
torch.onnx.export(
model,
torch.tensor(results["input_ids"], requires_grad=True).long(),
"gbst.onnx",
export_params=True,
opset_version=11,
do_constant_folding=True,
input_names=["input"],
output_names=["output"],
dynamic_axes={
"input": {0: "batch_size", 1: "sequence_length"},
"output": {0: "batch_size"},
},
)
```
Owner
- Name: Chenghao Mou
- Login: ChenghaoMou
- Kind: user
- Location: Ireland
- Website: https://sleeplessindebugging.blog/
- Repositories: 32
- Profile: https://github.com/ChenghaoMou
NLP/AI
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Dependencies
poetry.lock
pypi
- absl-py 1.1.0 develop
- aiohttp 3.8.1 develop
- aiosignal 1.2.0 develop
- async-timeout 4.0.2 develop
- cachetools 5.2.0 develop
- datasets 1.18.4 develop
- dill 0.3.5.1 develop
- frozenlist 1.3.0 develop
- fsspec 2022.5.0 develop
- google-auth 2.9.1 develop
- google-auth-oauthlib 0.4.6 develop
- multidict 6.0.2 develop
- multiprocess 0.70.13 develop
- oauthlib 3.2.0 develop
- pandas 1.4.3 develop
- protobuf 3.20.1 develop
- pyarrow 8.0.0 develop
- pyasn1 0.4.8 develop
- pyasn1-modules 0.2.8 develop
- pydeprecate 0.3.2 develop
- python-dateutil 2.8.2 develop
- pytorch-lightning 1.6.5 develop
- pytz 2022.1 develop
- requests-oauthlib 1.3.1 develop
- responses 0.18.0 develop
- rsa 4.8 develop
- tensorboard 2.9.0 develop
- tensorboard-data-server 0.6.1 develop
- tensorboard-plugin-wit 1.8.1 develop
- torchmetrics 0.9.2 develop
- werkzeug 2.1.2 develop
- xxhash 3.0.0 develop
- yarl 1.7.2 develop
- atomicwrites 1.4.1
- attrs 21.4.0
- certifi 2022.6.15
- charset-normalizer 2.1.0
- colorama 0.4.5
- einops 0.3.2
- filelock 3.7.1
- grpcio 1.38.1
- huggingface-hub 0.8.1
- idna 3.3
- importlib-metadata 4.12.0
- iniconfig 1.1.1
- joblib 1.1.0
- loguru 0.5.3
- mako 1.2.1
- markdown 3.4.1
- markupsafe 2.1.1
- mmh3 3.0.0
- numpy 1.23.1
- packaging 21.3
- pdoc3 0.9.2
- pillow 8.4.0
- pluggy 1.0.0
- py 1.11.0
- pyparsing 3.0.9
- pytest 6.2.5
- pyyaml 6.0
- regex 2022.7.9
- requests 2.28.1
- scikit-learn 1.1.1
- scipy 1.8.1
- six 1.16.0
- threadpoolctl 3.1.0
- tokenizers 0.12.1
- toml 0.10.2
- torch 1.12.0
- tqdm 4.64.0
- transformers 4.20.1
- typing-extensions 4.3.0
- urllib3 1.26.10
- win32-setctime 1.1.0
- zipp 3.8.1
pyproject.toml
pypi
- datasets ^1.9.0 develop
- pytorch-lightning ^1.3.8 develop
- tqdm ^4.61.2 develop
- Pillow ^8.3.1
- einops ^0.3.0
- grpcio 1.38.1
- loguru ^0.5.3
- mmh3 ^3.0.0
- numpy ^1.21.1
- pdoc3 ^0.9.2
- pytest ^6.2.4
- python >=3.8,<3.11
- scikit-learn ^1.1.1
- scipy ^1.8.1
- tokenizers ^0.12.1
- torch ^1.12.0
- transformers ^4.8.2