https://github.com/chenghaomou/embeddings

zero-vocab or low-vocab embeddings

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.4%) to scientific vocabulary

Keywords

embeddings nlp text-processing transformers

Last synced: 5 months ago · JSON representation

Repository

zero-vocab or low-vocab embeddings

Basic Info

Host: GitHub
Owner: ChenghaoMou
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 6.33 MB

Statistics

Stars: 18
Watchers: 3
Forks: 1
Open Issues: 0
Releases: 0

Topics

embeddings nlp text-processing transformers

Created almost 5 years ago · Last pushed over 3 years ago

https://github.com/ChenghaoMou/embeddings/blob/main/

![banner](./banner.png)
[![PyPI version](https://badge.fury.io/py/text-embeddings.svg)](https://badge.fury.io/py/text-embeddings) [![Codacy Badge](https://app.codacy.com/project/badge/Grade/112e50abd97444a4aca06f94fb7e8873)](https://www.codacy.com/gh/ChenghaoMou/embeddings/dashboard?utm_source=github.com&utm_medium=referral&utm_content=ChenghaoMou/embeddings&utm_campaign=Badge_Grade)[![Codacy Badge](https://app.codacy.com/project/badge/Coverage/112e50abd97444a4aca06f94fb7e8873)](https://www.codacy.com/gh/ChenghaoMou/embeddings/dashboard?utm_source=github.com&utm_medium=referral&utm_content=ChenghaoMou/embeddings&utm_campaign=Badge_Coverage)

## Features

-   [x] `VTRTokenizer` from [Robust Open-Vocabulary Translation from Visual Text Representations](https://t.co/l9E6rL8O5p?amp=1)
-   [x] `PQRNNTokenizer` from [Advancing NLP with Efficient Projection-Based Model Architectures](https://ai.googleblog.com/2020/09/advancing-nlp-with-efficient-projection.html)
-   [x] `CANINETokenizer` from [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874)
-   [x] `ByT5Tokenizer` from [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/pdf/2105.13626.pdf)
-   [x] `GBST` and `ByteTokenizer` from [Charformer: Fast Character Transformers via Gradient-based Subword Tokenization](https://arxiv.org/abs/2106.12672)
-   [x] `LTPMultiHeadAttention` from [Learned Token Pruning for Transformers](https://arxiv.org/abs/2107.00910)
-   [x] `X` and `XLoss`, a model inspired from [PonderNet](https://arxiv.org/abs/2107.05407) and [Perceiver](https://arxiv.org/abs/2103.03206), with Byte Embeddings.

## Examples

-   [x] [Machine Translation](examples/translation/nmt_transformer.py)
-   [x] [Text Classification](examples/classification/rnn.py)

## Installation

```bash
pip install text-embeddings --upgrade
```

## Documentation

[Link](https://chenghaomou.github.io/embeddings/)

## Example Usage

```python
from text_embeddings.visual import VTRTokenizer

data = [
"Hello world!",
"Hola Mundo!",
"",
]

tokenizer = VTRTokenizer(
    font_size=14,
    window_size=10,
    font="resources/NotoSans-Regular.ttf",
    max_length=36
)

results = tokenizer(
    text=data,
    text_pair=data,
    add_special_tokens=True,
    padding="longest", 
    return_tensors='pt',
    truncation="longest_first", 
    return_attention_mask=True, 
    return_special_tokens_mask=True,
    return_length=True,
    prepend_batch_axis=True,
    return_overflowing_tokens=False,
)

assert results["input_ids"].shape == (3, results["input_ids"].shape[1], 14, 10) 
assert results["attention_mask"].shape == (3, results["input_ids"].shape[1])
assert results["token_type_ids"].shape == (3, results["input_ids"].shape[1])
assert results["length"].shape == (3, )
```

## Write Your Own Embedding Tokenizer

```python
import numpy as np
from typing import Optional, List, Dict
from text_embeddings.base import EmbeddingTokenizer


class MyOwnTokenizer(EmbeddingTokenizer):

    def __init__(
        self,
        model_input_names: Optional[List[str]] = None,
        special_tokens: Optional[Dict[str, np.ndarray]] = None,
        max_length: Optional[int] = 2048,
    ):
        super().__init__(model_input_names, special_tokens, max_length)

    def text2embeddings(self, text: str) -> np.ndarray:
        
        sequence_length = 10
        dimensions = (10, 10, 10) # each token is mapped to a 3-d array
        return np.zeros((sequence_length, *dimensions))

    def create_padding_token_embedding(self, input_embeddings=None) -> np.ndarray:

        # let's create a consistent 3-d array
        return np.zeros((10, 10, 10))

```

## Example Usage for GBST

```python
import torch.onnx  # nightly torch only
from text_embeddings.byte.charformer import GBST, ByteTokenizer

model = GBST(
    embed_size=128,
    max_block_size=4,
    downsampling_factor=2,
    score_calibration=True,
    vocab_size=259,
)

tokenizer = ByteTokenizer()
results = tokenizer(
    ["Life is like a box of chocolates.", "Coding is fun."],
    add_special_tokens=True,
    padding="longest",
    truncation="longest_first",
)

# Export the model
torch.onnx.export(
    model,
    torch.tensor(results["input_ids"], requires_grad=True).long(),
    "gbst.onnx",
    export_params=True,
    opset_version=11,
    do_constant_folding=True,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size", 1: "sequence_length"},
        "output": {0: "batch_size"},
    },
)
```

Owner

Name: Chenghao Mou
Login: ChenghaoMou
Kind: user
Location: Ireland

Website: https://sleeplessindebugging.blog/
Repositories: 32
Profile: https://github.com/ChenghaoMou

NLP/AI

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Dependencies

poetry.lock pypi

absl-py 1.1.0 develop
aiohttp 3.8.1 develop
aiosignal 1.2.0 develop
async-timeout 4.0.2 develop
cachetools 5.2.0 develop
datasets 1.18.4 develop
dill 0.3.5.1 develop
frozenlist 1.3.0 develop
fsspec 2022.5.0 develop
google-auth 2.9.1 develop
google-auth-oauthlib 0.4.6 develop
multidict 6.0.2 develop
multiprocess 0.70.13 develop
oauthlib 3.2.0 develop
pandas 1.4.3 develop
protobuf 3.20.1 develop
pyarrow 8.0.0 develop
pyasn1 0.4.8 develop
pyasn1-modules 0.2.8 develop
pydeprecate 0.3.2 develop
python-dateutil 2.8.2 develop
pytorch-lightning 1.6.5 develop
pytz 2022.1 develop
requests-oauthlib 1.3.1 develop
responses 0.18.0 develop
rsa 4.8 develop
tensorboard 2.9.0 develop
tensorboard-data-server 0.6.1 develop
tensorboard-plugin-wit 1.8.1 develop
torchmetrics 0.9.2 develop
werkzeug 2.1.2 develop
xxhash 3.0.0 develop
yarl 1.7.2 develop
atomicwrites 1.4.1
attrs 21.4.0
certifi 2022.6.15
charset-normalizer 2.1.0
colorama 0.4.5
einops 0.3.2
filelock 3.7.1
grpcio 1.38.1
huggingface-hub 0.8.1
idna 3.3
importlib-metadata 4.12.0
iniconfig 1.1.1
joblib 1.1.0
loguru 0.5.3
mako 1.2.1
markdown 3.4.1
markupsafe 2.1.1
mmh3 3.0.0
numpy 1.23.1
packaging 21.3
pdoc3 0.9.2
pillow 8.4.0
pluggy 1.0.0
py 1.11.0
pyparsing 3.0.9
pytest 6.2.5
pyyaml 6.0
regex 2022.7.9
requests 2.28.1
scikit-learn 1.1.1
scipy 1.8.1
six 1.16.0
threadpoolctl 3.1.0
tokenizers 0.12.1
toml 0.10.2
torch 1.12.0
tqdm 4.64.0
transformers 4.20.1
typing-extensions 4.3.0
urllib3 1.26.10
win32-setctime 1.1.0
zipp 3.8.1

pyproject.toml pypi

datasets ^1.9.0 develop
pytorch-lightning ^1.3.8 develop
tqdm ^4.61.2 develop
Pillow ^8.3.1
einops ^0.3.0
grpcio 1.38.1
loguru ^0.5.3
mmh3 ^3.0.0
numpy ^1.21.1
pdoc3 ^0.9.2
pytest ^6.2.4
python >=3.8,<3.11
scikit-learn ^1.1.1
scipy ^1.8.1
tokenizers ^0.12.1
torch ^1.12.0
transformers ^4.8.2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/chenghaomou/embeddings

Science Score: 10.0%

Keywords

Repository

Basic Info

Statistics

Topics

https://github.com/ChenghaoMou/embeddings/blob/main/

Owner

GitHub Events

Total

Last Year

Dependencies