https://github.com/chenghaomou/embeddings

zero-vocab or low-vocab embeddings

https://github.com/chenghaomou/embeddings

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.4%) to scientific vocabulary

Keywords

embeddings nlp text-processing transformers
Last synced: 5 months ago · JSON representation

Repository

zero-vocab or low-vocab embeddings

Basic Info
  • Host: GitHub
  • Owner: ChenghaoMou
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 6.33 MB
Statistics
  • Stars: 18
  • Watchers: 3
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Topics
embeddings nlp text-processing transformers
Created almost 5 years ago · Last pushed over 3 years ago

https://github.com/ChenghaoMou/embeddings/blob/main/

![banner](./banner.png)
[![PyPI version](https://badge.fury.io/py/text-embeddings.svg)](https://badge.fury.io/py/text-embeddings) [![Codacy Badge](https://app.codacy.com/project/badge/Grade/112e50abd97444a4aca06f94fb7e8873)](https://www.codacy.com/gh/ChenghaoMou/embeddings/dashboard?utm_source=github.com&utm_medium=referral&utm_content=ChenghaoMou/embeddings&utm_campaign=Badge_Grade)[![Codacy Badge](https://app.codacy.com/project/badge/Coverage/112e50abd97444a4aca06f94fb7e8873)](https://www.codacy.com/gh/ChenghaoMou/embeddings/dashboard?utm_source=github.com&utm_medium=referral&utm_content=ChenghaoMou/embeddings&utm_campaign=Badge_Coverage)

## Features

-   [x] `VTRTokenizer` from [Robust Open-Vocabulary Translation from Visual Text Representations](https://t.co/l9E6rL8O5p?amp=1)
-   [x] `PQRNNTokenizer` from [Advancing NLP with Efficient Projection-Based Model Architectures](https://ai.googleblog.com/2020/09/advancing-nlp-with-efficient-projection.html)
-   [x] `CANINETokenizer` from [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874)
-   [x] `ByT5Tokenizer` from [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/pdf/2105.13626.pdf)
-   [x] `GBST` and `ByteTokenizer` from [Charformer: Fast Character Transformers via Gradient-based Subword Tokenization](https://arxiv.org/abs/2106.12672)
-   [x] `LTPMultiHeadAttention` from [Learned Token Pruning for Transformers](https://arxiv.org/abs/2107.00910)
-   [x] `X` and `XLoss`, a model inspired from [PonderNet](https://arxiv.org/abs/2107.05407) and [Perceiver](https://arxiv.org/abs/2103.03206), with Byte Embeddings.

## Examples

-   [x] [Machine Translation](examples/translation/nmt_transformer.py)
-   [x] [Text Classification](examples/classification/rnn.py)

## Installation

```bash
pip install text-embeddings --upgrade
```

## Documentation

[Link](https://chenghaomou.github.io/embeddings/)

## Example Usage

```python
from text_embeddings.visual import VTRTokenizer

data = [
"Hello world!",
"Hola Mundo!",
"",
]

tokenizer = VTRTokenizer(
    font_size=14,
    window_size=10,
    font="resources/NotoSans-Regular.ttf",
    max_length=36
)

results = tokenizer(
    text=data,
    text_pair=data,
    add_special_tokens=True,
    padding="longest", 
    return_tensors='pt',
    truncation="longest_first", 
    return_attention_mask=True, 
    return_special_tokens_mask=True,
    return_length=True,
    prepend_batch_axis=True,
    return_overflowing_tokens=False,
)

assert results["input_ids"].shape == (3, results["input_ids"].shape[1], 14, 10) 
assert results["attention_mask"].shape == (3, results["input_ids"].shape[1])
assert results["token_type_ids"].shape == (3, results["input_ids"].shape[1])
assert results["length"].shape == (3, )
```

## Write Your Own Embedding Tokenizer

```python
import numpy as np
from typing import Optional, List, Dict
from text_embeddings.base import EmbeddingTokenizer


class MyOwnTokenizer(EmbeddingTokenizer):

    def __init__(
        self,
        model_input_names: Optional[List[str]] = None,
        special_tokens: Optional[Dict[str, np.ndarray]] = None,
        max_length: Optional[int] = 2048,
    ):
        super().__init__(model_input_names, special_tokens, max_length)

    def text2embeddings(self, text: str) -> np.ndarray:
        
        sequence_length = 10
        dimensions = (10, 10, 10) # each token is mapped to a 3-d array
        return np.zeros((sequence_length, *dimensions))

    def create_padding_token_embedding(self, input_embeddings=None) -> np.ndarray:

        # let's create a consistent 3-d array
        return np.zeros((10, 10, 10))

```

## Example Usage for GBST

```python
import torch.onnx  # nightly torch only
from text_embeddings.byte.charformer import GBST, ByteTokenizer

model = GBST(
    embed_size=128,
    max_block_size=4,
    downsampling_factor=2,
    score_calibration=True,
    vocab_size=259,
)

tokenizer = ByteTokenizer()
results = tokenizer(
    ["Life is like a box of chocolates.", "Coding is fun."],
    add_special_tokens=True,
    padding="longest",
    truncation="longest_first",
)

# Export the model
torch.onnx.export(
    model,
    torch.tensor(results["input_ids"], requires_grad=True).long(),
    "gbst.onnx",
    export_params=True,
    opset_version=11,
    do_constant_folding=True,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size", 1: "sequence_length"},
        "output": {0: "batch_size"},
    },
)
```

Owner

  • Name: Chenghao Mou
  • Login: ChenghaoMou
  • Kind: user
  • Location: Ireland

NLP/AI

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Dependencies

poetry.lock pypi
  • absl-py 1.1.0 develop
  • aiohttp 3.8.1 develop
  • aiosignal 1.2.0 develop
  • async-timeout 4.0.2 develop
  • cachetools 5.2.0 develop
  • datasets 1.18.4 develop
  • dill 0.3.5.1 develop
  • frozenlist 1.3.0 develop
  • fsspec 2022.5.0 develop
  • google-auth 2.9.1 develop
  • google-auth-oauthlib 0.4.6 develop
  • multidict 6.0.2 develop
  • multiprocess 0.70.13 develop
  • oauthlib 3.2.0 develop
  • pandas 1.4.3 develop
  • protobuf 3.20.1 develop
  • pyarrow 8.0.0 develop
  • pyasn1 0.4.8 develop
  • pyasn1-modules 0.2.8 develop
  • pydeprecate 0.3.2 develop
  • python-dateutil 2.8.2 develop
  • pytorch-lightning 1.6.5 develop
  • pytz 2022.1 develop
  • requests-oauthlib 1.3.1 develop
  • responses 0.18.0 develop
  • rsa 4.8 develop
  • tensorboard 2.9.0 develop
  • tensorboard-data-server 0.6.1 develop
  • tensorboard-plugin-wit 1.8.1 develop
  • torchmetrics 0.9.2 develop
  • werkzeug 2.1.2 develop
  • xxhash 3.0.0 develop
  • yarl 1.7.2 develop
  • atomicwrites 1.4.1
  • attrs 21.4.0
  • certifi 2022.6.15
  • charset-normalizer 2.1.0
  • colorama 0.4.5
  • einops 0.3.2
  • filelock 3.7.1
  • grpcio 1.38.1
  • huggingface-hub 0.8.1
  • idna 3.3
  • importlib-metadata 4.12.0
  • iniconfig 1.1.1
  • joblib 1.1.0
  • loguru 0.5.3
  • mako 1.2.1
  • markdown 3.4.1
  • markupsafe 2.1.1
  • mmh3 3.0.0
  • numpy 1.23.1
  • packaging 21.3
  • pdoc3 0.9.2
  • pillow 8.4.0
  • pluggy 1.0.0
  • py 1.11.0
  • pyparsing 3.0.9
  • pytest 6.2.5
  • pyyaml 6.0
  • regex 2022.7.9
  • requests 2.28.1
  • scikit-learn 1.1.1
  • scipy 1.8.1
  • six 1.16.0
  • threadpoolctl 3.1.0
  • tokenizers 0.12.1
  • toml 0.10.2
  • torch 1.12.0
  • tqdm 4.64.0
  • transformers 4.20.1
  • typing-extensions 4.3.0
  • urllib3 1.26.10
  • win32-setctime 1.1.0
  • zipp 3.8.1
pyproject.toml pypi
  • datasets ^1.9.0 develop
  • pytorch-lightning ^1.3.8 develop
  • tqdm ^4.61.2 develop
  • Pillow ^8.3.1
  • einops ^0.3.0
  • grpcio 1.38.1
  • loguru ^0.5.3
  • mmh3 ^3.0.0
  • numpy ^1.21.1
  • pdoc3 ^0.9.2
  • pytest ^6.2.4
  • python >=3.8,<3.11
  • scikit-learn ^1.1.1
  • scipy ^1.8.1
  • tokenizers ^0.12.1
  • torch ^1.12.0
  • transformers ^4.8.2