https://github.com/google-research/retvec

RETVec is an efficient, multilingual, and adversarially-robust text vectorizer.

https://github.com/google-research/retvec

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.3%) to scientific vocabulary

Keywords

deep-learning natural-language-processing nlp python tensorflow text-classification

Keywords from Contributors

archival projection profiles embedded sequences interactive generic ecosystem-modeling modular network-simulation
Last synced: 4 months ago · JSON representation ·

Repository

RETVec is an efficient, multilingual, and adversarially-robust text vectorizer.

Basic Info
  • Host: GitHub
  • Owner: google-research
  • License: apache-2.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 10.9 MB
Statistics
  • Stars: 292
  • Watchers: 10
  • Forks: 23
  • Open Issues: 24
  • Releases: 0
Archived
Topics
deep-learning natural-language-processing nlp python tensorflow text-classification
Created about 4 years ago · Last pushed 11 months ago
Metadata Files
Readme Contributing License Citation

README.md

RETVec: Resilient & Efficient Text Vectorizer

NOTE (4/3/2025): This repository has been archived and no longer actively maintained.

Overview

RETVec is a next-gen text vectorizer designed to be efficient, multilingual, and provide built-in adversarial resilience using robust word embeddings trained with similarity learning. You can read the paper here.

RETVec is trained to be resilient against character-level manipulations including insertion, deletion, typos, homoglyphs, LEET substitution, and more. The RETVec model is trained on top of a novel character encoder which can encode all UTF-8 characters and words efficiently. Thus, RETVec works out-of-the-box on over 100 languages without the need for a lookup table or fixed vocabulary size. Furthermore, RETVec is a layer, which means that it can be inserted into any TF model without the need for a separate pre-processing step.

RETVec's speed and size (~200k instead of millions of parameters) also makes it a great choice for on-device and web use cases. It is natively supported in TensorFlow Lite via custom ops in TensorFlow Text, and we provide a JavaScript implementation of RETVec which allows you to deploy web models via TensorFlow.js.

Please see our example colabs on how to get started with training your own models with RETVec. trainretvecmodel_tf.ipynb is a great starting point for training a TF model using RETVec.

Demos

To see RetVec in action, visit our demos.

Getting started

Installation

You can use pip to install the latest TensorFlow version of RETVec:

python pip install retvec

RETVec has been tested on TensorFlow 2.6+ and python 3.8+.

Basic Usage

You can use RETVec as the vectorization layer in any TensorFlow model with just a single line of code. RETVec operates on raw strings with pre-processing options built-in (e.g. lowercasing text). For example:

```python import tensorflow as tf from tensorflow.keras import layers

Define the input layer, which accepts raw strings

inputs = layers.Input(shape=(1, ), name="input", dtype=tf.string)

Add the RETVec Tokenizer layer using the RETVec embedding model -- that's it!

x = RETVecTokenizer(sequence_length=128)(inputs)

Create your model like normal

e.g. a simple LSTM model for classification with NUM_CLASSES classes

x = layers.Bidirectional(layers.LSTM(64, returnsequences=True))(x) x = layers.Bidirectional(layers.LSTM(64))(x) outputs = layers.Dense(NUMCLASSES, activation='softmax')(x) model = tf.keras.Model(inputs, outputs) ```

Then you can compile, train and save your model like usual! As demonstrated in our paper, models trained using RETVec are more resilient against adversarial attacks and typos, as well as computationally efficient. RETVec also offers support in TFJS and TF Lite, making it perfect for on-device mobile and web use cases.

Colabs

Detailed example colabs for RETVec can be found at under notebooks. These are a good way to get started with using RETVec. You can run the notebooks in Google Colab by clicking the Google Colab button. If none of the examples are similar to your use case, please let us know!

We have the following example colabs:

  • Training RETVec-based models using TensorFlow: trainretvecmodel_tf.ipynb for GPU/CPU training, and train_tpu.ipynb for a TPU-compatible training example.
  • Converting RETVec models into TF Lite models to run on-device: tfliteretvec.ipynb
  • (Coming soon!) Using RETVec JS to deploy RETVec models in the web using TensorFlow.js

Citing

Please cite this reference if you use RETVec in your research:

bibtex @article{retvec2023, title={RETVec: Resilient and Efficient Text Vectorizer}, author={Elie Bursztein, Marina Zhang, Owen Vallis, Xinyu Jia, and Alexey Kurakin}, year={2023}, eprint={2302.09207} }

Contributing

To contribute to the project, please check out the contribution guidelines. Thank you!

Disclaimer

This is not an official Google product.

Owner

  • Name: Google Research
  • Login: google-research
  • Kind: organization
  • Location: Earth

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - given-names: "Elie"
    family-names: "Bursztein"
  - given-names: "Marina"
    family-names: "Zhang"
  - given-names: "Owen"
    family-names: "Vallis"
  - given-names: "Yury"
    family-names: "Kartynnik"
title: "RETVec: Resilient and Efficient Text Vectorizer"
version: 1.0.2
date-released: 2023-10-12
url: "https://github.com/google-research/retvec"
preferred-citation:
  type: article
  authors:
  - given-names: "Elie"
    family-names: "Bursztein"
  - given-names: "Marina"
    family-names: "Zhang"
  - given-names: "Owen"
    family-names: "Vallis"
  - given-names: "Xinyu"
    family-names: "Jia"
  - given-names: "Alexey"
    family-names: "Kurakin"
  doi: "10.48550/arXiv.2302.09207"
  title: "RETVec: Resilient and Efficient Text Vectorizer"
  year: 2023
  month: 10
  journal: "arXiv"
  url: "https://arxiv.org/abs/2302.09207"
  publisher:
    name: "arXiv"

GitHub Events

Total
  • Watch event: 13
  • Delete event: 3
  • Issue comment event: 4
  • Push event: 1
  • Pull request event: 9
  • Fork event: 3
  • Create event: 7
Last Year
  • Watch event: 13
  • Delete event: 3
  • Issue comment event: 4
  • Push event: 1
  • Pull request event: 9
  • Fork event: 3
  • Create event: 7

Committers

Last synced: 11 months ago

All Time
  • Total Commits: 67
  • Total Committers: 5
  • Avg Commits per committer: 13.4
  • Development Distribution Score (DDS): 0.194
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Marina m****h@g****m 54
Elie Bursztein g****b@e****t 5
Luca Invernizzi i****l@g****m 4
Yury Kartynnik k****k@g****m 3
dependabot[bot] 4****] 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 26
  • Total pull requests: 40
  • Average time to close issues: 3 months
  • Average time to close pull requests: about 2 months
  • Total issue authors: 9
  • Total pull request authors: 4
  • Average comments per issue: 0.35
  • Average comments per pull request: 0.33
  • Merged pull requests: 16
  • Bot issues: 0
  • Bot pull requests: 22
Past Year
  • Issues: 0
  • Pull requests: 15
  • Average time to close issues: N/A
  • Average time to close pull requests: about 1 month
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.33
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 15
Top Authors
Issue Authors
  • MarinaZhang (15)
  • ebursztein (3)
  • brijeshthakur (1)
  • AfroEuro (1)
  • ny1236 (1)
  • duguwanglong (1)
  • huanan254 (1)
  • chinsu70802 (1)
  • delkind-dnsf (1)
Pull Request Authors
  • dependabot[bot] (30)
  • MarinaZhang (10)
  • invernizzi (4)
  • kartynnik (3)
Top Labels
Issue Labels
documentation (8) enhancement (8) good first issue (4) help wanted (4) bug (1)
Pull Request Labels
dependencies (30) javascript (6) enhancement (5) documentation (3) bug (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 141 last-month
  • Total dependent packages: 1
  • Total dependent repositories: 0
  • Total versions: 2
  • Total maintainers: 1
pypi.org: retvec

Resilient and Efficient Text Vectorizer

  • Versions: 2
  • Dependent Packages: 1
  • Dependent Repositories: 0
  • Downloads: 141 Last month
Rankings
Dependent packages count: 7.6%
Stargazers count: 28.2%
Forks count: 30.4%
Average: 33.9%
Dependent repos count: 69.4%
Maintainers (1)
Last synced: 5 months ago

Dependencies

.github/workflows/pages.yml actions
  • actions/checkout v3 composite
  • actions/configure-pages v3 composite
  • actions/deploy-pages v2 composite
  • actions/setup-node v3 composite
  • actions/upload-pages-artifact v2 composite
.github/workflows/python-publish.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
.github/workflows/tests-tensorflow.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • isort/isort-action master composite
  • psf/black stable composite
demos/js/package-lock.json npm
  • 323 dependencies
demos/js/package.json npm
  • @vitejs/plugin-vue ^4.2.3 development
  • eslint ^8.53.0 development
  • eslint-config-airbnb-base ^15.0.0 development
  • eslint-plugin-import ^2.29.0 development
  • eslint-plugin-vue ^9.18.1 development
  • sass ^1.60.0 development
  • unplugin-fonts ^1.0.3 development
  • vite ^4.2.0 development
  • vite-plugin-vuetify ^1.0.0 development
  • @mdi/font 7.0.96
  • @tensorflow/tfjs ^4.12.0
  • core-js ^3.29.0
  • lodash ^4.17.21
  • roboto-fontface *
  • vue ^3.2.0
  • vue-router ^4.0.0
  • vuetify ^3.0.0
retvecjs/package-lock.json npm
  • 382 dependencies
retvecjs/package.json npm
  • @web/dev-server ^0.1.31 development
  • @web/dev-server-legacy ^1.0.0 development
  • typescript ~4.7.4 development
  • @tensorflow/tfjs ^4.10.0
  • file-url ^4.0.0
  • path-browserify ^1.0.1
pyproject.toml pypi
setup.py pypi