uform

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️

https://github.com/unum-cloud/uform

Keywords

bert clip clustering contrastive-learning cross-attention huggingface-transformers image-search language-vision llava multi-lingual multimodal neural-network openai openclip pretrained-models pytorch representation-learning semantic-search transformer vector-search

Keywords from Contributors

webassembly text-search similarity-search simd search-engine recommender-system nearest-neighbor-search kann fuzzy-search full-text-search

Last synced: 10 months ago · JSON representation ·

Repository

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️

Basic Info

Host: GitHub
Owner: unum-cloud
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://unum-cloud.github.io/uform/
Size: 678 KB

Statistics

Stars: 1,162
Watchers: 15
Forks: 72
Open Issues: 15
Releases: 37

Topics

bert clip clustering contrastive-learning cross-attention huggingface-transformers image-search language-vision llava multi-lingual multimodal neural-network openai openclip pretrained-models pytorch representation-learning semantic-search transformer vector-search

Created over 3 years ago · Last pushed about 1 year ago

Metadata Files

Readme Contributing License Citation

README.md

UForm

Pocket-Sized Multimodal AI
For Content Understanding and Generation

Multimodal Embeddings from 64 to 768 Dimensions • 1B Parameter Chat
Short Texts • Images • 🔜 Video Clips • 🔜 Long Documents
ONNX • CoreML • PyTorch
Python • JavaScript • Swift

UForm Chat Preview

Welcome to UForm, a multimodal AI library that's as versatile as it is efficient. UForm tiny embedding models will help you understand and search visual and textual content across various languages. UForm small generative models, on the other hand, don't only support conversational and chat use-cases, but are great for fast image captioning and Visual Question Answering (VQA). With compact custom pre-trained transformer models, this can run anywhere from your server farm down to your smartphone.

Features

Tiny Embeddings: 64-dimensional Matryoshka-style embeddings for extremely fast search.
Throughput: Thanks to the small size, the inference speed is 2-4x faster than competitors.
Portable: Models come with native ONNX support, making them easy to deploy on any platform.
Quantization Aware: Down-cast embeddings from f32 to i8 without losing much recall.
Multilingual: Trained on a balanced dataset, the recall is great across over 20 languages.

Models

For accuracy and speed benchmarks refer to the evaluation page.

Embedding Models

Model	Parameters	Languages	Architecture
`uform3-image-text-english-large` 🆕	365 M	1	12 layer BERT, ViT-L/14
`uform3-image-text-english-base`	143 M	1	4 layer BERT, ViT-B/16
`uform3-image-text-english-small` 🆕	79 M	1	4 layer BERT, ViT-S/16
`uform3-image-text-multilingual-base`	206M	21	12 layer BERT, ViT-B/16

Generative Models

Model	Parameters	Purpose	Architecture
`uform-gen2-dpo` 🆕	1.2 B	Chat, Image Captioning, VQA	qwen1.5-0.5B, ViT-H/14
`uform-gen2-qwen-500m`	1.2 B	Chat, Image Captioning, VQA	qwen1.5-0.5B, ViT-H/14
`uform-gen` ⚠️	1.5 B	Image Captioning, VQA	llama-1.3B, ViT-B/16

Quick Start Examples

Embedding Models

First, pip install uform. Then, load the model:

```py from uform import get_model, Modality

processors, models = get_model('unum-cloud/uform3-image-text-english-small')

modeltext = models[Modality.TEXTENCODER] modelimage = models[Modality.IMAGEENCODER] processortext = processors[Modality.TEXTENCODER] processorimage = processors[Modality.IMAGEENCODER] ```

Embed images:

```py import requests from io import BytesIO from PIL import Image

imageurl = 'https://media-cdn.tripadvisor.com/media/photo-s/1b/28/6b/53/lovely-armenia.jpg' image = Image.open(BytesIO(requests.get(imageurl).content)) imagedata = processorimage(image) imagefeatures, imageembedding = modelimage.encode(imagedata, return_features=True) ```

Embed queries:

py text = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background' text_data = processor_text(text) text_features, text_embedding = model_text.encode(text_data, return_features=True)

For more details check out:

Python docs on embedding models in python/README.md
JavaScript docs on embedding models in javascript/README.md
Swift docs on embedding models in swift/README.md

Generative Models

The generative models are natively compatible with

```python from transformers import AutoModel, AutoProcessor

model = AutoModel.frompretrained('unum-cloud/uform-gen2-dpo', trustremotecode=True) processor = AutoProcessor.frompretrained('unum-cloud/uform-gen2-dpo', trustremotecode=True)

prompt = 'Question or Instruction' image = Image.open('image.jpg')

inputs = processor(text=[prompt], images=[image], return_tensors='pt')

with torch.inferencemode(): output = model.generate( **inputs, dosample=False, usecache=True, maxnewtokens=256, eostokenid=151645, padtokenid=processor.tokenizer.padtokenid ) promptlen = inputs['inputids'].shape[1] decodedtext = processor.batchdecode(output[:, promptlen:])[0] ```

For more details check out:

Python docs on generative models in python/README.md
JavaScript docs on generative models 🔜
Swift docs on generative models 🔜

Technical Details

Down-casting, Quantization, Matryoshka, and Slicing

Depending on the application, the embeddings can be down-casted to smaller numeric representations without losing much recall. Switching from f32 to f16 is recommended in almost all cases, unless you are running on very old hardware without half-precision support. Switching to i8 with linear scaling is also possible, but will be noticeable in the recall on larger collections with millions of searchable entries. Similarly, for higher-dimensional embeddings (512 or 768), a common strategy is to quantize them into single-bit representations for faster search.

```python import numpy as np

f32embedding: np.ndarray = model.encodetext(textdata, returnfeatures=False) f16embedding: np.ndarray = f32embedding.astype(np.float16) i8embedding: np.ndarray = (f32embedding * 127).astype(np.int8) b1embedding: np.ndarray = np.packbits((f32embedding > 0).astype(np.uint8)) ```

Alternative approach to quantization is to use the Matryoshka embeddings, where the embeddings are sliced into smaller parts, and the search is performed in a hierarchical manner.

```python import numpy as np

largeembedding: np.ndarray = model.encodetext(textdata, returnfeatures=False) smallembedding: np.ndarray = largeembedding[:, :256] tinyembedding: np.ndarray = largeembedding[:, :64] ```

Both approaches are natively supported by the USearch vector-search engine and the SimSIMD numerics libraries. When dealing with small collections (up to millions of entries) and looking for low-latency cosine distance calculations, you can achieve 5x-2500x performance improvement over Torch, NumPy, SciPy, and vanilla Python using SimSIMD.

```python from simsimd import cosine, hamming

distance: float = cosine(f32embedding, f32embedding) # 32x SciPy performance on Apple M2 CPU distance: float = cosine(f16embedding, f16embedding) # 79x SciPy performance on Apple M2 CPU distance: float = cosine(i8embedding, i8embedding) # 133x SciPy performance on Apple M2 CPU distance: float = hamming(b1embedding, b1embedding) # 17x SciPy performance on Apple M2 CPU ```

Similarly, when dealing with large collections (up to billions of entries per server) and looking for high-throughput search, you can achieve 100x performance improvement over FAISS and other vector-search solutions using USearch. Here are a couple of examples:

```python from usearch.index import Index

f32index = Index(ndim=64, metric='cos', dtype='f32') # for Matryoshka embeddings f16index = Index(ndim=64, metric='cos', dtype='f16') # for Matryoshka embeddings i8index = Index(ndim=256, metric='cos', dtype='i8') # for quantized embeddings b1index = Index(ndim=768, metric='hamming', dtype='b1') # for binary embeddings ```

Compact Packaging

PyTorch is a heavy dependency to carry, especially if you run on Edge or IoT devices. Using vanilla ONNX runtime, one can significantly reduce memory consumption and deployment latency.

```sh $ conda create -n uformtorch python=3.10 -y $ conda create -n uformonnx python=3.10 -y $ conda activate uformtorch && pip install -e ".[torch]" && conda deactivate $ conda activate uformonnx && pip install -e ".[onnx]" && conda deactivate $ du -sh $(conda info --envs | grep 'uform_torch' | awk '{print $2}')

5.2G ~/conda/envs/uformtorch $ du -sh $(conda info --envs | grep 'uformonnx' | awk '{print $2}') 461M ~/conda/envs/uform_onnx ```

Most of that weight can be further reduced down to 100 MB for both the model and the runtime. You can pick one of many supported ONNX execution providers, which includes XNNPACK, CUDA and TensorRT for Nvidia GPUs, OpenVINO on Intel, DirectML on Windows, ROCm on AMD, CoreML on Apple devices, and more to come.

Multimodal Chat in CLI

The generative models can be used for chat-like experiences in the command line. For that, you can use the uform-chat CLI tool, which is available in the UForm package.

```bash $ pip install uform $ uform-chat --model unum-cloud/uform-gen2-dpo --image=zebra.jpg $ uform-chat --model unum-cloud/uform-gen2-dpo \

--image="https://bit.ly/3tIVg9M" \
--device="cuda:0" \
--fp16

```

Owner

Name: Unum
Login: unum-cloud
Kind: organization
Email: info@unum.cloud
Location: Armenia

Website: https://unum.cloud
Twitter: unum_cloud
Repositories: 7
Profile: https://github.com/unum-cloud

Scaling Intelligence

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Kim"
  given-names: "Mikhail"
  orcid: "https://orcid.org/0009-0003-8413-3221"
- family-names: "Orshulevich"
  given-names: "Vladimir"
  orcid: "https://orcid.org/0009-0007-8961-6969"
- family-names: "Vardanian"
  given-names: "Ash"
  orcid: "https://orcid.org/0000-0002-4882-1815"
title: "UForm by Unum Cloud"
version: 3.1.3
keywords:
- "text-to-image retrieval"
- "multimodal"
- "visual-language pre-training"
doi: 10.5281/zenodo.7951497
date-released: 2023-01-03
url: "https://github.com/unum-cloud/uform"

GitHub Events

Total

Create event: 7
Release event: 2
Issues event: 9
Watch event: 124
Delete event: 3
Issue comment event: 15
Push event: 10
Pull request review event: 1
Pull request event: 8
Fork event: 11

Last Year

Create event: 7
Release event: 2
Issues event: 9
Watch event: 124
Delete event: 3
Issue comment event: 15
Push event: 10
Pull request review event: 1
Pull request event: 8
Fork event: 11

Committers

Last synced: about 1 year ago

All Time

Total Commits: 225
Total Committers: 19
Avg Commits per committer: 11.842
Development Distribution Score (DDS): 0.453

Past Year

Commits: 8
Committers: 5
Avg Commits per committer: 1.6
Development Distribution Score (DDS): 0.625

Top Committers

Name	Email	Commits
Ash Vardanian	1****n	123
semantic-release-bot	s**t@m**t	31
Mikhail Kim	k**v@g**m	29
Mike	m**m@u**d	9
Ishkhan Nazaryan	1****2	6
VoVoR	v**r@V**l	5
Vladimir Orshulevich	3****R	5
vov_or	v**a@l**t	3
TinySemVer	t**r@a**m	3
Gurgen Yegoryan	2****n	2
Jake Zhang	a**5@1**m	1
Kapulkin Stanislav	k**n@g**m	1
Louis Maddox	l****x	1
Niels Horn	n**s@h**a	1
Oliver Sauter	o**i@w**o	1
SebK	s**b@l**m	1
root	r**t@u**l	1
Vincent Botta	v**t@g**m	1
djacobs7	d**7@g**m	1

Committer Domains (Top 20 + Academic)

uform-train.eu-north1.internal: 1 lottoshield.com: 1 worldbrain.io: 1 horn.ninja: 1 163.com: 1 ashvardanian.com: 1 unum.cloud: 1 martynus.net: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 36
Total pull requests: 82
Average time to close issues: 14 days
Average time to close pull requests: 3 days
Total issue authors: 32
Total pull request authors: 15
Average comments per issue: 1.53
Average comments per pull request: 0.78
Merged pull requests: 74
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 10
Pull requests: 11
Average time to close issues: 11 days
Average time to close pull requests: 1 day
Issue authors: 9
Pull request authors: 4
Average comments per issue: 0.2
Average comments per pull request: 0.36
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

ashvardanian (3)
javiabellan (2)
bupianlizhugui (2)
ake020675 (1)
skull8888888 (1)
0asa (1)
orca-zhang (1)
jsbintu (1)
ppbrown (1)
beaugunderson (1)
karthikra (1)
afsaneh-ebrahimi (1)
Apro123 (1)
laclouis5 (1)
chadbrewbaker (1)

Pull Request Authors

ashvardanian (35)
kimihailv (14)
VoVoR (8)
lmmx (4)
djacobs7 (4)
wnma3mz (2)
0asa (2)
nilq (2)
sebouh (2)
ishkhan42 (2)
ake020675 (2)
kapulkin (2)
xyb (1)
blackforestboi (1)
gurgenyegoryan (1)

Top Labels

Issue Labels

good first issue (3) help wanted (2) released (1) bug (1)

Pull Request Labels

released (41)

Packages

Total packages: 2
Total downloads:
- pypi 272 last-month

Total dependent packages: 1
(may contain duplicates)
Total dependent repositories: 1
(may contain duplicates)
Total versions: 79
Total maintainers: 2

pypi.org: uform

Pocket-Sized Multimodal AI for Content Understanding and Generation

Homepage: https://github.com/unum-cloud/uform
Documentation: https://uform.readthedocs.io/
License: Apache Software License
Latest release: 3.1.3
published 10 months ago

Versions: 39
Dependent Packages: 1
Dependent Repositories: 1
Downloads: 272 Last month
Docker Downloads: 0

Rankings

Stargazers count: 3.1%

Docker downloads count: 3.7%

Forks count: 8.4%

Downloads: 9.3%

Average: 9.4%

Dependent packages count: 10.0%

Dependent repos count: 21.8%

Maintainers (2)

unum kimihailv

Last synced: 10 months ago

swiftpackageindex.com: github.com/unum-cloud/uform

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️

Homepage: https://unum-cloud.github.io/uform/
Documentation: https://swiftpackageindex.com/unum-cloud/uform/documentation
License: apache-2.0
Latest release: v3.1.1
published over 1 year ago

Versions: 40
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 14.5%

Average: 22.3%

Dependent repos count: 30.1%

Last synced: 11 months ago

uform

Science Score: 64.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

UForm

Pocket-Sized Multimodal AI For Content Understanding and Generation

Features

Models

Embedding Models

Generative Models

Quick Start Examples

Embedding Models

Generative Models

Technical Details

Down-casting, Quantization, Matryoshka, and Slicing

Compact Packaging

Multimodal Chat in CLI

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: uform

Rankings

Maintainers (2)

swiftpackageindex.com: github.com/unum-cloud/uform

Rankings

Pocket-Sized Multimodal AI
For Content Understanding and Generation