https://github.com/kyegomez/gemini

The open source implementation of Gemini, the model that will "eclipse ChatGPT" by Google

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.4%) to scientific vocabulary

Keywords

ai artificial-intelligence gemini gpt4 machine-learning ml multi-modality multimodla

Keywords from Contributors

projection interactive serializer measurement cycles packaging charts network-simulation archival shellcodes

Last synced: 10 months ago · JSON representation

Repository

The open source implementation of Gemini, the model that will "eclipse ChatGPT" by Google

Basic Info

Host: GitHub
Owner: kyegomez
License: mit
Language: Python
Default Branch: main
Homepage: https://discord.gg/GYbXvDGevY
Size: 659 KB

Statistics

Stars: 457
Watchers: 11
Forks: 60
Open Issues: 5
Releases: 0

Topics

ai artificial-intelligence gemini gpt4 machine-learning ml multi-modality multimodla

Created almost 3 years ago · Last pushed 11 months ago

Metadata Files

Readme Funding License

Gemini

gemini

The open source implementation of Gemini, the model that will "eclipse ChatGPT", it seems to work by directly taking in all modalities all at once into a transformer with special decoders for text or img generation!

Join the Agora discord channel to help with the implementation! and Here is the project board:

The input sequences for Gemini consist of texts, audio, images, and videos. These inputs are transformed into tokens, which are then processed by a transformer. Subsequently, conditional decoding takes place to generate image outputs. Interestingly, the architecture of Gemini bears resemblance to Fuyu's architecture but is expanded to encompass multiple modalities. Instead of utilizing a visual transformer (vit) encoder, Gemini simply feeds image embeddings directly into the transformer. For Gemini, the token inputs will likely be indicated by special modality tokens such as [IMG], , [AUDIO], or

Install

pip3 install gemini-torch

Usage

Gemini Transformer Usage

Base transformer
Multi Grouped Query Attn / flash attn
rope
alibi
xpos
qk norm
no pos embeds
kv cache

```python import torch

from gemini_torch.model import Gemini

Initialize model with smaller dimensions

model = Gemini( numtokens=50432, maxseqlen=4096, # Reduced from 8192 dim=1280, # Reduced from 2560 depth=16, # Reduced from 32 dimhead=64, # Reduced from 128 heads=12, # Reduced from 24 useabsposemb=False, attnflash=True, attnkvheads=2, qknorm=True, attnqknorm=True, attnqknormdim_scale=True, )

Text shape: [batch, seq_len, dim]

text = torch.randint(0, 50432, (1, 4096)) # Reduced seq_len from 8192

Apply model to text

y = model( text, )

Output shape: [batch, seq_len, dim]

print(y)

```

Full Multi-Modal Gemini

Processes images and audio through a series of reshapes
Ready to train for production grade usage
Hyper optimized with flash attention, qk norm, and other methods

```python import torch

from gemini_torch.model import Gemini

Initialize model with smaller dimensions

model = Gemini( numtokens=10000, # Reduced from 50432 maxseqlen=1024, # Reduced from 4096 dim=320, # Reduced from 1280 depth=8, # Reduced from 16 dimhead=32, # Reduced from 64 heads=6, # Reduced from 12 useabsposemb=False, attnflash=True, attnkvheads=2, qknorm=True, attnqknorm=True, attnqknormdimscale=True, postfusionnorm=True, postmodaltransformnorm=True, )

Text shape: [batch, seq_len, dim]

text = torch.randint(0, 10000, (1, 1024)) # Reduced seq_len from 4096

Img shape: [batch, channels, height, width]

img = torch.randn(1, 3, 64, 64) # Reduced height and width from 128

Audio shape: [batch, audioseqlen, dim]

audio = torch.randn(1, 32) # Reduced audioseqlen from 64

Apply model to text and img

y, _ = model(text=text, img=img, audio=audio)

Output shape: [batch, seq_len, dim]

print(y) print(y.shape)

After much training

model.eval()

text = tokenize(texts) logits = model(text) text = detokenize(logits)

```

LongGemini

An implementation of Gemini with Ring Attention, no multi-modality processing yet.

```python import torch from gemini_torch import LongGemini

Text tokens

x = torch.randint(0, 10000, (1, 1024))

Create an instance of the LongGemini model

model = LongGemini( dim=512, # Dimension of the input tensor depth=32, # Number of transformer blocks dimhead=128, # Dimension of the query, key, and value vectors longgeminidepth=9, # Number of long gemini transformer blocks heads=24, # Number of attention heads qknorm=True, # Whether to apply layer normalization to query and key vectors ringseqsize=512, # The size of the ring sequence )

Apply the model to the input tensor

out = model(x)

Print the output tensor

print(out)

```

Tokenizer

Sentencepiece, tokenizer
We're using the same tokenizer as LLAMA with special tokens denoting the beginning and end of the multi modality tokens.
Does not fully process img, audio, or videos now we need help on that

```python from gemini_torch.tokenizer import MultimodalSentencePieceTokenizer

Example usage

tokenizername = "hf-internal-testing/llama-tokenizer" tokenizer = MultimodalSentencePieceTokenizer(tokenizername=tokenizer_name)

Encoding and decoding examples

encodedaudio = tokenizer.encode("Audio description", modality="audio") decodedaudio = tokenizer.decode(encoded_audio)

print("Encoded audio:", encodedaudio) print("Decoded audio:", decodedaudio) ```

References

Combine Reinforcment learning with modular pretrained transformer, multi-modal capabilities, image, audio,
self improving mechanisms like robocat
PPO? or MPO
get good at backtracking and exploring alternative paths
speculative decoding
Algorithm of Thoughts
RLHF
Gemini Report
Gemini Landing Page

Todo

[ ] Check out the project board for more todos
[x] Implement the img feature embedder and align imgs with text and pass into transformer: Gemini models are trained to accommodate textual input interleaved with a wide variety of audio and visual inputs, such as natural images, charts, screenshots, PDFs, and videos, and they can produce text and image outputs (see Figure 2). The visual encoding of Gemini models is inspired by our own foundational work on Flamingo (Alayrac et al., 2022), CoCa (Yu et al., 2022a), and PaLI (Chen et al., 2022), with the important distinction that the models are multimodal from the beginning and can natively output images using discrete image tokens (Ramesh et al., 2021; Yu et al., 2022b).
[x] Implement the audio processing using USM by Google:In addition, Gemini can directly ingest audio signals at 16kHz from Universal Speech Model (USM) (Zhang et al., 2023) features. This enables the model to capture nuances that are typically lost when the audio is naively mapped to a text input (for example, see audio understanding demo on the website).
[ ] Video Processing Technique: " Video understanding is accomplished by encoding the video as a sequence of frames in the large context window. Video frames or images can be interleaved naturally with text or audio as part of the model input"
[ ] Prompting Technique: We find Gemini Ultra achieves highest accuracy when used in combination with a chain-of-thought prompting approach (Wei et al., 2022) that accounts for model uncertainty. The model produces a chain of thought with k samples, for example 8 or 32. If there is a consensus above a preset threshold (selected based on the validation split), it selects this answer, otherwise it reverts to a greedy sample based on maximum likelihood choice without chain of thought. We refer the reader to appendix for a detailed breakdown of how this approach compares with only chain-of-thought prompting or only greedy sampling.
[ ] Train a 1.8B + 3.25 Model: Nano-1 and Nano-2 model sizes are only 1.8B and 3.25B parameters respectively. Despite their size, they show exceptionally strong performance on factuality, i.e. retrieval-related tasks, and significant performance on reasoning, STEM, coding, multimodal and

Owner

Name: Eternal Reclaimer
Login: kyegomez
Kind: user
Location: Miami
Company: Automated Public Assistance Company

Website: https://www.swarms.world/
Twitter: KyeGomezB
Repositories: 331
Profile: https://github.com/kyegomez

Leader of Agora, the open source Multi-Modal AI research lab join our community here: https://discord.gg/hCJpnhA5aP

GitHub Events

Total

Watch event: 47
Delete event: 32
Issue comment event: 42
Pull request event: 60
Fork event: 7
Create event: 28

Last Year

Watch event: 47
Delete event: 32
Issue comment event: 42
Pull request event: 60
Fork event: 7
Create event: 28

Committers

Last synced: about 1 year ago

All Time

Total Commits: 115
Total Committers: 3
Avg Commits per committer: 38.333
Development Distribution Score (DDS): 0.409

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Kye	k**e@a**m	68
dependabot[bot]	4****]	34
Eternal Reclaimer	9****z	13

Committer Domains (Top 20 + Academic)

apacmediasolutions.com: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 6
Total pull requests: 179
Average time to close issues: about 2 months
Average time to close pull requests: 26 days
Total issue authors: 5
Total pull request authors: 2
Average comments per issue: 3.33
Average comments per pull request: 0.63
Merged pull requests: 57
Bot issues: 0
Bot pull requests: 177

Past Year

Issues: 0
Pull requests: 77
Average time to close issues: N/A
Average time to close pull requests: about 1 month
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.95
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 77

View more stats

Top Authors

Issue Authors

corous (2)
smithgi (1)
fyang064 (1)
meysamKianian (1)
pjwang0928 (1)

Pull Request Authors

dependabot[bot] (177)
James4Ever0 (2)

Top Labels

Issue Labels

bug (3) no-issue-activity (3)

Pull Request Labels

dependencies (177) github_actions (127) python (50) no-pr-activity (34)

Dependencies

pyproject.toml pypi

python ^3.6
torch *

https://github.com/kyegomez/gemini

Science Score: 26.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Gemini

Install

Usage

Gemini Transformer Usage

Initialize model with smaller dimensions

Text shape: [batch, seq_len, dim]

Apply model to text

Output shape: [batch, seq_len, dim]

```

Full Multi-Modal Gemini

Initialize model with smaller dimensions

Text shape: [batch, seq_len, dim]

Img shape: [batch, channels, height, width]

Audio shape: [batch, audioseqlen, dim]

Apply model to text and img

Output shape: [batch, seq_len, dim]

After much training

```

LongGemini

Text tokens

Create an instance of the LongGemini model

Apply the model to the input tensor

Print the output tensor

Tokenizer

Example usage

Encoding and decoding examples

References

Todo

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies