gemma3n-profiling
Profiling Google Gemma 3n Model Using PyTorch Profiler
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.9%) to scientific vocabulary
Repository
Profiling Google Gemma 3n Model Using PyTorch Profiler
Basic Info
- Host: GitHub
- Owner: sbnb-io
- License: mit
- Language: Python
- Default Branch: main
- Size: 26.4 MB
Statistics
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Profiling Google Gemma 3n Model Using PyTorch Profiler
In this work, we profile the Google Gemma 3n model running on an NVIDIA GPU using PyTorch Profiler, performing image-to-text generation on a sample bee.jpg image. We share both the code and the raw profiling metrics so that anyone can reproduce these results, which show token generation averaging 24 milliseconds per token on our hardware and suggest there may be opportunities to further improve GPU utilization.
We visualize the profiling results using https://ui.perfetto.dev/, as shown in the animated GIF below.

Prerequisites
- Ubuntu 24.04
- NVIDIA GPU
Quick Start
bash
apt-get update && apt-get install -y libgl1 python3-venv
bash
git clone https://github.com/sbnb-io/gemma3n-profiling
cd gemma3n-profiling
python3 -m venv .
. bin/activate
pip3 install -r requirements.txt
To update CUDA to version 12.9 (which supports the latest NVIDIA Blackwell 50 series GPUs), run:
bash
pip3 install --pre torch torchvision torchaudio nvidia-cuda-cupti-cu12==12.9.79 --index-url https://download.pytorch.org/whl/nightly/cu129
Set your Hugging Face token:
bash
export HF_TOKEN=hf_REPLACE_ME
To start profiling, run:
bash
python3 gemma3n-profiling.py
Viewing the Results
A gemma3n-profiling.json file will be generated, approximately 80MB in size.
To visualize the trace, go to https://ui.perfetto.dev/ and select Open trace file, pointing to your gemma3n-profiling.json.
- Expand the
python3 PIDrow to explore the code running on the CPU. - Expand the
python3 0(stream 7 7) row to examine code running on the GPU.
Profiling Notes
- In this work, we ask the Google Gemma 3n model to describe the image bee.jpg.
- We limit generation to 10 tokens to keep the resulting trace file smaller and easier to analyze.
- The script performs two runs and skips the first as a warm-up. The first run takes around 60 seconds, but subsequent runs finish in about 0.4 seconds. If you wish to profile the warm-up run, you can adjust the
warmupandactivearguments oftorch.profiler.schedule. - For convenience, we have included the resulting
gemma3n-profiling.jsonfile in this repository, in case you prefer to explore the results without running the setup yourself. - The exact package versions in our environment:
transformers 4.53.1 timm 1.0.16 nvidia-cuda-cupti-cu12 12.9.79 nvidia-cuda-nvrtc-cu12 12.9.86 nvidia-cuda-runtime-cu12 12.9.79 torch 2.9.0.dev20250706+cu129 - Profiling was done inside an Ubuntu 24.04.2 LTS virtual machine running on AI Linux (Sbnb Linux), with an NVIDIA RTX 5060 Ti 16GB Blackwell GPU.
Diving Deeper Into the Results
Total runtime measured was 483 milliseconds.
Initially, the trace shows the get_image_features function of Gemma3n (source), which then calls forward_features in MobileNetV5 (source), taking about 74 milliseconds.

Next, a series of Gemma3nTextDecoderLayer (source) calls took 142 milliseconds.

Finally, generating the 10 tokens took approximately 244 milliseconds total, which averages around 24 milliseconds per token.

Each token generation involves a cudaGraphLaunch (which launches an executable graph in a stream), followed by a cudaStreamSynchronize (which waits for the stream’s tasks to complete).
The MobileNetV5 and Gemma3nTextDecoderLayer phases accounted for around 50% of the total runtime. However, their share would decrease significantly if more tokens are generated. For instance, generating 100 tokens would reduce their share to roughly 10%.
Open Questions
- Potential for Speedup:
The GPU sits idle for around 12 milliseconds after each token is generated. This delay occurs because the CPU is busy with the next call toprepare_inputs_for_generation. Could this step be optimized to load the next tasks into the GPU more quickly, improving GPU utilization?

Questions or Suggestions?
We recognize this work might be missing critical pieces. We welcome any feedback! Feel free to open an issue or discussion in this repository.
Owner
- Name: sbnb-io
- Login: sbnb-io
- Kind: organization
- Repositories: 1
- Profile: https://github.com/sbnb-io
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this work, please cite it as below."
title: "Profiling Google Gemma 3n Model with PyTorch Profiler"
authors:
- family-names: Ospan
given-names: Abylay
date-released: 2025-07-06
url: "https://github.com/sbnb-io/gemma3n-profiling"
version: "1.0"
GitHub Events
Total
- Watch event: 11
- Push event: 1
- Public event: 1
- Fork event: 2
Last Year
- Watch event: 11
- Push event: 1
- Public event: 1
- Fork event: 2
Dependencies
- accelerate >=0.26.0
- ipython *
- opencv-python *
- pillow *
- timm *
- torch >=2.1
- transformers >=4.53.0