https://github.com/adithya-s-k/yologemma

Testing and evaluating the capabilities of Vision-Language models (PaliGemma) in performing computer vision tasks such as object detection and segmentation.

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.1%) to scientific vocabulary

Keywords

gemma paligemma vlm

Last synced: 10 months ago · JSON representation

Repository

Testing and evaluating the capabilities of Vision-Language models (PaliGemma) in performing computer vision tasks such as object detection and segmentation.

Basic Info

Host: GitHub
Owner: adithya-s-k
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 11.8 MB

Statistics

Stars: 80
Watchers: 2
Forks: 5
Open Issues: 2
Releases: 0

Topics

gemma paligemma vlm

Created about 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme License

YoloGemma

YoloGemma is a project showcasing the capabilities of Vision-Language models in performing computer vision tasks such as object detection and segmentation. At the heart of this experiment lies PaliGemma, a state-of-the-art model that bridges the gap between Language and Vision. Through YoloGemma, we aim to explore whether Vision-Language models can match conventional methods of computer vision.

Outputs

YoloGemma generates outputs by processing images and videos to identify and segment objects within them. The results are visualized as annotated images or videos, highlighting detected objects with bounding boxes or segmentation masks.

Detect Big Cat:	Detect Small Cat:
Detect Gojo:	Detect Short Person:

Installation

To get started with YoloGemma, follow these simple installation steps:

Clone the repository: bash git clone https://github.com/your-username/YoloGemma.git cd YoloGemma
Install the required dependencies: bash conda create -n YoloGemma-venv python=3.10 conda activate YoloGemma-venv pip install -e .

How to Run

Model Download

You can download the model by running the following command: bash python download.py This command will download and quantize the model.

YoloGemma provides three main scripts to facilitate various tasks. Below are instructions on how to run each script:

Main Script for Object Detection

bash python main.py --prompt "Detect 4 people" --vid_path ./people.mp4 --vid_start 1 --vid_end 12 --max_new_tokens 10

Command Line Arguments

--prompt (type: str, default: "detect cat"): The prompt specifying what to detect in the video.
--vid_path (type: str, default: ""): The path to the input MP4 video file.
--vid_start (type: int, default: 0): The start time in seconds where the detection should begin.
--vid_end (type: int, default: 10): The end time in seconds where the detection should stop.
--max_new_tokens (type: int, default: 15): Maximum number of new tokens.

Additional Parameters

--interactive (action: store_true): Launch the application in interactive mode.
--top_k (type: int, default: 200): Top-k sampling for generating new tokens.
--temperature (type: float, default: 0.8): Sampling temperature.
--checkpoint_path (type: Path, default: Path("checkpoints/google/paligemma-3b-mix-224/modelint8.pth")): Path to the model checkpoint file.
--compile (action: store_true, default: True): Whether to compile the model.
--compile_prefill (action: store_true): Whether to compile the prefill for improved performance.
--profile (type: Path, default: None): Path to the profile.
--speculate_k (type: int, default: 5): Speculative execution depth.
--draft_checkpoint_path (type: Path, default: None): Path to the draft checkpoint.
--device (type: str, default: "cuda"): Device to use for running the model (e.g., "cuda" for GPU).

Example

bash python main.py --prompt "Detect 4 people" --vid_path ./people.mp4 --vid_start 1 --vid_end 12 --max_new_tokens 10

This command will start the detection process for the prompt "Detect 4 people" on the video located at ./people.mp4, beginning at 1 second and ending at 12 seconds into the video. It will use a maximum of 10 new tokens during processing.

Gradio Interface (Coming Soon)

bash python demo.py This command will launch a Gradio interface, providing an interactive web application to perform object detection and segmentation.

Troubleshooting

If you encounter any issues, please ensure that: - The video file path is correct and the file is accessible. - The required dependencies are installed. - Your system has the necessary hardware (e.g., a compatible GPU if using CUDA).

For further assistance, please refer to the project's issues page or contact the maintainers.

Acknowledgements

Special thanks to PaliGemma for their groundbreaking work in Vision-Language models, which serves as the foundation for this project. The project was inspired by this repository - loopvlm.

YoloGemma is an exciting experimental step towards the future of vision-language model-based computer vision, blending the strengths of language models with visual understanding.

Owner

Name: Adithya S K
Login: adithya-s-k
Kind: user
Location: Indian
Company: Cognitivelab

Website: https://adithyask.com/
Twitter: adithya_s_k
Repositories: 60
Profile: https://github.com/adithya-s-k

Exploring Generative AI • Google DSC Lead'23 • Cloud & Full Stack Engineer • Drones & IoT • FOSS Contributor

GitHub Events

Total

Issues event: 1
Watch event: 9

Last Year

Issues event: 1
Watch event: 9

Committers

Last synced: about 1 year ago

All Time

Total Commits: 18
Total Committers: 1
Avg Commits per committer: 18.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 18
Committers: 1
Avg Commits per committer: 18.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Adithya S K	a**i@g**m	18

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 3
Total pull requests: 0
Average time to close issues: about 7 hours
Average time to close pull requests: N/A
Total issue authors: 3
Total pull request authors: 0
Average comments per issue: 0.67
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 3
Pull requests: 0
Average time to close issues: about 7 hours
Average time to close pull requests: N/A
Issue authors: 3
Pull request authors: 0
Average comments per issue: 0.67
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

https://github.com/adithya-s-k/yologemma

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

YoloGemma

Outputs

Installation

How to Run

Model Download

Main Script for Object Detection

Command Line Arguments

Additional Parameters

Example

Gradio Interface (Coming Soon)

Troubleshooting

Acknowledgements

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels