vision-agent

Vision agent

https://github.com/landing-ai/vision-agent

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Vision agent

Basic Info
  • Host: GitHub
  • Owner: landing-ai
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 16 MB
Statistics
  • Stars: 4,990
  • Watchers: 56
  • Forks: 561
  • Open Issues: 9
  • Releases: 0
Created about 2 years ago · Last pushed 7 months ago
Metadata Files
Readme License Citation

README.md

VisionAgent _Prompt with an image/video → Get runnable vision code → Build Visual AI App in minutes_ [![](https://dcbadge.vercel.app/api/server/wPdN8RCYew?compact=true&style=flat)](https://discord.gg/wPdN8RCYew) ![ci_status](https://github.com/landing-ai/vision-agent/actions/workflows/ci_cd.yml/badge.svg) [![PyPI version](https://badge.fury.io/py/vision-agent.svg)](https://badge.fury.io/py/vision-agent) ![version](https://img.shields.io/pypi/pyversions/vision-agent)

Discord · Architecture · YouTube


VisionAgent is the Visual AI pilot from LandingAI. Give it a prompt and an image, and it automatically picks the right vision models and outputs ready‑to‑run code—letting you build vision‑enabled apps in minutes. You can play around with VisionAgent using our local webapp in examples/chat and following the directions in the README.md:

https://github.com/user-attachments/assets/752632b3-dda5-44f1-b27e-5cb4c97757ac

Steps to Set Up the Library

Get Your VisionAgent API Key

The most important step is to create an account and obtain your API key.

Other Prerequisites

Why do I need Anthropic and Google API Keys?

VisionAgent uses models from Anthropic and Google to respond to prompts and generate code.

When you run VisionAgent, the app will need to use your API keys to access the Anthropic and Google models. This ensures that any projects you run with VisionAgent aren’t limited by the rate limits in place with the LandingAI accounts, and it also prevents many users from overloading the LandingAI rate limits.

Anthropic and Google each have their own rate limits and paid tiers. Refer to their documentation and pricing to learn more.

NOTE: In VisionAgent v1.0.2 and earlier, VisionAgent was powered by Anthropic Claude-3.5 and OpenAI o1. If using one of these VisionAgent versions, you get an OpenAI API key and set it as an environment variable.

Get an Anthropic API Key

  1. If you don’t have one yet, create an Anthropic Console account.
  2. In the Anthropic Console, go to the API Keys page.
  3. Generate an API key.

Get a Google API Key

  1. If you don’t have one yet, create a Google AI Studio account.
  2. In Google AI Studio, go to the Get API Key page.
  3. Generate an API key.

Installation

Install with uv: bash uv add vision-agent

Install with pip:

bash pip install vision-agent

Quickstart: Prompt VisionAgent

Follow this quickstart to learn how to prompt VisionAgent. After learning the basics, customize your prompt and workflow to meet your needs.

  1. Get your Anthropic, Google, and VisionAgent API keys.
  2. Set the Anthropic, Google, and VisionAgent API keys as environment variables.
  3. Install VisionAgent.
  4. Create a folder called quickstart.
  5. Find an image you want to analyze and save it to the quickstart folder.
  6. Copy the Sample Script to a file called source.py. Save the file to the quickstart folder.
  7. Run source.py.
  8. VisionAgent creates a file called generated_code.py and saves the generated code there.

Set API Keys as Environment Variables

Before running VisionAgent code, you must set the Anthropic, Google, and VisionAgent API keys as environment variables. Each operating system offers different ways to do this.

Here is the code for setting the variables: bash export VISION_AGENT_API_KEY="your-api-key" export ANTHROPIC_API_KEY="your-api-key" export GOOGLE_API_KEY="your-api-key"

Sample Script: Prompt VisionAgent

To use VisionAgent to generate code, use the following script as a starting point:

```python

Import the classes you need from the VisionAgent package

from visionagent.agent import VisionAgentCoderV2 from visionagent.models import AgentMessage

Enable verbose output

agent = VisionAgentCoderV2(verbose=True)

Add your prompt (content) and image file (media)

codecontext = agent.generatecode( [ AgentMessage( role="user", content="Describe the image", media=["friends.jpg"] ) ] )

Write the output to a file

with open("generatedcode.py", "w") as f: f.write(codecontext.code + "\n" + code_context.test) ```

What to Expect When You Prompt VisionAgent

When you submit a prompt, VisionAgent performs the following tasks.

  1. Generates a plan for the code generation task. If verbose output is on, the numbered steps for this plan display.
  2. Generates code and a test case based on the plan.
  3. Tests the generated code with the test case. If the test case fails, VisionAgent iterates on the code generation process until the test case passes.

Example: Count Cans in an Image

Check out how to use VisionAgent in this Jupyter Notebook to learn how to count the number of cans in an image:

Count Cans in an Image

Use Specific Tools from VisionAgent

The VisionAgent library includes a set of tools, which are standalone models or functions that complete specific tasks. When you prompt VisionAgent, VisionAgent selects one or more of these tools to complete the tasks outlined in your prompt.

For example, if you prompt VisionAgent to “count the number of dogs in an image”, VisionAgent might use the florence2_object_detection tool to detect all the dogs, and then the countgd_object_detection tool to count the number of detected dogs.

After installing the VisionAgent library, you can also use the tools in your own scripts. For example, if you’re writing a script to track objects in videos, you can call the owlv2_sam2_video_tracking function. In other words, you can use the VisionAgent tools outside of simply prompting VisionAgent.

The tools are in the vision_agent.tools API.

Sample Script: Use Specific Tools for Images

You can call the countgd_object_detection function to count the number of objects in an image.

To do this, you could run this script: ```python

Import the VisionAgent Tools library; import Matplotlib to visualize the results

import vision_agent.tools as T import matplotlib.pyplot as plt

Load the image

image = T.load_image("people.png")

Call the function to count objects in an image, and specify that you want to count people

dets = T.countgdobjectdetection("person", image)

Visualize the countgd bounding boxes on the image

viz = T.overlayboundingboxes(image, dets)

Save the visualization to a file

T.saveimage(viz, "peopledetected.png")

Display the visualization

plt.imshow(viz) plt.show()

```

Sample Script: Use Specific Tools for Videos

You can call the countgd_sam2_video_tracking function to track people in a video and pair it with the extract_frames_and_timestamps function to return the frames and timestamps in which those people appear.

To do this, you could run this script: ```python

Import the VisionAgent Tools library

import vision_agent.tools as T

Call the function to get the frames and timestamps

framesandts = T.extractframesand_timestamps("people.mp4")

Extract the frames from the framesandts list

frames = [f["frame"] for f in framesandts]

Call the function to track objects, and specify that you want to track people

tracks = T.countgdsam2video_tracking("person", frames)

Visualize the countgd tracking results on the frames and save the video

viz = T.overlaysegmentationmasks(frames, tracks) T.savevideo(viz, "peopledetected.mp4") ```

Use Other LLM Providers

VisionAgent uses Anthropic Claude 3.7 Sonnet and Gemini Flash 2.0 Experimental (gemini-2.0-flash-exp) to respond to prompts and generate code. We’ve found that these provide the best performance for VisionAgent and are available on the free tiers (with rate limits) from their providers.

If you prefer to use only one of these models or a different set of models, you can change the selected LLM provider in this file: vision_agent/configs/config.py. You must also add the provider’s API Key as an environment variable.

For example, if you want to use only the Anthropic model, run this command: bash cp vision_agent/configs/anthropic_config.py vision_agent/configs/config.py

Or, you can manually enter the model details in the config.py file. For example, if you want to change the planner model from Anthropic to OpenAI, you would replace this code: python planner: Type[LMM] = Field(default=AnthropicLMM) planner_kwargs: dict = Field( default_factory=lambda: { "model_name": "claude-3-7-sonnet-20250219", "temperature": 0.0, "image_size": 768, } )

with this code:

python planner: Type[LMM] = Field(default=OpenAILMM) planner_kwargs: dict = Field( default_factory=lambda: { "model_name": "gpt-4o-2024-11-20", "temperature": 0.0, "image_size": 768, "image_detail": "low", } )

Resources

  • Discord: Check out our community of VisionAgent users to share use cases and learn about updates.
  • VisionAgent Library Docs: Learn how to use this library.
  • Video Tutorials: Watch the latest video tutorials to see how VisionAgent is used in a variety of use cases.

Owner

  • Name: Landing AI
  • Login: landing-ai
  • Kind: organization
  • Location: United States of America

Landing AI’s cutting-edge software platform makes computer vision easy for a wide range of applications across all industries

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Laird"
  given-names: "Dillon"
- family-names: "Jagadeesan"
  given-name: "Shankar"
- family-names: "Cao"
  given-name: "Yazhou"
- family-names: "Ng"
  given-name: "Andrew"
title: "Vision Agent"
version: 0.2
date-released: 2024-02-12
url: "https://github.com/landing-ai/vision-agent"

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 35
  • Total pull requests: 80
  • Average time to close issues: 25 days
  • Average time to close pull requests: 19 days
  • Total issue authors: 32
  • Total pull request authors: 19
  • Average comments per issue: 1.17
  • Average comments per pull request: 0.25
  • Merged pull requests: 48
  • Bot issues: 0
  • Bot pull requests: 9
Past Year
  • Issues: 35
  • Pull requests: 79
  • Average time to close issues: 25 days
  • Average time to close pull requests: 15 days
  • Issue authors: 32
  • Pull request authors: 19
  • Average comments per issue: 1.17
  • Average comments per pull request: 0.24
  • Merged pull requests: 48
  • Bot issues: 0
  • Bot pull requests: 8
Top Authors
Issue Authors
  • claude89757 (2)
  • rishabh-akridata (2)
  • dillonalaird (2)
  • hmppt (2)
  • DejaYang (2)
  • moutasemalakkad (2)
  • lxyzler (1)
  • eshasadia (1)
  • FifthFractal (1)
  • sahulsumra (1)
  • Abubakar17 (1)
  • ws4349893 (1)
  • dependabot[bot] (1)
  • yungyuantseng (1)
  • Eric-Canas (1)
Pull Request Authors
  • dillonalaird (143)
  • humpydonkey (45)
  • wuyiqunLu (24)
  • dependabot[bot] (18)
  • yzld2002 (17)
  • shankar-vision-eng (13)
  • cmaloney111 (13)
  • AsiaCao (11)
  • CamiloInx (9)
  • shankar-landing-ai (9)
  • camiloaz (9)
  • Dayof (7)
  • MingruiZhang (7)
  • hugohonda (5)
  • hrnn (5)
Top Labels
Issue Labels
dependencies (1)
Pull Request Labels
dependencies (17) javascript (6) python (2)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 4,696 last-month
  • Total dependent packages: 2
  • Total dependent repositories: 0
  • Total versions: 321
  • Total maintainers: 1
pypi.org: vision-agent

Toolset for Vision Agent

  • Versions: 321
  • Dependent Packages: 2
  • Dependent Repositories: 0
  • Downloads: 4,696 Last month
Rankings
Dependent packages count: 9.8%
Average: 37.2%
Dependent repos count: 64.6%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/ci_cd.yml actions
  • abatilo/actions-poetry v2.1.0 composite
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
poetry.lock pypi
  • 103 dependencies
pyproject.toml pypi
  • autoflake 1.* develop
  • black 23.* develop
  • data-science-types ^0.2.23 develop
  • flake8 5.* develop
  • isort 5.* develop
  • mkdocs ^1.5.3 develop
  • mkdocs-material ^9.4.2 develop
  • mkdocstrings ^0.23.0 develop
  • mypy <1.8.0 develop
  • pytest 7.* develop
  • responses ^0.23.1 develop
  • setuptools ^68.0.0 develop
  • types-pillow ^9.5.0.4 develop
  • types-requests ^2.31.0.0 develop
  • types-tqdm ^4.65.0.1 develop
  • faiss-cpu 1.*
  • numpy >=1.21.0,<2.0.0
  • openai 1.*
  • pandas 2.*
  • pillow 10.*
  • python >=3.10,<3.12
  • requests 2.*
  • sentence-transformers 2.*
  • torch 2.1.*
  • tqdm >=4.64.0,<5.0.0
  • typing_extensions 4.*