Skip to content

Bounding boxes

This example demonstrates how to use the Gemini API to detect an object (a cat) in an image and retrieve its bounding box coordinates.

Import necessary libraries. Make sure Pillow is installed!

from google import genai
import requests
from PIL import Image
from io import BytesIO

Initialize the Gemini client with your API key

client = genai.Client(api_key="YOUR_API_KEY")

Specify the prompt, asking for a bounding box around the cat

prompt = (
    "Return a bounding box for the cat in this image "
    "in [ymin, xmin, ymax, xmax] format."
)

Download the cat image from cataas.com

image_url = "https://cataas.com/cat"
response = requests.get(image_url)
cat_image = Image.open(BytesIO(response.content))

Call the Gemini API to generate content with the image and prompt

response = client.models.generate_content(
    model="gemini-1.5-pro", contents=[cat_image, prompt]
)

Print the response text, which will contain the bounding box coordinates

print(response.text)

Normalize Coordinates The model returns bounding box coordinates in the format [y_min, x_min, y_max, x_max]. To convert these normalized coordinates to the pixel coordinates of your original image, follow these steps: 1. Divide each output coordinate by 1000. 2. Multiply the x-coordinates by the original image width. 3. Multiply the y-coordinates by the original image height.

Example Calculation (assuming the model returns [200, 300, 700, 800] and the image is 1000x800):

y_min = (200 / 1000) * 800  # 160
x_min = (300 / 1000) * 1000  # 300
y_max = (700 / 1000) * 800  # 560
x_max = (800 / 1000) * 1000  # 800

Running the Example

First, install the Google Generative AI library, requests, and Pillow

$ pip install google-genai Pillow requests

Then run the program with Python

$ python object-detection.py
[0.1, 0.2, 0.7, 0.8]

Further Information