GantryGraph / Docs / Concepts / Perception

Perception sources

Perception is how the agent sees the world before each think step. Without a perception source the agent only reads its task description and tool results. With one, it receives a fresh screenshot or DOM snapshot every loop iteration.

Do you need perception?

Yes — if the task requires reading what's currently on screen before deciding what to do:

  • Desktop automation (click buttons, read UI text, fill forms in a native app)
  • Web scraping or form filling (read page state, navigate based on what's visible)
  • Monitoring dashboards (watch for a value to change)

No — if the agent works purely through tool results:

  • Read files, run commands, call APIs → tools alone are sufficient
  • The agent receives tool outputs as text; no visual input needed

DesktopScreen — screenshot the monitor

Captures the primary monitor before each think step and sends the image to the LLM. Runs in a thread pool so it never blocks the event loop.

from gantrygraph.perception import DesktopScreen

# Full native resolution — highest quality, highest token cost
screen = DesktopScreen()

# Cap resolution to reduce vision token cost
screen = DesktopScreen(max_resolution=(1280, 720))

No extra dependencies — mss is bundled with the core install.

Reduce cost with vision_mode

vision_mode="low" caps the image at 1280×720 regardless of max_resolution. Use it when the task doesn't require pixel-perfect reads — UI labels, button text, and form fields are still readable.

Mode Resolution cap Token cost Use when
"high" (default) Native resolution Higher Reading fine UI details, small text
"low" 1280 × 720 ~4× cheaper Navigating apps, clicking buttons
screen = DesktopScreen(max_resolution=(1920, 1080), vision_mode="low")

Screenshot diffing

The engine automatically skips sending an image if the screen hasn't changed since the previous step. This saves vision tokens during steps where the agent is waiting for a tool result or processing text.

DesktopAXTree — read any macOS app without screenshots

Instead of capturing a screenshot and spending vision tokens, DesktopAXTree reads the native macOS Accessibility API (AXUIElement) and returns the full UI hierarchy as structured text. The LLM sees every button, text field, and label without consuming a single image token.

pip install 'gantrygraph[desktop-ax]'
# Grant: System Settings → Privacy & Security → Accessibility
from gantrygraph.perception import DesktopAXTree

# Target a specific app
perception = DesktopAXTree(app_name="Obsidian")

# Or whichever app is currently focused
perception = DesktopAXTree()

# AX tree + screenshot together
perception = DesktopAXTree(app_name="Obsidian", include_screenshot=True)

What the LLM receives instead of a screenshot:

AXApplication 'Obsidian'
AXWindow 'My Vault — Obsidian'
AXTextArea 'Q2 Goals\n- Ship v1\n- Write docs' (focused, editable)
AXButton 'New note' (enabled)
AXButton 'Search' (enabled)
DesktopScreen DesktopAXTree
Platform macOS, Linux, Windows macOS only
Token cost ~2 000 / step ~200 / step
Works off-screen No Yes

WebPage — screenshot a browser page

Renders a URL via Playwright, captures a screenshot, and extracts the page's accessibility tree. Both are sent to the LLM before each think step. Requires pip install 'gantrygraph[browser]'.

from gantrygraph.perception import WebPage

page = WebPage(url="https://myapp.example.com", headless=True)

# Lower token cost — downscales screenshots to 1280×720
page = WebPage(url="https://myapp.example.com", vision_mode="low")

Share the browser with BrowserTools

Pass the same WebPage instance to both perception= and BrowserTools so they operate on the same Playwright Page object — no double browser launch.

from gantrygraph import GantryEngine
from gantrygraph.perception import WebPage
from gantrygraph.actions import BrowserTools

web = WebPage(url="https://app.example.com", headless=True)

agent = GantryEngine(
llm=...,
perception=web,
tools=[BrowserTools(web_page=web)],
)

Without this, BrowserTools would launch a second browser instance pointing at a different page — the agent would act on one page but perceive another.

MultiPerception — combine sources

When your agent needs to see multiple things simultaneously — for example, control the desktop while monitoring a web dashboard:

from gantrygraph import MultiPerception
from gantrygraph.perception import DesktopScreen, WebPage

agent = GantryEngine(
llm=...,
perception=MultiPerception([
DesktopScreen(),
WebPage(url="https://dashboard.internal"),
]),
)

The first source's screenshot is used as the primary image. Accessibility trees from all sources are concatenated with source labels.

Write a custom perception source

Subclass BasePerception and implement observe(). Return a PerceptionResult.

import asyncio
from gantrygraph import BasePerception
from gantrygraph.core.events import PerceptionResult

class SystemMetricsPerception(BasePerception):
"""Feed CPU and memory stats to the agent instead of a screenshot."""

async def observe(self) -> PerceptionResult:
import psutil
stats = await asyncio.get_event_loop().run_in_executor(
None,
lambda: {
"cpu": psutil.cpu_percent(),
"mem": psutil.virtual_memory().percent,
},
)
return PerceptionResult(
screenshot_b64=None, # no image — text-only perception
accessibility_tree=f"CPU: {stats['cpu']}%\nMEM: {stats['mem']}%",
url=None,
width=0,
height=0,
metadata=stats,
)

async def close(self) -> None:
pass # nothing to clean up

Use it exactly like a built-in source:

agent = GantryEngine(
llm=...,
perception=SystemMetricsPerception(),
tools=[...],
)
agent.run("Alert me if CPU stays above 90% for more than 30 seconds.")

PerceptionResult.screenshot_b64 can be None — the engine skips the image and only sends the accessibility_tree text to the LLM. This pattern works for any text-based sensor: log tails, metric APIs, file watchers, database cursors.


See also: Desktop agent guide · Browser agent guide · API reference