Perception sources

Perception is how the agent sees the world before each think step. Without a perception source the agent only reads its task description and tool results. With one, it receives a fresh screenshot or DOM snapshot every loop iteration.

Do you need perception?

Yes — if the task requires reading what's currently on screen before deciding what to do:

Desktop automation (click buttons, read UI text, fill forms in a native app)
Web scraping or form filling (read page state, navigate based on what's visible)
Monitoring dashboards (watch for a value to change)

No — if the agent works purely through tool results:

Read files, run commands, call APIs → tools alone are sufficient
The agent receives tool outputs as text; no visual input needed

`DesktopScreen` — screenshot the monitor

Captures the primary monitor before each think step and sends the image to the LLM. Runs in a thread pool so it never blocks the event loop.

from gantrygraph.perception import DesktopScreen

# Full native resolution — highest quality, highest token cost
screen = DesktopScreen()

# Cap resolution to reduce vision token cost
screen = DesktopScreen(max_resolution=(1280, 720))

No extra dependencies — mss is bundled with the core install.

Reduce cost with `vision_mode`

vision_mode="low" caps the image at 1280×720 regardless of max_resolution. Use it when the task doesn't require pixel-perfect reads — UI labels, button text, and form fields are still readable.

Mode	Resolution cap	Token cost	Use when
`"high"` (default)	Native resolution	Higher	Reading fine UI details, small text
`"low"`	1280 × 720	~4× cheaper	Navigating apps, clicking buttons

screen = DesktopScreen(max_resolution=(1920, 1080), vision_mode="low")

Screenshot diffing

The engine automatically skips sending an image if the screen hasn't changed since the previous step. This saves vision tokens during steps where the agent is waiting for a tool result or processing text.

`DesktopAXTree` — read any macOS app without screenshots

Instead of capturing a screenshot and spending vision tokens, DesktopAXTree reads the native macOS Accessibility API (AXUIElement) and returns the full UI hierarchy as structured text. The LLM sees every button, text field, and label without consuming a single image token.

pip install 'gantrygraph[desktop-ax]'
# Grant: System Settings → Privacy & Security → Accessibility

from gantrygraph.perception import DesktopAXTree

# Target a specific app
perception = DesktopAXTree(app_name="Obsidian")

# Or whichever app is currently focused
perception = DesktopAXTree()

# AX tree + screenshot together
perception = DesktopAXTree(app_name="Obsidian", include_screenshot=True)

What the LLM receives instead of a screenshot:

AXApplication 'Obsidian'
  AXWindow 'My Vault — Obsidian'
    AXTextArea 'Q2 Goals\n- Ship v1\n- Write docs' (focused, editable)
    AXButton 'New note' (enabled)
    AXButton 'Search' (enabled)

	`DesktopScreen`	`DesktopAXTree`
Platform	macOS, Linux, Windows	macOS only
Token cost	~2 000 / step	~200 / step
Works off-screen	No	Yes

`WebPage` — screenshot a browser page

Renders a URL via Playwright, captures a screenshot, and extracts the page's accessibility tree. Both are sent to the LLM before each think step. Requires pip install 'gantrygraph[browser]'.

from gantrygraph.perception import WebPage

page = WebPage(url="https://myapp.example.com", headless=True)

# Lower token cost — downscales screenshots to 1280×720
page = WebPage(url="https://myapp.example.com", vision_mode="low")

Share the browser with `BrowserTools`

Pass the same WebPage instance to both perception= and BrowserTools so they operate on the same Playwright Page object — no double browser launch.

from gantrygraph import GantryEngine
from gantrygraph.perception import WebPage
from gantrygraph.actions import BrowserTools

web = WebPage(url="https://app.example.com", headless=True)

agent = GantryEngine(
    llm=...,
    perception=web,
    tools=[BrowserTools(web_page=web)],
)

Without this, BrowserTools would launch a second browser instance pointing at a different page — the agent would act on one page but perceive another.

`MultiPerception` — combine sources

When your agent needs to see multiple things simultaneously — for example, control the desktop while monitoring a web dashboard:

from gantrygraph import MultiPerception
from gantrygraph.perception import DesktopScreen, WebPage

agent = GantryEngine(
    llm=...,
    perception=MultiPerception([
        DesktopScreen(),
        WebPage(url="https://dashboard.internal"),
    ]),
)

The first source's screenshot is used as the primary image. Accessibility trees from all sources are concatenated with source labels.

Write a custom perception source

Subclass BasePerception and implement observe(). Return a PerceptionResult.

import asyncio
from gantrygraph import BasePerception
from gantrygraph.core.events import PerceptionResult

class SystemMetricsPerception(BasePerception):
    """Feed CPU and memory stats to the agent instead of a screenshot."""

    async def observe(self) -> PerceptionResult:
        import psutil
        stats = await asyncio.get_event_loop().run_in_executor(
            None,
            lambda: {
                "cpu": psutil.cpu_percent(),
                "mem": psutil.virtual_memory().percent,
            },
        )
        return PerceptionResult(
            screenshot_b64=None,       # no image — text-only perception
            accessibility_tree=f"CPU: {stats['cpu']}%\nMEM: {stats['mem']}%",
            url=None,
            width=0,
            height=0,
            metadata=stats,
        )

    async def close(self) -> None:
        pass  # nothing to clean up

Use it exactly like a built-in source:

agent = GantryEngine(
    llm=...,
    perception=SystemMetricsPerception(),
    tools=[...],
)
agent.run("Alert me if CPU stays above 90% for more than 30 seconds.")

PerceptionResult.screenshot_b64 can be None — the engine skips the image and only sends the accessibility_tree text to the LLM. This pattern works for any text-based sensor: log tails, metric APIs, file watchers, database cursors.

See also: Desktop agent guide · Browser agent guide · API reference

Perception sources

Do you need perception?

DesktopScreen — screenshot the monitor

Reduce cost with vision_mode

Screenshot diffing

DesktopAXTree — read any macOS app without screenshots

WebPage — screenshot a browser page

Share the browser with BrowserTools

MultiPerception — combine sources