Automate the desktop¶

This guide shows you how to build an agent that takes screenshots and controls the mouse and keyboard — like a human at a computer, but autonomous.

Prerequisites¶

pip install 'gantrygraph[desktop]'

On Linux, a display server is required:

apt-get install -y xvfb
Xvfb :99 -screen 0 1280x720x24 &
export DISPLAY=:99

A working agent¶

from gantrygraph import GantryEngine
from gantrygraph.perception import DesktopScreen
from gantrygraph.actions import MouseKeyboardTools
from gantrygraph.security import BudgetPolicy
from langchain_anthropic import ChatAnthropic

agent = GantryEngine(
    llm=ChatAnthropic(model="claude-sonnet-4-6"),
    perception=DesktopScreen(max_resolution=(1280, 720)),
    tools=[MouseKeyboardTools()],
    max_steps=30,
    budget=BudgetPolicy(max_wall_seconds=120),
    system_prompt=(
        "You are a desktop automation agent. "
        "After each action, wait for the screen to update before proceeding."
    ),
)

agent.run("Open the terminal, run 'echo hello world', and close it.")

What happens on each loop iteration:

DesktopScreen takes a screenshot of your monitor
The LLM receives the image and decides what to click, type, or press
MouseKeyboardTools executes the action
The loop repeats until the task is done

Available mouse and keyboard tools¶

Tool	What it does
`mouse_click`	Click at (x, y) coordinates
`mouse_move`	Move the cursor to (x, y)
`key_press`	Press a key combination (e.g. `ctrl+c`)
`type_text`	Type a string character by character
`screenshot`	Take a screenshot (also happens automatically via perception)

Tips for reliable automation¶

Use lower resolution. Smaller screenshots cost fewer tokens and let the LLM focus on relevant UI elements. (1024, 768) works well for most tasks. (1280, 720) is a safe maximum for Claude.

Set a time budget. GUI state can change unexpectedly. Always add a BudgetPolicy for unattended agents so a stuck loop doesn't run forever.

Write a descriptive system prompt. Tell the agent which application it's using, what success looks like, and any app-specific quirks.

Combine with file tools. Desktop agents often need to read or write files alongside GUI interaction:

from gantrygraph.actions import FileSystemTools

agent = GantryEngine(
    llm=...,
    perception=DesktopScreen(),
    tools=[
        MouseKeyboardTools(),
        FileSystemTools(workspace="/my/project"),
    ],
    max_steps=40,
)

agent.run(
    "Open VS Code, edit src/config.py to set DEBUG=False, save, and close."
)