Reduce Token Cost

Four orthogonal controls that together cut token spend by 5–10× on long agent runs.

Summary

Control	Where to set it	Typical savings
Accessibility-tree perception	`GantryEngine(perception_mode=)`	~80% per observe step
Shell output truncation	`ShellTools(max_output_chars=)`	Prevents 100k-token log dumps
Sliding message window	`GantryEngine(message_window=)`	Caps O(N²) history growth
Anthropic prompt cache	`GantryEngine(enable_caching=True)`	Up to 90% on system messages

Enable all four together for maximum effect:

from gantrygraph import GantryEngine
from gantrygraph.actions import ShellTools
from langchain_anthropic import ChatAnthropic

agent = GantryEngine(
    llm=ChatAnthropic(model="claude-sonnet-4-6"),
    tools=[ShellTools(max_output_chars=2000)],
    perception_mode="axtree",
    message_window=20,
    enable_caching=True,
    max_steps=50,
)

1. Accessibility-tree perception (`perception_mode`)

Default: "auto"

Every observe step sends a representation of the page to the LLM. By default in "auto" or "axtree" mode, GantryGraph sends the browser's accessibility tree as plain text instead of a screenshot image.

Mode	Cost per step	When to use
`"axtree"`	~300 tokens	Web pages with a readable DOM — most scraping and form-filling tasks
`"auto"`	~300 tokens when DOM readable, screenshot as fallback	Best default for mixed workloads
`"vision"`	~1,500–3,000 tokens	Canvas apps, PDFs, desktop GUIs, pixel-level verification

# Maximum savings — text-only, no screenshots
agent = GantryEngine(llm=..., perception_mode="axtree")

# Default — text when available, screenshot as fallback
agent = GantryEngine(llm=..., perception_mode="auto")

# Always screenshot — for visual tasks
agent = GantryEngine(llm=..., perception_mode="vision")

Rule of thumb: use "axtree" for web scraping and form filling; use "vision" only when the agent needs to read a chart, verify layout, or interact with a canvas element.

2. Shell output truncation (`max_output_chars`)

Default: 2000

A single cat large_logfile.log can produce megabytes of output. Without truncation, every byte ends up in the conversation history and is re-paid on every subsequent LLM call.

ShellTools truncates combined stdout+stderr at max_output_chars and appends a hint so the agent knows to refine with grep or head:

[48,312 chars truncated — use grep/head/tail for details]

from gantrygraph.actions import ShellTools

# Default — 2 000 chars is enough for most task decisions
ShellTools(max_output_chars=2000)

# Raise for log-heavy debugging tasks
ShellTools(max_output_chars=5000)

# Lower for very token-sensitive pipelines
ShellTools(max_output_chars=500)

Do not set max_output_chars to a very large value in production. The agent rarely needs more than the first 2 000 characters to decide its next action; the truncation hint guides it to refine its query with targeted shell commands.

3. Sliding message window (`message_window`)

Default: None (full history)

Without a window, every step appends new messages to the conversation. After N steps the context contains ~N messages, and the next LLM call pays for all of them — O(N²) token growth over the lifetime of a long task.

message_window=20 keeps messages[0] (the task/system prompt) plus the last 20 messages, bounding context to a fixed size regardless of run length.

# Recommended for long-running tasks (50+ steps)
agent = GantryEngine(llm=..., message_window=20)

Choosing a value:

Window	Good for
`10`	Short extraction tasks with a clear goal
`20`	Most web and shell agents — covers the recent action cycle
`40`	Tasks with complex multi-step dependencies
`None`	Short tasks (< 20 steps) where full history is affordable

messages[0] (the task prompt) is always preserved regardless of window size, so the agent never loses its objective.

4. Anthropic prompt cache (`enable_caching`)

Default: False — only effective with langchain_anthropic.ChatAnthropic

When enable_caching=True, GantryGraph adds Anthropic's cache_control: {"type": "ephemeral"} to system messages. Requests that hit the cache pay the cached input rate — up to 90% cheaper than the standard input rate.

The cache TTL is 5 minutes on Anthropic's side. For runs shorter than 5 minutes, every step after the first gets the cached rate. Longer runs re-prime the cache automatically.

from gantrygraph import GantryEngine
from langchain_anthropic import ChatAnthropic

agent = GantryEngine(
    llm=ChatAnthropic(model="claude-sonnet-4-6"),
    enable_caching=True,
)

Provider note: enable_caching=True has no effect with OpenAI or other providers — the flag is silently ignored, no error is raised.

Before / after — 50-step browser task

Metric	Unoptimized	Optimized
Tokens per observe step	~1,500 (screenshot)	~300 (AXTree)
Context growth	O(N²)	O(N) bounded
Input token discount	none	up to 90% (cache)
Risk of log-flood	high	eliminated

Next: Observability · State persistence · Browser agent

Reduce Token Cost

Summary

1. Accessibility-tree perception (perception_mode)

2. Shell output truncation (max_output_chars)

3. Sliding message window (message_window)

4. Anthropic prompt cache (enable_caching)

Before / after — 50-step browser task

1. Accessibility-tree perception (`perception_mode`)

2. Shell output truncation (`max_output_chars`)

3. Sliding message window (`message_window`)

4. Anthropic prompt cache (`enable_caching`)