Frequently Asked Questions

β–ΆHow much does it cost to run an AI agent 24/7?

Without optimization, a single always-on agent using a premium model like Claude Opus can cost $500-2,000/month. With the routing and filtering strategies described here β€” local models for heartbeats, cheap cloud for simple tasks, premium only when needed β€” you can run the same agent for $50-200/month. The key is matching model capability to task complexity.

β–ΆWhat is model routing for AI agents?

Model routing is the practice of directing different tasks to different AI models based on complexity. Simple classification tasks go to small local models (free). Medium-complexity work goes to fast cheap cloud models like Grok Mini ($0.10/M tokens). Complex reasoning and tool use go to premium models like Claude Opus. A well-designed router can cut costs 60-80% while maintaining output quality.

β–ΆCan you run AI agents locally without cloud APIs?

For certain tasks, yes. Local models like Llama 3.2 (3B parameters) run on consumer hardware and handle classification, summarization, and simple parsing well. They cannot replace cloud models for complex reasoning, multi-step tool use, or nuanced conversation. The optimal setup uses local models as a first filter β€” handling 40-60% of routine tasks for free β€” and routes the rest to cloud APIs.

β–ΆWhat is the biggest waste of tokens in AI agent systems?

Context bloat. Every time an agent wakes up, it re-reads its entire conversation history. Without compaction, a single day of activity can accumulate 50,000+ tokens of context that gets re-processed every turn. Memory compaction β€” summarizing old context and archiving raw logs β€” can reduce per-turn input tokens by 70% or more.

β–ΆHow do you measure AI agent token efficiency?

Track three metrics: cost per useful action (total spend divided by actions that produced user value), waste ratio (tokens spent on heartbeats and context re-reads vs. actual work), and model hit rate (percentage of tasks correctly handled by the cheapest capable model). Most unoptimized systems have a waste ratio above 60% β€” meaning more than half their spend produces no user-visible output.

March 2026Β·AI & AutomationΒ·10 min read

Token Optimization for AI Agents: A Practical Guide That Saved Us 70%

Your AI agent is burning money while you sleep. Here's the playbook we used to cut costs without cutting capability.

I run a production AI agent system. It monitors emails, reconciles financial data across NetSuite and HubSpot, watches market signals, generates reports, and manages its own memory β€” all autonomously, 24/7.

The first month, I looked at the bill and almost shut it down.

Not because it wasn't working β€” it was phenomenal. But running Claude Opus for every heartbeat check, every "nothing new" email scan, every memory rotation? That's like hiring a Harvard MBA to check if the mail came.

Here's every optimization we implemented, what worked, what didn't, and the exact savings.

The Problem: 60% of Your Tokens Produce Zero Value

Before optimization, here's where our tokens went:

  • Heartbeat polls (40%) β€” Agent wakes up every 30 minutes, re-reads context, checks if anything needs attention. 90% of the time: nothing. That's 40% of total spend returning "HEARTBEAT_OK."
  • Context re-reads (25%) β€” Every turn, the agent re-processes its entire conversation history, memory files, tool configs. Same content, re-tokenized every single time.
  • Routine checks (15%) β€” Email scans, calendar lookups, weather checks. Simple API calls wrapped in an expensive reasoning engine.
  • Actual work (20%) β€” The stuff that matters: analyzing data, writing reports, making decisions, executing multi-step workflows.

Read that again. 80% of spend on overhead. 20% on output.

TOKEN WASTE BREAKDOWN
40%
Heartbeat Polls
25%
Context Re-reads
15%
Routine Checks
20%
Actual Work

β€œWhen 80% of your AI spend is context loading and 20% is actual reasoning, you are not running an intelligent system. You are running an expensive filing cabinet.”

Optimization 1: Three-Tier Model Routing

Not every task needs the smartest model. This is obvious in theory and revolutionary in practice. The full architecture is detailed in the layered model architecture guide.

Tier 1: Local Models (Cost: $0)

We run Llama 3.2 (3B) via Ollama on the same machine as the agent. It handles:

  • Heartbeat pre-filtering. A bash script reads the heartbeat config, passes it to the local model, and asks: "Does anything here need attention?" If no β†’ return HEARTBEAT_OK without ever touching the cloud API.
  • Memory compaction drafts. Summarize yesterday's raw logs into a compact digest. The local model drafts it; the premium model reviews on the next real interaction.
  • Simple classification. "Is this email from a VIP?" "Does this order look like eBay?" Binary decisions that don't need nuance.

Impact: 40-50% of all agent invocations handled for free.

Tier 2: Cheap Cloud Models (Cost: ~$0.10-0.30/M tokens)

Grok 3 Mini, Claude Haiku, GPT-4o Mini. Fast, cheap, good enough for:

  • Medium-complexity Q&A
  • Structured data extraction
  • Template-based report generation
  • Multi-step but well-defined workflows

Impact: Another 20-30% of invocations at 1/10th the cost.

Tier 3: Premium Models (Cost: ~$15-75/M tokens)

Claude Opus, GPT-4 Turbo. Reserved for what actually needs it:

  • Complex multi-tool orchestration
  • Nuanced conversation with the user
  • Financial analysis and decision-making
  • Code generation and debugging

Impact: Only 20-30% of invocations hit the expensive tier.

The routing decision itself costs nothing β€” it's a simple script, not another LLM call. Don't use AI to decide which AI to use. Use rules.

Optimization 2: Heartbeat Pre-Filtering

This single optimization saved more than everything else combined.

An always-on agent checks in periodically β€” typically every 15-30 minutes. Each check loads the full system prompt, workspace files, memory context, and conversation history. Even if there's nothing to do, you're paying for all that context processing.

The fix: a 20-line bash script that runs before the agent wakes up.

  1. Read the heartbeat config file
  2. Check each data source directly (Gmail API, calendar, etc.)
  3. If nothing new β†’ return immediately, skip the LLM entirely
  4. If something needs attention β†’ pass to the agent with pre-fetched context

Result: 85% of heartbeat polls never touch the AI model. At 48 heartbeats per day, that's 40+ premium model calls eliminated daily.

Optimization 3: Memory Architecture That Doesn't Bloat

The naive approach to agent memory: append everything to a conversation, let the context window handle it. This works until day three, when your agent is re-reading 100K tokens of raw logs every single turn. We built a complete six-layer memory architecture to solve this permanently.

Our six-layer memory stack:

  • Layer 1 β€” Short-term (Markdown, auto-rotated). Daily files. Raw observations and task logs. Auto-archived after 3 days.
  • Layer 2 β€” Medium-term (Markdown, curated). Active project status, key decisions, working context. Manually maintained.
  • Layer 3 β€” Long-term (Markdown, distilled). Core memories β€” relationships, preferences, lessons learned. Updated weekly from daily notes.
  • Layer 4 β€” Structured (SQLite). Config values, rep directories, schema definitions. Queryable, not re-read every turn.
  • Layer 5 β€” Semantic (Vector DB). Mem0 + Qdrant for recall by meaning, not keyword. "What did we decide about COGS?" works without knowing the exact date.
  • Layer 6 β€” System prompt injection. Only the current day's notes + MEMORY.md get injected into context. Everything else is queryable on-demand.

Impact: Per-turn context reduced from ~80K tokens to ~15K tokens. That's 5x fewer input tokens on every single interaction.

β€œPer-turn context reduced from ~80K tokens to ~15K tokens. That's 5x fewer input tokens on every single interaction.”

Optimization 4: Compaction, Not Truncation

Most frameworks truncate old messages when the context window fills up. This is the worst possible approach β€” you lose information unpredictably.

Compaction is different: you actively summarize the conversation history into a compact representation that preserves all critical context in fewer tokens.

Our approach:

  1. Local model (Tier 1) drafts a summary of the oldest conversation segment
  2. Summary replaces the raw messages in the context window
  3. Raw messages get archived to daily notes (Layer 1 memory)
  4. Critical decisions and context get promoted to Layer 2-3

The result: a conversation that's been running for 8 hours has the same context efficiency as one that just started, but with full memory of everything that happened.

Optimization 5: Tool Call Batching

Every tool call is a round trip: the model generates a tool call, you execute it, return the result, the model processes it. Each round trip adds tokens.

Optimization: when multiple independent tool calls are needed, make them all in one turn instead of sequentially. Reading three files? One batch call, not three round trips.

This sounds simple but most agent frameworks default to sequential execution. Parallel tool calls reduce total turns by 30-50% in data-gathering workflows.

The Punch List: What We've Done and What's Next

Implemented

  • Three-tier model routing (Local β†’ Cheap Cloud β†’ Premium)
  • Heartbeat pre-filter script (bash, runs before agent wake)
  • Six-layer memory architecture (markdown + SQLite + vector)
  • Auto-compaction with local model drafts
  • Memory rotation script (daily β†’ archive after 3 days)
  • Semantic search via Mem0 + Qdrant (on-disk, $0/month)
  • Heartbeat state tracking (JSON) to avoid redundant checks
  • Tool response truncation for large API results
  • Context-aware prompt injection (only load relevant memory layers)

In Progress

  • Automated cost tracking per task category
  • Dynamic model selection based on task complexity scoring
  • Streaming compaction (compact while generating, not after)

Planned

  • Fine-tuned local model for domain-specific classification
  • Predictive heartbeat scheduling (check more often during business hours)
  • Cross-agent memory sharing (specialist agents contribute to shared knowledge)
  • Token budget per session with automatic model downgrade
  • Embedding-based context selection (inject only the 5 most relevant memory chunks)

The Numbers

Before optimization: roughly $800-1,200/month for a single always-on agent with moderate activity.

After implementing everything above: $150-250/month for the same agent doing the same work.

That's a 70-80% reduction without removing a single capability. The agent still uses Claude Opus for complex reasoning. It still monitors email, reconciles financial data, generates reports. It just doesn't use a $75/M-token model to check if the inbox is empty.

OPTIMIZATION RESULTS
70-80%
Cost Reduction
$150-250
Monthly After Optimization
85%
Heartbeats Filtered Out
5x
Context Reduction per Turn
40-50%
Tasks Handled Free (Local)
30-50%
Fewer Turns via Batching

What Most People Get Wrong

Three mistakes I see constantly:

  1. "Just use the cheapest model for everything." This kills quality. You need the premium model for complex work. The optimization isn't using cheap models everywhere β€” it's using expensive models only where they matter. Understanding every AI's specific weaknesses helps you route more intelligently.
  2. "Optimize the prompts." Prompt engineering saves 10-20% at best. Architectural changes (routing, filtering, memory) save 60-80%. Don't polish the deck chairs.
  3. "Use a smaller context window." Truncation loses information. Compaction preserves it. The goal isn't less context β€” it's denser context.

β€œThe cheapest token is the one you never send. Before optimizing how your agent thinks, optimize whether it needs to think at all.”

The cheapest token is the one you never send. Before optimizing how your agent thinks, optimize whether it needs to think at all.

Token optimization isn't about being cheap. It's about being intentional. Every token should earn its place in the context window. Every model invocation should match the complexity of the task. Every heartbeat should justify waking up the most expensive brain in the room.

Build the routing. Build the filters. Build the memory layers. Then let the expensive model do what it's actually good at β€” and nothing else. If you want to see the full economic comparison between running models locally and calling cloud APIs, check out the real math behind local vs. cloud AI.