How much does it cost to run an AI agent 24/7?

Without optimization, a single always-on agent using a premium model like Claude Opus can cost $500-2,000/month. With routing and filtering strategies — local models for heartbeats, cheap cloud for simple tasks, premium only when needed — you can run the same agent for $50-200/month.

What is model routing for AI agents?

Model routing directs different tasks to different AI models based on complexity. Simple tasks go to local models (free), medium tasks to cheap cloud models, and complex reasoning to premium models. A well-designed router can cut costs 60-80%.

Can you run AI agents locally without cloud APIs?

For certain tasks, yes. Local models like Llama 3.2 handle classification and simple parsing well. They cannot replace cloud models for complex reasoning or tool use. The optimal setup uses local models as a first filter for 40-60% of routine tasks.

What is the biggest waste of tokens in AI agent systems?

Context bloat. Without compaction, a single day of activity can accumulate 50,000+ tokens of context re-processed every turn. Memory compaction can reduce per-turn input tokens by 70% or more.

How do you measure AI agent token efficiency?

Track cost per useful action, waste ratio (tokens on heartbeats vs. actual work), and model hit rate (tasks handled by the cheapest capable model). Most unoptimized systems have a waste ratio above 60%.

March 2026·AI & Automation·10 min read

Token Optimization for AI Agents: A Practical Guide That Saved Us 70%

Your AI agent is burning money while you sleep. Here's the playbook we used to cut costs without cutting capability.

I run a production AI agent system. It monitors emails, reconciles financial data across NetSuite and HubSpot, watches market signals, generates reports, and manages its own memory — all autonomously, 24/7.

The first month, I looked at the bill and almost shut it down.

Not because it wasn't working — it was phenomenal. But running Claude Opus for every heartbeat check, every "nothing new" email scan, every memory rotation? That's like hiring a Harvard MBA to check if the mail came.

Here's every optimization we implemented, what worked, what didn't, and the exact savings.

The Problem: 60% of Your Tokens Produce Zero Value

Before optimization, here's where our tokens went:

Heartbeat polls (40%) — Agent wakes up every 30 minutes, re-reads context, checks if anything needs attention. 90% of the time: nothing. That's 40% of total spend returning "HEARTBEAT_OK."
Context re-reads (25%) — Every turn, the agent re-processes its entire conversation history, memory files, tool configs. Same content, re-tokenized every single time.
Routine checks (15%) — Email scans, calendar lookups, weather checks. Simple API calls wrapped in an expensive reasoning engine.
Actual work (20%) — The stuff that matters: analyzing data, writing reports, making decisions, executing multi-step workflows.

Read that again. 80% of spend on overhead. 20% on output.

TOKEN WASTE BREAKDOWN

40%

Heartbeat Polls

25%

Context Re-reads

15%

Routine Checks

20%

Actual Work

“When 80% of your AI spend is context loading and 20% is actual reasoning, you are not running an intelligent system. You are running an expensive filing cabinet.”

Optimization 1: Three-Tier Model Routing

Not every task needs the smartest model. This is obvious in theory and revolutionary in practice. The full architecture is detailed in the layered model architecture guide.

Tier 1: Local Models (Cost: $0)

We run Llama 3.2 (3B) via Ollama on the same machine as the agent. It handles:

Heartbeat pre-filtering. A bash script reads the heartbeat config, passes it to the local model, and asks: "Does anything here need attention?" If no → return HEARTBEAT_OK without ever touching the cloud API.
Memory compaction drafts. Summarize yesterday's raw logs into a compact digest. The local model drafts it; the premium model reviews on the next real interaction.
Simple classification. "Is this email from a VIP?" "Does this order look like eBay?" Binary decisions that don't need nuance.

Impact: 40-50% of all agent invocations handled for free.

Tier 2: Cheap Cloud Models (Cost: ~$0.10-0.30/M tokens)

Grok 3 Mini, Claude Haiku, GPT-4o Mini. Fast, cheap, good enough for:

Medium-complexity Q&A
Structured data extraction
Template-based report generation
Multi-step but well-defined workflows

Impact: Another 20-30% of invocations at 1/10th the cost.

Tier 3: Premium Models (Cost: ~$15-75/M tokens)

Claude Opus, GPT-4 Turbo. Reserved for what actually needs it:

Complex multi-tool orchestration
Nuanced conversation with the user
Financial analysis and decision-making
Code generation and debugging

Impact: Only 20-30% of invocations hit the expensive tier.

The routing decision itself costs nothing — it's a simple script, not another LLM call. Don't use AI to decide which AI to use. Use rules.

Optimization 2: Heartbeat Pre-Filtering

This single optimization saved more than everything else combined.

An always-on agent checks in periodically — typically every 15-30 minutes. Each check loads the full system prompt, workspace files, memory context, and conversation history. Even if there's nothing to do, you're paying for all that context processing.

The fix: a 20-line bash script that runs before the agent wakes up.

Read the heartbeat config file
Check each data source directly (Gmail API, calendar, etc.)
If nothing new → return immediately, skip the LLM entirely
If something needs attention → pass to the agent with pre-fetched context

Result: 85% of heartbeat polls never touch the AI model. At 48 heartbeats per day, that's 40+ premium model calls eliminated daily.

Optimization 3: Memory Architecture That Doesn't Bloat

The naive approach to agent memory: append everything to a conversation, let the context window handle it. This works until day three, when your agent is re-reading 100K tokens of raw logs every single turn. We built a complete six-layer memory architecture to solve this permanently.

Our six-layer memory stack:

Layer 1 — Short-term (Markdown, auto-rotated). Daily files. Raw observations and task logs. Auto-archived after 3 days.
Layer 2 — Medium-term (Markdown, curated). Active project status, key decisions, working context. Manually maintained.
Layer 3 — Long-term (Markdown, distilled). Core memories — relationships, preferences, lessons learned. Updated weekly from daily notes.
Layer 4 — Structured (SQLite). Config values, rep directories, schema definitions. Queryable, not re-read every turn.
Layer 5 — Semantic (Vector DB). Mem0 + Qdrant for recall by meaning, not keyword. "What did we decide about COGS?" works without knowing the exact date.
Layer 6 — System prompt injection. Only the current day's notes + MEMORY.md get injected into context. Everything else is queryable on-demand.

Impact: Per-turn context reduced from ~80K tokens to ~15K tokens. That's 5x fewer input tokens on every single interaction.

“Per-turn context reduced from ~80K tokens to ~15K tokens. That's 5x fewer input tokens on every single interaction.”

Optimization 4: Compaction, Not Truncation

Most frameworks truncate old messages when the context window fills up. This is the worst possible approach — you lose information unpredictably.

Compaction is different: you actively summarize the conversation history into a compact representation that preserves all critical context in fewer tokens.

Our approach:

Local model (Tier 1) drafts a summary of the oldest conversation segment
Summary replaces the raw messages in the context window
Raw messages get archived to daily notes (Layer 1 memory)
Critical decisions and context get promoted to Layer 2-3

The result: a conversation that's been running for 8 hours has the same context efficiency as one that just started, but with full memory of everything that happened.

Optimization 5: Tool Call Batching

Every tool call is a round trip: the model generates a tool call, you execute it, return the result, the model processes it. Each round trip adds tokens.

Optimization: when multiple independent tool calls are needed, make them all in one turn instead of sequentially. Reading three files? One batch call, not three round trips.

This sounds simple but most agent frameworks default to sequential execution. Parallel tool calls reduce total turns by 30-50% in data-gathering workflows.

The Punch List: What We've Done and What's Next

Implemented

Three-tier model routing (Local → Cheap Cloud → Premium)
Heartbeat pre-filter script (bash, runs before agent wake)
Six-layer memory architecture (markdown + SQLite + vector)
Auto-compaction with local model drafts
Memory rotation script (daily → archive after 3 days)
Semantic search via Mem0 + Qdrant (on-disk, $0/month)
Heartbeat state tracking (JSON) to avoid redundant checks
Tool response truncation for large API results
Context-aware prompt injection (only load relevant memory layers)

In Progress

Automated cost tracking per task category
Dynamic model selection based on task complexity scoring
Streaming compaction (compact while generating, not after)

Planned

Fine-tuned local model for domain-specific classification
Predictive heartbeat scheduling (check more often during business hours)
Cross-agent memory sharing (specialist agents contribute to shared knowledge)
Token budget per session with automatic model downgrade
Embedding-based context selection (inject only the 5 most relevant memory chunks)

The Numbers

Before optimization: roughly $800-1,200/month for a single always-on agent with moderate activity.

After implementing everything above: $150-250/month for the same agent doing the same work.

That's a 70-80% reduction without removing a single capability. The agent still uses Claude Opus for complex reasoning. It still monitors email, reconciles financial data, generates reports. It just doesn't use a $75/M-token model to check if the inbox is empty.

OPTIMIZATION RESULTS

70-80%

Cost Reduction

$150-250

Monthly After Optimization

85%

Heartbeats Filtered Out

Context Reduction per Turn

40-50%

Tasks Handled Free (Local)

30-50%

Fewer Turns via Batching

What Most People Get Wrong

Three mistakes I see constantly:

"Just use the cheapest model for everything." This kills quality. You need the premium model for complex work. The optimization isn't using cheap models everywhere — it's using expensive models only where they matter. Understanding every AI's specific weaknesses helps you route more intelligently.
"Optimize the prompts." Prompt engineering saves 10-20% at best. Architectural changes (routing, filtering, memory) save 60-80%. Don't polish the deck chairs.
"Use a smaller context window." Truncation loses information. Compaction preserves it. The goal isn't less context — it's denser context.

“The cheapest token is the one you never send. Before optimizing how your agent thinks, optimize whether it needs to think at all.”

The cheapest token is the one you never send. Before optimizing how your agent thinks, optimize whether it needs to think at all.

Token optimization isn't about being cheap. It's about being intentional. Every token should earn its place in the context window. Every model invocation should match the complexity of the task. Every heartbeat should justify waking up the most expensive brain in the room.

Build the routing. Build the filters. Build the memory layers. Then let the expensive model do what it's actually good at — and nothing else. If you want to see the full economic comparison between running models locally and calling cloud APIs, check out the real math behind local vs. cloud AI.

The Layered Model Architecture: Why One AI Model Is Never Enough

Memory Optimization for Local AI Agents: A Six-Layer Architecture

Local AI Models vs Cloud APIs: The Math Nobody Shows You

AI Agents Are Not Chatbots: Building Systems That Actually Work

← Back to battletested.ai