Long-Running AI Agents: What Anthropic Got Right, and Where We Diverge
A practitioner's response to Anthropic's engineering guide on long-running agents.
Long-Running [AI Agents](/services/custom-ai-agents): What Anthropic Got Right, and Where We Diverge
Anthropic recently published an engineering guide on building effective harnesses for long-running agents. It's a solid piece—practical, grounded in real implementation experience, and refreshingly honest about the hard problems.
We've wrestled with the same challenges: agents that try to do everything at once, declare victory prematurely, and leave chaos for the next session to clean up. But we've solved some of these problems differently.
Here's what works for us, what doesn't work yet, and where we respectfully disagree with the proposed solutions.
The Memory Problem: Three Layers Beat One
Anthropic's solution is a `progress.txt` file plus git history. It works, but it's flat. Everything lives in one place—what was done, what's blocked, what's next. The agent has to parse context, status, and history from a single source.
We use three layers instead:
Layer 1: Semantic Memory (Model Update)
A memory system that helps the orchestrating agent understand context—not just "what files exist" but "what are we building and why." This is the soft layer that answers "who am I in this project."
This layer stores:
- Project goals and constraints
- Architectural decisions and their rationale
- Domain knowledge specific to this codebase
- Patterns and conventions from previous sessions
Layer 2: Intention Pool (Structured Task State)
Think of it as Jira meets Git, but designed for AI agents. Structured storage of tasks with rich metadata: blockers, decision paths, dependencies, progress state, and the reasoning behind each choice.
The agent doesn't just know what to do next—it understands the logic of how we got here and where we're going. When a task is blocked, it knows why. When a decision was made, it knows what alternatives were considered.
Layer 3: Git (Version Control)
The non-negotiable foundation. Because sometimes you just need to roll back, and no amount of clever memory architecture replaces the ability to see exactly what changed when.
Why Three Layers Matter
The key insight: separating "understanding" from "tracking" from "versioning" reduces cognitive load on the agent. It doesn't need to parse a monolithic progress file to figure out what's relevant right now.
Each layer serves a different temporal need:
- Semantic memory: "What is this project about?" (stable, rarely changes)
- Intention pool: "What are we doing and why?" (changes per task)
- Git: "What exactly happened?" (immutable history)
When an agent returns to a project after losing context, it can rebuild understanding progressively: first the big picture, then current state, then specific changes.
On Premature Victory: Prompt Engineering > Programmatic Constraints
Anthropic's approach to the "agent declares victory too early" problem is a JSON file with `passes: true/false` flags and strongly worded instructions not to edit it.
This feels like a bandaid on a structural decomposition problem.
If an agent is declaring victory prematurely, the question isn't "how do we prevent it from lying?" It's "why does the agent think it's done when it isn't?"
Our Approach: Atomic Task Decomposition
We make the supervisor agent responsible for proper task decomposition into what we call atomic structures—concrete enough to be unambiguous, but not so granular that they micromanage implementation.
The sweet spot is functional clarity:
- Too vague: "implement authentication"
- Too specific: "create function handleOAuthCallback on line 47"
- Just right: "implement Google OAuth with token refresh handling"
The task definition answers "what" without prescribing "how."
Completion Criteria Live in the Agent
Completion criteria live in the sub-agent's prompt, not in the task definition. The sub-agent knows: tests must pass, lint must be clean, migrations applied, no hardcoded secrets.
The supervisor agent doesn't repeat this for every task—it's baked into how the sub-agent operates. The sub-agent has internalized what "done" means.
The Result
Yes, this requires better prompts. But it also produces more robust behavior. The agent develops something closer to judgment rather than just following rules it's told not to break.
When an agent understands why tests must pass (not just that they must), it's less likely to game the constraint. It's the difference between compliance and understanding.
The Multi-Agent Question: Minimum Viable Agents
Anthropic asks whether a single generalist agent or multi-agent architecture works better.
Our answer: use as few agents as possible.
Why Fewer Is Better
Every handoff between agents is a potential break in reasoning continuity. Context gets lost. Assumptions don't transfer. Nuance evaporates. The more agents you add, the more failure points you create.
We've seen this pattern repeatedly:
- Agent A makes an architectural decision
- Agent B doesn't have that context
- Agent B makes a locally reasonable choice that conflicts with Agent A's decision
- Hours of debugging follow
Our Setup
For small projects or clean microservice architectures: Two agents.
- One strategic orchestrator (planning, decomposition, integration)
- One coding agent (implementation)
For complex systems: Add a code reviewer. Maximum three agents.
The reviewer doesn't need to be expensive. A smaller, faster model handles security basics—parameterized queries, environment variable handling, port exposure, common vulnerability patterns. Save the big context windows for architectural decisions where nuance matters.
The Exception
More agents make sense when you have genuinely independent workstreams with minimal interaction. But "genuinely independent" is rarer than it seems. Most codebases have hidden dependencies that only surface when agents step on each other.
What Anthropic Missed: Human-in-the-Loop as Synchronization
The Anthropic piece treats human involvement as bookends—you provide the prompt, you review the result. Everything in between is autonomous.
We've built something different: the user can intervene at any time in either agent's work. More importantly, the sub-agent's completion report doesn't reach the orchestrator until the user validates it.
How We Discovered This
This started as a bug. We had a race condition where the user could see the sub-agent's output before it was reported upstream. We fixed the race condition but kept the checkpoint.
It became our favorite feature.
Why It Matters
Verification of reality, not claims: The user confirms what was actually done, not what the agent claims was done. Agents are confident. Sometimes they're confidently wrong.
Catching destructive operations: "Wait, that database has production data" is a sentence you want to say before the migration runs, not after.
Eliminating false completions: The "victory" moment requires human confirmation. The agent can't unilaterally declare success—someone who can actually verify the work must agree.
This Isn't Limiting Autonomy
We're not adding friction for its own sake. The agent still does the work autonomously. We're synchronizing understanding between human and AI at critical checkpoints.
Think of it like code review. The developer writes the code autonomously. The review happens before merge, not instead of writing code.
What We Haven't Solved Yet
Honesty moment: end-to-end testing is still manual for us.
Lint passes. Unit tests run. Integration tests run. But visual verification—"does this actually render correctly in a browser?"—happens in a separate session with a human watching.
Anthropic's Puppeteer integration for browser testing is genuinely useful. We haven't automated that layer yet. It's on the roadmap, right after "pay rent" and "occasionally sleep."
We also don't have good solutions for:
- Cross-session learning: When an agent figures out something clever, that insight doesn't automatically transfer to future sessions
- Graceful degradation: When an agent is truly stuck, the handoff back to human could be smoother
- Cost prediction: Estimating how many tokens a task will consume before starting it
These are hard problems. We're working on them.
The Philosophical Difference
Long-running agents are hard. Anthropic's solutions work. Ours work differently.
The core philosophical difference:
Anthropic leans toward programmatic constraints: JSON files, explicit flags, structured formats the model "can't" edit, guardrails that enforce behavior through rules.
We lean toward better task decomposition and human checkpoints: Clearer definitions of what success looks like, agents that understand (not just follow) completion criteria, and humans in the loop at moments that matter.
Neither approach is wrong. They solve different failure modes.
Programmatic constraints work well when agents are adversarial or unreliable—when you can't trust them to follow instructions. Human checkpoints work well when agents are capable but sometimes confused—when the problem is understanding, not compliance.
The Practical Takeaway
If your agent keeps declaring victory too early, ask yourself:
Is the problem that you need more guardrails? Or is it that you need clearer definitions of what victory actually looks like?
Sometimes the answer is more constraints. Often the answer is better prompts.
What's Next
We're continuing to iterate on this architecture. Current experiments include:
- Semantic compression of long context to preserve reasoning across sessions
- Automated detection of "agent is stuck" patterns
- Cost-aware task scheduling (do expensive things when they matter)
If you're building long-running agent systems and want to compare notes—let's talk.
Maryna Vyshnyvetska is CEO of Kenaz GmbH, a Swiss AI consultancy building practical agent architectures for enterprise clients. Connect on LinkedIn
Frequently Asked Questions
What's the difference between long-running agents and regular AI assistants?
Regular AI assistants handle single requests—you ask, they answer, context is lost. Long-running agents maintain state across multiple sessions, remember what they've done, track progress on complex tasks, and can resume where they left off. They're closer to a team member than a tool.
Why not just use one agent for everything?
One agent can work for simple projects. But for complex systems, specialization helps. An orchestrator thinks strategically, a coder implements, a reviewer catches mistakes. The key is keeping the count minimal—every handoff between agents loses context.
How do you handle context limits with long-running tasks?
Three strategies: semantic memory stores the important stuff that rarely changes, the intention pool tracks current state without full history, and aggressive summarization compresses completed work. The agent doesn't need to remember every keystroke—it needs to remember what matters.
Isn't human-in-the-loop slower than fully autonomous agents?
Yes, technically. But it's faster than debugging the mess a confidently wrong agent creates. The checkpoint adds minutes; fixing autonomous disasters adds hours. For production systems, we'll take the minutes.
What models work best for long-running agents?
It depends on the role. Orchestration benefits from larger context windows and stronger reasoning—Claude or GPT-5 class. Code review can use smaller, faster models. Implementation varies by complexity. Match the model to the cognitive demand of the task.
How do you prevent agents from contradicting each other?
Centralized state (the intention pool) and clear ownership. Every task has one responsible agent. Decisions are recorded with rationale. When Agent B starts work, it reads Agent A's reasoning, not just Agent A's output.
