AI AgentsLong-Running AgentsAgent ArchitectureMulti-Agent SystemsAI Engineering

Long-Running AI Agents: What Anthropic Got Right, and Where We Diverge

A practitioner's response to Anthropic's engineering guide on long-running agents.

December 16, 202410 minMaryna Vyshnyvetska

Long-Running [AI Agents](/services/custom-ai-agents): What Anthropic Got Right, and Where We Diverge

Anthropic recently published an engineering guide on building effective harnesses for long-running agents. It's a solid piece—practical, grounded in real implementation experience, and refreshingly honest about the hard problems.

We've wrestled with the same challenges: agents that try to do everything at once, declare victory prematurely, and leave chaos for the next session to clean up. But we've solved some of these problems differently.

Here's what works for us, what doesn't work yet, and where we respectfully disagree with the proposed solutions.

The Memory Problem: Three Layers Beat One

Anthropic's solution is a `progress.txt` file plus git history. It works, but it's flat. Everything lives in one place—what was done, what's blocked, what's next. The agent has to parse context, status, and history from a single source.

We use three layers instead:

Layer 1: Semantic Memory (Model Update)

A memory system that helps the orchestrating agent understand context—not just "what files exist" but "what are we building and why." This is the soft layer that answers "who am I in this project."

This layer stores:

Project goals and constraints
Architectural decisions and their rationale
Domain knowledge specific to this codebase
Patterns and conventions from previous sessions

Layer 2: Intention Pool (Structured Task State)

Think of it as Jira meets Git, but designed for AI agents. Structured storage of tasks with rich metadata: blockers, decision paths, dependencies, progress state, and the reasoning behind each choice.

The agent doesn't just know what to do next—it understands the logic of how we got here and where we're going. When a task is blocked, it knows why. When a decision was made, it knows what alternatives were considered.

Layer 3: Git (Version Control)

The non-negotiable foundation. Because sometimes you just need to roll back, and no amount of clever memory architecture replaces the ability to see exactly what changed when.

Why Three Layers Matter

The key insight: separating "understanding" from "tracking" from "versioning" reduces cognitive load on the agent. It doesn't need to parse a monolithic progress file to figure out what's relevant right now.

Each layer serves a different temporal need:

Semantic memory: "What is this project about?" (stable, rarely changes)
Intention pool: "What are we doing and why?" (changes per task)
Git: "What exactly happened?" (immutable history)

When an agent returns to a project after losing context, it can rebuild understanding progressively: first the big picture, then current state, then specific changes.

On Premature Victory: Prompt Engineering > Programmatic Constraints

Anthropic's approach to the "agent declares victory too early" problem is a JSON file with `passes: true/false` flags and strongly worded instructions not to edit it.

This feels like a bandaid on a structural decomposition problem.

If an agent is declaring victory prematurely, the question isn't "how do we prevent it from lying?" It's "why does the agent think it's done when it isn't?"

Our Approach: Atomic Task Decomposition

We make the supervisor agent responsible for proper task decomposition into what we call atomic structures—concrete enough to be unambiguous, but not so granular that they micromanage implementation.

The sweet spot is functional clarity:

Too vague: "implement authentication"
Too specific: "create function handleOAuthCallback on line 47"
Just right: "implement Google OAuth with token refresh handling"

The task definition answers "what" without prescribing "how."

Completion Criteria Live in the Agent

Completion criteria live in the sub-agent's prompt, not in the task definition. The sub-agent knows: tests must pass, lint must be clean, migrations applied, no hardcoded secrets.

The supervisor agent doesn't repeat this for every task—it's baked into how the sub-agent operates. The sub-agent has internalized what "done" means.

The Result

Yes, this requires better prompts. But it also produces more robust behavior. The agent develops something closer to judgment rather than just following rules it's told not to break.

When an agent understands why tests must pass (not just that they must), it's less likely to game the constraint. It's the difference between compliance and understanding.

The Multi-Agent Question: Minimum Viable Agents

Anthropic asks whether a single generalist agent or multi-agent architecture works better.

Our answer: use as few agents as possible.

Why Fewer Is Better

Every handoff between agents is a potential break in reasoning continuity. Context gets lost. Assumptions don't transfer. Nuance evaporates. The more agents you add, the more failure points you create.

We've seen this pattern repeatedly:

Agent A makes an architectural decision
Agent B doesn't have that context
Agent B makes a locally reasonable choice that conflicts with Agent A's decision
Hours of debugging follow

Our Setup

For small projects or clean microservice architectures: Two agents.

One strategic orchestrator (planning, decomposition, integration)
One coding agent (implementation)

For complex systems: Add a code reviewer. Maximum three agents.

The reviewer doesn't need to be expensive. A smaller, faster model handles security basics—parameterized queries, environment variable handling, port exposure, common vulnerability patterns. Save the big context windows for architectural decisions where nuance matters.

The Exception

More agents make sense when you have genuinely independent workstreams with minimal interaction. But "genuinely independent" is rarer than it seems. Most codebases have hidden dependencies that only surface when agents step on each other.

What Anthropic Missed: Human-in-the-Loop as Synchronization

The Anthropic piece treats human involvement as bookends—you provide the prompt, you review the result. Everything in between is autonomous.

We've built something different: the user can intervene at any time in either agent's work. More importantly, the sub-agent's completion report doesn't reach the orchestrator until the user validates it.

How We Discovered This

This started as a bug. We had a race condition where the user could see the sub-agent's output before it was reported upstream. We fixed the race condition but kept the checkpoint.

It became our favorite feature.

Why It Matters

Verification of reality, not claims: The user confirms what was actually done, not what the agent claims was done. Agents are confident. Sometimes they're confidently wrong.

Catching destructive operations: "Wait, that database has production data" is a sentence you want to say before the migration runs, not after.

Eliminating false completions: The "victory" moment requires human confirmation. The agent can't unilaterally declare success—someone who can actually verify the work must agree.

This Isn't Limiting Autonomy

We're not adding friction for its own sake. The agent still does the work autonomously. We're synchronizing understanding between human and AI at critical checkpoints.

Think of it like code review. The developer writes the code autonomously. The review happens before merge, not instead of writing code.

What We Haven't Solved Yet

Honesty moment: end-to-end testing is still manual for us.

Lint passes. Unit tests run. Integration tests run. But visual verification—"does this actually render correctly in a browser?"—happens in a separate session with a human watching.

Anthropic's Puppeteer integration for browser testing is genuinely useful. We haven't automated that layer yet. It's on the roadmap, right after "pay rent" and "occasionally sleep."

We also don't have good solutions for:

Cross-session learning: When an agent figures out something clever, that insight doesn't automatically transfer to future sessions
Graceful degradation: When an agent is truly stuck, the handoff back to human could be smoother
Cost prediction: Estimating how many tokens a task will consume before starting it

These are hard problems. We're working on them.

The Philosophical Difference

Long-running agents are hard. Anthropic's solutions work. Ours work differently.

The core philosophical difference:

Anthropic leans toward programmatic constraints: JSON files, explicit flags, structured formats the model "can't" edit, guardrails that enforce behavior through rules.

We lean toward better task decomposition and human checkpoints: Clearer definitions of what success looks like, agents that understand (not just follow) completion criteria, and humans in the loop at moments that matter.

Neither approach is wrong. They solve different failure modes.

Programmatic constraints work well when agents are adversarial or unreliable—when you can't trust them to follow instructions. Human checkpoints work well when agents are capable but sometimes confused—when the problem is understanding, not compliance.

The Practical Takeaway

If your agent keeps declaring victory too early, ask yourself:

Is the problem that you need more guardrails? Or is it that you need clearer definitions of what victory actually looks like?

Sometimes the answer is more constraints. Often the answer is better prompts.

What's Next

We're continuing to iterate on this architecture. Current experiments include:

Semantic compression of long context to preserve reasoning across sessions
Automated detection of "agent is stuck" patterns
Cost-aware task scheduling (do expensive things when they matter)

If you're building long-running agent systems and want to compare notes—let's talk.

Maryna Vyshnyvetska is CEO of Kenaz GmbH, a Swiss AI consultancy building practical agent architectures for enterprise clients. Connect on LinkedIn

Frequently Asked Questions

What's the difference between long-running agents and regular AI assistants?

Regular AI assistants handle single requests—you ask, they answer, context is lost. Long-running agents maintain state across multiple sessions, remember what they've done, track progress on complex tasks, and can resume where they left off. They're closer to a team member than a tool.

Why not just use one agent for everything?

One agent can work for simple projects. But for complex systems, specialization helps. An orchestrator thinks strategically, a coder implements, a reviewer catches mistakes. The key is keeping the count minimal—every handoff between agents loses context.

How do you handle context limits with long-running tasks?

Three strategies: semantic memory stores the important stuff that rarely changes, the intention pool tracks current state without full history, and aggressive summarization compresses completed work. The agent doesn't need to remember every keystroke—it needs to remember what matters.

Isn't human-in-the-loop slower than fully autonomous agents?

Yes, technically. But it's faster than debugging the mess a confidently wrong agent creates. The checkpoint adds minutes; fixing autonomous disasters adds hours. For production systems, we'll take the minutes.

What models work best for long-running agents?

It depends on the role. Orchestration benefits from larger context windows and stronger reasoning—Claude or GPT-5 class. Code review can use smaller, faster models. Implementation varies by complexity. Match the model to the cognitive demand of the task.

How do you prevent agents from contradicting each other?

Centralized state (the intention pool) and clear ownership. Every task has one responsible agent. Decisions are recorded with rationale. When Agent B starts work, it reads Agent A's reasoning, not just Agent A's output.