Cursor vs Anthropic multi-agent coding experiments

Feb 06, 2026

Today, both Cursor building web browser (Using Claude 4.5) and Anthropic building C compiler (Using Claude 4.6) multi-agent experiments. They are very similar but there seem to be some important differences as one tries to scale the parallelism. Given I am a big fan of agentic coding here are my 3 take-aways from reading both posts:

1. Both systems used hierarchical decomposition

Anthropic:

In the agent prompt, I tell Claude what problem to solve and ask it to approach the problem by breaking it into small pieces, tracking what it’s working on, figuring out what to work on next, and to effectively keep going until it’s perfect.

Cursor uses a more dedicated structure:

A root planner owns the entire scope of the user’s instructions. It’s responsible for understanding the current state and delivering specific, targeted tasks that would progress toward the goal. It does no coding itself. It’s not aware of whether its tasks are being picked up or by whom.
When a planner feels its scope can be subdivided, it spawns subplanners that fully own the delegated narrow slice, taking full ownership in a similar way but only for that slice. This is recursive.
Workers pick up tasks and are solely responsible for driving them to completion. They’re unaware of the larger system. They don’t communicate with any other planners or workers. They work on their own copy of the repo, and when done, they write up a single handoff that the system submits to the planner that requested the task.

2. Coordination and correctness is the bottleneck on parallelism.

The simple parallelism mechanism reported by Anthropic is:

Claude takes a “lock” on a task by writing a text file to current_tasks/ (e.g., one agent might lock current_tasks/parse_if_statement.txt, while another locks current_tasks/codegen_function_definition.txt). If two agents try to claim the same task, git’s synchronization forces the second agent to pick a different one.
Claude works on the task, then pulls from upstream, merges changes from other agents, pushes its changes, and removes the lock. Merge conflicts are frequent, but Claude is smart enough to figure that out.
The infinite agent-generation-loop spawns a new Claude Code session in a fresh container, and the cycle repeats.

Cursor tried something similar but reports the basic todo file breaks at some point:

The coordination file quickly created more problems. Agents held locks for too long, forgot to release them, tried to lock or unlock when it was illegal to, and in general didn’t understand the significance of holding a lock on the coordination file. Locking is easy to get wrong and narrowly correct, and more prompting didn’t help.

Expecting perfect commits also limits scaling:

When we required 100% correctness before every single commit, it caused major serialization and slowdowns of effective throughput. Even a single small error, like an API change or typo, would cause the whole system to grind to a halt. Workers would go outside their scope and start fixing irrelevant things. Many agents would pile on and trample each other trying to fix the same issue.

This seems it was not a problem for Anthropic at 16 parallel workers. Cursor managed to pass through this to 1000 workers by having explicit task coordination and not expect commits will be perfect. It assumed bugs will be discovered and fixed later.

3. Human input seems to still matter a lot (at least for Cursor)

While titles try to downplay this driving the hype this still matters a lot.
Cursor:

Instructions given to this multi-agent system were very important.

Initially, we didn’t make them our primary goal, but instead aimed for a stable and effective harness. But the significance of instructions became apparent quickly. We were essentially interacting with a typical coding agent, except with orders of magnitude more time and compute. This amplifies everything, including suboptimal and unclear instructions.
Initially, the instructions focused on implementing specs and squashing bugs. Instructions like “spec implementation” were vague enough that agents would go deep into obscure, rarely used features rather than intelligently prioritizing.
We assumed implicitly that there were performance expectations within user-friendly bounds. But it took explicit instructions and enforced timeouts to force agents to balance performance alongside other goals.
For complex parts of the system, agents may write code that has memory leaks or causes deadlocks. Humans would notice this, but it wasn’t always obvious to agents. Explicit process-based resource management tools were required to allow the system to gracefully recover and be more defensive.

Anthropic also reports quite heavy human involvement - writing tests, instruction tuning, failure analysis etc. They describe some tasks the agents could not solve and needed workarounds, this is clearly quite involved expert activity.

Discussion about this post

Ready for more?