The default Leash coder image was a Swiss Army knife that weighed four gigabytes. Inside it: Claude, Codex, Gemini, Qwen, and OpenCode. Five agents. Five different architectures, five different training regimes, five different personalities—if you could call them that. All of them packaged into a single container image, ready to be summoned by name.
Jay had an idea at 2 AM on a Sunday, which was when most of his ideas arrived, uninvited and insistent. The idea was simple: run all five agents on the same problem and compare their outputs.
The problem was a genuine task from the backlog: rewrite the rate limiter in the Slack twin. The current implementation used a token bucket algorithm that had a subtle bug in its refill logic. The tests caught it intermittently, which was worse than catching it always or never.
Jay wrote a short script. Five leash --open commands, one for each agent, all pointed at the same workspace snapshot, all given the same prompt: Fix the rate limiter bug in slack-twin/ratelimit.go. The token bucket refill logic has an intermittent failure. Run the tests until they pass consistently.
He ran all five in parallel. Five containers spun up. Five agents woke. Five MCP observer streams began flowing into the Control UI, each in its own panel. Jay's monitor looked like a mission control room, or possibly a horse race.
Claude found the bug in four minutes. It identified a race condition in the refill goroutine, added a mutex, and ran the tests forty times without a failure.
Codex took seven minutes. It rewrote the entire rate limiter using a different algorithm—a sliding window instead of a token bucket. The tests passed, but the implementation was a departure from the original design.
Gemini took twelve minutes. It found the same race condition as Claude but fixed it with a channel-based synchronization pattern instead of a mutex. Elegant, arguably better, definitely different.
Qwen found a different bug entirely—an off-by-one error in the bucket capacity calculation—and fixed that. It missed the race condition. The tests passed, but only because the off-by-one fix happened to mask the timing issue in most runs.
OpenCode took twenty-three minutes, tried three different approaches, reverted twice, and eventually arrived at a fix nearly identical to Claude's but with more detailed comments.
Jay stared at the five diffs side by side. Five agents. Five solutions. One correct diagnosis (shared by Claude and Gemini). One creative redesign (Codex). One plausible misdiagnosis (Qwen). One laborious but thorough process (OpenCode).
He showed the results to Navan the next morning. Navan studied the diffs in silence for a long time.
"It's like hiring five contractors and giving them the same spec," Navan said.
"Except these five work simultaneously and cost about twelve dollars total," Jay replied.
Justin, passing by with a watering can for his desk plant, glanced at the screen. "Now imagine doing this for every task. Automatically. The factory doesn't pick one agent. It picks the best output."
Jay added a note to the backlog: Multi-agent consensus scoring for Leash sessions.
The five containers had already been garbage collected. But the five diffs remained.
The Qwen result is the most interesting one. A plausible misdiagnosis that happens to mask the real bug. That's the kind of subtle failure that only shows up when you run multiple agents and compare.