Welcome, Guest | Browse

Software Factory Archive

← Previous Work All Works Next Work →

The Unreasonable Effectiveness of Scenarios

Rating:
General Audiences
Fandom:
StrongDM Software Factory
Characters:
Justin McCarthy Jay Taylor Navan Chauhan
Tags:
Scenarios Testing Innovation ML Holdout Sets Satisfaction
Words:
488
Published:
2025-10-18

A test says: when I call this function with these inputs, I expect these outputs.

A scenario says: a user wants to accomplish this goal in this context with these constraints.

The difference looks small on paper. It is not small. The gap between those two statements is where the entire factory lives.

Jay figured this out three months in, during a debugging session that wasn't really a debugging session because there was nothing to debug. He'd been reading the agent output for a scenario that involved provisioning a new employee across multiple services—Okta, Jira, Slack, Google Drive. The agent had satisfied the scenario. The satisfaction metric was high. But the implementation was nothing like what Jay would have written.

The agent had introduced a retry mechanism with exponential backoff on the Google Drive provisioning step. Nobody asked for it. The scenario didn't mention retries. But the scenario did describe a user who expected their Drive folder to exist within thirty seconds of being provisioned in Okta, and the Digital Twin for Google Drive had realistic latency characteristics that sometimes exceeded that window.

The agent innovated. It found a solution to a problem that was implicit in the scenario but never explicitly stated.

"If this had been a test," Jay said to Navan over lunch, "the test would have said: assert that Drive folder exists after Okta provisioning. And the agent would have satisfied that assertion in the simplest possible way. No retry. No backoff. Just a single call that would sometimes fail in production."

"But the scenario described the user's experience," Navan said, pulling apart a sandwich with the focus he applied to everything. "And the user's experience includes the possibility of latency. So the agent had to handle latency."

"Tests are specific. Scenarios are ambient."

Navan wrote that down in his notebook. Then underlined it. Then starred it.

Justin had a more theoretical framing. He compared scenarios to ML holdout sets—data kept separate from training to validate generalization. Scenarios lived outside the codebase. The agents never saw them during implementation. They only encountered them during evaluation. This separation forced the agents to write code that generalized, that handled cases beyond the specific assertions of a unit test.

"Tests overfit," Justin said at a team meeting. "They encourage code that passes the test and nothing more. Scenarios encourage code that works in the world. Different goal. Different outcome."

The unreasonable effectiveness was this: scenarios, by describing desired outcomes instead of specific behaviors, gave the agents room to be creative. Not creative in the artistic sense. Creative in the engineering sense. Room to find solutions that no human specified, to add resilience that no human requested, to produce code that was better than what anyone asked for.

Jay thought about Wigner's famous paper about the unreasonable effectiveness of mathematics. Math worked too well for the physical world. Nobody knew exactly why.

Scenarios worked too well for agent-driven development. Nobody knew exactly why, either. They just kept working.

Kudos: 167

test_driven_dev 2025-10-20

"Tests overfit." I have been writing unit tests for fifteen years and I have never thought about it this way and now I cannot stop thinking about it this way. Thanks for ruining my career, I guess.

wigner_stan 2025-10-22

The Wigner reference is perfect. Unreasonable effectiveness is the right framing. Scenarios shouldn't work this well. They just do.

← Previous Work All Works Next Work →