The Holdout Set

It was Jay who asked the question that started the argument.

"Why don't the scenarios live in the repo?"

They were at lunch. A Thai place two blocks from the office, the kind with laminated menus and a fish tank by the register. Justin had pad see ew. Navan had green curry. Jay had asked the question while holding a spring roll in mid-air, as if the thought had struck him between the dip and the bite.

Justin set down his chopsticks. "Same reason you don't put the test set inside the training data."

Navan looked up from his curry. Jay lowered the spring roll.

"Think about it," Justin continued. "The agents have full access to the codebase. They read it, modify it, reason about it. If the scenarios lived in the repo, the agents would have access to them too. They'd be able to read the exact acceptance criteria while generating the code that's supposed to satisfy those criteria."

"Overfitting," Navan said immediately.

"Exactly. The agent would optimize for the specific scenarios rather than for the underlying user intent. It would learn to pass the tests rather than solve the problem. Classic Goodhart's Law—when the measure becomes the target, it ceases to be a good measure."

Jay bit the spring roll, chewed, thought. "But the agents know scenarios exist. They know they're being measured."

"Knowing you'll be tested is different from having the answer key. A student who knows there'll be an exam on chapter five still has to understand chapter five. A student who has the exact questions just memorizes answers."

"So the scenarios are the holdout set," Jay said. "In ML terms. The code is the model, the scenarios are the evaluation data, and you keep them separate to measure true generalization."

"That's the analogy. It's not perfect—scenarios are handwritten, not sampled from a distribution—but the principle holds."

Navan shook his head slowly. "I don't love it."

Justin raised an eyebrow. "Go on."

"In ML, the holdout set is drawn from the same distribution as the training data. Our scenarios are written by three humans with specific mental models of what users want. We're not sampling from the real distribution of user intent. We're approximating it based on our own biases." He pointed his spoon at Justin. "What if our scenarios have blind spots? The agents could satisfy every scenario we write and still fail in ways we didn't think to test."

"That's true," Justin said. "And it's a feature, not a bug."

"How is a blind spot a feature?"

"Because it keeps us honest about what we don't know. If we pretended our scenarios covered everything, we'd have false confidence. Instead, we know our coverage is incomplete. We know we're measuring a proxy for user satisfaction, not satisfaction itself. The gap between our scenarios and reality is a thing we have to actively think about and close over time."

Jay leaned back. "So we're not just writing scenarios. We're maintaining a model of what we don't understand about our users."

"Now you're getting it."

Navan was quiet for a moment, stirring his curry. "I still don't love it," he said. "But I don't have a better alternative. And the honesty argument is real. I'd rather know my test suite is incomplete than believe it's comprehensive when it's not."

"That," Justin said, picking up his chopsticks, "is the most important sentence anyone's said in this factory all week."

They finished lunch. On the walk back, Navan drafted three new scenarios on his phone, each one targeting a blind spot he'd been ignoring. By the time they reached the office, the holdout set was a little less incomplete than before.

It would never be complete. That was the point.

Software Factory Archive

Kudos: 91