The 10x Test

Navan ran the numbers on a Tuesday afternoon while waiting for a scenario suite to finish. He wrote them in his notebook, double-checked them, and then walked over to Jay's desk.

"If we ran last week's scenario suite against real APIs," Navan said, "it would have cost us approximately eleven thousand dollars in API usage."

Jay looked up. "How much did it cost us against the twins?"

"The electricity to run the binaries on our hardware. Maybe four dollars."

The math was simple but the implications were not. The factory ran thousands of scenarios per day. Each scenario made dozens of API calls across multiple services. Against real APIs, each call consumed rate limit budget, incurred usage charges, and left state in production systems that had to be cleaned up or isolated. Against the twins, each call consumed CPU cycles and nothing else.

Justin had a phrase for it: "test at volumes far exceeding production limits." The twins made that possible. Not ten percent more than production. Not double production. Ten times production. A hundred times production, if you wanted. The twins didn't throttle. The twins didn't charge.

"We ran 14,200 scenarios last week," Navan continued. "Against real APIs, Okta's rate limit would have let us run about 1,400 before we hit throttling. Jira's limits would have capped us at around 900. Slack would have been the bottleneck at maybe 600 scenarios per day before Tier 4 limits kicked in."

"So against real APIs, we could have run roughly 600 scenarios per day," Jay said. "Against the twins, we ran 14,200 in a week. That's about 2,000 per day."

"And we could go higher. We're not limited by the twins. We're limited by our scenario runner's concurrency settings and the hardware's CPU capacity. If we threw more machines at it, we could run 10,000 per day."

Jay leaned back. "What would we do with 10,000 scenarios per day?"

"Find bugs faster. The factory's whole thesis is that scenarios are the holdout set. The more scenarios you run, the more confident your satisfaction score. With 600 per day, you're sampling. With 10,000 per day, you're approaching census."

Justin had been listening from across the room. "The 10x test isn't about bragging rights. It's about statistical confidence. When we say a satisfaction score is 94%, the confidence interval depends on sample size. More scenarios, tighter interval. Tighter interval, more trustworthy score."

"And the cost of a wider confidence interval?" Jay asked.

"Bugs in production that your satisfaction score should have caught but didn't. A 94% score based on 600 scenarios means something different than a 94% score based on 14,000 scenarios. The number is the same. The confidence is not."

Navan wrote that down in his notebook: Same number, different confidence. He underlined it twice.

The scenario suite finished. 2,031 scenarios run. 2,004 satisfied. Twenty-seven failures, each one a potential bug that real API testing might never have found because real API testing would never have reached scenario number 1,401.

Software Factory Archive

Kudos: 59