Satisfaction: 0.73

The number had been 0.73 for four hours.

Navan had watched it through dinner (a granola bar from the vending machine down the hall), through the sunset that painted the office windows amber and then dark blue, and through two complete cycles of the scenario suite. Every run, the agents would churn through the codebase, propose patches, validate against the Digital Twin Universe, and produce a satisfaction metric. Every run: 0.73. Sometimes 0.7301. Once, tantalizingly, 0.7314. Then back to 0.73.

The metric was the thing. Not pass-fail. Not green-or-red. A probability: what fraction of observed trajectories through all scenarios likely satisfy the user? Anything below 0.95 meant something was wrong. And 0.73 meant something was very wrong.

Jay had gone home at nine, then come back at ten-thirty with cold brew in a Mason jar. He set it on the table next to Navan without a word and pulled up the scenario logs on his own laptop.

"It's the Jira twin," Navan said. He'd been saying it for two hours, but now he had evidence. He turned his screen so Jay could see. "Look at Scenario Fourteen. User creates a subtask under an epic. The Jira twin accepts the API call, returns a 201, populates the subtask in the board view. All correct. But the parent field in the response payload—"

"It's null," Jay finished, reading the log.

"It's null. The subtask exists, it shows up in the right place in the UI twin, but the API response says it has no parent. So when Scenario Fifteen runs—the one that queries subtask relationships—it finds an orphan. The agent's code correctly handles the response, but the response is lying."

Jay set down his cold brew. "The twin's behavior is inconsistent with itself."

"Exactly. The board-view behavior was cloned from one traffic capture. The API response behavior was cloned from a different capture, taken three weeks later. Jira shipped a change in between. Our twin has two different versions of reality stitched together."

They sat with that for a moment. The office was quiet. Somewhere down the hall, a cleaning crew was running a vacuum.

"How many scenarios does this affect?" Jay asked.

Navan had already run the analysis. "Directly? Just Fourteen and Fifteen. But those trajectories fan out. Sixteen through Twenty-Two all depend on consistent subtask relationships. That's nine scenarios with poisoned starting conditions. Nine out of thirty-three." He paused. "0.73."

Jay let out a breath that was almost a laugh. "Twenty-seven percent of our scenarios are failing because one API field is null when it shouldn't be."

"One field. One twin. One inconsistency." Navan was already drafting the prompt for the agent. Not writing code. Describing the problem. The Jira twin's subtask creation endpoint returns a null parent field inconsistent with its board-view representation. The behavioral model should unify these to reflect Jira's post-June behavior...

He submitted the prompt at 11:42 PM. The agent began its work. They watched the satisfaction metric in silence.

At 12:07 AM, it ticked to 0.74.

At 12:09, it jumped to 0.91.

Navan didn't say anything. Jay picked up his cold brew, took a long sip, and nodded once.

By 12:15, it was 0.97. Navan screenshot the dashboard, then screenshot his own screenshot. He'd want to remember this one.

Software Factory Archive

Kudos: 63