The Human in the Loop

The investor's name was David, and he wore the particular expression of a man who had funded fourteen AI startups and understood exactly none of them. He was pleasant. He asked good questions. He took notes on an iPad with a stylus that cost more than Jay's first car.

"So walk me through the review process," David said, settling into a chair in the main room. The dashboard glowed on the wall behind him. He didn't look at it. "When the agents produce code, who reviews it?"

Justin, Jay, and Navan exchanged a look.

It was the kind of look that contained an entire conversation. Jay's eyebrows went up fractionally: You want to take this one? Navan's mouth twitched: This is going to be fun. Justin's expression didn't change at all, which was how you knew he was already composing his answer.

"Nobody," Justin said.

David's stylus stopped moving. "Nobody reviews the code?"

"That's correct."

"The agents write code, and nobody—no human—looks at it before it ships?"

"Code must not be written by humans," Justin said, as if reciting from a text. "Code must not be reviewed by humans. Those are the two rules of the factory."

David set his iPad down. This was, in Jay's experience, the moment where people either got very interested or very nervous. David was getting nervous.

"Then how do you know the code is correct?"

"We don't know the code is correct," Justin said. "We know the system is correct. Or more precisely, we know the probability that observed user trajectories through our scenarios are satisfactory. We measure satisfaction, not correctness."

"But surely someone needs to—"

"Let me show you something." Justin pulled up the dashboard. "This is the satisfaction metric for our Okta provisioning scenario cluster. Right now it's at 0.94. That means that 94% of the time, when a simulated user goes through the provisioning flow across all our digital twins, they get the outcome they expected. The code that produces this outcome was written entirely by agents. No human has read it."

"And you're comfortable with that?"

"I'm comfortable with 0.94. I'd be more comfortable with 0.97. I wouldn't trust a human reviewer to get me from 0.94 to 0.97 faster than the agents will."

David picked up his iPad again. "So the human role is..."

"We write the scenarios," Navan said. "We describe what the user wants to accomplish. We define what satisfaction looks like. We build and maintain the digital twins—the simulated services that the code runs against. We decide what matters."

"We're not in the loop," Jay added. "We're around the loop. We define the shape of the loop, but we don't insert ourselves into it."

David was writing quickly now. "And this works?"

"We spend over a thousand dollars a day per engineer on AI tokens," Justin said. "We've been running for five months. Our scenario satisfaction metrics are higher than any traditional test suite I've ever seen, and they cover behavioral edge cases that no human QA team would catch because no human QA team would think to test them."

"The agents test things humans wouldn't think of?"

"The agents test things humans wouldn't imagine," Navan said. "And the twins catch failures humans wouldn't notice. The humans are the least reliable component. That's why we took ourselves out of the loop."

David was quiet for a moment. Then he said, with the tone of a man having a revelation he wasn't sure he wanted: "You've automated yourselves into a management role."

Justin, Jay, and Navan exchanged another look.

"We prefer 'world-builders,'" Justin said.

Software Factory Archive

Kudos: 145