The Mock Debate Revisited

Six months after Navan's talk at the conference, someone forked the Digital Twin Universe repository. Then someone else forked it. Then forty-seven more people forked it. Within a week, the repository had more forks than the team had expected to accumulate in a year.

The cause was a Hacker News post. Someone had written a blog post titled "Behavioral Clones vs. Mocks: The StrongDM Approach," and it had hit the front page on a slow Tuesday afternoon when the HN audience was hungry for something to argue about.

Jay saw it first. He had a personal history with Hacker News—he'd built an HN live feed viewer years ago, and he still checked the site with the guilty regularity of a former habit. The post had 347 comments when he found it. By the time he called Navan over, it had 412.

"The top comment says we reinvented WireMock," Navan reported, scrolling through the thread.

"We didn't reinvent WireMock."

"I know. The second comment explains why we didn't reinvent WireMock. The third comment disagrees with the second. The fourth comment asks what WireMock is. This is a very normal HN thread."

The debate crystallized around a simple question: were behavioral clones meaningfully different from sophisticated mocks? One camp said no. Mocks had been around for decades. Record-and-replay testing was standard practice. The twins were just mocks with better marketing. The other camp said yes. Mocks returned canned responses to anticipated inputs. Behavioral clones modeled the service's decision-making process and generated responses to unanticipated inputs. The distinction was generalization.

"Comment 187 is good," Jay said. "Someone pointed out that a mock fails open—if you send an unexpected request, it returns nothing. A behavioral clone fails closed—if you send an unexpected request, it generates the most plausible response based on its behavioral model. The failure mode is the differentiator."

Justin read the thread over lunch. He didn't comment. He never commented on threads about the factory. His position was that the work spoke for itself, and if the work needed defending, it wasn't done yet.

"The forks are interesting," Justin said. "Forty-nine people forked the repository. How many of them will actually build something?"

"Maybe three," Navan estimated. "Most forks are bookmarks. GitHub's fork button is the internet's 'save for later' button."

"But three is enough. If three people build their own twins using our approach, we'll learn from their implementations. Gene transfusion runs in both directions."

The thread continued growing. By evening it had 600 comments and had spawned two spin-off posts: "I Built a Behavioral Clone of Stripe in a Weekend" (the author had not, in fact, built a behavioral clone of Stripe) and "Why Behavioral Clones Won't Replace Integration Tests" (a measured take that Navan bookmarked and later cited in an internal document).

"The debate is useful," Jay said that night, writing up a summary of the thread for the team's internal log. "Not because anyone changes their mind—nobody changes their mind on HN—but because the arguments surface assumptions we haven't examined."

"Like what?"

"Like whether twin fidelity needs to be above a certain threshold before behavioral cloning beats traditional mocking. One commenter suggested ninety percent fidelity is the crossover point. Below that, mocks are cheaper and sufficient. Above that, clones provide value mocks can't."

Navan looked at their fidelity dashboard. All six twins were above ninety percent. They'd crossed the threshold without knowing there was a threshold to cross.

Software Factory Archive

Kudos: 85