The fidelity score lived on a dashboard that nobody had asked for but everyone checked daily. It was a single number per twin, updated every morning at 6 AM Pacific, representing the percentage of API interactions where the twin's response was indistinguishable from the real service's response.
The Okta twin was at 97.3%. The Jira twin was at 94.8%. The Slack twin was at 96.1%. Google Docs sat at 91.2%. Google Drive was at 93.7%. Google Sheets was at 92.4%.
"Why is Docs the lowest?" Justin asked during the Monday review. He asked it the way he asked most questions—not accusatory, just curious, the way you'd ask why a plant was leaning toward a particular window.
Navan had the answer ready. "Collaboration edge cases. The OT implementation handles the common paths correctly, but there are about thirty interaction patterns around simultaneous suggestion-mode edits that we haven't fully characterized. Each one costs us a fraction of a percent."
"How do we measure this?" Jay asked. He was new enough to the fidelity process that the methodology still interested him more than the numbers.
"Shadow testing," Navan explained. "We run a subset of our scenarios against both the twin and the real service. Same inputs, same ordering, same timing as close as we can manage. Then we compare the responses. Exact match on status codes, headers we care about, and response body structure. Semantic match on response body content—we allow for differences in timestamps, request IDs, and other non-deterministic fields."
"So the fidelity score is empirical. Not theoretical."
"Strictly empirical. It measures what we've tested, not what we haven't. There could be API paths we've never exercised where the twin would score zero." Navan pulled up the breakdown. "The Okta twin's 97.3% comes from 2,341 API interactions tested last week. The 2.7% gap is sixty-three interactions where the twin's response differed. Forty of those are known issues with tracking. The other twenty-three are under investigation."
Justin studied the numbers. "What's the trajectory?"
Navan switched to the trend view. Six lines on a chart, one per twin, all sloping upward over the past eight weeks. The Okta twin had started at 89.1% and climbed to 97.3%. Jira had started at 82.4% and reached 94.8%. Every twin was improving, but the rate of improvement was slowing. The easy fidelity gains had been captured. What remained were the edge cases, the undocumented behaviors, the interactions that only surfaced under specific timing conditions.
"Asymptotic," Jay said, reading the curve.
"Every percentage point above 95 costs more than the previous one," Navan confirmed. "Going from 95 to 96 took us a week. Going from 96 to 97 took us two weeks. Going from 97 to 98 will probably take a month."
"Is 100% the goal?" Jay asked.
Justin shook his head. "100% means the twin is the service. That's not the goal. The goal is high enough fidelity that scenarios validated against the twin are reliable predictors of behavior against the real service. We need fidelity sufficient for confidence, not fidelity for its own sake."
"What's the threshold for confidence?"
"We'll know it when scenarios stop surfacing production bugs that the twins didn't catch." Justin closed the dashboard. "We're not there yet. Keep climbing."
The asymptotic nature of fidelity improvement is such a sharp observation. Every testing effort I've been part of hits that same curve. The last 5% is where the real work lives.