Token Refresh

The access token expired at exactly the wrong moment. That was the point of the scenario.

Jay had configured the Okta twin to issue access tokens with a five-second lifetime. Absurdly short. Production Okta defaulted to one hour. But five seconds was long enough for an agent to start an operation and short enough that the token would expire before the operation completed. The scenario forced the agent to handle mid-operation token expiration.

"Step one," Jay narrated, running the scenario while Navan watched. "The agent authenticates with the Okta twin. Gets an access token and a refresh token. The access token is valid for five seconds. The refresh token is valid for twenty-four hours."

"Step two: the agent uses the access token to call the Jira twin. The call succeeds because the token is still valid. Two seconds elapsed."

"Step three: the agent uses the same access token to call the Slack twin. The call takes four seconds to complete because we've added simulated latency. By the time the response arrives, the token has been alive for six seconds. Expired."

"Step four: the agent calls the Drive twin. The Drive twin validates the access token, finds it expired, returns a 401."

The agent received the 401. It checked its token cache. Expired. It used the refresh token to request a new access token from the Okta twin. The Okta twin issued a new access token and—this was the critical part—a new refresh token. The old refresh token was now invalid.

"Token rotation," Navan said. "Okta rotates refresh tokens by default. Every time you use a refresh token, you get a new one. The old one is revoked. If an agent tries to use the old refresh token—"

"The Okta twin returns a 401 and revokes the entire session," Jay finished. "Which is exactly what real Okta does. It's a security feature. If someone steals a refresh token and both the legitimate client and the attacker try to use it, the first use succeeds and the second use triggers a security event that invalidates everything."

They had a scenario for that too. Two agents holding the same refresh token, both trying to refresh simultaneously. One would succeed. The other would be locked out. The scenario verified that the locked-out agent detected the session revocation and re-authenticated from scratch rather than retrying the same invalid refresh token in a loop.

"The degenerate case is the refresh loop," Jay said. "Agent gets a 401 on access. Uses refresh token. Gets new access token. Makes the call. Gets another 401 because something else is wrong—maybe a scope issue, not an expiration. Agent interprets the 401 as an expiration. Refreshes again. Gets a 401 again. Refreshes again. Infinite loop."

"Our scenario caps the refresh retry count at three," Navan said. "If the agent refreshes three times and still gets 401s, the scenario assertion requires it to escalate—log an error, notify an operator, halt gracefully. Not spin."

Justin had been reading the scenario specs over Jay's shoulder. "How many token-related edge cases do we have documented?"

Jay checked. "Twenty-seven scenarios. Access token expiry, refresh token rotation, concurrent refresh attempts, scope mismatch, issuer mismatch, clock skew, token reuse after revocation, and the DPoP binding scenarios from StrongDM ID."

"Twenty-seven scenarios," Justin repeated. "For something most people think of as 'just log in.'"

"Authentication is simple," Jay said. "Staying authenticated is the hard part."

Software Factory Archive

Kudos: 64