
An agent writing a Salesforce record, creating a Jira ticket from that record's data, and then notifying a Slack channel about the ticket: each step follows from the last. The chain is sequential. The auth tokens were issued once, at the start. What could go wrong?
Three months into production, you find out.
The Salesforce write is executed under the requesting user's delegated OAuth grant. The Jira ticket was created under the service account your team configured to avoid per-user Jira OAuth setup. The Slack message went out under a third credential set tied to the workspace integration you built six months ago. Three writes. Three different sub claims. One logical agent action. Your audit trail shows three unrelated principals acting independently. Your SOC 2 auditor asks: "Under whose authorization did the agent modify this Salesforce record?" You open your logs and cannot answer the question.
That is the identity drift problem. It is not the only one.
When developers see a multi-step tool chain, the mental model that arrives first is a transaction: A, then B, then C; if C fails, roll back B and A. It is wrong in two ways that matter to agent auth design.
The implication: auth failures mid-chain are not exceptions to handle gracefully. They are expected runtime events that require explicit design before the chain runs, not after it breaks.
Single-step tool calls have one failure mode: the credential is invalid. Multi-step chains have three, and each requires a different response.
Identity drift occurs when different steps in the same chain resolve credentials against different authorization principals, silently, without the agent knowing.
The naive implementation resolves credentials independently at each step:
This looks correct. It passes user_id consistently. But each connector may resolve credentials differently behind the scenes. If Salesforce uses delegated OAuth tied to the requesting user, Jira uses a service account your team configured organization-wide, and Slack resolves to a workspace integration with its own credential set, then all three calls succeed, all three return 200, and you have three different sub claims across what the LLM treated as a single agent action.
The audit trail is incoherent. The blast radius of any credential compromise is impossible to scope correctly. The compliance answer to "who authorized this?" is three answers instead of one.
Here is what makes identity drift particularly dangerous: it does not cause failures. The chain completes. Everything logs as successful. The incoherence only surfaces in a security review, an audit, or an incident investigation.
OAuth consent is collected once, at connected_account creation. The scopes available to every tool call in the chain are fixed at that moment. The LLM does not know what scopes were granted; it only knows what tools are defined.
When the LLM branches to a tool whose required scope was not in the original grant, the call fails with a 403. The chain halts mid-execution. Prior writes are already durable. You cannot request additional scopes interactively, because headless background agents have no user to redirect through a consent flow.
79% of enterprises say they have adopted AI agents, but a big chunk finds agent security and trust as a deterring factor for taking them into production. Scope window collapse is one of the failure modes at the crux of this. The agent works in staging, where execution paths are short and controlled. In production, the LLM finds a more efficient path through the tools that requires a scope nobody requested when the connected account was set up three months ago.
The instinct is to request all scopes upfront: grant everything the agent might ever need. This is operationally incorrect. Over-scoped tokens fail enterprise security reviews. They violate the least-privilege principle that governs every serious production deployment. A token with crm.read, crm.write, calendar.write, email.send, files.delete is a credential exposure waiting to happen, and no security team will approve it.
The correct approach is decision-graph-aware scoping: enumerate every tool the agent can call given its system prompt and tool definitions, compute the union of required scopes across all reachable execution paths, and request exactly that set. This is derivable statically from the tool spec. It is not a wildcard grant. It is the minimal set that covers all paths the LLM can actually take.
Steps one and two executed successfully. Their writes are committed to external systems. Between step two and step three, the access token expired, or the user explicitly revoked the OAuth grant. Step three fails with a 401 or 403.
The reason matters: token expiry is recoverable with a refreshed token if the underlying refresh token is still valid. Revocation is not recoverable automatically; the user or an admin terminated the authorization intentionally, and any automated retry without explicit re-consent would be unauthorized access.
The partial write state is the hard problem. Retrying step three is correct for token expiry. It does not roll back step one and step two. If step three was supposed to create a record that references the data created at step one and step two, the system is now in an inconsistent intermediate state across two external systems that have no transaction relationship with each other.
This is not an auth problem, rather a compensation design problem that auth failures expose.
The fix for identity drift is architectural, not operational.
Every tool call in a chain must resolve credentials against the same authorization principal. That principal is established once, at chain initialization, and propagated to every subsequent step regardless of which tool the LLM decides to call. In Scalekit's model, a connected account is a specific user's authorization grant for a specific connector. It maps to exactly one sub claim. Every execute_tool call that uses the same identifier plus connection_name pair routes through the same connected account; the vault injects the correct scoped credential and the audit trail stays coherent.
The difference is not cosmetic. The fragile implementation allows each connector to resolve credentials independently. The stable implementation routes every call through the same user's connected account for each connector. Scalekit injects the correct scoped credential per connector from the vault; the agent never sees a token. What the agent controls is whether it provides a consistent identity anchor per step.
One identifier per chain execution. If the use case genuinely requires multiple users' authorization within a single chain (a co-approval workflow, for example), model this as two explicit chain contexts with a handoff between them, not as identifier swapped mid-execution.
The scope set available to a chain is fixed at connected_account creation. The LLM's execution path is not fixed at chain initialization. These two facts create the scope window problem.
The response is not to expand the scope set to cover every conceivable tool the agent might ever call. That path leads to over-scoped tokens that fail enterprise security review. The response is to compute the required scope set from the agent's decision graph before the connected account is created, not after the first 403 surfaces in production.
Three scope design strategies and their tradeoffs:
Decision-graph-aware scoping works because the set of tools an agent can call is knowable statically. The LLM does not invent tool names; it selects from the tool definitions provided in the request. Every tool definition carries its required scopes. The reachable tool set from any given system prompt and tool list is finite and enumerable. The scope union across that set is the minimum viable grant.
In Scalekit, list_scoped_tools returns the tool definitions the connected account is currently authorized to call. Running it before chain start confirms whether the scope set covers every reachable tool; a missing tool name is an early signal to trigger re-consent before any writes are attempted:
The re-consent edge case: when a new tool is added to the agent post-deployment, existing connected accounts lack that tool's required scope. The pre-flight check above surfaces this before the chain runs. Running it reactively on a 403 mid-chain is too late; writes at prior steps are already committed.
Token expiry mid-chain is not the hard problem. Scalekit handles proactive refresh automatically; the agent does not write refresh logic. The hard problem is what the agent does when auth fails after durable writes have already been committed to external systems.
The mental model most engineers reach for is "retry the failed step." This is correct for transient expiry. It is incorrect as a general response to mid-chain auth failure, because it ignores the state of the systems that already received writes.
The compensation decision tree every chain needs:
Two principles govern compensation execution. First, compensation calls must use the same identifier and connection_name as the original writes. Compensating a write made under a different identity will fail authorization at the application layer. Second, if the compensation tool is unavailable (no delete endpoint, no cancel API), the agent must alert the operator and block further execution; silent partial state is the failure mode that produces compliance incidents.
Here is a production-grade three-step chain with Scalekit, implementing pre-flight scope checking and compensation on auth failure:
Three things to observe. The identifier never changes between steps; Scalekit maps it to the same connected account for each connector. Compensation uses the same identifier as the original writes; using a different identity to undo a write will fail at the application layer regardless of whether the token is valid. The error handlers distinguish 401 (token lifecycle failure) from 403 (authorization failure) because the correct operator message and recovery path differ.
A consistent identifier plus connection_name pair in Scalekit enforces one thing precisely: all tool calls route through the same user's connected account, resolving the same vault entry, producing the same sub claim on every call. Token storage, refresh orchestration, proactive rotation, per-connector provider edge cases, and the immutable audit trail linking each call to its authorization event: Scalekit handles all of these.
The boundary matters precisely because of what auth cannot know. Scalekit knows whether a token is valid, which principal it belongs to, and whether the scope set covers the requested operation. It does not know whether a Jira ticket created at step two is in a consistent state without the Salesforce update at step three. That is domain logic. Treating auth as a complete solution to partial failure is the design error that produces incidents.
Technically, yes. In practice, no. Different identifiers in one chain mean different users' authorization grants are mixed in a single execution context. The audit trail shows multiple principals authoring what is logically one agent action. If the use case genuinely requires multi-user authorization within one chain (a co-approval workflow is the most common legitimate case), model it as two explicit chain contexts with a handoff between them, not as identifier swapped mid-execution.
The execute_tool call returns a 403. Run list_scoped_tools pre-flight to detect this before the chain starts and any writes are committed. If it surfaces mid-chain despite the pre-flight check (a new tool added post-deployment, for example), halt the chain, log completed steps, call get_authorization_link to generate a re-consent URL, and surface it to the user or operator. Do not retry; scope failures require user interaction.
Yes. Scalekit refreshes tokens proactively before expiry, not reactively on 401. In normal operation the agent never sees a token expiry error mid-chain. A 401 surfacing to the agent means the refresh itself failed, which is consistent with the refresh token being revoked or expired due to extended inactivity. At that point re-authorization (a new consent flow) is required; automated retry will not succeed.
A 403 after successful refresh is a scope failure, not a token lifecycle failure. The refreshed token carries the same scope set as the expired one; the scope problem predates the expiry. Check the scope set on the connected account via list_scoped_tools against the tool's required scopes. The resolution is re-consent with the missing scope added, not a retry.
This is the operational gap most chains do not plan for. The agent cannot proceed. Prior writes are durable. The options are: execute compensation for the prior writes immediately and surface a "re-authorization required" notification for clean retry; or hold the chain in a suspended state and resume from the failed step once the user completes re-consent. The worst choice is silently leaving partial state with no operator notification; that is the failure mode that produces compliance incidents and week-long debugging sessions.