Announcing CIMD support for MCP Client registration
Learn more

Tool Call Failures in Production (Debug/Prevent)

TL;DR

  • Most tool call failures in production agents are silent: no exception, no crash, no alert. The agent moves on; the action never happened.
  • The failure taxonomy has four independent layers: identity/auth, connector/proxy, upstream API, and execution semantics. Each requires a different detection and recovery path.
  • A 401 from an expired access token and a 401 from a revoked refresh token look identical at the execution layer but require completely different responses. Treating them the same produces retry storms and cascading failures.
  • Partial execution (the tool call was accepted; the action was never committed) is the hardest failure class to detect and the most consequential to recover from, and it returns HTTP 200.
  • Five metrics surface silent failures before your users do; an auth event pipeline makes 2AM runbooks unnecessary.
  • Scalekit handles the identity layer failures automatically: token lifecycle, proactive refresh, revocation detection, and per-user isolation via connected accounts, so the failure surface your agent owns is bound to the connector and upstream layers.

An agent automates ticket routing: pulling open items from Jira, reading account context from Salesforce, posting assignments to Slack. In staging it runs clean. No exceptions, consistent output. You ship.

Three weeks in, escalations go unrouted. Engineering opens the logs. No exception anywhere in the call chain. The agent ran, returned a success response at each step, and moved on. The Salesforce query had been executing against a delegated OAuth token that was revoked four days prior when a user disconnected the app from their Google account settings. The connector returned an empty result set without surfacing the auth failure. The agent treated empty context as valid state. Routing decisions ran without account data. Not a crash. Not a traceable error. A quiet degradation that looked like correct execution from every observability surface the team had instrumented.

This is not a staging versus production gap in any conventional sense. It is a structural property of how tool-calling agents traverse delegation chains.

Production telemetry confirms the scale of the problem.

As per Datadog, 1 in 20 AI model requests already fail in production, yet systems continue to run and return outputs that appear correct, making these failures difficult to detect. This captures only errors that surface as error codes. Silent failures: degraded outputs, stale state, partial executions, swallowed auth failures, do not appear in it at all.

The compound math makes this worse at scale. At 99% per-step reliability across a 10-step workflow, overall reliability is 90.4%. At 95% per step across 20 steps, it falls to 35.8%. Individual tool calls fail 3 to 15% of the time in well-engineered production systems due to network timeouts, rate limits, and upstream service interruptions. In development, tool calls succeed reliably. In production, those rates compound across multi-step workflows.

The correct diagnostic frame is not "my agent failed." It is: which layer in my agent's delegation chain degraded, and did it surface a signal I can act on? The answer to the second question is almost always: not yet.

The Four-Layer Failure Taxonomy

A tool-calling agent does not execute a single operation. It traverses a delegation chain: LLM planner → orchestration layer → tool executor → connector → OAuth provider → upstream API. Each hop is an independent failure surface. The orchestrator observes only the tool executor's response; it has no direct visibility into what the connector, the OAuth provider, or the upstream API actually did.

This is why the same HTTP 401 can mean three structurally different things, and why the same HTTP 200 can represent either success or a committed write that never happened.

Classifying failures by status code alone is insufficient. The correct classification is by layer of origin:

Layer
Failure Class
Typical Signal
Silent?
Identity / Auth
Expired access token; revoked refresh token; scope mismatch; tenant isolation violation
401, 403, or nothing if connector swallows the failure
Sometimes
Connector / Proxy
Rate limiting; schema translation error; malformed tool parameters from LLM
429, 400, 422, or nothing
Sometimes
Upstream API
Provider outage; schema drift; quota exhaustion; write accepted but not committed
500, 503, or 200 with a no-op body
Frequently
Execution Semantics
Partial execution; idempotency violation; stale state used in read-modify-write
200 everywhere; wrong outcome in downstream state
Always

Each layer has a different signal profile, a different recovery path, and a different architectural fix. The identity layer gets the most depth below because it is where the majority of production failures originate and where signals are most consistently misread.

The Identity Layer: One Status Code, Three Structurally Different Failures

When a Salesforce tool call returns a 401, three distinct states could be true:

Auth Failure Mode
HTTP Signal
Recovery
Automatable?
Expired access token
401
Refresh using refresh token; retry once
Yes
Revoked refresh token
401 (OAuth: invalid_grant)
Re-authorization required from user
No
Scope mismatch
403 (sometimes 401)
Consent flow modification required
No
Tenant isolation violation
403
Architecture fix required
Never
  • Expired access token is the recoverable case. The access token has exceeded its TTL (typically 5–60 minutes; the expires_in field from RFC 6749 tells you exactly when). The refresh token is still valid. Refresh it, retry once. If the retry fails, surface the error and stop.
  • Revoked refresh token is where most retry logic breaks down in production. When a user revokes OAuth consent in the upstream provider's settings — disconnecting your app from Google, revoking access in Salesforce's Connected Apps, or triggering an admin-initiated credential rotation — the refresh token is invalidated at the provider level. Your agent discovers this on the next tool call, not at revocation time. The OAuth 2.0 error returned is invalid_grant (RFC 6749), not a distinct "revoked" code. A retry with the same refresh token fails. An attempt to issue a new refresh also fails. Every retry attempt generates a new failed call.

In multi-threaded agents, this becomes a race.

When thread A discovers a 401 and starts a refresh, threads B through N have already begun the same attempt, each unaware the others are doing the same. Thread A succeeds and issues a new token. Every other thread's refresh call invalidates that token by issuing yet another one. The thread that "wins" holds a valid token; every other thread is now holding an invalidated credential. This is the exact failure mode Box documents for its single-use refresh tokens: "If two concurrent requests both attempt a refresh, one will succeed and one will invalidate a now-stale token."

This is not a code bug. It is a structural consequence of reactive refresh (refreshing on 401 instead of proactively scheduling refresh before expiry). The fix is a refresh scheduler driven by the expires_in value from RFC 6749, with distributed locking to ensure exactly one refresh attempt executes per credential. If the refresh fails with invalid_grant, surface a re-authorization signal immediately. Do not retry.

Scope mismatch surfaces even more quietly.

An agent was authorized to read Salesforce records when the user consented. The workflow was later extended to create records. Nobody updated the OAuth consent. The create_opportunity call returns 403. There is no automated recovery. The required scope was never granted. The fix is a new consent flow, not a retry.

OAuth providers return different error codes when refresh operations fail, and each error requires a different recovery strategy. Treating all refresh failures identically — the most common implementation pattern — leads to unnecessary re-authorization requests or repeated failed refresh attempts that compound rather than resolve.

Scalekit handles the identity layer

Connected accounts store tokens outside agent runtime; token refresh runs proactively before expiry, not reactively on 401; revocation detection tracks connection state so actions.execute_tool() either succeeds with a valid credential or raises an exception that routes to re-authorization, not a blind retry loop. The connection status API (GET /api/v1/connected_accounts) lets you check auth state before executing critical operations.

2. The Connector and Upstream Layers: Where 200 Becomes a Lie

60% of all LLM call errors observed across production deployments were caused by exceeded rate limits. Note this is LLM API call telemetry specifically, but the pattern holds at the tool execution layer for the same reason: rate limits compound across parallel agent threads in ways that single-request clients never encounter.

  • Rate limits (Layer 2) are the most tractable failure class in this group because they produce 429 with a Retry-After header. The operational error is applying exponential backoff uniformly to all 429 responses without inspecting that header. Some providers include retry_after in milliseconds. Others return 429 with no retry guidance. The correct behavior: parse Retry-After when present; otherwise apply exponential backoff with jitter, max 3 retries per tool call context. Circuit-break after the third retry. Do not propagate the retry loop to the orchestrator layer.
  • Malformed tool parameters (Layer 2) return 400 or 422 and are a signal that the LLM generated structurally incorrect parameters for a tool schema. The correct response is not a retry. The parameters that produced a 400 on the first call produce a 400 on the second. Log the raw parameters the LLM generated, log the tool schema the call was made against, and surface a schema mismatch event for review. Retrying 400s is a source of noise in retry rate metrics and masks the underlying LLM behavior or tool schema problem.
  • Upstream API partial execution (Layer 3) is the hardest failure class to catch. The upstream API acknowledged the write request (HTTP 200), but the write was queued and not immediately committed — or was committed to a buffer that requires a subsequent flush. The signal from the tool call is success. The downstream state is not what the agent expects. Detection requires semantic instrumentation: a post-call state check for data-mutating tools, not just status code logging. For a create_issue tool call, the correct pattern is: after the 200 response, query the resource by the ID returned in the response body. If it is not found or is in a pending state, emit a partial execution event. This adds latency. It is the only way to catch this failure class before it propagates to downstream steps.
  • Schema drift is the slow version of upstream failure. A field gets renamed in a provider's API update. The connector's tool schema still references the old field name. Calls succeed but return unexpected structure. The agent misreads the response. No error is raised. This is why connector schema versioning and drift alerting are production requirements.

The Five Metrics That Catch Silent Failures First

Aggregate error rate is the least useful signal for agent reliability. It averages across tenants, obscures per-connection degradation, and captures only failures loud enough to produce an error code. These five metrics surface the failure classes above.

1. Tool Call Success Rate by connection_id

A single tenant's revoked credential drops their tool call success rate to zero while the aggregate stays healthy. Aggregate success rate hides this entirely.

Instrument: tool_calls_succeeded / tool_calls_attempted, segmented by connection_id. Alert on any connection_id where the success rate drops below 95% for more than two consecutive minutes.

2. Token Refresh Success Rate and Latency by Provider

A declining refresh success rate is the earliest leading indicator of an incoming auth failure wave. Refresh failures accumulate silently; the agent discovers them only when the next expired access token is used in a tool call.

Instrument: token_refresh_succeeded / token_refresh_attempted, with p50/p99 latency segmented by OAuth provider. A latency spike on the token endpoint signals provider-side rate limiting on refresh — a precursor to refresh failures, not a consequence.

3. Partial Execution Rate

Tool calls that returned HTTP 200 but produced no observable state change downstream. This is the hardest metric to instrument and the most consequential to miss.

Instrument: post-call state verification for all data-mutating tools. For every create, update, or delete call, query the affected resource by the ID or key returned in the response. If the state does not match the expected post-call state, emit a partial_execution_detected event. Track: partial_executions / total_write_calls. A non-zero value here is always worth investigating.

4. Retry Rate by Error Class

Retry rate is useful only when segmented by error class. A high retry rate on expired-token failures means proactive refresh is not working. A high retry rate on malformed parameters means retries are being applied to non-retryable failures. A high retry rate on upstream_5xx with no circuit breaker means cascading failures are already in progress.

Instrument: retry count per tool call, tagged with error class (token_expired, refresh_failed, rate_limited, upstream_5xx, malformed_params). Retry rate on refresh failures and malformed parameters should be zero. If it is not, the retry classification logic has a bug.

5. Auth Event Delivery Lag

If you are consuming connected account lifecycle webhooks, the time between when an auth event occurred and when your agent acted on it is a live risk window. An agent that processes a revocation event 4 minutes after it was emitted has spent 4 minutes executing tool calls against a credential it no longer holds.

Instrument: webhook_received_at - event_occurred_at for all auth lifecycle events. Alert if delivery lag exceeds 30 seconds for revocation or refresh failure event types.

What not to measure: aggregate error rate as a primary agent health signal. It captures only loud failures, averages across tenants, and produces no actionable routing to a specific failure class.

Scalekit's webhooks fire real-time events across three categories: connected account lifecycle changes (account created, authorization completed, account disconnected), auth events (authorization failures, re-auth required, scope changes), and token activity (token refresh succeeded, token refresh failed). These land in your system before the next tool call discovers the failure. Metrics 1, 2, and 5 above are zero-instrumentation-lift when Scalekit manages the agent auth layer.

What to Do When Failures Happen

Detection without a response playbook is still a 2AM page. The decision tree below is deterministic: given a failure class, the correct response is not a judgment call. Retry storms originate almost exclusively from engineers applying the wrong response to a correctly detected failure.

# on_tool_call_failure(error, tool_name, connected_account_id) # --- Identity layer --- if error.layer == "identity": if error.oauth_error == "token_expired": # Proactive refresh should have prevented this path. # If reached, refresh now and retry exactly once. refresh_token() retry_once() if retry_fails: emit("auth_failure", connected_account_id=connected_account_id) halt_workflow_step() # STOP. Do not loop. if error.oauth_error == "invalid_grant": # Refresh token revoked. Cannot be resolved by retry. halt_agent_workflow() emit("reauth_required", connected_account_id=connected_account_id) # NEVER RETRY. Surface re-authorization link to user. if error.http_status == 403: # Scope mismatch or tenant isolation violation. halt_agent_workflow() emit("consent_required", connected_account_id=connected_account_id, required_scopes=error.required_scopes) # NEVER RETRY. Requires consent flow update. # --- Connector layer --- if error.layer == "connector": if error.http_status == 429: retry_after = parse_retry_after_header(error.headers) exponential_backoff_with_jitter( initial_delay=retry_after or 1.0, max_retries=3, window_seconds=60) if circuit_open: emit("connector_rate_limit_sustained", connector=error.connector) halt_workflow_step() if error.http_status in [400, 422]: # LLM generated invalid parameters. Retry is pointless. log_raw_params(tool_name=tool_name, params=error.request_params) log_tool_schema(tool_name=tool_name) emit("tool_schema_mismatch", tool_name=tool_name) # NEVER RETRY. # --- Upstream API layer --- if error.layer == "upstream_api": if error.http_status in [500, 502, 503]: retry_with_circuit_breaker(max_retries=2, window_seconds=60) if circuit_open: emit("upstream_degraded", provider=error.provider) halt_workflow_step() if error.http_status == 200 and not state_check_passes(): emit("partial_execution_detected", tool_name=tool_name) if is_idempotent(tool_name): retry_once() else: emit("human_review_required", workflow_id=workflow_id) halt() # --- Execution semantics layer --- if error.layer == "execution_semantics": run_compensating_transaction() # if available emit("human_review_required", workflow_id=workflow_id, step_id=step_id) halt() # NEVER retry blindly.

Three rules govern every branch of this tree.

invalid_grant and scope failures are never retried. These are not transient failures. Retrying them generates noise, inflates retry metrics, and delays the only correct action: surfacing a re-authorization signal to the affected user. Every retry on a revoked credential is a wasted call.

400 and 422 are never retried. The LLM generated structurally incorrect parameters. The same parameters produce the same error on the second call. Log the raw parameters and the schema; route to tool schema review; do not retry.

Write retries require an idempotency check first. Before retrying any data-mutating tool call, verify that the previous attempt did not partially execute. Most upstream APIs expose provider-specific idempotency mechanisms: Jira's X-Atlassian-Token, Stripe's Idempotency-Key header, Salesforce's composite request referenceId. Use the provider's native mechanism, not a generic key, and verify downstream state before issuing the retry. A retry on a non-idempotent create that is already partially committed produces a duplicate.

Getting Auth Events Into Your Observability Stack Before You Need Them

Datadog and Grafana observe the execution plane: HTTP status codes, latency, throughput, retry rates. They do not observe the identity plane: refresh failures, revocation events, scope changes, tenant disconnections. These are two independent telemetry streams. Most teams instrument only one.

Failures that originate in the identity plane are invisible to execution plane telemetry until they surface as tool call failures — which is the moment of maximum operational impact, not the earliest detectable moment.

The auth events that must reach your observability stack before go-live:

Auth Event Category
Without It, Execution Layer Sees
Token refresh failed
401 on the next tool call against that connected_account_id
Connected account disconnected
401 or empty result on the next tool call
Scope change (granted or denied)
403 on the next out-of-scope tool call
Authorization failure
401 only if the connector surfaces it

The Pipeline

Scalekit Auth Events (connected account lifecycle, token activity, auth failures) ↓ Webhook endpoint (your service; verify signature before processing) ↓ Event bus (SNS / Pub/Sub / Kafka) ↓ Datadog Log Intake / Grafana Loki ↓ Alert rules

The documented Scalekit webhook payload structure uses "type", "object", "occurred_at", "environment_id", and "organization_id" as top-level fields, with event-specific data in the "data" field. Index connection_id, occurred_at, and the event "type" as the dimensions your alert rules filter on. Verify webhook signatures using the Scalekit SDK before processing any event payload.

The Two Alert Rules That Eliminate the 2AM Page

Alert 1: Token refresh failure sustained

alert: token_refresh_failure_sustained condition: event_type: token_activity # refresh failed connected_account_id: any sustained_for: 5 minutes group_by: [connected_account_id, connector] actions: - notify: agent_ops_channel - trigger: reauth_notification to affected user - tag: workflow_pause_required = true for connected_account_id

A sustained refresh failure means the agent will rediscover it as a 401 on every subsequent tool call until the credential is re-authorized. This alert fires before the tool calls fail, not after.

Alert 2: Connected account disconnected for an active workflow

alert: active_connection_disconnected condition: event_type: connected_account # account disconnected connected_account_id: IN (active_workflow_connections) actions: - immediately: pause all workflow steps using connected_account_id - notify: affected_user with reauth link - emit: workflow_paused event to orchestrator - log: last_successful_tool_call_timestamp for connected_account_id

This alert should fire within 30 seconds of the event. The gap between a disconnection event and the agent discovering it on the next tool call is an unconstrained failure window. Every tool call executed during that window fails; every one of those failures was preventable.

Why "SIEM Integration" Is the Wrong Frame

SIEM is security event correlation and compliance audit. Auth event observability for agents is an operational concern: it drives workflow pause logic, re-authorization notifications, and retry suppression. The same event stream serves both purposes, but the alert rules and response playbooks are different.

A connected_account_disconnected event is a P2 operational incident for the agent platform and a medium-severity audit event for the security team. These go to different channels and trigger different responses. Routing both through a SIEM adds latency and buries the operational signal behind severity thresholds tuned for breach detection, not workflow reliability.

Route auth events to your observability stack first. Forward a copy to SIEM for compliance retention. Operational alert rules need sub-60-second delivery. SIEM ingestion pipelines are not designed for that SLA.

FAQs

Why does my agent return 200 when the underlying tool call failed?

The agent's orchestrator observes the tool executor's response, not the upstream API's response. If the connector returns 200 after swallowing an auth failure internally (a common pattern in retry-on-failure middleware), the orchestrator has no signal that the action failed. Add semantic verification — post-call state checks for write operations — and ensure connector-level errors surface as typed exceptions rather than swallowed into empty responses.

How do I distinguish a revoked refresh token from an expired access token?

By the OAuth error body, not the HTTP status. Both return 401. Expired access tokens produce error: "invalid_token" in the response body. Revoked refresh tokens produce error: "invalid_grant" (RFC 6749, Section 5.2). Parse the error body. If you are using Scalekit, check connected_account status via the API — if the account is in an error state, re-authorization is required.

Should I apply exponential backoff to all error classes?

No. Exponential backoff applies to transient failures: 429, 500, 502, 503. It is incorrect for invalid_grant (the token will still be revoked 30 seconds later), 403 scope failures, and 400/422 malformed parameters. None of these are transient. Retrying them generates noise and delays the correct remediation.

My agent has multiple concurrent threads sharing the same connected account. How do I prevent refresh race conditions?

Use distributed locking on the refresh operation, keyed by connected_account_id. Acquire the lock before attempting a refresh. If the lock is held, wait for the holder to complete, then read the updated token from the store. Do not attempt a parallel refresh. If your auth layer runs through Scalekit, refresh coordination happens at the infrastructure level via the connected account model — your agent threads call actions.execute_tool() and always receive a valid token regardless of concurrency.

What is the correct escalation path when an upstream API is degraded?

Circuit-break at two retries within a 60-second window. Emit an upstream degradation event tagged with the provider and connected_account_id. Halt the affected workflow step. Do not propagate the retry loop to the orchestration layer. Queue the step for resumption when the provider recovers — poll the provider's status endpoint or subscribe to their status page feed if available.

How do I handle a tool call that returned 200 but the action was never committed?

This is the partial execution case. The detection gate is a post-call state check: query the resource by the ID returned in the response body and verify the expected state. Before retrying, confirm the upstream API exposes a provider-native idempotency mechanism (such as X-Atlassian-Token for Jira or Idempotency-Key for Stripe). Pass a stable key — one that does not change between retry attempts — and verify downstream state before issuing the retry. If the API does not support idempotent writes, halt and route to human review rather than risk a duplicate action.

No items found.
Agent Auth Quickstart
Share this article
Agent Auth Quickstart

Acquire enterprise customers with zero upfront cost

Every feature unlocked. No hidden fees.
Start Free
$0
/ month
1 million Monthly Active Users
100 Monthly Active Organizations
1 SSO connection
1 SCIM connection
10K Connected Accounts
Unlimited Dev & Prod environments