Jun 30, 2026

Rate Limiting in Virtual MCP Servers: Per-User, Per-Tool, and Per-Tenant Controls

TL;DR

Agent traffic changed the math. A single agent conversation now chains ~8–15 tool calls in seconds (up from 2–3 in 2024). Rate limiting is now a cost, security, and reliability control at once.
Three failure modes, three keys. Runaway agents → key on the user; expensive-tool abuse → key on the tool; noisy neighbors → key on the tenant. Most teams instrument one and get blindsided by the other two, so layer all three.
Token bucket beats sliding window for agents. Bursty agent traffic is best absorbed by a token bucket with a burst allowance. Weight calls by cost (a DB query shouldn't spend the same budget as a trivial read) and count in sessions, not raw HTTP requests.
Mind the limits you don't own. Model-provider, gateway, and downstream SaaS ceilings fire regardless of your own rules. Set your edge limit against the tightest downstream ceiling so you slow traffic gracefully instead of relaying a wall of 429s.
Make every 429 actionable. Return machine-readable headers showing remaining budget and reset time, and guard against silent throttling (some providers return success with empty data when limited).

A virtual MCP server turns a set of tools into something a language model can drive at full speed. That is the point of it, and it is also the problem. A request pattern that looked like a careful human a year ago now looks like a burst of automated calls, because a single agent conversation commonly chains many tool calls in seconds rather than minutes.

Why this is not ordinary API throttling

The numbers make the shift concrete, and they are worth keeping in mind as design constraints:

A single agent conversation now averages roughly 8 to 15 tool calls, up from 2 to 3 in 2024, according to field reports.
A mostly read-only Claude session has been reported hitting GitHub's 5,000-per-hour ceiling in about two minutes.
A single runaway loop reportedly ran up around 47,000 dollars in cloud costs across 127,000 calls in roughly eight hours.

Rate limiting in this setting is no longer a latency nicety. It is a cost, security, and reliability control at the same time.

Three failure modes, three keys

The three controls (Per-User, Per-Tool, and Per-Tenant) are not arbitrary. Each answers a distinct failure mode, and each keys on a different identity dimension:

Failure mode

What goes wrong

Key the limit on

Runaway agent or user

One user's agent loops and floods the server

The user

Expensive-tool abuse

A costly tool is called far too often

The tool

Noisy neighbor

One tenant starves the rest behind a shared resource

The tenant

Why the distinction matters

Most teams instrument one of these and get surprised by the other two. A per-user limit does nothing to stop a single expensive tool from dominating, and neither one prevents one tenant from exhausting a shared upstream quota.

Defense in depth here means layering limits, which is exactly what OWASP's AI agent guidance recommends across gateway, application, and tool levels.

The mechanism every limit shares

Whatever you key on, a rate limit is built from the same pieces:

An identity key, extracted from the request, that says whose budget this call draws from.
An algorithm that decides to allow or reject.
A storage backend that holds the counters.
An error response that tells the caller what happened and when to retry.

Choosing an algorithm

Two algorithms cover most cases, and the choice is a tradeoff.

Token bucket is the common default: it refills at a fixed rate and spends one or more tokens per call, which absorbs short bursts while holding a sustained average.
Sliding window is more precise because it tracks individual timestamps, at the cost of more memory.

For the bursty shape of agent traffic, token bucket with a burst allowance is usually the better starting point.

Weighting by cost, and counting in sessions

Two refinements separate a naive limiter from a useful one.

First, weight calls by cost: a trivial read should not draw the same budget as a database query or an inference call, so expensive tools spend more tokens.

Second, count in sessions, not raw requests. Every MCP operation is a single HTTP POST, so one tool call is really a short sequence of posts; a limit of five requests per second is closer to one tool-call session per second once discovery and handshake are included.

Per-user controls

Per-user limiting addresses the runaway agent. When one user's agent gets stuck in a retry cycle or fans out across many parallel conversations, a per-user budget contains the damage to that user rather than the whole server.

How to key it

On a remote virtual MCP server over Streamable HTTP, the caller is identified from the request, then used as the bucket key:

Use the authenticated identity, such as the bearer token subject or session identity, as the per-user key.
Give each user a token bucket with a sustained rate plus a burst allowance for natural sequences like listing then reading.
Watch the rejection rate. If a user sits at the limit for minutes, that is usually a loop to fix at the source, not just throttle.

Per-tool controls

Not all tools are equal, and per-tool limiting reflects that. A cheap echo and an expensive long-running query should never share one ceiling, because the expensive one is where cost and downstream load concentrate.

How to key it

Per-tool limiting reads the tool name out of the call and gives each tool its own counter. Gateway implementations do this by inspecting the JSON-RPC body:

Extract the tool name from the request and combine it with the method to form a descriptor, so each tool gets an independent bucket.
Set tight ceilings on expensive tools and loose ones on cheap reads. A common shape is something like three calls per minute for a heavy operation against ten per minute for ordinary ones.
Cache read-heavy tool results where you can, so repeated reads do not each spend budget.

Per-tenant controls

Per-tenant limiting is the cure for the noisy neighbor. In a multi-tenant product, one aggressive tenant can degrade or break service for everyone sharing the same resource, and per-tenant isolation is the only durable fix.

The root cause worth naming

The sharpest version of this problem comes from shared credentials.

When one OAuth credential is shared across many customers, the first agent to get throttled can trigger a cascade: the upstream returns a 429, and every other session behind that same credential inherits it. Reports describe the first call being throttled within twenty seconds and dozens of sessions failing behind it. Per-tenant buckets contain the rate, but the credential model underneath is what decides the blast radius.

The ceilings you do not own: Your limits versus upstream limits

There is a second set of limits you do not control, and they fire whether or not you have your own. A single agent prompt can run into three independent ceilings:

The model provider's request and token limits, exposed through response headers you should read on every call.
Your own gateway or session limits, such as concurrency caps and tool timeouts.
The downstream SaaS limits, which vendors have tightened specifically to slow agent traffic.

Respecting them before they bite

Those downstream numbers are unforgiving at agent speed. Public figures include rates such as a few requests per second on some platforms and fixed hourly ceilings on others, none of which the model knows about. Your edge limit should be set with the tightest relevant downstream ceiling in mind, so your server slows traffic gracefully instead of relaying a wall of upstream 429s.

Returning a 429 an agent can act on

Why silent or signal-free limits fail?

A rate limit is only as good as the message it returns. The common failure is a 429 with no machine-readable guidance, which leaves the client to do something arbitrary: retry immediately, give up, or surface a confusing error. Without an explicit signal, the client never learns to slow down and simply repeats the pattern moments later.

What a good response includes

A useful rejection tells the caller how to behave, and a robust client reads it:

Return standard rate-limit headers that state how much budget remains and when it resets.
Read upstream provider headers too, since model request and token budgets can sit in separate buckets.
Guard against silent throttling. Some providers return a success with empty data when limited, so a sentinel check, such as alerting when a result set drops sharply against the last good call, catches what a status code would have missed.

The identity foundation: where Scalekit fits, and where it does not

Rate limiting is not a Scalekit feature (not yet), and Scalekit does not sit in the request path enforcing token buckets for you. The enforcement layer described above lives in your MCP server or in a gateway in front of it. If you build it into the server itself, the FastMCP rate-limiting middleware ships both token bucket and sliding window implementations to start from.

What Scalekit does provide

What every one of these controls needs first is a reliable identity to key on, and that is the part Scalekit supplies:

Per-user identity through connected accounts, so a limit can attach to the specific authorizing user rather than a shared agent.
Per-tenant isolation through a per-tenant vault and namespacing, so a tenant key is meaningful and not leaky.
Per-tool scoping through declared, scoped tool surfaces, so a per-tool limit lines up with the tools an agent role can actually call.

The credential model changes the blast radius

The worst noisy-neighbor case above came from a shared credential, where one tenant's throttling cascaded to everyone behind it. Scalekit resolves a separate per-user credential for each call rather than a shared bot token, so each user's calls hit their own upstream quota instead of a common one. That does not throttle anyone by itself, but it changes where an upstream 429 lands, which is half the per-tenant problem. You can see the per-user credential model in any connector page, and the auth setup in the MCP authorization quickstart.

Conclusion: limit on identity, signal on rejection

Rate limiting for virtual MCP servers comes down to two disciplines. Key your limits on the right identity, and layer them so the three failure modes are each covered:

Per-user buckets to contain a runaway agent.
Cost-weighted per-tool buckets to protect expensive operations.
Per-tenant isolation to stop one customer starving the rest.

Where to start

Decide what each limit keys on before you tune any numbers, because the keys are the architecture and the numbers are just configuration. Get a clean per-user, per-tool, and per-tenant identity in place first, set conservative buckets with burst room, and make every rejection return a signal an agent can actually act on.

For the auth and identity groundwork underneath, the SSO-backed MCP authentication guide and the secure MCP server writeup go deeper than this overview can.

No items found.

On this page

Introduction
‍

This is some text inside of a div block.

Rate Limiting in Virtual MCP Servers: Per-User, Per-Tool, and Per-Tenant Controls

TL;DR

Why this is not ordinary API throttling

Three failure modes, three keys

Why the distinction matters

The mechanism every limit shares

Choosing an algorithm

Weighting by cost, and counting in sessions

Per-user controls

How to key it

Per-tool controls

How to key it

Per-tenant controls

The root cause worth naming

The ceilings you do not own: Your limits versus upstream limits

Respecting them before they bite

Returning a 429 an agent can act on

What a good response includes

The identity foundation: where Scalekit fits, and where it does not

What Scalekit does provide

The credential model changes the blast radius

Conclusion: limit on identity, signal on rejection

Where to start

Acquire enterprise customers with zero upfront cost