Part 3 of 7 in the “APIM for AI Workloads” series
The Azure API Management token limit policy is one of the most direct cost control levers you have for AI workloads. In Part 1 of this series, I argued that token consumption is invisible without the right instrumentation. The token limit policy is the enforcement side of that equation: once you know how many tokens consumers are using, you set boundaries so that no single consumer can exhaust your model capacity or run up an unexpected bill.
This post covers how the policy works, which counter-key strategy to choose for your workload, how to size your tokens-per-minute (TPM) limits, and the difference between the Azure OpenAI-specific policy and the generic LLM variant for non-Microsoft backends.
Azure API Management Token Limit Policy: How It Works
The azure-openai-token-limit policy sits in the inbound section of your APIM policy pipeline. Before any request reaches the AI backend, APIM checks a sliding window counter keyed to the value you specify. If the caller is within their TPM budget, the request passes through. If they’ve exceeded it, APIM returns a 429 Too Many Requests response with a Retry-After header, and the backend never sees the request.
This is important: the throttling happens at the gateway, not at the Azure OpenAI endpoint. That means you’re not paying for rejected requests, and your model deployment is protected from saturation by a single runaway consumer.

The policy has two variants. The azure-openai-token-limit policy is purpose-built for Azure OpenAI and Microsoft Foundry endpoints, and uses the actual token counts returned in the API response. The llm-token-limit policy is the generic variant for any LLM backend, including Mistral, Cohere, and others. Both share the same attribute model, so the configuration patterns below apply to either.
Choosing a counter-key for Azure API Management Token Limiting
The counter-key attribute is the most important decision in configuring the token limit policy. It determines the scope of the limit: who shares a TPM bucket, and who gets their own.

The three main strategies are:
Per subscription: @(context.Subscription.Id). This is the most common pattern for internal enterprise use. Each API product subscription gets its own TPM counter, which maps cleanly to a team, a product, or a cost center. Combined with the Token Metric policy covered in Part 4, this provides per-subscriber cost visibility and enforcement in a single configuration.
Per IP address: @(context.Request.IpAddress). Better suited to public-facing endpoints or developer portals where you don’t have a subscription model. It’s a blunt instrument — NAT and shared egress can mean multiple users share a counter — but it’s effective for abuse prevention and trial access scenarios.
Per JWT claim or custom header: @(context.Request.Headers.GetValueOrDefault(“x-user-id”,””)). The most flexible option. If your application passes a user identifier in a header or JWT claim, you can scope limits to the individual user. This is the right approach for multi-tenant applications where each end user should have their own token budget, independent of which subscription they’re calling through.
Sizing Your TPM Limits
TPM limits are context-dependent, but a few principles apply across most workloads.
Start by profiling your actual token usage in a staging environment before setting production limits. The remaining-tokens-variable-name attribute exposes the remaining token budget as a policy variable, which you can log via the Token Metric policy to build a usage baseline before enforcing hard limits.
For the estimate-prompt-tokens attribute: set it to false in production. When set to true, APIM estimates prompt tokens before the response is returned, enabling earlier throttling but reducing accuracy. In practice, counting actual tokens from the response is more reliable and avoids throttling requests that would have been within budget.
A common mistake is setting a single global TPM limit too low, which throttles all consumers the moment a batch job runs on any team. The better pattern is tiered limits by API product: a Developer product with a low TPM ceiling, a Standard product for normal workloads, and an Unlimited product for production pipelines that need burst capacity.
Handling 429 Responses in Calling Applications
Any application calling an APIM-fronted AI endpoint needs to handle 429 responses gracefully. APIM returns a Retry-After header indicating how many seconds until the token window resets. Well-behaved clients respect this header and back off rather than retrying immediately.
For agentic workloads with multiple pipeline steps, a 429 response midway through can leave the agent in an inconsistent state. The recommended pattern is to expose the remaining-tokens-variable-name value in a response header so the calling application can monitor its own budget and slow down proactively, rather than waiting for a hard rejection.
The Azure OpenAI token limit policy documentation covers the full attribute reference, including tokens-per-minute, counter-key, estimate-prompt-tokens, and remaining-tokens-variable-name. The llm-token-limit variant has the same interface for non-Azure backends.
What’s Next in This Azure API Management for AI Series
Part 4 covers the Token Metric policy: how to emit token usage data to Application Insights broken down by consumer dimensions, and how to use that data for internal cross-charging and spend dashboards.
- Part 4: Token Metric policy — emitting usage data for observability and cross-charging.
- Part 5: Load balancing and circuit breaking across PTU and PAYG backends.
- Part 6: Semantic caching — reducing token consumption with similarity-based response reuse.
- Part 7: APIM as an MCP gateway for agentic AI workloads.
Pingback: Azure API Management for AI: Why Your APIs Need a Gateway - Cloud PerspectivesCloud Perspectives