Azure API Management Token Limit Policy: Controlling AI Token Consumption Per Consumer

Part 3 of 7 in the “APIM for AI Workloads” series

The Azure API Management token limit policy is one of the most direct cost control levers you have for AI workloads. In Part 1 of this series, I argued that token consumption is invisible without the right instrumentation. The token limit policy is the enforcement side of that equation: once you know how many tokens consumers are using, you set boundaries so that no single consumer can exhaust your model capacity or run up an unexpected bill.

This post covers how the policy works, which counter-key strategy to choose for your workload, how to size your tokens-per-minute (TPM) limits, and the difference between the Azure OpenAI-specific policy and the generic LLM variant for non-Microsoft backends.

Azure API Management Token Limit Policy: How It Works

The azure-openai-token-limit policy sits in the inbound section of your APIM policy pipeline. Before any request reaches the AI backend, APIM checks a sliding window counter keyed to the value you specify. If the caller is within their TPM budget, the request passes through. If they’ve exceeded it, APIM returns a 429 Too Many Requests response with a Retry-After header, and the backend never sees the request.

This is important: the throttling happens at the gateway, not at the Azure OpenAI endpoint. That means you’re not paying for rejected requests, and your model deployment is protected from saturation by a single runaway consumer.

Azure API Management token limit policy funnel throttling AI requests with 429 response and Retry-After header
Diagram 1: The token limit policy acts as a funnel in the APIM inbound pipeline. Requests within the TPM budget pass through to the AI backend. Requests exceeding the limit receive a 429 status code with a Retry-After header before the backend is even reached.

The policy has two variants. The azure-openai-token-limit policy is purpose-built for Azure OpenAI and Microsoft Foundry endpoints, and uses the actual token counts returned in the API response. The llm-token-limit policy is the generic variant for any LLM backend, including Mistral, Cohere, and others. Both share the same attribute model, so the configuration patterns below apply to either.

Choosing a counter-key for Azure API Management Token Limiting

The counter-key attribute is the most important decision in configuring the token limit policy. It determines the scope of the limit: who shares a TPM bucket, and who gets their own.

Azure API Management token limit policy counter-key strategies per subscription IP address and JWT claim with TPM sizing table
Diagram 2: Three counter-key strategies and TPM sizing guidance by workload type. The right scope depends on whether you are separating teams, protecting a public endpoint, or enforcing per-user limits in a multi-tenant application.

The three main strategies are:

Per IP address: @(context.Request.IpAddress). Better suited to public-facing endpoints or developer portals where you don’t have a subscription model. It’s a blunt instrument — NAT and shared egress can mean multiple users share a counter — but it’s effective for abuse prevention and trial access scenarios.

Per JWT claim or custom header: @(context.Request.Headers.GetValueOrDefault(“x-user-id”,””)). The most flexible option. If your application passes a user identifier in a header or JWT claim, you can scope limits to the individual user. This is the right approach for multi-tenant applications where each end user should have their own token budget, independent of which subscription they’re calling through.

Sizing Your TPM Limits

TPM limits are context-dependent, but a few principles apply across most workloads.

Start by profiling your actual token usage in a staging environment before setting production limits. The remaining-tokens-variable-name attribute exposes the remaining token budget as a policy variable, which you can log via the Token Metric policy to build a usage baseline before enforcing hard limits.

For the estimate-prompt-tokens attribute: set it to false in production. When set to true, APIM estimates prompt tokens before the response is returned, enabling earlier throttling but reducing accuracy. In practice, counting actual tokens from the response is more reliable and avoids throttling requests that would have been within budget.

A common mistake is setting a single global TPM limit too low, which throttles all consumers the moment a batch job runs on any team. The better pattern is tiered limits by API product: a Developer product with a low TPM ceiling, a Standard product for normal workloads, and an Unlimited product for production pipelines that need burst capacity.

Handling 429 Responses in Calling Applications

Any application calling an APIM-fronted AI endpoint needs to handle 429 responses gracefully. APIM returns a Retry-After header indicating how many seconds until the token window resets. Well-behaved clients respect this header and back off rather than retrying immediately.

For agentic workloads with multiple pipeline steps, a 429 response midway through can leave the agent in an inconsistent state. The recommended pattern is to expose the remaining-tokens-variable-name value in a response header so the calling application can monitor its own budget and slow down proactively, rather than waiting for a hard rejection.

The Azure OpenAI token limit policy documentation covers the full attribute reference, including tokens-per-minute, counter-key, estimate-prompt-tokens, and remaining-tokens-variable-name. The llm-token-limit variant has the same interface for non-Azure backends.

What’s Next in This Azure API Management for AI Series

Part 4 covers the Token Metric policy: how to emit token usage data to Application Insights broken down by consumer dimensions, and how to use that data for internal cross-charging and spend dashboards.

Azure API Management for AI: Securing Your AI APIs with Authentication and Authorization

Part 2 of 7 in the “APIM for AI Workloads” series

In Part 1 of this series, I made the case for why Azure API Management for AI workloads is the right control plane for governing AI traffic across an organization. This post gets practical: how do you actually secure access to your AI backends with APIM without creating a credential-management nightmare?

Security is where many AI projects cut corners, and understandably so. When you’re moving fast to prove value with a new model, authentication feels like overhead. But AI endpoints are expensive, and an unsecured Azure OpenAI endpoint is a real risk: anyone with the URL and key can start consuming tokens at your cost. At scale, that’s a significant financial and compliance exposure.

APIM addresses this with a three-layer security model. Let’s walk through each layer.

Azure API Management for AI Security: A Three-Layer Model

The authentication and authorization pattern in APIM is deliberately layered. Each layer answers a different question and operates independently, so a failure at any layer stops the request before it reaches the AI backend.

Azure API Management for AI three-layer authentication flow showing subscription key, JWT validation and Managed Identity policy pipeline
Diagram 1: Three-layer auth in APIM for AI workloads. Layer 1 identifies the caller via subscription key. JWT validation in Layer 2 then determines what they’re permitted to do. Finally, Layer 3 authenticates APIM itself to the AI backend via Managed Identity.

The three layers are:

  • Subscription keys to identify and track API consumers.
  • JWT validation to enforce fine-grained access control based on claims.
  • Managed Identity to authenticate APIM to Azure OpenAI without storing credentials.

Each layer has a distinct role. Confusing them is a common mistake, so it’s worth being explicit about what each one does and does not do.

Layer 1: Subscription Keys

Subscription keys are APIM’s mechanism for identifying API consumers. When you create an API product in APIM and require a subscription, callers must include their key in the Ocp-Apim-Subscription-Key header. APIM validates the key, maps it to a subscriber, and lets the request proceed.

This is important for AI workloads specifically because subscription keys enable per-consumer token tracking. When you combine subscription key validation with the Token Metric policy we’ll cover in Part 4, you get usage data broken down by subscriber, which is the foundation of any internal cross-charging model.

Subscription keys answer the question: Who is calling? They don’t answer what the caller is allowed to do. For that, you need JWT validation.

Layer 2: JWT Validation and Claims-Based Authorization

The validate-jwt policy is where you enforce what a caller is permitted to do. It validates the JWT token in the Authorization header against your identity provider, and can inspect any claim in the token to make authorization decisions.

For Azure OpenAI specifically, this is where you control which teams or applications can access which model deployments. A team working on an internal chatbot should not be able to call a GPT-4o deployment reserved for a production workload. JWT claims let you enforce that boundary at the gateway layer, with no changes required in the calling application.

A typical policy checks the token signature against your Azure AD tenant’s OpenID Connect configuration, then validates that a required scope or role claim is present:

The failed-validation-httpcode=”401″ attribute ensures unauthenticated callers get a clean rejection before they ever reach the backend. You can also use failed-validation-error-message to return a specific error message, which helps consumers debug auth failures without exposing internal details.

For multi-provider setups where you’re routing to non-Azure backends like Mistral or Cohere, the same JWT policy applies. The claims model is provider-agnostic, which is one of the advantages of centralizing auth in APIM rather than handling it per-backend.

Layer 3: Managed Identity for Backend Authentication

Managed Identity is the most important security improvement you can make when setting up Azure API Management for AI. It replaces the pattern of storing an Azure OpenAI API key in APIM’s named values with a system-assigned or user-assigned Managed Identity that APIM uses to authenticate directly to Azure OpenAI via Azure AD.

Azure API Management for AI comparing API key authentication risks versus Managed Identity benefits for Azure OpenAI backend access
Diagram 2: API key authentication (left) vs. Managed Identity (right). The key difference is that Managed Identity requires no stored credentials anywhere in your configuration.

The practical difference is significant. With API key authentication, you have a long-lived secret that needs to be stored, rotated, and kept out of source control. With Managed Identity, there is no secret. APIM requests a short-lived token from Azure AD at runtime, and Azure AD issues it based on the APIM instance’s identity. Nothing is stored. Nothing can leak.

The configuration is a single policy element in the inbound section: <authentication-managed-identity resource=”https://cognitiveservices.azure.com”/&gt;. APIM handles the rest, automatically fetching and refreshing the token.

On the Azure OpenAI side, you grant the APIM instance’s Managed Identity the Cognitive Services User role on the Azure OpenAI resource. That’s the minimum required permission. You can scope it further to specific deployments if needed.

For organizations in regulated industries, such as healthcare, financial services, and government, Managed Identity is not optional. It satisfies Zero Trust authentication requirements and produces a full audit trail in Azure Monitor, tied to the APIM instance identity rather than a shared key.

Azure API Management for AI: Putting the Three Layers Together

In a production setup, all three layers run sequentially within the inbound policy pipeline. A request arrives with a subscription key and a JWT. APIM validates the key first (fast, no external call), then validates the JWT against Azure AD, then forwards the request to Azure OpenAI using its Managed Identity token. The AI backend never sees the caller’s JWT, and APIM never stores an API key.

The result is a clean separation of concerns:

  • The calling application manages its own JWT (issued by Azure AD based on its own identity or the user’s identity).
  • APIM enforces the authorization policy without the backend needing to know anything about it.
  • The AI backend trusts only APIM’s Managed Identity, not arbitrary callers.

This is the architecture you want before you go to production with any AI workload that touches sensitive data or incurs meaningful cost.

What’s Next in This Series

Part 3 covers the Token Limit policy: how to enforce tokens-per-minute limits per consumer, configure throttling behavior, and handle the differences between the azure-openai-token-limit and llm-token-limit policy variants.

Azure API Management for AI: Why Your APIs Need a Gateway

Part 1 of 7 in the “APIM for AI Workloads” series

Over the past year, I’ve been doing a lot of work with integration services, including Azure API Management and, recently, also on AI adoption: evaluating models, designing agentic architectures, and figuring out how to govern AI consumption across the organization responsibly. One thing that keeps coming up in those conversations is a question that sounds almost too basic to ask: Who is keeping track of what we’re spending on tokens?

The answer, more often than not, is nobody.

That’s the problem this series is about. AI APIs are fundamentally different from the REST APIs we’ve been managing for the past decade, and the differences matter operationally. Before we dive into the mechanics of Azure API Management policies, load balancing, and semantic caching in subsequent posts, I want to make the case for a gateway layer in front of your AI services. Before we dive into the mechanics of Azure API Management for AI workloads, policies, load balancing, and semantic caching, I want to make the case for why you need a gateway layer in front of your AI services.

Tokens Are Not Requests

Traditional API management was built around a relatively simple model: count the requests, enforce rate limits, log the traffic, and call it done. One call in, one response out. The cost model was predictable.

AI APIs broke that model completely.

When you call an Azure OpenAI endpoint, you’re not paying per request. You’re paying per token. And a token count is invisible at the API gateway layer unless you specifically instrument for it. A single call from a conversational agent might consume 500 tokens. A call from a poorly-optimized batch process might consume 50,000. Both look the same at the HTTP level: one POST, one 200 OK.

This creates a blind spot that grows dangerously as AI adoption scales across an organization. Teams start building intelligent apps: conversational agents, personalized content generators, voice assistants, copilots, and each one is independently calling AI backend services. Nobody has a view across the whole estate of what’s being consumed, by whom, and at what cost.

The diagram below shows what this looks like in practice: multiple application types hitting multiple AI providers, with token-based pricing models sitting underneath.

Azure API Management control plane between intelligent apps and AI providers showing PTU and PAYG token billing
Diagram 1: Intelligent applications on the left, AI service providers on the right, with both PTU and PAYG billing models underneath. Without a control plane in the middle, you’re flying blind.

The Three Problems Azure API Management for AI Solves

Azure API Management acts as the centralized control plane between your intelligent applications and your AI backends. It addresses three distinct categories of problems.

Performance optimization: AI model endpoints have throughput limits. Azure OpenAI Provisioned Throughput Units (PTU) give you reserved capacity at a fixed price, but cap out at a hard ceiling. Pay-as-you-go (PAYG) endpoints scale elastically but incur higher per-token costs at high volumes. Without a gateway layer, individual applications can’t know whether PTU capacity is available or saturated. A gateway can make that routing decision automatically, serving from PTU when it has headroom, falling back to PAYG when it doesn’t. That’s a meaningful cost optimization with no changes required to the calling applications.

Cost control: Tokens consumed by one team are costs borne by another team’s budget if you’re centralizing AI spend, which most organizations will do, at least initially. Without per-consumer visibility into token usage, internal cross-charging and showback are impossible. APIM’s token metric policies make this tractable by emitting token consumption data broken down by dimensions such as User ID, Subscription ID, or API product, all of which feed into Application Insights for dashboarding and alerting.

Data security: Routing AI traffic through a managed gateway gives you a single enforcement point for authentication, authorization, and policy. You can validate JWT claims, require subscription keys from API consumers, use Managed Identity to authenticate to Azure OpenAI without exposing credentials, and ensure traffic never leaves your controlled perimeter. Without a gateway, every team builds its own auth story, or more commonly, skips it.

PTU vs. PAYG: Why the Billing Model Shapes Your Architecture

Before we go further, it’s worth spending a moment on the two Azure OpenAI billing models, because they have direct architectural implications.

Provisioned Throughput Units (PTU) give you reserved capacity on a model. You pay a fixed hourly rate regardless of how many tokens you actually consume. The benefits are predictable costs and guaranteed throughput. The risk is waste if your utilization is low, and hard throttling if you exceed the provisioned limit.

Pay-as-you-go (PAYG) charges per token consumed. No upfront commitment, no capacity ceiling, but costs scale linearly with usage and can surprise you if consumption spikes.

Most production AI deployments end up using both: PTU for baseline load, where utilization is predictable, and PAYG as an overflow layer. This makes a load balancer with circuit breaking essential, which we’ll cover in Part 5 of this series.

The same logic applies beyond Azure OpenAI. APIM now supports generic LLM backends via the llm-* policy family, which means you can manage traffic to Mistral, Cohere, LLaMA, and other providers through the same control plane. The diagram below shows this architecture: APIM in the center, with load balancing across PTU and PAYG instances, token metrics flowing to Application Insights, and the full provider landscape behind it.

Azure API Management AI control plane with token limit, token metric, load balancing, semantic caching and circuit breaker policies across PTU and PAYG backends
Azure API Management as the centralized AI control plane, with performance, cost, and security governance across multiple providers and billing models.

What This Looks Like in Practice

Let me make this concrete with a scenario I’ve seen play out multiple times.

An organization deploys its first Azure OpenAI service for a conversational agent. A few months later, a second team wants to use AI for content generation. Then a third team builds an internal copilot. Each team provisions its own Azure OpenAI resource, authenticates directly, and manages its own rate limiting. There’s no visibility into combined spend. No shared capacity optimization. No centralized audit trail.

This is the point where someone in finance asks a question that nobody can answer: “How much are we spending on AI, and which team is spending what?”

Centralizing AI traffic through APIM is how you get out of that situation before it becomes a problem. The policy-based approach means you can add governance without changing anything in the calling applications. They call the APIM endpoint, APIM handles the rest.

Azure API Management for AI Workloads: What’s Coming in This Series

The next six posts will go deep on the specific capabilities that make APIM a serious AI control plane:

Each post will include the relevant policy XML, real-world sizing guidance, and the architectural decisions behind the patterns.

If you’re building AI-powered applications at scale and you’re not yet routing that traffic through a gateway, the rest of this series is for you.