Part 1 of 7 in the “APIM for AI Workloads” series
Over the past year, I’ve been doing a lot of work with integration services, including Azure API Management and, recently, also on AI adoption: evaluating models, designing agentic architectures, and figuring out how to govern AI consumption across the organization responsibly. One thing that keeps coming up in those conversations is a question that sounds almost too basic to ask: Who is keeping track of what we’re spending on tokens?
The answer, more often than not, is nobody.
That’s the problem this series is about. AI APIs are fundamentally different from the REST APIs we’ve been managing for the past decade, and the differences matter operationally. Before we dive into the mechanics of Azure API Management policies, load balancing, and semantic caching in subsequent posts, I want to make the case for a gateway layer in front of your AI services. Before we dive into the mechanics of Azure API Management for AI workloads, policies, load balancing, and semantic caching, I want to make the case for why you need a gateway layer in front of your AI services.
Tokens Are Not Requests
Traditional API management was built around a relatively simple model: count the requests, enforce rate limits, log the traffic, and call it done. One call in, one response out. The cost model was predictable.
AI APIs broke that model completely.
When you call an Azure OpenAI endpoint, you’re not paying per request. You’re paying per token. And a token count is invisible at the API gateway layer unless you specifically instrument for it. A single call from a conversational agent might consume 500 tokens. A call from a poorly-optimized batch process might consume 50,000. Both look the same at the HTTP level: one POST, one 200 OK.
This creates a blind spot that grows dangerously as AI adoption scales across an organization. Teams start building intelligent apps: conversational agents, personalized content generators, voice assistants, copilots, and each one is independently calling AI backend services. Nobody has a view across the whole estate of what’s being consumed, by whom, and at what cost.
The diagram below shows what this looks like in practice: multiple application types hitting multiple AI providers, with token-based pricing models sitting underneath.

The Three Problems Azure API Management for AI Solves
Azure API Management acts as the centralized control plane between your intelligent applications and your AI backends. It addresses three distinct categories of problems.
Performance optimization: AI model endpoints have throughput limits. Azure OpenAI Provisioned Throughput Units (PTU) give you reserved capacity at a fixed price, but cap out at a hard ceiling. Pay-as-you-go (PAYG) endpoints scale elastically but incur higher per-token costs at high volumes. Without a gateway layer, individual applications can’t know whether PTU capacity is available or saturated. A gateway can make that routing decision automatically, serving from PTU when it has headroom, falling back to PAYG when it doesn’t. That’s a meaningful cost optimization with no changes required to the calling applications.
Cost control: Tokens consumed by one team are costs borne by another team’s budget if you’re centralizing AI spend, which most organizations will do, at least initially. Without per-consumer visibility into token usage, internal cross-charging and showback are impossible. APIM’s token metric policies make this tractable by emitting token consumption data broken down by dimensions such as User ID, Subscription ID, or API product, all of which feed into Application Insights for dashboarding and alerting.
Data security: Routing AI traffic through a managed gateway gives you a single enforcement point for authentication, authorization, and policy. You can validate JWT claims, require subscription keys from API consumers, use Managed Identity to authenticate to Azure OpenAI without exposing credentials, and ensure traffic never leaves your controlled perimeter. Without a gateway, every team builds its own auth story, or more commonly, skips it.
PTU vs. PAYG: Why the Billing Model Shapes Your Architecture
Before we go further, it’s worth spending a moment on the two Azure OpenAI billing models, because they have direct architectural implications.
Provisioned Throughput Units (PTU) give you reserved capacity on a model. You pay a fixed hourly rate regardless of how many tokens you actually consume. The benefits are predictable costs and guaranteed throughput. The risk is waste if your utilization is low, and hard throttling if you exceed the provisioned limit.
Pay-as-you-go (PAYG) charges per token consumed. No upfront commitment, no capacity ceiling, but costs scale linearly with usage and can surprise you if consumption spikes.
Most production AI deployments end up using both: PTU for baseline load, where utilization is predictable, and PAYG as an overflow layer. This makes a load balancer with circuit breaking essential, which we’ll cover in Part 5 of this series.
The same logic applies beyond Azure OpenAI. APIM now supports generic LLM backends via the llm-* policy family, which means you can manage traffic to Mistral, Cohere, LLaMA, and other providers through the same control plane. The diagram below shows this architecture: APIM in the center, with load balancing across PTU and PAYG instances, token metrics flowing to Application Insights, and the full provider landscape behind it.

What This Looks Like in Practice
Let me make this concrete with a scenario I’ve seen play out multiple times.
An organization deploys its first Azure OpenAI service for a conversational agent. A few months later, a second team wants to use AI for content generation. Then a third team builds an internal copilot. Each team provisions its own Azure OpenAI resource, authenticates directly, and manages its own rate limiting. There’s no visibility into combined spend. No shared capacity optimization. No centralized audit trail.
This is the point where someone in finance asks a question that nobody can answer: “How much are we spending on AI, and which team is spending what?”
Centralizing AI traffic through APIM is how you get out of that situation before it becomes a problem. The policy-based approach means you can add governance without changing anything in the calling applications. They call the APIM endpoint, APIM handles the rest.
Azure API Management for AI Workloads: What’s Coming in This Series
The next six posts will go deep on the specific capabilities that make APIM a serious AI control plane:
- Part 2 covers authentication and authorization: JWT validation, Managed Identity, and subscription keys.
- Part 3 covers the Token Limit policy: enforcing tokens-per-minute limits per consumer.
- Part 4 covers the Token Metric policy: emitting usage data for observability and cross-charging.
- Part 5 covers load balancing and circuit breaking across PTU and PAYG backends.
- Part 6 covers semantic caching: reducing token consumption by serving cached responses for similar prompts.
- Part 7 covers APIM’s emerging role as an MCP gateway for agentic AI workloads.
Each post will include the relevant policy XML, real-world sizing guidance, and the architectural decisions behind the patterns.
If you’re building AI-powered applications at scale and you’re not yet routing that traffic through a gateway, the rest of this series is for you.



