Azure API Management as MCP Gateway: Governing Agentic AI Workloads

Part 7 of 7 in the “APIM for AI Workloads” series

Azure API Management as MCP gateway is the natural endpoint of everything this series has built. In Parts 1 through 6, we established APIM as the control plane for AI workloads: securing access, limiting and measuring token consumption, routing traffic resiliently across backends, and reducing costs through semantic caching. All of that applies equally to agentic workloads. The difference is that agents introduce a new communication pattern: the Model Context Protocol (MCP), which standardizes how AI agents discover and call tools.

In my work and online research on agentic AI architecture, I consistently returned to the same question: how does one govern agent tool calls with the same rigor we apply to API calls? The answer, increasingly, is that APIM handles both. This post covers what that looks like in practice.

What MCP Is and Why It Changes the APIM Story

MCP is an open protocol, originally developed by Anthropic, that defines a standard interface between AI agents (MCP clients) and the tools they call (MCP servers). Instead of each agent framework implementing its own bespoke tool-calling mechanism, MCP gives agents a consistent way to discover available tools, understand their input schemas, and invoke them. Frameworks including Semantic Kernel, AutoGen, and LangGraph are all adding MCP client support.

For APIM, MCP matters because it transforms the gateway from a proxy for AI completions into a broker for agent tool calls. An agent no longer calls your internal APIs directly. Instead, it discovers them as MCP tools through APIM, and APIM enforces the same governance policies on those tool calls that it enforces on any other request. The control plane extends naturally into the agentic layer.

Azure API Management as MCP Gateway: Three Capabilities

Diagram showing Azure API Management acting as an MCP gateway. On the left, AI agents connect as MCP clients. In the centre, APIM exposes REST APIs as MCP tool definitions, proxies external MCP servers, and routes agent-to-agent traffic through the policy pipeline. On the right, Azure OpenAI and AI Foundry backends receive governed requests.
Azure API Management as an MCP gateway. Existing REST APIs are auto-exposed as MCP tool definitions via the export-rest-mcp-server policy. External MCP servers are proxied through APIM. Agent-to-agent traffic passes through the same inbound policy pipeline, with all series policies, authentication, token limits, token metrics, andload balancing applied uniformly.

APIM’s MCP gateway capabilities fall into three categories:

Expose REST APIs as MCP servers. The export-rest-mcp-server policy takes any API already registered in your APIM catalog and auto-generates MCP tool definitions from it. An agent connecting to your APIM MCP endpoint discovers those tools via the standard MCP protocol and can call them without any knowledge of the underlying REST implementation. Crucially, no changes are required to the underlying API. The policy handles the translation layer entirely within APIM.

Pass through external MCP servers. APIM can proxy external MCP servers — whether third-party services like GitHub or Jira, or custom MCP servers built by your own teams — through the same gateway. All traffic passes through APIM’s policy pipeline, so you apply JWT validation, subscription key enforcement, token limits, and logging to external MCP calls exactly as you would to any other API call. Agents get a single APIM endpoint; APIM handles the routing.

Agent-to-agent (A2A) traffic. In multi-agent architectures, orchestrator agents call sub-agents to delegate tasks. Routing that traffic through APIM means every A2A hop is governed: authenticated, rate-limited, logged, and subject to the same token budget controls applied to end-user traffic. This is particularly relevant for agentic pipelines running on Microsoft Foundry, where multiple specialized agents collaborate within a single workflow.

Applying Series Policies to Agentic Workloads

One of the practical advantages of routing MCP traffic through APIM is that every policy covered in this series applies without modification. Agentic workloads are not a special case requiring a separate governance layer. They use the same pipeline.

  • Authentication (Part 2): Agents authenticate to APIM using subscription keys or JWT tokens. APIM authenticates to AI backends via Managed Identity. The agent never holds backend credentials.
  • Token limits (Part 3): Multi-step agentic pipelines can consume large token volumes per workflow. Per-subscription TPM limits prevent a single runaway pipeline from exhausting shared capacity.
  • Token metrics (Part 4): Token consumption from agentic workflows is attributed to the subscribing team or pipeline via the emit-token-metric policy. FinOps visibility extends automatically to agentic workloads.
  • Load balancing (Part 5): Agentic pipelines often run longer and consume more tokens per call than chat applications. PTU-to-PAYG failover protects pipeline continuity when primary capacity saturates.
  • Semantic caching (Part 6): Agents that make repeated identical tool calls, checking a status, or looking up a reference value, benefit from semantic caching in the same way chat applications do.

Practical Considerations for APIM as MCP Gateway

A few agentic-specific considerations are worth calling out before you start routing MCP traffic through APIM.

Tool discovery latency. MCP clients typically discover available tools at session start by calling the MCP server’s tool list endpoint. With APIM in the path, that discovery call passes through the full policy pipeline. Keep your inbound policies lightweight for discovery calls, or cache the tool list response to avoid repeated round trips.

Streaming responses. Many AI completions endpoints support streaming via server-sent events. APIM supports streaming passthrough, but some policies — including semantic cache lookup — do not apply to streaming responses. Structure your pipeline accordingly: apply caching only to non-streaming completion calls.

Session state. MCP conversations are stateful within a session. APIM is stateless between requests, so per-session state must live in the calling agent or an external store. The vary-by pattern from the semantic cache policy can scope cached tool responses by session ID if the agent passes one in a header.

Token budget propagation. In multi-agent pipelines, token budgets need to propagate from the orchestrator to sub-agents. Exposing the remaining token budget from the remaining-tokens-variable-name attribute (Part 3) as a response header lets orchestration frameworks like Semantic Kernel make informed decisions about which sub-agent to invoke next.

Azure API Management as MCP Gateway: Closing the Series

This post closes the series, but the control plane it describes is not static. MCP is still evolving rapidly. New APIM policy capabilities for agentic workloads are shipping frequently. The architecture board conversation at various enterprise has shifted from “should we centralize AI traffic through APIM?” to “what do we govern next?”, which is a good place to be.

Diagram showing the complete Azure API Management AI control plane. On the left, five consumer types — AI agents, chat apps, copilots, pipelines, and enterprise apps — connect through a single APIM instance. In the centre, seven policy layers are stacked vertically: authentication, token limit, token metric, load balancing and circuit breaker, semantic caching, MCP gateway, and named value kill switch, each labelled with its series part number. On the right, Azure AI backends including Azure OpenAI PTU and PAYG, AI Foundry, and MCP-enabled backends receive governed requests.
The complete APIM for AI control plane across all seven parts of the series. One APIM instance governs every consumer type, every Azure AI backend, and every governance requirement — including agentic MCP workloads introduced in this post. Each policy layer can be implemented incrementally, starting with authentication and adding capability as workloads mature.

Looking back across the seven posts, the consistent theme is that AI workloads are not fundamentally different from other API workloads in terms of governance requirements. They need authentication, rate limiting, observability, resilience, and cost control. APIM provides all of those. What changes with AI is the unit of measurement (tokens, not requests), the billing model (PTU vs. PAYG), and now the communication protocol (MCP for agents). The control plane adapts to each of these without requiring a parallel governance infrastructure.

The full series index is below for reference. Each post links to the relevant Microsoft documentation and includes policy XML you can use directly.

  • Part 1: Why your AI APIs need a gateway.
  • Part 2: Authentication and authorization.
  • Part 3: Token limit policy.
  • Part 4: Token metric policy and cross-charging.
  • Part 5: Load balancing and circuit breaking.
  • Part 6: Semantic caching.

Part 7 (this post): APIM as MCP gateway for agentic AI workloads.

Azure API Management Semantic Caching: Cut AI Token Costs with Similarity-Based Responses

Azure API Management semantic caching is the most operationally transparent cost optimization in this series. Every technique covered so far, auth, token limits, token metrics, and load balancing, requires deliberate design decisions in how you configure APIM. Semantic caching, by contrast, works silently. Calling applications sends prompts as normal. APIM checks whether a semantically similar prompt has already been answered. If a match exists above a configurable similarity threshold, APIM returns the cached response without touching the AI backend. Zero tokens consumed. Zero latency is added by the model.

For workloads with repetitive prompt patterns, internal FAQ bots, document classifiers, and support agents that see the same questions repeatedly, the cache hit rate can be surprisingly high. Even a 20% hit rate on a high-volume workload translates directly into cost reduction and lower average latency.

How Azure API Management Semantic Caching Works

The azure-openai-semantic-cache-lookup policy sits in the inbound section of your APIM pipeline, before the request reaches the AI backend. When a prompt arrives, APIM sends it to a configured embedding model, typically Azure OpenAI text-embedding-ada-002 or equivalent, to generate a vector representation of the prompt. APIM then compares that vector against cached embeddings stored in Azure Managed Redis using cosine similarity.

If the similarity score between the incoming prompt and a cached prompt falls below the configured score-threshold, APIM treats it as a cache hit and returns the stored response. If no match meets the threshold, APIM forwards the request to the AI backend as normal and stores the response in Redis for future lookups.

Azure API Management semantic caching policy flow showing cache hit returning stored response and cache miss forwarding to Azure OpenAI
Diagram 1: Semantic cache request flow. On a cache hit, APIM returns a stored response directly — consuming zero tokens. On a miss, APIM forwards to the AI backend and stores the response in Azure Managed Redis for future hits.

The generic variant, llm-semantic-cache-lookup, works identically for non-Azure backends. Both require the same supporting infrastructure: an embedding model backend and an Azure Managed Redis instance configured in APIM. The semantic cache store policy handles writing responses back to the cache in the outbound section.

Tuning the Score Threshold for Azure API Management Semantic Caching

The score-threshold attribute is the most consequential configuration decision in the semantic caching policy. It controls how similar an incoming prompt must be to a cached prompt for APIM to treat it as a hit. The value runs from 0.0 to 1.0, but the practical range is much narrower.

Azure API Management semantic caching score threshold tuning guide from aggressive to conservative with vary-by subscription user and global scope strategies
Diagram 2: Score threshold tuning guide and vary-by scope strategies. Lower thresholds cache more aggressively. The default of 0.05 suits most production workloads. A global cache (no vary-by) maximizes hit rate but risks serving the wrong user’s response.

In practice, three zones matter:

0.01 to 0.05 (aggressive). At this range, prompts that are paraphrases of each other — “What is my account balance?” and “Can you show me my current balance?” — reliably produce cache hits. This is the right range for FAQ bots, support agents, and any workload where users ask the same questions in slightly different words. The default of 0.05 sits here and suits most production deployments.

0.05 to 0.20 (conservative). At this range, only prompts that are very close in wording produce hits. Creative workloads, code generation, and document drafting tend to have high prompt variance, so a more conservative threshold avoids serving stale cached responses to genuinely different requests.

Above 0.30 (too strict). At this threshold, almost no prompts match. The cache effectively stops functioning. Avoid this range unless you are deliberately disabling caching for a specific API product while keeping the policy in the pipeline for future use.

Start at 0.05 and monitor cache hit rates in Application Insights. If the hit rate is low for a workload you expect to be repetitive, lower the threshold incrementally. If you start seeing complaints about incorrect or stale responses, raise it.

vary-by Scope: Preventing Cache Pollution

The vary-by element scopes the cache namespace. Without it, all consumers share a single global cache. That maximizes the hit rate but introduces a significant risk: APIM could serve one user’s cached response to a different user. For most enterprise AI workloads, that is unacceptable.

The safest default is to vary by Subscription ID, which gives each API subscriber their own cache namespace. This prevents cross-team cache pollution while still achieving high hit rates within each subscriber’s own prompt patterns. For multi-tenant applications where individual users have distinct contexts, vary by a user identifier extracted from the JWT or a custom header instead.

A global cache with no vary-by is appropriate only for fully public, stateless APIs where responses are identical regardless of who requests them. Internal enterprise AI workloads rarely meet that bar.

Infrastructure Requirements for Azure API Management Semantic Caching

Semantic caching requires two supporting Azure resources beyond APIM itself. First, an Azure Managed Redis instance configured as an external cache in APIM. Redis stores the prompt embeddings and cached responses. The cache TTL is configurable in the store policy, so you control how long responses remain valid before APIM re-queries the backend.

Second, an embeddings model backend registered in APIM. For Azure OpenAI, this is typically a separate deployment of text-embedding-ada-002 or text-embedding-3-small. The embeddings backend is referenced by the embeddings-backend-id attribute. It is separate from your completions backend, so you can apply independent token limits and load balancing to the embeddings traffic.

One practical consideration: the embeddings call itself consumes tokens and adds a small amount of latency on every request, whether or not the cache hits. For workloads with very low prompt repetition, the overhead of generating embeddings for every request may outweigh the savings from occasional cache hits. Measure the hit rate before committing the infrastructure cost.

What’s Next in This Azure API Management for AI Series

Part 7 closes the series by covering APIM’s emerging role as an MCP gateway for agentic AI workloads: how to expose REST APIs as MCP servers, pass through existing MCP servers, and manage agent-to-agent traffic through the same control plane we’ve built across this series.

Anatomy of an Agent Loop in Azure Logic Apps

Part 2 of 7 in the Logic Apps Agent Loop series

Part 1 explained why the Azure Logic Apps agent loop is a different design paradigm from conventional workflow automation. This post gets hands-on with the anatomy of that loop. We will look at the four building blocks that make up every agent loop trigger, instructions, connected model, and tools, and walk through how to wire them together in a Standard logic app.

By the end of this post you will have a working autonomous agent that accepts a prompt from a trigger, reasons over it using Azure OpenAI, invokes a connector action as a tool, and returns a result. The run history will show you exactly how the loop iterated.

The Azure Logic Apps agent loop: four building blocks

Before opening the designer, it helps to have a clear mental model of what you are assembling. Every Azure Logic Apps agent loop consists of four parts.

Trigger

The trigger starts the workflow, exactly as it does in any nonagentic Logic Apps workflow. For an autonomous agent, this can be any supported trigger an HTTP request, a timer, a Service Bus message, a new email, or anything else in the connector library. The trigger’s output becomes the initial input to the agent: the prompt or data the model will reason over.

Instructions

Instructions are the system prompt for the agent. You provide them as a block of natural language text in the agent action’s configuration pane. They define the agent’s role, what it can and cannot do, how it should respond, and any constraints it should observe. A well-written instructions block is the single most important factor in how well the agent performs. Think of it as the job description you hand to the model at the start of every run.

Connected model

The agent needs a language model to reason with. In Standard Logic Apps, you connect the agent to an Azure OpenAI Service resource and specify the model deployment to use — typically a GPT-4o deployment. The agent sends the instructions, the trigger input, and the results of any tool calls to the model at each iteration of the loop. The model’s response tells the agent what to do next.

Tools

A tool is a sequence of one or more connector actions that the agent can choose to invoke. You build tools directly in the Logic Apps designer by adding actions from the connector gallery inside the agent action. Each tool gets a name and a description — the model uses these to decide which tool to call and when. A single agent can have multiple tools. An agent with no tools can still respond to prompts using the model’s built-in knowledge, but it cannot take action on external systems.

The diagram below shows how these four parts fit together inside a single agent loop execution.

Anatomy of an Azure Logic Apps agent loop — trigger, instructions, model, and tools
Figure 1 — Every Azure Logic Apps agent loop consists of four building blocks: a trigger that starts the workflow, instructions that define the agent’s role, a connected model (Azure OpenAI / GPT-4o) that reasons over each iteration, and tools built from connector actions. The loop cycles through Think, Act, and Observe until the model determines the task is complete.

Building your first agent loop in Azure Logic Apps

The demo for this post is deliberately simple: an agent that receives a topic via an HTTP trigger, uses Azure OpenAI to generate a summary, and returns the result to the caller. One trigger, one model, one tool is enough to see all four building blocks in action and to read a meaningful run history.

Prerequisites

  • A Standard logic app resource deployed in Azure
  • An Azure OpenAI Service resource with a GPT-4o model deployment
  • Contributor access to both resources

Step 1: Create the workflow

In the Azure portal, open your Standard logic app and select Workflows from the sidebar. Choose Add, then select Autonomous Agents as the workflow type. Give the workflow a name and select Stateful. Logic Apps creates a new workflow with an empty agent action already in place.

Step 2: Configure the trigger

The autonomous agent workflow template starts with a When a HTTP request is received trigger by default. Leave the method as POST. In the request body JSON schema, add a single property: topic of type string. This is the input the agent will work with.

Step 3: Write the instructions

Select the agent action in the designer to open its configuration pane. On the Parameters tab, find the Instructions field. Enter something like the following:

You are a research assistant. When given a topic, use the available tools to retrieve relevant information and return a concise summary of no more than three sentences. Always cite your source.

Keep instructions specific and bounded. Vague instructions produce unpredictable behaviour. The model will take the instructions literally, so precision matters.

Step 4: Connect the model

Still on the Parameters tab, select Add connection under the model configuration section. Choose Azure OpenAI Service, select your resource, and choose your GPT-4o deployment. Logic Apps establishes the connection and stores it against the workflow.

Step 5: Add a tool

Inside the agent action, select Add a tool. This opens the connector gallery filtered to actions that can be used as tools. For this demo, add the HTTP action as a tool — name it search_web, give it the description “Retrieves content from a given URL”, and configure it to accept a URL as input. In a production scenario you would use Azure AI Search or a more capable connector here; the HTTP action keeps the demo self-contained.

Step 6: Save and run

Save the workflow. Use a REST client to POST a JSON body like {"topic": "Azure Logic Apps agent loop"} to the workflow’s trigger URL. The agent fires, the model reasons over the instructions and the topic, invokes the search tool, and returns a summary.

Logic Apps designer view of an autonomous agent workflow. The canvas shows an HTTP request trigger connected to an Agent action containing a Tool with an HTTP action inside. The right pane shows the Agent parameters: AI model set to GPT-4o via Foundry Models, instructions for the research assistant role, and the topic dynamic value wired as user instructions item 1.
Figure 2 — The completed agent configuration in the Logic Apps designer. The Agent action is connected to GPT-4o via Foundry Models, the instructions define the research assistant role and output format, and the topic value from the HTTP trigger is passed in as the user instruction. The Tool contains a single HTTP action that the agent can invoke to retrieve content from a given URL.

Reading the run history

he run history is where the Azure Logic Apps agent loop becomes visible. Open the workflow’s Run history and select the latest run. You will see the trigger, followed by the agent action. Expand the agent action and you will find each iteration of the loop shown as a numbered step: the model’s reasoning output, the tool call with its inputs and outputs, and the model’s decision on whether to loop again or return a final answer.

This is the key difference from a nonagentic run history. In a conventional workflow, the run history shows a flat list of actions. In an agent loop, it shows a nested, iterative structure the model’s chain of thought made visible.

Run history of a Logic Apps autonomous agent workflow completed in 9.21 seconds, showing the HTTP trigger, a first agent iteration that invoked the HTTP tool in 3.7 seconds, and a second agent iteration that sent the final chat message.
Figure 3 — The run history of the minimal autonomous agent from this post. The loop ran two iterations: the first agent step (3.9s) reasoned over the topic prompt and invoked the HTTP tool (0.6s); the second agent step (3.2s) observed the result and composed the final response. The canvas shows iteration 1 of 3 steps — trigger, tool, and HTTP action — all succeeded in 9.21 seconds total.

For a simple prompt, you may see a single iteration. For a more complex task involving multiple tool calls, you will see the loop unfold across three, five, or more steps. Each step shows exactly what the model decided and why.

Standard versus Consumption: model connections

In Standard logic apps, you configure the model connection yourself — selecting an Azure OpenAI Service resource and specifying the deployment. This gives you full control over which model version you use, where it is hosted, and how it is secured via Managed Identity.

In Consumption logic apps (currently in public preview), the model connection is handled via Microsoft Foundry and the configuration is more constrained. For any production workload, Standard remains the right choice.

What comes next

The agent in this post is autonomous it runs without human interaction, triggered by an HTTP call and returning a result when done. That covers a wide range of integration scenarios, but not all of them. Some tasks require a back-and-forth with a user: a support conversation, a guided data-entry flow, a multi-turn research session.

The next part will cover exactly that distinction, autonomous versus conversational agentic workflows, and walk through when to choose each pattern and what changes in the designer when you do.

Why the Agent Loop Changes Everything in Azure Logic Apps

Part 1 of 7 in the Logic Apps Agent Loop series

The Azure Logic Apps agent loop introduces a fundamentally different way to design workflows on the platform. While conventional Logic Apps workflows follow a fixed sequence of steps defined at design time, the agent loop delegates reasoning to a large language model at runtime, looping through think, act, and observe cycles until a task is complete. This post opens a seven-part series on building agentic workflows in Logic Apps. It starts with the question that matters most: why does this change anything?

For years, Azure Logic Apps has been the platform of choice for integration architects who need to orchestrate business processes across cloud services and on-premises systems. You build a workflow, wire up triggers and actions, define your conditions, handle your errors, deploy, and move on. The flow is predictable (deterministic): given the same inputs, it does the same thing every time. That predictability is the point.

The agent loop breaks that contract, deliberately and usefully.

With the introduction of agentic workflows in Azure Logic Apps, Microsoft has extended the platform from a fixed automation engine into something that can reason, adapt, and decide. At its core, the agent loop drives this shift. It is a repeating process: the connected language model thinks through a problem, selects a tool, acts on the result, and decides whether the task is done.Unlike a conventional workflow, there is no hardcoded sequence of steps. Instead, the model determines the path based on the task.

This post is the opening of a seven-part series on building agentic workflows in Azure Logic Apps. Before going hands-on with triggers, connectors, and multi-agent patterns in later posts, this one makes the case for why the agent loop matters and what it fundamentally changes about how you think about workflow design.

How the Azure Logic Apps agent loop differs from nonagentic workflows

Nonagentic Logic Apps workflows are excellent at exactly the kind of work they were designed for: stable, predictable, repeatable processes. An approval workflow, an ETL pipeline, and a B2B message exchange are all scenarios where the path through the workflow is known in advance. The trigger fires, the conditions evaluate, the actions execute in sequence, and the run history tells you exactly what happened and why.

The challenge arises when the environment you are integrating with is unstable or unpredictable. When incoming data is unstructured. Or when the right action depends on context that cannot be captured in a condition expression. Or when you need to handle a customer query that could go a dozen different directions depending on what the customer actually says.

These are the cases where deterministic workflows buckle. You end up building sprawling switch-case structures, hardcoding edge cases as branches, and constantly patching the workflow every time a new variation appears. The workflow becomes a maintenance problem rather than a solution.

Agentic workflows excel in dynamic environments where unexpected events occur, the choice of the right tool relies on the input, and the system must manage unstructured data without specific instructions for each variant.

The agent loop: Think, Act, Learn

How the agent loop works: Think, Act, Learn

The Azure Logic Apps agent loop follows a three-step process.

Think. The agent collects available information: task instructions, prior inputs, and previous tool results. It then passes all of this to the connected language model.The model reasons over the context and decides what to do next: invoke a tool, ask a follow-up question, or return a final answer.

Act. In Logic Apps, tools are actions drawn from 400+ connectors. These include Azure OpenAI, Azure AI Search, Office 365, and custom APIs. Once the action runs, the result feeds back into the next cycle.

Optionally, the loop adapts. The agent can use feedback or external signals to adjust its behaviour over time, though this is the most advanced capability and not required for most workflows.

Iterations, not instructions

This loop continues think, act, observe, decide until the model determines the task is complete. You can change the number of iterations as needed. A simple query might resolve in one loop. A complex multi-step task might require five or ten.

The diagram below shows the difference between a conventional non-agentic workflow, which follows a linear sequence of predetermined steps, and the agent loop, which dynamically iterates until the model determines that the task is complete.

Figure 1 — A conventional nonagentic workflow follows a fixed path defined at design time (left). The agent loop iterates dynamically at runtime: the LLM thinks, acts, observes the result, and decides whether to loop again or return a final answer (right).

Agent versus nonagentic: a structural comparison

The difference is not just philosophical. It shows up in how you design, deploy, and maintain the workflow.

In a nonagentic workflow, the logic architect owns the decision tree. Every branch, every condition, every action path is explicitly modelled. This is powerful for known, bounded scenarios, but it places all the reasoning burden on the architect at design time.

In an agentic workflow, the reasoning is delegated to the model at runtime. The architect’s job shifts: instead of modelling every path, you define the agent’s instructions, give it the right tools, and trust the model to navigate the task. This is a different skill and a different mindset closer to prompt engineering and system design than to traditional workflow modelling.

The Microsoft documentation puts it plainly: agentic workflows can adapt to environments where unexpected events happen, choose which tools to use based on prompts and available data, and handle unstructured data at a level of flexibility that nonagentic workflows simply cannot match. Moreover, nonagentic workflows function best in stable environments with static, predictable, repetitive tasks.

Neither is universally better. They address different problems. But for integration architects, the arrival of the agent loop means Logic Apps can now cover territory that previously required a custom-coded application or a fully separate agent framework.

Standard versus Consumption: what you need to know now

Azure Logic Apps offers two hosting models: Standard (single-tenant, runs on Azure Functions runtime) and Consumption (multitenant, pay-per-execution). Agentic workflows are fully available in Standard. Consumption support is in public preview as of early 2026 and carries some restrictions.

For production agentic workloads, Standard is the right choice today. The rest of this series will use Standard throughout, with notes where the Consumption behaviour differs.

What this series covers

The seven posts in this series move from concept to production:

  1. Why the agent loop changes everything — this post
  2. Anatomy of an agent loop — instructions, the connected model, tool calls, and how the loop iterates
  3. Autonomous versus conversational workflows — choosing between unattended execution and human-in-the-loop patterns
  4. Building tools for the agent — connectors, custom connectors, and MCP servers as tool providers
  5. Multi-agent patterns — handoffs, orchestrators, and sequential agent loops
  6. Securing agentic workflows — authentication, the expanded caller surface, and Easy Auth
  7. Observability, pricing, and running in production — Application Insights, agent loop pricing, and DevOps deployment

The next post gets hands-on: we will look at the anatomy of a single agent loop in the Logic Apps designer, walk through the instructions pane, wire up Azure OpenAI as the model, and watch the run history to see how the iterations unfold.

Azure API Management Load Balancing and Circuit Breaker for AI Backends

Part 5 of 7 in the “APIM for AI Workloads” series

Azure API Management load balancing for AI workloads solves a problem that every team hits once they move beyond a single Azure OpenAI deployment: PTU capacity is finite, PAYG is a safety net, and when things go wrong on one backend, the rest of your workload should not notice. In Part 1 of this series, I described PTU vs. PAYG as a routing problem. This post is where we solve it.

The combination of backend pools, priority-based routing, and circuit breaker rules in APIM gives you a resilient AI gateway that handles three distinct failure modes: PTU saturation (too many tokens consumed against reserved capacity), regional outages, and transient backend errors. None of these requires changes to calling applications. APIM absorbs the complexity and presents a single stable endpoint.

Azure API Management Load Balancing: Backend Pools for AI

APIM’s backend pool feature lets you define a named group of AI backends and route to them as a unit. You reference the pool in the set-backend-service policy by its pool ID. When a request arrives, APIM selects a backend from the pool based on priority and weight, tracks health state via the circuit breaker, and retries on the next available member if the selected backend fails.

For AI workloads, the standard pattern uses two tiers. The first tier is your PTU deployment reserved capacity in a primary region, assigned priority 1. The second tier is a PAYG deployment in a secondary region, assigned priority 2. APIM routes all traffic to the PTU backend as long as the PTU backend is healthy. When PTU returns a 429 (capacity exceeded) error or becomes unreachable, the circuit breaker trips, and APIM automatically fails over to the PAYG backend.

Azure API Management load balancing backend pool with PTU primary PAYG overflow and circuit breaker tripped on unavailable backend
Diagram 1: APIM backend pool with three members. APIM backend pool with three members. The PTU backend (priority 1) handles normal load, while the PAYG backend (priority 2) absorbs overflow. After repeated 429 responses, Backend #3 has tripped its circuit breaker and is bypassed until the probe succeeds.

Priority determines the preference order: lower numbers are preferred. Weight applies when multiple backends share the same priority, distributing load proportionally between them. A common pattern for multi-region PTU deployments is two PTU backends at priority 1, each with a different weight reflecting their provisioned capacity, and a shared PAYG backend at priority 2 as the common overflow.

Circuit Breaker Configuration for Azure API Management AI Backends

The circuit breaker is what makes the backend pool resilient rather than just load-balanced. Without it, APIM continues routing to a saturated or unavailable backend on every request, each one failing with a 429 or timeout before falling back. The circuit breaker short-circuits that path: after a configurable number of failures within a time window, it marks the backend as OPEN and stops sending traffic to it entirely.

Azure API Management circuit breaker state machine showing closed open and half-open states for AI backend failover
Diagram 2: Circuit breaker state machine. CLOSED is normal operation. Exceeding the failure threshold trips the breaker to OPEN, bypassing the backend. After tripDuration seconds, APIM sends a single probe request to test recovery. Success returns to CLOSED; failure reopens the circuit.

The three circuit breaker states map directly to operational behavior:

CLOSED is the normal state. All requests are routed to the backend. Failures APIM counts failures within the configured interval, and the counter resets at the end of each interval if the number of failures remains below the threshold.

After enough failures to exceed the threshold, the breaker trips to OPEN. In this state, APIM bypasses the backend entirely, and APIM routes to the next available pool member without attempting the failed backend again. The tripDuration timer starts counting down immediately.

Once tripDuration elapses, the breaker enters HALF-OPEN and sends a single probe request to test recovery. A successful response transitions the backend back to CLOSED. A failure resets the timer and keeps the circuit OPEN.

For Azure OpenAI specifically, 429 should always be in your failureCondition alongside 503 and 504. A 429 from a PTU endpoint indicates that the provisioned throughput ceiling has been reached and the backend is temporarily unable to serve requests. That is exactly the condition you want to trip the circuit and fail over to PAYG, rather than returning errors to the caller.

Sizing Circuit Breaker Parameters for AI Workloads

The right circuit breaker parameters depend on your traffic pattern and how quickly you need failover to activate. A few practical guidelines:

threshold: For AI workloads, 3 to 5 failures is a reasonable starting point. PTU endpoints return 429 consistently when saturated, so you don’t need a high threshold to detect the condition. Setting it too high means you absorb too many failed requests before failing over.

interval: 60 seconds works well for most workloads. This is the window over which failures are counted. Shorter intervals are more sensitive to transient errors, while longer ones suit bursty traffic patterns where a few failures in a short window are expected.

tripDuration: 30 seconds is a sensible default. PTU capacity refreshes on a per-minute basis, so a 30-second trip duration gives the backend time to recover before the probe fires. For deployments where PTU saturation is a known recurring pattern, a longer trip duration (60 to 120 seconds) reduces the frequency of failed probes.

Retry Policy and Agentic Workload Considerations

Backend pool failover and circuit breaking handle backend-level failures, but you may also want a retry policy in your APIM inbound pipeline for transient errors that do not warrant a full circuit trip. The retry policy can be scoped to specific status codes and configured with a backoff interval, giving you a two-level resilience model: retry for transient errors, circuit break for sustained failures.

For agentic workloads specifically, failover behavior needs careful thought. A conversational agent mid-session that silently switches from a PTU to a PAYG backend will not notice the change at the model API level. But agentic pipelines with multiple sequential tool calls are more sensitive: a mid-pipeline failover can introduce latency spikes that cause timeouts in orchestration layers such as Azure Logic Apps or Semantic Kernel.

The practical mitigation is to expose the remaining token budget via the token limit policy variable from Part 3 and have the orchestration layer monitor it to proactively slow down before circuit breaking kicks in. Prevention is cheaper than recovery when the workload is stateful.

What’s Next in This Azure API Management for AI Series

Part 6 covers semantic caching: how APIM uses an embeddings model and Azure Managed Redis to serve cached responses for semantically similar prompts, reducing token consumption and latency without any changes to calling applications.

Azure API Management Token Metric Policy: AI Cost Observability and Cross-Charging

Part 4 of 7 in the “APIM for AI Workloads” series

The Azure API Management token metric policy turns AI cost data from a finance problem into an engineering one. In Part 3, we covered enforcement: how to set consumption boundaries per consumer. This post covers the complementary piece: how to measure that consumption. More importantly, it shows how to make it visible to the right people and use it to drive internal cross-charging and FinOps dashboards.

At my current company, one of the first questions the architecture board asked was straightforward: which teams are consuming what, and what does it cost? Without instrumentation at the gateway layer, that question is genuinely unanswerable. The token metric policy is how you answer it.

Azure API Management Token Metric Policy: How It Works

The policy sits in the outbound section of your APIM pipeline. After the AI backend returns a response, APIM reads the token usage fields from the response body. These include prompt tokens, completion tokens, and total tokens. APIM then emits them as custom metrics to Application Insights under a namespace you define.

Crucially, the policy emits metrics after the response arrives. It uses actual token counts from the API response rather than estimates. As a result, the data is accurate rather than approximated. It also means the metric emission adds no latency to the request path: the response is returned to the caller immediately, and the metric is emitted asynchronously.

Azure API Management token metric policy observability pipeline emitting token counts to Application Insights for cross-charging
Diagram 1: Token metric policy observability pipeline. Token counts from the AI backend response flow through the APIM metrics layer to Application Insights, broken down by dimensions for cross-charging and cost allocation.

The generic variant, llm-emit-token-metric, works identically for non-Azure backends. Both policies share the same dimension model, so the configuration patterns below apply regardless of which AI provider sits behind APIM.

Choosing Dimensions for Azure API Management Token Metric Policy

Dimensions are the labels attached to each metric event. They explain how to slice and aggregate token consumption data in Application Insights. Choosing the right dimensions is the most important configuration decision for making the data useful for cross-charging.

Azure API Management token metric policy dimension strategies for cross-charging using Subscription ID User ID and API ID
Diagram 2: Three-dimensional strategies for cross-charging and showback. Subscription ID maps to teams and cost centers, User ID enables per-user billing in multi-tenant apps, and API ID breaks down cost by AI workload or feature.

The three primary dimension options are:

Subscription ID. The most common choice for internal enterprise deployments. Each APIM subscription maps to a team, product, or cost center, so filtering Application Insights metrics by Subscription ID gives you direct per-team token consumption. This pairs naturally with the subscription key authentication pattern from Part 2 and the per-subscription counter-key from Part 3.

User ID. Sourced from the JWT subject claim or a custom header, User ID enables per-user consumption reporting. This is the right dimension for multi-tenant SaaS applications where individual end users have their own token budgets, or where you need to identify heavy consumers within a shared subscription.

API ID. Identifies which APIM API product generated the consumption. Useful when a single subscription uses multiple AI-backed APIs: one for a conversational agent, one for content generation, and one for document summarization. API ID lets you break down cost by use case rather than just by subscriber.

In practice, combining all three dimensions gives you the most flexibility. A single metric event tagged with Subscription ID, User ID, and API ID can answer questions at every level: how much did the platform spend in total, how much did Team A spend, how much did User X consume, and which AI feature is the most expensive to run.

Querying Token Metrics in Application Insights

Once the policy is emitting metrics, you query them in Application Insights using the custom metrics namespace you configured. The metrics appear under the namespace name you set in the policy (for example, “AzureOpenAI” or “MyLLM”), with separate metric events for prompt tokens and completion tokens.

A practical starting point is a KQL query that aggregates the total number of tokens by Subscription ID over the past 30 days. From there, you can add filters by API ID to isolate specific workloads, or pivot by User ID to identify the highest consumers within a team.

For FinOps dashboards, the most useful view is a stacked time-series chart of total token consumption broken down by subscription, updated daily. This gives finance and engineering a shared view of AI spend trends without exporting data from Azure Monitor to a separate BI tool. Azure Workbooks can host this directly in the Azure portal, making it accessible to non-technical stakeholders.

From Observability to Cross-Charging

Observability is the prerequisite for cross-charging. However, they are not the same thing. Observability tells you what happened. Cross-charging, by contrast, is the organizational process of allocating those costs to the right budget owners.

The token metric policy gives you the raw data. To turn that into a cross-charge, you need two additional steps. First, agree on a price per token with your finance team — usually derived from the Azure cost per 1,000 tokens for your model and region. Second, automate a monthly report that multiplies token consumption by the subscription price.

This does not need to be complex. For example, a Logic App or Azure Function that queries Application Insights on the first of each month works well for most organizations starting out. It aggregates tokens by subscription, multiplies by the agreed rate, and emails a cost summary to each team lead. The Application Insights REST API makes this straightforward to automate.

Finally, the most important advice: have this conversation with finance and product teams before AI consumption scales. Retroactive cross-charging is significantly harder to establish than an upfront model with clear methodology and tooling.

What’s Next in This Azure API Management for AI Series

Part 5 covers load balancing and circuit breaking: how to distribute traffic across PTU and PAYG backends, configure backend pools, and set up circuit breaker rules for automatic failover when a primary endpoint becomes unavailable.

AI Gateway Commercial vs Open Source: How to Choose the Right Control Plane

The AI gateway commercial vs. open-source decision is one that most organizations reach not by planning but by accident. One team has already integrated directly with Azure OpenAI. Another is using LiteLLM to wrap a few models. A third wants to use the enterprise API management platform you already have. Suddenly, you need to make a choice, and the conversation gets complicated fast.

This companion post to my APIM for AI Workloads series takes a step back from the Azure API Management specifics and addresses the question that comes before all of it: which gateway should you be using in the first place? The series covers APIM in depth because it’s the right answer for the Microsoft ecosystem. But it’s not the only answer, and for some organizations it’s not the right one.

Here is how to think through the decision properly.

Why the AI Gateway Commercial vs Open Source Choice Matters More Than You Think

Most API gateway decisions are relatively low-stakes. If you pick the wrong one, you migrate. But the AI gateway decision carries more weight for two reasons.

First, the gateway sits in the critical path of every AI interaction in your organization. Its policy language, authentication model, and observability hooks become embedded in the way your teams build AI-powered applications. Switching later is not impossible, but it is disruptive.

Second, the governance patterns you establish now, how you handle token limits, cross-charging, PII, and compliance logging, are much harder to retrofit than to design in from the start. The Team Rockstars IT AI Gateway whitepaper, published this month, makes this point well: organizations that set up audit logging via an AI gateway from day one build a direct compliance advantage under the EU AI Act. Those who add it later risk complex and costly rework.

So the choice deserves deliberate thought, not a default.

The Commercial Options for AI Gateway

Commercial AI gateways offer a faster path to production and offload operational complexity to the vendor. The main options in the market today are:

Azure API Management is the right choice if you are already in the Microsoft ecosystem. Its AI-specific policy extensions for token limits, token metrics, semantic caching, and load balancing across PTU and PAYG backends are mature and tightly integrated with Azure Monitor and Application Insights. The series covers this in depth from Part 1 onwards.

Kong Konnect is a strong option for organizations that already use Kong for API management and want to extend it into AI. Its plugin ecosystem covers rate limiting, authentication, and observability, with AI-specific plugins growing quickly.

Portkey is purpose-built as an AI gateway with a lightweight footprint and fast time-to-value. It supports a broad range of model providers, has built-in semantic caching and observability, and is a practical option for teams that want AI governance without the overhead of a full enterprise API management platform.

Apigee (Google Cloud) is the natural choice for GCP-centric organizations. Like APIM in the Microsoft world, its AI gateway capabilities are deepening with each release as Google embeds Gemini and Vertex AI integrations.

The common advantages across all commercial options are faster deployment, built-in compliance features, vendor support contracts, and operational burden offloaded to the vendor. The common risks are licensing costs, proprietary policy languages that create switching friction, and dependency on the vendor’s roadmap.

The Open Source Options for AI Gateway

Open-source gateways offer maximum control and no licensing costs, but they require your organization to own what the vendor would otherwise handle.

LiteLLM is the most widely adopted open source AI gateway today. It provides a unified API across more than 100 model providers, with built-in rate limiting, spend tracking, and a proxy server that is straightforward to self-host. The community is active, and the feature velocity is high. The supply chain risk is real, though: a 2025 attack targeting LiteLLM and Trivy demonstrated that even widely used security-adjacent tools can become attack vectors. If you run LiteLLM in production, you own the patching cadence.

Agent Gateway from Anthropic is purpose-built for MCP and agentic traffic. If your primary use case is governing tool calls from AI agents rather than managing completion API traffic, it is worth evaluating alongside the broader options.

One API provides a unified, OpenAI-compatible interface across multiple providers and is widely used by organizations seeking provider-agnostic routing without vendor lock-in.

HelixML focuses on self-hosted deployments with strong data-sovereignty properties, making it relevant for organizations where data-residency requirements rule out SaaS-based gateway options.

AI Gateway Commercial vs Open Source: Five Decision Factors

AI gateway commercial vs open source comparison matrix across time to value compliance internal capability flexibility and supply chain risk
Diagram 1: Commercial vs open source AI gateway decision factors. Neither option wins across the board — the right choice depends on your compliance posture, internal capability, and how much operational complexity you want to own.

Five factors consistently determine which direction is right for a given organization:

Time to value. In my experience, commercial gateways can be production-ready in days to weeks. Open source deployments typically take weeks to months to reach production quality, depending on how much custom policy logic you need to build. If you have an urgent compliance or cost control problem to solve, commercial is the pragmatic choice.

Compliance and data residency. For Dutch and European organizations operating under AVG, NIS2, and the EU AI Act, commercial gateways offer contractual guarantees: data processing agreements, certified regions, and SLAs with defined incident response times. Open source can meet the same requirements, but you are responsible for demonstrating compliance yourself rather than relying on a vendor certification.

Internal platform capability. Open source is not free. The licensing cost is zero, but according to the CNCF’s platform engineering maturity model. Organizations without a dedicated platform engineering team that can credibly own the gateway long-term should not choose open source. The operational gap will become visible at the worst possible moment.

Flexibility and lock-in risk. Open source wins on long-term flexibility. Proprietary policy languages in commercial gateways create switching friction that grows over time as you invest in custom policies. If multi-cloud strategy and provider-agnosticism are strategic priorities, design your gateway layer with that in mind from the start, even if you begin with a commercial option, applying the strangler fig pattern to abstract away proprietary dependencies over time.

Supply chain risk. This factor is underweighted in most evaluations. The 2025 supply chain attack targeting LiteLLM and Trivy demonstrated that open source security tooling itself can become an attack vector. Commercial vendors have contractual obligations around vulnerability disclosure and patching. With open source, that obligation falls to your team.

A Decision Framework for AI Gateway Commercial vs Open Source

AI gateway decision flowchart showing when to choose commercial APIM Kong Portkey versus open source LiteLLM Agent Gateway based on compliance capability and cloud ecosystem
Diagram 2: Decision flowchart for choosing between commercial and open source AI gateways. Compliance requirements, internal capability, and cloud ecosystem fit are the three most decisive factors.

The flowchart above works through the most decisive questions in order. A few practical observations from applying it:

Regulated industries almost always land in commercial. Healthcare, financial services, and insurance organizations operating under Dutch or European regulation have compliance requirements that are significantly easier to satisfy with contractual vendor guarantees than with self-operated open source tooling. At my company, the AVG and healthcare-specific data processing requirements made APIM the clear choice.

The hybrid pattern is underused. Many organizations run a commercial gateway in production for governed workloads, while developer teams use LiteLLM or a lightweight open source option in lower environments for experimentation. This gives you the compliance and operational properties you need in production while keeping the innovation surface open. It is more work to maintain two gateway patterns, but the tradeoff is often worth it.

Design for replaceability regardless of what you choose. The Team Rockstars whitepaper frames this well: choose your first gateway deliberately, but design for replacement. Use open standards, abstract your policy logic where possible, and avoid deep coupling to proprietary features without open-source equivalents. The gateway landscape is evolving fast enough that what is the right choice today may not be in two years.

Where This Fits in the APIM for AI Workloads Series

The rest of the series goes deep on Azure API Management specifically: the token metric policy, load balancing and circuit breaking, semantic caching, and MCP gateway for agentic workloads. If you have landed on APIM as your gateway of choice or if you are in a Microsoft-centric organization where it is the natural fit, the series covers the production patterns you need.

  • Part 1: Why your AI APIs need a gateway.
  • Part 2: Authentication and authorization.
  • Part 3: Token limit policy.
  • Part 4: Token metric policy and cross-charging.
  • Part 5: Load balancing and circuit breaking.
  • Part 6: Semantic caching.
  • Part 7: APIM as MCP gateway for agentic AI workloads.

Azure API Management Token Limit Policy: Controlling AI Token Consumption Per Consumer

Part 3 of 7 in the “APIM for AI Workloads” series

The Azure API Management token limit policy is one of the most direct cost control levers you have for AI workloads. In Part 1 of this series, I argued that token consumption is invisible without the right instrumentation. The token limit policy is the enforcement side of that equation: once you know how many tokens consumers are using, you set boundaries so that no single consumer can exhaust your model capacity or run up an unexpected bill.

This post covers how the policy works, which counter-key strategy to choose for your workload, how to size your tokens-per-minute (TPM) limits, and the difference between the Azure OpenAI-specific policy and the generic LLM variant for non-Microsoft backends.

Azure API Management Token Limit Policy: How It Works

The azure-openai-token-limit policy sits in the inbound section of your APIM policy pipeline. Before any request reaches the AI backend, APIM checks a sliding window counter keyed to the value you specify. If the caller is within their TPM budget, the request passes through. If they’ve exceeded it, APIM returns a 429 Too Many Requests response with a Retry-After header, and the backend never sees the request.

This is important: the throttling happens at the gateway, not at the Azure OpenAI endpoint. That means you’re not paying for rejected requests, and your model deployment is protected from saturation by a single runaway consumer.

Azure API Management token limit policy funnel throttling AI requests with 429 response and Retry-After header
Diagram 1: The token limit policy acts as a funnel in the APIM inbound pipeline. Requests within the TPM budget pass through to the AI backend. Requests exceeding the limit receive a 429 status code with a Retry-After header before the backend is even reached.

The policy has two variants. The azure-openai-token-limit policy is purpose-built for Azure OpenAI and Microsoft Foundry endpoints, and uses the actual token counts returned in the API response. The llm-token-limit policy is the generic variant for any LLM backend, including Mistral, Cohere, and others. Both share the same attribute model, so the configuration patterns below apply to either.

Choosing a counter-key for Azure API Management Token Limiting

The counter-key attribute is the most important decision in configuring the token limit policy. It determines the scope of the limit: who shares a TPM bucket, and who gets their own.

Azure API Management token limit policy counter-key strategies per subscription IP address and JWT claim with TPM sizing table
Diagram 2: Three counter-key strategies and TPM sizing guidance by workload type. The right scope depends on whether you are separating teams, protecting a public endpoint, or enforcing per-user limits in a multi-tenant application.

The three main strategies are:

Per IP address: @(context.Request.IpAddress). Better suited to public-facing endpoints or developer portals where you don’t have a subscription model. It’s a blunt instrument — NAT and shared egress can mean multiple users share a counter — but it’s effective for abuse prevention and trial access scenarios.

Per JWT claim or custom header: @(context.Request.Headers.GetValueOrDefault(“x-user-id”,””)). The most flexible option. If your application passes a user identifier in a header or JWT claim, you can scope limits to the individual user. This is the right approach for multi-tenant applications where each end user should have their own token budget, independent of which subscription they’re calling through.

Sizing Your TPM Limits

TPM limits are context-dependent, but a few principles apply across most workloads.

Start by profiling your actual token usage in a staging environment before setting production limits. The remaining-tokens-variable-name attribute exposes the remaining token budget as a policy variable, which you can log via the Token Metric policy to build a usage baseline before enforcing hard limits.

For the estimate-prompt-tokens attribute: set it to false in production. When set to true, APIM estimates prompt tokens before the response is returned, enabling earlier throttling but reducing accuracy. In practice, counting actual tokens from the response is more reliable and avoids throttling requests that would have been within budget.

A common mistake is setting a single global TPM limit too low, which throttles all consumers the moment a batch job runs on any team. The better pattern is tiered limits by API product: a Developer product with a low TPM ceiling, a Standard product for normal workloads, and an Unlimited product for production pipelines that need burst capacity.

Handling 429 Responses in Calling Applications

Any application calling an APIM-fronted AI endpoint needs to handle 429 responses gracefully. APIM returns a Retry-After header indicating how many seconds until the token window resets. Well-behaved clients respect this header and back off rather than retrying immediately.

For agentic workloads with multiple pipeline steps, a 429 response midway through can leave the agent in an inconsistent state. The recommended pattern is to expose the remaining-tokens-variable-name value in a response header so the calling application can monitor its own budget and slow down proactively, rather than waiting for a hard rejection.

The Azure OpenAI token limit policy documentation covers the full attribute reference, including tokens-per-minute, counter-key, estimate-prompt-tokens, and remaining-tokens-variable-name. The llm-token-limit variant has the same interface for non-Azure backends.

What’s Next in This Azure API Management for AI Series

Part 4 covers the Token Metric policy: how to emit token usage data to Application Insights broken down by consumer dimensions, and how to use that data for internal cross-charging and spend dashboards.

Junior Developer Pipeline AI Crisis: The Narrowing Pyramid

Back in February, I wrote about how AI is reshaping software development at a cost we’re only beginning to understand. I looked at three threads: skill erosion, open-source sustainability, and Agile methodology. Together they pointed at the same underlying tension: AI is accelerating what we can measure while quietly degrading what we can’t. Two months later, the evidence has hardened, and a new dimension has come into focus. It’s not just about individual developers losing depth. It’s about the junior developer pipeline itself.

How AI is collapsing the junior developer pipeline

For my April piece on InfoQ, I covered a peer-reviewed opinion paper by Microsoft Azure CTO Mark Russinovich and VP Scott Hanselman, published in the April 2026 issue of Communications of the ACM. Their argument is precise and uncomfortable: agentic AI coding tools are creating a structural crisis in software engineering. AI boosts senior engineers while imposing what they call “AI drag” on early-in-career developers who haven’t yet built the judgment to steer, verify, and fix AI output. The incentive consequence is predictable: organizations hire seniors, automate juniors, and the pipeline that produces the next generation of seniors quietly collapses.

They call this the “narrowing pyramid hypothesis.” Traditionally, junior developers grow through the bottom rungs: bug fixes, straightforward implementation, exposure to real architecture, and build systems. Over time, the best rise to the tech lead role. When AI eliminates that entry-level work, the apprenticeship disappears with it.

Two pyramid diagrams comparing the traditional junior developer pipeline versus the AI-era pipeline, showing how AI eliminates entry-level roles and collapses career progression from the bottom up.
The junior developer pipeline AI is eroding from the bottom: entry-level roles are disappearing faster than organizations are replacing them.

Payroll records, not projections

A Harvard study cited in the paper found that after GPT-4’s release, employment of 22- to 25-year-olds in AI-exposed jobs fell by roughly 13%, even as senior roles grew. The Stanford AI Index 2026 adds a harder data point: employment for software developers aged 22 to 25 has dropped nearly 20% from its 2022 peak, based on ADP payroll data matched against AI exposure. These aren’t speculative projections; they’re payroll records.

The structural picture at the job posting level is equally stark. Developer roles such as Android, Java, .NET, iOS, and web development are down 60% or more from 2020, while postings for machine learning engineers are up 59%. Forrester’s 2026 Predictions project a 20% decline in CS enrolments. Prospective students are responding to deteriorating signals in the job market. Fewer graduates entering today means a potential shortage of senior engineers in five to ten years.

The vacancy chain is breaking. In a healthy market, a senior leaves, a mid-level moves up, and a junior gets hired. AI disrupts this chain by automating the bottom link, severing the pathway for new entrants. This is the mechanism behind the junior developer pipeline crisis: the bottom rung disappears, and the whole ladder stops working.

What the productivity data doesn’t show

I spend a lot of time on the productivity narrative because it dominates the boardroom conversation. McKinsey analyzed nearly 300 publicly traded companies and found that top-quintile performers are achieving 16-30% improvements in productivity and 31-45% gains in software quality. That’s real. I don’t dispute it.

But productivity gains at the team level and investment in the junior developer pipeline are not mutually exclusive; in practice, they apparently are. A Harvard study tracked 62 million workers across 285,000 US firms. It found junior employment at AI-adopting companies declined 9-10% within six quarters of implementation. Senior employment remained virtually unchanged. Organizations are taking the productivity gains and banking them rather than reinvesting them in the junior developer pipeline.

The problem with this trade-off is temporal. The people who become your senior engineers in 2031 need to be junior engineers today. A Harvard/Berkeley study captures the downstream effect for the seniors who remain: instead of mentoring juniors, senior engineers now spend hours reviewing and fixing AI-generated code. One engineer described the experience as being “a judge at an assembly line that is never-ending.” The time cost of AI output review hasn’t disappeared; it’s just shifted upward.

Why the junior developer pipeline AI problem runs deeper than hiring

This connects directly to what I wrote in February, drawing on the Anthropic study showing that developers using AI assistance scored 17% lower on comprehension tests. The dividing line sat around a 65% threshold: above it were developers using AI as a thinking partner; below it were those who had delegated the thinking entirely.

Russinovich and Hanselman illustrate what that delegation looks like in production. An AI agent responding to a race condition inserted a sleep call, a classic masking fix that hides the underlying synchronization bug. An experienced engineer catches this immediately. A developer who has never debugged a real race condition, because they’ve always had an AI to write the code, does not. The term they use for what’s being lost is “systems taste,” the intuition developed through years of production exposure. You can’t prompt-engineer your way to systems taste.

The US Bureau of Labor Statistics reports that overall employment for programmers fell 27.5% between 2023 and 2025, while more design-oriented software developer roles held roughly flat. The market is bifurcating: roles that require judgment are surviving, roles that are primarily about code production are being automated. If junior developers are no longer doing the production work that builds the judgment, we’re investing in neither.

A possible path forward

Rebuilding the junior developer pipeline requires treating it as infrastructure, not overhead. Russinovich and Hanselman propose a preceptor model borrowed from medical education: pair early-career developers with experienced mentors in real product teams, with learning measured and compensated as an explicit organizational goal rather than a side effect of shipping.

The preceptor model

The senior’s role shifts from “person who answers questions” to “person who teaches judgment.” The pair uses AI tools together, with the senior observing what the junior accepts, rejects, and misses. Hanselman explained the thinking behind this in a LeadDev interview: just as a nurse needs to prove clinical readiness, engineers need to do the same to earn the title.

Honeycomb CTO Charity Majors noted on X in response to the paper that at every organization she has seen successfully hire junior engineers in recent years, the charge was led and lobbied for by senior engineers. That’s the critical variable. This isn’t something HR can mandate. It requires senior engineers who recognize that their own long-term relevance depends on a healthy profession.

Community reaction has been sharp on the question of whether good intentions survive corporate incentive structures. One Reddit thread puts it bluntly: the math doesn’t work. Hiring a junior who takes two years to become productive loses out against an AI assistant that makes a mid-level engineer 30% more productive today. Unless you’re training juniors specifically to oversee AI output, which is not what CS programs teach.

That reframe is actually the right one. The goal isn’t to protect junior roles from AI. It’s to deliberately design a new apprenticeship path in which AI is part of what juniors learn to manage, not a substitute for the learning process.

More code is not better architecture

From my vantage point as both an enterprise architect and a technology editor, I see this playing out in real organizations. The teams moving fast on AI tooling are producing more code. But “more code” and “better architecture” are different things, and the gap only becomes visible under pressure: during incidents, migrations, and the inevitable moment when someone has to explain why a system behaves the way it does. That explanation requires comprehension, not generation.

I wrote in February that the choices made in the next year or two would shape the industry for a decade. The narrowing junior developer pipeline makes that window feel shorter than it did then.

Azure API Management for AI: Securing Your AI APIs with Authentication and Authorization

Part 2 of 7 in the “APIM for AI Workloads” series

In Part 1 of this series, I made the case for why Azure API Management for AI workloads is the right control plane for governing AI traffic across an organization. This post gets practical: how do you actually secure access to your AI backends with APIM without creating a credential-management nightmare?

Security is where many AI projects cut corners, and understandably so. When you’re moving fast to prove value with a new model, authentication feels like overhead. But AI endpoints are expensive, and an unsecured Azure OpenAI endpoint is a real risk: anyone with the URL and key can start consuming tokens at your cost. At scale, that’s a significant financial and compliance exposure.

APIM addresses this with a three-layer security model. Let’s walk through each layer.

Azure API Management for AI Security: A Three-Layer Model

The authentication and authorization pattern in APIM is deliberately layered. Each layer answers a different question and operates independently, so a failure at any layer stops the request before it reaches the AI backend.

Azure API Management for AI three-layer authentication flow showing subscription key, JWT validation and Managed Identity policy pipeline
Diagram 1: Three-layer auth in APIM for AI workloads. Layer 1 identifies the caller via subscription key. JWT validation in Layer 2 then determines what they’re permitted to do. Finally, Layer 3 authenticates APIM itself to the AI backend via Managed Identity.

The three layers are:

  • Subscription keys to identify and track API consumers.
  • JWT validation to enforce fine-grained access control based on claims.
  • Managed Identity to authenticate APIM to Azure OpenAI without storing credentials.

Each layer has a distinct role. Confusing them is a common mistake, so it’s worth being explicit about what each one does and does not do.

Layer 1: Subscription Keys

Subscription keys are APIM’s mechanism for identifying API consumers. When you create an API product in APIM and require a subscription, callers must include their key in the Ocp-Apim-Subscription-Key header. APIM validates the key, maps it to a subscriber, and lets the request proceed.

This is important for AI workloads specifically because subscription keys enable per-consumer token tracking. When you combine subscription key validation with the Token Metric policy we’ll cover in Part 4, you get usage data broken down by subscriber, which is the foundation of any internal cross-charging model.

Subscription keys answer the question: Who is calling? They don’t answer what the caller is allowed to do. For that, you need JWT validation.

Layer 2: JWT Validation and Claims-Based Authorization

The validate-jwt policy is where you enforce what a caller is permitted to do. It validates the JWT token in the Authorization header against your identity provider, and can inspect any claim in the token to make authorization decisions.

For Azure OpenAI specifically, this is where you control which teams or applications can access which model deployments. A team working on an internal chatbot should not be able to call a GPT-4o deployment reserved for a production workload. JWT claims let you enforce that boundary at the gateway layer, with no changes required in the calling application.

A typical policy checks the token signature against your Azure AD tenant’s OpenID Connect configuration, then validates that a required scope or role claim is present:

The failed-validation-httpcode=”401″ attribute ensures unauthenticated callers get a clean rejection before they ever reach the backend. You can also use failed-validation-error-message to return a specific error message, which helps consumers debug auth failures without exposing internal details.

For multi-provider setups where you’re routing to non-Azure backends like Mistral or Cohere, the same JWT policy applies. The claims model is provider-agnostic, which is one of the advantages of centralizing auth in APIM rather than handling it per-backend.

Layer 3: Managed Identity for Backend Authentication

Managed Identity is the most important security improvement you can make when setting up Azure API Management for AI. It replaces the pattern of storing an Azure OpenAI API key in APIM’s named values with a system-assigned or user-assigned Managed Identity that APIM uses to authenticate directly to Azure OpenAI via Azure AD.

Azure API Management for AI comparing API key authentication risks versus Managed Identity benefits for Azure OpenAI backend access
Diagram 2: API key authentication (left) vs. Managed Identity (right). The key difference is that Managed Identity requires no stored credentials anywhere in your configuration.

The practical difference is significant. With API key authentication, you have a long-lived secret that needs to be stored, rotated, and kept out of source control. With Managed Identity, there is no secret. APIM requests a short-lived token from Azure AD at runtime, and Azure AD issues it based on the APIM instance’s identity. Nothing is stored. Nothing can leak.

The configuration is a single policy element in the inbound section: <authentication-managed-identity resource=”https://cognitiveservices.azure.com”/&gt;. APIM handles the rest, automatically fetching and refreshing the token.

On the Azure OpenAI side, you grant the APIM instance’s Managed Identity the Cognitive Services User role on the Azure OpenAI resource. That’s the minimum required permission. You can scope it further to specific deployments if needed.

For organizations in regulated industries, such as healthcare, financial services, and government, Managed Identity is not optional. It satisfies Zero Trust authentication requirements and produces a full audit trail in Azure Monitor, tied to the APIM instance identity rather than a shared key.

Azure API Management for AI: Putting the Three Layers Together

In a production setup, all three layers run sequentially within the inbound policy pipeline. A request arrives with a subscription key and a JWT. APIM validates the key first (fast, no external call), then validates the JWT against Azure AD, then forwards the request to Azure OpenAI using its Managed Identity token. The AI backend never sees the caller’s JWT, and APIM never stores an API key.

The result is a clean separation of concerns:

  • The calling application manages its own JWT (issued by Azure AD based on its own identity or the user’s identity).
  • APIM enforces the authorization policy without the backend needing to know anything about it.
  • The AI backend trusts only APIM’s Managed Identity, not arbitrary callers.

This is the architecture you want before you go to production with any AI workload that touches sensitive data or incurs meaningful cost.

What’s Next in This Series

Part 3 covers the Token Limit policy: how to enforce tokens-per-minute limits per consumer, configure throttling behavior, and handle the differences between the azure-openai-token-limit and llm-token-limit policy variants.