Microsoft Foundry Citadel Platform Azure: A Practitioner’s Deployment Guide

Microsoft Foundry Citadel Platform on Azure is a layered AI governance architecture that delivers production-ready agent deployments with unified governance, end-to-end observability, and centralized policy enforcement via Azure API Management. It is still in preview, and the documentation assumes a degree of familiarity with Azure infrastructure that not everyone has on day one. This post walks through what it actually takes to get a working hub-and-spoke running in Sweden Central, including the pitfalls, so you can decide whether it is a viable starting point for your own AI platform journey.

What Citadel Is (and Is Not)

Before touching the tooling, it helps to understand what Citadel actually deploys. The architecture has four layers:

The first layer — Governance Hub is the runtime enforcement plane: Azure API Management as a centralized AI gateway, Azure API Center as a model registry, and supporting services for content safety, PII detection, cost attribution, and usage telemetry.

Subsequent second layer 2 — AI Control Plane provides observability via the Foundry Control Plane: agent-level execution traces, AI evaluations in development and production, red-teaming, drift monitoring, and fleet dashboards.

The next third layer — Agent Identity transforms agents into managed enterprise assets via Microsoft Entra ID, with lifecycle management, sponsorship models for human accountability, and shadow AI discovery.

Finally, the last fourth layer, 4 Security Fabric, weaves Defender, Purview, and Entra across the other three layers for real-time threat intelligence, data governance, and compliance automation.

For this guide, we deploy Layer 1 (the Governance Hub via the AI Hub Gateway Solution Accelerator) and a Layer 1/2 spoke (via the AI Landing Zone Bicep). Layers 3 and 4 reference existing Azure services (Entra ID, Defender, Purview) that you integrate separately.

Important: Citadel is currently in preview. The repos, parameter schemas, and CLI commands will change. Treat everything in this post as a starting point, not a stable reference.

Prerequisites

Before you start, make sure you have:

  • An Azure subscription with Azure OpenAI access approved (aka.ms/oaiapply)
  • Microsoft.Authorization/roleAssignments/write on the subscription (Owner or User Access Administrator role)
  • Azure CLI installed and authenticated (az login)
  • Azure Developer CLI (azd) installed
  • Node.js — use v20 LTS, not v24. Node 24 on Windows has a known issue where npm bundles are incomplete, causing MODULE_NOT_FOUND errors on npm-cli.js and npm-prefix.js when azd tries to package Logic App components

If you run into npm issues on Windows, the cleanest workaround is Azure Cloud Shell, where Node, npm, az, and azd are all pre-installed and healthy.

Part 1: Deploying the Microsoft Foundry Citadel Governance Hub

Clone the AI Hub Gateway Solution Accelerator:

git clone https://github.com/Azure-Samples/ai-hub-gateway-solution-accelerator.git
cd ai-hub-gateway-solution-accelerator

Create your azd environment:

azd auth login
azd env new ai-hub-gateway-dev
azd env set AZURE_LOCATION swedencentral

Create a parameters file at infra/main.parameters.json. The key decisions:

Model versions matter. At the time of writing, gpt-4o-mini versions 2024-07-18 and 2024-10-18 are retired. Use gpt-4o version 2024-11-20 with GlobalStandard SKU. Always verify current model availability at aka.ms/aoai-regions before deploying these changes frequently.

{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentParameters.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"environmentName": { "value": "ai-hub-gateway-dev" },
"location": { "value": "swedencentral" },
"apimSku": { "value": "Developer" },
"openAiInstances": {
"value": {
"openAi1": {
"name": "openai1",
"location": "swedencentral",
"deployments": [
{
"name": "chat",
"model": { "format": "OpenAI", "name": "gpt-4o", "version": "2024-11-20" },
"sku": { "name": "GlobalStandard", "capacity": 20 }
},
{
"name": "embedding",
"model": { "format": "OpenAI", "name": "text-embedding-3-large", "version": "1" },
"sku": { "name": "Standard", "capacity": 20 }
}
]
}
}
},
"provisionFunctionApp": { "value": false },
"createAppInsightsDashboard": { "value": false },
"enableAIGatewayPiiRedaction": { "value": true },
"enableAIModelInference": { "value": true }
}
}

Deploy:

azd up

Expect 45–90 minutes. APIM Developer SKU is the slow component. If the deployment fails partway through, re-run azd up it is idempotent and will pick up where it left off.

Azure CLI output showing successful deployment of the Microsoft Foundry Citadel Governance Hub including APIM, Azure OpenAI chat and embedding model deployments, private endpoints, and Logic App in Sweden Central.
The AI Hub Gateway Solution Accelerator was deployed successfully in Azure Sweden Central after 21 hours and31 minutes, provisioning APIM, Azure OpenAI, Content Safety, Application Insights, private endpoints, and the usage processing Logic App.

Pitfall: Managed Identity Race Condition

You will likely see this error on first attempt:

BadRequest: The provided principal ID was not found in the AAD tenant(s)

This is a known race condition — the Managed Identity is created but has not yet propagated in Entra ID before the role assignment fires. Re-run azd up without any changes and it will succeed.

Validate the Hub

Once deployed, run:

azd env get-values | grep APIM

You will get your APIM gateway URL. Test it with a chat completion:

$headers = @{
"Content-Type" = "application/json"
"api-key" = "<YOUR_APIM_SUBSCRIPTION_KEY>"
}
$body = '{"messages":[{"role":"user","content":"Hello from the AI Hub Gateway!"}],"max_tokens":100}'
Invoke-RestMethod `
-Uri "https://<your-apim>.azure-api.net/openai/deployments/chat/chat/completions?api-version=2024-02-01" `
-Method POST -Headers $headers -Body $body
PowerShell output showing a successful chat completion response from the Microsoft Foundry Citadel APIM gateway in Azure Sweden Central, with content filter results, prompt filter results, and token usage confirmed.
Validating the Citadel Governance Hub by calling the APIM gateway endpoint via PowerShell, the response confirms gpt-4o-2024-11-20 routing, Content Safety filtering, PII redaction, and token usage tracking are all active.

A successful response with content_filter_results and prompt_filter_results confirms Content Safety and PII redaction are active. Token usage in the response confirms Cosmos DB is logging for cost attribution.

Part 2: Deploying a Citadel Platform Agent Spoke on Azure

The spoke is deployed from the AI Landing Zone Bicep repo. Download it as a ZIP (no GitHub account required):

https://github.com/Azure/bicep-ptn-aiml-landing-zone/archive/refs/heads/main.zip

Extract and navigate to the folder. Create a resource group for the spoke:

az group create --name rg-ai-spoke-dev --location swedencentral

Create a spoke.parameters.json file. Several things to know upfront:

The parameter schema is not the same as the Citadel README suggests. The actual template parameters differ from the example file. Key differences discovered in practice: aiFoundryLocation does not exist as a separate parameter; deployMcp, greenFieldDeployment, deployPostgres, and useCMK are not in this version of the template; and solutionStorageAccountName is simply storageAccountName.

The modelDeploymentList uses nested objects, not flat properties:

"modelDeploymentList": {
"value": [
{
"name": "chat",
"model": { "format": "OpenAI", "name": "gpt-4o", "version": "2024-11-20" },
"sku": { "name": "GlobalStandard", "capacity": 20 },
"canonical_name": "CHAT_DEPLOYMENT_NAME",
"apiVersion": "2025-04-01-preview"
},
{
"name": "text-embedding",
"model": { "format": "OpenAI", "name": "text-embedding-3-large", "version": "1" },
"sku": { "name": "Standard", "capacity": 10 },
"canonical_name": "EMBEDDING_DEPLOYMENT_NAME",
"apiVersion": "2025-04-01-preview"
}
]
}

containerAppsList cannot be an empty array. The template references containerApps[0] internally and will fail validation if the array is empty. Pass at least one placeholder entry.

Deploy:

az deployment group create `
--resource-group rg-ai-spoke-dev `
--template-file main.bicep `
--parameters @spoke.parameters.json

Pitfalls in the Spoke Deployment

AI Search Standard SKU capacity exhaustion. Sweden Central frequently runs out of AI Search Standard SKU capacity. You will see ResourcesForSkuUnavailable. This affects both the standalone Search Service and the AI Foundry Agent Service’s internal Search instance. Disable both:

"deploySearchService": { "value": false },
"deployAAfAgentSvc": { "value": false }

You can re-enable them later once capacity is available, or deploy Search in a different region.

Soft-deleted resources block redeployment. Azure retains soft-deleted Cognitive Services accounts, Key Vaults, and App Configuration stores for up to 90 days. If you delete a resource group and redeploy, the deployment will fail with FlagMustBeSetForRestore or NameUnavailable. Purge them explicitly before redeploying:

# List and purge soft-deleted resources
az keyvault list-deleted --subscription <sub-id> -o table
az keyvault purge --name <name> --location swedencentral
az appconfig list-deleted --subscription <sub-id> -o table
az appconfig purge --name <name> --location swedencentral --yes
az cognitiveservices account list-deleted --subscription <sub-id> -o table
az cognitiveservices account purge --name <name> --location swedencentral

Key Vault purges are slow — allow 2–5 minutes per vault.

Bastion subnet ID resolution fails with networkIsolation=false. When you disable network isolation, the template passes a relative subnet ID to Bastion instead of a fully qualified resource ID. Disable Bastion, Jump VM, and NAT Gateway for the dev spoke:

"deployBastion": { "value": false },
"deployJumpbox": { "value": false },
"deployVM": { "value": false },
"deployNatGateway": { "value": false }

Write parameters files without BOM. On Windows, Out-File -Encoding utf8 adds a Byte Order Mark that causes az deployment to fail with Unable to parse parameter. Use either:

$content | Out-File -FilePath "spoke.parameters.json" -Encoding utf8NoBOM
# or
[System.IO.File]::WriteAllText("spoke.parameters.json", $content, [System.Text.UTF8Encoding]::new($false))

Part 3: Wiring the Citadel Spoke to the Azure APIM Hub

Add the hub’s APIM gateway URL and subscription key to the spoke’s App Configuration:

az appconfig kv set `
--name <spoke-appconfig-name> `
--key "APIM_GATEWAY_URL" `
--label "ai-lz" `
--value "https://<your-apim>.azure-api.net/openai" `
--yes
az appconfig kv set `
--name <spoke-appconfig-name> `
--key "APIM_SUBSCRIPTION_KEY" `
--label "ai-lz" `
--value "<YOUR_APIM_KEY>" `
--yes

Note: az cognitiveservices account connection create with a YAML file for creating an APIM connection in AI Foundry has known bugs in the current CLI version and will throw NoneType or codec errors. Create this connection via the Azure AI Foundry portal UI instead.

Validate End-to-End

$headers = @{
"Content-Type" = "application/json"
"api-key" = "<YOUR_APIM_KEY>"
}
$body = '{"messages":[{"role":"user","content":"Hello from the Citadel spoke!"}],"max_tokens":50}'
Invoke-RestMethod `
-Uri "https://<your-apim>.azure-api.net/openai/deployments/chat/chat/completions?api-version=2024-02-01" `
-Method POST -Headers $headers -Body $body

A successful response with content_filter_results, prompt_filter_results, and usage confirms the full Citadel loop: spoke → APIM gateway → Azure OpenAI → governance telemetry.

PowerShell output showing a successful end-to-end chat completion from the Citadel agent spoke through the Azure APIM Governance Hub, confirming spoke to hub routing, content filter results, and token usage tracking in Sweden Central.
End-to-end validation of the Citadel hub-and-spoke setup: a request from the agent spoke routes through the APIM Governance Hub in Sweden Central, returning a successful gpt-4o response, with Content Safety filtering and token usage tracking confirmed.

What the Microsoft Foundry Citadel Platform Deploys

After following this guide, your rg-ai-hub-gateway-dev resource group contains:

  • APIM gateway with content safety, PII redaction, token rate limiting, and cost attribution policies
  • Azure OpenAI with gpt-4o and text-embedding-3-large
  • Cosmos DB for usage event logging
  • Logic App for usage processing
  • Application Insights for gateway telemetry

Your rg-ai-spoke-dev resource group contains:

  • AI Foundry account and project
  • gpt-4o and text-embedding-3-large deployments
  • Cosmos DB with a conversations container
  • Key Vault, App Configuration, Storage Account, Application Insights, Log Analytics

App Configuration is fully populated with canonical keys (CHAT_DEPLOYMENT_NAME, AI_FOUNDRY_PROJECT_ENDPOINT, COSMOS_DB_ENDPOINT, and more) ready for agent applications to consume.

This Is a Dev Setup — Here Is What Changes for Non-Prod and Production

The configuration above is a starting point, not a production blueprint. Key differences when moving up the environment stack:

APIM SKU. Developer SKU has no SLA and no VNet support. Switch to Premium SKU for non-prod and production. This significantly increases cost and deployment time but enables private networking, multi-region, and availability zones.

Network isolation. For production, set networkIsolation=true and wire the spoke VNet to your hub VNet via peering (hubIntegrationHubVnetResourceId). This requires coordinating private DNS zones across the hub and spoke. The template supports bringing existing DNS zones via the existingPrivateDnsZone* parameters.

AI Search. Re-enable deploySearchService and deployAAfAgentSvc for non-prod and production. If Sweden Central remains capacity-constrained on Standard SKU, deploy Search to a paired region (East US 2 works well) using the searchServiceLocation parameter.

Bastion and Jump VM. For production with networkIsolation=true, re-enable deployBastion and deployJumpbox so operators can access resources inside the private VNet without public endpoints.

Separate parameter files per environment. Maintain spoke.parameters.dev.json, spoke.parameters.nonprod.json, and spoke.parameters.prod.json with environment-specific values. Use a deployment pipeline (GitHub Actions or Azure DevOps) to apply them consistently.

Model versions. Pin specific model versions in parameters files and validate availability in your target region before each deployment. Azure OpenAI model lifecycle moves fast; versions retire on 18-month cycles, and regional availability varies.

Preview Caveats

Citadel is in active development. Several things you should expect to change:

The parameter schemas for both the hub and spoke accelerators will evolve. Parameters discovered missing or renamed in this guide will likely be reorganized again as the repos mature. Always check the actual main.bicep parameter definitions rather than relying on example files.

The az cognitiveservices account connection create CLI command for AI Foundry connections is incomplete at the time of writing. This will improve as the Foundry CLI surface area matures.

The citadel-v1 branch in the AI Hub Gateway repo is flagged as the recommended path for new deployments. By the time you read this, it may have become the default branch with a cleaner deployment experience.

Regional capacity for AI Search Standard SKU fluctuates. Sweden Central is a high-demand region for AI workloads plan for capacity constraints in any SKU beyond Basic for dev scenarios.

Conclusion

Citadel gives you a credible, opinionated starting point for enterprise AI governance on Azure APIM as the AI gateway, AI Foundry as the agent runtime, Cosmos DB for conversation state, and App Configuration as the configuration backbone. Getting it running today requires navigating several rough edges: parameter schema inconsistencies, soft-delete cascades, model version deprecations, regional capacity constraints, and Windows-specific tooling issues.

None of these are blockers. They are the expected friction of working with a platform in active preview. The underlying architecture is sound, and the pieces that do work, APIM governance policies, Content Safety integration, App Config population, and AI Foundry project wiring deliver real value immediately.

If you are building an AI platform for your organization, a Citadel dev setup is a reasonable first step. Treat it as a learning environment to understand the architecture, validate the tooling, and build the parameter files you will need for non-prod and production. Then evolve it deliberately: add network isolation, re-enable Search and Agent Services as capacity allows, and adopt the Citadel contracts (AI Access Contract, AI Publish Contract) to formalize the hub-spoke integration as your agent portfolio grows.

The governance-velocity paradox Citadel sets out to solve is real. Getting the foundation right now, while it is still in preview and the patterns are malleable, is the right time to start.

Final note: This post reflects a hands-on deployment performed in June 2026. Given the pace of change in this space, verify all CLI commands, parameter schemas, and model versions against current documentation before applying them in your own environment.

Multi-Agent Patterns in Azure Logic Apps: Handoffs, Orchestrators, and Sequential Loops

Part 5 of 7 in the Logic Apps Agent Loop series

Part 4 covered the three tooling layers available to an Azure Logic Apps agent. A single agent with well-defined tools handles a wide range of integration scenarios, but some workloads are too complex for one agent to handle well. Azure Logic Apps multi-agent patterns let you compose multiple agent loops into a coordinated system, where each agent has a single focused responsibility and the output of one feeds directly into the next. This post covers the four patterns Microsoft has defined and includes a working demo that builds a two-agent sequential loop.

This post covers the four patterns Microsoft has defined for multi-agent composition in Azure Logic Apps: prompt chaining, routing, handoff, and orchestrator-workers and includes a demo that builds a two-agent sequential loop: a triage agent that classifies a customer request and hands off to a specialist agent.

Why Azure Logic Apps multi-agent patterns matter

A single agent loop works well when the task is bounded and the instructions can cover every case. The problem comes when a task has multiple distinct phases that require different expertise, different tools, or different models. Packing all of that into one agent’s instructions creates a sprawling, hard-to-maintain prompt. The model has to context-switch between roles in a single loop, which degrades quality and makes the run history harder to interpret.

Multi-agent patterns solve this by giving each agent a single, clear responsibility. The agents are composed at the workflow level: one agent’s output becomes another agent’s input, and each agent can have its own model, its own tools, and its own focused instructions.

The four Azure Logic Apps multi-agent patterns explained

Microsoft’s documentation defines four patterns for multi-agent composition in Logic Apps. They are ordered by complexity.

Prompt chaining

The simplest pattern. A sequence of agent loops runs one after another, where the output of each loop becomes the input to the next. Each agent has a single focused task: extract, then format, then sort, then summarise. The chain is linear and predictable.

Use prompt chaining when the workload can be decomposed into sequential steps with clear handover points and when the output of each step is well-defined. A business report processing chain, raw data in, executive summary out, is the canonical example from the Microsoft documentation.

Routing

A classification agent examines the incoming request and routes it to one of several specialist agent loops based on what it finds. The routing agent does not do the work itself it decides which agent should do the work and passes control there.

Use routing when incoming requests fall into distinct categories that need different handling: a customer service triage agent that routes billing queries to a billing agent loop, technical questions to a technical support agent loop, and general inquiries to a general response agent loop. The routing pattern prevents optimization conflicts, allowing a billing specialist agent to be tuned for billing tasks without being distracted by technical support scenarios.

Handoff

Similar to routing but more dynamic. Instead of a central classifier making an upfront routing decision, each agent loop decides during its own execution whether it needs to hand off to another agent. The handoff preserves conversation context and state across the transition the receiving agent knows the full history of what the previous agent did and said.

Use handoff when the trigger for transferring control depends on what emerges during the conversation: a general support agent that escalates to a technical specialist when it detects a complex issue, or a research agent that hands off to a writer agent once it has gathered enough material. The handoff pattern mimics human escalation patterns: a front-line agent handles what it can and passes on what it cannot.

Orchestrator-workers

The most sophisticated pattern. A central orchestrator agent dynamically decomposes a task into subtasks and delegates each subtask to a worker agent loop. The worker agents operate as tools that the orchestrator can invoke, exactly the tool provider pattern from Part 4, applied to agents rather than connectors.

Use orchestrator-workers when you cannot predict the required subtasks in advance. A coding agent that needs to make changes to an unpredictable number of files, a research agent that gathers information from multiple dynamic sources, or a content pipeline with a writer, reviewer, and publisher working together, these are all orchestrator-worker scenarios. The orchestrator dynamically determines what needs to be done; the workers execute it.

Demo: Building a sequential agent loop — Extract and Summarise

This demo builds a two-agent prompt chaining workflow in a new sequential-agents workflow inside la-agent-loop. The scenario is a business report processing chain: Agent 1 extracts key facts and metrics from a raw text input, Agent 2 takes those facts and writes a concise executive summary. The output of Agent 1 feeds directly into Agent 2 — this is the prompt chaining pattern in its simplest form.

Prerequisites

  • The la-agent-loop Standard logic app from previous posts
  • An Azure OpenAI / Foundry Models connection already configured

Step 1: Create the workflow

In la-agent-loop, click Create and name the workflow sequential-agents. Select Autonomous Agents as the workflow type. Logic Apps creates the workflow with an HTTP trigger and an empty Agent action.

Step 2: Configure the HTTP trigger

Click the When an HTTP request is received trigger and paste this request body schema:

{
"type": "object",
"properties": {
"report": {
"type": "string"
}
},
"required": ["report"]
}

Step 3: Configure the Extract Agent

Click the first Agent action and rename it Extract Agent. Configure it:

  • AI model: your GPT-4o / Foundry Models connection
  • Instructions: You are a data extraction specialist. Extract all numerical values, metrics, and key facts from the provided text. Return them as a clean bulleted list. Do not summarise or interpret — only extract.
  • User instructions item – 1: select report from the HTTP trigger dynamic content

Step 4: Add a Compose action

This is a critical step. The Extract Agent output is a JSON object containing a messages array — not a plain string. The Summarize Agent cannot process it directly. A Compose action between the two agents extracts the plain text content.

Click + below the Extract Agent container and add Add an action → Simple Operations → Compose. Set the Inputs expression to:

outputs('Extract_Agent')?['body']?['messages'][0]['content']

This extracts the bulleted list text from the Extract Agent’s output object and passes it as a clean string to the next agent.

Step 5: Add the Summarize Agent

Click + below the Compose action and select Add an agent. Rename it Summarize Agent. Configure it:

  • AI model: your GPT-4o / Foundry Models connection
  • Instructions: You are an executive communications specialist. Take the provided list of facts and metrics and write a concise three-sentence executive summary suitable for a board report. Be professional and direct.
  • User instructions item – 1: select the Outputs of the Compose action from the dynamic content picker

Step 6: Add a Response action

Click + below the Summarize Agent container and add a Response action:

  • Status Code: 200
  • Content-Type header: application/json
  • Body: set the expression to outputs('Summarize_Agent')?['body']?['messages'][0]['content']
Azure Logic Apps designer showing the sequential-agents workflow. An HTTP request trigger connects to an Extract Agent action, followed by a Compose action that extracts the agent output content, then a Summarize Agent action, and finally a Response action that returns the executive summary to the caller.
Figure 1 — The complete sequential agent loop workflow in the Logic Apps designer. The Extract Agent receives the raw report text from the HTTP trigger and returns a bulleted list of facts. A Compose action bridges the two agents by extracting the plain text content from the Extract Agent’s JSON output object — a required intermediate step since Agent actions do not expose their output as a typed string in the dynamic content picker. The Summarize Agent receives the extracted facts and produces a three-sentence executive summary, which the Response action returns as a 200 OK.

Step 7: Save and test

Save the workflow and POST this to the trigger URL:

{ "report": "Q3 revenue was €4.2M, up 18% year on year. Customer acquisition cost dropped to €142, down from €198. Net promoter score reached 67. Headcount grew from 43 to 51. Churn rate fell to 2.3%." }

The workflow runs in approximately 16 seconds and returns a clean executive summary:

In Q3, revenue reached €4.2M, reflecting an 18% year-on-year increase, supported by a significant reduction in customer acquisition cost from €198 to €142. The company saw operational growth with headcount rising from 43 to 51, while maintaining strong customer satisfaction, evidenced by a Net Promoter Score of 67 and a low churn rate of 2.3%. These metrics highlight sustained growth and improved efficiency across key areas.

The run history shows two distinct agent iterations, Extract Agent and Summarize Agent, each with their own Think → Observe cycle, confirming the prompt chaining pattern is working end to end.

Logic Apps run history for the sequential-agents workflow completed in 7.37 seconds. The log shows the HTTP trigger, Extract Agent completing in 3.1 seconds, a Compose action at 0 seconds, Summarize Agent completing in 4 seconds, and a Response action at 0 seconds. The canvas on the right shows all five steps with green success indicators, with both agent actions showing iteration 1 of 2.
Figure 2 — The run history of the sequential agent loop, completed in 7.37 seconds. The Extract Agent ran for 3.1 seconds and passed its output to the Summarize Agent via the Compose action, which completed in 4 seconds. Both agent actions show iteration 1 of 2 on the canvas, confirming that each ran its own Think → Observe cycle independently. The Compose action completed in 0 seconds, serving purely as a data-transformation bridge between the two agent outputs.

Practitioner note: The Compose action between the two agents is not optional. Logic Apps Agent actions return a structured JSON object not a plain string, so the second agent cannot consume the first agent’s output directly from dynamic content. The Compose expression outputs('Extract_Agent')?['body']?['messages'][0]['content'] bridges this gap. This is not documented clearly by Microsoft at the time of writing and is the most common point of failure when building sequential agent loops.


Choosing the right pattern

PatternComplexityUse when
Prompt chainingLowSequential steps with clear handover points
RoutingLow–mediumDistinct input categories needing different handling
HandoffMediumDynamic escalation based on conversation content
Orchestrator-workersHighUnpredictable subtasks requiring dynamic decomposition

The patterns are not mutually exclusive. A production customer service system might use routing to direct initial requests, handoff for mid-conversation escalations, and prompt chaining within each specialist agent to process the request through multiple steps.

Diagram showing four Azure Logic Apps multi-agent patterns arranged in rows. Row 1: prompt chaining — Agent 1 Extract, Agent 2 Format, Agent 3 Summarise, Output. Row 2: routing — a Classifier triage agent routes to either a Billing agent or a Technical agent. Row 3: handoff — a General agent detects escalation and passes context via a dashed arrow to a Specialist agent with full history. Row 4: orchestrator-workers — an Orchestrator with dynamic breakdown fans out to Worker A, Worker B, and Worker C, which converge into a Synthesised output. Legend shows teal for agent/worker, purple for orchestrator/classifier, coral for specialist.
Figure 3 — The four multi-agent patterns available in Azure Logic Apps, ordered by complexity. Prompt chaining (top) runs agents sequentially, with each output feeding the next, as demonstrated in this post’s demo. Routing uses a classifier agent to direct requests to the right specialist. Handoff transfers control dynamically mid-conversation, preserving the full conversation history across the transition. Orchestrator-workers (bottom) is the most advanced pattern: a central orchestrator dynamically decomposes tasks and delegates them to worker agents, synthesizing their results into a final output.

What comes next

Part 6 covers securing agentic workflows, the expanded caller surface that multi-agent and conversational patterns introduce, Easy Auth setup for production, and Managed Identity for backend connections.

Azure API Management Build 2026 AI Gateway: What’s New

The Azure API Management Build 2026 AI gateway announcements mark a significant expansion of APIM’s control plane capabilities. Microsoft shipped three headline additions: a Unified Model API that lets clients standardize on one format while APIM transforms requests to Anthropic, Google Vertex AI, and other backends; content safety policies extended to cover MCP tool calls and agent-to-agent traffic; and expanded token metrics that now track reasoning, cached, and audio tokens across providers. This post explains what each change means in practice for teams building enterprise AI workloads on Azure.

Azure API Management Build 2026 AI Gateway: Three Headline Changes

The biggest announcement is the Unified Model API, now in public preview. It lets clients standardize on a single API format, currently OpenAI Chat Completions. At the same time, APIM transparently converts requests to the backend provider’s native format, whether that is Anthropic’s Messages API, Google Vertex AI, or another provider.

For teams running multi-model architectures, this is significant. Until now, switching providers or adding a new model required client-side changes. With the Unified Model API, the routing decision moves entirely to APIM. Teams can swap backends, add providers, or route traffic based on cost or latency without touching client code.

Diagram showing a client sending requests in OpenAI Chat Completions format to Azure API Management. APIM's Unified Model API layer transforms the request to each provider's native format — Azure OpenAI natively, Anthropic Messages API, and Google Vertex AI format — while applying governance policies and unified token metrics uniformly across all backends. A caption notes that client code is unchanged when swapping providers.
The APIM Unified Model API transformation layer. Clients standardize on a single API format, while APIM handles per-provider translation transparently. All governance policies, rate limits, content safety, and token metrics apply uniformly regardless of which provider handles inference. Teams can swap backends or add providers without touching client code.

From an architecture perspective, this strengthens the case for APIM as the single AI control plane. Every governance policy, rate limit, content safety, and token metric applies consistently regardless of which provider handles inference. There is no need for a parallel governance stack per provider.

One practical implication: the three-layer auth model from Part 2 of this series applies uniformly across all providers. Managed Identity to backend is the cleanest approach, but the provider must support it. For Anthropic and Vertex AI, check the current authentication requirements before assuming token-based auth transfers directly.

Content Safety for MCP and A2A: The Gap That Needed Closing

Extending the llm-content-safety policy to MCP tool calls and agent-to-agent payloads is the most architecturally significant change. Until now, content safety only covered LLM completions traffic. MCP tool-call arguments and A2A messages were ungoverned at the gateway layer.

This matters because prompt injection attacks do not only arrive via the user-facing chat interface. A malicious payload embedded in a tool response from an external MCP server, for example, can propagate through an agentic pipeline if there is no inspection at the gateway layer. The shield-prompt attribute specifically addresses this by checking for adversarial prompt-injection patterns in MCP and A2A traffic, not just in LLM input.

Side-by-side comparison diagram. On the left, before Build 2026, Azure API Management content safety covers only LLM completions traffic. MCP tool calls, agent-to-agent traffic, and prompt injection via tool responses are shown in red as ungoverned. On the right, after Build 2026, all four traffic types are shown in teal as covered — MCP tool call arguments, A2A agent payloads, and prompt injection attacks are now scanned by the llm-content-safety policy with the shield-prompt attribute enforced.
Content safety coverage before and after Build 2026. Prior to the announcement, the llm-content-safety policy only applied to LLM completions traffic. MCP tool-call arguments, agent-to-agent payloads, and prompt injection attacks arriving via tool responses were ungoverned at the gateway layer. The Build 2026 update closes all three gaps with the same policy, extended to cover MCP and A2A traffic.

One implementation detail worth calling out: the policy behaves differently for streaming responses. In non-streaming mode, a violation returns a clean 403. In streaming mode, the policy buffers events in a sliding window and stops forwarding without returning an explicit error code. Agents consuming streaming completions need to handle an abrupt stop gracefully. If you are designing agentic pipelines that use streaming, build in a timeout and an explicit error handling path for this case.

The two new attributes — window-size and window-overlap-size — let you tune how content exceeding Azure Content Safety’s 10,000 character limit is split for evaluation. For agentic pipelines with large tool responses, these will need tuning based on your typical payload sizes.

Expanded Token Metrics: Catching What Was Missing

The token metric policy from Part 4 of this series now logs reasoning tokens, cached tokens, and audio tokens to Application Insights. This is a meaningful improvement for FinOps visibility.

Reasoning models like o1 and o3 consume significant token budgets in their internal reasoning chain before producing output. Without reasoning token tracking, cross-charging dashboards systematically undercount consumption from teams using these models. The expanded metrics fix this.

Matrix diagram with token types as rows and AI providers as columns. Prompt tokens and completion tokens are tracked across all five providers: Azure OpenAI, Anthropic, Google Vertex AI, Amazon Bedrock, and Microsoft Foundry. Three new token types added at Build 2026 are highlighted in amber: reasoning tokens, tracked for Azure OpenAI, Anthropic, and Microsoft Foundry; cached tokens, tracked for Azure OpenAI, Anthropic, Google Vertex AI, and Microsoft Foundry; and audio tokens, tracked for Azure OpenAI only. Grey cells indicate token types not reported by a given provider. All data flows to Application Insights for FinOps dashboards and budget alerts.
Token metric coverage in Application Insights after Build 2026. The three amber rows — reasoning, cached, and audio tokens — are new additions. Reasoning token tracking is particularly significant for FinOps teams using o1 or o3 models, where the internal reasoning chain can consume a substantial portion of the total token budget that earlier metrics did not capture. Grey cells indicate that a provider does not expose that token type in its API response.

Cached token tracking is equally important for cost optimization. Azure OpenAI’s prompt caching reduces the cost of repeated prompt prefixes. Tracking cached vs. uncached tokens separately lets you measure the actual cache hit rate and tune your prompt structure accordingly.

The multi-provider coverage of Microsoft Foundry, OpenAI, Amazon Bedrock, and Google Vertex AI means the FinOps dashboard built in Part 4 now works across your entire model estate, not just Azure OpenAI.

API Center MCP Server: Enterprise Discovery at GA

The Azure API Center data plane MCP server reached general availability. It acts as a unified discovery endpoint: agents and developer tools can find registered MCP servers, tools, APIs, and AI assets through a single MCP connection. When a team registers a new MCP server in API Center, it becomes automatically discoverable without requiring individual client reconfigurations.

This is the enterprise catalogue layer that makes the MCP gateway story from Part 7 operationally sustainable at scale. Without it, discovery is a manual configuration problem. With it, the control plane extends automatically as new capabilities are registered.

Where This Leaves the Control Plane

Looking at the Build announcements together, the pattern is consistent with what the series argued: APIM is becoming the governance layer for all AI traffic, not just LLM completions. The Unified Model API extends it across providers. Content safety for MCP and A2A extends it across protocols. The API Center MCP server extends discovery to the enterprise catalogue layer.

The competitive context is worth noting. AWS Bedrock Guardrails handles content filtering but has no equivalent to the Unified Model API or MCP/A2A coverage. Google Apigee has added AI gateway features, but not at this protocol breadth. Cloudflare’s AI Gateway focuses on spend limits and caching. APIM’s position that the API gateway is the natural control plane for AI workloadsis increasingly defensible.

For teams that have followed the series and implemented the seven patterns, the Build announcements are additive rather than disruptive. The policy pipeline you built still works. The new capabilities slot in: swap your backend URL configuration to use the Unified Model API, add the llm-content-safety policy to your MCP server inbound pipeline, and update your Application Insights queries to include reasoning and cached token dimensions.

Lastly, the Microsoft AI Gateway labs‘ 30+ Jupyter notebooks with deployable Bicep templates are worth bookmarking if you are implementing any of these patterns.

Building Azure Logic Apps Agent Tools: Connectors and MCP

Part 4 of 7 in the Logic Apps Agent Loop series

Part 3 covered the two agentic workflow patterns in Azure Logic Apps, autonomous and conversational, and how to choose between them. Both patterns rely on the same mechanism for getting work done: tools. An Azure Logic Apps agent loop tool is the means by which the model reaches out to the world to query a database, send an email, call an API, or retrieve a document. Without tools, the agent can only reason over what the model already knows.

This post is the most hands-on in the series. It covers the three layers of the Azure Logic Apps tooling model, built-in connectors, custom connectors, and MCP servers. Moreover, it includes a demo showing how to expose a Logic Apps workflow as a tool provider that can be called by an external agent in Azure AI Foundry.

Choosing the right Azure Logic Apps agent tools layer

Before building anything, it is important first to understand what a tool actually is in Logic Apps terms. Specifically, a tool is defined as a sequence of one or more connector actions that the agent can choose to invoke during a loop iteration. Consequently, the model decides which tool to call based on the tool’s name and description. Therefore, naming and describing tools clearly is one of the most crucial decisions you will make when building an agentic workflow.

Logic Apps offers three layers of tooling, each adding capability and complexity.

Layer 1: Built-in and managed connectors

The foundation layer is the 1,400+ connector library that Logic Apps has always offered. For agent tools, the most relevant connectors are those that give the agent access to data and services: Azure OpenAI, Azure AI Search, Azure Blob Storage, Office 365 Outlook, SharePoint, SQL Server, HTTP, and Service Bus among them.

You build a tool by adding one or more of these connector actions inside the tool container within the agent action. Each tool gets a name and a description. The model reads these at runtime to decide whether to invoke the tool and what arguments to pass. You then create agent parameters for any action inputs that the model should supply dynamically: a city name for a weather lookup, a query string for a search, a recipient address for an email.

Agent parameters differ from standard Logic Apps parameters importantly. They are scoped to the tool where you define them; they cannot be shared across tools. They also receive their values only when the agent invokes the tool, not at workflow start time. You can call the same tool multiple times in a single loop using different parameter values: for example, you could invoke a weather tool for both Amsterdam and London in the same run.

Layer 2: Custom connectors

Where the built-in connector library has gaps, custom connectors fill them. A custom connector in Logic Apps is an OpenAPI-described wrapper around any REST API, internal or external. Furthermore, once you register it, it appears in the connector gallery just like a managed connector, and you can use it inside a tool in the same way.

For enterprise integration architects, custom connectors are the bridge between the agent loop and any internal system that does not have a first-party Logic Apps connector: an internal HR system, a legacy claims processing API, a proprietary data platform. The investment in defining the OpenAPI specification pays off because the connector becomes reusable across all workflows in the tenant, not just the agentic ones.

Building a custom connector for use in an agent tool follows the standard Logic Apps custom connector creation process:: define the API, specify authentication, and configure the operations, with one addition: write clear operation descriptions, because the model uses these descriptions to decide when to invoke the connector.

Layer 3: MCP servers

The third layer is the newest and the most architecturally significant. Azure Logic Apps can serve as the backend for a Model Context Protocol (MCP) server exposing connector actions as a structured, discoverable toolset that external agents and models can call over a standard protocol.

MCP is an open standard that defines how AI components discover and invoke tools. Moreover, an MCP server acts as a bridge between an AI agent and the tools it can use. This is a significant shift from the previous two layers. Built-in and custom connectors are tools that the agent in your Logic Apps workflow invokes. An MCP server inverts the relationship: your Logic Apps workflow becomes the tool provider, and the calling agent lives somewhere else entirely.

Structural diagram showing three tooling layers for Azure Logic Apps agentic workflows. Layer 1 contains built-in and managed connectors including Azure OpenAI, Azure AI Search, Office 365, HTTP, and 1,400 more. Layer 2 shows custom connectors wrapping internal REST APIs such as HR, claims, and ERP systems. Both layers sit inside the Logic App boundary with an agent parameters note. Layer 3 sits below as a separate MCP server section, showing an external agent connecting via Azure API Center to MCP tools backed by Logic Apps connectors.
Figure 1 — The three tooling layers available to an Azure Logic Apps agent. Layer 1 (purple) covers the 1,400+ built-in and managed connectors packaged as tools directly inside the agent action. Layer 2 (coral) adds custom connectors that wrap internal REST APIs not covered by first-party connectors, reusable across the tenant. Both layers follow the same pattern: the agent in your workflow calls the tool. Layer 3 (purple, below) inverts the relationship — your Standard logic app becomes the tool provider, registered through Azure API Center and callable by any external MCP-compatible agent. Agent parameters apply across all three layers: the model supplies tool input values at runtime, scoped per tool.

A note on the demo: real-world limitations of the tooling preview

For this post I set out to build a working end-to-end demo showing a Logic Apps workflow exposed as an MCP tool provider callable by an Azure AI Foundry agent. The concept is sound and the architecture is correct, but two practical blockers prevented a clean demo at the time of writing.

API Center MCP wizard limitations. The registration wizard in Azure API Center is in active preview. The connector picker surfaces only managed connectors, so the built-in HTTP action from Part 2 is unavailable. The logic app dropdown is also filtered by region, a logic app in West Europe will not appear in an API Center resource deployed to a different region.

Foundry OpenAPI tool network restrictions. Azure AI Foundry’s OpenAPI tool sandbox cannot reach azurewebsites.net endpoints directly. Calls from the Foundry playground return an Unknown error regardless of the spec configuration. The workaround is to front the Logic Apps endpoint with Azure API Management, which Foundry can reach however that adds infrastructure complexity beyond the scope of this post.

Both limitations are preview-stage issues that Microsoft will likely resolve. The OpenAPI spec, the Foundry agent configuration, and the mcp-research workflow pattern described above are all correct and will work once network access between Foundry and Logic Apps endpoints is available or via an APIM gateway.

The Layer 3 pattern of your Logic App as a tool provider for any external MCP-compatible agent remains the most architecturally significant development in this series. In addition, Part 6 picks up the security implications of that expanded caller surface.

Choosing the right tooling layer

The table below summarises how Azure Logic Apps agentic workflows differ across the three tooling layers.

Built-in connectorsCustom connectorsMCP server
Who calls the toolAgent in your workflowAgent in your workflowAny external MCP-compatible agent
Setup complexityLowMediumMedium–high
ReusabilityWithin the workflowAcross the tenantAcross agents and platforms
Best forStandard integrationsInternal APIs without a connectorMulti-agent, cross-platform tooling

The three layers are not mutually exclusive. A production agentic workflow will typically use built-in connectors for standard integrations, custom connectors for internal systems, and an MCP server where the toolset needs to be shared across multiple agents or platforms.


What comes next

The next post moves from individual tools to multi-agent composition. Part 5 covers orchestrator-worker topologies, agent handoffs, and how to build sequential agent loops.

Autonomous vs Conversational Agentic Workflows in Logic Apps

Part 3 of 7 in the Logic Apps Agent Loop series

Part 2 walked through the anatomy of an Azure Logic Apps agent loop and built a minimal autonomous agent from scratch. Before opening the designer, though, there is a design decision to make as Azure Logic Apps agentic workflows come in two patterns: autonomous and conversational, and choosing the right one shapes the trigger, the prompt source, the output destination, and the authentication you need before going to production. This post covers both patterns and helps you decide which fits your scenario.

Two Azure Logic Apps agentic workflow patterns, one agent loop

Both autonomous and conversational agentic workflows use the same Azure Logic Apps agent loop under the hood, the same Think, Act, Observe cycle from Post 2, the same connected model, the same tools built from connector actions. The differences arise from how the workflow starts, who supplies the prompts, and how the results get delivered.

Autonomous agentic workflows

Supported Logic Apps triggers include an HTTP request, a timer, a Service Bus message, a new file in Blob Storage, and an email arriving in an inbox. The trigger fires, outputs the agent’s prompt, runs the loop, and then returns the result to the caller or forwards it to a downstream system. No human is in the loop during execution.

This is the pattern from Post 2. It works well in scenarios where the input is clear, and the agent’s task is specific: summarize this document, classify this support ticket, extract these fields from this invoice, and route this order based on its contents. The workflow runs unattended, potentially thousands of times a day, without any human interaction between trigger and result.

The key design characteristic of an autonomous workflow is that the prompt comes from the system, not from a person. The trigger outputs a message body, a file name, and a queue payload, which is what the agent reasons over. The instructions you write in the agent’s configuration pane define the agent’s role for every run.

Conversational agentic workflows

A conversational agentic workflow introduces a human in the loop. Instead of firing from a system trigger, it always starts with the “When a chat session starts” trigger the only trigger supported for this pattern. From there, the agent receives prompts through an integrated chat interface: a person types a message, the agent reasons over it, invokes tools if needed, and responds. The conversation continues turn by turn until the session ends.

This pattern suits scenarios that require dialogue: a support agent that asks clarifying questions, a guided data-entry flow, a research assistant that refines its output based on feedback, or any situation where the right response depends on what the user says next. The agent maintains session state across turns, so each prompt it receives includes the history of the conversation so far.

The integrated chat interface is accessible directly from the Logic Apps designer in the Azure portal during development. For production use, conversational workflows also support an external chat client that people outside the portal can access, which introduces authentication requirements covered later in this post.

Choosing the right Azure Logic Apps agentic workflow pattern

The decision comes down to one question: does the workflow need a human in the loop during execution?

If the input is fully available at trigger time and the task can be completed without further human input, use the autonomous pattern. If the workflow needs to ask questions, receive feedback, or maintain a conversation across multiple turns, use the conversational pattern.

A few other factors are worth considering:

Trigger flexibility. Autonomous workflows support any Logic Apps trigger, the full library of 1,400+ connectors. Conversational workflows are locked to the When a chat session starts trigger. If your scenario requires a scheduled run, a queue-based trigger, or any event-driven start, autonomous is your only option.

Output destination. Autonomous agents return results to the workflow caller or pass them to a downstream action, an email, a queue message, or a database write. Conversational agents respond through the chat interface. If the output needs to go somewhere other than a chat window, autonomous is the right fit.

Authentication complexity. Autonomous workflows authenticate using the same patterns as any other Logic Apps workflow, Managed Identity, SAS tokens, and Easy Auth. Conversational workflows that expose an external chat client face a broader authentication challenge: callers can come from dynamic, unknown, or untrusted networks, and every external caller must be authenticated and authorized before going to production. During development, the Azure portal provides a developer key for quick testing in the designer, but this key is explicitly not suitable for production use.

State management. Conversational workflows maintain conversation history across turns automatically. Autonomous workflows have no concept of a session — each run is independent. If your scenario needs memory across multiple interactions, the conversational pattern handles this natively.

What changes in the designer for Logic Apps

Setting up Azure Logic Apps agentic workflows in the designer follows the same steps for both patterns, with two key differences.

When you create a new workflow, select Conversational Agents instead of Autonomous Agents as the workflow type. Logic Apps creates the workflow with the When a chat session starts trigger already in place and an empty agent action connected to it.

The second difference is the chat interface itself. Once the workflow is saved, a chat panel is accessible from the designer toolbar. During development, this is where you test the agent interactively, type a prompt, read the response, andcontinue the conversation. The run history records each turn as a separate agent iteration, giving you the same visibility into the loop’s behaviour as in an autonomous workflow.

Authentication for conversational workflows in production

The developer key that the Azure portal uses during design and testing is a convenience mechanism tied to your portal session. It is not a substitute for production authentication. The developer key is not designed for large or untrusted caller populations, is not governed by Conditional Access policies at the request execution layer, and cannot be distributed externally.

For production conversational agentic workflows, you need to set up Easy Auth on the Logic App.This section addresses external callers, who include individuals or agents accessing the chat endpoint from outside the Azure portal. It emphasizes the need to use proper identity-based authentication for this access. In Post 6 of this series, we will delve deeper into the complete security landscape concerning agentic workflows. This includes a detailed discussion on setting up Easy Auth, utilizing Managed Identity for backend connections, and evaluating the broader threat model associated with conversational workflows.

Choosing the right pattern: a quick reference

AutonomousConversational
TriggerAny supported triggerWhen a chat session starts only
Human interactionNone during executionTurn-by-turn via chat interface
Prompt sourceTrigger or preceding action outputHuman input through chat
Output destinationCaller, downstream action, or systemChat interface response
Session stateNone — each run is independentMaintained across turns
External accessStandard Logic Apps authRequires Easy Auth for production
Best forUnattended, event-driven tasksDialogue, guided flows, multi-turn tasks

Azure Logic Apps supports two agentic workflow patterns: autonomous and conversational. This post explains how they differ in trigger, prompt source, output, and authentication and helps you decide which pattern fits your scenario.
Figure 1 — Autonomous agentic workflows (left) accept input from any supported Logic Apps trigger and run without human interaction, returning results to a caller or downstream system. Conversational agentic workflows (right) always start with the When a chat session starts trigger, receive prompts from a human through the integrated chat interface, and maintain session state across turns. Both patterns use the same agent loop mechanics: Think, Act, Observe, but differ in trigger flexibility, prompt source, output destination, and production authentication requirements.

What comes next

The next post moves from pattern selection to tooling. The upcoming part 4 covers how to build tools for the agent, from built-in and custom connectors to MCP servers as tool providers, and includes the most hands-on demo in the series.

Azure API Management as MCP Gateway: Governing Agentic AI Workloads

Part 7 of 7 in the “APIM for AI Workloads” series

Azure API Management as MCP gateway is the natural endpoint of everything this series has built. In Parts 1 through 6, we established APIM as the control plane for AI workloads: securing access, limiting and measuring token consumption, routing traffic resiliently across backends, and reducing costs through semantic caching. All of that applies equally to agentic workloads. The difference is that agents introduce a new communication pattern: the Model Context Protocol (MCP), which standardizes how AI agents discover and call tools.

In my work and online research on agentic AI architecture, I consistently returned to the same question: how does one govern agent tool calls with the same rigor we apply to API calls? The answer, increasingly, is that APIM handles both. This post covers what that looks like in practice.

What MCP Is and Why It Changes the APIM Story

MCP is an open protocol, originally developed by Anthropic, that defines a standard interface between AI agents (MCP clients) and the tools they call (MCP servers). Instead of each agent framework implementing its own bespoke tool-calling mechanism, MCP gives agents a consistent way to discover available tools, understand their input schemas, and invoke them. Frameworks including Semantic Kernel, AutoGen, and LangGraph are all adding MCP client support.

For APIM, MCP matters because it transforms the gateway from a proxy for AI completions into a broker for agent tool calls. An agent no longer calls your internal APIs directly. Instead, it discovers them as MCP tools through APIM, and APIM enforces the same governance policies on those tool calls that it enforces on any other request. The control plane extends naturally into the agentic layer.

Azure API Management as MCP Gateway: Three Capabilities

Diagram showing Azure API Management acting as an MCP gateway. On the left, AI agents connect as MCP clients. In the centre, APIM exposes REST APIs as MCP tool definitions, proxies external MCP servers, and routes agent-to-agent traffic through the policy pipeline. On the right, Azure OpenAI and AI Foundry backends receive governed requests.
Azure API Management as an MCP gateway. Existing REST APIs are auto-exposed as MCP tool definitions via the export-rest-mcp-server policy. External MCP servers are proxied through APIM. Agent-to-agent traffic passes through the same inbound policy pipeline, with all series policies, authentication, token limits, token metrics, andload balancing applied uniformly.

APIM’s MCP gateway capabilities fall into three categories:

Expose REST APIs as MCP servers. The export-rest-mcp-server policy takes any API already registered in your APIM catalog and auto-generates MCP tool definitions from it. An agent connecting to your APIM MCP endpoint discovers those tools via the standard MCP protocol and can call them without any knowledge of the underlying REST implementation. Crucially, no changes are required to the underlying API. The policy handles the translation layer entirely within APIM.

Pass through external MCP servers. APIM can proxy external MCP servers — whether third-party services like GitHub or Jira, or custom MCP servers built by your own teams — through the same gateway. All traffic passes through APIM’s policy pipeline, so you apply JWT validation, subscription key enforcement, token limits, and logging to external MCP calls exactly as you would to any other API call. Agents get a single APIM endpoint; APIM handles the routing.

Agent-to-agent (A2A) traffic. In multi-agent architectures, orchestrator agents call sub-agents to delegate tasks. Routing that traffic through APIM means every A2A hop is governed: authenticated, rate-limited, logged, and subject to the same token budget controls applied to end-user traffic. This is particularly relevant for agentic pipelines running on Microsoft Foundry, where multiple specialized agents collaborate within a single workflow.

Applying Series Policies to Agentic Workloads

One of the practical advantages of routing MCP traffic through APIM is that every policy covered in this series applies without modification. Agentic workloads are not a special case requiring a separate governance layer. They use the same pipeline.

  • Authentication (Part 2): Agents authenticate to APIM using subscription keys or JWT tokens. APIM authenticates to AI backends via Managed Identity. The agent never holds backend credentials.
  • Token limits (Part 3): Multi-step agentic pipelines can consume large token volumes per workflow. Per-subscription TPM limits prevent a single runaway pipeline from exhausting shared capacity.
  • Token metrics (Part 4): Token consumption from agentic workflows is attributed to the subscribing team or pipeline via the emit-token-metric policy. FinOps visibility extends automatically to agentic workloads.
  • Load balancing (Part 5): Agentic pipelines often run longer and consume more tokens per call than chat applications. PTU-to-PAYG failover protects pipeline continuity when primary capacity saturates.
  • Semantic caching (Part 6): Agents that make repeated identical tool calls, checking a status, or looking up a reference value, benefit from semantic caching in the same way chat applications do.

Practical Considerations for APIM as MCP Gateway

A few agentic-specific considerations are worth calling out before you start routing MCP traffic through APIM.

Tool discovery latency. MCP clients typically discover available tools at session start by calling the MCP server’s tool list endpoint. With APIM in the path, that discovery call passes through the full policy pipeline. Keep your inbound policies lightweight for discovery calls, or cache the tool list response to avoid repeated round trips.

Streaming responses. Many AI completions endpoints support streaming via server-sent events. APIM supports streaming passthrough, but some policies — including semantic cache lookup — do not apply to streaming responses. Structure your pipeline accordingly: apply caching only to non-streaming completion calls.

Session state. MCP conversations are stateful within a session. APIM is stateless between requests, so per-session state must live in the calling agent or an external store. The vary-by pattern from the semantic cache policy can scope cached tool responses by session ID if the agent passes one in a header.

Token budget propagation. In multi-agent pipelines, token budgets need to propagate from the orchestrator to sub-agents. Exposing the remaining token budget from the remaining-tokens-variable-name attribute (Part 3) as a response header lets orchestration frameworks like Semantic Kernel make informed decisions about which sub-agent to invoke next.

Azure API Management as MCP Gateway: Closing the Series

This post closes the series, but the control plane it describes is not static. MCP is still evolving rapidly. New APIM policy capabilities for agentic workloads are shipping frequently. The architecture board conversation at various enterprise has shifted from “should we centralize AI traffic through APIM?” to “what do we govern next?”, which is a good place to be.

Diagram showing the complete Azure API Management AI control plane. On the left, five consumer types — AI agents, chat apps, copilots, pipelines, and enterprise apps — connect through a single APIM instance. In the centre, seven policy layers are stacked vertically: authentication, token limit, token metric, load balancing and circuit breaker, semantic caching, MCP gateway, and named value kill switch, each labelled with its series part number. On the right, Azure AI backends including Azure OpenAI PTU and PAYG, AI Foundry, and MCP-enabled backends receive governed requests.
The complete APIM for AI control plane across all seven parts of the series. One APIM instance governs every consumer type, every Azure AI backend, and every governance requirement — including agentic MCP workloads introduced in this post. Each policy layer can be implemented incrementally, starting with authentication and adding capability as workloads mature.

Looking back across the seven posts, the consistent theme is that AI workloads are not fundamentally different from other API workloads in terms of governance requirements. They need authentication, rate limiting, observability, resilience, and cost control. APIM provides all of those. What changes with AI is the unit of measurement (tokens, not requests), the billing model (PTU vs. PAYG), and now the communication protocol (MCP for agents). The control plane adapts to each of these without requiring a parallel governance infrastructure.

The full series index is below for reference. Each post links to the relevant Microsoft documentation and includes policy XML you can use directly.

  • Part 1: Why your AI APIs need a gateway.
  • Part 2: Authentication and authorization.
  • Part 3: Token limit policy.
  • Part 4: Token metric policy and cross-charging.
  • Part 5: Load balancing and circuit breaking.
  • Part 6: Semantic caching.

Part 7 (this post): APIM as MCP gateway for agentic AI workloads.

Azure API Management Semantic Caching: Cut AI Token Costs with Similarity-Based Responses

Part 6 of 7 in the “APIM for AI Workloads” series

Azure API Management semantic caching is the most operationally transparent cost optimization in this series. Every technique covered so far, auth, token limits, token metrics, and load balancing, requires deliberate design decisions in how you configure APIM. Semantic caching, by contrast, works silently. Calling applications sends prompts as normal. APIM checks whether a semantically similar prompt has already been answered. If a match exists above a configurable similarity threshold, APIM returns the cached response without touching the AI backend. Zero tokens consumed. Zero latency is added by the model.

For workloads with repetitive prompt patterns, internal FAQ bots, document classifiers, and support agents that see the same questions repeatedly, the cache hit rate can be surprisingly high. Even a 20% hit rate on a high-volume workload translates directly into cost reduction and lower average latency.

How Azure API Management Semantic Caching Works

The azure-openai-semantic-cache-lookup policy sits in the inbound section of your APIM pipeline, before the request reaches the AI backend. When a prompt arrives, APIM sends it to a configured embedding model, typically Azure OpenAI text-embedding-ada-002 or equivalent, to generate a vector representation of the prompt. APIM then compares that vector against cached embeddings stored in Azure Managed Redis using cosine similarity.

If the similarity score between the incoming prompt and a cached prompt falls below the configured score-threshold, APIM treats it as a cache hit and returns the stored response. If no match meets the threshold, APIM forwards the request to the AI backend as normal and stores the response in Redis for future lookups.

Azure API Management semantic caching policy flow showing cache hit returning stored response and cache miss forwarding to Azure OpenAI
Diagram 1: Semantic cache request flow. On a cache hit, APIM returns a stored response directly — consuming zero tokens. On a miss, APIM forwards to the AI backend and stores the response in Azure Managed Redis for future hits.

The generic variant, llm-semantic-cache-lookup, works identically for non-Azure backends. Both require the same supporting infrastructure: an embedding model backend and an Azure Managed Redis instance configured in APIM. The semantic cache store policy handles writing responses back to the cache in the outbound section.

Tuning the Score Threshold for Azure API Management Semantic Caching

The score-threshold attribute is the most consequential configuration decision in the semantic caching policy. It controls how similar an incoming prompt must be to a cached prompt for APIM to treat it as a hit. The value runs from 0.0 to 1.0, but the practical range is much narrower.

Azure API Management semantic caching score threshold tuning guide from aggressive to conservative with vary-by subscription user and global scope strategies
Diagram 2: Score threshold tuning guide and vary-by scope strategies. Lower thresholds cache more aggressively. The default of 0.05 suits most production workloads. A global cache (no vary-by) maximizes hit rate but risks serving the wrong user’s response.

In practice, three zones matter:

0.01 to 0.05 (aggressive). At this range, prompts that are paraphrases of each other — “What is my account balance?” and “Can you show me my current balance?” — reliably produce cache hits. This is the right range for FAQ bots, support agents, and any workload where users ask the same questions in slightly different words. The default of 0.05 sits here and suits most production deployments.

0.05 to 0.20 (conservative). At this range, only prompts that are very close in wording produce hits. Creative workloads, code generation, and document drafting tend to have high prompt variance, so a more conservative threshold avoids serving stale cached responses to genuinely different requests.

Above 0.30 (too strict). At this threshold, almost no prompts match. The cache effectively stops functioning. Avoid this range unless you are deliberately disabling caching for a specific API product while keeping the policy in the pipeline for future use.

Start at 0.05 and monitor cache hit rates in Application Insights. If the hit rate is low for a workload you expect to be repetitive, lower the threshold incrementally. If you start seeing complaints about incorrect or stale responses, raise it.

vary-by Scope: Preventing Cache Pollution

The vary-by element scopes the cache namespace. Without it, all consumers share a single global cache. That maximizes the hit rate but introduces a significant risk: APIM could serve one user’s cached response to a different user. For most enterprise AI workloads, that is unacceptable.

The safest default is to vary by Subscription ID, which gives each API subscriber their own cache namespace. This prevents cross-team cache pollution while still achieving high hit rates within each subscriber’s own prompt patterns. For multi-tenant applications where individual users have distinct contexts, vary by a user identifier extracted from the JWT or a custom header instead.

A global cache with no vary-by is appropriate only for fully public, stateless APIs where responses are identical regardless of who requests them. Internal enterprise AI workloads rarely meet that bar.

Infrastructure Requirements for Azure API Management Semantic Caching

Semantic caching requires two supporting Azure resources beyond APIM itself. First, an Azure Managed Redis instance configured as an external cache in APIM. Redis stores the prompt embeddings and cached responses. The cache TTL is configurable in the store policy, so you control how long responses remain valid before APIM re-queries the backend.

Second, an embeddings model backend registered in APIM. For Azure OpenAI, this is typically a separate deployment of text-embedding-ada-002 or text-embedding-3-small. The embeddings backend is referenced by the embeddings-backend-id attribute. It is separate from your completions backend, so you can apply independent token limits and load balancing to the embeddings traffic.

One practical consideration: the embeddings call itself consumes tokens and adds a small amount of latency on every request, whether or not the cache hits. For workloads with very low prompt repetition, the overhead of generating embeddings for every request may outweigh the savings from occasional cache hits. Measure the hit rate before committing the infrastructure cost.

What’s Next in This Azure API Management for AI Series

Part 7 closes the series by covering APIM’s emerging role as an MCP gateway for agentic AI workloads: how to expose REST APIs as MCP servers, pass through existing MCP servers, and manage agent-to-agent traffic through the same control plane we’ve built across this series.

Anatomy of an Agent Loop in Azure Logic Apps

Part 2 of 7 in the Logic Apps Agent Loop series

Part 1 explained why the Azure Logic Apps agent loop is a different design paradigm from conventional workflow automation. This post gets hands-on with the anatomy of that loop. We will look at the four building blocks that make up every agent loop trigger, instructions, connected model, and tools, and walk through how to wire them together in a Standard logic app.

By the end of this post you will have a working autonomous agent that accepts a prompt from a trigger, reasons over it using Azure OpenAI, invokes a connector action as a tool, and returns a result. The run history will show you exactly how the loop iterated.

The Azure Logic Apps agent loop: four building blocks

Before opening the designer, it helps to have a clear mental model of what you are assembling. Every Azure Logic Apps agent loop consists of four parts.

Trigger

The trigger starts the workflow, exactly as it does in any nonagentic Logic Apps workflow. For an autonomous agent, this can be any supported trigger an HTTP request, a timer, a Service Bus message, a new email, or anything else in the connector library. The trigger’s output becomes the initial input to the agent: the prompt or data the model will reason over.

Instructions

Instructions are the system prompt for the agent. You provide them as a block of natural language text in the agent action’s configuration pane. They define the agent’s role, what it can and cannot do, how it should respond, and any constraints it should observe. A well-written instructions block is the single most important factor in how well the agent performs. Think of it as the job description you hand to the model at the start of every run.

Connected model

The agent needs a language model to reason with. In Standard Logic Apps, you connect the agent to an Azure OpenAI Service resource and specify the model deployment to use — typically a GPT-4o deployment. The agent sends the instructions, the trigger input, and the results of any tool calls to the model at each iteration of the loop. The model’s response tells the agent what to do next.

Tools

A tool is a sequence of one or more connector actions that the agent can choose to invoke. You build tools directly in the Logic Apps designer by adding actions from the connector gallery inside the agent action. Each tool gets a name and a description — the model uses these to decide which tool to call and when. A single agent can have multiple tools. An agent with no tools can still respond to prompts using the model’s built-in knowledge, but it cannot take action on external systems.

The diagram below shows how these four parts fit together inside a single agent loop execution.

Anatomy of an Azure Logic Apps agent loop — trigger, instructions, model, and tools
Figure 1 — Every Azure Logic Apps agent loop consists of four building blocks: a trigger that starts the workflow, instructions that define the agent’s role, a connected model (Azure OpenAI / GPT-4o) that reasons over each iteration, and tools built from connector actions. The loop cycles through Think, Act, and Observe until the model determines the task is complete.

Building your first agent loop in Azure Logic Apps

The demo for this post is deliberately simple: an agent that receives a topic via an HTTP trigger, uses Azure OpenAI to generate a summary, and returns the result to the caller. One trigger, one model, one tool is enough to see all four building blocks in action and to read a meaningful run history.

Prerequisites

  • A Standard logic app resource deployed in Azure
  • An Azure OpenAI Service resource with a GPT-4o model deployment
  • Contributor access to both resources

Step 1: Create the workflow

In the Azure portal, open your Standard logic app and select Workflows from the sidebar. Choose Add, then select Autonomous Agents as the workflow type. Give the workflow a name and select Stateful. Logic Apps creates a new workflow with an empty agent action already in place.

Step 2: Configure the trigger

The autonomous agent workflow template starts with a When a HTTP request is received trigger by default. Leave the method as POST. In the request body JSON schema, add a single property: topic of type string. This is the input the agent will work with.

Step 3: Write the instructions

Select the agent action in the designer to open its configuration pane. On the Parameters tab, find the Instructions field. Enter something like the following:

You are a research assistant. When given a topic, use the available tools to retrieve relevant information and return a concise summary of no more than three sentences. Always cite your source.

Keep instructions specific and bounded. Vague instructions produce unpredictable behaviour. The model will take the instructions literally, so precision matters.

Step 4: Connect the model

Still on the Parameters tab, select Add connection under the model configuration section. Choose Azure OpenAI Service, select your resource, and choose your GPT-4o deployment. Logic Apps establishes the connection and stores it against the workflow.

Step 5: Add a tool

Inside the agent action, select Add a tool. This opens the connector gallery filtered to actions that can be used as tools. For this demo, add the HTTP action as a tool — name it search_web, give it the description “Retrieves content from a given URL”, and configure it to accept a URL as input. In a production scenario you would use Azure AI Search or a more capable connector here; the HTTP action keeps the demo self-contained.

Step 6: Save and run

Save the workflow. Use a REST client to POST a JSON body like {"topic": "Azure Logic Apps agent loop"} to the workflow’s trigger URL. The agent fires, the model reasons over the instructions and the topic, invokes the search tool, and returns a summary.

Logic Apps designer view of an autonomous agent workflow. The canvas shows an HTTP request trigger connected to an Agent action containing a Tool with an HTTP action inside. The right pane shows the Agent parameters: AI model set to GPT-4o via Foundry Models, instructions for the research assistant role, and the topic dynamic value wired as user instructions item 1.
Figure 2 — The completed agent configuration in the Logic Apps designer. The Agent action is connected to GPT-4o via Foundry Models, the instructions define the research assistant role and output format, and the topic value from the HTTP trigger is passed in as the user instruction. The Tool contains a single HTTP action that the agent can invoke to retrieve content from a given URL.

Reading the run history

he run history is where the Azure Logic Apps agent loop becomes visible. Open the workflow’s Run history and select the latest run. You will see the trigger, followed by the agent action. Expand the agent action and you will find each iteration of the loop shown as a numbered step: the model’s reasoning output, the tool call with its inputs and outputs, and the model’s decision on whether to loop again or return a final answer.

This is the key difference from a nonagentic run history. In a conventional workflow, the run history shows a flat list of actions. In an agent loop, it shows a nested, iterative structure the model’s chain of thought made visible.

Run history of a Logic Apps autonomous agent workflow completed in 9.21 seconds, showing the HTTP trigger, a first agent iteration that invoked the HTTP tool in 3.7 seconds, and a second agent iteration that sent the final chat message.
Figure 3 — The run history of the minimal autonomous agent from this post. The loop ran two iterations: the first agent step (3.9s) reasoned over the topic prompt and invoked the HTTP tool (0.6s); the second agent step (3.2s) observed the result and composed the final response. The canvas shows iteration 1 of 3 steps — trigger, tool, and HTTP action — all succeeded in 9.21 seconds total.

For a simple prompt, you may see a single iteration. For a more complex task involving multiple tool calls, you will see the loop unfold across three, five, or more steps. Each step shows exactly what the model decided and why.

Standard versus Consumption: model connections

In Standard logic apps, you configure the model connection yourself — selecting an Azure OpenAI Service resource and specifying the deployment. This gives you full control over which model version you use, where it is hosted, and how it is secured via Managed Identity.

In Consumption logic apps (currently in public preview), the model connection is handled via Microsoft Foundry and the configuration is more constrained. For any production workload, Standard remains the right choice.

What comes next

The agent in this post is autonomous it runs without human interaction, triggered by an HTTP call and returning a result when done. That covers a wide range of integration scenarios, but not all of them. Some tasks require a back-and-forth with a user: a support conversation, a guided data-entry flow, a multi-turn research session.

The next part will cover exactly that distinction, autonomous versus conversational agentic workflows, and walk through when to choose each pattern and what changes in the designer when you do.

Why the Agent Loop Changes Everything in Azure Logic Apps

Part 1 of 7 in the Logic Apps Agent Loop series

The Azure Logic Apps agent loop introduces a fundamentally different way to design workflows on the platform. While conventional Logic Apps workflows follow a fixed sequence of steps defined at design time, the agent loop delegates reasoning to a large language model at runtime, looping through think, act, and observe cycles until a task is complete. This post opens a seven-part series on building agentic workflows in Logic Apps. It starts with the question that matters most: why does this change anything?

For years, Azure Logic Apps has been the platform of choice for integration architects who need to orchestrate business processes across cloud services and on-premises systems. You build a workflow, wire up triggers and actions, define your conditions, handle your errors, deploy, and move on. The flow is predictable (deterministic): given the same inputs, it does the same thing every time. That predictability is the point.

The agent loop breaks that contract, deliberately and usefully.

With the introduction of agentic workflows in Azure Logic Apps, Microsoft has extended the platform from a fixed automation engine into something that can reason, adapt, and decide. At its core, the agent loop drives this shift. It is a repeating process: the connected language model thinks through a problem, selects a tool, acts on the result, and decides whether the task is done.Unlike a conventional workflow, there is no hardcoded sequence of steps. Instead, the model determines the path based on the task.

This post is the opening of a seven-part series on building agentic workflows in Azure Logic Apps. Before going hands-on with triggers, connectors, and multi-agent patterns in later posts, this one makes the case for why the agent loop matters and what it fundamentally changes about how you think about workflow design.

How the Azure Logic Apps agent loop differs from nonagentic workflows

Nonagentic Logic Apps workflows are excellent at exactly the kind of work they were designed for: stable, predictable, repeatable processes. An approval workflow, an ETL pipeline, and a B2B message exchange are all scenarios where the path through the workflow is known in advance. The trigger fires, the conditions evaluate, the actions execute in sequence, and the run history tells you exactly what happened and why.

The challenge arises when the environment you are integrating with is unstable or unpredictable. When incoming data is unstructured. Or when the right action depends on context that cannot be captured in a condition expression. Or when you need to handle a customer query that could go a dozen different directions depending on what the customer actually says.

These are the cases where deterministic workflows buckle. You end up building sprawling switch-case structures, hardcoding edge cases as branches, and constantly patching the workflow every time a new variation appears. The workflow becomes a maintenance problem rather than a solution.

Agentic workflows excel in dynamic environments where unexpected events occur, the choice of the right tool relies on the input, and the system must manage unstructured data without specific instructions for each variant.

The agent loop: Think, Act, Learn

How the agent loop works: Think, Act, Learn

The Azure Logic Apps agent loop follows a three-step process.

Think. The agent collects available information: task instructions, prior inputs, and previous tool results. It then passes all of this to the connected language model.The model reasons over the context and decides what to do next: invoke a tool, ask a follow-up question, or return a final answer.

Act. In Logic Apps, tools are actions drawn from 400+ connectors. These include Azure OpenAI, Azure AI Search, Office 365, and custom APIs. Once the action runs, the result feeds back into the next cycle.

Optionally, the loop adapts. The agent can use feedback or external signals to adjust its behaviour over time, though this is the most advanced capability and not required for most workflows.

Iterations, not instructions

This loop continues think, act, observe, decide until the model determines the task is complete. You can change the number of iterations as needed. A simple query might resolve in one loop. A complex multi-step task might require five or ten.

The diagram below shows the difference between a conventional non-agentic workflow, which follows a linear sequence of predetermined steps, and the agent loop, which dynamically iterates until the model determines that the task is complete.

Figure 1 — A conventional nonagentic workflow follows a fixed path defined at design time (left). The agent loop iterates dynamically at runtime: the LLM thinks, acts, observes the result, and decides whether to loop again or return a final answer (right).

Agent versus nonagentic: a structural comparison

The difference is not just philosophical. It shows up in how you design, deploy, and maintain the workflow.

In a nonagentic workflow, the logic architect owns the decision tree. Every branch, every condition, every action path is explicitly modelled. This is powerful for known, bounded scenarios, but it places all the reasoning burden on the architect at design time.

In an agentic workflow, the reasoning is delegated to the model at runtime. The architect’s job shifts: instead of modelling every path, you define the agent’s instructions, give it the right tools, and trust the model to navigate the task. This is a different skill and a different mindset closer to prompt engineering and system design than to traditional workflow modelling.

The Microsoft documentation puts it plainly: agentic workflows can adapt to environments where unexpected events happen, choose which tools to use based on prompts and available data, and handle unstructured data at a level of flexibility that nonagentic workflows simply cannot match. Moreover, nonagentic workflows function best in stable environments with static, predictable, repetitive tasks.

Neither is universally better. They address different problems. But for integration architects, the arrival of the agent loop means Logic Apps can now cover territory that previously required a custom-coded application or a fully separate agent framework.

Standard versus Consumption: what you need to know now

Azure Logic Apps offers two hosting models: Standard (single-tenant, runs on Azure Functions runtime) and Consumption (multitenant, pay-per-execution). Agentic workflows are fully available in Standard. Consumption support is in public preview as of early 2026 and carries some restrictions.

For production agentic workloads, Standard is the right choice today. The rest of this series will use Standard throughout, with notes where the Consumption behaviour differs.

What this series covers

The seven posts in this series move from concept to production:

  1. Why the agent loop changes everything — this post
  2. Anatomy of an agent loop — instructions, the connected model, tool calls, and how the loop iterates
  3. Autonomous versus conversational workflows — choosing between unattended execution and human-in-the-loop patterns
  4. Building tools for the agent — connectors, custom connectors, and MCP servers as tool providers
  5. Multi-agent patterns — handoffs, orchestrators, and sequential agent loops
  6. Securing agentic workflows — authentication, the expanded caller surface, and Easy Auth
  7. Observability, pricing, and running in production — Application Insights, agent loop pricing, and DevOps deployment

The next post gets hands-on: we will look at the anatomy of a single agent loop in the Logic Apps designer, walk through the instructions pane, wire up Azure OpenAI as the model, and watch the run history to see how the iterations unfold.

Azure API Management Load Balancing and Circuit Breaker for AI Backends

Part 5 of 7 in the “APIM for AI Workloads” series

Azure API Management load balancing for AI workloads solves a problem that every team hits once they move beyond a single Azure OpenAI deployment: PTU capacity is finite, PAYG is a safety net, and when things go wrong on one backend, the rest of your workload should not notice. In Part 1 of this series, I described PTU vs. PAYG as a routing problem. This post is where we solve it.

The combination of backend pools, priority-based routing, and circuit breaker rules in APIM gives you a resilient AI gateway that handles three distinct failure modes: PTU saturation (too many tokens consumed against reserved capacity), regional outages, and transient backend errors. None of these requires changes to calling applications. APIM absorbs the complexity and presents a single stable endpoint.

Azure API Management Load Balancing: Backend Pools for AI

APIM’s backend pool feature lets you define a named group of AI backends and route to them as a unit. You reference the pool in the set-backend-service policy by its pool ID. When a request arrives, APIM selects a backend from the pool based on priority and weight, tracks health state via the circuit breaker, and retries on the next available member if the selected backend fails.

For AI workloads, the standard pattern uses two tiers. The first tier is your PTU deployment reserved capacity in a primary region, assigned priority 1. The second tier is a PAYG deployment in a secondary region, assigned priority 2. APIM routes all traffic to the PTU backend as long as the PTU backend is healthy. When PTU returns a 429 (capacity exceeded) error or becomes unreachable, the circuit breaker trips, and APIM automatically fails over to the PAYG backend.

Azure API Management load balancing backend pool with PTU primary PAYG overflow and circuit breaker tripped on unavailable backend
Diagram 1: APIM backend pool with three members. APIM backend pool with three members. The PTU backend (priority 1) handles normal load, while the PAYG backend (priority 2) absorbs overflow. After repeated 429 responses, Backend #3 has tripped its circuit breaker and is bypassed until the probe succeeds.

Priority determines the preference order: lower numbers are preferred. Weight applies when multiple backends share the same priority, distributing load proportionally between them. A common pattern for multi-region PTU deployments is two PTU backends at priority 1, each with a different weight reflecting their provisioned capacity, and a shared PAYG backend at priority 2 as the common overflow.

Circuit Breaker Configuration for Azure API Management AI Backends

The circuit breaker is what makes the backend pool resilient rather than just load-balanced. Without it, APIM continues routing to a saturated or unavailable backend on every request, each one failing with a 429 or timeout before falling back. The circuit breaker short-circuits that path: after a configurable number of failures within a time window, it marks the backend as OPEN and stops sending traffic to it entirely.

Azure API Management circuit breaker state machine showing closed open and half-open states for AI backend failover
Diagram 2: Circuit breaker state machine. CLOSED is normal operation. Exceeding the failure threshold trips the breaker to OPEN, bypassing the backend. After tripDuration seconds, APIM sends a single probe request to test recovery. Success returns to CLOSED; failure reopens the circuit.

The three circuit breaker states map directly to operational behavior:

CLOSED is the normal state. All requests are routed to the backend. Failures APIM counts failures within the configured interval, and the counter resets at the end of each interval if the number of failures remains below the threshold.

After enough failures to exceed the threshold, the breaker trips to OPEN. In this state, APIM bypasses the backend entirely, and APIM routes to the next available pool member without attempting the failed backend again. The tripDuration timer starts counting down immediately.

Once tripDuration elapses, the breaker enters HALF-OPEN and sends a single probe request to test recovery. A successful response transitions the backend back to CLOSED. A failure resets the timer and keeps the circuit OPEN.

For Azure OpenAI specifically, 429 should always be in your failureCondition alongside 503 and 504. A 429 from a PTU endpoint indicates that the provisioned throughput ceiling has been reached and the backend is temporarily unable to serve requests. That is exactly the condition you want to trip the circuit and fail over to PAYG, rather than returning errors to the caller.

Sizing Circuit Breaker Parameters for AI Workloads

The right circuit breaker parameters depend on your traffic pattern and how quickly you need failover to activate. A few practical guidelines:

threshold: For AI workloads, 3 to 5 failures is a reasonable starting point. PTU endpoints return 429 consistently when saturated, so you don’t need a high threshold to detect the condition. Setting it too high means you absorb too many failed requests before failing over.

interval: 60 seconds works well for most workloads. This is the window over which failures are counted. Shorter intervals are more sensitive to transient errors, while longer ones suit bursty traffic patterns where a few failures in a short window are expected.

tripDuration: 30 seconds is a sensible default. PTU capacity refreshes on a per-minute basis, so a 30-second trip duration gives the backend time to recover before the probe fires. For deployments where PTU saturation is a known recurring pattern, a longer trip duration (60 to 120 seconds) reduces the frequency of failed probes.

Retry Policy and Agentic Workload Considerations

Backend pool failover and circuit breaking handle backend-level failures, but you may also want a retry policy in your APIM inbound pipeline for transient errors that do not warrant a full circuit trip. The retry policy can be scoped to specific status codes and configured with a backoff interval, giving you a two-level resilience model: retry for transient errors, circuit break for sustained failures.

For agentic workloads specifically, failover behavior needs careful thought. A conversational agent mid-session that silently switches from a PTU to a PAYG backend will not notice the change at the model API level. But agentic pipelines with multiple sequential tool calls are more sensitive: a mid-pipeline failover can introduce latency spikes that cause timeouts in orchestration layers such as Azure Logic Apps or Semantic Kernel.

The practical mitigation is to expose the remaining token budget via the token limit policy variable from Part 3 and have the orchestration layer monitor it to proactively slow down before circuit breaking kicks in. Prevention is cheaper than recovery when the workload is stateful.

What’s Next in This Azure API Management for AI Series

Part 6 covers semantic caching: how APIM uses an embeddings model and Azure Managed Redis to serve cached responses for semantically similar prompts, reducing token consumption and latency without any changes to calling applications.