Microsoft Foundry Citadel Platform Azure: A Practitioner’s Deployment Guide

Posted on June 24, 2026 by steefjan1970

Microsoft Foundry Citadel Platform on Azure is a layered AI governance architecture that delivers production-ready agent deployments with unified governance, end-to-end observability, and centralized policy enforcement via Azure API Management. It is still in preview, and the documentation assumes a degree of familiarity with Azure infrastructure that not everyone has on day one. This post walks through what it actually takes to get a working hub-and-spoke running in Sweden Central, including the pitfalls, so you can decide whether it is a viable starting point for your own AI platform journey.

What Citadel Is (and Is Not)

Before touching the tooling, it helps to understand what Citadel actually deploys. The architecture has four layers:

The first layer — Governance Hub is the runtime enforcement plane: Azure API Management as a centralized AI gateway, Azure API Center as a model registry, and supporting services for content safety, PII detection, cost attribution, and usage telemetry.

Subsequent second layer 2 — AI Control Plane provides observability via the Foundry Control Plane: agent-level execution traces, AI evaluations in development and production, red-teaming, drift monitoring, and fleet dashboards.

The next third layer — Agent Identity transforms agents into managed enterprise assets via Microsoft Entra ID, with lifecycle management, sponsorship models for human accountability, and shadow AI discovery.

Finally, the last fourth layer, 4 — Security Fabric, weaves Defender, Purview, and Entra across the other three layers for real-time threat intelligence, data governance, and compliance automation.

For this guide, we deploy Layer 1 (the Governance Hub via the AI Hub Gateway Solution Accelerator) and a Layer 1/2 spoke (via the AI Landing Zone Bicep). Layers 3 and 4 reference existing Azure services (Entra ID, Defender, Purview) that you integrate separately.

Important: Citadel is currently in preview. The repos, parameter schemas, and CLI commands will change. Treat everything in this post as a starting point, not a stable reference.

Prerequisites

Before you start, make sure you have:

An Azure subscription with Azure OpenAI access approved (aka.ms/oaiapply)
Microsoft.Authorization/roleAssignments/write on the subscription (Owner or User Access Administrator role)
Azure CLI installed and authenticated (az login)
Azure Developer CLI (azd) installed
Node.js — use v20 LTS, not v24. Node 24 on Windows has a known issue where npm bundles are incomplete, causing MODULE_NOT_FOUND errors on npm-cli.js and npm-prefix.js when azd tries to package Logic App components

If you run into npm issues on Windows, the cleanest workaround is Azure Cloud Shell, where Node, npm, az, and azd are all pre-installed and healthy.

Part 1: Deploying the Microsoft Foundry Citadel Governance Hub

Clone the AI Hub Gateway Solution Accelerator:

			
git clone https://github.com/Azure-Samples/ai-hub-gateway-solution-accelerator.git
cd ai-hub-gateway-solution-accelerator

Create your azd environment:

			
azd auth login
azd env new ai-hub-gateway-dev
azd env set AZURE_LOCATION swedencentral

Create a parameters file at infra/main.parameters.json. The key decisions:

Model versions matter. At the time of writing, gpt-4o-mini versions 2024-07-18 and 2024-10-18 are retired. Use gpt-4o version 2024-11-20 with GlobalStandard SKU. Always verify current model availability at aka.ms/aoai-regions before deploying these changes frequently.

			
{
  "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentParameters.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "environmentName": { "value": "ai-hub-gateway-dev" },
    "location": { "value": "swedencentral" },
    "apimSku": { "value": "Developer" },
    "openAiInstances": {
      "value": {
        "openAi1": {
          "name": "openai1",
          "location": "swedencentral",
          "deployments": [
            {
              "name": "chat",
              "model": { "format": "OpenAI", "name": "gpt-4o", "version": "2024-11-20" },
              "sku": { "name": "GlobalStandard", "capacity": 20 }
            },
            {
              "name": "embedding",
              "model": { "format": "OpenAI", "name": "text-embedding-3-large", "version": "1" },
              "sku": { "name": "Standard", "capacity": 20 }
            }
          ]
        }
      }
    },
    "provisionFunctionApp": { "value": false },
    "createAppInsightsDashboard": { "value": false },
    "enableAIGatewayPiiRedaction": { "value": true },
    "enableAIModelInference": { "value": true }
  }
}

		

Deploy:

azd up

Expect 45–90 minutes. APIM Developer SKU is the slow component. If the deployment fails partway through, re-run azd up it is idempotent and will pick up where it left off.

Azure CLI output showing successful deployment of the Microsoft Foundry Citadel Governance Hub including APIM, Azure OpenAI chat and embedding model deployments, private endpoints, and Logic App in Sweden Central. — The AI Hub Gateway Solution Accelerator was deployed successfully in Azure Sweden Central after 21 hours and31 minutes, provisioning APIM, Azure OpenAI, Content Safety, Application Insights, private endpoints, and the usage processing Logic App.

Pitfall: Managed Identity Race Condition

You will likely see this error on first attempt:

BadRequest: The provided principal ID was not found in the AAD tenant(s)

This is a known race condition — the Managed Identity is created but has not yet propagated in Entra ID before the role assignment fires. Re-run azd up without any changes and it will succeed.

Validate the Hub

Once deployed, run:

azd env get-values | grep APIM

You will get your APIM gateway URL. Test it with a chat completion:

			
$headers = @{
  "Content-Type" = "application/json"
  "api-key" = "<YOUR_APIM_SUBSCRIPTION_KEY>"
}
$body = '{"messages":[{"role":"user","content":"Hello from the AI Hub Gateway!"}],"max_tokens":100}'
Invoke-RestMethod `
  -Uri "https://<your-apim>.azure-api.net/openai/deployments/chat/chat/completions?api-version=2024-02-01" `
  -Method POST -Headers $headers -Body $body

		

PowerShell output showing a successful chat completion response from the Microsoft Foundry Citadel APIM gateway in Azure Sweden Central, with content filter results, prompt filter results, and token usage confirmed. — Validating the Citadel Governance Hub by calling the APIM gateway endpoint via PowerShell, the response confirms gpt-4o-2024-11-20 routing, Content Safety filtering, PII redaction, and token usage tracking are all active.

A successful response with content_filter_results and prompt_filter_results confirms Content Safety and PII redaction are active. Token usage in the response confirms Cosmos DB is logging for cost attribution.

Part 2: Deploying a Citadel Platform Agent Spoke on Azure

The spoke is deployed from the AI Landing Zone Bicep repo. Download it as a ZIP (no GitHub account required):

			
https://github.com/Azure/bicep-ptn-aiml-landing-zone/archive/refs/heads/main.zip

Extract and navigate to the folder. Create a resource group for the spoke:

az group create --name rg-ai-spoke-dev --location swedencentral

Create a spoke.parameters.json file. Several things to know upfront:

The parameter schema is not the same as the Citadel README suggests. The actual template parameters differ from the example file. Key differences discovered in practice: aiFoundryLocation does not exist as a separate parameter; deployMcp, greenFieldDeployment, deployPostgres, and useCMK are not in this version of the template; and solutionStorageAccountName is simply storageAccountName.

The modelDeploymentList uses nested objects, not flat properties:

			
"modelDeploymentList": {
  "value": [
    {
      "name": "chat",
      "model": { "format": "OpenAI", "name": "gpt-4o", "version": "2024-11-20" },
      "sku": { "name": "GlobalStandard", "capacity": 20 },
      "canonical_name": "CHAT_DEPLOYMENT_NAME",
      "apiVersion": "2025-04-01-preview"
    },
    {
      "name": "text-embedding",
      "model": { "format": "OpenAI", "name": "text-embedding-3-large", "version": "1" },
      "sku": { "name": "Standard", "capacity": 10 },
      "canonical_name": "EMBEDDING_DEPLOYMENT_NAME",
      "apiVersion": "2025-04-01-preview"
    }
  ]
}

		

containerAppsList cannot be an empty array. The template references containerApps[0] internally and will fail validation if the array is empty. Pass at least one placeholder entry.

Deploy:

			
az deployment group create `
  --resource-group rg-ai-spoke-dev `
  --template-file main.bicep `
  --parameters @spoke.parameters.json

Pitfalls in the Spoke Deployment

AI Search Standard SKU capacity exhaustion. Sweden Central frequently runs out of AI Search Standard SKU capacity. You will see ResourcesForSkuUnavailable. This affects both the standalone Search Service and the AI Foundry Agent Service’s internal Search instance. Disable both:

			
"deploySearchService": { "value": false },
"deployAAfAgentSvc": { "value": false }

You can re-enable them later once capacity is available, or deploy Search in a different region.

Soft-deleted resources block redeployment. Azure retains soft-deleted Cognitive Services accounts, Key Vaults, and App Configuration stores for up to 90 days. If you delete a resource group and redeploy, the deployment will fail with FlagMustBeSetForRestore or NameUnavailable. Purge them explicitly before redeploying:

			
# List and purge soft-deleted resources
az keyvault list-deleted --subscription <sub-id> -o table
az keyvault purge --name <name> --location swedencentral
az appconfig list-deleted --subscription <sub-id> -o table
az appconfig purge --name <name> --location swedencentral --yes
az cognitiveservices account list-deleted --subscription <sub-id> -o table
az cognitiveservices account purge --name <name> --location swedencentral

		

Key Vault purges are slow — allow 2–5 minutes per vault.

Bastion subnet ID resolution fails with networkIsolation=false. When you disable network isolation, the template passes a relative subnet ID to Bastion instead of a fully qualified resource ID. Disable Bastion, Jump VM, and NAT Gateway for the dev spoke:

			
"deployBastion": { "value": false },
"deployJumpbox": { "value": false },
"deployVM": { "value": false },
"deployNatGateway": { "value": false }

Write parameters files without BOM. On Windows, Out-File -Encoding utf8 adds a Byte Order Mark that causes az deployment to fail with Unable to parse parameter. Use either:

			
$content | Out-File -FilePath "spoke.parameters.json" -Encoding utf8NoBOM
# or
[System.IO.File]::WriteAllText("spoke.parameters.json", $content, [System.Text.UTF8Encoding]::new($false))

Part 3: Wiring the Citadel Spoke to the Azure APIM Hub

Add the hub’s APIM gateway URL and subscription key to the spoke’s App Configuration:

			
az appconfig kv set `
  --name <spoke-appconfig-name> `
  --key "APIM_GATEWAY_URL" `
  --label "ai-lz" `
  --value "https://<your-apim>.azure-api.net/openai" `
  --yes
az appconfig kv set `
  --name <spoke-appconfig-name> `
  --key "APIM_SUBSCRIPTION_KEY" `
  --label "ai-lz" `
  --value "<YOUR_APIM_KEY>" `
  --yes

		

Note: az cognitiveservices account connection create with a YAML file for creating an APIM connection in AI Foundry has known bugs in the current CLI version and will throw NoneType or codec errors. Create this connection via the Azure AI Foundry portal UI instead.

Validate End-to-End

			
$headers = @{
  "Content-Type" = "application/json"
  "api-key" = "<YOUR_APIM_KEY>"
}
$body = '{"messages":[{"role":"user","content":"Hello from the Citadel spoke!"}],"max_tokens":50}'
Invoke-RestMethod `
  -Uri "https://<your-apim>.azure-api.net/openai/deployments/chat/chat/completions?api-version=2024-02-01" `
  -Method POST -Headers $headers -Body $body

		

A successful response with content_filter_results, prompt_filter_results, and usage confirms the full Citadel loop: spoke → APIM gateway → Azure OpenAI → governance telemetry.

PowerShell output showing a successful end-to-end chat completion from the Citadel agent spoke through the Azure APIM Governance Hub, confirming spoke to hub routing, content filter results, and token usage tracking in Sweden Central. — End-to-end validation of the Citadel hub-and-spoke setup: a request from the agent spoke routes through the APIM Governance Hub in Sweden Central, returning a successful gpt-4o response, with Content Safety filtering and token usage tracking confirmed.

What the Microsoft Foundry Citadel Platform Deploys

After following this guide, your rg-ai-hub-gateway-dev resource group contains:

APIM gateway with content safety, PII redaction, token rate limiting, and cost attribution policies
Azure OpenAI with gpt-4o and text-embedding-3-large
Cosmos DB for usage event logging
Logic App for usage processing
Application Insights for gateway telemetry

Your rg-ai-spoke-dev resource group contains:

AI Foundry account and project
gpt-4o and text-embedding-3-large deployments
Cosmos DB with a conversations container
Key Vault, App Configuration, Storage Account, Application Insights, Log Analytics

App Configuration is fully populated with canonical keys (CHAT_DEPLOYMENT_NAME, AI_FOUNDRY_PROJECT_ENDPOINT, COSMOS_DB_ENDPOINT, and more) ready for agent applications to consume.

This Is a Dev Setup — Here Is What Changes for Non-Prod and Production

The configuration above is a starting point, not a production blueprint. Key differences when moving up the environment stack:

APIM SKU. Developer SKU has no SLA and no VNet support. Switch to Premium SKU for non-prod and production. This significantly increases cost and deployment time but enables private networking, multi-region, and availability zones.

Network isolation. For production, set networkIsolation=true and wire the spoke VNet to your hub VNet via peering (hubIntegrationHubVnetResourceId). This requires coordinating private DNS zones across the hub and spoke. The template supports bringing existing DNS zones via the existingPrivateDnsZone* parameters.

AI Search. Re-enable deploySearchService and deployAAfAgentSvc for non-prod and production. If Sweden Central remains capacity-constrained on Standard SKU, deploy Search to a paired region (East US 2 works well) using the searchServiceLocation parameter.

Bastion and Jump VM. For production with networkIsolation=true, re-enable deployBastion and deployJumpbox so operators can access resources inside the private VNet without public endpoints.

Separate parameter files per environment. Maintain spoke.parameters.dev.json, spoke.parameters.nonprod.json, and spoke.parameters.prod.json with environment-specific values. Use a deployment pipeline (GitHub Actions or Azure DevOps) to apply them consistently.

Model versions. Pin specific model versions in parameters files and validate availability in your target region before each deployment. Azure OpenAI model lifecycle moves fast; versions retire on 18-month cycles, and regional availability varies.

Preview Caveats

Citadel is in active development. Several things you should expect to change:

The parameter schemas for both the hub and spoke accelerators will evolve. Parameters discovered missing or renamed in this guide will likely be reorganized again as the repos mature. Always check the actual main.bicep parameter definitions rather than relying on example files.

The az cognitiveservices account connection create CLI command for AI Foundry connections is incomplete at the time of writing. This will improve as the Foundry CLI surface area matures.

The citadel-v1 branch in the AI Hub Gateway repo is flagged as the recommended path for new deployments. By the time you read this, it may have become the default branch with a cleaner deployment experience.

Regional capacity for AI Search Standard SKU fluctuates. Sweden Central is a high-demand region for AI workloads plan for capacity constraints in any SKU beyond Basic for dev scenarios.

Conclusion

Citadel gives you a credible, opinionated starting point for enterprise AI governance on Azure APIM as the AI gateway, AI Foundry as the agent runtime, Cosmos DB for conversation state, and App Configuration as the configuration backbone. Getting it running today requires navigating several rough edges: parameter schema inconsistencies, soft-delete cascades, model version deprecations, regional capacity constraints, and Windows-specific tooling issues.

None of these are blockers. They are the expected friction of working with a platform in active preview. The underlying architecture is sound, and the pieces that do work, APIM governance policies, Content Safety integration, App Config population, and AI Foundry project wiring deliver real value immediately.

If you are building an AI platform for your organization, a Citadel dev setup is a reasonable first step. Treat it as a learning environment to understand the architecture, validate the tooling, and build the parameter files you will need for non-prod and production. Then evolve it deliberately: add network isolation, re-enable Search and Agent Services as capacity allows, and adopt the Citadel contracts (AI Access Contract, AI Publish Contract) to formalize the hub-spoke integration as your agent portfolio grows.

The governance-velocity paradox Citadel sets out to solve is real. Getting the foundation right now, while it is still in preview and the patterns are malleable, is the right time to start.

Final note: This post reflects a hands-on deployment performed in June 2026. Given the pace of change in this space, verify all CLI commands, parameter schemas, and model versions against current documentation before applying them in your own environment.

Azure API Management Build 2026 AI Gateway: What’s New

Posted on June 18, 2026 by steefjan1970

The Azure API Management Build 2026 AI gateway announcements mark a significant expansion of APIM’s control plane capabilities. Microsoft shipped three headline additions: a Unified Model API that lets clients standardize on one format while APIM transforms requests to Anthropic, Google Vertex AI, and other backends; content safety policies extended to cover MCP tool calls and agent-to-agent traffic; and expanded token metrics that now track reasoning, cached, and audio tokens across providers. This post explains what each change means in practice for teams building enterprise AI workloads on Azure.

Azure API Management Build 2026 AI Gateway: Three Headline Changes

The biggest announcement is the Unified Model API, now in public preview. It lets clients standardize on a single API format, currently OpenAI Chat Completions. At the same time, APIM transparently converts requests to the backend provider’s native format, whether that is Anthropic’s Messages API, Google Vertex AI, or another provider.

For teams running multi-model architectures, this is significant. Until now, switching providers or adding a new model required client-side changes. With the Unified Model API, the routing decision moves entirely to APIM. Teams can swap backends, add providers, or route traffic based on cost or latency without touching client code.

Diagram showing a client sending requests in OpenAI Chat Completions format to Azure API Management. APIM's Unified Model API layer transforms the request to each provider's native format — Azure OpenAI natively, Anthropic Messages API, and Google Vertex AI format — while applying governance policies and unified token metrics uniformly across all backends. A caption notes that client code is unchanged when swapping providers. — The APIM Unified Model API transformation layer. Clients standardize on a single API format, while APIM handles per-provider translation transparently. All governance policies, rate limits, content safety, and token metrics apply uniformly regardless of which provider handles inference. Teams can swap backends or add providers without touching client code.

From an architecture perspective, this strengthens the case for APIM as the single AI control plane. Every governance policy, rate limit, content safety, and token metric applies consistently regardless of which provider handles inference. There is no need for a parallel governance stack per provider.

One practical implication: the three-layer auth model from Part 2 of this series applies uniformly across all providers. Managed Identity to backend is the cleanest approach, but the provider must support it. For Anthropic and Vertex AI, check the current authentication requirements before assuming token-based auth transfers directly.

Content Safety for MCP and A2A: The Gap That Needed Closing

Extending the llm-content-safety policy to MCP tool calls and agent-to-agent payloads is the most architecturally significant change. Until now, content safety only covered LLM completions traffic. MCP tool-call arguments and A2A messages were ungoverned at the gateway layer.

This matters because prompt injection attacks do not only arrive via the user-facing chat interface. A malicious payload embedded in a tool response from an external MCP server, for example, can propagate through an agentic pipeline if there is no inspection at the gateway layer. The shield-prompt attribute specifically addresses this by checking for adversarial prompt-injection patterns in MCP and A2A traffic, not just in LLM input.

Side-by-side comparison diagram. On the left, before Build 2026, Azure API Management content safety covers only LLM completions traffic. MCP tool calls, agent-to-agent traffic, and prompt injection via tool responses are shown in red as ungoverned. On the right, after Build 2026, all four traffic types are shown in teal as covered — MCP tool call arguments, A2A agent payloads, and prompt injection attacks are now scanned by the llm-content-safety policy with the shield-prompt attribute enforced. — Content safety coverage before and after Build 2026. Prior to the announcement, the llm-content-safety policy only applied to LLM completions traffic. MCP tool-call arguments, agent-to-agent payloads, and prompt injection attacks arriving via tool responses were ungoverned at the gateway layer. The Build 2026 update closes all three gaps with the same policy, extended to cover MCP and A2A traffic.

One implementation detail worth calling out: the policy behaves differently for streaming responses. In non-streaming mode, a violation returns a clean 403. In streaming mode, the policy buffers events in a sliding window and stops forwarding without returning an explicit error code. Agents consuming streaming completions need to handle an abrupt stop gracefully. If you are designing agentic pipelines that use streaming, build in a timeout and an explicit error handling path for this case.

The two new attributes — window-size and window-overlap-size — let you tune how content exceeding Azure Content Safety’s 10,000 character limit is split for evaluation. For agentic pipelines with large tool responses, these will need tuning based on your typical payload sizes.

Expanded Token Metrics: Catching What Was Missing

The token metric policy from Part 4 of this series now logs reasoning tokens, cached tokens, and audio tokens to Application Insights. This is a meaningful improvement for FinOps visibility.

Reasoning models like o1 and o3 consume significant token budgets in their internal reasoning chain before producing output. Without reasoning token tracking, cross-charging dashboards systematically undercount consumption from teams using these models. The expanded metrics fix this.

Matrix diagram with token types as rows and AI providers as columns. Prompt tokens and completion tokens are tracked across all five providers: Azure OpenAI, Anthropic, Google Vertex AI, Amazon Bedrock, and Microsoft Foundry. Three new token types added at Build 2026 are highlighted in amber: reasoning tokens, tracked for Azure OpenAI, Anthropic, and Microsoft Foundry; cached tokens, tracked for Azure OpenAI, Anthropic, Google Vertex AI, and Microsoft Foundry; and audio tokens, tracked for Azure OpenAI only. Grey cells indicate token types not reported by a given provider. All data flows to Application Insights for FinOps dashboards and budget alerts. — Token metric coverage in Application Insights after Build 2026. The three amber rows — reasoning, cached, and audio tokens — are new additions. Reasoning token tracking is particularly significant for FinOps teams using o1 or o3 models, where the internal reasoning chain can consume a substantial portion of the total token budget that earlier metrics did not capture. Grey cells indicate that a provider does not expose that token type in its API response.

Cached token tracking is equally important for cost optimization. Azure OpenAI’s prompt caching reduces the cost of repeated prompt prefixes. Tracking cached vs. uncached tokens separately lets you measure the actual cache hit rate and tune your prompt structure accordingly.

The multi-provider coverage of Microsoft Foundry, OpenAI, Amazon Bedrock, and Google Vertex AI means the FinOps dashboard built in Part 4 now works across your entire model estate, not just Azure OpenAI.

API Center MCP Server: Enterprise Discovery at GA

The Azure API Center data plane MCP server reached general availability. It acts as a unified discovery endpoint: agents and developer tools can find registered MCP servers, tools, APIs, and AI assets through a single MCP connection. When a team registers a new MCP server in API Center, it becomes automatically discoverable without requiring individual client reconfigurations.

This is the enterprise catalogue layer that makes the MCP gateway story from Part 7 operationally sustainable at scale. Without it, discovery is a manual configuration problem. With it, the control plane extends automatically as new capabilities are registered.

Where This Leaves the Control Plane

Looking at the Build announcements together, the pattern is consistent with what the series argued: APIM is becoming the governance layer for all AI traffic, not just LLM completions. The Unified Model API extends it across providers. Content safety for MCP and A2A extends it across protocols. The API Center MCP server extends discovery to the enterprise catalogue layer.

The competitive context is worth noting. AWS Bedrock Guardrails handles content filtering but has no equivalent to the Unified Model API or MCP/A2A coverage. Google Apigee has added AI gateway features, but not at this protocol breadth. Cloudflare’s AI Gateway focuses on spend limits and caching. APIM’s position that the API gateway is the natural control plane for AI workloadsis increasingly defensible.

For teams that have followed the series and implemented the seven patterns, the Build announcements are additive rather than disruptive. The policy pipeline you built still works. The new capabilities slot in: swap your backend URL configuration to use the Unified Model API, add the llm-content-safety policy to your MCP server inbound pipeline, and update your Application Insights queries to include reasoning and cached token dimensions.

Lastly, the Microsoft AI Gateway labs‘ 30+ Jupyter notebooks with deployable Bicep templates are worth bookmarking if you are implementing any of these patterns.

Azure API Management Semantic Caching: Cut AI Token Costs with Similarity-Based Responses

Posted on June 3, 2026 by steefjan1970

Part 6 of 7 in the “APIM for AI Workloads” series

Azure API Management semantic caching is the most operationally transparent cost optimization in this series. Every technique covered so far, auth, token limits, token metrics, and load balancing, requires deliberate design decisions in how you configure APIM. Semantic caching, by contrast, works silently. Calling applications sends prompts as normal. APIM checks whether a semantically similar prompt has already been answered. If a match exists above a configurable similarity threshold, APIM returns the cached response without touching the AI backend. Zero tokens consumed. Zero latency is added by the model.

For workloads with repetitive prompt patterns, internal FAQ bots, document classifiers, and support agents that see the same questions repeatedly, the cache hit rate can be surprisingly high. Even a 20% hit rate on a high-volume workload translates directly into cost reduction and lower average latency.

How Azure API Management Semantic Caching Works

The azure-openai-semantic-cache-lookup policy sits in the inbound section of your APIM pipeline, before the request reaches the AI backend. When a prompt arrives, APIM sends it to a configured embedding model, typically Azure OpenAI text-embedding-ada-002 or equivalent, to generate a vector representation of the prompt. APIM then compares that vector against cached embeddings stored in Azure Managed Redis using cosine similarity.

If the similarity score between the incoming prompt and a cached prompt falls below the configured score-threshold, APIM treats it as a cache hit and returns the stored response. If no match meets the threshold, APIM forwards the request to the AI backend as normal and stores the response in Redis for future lookups.

Azure API Management semantic caching policy flow showing cache hit returning stored response and cache miss forwarding to Azure OpenAI — Diagram 1: Semantic cache request flow. On a cache hit, APIM returns a stored response directly — consuming zero tokens. On a miss, APIM forwards to the AI backend and stores the response in Azure Managed Redis for future hits.

The generic variant, llm-semantic-cache-lookup, works identically for non-Azure backends. Both require the same supporting infrastructure: an embedding model backend and an Azure Managed Redis instance configured in APIM. The semantic cache store policy handles writing responses back to the cache in the outbound section.

Tuning the Score Threshold for Azure API Management Semantic Caching

The score-threshold attribute is the most consequential configuration decision in the semantic caching policy. It controls how similar an incoming prompt must be to a cached prompt for APIM to treat it as a hit. The value runs from 0.0 to 1.0, but the practical range is much narrower.

Azure API Management semantic caching score threshold tuning guide from aggressive to conservative with vary-by subscription user and global scope strategies — Diagram 2: Score threshold tuning guide and vary-by scope strategies. Lower thresholds cache more aggressively. The default of 0.05 suits most production workloads. A global cache (no vary-by) maximizes hit rate but risks serving the wrong user’s response.

In practice, three zones matter:

0.01 to 0.05 (aggressive). At this range, prompts that are paraphrases of each other — “What is my account balance?” and “Can you show me my current balance?” — reliably produce cache hits. This is the right range for FAQ bots, support agents, and any workload where users ask the same questions in slightly different words. The default of 0.05 sits here and suits most production deployments.

0.05 to 0.20 (conservative). At this range, only prompts that are very close in wording produce hits. Creative workloads, code generation, and document drafting tend to have high prompt variance, so a more conservative threshold avoids serving stale cached responses to genuinely different requests.

Above 0.30 (too strict). At this threshold, almost no prompts match. The cache effectively stops functioning. Avoid this range unless you are deliberately disabling caching for a specific API product while keeping the policy in the pipeline for future use.

Start at 0.05 and monitor cache hit rates in Application Insights. If the hit rate is low for a workload you expect to be repetitive, lower the threshold incrementally. If you start seeing complaints about incorrect or stale responses, raise it.

vary-by Scope: Preventing Cache Pollution

The vary-by element scopes the cache namespace. Without it, all consumers share a single global cache. That maximizes the hit rate but introduces a significant risk: APIM could serve one user’s cached response to a different user. For most enterprise AI workloads, that is unacceptable.

The safest default is to vary by Subscription ID, which gives each API subscriber their own cache namespace. This prevents cross-team cache pollution while still achieving high hit rates within each subscriber’s own prompt patterns. For multi-tenant applications where individual users have distinct contexts, vary by a user identifier extracted from the JWT or a custom header instead.

A global cache with no vary-by is appropriate only for fully public, stateless APIs where responses are identical regardless of who requests them. Internal enterprise AI workloads rarely meet that bar.

Infrastructure Requirements for Azure API Management Semantic Caching

Semantic caching requires two supporting Azure resources beyond APIM itself. First, an Azure Managed Redis instance configured as an external cache in APIM. Redis stores the prompt embeddings and cached responses. The cache TTL is configurable in the store policy, so you control how long responses remain valid before APIM re-queries the backend.

Second, an embeddings model backend registered in APIM. For Azure OpenAI, this is typically a separate deployment of text-embedding-ada-002 or text-embedding-3-small. The embeddings backend is referenced by the embeddings-backend-id attribute. It is separate from your completions backend, so you can apply independent token limits and load balancing to the embeddings traffic.

One practical consideration: the embeddings call itself consumes tokens and adds a small amount of latency on every request, whether or not the cache hits. For workloads with very low prompt repetition, the overhead of generating embeddings for every request may outweigh the savings from occasional cache hits. Measure the hit rate before committing the infrastructure cost.

What’s Next in This Azure API Management for AI Series

Part 7 closes the series by covering APIM’s emerging role as an MCP gateway for agentic AI workloads: how to expose REST APIs as MCP servers, pass through existing MCP servers, and manage agent-to-agent traffic through the same control plane we’ve built across this series.

Part 7: APIM as an MCP gateway for agentic AI workloads.

Azure API Management Token Metric Policy: AI Cost Observability and Cross-Charging

Posted on May 20, 2026 by steefjan1970

Part 4 of 7 in the “APIM for AI Workloads” series

The Azure API Management token metric policy turns AI cost data from a finance problem into an engineering one. In Part 3, we covered enforcement: how to set consumption boundaries per consumer. This post covers the complementary piece: how to measure that consumption. More importantly, it shows how to make it visible to the right people and use it to drive internal cross-charging and FinOps dashboards.

At my current company, one of the first questions the architecture board asked was straightforward: which teams are consuming what, and what does it cost? Without instrumentation at the gateway layer, that question is genuinely unanswerable. The token metric policy is how you answer it.

Azure API Management Token Metric Policy: How It Works

The policy sits in the outbound section of your APIM pipeline. After the AI backend returns a response, APIM reads the token usage fields from the response body. These include prompt tokens, completion tokens, and total tokens. APIM then emits them as custom metrics to Application Insights under a namespace you define.

Crucially, the policy emits metrics after the response arrives. It uses actual token counts from the API response rather than estimates. As a result, the data is accurate rather than approximated. It also means the metric emission adds no latency to the request path: the response is returned to the caller immediately, and the metric is emitted asynchronously.

Azure API Management token metric policy observability pipeline emitting token counts to Application Insights for cross-charging — Diagram 1: Token metric policy observability pipeline. Token counts from the AI backend response flow through the APIM metrics layer to Application Insights, broken down by dimensions for cross-charging and cost allocation.

The generic variant, llm-emit-token-metric, works identically for non-Azure backends. Both policies share the same dimension model, so the configuration patterns below apply regardless of which AI provider sits behind APIM.

Choosing Dimensions for Azure API Management Token Metric Policy

Dimensions are the labels attached to each metric event. They explain how to slice and aggregate token consumption data in Application Insights. Choosing the right dimensions is the most important configuration decision for making the data useful for cross-charging.

Azure API Management token metric policy dimension strategies for cross-charging using Subscription ID User ID and API ID — Diagram 2: Three-dimensional strategies for cross-charging and showback. Subscription ID maps to teams and cost centers, User ID enables per-user billing in multi-tenant apps, and API ID breaks down cost by AI workload or feature.

The three primary dimension options are:

Subscription ID. The most common choice for internal enterprise deployments. Each APIM subscription maps to a team, product, or cost center, so filtering Application Insights metrics by Subscription ID gives you direct per-team token consumption. This pairs naturally with the subscription key authentication pattern from Part 2 and the per-subscription counter-key from Part 3.

User ID. Sourced from the JWT subject claim or a custom header, User ID enables per-user consumption reporting. This is the right dimension for multi-tenant SaaS applications where individual end users have their own token budgets, or where you need to identify heavy consumers within a shared subscription.

API ID. Identifies which APIM API product generated the consumption. Useful when a single subscription uses multiple AI-backed APIs: one for a conversational agent, one for content generation, and one for document summarization. API ID lets you break down cost by use case rather than just by subscriber.

In practice, combining all three dimensions gives you the most flexibility. A single metric event tagged with Subscription ID, User ID, and API ID can answer questions at every level: how much did the platform spend in total, how much did Team A spend, how much did User X consume, and which AI feature is the most expensive to run.

Querying Token Metrics in Application Insights

Once the policy is emitting metrics, you query them in Application Insights using the custom metrics namespace you configured. The metrics appear under the namespace name you set in the policy (for example, “AzureOpenAI” or “MyLLM”), with separate metric events for prompt tokens and completion tokens.

A practical starting point is a KQL query that aggregates the total number of tokens by Subscription ID over the past 30 days. From there, you can add filters by API ID to isolate specific workloads, or pivot by User ID to identify the highest consumers within a team.

For FinOps dashboards, the most useful view is a stacked time-series chart of total token consumption broken down by subscription, updated daily. This gives finance and engineering a shared view of AI spend trends without exporting data from Azure Monitor to a separate BI tool. Azure Workbooks can host this directly in the Azure portal, making it accessible to non-technical stakeholders.

From Observability to Cross-Charging

Observability is the prerequisite for cross-charging. However, they are not the same thing. Observability tells you what happened. Cross-charging, by contrast, is the organizational process of allocating those costs to the right budget owners.

The token metric policy gives you the raw data. To turn that into a cross-charge, you need two additional steps. First, agree on a price per token with your finance team — usually derived from the Azure cost per 1,000 tokens for your model and region. Second, automate a monthly report that multiplies token consumption by the subscription price.

This does not need to be complex. For example, a Logic App or Azure Function that queries Application Insights on the first of each month works well for most organizations starting out. It aggregates tokens by subscription, multiplies by the agreed rate, and emails a cost summary to each team lead. The Application Insights REST API makes this straightforward to automate.

Finally, the most important advice: have this conversation with finance and product teams before AI consumption scales. Retroactive cross-charging is significantly harder to establish than an upfront model with clear methodology and tooling.

What’s Next in This Azure API Management for AI Series

Part 5 covers load balancing and circuit breaking: how to distribute traffic across PTU and PAYG backends, configure backend pools, and set up circuit breaker rules for automatic failover when a primary endpoint becomes unavailable.

Part 5: Load balancing and circuit breaking across PTU and PAYG backends.
Part 6: Semantic caching — reducing token consumption with similarity-based response reuse.
Part 7: APIM as an MCP gateway for agentic AI workloads.

Azure API Management for AI: Securing Your AI APIs with Authentication and Authorization

Posted on May 5, 2026 by steefjan1970

Part 2 of 7 in the “APIM for AI Workloads” series

In Part 1 of this series, I made the case for why Azure API Management for AI workloads is the right control plane for governing AI traffic across an organization. This post gets practical: how do you actually secure access to your AI backends with APIM without creating a credential-management nightmare?

Security is where many AI projects cut corners, and understandably so. When you’re moving fast to prove value with a new model, authentication feels like overhead. But AI endpoints are expensive, and an unsecured Azure OpenAI endpoint is a real risk: anyone with the URL and key can start consuming tokens at your cost. At scale, that’s a significant financial and compliance exposure.

APIM addresses this with a three-layer security model. Let’s walk through each layer.

Azure API Management for AI Security: A Three-Layer Model

The authentication and authorization pattern in APIM is deliberately layered. Each layer answers a different question and operates independently, so a failure at any layer stops the request before it reaches the AI backend.

Azure API Management for AI three-layer authentication flow showing subscription key, JWT validation and Managed Identity policy pipeline — *Diagram 1: Three-layer auth in APIM for AI workloads.* Layer 1 identifies the caller via subscription key. JWT validation in Layer 2 then determines what they’re permitted to do. Finally, Layer 3 authenticates APIM itself to the AI backend via Managed Identity.

The three layers are:

Subscription keys to identify and track API consumers.
JWT validation to enforce fine-grained access control based on claims.
Managed Identity to authenticate APIM to Azure OpenAI without storing credentials.

Each layer has a distinct role. Confusing them is a common mistake, so it’s worth being explicit about what each one does and does not do.

Layer 1: Subscription Keys

Subscription keys are APIM’s mechanism for identifying API consumers. When you create an API product in APIM and require a subscription, callers must include their key in the Ocp-Apim-Subscription-Key header. APIM validates the key, maps it to a subscriber, and lets the request proceed.

This is important for AI workloads specifically because subscription keys enable per-consumer token tracking. When you combine subscription key validation with the Token Metric policy we’ll cover in Part 4, you get usage data broken down by subscriber, which is the foundation of any internal cross-charging model.

Subscription keys answer the question: Who is calling? They don’t answer what the caller is allowed to do. For that, you need JWT validation.

Layer 2: JWT Validation and Claims-Based Authorization

The validate-jwt policy is where you enforce what a caller is permitted to do. It validates the JWT token in the Authorization header against your identity provider, and can inspect any claim in the token to make authorization decisions.

For Azure OpenAI specifically, this is where you control which teams or applications can access which model deployments. A team working on an internal chatbot should not be able to call a GPT-4o deployment reserved for a production workload. JWT claims let you enforce that boundary at the gateway layer, with no changes required in the calling application.

A typical policy checks the token signature against your Azure AD tenant’s OpenID Connect configuration, then validates that a required scope or role claim is present:

The failed-validation-httpcode=”401″ attribute ensures unauthenticated callers get a clean rejection before they ever reach the backend. You can also use failed-validation-error-message to return a specific error message, which helps consumers debug auth failures without exposing internal details.

For multi-provider setups where you’re routing to non-Azure backends like Mistral or Cohere, the same JWT policy applies. The claims model is provider-agnostic, which is one of the advantages of centralizing auth in APIM rather than handling it per-backend.

Layer 3: Managed Identity for Backend Authentication

Managed Identity is the most important security improvement you can make when setting up Azure API Management for AI. It replaces the pattern of storing an Azure OpenAI API key in APIM’s named values with a system-assigned or user-assigned Managed Identity that APIM uses to authenticate directly to Azure OpenAI via Azure AD.

Azure API Management for AI comparing API key authentication risks versus Managed Identity benefits for Azure OpenAI backend access — *Diagram 2: API key authentication (left) vs. Managed Identity (right). The key difference is that Managed Identity requires no stored credentials anywhere in your configuration.*

The practical difference is significant. With API key authentication, you have a long-lived secret that needs to be stored, rotated, and kept out of source control. With Managed Identity, there is no secret. APIM requests a short-lived token from Azure AD at runtime, and Azure AD issues it based on the APIM instance’s identity. Nothing is stored. Nothing can leak.

The configuration is a single policy element in the inbound section: <authentication-managed-identity resource=”https://cognitiveservices.azure.com”/>. APIM handles the rest, automatically fetching and refreshing the token.

On the Azure OpenAI side, you grant the APIM instance’s Managed Identity the Cognitive Services User role on the Azure OpenAI resource. That’s the minimum required permission. You can scope it further to specific deployments if needed.

For organizations in regulated industries, such as healthcare, financial services, and government, Managed Identity is not optional. It satisfies Zero Trust authentication requirements and produces a full audit trail in Azure Monitor, tied to the APIM instance identity rather than a shared key.

Azure API Management for AI: Putting the Three Layers Together

In a production setup, all three layers run sequentially within the inbound policy pipeline. A request arrives with a subscription key and a JWT. APIM validates the key first (fast, no external call), then validates the JWT against Azure AD, then forwards the request to Azure OpenAI using its Managed Identity token. The AI backend never sees the caller’s JWT, and APIM never stores an API key.

The result is a clean separation of concerns:

The calling application manages its own JWT (issued by Azure AD based on its own identity or the user’s identity).
APIM enforces the authorization policy without the backend needing to know anything about it.
The AI backend trusts only APIM’s Managed Identity, not arbitrary callers.

This is the architecture you want before you go to production with any AI workload that touches sensitive data or incurs meaningful cost.

What’s Next in This Series

Part 3 covers the Token Limit policy: how to enforce tokens-per-minute limits per consumer, configure throttling behavior, and handle the differences between the azure-openai-token-limit and llm-token-limit policy variants.

Part 3: Token Limit policy — enforcing tokens-per-minute limits per consumer.
Part 4: Token Metric policy — emitting usage data for observability and cross-charging.
Part 5: Load balancing and circuit breaking across PTU and PAYG backends.
Part 6: Semantic caching — reducing token consumption with similarity-based response reuse.
Part 7: APIM as an MCP gateway for agentic AI workloads.

Azure API Management for AI: Why Your APIs Need a Gateway

Posted on May 2, 2026 by steefjan1970

Part 1 of 7 in the “APIM for AI Workloads” series

Over the past year, I’ve been doing a lot of work with integration services, including Azure API Management and, recently, also on AI adoption: evaluating models, designing agentic architectures, and figuring out how to govern AI consumption across the organization responsibly. One thing that keeps coming up in those conversations is a question that sounds almost too basic to ask: Who is keeping track of what we’re spending on tokens?

The answer, more often than not, is nobody.

That’s the problem this series is about. AI APIs are fundamentally different from the REST APIs we’ve been managing for the past decade, and the differences matter operationally. Before we dive into the mechanics of Azure API Management policies, load balancing, and semantic caching in subsequent posts, I want to make the case for a gateway layer in front of your AI services. Before we dive into the mechanics of Azure API Management for AI workloads, policies, load balancing, and semantic caching, I want to make the case for why you need a gateway layer in front of your AI services.

Tokens Are Not Requests

Traditional API management was built around a relatively simple model: count the requests, enforce rate limits, log the traffic, and call it done. One call in, one response out. The cost model was predictable.

AI APIs broke that model completely.

When you call an Azure OpenAI endpoint, you’re not paying per request. You’re paying per token. And a token count is invisible at the API gateway layer unless you specifically instrument for it. A single call from a conversational agent might consume 500 tokens. A call from a poorly-optimized batch process might consume 50,000. Both look the same at the HTTP level: one POST, one 200 OK.

This creates a blind spot that grows dangerously as AI adoption scales across an organization. Teams start building intelligent apps: conversational agents, personalized content generators, voice assistants, copilots, and each one is independently calling AI backend services. Nobody has a view across the whole estate of what’s being consumed, by whom, and at what cost.

The diagram below shows what this looks like in practice: multiple application types hitting multiple AI providers, with token-based pricing models sitting underneath.

Azure API Management control plane between intelligent apps and AI providers showing PTU and PAYG token billing — *Diagram 1: Intelligent applications on the left, AI service providers on the right, with both PTU and PAYG billing models underneath. Without a control plane in the middle, you’re flying blind.*

The Three Problems Azure API Management for AI Solves

Azure API Management acts as the centralized control plane between your intelligent applications and your AI backends. It addresses three distinct categories of problems.

Performance optimization: AI model endpoints have throughput limits. Azure OpenAI Provisioned Throughput Units (PTU) give you reserved capacity at a fixed price, but cap out at a hard ceiling. Pay-as-you-go (PAYG) endpoints scale elastically but incur higher per-token costs at high volumes. Without a gateway layer, individual applications can’t know whether PTU capacity is available or saturated. A gateway can make that routing decision automatically, serving from PTU when it has headroom, falling back to PAYG when it doesn’t. That’s a meaningful cost optimization with no changes required to the calling applications.

Cost control: Tokens consumed by one team are costs borne by another team’s budget if you’re centralizing AI spend, which most organizations will do, at least initially. Without per-consumer visibility into token usage, internal cross-charging and showback are impossible. APIM’s token metric policies make this tractable by emitting token consumption data broken down by dimensions such as User ID, Subscription ID, or API product, all of which feed into Application Insights for dashboarding and alerting.

Data security: Routing AI traffic through a managed gateway gives you a single enforcement point for authentication, authorization, and policy. You can validate JWT claims, require subscription keys from API consumers, use Managed Identity to authenticate to Azure OpenAI without exposing credentials, and ensure traffic never leaves your controlled perimeter. Without a gateway, every team builds its own auth story, or more commonly, skips it.

PTU vs. PAYG: Why the Billing Model Shapes Your Architecture

Before we go further, it’s worth spending a moment on the two Azure OpenAI billing models, because they have direct architectural implications.

Provisioned Throughput Units (PTU) give you reserved capacity on a model. You pay a fixed hourly rate regardless of how many tokens you actually consume. The benefits are predictable costs and guaranteed throughput. The risk is waste if your utilization is low, and hard throttling if you exceed the provisioned limit.

Pay-as-you-go (PAYG) charges per token consumed. No upfront commitment, no capacity ceiling, but costs scale linearly with usage and can surprise you if consumption spikes.

Most production AI deployments end up using both: PTU for baseline load, where utilization is predictable, and PAYG as an overflow layer. This makes a load balancer with circuit breaking essential, which we’ll cover in Part 5 of this series.

The same logic applies beyond Azure OpenAI. APIM now supports generic LLM backends via the llm-* policy family, which means you can manage traffic to Mistral, Cohere, LLaMA, and other providers through the same control plane. The diagram below shows this architecture: APIM in the center, with load balancing across PTU and PAYG instances, token metrics flowing to Application Insights, and the full provider landscape behind it.

Azure API Management AI control plane with token limit, token metric, load balancing, semantic caching and circuit breaker policies across PTU and PAYG backends — Azure API Management as the centralized AI control plane, with performance, cost, and security governance across multiple providers and billing models.

What This Looks Like in Practice

Let me make this concrete with a scenario I’ve seen play out multiple times.

An organization deploys its first Azure OpenAI service for a conversational agent. A few months later, a second team wants to use AI for content generation. Then a third team builds an internal copilot. Each team provisions its own Azure OpenAI resource, authenticates directly, and manages its own rate limiting. There’s no visibility into combined spend. No shared capacity optimization. No centralized audit trail.

This is the point where someone in finance asks a question that nobody can answer: “How much are we spending on AI, and which team is spending what?”

Centralizing AI traffic through APIM is how you get out of that situation before it becomes a problem. The policy-based approach means you can add governance without changing anything in the calling applications. They call the APIM endpoint, APIM handles the rest.

Azure API Management for AI Workloads: What’s Coming in This Series

The next six posts will go deep on the specific capabilities that make APIM a serious AI control plane:

Part 2 covers authentication and authorization: JWT validation, Managed Identity, and subscription keys.
Part 3 covers the Token Limit policy: enforcing tokens-per-minute limits per consumer.
Part 4 covers the Token Metric policy: emitting usage data for observability and cross-charging.
Part 5 covers load balancing and circuit breaking across PTU and PAYG backends.
Part 6 covers semantic caching: reducing token consumption by serving cached responses for similar prompts.
Part 7 covers APIM’s emerging role as an MCP gateway for agentic AI workloads.

Each post will include the relevant policy XML, real-world sizing guidance, and the architectural decisions behind the patterns.

If you’re building AI-powered applications at scale and you’re not yet routing that traffic through a gateway, the rest of this series is for you.

Build an AI Tech News Aggregator: Azure Functions & Claude

Posted on March 25, 2026 by steefjan1970

There’s a lot of noise on the internet. Reddit, Hacker News, tech blogs, keeping up with what actually matters in enterprise software is a full-time job. So I built a fully automated system that does it for me, runs in the cloud, is powered by AI, and was deployed end-to-end in less than two hours using Claude Code.

Here’s how.

What We Built (What Claude did mostly)

A C# Azure Function that runs every hour and:

Fetches posts from configurable Reddit subreddits and Hacker News
Filters for recency only posts from the last 7 days
Deduplicates across runs never evaluates the same URL twice
Applies an AI editorial filter Claude decides what’s genuinely newsworthy
Writes curated results to Azure Blob Storage as timestamped JSON

The output is clean, structured JSON ready to feed into a newsletter, dashboard, or notification system.

The Architecture

The system has three layers: data collection, AI filtering, and persistence.

Reddit RSS feeds ──┐

├─► Aggregator Function ─► Claude AI Filter ─► Blob Storage

HN Firebase API ───┘ │

└─► State Store (seen URLs)

Tech Stack

Concern	Choice
Runtime	Azure Functions v4, .NET 8 isolated worker
Reddit data	Public Atom/RSS feed (r/{sub}/top.rss)
HN data	Firebase REST API
AI filtering	Anthropic Claude (claude-opus-4-6) via raw HttpClient
Storage	Azure Blob Storage
Schedule	NCRONTAB timer trigger

Interesting Engineering Decisions

Reddit: RSS over JSON API

The Reddit JSON API (/top.json) started returning 403s without authentication. Rather than deal with OAuth, we switched to Reddit’s public Atom/RSS feed (no credentials required) and parsed it with System.Xml.Linq in a handful of lines. Simple wins.

Claude as an Editorial Filter

Instead of writing brittle keyword heuristics to judge whether a post is “real tech news,” we hand that job to Claude with a carefully crafted system prompt based on Editorial Guidelines:

A post qualifies if it is relevant to enterprise software development AND meets at least one of the following: Change, Innovation, or Emergent Ideas, and is not a minor patch release, pure marketing, or clickbait.

Claude receives posts in batches of 25, returns a JSON array of qualifying indices, and we map those back to posts. If the API is unreachable, the batch passes through unfiltered as a deliberate fail-safe so the pipeline never breaks.

We used structured JSON output (output_config.format.type = “json_schema”) to guarantee a parseable response every time, no regex needed.

Deduplication Without a Database

To prevent re-evaluating the same URLs across hourly runs (and paying for unnecessary AI API calls), we persist a rolling state file — state/seen-urls.json — in Blob Storage. On each run:

Load seen URLs into a HashSet<string> for O(1) lookup
Filter new posts against it
After filtering, mark all new posts as seen (not just the ones that passed the AI filter — rejected posts shouldn’t be retried)
Prune entries older than 7 days to keep the file small

No database, no Redis, no infrastructure overhead. A blob file is enough.

The AI Filter in Practice

A typical hourly run might look like this:

Fetched 312 posts from the last 7 days.

Deduplication: 47 new / 265 already seen (skipped).

Running news quality filter on 47 new posts…

News filter: 11/25 posts passed.

News filter: 9/22 posts passed.

Filter complete: 20/47 posts kept.

20 posts saved to 2026/03/24/09-00-01.json

Out of 312 raw posts, 20 make it through. That’s the kind of signal-to-noise ratio that makes a curated feed actually worth reading.

Deployment

The whole thing deploys with two commands:

# Push app settings (API keys, schedule, etc.)

az functionapp config appsettings set \

–name FuncNewsAggregation \

–resource-group rg-news-aggregators \

–settings @appsettings.json

# Publish the function

func azure functionapp publish FuncNewsAggregation –dotnet-isolated

Done. The function is live, running on Azure’s infrastructure, costing pennies per day.

What’s Next

A few natural extensions:

Email or Slack digest — trigger a Logic App when a new blob is written
Web frontend — serve the JSON blobs as a read-only news feed
Scoring — weight HN scores more heavily now that RSS drops Reddit scores
More sources — dev.to, lobste.rs, or custom RSS feeds are easy to add

Takeaways

The most interesting lesson here isn’t the code, it’s the division of labor. Deterministic logic handles the mechanical work: fetching, deduplicating, and scheduling. The judgment call “Is this actually news?” goes to the model.

That separation keeps the system simple, cheap to run, and easy to adjust. Change the system prompt, and you change the editorial policy. No retraining, no feature engineering.

Two hours from idea to deployed function. That’s the pace at which you can build now.

All source code is C# targeting .NET 8. The function runs on an Azure Consumption plan and incurs roughly $0 in hourly costs well within the free tier.

AI Is Reshaping Software Development — At What Cost?

Posted on February 28, 2026 by steefjan1970

February has been a busy month for me at InfoQ. I wrote three articles that, on the surface, cover different topics: skill formation, open-source sustainability, and Agile methodology. But when I stepped back and looked at them together, a pattern jumped out at me. Each one tells a piece of the same story: AI is transforming how we build software at a pace that exceeds our ability to think about the consequences.

I want to use this post to connect the dots.

AI Software Development Is Eroding Developer Skills

The first piece I wrote covered an Anthropic study on how AI coding assistance affects skill development. The research was a randomized controlled trial with 52 junior engineers learning a Python library called Trio, which none of them had used before. The findings were stark. Developers who used AI assistance scored 17 percent lower on comprehension tests compared to those who coded by hand. That gap is roughly equivalent to two letter grades.

What struck me most wasn’t the headline number, though. It was the nuance underneath. Participants who used AI as a thinking partner, asking conceptual questions, requesting explanations, and working through problems alongside the tool, retained far more knowledge than those who asked the AI to generate code for them. The dividing line sat around a 65 percent score threshold. Above it, you found the curious developers. Below it are the ones who had delegated the thinking.

I’ve been working in IT for a long time. I’ve seen junior engineers grow into senior architects, and the path always involved struggle. Debugging code you don’t understand at 11 PM on a Tuesday. Reading documentation that makes your eyes glaze over. Writing something that breaks, then figuring out why. That struggle is where the learning happens. What concerns me is not that AI exists; I use it daily and find it genuinely helpful, but that we might be removing the friction that develops competence in the first place.

The full article is here: Anthropic Study: AI Coding Assistance Reduces Developer Skill Mastery by 17%

AI Coding Tools Are Overwhelming Open Source Maintainers

My second article examined a problem I’ve been watching develop for months. Daniel Stenberg shut down cURL’s bug bounty after AI-generated submissions reached 20 percent of the total. Mitchell Hashimoto banned AI-generated code from Ghostty entirely. Steve Ruiz took it even further with tldraw, auto-closing all external pull requests. These aren’t fringe projects. cURL runs on billions of devices. These are maintainers reaching a breaking point.

RedMonk analyst Kate Holterhoff coined the term “AI Slopageddon” to capture what’s happening, and it does so well. The flood of AI-generated contributions looks plausible at first glance but falls apart on inspection. The problem isn’t just quality, it’s volume. Maintainers are human beings with limited time, and they’re now spending that time sifting through submissions that an AI produced in seconds without any real understanding of the project.

A research paper from the Central European University and the Kiel Institute for the World Economy modeled the bigger structural risk here. Open-source projects depend on user engagement, documentation views, bug reports, and community recognition as a return on the maintainer’s investment. When AI agents assemble packages without developers ever reading the docs or filing bugs, that feedback loop breaks. The researchers tried to model a “Spotify-style” revenue redistribution. Still, the numbers didn’t work: vibe-coded users would need to generate 84 percent of the engagement that direct users currently provide. That’s not realistic.

I keep thinking about this one. My entire career has been built on open source, from the tools I integrate at work to the libraries I rely on for InfoQ articles. If the ecosystem that produces and maintains these tools becomes unsustainable because AI-generated noise overwhelms the people doing the actual work, we all lose. Not eventually. Soon.

More details here: AI “Vibe Coding” Threatens Open Source as Maintainers Face Crisis.

AI Software Development Puts Agile Under Pressure

The third article I wrote covered a debate sparked by Steve Jones, an executive VP at Capgemini, who declared that AI has killed the Agile Manifesto. His argument: when agentic SDLC systems can build applications in hours, the Manifesto’s human-centric principles no longer apply. If the tooling matters as much as or more than the people using it, then the Manifesto’s preference for “individuals and interactions over processes and tools” breaks down.

It’s a provocative claim that generated a lot of discussion. Casey West proposed an “Agentic Manifesto” that shifts the focus from verification to validation. AWS’s 2026 prescriptive guidance suggests “Intent Design” should replace sprint planning. Kent Beck, one of the original Manifesto signatories, has been talking about “augmented coding” as a new paradigm.

But here’s the counterpoint that keeps sticking with me. Forrester’s 2025 State of Agile Development report found that 95 percent of professionals still consider Agile critically relevant to their work. That’s not a methodology on its deathbed. And as one commenter noted in the discussion thread, bureaucracy killed Agile long before AI agents came along.

I think the question isn’t whether the Agile Manifesto is obsolete. It’s whether we’ve ever fully lived by its principles in the first place. The Manifesto says “responding to change over following a plan.” If there’s ever been a moment that demands responsiveness and adaptation, it’s right now. The irony of declaring Agile dead precisely when we need its core philosophy the most isn’t lost on me.

Full article: Does AI Make the Agile Manifesto Obsolete?

What AI’s Impact on Software Development Really Tells Us

When I look at these three stories together, I see a common tension. AI is accelerating what we can measure, lines of code produced, pull requests submitted, and applications prototyped, while eroding what is harder to quantify. Deep understanding of a codebase. Thoughtful engagement with an open-source community. The human judgment that sits at the heart of iterative development.

The Anthropic study shows that speed and learning pull in opposite directions, at least for developers acquiring new skills. The open-source crisis tells us that volume and quality are diverging at an alarming rate. The Agile debate tells us that our existing frameworks for organizing human work are straining under the weight of AI-driven change.

None of this means we should reject AI tools. I certainly won’t. But I think we need to be far more intentional about how we deploy them. That means designing AI assistants that support learning rather than replace it. It means building platforms that protect maintainers from low-quality noise. It means evolving our methodologies rather than abandoning them.

As someone who has spent years exploring new technologies, it’s one of the things I enjoy most about working in this field. I remain optimistic about where AI can take us. But optimism without caution is just naivety. The choices we make in the next year or two about how AI integrates into our development practices will shape the industry for a decade.

We should probably pay attention.

Cloud Perspectives

Steef-Jan Wiggers

Tag Archives: llm

Azure API Management Semantic Caching: Cut AI Token Costs with Similarity-Based Responses

How Azure API Management Semantic Caching Works

Tuning the Score Threshold for Azure API Management Semantic Caching

vary-by Scope: Preventing Cache Pollution

Infrastructure Requirements for Azure API Management Semantic Caching

What’s Next in This Azure API Management for AI Series

Azure API Management Token Metric Policy: AI Cost Observability and Cross-Charging

Azure API Management Token Metric Policy: How It Works

Choosing Dimensions for Azure API Management Token Metric Policy

Querying Token Metrics in Application Insights

From Observability to Cross-Charging

Azure API Management for AI: Securing Your AI APIs with Authentication and Authorization

Azure API Management for AI Security: A Three-Layer Model

Layer 1: Subscription Keys

Layer 2: JWT Validation and Claims-Based Authorization

Layer 3: Managed Identity for Backend Authentication

Azure API Management for AI: Putting the Three Layers Together

What’s Next in This Series

Build an AI Tech News Aggregator: Azure Functions & Claude

What We Built (What Claude did mostly)

The Architecture

Tech Stack

Reddit: RSS over JSON API

Claude as an Editorial Filter

Deduplication Without a Database

The AI Filter in Practice

Deployment

What’s Next

Takeaways

AI Is Reshaping Software Development — At What Cost?

AI Software Development Is Eroding Developer Skills

AI Coding Tools Are Overwhelming Open Source Maintainers

AI Software Development Puts Agile Under Pressure

What AI’s Impact on Software Development Really Tells Us