Azure API Management Semantic Caching: Cut AI Token Costs with Similarity-Based Responses

Azure API Management semantic caching is the most operationally transparent cost optimization in this series. Every technique covered so far, auth, token limits, token metrics, and load balancing, requires deliberate design decisions in how you configure APIM. Semantic caching, by contrast, works silently. Calling applications sends prompts as normal. APIM checks whether a semantically similar prompt has already been answered. If a match exists above a configurable similarity threshold, APIM returns the cached response without touching the AI backend. Zero tokens consumed. Zero latency is added by the model.

For workloads with repetitive prompt patterns, internal FAQ bots, document classifiers, and support agents that see the same questions repeatedly, the cache hit rate can be surprisingly high. Even a 20% hit rate on a high-volume workload translates directly into cost reduction and lower average latency.

How Azure API Management Semantic Caching Works

The azure-openai-semantic-cache-lookup policy sits in the inbound section of your APIM pipeline, before the request reaches the AI backend. When a prompt arrives, APIM sends it to a configured embedding model, typically Azure OpenAI text-embedding-ada-002 or equivalent, to generate a vector representation of the prompt. APIM then compares that vector against cached embeddings stored in Azure Managed Redis using cosine similarity.

If the similarity score between the incoming prompt and a cached prompt falls below the configured score-threshold, APIM treats it as a cache hit and returns the stored response. If no match meets the threshold, APIM forwards the request to the AI backend as normal and stores the response in Redis for future lookups.

Azure API Management semantic caching policy flow showing cache hit returning stored response and cache miss forwarding to Azure OpenAI
Diagram 1: Semantic cache request flow. On a cache hit, APIM returns a stored response directly — consuming zero tokens. On a miss, APIM forwards to the AI backend and stores the response in Azure Managed Redis for future hits.

The generic variant, llm-semantic-cache-lookup, works identically for non-Azure backends. Both require the same supporting infrastructure: an embedding model backend and an Azure Managed Redis instance configured in APIM. The semantic cache store policy handles writing responses back to the cache in the outbound section.

Tuning the Score Threshold for Azure API Management Semantic Caching

The score-threshold attribute is the most consequential configuration decision in the semantic caching policy. It controls how similar an incoming prompt must be to a cached prompt for APIM to treat it as a hit. The value runs from 0.0 to 1.0, but the practical range is much narrower.

Azure API Management semantic caching score threshold tuning guide from aggressive to conservative with vary-by subscription user and global scope strategies
Diagram 2: Score threshold tuning guide and vary-by scope strategies. Lower thresholds cache more aggressively. The default of 0.05 suits most production workloads. A global cache (no vary-by) maximizes hit rate but risks serving the wrong user’s response.

In practice, three zones matter:

0.01 to 0.05 (aggressive). At this range, prompts that are paraphrases of each other — “What is my account balance?” and “Can you show me my current balance?” — reliably produce cache hits. This is the right range for FAQ bots, support agents, and any workload where users ask the same questions in slightly different words. The default of 0.05 sits here and suits most production deployments.

0.05 to 0.20 (conservative). At this range, only prompts that are very close in wording produce hits. Creative workloads, code generation, and document drafting tend to have high prompt variance, so a more conservative threshold avoids serving stale cached responses to genuinely different requests.

Above 0.30 (too strict). At this threshold, almost no prompts match. The cache effectively stops functioning. Avoid this range unless you are deliberately disabling caching for a specific API product while keeping the policy in the pipeline for future use.

Start at 0.05 and monitor cache hit rates in Application Insights. If the hit rate is low for a workload you expect to be repetitive, lower the threshold incrementally. If you start seeing complaints about incorrect or stale responses, raise it.

vary-by Scope: Preventing Cache Pollution

The vary-by element scopes the cache namespace. Without it, all consumers share a single global cache. That maximizes the hit rate but introduces a significant risk: APIM could serve one user’s cached response to a different user. For most enterprise AI workloads, that is unacceptable.

The safest default is to vary by Subscription ID, which gives each API subscriber their own cache namespace. This prevents cross-team cache pollution while still achieving high hit rates within each subscriber’s own prompt patterns. For multi-tenant applications where individual users have distinct contexts, vary by a user identifier extracted from the JWT or a custom header instead.

A global cache with no vary-by is appropriate only for fully public, stateless APIs where responses are identical regardless of who requests them. Internal enterprise AI workloads rarely meet that bar.

Infrastructure Requirements for Azure API Management Semantic Caching

Semantic caching requires two supporting Azure resources beyond APIM itself. First, an Azure Managed Redis instance configured as an external cache in APIM. Redis stores the prompt embeddings and cached responses. The cache TTL is configurable in the store policy, so you control how long responses remain valid before APIM re-queries the backend.

Second, an embeddings model backend registered in APIM. For Azure OpenAI, this is typically a separate deployment of text-embedding-ada-002 or text-embedding-3-small. The embeddings backend is referenced by the embeddings-backend-id attribute. It is separate from your completions backend, so you can apply independent token limits and load balancing to the embeddings traffic.

One practical consideration: the embeddings call itself consumes tokens and adds a small amount of latency on every request, whether or not the cache hits. For workloads with very low prompt repetition, the overhead of generating embeddings for every request may outweigh the savings from occasional cache hits. Measure the hit rate before committing the infrastructure cost.

What’s Next in This Azure API Management for AI Series

Part 7 closes the series by covering APIM’s emerging role as an MCP gateway for agentic AI workloads: how to expose REST APIs as MCP servers, pass through existing MCP servers, and manage agent-to-agent traffic through the same control plane we’ve built across this series.

  • Part 7: APIM as an MCP gateway for agentic AI workloads.

Azure API Management Token Metric Policy: AI Cost Observability and Cross-Charging

Part 4 of 7 in the “APIM for AI Workloads” series

The Azure API Management token metric policy turns AI cost data from a finance problem into an engineering one. In Part 3, we covered enforcement: how to set consumption boundaries per consumer. This post covers the complementary piece: how to measure that consumption. More importantly, it shows how to make it visible to the right people and use it to drive internal cross-charging and FinOps dashboards.

At my current company, one of the first questions the architecture board asked was straightforward: which teams are consuming what, and what does it cost? Without instrumentation at the gateway layer, that question is genuinely unanswerable. The token metric policy is how you answer it.

Azure API Management Token Metric Policy: How It Works

The policy sits in the outbound section of your APIM pipeline. After the AI backend returns a response, APIM reads the token usage fields from the response body. These include prompt tokens, completion tokens, and total tokens. APIM then emits them as custom metrics to Application Insights under a namespace you define.

Crucially, the policy emits metrics after the response arrives. It uses actual token counts from the API response rather than estimates. As a result, the data is accurate rather than approximated. It also means the metric emission adds no latency to the request path: the response is returned to the caller immediately, and the metric is emitted asynchronously.

Azure API Management token metric policy observability pipeline emitting token counts to Application Insights for cross-charging
Diagram 1: Token metric policy observability pipeline. Token counts from the AI backend response flow through the APIM metrics layer to Application Insights, broken down by dimensions for cross-charging and cost allocation.

The generic variant, llm-emit-token-metric, works identically for non-Azure backends. Both policies share the same dimension model, so the configuration patterns below apply regardless of which AI provider sits behind APIM.

Choosing Dimensions for Azure API Management Token Metric Policy

Dimensions are the labels attached to each metric event. They explain how to slice and aggregate token consumption data in Application Insights. Choosing the right dimensions is the most important configuration decision for making the data useful for cross-charging.

Azure API Management token metric policy dimension strategies for cross-charging using Subscription ID User ID and API ID
Diagram 2: Three-dimensional strategies for cross-charging and showback. Subscription ID maps to teams and cost centers, User ID enables per-user billing in multi-tenant apps, and API ID breaks down cost by AI workload or feature.

The three primary dimension options are:

Subscription ID. The most common choice for internal enterprise deployments. Each APIM subscription maps to a team, product, or cost center, so filtering Application Insights metrics by Subscription ID gives you direct per-team token consumption. This pairs naturally with the subscription key authentication pattern from Part 2 and the per-subscription counter-key from Part 3.

User ID. Sourced from the JWT subject claim or a custom header, User ID enables per-user consumption reporting. This is the right dimension for multi-tenant SaaS applications where individual end users have their own token budgets, or where you need to identify heavy consumers within a shared subscription.

API ID. Identifies which APIM API product generated the consumption. Useful when a single subscription uses multiple AI-backed APIs: one for a conversational agent, one for content generation, and one for document summarization. API ID lets you break down cost by use case rather than just by subscriber.

In practice, combining all three dimensions gives you the most flexibility. A single metric event tagged with Subscription ID, User ID, and API ID can answer questions at every level: how much did the platform spend in total, how much did Team A spend, how much did User X consume, and which AI feature is the most expensive to run.

Querying Token Metrics in Application Insights

Once the policy is emitting metrics, you query them in Application Insights using the custom metrics namespace you configured. The metrics appear under the namespace name you set in the policy (for example, “AzureOpenAI” or “MyLLM”), with separate metric events for prompt tokens and completion tokens.

A practical starting point is a KQL query that aggregates the total number of tokens by Subscription ID over the past 30 days. From there, you can add filters by API ID to isolate specific workloads, or pivot by User ID to identify the highest consumers within a team.

For FinOps dashboards, the most useful view is a stacked time-series chart of total token consumption broken down by subscription, updated daily. This gives finance and engineering a shared view of AI spend trends without exporting data from Azure Monitor to a separate BI tool. Azure Workbooks can host this directly in the Azure portal, making it accessible to non-technical stakeholders.

From Observability to Cross-Charging

Observability is the prerequisite for cross-charging. However, they are not the same thing. Observability tells you what happened. Cross-charging, by contrast, is the organizational process of allocating those costs to the right budget owners.

The token metric policy gives you the raw data. To turn that into a cross-charge, you need two additional steps. First, agree on a price per token with your finance team — usually derived from the Azure cost per 1,000 tokens for your model and region. Second, automate a monthly report that multiplies token consumption by the subscription price.

This does not need to be complex. For example, a Logic App or Azure Function that queries Application Insights on the first of each month works well for most organizations starting out. It aggregates tokens by subscription, multiplies by the agreed rate, and emails a cost summary to each team lead. The Application Insights REST API makes this straightforward to automate.

Finally, the most important advice: have this conversation with finance and product teams before AI consumption scales. Retroactive cross-charging is significantly harder to establish than an upfront model with clear methodology and tooling.

What’s Next in This Azure API Management for AI Series

Part 5 covers load balancing and circuit breaking: how to distribute traffic across PTU and PAYG backends, configure backend pools, and set up circuit breaker rules for automatic failover when a primary endpoint becomes unavailable.

Azure API Management for AI: Securing Your AI APIs with Authentication and Authorization

Part 2 of 7 in the “APIM for AI Workloads” series

In Part 1 of this series, I made the case for why Azure API Management for AI workloads is the right control plane for governing AI traffic across an organization. This post gets practical: how do you actually secure access to your AI backends with APIM without creating a credential-management nightmare?

Security is where many AI projects cut corners, and understandably so. When you’re moving fast to prove value with a new model, authentication feels like overhead. But AI endpoints are expensive, and an unsecured Azure OpenAI endpoint is a real risk: anyone with the URL and key can start consuming tokens at your cost. At scale, that’s a significant financial and compliance exposure.

APIM addresses this with a three-layer security model. Let’s walk through each layer.

Azure API Management for AI Security: A Three-Layer Model

The authentication and authorization pattern in APIM is deliberately layered. Each layer answers a different question and operates independently, so a failure at any layer stops the request before it reaches the AI backend.

Azure API Management for AI three-layer authentication flow showing subscription key, JWT validation and Managed Identity policy pipeline
Diagram 1: Three-layer auth in APIM for AI workloads. Layer 1 identifies the caller via subscription key. JWT validation in Layer 2 then determines what they’re permitted to do. Finally, Layer 3 authenticates APIM itself to the AI backend via Managed Identity.

The three layers are:

  • Subscription keys to identify and track API consumers.
  • JWT validation to enforce fine-grained access control based on claims.
  • Managed Identity to authenticate APIM to Azure OpenAI without storing credentials.

Each layer has a distinct role. Confusing them is a common mistake, so it’s worth being explicit about what each one does and does not do.

Layer 1: Subscription Keys

Subscription keys are APIM’s mechanism for identifying API consumers. When you create an API product in APIM and require a subscription, callers must include their key in the Ocp-Apim-Subscription-Key header. APIM validates the key, maps it to a subscriber, and lets the request proceed.

This is important for AI workloads specifically because subscription keys enable per-consumer token tracking. When you combine subscription key validation with the Token Metric policy we’ll cover in Part 4, you get usage data broken down by subscriber, which is the foundation of any internal cross-charging model.

Subscription keys answer the question: Who is calling? They don’t answer what the caller is allowed to do. For that, you need JWT validation.

Layer 2: JWT Validation and Claims-Based Authorization

The validate-jwt policy is where you enforce what a caller is permitted to do. It validates the JWT token in the Authorization header against your identity provider, and can inspect any claim in the token to make authorization decisions.

For Azure OpenAI specifically, this is where you control which teams or applications can access which model deployments. A team working on an internal chatbot should not be able to call a GPT-4o deployment reserved for a production workload. JWT claims let you enforce that boundary at the gateway layer, with no changes required in the calling application.

A typical policy checks the token signature against your Azure AD tenant’s OpenID Connect configuration, then validates that a required scope or role claim is present:

The failed-validation-httpcode=”401″ attribute ensures unauthenticated callers get a clean rejection before they ever reach the backend. You can also use failed-validation-error-message to return a specific error message, which helps consumers debug auth failures without exposing internal details.

For multi-provider setups where you’re routing to non-Azure backends like Mistral or Cohere, the same JWT policy applies. The claims model is provider-agnostic, which is one of the advantages of centralizing auth in APIM rather than handling it per-backend.

Layer 3: Managed Identity for Backend Authentication

Managed Identity is the most important security improvement you can make when setting up Azure API Management for AI. It replaces the pattern of storing an Azure OpenAI API key in APIM’s named values with a system-assigned or user-assigned Managed Identity that APIM uses to authenticate directly to Azure OpenAI via Azure AD.

Azure API Management for AI comparing API key authentication risks versus Managed Identity benefits for Azure OpenAI backend access
Diagram 2: API key authentication (left) vs. Managed Identity (right). The key difference is that Managed Identity requires no stored credentials anywhere in your configuration.

The practical difference is significant. With API key authentication, you have a long-lived secret that needs to be stored, rotated, and kept out of source control. With Managed Identity, there is no secret. APIM requests a short-lived token from Azure AD at runtime, and Azure AD issues it based on the APIM instance’s identity. Nothing is stored. Nothing can leak.

The configuration is a single policy element in the inbound section: <authentication-managed-identity resource=”https://cognitiveservices.azure.com”/&gt;. APIM handles the rest, automatically fetching and refreshing the token.

On the Azure OpenAI side, you grant the APIM instance’s Managed Identity the Cognitive Services User role on the Azure OpenAI resource. That’s the minimum required permission. You can scope it further to specific deployments if needed.

For organizations in regulated industries, such as healthcare, financial services, and government, Managed Identity is not optional. It satisfies Zero Trust authentication requirements and produces a full audit trail in Azure Monitor, tied to the APIM instance identity rather than a shared key.

Azure API Management for AI: Putting the Three Layers Together

In a production setup, all three layers run sequentially within the inbound policy pipeline. A request arrives with a subscription key and a JWT. APIM validates the key first (fast, no external call), then validates the JWT against Azure AD, then forwards the request to Azure OpenAI using its Managed Identity token. The AI backend never sees the caller’s JWT, and APIM never stores an API key.

The result is a clean separation of concerns:

  • The calling application manages its own JWT (issued by Azure AD based on its own identity or the user’s identity).
  • APIM enforces the authorization policy without the backend needing to know anything about it.
  • The AI backend trusts only APIM’s Managed Identity, not arbitrary callers.

This is the architecture you want before you go to production with any AI workload that touches sensitive data or incurs meaningful cost.

What’s Next in This Series

Part 3 covers the Token Limit policy: how to enforce tokens-per-minute limits per consumer, configure throttling behavior, and handle the differences between the azure-openai-token-limit and llm-token-limit policy variants.

Azure API Management for AI: Why Your APIs Need a Gateway

Part 1 of 7 in the “APIM for AI Workloads” series

Over the past year, I’ve been doing a lot of work with integration services, including Azure API Management and, recently, also on AI adoption: evaluating models, designing agentic architectures, and figuring out how to govern AI consumption across the organization responsibly. One thing that keeps coming up in those conversations is a question that sounds almost too basic to ask: Who is keeping track of what we’re spending on tokens?

The answer, more often than not, is nobody.

That’s the problem this series is about. AI APIs are fundamentally different from the REST APIs we’ve been managing for the past decade, and the differences matter operationally. Before we dive into the mechanics of Azure API Management policies, load balancing, and semantic caching in subsequent posts, I want to make the case for a gateway layer in front of your AI services. Before we dive into the mechanics of Azure API Management for AI workloads, policies, load balancing, and semantic caching, I want to make the case for why you need a gateway layer in front of your AI services.

Tokens Are Not Requests

Traditional API management was built around a relatively simple model: count the requests, enforce rate limits, log the traffic, and call it done. One call in, one response out. The cost model was predictable.

AI APIs broke that model completely.

When you call an Azure OpenAI endpoint, you’re not paying per request. You’re paying per token. And a token count is invisible at the API gateway layer unless you specifically instrument for it. A single call from a conversational agent might consume 500 tokens. A call from a poorly-optimized batch process might consume 50,000. Both look the same at the HTTP level: one POST, one 200 OK.

This creates a blind spot that grows dangerously as AI adoption scales across an organization. Teams start building intelligent apps: conversational agents, personalized content generators, voice assistants, copilots, and each one is independently calling AI backend services. Nobody has a view across the whole estate of what’s being consumed, by whom, and at what cost.

The diagram below shows what this looks like in practice: multiple application types hitting multiple AI providers, with token-based pricing models sitting underneath.

Azure API Management control plane between intelligent apps and AI providers showing PTU and PAYG token billing
Diagram 1: Intelligent applications on the left, AI service providers on the right, with both PTU and PAYG billing models underneath. Without a control plane in the middle, you’re flying blind.

The Three Problems Azure API Management for AI Solves

Azure API Management acts as the centralized control plane between your intelligent applications and your AI backends. It addresses three distinct categories of problems.

Performance optimization: AI model endpoints have throughput limits. Azure OpenAI Provisioned Throughput Units (PTU) give you reserved capacity at a fixed price, but cap out at a hard ceiling. Pay-as-you-go (PAYG) endpoints scale elastically but incur higher per-token costs at high volumes. Without a gateway layer, individual applications can’t know whether PTU capacity is available or saturated. A gateway can make that routing decision automatically, serving from PTU when it has headroom, falling back to PAYG when it doesn’t. That’s a meaningful cost optimization with no changes required to the calling applications.

Cost control: Tokens consumed by one team are costs borne by another team’s budget if you’re centralizing AI spend, which most organizations will do, at least initially. Without per-consumer visibility into token usage, internal cross-charging and showback are impossible. APIM’s token metric policies make this tractable by emitting token consumption data broken down by dimensions such as User ID, Subscription ID, or API product, all of which feed into Application Insights for dashboarding and alerting.

Data security: Routing AI traffic through a managed gateway gives you a single enforcement point for authentication, authorization, and policy. You can validate JWT claims, require subscription keys from API consumers, use Managed Identity to authenticate to Azure OpenAI without exposing credentials, and ensure traffic never leaves your controlled perimeter. Without a gateway, every team builds its own auth story, or more commonly, skips it.

PTU vs. PAYG: Why the Billing Model Shapes Your Architecture

Before we go further, it’s worth spending a moment on the two Azure OpenAI billing models, because they have direct architectural implications.

Provisioned Throughput Units (PTU) give you reserved capacity on a model. You pay a fixed hourly rate regardless of how many tokens you actually consume. The benefits are predictable costs and guaranteed throughput. The risk is waste if your utilization is low, and hard throttling if you exceed the provisioned limit.

Pay-as-you-go (PAYG) charges per token consumed. No upfront commitment, no capacity ceiling, but costs scale linearly with usage and can surprise you if consumption spikes.

Most production AI deployments end up using both: PTU for baseline load, where utilization is predictable, and PAYG as an overflow layer. This makes a load balancer with circuit breaking essential, which we’ll cover in Part 5 of this series.

The same logic applies beyond Azure OpenAI. APIM now supports generic LLM backends via the llm-* policy family, which means you can manage traffic to Mistral, Cohere, LLaMA, and other providers through the same control plane. The diagram below shows this architecture: APIM in the center, with load balancing across PTU and PAYG instances, token metrics flowing to Application Insights, and the full provider landscape behind it.

Azure API Management AI control plane with token limit, token metric, load balancing, semantic caching and circuit breaker policies across PTU and PAYG backends
Azure API Management as the centralized AI control plane, with performance, cost, and security governance across multiple providers and billing models.

What This Looks Like in Practice

Let me make this concrete with a scenario I’ve seen play out multiple times.

An organization deploys its first Azure OpenAI service for a conversational agent. A few months later, a second team wants to use AI for content generation. Then a third team builds an internal copilot. Each team provisions its own Azure OpenAI resource, authenticates directly, and manages its own rate limiting. There’s no visibility into combined spend. No shared capacity optimization. No centralized audit trail.

This is the point where someone in finance asks a question that nobody can answer: “How much are we spending on AI, and which team is spending what?”

Centralizing AI traffic through APIM is how you get out of that situation before it becomes a problem. The policy-based approach means you can add governance without changing anything in the calling applications. They call the APIM endpoint, APIM handles the rest.

Azure API Management for AI Workloads: What’s Coming in This Series

The next six posts will go deep on the specific capabilities that make APIM a serious AI control plane:

Each post will include the relevant policy XML, real-world sizing guidance, and the architectural decisions behind the patterns.

If you’re building AI-powered applications at scale and you’re not yet routing that traffic through a gateway, the rest of this series is for you.

Build an AI Tech News Aggregator: Azure Functions & Claude

There’s a lot of noise on the internet. Reddit, Hacker News, tech blogs, keeping up with what actually matters in enterprise software is a full-time job. So I built a fully automated system that does it for me, runs in the cloud, is powered by AI, and was deployed end-to-end in less than two hours using Claude Code.

Here’s how.

What We Built (What Claude did mostly)

A C# Azure Function that runs every hour and:

  1. Fetches posts from configurable Reddit subreddits and Hacker News
  2. Filters for recency only posts from the last 7 days
  3. Deduplicates across runs never evaluates the same URL twice
  4. Applies an AI editorial filter Claude decides what’s genuinely newsworthy
  5. Writes curated results to Azure Blob Storage as timestamped JSON

The output is clean, structured JSON ready to feed into a newsletter, dashboard, or notification system.

The Architecture

The system has three layers: data collectionAI filtering, and persistence.

Reddit RSS feeds ──┐

                   ├─► Aggregator Function ─► Claude AI Filter ─► Blob Storage

HN Firebase API ───┘         │

                              └─► State Store (seen URLs)

Tech Stack

ConcernChoice
RuntimeAzure Functions v4, .NET 8 isolated worker
Reddit dataPublic Atom/RSS feed (r/{sub}/top.rss)
HN dataFirebase REST API
AI filteringAnthropic Claude (claude-opus-4-6) via raw HttpClient
StorageAzure Blob Storage
ScheduleNCRONTAB timer trigger

Interesting Engineering Decisions

Reddit: RSS over JSON API

The Reddit JSON API (/top.json) started returning 403s without authentication. Rather than deal with OAuth, we switched to Reddit’s public Atom/RSS feed (no credentials required) and parsed it with System.Xml.Linq in a handful of lines. Simple wins.

Claude as an Editorial Filter

Instead of writing brittle keyword heuristics to judge whether a post is “real tech news,” we hand that job to Claude with a carefully crafted system prompt based on Editorial Guidelines:

A post qualifies if it is relevant to enterprise software development AND meets at least one of the following: Change, Innovation, or Emergent Ideas, and is not a minor patch release, pure marketing, or clickbait.

Claude receives posts in batches of 25, returns a JSON array of qualifying indices, and we map those back to posts. If the API is unreachable, the batch passes through unfiltered as a deliberate fail-safe so the pipeline never breaks.

We used structured JSON output (output_config.format.type = “json_schema”) to guarantee a parseable response every time, no regex needed.

Deduplication Without a Database

To prevent re-evaluating the same URLs across hourly runs (and paying for unnecessary AI API calls), we persist a rolling state file — state/seen-urls.json — in Blob Storage. On each run:

  • Load seen URLs into a HashSet<string> for O(1) lookup
  • Filter new posts against it
  • After filtering, mark all new posts as seen (not just the ones that passed the AI filter — rejected posts shouldn’t be retried)
  • Prune entries older than 7 days to keep the file small

No database, no Redis, no infrastructure overhead. A blob file is enough.

The AI Filter in Practice

A typical hourly run might look like this:

Fetched 312 posts from the last 7 days.

Deduplication: 47 new / 265 already seen (skipped).

Running news quality filter on 47 new posts…

News filter: 11/25 posts passed.

News filter: 9/22 posts passed.

Filter complete: 20/47 posts kept.

20 posts saved to 2026/03/24/09-00-01.json

Out of 312 raw posts, 20 make it through. That’s the kind of signal-to-noise ratio that makes a curated feed actually worth reading.

Deployment

The whole thing deploys with two commands:

# Push app settings (API keys, schedule, etc.)

az functionapp config appsettings set \

  –name FuncNewsAggregation \

  –resource-group rg-news-aggregators \

  –settings @appsettings.json

# Publish the function

func azure functionapp publish FuncNewsAggregation –dotnet-isolated

Done. The function is live, running on Azure’s infrastructure, costing pennies per day.

What’s Next

A few natural extensions:

  • Email or Slack digest — trigger a Logic App when a new blob is written
  • Web frontend — serve the JSON blobs as a read-only news feed
  • Scoring — weight HN scores more heavily now that RSS drops Reddit scores
  • More sources — dev.to, lobste.rs, or custom RSS feeds are easy to add

Takeaways

The most interesting lesson here isn’t the code, it’s the division of labor. Deterministic logic handles the mechanical work: fetching, deduplicating, and scheduling. The judgment call “Is this actually news?”  goes to the model.

That separation keeps the system simple, cheap to run, and easy to adjust. Change the system prompt, and you change the editorial policy. No retraining, no feature engineering.

Two hours from idea to deployed function. That’s the pace at which you can build now.


All source code is C# targeting .NET 8. The function runs on an Azure Consumption plan and incurs roughly $0 in hourly costs well within the free tier.

AI Is Reshaping Software Development — At What Cost?

February has been a busy month for me at InfoQ. I wrote three articles that, on the surface, cover different topics: skill formation, open-source sustainability, and Agile methodology. But when I stepped back and looked at them together, a pattern jumped out at me. Each one tells a piece of the same story: AI is transforming how we build software at a pace that exceeds our ability to think about the consequences.

I want to use this post to connect the dots.

AI Software Development Is Eroding Developer Skills

The first piece I wrote covered an Anthropic study on how AI coding assistance affects skill development. The research was a randomized controlled trial with 52 junior engineers learning a Python library called Trio, which none of them had used before. The findings were stark. Developers who used AI assistance scored 17 percent lower on comprehension tests compared to those who coded by hand. That gap is roughly equivalent to two letter grades.

What struck me most wasn’t the headline number, though. It was the nuance underneath. Participants who used AI as a thinking partner, asking conceptual questions, requesting explanations, and working through problems alongside the tool, retained far more knowledge than those who asked the AI to generate code for them. The dividing line sat around a 65 percent score threshold. Above it, you found the curious developers. Below it are the ones who had delegated the thinking.

I’ve been working in IT for a long time. I’ve seen junior engineers grow into senior architects, and the path always involved struggle. Debugging code you don’t understand at 11 PM on a Tuesday. Reading documentation that makes your eyes glaze over. Writing something that breaks, then figuring out why. That struggle is where the learning happens. What concerns me is not that AI exists; I use it daily and find it genuinely helpful, but that we might be removing the friction that develops competence in the first place.

The full article is here: Anthropic Study: AI Coding Assistance Reduces Developer Skill Mastery by 17%

AI Coding Tools Are Overwhelming Open Source Maintainers

My second article examined a problem I’ve been watching develop for months. Daniel Stenberg shut down cURL’s bug bounty after AI-generated submissions reached 20 percent of the total. Mitchell Hashimoto banned AI-generated code from Ghostty entirely. Steve Ruiz took it even further with tldraw, auto-closing all external pull requests. These aren’t fringe projects. cURL runs on billions of devices. These are maintainers reaching a breaking point.

RedMonk analyst Kate Holterhoff coined the term “AI Slopageddon” to capture what’s happening, and it does so well. The flood of AI-generated contributions looks plausible at first glance but falls apart on inspection. The problem isn’t just quality, it’s volume. Maintainers are human beings with limited time, and they’re now spending that time sifting through submissions that an AI produced in seconds without any real understanding of the project.

A research paper from the Central European University and the Kiel Institute for the World Economy modeled the bigger structural risk here. Open-source projects depend on user engagement, documentation views, bug reports, and community recognition as a return on the maintainer’s investment. When AI agents assemble packages without developers ever reading the docs or filing bugs, that feedback loop breaks. The researchers tried to model a “Spotify-style” revenue redistribution. Still, the numbers didn’t work: vibe-coded users would need to generate 84 percent of the engagement that direct users currently provide. That’s not realistic.

I keep thinking about this one. My entire career has been built on open source, from the tools I integrate at work to the libraries I rely on for InfoQ articles. If the ecosystem that produces and maintains these tools becomes unsustainable because AI-generated noise overwhelms the people doing the actual work, we all lose. Not eventually. Soon.

More details here: AI “Vibe Coding” Threatens Open Source as Maintainers Face Crisis.

AI Software Development Puts Agile Under Pressure

The third article I wrote covered a debate sparked by Steve Jones, an executive VP at Capgemini, who declared that AI has killed the Agile Manifesto. His argument: when agentic SDLC systems can build applications in hours, the Manifesto’s human-centric principles no longer apply. If the tooling matters as much as or more than the people using it, then the Manifesto’s preference for “individuals and interactions over processes and tools” breaks down.

It’s a provocative claim that generated a lot of discussion. Casey West proposed an “Agentic Manifesto” that shifts the focus from verification to validation. AWS’s 2026 prescriptive guidance suggests “Intent Design” should replace sprint planning. Kent Beck, one of the original Manifesto signatories, has been talking about “augmented coding” as a new paradigm.

But here’s the counterpoint that keeps sticking with me. Forrester’s 2025 State of Agile Development report found that 95 percent of professionals still consider Agile critically relevant to their work. That’s not a methodology on its deathbed. And as one commenter noted in the discussion thread, bureaucracy killed Agile long before AI agents came along.

I think the question isn’t whether the Agile Manifesto is obsolete. It’s whether we’ve ever fully lived by its principles in the first place. The Manifesto says “responding to change over following a plan.” If there’s ever been a moment that demands responsiveness and adaptation, it’s right now. The irony of declaring Agile dead precisely when we need its core philosophy the most isn’t lost on me.

Full article: Does AI Make the Agile Manifesto Obsolete?

What AI’s Impact on Software Development Really Tells Us

When I look at these three stories together, I see a common tension. AI is accelerating what we can measure, lines of code produced, pull requests submitted, and applications prototyped, while eroding what is harder to quantify. Deep understanding of a codebase. Thoughtful engagement with an open-source community. The human judgment that sits at the heart of iterative development.

The Anthropic study shows that speed and learning pull in opposite directions, at least for developers acquiring new skills. The open-source crisis tells us that volume and quality are diverging at an alarming rate. The Agile debate tells us that our existing frameworks for organizing human work are straining under the weight of AI-driven change.

None of this means we should reject AI tools. I certainly won’t. But I think we need to be far more intentional about how we deploy them. That means designing AI assistants that support learning rather than replace it. It means building platforms that protect maintainers from low-quality noise. It means evolving our methodologies rather than abandoning them.

As someone who has spent years exploring new technologies, it’s one of the things I enjoy most about working in this field. I remain optimistic about where AI can take us. But optimism without caution is just naivety. The choices we make in the next year or two about how AI integrates into our development practices will shape the industry for a decade.

We should probably pay attention.