Azure API Management Semantic Caching: Cut AI Token Costs with Similarity-Based Responses

Azure API Management semantic caching is the most operationally transparent cost optimization in this series. Every technique covered so far, auth, token limits, token metrics, and load balancing, requires deliberate design decisions in how you configure APIM. Semantic caching, by contrast, works silently. Calling applications sends prompts as normal. APIM checks whether a semantically similar prompt has already been answered. If a match exists above a configurable similarity threshold, APIM returns the cached response without touching the AI backend. Zero tokens consumed. Zero latency is added by the model.

For workloads with repetitive prompt patterns, internal FAQ bots, document classifiers, and support agents that see the same questions repeatedly, the cache hit rate can be surprisingly high. Even a 20% hit rate on a high-volume workload translates directly into cost reduction and lower average latency.

How Azure API Management Semantic Caching Works

The azure-openai-semantic-cache-lookup policy sits in the inbound section of your APIM pipeline, before the request reaches the AI backend. When a prompt arrives, APIM sends it to a configured embedding model, typically Azure OpenAI text-embedding-ada-002 or equivalent, to generate a vector representation of the prompt. APIM then compares that vector against cached embeddings stored in Azure Managed Redis using cosine similarity.

If the similarity score between the incoming prompt and a cached prompt falls below the configured score-threshold, APIM treats it as a cache hit and returns the stored response. If no match meets the threshold, APIM forwards the request to the AI backend as normal and stores the response in Redis for future lookups.

Azure API Management semantic caching policy flow showing cache hit returning stored response and cache miss forwarding to Azure OpenAI
Diagram 1: Semantic cache request flow. On a cache hit, APIM returns a stored response directly — consuming zero tokens. On a miss, APIM forwards to the AI backend and stores the response in Azure Managed Redis for future hits.

The generic variant, llm-semantic-cache-lookup, works identically for non-Azure backends. Both require the same supporting infrastructure: an embedding model backend and an Azure Managed Redis instance configured in APIM. The semantic cache store policy handles writing responses back to the cache in the outbound section.

Tuning the Score Threshold for Azure API Management Semantic Caching

The score-threshold attribute is the most consequential configuration decision in the semantic caching policy. It controls how similar an incoming prompt must be to a cached prompt for APIM to treat it as a hit. The value runs from 0.0 to 1.0, but the practical range is much narrower.

Azure API Management semantic caching score threshold tuning guide from aggressive to conservative with vary-by subscription user and global scope strategies
Diagram 2: Score threshold tuning guide and vary-by scope strategies. Lower thresholds cache more aggressively. The default of 0.05 suits most production workloads. A global cache (no vary-by) maximizes hit rate but risks serving the wrong user’s response.

In practice, three zones matter:

0.01 to 0.05 (aggressive). At this range, prompts that are paraphrases of each other — “What is my account balance?” and “Can you show me my current balance?” — reliably produce cache hits. This is the right range for FAQ bots, support agents, and any workload where users ask the same questions in slightly different words. The default of 0.05 sits here and suits most production deployments.

0.05 to 0.20 (conservative). At this range, only prompts that are very close in wording produce hits. Creative workloads, code generation, and document drafting tend to have high prompt variance, so a more conservative threshold avoids serving stale cached responses to genuinely different requests.

Above 0.30 (too strict). At this threshold, almost no prompts match. The cache effectively stops functioning. Avoid this range unless you are deliberately disabling caching for a specific API product while keeping the policy in the pipeline for future use.

Start at 0.05 and monitor cache hit rates in Application Insights. If the hit rate is low for a workload you expect to be repetitive, lower the threshold incrementally. If you start seeing complaints about incorrect or stale responses, raise it.

vary-by Scope: Preventing Cache Pollution

The vary-by element scopes the cache namespace. Without it, all consumers share a single global cache. That maximizes the hit rate but introduces a significant risk: APIM could serve one user’s cached response to a different user. For most enterprise AI workloads, that is unacceptable.

The safest default is to vary by Subscription ID, which gives each API subscriber their own cache namespace. This prevents cross-team cache pollution while still achieving high hit rates within each subscriber’s own prompt patterns. For multi-tenant applications where individual users have distinct contexts, vary by a user identifier extracted from the JWT or a custom header instead.

A global cache with no vary-by is appropriate only for fully public, stateless APIs where responses are identical regardless of who requests them. Internal enterprise AI workloads rarely meet that bar.

Infrastructure Requirements for Azure API Management Semantic Caching

Semantic caching requires two supporting Azure resources beyond APIM itself. First, an Azure Managed Redis instance configured as an external cache in APIM. Redis stores the prompt embeddings and cached responses. The cache TTL is configurable in the store policy, so you control how long responses remain valid before APIM re-queries the backend.

Second, an embeddings model backend registered in APIM. For Azure OpenAI, this is typically a separate deployment of text-embedding-ada-002 or text-embedding-3-small. The embeddings backend is referenced by the embeddings-backend-id attribute. It is separate from your completions backend, so you can apply independent token limits and load balancing to the embeddings traffic.

One practical consideration: the embeddings call itself consumes tokens and adds a small amount of latency on every request, whether or not the cache hits. For workloads with very low prompt repetition, the overhead of generating embeddings for every request may outweigh the savings from occasional cache hits. Measure the hit rate before committing the infrastructure cost.

What’s Next in This Azure API Management for AI Series

Part 7 closes the series by covering APIM’s emerging role as an MCP gateway for agentic AI workloads: how to expose REST APIs as MCP servers, pass through existing MCP servers, and manage agent-to-agent traffic through the same control plane we’ve built across this series.

  • Part 7: APIM as an MCP gateway for agentic AI workloads.