AI Is Reshaping Software Development — But Are We Paying Attention to What We’re Losing?

Posted on February 28, 2026 by steefjan1970

February has been a busy month for me at InfoQ. I wrote three articles that, on the surface, cover different topics: skill formation, open-source sustainability, and Agile methodology. But when I stepped back and looked at them together, a pattern jumped out at me. Each one tells a piece of the same story: AI is transforming how we build software at a pace that exceeds our ability to think about the consequences.

I want to use this post to connect the dots.

The Skill Problem Nobody Wants to Talk About

The first piece I wrote covered an Anthropic study on how AI coding assistance affects skill development. The research was a randomized controlled trial with 52 junior engineers learning a Python library called Trio, which none of them had used before. The findings were stark. Developers who used AI assistance scored 17 percent lower on comprehension tests compared to those who coded by hand. That gap is roughly equivalent to two letter grades.

What struck me most wasn’t the headline number, though. It was the nuance underneath. Participants who used AI as a thinking partner, asking conceptual questions, requesting explanations, and working through problems alongside the tool, retained far more knowledge than those who asked the AI to generate code for them. The dividing line sat around a 65 percent score threshold. Above it, you found the curious developers. Below it are the ones who had delegated the thinking.

I’ve been working in IT for a long time. I’ve seen junior engineers grow into senior architects, and the path always involved struggle. Debugging code you don’t understand at 11 PM on a Tuesday. Reading documentation that makes your eyes glaze over. Writing something that breaks, then figuring out why. That struggle is where the learning happens. What concerns me is not that AI exists; I use it daily and find it genuinely helpful, but that we might be removing the friction that develops competence in the first place.

The full article is here: Anthropic Study: AI Coding Assistance Reduces Developer Skill Mastery by 17%

Open Source Is Drowning in AI Slop

My second article examined a problem I’ve been watching develop for months. Daniel Stenberg shut down cURL’s bug bounty after AI-generated submissions reached 20 percent of the total. Mitchell Hashimoto banned AI-generated code from Ghostty entirely. Steve Ruiz took it even further with tldraw, auto-closing all external pull requests. These aren’t fringe projects. cURL runs on billions of devices. These are maintainers reaching a breaking point.

RedMonk analyst Kate Holterhoff coined the term “AI Slopageddon” for what’s happening, and it captures the reality well. The flood of AI-generated contributions looks plausible at first glance but falls apart on inspection. The problem isn’t just quality, it’s volume. Maintainers are human beings with limited time, and they’re now spending that time sifting through submissions that an AI produced in seconds without any real understanding of the project.

A research paper from the Central European University and the Kiel Institute for the World Economy modeled the bigger structural risk here. Open-source projects depend on user engagement, documentation views, bug reports, and community recognition as a return on the maintainer’s investment. When AI agents assemble packages without developers ever reading the docs or filing bugs, that feedback loop breaks. The researchers tried to model a “Spotify-style” revenue redistribution. Still, the numbers didn’t work: vibe-coded users would need to generate 84 percent of the engagement that direct users currently provide. That’s not realistic.

I keep thinking about this one. My entire career has been built on open source, from the tools I integrate at work to the libraries I rely on for InfoQ articles. If the ecosystem that produces and maintains these tools becomes unsustainable because AI-generated noise overwhelms the people doing the actual work, we all lose. Not eventually. Soon.

More details here: AI “Vibe Coding” Threatens Open Source as Maintainers Face Crisis.

Is the Agile Manifesto Dead? Not So Fast.

The third article I wrote covered a debate sparked by Steve Jones, an executive VP at Capgemini, who declared that AI has killed the Agile Manifesto. His argument: when agentic SDLC systems can build applications in hours, the Manifesto’s human-centric principles no longer apply. If the tooling matters as much as or more than the people using it, then the Manifesto’s preference for “individuals and interactions over processes and tools” breaks down.

It’s a provocative claim that generated a lot of discussion. Casey West proposed an “Agentic Manifesto” that shifts the focus from verification to validation. AWS’s 2026 prescriptive guidance suggests “Intent Design” should replace sprint planning. Kent Beck, one of the original Manifesto signatories, has been talking about “augmented coding” as a new paradigm.

But here’s the counterpoint that keeps sticking with me. Forrester’s 2025 State of Agile Development report found that 95 percent of professionals still consider Agile critically relevant to their work. That’s not a methodology on its deathbed. And as one commenter noted in the discussion thread, bureaucracy killed Agile long before AI agents came along.

I think the question isn’t whether the Agile Manifesto is obsolete. It’s whether we’ve ever fully lived by its principles in the first place. The Manifesto says “responding to change over following a plan.” If there’s ever been a moment that demands responsiveness and adaptation, it’s right now. The irony of declaring Agile dead precisely when we need its core philosophy the most isn’t lost on me.

Full article: Does AI Make the Agile Manifesto Obsolete?

Connecting the Threads

When I look at these three stories together, I see a common tension. AI is accelerating what we can measure, lines of code produced, pull requests submitted, and applications prototyped, while eroding what is harder to quantify. Deep understanding of a codebase. Thoughtful engagement with an open-source community. The human judgment that sits at the heart of iterative development.

The Anthropic study shows that speed and learning pull in opposite directions, at least for developers acquiring new skills. The open-source crisis tells us that volume and quality are diverging at an alarming rate. The Agile debate tells us that our existing frameworks for organizing human work are straining under the weight of AI-driven change.

None of this means we should reject AI tools. I certainly won’t. But I think we need to be far more intentional about how we deploy them. That means designing AI assistants that support learning rather than replace it. It means building platforms that protect maintainers from low-quality noise. It means evolving our methodologies rather than abandoning them.

As someone who has spent years exploring new technologies, it’s one of the things I enjoy most about working in this field. I remain optimistic about where AI can take us. But optimism without caution is just naivety. The choices we make in the next year or two about how AI integrates into our development practices will shape the industry for a decade.

We should probably pay attention.

AWS European Sovereign Cloud Launches—But Does It Solve the Real Problem?

Posted on February 6, 2026 by steefjan1970

Earlier, AWS officially launched its European Sovereign Cloud, backed by a €7.8 billion investment in Brandenburg, Germany. The infrastructure is physically and logically separated from AWS global regions, managed by a new German parent company (AWS European Sovereign Cloud GmbH), and staffed exclusively by EU residents. On paper, it checks every compliance box for data residency and operational sovereignty. AWS CEO Matt Garman called it “a big bet” for the company—and it is. The question is whether it’s the right bet for Europe.

The Technical Reality: Real Isolation, Real Trade-offs

The technical separation is genuine. An AWS engineer who deployed services to the European Sovereign Cloud confirmed on Hacker News that proper boundaries exist—U.S.-based engineers can’t see anything happening in the sovereign cloud. To fix issues there, they play “telephone” with EU-based engineers. The infrastructure uses partition name *aws-eusc* and region name *eusc-de-east-1*, completely separate from AWS’s global regions. All components—IAM, billing systems, Route 53 name servers using European Top-Level Domains—remain within EU borders.

But this isolation comes with costs. As that same engineer warned, “it really slows down debugging issues. Problems that would be fixed in a day or two can take a month.” This is the sovereignty trade-off in practice: more control, less velocity. The service launches with approximately 90 AWS services, not the full catalog. Plans exist to expand into sovereign Local Zones in Belgium, the Netherlands, and Portugal, but this remains a subset of what AWS offers globally.

For some workloads, this trade-off makes sense. For others, it’s a deal-breaker.

The Legal Problem: U.S. Ownership, U.S. Jurisdiction

Here’s the uncomfortable truth that AWS’s marketing carefully sidesteps: technical isolation doesn’t create legal isolation. AWS, headquartered in America, remains subject to U.S. jurisdiction. The CLOUD Act allows U.S. authorities to compel U.S.-based technology companies to provide data, regardless of where it is stored globally. Courts can require parent companies to produce data held by subsidiaries.

This isn’t theoretical hand-wraving. Microsoft had to admit in a French court that it cannot guarantee data sovereignty for EU customers. When Airbus executive Catherine Jestin discussed AWS’s sovereignty claims with lawyers late last year, she said: “I still don’t understand how it is possible” for AWS to be immune to extraterritorial laws.

Cristina Caffarra, founder of the Eurostack Foundation and competition economist, puts it bluntly:

A company subject to the extraterritorial laws of the United States cannot be considered sovereign for Europe. That simply doesn’t work.

The AWS response focuses on technical controls—encryption, the Nitro System preventing employee access, and hardware security modules. These are important safeguards, but they don’t address the core legal issue. If a U.S. court orders Amazon.com Inc. to produce data, technical barriers become legal obstacles the parent company must overcome, not protections.

Europe’s Broader Response: The Cloud and AI Development Act

AWS’s launch comes as Europe finalizes its own legislative response. The EU Cloud and AI Development Act, expected in Q1 2026, aims to strengthen Europe’s autonomy over cloud infrastructure and data. As Christoph Strnadl, CTO of Gaia-X, explains:

For critical data, you will never, ever use a US company. Sovereignty means having strategic options — not doing everything yourself.

The Act is part of the EU’s Competitiveness Compass and addresses a fundamental problem: Europe’s 90% dependency on non-EU cloud infrastructure, predominantly American companies. This dependency isn’t just about data residency—it’s about strategic autonomy. When essential services depend on infrastructure governed by foreign law, questions arise about jurisdiction, resilience, and what happens during geopolitical disruption.

Current estimates indicate that AWS, Microsoft Azure, and Google Cloud collectively control over 60% of the European cloud market. European providers account for only a small share of revenues. The Cloud and AI Development Act aims to establish minimum criteria for cloud services in Europe, mobilize public and private initiatives for AI infrastructure, and create a single EU-wide cloud policy for public administrations and procurement.

Importantly, Brussels isn’t seeking to ban non-EU providers. As Strnadl notes:

Sovereignty does not mean you have to do everything yourself. Sovereignty means that for critical things, you have strategic options.

The Gaia-X Lesson: Sovereignty Washing

Europe has been down this path before. Gaia-X, launched in 2019, intended to create a trustworthy European data infrastructure. Then American companies lobbied to be included. Once Microsoft, Google, and AWS were inside, critics argue, Gaia-X lost its purpose. The fear now is that AWS’s European Sovereign Cloud represents sophisticated “sovereignty washing”—placing datacenters on European soil without resolving the fundamental legal issue.

Recent European actions suggest growing awareness of this problem. Austria, Germany, France, and the International Criminal Court in The Hague are taking concrete steps toward genuine digital independence. These aren’t just policy statements—they’re actual migrations away from U.S. hyperscalers toward European alternatives.

The Market Reality: No Complete Migration in 2026

Forrester predicts that no European enterprise will fully shift away from U.S. hyperscalers in 2026, citing geopolitical tensions, volatility, and new legislation, such as the EU AI Act, as barriers. The scale of dependency is too deep, the feature gap too wide, and the migration costs too high for rapid change.

Gartner forecasts European IT spending will grow 11% in 2026 to $1.4 trillion, with 61% of European CIOs and tech leaders wanting to increase their use of local cloud providers. Around half (53%) said geopolitical factors would limit their use of global providers in the future. The direction is clear, even if the pace remains uncertain.

This creates a transitional period where organizations must make pragmatic choices. For non-critical workloads, AWS’s European Sovereign Cloud may be sufficient. For truly sensitive data—government communications, defense systems, critical infrastructure—organizations need genuinely European alternatives: Hetzner, Scaleway, OVHCloud, StackIT by Schwarz Digits.

What AWS Actually Delivers

Let’s be precise about what AWS European Sovereign Cloud achieves. It provides:

Data residency within the EU
Operational control by EU residents
Governance through EU-based legal entities
Technical isolation from the global AWS infrastructure
An advisory board of EU citizens with independent oversight

What it doesn’t provide is independence from U.S. legal jurisdiction. For compliance requirements focused purely on data residency and operational transparency, this may be sufficient. For organizations requiring protection from U.S. government data requests, it fundamentally isn’t.

As Eric Swanson from CarMax noted in a LinkedIn post:

Sovereign cloud offerings do not override the Patriot Act. They mainly reduce overlap across other contexts: data location, operational control, employee access, and customer jurisdiction.

Looking Forward: Strategic Autonomy, Not Autarky

Europe’s path forward isn’t about digital isolationism. As Strnadl emphasizes, technology adoption that involves a paradigm shift doesn’t happen in two years. The challenge is adoption, not frameworks. “Cooperation needs trust,” he says, “and trust needs a trust framework.”

The Cloud and AI Development Act, expected this quarter, will provide that framework. It will set minimum criteria, promote interoperability, and establish procurement rules that favor sovereignty for critical workloads. The question for organizations is: what constitutes critical?

For email, public administration, political communication, defense systems—the answer should be obvious. These require European alternatives. For other workloads, AWS’s European Sovereign Cloud may strike an acceptable balance between capability and control.

The Bottom Line

AWS’s €7.8 billion investment is real. The technical isolation is real. The economic contribution to Germany’s GDP (€17.2 billion over 20 years) is real. What’s also real is that Amazon.com Inc., a U.S. company, ultimately controls this infrastructure and remains subject to U.S. law.

For organizations seeking compliance checkboxes and data residency guarantees, AWS European Sovereign Cloud delivers. For organizations requiring genuine independence from U.S. legal jurisdiction, it remains fundamentally insufficient. That’s not a criticism of AWS’s engineering—it’s a statement of legal reality.

The sovereignty question Europe faces isn’t technical. It’s strategic: do we accept managed dependency or build genuine autonomy? AWS offers the former. Only European alternatives can provide the latter.

The market will decide which answer matters more.

From Rigid Choreography to Intelligent Collaboration: Agentic Orchestration as the Evolution of SOA

Posted on January 25, 2026 by steefjan1970

For decades, my friends and I, along with many other integration professionals, worked in the trenches of integration, shaping the digital backbone of enterprises. From the heady days of EAI to the structured elegance of SOA, and the agile pragmatism of microservices, our quest has remained constant: how do we weave disparate capabilities into a cohesive, valuable whole?

We built the bridges, the highways, and the intricate railway networks of the digital world. Yet, let’s be honest—for all our sophistication, our orchestrations often felt like a meticulous, rigid dance.

Enter Agentic Orchestration. This isn’t just another buzzword. It’s a profound shift, an evolution that takes the core principles of SOA and infuses them with intelligence, dynamism, and a remarkable degree of autonomy. For the seasoned integration architect and engineer, this isn’t about replacing what we know—it’s about enhancing it, elevating it to a new plane of capability.

The Deterministic Dance of SOA Composites

Cast your mind back to the golden age of SOA. For those of us in the Microsoft ecosystem, this meant nearly two and a half decades with BizTalk Server—our workhorse, our battleground, our canvas. We diligently crafted composite services using orchestration designers, adapters, and pipelines. Others wielded BPEL and ESBs, but the principle was the same. Our logic was clear, explicit, and, crucially, deterministic.

If a business process required validating a customer, then checking inventory, and finally processing an order, we laid out that sequence with unwavering precision—whether in BizTalk’s visual orchestration designer or in BPEL code:

XML

			
<bpel:sequence name="OrderFulfillmentProcess">
  <bpel:invoke operation="validateCustomer" partnerLink="CustomerService"/>
  <bpel:invoke operation="checkInventory" partnerLink="InventoryService"/>
  <bpel:invoke operation="processPayment" partnerLink="PaymentService"/>
</bpel:sequence>

		

Those of us who spent years with BizTalk know this dance intimately: the Receive shapes, the Decision shapes, the carefully constructed correlation sets, the Scope shapes wrapped around every potentially fragile operation. We debugged orchestrations at 2 AM, optimized dehydration points, and became masters of the Box-Line-Polygon visual language.

This approach delivered immense value. It brought order to chaos, reused services, and provided a clear, auditable trail. However, its strength was also its weakness: rigidity. Any deviation or unforeseen circumstance required a developer to step in, modify the orchestration, and redeploy. The system couldn’t “think” its way around a problem—it merely executed a predefined script—a well-choreographed ballet, beautiful but utterly inflexible to improvisation.

The Rise of the Intelligent Collaborator: Agentic Orchestration

Now, imagine an orchestration that doesn’t just execute a script, but reasons. An orchestration where the “participants” are not passive services waiting for an instruction, but intelligent agents equipped with goals, memory, and a suite of “tools”—which, for us, are often our existing services and APIs.

This is the essence of agentic orchestration. It shifts from a predefined, top-down command structure to a more collaborative, goal-driven paradigm. Instead of meticulously charting every step, we define the desired outcome and empower intelligent agents to find the best path to it.

Think of it as moving from a detailed project plan (SOA) to giving a highly skilled project manager (the Orchestrator Agent) a clear objective and a team of specialists (worker agents, each with specific skills/tools).

Key Differences that Matter

From Fixed Sequence to Dynamic Planning:

SOA: “Execute Step A, then Step B, then Step C.”
Agentic: “Achieve Goal X. What tools do I have? Which one is best for this step? What’s my next logical action?” Agents dynamically construct their plan based on the current context and available resources.

From Explicit Error Handling to Self-Correction:

SOA: We built elaborate try-catch blocks for every potential failure. In BizTalk, we wrapped Scope shapes around Scope shapes, each with its own exception handler.
Agentic: If an agent tries a tool and it fails, it can reason why it failed, perhaps try a different tool, consult another agent, or even adjust its plan. This isn’t magic—it’s the underlying Large Language Model doing what it does best: problem-solving within constraints.

From API Contracts to Intent-Based Communication:

SOA: Services communicate via strict, often verbose, XML or JSON contracts. We spent countless hours on schema design and message transformation.
Agentic: Agents communicate intents. An “Order Fulfillment Agent” can tell a “Shipping Agent” in natural language (or a structured representation of intent), “Ship this package to customer X by date Y.” The Shipping Agent, understanding the intent, then uses its own tools (e.g., FedEx API, DHL API) to achieve that goal. The complexity of the underlying service calls is abstracted away.

From Static Connectors to Smart Tools:

SOA: Connectors and adapters are fixed pathways requiring explicit configuration for each scenario. Remember configuring BizTalk adapters for each specific integration point?
Agentic: Our existing APIs, databases, message queues, and even legacy systems become “tools” that agents can discover and wield intelligently. A Logic App connector to SAP is no longer just a connector—it’s a powerful SAP tool that an agent can learn to use when needed. The Model Context Protocol (MCP) is making this discovery even more seamless.

A Concrete Example

Consider an order that fails the inventory check in our traditional BPEL or BizTalk orchestration. In SOA: hard stop, send error notification, await human intervention, and process redesign.

In an agentic system, the orchestrator agent might dynamically query alternate suppliers, adjust delivery timelines based on customer priority, suggest product substitutions, or even negotiate partial fulfillment—all without hardcoded logic for each scenario. The agent reasons about the business goal (fulfill the customer order) and uses available tools to achieve it, adapting to circumstances we never explicitly programmed for.

Azure Logic Apps: The Bridge to the Agentic Future

Azure Logic Apps demonstrates this evolution in practice, and it’s particularly compelling for integration professionals. For those of us coming from the BizTalk world, Logic Apps already felt familiar—the visual designer, the connectors, the enterprise reliability. Now, we’re not throwing away our decades of experience with these patterns. Instead, we’re adding an “intelligence layer” on top.

The Agent Loop within Logic Apps, with its “Think-Act-Reflect” cycle, transforms our familiar integration canvas into a dynamic decision-making engine. We can build multi-agent patterns—agent “handoffs” in which one agent completes a task and passes it to another, or “evaluator-optimizer” setups in which one agent generates a solution and another critiques and refines it.

All this, while leveraging the robust, enterprise-ready connectors we already depend on. Our existing investments in integration infrastructure don’t become obsolete—they become more powerful. The knowledge we gained from debugging BizTalk orchestrations, understanding message flows, and designing for reliability? All of that remains valuable. Micorsoft simply upgrading our toolkit.

The Path Forward: Embrace the Evolution

For integration engineers and architects, this is not a threat but an immense opportunity. We are uniquely positioned to lead this charge. We understand the nuances of enterprise systems, the criticality of data integrity, and the challenges of connecting disparate technologies. Those of us who survived the BizTalk years are battle-tested—we know what real-world integration demands.

Agentic orchestration frees us from the burden of explicit, step-by-step programming for every conceivable scenario. It allows us to design systems that are more resilient, more adaptive, and ultimately, more intelligent. It enables us to build solutions that don’t just execute business processes but actively participate in achieving business outcomes.

Start small: Identify one rigid orchestration in your current architecture that would benefit from adaptive decision-making. Perhaps it’s an order fulfillment process with too many exception handlers, or a customer onboarding workflow that breaks when regional requirements change. That’s your first candidate for agentic enhancement.

Let’s cast aside the notion of purely deterministic choreography. Let us instead embrace the era of intelligent collaboration, where our meticulously crafted services become the powerful tools in the hands of autonomous, reasoning agents.

The evolution is here. It’s time to orchestrate a smarter future.

Digital Destiny: Navigating Europe’s Sovereignty Challenge – A Framework for Control

Posted on November 9, 2025 by steefjan1970

With the geopolitical changes since Trump took office, I’ve been following developments in digital sovereignty and have seen the industry’s response to Europe’s strategic demands through various InfoQ news items.

Today, Europe and the Netherlands find themselves at a crucial junction, navigating the complex landscape of digital autonomy. The recent introduction of the EU’s new Cloud Sovereignty Framework is the clearest signal yet that the continent is ready to take back control of its digital destiny.

This isn’t just about setting principles; it’s about introducing a standardized, measurable scorecard that will fundamentally redefine cloud procurement.

The Digital Predicament: Why Sovereignty is Non-Negotiable

The digital revolution has brought immense benefits, yet it has also positioned Europe in a state of significant dependency. Approximately 80% of our digital infrastructure relies on foreign companies, primarily American cloud providers. This dependence is not merely a matter of convenience; it’s a profound strategic vulnerability.

The core threat stems from U.S. legislation such as the CLOUD Act, which grants American law enforcement the power to request data from U.S. cloud service providers, even if that data is stored abroad. Moreover, this directly clashes with Europe’s stringent privacy regulations (GDPR) and exposes critical European data to external legal and geopolitical risk.

As we’ve seen with incidents like the Microsoft-ICC blockade, foreign political pressures can impact essential digital services. The possibility of geopolitical shifts, such as a “Trump II” presidency, only amplifies this collective awareness: we cannot afford to depend on foreign legislation for our critical infrastructure. The risk is present, and we must build resilience against it.

The Sovereignty Scorecard: From Principles to SEAL Rankings

The new Cloud Sovereignty Framework is the EU’s proactive response. It shifts the discussion from abstract aspirations to concrete, auditable metrics by evaluating cloud services against eight Sovereignty Objectives (SOVs) that cover legal, strategic, supply chain, and technological aspects.

The result is a rigorous “scorecard.” A provider’s weighted score determines its SEAL ranking (from SEAL-0 to SEAL-4, with SEAL-4 indicating full digital sovereignty). Crucially, this ranking is intended to serve as the definitive minimum assurance factor in government and public sector cloud procurement tenders. The Commission wants to create a level playing field where providers must tangibly demonstrate their sovereignty strengths.

The Duel for Dominance: Hyperscalers vs. European Federation

The framework has accelerated a critical duality in the market: massive, centralized investments by US hyperscalers versus strategic, federated growth by European alternatives.

Hyperscalers Adapt: Deepening European Ties

Global providers are making sovereignty a mandatory architectural and legal prerequisite by localizing their operations and governance.

AWS explicitly responded by announcing its EU Sovereign Cloud unit. This service is structured to ensure data residency and operational autonomy within Europe, explicitly targeting the SOV-3 (Data & AI Sovereignty: The degree of control customers have over their data and AI models, including where data is processed) criteria through physically and logically separated infrastructure and governance.
Google Cloud has also made significant moves, approaching digital sovereignty across three distinct pillars:
- Data Sovereignty (focusing on control over data storage, processing, and access with features like the Data Boundary and External Key Management, EKM, where keys can be held outside Google Cloud’s infrastructure);
- Operational Sovereignty (ensuring local partner oversight, such as the partnership with T-Systems in Germany); and
- Software Sovereignty (providing tools to reduce lock-in and enable workload portability).To help organizations navigate these complex choices, Google introduced the Digital Sovereignty Explorer, an interactive online tool that clarifies terms, explains trade-offs, and guides European organizations in developing a tailored cloud strategy across these three domains. Furthermore, Google has developed highly specialized options, including Air-Gapped solutions for the defense and intelligence sectors, demonstrating a commitment to the highest levels of security and residency.
Microsoft has demonstrated a profound deepening of its commitment, outlining five comprehensive digital commitments designed to address sovereignty concerns:
- Massive Infrastructure Investment: Pledging a 40% increase in European datacenter capacity, doubling its footprint by 2027.
- Governance and Resilience: Instituting a “European cloud for Europe” overseen by a dedicated European board of directors (composed exclusively of European nationals) and backed by a “Digital Resilience Commitment” to contest any government order to suspend European operations legally.
- Data Control: Completing the EU Data Boundary project to ensure European customers can store and process core cloud service data within the EU/EFTA.

European Contenders Scale Up

Strategic, open-source European initiatives powerfully mirror this regulatory push:

Virt8ra Expands: The Virt8ra sovereign cloud, which positions itself as a significant European alternative, recently announced a substantial expansion of its federated infrastructure. The platform, coordinated by OpenNebula Systems, added six new cloud service providers, including OVHcloud and Scaleway, significantly broadening its reach and capacity across the continent.
IPCEI Funding: This initiative, leveraging the open-source OpenNebula technology, is part of the Important Project of Common European Interest (IPCEI) on Next Generation Cloud Infrastructure and Services, backed by over €3 billion in public and private funding. This is a clear indicator that the vision for a robust, distributed European cloud ecosystem is gaining significant traction.

Sovereignty Redefined: Resilience and Governance

Industry experts emphasize that the framework embodies a more mature understanding of digital sovereignty. It’s not about isolation (autarky), but about resilience and governance.

Sovereignty is about how an organization is “resilient against specific scenarios.” True sovereignty, in this view, lies in the proven, auditable ability to govern your own digital estate. For developers, this means separating cloud-specific infrastructure code from core business logic to maximize portability, allowing the use of necessary hyper-scale features while preserving architectural flexibility.

The Challenge: Balancing Features with Control

Despite the massive investments and public commitments from all major players, the framework faces two key hurdles:

The Feature Gap: European providers often lack the “huge software suite” and “deep feature integration” of US hyperscalers, which can slow down rapid development. Advanced analytics platforms, serverless computing, and tightly integrated security services often lack direct equivalents at smaller providers. This creates a complex chicken-and-egg problem: large enterprises won’t migrate to European providers because they lack features, but local providers struggle to develop those capabilities without enterprise revenue.
Skepticism and Compliance Complexity: Some analysts fear the framework’s complexity will inadvertently favor the global giants with larger compliance teams. Furthermore, deep-seated apprehension in the community remains, with some expressing the fundamental desire for purely European technological solutions: “I don’t want a Microsoft cloud or AI solutions in Europe. I want European ones.” Some experts suggest that European providers should focus on building something different by innovating with European privacy and control values baked in, rather than trying to catch up with US providers’ feature sets.

My perspective on this situation is that achieving true digital sovereignty for Europe is a complex and multifaceted endeavor. While the commitments from global hyperscalers are significant, the underlying desire for independent, European-led solutions remains strong. It’s about strategic autonomy, ensuring that we, as Europeans, maintain ultimate control over our digital destiny and critical data, irrespective of where the technology originates.

The race is now on. The challenge for the cloud industry is to translate the high-level, technical criteria of the SOVs into auditable, real-world reality to achieve that elusive top SEAL-4 ranking. The battle for the future of Europe’s cloud is officially underway.

The Walking Skeleton and Pipes & Filters: Building Resilient Integration Architectures

Posted on November 7, 2025 by steefjan1970

I’ve spent quite some time in IT doing enterprise integration, and if there’s one truth that consistently holds up, it’s that a solid foundation prevents future disappointment or failure. We’ve all been there: a rush to deliver features on a shaky, unvalidated architecture, leading to months of painful, expensive refactoring down the line.

My experience in retail showed me that, and I was involved in rebuilding an integration platform. In the world of integration, where you’re constantly juggling disparate systems, multiple data formats, and unpredictable volumes, a solid architecture is paramount. Thus, I always try to build the best solution based on experience rather than on what’s written in the literature.

What is funny to me is that when I built the integration platform, I realized I was applying patterns like the Walking Skeleton for architectural validation and the Pipes and Filters pattern for resilient, flexible integration flows.

The Walking Skeleton caught my attention when a fellow architect at my current workplace brought it to my attention. And I realized that this is what I actually did with my team at the retailer. Hence, I should read some literature from time to time!

The Walking Skeleton: Your Architectural First Step

Before you write a line of business logic, you need to prove your stack works from end to end. The Walking Skeleton is precisely that: a minimal, fully functional implementation of your system’s architecture.

It’s not an MVP (Minimum Viable Product), which is a business concept focused on features; the Skeleton is a technical proof-of-concept focused on connectivity.

Why Build the Skeleton First?

Risk Mitigation: You validate your major components—UI, API Gateway, Backend Services, Database, Message Broker—can communicate and operate correctly before you invest heavily in complex features.
CI/CD Foundation: By its nature, the Skeleton must run end-to-end. This forces you to set up your CI/CD pipelines early, giving you a working deployment mechanism from day one.
Team Alignment: A running system is the best documentation. Everyone on the team gets a shared, tangible understanding of how data flows through the architecture.

Suppose you’re building an integration platform in the cloud (like with Azure). In that case, the Walking Skeleton confirms your service choices, such as Azure Functions and Logic Apps, which integrate with your storage, networking, and security layers. Guess what I am going to do again in the near future, I hope.

Leveraging Pipes and Filters Within the Skeleton

Now, let’s look at what that “minimal, end-to-end functionality” should look like, especially for data and process flow. The Pipes and Filters pattern is ideally suited for building the first functional slice of your integration Skeleton.

The pattern works by breaking down a complex process into a sequence of independent, reusable processing units (Filters) connected by communication channels (Pipes).

How They Map to Integration:

Filters = Single Responsibility: Each Filter performs one specific, discrete action on the data stream, such as:
- Schema Validation
- Data Mapping (XML to JSON)
- Business Rule Enrichment
- Auditing/Logging
Pipes = Decoupled Flow: The Pipes ensure data flows reliably between Filters, typically via a message broker or an orchestration layer.

In a serverless environment (e.g., using Azure Functions for the Filters and Azure Service Bus/Event Grid for the Pipes), this pattern delivers immense value:

Composability: Need to change a validation rule? You only update one small, isolated Filter. Need a new output format? You add a new mapping Filter at the end of the pipe.
Resilience: If one Filter fails, the data is typically held in the Pipe (queue/topic), preventing the loss of the entire transaction and allowing for easy retries.
Observability: Each Filter is a dedicated unit of execution. This makes monitoring, logging, and troubleshooting exact no more “black box” failures.

The Synergy: Building and Expanding

The real power comes from using the pattern within the process of building and expanding your Walking Skeleton:

Initial Validation (The Skeleton): Select the absolute simplest, non-critical domain (e.g., an Article Data Distribution pipeline, as I have done with my team for retailers). Implement this single, end-to-end flow using the Pipes and Filters pattern. This proves that your architectural blueprint and your chosen integration pattern work together.
Iterative Expansion: Once the Article Pipe is proven, validating the architectural choice, deployment, monitoring, and scaling, you have a template.
- At the retailer, we subsequently built the integration for the Pricing domain, and by creating a new Pipe that reuses common Filters (e.g., the logging or basic validation Filters).
- Next, we picked another domain by cloning the proven pipeline architecture and swapping in the domain-specific Filters.

You don’t start from scratch; you reapply a proven, validated template across domains. This approach dramatically reduces time-to-market and ensures that every new domain is built on a resilient, transparent, and scalable foundation.

My advice, based on what I know now and my experience, is not to skip the Skeleton. And don’t build a monolith inside it. Start with Pipes and Filters and Skeleton for a future-proof, durable architecture for enterprise integration when rebuilding an integration platform in Azure.

What architectural pattern do you find most useful when kicking off a new integration project? Drop a comment!

AWS Shifts to a Credit-Based Free Plan, Aligning with Azure and GCP

Posted on August 7, 2025 by steefjan1970

AWS is officially moving away from its long-standing 12-month free tier for new accounts. The new standard, called the Free Account Plan, is a credit-based model designed to eliminate the risk of unexpected bills for new users.

With this new plan, you get:

A risk-free environment for experimenting and building proofs of concept for up to six months.
A starting credit of $100, with the potential to earn another $100 by completing specific exploration activities, such as launching an EC2 instance. This means you can get up to $200 in credits to use across eligible services.
The plan ends after six months or once your credits are entirely spent, whichever comes first. After that, you have a 90-day window to upgrade to a paid plan and restore access to your account and data.

This shift, as Principal Developer Advocate Channy Yun explains, allows new users to get hands-on experience without cost commitments. However, it’s worth noting that some services typically used by large enterprises won’t be available on this free plan.

While some may see this as a step back, I tend to agree with Corey Quinn’s perspective. He writes that this is “a return to product-led growth rather than focusing on enterprise revenue to the exclusion of all else.” Let’s face it: big companies aren’t concerned with the free tier. But for students and hobbyists, who can be seen as the next generation of cloud builders, a credit-based, risk-free sandbox is a much more attractive proposition. The new notifications for credit usage and expiration dates are a smart addition that provides peace of mind.

How the New Plan Compares to Other Hyperscalers

A helpful plan for those who like to experiment on AWS, I think. Yet, other hyperscalers like Azure and GCP offer similar plans too. Microsoft Azure and Google Cloud Platform (GCP) have long operated on credit-based models.

Azure offers a different model: $200 in credits for the first 30 days, supplemented by over 25 “always free” services and a selection of services available for free for 12 months.
GCP provides a 90-day, $300 Free Trial for new customers, which can be applied to most products, along with an “Always Free” tier that gives ongoing access to core services like Compute Engine and Cloud Storage up to specific monthly limits.

This alignment among the major cloud providers highlights a consensus on the best way to attract and onboard new developers.

Microsoft also offers $100 in Azure credits through Azure for students. Note that the MSDN credits are typically a monthly allowance tied to a specific Visual Studio subscription, and the student credits are a lump sum for a particular period (e.g., 12 months), as I believe these different models can be confusing.

Speaking of other cloud providers, my own experience with Azure is an excellent example of how these credit models can be beneficial. I enjoy credits for Azure because of my MVP benefits, and through MSDN subscriptions, one has a monthly $150 in credits. These are different options from the general one I mentioned earlier. Anyway, there are ways to access services provided by the three big hyperscalers that allow you to get hands-on experience in combination with their documentation and what you can find in public repos.

In general, when you like to learn more about Azure, AWS, or GCP, the following table shows the most straightforward options:

Cloud Hyperscaler	Free Credits	Documentation	Repo (samples)
Azure	Azure Free Account	Microsoft Learn	Azure Samples · GitHub
AWS	AWS Free Tier	AWS Documentation	AWS Samples · GitHub
GCP	GCP Free Trial	Google Cloud Documentation	Google Cloud Platform · GitHub

Decoding Figma’s AWS Spend: Beyond the Hype and Panic

Posted on July 16, 2025 by steefjan1970

Figma’s recent IPO filing revealed a daily AWS expenditure of roughly $300,000, translating to approximately $109 million annually, or 12% of its reported revenue of $821 million. The company also committed to a minimum spend of $545 million over the next five years with AWS. Cue the online meltdown. “Figma is doomed!” “Fire the CTO!” The internet, in its infinite wisdom, declared. I wrote a news item on it for InfoQ and thought, ‘Let’s put things into perspective and add my own experience.’

(Source: Figma.com)

But let’s inject a dose of reality, shall we? As Corey Quinn from The Duckbill Group, who probably sees more AWS invoices than you’ve seen Marvel movies, rightly points out, this kind of spending for a company like Figma is boringly normal.

As Quinn extensively details in his blog post, Figma isn’t running a simple blog. It’s a compute-intensive, real-time collaborative platform serving 13 million monthly active users and 450,000 paying customers. It renders complex designs with sub-100ms latency. This isn’t just about spinning up a few virtual machines; it’s about providing a seamless, high-performance experience on a global scale.

The Numbers Game: What the Armchair Experts Missed

The initial panic conveniently ignored a few crucial realities, according to Quinn:

Ramping Spend: Most large AWS contracts increase year-over-year. A $109 million annual average over five years likely starts lower (e.g., $80 million) and gradually increases to a higher figure (e.g., $150 million in year five) as the company expands.
Post-Discount Figures: These spend targets are post-discount. At Figma’s scale, they’re likely getting a significant discount (think 30% effective discount) on their cloud spend. So, their “retail” spend would be closer to $785 million over five years, not $545 million.

When you factor these in, Figma’s 12% of revenue on cloud infrastructure for a company of its type falls squarely within industry benchmarks:

Compute-lite SaaS: Around 5% of revenue.
Compute-heavy platforms (like Figma): 10-15% of revenue.
AI/ML-intensive companies: Often exceeding 15%.

Furthermore, the increasing adoption of AI and Machine Learning in application development is introducing a new dimension to cloud costs. AI workloads, particularly for training and continuous inference, are incredibly resource-intensive, pushing the boundaries of compute, storage, and specialized hardware (like GPUs), which naturally translates to higher cloud bills. This makes effective FinOps and cost optimization strategies even more crucial for companies that leverage AI at scale.

So, while the internet was busy getting its math wrong and forecasting doom, Figma was operating within a completely reasonable range for its business model and scale.

The “Risky Dependency” Non-Story

Another popular narrative was the “risky dependency” on AWS. Figma’s S-1 filing includes standard boilerplate language about vendor dependencies, a common feature found in virtually every cloud-dependent company’s SEC filings. It’s the legal equivalent of saying, “If the sky falls, our business might be affected.”

Breaking news: a SaaS company that uses a cloud provider might be affected by outages. In related news, restaurants depend on food suppliers. This isn’t groundbreaking insight; it’s just common business risk disclosure. Figma’s “deep entanglement” with AWS, as described by Hacker News commenter nevon, underscores the complexity of modern cloud architectures, where every aspect, from permissions to disaster recovery, is seamlessly integrated. This makes a quick migration akin to performing open-heart surgery without anesthetic – highly complex and not something you do on a whim.

Cloud Repatriation: A Valid Strategy, But Not a Universal Panacea

The discussion around Figma’s costs also brought up the topic of cloud repatriation, with examples like 37signals, whose CTO, David Heinemeier Hansson, has been a vocal advocate for exiting the cloud to save millions. While repatriating certain workloads can indeed lead to significant savings for some companies, it’s not a one-size-fits-all solution.

Every company’s needs are different. For a company like Scrimba, which runs on dedicated servers and spends less than 1% of its revenue on infrastructure, this might be a perfect fit. For Figma, with its real-time collaborative demands and massive user base, the agility, scalability, and managed services offered by a hyperscale cloud provider like AWS are critical to their business model and growth.

This brings us to a broader conversation, especially relevant in the European context: digital sovereignty. As I’ve discussed in my blog post, “Digital Destiny: Navigating Europe’s Sovereignty Challenge,” the deep integration with a single hyperscaler, such as AWS, isn’t just about cost or technical complexity; it also affects the control and autonomy an organization retains over its data and operations. While the convenience of cloud services is undeniable, the potential for vendor lock-in can have strategic implications, particularly concerning data governance, regulatory compliance, and the ability to dictate terms. The ongoing debate around data residency and the extraterritorial reach of foreign laws further amplifies these concerns, pushing some organizations to consider multi-cloud strategies or even hybrid models to mitigate risks and assert greater control over their digital destiny.

My Cloud Anecdote: Costs vs. Value

This whole debate reminds me of a scenario I encountered back in 2017. I was working on a proof of concept for a customer, building a future-proof knowledge base using Cosmos DB, the Graph Model, and Search. The operating cost, primarily driven by Cosmos DB, was approximately 1,000 eurosper month. Some developers immediately flagged it as “too expensive,” as I can recall, or even thought I was selling Cosmos DB. The reception, however, wasn’t universally positive. In fact, one attendee later wrote in their blog:

The most uninteresting talk of the day came from Steef-Jan Wiggers , who, in my opinion, delivered an hour-long marketing pitch for CosmosDB. I think it’s expensive for what it currently offers, and many developers could architect something with just as much performance without needing CosmosDB.

However, the proposed solution was for a knowledge base that customers could leverage via a subscription model. The crucial point was that the costs were negligible compared to the potential revenue the subscription model would net for the customer. It was an investment in a revenue-generating asset, not just a pure expense.

The Bottom Line: Innovation vs. Optimization

Thanks to Quinn, I understand that Figma is actively optimizing its infrastructure, transitioning from Ruby to C++ pipelines, migrating workloads, and implementing dynamic cluster scaling. He concluded:

They’re doing the work. More importantly, they’re growing at 46% year-over-year with a 91% gross margin. If you’re losing sleep over their AWS bill while they’re printing money like this, you might need to reconsider your priorities.

The “innovation <-> optimization continuum” is always at play. Companies often prioritize rapid innovation and speed to market, leveraging the cloud for its agility and flexibility. As they scale, they can then focus on optimizing those costs.

This increasing complexity underscores the growing importance of FinOps (Cloud Financial Operations), a cultural practice that brings financial accountability to the variable spend model of cloud, empowering teams to make data-driven decisions on cloud usage and optimize costs without sacrificing innovation.

Figma’s transparency in disclosing its cloud costs is actually a good thing. It forces a much-needed conversation about the true cost of running enterprise-scale infrastructure in 2025. The hyperbolic reactions, however, expose a fundamental misunderstanding of these realities. Which I also encountered with my Cosmos DB project in 2017.

So, the next time someone tells you that a company spending 12% of its revenue on infrastructure that literally runs its entire business is “doomed,” perhaps ask them how much they think it should cost to serve real-time collaborative experiences to 13 million users across the globe. The answer, if based on reality, might surprise them.

Lastly, as the cloud landscape continues to evolve, with new services, AI integration, and shifting geopolitical considerations, the core lesson remains: smart cloud investment isn’t about avoiding the bill, but understanding its true value in driving business outcomes and strategic advantage. The dialogue about cloud costs is far from over, but it’s time we grounded it in reality.

Digital Destiny: Navigating Europe’s Sovereignty Challenge

Posted on June 19, 2025 by steefjan1970

During my extensive career in IT, I’ve often seen how technology can both empower and entangle us. Today, Europe and the Netherlands find themselves at a crucial junction, navigating the complex landscape of digital sovereignty. Recent geopolitical shifts and the looming possibility of a “Trump II” presidency have only amplified our collective awareness: we cannot afford to be dependent on foreign legislation when it comes to our critical infrastructure.

In this post, I will delve into the threats and strategic risks that underpin this challenge. We’ll explore the initiatives being undertaken at both the European and Dutch levels, and crucially, what the major U.S. Hyperscalers are now bringing to the table in response.

The Digital Predicament: Threats to Our Autonomy

The digital revolution has certainly brought unprecedented benefits, not least through innovative Cloud Services that are transforming our economy and society. However, this advancement has also positioned Europe in a state of significant dependency. Approximately 80% of our digital infrastructure relies on foreign companies, primarily American cloud providers, including Amazon Web Services (AWS), Microsoft Azure, and Google Cloud. This reliance isn’t just a matter of convenience; it’s a strategic vulnerability.

The Legal Undercurrent: U.S. Legislation

One of the most persistent threats to European digital sovereignty stems from American legislation. The CLOUD Act (2018), an addition to the Freedom Act (2015) that replaced the Patriot Act (2001), grants American law enforcement and security services the power to request data from American cloud service providers, even if that data is stored abroad.

Think about it: if U.S. intelligence agencies can request data from powerhouses like AWS, Microsoft, or Google without your knowledge, what does this mean for European organizations that have placed their crown jewels there? This directly clashes with Europe’s stringent privacy regulations, the General Data Protection Regulation (GDPR), which sets strict requirements for the protection of personal data of individuals in the EU.

While the Dutch National Cyber Security Centre (NCSC) has stated that, in practice, the chance of the U.S. government requesting European data via the CLOUD Act has historically been minimal, they also acknowledge that this could change with recent geopolitical developments. The risk is present, even though it has rarely materialized thus far.

Geopolitics: The Digital Chessboard

Beyond legal frameworks, geopolitical developments pose a very real threat to our digital autonomy. Foreign governments may impose trade barriers and sanctions on Cloud Services. Imagine scenarios where tensions between major powers lead to access restrictions for essential Cloud Services. The European Union or even my country cannot afford to be a digital pawn in such a high-stakes game.

We’ve already seen these dynamics play out. In negotiations for a minerals deal with Ukraine, the White House reportedly made a phone call to stop the delivery of satellite images from Maxar Technologies, an American space company. These images were crucial for monitoring Russian troop movements and documenting war crimes.

Another stark example is the Microsoft-ICC incident, where Microsoft blocked access to email and Office 365 services for the chief prosecutor of the International Criminal Court in The Hague due to American sanctions. These incidents serve as powerful reminders of how critical external political pressures can be in impacting digital services.

Europe’s Response: A Collaborative Push for Sovereignty

Recognizing these challenges, both Europe and the Netherlands are actively pursuing initiatives to bolster digital autonomy. It’s also worth noting how major cloud providers are responding to these evolving demands.

European Ambitions:

The European Union has been a driving force behind initiatives to reinforce its digital independence:

Gaia-X: This ambitious European project aims to create a trustworthy and secure data infrastructure, fostering a federated system that connects existing European cloud providers and ensures compliance with European regulations, such as the General Data Protection Regulation (GDPR). It’s about creating a transparent and controlled framework.
Digital Markets Act (DMA) & Digital Services Act (DSA): These legislative acts aim to regulate the digital economy, fostering fairer competition and greater accountability from large online platforms.
Cloud and AI Development Act (proposed): This upcoming legislation seeks to ensure that strategic EU use cases can rely on sovereign cloud solutions, with the public sector acting as a crucial “anchor client.”
EuroStack: This broader initiative envisions Europe as a leader in digital sovereignty, building a comprehensive digital ecosystem from semiconductors to AI systems.

Crucially, we’re seeing tangible progress here. Virt8ra, a significant European initiative positioning itself as a major alternative to US-based cloud vendors, recently announced a substantial expansion of its federated infrastructure. The platform, which initially included Arsys, BIT, Gdańsk University of Technology, Infobip, IONOS, Kontron, MONDRAGON Corporation, and Oktawave, all coordinated by OpenNebula Systems, has now been joined by six new cloud service providers: ADI Data Center Euskadi, Clever Cloud, CloudFerro, OVHcloud, Scaleway, and Stackscale. This expansion is a clear indicator that the vision for a robust, distributed European cloud ecosystem is gaining significant traction.

Dutch Determination:

The Netherlands is equally committed to this journey:

Strategic Digital Autonomy and Government-Wide Cloud Policy: A coalition of Dutch organizations has developed a roadmap, proposing a three-layer model for government cloud policy that advocates for local storage of state secret data and autonomy requirements for sensitive government data.
Cloud Kootwijk: This initiative brings together local providers to develop viable alternatives to hyperscaler clouds, fostering homegrown digital infrastructure.
“Reprogram the Government” Initiative: This initiative advocates for a more robust and self-reliant digital government, pushing for IT procurement reforms and in-house expertise.
GPT-NL: A project to develop a Dutch language model, strengthening national strategic autonomy in AI and ensuring alignment with Dutch values.

Hyperscalers and the Sovereignty Landscape:

The growing demand for digital sovereignty has prompted significant responses from major cloud providers, demonstrating a recognition of European concerns:

AWS European Sovereign Cloud: AWS has announced key components of its independent European governance for the AWS European Sovereign Cloud.
Microsoft’s Five Digital Commitments: Microsoft recently outlined five significant digital commitments to deepen its investment and support for Europe’s technological landscape.

These efforts from hyperscalers highlight a critical balance. As industry analyst David Linthicum noted, while Europe’s drive for homegrown solutions is vital for data control, it also prompts questions about access to cutting-edge innovations. He stresses the importance of “striking the right balance” to ensure sovereignty efforts don’t inadvertently limit access to crucial capabilities that drive innovation.

However, despite these significant investments, skepticism persists. There is an ongoing debate within Europe regarding digital sovereignty and reliance on technology providers headquartered outside the European Union. Some in the community express doubts about how such companies can truly operate independently and prioritize European interests, with comments like, “Microsoft is going to do exactly what the US government tells them to do. Their proclamations are meaningless.” Others echo the sentiment that “European money should not flow to American pockets in such a way. Europe needs to become independent from American tech giants as a way forward.” This collective feedback highlights Europe’s ongoing effort to develop its own technological capabilities and reduce its reliance on non-European entities for critical digital infrastructure.

My perspective on this situation is that achieving true digital sovereignty for Europe is a complex and multifaceted endeavor, marked by both opportunities and challenges. While the commitments from global hyperscalers are significant and demonstrate a clear response to European demands, the underlying desire for independent, European-led solutions remains strong. It’s not about outright rejection of external providers, but about strategic autonomy – ensuring that we, as Europeans, maintain ultimate control over our digital destiny and critical data, irrespective of where the technology originates.

Resilient by Design — Timeouts, Retries, and Idempotency

Posted on April 18, 2025 by steefjan1970

Recently, I attended various sessions at QCon London 2025, and one that I liked in particular was Sam Newman´s session on Timeouts, Retries, and Idempotency in Distributed Systems. My InfoQ colleague Olimpu Pop wrote an excellent news item on InfoQ, yet I wanted to write a more in-depth blog post on the session. I also feel that this topic relates to integration and building cloud solutions.

Hence, this post discusses his session, which tackles the often overlooked but critical dimensions of distributed systems: timeouts, retries, and idempotency. Delivered with clarity and urgency, it provides a blueprint for designing systems prioritizing resilience without sacrificing performance.

While much attention in modern software architecture is paid to scalability, service meshes, and observability, this post challenges that focus by spotlighting what can quietly and catastrophically derail a system—poor handling of network failures and repeated operations.

Timeouts: The Silent Guardians of System Health

Timeouts were introduced as a performance lever and a protective mechanism. A timeout isn’t about rushing a request; it’s about setting boundaries. It enforces a contract that prevents one unresponsive component from exhausting the system’s resources.

The guiding philosophy was profound yet straightforward: timeouts prioritize the overall system’s health over the success of a single request. Letting a request fail fast, while seemingly harsh, is an act of system-wide preservation.

(Source: Sam Newman’s Definition of Insanity Slide Deck)

Newman has explored scenarios in which a frontend service invokes multiple backend APIs. Without timeouts, a stalled backend can tie up resources indefinitely, leading to cascading failures. The takeaway was that timeout values should be explicitly defined and never left to defaults.

Strategic Retries: When Retrying Becomes a Threat

When misused, retries were described as potential self-inflicted denial-of-service attacks. Sam Newman´s discussions on retries highlighted how indiscriminate retrying, especially under load, can amplify outages and destabilize services.

Instead, Newman recommends adopting structured retry strategies using exponential backoff, jitter, and bounded attempts. These techniques reduce the likelihood of synchronized retry storms and give struggling services a chance to recover.

(Source: Sam Newman’s Definition of Insanity Slide Deck)

Libraries like Resilience4J and Polly were noted for their configurable retry mechanisms, but attendees were cautioned that no tool can replace intentional system design. The message was clear: retries should be deliberate, context-aware, and failure-conscious.

Idempotency: Making Repetition Safe

Newman then turned to idempotency—the idea that repeating an operation should have the same effect as doing it once. In distributed environments, duplicate requests are inevitable, whether due to retries (as discussed earlier), client behavior, or network glitches.

Without idempotent operations, these duplicates can lead to data corruption, financial discrepancies, or compounding business logic errors. Imagine, for example, a scenario where a payment request is processed multiple times due to network issues.

(Source: Sam Newman’s Definition of Insanity Slide Deck)

A practical solution discussed was using idempotency keys—unique identifiers that allow systems to recognize and ignore repeated operations. This approach was framed as essential, not optional, for write operations. However, achieving idempotency, especially in complex distributed systems, isn’t always straightforward. For example, ensuring idempotency across multiple services or databases (a distributed transaction) can be particularly challenging, requiring careful coordination and potentially distributed locking mechanisms. Even with idempotency keys, issues like key generation, storage, and handling concurrent requests must be addressed thoughtfully.

Participants were encouraged to audit their systems using the following question: Can this operation be safely retried? If not, safeguards need to be built in.

Timeout Budgets: Coordinating Time Across Services

One of the more advanced concepts introduced was timeout budget propagation. Rather than treating timeouts in isolation, systems should treat them as shared contracts across the entire call chain.

For instance, if a user’s request has a 2-second budget, every downstream call should be completed within its portion of that total time. Once the budget is exhausted, subsequent calls should short-circuit to avoid waste.

This leads to more intelligent and responsive systems that avoid making pointless calls and fail quickly with clarity.

Tools Are Helpers, Not Saviors

The final theme reinforced the importance of understanding over automation. Tools like Resilience4J and Polly provide robust functionality but cannot replace deep knowledge of a system’s behavior under duress.

It was emphasized that installing these libraries without understanding failure patterns, latency curves, and operational context could worsen reliability.

The recommendation was for teams to invest time in studying their systems’ behavior, conduct chaos testing, and build observability around failure and recovery mechanisms.

Bringing It Together: A Blueprint for Resilience

The trio of timeouts, retries, and idempotency formed a comprehensive framework for resilience. They were positioned not as technical trivia but as strategic imperatives.

Attendees were encouraged to formalize resilience patterns, create shared documentation for timeout policies, and continuously test their assumptions through simulated failures.

The session closed by highlighting that resilience does not emerge by accident. It must be engineered deliberately and iterated constantly.

Conclusion: From Fragile to Fault-Tolerant

Newman´s talk offered actionable insights and cautionary tales for software teams building and operating distributed systems. It shifted the conversation from high-level abstraction to the gritty realities of how distributed systems behave under failure.

In a landscape increasingly dominated by complexity and scale, small design choices—timeout values, retry conditions, and idempotency guarantees—determine whether systems bend or break.

The overarching message was simple: Every system fails. The only question is how gracefully it does so.

Key Takeaways of the session:

Timeouts Are About System Health, Not Just Performance: Timeouts protect the entire system by ensuring a single failing component doesn’t compromise the larger architecture.
Retries Need Strategy, Not Hope: Blindly retrying failed requests can worsen problems. Controlled, contextual retries with backoff policies are essential.
Idempotency is a Survival Mechanism: Distributed systems must gracefully handle repeated operations without unintended side effects.
Timeout Budgets Should Be Propagated: Passing timeout constraints downstream ensures coordinated request handling and better system responsiveness.
Tools Matter, But Understanding Comes First: Resilience4J, Polly, and similar libraries are powerful, but they must be used with a solid grasp of distributed system behavior.

Lastly, his website page provides more information on his topic and details on his book about distributed systems.

Azure Cosmos DB’s Latest Performance Features

Posted on July 20, 2023 by steefjan1970

As an earlier adopter of Azure Cosmos DB, I have always been following the developments of this service and have built up my experience myself with leveraging it for monitoring purposes (a recent one is presented at Azure Cosmos DB Conf 2023 – Leveraging Azure Cosmos DB for End-to-End Monitoring of Retail Processes).

Azure Cosmos DB

For those unfamiliar with Azure Cosmos DB, Microsoft’s globally distributed, multi-model database service offers low-latency, scalable storage and querying of diverse data types. It allows developers to build applications with data access and high availability across regions. Its well-known counterpart is Amazon DynamoDB.

In this blog post, I like to point out some recent optimizations of the service around performance. Moreover, I have written an InfoQ news item recently on this as well.

Priority-based execution

One of the more recent features introduced in the service is priority-based execution, which is currently in public preview. It allows users to define the priority of requests sent to Azure Cosmos DB. When the number of requests surpasses the configured Request Units per second (RU/s) limit, lower-priority requests are slowed down to prioritize the processing of high-priority requests, as specified by the user’s defined priority.

As mentioned in a blog post by Microsoft, this feature empowers users to prioritize critical tasks over less crucial ones in situations where a container surpasses its configured request units per second (RU/s) capacity. Less important tasks are automatically retried by clients using an SDK with the specified retry policy until they can be successfully processed.

With priority-based execution, you have the flexibility to allocate varying priorities to workloads operating within the same container in your application. This proves beneficial in numerous scenarios, including prioritizing read, write, or query operations, as well as giving precedence to user actions over background tasks like bulk execution, stored procedures, and data ingestion/migration.

Once accepted, a nomination form is available to access the feature and .NET SDK.

Hierarchical Partition Keys

In addition to Priority-based execution, the product group for Cosmos DB also introduced Hierarchical Partition Keys to optimize performance.

Hierarchical partition keys enhance Cosmos DB’s elasticity, particularly in scenarios where users utilize synthetic- or logical partition keys surpassing 20 GB of data. By employing up to three keys with hierarchical partitioning, users can effectively sub-partition their data, achieving superior data distribution and enabling greater scalability. Azure Cosmos DB automatically distributes the data among physical partitions, allowing logical partition prefixes to exceed the 20GB storage limit.

According to the documentation, the simplest way to create a container and specify hierarchical partition keys is using the Azure portal.

For example, you can use hierarchical partition keys to partition data by tenant ID and then by item ID. This way, all items for a given tenant are stored together in the same physical partition. This can improve query performance by reducing the number of physical partitions that need to be queried.

A more detailed explanation and use case for hierarchical keys in Azure Cosmos DB can be found in the blog post by Leonard Lobel.

Burst Capacity Feature

Lastly, the team also made the burst capacity feature for Azure Cosmos DB generally available (GA) to allow you to take advantage of your database or container’s idle throughput capacity to handle traffic spikes.

Burst capacity allows each physical partition to accumulate up to 5 minutes of idle capacity, which can be utilized at a rate of up to 3000 RU/s. This feature is applicable to databases and containers utilizing manual or autoscale throughput, provided they have less than 3000 RU/s provisioned per physical partition.

To begin utilizing burst capacity, access the Features page within your Azure Cosmos DB account and enable the Burst Capacity feature. Please note that the feature may take approximately 15-20 minutes to become active once enabled.

Enabling the burst capacity feature (Source: Microsoft Learn Bust Capacity)

According to the documentation, to use the feature, you need to consider the following:

If your Azure Cosmos DB account is configured with provisioned throughput (manual or autoscale), burst capacity is not applicable. Burst capacity is specifically for serverless accounts.
Additionally, burst capacity is compatible with Azure Cosmos DB accounts utilizing the API for NoSQL, Cassandra, Gremlin, MongoDB, or Table.

Lastly, in case you are wondering what the difference between burst capacity and priority-based execution is, Jay Gordon, a Senior Cosmos DB program manager, explained that in the discussion of the blog post around these performance features:

The difference between burst capacity and execution based on priority lies in their impact on performance and resource allocation:

Burst capacity affects the overall throughput capacity of your Azure Cosmos DB container or database. It allows you to temporarily exceed the provisioned throughput to handle sudden spikes in workload. Burst capacity helps maintain low latency and prevent throttling during peak usage periods.

Execution based on priority determines the order in which requests are processed when multiple concurrent requests exist. Higher priority requests are prioritized and typically get faster access to resources for execution. This ensures that essential or time-sensitive operations are processed promptly, while lower-priority requests may experience slight delays.

“In terms of results, burst capacity and execution based on priority are independent. Utilizing burst capacity allows you to handle temporary workload spikes, whereas execution based on importance ensures that higher-priority requests are processed more promptly. These mechanisms work together to optimize performance and resource allocation in Azure Cosmos DB, but they serve different purposes“.

Conclusion

In conclusion, Azure Cosmos DB continues to evolve with new features designed to enhance performance and scalability. The priority-based execution, currently in public preview, enables users to prioritize critical tasks over less important ones when the request unit capacity is exceeded. This flexibility is further enhanced by introducing hierarchical partition keys, allowing optimal data distribution and larger scales in scenarios with substantial data. Additionally, the burst capacity feature, now generally available, provides an efficient way to handle traffic spikes by utilizing idle throughput capacity. Users can easily enable burst capacity through the Azure Cosmos DB account’s Features page, making it a valuable tool for serverless accounts.

Returning to Amazon, DynamoDB, the Cosmos DB counterpart on AWS, offers performance-optimizing capabilities. Concepts are similar.