Cloud Architecture Resources

DZone's Featured Cloud Architecture Resources

The DevOps Security Paradox: Why Faster Delivery Often Creates More Risk

By Jaswinder Kumar

A few years ago, I was part of a large enterprise transformation program where the leadership team proudly announced that they had successfully implemented DevOps across hundreds of applications. Deployments were faster.Release cycles dropped from months to days.Developers were happy. But within six months, the security team discovered something alarming. Misconfigured cloud storage.Exposed internal APIs.Containers running with root privileges.Unpatched base images being deployed daily. Ironically, the same DevOps practices that accelerated innovation had also accelerated risk. This is the DevOps Security Paradox. The faster organizations move, the easier it becomes for security gaps to slip into production. The Velocity vs Security Conflict Traditional software delivery worked like a relay race. Developers wrote the code. Operations deployed it. Security reviewed it near the end. DevOps changed that model entirely. Instead of a relay race, delivery became a high-speed continuous conveyor belt. Code moves through: Source controlCI pipelinesContainer buildsInfrastructure provisioningProduction deployment Sometimes this entire journey happens in minutes. The problem is that security processes did not evolve at the same speed. Many organizations still rely on: Manual reviewsSecurity gates late in the pipelinePeriodic compliance audits By the time issues are discovered, the code is already running in production. The Hidden Security Gaps in Modern DevOps In my experience working with cloud and DevOps teams, most security issues come from a few recurring patterns. 1. Infrastructure as Code Without Guardrails Infrastructure as Code (IaC) is powerful. Teams can provision entire environments with a few lines of code. But this also means developers can accidentally deploy insecure infrastructure at scale. Common issues include: Public S3 bucketsSecurity groups open to the internetDatabases without encryptionMissing network segmentation Because IaC is automated, one mistake can replicate across hundreds of environments instantly. 2. Container Security Is Often Ignored Containers made application packaging simple, but they also introduced new attack surfaces. Many container images in production today still include: Outdated base imagesHundreds of unnecessary packagesCritical vulnerabilities Developers often pull images from public registries without verification. A single vulnerable dependency can quietly introduce risk into the entire platform. 3. CI/CD Pipelines Become a Security Blind Spot CI/CD pipelines now have enormous power. They can: Access source codeBuild artifactsPush imagesDeploy to productionAccess cloud credentials Yet pipelines are rarely treated as high-value targets. Common risks include: Hardcoded secretsOver-privileged IAM rolesLack of pipeline integrity verificationUntrusted third-party actions A compromised pipeline can become the fastest route to compromise production systems. 4. Identity and Access Sprawl Cloud environments grow quickly. What starts with a few roles and service accounts soon becomes hundreds. Without strong identity governance, teams end up with: Overly permissive IAM rolesLong-lived credentialsUnused service accountsCross-account trust misconfigurations Identity is now the primary attack vector in cloud environments, yet it remains one of the least governed areas. Why Security Teams Struggle to Keep Up The reality is that most security teams were never designed for the pace of DevOps. Traditional security approaches rely heavily on: Ticket-based reviewsStatic compliance checklistsQuarterly audits But modern cloud environments change daily. A Kubernetes cluster may create or destroy hundreds of resources every hour. Manual reviews simply cannot scale. Security must evolve from manual inspection to automated enforcement. The DevSecOps Shift The solution is not slowing down DevOps. The solution is making security move at the same speed as DevOps. This is where DevSecOps becomes critical. Instead of adding security at the end, it becomes embedded throughout the delivery lifecycle. Key practices include: Policy as Code Security rules should be enforced automatically. Tools like Open Policy Agent or Kyverno allow teams to define policies such as: Containers cannot run as rootRequired resource limits must be definedPublic cloud resources must be restrictedEncryption must be enabled These policies run automatically during CI pipelines or Kubernetes deployments. Automated Security Scanning Every pipeline should automatically scan for: Container vulnerabilitiesIaC misconfigurationsDependency risksSecret leaks Developers receive immediate feedback before code reaches production. Secure CI/CD Design CI pipelines themselves must follow security best practices: Short-lived credentialsIsolated runnersSigned artifactsVerified dependencies Pipelines should be treated as critical infrastructure, not just build tools. Continuous Cloud Posture Monitoring Even with preventive controls, misconfigurations still happen. Continuous monitoring tools help detect issues such as: Public resourcesIAM privilege escalation risksCompliance violationsDrift from security baselines Security becomes an ongoing process rather than a periodic audit. Culture Matters More Than Tools One of the biggest lessons I’ve learned after two decades in the industry is this: Security failures rarely happen because tools are missing.They happen because security is treated as someone else's responsibility.When developers view security as a blocker, they find ways to bypass it. But when security is built into the developer workflow, it becomes part of normal engineering. Successful DevSecOps cultures usually follow three principles: Security feedback must be immediateSecurity controls must be automatedSecurity must empower developers, not slow them down The Future of Secure DevOps Over the next few years, we will see security becoming deeply integrated into engineering platforms. Some trends are already emerging: Secure Software Supply ChainsSigned container artifactsZero Trust cloud architecturesPolicy-driven infrastructureAI-assisted security detection Organizations that succeed will not treat security as a checkpoint. They will treat it as an automated system woven into the fabric of their delivery platforms. Final Thoughts DevOps changed how we build and deliver software. But it also changed how attackers find opportunities. Speed without security creates fragile systems. The organizations that thrive will be those that learn to balance velocity with resilience. DevOps helped us move faster. DevSecOps ensures we move fast without breaking trust. Stay Connected If you found this article useful and want more insights on Cloud, DevOps, and Security engineering, feel free to follow and connect. More

Architecting AI-Native Cloud Platforms: Signals to Insights to Actions

By Harvendra Singh

Cloud platforms have historically been built to execute applications at scale. Over the past decade enterprises pushed workloads from private datacenters to the cloud, taking advantage of elasticity, automation, and worldwide scale. Today’s digital businesses process massive amounts of signals every second: customer behaviorsystem activitytransaction processingIoT dataoperational health Legacy cloud architectures collect these signals and then process them later for insights via dashboards and reports. This model served organizations well but in a world where microseconds matter, it comes with a crippling drawback: There is often a significant delay between when an event occurs and when a decision can be made about that event. Enterprises now demand systems which can sense what is going on in real time, comprehend what those signals mean, and act upon those insights immediately. Platforms that do these three things - Sense, Comprehend, Act - will be AI-Native Cloud Platforms. Applications can no longer simply run on platforms; they must think. Architecting an AI-Native Platform Platform architects treat artificial intelligence as a first-class citizen. Data streaming continuously through the platform is analyzed by machine learning models which trigger automated decisions inside business applications. Platform components talk to each other intelligently. At their core, AI-Native Platforms transform cloud platforms from infrastructure providers into digital brains for your applications. What Do I Mean by AI-Native Cloud Platform? When I say AI-Native Cloud Platform, I’m referring to an architecture where data, machine learning, and applications are tightly integrated together. Instead of data feeds going into a traditional analytics stack while ML models are built and run elsewhere, AI-native platforms weave intelligence into the fabric of the platform. Signals become Insights, which automatically trigger Actions. Characteristics of AI-Native Platforms Data is streamed continuously as event feeds.Machine learning models are embedded into operational workflows.AI systems automate decisions.Machine learns from real-world decisions.Intelligence flows through the platform. Evolution of Cloud System Architecture Here’s how I see cloud system architecture evolving: Traditional Cloud Platforms Driven by compute and storage needs. Automation was focused on infrastructure rather than applications. Analytics-Driven Platforms Modern data lakes, warehouses, and analytics systems generated significant business insights, but most processing was still done after the fact. AI-Native Cloud Platforms Everything starts to come together. Intelligence is infused into the platform itself. Signals are sensed → processed into Insights → trigger Actions Building Blocks of AI-Native Platforms Let’s take a deeper look at the architecture of AI-Native Cloud Platforms. Below is my own high-level architecture that shows how intelligence flows through an AI-Native Platform. 1. Event Ingestion Layer Apps, devices, and services on modern digital platforms emit streams of events and signals. They need to be reliably captured at scale. Popular ingestion technologies: event streaming platformsAPI gatewaysmessage queueslog shippers Ingestion layers move data “into” the platform. Think of them like data on-ramps. Streaming platforms also allow services to decouple from each other. Data pipelines connect systems without requiring tight integration. 2. Data Platform Layer Raw events are persisted to and processed within your data platform. Modern data platforms often include: data lakeslakehouse platformsstream processing backendsfeature stores Feature stores warrant special call-out here. Feature stores allow you to atomically version and serve machine learning features. So the same features used to train models are used at production time. Serving stale features is a common reason models go from performing well to poorly. 3. Machine Learning Platform With the data captured and served, data scientists and engineers can use a machine learning platform to build models. Platforms that enable: distributed model trainingexperimentation toolsmodel versioningretraining pipelines Organizations are implementing MLOps practices to operationalize these workflows. MLOps applies software engineering practices to the development and lifecycle management of ML models. 4. Model Serving Layer Building models is great, but serving models to generate predictions is where the value lies. ML models are hosted on AI-native platforms using low-latency model serving infrastructure. Includes technologies like: containerized model serversscalable Kubernetes clustersGPU inference serversprediction endpoints Coupled with your ingestion layer, this allows digital systems to make predictions seconds (often milliseconds) after data is created. 5. Decision & Automation Layer The output from our models should lead to decisions. AI-powered decisions should trigger follow-on actions. Includes but is not limited to: approving vs blocking a financial transactionmaking product recommendations to usersscaling up/down infrastructureidentifying cyber-security threats Decisioning capabilities can include: ML predictionspolicy enginesworkflow automationAI agents Combine these tools to build a closed-loop decisioning system. Intelligent Feedback Loop Architecture AI-native systems also rely on continuous feedback. Below is another original conceptual diagram showing the learning loop. This loop ensures the system constantly improves. As more data flows through the platform, models are retrained and refined. Examples of AI-Native Applications Financial services: Banks process streams of transactions to identify fraud in milliseconds.Retail: Online retailers serve real-time recommendations as customers browse their site.Manufacturing: Factories correlate millions of sensor readings to predict machine failures.Cloud operations: AI correlates metrics and events to automatically troubleshoot issues. AI-native Applications Are Not Without Challenges Bad data poorly pipelined = bad predictions: The predictions generated from poor quality data are inaccurate. Drifting Markov Chain: Model’s knowledge becomes less accurate as real-world variables shift. Accountability: Leaders need to be sure AI decisions can be explained and audited. Monolith of doing it all: Connecting data infrastructure, ML pipelines, and application services at scale is complex. What’s Next: Towards Autonomous Digital Ecosystems Tomorrow’s cloud platforms will be capable of Automatically optimizing themselvesManaging infrastructure without human interventionOperating entire ecosystems of AI-driven applications Cloud platforms won’t just run our applications. They will see. They will think. They will act. Cloud. Rethought. Intelligence is the new scale. The next evolution of cloud-native architecture isn’t about how big your deployment can grow. Or how available your application could be. It’s about creating an environment that knows what’s happening in real-time across your business. AI-native applications weave together realtime event streams, machine learning models, and automated decision making to deliver business value that responds to the needs of your customers at the speed of data. More

How CNAPP Bridges the Gap Between DevSecOps and Cloud Security Companies

By Anastasios Arampatzis

When Kubernetes Breaks Session Consistency: Using Cosmos DB and Redis Together

By Vikas Mittal

Runtime FinOps: Making Cloud Cost Observable

By David Iyanu Jonathan

NeMo Agent Toolkit With Docker Model Runner

The year 2025 has been widely recognized as the year of AI agents. With the launch of frameworks like Docker Cagent, Microsoft Agent Framework (MAF), and Google’s Agent Development Kit (ADK), organizations rapidly embraced agentic systems. However, one critical area received far less attention: agent observability. While teams moved quickly to build and deploy agent-based solutions, a fundamental question remained largely unanswered. How do we know these agents are actually working as intended? Are multiple agents coordinating effectively?Are their outputs reliable and of high quality?Can we diagnose failures or unexpected behaviors in complex, multi-agent workflows? These challenges sit at the core of agent observability. This is where Nvidia’s open-source toolkit, NeMo, comes into the picture. NeMo brings much-needed, enterprise-grade observability to LLM-powered systems, enabling teams to monitor, evaluate, and trust their agent infrastructure at scale. At the same time, Docker Model Runner is emerging as the de facto standard for local inference from the desktop. It provides a unified, “single pane of glass” experience for experimenting with a wide range of open-source models available through the Docker Models Hub. As part of this tutorial, we will look at how we can add observability to your AI agents when inferencing through Docker Model Runner. Docker Model Runner Setup First, let’s set up Docker Model Runner using a small language model. In this tutorial, we will use ai/smollm2. The setup instructions for Docker Model Runner are available in the official documentation. Follow those steps to get your environment ready. Make sure to enable TCP access in Docker Desktop. This step is essential; without it, your prototype will not be able to communicate with the model runner over localhost. Command to pull the small language model we will use for inferencing. Plain Text docker model run ai/smollm2 NeMo Agentic Toolkit Setup The first step begins with installing the Nvidia NAT package from Python. I recommend installing uv and installing all the nat dependencies through uv because going down the plain “pip” route causes timeouts. Plain Text uv pip install nvidia-nat NeMo's agentic setup is done through YAML. So, declare a YAML configuration for eg: agent-run.yaml YAML functions: # Add a tool to search wikipedia wikipedia_search: _type: wiki_search max_results: 2 llms: # Tell NeMo Agent Toolkit which LLM to use for the agent openai_llm: _type: openai model_name: ai/smollm2 base_url: http://localhost:12434/engines/v1 # Docker model runner endpoint api_key: "empty" // because we are using local inference this can be empty. temperature: 0.7 max_tokens: 1000 timeout: 30 general: telemetry: tracing: otelcollector: _type: otelcollector # The endpoint where you have deployed the otel collector endpoint: http://0.0.0.0:5216/v1/traces project: nemo_project workflow: # Use an agent that 'reasons' and 'acts' _type: react_agent # Give it access to our wikipedia search tool tool_names: [wikipedia_search] # Tell it which LLM to use (now using OpenAI with Docker endpoint) llm_name: openai_llm # Make it verbose verbose: true # Retry up to 3 times parse_agent_response_max_retries: 3 There are four important sections in the YAML file: Functions: These are simple components that perform a specific operation. In this case, built-in Wikipedia search, for example. You can define your own functions too.LLMs: The large language model provider we plan to use. Currently, OpenAI, Anthropic, Azure OpenAI, Bedrock, and Hugging Face are the supported providers. Since Docker Model Runner supports both OpenAI and Anthropic API formats, we can leverage it for both the LLM providers.Telemetry: This is where Observability comes into the picture. In this example, we have added OTel-based tracing. As a result, we will be logging spans to the OpenTelemetry configured destination.Workflow: This is the final piece in the puzzle, where we will end up configuring all the functions, LLMS, and tools to create a workflow. For the current workflow, we are configuring a reasoning and act agent along with the Wikipedia search tool and Docker Model Runner inference endpoint. Before we run the workflow, we will configure the OpenTelemetry exporter to publish spans to the otellogs/span folder. Create a file named otel_config.yml. YAML receivers: otlp: protocols: http: endpoint: 0.0.0.0:5216 processors: batch: send_batch_size: 100 timeout: 10s exporters: file: path: /otellogs/spans.json format: json service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [file] Run the following command in the terminal. Plain Text mkdir otel_logs chmod 777 otel_logs docker run -v $(pwd)/otelcollectorconfig.yaml:/etc/otelcol-contrib/config.yaml \ -p 5216:5216 \ -v $(pwd)/otel_logs:/otel_logs/ \ otel/opentelemetry-collector-contrib:0.128.0 Finally, run the NeMo workflow using the following command. Plain Text nat run --config_file ./agent-run.yml --input "What is the capital of Washington" Output: Plain Text [AGENT] Agent input: What is the capital of Washington Agent's thoughts: WikiSearch: {'annotation': 'Washington State', 'required': False} Thought: You should always think about what to do. Action: Wikipedia Search: {'annotation': 'Washington State', 'required': False} ------------------------------ 2026-03-22 21:55:18 - INFO - nat.plugins.langchain.agent.react_agent.agent:357 - [AGENT] Retrying ReAct Agent, including output parsing Observation 2026-03-22 21:55:18 - INFO - httpx:1740 - HTTP Request: POST http://localhost:12434/engines/v1/chat/completions "HTTP/1.1 200 OK" 2026-03-22 21:55:18 - INFO - nat.plugins.langchain.agent.react_agent.agent:270 - ------------------------------ [AGENT] Agent input: What is the capital of Washington State Agent's thoughts: The capital of Washington State is Olympia. After running the above command, you will see a spans.json file under the otel_logs section, which contains the entire span, along with inputs and outputs. In addition to what we discussed, it is also possible to set up logging and evaluations on model response that check for coherence, relevance, and groundedness. References Docker Model Runner: https://docs.docker.com/ai/model-runner/Nvidia NeMo Agent Toolkit: https://docs.nvidia.com/nemo/agent-toolkit/latest/get-started/installation.html

By Siri Varma Vegiraju

CORE

Building an AI-Powered SRE Incident Response Workflow With AWS Strands Agents

The SRE Incident Response Agent is a multi-agent sample that ships with the AWS Strands Agents SDK. It automatically discovers active CloudWatch alarms, performs AI-powered root cause analysis using Claude Sonnet 4 on Amazon Bedrock, proposes Kubernetes or Helm remediations, and posts a structured incident report to Slack. This guide covers everything you need to clone the repo and run it yourself. Prerequisites Before you begin, make sure the following are in place: Python 3.11+ installed on your machineAWS credentials configured (aws configure or an active IAM role)Amazon Bedrock access enabled for Claude Sonnet 4 in your target regionkubectl and helm v3 installed — only required if you plan to run live remediations. Dry-run mode works without them. Step 1: Clone the Repository The sample lives inside the strands-agents/samples open source repository. Clone it and navigate to the SRE agent directory: Shell git clone https://github.com/strands-agents/samples.git cd samples/02-samples/sre-incident-response-agent The directory contains the following files: Plain Text sre-incident-response-agent/ ├── sre_agent.py # Main agent: 4 agents + 8 tools ├── test_sre_agent.py # Pytest unit tests (12 tests, mocked AWS) ├── requirements.txt ├── .env.example └── README.md Step 2: Create a Virtual Environment and Install Dependencies Shell python -m venv .venv source .venv/activate # Windows: .venv\Scripts\activate pip install -r requirements.txt The requirements.txt pins the core dependencies: Shell strands-agents>=0.1.0 strands-agents-tools>=0.1.0 boto3>=1.38.0 botocore>=1.38.0 Step 3: Configure Environment Variables Copy .env.example to .env and fill in your values: Shell cp .env.example .env Open .env and set the following: Shell # AWS region where your CloudWatch alarms live AWS_REGION=us-east-1 # Amazon Bedrock model ID (Claude Sonnet 4 is the default) BEDROCK_MODEL_ID=us.anthropic.claude-sonnet-4-20250514-v1:0 # DRY_RUN=true means kubectl/helm commands are printed, not executed. # Set to false only when you are ready for live remediations. DRY_RUN=true # Optional: post the incident report to Slack. # Leave blank to print to stdout instead. SLACK_WEBHOOK_URL= Step 4: Grant IAM Permissions The agent needs read-only access to CloudWatch alarms, metric statistics, and log events. No write permissions to CloudWatch are required. Attach the following policy to the IAM role or user running the agent: Shell { "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": [ "cloudwatch:DescribeAlarms", "cloudwatch:GetMetricStatistics", "logs:FilterLogEvents", "logs:DescribeLogGroups" ], "Resource": "*" }] } Step 5: Run the Agent There are two ways to trigger the agent. Option A: Automatic Alarm Discovery Let the agent discover all active CloudWatch alarms on its own. This is the recommended mode for a real on-call scenario: Shell python sre_agent.py Option B: Targeted Investigation Pass a natural-language description of the triggering event. The agent will focus its investigation on the service and symptom you describe: Shell python sre_agent.py "High CPU alarm fired on ECS service my-api in prod namespace" Example Output Running the targeted trigger above produces output similar to the following: Shell Starting SRE Incident Response Trigger: High CPU alarm fired on ECS service my-api in prod namespace [cloudwatch_agent] Fetching active alarms... Found alarm: my-api-HighCPU (CPUUtilization > 85% for 5m) Metric stats: avg 91.3%, max 97.8% over last 30 min Log events: 14 OOMKilled events in /ecs/my-api [rca_agent] Performing root cause analysis... Root cause: Memory leak causing CPU spike as GC thrashes Severity: P2 - single service, <5% of users affected Recommended fix: Rolling restart to clear heap; monitor for recurrence [remediation_agent] Applying remediation... [DRY-RUN] kubectl rollout restart deployment/my-api -n prod ================================================================ *[P2] SRE Incident Report - 2025-10-14 09:31 UTC* What happened: CloudWatch alarm my-api-HighCPU fired at 09:18 UTC. CPU reached 97.8% (threshold 85%). 14 OOMKilled events in 15 min. Root cause: Memory leak in application heap leading to aggressive GC, causing CPU saturation. Likely introduced in the last deployment. Remediation: Rolling restart of deployment/my-api in namespace prod initiated (dry-run). All pods will be replaced with fresh instances. Follow-up: - Monitor CPUUtilization for next 30 min - Review recent commits for memory allocation changes - Consider setting memory limits in the Helm chart ================================================================ Running the Tests (No AWS Credentials Required) The sample ships with 12 pytest unit tests that mock boto3 entirely. You can run the full test suite in any environment, including CI, without any AWS credentials: Shell pip install pytest pytest-mock pytest test_sre_agent.py -v # Expected: 12 passed Enabling Live Remediation Once you have validated the agent’s behaviour in dry-run mode and are satisfied with the decisions it makes, you can enable live kubectl and helm execution by setting DRY_RUN=false in your .env file: Shell DRY_RUN=false Conclusion In under five minutes of setup, the AWS Strands Agents SDK gives you a working multi-agent incident response loop: alarm discovery, AI-powered root cause analysis, Kubernetes remediation, and a structured incident report, all driven by a single python sre_agent.py command. The dry-run default means there is no risk in running it against a real environment while you evaluate its reasoning. From here, the natural next steps are connecting a Slack webhook for team notifications, adding a PagerDuty tool for incident tracking, or extending the RCA agent with a vector store of past postmortems. All of that is a tool definition away. I hope you found this article helpful and that it will inspire you to explore AWS Strands Agents SDK and AI agents more deeply.

By Ayush Raj Jha

The 4 Signals That Actually Predict Production Failures - Part 2

A Practical Guide In the first part, I covered the two initial signals to diagnose that something is wrong: LatencyTraffic Those two alone explain a surprising number of production incidents. But they don’t explain everything. Rising latency tells you a problem is developing. Traffic tells you what the system is dealing with. I mentioned two more signals: ErrorsSaturation These two tell you something more important - whether the system is approaching failure. And this is where monitoring becomes truly operational. I will cover those two signals in this blog. Let us start with Errors. Errors - The Most Misunderstood Signal Many teams think error monitoring is simple. It is about counting failures. Raise an alert when they increase. In practice, error metrics are rarely that straightforward. The first mistake teams make is treating all errors as equal. They are not. Some errors are expected and some errors are harmless. Others indicate an outage in progress. Monitoring must differ between them. Otherwise alerts become noise. And noisy alerts get ignored, which defeats the entire purpose. I have seen production systems where engineers simply muted error alerts because they fired every few hours. Error Rate Is More Important Than Error Count Raw error counts are misleading. What do you think - ten errors per minute might be catastrophic or irrelevant? It depends on traffic. If you process: 100 requests per minute → 10 errors = disaster100,000 requests per minute → 10 errors = background noise Error rate is what matters. A simple production alert looks like this: It means alert when:Error rate > 2% This works far better than static thresholds because it scales automatically with traffic. 4xx vs 5xx - Critical Distinction One of the most common monitoring mistakes is combining 4xx and 5xx errors. They represent completely different problems. Let me talk through them. 5xx errors These indicate system failures: ExceptionsTimeoutsDependency failuresResource exhaustion 5xx errors should almost always trigger alerts. They mean the system is failing users. 4xx errors These usually indicate client behaviour: Invalid inputAuthentication failuresMissing resources Most of the time, 4xx errors should not page engineers. But they should still be monitored. Their spikes often reveal integration problems. Partner systems misbehavingClients sending unexpected requestsSometimes bots discovering your APIs I once saw a system where 40% of traffic suddenly became 401 responses. Nothing was broken in my service. A client service had deployed a change with an incorrect token configuration. The service was healthy. The integration was not. Without separate 4xx monitoring we would never have noticed. Error Budget Thinking Once services mature, error monitoring becomes less about incidents and more about error budgets. Instead of asking “Did we have errors?” You ask “Did we exceed acceptable failure levels?” Example SLO: 99.9% success rate That allows: 0.1% failure Error budgets prevent overreaction to minor fluctuations. Without them, teams end up firefighting dashboards instead of protecting user experience. In most post-mortems, latency and errors are symptoms. Saturation is usually the cause. Let us move to the next indicator – saturation. Saturation — Where Failures Actually Begin If latency is the early warning signal, saturation is the root cause signal. Most production outages start with a resource limit somewhere. I am not necessarily talking about CPU or memory. I am talking about less obvious resources like thread pools, connection pools, queue consumers, file descriptors, and rate limits. These limits quietly fill up until requests start waiting and then timing out. Then they start failing. By the time error rates increase, saturation has usually been happening for a while. CPU and Memory - Necessary but Not Enough Infrastructure metrics still matter. They just don’t tell the whole story. Monitor: CPU utilizationMemory usageDisk I/ONetwork throughput Example: rate(container_cpu_usage_seconds_total[1m]) and: container_memory_usage_bytes The Metrics That Break Systems Most Often As I mentioned in my previous blog, you need effective metrics. In this section I will list a few metrics that can prove useful. Connection Pool Usage Monitor connection pool usage. When a connection pool fills up - requests queue internally, latency increases, timeouts appear, and errors follow. In this scenario CPU can still be 30%. Memory can still be healthy. The service still looks “green.” Except users are waiting seconds for responses. Example — Monitoring a connection pool Micrometer automatically exposes Hikari metrics: hikaricp_connections_activehikaricp_connections_idlehikaricp_connections_pending The critical one is:hikaricp_connections_pending If pending connections increase steadily, saturation is approaching and action is needed. Kubernetes Saturation Signals Container platforms introduce new saturation points. An important metric to monitor is:kube_pod_container_status_restarts_total Restarts indicate instability. And:container_cpu_cfs_throttled_seconds_total CPU throttling causes latency spikes even when CPU usage looks normal. That one surprises a lot of teams. Dependency Metrics — The Missing Visibility Layer Most services are only as reliable as their dependencies – databases, caches, APIs, queues, and third-party integrations. When dependencies slow down, your service slows down. But if you only monitor your service, you won’t see the cause. You only see the symptoms. Dependency metrics close that gap. Without them, incident investigations turn into guesswork. Downstream Latency Metrics Every external call should have a latency metric. Even if the dependency is “reliable.” Especially then. Simple example: Java Timer.Sample sample = Timer.start(registry); Response response = paymentClient.process(request); sample.stop( registry.timer("payment.api.latency") ); During incidents, this metric often points directly at the problem. Dependency Error Metrics Track dependency failures separately. Example:payment_api_errors_total This helps answer:Are we failing… or is the dependency failing? That distinction saves time during incidents. Database Metrics — Where Many Incidents Begin Databases rarely fail suddenly. They slowly degrade. I have seen these follow a pattern. First queries take slightly longer. Then pools begin filling. Then request latency increases. Then timeouts appear. The progression is almost always the same. Which means the signals are predictable. Query Latency Slow queries often trigger cascading failures. Track:db_query_duration_seconds Watch percentiles and not averages. The same rule applies as service latency. Connection Pool Usage Database pools deserve dedicated dashboards. Track:db_connections_activedb_connections_idle Pool exhaustion is a classic outage pattern. Lock Contention Lock waits produce unpredictable latency spikes, especially under load. Important metrics include: Lock wait timeDeadlocksBlocked queries These metrics explain incidents that otherwise look random. Queue Metrics — The Early Warning Event-driven systems fail differently and have a different pattern. Instead of request latency increasing, queues begin filling. Messages accumulate silently. Until delays become visible. Queue metrics often detect issues earlier than service metrics. Queue Depth Example metric:messages_available If depth increases steadily, it means something is wrong. Either: Producers too fastConsumers too slowDependencies degraded Queue depth is one of the most reliable early warning signals in distributed systems. Consumer Lag For streaming systems, lag is critical. Example:kafka_consumer_lag Lag increasing means consumers cannot keep up. Eventually processing delays impact users. Pattern Worth Recognizing After enough incidents you start recognizing patterns. One of the most common looks like this: Dependency latency increasesConnection pools fillRequest latency increasesQueues growErrors appear When you see that progression on dashboards, you already know the story before investigation begins. Good monitoring turns incidents into recognizable shapes. And recognizable shapes reduce stress during outages. Experienced engineers eventually learn that most outages are not mysterious. They follow patterns. Because uncertainty is what makes incidents difficult. Not complexity. I hope you find these useful, I will continue the discussion in the final blog of this series.

By Gaurav Gaur

CORE

FinOps for Engineers: Turning Cloud Bills Into Runtime Signals

The bill lands in your inbox. $37,000 this month. Was $29,000 last month. Someone in Finance cc's half the engineering org asking what happened. Engineering doesn't know. Nobody knows. The thread dies with "we'll investigate" and everyone goes back to fighting fires. Month later, same thing. This is how most companies run cloud infrastructure. Cost is something Finance worries about quarterly while Engineering optimizes for uptime and latency. The feedback loop is measured in weeks. By the time anyone notices the spend anomaly, you've already burned through the overage and the root cause is buried under three deployments. What if your cloud spend behaved like request latency? Spiked in Grafana when something broke. Triggered the same on-call rotation as a degraded service. Lived in the same mental space where you reason about capacity and performance. Not as a finance exercise. As an operational metric that engineers own. That's FinOps. Cloud Financial Operations. The idea that cost is telemetry — another dimension of system health you instrument, monitor, and optimize in real time. Your AWS bill stops being a monthly surprise from Finance and starts being a dashboard that updates hourly, tagged by service and team, graphed alongside request rates and error budgets. Every Workload Has a Cost Signature Start here: cloud resources cost money in specific, measurable ways. A Lambda invocation costs $0.0000002 per request at 128MB memory. Sounds trivial until you're handling 50 million requests daily and the bill is $10k monthly. An RDS db.r5.2xlarge burns $0.504/hour whether it's serving 10 queries or 10,000. You pay for provisioned capacity, not utilization. An S3 GET request costs $0.0004 per thousand. An S3 LIST operation over a bucket with 10 million objects can cost $50 if you're iterating stupidly. These aren't abstract numbers. They're the unit economics of running code. New Relic's engineering team did something that sounds obvious in retrospect but almost nobody does: they instrumented every operational metric with its marginal cost. Cost per API call. Cost per trace ingested. Cost per metric scraped. When a service starts hammering an endpoint, two graphs spike simultaneously — request volume and dollars per minute. You see the correlation immediately. The cost becomes visceral, not theoretical. This matters because cloud infrastructure obscures its economics by design. When you bought physical servers, the constraints were obvious. You ordered a rack, waited six weeks for delivery, racked and cabled it, and then you squeezed every milliwatt out of that hardware because the capital was spent. You knew exactly what you had. Cloud abstracts that away. Auto-scaling groups spin up instances when CPU crosses 70%. Reasonable behavior. Also capable of burning $8,000 on a Saturday because someone pushed a bad regex that triggers catastrophic backtracking in a log parser and every request starts taking 4 seconds instead of 40ms. The auto-scaler sees high CPU, adds instances. More instances, same bug, more CPU, more instances. By the time someone notices and rolls back, you've scaled to 80 instances serving the same traffic that normally runs on 6. The cloud bill arrives two weeks later. Nobody connects it to that Saturday incident because the feedback loop is broken. Making cost legible means fixing that loop. Instrumenting Spend the Same Way You Instrument Latency You already know how to do this for performance metrics. Prometheus scrapes endpoints every 15 seconds. Grafana renders time-series graphs. Alerts fire when error rates cross thresholds. Runbooks trigger. On-call gets paged. Extend that model to cost. AWS publishes Cost and Usage Reports — massive gzipped CSV files dumped to S3 with line-item billing detail. Every EC2 instance-hour, every GB-month of S3 storage, every Lambda invocation, tagged with resource IDs, availability zones, usage types. The files are enormous. Last month's CUR for a medium-sized infrastructure might be 4GB compressed, 40GB uncompressed, millions of rows. Parse it. Azure has equivalent exports. GCP pushes billing data to BigQuery. The mechanics differ but the pattern is identical: get granular billing data, tag it with the same metadata you use for observability, aggregate it, and shove it into your metrics pipeline. Here's what that looks like in practice. You write a script — Python with boto3 and pandas, or Go with the AWS SDK, doesn't matter. Runs every hour via cron. Pulls the latest CUR data from S3, parses the CSV, groups by resource tags (team, environment, service, feature), computes deltas since last run, exports to Prometheus. Now cost is time-series data. You graph it next to your operational metrics. Dual-axis chart: requests per second on the left Y-axis, dollars per hour on the right. Watch what happens. Traffic doubles during a product launch. Cost doubles. That's healthy — linear scaling, expected behavior. But then three days later: traffic flat, cost up 40%. That's the signal. Something changed and it's not traffic. You investigate. New deployment went out Tuesday. Changelog shows a "minor optimization" to caching logic. You dig deeper. The optimization broke cache key generation. Cache hit rate dropped from 85% to 12%. Every request that should hit cache now hits the database. RDS connection count spiked. Auto-scaling added read replicas. Cost follows. Without cost telemetry, this surfaces as a vague sense that "the database seems slower lately" and maybe someone investigates next sprint. With cost telemetry, it's a P2 incident Tuesday afternoon and you revert the deployment before dinner. Kubernetes complicates this. Workloads are ephemeral. Pods get scheduled across nodes. A single node might run workloads from six different teams. Cloud billing shows you the EC2 instance cost, but how do you allocate that to teams? Kubecost solves this by querying the Kubernetes API for pod resource requests and limits, correlating that with node pricing, and exporting per-pod, per-namespace, per-label cost metrics. You tag your Deployments and StatefulSets the same way you tag everything else. Kubecost tells you the data-pipeline namespace in the prod cluster burned $340 last Tuesday. You trace it back. A CronJob that should run nightly and terminate ran 16 times because of a misconfigured schedule. Each run spawned 20 pods requesting 4 cores each. Most of the work was waiting on I/O but Kubernetes saw the resource requests and provisioned accordingly. The pods sat there, allocated but mostly idle, burning money. Without namespace-level cost visibility, that's invisible. With it, it's a line item you investigate Wednesday morning. Granularity is everything. Cluster-wide cost is useless — it's just a big number. Per-team cost is better but still vague. Per-service cost is actionable. Per-customer cost lets you calculate unit economics and answer whether your pricing model actually covers infrastructure. Cost as a Service Level Indicator If cost is telemetry, it deserves the same rigor as uptime. Define budget burn rate as an SLI. Set an SLO: "Monthly spend shall not exceed projected budget by more than 15% for three consecutive days." Alert on violations the same way you alert on error rate thresholds. This sounds straightforward until you try to implement it and realize your budget projections are wildly wrong. They're based on last quarter's usage, extrapolated linearly, ignoring seasonality and feature launches and customer growth patterns. Your projections say you'll spend $45k this month. You're on track for $62k by day 10. Is that a problem? Maybe. Maybe you launched a feature that's more popular than expected and the increased cost maps to increased revenue. Or maybe someone left a data pipeline running in dev that's scanning the entire production database every hour for no reason. The projection being wrong isn't the problem. The problem is not knowing about the divergence until the bill closes. Start with bad projections. Iterate. Build a feedback loop where actual spend informs next month's forecast. The goal isn't perfect forecasting — it's timely detection of unexpected changes. Netflix uses anomaly detection for this. Not because ML is magic, but because their scale makes manual thresholding impossible. When you're spending millions monthly across thousands of microservices, you can't manually review every cost trend. Anomaly detection flags outliers — services whose cost trajectory deviates from historical patterns adjusted for traffic and seasonality. An engineer investigates. Often it's legitimate: new feature shipped, traffic grew, cost followed proportionally. Sometimes it's pathological. A retry loop that exponentially backs off but never terminates. A memory leak that causes pods to restart every 20 minutes, and Kubernetes keeps scheduling replacements. An auto-scaler that scales up aggressively but down conservatively, ratcheting instance count higher over days. These are all real incidents I've debugged. None of them showed up in traditional monitoring because the services technically worked. Requests succeeded. Latency was acceptable. But cost was hemorrhaging and nobody noticed until the monthly bill. The anti-pattern here is the financial silo. Cost analysts in Finance who don't understand the workload architecture. Engineers who never see the bill. The gap between them guarantees dysfunction. Finance sees numbers without context — "EC2 spend up 35%" — but can't trace it to a service or deployment. Engineering makes architectural decisions without feedback on cost implications. Showback bridges this gap. Allocate cost to teams based on tagged resources. Publish monthly dashboards showing each team's spend broken down by service. No penalties, no hard budget enforcement — just visibility. Teams start asking questions they've never asked before. "Why did we spend $4,200 on NAT Gateway last month?" Someone investigates. Turns out half the VPC subnets are misconfigured, routing all egress traffic through a single NAT Gateway instead of using VPC endpoints for S3 and DynamoDB. They fix the routing. Next month NAT Gateway cost drops to $600. Chargeback goes further — actually billing teams internally for their infrastructure spend. This creates budget accountability but also introduces perverse incentives. Teams might under-provision to save budget, degrading reliability. They might game the allocation system. Politics emerge. Showback delivers most of the value — awareness, attribution, cultural shift toward cost consciousness — without the hazards. What You Actually Do Monday Morning You're convinced. Cost observability makes sense. Now what? Tag everything. This is tedious, unglamorous infrastructure work. It's also foundational. Without tags, attribution is impossible. Define a standard schema. team, environment (prod, staging, dev), service, feature, cost-center. Enforce it with policy-as-code. Terraform modules that reject resource creation without required tags. Kubernetes admission controllers that reject pod specs missing labels. OPA policies. Sentinel. Whatever your infrastructure-as-code stack supports. Legacy resources will violate the schema. That's fine. Tag them retroactively. Write a script that queries the AWS API for untagged resources and bulk-applies tags based on naming conventions or VPC associations. It won't be perfect. You'll have orphaned resources you can't attribute. Tag what you can, document what you can't, accept that you'll be chasing this forever. Ingest billing data. Set up automated CUR exports to S3. Write the parser — runs hourly, aggregates by tag, computes deltas, pushes to Prometheus or your metrics backend. If you're on Azure, use the Billing API. GCP exports to BigQuery, so you write SQL queries instead of parsing CSVs. The mechanics differ; the pattern doesn't. Build dashboards. Grafana is usually the right answer because you're already using it for everything else. Add cost panels. Create a "FinOps Overview" showing total spend, top services, week-over-week trends, cost per customer if you track that granularly. Create team-specific dashboards showing their allocated spend. Make cost visible in the places engineers already look — not in a separate finance tool they'll never open. Define alerts. Start simple: "Daily spend exceeded $X." That'll fire false positives. Refine it: "Service Y's cost increased 50% week-over-week while request volume increased 10%." These are heuristics, not perfect detectors. They'll still fire false positives. That's acceptable. The goal is building the muscle memory of investigating cost anomalies the same way you investigate latency regressions. Integrate cost into CI/CD. This is harder. More speculative. But powerful when it works. Imagine a GitHub Action that runs on pull requests. It parses the Terraform diff, estimates the cost impact of proposed changes (new instance types, additional replicas, modified auto-scaling bounds), and posts a comment: "This change will increase monthly spend by approximately $430." Engineers see it during code review. They weigh cost against benefit. Sometimes they proceed — the feature justifies the expense. Sometimes they rethink the approach — maybe there's a cheaper architecture that accomplishes the same goal. Infracost does this. It's imperfect. Cloud pricing is Byzantine. Usage varies. Reserved Instances and Savings Plans complicate the math. Spot pricing fluctuates. But even a rough estimate is infinitely better than no estimate. It makes cost a first-class consideration during design instead of a surprise discovered three weeks later. Where This Falls Apart Theory is clean. Practice is messy, full of edge cases and incomplete data and tooling that almost works. Billing data lags. AWS CUR updates hourly at best, often with delays. Azure and GCP have similar latencies. You're trying to build real-time observability on top of data that's 60 to 90 minutes old. For long-running workloads — databases, cache clusters — this is fine. For burst workloads — Lambda functions, Fargate tasks, spot instances — you're often diagnosing yesterday's problem. Tagging is perpetually incomplete. Legacy resources predate your schema. External teams don't follow the standard because they don't know it exists or don't care. Someone spins up an instance manually during an incident and forgets to tag it. Your dashboards show "unallocated spend" growing every month. You chase it down, tag what you can find, but there's always more. It's Sisyphean. Attribution gets philosophical fast. A shared RDS instance serves three services owned by different teams. How do you allocate the cost? By query count? By table size? By connection time? By team ownership percentage? There's no obviously correct answer. You pick a heuristic, document it clearly, communicate it to stakeholders, and accept that someone will complain it's unfair. The team that runs heavy analytics queries will argue they shouldn't pay the same as the team doing lightweight lookups. The team that owns the largest tables will argue query count is a better metric than storage. You can't make everyone happy. Pick something reasonable, be transparent about the methodology, and move on. Cost optimization competes with reliability. Every optimization is a trade-off you have to think through. Spot instances are 70% cheaper than on-demand but can be interrupted with two minutes notice. Right-sizing instances saves money but reduces headroom for traffic spikes. Aggressive auto-scaling-down minimizes waste but introduces cold-start latency when you need to scale back up. Switching from RDS to Aurora might save money but requires refactoring connection pooling logic. FinOps doesn't resolve these tensions. It makes them explicit so you can make informed trade-offs instead of optimizing blindly. Cultural friction is real. Developers already juggle latency, error rates, saturation, on-call rotation, tech debt, feature delivery. Now you're asking them to care about cost too? It feels like scope creep. Like Finance trying to colonize engineering decisions with spreadsheets and budget restrictions. The pushback is legitimate. You mitigate it through framing. Cost visibility isn't about policing spending or denying resource requests. It's about enablement — giving engineers the information they need to make good decisions. When someone proposes a costly architecture, you don't say "no, that's too expensive." You say "here's what it will cost; here are three cheaper alternatives; here are the trade-offs; your call." Engineers appreciate having the data. What they resent is having decisions made for them by people who don't understand the constraints. What Actually Changes When You Get This Right FinOps shifts the conversation from reactive to proactive. Finance stops discovering overruns at month-end and demanding retroactive cuts. Engineering sees trends early, investigates, optimizes continuously. Real example: a team notices their CloudWatch Logs bill tripled month-over-month. They investigate, discover they're logging full request bodies at DEBUG level in production — something someone enabled during an incident six weeks ago and forgot to revert. They change the log level to WARN, keep detailed logs only in staging. Spend drops 70%. Simple. Obvious in hindsight. Completely invisible without the cost signal. Another team runs ETL batch jobs on on-demand instances. They check utilization: jobs run overnight, instances sit idle 16 hours daily. They switch to Spot instances with a Spot Fleet configuration that tolerates interruptions and falls back to on-demand only when Spot capacity is unavailable. Cost drops 60%. Jobs take 10% longer sometimes when Spot gets interrupted, but they're not user-facing, so the latency doesn't matter. A database sized for peak load two years ago. Traffic declined since then — customer churn, product pivot, whatever. Nobody ever resized it. Monitoring shows consistent 15% CPU utilization. They downsize from db.r5.8xlarge to db.r5.2xlarge. Performance metrics stay healthy. Monthly cost drops $2,000. None of these are heroic optimizations. They're hygiene. Basic operational discipline. But hygiene compounds. Ten optimizations saving $200 each is $2,000 monthly, $24,000 annually. At scale, it's hundreds of thousands. More importantly, the culture changes. Teams start asking "what's this going to cost?" during design, not as an afterthought. Cost becomes part of the conversation alongside performance and reliability. The Tools You'll End Up Evaluating Cloud cost observability is now a legitimate product category with venture-backed companies and competitive positioning. CloudZero, Vantage, Yotascale, Apptio Cloudability — they ingest billing data, correlate it with resource tags and business metrics, render dashboards showing unit economics. The pitch is visibility and optimization insights. Pricing varies wildly. Some charge a percentage of your cloud spend. Some charge per seat. The ROI calculation depends on your scale. Datadog, New Relic, Dynatrace — observability platforms adding cost modules as a feature. They already instrument your infrastructure for performance. Adding cost is a natural extension. The value proposition is consolidation: one tool for operations and economics instead of separate platforms. Kubecost focuses specifically on Kubernetes. Open-source core, commercial tier with extra features. For Kubernetes-heavy organizations, it's nearly essential — native cloud billing has no visibility into namespace or pod-level costs. The FinOps Foundation publishes frameworks, maturity models, case studies. They run certifications — FinOps Certified Practitioner. Whether that certification has value depends on your organization, but the community and knowledge sharing are real. Consulting follows the tooling. Organizations hire people to embed FinOps culture: how to structure teams, run cost reviews, build accountability mechanisms. There's a certification ecosystem emerging. The quality varies. CostOps — integrating cost intelligence directly into DevOps pipelines — is still nascent. Terraform modules that estimate cost before apply. CI runners that block deployments exceeding budget without approval. GitOps workflows where cost is a merge check. The tooling isn't mature, but the concept makes sense if you're already doing everything else as code. The Skeptical Counterargument Is any of this actually necessary? Can't Finance just handle cost management like they always have? Only if you're comfortable with month-long feedback loops and blunt instruments. Finance can identify that EC2 spend increased 40%, but they can't trace it to a specific service, deployment, or bug. They can't fix it. They can escalate to engineering, who then spend days investigating with incomplete data because nobody instrumented cost in the first place. FinOps collapses that loop. Engineers see cost in real time, correlate it with their changes, optimize autonomously. It's not replacing Finance — it's shifting left, handling problems at the source before they become quarterly budget disasters. But it requires discipline. Instrumentation, tagging, alerting, dashboards — these don't build themselves. They require ongoing time, maintenance, evolution as your infrastructure changes. If your team is barely keeping production running, adding FinOps infrastructure might legitimately be a luxury you can't afford right now. Fair. Prioritize. Maybe you start minimal: allocate costs monthly, publish basic reports, build awareness. Low investment, high value. Once teams start caring, they'll demand better data. Then you invest in real-time telemetry. Or maybe your spend is low enough it genuinely doesn't matter. If you're burning $500 monthly, optimizing down to $300 saves trivial money relative to engineering time. FinOps scales with spend. When you're spending tens of thousands monthly, the ROI is obvious. The Actual Goal Netflix's stated goal is "nearly complete cost insight coverage." Every service, every workload, every feature instrumented with cost telemetry. It's aspirational. Probably impossible to fully achieve. But the direction is right. You won't fix everything Monday. You'll tag some resources, build a basic dashboard, maybe set up one alert. That's sufficient. The value accumulates through repetition — making cost visible in the daily flow of work, asking "what does this cost?" as routinely as "how fast is this?" or "will this scale?" The cloud hides its economics behind layers of abstraction. Auto-scaling, serverless, pay-per-request — all brilliant innovations that make infrastructure invisible until the bill arrives. FinOps makes those economics legible again. That's the entire game.

By David Iyanu Jonathan

Unlocking the Potential: Integrating AI-Driven Insights with MuleSoft and AWS for Scalable Enterprise Solutions

This article explores the transformative potential of integrating artificial intelligence (AI)-driven insights with MuleSoft and AWS platforms to achieve scalable enterprise solutions. This integration promises to enhance enterprise scalability through predictive maintenance, improve data quality through AI-driven data enrichment, and revolutionize customer experiences across industries like healthcare and retail. Furthermore, it emphasises navigating the balance between centralized and decentralized integration structures and highlights the importance of dismantling data silos to facilitate a more agile and adaptive business environment. Enterprises are encouraged to invest in AI skills and infrastructure to leverage these new capabilities and maintain competitive advantage. Introduction Not long ago, I had one of those "aha" moments while working late at our Woodland Hills office. Picture this: I was elbows-deep in the spaghetti of our MuleSoft integrations, and it hit me — what if we could fuse our conventional setup with AI-driven insights to revolutionize our enterprise scalability? As someone who has spent countless hours with MuleSoft and AWS, toggling between Anypoint Platform and cloud paradigms, I realized we were standing on the precipice of something transformative. The Magic of AI-Augmented Integration Platforms The trend of merging AI with platforms like MuleSoft is becoming a game-changer. Think about it — self-optimizing integration pipelines that don't just react but predict. AI-driven anomaly detection is no longer a futuristic notion but a present-day reality. A critical takeaway here is that enterprises must shift their focus toward building predictive maintenance into their integration solutions. This isn't just about reducing downtime; it's about reliability, a quality all stakeholders crave. Here's a personal aside: in one of my projects at TCS, we faced repeated disruptions due to undetected anomalies in our pipeline. After integrating an AI-centric approach using AWS’s AI/ML services, we saw a 30% decrease in system alerts. It felt like watching a well-oiled machine where everything just fit. It was hard work getting there, but the reduced manual monitoring was worth every bit of effort. Centralized Control vs. Decentralized Agility Let's face it — a debate that's been brewing is centralized versus decentralized integration. I'm of two minds here. Centralized platforms like MuleSoft offer comprehensive control, yet there's a strong argument for decentralized, microservices-led frameworks powered by AI. These can make autonomous decisions at the edge, thus providing agility. In practice, evaluating trade-offs is crucial. During Farmers Insurance projects, we struggled with balancing centralized governance with the nimbleness of decentralized systems — often a tug-of-war. Through trial and error, we realized that a hybrid approach, leveraging MuleSoft for core integrations while empowering microservices with AI-driven intelligence, struck the right chord. The key was not in choosing sides but in finding harmony between the two. Cross-Industry Applications: Breaking the Mold AI-driven insights aren’t limited to tech giants — they're creeping into retail and healthcare, too. In a recent pilot, we explored using MuleSoft solutions in a healthcare setting, where real-time data processing played a critical role in patient interactions. The challenge was integrating vast datasets, something AI handled adeptly. The result? Improved patient engagement and faster response times. In another example, a retail client used AI integration to enrich customer experiences, from personalized offers to stock predictions. You might say these are exceptions, not the rule, but they demonstrate the potential of cross-industry applications. The lesson here? Look beyond traditional tech spaces for unique use cases and new revenue streams. AI-Driven Data Enrichment: A Technical Deep Dive One of the lesser-known but powerful capabilities of AI is data enrichment. Within MuleSoft and AWS environments, machine learning algorithms are at work to refine and enhance data for superior analytics. It's like having a data wizard on your team. In practical terms, we deployed advanced algorithms to improve data quality at Farmers Insurance. The challenge was ensuring seamless integration without disrupting existing architectures — a frequent pain point. This experience taught us the importance of innovative middleware solutions to streamline AI insights integration. The result? Enhanced data accuracy and business intelligence, empowering informed decision-making. Lessons from the Trenches: Navigating Market Dynamics Market dynamics are shifting rapidly, but the struggle with siloed data persists. Inefficient integration architectures can be a thorn in the side of digital transformation. Here, AI-driven insights can play a crucial role. In a project where data silos were hindering progress, we revamped our strategy. By prioritizing AI integrations, we dismantled these silos, resulting in a more fluid and flexible system. The critical lesson was understanding that breaking down silos is just as important as building new integrations. A balance of both ensures scalable and adaptive solutions. Future Horizons: Preparing for the AI Revolution The enterprise integration landscape is on the cusp of a new era. AI-driven insights will automate decision-making and predictive analytics, fundamentally changing business operations and competitive dynamics. To stay ahead, it's imperative for companies to invest in AI skills and infrastructure. In my own journey, continuous learning and adaptation have been key. Embracing new technologies and methodologies isn't just a requirement — it's an ongoing pursuit of excellence. And yes, I still hit roadblocks. There's always more to learn, more to implement, but that's what makes this field so exciting. Conclusion: Embracing the Transformation Integrating AI-driven insights with MuleSoft and AWS opens doors to innovation and competitiveness. As we stand on the verge of this transformation, the opportunities are vast. By focusing on emerging trends, questioning conventions, and exploring new applications, enterprises can unlock unprecedented value. In conclusion, if you're like me, sipping a coffee and wondering how to elevate your integration game, take the leap. Blend AI with your MuleSoft and AWS strategy, embrace imperfections, learn from every hiccup, and watch your enterprise soar to new heights.

By Abhijit Roy

Migration from Lovable Cloud to Supabase

Once your vibe-coded prototype on Lovable is up and running, you might be ready to graduate to self-hosting or a more advanced service that scales better and gives you more control. If your Lovable application uses Lovable Cloud, a common choice is to migrate your data to Supabase. Supabase is a popular Backend-as-a-Service platform based on PostgreSQL. It has a reasonable free tier and it offers baked-in solutions for auth, serverless code execution (Edge Functions), real-time database change notifications and REST API. One of the big advantages of the platform is that at its core it's based on Open Source technologies and the vast PostgreSQL ecosystem. Internally, Lovable Cloud is already using a shared Supabase instance behind the scenes, but it doesn't expose direct database access making migration a little more involved than it needs to be. Below we list specific instructions for getting your data and users to Supabase in just 7 steps. For this exercise, I created a demo "SpendSmart" Lovable application - it shows analytics for credit card spending, supports email-based authentication and enables row-level security to protect users from viewing each other's personal information and transactions. I assume that you already have development tools installed on your machine, such as git , node and npm. Step 1: Connect your GitHub account and sync your Lovable project to a repository As our first step, we will need to export the project code from Lovable. Fortunately, it is easy to do using their GitHub sync. You can follow the detailed instructions in their official documentation. Step 2: Clone your repository locally After the project code has been successfully synced to GitHub, we can clone the repository locally: Shell git clone [email protected]:<USER>/<REPO>.git cd <REPO> Step 3: Create a new Supabase project If you don't already have an existing project, navigate to and create it. As of this writing, their free tier provides a shared CPU, 500 MB RAM and 500 MB database size. While this is indeed not enough for a production database serving live traffic, it's plenty for moving your prototype from Lovable. Step 4: Initialize Supabase config in your repo After creating the Supabase project, we need to initialize it with our repo. Note that Lovable already exports Supabase database schema and RLS policies along with the original project code, so we don't need a separate step to export it. Shell npm install supabase --save-dev npx supabase login npx supabase link npx supabase db push # this will create the schema and RLS policies Step 5: Update the project's environment variables The project's environment variables for connecting the application to Supabase backend are stored in the .env file in our repo: Shell VITE_SUPABASE_PROJECT_ID="..." VITE_SUPABASE_PUBLISHABLE_KEY="..." VITE_SUPABASE_URL="..." From the Supabase Project view in your Web Browser, get Supabase Project Id, Url, and Publishable Key. Replace the values in the .env file for all three variables. If your application is serving live traffic, at this point you should temporarily un-publish your app in Lovable=>Project Settings to preserve data integrity. Step 6: Migrate auth data from Lovable If your Lovable app is serving live traffic, before we proceed with steps 6 and 7, I recommend that you temporarily un-publish your app in Lovable=>Project Settings. This is done to preserve data integrity, because now we will be migrating the database itself. Step 6 is a bit non-trivial as to make things clean, we need to capture records from Lovable Cloud's auth.users table and import it using Supabase's createUser() API. We wrote a simple helper script for that. Save it as migrate.js and edit SUPABASE_URL and SERVICE_ROLE_KEY: JavaScript import fs from 'node:fs'; import csv from 'csv-parser'; import { createClient } from '@supabase/supabase-js'; // 1. Configuration - Update these with your NEW project details const SUPABASE_URL = // New Supabase Project URL const SERVICE_ROLE_KEY = // Supabase secret key from Settings->API Keys const supabase = createClient(SUPABASE_URL, SERVICE_ROLE_KEY); async function migrateUsers(filePath) { const users = []; // 2. Read and parse the CSV fs.createReadStream(filePath) .pipe(csv({ separator: ';' })) // Using the semicolon delimiter from your example .on('data', (row) => users.push(row)) .on('end', async () => { console.log(`Found ${users.length} users. Starting migration...`); for (const user of users) { try { // Parse the metadata JSON string const metadata = user.raw_user_meta_data ? JSON.parse(user.raw_user_meta_data) : {}; const { data, error } = await supabase.auth.admin.createUser({ id: user.id, // Keeps the original ID so your foreign keys don't break email: user.email, password_hash: user.encrypted_password, // Injects the hash directly user_metadata: metadata, email_confirm: true // Prevents sending confirmation emails to everyone }); if (error) { console.error(`Error importing ${user.email}:`, error.message); } else { console.log(`Imported: ${user.email}`); } } catch (parseError) { console.error(`Failed to parse data for ${user.email}:`, parseError.message); } } console.log('Migration complete!'); }); } // Get the filename from the command line argument const csvFile = process.argv[2]; if (!csvFile) { console.log('Usage: node migrate.js your_file.csv'); } else { migrateUsers(csvFile); } Now run the following SQL query in Lovable Cloud to get the auth information. Export the result as a CSV file using the "Export CSV" button in the UI: SELECT id, email, encrypted_password, raw_user_meta_data, created_at FROM auth.users; After that, you can import the users using the script above: Shell npm install @supabase/supabase-js csv-parser node migrate_auth.js query-results-export-....csv Step 7: Migrate your tables from Lovable Cloud to Supabase (Final Step) Finally, export each table's data as CSV files: To import the data into your Supabase project, you can use pgAdmin, write a script with psql and the COPY command, or use a tool like Dsync. Here we used Dsync because it automates the whole import task and doesn't require custom scripting or ordering the files with respect to foreign keys in the schema. Lovable exports CSV files with the naming convention <TABLE_NAME>-export-<DATE>.csv. Rename those files into <TABLE_NAME>.csv and put them into a temporary folder, like /tmp/love-export/public/. The "public" subfolder name will be interpreted by Dsync as the schema name. The file names will be interpreted as table names. You will also need your new Supabase direct connection string (IPv4 compatible if IPv6 doesn't work): The sample Dsync command: Shell dsync --mode InitialSync file:///tmp/love-export --delimiter=";" postgresql://postgres....:.....@....:5432/postgres" Done After the Dsync command has successfully completed, you should check the tables in Supabase and ensure that they all exist and are populated. You can now start your project locally, authenticate with the same credentials and see the same data in your app: Shell npm i npm run dev

By Alexander Komyagin

MCP + AWS AgentCore: Give Your AI Agent Real Tools in 60 Minutes

If you've been building with AI agents, you've probably hit the same wall I did: your agent needs to do things — query databases, call APIs, check systems — but wiring up each tool is a bespoke integration every time. The Model Context Protocol (MCP) solves this by giving agents a standard way to discover and invoke tools. Think of it as USB-C for AI tooling. The problem? Most MCP tutorials stop at "run it locally with stdio." That's fine for solo dev work, but it falls apart the moment you need: Multiple clients connecting to the same serverAuth, session isolation, and scalingA deployment that doesn't die when your laptop sleeps AWS Bedrock AgentCore Runtime changes the equation. You write an MCP server, hand it over, and AgentCore handles containerization, scaling, IAM auth, and session isolation — each user session runs in a dedicated microVM. No ECS clusters to configure. No load balancers to tune. In this post, we'll build a practical MCP server from scratch, deploy it to AgentCore Runtime, and connect an AI agent to it. The whole thing takes about 30-60 minutes. What We're Building We'll create an MCP server that exposes infrastructure health tools — the kind of thing a DevOps agent would use to check system status, list recent deployments, and surface alerts. It's more interesting than a dice roller but simple enough to follow. Here's the architecture: Your agent connects via IAM auth → AgentCore discovers the tools → your server executes them → results stream back. You never manage servers, containers, or networking. Prerequisites Before we start, make sure you have: Python 3.10+ and uv (or pip — but uv is faster)AWS CLI configured with credentials that have Bedrock AgentCore permissionsNode.js 18+ (for the AgentCore CLI)An AWS account with AgentCore access (there's a free tier) Install the AgentCore tooling: Shell # AgentCore CLI npm install -g @aws/agentcore # AgentCore Python SDK pip install bedrock-agentcore # AgentCore Starter Toolkit (handles scaffolding + deployment) pip install bedrock-agentcore-starter-toolkit Step 1: Build the MCP Server Create your project structure: Shell mkdir infra-health-mcp && cd infra-health-mcp uv init --bare uv add mcp bedrock-agentcore Now create server.py. We'll use FastMCP, which gives us a decorator-based API for defining tools: Python from mcp.server.fastmcp import FastMCP from datetime import datetime, timedelta import random mcp = FastMCP("infra-health") @mcp.tool() def get_service_status(service_name: str) -> dict: """Check the health status of a deployed service. Args: service_name: Name of the service to check (e.g., 'api-gateway', 'auth-service', 'payments') """ # In production, this would hit your monitoring API statuses = ["healthy", "healthy", "healthy", "degraded", "unhealthy"] uptime = round(random.uniform(95.0, 99.99), 2) return { "service": service_name, "status": random.choice(statuses), "uptime_percent": uptime, "last_checked": datetime.utcnow().isoformat(), "active_instances": random.randint(2, 10), "avg_latency_ms": round(random.uniform(12, 250), 1) } @mcp.tool() def list_recent_deployments(hours: int = 24) -> list[dict]: """List deployments that occurred in the last N hours. Args: hours: Number of hours to look back (default: 24) """ services = ["api-gateway", "auth-service", "payments", "notification-svc", "user-profile"] deployers = ["ci-pipeline", "ci-pipeline", "hotfix-manual"] deployments = [] for i in range(random.randint(1, 5)): deploy_time = datetime.utcnow() - timedelta( hours=random.randint(1, hours) ) deployments.append({ "service": random.choice(services), "version": f"v1.{random.randint(20,45)}.{random.randint(0,9)}", "deployed_at": deploy_time.isoformat(), "deployed_by": random.choice(deployers), "status": random.choice(["success", "success", "rolled_back"]) }) return sorted(deployments, key=lambda d: d["deployed_at"], reverse=True) @mcp.tool() def get_active_alerts(severity: str = "all") -> list[dict]: """Retrieve currently active infrastructure alerts. Args: severity: Filter by severity level - 'critical', 'warning', 'info', or 'all' """ alerts = [ { "id": "ALT-1024", "severity": "warning", "message": "auth-service p99 latency above threshold (>500ms)", "triggered_at": ( datetime.utcnow() - timedelta(minutes=23) ).isoformat(), "service": "auth-service" }, { "id": "ALT-1025", "severity": "critical", "message": "payments service error rate at 2.3% (threshold: 1%)", "triggered_at": ( datetime.utcnow() - timedelta(minutes=8) ).isoformat(), "service": "payments" }, { "id": "ALT-1026", "severity": "info", "message": "Scheduled maintenance window in 4 hours", "triggered_at": ( datetime.utcnow() - timedelta(hours=2) ).isoformat(), "service": "all" }, ] if severity != "all": alerts = [a for a in alerts if a["severity"] == severity] return alerts if __name__ == "__main__": mcp.run(transport="streamable-http") Key decisions here: Each tool has a clear docstring with typed args — this is what the LLM sees when deciding which tool to call, so be descriptiveWe're using streamable-http transport, which is what AgentCore Runtime expectsIn production, you'd replace the mock data with calls to Datadog, CloudWatch, your deployment system, etc. Step 2: Test Locally Before deploying anything, make sure the server works: Python # Start the server uv run server.py In another terminal, test it with the MCP inspector or a quick curl: Shell # Using the MCP CLI inspector npx @modelcontextprotocol/inspector http://localhost:8000/mcp You should see your three tools listed. Click through them, pass some args, verify the responses look right. Fix any issues now — it's much faster than debugging after deployment. Step 3: Prepare for AgentCore Runtime AgentCore Runtime needs your server wrapped with the BedrockAgentCoreApp. Update server.py by adding this at the top and modifying the entrypoint: Python from bedrock_agentcore.runtime import BedrockAgentCoreApp # ... (keep all your existing tool definitions) ... # Replace the if __name__ block: app = BedrockAgentCoreApp() @app.entrypoint() def handler(payload): return mcp.run(transport="streamable-http") if __name__ == "__main__": app.run() Alternatively, use the AgentCore Starter Toolkit to scaffold the project structure automatically: Shell agentcore init --protocol mcp This generates the Dockerfile, IAM role config, and agentcore.json for you. Copy your server.py into the generated project and point the entry point to it. Step 4: Deploy to AWS This is the part that used to take hours of ECS/ECR/IAM wrangling. With the Starter Toolkit, it's two commands: Shell # Configure (generates IAM roles, ECR repo, build config) agentcore configure # Deploy (builds container via CodeBuild, pushes to ECR, # deploys to AgentCore Runtime) agentcore deploy That's it. No Docker installed locally. No Terraform. CodeBuild handles the container image, and AgentCore Runtime manages the rest. The output gives you a Runtime ARN — save this, you'll need it to connect your agent. Step 5: Invoke Your Deployed Server Test the deployed server using the AWS CLI: Shell aws bedrock-agent-runtime invoke-agent-runtime \ --agent-runtime-arn "arn:aws:bedrock:us-east-1:123456789:agent-runtime/your-runtime-id" \ --payload '{"jsonrpc":"2.0","method":"tools/list","id":1}' \ --output text You should see your three tools returned. Now try calling one: Shell aws bedrock-agent-runtime invoke-agent-runtime \ --agent-runtime-arn "arn:aws:bedrock:us-east-1:123456789:agent-runtime/your-runtime-id" \ --payload '{"jsonrpc":"2.0","method":"tools/call","params":{"name":"get_active_alerts","arguments":{"severity":"critical"},"id":2}' \ --output text Step 6: Connect an AI Agent Now the fun part. Let's wire this up to a Strands agent that can use our infrastructure tools conversationally: Python from strands import Agent from strands.tools.mcp import MCPClient from mcp.client.streamable_http import streamablehttp_client # Connect to your deployed MCP server via IAM auth mcp_client = MCPClient( lambda: streamablehttp_client( url="https://your-agentcore-endpoint/mcp", # IAM auth is handled automatically via your AWS credentials ) ) with mcp_client: agent = Agent( model="us.anthropic.claude-sonnet-4-20250514", tools=mcp_client.list_tools_sync(), system_prompt="""You are a DevOps assistant with access to infrastructure health tools. When asked about system status, check services, review recent deployments, and surface any active alerts. Be concise and flag anything that needs immediate attention.""" ) response = agent( "Give me a quick health check — any services having issues? " "And were there any recent deployments that might be related?" ) print(response) The agent will automatically discover the tools, decide which ones to call, and synthesize the results into a coherent answer. You'll see it call get_active_alerts, then get_service_status for the flagged services, then list_recent_deployments to correlate — all without you writing any orchestration logic. What AgentCore Gives You for Free It's worth pausing to appreciate what you didn't have to build: ConcernWithout AgentCoreWith AgentCoreContainer infraECR + ECS/EKS + ALBHandledSession isolationCustom session managementmicroVM per sessionAuthOAuth setup, token managementIAM SigV4 built inScalingAuto-scaling policies, metricsAutomaticNetworkingVPC, security groups, NATManagedHealth checksCustom implementationBuilt in You wrote a Python file with tool definitions. Everything else is infrastructure you didn't touch. Production Considerations Before going live with real data, a few things to think about: Replace mock data with real integrations. The tool signatures stay the same — swap random.choice(statuses) with a call to your CloudWatch API, PagerDuty, or whatever you use. Add error handling. MCP tools should return meaningful errors, not stack traces. Wrap your integrations in try/except and return structured error responses. Think about tool granularity. Three focused tools are better than one "do everything" tool. The LLM needs clear, specific tool descriptions to make good decisions about what to call. Stateful vs. stateless. Our server is stateless (the default and recommended mode). If you need multi-turn interactions where the server asks the user for clarification mid-execution, look into AgentCore's stateful MCP support with elicitation and sampling. Connect to AgentCore Gateway. If your agent needs tools from multiple MCP servers, the Gateway acts as a single entry point that discovers and routes to all of them. You can also use the Responses API with a Gateway ARN to get server-side tool execution — Bedrock handles the entire orchestration loop in a single API call. Cleanup When you're done experimenting: Shell agentcore destroy This tears down the Runtime, CodeBuild project, IAM roles, and ECR artifacts. You'll be prompted to confirm. What's Next? A few directions to take this further: Add a Gateway to combine your MCP server with AWS's open-source MCP servers (S3, DynamoDB, CloudWatch, etc.) into a single agent toolkit.Try the AG-UI protocol alongside MCP — it standardizes how agents communicate with frontends, enabling streaming progress updates and interactive UIs. References https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/what-is-bedrock-agentcore.htmlhttps://github.com/strands-agents/sdk-pythonhttps://aws.amazon.com/solutions/guidance/deploying-model-context-protocol-servers-on-aws

By Jubin Abhishek Soni

CORE

Mastering Multi-Cloud Integration: SAFe 5.0, MuleSoft, and AWS - A Personal Journey

The article explores the journey of multi-cloud integration through the lens of personal experience, focusing on integrating MuleSoft and AWS using SAFe 5.0 principles. It begins by outlining the necessity of multi-cloud solutions in today's digitally connected world, highlighting challenges such as security and vendor lock-ins. The author discusses overcoming these challenges by employing SAFe 5.0's modular designs and integrating AI services like AWS SageMaker with MuleSoft for real-time decision-making. The article also emphasizes the importance of comprehensive training and cross-functional collaboration to bridge skills gaps. A real-world case study illustrates the approach’s success in reducing latency for an e-commerce giant. The conclusion stresses continuous learning and aligning technical initiatives with business objectives as key to leveraging multi-cloud environments. Introduction I still remember the first time I heard the term "multi-cloud integration." It was during a client meeting at Tata Consultancy Services in 2014. Fresh-faced and eager, I couldn't fathom the complexities that lay ahead. Fast forward to today, I find myself at the heart of pioneering integrations leveraging SAFe 5.0 principles with MuleSoft and AWS — a journey full of insights, occasional blunders, and numerous successes. Let's dive into this strategic blueprint which modern enterprises can adopt for optimizing their multi-cloud strategies. Embracing the Multi-Cloud Revolution In today's digitally connected world, multi-cloud solutions are more of a necessity than an option. From banking to retail, industries are transitioning to multi-cloud environments to harness flexibility, scalability, and redundancy. But with great power comes great responsibility, especially when it comes to security and governance. Emerging Trends: Security and Governance at the Forefront The financial sector, often risk-averse, has been a significant adopter of MuleSoft and AWS for real-time data processing. I recall a project where we integrated real-time transaction data across several cloud environments for a leading bank. We utilized AWS's Lambda for automated validations, ensuring compliance across different jurisdictions — a crucial step in maintaining data integrity and security. Personal Insight: During our deployment, we found that while AWS and MuleSoft offer robust frameworks for security, the challenge lay in integrating these seamlessly. Detailed planning and understanding of each platform's native capabilities were vital. My advice? Never underestimate the power of thorough documentation and the importance of a well-documented API architecture. The Contrarian View: The Vendor Lock-in Debate Many advocate that multi-cloud strategies eliminate vendor lock-in. Yet, as someone who's navigated these waters, I challenge this notion. The intricacies of integration can often weave a web of dependencies, especially when working with MuleSoft and AWS. Solving the Dependency Puzzle with SAFe 5.0 One strategy we've employed is designing modular and agnostic solutions. Utilizing SAFe 5.0's modular design principles, we ensure our integrations are flexible and can pivot with changing vendor landscapes. In a recent project at a healthcare firm, we leveraged MuleSoft's Anypoint Platform to create a loosely coupled architecture, enabling easy transitions between cloud providers. Lesson Learned: Over-engineering for flexibility can be a pitfall, adding unnecessary complexity. It's about striking a balance — focusing on critical services that need agility while ensuring core systems remain stable and robust. Surviving the Technical Trenches: AWS AI and MuleSoft Integrating AI services like AWS SageMaker with MuleSoft has been a game-changer, enabling real-time intelligent decision-making. For instance, in a retail analytics project, we created custom connectors in MuleSoft for seamless data flow into SageMaker, enhancing predictive analytics and improving customer personalization. Technical Deep-Dive: Crafting Custom Connectors Creating these connectors isn't just about linking systems; it’s about understanding the data lifecycle and business objectives. We encountered challenges with data latency and consistency, but by iterating our API definitions and leveraging AWS's data pipeline services, we achieved near-instantaneous data processing — a key success metric in that project. Behind the Scenes: Engaging with MuleSoft's C4E team was instrumental in overcoming integration roadblocks. If there's one thing I’ve learned, it’s that community collaboration often yields the most innovative solutions. Bridging the Skill Gap with SAFe 5.0 Despite its many benefits, the learning curve for integrating MuleSoft and AWS using SAFe 5.0 principles is steep. Here's what worked for us: Comprehensive Training Programs: We developed focused training sessions highlighting SAFe 5.0 frameworks and contextualizing them within our projects. This approach demystified complex topics and empowered our teams to innovate confidently. Cross-Functional Collaboration: By facilitating dialogue across departments — from developers to QA teams — we fostered a culture of shared knowledge and innovation. This collaborative ethos became a bedrock for overcoming integration hurdles. Real-World Implementation: A Case Study Last year, we spearheaded an integration initiative for an e-commerce giant aiming to reduce latency in order processing. Utilizing AWS's Outposts and Local Zones, paired with MuleSoft's capabilities, we achieved remarkable results. Concrete Example: We reduced latency by 40%, improving customer satisfaction scores by a significant margin. The key was aligning technical prowess with business goals—something SAFe 5.0 principles advocate strongly. Actionable Takeaway: Always align technical initiatives with overarching business objectives. It's not just about the technology; it's about driving tangible business outcomes. Conclusion: The Road Ahead The integration of MuleSoft with AWS, underpinned by SAFe 5.0 principles, offers a robust framework for tackling modern multi-cloud challenges. As we look to the future, the demand for hybrid solutions with integrated AI capabilities will only grow. Final Thought: If there's one piece of advice I'd impart — never stop learning. The technology landscape is ever-evolving, and staying curious ensures we remain at the forefront of innovation. As I share these hard-won insights over a metaphorical cup of coffee, I hope they serve as a guide for your own multi-cloud journey. Let's embrace the complexities with enthusiasm and turn challenges into opportunities for growth.

By Abhijit Roy

AI-Based Multi-Cloud Cost and Resource Optimization

Why Multi Cloud Cost Control Has Become Harder Than Ever Most enterprises did not intentionally design a multi cloud strategy from day one. It evolved. One team adopted one provider. Another team preferred a different ecosystem. Over time, resilience, vendor leverage, and geographic expansion pushed workloads across multiple platforms. What began as flexibility slowly became complexity. Finance teams see growing invoices. Engineering teams see healthy dashboards. But somewhere in between, the link between infrastructure behavior and financial impact disappears. Small inefficiencies compound. Idle compute hides inside development clusters. Storage volumes remain attached to nothing. Traffic patterns fluctuate, but infrastructure remains static. Multi cloud does not fail because of one catastrophic mistake. It fails because of accumulated invisibility. This is where AI becomes transformative. Not as a buzzword, but as a continuous reasoning layer that interprets infrastructure behavior, predicts demand, and enforces optimization automatically. Where Cost Inefficiencies Actually Hide Multi cloud cost challenges typically fall into three structural categories: fragmented visibility, resource sprawl, and reactive scaling. 1. Fragmented Visibility Every provider exposes billing differently. Some charge by the second. Others by the hour. Network egress categories vary. Storage tiers differ. Without normalization, cost comparison becomes guesswork. A practical solution is to aggregate billing data into a unified model. Plain Text import pandas as pd# Example unified billing normalization billing = pd.read_csv("multi_cloud_billing.csv") billing["cost_per_vcpu_hour"] = ( billing["total_cost"] / billing["vcpu_hours"].replace(0, 1) ) billing["cost_per_gb"] = ( billing["storage_cost"] / billing["gb_used"].replace(0, 1) ) print(billing.head()) Once cost is standardized, anomalies and inefficiencies become visible across providers. This is the foundation of intelligent optimization. 2. Resource Sprawl Speed is both a strength and a weakness of cloud native systems. Developers deploy quickly. Infrastructure scales automatically. But cleanup rarely scales at the same rate. Idle virtual machines remain active. Test clusters are forgotten. Unattached disks accumulate. These do not create alarms. They simply create slow budget erosion. An AI driven waste detection loop continuously scans inventory and usage: Inventory → Utilization → Pattern Analysis → Recommendation → Automation Example logic for detecting idle compute: Plain Text def detect_idle(avg_cpu, avg_memory, threshold=10): if avg_cpu < threshold and avg_memory < threshold: return "Potentially Idle" return "Active" print(detect_idle(avg_cpu=5, avg_memory=7)) This may seem simple, but at scale, applied across thousands of resources, it unlocks meaningful savings. 3. Reactive Scaling Instead of Predictive Scaling Most auto scaling policies are reactive. They trigger when CPU crosses a threshold. By then, user experience may already be affected. The smarter approach is predictive scaling. Forecasting models analyze historical traffic and anticipate future demand. Plain Text import pandas as pd traffic = pd.read_csv("traffic.csv") # Simple rolling average baseline traffic["forecast"] = traffic["requests"].rolling(window=60).mean() print(traffic.tail()) In production, more advanced models such as LSTM networks are used: Plain Text from tensorflow.keras.models import Sequential from tensorflow.keras.layers import LSTM, Dense model = Sequential() model.add(LSTM(64, input_shape=(24,1))) model.add(Dense(1)) model.compile(optimizer="adam", loss="mse") The shift from reactive to predictive scaling alone can reduce overprovisioning significantly while improving performance stability. 4. Designing the Continuous Optimization Loop Optimization must operate as a closed system. Data enters. Models interpret it. Decisions are made. Changes are enforced. Results are validated. The system learns again. The architecture typically includes: Billing ingestion layerMetrics and telemetry aggregationData normalization storeMachine learning layerOptimization decision engineAutomation pipelineMonitoring and feedback loop The critical element is feedback. Every change must be evaluated. Did savings occur? Did latency increase? Did reliability degrade? Without feedback, intelligence becomes guesswork. 5. Machine Learning That Actually Delivers Value Machine learning in FinOps must solve practical, measurable problems. Demand Forecasting Forecast short term infrastructure needs so scaling decisions are proactive. Cost Anomaly Detection Detect unexpected cost spikes within hours. Plain Text from sklearn.ensemble import IsolationForest model = IsolationForest(contamination=0.05) cost_data["anomaly"] = model.fit_predict(cost_data[["daily_cost"]]) This prevents runaway spend. Reinforcement Learning for Placement Placement decisions involve tradeoffs between cost, latency, and compliance. A simplified reward function may look like: Plain Text reward = savings - (sla_penalty + migration_cost) Over time, the model learns which placements maximize long term benefit. 6. Intelligent Rightsizing: Reducing Cost Without Breaking Performance Rightsizing is often described as the simplest way to reduce cloud cost. In practice, it is rarely simple. Most organizations either ignore it entirely or apply it too aggressively. Both approaches create problems. At its core, rightsizing means aligning infrastructure capacity with actual workload demand. But real workloads are not static. They fluctuate by hour, by day, and by season. Some systems show stable utilization. Others spike unpredictably. A naïve downsizing decision based only on average CPU can introduce instability. Intelligent rightsizing goes beyond surface metrics. It evaluates patterns, risk, and business impact before making any recommendation. Why Traditional Rightsizing Fails Many teams rely on basic thresholds such as: If average CPU usage is below 30 percent, reduce instance size. This logic ignores important realities: Peak usage may be short but criticalWorkloads may have burst patternsMemory usage may differ from CPU trendsLatency sensitivity varies by serviceSome services are revenue critical A system that appears underutilized most of the day may still require burst capacity for short high traffic windows. Blind downsizing creates performance degradation and erodes trust in optimization systems. Intelligent rightsizing avoids that mistake. Rightsizing is one of the most immediate levers for cost reduction. But it must be risk aware. Plain Text def recommend_resize(avg_cpu, peak_cpu): if avg_cpu < 30 and peak_cpu < 70: return "Downsize" return "Keep Current" A proper system evaluates variance, seasonality, and workload criticality before making decisions. Safe optimization always balances cost and stability. 7. Cross Cloud Workload Placement Most organizations adopt multi cloud for flexibility. Very few use it strategically. In many environments, workloads stay where they were originally deployed. An application built in one region remains there for years. A data job runs on the same provider simply because it always has. Placement becomes historical rather than intentional. But in multi cloud environments, location directly affects cost, performance, and reliability. Different providers and regions vary in compute pricing, storage costs, network egress fees, latency, and compliance rules. Even small differences become significant at scale. A batch workload might run much cheaper in one region. A customer facing API might need to stay close to users to maintain low latency. Intelligent placement means continuously asking: Where should this workload run today? An AI driven placement engine evaluates workload type, demand patterns, pricing differences, network costs, compliance constraints, and service dependencies. It scores each possible location and selects the one that balances savings with performance and risk. Here is a simplified scoring example: Plain Text def score_region(cost, latency, reliability): return (0.5 * cost) + (0.3 * latency) - (0.2 * reliability) The goal is not just to reduce cost, but to optimize intelligently. Moving a workload must not increase latency or create hidden egress charges. Good placement decisions always consider long term impact. For example, a nightly analytics job can shift to a lower cost region without affecting users. A real time API remains in a high performance region. The result is targeted savings without compromising stability. When done correctly, cross cloud placement turns multi cloud from a passive architecture choice into an active economic strategy. Infrastructure stops being static and becomes adaptive. And adaptive systems are always more efficient than fixed ones. Placement becomes strategic when cost differences across regions and providers are significant. AI evaluates: Compute cost differencesStorage cost tiersNetwork egress chargesLatency to usersCompliance restrictions A simple scoring function: The lowest score wins. Batch workloads can shift to lower cost regions. Latency sensitive APIs remain near users. Dynamic placement transforms cloud from static infrastructure into an adaptive economic system. 8. Automation: Turning Decisions Into Reality Insights are useless without execution. Optimization systems integrate with infrastructure as code pipelines. Terraform example: Plain Text variable "instance_type" {} resource "cloud_instance" "app" { instance_type = var.instance_type } Instead of directly modifying infrastructure, AI updates configuration variables. CI pipelines validate changes. Deployment strategies such as canary rollouts minimize risk. Automation must always include rollback capability. 9. Governance: Control Without Slowing Innovation When people hear the word governance, they often think of restrictions, approvals, and delays. In cloud environments, governance has traditionally meant slowing things down to reduce risk. But in autonomous multi cloud systems, governance plays a different role. It does not block innovation. It enables safe optimization at scale. As AI begins making recommendations about resizing workloads, shifting regions, or adjusting scaling policies, organizations need confidence that those decisions align with business priorities. Governance ensures that cost savings never compromise reliability, compliance, or customer experience. A well designed governance layer answers four simple questions: Is this change within budget guardrails?Does it respect compliance and data residency rules?Will it impact critical workloads?Can it be audited and rolled back if needed? Instead of acting as a bottleneck, governance becomes a policy engine that runs automatically before execution. For example, production systems may require approval before downsizing, while development environments can be optimized automatically. Budget guardrails can prevent aggressive changes that might introduce risk. Compliance rules can block workloads from moving outside approved regions. Audit logs capture every decision, who approved it, and what the expected savings were. The key is balance. Too little governance creates instability. Too much governance kills automation. The right approach embeds governance directly into the optimization pipeline. AI recommends. Policies validate. Automation executes. Monitoring verifies. This structure allows organizations to move quickly without losing control. In modern FinOps, governance is not about slowing innovation. It is about building trust so innovation can happen safely. 10. Measuring Success Optimization sounds impressive in theory. But in practice, it only matters if it delivers real results. When an AI system recommends resizing servers, shifting workloads, or adjusting scaling policies, the real question is simple: did it actually make things better? Success in multi cloud optimization comes down to balance. You want lower cost, but not at the expense of reliability or performance. Saving money while increasing latency or causing instability is not success. It is just a different kind of problem. To measure whether optimization is working, focus on a few meaningful signals: Is the cost per transaction decreasing?Are idle resources shrinking over time?Are scaling predictions accurate?Has service reliability remained stable?Did the actual savings match what was predicted? These metrics tell a story. If costs go down and performance remains steady, the system is working. If performance drops or customer experience suffers, something needs to be adjusted. Measurement also improves intelligence. When predictions are slightly off, the system learns from the gap between expected and actual results. Over time, decisions become more precise. In the end, measuring success is not about dashboards filled with numbers. It is about proving that optimization improves efficiency without sacrificing stability. That is when AI driven FinOps becomes truly valuable. 11. The Road Ahead Cloud environments are not getting simpler. They are becoming more distributed, more dynamic, and more financially complex. Managing cost across multiple providers using manual reviews and static policies will soon feel outdated. The next phase of multi cloud optimization is continuous intelligence. Systems will not just report inefficiencies. They will predict them, correct them, and learn from the results. Cost will become a real time signal, just like performance or reliability. We will also see optimization expand beyond price alone. Future systems will consider sustainability, energy efficiency, and long term commitment strategies automatically. Cloud infrastructure will gradually behave less like fixed capacity and more like a responsive economic system. The organizations that embrace adaptive, AI driven control early will operate leaner and move faster. The future of cloud operations is not more dashboards. It is smarter systems working quietly in the background. Final Thoughts AI-based multi cloud cost and resource optimization is not about replacing engineers. It is about giving them intelligent systems that continuously align infrastructure cost with business demand. The organizations that embrace autonomous FinOps will operate leaner, react faster, and innovate more confidently. The future of cloud operations is not manual tuning. It is intelligent adaptation.

By Venkatesan Thirumalai

Run AI Agents Safely With Docker Sandboxes: A Complete Walkthrough

There are days when I want an agent to work on a project, run commands, install packages, and poke around a repo without getting anywhere near the rest of my machine. That is exactly why Docker Sandboxes clicked for me. The nice part is that the setup is not complicated. You install the CLI, sign in once, choose a network policy, and launch a sandbox from your project folder. After that, you can list it, stop it, reconnect to it, or remove it when you are done. In this post, I am keeping the focus narrow on purpose: Set up Docker Sandboxes, run one against a local project, understand the few commands that matter, and avoid the mistakes that usually slow people down on day one. What Are Docker Sandboxes? Docker Sandboxes give you an isolated environment for coding agents. Each sandbox runs inside its own microVM and gets its own filesystem, network, and Docker daemon. The simple way to think about it is this: the agent gets a workspace to do real work, but it does not get free access to your whole laptop. That is the reason this feature is interesting. You can let an agent install packages, edit files, run builds, and even run Docker commands inside the sandbox without turning your host machine into the experiment. Before You Start You do not need a big lab setup to try this, but you do need: macOS or Windows machine installedWindows "HypervisorPlatform" feature enabledDocker Sbx CLI installedAPI key or authentication for the agent you want to use If you start with the built-in shell agent, Docker sign-in is enough for your first walkthrough. If you want to start with claude, copilot, codex, gemini, or another coding agent, make sure you also have that agent's authentication ready. If you are on Windows, make sure Windows Hypervisor Platform is enabled first. PowerShell Enable-WindowsOptionalFeature -Online -FeatureName HypervisorPlatform -All If Windows asks for a restart, do that before moving on. Note: Docker documents the getting-started flow with the sbx CLI. There is also a docker sandbox command family, but sbx is the cleanest way to get started, so that is what I am using in this walkthrough. Step 1: Install the Docker Sandboxes CLI On Windows: PowerShell winget install -h Docker.sbx On macOS: PowerShell brew install docker/tap/sbx That is it for installation. If sbx is not recognized immediately after install, open a new terminal window and try again. I hit that once on Windows after installation, and a fresh terminal fixed it. Note: Docker Desktop is not required for sbx. Step 2: Sign In Now sign in once: PowerShell sbx login This opens the Docker sign-in flow in your browser. During login, Docker asks you to choose a default network policy for your sandboxes: Open – Everything is allowedBalanced – Common development traffic is allowed, but it is more controlledLocked down – Everything is blocked unless you explicitly allow it If you are just getting started, pick Balanced. That is the easiest choice for a first run because it usually works without making the sandbox too open. Step 3: Pick a Small Project Folder You can use an existing project folder, or create a tiny test folder just for this walkthrough. For example: PowerShell mkdir hello-sandbox cd hello-sandbox If you want, drop a file into it so you have something visible inside the sandbox: PowerShell echo "# hello-sandbox" > README.md Nothing fancy is needed here. The goal is just to have a folder you are comfortable letting the agent work in. Step 4: Run Your First Sandbox Here is the command that matters most: PowerShell sbx run shell . Figure 1.1: Shows how to create a new sandbox using Sbx command What this does: Starts a sandbox for the shell agentMounts your current folder into the sandboxOpens an isolated environment where the agent can work on that folder If you prefer naming your sandbox from the start, use: PowerShell sbx run --name my-first-sandbox shell . On the first run, Docker may take a little longer because it needs to pull the agent image. That is normal. Later runs are much faster. I like starting with shell because it is the easiest way to prove the sandbox is working before you bring an actual coding agent into the mix. Once that works, replace shell with the agent you actually want to use, such as claude, copilot, codex, gemini, or another supported agent from the Docker docs. Step 5: See What Is Running To check your active sandboxes, run: PowerShell sbx ls You should see output with a name, status, and uptime. This is a handy command because once you start using sandboxes regularly, it becomes the quickest way to see what is still running and what needs cleanup. Figure 1.2: Shows how to verify list of all active sandboxes running on the machine Step 6: Switch to a Real Coding Agent Once you have proved the sandbox works with shell, move to the coding agent you actually want to use. For example: PowerShell sbx run copilot Figure 1.3: Shows how to run Copilot agent on Docker sandbox or PowerShell sbx run gemini Figure 1.4: Shows how to run gemini agent on Docker sandbox The workflow is the same as shell. The only thing that changes is the agent inside the sandbox. If the agent needs its own provider login or API key, complete that setup and then continue. The important point is that the agent is still running inside the sandbox, not directly on your host machine. Step 7: Stop the Sandbox When You Are Done When you are finished using Sandbox, you can stop it by running the command below: PowerShell sbx stop copilot-dockersandboxtest If you don't remember the name, run sbx ls first to see all the active sandboxes running. Stopping is useful when you want to pause work without removing the sandbox immediately. Step 8: Remove the Sandbox When You No Longer Need It When you are done for good, you can remove it by running the command below: PowerShell sbx rm copilot-dockersandboxtest Or remove all sandboxes by simply passing --all flag as shown below: PowerShell sbx rm --all Figure 1.5: Removing all sandboxes using sbx rm --all command Step 9: Use YOLO Mode Safely Now for the newer idea Docker has just announced, which is YOLO mode. If you want to read more about it, refer to Docker's recent blog post, which is worth bookmarking: Docker Sandboxes: Run Agents in YOLO Mode, Safely. In simple terms, YOLO mode means letting a coding agent work with fewer interruptions and fewer approval prompts. That can save time, but it only makes sense when the agent is already inside a sandbox. Note: I would not start with YOLO mode on day one. I would start with a normal sandbox run, get comfortable with the lifecycle first, and only then try YOLO mode. Conclusion This article explains Docker Sandboxes and provides step-by-step instructions for getting started. What I like about Docker Sandboxes is that they remove a lot of friction from a very real problem. Sometimes you want an agent to have freedom, but not too much freedom. You want it to run commands, inspect files, and do useful work, but you also want a clear boundary around that work. That is the sweet spot Docker Sandboxes are aiming for. If you are curious about them, my advice is simple: do not start with a giant repo or a complicated setup. Pick one small folder, use the Balanced policy first, run a single sandbox, and get comfortable with the basic lifecycle first. Once that clicks, the rest feels much easier to work in YOLO mode.

By Naga Santhosh Reddy Vootukuri

CORE

Cloud Architecture

DZone's Featured Cloud Architecture Resources

Top Cloud Architecture Experts

The Latest Cloud Architecture Topics