Deployment Resources

DZone's Featured Deployment Resources

Code Security Remediation: What 50,000 Repositories Reveal About PR Scanning

By Braden Riggs

Security teams have gotten good at finding vulnerabilities. Fixing them has always been the hard part. An analysis of remediation patterns across 50,000+ actively developed repositories and 400+ organizations during 2025 reveals a pattern: where a vulnerability is detected has more impact on whether it gets fixed than what the vulnerability is. PR-Detected Findings Get Fixed 9x Faster Static Application Security Testing (SAST) tools scan your source code for security flaws like SQL injection, hardcoded secrets, or missing auth checks. When a scan flags one of these issues (a "finding"), how quickly it gets fixed depends almost entirely on when it was detected. Findings caught during a pull request (PR) are resolved in 4.8 days on average. The same class of finding detected via a full repository scan takes 43 days. That is a 9x difference, and the reason is context. Consider what the PR workflow looks like in practice. A developer opens a PR. A scan runs automatically in CI, and a finding appears as an inline comment: "SQL injection via string concatenation on line 47." The developer is already in that file. The context is fresh. The fix is two lines: SQL # Before — vulnerable query = "SELECT * FROM users WHERE name = '" + username + "'" # After — parameterized query = "SELECT * FROM users WHERE name = ?" db.execute(query, [username]) 63% of PR-detected SAST fixes happen the same day. Now consider the full-scan path. Three months later, a different developer is assigned a Jira ticket for the same class of vulnerability in code written by someone else two years ago. They have to locate the file, rebuild context around unfamiliar code, figure out the right fix, and push it through review. That ticket competes with feature work, and it often sits in the backlog for weeks. The pattern holds for dependency vulnerabilities too, though the gap is smaller. Software Composition Analysis (SCA) findings caught in a PR are resolved in 12.1 days versus 36.4 days for full-scan findings, a 3x improvement. The smaller gap reflects reality: SCA remediation depends on whether a patch exists upstream, which is outside the developer's control. This is not an argument against full scans. PR scanning depends on full scans to establish the baseline that makes diff analysis possible. And some vulnerability classes, particularly cross-file issues where untrusted input enters one file and reaches a dangerous sink in another, require the full codebase context that only a full scan provides. You need both. But the data makes clear that when a finding can be caught at PR time, it is far more likely to get fixed. The 90-Day Cliff Security findings that sit unfixed for 90 days do not get fixed. Teams tell themselves the backlog is temporary, that they will get to those findings next sprint. They rarely do. After 90 days, the original developer may have moved on. The code may have been refactored around the vulnerability. The organizational memory of why this finding matters has faded. What was once a 20-minute fix is now a research project, and research projects lose to feature work every time. Among top-performing organizations (the top 15% by fix rate), only 9.4% of SAST remediations come from findings open longer than 90 days. For the remaining 85%, it is 16%. Leaders are not just fixing more; they are fixing earlier. And the most counterintuitive part: these groups use the same scanning tools. One organization fixes 63% of its critical findings. Another fixes 13%. Same scanner. Same severity filters. Same findings surfaced. The difference is what happens after the scan. In my experience talking to security teams, the gap comes down to three things. Findings sit in a security dashboard that developers never check. Findings reach developers, but without enough context to understand the fix. Or there is no clear owner, so the finding is effectively unassigned. Treat 90 days as an escalation point, not a deadline. At that threshold, every open finding should go through one of three paths: remediate it with dedicated time, formally accept the risk with documented justification, or suppress it as a confirmed false positive. Letting findings sit in the backlog indefinitely without a decision is not risk management. Two Diagnostics to Prioritize Measure your same-day PR fix rate. What percentage of findings detected in PRs get resolved the same day? If it is below 50%, developers are seeing findings but not acting on them. That points to a context problem: the finding does not include enough information to act on, or the developer does not feel ownership over security findings in their code. Leaders hit 63%. Check your 90-day backlog share. What percentage of your total open findings have been sitting for more than 90 days? If a significant portion of your remediations comes from that bucket, your team is spending effort on findings that have already crossed the threshold where fixes are unlikely. PR scanning, CI policies that block merges on high-confidence findings, and faster triage loops all move fixes into the first 30 days, where they are most likely to happen. The dataset behind these benchmarks, including fix rate analysis by OWASP category, specific CWEs, package ecosystem breakdowns, and time-to-fix distributions, comes from the Remediation at Scale report. Dig into the full dataset to see how your numbers compare. Check out the full Semgrep article collection here. More

When Kubernetes Breaks Session Consistency: Using Cosmos DB and Redis Together

By Vikas Mittal

Distributed systems rarely struggle because of storage engines. They struggle because of coordination. We were operating a high-throughput microservice on Kubernetes backed by Azure Cosmos DB. The service required durability, global availability, and predictable read behavior under horizontal scaling. Cosmos DB was configured with SESSION consistency because it offers a practical balance between correctness and performance. It guarantees read-your-own-writes without incurring the latency and throughput penalties associated with strong consistency. Architecturally, everything appeared sound. Yet under real production traffic, an intermittent pattern began emerging. Occasionally, a read request issued immediately after a write would return slightly stale data. There was no corruption and no failure — just subtle inconsistencies that were difficult to reproduce but impossible to ignore. The issue was not rooted in Cosmos DB alone, nor in Kubernetes alone. It lived in the interaction between the two. The Assumption Behind Session Consistency Cosmos DB’s session consistency model relies on a session token. Every write operation returns a token representing the latest version of the document within that session. If that token is passed back during a subsequent read, Cosmos guarantees that the client will see its own write. In a single-instance application, this is straightforward. The same process that performs the write retains the session token in memory and uses it for subsequent reads. Kubernetes changes that assumption entirely. In a horizontally scaled deployment, a write request may land on Pod A. Cosmos returns a session token to Pod A. The next read request for the same document may land on Pod B. Pod B has no awareness of Pod A’s session token. Without that token, Cosmos may return a slightly older replica version consistent with session guarantees — but not necessarily reflecting the most recent write handled by another pod. The database is honoring its consistency contract. The application simply is not sharing the required metadata across instances. This is a classic distributed systems nuance: guarantees often depend on contextual state that stateless infrastructure does not preserve. Why Strong Consistency Was Not the Right Fix Switching Cosmos DB to strong consistency would have eliminated the problem entirely. However, that solution carried significant tradeoffs. Strong consistency increases latency because replicas must coordinate synchronously. It reduces overall throughput and increases RU consumption. It also introduces constraints in multi-region deployments where low-latency global reads are required. The problem was not that the database guarantees were insufficient. The problem was that session context was not shared across pods. Rather than strengthening storage semantics, we focused on improving coordination. Introducing Redis as a Coordination Layer The solution was conceptually simple. After every write to Cosmos DB, we extracted the returned session token and stored it in Redis, keyed by the document ID. Before every read from Cosmos, we retrieved the session token from Redis and supplied it with the read request. Redis became a lightweight session token broker between Kubernetes pods. It is important to emphasize what Redis was not used for. It did not store business data. It did not act as a second database. It did not cache full documents. It stored only the small piece of metadata required to preserve cross-pod session guarantees. Cosmos remained the durable system of record. Redis handled coordination. By limiting Redis to this narrow responsibility, the architecture avoided unnecessary complexity and eliminated the risk of data divergence between systems. Designing for Failure, Not Just Success Adding Redis introduced a new dependency, which required careful design consideration. We made a deliberate decision that Redis would never become mandatory for availability. In the read path, the service first attempts to retrieve the session token from Redis. If a token exists, it is passed to Cosmos, ensuring read-your-own-writes. If Redis is unavailable or the token is missing, the system proceeds with a standard Cosmos read without the token. The result is a graceful degradation model. Redis enhances consistency but does not control system availability. If Redis fails completely, the application continues operating with normal session semantics, potentially returning slightly stale reads but never failing outright. On the write path, the order of operations is equally important. The document is first persisted to Cosmos. Only after a successful write is the session token stored in Redis. This ensures that durability is never dependent on coordination infrastructure. To further strengthen resilience, Redis was deployed in a dual configuration consisting of a primary and fallback instance. Writes are performed against both, with the fallback update executed asynchronously to avoid increasing request latency. If Redis writes fail, the errors are logged, but the core transaction succeeds. This ordering ensures the system bends under failure rather than breaks. Cost and Throughput Optimization While addressing consistency, we also examined write efficiency. In high-throughput systems, replacing entire documents for minor state changes can significantly increase RU consumption. Instead of issuing full document replacements, we adopted Cosmos PATCH operations for partial updates. Only modified attributes were updated, reducing request charge and improving overall efficiency. This adjustment produced measurable cost savings and reinforced a broader lesson: architectural improvements often reveal opportunities for operational optimization. Evaluating Alternative Approaches Before settling on Redis-backed session sharing, several alternatives were considered. Sticky sessions at the load balancer layer could have preserved session affinity, ensuring that reads followed writes to the same pod. However, this approach reduces horizontal scaling flexibility and can create uneven traffic distribution. In-memory distributed caching strategies were also evaluated but introduce replication complexity and failure coordination challenges. Enabling strong consistency at the database layer, while technically simpler, imposed unacceptable performance and cost penalties. Redis provided the right balance. It is fast, operationally mature, and purpose-built for ephemeral coordination data. Most importantly, it allowed us to solve a coordination problem without modifying database guarantees. Extending Redis Carefully Once Redis became part of the architecture, it was tempting to broaden its use. Discipline was critical. Redis was later used to cache selected reference metadata retrieved from downstream services. Instead of invoking dependent systems on every request, a scheduled refresher populated Redis entries with defined TTLs. This reduced latency and protected downstream systems during peak load. Redis was also used to maintain shared operational counters across pods. In a horizontally scaled environment, in-memory metrics fragment across instances. Storing certain counters in Redis provided consistent observability across all running pods. In both cases, Redis remained coordination infrastructure rather than primary storage. The Architectural Pattern Cosmos DB and Redis are often described simply as database and cache. In this design, Redis is not a cache of business objects. It is a coordination layer that enables predictable behavior in a stateless, horizontally scaled environment. By separating durable state from coordination state, the system maintains scalability, controls cost, and preserves session guarantees without relying on strong consistency or sticky sessions. Kubernetes encourages statelessness. Databases provide consistency guarantees within defined boundaries. Bridging the two requires explicit coordination. Distributed systems are rarely about choosing the strongest guarantee available. They are about understanding the guarantees you already have and ensuring they are applied correctly across infrastructure boundaries. Sometimes the most effective solution is not increasing consistency but ensuring that the consistency you already depend on is shared intelligently. Architecture Diagram More

Architecting the Future of Research: A Technical Deep-Dive into NotebookLM and Gemini Integration

By Jubin Abhishek Soni

CORE

How to Reliably Implement Post-Commit Actions in Spring

By Mario Casari

Runtime FinOps: Making Cloud Cost Observable

By David Iyanu Jonathan

NeMo Agent Toolkit With Docker Model Runner

The year 2025 has been widely recognized as the year of AI agents. With the launch of frameworks like Docker Cagent, Microsoft Agent Framework (MAF), and Google’s Agent Development Kit (ADK), organizations rapidly embraced agentic systems. However, one critical area received far less attention: agent observability. While teams moved quickly to build and deploy agent-based solutions, a fundamental question remained largely unanswered. How do we know these agents are actually working as intended? Are multiple agents coordinating effectively?Are their outputs reliable and of high quality?Can we diagnose failures or unexpected behaviors in complex, multi-agent workflows? These challenges sit at the core of agent observability. This is where Nvidia’s open-source toolkit, NeMo, comes into the picture. NeMo brings much-needed, enterprise-grade observability to LLM-powered systems, enabling teams to monitor, evaluate, and trust their agent infrastructure at scale. At the same time, Docker Model Runner is emerging as the de facto standard for local inference from the desktop. It provides a unified, “single pane of glass” experience for experimenting with a wide range of open-source models available through the Docker Models Hub. As part of this tutorial, we will look at how we can add observability to your AI agents when inferencing through Docker Model Runner. Docker Model Runner Setup First, let’s set up Docker Model Runner using a small language model. In this tutorial, we will use ai/smollm2. The setup instructions for Docker Model Runner are available in the official documentation. Follow those steps to get your environment ready. Make sure to enable TCP access in Docker Desktop. This step is essential; without it, your prototype will not be able to communicate with the model runner over localhost. Command to pull the small language model we will use for inferencing. Plain Text docker model run ai/smollm2 NeMo Agentic Toolkit Setup The first step begins with installing the Nvidia NAT package from Python. I recommend installing uv and installing all the nat dependencies through uv because going down the plain “pip” route causes timeouts. Plain Text uv pip install nvidia-nat NeMo's agentic setup is done through YAML. So, declare a YAML configuration for eg: agent-run.yaml YAML functions: # Add a tool to search wikipedia wikipedia_search: _type: wiki_search max_results: 2 llms: # Tell NeMo Agent Toolkit which LLM to use for the agent openai_llm: _type: openai model_name: ai/smollm2 base_url: http://localhost:12434/engines/v1 # Docker model runner endpoint api_key: "empty" // because we are using local inference this can be empty. temperature: 0.7 max_tokens: 1000 timeout: 30 general: telemetry: tracing: otelcollector: _type: otelcollector # The endpoint where you have deployed the otel collector endpoint: http://0.0.0.0:5216/v1/traces project: nemo_project workflow: # Use an agent that 'reasons' and 'acts' _type: react_agent # Give it access to our wikipedia search tool tool_names: [wikipedia_search] # Tell it which LLM to use (now using OpenAI with Docker endpoint) llm_name: openai_llm # Make it verbose verbose: true # Retry up to 3 times parse_agent_response_max_retries: 3 There are four important sections in the YAML file: Functions: These are simple components that perform a specific operation. In this case, built-in Wikipedia search, for example. You can define your own functions too.LLMs: The large language model provider we plan to use. Currently, OpenAI, Anthropic, Azure OpenAI, Bedrock, and Hugging Face are the supported providers. Since Docker Model Runner supports both OpenAI and Anthropic API formats, we can leverage it for both the LLM providers.Telemetry: This is where Observability comes into the picture. In this example, we have added OTel-based tracing. As a result, we will be logging spans to the OpenTelemetry configured destination.Workflow: This is the final piece in the puzzle, where we will end up configuring all the functions, LLMS, and tools to create a workflow. For the current workflow, we are configuring a reasoning and act agent along with the Wikipedia search tool and Docker Model Runner inference endpoint. Before we run the workflow, we will configure the OpenTelemetry exporter to publish spans to the otellogs/span folder. Create a file named otel_config.yml. YAML receivers: otlp: protocols: http: endpoint: 0.0.0.0:5216 processors: batch: send_batch_size: 100 timeout: 10s exporters: file: path: /otellogs/spans.json format: json service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [file] Run the following command in the terminal. Plain Text mkdir otel_logs chmod 777 otel_logs docker run -v $(pwd)/otelcollectorconfig.yaml:/etc/otelcol-contrib/config.yaml \ -p 5216:5216 \ -v $(pwd)/otel_logs:/otel_logs/ \ otel/opentelemetry-collector-contrib:0.128.0 Finally, run the NeMo workflow using the following command. Plain Text nat run --config_file ./agent-run.yml --input "What is the capital of Washington" Output: Plain Text [AGENT] Agent input: What is the capital of Washington Agent's thoughts: WikiSearch: {'annotation': 'Washington State', 'required': False} Thought: You should always think about what to do. Action: Wikipedia Search: {'annotation': 'Washington State', 'required': False} ------------------------------ 2026-03-22 21:55:18 - INFO - nat.plugins.langchain.agent.react_agent.agent:357 - [AGENT] Retrying ReAct Agent, including output parsing Observation 2026-03-22 21:55:18 - INFO - httpx:1740 - HTTP Request: POST http://localhost:12434/engines/v1/chat/completions "HTTP/1.1 200 OK" 2026-03-22 21:55:18 - INFO - nat.plugins.langchain.agent.react_agent.agent:270 - ------------------------------ [AGENT] Agent input: What is the capital of Washington State Agent's thoughts: The capital of Washington State is Olympia. After running the above command, you will see a spans.json file under the otel_logs section, which contains the entire span, along with inputs and outputs. In addition to what we discussed, it is also possible to set up logging and evaluations on model response that check for coherence, relevance, and groundedness. References Docker Model Runner: https://docs.docker.com/ai/model-runner/Nvidia NeMo Agent Toolkit: https://docs.nvidia.com/nemo/agent-toolkit/latest/get-started/installation.html

By Siri Varma Vegiraju

CORE

Building an AI-Powered SRE Incident Response Workflow With AWS Strands Agents

The SRE Incident Response Agent is a multi-agent sample that ships with the AWS Strands Agents SDK. It automatically discovers active CloudWatch alarms, performs AI-powered root cause analysis using Claude Sonnet 4 on Amazon Bedrock, proposes Kubernetes or Helm remediations, and posts a structured incident report to Slack. This guide covers everything you need to clone the repo and run it yourself. Prerequisites Before you begin, make sure the following are in place: Python 3.11+ installed on your machineAWS credentials configured (aws configure or an active IAM role)Amazon Bedrock access enabled for Claude Sonnet 4 in your target regionkubectl and helm v3 installed — only required if you plan to run live remediations. Dry-run mode works without them. Step 1: Clone the Repository The sample lives inside the strands-agents/samples open source repository. Clone it and navigate to the SRE agent directory: Shell git clone https://github.com/strands-agents/samples.git cd samples/02-samples/sre-incident-response-agent The directory contains the following files: Plain Text sre-incident-response-agent/ ├── sre_agent.py # Main agent: 4 agents + 8 tools ├── test_sre_agent.py # Pytest unit tests (12 tests, mocked AWS) ├── requirements.txt ├── .env.example └── README.md Step 2: Create a Virtual Environment and Install Dependencies Shell python -m venv .venv source .venv/activate # Windows: .venv\Scripts\activate pip install -r requirements.txt The requirements.txt pins the core dependencies: Shell strands-agents>=0.1.0 strands-agents-tools>=0.1.0 boto3>=1.38.0 botocore>=1.38.0 Step 3: Configure Environment Variables Copy .env.example to .env and fill in your values: Shell cp .env.example .env Open .env and set the following: Shell # AWS region where your CloudWatch alarms live AWS_REGION=us-east-1 # Amazon Bedrock model ID (Claude Sonnet 4 is the default) BEDROCK_MODEL_ID=us.anthropic.claude-sonnet-4-20250514-v1:0 # DRY_RUN=true means kubectl/helm commands are printed, not executed. # Set to false only when you are ready for live remediations. DRY_RUN=true # Optional: post the incident report to Slack. # Leave blank to print to stdout instead. SLACK_WEBHOOK_URL= Step 4: Grant IAM Permissions The agent needs read-only access to CloudWatch alarms, metric statistics, and log events. No write permissions to CloudWatch are required. Attach the following policy to the IAM role or user running the agent: Shell { "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": [ "cloudwatch:DescribeAlarms", "cloudwatch:GetMetricStatistics", "logs:FilterLogEvents", "logs:DescribeLogGroups" ], "Resource": "*" }] } Step 5: Run the Agent There are two ways to trigger the agent. Option A: Automatic Alarm Discovery Let the agent discover all active CloudWatch alarms on its own. This is the recommended mode for a real on-call scenario: Shell python sre_agent.py Option B: Targeted Investigation Pass a natural-language description of the triggering event. The agent will focus its investigation on the service and symptom you describe: Shell python sre_agent.py "High CPU alarm fired on ECS service my-api in prod namespace" Example Output Running the targeted trigger above produces output similar to the following: Shell Starting SRE Incident Response Trigger: High CPU alarm fired on ECS service my-api in prod namespace [cloudwatch_agent] Fetching active alarms... Found alarm: my-api-HighCPU (CPUUtilization > 85% for 5m) Metric stats: avg 91.3%, max 97.8% over last 30 min Log events: 14 OOMKilled events in /ecs/my-api [rca_agent] Performing root cause analysis... Root cause: Memory leak causing CPU spike as GC thrashes Severity: P2 - single service, <5% of users affected Recommended fix: Rolling restart to clear heap; monitor for recurrence [remediation_agent] Applying remediation... [DRY-RUN] kubectl rollout restart deployment/my-api -n prod ================================================================ *[P2] SRE Incident Report - 2025-10-14 09:31 UTC* What happened: CloudWatch alarm my-api-HighCPU fired at 09:18 UTC. CPU reached 97.8% (threshold 85%). 14 OOMKilled events in 15 min. Root cause: Memory leak in application heap leading to aggressive GC, causing CPU saturation. Likely introduced in the last deployment. Remediation: Rolling restart of deployment/my-api in namespace prod initiated (dry-run). All pods will be replaced with fresh instances. Follow-up: - Monitor CPUUtilization for next 30 min - Review recent commits for memory allocation changes - Consider setting memory limits in the Helm chart ================================================================ Running the Tests (No AWS Credentials Required) The sample ships with 12 pytest unit tests that mock boto3 entirely. You can run the full test suite in any environment, including CI, without any AWS credentials: Shell pip install pytest pytest-mock pytest test_sre_agent.py -v # Expected: 12 passed Enabling Live Remediation Once you have validated the agent’s behaviour in dry-run mode and are satisfied with the decisions it makes, you can enable live kubectl and helm execution by setting DRY_RUN=false in your .env file: Shell DRY_RUN=false Conclusion In under five minutes of setup, the AWS Strands Agents SDK gives you a working multi-agent incident response loop: alarm discovery, AI-powered root cause analysis, Kubernetes remediation, and a structured incident report, all driven by a single python sre_agent.py command. The dry-run default means there is no risk in running it against a real environment while you evaluate its reasoning. From here, the natural next steps are connecting a Slack webhook for team notifications, adding a PagerDuty tool for incident tracking, or extending the RCA agent with a vector store of past postmortems. All of that is a tool definition away. I hope you found this article helpful and that it will inspire you to explore AWS Strands Agents SDK and AI agents more deeply.

By Ayush Raj Jha

Designing AI-Assisted Integration Pipelines for Enterprise SaaS

AI data mapping automates the complex process of connecting disparate data sources significantly reducing manual effort. Integration pipelines are essential for syncing data between enterprise SaaS (like Workday) and downstream systems. Traditional pipelines require manual schema alignment and field mapping, which is error-prone. Emerging AI techniques can automate and accelerate these tasks, improving accuracy and speed. Challenges in SaaS Data Integration As one source explains, modern integration needs semantic understanding of fields to align them. Workday and similar SaaS platforms have complex, evolving data models. Moving Workday data to a data warehouse or another system requires matching fields to the target schema. This mapping is time-consuming and brittle if done manually. Frequent API or report changes can break hard-coded mappings. Key challenges include: Schema drift: Workday reports or custom fields change, requiring pipeline updates.Complex mappings: Fields like emp_id vs Employee_ID differ in naming or semantics.Data quality: Missing or duplicate values can go unnoticed without checks.Scalability: Pipelines must handle large volumes of HR/finance data for analytics.Governance: Automated flows must still enforce Workday’s security and compliance. AI-assisted pipelines address these issues by automating mapping and monitoring. Some AI agents continuously scan streaming data to spot outliers. Vendors report that AI-powered integration can cut maintenance by ~80% by handling routine schema tasks. In practice, an AI-augmented pipeline can flag mismatches or new fields immediately, reducing manual troubleshooting. Leveraging AI for Data Mapping AI data mapping uses ML, NLP and rule-based techniques to align source and target schemas. Common approaches include: Rule-Based: Explicit mapping rules or functions.Machine Learning: Supervised models learn from example mappings to predict new ones.Large Language Models (LLMs): GPT-4 or Claude can interpret schema names and propose mappings.Semantic Graphs: Ontologies/knowledge graphs infer equivalent fields. Often a hybrid approach is used. A pipeline might first apply explicit rules for known fields, then use an ML model for fuzzy matches, and finally invoke an LLM to resolve any remaining cases. By automating field alignment, AI greatly cuts manual work. Below are Python examples of rule-based, ML-based, and LLM-based mapping logic. Rule-Based Mapping Python def rule_based_mapping(source_record, mapping_rules): target_record = {} for src, tgt, transform in mapping_rules: if src in source_record: target_record[tgt] = transform(source_record[src]) return target_record # Example with Workday-like fields source = {"Employee_ID": "E123", "Employee_Name": "Jane Doe", "Dept": "Engineering"} rules = [ ("Employee_ID", "emp_id", lambda x: x), ("Employee_Name", "full_name", lambda x: x.strip().title()), ("Dept", "department", lambda x: x.lower()) ] mapped = rule_based_mapping(source, rules) print(mapped) # {'emp_id': 'E123', 'full_name': 'Jane Doe', 'department': 'engineering'} This function applies each source-to-target rule. In practice, one would loop over Workday records and apply this to each. Rule-based methods are transparent but must be updated whenever the Workday schema changes. ML-Based Schema Matching Python from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression def ml_schema_matching(src_cols, tgt_cols, train_pairs): X_train = [f"src: {s} tgt: {t}" for (s,t) in train_pairs] y_train = [1]*len(train_pairs) neg = [] for s in src_cols: for t in tgt_cols: if (s,t) not in train_pairs: neg.append((s,t)) if len(neg) >= len(train_pairs): break if len(neg) >= len(train_pairs): break X_train += [f"src: {s} tgt: {t}" for s,t in neg] y_train += [0]*len(neg) vectorizer = TfidfVectorizer() X_vec = vectorizer.fit_transform(X_train) model = LogisticRegression().fit(X_vec, y_train) mapping = {} for s in src_cols: best_prob, best_t = 0, None for t in tgt_cols: prob = model.predict_proba(vectorizer.transform([f"src: {s} tgt: {t}"]))[0][1] if prob > best_prob: best_prob, best_t = prob, t if best_prob > 0.5: mapping[s] = best_t return mapping # Example usage src_cols = ["Employee_ID", "Employee_Name", "Department"] tgt_cols = ["emp_id", "full_name", "department", "location"] train_pairs = [("Employee_ID", "emp_id"), ("Employee_Name", "full_name")] matches = ml_schema_matching(src_cols, tgt_cols, train_pairs) print(matches) # e.g., {'Employee_ID': 'emp_id', 'Employee_Name': 'full_name'} This ML approach learns from example pairs and predicts the best match for each source column. It can generalize to new field names by learning semantics. As more mappings are confirmed, the model improves, reducing manual workload. LLM-Assisted Mapping Python import os, openai openai.api_key = os.getenv("OPENAI_API_KEY") src = "['Employee_ID', 'Employee_Name', 'Dept']" tgt = "['emp_id', 'full_name', 'department']" prompt = f\"\"\"Map Workday fields to target fields:\nWorkday: {src}\nTarget: {tgt}\nAnswer with JSON mapping.\"\"\" resp = openai.ChatCompletion.create( model="gpt-4", messages=[ {"role":"system","content":"You are a data integration assistant."}, {"role":"user","content":prompt} ], temperature=0 ) mapping = resp.choices[0].message['content'] print(mapping) This code asks GPT-4 to output a JSON mapping. LLMs use contextual understanding to match fields. This can handle ambiguous cases, but it’s crucial to verify the output against your schema to avoid errors. Building the Integration Pipeline An AI-assisted Workday pipeline might proceed as follows: Extract: Pull data from Workday via its API or reports-as-a-service. Use Python’s requests or a connector (CData) to query a Workday report.Map/Transform: Apply the mapping logic to align Workday fields to the target schema.Load: Write the transformed data to the destination (database, data lake, or another SaaS).Monitor: Track pipeline health with logs/alerts. Include checks or an AI agent to spot anomalies (like schema drift or null spikes). For instance, using CData’s Workday connector and petl to load into Postgres: Python import cdata.workday as mod, petl as etl conn = mod.connect("https://wd3-impl-services1.workday.com;Tenant=mytenant;ConnectionType=WQL;InitiateOAuth=GETANDREFRESH;") query = "SELECT Employee_ID, Name_Full, Department FROM Worker" table = etl.fromdb(conn, query) # Rename columns to match target schema table = table.rename('Employee_ID','emp_id') \ .rename('Name_Full','full_name') \ .rename('Department','department') etl.todb(table, 'postgresql://user:pass@host/db', 'employees') This streams Workday data into a Postgres table, applying simple renames. In a real pipeline, you could insert ML or LLM mapping steps between fromdb and todb as needed. Workday Integration Use Case A common scenario is syncing Workday HR data into a cloud data warehouse for analytics. A daily ETL job might pull Workday’s All Workers report, map fields (Employee_ID -->employee_id, First_Name+Last_Name -->full_name, Country -->office_region) and load the results into a warehouse. Instead of manually coding each mapping, an ML model or GPT-4 can suggest them. For instance, an AI might infer that Workday’s Country field should map to the office_region column, or that a Start_Date in one report is the same as Hire_Date in another. Modern ETL frameworks (like Apache Airflow) can orchestrate these tasks with AI steps validating or refining mappings on-the-fly. This accelerates development and makes maintenance easier, since the AI flags any new or changed fields as Workday evolves. Best Practices Verify AI Outputs: Always review and test AI-generated mappings before production.Incremental Loads: Use timestamps or CDC to sync only new Workday records improving efficiency.Observability: Log pipeline metrics and set alerts. Include anomaly detection to catch issues early.DevOps/CI-CD: Version-control all pipeline code and mapping configs. Automate testing so changes to mapping logic are validated.Governance: Ensure secure auth (OAuth, encryption) and compliance for sensitive HR data. In an era defined by data, building a scalable and flexible integration strategy is more critical than ever. AI-driven pipelines enable faster, smarter integration. Research shows ML-driven mapping can cut data prep time by up to ~80%. By shifting routine mapping tasks to AI, engineers focus on higher-value work. For architects, this means faster rollouts of new integrations and more trustworthy data for analytics and decision-making.

By Suresh Kurapati

SelfService HR Dashboards with Workday Extend and APIs

Workday Extend lets you build custom in-Workday apps that leverage Workday’s data model, UI and security. Extend apps are fully integrated into the Workday interface and can tap into Workday data via APIs and reports. In practice, a dashboard app on Extend will call Workday data services (native REST or “Report-as-a-Service” reports) behind the scenes, transform the results into chart-ready data, and render interactive charts or grids in the custom UI. The high-level architecture is: Data Source (RaaS/REST): Workday HCM data exposed via custom RaaS reports or built-in REST endpoints.Integration Layer: Workday Extend app defines an integration that invokes those services and retrieves JSON/XML data.Extend App & UI: The app’s UI screens (built with Extend’s UI builder or XSLT views) bind to this data, apply filters, and use chart components to display the analytics. Since Extend apps inherit Workday’s UI framework, the dashboard feels like a native Workday report.Security & Deployment: All data access goes through Workday’s security framework. You assign minimal permissions to the Integration System User (ISU) or API client and only expose needed field. Figure: Example Workday Prism Analytics dashboard similar visual style can be embedded in an Extend dashboard app using Workday’s UI components and data. The app fetches data via REST/RaaS and displays charts in a page. Architecture Overview A typical self-service dashboard app flows as follows: the Extend app launches a request then Workday’s Integration Cloud services call a RaaS or REST endpoint. For example, you might create an Advanced Custom Report Worker Summary in Workday and check “Enable as Web Service”. Workday generates a REST URL returning JSON/XML. The Extend integration invokes this URL and gets back Report_Entry XML nodes for each worker. The app’s XSLT or data mapping then picks out fields into the UI model. Architecturally, this keeps data close to the business context: the dashboards live inside Workday. Extend apps automatically appear in Workday menus or worklets alongside standard reports. The app can also use Workday’s Orchestrations if needed to sequence multiple API calls or pre-process data before rendering. All in all, the pieces are Workday data (in HCM/FN objects) → RaaS/REST call → Extend app logic → UI screen (charts/widgets). Using Report-as-a-Service (RaaS) Workday’s Report-as-a-Service is a simple way to expose HR data. Any Advanced Custom Report in Workday can be marked Enable as Web Service and turned into a RESTful endpoint. You add all required fields to the report, set filters/prompts and share the report with an Integration System User. After saving, use the View URLs related action to get the JSON/XML endpoint for your report. You can even embed prompts as query parameters in the URL to fetch filtered data. For example, a RaaS URL might look like: Shell https://wd2-impl-services1.workday.com/ccx/service/customreport2/MyTenant/HR/Employee_Summary ?SupervisorOrg={orgId}&_startDate={date}&_endDate={date}&format=json The Extend integration can call this and receive a payload like: XML <wd:Report_Data xmlns:wd="urn:com.workday.report/Employee_Summary"> <wd:Report_Entry> <wd:Worker_Name>Jane Doe</wd:Worker_Name> <wd:Job_Title>Engineer</wd:Job_Title> <wd:Compensation>90000</wd:Compensation> <wd:Department>R&D</wd:Department>  </wd:Report_Entry>  </wd:Report_Data> Your Extend app’s XSLT or code can transform these wd:Report_Entry items into JSON for the UI chart component. For example: XML <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:wd="urn:com.workday.report/Employee_Summary" version="2.0"> <xsl:template match="/wd:Report_Data"> <DashboardData> <xsl:for-each select="wd:Report_Entry"> <Employee> <Name><xsl:value-of select="wd:Worker_Name"/></Name> <Title><xsl:value-of select="wd:Job_Title"/></Title> <Dept><xsl:value-of select="wd:Department"/></Dept> <Salary><xsl:value-of select="wd:Compensation"/></Salary> </Employee> </xsl:for-each> </DashboardData> </xsl:template> </xsl:stylesheet> This extracts key fields into a simplified XML/JSON model that the dashboard UI can bind to. In short, RaaS makes it easy to pull Workday report data as JSON/XML for any external or Extend consumer. Using Workday REST APIs Workday also provides RESTful web services for core objects. Unlike RaaS, these are out-of-the-box APIs. The Workday REST API supports standard HTTP methods (GET/POST/PUT/DELETE) and returns JSON. For example, to list workers you call: Shell GET https://wd2-impl-services1.workday.com/ccx/api/v1/<Tenant>/workers with OAuth2 Bearer token authorization. You obtain the OAuth token by registering an API client in Workday, then exchanging its credentials and a refresh token for an access token. Workday REST endpoints cover objects like workers, time_off_requests, departments etc. Responses include fields. This approach is useful for real-time queries or if you need to create/update records. For dashboards, you typically use GET queries and retrieve JSON that you then feed into your UI. For example, a code snippet (pseudocode) calling a REST endpoint in an Extend app might look like XML <IntegrationDefinition id="Get_Workers"> <HttpRequest> <Url>https://wd2-impl-services1.workday.com/ccx/api/v1/MyTenant/workers</Url> <Method>GET</Method> <Headers> <Header name="Authorization" value="Bearer ${access_token}"/> <Header name="Accept" value="application/json"/> </Headers> </HttpRequest> </IntegrationDefinition> This returns JSON like: JSON { "workers": [ {"id": "123", "name": "Jane Doe", "title": "Engineer", "supervisory_org": "R&D", ...}, ... ] } You can bind this JSON into your Extend screen data model or transform it as needed. In summary, REST APIs give live access to Workday data using OAuth2-secured HTTP calls. Development Best Practices Modular Integrations: Keep RaaS reports small and focused. For large datasets, use filters or prompt parameters so the dashboard only retrieves needed records. Consider multiple reports (or API calls) if you need different data slices.Efficient XSLT/Data Transforms: Write XSL or JavaScript transformations to extract and reshape only the fields you need. Use grouping or XML-to-JSON functions to simplify outputs. For large JSON, paginate the requests at the report level or limit date range.UI Design: Use Extend’s screen builder or custom XSL views to create charts (bar, pie, line) and tables. Allow user inputs by adding Workday prompt pickers on the screen, then re-invoke the integration with those parameters. The UI should feel native and responsive. Remember that Extend apps support responsive design for mobile.Error Handling: Catch and log any integration errors. For example, wrap the REST calls in an Orchestration step with try/catch, and display a friendly message if the API fails. Also validate input prompts on the client side before sending.Reusability: If multiple dashboards need the same base data, reuse the same RaaS report or API integration. Use Extend’s shared components or sub-flows for common logic. Security and Governance Workday Extend apps inherit Workday’s security model. Use this to your advantage no separate login is needed. However, the integrations that fetch data must be carefully secured: Integration System Users (ISUs): Create a dedicated ISU for your dashboard app. Each integration (REST call or RaaS) should use its own ISU or API client with minimal privileges. The ISU’s security group should have only Get/Put access to the specific integration services and only View access to the underlying data domains.Report Sharing: Only share RaaS reports with the specific ISU accounts. Do not leave RaaS open to all. Always use prompts or filters to avoid inadvertently exposing all employee data.OAuth Tokens: Store OAuth client secrets and tokens securely. Workday Extend can manage OAuth in its connection settings, but never hard-code tokens in UI code. Use secure credential storage or the built-in OAuth client in Extend/Orchestrate.Least Privilege: Avoid giving broad access. The principle is to “only let approved users and systems access your RaaS endpoints”. Review domain security policies so that even if someone had the URL, they could not get data unless in the correct security group.Audit & Monitoring: Workday automatically logs integration events. Set up alerts or use the Integration Reports to monitor how often the dashboard runs. Also watch performance if a RaaS report slows down, consider optimizing it. By following Workday’s security framework, you ensure your dashboard app meets enterprise security standards. In short, treat the dashboard app just like any other Workday integration: use an ISU, scope its permissions tightly, and use HTTPS + OAuth/WS-Security on all calls. Summary A Workday Extend dashboard app brings analytics into the user’s flow of work by combining Workday’s data with custom UI. In practice, you expose HR data via RaaS reports or REST APIs, call them from your Extend app, and bind the results to charts. Key steps include designing optimized reports, coding the data transformation and building the screens. Leveraging Workday’s built-in integration cloud and security model means you get single sign-on, role-based access control, and real-time data all in one. With careful design you can deliver responsive, drillable self-service dashboards in Workday that give HR users up-to-date insights without ever leaving the system.

By Suresh Kurapati

Building End-to-End Payroll Integrations in Workday Using PECI and PICOF

Integrating Workday with third-party payroll systems is crucial for organizations that use Workday HCM but rely on external payroll providers globally. Workday provides Cloud Connect for Third-Party Payroll, a set of pre-built integration frameworks to securely send HR data changes to and receive payroll results from external payroll systems. Two key Workday integration methods in this context are PICOF (Payroll Integration Common Output File) and PECI (Payroll Effective Change Interface). PICOF and PECI serve as the outbound interfaces from Workday to payroll vendors and together with inbound result integrations they enable a full end-to-end payroll data flow. This article provides a brief primer on these integration approaches and how a Workday consultant can design an end-to-end payroll integration using PICOF/PECI from sending employee data changes to pulling back payroll results. Workday Integration Basics for Payroll (Brief Primer) Workday’s Integration Cloud allows building integrations without external middleware, using delivered templates and tools. For payroll, Workday acts as the system of record for HR and compensation data, while actual payroll calculation happens in a third-party system. The goal of an integration is to keep these systems in sync. Workday’s Cloud Connect for Third-Party Payroll provides delivered connector templates that can be configured to extract relevant data changes from Workday and output them in a file for the payroll provider. These connectors support bidirectional data exchange, meaning they not only send data to the payroll system but can also receive data back for a seamless global payroll experience. In practice, outbound integrations (Workday to Payroll) are often scheduled around each pay period to send new hires, terminations, compensation changes and other updates, while inbound flows bring back processed payroll results into Workday for reporting and employee self-service. Workday integration templates greatly simplify development: a Workday consultant can create a new Integration System using a payroll connector template, configure the included data fields and let Workday handle the data extraction logic. Security is managed via integration system user accounts to ensure only necessary data is accessed and transferred. With these basics in mind, let's examine PICOF and PECI the two primary outbound payroll interface formats and how they differ. Outbound Payroll Data Integration: PICOF vs. PECI PICOF and PECI are Workday’s two generations of payroll outbound integrations. They serve a similar purpose extracting changes in worker data from Workday to send to the payroll vendor but with different philosophies and capabilities. Below is an overview of their differences and evolution: Data Captured (Top-of-Stack vs Full History) PICOF outputs worker data largely in a top-of-stack manner for each worker, it captures the final state of each data element within the pay period. This means if multiple changes occur in one period, some intermediate changes might not appear, risking missing transactions unless additional audits are done. In contrast, PECI was designed to capture the full stack of effective changes: it transmits each change event in the order they occurred, giving complete visibility of all payroll-relevant events during the period. Handling of Corrections & Rescinds With PICOF, certain actions like rescinding a hire or correcting a transaction are not explicitly flagged in the output; Workday would generate a separate HTML report for manual review of rescinds/corrections and the integration output may need manual adjustments. PECI introduces automated change labeling every correction or rescinded event is annotated in the data file so the payroll system can process them systematically. This eliminates manual intervention and reduces errors, as corrections are directly included in the interface file rather than handled through offline ticketing or reports. Ongoing Support and Features PICOF is considered an end-of-life product Workday no longer provides enhancements to it. It was the standard prior to 2016, but since then Workday has shifted focus to PECI. PECI is the current standard and is continually improved with new features. For instance, Workday’s Event-Driven Integration (EDI) for real-time triggering of urgent changes is only compatible with PECI, not with PICOF. Similarly, PECI connectors introduced features like tracking the last extracted date at each worker level to avoid duplicate data and can even capture future-dated changes to send proactively. Choosing PECI “future-proofs” the integration, ensuring support for the latest Workday innovations. Integration Output and Flexibility Both PICOF and PECI output data in a structured format. PICOF historically could be run in two modes: changes only mode or a full file mode where it outputs all current data for all employees. PECI by design sends only changes by default, which is more efficient only employees who got a new package or changed data are sent each run. However, PECI also supports a full snapshot on demand if an initial full file or resynchronization is needed. Both integration types allow configuration of which data sections to include and Workday allows adding custom fields or value mappings so the output aligns with the payroll provider’s required format. PECI offers field override flexibility in most sections, making it easier to incorporate additional fields than older PICOF mappings. Vendor Compatibility One deciding factor between PICOF and PECI is the payroll vendor’s capability to consume the data. PECI sends a detailed sequence of events which requires the vendor’s payroll system to handle incremental changes and possibly multiple records per employee. Some simpler payroll systems may only accept a single consolidated record per employee per period. In those cases, PICOF might be used as a simpler feed. Workday’s guidance is to use PECI for new integrations whenever possible, but if a vendor cannot support full-stack changes, PICOF is a fallback. Additionally, if the population is extremely large, PICOF may perform better or multiple PECI integrations may be needed. Workday also provides a variant called WECI to handle changes for contingent workers, since PECI’s standard output excludes contingents a consultant should include WECI if the payroll vendor needs contractor data as well. In summary, PICOF was the original common output file for payroll integration, but it has limitations in change visibility and automation. PECI is the “game-changer” that addresses those gaps by sending every relevant change with effective dates, greatly improving payroll data integrity and reducing manual effort. Given Workday’s direction, PECI is now the recommended approach for outbound payroll integrations, with PICOF only used in special scenarios. Designing the End-to-End Integration Flow A successful end-to-end payroll integration covers both outbound and inbound data flows. The diagram below illustrates the high-level flow between Workday and a third-party payroll system: Workday-to-Payroll outbound flow (via PICOF/PECI) and Payroll-to-Workday inbound flow for results. Workday’s Cloud Connect provides delivered templates to facilitate these integrations. 1. Building the Outbound PICOF/PECI Integration In Workday, you would start by creating an Integration System using the appropriate template. For example, for a PICOF integration you choose Payroll Interface Payroll Interface Common Output File as the template when creating the integration system. Workday then presents a set of configurable Integration Services, which represent categories of data that can be included. As a consultant, you select which data services the payroll vendor needs and configure any integration attributes and integration maps. Once configuration is done, the integration system will generate an output file. If the payroll provider requires a different format, Workday can often accommodate that via built-in format options or by using a transformation step. In some cases, Workday Studio or an Enterprise Interface Builder (EIB) might be used to transform the XML to CSV or apply custom logic, but Workday’s delivered payroll integrations cover most needs. Scheduling & Triggers Typically, the outbound integration is scheduled to run at a regular interval. PECI integrations can also leverage event-driven triggers for certain changes, ensuring timely updates for critical events when supported. Each run will capture all changes since the last run. It’s important to coordinate the timing with the payroll calendar. Testing & Validation A best practice is to test the outbound file thoroughly with various scenarios: multiple job changes in one period, retroactive changes, rescinds, leaves of absence, etc., to ensure the integration logic captures them correctly. Workday provides a Payroll Data Audit report that shows exactly which workers and fields were included in the output, which is extremely helpful to verify completeness. If something seems missing, Workday consultants should play “integration detective,” checking if the worker should have been picked up and whether filters or inclusion rules need adjustment. By iterating on configurations and using Workday’s integration event logs and error reports, you can ensure the outbound feed is accurate and trusted by the payroll team. 2. Setting Up the Inbound (Payroll Results) Integration End-to-end integration doesn’t stop at sending data out. Once the payroll provider processes the payroll, key results often need to come back into Workday. Workday’s Cloud Connect supports importing various payroll outputs for a unified experience. Common inbound elements include:- Payroll Results, Paycheck or Payslip information and Year-end tax documents. To bring these into Workday, one might use a delivered EIB or connector if provided by the payroll partner, or configure a custom integration. The inbound integration is usually scheduled to run after each payroll is completed at the vendor side. It might be initiated by the vendor placing a results file on an SFTP server which Workday pulls, or the vendor could call a Workday web service. In either case, once imported, Workday can utilize this data for composite reports and for user access. Essentially, this step closes the loop, ensuring Workday is updated with actual payroll outcomes so that it remains the “single source of truth” for both HR and high-level pay data. 3. Security and Data Governance Throughout the integration setup, ensure that data is handled securely. Use secure protocols for file transfers. Workday integration system users should have access only to the necessary domains. Auditing who can launch or modify integrations is also important. Workday provides an Integration Dashboard to monitor runs and you can set up error alerts so issues are promptly addressed. 4. Testing End-to-End It’s advisable to run parallel tests with the legacy process when replacing an old integration or implementing a new one. Compare payroll results to ensure no discrepancies. Pay special attention to complex scenarios. Both outbound and inbound pieces should be tested together. 5. Cutover and Beyond Once deployed, the integration should be closely monitored for the first few payroll cycles. Workday’s logging and the payroll vendor’s feedback are invaluable. Over time, maintain the integration by applying any needed updates. If using PECI, stay current with Workday’s releases as they continue to enhance PECI’s capabilities. Remember that PICOF will not receive such enhancements, so if you’re still on PICOF, plan a future migration to PECI to avoid missing out on improvements and support. Conclusion Designing an end-to-end payroll integration in Workday using PICOF or PECI involves a thorough understanding of both the Workday integration tools and the payroll provider’s needs. For Workday consultants, PECI has become the tool of choice due to its robust handling of change data, automated corrections and alignment with Workday’s future roadmap. PICOF, while still usable, represents an older paradigm and should mainly be considered in scenarios where a payroll system cannot handle the richness of PECI’s output or other edge considerations. By leveraging Workday’s delivered connector templates, configuring the integration with the right data fields and mappings and setting up inbound result flows, one can achieve a seamless integration where Workday and the payroll system operate in concert. The result is that employees and administrators get a unified experience workers see their up-to-date pay information in Workday, and the organization benefits from accurate, timely payroll data for compliance and analysis. In summary, building a Workday payroll integration with PICOF/PECI is about enabling data unity between HR and payroll Workday captures every HR change and transmits it reliably to payroll, then payroll outputs are fed back to keep Workday in sync. This end-to-end loop, when done right, ensures that despite using separate systems, the HR and payroll domains function as one cohesive whole reducing errors, saving manual effort and providing confidence in the data for all stakeholders.

By Suresh Kurapati

Unlocking the Potential: Integrating AI-Driven Insights with MuleSoft and AWS for Scalable Enterprise Solutions

This article explores the transformative potential of integrating artificial intelligence (AI)-driven insights with MuleSoft and AWS platforms to achieve scalable enterprise solutions. This integration promises to enhance enterprise scalability through predictive maintenance, improve data quality through AI-driven data enrichment, and revolutionize customer experiences across industries like healthcare and retail. Furthermore, it emphasises navigating the balance between centralized and decentralized integration structures and highlights the importance of dismantling data silos to facilitate a more agile and adaptive business environment. Enterprises are encouraged to invest in AI skills and infrastructure to leverage these new capabilities and maintain competitive advantage. Introduction Not long ago, I had one of those "aha" moments while working late at our Woodland Hills office. Picture this: I was elbows-deep in the spaghetti of our MuleSoft integrations, and it hit me — what if we could fuse our conventional setup with AI-driven insights to revolutionize our enterprise scalability? As someone who has spent countless hours with MuleSoft and AWS, toggling between Anypoint Platform and cloud paradigms, I realized we were standing on the precipice of something transformative. The Magic of AI-Augmented Integration Platforms The trend of merging AI with platforms like MuleSoft is becoming a game-changer. Think about it — self-optimizing integration pipelines that don't just react but predict. AI-driven anomaly detection is no longer a futuristic notion but a present-day reality. A critical takeaway here is that enterprises must shift their focus toward building predictive maintenance into their integration solutions. This isn't just about reducing downtime; it's about reliability, a quality all stakeholders crave. Here's a personal aside: in one of my projects at TCS, we faced repeated disruptions due to undetected anomalies in our pipeline. After integrating an AI-centric approach using AWS’s AI/ML services, we saw a 30% decrease in system alerts. It felt like watching a well-oiled machine where everything just fit. It was hard work getting there, but the reduced manual monitoring was worth every bit of effort. Centralized Control vs. Decentralized Agility Let's face it — a debate that's been brewing is centralized versus decentralized integration. I'm of two minds here. Centralized platforms like MuleSoft offer comprehensive control, yet there's a strong argument for decentralized, microservices-led frameworks powered by AI. These can make autonomous decisions at the edge, thus providing agility. In practice, evaluating trade-offs is crucial. During Farmers Insurance projects, we struggled with balancing centralized governance with the nimbleness of decentralized systems — often a tug-of-war. Through trial and error, we realized that a hybrid approach, leveraging MuleSoft for core integrations while empowering microservices with AI-driven intelligence, struck the right chord. The key was not in choosing sides but in finding harmony between the two. Cross-Industry Applications: Breaking the Mold AI-driven insights aren’t limited to tech giants — they're creeping into retail and healthcare, too. In a recent pilot, we explored using MuleSoft solutions in a healthcare setting, where real-time data processing played a critical role in patient interactions. The challenge was integrating vast datasets, something AI handled adeptly. The result? Improved patient engagement and faster response times. In another example, a retail client used AI integration to enrich customer experiences, from personalized offers to stock predictions. You might say these are exceptions, not the rule, but they demonstrate the potential of cross-industry applications. The lesson here? Look beyond traditional tech spaces for unique use cases and new revenue streams. AI-Driven Data Enrichment: A Technical Deep Dive One of the lesser-known but powerful capabilities of AI is data enrichment. Within MuleSoft and AWS environments, machine learning algorithms are at work to refine and enhance data for superior analytics. It's like having a data wizard on your team. In practical terms, we deployed advanced algorithms to improve data quality at Farmers Insurance. The challenge was ensuring seamless integration without disrupting existing architectures — a frequent pain point. This experience taught us the importance of innovative middleware solutions to streamline AI insights integration. The result? Enhanced data accuracy and business intelligence, empowering informed decision-making. Lessons from the Trenches: Navigating Market Dynamics Market dynamics are shifting rapidly, but the struggle with siloed data persists. Inefficient integration architectures can be a thorn in the side of digital transformation. Here, AI-driven insights can play a crucial role. In a project where data silos were hindering progress, we revamped our strategy. By prioritizing AI integrations, we dismantled these silos, resulting in a more fluid and flexible system. The critical lesson was understanding that breaking down silos is just as important as building new integrations. A balance of both ensures scalable and adaptive solutions. Future Horizons: Preparing for the AI Revolution The enterprise integration landscape is on the cusp of a new era. AI-driven insights will automate decision-making and predictive analytics, fundamentally changing business operations and competitive dynamics. To stay ahead, it's imperative for companies to invest in AI skills and infrastructure. In my own journey, continuous learning and adaptation have been key. Embracing new technologies and methodologies isn't just a requirement — it's an ongoing pursuit of excellence. And yes, I still hit roadblocks. There's always more to learn, more to implement, but that's what makes this field so exciting. Conclusion: Embracing the Transformation Integrating AI-driven insights with MuleSoft and AWS opens doors to innovation and competitiveness. As we stand on the verge of this transformation, the opportunities are vast. By focusing on emerging trends, questioning conventions, and exploring new applications, enterprises can unlock unprecedented value. In conclusion, if you're like me, sipping a coffee and wondering how to elevate your integration game, take the leap. Blend AI with your MuleSoft and AWS strategy, embrace imperfections, learn from every hiccup, and watch your enterprise soar to new heights.

By Abhijit Roy

MCP + AWS AgentCore: Give Your AI Agent Real Tools in 60 Minutes

If you've been building with AI agents, you've probably hit the same wall I did: your agent needs to do things — query databases, call APIs, check systems — but wiring up each tool is a bespoke integration every time. The Model Context Protocol (MCP) solves this by giving agents a standard way to discover and invoke tools. Think of it as USB-C for AI tooling. The problem? Most MCP tutorials stop at "run it locally with stdio." That's fine for solo dev work, but it falls apart the moment you need: Multiple clients connecting to the same serverAuth, session isolation, and scalingA deployment that doesn't die when your laptop sleeps AWS Bedrock AgentCore Runtime changes the equation. You write an MCP server, hand it over, and AgentCore handles containerization, scaling, IAM auth, and session isolation — each user session runs in a dedicated microVM. No ECS clusters to configure. No load balancers to tune. In this post, we'll build a practical MCP server from scratch, deploy it to AgentCore Runtime, and connect an AI agent to it. The whole thing takes about 30-60 minutes. What We're Building We'll create an MCP server that exposes infrastructure health tools — the kind of thing a DevOps agent would use to check system status, list recent deployments, and surface alerts. It's more interesting than a dice roller but simple enough to follow. Here's the architecture: Your agent connects via IAM auth → AgentCore discovers the tools → your server executes them → results stream back. You never manage servers, containers, or networking. Prerequisites Before we start, make sure you have: Python 3.10+ and uv (or pip — but uv is faster)AWS CLI configured with credentials that have Bedrock AgentCore permissionsNode.js 18+ (for the AgentCore CLI)An AWS account with AgentCore access (there's a free tier) Install the AgentCore tooling: Shell # AgentCore CLI npm install -g @aws/agentcore # AgentCore Python SDK pip install bedrock-agentcore # AgentCore Starter Toolkit (handles scaffolding + deployment) pip install bedrock-agentcore-starter-toolkit Step 1: Build the MCP Server Create your project structure: Shell mkdir infra-health-mcp && cd infra-health-mcp uv init --bare uv add mcp bedrock-agentcore Now create server.py. We'll use FastMCP, which gives us a decorator-based API for defining tools: Python from mcp.server.fastmcp import FastMCP from datetime import datetime, timedelta import random mcp = FastMCP("infra-health") @mcp.tool() def get_service_status(service_name: str) -> dict: """Check the health status of a deployed service. Args: service_name: Name of the service to check (e.g., 'api-gateway', 'auth-service', 'payments') """ # In production, this would hit your monitoring API statuses = ["healthy", "healthy", "healthy", "degraded", "unhealthy"] uptime = round(random.uniform(95.0, 99.99), 2) return { "service": service_name, "status": random.choice(statuses), "uptime_percent": uptime, "last_checked": datetime.utcnow().isoformat(), "active_instances": random.randint(2, 10), "avg_latency_ms": round(random.uniform(12, 250), 1) } @mcp.tool() def list_recent_deployments(hours: int = 24) -> list[dict]: """List deployments that occurred in the last N hours. Args: hours: Number of hours to look back (default: 24) """ services = ["api-gateway", "auth-service", "payments", "notification-svc", "user-profile"] deployers = ["ci-pipeline", "ci-pipeline", "hotfix-manual"] deployments = [] for i in range(random.randint(1, 5)): deploy_time = datetime.utcnow() - timedelta( hours=random.randint(1, hours) ) deployments.append({ "service": random.choice(services), "version": f"v1.{random.randint(20,45)}.{random.randint(0,9)}", "deployed_at": deploy_time.isoformat(), "deployed_by": random.choice(deployers), "status": random.choice(["success", "success", "rolled_back"]) }) return sorted(deployments, key=lambda d: d["deployed_at"], reverse=True) @mcp.tool() def get_active_alerts(severity: str = "all") -> list[dict]: """Retrieve currently active infrastructure alerts. Args: severity: Filter by severity level - 'critical', 'warning', 'info', or 'all' """ alerts = [ { "id": "ALT-1024", "severity": "warning", "message": "auth-service p99 latency above threshold (>500ms)", "triggered_at": ( datetime.utcnow() - timedelta(minutes=23) ).isoformat(), "service": "auth-service" }, { "id": "ALT-1025", "severity": "critical", "message": "payments service error rate at 2.3% (threshold: 1%)", "triggered_at": ( datetime.utcnow() - timedelta(minutes=8) ).isoformat(), "service": "payments" }, { "id": "ALT-1026", "severity": "info", "message": "Scheduled maintenance window in 4 hours", "triggered_at": ( datetime.utcnow() - timedelta(hours=2) ).isoformat(), "service": "all" }, ] if severity != "all": alerts = [a for a in alerts if a["severity"] == severity] return alerts if __name__ == "__main__": mcp.run(transport="streamable-http") Key decisions here: Each tool has a clear docstring with typed args — this is what the LLM sees when deciding which tool to call, so be descriptiveWe're using streamable-http transport, which is what AgentCore Runtime expectsIn production, you'd replace the mock data with calls to Datadog, CloudWatch, your deployment system, etc. Step 2: Test Locally Before deploying anything, make sure the server works: Python # Start the server uv run server.py In another terminal, test it with the MCP inspector or a quick curl: Shell # Using the MCP CLI inspector npx @modelcontextprotocol/inspector http://localhost:8000/mcp You should see your three tools listed. Click through them, pass some args, verify the responses look right. Fix any issues now — it's much faster than debugging after deployment. Step 3: Prepare for AgentCore Runtime AgentCore Runtime needs your server wrapped with the BedrockAgentCoreApp. Update server.py by adding this at the top and modifying the entrypoint: Python from bedrock_agentcore.runtime import BedrockAgentCoreApp # ... (keep all your existing tool definitions) ... # Replace the if __name__ block: app = BedrockAgentCoreApp() @app.entrypoint() def handler(payload): return mcp.run(transport="streamable-http") if __name__ == "__main__": app.run() Alternatively, use the AgentCore Starter Toolkit to scaffold the project structure automatically: Shell agentcore init --protocol mcp This generates the Dockerfile, IAM role config, and agentcore.json for you. Copy your server.py into the generated project and point the entry point to it. Step 4: Deploy to AWS This is the part that used to take hours of ECS/ECR/IAM wrangling. With the Starter Toolkit, it's two commands: Shell # Configure (generates IAM roles, ECR repo, build config) agentcore configure # Deploy (builds container via CodeBuild, pushes to ECR, # deploys to AgentCore Runtime) agentcore deploy That's it. No Docker installed locally. No Terraform. CodeBuild handles the container image, and AgentCore Runtime manages the rest. The output gives you a Runtime ARN — save this, you'll need it to connect your agent. Step 5: Invoke Your Deployed Server Test the deployed server using the AWS CLI: Shell aws bedrock-agent-runtime invoke-agent-runtime \ --agent-runtime-arn "arn:aws:bedrock:us-east-1:123456789:agent-runtime/your-runtime-id" \ --payload '{"jsonrpc":"2.0","method":"tools/list","id":1}' \ --output text You should see your three tools returned. Now try calling one: Shell aws bedrock-agent-runtime invoke-agent-runtime \ --agent-runtime-arn "arn:aws:bedrock:us-east-1:123456789:agent-runtime/your-runtime-id" \ --payload '{"jsonrpc":"2.0","method":"tools/call","params":{"name":"get_active_alerts","arguments":{"severity":"critical"},"id":2}' \ --output text Step 6: Connect an AI Agent Now the fun part. Let's wire this up to a Strands agent that can use our infrastructure tools conversationally: Python from strands import Agent from strands.tools.mcp import MCPClient from mcp.client.streamable_http import streamablehttp_client # Connect to your deployed MCP server via IAM auth mcp_client = MCPClient( lambda: streamablehttp_client( url="https://your-agentcore-endpoint/mcp", # IAM auth is handled automatically via your AWS credentials ) ) with mcp_client: agent = Agent( model="us.anthropic.claude-sonnet-4-20250514", tools=mcp_client.list_tools_sync(), system_prompt="""You are a DevOps assistant with access to infrastructure health tools. When asked about system status, check services, review recent deployments, and surface any active alerts. Be concise and flag anything that needs immediate attention.""" ) response = agent( "Give me a quick health check — any services having issues? " "And were there any recent deployments that might be related?" ) print(response) The agent will automatically discover the tools, decide which ones to call, and synthesize the results into a coherent answer. You'll see it call get_active_alerts, then get_service_status for the flagged services, then list_recent_deployments to correlate — all without you writing any orchestration logic. What AgentCore Gives You for Free It's worth pausing to appreciate what you didn't have to build: ConcernWithout AgentCoreWith AgentCoreContainer infraECR + ECS/EKS + ALBHandledSession isolationCustom session managementmicroVM per sessionAuthOAuth setup, token managementIAM SigV4 built inScalingAuto-scaling policies, metricsAutomaticNetworkingVPC, security groups, NATManagedHealth checksCustom implementationBuilt in You wrote a Python file with tool definitions. Everything else is infrastructure you didn't touch. Production Considerations Before going live with real data, a few things to think about: Replace mock data with real integrations. The tool signatures stay the same — swap random.choice(statuses) with a call to your CloudWatch API, PagerDuty, or whatever you use. Add error handling. MCP tools should return meaningful errors, not stack traces. Wrap your integrations in try/except and return structured error responses. Think about tool granularity. Three focused tools are better than one "do everything" tool. The LLM needs clear, specific tool descriptions to make good decisions about what to call. Stateful vs. stateless. Our server is stateless (the default and recommended mode). If you need multi-turn interactions where the server asks the user for clarification mid-execution, look into AgentCore's stateful MCP support with elicitation and sampling. Connect to AgentCore Gateway. If your agent needs tools from multiple MCP servers, the Gateway acts as a single entry point that discovers and routes to all of them. You can also use the Responses API with a Gateway ARN to get server-side tool execution — Bedrock handles the entire orchestration loop in a single API call. Cleanup When you're done experimenting: Shell agentcore destroy This tears down the Runtime, CodeBuild project, IAM roles, and ECR artifacts. You'll be prompted to confirm. What's Next? A few directions to take this further: Add a Gateway to combine your MCP server with AWS's open-source MCP servers (S3, DynamoDB, CloudWatch, etc.) into a single agent toolkit.Try the AG-UI protocol alongside MCP — it standardizes how agents communicate with frontends, enabling streaming progress updates and interactive UIs. References https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/what-is-bedrock-agentcore.htmlhttps://github.com/strands-agents/sdk-pythonhttps://aws.amazon.com/solutions/guidance/deploying-model-context-protocol-servers-on-aws

By Jubin Abhishek Soni

CORE

Content Security Policy Drift in Salesforce Lightning: Engineering Stable Embedded Integration Boundaries

A global case management system depends on a telephony surface to bind a live call to a customer record. When a call arrives, an external CTI frame loads inside Lightning, identifies the caller, resolves the account, and anchors the interaction to an open case. That binding is logged, audited, and later referenced by downstream analytics and compliance reviews. The desk assumes that if the page renders and the integration was validated during implementation, the identity chain will hold for the life of the system. That assumption rests on a boundary contract most teams never model explicitly. The CTI frame is not native to Lightning. It is served from an external origin, evaluated by the browser, and permitted or rejected according to the Content Security Policy. When engineers add a trusted site and confirm the frame loads in a sandbox, they implicitly conclude that the integration is stable. In reality, the embed is evaluated every time the page renders, under the CSP enforcement rules active at that moment. The trust chain between call, identity resolution, and case binding depends on a security boundary that can evolve independently of application logic. This is not a configuration detail. It is a lifecycle risk. Lightning integrations frequently extend beyond telephony. Scheduling platforms, analytics dashboards, payment processors, document renderers, and knowledge widgets all introduce external surfaces. Each embed assumes that the configured origin remains both valid and resolvable under the current enforcement regime. The system behaves correctly only if that assumption continues to hold. The more embedded surfaces an application contains, the more its operational integrity depends on the stability of its CSP boundary. Modeling the Boundary as a Contract Rather than describing CSP as a setting, it is more accurate to model it as a runtime contract between Lightning and external systems. At a minimum, that contract includes: The Lightning originThe set of explicitly allow listed external originsThe redirect resolution behavior that determines final origin The enforcement engine(browser-level CSP evaluation) A simplified representation illustrates the structure: JSON boundary_contract = { "lightning_origin": "https://org.lightning.force.com", "allowed_frame_origins": ["https://cti.vendor.com"], "enforcement": "browser_csp_evaluation", "redirect_resolution": True } In this model, the system assumes that https://cti.vendor.com remains the effective origin evaluated by the browser. However, most production services do not serve content from a single static host. Vendors introduce CDN layers, regional routing, or edge services that alter the resolved origin without changing the configured entry point. If the external service resolves differently at runtime, the boundary contract is reinterpreted. Python def resolve_origin(configured_origin): # Simulate infrastructure migration return "https://edge.cdn-telephony.net" resolved_origin = resolve_origin("https://cti.vendor.com") print("Configured:", boundary_contract["allowed_frame_origins"][0]) print("Resolved:", resolved_origin) CSP enforcement evaluates the resolved origin. If edge.cdn-telephony.net is not explicitly allowed, the frame is rejected. The application logic has not changed. The integration design has not changed. The enforcement environment has. The fragility lies in assuming that the configured origin and the evaluated origin are identical over time. Temporal Drift: Embed Now, Enforce Later Embedded integrations are validated at a point in time. Enforcement, however, is continuous. Browser vendors harden CSP evaluation. Salesforce seasonal releases refine isolation rules in Lightning Experience. Redirect handling behavior evolves. Security contexts tighten. An embed that renders successfully under one release is not grandfathered. It is re-evaluated under the current policy at each load. This creates a temporal exposure window similar to long-lived cryptographic assumptions in other domains. The application’s operational correctness depends on a boundary that is aging under policy pressure. To explore this drift, consider a minimal mutation harness that simulates domain variance: JavaScript function mutateDomain(base) { const prefixes = ["cdn", "edge", "assets", "us-east", "regional"]; const prefix = prefixes[Math.floor(Math.random() * prefixes.length)]; return `https://${prefix}.${base}`; } function evaluateAgainstPolicy(allowedOrigins, candidateOrigin) { return allowedOrigins.includes(candidateOrigin); } const allowed = ["https://cti.vendor.com"]; for (let i = 0; i < 5; i++) { const mutated = mutateDomain("vendor.com"); console.log( mutated, evaluateAgainstPolicy(allowed, mutated) ? "ALLOWED" : "BLOCKED" ); } The output demonstrates a structural truth: slight infrastructure variations produce deterministic rejection under static policy definitions. In practice, these variations emerge not from malicious actors but from routine vendor maintenance, CDN optimization, or regional scaling. What appears as an intermittent integration issue is often policy drift expressed at the boundary layer. From Configuration to Deterministic Policy Engineering If CSP is a runtime contract, then it must be versioned and audited like any other contract. Trusted sites should not exist solely as UI configuration; they should be represented as deployable metadata and tracked in source control. Environmental parity becomes measurable rather than assumed. A minimal CspTrustedSite artifact might look like: XML <?xml version="1.0" encoding="UTF-8"?> <CspTrustedSite xmlns="http://soap.sforce.com/2006/04/metadata"> <endpointUrl>https://cti.vendor.com</endpointUrl> <isActive>true</isActive> <description>Primary CTI integration surface</description> <context>LightningComponent</context> </CspTrustedSite> Retrieving and diffing these artifacts during CI transforms policy state into an auditable signal: sfdx force:mdapi:retrieve -m CspTrustedSite -u staging diff retrieved/unpackaged/CspTrustedSite repo/CspTrustedSite This step does not prevent drift in external infrastructure, but it eliminates configuration divergence between environments. The boundary contract becomes explicit and reviewable. Stress Testing the Boundary Under Real Enforcement Metadata validation alone cannot capture browser-level enforcement changes. Because CSP evaluation occurs in the client, regression testing must execute under actual browser conditions. A headless harness can detect enforcement violations during staging deployments: JavaScript const puppeteer = require('puppeteer'); (async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); let violationDetected = false; page.on('console', msg => { if (msg.text().includes('Content Security Policy')) { violationDetected = true; console.error("CSP violation:", msg.text()); } }); await page.goto("https://org.lightning.force.com/lightning/page"); await page.waitForTimeout(5000); await browser.close(); if (violationDetected) { process.exit(1); } })(); This converts enforcement behavior into a release gate. If a Lightning page attempts to load a resource outside the current policy, the build fails before users encounter the rejection. The system is no longer surprised by boundary evaluation; it anticipates it. Logging Boundary State as First-Class Metadata Long-lived enterprise systems increasingly require audibility at every layer. If embedded integrations form part of the operational chain, their boundary state should be logged alongside functional metadata. A lightweight integration descriptor could include: JSON integration_metadata = { "integration": "CTI", "configured_origin": "https://cti.vendor.com", "resolved_origin": "https://edge.cdn-telephony.net", "csp_compliant": False, "last_validated_release": "Spring '26" } Persisting such metadata enables teams to correlate enforcement failures with release transitions or vendor infrastructure changes. The boundary is no longer an invisible dependency; it becomes an observable state. In mature systems, this concept extends further. Just as risk engines track exposure metrics over time, integration layers can track policy compliance status across releases. A failure is then not an unexpected anomaly but a measurable transition. Engineering for Aging Boundaries Embedded systems age. They age not only in business logic but in their surrounding enforcement ecosystem. CSP boundaries tighten. Infrastructure shifts. Redirect paths mutate. The question is not whether enforcement will evolve but whether the integration model accounts for that evolution. Treating CSP as a static setup step assumes that the external world remains constant. Treating it as governed infrastructure acknowledges that the boundary is active and subject to change. Enumerating external origins, versioning policy artifacts, stress-testing enforcement under real browsers, and logging boundary state convert a fragile embed into a managed contract. Lightning architectures that rely on embedded surfaces cannot afford to ignore this layer. The operational integrity of identity binding, scheduling coordination, analytics visualization, and payment capture depends on a security boundary evaluated continuously at runtime. When that boundary drifts, the integration does not degrade gracefully; it is either permitted or rejected. Systems can tolerate latency fluctuations and transient service failures. They are less tolerant of broken trust at the boundary layer. Engineering discipline applied to CSP transforms that boundary from an intermittent hazard into a predictable surface.

By Sarat Mahavratayajula

Mastering Multi-Cloud Integration: SAFe 5.0, MuleSoft, and AWS - A Personal Journey

The article explores the journey of multi-cloud integration through the lens of personal experience, focusing on integrating MuleSoft and AWS using SAFe 5.0 principles. It begins by outlining the necessity of multi-cloud solutions in today's digitally connected world, highlighting challenges such as security and vendor lock-ins. The author discusses overcoming these challenges by employing SAFe 5.0's modular designs and integrating AI services like AWS SageMaker with MuleSoft for real-time decision-making. The article also emphasizes the importance of comprehensive training and cross-functional collaboration to bridge skills gaps. A real-world case study illustrates the approach’s success in reducing latency for an e-commerce giant. The conclusion stresses continuous learning and aligning technical initiatives with business objectives as key to leveraging multi-cloud environments. Introduction I still remember the first time I heard the term "multi-cloud integration." It was during a client meeting at Tata Consultancy Services in 2014. Fresh-faced and eager, I couldn't fathom the complexities that lay ahead. Fast forward to today, I find myself at the heart of pioneering integrations leveraging SAFe 5.0 principles with MuleSoft and AWS — a journey full of insights, occasional blunders, and numerous successes. Let's dive into this strategic blueprint which modern enterprises can adopt for optimizing their multi-cloud strategies. Embracing the Multi-Cloud Revolution In today's digitally connected world, multi-cloud solutions are more of a necessity than an option. From banking to retail, industries are transitioning to multi-cloud environments to harness flexibility, scalability, and redundancy. But with great power comes great responsibility, especially when it comes to security and governance. Emerging Trends: Security and Governance at the Forefront The financial sector, often risk-averse, has been a significant adopter of MuleSoft and AWS for real-time data processing. I recall a project where we integrated real-time transaction data across several cloud environments for a leading bank. We utilized AWS's Lambda for automated validations, ensuring compliance across different jurisdictions — a crucial step in maintaining data integrity and security. Personal Insight: During our deployment, we found that while AWS and MuleSoft offer robust frameworks for security, the challenge lay in integrating these seamlessly. Detailed planning and understanding of each platform's native capabilities were vital. My advice? Never underestimate the power of thorough documentation and the importance of a well-documented API architecture. The Contrarian View: The Vendor Lock-in Debate Many advocate that multi-cloud strategies eliminate vendor lock-in. Yet, as someone who's navigated these waters, I challenge this notion. The intricacies of integration can often weave a web of dependencies, especially when working with MuleSoft and AWS. Solving the Dependency Puzzle with SAFe 5.0 One strategy we've employed is designing modular and agnostic solutions. Utilizing SAFe 5.0's modular design principles, we ensure our integrations are flexible and can pivot with changing vendor landscapes. In a recent project at a healthcare firm, we leveraged MuleSoft's Anypoint Platform to create a loosely coupled architecture, enabling easy transitions between cloud providers. Lesson Learned: Over-engineering for flexibility can be a pitfall, adding unnecessary complexity. It's about striking a balance — focusing on critical services that need agility while ensuring core systems remain stable and robust. Surviving the Technical Trenches: AWS AI and MuleSoft Integrating AI services like AWS SageMaker with MuleSoft has been a game-changer, enabling real-time intelligent decision-making. For instance, in a retail analytics project, we created custom connectors in MuleSoft for seamless data flow into SageMaker, enhancing predictive analytics and improving customer personalization. Technical Deep-Dive: Crafting Custom Connectors Creating these connectors isn't just about linking systems; it’s about understanding the data lifecycle and business objectives. We encountered challenges with data latency and consistency, but by iterating our API definitions and leveraging AWS's data pipeline services, we achieved near-instantaneous data processing — a key success metric in that project. Behind the Scenes: Engaging with MuleSoft's C4E team was instrumental in overcoming integration roadblocks. If there's one thing I’ve learned, it’s that community collaboration often yields the most innovative solutions. Bridging the Skill Gap with SAFe 5.0 Despite its many benefits, the learning curve for integrating MuleSoft and AWS using SAFe 5.0 principles is steep. Here's what worked for us: Comprehensive Training Programs: We developed focused training sessions highlighting SAFe 5.0 frameworks and contextualizing them within our projects. This approach demystified complex topics and empowered our teams to innovate confidently. Cross-Functional Collaboration: By facilitating dialogue across departments — from developers to QA teams — we fostered a culture of shared knowledge and innovation. This collaborative ethos became a bedrock for overcoming integration hurdles. Real-World Implementation: A Case Study Last year, we spearheaded an integration initiative for an e-commerce giant aiming to reduce latency in order processing. Utilizing AWS's Outposts and Local Zones, paired with MuleSoft's capabilities, we achieved remarkable results. Concrete Example: We reduced latency by 40%, improving customer satisfaction scores by a significant margin. The key was aligning technical prowess with business goals—something SAFe 5.0 principles advocate strongly. Actionable Takeaway: Always align technical initiatives with overarching business objectives. It's not just about the technology; it's about driving tangible business outcomes. Conclusion: The Road Ahead The integration of MuleSoft with AWS, underpinned by SAFe 5.0 principles, offers a robust framework for tackling modern multi-cloud challenges. As we look to the future, the demand for hybrid solutions with integrated AI capabilities will only grow. Final Thought: If there's one piece of advice I'd impart — never stop learning. The technology landscape is ever-evolving, and staying curious ensures we remain at the forefront of innovation. As I share these hard-won insights over a metaphorical cup of coffee, I hope they serve as a guide for your own multi-cloud journey. Let's embrace the complexities with enthusiasm and turn challenges into opportunities for growth.

By Abhijit Roy

Run AI Agents Safely With Docker Sandboxes: A Complete Walkthrough

There are days when I want an agent to work on a project, run commands, install packages, and poke around a repo without getting anywhere near the rest of my machine. That is exactly why Docker Sandboxes clicked for me. The nice part is that the setup is not complicated. You install the CLI, sign in once, choose a network policy, and launch a sandbox from your project folder. After that, you can list it, stop it, reconnect to it, or remove it when you are done. In this post, I am keeping the focus narrow on purpose: Set up Docker Sandboxes, run one against a local project, understand the few commands that matter, and avoid the mistakes that usually slow people down on day one. What Are Docker Sandboxes? Docker Sandboxes give you an isolated environment for coding agents. Each sandbox runs inside its own microVM and gets its own filesystem, network, and Docker daemon. The simple way to think about it is this: the agent gets a workspace to do real work, but it does not get free access to your whole laptop. That is the reason this feature is interesting. You can let an agent install packages, edit files, run builds, and even run Docker commands inside the sandbox without turning your host machine into the experiment. Before You Start You do not need a big lab setup to try this, but you do need: macOS or Windows machine installedWindows "HypervisorPlatform" feature enabledDocker Sbx CLI installedAPI key or authentication for the agent you want to use If you start with the built-in shell agent, Docker sign-in is enough for your first walkthrough. If you want to start with claude, copilot, codex, gemini, or another coding agent, make sure you also have that agent's authentication ready. If you are on Windows, make sure Windows Hypervisor Platform is enabled first. PowerShell Enable-WindowsOptionalFeature -Online -FeatureName HypervisorPlatform -All If Windows asks for a restart, do that before moving on. Note: Docker documents the getting-started flow with the sbx CLI. There is also a docker sandbox command family, but sbx is the cleanest way to get started, so that is what I am using in this walkthrough. Step 1: Install the Docker Sandboxes CLI On Windows: PowerShell winget install -h Docker.sbx On macOS: PowerShell brew install docker/tap/sbx That is it for installation. If sbx is not recognized immediately after install, open a new terminal window and try again. I hit that once on Windows after installation, and a fresh terminal fixed it. Note: Docker Desktop is not required for sbx. Step 2: Sign In Now sign in once: PowerShell sbx login This opens the Docker sign-in flow in your browser. During login, Docker asks you to choose a default network policy for your sandboxes: Open – Everything is allowedBalanced – Common development traffic is allowed, but it is more controlledLocked down – Everything is blocked unless you explicitly allow it If you are just getting started, pick Balanced. That is the easiest choice for a first run because it usually works without making the sandbox too open. Step 3: Pick a Small Project Folder You can use an existing project folder, or create a tiny test folder just for this walkthrough. For example: PowerShell mkdir hello-sandbox cd hello-sandbox If you want, drop a file into it so you have something visible inside the sandbox: PowerShell echo "# hello-sandbox" > README.md Nothing fancy is needed here. The goal is just to have a folder you are comfortable letting the agent work in. Step 4: Run Your First Sandbox Here is the command that matters most: PowerShell sbx run shell . Figure 1.1: Shows how to create a new sandbox using Sbx command What this does: Starts a sandbox for the shell agentMounts your current folder into the sandboxOpens an isolated environment where the agent can work on that folder If you prefer naming your sandbox from the start, use: PowerShell sbx run --name my-first-sandbox shell . On the first run, Docker may take a little longer because it needs to pull the agent image. That is normal. Later runs are much faster. I like starting with shell because it is the easiest way to prove the sandbox is working before you bring an actual coding agent into the mix. Once that works, replace shell with the agent you actually want to use, such as claude, copilot, codex, gemini, or another supported agent from the Docker docs. Step 5: See What Is Running To check your active sandboxes, run: PowerShell sbx ls You should see output with a name, status, and uptime. This is a handy command because once you start using sandboxes regularly, it becomes the quickest way to see what is still running and what needs cleanup. Figure 1.2: Shows how to verify list of all active sandboxes running on the machine Step 6: Switch to a Real Coding Agent Once you have proved the sandbox works with shell, move to the coding agent you actually want to use. For example: PowerShell sbx run copilot Figure 1.3: Shows how to run Copilot agent on Docker sandbox or PowerShell sbx run gemini Figure 1.4: Shows how to run gemini agent on Docker sandbox The workflow is the same as shell. The only thing that changes is the agent inside the sandbox. If the agent needs its own provider login or API key, complete that setup and then continue. The important point is that the agent is still running inside the sandbox, not directly on your host machine. Step 7: Stop the Sandbox When You Are Done When you are finished using Sandbox, you can stop it by running the command below: PowerShell sbx stop copilot-dockersandboxtest If you don't remember the name, run sbx ls first to see all the active sandboxes running. Stopping is useful when you want to pause work without removing the sandbox immediately. Step 8: Remove the Sandbox When You No Longer Need It When you are done for good, you can remove it by running the command below: PowerShell sbx rm copilot-dockersandboxtest Or remove all sandboxes by simply passing --all flag as shown below: PowerShell sbx rm --all Figure 1.5: Removing all sandboxes using sbx rm --all command Step 9: Use YOLO Mode Safely Now for the newer idea Docker has just announced, which is YOLO mode. If you want to read more about it, refer to Docker's recent blog post, which is worth bookmarking: Docker Sandboxes: Run Agents in YOLO Mode, Safely. In simple terms, YOLO mode means letting a coding agent work with fewer interruptions and fewer approval prompts. That can save time, but it only makes sense when the agent is already inside a sandbox. Note: I would not start with YOLO mode on day one. I would start with a normal sandbox run, get comfortable with the lifecycle first, and only then try YOLO mode. Conclusion This article explains Docker Sandboxes and provides step-by-step instructions for getting started. What I like about Docker Sandboxes is that they remove a lot of friction from a very real problem. Sometimes you want an agent to have freedom, but not too much freedom. You want it to run commands, inspect files, and do useful work, but you also want a clear boundary around that work. That is the sweet spot Docker Sandboxes are aiming for. If you are curious about them, my advice is simple: do not start with a giant repo or a complicated setup. Pick one small folder, use the Balanced policy first, run a single sandbox, and get comfortable with the basic lifecycle first. Once that clicks, the rest feels much easier to work in YOLO mode.

By Naga Santhosh Reddy Vootukuri

CORE

Deployment

DZone's Featured Deployment Resources

Top Deployment Experts

The Latest Deployment Topics