Development and programming tools are used to build frameworks, and they can be used for creating, debugging, and maintaining programs — and much more. The resources in this Zone cover topics such as compilers, database management systems, code editors, and other software tools and can help ensure engineers are writing clean code.
NeMo Agent Toolkit With Docker Model Runner
When Kubernetes Breaks Session Consistency: Using Cosmos DB and Redis Together
After working with Oracle databases for more than 15 years, one thing I have learned is that patching is not just a maintenance task, it’s a critical security and stability practice. Many production issues I have seen in enterprise environments could have been avoided simply by keeping databases updated with the latest Release Updates (RUs). In this article, I will walk through how I typically apply an Oracle 19c Release Update, upgrading a standalone database from 19.3 to 19.20, using a structured approach that I have followed in multiple production environments. The entire patching process can be divided into three phases: Pre-Patch PreparationPatching ExecutionPost-Patch Validation Following this structured method significantly reduces risks during patching. 1. Pre-Patch Preparation Before touching a production Oracle home, preparation is everything. Skipping preparation steps is where most patching failures happen. Step 1: Verify Current Database Version The first step I always take is confirming the current database version. SQL SELECT banner_full FROM v$version; -- Example output: Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 Version 19.3.0.0.0 This confirms the starting patch level. Step 2: Check Current Patch Level Next, I verify which patches are already installed. SQL SELECT TO_CHAR(action_time,'YYYY-MM-DD') action_time, action, status, description, patch_id FROM dba_registry_sqlpatch ORDER BY action_time; -- Example output shows the current RU applied: Database Release Update : 19.3.0.0.190416 This helps me understand the patch history of the database. Step 3: Check Patches from OS Level I also confirm patch details from the operating system using OPatch. Shell $ORACLE_HOME/OPatch/opatch lspatches ## Example: 29517242;Database Release Update : 19.3.0.0 This cross-verification avoids confusion between binary patching and SQL patching. Step 4: Verify OPatch Version Oracle frequently requires newer OPatch versions for newer patches. Shell ./opatch version Example output: OPatch Version: 12.2.0.1.17 If necessary, I download the latest OPatch (patch 6880880) from Oracle Support. Step 5: Download Required Patch From Oracle Support → Patches & Updates, I download: Latest OPatchTarget Release Update Example RU: Patch 35320081 → Oracle 19.20 RU Step 6: Backup Oracle Home and Inventory One of the golden rules I always follow: Never patch without backups. Backup Oracle Home: Shell tar -pcvf /u02/backups/backup.tar dbhome_1 Backup inventory: Shell cp -R inventory /u01/backup Step 7: Backup Database Before patching production databases, I always take a full RMAN backup. SQL RMAN> BACKUP AS COMPRESSED BACKUPSET DATABASE INCLUDE CURRENT CONTROLFILE PLUS ARCHIVELOG; This ensures we can recover quickly if anything goes wrong. Step 8: Check for Invalid Objects Invalid objects sometimes cause datapatch failures. SQL SELECT * FROM dba_objects WHERE status='INVALID'; If objects exist, I usually fix them before proceeding. Step 9: Update OPatch Utility Rename the old OPatch directory: Shell mv OPatch OPatch_backup Then unzip the latest OPatch: Shell unzip p6880880_190000_Linux-x86-64.zip -d $ORACLE_HOME Verify the version again: Shell ./opatch version Step 10: Run Patch Conflict Check Before applying the patch, I always check for conflicts. Shell opatch prereq CheckConflictAgainstOHWithDetail -ph ./ This step confirms the patch can safely be applied to the Oracle home. 2. Applying the Patch Once all preparation steps are complete, I proceed with patching. Step 1: Shutdown Database and Listener SQL sqlplus / as sysdba shutdown immediate Then stop the listener. Step 2: Set OPatch Path Shell export PATH=$ORACLE_HOME/OPatch:$PATH Verify: Shell which opatch Step 3: Apply the Patch Navigate to the patch directory and execute: Shell opatch apply -- The system prompts for confirmation: Do you want to proceed? [y|n] Enter y to continue. During execution, OPatch updates multiple Oracle components such as: JDKODBCOracle middleware componentsPrecompiler libraries Once completed, you should see: Shell OPatch succeeded This confirms the binary patch has been successfully applied. 3. Post-Patch Activities The patching process isn’t finished until post-patch validation is completed. Step 1: Verify Installed Patches Shell opatch lspatches ## Example output: 35320081;Database Release Update : 19.20.0.0 Step 2: Start Database and Listener Shell startup lsnrctl start Step 3: Verify Database Version SQL SELECT banner_full FROM v$version; -- Example output: Oracle Database 19c Enterprise Edition Version 19.20.0.0.0 Step 4: Run Datapatch This is a critical step many DBAs forget. Shell datapatch -verbose Datapatch applies SQL changes associated with the binary patch. Step 5: Validate SQL Patch Registry SQL SELECT TO_CHAR(action_time,'YYYY-MM-DD') action_time, action, status, description, patch_id FROM dba_registry_sqlpatch ORDER BY action_time; You should see the newly applied patch. Step 6: Verify Invalid Objects Finally, check for invalid objects again: SQL SELECT count(*) FROM dba_objects WHERE status='INVALID'; If any exist, compile them using: Shell @?/rdbms/admin/utlrp.sql Lessons I’ve Learned from Patching Production Systems Over the years, I’ve learned a few practical lessons: ✔ Always test patches in staging environments first ✔ Never skip Oracle Home backups ✔ Always run datapatch after binary patching ✔ Check invalid objects before and after patching ✔ Schedule patching during planned maintenance windows A disciplined patching strategy keeps Oracle environments secure, stable, and supported. Final Thoughts Oracle continuously releases Release Updates (RUs) to address: Security vulnerabilitiesPerformance improvementsBug fixes From my experience managing enterprise databases, staying current with patching is one of the simplest and most effective ways to prevent major production incidents. Over the years, I’ve seen several situations where systems ran into avoidable issues simply because critical patches were delayed or skipped. Small vulnerabilities or known bugs can become larger problems when databases are running on older patch levels. A well-planned patching strategy ensures your databases remain secure, performant, and fully supported by Oracle. Security patches address newly discovered vulnerabilities that could otherwise expose sensitive data or allow unauthorized access. In today’s environment, where databases often store critical business information, staying on top of these updates is essential. Patching also improves overall stability and performance. Oracle regularly fixes bugs, improves internal components, and resolves issues reported by customers in real-world environments. Applying these fixes helps avoid unexpected behavior, performance slowdowns, or system crashes that could affect business operations. Another important factor is supportability. When a production issue occurs, Oracle Support typically expects the environment to be on a reasonably current patch level. If the system is far behind, troubleshooting can become more complicated, and in some cases, support may first recommend applying the latest patches before deeper investigation. The key is to approach patching in a controlled and predictable way. Maintain a regular patching schedule, test patches in lower environments, and coordinate maintenance windows with application teams. This approach minimizes risk while keeping the database environment stable and secure. In the long run, consistent patching saves far more time and effort than dealing with preventable production problems.
Docling Studio is a visual layer on top of Docling, the document extraction engine. The idea is simple: give users a way to see extraction results for debugging, quality analysis, and understanding how the pipeline actually works. Document extraction is a critical building block for AI projects, especially in RAG contexts. But when something goes wrong in the extraction, you need to see it, not just read JSON. You could script a quick viewer with pdf.js, but you'd be re-implementing coordinate transforms, element filtering, and result navigation from scratch, and you'd lose the native connection to DoclingDocument structures that makes inspection actually useful. That's why I built Docling Studio: a tool that works directly with Docling's output format and supports both local debugging and production-ready setups out of the box. Section 1: Two Modes, One Interface The first architectural decision was about how Docling Studio connects to Docling itself. There are two very different use cases: a developer inspecting a single PDF on their laptop, and a team running extraction at scale through an API. Building two separate tools for this made no sense. So Docling Studio supports both modes behind a single interface. The first architectural decision was about how Docling Studio connects to Docling itself. Local mode embeds Docling directly in the backend. You install Docling Studio, drop a PDF, and the extraction runs in-process. No external service to configure, no network dependency. This is the fastest way to debug and explore what Docling produces for a specific document. Serve mode not yet implemented, but architecturally ready, will connect to a running Docling Serve instance. Instead of running the extraction locally, Docling Studio will send the document to Docling Serve over HTTP and retrieve the results. This is the production path: your team runs Docling Serve on a dedicated server or cluster, and Docling Studio becomes a visual frontend for it. The key design constraint was that both modes had to be invisible to the rest of the application. The frontend doesn't know (and doesn't care) which engine is running behind the scenes. The extraction results come back in the same format either way. This is achieved through a simple engine adapter abstraction: a common interface that takes a document and pipeline configuration in, and returns a DoclingDocument out. Today, the local implementation is in place. The remote connector for Docling Serve will plug into the same interface; the rest of the codebase won't need to change. This pattern keeps the codebase clean and makes it easy to add new engines later — for instance, connecting to a custom extraction pipeline or a different version of Docling. Section 2: Frontend: Rendering Bounding Boxes at Scale The core value of Docling Studio lives in the frontend: overlaying colored bounding boxes on top of the original PDF pages and letting users click through each detected element. Sounds straightforward. It wasn't. The Coordinate Problem Docling outputs bounding box coordinates in PDF points, a unit system that has nothing to do with screen pixels. A PDF point is 1/72 of an inch, and the coordinate origin sits at the bottom-left of the page. Browsers render top-left origin, in pixels, at whatever zoom level the user picks. Every single box needs to be transformed: flip the Y axis, scale to the current zoom factor, and offset to the page position on screen. Get one step wrong, and your boxes float next to the content instead of on top of it. Why Vue 3 Composition API The rendering layer needs to be reactive at a fine-grained level. When a user changes zoom, every box recalculates. When they click a box, the side panel updates with the element details. When they filter by element type, show only tables, hide text blocks, dozens of boxes appear or disappear instantly. Vue 3's Composition API handles this naturally: each concern (zoom state, selection state, filter state) lives in its own composable, and the reactivity system takes care of re-rendering only what changed. No manual DOM manipulation, no performance hacks. The Performance Question A dense academic paper can produce 50+ detected elements on a single page. Rendering 50 semi-transparent overlays with hover states and click handlers could easily lag. The solution was surprisingly simple: the boxes are plain CSS-positioned divs layered on top of a rendered page image, not canvas elements. The browser's layout engine handles the compositing natively. No virtual DOM tricks needed, no canvas API complexity. CSS transforms for positioning, CSS transitions for hover effects. It just works. The frontend is organized by feature: analysis, document, history, and settings, each as a self-contained module with its own components, composables, and stores. Shared utilities like the coordinate transform logic live in a common layer. JSON frontend/src/ ├── app/ # App shell, router, global styles ├── pages/ # Route-level pages │ ├── HomePage.vue │ ├── StudioPage.vue # PDF viewer + config + results │ ├── DocumentsPage.vue │ ├── HistoryPage.vue │ └── SettingsPage.vue │ ├── features/ # Feature modules │ ├── analysis/ # Analysis store, API, bbox scaling, UI │ │ ├── store.ts │ │ ├── api.ts │ │ ├── bboxScaling.ts # Pure math: page coords → pixel coords │ │ └── ui/ │ │ ├── BboxOverlay.vue │ │ ├── AnalysisPanel.vue │ │ ├── StructureViewer.vue │ │ └── ... │ ├── document/ # Document store, API, upload │ ├── history/ # History store, navigation │ └── settings/ # Theme, locale, API URL │ └── shared/ # Cross-feature utilities ├── types.ts # All shared TypeScript interfaces ├── i18n.ts # FR/EN translations ├── format.ts # Date/size formatters └── api/http.ts # HTTP client (fetch wrapper) Section 3: Backend: FastAPI as a Thin Orchestration Layer The backend has one job: sit between the frontend and Docling, and make the extraction pipeline easy to drive through a REST API. It shouldn't do more than that. Why FastAPI Over Django or Flask Docling is a Python library. Whatever framework wraps it needs to be Python-native. Django was overkill. Docling Studio doesn't need an admin panel, an ORM migration framework, or a template engine. Flask was an option, but FastAPI wins on two things: native async support and automatic OpenAPI documentation. The async part matters because document extraction can take minutes on large files. You don't want the server to block while Docling processes a 40-page report. FastAPI handles this cleanly with background tasks. The Orchestration Pattern The backend follows a layered structure that keeps concerns separated. The API layer defines the routes and handles HTTP; it knows nothing about Docling. The service layer contains the business logic: start an analysis, check its status, and retrieve results. The domain layer holds the data models. And the infrastructure layer is where the engine adapters live, the local and Serve mode implementations described earlier. This separation isn't over-engineering for the sake of it. When Docling Serve support lands, it will only touch the infrastructure layer. None of the API routes, services, or domain models needs to change. No WebSocket, On Purpose An extraction job on a large document can take time, and the natural instinct is to push real-time progress updates via WebSocket. I chose polling instead. The frontend checks the job status every few seconds. It's less elegant but drastically simpler, no connection management, no reconnection logic, no state sync issues. For an inspection tool where you run a few documents at a time, the UX difference is negligible. Simplicity won. JSON document-parser/ ├── main.py # FastAPI app, CORS, lifespan │ ├── domain/ # Pure domain — no HTTP, no DB │ ├── models.py # Document, AnalysisJob dataclasses │ ├── parsing.py # Docling conversion & page extraction │ └── bbox.py # Bounding box coordinate normalization │ ├── api/ # HTTP layer (FastAPI routers) │ ├── schemas.py # Pydantic DTOs (camelCase serialization) │ ├── documents.py # /api/documents endpoints │ └── analyses.py # /api/analyses endpoints │ ├── persistence/ # Data layer (SQLite via aiosqlite) │ ├── database.py # Connection management, schema init │ ├── document_repo.py # Document CRUD │ └── analysis_repo.py # AnalysisJob CRUD │ ├── services/ # Use case orchestration │ ├── document_service.py # Upload, delete, preview │ └── analysis_service.py # Async Docling processing │ └── tests/ # pytest Section 4: Data Layer: Why SQLite Over PostgreSQL This is the question every developer asks when they see the stack. The answer is simple: Docling Studio is a tool, not a platform. The Zero-Dependency Argument Docling Studio ships as a single Docker image. You pull it, you run it, you start inspecting documents. Adding PostgreSQL to the equation means a second container, a docker-compose file, connection strings, volume mounts for data persistence, and a migration step. For an inspection tool that a developer wants to try in five minutes, that's a dealbreaker. SQLite lives inside the container as a single file. Nothing to configure, nothing to connect. The Use Case Doesn't Need More Docling Studio stores analysis metadata, job history, and pipeline configurations. Not millions of rows, a few hundred at most, even for heavy users. There are no concurrent writes from multiple users, no complex joins across large datasets, and no need for replication. SQLite handles this effortlessly. The Abstraction Is There The persistence layer uses aiosqlite for direct async access to SQLite, no ORM overhead for what amounts to simple CRUD operations on a few tables. All database access is abstracted behind a repository pattern: the rest of the codebase calls document_repo.save() or analysis_repo.get_by_id(), never raw SQL. If a use case ever requires PostgreSQL (say, a shared team instance with multiple concurrent users), the migration path is clear: swap the repository implementations, keep the interfaces. The choice is deliberate, not a limitation. Section 5: Packaging: One Docker Image to Rule Them All Docling Studio bundles everything: frontend, backend, and Nginx, into a single Docker image. One pull, one run, done. Why a Single Image The target user is a developer who wants to try Docling Studio now, not spend twenty minutes wiring containers together. A multi-container setup with separate frontend, backend, and reverse proxy images is the textbook answer. It's also the wrong answer for an open-source tool that lives or dies on first impressions. If the README says docker run and it works, people try it. If it says docker-compose up with a config file to edit first, half of them leave. How It Works A multi-stage Dockerfile builds the Vue 3 frontend into static assets, installs the Python backend with its dependencies, and configures Nginx to serve the frontend and proxy API calls to FastAPI. The final image contains everything needed to run. Multi-Arch for Real-World Adoption The image builds for both AMD64 and ARM64. This isn't optional; a growing share of developers run Apple Silicon machines, and ignoring ARM means locking out a significant part of your potential users. CI/CD builds and pushes both architectures automatically. The Trade-Off A single image doesn't scale horizontally. You can't independently scale the frontend and backend, and you can't run multiple backend instances behind a load balancer. That's fine. Docling Studio is an inspection tool, not a SaaS platform. If someone needs to scale extraction, they run Docling Serve separately and point Docling Studio at it; that's exactly what mode is for. Section 6: What's Next Docling Studio today is a debugging tool. You feed it a document, you see what Docling extracted, you spot what went wrong. That's V0, and it's useful, but it's only the first step of a longer arc. The trajectory follows three stages that each build on the previous one: see what the extraction produces, audit what your pipeline does with it, then improve the models that power it. See - V0: Extraction Inspection This is the current state. Docling Studio renders bounding boxes over original pages, lets you click through detected elements, and gives you a visual ground truth of what the pipeline actually captured. The immediate next step is full Docling Serve integration, so teams running extraction in production can inspect results without re-processing documents locally. Audit - V1: RAG Chunking Visualization Docling is increasingly used as the ingestion layer for RAG pipelines. But between extraction and retrieval, there's a critical step nobody can see: chunking. Where do the splits happen? Does a table end up in one chunk or scattered across three? Is the section heading grouped with its content or orphaned in the previous chunk? Today, answering these questions means staring at JSON. V1 makes it visual, overlaying chunk boundaries on the original document, showing exactly what context is preserved or lost. This turns Docling Studio from an extraction debugger into a vector store audit tool. The developer who debugs extraction in V0 is exactly the one who needs to audit chunks in V1. Improve - V2: Dataset Annotation and Evaluation This is the strategic step. When you can see extraction errors and chunking failures, the natural next question is: can I fix the model? V2 adds a native annotation layer, users correct extraction results directly on top of the document, building training datasets without leaving the tool. The key differentiator against established annotation tools like Label Studio or Prodigy is that Docling Studio operates natively on DoclingDocument structures, not on generic formats. You annotate what Docling actually produces, not a re-imported approximation. The connection point that closes the loop is docling-eval, the evaluation framework from the Docling project. By feeding corrected annotations back into docling-eval, Docling Studio becomes the missing piece of a complete cycle: extract → visualize → annotate → evaluate → retrain. This positions it not as a standalone tool, but as the visual and human-in-the-loop layer of the Docling ecosystem. Each stage widens the user base. V0 attracts developers debugging single documents. V1 brings in teams building RAG pipelines. V2 pulls in ML engineers training extraction models. The architecture is designed to support this progression without rewrites; the engine abstraction, the layered backend, and the modular frontend all serve this longer-term trajectory. Contributions and feedback are welcome!github.com/scub-france/Docling-Studio
I've watched this failure mode enough times that I can smell it coming during architecture reviews. Someone draws a box labeled "queue" between two overwhelmed services and everyone nods like the problem is solved. It isn't. What they've actually built is a time-bomb with a progress bar. Queues smooth spikes — this part is true. When your API gets hammered for thirty seconds because someone's cron job misfired, a queue absorbs that burst and lets your consumers work through the backlog at sustainable pace. This is the happy path, the scenario in all the diagrams. Short-duration load, finite work, queue drains, everyone goes home. But sustained overload? That's different physics entirely. The Bankruptcy Metaphor Is Precise When incoming message rate exceeds consumer throughput for longer than a few minutes, the queue doesn't solve anything — it just moves the failure downstream and masks the symptoms while the situation metastasizes. Picture it: every second, twenty messages arrive. Your consumers handle fifteen. The math is brutal and linear. After one minute you're 300 messages behind. After ten minutes, 3,000. The queue grows. Eventually you hit a wall. Not a soft limit — a wall. RabbitMQ runs out of memory and starts paging to disk, which destroys throughput. SQS approaches its 120,000 in-flight message limit for standard queues. Kafka partitions fill their retention window. Or — and this is the common case — you never breach the queue's own limits because something else breaks first: the downstream database runs out of connections, disk I/O saturates, or the consumer instances thrash on context switching because they're drowning in work they can't complete. This is what Christant means by eventual crash. The queue doesn't fail gracefully; it fails by creating conditions that topple multiple dominoes. When the dam breaks, you lose messages. Clients time out. The errors cascade backward through your call stack until customer-facing requests start failing. You've transformed a capacity problem into a reliability catastrophe. What Actually Lives in Production I've debugged this at 2 AM more than once. The pattern is always similar: someone implemented a queue months ago to handle normal traffic, never added monitoring for consumer lag, and never asked the hard question — what happens when this queue fills? The answer is usually nothing happens. Nothing deliberate, anyway. The system lurches between states: queue growing, CPU climbing, memory pressure increasing, garbage collection pauses lengthening. Then something tips. An instance OOMs. A connection pool exhausts. A health check times out and the orchestrator kills a pod that was actually fine, just slow, which makes everything worse. You need backpressure. Not someday — on Monday morning, in the first iteration. Backpressure means the queue has agency. When it reaches 70% capacity, it stops accepting new work and signals upstream: slow down. HTTP 429. TCP flow control. gRPC RESOURCE_EXHAUSTED. The mechanism varies but the principle doesn't — apply counterpressure before failure becomes inevitable. This sounds simple until you implement it. Who decides the threshold? How do clients react to rejection — do they retry, wait, drop the request? If they retry without backoff, you've just created a retry storm that makes the overload worse. If they drop requests, what are you losing? Are these payment confirmations or newsletter clicks? These are product decisions dressed up as infrastructure problems. Bounded Queues Force Honesty I prefer bounded queues for this reason — they make the trade-off explicit. When the queue is full, you must choose: reject the message immediately, or block the producer until space opens. Both hurt, but they hurt in different ways. Rejecting is fast and visible. The producer gets an error code, can increment a metric, maybe log it. You know you're shedding load. This is honest. The alternative — accepting the message into an "infinite" queue — is lying. You're pretending to have capacity you don't, delaying the pain until the queue fills (it will) or the downstream system collapses under the weight of backlog (it will). Blocking the producer is sometimes correct. If you're processing financial transactions, you can't drop them. You'd rather make the client wait — visibly, with a timeout — than silently lose a payment. This creates organic backpressure: slow consumers make producers slow, which bubbles up to load balancers, which eventually refuse connections at the edge. The system regulates itself through congestion. But blocking requires every layer to handle timeouts correctly, which in my experience about 40% of them don't. You'll find places where a blocked write turns into a hung thread because someone set an infinite timeout, or a client that retries indefinitely because the retry logic doesn't distinguish between "server overloaded" and "network cable unplugged." The Circuit Breaker Isn't Optional Here's where most designs fall apart: they add the queue but not the escape hatch. A circuit breaker wraps the queue consumer — or ideally, the entire call path — and monitors failure rates. When errors exceed a threshold (say, 50% of the last hundred operations), the breaker opens. New requests fail immediately with a service unavailable error. You stop trying to push work through a system that's clearly unable to handle it. This seems harsh until you've lived through the alternative. Without a breaker, the system keeps attempting doomed work. Database connections time out after thirty seconds each. Each timeout consumes a thread, a file descriptor, some RAM. The slower the system gets, the more requests pile up, the more resources get locked waiting for operations that will never complete. It's a death spiral. An open circuit breaker stops the spiral. Yes, you're rejecting requests — but you were going to reject them anyway, just slowly and expensively. Better to fail fast, preserve resources, and give the overwhelmed system room to recover. The tricky part is tuning the breaker's sensitivity. Too aggressive and you'll open the circuit during transient blips, rejecting work you could've handled. Too lenient and you won't open fast enough to prevent the cascade. I've settled on tracking both error rate and latency percentiles — if P99 latency suddenly triples and error rate climbs above 20%, something is wrong. Open the circuit. Consumer Lag Is the Metric That Matters Queue depth tells you how much work is waiting. Consumer lag tells you whether you're gaining or losing ground. Lag is the delta between the newest message ID and the message ID your consumers are currently processing. If lag grows monotonically, you're falling behind. The queue might not be "full" yet, but you're on the path to failure. This is the metric that should wake you up at night. Monitor it continuously. When lag exceeds some threshold — say, five minutes of backlog — you need to act: Scale consumers horizontally if possible (Kubernetes HPA based on queue metrics, Lambda concurrency increases)Rate-limit producers if consumers can't scale fast enough (API Gateway throttles, token bucket algorithms)Start shedding load selectively if neither scaling nor rate-limiting is sufficient That last option is controversial but pragmatic. In a system processing both critical payments and optional analytics events, maybe you drop the analytics when under pressure. This requires tagging messages by priority and implementing tiered processing — more complexity, but honest complexity that acknowledges the system's real constraints. The SQS Trap AWS SQS makes this worse in a specific way: visibility timeout. When a consumer pulls a message from SQS, the message becomes invisible to other consumers for a configurable period — typically 30 to 300 seconds. If the consumer doesn't delete the message within that window, it becomes visible again for retry. This prevents duplicate processing and handles consumer crashes elegantly. But under overload, the visibility timeout becomes a weapon. Consumers pull messages they can't process in time. The messages time out, return to the queue, get pulled again, time out again. You're churning through the same work repeatedly, burning CPU and network bandwidth to accomplish nothing. Meanwhile, new messages keep arriving. The visible message count stays low — looks fine in the dashboard — while the invisible message count climbs into the tens of thousands. I've debugged systems where 80% of the processing effort was going into re-processing messages that never got deleted. The fix is usually multi-part: increase visibility timeout to match realistic processing time under load, reduce the number of concurrent consumers to match actual capacity, implement exponential backoff for retries using dead-letter queues, and — critically — add monitoring for retry counts. If a message has been delivered three times, something is wrong structurally, not transiently. Kafka's Different Pathology Kafka doesn't have visibility timeouts — it has consumer group offsets. Your consumer tracks which offset (message position) it's processed in each partition. This is simpler in some ways: no invisible messages, no timeout-based retries. But it creates a different failure mode. If one consumer in a group falls behind, it blocks the entire partition. Kafka guarantees ordering within partitions, so message N+1 can't be processed until message N is committed. A single slow message — say, one that triggers a database deadlock or a network timeout—stalls everything behind it in that partition. The standard fix is more partitions. Spread the load across many partitions so a stall in one doesn't block unrelated work. But this breaks ordering guarantees across the topic, which might violate your requirements. If you're processing bank transactions for user accounts, you need strict ordering per account but not across accounts. So you partition by account ID — clever, until you realize that user 12345 just generated 10,000 events and your partition for that account is now backed up while others sit idle. Rebalancing helps but adds latency. When you scale consumers, Kafka redistributes partitions across the group. During rebalancing, no messages are processed — a "stop-the-world" pause that can last seconds. If you're auto-scaling based on lag, you might trigger rebalances frequently, which creates pauses, which increase lag, which triggers more scaling. Another death spiral, different mechanism. The Monday Morning Checklist If you're designing a queue-based system or inheriting one, here's what I'd verify first: Monitoring: Consumer lag by queue or topic. Alert when lag exceeds two minutes. Graph it over time so you can see trends. Backpressure: Explicit mechanism to signal upstream when queues approach capacity. If you're using HTTP, return 503 or 429 with Retry-After headers. If you're using gRPC, return RESOURCE_EXHAUSTED with details. Make the producer's retry logic respect these signals. Bounded queues: Set a maximum depth and decide what happens when you hit it. Reject new messages? Block producers? Drop low-priority work? Circuit breakers: Wrap consumers in failure-detection logic. Open the circuit when error rates or latencies spike, fail fast instead of thrashing. Dead-letter queues: Route messages that fail repeatedly (after three attempts, say) into a separate queue for manual inspection. Don't let poison messages clog the primary path. Consumer scaling: Autoscale based on lag, not just queue depth. If lag is growing, you need more consumers now, not when the queue reaches some arbitrary size threshold. Capacity planning: Know your consumer throughput per instance. If each consumer handles 100 messages/second and you're receiving 2,000 messages/second, you need at least 20 consumers — probably more for headroom. Do the math. Graceful degradation: Identify which work is critical and which is optional. Tag messages by priority if possible. When under pressure, shed the optional work first. This isn't glamorous. It's plumbing. But plumbing failures flood buildings. What You're Actually Selling If you're building this as a service or consultancy, the pitch isn't "we'll set up RabbitMQ for you" — everyone can do that. The pitch is "we'll design a queue-based system that fails gracefully and tells you why." Customers don't buy queues. They buy resilience. They buy systems that stay up during traffic spikes, that degrade gracefully under overload, that emit actionable metrics before failures cascade. They buy confidence that their payments won't get lost when the marketing team launches a campaign without warning engineering. The productizable pieces: Monitoring dashboards pre-configured for queue lag, consumer health, backpressure eventsAuto-scaling policies tuned for their traffic patternsCircuit breaker libraries integrated with their observability stackRunbooks for common failure scenarios: queue full, consumer crash, downstream service timeoutLoad testing harnesses that simulate realistic failure modes — not just "send a million requests" but "send sustained 2x load for thirty minutes and see what breaks" You can also train teams. Most engineers understand queues conceptually but haven't lived through a queue-driven outage. Teach them to think in failure modes: what happens when this fills? When consumers crash? When the downstream database goes read-only? Run fire drills. Break things deliberately in staging and make them fix it under time pressure. The niche is narrow but valuable: people who understand distributed systems failure modes deeply enough to design around them, not just implement the happy path. Why This Keeps Happening The fundamental mistake is treating queues as magical absorbers rather than finite buffers. Part of this is the marketing — cloud vendors tout "unlimited scalability" for their queue services, which is true in a narrow technical sense (the queue itself won't refuse your message) but misleading in the systemic sense (your consumers will still drown). Part of it is developmental complexity. Building a system with proper backpressure, monitoring, and failure handling is harder than just putting a queue in the middle. It requires thinking through scenarios that haven't happened yet. Engineers under deadline pressure defer that thinking, ship the basic version, and promise to "harden it later." Later arrives at 2 AM during an incident. And part of it is the seductive linearity of the solution. "We're getting too many requests → let's buffer them" sounds logical. It is logical for bounded, transient load. But load in production is rarely bounded or transient. It's fractal — spiky at every timescale, with long tails and sudden cliffs. A queue without capacity governance just shifts the cliff from "right now" to "fifteen minutes from now," which arguably makes the failure worse because you've lost situational awareness. The Honest Version Queues are tools. Good ones. They decouple systems, enable async processing, smooth traffic spikes. But they don't create capacity. They delay the moment when insufficient capacity becomes visible. If you're consuming 100 messages/second and receiving 150, you need to handle 150 or reduce to 100. The queue doesn't change that arithmetic — it just hides it until the backlog grows large enough to collapse something else. Design accordingly. Monitor lag. Implement backpressure. Scale consumers. Shed load when necessary. Fail fast when overloaded. These aren't optimizations — they're requirements for anything you expect to survive production traffic. And test the failure modes. Don't wait for the queue to fill in production to discover that your consumers panic when they can't keep up. Fill it deliberately in staging. Break your downstream database. Introduce artificial latency. Watch what happens. Fix what breaks. This is the work. The rest is just configuration files.
This article explores the transformative potential of integrating artificial intelligence (AI)-driven insights with MuleSoft and AWS platforms to achieve scalable enterprise solutions. This integration promises to enhance enterprise scalability through predictive maintenance, improve data quality through AI-driven data enrichment, and revolutionize customer experiences across industries like healthcare and retail. Furthermore, it emphasises navigating the balance between centralized and decentralized integration structures and highlights the importance of dismantling data silos to facilitate a more agile and adaptive business environment. Enterprises are encouraged to invest in AI skills and infrastructure to leverage these new capabilities and maintain competitive advantage. Introduction Not long ago, I had one of those "aha" moments while working late at our Woodland Hills office. Picture this: I was elbows-deep in the spaghetti of our MuleSoft integrations, and it hit me — what if we could fuse our conventional setup with AI-driven insights to revolutionize our enterprise scalability? As someone who has spent countless hours with MuleSoft and AWS, toggling between Anypoint Platform and cloud paradigms, I realized we were standing on the precipice of something transformative. The Magic of AI-Augmented Integration Platforms The trend of merging AI with platforms like MuleSoft is becoming a game-changer. Think about it — self-optimizing integration pipelines that don't just react but predict. AI-driven anomaly detection is no longer a futuristic notion but a present-day reality. A critical takeaway here is that enterprises must shift their focus toward building predictive maintenance into their integration solutions. This isn't just about reducing downtime; it's about reliability, a quality all stakeholders crave. Here's a personal aside: in one of my projects at TCS, we faced repeated disruptions due to undetected anomalies in our pipeline. After integrating an AI-centric approach using AWS’s AI/ML services, we saw a 30% decrease in system alerts. It felt like watching a well-oiled machine where everything just fit. It was hard work getting there, but the reduced manual monitoring was worth every bit of effort. Centralized Control vs. Decentralized Agility Let's face it — a debate that's been brewing is centralized versus decentralized integration. I'm of two minds here. Centralized platforms like MuleSoft offer comprehensive control, yet there's a strong argument for decentralized, microservices-led frameworks powered by AI. These can make autonomous decisions at the edge, thus providing agility. In practice, evaluating trade-offs is crucial. During Farmers Insurance projects, we struggled with balancing centralized governance with the nimbleness of decentralized systems — often a tug-of-war. Through trial and error, we realized that a hybrid approach, leveraging MuleSoft for core integrations while empowering microservices with AI-driven intelligence, struck the right chord. The key was not in choosing sides but in finding harmony between the two. Cross-Industry Applications: Breaking the Mold AI-driven insights aren’t limited to tech giants — they're creeping into retail and healthcare, too. In a recent pilot, we explored using MuleSoft solutions in a healthcare setting, where real-time data processing played a critical role in patient interactions. The challenge was integrating vast datasets, something AI handled adeptly. The result? Improved patient engagement and faster response times. In another example, a retail client used AI integration to enrich customer experiences, from personalized offers to stock predictions. You might say these are exceptions, not the rule, but they demonstrate the potential of cross-industry applications. The lesson here? Look beyond traditional tech spaces for unique use cases and new revenue streams. AI-Driven Data Enrichment: A Technical Deep Dive One of the lesser-known but powerful capabilities of AI is data enrichment. Within MuleSoft and AWS environments, machine learning algorithms are at work to refine and enhance data for superior analytics. It's like having a data wizard on your team. In practical terms, we deployed advanced algorithms to improve data quality at Farmers Insurance. The challenge was ensuring seamless integration without disrupting existing architectures — a frequent pain point. This experience taught us the importance of innovative middleware solutions to streamline AI insights integration. The result? Enhanced data accuracy and business intelligence, empowering informed decision-making. Lessons from the Trenches: Navigating Market Dynamics Market dynamics are shifting rapidly, but the struggle with siloed data persists. Inefficient integration architectures can be a thorn in the side of digital transformation. Here, AI-driven insights can play a crucial role. In a project where data silos were hindering progress, we revamped our strategy. By prioritizing AI integrations, we dismantled these silos, resulting in a more fluid and flexible system. The critical lesson was understanding that breaking down silos is just as important as building new integrations. A balance of both ensures scalable and adaptive solutions. Future Horizons: Preparing for the AI Revolution The enterprise integration landscape is on the cusp of a new era. AI-driven insights will automate decision-making and predictive analytics, fundamentally changing business operations and competitive dynamics. To stay ahead, it's imperative for companies to invest in AI skills and infrastructure. In my own journey, continuous learning and adaptation have been key. Embracing new technologies and methodologies isn't just a requirement — it's an ongoing pursuit of excellence. And yes, I still hit roadblocks. There's always more to learn, more to implement, but that's what makes this field so exciting. Conclusion: Embracing the Transformation Integrating AI-driven insights with MuleSoft and AWS opens doors to innovation and competitiveness. As we stand on the verge of this transformation, the opportunities are vast. By focusing on emerging trends, questioning conventions, and exploring new applications, enterprises can unlock unprecedented value. In conclusion, if you're like me, sipping a coffee and wondering how to elevate your integration game, take the leap. Blend AI with your MuleSoft and AWS strategy, embrace imperfections, learn from every hiccup, and watch your enterprise soar to new heights.
If you've been building with AI agents, you've probably hit the same wall I did: your agent needs to do things — query databases, call APIs, check systems — but wiring up each tool is a bespoke integration every time. The Model Context Protocol (MCP) solves this by giving agents a standard way to discover and invoke tools. Think of it as USB-C for AI tooling. The problem? Most MCP tutorials stop at "run it locally with stdio." That's fine for solo dev work, but it falls apart the moment you need: Multiple clients connecting to the same serverAuth, session isolation, and scalingA deployment that doesn't die when your laptop sleeps AWS Bedrock AgentCore Runtime changes the equation. You write an MCP server, hand it over, and AgentCore handles containerization, scaling, IAM auth, and session isolation — each user session runs in a dedicated microVM. No ECS clusters to configure. No load balancers to tune. In this post, we'll build a practical MCP server from scratch, deploy it to AgentCore Runtime, and connect an AI agent to it. The whole thing takes about 30-60 minutes. What We're Building We'll create an MCP server that exposes infrastructure health tools — the kind of thing a DevOps agent would use to check system status, list recent deployments, and surface alerts. It's more interesting than a dice roller but simple enough to follow. Here's the architecture: Your agent connects via IAM auth → AgentCore discovers the tools → your server executes them → results stream back. You never manage servers, containers, or networking. Prerequisites Before we start, make sure you have: Python 3.10+ and uv (or pip — but uv is faster)AWS CLI configured with credentials that have Bedrock AgentCore permissionsNode.js 18+ (for the AgentCore CLI)An AWS account with AgentCore access (there's a free tier) Install the AgentCore tooling: Shell # AgentCore CLI npm install -g @aws/agentcore # AgentCore Python SDK pip install bedrock-agentcore # AgentCore Starter Toolkit (handles scaffolding + deployment) pip install bedrock-agentcore-starter-toolkit Step 1: Build the MCP Server Create your project structure: Shell mkdir infra-health-mcp && cd infra-health-mcp uv init --bare uv add mcp bedrock-agentcore Now create server.py. We'll use FastMCP, which gives us a decorator-based API for defining tools: Python from mcp.server.fastmcp import FastMCP from datetime import datetime, timedelta import random mcp = FastMCP("infra-health") @mcp.tool() def get_service_status(service_name: str) -> dict: """Check the health status of a deployed service. Args: service_name: Name of the service to check (e.g., 'api-gateway', 'auth-service', 'payments') """ # In production, this would hit your monitoring API statuses = ["healthy", "healthy", "healthy", "degraded", "unhealthy"] uptime = round(random.uniform(95.0, 99.99), 2) return { "service": service_name, "status": random.choice(statuses), "uptime_percent": uptime, "last_checked": datetime.utcnow().isoformat(), "active_instances": random.randint(2, 10), "avg_latency_ms": round(random.uniform(12, 250), 1) } @mcp.tool() def list_recent_deployments(hours: int = 24) -> list[dict]: """List deployments that occurred in the last N hours. Args: hours: Number of hours to look back (default: 24) """ services = ["api-gateway", "auth-service", "payments", "notification-svc", "user-profile"] deployers = ["ci-pipeline", "ci-pipeline", "hotfix-manual"] deployments = [] for i in range(random.randint(1, 5)): deploy_time = datetime.utcnow() - timedelta( hours=random.randint(1, hours) ) deployments.append({ "service": random.choice(services), "version": f"v1.{random.randint(20,45)}.{random.randint(0,9)}", "deployed_at": deploy_time.isoformat(), "deployed_by": random.choice(deployers), "status": random.choice(["success", "success", "rolled_back"]) }) return sorted(deployments, key=lambda d: d["deployed_at"], reverse=True) @mcp.tool() def get_active_alerts(severity: str = "all") -> list[dict]: """Retrieve currently active infrastructure alerts. Args: severity: Filter by severity level - 'critical', 'warning', 'info', or 'all' """ alerts = [ { "id": "ALT-1024", "severity": "warning", "message": "auth-service p99 latency above threshold (>500ms)", "triggered_at": ( datetime.utcnow() - timedelta(minutes=23) ).isoformat(), "service": "auth-service" }, { "id": "ALT-1025", "severity": "critical", "message": "payments service error rate at 2.3% (threshold: 1%)", "triggered_at": ( datetime.utcnow() - timedelta(minutes=8) ).isoformat(), "service": "payments" }, { "id": "ALT-1026", "severity": "info", "message": "Scheduled maintenance window in 4 hours", "triggered_at": ( datetime.utcnow() - timedelta(hours=2) ).isoformat(), "service": "all" }, ] if severity != "all": alerts = [a for a in alerts if a["severity"] == severity] return alerts if __name__ == "__main__": mcp.run(transport="streamable-http") Key decisions here: Each tool has a clear docstring with typed args — this is what the LLM sees when deciding which tool to call, so be descriptiveWe're using streamable-http transport, which is what AgentCore Runtime expectsIn production, you'd replace the mock data with calls to Datadog, CloudWatch, your deployment system, etc. Step 2: Test Locally Before deploying anything, make sure the server works: Python # Start the server uv run server.py In another terminal, test it with the MCP inspector or a quick curl: Shell # Using the MCP CLI inspector npx @modelcontextprotocol/inspector http://localhost:8000/mcp You should see your three tools listed. Click through them, pass some args, verify the responses look right. Fix any issues now — it's much faster than debugging after deployment. Step 3: Prepare for AgentCore Runtime AgentCore Runtime needs your server wrapped with the BedrockAgentCoreApp. Update server.py by adding this at the top and modifying the entrypoint: Python from bedrock_agentcore.runtime import BedrockAgentCoreApp # ... (keep all your existing tool definitions) ... # Replace the if __name__ block: app = BedrockAgentCoreApp() @app.entrypoint() def handler(payload): return mcp.run(transport="streamable-http") if __name__ == "__main__": app.run() Alternatively, use the AgentCore Starter Toolkit to scaffold the project structure automatically: Shell agentcore init --protocol mcp This generates the Dockerfile, IAM role config, and agentcore.json for you. Copy your server.py into the generated project and point the entry point to it. Step 4: Deploy to AWS This is the part that used to take hours of ECS/ECR/IAM wrangling. With the Starter Toolkit, it's two commands: Shell # Configure (generates IAM roles, ECR repo, build config) agentcore configure # Deploy (builds container via CodeBuild, pushes to ECR, # deploys to AgentCore Runtime) agentcore deploy That's it. No Docker installed locally. No Terraform. CodeBuild handles the container image, and AgentCore Runtime manages the rest. The output gives you a Runtime ARN — save this, you'll need it to connect your agent. Step 5: Invoke Your Deployed Server Test the deployed server using the AWS CLI: Shell aws bedrock-agent-runtime invoke-agent-runtime \ --agent-runtime-arn "arn:aws:bedrock:us-east-1:123456789:agent-runtime/your-runtime-id" \ --payload '{"jsonrpc":"2.0","method":"tools/list","id":1}' \ --output text You should see your three tools returned. Now try calling one: Shell aws bedrock-agent-runtime invoke-agent-runtime \ --agent-runtime-arn "arn:aws:bedrock:us-east-1:123456789:agent-runtime/your-runtime-id" \ --payload '{"jsonrpc":"2.0","method":"tools/call","params":{"name":"get_active_alerts","arguments":{"severity":"critical"},"id":2}' \ --output text Step 6: Connect an AI Agent Now the fun part. Let's wire this up to a Strands agent that can use our infrastructure tools conversationally: Python from strands import Agent from strands.tools.mcp import MCPClient from mcp.client.streamable_http import streamablehttp_client # Connect to your deployed MCP server via IAM auth mcp_client = MCPClient( lambda: streamablehttp_client( url="https://your-agentcore-endpoint/mcp", # IAM auth is handled automatically via your AWS credentials ) ) with mcp_client: agent = Agent( model="us.anthropic.claude-sonnet-4-20250514", tools=mcp_client.list_tools_sync(), system_prompt="""You are a DevOps assistant with access to infrastructure health tools. When asked about system status, check services, review recent deployments, and surface any active alerts. Be concise and flag anything that needs immediate attention.""" ) response = agent( "Give me a quick health check — any services having issues? " "And were there any recent deployments that might be related?" ) print(response) The agent will automatically discover the tools, decide which ones to call, and synthesize the results into a coherent answer. You'll see it call get_active_alerts, then get_service_status for the flagged services, then list_recent_deployments to correlate — all without you writing any orchestration logic. What AgentCore Gives You for Free It's worth pausing to appreciate what you didn't have to build: ConcernWithout AgentCoreWith AgentCoreContainer infraECR + ECS/EKS + ALBHandledSession isolationCustom session managementmicroVM per sessionAuthOAuth setup, token managementIAM SigV4 built inScalingAuto-scaling policies, metricsAutomaticNetworkingVPC, security groups, NATManagedHealth checksCustom implementationBuilt in You wrote a Python file with tool definitions. Everything else is infrastructure you didn't touch. Production Considerations Before going live with real data, a few things to think about: Replace mock data with real integrations. The tool signatures stay the same — swap random.choice(statuses) with a call to your CloudWatch API, PagerDuty, or whatever you use. Add error handling. MCP tools should return meaningful errors, not stack traces. Wrap your integrations in try/except and return structured error responses. Think about tool granularity. Three focused tools are better than one "do everything" tool. The LLM needs clear, specific tool descriptions to make good decisions about what to call. Stateful vs. stateless. Our server is stateless (the default and recommended mode). If you need multi-turn interactions where the server asks the user for clarification mid-execution, look into AgentCore's stateful MCP support with elicitation and sampling. Connect to AgentCore Gateway. If your agent needs tools from multiple MCP servers, the Gateway acts as a single entry point that discovers and routes to all of them. You can also use the Responses API with a Gateway ARN to get server-side tool execution — Bedrock handles the entire orchestration loop in a single API call. Cleanup When you're done experimenting: Shell agentcore destroy This tears down the Runtime, CodeBuild project, IAM roles, and ECR artifacts. You'll be prompted to confirm. What's Next? A few directions to take this further: Add a Gateway to combine your MCP server with AWS's open-source MCP servers (S3, DynamoDB, CloudWatch, etc.) into a single agent toolkit.Try the AG-UI protocol alongside MCP — it standardizes how agents communicate with frontends, enabling streaming progress updates and interactive UIs. References https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/what-is-bedrock-agentcore.htmlhttps://github.com/strands-agents/sdk-pythonhttps://aws.amazon.com/solutions/guidance/deploying-model-context-protocol-servers-on-aws
The article explores the journey of multi-cloud integration through the lens of personal experience, focusing on integrating MuleSoft and AWS using SAFe 5.0 principles. It begins by outlining the necessity of multi-cloud solutions in today's digitally connected world, highlighting challenges such as security and vendor lock-ins. The author discusses overcoming these challenges by employing SAFe 5.0's modular designs and integrating AI services like AWS SageMaker with MuleSoft for real-time decision-making. The article also emphasizes the importance of comprehensive training and cross-functional collaboration to bridge skills gaps. A real-world case study illustrates the approach’s success in reducing latency for an e-commerce giant. The conclusion stresses continuous learning and aligning technical initiatives with business objectives as key to leveraging multi-cloud environments. Introduction I still remember the first time I heard the term "multi-cloud integration." It was during a client meeting at Tata Consultancy Services in 2014. Fresh-faced and eager, I couldn't fathom the complexities that lay ahead. Fast forward to today, I find myself at the heart of pioneering integrations leveraging SAFe 5.0 principles with MuleSoft and AWS — a journey full of insights, occasional blunders, and numerous successes. Let's dive into this strategic blueprint which modern enterprises can adopt for optimizing their multi-cloud strategies. Embracing the Multi-Cloud Revolution In today's digitally connected world, multi-cloud solutions are more of a necessity than an option. From banking to retail, industries are transitioning to multi-cloud environments to harness flexibility, scalability, and redundancy. But with great power comes great responsibility, especially when it comes to security and governance. Emerging Trends: Security and Governance at the Forefront The financial sector, often risk-averse, has been a significant adopter of MuleSoft and AWS for real-time data processing. I recall a project where we integrated real-time transaction data across several cloud environments for a leading bank. We utilized AWS's Lambda for automated validations, ensuring compliance across different jurisdictions — a crucial step in maintaining data integrity and security. Personal Insight: During our deployment, we found that while AWS and MuleSoft offer robust frameworks for security, the challenge lay in integrating these seamlessly. Detailed planning and understanding of each platform's native capabilities were vital. My advice? Never underestimate the power of thorough documentation and the importance of a well-documented API architecture. The Contrarian View: The Vendor Lock-in Debate Many advocate that multi-cloud strategies eliminate vendor lock-in. Yet, as someone who's navigated these waters, I challenge this notion. The intricacies of integration can often weave a web of dependencies, especially when working with MuleSoft and AWS. Solving the Dependency Puzzle with SAFe 5.0 One strategy we've employed is designing modular and agnostic solutions. Utilizing SAFe 5.0's modular design principles, we ensure our integrations are flexible and can pivot with changing vendor landscapes. In a recent project at a healthcare firm, we leveraged MuleSoft's Anypoint Platform to create a loosely coupled architecture, enabling easy transitions between cloud providers. Lesson Learned: Over-engineering for flexibility can be a pitfall, adding unnecessary complexity. It's about striking a balance — focusing on critical services that need agility while ensuring core systems remain stable and robust. Surviving the Technical Trenches: AWS AI and MuleSoft Integrating AI services like AWS SageMaker with MuleSoft has been a game-changer, enabling real-time intelligent decision-making. For instance, in a retail analytics project, we created custom connectors in MuleSoft for seamless data flow into SageMaker, enhancing predictive analytics and improving customer personalization. Technical Deep-Dive: Crafting Custom Connectors Creating these connectors isn't just about linking systems; it’s about understanding the data lifecycle and business objectives. We encountered challenges with data latency and consistency, but by iterating our API definitions and leveraging AWS's data pipeline services, we achieved near-instantaneous data processing — a key success metric in that project. Behind the Scenes: Engaging with MuleSoft's C4E team was instrumental in overcoming integration roadblocks. If there's one thing I’ve learned, it’s that community collaboration often yields the most innovative solutions. Bridging the Skill Gap with SAFe 5.0 Despite its many benefits, the learning curve for integrating MuleSoft and AWS using SAFe 5.0 principles is steep. Here's what worked for us: Comprehensive Training Programs: We developed focused training sessions highlighting SAFe 5.0 frameworks and contextualizing them within our projects. This approach demystified complex topics and empowered our teams to innovate confidently. Cross-Functional Collaboration: By facilitating dialogue across departments — from developers to QA teams — we fostered a culture of shared knowledge and innovation. This collaborative ethos became a bedrock for overcoming integration hurdles. Real-World Implementation: A Case Study Last year, we spearheaded an integration initiative for an e-commerce giant aiming to reduce latency in order processing. Utilizing AWS's Outposts and Local Zones, paired with MuleSoft's capabilities, we achieved remarkable results. Concrete Example: We reduced latency by 40%, improving customer satisfaction scores by a significant margin. The key was aligning technical prowess with business goals—something SAFe 5.0 principles advocate strongly. Actionable Takeaway: Always align technical initiatives with overarching business objectives. It's not just about the technology; it's about driving tangible business outcomes. Conclusion: The Road Ahead The integration of MuleSoft with AWS, underpinned by SAFe 5.0 principles, offers a robust framework for tackling modern multi-cloud challenges. As we look to the future, the demand for hybrid solutions with integrated AI capabilities will only grow. Final Thought: If there's one piece of advice I'd impart — never stop learning. The technology landscape is ever-evolving, and staying curious ensures we remain at the forefront of innovation. As I share these hard-won insights over a metaphorical cup of coffee, I hope they serve as a guide for your own multi-cloud journey. Let's embrace the complexities with enthusiasm and turn challenges into opportunities for growth.
There are days when I want an agent to work on a project, run commands, install packages, and poke around a repo without getting anywhere near the rest of my machine. That is exactly why Docker Sandboxes clicked for me. The nice part is that the setup is not complicated. You install the CLI, sign in once, choose a network policy, and launch a sandbox from your project folder. After that, you can list it, stop it, reconnect to it, or remove it when you are done. In this post, I am keeping the focus narrow on purpose: Set up Docker Sandboxes, run one against a local project, understand the few commands that matter, and avoid the mistakes that usually slow people down on day one. What Are Docker Sandboxes? Docker Sandboxes give you an isolated environment for coding agents. Each sandbox runs inside its own microVM and gets its own filesystem, network, and Docker daemon. The simple way to think about it is this: the agent gets a workspace to do real work, but it does not get free access to your whole laptop. That is the reason this feature is interesting. You can let an agent install packages, edit files, run builds, and even run Docker commands inside the sandbox without turning your host machine into the experiment. Before You Start You do not need a big lab setup to try this, but you do need: macOS or Windows machine installedWindows "HypervisorPlatform" feature enabledDocker Sbx CLI installedAPI key or authentication for the agent you want to use If you start with the built-in shell agent, Docker sign-in is enough for your first walkthrough. If you want to start with claude, copilot, codex, gemini, or another coding agent, make sure you also have that agent's authentication ready. If you are on Windows, make sure Windows Hypervisor Platform is enabled first. PowerShell Enable-WindowsOptionalFeature -Online -FeatureName HypervisorPlatform -All If Windows asks for a restart, do that before moving on. Note: Docker documents the getting-started flow with the sbx CLI. There is also a docker sandbox command family, but sbx is the cleanest way to get started, so that is what I am using in this walkthrough. Step 1: Install the Docker Sandboxes CLI On Windows: PowerShell winget install -h Docker.sbx On macOS: PowerShell brew install docker/tap/sbx That is it for installation. If sbx is not recognized immediately after install, open a new terminal window and try again. I hit that once on Windows after installation, and a fresh terminal fixed it. Note: Docker Desktop is not required for sbx. Step 2: Sign In Now sign in once: PowerShell sbx login This opens the Docker sign-in flow in your browser. During login, Docker asks you to choose a default network policy for your sandboxes: Open – Everything is allowedBalanced – Common development traffic is allowed, but it is more controlledLocked down – Everything is blocked unless you explicitly allow it If you are just getting started, pick Balanced. That is the easiest choice for a first run because it usually works without making the sandbox too open. Step 3: Pick a Small Project Folder You can use an existing project folder, or create a tiny test folder just for this walkthrough. For example: PowerShell mkdir hello-sandbox cd hello-sandbox If you want, drop a file into it so you have something visible inside the sandbox: PowerShell echo "# hello-sandbox" > README.md Nothing fancy is needed here. The goal is just to have a folder you are comfortable letting the agent work in. Step 4: Run Your First Sandbox Here is the command that matters most: PowerShell sbx run shell . Figure 1.1: Shows how to create a new sandbox using Sbx command What this does: Starts a sandbox for the shell agentMounts your current folder into the sandboxOpens an isolated environment where the agent can work on that folder If you prefer naming your sandbox from the start, use: PowerShell sbx run --name my-first-sandbox shell . On the first run, Docker may take a little longer because it needs to pull the agent image. That is normal. Later runs are much faster. I like starting with shell because it is the easiest way to prove the sandbox is working before you bring an actual coding agent into the mix. Once that works, replace shell with the agent you actually want to use, such as claude, copilot, codex, gemini, or another supported agent from the Docker docs. Step 5: See What Is Running To check your active sandboxes, run: PowerShell sbx ls You should see output with a name, status, and uptime. This is a handy command because once you start using sandboxes regularly, it becomes the quickest way to see what is still running and what needs cleanup. Figure 1.2: Shows how to verify list of all active sandboxes running on the machine Step 6: Switch to a Real Coding Agent Once you have proved the sandbox works with shell, move to the coding agent you actually want to use. For example: PowerShell sbx run copilot Figure 1.3: Shows how to run Copilot agent on Docker sandbox or PowerShell sbx run gemini Figure 1.4: Shows how to run gemini agent on Docker sandbox The workflow is the same as shell. The only thing that changes is the agent inside the sandbox. If the agent needs its own provider login or API key, complete that setup and then continue. The important point is that the agent is still running inside the sandbox, not directly on your host machine. Step 7: Stop the Sandbox When You Are Done When you are finished using Sandbox, you can stop it by running the command below: PowerShell sbx stop copilot-dockersandboxtest If you don't remember the name, run sbx ls first to see all the active sandboxes running. Stopping is useful when you want to pause work without removing the sandbox immediately. Step 8: Remove the Sandbox When You No Longer Need It When you are done for good, you can remove it by running the command below: PowerShell sbx rm copilot-dockersandboxtest Or remove all sandboxes by simply passing --all flag as shown below: PowerShell sbx rm --all Figure 1.5: Removing all sandboxes using sbx rm --all command Step 9: Use YOLO Mode Safely Now for the newer idea Docker has just announced, which is YOLO mode. If you want to read more about it, refer to Docker's recent blog post, which is worth bookmarking: Docker Sandboxes: Run Agents in YOLO Mode, Safely. In simple terms, YOLO mode means letting a coding agent work with fewer interruptions and fewer approval prompts. That can save time, but it only makes sense when the agent is already inside a sandbox. Note: I would not start with YOLO mode on day one. I would start with a normal sandbox run, get comfortable with the lifecycle first, and only then try YOLO mode. Conclusion This article explains Docker Sandboxes and provides step-by-step instructions for getting started. What I like about Docker Sandboxes is that they remove a lot of friction from a very real problem. Sometimes you want an agent to have freedom, but not too much freedom. You want it to run commands, inspect files, and do useful work, but you also want a clear boundary around that work. That is the sweet spot Docker Sandboxes are aiming for. If you are curious about them, my advice is simple: do not start with a giant repo or a complicated setup. Pick one small folder, use the Balanced policy first, run a single sandbox, and get comfortable with the basic lifecycle first. Once that clicks, the rest feels much easier to work in YOLO mode.
The base Linux distribution we choose for building our container images affects the whole container stack: image size, performance, CVE exposure, patch cadence, debugging, maintainability. This is why going for some random base that ‘just works’ is not an option. Luckily, there are multiple good options on the market for various use cases and business needs. This guide is aimed at providing you with the summary of top five lightweight Linux distributions chosen for their production-relevance: small, container-focused, actively maintained, chosen by developers. The summary is based on criteria important for production, such as footprint, libc variant, licensing, security features, support. Note that this article is not a best-to-worse ranking, and the distros are listed alphabetically with the most popular one opening the list. These distributions are built for different goals, teams, and risk profiles. Our goal here is to provide a data-based comparison using information available from vendor documentation, official websites, and container registries. The point is to help you make an informed decision for your own use case, not to crown a universal winner. Alpine Linux Alpine Linux is the first distribution that comes to mind when one says ‘a lightweight base for containers’. It is minimalistic, clean, simple, and very common in Dockerfiles. It doesn’t include any unnecessary packages and uses musl libc instead of glibc like most other distributions. Contrary to glibc, musl was developed with minimalistic design in mind and so, it has smaller static and dynamic overhead.BusyBox instead of GNU Core utilities. BusyBox is a set of command-line Unix utilities with the size of about 1MB, which means that distributions based on BusyBox consume much less memory. Small and modular OpenRC init system instead of systemd. Alpine Package Keeper or apk as a package manager, which is smaller than yum/rpm or deb/apt. All of that contributes to Alpine’s miniature size — the compressed image size of Alpine on Docker Hub is less than 4 megabytes. At the same time, if you need extra packages, you can add them from the repo. As far as security is concerned, Alpine was designed with security in mind, The lack of extra packages reduces the attack surface. Plus, there are additional security features such as compiling userland binaries as Position Independent Executables (PIE) with stack smashing protection. Alpine is 100% free and community-based. There’s no single distro-wide EULA, and the package licenses vary and must be checked per package. The Alpine team does not provide enterprise support for Alpine, but it is available from third-party vendors as part of their commercial offerings. As for releases, Alpine has a predictable rhythm. The stable branches are released twice a year, in May and November. There’s no vendor “LTS” program in the enterprise sense, but the main repository is generally supported for about two years. Ironically, its drawbacks come from its strong sides. The musl libc may have inferior performance as compared to glibc for some workloads, especially Java-based ones. Some teams may experience compatibility issues when migrating their container images to a musl-based distribution. In addition, lack of dedicated support from the project team may be unsuitable for enterprises looking for strict SLAs for patches and fixes. Alpaquita Linux Alpaquita Linux is developed and supported by BellSoft. Like Alpine, it was designed to be minimalistic, efficient, and secure. At the same time, its goal is to close the gap between open-source lightweight images and enterprise expectations. Alpaquita also includes only essential packages and uses BusyBox, OpenRC and apk. But as for libc, it offers two flavors — glibc and musl perf with performance equal or superior to glibc depending on the workload. The choice enables the teams to leverage musl efficiency without impacts on performance or to stay on glibc and still get the reduced footprint. The Alpaquita musl images on Docker Hub are less than four megabytes, the glibc ones are about nine megabytes. Although Alpaquita Linux is compatible with various runtimes and offers ready images for Java, Python, and C++, its main strength is in the Java realm. Alpaquita integrates seamlessly with other BellSoft’s products for Java development, Liberica JDK and Liberica Native Image Kit, and helps to reduce the RAM consumption of Java applications by up to 30%. Alpaquita-based buildpacks for Java are also available. As for security, Alpaquita has some additional features such as kernel hardening. There is also a set of hardened images with minimized attack surface, provenance data, and SLA for patches both for OS and runtime from one team. From a maintenance perspective, Alpaquita comes in Stream, which is a rolling, continuously updated release, and LTS with four years of support. The distribution is open source, free-to-use, and is covered by EULA. Commercial support is also available from the BellSoft team. The drawback might be the limited choice of packages in the repository. Chiseled Ubuntu Chiseled Ubuntu is Canonical’s way to take the best of two worlds. It is almost a distroless base image stripped down to the essentials, but still a well-known and beloved Ubuntu distribution with a broad ecosystem, release roadmap, and LTS. With the tool called chisel, one can cut out a custom base image with only those packages required for the application to run. Canonical’s documentation and official images emphasize that chiselled images often include no shell and no package manager in the final runtime image, which contributes to minimized attack surface. The final images can be about 5–6 megabytes in size, depending on the runtime stack you target. Due to the fact that it is Ubuntu-based, the distribution uses glibc and enjoys Ubuntu’s broad compatibility. Chiseled Ubuntu is open source, the images are built from Ubuntu packages, so the contents are mostly open source packages under their respective licenses. The commercial support is available from Canonical, which might be appealing to teams that want a familiar ecosystem, minimal image, and enterprise support. Like with Alpine, Chiseled Ubuntu’s drawback comes from the strong side. To get a custom image, you need to cut out the OS yourself using the dedicated tool as there are no ready-to-use images. If the application changes, you may need to repeat the process. RHEL UBI Micro RHEL UBI Micro is Red Hat’s base image with a compressed size of about 10 MB. The image is part of the RHEL UBI family, so it is RHEL as you know it: glibc-based and seamlessly compatible with Red Hat’s infrastructure. But like Chiseled Ubuntu, UBI “micro” images are stripped down and contain only essential packages for running the application in a container. The images are updated regularly, LTS releases are based on the RHEL lifecycle model. Licensing might be an important nuance here. UBI images are described as freely redistributable, but under the UBI EULA, and support is part of Red Hat’s subscription ecosystem. In practice, teams may want to pick UBI Micro when they want the Red Hat supply chain and vendor alignment. Wolfi Wolfi is maintained by Chainguard. It is a container-first Linux “un-distro” as the vendor calls it, which was designed around modern supply-chain security needs and focuses on factors like provenance, SBOMs, and signing. A typical compressed image size for Wolfi is around 5 to 7 MB, depending on the architecture. It uses apk like Alpine, but unlike Alpine, it is based on glibc. That makes Wolfi a good option when you want minimal images without the surprises of the default musl implementation. Wolfi is the base on which Chainguard OS is built and used in Chainguard Containers — distroless images that are rebuilt daily and come with comprehensive provenance data. Wolfi’s releases are rolling. The emphasis is on fast package updates rather than versioned distribution releases. Chainguard documentation states that the images are rebuilt on a frequent schedule, commonly daily/nightly. On the other hand, there isn’t an LTS concept the way you’d see with a vendor enterprise distro. Wolfi is open source and freely available under the Apache License V2. Commercially, Chainguard has a paid offering around hardened “production” images with support commitments and patch SLAs. The caveat is the trade-off you may get with rolling updates. You get fresh images, but you should invest in reproducibility and pinning if you want stable deployments. Conclusion: Factors to Consider When Selecting a Linux Base Image To sum up, there is no single best Linux distribution for container images, only various options for different teams, workloads, and constraints. Some prioritize small size and simplicity. Others need compatibility with their existing infrastructure. For some, enterprise support matters the most. So, comparing Linux distributions by size only does not cover the broader picture of business requirements. When choosing a base Linux distro for containers, teams should pay attention to the following factors: The libc implementation. Selecting between musl vs glibc is a big decision point that may influence the performance to the better or worse, or cause compatibility problems.Update model and release cadence. Rolling vs stable vs LTS influences the way teams patch, test, and update images. In this case, you need to decide whether you need maximum freshness or a more predictable lifecycle.Security posture. Look at attack surface reduction, patch cadence, hardened versions, and supply chain features such as provenance, signing, and SBOM.Licensing. Some options are community distributions, while others are vendor-distributed images under EULAs. That may matter for compliance and internal policy reviews.Support. Decide whether you need vendor-backed support, can do with third-party support, or require no commercial support at all. This is often determined by organizational requirements.Ecosystem fit. The most optimal base image is usually the one that fits your CI/CD, scanning tools, and compliance requirements. In short, choosing a base Linux distro is a platform decision. The right choice is the one that aligns with your application’s compatibility needs, team’s operational model, and organization’s security and compliance requirements.
Most Docker tutorials show secrets passed as environment variables. It's convenient, works everywhere, and feels simple. It's also fundamentally insecure. Environment variables are visible to any process running inside the container. They appear in docker inspect output accessible to anyone with Docker socket access. Debugging tools log them. Child processes inherit them. And in many logging frameworks, they get written to log files where they persist indefinitely. Consider this common pattern: Shell docker run -e DATABASE_PASSWORD=SuperSecret123 myapp That password is now: Visible in docker inspect myappReadable by any process in the container via /proc/1/environInherited by every subprocess spawned by the applicationPotentially logged by the application's error handlingAvailable to anyone with read access to the Docker socket Screenshot of docker inspect showing environment variables with secrets visible This is not theoretical. In production pharmaceutical environments managing patient data under HIPAA, environment variable leakage through log aggregation systems has triggered compliance violations. Docker Swarm Secrets: The Native Solution Docker Swarm includes built-in secret management that addresses the environment variable problem through encryption and in-memory delivery. How Swarm Secrets Work When you create a secret in Swarm, the secret value is encrypted and stored in Swarm's distributed state (backed by Raft consensus). The secret is only decrypted on nodes running services that explicitly declare they need it. On those nodes, secrets are mounted as files in an in-memory tmpfs filesystem at /run/secrets/. This means: Encrypted at rest: Secrets are encrypted in Swarm's internal databaseEncrypted in transit: Secrets are transmitted over TLS between Swarm nodesNever written to disk: Secrets exist only in memory via tmpfsScoped access: Only containers declaring the secret can read itNo inspect visibility: docker inspect shows secret names, not values Important security note: While Swarm secrets are encrypted at rest, the encryption keys are managed by the Swarm itself and reside in manager node memory. This means an attacker with privileged access to a manager node could theoretically access them. However, this is still a massive improvement over environment variables, which are exposed at the filesystem and process level on every worker node. Example usage: Shell # Create a secret echo "SuperSecret123" | docker secret create db_password - # Deploy a service using the secret docker service create \ --name api \ --secret db_password \ myapp:latest # Inside the container cat /run/secrets/db_password # SuperSecret123 # From the host docker inspect api Terminal screenshot showing secret mounted at /run/secrets/ with permissions 400 File permissions: The secret file is mounted with 400 permissions (read-only, owner-only) and owned by root. This means only the container's root user — or a process that has dropped privileges after reading — can access it. If your application runs as a non-root user (best practice), you'll need to read the secret during initialization while still running as root, then drop privileges. Screenshot of docker inspect output showing SecretName but no SecretValue Production reality: In pharmaceutical cluster environments, Swarm secrets enable compliance with data protection requirements by ensuring database credentials are never written to disk and are only accessible to explicitly authorized services. When Swarm Secrets Are Enough Swarm secrets work well for: Single-platform Docker deployments (not mixing VMs and containers)Static secrets that change infrequently (manual rotation is acceptable)Environments where Vault's operational complexity isn't justifiedSimple microservice architectures where each service needs 2-5 secrets Swarm secrets are Docker-native, require no external dependencies, and work on single-node "Swarms" (you can run docker swarm init on a single host to get secret management without clustering). HashiCorp Vault: When You Need More Vault is an external secret manager that adds capabilities Swarm secrets don't have: dynamic secret generation, automatic rotation, fine-grained access policies, and audit logging. Dynamic Secrets: The Key Differentiator The most powerful Vault feature is dynamic secrets. Instead of storing a static database password, Vault generates temporary credentials on-demand that expire automatically. Traditional approach - Static password stored in Vault: Shell vault kv put secret/db password=SuperSecret123 Dynamic approach - Vault generates temporary credentials: Shell vault read database/creds/app-role # Returns: # username: v-token-app-role-8h3k2j # password: A1Bb2Cc3Dd4Ee5Ff (auto-generated) # lease_duration: 3600 (expires in 1 hour) Terminal output showing Vault returning temporary username/password with lease_duration When the application requests database credentials from Vault, Vault connects to the database and creates a temporary user with the exact permissions the application needs. That user exists for a limited time (configurable, typically 1-24 hours), then Vault automatically revokes it. This solves two problems: Credential sprawl: No static password shared across environmentsBlast radius: Compromised credentials expire automatically Audit Logging for Compliance Vault logs every secret access. This is required for SOC 2 Type II and PCI DSS compliance, where auditors need proof of who accessed which secrets when. Example Vault audit log entry: JSON { "time": "2026-03-30T19:45:12Z", "type": "response", "auth": { "token_type": "service", "entity_id": "api-service" }, "request": { "path": "database/creds/app-role" }, "response": { "secret": true } } Vault audit log showing timestamp, entity_id, request path, and response metadata Every access is logged with timestamps, the requesting identity, and the secret path. This log is write-only (even Vault admins can't modify it) and can be exported to SIEM systems. When Vault Is Justified Use Vault when: You need dynamic database credentials (most important use case)Compliance requires audit trails (SOC 2, PCI DSS, HIPAA)You're managing secrets across multiple platforms (Docker + VMs + Kubernetes)Automated secret rotation is requiredYou have dedicated operations staff to maintain Vault infrastructure Vault's operational complexity is real. It requires: High-availability deployment (3+ nodes)Secure initialization and unsealing proceduresTLS certificate managementBackup and disaster recovery planningAccess policy maintenance For a 5-person startup, this overhead usually isn't justified. For Fortune 500 pharmaceutical operations managing hundreds of microservices accessing regulated data stores, it's mandatory infrastructure. BuildKit Secret Mounts: Build-Time Security Build-time secrets are different. You need credentials during docker build to access private npm registries, clone private git repos, or download proprietary dependencies. These secrets should never persist in the final image. BuildKit secret mounts solve this. BuildKit has been the default builder since Docker Engine 23.0, so if you're on a modern Docker version, you already have this capability — no special flags or setup required. Dockerfile: Dockerfile FROM node:18.20.5-alpine3.20 WORKDIR /a COPY package.json* RUN --mount=type=secret,id=npmrc,target=/root/.npmrc \ npm install --only=production && \ npm cache clean --force COPY app.js ./ RUN addgroup -g 1001 -S nodejs && \ adduser -S nodejs -u 1001 && \ chown -R nodejs:nodejs /app USER node CMD ["node", "app.js"] Build the image with the secret: Shell docker build --secret id=npmrc,src=$HOME/.npmrc -t myapp . The .npmrc file is available inside the container during npm install, but it's not written to any image layer. It's not in the final image. It's not in docker history. It existed only for the duration of that one RUN instruction. Diagram showing BuildKit secret mount lifecycle - secret available during RUN, then immediately discarded Why BuildKit Secrets Matter: Before BuildKit secrets, developers used ARG or multi-stage builds with complex cleanup scripts. Both leaked secrets into intermediate layers visible in docker history. BuildKit secrets are ephemeral by design — they can't leak because they never persist. Common Build-Time Secret Patterns Private npm/pip registries: Dockerfile RUN --mount=type=secret,id=npmrc,target=/root/.npmrc \ npm install SSH keys for private git repos: Dockerfile RUN --mount=type=secret,id=ssh_key,target=/tmp/key \ cp /tmp/key /root/.ssh/id_rsa && \ chmod 600 /root/.ssh/id_rsa && \ git clone [email protected]:company/private-repo.git && \ rm /root/.ssh/id_rsa API tokens for downloading artifacts: Dockerfile RUN --mount=type=secret,id=api_token \ TOKEN=$(cat /run/secrets/api_token) && \ curl -H "Authorization: Bearer $TOKEN" \ https://api.company.com/artifact.tar.gz -o /tmp/artifact.tar.gz Secret Scanning: Prevention Layer Despite proper secret management, developers still accidentally commit secrets. GitLeaks and similar tools scan repositories for patterns matching credentials. Dockerfile # Scan current repository docker run -v $(pwd):/path zricethezav/gitleaks:latest \ detect --source /path --verbose GitLeaks terminal output showing detected AWS key and GitHub token with file paths and line numbers GitLeaks detects: AWS keys (AKIA...)GitHub tokens (ghp_...)Stripe keys (sk_live_...)Private keys (-----BEGIN PRIVATE KEY-----)Database connection stringsHigh-entropy strings (potential secrets) Prevention via Pre-Commit Hooks The most effective scanning happens before commit: .pre-commit-config.yaml: YAML repos: - repo: https://github.com/gitleaks/gitleaks rev: v8.18.0 hooks: - id: gitleaks Install the hook: Shell pre-commit install # Now every git commit runs GitLeaks first git commit -m "Add config" # GitLeaks scan... # ERROR: Secret detected in config.yml Terminal showing GitLeaks blocking a commit with "ERROR: Secret detected in config.yml Pre-commit hooks prevent secrets from entering git history. CI/CD scanning catches what pre-commit missed. Together, they create defense in depth. Critical: Secrets in Git Are Permanent Even after deleting a file containing secrets, those secrets remain in git history indefinitely. The only remediation is to rotate the secret (assume it's compromised) and optionally rewrite history with git filter-branch or BFG Repo-Cleaner. Layered Approach for Production Production environments don't choose one solution. They layer multiple approaches: Secret typesolutionwhy Build-time (npm, SSH) BuildKit Mounts Ephemeral, can't leak into image Simple service secrets Docker Swarm Secrets Native, encrypted, no external deps Database credentials Vault Dynamic Secrets Auto-expiring, audit trail Compliance-regulated Vault + Audit Logs SOC 2, PCI DSS requirements Detection GitLeaks + Pre-commit Prevent accidents Architecture diagram showing layered secrets approach - BuildKit for builds, Swarm for simple secrets, Vault for DB, GitLeaks for prevention Example Architecture for a Pharmaceutical Application: CI/CD pipeline: BuildKit mounts for private npm registry accessAPI service: Swarm secret for JWT signing key (static, rotated quarterly)Database access: Vault dynamic credentials (expire every 4 hours, audit logged)Pre-commit hooks: GitLeaks scanning on every developer commitCI/CD gates: Automated GitLeaks scan on every pull request Key Takeaways Environment variables are not secrets. They're visible to any process, appear in docker inspect, and get logged. Use them for configuration, not credentials. Swarm secrets are underutilized. Most teams don't realize Docker has native secret management that works on single nodes. No Vault complexity required for simple use cases. Vault's value is dynamic secrets. Static secret storage is a nice feature. Dynamic database credentials that auto-expire are transformative for security posture. BuildKit secrets prevent build leakage. Before BuildKit, build-time secrets inevitably leaked into image layers. BuildKit mounts are ephemeral by design. Secrets in git are forever. File deletion doesn't remove secrets from history. Rotate immediately if detected. Pre-commit hooks prevent the problem. Layer your approach. Production systems use BuildKit for builds, Swarm for simple secrets, Vault for dynamic credentials, and GitLeaks for prevention. Each solves a different problem. Hands-On Practice Want to practice these concepts? Lab 10 in the Docker Security Practical Guide covers all five scenarios: Anti-patterns (environment variables, docker history leaks)Swarm secrets (encrypted, tmpfs-mounted)Vault integration (dynamic credentials, audit logging)BuildKit secret mounts (ephemeral build-time secrets)Secret scanning with GitLeaks (pre-commit hooks, CI/CD) All labs are executable on Docker Desktop (macOS/Windows/Linux). Note: Lab 10 covers Vault in development mode to demonstrate core concepts. For production Vault deployment with high availability, TLS, dynamic database credentials, and audit logging integration, see the upcoming Lab 11 (Tier 2 Deep-Dive) in the same repository. GitHub: https://github.com/opscart/docker-security-practical-guide/tree/master/labs/10-secrets-management Complete guide: https://opscart.com/docker-security-guide/docker-secrets-management/
Introduction Arm technology now powers a broad spectrum of on-premises and cloud server workloads. Building on Ampere Computing's previous reference architecture, which demonstrated that Apache Spark on Ampere Altra – 128C (Ampere Altra 128 Cores) processors delivers superior performance per rack, lower power consumption, and optimized CapEx and OpEx, this paper evaluates and extends that analysis to showcase Spark performance on the latest generation of AmpereOne® M processors. Scope and Audience This document describes the process of setting up, tuning, and evaluating Spark performance using a testbed powered by AmpereOne® M processors. It includes a comparative analysis of the performance benefits of the 12-channel AmpereOne® M processors relative to their predecessors, specifically Ampere Altra – 128C processors. Additionally, the paper examines the Spark performance improvements achieved by using a 64KB page-size kernel over standard 4KB page-size kernels. We outline the installation and tuning procedures for deploying Spark on both single-node and multi-node clusters. These recommendations are intended as general guidelines, and configuration parameters can be further optimized based on specific workloads and use cases. This document is intended for sales engineers, IT and cloud architects, IT and cloud managers, and customers seeking to leverage the performance and power efficiency advantages of Ampere Arm servers across their IT infrastructure. It provides practical guidance and technical insights for professionals interested in deploying and optimizing Arm-based Spark solutions. AmpereOne® M Processors AmpereOne® M is part of the AmpereOne® M family of high-performance server-class processors, designed to deliver exceptional performance for AI Compute and a wide range of mainstream data center workloads. Data-intensive applications such as Hadoop and Apache Spark benefit directly from the 12 DDR5 memory channels, which provide the high memory bandwidth required for large-scale data processing. AmpereOne® M processors introduce a new platform architecture with a higher core count and additional memory channels, differentiating it from earlier Ampere platforms while preserving Ampere’s Cloud Native processing principles. Designed from the ground up for cloud efficiency and predictable scaling, AmpereOne® M employs a one-to-one mapping between vCPUs and physical cores, ensuring consistent performance without resource contention. With up to 192 single-threaded cores and twelve DDR5 channels delivering 5600 MT/s, AmpereOne® M delivers a sustained throughput required for demanding workloads such as Spark, though also including modern AI inference relying on Large Language Models (LLM). AmpereOne® M also emphasizes exceptional performance-per-watt, helping reduce operational costs, energy consumption, and cooling requirements in modern data centers. Apache Spark Apache Spark is a unified data processing and analytics framework used for data engineering, data science, and machine learning workloads. It can operate on a single node or scale across large clusters, making it suitable for processing large and complex datasets. By leveraging distributed computing, Spark efficiently parallelizes data processing tasks across multiple nodes, either independently or in combination with other distributed computing systems. Spark utilizes in-memory caching, which allows for quick access to data and optimized query execution, enabling fast analytic queries on datasets of any size. The framework provides APIs in popular programming languages such as Java, Scala, Python, and R, making it accessible to the broad developer community. Spark supports various workloads, including real-time analytics, batch processing, interactive queries, and machine learning, offering a comprehensive solution for modern data processing needs. Spark supports multiple deployment models. It can run as a standalone cluster or integrate with cluster management and orchestration platforms such as Hadoop YARN, Kubernetes, and Docker. This flexibility allows Spark to adapt to diverse infrastructure environments and workload requirements. Spark Architecture and Components Figure 1 Spark Driver The Spark Driver serves as the central controller of the Spark execution engine and is responsible for managing the overall state of the Spark cluster. It interacts with the cluster manager to acquire the necessary resources, such as virtual CPUs (vCPUs) and memory. Once the resources are obtained, the Driver launches the executors, which are responsible for executing the actual tasks of the Spark application. Additionally, the Spark Driver plays a crucial role in maintaining the state of the application running on the cluster. It keeps track of various important information, such as the execution plan, task scheduling, and the data transformations and actions to be performed. The Driver coordinates the execution of tasks across the available executors, ensuring efficient data processing and computation. Spark Driver, hence, acts as a control unit orchestrating the execution of the Spark application on the cluster and maintaining the necessary states and communication with the cluster manager and executors. Spark Executors Spark Executors are responsible for executing the tasks assigned to them by the Spark Driver. Once the Driver distributes the tasks across the available Executors, each Executor independently processes its assigned tasks. The Executors run these tasks in parallel, leveraging the resources allocated to them, such as CPU and memory. They perform the necessary computations, transformations, and actions specified in the Spark application code. This includes operations like data transformations, filtering, aggregations, and machine learning algorithms, depending on the nature of the tasks. During the execution of the tasks, the Executors communicate with the Driver, providing updates on their progress and reporting the results of each task. Cluster Manager The Cluster Manager is responsible for maintaining the cluster of machines on which the Spark applications run. It handles resource allocation, scheduling, and management of the Spark Driver and Executors, ensuring efficient execution of Spark applications on the available cluster resources. When a Spark application is submitted, the Driver communicates with the Custer Manager to request the necessary resources, such as CPU, memory, and storage, to run the application. It ensures that the resources are distributed effectively to meet the requirements of the Spark application. This includes tasks such as assigning containers or worker nodes to execute the Spark Executors and ensuring that the required dependencies and configurations are in place. Spark RDD Spark uses a concept called Resilient Distributed Dataset (RDD), an abstraction that represents an immutable collection of objects that can be split across a cluster. RDDs can be created from various data sources, including SQL databases and NoSQL stores. Spark Core, which is built upon the RDD model, provides essential functionalities such as mapping and reducing operations. It also offers built-in support for joining data sets, filtering, sampling, and aggregation, making it a powerful tool for data processing. When executing tasks, Spark splits them into smaller subtasks and distributes them across multiple executor processes running on the cluster. This enables the parallel execution of tasks across the available computational resources, resulting in improved performance and scalability. Spark Core Spark Core serves as the underlying execution engine for the Spark platform, forming the basis for all other Spark functionality. It offers powerful capabilities such as in-memory computing and the ability to reference datasets stored on external storage systems. One of the key components of Spark Core is the resilient distributed dataset (RDD), which serves as the primary programming abstraction in Spark. RDDs enable fault-tolerant and distributed data processing across a cluster. Spark Core provides a wide range of APIs for creating, manipulating, and transforming RDDs. These APIs are available in multiple programming languages, including Java, Python, Scala, and R. This flexibility allows developers to work with Spark Core using their preferred language and leverages the rich ecosystem of libraries and tools available in those languages. Spark Scheduler The Spark Scheduler is a vital component responsible for task scheduling and execution. It uses a Directed Acyclic Graph (DAG) and employs a task-oriented approach for scheduling tasks. The Scheduler analyzes the dependencies between different stages and tasks of a Spark application, represented by the DAG. It determines the optimal order in which tasks should be executed to achieve efficient computation and minimize data movement across the cluster. By understanding the dependencies and requirements of each task, the Scheduler assigns resources, such as CPU and memory, to the tasks. It considers factors like data locality, where possible, to reduce network overhead and improve performance. The task-oriented approach of the Spark Scheduler allows it to break down the application into smaller, manageable tasks and distribute them across the available resources. This enables parallel execution and efficient utilization of the cluster's computing power. Spark SQL Spark SQL is a widely used component of Apache Spark that facilitates the creation of applications for processing structured data. It adopts a data frame approach and allows efficient and flexible data manipulation. One of the key features of Spark SQL is its ability to interface with various data storage systems. It provides built-in support for reading and writing data from and to different datastores, including JSON, HDFS, JDBC, and Parquet. This makes it easy to work with structured data residing in different formats and storage systems. Additionally, Spark SQL extends its connectivity beyond the built-in datastores. It offers connectors that enable integration with other popular data stores such as MongoDB, Cassandra, and HBase. These connectors allow users to seamlessly interact with and process data stored in these systems using Spark SQL's powerful querying and processing capabilities. Spark MLlib In addition to its core functionalities, Apache Spark includes bundled libraries for machine learning and graph analysis techniques. One such library is MLlib, which provides a comprehensive framework for developing machine learning pipelines. MLlib simplifies the implementation of machine learning workflows by offering a wide range of tools and algorithms. It simplifies the implementation of feature extraction and transformations on structured datasets and offers a wide range of machine learning algorithms. MLlib empowers developers to build scalable and efficient machine learning workflows, enabling them to leverage the power of Spark for advanced analytics and data-driven applications. Distributed Storage Spark does not provide its own distributed file system. However, it can effectively utilize existing distributed file systems to store and access large datasets across multiple servers. One commonly used distributed file system with Spark is the Hadoop Distributed File System (HDFS). HDFS allows for the distribution of files across a cluster of machines, organizing data into consistent sets of blocks stored on each node. Spark can leverage HDFS to efficiently read and write data during its processing tasks. When Spark processes data, it typically copies the required data from the distributed file system into its memory. By doing so, Spark reduces the need for frequent interactions with the underlying file system, resulting in faster processing compared to traditional Hadoop MapReduce jobs. As the dataset size increases, additional servers with local disks can be added to the distributed file system, allowing for horizontal scalability and improved performance. Spark Jobs, Stages, and Tasks In a Spark application, the execution flow is organized into a hierarchical structure consisting of Jobs, Stages, and Tasks. A Job represents a high-level unit of work within a Spark application. It can be seen as a complete computation that needs to be performed, involving multiple Stages and transformations on the input data. A Stage is a logical division of tasks that share the same shuffle dependencies, meaning they need to exchange data with each other during execution. Stages are created when there is a shuffle operation, such as a groupBy or a join, that requires data to be redistributed across the cluster. Within each Stage, there are multiple Tasks. A Task represents the smallest unit of work in Spark, representing a single operation that can be executed on a partition of the data. Tasks are typically executed in parallel across multiple nodes in the cluster, with each node responsible for processing a subset of the data. Spark intelligently partitions the data and schedules Tasks across the cluster to maximize parallelism and optimize performance. It automatically determines the optimal number of Tasks and assigns them to available resources, considering factors such as data locality to minimize data shuffling between nodes. Spark handles the management and coordination of Tasks within each stage, ensuring that they are executed efficiently and leveraging the parallel processing capabilities of the cluster. Figure 2 Shuffle boundaries introduce a barrier where Stages/Tasks must wait for the previous stage to finish before they fetch map outputs. In the above diagram, Stage 0 and Stage 1 are executed in parallel, while Stage 2 and Stage 3 are executed sequentially. Hence, Stage 2 has to wait until both Stage 0 and Stage 1 are complete. This execution plan is evaluated by Spark. Spark Test Bed The Spark cluster was set up for performance benchmarking. Equipment Under Test Cluster nodes: 3CPU: AmpereOne® MSockets/node: 1Cores/socket: 192Threads/socket: 192CPU speed: 3200 MHzMemory channels: 12Memory/node: 768 GB (12 x 64GB DDR5-5600, 1DPC)Network card/node: 1 x Mellanox ConnectX-6OS storage/node: 1 x Samsung 960GB M.2Data storage/mode: 4 x Micron 7450 Gen 4 NVME, 3.84 TBKernel version: 6.8.0-85Operating system: Ubuntu 24.04.3YARN version: 3.3.6Spark version: 3.5.7JDK version: JDK 17 Spark Installation and Cluster Setup We set up the cluster with an HDFS file system. Hence, we installed Spark as a Hadoop user and configured the disks for HDFS. OS Install The majority of modern open-source and enterprise-supported Linux distributions offer full support for the AArch64 architecture. To install your chosen operating system, use the server Kernel-based Virtual Machine (KVM) console to map or attach the OS installation media, and then follow the standard installation procedure. Networking Setup Set up a public network on one of the available interfaces for client communication. This can be used to log in to any of the servers where client communication is needed. Set up a private network for communication between the cluster nodes. Storage Setup Choose a drive of your choice for the OS to install, clear any old partitions, reformat, and choose the disk to install the OS. Here, a Samsung 960 GB drive (M.2) was chosen for the OS installation on each server. Add additional high-speed NVMe drives to support the HDFS file system. Create Hadoop User Create a user named “hadoop” as part of the OS Install. This user was used for both Hadoop and Spark daemons on the test bed. Post-Install Steps Perform the following post-install steps on all the nodes on OS after the install. yum or apt update on the nodes.Install packages like dstat, net-tools, lm-sensors, linux-tools-generic, python, sysstat for your monitoring needs.Set up ssh trust between all the nodes.Update /etc/sudoers file for nopasswd for hadoop user.Update /etc/security/ limits.conf per Appendix.Update /etc/sysctl.conf per Appendix.Update scaling governor and hugepages per Appendix.If necessary, make changes to /etc/rc.d to keep the above changes permanent after every reboot.Set up NVMe disks as an XFS file system for HDFS. Create a single partition on each of the NVMe disks with fdisk or parted.Create a file system on each of the created partitions using mkfs.xfs -f /dev/nvme[0-n]n1p1.Create directories for mounting as mkdir -p /root/nvme[0-n]1p1. d. Update /etc/fstab with entries and mount the file system. The UUID of each partition in fstab can be extracted from the blkid command.Change ownership of these directories to the ‘hadoop’ user created earlier. Spark Install Download Hadoop 3.3.6 from the Apache website, Spark 3.5.7 from Apache Spark, and JDK11 and JDK17 for Arm64/Aarch64. We will use JDK11 for Hadoop and JDK17 for Spark installs. Extract the tarball files under the Hadoop user home directory. Update Spark and Hadoop configuration files in ~/hadoop/spark/conf and ~/hadoop/etc/hadoop/ and environment parameters in .bashrc per Appendix. Depending on the hardware specifications of cores, memory, and disk capacities, these may have to be altered. Update the Workers’ files to include the set of data nodes. Run the following commands: Shell hdfs namdenode -format scp -r ~/hadoop <datanodes>:~/hadoop ~/hadoop/sbin/start-all.sh ~/spark/sbin/start-all.sh This should start Spark Master, Worker, and other Hadoop daemons. Performance Tuning Spark is a complex system where many components interact across various layers. To achieve optimal performance, several factors must be considered, including BIOS and operating system settings, the network and disk infrastructure, and the specific software stack configuration. Experience with Hadoop and Spark significantly helps in fine-tuning these settings. Keep in mind that performance tuning is an ongoing, iterative process. The parameters in the Appendix are provided as starting reference points, gathered from just a few initial tuning cycles. Linux Occasionally, there can be conflicts between the subcomponents of a Linux system, such as the network and disk, which can impact overall performance. The objective is to optimize the system to achieve optimal disk and network throughput and identify and resolve any bottlenecks that may arise. Network To evaluate the network infrastructure, the iperf utility can be utilized to conduct stress tests. Adjusting the TX/RX ring buffers and the number of interrupt queues to align with the cores on the NUMA node where the NIC is located can help optimize performance. However, if the BIOS setting is already configured as chipset-ANC in a monolithic manner, these modifications may not be necessary. Disks Aligned partitions: Partitions should be aligned with the storage's physical block boundaries to maximize I/O efficiency. Utilities like parted can be used to create aligned partitions.I/O queue settings: Parameters such as the queue depth and nr_requests (number of requests) can be fine-tuned via the /sys/block//queue/ directory paths to control how many I/O operations the kernel schedules for a storage device.Filesystem mount options: Utilizing the noatime option in the /etc/fstab file is critical for Hadoop and HDFS, as it prevents unnecessary disk writes by disabling the recording of file access timestamps. The fio (flexible I/O tester) tool is highly effective for benchmarking and validating the performance of the disk subsystem after these changes are implemented. Spark Configuration Parameters There are several tunables on Spark. Only a few of them are addressed here. Tune your parameters by observing the resource usage from http://:4040. Using Data Frames Over RDD It is preferred to use Datasets or Data Frames over RDD, which include several optimizations to improve the performance of Spark workloads. Spark data frames can handle the data better by storing and managing it efficiently, as they maintain the structure of the data and column types. Using Serialized Data Formats In Spark jobs, a common scenario involves writing data to a file, which is then read by another job and written to another file for subsequent Spark processing. To optimize this data flow, it is recommended to write the intermediate data into a serialized file format such as Parquet. Using Parquet as the intermediate file format can yield improved performance compared to formats like CSV or JSON. Parquet is a columnar file format designed to accelerate query processing. It organizes data in a columnar manner, allowing for more efficient compression and encoding techniques. This columnar storage format enables faster data access and processing, particularly for operations that involve selecting specific columns or performing aggregations. By leveraging Parquet as the intermediate file format, Spark jobs can benefit from faster transformation operations. The columnar storage and optimized encoding techniques offered by Parquet, as well as its compatibility with processing frameworks like Hadoop, contribute to improved query performance and reduced data processing time. Reducing Shuffle Operations Shuffling is a fundamental Spark operation that reorders data among different executors and nodes. This is necessary for distributed tasks such as joins, grouping, and reductions. This data redistribution is expensive in terms of resources, as it requires considerable disk IO, data packaging, and movement across the network. This is crucial to how Spark works, but can severely reduce performance if not understood and tuned properly. The spark.sql.shuffle.partitions configuration parameter is key to managing shuffle behavior. Found in spark-defaults.conf, this setting dictates the number of partitions created during shuffle operations. The optimal value varies significantly, depending on data volume, available CPU cores, and the cluster's memory capacity. Setting too many partitions results in a large number of smaller output files, potentially increasing overhead. Conversely, too few partitions can lead to individual partitions becoming excessively large, risking out-of-memory errors on executors. Optimizing shuffle performance involves an iterative process, carefully adjusting spark.sql.shuffle.partitions to strike the right balance between partition count and size for your specific workload. Spark Executor Cores The number of cores allocated to each Spark Executor is an important consideration for optimal performance. In general, allocating around 5 cores per Executor tends to be a fair allocation when using the Hadoop Distributed File System (HDFS). When running Spark alongside Hadoop daemons, it is vital to reserve a portion of the available cores for these daemons. This ensures that the Hadoop infrastructure functions smoothly alongside Spark. The remaining cores can then be distributed among the Spark Executors for executing data processing tasks. By striking a balance between allocating cores to Hadoop daemons and Spark executors, you can ensure that both systems coexist effectively, enabling efficient and parallel processing of data. It is important to adjust the allocation based on the specific requirements of your cluster and workload to achieve optimal performance. Spark Executor Instances The number of Spark executor instances represents the total count of executor instances that can be spawned across all worker nodes for data processing. To calculate the total number of cores consumed by a Spark application, you can multiply the number of executors by the cores allocated per executor. The Spark UI provides information on the actual utilization of cores during task execution, indicating the extent to which the available cores are being utilized. It is recommended to maximize this utilization based on the availability of system resources. By effectively using the available cores, you can boost your Spark application's processing power and make its overall performance better. It is crucial to look at the resources in your cluster and change the amount of executor instances and cores given to each executor to match. This ensures resources are used effectively and gets the most computational power out of your Spark application. Executor and Driver Memory The memory configuration for Spark's Driver and Executors plays a critical role in determining the available memory for these components. It is important to tune these values based on the memory requirements of your Spark application and the memory availability within your YARN scheduler and NodeManager resource allocation parameters. The Executor's memory refers to the memory allocated for each executor, while the Driver's memory represents the memory allocated for the Spark Driver. These values should be adjusted carefully to ensure optimal performance and avoid memory-related issues. When tuning the memory configuration, it is essential to consider the overall memory availability in your environment and consider any memory constraints imposed by the YARN scheduler and NodeManager settings. By aligning the memory allocation with the available resources, you can optimize the memory utilization and prevent potential out-of-memory errors or performance degradation (swapping or disk spills). It is recommended to monitor the memory usage with Spark UI and adjust the configuration iteratively to achieve the best performance for your Spark workload. Benchmark Tools We used both Intel HiBench and TPC-DS benchmarking tools to measure the performance of the clusters. TeraSort We used the HiBench benchmarking tool to measure the TeraSort performance. HiBench is a popular benchmarking suite specifically designed for evaluating the performance of Big Data frameworks, such as Apache Hadoop and Apache Spark. It consists of a set of workload-specific benchmarks that simulate real-world Big Data processing scenarios. For additional information, you can refer to this link. By running HiBench on the cluster, you can assess and compare its performance in handling various Big Data workloads. The benchmark results can provide insights into factors such as data processing speed, scalability, and resource utilization for each cluster. Update hibench.conf file, like scale, profile, parallelism parameters, and a list of master and slave nodes.Run ~HiBench/bin/workloads/micro/terasort/prepare/prepare.sh.Run ~HiBench/bin/workloads/micro/terasort/spark/run.sh. After executing the above, a file named hibench.report will be generated within the report directory. Additionally, a file named bench.log will contain comprehensive information regarding the execution. The cluster was using a data set of 3 TB. We measured the total power consumed, CPU power, CPU utilization, and other parameters like disk and network utilization using Grafana and IPMI tools. Throughput from the HiBench run was calculated for TeraSort in the following scenarios: Spark running on a single AmpereOne® M node compared with a single node Ampere Altra – 128C (prior generation)Spark running on a single AmpereOne® M node compared with a 3-node AmpereOne® M cluster to measure the scalabilitySpark running on a 3-node AmpereOne® M cluster with 64k page size vs 4k page size TPC-DS TPC-DS is an industry-standard decision-support benchmark that models various aspects of a decision-support system, including data maintenance and query processing. Its purpose is to assist organizations in making informed decisions regarding their technology choices for decision support systems. TPC benchmarks aim to provide objective performance data that is relevant to industry users. For more in-depth information, you can refer to this tpc.org/tpcds/. Similar to TeraSort testing, we conducted TPC-DS benchmark on AmpereOne® M processors using both single-node and 3-node cluster configurations to compare performance with the prior generation Ampere Altra – 128C processors and to assess scalability. Additional performance evaluations on the AmpereOne® M processor compared to Linux kernels configured with 64KB and 4KB page sizes. This test also used a 3 TB dataset across the cluster. To gain deeper insights into system performance, we monitored key performance metrics including total system power consumption, CPU power, CPU utilization, and network utilization. Performance Tests on 3 Node Clusters Figures 3 and 4 We evaluated Spark TeraSort performance using the HiBench tool. The tests were run on one, two, and three nodes with AmpereOne® M processors, and the earlier values obtained on Ampere Altra – 128C were compared. From Figure 3, it is evident that there is a 30% benefit of AmpereOne® M over Ampere Altra – 128C while running Spark TeraSort. This increase in performance can be attributed to a newer microarchitecture design, an increase in core count (from 128 to 192), and the 12-channel DDR5 design on AmpereOne® M (versus 8-channel DDR4 on Ampere Altra – 128C). The output for the 3x nodes configuration, as shown in Figure 4, was found to be close to three times the output of a single node. 64k Page Size Figure 5 We observed a significant performance increase, approximately 40%, with 64k page size on Arm64 architecture while running Spark TeraSort benchmark. Most modern Linux distributions support largemem kernels natively. We have not observed any issues while running Spark TeraSort benchmarks on largemem kernels. Performance Per Watt on AmpereOne® M Figure 6 To evaluate the energy efficiency of the cluster, we computed the Performance-per-Watt (Perf/Watt) ratio. This metric is derived by dividing the cluster's measured throughput (megabytes per second) by its total power consumption (watts) during the benchmarking interval. In these assessments, we observed AmpereOne® M performing 35% better over its predecessor on the Spark TeraSort benchmark. OS Metrics While Running TeraSort Benchmark Figure 7 The above image is a snapshot from the Grafana dashboard captured while running the TeraSort benchmark. During the HiBench test, the systems achieved maximum CPU utilization up to 90% while running the TeraSort benchmark. We observed disk read/write activity of approximately 15 GB/s and network throughput of 20 GB/s. Since both observed I/O and network throughput were significantly below the cluster's scalable limits, the results confirm that the benchmark successfully pushed the CPU to its maximum capacity. We observed from the above graphs that AmpereOne® M not only drove disk and network I/O higher than Ampere Altra – 128C, but it also completed tasks considerably faster. Power Consumption Figure 8 The graph illustrates the power consumption of cluster nodes, the platform, and the CPU. The power was measured using the IPMI tool during the benchmark run. We observe that the AmpereOne® M clusters consumed more power than the Ampere Altra – 128C cluster. This is not surprising in that the latest generation AmpereOne® M systems have 50% more compute cores and support 50% more memory channels. Additionally, as shown earlier, this increased power usage also delivered notably higher TeraSort throughput as well as better power efficiency (perf/watt) on AmpereOne® M (Figure 6). TPC-DS Performance Figures 9 and 10 The TPC-DS benchmarking tool was used to execute the TPC-DS workload on the clusters. The performance evaluation was based on the total time required to execute all 99 SQL queries on the cluster. Queries on AmpereOne® M completed in 50% less time than those run on Ampere Altra – 128C. The TPC-DS scalability improvement observed between 1 and 3 nodes was less compared to the scalability seen with TeraSort. 64k Page Size Figure 11 TPC-DS queries got a 9% boost by moving to a 64k page size kernel. Conclusion This paper presents a reference architecture for deploying Spark on a multi-node cluster powered by AmpereOne® M processors and compares the results with an earlier deployment based on Ampere Altra 128C processors. The latest TeraSort benchmark results reinforce the conclusions of earlier studies, demonstrating that Arm64-based data center processors provide a compelling, high-performance alternative to traditional x86 systems for Big Data workloads. Extending this analysis, the evaluation of the 12‑channel DDR5 AmpereOne® M platform shows measurable improvements in both raw throughput and performance-per-watt compared to previous-generation processors. These gains confirm that the AmpereOne® M is a groundbreaking platform designed for data centers and enterprises that prioritize performance, efficiency, and sustainability. Big Data workloads demand substantial computational resources and persistent storage, and by deploying these applications on Ampere processors, organizations benefit from both scale-up and scale-out architectures, enabling efficient growth while maintaining consistent throughput. For more information, visit our website at https://www.amperecomputing.com. If you’re interested in additional workload performance briefs, tuning guides, and more, please visit our Solutions Center at https://amperecomputing.com/solutions Appendix /etc/sysctl.conf Shell kernel.pid_max = 4194303 fs.aio-max-nr = 1048576 net.ipv4.conf.default.rp_filter=1 net.ipv4.tcp_timestamps=0 net.ipv4.tcp_sack = 1 net.core.netdev_max_backlog = 25000 net.core.rmem_max = 2147483647 net.core.wmem_max = 2147483647 net.core.rmem_default = 33554431 net.core.wmem_default = 33554432 net.core.optmem_max = 40960 net.ipv4.tcp_rmem =8192 33554432 2147483647 net.ipv4.tcp_wmem =8192 33554432 2147483647 net.ipv4.tcp_low_latency=1 net.ipv4.tcp_adv_win_scale=1 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv4.conf.all.arp_filter=1 net.ipv4.tcp_retries2=5 net.ipv6.conf.lo.disable_ipv6 = 1 net.core.somaxconn = 65535 #memory cache settings vm.swappiness=1 vm.overcommit_memory=0 vm.dirty_background_ratio=2 /etc/security/limits.conf Shell * soft nofile 65536 * hard nofile 65536 * soft nproc 65536 * hard nproc 65536 Miscellaneous Kernel changes Shell #Disable Transparent Huge Page defrag echo never> /sys/kernel/mm/transparent_hugepage/defrag echo never > /sys/kernel/mm/transparent_hugepage/enabled #MTU 9000 for 100Gb Private interface and CPU governor on performance mode ifconfig enP6p1s0np0 mtu 9000 up cpupower frequency-set --governor performance .bashrc file Shell export JAVA_HOME=/home/hadoop/jdk export JRE_HOME=$JAVA_HOME/jre export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib:$classpath export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin #HADOOP_HOME export HADOOP_HOME=/home/hadoop/hadoop export SPARK_HOME=/home/hadoop/spark export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH core-site.xml XML <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://<server1>:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/data/data1/hadoop, /data/data2/hadoop, /data/data3/hadoop, /data/data4/hadoop </value> </property> <property> <name>io.native.lib.available</name> <value>true</value> </property> <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec, org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>io.compression.codec.snappy.class</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> </configuration> hdfs-site.xml XML configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.blocksize</name> <value>536870912</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/hadoop/hadoop_store/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/data/data1/hadoop, /data/data2/hadoop, /data/data3/hadoop, /data/data4/hadoop </value> </property> <property> <name>dfs.client.read.shortcircuit</name> <value>true</value> </property> <property> <name>dfs.domain.socket.path</name> <value>/var/lib/hadoop-hdfs/dn_socket</value> </property> </configuration> yarn-site.xml XML <configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value><server1></value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>1024</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>81920</value> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>186</value> </property> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>4</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>737280</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>186</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> </configuration> mapred-site.xml XML <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME, LD_LIBRARY_PATH=$LD_LIBRARY_PATH </value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> <property> <name>mapreduce.application.classpath</name> <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib-examples/*, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/sources/*, $HADOOP_MAPRED_HOME/share/hadoop/common/*, $HADOOP_MAPRED_HOME/share/hadoop/common/lib/*, $HADOOP_MAPRED_HOME/share/hadoop/yarn/*, $HADOOP_MAPRED_HOME/share/hadoop/yarn/lib/*, $HADOOP_MAPRED_HOME/share/hadoop/hdfs/*, $HADOOP_MAPRED_HOME/share/hadoop/hdfs/lib/*</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value><server1>:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value><server1>:19888</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>2048</value> </property> <property> <name>mapreduce.map.cpu.vcore</name> <value>1</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>4096</value> </property> <property> <name>mapreduce.reduce.cpu.vcore</name> <value>1</value> </property> <property> <name>mapreduce.map.java.opts</name> <value> -Djava.net.preferIPv4Stack=true -Xmx2g -XX:+UseParallelGC -XX:ParallelGCThreads=32 -Xlog:gc*:stdout</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value> -Djava.net.preferIPv4Stack=true -Xmx3g -XX:+UseParallelGC -XX:ParallelGCThreads=32 -Xlog:gc*:stdout</value> </property> <property> <name>mapreduce.task.timeout</name> <value>6000000</value> </property> <property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <property> <name>mapreduce.map.output.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress</name> <value>true</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.type</name> <value>BLOCK</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>mapreduce.reduce.shuffle.parallelcopies</name> <value>32</value> </property> <property> <name>mapred.reduce.parallel.copies</name> <value>32</value> </property> </configuration> spark-defaults.conf Shell spark.driver.memory 32g # used driver memory as 64g for TPC-DS spark.dynamicAllocation.enabled=false spark.executor.cores 5 spark.executor.extraJavaOptions=-Djava.net.preferIPv4Stack=true -XX:+UseParallelGC -XX:ParallelGCThreads=32 spark.executor.instances 108 spark.executor.memory 18g spark.executorEnv.MKL_NUM_THREADS=1 spark.executorEnv.OPENBLAS_NUM_THREADS=1 spark.files.maxPartitionBytes 128m spark.history.fs.logDirectory hdfs://<Master Server>:9000/logs spark.history.fs.update.interval 10s spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider spark.history.ui.port 18080 spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec spark.io.compression.snappy.blockSize=512k spark.kryoserializer.buffer 1024m spark.master yarn spark.master.ui.port 8080 spark.network.crypto.enabled=false spark.shuffle.compress true spark.shuffle.spill.compress true spark.sql.shuffle.partitions 12000 spark.ui.port 8080 spark.worker.ui.port 8081 spark.yarn.archive hdfs://<Master Server>:9000/spark-libs.jar spark.yarn.jars=/home/hadoop/spark/jars/*,/home/hadoop/spark/yarn/* hibench.conf Shell hibench.default.map/shuffle.parallelism 12000 # 3 node cluster hibench.scale.profile bigdata # the bigdata size configured as hibench.terasort.bigdata.datasize 30000000000 in ~/HiBench/conf/workloads/micro/terasort.conf Check out the full Ampere article collection here.
Abhishek Gupta
Principal PM, Azure Cosmos DB,
Microsoft
Yitaek Hwang
Software Engineer,
NYDIG