Coding Resources

DZone's Featured Coding Resources

Multithreading in Modern Java: Advanced Benefits and Best Practices

By Muhammed Harris Kodavath

Multithreading has always been one of core strengths of Java over years. From the early days of the JVM, Java was designed with built-in support for concurrent programming. But for many years, writing scalable multithreaded applications required careful tuning, thread pool management and constant attention to synchronization. In the latest Java versions, the concurrency model has evolved significantly. Modern Java introduces improvements such as Virtual Threads, better executors, improved fork-join performance and more structured concurrency approaches. These features allow developers to build highly concurrent applications with simpler code and fewer scalability limitations. In this article I am trying to summarize some of the best practices for multithreading, so developers can use it as a reference guide. Why Multithreading Matters Modern applications rarely operate in isolation. Web servers handle thousands of requests, microservices communicate with multiple downstream systems and data pipelines process large streams of information simultaneously. Multithreading enables applications to handle many tasks concurrently, improve throughput and responsiveness, utilize multi-core CPUs efficiently and avoid blocking entire systems during slow operations like Input Output. However, poorly implemented multithreading can lead to race conditions, deadlocks and unpredictable performance. That is why understanding modern concurrency tools is more essential. 1. Traditional Threads in Java Before the latest concurrency improvements, Java developers typically relied on platform threads created through the Thread class or managed using executors. A simple example looks like this: Plain Text class Worker implements Runnable { @Override public void run() { System.out.println("Running task in thread : " + Thread.currentThread().getName()); } } public class ThreadExample { public static void main(String[] args) { Thread t = new Thread(new Worker()); t.start(); } } This approach works, but creating large numbers of platform threads can consume significant system resources. Each thread requires stack memory and OS-level scheduling. For high scale systems, managing thousands of threads becomes more expensive. 2. Thread Pools with ExecutorService To improve efficiency, Java introduced ExecutorService, which allows the tasks to be executed using a pool of reusable threads. Plain Text ExecutorService executor = Executors.newFixedThreadPool(5); for (int i = 0; i < 10; i++) { int taskId = i; executor.submit(() -> { System.out.println("Processing task : " + taskId + " in " + Thread.currentThread().getName()); }); } executor.shutdown(); Benefits of using thread pools include reduced thread creation overhead, controlled concurrency and better resource management However, even with thread pools, large systems can still struggle when workloads involve blocking operations like database calls or external APIs. 3. Virtual Threads: A Major Leap Forward One of the most important improvements in modern Java concurrency is Virtual Threads, introduced as part of Project Loom. Virtual threads are lightweight threads managed by the JVM rather than the operating system. Instead of mapping one thread per OS thread, the JVM can schedule millions of virtual threads efficiently. This dramatically improves scalability for I/O heavy applications. Example using virtual threads: Plain Text try (var executor = Executors.newVirtualThreadPerTaskExecutor()) { for (int i = 0; i < 1000; i++) { int task = i; executor.submit(() -> { Thread.sleep(100); System.out.println("Task " + task + " executed by " + Thread.currentThread()); }); } } Key benefits of virtual threads are handle large numbers of concurrent tasks, simplify asynchronous programming, reduce complexity compared to reactive frameworks and improve resource efficiency. For applications that previously required complex asynchronous pipelines, virtual threads often allow returning to simpler blocking code while maintaining scalability. 4. Parallel Processing with the ForkJoin Framework For CPU-intensive workloads, Java provides the ForkJoin framework, which divides tasks into smaller subtasks and executes them in parallel. Example: Plain Text class SumTask extends RecursiveTask<Integer> { private final int[] array; private final int start; private final int end; SumTask(int[] array, int start, int end) { this.array = array; this.start = start; this.end = end; } protected Integer compute() { if (end - start <= 10) { int sum = 0; for (int i = start; i < end; i++) sum += array[i]; return sum; } int mid = (start + end) / 2; SumTask left = new SumTask(array, start, mid); SumTask right = new SumTask(array, mid, end); left.fork(); return right.compute() + left.join(); } } public class ForkJoinExample { public static void main(String[] args) { int[] numbers = new int[100]; Arrays.fill(numbers, 1); ForkJoinPool pool = new ForkJoinPool(); int result = pool.invoke(new SumTask(numbers, 0, numbers.length)); System.out.println("Sum = " + result); } } The ForkJoin framework works particularly well for CPU-bound algorithms like sorting, numerical computations and large data transformations. 5. Avoiding Common Multithreading Pitfalls While concurrency improves performance, it introduces complexity. Here are some common pitfalls developers encounter: Race Conditions When multiple threads modify shared data without synchronization. Example issue: Plain Text counter++; This operation is not atomic. Use AtomicInteger instead: Plain Text AtomicInteger counter = new AtomicInteger(); counter.incrementAndGet(); Deadlocks Deadlocks occur when two threads wait indefinitely for each other to release resources. Best practice is to avoid nested locks and use consistent lock ordering. Excessive Synchronization Overusing synchronized blocks can severely reduce throughput. Prefer concurrent utilities such as: ConcurrentHashMapAtomicIntegerReadWriteLock Example: Plain Text ConcurrentHashMap<String, Integer> map = new ConcurrentHashMap<>(); map.put("users", 100); Best Practices for Modern Java Multithreading Based on real-world production systems, the following practices help build reliable concurrent applications. Prefer Virtual Threads for I/O Workloads For services handling large numbers of requests, virtual threads simplify concurrency without complex async frameworks. Use Executors Instead of Manually Managing Threads Thread lifecycle management becomes simpler and safer when handled through executor frameworks. Avoid Shared Mutable State Immutable objects reduce the need for synchronization and simplify concurrent logic. Use Concurrent Collections Classes like ConcurrentHashMap and BlockingQueue are optimized for multi-threaded access. Monitor Thread Behavior Use tools such as Java Flight Recorder (JFR), Thread dumps and JVM monitoring tools. These help identify deadlocks, blocked threads and thread contention issues. Final Thoughts Features like virtual threads, better executor frameworks and advanced concurrency utilities allow developers to build scalable systems with simpler and more maintainable code. Modern Java gives developers powerful concurrency capabilities. When applied with the right best practices, these tools enable applications to handle massive workloads efficiently while keeping the codebase readable and maintainable. More

How to Transfer Domains via API: Automate Domain Migrations Programmatically

By Jakkie Koekemoer

Ten minutes per domain times 50 domains is roughly 8 hours of manual work, and that assumes nothing goes wrong. Stale auth codes, missed confirmation emails, forgotten unlock steps, and zero visibility into in-flight transfer status mean something frequently goes wrong. For platform engineers managing domain portfolios, the manual transfer workflow isn’t just slow. It’s a liability with no audit trail and no retry logic. Every step of the transfer lifecycle maps directly to an API call. Scripting the workflow makes it idempotent, auditable, and repeatable. This tutorial walks through a complete implementation using the name.com API, from HTTP Basic Auth setup through bulk migration with status polling and error handling. You’ll leave with working curl commands and a Python skeleton you can ship today. Why Manual Domain Transfers Break at Scale The 5-to-7-day ICANN transfer window is fixed. You can’t script around it. But the human steps surrounding it are entirely the problem. A typical manual transfer cycle looks like this: Log into the losing registrar’s UI to disable WHOIS privacy, unlock the domain, generate an auth code, copy it somewhere safe, initiate the transfer at the gaining registrar, wait for a confirmation email, click through an approval link, then check back daily until the transfer completes or times out. Each domain takes 8–12 minutes when everything works. At 50 domains, you’re looking at 8+ hours spread across multiple sessions, with state tracked in a spreadsheet that has no retry logic, no idempotency, and no audit trail. The failure modes compound: auth codes expire (typically within 7 days, depending on TLD), unlock steps get skipped, confirmation emails land in spam, and you have no programmatic way to detect a stalled transfer until it has already failed. The fix isn’t faster clicking. Every one of those steps, auth code retrieval, transfer initiation, status polling, cancellation, is available through a registrar API. Script them once, run them forever. The Domain Transfer Lifecycle: What the Script Needs to Drive Your script only needs to interact with the REST layer of the name.com API. The underlying EPP (Extensible Provisioning Protocol) standard is what registrars use to talk to each other. You don’t need to understand it to automate the workflow. Before initiating any transfer, validate four preconditions: The domain is unlocked (client transfer prohibited flag cleared at the losing registrar)WHOIS privacy is disabled on TLDs that require it for transfers (varies by registry)The domain is more than 60 days old since registration or last transfer (ICANN policy)Zone records are backed up, since DNS configuration doesn’t travel with the domain Once those pass, the transfer moves through a deterministic state machine: The name.com API can play either role: as the losing registrar (where you retrieve auth codes for outbound transfers) or as the gaining registrar (where you initiate transfers in). This tutorial covers both. Setting Up API Authentication for Domain Transfers The name.com API uses HTTP Basic Auth over HTTPS. Every request requires an Authorization header containing your username and API token, Base64-encoded as username:api_token. Generate your token at https://www.name.com/account/settings/api. It’s self-serve: no approval queue, no sales call. The token is available immediately. The same API powers domain operations for Vercel, Replit, and Netlify at production scale, so the endpoints in this tutorial are the same ones running in real infrastructure. The base URL for all endpoints in this tutorial is https://api.name.com/core/v1. You can also use their sandbox environment at https://api.dev.name.com/core/v1. With curl, use the -u flag: Shell curl -u "yourusername:your_api_token" \ https://api.name.com/core/v1/domains The -u flag Base64-encodes the credentials and sets the Authorization header automatically. In Python, set up a requests.Session once and reuse it across all calls. This avoids re-encoding credentials on every request and gives you a single place to update auth when your token rotates: Python import requests session = requests.Session() session.auth = ("yourusername", "your_api_token") BASE_URL = "https://api.name.com/core/v1" Retrieving the Domain Auth Code Programmatically You’ll need this endpoint in two scenarios: scripting outbound transfers for domains you own at name.com, or building platform tooling that surfaces auth codes to your users programmatically. One detail worth getting right: auth codes are time-sensitive. Retrieve them immediately before calling the transfer initiation endpoint, not as a pre-batch step hours earlier. For most TLDs, the codes are valid for up to 7 days, but if you pre-fetch codes for a 50-domain batch and hit unexpected errors midway through, the first codes in the set may expire before you reach them. Endpoint: GET /domains/{domainName}/auth-code (docs) Shell curl -u "yourusername:your_api_token" \ https://api.name.com/core/v1/domains/example.com:getAuthCode Response: JSON { "authCode": "Xk9#mP2qL8wR" } In Python, extract the auth code and store it in a dict keyed by domain name. You’ll pass this directly into the transfer initiation call: Python def get_auth_code(session, domain): resp = session.get(f"{BASE_URL}/domains/{domain}:getAuthCode") resp.raise_for_status() return resp.json()["authCode"] auth_codes = {} domains_to_transfer = [] domains_to_transfer.append('example.com') # or add more domains if you want to run with more for domain in domains_to_transfer: auth_codes[domain] = get_auth_code(session, domain) print(auth_codes) With that in place, you can retrieve auth codes for any domains you own. Initiating the Domain Transfer via API Endpoint: POST /transfers (docs) The request body requires two fields: domainName and authCode. Optionally, you can set whether you want privacy enabled by default, and the purchase/renewal price for the domain in question. Shell curl -u "yourusername:your_api_token" --request POST \ --url https://api.name.com/core/v1/transfers \ --header 'Content-Type: application/json' \ --data ' { "authCode": "Xk9#mP2qL8wR", "domainName": "example.com", "privacyEnabled": true } ' A successful response returns HTTP 200 with the transfer status: JSON { "order": 12345, "totalPaid": 12.99, "transfer": { "domainName": "example.com", "status": "pending_transfer", "email": "[email protected]" } Log that status field. Three error responses to handle explicitly: 409 Conflict: the domain is already in a transfer, and you cannot initiate another transfer.422 Unprocessable Entity: domain pricing is unavailable for that TLD, which typically means the TLD isn’t supported for transfer-in at this time.400 Bad Request: malformed request body. Check your command for missing required fields. Here’s a Python function that wraps the POST. It returns the response data on success and raises with the full error body on failure, which becomes the core of the bulk migration loop in the next section: Python def initiate_transfer(session, domain, auth_code): payload = { "domainName": domain, "authCode": auth_code } resp = session.post(f"{BASE_URL}/transfers", json=payload) if resp.status_code == 200: return resp.json() raise RuntimeError(f"Transfer failed for {domain}: {resp.status_code} {resp.text}") Polling Transfer Status and Handling State Changes Two endpoints handle status checks. For a single transfer by domain name: GET /transfers/{domainname} (docs) For all in-flight transfers at once, useful for a status dashboard: GET /transfers (docs) The total transfer window is up to 7 days, so your polling loop needs to be patient. Use exponential backoff starting at 5-minute intervals, doubling each pass, capped at 60 minutes: Python import time def poll_transfer(session, domain, max_hours=168): # 7 days interval = 300 # start at 5 minutes max_interval = 3600 # cap at 60 minutes elapsed = 0 while elapsed < max_hours * 3600: resp = session.get(f"{BASE_URL}/transfers/{domain}") data = resp.json() status = data.get("status") if status == "complete": print(f"{domain}: transfer complete") return status elif status in ("cancelled", "failed"): print(f"{domain}: terminal state {status} - {data}") return status elif status == "pendingApproval": # Flag for manual review; expediting requires registrar dashboard action print(f"{domain}: pending approval - check registrar dashboard") elif status == None: # the domain is not listed for transfer print(f"{domain}: no transfers listed for this domain") # pendingTransfer is the normal in-progress state; keep polling time.sleep(interval) elapsed += interval interval = min(interval * 2, max_interval) raise TimeoutError(f"Transfer polling timed out for {domain}") poll_transfer(session, 'example.com') When you hit pendingApproval, some registrars allow expediting through their dashboard. A failed status deserves immediate investigation: the most common cause is a stale or incorrect auth code. Cancellation is available via POST /transfers/{domainName}:cancel. Use it when the auth code is wrong, and you need to restart with a fresh one. Cancellation is only possible within the first 5 days of a transfer. Shell curl -u "yourusername:your_api_token" --request POST \ --url https://api.name.com/core/v1/transfers/example.com:cancel For production systems, subscribe to transfer status webhooks rather than running a long-lived polling loop. The name.com API supports webhooks that fire on state changes, which lets you react immediately without keeping a process alive for days. Scripting a Bulk Domain Migration Workflow The full bulk migration script reads from a CSV with two columns: domain and auth_code. Leave auth_code blank for domains already registered at name.com; the script retrieves it programmatically. Python import csv import time import requests from datetime import datetime session = requests.Session() session.auth = ("yourusername", "your_api_token") BASE_URL = "https://api.name.com/core/v1" def load_domains(input_csv): with open(input_csv) as f: return list(csv.DictReader(f)) def load_existing_transfers(output_csv): """Return domains that already have a transfer ID logged.""" existing = {} try: with open(output_csv) as f: for row in csv.DictReader(f): if row.get("transfer_id"): existing[row["domain"]] = row["transfer_id"] except FileNotFoundError: pass return existing def log_result(output_csv, domain, transfer_id, status): with open(output_csv, "a", newline="") as f: writer = csv.writer(f) writer.writerow([domain, transfer_id, datetime.utcnow().isoformat(), status]) def run_bulk_transfer(input_csv, output_csv): domains = load_domains(input_csv) existing = load_existing_transfers(output_csv) for row in domains: domain = row["domain"] auth_code = row.get("auth_code", "").strip() # Idempotency: skip initiation if already logged if domain in existing: print(f"{domain}: already initiated, transfer ID {existing[domain]}") continue # Retrieve auth code if not in the CSV if not auth_code: try: resp = session.get(f"{BASE_URL}/domains/{domain}/auth-code") resp.raise_for_status() auth_code = resp.json()["authCode"] except Exception as e: print(f"{domain}: auth code retrieval failed - {e}") log_result(output_csv, domain, "", "auth_code_failed") continue # Initiate transfer try: payload = { "domainName": domain, "authCode": auth_code } resp = session.post(f"{BASE_URL}/transfers", json=payload) resp.raise_for_status() data = resp.json() log_result(output_csv, domain, domain, data.get("status")) print(f"{domain}: initiated, status={data.get('status')}") except Exception as e: print(f"{domain}: initiation failed - {e}") log_result(output_csv, domain, "", "initiation_failed") continue time.sleep(2) # throttle between requests if __name__ == "__main__": run_bulk_transfer("domains.csv", "transfer_log.csv") Three implementation decisions in this script are worth understanding. The idempotency check at the top of the loop prevents duplicate transfers on re-runs. If the script fails at domain 23 of 50, re-running it skips the first 22 already logged in the output CSV and picks up where it left off. Sequential processing with a 2-second sleep is conservative by design. If you parallelize this script, handle 429 Too Many Requests with exponential backoff. The name.com API doesn’t publish a specific rate limit ceiling, so sequential processing is the safer default for batch operations. The audit log captures domain, transfer_id, initiated_at, and status per row, with updates appended on each polling pass. Run the polling loop as a separate script pass against the same CSV rather than blocking the initiation loop for up to 7 days per domain. Run Your First API Domain Transfer in 15 Minutes Step 1: Generate your API token at https://www.name.com/account/settings/api. Self-serve, takes under 2 minutes. Step 2: Run the auth code retrieval command against a domain you own at name.com: Shell curl -u "yourusername:your_api_token" \ https://api.name.com/core/v1/domains/yourdomain.com:getAuthCode Confirm you get a JSON response with an authCode field. Step 3: Substitute the auth code into the POST /transfers curl command from the section above and fire the request. Note the status field in the response. Step 4: Poll the status endpoint with that domain name: Shell curl -u "yourusername:your_api_token" \ https://api.name.com/core/v1/transfers/yourdomain.com Confirm the status returns pendingTransfer. From there, drop the curl commands into the Python skeleton, and you have a working bulk migration script. For platform teams integrating domain operations into a product, the same name.com API spec covers domain search, registration, DNS management, and renewals. The transfer endpoints you’ve just used are part of a broader, consistent interface you don’t need to re-learn for each operation. If you’ve run bulk domain migrations before, whether through a registrar API or a more manual process, what failure modes actually bit you? Auth code timing, rate limits, something else entirely? Drop it in the comments. More

Training a Neural Network Model With Java and TensorFlow

By George Pod

Advanced Auto Loader Patterns for Large-Scale JSON and Semi-Structured Data

By Seshendranath Balla Venkata

Optimizing Java Back-End Performance Profiling and Best Practices

By Ramya vani Rayala

When Kubernetes Breaks Session Consistency: Using Cosmos DB and Redis Together

Distributed systems rarely struggle because of storage engines. They struggle because of coordination. We were operating a high-throughput microservice on Kubernetes backed by Azure Cosmos DB. The service required durability, global availability, and predictable read behavior under horizontal scaling. Cosmos DB was configured with SESSION consistency because it offers a practical balance between correctness and performance. It guarantees read-your-own-writes without incurring the latency and throughput penalties associated with strong consistency. Architecturally, everything appeared sound. Yet under real production traffic, an intermittent pattern began emerging. Occasionally, a read request issued immediately after a write would return slightly stale data. There was no corruption and no failure — just subtle inconsistencies that were difficult to reproduce but impossible to ignore. The issue was not rooted in Cosmos DB alone, nor in Kubernetes alone. It lived in the interaction between the two. The Assumption Behind Session Consistency Cosmos DB’s session consistency model relies on a session token. Every write operation returns a token representing the latest version of the document within that session. If that token is passed back during a subsequent read, Cosmos guarantees that the client will see its own write. In a single-instance application, this is straightforward. The same process that performs the write retains the session token in memory and uses it for subsequent reads. Kubernetes changes that assumption entirely. In a horizontally scaled deployment, a write request may land on Pod A. Cosmos returns a session token to Pod A. The next read request for the same document may land on Pod B. Pod B has no awareness of Pod A’s session token. Without that token, Cosmos may return a slightly older replica version consistent with session guarantees — but not necessarily reflecting the most recent write handled by another pod. The database is honoring its consistency contract. The application simply is not sharing the required metadata across instances. This is a classic distributed systems nuance: guarantees often depend on contextual state that stateless infrastructure does not preserve. Why Strong Consistency Was Not the Right Fix Switching Cosmos DB to strong consistency would have eliminated the problem entirely. However, that solution carried significant tradeoffs. Strong consistency increases latency because replicas must coordinate synchronously. It reduces overall throughput and increases RU consumption. It also introduces constraints in multi-region deployments where low-latency global reads are required. The problem was not that the database guarantees were insufficient. The problem was that session context was not shared across pods. Rather than strengthening storage semantics, we focused on improving coordination. Introducing Redis as a Coordination Layer The solution was conceptually simple. After every write to Cosmos DB, we extracted the returned session token and stored it in Redis, keyed by the document ID. Before every read from Cosmos, we retrieved the session token from Redis and supplied it with the read request. Redis became a lightweight session token broker between Kubernetes pods. It is important to emphasize what Redis was not used for. It did not store business data. It did not act as a second database. It did not cache full documents. It stored only the small piece of metadata required to preserve cross-pod session guarantees. Cosmos remained the durable system of record. Redis handled coordination. By limiting Redis to this narrow responsibility, the architecture avoided unnecessary complexity and eliminated the risk of data divergence between systems. Designing for Failure, Not Just Success Adding Redis introduced a new dependency, which required careful design consideration. We made a deliberate decision that Redis would never become mandatory for availability. In the read path, the service first attempts to retrieve the session token from Redis. If a token exists, it is passed to Cosmos, ensuring read-your-own-writes. If Redis is unavailable or the token is missing, the system proceeds with a standard Cosmos read without the token. The result is a graceful degradation model. Redis enhances consistency but does not control system availability. If Redis fails completely, the application continues operating with normal session semantics, potentially returning slightly stale reads but never failing outright. On the write path, the order of operations is equally important. The document is first persisted to Cosmos. Only after a successful write is the session token stored in Redis. This ensures that durability is never dependent on coordination infrastructure. To further strengthen resilience, Redis was deployed in a dual configuration consisting of a primary and fallback instance. Writes are performed against both, with the fallback update executed asynchronously to avoid increasing request latency. If Redis writes fail, the errors are logged, but the core transaction succeeds. This ordering ensures the system bends under failure rather than breaks. Cost and Throughput Optimization While addressing consistency, we also examined write efficiency. In high-throughput systems, replacing entire documents for minor state changes can significantly increase RU consumption. Instead of issuing full document replacements, we adopted Cosmos PATCH operations for partial updates. Only modified attributes were updated, reducing request charge and improving overall efficiency. This adjustment produced measurable cost savings and reinforced a broader lesson: architectural improvements often reveal opportunities for operational optimization. Evaluating Alternative Approaches Before settling on Redis-backed session sharing, several alternatives were considered. Sticky sessions at the load balancer layer could have preserved session affinity, ensuring that reads followed writes to the same pod. However, this approach reduces horizontal scaling flexibility and can create uneven traffic distribution. In-memory distributed caching strategies were also evaluated but introduce replication complexity and failure coordination challenges. Enabling strong consistency at the database layer, while technically simpler, imposed unacceptable performance and cost penalties. Redis provided the right balance. It is fast, operationally mature, and purpose-built for ephemeral coordination data. Most importantly, it allowed us to solve a coordination problem without modifying database guarantees. Extending Redis Carefully Once Redis became part of the architecture, it was tempting to broaden its use. Discipline was critical. Redis was later used to cache selected reference metadata retrieved from downstream services. Instead of invoking dependent systems on every request, a scheduled refresher populated Redis entries with defined TTLs. This reduced latency and protected downstream systems during peak load. Redis was also used to maintain shared operational counters across pods. In a horizontally scaled environment, in-memory metrics fragment across instances. Storing certain counters in Redis provided consistent observability across all running pods. In both cases, Redis remained coordination infrastructure rather than primary storage. The Architectural Pattern Cosmos DB and Redis are often described simply as database and cache. In this design, Redis is not a cache of business objects. It is a coordination layer that enables predictable behavior in a stateless, horizontally scaled environment. By separating durable state from coordination state, the system maintains scalability, controls cost, and preserves session guarantees without relying on strong consistency or sticky sessions. Kubernetes encourages statelessness. Databases provide consistency guarantees within defined boundaries. Bridging the two requires explicit coordination. Distributed systems are rarely about choosing the strongest guarantee available. They are about understanding the guarantees you already have and ensuring they are applied correctly across infrastructure boundaries. Sometimes the most effective solution is not increasing consistency but ensuring that the consistency you already depend on is shared intelligently. Architecture Diagram

By Vikas Mittal

How to Reliably Implement Post-Commit Actions in Spring

Sometimes, in modern backend systems, you need to perform one or more actions after database inserts or updates. You may need to publish to a message broker, send an email, or trigger a workflow. If you perform those actions inside a database transaction that rolls back at the end, it is too late, because they are already started and cannot be cancelled, possibly creating inconsistencies. In Spring Boot, you can use @TransactionalEventListener to mark a listener that is triggered by the ApplicationEventPublisher's publish method to solve this problem. In this article, we will explain how this works. The Problem In this article, we are discussing the side effects associated with transactional code. By side effects, we mean deliberate actions towards external services and resources. As an example, we can consider an application component that creates an order, saves it in the database, and then sends an email: Java @Transactional public void createOrder(Order order) { orderRepository.save(order); emailService.sendOrderConfirmation(order.getId()); } In normal conditions, this works as expected: the order is persisted in the database, and the email is sent to the recipient. However, suppose the transaction fails for some reason. In this case, you still send the email, but the order does not exist in the database. In the next section, you will see an approach to solving this problem by using the @TransactionalEventListener annotation. The Solution: @TransactionalEventListener Spring provides the ApplicationEventPublisher class to decouple logic through application events using an implementation of the Observer pattern. By an ApplicationEventPublisher object, you can send an event in one part of the application: Java private final ApplicationEventPublisher publisher; publisher.publishEvent(new OrderCreatedEvent(order.getId())); And catch it with a listener in another part: @EventListener public void handle(OrderCreatedEvent event) { emailService.sendOrderConfirmation(event.getOrderId()); } This improves things from the architectural standpoint, but @EventListener executes immediately, even if the transaction rolls back. This way, your system will still suffer inconsistencies. Replacing @EventListener with the @TransactionalEventListener annotation, though, you will get the event listener to run only after a configured phase of the transaction. By default, it runs after a successful commit. Java @TransactionalEventListener public void handle(OrderCreatedEvent event) { emailService.sendOrderConfirmation(event.getOrderId()); } In the above code, you have the following execution steps in case of a successful commit: The order is savedThe event is publishedThe transaction commitsThe listener executesThe email is sent If the transaction fails, the listener never runs, and no inconsistency is left. Implementation To complete the example shown in the previous section, you should first define a simple event class. Events should be lightweight with only identifiers rather than full entities. In the following code snippet, you have an event related to the creation of an order, and its definition class contains only the order ID. Java public class OrderCreatedEvent { private final Long orderId; public OrderCreatedEvent(Long orderId) { this.orderId = orderId; } public Long getOrderId() { return orderId; } } Then, you should publish the event inside a transactional method of a service class: Java @Service public class OrderService { private final OrderRepository orderRepository; private final ApplicationEventPublisher publisher; public OrderService(OrderRepository orderRepository, ApplicationEventPublisher publisher) { this.orderRepository = orderRepository; this.publisher = publisher; } @Transactional public void createOrder(Order order) { orderRepository.save(order); publisher.publishEvent(new OrderCreatedEvent(order.getId())); } } Note that the event is published inside the transaction. Then, you should handle the event after the commit phase by creating a listener: Java @Component public class OrderEventListener { @TransactionalEventListener public void handle(OrderCreatedEvent event) { emailService.sendOrderConfirmation(event.getOrderId()); } } This listener will execute only after the transaction commits successfully. Note that @TransactionalEventListener, without any additional configuration, runs after a successful commit. In the following section, you will see that you can modify the transaction phase in which the listener is triggered by using a specific parameter. Transaction Phases You can configure the listener to specify the transaction phase in which it should run. The default phase is AFTER_COMMIT: BEFORE_COMMIT: Runs just before the transaction commits. It can be configured with @TransactionalEventListener(phase = TransactionPhase.BEFORE_COMMIT). It's useful when you need to finalize data before persistence completesAFTER_COMMIT: Configured with @TransactionalEventListener(phase = TransactionPhase.AFTER_COMMIT). It runs after a successful commit. It is best suited for sending emails, publishing events, updating caches, and triggering workflowsAFTER_ROLLBACK: Configured with @TransactionalEventListener(phase = TransactionPhase.AFTER_ROLLBACK), It runs only if the transaction fails and rolls back. It can be used for clean-up operations or for executing compensating actionsAFTER_COMPLETION: Configured with @TransactionalEventListener(phase = TransactionPhase.AFTER_COMPLETION), it runs after the transaction regardless of its outcome. It can be used for cleaning up or executing compensating actions. It could also be useful for monitoring or auditing. Asynchronous Execution The actions performed after the transaction completion are usually external calls, like: Sending emailsCalling external APIsSending Kafka messagesRunning analytics They can be slow and may cause the program to become unresponsive. To avoid that, you can combine @TransactionalEventListener with @Async, like in the following example: Java @Async @TransactionalEventListener public void handle(OrderCreatedEvent event) { analyticsService.trackOrder(event.getOrderId()); } This way, the transaction completes without delay, while the heavy work is handled in the background. How it works Internally, Spring registers a transaction synchronization mechanism with the transaction manager. During the transaction execution, Spring performs the following steps: Stores the eventWaits until the transaction completesExecutes the listener in the configured phase Best Practices A couple of best practices in using this feature: Publish events from the service layer: Events should usually be triggered from transactional services, not controllersUse for side effects only: @TransactionalEventListener is ideal for external communication, background processing, and integration with other systems. Avoid using it for core business logic Conclusion @TransactionalEventListener is a useful feature available in Spring Boot applications. It allows you to link actions to specific database transaction phases, like the after-commit phase. It helps in keeping a clean business logic and avoiding inconsistent states. You can find the example code on GitHub.

By Mario Casari

Runtime FinOps: Making Cloud Cost Observable

There's a particular kind of learned helplessness that settles into engineering organizations after a few years of rapid cloud growth. You ship a feature. The feature works. Latency looks fine, error rates stay quiet, on-call doesn't page. Then three weeks later someone from finance drops a Slack message — a screenshot of the AWS Cost Explorer with a jagged upward spike, annotated with a red arrow and a question mark. By then, the deployment that caused it has been buried under six more deploys. The engineer who wrote the change is mentally two features ahead. Nobody remembers. You run a postmortem on nothing. This is the default state for most shops. Not negligence, exactly. More like a structural information deficit: the feedback loop between code change and cost impact is measured in billing cycles, not seconds. Runtime FinOps is the attempt to collapse that latency. The core mechanical insight is embarrassingly simple once you see it. Cloud spend is ultimately a function of resource consumption, which is itself a function of workload behavior, which is directly caused by deployed code. The causal chain is unbroken. What's broken is the observability of that chain — the instrumentation stops at runtime metrics and never continues downstream into the dollar layer. Prometheus scrapes CPU and memory. Datadog tracks p99 latency. Nobody is emitting cost_per_request_dollars into the same time-series store. That gap isn't accidental. It reflects organizational archaeology — engineering tools were built by engineers who didn't own the bill, and finance tools were built by accountants who didn't understand deployment pipelines. The FinOps movement as a discipline has largely tried to paper over this by creating shared dashboards and monthly reviews. That's better than nothing. It is not remotely sufficient. What sufficient looks like: a Grafana panel, sitting next to your latency and throughput charts, showing dollars-per-minute in something close to real time. Not aggregated monthly, not delayed by the 24-to-48-hour lag that AWS billing data typically carries, but live. Or close to live. And critically, annotated — vertical lines at every deploy, tagged by Git SHA, so when the cost curve flexes upward you can see which change correlated with when. Tools like Kubecost and CloudZero attempt this for containerized workloads, mapping cluster resource consumption to workloads and namespaces with reasonable accuracy. The attribution model involves some approximation — particularly around shared infrastructure, node-level overhead, and storage that doesn't decompose cleanly to individual pods — and practitioners would be dishonest if they called it precise. It's directionally accurate. In FinOps, directionally accurate and fast beats precisely accurate and three weeks late every single time. The tagging problem deserves its own meditation, because this is where ambition usually fractures against operational reality. The idea is clean: every cloud resource carries tags — service, team, environment, git-sha, pr-number — and those tags flow through billing, letting you attribute cost to the unit of work that caused it. In theory, you can then answer "what did this pull request cost us in production over its first 72 hours of traffic?" In practice, tagging compliance in most organizations sits somewhere between 40% and 70% on a good day, because tags are set at resource creation and then drift, or get set inconsistently across Terraform modules, or simply aren't applied to resources provisioned through the console in a hurry. Data transfer costs — often a substantial portion of a distributed system's bill — aren't taggable in any meaningful way. RDS instance costs don't decompose to the query or calling service. The tag taxonomy you design in January will be partially obsolete by June when someone creates a new microservice and doesn't know the convention. None of this means tagging is futile. It means the feedback loop you build on top of tags is only as trustworthy as your tagging governance, and tagging governance requires someone to actually own it, which requires organizational will that frequently isn't there. The more robust pattern I've seen in practice: tag at the workload level (not the resource level), enforce it via CI/CD gate rather than relying on humans to remember, and accept that you'll have a residual "unattributed" bucket that you manage down over time rather than eliminating entirely. Tools like AWS Tag Editor and custom OPA policies for Terraform can close the loop on net-new resources. The legacy tail requires a different, less glamorous approach: manually audit, assign, iterate. The CI/CD integration story is where things get genuinely exciting, and also where practitioners should calibrate their expectations carefully. Infracost is the canonical example: it parses Terraform plan output, estimates the monthly cost delta of the proposed infrastructure change, and posts that estimate as a comment on the pull request. This is legitimately useful. A PR that adds three RDS read replicas and a NAT gateway should trigger a cost conversation before it merges, not after the bill lands. Engineers who see "this change will add ~$340/month" in their PR review interface learn, over time, a working intuition about what infrastructure costs. That intuition is rarer than it should be. The limitation is that Infracost and its peers estimate infrastructure cost — the static resource footprint — rather than operational cost, which includes data transfer, API calls, Lambda invocations, storage I/O, and everything else that scales with traffic and behavior rather than existence. A change that looks cost-neutral at the infrastructure level might double your CloudFront egress if it changes response payload sizes. It might triple your DynamoDB read units if it introduces a hot key. The tools don't know this. They can't, without runtime data. The more sophisticated version of this loop, which fewer teams have built, uses predictive cost modeling against actual traffic. You have a deployment. You have the last N days of traffic patterns. You can project forward: "given current traffic, this new resource configuration will consume approximately $X over the next 30 days." AWS Cost Explorer has a forecast API. Combining it with deployment annotation is not a huge engineering lift, but it requires someone to actually build and maintain the plumbing. Most teams haven't made that investment. Consider what an SRE-inflected cost culture actually demands. SRE borrow two concepts that apply almost without modification: error budgets and anomaly alerting. An error budget for cost would look like this: the service owns a monthly cost envelope, approved and visible, and the team tracks burn rate against it the way they track error budget burn against their SLO. When burn rate exceeds a threshold — say, the monthly budget will be exhausted in 20 days at current trajectory — that's an alert, the same severity as a latency SLO violation. Not a finance report. A PagerDuty ticket if you want to be maximalist about it, or at minimum a Slack alert that reaches the on-call engineer, not the VP of Engineering. AWS Cost Anomaly Detection does a serviceable version of this out of the box, using ML to detect spend patterns that deviate from the expected baseline and sending SNS notifications. It's underused. I suspect this is partly because the notification goes to whoever set up the billing alert (often a platform team, sometimes a finance person) rather than to the team that owns the service. The alert finds the wrong inbox and dies there. The organizational fix is unglamorous: route cost anomaly notifications to the same escalation paths as operational incidents. The same service catalog that maps an alert to an on-call rotation should map a cost anomaly to the team that owns the relevant tagged resource. This requires the tagging to work. Everything requires the tagging to work. There's an architectural pattern worth naming explicitly: cost as a flow control signal. In a well-instrumented system, you might have a service that responds to demand by scaling out — adding pods, provisioning more compute, whatever the autoscaling policy dictates. This is good. Autoscaling is good. But autoscaling policies are typically expressed in terms of CPU utilization or queue depth or request rate, never in terms of "we have now spent $X in the last hour and this is abnormal." A traffic spike from a misbehaving client, a scraper, an accidental infinite loop in a partner's integration — these can drive spend through the ceiling before any CPU-based autoscaler would even notice a problem. Dollar-rate alerting fills a different detection envelope than performance alerting. A pathological client that sends low-volume but expensive requests — each one triggering a chain of downstream API calls, S3 reads, expensive ML inference — might not move your CPU metrics at all. It will move your bill. If you're watching dollars-per-minute in Prometheus and the rate doubles, that signal is available to you immediately. Whether you act on it programmatically (rate limiting, circuit breaking, graceful degradation) or operationally (alert, investigate, remediate) is a choice, but you can't make it if you can't see it. The blameless postmortem for cost incidents is a concept that sounds slightly ridiculous the first time you hear it and becomes obviously correct about sixty seconds later. When a cost spike happens, the natural instinct in most organizations is either to ignore it (it's just money, nobody died) or to hunt for the responsible party and make an example of them. Both responses are bad. Ignoring it means the behavior repeats. Making an example of someone means engineers become risk-averse about infrastructure changes in ways that slow down the whole organization. The SRE approach to operational incidents — reconstruct the timeline, identify contributing factors, generate mitigations, share the learning broadly — transfers completely. What was the change that caused the spike? Was it a code change, a configuration change, an unexpected shift in traffic? Was it even caused by a change, or is it an emergent behavior of a system that was always going to fail this way under sufficient load? What could have caught it earlier? What will catch it next time? The output of that process is institutional knowledge and, eventually, changed defaults. The team that burns their cost budget on an accidentally O(n²) database query and runs a postmortem on it will write better queries afterward, not out of fear but because they now have a concrete understanding of what "better" means in dollar terms. Honestly, the biggest obstacle isn't technical. The tools exist. Kubecost, CloudZero, Infracost, CloudHealth, AWS-native cost tooling — the ecosystem is mature enough that you can build a meaningful runtime FinOps practice without writing much novel infrastructure. The pipeline from resource consumption to tagged cost attribution to developer-facing dashboard is navigable. What isn't navigable without organizational agreement is the question of who owns this. Finance owns the bill but not the code. Engineering owns the code but not the budget. Platform teams own the tooling but not the individual services. FinOps functions, where they exist, often sit in a liminal space that has advisory authority but not operational authority. None of these entities, alone, can close the feedback loop. The teams that actually do this well tend to have one thing in common: a clear owner at the service level. Not "the platform team will build cost dashboards for everyone" but "this service team owns a cost SLO, reviews it in their weekly ops meeting, and is the first call when a cost anomaly fires." That's a cultural stance, not a technical one. If you wanted to change something by Monday morning, the smallest high-signal move is this: find your last three significant cost spikes, look at the deployment timeline, and see whether you can identify the correlating change. Do this manually, in AWS Cost Explorer, cross-referenced against your deployment log. If you can correlate them — if the mechanism is visible in retrospect — you now have a concrete example to show your team of what a runtime cost signal would have caught in real time. That example is worth more than any amount of abstract advocacy for FinOps practices. Then ask yourself: what's the minimum instrumentation that would have surfaced this signal at deploy time? Maybe it's a CloudWatch alarm on spend rate. Maybe it's a Kubecost dashboard with a deployment annotation. Maybe it's just a Slack alert from Cost Anomaly Detection routed to the right channel. Start there. The elaborate CI/CD cost gates and per-Git-SHA bill-of-materials and predictive spend forecasting are all real and all worthwhile, but they're downstream of a simpler belief: that cloud spend is a system metric, not a finance report, and your observability stack should treat it that way. The rest follows.

By David Iyanu Jonathan

NeMo Agent Toolkit With Docker Model Runner

The year 2025 has been widely recognized as the year of AI agents. With the launch of frameworks like Docker Cagent, Microsoft Agent Framework (MAF), and Google’s Agent Development Kit (ADK), organizations rapidly embraced agentic systems. However, one critical area received far less attention: agent observability. While teams moved quickly to build and deploy agent-based solutions, a fundamental question remained largely unanswered. How do we know these agents are actually working as intended? Are multiple agents coordinating effectively?Are their outputs reliable and of high quality?Can we diagnose failures or unexpected behaviors in complex, multi-agent workflows? These challenges sit at the core of agent observability. This is where Nvidia’s open-source toolkit, NeMo, comes into the picture. NeMo brings much-needed, enterprise-grade observability to LLM-powered systems, enabling teams to monitor, evaluate, and trust their agent infrastructure at scale. At the same time, Docker Model Runner is emerging as the de facto standard for local inference from the desktop. It provides a unified, “single pane of glass” experience for experimenting with a wide range of open-source models available through the Docker Models Hub. As part of this tutorial, we will look at how we can add observability to your AI agents when inferencing through Docker Model Runner. Docker Model Runner Setup First, let’s set up Docker Model Runner using a small language model. In this tutorial, we will use ai/smollm2. The setup instructions for Docker Model Runner are available in the official documentation. Follow those steps to get your environment ready. Make sure to enable TCP access in Docker Desktop. This step is essential; without it, your prototype will not be able to communicate with the model runner over localhost. Command to pull the small language model we will use for inferencing. Plain Text docker model run ai/smollm2 NeMo Agentic Toolkit Setup The first step begins with installing the Nvidia NAT package from Python. I recommend installing uv and installing all the nat dependencies through uv because going down the plain “pip” route causes timeouts. Plain Text uv pip install nvidia-nat NeMo's agentic setup is done through YAML. So, declare a YAML configuration for eg: agent-run.yaml YAML functions: # Add a tool to search wikipedia wikipedia_search: _type: wiki_search max_results: 2 llms: # Tell NeMo Agent Toolkit which LLM to use for the agent openai_llm: _type: openai model_name: ai/smollm2 base_url: http://localhost:12434/engines/v1 # Docker model runner endpoint api_key: "empty" // because we are using local inference this can be empty. temperature: 0.7 max_tokens: 1000 timeout: 30 general: telemetry: tracing: otelcollector: _type: otelcollector # The endpoint where you have deployed the otel collector endpoint: http://0.0.0.0:5216/v1/traces project: nemo_project workflow: # Use an agent that 'reasons' and 'acts' _type: react_agent # Give it access to our wikipedia search tool tool_names: [wikipedia_search] # Tell it which LLM to use (now using OpenAI with Docker endpoint) llm_name: openai_llm # Make it verbose verbose: true # Retry up to 3 times parse_agent_response_max_retries: 3 There are four important sections in the YAML file: Functions: These are simple components that perform a specific operation. In this case, built-in Wikipedia search, for example. You can define your own functions too.LLMs: The large language model provider we plan to use. Currently, OpenAI, Anthropic, Azure OpenAI, Bedrock, and Hugging Face are the supported providers. Since Docker Model Runner supports both OpenAI and Anthropic API formats, we can leverage it for both the LLM providers.Telemetry: This is where Observability comes into the picture. In this example, we have added OTel-based tracing. As a result, we will be logging spans to the OpenTelemetry configured destination.Workflow: This is the final piece in the puzzle, where we will end up configuring all the functions, LLMS, and tools to create a workflow. For the current workflow, we are configuring a reasoning and act agent along with the Wikipedia search tool and Docker Model Runner inference endpoint. Before we run the workflow, we will configure the OpenTelemetry exporter to publish spans to the otellogs/span folder. Create a file named otel_config.yml. YAML receivers: otlp: protocols: http: endpoint: 0.0.0.0:5216 processors: batch: send_batch_size: 100 timeout: 10s exporters: file: path: /otellogs/spans.json format: json service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [file] Run the following command in the terminal. Plain Text mkdir otel_logs chmod 777 otel_logs docker run -v $(pwd)/otelcollectorconfig.yaml:/etc/otelcol-contrib/config.yaml \ -p 5216:5216 \ -v $(pwd)/otel_logs:/otel_logs/ \ otel/opentelemetry-collector-contrib:0.128.0 Finally, run the NeMo workflow using the following command. Plain Text nat run --config_file ./agent-run.yml --input "What is the capital of Washington" Output: Plain Text [AGENT] Agent input: What is the capital of Washington Agent's thoughts: WikiSearch: {'annotation': 'Washington State', 'required': False} Thought: You should always think about what to do. Action: Wikipedia Search: {'annotation': 'Washington State', 'required': False} ------------------------------ 2026-03-22 21:55:18 - INFO - nat.plugins.langchain.agent.react_agent.agent:357 - [AGENT] Retrying ReAct Agent, including output parsing Observation 2026-03-22 21:55:18 - INFO - httpx:1740 - HTTP Request: POST http://localhost:12434/engines/v1/chat/completions "HTTP/1.1 200 OK" 2026-03-22 21:55:18 - INFO - nat.plugins.langchain.agent.react_agent.agent:270 - ------------------------------ [AGENT] Agent input: What is the capital of Washington State Agent's thoughts: The capital of Washington State is Olympia. After running the above command, you will see a spans.json file under the otel_logs section, which contains the entire span, along with inputs and outputs. In addition to what we discussed, it is also possible to set up logging and evaluations on model response that check for coherence, relevance, and groundedness. References Docker Model Runner: https://docs.docker.com/ai/model-runner/Nvidia NeMo Agent Toolkit: https://docs.nvidia.com/nemo/agent-toolkit/latest/get-started/installation.html

By Siri Varma Vegiraju

CORE

C/C++ Is Where Vulnerability Programs Go to Guess

Walk into most AppSec reviews, and you'll find a familiar pattern. Python dependencies: fully inventoried. npm packages: tracked and patched. C and C++ code powering the operating system, the embedded firmware, or the performance-critical core of the product? A blank space where the risk assessment should be. This is not a tooling gap that's easy to paper over. C and C++ do have package managers, but adoption is still ramping, and they are dependent on the operating system and build environment. Libraries get vendored directly into repositories. Static linking buries third-party components inside compiled binaries with no labels and often no version information left to read. Build logic lives across Conan, CMake files, Bazel configs, Makefiles, Yocto recipes, and BitBake layers, and no two projects use the same way. There is a compounding problem that rarely gets named directly: which libraries are present is often not determined until build time, or until the container or environment is assembled. There is no static manifest to read. The dependency graph is only fully real once the software is built, and by then, most tools have already finished their analysis and moved on. Most tools handle this by doing their best with whatever they can find and returning results that look complete. The incompleteness tends to be silent. You get a list of components with no indication of how many were missed, and in many cases, a generous helping of components you don't actually have. When tools can't determine dependencies precisely, they guess, and those guesses show up in your inventory as findings that engineers spend time investigating before discovering the library in question was never there. The Problem Isn't Complexity. It's Assumptions. Most software composition analysis tools were designed around a reasonable assumption: that dependencies are declared somewhere machine-readable. In Python, that's a requirements file. In JavaScript, a package.json. In C and C++, that assumption fails immediately. Build systems in the C and C++ ecosystems are diverse and project-specific. CMake, Bazel, Make, Yocto, BitBake, and custom shell scripts all encode dependency logic differently, and there is no common interface for tooling to parse across them. Static linking buries third-party components inside compiled binaries, stripping the version information that scanners rely on. Libraries get copied into codebases directly, with no record of their origin and no metadata attached. Enterprise package managers like Conan exist and are improving, but they aren't a shortcut to solving this. Retrofitting an existing C/C++ project to use Conan is not a weekend task. It is an architectural undertaking that touches build infrastructure, CI pipelines, and dependency resolution logic that may have accumulated over the years. The migration cost is often far higher than the security team proposing it realizes, and the security case alone rarely wins the budget argument. Any security program that assumes package manager adoption is around the corner is building on a foundation that isn't there yet. The practical consequence is that security teams making risk decisions about their most critical software, the code that runs the kernel, the device, or the real-time system, are doing so with fundamental gaps in their data. A CVE drops for a library shipped in three products. Without accurate visibility into which builds include which version of that library, triage becomes guesswork. And because dependencies are often resolved at build time or inherited from a base container or environment, the answer to "do we use this library?" is not in a manifest. It requires someone to trace back through build logs, environment configurations, and image layers by hand. That work falls on engineers, not tooling. Start With What Shipped, Not What Was Declared The correct approach to this problem does not start with a package manager. It starts with a different question: what is actually present in this build, this artifact, this container image? This reframing matters because it accepts the reality of how C/C++ projects are actually built. Dependencies are resolved at build time. Libraries are pulled from the environment. Components are embedded in base images. The dependency graph that matters is the one that shipped, not the one that was planned, and the only way to recover it is to work backwards from the output rather than forward from what was declared. In practice, this means parsing build system outputs across all major toolchains rather than expecting a common format. It means analyzing binaries and system images alongside source trees, not instead of them. For the cases where a dependency is genuinely obscured, such as a library vendored without documentation or a component embedded deep in a third-party SDK, it means applying language-aware inference to surface what rule-based tools miss. None of this is simple. But the complexity is an argument for investing in it, not for treating C/C++ as a known unknown and moving on. What Changes When You Can Actually See The outcome of getting this right is not just a more accurate inventory. It is the CVE response measured in hours rather than days. It is compliance artifacts that reflect what is actually in production rather than what the tooling happened to find. It is AppSec teams that can answer "are we affected?" with confidence instead of a best guess followed by a week of manual investigation. C and C++ power a disproportionate share of the software that runs the world: operating systems, embedded devices, automotive systems, industrial controls, and the performance-critical cores of applications that can't afford to be wrong. Security programs that treat this code as too hard to analyze are not avoiding complexity. They are accepting risk they cannot quantify, in the software they can least afford to get wrong.

By Lexi Selldorff

The Hidden Engineering Cost of XML in Enterprise Development Workflows

While JSON dominates modern APIs, XML continues to power a significant portion of enterprise integrations, financial systems, telecom services, configuration pipelines, and SOAP-based APIs. Many developers assume XML is “solved,” but in practice, generating structured, well-formed XML repeatedly remains a surprisingly inefficient task. In regulated industries such as banking, healthcare infrastructure, and enterprise SaaS platforms, XML is not optional — it is mandated by legacy systems, compliance frameworks, and long-standing integration contracts. This makes XML proficiency essential, even for teams primarily working in modern stacks. This article explores the real-world problems developers face when working with XML in professional environments and outlines practical strategies to eliminate repetitive friction from XML-heavy workflows. Problem 1: Manual XML Creation Is Error-Prone In test environments, staging systems, or internal tools, developers often need to manually craft XML payloads. This usually starts simple: XML <user> <name>John</name> <email>[email protected]</email> </user> But real systems rarely stay this small. Enterprise schemas introduce: Deeply nested elementsNamespaces and prefix bindingsStrict ordering requirementsOptional vs. required nodesSchema validation constraints (XSD) One missing closing tag, misplaced namespace declaration, or incorrectly nested element can break entire test pipelines. Developers then waste time debugging formatting issues rather than business logic. Unlike syntax errors in code editors, XML structural issues often surface only during integration validation. Problem 2: Repetitive Test Data Creation Automated test suites often require multiple variations of XML inputs: Valid payloadsBoundary-condition payloadsMalformed payloadsLarge dataset payloads Creating these variations manually introduces duplication and inconsistency. Developers frequently copy-paste existing XML and modify values, which increases the risk of: Outdated sample structuresIncorrect tag reuseSchema drift between environments Over time, test data becomes unreliable and difficult to maintain. A slight change in schema can require editing dozens of static XML files across repositories. Problem 3: Schema Evolution Breaks Examples When XML schemas evolve, documentation and example payloads often lag behind. API documentation might show an older structure while backend services enforce updated rules. This leads to: Integration confusionClient-side validation failuresOnboarding delays for new developersUnexpected production incidents Maintaining synchronized XML examples across documentation, test cases, and staging systems becomes a recurring operational burden. Without structured generation workflows, keeping everything aligned requires constant manual updates. Problem 4: Boilerplate Fatigue in SOAP and Legacy Systems In SOAP-based integrations, developers frequently work with verbose envelopes like: XML <soapenv:Envelope> <soapenv:Header/> <soapenv:Body> ... </soapenv:Body> </soapenv:Envelope> Even minor changes require editing large structured blocks. This repetitive boilerplate slows iteration speed, especially during debugging or rapid prototyping sessions. When multiple namespaces are involved, envelope headers must remain precise. A small prefix mismatch can invalidate an entire request, causing hours of troubleshooting in distributed environments. Problem 5: XML Validation and Debugging Overhead Developers often discover structural issues only after runtime validation errors occur. Common XML-related debugging frustrations include: Unexpected whitespace handlingEncoding mismatches (UTF-8 vs UTF-16)Invalid special characters (&, <, >)Namespace prefix conflicts Unlike strongly typed programming languages, XML validation errors can be verbose and difficult to interpret. Error messages often reference line numbers in large payloads, requiring manual tracing. Instead of focusing on core functionality, developers spend valuable time identifying syntax issues in data representation layers. A Practical Workflow to Reduce XML Friction To address these recurring issues, teams should adopt structured XML generation practices rather than relying on manual editing. Browser-based utilities can help standardize the creation of structured XML payloads during development and testing. Instead of hand-writing nested elements repeatedly, developers can: Define root structures consistently.Generate multiple variations quickly.Copy structured output directly into test suites.Regenerate examples whenever schema updates occur. This approach reduces human formatting errors and accelerates iterative testing cycles while preserving structural consistency. Best Practices for XML-Heavy Projects 1. Centralize Sample Payloads Maintain a single source of truth, for example, XML structures. Regenerate them when schemas change to avoid inconsistencies. 2. Validate Early Use schema validators during development rather than waiting for runtime failures in staging or production. 3. Automate Where Possible Integrate generated XML samples into CI pipelines for regression testing of parsers and transformation logic. 4. Separate Structure from Business Logic Avoid mixing XML formatting code directly inside business logic layers. Use templates or generators to keep responsibilities clean. 5. Monitor Schema Changes Proactively When working with third-party integrations, track schema updates carefully. Establish a review process to evaluate how structural changes affect internal systems before deployment. Real-World Benefits Beyond Error Reduction Teams that adopt structured XML workflows often notice improvements beyond just fewer bugs. For example, onboarding new developers becomes faster because they can rely on standardized XML templates rather than deciphering inconsistent examples. QA engineers can generate realistic test cases without spending hours editing XML by hand, which improves test coverage and reduces missed edge cases. In addition, having a reliable generation process makes it easier to document API responses accurately, helping technical writers produce up-to-date reference material. Over time, these practices create a more maintainable codebase and reduce the risk of hidden errors in production. Conclusion XML may not be the trendiest format, but it remains deeply embedded in professional software systems. The friction developers experience is rarely about XML itself — it’s about repetitive structure management, schema drift, and manual formatting errors. By standardizing XML generation, validating structures early, and eliminating manual boilerplate editing, teams can significantly reduce debugging time, improve integration reliability, and accelerate development cycles in XML-dependent environments. Additionally, this approach fosters better collaboration between developers and QA teams, as consistent XML structures reduce misunderstandings and integration errors. It also allows new team members to onboard faster, since clear and standardized XML examples serve as reliable references. Over time, adopting these practices contributes to more maintainable systems, less technical debt, and greater confidence in automated workflows that rely heavily on XML data.

By Moeez Ayub

Building an AI-Powered SRE Incident Response Workflow With AWS Strands Agents

The SRE Incident Response Agent is a multi-agent sample that ships with the AWS Strands Agents SDK. It automatically discovers active CloudWatch alarms, performs AI-powered root cause analysis using Claude Sonnet 4 on Amazon Bedrock, proposes Kubernetes or Helm remediations, and posts a structured incident report to Slack. This guide covers everything you need to clone the repo and run it yourself. Prerequisites Before you begin, make sure the following are in place: Python 3.11+ installed on your machineAWS credentials configured (aws configure or an active IAM role)Amazon Bedrock access enabled for Claude Sonnet 4 in your target regionkubectl and helm v3 installed — only required if you plan to run live remediations. Dry-run mode works without them. Step 1: Clone the Repository The sample lives inside the strands-agents/samples open source repository. Clone it and navigate to the SRE agent directory: Shell git clone https://github.com/strands-agents/samples.git cd samples/02-samples/sre-incident-response-agent The directory contains the following files: Plain Text sre-incident-response-agent/ ├── sre_agent.py # Main agent: 4 agents + 8 tools ├── test_sre_agent.py # Pytest unit tests (12 tests, mocked AWS) ├── requirements.txt ├── .env.example └── README.md Step 2: Create a Virtual Environment and Install Dependencies Shell python -m venv .venv source .venv/activate # Windows: .venv\Scripts\activate pip install -r requirements.txt The requirements.txt pins the core dependencies: Shell strands-agents>=0.1.0 strands-agents-tools>=0.1.0 boto3>=1.38.0 botocore>=1.38.0 Step 3: Configure Environment Variables Copy .env.example to .env and fill in your values: Shell cp .env.example .env Open .env and set the following: Shell # AWS region where your CloudWatch alarms live AWS_REGION=us-east-1 # Amazon Bedrock model ID (Claude Sonnet 4 is the default) BEDROCK_MODEL_ID=us.anthropic.claude-sonnet-4-20250514-v1:0 # DRY_RUN=true means kubectl/helm commands are printed, not executed. # Set to false only when you are ready for live remediations. DRY_RUN=true # Optional: post the incident report to Slack. # Leave blank to print to stdout instead. SLACK_WEBHOOK_URL= Step 4: Grant IAM Permissions The agent needs read-only access to CloudWatch alarms, metric statistics, and log events. No write permissions to CloudWatch are required. Attach the following policy to the IAM role or user running the agent: Shell { "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": [ "cloudwatch:DescribeAlarms", "cloudwatch:GetMetricStatistics", "logs:FilterLogEvents", "logs:DescribeLogGroups" ], "Resource": "*" }] } Step 5: Run the Agent There are two ways to trigger the agent. Option A: Automatic Alarm Discovery Let the agent discover all active CloudWatch alarms on its own. This is the recommended mode for a real on-call scenario: Shell python sre_agent.py Option B: Targeted Investigation Pass a natural-language description of the triggering event. The agent will focus its investigation on the service and symptom you describe: Shell python sre_agent.py "High CPU alarm fired on ECS service my-api in prod namespace" Example Output Running the targeted trigger above produces output similar to the following: Shell Starting SRE Incident Response Trigger: High CPU alarm fired on ECS service my-api in prod namespace [cloudwatch_agent] Fetching active alarms... Found alarm: my-api-HighCPU (CPUUtilization > 85% for 5m) Metric stats: avg 91.3%, max 97.8% over last 30 min Log events: 14 OOMKilled events in /ecs/my-api [rca_agent] Performing root cause analysis... Root cause: Memory leak causing CPU spike as GC thrashes Severity: P2 - single service, <5% of users affected Recommended fix: Rolling restart to clear heap; monitor for recurrence [remediation_agent] Applying remediation... [DRY-RUN] kubectl rollout restart deployment/my-api -n prod ================================================================ *[P2] SRE Incident Report - 2025-10-14 09:31 UTC* What happened: CloudWatch alarm my-api-HighCPU fired at 09:18 UTC. CPU reached 97.8% (threshold 85%). 14 OOMKilled events in 15 min. Root cause: Memory leak in application heap leading to aggressive GC, causing CPU saturation. Likely introduced in the last deployment. Remediation: Rolling restart of deployment/my-api in namespace prod initiated (dry-run). All pods will be replaced with fresh instances. Follow-up: - Monitor CPUUtilization for next 30 min - Review recent commits for memory allocation changes - Consider setting memory limits in the Helm chart ================================================================ Running the Tests (No AWS Credentials Required) The sample ships with 12 pytest unit tests that mock boto3 entirely. You can run the full test suite in any environment, including CI, without any AWS credentials: Shell pip install pytest pytest-mock pytest test_sre_agent.py -v # Expected: 12 passed Enabling Live Remediation Once you have validated the agent’s behaviour in dry-run mode and are satisfied with the decisions it makes, you can enable live kubectl and helm execution by setting DRY_RUN=false in your .env file: Shell DRY_RUN=false Conclusion In under five minutes of setup, the AWS Strands Agents SDK gives you a working multi-agent incident response loop: alarm discovery, AI-powered root cause analysis, Kubernetes remediation, and a structured incident report, all driven by a single python sre_agent.py command. The dry-run default means there is no risk in running it against a real environment while you evaluate its reasoning. From here, the natural next steps are connecting a Slack webhook for team notifications, adding a PagerDuty tool for incident tracking, or extending the RCA agent with a vector store of past postmortems. All of that is a tool definition away. I hope you found this article helpful and that it will inspire you to explore AWS Strands Agents SDK and AI agents more deeply.

By Ayush Raj Jha

Faster Releases With DevOps: Java Microservices and Angular UI in CI/CD

In modern DevOps workflows, automating the build-test-deploy cycle is key to accelerating releases for both Java-based microservices and an Angular front end. Tools like Jenkins can detect changes to source code and run pipelines that compile code, execute tests, build artifacts, and deploy them to environments on AWS. A fully automated CI/CD pipeline drastically cuts down manual steps and errors. As one practitioner notes, Jenkins is a powerful CI/CD tool that significantly reduces manual effort and enables faster, more reliable deployments. By treating the entire delivery pipeline as code, teams get repeatable, versioned workflows that kick off on every Git commit via webhooks or polling. Jenkins Pipelines as Code Jenkins pipelines allow defining build, test, and deploy stages in a Jenkinsfile so that CI/CD is truly “pipeline-as-code.” When developers push changes to Git, Jenkins can automatically start the pipeline. A typical Declarative Pipeline might look like: Groovy pipeline { agent any stages { stage('Build') { steps { /* build steps here */ } } stage('Test') { steps { /* test steps here */ } } stage('Deliver'){ steps { /* deploy steps here */ } } } } This approach version controls the CI/CD logic along with the application code. Each stage appears in the Jenkins UI, showing real-time status. Plugins extend Jenkins in many ways: NodeJS plugin lets a pipeline use a named Node installation to run npm or ng commands, and the Amazon ECR plugin provides steps to authenticate and push Docker images to AWS ECR. Building Java Microservices For Java microservices, a common pipeline starts with a Maven or Gradle build. For instance, a Build stage might run: Shell mvn -B -DskipTests clean package This compiles the code and packages it into a JAR without running tests. Immediately following is a Test stage, running unit tests, and archiving results. In Jenkins, one can even use the JUnit plugin to publish test reports. For example: Groovy stage('Test') { steps { sh 'mvn test' } post { always { junit 'target/surefire-reports/*.xml' } } } This ensures test failures are reported in Jenkins and can stop the pipeline if needed. Static analysis or security scans can be added as additional stages before packaging. In practice, pushing code triggers the pipeline: as one blog describes, When the user pushes code, it triggers [Jenkins]. The Jenkins pipeline builds the code using Maven, runs unit tests, and performs static code analysis. If the code passes, Jenkins builds a Docker image and pushes the image as the artifact. By automating these steps, developers get fast feedback on their changes without manual intervention. Containerizing and Deploying Java Services Microservices are often deployed in containers on AWS. The Jenkins pipeline can build and push Docker images automatically. For example, one might include in the Jenkinsfile: Groovy stage('Build & Tag Docker Image') { steps { sh 'docker build -t myrepo/myservice:latest .' } } stage('Push Docker Image') { steps { sh 'docker push myrepo/myservice:latest' } } Here, each push builds the image and tags it. These commands can use Jenkins credentials or tools like docker.withRegistry to authenticate. In fact, using Jenkins’s Amazon ECR plugin simplifies this for AWS, a pipeline example shows setting an environment { registry = "...amazonaws.com/myRepo"; registryCredential = "ecr-creds" }, then running docker.build() and docker.withRegistry(...) { dockerImage.push() }. Alternatively, one could invoke the AWS CLI, first authenticate (aws ecr get-login-password | docker login ...), then docker push. AWS documentation notes that You can push your container images to an Amazon ECR repository with the docker push command once authentication is done. The CI/CD pipeline can automate creating the ECR repo if needed, tagging the image with the account’s registry URI, and pushing it. A successful pipeline run will result in updated Docker images in ECR ready for deployment. After pushing images, a final Deploy/Deliver stage can use AWS APIs or tools to launch the containers. For example, Jenkins could use kubectl to update an EKS deployment or use AWS CodeDeploy/CodePipeline to roll out new versions. Even simply SSH’ing into an EC2 and running docker run can be automated in a Jenkins pipeline. The key is that committing code automatically packages and publishes the service so teams ship faster with confidence. Building and Deploying the Angular UI The frontend Angular app is typically a static site that runs in the browser. The Jenkins pipeline for Angular is similar but uses NodeJS/NPM. First, configure Jenkins with a NodeJS installation. A pipeline stage might then look like: Groovy stage('Build Angular') { steps { sh 'npm install' sh 'ng build --prod' } } This installs dependencies and runs ng build --prod, creating a production-ready bundle in the dist/ folder. If tests or linting are required, they can be added before the build step. Once built, the static files need to be hosted. A common approach on AWS is to use S3 and CloudFront. In Jenkins, a Deploy stage could use the AWS CLI to sync the dist/ contents to an S3 bucket. For example: Shell aws s3 sync dist/my-app/ s3://my-angular-bucket/ --acl public-read or as shown in a Jenkins pipeline example simply: Shell aws s3 cp ./dist/ --recursive s3://my-bucket/ --acl public-read This command copies the built site to S3, making it publicly accessible. Using CloudFront in front of the bucket delivers the files globally with caching, and Route 53 can point a custom domain to the distribution. In short, Jenkins fully automates the publish step, so every commit to the Angular repo triggers a build and S3 upload. By hosting the Angular app on S3 and CloudFront, the CI/CD pipeline keeps the frontend delivery serverless and scalable. The build scripts are as simple as it gets: just copy the dist folder to S3 on each update. This release-ready static deploy ensures the front end is updated in lockstep with backend services. End-to-End CI/CD on AWS In practice, one Jenkins pipeline can orchestrate both the Java and Angular builds. A multibranch pipeline could build the microservices repositories, push each to Docker/ECR, and also build and deploy the Angular UI repository in parallel. The general flow is: Commit and trigger: A Git push to any service or UI repository triggers Jenkins via webhook or polling.Build stages: Jenkins runs the defined stages. Java repos run Maven/CODE analysis and Docker build; Angular repo runs npm/ng build.Publish artifacts: Backend images are pushed to Amazon ECR (or Docker Hub). The Angular build is pushed to an S3 bucket.Deploy stages: Finally, Jenkins can use AWS CLI, CloudFormation, or deployment scripts to update running services. Even without containers, Jenkins could SSH and deploy JARs to EC2.Verification: Automated tests or smoke tests can run post-deploy to validate the release. Key DevOps practices here include pipeline-as-code, consistent tooling, and immutable artifacts. Because the pipeline is triggered on each change, feedback is immediate, broken builds or tests fail the job early, preventing flawed code from reaching production. At the same time, successful runs deliver a full release-ready bundle. As one summary points out, adopting CI/CD ensures faster, more reliable deployments by cutting manual steps. Summary Using Jenkins for CI/CD of Java microservices and an Angular UI greatly accelerates release cycles. Engineers define build and deploy steps in code, so any commit runs through the same automated process. Java services can be built with Maven, tested, and containerized images are pushed to AWS ECR and deployed on EC2/ECS/EKS. The Angular app is built with the Angular CLI and deployed as a static site to S3. Throughout this, Jenkins provides visibility and control stages for build, test, and deploy, showing real-time status, and any failure halts the pipeline. By integrating with AWS, the pipeline taps into scalable cloud resources. For example, AWS’s ECR supports secure Docker registry workflows, and S3/CloudFront provides effortless frontend hosting. With everything automated, teams achieve the goal of continuous integration and continuous delivery, making each release faster and more reliable. In short, a well-designed Jenkins CI/CD pipeline for Java microservices and Angular ensures that code changes flow swiftly from commit to production with minimal manual overhead

By Kavitha Thiyagarajan

How to Test a GET API Request Using REST-Assured Java

Testing GET requests is a fundamental part of API automation, ensuring that endpoints return the expected data and status codes. With REST Assured in Java, sending GET requests with query and path parameters, extracting data, verifying the status code, and validating the response body is quite simple. This tutorial walks through practical approaches to efficiently test GET APIs and build reliable automated checks, including: Basic GET Request (Simplest)Using Query ParametersUsing Map for Query ParamsUsing Path ParametersUsing Headers (Auth, Content-Type, etc.)Extracting ResponseUsing Validations with GETUsing Authentication (Basic Auth Example) In earlier tutorials, topics such as API automation for POST requests, response verification, data-driven testing, and more were covered. Application Under Test We will be using the following GET APIs from the RESTful e-commerce demo application to write the GET API requests test. GET /getAllOrders The GET /getAllOrders API returns the list of all the available orders in the system. The following is the response body of this API: Java { "message": "Orders fetched successfully!", "orders": [ { "user_id": "string", "product_id": "string", "product_name": "string", "product_amount": 0, "qty": 0, "tax_amt": 0, "total_amt": 0 } ] } GET /getOrder The GET /getOrder API returns the single order for the optional query param supplied for “order id,” “user id,” or “product id.” The following response is returned: Java { "message": "Order found!!", "orders": [ { "user_id": "string", "product_id": "string", "product_name": "string", "product_amount": 0, "qty": 0, "tax_amt": 0, "total_amt": 0 } ] } Sending a GET Request Using REST-Assured Java The following is the simplest code that could be written to test a GET /getAllOrders endpoint with REST-Assured Java: Java @Test public void testGetAllOrders () { given ().when () .get ("http://localhost:3004/getAllOrders") .then () .statusCode (200); } This test method demonstrates a basic GET request using REST Assured to verify an API endpoint. given() is the starting point where request specifications (like headers, params, and auth) can be defined. In this case, it’s empty since no additional setup is needed.when() specifies the action to be performed, here, sending the request.get("http://localhost:3004/getAllOrders") sends a GET request to the specified endpoint to retrieve all orders.then() is used to validate the response.statusCode(200) asserts that the API responds with HTTP status code 200 (OK), confirming a successful request. In simple terms, this test checks if the Get All Orders API is reachable and returns a successful response. Sending a GET Request With Query Parameters The GET request can be sent using query parameters, which play an important role in filtering, sorting, and customizing the data returned by an API. They allow clients to request only the specific information needed, making API interactions more efficient and flexible. Java @Test public void testGetOrderWithQueryParam () { given ().when () .log () .all () .queryParam ("id", 1) .get ("http://localhost:3004/getOrder") .then () .log () .all () .statusCode (200) .and () .body ("orders[0].id", equalTo (1)); } The testGetOrderWithQueryParam() test method sends a GET order request to the /getOrder API endpoint using a query parameter and validates the response. queryParam("id", 1) adds a query parameter to the request, making the final URL: http://localhost:3004/getOrder?id=1get("http://localhost:3004/getOrder") sends the GET request to fetch the order with id = 1.statusCode(200) verifies that the request was successful.and().body("orders[0].id", equalTo(1)) validates that the first item in the orders array has an id of 1. This confirms that the order requested via the query parameter is correctly fetched in the response. This test not only sends a GET request with a query parameter but also ensures that the correct data is returned in the response. Multiple Query Params The queryParams() method in REST Assured allows adding multiple parameters. For example, if we need to filter the records using order_id, user_id, and product_id, we can supply the query parameters as shown below: Java @Test public void testGetOrderWithMultipleQueryParam () { given ().when () .log () .all () .queryParams ("id", 1, "user_id", "1", "product_id", "1") .get ("http://localhost:3004/getOrder") .then () .log () .all () .statusCode (200) .and () .body ("orders[0].id", equalTo (1)); } Similarly, we can also add the different query parameters by calling the queryParam() method multiple times, as shown in the test below: Java @Test public void testGetOrderWithMultipleQueryParameters () { given ().when () .log () .all () .queryParam ("id", 1) .queryParam ("user_id", "1") .queryParam ("product_id", "1") .get ("http://localhost:3004/getOrder") .then () .log () .all () .statusCode (200) .and () .body ("orders[0].id", equalTo (1)); } Both approaches are correct; however, as a best practice, we can use Java Map to handle multiple query parameters. This approach is especially useful when dealing with dynamic or large sets of parameters, as all key pairs can be stored in a Map and passed in a single step using queryParams(map) as shown in the code below: Java @Test public void testGetOrderWithMultipleQueryParamWithMap () { Map<String, Object> queryParams = new HashMap<> (); queryParams.put ("id", 1); queryParams.put ("user_id", "1"); queryParams.put ("product_id", "1"); given ().when () .log () .all () .queryParams (queryParams) .get ("http://localhost:3004/getOrder") .then () .log () .all () .statusCode (200) .and () .body ("orders[0].id", equalTo (1)); } A Map<String, Object> queryParams is used to store multiple query parameters.queryParams(queryParams) automatically appends all key-value pairs to the URL.The final request URL would look like: http://localhost:3004/getOrder?id=1&user&id=1&product_id=1 Calling the log().all() method before the queryParams() method is super helpful in logging the request in the console, which helps in understanding how the query parameters are passed in the request. Sending a GET Request With Path Parameters The GET request can be sent using path parameters, which are essential for accessing specific resources directly within the API endpoint. They are typically used to uniquely identify a resource, such as an order ID or user ID, making the request more intuitive and RESTful. Let’s take an example of the GET — GetBooking API from the RESTful-Booker demo application. It fetches the booking details directly using the Path Param. The following curl can be used to import the GET /booking API in Postman: Plain Text curl -i https://restful-booker.herokuapp.com/booking/1 The following test script is used for fetching the booking record using Path Params: Java @Test public void testGetBookingWithPathParam () { given ().when () .log () .all () .pathParam ("id", 3) .get ("https://restful-booker.herokuapp.com/booking/{id}") .then () .log () .all () .statusCode (200); } The testGetBookingWithPathParam() test method demonstrates how to use a path parameter in a GET request with REST Assured. pathParam("id", 3) defines a path parameter named id with the value 3.get("https://restful-booker.herokuapp.com/booking/{id}") sends the GET request. Here, {id} is a placeholder in the URL, and REST Assured replaces it with the value 3, making the final request: https://restful-booker.herokuapp.com/booking/3Finally, an assertion is performed to verify that the GET request was successfully sent and that it returned a successful response with a 200 OK status. Using the Path param, specific resources can be dynamically accessed and validated by passing values directly within the endpoint URL. Using Headers in the GET Requests Headers in a GET request are used to pass additional information, such as authentication tokens, content type, and client details, to the server. They play an important role in securing APIs (e.g., Authorization headers) and ensuring the server understands how to process the request. Authorization Token in the Header Java @Test public void testAuthHeader () { given ().header ("Authorization", "Bearer my-token-123") .when () .get ("https://httpbin.org/bearer") .then () .log() .all() .statusCode (200); } The testAuthHeader() test method demonstrates how to send a GET request with an Authorization header using REST Assured. header("Authorization", "Bearer my-token-123") adds an Authorization header with a Bearer token, which is commonly used for securing APIs. This tells the server that the request is authenticated. The “Bearer my token-123” is a valid token for this request. If it is not supplied or an invalid token is supplied, the test will fail, throwing a 401 Unauthorized status.get("https://httpbin.org/bearer") sends the GET request to the endpoint that validates Bearer token authentication.statusCode(200) verifies that the request was successful, meaning the token was accepted. Similarly, a negative test can be written for the GEET request, supplying an invalid bearer token and verifying that a 401 status is returned in response. Adding Multiple Headers The “Content-Type” or “Accept” headers can also be supplied to specify the format of the request and response, such as JSON or XML, ensuring proper communication between the client and server. We can use a Java Map to add multiple headers and pass it to the test as shown in the test script below: Java @Test public void testGetAllOrdersWithHeaders () { Map<String, String> headers = new HashMap<> (); headers.put ("Content-Type", "application/json"); headers.put ("Accept", "application/json"); given ().headers (headers) .when () .get ("http://localhost:3004/getAllOrders") .then () .statusCode (200); } The testGetAllOrdersWithHeaders() test demonstrates how to send a GET request with multiple headers using a Java Map in REST Assured. The headers (Content-Type and Accept) are stored in a Map and passed using .headers(headers). The test then sends a request to fetch all orders and verifies that the API responds with a 200 OK status. Extracting Response Body and Values Extracting the response body from a GET request allows capturing and reusing API data for further validations or chaining requests. Using REST Assured, the values can be extracted using methods like .extract().response() or directly fetch specific fields using JSON path. This is especially useful for validating dynamic data and passing values between API calls in end-to-end test scenarios. Extracting the Response Body Java @Test public void testExtractResponseBody () { String responseBody = given ().when () .get ("http://localhost:3004/getAllOrders") .then () .statusCode (200) .extract () .response () .asString (); System.out.println (responseBody); } The testExtractResponseBody() method demonstrates how to extract the full response body from a GET request in REST Assured. extract().response().asString() extracts the complete response body, converts it into a string, and stores it in the responseBody variable for further use.System.out.println(responseBody); prints the response to the console. Extracting a Specific Field Value From the Response Body While working with test automation, there are scenarios where we need to extract a specific field value from the response for further use in the test. A classic example is end-to-end testing, where we need the order ID to update or delete an order. Java @Test public void testExtractFiedValueFromResponse () { int orderId = given ().when () .get ("http://localhost:3004/getAllOrders") .then () .statusCode (200) .extract () .response () .path ("orders[0].id"); System.out.println ("Order id is: " + orderId); } The testExtractFieldValueFromResponse() test method extracts the Order ID of the first order from the orders array in the response. extract().response().path("orders[0].id") extracts a specific value from the response body using a JSON path. In this case, the ID of the first order in the orders array.The extracted value is stored in the orderId variable for further use in the test.System.out.println(...) prints the extracted order ID to the console. Once the value is extracted into a variable, it can be further reused anywhere in the test. The variable can also be declared as a global variable to reuse the value in multiple tests within the same class. Using Validations With GET Requests Using validations on the responses from GET requests ensures the API returns the correct, expected data. Status codes, response headers, response time, and the content of the response body can be validated. These validations help confirm both the functional correctness and performance of the API. Validating the Response Headers Validating a response header ensures that the API returns the expected metadata, such as Content-Type, confirming the response format is correct. Java @Test public void testVerifyResponseHeader () { given ().when () .get ("http://localhost:3004/getAllOrders") .then () .headers ("Content-Type", "application/json; charset=utf-8") .statusCode (200); } The testVerifyResponseHeader() test validates both the response header and the status code to ensure the API behaves as expected. The .headers("Content-Type", "application/json; charset=utf-8") verifies that the response contains the expected Content-Type header value. Validating the Response Time The response time can also be validated using REST Assured to monitor and ensure optimal API performance. Java @Test public void testVerifyResponseTime () { given ().when () .get ("http://localhost:3004/getAllOrders") .then () .statusCode (200) .time (lessThan (500L), TimeUnit.MILLISECONDS); } The testVerifyResponseTime() test method ensures the API responds successfully and within an acceptable time (500 MilliSeconds). This test sends an API GET request to the /getAllOrders endpoint. This test performs two assertions: the first one verifies that the 200 OK status is returned, and the other verifies the response time. The statement.time(lessThan(500L), TimeUnit.MILLISECONDS)validates the response time and ensures the response is received in less than 500 milliseconds. The measured time includes: Request transmissionResponse receptionAssertion/validation overhead If we need to measure just the time for sending the request and receiving the response, the following test script can be used: Java public void testResponseTime () { Response response = given ().get ("http://localhost:3004/getAllOrders"); System.out.println (response.getTimeIn (TimeUnit.MILLISECONDS)); } The testResponseTime() method sends the GET /getAllOrders request and prints the raw response time. Validating the Response Size Validating the response size ensures the API returns the expected size of data and helps detect issues like incomplete or excessively large payloads that may impact performance. Java @Test public void testVerifyResponseSize () { given ().when () .get ("http://localhost:3004/getAllOrders") .then () .statusCode (200) .body ("orders.size()", greaterThan (0)); } The testVerifyResponseSize() method verifies both the success of the API call and that the correct size of data is returned in the response. given().when().get(...) sends a GET request to fetch all orders from the API..then().statusCode(200) verifies that the response is successful..body("orders.size()", greaterThan(0)) validates that the orders array in the response contains at least one item, ensuring the response is not empty. The greaterThan() is a static method used from the Hamcrest library. Similarly, response size verification can be performed by using equalTo(), lessThan(), and other such validations from the Hamcrest library, depending on the use case. Validating the Response Body Validating the response body ensures that the API returns the correct data and values as expected. It is a core part of functional testing that ensures the API behaves as expected and returns accurate and reliable data. Java @Test public void testResponseBody () { given ().when () .queryParam ("id", 1) .get ("http://localhost:3004/getOrder") .then () .statusCode (200) .and () .body ("message", equalTo ("Order found!!")) .body ("message", notNullValue ()) .body ("orders[0].id", equalTo (1)); } The testResponseBody() test method sends the GET /getAllOrders request by adding a query parameter to fetch a specific order, making the request URL: http://localhost:3004/getOrder?id=1. This method performs an assertion for the response body using the following three statements one by one: .body("message", equalTo("Order found!!")) validates that the response contains the expected message..body("orders[0].id", notNullValue()) ensures that the ID in the first-order object is not null..body("orders[0].id", equalTo(1)) checks that the returned order has the ID 1. In this test, verification for the response body is done step by step as per the code statements, so if the first verification fails, the remaining assertions will not be executed, causing the test to fail immediately. REST Assured also allows adding multiple verification statements in a single statement, as shown in the code below: Java public void testResponseBodyMultipleAssertions () { given ().when () .queryParam ("id", 1) .get ("http://localhost:3004/getOrder") .then () .statusCode (200) .and () .body ("message", equalTo ("Order found!!"), "orders[0].id", notNullValue (), "orders[0].id", equalTo (1)); } The testResponseBodyMultipleAssertions() method performs multiple validations using a single .body() statement. "message", equalTo("Order found!!") validates the response message."orders[0].id", notNullValue() ensures the order ID is present."orders[0].id", equalTo(1) verifies the correct order is returned. Using multiple assertions within a single .body() keeps related validations grouped, and reduces repetitive code. It also makes the test more concise while ensuring that multiple aspects of the response are validated in one place. Using Authentication With GET Requests Using authentication with GET requests ensures that only authorized users can access protected API resources. Common methods include Basic Auth, Bearer tokens, and API keys, which are typically passed through headers. Incorporating authentication in tests helps validate both security and access control mechanisms of the API. Java @Test public void testBasicAuthWithGetRequest () { given ().auth () .basic ("user", "passwd") .when () .get ("https://httpbin.org/basic-auth/user/passwd") .then () .statusCode (200); } The testBasicAuthWithGetRequest() method demonstrates how to send a GET request with Basic Authentication and verify successful access to a secured API endpoint. given().auth().basic("user", "passwd") sets up Basic Authentication by sending the username and password with the request. Here, the username is “user,” and the password is “passwd”. It sends the credentials in the request.when().get("https://httpbin.org/basic-auth/user/passwd") sends a GET request to an endpoint that requires Basic Auth credentials. The URL includes the username and password only because this specific API is designed to validate them from the path.then().statusCode(200) verifies that the request was successful, meaning the provided credentials were valid. In short, this test checks whether an API protected by Basic Authentication can be accessed using the correct username and password. Summary Testing GET API requests with REST Assured is an efficient way to validate API functionality. By covering scenarios such as query parameters, path parameters, headers, authentication, and validations, it ensures that the API returns accurate and expected responses. In my experience, while testing a GET API request, it is important to consider negative scenarios, such as validating different status codes when no record is available, using invalid query and path parameters, and adding appropriate assertions. Verifying the response body is a core part of functional testing, ensuring that the API returns accurate and expected data. Additionally, validations for response size, response time, and headers should be included to ensure thorough verification of GET requests. Happy testing!!

By Faisal Khatri

CORE

Pushdown-First Modernization: Engineering Execution-Plan Stability in SAP HANA Migrations

Most SAP HANA migration failures are not correctness failures. They are plan stability failures that surface only under concurrency. A query that executes in 900 milliseconds in isolation begins to oscillate between 800 milliseconds and 14 seconds under load, with no code change and no data skew obvious enough to blame. The root cause is rarely hardware or memory configuration. In most cases, PlanViz shows large intermediate row counts forming before reduction, with estimated cardinality significantly below actual. The instability originates from translating legacy EDW logic into SAP HANA artifacts without redesigning execution boundaries for a columnar, operator-driven engine. Pushdown-first modernization is often interpreted as "move everything into SQL." That interpretation is incomplete. The actual problem is not about moving logic downward; it is about controlling how the calculation engine constructs and reuses execution graphs under varying runtime conditions. When SQLScript procedures and calculation views are designed without regard to grain stabilization, operator ordering, and cardinality propagation, the resulting plans remain syntactically valid but produce workload-sensitive operator graphs whose memory footprint shifts with parameter selectivity. This article dissects the mechanics behind execution-plan stability in SAP HANA migrations, focusing on SQLScript procedures and Calculation Views as first-class architectural units. The Architectural Shift: From Staged ETL to Operator Graph Execution Traditional EDW pipelines relied on staged transformations. Each step materialized an intermediate state, often writing into persistent tables between transformations. That staging introduced natural grain boundaries. Joins were resolved, aggregations were completed, and the next transformation consumed stable, reduced datasets. In SAP HANA, Calculation Views and SQLScript table functions remove those materialization barriers. Logical transformations are fused into a single operator graph. PlanViz reveals this as a directed acyclic graph of projection, join, aggregation, and calculation nodes. The optimizer is free to reorder joins, push predicates downward, and defer aggregations. That freedom improves latency in well-designed models. It amplifies instability in poorly designed ones. Consider a common migration pattern: SQL SELECT h.MATERIAL_ID, SUM(l.QUANTITY) AS TOTAL_QTY FROM HEADER h JOIN LINE_ITEM l ON h.DOC_ID = l.DOC_ID WHERE h.POSTING_DATE BETWEEN :p_from AND :p_to GROUP BY h.MATERIAL_ID; Translated directly into a Calculation View, the join and aggregation nodes are placed without enforcing a grain reduction before high-cardinality joins. Under small parameter windows, the plan performs adequately. Under wide date ranges, the join produces a large intermediate result before aggregation collapses it. Memory amplification becomes workload-dependent. In PlanViz, the join node frequently shows actual row counts an order of magnitude higher than estimated. For example, a date window spanning a quarter can produce 38 million intermediate rows before aggregation collapses the result to fewer than 300000 grouped records. The aggregation node is inexpensive. The join node is not. Memory allocation occurs before reduction. The legacy system relied on pre-aggregated staging tables to constrain that explosion. The HANA translation removed the staging but did not redesign the grain boundary. Why Preserving Batch Semantics Breaks Under Concurrency In staged ETL systems, concurrency was limited. Batch windows were serialized. Execution plans operated with predictable resource envelopes. HANA environments operate with interactive workloads, overlapping parameter combinations, and mixed analytic demands. An SQLScript procedure frequently encapsulates logic like this: SQL lt_filtered = SELECT * FROM SALES WHERE REGION = :p_region; lt_enriched = SELECT f.*, d.CATEGORY FROM :lt_filtered AS f JOIN DIM_PRODUCT d ON f.PRODUCT_ID = d.PRODUCT_ID; lt_aggregated = SELECT CATEGORY, SUM(AMOUNT) AS TOTAL FROM :lt_enriched GROUP BY CATEGORY; SELECT * FROM :lt_aggregated; Syntactically, the intermediate variables imply sequencing. In practice, the optimizer inlines these operations. If REGION is not highly selective, the join with DIM_PRODUCT expands cardinality before aggregation. Under multiple concurrent sessions with varying region selectivity, the same operator graph is reused while actual cardinality diverges across sessions. One session may process 2 million rows, another 40 million. Each constructs its own hash structures while the plan shape remains identical. Plan instability emerges from estimation drift, not code defects. Batch semantics assumed a stable data distribution. Interactive concurrency invalidates that assumption. Grain Stabilization as a First-Class Design Constraint Execution-plan stability in HANA depends on reducing cardinality before high-cost joins. That principle is mechanical, not stylistic. Instead of joining at the transaction grain and aggregating afterward, redesign the model to collapse the grain first: SQL lt_reduced = SELECT PRODUCT_ID, SUM(AMOUNT) AS TOTAL_AMOUNT FROM SALES WHERE REGION = :p_region GROUP BY PRODUCT_ID; SELECT r.PRODUCT_ID, d.CATEGORY, r.TOTAL_AMOUNT FROM :lt_reduced AS r JOIN DIM_PRODUCT d ON r.PRODUCT_ID = d.PRODUCT_ID; This change enforces aggregation before dimensional enrichment. The intermediate dataset shrinks before the join. In PlanViz, the aggregation node now executes before dimensional enrichment, reducing the intermediate row count from tens of millions to low single-digit millions before the join. Hash table size contracts accordingly, and runtime variance narrows under concurrency. Within calculation views, this requires explicit modeling: Aggregation nodes placed before join nodesJoin cardinality correctly annotatedStar-join semantics avoided for high-variance fact tables Without explicit grain control, the optimizer may defer aggregation for cost-based reasons that are correct for one parameter distribution and catastrophic for another. Pushdown-first modernization must include grain-first redesign. Calculation Views: Join Cardinality and Engine Transitions Graphical Calculation Views introduce another source of instability: cardinality metadata and engine transitions. When join cardinality is left as "n..m," the optimizer assumes worst-case explosion. When incorrectly set as "1..1," it may reorder joins aggressively and defer filtering. Both mistakes alter the plan's shape. A frequent migration pattern is to replicate legacy multi-join views into a single Calculation View with multiple projection nodes feeding a central join node. Under load, the join engine allocates hash tables proportional to pre-aggregation cardinality. If aggregation nodes sit above that join, each concurrent session constructs its own large intermediate state before reduction, multiplying memory pressure across sessions. Execution-plan stability requires: Accurate cardinality annotationProjection pruning enabledCalculated columns minimized before aggregationTable functions are used sparingly and only when logic cannot be expressed declaratively Table functions introduce optimization boundaries. When overused, they prevent join reordering and predicate pushdown across function boundaries, fragmenting the operator graph. SQLScript Procedures and Optimization Boundaries SQLScript introduces imperative constructs that can fragment optimization. For example: SQL IF :p_flag = 'Y' THEN SELECT ... ELSE SELECT ... END IF; Branching logic produces separate subplans. Under concurrency, plan cache fragmentation increases. Each branch may generate a distinct plan variant, multiplying the memory footprint. Similarly, cursor-based loops imported from legacy logic disable set-based optimization. Even when pushdown is nominally achieved, the presence of row-by-row constructs forces materialization. Execution stability improves when: Set-based transformations replace procedural loopsConditional logic is expressed via predicates rather than branchesIntermediate variables are minimized to avoid implicit materialization The goal is a single coherent operator graph with predictable cardinality flow. Observability: PlanViz as a Stability Instrument PlanViz is not a tuning tool alone. It is a stability diagnostic instrument. Stable models show: Early aggregation nodesReduced intermediate row counts after each operatorLimited engine transitions between OLAP and Join enginesConsistent estimated vs actual row counts Unstable models show: Large intermediate nodes before aggregationHigh variance between estimated and actual cardinalitiesMultiple hash join operators with spill riskRepeated plan variants under similar parameter shapes Stability is observed by running parameter sweeps under controlled concurrency and comparing plan shapes, not just runtimes. State Amplification Under Concurrent Workloads When intermediate result sets scale with the input window size, concurrent sessions amplify state multiplicatively. If one session produces 200 million intermediate rows before aggregation and five sessions overlap, each constructs its own intermediate state, causing cumulative memory allocation that triggers throttling or spill behavior despite acceptable single-session performance. Stabilized models collapse grain early, producing intermediate datasets proportional to grouped dimensions rather than raw transaction volume. Concurrency then scales linearly instead of exponentially. This distinction is architectural. It cannot be solved with indexes, hints, or hardware. Engineering Stability Instead of Translating Logic Most unstable migrations are not slow because SAP HANA is inefficient. They are unstable because the reduction was deferred. When aggregation happens after cardinality amplification, the intermediate state scales with raw transaction volume. Under concurrency, that decision multiplies memory pressure across sessions. The system behaves exactly as modeled. Pushdown-first modernization succeeds when reduction precedes enrichment and when the operator graph is engineered for concurrency, not just correctness.

By Rajaganapathi Rangdale Srinivasa Rao

Coding

Functions of Coding

Frameworks

Java

JavaScript

Languages

Tools

DZone's Featured Coding Resources

The Latest Coding Topics