Databases Resources

DZone's Featured Databases Resources

Code Security Remediation: What 50,000 Repositories Reveal About PR Scanning

By Braden Riggs

Security teams have gotten good at finding vulnerabilities. Fixing them has always been the hard part. An analysis of remediation patterns across 50,000+ actively developed repositories and 400+ organizations during 2025 reveals a pattern: where a vulnerability is detected has more impact on whether it gets fixed than what the vulnerability is. PR-Detected Findings Get Fixed 9x Faster Static Application Security Testing (SAST) tools scan your source code for security flaws like SQL injection, hardcoded secrets, or missing auth checks. When a scan flags one of these issues (a "finding"), how quickly it gets fixed depends almost entirely on when it was detected. Findings caught during a pull request (PR) are resolved in 4.8 days on average. The same class of finding detected via a full repository scan takes 43 days. That is a 9x difference, and the reason is context. Consider what the PR workflow looks like in practice. A developer opens a PR. A scan runs automatically in CI, and a finding appears as an inline comment: "SQL injection via string concatenation on line 47." The developer is already in that file. The context is fresh. The fix is two lines: SQL # Before — vulnerable query = "SELECT * FROM users WHERE name = '" + username + "'" # After — parameterized query = "SELECT * FROM users WHERE name = ?" db.execute(query, [username]) 63% of PR-detected SAST fixes happen the same day. Now consider the full-scan path. Three months later, a different developer is assigned a Jira ticket for the same class of vulnerability in code written by someone else two years ago. They have to locate the file, rebuild context around unfamiliar code, figure out the right fix, and push it through review. That ticket competes with feature work, and it often sits in the backlog for weeks. The pattern holds for dependency vulnerabilities too, though the gap is smaller. Software Composition Analysis (SCA) findings caught in a PR are resolved in 12.1 days versus 36.4 days for full-scan findings, a 3x improvement. The smaller gap reflects reality: SCA remediation depends on whether a patch exists upstream, which is outside the developer's control. This is not an argument against full scans. PR scanning depends on full scans to establish the baseline that makes diff analysis possible. And some vulnerability classes, particularly cross-file issues where untrusted input enters one file and reaches a dangerous sink in another, require the full codebase context that only a full scan provides. You need both. But the data makes clear that when a finding can be caught at PR time, it is far more likely to get fixed. The 90-Day Cliff Security findings that sit unfixed for 90 days do not get fixed. Teams tell themselves the backlog is temporary, that they will get to those findings next sprint. They rarely do. After 90 days, the original developer may have moved on. The code may have been refactored around the vulnerability. The organizational memory of why this finding matters has faded. What was once a 20-minute fix is now a research project, and research projects lose to feature work every time. Among top-performing organizations (the top 15% by fix rate), only 9.4% of SAST remediations come from findings open longer than 90 days. For the remaining 85%, it is 16%. Leaders are not just fixing more; they are fixing earlier. And the most counterintuitive part: these groups use the same scanning tools. One organization fixes 63% of its critical findings. Another fixes 13%. Same scanner. Same severity filters. Same findings surfaced. The difference is what happens after the scan. In my experience talking to security teams, the gap comes down to three things. Findings sit in a security dashboard that developers never check. Findings reach developers, but without enough context to understand the fix. Or there is no clear owner, so the finding is effectively unassigned. Treat 90 days as an escalation point, not a deadline. At that threshold, every open finding should go through one of three paths: remediate it with dedicated time, formally accept the risk with documented justification, or suppress it as a confirmed false positive. Letting findings sit in the backlog indefinitely without a decision is not risk management. Two Diagnostics to Prioritize Measure your same-day PR fix rate. What percentage of findings detected in PRs get resolved the same day? If it is below 50%, developers are seeing findings but not acting on them. That points to a context problem: the finding does not include enough information to act on, or the developer does not feel ownership over security findings in their code. Leaders hit 63%. Check your 90-day backlog share. What percentage of your total open findings have been sitting for more than 90 days? If a significant portion of your remediations comes from that bucket, your team is spending effort on findings that have already crossed the threshold where fixes are unlikely. PR scanning, CI policies that block merges on high-confidence findings, and faster triage loops all move fixes into the first 30 days, where they are most likely to happen. The dataset behind these benchmarks, including fix rate analysis by OWASP category, specific CWEs, package ecosystem breakdowns, and time-to-fix distributions, comes from the Remediation at Scale report. Dig into the full dataset to see how your numbers compare. Check out the full Semgrep article collection here. More

When Kubernetes Breaks Session Consistency: Using Cosmos DB and Redis Together

By Vikas Mittal

Distributed systems rarely struggle because of storage engines. They struggle because of coordination. We were operating a high-throughput microservice on Kubernetes backed by Azure Cosmos DB. The service required durability, global availability, and predictable read behavior under horizontal scaling. Cosmos DB was configured with SESSION consistency because it offers a practical balance between correctness and performance. It guarantees read-your-own-writes without incurring the latency and throughput penalties associated with strong consistency. Architecturally, everything appeared sound. Yet under real production traffic, an intermittent pattern began emerging. Occasionally, a read request issued immediately after a write would return slightly stale data. There was no corruption and no failure — just subtle inconsistencies that were difficult to reproduce but impossible to ignore. The issue was not rooted in Cosmos DB alone, nor in Kubernetes alone. It lived in the interaction between the two. The Assumption Behind Session Consistency Cosmos DB’s session consistency model relies on a session token. Every write operation returns a token representing the latest version of the document within that session. If that token is passed back during a subsequent read, Cosmos guarantees that the client will see its own write. In a single-instance application, this is straightforward. The same process that performs the write retains the session token in memory and uses it for subsequent reads. Kubernetes changes that assumption entirely. In a horizontally scaled deployment, a write request may land on Pod A. Cosmos returns a session token to Pod A. The next read request for the same document may land on Pod B. Pod B has no awareness of Pod A’s session token. Without that token, Cosmos may return a slightly older replica version consistent with session guarantees — but not necessarily reflecting the most recent write handled by another pod. The database is honoring its consistency contract. The application simply is not sharing the required metadata across instances. This is a classic distributed systems nuance: guarantees often depend on contextual state that stateless infrastructure does not preserve. Why Strong Consistency Was Not the Right Fix Switching Cosmos DB to strong consistency would have eliminated the problem entirely. However, that solution carried significant tradeoffs. Strong consistency increases latency because replicas must coordinate synchronously. It reduces overall throughput and increases RU consumption. It also introduces constraints in multi-region deployments where low-latency global reads are required. The problem was not that the database guarantees were insufficient. The problem was that session context was not shared across pods. Rather than strengthening storage semantics, we focused on improving coordination. Introducing Redis as a Coordination Layer The solution was conceptually simple. After every write to Cosmos DB, we extracted the returned session token and stored it in Redis, keyed by the document ID. Before every read from Cosmos, we retrieved the session token from Redis and supplied it with the read request. Redis became a lightweight session token broker between Kubernetes pods. It is important to emphasize what Redis was not used for. It did not store business data. It did not act as a second database. It did not cache full documents. It stored only the small piece of metadata required to preserve cross-pod session guarantees. Cosmos remained the durable system of record. Redis handled coordination. By limiting Redis to this narrow responsibility, the architecture avoided unnecessary complexity and eliminated the risk of data divergence between systems. Designing for Failure, Not Just Success Adding Redis introduced a new dependency, which required careful design consideration. We made a deliberate decision that Redis would never become mandatory for availability. In the read path, the service first attempts to retrieve the session token from Redis. If a token exists, it is passed to Cosmos, ensuring read-your-own-writes. If Redis is unavailable or the token is missing, the system proceeds with a standard Cosmos read without the token. The result is a graceful degradation model. Redis enhances consistency but does not control system availability. If Redis fails completely, the application continues operating with normal session semantics, potentially returning slightly stale reads but never failing outright. On the write path, the order of operations is equally important. The document is first persisted to Cosmos. Only after a successful write is the session token stored in Redis. This ensures that durability is never dependent on coordination infrastructure. To further strengthen resilience, Redis was deployed in a dual configuration consisting of a primary and fallback instance. Writes are performed against both, with the fallback update executed asynchronously to avoid increasing request latency. If Redis writes fail, the errors are logged, but the core transaction succeeds. This ordering ensures the system bends under failure rather than breaks. Cost and Throughput Optimization While addressing consistency, we also examined write efficiency. In high-throughput systems, replacing entire documents for minor state changes can significantly increase RU consumption. Instead of issuing full document replacements, we adopted Cosmos PATCH operations for partial updates. Only modified attributes were updated, reducing request charge and improving overall efficiency. This adjustment produced measurable cost savings and reinforced a broader lesson: architectural improvements often reveal opportunities for operational optimization. Evaluating Alternative Approaches Before settling on Redis-backed session sharing, several alternatives were considered. Sticky sessions at the load balancer layer could have preserved session affinity, ensuring that reads followed writes to the same pod. However, this approach reduces horizontal scaling flexibility and can create uneven traffic distribution. In-memory distributed caching strategies were also evaluated but introduce replication complexity and failure coordination challenges. Enabling strong consistency at the database layer, while technically simpler, imposed unacceptable performance and cost penalties. Redis provided the right balance. It is fast, operationally mature, and purpose-built for ephemeral coordination data. Most importantly, it allowed us to solve a coordination problem without modifying database guarantees. Extending Redis Carefully Once Redis became part of the architecture, it was tempting to broaden its use. Discipline was critical. Redis was later used to cache selected reference metadata retrieved from downstream services. Instead of invoking dependent systems on every request, a scheduled refresher populated Redis entries with defined TTLs. This reduced latency and protected downstream systems during peak load. Redis was also used to maintain shared operational counters across pods. In a horizontally scaled environment, in-memory metrics fragment across instances. Storing certain counters in Redis provided consistent observability across all running pods. In both cases, Redis remained coordination infrastructure rather than primary storage. The Architectural Pattern Cosmos DB and Redis are often described simply as database and cache. In this design, Redis is not a cache of business objects. It is a coordination layer that enables predictable behavior in a stateless, horizontally scaled environment. By separating durable state from coordination state, the system maintains scalability, controls cost, and preserves session guarantees without relying on strong consistency or sticky sessions. Kubernetes encourages statelessness. Databases provide consistency guarantees within defined boundaries. Bridging the two requires explicit coordination. Distributed systems are rarely about choosing the strongest guarantee available. They are about understanding the guarantees you already have and ensuring they are applied correctly across infrastructure boundaries. Sometimes the most effective solution is not increasing consistency but ensuring that the consistency you already depend on is shared intelligently. Architecture Diagram More

Architecting the Future of Research: A Technical Deep-Dive into NotebookLM and Gemini Integration

By Jubin Abhishek Soni

CORE

Runtime FinOps: Making Cloud Cost Observable

By David Iyanu Jonathan

The ID That Costs Millions: Why API Authorization Failures Keep Winning

By Igboanugo David Ugochukwu

CORE

How Online Databases Replicate Public Records: A Look at Data Aggregation

A large portion of the information we find online does not originate from the websites where we see it. Many platforms function primarily as aggregators: they collect data from multiple public sources, reorganize it, and make it searchable in one place. This model has become extremely common across different industries. Job boards collect listings from employers, travel sites aggregate airline and hotel data, and property platforms consolidate listings from multiple agencies. The same approach appears in many other types of public data as well. Once a piece of information becomes publicly accessible, aggregation systems can capture it and redistribute it across numerous databases. From an engineering perspective, this process is driven by structured data pipelines designed to collect, normalize, and distribute records at scale. A Typical Data Aggregation Pipeline Although implementations vary, most aggregation platforms follow a similar architecture. Data flows through several layers before it becomes searchable on a public website. A simplified pipeline often looks like this: Plain Text Primary Data Sources (auctions, marketplaces, public feeds) ↓ Collection Layer (APIs, scraping, scheduled crawlers) ↓ Normalization Layer (data cleaning, schema mapping) ↓ Central Aggregation Database ↓ Replication Layer (search indexes, cache, CDN nodes) ↓ Public Web Pages (search results and listings) Each stage introduces new copies of the same underlying record. By the time a user encounters the information on a website, it may already have passed through several systems. This architecture is highly effective for building large searchable datasets. At the same time, it naturally leads to duplication and redistribution of the same information across multiple platforms. Why Aggregated Records Spread Across the Web One interesting property of aggregated data is that it rarely stays within a single ecosystem. When a platform publishes structured pages based on its database, those pages become visible to search engines and other data collectors. In many cases, additional aggregation services later capture the same information again. Over time, this creates chains of redistribution. A record that originally appeared on one site may eventually be visible across dozens of unrelated platforms. From a technical standpoint, this is not necessarily intentional replication. It is simply the result of independent systems collecting publicly available data and organizing it in their own databases. The Role of Replication and Caching Large aggregation platforms usually rely on distributed infrastructure. High-traffic services often separate storage, indexing, and delivery layers. To ensure fast response times, records may be replicated into: Search indexesCaching systemsContent delivery networksAnalytics databases Each layer improves performance, but it also introduces additional persistence. Even when the original source changes, cached or replicated versions of the data may continue to exist for some time. In distributed systems, synchronization is rarely instantaneous. Update cycles vary across services, which means that different platforms may show different versions of the same record. Vehicle Data as a Case Study Automotive information is a useful example of how aggregation ecosystems develop. Vehicle records can originate from a wide range of places: auction platforms, dealer inventories, insurance reports, and other public datasets. Once these records appear online, aggregation platforms often collect them and build searchable databases around them. Because several services may ingest similar datasets, the same record can eventually appear on multiple websites that have no direct connection to one another. The Lifecycle of Aggregated Records Looking at the system from a data-engineering perspective, aggregated records tend to follow a predictable lifecycle. A record appears in a primary source.Aggregation systems collect it.The data is normalized and stored.Replicated copies are distributed across infrastructure layers.Search engines and additional aggregators discover the pages. At that point, the information has effectively become part of a broader network of datasets. In practice, this means that records may remain visible online long after their original context has changed. For example, people sometimes look for ways to remove VIN history references or remove vehicle records that continue circulating across various platforms. From a systems perspective, however, those records may already exist in several independent databases. Engineering Challenges in Aggregation Systems Aggregation platforms provide clear benefits: they help organize fragmented information and make it easier to search and analyze. However, they also introduce several technical challenges: Maintaining data freshnessManaging update propagationPreventing uncontrolled duplicationDefining lifecycle policies for public records These challenges become more visible as aggregation networks grow and interact with one another. Designing systems that efficiently distribute information is a well-understood problem. Designing systems that gracefully update or retire information across multiple independent platforms is often much harder. Conclusion Data aggregation has become a foundational pattern for building large online databases. By collecting information from many sources and organizing it into searchable formats, aggregation systems dramatically improve access to public data. Yet this same architecture also explains why information tends to spread across the web once it becomes public. Replication layers, caching systems, search indexing, and independent aggregation pipelines all contribute to the persistence of records. For engineers building data-driven platforms, understanding how information propagates through these systems is increasingly important. The lifecycle of aggregated data does not end when a record is first published — in many cases, that is only the beginning of its journey through the web.

By TIANA LO

Building an AI-Powered SRE Incident Response Workflow With AWS Strands Agents

The SRE Incident Response Agent is a multi-agent sample that ships with the AWS Strands Agents SDK. It automatically discovers active CloudWatch alarms, performs AI-powered root cause analysis using Claude Sonnet 4 on Amazon Bedrock, proposes Kubernetes or Helm remediations, and posts a structured incident report to Slack. This guide covers everything you need to clone the repo and run it yourself. Prerequisites Before you begin, make sure the following are in place: Python 3.11+ installed on your machineAWS credentials configured (aws configure or an active IAM role)Amazon Bedrock access enabled for Claude Sonnet 4 in your target regionkubectl and helm v3 installed — only required if you plan to run live remediations. Dry-run mode works without them. Step 1: Clone the Repository The sample lives inside the strands-agents/samples open source repository. Clone it and navigate to the SRE agent directory: Shell git clone https://github.com/strands-agents/samples.git cd samples/02-samples/sre-incident-response-agent The directory contains the following files: Plain Text sre-incident-response-agent/ ├── sre_agent.py # Main agent: 4 agents + 8 tools ├── test_sre_agent.py # Pytest unit tests (12 tests, mocked AWS) ├── requirements.txt ├── .env.example └── README.md Step 2: Create a Virtual Environment and Install Dependencies Shell python -m venv .venv source .venv/activate # Windows: .venv\Scripts\activate pip install -r requirements.txt The requirements.txt pins the core dependencies: Shell strands-agents>=0.1.0 strands-agents-tools>=0.1.0 boto3>=1.38.0 botocore>=1.38.0 Step 3: Configure Environment Variables Copy .env.example to .env and fill in your values: Shell cp .env.example .env Open .env and set the following: Shell # AWS region where your CloudWatch alarms live AWS_REGION=us-east-1 # Amazon Bedrock model ID (Claude Sonnet 4 is the default) BEDROCK_MODEL_ID=us.anthropic.claude-sonnet-4-20250514-v1:0 # DRY_RUN=true means kubectl/helm commands are printed, not executed. # Set to false only when you are ready for live remediations. DRY_RUN=true # Optional: post the incident report to Slack. # Leave blank to print to stdout instead. SLACK_WEBHOOK_URL= Step 4: Grant IAM Permissions The agent needs read-only access to CloudWatch alarms, metric statistics, and log events. No write permissions to CloudWatch are required. Attach the following policy to the IAM role or user running the agent: Shell { "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": [ "cloudwatch:DescribeAlarms", "cloudwatch:GetMetricStatistics", "logs:FilterLogEvents", "logs:DescribeLogGroups" ], "Resource": "*" }] } Step 5: Run the Agent There are two ways to trigger the agent. Option A: Automatic Alarm Discovery Let the agent discover all active CloudWatch alarms on its own. This is the recommended mode for a real on-call scenario: Shell python sre_agent.py Option B: Targeted Investigation Pass a natural-language description of the triggering event. The agent will focus its investigation on the service and symptom you describe: Shell python sre_agent.py "High CPU alarm fired on ECS service my-api in prod namespace" Example Output Running the targeted trigger above produces output similar to the following: Shell Starting SRE Incident Response Trigger: High CPU alarm fired on ECS service my-api in prod namespace [cloudwatch_agent] Fetching active alarms... Found alarm: my-api-HighCPU (CPUUtilization > 85% for 5m) Metric stats: avg 91.3%, max 97.8% over last 30 min Log events: 14 OOMKilled events in /ecs/my-api [rca_agent] Performing root cause analysis... Root cause: Memory leak causing CPU spike as GC thrashes Severity: P2 - single service, <5% of users affected Recommended fix: Rolling restart to clear heap; monitor for recurrence [remediation_agent] Applying remediation... [DRY-RUN] kubectl rollout restart deployment/my-api -n prod ================================================================ *[P2] SRE Incident Report - 2025-10-14 09:31 UTC* What happened: CloudWatch alarm my-api-HighCPU fired at 09:18 UTC. CPU reached 97.8% (threshold 85%). 14 OOMKilled events in 15 min. Root cause: Memory leak in application heap leading to aggressive GC, causing CPU saturation. Likely introduced in the last deployment. Remediation: Rolling restart of deployment/my-api in namespace prod initiated (dry-run). All pods will be replaced with fresh instances. Follow-up: - Monitor CPUUtilization for next 30 min - Review recent commits for memory allocation changes - Consider setting memory limits in the Helm chart ================================================================ Running the Tests (No AWS Credentials Required) The sample ships with 12 pytest unit tests that mock boto3 entirely. You can run the full test suite in any environment, including CI, without any AWS credentials: Shell pip install pytest pytest-mock pytest test_sre_agent.py -v # Expected: 12 passed Enabling Live Remediation Once you have validated the agent’s behaviour in dry-run mode and are satisfied with the decisions it makes, you can enable live kubectl and helm execution by setting DRY_RUN=false in your .env file: Shell DRY_RUN=false Conclusion In under five minutes of setup, the AWS Strands Agents SDK gives you a working multi-agent incident response loop: alarm discovery, AI-powered root cause analysis, Kubernetes remediation, and a structured incident report, all driven by a single python sre_agent.py command. The dry-run default means there is no risk in running it against a real environment while you evaluate its reasoning. From here, the natural next steps are connecting a Slack webhook for team notifications, adding a PagerDuty tool for incident tracking, or extending the RCA agent with a vector store of past postmortems. All of that is a tool definition away. I hope you found this article helpful and that it will inspire you to explore AWS Strands Agents SDK and AI agents more deeply.

By Ayush Raj Jha

How to Test a GET API Request Using REST-Assured Java

Testing GET requests is a fundamental part of API automation, ensuring that endpoints return the expected data and status codes. With REST Assured in Java, sending GET requests with query and path parameters, extracting data, verifying the status code, and validating the response body is quite simple. This tutorial walks through practical approaches to efficiently test GET APIs and build reliable automated checks, including: Basic GET Request (Simplest)Using Query ParametersUsing Map for Query ParamsUsing Path ParametersUsing Headers (Auth, Content-Type, etc.)Extracting ResponseUsing Validations with GETUsing Authentication (Basic Auth Example) In earlier tutorials, topics such as API automation for POST requests, response verification, data-driven testing, and more were covered. Application Under Test We will be using the following GET APIs from the RESTful e-commerce demo application to write the GET API requests test. GET /getAllOrders The GET /getAllOrders API returns the list of all the available orders in the system. The following is the response body of this API: Java { "message": "Orders fetched successfully!", "orders": [ { "user_id": "string", "product_id": "string", "product_name": "string", "product_amount": 0, "qty": 0, "tax_amt": 0, "total_amt": 0 } ] } GET /getOrder The GET /getOrder API returns the single order for the optional query param supplied for “order id,” “user id,” or “product id.” The following response is returned: Java { "message": "Order found!!", "orders": [ { "user_id": "string", "product_id": "string", "product_name": "string", "product_amount": 0, "qty": 0, "tax_amt": 0, "total_amt": 0 } ] } Sending a GET Request Using REST-Assured Java The following is the simplest code that could be written to test a GET /getAllOrders endpoint with REST-Assured Java: Java @Test public void testGetAllOrders () { given ().when () .get ("http://localhost:3004/getAllOrders") .then () .statusCode (200); } This test method demonstrates a basic GET request using REST Assured to verify an API endpoint. given() is the starting point where request specifications (like headers, params, and auth) can be defined. In this case, it’s empty since no additional setup is needed.when() specifies the action to be performed, here, sending the request.get("http://localhost:3004/getAllOrders") sends a GET request to the specified endpoint to retrieve all orders.then() is used to validate the response.statusCode(200) asserts that the API responds with HTTP status code 200 (OK), confirming a successful request. In simple terms, this test checks if the Get All Orders API is reachable and returns a successful response. Sending a GET Request With Query Parameters The GET request can be sent using query parameters, which play an important role in filtering, sorting, and customizing the data returned by an API. They allow clients to request only the specific information needed, making API interactions more efficient and flexible. Java @Test public void testGetOrderWithQueryParam () { given ().when () .log () .all () .queryParam ("id", 1) .get ("http://localhost:3004/getOrder") .then () .log () .all () .statusCode (200) .and () .body ("orders[0].id", equalTo (1)); } The testGetOrderWithQueryParam() test method sends a GET order request to the /getOrder API endpoint using a query parameter and validates the response. queryParam("id", 1) adds a query parameter to the request, making the final URL: http://localhost:3004/getOrder?id=1get("http://localhost:3004/getOrder") sends the GET request to fetch the order with id = 1.statusCode(200) verifies that the request was successful.and().body("orders[0].id", equalTo(1)) validates that the first item in the orders array has an id of 1. This confirms that the order requested via the query parameter is correctly fetched in the response. This test not only sends a GET request with a query parameter but also ensures that the correct data is returned in the response. Multiple Query Params The queryParams() method in REST Assured allows adding multiple parameters. For example, if we need to filter the records using order_id, user_id, and product_id, we can supply the query parameters as shown below: Java @Test public void testGetOrderWithMultipleQueryParam () { given ().when () .log () .all () .queryParams ("id", 1, "user_id", "1", "product_id", "1") .get ("http://localhost:3004/getOrder") .then () .log () .all () .statusCode (200) .and () .body ("orders[0].id", equalTo (1)); } Similarly, we can also add the different query parameters by calling the queryParam() method multiple times, as shown in the test below: Java @Test public void testGetOrderWithMultipleQueryParameters () { given ().when () .log () .all () .queryParam ("id", 1) .queryParam ("user_id", "1") .queryParam ("product_id", "1") .get ("http://localhost:3004/getOrder") .then () .log () .all () .statusCode (200) .and () .body ("orders[0].id", equalTo (1)); } Both approaches are correct; however, as a best practice, we can use Java Map to handle multiple query parameters. This approach is especially useful when dealing with dynamic or large sets of parameters, as all key pairs can be stored in a Map and passed in a single step using queryParams(map) as shown in the code below: Java @Test public void testGetOrderWithMultipleQueryParamWithMap () { Map<String, Object> queryParams = new HashMap<> (); queryParams.put ("id", 1); queryParams.put ("user_id", "1"); queryParams.put ("product_id", "1"); given ().when () .log () .all () .queryParams (queryParams) .get ("http://localhost:3004/getOrder") .then () .log () .all () .statusCode (200) .and () .body ("orders[0].id", equalTo (1)); } A Map<String, Object> queryParams is used to store multiple query parameters.queryParams(queryParams) automatically appends all key-value pairs to the URL.The final request URL would look like: http://localhost:3004/getOrder?id=1&user&id=1&product_id=1 Calling the log().all() method before the queryParams() method is super helpful in logging the request in the console, which helps in understanding how the query parameters are passed in the request. Sending a GET Request With Path Parameters The GET request can be sent using path parameters, which are essential for accessing specific resources directly within the API endpoint. They are typically used to uniquely identify a resource, such as an order ID or user ID, making the request more intuitive and RESTful. Let’s take an example of the GET — GetBooking API from the RESTful-Booker demo application. It fetches the booking details directly using the Path Param. The following curl can be used to import the GET /booking API in Postman: Plain Text curl -i https://restful-booker.herokuapp.com/booking/1 The following test script is used for fetching the booking record using Path Params: Java @Test public void testGetBookingWithPathParam () { given ().when () .log () .all () .pathParam ("id", 3) .get ("https://restful-booker.herokuapp.com/booking/{id}") .then () .log () .all () .statusCode (200); } The testGetBookingWithPathParam() test method demonstrates how to use a path parameter in a GET request with REST Assured. pathParam("id", 3) defines a path parameter named id with the value 3.get("https://restful-booker.herokuapp.com/booking/{id}") sends the GET request. Here, {id} is a placeholder in the URL, and REST Assured replaces it with the value 3, making the final request: https://restful-booker.herokuapp.com/booking/3Finally, an assertion is performed to verify that the GET request was successfully sent and that it returned a successful response with a 200 OK status. Using the Path param, specific resources can be dynamically accessed and validated by passing values directly within the endpoint URL. Using Headers in the GET Requests Headers in a GET request are used to pass additional information, such as authentication tokens, content type, and client details, to the server. They play an important role in securing APIs (e.g., Authorization headers) and ensuring the server understands how to process the request. Authorization Token in the Header Java @Test public void testAuthHeader () { given ().header ("Authorization", "Bearer my-token-123") .when () .get ("https://httpbin.org/bearer") .then () .log() .all() .statusCode (200); } The testAuthHeader() test method demonstrates how to send a GET request with an Authorization header using REST Assured. header("Authorization", "Bearer my-token-123") adds an Authorization header with a Bearer token, which is commonly used for securing APIs. This tells the server that the request is authenticated. The “Bearer my token-123” is a valid token for this request. If it is not supplied or an invalid token is supplied, the test will fail, throwing a 401 Unauthorized status.get("https://httpbin.org/bearer") sends the GET request to the endpoint that validates Bearer token authentication.statusCode(200) verifies that the request was successful, meaning the token was accepted. Similarly, a negative test can be written for the GEET request, supplying an invalid bearer token and verifying that a 401 status is returned in response. Adding Multiple Headers The “Content-Type” or “Accept” headers can also be supplied to specify the format of the request and response, such as JSON or XML, ensuring proper communication between the client and server. We can use a Java Map to add multiple headers and pass it to the test as shown in the test script below: Java @Test public void testGetAllOrdersWithHeaders () { Map<String, String> headers = new HashMap<> (); headers.put ("Content-Type", "application/json"); headers.put ("Accept", "application/json"); given ().headers (headers) .when () .get ("http://localhost:3004/getAllOrders") .then () .statusCode (200); } The testGetAllOrdersWithHeaders() test demonstrates how to send a GET request with multiple headers using a Java Map in REST Assured. The headers (Content-Type and Accept) are stored in a Map and passed using .headers(headers). The test then sends a request to fetch all orders and verifies that the API responds with a 200 OK status. Extracting Response Body and Values Extracting the response body from a GET request allows capturing and reusing API data for further validations or chaining requests. Using REST Assured, the values can be extracted using methods like .extract().response() or directly fetch specific fields using JSON path. This is especially useful for validating dynamic data and passing values between API calls in end-to-end test scenarios. Extracting the Response Body Java @Test public void testExtractResponseBody () { String responseBody = given ().when () .get ("http://localhost:3004/getAllOrders") .then () .statusCode (200) .extract () .response () .asString (); System.out.println (responseBody); } The testExtractResponseBody() method demonstrates how to extract the full response body from a GET request in REST Assured. extract().response().asString() extracts the complete response body, converts it into a string, and stores it in the responseBody variable for further use.System.out.println(responseBody); prints the response to the console. Extracting a Specific Field Value From the Response Body While working with test automation, there are scenarios where we need to extract a specific field value from the response for further use in the test. A classic example is end-to-end testing, where we need the order ID to update or delete an order. Java @Test public void testExtractFiedValueFromResponse () { int orderId = given ().when () .get ("http://localhost:3004/getAllOrders") .then () .statusCode (200) .extract () .response () .path ("orders[0].id"); System.out.println ("Order id is: " + orderId); } The testExtractFieldValueFromResponse() test method extracts the Order ID of the first order from the orders array in the response. extract().response().path("orders[0].id") extracts a specific value from the response body using a JSON path. In this case, the ID of the first order in the orders array.The extracted value is stored in the orderId variable for further use in the test.System.out.println(...) prints the extracted order ID to the console. Once the value is extracted into a variable, it can be further reused anywhere in the test. The variable can also be declared as a global variable to reuse the value in multiple tests within the same class. Using Validations With GET Requests Using validations on the responses from GET requests ensures the API returns the correct, expected data. Status codes, response headers, response time, and the content of the response body can be validated. These validations help confirm both the functional correctness and performance of the API. Validating the Response Headers Validating a response header ensures that the API returns the expected metadata, such as Content-Type, confirming the response format is correct. Java @Test public void testVerifyResponseHeader () { given ().when () .get ("http://localhost:3004/getAllOrders") .then () .headers ("Content-Type", "application/json; charset=utf-8") .statusCode (200); } The testVerifyResponseHeader() test validates both the response header and the status code to ensure the API behaves as expected. The .headers("Content-Type", "application/json; charset=utf-8") verifies that the response contains the expected Content-Type header value. Validating the Response Time The response time can also be validated using REST Assured to monitor and ensure optimal API performance. Java @Test public void testVerifyResponseTime () { given ().when () .get ("http://localhost:3004/getAllOrders") .then () .statusCode (200) .time (lessThan (500L), TimeUnit.MILLISECONDS); } The testVerifyResponseTime() test method ensures the API responds successfully and within an acceptable time (500 MilliSeconds). This test sends an API GET request to the /getAllOrders endpoint. This test performs two assertions: the first one verifies that the 200 OK status is returned, and the other verifies the response time. The statement.time(lessThan(500L), TimeUnit.MILLISECONDS)validates the response time and ensures the response is received in less than 500 milliseconds. The measured time includes: Request transmissionResponse receptionAssertion/validation overhead If we need to measure just the time for sending the request and receiving the response, the following test script can be used: Java public void testResponseTime () { Response response = given ().get ("http://localhost:3004/getAllOrders"); System.out.println (response.getTimeIn (TimeUnit.MILLISECONDS)); } The testResponseTime() method sends the GET /getAllOrders request and prints the raw response time. Validating the Response Size Validating the response size ensures the API returns the expected size of data and helps detect issues like incomplete or excessively large payloads that may impact performance. Java @Test public void testVerifyResponseSize () { given ().when () .get ("http://localhost:3004/getAllOrders") .then () .statusCode (200) .body ("orders.size()", greaterThan (0)); } The testVerifyResponseSize() method verifies both the success of the API call and that the correct size of data is returned in the response. given().when().get(...) sends a GET request to fetch all orders from the API..then().statusCode(200) verifies that the response is successful..body("orders.size()", greaterThan(0)) validates that the orders array in the response contains at least one item, ensuring the response is not empty. The greaterThan() is a static method used from the Hamcrest library. Similarly, response size verification can be performed by using equalTo(), lessThan(), and other such validations from the Hamcrest library, depending on the use case. Validating the Response Body Validating the response body ensures that the API returns the correct data and values as expected. It is a core part of functional testing that ensures the API behaves as expected and returns accurate and reliable data. Java @Test public void testResponseBody () { given ().when () .queryParam ("id", 1) .get ("http://localhost:3004/getOrder") .then () .statusCode (200) .and () .body ("message", equalTo ("Order found!!")) .body ("message", notNullValue ()) .body ("orders[0].id", equalTo (1)); } The testResponseBody() test method sends the GET /getAllOrders request by adding a query parameter to fetch a specific order, making the request URL: http://localhost:3004/getOrder?id=1. This method performs an assertion for the response body using the following three statements one by one: .body("message", equalTo("Order found!!")) validates that the response contains the expected message..body("orders[0].id", notNullValue()) ensures that the ID in the first-order object is not null..body("orders[0].id", equalTo(1)) checks that the returned order has the ID 1. In this test, verification for the response body is done step by step as per the code statements, so if the first verification fails, the remaining assertions will not be executed, causing the test to fail immediately. REST Assured also allows adding multiple verification statements in a single statement, as shown in the code below: Java public void testResponseBodyMultipleAssertions () { given ().when () .queryParam ("id", 1) .get ("http://localhost:3004/getOrder") .then () .statusCode (200) .and () .body ("message", equalTo ("Order found!!"), "orders[0].id", notNullValue (), "orders[0].id", equalTo (1)); } The testResponseBodyMultipleAssertions() method performs multiple validations using a single .body() statement. "message", equalTo("Order found!!") validates the response message."orders[0].id", notNullValue() ensures the order ID is present."orders[0].id", equalTo(1) verifies the correct order is returned. Using multiple assertions within a single .body() keeps related validations grouped, and reduces repetitive code. It also makes the test more concise while ensuring that multiple aspects of the response are validated in one place. Using Authentication With GET Requests Using authentication with GET requests ensures that only authorized users can access protected API resources. Common methods include Basic Auth, Bearer tokens, and API keys, which are typically passed through headers. Incorporating authentication in tests helps validate both security and access control mechanisms of the API. Java @Test public void testBasicAuthWithGetRequest () { given ().auth () .basic ("user", "passwd") .when () .get ("https://httpbin.org/basic-auth/user/passwd") .then () .statusCode (200); } The testBasicAuthWithGetRequest() method demonstrates how to send a GET request with Basic Authentication and verify successful access to a secured API endpoint. given().auth().basic("user", "passwd") sets up Basic Authentication by sending the username and password with the request. Here, the username is “user,” and the password is “passwd”. It sends the credentials in the request.when().get("https://httpbin.org/basic-auth/user/passwd") sends a GET request to an endpoint that requires Basic Auth credentials. The URL includes the username and password only because this specific API is designed to validate them from the path.then().statusCode(200) verifies that the request was successful, meaning the provided credentials were valid. In short, this test checks whether an API protected by Basic Authentication can be accessed using the correct username and password. Summary Testing GET API requests with REST Assured is an efficient way to validate API functionality. By covering scenarios such as query parameters, path parameters, headers, authentication, and validations, it ensures that the API returns accurate and expected responses. In my experience, while testing a GET API request, it is important to consider negative scenarios, such as validating different status codes when no record is available, using invalid query and path parameters, and adding appropriate assertions. Verifying the response body is a core part of functional testing, ensuring that the API returns accurate and expected data. Additionally, validations for response size, response time, and headers should be included to ensure thorough verification of GET requests. Happy testing!!

By Faisal Khatri

CORE

Pushdown-First Modernization: Engineering Execution-Plan Stability in SAP HANA Migrations

Most SAP HANA migration failures are not correctness failures. They are plan stability failures that surface only under concurrency. A query that executes in 900 milliseconds in isolation begins to oscillate between 800 milliseconds and 14 seconds under load, with no code change and no data skew obvious enough to blame. The root cause is rarely hardware or memory configuration. In most cases, PlanViz shows large intermediate row counts forming before reduction, with estimated cardinality significantly below actual. The instability originates from translating legacy EDW logic into SAP HANA artifacts without redesigning execution boundaries for a columnar, operator-driven engine. Pushdown-first modernization is often interpreted as "move everything into SQL." That interpretation is incomplete. The actual problem is not about moving logic downward; it is about controlling how the calculation engine constructs and reuses execution graphs under varying runtime conditions. When SQLScript procedures and calculation views are designed without regard to grain stabilization, operator ordering, and cardinality propagation, the resulting plans remain syntactically valid but produce workload-sensitive operator graphs whose memory footprint shifts with parameter selectivity. This article dissects the mechanics behind execution-plan stability in SAP HANA migrations, focusing on SQLScript procedures and Calculation Views as first-class architectural units. The Architectural Shift: From Staged ETL to Operator Graph Execution Traditional EDW pipelines relied on staged transformations. Each step materialized an intermediate state, often writing into persistent tables between transformations. That staging introduced natural grain boundaries. Joins were resolved, aggregations were completed, and the next transformation consumed stable, reduced datasets. In SAP HANA, Calculation Views and SQLScript table functions remove those materialization barriers. Logical transformations are fused into a single operator graph. PlanViz reveals this as a directed acyclic graph of projection, join, aggregation, and calculation nodes. The optimizer is free to reorder joins, push predicates downward, and defer aggregations. That freedom improves latency in well-designed models. It amplifies instability in poorly designed ones. Consider a common migration pattern: SQL SELECT h.MATERIAL_ID, SUM(l.QUANTITY) AS TOTAL_QTY FROM HEADER h JOIN LINE_ITEM l ON h.DOC_ID = l.DOC_ID WHERE h.POSTING_DATE BETWEEN :p_from AND :p_to GROUP BY h.MATERIAL_ID; Translated directly into a Calculation View, the join and aggregation nodes are placed without enforcing a grain reduction before high-cardinality joins. Under small parameter windows, the plan performs adequately. Under wide date ranges, the join produces a large intermediate result before aggregation collapses it. Memory amplification becomes workload-dependent. In PlanViz, the join node frequently shows actual row counts an order of magnitude higher than estimated. For example, a date window spanning a quarter can produce 38 million intermediate rows before aggregation collapses the result to fewer than 300000 grouped records. The aggregation node is inexpensive. The join node is not. Memory allocation occurs before reduction. The legacy system relied on pre-aggregated staging tables to constrain that explosion. The HANA translation removed the staging but did not redesign the grain boundary. Why Preserving Batch Semantics Breaks Under Concurrency In staged ETL systems, concurrency was limited. Batch windows were serialized. Execution plans operated with predictable resource envelopes. HANA environments operate with interactive workloads, overlapping parameter combinations, and mixed analytic demands. An SQLScript procedure frequently encapsulates logic like this: SQL lt_filtered = SELECT * FROM SALES WHERE REGION = :p_region; lt_enriched = SELECT f.*, d.CATEGORY FROM :lt_filtered AS f JOIN DIM_PRODUCT d ON f.PRODUCT_ID = d.PRODUCT_ID; lt_aggregated = SELECT CATEGORY, SUM(AMOUNT) AS TOTAL FROM :lt_enriched GROUP BY CATEGORY; SELECT * FROM :lt_aggregated; Syntactically, the intermediate variables imply sequencing. In practice, the optimizer inlines these operations. If REGION is not highly selective, the join with DIM_PRODUCT expands cardinality before aggregation. Under multiple concurrent sessions with varying region selectivity, the same operator graph is reused while actual cardinality diverges across sessions. One session may process 2 million rows, another 40 million. Each constructs its own hash structures while the plan shape remains identical. Plan instability emerges from estimation drift, not code defects. Batch semantics assumed a stable data distribution. Interactive concurrency invalidates that assumption. Grain Stabilization as a First-Class Design Constraint Execution-plan stability in HANA depends on reducing cardinality before high-cost joins. That principle is mechanical, not stylistic. Instead of joining at the transaction grain and aggregating afterward, redesign the model to collapse the grain first: SQL lt_reduced = SELECT PRODUCT_ID, SUM(AMOUNT) AS TOTAL_AMOUNT FROM SALES WHERE REGION = :p_region GROUP BY PRODUCT_ID; SELECT r.PRODUCT_ID, d.CATEGORY, r.TOTAL_AMOUNT FROM :lt_reduced AS r JOIN DIM_PRODUCT d ON r.PRODUCT_ID = d.PRODUCT_ID; This change enforces aggregation before dimensional enrichment. The intermediate dataset shrinks before the join. In PlanViz, the aggregation node now executes before dimensional enrichment, reducing the intermediate row count from tens of millions to low single-digit millions before the join. Hash table size contracts accordingly, and runtime variance narrows under concurrency. Within calculation views, this requires explicit modeling: Aggregation nodes placed before join nodesJoin cardinality correctly annotatedStar-join semantics avoided for high-variance fact tables Without explicit grain control, the optimizer may defer aggregation for cost-based reasons that are correct for one parameter distribution and catastrophic for another. Pushdown-first modernization must include grain-first redesign. Calculation Views: Join Cardinality and Engine Transitions Graphical Calculation Views introduce another source of instability: cardinality metadata and engine transitions. When join cardinality is left as "n..m," the optimizer assumes worst-case explosion. When incorrectly set as "1..1," it may reorder joins aggressively and defer filtering. Both mistakes alter the plan's shape. A frequent migration pattern is to replicate legacy multi-join views into a single Calculation View with multiple projection nodes feeding a central join node. Under load, the join engine allocates hash tables proportional to pre-aggregation cardinality. If aggregation nodes sit above that join, each concurrent session constructs its own large intermediate state before reduction, multiplying memory pressure across sessions. Execution-plan stability requires: Accurate cardinality annotationProjection pruning enabledCalculated columns minimized before aggregationTable functions are used sparingly and only when logic cannot be expressed declaratively Table functions introduce optimization boundaries. When overused, they prevent join reordering and predicate pushdown across function boundaries, fragmenting the operator graph. SQLScript Procedures and Optimization Boundaries SQLScript introduces imperative constructs that can fragment optimization. For example: SQL IF :p_flag = 'Y' THEN SELECT ... ELSE SELECT ... END IF; Branching logic produces separate subplans. Under concurrency, plan cache fragmentation increases. Each branch may generate a distinct plan variant, multiplying the memory footprint. Similarly, cursor-based loops imported from legacy logic disable set-based optimization. Even when pushdown is nominally achieved, the presence of row-by-row constructs forces materialization. Execution stability improves when: Set-based transformations replace procedural loopsConditional logic is expressed via predicates rather than branchesIntermediate variables are minimized to avoid implicit materialization The goal is a single coherent operator graph with predictable cardinality flow. Observability: PlanViz as a Stability Instrument PlanViz is not a tuning tool alone. It is a stability diagnostic instrument. Stable models show: Early aggregation nodesReduced intermediate row counts after each operatorLimited engine transitions between OLAP and Join enginesConsistent estimated vs actual row counts Unstable models show: Large intermediate nodes before aggregationHigh variance between estimated and actual cardinalitiesMultiple hash join operators with spill riskRepeated plan variants under similar parameter shapes Stability is observed by running parameter sweeps under controlled concurrency and comparing plan shapes, not just runtimes. State Amplification Under Concurrent Workloads When intermediate result sets scale with the input window size, concurrent sessions amplify state multiplicatively. If one session produces 200 million intermediate rows before aggregation and five sessions overlap, each constructs its own intermediate state, causing cumulative memory allocation that triggers throttling or spill behavior despite acceptable single-session performance. Stabilized models collapse grain early, producing intermediate datasets proportional to grouped dimensions rather than raw transaction volume. Concurrency then scales linearly instead of exponentially. This distinction is architectural. It cannot be solved with indexes, hints, or hardware. Engineering Stability Instead of Translating Logic Most unstable migrations are not slow because SAP HANA is inefficient. They are unstable because the reduction was deferred. When aggregation happens after cardinality amplification, the intermediate state scales with raw transaction volume. Under concurrency, that decision multiplies memory pressure across sessions. The system behaves exactly as modeled. Pushdown-first modernization succeeds when reduction precedes enrichment and when the operator graph is engineered for concurrency, not just correctness.

By Rajaganapathi Rangdale Srinivasa Rao

The 4 Signals That Actually Predict Production Failures - Part 2

A Practical Guide In the first part, I covered the two initial signals to diagnose that something is wrong: LatencyTraffic Those two alone explain a surprising number of production incidents. But they don’t explain everything. Rising latency tells you a problem is developing. Traffic tells you what the system is dealing with. I mentioned two more signals: ErrorsSaturation These two tell you something more important - whether the system is approaching failure. And this is where monitoring becomes truly operational. I will cover those two signals in this blog. Let us start with Errors. Errors - The Most Misunderstood Signal Many teams think error monitoring is simple. It is about counting failures. Raise an alert when they increase. In practice, error metrics are rarely that straightforward. The first mistake teams make is treating all errors as equal. They are not. Some errors are expected and some errors are harmless. Others indicate an outage in progress. Monitoring must differ between them. Otherwise alerts become noise. And noisy alerts get ignored, which defeats the entire purpose. I have seen production systems where engineers simply muted error alerts because they fired every few hours. Error Rate Is More Important Than Error Count Raw error counts are misleading. What do you think - ten errors per minute might be catastrophic or irrelevant? It depends on traffic. If you process: 100 requests per minute → 10 errors = disaster100,000 requests per minute → 10 errors = background noise Error rate is what matters. A simple production alert looks like this: It means alert when:Error rate > 2% This works far better than static thresholds because it scales automatically with traffic. 4xx vs 5xx - Critical Distinction One of the most common monitoring mistakes is combining 4xx and 5xx errors. They represent completely different problems. Let me talk through them. 5xx errors These indicate system failures: ExceptionsTimeoutsDependency failuresResource exhaustion 5xx errors should almost always trigger alerts. They mean the system is failing users. 4xx errors These usually indicate client behaviour: Invalid inputAuthentication failuresMissing resources Most of the time, 4xx errors should not page engineers. But they should still be monitored. Their spikes often reveal integration problems. Partner systems misbehavingClients sending unexpected requestsSometimes bots discovering your APIs I once saw a system where 40% of traffic suddenly became 401 responses. Nothing was broken in my service. A client service had deployed a change with an incorrect token configuration. The service was healthy. The integration was not. Without separate 4xx monitoring we would never have noticed. Error Budget Thinking Once services mature, error monitoring becomes less about incidents and more about error budgets. Instead of asking “Did we have errors?” You ask “Did we exceed acceptable failure levels?” Example SLO: 99.9% success rate That allows: 0.1% failure Error budgets prevent overreaction to minor fluctuations. Without them, teams end up firefighting dashboards instead of protecting user experience. In most post-mortems, latency and errors are symptoms. Saturation is usually the cause. Let us move to the next indicator – saturation. Saturation — Where Failures Actually Begin If latency is the early warning signal, saturation is the root cause signal. Most production outages start with a resource limit somewhere. I am not necessarily talking about CPU or memory. I am talking about less obvious resources like thread pools, connection pools, queue consumers, file descriptors, and rate limits. These limits quietly fill up until requests start waiting and then timing out. Then they start failing. By the time error rates increase, saturation has usually been happening for a while. CPU and Memory - Necessary but Not Enough Infrastructure metrics still matter. They just don’t tell the whole story. Monitor: CPU utilizationMemory usageDisk I/ONetwork throughput Example: rate(container_cpu_usage_seconds_total[1m]) and: container_memory_usage_bytes The Metrics That Break Systems Most Often As I mentioned in my previous blog, you need effective metrics. In this section I will list a few metrics that can prove useful. Connection Pool Usage Monitor connection pool usage. When a connection pool fills up - requests queue internally, latency increases, timeouts appear, and errors follow. In this scenario CPU can still be 30%. Memory can still be healthy. The service still looks “green.” Except users are waiting seconds for responses. Example — Monitoring a connection pool Micrometer automatically exposes Hikari metrics: hikaricp_connections_activehikaricp_connections_idlehikaricp_connections_pending The critical one is:hikaricp_connections_pending If pending connections increase steadily, saturation is approaching and action is needed. Kubernetes Saturation Signals Container platforms introduce new saturation points. An important metric to monitor is:kube_pod_container_status_restarts_total Restarts indicate instability. And:container_cpu_cfs_throttled_seconds_total CPU throttling causes latency spikes even when CPU usage looks normal. That one surprises a lot of teams. Dependency Metrics — The Missing Visibility Layer Most services are only as reliable as their dependencies – databases, caches, APIs, queues, and third-party integrations. When dependencies slow down, your service slows down. But if you only monitor your service, you won’t see the cause. You only see the symptoms. Dependency metrics close that gap. Without them, incident investigations turn into guesswork. Downstream Latency Metrics Every external call should have a latency metric. Even if the dependency is “reliable.” Especially then. Simple example: Java Timer.Sample sample = Timer.start(registry); Response response = paymentClient.process(request); sample.stop( registry.timer("payment.api.latency") ); During incidents, this metric often points directly at the problem. Dependency Error Metrics Track dependency failures separately. Example:payment_api_errors_total This helps answer:Are we failing… or is the dependency failing? That distinction saves time during incidents. Database Metrics — Where Many Incidents Begin Databases rarely fail suddenly. They slowly degrade. I have seen these follow a pattern. First queries take slightly longer. Then pools begin filling. Then request latency increases. Then timeouts appear. The progression is almost always the same. Which means the signals are predictable. Query Latency Slow queries often trigger cascading failures. Track:db_query_duration_seconds Watch percentiles and not averages. The same rule applies as service latency. Connection Pool Usage Database pools deserve dedicated dashboards. Track:db_connections_activedb_connections_idle Pool exhaustion is a classic outage pattern. Lock Contention Lock waits produce unpredictable latency spikes, especially under load. Important metrics include: Lock wait timeDeadlocksBlocked queries These metrics explain incidents that otherwise look random. Queue Metrics — The Early Warning Event-driven systems fail differently and have a different pattern. Instead of request latency increasing, queues begin filling. Messages accumulate silently. Until delays become visible. Queue metrics often detect issues earlier than service metrics. Queue Depth Example metric:messages_available If depth increases steadily, it means something is wrong. Either: Producers too fastConsumers too slowDependencies degraded Queue depth is one of the most reliable early warning signals in distributed systems. Consumer Lag For streaming systems, lag is critical. Example:kafka_consumer_lag Lag increasing means consumers cannot keep up. Eventually processing delays impact users. Pattern Worth Recognizing After enough incidents you start recognizing patterns. One of the most common looks like this: Dependency latency increasesConnection pools fillRequest latency increasesQueues growErrors appear When you see that progression on dashboards, you already know the story before investigation begins. Good monitoring turns incidents into recognizable shapes. And recognizable shapes reduce stress during outages. Experienced engineers eventually learn that most outages are not mysterious. They follow patterns. Because uncertainty is what makes incidents difficult. Not complexity. I hope you find these useful, I will continue the discussion in the final blog of this series.

By Gaurav Gaur

CORE

Swift Concurrency, Part 3: Bridging Legacy APIs With Continuations

Swift concurrency has fundamentally changed how we write asynchronous code, making it more readable and safer. However, the real world is still full of legacy APIs and SDKs that rely on completion handlers and delegates. You cannot simply rewrite every library overnight. This is where Continuations come in. They act as a powerful bridge, allowing us to wrap older asynchronous patterns into modern async functions, ensuring that our codebases remain clean and consistent even when dealing with legacy code. The Challenge of Traditional Async Patterns For years, iOS developers relied on two fundamental approaches for asynchronous operations: completion closures and delegate callbacks. Consider a typical network request using completion handlers: Swift func fetchUserData(completion: @escaping (User?, Error?) -> Void) { URLSession.shared.dataTask(with: url) { data, response, error in // Handle response in a different scope if let error = error { completion(nil, error) return } // Process data... completion(user, nil) }.resume() Similarly, delegate patterns scatter logic across multiple methods: Swift class LocationManager: NSObject, CLLocationManagerDelegate { func locationManager(_ manager: CLLocationManager, didUpdateLocations locations: [CLLocation]) { // Handle success in one method } func locationManager(_ manager: CLLocationManager, didFailWithError error: Error) { // Handle failure in another method } } Both approaches share a critical weakness: they fragment your program’s control flow. Instead of reading code from top to bottom, developers must mentally jump between closures, delegate methods, and completion callbacks. This cognitive overhead breeds subtle bugs-forgetting to invoke a completion handler, calling it multiple times, or losing track of error paths through nested callbacks. Bridging the Gap With Async/Await Continuations transform these fragmented patterns into linear, readable code. They provide the missing link between callback-based APIs and Swift’s structured concurrency model. By wrapping legacy asynchronous operations, you can write code that suspends at natural points and resumes when results arrive, without modifying the underlying implementation. Here’s the transformation in action. Our callback-based network function becomes: Swift func fetchUserData() async throws -> User { try await withCheckedThrowingContinuation { continuation in URLSession.shared.dataTask(with: url) { data, response, error in if let error = error { continuation.resume(throwing: error) return } // Process and resume with result continuation.resume(returning: user) }.resume() } } Now calling code flows naturally: Swift do { let user = try await fetchUserData() let profile = try await fetchProfile(for: user) updateUI(with: profile) } catch { showError(error) } Understanding Continuation Mechanics A continuation represents a frozen moment in your program’s execution. When you mark a suspension point with await, Swift doesn’t simply pause and wait; it captures the entire execution context into a lightweight continuation object. This includes local variables, the program counter, and the call stack state. This design enables Swift’s runtime to operate efficiently. Rather than dedicating one thread per asynchronous operation (the traditional approach that leads to thread explosion), the concurrency system maintains a thread pool sized to match your CPU cores. When a task suspends, its thread becomes available for other work. When the task is ready to resume, the runtime uses any available thread to reconstruct the execution state from the continuation. Consider what happens during a network call: Swift func processData() async throws { let config = loadConfiguration() // Runs immediately let data = try await downloadData() // Suspends here let result = transform(data, with: config) // Resumes here return result } At the await point, Swift creates a continuation capturing config and the program location. The current thread is freed for other tasks. When downloadData() completes, the runtime schedules resumption—but not necessarily on the same thread. The continuation ensures all local state travels with the execution, making thread switching transparent. Manual Continuation Creation Swift provides two continuation variants, each addressing different needs: CheckedContinuation performs runtime validation, detecting common errors like resuming twice or forgetting to resume. This safety net makes it the default choice during development: Swift func getCurrentLocation() async throws -> CLLocation { try await withCheckedThrowingContinuation { continuation in let manager = CLLocationManager() manager.requestLocation() manager.locationHandler = { locations in if let location = locations.first { continuation.resume(returning: location) } } manager.errorHandler = { error in continuation.resume(throwing: error) } } } If you accidentally resume twice, you’ll see a runtime warning: SWIFT TASK CONTINUATION MISUSE: continuation resumed multiple times. UnsafeContinuation removes these checks for maximum performance. Use it only in hot paths where profiling confirms the overhead matters, and you’ve thoroughly verified correctness: Swift func criticalOperation() async -> Result { await withUnsafeContinuation { continuation in performHighFrequencyCallback { result in continuation.resume(returning: result) } } } Working With Continuation Resume Methods The continuation API enforces a strict contract: resume exactly once. This guarantee prevents resource leaks and ensures predictable execution. Swift provides four resume methods to cover different scenarios: resume() for operations without return values: Swift func waitForAnimation() async { await withCheckedContinuation { continuation in UIView.animate(withDuration: 0.3, animations: { self.view.alpha = 0 }) { _ in continuation.resume() } } } resume(returning:) to provide a result: Swift func promptUser(message: String) async -> Bool { await withCheckedContinuation { continuation in let alert = UIAlertController(title: message, message: nil, preferredStyle: .alert) alert.addAction(UIAlertAction(title: "Yes", style: .default) { _ in continuation.resume(returning: true) }) alert.addAction(UIAlertAction(title: "No", style: .cancel) { _ in continuation.resume(returning: false) }) present(alert, animated: true) } } resume(throwing:) for error propagation: Swift func authenticateUser() async throws -> User { try await withCheckedThrowingContinuation { continuation in authService.login { result in switch result { case .success(let user): continuation.resume(returning: user) case .failure(let error): continuation.resume(throwing: error) } } } } resume(with:) as a convenient shorthand for Result types: Swift func loadImage(from url: URL) async throws -> UIImage { try await withCheckedThrowingContinuation { continuation in imageLoader.fetch(url) { result in continuation.resume(with: result) } } } Practical Integration Patterns When migrating real-world code, certain patterns emerge repeatedly. Here’s how to handle a delegate-based API with multiple possible outcomes: Swift class NotificationPermissionManager: NSObject, UNUserNotificationCenterDelegate { func requestPermission() async throws -> Bool { try await withCheckedThrowingContinuation { continuation in UNUserNotificationCenter.current().requestAuthorization(options: [.alert, .sound]) { granted, error in if let error = error { continuation.resume(throwing: error) } else { continuation.resume(returning: granted) } } } } } For callbacks that might never fire (like user cancellation), ensure you handle all paths: Swift func selectPhoto() async -> UIImage? { await withCheckedContinuation { continuation in let picker = UIImagePickerController() picker.didSelect = { image in continuation.resume(returning: image) } picker.didCancel = { continuation.resume(returning: nil) } present(picker, animated: true) } } Conclusion Continuations represent more than a compatibility layer; they embody Swift’s pragmatic approach to evolution. By providing clean integration between legacy and modern patterns, they enable gradual migration rather than forcing disruptive rewrites. As you encounter older APIs in your codebase, continuations offer a path forward that maintains both backward compatibility and forward-looking code quality. The safety guarantees of CheckedContinuation make experimentation low-risk, while UnsafeContinuation provides an escape hatch for proven, performance-critical code. Master these tools, and you’ll find that even the most callback-laden legacy code can integrate seamlessly into modern async workflows.

By Nikita Vasilev

Apache Spark 3 to Apache Spark 4 Migration: What Breaks, What Improves, What's Mandatory

Apache Spark 4.0 represents a major evolutionary leap in the big data processing ecosystem. Released in 2025, this version introduces significant enhancements across SQL capabilities, Python integration, connectivity features, and overall performance. However, with great power comes great responsibility — migrating from Spark 3.x to Spark 4.0 requires careful planning due to several breaking changes that can impact your existing workloads. This comprehensive guide walks you through everything you need to know about the Spark 3 to Spark 4 migration journey. We'll cover what breaks in your existing code, what improvements you can leverage, and what changes are mandatory for a successful transition. Whether you're a data engineer, platform architect, or data scientist, this article provides practical insights to ensure a smooth migration path. Understanding the Spark 4.0 Release Timeline Before diving into the technical details, let's understand the release cadence: Apache Spark 4.0: Initial release in early 2025Spark 4.0.1: Scheduled for September 2025Spark 4.1.1: Planned for January 2026 This timeline is important because some features and breaking changes are being introduced progressively. For instance, the Log4j upgrade from 1.x to 2.x is being implemented in Spark 4.1, giving organizations additional time to prepare their logging configurations. What Breaks: Critical Breaking Changes Understanding breaking changes is crucial for migration planning. Here are the most impactful changes that will break your existing Spark 3.x workloads: 1. ANSI SQL Mode Enabled by Default This is arguably the most significant breaking change in Spark 4.0. The ANSI SQL compliance mode is now enabled by default, fundamentally changing how Spark handles errors and edge cases. What this means for your code: Division by zero: Previously returned NULL, now throws ArithmeticExceptionInvalid type casts: Previously returned NULL, now throws runtime exceptionsNumeric overflows: Previously wrapped around silently, now throws exceptionsInvalid date/timestamp operations: Now produce errors instead of NULL values Example of Breaking Behavior: Plain Text -- Spark 3.x behavior SELECT 10 / 0; -- Returns NULL -- Spark 4.0 behavior (ANSI mode default) SELECT 10 / 0; -- Throws ArithmeticException: Division by zero Migration Strategy: Plain Text # Temporary workaround (not recommended for long-term) spark.conf.set("spark.sql.ansi.enabled", "false") # Recommended: Update your code to handle edge cases SELECT CASE WHEN divisor = 0 THEN NULL ELSE numerator / divisor END as result Best Practice: Enable ANSI mode in your Spark 3.x environment before migration to identify problematic queries early. This proactive approach helps you address data quality issues before they become runtime exceptions in production. 2. Java 17 as Default Runtime Spark 4.0 requires Java 17 as the default runtime, with support for Java 21 also added. This is a mandatory change that affects your entire deployment infrastructure. Impact Areas: All Spark driver and executor processes must run on Java 17+Dependencies compiled for older Java versions may have compatibility issuesSome reflection-based code patterns may fail due to JDK module system changesGC tuning parameters may need adjustment for optimal performance Migration Checklist: Plain Text # Verify Java version on all cluster nodes java -version # Should show 17.x or higher # Update JAVA_HOME environment variable export JAVA_HOME=/path/to/java17 # Test all custom JARs and UDFs for Java 17 compatibility # Update build configurations (Maven/Gradle) to target Java 17 3. Apache Mesos Support Removed If your organization runs Spark on Apache Mesos, this is a mandatory migration. Spark 4.0 completely removes Mesos support. Migration Options: Kubernetes: The recommended path forward, especially for cloud-native deploymentsYARN: Suitable for Hadoop-centric environmentsStandalone Mode: For simpler deployments or development environments 4. CREATE TABLE Behavior Change The default behavior for CREATE TABLE statements without explicit format specification has changed: Plain Text -- Spark 3.x: Defaults to Hive format CREATE TABLE my_table (id INT, name STRING); -- Spark 4.0: Uses spark.sql.sources.default (typically Parquet) CREATE TABLE my_table (id INT, name STRING); Impact: Existing DDL scripts that rely on implicit Hive format may create tables in a different format, potentially breaking downstream consumers expecting Hive tables. Migration Fix: Plain Text -- Explicitly specify the format CREATE TABLE my_table (id INT, name STRING) USING HIVE; -- Or set the configuration to maintain old behavior spark.conf.set("spark.sql.sources.default", "hive") 5. Structured Streaming Trigger.Once Deprecation The Trigger.Once trigger in Structured Streaming is deprecated and will be removed in future versions. Plain Text # Deprecated approach query = df.writeStream \ .trigger(once=True) \ .start() # Recommended migration query = df.writeStream \ .trigger(availableNow=True) \ .start() Why this matters: Trigger.AvailableNow provides more predictable behavior for incremental batch processing, better checkpoint management, and improved reliability for exactly-once semantics. 6. Log4j 2.x Migration (Spark 4.1+) Starting from Spark 4.1, the logging framework migrates from Log4j 1.x to Log4j 2.x. This requires rewriting your log4j.properties files. Plain Text # Old log4j.properties format (Log4j 1.x) log4j.rootLogger=INFO, console log4j.appender.console=org.apache.log4j.ConsoleAppender # New log4j2.properties format (Log4j 2.x) rootLogger.level = INFO rootLogger.appenderRef.console.ref = Console appender.console.type = Console appender.console.name = Console What Improves: New Features and Enhancements Spark 4.0 brings exciting improvements that can significantly enhance your data engineering workflows. Here's what you can leverage after migration: 1. SQL Enhancements PIPE Syntax for Intuitive Transformations The new PIPE syntax (|>) allows chaining SQL transformations in a more readable, pipeline-like manner: Plain Text -- Traditional nested approach SELECT name, total_sales FROM ( SELECT name, SUM(amount) as total_sales FROM ( SELECT * FROM orders WHERE status = 'COMPLETED' ) filtered GROUP BY name ) aggregated WHERE total_sales > 1000; -- New PIPE syntax FROM orders |> WHERE status = 'COMPLETED' |> AGGREGATE SUM(amount) as total_sales GROUP BY name |> WHERE total_sales > 1000 |> SELECT name, total_sales; VARIANT Data Type for Semi-Structured Data The new VARIANT data type provides native support for semi-structured data like JSON, offering up to 8x performance improvement compared to string-based JSON handling: Plain Text -- Create table with VARIANT column CREATE TABLE events ( event_id BIGINT, event_data VARIANT ); -- Insert JSON data directly INSERT INTO events VALUES (1, '{"user": "john", "action": "click", "metadata": {"page": "home"}'); -- Query with native path access (much faster than JSON functions) SELECT event_data:user::STRING as username, event_data:metadata:page::STRING as page FROM events; SQL Scripting with Control Flow Spark 4.0 introduces procedural SQL capabilities including variables, loops, and exception handling: Plain Text DECLARE total_count INT DEFAULT 0; DECLARE batch_size INT DEFAULT 1000; WHILE total_count < 10000 DO INSERT INTO target_table SELECT * FROM source_table LIMIT batch_size; SET total_count = total_count + batch_size; END WHILE; Parameterized Queries Enhanced security with named and unnamed parameter markers: Plain Text # Named parameters spark.sql("SELECT * FROM users WHERE id = :user_id AND status = :status", args={"user_id": 123, "status": "active"}) # Unnamed parameters spark.sql("SELECT * FROM users WHERE id = ? AND status = ?", args=[123, "active"]) String Collation Support Control string comparison behavior for locale-specific sorting and case sensitivity: Plain Text -- Case-insensitive comparison SELECT * FROM products WHERE name COLLATE 'UNICODE_CI' = 'iPhone'; 2. Python (PySpark) Improvements Native Python Data Source API Create custom data sources entirely in Python without Scala/Java: Plain Text from pyspark.sql.datasource import DataSource, DataSourceReader class MyCustomDataSource(DataSource): @classmethod def name(cls): return "my_custom_source" def reader(self, schema): return MyCustomReader(schema) class MyCustomReader(DataSourceReader): def read(self, partition): # Your custom read logic yield {"id": 1, "value": "data"} # Register and use spark.dataSource.register(MyCustomDataSource) df = spark.read.format("my_custom_source").load() Polymorphic Python UDTFs Create table-valued functions that accept varying input schemas: Plain Text from pyspark.sql.functions import udtf @udtf(returnType="id: int, value: string, multiplied: int") class MultiplyAndExplode: def eval(self, id: int, value: str, factor: int): for i in range(factor): yield id, f"{value}_{i}", id * (i + 1) # Use in SQL spark.udtf.register("multiply_and_explode", MultiplyAndExplode) spark.sql("SELECT * FROM multiply_and_explode(1, 'test', 3)") Native Plotting with Plotly Visualize DataFrames directly without converting to pandas: Plain Text df = spark.sql("SELECT category, SUM(sales) as total FROM orders GROUP BY category") df.plot.bar(x="category", y="total") Lightweight PySpark Client A new 1.5 MB pyspark-client package for remote connectivity: Plain Text pip install pyspark-client from pyspark.sql import SparkSession spark = SparkSession.builder.remote("sc://my-spark-cluster:15002").getOrCreate() 3. Spark Connect Enhancements Spark Connect reaches near feature parity with Spark Classic, offering: Improved Python and Scala API compatibilityNew community clients for Go, Swift, and RustBetter error handling and debugging capabilitiesReduced deployment complexity 4. Structured Logging Framework Logs are now output as structured JSON for better observability: Plain Text { "ts": "2025-01-15T10:30:45.123Z", "level": "INFO", "msg": "Query completed", "context": { "queryId": "abc123", "duration_ms": 1234, "rows_processed": 1000000 } } This structured format enables: Easy integration with ELK Stack, Splunk, and DatadogAutomated alerting based on specific log fieldsBetter troubleshooting with rich metadata 5. Performance Optimizations Spark 4.0 delivers up to 30% performance improvements through: Enhanced Catalyst Optimizer: Better query plan generationImproved AQE: Smarter runtime adaptationsColumnar Execution: Better vectorized processingMemory Management: Reduced overhead and better cache utilizationShuffle Optimization: Smarter data movement across nodesShuffle Optimization: Smarter data movement across nodes 6. Arbitrary Stateful Processing V2 Enhanced state management for Structured Streaming: Plain Text def update_state(key, input_rows, state): current_sum = state.get() or 0 new_sum = current_sum + sum(row.value for row in input_rows) state.update(new_sum) return [(key, new_sum)] result = df.groupByKey(lambda x: x.key) \ .applyInPandasWithState( update_state, output_schema="key string, sum long", state_schema="sum_value long", mode="update" ) What's Mandatory: Required Changes for Migration Some changes in Spark 4.0 are not optional — they must be addressed for your applications to run correctly: 1. Java Runtime Upgrade Mandatory Action: Upgrade all cluster nodes to Java 17 or higher Plain Text # Verification steps echo $JAVA_HOME java -version # Cluster-wide update (example for CDH/CDP) sudo update-alternatives --config java 2. Mesos Migration (if applicable) Mandatory Action: Migrate to Kubernetes, YARN, or Standalone mode Plain Text # Example Kubernetes migration spark-submit \ --master k8s://https://kubernetes-master:6443 \ --deploy-mode cluster \ --conf spark.kubernetes.container.image=my-spark:4.0 \ my-application.py 3. Error Handling Updates Mandatory Action: Update code to handle new runtime exceptions from ANSI mode Plain Text # Python example with proper error handling try: result = spark.sql("SELECT 1/0").collect() except Exception as e: if "ArithmeticException" in str(e): # Handle division by zero gracefully result = None 4. Dependency Compatibility Verification Mandatory Action: Verify all third-party libraries work with Java 17 and Spark 4.0 APIs Plain Text # Create a compatibility test suite def test_dependencies(): # Test Delta Lake spark.read.format("delta").load("/path/to/delta") # Test custom UDFs from my_lib import custom_udf df.select(custom_udf("column")).show() # Test serialization df.rdd.map(lambda x: x).collect() Step-by-Step Migration Playbook Follow this structured approach for a successful migration: Phase 1: Assessment (Weeks 1-2) Inventory Current State: Document Spark versions, configurations, and deployment environmentsCatalog Dependencies: List all libraries, custom UDFs, and integrationsIdentify Workload Types: Categorize batch vs. streaming, SQL vs. DataFrame, etc.Review Breaking Changes: Map each breaking change to affected applications Phase 2: Preparation (Weeks 3-4) Enable ANSI Mode in Spark 3.x: Proactively identify problematic queriesUpgrade Java in Non-Production: Test Java 17 compatibilityUpdate Build Pipelines: Configure Maven/Gradle for Java 17Create Compatibility Test Suite: Automated tests for regression detection Phase 3: Testing (Weeks 5-8) Set Up Spark 4.0 Test Environment: Isolated cluster or Databricks Runtime 17.0+Port Critical Workloads: Start with non-critical pipelinesPerformance Benchmarking: Compare execution times and resource usageStreaming Job Validation: Test state recovery and checkpoint compatibility Phase 4: Deployment (Weeks 9-10) Blue-Green Deployment: Run Spark 3.x and 4.0 in parallelGradual Traffic Migration: Move workloads incrementallyMonitoring and Rollback Plan: Have clear criteria for rollback if neededDocumentation Update: Update runbooks and operational procedures Phase 5: Optimization (Ongoing) Adopt New Features: Gradually implement VARIANT, PIPE syntax, etc.Performance Tuning: Leverage new optimizationsRemove Workarounds: Phase out temporary compatibility configurations Common Migration Pitfalls and Solutions Pitfall 1: Silent Data Quality Issues Problem: ANSI mode reveals previously hidden data quality issues Solution: Use data profiling tools before migration to identify NULL-returning operations Pitfall 2: Checkpoint Incompatibility Problem: Streaming checkpoints from Spark 3.x may not work in Spark 4.0 Solution: Plan for checkpoint recreation or use stateless processing where possible Pitfall 3: UDF Performance Regression Problem: Some UDFs may perform differently on Java 17 Solution: Benchmark critical UDFs and consider rewriting with Arrow optimizations Pitfall 4: Third-Party Library Conflicts Problem: Libraries may have transitive dependencies on older Java versions Solution: Use dependency:tree analysis and shade conflicting dependencies Conclusion Migrating from Apache Spark 3.x to Spark 4.0 is a significant undertaking, but the benefits far outweigh the challenges. The new features—including VARIANT data type, PIPE syntax, native Python data sources, and substantial performance improvements—position Spark 4.0 as a compelling upgrade for modern data engineering workflows. The key to success lies in thorough preparation: understand the breaking changes, especially the ANSI mode default; verify Java 17 compatibility across your ecosystem; and plan for any infrastructure changes like Mesos migration. By following the phased migration approach outlined in this guide, you can minimize risk while maximizing the benefits of Spark 4.0. Remember that this migration is not just a version upgrade—it's an opportunity to modernize your data platform, improve data quality enforcement, and leverage state-of-the-art features that will drive efficiency for years to come. References Apache Spark 4.0 Release Notes: https://spark.apache.org/releases/spark-release-4-0-0.htmlSpark ANSI Mode Documentation: https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.htmlDatabricks Apache Spark 4.0 Preview: https://www.databricks.com/blog/announcing-apache-spark-4Apache Spark Migration Guide: https://spark.apache.org/docs/latest/migration-guide.htmlJava 17 for Spark Users: https://docs.oracle.com/en/java/javase/17/migrate/getting-started.html

By Rambabu Bandam

AI-Driven Automated Trading System

In the rapidly evolving field of algorithmic trading, I have observed that access to sophisticated strategies is typically limited to professional traders and large institutions. In my experience, most traditional systems demand deep market knowledge, continuous monitoring, and significant technical expertise, creating barriers that prevent everyday individuals from participating with confidence. Through this article, I share my practical experience designing and implementing a fully automated, AI-driven trading system intended to remove these constraints and allow users, regardless of trading experience or geographic location, to benefit from advanced trading strategies. The core innovation of the system I built is a hybrid signal fusion engine that combines the Relative Strength Index (RSI), time-series neural networks, and large language models such as Google Gemini and OpenAI GPT to generate, validate, and explain high-confidence trading signals. I implemented the platform on a robust Oracle Database backend with native multi-user support, enabling global users to securely connect their broker accounts and operate autonomously. In my implementation, the system performs real-time market analysis across 1, 5, and 15-minute timeframes, manages signal generation and trade execution, and maintains complete historical records without requiring manual intervention. In this article, I present a practical technical approach to implementing hybrid AI signal fusion within a secure, scalable, and multi-tenant architecture, based on my hands-on experience building and operating the platform, and contribute to broader efforts aimed at making advanced fintech solutions more accessible and reliable. The Challenge: Making Advanced Trading Truly Accessible Most retail traders lack the time or expertise to monitor markets, interpret indicators, or manage risk effectively. Conventional bots are either too simplistic (rule-based only) or too complex (requiring coding and constant tuning). Meanwhile, powerful AI techniques remain locked in institutional silos. The objective is to enable a system where users in any country — whether in London, Dubai, Singapore, or New York — can connect once and operate hands-off. No charts to monitor, no decisions to make. The platform runs 24/7 on a central server, analyzing multiple currency pairs and executing trades directly in each user’s broker account. This technical approach introduces innovation not only in signal generation but also in user experience and global scalability. System Overview and Multi-User Architecture The platform is built for scale and simplicity, using Oracle Database as the secure, high-performance core. Each user gets their own dedicated schema within the same database instance, ensuring complete data isolation while allowing efficient centralized processing. Key components include: Component Description Oracle Database Central analytical backend with multi-tenant schemas for secure user isolation. Asset Tables One set per user schema, organized by asset and timeframe (1m, 5m, 15m). Database Triggers Automatically calculate RSI on new data and flag potential signals for each user. DBMS Scheduler Jobs Manage bulk historical loads and refreshes across all user schemas. Neural Network Model Shared LSTM model analyzes recent price windows for predictive scoring. LLM Integration Gemini and GPT using python to evaluate chart data from 1-, 5-, and 15-minute timeframes. AI_SIGNALS Table Per-user table storing final buy/sell/hold decisions with full explanations. Auto-Execution Service Java based service monitors each user’s AI_SIGNALS table and executes trades via their broker API. Trade History Tables Complete audit trail stored securely in each user’s schema. This multi-tenant design allows one powerful AI engine to serve thousands of users simultaneously without performance degradation, while maintaining strict privacy and compliance. The system comprises two primary interfaces supported by an Oracle analytical backend: 1. Real-Time Data Feed Interface Streams continuous live market data from multiple assets. Performs initial bulk data load for the last 24 hours for each timeframe (1, 5, and 15 minutes).Handles incremental updates per minute post-initialization. 2. Automated Trade Execution Interface Monitors generated signals from the Oracle database.Executes trades immediately via broker API using authenticated account credentials. Together, these interfaces enable a seamless pipeline from data acquisition to execution without manual intervention. 3. Data Flow Process Initialization Phase System loads 24-hour historical data via feeder interface.SQL verification ensures dataset completeness before enabling triggers Streaming Phase System loads 24-hour historical data via feeder interface. SQL verification ensures dataset completeness before enabling triggers 4. AI Signal Generation Layer Invocation of the Neural Network Layer A custom Tensor Flow neural network model is called.Input features include RSI (already calculated by database triggers), moving averages, price volatility, and trend vectors derived from the latest market data across 1-, 5-, and 15-minute timeframes.The model outputs a structured prediction: a confidence score for “Buy,” “Sell,” or “Hold.”This provides a quantitative, data-driven probability based purely on historical patterns learned during training and periodic retraining (handled by Oracle Scheduler jobs). Invocation of the Generative AI (LLM) Layer The system queries large language models (Google Gemini and/or OpenAI GPT/ChatGPT) via their APIs.A carefully crafted prompt is sent that includes recent market summaries, current RSI values, trend information, and multi-timeframe data (e.g., the sample prompt for EUR/USD that describes RSI on 1-min, 5-min, and 15-min charts, support levels, and momentum).The LLM is instructed to act as an expert Forex analyst and respond in a strict structured format: Confidence: 9/10 Reasoning: 1. “The RSI is oversold on all three timeframes with clear bounce from support at 1.0850. Upward momentum is building on the 5-minute and 15-minute charts. No major news events are expected soon.” 2. “Price is holding above daily support and RSI shows oversold conditions across 1-min, 5-min and 15-min charts. Higher timeframes are aligned with bullish momentum. Risk appears low for a moderate-risk entry.” 3. Veto: YES/NO: This response adds qualitative, contextual reasoning and simulated market sentiment analysis that the pure neural network cannot capture. Signal Fusion (Hybrid Decision Making) The system now has three independent signal sources: Primary technical signal (RSI-based, from database triggers) Neural network output (quantitative confidence score)Generative AI recommendation (structured confidence, reasoning, and veto flag)These are combined using a predefined weighted fusion logic (weight ratios mentioned in the document).The fusion process resolves disagreements, reduces false positives, and produces a single consensus “final_signal” (BUY, SELL, or HOLD) with higher overall accuracy than any individual component. Storage of the Final Decision The resulting hybrid signal is inserted into the central AI_SIGNALS table in the Oracle database.This table serves as the authoritative source that the Automated Trade Execution Interface constantly polls.Once a valid high-confidence signal is detected in AI_SIGNALS, the execution module immediately places the trade via the broker API. In summary, the Signal Computation Phase transforms basic RSI triggers into a robust, multi-layered decision by sequentially invoking the neural network for pattern-based prediction, the LLM for contextual expert reasoning, fusing all three sources, and then persisting the final consensus signal in the AI_SIGNALS table for instant execution. This hybrid approach is the key innovation that achieves the documented 92% signal accuracy and 60% reduction in false signals compared to RSI-only methods. Key Innovations Category Innovation AI-Augmented Trading Combines rule-based RSI logic, predictive modeling, and contextual AI. Hybrid Signal Intelligence Multi-source fusion increases reliability. Database-Centric Design Core analytics handled within Oracle through triggers and schedulers. Continuous Learning Loop Post-trade feedback improves model accuracy. Generative Insight Integration Leverages LLMs for qualitative assessment of quantitative data. Results and Impact Preliminary back testing shows: 92% signal accuracy compared to RSI-only models.Execution latency under one minute from data receipt to trade.Self-adapting AI behavior, automatically retraining from new market data. The system architecture can be scaled for institutional deployment or integrated into broader AI-based financial intelligence platforms. This seamless, hands-off operation across multiple countries and time zones addresses real user pain points while leveraging cutting-edge AI responsibly. Implementation Highlights Multi-Tenancy Expertise: Designed dynamic schema creation and row-level security to support users across jurisdictions while meeting data protection standards.Performance Optimization: Oracle triggers and scheduler ensure sub-second signal generation even with hundreds of active users.Back testing Results: Across six months of historical data, the hybrid approach reduced false signals by 60% compared to RSI-only, improving overall profitability in simulated multi-user environments.Transparent and Explainable Trade Decisions Every trade includes LLM-generated plain English notes (e.g., “Strong buy due to oversold conditions across all timeframes and supportive macro sentiment”), helping users learn over time. This hands-off platform addresses an underserved market of non-expert retail traders worldwide with a novel hybrid AI solution in a scalable, secure package. This hands-off platform addresses an underserved market of non-expert retail traders worldwide with a novel hybrid AI solution in a scalable, secure package. Benefits and Broader Impact True Accessibility: Anyone with a broker account can now use institutional-grade strategies — no knowledge required.Global Reach: Time-zone agnostic, supports users in 100+ countries with localized risk settings.Scalability: Capable to handle 500+ concurrent users with minimal resource increase. Conclusion By pioneering a hybrid fusion of RSI, neural networks, and large language models within a secure, multi-tenant Oracle platform, I have developed an AI-driven trading system that represents a meaningful step forward in making advanced fintech solutions more accessible. Based on my experience, the platform converts complex, expert-level trading strategies into a practical set-and-forget experience for users worldwide, while still preserving full auditability, transparency, and performance. Through this work, I aim to advance system design and innovation by demonstrating how multiple AI techniques can be integrated effectively within financial technology. By sharing this architecture openly, I contribute to the digital technology sector by offering a practical and replicable blueprint for building inclusive, intelligent automation platforms. I encourage fellow architects and developers to build upon this concept whether enhancing the fusion logic, adding new assets, or adapting it to stocks or crypto. Code Snippets 1. Neural Network Description: A deep learning model built with Tensor Flow processes historical EUR/USD data to predict buy/sell signals. It uses normalized inputs, multiple dense layers, and early stopping for accuracy. The trained model evaluates performance via confusion matrix and predicts real-time trade direction on new data. 2. Gemini API Integration Description: Combines human-style reasoning via Gemini API with technical indicators to validate AI-generated trade signals and reduce false positives. 3. Database Trading_History Table Description:The trading_history table logs every executed trade, capturing details such as asset pair, execution mode, timestamps, entry price, trade direction (HIGHER/LOWER), outcome (WIN/LOSE), trade amount, and profit or loss. It serves as the performance audit and analytics source for the automated trading system. 4. Database Asset Table Description:Defines the base structure for storing real-time EUR/USD market data at 1, 5 and 15-minute intervals for each asset (22) separately. Each record is used as input for RSI calculation and subsequent signal generation. 5. Database Trigger Description: Automatically computes the RSI for each new tick of market data and updates the results in the same or a related RSI_MAIN_[Asset Name] for each Asset separately. This enables near real-time analytics directly in the database layer.

By Annel John

FinOps for Engineers: Turning Cloud Bills Into Runtime Signals

The bill lands in your inbox. $37,000 this month. Was $29,000 last month. Someone in Finance cc's half the engineering org asking what happened. Engineering doesn't know. Nobody knows. The thread dies with "we'll investigate" and everyone goes back to fighting fires. Month later, same thing. This is how most companies run cloud infrastructure. Cost is something Finance worries about quarterly while Engineering optimizes for uptime and latency. The feedback loop is measured in weeks. By the time anyone notices the spend anomaly, you've already burned through the overage and the root cause is buried under three deployments. What if your cloud spend behaved like request latency? Spiked in Grafana when something broke. Triggered the same on-call rotation as a degraded service. Lived in the same mental space where you reason about capacity and performance. Not as a finance exercise. As an operational metric that engineers own. That's FinOps. Cloud Financial Operations. The idea that cost is telemetry — another dimension of system health you instrument, monitor, and optimize in real time. Your AWS bill stops being a monthly surprise from Finance and starts being a dashboard that updates hourly, tagged by service and team, graphed alongside request rates and error budgets. Every Workload Has a Cost Signature Start here: cloud resources cost money in specific, measurable ways. A Lambda invocation costs $0.0000002 per request at 128MB memory. Sounds trivial until you're handling 50 million requests daily and the bill is $10k monthly. An RDS db.r5.2xlarge burns $0.504/hour whether it's serving 10 queries or 10,000. You pay for provisioned capacity, not utilization. An S3 GET request costs $0.0004 per thousand. An S3 LIST operation over a bucket with 10 million objects can cost $50 if you're iterating stupidly. These aren't abstract numbers. They're the unit economics of running code. New Relic's engineering team did something that sounds obvious in retrospect but almost nobody does: they instrumented every operational metric with its marginal cost. Cost per API call. Cost per trace ingested. Cost per metric scraped. When a service starts hammering an endpoint, two graphs spike simultaneously — request volume and dollars per minute. You see the correlation immediately. The cost becomes visceral, not theoretical. This matters because cloud infrastructure obscures its economics by design. When you bought physical servers, the constraints were obvious. You ordered a rack, waited six weeks for delivery, racked and cabled it, and then you squeezed every milliwatt out of that hardware because the capital was spent. You knew exactly what you had. Cloud abstracts that away. Auto-scaling groups spin up instances when CPU crosses 70%. Reasonable behavior. Also capable of burning $8,000 on a Saturday because someone pushed a bad regex that triggers catastrophic backtracking in a log parser and every request starts taking 4 seconds instead of 40ms. The auto-scaler sees high CPU, adds instances. More instances, same bug, more CPU, more instances. By the time someone notices and rolls back, you've scaled to 80 instances serving the same traffic that normally runs on 6. The cloud bill arrives two weeks later. Nobody connects it to that Saturday incident because the feedback loop is broken. Making cost legible means fixing that loop. Instrumenting Spend the Same Way You Instrument Latency You already know how to do this for performance metrics. Prometheus scrapes endpoints every 15 seconds. Grafana renders time-series graphs. Alerts fire when error rates cross thresholds. Runbooks trigger. On-call gets paged. Extend that model to cost. AWS publishes Cost and Usage Reports — massive gzipped CSV files dumped to S3 with line-item billing detail. Every EC2 instance-hour, every GB-month of S3 storage, every Lambda invocation, tagged with resource IDs, availability zones, usage types. The files are enormous. Last month's CUR for a medium-sized infrastructure might be 4GB compressed, 40GB uncompressed, millions of rows. Parse it. Azure has equivalent exports. GCP pushes billing data to BigQuery. The mechanics differ but the pattern is identical: get granular billing data, tag it with the same metadata you use for observability, aggregate it, and shove it into your metrics pipeline. Here's what that looks like in practice. You write a script — Python with boto3 and pandas, or Go with the AWS SDK, doesn't matter. Runs every hour via cron. Pulls the latest CUR data from S3, parses the CSV, groups by resource tags (team, environment, service, feature), computes deltas since last run, exports to Prometheus. Now cost is time-series data. You graph it next to your operational metrics. Dual-axis chart: requests per second on the left Y-axis, dollars per hour on the right. Watch what happens. Traffic doubles during a product launch. Cost doubles. That's healthy — linear scaling, expected behavior. But then three days later: traffic flat, cost up 40%. That's the signal. Something changed and it's not traffic. You investigate. New deployment went out Tuesday. Changelog shows a "minor optimization" to caching logic. You dig deeper. The optimization broke cache key generation. Cache hit rate dropped from 85% to 12%. Every request that should hit cache now hits the database. RDS connection count spiked. Auto-scaling added read replicas. Cost follows. Without cost telemetry, this surfaces as a vague sense that "the database seems slower lately" and maybe someone investigates next sprint. With cost telemetry, it's a P2 incident Tuesday afternoon and you revert the deployment before dinner. Kubernetes complicates this. Workloads are ephemeral. Pods get scheduled across nodes. A single node might run workloads from six different teams. Cloud billing shows you the EC2 instance cost, but how do you allocate that to teams? Kubecost solves this by querying the Kubernetes API for pod resource requests and limits, correlating that with node pricing, and exporting per-pod, per-namespace, per-label cost metrics. You tag your Deployments and StatefulSets the same way you tag everything else. Kubecost tells you the data-pipeline namespace in the prod cluster burned $340 last Tuesday. You trace it back. A CronJob that should run nightly and terminate ran 16 times because of a misconfigured schedule. Each run spawned 20 pods requesting 4 cores each. Most of the work was waiting on I/O but Kubernetes saw the resource requests and provisioned accordingly. The pods sat there, allocated but mostly idle, burning money. Without namespace-level cost visibility, that's invisible. With it, it's a line item you investigate Wednesday morning. Granularity is everything. Cluster-wide cost is useless — it's just a big number. Per-team cost is better but still vague. Per-service cost is actionable. Per-customer cost lets you calculate unit economics and answer whether your pricing model actually covers infrastructure. Cost as a Service Level Indicator If cost is telemetry, it deserves the same rigor as uptime. Define budget burn rate as an SLI. Set an SLO: "Monthly spend shall not exceed projected budget by more than 15% for three consecutive days." Alert on violations the same way you alert on error rate thresholds. This sounds straightforward until you try to implement it and realize your budget projections are wildly wrong. They're based on last quarter's usage, extrapolated linearly, ignoring seasonality and feature launches and customer growth patterns. Your projections say you'll spend $45k this month. You're on track for $62k by day 10. Is that a problem? Maybe. Maybe you launched a feature that's more popular than expected and the increased cost maps to increased revenue. Or maybe someone left a data pipeline running in dev that's scanning the entire production database every hour for no reason. The projection being wrong isn't the problem. The problem is not knowing about the divergence until the bill closes. Start with bad projections. Iterate. Build a feedback loop where actual spend informs next month's forecast. The goal isn't perfect forecasting — it's timely detection of unexpected changes. Netflix uses anomaly detection for this. Not because ML is magic, but because their scale makes manual thresholding impossible. When you're spending millions monthly across thousands of microservices, you can't manually review every cost trend. Anomaly detection flags outliers — services whose cost trajectory deviates from historical patterns adjusted for traffic and seasonality. An engineer investigates. Often it's legitimate: new feature shipped, traffic grew, cost followed proportionally. Sometimes it's pathological. A retry loop that exponentially backs off but never terminates. A memory leak that causes pods to restart every 20 minutes, and Kubernetes keeps scheduling replacements. An auto-scaler that scales up aggressively but down conservatively, ratcheting instance count higher over days. These are all real incidents I've debugged. None of them showed up in traditional monitoring because the services technically worked. Requests succeeded. Latency was acceptable. But cost was hemorrhaging and nobody noticed until the monthly bill. The anti-pattern here is the financial silo. Cost analysts in Finance who don't understand the workload architecture. Engineers who never see the bill. The gap between them guarantees dysfunction. Finance sees numbers without context — "EC2 spend up 35%" — but can't trace it to a service or deployment. Engineering makes architectural decisions without feedback on cost implications. Showback bridges this gap. Allocate cost to teams based on tagged resources. Publish monthly dashboards showing each team's spend broken down by service. No penalties, no hard budget enforcement — just visibility. Teams start asking questions they've never asked before. "Why did we spend $4,200 on NAT Gateway last month?" Someone investigates. Turns out half the VPC subnets are misconfigured, routing all egress traffic through a single NAT Gateway instead of using VPC endpoints for S3 and DynamoDB. They fix the routing. Next month NAT Gateway cost drops to $600. Chargeback goes further — actually billing teams internally for their infrastructure spend. This creates budget accountability but also introduces perverse incentives. Teams might under-provision to save budget, degrading reliability. They might game the allocation system. Politics emerge. Showback delivers most of the value — awareness, attribution, cultural shift toward cost consciousness — without the hazards. What You Actually Do Monday Morning You're convinced. Cost observability makes sense. Now what? Tag everything. This is tedious, unglamorous infrastructure work. It's also foundational. Without tags, attribution is impossible. Define a standard schema. team, environment (prod, staging, dev), service, feature, cost-center. Enforce it with policy-as-code. Terraform modules that reject resource creation without required tags. Kubernetes admission controllers that reject pod specs missing labels. OPA policies. Sentinel. Whatever your infrastructure-as-code stack supports. Legacy resources will violate the schema. That's fine. Tag them retroactively. Write a script that queries the AWS API for untagged resources and bulk-applies tags based on naming conventions or VPC associations. It won't be perfect. You'll have orphaned resources you can't attribute. Tag what you can, document what you can't, accept that you'll be chasing this forever. Ingest billing data. Set up automated CUR exports to S3. Write the parser — runs hourly, aggregates by tag, computes deltas, pushes to Prometheus or your metrics backend. If you're on Azure, use the Billing API. GCP exports to BigQuery, so you write SQL queries instead of parsing CSVs. The mechanics differ; the pattern doesn't. Build dashboards. Grafana is usually the right answer because you're already using it for everything else. Add cost panels. Create a "FinOps Overview" showing total spend, top services, week-over-week trends, cost per customer if you track that granularly. Create team-specific dashboards showing their allocated spend. Make cost visible in the places engineers already look — not in a separate finance tool they'll never open. Define alerts. Start simple: "Daily spend exceeded $X." That'll fire false positives. Refine it: "Service Y's cost increased 50% week-over-week while request volume increased 10%." These are heuristics, not perfect detectors. They'll still fire false positives. That's acceptable. The goal is building the muscle memory of investigating cost anomalies the same way you investigate latency regressions. Integrate cost into CI/CD. This is harder. More speculative. But powerful when it works. Imagine a GitHub Action that runs on pull requests. It parses the Terraform diff, estimates the cost impact of proposed changes (new instance types, additional replicas, modified auto-scaling bounds), and posts a comment: "This change will increase monthly spend by approximately $430." Engineers see it during code review. They weigh cost against benefit. Sometimes they proceed — the feature justifies the expense. Sometimes they rethink the approach — maybe there's a cheaper architecture that accomplishes the same goal. Infracost does this. It's imperfect. Cloud pricing is Byzantine. Usage varies. Reserved Instances and Savings Plans complicate the math. Spot pricing fluctuates. But even a rough estimate is infinitely better than no estimate. It makes cost a first-class consideration during design instead of a surprise discovered three weeks later. Where This Falls Apart Theory is clean. Practice is messy, full of edge cases and incomplete data and tooling that almost works. Billing data lags. AWS CUR updates hourly at best, often with delays. Azure and GCP have similar latencies. You're trying to build real-time observability on top of data that's 60 to 90 minutes old. For long-running workloads — databases, cache clusters — this is fine. For burst workloads — Lambda functions, Fargate tasks, spot instances — you're often diagnosing yesterday's problem. Tagging is perpetually incomplete. Legacy resources predate your schema. External teams don't follow the standard because they don't know it exists or don't care. Someone spins up an instance manually during an incident and forgets to tag it. Your dashboards show "unallocated spend" growing every month. You chase it down, tag what you can find, but there's always more. It's Sisyphean. Attribution gets philosophical fast. A shared RDS instance serves three services owned by different teams. How do you allocate the cost? By query count? By table size? By connection time? By team ownership percentage? There's no obviously correct answer. You pick a heuristic, document it clearly, communicate it to stakeholders, and accept that someone will complain it's unfair. The team that runs heavy analytics queries will argue they shouldn't pay the same as the team doing lightweight lookups. The team that owns the largest tables will argue query count is a better metric than storage. You can't make everyone happy. Pick something reasonable, be transparent about the methodology, and move on. Cost optimization competes with reliability. Every optimization is a trade-off you have to think through. Spot instances are 70% cheaper than on-demand but can be interrupted with two minutes notice. Right-sizing instances saves money but reduces headroom for traffic spikes. Aggressive auto-scaling-down minimizes waste but introduces cold-start latency when you need to scale back up. Switching from RDS to Aurora might save money but requires refactoring connection pooling logic. FinOps doesn't resolve these tensions. It makes them explicit so you can make informed trade-offs instead of optimizing blindly. Cultural friction is real. Developers already juggle latency, error rates, saturation, on-call rotation, tech debt, feature delivery. Now you're asking them to care about cost too? It feels like scope creep. Like Finance trying to colonize engineering decisions with spreadsheets and budget restrictions. The pushback is legitimate. You mitigate it through framing. Cost visibility isn't about policing spending or denying resource requests. It's about enablement — giving engineers the information they need to make good decisions. When someone proposes a costly architecture, you don't say "no, that's too expensive." You say "here's what it will cost; here are three cheaper alternatives; here are the trade-offs; your call." Engineers appreciate having the data. What they resent is having decisions made for them by people who don't understand the constraints. What Actually Changes When You Get This Right FinOps shifts the conversation from reactive to proactive. Finance stops discovering overruns at month-end and demanding retroactive cuts. Engineering sees trends early, investigates, optimizes continuously. Real example: a team notices their CloudWatch Logs bill tripled month-over-month. They investigate, discover they're logging full request bodies at DEBUG level in production — something someone enabled during an incident six weeks ago and forgot to revert. They change the log level to WARN, keep detailed logs only in staging. Spend drops 70%. Simple. Obvious in hindsight. Completely invisible without the cost signal. Another team runs ETL batch jobs on on-demand instances. They check utilization: jobs run overnight, instances sit idle 16 hours daily. They switch to Spot instances with a Spot Fleet configuration that tolerates interruptions and falls back to on-demand only when Spot capacity is unavailable. Cost drops 60%. Jobs take 10% longer sometimes when Spot gets interrupted, but they're not user-facing, so the latency doesn't matter. A database sized for peak load two years ago. Traffic declined since then — customer churn, product pivot, whatever. Nobody ever resized it. Monitoring shows consistent 15% CPU utilization. They downsize from db.r5.8xlarge to db.r5.2xlarge. Performance metrics stay healthy. Monthly cost drops $2,000. None of these are heroic optimizations. They're hygiene. Basic operational discipline. But hygiene compounds. Ten optimizations saving $200 each is $2,000 monthly, $24,000 annually. At scale, it's hundreds of thousands. More importantly, the culture changes. Teams start asking "what's this going to cost?" during design, not as an afterthought. Cost becomes part of the conversation alongside performance and reliability. The Tools You'll End Up Evaluating Cloud cost observability is now a legitimate product category with venture-backed companies and competitive positioning. CloudZero, Vantage, Yotascale, Apptio Cloudability — they ingest billing data, correlate it with resource tags and business metrics, render dashboards showing unit economics. The pitch is visibility and optimization insights. Pricing varies wildly. Some charge a percentage of your cloud spend. Some charge per seat. The ROI calculation depends on your scale. Datadog, New Relic, Dynatrace — observability platforms adding cost modules as a feature. They already instrument your infrastructure for performance. Adding cost is a natural extension. The value proposition is consolidation: one tool for operations and economics instead of separate platforms. Kubecost focuses specifically on Kubernetes. Open-source core, commercial tier with extra features. For Kubernetes-heavy organizations, it's nearly essential — native cloud billing has no visibility into namespace or pod-level costs. The FinOps Foundation publishes frameworks, maturity models, case studies. They run certifications — FinOps Certified Practitioner. Whether that certification has value depends on your organization, but the community and knowledge sharing are real. Consulting follows the tooling. Organizations hire people to embed FinOps culture: how to structure teams, run cost reviews, build accountability mechanisms. There's a certification ecosystem emerging. The quality varies. CostOps — integrating cost intelligence directly into DevOps pipelines — is still nascent. Terraform modules that estimate cost before apply. CI runners that block deployments exceeding budget without approval. GitOps workflows where cost is a merge check. The tooling isn't mature, but the concept makes sense if you're already doing everything else as code. The Skeptical Counterargument Is any of this actually necessary? Can't Finance just handle cost management like they always have? Only if you're comfortable with month-long feedback loops and blunt instruments. Finance can identify that EC2 spend increased 40%, but they can't trace it to a specific service, deployment, or bug. They can't fix it. They can escalate to engineering, who then spend days investigating with incomplete data because nobody instrumented cost in the first place. FinOps collapses that loop. Engineers see cost in real time, correlate it with their changes, optimize autonomously. It's not replacing Finance — it's shifting left, handling problems at the source before they become quarterly budget disasters. But it requires discipline. Instrumentation, tagging, alerting, dashboards — these don't build themselves. They require ongoing time, maintenance, evolution as your infrastructure changes. If your team is barely keeping production running, adding FinOps infrastructure might legitimately be a luxury you can't afford right now. Fair. Prioritize. Maybe you start minimal: allocate costs monthly, publish basic reports, build awareness. Low investment, high value. Once teams start caring, they'll demand better data. Then you invest in real-time telemetry. Or maybe your spend is low enough it genuinely doesn't matter. If you're burning $500 monthly, optimizing down to $300 saves trivial money relative to engineering time. FinOps scales with spend. When you're spending tens of thousands monthly, the ROI is obvious. The Actual Goal Netflix's stated goal is "nearly complete cost insight coverage." Every service, every workload, every feature instrumented with cost telemetry. It's aspirational. Probably impossible to fully achieve. But the direction is right. You won't fix everything Monday. You'll tag some resources, build a basic dashboard, maybe set up one alert. That's sufficient. The value accumulates through repetition — making cost visible in the daily flow of work, asking "what does this cost?" as routinely as "how fast is this?" or "will this scale?" The cloud hides its economics behind layers of abstraction. Auto-scaling, serverless, pay-per-request — all brilliant innovations that make infrastructure invisible until the bill arrives. FinOps makes those economics legible again. That's the entire game.

By David Iyanu Jonathan

Enhancing SQL Server Performance with Query Store and Intelligent Query Processing

SQL Server performance issues are a common pain point for database administrators. One of my most challenging scenarios occurred after deploying a financial analytics database update. Reports that previously ran in less than 3 minutes suddenly ballooned to over 20 minutes. Key stored procedures started underperforming, and CPU usage spiked to critical levels during peak workloads. Through careful investigation, I identified query regressions caused by outdated execution plans and parameter sniffing. Instead of applying temporary fixes, I turned to Query Store and Intelligent Query Processing (IQP) to develop a sustainable, long-term solution. This article provides step-by-step instructions for using these tools, including practical examples, my exact investigation process, configuration changes, benchmark results before and after optimizations, and how these changes improved overall performance and stabilized the production environment. Performance Issue Investigation: Observing Query Regressions The performance degradation stemmed from new internal processes introduced into the application, which altered data patterns. Parameter sniffing a common issue where SQL Server cached an execution plan optimized for specific parameters but reused it for parameters with drastically different data distributions caused previously fast queries to slow down. To pinpoint the bottleneck, I queried the sys.dm_exec_requests and sys.dm_exec_query_stats views, which revealed certain stored procedures with much higher CPU and runtime durations than they had before. For example, running the following query helped me confirm which plans were underperforming: MS SQL SELECT TOP 5 qs.sql_handle, qs.creation_time, qs.total_worker_time / qs.execution_count AS average_cpu_time, qs.execution_count, qp.query_plan FROM sys.dm_exec_query_stats qs OUTER APPLY sys.dm_exec_query_plan(qs.plan_handle) qp ORDER BY average_cpu_time DESC; From this, I identified two stored procedures that were impacted, usp_generate_financial_report and usp_calculate_daily_totals, which each had sudden spikes in execution times. Enabling Query Store for Plan Analysis To resolve the regressions effectively, I enabled Query Store to monitor all query plans and runtime statistics. Query Store maintains a history of plan performance, making it possible to diagnose and compare regressed plans to their optimal counterparts. I enabled Query Store with the following command: MS SQL ALTER DATABASE [FinancialAnalyticsDB] SET QUERY_STORE = ON; ALTER DATABASE [FinancialAnalyticsDB] SET QUERY_STORE ( OPERATION_MODE = READ_WRITE, CLEANUP_POLICY = (STALE_QUERY_THRESHOLD_DAYS = 30), DATA_FLUSH_INTERVAL_SECONDS = 900, QUERY_CAPTURE_MODE = AUTO ); This configuration automatically captured query plans and runtime metrics while limiting unnecessary data retention to 30 days. I immediately noticed that usp_generate_financial_report was generating multiple inefficient plans based on the cached parameters. Query Store also provided insights into how the queries performed under those plans. Query Store Analysis Before the Fix: I used the following query to identify the regressed query: MS SQL SELECT q.query_id, q.object_id, MAX(rs.avg_duration) AS max_duration, MIN(rs.avg_duration) AS min_duration FROM sys.query_store_query q JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id JOIN sys.query_store_runtime_stats rs ON q.query_id = rs.query_id GROUP BY q.object_id, q.query_id ORDER BY max_duration DESC; Results revealed the following for usp_generate_financial_report: MetricValue Before FixMax Duration19,789 msMin Duration2,345 msMemory Usage410 MBCPU Utilization75% (Peaking 90%) Parameter sniffing caused the query to use an index seek for one execution and a full table scan for another, leading to an average of 20 seconds per execution during peak hours. Parameter sniffing in SQL Server occurs when the database engine compiles and caches an execution plan using the parameter values provided during the query's first execution. While this can improve performance for similar subsequent executions, it may cause issues if the initial parameter values do not represent the typical data distribution or usage patterns. This leads to suboptimal plans for subsequent executions with different parameters, resulting in poor performance. For example, a plan optimized for a smaller dataset might perform poorly when run against a much larger dataset with different parameter values. Fixing Query Regressions by Forcing Plans Using Query Store, I located the plan that performed optimally and forced SQL Server to reuse it for subsequent executions. Forced Plan Implementation: MS SQL -- Identify the Query ID and Plan ID EXEC sp_query_store_force_plan @query_id = 1203, @plan_id = 3456; This ensures the application always runs the query using the best-performing execution plan. I tested this change in a development environment to confirm its impact before implementing it in production. Benchmark Results After Forcing Plans: After forcing the optimal plan, the following improvements were observed: MetricBefore FixAfter Plan ForcingMax Duration19,789 ms3,455 msMin Duration2,345 ms2,900 msMemory Usage410 MB140 MBCPU Utilization75% (Peak 90%)25% (Peak 35%) Execution times decreased significantly to less than 4 seconds, and resource usage normalized during peak traffic. Leveraging Intelligent Query Processing for Scalability To prevent similar regressions in the future, I enabled Intelligent Query Processing (available in SQL Server 2019 and later). This suite of features dynamically resolves common query problems without manual DBA intervention. For this workload, the most impactful IQP features were Scalar UDF Inlining and Adaptive Joins. Scalar UDF Inlining automatically translated user-defined functions into inline relational operations, eliminating their row-by-row execution. This was critical for usp_calculate_daily_totals, which had heavy reliance on scalar UDFs. Adaptive Joins converted fixed strategies like Nested Loops or Hash Joins into dynamic choices based on runtime statistics, adding further efficiency when handling varying query workloads. After enabling IQP for the database: ALTER DATABASE SCOPED CONFIGURATION SET scalar_udf_inlining = ON; Benchmark Results After Enabling IQP The following table compares metrics before and after enabling IQP for usp_calculate_daily_totals: MetricBefore IQPAfter IQPUDF Execution Time20,134 ms3,289 msLogical Reads15,0004,000CPU Utilization60%20% Enabling Scalar UDF Inlining improved query execution by up to 85%, while Adaptive Joins reduced variability across parameterized query runs. Monitoring and Stabilization After resolving the performance issues, I configured a proactive monitoring system to guard against future regressions. Query Store provided continuous insights, while Extended Events helped trace any unusual query behavior. Automating these tasks with scheduled jobs ensured the environment remained stable even with evolving workloads. Conclusion By combining the power of Query Store and Intelligent Query Processing, I was able to diagnose and resolve query regressions quickly and effectively. Query Store helped me identify problematic plans and ensure optimal execution using forced plans, while IQP addressed inefficiencies in both existing and future queries dynamically. In this specific case, the financial analytics database saw execution times drop by over 80%, CPU utilization reduced by 50%, and user complaints ceased entirely. For any DBA seeking long-term, scalable solutions to performance challenges, leveraging these tools is a must. Start using Query Store and IQP today, and take control of your SQL Server performance issues for good.

By arvind toorpu

CORE

Databases

DZone's Featured Databases Resources

Top Databases Experts

The Latest Databases Topics