Performance Resources

DZone's Featured Performance Resources

Why Angular Performance Problems Are Often Backend Problems

By Bhanu Sekhar Guttikonda

Angular developers often get the blame when an app feels slow. We instinctively reach for frontend fixes optimizing components, change detection, bundle sizes, and so on. However, a significant portion of perceived Angular slowness comes not from the framework or the UI at all, but from the backend. One seasoned Angular engineer noted that most sluggish apps feel slow due to chatty APIs and oversized responses rather than the UI layer itself. In other words, you can fine tune Angular’s performance features all you want but if your API calls are slow or inefficient, the user will still be waiting on data. The Common Misconception: The Angular App Is Slow When performance metrics are poor, teams often assume the Angular frontend is to blame. Common first reactions include: Tuning change detection strategyAdding more lazy-loaded modules or componentsReducing DOM elements and re-rendersRefactoring or memoizing expensive components These optimizations can indeed make Angular UIs more efficient. However, in practice they often yield only minor improvements in real user centric metrics like Largest Contentful Paint or Time to Interactive. Because LCP is mostly influenced by network delays, not JavaScript execution. If the browser is sitting idle waiting for an API response or an image to load, shaving 50ms off a component’s render time has virtually no effect on the overall load time. Angular’s own rendering performance is rarely the true bottleneck for multi-second delays. API Waterfalls: The Silent Performance Killer One of the most notorious backend related issues is the API waterfall. An API waterfall occurs when the front-end has to make multiple HTTP calls in sequence, because each response is needed to initiate the next request. The pattern looks like this: Plain Text Frontend Component -> API A -> (wait) -> API B -> (wait) -> API C -> ... -> Render UI Each dependent call adds stacked network latency and additional server processing time. In Angular, you might see code like this in a service or component: TypeScript // Sequential API calls (waterfall) this.http.get<Profile>('/api/profile/123').pipe( switchMap(profile => this.http.get<Orders[]>(`/api/users/${profile.id}/orders`)), switchMap(orders => { // Assume we need details of the first order const firstOrderId = orders[0]?.id; return firstOrderId ? this.http.get<OrderDetail>(`/api/orders/${firstOrderId}/detail`) : of(null); }) ).subscribe(detail => { this.orderDetail = detail; }); In the above Angular code, the component cannot display the final data until three sequential requests have all completed. This waterfall means multiple round trips and an accumulating delay at each step. The browser’s network timeline would show idle gaps while waiting for each response. Why Angular Optimizations Alone Don’t Fix Load Times It’s important to understand that front end optimization has limits. Imagine a scenario where an Angular component takes 100ms to render once data is ready. You refactor and use an OnPush change detection strategy, cutting rendering down to 50ms a nice 2× improvement. But if the API call that provides the data takes 3,000ms, the user won’t notice the difference between 100ms vs 50ms rendering they’re still stuck waiting 3 seconds for content to appear. This is why teams can spend weeks tweaking Angular code for marginal gains, only to find the real-world metrics barely improve. Some examples: Change Detection Tweaks: Angular’s default change detection is fast. Using ChangeDetectionStrategy.OnPush or Angular signals can reduce unnecessary checks, but they won’t make data arrive sooner. If data is late, the UI stays blank regardless.Lazy Loading Modules: Splitting the app and loading parts on demand helps initial bundle size. Yet if your main screen still waits on multiple API calls, lazy loading doesn’t solve the wait. All required data must be fetched before meaningful content is shown.Client-Side Caching & State: Using client-side caching can help on subsequent navigations, but for a first load or cache miss, you’re back to waiting on the server. Angular is very performant at rendering, and its recent features further reduce framework overhead. But none of that can compensate for a slow or chatty backend. Frontend fixes address milliseconds backend fixes can eliminate seconds of wait time. Key Back-End Decisions That Influence Angular Performance If speeding up Angular’s own execution isn’t solving your issues, it’s time to look at the backend. There are several backend design choices that directly impact frontend performance for an Angular app: API Granularity and Data Shaping Backend APIs often reflect internal microservices or database models, not the needs of the UI. This mismatch can result in: Over-fetching: Endpoints that return far more data than the frontend actually needs. The Angular app then wastes time parsing and filtering data.Under-fetching: Endpoints that are too fine grained, forcing the client to make multiple calls to gather related data for one screen.Excessive Data Size: Lack of server-side pagination or filtering, returning 5,000 records in one response and making the Angular client sort or slice them. This not only delays initial load but also puts processing burden on the browser.Inconsistent Formats: Data not shaped for direct use, requiring the Angular code to transform it. Such processing on the client can be slow if the data volume is large, and it complicates the front-end code. Consider a simple example of over-fetching say the UI needs to display a list of product names and prices. A poorly designed API might return an entire product object with dozens of fields. An Angular component might then filter or map that data: TypeScript // Inefficient data handling due to over-fetching this.http.get<Product[]>('/api/products').subscribe(products => { // UI only needs name and price, filter the rest this.products = products.map(p => ({ name: p.name, price: p.price })); }); Here, the browser had to download all product fields only to ignore most of them. The extra data makes the response larger and slower. A better approach would be for the backend to offer an endpoint to retrieve only the needed fields or perhaps a specialized summary endpoint. APIs that are designed around UI use cases can dramatically reduce round trips and client-side work. When the backend sends exactly what the UI needs the Angular app can render content much faster. Workflow APIs and Server-Side Orchestration Instead of making the Angular client orchestrate multiple calls, the backend can provide workflow APIs that aggregate data from multiple sources. Let the server handle the sequence and combine results, returning one payload tailored for the screen. This approach can turn the earlier waterfall example into a single request: Java // Spring Boot example: Orchestrating multiple calls in one API @RestController public class AggregateController { @Autowired UserService userService; @Autowired OrderService orderService; @GetMapping("/api/userOrders/{userId}") public UserOrdersResponse getUserWithOrders(@PathVariable String userId) { User profile = userService.getUserProfile(userId); List<Order> orders = orderService.getOrdersForUser(userId); return new UserOrdersResponse(profile, orders); // aggregate data } } Server-Side Caching and Third-Party Isolation Sometimes the data itself comes from slow or unreliable sources. If such data is needed for Angular app’s critical path, it will drag down performance. Backend solutions like caching can drastically improve this. By caching frequently used data on the server and ensure the frontend isn’t stuck waiting on a slow external call or repeating the same heavy computation. Similarly, isolating third party API calls via backend strategies can prevent those services from affecting app’s perceived performance. The Angular frontend then interacts with your faster proxy or cache rather than directly with a slow third party. In effect, the backend shields the frontend from unpredictable latency. Minimizing Round Trips and Duplicated Calls Every HTTP call has overhead, so reducing the number of calls is crucial. Discussed combining calls via orchestration but also beware of duplicate calls. It’s surprisingly easy to inadvertently call the same API multiple times in Angular perhaps two components both request the same data or a user triggers an action repeatedly. This can bog down the app and the server. One solution on the frontend is to use shared observables or caching in services so that data is fetched once and reused. Angular’s reactive architecture with RxJS makes this straightforward. Use a BehaviorSubject or the shareReplay operator to cache a value: TypeScript @Injectable({ providedIn: 'root' }) export class CustomerService { private customerCache$?: Observable<Customer>; getCustomer(id: string): Observable<Customer> { if (!this.customerCache$) { // Fetch once, then share the result to all subscribers this.customerCache$ = this.http.get<Customer>(`/api/customers/${id}`) .pipe(shareReplay(1)); } return this.customerCache$; } } However, while frontend caching and smarter subscription management can alleviate unnecessary calls, they are fundamentally workarounds. Conclusion: Fast Apps Need Strong FrontEnd–BackEnd Contracts Frontend performance may manifest in the browser but it’s often determined by the server. A fast Angular app isn’t just about Angular; it’s about the contract between frontend and backend. If that contract is efficient delivering the right data at the right time with minimal overhead Angular will shine and users will enjoy a fast experience. The quickest way to improve an slow Angular app is frequently by looking behind the scenes optimize your APIs, reduce network trips, cache expensive operations and remove work from the critical rendering path. By fixing backend bottlenecks and designing with frontend needs in mind, empower Angular to experience true high performance. In summary, when the frontend and backend are designed together not in isolation, web apps can be both rich and fast. The next time someone says Angular is slow, remember to check the server side before refactoring that component yet again. More

Fine-Tuning of Spring Cache

By Constantin Kwiatkowski

Caching is one of the most effective techniques for improving the performance of modern Spring applications. Especially in microservice architectures or high-traffic APIs, a well-configured cache can significantly reduce database load and greatly improve response times. By storing frequently accessed data in memory, applications can avoid repeated expensive operations such as database queries or external API calls. This article provides a compact yet comprehensive overview of Spring’s caching capabilities. For example, without caching, each request follows this process: Plain Text Client → Web Server → Application → Database When 10,000 users request the same data, without caching, this results in 10,000 database queries, high latency, and significant server load. With caching: Plain Text Client → Web Server → Application → Cache → Database The first request hits the database, and the result is stored in the cache. Subsequent requests are served directly from the cache. This reduces database load, speeds up responses, and enhances scalability. Caching can be implemented at multiple levels within a web application: Browser Cache The browser stores static files such as images, CSS, and JavaScript. This prevents these files from being downloaded on every request, reducing load time and network usage.CDN Cache A Content Delivery Network (CDN) caches content across multiple servers geographically. Examples include Cloudflare and Akamai. This reduces latency for users worldwide and improves page load speed.Server / Application Cache Data is stored within the application or server to speed up access to frequently requested information. Typical examples include database query results, API responses, or computational results. Common technologies for this level include Redis, Ehcache, Caffeine, and Apache Ignite. Spring Cache The caching module in the Spring Framework implements precisely this application-level cache. Spring provides an abstraction layer that allows developers to define caching very easily using annotations. Spring handles cache access, storage, and invalidation, so the developer only needs to specify which methods should be cached. Architecture Typical architecture: Plain Text Client ↓ Controller ↓ Service (@Cacheable) ↓ Cache (Redis / Caffeine / Ignite) ↓ Database ------- Controller │ ▼ Spring Proxy │ ▼ CacheInterceptor │ ▼ CacheManager → Cache Lookup │ ├── Cache Hit → Return Value │ └── Cache Miss │ ▼ Service Method │ ▼ Cache Put │ ▼ Return Value Spring sits between the service and the underlying cache system, providing a caching API that enables significant performance improvements. Spring defines the caching logic, while a cache provider handles data storage. The Spring caching abstraction offers several benefits: simple implementation via annotations, pluggable cache providers, reduced database load, and improved performance. Supported providers include Redis, Ehcache, Caffeine, and Apache Ignite. Spring caching is typically built on proxy-based Aspect-Oriented Programming (AOP). A proxy object is placed between the caller and the service method. The simplified flow looks like this: Plain Text Controller → Proxy → Cache Check → Service Method If a cache hit occurs, the actual service method is not executed; instead, the proxy returns the cached value immediately. Remarks: Spring supports two types of proxies: JDK Dynamic Proxies used for interfaces and CGLIB Proxies used for classes without interfaces. However, proxy-based AOP has some limitations: Internal method calls bypass the proxy, private methods cannot be intercepted, and Final methods cannot be proxied. For more complex scenarios, AspectJ weaving can be used as an alternative, which allows direct bytecode modification and interception without relying on proxies. Tuning of Spring Cache The fine-tuning of the Spring cache system is primarily done through the choice of cache provider. The following table provides an overview of different cache providers, their typical use cases, and the associated advantages and disadvantages. CACHE PROVIDERTYPEUSE CASEADVANTAGESDISADVANTAGESConcurrentMapCacheLocalSmall applications / testingSimple, no external dependencyNo TTL, no evictionCaffeineLocalHigh-performance APIsVery fast, modern eviction algorithms, async loadingLocal onlyEhCacheLocal + PersistenceApplications with long-lived cache dataDisk persistence possibleMore complex configurationRedisDistributedWeb apps / microservicesVery fast, cluster-capableNetwork latencyHazelcastDistributedCluster applicationsIn-memory grid, automatic replicationHigher resource consumptionInfinispanDistributedEnterprise systemsScalable, supports transactionsComplex operation The next table provides an overview of tuning parameters and their effects on the cache system, which can be used in conjunction with a cache provider. PARAMETERMEANINGTYPICAL VALUE / RECOMMENDATIONEFFECTmaximumSize / maxEntriesMaximum number of cache entries500–5,000 local, 10k+ distributedPrevents memory overflowexpireAfterWrite / TTLTime-to-live after write5–60 minutesPrevents stale dataexpireAfterAccess / TTITime-to-idle since last access5–60 minutesRemoves rarely used dataEviction PolicyStrategy for removing entriesLRU standard, LFU for hot keysOptimizes memory usagerefreshAfterWriteBackground refreshOptionalPrevents cache missesinitialCapacityInitial size of the cacheSet high for large cachesLess rehashing / lockingbackupCountReplication in cluster1–2Higher fault toleranceoverflowToDiskPersistence outside of heapFor large dataReduces heap usage Based on the insights gained so far, various strategies can now be applied to achieve improved caching performance. The following table explains some of these strategies. TechniqueDescriptionImpactsync=truePrevents cache stampedeAvoids DB stormsLocal Cache before RedisMulti-level caching10–50× faster responsesIncrease initialCapacityReduces resizing and rehashingLess lock contentionKey optimizationUse simple keys10–20% performance improvementNegative cachingCache null resultsUp to 80% DB load reductionAsync cache loadingNon-blocking data retrievalBetter scalabilityCache warmupPreload cache at startupStable latency after deployment Let's break down some of these concepts and analyze a few examples. Many caches start with a small initial capacity, which causes internal data structures to resize frequently. An example using Caffeine: Plain Text Caffeine.newBuilder().initialCapacity(100000).maximumSize(1000000).build(); A larger initial capacity reduces rehash operations and improves scalability. Another important concept is negative caching, where even negative results are stored in the cache. For example, if a user ID does not exist and every request queries the database again, this can create unnecessary load. By caching null results, this effect can be avoided. Modern caching libraries like Caffeine also support asynchronous data loading, allowing non-blocking computation and improved parallelism for high-traffic applications. Plain Text buildAsync(key -> loadUser(key)); This reduces thread blocking, which is especially beneficial in highly parallelized systems. After a deployment, many caches are initially empty. This can cause performance spikes if a large number of requests suddenly hit the database. A simple solution is to perform a cache warm-up when the application starts: Plain Text @PostConstruct public void warmCache() { service.getUser(1); service.getUser(2); } This pre-populates the cache with frequently used data at startup. In addition to the performance strategies mentioned so far, the choice of proxy type for the caching system can also impact performance (proxy-based AOP). In Spring, there are two proxy types: JDK Dynamic Proxies and CGLIB. The following table provides an overview of the differences between them. PROXY TYPEIMPLEMENTATIONPERFORMANCETYPICAL USE CASEJDK Dynamic ProxyJava Reflection + Interface ProxyMinimally slower on method callsServices with interfacesCGLIB ProxyBytecode-generated subclassMinimally faster on callsClasses without interfaces CGLIB generates a subclass of your class and invokes methods directly via bytecode. Internal example: Plain Text UserService$$SpringCGLIBProxy extends UserService Method Call: Plain Text proxy.getUser() ↓ interceptor ↓ super.getUser() No reflection is needed. JDK proxies work via an InvocationHandler. Internally: Plain Text Proxy.invoke() ↓ InvocationHandler ↓ Method.invoke() This uses reflection, which has historically been slower. The following table shows some numerical values illustrating how the choice of proxy type affects overall performance. OperationTIMERegular Call~5 nsCGLIB Proxy Call~20–40 nsJDK Proxy Call~40–80 ns When should each proxy type be used? The following table provides an overview. SituationAdviseMicroservicesCGLIBSpring Boot StandardCGLIBInterface-heavy ArchitectureJDK ProxyPerformanceIt doesn't matter Conclusion Spring Cache is a powerful mechanism for improving application performance. Caching frequently accessed data significantly reduces the number of expensive operations, such as database queries or external API calls. As a result, applications can achieve lower latency, reduced backend load, and better scalability under high traffic. However, while Spring provides a variety of configuration and tuning options, not all of them have the same impact on performance. A good example of marginal impact is the selection of the proxy type used by Spring’s AOP infrastructure. Whether the framework uses JDK Dynamic Proxies or CGLIB proxies typically results in only negligible performance differences, since the overhead of proxy invocation is extremely small compared to the cost of database queries, network calls, or complex business logic. Therefore, performance optimization in Spring Cache should focus primarily on architectural and data-access related aspects rather than low-level framework details. More

Advanced Auto Loader Patterns for Large-Scale JSON and Semi-Structured Data

By Seshendranath Balla Venkata

Seeing the Whole System: Why OpenTelemetry Is Ending the Era of Fragmented Visibility

By Igboanugo David Ugochukwu

CORE

Stop Burning Money on AI Inference: A Cloud-Agnostic Guide to Serverless Cost Optimization

By Rajesh Kumar Pandey

Spark on AmpereOne® M Arm Processors Reference Architecture

Introduction Arm technology now powers a broad spectrum of on-premises and cloud server workloads. Building on Ampere Computing's previous reference architecture, which demonstrated that Apache Spark on Ampere Altra – 128C (Ampere Altra 128 Cores) processors delivers superior performance per rack, lower power consumption, and optimized CapEx and OpEx, this paper evaluates and extends that analysis to showcase Spark performance on the latest generation of AmpereOne® M processors. Scope and Audience This document describes the process of setting up, tuning, and evaluating Spark performance using a testbed powered by AmpereOne® M processors. It includes a comparative analysis of the performance benefits of the 12-channel AmpereOne® M processors relative to their predecessors, specifically Ampere Altra – 128C processors. Additionally, the paper examines the Spark performance improvements achieved by using a 64KB page-size kernel over standard 4KB page-size kernels. We outline the installation and tuning procedures for deploying Spark on both single-node and multi-node clusters. These recommendations are intended as general guidelines, and configuration parameters can be further optimized based on specific workloads and use cases. This document is intended for sales engineers, IT and cloud architects, IT and cloud managers, and customers seeking to leverage the performance and power efficiency advantages of Ampere Arm servers across their IT infrastructure. It provides practical guidance and technical insights for professionals interested in deploying and optimizing Arm-based Spark solutions. AmpereOne® M Processors AmpereOne® M is part of the AmpereOne® M family of high-performance server-class processors, designed to deliver exceptional performance for AI Compute and a wide range of mainstream data center workloads. Data-intensive applications such as Hadoop and Apache Spark benefit directly from the 12 DDR5 memory channels, which provide the high memory bandwidth required for large-scale data processing. AmpereOne® M processors introduce a new platform architecture with a higher core count and additional memory channels, differentiating it from earlier Ampere platforms while preserving Ampere’s Cloud Native processing principles. Designed from the ground up for cloud efficiency and predictable scaling, AmpereOne® M employs a one-to-one mapping between vCPUs and physical cores, ensuring consistent performance without resource contention. With up to 192 single-threaded cores and twelve DDR5 channels delivering 5600 MT/s, AmpereOne® M delivers a sustained throughput required for demanding workloads such as Spark, though also including modern AI inference relying on Large Language Models (LLM). AmpereOne® M also emphasizes exceptional performance-per-watt, helping reduce operational costs, energy consumption, and cooling requirements in modern data centers. Apache Spark Apache Spark is a unified data processing and analytics framework used for data engineering, data science, and machine learning workloads. It can operate on a single node or scale across large clusters, making it suitable for processing large and complex datasets. By leveraging distributed computing, Spark efficiently parallelizes data processing tasks across multiple nodes, either independently or in combination with other distributed computing systems. Spark utilizes in-memory caching, which allows for quick access to data and optimized query execution, enabling fast analytic queries on datasets of any size. The framework provides APIs in popular programming languages such as Java, Scala, Python, and R, making it accessible to the broad developer community. Spark supports various workloads, including real-time analytics, batch processing, interactive queries, and machine learning, offering a comprehensive solution for modern data processing needs. Spark supports multiple deployment models. It can run as a standalone cluster or integrate with cluster management and orchestration platforms such as Hadoop YARN, Kubernetes, and Docker. This flexibility allows Spark to adapt to diverse infrastructure environments and workload requirements. Spark Architecture and Components Figure 1 Spark Driver The Spark Driver serves as the central controller of the Spark execution engine and is responsible for managing the overall state of the Spark cluster. It interacts with the cluster manager to acquire the necessary resources, such as virtual CPUs (vCPUs) and memory. Once the resources are obtained, the Driver launches the executors, which are responsible for executing the actual tasks of the Spark application. Additionally, the Spark Driver plays a crucial role in maintaining the state of the application running on the cluster. It keeps track of various important information, such as the execution plan, task scheduling, and the data transformations and actions to be performed. The Driver coordinates the execution of tasks across the available executors, ensuring efficient data processing and computation. Spark Driver, hence, acts as a control unit orchestrating the execution of the Spark application on the cluster and maintaining the necessary states and communication with the cluster manager and executors. Spark Executors Spark Executors are responsible for executing the tasks assigned to them by the Spark Driver. Once the Driver distributes the tasks across the available Executors, each Executor independently processes its assigned tasks. The Executors run these tasks in parallel, leveraging the resources allocated to them, such as CPU and memory. They perform the necessary computations, transformations, and actions specified in the Spark application code. This includes operations like data transformations, filtering, aggregations, and machine learning algorithms, depending on the nature of the tasks. During the execution of the tasks, the Executors communicate with the Driver, providing updates on their progress and reporting the results of each task. Cluster Manager The Cluster Manager is responsible for maintaining the cluster of machines on which the Spark applications run. It handles resource allocation, scheduling, and management of the Spark Driver and Executors, ensuring efficient execution of Spark applications on the available cluster resources. When a Spark application is submitted, the Driver communicates with the Custer Manager to request the necessary resources, such as CPU, memory, and storage, to run the application. It ensures that the resources are distributed effectively to meet the requirements of the Spark application. This includes tasks such as assigning containers or worker nodes to execute the Spark Executors and ensuring that the required dependencies and configurations are in place. Spark RDD Spark uses a concept called Resilient Distributed Dataset (RDD), an abstraction that represents an immutable collection of objects that can be split across a cluster. RDDs can be created from various data sources, including SQL databases and NoSQL stores. Spark Core, which is built upon the RDD model, provides essential functionalities such as mapping and reducing operations. It also offers built-in support for joining data sets, filtering, sampling, and aggregation, making it a powerful tool for data processing. When executing tasks, Spark splits them into smaller subtasks and distributes them across multiple executor processes running on the cluster. This enables the parallel execution of tasks across the available computational resources, resulting in improved performance and scalability. Spark Core Spark Core serves as the underlying execution engine for the Spark platform, forming the basis for all other Spark functionality. It offers powerful capabilities such as in-memory computing and the ability to reference datasets stored on external storage systems. One of the key components of Spark Core is the resilient distributed dataset (RDD), which serves as the primary programming abstraction in Spark. RDDs enable fault-tolerant and distributed data processing across a cluster. Spark Core provides a wide range of APIs for creating, manipulating, and transforming RDDs. These APIs are available in multiple programming languages, including Java, Python, Scala, and R. This flexibility allows developers to work with Spark Core using their preferred language and leverages the rich ecosystem of libraries and tools available in those languages. Spark Scheduler The Spark Scheduler is a vital component responsible for task scheduling and execution. It uses a Directed Acyclic Graph (DAG) and employs a task-oriented approach for scheduling tasks. The Scheduler analyzes the dependencies between different stages and tasks of a Spark application, represented by the DAG. It determines the optimal order in which tasks should be executed to achieve efficient computation and minimize data movement across the cluster. By understanding the dependencies and requirements of each task, the Scheduler assigns resources, such as CPU and memory, to the tasks. It considers factors like data locality, where possible, to reduce network overhead and improve performance. The task-oriented approach of the Spark Scheduler allows it to break down the application into smaller, manageable tasks and distribute them across the available resources. This enables parallel execution and efficient utilization of the cluster's computing power. Spark SQL Spark SQL is a widely used component of Apache Spark that facilitates the creation of applications for processing structured data. It adopts a data frame approach and allows efficient and flexible data manipulation. One of the key features of Spark SQL is its ability to interface with various data storage systems. It provides built-in support for reading and writing data from and to different datastores, including JSON, HDFS, JDBC, and Parquet. This makes it easy to work with structured data residing in different formats and storage systems. Additionally, Spark SQL extends its connectivity beyond the built-in datastores. It offers connectors that enable integration with other popular data stores such as MongoDB, Cassandra, and HBase. These connectors allow users to seamlessly interact with and process data stored in these systems using Spark SQL's powerful querying and processing capabilities. Spark MLlib In addition to its core functionalities, Apache Spark includes bundled libraries for machine learning and graph analysis techniques. One such library is MLlib, which provides a comprehensive framework for developing machine learning pipelines. MLlib simplifies the implementation of machine learning workflows by offering a wide range of tools and algorithms. It simplifies the implementation of feature extraction and transformations on structured datasets and offers a wide range of machine learning algorithms. MLlib empowers developers to build scalable and efficient machine learning workflows, enabling them to leverage the power of Spark for advanced analytics and data-driven applications. Distributed Storage Spark does not provide its own distributed file system. However, it can effectively utilize existing distributed file systems to store and access large datasets across multiple servers. One commonly used distributed file system with Spark is the Hadoop Distributed File System (HDFS). HDFS allows for the distribution of files across a cluster of machines, organizing data into consistent sets of blocks stored on each node. Spark can leverage HDFS to efficiently read and write data during its processing tasks. When Spark processes data, it typically copies the required data from the distributed file system into its memory. By doing so, Spark reduces the need for frequent interactions with the underlying file system, resulting in faster processing compared to traditional Hadoop MapReduce jobs. As the dataset size increases, additional servers with local disks can be added to the distributed file system, allowing for horizontal scalability and improved performance. Spark Jobs, Stages, and Tasks In a Spark application, the execution flow is organized into a hierarchical structure consisting of Jobs, Stages, and Tasks. A Job represents a high-level unit of work within a Spark application. It can be seen as a complete computation that needs to be performed, involving multiple Stages and transformations on the input data. A Stage is a logical division of tasks that share the same shuffle dependencies, meaning they need to exchange data with each other during execution. Stages are created when there is a shuffle operation, such as a groupBy or a join, that requires data to be redistributed across the cluster. Within each Stage, there are multiple Tasks. A Task represents the smallest unit of work in Spark, representing a single operation that can be executed on a partition of the data. Tasks are typically executed in parallel across multiple nodes in the cluster, with each node responsible for processing a subset of the data. Spark intelligently partitions the data and schedules Tasks across the cluster to maximize parallelism and optimize performance. It automatically determines the optimal number of Tasks and assigns them to available resources, considering factors such as data locality to minimize data shuffling between nodes. Spark handles the management and coordination of Tasks within each stage, ensuring that they are executed efficiently and leveraging the parallel processing capabilities of the cluster. Figure 2 Shuffle boundaries introduce a barrier where Stages/Tasks must wait for the previous stage to finish before they fetch map outputs. In the above diagram, Stage 0 and Stage 1 are executed in parallel, while Stage 2 and Stage 3 are executed sequentially. Hence, Stage 2 has to wait until both Stage 0 and Stage 1 are complete. This execution plan is evaluated by Spark. Spark Test Bed The Spark cluster was set up for performance benchmarking. Equipment Under Test Cluster nodes: 3CPU: AmpereOne® MSockets/node: 1Cores/socket: 192Threads/socket: 192CPU speed: 3200 MHzMemory channels: 12Memory/node: 768 GB (12 x 64GB DDR5-5600, 1DPC)Network card/node: 1 x Mellanox ConnectX-6OS storage/node: 1 x Samsung 960GB M.2Data storage/mode: 4 x Micron 7450 Gen 4 NVME, 3.84 TBKernel version: 6.8.0-85Operating system: Ubuntu 24.04.3YARN version: 3.3.6Spark version: 3.5.7JDK version: JDK 17 Spark Installation and Cluster Setup We set up the cluster with an HDFS file system. Hence, we installed Spark as a Hadoop user and configured the disks for HDFS. OS Install The majority of modern open-source and enterprise-supported Linux distributions offer full support for the AArch64 architecture. To install your chosen operating system, use the server Kernel-based Virtual Machine (KVM) console to map or attach the OS installation media, and then follow the standard installation procedure. Networking Setup Set up a public network on one of the available interfaces for client communication. This can be used to log in to any of the servers where client communication is needed. Set up a private network for communication between the cluster nodes. Storage Setup Choose a drive of your choice for the OS to install, clear any old partitions, reformat, and choose the disk to install the OS. Here, a Samsung 960 GB drive (M.2) was chosen for the OS installation on each server. Add additional high-speed NVMe drives to support the HDFS file system. Create Hadoop User Create a user named “hadoop” as part of the OS Install. This user was used for both Hadoop and Spark daemons on the test bed. Post-Install Steps Perform the following post-install steps on all the nodes on OS after the install. yum or apt update on the nodes.Install packages like dstat, net-tools, lm-sensors, linux-tools-generic, python, sysstat for your monitoring needs.Set up ssh trust between all the nodes.Update /etc/sudoers file for nopasswd for hadoop user.Update /etc/security/ limits.conf per Appendix.Update /etc/sysctl.conf per Appendix.Update scaling governor and hugepages per Appendix.If necessary, make changes to /etc/rc.d to keep the above changes permanent after every reboot.Set up NVMe disks as an XFS file system for HDFS. Create a single partition on each of the NVMe disks with fdisk or parted.Create a file system on each of the created partitions using mkfs.xfs -f /dev/nvme[0-n]n1p1.Create directories for mounting as mkdir -p /root/nvme[0-n]1p1. d. Update /etc/fstab with entries and mount the file system. The UUID of each partition in fstab can be extracted from the blkid command.Change ownership of these directories to the ‘hadoop’ user created earlier. Spark Install Download Hadoop 3.3.6 from the Apache website, Spark 3.5.7 from Apache Spark, and JDK11 and JDK17 for Arm64/Aarch64. We will use JDK11 for Hadoop and JDK17 for Spark installs. Extract the tarball files under the Hadoop user home directory. Update Spark and Hadoop configuration files in ~/hadoop/spark/conf and ~/hadoop/etc/hadoop/ and environment parameters in .bashrc per Appendix. Depending on the hardware specifications of cores, memory, and disk capacities, these may have to be altered. Update the Workers’ files to include the set of data nodes. Run the following commands: Shell hdfs namdenode -format scp -r ~/hadoop <datanodes>:~/hadoop ~/hadoop/sbin/start-all.sh ~/spark/sbin/start-all.sh This should start Spark Master, Worker, and other Hadoop daemons. Performance Tuning Spark is a complex system where many components interact across various layers. To achieve optimal performance, several factors must be considered, including BIOS and operating system settings, the network and disk infrastructure, and the specific software stack configuration. Experience with Hadoop and Spark significantly helps in fine-tuning these settings. Keep in mind that performance tuning is an ongoing, iterative process. The parameters in the Appendix are provided as starting reference points, gathered from just a few initial tuning cycles. Linux Occasionally, there can be conflicts between the subcomponents of a Linux system, such as the network and disk, which can impact overall performance. The objective is to optimize the system to achieve optimal disk and network throughput and identify and resolve any bottlenecks that may arise. Network To evaluate the network infrastructure, the iperf utility can be utilized to conduct stress tests. Adjusting the TX/RX ring buffers and the number of interrupt queues to align with the cores on the NUMA node where the NIC is located can help optimize performance. However, if the BIOS setting is already configured as chipset-ANC in a monolithic manner, these modifications may not be necessary. Disks Aligned partitions: Partitions should be aligned with the storage's physical block boundaries to maximize I/O efficiency. Utilities like parted can be used to create aligned partitions.I/O queue settings: Parameters such as the queue depth and nr_requests (number of requests) can be fine-tuned via the /sys/block//queue/ directory paths to control how many I/O operations the kernel schedules for a storage device.Filesystem mount options: Utilizing the noatime option in the /etc/fstab file is critical for Hadoop and HDFS, as it prevents unnecessary disk writes by disabling the recording of file access timestamps. The fio (flexible I/O tester) tool is highly effective for benchmarking and validating the performance of the disk subsystem after these changes are implemented. Spark Configuration Parameters There are several tunables on Spark. Only a few of them are addressed here. Tune your parameters by observing the resource usage from http://:4040. Using Data Frames Over RDD It is preferred to use Datasets or Data Frames over RDD, which include several optimizations to improve the performance of Spark workloads. Spark data frames can handle the data better by storing and managing it efficiently, as they maintain the structure of the data and column types. Using Serialized Data Formats In Spark jobs, a common scenario involves writing data to a file, which is then read by another job and written to another file for subsequent Spark processing. To optimize this data flow, it is recommended to write the intermediate data into a serialized file format such as Parquet. Using Parquet as the intermediate file format can yield improved performance compared to formats like CSV or JSON. Parquet is a columnar file format designed to accelerate query processing. It organizes data in a columnar manner, allowing for more efficient compression and encoding techniques. This columnar storage format enables faster data access and processing, particularly for operations that involve selecting specific columns or performing aggregations. By leveraging Parquet as the intermediate file format, Spark jobs can benefit from faster transformation operations. The columnar storage and optimized encoding techniques offered by Parquet, as well as its compatibility with processing frameworks like Hadoop, contribute to improved query performance and reduced data processing time. Reducing Shuffle Operations Shuffling is a fundamental Spark operation that reorders data among different executors and nodes. This is necessary for distributed tasks such as joins, grouping, and reductions. This data redistribution is expensive in terms of resources, as it requires considerable disk IO, data packaging, and movement across the network. This is crucial to how Spark works, but can severely reduce performance if not understood and tuned properly. The spark.sql.shuffle.partitions configuration parameter is key to managing shuffle behavior. Found in spark-defaults.conf, this setting dictates the number of partitions created during shuffle operations. The optimal value varies significantly, depending on data volume, available CPU cores, and the cluster's memory capacity. Setting too many partitions results in a large number of smaller output files, potentially increasing overhead. Conversely, too few partitions can lead to individual partitions becoming excessively large, risking out-of-memory errors on executors. Optimizing shuffle performance involves an iterative process, carefully adjusting spark.sql.shuffle.partitions to strike the right balance between partition count and size for your specific workload. Spark Executor Cores The number of cores allocated to each Spark Executor is an important consideration for optimal performance. In general, allocating around 5 cores per Executor tends to be a fair allocation when using the Hadoop Distributed File System (HDFS). When running Spark alongside Hadoop daemons, it is vital to reserve a portion of the available cores for these daemons. This ensures that the Hadoop infrastructure functions smoothly alongside Spark. The remaining cores can then be distributed among the Spark Executors for executing data processing tasks. By striking a balance between allocating cores to Hadoop daemons and Spark executors, you can ensure that both systems coexist effectively, enabling efficient and parallel processing of data. It is important to adjust the allocation based on the specific requirements of your cluster and workload to achieve optimal performance. Spark Executor Instances The number of Spark executor instances represents the total count of executor instances that can be spawned across all worker nodes for data processing. To calculate the total number of cores consumed by a Spark application, you can multiply the number of executors by the cores allocated per executor. The Spark UI provides information on the actual utilization of cores during task execution, indicating the extent to which the available cores are being utilized. It is recommended to maximize this utilization based on the availability of system resources. By effectively using the available cores, you can boost your Spark application's processing power and make its overall performance better. It is crucial to look at the resources in your cluster and change the amount of executor instances and cores given to each executor to match. This ensures resources are used effectively and gets the most computational power out of your Spark application. Executor and Driver Memory The memory configuration for Spark's Driver and Executors plays a critical role in determining the available memory for these components. It is important to tune these values based on the memory requirements of your Spark application and the memory availability within your YARN scheduler and NodeManager resource allocation parameters. The Executor's memory refers to the memory allocated for each executor, while the Driver's memory represents the memory allocated for the Spark Driver. These values should be adjusted carefully to ensure optimal performance and avoid memory-related issues. When tuning the memory configuration, it is essential to consider the overall memory availability in your environment and consider any memory constraints imposed by the YARN scheduler and NodeManager settings. By aligning the memory allocation with the available resources, you can optimize the memory utilization and prevent potential out-of-memory errors or performance degradation (swapping or disk spills). It is recommended to monitor the memory usage with Spark UI and adjust the configuration iteratively to achieve the best performance for your Spark workload. Benchmark Tools We used both Intel HiBench and TPC-DS benchmarking tools to measure the performance of the clusters. TeraSort We used the HiBench benchmarking tool to measure the TeraSort performance. HiBench is a popular benchmarking suite specifically designed for evaluating the performance of Big Data frameworks, such as Apache Hadoop and Apache Spark. It consists of a set of workload-specific benchmarks that simulate real-world Big Data processing scenarios. For additional information, you can refer to this link. By running HiBench on the cluster, you can assess and compare its performance in handling various Big Data workloads. The benchmark results can provide insights into factors such as data processing speed, scalability, and resource utilization for each cluster. Update hibench.conf file, like scale, profile, parallelism parameters, and a list of master and slave nodes.Run ~HiBench/bin/workloads/micro/terasort/prepare/prepare.sh.Run ~HiBench/bin/workloads/micro/terasort/spark/run.sh. After executing the above, a file named hibench.report will be generated within the report directory. Additionally, a file named bench.log will contain comprehensive information regarding the execution. The cluster was using a data set of 3 TB. We measured the total power consumed, CPU power, CPU utilization, and other parameters like disk and network utilization using Grafana and IPMI tools. Throughput from the HiBench run was calculated for TeraSort in the following scenarios: Spark running on a single AmpereOne® M node compared with a single node Ampere Altra – 128C (prior generation)Spark running on a single AmpereOne® M node compared with a 3-node AmpereOne® M cluster to measure the scalabilitySpark running on a 3-node AmpereOne® M cluster with 64k page size vs 4k page size TPC-DS TPC-DS is an industry-standard decision-support benchmark that models various aspects of a decision-support system, including data maintenance and query processing. Its purpose is to assist organizations in making informed decisions regarding their technology choices for decision support systems. TPC benchmarks aim to provide objective performance data that is relevant to industry users. For more in-depth information, you can refer to this tpc.org/tpcds/. Similar to TeraSort testing, we conducted TPC-DS benchmark on AmpereOne® M processors using both single-node and 3-node cluster configurations to compare performance with the prior generation Ampere Altra – 128C processors and to assess scalability. Additional performance evaluations on the AmpereOne® M processor compared to Linux kernels configured with 64KB and 4KB page sizes. This test also used a 3 TB dataset across the cluster. To gain deeper insights into system performance, we monitored key performance metrics including total system power consumption, CPU power, CPU utilization, and network utilization. Performance Tests on 3 Node Clusters Figures 3 and 4 We evaluated Spark TeraSort performance using the HiBench tool. The tests were run on one, two, and three nodes with AmpereOne® M processors, and the earlier values obtained on Ampere Altra – 128C were compared. From Figure 3, it is evident that there is a 30% benefit of AmpereOne® M over Ampere Altra – 128C while running Spark TeraSort. This increase in performance can be attributed to a newer microarchitecture design, an increase in core count (from 128 to 192), and the 12-channel DDR5 design on AmpereOne® M (versus 8-channel DDR4 on Ampere Altra – 128C). The output for the 3x nodes configuration, as shown in Figure 4, was found to be close to three times the output of a single node. 64k Page Size Figure 5 We observed a significant performance increase, approximately 40%, with 64k page size on Arm64 architecture while running Spark TeraSort benchmark. Most modern Linux distributions support largemem kernels natively. We have not observed any issues while running Spark TeraSort benchmarks on largemem kernels. Performance Per Watt on AmpereOne® M Figure 6 To evaluate the energy efficiency of the cluster, we computed the Performance-per-Watt (Perf/Watt) ratio. This metric is derived by dividing the cluster's measured throughput (megabytes per second) by its total power consumption (watts) during the benchmarking interval. In these assessments, we observed AmpereOne® M performing 35% better over its predecessor on the Spark TeraSort benchmark. OS Metrics While Running TeraSort Benchmark Figure 7 The above image is a snapshot from the Grafana dashboard captured while running the TeraSort benchmark. During the HiBench test, the systems achieved maximum CPU utilization up to 90% while running the TeraSort benchmark. We observed disk read/write activity of approximately 15 GB/s and network throughput of 20 GB/s. Since both observed I/O and network throughput were significantly below the cluster's scalable limits, the results confirm that the benchmark successfully pushed the CPU to its maximum capacity. We observed from the above graphs that AmpereOne® M not only drove disk and network I/O higher than Ampere Altra – 128C, but it also completed tasks considerably faster. Power Consumption Figure 8 The graph illustrates the power consumption of cluster nodes, the platform, and the CPU. The power was measured using the IPMI tool during the benchmark run. We observe that the AmpereOne® M clusters consumed more power than the Ampere Altra – 128C cluster. This is not surprising in that the latest generation AmpereOne® M systems have 50% more compute cores and support 50% more memory channels. Additionally, as shown earlier, this increased power usage also delivered notably higher TeraSort throughput as well as better power efficiency (perf/watt) on AmpereOne® M (Figure 6). TPC-DS Performance Figures 9 and 10 The TPC-DS benchmarking tool was used to execute the TPC-DS workload on the clusters. The performance evaluation was based on the total time required to execute all 99 SQL queries on the cluster. Queries on AmpereOne® M completed in 50% less time than those run on Ampere Altra – 128C. The TPC-DS scalability improvement observed between 1 and 3 nodes was less compared to the scalability seen with TeraSort. 64k Page Size Figure 11 TPC-DS queries got a 9% boost by moving to a 64k page size kernel. Conclusion This paper presents a reference architecture for deploying Spark on a multi-node cluster powered by AmpereOne® M processors and compares the results with an earlier deployment based on Ampere Altra 128C processors. The latest TeraSort benchmark results reinforce the conclusions of earlier studies, demonstrating that Arm64-based data center processors provide a compelling, high-performance alternative to traditional x86 systems for Big Data workloads. Extending this analysis, the evaluation of the 12‑channel DDR5 AmpereOne® M platform shows measurable improvements in both raw throughput and performance-per-watt compared to previous-generation processors. These gains confirm that the AmpereOne® M is a groundbreaking platform designed for data centers and enterprises that prioritize performance, efficiency, and sustainability. Big Data workloads demand substantial computational resources and persistent storage, and by deploying these applications on Ampere processors, organizations benefit from both scale-up and scale-out architectures, enabling efficient growth while maintaining consistent throughput. For more information, visit our website at https://www.amperecomputing.com. If you’re interested in additional workload performance briefs, tuning guides, and more, please visit our Solutions Center at https://amperecomputing.com/solutions Appendix /etc/sysctl.conf Shell kernel.pid_max = 4194303 fs.aio-max-nr = 1048576 net.ipv4.conf.default.rp_filter=1 net.ipv4.tcp_timestamps=0 net.ipv4.tcp_sack = 1 net.core.netdev_max_backlog = 25000 net.core.rmem_max = 2147483647 net.core.wmem_max = 2147483647 net.core.rmem_default = 33554431 net.core.wmem_default = 33554432 net.core.optmem_max = 40960 net.ipv4.tcp_rmem =8192 33554432 2147483647 net.ipv4.tcp_wmem =8192 33554432 2147483647 net.ipv4.tcp_low_latency=1 net.ipv4.tcp_adv_win_scale=1 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv4.conf.all.arp_filter=1 net.ipv4.tcp_retries2=5 net.ipv6.conf.lo.disable_ipv6 = 1 net.core.somaxconn = 65535 #memory cache settings vm.swappiness=1 vm.overcommit_memory=0 vm.dirty_background_ratio=2 /etc/security/limits.conf Shell * soft nofile 65536 * hard nofile 65536 * soft nproc 65536 * hard nproc 65536 Miscellaneous Kernel changes Shell #Disable Transparent Huge Page defrag echo never> /sys/kernel/mm/transparent_hugepage/defrag echo never > /sys/kernel/mm/transparent_hugepage/enabled #MTU 9000 for 100Gb Private interface and CPU governor on performance mode ifconfig enP6p1s0np0 mtu 9000 up cpupower frequency-set --governor performance .bashrc file Shell export JAVA_HOME=/home/hadoop/jdk export JRE_HOME=$JAVA_HOME/jre export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib:$classpath export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin #HADOOP_HOME export HADOOP_HOME=/home/hadoop/hadoop export SPARK_HOME=/home/hadoop/spark export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH core-site.xml XML <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://<server1>:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/data/data1/hadoop, /data/data2/hadoop, /data/data3/hadoop, /data/data4/hadoop </value> </property> <property> <name>io.native.lib.available</name> <value>true</value> </property> <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec, org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>io.compression.codec.snappy.class</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> </configuration> hdfs-site.xml XML configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.blocksize</name> <value>536870912</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/hadoop/hadoop_store/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/data/data1/hadoop, /data/data2/hadoop, /data/data3/hadoop, /data/data4/hadoop </value> </property> <property> <name>dfs.client.read.shortcircuit</name> <value>true</value> </property> <property> <name>dfs.domain.socket.path</name> <value>/var/lib/hadoop-hdfs/dn_socket</value> </property> </configuration> yarn-site.xml XML <configuration>  <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value><server1></value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>1024</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>81920</value> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>186</value> </property> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>4</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>737280</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>186</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> </configuration> mapred-site.xml XML <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME, LD_LIBRARY_PATH=$LD_LIBRARY_PATH </value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> <property> <name>mapreduce.application.classpath</name> <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib-examples/*, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/sources/*, $HADOOP_MAPRED_HOME/share/hadoop/common/*, $HADOOP_MAPRED_HOME/share/hadoop/common/lib/*, $HADOOP_MAPRED_HOME/share/hadoop/yarn/*, $HADOOP_MAPRED_HOME/share/hadoop/yarn/lib/*, $HADOOP_MAPRED_HOME/share/hadoop/hdfs/*, $HADOOP_MAPRED_HOME/share/hadoop/hdfs/lib/*</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value><server1>:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value><server1>:19888</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>2048</value> </property> <property> <name>mapreduce.map.cpu.vcore</name> <value>1</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>4096</value> </property> <property> <name>mapreduce.reduce.cpu.vcore</name> <value>1</value> </property> <property> <name>mapreduce.map.java.opts</name> <value> -Djava.net.preferIPv4Stack=true -Xmx2g -XX:+UseParallelGC -XX:ParallelGCThreads=32 -Xlog:gc*:stdout</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value> -Djava.net.preferIPv4Stack=true -Xmx3g -XX:+UseParallelGC -XX:ParallelGCThreads=32 -Xlog:gc*:stdout</value> </property> <property> <name>mapreduce.task.timeout</name> <value>6000000</value> </property> <property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <property> <name>mapreduce.map.output.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress</name> <value>true</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.type</name> <value>BLOCK</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>mapreduce.reduce.shuffle.parallelcopies</name> <value>32</value> </property> <property> <name>mapred.reduce.parallel.copies</name> <value>32</value> </property> </configuration> spark-defaults.conf Shell spark.driver.memory 32g # used driver memory as 64g for TPC-DS spark.dynamicAllocation.enabled=false spark.executor.cores 5 spark.executor.extraJavaOptions=-Djava.net.preferIPv4Stack=true -XX:+UseParallelGC -XX:ParallelGCThreads=32 spark.executor.instances 108 spark.executor.memory 18g spark.executorEnv.MKL_NUM_THREADS=1 spark.executorEnv.OPENBLAS_NUM_THREADS=1 spark.files.maxPartitionBytes 128m spark.history.fs.logDirectory hdfs://<Master Server>:9000/logs spark.history.fs.update.interval 10s spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider spark.history.ui.port 18080 spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec spark.io.compression.snappy.blockSize=512k spark.kryoserializer.buffer 1024m spark.master yarn spark.master.ui.port 8080 spark.network.crypto.enabled=false spark.shuffle.compress true spark.shuffle.spill.compress true spark.sql.shuffle.partitions 12000 spark.ui.port 8080 spark.worker.ui.port 8081 spark.yarn.archive hdfs://<Master Server>:9000/spark-libs.jar spark.yarn.jars=/home/hadoop/spark/jars/*,/home/hadoop/spark/yarn/* hibench.conf Shell hibench.default.map/shuffle.parallelism 12000 # 3 node cluster hibench.scale.profile bigdata # the bigdata size configured as hibench.terasort.bigdata.datasize 30000000000 in ~/HiBench/conf/workloads/micro/terasort.conf Check out the full Ampere article collection here.

By RamaKrishna Nishtala

Optimizing Java Back-End Performance Profiling and Best Practices

The dashboard turned red at weekday. Our order processing API latency jumped from fifty milliseconds to five seconds. Customer support tickets flooded in. Users reported timeouts during checkout. The infrastructure team scaled up the Kubernetes pods, but the issue persisted. CPU usage sat at 100 percent across all nodes. We were throwing hardware at a software problem. This approach failed miserably. In this article, I will share how we diagnosed the bottleneck. I will explain the profiling tools we used. I will detail the code changes that restored performance. This is not a theoretical guide. It is a record of a real production incident and the steps we took to resolve it. The Incident Silent Degradation Our backend was built on Spring Boot with Hibernate. It handled complex transaction logic for a retail platform. The system worked fine during development. Load testing showed acceptable results. However, production traffic patterns differed significantly from our tests. Users browsed catalogs for hours before purchasing. This created long-lived sessions and large heaps. The garbage collector struggled to keep up. We noticed frequent stop-the-world pauses in the logs. These pauses coincided with the latency spikes. We initially suspected database locks. We checked slow query logs. The queries looked efficient. Indexes were in place. Connection pool usage was normal. This ruled out the database as the primary culprit. We then looked at the application layer. Thread dumps showed many threads in a WAITING state. They were waiting for locks or IO. This pointed towards contention or resource exhaustion within the JVM. Profiling Strategy: Stop Guessing We stopped guessing and started profiling. Guessing leads to wasted time. Profiling provides data. We enabled Java Flight Recorder (JFR) in production. JFR has low overhead and captures detailed runtime events. We recorded a twenty-minute session during peak traffic. We then analyzed the recording using Java Mission Control. The flame graphs revealed the truth. A significant portion of CPU time was spent in garbage collection. Specifically, the G1GC collector was working overtime. It was trying to reclaim memory, but the allocation rate was too high. We were creating too many short-lived objects. This is known as high object churn. Every request creates thousands of temporary objects. These objects filled the Eden space rapidly. This triggered frequent minor GC cycles. Root Cause: Hidden Allocations We traced the allocations back to specific code paths. The biggest offender was a utility method used for data transformation. It converted entity objects to DTOs. The method created new ArrayList instances for every field. It also used string concatenation in loops. These patterns seem harmless in small doses. At scale, they become disastrous. Here is the problematic code we found. This method allocated memory for every list and every string concatenation. With thousands of orders per minute, the heap filled up instantly. We refactored this code to reduce allocations. We reused collections where possible. We used StringBuilder for string operations. This change reduced object creation by eighty percent. The GC pressure dropped immediately. Latency returned to normal levels. Database Interaction Issues Profiling also revealed issues with Hibernate. We noticed many small queries executing sequentially. This is the N+1 select problem. The parent entity loaded correctly. However, accessing child collections triggered new queries. This happened for every item in the list. Network latency is multiplied by each query. We fixed this using Entity Graphs. This told Hibernate to fetch related data in a single JOIN. We defined the graph in the repository layer. This reduced database round-trip times significantly. The CPU usage on the database server also dropped. This showed how backend Java code impacts downstream systems. Optimization is not isolated to one layer. JVM Tuning and Configuration Code changes were not enough. We also tuned the JVM flags. The default heap size was too small for our workload. We increased the maximum heap to match the container limits. We also adjusted the G1GC parameters. We set the initiating heap occupancy percent. This controls when concurrent marking starts. The default is 45 percent. We lowered it to 30 percent. This started GC earlier and prevented full GC pauses. We also enabled logging for GC events. This allowed us to monitor health continuously. These flags ensured the JVM behaved predictably under load. We tested these settings in staging before applying them to production. Never tune JVM flags without testing. Incorrect settings can worsen performance. Concurrency and Lock Contention Thread dumps showed contention on shared resources. We used synchronized blocks for caching. This serialized access created bottlenecks. Multiple threads waited for a single lock. We replaced synchronized blocks with ConcurrentHashMap. This allowed concurrent reads and writes. We also reviewed our thread pool settings. The default pool size was insufficient. We increased the core pool size based on CPU cores. We set the queue size to prevent unbounded growth. Unbounded queues can lead to OutOfMemoryErrors. Best Practices for Sustainable Performance We learned several lessons during this incident. We incorporated them into our development process. Profile early: Do not wait for production issues. Profile during development. Use JFR locally to catch high allocation patterns.Monitor GC metrics: Track GC pause times and frequency. Set alerts for long pauses. This provides early warning of memory issues.Avoid N+1 queries: Always check Hibernate SQL logs. Use JOIN FETCH or Entity Graphs for relationships.Minimize allocations: Reuse objects where safe. Avoid creating temporary collections in loops. Use StringBuilder for strings.Tune for containers: Ensure JVM flags respect container limits. Use MaxRAMPercentage instead of fixed Xmx values.Load test realistically: Simulate production traffic patterns. Long sessions and high concurrency reveal different bugs than simple unit tests.Use async processing: Offload heavy tasks to background queues. Do not block HTTP threads for long operations. Conclusion Performance optimization is a continuous journey. Our incident taught us that hardware scaling is not a silver bullet. We had to dig into the code and the JVM. Profiling tools gave us the visibility we needed. Code refactoring reduced the load on the garbage collector. Database optimization reduced IO wait times. JVM tuning ensured stable runtime behavior. The system is now stable. Latency is consistent even during peak traffic. We continue to monitor performance metrics closely. We treat performance as a feature, not an afterthought. Java provides powerful tools for building high-performance backends. We must use them wisely. Happy profiling and keep your systems fast.

By Ramya vani Rayala

Build High-Performance Web Systems Using Adaptive Edge-Native Performance Governance Framework

Today’s websites are no longer simple; they function as enterprise applications and distributed systems composed of multiple layers, including microservices APIs (application programming interfaces that allow different software components to communicate), edge delivery networks (systems that deliver content to users from the nearest location), JavaScript-heavy frontends, analytics integrations, personalization engines, and third-party marketing scripts. As these systems grow in complexity, performance problems inevitably appear. Most organizations try to fix performance by running optimization initiatives, but the real problem is that performance is usually treated as a one-time optimization project instead of a continuous engineering discipline. This is the idea behind the Adaptive Edge-Native Performance Governance Framework (AEPGF) — the framework helps teams prevent regressions before they reach production. Why Performance Problems Keep Coming Back Many teams focus on improving performance using isolated techniques such as: Enabling CDN cachingCompressing imagesBundling JavaScriptOptimizing backend queries While these steps help, they rarely solve the underlying issue. Performance problems typically arise from system-wide complexity, not just one slow component. Some of the most common causes include: JavaScript bundles gradually growing largernew third-party scripts being addedincreasing media payload sizeslack of performance checks in CI/CD pipelines Even small latency variations can accumulate across complex systems and eventually impact user experience. To prevent this, performance needs to be governed continuously rather than optimized occasionally, which involves implementing regular performance checks and monitoring throughout the CI/CD pipelines. The Adaptive Edge-Native Performance Governance Framework (AEPGF) The AEPGF addresses these challenges by embedding performance management into architecture, development, and operations. Instead of focusing on isolated optimizations, the framework integrates six key layers: Edge-first deliveryAdaptive asset optimizationJavaScript governanceDevOps performance budgetsObservability feedback loopsSustainability-aware delivery Together, these layers create a continuous performance governance pipeline. Edge-First Delivery One of the biggest improvements in modern web architecture is the shift toward edge computing. Edge delivery networks reduce latency by moving content closer to users. Instead of every request reaching a centralized server, much of the processing happens at the network edge. Typical edge strategies include: CDN cachingEdge compute functionsDynamic compressionRequest coalescingEdge-side rendering For globally distributed applications, edge delivery improves both speed and consistency across regions, which is crucial for enhancing user experience and reducing latency in accessing content. Adaptive Asset Optimization Images and media assets often account for the largest portion of a webpage’s payload. Rather than delivering the same asset to every user, adaptive delivery dynamically selects optimized versions based on factors such as: Device typeScreen resolutionBrowser capabilitiesNetwork speed Common techniques include: Modern image formats like AVIF or WebPResponsive image variantsLazy loadingCDN-based image transformation These optimizations significantly reduce network transfer time and improve rendering performance. JavaScript Governance JavaScript execution is one of the most common causes of slow user interactions. Over time, JavaScript bundles often grow due to framework dependencies, analytics integrations, and feature additions. The framework introduces governance practices such as: Route-level code splittingTree shaking and dead-code removalWeb Worker offloadingStrict bundle ownership policiesServer-side tagging for analytics Server-side tagging is particularly helpful because it moves analytics processing away from the browser, reducing main-thread blocking. Performance Budgets in CI/CD One of the most effective ways to prevent performance regressions is to enforce performance budgets during development. Example targets might include: Page weight < 1.5 MBLargest Contentful Paint < 2.5 secondsInteraction to Next Paint < 200 msCumulative Layout Shift < 0.1 These limits can be enforced automatically using CI/CD tools such as Lighthouse CI. Example configuration: JSON { "ci": { "collect": { "numberOfRuns": 3, "url": ["https://example.com"] }, "assert": { "assertions": { "largest-contentful-paint": ["warn", {"maxNumericValue": 2500}], "cumulative-layout-shift": ["error", {"maxNumericValue": 0.1}] } } } } If performance metrics exceed the defined thresholds, the build fails automatically. This ensures performance regressions are caught before deployment. Observability and Real User Monitoring You can only improve what you can measure. The framework integrates telemetry from multiple sources, including Real user monitoring (RUM)Synthetic performance testsDistributed tracingInfrastructure metrics Example RUM instrumentation: JavaScript import { onCLS, onINP, onLCP } from "web-vitals"; function report(metric) { navigator.sendBeacon("/rum", JSON.stringify(metric)); } onCLS(report); onINP(report); onLCP(report); Real user monitoring captures performance data from actual devices and networks, which often reveals issues that synthetic testing misses, such as latency problems and user experience challenges that can significantly impact overall performance. Real-World Results Organizations that implemented performance governance across large platforms reported measurable improvements: metricimprovement Page weight 50–65% reduction Largest Contentful Paint ~40% improvement JavaScript execution time ~45% reduction Transaction completion ~12% increase Mobile conversion 15–20% uplift Infrastructure cost ~15% reduction These results show that performance optimization can deliver both technical and business impact. Lessons Learned Teams adopting performance governance often discover several important lessons: Performance regressions are often caused by process changes rather than individual code issuesThird-party scripts frequently introduce hidden latencyEdge delivery improves consistency across geographic regionsReal user monitoring is essential for accurate performance insightsEnforcing performance budgets encourages developers to consider performance earlier in development Final Thoughts With the recent agentic AI, this process can be made easier by constantly checking data, spotting performance issues, enforcing budgets during CI/CD validation, and suggesting or making improvements before problems affect users. In this model, autonomous performance agents serve as smart overseers that keep performance steady in changing systems, allowing organizations to provide fast and dependable digital experiences while maintaining engineering speed and growth.

By jerald selvaraj

Hadoop on AmpereOne Reference Architecture

Ampere processors with Arm architecture deliver superior power efficiency and cost advantages compared to traditional x86 architecture. Hadoop, with its core components and broader ecosystem, is fully compatible with Arm-based platforms. Ampere Computing has previously published a comprehensive reference architecture demonstrating Hadoop deployments on Ampere® Altra® M processors. This paper builds on that foundation and extends the analysis by highlighting Hadoop performance on the next generation of AmpereOne® M processor. Scope and Audience The scope of this document includes setting up, tuning, and evaluating the performance of Hadoop on a testbed with AmpereOne® M processors. This document also compares the performance benefits of the 12-Channel AmpereOne® M processors with the previous generation of Ampere Altra — the 128C (Ampere Altra 128 Cores) processors. In addition, the document evaluates the use of 64k page-size kernels and highlights the resulting performance improvements compared to traditional 4KB page-size kernels. The document provides step-by-step guidance for installing and tuning Hadoop on single- and multi-node clusters. These recommendations serve as general guidelines for cluster configuration, and parameters can be further optimized for particular workloads and use cases. This document is intended for a diverse audience, including sales engineers, IT and cloud architects, IT and cloud managers, and customers seeking to leverage the performance and power efficiency benefits of Ampere Arm servers in their data centers. It aims to provide valuable insights and technical guidance to these professionals who are interested in implementing Arm-based Hadoop solutions and optimizing their infrastructure. AmpereOne® M Processors AmpereOne® M processor is part of the AmpereOne® family of high-performance server-class processors, engineered to deliver exceptional performance for AI Compute and a broad spectrum of mainstream data center workloads. Data-intensive applications such as Hadoop and Apache Spark benefit directly from the processor’s 12 DDR5 memory channels, which provide the bandwidth required for large-scale data processing. AmpereOne® M processors introduce a new platform architecture featuring higher core counts and additional memory channels, distinguishing it from Ampere’s previous platforms while preserving Ampere’s Cloud Native design principles. AmpereOne® M was designed from the ground up for cloud efficiency and predictable scaling. Each vCPU maps one-to-one with a physical core, ensuring consistent performance without resource contention. With up to 192 single-threaded cores and twelve DDR5 channels delivering 5600 MT/s, AmpereOne® M sustains the throughput required for demanding workloads, ranging from large language model (LLM) inference to real-time analytics. In addition, AmpereOne® M delivers exceptional performance-per-watt, reducing operational costs, energy consumption, and cooling requirements, making it well-suited for sustainable, high-density data center deployments. Hadoop on Ampere Processors There has been a significant shift towards the adoption of Arm-based processors in data centers over the past several years. Arm-based processors are increasingly used for distributed computing and offer compelling advantages for Hadoop deployments, a few of which are discussed in this paper. The Hadoop ecosystem is written in Java and runs seamlessly on Arm processors. Most of the Linux distributions, file systems, and open-source tools commonly used with Hadoop provide native Arm support. As a result, migrating existing Hadoop clusters (brownfield deployments) or deploying new clusters (greenfield deployments) on Arm-based infrastructure can be accomplished with little to no disruption. Running Hadoop’s distributed processing framework on energy-efficient Ampere processors represents an important evolution in big data infrastructure. This approach enables more sustainable, power-efficient, and cost-effective Hadoop deployments while maintaining performance and scalability Big Data Architecture The scale, complexity, and unstructured nature of modern data generation exceed the capabilities of traditional software systems. Big data applications are purpose-built to manage and analyze these complex datasets. Big data is defined not only by Volume but also by the Velocity at which the data is generated and processed, the variety of formats it spans (from structured numerical data to unstructured text, images, and video), and its Veracity (the quality and accuracy of the data) and the Value it delivers. Together, these characteristics create both significant challenges and unprecedented opportunities for insight and innovation. Big data includes structured, semi-structured, and unstructured data that is analyzed using advanced analytics. Typical big data deployments operate at a terabyte and petabyte scale, with data continuously created and collected over time. The big data domain includes data ingestion, processing, and analysis of datasets that are too large, fast-moving, or complex for traditional data processing systems. Sources of big data are limitless, and include Internet of Things (IoT) sensors, social media activity, e-commerce transactions, satellite imagery, scientific instruments, web logs, and more. The true power of big data is realized by extracting meaningful insights from this diverse and often unstructured information. By applying advanced analytics like artificial intelligence (AI) and machine learning (ML), organizations can predict trends, gain a deeper understanding of customer behavior and market dynamics, and identify operational inefficiencies at scale. Big data solutions involve the following types of workloads: Batch processing of big data sources at restReal-time processing of big data in motionInteractive exploration of big dataPredictive analytics and machine learning Hadoop Ecosystem The Apache Hadoop software library facilitates scalable, fault-tolerant, distributed computing by providing a framework for processing large volumes of data across commodity hardware clusters. Designed to scale from single-node deployments to thousands of machines, Hadoop distributes both storage through Hadoop Distributed File System (HDFS) and computation via MapReduce and YARN. Hadoop incorporates built-in fault tolerance to handle common node failures in large clusters. Through resilient software techniques such as data replication, the platform maintains high availability and ensures continuous data processing, even during infrastructure failure. By leveraging distributed computing and a resilient data management framework, Hadoop enables efficient processing and analysis of massive datasets. The platform supports a wide spectrum of data-intensive workloads, including data analytics, data mining, and machine learning, providing organizations with the required scalability, reliability, and performance required for complex data processing at scale. The four main elements of the ecosystem are Hadoop Distributed File System (HDFS), MapReduce, Yet Another Resource Negotiator (YARN), and Hadoop Common. Hadoop Distributed File System (HDFS) As the primary storage layer of Hadoop, HDFS manages datasets across distributed nodes. Its architecture ensures high scalability and fault tolerance through data replication and redundancy. HDFS divides data into fixed-size blocks and distributes them across the cluster, optimizing the system for parallel processing and high-throughput data access. MapReduce MapReduce is a programming model and processing framework for distributed data processing within the Hadoop ecosystem. It enables parallel execution by dividing workloads into smaller tasks that are distributed across cluster nodes. The Map phase processes data in parallel, and the Reduce phase aggregates and summarizes the results. MapReduce is commonly used for batch processing and large-scale data analytics workloads. Yet Another Resource Negotiator (YARN) YARN is a cluster of resource management software within the Hadoop ecosystem. It is responsible for resource allocation, scheduling, and workload coordination across the cluster. YARN enables multiple processing frameworks, such as MapReduce, Apache Spark, and Apache Flink, to run concurrently on the same infrastructure, allowing diverse workloads to efficiently share cluster resources. Hadoop Common Hadoop Common is a foundational component of the Hadoop ecosystem, providing shared libraries and utilities for all Hadoop modules to operate. It delivers core services including authentication, security protocols, and file system interfaces, ensuring consistency and interoperability across the ecosystem’s components. Hadoop Common has officially supported ARM-based platforms since version 3.3.0, including native libraries optimized for the Arm architecture. This support enables seamless deployment and operation of Hadoop on modern Arm-based infrastructure. Figure 1 Hadoop Test Bed A 3-node cluster was set up for performance benchmarking. The cluster was set up with AmpereOne® M processors. Equipment Under Test Cluster Nodes: 3CPU: AmpereOne® MSockets/Node: 1Cores/Socket: 192Threads/Socket: 192CPU Speed: 3200 MHzMemory Channels: 12Memory/Node: 768GBNetwork Card/Node: 1 x Mellanox ConnectX-6Storage/Node: 4 x Micron 7450 Gen 4 NVMEKernel Version: 6.8.0-85: Ubuntu 24.04.3Hadoop Version: 3.3.6JDK Version: JDK 11 Hadoop Installation and Cluster Setup OS Install. The majority of modern open-source and enterprise-supported Linux distributions offer full support for the AArch64 architecture. To install your chosen operating system, use the server for a Kernel-based virtual machine (KVM) console to map or attach the OS installation media, and then follow the standard installation procedure. Networking Setup. Set up a public network on one of the available interfaces for client communication. This can be used to log in to any of the servers where client communication is needed. Set up a private network for communication between the cluster nodes. Storage Setup. Choose a drive of your choice for OS installation, clear any old partitions, reformat, and choose the disk to install the OS. A Samsung 960 GB drive (M.2) was chosen for the OS installation in this setup. Add additional high-speed NVMe drives for the HDFS file system. Create Hadoop. User: Create a user named “hadoop” as part of the OS Installation and provide necessary sudo privileges for the user. Post-Install Steps: Perform the following post-installation steps on all the nodes after the OS installation. yum or apt update on the nodes.Install packages like dstat, net-tools, nvme-cli, lm-sensors, linux-tools-generic, python, and sysstat for your monitoring needs.Set up ssh trust between all the nodes.Update /etc/sudoers file for nopasswd for hadoop user.Update /etc/security/limits.conf per Appendix.Update /etc/sysctl.conf per Appendix.Update the scaling governor to performance and disable transparent hugepages per the Appendix.If necessary, make changes to /etc/rc.d to keep the above changes permanent after every reboot.Set up NVMe disks as an XFS file system for HDFS. Zap and format the NVME disks.Create a single partition on each of the nvme disks with fdisk or parted.Create file system on each of the created partitions as mkfs.xfs -f /dev/nvme[0-n]n1p1.Create directories for mounting on root.mkdir -p /root/nvme[0-n]1p1.Update /etc/fstab with entries to mount the file system. The UUID of each partition for update in fstab can be extracted from the blkid command.Change ownership of these directories to the ‘hadoop’ user created earlier. Hadoop Install Download Hadoop 3.3.6 from the Apache website and JDK11 for Arm/Aarch64. Extract the tarballs under the Hadoop home directory. Update the Hadoop configuration files in ~/hadoop/etc/hadoop/ and the environment parameters in .bashrc per the Appendix. Depending on the hardware specifications of cores, memory, and disk capacities, these parameters may have to be altered. Update the workers' file to include the set of data nodes. Run the following commands Shell hdfs namenode -format scp -r ~/hadoop <datanodes>:~/hadoop ~/hadoop/sbin/start-all.sh This should start with the NodeManager, ResourceManager, NameNode, and DataNode processes on the nodes. Please note that NameNode and Resource Managers are started only on the master node. Verification of the setup: Run the jps command on each node to check the status of the Hadoop daemons.Verify that -ls, -put, -du, -mkdir commands can be run on the cluster. Performance Tuning Hadoop is a complex framework where many components interact across multiple systems. Overall performance is influenced by several distinct factors: Platform settings: This includes configurations at the hardware and operating system levels, such as BIOS settings, specific OS parameters, and the performance of network and disk subsystems.Hadoop configuration: The configuration of the Hadoop software stack itself also plays a critical role in efficiency. Optimizing these settings typically requires prior experience with Hadoop. It is important to approach performance tuning as an iterative process. It is important to note that performance tuning is an iterative process, and the parameters provided in the Appendix are merely reference values obtained through a few iterations. Linux: Occasionally, conflicts between different subcomponents of a Linux system, such as the networking and disk subsystems, can arise and negatively impact overall performance. The primary objective is to optimize the entire system to achieve optimal disk and network throughput by identifying and resolving any bottlenecks that may emerge during operation. Network: To evaluate the underlying network infrastructure, the iperf utility can be used to conduct stress tests. Performance optimization involves adjusting specific driver parameters, such as the Transmit (TX) and Receive (RX) ring buffers and the number of interrupt queues, to align them with the CPU cores on the Non-Uniform Memory Access (NUMA) node where the Network Interface Card (NIC) resides. However, if the system's BIOS is already configured in monolithic mode, these specific kernel-level modifications related to NUMA alignment may not be necessary. Disks: When optimizing performance in a Hadoop environment, administrators should focus on specific disk subsystem parameters: Aligned partitions: Partitions should be aligned with the storage's physical block boundaries to maximize I/O efficiency. Utilities like parted can be used to create aligned partitions. I/O queue settings: Parameters such as the queue depth and nr_requests (number of requests) can be fine-tuned via the /sys/block//queue/ directory paths to control how many I/O operations the kernel schedules for a storage device. Filesystem mount options: Utilizing the noatime option in the /etc/fstab file is critical for Hadoop, as it prevents unnecessary disk writes by disabling the recording of file access timestamps. The fio (flexible I/O tester) tool is highly effective for benchmarking and validating the performance of the disk subsystem after these changes are implemented. HDFS, YARN, and MapReduce HDFS In HDFS, the primary parameters to consider for data management and resilience are the block size and replication factor. By default, the HDFS block size is 128 MB. Files are divided into chunks matching this size, which are then distributed across different data nodes. In certain high-performance environments or test beds, a larger block size, such as 512 MB, might be used to optimize throughput for large files. The test bed with the AmpereOne® M processor was also set up with 512MB. The replication factor (defaulting to 3) determines data redundancy. When an application writes data once, HDFS replicates those blocks across the cluster based on this factor, ensuring three identical copies are available for high availability and fault tolerance. Consequently, the total storage space required is directly proportional to the replication factor used (a factor of 3 means you need 3x the raw data size in storage capacity). HDFS 3.x introduced Erasure Coding (EC) as an alternative to traditional replication. EC significantly reduces storage overhead; for example, a 6+3 EC configuration provides data redundancy comparable to a 3x replication factor but uses substantially less physical storage space. It is important to note, however, that while EC saves storage, it introduces additional computational and network load compared to simple replication. In the described test bed environment, a standard replication factor of 1 was employed YARN YARN (Yet Another Resource Negotiator) is the resource management framework within the Hadoop ecosystem. It offers two main scheduler options: the Fair scheduler and the Capacity scheduler. The Fair scheduler (the default configuration) distributes available cluster resources evenly and dynamically among all running applications or jobs over time. The Capacity scheduler allocates a guaranteed, fixed capacity to each queue, user, or job. By default, the behavior of standard configurations is that if a queue does not fully utilize its reserved capacity, that excess may remain unused or might be conditionally shared depending on specific configuration parameters. Key configuration settings for either scheduler involve defining the limits for resource allocation, specifically the minimum allocation, maximum allocation, and incremental "stepping" values for both memory and virtual CPU cores (vcores). We used the default configuration in the testing environment. MapReduce In the MapReduce framework, a job is broken down into numerous smaller tasks, where each task is designed to have a smaller memory footprint and leverage a single or fewer virtual cores (vcores). Resource allocation within YARN is determined by these task requirements, considering the total memory available to the YARN Node Manager and the total number of vcores it manages. These configurations can be directly adjusted within the yarn-site.xml file. Reference parameters used in a specific test bed are often provided in an Appendix for guidance. Benchmark Tools We used the HiBench benchmarking tool. HiBench is a popular benchmarking suite specifically designed for evaluating the performance of Big Data frameworks, such as Apache Hadoop and Apache Spark. It consists of a set of workload-specific benchmarks that simulate real-world Big Data processing scenarios. For additional information, you can refer to this link. By running HiBench on the cluster, you can assess and compare its performance in handling various Big Data workloads. The benchmark results can provide insights into factors such as data processing speed, scalability, and resource utilization for each cluster. Steps to run HiBench on the cluster: Download HiBench software from the link above.Update hibench.conf file, like scale, profile, parallelism parameters, and a list of master and slave nodes.Run ~HiBench/bin/workloads/micro/terasort/prepare/prepare.sh.Run ~HiBench/bin/workloads/micro/terasort/Hadoop/run.sh. The above will generate a hibench.report file under the report directory. Further, a bench.log file provides details of the run. The cluster was using a data set of 3 TB. We measured the total power consumed, CPU power, CPU utilization, and other parameters like disk and network utilization using Grafana and IPMI tools. Throughput from the HiBench run was calculated for TeraSort in the following scenarios: Hadoop running on a single node on AmpereOne® M to compare with the previous generation of Ampere Altra – 128c.Hadoop running on a single node on AmpereOne® M to compare with a 3-node cluster of AmpereOne® M to measure the scalability.Hadoop running on a 3-node cluster with 64k page size on AmpereOne® M to compare it with 4k page size on the same processor. Performance Tests on AmpereOne® M Cluster TeraSort Performance Figures 2 and 3 Using the Hibench tool as mentioned above, we ran Hadoop TeraSort tests on one, two, and three nodes with AmpereOne® M processors and compared the values we got earlier on Ampere Altra – 128C. From Figure 2, it is evident that there is a 40% benefit of AmpereOne® M over Ampere Altra – 128C while running Hadoop TeraSort. This increase in performance can be attributed to a newer microarchitecture design, an increase in core count (from 128 to 192), and the 12-channel DDR5 design on AmpereOne® M. Near-linear scalability was observed when running TeraSort. The output for the 3x nodes configuration was found to be very close to three times the output of a single node. 64k Page Size Figure 4 We observed a significant performance increase, approximately 30%, with 64k page size on the Arm architecture while running the Hadoop TeraSort benchmark. Most modern Linux distributions, support largemem kernels natively. For other systems, building a custom 64k page size kernel is a straightforward procedure that can be implemented with a standard reboot. We have not observed any issues while running Hadoop TeraSort benchmarks on largemem kernels. Performance per Watt on AmpereOne® M Figure 5 To evaluate the energy efficiency of the cluster, we computed the Performance-per-Watt (Perf/Watt) ratio. This metric is derived by dividing the cluster's measured throughput (megabytes per second) by its total power consumption (watts) during the benchmarking interval. In these assessments, we observed AmpereOne® M performing 30% better over its predecessor on the Hadoop TeraSort benchmark. OS Metrics While Running the Benchmark Figure 6 The above image is a snapshot from the Grafana dashboard captured while running the benchmark. The systems achieved maximum CPU utilization while running the TeraSort benchmark using HiBench. We observed disk read/write activity of approximately 10 GB/s and network throughput of 30 GB/s. Since both observed I/O and network throughput were significantly below the cluster's scalable limits, the results confirm that the benchmark successfully pushed the CPUs to their maximum capacity. We observed from the above graphs that AmpereOne® M not only drove disk and network I/O higher than Ampere Altra – 128C, but also completed tasks considerably faster Power Consumption Figure 7 The graph illustrates the power consumption of cluster nodes, the platform, and the CPU. The power was measured by the IPMI tool during the benchmark run. The data reveals that the AmpereOne® M cluster consumed more absolute power than the Ampere Altra – 128C. However, this increased power usage correlated with a higher TeraSort throughput on the AmpereOne® M system. AmpereOne® M cluster delivers a better performance per watt (Figure 5). Conclusions This paper presents a reference architecture for deploying Hadoop on a multi-node cluster powered by AmpereOne® M processors and compares the results against a prior deployment on Ampere Altra – 128C processors. The latest TeraSort benchmark results validate the findings of earlier studies, demonstrating that Arm-based processors provide a compelling, high-performance alternative to traditional x86 systems for big-data workloads. Building on this foundation, the evaluation of the 12‑channel DDR5 AmpereOne® M platform shows measurable improvements not only in raw throughput but also in performance-per-watt compared to previous generation processors. The improvements confirm that the AmpereOne® M is a purpose-built platform designed for modern data centers and enterprises that prioritize both performance and energy efficiency. AmpereOne® M addresses the core requirements of today’s organizations: performance, efficiency, and scalability. Big Data workloads demand significant compute capacity and persistent storage, and by deploying these applications on Ampere processors, organizations benefit from both scale-up and scale-out architectures. This approach enables a higher density per rack, reduces power consumption, and delivers consistent throughput at scale. To learn more about our developer efforts and find best practices, visit Ampere’s Developer Center and join the conversation in the Ampere Developer Community. Appendix /etc/sysctl.conf Shell kernel.pid_max = 4194303 fs.aio-max-nr = 1048576 net.ipv4.conf.default.rp_filter=1 net.ipv4.tcp_timestamps=0 net.ipv4.tcp_sack = 1 net.core.netdev_max_backlog = 25000 net.core.rmem_max = 2147483647 net.core.wmem_max = 2147483647 net.core.rmem_default = 33554431 net.core.wmem_default = 33554432 net.core.optmem_max = 40960 net.ipv4.tcp_rmem =8192 33554432 2147483647 net.ipv4.tcp_wmem =8192 33554432 2147483647 net.ipv4.tcp_low_latency=1 net.ipv4.tcp_adv_win_scale=1 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv4.conf.all.arp_filter=1 net.ipv4.tcp_retries2=5 net.ipv6.conf.lo.disable_ipv6 = 1 net.core.somaxconn = 65535 #memory cache settings vm.swappiness=1 vm.overcommit_memory=0 vm.dirty_background_ratio=2 /etc/security/limits.conf Shell * soft nofile 65536 * hard nofile 65536 * soft nproc 65536 * hard nproc 65536 Miscellaneous Kernel changes Shell #Disable Transparent Huge Page defrag echo never> /sys/kernel/mm/transparent_hugepage/defrag echo never > /sys/kernel/mm/transparent_hugepage/enabled #MTU 9000 for 100Gb Private interface and CPU governor on performance mode ifconfig enP6p1s0np0 mtu 9000 up cpupower frequency-set --governor performance .bashrc file Shell export JAVA_HOME=/home/hadoop/jdk export JRE_HOME=$JAVA_HOME/jre export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib:$classpath export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin #HADOOP_HOME export HADOOP_HOME=/home/hadoop/hadoop export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH export PATH=$PATH:/home/hadoop/.local/bin core-site.xml XML <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://<server1>:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/data/data1/hadoop, /data/data2/hadoop, /data/data3/hadoop, /data/data4/hadoop </value> </property> <property> <name>io.native.lib.available</name> <value>true</value> </property> <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec, org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>io.compression.codec.snappy.class</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> </configuration> hdfs-site.xml XML <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.blocksize</name> <value>536870912</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/hadoop/hadoop_store/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/data/data1/hadoop, /data/data2/hadoop, /data/data3/hadoop, /data/data4/hadoop </value> </property> <property> <name>dfs.client.read.shortcircuit</name> <value>true</value> </property> <property> <name>dfs.domain.socket.path</name> <value>/var/lib/hadoop-hdfs/dn_socket</value> </property> </configuration> yarn-site.xml XML <configuration>  <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value><server1></value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>1024</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>81920</value> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>186</value> </property> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>4</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>737280</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>186</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> </configuration> mapred-site.xml XML <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME, LD_LIBRARY_PATH=$LD_LIBRARY_PATH </value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> <property> <name>mapreduce.application.classpath</name> <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib-examples/*, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/sources/*, $HADOOP_MAPRED_HOME/share/hadoop/common/*, $HADOOP_MAPRED_HOME/share/hadoop/common/lib/*, $HADOOP_MAPRED_HOME/share/hadoop/yarn/*, $HADOOP_MAPRED_HOME/share/hadoop/yarn/lib/*, $HADOOP_MAPRED_HOME/share/hadoop/hdfs/*, $HADOOP_MAPRED_HOME/share/hadoop/hdfs/lib/*</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value><server1>:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value><server1>:19888</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>2048</value> </property> <property> <name>mapreduce.map.cpu.vcore</name> <value>1</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>4096</value> </property> <property> <name>mapreduce.reduce.cpu.vcore</name> <value>1</value> </property> <property> <name>mapreduce.map.java.opts</name> <value> -Djava.net.preferIPv4Stack=true -Xmx2g -XX:+UseParallelGC -XX:ParallelGCThreads=32 -Xlog:gc*:stdout</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value> -Djava.net.preferIPv4Stack=true -Xmx3g -XX:+UseParallelGC -XX:ParallelGCThreads=32 -Xlog:gc*:stdout</value> </property> <property> <name>mapreduce.task.timeout</name> <value>6000000</value> </property> <property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <property> <name>mapreduce.map.output.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress</name> <value>true</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.type</name> <value>BLOCK</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>mapreduce.reduce.shuffle.parallelcopies</name> <value>32</value> </property> <property> <name>mapred.reduce.parallel.copies</name> <value>32</value> </property> </configuration> Check out the full Ampere article collection here.

By RamaKrishna Nishtala

Building an AI-Powered SRE Incident Response Workflow With AWS Strands Agents

The SRE Incident Response Agent is a multi-agent sample that ships with the AWS Strands Agents SDK. It automatically discovers active CloudWatch alarms, performs AI-powered root cause analysis using Claude Sonnet 4 on Amazon Bedrock, proposes Kubernetes or Helm remediations, and posts a structured incident report to Slack. This guide covers everything you need to clone the repo and run it yourself. Prerequisites Before you begin, make sure the following are in place: Python 3.11+ installed on your machineAWS credentials configured (aws configure or an active IAM role)Amazon Bedrock access enabled for Claude Sonnet 4 in your target regionkubectl and helm v3 installed — only required if you plan to run live remediations. Dry-run mode works without them. Step 1: Clone the Repository The sample lives inside the strands-agents/samples open source repository. Clone it and navigate to the SRE agent directory: Shell git clone https://github.com/strands-agents/samples.git cd samples/02-samples/sre-incident-response-agent The directory contains the following files: Plain Text sre-incident-response-agent/ ├── sre_agent.py # Main agent: 4 agents + 8 tools ├── test_sre_agent.py # Pytest unit tests (12 tests, mocked AWS) ├── requirements.txt ├── .env.example └── README.md Step 2: Create a Virtual Environment and Install Dependencies Shell python -m venv .venv source .venv/activate # Windows: .venv\Scripts\activate pip install -r requirements.txt The requirements.txt pins the core dependencies: Shell strands-agents>=0.1.0 strands-agents-tools>=0.1.0 boto3>=1.38.0 botocore>=1.38.0 Step 3: Configure Environment Variables Copy .env.example to .env and fill in your values: Shell cp .env.example .env Open .env and set the following: Shell # AWS region where your CloudWatch alarms live AWS_REGION=us-east-1 # Amazon Bedrock model ID (Claude Sonnet 4 is the default) BEDROCK_MODEL_ID=us.anthropic.claude-sonnet-4-20250514-v1:0 # DRY_RUN=true means kubectl/helm commands are printed, not executed. # Set to false only when you are ready for live remediations. DRY_RUN=true # Optional: post the incident report to Slack. # Leave blank to print to stdout instead. SLACK_WEBHOOK_URL= Step 4: Grant IAM Permissions The agent needs read-only access to CloudWatch alarms, metric statistics, and log events. No write permissions to CloudWatch are required. Attach the following policy to the IAM role or user running the agent: Shell { "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": [ "cloudwatch:DescribeAlarms", "cloudwatch:GetMetricStatistics", "logs:FilterLogEvents", "logs:DescribeLogGroups" ], "Resource": "*" }] } Step 5: Run the Agent There are two ways to trigger the agent. Option A: Automatic Alarm Discovery Let the agent discover all active CloudWatch alarms on its own. This is the recommended mode for a real on-call scenario: Shell python sre_agent.py Option B: Targeted Investigation Pass a natural-language description of the triggering event. The agent will focus its investigation on the service and symptom you describe: Shell python sre_agent.py "High CPU alarm fired on ECS service my-api in prod namespace" Example Output Running the targeted trigger above produces output similar to the following: Shell Starting SRE Incident Response Trigger: High CPU alarm fired on ECS service my-api in prod namespace [cloudwatch_agent] Fetching active alarms... Found alarm: my-api-HighCPU (CPUUtilization > 85% for 5m) Metric stats: avg 91.3%, max 97.8% over last 30 min Log events: 14 OOMKilled events in /ecs/my-api [rca_agent] Performing root cause analysis... Root cause: Memory leak causing CPU spike as GC thrashes Severity: P2 - single service, <5% of users affected Recommended fix: Rolling restart to clear heap; monitor for recurrence [remediation_agent] Applying remediation... [DRY-RUN] kubectl rollout restart deployment/my-api -n prod ================================================================ *[P2] SRE Incident Report - 2025-10-14 09:31 UTC* What happened: CloudWatch alarm my-api-HighCPU fired at 09:18 UTC. CPU reached 97.8% (threshold 85%). 14 OOMKilled events in 15 min. Root cause: Memory leak in application heap leading to aggressive GC, causing CPU saturation. Likely introduced in the last deployment. Remediation: Rolling restart of deployment/my-api in namespace prod initiated (dry-run). All pods will be replaced with fresh instances. Follow-up: - Monitor CPUUtilization for next 30 min - Review recent commits for memory allocation changes - Consider setting memory limits in the Helm chart ================================================================ Running the Tests (No AWS Credentials Required) The sample ships with 12 pytest unit tests that mock boto3 entirely. You can run the full test suite in any environment, including CI, without any AWS credentials: Shell pip install pytest pytest-mock pytest test_sre_agent.py -v # Expected: 12 passed Enabling Live Remediation Once you have validated the agent’s behaviour in dry-run mode and are satisfied with the decisions it makes, you can enable live kubectl and helm execution by setting DRY_RUN=false in your .env file: Shell DRY_RUN=false Conclusion In under five minutes of setup, the AWS Strands Agents SDK gives you a working multi-agent incident response loop: alarm discovery, AI-powered root cause analysis, Kubernetes remediation, and a structured incident report, all driven by a single python sre_agent.py command. The dry-run default means there is no risk in running it against a real environment while you evaluate its reasoning. From here, the natural next steps are connecting a Slack webhook for team notifications, adding a PagerDuty tool for incident tracking, or extending the RCA agent with a vector store of past postmortems. All of that is a tool definition away. I hope you found this article helpful and that it will inspire you to explore AWS Strands Agents SDK and AI agents more deeply.

By Ayush Raj Jha

Boost Your Spark Jobs: How Photon Accelerates Apache Spark Performance

What is Photon Databricks’ Photon engine isn't just a minor update — it’s a complete rewrite of how we handle big data. While standard Spark relies on Java, Photon is built from the ground up in C++ to squeeze every drop of power out of modern hardware. By using vectorized execution, it processes data in batches rather than one row at a time, drastically cutting down on CPU bottlenecks. In plain English? Your heaviest workloads run significantly faster and cost less to execute. This shift means less time waiting for queries to finish and more time actually using your data to drive decisions. Motivation Databricks built Photon to solve a classic headache: the trade-off between the speed of a data warehouse and the flexibility of a data lake. For years, if you wanted high-speed analytics, you had to move your data into an expensive, proprietary warehouse. If you kept it in a "data lake," it stayed flexible but ran painfully slow. Databricks’ Lakehouse architecture aims to give you the best of both worlds — warehouse performance directly on top of your open data lake. While their Delta Lake layer fixed the storage side (adding things like "time travel" and transactions to raw files), it wasn't enough. Even with organized data, the actual processing often hit a wall because the "engine" couldn't keep up with the CPU. That’s where Photon comes in. It’s a high-speed engine designed to: Supercharge Delta Lake: Making organized data run at warehouse speeds.Handle the Mess: Staying fast even when dealing with the raw, uncurated data found in typical lakes.Stay Simple: It works with the Spark APIs you already know, so you don't have to rewrite your code. Essentially, Photon is the high-performance motor that finally makes the "Lakehouse" dream a reality. Spark vs Photon (Architectural Comparsion) Key Differences As Databricks optimized its storage with tools like NVMe caching, they hit a frustrating wall: the CPU became the new bottleneck. The culprit? The Java Virtual Machine (JVM). While Java is great for many things, it’s notoriously hard to optimize for high-performance hardware. It hides the "bare metal" from developers, making it nearly impossible to use specialized CPU tricks like SIMD instructions. To fix this, the team made a bold move: they rewrote the engine from scratch in C++. They also had to choose a processing style. Instead of Spark's traditional "code-generation", they chose vectorization. This approach processes data in massive batches rather than one row at a time. It’s not just faster; it’s easier to debug and allows the engine to adapt to data in real-time. By switching to a columnar format, they ensured the data sits in the computer's memory exactly how the CPU likes it. Finally, they made sure Photon plays well with others. It doesn’t require an "all-or-nothing" switch. It integrates into existing Spark plans as a shared library, handling the heavy lifting where it can and passing the rest back to Spark. This "hybrid" approach ensures your workloads stay safe while getting a massive speed boost. To get Photon running, Databricks uses Spark’s "brain" the Catalyst optimizer to swap out standard tasks for high-speed Photon versions. Think of it like an automatic upgrade: as the engine looks at your query, it replaces as many steps as possible with Photon’s C++ power. However, it’s not an "all-or-nothing" switch. The engine works from the bottom up, upgrading parts of the plan until it hits a task Photon can't handle yet. Because Photon uses a modern columnar format and Spark uses an older row format, switching between them requires a "pivot" that takes time. To keep things efficient, the engine tries to avoid bouncing back and forth too much. For this to work smoothly, Photon and Spark have to be perfect roommates. They share the same memory pool, meaning Photon asks Spark for permission before grabbing more RAM to prevent crashes. They even share "spill" rules if memory runs low, they both know exactly how to move data to the disk. Finally, Photon is built to be a team player. It reports its performance stats just like Spark does, so your monitoring dashboards still work perfectly. Most importantly, every Photon "twin" is put through a massive testing suite to ensure it gives the exact same results as the original Spark code just much faster. Native Execution vs. JVM Photon talks directly to your computer's hardware, those clunky pauses are gone. It’s like switching from a heavy, generic rental car to a precision-tuned sports car built exactly for the track. The Result: You’ll see your processing latency drop by 40–60%, and those unpredictable GC spikes will disappear from your performance logs entirely. Python # Traditional Spark Execution from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("Traditional") \ .config("spark.memory.fraction", "0.8") \ .getOrCreate() # Incurs JVM overhead and GC pauses df = spark.read.parquet("data.parquet") \ .filter("revenue > 1000") \ .groupBy("category") \ .agg({"revenue": "sum"}) # Photon's Native Execution class PhotonExecutor: def process_query(self, data_path: str): # Direct CPU instruction execution # Zero JVM overhead with self.native_reader(data_path) as reader: return self.vectorized_aggregate( reader.filter_columns(["revenue > 1000"]) ) Vectorized Processing Outcome: 3–7x improvement in scan-heavy operations and 2–4x faster joins. Python import numpy as np class VectorizedProcessor: """ A low-level processor designed to maximize CPU throughput by leveraging SIMD (Single Instruction, Multiple Data) principles. """ def process_columnar_batch(self, data: np.ndarray, vector_width: int = 8): """ Executes operations on data aligned to CPU cache lines to prevent cache misses and minimize instruction overhead. """ # 1. Ensure memory alignment for optimal CPU Fetching # Aligned data allows the memory controller to saturate the bus aligned_data = self._ensure_alignment(data) # 2. Vectorized Execution Loop # Processes 8 elements (for 512-bit registers) in a single CPU cycle output = np.zeros_like(aligned_data) for i in range(0, len(aligned_data), vector_width): # Slicing in NumPy leverages underlying C-based SIMD kernels batch = aligned_data[i : i + vector_width] output[i : i + vector_width] = np.add(batch, 100) # Example SIMD Op return output def _ensure_alignment(self, data): # Logic to align buffer to 64-byte boundaries (standard cache line) return np.require(data, requirements='A') Memory Management Revolution Photon implements zero-copy memory management and cache-conscious data layouts. Outcome: 30–50% reduction in memory usage and improved cache hit rates. Python class PhotonMemoryManager: def __init__(self): self.page_size = 2 * 1024 * 1024 # 2MB huge pages def optimize_data_layout(self, data: np.ndarray) -> np.ndarray: """ Implement cache-conscious data layout """ # Align to CPU cache lines aligned_size = (len(data) + 7) & ~7 # Round up to multiple of 8 aligned_data = np.zeros(aligned_size, dtype=data.dtype) aligned_data[:len(data)] = data return self._arrange_for_simd(aligned_data) Lets look at a Real World Example Python import numpy as np class LogAnalyzer: """ A high-performance log analysis utility leveraging the Photon engine for vectorized processing. """ def __init__(self): self.executor = PhotonExecutor() self.memory_mgr = PhotonMemoryManager() def run_analysis(self, logs_data: np.ndarray, threshold: int = 400): """ Executes vectorized aggregation on server logs. """ try: # Step 1: Optimize memory layout for columnar access optimized_layout = self.memory_mgr.optimize_data_layout(logs_data) # Step 2: Vectorized batch processing # Photon processes data in small batches to fit into CPU cache analysis_results = self.executor.process_batch( data=optimized_layout, aggregations=["count", "avg", "percentile"], filters={"status_code": f"> {threshold}"} ) return analysis_results except Exception as e: print(f"Error during Photon execution: {e}") return None finally: # Ensure memory resources are released self.memory_mgr.clear_cache() # Usage analyzer = LogAnalyzer() results = analyzer.run_analysis(logs_data) Why These Metrics Matter MetricImprovementTechnical ReasonQuery Latency3-7x fasterNative execution, SIMD operationsMemory Usage40% lessZero-copy, columnar layoutCPU Utilization85% vs 45%Vectorized processingCache Hit Rate92% vs 65%Cache-conscious data layout SIMD Operations: Allows the CPU to process multiple data points with a single instruction, drastically reducing the clock cycles needed for aggregations.Zero-Copy & Columnar Layout: By using a columnar format, the system avoids moving or duplicating data unnecessarily, which lowers the memory overhead.Cache Consciousness: By organizing data to fit into the CPU's L1/L2/L3 caches, the engine avoids the "memory wall" where the CPU sits idle waiting for data from the RAM. When deciding between the Databricks Photon engine and the standard Spark runtime, the following implementation considerations should guide your architecture. Implementation Considerations ScenarioRecommendationKey DriversAnalytical QueriesUse PhotonBest for CPU-intensive tasks and large-scale data aggregations.Interactive AnalysisUse PhotonIdeal for environments requiring low latency and rapid iteration.Production EfficiencyUse PhotonHighly effective for cost-sensitive workloads due to higher CPU throughput.Extensive Custom LogicStick with SparkNecessary when there is heavy use of custom UDFs (User Defined Functions) that bypass native execution.Legacy IntegrationStick with SparkRequired for projects with specific Spark ecosystem dependencies not yet supported by Photon. Conclusion Photon's architectural innovations drive substantial performance gains by reimagining how data is processed at the hardware level. Performance Impact The following improvements are characteristic of Photon's optimized execution layer: Query Execution: Delivers 3–7x faster performance through native, C++-based processing.Resource Utilization: Reduces overhead by 40–60%, allowing for leaner cluster configurations.Operational Costs: Leads to significantly lower total cost of ownership (TCO) by completing jobs faster.Concurrency: Provides better scalability and stability for high-volume, concurrent workloads. Core Architectural Pillars The transition from traditional Spark to Photon is defined by three fundamental shifts in data engineering: Elimination of JVM Overhead By moving execution out of the Java Virtual Machine (JVM) and into a native C++ environment, Photon removes the performance bottlenecks associated with garbage collection and "just-in-the-time" (JIT) compilation.Vectorized Processing with SIMD Photon processes data in batches (vectors) rather than row-by-row. This allows the engine to leverage SIMD (Single Instruction, Multiple Data), where a single CPU instruction operates on multiple data points simultaneously.Zero-Copy Memory Management This innovation minimizes the need to move or transform data between different memory layers. By using a columnar format and cache-conscious layout, Photon ensures the CPU spends less time waiting for data to arrive from RAM. Strategic Outlook For organizations managing data-intensive workloads, Photon represents a paradigm shift in processing efficiency. While it offers a clear path to faster insights and lower costs, a workload compatibility evaluation is essential to identify areas where custom UDFs or specific legacy dependencies might still require the standard Spark runtime.

By Seshendranath Balla Venkata

AI in SRE: What's Actually Coming in 2026

It's 3:14 AM. Your phone buzzes. PagerDuty. Again. You groggily open your laptop and stare at a wall of red in your dashboards. Latency spike. Error rate climbing. Somewhere, something broke. You start the ritual: check the deploy log, correlate timestamps, grep through metrics, ping the on-call from the upstream team, open six tabs of Splunk queries. Forty-five minutes later, you find it. A config change from Tuesday interacted badly with a traffic pattern that only shows up on Thursday nights. The fix takes three minutes. The investigation took forty-five. This is the tax we pay. Every single incident. Now imagine an AI that could have done that forty-five minutes of stitching in under a minute. Not replacing you just doing the grunt work so you can focus on the actual fix. That's not science fiction anymore. It's the real opportunity in AI SRE. But there's a lot of noise in this space right now, and most of the "2026 predictions" I've seen read like vendor press releases. Let me cut through it. The Hype vs. The Reality Here's the secret about AI in operations: most of what's being sold as "AI SRE" is just better search with a chatbot slapped on top. RAG (retrieval-augmented generation) over your runbooks? That's a fancy ctrl+F. Summarizing alerts? Useful, but not transformative. "AI-powered dashboards"? Usually just means there's a natural language query box somewhere. The actual breakthrough- the thing that changes how we work is when AI can reason across systems, correlate events across time, and surface the "why" without a human manually connecting the dots. That's starting to happen. And 2026 is when it gets real for most organizations. What's Actually Changing 1. Root Cause Analysis That Actually Works This is the proof point. The first use case where AI in SRE delivers undeniable value. Not "here are 47 potentially related events" that's just noise with extra steps. I'm talking about AI that can look at your metrics, logs, traces, and recent changes, then tell you: "The latency spike started 3 minutes after deploy #4521 hit production. That deploy changed the connection pool size. Here's the specific service and the specific config." We've been promising "automated root cause analysis" for a decade. The difference now is that LLMs can actually parse unstructured data log messages, Slack threads, Confluence pages and reason about them in context. Is it perfect? No. Will it hallucinate sometimes? Yes. But even 70% accuracy on first-pass RCA is a game-changer when your current process is "senior engineer spends an hour doing archaeology." 2. Pre-Change Impact Analysis (The Real Win) Here's what I'm most excited about, and almost nobody's talking about it: What if, before you deploy, an AI could tell you "this change is similar to something that caused an incident six months ago in Service X here's what went wrong and what to watch for"? This flips the model. Instead of AI helping you clean up faster, it helps you avoid the mess in the first place. The ingredients exist: historical incident data, change logs, system topology, and models that can reason about similarity and causation. Stitching them together into a usable "pre-flight check" is the engineering challenge of 2026. Organizations that figure this out will see step-function improvements in reliability. Not 10% better MTTR - fundamentally fewer incidents. 3. The Death of "Swivel-Chair" Operations Every SRE I know has this workflow burned into their muscle memory: Alert firesOpen DatadogOpen SplunkOpen the deploy dashboardOpen PagerDuty historyOpen the Slack channelStart correlating timestamps manually We call this "swivel-chair operations" or "click ops." It's soul-crushing toil. AI SRE tools are genuinely good at this now. They can pull data from multiple sources, correlate it, and present a unified view. It's not glamorous, but it's the kind of drudgery reduction that compounds. The 2026 shift: these tools become the default interface for incident response. Not another dashboard to check the place where you start. 4. Junior Engineers Get Superpowers This is underrated. Right now, incident response is heavily skewed toward senior engineers. They're the ones with the mental model of the system, the tribal knowledge of past incidents, the intuition for where to look first. AI SRE tools can externalize some of that knowledge. A junior engineer with a good AI copilot can perform at a level that used to require years of system-specific experience. This doesn't eliminate the need for senior engineers. It multiplies their impact by letting them focus on the hard problems while AI-assisted juniors handle the routine ones. Organizations that embrace this will have a massive talent leverage advantage. What's Still Hard (And Overhyped) Fully Autonomous Remediation Every vendor wants to sell you "auto-remediation." AI detects the problem, AI fixes it, humans sleep through the night. I'm skeptical. Not because the technology can't get there eventually, but because the failure modes are terrifying. An AI that confidently executes the wrong fix at 3 AM can turn a minor incident into a major outage. The 2026 reality: AI will suggest remediations, humans will approve them. Fully autonomous action will be limited to narrow, well-defined scenarios (restart this pod, rollback this specific deploy) with tight guardrails. Anyone selling you "set it and forget it" autonomous operations is selling you a future that's further out than they're admitting. "AI Replaces SREs" Not happening. Not in 2026, probably not in 2030. What is happening: the nature of the work changes. Less time on investigation and toil, more time on system design, AI oversight, and strategic reliability work. The SRE role evolves. It doesn't disappear. If anything, the shortage of good SREs gets worse in the short term, because now you need people who understand both systems and how to work effectively with AI tools. That's a rarer skillset than either alone. What To Actually Do in 2026 If you're running production systems, here's my practical advice: 1. Pick one AI SRE tool and actually try it. Not a six-month evaluation. Not a proof-of-concept that never leaves staging. Actually put it in front of your on-call rotation for real incidents. You'll learn more in two weeks of real use than six months of vendor demos. 2. Start with RCA, not remediation. Root cause analysis is the use case where AI delivers value today. Autonomous remediation is where AI might deliver value eventually. Don't get seduced by the sexier demo. Start where the technology actually works. 3. Invest in your data. AI SRE tools are only as good as the data they can access. If your logs are garbage, your metrics are inconsistent, and your deploy history lives in someone's head, no amount of AI magic will save you. The unsexy work of improving observability data quality has never had higher ROI. 4. Train your team on AI collaboration. This is the part everyone skips. Working effectively with AI tools is a skill. Knowing when to trust the AI's suggestion, when to dig deeper, how to prompt effectively, how to validate outputs this stuff matters. Budget time for it. The teams that treat AI as "another tool to figure out" will underperform teams that deliberately build AI collaboration skills. The Bottom Line AI in SRE is real. The hype is real too. Your job is to separate them. The transformation isn't "AI replaces your team." It's "AI handles the drudgery so your team can focus on what humans are actually good at." That's not a revolution. It's an evolution. But it's an evolution that compounds and the teams that start now will have a significant advantage over those that wait for the technology to be "ready." The technology is ready enough. The question is whether you are. What's your team's experience with AI in operations? War stories welcome especially the ones where it didn't work. Those are the ones we learn from.

By Ashly Joseph

Pushdown-First Modernization: Engineering Execution-Plan Stability in SAP HANA Migrations

Most SAP HANA migration failures are not correctness failures. They are plan stability failures that surface only under concurrency. A query that executes in 900 milliseconds in isolation begins to oscillate between 800 milliseconds and 14 seconds under load, with no code change and no data skew obvious enough to blame. The root cause is rarely hardware or memory configuration. In most cases, PlanViz shows large intermediate row counts forming before reduction, with estimated cardinality significantly below actual. The instability originates from translating legacy EDW logic into SAP HANA artifacts without redesigning execution boundaries for a columnar, operator-driven engine. Pushdown-first modernization is often interpreted as "move everything into SQL." That interpretation is incomplete. The actual problem is not about moving logic downward; it is about controlling how the calculation engine constructs and reuses execution graphs under varying runtime conditions. When SQLScript procedures and calculation views are designed without regard to grain stabilization, operator ordering, and cardinality propagation, the resulting plans remain syntactically valid but produce workload-sensitive operator graphs whose memory footprint shifts with parameter selectivity. This article dissects the mechanics behind execution-plan stability in SAP HANA migrations, focusing on SQLScript procedures and Calculation Views as first-class architectural units. The Architectural Shift: From Staged ETL to Operator Graph Execution Traditional EDW pipelines relied on staged transformations. Each step materialized an intermediate state, often writing into persistent tables between transformations. That staging introduced natural grain boundaries. Joins were resolved, aggregations were completed, and the next transformation consumed stable, reduced datasets. In SAP HANA, Calculation Views and SQLScript table functions remove those materialization barriers. Logical transformations are fused into a single operator graph. PlanViz reveals this as a directed acyclic graph of projection, join, aggregation, and calculation nodes. The optimizer is free to reorder joins, push predicates downward, and defer aggregations. That freedom improves latency in well-designed models. It amplifies instability in poorly designed ones. Consider a common migration pattern: SQL SELECT h.MATERIAL_ID, SUM(l.QUANTITY) AS TOTAL_QTY FROM HEADER h JOIN LINE_ITEM l ON h.DOC_ID = l.DOC_ID WHERE h.POSTING_DATE BETWEEN :p_from AND :p_to GROUP BY h.MATERIAL_ID; Translated directly into a Calculation View, the join and aggregation nodes are placed without enforcing a grain reduction before high-cardinality joins. Under small parameter windows, the plan performs adequately. Under wide date ranges, the join produces a large intermediate result before aggregation collapses it. Memory amplification becomes workload-dependent. In PlanViz, the join node frequently shows actual row counts an order of magnitude higher than estimated. For example, a date window spanning a quarter can produce 38 million intermediate rows before aggregation collapses the result to fewer than 300000 grouped records. The aggregation node is inexpensive. The join node is not. Memory allocation occurs before reduction. The legacy system relied on pre-aggregated staging tables to constrain that explosion. The HANA translation removed the staging but did not redesign the grain boundary. Why Preserving Batch Semantics Breaks Under Concurrency In staged ETL systems, concurrency was limited. Batch windows were serialized. Execution plans operated with predictable resource envelopes. HANA environments operate with interactive workloads, overlapping parameter combinations, and mixed analytic demands. An SQLScript procedure frequently encapsulates logic like this: SQL lt_filtered = SELECT * FROM SALES WHERE REGION = :p_region; lt_enriched = SELECT f.*, d.CATEGORY FROM :lt_filtered AS f JOIN DIM_PRODUCT d ON f.PRODUCT_ID = d.PRODUCT_ID; lt_aggregated = SELECT CATEGORY, SUM(AMOUNT) AS TOTAL FROM :lt_enriched GROUP BY CATEGORY; SELECT * FROM :lt_aggregated; Syntactically, the intermediate variables imply sequencing. In practice, the optimizer inlines these operations. If REGION is not highly selective, the join with DIM_PRODUCT expands cardinality before aggregation. Under multiple concurrent sessions with varying region selectivity, the same operator graph is reused while actual cardinality diverges across sessions. One session may process 2 million rows, another 40 million. Each constructs its own hash structures while the plan shape remains identical. Plan instability emerges from estimation drift, not code defects. Batch semantics assumed a stable data distribution. Interactive concurrency invalidates that assumption. Grain Stabilization as a First-Class Design Constraint Execution-plan stability in HANA depends on reducing cardinality before high-cost joins. That principle is mechanical, not stylistic. Instead of joining at the transaction grain and aggregating afterward, redesign the model to collapse the grain first: SQL lt_reduced = SELECT PRODUCT_ID, SUM(AMOUNT) AS TOTAL_AMOUNT FROM SALES WHERE REGION = :p_region GROUP BY PRODUCT_ID; SELECT r.PRODUCT_ID, d.CATEGORY, r.TOTAL_AMOUNT FROM :lt_reduced AS r JOIN DIM_PRODUCT d ON r.PRODUCT_ID = d.PRODUCT_ID; This change enforces aggregation before dimensional enrichment. The intermediate dataset shrinks before the join. In PlanViz, the aggregation node now executes before dimensional enrichment, reducing the intermediate row count from tens of millions to low single-digit millions before the join. Hash table size contracts accordingly, and runtime variance narrows under concurrency. Within calculation views, this requires explicit modeling: Aggregation nodes placed before join nodesJoin cardinality correctly annotatedStar-join semantics avoided for high-variance fact tables Without explicit grain control, the optimizer may defer aggregation for cost-based reasons that are correct for one parameter distribution and catastrophic for another. Pushdown-first modernization must include grain-first redesign. Calculation Views: Join Cardinality and Engine Transitions Graphical Calculation Views introduce another source of instability: cardinality metadata and engine transitions. When join cardinality is left as "n..m," the optimizer assumes worst-case explosion. When incorrectly set as "1..1," it may reorder joins aggressively and defer filtering. Both mistakes alter the plan's shape. A frequent migration pattern is to replicate legacy multi-join views into a single Calculation View with multiple projection nodes feeding a central join node. Under load, the join engine allocates hash tables proportional to pre-aggregation cardinality. If aggregation nodes sit above that join, each concurrent session constructs its own large intermediate state before reduction, multiplying memory pressure across sessions. Execution-plan stability requires: Accurate cardinality annotationProjection pruning enabledCalculated columns minimized before aggregationTable functions are used sparingly and only when logic cannot be expressed declaratively Table functions introduce optimization boundaries. When overused, they prevent join reordering and predicate pushdown across function boundaries, fragmenting the operator graph. SQLScript Procedures and Optimization Boundaries SQLScript introduces imperative constructs that can fragment optimization. For example: SQL IF :p_flag = 'Y' THEN SELECT ... ELSE SELECT ... END IF; Branching logic produces separate subplans. Under concurrency, plan cache fragmentation increases. Each branch may generate a distinct plan variant, multiplying the memory footprint. Similarly, cursor-based loops imported from legacy logic disable set-based optimization. Even when pushdown is nominally achieved, the presence of row-by-row constructs forces materialization. Execution stability improves when: Set-based transformations replace procedural loopsConditional logic is expressed via predicates rather than branchesIntermediate variables are minimized to avoid implicit materialization The goal is a single coherent operator graph with predictable cardinality flow. Observability: PlanViz as a Stability Instrument PlanViz is not a tuning tool alone. It is a stability diagnostic instrument. Stable models show: Early aggregation nodesReduced intermediate row counts after each operatorLimited engine transitions between OLAP and Join enginesConsistent estimated vs actual row counts Unstable models show: Large intermediate nodes before aggregationHigh variance between estimated and actual cardinalitiesMultiple hash join operators with spill riskRepeated plan variants under similar parameter shapes Stability is observed by running parameter sweeps under controlled concurrency and comparing plan shapes, not just runtimes. State Amplification Under Concurrent Workloads When intermediate result sets scale with the input window size, concurrent sessions amplify state multiplicatively. If one session produces 200 million intermediate rows before aggregation and five sessions overlap, each constructs its own intermediate state, causing cumulative memory allocation that triggers throttling or spill behavior despite acceptable single-session performance. Stabilized models collapse grain early, producing intermediate datasets proportional to grouped dimensions rather than raw transaction volume. Concurrency then scales linearly instead of exponentially. This distinction is architectural. It cannot be solved with indexes, hints, or hardware. Engineering Stability Instead of Translating Logic Most unstable migrations are not slow because SAP HANA is inefficient. They are unstable because the reduction was deferred. When aggregation happens after cardinality amplification, the intermediate state scales with raw transaction volume. Under concurrency, that decision multiplies memory pressure across sessions. The system behaves exactly as modeled. Pushdown-first modernization succeeds when reduction precedes enrichment and when the operator graph is engineered for concurrency, not just correctness.

By Rajaganapathi Rangdale Srinivasa Rao

Enhancing SQL Server Performance with Query Store and Intelligent Query Processing

SQL Server performance issues are a common pain point for database administrators. One of my most challenging scenarios occurred after deploying a financial analytics database update. Reports that previously ran in less than 3 minutes suddenly ballooned to over 20 minutes. Key stored procedures started underperforming, and CPU usage spiked to critical levels during peak workloads. Through careful investigation, I identified query regressions caused by outdated execution plans and parameter sniffing. Instead of applying temporary fixes, I turned to Query Store and Intelligent Query Processing (IQP) to develop a sustainable, long-term solution. This article provides step-by-step instructions for using these tools, including practical examples, my exact investigation process, configuration changes, benchmark results before and after optimizations, and how these changes improved overall performance and stabilized the production environment. Performance Issue Investigation: Observing Query Regressions The performance degradation stemmed from new internal processes introduced into the application, which altered data patterns. Parameter sniffing a common issue where SQL Server cached an execution plan optimized for specific parameters but reused it for parameters with drastically different data distributions caused previously fast queries to slow down. To pinpoint the bottleneck, I queried the sys.dm_exec_requests and sys.dm_exec_query_stats views, which revealed certain stored procedures with much higher CPU and runtime durations than they had before. For example, running the following query helped me confirm which plans were underperforming: MS SQL SELECT TOP 5 qs.sql_handle, qs.creation_time, qs.total_worker_time / qs.execution_count AS average_cpu_time, qs.execution_count, qp.query_plan FROM sys.dm_exec_query_stats qs OUTER APPLY sys.dm_exec_query_plan(qs.plan_handle) qp ORDER BY average_cpu_time DESC; From this, I identified two stored procedures that were impacted, usp_generate_financial_report and usp_calculate_daily_totals, which each had sudden spikes in execution times. Enabling Query Store for Plan Analysis To resolve the regressions effectively, I enabled Query Store to monitor all query plans and runtime statistics. Query Store maintains a history of plan performance, making it possible to diagnose and compare regressed plans to their optimal counterparts. I enabled Query Store with the following command: MS SQL ALTER DATABASE [FinancialAnalyticsDB] SET QUERY_STORE = ON; ALTER DATABASE [FinancialAnalyticsDB] SET QUERY_STORE ( OPERATION_MODE = READ_WRITE, CLEANUP_POLICY = (STALE_QUERY_THRESHOLD_DAYS = 30), DATA_FLUSH_INTERVAL_SECONDS = 900, QUERY_CAPTURE_MODE = AUTO ); This configuration automatically captured query plans and runtime metrics while limiting unnecessary data retention to 30 days. I immediately noticed that usp_generate_financial_report was generating multiple inefficient plans based on the cached parameters. Query Store also provided insights into how the queries performed under those plans. Query Store Analysis Before the Fix: I used the following query to identify the regressed query: MS SQL SELECT q.query_id, q.object_id, MAX(rs.avg_duration) AS max_duration, MIN(rs.avg_duration) AS min_duration FROM sys.query_store_query q JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id JOIN sys.query_store_runtime_stats rs ON q.query_id = rs.query_id GROUP BY q.object_id, q.query_id ORDER BY max_duration DESC; Results revealed the following for usp_generate_financial_report: MetricValue Before FixMax Duration19,789 msMin Duration2,345 msMemory Usage410 MBCPU Utilization75% (Peaking 90%) Parameter sniffing caused the query to use an index seek for one execution and a full table scan for another, leading to an average of 20 seconds per execution during peak hours. Parameter sniffing in SQL Server occurs when the database engine compiles and caches an execution plan using the parameter values provided during the query's first execution. While this can improve performance for similar subsequent executions, it may cause issues if the initial parameter values do not represent the typical data distribution or usage patterns. This leads to suboptimal plans for subsequent executions with different parameters, resulting in poor performance. For example, a plan optimized for a smaller dataset might perform poorly when run against a much larger dataset with different parameter values. Fixing Query Regressions by Forcing Plans Using Query Store, I located the plan that performed optimally and forced SQL Server to reuse it for subsequent executions. Forced Plan Implementation: MS SQL -- Identify the Query ID and Plan ID EXEC sp_query_store_force_plan @query_id = 1203, @plan_id = 3456; This ensures the application always runs the query using the best-performing execution plan. I tested this change in a development environment to confirm its impact before implementing it in production. Benchmark Results After Forcing Plans: After forcing the optimal plan, the following improvements were observed: MetricBefore FixAfter Plan ForcingMax Duration19,789 ms3,455 msMin Duration2,345 ms2,900 msMemory Usage410 MB140 MBCPU Utilization75% (Peak 90%)25% (Peak 35%) Execution times decreased significantly to less than 4 seconds, and resource usage normalized during peak traffic. Leveraging Intelligent Query Processing for Scalability To prevent similar regressions in the future, I enabled Intelligent Query Processing (available in SQL Server 2019 and later). This suite of features dynamically resolves common query problems without manual DBA intervention. For this workload, the most impactful IQP features were Scalar UDF Inlining and Adaptive Joins. Scalar UDF Inlining automatically translated user-defined functions into inline relational operations, eliminating their row-by-row execution. This was critical for usp_calculate_daily_totals, which had heavy reliance on scalar UDFs. Adaptive Joins converted fixed strategies like Nested Loops or Hash Joins into dynamic choices based on runtime statistics, adding further efficiency when handling varying query workloads. After enabling IQP for the database: ALTER DATABASE SCOPED CONFIGURATION SET scalar_udf_inlining = ON; Benchmark Results After Enabling IQP The following table compares metrics before and after enabling IQP for usp_calculate_daily_totals: MetricBefore IQPAfter IQPUDF Execution Time20,134 ms3,289 msLogical Reads15,0004,000CPU Utilization60%20% Enabling Scalar UDF Inlining improved query execution by up to 85%, while Adaptive Joins reduced variability across parameterized query runs. Monitoring and Stabilization After resolving the performance issues, I configured a proactive monitoring system to guard against future regressions. Query Store provided continuous insights, while Extended Events helped trace any unusual query behavior. Automating these tasks with scheduled jobs ensured the environment remained stable even with evolving workloads. Conclusion By combining the power of Query Store and Intelligent Query Processing, I was able to diagnose and resolve query regressions quickly and effectively. Query Store helped me identify problematic plans and ensure optimal execution using forced plans, while IQP addressed inefficiencies in both existing and future queries dynamically. In this specific case, the financial analytics database saw execution times drop by over 80%, CPU utilization reduced by 50%, and user complaints ceased entirely. For any DBA seeking long-term, scalable solutions to performance challenges, leveraging these tools is a must. Start using Query Store and IQP today, and take control of your SQL Server performance issues for good.

By arvind toorpu

CORE

Why Image Optimization in Modern Applications Matters More Than You Think

In modern applications images are vital ingredient. No longer are they just decorative elements. From product thumbnails, hero banners, marketing assets, user-generated content to dashboards, and even core data visualizations. They're everywhere. Be it an e-commerce platform, fintech dashboard, healthcare portal, or AI-powered SaaS application, all modern web applications use images to some extent. Amidst rigorous discussions around architecture, microservices, performance tuning, and frontend frameworks, image optimization often takes a back seat or is considered a post-launch cleanup task. The consequences: Slower page loads, critical for SEO and user engagementHigh bounce ratesIncreased CDN costsPoor Core Web Vitals, specifically Largest Contentful Paint (LCP)Degraded mobile experience The Hidden Cost of Unoptimized Images Do you know what happens when you give images a cold shoulder? 1. Performance Degrades Images typically account for 50–70% of the total page weight on modern websites. Serving images with higher resolution than required, lack of compression, or using an inappropriate format, can lead to bloated content that significantly increases your: Time to First Byte (TTFB)Largest Contentful Paint (LCP)Time to Interactive (TTI) User’s time spent waiting increases, pushing them to look for alternatives. Search engine rankings also suffer due to these high loading times. 2. Poor Mobile Experience Mobile users are more disproportionately affected by heavy images. Some common scenarios where bloated image hits you back. A 3MB image on office Wi-Fi vs 3G or constrained 4G connection Loads fine in cities, slower in remote areas While a 3MB image loads in <1s on 5G, it takes more than 5s on 3G (and up to 8s on 4G). That's enough to lose a potential customer! 3. Increased Infrastructure Costs At scale, serving unoptimized or oversized images lead to increased costs in: CDN Transfer and Storage costsServer processing overhead At scale, this becomes a significant operational expense Why This Matters More in Modern Applications Modern applications are: Image heavyMobile firstGlobally distributedSEO drivenPerformance monitored Google's recent announcement to consider Largest Contentful Paint (LCP) as a ranking in mid-2021 should make you want to re-evaluate your images, because often, the hero images set the LCP score. In growth-driven teams, a lot of products are shipped based on A/B outcomes; a single page score distorted by images can lead to the wrong features getting accepted. New E-commerce startup? A user's first impression can easily be tarnished by a lagging site due to one bloated image. Key Image Optimization Strategies Now lets discuss the strategies which we can apply to avoid this 1. Use Modern Image Formats Use image formats like WebP and AVIF, which already provide better quality images and lower sizes than JPEG and PNG. Fall back to JPEG and PNG for unsupported browsers. 2. Resize Images Appropriately Never serve a 4000px wide image into a 400px container. Use responsive techniques which ensures devices download only what they need HTML <img src="image-800.webp" srcset="image-400.webp 400w, image-800.webp 800w, image-1600.webp 1600w" sizes="(max-width: 600px) 400px, 800px" alt="Product image" /> 3. Implement Lazy Loading Load images only when they enter the viewport which reduces initial payload and improves first paint performance HTML <img src="image.webp" loading="lazy" alt="Dashboard preview" /> 4. Use a CDN with Image Transformation Use CDNs like Imgix, Cloudinary, or Fastly that provide on-the-fly image transformation, format conversion, and deliver images based on the user's device. 5. Optimize for Core Web Vitals For hero/initial rendered images: Preload critical assets using rel attribute on link tagSet explicit width and heightEnsure images don't shift contentCompress without visible quality degradation 6. Automate in Your CI/CD Pipeline Add automated CI/CD tooling for : Compression during the buildFormat conversionImplementing image size validation thresholds Real World Impact Teams that are disciplined practitioners of image optimization often see: 20–50% reduction in page sizeLCP improvementsImproved SEO rankingsLower infra costsReduced CDN costsBetter experiment results If you are building an application for performance-conscious domain like Fintech or an AI dashboard where users expect lightning-fast responses, image optimizations should be your high priority. Common Mistakes to Avoid Directly using the images from design tools like FigmaServing higher resolution images on smaller screen sizesIgnoring compression quality settingsOverly relying on browser caching Performance is a Crucial Product Feature Modern users do not consciously measure load time. They feel it! A fast application: Feels premiumImproves engagementReduces user's churnImproves revenue Final Thoughts Optimizing images are often looked down upon. After all, it sounds so trivial to "compress an image". However at scale, it's one of the most impactful things you can do for your application. Image optimization often becomes an afterthought for fast-growing companies, but today's modern architectures make image optimization even more essential for your: SEOConversionUser's trust As applications grow richer and more visual, image optimization must be treated as a first-class architectural concern, not a post-launch cleanup task.

By Satyam Nikhra

Performance

DZone's Featured Performance Resources

Top Performance Experts

The Latest Performance Topics