Amazon Science homepage

Amazon Science homepage https://www.amazon.science/ Learn about Amazon's scientific research, science community, and career opportunities in artificial intelligence (AI), machine learning (ML), computer vision, robotics, quantum, economics and more. en-US Fri, 17 Apr 2026 13:00:00 GMT Isabelle/HOL: The proof assistant behind the Nitro Isolation Engine https://www.amazon.science/blog/isabelle-hol-the-proof-assistant-behind-the-nitro-isolation-engine Isabelle/HOL's balance of expressiveness, automation, and scalability enabled the world's first formally verified cloud hypervisor. At Amazon’s 2025 re:Invent conference, Amazon Web Services (AWS) announced the Nitro Isolation Engine (NIE), a software module tasked with providing resources to AWS clients while ensuring the security of customer data. AWS also announced the formal verification of the isolation engine’s correctness and security guarantees, using a proof assistant called Isabelle/HOL. As the first formally verified cloud hypervisor, NIE sets a new standard for cloud security. A proof assistant is an automated tool that can help human users develop formal proofs — of mathematical theorems, the validity of hardware or software systems, or anything in-between. Several proof assistants are in common use, and we chose Isabelle/HOL because it struck the right balance among expressiveness, automation, proof readability, and scalability. So what do I mean by that? Logical reasoning by computer There is no fixed language of mathematics, but we can create languages for expressing mathematical reasoning, just as programming languages express computational tasks. And just as programming languages involve trade-offs between expressiveness and performance, mathematical languages involve trade-offs between expressiveness and ease of automation. Automation is vital because the construction of a formal proof is both time consuming and extremely tedious, analogous to constructing a ship in a bottle. The most elementary mathematical language is Boolean logic, the world of the binary operators AND, OR, and NOT. Because this language is so simple, powerful automatic solvers exist for it. In 2016, Carnegie Mellon professor Marijn Heule — now an Amazon Scholar — and his colleagues encoded into Boolean logic an unsolved mathematical question, the Boolean Pythagorean Triples Problem, and used automatic solvers to help create the largest proof ever, 200 terabytes long. A richer mathematical language called first-order logic allows us to talk about some domain of interest — the integers, say — and to define functions over that domain. And we can go beyond Boolean logic by including the quantifiers "for all" and "there exists" in assertions. In this sort of language, we can express statements such as "every prime number greater than two is odd". We can also prove the following theorem, due to Lewis Carroll: No ducks waltz; no officers ever decline to waltz; all my poultry are ducks. Hence, none of my poultry are officers. However, most people prefer a still stronger mathematical language, where they can define types, as they do in programming. In higher-order logic, there are even function types, as found in functional programming languages such as Haskell. Higher-order logic is much richer than first-order logic, able to express statements such as Every set containing the number 1 and closed under addition contains all the positive integers. It appears to be rich enough to express most of mathematics. The richest mathematical languages — called dependent-type theories — even allow types to take arbitrary values as parameters, e.g., T(i), where i is an integer. The best-known such languages are Lean and Rocq. Powerful automatic theorem provers exist for first-order logic, but for higher-order logic and beyond, full automation is not available. This is the price of expressiveness. A proof assistant allows users to build proofs interactively, supported by partial automation and the possibility of coding their own proof searches. A proof assistant enforces strict compliance with the laws of logic, typically through a kernel architecture that gives only a limited portion of the code the right to create a theorem. A proof assistant also supports the interactive development of a possibly huge formal-specification hierarchy. For example, verification of the Nitro Isolation Engine (NIE) rests on specifications of the architecture of the Graviton-5 processor, the Rust code of the hypercalls and their functional correctness, and the security properties that are to be proved. These take up much of the quarter of a million lines constituting the formal proof. Higher-order logic is supported by HOL and HOL Light, two closely related proof assistants, and has been used to verify hardware designs, floating-point algorithms, and pure mathematics since the 1990s. AWS senior principal applied scientist John Harrison developed HOL Light, and he has used it to improve the performance of digital signatures on Amazon’s Graviton2 chip by up to 94%, by verifying an optimized version of the cryptographic algorithms. The code was delicate, and exhaustive testing is not feasible; only a formal verification of full functional correctness would do before the deployment of such critical software. But today we are interested in Isabelle/HOL. Overview of Isabelle/HOL The most visible difference between Isabelle/HOL and the other HOL systems — which are all based on higher-order logic — is its specification and proof language. With most proof assistants, users state what they want to prove, followed by lists of commands that replace the original goals with series of subgoals in a kind of whack-a-mole game. In Isabelle and to some extent Lean, the proof language allows desired intermediate goals to be written out explicitly, allowing a better controlled proof process and a more legible proof document. There are plenty of examples online. Other notable features are as follows: a user-configurable parser, which allowed us to embed a significant fragment of the Rust language into our specifications; type classes for principled overloading, so that say + can be given its natural meaning, not just for a variety of numeric types but for machine words and in other appropriate contexts; locales, a lightweight module system allowing a hierarchy of specifications to be defined and interpreted in various ways, even within a proof; powerful built-in automation through simplification and backchaining proof search; sledgehammer: one-click access to even more powerful external automation; counterexample-finding tools, for identifying claims that are actually false; code generation from executable higher-order specifications, which we used to test conformance. For the verification of NIE, we began by implementing a specialized language called separation logic on top of Isabelle/HOL. Separation logic is designed for verifying program code operating on shared resources. We coded our own proof automation and also used what was built in. We therefore could use separation logic but also plain higher-order logic when we wanted to. Isabelle turned out to be resilient enough and efficient enough to cope with the truly gigantic subgoals. It could run that quarter-million-line proof in half an hour using an off-the-shelf laptop. Some applications of Isabelle/HOL The single most impressive application of Isabelle prior to NIE is probably the verification of seL4, a widely used microkernel. This proof was also about a quarter of a million lines when first announced, although it is now much longer. The seL4 developers proved that the microkernel’s C implementation refined the abstract specification, yielding full functional correctness of the core operations. And they have observed no bugs in the verified parts of the code, although testing still plays a role in covering unverified parts and certain assumptions that cannot be formalized. Isabelle was also used in the following projects: to formalize the semantics of the WebAssembly language, to identify errors and, in particular, to prove the soundness of its type system; to create a verification framework for the Cogent programming language; to prove the correctness of algorithms for conflict-free replicated data types, which are used for distributed editing; to formalize numerous results in pure mathematics; to verify cryptographic protocols at an abstract level. Isabelle is free, open source, and available to download. It runs on all the main operating systems on any machine that has enough memory. Fri, 17 Apr 2026 13:00:00 GMT https://www.amazon.science/blog/isabelle-hol-the-proof-assistant-behind-the-nitro-isolation-engine Customized Amazon Nova models improve molecular-property prediction in drug discovery https://www.amazon.science/blog/customized-amazon-nova-models-improve-molecular-property-prediction-in-drug-discovery A single, optimized LLM unifies what previously required multiple models and can serve as a reasoning partner for medical chemists. In recent years, large language models (LLMs) have become indispensable assistants for software engineers and knowledge workers. Nimbus Therapeutics enlisted us at Amazon’s Generative AI Innovation Center and Artificial General Intelligence (AGI) organization to investigate whether it’s possible to make equally capable assistants for medicinal chemists discovering new drugs. Such an agent could significantly speed up drug discovery, potentially saving lives. AI in drug discovery has traditionally involved models called graph neural networks, or GNNs. GNNs are the workhorses of molecular-property prediction across pharmaceutical R&D, and for good reason: they deliver strong accuracy on well-defined tasks. Typically, multiple GNNs, specialized for different molecular properties, have to be built and maintained in-house — an expensive, operationally complex process. In recent years, the success of LLMs in a variety of research domains has caught the eye of biotech firms, but for drug discovery, general, off-the-shelf LLMs have proven to be less accurate than GNNs or other computational methods. We have adopted a new approach that combines the accuracy of GNNs with the generalizability and reasoning ability of LLMs. Using supervised fine tuning (SFT) and reinforcement fine tuning (RFT) to customize a general-purpose LLM, we were able to achieve results comparable to those of using multiple GNNs, at a fraction of the time and labor. Fine-tuned LLMs offer a significantly simplified workflow. In the traditional setting, each GNN has a separate interface, with its own quirks, data formats, and failure modes. Results come back as disconnected numbers that the chemist must manually integrate. When a new property needs to be predicted, someone must construct a multitask dataset and train and validate an entirely new model, a process that can take weeks. In contrast, a single, fine-tuned LLM allows a chemist to submit one query and receive predictions on all molecular properties of interest. Adding a new property requires incremental fine tuning rather than building a new model from scratch. Moreover, a language model opens the door to a qualitatively different capability: conversation. With a fine-tuned LLM, it’s now possible to ask for the reasoning behind the model outputs or to suggest molecular modifications that might yield the desired properties. This points toward an assistant that unifies molecular-property prediction and generation in one interactive experience, which we see as the ideal next step for AI-assisted drug design. Customized LLMs unlock domain-specific scientific assistants, giving lean biotech teams a practical way to collaborate with AI systems that speak their scientific language. Today, bringing a single drug to market takes 10 to 15 years and costs on average over $2 billion, with only about 8 percent of drug candidates that enter clinical trials receiving FDA approval. We believe that AI assistants could particularly improve productivity in the early stages of this pipeline, where chemists design molecules with druglike properties. Increasing the speed of development and the number of viable candidates would maximize the chances of delivering a safe and efficacious drug to the clinic. What we looked at Our work with Nimbus Therapeutics focused on properties spanning three categories critical to drug development: Lipophilicity (which has one associated property) determines whether a molecule can cross biological membranes. It is fundamental to drug absorption and distribution and affects all other characteristics of a drug. Permeability (four associated properties) measures how easily a drug enters the body via the bloodstream. Clearance (six properties) determines how quickly the body eliminates a drug. A drug that takes too long to be cleared could become toxic; one that is cleared too quickly won’t be effective. These properties span different value ranges and exhibit complex interdependencies — in practice requiring separate multitask GNN models . We tested the general-purpose LLMs Claude Sonnet 4 and Nova 2 Lite on the task of predicting all three sets of properties for particular molecules. Despite their impressive capabilities elsewhere, the models significantly underperformed specialized GNNs, with an accuracy gap that ranged from 40% to over 200% error, as measured by the root mean squared error (RMSE), depending on the property. However, we discovered that Nova 2 Lite with supervised fine tuning (SFT), followed by reinforcement fine tuning (RFT), could close that gap. Our single, fine-tuned LLM predicted 11 different molecular properties with accuracy similar to that of multiple separately trained multitask GNN models. How we did it Our approach to fine-tuning the LLM follows a principle common to both human-expertise development and machine learning: foundational knowledge must precede performance optimization. During SFT, the model learned core concepts such as molecular structure and property relationships. Then, during RFT, training shifted to the development of predictive judgment through practice and feedback. During SFT, we exposed Nova 2 Lite to more than 55,000 molecules labeled with experimental measurements across 11 properties. SFT was essential because the domain-specific tasks we asked the model to perform fall far outside Nova 2 Lite’s generalized pretraining data. For example, we use a notation called SMILES (simplified molecular-input line entry system) to represent chemical structures. Without SFT, the LLM wouldn’t have been able to perform a task like “predict chemical property from SMILES strings in structured JSON format”. The second training stage, reinforcement fine tuning (RFT), is especially critical for properties with limited experimental data, where SFT alone struggles to generalize. RFT also enables the intramodel transfer of learning across properties. For instance, lipophilicity affects permeability, and both can inform metabolism predictions. Further, RFT shifts the learning objective from pattern matching ("given molecule X, output value Y based on similar examples") to quality optimization ("minimize prediction error across all properties"). We tested the SFT and RFT models on 15,000 molecules unseen during training. We also built a system prompt that encompassed a knowledge of both core chemistry and our 11 chemical properties of interest, including their definitions and expected value ranges. During the RFT stage, we experimented with three strategies for generated rewards, which guided the learning process. Molecular-property prediction is particularly amenable to reward engineering for RFT since the output is a single number, which allows us to measure exactly how far off each prediction is. Our first strategy was to use an exponential decay function, so predictions closer to the true value received exponentially higher rewards. But at high error, improving from “terrible” to merely “bad” yielded almost no reward difference, keeping the model from learning from its worst predictions, while at low error, small changes resulted in large reward differences, which made the reward signal noisy and ultimately unhelpful. Our second strategy, binary pass/fail rewards, created the opposite problem. The model received zero reinforcement for gradual improvement: it either crossed an arbitrary threshold (in our case, correct within 10 percent) or learned nothing. Rewards based on the Huber loss — a metric proposed in 1964 by the Swiss statistician Peter Huber, which limits the influence of outliers — solved both issues. Unlike exponential decay, Huber rewards don't become negligible on large errors — the model always receives a meaningful signal to improve — yet they remain stable near the correct answer, refining predictions without overreacting to small fluctuations. This yielded our best result, a 4.9% R² improvement over baseline, and we used the Huber reward as the default for training the model on multiple molecular properties simultaneously. Carrying this forward into multiproperty training, we fine-tuned a single model to predict all 11 properties simultaneously. Our best-performing model was Nova 2 Lite with RFT on top of full-rank SFT, meaning that all the model parameters were updated. It outperforms Claude Sonnet 4 by 39% and base Nova 2 Lite by 37% on average RMSE. While averaging 5% behind the baseline GNN, it matches or outperforms the GNN on 7 of 11 properties — a striking result given that a single LLM is going toe-to-toe with multiple independently trained multitask GNN models, reducing not just model count but the entire infrastructure footprint around training, deployment, and maintenance. It’s important to note that Nova Forge — a service that allows Amazon Web Services customers to use proprietary data during both pretraining and SFT — supports both SFT and RFT on SageMaker, enabling extensive model customization. Since SageMaker handles the training framework and infrastructure maintenance internally, organizations avoid the cost of building and maintaining custom training pipelines from scratch. What’s next? Based on these initial experiments and results, Nimbus Therapeutics recently deployed its Novus model on Amazon Bedrock. Novus is the company’s custom-built LLM, created through Nova Forge. In its current form, Novus handles molecular-property prediction with an accuracy that is competitive with purpose-built GNNs. The next milestone is extending those capabilities toward molecular design, enabling the model to propose structural modifications, predict their downstream properties, and explain its reasoning, all in a single conversation. Acknowledgements Leela Dodda (Nimbus), Aarush Garg (Nimbus), Matthew Medina (Nimbus), Md Tamzeed Islam , Elyse Zhang, Clement Perrot, Rohit Thekkanal, Shiv Vitaladevuni Wed, 15 Apr 2026 16:10:55 GMT https://www.amazon.science/blog/customized-amazon-nova-models-improve-molecular-property-prediction-in-drug-discovery AWS and Hopkins Engineering announce groundbreaking database for AI/ML antibody design https://www.amazon.science/news/aws-gray-lab-johns-hopkins-announce-groundbreaking-database-for-ai-ml-antibody-design Built in collaboration with the Gray Lab at Johns Hopkins Whiting School of Engineering, the Antibody Developability Benchmark is powered by one of the most diverse antibody datasets in public literature, enabling transparent performance evaluation for AI-guided antibody design. In 1986 the US Food and Drug Administration issued its first approval for human use of a therapeutic antibody. Despite steady advances in methodology, genetic sequencing, and biomedical science, 40 years later the process of discovering and optimizing therapeutic antibodies often remains prohibitively expensive, in terms of both cost and time. Recent experiences with pandemic-style infectious-disease outbreaks lend an even greater urgency to the need to more quickly and efficiently identify and develop these antibodies. Artificial-intelligence- and machine-learning-guided approaches to antibody design, in the form of biological foundation models (BioFM), represent a significant opportunity to address these challenges. Models built using protein language models (pLMs) and structure-based deep-learning frameworks have significant potential to predict antibody developability properties — the characteristics that determine whether a molecule is manufacturable, stable, and safe as a therapeutic. The development of those tools could drastically shorten discovery timelines while also reducing experimental costs. That potential, however, has been hindered by the lack of a public dataset that would allow researchers to benchmark those tools, a crucial step in the development of trustworthy in-silico tools for drug discovery. While there are existing public antibody datasets, they are too frequently limited by a focus on a single antibody format or target. Others are composed of naturally occurring or clinically advanced antibodies, a bias that severely limits their utility for training or evaluating predictive models. “Trust in the predictions made by these models must be grounded in evaluations against experimental data that is sufficiently large and diverse,” explained Luca Giancardo, an applied scientist with Amazon Web Services (AWS) who works on the Amazon Bio Discovery team. “That data must be representative of the real sequence space encountered during antibody engineering and balanced in terms of developability outcomes.” Jeffrey Gray is a professor in the Chemical and Biomolecular Engineering Department at the Johns Hopkins Whiting School of Engineering, where he leads the Gray Lab, which focuses on the computational prediction and design of protein structures. He is also the original developer of RosettaDock, a tool for the prediction of the structure of protein complexes from their constituent proteins. Gray noted that while AI has made tremendous progress in the prediction and design of antibody properties, his own lab’s benchmarks have shown that current models do not yet reliably predict critical developability features, such as solubility and specificity, needed for efficient design of therapeutics. He cited the lack of diverse data in standardized conditions as a primary limitation for training models. That, coupled with the absence of a comprehensive, heterogenous, large-scale database, has acted as a significant drag on the potential of developing AI tools for antibody development. Antibody developability benchmark To that end, AWS, in collaboration with the Gray Lab and Johns Hopkins Engineering are announcing the launch of the Antibody Developability Benchmark, powered by the largest and most diverse antibody dataset in public literature. This is the first large-scale benchmark of antibody biophysical and biochemical properties designed to support the development and rigorous evaluation of in-silico antibody property predictors. The Antibody Developability Benchmark is 20 times as diverse — in terms of antibody formats, targets, and developability profiles — as benchmarks currently available in the scientific literature. While other datasets may contain more individual antibody designs, they typically explore a single target or antibody framework with limited property coverage. The Antibody Developability Benchmark is unique in its combination of scale and heterogeneity, encompassing 50 seed antibodies, four structural formats, and 42 antigens. It also includes both favorable and unfavorable developability outcomes. Gray lauded the opportunity to work with AWS experts, noting that the collaboration has enabled the creation of a dataset larger and more diverse than any of the publicly available datasets. He called the project an important next step toward fulfilling the promise of AI to improve human health. The Antibody Developability Benchmark includes the first heterogeneous antibody-property dataset explicitly designed to capture favorable and unfavorable developability profiles across multiple antigens and mutation strategies. Crucially, all data was affirmed via wet-lab experiments, providing ground truth validation that existing public benchmarks lack. “This dataset will allow researchers to confidently be able to answer ‘Which model is better suited for our purposes?’,” noted Giancardo, whose Bio Discovery team led the development of the dataset. “Today there are many computational models coming out that are mostly evaluated on either proprietary data or public datasets, which are not representative of antibody heterogeneity. That means deciding what is better or worse is very, very hard — if not impossible.” The unmatched diversity and deliberate heterogeneity of the Antibody Developability Benchmark will help make those determinations possible. Michael Chungyoun, a PhD researcher at JHU who worked on the project, observed that the benchmark covers a wide space of antibodies, particularly in terms of their properties. He noted that allowing researchers to check against a very diverse benchmark can save time and labor by helping them compare models and choose the best approach. The antibody dataset The dataset consists of 50 clinically and scientifically relevant seed antibodies spanning four structural formats — IgG, VHH, NearGermline-IgG, and scFv — targeting 42 distinct antigens. It measures expression, purity, thermostability, aggregation, polyreactivity, and hydrophobicity — six traits that are essential in the development of viable therapeutic antibodies. “The composition is a deliberate design choice,” Giancardo noted. “We strove to find a balance between heterogeneity of antibody classes, therapeutic targets, and mutation types, with the aim of creating benchmarks that would be generalizable across the structural diversity of the modern therapeutic-antibody landscape.” Researchers at the Gray Lab, assisted by a sponsored research grant from AWS, helped select the seed antibodies for inclusion in the dataset. They were intentional about the seeds they chose, Chungyoun noted, opting in some cases for existing clinical-stage antibodies or FDA-approved antibodies. The team also selected antibodies more akin to those that circulate in the human body but aren't approved therapeutics. Those are called germline antibodies. Chungyoun explained that germline antibodies are those found in the human body, and they have important biophysical characteristics. While some of those characteristics are shared with therapeutic antibodies, there are also differences between the two. The extent of those differences, and how to bridge that gap, is a vital and unanswered question. Traditional antibody-based drug discovery begins with antibodies that come from animals or humans. Chungyoun explained that germline antibodies occasionally need to be modified to look more like therapeutics. That process is one researchers are still exploring. Mutation strategy The dataset also includes engineered variants of each seed antibody, generated by applying systematic mutation strategies to each seed. “Initially, the hardest thing was essentially coming up with example sequences that would cover the broad spectrum of properties and the ways of mutating these sequences,” Giancardo explained. “It's challenging because you have to do it a priori until you do it, and then you don't know what will come out.” Working with Johns Hopkins Engineering, Giancardo and his team systematically engineered variants employing a variety of approaches, including protein-language-model-guided (pLM-guided) versus non-pLM-guided mutation selection and amino acid substitutions versus insertions/deletions. “Protein language models are essentially the equivalent of large language models [LLMs] for the protein world,” Giancardo said. “There are multiple ways of looking at proteins. A common way is expressing them as a string of amino acids, which are essentially letters.” When some of the letters in the amino acid chains are masked, the models can be trained to fill in the gaps — the same "self-supervised" approach used to train LLMs. The models can also be trained to predict what changes inserting a different letter or letters — i.e., mutation — will yield. That approach resulted in a wide variety of mutations — up to 99 engineered variants per seed. The breadth and depth of those mutations contribute to another distinguishing feature of the Antibody Developability Benchmark: its deliberate heterogeneity. The inclusion of both favorable, or developable, and unfavorable, or poorly developable, examples sets it apart from existing datasets. “This range is essential for training and evaluating machine learning models, which require balanced label distributions and exposure to the failure modes they are intended to predict and avoid,” Giancardo explained. He also clarified that those failures still fall within a range of viability. “These are not examples that are obviously wrong but rather bad examples that have a fighting chance," he added. "These all still meet some baseline quality assessment, meaning researchers could reasonably send them to a wet-lab partner to test.” Zero-shot learning Gray and his team at Hopkins Engineering also collaborated with their AWS counterparts by selecting and running existing open-source antibody design and prediction models on their own. They then shared their findings with the Bio Discovery team, who compared the results those models generated against the benchmarking dataset without exposing those models to the information in that dataset. “This is essentially zero-shot inference,” Giancardo said. That siloed approach allowed both sides to have greater confidence in the results the Antibody Developability Benchmark generated. “The fact that we operated separately gave us confidence that we were not introducing errors. There is no data leakage of any sort, even from an external perspective.” The teams compared their data and used those results to further fine-tune the Antibody Developability Benchmark. That iterative process means researchers who utilize the benchmark can have greater confidence about the viability of their models before the necessary, and costly, step of working with a wet lab partner. That can also shorten the overall timeline in terms of experimentation. “When you are confident enough to do a screen, then you can turn to the wet lab, get new metrics, and further train on those results, which will be much, much, much more meaningful,” Giancardo explained. The future Researchers at both AWS and Hopkins Engineering emphasized the importance of sharing model benchmarks based on the Antibody Developability Benchmark Dataset with the larger scientific community. The benchmark results are now available as part of Amazon Bio Discovery; additional benchmarks will be added over time and released in a paper later this year. The sharp uptick in proposed protein AI models has researchers excited, but the expense and time commitment of wet labs has meant researchers have thus far been unable to compare those models head to head, Chungyoun observed. He noted that the launch of this dataset means those researchers now have an opportunity to learn which model properties improve performance. That can serve to illuminate the connection between what models learn and how those models can be improved to better predict those properties. The dataset won’t remain static either: more models and properties will be added in the future. "The database has the potential to surface models and tools that may have previously gone unrecognized — research published in lesser-known venues or work that simply didn't receive the attention it deserved," said Nina Cheng, a senior science manager in the AWS Life Sciences organization. "This database can play a key role in bringing that kind of overlooked work to light." Acknowledgements Amazon Bio Discovery Science and product team: Luca Giancardo, Yue Zhao, Melih Yilmaz, Kemal Sonmez, Lan Guo, Gordon Trang, Edward Lee, Chuanyui Teh, Fangda Xu, Nina Cheng, Jiwon Kim. Tue, 14 Apr 2026 14:00:00 GMT https://www.amazon.science/news/aws-gray-lab-johns-hopkins-announce-groundbreaking-database-for-ai-ml-antibody-design How Amazon uses agentic AI for vulnerability detection at global scale https://www.amazon.science/blog/how-amazon-uses-agentic-ai-for-vulnerability-detection-at-global-scale Amazon’s RuleForge system uses agentic AI to generate production-ready detection rules 336% faster than traditional methods. In 2025, the National Vulnerability Database published more than 48,000 new common vulnerabilities and exposures (CVEs), reflecting the impact of automated and AI-powered tools on vulnerability discovery. For security teams, however, knowing about new vulnerabilities isn’t enough; they must translate each disclosure into robust detection logic fast enough to protect large, complex systems. At AWS, we built RuleForge, an agentic-AI system that generates detection rules directly from examples of vulnerability-exploiting code, achieving a 336% productivity advantage over manual rule creation while maintaining the precision required for production security systems and enhanced customer security. Closing the gap between disclosure and defense At Amazon, detection rules are written in JSON and applied to data such as requests to MadPot, a global “honeypot” system that uses digital decoys to capture the behavior of malicious hackers, and likely exploit attempts flagged by our internal detection system, Sonaris. We expect the number of high-severity vulnerabilities published to the NVD to continue to grow, which means that AI-powered automation is essential for security at scale. By automating rule generation, we’re closing that gap while expanding our coverage. Our teams can now turn high-severity CVEs into validated detection rules at a pace and scale that would be impossible with traditional methods, providing more comprehensive protection for customers. The manual-detection rule workflow Before RuleForge, creating a detection rule for a new CVE was a multistep, analyst-driven process: Download and analyze. A security analyst located publicly available proof-of-concept exploit code — code that demonstrates how to trigger a vulnerability — and studied it to understand the attack mechanism, inputs, and expected behavior. Write detection logic. The analyst authored a rule to catch malicious traffic targeting the vulnerability, then wrote queries to measure the rule's accuracy against traffic logs. Validate and iterate. The analyst ran those queries, reviewed the results, tuned the rule to reduce false positives, and repeated until the rule performed well enough for production. Peer review and deploy. Finally, the analyst submitted the rule for code review by another security engineer before deployment. This workflow produced high-quality rules, but the time investment meant the team had to carefully prioritize which vulnerabilities to cover first. Reframing rule creation as an agentic-AI pipeline RuleForge reimagines this workflow as an agentic-AI system — a set of specialized AI agents that collaborate to generate, evaluate, and refine detection rules, with humans remaining in the loop for final approval. Rather than attempting to solve the end-to-end problem with a single model, RuleForge decomposes the task into stages that mirror how human experts work: Automated ingestion and prioritization. RuleForge downloads publicly available exploit proof-of-concept code demonstrating how to target a specific vulnerability. It scores each exploit using content analysis and threat intelligence sources. This ensures that rule generation focuses on the threats that matter most. Parallel rule generation. For each prioritized CVE, a generation agent running on AWS Fargate with Amazon Bedrock proposes multiple candidate detection rules in parallel. Each candidate can be refined across several iterations based on feedback from later stages, enabling the system to explore different detection strategies before selecting the most promising ones. Instead of relying on one expert working rule by rule, RuleForge treats detection engineering as a pipeline where AI proposes options and humans decide what ships. AI-powered evaluation. A separate evaluation agent reviews each candidate. This is one of RuleForge's key innovations: rather than having the generation model judge its own work, RuleForge uses a dedicated "judge" model to score each rule on two dimensions that human experts use to assess detection rules: Sensitivity: What is the probability that this rule will fail to flag malicious requests described in the CVE? Specificity: What is the probability that this rule targets a feature that correlates with the vulnerability rather than the vulnerability itself? Multistage validation. Rules that pass the judge move through a pipeline of increasingly rigorous tests. Synthetic testing generates both malicious and benign test cases to verify basic detection accuracy. Rules are then validated against traffic logs, such as those from MadPot, to confirm they perform as expected. Rules that fail at any stage get sent back to the generation agent with specific feedback explaining why, creating a closed loop of improvement. Human review and deployment. The best-performing rule enters code review, just as before. A security engineer reviews it, and any feedback goes back to the generation agent for revision. Human judgment remains the final gate before production deployment. Why a separate judge model matters When we asked the rule generation model to report its confidence in its own candidate rules, it thought almost everything it produced was good. This aligns with research showing poor LLM calibration on security topics. The solution was separating generation from evaluation. Using a dedicated judge model reduced false positives by 67% while maintaining the same number of true positive detections. Two main design choices made the judge effective: Negative phrasing improves accuracy. Asking "what is the probability that the rule fails to flag malicious requests?" produces better calibration than asking "what is the probability that the rule correctly flags all malicious requests?" Given that LLMs tend toward affirmation, framing the evaluation as a search for problems yields more honest assessments. Domain-specific prompts outperform generic ones. Simply asking the model to rate its overall confidence in a rule produced poor calibration. The questions that worked encoded what security engineers actually look for: whether the rule targets the vulnerability mechanism itself versus a correlated surface feature and whether the rule covers the full range of exploit variations. The system also generates reasoning chains explaining its scores. We evaluated those reasoning chains against human assessments and found that the AI judge's reasoning matched expert human reasoning for six out of nine rules. For example, when a human evaluator noted, "That SQL injection regex is too loose," the judge had independently determined that "the regex pattern will catch any query parameter with a single quote, which is broader than just the specific vulnerability." Results and what’s next We deployed the confidence scoring system in August 2025, accelerating how quickly our analysts can deploy new detection rules. Over the final four months of the year, RuleForge enabled our team to produce and validate rules 336% faster than it could manually, while maintaining the high accuracy required for production security systems. By shifting analyst focus from authoring to review, we’ve multiplied overall throughput without compromising quality. We’re closing the gap between vulnerability disclosure and defense more effectively than ever before and ensuring that the managed protections that help safeguard customer workloads on AWS are updated faster and cover more high-severity CVEs. RuleForge demonstrates that agentic AI can augment human security expertise at production scale while meeting precision requirements. The key innovations are architectural: separating rule generation from rule evaluation, using multiple specialized agents rather than a single model, and keeping humans in the loop for final approval. As the rate of vulnerability disclosures continues to accelerate, these design principles will help us keep defenses current. For a deeper look at the technical details behind RuleForge, including the evaluation methodology and experimental results, see our paper on arXiv. Wed, 08 Apr 2026 16:17:20 GMT https://www.amazon.science/blog/how-amazon-uses-agentic-ai-for-vulnerability-detection-at-global-scale Verifying and optimizing post-quantum cryptography at Amazon https://www.amazon.science/blog/verifying-and-optimizing-post-quantum-cryptography-at-amazon How automated reasoning reconciles the demands of security, performance, and maintainability. Today, secure online communication is enabled by public-key cryptography, primarily RSA and elliptic-curve cryptography (ECC), whose security depends on the assumption that certain computational problems are intractable. However, while believed to be intractable for conventional computers, the problems underlying RSA and ECC may be tractable for sufficiently large quantum computers. “Store now, decrypt later” attacks — which intercept encrypted information and hold onto it until quantum computers can decrypt it — require protection against these attacks long before they become technically feasible. Post-quantum cryptography (PQC) is cryptography running on classical computers but secure in the face of quantum computing. In 2024, following an eight-year standardization effort, the US National Institute of Standards and Technology (NIST) published standard FIPS-203, which specifies the Module-Lattice-Based Key Encapsulation Mechanism, or ML-KEM, as a mechanism for key agreement believed to be secure against attacks from quantum computers. In this post, we describe how Amazon’s Automated Reasoning Group, AWS Cryptography, and the open-source community have collaborated to create an open-source, formally verified, and optimized implementation of ML-KEM, protecting customers against store-now-decrypt-later attacks with the highest assurance and minimal cost. What is good cryptographic engineering? In keeping with Amazon’s customer obsession, we prioritize three goals when working on cryptographic solutions: The security of the customer’s data: Cryptography is notoriously hard to implement securely, and any flaw can endanger the customer’s privacy; The customer experience: Cryptography is a computational tax that we minimize to ensure the lowest cost and best experience for our customers; Our ability to maintain the solution going forward: The less time we need to spend on maintenance, the more we can innovate on behalf of our customers. There are, however, tensions between these goals: Simple code is easiest to maintain and write securely but tends to be slow. Fast code tends to be more difficult to audit and prone to errors. Automated reasoning allows us to resolve these tensions and provide our customers with cryptographic solutions that are secure, fast, and maintainable, all at once. Yet another implementation of ML-KEM? ML-KEM — formerly known as Kyber — is well studied from an implementation perspective: On the one hand, the Kyber reference code provides a clean C implementation that has been scrutinized for years. On the other hand, numerous research papers describe how to optimize ML-KEM for various metrics and platforms. The challenge faced by AWS Cryptography and the Automated Reasoning Group in 2024 was to combine the simplicity of the reference implementation and the optimization potential revealed in the research works in a single production-ready implementation. Around the same time, AWS became a founding member of the Linux Foundation’s Post-Quantum Cryptography Alliance (PQCA), which created the Post-Quantum Cryptography Package (PQCP), “a collection of open-source projects aiming to build high-assurance software implementations of standards-track post-quantum cryptography algorithms”. Therefore, rather than brewing our own code, members of our team joined the PQCP and soon after launched mlkem-native, a high-assurance, high-performance C implementation of ML-KEM aiming to combine the ML-KEM reference implementation with research on optimization and formal verification. Coding, fast and slow Mlkem-native’s modular design combines a frontend covering the high-level logic of ML-KEM with a backend responsible for all performance-critical subroutines. Each subroutine — including the Keccak permutation underlying SHA3 and the number-theoretic transform (NTT) underlying fast polynomial arithmetic — has multiple, highly efficient implementations written natively for specific hardware. In addition to the default C implementation, mlkem-native provides assembly/intrinsics backends for AArch64, x86_64, and RISC-V64. Importantly for maintainability, the interface between frontend and backend is fixed: a developer adding optimizations for a new target architecture implements select backend functionality against the backend specification, while the frontend stays the same. The development of the backend specification turned out to be less obvious than it sounds, as we explain below. Knowing your limits Memory safety A well-known challenge with the C programming language is the risk of buffer overflows: writing past the designated limits of a memory region can corrupt data structures and, when maliciously exploited, lead to unprivileged code execution. The umbrella term for such issues is memory safety. Memory-safe languages such as Rust can limit the impact of out-of-bounds accesses — by, for example, panicking instead of exhibiting undefined behavior — but they don’t prevent the mistake itself. Type safety Another well-known challenge, this time with implementing ML-KEM, is the risk of integer overflows — an aspect of type safety. Like RSA and ECC, ML-KEM relies on modular arithmetic, in which the results of operations are divided by a particular number — in ML-KEM’s case, the prime 3,329, designated MLKEM_Q or just q — and only the remainder is carried forward. The modulo operator is represented by the percentage symbol, %. Logically, if two numbers x and y need adding or multiplying in ML-KEM, one needs to compute (x + y) % q and (x * y) % q; for example, (294 * 38) % q = 11,172 % q = 1,185. Such “eager” arithmetic modulo q, which constantly applies modular reduction to represent data in the “canonical” range {0, 1, 2, … , q-1}, is prohibitively slow. Efficient ML-KEM implementations instead use “lazy” arithmetic modulo q: data is operated on without modular reduction for as long as possible, and only once there is a worst-case risk of overflow does reduction happen. Further, this allows the use of imperfect reduction algorithms such as Montgomery reduction, which are fast but don’t always give fully reduced outputs. In the case of ML-KEM, data modulo q = 3,329 is typically stored in signed 16-bit integers. When dealing with lazy arithmetic across the numerous arithmetic routines in ML-KEM, it is therefore essential to track the worst-case bounds of the data and insert modular reductions where those bounds would exceed the limits of 16-bit integers. Small mistakes in this domain can evade testing — because average bounds tend to be much smaller than worst-case bounds — and then randomly surface in production. Tracking buffer bounds and especially arithmetic bounds is time consuming and error prone: for example, weakening the output bounds of a low-level arithmetic function might lead to a rare arithmetic overflow in an entirely different function. Checking this by hand not only requires meticulous documentation and skilled auditors but also slows down development. In mlkem-native, we use a tool called the C Bounded Model Checker (CBMC) to automatically verify memory safety and type safety at the C level: for every function, we add machine- and human-readable contracts to the source code to specify the bounds of buffers and arithmetic data, and we have CBMC automatically verify that, with respect to those bounds, no buffer overflow or arithmetic overflow can happen. Let’s look at a simple example of modular reduction: Focusing on the relevant parts one at a time: First, note the __contract__( ... ) . Slightly simplified, the memory_no_alias and memory_slice lines specify which memory the code can read and write; this relates to memory safety. The ensures(array_bound(...)) clause relates to type safety: it specifies that the function will guarantee that upon return, the data is within the interval [0, 1, …, q). In the proof, you see the __loop__(invariant(...)), specifying how the loop gradually establishes this bound: in the ith iteration, it holds up to the ith coefficient. Finally, the implementation effectively composes mlk_barrett_reduce and mlk_scalar_signed_to_unsigned_q. CBMC does not look inside these but replaces them with their contracts: You can see that mlk_barrett_reduce first establishes a symmetric output interval (-q/2, …, q/2), and then mlk_scalar_signed_to_unsigned_q maps it to [0,1, …, q). In this instance, it is easy to confirm by eye that the specifications line up in the desired way, but for more complex examples, this is less obvious. Either way, CBMC checks it for us automatically. Going fast, staying safe The CBMC proofs described above establish memory safety and type safety for mlkem-native's C code. However, the most performance-critical parts of mlkem-native — the Keccak permutation and number theoretic transform — are implemented in hand-optimized assembly for AArch64 and x86_64. To gain assurance for the assembly implementations in mlkem-native while maintaining high performance, we use three components: SLOTHY, an assembly superoptimizer; HOL Light, a theorem prover; and s2n-bignum, a verification infrastructure for assembly built on HOL Light. Together, they enable a workflow where developers write clean, maintainable assembly, while deployed code achieves peak performance with formal guarantees of correctness. Writing high-performance assembly by hand creates a fundamental tension: clean, auditable code that clearly expresses the computation is slow, while fast code is dense, microarchitecture specific, and difficult to maintain. SLOTHY resolves this tension by automating microarchitecture-specific optimizations: it converts an assembly program into a constraint satisfaction problem, finds optimal instruction schedules and register allocations using a constraint solver, and outputs optimized assembly. Developers write clean code emphasizing the logic of the computation, and SLOTHY generates the fast code. We prove functional correctness for all AArch64 and x86_64 assembly routines using HOL Light and s2n-bignum. Where SLOTHY is used, the proofs are written to be agnostic to the specific instruction ordering and register allocation; we can therefore reoptimize the code for a specific microarchitecture without having to adjust the proofs. This “post-hoc” verification approach establishes the mathematical correctness of the computation represented by the assembly regardless of how it came about; in particular, SLOTHY is removed from the trusted computing base. Keeping it honest Formal verification is never absolute. Every proof links formal objects — specifications and models — to informal, real-world requirements and systems, and these links introduce gaps. Does the formal specification capture what we actually need? Does the formal model faithfully reflect the real system? Is the proof infrastructure itself sound? Earning and maintaining customer trust requires being transparent about these limits. We therefore developed and published a document titled SOUNDNESS.md, where we map out what is proved in mlkem-native, what is assumed, and where the residual risks lie — from the fidelity of the hardware models used in HOL Light proofs, to the larger trusted computing base of CBMC, to the manual bridge between the two verification stacks. For each gap, we describe mitigations in place and outline future work. Our goal is not to claim perfection but to earn trust through transparency. We encourage the community to read SOUNDNESS.md critically, challenge our assumptions, and help us close the remaining gaps. Getting on the road Mlkem-native is integrated into AWS-LC, Amazon's open-source cryptographic library, which underpins secure communication across AWS services. The integration uses an automated importer that pulls mlkem-native source code directly from the upstream repository, ensuring that AWS-LC stays synchronized with the latest verified implementation. The integration is designed for minimal friction: mlkem-native's modular architecture allows AWS-LC to import the core ML-KEM logic while providing its own implementations of platform-specific components. For example, AWS-LC maps mlkem-native's cryptographic primitives to its existing FIPS-202 (SHA-3) implementation, uses AWS-LC's random-number generation and memory zeroization functions, and enables FIPS-mode features like pairwise consistency tests when required. Enabling this is a thin compatibility layer that bridges mlkem-native's API to AWS-LC's infrastructure without modifying the verified code. Critically, the CBMC contracts that prove memory safety and type safety are preserved in the imported source code. While the preprocessor removes them from compiled binaries, they remain in the source as machine-checkable documentation of the code's guarantees — a form of "living proof" that travels with the implementation. Moreover, because both mlkem-native and AWS-LC are open source and permissively licensed, their benefits extend beyond AWS. Anyone can integrate mlkem-native into their systems and gain the same combination of performance and assurance. The formal verification artifacts — CBMC contracts and HOL Light proofs — are part of the repository, all tools involved are open source, and scripts are provided for setup and proof checking, inviting an independent validation of our security claims. Impact The development of mlkem-native demonstrates that the three goals of cryptographic engineering — security, performance, and maintainability — are not in conflict when automated reasoning is applied systematically. CBMC freed us from manually tracking bounds through complex arithmetic, catching errors that would evade testing and surface randomly in production. The annotations stay in the source code as machine-checkable documentation, making the code simultaneously more maintainable and more secure. HOL Light and s2n-bignum allowed us to deploy aggressive assembly optimizations with mathematical certainty of correctness. SLOTHY let us write clean, auditable code while achieving peak performance for specific microarchitectures. And because the proofs are written to be optimization agnostic, we can retarget the code without redoing the verification. The result is an implementation that is simultaneously more secure, faster, and easier to maintain than what traditional development could achieve. We didn't compromise between customer security, customer experience, and our ability to innovate: automated reasoning delivered all three. AWS-LC-FIPS releasePlatformOperation3.14.0Ratioc7iKeygen30899651462.1Encaps30623612332.0Decaps25141515452.0c7gKeygen29617711342.4Encaps28482668742.3Decaps23919647652.3Performance impact of switching from the ML-KEM reference implementation to mlkem-native in Amazon’s cryptography library AWS-LC. ML-KEM-768 performance is measured on c7i and c7g EC2 instances. The numbers represent operations per second (higher is better). The baseline is an AWS-LC-FIPS 3.1 release that contains the ML-KEM C reference implementation. The AWS-LC-FIPS 4 release is built with mlkem-native. The platforms are c7i with Intel(R) Xeon(R) Platinum 8488C and c7g with Graviton 3 processor. Acknowledgments We thank our colleague John Harrison, senior principal applied scientist at the Automated Reasoning Group, for providing the bulk of the AArch64 assembly proofs in HOL Light and for maintaining the HOL Light interactive theorem prover and the s2n-bignum verification infrastructure. Mlkem-native is a collaborative effort involving not only AWS but many members of the open-source community. Foremost, we thank our co-maintainer Matthias Kannwischer from zeroRISC, who started mlkem-native with us and has since been instrumental in the success of the project. Tue, 07 Apr 2026 15:00:00 GMT https://www.amazon.science/blog/verifying-and-optimizing-post-quantum-cryptography-at-amazon Improving quality and robustness in LLM-based text-to-speech systems https://www.amazon.science/blog/improving-quality-and-robustness-in-llm-based-text-to-speech-systems Low-rank adaptation, data augmentation, and chain-of-thought reasoning are among the techniques enabling accent-free polyglot outputs, improved expressiveness, and reliable synthesis. Text-to-speech models based on large language models (LLMs) have gotten very good at producing natural-sounding speech, even in voices cloned from short audio files. But some problems with these models still persist. One is accent leakage in polyglot text to speech. It should be possible to transfer a voice recorded in English to French, German, or Spanish with the correct accent and without loss of voice identity. But with most systems, the reference speaker's native accent leaks into the target language, or the target language's accent overwrites characteristics of the speaker’s voice. Expressiveness is another challenge, including the laughs, sighs, hesitations, and other indications of emotion that make speech engaging. And then there’s reliability. Unlike traditional text-to-speech (TTS) systems, LLM-based systems are autoregressive, meaning they generate speech tokens one at a time, without explicitly modeling duration. This can cause hallucinated repetitions, unexpected cutoffs, and inconsistent pronunciation. At Amazon, we're working to address all these issues. Mitigating accent leakage in polyglot TTS We use a locale-specific data augmentation approach to address the problem of accent leakage. Specifically, we use low-rank adaptation (LoRA) to fine-tune our polyglot models on data that is heavily weighted toward target locales. This also allows us to do accent-free polyglot voice cloning: the cloned voice speaks the target language with native-like pronunciation but without loss of speaker identity. Improving expressiveness We use classifier-free guidance (CFG) to generate synthetic reference audio samples with enhanced expressiveness. Using these as conditioning during inference pushes the model toward more expressive prosodic styles. Originally developed for diffusion modeling, CFG controls how strongly generation follows conditioning. CFG-based reference samples decouple speaker identity from accent, teaching the model to preserve voice characteristics while adopting native pronunciation in the target language. This allows us to scale a small number of recorded voices to many new locales and languages, while increasing expressiveness. Scored according to MUSHRA (multiple stimuli with hidden reference and anchor) listening tests, the quality of our models’ polyglot outputs across nine locales spanning English, French, Italian, German, and Spanish improved 5% to 20% over those of our previous model family. LocaleImprovement over baselineUS-English+12.43%Southern US-English+20.05%Great Britain-English+5.97%Australia-English+5.50%US-Spanish+11.78%Spain-Spanish+13.23%France-French+8.44%Germany-German+14.12%Italy-Italian+9.80%Robustness Traditional TTS had failure modes, but hallucination and random truncation weren't chief among them. LLM-based TTS can generate confident-sounding speech that doesn't match the input, and it will sometimes stop mid-sentence. Chain-of-thought for autoregressive TTS Traditional TTS pipelines have explicit stages: grapheme-to-phoneme conversion, duration prediction, and acoustic generation. More recent, non-autoregressive end-to-end models like FastSpeech predict durations explicitly before speech generation. LLM-based TTS takes an alternate approach. Duration emerges implicitly from autoregressive generation. There's no explicit plan for how long the utterance should be or how long each phoneme should take. This is why these models hallucinate (keep generating past the intended content) or truncate (stop too early). To address this problem, we add chain-of-thought reasoning to the model: before generating speech tokens, the model predicts phoneme sequences and estimates duration (total length and per-phoneme timing). This isn't the same as traditional TTS pipelines. Bolting duration prediction onto an autoregressive architecture is a different problem than building it into a non-autoregressive one, and it has its own challenges. Phoneme prediction enables the model to handle heteronyms ("read," "lead") and unusual names more reliably. Duration prediction gives the model a timing plan, which reduces both hallucination and truncation. These predictions are also useful for debugging, as you can see what the model "thought" it was going to generate before it started generating. Guardrails Our guardrails use the chain-of-thought predictions as checkpoints. We know the expected phoneme count and approximate speech duration before generation starts. After generation, we do a pair of checks: does the output duration match the prediction, and is the output length reasonable given the phoneme count? Large deviations flag likely hallucinations or truncations. When an agent detects problems, it can prompt the TTS system to regenerate with different sampling parameters or fall back to alternative approaches. Data filtering To filter the text data passing to the TTS model, we combine speech-recognition-based metrics with metrics based on the LLM’s attention mechanism. Automatic speech recognition (ASR) catches actual transcription errors. Taken together, the metrics keep data that's genuinely well aligned while preserving expressiveness that ASR-only filtering would discard. On generic long-form text, our full array of techniques reduces critical errors to an average of less than one second per hour, where “critical errors” include hallucinations, cutoffs beyond one word, and mismatches between input text and output speech. Conclusion LLM-based TTS models sound noticeably more natural than traditional systems. However, in our experience, they introduce new failure modes that need to be addressed before they can be deployed reliably in production. We have found that LoRA-based fine tuning addresses the heavy accent leakage observed in polyglot TTS, while classifier-free guidance is a useful tool for improving expressiveness. As for reliability, we find that smart data filtering and chain-of-thought reasoning coupled with guardrails and agentic regeneration can significantly reduce hallucination. Wed, 01 Apr 2026 18:13:19 GMT https://www.amazon.science/blog/improving-quality-and-robustness-in-llm-based-text-to-speech-systems Formally verified AES-XTS: The first AES algorithm to join s2n-bignum https://www.amazon.science/blog/formally-verified-aes-xts-the-first-aes-algorithm-to-join-s2n-bignum Simplifying and clarifying the assembly code for core operations enabled automated optimization and verification. Cryptographic encryption algorithms are mathematical procedures that transform readable data into ciphertext that looks like a stream of random bits. The ciphertext can be decrypted only with the corresponding decryption algorithm and the correct key. For data at rest — information stored on disks or in databases — algorithms like AES-XTS encrypt each block before it’s written to storage, protecting against physical theft or unauthorized access to storage systems. For data in transit — information traveling across networks — protocols like TLS combine multiple algorithms: asymmetric encryption algorithms (RSA or elliptic curves) establish secure connections, while fast symmetric encryption algorithms (like AES-GCM) protect the actual data stream and verify that it hasn't been tampered with. At Amazon Web Services (AWS), we use AES-XTS to protect customer data in services like EBS, Nitro cards, and DynamoDB, while TLS with AES-GCM secures all network communication between services and to customers. We took on the challenge of formally verifying an optimized Arm64 assembly implementation of AES-XTS decryption, where “formal verification” is the process of proving mathematically that an engineered system meets a particular specification. Our work follows the IEEE Standard 1619 for cryptographic protection of block-oriented storage devices and focuses on the AES-256-XTS variant of AES-XTS. The “256” specifies the size of the encryption key. Unlike algorithms that process fixed-size blocks, AES-XTS handles variable-length data from 16 bytes up to 16 megabytes, with special logic for incomplete blocks. The assembly code verified was a 5x-unrolled version, meaning that its loops were executed in parallel across five registers (each containing an input block), and it had been optimized for modern CPU pipelines. It was complex enough that manual review couldn't guarantee correctness yet critical enough that errors could compromise customer data security. As part of Amazon Web Services’ s2n-bignum library of formally verified big-number operations, we contributed an improved Arm64 assembly implementation of AES-XTS encryption and decryption, as well as specification and formal verification using the HOL Light interactive theorem prover, which was developed by a member of our team (John Harrison). This was an experiment in the proof-driven development of a large function with multiple paths based on the input length. It resulted in the largest proof so far in the s2n-bignum library. For the typical input size of 512 bytes, the performance of the algorithm either stayed close to that of original code or improved slightly on highly optimized Arm cores. By adding this algorithm and its proof to the s2n-bignum library, we pave the way for more AES-based algorithms to be added. Description of the algorithm AES is a block cipher that implements a keyed permutation. This means that it processes the plaintext files in blocks (in this case, blocks of 128 bits), and for any given key, it defines a bijective (one-to-one and invertible) function mapping each plaintext block to a unique ciphertext block. This mathematical property ensures that decryption can uniquely recover the original plaintext. AES-XTS is the mode specifically designed for storage encryption. It uses AES as its underlying building block but adds position-dependent tweaks and ciphertext stealing — a method for handling partial blocks — to address the unique requirements of disk encryption, where you need random access to any sector and must preserve the exact data size. AES-XTS encrypts storage data using a two-key approach where each 128-bit block and its position-dependent tweak are subjected to an exclusive-OR operation (XOR), a binary operation that outputs a one only if the input values differ. The result of the operation is encrypted with AES, then XORed with the tweak again, ensuring that identical data at different disk locations produces different ciphertext. The tweak is generated by encrypting the sector number with a second key, then multiplying by powers of α in a Galois field, creating unique values for each block position. When the final block isn't a full 128 bits, ciphertext stealing kicks in. Ciphertext stealing borrows bytes from the previous block, allowing encryption of data of any length without padding or wasted space. This lets you read or write any sector independently — critical for disk encryption — while basing each block's encryption on its position. That is a desired feature since the security model of disk encryption allows the adversary to access sectors other than those in question, modify them, and request their decryption. It also ensures that the size of the ciphertext is exactly the same as that of the plaintext, so it fits in its place. Control flow of the assembly implementation We started from an existing implementation of AES-XTS in Amazon’s AWS-LC cryptographic library. AES-XTS loops through the plaintext in 128-bit blocks, and encryption of each block requires 15 steps, each with its own “round key” derived from the encryption key. The existing implementation is 5x unrolled, meaning it processes blocks in parallel, five at a time. If the final block is less than 128 bits in length, there’s a risk of “buffer overread”, or reading beyond the limits of the input buffer. To avoid overread, the existing implementation does complex manipulation over the pointer to the current location in the input buffer. This requires a sophisticated control flow that can be hard to follow: the loop counter is incremented and decremented multiple times before and during the loop, and the loop has two additional exit points other than the final loop-back branch. One exit point is for the case when four blocks remain during the final iteration of the loop; the other exit point is for the case of one to three blocks remaining. The flow of the loop interleaves the load/store instructions with the AES and XOR instructions in an effort to maximize pipeline usage. After the loop exit, the processing of the remaining blocks is intertwined for the lengths of four down to one; if there’s a partial block at the end, the algorithm performs the ciphertext-stealing procedure. Additionally, only seven of the 15 rounds’ keys were kept in registers; the other eight were repeatedly loaded from memory in the loop and outside it. We first investigated whether we could improve the performance of the main loop by letting SLOTHY, a superoptimizer for Arm code, rearrange the instructions to maximize pipeline usage. SLOTHY contains simplified models of various Arm microarchitectures. It uses a constraint solver to provide optimal instruction schedule, register renaming, and periodic loop interleaving. However, for SLOTHY to identify and optimize a loop, the loop has to exhibit typical loop behavior, decreasing the counter at the end of each iteration and then jumping back to the beginning. SLOTHY also cannot handle the nested loop created by loading the eight-round keys. This gave us a reason to start “cleaning up” the loop. First, we freed up registers to permanently hold all round keys; this was possible as the optimized order of instructions required fewer temporary registers than the original code. Second, we removed the multiple exit points and the manipulation of the loop counter to handle the remaining blocks. The value of the counter always indicates the number of five-block chunks remaining in the buffer, conforming to SLOTHY’s requirement; the loop ends before the handling of the remaining blocks. Once the loop ends, we have a separate processing branch to handle each possible number of remaining blocks, from one to four; all four branches end in ciphertext stealing. This can be seen in the control flow graphs of the encrypt and decrypt algorithms (see below). Throughout the code, we maintained the constant-time design mindset; that is, branching and special processing are based not on secret data but only on public values, the input byte lengths. Performance With our modifications to the code, we were able to use SLOTHY to optimize the encrypt algorithm. This resulted in slight performance gains on the AWS Graviton family of Arm processors, although the gains were smaller on the more advanced chips, which have an optimized out-of-order pipeline. Compared to the original implementation, keeping round keys in registers throughout the algorithm’s execution, to save on loading them from memory, allowed us to offset the effects of not interleaving the AES instructions with other ones. Having a cleaner flow of instructions in the loop and modular exit processing allowed us to experiment with various unrolling factors for the loop iterations. We experimented with 3x, 4x, and 6x factors and concluded that 5x is still the best choice across various microarchitectures. Ensuring correctness through formal verification To deploy optimized cryptographic code in production, we need mathematical certainty that it works correctly. While random testing quickly checks simple cases, we rely on formal verification to deliver the highest level of assurance for our AES-XTS implementation. Why HOL Light for AES-XTS? To prove that our implementation matches the IEEE 1619 specification, we use HOL Light, an interactive theorem prover developed by our colleague John Harrison. HOL Light is a particularly simple implementation of the "correct by construction" approach to software development, in which code is verified as it’s written. HOL Light’s trusted kernel is just a few hundred lines of code, which implements basic logical inference rules. This means that even if there's a bug in our proof tactics or automation, it cannot cause HOL Light to accept an incorrect proof. At worst, a bug prevents us from completing a proof, but it cannot make a false statement provable. We chose HOL Light for several reasons specific to AES-XTS verification: Assembly-level verification: We write our implementations directly in assembly rather than relying on compiled code. While more challenging, this makes our proofs independent of any compiler. HOL Light reasons directly about machine code bytes using CPU instruction specifications, providing assurance at the lowest level of the software stack. Existing cryptographic infrastructure: S2n-bignum already provides extensive support for cryptographic verification, including symbolic simulation that strips away execution artifacts and leaves purely mathematical problems, specialized tactics for word operations, and byte list handling. We add proven lemmas about AES operations that we can reuse for the proofs of other AES modes. Complex control flow handling: Unlike fully automated provers that might fail on complex proofs without enough explanation, HOL Light's interactive approach lets us guide proofs through the intricate invariants required for our 5x-unrolled loops, processing arbitrarily long blocks of data and performing the complex memory reasoning required by variable-length inputs and partial blocks. The s2n-bignum framework Using s2n-bignum to implement AES-XTS serves two purposes: it's both a framework for formally verifying assembly code in x86-64 and Arm architectures and a collection of fast, verified assembly functions for cryptography. The library already contains verified implementations of numerous cryptographic algorithms, especially those pertaining to big-number mathematical operations (hence the name), which are the foundation of public-key cryptographic primitives. For details on how HOL Light was used to prove public-key algorithms as part of s2n-bignum, please refer to the previous Amazon Science blog posts “ Formal verification makes RSA faster — and faster to deploy” and “Better-performing ‘25519’ elliptic-curve cryptography”. As we mentioned, AES-XTS is one of the modes of the AES block cipher. AES is based on a substitution-permutation network (SPN) structure, which combines substitution operations (SubBytes using the S-box), permutation operations (ShiftRows, MixColumns), and key mixing. By expanding s2n-bignum to include the AES instruction set architecture (ISA) found in Arm64 and x86_64 processors, specifications for the AES block cipher, and additional specifications for AES-XTS, we're paving the way for the same rigorous verification of more AES-based algorithms. Developing and testing the specification The SPN nature of AES and the modes that are based on it cannot be expressed using simple mathematical formulae — such as modular multiplication, which is fundamental to public-key cryptography — that can be innately understood by a theorem prover. They require writing descriptions of the steps for processing the data. This is why, before verifying the assembly, we needed confidence that our HOL Light specification accurately captured the IEEE standard. We wrote the specification to mirror the standard's structure, using byte lists for input/output and 128-bit words for internal block operations. Then we developed conversions, HOL Light functions that we used to evaluate specifications with concrete inputs while generating proofs that the evaluations are mathematically correct. We validated our specification by conducting unit tests that cover different AES-XTS encryption/decryption scenarios, exercising the processing of all blocks (using recursion) and ciphertext stealing. These tests confirmed that our specification matched the IEEE standard before we tackled the more complex assembly verification. This two-phase approach — first ensuring that the specification is correct through testing, then formally verifying that the implementation matches the specification — gave us confidence we were proving the right thing. The proof strategy Our proofs are compositional, meaning they break the overall problem into subproblems that can be proved separately. Depending on the subproblem, the subproofs can be bounded — true only for a range of inputs — or unbounded. For inputs with fewer than five (or six, in the case of decrypt) blocks, we wrote bounded proofs that exhaustively verify each case. For inputs with five (six, in the case of decrypt) or more blocks, we developed loop invariants — mathematical statements that remain true throughout loop execution — to prove correctness for arbitrarily long inputs. The loop invariants track three critical factors until the loop exit condition is met: register states at each iteration, the evolution of "tweaks" (which make each block's encryption unique), and memory contents as blocks are processed. For partial-block (tail) handling, we proved a separate theorem for ciphertext stealing that could be reused across all cases. The top-level correctness theorem composes all proofs together, asserting the following statement: Memory safety and constant-time proofs Most recently, s2n-bignum was equipped with new functions and tactics for formally defining the constant-time and memory safety properties of assembly functions. With these resources, many assembly subroutines in s2n-bignum were verified to be constant time and memory safe, including top-level scalar-multiplication functions in elliptic curves, big-integer arithmetic for RSA, and the Arm implementation of the ML-KEM cryptography standard (the subject of a forthcoming blog post on Amazon Science). All assembly subroutines identified for use in AWS-LC as of October 2025 were formally verified to be constant time and memory safe. We are exploring whether the new tactics can easily be used to verify assembly subroutines that have subsequently been added, such as AES-XTS. As we mentioned, AES-XTS has a remarkably complex control flow, which resulted in a long and involved functional-correctness proof. That complexity is also a challenge for safety proofs. The process is ongoing, but we have already proved safety properties for the ciphertext-stealing subroutines of the decryption and encryption algorithms. These first proofs focused on crucial memory access procedures that are prone to buffer overread. Proofs for the remaining parts of the decryption and encryption algorithms can use the same methodology, where the constant-time and memory-safety proofs follow the same structure as the functional-correctness proofs but are simpler, since their proof goal is more focused. Continuous assurance of correctness We've integrated formal verification into s2n-bignum's continuous-integration (CI) workflow. This provides assurance that no changes to our AES-XTS implementation can be committed without successfully passing a formal proof of correctness. As part of CI, the CPU instruction modeling is validated through randomized testing against real hardware, "fuzzing out" inaccuracies to ensure our specifications are correct and the proofs hold in practice. Furthermore, the proof guarantees correctness for all possible inputs, since they’re represented in the proof as symbols. This overcomes the typical shortcoming of coverage testing, which may cover all paths of the code but may not be able to cover all input values. For example, a constant-time code, like the one used here, is written without branching on secret values. Typically, then, secret values are incorporated into the operation through the use of masks derived from them. The same instructions are executed irrespective of the secret value. Hence, achieving line coverage is usually within the reach of a developer, but achieving value coverage is left to the formal verification of correctness. This same methodology has enabled AWS to deploy optimized cryptographic implementations with mathematical guarantees of correctness while achieving significant performance improvements. This allows the developer and tools to further optimize the code freely without worrying about introducing bugs, since these will be automatically caught by the proof. Our experience with AES-XTS shows that proof-driven development of assembly code yields a control flow that is easier to understand, review, maintain, and optimize while never ceasing to be provably correct. Fri, 20 Mar 2026 16:38:44 GMT https://www.amazon.science/blog/formally-verified-aes-xts-the-first-aes-algorithm-to-join-s2n-bignum Optimizing LoRA target module selection for efficient fine tuning https://www.amazon.science/blog/optimizing-lora-target-module-selection-for-efficient-fine-tuning Ablation study clarifies trade-offs between accuracy and efficiency when using low-rank adaptation (LoRA) to fine-tune AI models. Fine-tuning a large language model (LLM) on a specific task requires updates to billions of parameters across trillions of tokens, with the attendant costs in GPU resources and time. Low-rank adaptation (LoRA) is a more efficient alternative that freezes the original model weights but introduces lightweight matrices into specific model sublayers, or “modules”. These matrices (commonly referred to as “adapters”) modify the modules’ weights, enabling not only efficient fine tuning but also on-demand model serving, which dramatically lowers inference costs; base-model sharing across GPUs, which cuts memory requirements; lower download overhead; and parallel inference across multiple adapters. The question is where to insert these adapters across the model. Empirically, targeting more and larger modules tends to boost performance, because it allows more flexibility in customization; but it also increases training and inference costs. Using a smaller, well-chosen subset preserves most gains with significantly better efficiency. Using Amazon’s Nova 2.0 Lite multimodal reasoning LLM as our base model, we set ourselves the goal of identifying a subset of standardized target-module configurations that works effectively across the vast majority of customer use cases. Through an ablation study, we identified a module known as o_proj, as the single module where adding an adapter achieves the best trade-off between efficiency and accuracy (o_proj is a linear transformation that mixes representations across attention heads into a single, cohesive form for the rest of the model to understand). The Transformer architecture Transformer models — the models responsible for all of AI’s remarkable recent gains — consist largely of blocks that are repeated multiple times. Each block in turn has two main components: an attention mechanism, which determines the relevance of previously seen tokens to the token currently being processed, and a feed-forward network, a conventional neural network that does additional processing on the outputs of the attention mechanism. The attention mechanism involves three different matrices, which take their names from database design: the query matrix represents how relevant the current token is to the other tokens in the input sequence; the key matrix represents how relevant other tokens are to one another; and the value matrix represents the raw content of those other tokens. Multiplying the three matrices together creates, essentially, a recipe for the Transformer's next output. To reduce computational complexity, these multiplications take place in a space with reduced dimensions. The matrices themselves and the results of their multiplication then have to be projected back up to the original dimensions of the input. LoRA approximates weight updates using a product of two smaller matrices, drastically reducing the number of trainable parameters. The technique is typically applied to attention projection layers and feed-forward network layers. These modules are ideal candidates because they constitute the bulk of Transformer parameters, directly govern representation learning, and exhibit natural alignment with low-rank approximations. Empirical evidence shows weight changes in these layers often lie within a low-dimensional subspace during fine tuning. Target module selection Selecting the right target modules directly affects accuracy, latency, and computational efficiency. The optimal choice of target modules is primarily a function of (a) the base model being fine-tuned (i.e., its architecture, pre- and post-training data distributions, etc.) and (b) customization domain/modality. When fine-tuning Nova 2.0 Lite, we balanced two competing objectives: Maximizing accuracy across diverse tasks and modalities and Minimizing latency to preserve LoRA's efficiency benefits. We investigated the application of LoRA to four different modules in each Transformer block: the query, key, and value projection layers ( qkv); the o_proj layer; and two different fully connected layers in the feed-forward network, gate_up_proj and gate_down_proj (referred to as fc1 and fc2). Below are the trade-offs for these modules, both singly and in combination, based on results published in literature and empirical studies. CombinationExpected accuracyExpected latencyUse caseqkv onlyGood (baseline)Lowest Resource-constrained environments Tasks where attention mechanisms are critical (e.g., classification, lightweight generation) Prioritizes speed over maximum accuracy o_proj onlyModerateLowest Ultralow-latency scenarios Tasks where refining attention outputs is sufficient (e.g., simple sentiment analysis). Plays an important role in reasoning Less effective than qkv, but very efficient qkv + o_projHighLow to moderate (+5–10%) Attention-focused tasks (e.g., machine translation, summarization) Balances refinement of both attention context ( o_proj) and query/key/value projections ( qkv) Best accuracy-to-latency ratio for most NLP tasks qkv + fc1 / fc2Very high (close to full fine tuning)Moderate (+10–15%) Complex generation tasks (e.g., translation, long-form summarization) When feed-forward layers ( fc1/ fc2) significantly influence output quality as they store and retrieve factual knowledge Prioritizes accuracy over speed o_proj + fc1 / fc2Good to highModerate (+5–10%) Tasks requiring adaptation of both attention output ( o_proj) and feed-forward layers (e.g., text classification, sentiment analysis) Suitable when qkv adaptation is unnecessary qkv + o_proj + fc1 / fc2Highest (near-full fine tuning)High (+15–20%) Maximum accuracy for critical tasks (e.g., research benchmarks, high-stakes generation) When all components of the Transformer block need adaptation Avoid for production if latency matters All modules ( qkv, o_proj, fc1, fc2)MaximumHighest (+20–25%) Prototyping/research with no latency constraints Rarely justified in practice; marginal gains over qkv + o_proj + fc1/ fc2 Trade-offs of accuracy and latency across target modules, based on literature review and empirical evidence. Experimental methodology We conducted a comprehensive ablation study, training multiple supervised-fine-tuning (SFT) LoRA variants on seven datasets spanning both text and visual data, across reasoning (i.e., the training datasets themselves include reasoning content) and non-reasoning tasks. The datasets covered diverse challenges from simple question answering to long-context summarization and structured JSON extraction. DatasetModalityReasoning tracesDomainTasksTraining sizeEval sizeEval metricSourceFinCOTTxtYesFinanceFinancial-reasoning dataset. Samples consist of complex financial queries, along with reasoning traces obtained from GPT-4o. Predictions are typically complex tables or calculations based on the input.74361147Accuracyhttps://huggingface.co/datasets/TheFinAI/FinCoTGovReportTxtNoGoverment DocLarge-context (30-40K tokens) summarization17457837RougeLsumhttps://gov-report-data.github.io/MedMCQATxtNoMedicalDataset for multiple-choice QA — also used in Nova 1.020k3683Accuracyhttps://huggingface.co/datasets/openlifescienceai/medmcqaMedReasonTxtYesMedicalMedical-reasoning dataset that consists of questions and answers compiled from various medical benchmarks (MedQA, MedMCQA, etc.), along with synthetic, high-quality reasoning traces. (This uses the same eval set as MedMCQA.)316823683Accuracyhttps://huggingface.co/datasets/UCSC-VLAA/MedReasonCoCoHDTxtNoPolitical DocA complex benchmark consisting of large-context (>20K tokens) transcripts of congressional hearings. The output is expected to be a summary in a specific JSON format, consisting of the members present, topic discussed, outcomes, etc.7321053Averaged key and value match ratehttps://github.com/gtfintechlab/CoCoHDLlava-COTImageYesImage understanding, General/ScienceMultimodal, image benchmark consisting of Q&A reasoning questions. The dataset includes high-quality reasoning traces.10k270Exact match ratehttps://huggingface.co/datasets/Xkev/LLaVA-CoT-100kInvoice OCRImageNoImage understandingOCR benchmark that takes an input image and produces a JSON file with fields from the image.1400447AccuracySummary of the experiment datasets All experiments used the Nova 2.0 Lite general-availability checkpoint with consistent hyperparameters across target modules, including learning-rate ratio and alpha values. Target datasetSettingSFT LoRA target performanceNova 2.0 Lite performanceFin-COTqkv67.09%72.12%o_proj68.30%fc175.35%fc260.24%o_proj + fc161.38%qkv + fc260.31%o_proj + fc262.79%qkv + fc168.37%All target modules66.15%CoCoHDqkv19.64%45.14%o_proj65.88%fc141.96%fc217.62%o_proj + fc176.83%qkv + fc266.47%o_proj + fc279.14%qkv + fc145.45%All target modules82.75%GovReporto_proj41.25%38.90%fc139.69%o_proj + fc141.74%o_proj + fc242.16%qkv + fc141.66%qkv + fc239.02%All target modules41.95%Llava-COTqkv64.26%16.22%o_proj64.26%fc165.92%fc265.02%o_proj + fc163.21%qkv + fc262.76%o_proj + fc266.37%qkv + fc166.52%All target modules63.96%Invoice OCRo_proj89.07%14.10%o_proj + fc190.03%qkv + fc287.84%o_proj + fc289.47%qkv + fc188.55%All target modules90.11%MedReasono_proj24.55%1.68%o_proj + fc120.88%qkv + fc28.39%o_proj + fc220.36%qkv + fc14.32%All target modules26.72%MedMCQAqkv62.18%1.68%o_proj63.10%fc112.90%fc259.98%o_proj + fc161.39%qkv + fc265.63%o_proj + fc264.95%qkv + fc157.21%All target modules66.11%Ablation study for target module selection. Some benchmarks have fewer variations, to save on computation and time. MedMCQA and MedReason use the MedMCQA test set for evaluation. On this task, Nova 2.0 Lite fails mainly due to formatting inconsistencies, even though it produces the right answer. For consistency’s sake, we use the same strict parser for SFT models. Key findings 1. O_proj is the most robust single target The o_proj-only configuration demonstrated remarkable consistency, never failing outright on any task and typically performing within a few percentage points of the best configuration (i.e., using all target modules). On MedMCQA, CoCoHD, GovReport, LLaVA-CoT, and Invoice OCR, o_proj-only either matched or came very close to optimal performance, making it an attractive default choice that balances performance and simplicity. There is emerging evidence that this module plays a key role in reasoning, which may explain its effectiveness here. 2. Qkv-only shows instability While qkv-only performed well on MedMCQA, it exhibited extreme variability, performing below baseline on CoCoHD and showing unremarkable results elsewhere. This aligns with the hypothesis that attention-only LoRA can underfit on tasks requiring richer features from the feed-forward network, rather than relying on modified token routing. 3. Module combinations provide modest gains Combinations like o_proj + fc2 or "all target modules" often achieved the highest per-dataset scores (particularly on CoCoHD, MedReason, and Invoice OCR). However, improvements over the best single module were typically modest, usually 1-3 percentage points. 4. Task difficulty amplifies configuration impact On challenging benchmarks where the base model performed poorly, the choice of target modules had greater impact. For example, on CoCoHD (long-context, complex JSON generation), o_proj + fc2 achieved a +15% absolute improvement over the base model, compared to only +3% with o_proj alone. 5. LoRA consistently outperforms base models Across nearly all datasets, any reasonable LoRA configuration dramatically outperformed the base model. For instance, MedReason, MedMCQA, LLaVA-CoT, and Invoice OCR showed improvements from a baseline accuracy of ~1-16% to 60-90%+ with LoRA. The notable exception was Fin-COT, where only certain configurations (notably fc1) exceeded baseline performance, suggesting task-specific sensitivity to adaptation strategy. Recommendations For accuracy-prioritized scenarios, we recommend o_proj + fc2 as the optimal configuration for both text and multimodal tasks, showing 2-12% improvements over o_proj alone across benchmarks. For balanced efficiency and performance, o_proj-only provides an excellent default, offering robust performance with minimal latency overhead — particularly valuable when serving multiple adapters or operating under resource constraints. For challenging tasks, such as benchmarks with long context or complex generation requirements or other tasks where base models struggle, the additional accuracy from o_proj + fc2 justifies the modest latency increase. Future directions Our research opens several promising avenues for further optimization: Modality and task-specific configurations: Segmenting target module selection by modality and task difficulty (e.g., long-context scenarios) could yield specialized configurations with better accuracy-latency trade-offs. Per-module hyperparameter optimization: Extensive hyperparameter optimization for each target module configuration could unlock additional performance gains, though computational costs remain a consideration. Two-stage LoRA for early candidate identification: Leveraging two-stage LoRA approaches that use training dynamics, gradients, etc., to determine the importance of different modules/layers could help identify promising configurations early in training, reducing the cost of comprehensive hyperparameter searches. Layer pruning for latency reduction: Using two-stage training to identify and prune unused layers could further reduce inference latency while maintaining accuracy. Conclusion Our comprehensive study demonstrates that thoughtful target module selection in LoRA fine tuning can improve accuracy while preserving the efficiency advantages that make LoRA attractive for production deployments. The o_proj layer emerges as a remarkably robust single target, while o_proj + fc2 combinations offer the best accuracy for challenging tasks. On average, o_proj LoRA is within 2% of o_proj + fc2 in terms of accuracy but has 22.6% lower latency (TPOT p95 decreases from 10.085ms → 7.803ms). These findings provide a principled foundation for standardizing LoRA configurations across diverse customer use cases, balancing the competing demands of model performance and computational efficiency. Acknowledgements: Kevin Rondinone, Kevin Chen, Nicole Ding, Sebastian Massella, Andy Li Thu, 19 Mar 2026 14:39:23 GMT https://www.amazon.science/blog/optimizing-lora-target-module-selection-for-efficient-fine-tuning How agentic AI helps heal the systems we can’t replace https://www.amazon.science/blog/how-agentic-ai-helps-heal-the-systems-we-cant-replace By learning the idiosyncrasies of accumulated layers of legacy systems, AI agents can preserve institutional knowledge and provide a unified interface to a range of services. Many of the world’s most important systems — the ones that move money, book flights, issue licenses, and process claims — are slow, brittle, and deeply outdated. Built decades ago and extended repeatedly, they now sit at the center of workflows too vital to pause, take offline, rebuild, or replace. Inside Amazon’s Artificial General Intelligence (AGI) Lab, teams train agents not on idealized interfaces but on high-fidelity simulations of such legacy systems. Learning the real behaviors of these systems — the quirks, delays, error states, and invisible dependencies — makes possible a different kind of innovation, one that grows from the systems we have instead of requiring their replacement. And by managing the idiosyncrasies of legacy systems behind the scenes, the agent effectively becomes a universal API — a single interface that the customer can use to perform a wide range of special-purpose tasks. The legacy systems that power everyday life Step behind the scenes of any large institution — a bank, an insurer, a hospital, a state agency — and you’ll find the same thing: an invisible layer of human labor holding software together. People know which buttons must be clicked in which order, which warnings can be ignored, which fields must be entered twice, and which screens must never be refreshed. The institutional knowledge required to navigate these eccentricities is passed down like the oral traditions of legacy systems. Much of the infrastructure beneath these workflows is older than the people managing it. The software backbone of modern finance, insurance, travel, scientific research, and public services took shape in the 1960s and ’70s, built on mainframe architectures and written in languages like COBOL and FORTRAN — designed for stability rather than adaptability. When the web arrived, institutions didn’t rebuild. They wrapped. Web forms fed mainframe jobs, middleware translated modern inputs into decades-old formats, and enterprise portals accumulated atop business rules that were never rewritten. Over time, modernization settled into layers: a mainframe instruction set at the bottom; a 1990s database above it; a 2000s portal above that; and a modern web interface masking everything beneath. A single transaction today might pass through all these layers — scripts, connectors, and integrations holding them together in ways no one fully understands. Attempts to replace these systems routinely stall. Dependencies surface no one knew existed, migrations fail, budgets spiral, and public-sector modernization efforts collapse under their own complexity. These systems cannot be taken offline, which means institutions must keep operating them no matter how brittle they become. For Amazon, this is one of the most compelling applications of agentic AI — navigating not the polished surfaces of web-era consumer apps but the deep, slow-moving architectures that keep institutions running. Learning the bad to heal the bad The hardest part of training an AI agent is not teaching it what a successful workflow looks like; it’s teaching it why workflows fail. The logic behind legacy systems reveals itself most clearly through friction: the modal (mandatory) window that appears late because it encodes a sequencing rule; the field that refuses input until another value is saved; the form that resets because a backend job restarted midflow. These behaviors aren’t glitches. They are the real semantics of the system. Researchers at Amazon’s AGI Labs seek this friction out. To surface failure modes safely and repeatedly, Amazon trains agents inside reinforcement learning (RL) gyms — synthetic environments designed to reproduce the quirks, delays, and ordering rules embedded in real workflows. These include synthetic web environments that simulate systems like state agencies, airline bookings, and specialized tax- and benefits-processing, among hundreds of others. Jason Laster, an AGI software engineer who works on agent training and replay systems, puts it plainly: “I want to push our RL training gyms to have all of the warts, all of the issues.” This is what “learning the bad to heal the bad” means: training an agent on the full spectrum of a system’s true behavior, including flaws, inconsistencies, delays, and all the embedded histories humans have quietly adapted to. By exposing agents to the same brokenness people navigate every day, Amazon trains them to move beyond surface correctness and understand the deeper logic beneath the interface. Agents as a new interface layer Once an agent can reliably navigate the idiosyncrasies of legacy interfaces, something more interesting begins to happen. Researchers have observed agents inferring not just what to click next but why — the latent workflow the interface expresses. In one simulated benefits application environment, an agent that realized it had added only one dependent was able to navigate back, correct the omission, and resume the flow without restarting — an early sign of understanding the nature of the system. For lab members, this marks an architectural turning point. Many institutional systems simply don’t expose APIs that reflect how real workflows behave; the only faithful expression of the logic is the interface itself. As Laster puts it, “the UI was designed to be discoverable, learnable — even if it’s bad.” When agents learn that layer deeply enough to predict outcomes and recover from failures, they begin to function as a kind of synthetic API — a stable, programmatic surface over infrastructure that can’t be changed. That shift enables new architectural possibilities: Stable semantics over unstable UIs: Agents turn inconsistent behaviors — delays, re-renders, partial saves — into predictable patterns. Cross-system abstraction: Because the agent reasons about the workflow rather than the application, it can bridge systems never designed to interoperate. Incremental modernization: Institutions can update components gradually without breaking workflows; the agent absorbs transitional fragility. Preservation of institutional logic: Agents retain operational knowledge otherwise stored only in human memory — rules, sequences, dependencies no one has documented. This is not workflow automation. It is a new interface layer for the world’s oldest systems — an upgrade path that doesn’t require turning anything off. The work ahead Agentic AI will not replace the administrative tasks that structure daily life — booking vacations, renewing licenses, scheduling medical appointments — but it can help make them more efficient by allowing the evolution of systems once too fragile to change. That fragility is becoming more acute. The programmers who built the institutional backbone of the 1960s and ’70s — COBOL batch jobs, FORTRAN routines, mainframe integrations — are retiring. Few younger developers learn these languages, and the knowledge embedded in those systems grows harder to access each year. Critical workflows now run atop software whose inner workings fewer and fewer people understand. Agents offer a different form of continuity. By learning how these systems behave — not from lost documentation but from the systems themselves — they can preserve operational logic that would otherwise disappear. They can stabilize workflows sitting atop code no one can safely rewrite and carry forward institutional knowledge that would otherwise age out of the workforce. In that sense, “the work ahead” is twofold. There is the technical work of building agents that can meet the reliability these environments demand. And there is the human work that becomes newly possible when people are no longer trapped inside brittle interfaces — work grounded in judgment, coordination, empathy, and design rather than memorizing which field must be entered twice. Agents will not rebuild the foundations of our digital world. But they may rebuild something else: our notion that innovation comes only from replacement. By turning brittle systems into stable platforms, agents offer a new model of progress — one that grows from what already works. Mon, 16 Mar 2026 13:00:00 GMT https://www.amazon.science/blog/how-agentic-ai-helps-heal-the-systems-we-cant-replace Designing AI agents that know when to step back https://www.amazon.science/blog/designing-ai-agents-that-know-when-to-step-back As AI agents become more autonomous, the key challenge isn't what they can do; it's how to design the human side of the equation. Agentic AI is taking off, and for good reason. AI agents can now write code, conduct research, plan travel, handle customer service, and more. Yet amid the excitement about what AI agents can do, a key question has been neglected: how do we design the human side of the equation? That question is critical, because agentic AI isn’t just another feature to bolt onto existing products. It’s a fundamentally different kind of software that demands fresh thinking. Unlike traditional software, agentic AI can be proactive and conversational, sometimes even anthropomorphic. It doesn’t just respond to commands; it initiates actions and makes decisions autonomously. This capability is what makes agentic AI so useful, but it’s also what makes effective interactions hard to design. A central user-experience (UX) challenge is coordination: the interplay between what users do, what they experience, and what the AI is doing, both visibly and behind the scenes. Trust, control, and transparency are essential to the agentic-AI user experience, and they all depend on getting this coordination right. Here, we introduce a framework for thinking about human-AI coordination. We also offer a vocabulary for characterizing agentic experiences, including when the AI feels too absent, too intrusive, or appropriately calibrated. A framework for human-AI coordination One of the most critical decisions in AI UX design is how visible and interactive AI capabilities should be. Should users direct the agent step by step, let it act autonomously, or work somewhere in-between? And how should this change based on the task, the user’s expertise, and the current context? You can think of coordination along these three dimensions: Human involvement: how much effort and attention the user invests in directing or monitoring the AI; AI salience: how prominent the AI feels in the experience (for example, a conversational chatbot with a name and persona has high salience, autocomplete suggestions have lower salience, and AI-generated navigation menus and backend optimizations have little or none); AI activity: what the AI is doing, whether or not the user sees it. Coordination is about aligning these dimensions. When human involvement and AI salience are both low, coordination is light-touch. When they are high, coordination is more hands-on. The right balance is often somewhere in-between, with an awareness of what the AI is doing in the background. Three zones of coordination Rather than treating agent autonomy as a binary choice — a fully autonomous system or one with a human in the loop — it is practical to consider three “zones” of coordination. Done with me (mutually collaborative): User and AI work closely together across multiple phases — initiation, monitoring, updating, and completion. Imagine collaborating with an AI assistant on a complex document or research project, with frequent back-and-forth. AI salience and human involvement are both high. The user is very in the loop. Done for me (heavily automated): Tasks are handled by AI with minimal user input and oversight. The user initiates the task and reviews the output; most of the work happens out of view. An examples is an agent that researches competitors and delivers a summary report. The user is barely in the loop. Done under me (discreetly assisted): AI works in the background without announcing itself. The user may not even notice the assistance. Smart sorting, predictive text, and intelligently personalized content and navigation menus fall into this category. The AI quickly delivers outcomes users can assess and act on. The user is implicitly in the loop. These aren’t rigid categories but calibration points for designing and delivering the right level of coordination to users. The goal is to match coordination intensity to the specific user, task, and context, rather than defaulting to a single mode everywhere or assuming that an autonomous agentic system eliminates the need for thoughtful coordination. The rhythm of human-AI coordination Because both agents and users can work independently, coordination cannot be static. Workflows often move through multiple zones: high involvement during initiation, perhaps defining goals and constraints; lower involvement during execution; and then a spike at review and next steps. We visualize these shifts as “coordination curves” — a variation of user-journey mapping that shows how human involvement and AI salience rise and fall across a workflow. High-level curves reveal the overall shape of an experience. Looking beneath the surface exposes specific AI touchpoints, handoffs, and decision points, helping UX design teams collaborate on bringing adaptive agentic systems to life. As multiagent applications become more sophisticated, they enable longer, computationally intensive work such as research projects, complex analyses, and multistep workflows. These create valleys in the coordination curve: stretches where the AI operates independently and the user is minimally involved. These valleys require thoughtful design around notification, approval, monitoring, and auditing. More broadly, the UX layer must provide the transparency and controls needed to build trust, support adaptation and course correction, and ultimately deliver value. Case study: Adaptive coordination in practice We developed an approach called “responsive salience”, whereby an AI agent automatically adjusts its visibility and interaction intensity to match the context. The core insight is simple: in traditional software, most of the interface is static or deterministic. With agentic AI, behavior is nondeterministic, so a user’s needs for oversight can change moment to moment. A user who trusts an agent on a familiar task may prefer to be largely hands-off. In unfamiliar or high-stakes work, that same user may want more transparency, checkpoints, and tighter control. Rather than forcing users to toggle settings, responsive salience lets the system adapt automatically. In our prototype, a monitoring agent continuously evaluates signals including task complexity, perceived risk, and user comfort level. When trust appears low — for example, when the user is a beginner, or the workflow involves sensitive data — the system increases salience. It could do this by providing richer explanations, additional approval gates, and expanded transparency features. The user may then be notified of the change and, if needed, can override the agent’s choice. Once confidence recovers or the task ends, the salience settings quietly revert. Over time, the system can learn from user behavior through user feedback loops, refining how quickly salience adapts and how far it goes. The result is autonomy that stays aligned with context. Early testing with users validated the idea while revealing some clear tradeoffs. Preferences diverged sharply: some found high-salience modes exhausting (“I felt visually fatigued by the large amount of communication”), while others appreciated the guidance (“It gave me options for what I might want to ask next”). One participant expressed the desire for a middle ground: “I want some oversight on what the agent is planning before execution. … The high setting was too annoying because I had to approve everything.” These results underline that user preferences for autonomy versus control can vary substantially, even in similar tasks. Responsive salience offers a solution by dynamically adjusting whether a given task is done-with-me, done-for-me, or done-under-me. Tellingly, several participants did not notice responsive salience until we pointed it out. That suggests that when the system is well calibrated, dynamic coordination can feel seamless rather than intrusive. Coevolution with agentic AI Agentic AI represents a genuine shift in what software can do, but realizing its potential depends just as much on what humans do alongside it. The frameworks, protocols, and infrastructures for building agents are maturing fast. The UX layer needs to catch up. Coordination isn’t a one-and-done problem but a moving target. As users gain expertise, tasks change, and AI capabilities evolve, the optimal balance of user involvement and AI salience will change too. So the goal isn’t to find the perfect static design, as it might have been before generative AI, but to build systems and a shared vocabulary that evolve as we learn what works in practice. Agentic AI makes this both necessary and possible: its behavior can be unpredictable, so users and designers must adjust, yet the technology itself can also learn, adapt, and course-correct proactively. Teams that get this right won’t simply build more capable agents. They will build agents that people trust, adopt, and even enjoy collaborating with. Wed, 11 Mar 2026 16:00:00 GMT https://www.amazon.science/blog/designing-ai-agents-that-know-when-to-step-back How AI is changing the nature of mathematical research https://www.amazon.science/blog/how-ai-is-changing-the-nature-of-mathematical-research What machine learning theorists learned using AI agents to generate proofs — and what comes next. Modern AI coding tools have revolutionized software engineering, with developers now using AI assistants to write a substantial fraction of their code across a range of applications. As scientists studying the theory of machine learning, we’re already seeing a similar transformation in basic scientific methodology, especially for research of a mathematical nature. More precisely, AI tools are now able to develop and write rigorous mathematical proofs only from prompts providing high-level proof sketches. These proofs are written in longstanding “languages” for detailing mathematical arguments, in the same way that code is written in formal programming languages like Python. AI seems to have become proficient in both kinds of languages and their underlying logics. We came to this realization during a three-week period last summer, when we used agentic AI tools to write a mathematical paper that normally would have taken months. The 50-page paper describes and solves an optimization problem based on concepts from graph theory and machine learning. A typical prompt we would give the AI to set up the general framework for our paper looked like this: “Imagine a directed acyclic network of linear least-squares learning agents, each of which shares a common dataset but each of which sees only a different subset of the features.” A typical prompt for a theorem statement and proof went “We believe that if the network contains a sufficiently long chain of agents whose features cover the entire dataset, some agent in the chain should rapidly converge to the globally optimal linear model. The proof should use the fact that error monotonically decreases in the chain, which forces long sequences of agents to be multi-accurate with respect to each other’s features.” While incantations like these might be opaque to the casual reader, they all have precise, standard mathematical interpretations that the AI was aware of, due to its training, and it proceeded to translate informal intuitions into precise definitions and statements. This translation was imperfect (as discussed below) but resulted in a great first draft that could then be corrected and smoothed. To be clear, for this specific paper, we already knew the general outline of the proofs we had in mind. What AI did was to automate and dramatically speed up the process of filling in the missing details and writing them with formal precision. But more recently, we’ve written papers that we believe are substantially different and better than what we would have produced without AI assistance — in which the AI contributed key ideas that were crucial to the final results. It’s important to note that AI tools are advancing quickly, which makes the future difficult to predict. While their use has shown potential to produce faster and better research, it has also generated serious questions for those who care about the future of science and its relationship to the broader world. AI is changing research norms and workflows. This raises concerns about how to train future generations of scientists. Specifically, how can intuition and “good taste” in scientific research be developed when AI automates many of the steps that have historically been used to train young researchers? Peer review is another challenge: AI-generated research papers, quickly churned out at scale, highlight the limitations of peer review and modern-day publishing structures and also exacerbate already emerging challenges to incentives for scientific success. Without claiming to have answers or solutions to these concerns, we are personally living through them and will discuss each in turn. New ways of doing research One of our major takeaways from our summer research project was that working with proof-based AI tools is akin to collaborating with a smart, broadly educated but occasionally error-prone colleague. One can verbally sketch a mathematical argument to an AI agent as you might to a human collaborator, and the agent can turn that sketch into a formally written lemma or theorem along with its proof. Increasingly, AI agents can find proofs themselves without a sketch, especially when those proofs are "standard" in some areas of mathematics. This is more useful than it sounds: many kinds of arguments are "standard" in some field, but often one in which you, the human author, are not an expert. An advantage of AI tools is that they are conversant in an enormous number of areas of mathematics and other scientific disciplines. For example, in our case, along the way to proving one of our main results from the sketch we provided incrementally, the AI spontaneously proved a simple but useful lemma we were not aware of, which meaningfully simplified the argument we had in mind. The implications of this sort of creativity are exciting, especially for lowering the barrier to discovery: scientists without access to a diverse community of collaborators could also participate in cutting-edge research in ways that were previously impossible. Using these tools still requires caution and expertise, however. The proofs they generate are correct perhaps only three-quarters of the time. But when they’re wrong, if you can identify the errors, it is often possible to iterate to correctness and then continue along a promising path. If the errors remain uncorrected, trying to continue often takes you down a dead end. A 25% error rate is low enough to make the tools extremely useful to experts but high enough to sometimes devolve into "AI research slop" — polished-looking but ultimately flawed or uninteresting work — when used without care or discernment. The models, after all, still don’t know what is “interesting” or “useful.” We also noticed some recurring failure modes or “rabbit holes” that come from using the AI tools. While writing our paper, we asked the AI to generate a small, self-contained result, which it did perfectly in a matter of minutes, at which point we told it this subproject was completed. Nevertheless, during the coming days, the AI would spontaneously take the initiative to suggest returning to the topic, despite being repeatedly told not to do so unless asked. This was an irritating reminder that generative AI does not have perfect recall but only an incomplete summary or embedding of the context. While working on the code for the experiments to illustrate our theoretical findings, we found that the AI could alternate between writing large amounts of rather complex working code very rapidly and getting lost for hours on something trivial, like simply printing out which iteration of a loop was being executed. Training the next generation Historically, people earn expertise in the mathematical sciences through struggle as junior researchers. PhD students spend years working through the details of technical arguments to gain hard-won intuitions about when a proof approach is promising, when they are being led astray by a problem, or what constitutes a novel and interesting research direction. But these aspects of being a researcher are exactly what AI tools are “giving away”. If doctoral students can simply ask AI for proofs — which is extremely tempting, especially when it is in service of advancing research — how do they develop the experience and skill that, for now at least, are required to use AI tools productively in the first place? We may need to be more intentional about teaching these foundational skills to young researchers, perhaps adopting an advanced version of teaching arithmetic in grade school without the use of calculators. The straightforward recommendation is to require junior researchers to write papers “the old-fashioned way”, even when their work could be sped up by AI. Perhaps in a separate track, students would be trained to understand and work with emerging AI tools. This is an area of increasing importance that will likely require creative solutions. While we are strong believers that AI tools will do astounding things for science, it may be important to deliberately moderate their use in order to build researchers up to the point at which they can use them wisely and tastefully, not simply as short cuts to second-rate (or worse) research. These next-generation training challenges aren’t unique to scientists using AI. We see them across myriad fields, including engineering, customer service, law, writing, and design — really, any industry in which entry-level tasks, previously used to introduce young workers to a field, are now done using AI. To find creative solutions to this skills-training challenge, or to just better anticipate the changes at hand, it might be helpful to look at analogies across fields or over time. After high-level programming languages and compilers were widely introduced in the early 1960s, most software engineers no longer wrote machine code or assembly language, which provided direct instructions to the underlying hardware but were tedious to program. But the best programmers still understood enough about how compilers translated high-level languages into machine code to reason about correctness and performance. We hope that making it easier to construct and check technical arguments will let all researchers operate at a higher level of abstraction and “think bigger thoughts”. The culture we envision would emphasize taste, problem selection, and modeling skills and devalue technical wizardry for its own sake. Breaking and remaking peer review From our perspective, peer review is not only, or even primarily, a process to verify the correctness and quality of research. Rather, its purpose is to focus a scarce resource — the attention of the research community — in the right places. Science progresses as researchers build on each other’s work, but there is already too much work out there for anyone to keep up with. The publication process should help identify the most interesting and promising directions, so they can be more efficiently and thoroughly developed. How does AI influence this focusing of communal attention? AI tools make it much easier to produce work that looks polished and correct, dramatically lowering the barrier to generating “papers” that can be submitted to journals and conferences. Many of these papers are neither interesting nor actually correct — but discovering this requires significant effort from reviewers. This is straining an already overburdened machine learning publishing ecosystem struggling with tens of thousands of submissions per venue. We have seen that reducing the time and effort needed to produce "a paper" — not necessarily a good paper — is beginning to destabilize our existing institutions for peer review. The most recent iterations of AI and ML conferences have seen the number of submissions growing by large multiples, with a significant number of papers polished by AI, but ultimately of low quality, making it surprisingly far through the review process before being noticed and called out. This is a problem across research fields, partially because it’s creating a market for AI-generated papers. This has in turn engendered a countermarket for AI-assisted detection of AI-generated papers — much like the familiar technological arms races around things like spam and its detection, but with the integrity of scientific publication at stake, not just the filtration of annoying or fraudulent e-mails. As a short-term fix, AI-driven automated correctness checks (e.g., formal verification of mathematical proofs), tools for which are already being deployed in major conferences, could be valuable. Think of this as a form of unit testing for math instead of code. The aim is to filter out papers that have nontrivial errors, while focusing the job of the human reviewer on the important parts of science that they are best suited to evaluate: determining what we learn about the world from a new result, and how useful and interesting it is, rather than being drowned in the monotony of checking countless papers for technical correctness. Without a serious, community-wide re-evaluation of peer review, AI threatens to arrest scientific progress at the community level even as it accelerates it at the level of individual researchers. Looking ahead We think AI is bringing a sea change in scientific research methodology, training, and peer review; there is no hiding from what is coming. But there are opportunities to proactively adapt and ensure that AI-assisted research fulfills its promise. What will research look like at the end of next year? The year after that? We’ve seen more change in the past year than in the previous decade, so all we can confidently predict is "different". Our scientific institutions — peer review, publishing, graduate education — evolved over decades to match the constraints of human cognition and effort. Those constraints are shifting rapidly, and our institutions will need to shift with them. Our goal should be to steer toward a world where AI amplifies human creativity and insight, accelerates discovery, and expands who can participate in the research enterprise — while preserving the joy and rigor that make science worthwhile. Mon, 09 Mar 2026 17:55:47 GMT https://www.amazon.science/blog/how-ai-is-changing-the-nature-of-mathematical-research Intelligence isn’t about parameter count. It’s about time. https://www.amazon.science/blog/intelligence-isnt-about-parameter-count-its-about-time As AI models grow larger, they become less insightful, not more. To ensure that they continue to learn, we need to reduce their inference time. When we prompt a large language model (LLM) to solve a complex polynomial equation, it does not just return an answer but uses its “chain of thought” to work through a solution. In a sense, the LLM behaves like a computer, a machine that computes the solution. But this machine is quite unlike what Alan Turing described as a universal model of computation almost 90 years ago. In what sense can an LLM be thought of as a computer? Can it be universal, that is, able to solve any computable task, as a Turing machine does? If so, how does it learn this ability from finite data? Current theories of machine learning are of little help in answering these questions, so we need new tools. In an earlier Amazon Science post, we argued that AI agents and the LLMs that power them are transductive-inference engines, despite being trained inductively in the mold of classical machine learning theory. Induction seeks generalization, or the ability to behave on future data as one did on past data. To achieve generalization, one must avoid memorization, i.e., overfitting the training data. This works in theory, under the condition that both past and future data are drawn from the same distribution. In practice, however, such a condition cannot be verified, and in general, it doesn’t apply to high-value data in business, finance, climate science, and even language. That leaves us with no handle to explain how an LLM might learn how to verifiably solve a general computable task. With transduction, by contrast, one seeks to reason through past data to craft solutions to new problems. Transduction is not about applying past solutions in the hope that they generalize; rather, it is about being able to retrieve portions of memory that matter when reasoning through new solutions. In transduction, memorization is not a stigma but a value. Using the test data, along with memory, to craft a solution during transductive inference is not overfitting but adaptive, query-specific computation — i.e., reasoning. Inductive generalization is the kind of behavior one is forced to adopt when pressed for time. Such automatic, reactive behavior is sometimes referred to as “system-1” in cognitive psychology. Transduction instead requires looking at all data and performing query-specific variable-length inference-time computation — chain-of-thought reasoning in an LLM, whose length depends on the complexity of the query. Such deliberative behavior is often referred to as “system-2” and is what we wish to foster through learning. In this sense, transductive learning is a particular form of meta-learning, or learning to reason. In 1964, Ray Solomonoff described a universally optimal algorithm for solving any problem through transductive inference, if we assume that memory and time are unbounded: execute all programs through a Turing machine, then average the outcome of those that reproduce the observed data. That will give the universally optimal answer — but it will generally take forever. What if we want not just a universally optimal but a universally fast algorithm? In 1973 — in the same paper where he introduced the notion of NP completeness — Leonid Levin derived such an algorithm . Unfortunately, Levin’s so-called universal search is not viable in practice, nor does it help us understand LLMs; for one thing, it involves no learning. Nonetheless, Levin pointed to the critical importance of time when solving computational tasks. Later, in 1986, Solomonoff hinted at how learning can help reduce time. In a new paper, we expand on these ideas and show how reducing inference time induces a trained model to operate transductively — i.e., to reason. In striving to reduce inference time, the model learns not just the statistical structure of the training data but also its algorithmic structure. It can then recombine algorithmic methods it’s learned in an infinite number of ways to address arbitrary new problems. This insight has implications for how AI models are designed and trained. In particular, they should be designed to predict the marginal value of additional costs at inference time, and their training targets should include complexity costs, to force them to minimize time during inference. This approach to learning turns classical statistical learning theory on its head. In classical statistical learning theory, the great danger is overfitting, so the goal is to regularize the solution, i.e., to minimize the information that the trained model retains from past data (beyond what matters for reducing the training loss). With transductive inference, on the other hand, the goal is to maximize the information retained, as it may come in handy for solving future problems. The inversion of scaling laws LLMs’ performance gains in the past few years have come mostly from scaling: increasing the number of model parameters has improved accuracy on benchmark datasets. This has led many to speculate that further increasing the models’ parameter counts could usher in an age of “superintelligence”, where the cognitive capacities of AI models exceed those of their human creators. In our paper, we argue the opposite: beyond a certain complexity, AI models enter what we call the savant regime, where learning becomes unnecessary, and better performance on the benchmarks comes with decreased “insight”. At the limit is the algorithm Solomonoff described in 1964, where any task can be solved by brute force. If scale does not lead to intelligence, what does? We argue that the answer is time. It’s an answer with some intuitive appeal. The concept of intelligence is fundamentally subjective and environment dependent. But while intelligence is hard to characterize, its absence is less so. Being unable to adapt to the speed of the environment is one among many behaviors that we call traits of non-intelligence (TONIs). TONIs are behaviors whose presence negates intelligence however one wishes to define it. Many TONIs are timebound. Taking the same amount of (non-minimal) time and energy to solve repeated instances of the same task, to no better outcome, is a TONI. So is the inability to allocate resources commensurate to the goal, thus spending the same effort for a trivial task as for a complex one. Starting a task that is known to take longer than the lifetime of the universe to render any usable answer would be another TONI. Given this intuition, how do we quantify the relationship between intelligence and time in AI models? The first step is to assess the amount of information contained in the models’ parameters; then we can see how it’s affected by the imposition of time constraints. Algorithmic information The standard way to measure information was proposed by Claude Shannon in a landmark 1948 paper that essentially created the field of information theory. Shannon defined the information content of a random variable as the entropy of its distribution. The more uncertainty about its value, the higher the information content. On this definition, however, a given data sample’s information content is not a property of the sample itself; it’s a property of the distribution it was drawn from. For any given sample, however, there are infinitely many distributions from which it could have been drawn. If all you have is a sample — say, a string of ones and zeroes — how do you compute its information content? In the 1960s, Solomonoff and, independently, Andrey Kolmogorov, addressed this problem, with an alternative notion of information, algorithmic information, which can be used to characterize the information content of arbitrary binary strings. For a given string, one can write a program that, when run through some computer, outputs that string. In fact, one can write infinitely many such programs and run each through many computers. The shortest possible program that, run through a universal Turing machine, outputs the specific datum is a property of that datum. That program is the algorithmic minimal sufficient statistic, and its length is the algorithmic information (Kolmogorov-Solomonoff complexity) of that datum. In his 1948 paper, Shannon also defined a metric called mutual information, which quantifies the information that can be inferred about the value of one variable by observing a correlated variable. This concept, too, can be extended to algorithmic information theory: the algorithmic mutual information between two data strings measures how much shorter the program for generating one string will be if you have access to the other. Time is information If we don’t know the distribution from which a model’s training data was drawn, and we don’t know whether the model’s future inputs will be drawn from the same distribution, how can we quantify the model’s future performance? In our paper, we assume that most tasks can be solved by combining and transforming — in infinitely many possible ways — some ultimately finite, but a priori unknown, collection of methods. In that case, we can show that optimizing performance is a matter of maximizing the algorithmic mutual information between the model’s training data and future tasks. Finding the shortest possible algorithm for generating a particular binary string is, however, an intractable problem (for all but the shortest strings). So computing the algorithmic mutual information between a model’s training data and future tasks is also intractable. Nonetheless, in our paper, we prove that there is a fundamental relation between the speed with which a model can find a solution to a new task and the algorithmic mutual information between the solution and the training data. Specifically, we show that where h is the solution to the new task, D is the dataset the model was trained on, and I(h : D) is the algorithmic mutual information between the data and the solution. This means that, during training, minimizing the time the model takes to perform an inference task will maximize the algorithmic information encoded in its weights. Reducing inference time ensures that, even as models’ parameter counts increase, they won’t descend into the savant regime, where they solve problems through brute force, without any insight or learning. The value of time You may have noticed that the equation relating inference time to algorithmic information doesn’t specify any units of measure. That’s because even the value of “time” is subjective. A zebra drinking from a pond does not know a priori how long it will take to be spotted by a predator. If it lingers too long, it ends up prey; if it panics and leaves, it ends up dehydrated. Similarly, for an AI model, there is no single cost of time to train for and correspondingly no unique scale beyond which LLMs enter the savant regime. For some tasks, such as scientific discovery, the time constant is centuries, while for others, such as algorithmic trading, it’s milliseconds. We expect agents to be able to adapt to their environment, in some cases spawning smaller specialized models for specific classes of tasks, and even then, to provide users (who are part of an agent’s environment) with controls to adjust the cost of time depending on the context and domain of application. The cost of time is already (partially and implicitly) factored into the process of training LLMs. During pretraining, the cost of time is effectively set to a minimum value, as the model is scored on the output of a single forward pass through the training data. Fine tuning the model for chain-of-thought reasoning requires annotated data, whose high cost imposes a bias toward shorter “ground truth” reasoning traces. Thus, LLMs already reflect the subjective cost of time to the annotators who assemble the training sets. However, to enable the user to modulate resources at inference time, depending on the cost of the environment, models should be trained to predict the marginal value of one more step of computation relative to the expected final return. Furthermore, they need to be trained to condition on a target complexity, in order to learn how to provide an answer within a customer-specified cost or bound. There are growing efforts to teach models the value of time, so they can adapt to the tasks at hand (with or without human supervision). These are certain to yield a better bang-to-buck ratio, but the theory predicts that, at some point, factoring in the cost of time will actually improve absolute performance in new tasks. For verifiable tasks, learning to reason comes from seeking the shortest chain of thought that yields a correct (verified) answer. Ultimately, imposing a cost on time should not impair reasoning performance. A new paradigm for AI coding Connecting these ideas to modern AI requires rethinking what computation means. LLMs are stochastic dynamical systems whose computational elements (context, weights, activations, chain of thought) do not resemble the “programs” in classical, minimalistic models of computation, such as universal Turing machines. Yet LLMs are models of computation — maximalist models. They’re universal, like Turing machines, but in many ways, they’re antithetical, and they operate through entirely different mechanisms. It’s possible to “program” such stochastic dynamical systems using a two-level control strategy: high-level, open-loop, global planning and low-level, closed-loop feedback control. That strategy can be realized with AI Functions, an open-source library released this week as part of Amazon’s Strands Labs, a GitHub repository for building AI agents. An existing programming language can be augmented with functions from the library. These are ordinary functions, in the syntax of the language, but their bodies are written in natural language instead of code, and they’re governed by pre- and post-conditions. These enable high-level, open-loop planning and verification, before a single line of code is written by AI, and they engender an automatic local feedback loop if the AI-generated code fails to clear all conditions. Minimizing time, which translates into cost, is at the core of the design and evaluation of the resulting agents. Wed, 25 Feb 2026 13:59:12 GMT https://www.amazon.science/blog/intelligence-isnt-about-parameter-count-its-about-time Why a 12-year-old forecasting paper has stood the test of time https://www.amazon.science/blog/why-a-12-year-old-forecasting-paper-has-stood-the-test-of-time Amazon Scholar Aravind Srinivasan coauthored a 2014 paper about forecasting civil unrest in Latin America, which won a test-of-time award at KDD 2025. Tue, 17 Feb 2026 14:00:00 GMT https://www.amazon.science/blog/why-a-12-year-old-forecasting-paper-has-stood-the-test-of-time How academic collaboration delivers real-world security to Amazon customers https://www.amazon.science/news/how-academic-collaboration-delivers-real-world-security-to-amazon-customers An early meeting between Amazon scientists and Stanford researchers led to cvc5, an open-source tool now powering approximately one billion automated-reasoning checks across AWS every day. On July 16, 2018, Amazon distinguished scientist Byron Cook was giving a keynote at the Federated Logic Conference (FloC) at the University of Oxford, a computer logic gathering held every four years since 1996. In the keynote, Cook described how his team was using an open-source software tool called cvc (cooperating validity checker) to identify logic problems in code and fix them. Sitting in the audience was Stanford University professor Clark Barrett, who had been working on cvc for almost 20 years. Cvc had been developed to analyze verification problems encoded as satisfiability modulo theory (SMT) problems. SMT is a mainstay of formal methods — the use of automated reasoning to prove that a program or system will behave as intended. By applying SMT at scale, cvc can detect logical errors in code and in systems such as those used for authentication and access management. “I was kind of stunned. It was really exciting,” Barrett says. “And this really started with this exciting moment of realizing, Hey, our work is being used by Amazon.” The encounter between Cook and Barrett ultimately led to a years-long research collaboration that culminated in Barrett’s becoming an Amazon Scholar in 2023. Initially, Amazon provided small grants to Barrett’s lab at Stanford’s School of Engineering through the Amazon Research Awards program; those grew into larger funding commitments as the research progressed. This funding supported foundational research that — together with deep technical collaboration between the two teams — enabled the development of cvc5, the latest version of the open-source software. Cvc5 has delivered significant value for both Amazon customers and the broader industry, while simultaneously advancing academic research. As one example, cvc5 is used in Automated Reasoning checks, a new Amazon Bedrock feature that verifies natural-language content against organizational policies. It powers access-policy analysis tools, including Identity and Access Management (IAM) Access Analyzer, a service that helps customers securely manage access to AWS resources. More recently, Amazon has begun deploying cvc5 for specification analysis and test generation in Kiro, a new agentic development environment. Across these applications, cvc5 now processes approximately one billion solver calls every day, enhancing security, reliability, and durability for AWS customers. A meeting of minds Working with Barrett on the project is Robert Jones, a senior principal applied scientist at AWS who shared an advisor with Clark when both were Stanford PhD students. Also involved in the project over the years were many students and postdocs keen to test their skills. More than a few have since joined Amazon to develop new implementations and applications, extending work that began when they were student researchers. “What's really fun about it is that people who have just finished their PhD, for example, often bring fresh insight to long-standing research challenges because they're thinking about them in a different way,” Jones says. “And I find that the best part of collaboration is that different people tend to build different mental models for the same problem. When those come together, you often have new insight into how to think about the problem or how to map it to a different problem you already know how to solve.” A successful coupling of academic research and commercial funding can have great impact, but as Barrett points out, there needs to be a focus on achievable goals. It’s easy to get caught up in an interesting project idea that leads to a practical dead end, Barrett says. “If you're in your ivory tower, building your tools, and you don't have access to the real problems, it's very easy to build the wrong tool. And I've actually made this mistake,” he says. “You build a hammer, and then you go around looking for a nail, and you can't quite find anything that fits. You get excited about a particular approach but don't think about what that approach could be good for. So I actually now much prefer the opposite, where I go find a real problem, and then I take a step back and say, ‘What approach can we actually use to solve that?’” When you change code, he says, “Eighty percent of the time it does better, and 20 percent of the time it does worse. This is actually not so great in some contexts.” Sorting the wheat from the chaff is essential to producing robust and scalable code, he adds, and large-scale testing is needed to find and fix issues that can be inadvertently introduced as the code changes. Analyzing interactions at that level requires multiple minds, and the more the merrier, Jones says. The old adage “many hands make light work” is particularly useful when mixing public research and practical applications. “I really like to work on hard problems that require multiple people to solve. I enjoy the collaboration involved in science,” he says. “I've always found that more minds working on the same problem together are better than one.” Barrett and Jones agree that what makes this work is a willingness to see from both points of view — the scholastic and the commercial. Sometimes a pure research goal can have very beneficial results, sometimes not, but melding these two approaches together to address serious issues can deliver huge benefits. And communication is key, both agree. “One of the hard things about academia is knowing which problems are the most important to work on and how those problems might impact the real-world problems that are being encountered in industry,” Jones says. “Having the ability to be much more open about the kinds of problems that we're struggling with and Clark telling us about his research agenda helps both of us. It enables Amazon to indicate areas of interest, and it helps Clark understand concrete problems that we encounter day to day as we try to apply these tools and techniques in practice.” Wed, 04 Feb 2026 14:00:00 GMT https://www.amazon.science/news/how-academic-collaboration-delivers-real-world-security-to-amazon-customers Amazon Nova AI Challenge returns with Nova Forge access for competing teams https://www.amazon.science/nova-ai-challenge/amazon-nova-ai-challenge-returns-with-nova-forge-access-for-competing-teams For the first time in an academic competition, students can customize frontier AI models to build trusted software agents The Amazon Nova AI Challenge is back for its second year. As ten selected university teams from around the world gather in Seattle this week for bootcamp (February 2-4), they're preparing to tackle a real-world challenge in software development: building AI agents that can handle complex coding tasks while maintaining security and reliability. For the first time in an academic competition, participating teams will use Amazon Nova Forge to customize Nova models with access to tools, models and computational resources that have historically been out of reach for university research programs. From single tasks to multi-step projects Last year's competition focused on secure AI-assisted software development, with teams working to identify and address vulnerabilities in code-generating models. Teams published research papers on their approaches, and the results demonstrated practical methods for improving security in AI coding tools. This year's challenge reflects how AI coding technology has progressed. "Generative AI for software development has rapidly moved from code generation to agents that plan, build, and test changes across entire codebases and user-facing applications," said Imre Kiss, Director, Amazon Nova Software Engineering Skills, "The focus of this year's Nova Challenge reflects that shift." The 2026 challenge centers on AI agents that can work through multi-step software development tasks, including planning changes, writing code, and validating results across complex projects. Unlike generating code from a single prompt, these systems must understand context across entire codebases and make decisions that affect product quality and system security. Teams must demonstrate progress on two measures: utility (can the agent handle increasingly complex software tasks?) and safety (does it maintain appropriate safeguards?). This dual focus addresses a practical reality: as AI agents become more capable, new security challenges emerge. Each team's approach will be different: some may focus on adding secure coding patterns to their training data, others on creating training environments that teach their agents to recognize security issues, and others on building smaller, faster models with strong agentic security reasoning. The red teams will develop methods to test the applications built by these AI coding agents for weaknesses, attempting to identify potential vulnerabilities and exploits. "The competition format creates an interesting dynamic," explains Rahul Gupta, Senior Applied Science Manager, Amazon Nova Responsible AI. "As red teams discover new vulnerabilities, developer teams must adapt their agents through retraining or additional safety controls. And as developer teams strengthen their systems, red teams must develop more sophisticated testing methods." Nova Forge: Access to model customization The significant change for this year's competition is integration of Nova Forge, Amazon's service for building customized AI models. Academic institutions have historically had limited access to the models, training data, and computational resources needed for large scale AI-research. Nova Forge changes that. "What researchers (and entrepreneurs) have traditionally faced is a set of difficult trade-offs," explains Michael Johnston, an applied science leader at Amazon overseeing the challenge. "You could fine-tune an existing closed model, but only in limited ways. You could work with open-source models, but risk losing core capabilities. Or you could build from scratch, but only if you had very substantial funding." Nova Forge offers another approach. The service gives teams access to Nova model checkpoints at different training stages, allowing them to add their own data throughout the training process. The result is a customized model — what Amazon calls a "Novella" — that combines Nova's capabilities with the team's specific approach to secure software development. "This changes what's possible for academic research," says Professor Ismini Lourentzou, University of Illinois Urbana Champaign. "We're participating in the training process itself, adding our research and security methods into the model's foundation." Nova Forge provides three capabilities that competing teams will use: Custom training environments: Teams can create simulated environments where models learn from scenarios that reflect real-world secure coding workflows. Model compression: Teams can create smaller, faster models that maintain performance at lower cost by training them on examples from larger models. Safety controls: Built-in tools allow teams to implement security measures and evaluate model behavior against their safety criteria. The competing teams Ten universities were selected to compete this year from a pool of applicants spanning five countries: the United States, Portugal, the Czech Republic, South Korea, and Taiwan. The lineup includes two returning champions and eight new teams: Model Developer teams: PurpCorn, University of Illinois Urbana-Champaign - Year 1 champions BlueTWIZ, NOVA School of Science and Technology, Lisbon, Portugal AlquistCoder, Czech Technical University, Prague, Czech Republic BruinWeb, University of California, Los Angeles Slugs and Roses, University of California, Santa Cruz Red teams: PurCL, Purdue University - Year 1 champions Jay'lBreak, Johns Hopkins University TeamSecLab, Ohio State University Pr1smCode, Carnegie Mellon University Lion-x0a, Penn State University Each team receives $250,000 in sponsorship, monthly AWS credits, and the chance to compete for prizes. The winning model developer and red teams will each receive $250,000 (split among students), with second-place teams earning $100,000. Practical research For participating students, the challenge is an opportunity to work on problems with direct application. "Academic research often focuses on theoretical problems," notes Xiangzhe Xu, PhD Student, Purdue University. "Here we are working with large-scale models. That changes how we approach the science." Amazon researchers working with the teams emphasize solutions that are straightforward to implement, easy to troubleshoot, and economically viable at scale. "We want innovations that engineers can actually use," says Johnston. The challenge also gives students experience with infrastructure not typically available in academic settings. Along with Forge, and the computational resources to train and evaluate large-scale models. "For a university team, this level of access is significant," says Professor Xiangyu Zhang, Purdue University. "We're able to run experiments that would be difficult on our academic budget, and students are gaining experience with the same tools used by top-tier AI companies." What's next The first evaluation will begin after bootcamp concludes, with additional rounds scheduled through August 2026. The finals will be in September 2026 and winners will be announced October 2026 at Amazon Nova AI Challenge Summit where teams will gather to present their research and celebrate the winning teams. All participating teams will publish research papers on their methods and findings. These publications will contribute to the field of responsible AI development, with particular focus on secure AI systems, insights that will benefit software development and other applications where AI interacts with complex systems. As AI systems become more capable of software development work, the research these teams are conducting becomes increasingly relevant. As AI coding systems take on more and more complex and impactful tasks, the key question is how to ensure the resulting applications are secure, reliable, and trustworthy at scale. Stay tuned for updates on the teams' progress and coverage of theSeptember 2026 finals. Mon, 02 Feb 2026 19:53:06 GMT https://www.amazon.science/nova-ai-challenge/amazon-nova-ai-challenge-returns-with-nova-forge-access-for-competing-teams Engaging the AI community through building, research, and shared learning https://www.amazon.science/nova-ai-challenge/engaging-the-ai-community-through-building-research-and-shared-learning Advancing AI requires more than breakthrough models. It depends on communities of builders and researchers who experiment, test assumptions, and share what they learn. That belief is guiding how Amazon engages developers and academics around Amazon Nova, Amazon’s portfolio of AI offerings including the Nova models, Nova Forge and Nova Act. Advancing AI requires more than breakthrough models. It depends on communities of builders and researchers who experiment, test assumptions, and share what they learn. That belief is guiding how Amazon engages developers and academics around Amazon Nova, Amazon’s portfolio of AI offerings including the Nova models, Nova Forge and Nova Act. Today, two Nova initiatives launch in parallel, each designed for a distinct audience but connected by a common purpose: to help people innovate, build skills, and tackle challenging real-world problems with AI. One program invites developers everywhere to learn by building. The other brings together university teams to advance research on secure and trustworthy AI agents. Together, they reflect a multi-layered approach to community engagement, spanning hands-on experimentation and long-term scientific inquiry. Learning by building: The Amazon Nova AI Hackathon The Amazon Nova AI Hackathon is a six-week open innovation event for developers around the world: professionals, students, and hobbyists alike. Participants are invited to build generative AI applications using Amazon Nova foundation models and services, including Nova Act. This is an opportunity to innovate with the latest AI capabilities, tackle challenging problems, showcase your skills, and compete for cash prizes. The hackathon is intentionally broad in scope. Developers can submit projects across five categories: Agentic AI Multimodal understanding UI automation Voice AI Freestyle experimentation Participants are encouraged to use the tools, frameworks, and workflows they prefer, but your solution should use a Nova foundation model and/or the Nova Act service with the goal of learning through experimentation. Submissions focus not only on technical implementation, but also on creativity and potential enterprise or community impact. Hackathons have long been a way to surface new ideas, strengthen developer relationships, and gather practical feedback. For Nova, this event builds on momentum from recent launches—including Nova 2 models and Nova Act—while creating space for developers to explore what these capabilities enable in practice. Over the six-week submission period, participants will share demos, code, and write-ups with the broader developer community. The emphasis is on hands-on learning, skill development, and community exchange, rather than polished products. Learn more about the Hackathon: https://amazon-nova.devpost.com/ Advancing research: The Amazon Nova AI Challenge Kicking off alongside the hackathon is the Amazon Nova AI Challenge, an eight-month academic research competition focused on trusted AI agents. Now in its second year, the challenge brings together ten university teams from five countries, including returning champions and new participants. As generative AI systems evolve from single-prompt tools to agents that plan, execute, and validate multi-step tasks, questions of reliability and security become increasingly important. The 2026 challenge is centered on this shift. Teams are tasked with building AI agents that can handle complex software development workflows while maintaining appropriate safeguards. Progress is evaluated along two dimensions: utility and safety. Red teams work in parallel to test systems for vulnerabilities, creating an environment where approaches are continuously evaluated and improved. A defining feature of this year’s challenge is access to Amazon Nova Forge, which allows teams to customize Nova models by integrating their own data and techniques throughout the training process. This level of access, historically difficult for academic programs to obtain, enables research that more closely reflects real-world AI development constraints. Beyond competition outcomes, the challenge emphasizes practical research. All teams publish papers documenting their methods and findings, contributing insights that extend beyond the challenge itself and inform broader discussions around responsible AI development. Engaging the community, end to end AI progress depends on those who build, test, question, and iterate. By supporting both open developer experimentation and structured academic research, Amazon aims to engage the AI community at multiple levels. Over the coming months, projects, research findings, and lessons from both programs will be shared with the wider community. The goal is not only to showcase what Nova can do, but to highlight how people learn and innovate when given the opportunity to build and explore together. Whether you’re a developer curious about agentic AI or a student researching trusted AI systems, these initiatives offer different ways to participate in shaping the next generation of AI. Mon, 02 Feb 2026 19:51:40 GMT https://www.amazon.science/nova-ai-challenge/engaging-the-ai-community-through-building-research-and-shared-learning A decade of NFL Next Gen Stats innovation https://www.amazon.science/blog/a-decade-of-nfl-next-gen-stats-innovation Every NFL game generates millions of tracking data points from 22 RFID-equipped players. Seventy-five machine learning models running on AWS process that data in under a second, transforming football into a sport where every movement is measured, modeled, and instantly analyzed. Every snap in the NFL triggers a deluge of physical data. Twenty-two players accelerate, collide, and change direction in fractions of a second, while the ball traces a path through the controlled chaos. Yet for most of the sport’s history, much of that complexity went unmeasured. “Football, for 100-plus years, has been a box score game: you've got yards, you've got touchdowns, you've got tackles … ,” says Mike Band, senior manager of research and analytics with NFL’s Next Gen Stats. Those numbers could capture only a sliver of what actually unfolded on the field. Coaches pored over game recordings and made educated guesses. Fans argued from the stands and the sofa. Officials occasionally made judgment calls based on partial, often obstructed views. “Looking at box score stats, you didn’t even know which 22 players were on the field for a given play,” says Mike Lopez, senior director of NFL Football Data and Analytics. In 2015, the NFL decided to expand beyond box scores by launching Next Gen Stats (NGS). RFID chips were placed in every set of shoulder pads and inside the football, and more than 20 ultrawideband receivers were mounted around each stadium. The system began streaming the coordinates of all 22 players (10 times a second) and the ball (25 times per second). For the first time, the league was capturing comprehensive player location data, accurate to a few inches, for every moment of every play. At first, each club could access only its own tracking data. That shifted in 2018, when teams gained league-wide access, putting coaches, scouts, and analysts on common analytic footing. Also that year, the league formalized and deepened its partnership with AWS, marking the start of the gradual transformation of NGS from a tracking experiment into critical NFL infrastructure, with live broadcasts only its most visible expression. Today, NGS underpins decision making across the league, from how clubs evaluate players and design game plans to how the NFL studies officiating, player safety, and rule changes. Every team, and much of the league itself, now works from the same continuously expanding data backbone. But it started simply, says Band. “Our early metrics were low-hanging fruit — player separation, speed, and time to throw — easily derivable from the data we had. Modeling more-complex game metrics takes much more effort, and that’s where AWS came in.” The first complex stat the partnership delivered, in 2018, was completion probability. It was built to answer a simple question: can the difficulty of a pass be quantified? The answer came, in part, courtesy of an XGBoost machine learning (ML) model hosted on Amazon’s SageMaker platform. It blended the factors that shape a throw’s outcome, from quarterback pressure to throw depth, receiver separation, and sideline proximity. The model returned a single percentage that captured both likelihood and difficulty. “That became our entry point into machine learning,” Band says. Beyond SageMaker, the NFL’s analytics work has expanded into a broad suite of AWS tools, including Amazon Quick, which the League uses to deliver real-time, interactive visualizations and answers to fans, analysts, and broadcast partners. Lopez says the members of the league’s football data analytics group “call ourselves an AWS shop.” By 2018, with league-wide access in place and AWS’s ML pipelines running, NGS began to illuminate deeper questions across the sport. Every NFL game generates millions of raw-tracking data points, yet the raw feed is only the substrate. The real data growth comes from the models that convert coordinates into usable football insight. Pressure probability, for example, estimates how likely a defender is to affect the quarterback at each moment of a pass rush and produces more than a dozen secondary metrics. Band estimates that NGS now produces between 500 and 1,000 stats — per play. Keeping the system responsive depends on AWS infrastructure to ingest the feed, run the models, return results within seconds for teams and broadcasters, and store the wider data trove for deeper analysis. Big Data Bowl The roots of that deeper analysis extend back to 2018, with the inaugural Big Data Bowl. Led by Lopez, it became the league’s first large-scale effort to open player-tracking data to external researchers, inviting them to tackle questions such as which defenders close space most effectively or how to predict post-throw player movement. Structured as a months-long hackathon, the annual competition challenges participants to train ML models on historical tracking data and test their ability to generalize to unseen plays. The emphasis is increasingly on prediction — models that can anticipate what would happen next. An early success was the 2020 development of rush yards over expectation (RYOE). The metric measures the difference between actual yards gained and expected rushing yards, or what a league-average player would be predicted to gain on the same carry, considering the location, speed, and direction of blockers and defenders. It helps contextualize how strong a given run was and, when aggregated, how well a back performed over a game or season. RYOE moved from the Big Data Bowl to national broadcasts quickly. Lopez recalls the moment he first saw it appear, during the 2021 NFC Championship Game between the Buccaneers and Packers: “Leonard Fournette had a good run, and immediately a graphic popped up with his rush yards over expectation. That was less than 10 months after we got the winning solution.” He adds: “I took a photo of my TV screen, and colleagues were sending me theirs. It was a proud moment.” That pipeline has turned the Big Data Bowl into a proving ground for both ideas and data science talent. In its first decade, the Big Data Bowl has become a central part of the league’s analytics ecosystem. As then New Orleans Saints coach Sean Payton quipped in 2015 about the rise of real-time data on the sidelines, “ I think it means there are going to be more MIT grads coaching.” Key metrics Over the past decade, NGS has grown into a portfolio of more than 75 ML models, spanning offense, defense, special teams, and game strategy. Among those, tackle probability and defensive alerts perhaps best illustrate how raw tracking data can be converted into clearer insights for teams, broadcasters, and fans. Tackle probability estimates the likelihood of a defender completing a tackle at the moment of contact, factoring in speed, angle, distance, leverage, and pursuit. That data allows NGS to identify true tackle opportunities, quantify missed tackles, and calculate the yards a defender saves or concedes. Defensive alerts assess defensive alignment and movement before the snap to predict which players are likely to rush. The model uses acceleration patterns and presnap shifts, combines them with situational context such as down, distance, and game state, and then applies generative AI to predict likely rushers, who are highlighted with red circles for viewers. “Defensive alerts had a big impact, from a broadcast perspective,” says Dashiell Flynn, AWS’s principal sports consultant. He highlights how the model exposes deliberate misdirection: “Sometimes the prediction is wrong because the defense itself is using misdirection, trying to trick the offense into thinking a blitz is coming.” Those moments give game commentators a natural way to discuss disguised defensive pressure and the intent behind it. Together, these metrics show how NGS models can turn fast, ambiguous moments into clear visual and tactical explanations. Player safety and rule changes The same tracking foundation that fuels performance analysis also gives the league clearer visibility into player safety. By capturing every player’s speed, spacing, and movement, it gives the league a concrete understanding of the dynamics behind plays long considered risky. The new dynamic kickoff, introduced for the 2024 season, is a clear example. Kickoffs were producing too many dangerous, high-speed collisions. NGS helped quantify and ultimately change that. “The season before, we were showing Next Gen Stats animations of the space and relative speeds of the players, and that analysis became a critical part of the rules change,” says Lopez. The NFL Competition Committee tested alternative formations and identified a design that reduced high-speed contact without removing the competitive element. Two seasons of data show the dynamic kickoff is working: the 2025 return rate jumped to 75% (from 32% in 2024), and even with 1,157 more plays, lower-extremity injuries dropped 35% while concussion rates remain below the old kickoff format. The change is delivering both more action and fewer injuries. Pose tracking The infrastructure for the next major advance — optical tracking — is already embedded in every NFL venue. Rather than recording only a player’s two-dimensional location, the system uses 4K cameras to capture the full three-dimensional position of key joints such as shoulders, elbows, knees, hips, and hands. The result is pose estimation, a digital skeletal model for every player on every play. This season marks the first year the league has had what Band calls “full installation, full capture” across every game, although the data remains internal while it is validated, structured, and stored for future use. For the NGS team, pose estimation arrives at the right moment. A decade of two-dimensional tracking has deepened understanding of the game, Band says, “but this new skeletal data is going to unlock the next level. It’s an inflection point.” The scale of the data capture is worth pausing over. Standard location tracking collects a single x,y coordinate for each player 10 times per second. Optical tracking captures high-resolution video from 16 angles to derive x,y,z coordinates for 29 body parts per player, 60 times a second. “The explosion in the volume of data can be daunting,” says Flynn. “But once folks wrap their heads around it, the ideas start flowing very quickly.” The pipeline behind optical tracking runs in three stages: local capture, on-site processing, and cloud analysis. High-bandwidth video from 4K cameras cannot be sent to the cloud fast enough, so each stadium hosts AWS servers that process the data within about 700 milliseconds. The processed, simplified data is then sent to the cloud, where ML models run in under 100 milliseconds and return analysis to the production team. This keeps the full capture-to-analysis pipeline under a second. And because broadcasts such as Thursday Night Football operate with a roughly two-second delay, Next Gen Stats derived from this new data can be delivered effectively in real time as plays develop on screen. The promise of pose data lies in the detail it adds to football’s geometry. It also resolves ambiguities that two-dimensional data cannot, says Lopez. “On a pass play now, we can see the ball pass a player using RFID data, but we don’t know if it rolled between their legs or flew 20 yards over their head.” The ultimate goal is a hybrid system that uses RFID to identify each player’s center of mass and combines it with full skeletal data, with algorithms filling in gaps when players obscure one another from camera view. Pose tracking will also unlock a new kind of training environment. Quarterbacks could use VR headsets to face a virtual pass rush that unfolds exactly as it did on the field. “You’re seeing those linemen coming at you and learning to keep your eye level down the field for that extra half second,” says Flynn. This realism makes it possible to both train safely and correct habits that get young quarterbacks into trouble, while also helping them make quicker decisions in the pocket. “Josh Allen took a couple of seasons to become Josh Allen. Perhaps that could happen in half a year instead of three,” Flynn says. Each stage in the evolution of NGS has pushed the league closer to modeling the game’s underlying mechanics rather than just its outcomes. As these capabilities come together, the wider transformation becomes clearer. Ten years after expanding box scores, the NFL’s partnership with AWS has evolved from a tracking experiment into something closer to the sport’s nervous system. By combining football expertise with scalable cloud infrastructure, Next Gen Stats continues to shape how the game is played, coached, and understood. But in the end, it’s the subtle depth of football that hooks people. “It’s like quantum physics,” says Band. “You can zoom in as much as you want, and every shift in scale reveals something new. There are games within the game, happening all over the field.” It turns out that illuminating the intricate mechanics of the sport doesn't spoil the magic but only deepens the awe. Mon, 02 Feb 2026 14:00:00 GMT https://www.amazon.science/blog/a-decade-of-nfl-next-gen-stats-innovation Customizing multiturn AI agents with reinforcement learning https://www.amazon.science/blog/customizing-multiturn-ai-agents-with-reinforcement-learning Leveraging existing environment simulators and reward functions based on verifiable ground truth boosts task success rate, even with small models and small training datasets. In today's rapidly evolving AI landscape, organizations increasingly need AI agents that excel in specific domains and business environments. While general-purpose AI systems demonstrate impressive capabilities across broad tasks, they often fall short when deployed in specialized contexts that require deep understanding of particular workflows, tools, and organizational needs. In recent work, scientists with Amazon Web Services’ AI Labs have been investigating how to efficiently adapt general-purpose agents to specific domains without requiring extensive expertise in machine learning or prohibitive computational resources. Through systematic experimentation across two distinct use cases — personal-assistant agents and agentic retrieval-augmented generation (RAG) — we've demonstrated that reinforcement-learning-based customization can significantly boost task success rates across diverse use cases, even with relatively small amounts of training data. Experimental framework and assumptions Consider a customer service agent that needs to navigate complex internal systems, understand company-specific policies, and maintain consistent brand voice across thousands of interactions. Or imagine a coding assistant that must adapt to a particular organization's coding standards, architectural patterns, and development workflows. These scenarios demand more than off-the-shelf AI solutions: they require agents that can be systematically customized and optimized for their intended environments. Our work explores the use of reinforcement learning (RL) to customize such agents. To establish a practical foundation for our experiments, we made several simplifying assumptions. We focused primarily on asynchronous multiturn agents that can autonomously complete tasks using tools, with results verifiable against ground truth. This approach reduces our dependency on simulated users while maintaining a framework applicable to many scenarios. Additionally, we leveraged existing environment and tool simulators from public benchmark datasets and agents, allowing us to focus on the core RL methodology rather than building simulation infrastructure from scratch. For reward signals, we rely on verifiable feedback available directly from the environment, such as task completion rates, code execution success, or information retrieval accuracy. These constraints provide the minimal conditions needed to begin our experimentation while keeping our scenarios realistic. Experimental design For our experiments involving a personal-assistant agent, we used the AppWorld benchmark, which involves the completion of day-to-day activities through phone app interactions. For the agentic-RAG experiments, we implemented a DeepSearch Agent for intelligent information retrieval and synthesis, using two different datasets. For the reward functions, we relied on verifiable environment-based feedback for AppWorld and exact match and semantic accuracy for RAG tasks. Our RL training framework has two main components: an online simulator and an online RL trainer. The online simulator takes a batch of tasks and produces a batch of rollout trajectories — sequences of interactions between the agent and its environment, often involving dozens of API calls. It also produces a reward for each trajectory by running checks against ground truth. The online RL trainer takes the rollout trajectories and the reward from the online simulator to update the actor policy. Internally, the online RL trainer has components such as actor, critic (for proximate policy optimization, which approximates the optimal weight that any one training example should be given during policy updates), and reference model. After the actor policy is updated in the online RL trainer, the weights of the actor model are synced to the agent in the online simulator. RL-based-training pipeline Let’s take a closer look at the RL pipeline, using the AppWorld experiments as an example. First, the simulator does a parallel simulation of interactions between agents and the AppWorld environment based on the provided task IDs and produces a batch of rollout trajectories. We’ll consider one such trajectory, which demonstrates how an agent systematically decomposes a high-level instruction — "add date prefixes to files and move non-current year files to recycle bin" — into a sequence of 32 discrete API calls across multiple applications and reasoning steps. The agent begins by authenticating with the file system using supervisor-provided credentials, then methodically explores available APIs through introspection calls. Each step involves explicit reasoning about the next action, error handling when APIs don't conform to expectations (as when the agent finds no "rename_file" function and adapts, using "move_file" instead), and maintaining state across multiple file operations. The trajectory showcases the agent's ability to handle complex parsing of dates and times, iterate through file collections, and coordinate operations across different directory structures while maintaining data integrity. Critically, the environment provides verifiable information about whether the task execution is successful, enabling the reinforcement learning framework to learn through concrete, measurable outcomes, rather than requiring human evaluation at every step. Moreover, rewards are collected only at the last turn, and this sparse reward collection provides a significant performance advantage over similar methods. Results and insights The consolidated table below shows that reinforcement learning can significantly boost agent performance across diverse use cases, even when relatively small training datasets are applied to relatively small models. Use caseDatasetBase modelBase model performanceRL-trained performanceMetricPersonal-assistant agentAppWorldQwen2.5-32B-Instruct39.20%72% (vs. Sonnet 3.7/4.0 ~69%)Task goal completionAgentic RAGNQQwen2.5-3b-Base0.1060.406Exact matchAgentic RAGMusiqueLlama-3.2-3B-inst0.040.1Exact matchHere are a few of our experimental findings: Larger base models demonstrate greater gains from RL training in absolute performance. This likely stems from their generating higher-quality rollouts during training, creating a positive feedback loop that enhances the RL process. Applying online RL customization to increasingly capable base models may unlock performance exceeding the benchmarks established by current proprietary models, which are often several times as large or complex as the base models. Achieving near-proprietary-model performance with small-scale RL training (72 examples in AppWorld) at 1% to 2% the cost demonstrates a fundamental shift in the economics of model customization. In some cases, online RL shows immediate effectiveness from the first training step, with rapid progression to competitive performance within 30 steps. RL training also induces specific behavioral improvements that may be useful, such as always checking API documentation before writing code, which leads to reduced code errors. Models also maintain robust semantic understanding across prompt variations even when exact-match scores decline, indicating genuine comprehension rather than pattern matching. In our experiments, smaller models face fundamental reasoning limitations (inability to recognize unanswerable questions or extract answers from relevant context) that RL alone cannot overcome. For constrained models, targeted distillation from more capable models may be more effective than scaling RL training. Based on these findings, we recommend investing in online RL as a method for agent customization across assistant agents and other use cases such as coding agents. However, several critical factors emerged that warrant careful attention in deployment: data quality and format correctness proved essential at every stage of the pipeline; larger base models demonstrated disproportionate benefits from RL training; and strategic task selection — prioritizing harder problems during training — enabled more efficient learning through asymmetric transfer to simpler tasks. Looking ahead, our research roadmap focuses on two primary directions. The first is expanding the applicability of our approach through synthetic-data generation and adaptive data filtering to improve training efficiency. The second is deepening our understanding of RL algorithms through more thorough comparisons across model families, reward signal exploration beyond outcome-based metrics, and pipeline optimizations. These investigations aim to make RL-based agent customization more accessible, efficient, and effective for organizations seeking to deploy AI agents that truly excel in their specific operational contexts. Our latest research papers — “SALT: Step-level advantage assignment for long-horizon agents via trajectory graph” and “Reinforcement learning for self-improving agent with skill library” — demonstrate further advances in agent RL algorithms, via fine-grained advantage assignment and reward shaping for agent skill learning, further demonstrating huge potential in this area. Acknowledgments: Lin Lee Cheong Tue, 13 Jan 2026 21:50:01 GMT https://www.amazon.science/blog/customizing-multiturn-ai-agents-with-reinforcement-learning Fine-tuning vision-language models on memory-constrained devices https://www.amazon.science/blog/fine-tuning-vision-language-models-on-memory-constrained-devices A new hybrid optimization approach allows edge devices to fine-tune vision-language models using only forward passes, achieving up to 7% higher accuracy than existing techniques. Fine-tuned vision-language models (VLMs) have shown remarkable performance across many computer vision tasks. However, backpropagation — the standard method for adjusting model weights during fine tuning, which works backward from output error — is computationally expensive and thus impractical on resource-constrained edge devices. An alternative is fine-tuning strategies that rely solely on forward passes, significantly lowering the computational requirements. Zeroth-order (ZO) estimation is one such method, but existing ZO-based VLM fine-tuning methods remain substantially inferior to backpropagation-based training in terms of accuracy and convergence. One major challenge is ZO’s high variance, which can make estimated gradients — the directions of weight adjustment resulting from a batch of training data — inconsistent and noisy. This can lead to unstable training dynamics and make it difficult for the model to converge to an optimal solution. Additionally, ZO estimation tends to have local search dynamics, meaning that it may get stuck in locally optimal but globally suboptimal regions of the loss landscape. In a paper we presented at this year’s Conference on Neural Information Processing Systems (NeurIPS 2025), we propose SharpZO, a hybrid sharpness-aware zeroth-order optimization approach for fine-tuning VLMs using only forward passes. SharpZO has a two-stage optimization process: (1) a global exploration stage that uses evolutionary strategies to smooth the loss landscape, constructing a strong initialization, and (2) a local-search stage that uses ZO to suppress outlier gradient estimates. In experiments, SharpZO improved on the accuracy of forward-only methods such as ZIP and BlackVIP by an average of up to 7%, and on several tasks, its performance approached that of CoOP, a first-order method requiring backpropagation of gradients. The loss landscape Given a model and a set of training data, every one of the model’s possible parameters (weights and biases) can be mapped against the corresponding loss, or error, on the training data, yielding a single point in a very-high-dimensional space. The graph of parameter settings against loss can be envisioned as a landscape with peaks (high-loss regions) and valleys (low-loss regions). The goal of training is to steer the parameter settings toward the bottom of the lowest valley in the landscape. Computing the complete landscape is intractable, but given a particular location (set of parameter settings), it’s possible to calculate the local direction of the slope — the gradient — and nudge the loss downhill. That’s how backpropagation works. ZO is a method for estimating, rather than calculating, the local gradient, by sampling the loss at various nearby points in the landscape. But the high variance of ZO’s estimates makes the landscape look more jagged — or sharper — than it really is, with more and higher peaks. This increases the chances that the optimization algorithm will get stuck in a local minimum, a local valley where the loss is actually significantly greater than at the global minimum. Our approach is to use an evolutionary algorithm — specifically, a sharpness-aware covariance-matrix adaptation evolution strategy (CMA-ES) — to smooth out the sharpness of the loss landscape. Then we use a slightly modified ZO algorithm to find the global minimum. SharpZO CMA-ES estimates not just the local gradient but the distribution of the loss over the whole set of possible parameter values. It also estimates the distribution’s covariance matrix, a matrix that describes the correlations between parameter values. Both the mean of the distribution and the values of the covariance matrix are updated after every round of training. We modify the ordinary CMA-ES approach by including an extra term in the loss function, which accounts for the worst possible loss that the model could incur, given the current estimate of the distribution and covariance matrix. Minimizing this term helps smooth out the estimated loss landscape. After applying CMA-ES, we use a modified sparse ZO algorithm to do more refined local searches. Traditional sparse ZO reduces the dimensionality of the gradient estimate by tossing out low-magnitude terms. We modify this procedure by normalizing the gradient vector according to its mean and standard deviation, which again helps smooth out the loss landscape. Evaluation We evaluated SharpZO on 11 diverse downstream tasks using CLIP models with various backbones. In addition to the average accuracy improvement of 7% over forward-only methods such as ZIP and BlackVIP, and the performance competitive with CoOP, our method achieves significantly faster convergence. For example, on the ImageNet dataset, SharpZO reached target accuracy in 15.3 minutes, compared to 19 mins for ZIP and 170 minutes for BlackVIP. SharpZO not only reduces the memory footprint by avoiding gradient storage but also ensures that this efficiency does not come at the cost of accuracy. We also found that our method is robust to distribution shifts, performing better than baselines on out-of-distribution tasks, such as recognizing sketches (ImageNet-Sketch) or adversarial examples of images (ImageNet-A). Currently, SharpZO is optimized for prompt tuning, where the number of trainable parameters is relatively small, and scaling to full-model fine tuning remains a future challenge. Furthermore, the sharpness-aware CMA-ES warmup stage requires coordinate-wise gradient estimation (CGE), which maybe computationally expensive for high-dimensional settings. This makes SharpZO a suitable candidate for parameter-efficient fine tuning (PEFT). Acknowledgements: This work was done as part of the Amazon-UCSB collaboration. We want to thank Zheng Zhang, Jimmy Kunzmann, and Denis Filimonov for their inputs and valuable discussions. Thu, 08 Jan 2026 16:41:08 GMT https://www.amazon.science/blog/fine-tuning-vision-language-models-on-memory-constrained-devices The unseen work of building reliable AI agents https://www.amazon.science/blog/the-unseen-work-of-building-reliable-ai-agents "Reinforcement learning gyms" train agents on the many low-level tasks that they must chain together to execute customer requests. Ask an AI developer what an agent might do for you, and the answer often sounds like a travel brochure: book your flights, find you a hotel, plan your summer vacation. It's a charming image — an invisible concierge effortlessly stitching together an itinerary while you sip a coffee. But inside Amazon, researchers know that a million small things must work before big things can happen. One example: before an AI can plan a vacation, it must learn to scroll. Literally. It must learn how to scroll … and click … and tab … and select a date that's hidden behind a pop-up … and recover when a form silently resets … and distinguish a calendar widget from a drop-down … and re-enter a field exactly once without overwriting another … and navigate a loyalty portal that hasn't been redesigned since 2004. A single "book my summer vacation" command sets off hundreds of micro-interactions across travel services: airline reservation systems still running decades-old interfaces; hotel inventory tools with inconsistent use patterns; credit card verification layers; loyalty programs; payment rails; mobile confirmations; and compliance checks buried behind browser-based forms. Every tiny action has to succeed — reliably, deterministically, every time — before the magical consumer moment is possible. This is the gap between the narrative of AI agents and the reality of building one. At Amazon, the mundane details aren't an afterthought; they're the foundation. To work successfully in the real world, an agent must first master a set of atomic behaviors. Internally, we sometimes describe this as building "normcore agents": systems trained to be exceptionally good at the very simple, very boring interactions that underpin the reliable operation of real software. Mastering those atomic behaviors requires a lot of practice, which is why Amazon's Artificial General Intelligence (AGI) Lab is building an ecosystem of high-fidelity reinforcement learning (RL) "gyms" where agents can hone their skills. Just as an athlete builds core stability by repeating fundamental movements under controlled conditions, an agent develops reliability by practicing the smallest units of interaction in repeatable, instrumented scenarios. Designed to reflect the messiness of real web systems, a gym isolates a skill, varies it, stresses it, and measures it. The end result is an agentic substrate — a shared foundation of competence from which a fleet of agents can build domain-specific efficiencies in real-world applications: form completions that make an address usable for a delivery or reservation; drop-down selections that indicate whether a fare, benefit, or option applies; and multistep workflows that guarantee that a transaction reaches a valid, verifiable end state. Today, the Amazon AGI Lab has built and trained agents in gyms spanning dozens of application domains and thousands of individual tasks, with more in development. These gyms don't just teach an agent how to book a vacation; they teach it how to survive the unpredictable terrain beneath the task. How to reason about web interfaces. How to detect and recover from errors. How to interact with legacy systems that humans tolerate but machines often misinterpret. To build an agent that can do anything humans do on a computer, our team has to teach it to handle the ambiguity humans navigate instinctively. Reliability If an agent's path to booking a summer vacation runs through hundreds of tiny, failure-prone steps, the autonomous cars that get us to the airport face an environment that's even less forgiving. So it's no accident that some of the engineers and researchers inside Amazon's AGI Lab come from the world of self-driving cars. They spent years in environments where "almost right" is indistinguishable from "unsafe," where a system that performs flawlessly one moment and fails silently the next is unfit for deployment. In autonomous vehicles, correctness isn't probabilistic; the system must be right every single time. That mindset now shapes how our lab approaches agentic AI. Agents don't just produce outputs; they take actions inside live systems. They touch databases, initiate transactions, and modify system states. And when the output of a model is a real change in the world, reliability becomes non-negotiable. To meet that standard, an agent must do something language models cannot: determine whether the system responded correctly to its action. That doesn't mean the agent inherently knows correctness; it means the training environment exposes enough ground truth — document object model (DOM) structure, UI timing, network behavior, backend state transitions — for the agent to compare what it attempted with what actually happened and escalate or defer to a human when the outcome is ambiguous or requires approval. This is where formal verifiers come in. Each task inside a gym is anchored by a specification that defines exactly what successful completion looks like. It describes the required end state, the backend changes that are allowed to produce it, and the changes that must never occur. A workflow like "send an e-mail," for example, isn't declared successful just because a button appears to have been clicked; it's declared successful because exactly one new e-mail record exists in the database, and no unrelated records have been created, modified, or deleted. In our RL gyms, these verifiers are the basis of a scoring function. The agent receives a reward only when the environment reflects the precise changes permitted and none of the forbidden ones, providing a signal about what "right" means. Agents must satisfy these verifiers not once but thousands of times, under shifting timing, network, and UI conditions. This repeated exposure — within precisely engineered RL gyms that isolate skills, vary conditions, and enforce verifiable outcomes — converts isolated successes into durable competence. Only when an agent meets that standard of near-perfect reliability can it be trusted to run real workflows. And only then can it operate safely in production, where every action has consequences. Normcore workouts Look closely at any real-world workflow and you'll find a scattering of tiny tasks that have to be executed perfectly. These are the normcore workouts inside our RL gyms: concentrated practice routines where agents learn the small things that make the big things happen. Here are a few examples: Workout 1: The calendar stability test Building robustness against inconsistent UI components In calendar applications, even selecting a date requires surprising coordination. Across the web, calendars behave in subtly different ways: elements shift under zoom, and widgets hide behind other UI layers or re-render mid-click. In RL gyms, these variations appear intentionally, teaching the agent to recognize a widget's current state, recover when it drifts, and commit the correct date exactly once — then verify that the resulting backend state is correct. This foundational skill applies to workflows everywhere, from travel bookings to scheduling tools to compliance applications. Workout 2: The dropdown discipline drill Learning to distinguish UI appearance from system state A dropdown menu might appear to have been updated before the backend has actually processed the change. This mismatch appears in enterprise applications, consumer portals, and government systems alike. Agents must confirm that the system — not just the UI — has registered the action. The drill builds discipline: trust the system state, not the surface. Workout 3: The async endurance run Maintaining coherence across long, timing-sensitive flows Many workflows involve long chains of asynchronous steps — searching, filtering, validating, refreshing — each with different timing and failure modes. RL gyms break these flows into atomic segments: text fields that compete with autosuggest lists, modal windows that load out of order, backends that intermittently return errors, and pages that scaffold before they populate. The agent learns endurance — staying aligned with the true state of the system across dozens or hundreds of steps. Acknowledgments: Deniz Birlikci, Gary Lim, and Annika Huston for their contributions. Wed, 07 Jan 2026 17:04:36 GMT https://www.amazon.science/blog/the-unseen-work-of-building-reliable-ai-agents The 10 most viewed publications of 2025 https://www.amazon.science/blog/the-10-most-viewed-publications-of-2025 From foundation model safety frameworks and formal verification at cloud scale to advanced robotics and multimodal AI reasoning, these are the most viewed publications from Amazon scientists and collaborators in 2025. Mon, 29 Dec 2025 16:28:15 GMT https://www.amazon.science/blog/the-10-most-viewed-publications-of-2025 The 10 most viewed blog posts of 2025 https://www.amazon.science/blog/the-10-most-viewed-blog-posts-of-2025 From quantum computing breakthroughs and foundation models for robotics to the evolution of Amazon Aurora and advances in agentic AI, these are the posts that captured readers' attention in 2025. Mon, 29 Dec 2025 16:27:55 GMT https://www.amazon.science/blog/the-10-most-viewed-blog-posts-of-2025 Dialogue Boost: How Amazon is using AI to enhance TV and movie dialogue https://www.amazon.science/blog/dialogue-boost-how-amazon-is-using-ai-to-enhance-tv-and-movie-dialogue New audio-processing technology is making entertainment more accessible for millions of viewers. At Amazon, we’re excited to introduce the new AI-powered Dialogue Boost technology available on select Echo smart speakers and Fire TV devices. Dialogue Boost enhances the clarity of movie and TV dialogue while adaptively suppressing background music and sound effects. Thanks to machine learning and advanced audio separation techniques, Dialogue Boost helps people hear conversations in their favorite TV shows, movies, and podcasts without having to blast the volume. Dialogue Boost can improve the viewing experience for all our customers, but it’s especially useful for the nearly 20% of the global population with hearing loss. Originally launched on Prime Video in 2022, the new Dialogue Boost leverages breakthroughs in deep-neural-network compression to run directly on-device, making it available to all media, including Netflix, YouTube, and Disney+. Clearer dialogue for movie nights For people with hearing loss, increasing the overall volume of a movie or TV show doesn’t make dialogue clearer, since music and other background sounds are also amplified. Most people solve this problem by using closed captions, but that isn’t the preferred viewing style for every customer. The problem of hard-to-hear dialogue in movies has been getting worse over the last decade. This is due in part to the increased complexity and variety of modern theater and home sound systems, which means there isn’t a single mix that works well on all playback configurations. For example, Hollywood sound editors may target a theater system with dozens of channels, including separate dialogue channels coming from the front of the theater and sound effects emanating from the sides. In the TV version, however, sound effects, music, and dialogue are all “down-mixed” to the same channel, making it even harder to understand what’s being said. Sound source separation We realized that, to improve our customers’ experience, we needed a way to suppress the music and sound effects while boosting the dialogue. We achieve this using a sound source separation system that processes audio in several stages. The first stage is analysis, where the incoming audio stream is transformed into a time-frequency representation, which maps energy in different frequency bands against time. The next stage involves a neural network trained on thousands of hours of speaking conditions including various languages, accents, recording circumstances, combinations of sound effects, and background noises. This model analyzes the time-frequency representation in real time to distinguish speech from other sounds. Two key innovations allowed the team to bring Dialogue Boost to Fire TV Sticks and Echo smart speakers: a more efficient separation architecture that processes audio in frequency sub-bands and a training methodology that relies on pseudo-labeling, where a model is fine-tuned on data that it has labeled itself. Sub-band processing Many existing networks process all frequency content together through temporal sequence modeling, which is similar to token sequence modeling in LLMs — a computationally intensive approach. Dividing the audio spectrum into frequency sub-bands enables inference to be parallelized, and each sub-band needs to be processed only along the time axis, a much simpler computational task. We also implemented a lightweight bridging module to merge sub-bands, improving cross-band consistency. This architecture enables our model to achieve or surpass the previous state-of-the-art performance, competing with much larger models while using less than 1% as many operations and requiring about 2% as many model parameters. Pseudo-labeling In most prior work, training relied heavily on synthetic mixtures of speech, background sound, and effects. But this synthetic data didn't cover all real-world conditions, such as live broadcasts and music events. Inspired by recent work on training multimodal LLMs, where state-of-the-art models benefit from pseudo-labeling pipelines, we created a system that generates training targets for real media content, better handling these rare scenarios. First, we train a large, powerful model on synthetic data and use it to extract speech signals from real data. Then we combine the pseudo-labeled real data with synthetic data and retrain the model. This process continues until further training epochs no longer improve the model’s accuracy. At this point, in a process known as knowledge distillation, we use the fully trained large model to generate training targets for a model that’s small and efficient enough to process audio signals in real time. The final stage is intelligent mixing, which goes beyond simple volume adjustment. The system combines multiple techniques to enhance dialogue while preserving the artistic intent of the original mix: it identifies speech-dominant audio channels, applies source separation to isolate dialogue, emphasizes frequency bands critical for speech intelligibility, and remixes these elements with the original audio. Viewers can adjust dialogue prominence while the system maintains overall sound quality and artistic balance. When Amazon Prime Video first introduced Dialogue Boost, it relied on cloud-based processing to pre-enhance audio tracks. Knowledge distillation helped us compress the original AI models to less than 1% of their size. Our models are now able to run in real time, within device constraints, while maintaining nearly identical performance to cloud-based techniques. The listening experience Our research shows that in discriminative listening tests, over 86% of participants preferred the clarity of Dialogue-Boost-enhanced audio to that of unprocessed audio, particularly during scenes with complex soundscapes, such as action sequences. For users with hearing loss, our research shows 100% feature approval, with users reporting significantly reduced listening effort during movie watching. Customers have reported that Dialogue Boost also helps them understand whispered conversations, content with varied accents or dialects, and dialogue during action-heavy scenes, and it lets them enjoy movies without subtitle distraction. Additionally, for late-night viewers, or people who watch TV while others are sleeping, the technology has proven particularly valuable. Rather than constantly adjusting volume or relying on subtitles, viewers can maintain a comfortable listening level while ensuring that dialogue remains clear and understandable. Acknowledgements Dialogue Boost is the result of collaboration across Amazon Lab126 and Prime Video teams. We would like to thank Gordon Han, Berkant Tacer, Phil Hilmes, Peter Korn, Rui Wang, Ali Milani, Scott Isabelle, Vimal Bhat, Linda Liu, Mohamed Omar, Lakshmi Ziskin, Rohith Mysore, and Vijaya Kumar. Wed, 10 Dec 2025 15:22:15 GMT https://www.amazon.science/blog/dialogue-boost-how-amazon-is-using-ai-to-enhance-tv-and-movie-dialogue Amazon Nova Forge: "Open training” paradigm that empowers everyone to build their own frontier AI https://www.amazon.science/blog/amazon-nova-forge-open-training-paradigm-that-empowers-everyone-to-build-their-own-frontier-ai New service lets customers mix their own data with the data used to train Amazon Nova at each major stage of model development, enabling deep domain understanding while preventing "catastrophic forgetting". As foundation models (FMs) — such as Transformer-based large language models (LLMs) — have grown in popularity, there is a pattern we have seen repeatedly: a new model launches with stunning benchmark scores, teams get excited and start testing, and then they hit production reality. The model that aced the public benchmarks struggles with specific use cases organizations want to enable. This is because the public benchmarks are “in the probability distribution” of the data used to train the models, whereas the use cases the organizations are interested in are “out of distribution." This distribution mismatch happens for two main reasons: The application depends on data, knowledge, and tools secured within an organization; these assets are not part of the public datasets used to train LLMs. Customer behavior and the application context keep evolving, so the new model is obsolete on the day it is deployed. A few months back, we asked how we could meet these fundamental challenges. Our front-row seats to diverse, large-scale application-development efforts within Amazon helped us invent a whole new service called Amazon Nova Forge that empowers organizations to build their own expert foundation models using Amazon Nova. In essence, Nova Forge gives you the training tools and recipes to make your differentiated use cases become “in distribution,” so your application can meet the highest standards of accuracy, reliability, cost effectiveness, and control. The result is a model that knows your organization and use cases as an expert in your domain. We call this model a “Novella” — a variant of Nova that is optimized for your organization. Before Amazon Nova Forge Historically, organizations have had three suboptimal choices for mitigating the challenges I described above. First, they could fine-tune closed-weights LLMs using APIs that are typically based on low-rank adapters (LoRA). But such limited adaptation cannot give the customized model a deep understanding of proprietary domain knowledge and complex workflows. Second, they could continue pretraining a base open-weights model or continue post-training one that is already instruction tuned for a set of use cases. But open-weights models do not come with the data used to train them or with their exact training recipes — e.g., how many training epochs, on which datasets, and at what learning rates. Consequently, it is extremely difficult to steer them to particular use cases without regressing on the core properties of the base model, a phenomenon known as “catastrophic forgetting.” Third, they could build a frontier-scale model from scratch, but that requires massive computational resources, expert developers, and time. Nova Forge, by contrast, is built on an entirely new paradigm of “open training” and has two main pillars: access to checkpoints from each major stage of model development and the ability to mix proprietary data with the data curated for training Amazon Nova. Access to checkpoints from each major stage of model development Most state-of-the-art foundation models are trained in three stages. First is pretraining, where the model is trained to predict the next token (i.e., unit of the LLM’s vocabulary, such as a word or a word part) in a sequence of tokens using large quantities of unlabeled data. Second is mid-training, where real-world and synthetic user-system interactions (traces) help improve the model’s performance on a prioritized set of applications and tasks while increasing (or at least preserving) generalizability to previously unseen tasks. Mid-training is like pretraining, except that the data is specific to a set of tasks that the model provider wants the model to excel at, and the learning rate (i.e., how much a given training example modifies the model) is different. Third is post-training, including supervised fine-tuning (SFT), where the model learns to complete tasks from curated demonstrations and instructions (e.g., from software engineering), and reinforcement learning (RL), which helps improve accuracy on these tasks and align the model’s outputs to specific policies. Depending on the complexity of the target application and the relative importance of historical data and ongoing usage, organizations need the ability to infuse their data and knowledge into one or all of these stages. This is why Nova Forge provides three model checkpoints — pretrained, mid-trained, and post-trained — and the recipes and code to continue training from any of them. If you are working in a novel domain that is not represented in the pretraining data at all (e.g., geospatial or radiology images) and have many trillions of tokens, you can continue pretraining from the pretrained checkpoint. If you have a few billion to a few trillion tokens of historical data or can synthesize interactions, you can continue from mid-trained checkpoints. You can also perform SFT and RL on the mid-trained checkpoints. Lastly, the most common use case is to continually update the model using RL from real-world feedback or synthetic data. Mixing proprietary and frontier data Foundation models with frontier capabilities come from frontier-scale data. While techniques such as regularization and carefully crafted learning rates can help mitigate the challenges of catastrophic forgetting, the best way to infuse new knowledge into a model without losing existing capabilities is to mix frontier-scale data with your own proprietary data. This is why, for all stages of training, Nova Forge provides API-based mixing of the high-quality curated data used to train our frontier models with your proprietary data. To the best of our knowledge, no proprietary FM provider — or even open-weights-model developer — has provided the ability to mix frontier-scale data with proprietary data during pretraining, mid-training, and post-training. When organizations blend their proprietary data with high-quality curated data at early stages, they achieve something fundamentally different from customization choices that were available before Nova Forge: they build models where expertise in their domain is the core capability of the model, not an afterthought. The model learns to reason about domain-specific concepts as fluently as it reasons about the general knowledge available in public sources. Consider the experience of Nimbus Therapeutics, a clinical-stage drug discovery company, when building an AI system to accelerate molecular design. Drug discovery requires finding the right balance of many properties within a single molecule. It is an exponentially complex task that cannot be solved by manual exploration of candidate combinations. The goal was to build a model that could generate molecular designs, reason through complex problems, and predict which molecules are worth testing in the lab, where each experiment can cost thousands of dollars. Off-the-shelf LLMs lacked the deep understanding of chemistry required for such specialized work. While Nimbus had already built a suite of specialized machine learning models to address this gap, these models still lacked true chemical-reasoning capabilities, and maintaining a collection of separate models had become increasingly complex and resource intensive. The team began by testing Nova 2 Lite on pharmaceutical-patent analysis, where it achieved 95% accuracy without any customization. This impressive result gave them confidence to use Nova Forge for a more ambitious goal: creating one unified molecular-intelligence system. For instance, a model needs to understand not just how to connect atoms to make a realistic molecule but how specific structural features in each molecule map to physico-chemical properties, biological activities, and toxicophores. A grasp of these complex relationships is difficult to bolt on after a model's knowledge of structures has solidified. Nova Forge enabled the team to bring in its own proprietary chemistry datasets and drive performance improvement using supervised fine-tuning and reinforcement learning. Early results show that the custom model built using Nova Forge already outperforms other leading LLMs on molecular-property prediction tasks by significant margins, with the promise of expanding into molecular generation — a cutting-edge technology that will help bring better medicines to patients more quickly than ever before. The next frontier We released Amazon Nova Forge as the first service that enables organizations to build their own frontier models with Nova, through this “open training” approach. The capabilities we recently launched with Nova 2 Lite and three other Nova 1 models address the two challenges I outlined earlier. We are now working to meet an emerging challenge — reducing the time and effort required to transfer knowledge from an existing, customized Nova model to a newly released Nova model. To that end, we are offering Forge customers early access to a more capable model, Nova 2 Pro, at the same time that we are providing it to our internal teams. Forge customers can use Nova 2 Pro in Amazon Bedrock right away to build their applications. In a few weeks, we will provide recipes for training from multiple checkpoints of Nova 2 Pro. Such early access to even more powerful models in Forge makes it easy for organizations to plan ahead for the transfer of knowledge to newer, more capable Nova models. Our open-training approach also makes it easy for the broader research community to explore fundamental research questions — and it is another reason I am excited by the potential of Nova Forge. Just as open-source software enabled the modern Internet, open training may enable a future where every organization can build its own frontier AI. The so what I gave Nova 2 Lite a description of Nova Forge and asked for a one-sentence summary for our customers. Nova 2 Lite came back with “Nova Forge: Your AI, your rules—built faster, smarter, and on your terms.” I could not have done a better job of summarizing the spirit of what we are trying to accomplish here, helping organizations of all sizes and expertise excel in their domains and deliver value with AI. Mon, 08 Dec 2025 19:41:50 GMT https://www.amazon.science/blog/amazon-nova-forge-open-training-paradigm-that-empowers-everyone-to-build-their-own-frontier-ai AutoGluon assistant: Zero-code AutoML through multiagent collaboration https://www.amazon.science/blog/autogluon-assistant-zero-code-automl-through-multiagent-collaboration A multiagent architecture separates data perception, tool knowledge, execution history, and code generation, enabling ML automation that works with messy, real-world inputs. At the 2024 Kaggle AutoML Grand Prix — a $75,000 competition featuring hundreds of teams including top AutoML practitioners and Kaggle grandmasters — our fully automated framework placed 10th, making it the only automated agent to score points in the competition. This achievement validated our answer to a question we'd been pursuing: could we eliminate not just the model selection and hyperparameter tuning typically associated with AutoML, but the coding itself? The promise of automated machine learning has always been democratization. Yet most AutoML tools still require users to write code, prepare data structures, and understand ML workflows. For domain experts without programming backgrounds — scientists analyzing experimental data, analysts building forecasting models, or researchers working with image collections — this coding requirement creates an unnecessary barrier. We designed AutoGluon Assistant to remove this barrier. Built on MLZero, a novel multiagent system powered by large language models, AutoGluon Assistant transforms natural-language descriptions into trained machine learning models across tabular, image, text, and time series data. The system achieved a 92% success rate on our Multimodal AutoML Agent Benchmark and 86% on the external MLE-bench Lite, with leading performance in both success rate and solution quality. A multiagent architecture for true automation Traditional AutoML tools assume clean, structured inputs and users capable of invoking APIs correctly. Real-world ML problems begin with messier realities: ambiguous data files, unclear task definitions, and users who may not know whether they need classification or regression. MLZero addresses this through a multiagent architecture where specialized components powered by large language models from Amazon Bedrock collaborate to transform raw inputs into working solutions. For example, consider a medical researcher who uploads chest x-ray images with segmentation masks, describing the goal as "locate disease regions in x-rays." The perception module identifies pixel-level segmentation as the task, semantic memory selects AutoGluon's MultiModalPredictor for semantic segmentation, and the iterative coding module generates and refines code. When the initial attempt encounters mask format incompatibilities, episodic memory provides debugging context to adjust preprocessing and postprocessing, successfully training a segmentation model — all without the researcher writing any code. The system comprises four core modules: perception, semantic memory, episodic memory, and iterative coding. The perception module interprets arbitrary data inputs, parsing file structures and content to build structured understanding regardless of format inconsistencies or ambiguous naming. Where users might provide CSV files without clear indication of target variables, perception analyzes column distributions and semantics to infer task structure. The semantic-memory module enriches the system with knowledge of ML libraries, maintaining structured information about AutoGluon's capabilities, API patterns, and best practices. Rather than requiring users to know that semantic-segmentation tasks require the SAM model in AutoGluon Multimodal, semantic memory enables the system to select appropriate tools based on task characteristics. Episodic memory maintains chronological execution records, tracking what the system has attempted, what succeeded, and what failed. When code execution produces errors, this module provides debugging context by surfacing relevant previous attempts and their outcomes. This addresses the iterative nature of ML development, where solutions emerge through refinement rather than appearing fully formed. The iterative-coding module implements a refinement process with feedback loops and augmented memory. Generated code executes, produces results or errors, and informs subsequent attempts. This continues until either successful execution or a maximum iteration limit, with optional per-iteration user input for guidance when needed. The architecture maintains high automation while preserving flexibility for human oversight. Through this comprehensive system, MLZero bridges the gap between noisy raw data and sophisticated ML solutions. The multiagent collaboration pattern proves effective across modalities because the architecture separates concerns — understanding data, knowing capabilities, tracking history, and generating code — that traditionally intertwine in single-agent systems. Breaking down results To validate our system against an established, external standard, we first evaluated it on MLE-bench Lite. This benchmark consists of 21 diverse challenges from previous Kaggle competitions, allowing us to directly compare our model's performance to those of other leading automated systems. Our model achieved the highest success rate, 86%, meaning it successfully completed and submitted valid solutions for 18 of the 21 challenges. It secured the top position in overall solution quality, with an average rank of 1.43 in the standings, compared to the next-best agent's 2.36. Our agent won six gold medals and outperformed all competitors in total medal counts across the benchmark's challenges. After proving our model's capabilities on an existing benchmark, we further tested it on our own Multimodal AutoML Agent Benchmark, a more challenging suite comprising 25 diverse tasks with less-processed datasets, where data is closer to its raw form with more noise, format inconsistencies, and ambiguities. This benchmark features multiple data modalities (tabular, image, text, document) and problem types (classification, regression, retrieval, semantic segmentation) and challenging data structures (multilingual, multitable, and large-scale datasets). AutoGluon Assistant (as MLZero) achieved a 92% success rate across all tasks. When implemented with a compact, eight-billion-parameter LLM, the system still achieved a 45.3% success rate, proving more effective than many larger, more resource-intensive agents. Accessible interfaces for diverse workflows AutoGluon Assistant supports multiple interaction modes to fit different user preferences and workflows. Users can invoke the system through a command-line interface for quick automation tasks, a Python API for integration into existing data pipelines, or a Web UI for visual interaction and monitoring, or they can use the Model Context Protocol (MCP) to integrate it with other agentic tools. This flexibility ensures that whether users prefer scripting, graphical interfaces, or programmatic control, they can access the same underlying automation capabilities. The system also supports optional per-iteration user input, allowing domain experts to inject specialized knowledge during iterative refinement while maintaining automation for routine use. When working with medical imaging data, for instance, experts might guide the system toward custom normalizations specific to their scanning protocols. Episodic memory tracks these interventions alongside system-generated attempts, creating a collaborative dynamic where automation handles mechanical complexity while users contribute strategic direction when they possess relevant insights. The system is open source and available on Github, with technical details published in our NeurIPS 2025 paper. Fri, 05 Dec 2025 15:24:42 GMT https://www.amazon.science/blog/autogluon-assistant-zero-code-automl-through-multiagent-collaboration