close
Skip to main content
chrisdown u/chrisdown avatar

Chris Down

u/chrisdown

Feed options
Hot
New
Top
View
Card
Compact

There are some numbers in the article, although of course I'm happy to hear any more you'd like presented.

  • A counterintuitive 25% reduction in disk writes at Instagram after enabling zswap

  • Eventual ~5:1 compression ratio on Django workloads with zswap + zstd

  • 20-30 minute OOM stalls at Cloudflare with the OOM killer never once firing under zram

The LRU inversion argument follows directly from the code. That is, it's a logical consequence of the architecture rather than an empirical question, so I'm not sure a benchmark would really add much there.


It's a good question and it's a bit nuanced. Reads don't wear flash meaningfully, indeed. But the meme people often cite that zram means "0 extra reads or writes" only holds when memory pressure is low enough that zram never fills, and in that regime zswap's pool also never fills and never writes to disk either, so they're equivalent.

The divergence only happens under pressure, and that's exactly where zram forces file cache churn onto the same device instead. So in that sense it's not zram with zero writes and zswap with unbounded writes, it's a question of which writes happen and whether the kernel got to choose cold pages from both anonymous and file pages, or was forced to evict hotter ones from a smaller subset of file pages only (and thus is more susceptible to thrash and cause more disk I/O).


Thanks for reading! And if it helps you to feel less disappointed, zram actually often ends up with more disk writes than zswap in testing (it's just, the writes don't come from swap :-)).

Now, this might sound like abject nonsense, but hear me out. With zram-only, once zram is full, there is nowhere for anonymous pages to go. The kernel can't evict them to disk because there is no disk swap. So when it needs to free memory, it has no choice but to reclaim file cache instead.

In such situations we are tying the kernel's hands quite significantly. We don't allow the kernel to choose which page is colder across both anonymous and file-backed memory, and instead force it to only reclaim file caches, so it is inevitable that you will eventually reclaim file caches that are actually hotter and you actually needed to be resident. Those thrashing reads and writes hit the SSD just as much, or even more, given that we're making more limited decisions.

As I mentioned in the article, there are cases where enabling zswap reduced disk writes by up to 25% compared to having no swap at all, because the kernel can now choose to park cold anonymous pages in compressed RAM rather than churning through file cache. Now of course the exact numbers vary across workloads, but directionally this holds for most systems that accumulate cold anonymous pages over time, and we've seen that on small systems like BMCs, to desktops, to servers, to even consumer devices like VR headsets.

So you may find the switch actually goes easier on your SSD than you'd expect. But if that's not the case, we'd definitely love to hear about it on linux-mm so we can make zswap more robust.


Hmm, reads aren't free on slow eMMC either, they still consume I/O bandwidth and add latency, which on the kind of low-end hardware you're describing can be very noticeable for responsiveness depending on what you want to read.

(Though I'd note that's a somewhat different concern from the original comment, which was about writes and storage wear.)


Browsers are certainly one major consumer, but they're not the only consumer of file cache. For example shared libraries, fonts, executables and many other things benefit from being resident too, and those do cause re-reads if evicted. The ecosystem is also getting more and more sensitive to it with increased binary sizes, full containerisation, appimages/flatpaks, etc. Funnily enough Chrome is one example for the executable case I talked about in my talk on lies programmers believe about memory.

That said, you're right that browser-heavy workloads are probably among the more favourable cases for zram-only, since a meaningful portion of their working set is self-managed. I'd still expect zswap to come out ahead on total I/O, but the margin would likely be smaller than the best case. If you find otherwise, we'd love to hear about it on linux-mm for sure.

(Also, a small nit: the Chromium quote is also saying the backend is robust to poor OS cache, not that OS cache provides no benefit, those are different claims.)


To clarify, "losing support upstream" doesn't mean we are removing it from the kernel tomorrow :-) What it means is that the kernel developers who maintain the surrounding subsystems are increasingly unwilling to take new work that depends on zram's current architecture, and are actively steering toward zswap as the single compressed swap implementation.

You can see this pretty directly in the quotes in the article from Christoph Hellwig (who works on the block layer) and Johannes Weiner (who is one of the MM maintainers). Christoph's position, for example, is essentially that zram is an abuse of the block layer for something that belongs in MM and is causing an increasing amount of maintenance burden, and thus he's not interested in taking patches that extend it further. The section showing how to do idle page reclaim on zram illustrates a number of those hacks in action, and I think neatly illustrates why so many are opposed to adding even more. So to that extent, improving zram is pretty much dead in the water, whereas zswap is being actively developed.

Popularity in distros and upstream development direction are pretty independent things: distros ship what works today, and zram does work today for some setups. But the upstream direction matters for the long term, because it determines what gets fixed, what gets optimised, and what gets left to rot. A lot of the complexity around operating zram correctly exists precisely because nobody is going to deeply integrate zram into the reclaim path, because people see better options emerging.


Thanks for the comment!

So, the SSD wear argument is something I address in the article. The short version is that refusing to swap anonymous pages just shifts pressure onto the page cache and can actually increase I/O in many workloads, even on eMMC, because now the kernel has far fewer pages to select for reclaim. As I mentioned in the article, in some cases enabling zswap reduced disk writes by up to 25% compared to having no swap at all. Obviously the exact numbers will vary across workloads, but the direction holds across most workloads that accumulate cold anonymous pages over time, and we've seen it apply in many domains (like with Quest headsets, on BMCs, on servers, and desktop use cases).

This may seem counterintuitive, but it makes sense: if you don't allow the kernel to choose which page is colder, and instead limit it to only reclaiming file caches, it is inevitable that you will eventually reclaim file caches that you actually did need to be resident to avoid disk activity.

As for your comment about diskless setups in general, we are addressing the diskless case with zswap directly. Nhat (Pham), the block maintainers, and a bunch of us from mm side have been increasingly pushing on making zswap work without a backing device at all, removing zram's main remaining use case. Weare mostly doing this because zram is extremely fragile on the kernel side, and relies strongly on hacks in the block subsystem to expose memory management internals (like the manual reclaim that I mention in the article). Once that's landed, one will be able to use zswap without backing swap and get similar semantics to how zram is now, but with a lot tighter integration with the rest of the mm subsystem and significantly better decisions when memory pressure hits.