A request for comments (RFC) patch series sent out this week for the Linux kernel is working on the notion of Virtual Swap Space support. The notion of Virtual Swap Space has been talked about for years and even going back to 2011 there’s been efforts to redesign the kernel’s swap cache along similar lines.
This latest work on Virtual Swap Space for Linux was posted by open-source developer Nhat Pham and is currently deemed a prototype implementation. With this new code the focus is on decoupling Zswap from the backing swap file as well as simplifying/optimizing Swapoff behavior. Additionally, this code could help out for scenarios around multi-tier swapping, swapfile compaction, and other functionality.
Nhat Pham explains of this work on Virtual Swap Space for Linux:
“This RFC implements the virtual swap space idea, based on Yosry’s proposals at LSFMMBPF 2023, as well as valuable
inputs from Johannes Weiner. The same idea (with different implementation details) has been floated by Rik van Riel since at least
2011.The code attached to this RFC is purely a prototype. It is not 100% merge-ready. I do, however, want to show people this prototype/RFC, including all the bells and whistles and a couple of actual use cases, so that folks can see what the end results will look like, and give me early feedback :)
I. Motivation
Currently, when an anon page is swapped out, a slot in a backing swap device is allocated and stored in the page table entries that refer to the original page. This slot is also used as the “key” to find the swapped out content, as well as the index to swap data structures, such as the swap cache, or the swap cgroup mapping. Tying a swap entry to its backing slot in this way is performant and efficient when swap is purely just disk space, and swapoff is rare.
However, the advent of many swap optimizations has exposed major drawbacks of this design. The first problem is that we occupy a physical slot in the swap space, even for pages that are NEVER expected to hit the disk: pages compressed and stored in the zswap pool, zero-filled pages, or pages rejected by both of these optimizations when zswap writeback is disabled. This is the arguably central shortcoming of zswap:
* In deployments when no disk space can be afforded for swap (such as mobile and embedded devices), users cannot adopt zswap, and are forced to use zram. This is confusing for users, and creates extra burdens for developers, having to develop and maintain similar features for two separate swap backends (writeback, cgroup charging, THP support, etc.).
* Resource-wise, it is hugely wasteful in terms of disk usage, and limits the memory saving potentials of these optimizations by the static size of the swapfile, especially in high memory systems that can have up to terabytes worth of memory. It also creates significant challenges for users who rely on swap utilization as an early OOM signal.
Another motivation for a swap redesign is to simplify swapoff, which is complicated and expensive in the current design. Tight coupling between a swap entry and its backing storage means that it requires a whole page table walk to update all the page table entries that refer to this swap entry, as well as updating all the associated swap data structures (swap cache, etc.).
…
This design allows us to:
* Decouple zswap (and zeromapped swap entry) from backing swapfile: simply associate the virtual swap slot with one of the supported backends: a zswap entry, a zero-filled swap page, a slot on the swapfile, or an in-memory page.* Simplify and optimize swapoff: we only have to fault the page in and have the virtual swap slot points to the page instead of the on-disk physical swap slot. No need to perform any page table walking.
…
Other than decoupling swap backends and optimizing swapoff, this new design allows us to implement the following more easily and efficiently:* Multi-tier swapping, with transparent transferring (promotion/demotion) of pages across tiers. Similar to swapoff, with the old design we would need to perform the expensive page table walk.
* Swapfile compaction to alleviate fragmentation (as proposed by Ying Huang).
* Mixed backing THP swapin: Once you have pinned down the backing store of THPs, then you can dispatch each range of subpages to appropriate swapin handle.
* Swapping a folio out with discontiguous physical swap slots”
Those wanting to learn more about this prototype work on Virtual Swap Space for Linux can see this RFC patch series for all the details.