On top of all the recent x86/x86_64 Linux kernel crypto improvements made recently by Google engineer Eric Biggers to better laverage AVX-512 and other modern x86 ISA features, a new patch-set posted today by Biggers would help make all x86/x86_64 kernel encryption/decryption at least slightly faster.
The new patch series isn’t about making use of some new CPU ISA features or anything wild like that but rather just cleaning up some existing code so that the x86 kernel-mode FPU always works reliably with soft IRQs so that some old SIMD helper crypto fallback code can be removed. That fallback code was “really bad for performance” and presented performance implications for those not even needing to rely on it.
In turn this code to remove the no-SIMD encryption/decryption fallbacks can help with performance by a few percent or even as much as a 23% improvement has been noted for AES-XTS.
Eric Biggers explained with this RFC patch series:
“This patchset fixes a longstanding issue where kernel-mode FPU (i.e., SIMD) was not reliably usable in softirqs in x86, which was creating the need for a fallback. The fallback was really bad for performance, and it even hurt performance for users that never encountered the edge case where kernel-mode FPU was not usable.
This patchset aligns x86 with other architectures such as arm, arm64, and riscv by making kernel-mode FPU work in softirqs reliably. There are a few possible ways to achieve that, and for now I just went with the simplest way; see patch 1 for details.
Patch 2 eliminates all uses of the “crypto SIMD helper” from x86, as patch 1 makes it unnecessary. For the RFC it is just one big patch; I’ll probably split patch 2 up if this progresses past RFC status.
Performance results have been positive. All en/decryption is now slightly faster on x86, as it no longer take a detour through crypto/simd.c. I get a 7% or 23% improvement for AES-XTS, for example.
I also benchmarked bidirectional IPsec, which has been claimed to often hit the edge case where kernel-mode FPU was previously not usable in softirq context. Ultimately, I was not actually able to reproduce that edge case being reached unless I reduced the number of CPUs to 1, in which case it then started being occasionally reached. Regardless, even without that case being reached, IPsec throughput still improved by 2%. In situations where that case was being reached, or where users required a synchronous algorithm, a much larger improvement should be seen.”
Great work and beyond the performance benefits, cleaning up this old fallback/helpers introduces just 100 lines of code while dropping 360 lines of existing code.