Message ID | 20221207103936.2198407-3-ardb@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | ARM: allow kernel mode NEON in softirq context | expand |
On Wed, Dec 7, 2022 at 11:39 AM Ard Biesheuvel <ardb@kernel.org> wrote: > We currently only permit kernel mode NEON in process context, to avoid > the need to preserve/restore the NEON register file when taking an > exception while running in the kernel. > > Like we did on arm64, we can relax this restriction substantially, by > permitting kernel mode NEON from softirq context, while ensuring that > softirq processing is disabled when the NEON is being used in task > context. This guarantees that only NEON context belonging to user space > needs to be preserved and restored, which is already taken care of. > > This is especially relevant for network encryption, where incoming > frames are typically handled in softirq context, and deferring software > decryption to a kernel thread or falling back to C code are both > undesirable from a performance PoV. > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> So boosting WireGuard as primary SW network encryption user? This is really neat, BTW: Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Yours, Linus Walleij
On Thu, 15 Dec 2022 at 11:27, Linus Walleij <linus.walleij@linaro.org> wrote: > > On Wed, Dec 7, 2022 at 11:39 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > > We currently only permit kernel mode NEON in process context, to avoid > > the need to preserve/restore the NEON register file when taking an > > exception while running in the kernel. > > > > Like we did on arm64, we can relax this restriction substantially, by > > permitting kernel mode NEON from softirq context, while ensuring that > > softirq processing is disabled when the NEON is being used in task > > context. This guarantees that only NEON context belonging to user space > > needs to be preserved and restored, which is already taken care of. > > > > This is especially relevant for network encryption, where incoming > > frames are typically handled in softirq context, and deferring software > > decryption to a kernel thread or falling back to C code are both > > undesirable from a performance PoV. > > > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> > > So boosting WireGuard as primary SW network encryption user? Essentially, although the use case that inspired this work is related to IPsec not WireGuard, and the crypto algorithm in that case (GCM) is ~3x faster than WG's chacha20poly1305, which makes the performance overhead of asynchronous completion even more significant. (Note that GCM needs the AES and PMULL instructions which are usually only available when running the 32-bit kernel on a 64-bit core, whereas chacha20poly1305 uses ordinary NEON instructions.) But Martin responded with a Tested-by regarding chacha20poly1305 on IPsec (not WG) where there is also a noticeable speedup, so WG on ARM32 should definitely benefit from this as well. > This is really neat, BTW: > Reviewed-by: Linus Walleij <linus.walleij@linaro.org> > Thanks!
On Thu, Dec 15, 2022 at 11:43:22AM +0100, Ard Biesheuvel wrote: > On Thu, 15 Dec 2022 at 11:27, Linus Walleij <linus.walleij@linaro.org> wrote: > > > > On Wed, Dec 7, 2022 at 11:39 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > > > > We currently only permit kernel mode NEON in process context, to avoid > > > the need to preserve/restore the NEON register file when taking an > > > exception while running in the kernel. > > > > > > Like we did on arm64, we can relax this restriction substantially, by > > > permitting kernel mode NEON from softirq context, while ensuring that > > > softirq processing is disabled when the NEON is being used in task > > > context. This guarantees that only NEON context belonging to user space > > > needs to be preserved and restored, which is already taken care of. > > > > > > This is especially relevant for network encryption, where incoming > > > frames are typically handled in softirq context, and deferring software > > > decryption to a kernel thread or falling back to C code are both > > > undesirable from a performance PoV. > > > > > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> > > > > So boosting WireGuard as primary SW network encryption user? > > Essentially, although the use case that inspired this work is related > to IPsec not WireGuard, and the crypto algorithm in that case (GCM) is > ~3x faster than WG's chacha20poly1305, which makes the performance > overhead of asynchronous completion even more significant. (Note that > GCM needs the AES and PMULL instructions which are usually only > available when running the 32-bit kernel on a 64-bit core, whereas > chacha20poly1305 uses ordinary NEON instructions.) > > But Martin responded with a Tested-by regarding chacha20poly1305 on > IPsec (not WG) where there is also a noticeable speedup, so WG on > ARM32 should definitely benefit from this as well. It'll be interesting to see whether there is any noticable difference with my WG VPN.
On Thu, 15 Dec 2022 at 11:51, Russell King (Oracle) <linux@armlinux.org.uk> wrote: > > On Thu, Dec 15, 2022 at 11:43:22AM +0100, Ard Biesheuvel wrote: > > On Thu, 15 Dec 2022 at 11:27, Linus Walleij <linus.walleij@linaro.org> wrote: > > > > > > On Wed, Dec 7, 2022 at 11:39 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > > > > > > We currently only permit kernel mode NEON in process context, to avoid > > > > the need to preserve/restore the NEON register file when taking an > > > > exception while running in the kernel. > > > > > > > > Like we did on arm64, we can relax this restriction substantially, by > > > > permitting kernel mode NEON from softirq context, while ensuring that > > > > softirq processing is disabled when the NEON is being used in task > > > > context. This guarantees that only NEON context belonging to user space > > > > needs to be preserved and restored, which is already taken care of. > > > > > > > > This is especially relevant for network encryption, where incoming > > > > frames are typically handled in softirq context, and deferring software > > > > decryption to a kernel thread or falling back to C code are both > > > > undesirable from a performance PoV. > > > > > > > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> > > > > > > So boosting WireGuard as primary SW network encryption user? > > > > Essentially, although the use case that inspired this work is related > > to IPsec not WireGuard, and the crypto algorithm in that case (GCM) is > > ~3x faster than WG's chacha20poly1305, which makes the performance > > overhead of asynchronous completion even more significant. (Note that > > GCM needs the AES and PMULL instructions which are usually only > > available when running the 32-bit kernel on a 64-bit core, whereas > > chacha20poly1305 uses ordinary NEON instructions.) > > > > But Martin responded with a Tested-by regarding chacha20poly1305 on > > IPsec (not WG) where there is also a noticeable speedup, so WG on > > ARM32 should definitely benefit from this as well. > > It'll be interesting to see whether there is any noticable difference > with my WG VPN. > Using WireGuard with the same 32-bit KVM guest communicating with its 64-bit host using virtio-net, I get a 44% speedup in the host->guest direction. The other direction performs exactly the same, which is unsurprising as it doesn't involve NEON crypto in softirq context at all. BEFORE ====== ardb@vm32:~$ iperf3 -c 192.168.11.2 Connecting to host 192.168.11.2, port 5201 [ 5] local 192.168.11.1 port 40144 connected to 192.168.11.2 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 25.8 MBytes 216 Mbits/sec 0 397 KBytes [ 5] 1.00-2.00 sec 25.9 MBytes 217 Mbits/sec 0 397 KBytes [ 5] 2.00-3.00 sec 27.0 MBytes 226 Mbits/sec 0 397 KBytes [ 5] 3.00-4.00 sec 26.5 MBytes 222 Mbits/sec 0 397 KBytes [ 5] 4.00-5.00 sec 26.2 MBytes 220 Mbits/sec 0 397 KBytes [ 5] 5.00-6.00 sec 26.1 MBytes 219 Mbits/sec 0 436 KBytes [ 5] 6.00-7.00 sec 26.2 MBytes 220 Mbits/sec 0 458 KBytes [ 5] 7.00-8.00 sec 26.2 MBytes 220 Mbits/sec 0 458 KBytes [ 5] 8.00-9.00 sec 26.5 MBytes 222 Mbits/sec 0 480 KBytes [ 5] 9.00-10.00 sec 26.9 MBytes 225 Mbits/sec 0 480 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 263 MBytes 221 Mbits/sec 0 sender [ 5] 0.00-10.00 sec 262 MBytes 220 Mbits/sec receiver ardb@sudo:~$ iperf3 -c 192.168.11.1 Connecting to host 192.168.11.1, port 5201 [ 5] local 192.168.11.2 port 46340 connected to 192.168.11.1 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 47.5 MBytes 398 Mbits/sec 0 1.75 MBytes [ 5] 1.00-2.00 sec 45.0 MBytes 377 Mbits/sec 18 1.35 MBytes [ 5] 2.00-3.00 sec 43.8 MBytes 367 Mbits/sec 0 1.47 MBytes [ 5] 3.00-4.00 sec 45.0 MBytes 377 Mbits/sec 0 1.56 MBytes [ 5] 4.00-5.00 sec 45.0 MBytes 377 Mbits/sec 0 1.63 MBytes [ 5] 5.00-6.00 sec 42.5 MBytes 357 Mbits/sec 0 1.68 MBytes [ 5] 6.00-7.00 sec 43.8 MBytes 367 Mbits/sec 0 1.71 MBytes [ 5] 7.00-8.00 sec 43.8 MBytes 367 Mbits/sec 0 1.73 MBytes [ 5] 8.00-9.00 sec 45.0 MBytes 377 Mbits/sec 0 1.74 MBytes [ 5] 9.00-10.00 sec 43.8 MBytes 367 Mbits/sec 0 1.75 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 445 MBytes 373 Mbits/sec 18 sender [ 5] 0.00-10.04 sec 444 MBytes 371 Mbits/sec receiver iperf Done. AFTER ===== ardb@vm32:~$ iperf3 -c 192.168.11.2 Connecting to host 192.168.11.2, port 5201 [ 5] local 192.168.11.1 port 44004 connected to 192.168.11.2 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 26.2 MBytes 220 Mbits/sec 0 399 KBytes [ 5] 1.00-2.00 sec 25.9 MBytes 217 Mbits/sec 0 399 KBytes [ 5] 2.00-3.00 sec 26.0 MBytes 218 Mbits/sec 0 444 KBytes [ 5] 3.00-4.00 sec 26.8 MBytes 225 Mbits/sec 0 485 KBytes [ 5] 4.00-5.00 sec 26.4 MBytes 222 Mbits/sec 0 542 KBytes [ 5] 5.00-6.00 sec 26.6 MBytes 223 Mbits/sec 0 568 KBytes [ 5] 6.00-7.00 sec 25.4 MBytes 213 Mbits/sec 0 568 KBytes [ 5] 7.00-8.00 sec 25.9 MBytes 217 Mbits/sec 0 568 KBytes [ 5] 8.00-9.00 sec 26.7 MBytes 224 Mbits/sec 0 568 KBytes [ 5] 9.00-10.00 sec 25.9 MBytes 217 Mbits/sec 0 568 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 262 MBytes 220 Mbits/sec 0 sender [ 5] 0.00-9.99 sec 261 MBytes 219 Mbits/sec receiver iperf Done. ardb@sudo:~$ iperf3 -c 192.168.11.1 Connecting to host 192.168.11.1, port 5201 [ 5] local 192.168.11.2 port 49838 connected to 192.168.11.1 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 61.2 MBytes 514 Mbits/sec 0 1.59 MBytes [ 5] 1.00-2.00 sec 66.2 MBytes 555 Mbits/sec 0 1.67 MBytes [ 5] 2.00-3.00 sec 65.0 MBytes 545 Mbits/sec 79 1.24 MBytes [ 5] 3.00-4.00 sec 63.8 MBytes 535 Mbits/sec 0 1.36 MBytes [ 5] 4.00-5.00 sec 63.8 MBytes 535 Mbits/sec 0 1.46 MBytes [ 5] 5.00-6.00 sec 63.8 MBytes 535 Mbits/sec 0 1.53 MBytes [ 5] 6.00-7.00 sec 62.5 MBytes 524 Mbits/sec 0 1.59 MBytes [ 5] 7.00-8.00 sec 65.0 MBytes 545 Mbits/sec 99 1.18 MBytes [ 5] 8.00-9.00 sec 65.0 MBytes 545 Mbits/sec 0 1.25 MBytes [ 5] 9.00-10.00 sec 65.0 MBytes 545 Mbits/sec 0 1.30 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 641 MBytes 538 Mbits/sec 178 sender [ 5] 0.00-10.02 sec 638 MBytes 535 Mbits/sec receiver iperf Done.
diff --git a/arch/arm/include/asm/simd.h b/arch/arm/include/asm/simd.h new file mode 100644 index 0000000000000000..82191dbd7e78a036 --- /dev/null +++ b/arch/arm/include/asm/simd.h @@ -0,0 +1,8 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#include <linux/hardirq.h> + +static __must_check inline bool may_use_simd(void) +{ + return IS_ENABLED(CONFIG_KERNEL_MODE_NEON) && !in_hardirq(); +} diff --git a/arch/arm/vfp/vfpmodule.c b/arch/arm/vfp/vfpmodule.c index 8f5bc672b4aac04a..4e1a786df76df157 100644 --- a/arch/arm/vfp/vfpmodule.c +++ b/arch/arm/vfp/vfpmodule.c @@ -723,12 +723,12 @@ void kernel_neon_begin(void) local_bh_disable(); /* - * Kernel mode NEON is only allowed outside of interrupt context - * with preemption disabled. This will make sure that the kernel - * mode NEON register contents never need to be preserved. + * Kernel mode NEON is only allowed outside of hardirq context with + * preemption and softirq processing disabled. This will make sure that + * the kernel mode NEON register contents never need to be preserved. */ - BUG_ON(in_interrupt()); - cpu = get_cpu(); + BUG_ON(in_hardirq()); + cpu = __smp_processor_id(); fpexc = fmrx(FPEXC) | FPEXC_EN; fmxr(FPEXC, fpexc); @@ -744,7 +744,6 @@ void kernel_neon_begin(void) vfp_save_state(vfp_current_hw_state[cpu], fpexc); #endif vfp_current_hw_state[cpu] = NULL; - local_bh_enable(); } EXPORT_SYMBOL(kernel_neon_begin); @@ -752,7 +751,7 @@ void kernel_neon_end(void) { /* Disable the NEON/VFP unit. */ fmxr(FPEXC, fmrx(FPEXC) & ~FPEXC_EN); - put_cpu(); + local_bh_enable(); } EXPORT_SYMBOL(kernel_neon_end);
We currently only permit kernel mode NEON in process context, to avoid the need to preserve/restore the NEON register file when taking an exception while running in the kernel. Like we did on arm64, we can relax this restriction substantially, by permitting kernel mode NEON from softirq context, while ensuring that softirq processing is disabled when the NEON is being used in task context. This guarantees that only NEON context belonging to user space needs to be preserved and restored, which is already taken care of. This is especially relevant for network encryption, where incoming frames are typically handled in softirq context, and deferring software decryption to a kernel thread or falling back to C code are both undesirable from a performance PoV. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> --- arch/arm/include/asm/simd.h | 8 ++++++++ arch/arm/vfp/vfpmodule.c | 13 ++++++------- 2 files changed, 14 insertions(+), 7 deletions(-)