diff mbox series

[v2,3/6] crypto: x86/chacha20 - limit the preemption-disabled section

Message ID 20181129230217.158038-4-ebiggers@kernel.org (mailing list archive)
State Superseded
Delegated to: Herbert Xu
Headers show
Series crypto: x86_64 optimized XChaCha and NHPoly1305 (for Adiantum) | expand

Commit Message

Eric Biggers Nov. 29, 2018, 11:02 p.m. UTC
From: Eric Biggers <ebiggers@google.com>

To improve responsiveness, disable preemption for each step of the walk
(which is at most PAGE_SIZE) rather than for the entire
encryption/decryption operation.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/crypto/chacha20_glue.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

Comments

Martin Willi Dec. 2, 2018, 10:47 a.m. UTC | #1
> To improve responsiveness, disable preemption for each step of the
> walk (which is at most PAGE_SIZE) rather than for the entire
> encryption/decryption operation.

It seems that it is not that uncommon for IPsec to get small inputs
scattered over multiple blocks. Doing FPU context saving for each walk
step then can slow down things.

An alternative approach could be to re-enable preemption not based on
the walk steps, but on the amount of bytes processed. This would
satisfy both users, I guess.

In the long run we probably need a better approach for FPU context
saving, as this really hurts performance-wise. For IPsec we should find
a way to avoid the (multiple) per-packet FPU save/restores in softirq
context, but I guess this requires support from process context
switching.

Best regards
Martin
Ard Biesheuvel Dec. 3, 2018, 2:13 p.m. UTC | #2
On Sun, 2 Dec 2018 at 11:47, Martin Willi <martin@strongswan.org> wrote:
>
>
> > To improve responsiveness, disable preemption for each step of the
> > walk (which is at most PAGE_SIZE) rather than for the entire
> > encryption/decryption operation.
>
> It seems that it is not that uncommon for IPsec to get small inputs
> scattered over multiple blocks. Doing FPU context saving for each walk
> step then can slow down things.
>
> An alternative approach could be to re-enable preemption not based on
> the walk steps, but on the amount of bytes processed. This would
> satisfy both users, I guess.
>
> In the long run we probably need a better approach for FPU context
> saving, as this really hurts performance-wise. For IPsec we should find
> a way to avoid the (multiple) per-packet FPU save/restores in softirq
> context, but I guess this requires support from process context
> switching.
>

At Jason's Zinc talk at plumbers, this came up, and apparently someone
is working on this, i.e., to ensure that on x86, the FPU restore only
occurs lazily, when returning to userland rather than every time you
call kernel_fpu_end() [like we do on arm64 as well]

Not sure what the ETA for that work is, though, nor did I get the name
of the guy working on it.
Eric Biggers Dec. 5, 2018, 6:15 a.m. UTC | #3
On Mon, Dec 03, 2018 at 03:13:37PM +0100, Ard Biesheuvel wrote:
> On Sun, 2 Dec 2018 at 11:47, Martin Willi <martin@strongswan.org> wrote:
> >
> >
> > > To improve responsiveness, disable preemption for each step of the
> > > walk (which is at most PAGE_SIZE) rather than for the entire
> > > encryption/decryption operation.
> >
> > It seems that it is not that uncommon for IPsec to get small inputs
> > scattered over multiple blocks. Doing FPU context saving for each walk
> > step then can slow down things.
> >
> > An alternative approach could be to re-enable preemption not based on
> > the walk steps, but on the amount of bytes processed. This would
> > satisfy both users, I guess.
> >
> > In the long run we probably need a better approach for FPU context
> > saving, as this really hurts performance-wise. For IPsec we should find
> > a way to avoid the (multiple) per-packet FPU save/restores in softirq
> > context, but I guess this requires support from process context
> > switching.
> >
> 
> At Jason's Zinc talk at plumbers, this came up, and apparently someone
> is working on this, i.e., to ensure that on x86, the FPU restore only
> occurs lazily, when returning to userland rather than every time you
> call kernel_fpu_end() [like we do on arm64 as well]
> 
> Not sure what the ETA for that work is, though, nor did I get the name
> of the guy working on it.

Thanks for the suggestion; I'll replace this with a patch that re-enables
preemption every 4 KiB encrypted.  That also avoids having to do a
kernel_fpu_begin(), kernel_fpu_end() pair just for hchacha_block_ssse3().  But
yes, I'd definitely like repeated kernel_fpu_begin(), kernel_fpu_end() to not be
incredibly slow.  That would help in a lot of other places too.

- Eric
diff mbox series

Patch

diff --git a/arch/x86/crypto/chacha20_glue.c b/arch/x86/crypto/chacha20_glue.c
index 773d075a1483..036de144aab6 100644
--- a/arch/x86/crypto/chacha20_glue.c
+++ b/arch/x86/crypto/chacha20_glue.c
@@ -135,26 +135,24 @@  static int chacha20_simd(struct skcipher_request *req)
 	if (req->cryptlen <= CHACHA_BLOCK_SIZE || !may_use_simd())
 		return crypto_chacha_crypt(req);
 
-	err = skcipher_walk_virt(&walk, req, true);
+	err = skcipher_walk_virt(&walk, req, false);
 
 	crypto_chacha_init(state, ctx, walk.iv);
 
-	kernel_fpu_begin();
-
 	while (walk.nbytes > 0) {
 		unsigned int nbytes = walk.nbytes;
 
 		if (nbytes < walk.total)
 			nbytes = round_down(nbytes, walk.stride);
 
+		kernel_fpu_begin();
 		chacha20_dosimd(state, walk.dst.virt.addr, walk.src.virt.addr,
 				nbytes);
+		kernel_fpu_end();
 
 		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
 	}
 
-	kernel_fpu_end();
-
 	return err;
 }