Message ID | 20220111134934.324663-2-Jason@zx2c4.com (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | smaller blake2s code size on m68k and other small platforms | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
Hi Jason, On Tue, Jan 11, 2022 at 2:49 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > Re-wind the loops entirely on kernels optimized for code size. This is > really not good at all performance-wise. But on m68k, it shaves off 4k > of code size, which is apparently important. On arm32: add/remove: 1/0 grow/shrink: 0/1 up/down: 160/-4212 (-4052) Function old new delta blake2s_sigma - 160 +160 blake2s_compress_generic 4872 660 -4212 Total: Before=9846148, After=9842096, chg -0.04% On arm64: add/remove: 1/2 grow/shrink: 0/1 up/down: 160/-4584 (-4424) Function old new delta blake2s_sigma - 160 +160 e843419@0710_00007634_e8a0 8 - -8 e843419@0441_0000423a_178c 8 - -8 blake2s_compress_generic 5088 520 -4568 Total: Before=32800278, After=32795854, chg -0.01% > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> For the size reduction: Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Hi Geert, Thanks for testing this. However, I've *abandoned* this patch, due to unacceptable performance hits, and figuring out that we can accomplish basically the same thing without as large of a hit by modifying the obsolete sha1 implementation. Herbert - please do not apply this patch. Instead, later versions of this patchset (e.g. v3 [1] and potentially later if it comes to that) are what should be applied. Jason [1] https://lore.kernel.org/linux-crypto/20220111220506.742067-1-Jason@zx2c4.com/
On Tue, Jan 11, 2022 at 02:49:33PM +0100, Jason A. Donenfeld wrote: > Re-wind the loops entirely on kernels optimized for code size. This is > really not good at all performance-wise. But on m68k, it shaves off 4k > of code size, which is apparently important. > > Cc: Geert Uytterhoeven <geert@linux-m68k.org> > Cc: Herbert Xu <herbert@gondor.apana.org.au> > Cc: Ard Biesheuvel <ardb@kernel.org> > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> > --- > lib/crypto/blake2s-generic.c | 30 ++++++++++++++++++------------ > 1 file changed, 18 insertions(+), 12 deletions(-) > > diff --git a/lib/crypto/blake2s-generic.c b/lib/crypto/blake2s-generic.c > index 75ccb3e633e6..990f000e22ee 100644 > --- a/lib/crypto/blake2s-generic.c > +++ b/lib/crypto/blake2s-generic.c > @@ -46,7 +46,7 @@ void blake2s_compress_generic(struct blake2s_state *state, const u8 *block, > { > u32 m[16]; > u32 v[16]; > - int i; > + int i, j; > > WARN_ON(IS_ENABLED(DEBUG) && > (nblocks > 1 && inc != BLAKE2S_BLOCK_SIZE)); > @@ -86,17 +86,23 @@ void blake2s_compress_generic(struct blake2s_state *state, const u8 *block, > G(r, 6, v[2], v[ 7], v[ 8], v[13]); \ > G(r, 7, v[3], v[ 4], v[ 9], v[14]); \ > } while (0) > - ROUND(0); > - ROUND(1); > - ROUND(2); > - ROUND(3); > - ROUND(4); > - ROUND(5); > - ROUND(6); > - ROUND(7); > - ROUND(8); > - ROUND(9); > - > + if (IS_ENABLED(CONFIG_CC_OPTIMIZE_FOR_SIZE)) { > + for (i = 0; i < 10; ++i) { > + for (j = 0; j < 8; ++j) > + G(i, j, v[j % 4], v[((j + (j / 4)) % 4) + 4], v[((j + 2 * (j / 4)) % 4) + 8], v[((j + 3 * (j / 4)) % 4) + 12]); > + } How about unrolling the inner loop but not the outer one? Wouldn't that give most of the benefit, without hurting performance as much? If you stay with this approach and don't unroll either loop, can you use 'r' and 'i' instead of 'i' and 'j', to match the naming in G()? Also, please wrap lines at 80 columns. - Eric
On Wed, Jan 12, 2022 at 7:32 PM Eric Biggers <ebiggers@kernel.org> wrote: > How about unrolling the inner loop but not the outer one? Wouldn't that give > most of the benefit, without hurting performance as much? > > If you stay with this approach and don't unroll either loop, can you use 'r' and > 'i' instead of 'i' and 'j', to match the naming in G()? All this might work, sure. But as mentioned earlier, I've abandoned this entirely, as I don't think this patch is necessary. See the v3 patchset instead: https://lore.kernel.org/linux-crypto/20220111220506.742067-1-Jason@zx2c4.com/
From: Jason A. Donenfeld > Sent: 12 January 2022 18:51 > > On Wed, Jan 12, 2022 at 7:32 PM Eric Biggers <ebiggers@kernel.org> wrote: > > How about unrolling the inner loop but not the outer one? Wouldn't that give > > most of the benefit, without hurting performance as much? > > > > If you stay with this approach and don't unroll either loop, can you use 'r' and > > 'i' instead of 'i' and 'j', to match the naming in G()? > > All this might work, sure. But as mentioned earlier, I've abandoned > this entirely, as I don't think this patch is necessary. See the v3 > patchset instead: > > https://lore.kernel.org/linux-crypto/20220111220506.742067-1-Jason@zx2c4.com/ I think you mentioned in another thread that the buffers (eg for IPv6 addresses) are actually often quite short. For short buffers the 'rolled-up' loop may be of similar performance to the unrolled one because of the time taken to read all the instructions into the I-cache and decode them. If the loop ends up small enough it will fit into the 'decoded loop buffer' of modern Intel x86 cpu and won't even need decoding on each iteration. I really suspect that the heavily unrolled loop is only really fast for big buffers and/or when it is already in the I-cache. In real life I wonder how often that actually happens? Especially for the uses the kernel is making of the code. You need to benchmark single executions of the function (doable on x86 with the performance monitor cycle counter) to get typical/best clocks/byte figures rather than a big average for repeated operation on a long buffer. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Hi David, On 1/12/22, David Laight <David.Laight@aculab.com> wrote: > I think you mentioned in another thread that the buffers (eg for IPv6 > addresses) are actually often quite short. > > For short buffers the 'rolled-up' loop may be of similar performance > to the unrolled one because of the time taken to read all the instructions > into the I-cache and decode them. > If the loop ends up small enough it will fit into the 'decoded loop > buffer' of modern Intel x86 cpu and won't even need decoding on > each iteration. > > I really suspect that the heavily unrolled loop is only really fast > for big buffers and/or when it is already in the I-cache. > In real life I wonder how often that actually happens? > Especially for the uses the kernel is making of the code. > > You need to benchmark single executions of the function > (doable on x86 with the performance monitor cycle counter) > to get typical/best clocks/byte figures rather than a > big average for repeated operation on a long buffer. > > David This patch has been dropped entirely from future revisions. The latest as of writing is at: https://lore.kernel.org/linux-crypto/20220111220506.742067-1-Jason@zx2c4.com/ If you'd like to do something with blake2s, by all means submit a patch and include various rationale and metrics and benchmarks. I do not intend to do that myself and do not think my particular patch here should be merged. But if you'd like to do something, feel free to CC me for a review. However, as mentioned, I don't think much needs to be done here. Again, v3 is here: https://lore.kernel.org/linux-crypto/20220111220506.742067-1-Jason@zx2c4.com/ Thanks, Jason
diff --git a/lib/crypto/blake2s-generic.c b/lib/crypto/blake2s-generic.c index 75ccb3e633e6..990f000e22ee 100644 --- a/lib/crypto/blake2s-generic.c +++ b/lib/crypto/blake2s-generic.c @@ -46,7 +46,7 @@ void blake2s_compress_generic(struct blake2s_state *state, const u8 *block, { u32 m[16]; u32 v[16]; - int i; + int i, j; WARN_ON(IS_ENABLED(DEBUG) && (nblocks > 1 && inc != BLAKE2S_BLOCK_SIZE)); @@ -86,17 +86,23 @@ void blake2s_compress_generic(struct blake2s_state *state, const u8 *block, G(r, 6, v[2], v[ 7], v[ 8], v[13]); \ G(r, 7, v[3], v[ 4], v[ 9], v[14]); \ } while (0) - ROUND(0); - ROUND(1); - ROUND(2); - ROUND(3); - ROUND(4); - ROUND(5); - ROUND(6); - ROUND(7); - ROUND(8); - ROUND(9); - + if (IS_ENABLED(CONFIG_CC_OPTIMIZE_FOR_SIZE)) { + for (i = 0; i < 10; ++i) { + for (j = 0; j < 8; ++j) + G(i, j, v[j % 4], v[((j + (j / 4)) % 4) + 4], v[((j + 2 * (j / 4)) % 4) + 8], v[((j + 3 * (j / 4)) % 4) + 12]); + } + } else { + ROUND(0); + ROUND(1); + ROUND(2); + ROUND(3); + ROUND(4); + ROUND(5); + ROUND(6); + ROUND(7); + ROUND(8); + ROUND(9); + } #undef G #undef ROUND
Re-wind the loops entirely on kernels optimized for code size. This is really not good at all performance-wise. But on m68k, it shaves off 4k of code size, which is apparently important. Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> --- lib/crypto/blake2s-generic.c | 30 ++++++++++++++++++------------ 1 file changed, 18 insertions(+), 12 deletions(-)