Message ID | 20201215234708.105527-1-ebiggers@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | crypto: add NEON-optimized BLAKE2b | expand |
On Tue, Dec 15, 2020 at 03:47:03PM -0800, Eric Biggers wrote: > This patchset adds a NEON implementation of BLAKE2b for 32-bit ARM. > Patches 1-4 prepare for it by making some updates to the generic > implementation, while patch 5 adds the actual NEON implementation. > > On Cortex-A7 (which these days is the most common ARM processor that > doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as > SHA-256, and slightly faster than SHA-1. It is also almost three times > as fast as the generic implementation of BLAKE2b: > > Algorithm Cycles per byte (on 4096-byte messages) > =================== ======================================= > blake2b-256-neon 14.1 > sha1-neon 16.4 > sha1-asm 20.8 > blake2s-256-generic 26.1 > sha256-neon 28.9 > sha256-asm 32.1 > blake2b-256-generic 39.9 > > This implementation isn't directly based on any other implementation, > but it borrows some ideas from previous NEON code I've written as well > as from chacha-neon-core.S. At least on Cortex-A7, it is faster than > the other NEON implementations of BLAKE2b I'm aware of (the > implementation in the BLAKE2 official repository using intrinsics, and > Andrew Moon's implementation which can be found in SUPERCOP). > > NEON-optimized BLAKE2b is useful because there is interest in using > BLAKE2b-256 for dm-verity on low-end Android devices (specifically, > devices that lack the ARMv8 Crypto Extensions) to replace SHA-1. On > these devices, the performance cost of upgrading to SHA-256 may be > unacceptable, whereas BLAKE2b-256 would actually improve performance. > > Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which > is intended for 32-bit platforms), on 32-bit ARM processors with NEON, > BLAKE2b is actually faster than BLAKE2s. This is because NEON supports > 64-bit operations, and because BLAKE2s's block size is too small for > NEON to be helpful for it. The best I've been able to do with BLAKE2s > on Cortex-A7 is 19.0 cpb with an optimized scalar implementation. By the way, if people are interested in having my ARM scalar implementation of BLAKE2s in the kernel too, I can send a patchset for that too. It just ended up being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case mentioned above. If it were to be added as "blake2s-256-arm", we'd have: Algorithm Cycles per byte (on 4096-byte messages) =================== ======================================= blake2b-256-neon 14.1 sha1-neon 16.4 blake2s-256-arm 19.0 sha1-asm 20.8 blake2s-256-generic 26.1 sha256-neon 28.9 sha256-asm 32.1 blake2b-256-generic 39.9
Hi Eric, On Wed, Dec 16, 2020 at 9:48 PM Eric Biggers <ebiggers@kernel.org> wrote: > By the way, if people are interested in having my ARM scalar implementation of > BLAKE2s in the kernel too, I can send a patchset for that too. It just ended up > being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case > mentioned above. If it were to be added as "blake2s-256-arm", we'd have: I'd certainly be interested in this. Any rough idea how it performs for pretty small messages compared to the generic implementation? 100-140 byte ranges? Is the speedup about the same as for longer messages because this doesn't parallelize across multiple blocks? Jason
On Wed, Dec 16, 2020 at 11:32:44PM +0100, Jason A. Donenfeld wrote: > Hi Eric, > > On Wed, Dec 16, 2020 at 9:48 PM Eric Biggers <ebiggers@kernel.org> wrote: > > By the way, if people are interested in having my ARM scalar implementation of > > BLAKE2s in the kernel too, I can send a patchset for that too. It just ended up > > being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case > > mentioned above. If it were to be added as "blake2s-256-arm", we'd have: > > I'd certainly be interested in this. Any rough idea how it performs > for pretty small messages compared to the generic implementation? > 100-140 byte ranges? Is the speedup about the same as for longer > messages because this doesn't parallelize across multiple blocks? > It does one block at a time, and there isn't much overhead, so yes the speedup on short messages should be about the same as on long messages. I did a couple quick userspace benchmarks and got (still on Cortex-A7): 100-byte messages: BLAKE2s ARM: 28.9 cpb BLAKE2s generic: 42.4 cpb 140-byte messages: BLAKE2s ARM: 29.5 cpb BLAKE2s generic: 44.0 cpb The results in the kernel may differ a bit, but probably not by much. - Eric
On Thu, Dec 17, 2020 at 4:54 AM Eric Biggers <ebiggers@kernel.org> wrote: > > On Wed, Dec 16, 2020 at 11:32:44PM +0100, Jason A. Donenfeld wrote: > > Hi Eric, > > > > On Wed, Dec 16, 2020 at 9:48 PM Eric Biggers <ebiggers@kernel.org> wrote: > > > By the way, if people are interested in having my ARM scalar implementation of > > > BLAKE2s in the kernel too, I can send a patchset for that too. It just ended up > > > being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case > > > mentioned above. If it were to be added as "blake2s-256-arm", we'd have: > > > > I'd certainly be interested in this. Any rough idea how it performs > > for pretty small messages compared to the generic implementation? > > 100-140 byte ranges? Is the speedup about the same as for longer > > messages because this doesn't parallelize across multiple blocks? > > > > It does one block at a time, and there isn't much overhead, so yes the speedup > on short messages should be about the same as on long messages. > > I did a couple quick userspace benchmarks and got (still on Cortex-A7): > > 100-byte messages: > BLAKE2s ARM: 28.9 cpb > BLAKE2s generic: 42.4 cpb > > 140-byte messages: > BLAKE2s ARM: 29.5 cpb > BLAKE2s generic: 44.0 cpb > > The results in the kernel may differ a bit, but probably not by much. That's certainly a nice improvement though, and I'd very much welcome the faster implementation. Jason