Message ID | 20240422203544.195390-2-ebiggers@kernel.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Optimize dm-verity and fsverity using multibuffer hashing | expand |
Eric Biggers <ebiggers@kernel.org> wrote: > > For now the API only supports 2-way interleaving, as the usefulness and > practicality seems to drop off dramatically after 2. The arm64 CPUs I > tested don't support more than 2 concurrent SHA-256 hashes. On x86_64, > AMD's Zen 4 can do 4 concurrent SHA-256 hashes (at least based on a > microbenchmark of the sha256rnds2 instruction), and it's been reported > that the highest SHA-256 throughput on Intel processors comes from using > AVX512 to compute 16 hashes in parallel. However, higher interleaving > factors would involve tradeoffs such as no longer being able to cache > the round constants in registers, further increasing the code size (both > source and binary), further increasing the amount of state that users > need to keep track of, and causing there to be more "leftover" hashes. I think the lack of extensibility is the biggest problem with this API. Now I confess I too have used the magic number 2 in the lskcipher patch-set, but there I think at least it was more justifiable based on the set of algorithms we currently support. Here I think the evidence for limiting this to 2 is weak. And the amount of work to extend this beyond 2 would mean ripping this API out again. So let's get this right from the start. Rather than shoehorning this into shash, how about we add this to ahash instead where an async return is a natural part of the API? In fact, if we do it there we don't need to make any major changes to the API. You could simply add an optional flag that to the request flags to indicate that more requests will be forthcoming immediately. The algorithm could then either delay the current request if it is supported, or process it immediately as is the case now. Cheers,
On Fri, May 03, 2024 at 06:18:32PM +0800, Herbert Xu wrote: > Eric Biggers <ebiggers@kernel.org> wrote: > > > > For now the API only supports 2-way interleaving, as the usefulness and > > practicality seems to drop off dramatically after 2. The arm64 CPUs I > > tested don't support more than 2 concurrent SHA-256 hashes. On x86_64, > > AMD's Zen 4 can do 4 concurrent SHA-256 hashes (at least based on a > > microbenchmark of the sha256rnds2 instruction), and it's been reported > > that the highest SHA-256 throughput on Intel processors comes from using > > AVX512 to compute 16 hashes in parallel. However, higher interleaving > > factors would involve tradeoffs such as no longer being able to cache > > the round constants in registers, further increasing the code size (both > > source and binary), further increasing the amount of state that users > > need to keep track of, and causing there to be more "leftover" hashes. > > I think the lack of extensibility is the biggest problem with this > API. Now I confess I too have used the magic number 2 in the > lskcipher patch-set, but there I think at least it was more > justifiable based on the set of algorithms we currently support. > > Here I think the evidence for limiting this to 2 is weak. And the > amount of work to extend this beyond 2 would mean ripping this API > out again. > > So let's get this right from the start. Rather than shoehorning > this into shash, how about we add this to ahash instead where an > async return is a natural part of the API? > > In fact, if we do it there we don't need to make any major changes > to the API. You could simply add an optional flag that to the > request flags to indicate that more requests will be forthcoming > immediately. > > The algorithm could then either delay the current request if it > is supported, or process it immediately as is the case now. > The kernel already had ahash-based multibuffer hashing years ago. It failed spectacularly, as it was extremely complex, buggy, slow, and potentially insecure as it mixed requests from different contexts. Sure, it could have been improved slightly by adding flush support, but most issues would have remained. Synchronous hashing really is the right model here. One of the main performance issues we are having with dm-verity and fs-verity is the scheduling hops associated with the workqueues on which the dm-verity and fs-verity work runs. If there was another scheduling hop from the worker task to another task to do the actual hashing, that would be even worse and would defeat the point of doing multibuffer hashing. And with the ahash based API this would be difficult to avoid, as when an individual request gets submitted and put on a queue somewhere it would lose the information about the original submitter, so when it finally gets hashed it might be by another task (which the original task would then have to wait for). I guess the submitter could provide some sort of tag that makes the request be put on a dedicated queue that would eventually get processed only by the same task (which might also be needed for security reasons anyway, due to all the CPU side channels), but that would add lots of complexity to add tag support to the API and support an arbitrary number of queues. And then there's the issue of request lengths. With one at a time submission via 'ahash_request', each request would have its own length. Having to support multibuffer hashing of different length requests would add a massive amount of complexity and edge cases that are difficult to get correct, as was shown by the old ahash based code. This suggests that either the API needs to enforce that all the lengths are the same, or it needs to provide a clean API (my patch) where the caller just provides a single length that applies to all messages. So the synchronous API really seems like the right approach, whereas shoehorning it into the asynchronous hash API would result in something much more complex and not actually useful for the intended use cases. If you're concerned about the hardcoding to 2x specifically, how about the following API instead: int crypto_shash_finup_mb(struct shash_desc *desc, const u8 *datas[], unsigned int len, u8 *outs[], int num_msgs) This would allow extension to higher interleaving factors. I do suspect that anything higher than 2x isn't going to be very practical for in-kernel use cases, where code size, latency, and per-request memory usage tend to be very important. Regardless, this would make the API able to support higher interleaving factors. - Eric
On Fri, May 03, 2024 at 08:28:10AM -0700, Eric Biggers wrote: > On Fri, May 03, 2024 at 06:18:32PM +0800, Herbert Xu wrote: > > Eric Biggers <ebiggers@kernel.org> wrote: > > > > > > For now the API only supports 2-way interleaving, as the usefulness and > > > practicality seems to drop off dramatically after 2. The arm64 CPUs I > > > tested don't support more than 2 concurrent SHA-256 hashes. On x86_64, > > > AMD's Zen 4 can do 4 concurrent SHA-256 hashes (at least based on a > > > microbenchmark of the sha256rnds2 instruction), and it's been reported > > > that the highest SHA-256 throughput on Intel processors comes from using > > > AVX512 to compute 16 hashes in parallel. However, higher interleaving > > > factors would involve tradeoffs such as no longer being able to cache > > > the round constants in registers, further increasing the code size (both > > > source and binary), further increasing the amount of state that users > > > need to keep track of, and causing there to be more "leftover" hashes. > > > > I think the lack of extensibility is the biggest problem with this > > API. Now I confess I too have used the magic number 2 in the > > lskcipher patch-set, but there I think at least it was more > > justifiable based on the set of algorithms we currently support. > > > > Here I think the evidence for limiting this to 2 is weak. And the > > amount of work to extend this beyond 2 would mean ripping this API > > out again. > > > > So let's get this right from the start. Rather than shoehorning > > this into shash, how about we add this to ahash instead where an > > async return is a natural part of the API? > > > > In fact, if we do it there we don't need to make any major changes > > to the API. You could simply add an optional flag that to the > > request flags to indicate that more requests will be forthcoming > > immediately. > > > > The algorithm could then either delay the current request if it > > is supported, or process it immediately as is the case now. > > > > The kernel already had ahash-based multibuffer hashing years ago. It failed > spectacularly, as it was extremely complex, buggy, slow, and potentially > insecure as it mixed requests from different contexts. Sure, it could have been > improved slightly by adding flush support, but most issues would have remained. > > Synchronous hashing really is the right model here. One of the main performance > issues we are having with dm-verity and fs-verity is the scheduling hops > associated with the workqueues on which the dm-verity and fs-verity work runs. > If there was another scheduling hop from the worker task to another task to do > the actual hashing, that would be even worse and would defeat the point of doing > multibuffer hashing. And with the ahash based API this would be difficult to > avoid, as when an individual request gets submitted and put on a queue somewhere > it would lose the information about the original submitter, so when it finally > gets hashed it might be by another task (which the original task would then have > to wait for). I guess the submitter could provide some sort of tag that makes > the request be put on a dedicated queue that would eventually get processed only > by the same task (which might also be needed for security reasons anyway, due to > all the CPU side channels), but that would add lots of complexity to add tag > support to the API and support an arbitrary number of queues. > > And then there's the issue of request lengths. With one at a time submission > via 'ahash_request', each request would have its own length. Having to support > multibuffer hashing of different length requests would add a massive amount of > complexity and edge cases that are difficult to get correct, as was shown by the > old ahash based code. This suggests that either the API needs to enforce that > all the lengths are the same, or it needs to provide a clean API (my patch) > where the caller just provides a single length that applies to all messages. > > So the synchronous API really seems like the right approach, whereas shoehorning > it into the asynchronous hash API would result in something much more complex > and not actually useful for the intended use cases. > > If you're concerned about the hardcoding to 2x specifically, how about the > following API instead: > > int crypto_shash_finup_mb(struct shash_desc *desc, > const u8 *datas[], unsigned int len, > u8 *outs[], int num_msgs) > > This would allow extension to higher interleaving factors. > > I do suspect that anything higher than 2x isn't going to be very practical for > in-kernel use cases, where code size, latency, and per-request memory usage tend > to be very important. Regardless, this would make the API able to support > higher interleaving factors. I've sent out a new version that makes the change to crypto_shash_finup_mb(). - Eric
diff --git a/include/crypto/hash.h b/include/crypto/hash.h index 0014bdd81ab7..66d93c940861 100644 --- a/include/crypto/hash.h +++ b/include/crypto/hash.h @@ -177,10 +177,13 @@ struct shash_desc { * @finup: see struct ahash_alg * @digest: see struct ahash_alg * @export: see struct ahash_alg * @import: see struct ahash_alg * @setkey: see struct ahash_alg + * @finup2x: **[optional]** Finish calculating the digests of two equal-length + * messages, interleaving the instructions to potentially achieve + * better performance than hashing each message individually. * @init_tfm: Initialize the cryptographic transformation object. * This function is called only once at the instantiation * time, right after the transformation context was * allocated. In case the cryptographic hardware has * some special requirements which need to be handled @@ -208,10 +211,12 @@ struct shash_alg { unsigned int len, u8 *out); int (*export)(struct shash_desc *desc, void *out); int (*import)(struct shash_desc *desc, const void *in); int (*setkey)(struct crypto_shash *tfm, const u8 *key, unsigned int keylen); + int (*finup2x)(struct shash_desc *desc, const u8 *data1, + const u8 *data2, unsigned int len, u8 *out1, u8 *out2); int (*init_tfm)(struct crypto_shash *tfm); void (*exit_tfm)(struct crypto_shash *tfm); int (*clone_tfm)(struct crypto_shash *dst, struct crypto_shash *src); unsigned int descsize; @@ -749,10 +754,15 @@ static inline unsigned int crypto_shash_digestsize(struct crypto_shash *tfm) static inline unsigned int crypto_shash_statesize(struct crypto_shash *tfm) { return crypto_shash_alg(tfm)->statesize; } +static inline bool crypto_shash_supports_finup2x(struct crypto_shash *tfm) +{ + return crypto_shash_alg(tfm)->finup2x != NULL; +} + static inline u32 crypto_shash_get_flags(struct crypto_shash *tfm) { return crypto_tfm_get_flags(crypto_shash_tfm(tfm)); } @@ -842,10 +852,34 @@ int crypto_shash_digest(struct shash_desc *desc, const u8 *data, * Return: 0 on success; < 0 if an error occurred. */ int crypto_shash_tfm_digest(struct crypto_shash *tfm, const u8 *data, unsigned int len, u8 *out); +/** + * crypto_shash_finup2x() - finish hashing two equal-length messages + * @desc: the hash state that will be forked for the two messages. This + * contains the state after hashing a (possibly-empty) common prefix of + * the two messages. + * @data1: the first message (not including any common prefix from @desc) + * @data2: the second message (not including any common prefix from @desc) + * @len: length of @data1 and @data2 in bytes + * @out1: output buffer for first message digest + * @out2: output buffer for second message digest + * + * Users must check crypto_shash_supports_finup2x(tfm) before calling this. + * + * Context: Any context. + * Return: 0 on success; a negative errno value on failure. + */ +static inline int crypto_shash_finup2x(struct shash_desc *desc, + const u8 *data1, const u8 *data2, + unsigned int len, u8 *out1, u8 *out2) +{ + return crypto_shash_alg(desc->tfm)->finup2x(desc, data1, data2, len, + out1, out2); +} + /** * crypto_shash_export() - extract operational state for message digest * @desc: reference to the operational state handle whose state is exported * @out: output buffer of sufficient size that can hold the hash state *