mbox series

[v8,0/7] Optimize dm-verity and fsverity using multibuffer hashing

Message ID 20250212154718.44255-1-ebiggers@kernel.org (mailing list archive)
Headers show
Series Optimize dm-verity and fsverity using multibuffer hashing | expand

Message

Eric Biggers Feb. 12, 2025, 3:47 p.m. UTC
[ This patchset keeps getting rejected by Herbert, who prefers a
  complex, buggy, and slow alternative that shoehorns CPU-based hashing
  into the asynchronous hash API which is designed for off-CPU offload:
  https://lore.kernel.org/linux-crypto/cover.1730021644.git.herbert@gondor.apana.org.au/
  This patchset is a much better way to do it though, and I've already
  been maintaining it downstream as it would not be reasonable to go the
  asynchronous hash route instead.  Let me know if there are any
  objections to me taking this patchset through the fsverity tree, or at
  least patches 1-5 as the dm-verity patches could go in separately. ]

[ This patchset applies to v6.14-rc2 ]

On many modern CPUs, it is possible to compute the SHA-256 hash of two
equal-length messages in about the same time as a single message, if all
the instructions are interleaved.  This is because each SHA-256 (and
also most other cryptographic hash functions) is inherently serialized
and therefore can't always take advantage of the CPU's full throughput.

An earlier attempt to support multibuffer hashing in Linux was based
around the ahash API.  That approach had some major issues, as does the
alternative ahash-based approach proposed by Herbert (e.g. see my
responses at
https://lore.kernel.org/linux-crypto/20240610164258.GA3269@sol.localdomain/
and
https://lore.kernel.org/linux-crypto/20241028190045.GA1408@sol.localdomain/,
and the other discussion on v4
https://lore.kernel.org/linux-crypto/20240603183731.108986-1-ebiggers@kernel.org/T/#t)
This patchset instead takes a much simpler approach of just adding a
synchronous API for hashing equal-length messages.

This works well for dm-verity and fsverity, which use Merkle trees and
therefore hash large numbers of equal-length messages.

This patchset is organized as follows:

- Patch 1-2 add crypto_shash_finup_mb() and tests for it.
- Patch 3-4 implement finup_mb on x86_64 and arm64, using an
  interleaving factor of 2.
- Patch 5 adds multibuffer hashing support to fsverity.
- Patch 6-7 add multibuffer hashing support to dm-verity.

This patchset increases raw SHA-256 hashing throughput by up to 98%,
depending on the CPU (see patches for per-CPU results).  The throughput
of cold-cache reads from dm-verity and fsverity increases by around 35%.

Changed in v8:
  - Rebased onto v6.14-rc2 and updated cover letter.

Changed in v7:
  - Rebased onto v6.12-rc1 and dropped patches that were upstreamed.
  - Added performance results for more CPUs.

Changed in v6:
  - All patches: added Reviewed-by and Acked-by tags
  - "crypto: testmgr - add tests for finup_mb": Whitespace fix
  - "crypto: testmgr - generate power-of-2 lengths more often":
    Fixed undefined behavior
  - "fsverity: improve performance by using multibuffer hashing":
    Simplified a comment
  - "dm-verity: reduce scope of real and wanted digests":
    Fixed mention of nonexistent function in commit message
  - "dm-verity: improve performance by using multibuffer hashing":
    Two small optimizations, and simplified a comment

Changed in v5:
  - Reworked the dm-verity patches again.  Split the preparation work
    into separate patches, fixed two bugs, and added some new cleanups.
  - Other small cleanups

Changed in v4:
  - Reorganized the fsverity and dm-verity code to have a unified code
    path for single-block vs. multi-block processing.  For data blocks
    they now use only crypto_shash_finup_mb().

Changed in v3:
  - Change API from finup2x to finup_mb.  It now takes arrays of data
    buffer and output buffers, avoiding hardcoding 2x in the API.

Changed in v2:
  - Rebase onto cryptodev/master
  - Add more comments to assembly
  - Reorganize some of the assembly slightly
  - Fix the claimed throughput improvement on arm64
  - Fix incorrect kunmap order in fs/verity/verify.c
  - Adjust testmgr generation logic slightly
  - Explicitly check for INT_MAX before casting unsigned int to int
  - Mention SHA3 based parallel hashes
  - Mention AVX512-based approach

Eric Biggers (7):
  crypto: shash - add support for finup_mb
  crypto: testmgr - add tests for finup_mb
  crypto: x86/sha256-ni - add support for finup_mb
  crypto: arm64/sha256-ce - add support for finup_mb
  fsverity: improve performance by using multibuffer hashing
  dm-verity: reduce scope of real and wanted digests
  dm-verity: improve performance by using multibuffer hashing

 arch/arm64/crypto/sha2-ce-core.S    | 281 ++++++++++++++++++++-
 arch/arm64/crypto/sha2-ce-glue.c    |  40 +++
 arch/x86/crypto/sha256_ni_asm.S     | 368 ++++++++++++++++++++++++++++
 arch/x86/crypto/sha256_ssse3_glue.c |  39 +++
 crypto/shash.c                      |  58 +++++
 crypto/testmgr.c                    |  73 +++++-
 drivers/md/dm-verity-fec.c          |  19 +-
 drivers/md/dm-verity-fec.h          |   5 +-
 drivers/md/dm-verity-target.c       | 192 +++++++++++----
 drivers/md/dm-verity.h              |  34 +--
 fs/verity/fsverity_private.h        |   7 +
 fs/verity/hash_algs.c               |   8 +-
 fs/verity/verify.c                  | 169 ++++++++++---
 include/crypto/hash.h               |  52 +++-
 14 files changed, 1224 insertions(+), 121 deletions(-)


base-commit: a64dcfb451e254085a7daee5fe51bf22959d52d3

Comments

Herbert Xu Feb. 13, 2025, 4:17 a.m. UTC | #1
On Wed, Feb 12, 2025 at 07:47:11AM -0800, Eric Biggers wrote:
> [ This patchset keeps getting rejected by Herbert, who prefers a
>   complex, buggy, and slow alternative that shoehorns CPU-based hashing
>   into the asynchronous hash API which is designed for off-CPU offload:
>   https://lore.kernel.org/linux-crypto/cover.1730021644.git.herbert@gondor.apana.org.au/
>   This patchset is a much better way to do it though, and I've already
>   been maintaining it downstream as it would not be reasonable to go the
>   asynchronous hash route instead.  Let me know if there are any
>   objections to me taking this patchset through the fsverity tree, or at
>   least patches 1-5 as the dm-verity patches could go in separately. ]

Yes I object.  While I very much like this idea of parallel hashing
that you're introducing, shoehorning it into shash is restricting
this to storage-based users.

Networking is equally able to benefit from paralell hashing, and
parallel crypto (in particular, AEAD) in general.  In fact, both
TLS and IPsec can benefit directly from bulk submission instead
of the current scheme where a single packet is processed at a time.

But thanks for the reminder and I will be posting my patches
soon.

Cheers,
Eric Biggers Feb. 13, 2025, 6:33 a.m. UTC | #2
On Thu, Feb 13, 2025 at 12:17:42PM +0800, Herbert Xu wrote:
> On Wed, Feb 12, 2025 at 07:47:11AM -0800, Eric Biggers wrote:
> > [ This patchset keeps getting rejected by Herbert, who prefers a
> >   complex, buggy, and slow alternative that shoehorns CPU-based hashing
> >   into the asynchronous hash API which is designed for off-CPU offload:
> >   https://lore.kernel.org/linux-crypto/cover.1730021644.git.herbert@gondor.apana.org.au/
> >   This patchset is a much better way to do it though, and I've already
> >   been maintaining it downstream as it would not be reasonable to go the
> >   asynchronous hash route instead.  Let me know if there are any
> >   objections to me taking this patchset through the fsverity tree, or at
> >   least patches 1-5 as the dm-verity patches could go in separately. ]
> 
> Yes I object.  While I very much like this idea of parallel hashing
> that you're introducing, shoehorning it into shash is restricting
> this to storage-based users.
> 
> Networking is equally able to benefit from paralell hashing, and
> parallel crypto (in particular, AEAD) in general.  In fact, both
> TLS and IPsec can benefit directly from bulk submission instead
> of the current scheme where a single packet is processed at a time.

I've already covered this extensively, but here we go again.  First there are
more users of shash than ahash in the kernel, since shash is much easier to use
and also a bit faster.  There is nothing storage specific about it.  You've
claimed that shash is deprecated, but that reflects a misunderstanding of what
users actually want and need.  Users want simple, fast, easy-to-use APIs.  Not
APIs that are optimized for an obsolete form of hardware offload and have
CPU-based crypto support bolted on as an afterthought.

Second, these days TLS and IPsec usually use AES-GCM, which is inherently
parallelizable so does not benefit from multibuffer crypto.  This is a major
difference between the AEADs and message digest algorithms in common use.  And
it happens that I recently did a lot of work to optimize AES-GCM on x86_64; see
my commits in v6.11 that made AES-GCM 2-3x as fast on VAES-capable CPUs.

So anyone who cares about TLS or IPsec performance should of course be using
AES-GCM, as it's the fastest by far, and it has no need for multibuffer.  But
even for the rare case where someone is still using a legacy algorithm like
"authenc(hmac(sha256),cbc(aes))" for some reason, as I've explained before there
are much lower hanging fruit to optimizing it.  For example x86_64 still has no
implementation of the authenc template, let alone one that interleaves the
encryption with the MAC.  Both could be done today with the current crypto API.
Meanwhile multibuffer crypto would be very hard to apply to that use case (much
harder than the cases where I've applied it) and would not be very effective,
for reasons such as the complexity of that combination of algorithms vs. just
SHA-256.  Again, see
https://lore.kernel.org/linux-crypto/20240605191410.GB1222@sol.localdomain/,
https://lore.kernel.org/linux-crypto/20240606052801.GA324380@sol.localdomain/,
https://lore.kernel.org/linux-crypto/20240610164258.GA3269@sol.localdomain/,
https://lore.kernel.org/linux-crypto/20240611203209.GB128642@sol.localdomain/,
https://lore.kernel.org/linux-crypto/20240611201858.GA128642@sol.localdomain/
where I've already explained this in detail.

You've drawn an analogy to TSO and claimed that submitting multiple requests to
the crypto API at once would significantly improve performance even without
support from the underlying algorithm.  But that is incorrect.  TSO saves a
significant amount of time due to how the networking stack works.  In contrast,
the equivalent in the crypto API would do very little.  It would at best save
one indirect call per message, at the cost of adding the overhead of multi
request support.  Even assuming it was beneficial at all, it would be a very
minor optimization, and not worth it over other optimization opportunities that
would not require making complex and error-prone extensions to the crypto API.

I remain quite puzzled by your position here, as it makes no sense.  TBH, I
think your opinions would be more informed if you had more experience with
actually implementing and optimizing the various crypto algorithms.

> But thanks for the reminder and I will be posting my patches soon.

I am not really interested in using your patches, sorry.  They just seem like
really poor engineering and a continuation of many of the worst practices of the
kernel crypto API that we *know* are not working.  Especially for cryptography
code, we need to do better.  (And I've even already done it!)

- Eric
Ard Biesheuvel Feb. 13, 2025, 10:10 a.m. UTC | #3
On Thu, 13 Feb 2025 at 05:17, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Wed, Feb 12, 2025 at 07:47:11AM -0800, Eric Biggers wrote:
> > [ This patchset keeps getting rejected by Herbert, who prefers a
> >   complex, buggy, and slow alternative that shoehorns CPU-based hashing
> >   into the asynchronous hash API which is designed for off-CPU offload:
> >   https://lore.kernel.org/linux-crypto/cover.1730021644.git.herbert@gondor.apana.org.au/
> >   This patchset is a much better way to do it though, and I've already
> >   been maintaining it downstream as it would not be reasonable to go the
> >   asynchronous hash route instead.  Let me know if there are any
> >   objections to me taking this patchset through the fsverity tree, or at
> >   least patches 1-5 as the dm-verity patches could go in separately. ]
>
> Yes I object.  While I very much like this idea of parallel hashing
> that you're introducing, shoehorning it into shash is restricting
> this to storage-based users.
>
> Networking is equally able to benefit from paralell hashing, and
> parallel crypto (in particular, AEAD) in general.  In fact, both
> TLS and IPsec can benefit directly from bulk submission instead
> of the current scheme where a single packet is processed at a time.
>
> But thanks for the reminder and I will be posting my patches
> soon.
>

I have to second Eric here, simply because his work has been ready to
go for a year now, while you keep rejecting it on the basis that
you're creating something better, and the only thing you have managed
to produce in the meantime didn't even work.

I strongly urge you to accept Eric's work, and if your approach is
really superior, it should be fairly easy making that point with
working code once you get around to producing it, and we can switch
over the users then.

The increased flexibility you claim your approach will have does not
mesh with my understanding of where the opportunities for improvement
are: CPU-based SHA can be tightly interleaved at the instruction level
to have a performance gain of almost 2x. Designing a more flexible
ahash based multibuffer API that can still take advantage of this to
the same extent is not straight-forward, and you going off and cooking
up something by yourself for months at a time does not inspire
confidence that this will converge any time soon, if at all.

Also, your network use case is fairly theoretical, whereas the
fsverity and dm-verity code runs on 100s of millions of mobile phones
in the field, so sacrificing any performance of the latter to serve
the former seems misguided to me.

So could you please remove yourself from the critical path here, and
merge this while we wait for your better alternative to materialize?

Thanks,
Ard.
Herbert Xu Feb. 14, 2025, 2:44 a.m. UTC | #4
On Wed, Feb 12, 2025 at 10:33:04PM -0800, Eric Biggers wrote:
> 
> I've already covered this extensively, but here we go again.  First there are
> more users of shash than ahash in the kernel, since shash is much easier to use
> and also a bit faster.  There is nothing storage specific about it.  You've
> claimed that shash is deprecated, but that reflects a misunderstanding of what
> users actually want and need.  Users want simple, fast, easy-to-use APIs.  Not
> APIs that are optimized for an obsolete form of hardware offload and have
> CPU-based crypto support bolted on as an afterthought.

The ahash interface was not designed for hardware offload, it's
exactly the same as the skcipher interface which caters for all
users.  The shash interface was a mistake, one which I've only
come to realise after adding the corresponding lskcipher interface.


> Second, these days TLS and IPsec usually use AES-GCM, which is inherently
> parallelizable so does not benefit from multibuffer crypto.  This is a major
> difference between the AEADs and message digest algorithms in common use.  And
> it happens that I recently did a lot of work to optimize AES-GCM on x86_64; see
> my commits in v6.11 that made AES-GCM 2-3x as fast on VAES-capable CPUs.

Bravo to your efforts on improving GCM.  But that does not mean that
GCM is not amenable to parallel processing.  While CTR itself is
obviously already parallel, the GHASH algorithm can indeed benefit
from parallel processing like any other hashing algorithm.

Cheers,
Eric Biggers Feb. 14, 2025, 3:35 a.m. UTC | #5
On Fri, Feb 14, 2025 at 10:44:47AM +0800, Herbert Xu wrote:
> On Wed, Feb 12, 2025 at 10:33:04PM -0800, Eric Biggers wrote:
> > 
> > I've already covered this extensively, but here we go again.  First there are
> > more users of shash than ahash in the kernel, since shash is much easier to use
> > and also a bit faster.  There is nothing storage specific about it.  You've
> > claimed that shash is deprecated, but that reflects a misunderstanding of what
> > users actually want and need.  Users want simple, fast, easy-to-use APIs.  Not
> > APIs that are optimized for an obsolete form of hardware offload and have
> > CPU-based crypto support bolted on as an afterthought.
> 
> The ahash interface was not designed for hardware offload, it's
> exactly the same as the skcipher interface which caters for all
> users.  The shash interface was a mistake, one which I've only
> come to realise after adding the corresponding lskcipher interface.

It absolutely is designed for an obsolete form of hardware offload.  Have you
ever tried actually using it?  Here's how to hash a buffer of data with shash:

	return crypto_shash_tfm_digest(tfm, data, size, out)

... and here's how to do it with the SHA-256 library, for what it's worth:

	sha256(data, size, out)

and here's how to do it with ahash:

	struct ahash_request *req;
	struct scatterlist sg;
	DECLARE_CRYPTO_WAIT(wait);
	int err;

	req = ahash_request_alloc(alg, GFP_KERNEL);
	if (!req)
		return -ENOMEM;

	sg_init_one(&sg, data, size);
	ahash_request_set_callback(req, CRYPTO_TFM_REQ_MAY_SLEEP |
					CRYPTO_TFM_REQ_MAY_BACKLOG,
				   crypto_req_done, &wait);
	ahash_request_set_crypt(req, &sg, out, size);

	err = crypto_wait_req(crypto_ahash_digest(req), &wait);

	ahash_request_free(req);
	return err;

Hmm, I wonder which API users would rather use?

The extra complexity is from supporting an obsolete form of hardware offload.

Yes, skcipher and aead have the same problem, but that doesn't mean it is right.

> > Second, these days TLS and IPsec usually use AES-GCM, which is inherently
> > parallelizable so does not benefit from multibuffer crypto.  This is a major
> > difference between the AEADs and message digest algorithms in common use.  And
> > it happens that I recently did a lot of work to optimize AES-GCM on x86_64; see
> > my commits in v6.11 that made AES-GCM 2-3x as fast on VAES-capable CPUs.
> 
> Bravo to your efforts on improving GCM.  But that does not mean that
> GCM is not amenable to parallel processing.  While CTR itself is
> obviously already parallel, the GHASH algorithm can indeed benefit
> from parallel processing like any other hashing algorithm.

What?  GHASH is a polynomial hash function, so it is easily parallelizable.  If
you precompute N powers of the hash key then you can process N blocks in
parallel.  Check how the AES-GCM assembly code works; that's exactly what it
does.  This is fundamentally different from message digests like SHA-* where the
blocks have to be processed serially.

- Eric
Herbert Xu Feb. 14, 2025, 3:50 a.m. UTC | #6
On Thu, Feb 13, 2025 at 07:35:18PM -0800, Eric Biggers wrote:
>
> It absolutely is designed for an obsolete form of hardware offload.  Have you
> ever tried actually using it?  Here's how to hash a buffer of data with shash:
> 
> 	return crypto_shash_tfm_digest(tfm, data, size, out)
> 
> ... and here's how to do it with the SHA-256 library, for what it's worth:
> 
> 	sha256(data, size, out)
> 
> and here's how to do it with ahash:

Try the new virt ahash interface, and we could easily put the
request object on the stack for sync algorithms:

	SYNC_AHASH_REQUEST_ON_STACK(req, alg);

	ahash_request_set_callback(req, CRYPTO_TFM_REQ_MAY_SLEEP, NULL, NULL);
	ahash_request_set_virt(req, data, out, size);

	return crypto_ahash_digest(req);
 
> Hmm, I wonder which API users would rather use?

You're conflating the SG API problem with the interface itself.
It's a separate issue, and quite easily solved.

> What?  GHASH is a polynomial hash function, so it is easily parallelizable.  If
> you precompute N powers of the hash key then you can process N blocks in
> parallel.  Check how the AES-GCM assembly code works; that's exactly what it
> does.  This is fundamentally different from message digests like SHA-* where the
> blocks have to be processed serially.

Fair enough.

But there are plenty of other users who want batching, such as the
zcomp with iaa, and I don't want everybody to invent their own API
for the same thing.

Cheers,
Eric Biggers Feb. 14, 2025, 4:29 a.m. UTC | #7
On Fri, Feb 14, 2025 at 11:50:51AM +0800, Herbert Xu wrote:
> On Thu, Feb 13, 2025 at 07:35:18PM -0800, Eric Biggers wrote:
> >
> > It absolutely is designed for an obsolete form of hardware offload.  Have you
> > ever tried actually using it?  Here's how to hash a buffer of data with shash:
> > 
> > 	return crypto_shash_tfm_digest(tfm, data, size, out)
> > 
> > ... and here's how to do it with the SHA-256 library, for what it's worth:
> > 
> > 	sha256(data, size, out)
> > 
> > and here's how to do it with ahash:
> 
> Try the new virt ahash interface, and we could easily put the
> request object on the stack for sync algorithms:
> 
> 	SYNC_AHASH_REQUEST_ON_STACK(req, alg);
> 
> 	ahash_request_set_callback(req, CRYPTO_TFM_REQ_MAY_SLEEP, NULL, NULL);
> 	ahash_request_set_virt(req, data, out, size);
> 
> 	return crypto_ahash_digest(req);

That doesn't actually exist, and your code snippet is also buggy (undefined
behavior) because it never sets the tfm pointer in the ahash_request.  So this
just shows that you still can't use your own proposed APIs correctly because
they're still too complex.  Yes the virt address support would be an improvement
on current ahash, but it would still be bolted onto an interface that wasn't
designed for it.  There would still be the weirdness of having to initialize so
many unnecessary fields in the request, and having "synchronous asynchronous
hashes" which is always a fun one to try to explain to people.  The shash and
lib/crypto/ interfaces are much better as they do not have these problems.

> > What?  GHASH is a polynomial hash function, so it is easily parallelizable.  If
> > you precompute N powers of the hash key then you can process N blocks in
> > parallel.  Check how the AES-GCM assembly code works; that's exactly what it
> > does.  This is fundamentally different from message digests like SHA-* where the
> > blocks have to be processed serially.
> 
> Fair enough.
> 
> But there are plenty of other users who want batching, such as the
> zcomp with iaa, and I don't want everybody to invent their own API
> for the same thing.

Well, the IAA and zswap people want a batch_compress method that takes an array
of pages
(https://lore.kernel.org/linux-crypto/20250206072102.29045-3-kanchana.p.sridhar@intel.com/).
They do not seem to want the weird request chaining thing that you are trying to
make them use.  batch_compress is actually quite similar to what I'm proposing,
just for compression instead of hashing.   So there is no conflict with my
proposal, and in fact they complement each other as they arrived at a similar
conclusion.  Hash and compression are different algorithm types, so they can
never use exactly the same API anyway, just similar ones.  And FWIW, zswap is
synchronous, so yet again all the weird async stuff just gets in the way.

- Eric
Herbert Xu Feb. 14, 2025, 4:56 a.m. UTC | #8
On Thu, Feb 13, 2025 at 08:29:51PM -0800, Eric Biggers wrote:
>
> That doesn't actually exist, and your code snippet is also buggy (undefined
> behavior) because it never sets the tfm pointer in the ahash_request.  So this

Well thanks for pointing out that deficiency.  It would be good
to be able to set the tfm in the macro, something like:

#define SYNC_AHASH_REQUEST_ON_STACK(name, _tfm) \
	char __##name##_desc[sizeof(struct ahash_request) + \
			     MAX_SYNC_AHASH_REQSIZE \
			    ] CRYPTO_MINALIGN_ATTR; \
	struct ahash_request *name = (((struct ahash_request *)__##name##_desc)->base.tfm = crypto_sync_ahash_tfm((_tfm)), (void *)__##name##_desc)

> just shows that you still can't use your own proposed APIs correctly because
> they're still too complex.  Yes the virt address support would be an improvement
> on current ahash, but it would still be bolted onto an interface that wasn't
> designed for it.  There would still be the weirdness of having to initialize so
> many unnecessary fields in the request, and having "synchronous asynchronous
> hashes" which is always a fun one to try to explain to people.  The shash and
> lib/crypto/ interfaces are much better as they do not have these problems.

I'm more than happy to rename ahash to hash.  The only reason
it was called ahash is to distinguish it from shash, which will
no longer be necessary.

> never use exactly the same API anyway, just similar ones.  And FWIW, zswap is
> synchronous, so yet again all the weird async stuff just gets in the way.

I think you're again conflating two different concepts.  Yes
zswap/iaa are sleepable, but they're not synchronous.  This comes
into play because iaa is also useable by IPsec, which is most
certainly not sleepable.

Cheers,
Eric Biggers Feb. 14, 2025, 6:11 a.m. UTC | #9
On Fri, Feb 14, 2025 at 12:56:10PM +0800, Herbert Xu wrote:
> On Thu, Feb 13, 2025 at 08:29:51PM -0800, Eric Biggers wrote:
> >
> > That doesn't actually exist, and your code snippet is also buggy (undefined
> > behavior) because it never sets the tfm pointer in the ahash_request.  So this
> 
> Well thanks for pointing out that deficiency.  It would be good
> to be able to set the tfm in the macro, something like:
> 
> #define SYNC_AHASH_REQUEST_ON_STACK(name, _tfm) \
> 	char __##name##_desc[sizeof(struct ahash_request) + \
> 			     MAX_SYNC_AHASH_REQSIZE \
> 			    ] CRYPTO_MINALIGN_ATTR; \
> 	struct ahash_request *name = (((struct ahash_request *)__##name##_desc)->base.tfm = crypto_sync_ahash_tfm((_tfm)), (void *)__##name##_desc)

I'm not sure what you intended with the second line, which looks like it won't
compile.  The first line also shows that ahash_request has a large alignment for
DMA, which is irrelevant for CPU based crypto.  And I'm not sure
__attribute__((aligned)) with that alignment on the stack even works reliably
all architectures; we've had issues with that in the past.  So again you're
still proposing APIs with weird quirks caused by bolting CPU-based crypto
support onto an API designed for an obsolete style of hardware offload.

> > just shows that you still can't use your own proposed APIs correctly because
> > they're still too complex.  Yes the virt address support would be an improvement
> > on current ahash, but it would still be bolted onto an interface that wasn't
> > designed for it.  There would still be the weirdness of having to initialize so
> > many unnecessary fields in the request, and having "synchronous asynchronous
> > hashes" which is always a fun one to try to explain to people.  The shash and
> > lib/crypto/ interfaces are much better as they do not have these problems.
> 
> I'm more than happy to rename ahash to hash.  The only reason
> it was called ahash is to distinguish it from shash, which will
> no longer be necessary.
> 
> > never use exactly the same API anyway, just similar ones.  And FWIW, zswap is
> > synchronous, so yet again all the weird async stuff just gets in the way.
> 
> I think you're again conflating two different concepts.  Yes
> zswap/iaa are sleepable, but they're not synchronous.  

Here's the compression in zswap:

    comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);

And here's the decompression:

    BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));

It waits synchronously for each request to complete.

It doesn't want an asynchronous API.

> iaa is also useable by IPsec, which is most certainly not sleepable.

The IAA driver doesn't actually support encryption, so that's a very bad start.
But even assuming it was added, the premise of IAA being helpful for IPsec seems
questionable.  AES-GCM is already accelerated via the VAES and VPCLMULQDQ
instructions, performing at 10-30 GB/s per thread on recent processors.  It's
hard to see how legacy-style offload can beat that in practice, when accounting
for all the driver overhead and the fact that memory often ends up as the
bottleneck these days.  But of course for optimal IPsec performance you actually
need adapter-level offload (inline crypto) which does not use the crypto API at
all, so again the legacy-style offload support in the crypto API is irrelevant.

But, this is tangential to this discussion, since we can still keep the legacy
style hardware offload APIs around for the few users that think they want them.
The point is that we shouldn't let them drag down everyone else.

- Eric
Ard Biesheuvel Feb. 14, 2025, 10:50 a.m. UTC | #10
On Fri, 14 Feb 2025 at 04:51, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Thu, Feb 13, 2025 at 07:35:18PM -0800, Eric Biggers wrote:
> >
> > It absolutely is designed for an obsolete form of hardware offload.  Have you
> > ever tried actually using it?  Here's how to hash a buffer of data with shash:
> >
> >       return crypto_shash_tfm_digest(tfm, data, size, out)
> >
> > ... and here's how to do it with the SHA-256 library, for what it's worth:
> >
> >       sha256(data, size, out)
> >
> > and here's how to do it with ahash:
>
> Try the new virt ahash interface, and we could easily put the
> request object on the stack for sync algorithms:
>
>         SYNC_AHASH_REQUEST_ON_STACK(req, alg);
>
>         ahash_request_set_callback(req, CRYPTO_TFM_REQ_MAY_SLEEP, NULL, NULL);
>         ahash_request_set_virt(req, data, out, size);
>
>         return crypto_ahash_digest(req);
>

Whatever happened to not adding infrastructure to the kernel without a user?

You keep saying how great this will all work for hypothetical cases,
and from any other contributor, we would expect to see working code
that demonstrates the advantages of the approach.

But it seems you have no interest in actually writing this networking
code, and nor has anybody else, as far as I can tell, which makes your
claims rather dubious.

IOW, even if all your claims are correct, it really makes no
difference when nobody can be bothered to take advantage of it, and we
should just go with Eric's working code.
Jakub Kicinski Feb. 15, 2025, 5:04 p.m. UTC | #11
On Wed, 12 Feb 2025 22:33:04 -0800 Eric Biggers wrote:
> So anyone who cares about TLS or IPsec performance should of course be using
> AES-GCM, as it's the fastest by far, and it has no need for multibuffer.

Can confirm, FWIW. I don't know as much about IPsec, but for TLS
lightweight SW-only crypto would be ideal.
Herbert Xu Feb. 16, 2025, 2:27 a.m. UTC | #12
On Sat, Feb 15, 2025 at 09:04:12AM -0800, Jakub Kicinski wrote:
>
> Can confirm, FWIW. I don't know as much about IPsec, but for TLS
> lightweight SW-only crypto would be ideal.

Please note that while CPU-only crypto is the best for networking,
it actually operates in asynchronous mode on x86.  This is because
RX occurs in softirq context, which may not be able to use SIMD on
x86.

Cheers,
Eric Biggers Feb. 16, 2025, 3:26 a.m. UTC | #13
On Sun, Feb 16, 2025 at 10:27:02AM +0800, Herbert Xu wrote:
> On Sat, Feb 15, 2025 at 09:04:12AM -0800, Jakub Kicinski wrote:
> >
> > Can confirm, FWIW. I don't know as much about IPsec, but for TLS
> > lightweight SW-only crypto would be ideal.
> 
> Please note that while CPU-only crypto is the best for networking,
> it actually operates in asynchronous mode on x86.  This is because
> RX occurs in softirq context, which may not be able to use SIMD on
> x86.

Well, the async fallback (using cryptd) occurs only when a kernel-mode FPU
section in process context is interrupted by a hardirq and at the end of it a
softirq also tries to use kernel-mode FPU.  It's generally a rare case but also
a terrible implementation that is really bad for performance; this should never
have been implemented this way.  I am planning to fix it so that softirqs on x86
will always be able to use the FPU, like they can on some of the other arches
like arm64 and riscv.

- Eric
Herbert Xu Feb. 16, 2025, 3:29 a.m. UTC | #14
On Sat, Feb 15, 2025 at 07:26:16PM -0800, Eric Biggers wrote:
>
> Well, the async fallback (using cryptd) occurs only when a kernel-mode FPU
> section in process context is interrupted by a hardirq and at the end of it a
> softirq also tries to use kernel-mode FPU.  It's generally a rare case but also
> a terrible implementation that is really bad for performance; this should never

It may not be rare if the kernel is busy doing bidirectional
TX/RX with crypto.  The process context will be the TX-side
encrypting while the softirq context will do RX-side decryption.

> have been implemented this way.  I am planning to fix it so that softirqs on x86

Sure but it's way better than falling back to the C implementation
on the RX-side.

> will always be able to use the FPU, like they can on some of the other arches
> like arm64 and riscv.

That's great news.  Thanks!
Jakub Kicinski Feb. 17, 2025, 5:40 p.m. UTC | #15
On Sun, 16 Feb 2025 10:27:02 +0800 Herbert Xu wrote:
> On Sat, Feb 15, 2025 at 09:04:12AM -0800, Jakub Kicinski wrote:
> > Can confirm, FWIW. I don't know as much about IPsec, but for TLS
> > lightweight SW-only crypto would be ideal.  
> 
> Please note that while CPU-only crypto is the best for networking,
> it actually operates in asynchronous mode on x86.  This is because
> RX occurs in softirq context, which may not be able to use SIMD on
> x86.

Yes, that's true for tunnels and for IPsec.
TLS does crypto in sendmsg/recvmsg, process context.
Herbert Xu Feb. 18, 2025, 3:42 a.m. UTC | #16
On Mon, Feb 17, 2025 at 09:40:32AM -0800, Jakub Kicinski wrote:
>
> Yes, that's true for tunnels and for IPsec.
> TLS does crypto in sendmsg/recvmsg, process context.

OK that's good to know.  So whether SIMD is always allowed or
not won't impact TLS at least.

Cheers,
Ard Biesheuvel Feb. 18, 2025, 7:41 a.m. UTC | #17
On Tue, 18 Feb 2025 at 04:43, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Mon, Feb 17, 2025 at 09:40:32AM -0800, Jakub Kicinski wrote:
> >
> > Yes, that's true for tunnels and for IPsec.
> > TLS does crypto in sendmsg/recvmsg, process context.
>
> OK that's good to know.  So whether SIMD is always allowed or
> not won't impact TLS at least.
>

And for IPsec, I'd assume that the cryptd fallback is only needed when
TX and RX are competing for the same CPU.

So for modern systems, I don't think the SIMD helper does anything
useful, and we should just remove it, especially if we can relax the
softirq/preemption rules for kernel SIMD on x86 like I did for arm64.
Herbert Xu Feb. 18, 2025, 8:02 a.m. UTC | #18
On Tue, Feb 18, 2025 at 08:41:57AM +0100, Ard Biesheuvel wrote:
>
> And for IPsec, I'd assume that the cryptd fallback is only needed when
> TX and RX are competing for the same CPU.

Which can happen if the system is handling encrypted data in both
directions.  Sometimes you do want to have the hardware steer a
given flow to the same CPU in both directions.

> So for modern systems, I don't think the SIMD helper does anything
> useful, and we should just remove it, especially if we can relax the
> softirq/preemption rules for kernel SIMD on x86 like I did for arm64.

Sure, if we can ensure that SIMD is always useable even in softirq
context then we should definitely remove the simd wrapper.  But until
that happens, suddenly switching from AESNI to the generic C
implementation because a system is loaded is not good.

Cheers,