diff mbox series

[07/10] crypto: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN

Message ID 20220405135758.774016-8-catalin.marinas@arm.com (mailing list archive)
State New, archived
Headers show
Series mm, arm64: Reduce ARCH_KMALLOC_MINALIGN below the cache line size | expand

Commit Message

Catalin Marinas April 5, 2022, 1:57 p.m. UTC
ARCH_DMA_MINALIGN represents the minimum (static) alignment for safe DMA
operations while ARCH_KMALLOC_MINALIGN is the minimum kmalloc() objects
alignment.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
---
 include/linux/crypto.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Herbert Xu April 5, 2022, 10:57 p.m. UTC | #1
On Tue, Apr 05, 2022 at 02:57:55PM +0100, Catalin Marinas wrote:
> ARCH_DMA_MINALIGN represents the minimum (static) alignment for safe DMA
> operations while ARCH_KMALLOC_MINALIGN is the minimum kmalloc() objects
> alignment.
> 
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: "David S. Miller" <davem@davemloft.net>
> ---
>  include/linux/crypto.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/crypto.h b/include/linux/crypto.h
> index 2324ab6f1846..654b9c355575 100644
> --- a/include/linux/crypto.h
> +++ b/include/linux/crypto.h
> @@ -167,7 +167,7 @@
>   * maintenance for non-coherent DMA (cache invalidation in particular) does not
>   * affect data that may be accessed by the CPU concurrently.
>   */
> -#define CRYPTO_MINALIGN ARCH_KMALLOC_MINALIGN
> +#define CRYPTO_MINALIGN ARCH_DMA_MINALIGN

I think this should remain as ARCH_KMALLOC_MINALIGN with the
comment above modified.  The reason is that we assume memory
returned by kmalloc is already aligned to this value.

Ard, you added the comment regarding the DMA requirement, so
does anything actually rely on this? If they do, they now need
to do their own alignment.

Thanks,
Ard Biesheuvel April 6, 2022, 6:53 a.m. UTC | #2
On Wed, 6 Apr 2022 at 00:57, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Tue, Apr 05, 2022 at 02:57:55PM +0100, Catalin Marinas wrote:
> > ARCH_DMA_MINALIGN represents the minimum (static) alignment for safe DMA
> > operations while ARCH_KMALLOC_MINALIGN is the minimum kmalloc() objects
> > alignment.
> >
> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Herbert Xu <herbert@gondor.apana.org.au>
> > Cc: "David S. Miller" <davem@davemloft.net>
> > ---
> >  include/linux/crypto.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/include/linux/crypto.h b/include/linux/crypto.h
> > index 2324ab6f1846..654b9c355575 100644
> > --- a/include/linux/crypto.h
> > +++ b/include/linux/crypto.h
> > @@ -167,7 +167,7 @@
> >   * maintenance for non-coherent DMA (cache invalidation in particular) does not
> >   * affect data that may be accessed by the CPU concurrently.
> >   */
> > -#define CRYPTO_MINALIGN ARCH_KMALLOC_MINALIGN
> > +#define CRYPTO_MINALIGN ARCH_DMA_MINALIGN
>
> I think this should remain as ARCH_KMALLOC_MINALIGN with the
> comment above modified.  The reason is that we assume memory
> returned by kmalloc is already aligned to this value.
>
> Ard, you added the comment regarding the DMA requirement, so
> does anything actually rely on this? If they do, they now need
> to do their own alignment.
>

This patch looks incorrect to me, as ARCH_DMA_MINALIGN is not
#define'd on all architectures.

But I am fine with the intent: ARCH_DMA_MINALIGN will be >=
ARCH_KMALLOC_MINALIGN, and so the compile time layout of structs will
take the worst cast minimum DMA alignment into account, whereas their
placement in memory when they allocated dynamically may be aligned to
ARCH_KMALLOC_MINALIGN only. Since the latter will be based on the
actual cache geometry, this should be fine.

Apart from the 'shash desc on stack' issue solved by the patch that
also introduced the above comment(660d2062190d), I've never looked
into the actual memory footprint of the crypto related data structures
resulting from this alignment, but it seems to me that /if/ this is
significant, we should be able to punt this to the drivers that
actually need this, rather than impose it for the whole system. (This
would involve over-allocating the context struct, and aligning up the
pointer in the various xxx_ctx() getters iff needed by the driver in
question)

I'll put this on my list of things to look at.
Catalin Marinas April 6, 2022, 8:49 a.m. UTC | #3
On Wed, Apr 06, 2022 at 08:53:33AM +0200, Ard Biesheuvel wrote:
> On Wed, 6 Apr 2022 at 00:57, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > On Tue, Apr 05, 2022 at 02:57:55PM +0100, Catalin Marinas wrote:
> > > ARCH_DMA_MINALIGN represents the minimum (static) alignment for safe DMA
> > > operations while ARCH_KMALLOC_MINALIGN is the minimum kmalloc() objects
> > > alignment.
> > >
> > > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> > > Cc: Herbert Xu <herbert@gondor.apana.org.au>
> > > Cc: "David S. Miller" <davem@davemloft.net>
> > > ---
> > >  include/linux/crypto.h | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/crypto.h b/include/linux/crypto.h
> > > index 2324ab6f1846..654b9c355575 100644
> > > --- a/include/linux/crypto.h
> > > +++ b/include/linux/crypto.h
> > > @@ -167,7 +167,7 @@
> > >   * maintenance for non-coherent DMA (cache invalidation in particular) does not
> > >   * affect data that may be accessed by the CPU concurrently.
> > >   */
> > > -#define CRYPTO_MINALIGN ARCH_KMALLOC_MINALIGN
> > > +#define CRYPTO_MINALIGN ARCH_DMA_MINALIGN
> >
> > I think this should remain as ARCH_KMALLOC_MINALIGN with the
> > comment above modified.  The reason is that we assume memory
> > returned by kmalloc is already aligned to this value.
> >
> > Ard, you added the comment regarding the DMA requirement, so
> > does anything actually rely on this? If they do, they now need
> > to do their own alignment.
> 
> This patch looks incorrect to me, as ARCH_DMA_MINALIGN is not
> #define'd on all architectures.

It is after the first patch:

https://lore.kernel.org/all/20220405135758.774016-2-catalin.marinas@arm.com/

The series makes both ARCH_*_MINALIGN available irrespective of what an
arch defines. If one needs guaranteed static alignment for DMA, use the
DMA macro. If the minimum kmalloc() alignment is needed (e.g. to store
some flags in the lower pointer bits), use the KMALLOC macro. I grep'ed
through drivers/ and I've seen both cases (e.g.
drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c for the latter use-case).

> But I am fine with the intent: ARCH_DMA_MINALIGN will be >=
> ARCH_KMALLOC_MINALIGN, and so the compile time layout of structs will
> take the worst cast minimum DMA alignment into account, whereas their
> placement in memory when they allocated dynamically may be aligned to
> ARCH_KMALLOC_MINALIGN only. Since the latter will be based on the
> actual cache geometry, this should be fine.

That's the idea.

> Apart from the 'shash desc on stack' issue solved by the patch that
> also introduced the above comment(660d2062190d), I've never looked
> into the actual memory footprint of the crypto related data structures
> resulting from this alignment, but it seems to me that /if/ this is
> significant, we should be able to punt this to the drivers that
> actually need this, rather than impose it for the whole system. (This
> would involve over-allocating the context struct, and aligning up the
> pointer in the various xxx_ctx() getters iff needed by the driver in
> question)

Since ARCH_KMALLOC_MINALIGN on arm64 prior to this series is 128, there
is any change to the crypto code.
Ard Biesheuvel April 6, 2022, 9:41 a.m. UTC | #4
On Wed, 6 Apr 2022 at 10:49, Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Wed, Apr 06, 2022 at 08:53:33AM +0200, Ard Biesheuvel wrote:
> > On Wed, 6 Apr 2022 at 00:57, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > > On Tue, Apr 05, 2022 at 02:57:55PM +0100, Catalin Marinas wrote:
> > > > ARCH_DMA_MINALIGN represents the minimum (static) alignment for safe DMA
> > > > operations while ARCH_KMALLOC_MINALIGN is the minimum kmalloc() objects
> > > > alignment.
> > > >
> > > > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> > > > Cc: Herbert Xu <herbert@gondor.apana.org.au>
> > > > Cc: "David S. Miller" <davem@davemloft.net>
> > > > ---
> > > >  include/linux/crypto.h | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/include/linux/crypto.h b/include/linux/crypto.h
> > > > index 2324ab6f1846..654b9c355575 100644
> > > > --- a/include/linux/crypto.h
> > > > +++ b/include/linux/crypto.h
> > > > @@ -167,7 +167,7 @@
> > > >   * maintenance for non-coherent DMA (cache invalidation in particular) does not
> > > >   * affect data that may be accessed by the CPU concurrently.
> > > >   */
> > > > -#define CRYPTO_MINALIGN ARCH_KMALLOC_MINALIGN
> > > > +#define CRYPTO_MINALIGN ARCH_DMA_MINALIGN
> > >
> > > I think this should remain as ARCH_KMALLOC_MINALIGN with the
> > > comment above modified.  The reason is that we assume memory
> > > returned by kmalloc is already aligned to this value.
> > >
> > > Ard, you added the comment regarding the DMA requirement, so
> > > does anything actually rely on this? If they do, they now need
> > > to do their own alignment.
> >
> > This patch looks incorrect to me, as ARCH_DMA_MINALIGN is not
> > #define'd on all architectures.
>
> It is after the first patch:
>
> https://lore.kernel.org/all/20220405135758.774016-2-catalin.marinas@arm.com/
>

I wasn't cc'ed on that :-)

> The series makes both ARCH_*_MINALIGN available irrespective of what an
> arch defines. If one needs guaranteed static alignment for DMA, use the
> DMA macro. If the minimum kmalloc() alignment is needed (e.g. to store
> some flags in the lower pointer bits), use the KMALLOC macro. I grep'ed
> through drivers/ and I've seen both cases (e.g.
> drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c for the latter use-case).
>
> > But I am fine with the intent: ARCH_DMA_MINALIGN will be >=
> > ARCH_KMALLOC_MINALIGN, and so the compile time layout of structs will
> > take the worst cast minimum DMA alignment into account, whereas their
> > placement in memory when they allocated dynamically may be aligned to
> > ARCH_KMALLOC_MINALIGN only. Since the latter will be based on the
> > actual cache geometry, this should be fine.
>
> That's the idea.
>
> > Apart from the 'shash desc on stack' issue solved by the patch that
> > also introduced the above comment(660d2062190d), I've never looked
> > into the actual memory footprint of the crypto related data structures
> > resulting from this alignment, but it seems to me that /if/ this is
> > significant, we should be able to punt this to the drivers that
> > actually need this, rather than impose it for the whole system. (This
> > would involve over-allocating the context struct, and aligning up the
> > pointer in the various xxx_ctx() getters iff needed by the driver in
> > question)
>
> Since ARCH_KMALLOC_MINALIGN on arm64 prior to this series is 128, there
> is any change to the crypto code.
>

No, not currently. But what I started looking into today is avoiding
the need to impose DMA alignment on every single data structure
allocated by the crypto API, even if it is never used for DMA or
touched by a device. That seems rather wasteful as well.
Herbert Xu April 7, 2022, 4:30 a.m. UTC | #5
On Wed, Apr 06, 2022 at 09:49:42AM +0100, Catalin Marinas wrote:
>
> Since ARCH_KMALLOC_MINALIGN on arm64 prior to this series is 128, there
> is any change to the crypto code.

But the crypto API assumes that memory returned by kmalloc is
automatically aligned to CRYPTO_MINALIGN, would this still be
the case if you change it to ARCH_DMA_MINALIGN?

Thanks,
Muchun Song April 7, 2022, 6:14 a.m. UTC | #6
On Tue, Apr 05, 2022 at 02:57:55PM +0100, Catalin Marinas wrote:
> ARCH_DMA_MINALIGN represents the minimum (static) alignment for safe DMA
> operations while ARCH_KMALLOC_MINALIGN is the minimum kmalloc() objects
> alignment.
> 
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: "David S. Miller" <davem@davemloft.net>
> ---
>  include/linux/crypto.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/crypto.h b/include/linux/crypto.h
> index 2324ab6f1846..654b9c355575 100644
> --- a/include/linux/crypto.h
> +++ b/include/linux/crypto.h
> @@ -167,7 +167,7 @@
>   * maintenance for non-coherent DMA (cache invalidation in particular) does not
>   * affect data that may be accessed by the CPU concurrently.
>   */
> -#define CRYPTO_MINALIGN ARCH_KMALLOC_MINALIGN
> +#define CRYPTO_MINALIGN ARCH_DMA_MINALIGN
>

I don't think this should be changed since ARCH_KMALLOC_MINALIGN is
already aligned with the size what you need.

Thanks.
Catalin Marinas April 7, 2022, 9:25 a.m. UTC | #7
On Thu, Apr 07, 2022 at 02:14:15PM +0800, Muchun Song wrote:
> On Tue, Apr 05, 2022 at 02:57:55PM +0100, Catalin Marinas wrote:
> > ARCH_DMA_MINALIGN represents the minimum (static) alignment for safe DMA
> > operations while ARCH_KMALLOC_MINALIGN is the minimum kmalloc() objects
> > alignment.
> > 
> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Herbert Xu <herbert@gondor.apana.org.au>
> > Cc: "David S. Miller" <davem@davemloft.net>
> > ---
> >  include/linux/crypto.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/crypto.h b/include/linux/crypto.h
> > index 2324ab6f1846..654b9c355575 100644
> > --- a/include/linux/crypto.h
> > +++ b/include/linux/crypto.h
> > @@ -167,7 +167,7 @@
> >   * maintenance for non-coherent DMA (cache invalidation in particular) does not
> >   * affect data that may be accessed by the CPU concurrently.
> >   */
> > -#define CRYPTO_MINALIGN ARCH_KMALLOC_MINALIGN
> > +#define CRYPTO_MINALIGN ARCH_DMA_MINALIGN
> 
> I don't think this should be changed since ARCH_KMALLOC_MINALIGN is
> already aligned with the size what you need.

With this series, ARCH_KMALLOC_MINALIGN is no longer safe for
non-coherent DMA on all arm64 SoCs, that's what ARCH_DMA_MINALIGN will
cover.

Now, looking at the comment for CRYPTO_MINALIGN, one aspect it covers is
the minimum alignment required by C for the crypto_tfm structure access.
So a smaller ARCH_KMALLOC_MINALIGN would do. But the other part of the
comment mentions in-structure alignment for non-coherent DMA. Here we'd
need the upper bound alignment, ARCH_DMA_MINALIGN.

I'll follow up on Herbert's email as I think he has a good point on
structure vs kmalloc() alignment.
Muchun Song April 7, 2022, 10 a.m. UTC | #8
On Thu, Apr 7, 2022 at 5:25 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Thu, Apr 07, 2022 at 02:14:15PM +0800, Muchun Song wrote:
> > On Tue, Apr 05, 2022 at 02:57:55PM +0100, Catalin Marinas wrote:
> > > ARCH_DMA_MINALIGN represents the minimum (static) alignment for safe DMA
> > > operations while ARCH_KMALLOC_MINALIGN is the minimum kmalloc() objects
> > > alignment.
> > >
> > > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> > > Cc: Herbert Xu <herbert@gondor.apana.org.au>
> > > Cc: "David S. Miller" <davem@davemloft.net>
> > > ---
> > >  include/linux/crypto.h | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/crypto.h b/include/linux/crypto.h
> > > index 2324ab6f1846..654b9c355575 100644
> > > --- a/include/linux/crypto.h
> > > +++ b/include/linux/crypto.h
> > > @@ -167,7 +167,7 @@
> > >   * maintenance for non-coherent DMA (cache invalidation in particular) does not
> > >   * affect data that may be accessed by the CPU concurrently.
> > >   */
> > > -#define CRYPTO_MINALIGN ARCH_KMALLOC_MINALIGN
> > > +#define CRYPTO_MINALIGN ARCH_DMA_MINALIGN
> >
> > I don't think this should be changed since ARCH_KMALLOC_MINALIGN is
> > already aligned with the size what you need.
>
> With this series, ARCH_KMALLOC_MINALIGN is no longer safe for
> non-coherent DMA on all arm64 SoCs, that's what ARCH_DMA_MINALIGN will
> cover.
>
> Now, looking at the comment for CRYPTO_MINALIGN, one aspect it covers is
> the minimum alignment required by C for the crypto_tfm structure access.
> So a smaller ARCH_KMALLOC_MINALIGN would do. But the other part of the
> comment mentions in-structure alignment for non-coherent DMA. Here we'd
> need the upper bound alignment, ARCH_DMA_MINALIGN.
>
> I'll follow up on Herbert's email as I think he has a good point on
> structure vs kmalloc() alignment.

Got it. Now I know what you want to do. You want to set
ARCH_KMALLOC_MINALIGN to 64, however, the smallest
size of kmem_cache depends on the cache line size at
runtime.  But we have to know the safe alignment at building
time.  So we have to make those align with ARCH_DMA_MINALIGN.
Right?  I think you are on the right road since most CPUs have
a 64-byte cache line.

Thanks.
Catalin Marinas April 7, 2022, 11:01 a.m. UTC | #9
On Thu, Apr 07, 2022 at 02:30:54PM +1000, Herbert Xu wrote:
> On Wed, Apr 06, 2022 at 09:49:42AM +0100, Catalin Marinas wrote:
> > is any change to the crypto code.
> 
> But the crypto API assumes that memory returned by kmalloc is
> automatically aligned to CRYPTO_MINALIGN, would this still be
> the case if you change it to ARCH_DMA_MINALIGN?

No but I think that's a valid point. Taking the crypto_tfm structure as
an example with ARCH_DMA_MINALIGN of 128:

#define CRYPTO_MINALIGN 128
#define CRYPTO_MINALIGN_ATTR __attribute__ ((__aligned__(CRYPTO_MINALIGN)))

struct crypto_tfm {
	u32 crt_flags;
	int node;
	void (*exit)(struct crypto_tfm *tfm);
	struct crypto_alg *__crt_alg;
	void *__crt_ctx[] CRYPTO_MINALIGN_ATTR;
};

The alignof(struct crypto_tfm) is 128. However, a kmalloc() would only
guarantee the smaller ARCH_KMALLOC_MINALIGN which, after this series,
would be 64 for arm64. From the DMA perspective there's no issue with
the smaller kmalloc() alignment since, if a crypto_tfm pointer is
DMA-aligned for the hardware it is running on, so would __ctr_ctx[] at
an offset multiple of the dynamic DMA alignment. If we used
ARCH_KMALLOC_MINALIGN instead and the hardware alignment requirement was
larger, than we would have a potential problem with non-coherent DMA.

The only issue is whether the compiler gets confused by a pointer to a
structure with a smaller alignment than alignof(struct ...). I don't see
a performance or correctness issue on arm64 here. It would be a problem
if instead of 16 we went down to 8 or 4 due to unaligned accesses but
from 128 to 64 (or even 16), I don't think it matters.
Catalin Marinas April 7, 2022, 11:06 a.m. UTC | #10
On Thu, Apr 07, 2022 at 06:00:10PM +0800, Muchun Song wrote:
> On Thu, Apr 7, 2022 at 5:25 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Thu, Apr 07, 2022 at 02:14:15PM +0800, Muchun Song wrote:
> > > On Tue, Apr 05, 2022 at 02:57:55PM +0100, Catalin Marinas wrote:
> > > > ARCH_DMA_MINALIGN represents the minimum (static) alignment for safe DMA
> > > > operations while ARCH_KMALLOC_MINALIGN is the minimum kmalloc() objects
> > > > alignment.
> > > >
> > > > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> > > > Cc: Herbert Xu <herbert@gondor.apana.org.au>
> > > > Cc: "David S. Miller" <davem@davemloft.net>
> > > > ---
> > > >  include/linux/crypto.h | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/include/linux/crypto.h b/include/linux/crypto.h
> > > > index 2324ab6f1846..654b9c355575 100644
> > > > --- a/include/linux/crypto.h
> > > > +++ b/include/linux/crypto.h
> > > > @@ -167,7 +167,7 @@
> > > >   * maintenance for non-coherent DMA (cache invalidation in particular) does not
> > > >   * affect data that may be accessed by the CPU concurrently.
> > > >   */
> > > > -#define CRYPTO_MINALIGN ARCH_KMALLOC_MINALIGN
> > > > +#define CRYPTO_MINALIGN ARCH_DMA_MINALIGN
> > >
> > > I don't think this should be changed since ARCH_KMALLOC_MINALIGN is
> > > already aligned with the size what you need.
> >
> > With this series, ARCH_KMALLOC_MINALIGN is no longer safe for
> > non-coherent DMA on all arm64 SoCs, that's what ARCH_DMA_MINALIGN will
> > cover.
> >
> > Now, looking at the comment for CRYPTO_MINALIGN, one aspect it covers is
> > the minimum alignment required by C for the crypto_tfm structure access.
> > So a smaller ARCH_KMALLOC_MINALIGN would do. But the other part of the
> > comment mentions in-structure alignment for non-coherent DMA. Here we'd
> > need the upper bound alignment, ARCH_DMA_MINALIGN.
> >
> > I'll follow up on Herbert's email as I think he has a good point on
> > structure vs kmalloc() alignment.
> 
> Got it. Now I know what you want to do. You want to set
> ARCH_KMALLOC_MINALIGN to 64, however, the smallest
> size of kmem_cache depends on the cache line size at
> runtime.  But we have to know the safe alignment at building
> time.  So we have to make those align with ARCH_DMA_MINALIGN.
> Right?  

Right.

> I think you are on the right road since most CPUs have a 64-byte cache
> line.

Yeah, apart from about three SoCs with 128, the rest use 64-byte cache
lines.

Longer term we should try reduce such alignment below the cache line
size *if* the SoC fully DMA-coherent (like x86). In that case cache
maintenance for DMA doesn't exist, so there's no risk to smaller
allocations. Of course, one can use a kmem_cache_create() with
SLAB_HWCACHE_ALIGN and get the cacheline alignment for performance
reasons.
Herbert Xu April 7, 2022, 11:40 a.m. UTC | #11
On Thu, Apr 07, 2022 at 12:01:02PM +0100, Catalin Marinas wrote:
>
> The only issue is whether the compiler gets confused by a pointer to a
> structure with a smaller alignment than alignof(struct ...). I don't see
> a performance or correctness issue on arm64 here. It would be a problem
> if instead of 16 we went down to 8 or 4 due to unaligned accesses but
> from 128 to 64 (or even 16), I don't think it matters.

The issue is that there's code in the Crypto API which assumes
that all pointers returned by kmalloc are aligned to CRYPTO_MINALIGN,
if you break that then all that code would need to be modified.

However, I think it's better to change the code that assumes
CRYPTO_MINALIGN guarantees DMA alignment.

Cheers,
Catalin Marinas April 7, 2022, 4:28 p.m. UTC | #12
On Thu, Apr 07, 2022 at 07:40:59PM +0800, Herbert Xu wrote:
> On Thu, Apr 07, 2022 at 12:01:02PM +0100, Catalin Marinas wrote:
> > The only issue is whether the compiler gets confused by a pointer to a
> > structure with a smaller alignment than alignof(struct ...). I don't see
> > a performance or correctness issue on arm64 here. It would be a problem
> > if instead of 16 we went down to 8 or 4 due to unaligned accesses but
> > from 128 to 64 (or even 16), I don't think it matters.
> 
> The issue is that there's code in the Crypto API which assumes
> that all pointers returned by kmalloc are aligned to CRYPTO_MINALIGN,
> if you break that then all that code would need to be modified.

I'm not familiar with the crypto API, trying to make sense of it now ;).

I can see in many cases that the kmalloc() caller aligns the requested
size to something like crypto_tfm_ctx_alignment(). So this would
guarantee a kmalloc() object aligned to CRYPTO_MINALIGN.

> However, I think it's better to change the code that assumes
> CRYPTO_MINALIGN guarantees DMA alignment.

I saw Ard already started to refactor some of these. But in the meantime
are there cases where the crypto code does a kmalloc() of less than
CRYPTO_MINALIGN and expects it to be CRYPTO_MINALIGN aligned?
Herbert Xu April 8, 2022, 3:25 a.m. UTC | #13
On Thu, Apr 07, 2022 at 05:28:10PM +0100, Catalin Marinas wrote:
>
> I can see in many cases that the kmalloc() caller aligns the requested
> size to something like crypto_tfm_ctx_alignment(). So this would
> guarantee a kmalloc() object aligned to CRYPTO_MINALIGN.

crypto_tfm_ctx_alignment is basically the same as CRYPTO_MINALIGN.
We assume any kmalloced pointers to be aligned to that.

Specific algorithms may ask for an alignment greater than that
and we will use the knowledge that kmalloc is aligned to
CRYPTO_MINALIGN to derive the extra memory we need to get.

So if kmalloc no longer returns memory aligned to MINALIGN then
we'll get memory overruns.

> I saw Ard already started to refactor some of these. But in the meantime
> are there cases where the crypto code does a kmalloc() of less than
> CRYPTO_MINALIGN and expects it to be CRYPTO_MINALIGN aligned?

It's a fundamental assumption of the API.

Cheers,
Catalin Marinas April 8, 2022, 9:04 a.m. UTC | #14
On Fri, Apr 08, 2022 at 11:25:29AM +0800, Herbert Xu wrote:
> On Thu, Apr 07, 2022 at 05:28:10PM +0100, Catalin Marinas wrote:
> > I can see in many cases that the kmalloc() caller aligns the requested
> > size to something like crypto_tfm_ctx_alignment(). So this would
> > guarantee a kmalloc() object aligned to CRYPTO_MINALIGN.
> 
> crypto_tfm_ctx_alignment is basically the same as CRYPTO_MINALIGN.
> We assume any kmalloced pointers to be aligned to that.
> 
> Specific algorithms may ask for an alignment greater than that
> and we will use the knowledge that kmalloc is aligned to
> CRYPTO_MINALIGN to derive the extra memory we need to get.
> 
> So if kmalloc no longer returns memory aligned to MINALIGN then
> we'll get memory overruns.

My point is that if the crypto code kmallocs a size aligned to
crypto_tfm_ctx_alignment() (and CRYPTO_MINALIGN), the slab allocator
will return memory aligned to CRYPTO_MINALIGN even if
ARCH_KMALLOC_MINALIGN is smaller.

Would the crypto code, say, do a kmalloc(64) and expect a 128 byte
alignment (when CRYPTO_MINALIGN == 128)? Or does it align the size to
CRYPTO_MINALIGN and do a kmalloc(128) directly? If it's the latter, I
don't think there's a problem.
Herbert Xu April 8, 2022, 9:11 a.m. UTC | #15
On Fri, Apr 08, 2022 at 10:04:54AM +0100, Catalin Marinas wrote:
>
> My point is that if the crypto code kmallocs a size aligned to
> crypto_tfm_ctx_alignment() (and CRYPTO_MINALIGN), the slab allocator
> will return memory aligned to CRYPTO_MINALIGN even if
> ARCH_KMALLOC_MINALIGN is smaller.

No we don't align the size to CRYPTO_MINALIGN at all.  We simply
assume that this is the alignment returned by kmalloc.

> Would the crypto code, say, do a kmalloc(64) and expect a 128 byte
> alignment (when CRYPTO_MINALIGN == 128)? Or does it align the size to
> CRYPTO_MINALIGN and do a kmalloc(128) directly? If it's the latter, I
> don't think there's a problem.

It's the former.

I think you can still make the change you want, but first you need
to modify the affected drivers to specify their actual alignment
requirement explicitly through cra_alignmask and then use the
correct methods to access the context pointer.

Basically these drivers have been broken from day one, but their
brokenness has been hidden by the extra-large KMALLOC_MINALIGN
value on arm.  So to reduce the KMALLOC_MINALIGN value, you have
to modify the drivers and set the cra_alignmask value.

Cheers,
Catalin Marinas April 12, 2022, 9:32 a.m. UTC | #16
On Fri, Apr 08, 2022 at 05:11:28PM +0800, Herbert Xu wrote:
> On Fri, Apr 08, 2022 at 10:04:54AM +0100, Catalin Marinas wrote:
> > My point is that if the crypto code kmallocs a size aligned to
> > crypto_tfm_ctx_alignment() (and CRYPTO_MINALIGN), the slab allocator
> > will return memory aligned to CRYPTO_MINALIGN even if
> > ARCH_KMALLOC_MINALIGN is smaller.
> 
> No we don't align the size to CRYPTO_MINALIGN at all.  We simply
> assume that this is the alignment returned by kmalloc.
> 
> > Would the crypto code, say, do a kmalloc(64) and expect a 128 byte
> > alignment (when CRYPTO_MINALIGN == 128)? Or does it align the size to
> > CRYPTO_MINALIGN and do a kmalloc(128) directly? If it's the latter, I
> > don't think there's a problem.
> 
> It's the former.

Does this only matter for DMA? If there other unrelated alignment
requirements, I think those drivers should be fixed for their own
cra_alignmask independent of the CPU cache line and DMA needs (e.g.
some 1024-bit vectors that would benefit from an aligned load).

With this series, the dynamic arch_kmalloc_minalign() still provides the
DMA-safe alignment even if it would be smaller than the default
CRYPTO_MINALIGN of 128. Let's say we have a structure:

struct crypto_something {
	u8	some_field;
	u8	data[] CRYPTO_MINALIGN_ATTR;
};

The sizeof(struct crypto_something) is automatically a multiple of
CRYPTO_MINALIGN (128 bytes for arm64). While kmalloc() could return a
smaller alignment, arch_kmalloc_minalign(), the data pointer above is
still aligned to arch_kmalloc_minalign() and DMA-safe since
CRYPTO_MINALIGN is a multiple of this (similar to the struct devres
case, https://lore.kernel.org/all/YlRn2Wal4ezjvomZ@arm.com/).

Even if such struct is included in another struct, the alignment and
sizeof is inherited by the containing object.

So let's assume the driver needs 64 bytes of data in addition to the
struct, it would allocate:

	kmalloc(sizeof(struct crypto_something) + 64);

Prior to this series, kmalloc() would return a 256-byte aligned pointer.
With this series, if the cache line size on a SoC is 64-byte, the
allocation falls under the kmalloc-192 cache, so 'data' would be 64-byte
aligned which is safe for DMA.

> I think you can still make the change you want, but first you need
> to modify the affected drivers to specify their actual alignment
> requirement explicitly through cra_alignmask and then use the
> correct methods to access the context pointer.

At a quick grep, most cra_alignmask values are currently 15 or smaller.
I'm not convinced the driver needs to know about the CPU cache
alignment. We could set cra_alignmask to CRYPTO_MINALIGN but that would
incur unnecessary overhead via function like setkey_unaligned() when the
arch_kmalloc_minalign() was already sufficient for DMA safety.

Maybe I miss some use-cases or I focus too much on DMA safety.

Thanks.
Herbert Xu April 12, 2022, 9:40 a.m. UTC | #17
On Tue, Apr 12, 2022 at 10:32:58AM +0100, Catalin Marinas wrote:
>
> At a quick grep, most cra_alignmask values are currently 15 or smaller.
> I'm not convinced the driver needs to know about the CPU cache
> alignment. We could set cra_alignmask to CRYPTO_MINALIGN but that would
> incur unnecessary overhead via function like setkey_unaligned() when the
> arch_kmalloc_minalign() was already sufficient for DMA safety.
> 
> Maybe I miss some use-cases or I focus too much on DMA safety.

The alignment was never designed for DMA.  It's mainly for CPUs
that provide accelerated instructions that require a certain
amount of alignment, most commonly 16 bytes.

Therefore CRYPTO_MINALIGN was never meant to be a guarantee for
DMA alignment.  Any drivers relying on this is simply broken.

I understand that on ARM for historical reasons you have had a
situation that CRYPTO_MINALIGN happened to be large enough to
guarantee DMA alignment.  I totally agree with your patch to
try to fix this but it should not penalise other architectures
and raise the CRYPTO_MINALIGN unnecessarily.

I think if CRYPTO_MINALIGN makes those drivers work then so
should cra_alignmask.  And that would be a relatively easy
patch to do.

Cheers,
Catalin Marinas April 12, 2022, 10:02 a.m. UTC | #18
On Tue, Apr 12, 2022 at 05:40:58PM +0800, Herbert Xu wrote:
> On Tue, Apr 12, 2022 at 10:32:58AM +0100, Catalin Marinas wrote:
> > At a quick grep, most cra_alignmask values are currently 15 or smaller.
> > I'm not convinced the driver needs to know about the CPU cache
> > alignment. We could set cra_alignmask to CRYPTO_MINALIGN but that would
> > incur unnecessary overhead via function like setkey_unaligned() when the
> > arch_kmalloc_minalign() was already sufficient for DMA safety.
> > 
> > Maybe I miss some use-cases or I focus too much on DMA safety.
> 
> The alignment was never designed for DMA.  It's mainly for CPUs
> that provide accelerated instructions that require a certain
> amount of alignment, most commonly 16 bytes.
> 
> Therefore CRYPTO_MINALIGN was never meant to be a guarantee for
> DMA alignment.  Any drivers relying on this is simply broken.

Thanks for the clarification (the comment in crypto.h implies that it
covers this aspect as well).

> I understand that on ARM for historical reasons you have had a
> situation that CRYPTO_MINALIGN happened to be large enough to
> guarantee DMA alignment.  I totally agree with your patch to
> try to fix this but it should not penalise other architectures
> and raise the CRYPTO_MINALIGN unnecessarily.

This series does not penalise any architecture. It doesn't even make
arm64 any worse than it currently is.

For arm64, before the series:

  ARCH_DMA_MINALIGN == 128
  ARCH_KMALLOC_MINALIGN == 128
  CRYPTO_MINALIGN == 128

After the series, on arm64:

  ARCH_DMA_MINALIGN == 128
  ARCH_KMALLOC_MINALIGN == 64
  CRYPTO_MINALIGN == 128

For x86, before the series:

  ARCH_DMA_MINALIGN undefined
  ARCH_KMALLOC_MINALIGN == 8
  CRYPTO_MINALIGN == 8

After the series, on x86 (or any architecture that doesn't define
ARCH_DMA_MINALIGN):

  ARCH_DMA_MINALIGN == 8
  ARCH_KMALLOC_MINALIGN == 8
  CRYPTO_MINALIGN == 8

For other architectures that define ARCH_DMA_MINALIGN to 'N' but do not
define ARCH_KMALLOC_MINALIGN, before and after this series:

  ARCH_DMA_MINALIGN == N
  ARCH_KMALLOC_MINALIGN == N
  CRYPTO_MINALIGN == N

> I think if CRYPTO_MINALIGN makes those drivers work then so
> should cra_alignmask.  And that would be a relatively easy
> patch to do.

Yes, the patch would be simple, subject to figuring out which drivers
and what alignment they actually need (any help appreciated).

There are already arm64 vendor kernels that change ARCH_DMA_MINALIGN to
64, hence ARCH_KMALLOC_MINALIGN and CRYPTO_MINALIGN become 64. We also
discussed a Kconfig option for this in the past. Would that have broken
any crypto drivers that rely on a strict 128 byte alignment?

Thanks.
Herbert Xu April 12, 2022, 10:18 a.m. UTC | #19
On Tue, Apr 12, 2022 at 11:02:54AM +0100, Catalin Marinas wrote:
>
> This series does not penalise any architecture. It doesn't even make
> arm64 any worse than it currently is.

Right, the patch as it stands doesn't change anything.  However,
it is also broken as it stands.  As I said before, CRYPTO_MINALIGN
is not something that is guaranteed by the Crypto API, it is simply
a statement of whatever kmalloc returns.

So if kmalloc is no longer returning CRYPTO_MINALIGN-aligned
memory, then those drivers that need this alignment for DMA
will break anyway.

If you want the Crypto API to guarantee alignment over and above
that returned by kmalloc, the correct way is to use cra_alignmask.

Cheers,
Catalin Marinas April 12, 2022, 10:20 a.m. UTC | #20
On Tue, Apr 12, 2022 at 11:02:54AM +0100, Catalin Marinas wrote:
> On Tue, Apr 12, 2022 at 05:40:58PM +0800, Herbert Xu wrote:
> > I think if CRYPTO_MINALIGN makes those drivers work then so
> > should cra_alignmask.  And that would be a relatively easy
> > patch to do.
> 
> Yes, the patch would be simple, subject to figuring out which drivers
> and what alignment they actually need (any help appreciated).
> 
> There are already arm64 vendor kernels that change ARCH_DMA_MINALIGN to
> 64, hence ARCH_KMALLOC_MINALIGN and CRYPTO_MINALIGN become 64. We also
> discussed a Kconfig option for this in the past. Would that have broken
> any crypto drivers that rely on a strict 128 byte alignment?

Actually, I think with a cra_alignmask of 127 (the arm64
CRYPTO_MINALIGN-1), crypto_check_alg() will fail since
MAX_ALGAPI_ALIGNMASK is 63.
Catalin Marinas April 12, 2022, 12:31 p.m. UTC | #21
On Tue, Apr 12, 2022 at 06:18:46PM +0800, Herbert Xu wrote:
> On Tue, Apr 12, 2022 at 11:02:54AM +0100, Catalin Marinas wrote:
> > This series does not penalise any architecture. It doesn't even make
> > arm64 any worse than it currently is.
> 
> Right, the patch as it stands doesn't change anything.  However,
> it is also broken as it stands.  As I said before, CRYPTO_MINALIGN
> is not something that is guaranteed by the Crypto API, it is simply
> a statement of whatever kmalloc returns.

I agree that CRYPTO_MINALIGN is not guaranteed by the Crypto API. What
I'm debating is the intended use for CRYPTO_MINALIGN in some (most?) of
the drivers. It's not just about kmalloc() but also a build-time offset
of buffers within structures to guarantee DMA safety. This can't be
fixed by cra_alignmask.

We could leave CRYPTO_MINALIGN as ARCH_KMALLOC_MINALIGN and that matches
it being just a statement of the kmalloc() minimum alignment. But since
it is also overloaded with the DMA in-structure offset alignment, we'd
need a new CRYPTO_DMA_MINALIGN (and _ATTR) to annotate those structures.
I have a suspicion there'll be fewer of the original CRYPTO_MINALIGN
uses left, hence my approach to making this bigger from the start.

There's also Ard's series introducing CRYPTO_REQ_MINALIGN while leaving
CRYPT_MINALIGN for DMA-safe offsets (IIUC):

https://lore.kernel.org/r/20220406142715.2270256-1-ardb@kernel.org

> So if kmalloc is no longer returning CRYPTO_MINALIGN-aligned
> memory, then those drivers that need this alignment for DMA
> will break anyway.

No. As per one of my previous emails, kmalloc() will preserve the DMA
alignment for an SoC even if smaller than CRYPTO_MINALIGN (or a new
CRYPTO_DMA_MINALIGN). Since kmalloc() returns DMA-safe pointers and
CRYPTO_MINALIGN (or a new CRYPTO_DMA_MINALIGN) is DMA-safe, so would an
offset from a pointer returned by kmalloc().

> If you want the Crypto API to guarantee alignment over and above
> that returned by kmalloc, the correct way is to use cra_alignmask.

For kmalloc(), this would work, but for the current CRYPTO_MINALIGN_ATTR
uses it won't.

Thanks.
Ard Biesheuvel April 12, 2022, 10:01 p.m. UTC | #22
On Tue, 12 Apr 2022 at 14:31, Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Tue, Apr 12, 2022 at 06:18:46PM +0800, Herbert Xu wrote:
> > On Tue, Apr 12, 2022 at 11:02:54AM +0100, Catalin Marinas wrote:
> > > This series does not penalise any architecture. It doesn't even make
> > > arm64 any worse than it currently is.
> >
> > Right, the patch as it stands doesn't change anything.  However,
> > it is also broken as it stands.  As I said before, CRYPTO_MINALIGN
> > is not something that is guaranteed by the Crypto API, it is simply
> > a statement of whatever kmalloc returns.
>
> I agree that CRYPTO_MINALIGN is not guaranteed by the Crypto API. What
> I'm debating is the intended use for CRYPTO_MINALIGN in some (most?) of
> the drivers. It's not just about kmalloc() but also a build-time offset
> of buffers within structures to guarantee DMA safety. This can't be
> fixed by cra_alignmask.
>
> We could leave CRYPTO_MINALIGN as ARCH_KMALLOC_MINALIGN and that matches
> it being just a statement of the kmalloc() minimum alignment. But since
> it is also overloaded with the DMA in-structure offset alignment, we'd
> need a new CRYPTO_DMA_MINALIGN (and _ATTR) to annotate those structures.
> I have a suspicion there'll be fewer of the original CRYPTO_MINALIGN
> uses left, hence my approach to making this bigger from the start.
>
> There's also Ard's series introducing CRYPTO_REQ_MINALIGN while leaving
> CRYPT_MINALIGN for DMA-safe offsets (IIUC):
>
> https://lore.kernel.org/r/20220406142715.2270256-1-ardb@kernel.org
>
> > So if kmalloc is no longer returning CRYPTO_MINALIGN-aligned
> > memory, then those drivers that need this alignment for DMA
> > will break anyway.
>

One thing to note here is that minimum DMA *alignment* is not the same
as the padding to cache writeback granule (CWG) that is needed when
dealing with non-cache coherent inbound DMA.

The former is typically a property of the peripheral IP, and is
something that the driver needs to deal with, potentially by setting
cra_alignmask to ensure that the input and output buffers are placed
such that they can accessed via DMA by the peripheral.

The latter is a property of the CPU's cache hierarchy, not only the
size of the CWG, but also whether or not DMA is cache coherent to
begin with. This is not something the driver should usually have to
care about if it uses the DMA API correctly.

The reason why CRYPTO_MINALIGN matters for DMA in spite of this is
that some drivers not only use DMA for processing the bulk of the data
(typically presented using scatterlists) but sometimes also use DMA to
map parts of the request and TFM structures, which carry control data
used by the CPU to manage the crypto requests. Doing a non-coherent
DMA write into such a structure may blow away 64 or 128 bytes of data,
even if the write itself is much smaller, due to the fact that we need
to perform cache invalidation in order for the CPU to be able to
observe what the device wrote to that memory, and the invalidated
cache lines may be shared with other data items, and may become dirty
while the DMA mapping is active.

This is what I am addressing with my patch series, i.e., padding out
the driver owned parts of the struct to the CWG size so that cache
invalidation does not affect data owned by other layers in the crypto
cake, but only at runtime. By doing this consistently for TFM and
request structures, we should be able to disregard ARCH_DMA_MINALIGN
entirely when it comes to defining CRYPTO_MINALIGN, as it is highly
unlikely that any of these peripherals would require more than 8 or 16
bytes of alignment for the DMA operations themselves.




> No. As per one of my previous emails, kmalloc() will preserve the DMA
> alignment for an SoC even if smaller than CRYPTO_MINALIGN (or a new
> CRYPTO_DMA_MINALIGN). Since kmalloc() returns DMA-safe pointers and
> CRYPTO_MINALIGN (or a new CRYPTO_DMA_MINALIGN) is DMA-safe, so would an
> offset from a pointer returned by kmalloc().
>
> > If you want the Crypto API to guarantee alignment over and above
> > that returned by kmalloc, the correct way is to use cra_alignmask.
>
> For kmalloc(), this would work, but for the current CRYPTO_MINALIGN_ATTR
> uses it won't.
>
> Thanks.
>
> --
> Catalin
Catalin Marinas April 13, 2022, 8:47 a.m. UTC | #23
On Wed, Apr 13, 2022 at 12:01:25AM +0200, Ard Biesheuvel wrote:
> On Tue, 12 Apr 2022 at 14:31, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Tue, Apr 12, 2022 at 06:18:46PM +0800, Herbert Xu wrote:
> > > On Tue, Apr 12, 2022 at 11:02:54AM +0100, Catalin Marinas wrote:
> > > > This series does not penalise any architecture. It doesn't even make
> > > > arm64 any worse than it currently is.
> > >
> > > Right, the patch as it stands doesn't change anything.  However,
> > > it is also broken as it stands.  As I said before, CRYPTO_MINALIGN
> > > is not something that is guaranteed by the Crypto API, it is simply
> > > a statement of whatever kmalloc returns.
> >
> > I agree that CRYPTO_MINALIGN is not guaranteed by the Crypto API. What
> > I'm debating is the intended use for CRYPTO_MINALIGN in some (most?) of
> > the drivers. It's not just about kmalloc() but also a build-time offset
> > of buffers within structures to guarantee DMA safety. This can't be
> > fixed by cra_alignmask.
> >
> > We could leave CRYPTO_MINALIGN as ARCH_KMALLOC_MINALIGN and that matches
> > it being just a statement of the kmalloc() minimum alignment. But since
> > it is also overloaded with the DMA in-structure offset alignment, we'd
> > need a new CRYPTO_DMA_MINALIGN (and _ATTR) to annotate those structures.
> > I have a suspicion there'll be fewer of the original CRYPTO_MINALIGN
> > uses left, hence my approach to making this bigger from the start.
> >
> > There's also Ard's series introducing CRYPTO_REQ_MINALIGN while leaving
> > CRYPT_MINALIGN for DMA-safe offsets (IIUC):
> >
> > https://lore.kernel.org/r/20220406142715.2270256-1-ardb@kernel.org
> >
> > > So if kmalloc is no longer returning CRYPTO_MINALIGN-aligned
> > > memory, then those drivers that need this alignment for DMA
> > > will break anyway.
> 
> One thing to note here is that minimum DMA *alignment* is not the same
> as the padding to cache writeback granule (CWG) that is needed when
> dealing with non-cache coherent inbound DMA.
> 
> The former is typically a property of the peripheral IP, and is
> something that the driver needs to deal with, potentially by setting
> cra_alignmask to ensure that the input and output buffers are placed
> such that they can accessed via DMA by the peripheral.

Good point, thanks for making this clear. AFAICT, the requirement for
bus master access in the crypto code doesn't go above 16 (a
cra_alignmask of 15).

> The latter is a property of the CPU's cache hierarchy, not only the
> size of the CWG, but also whether or not DMA is cache coherent to
> begin with. This is not something the driver should usually have to
> care about if it uses the DMA API correctly.

I agree. There is also an implicit expectation that the DMA API works on
kmalloc'ed buffers and that's what ARCH_DMA_MINALIGN is for (and the
dynamic arch_kmalloc_minalign() in this series). But the key point is
that the driver doesn't need to know the CPU cache topology, coherency,
the DMA API and kmalloc() take care of these.

> The reason why CRYPTO_MINALIGN matters for DMA in spite of this is
> that some drivers not only use DMA for processing the bulk of the data
> (typically presented using scatterlists) but sometimes also use DMA to
> map parts of the request and TFM structures, which carry control data
> used by the CPU to manage the crypto requests. Doing a non-coherent
> DMA write into such a structure may blow away 64 or 128 bytes of data,
> even if the write itself is much smaller, due to the fact that we need
> to perform cache invalidation in order for the CPU to be able to
> observe what the device wrote to that memory, and the invalidated
> cache lines may be shared with other data items, and may become dirty
> while the DMA mapping is active.

Indeed.

> This is what I am addressing with my patch series, i.e., padding out
> the driver owned parts of the struct to the CWG size so that cache
> invalidation does not affect data owned by other layers in the crypto
> cake, but only at runtime. By doing this consistently for TFM and
> request structures, we should be able to disregard ARCH_DMA_MINALIGN
> entirely when it comes to defining CRYPTO_MINALIGN, as it is highly
> unlikely that any of these peripherals would require more than 8 or 16
> bytes of alignment for the DMA operations themselves.

Your series would be a nice improvement and eventually get
CRYPTO_MINALIGN to ARCH_SLAB_MINALIGN. In the meantime, with my series,
we need CRYPTO_MINALIGN to be ARCH_DMA_MINALIGN on arm64, unless we add
a new CRYPT_DMA_MINALIGN but I don't see much point, especially if you
plan to reduce this alignment anyway.


Let's go back to restating the crypto code alignment requirements, as I
understand them (please correct):

1. Bus master accesses (DMA, CPUs that can't do efficient unaligned
   accesses). That's what cra_alignmask is for. If there's a driver that
   relies on an implicit kmalloc() alignment higher than cra_alignmask,
   it is already broken on x86 and a few other architectures that don't
   define ARCH_DMA_MINALIGN. But it won't be any worse with this series
   since it only brings the arm64 kmalloc() alignment down from 128 to
   64.

2. Non-coherent DMA and cache invalidation. With my series, kmalloc'ed
   buffers are DMA-safe for the CPU/SoC the kernel is running on.

3. DMA into buffers embedded in TFM structures. Since the offset of
   those buffers is decided at compile-time, we need a
   CRYPTO_MINALIGN_ATTR that covers both bus master alignment
   requirements and the highest cache line size (CWG) for a non-coherent
   DMA SoC. Ard's series would simplify the latter but, before then,
   CRYPTO_MINALIGN needs to stay as the larger ARCH_DMA_MINALIGN.

With my series, there is no change to the value of CRYPTO_MINALIGN for
arm64 or any other architecture, so point 3 is unaffected. The series
does change the kmalloc() alignment and that may be smaller than
CRYPTO_MINALIGN but neither of points 1 or 2 above are affected since
(a) we still have a sufficiently large ARCH_KMALLOC_MINALIGN of 64 and
(b) the kmalloc'ed buffers are safe for non-coherent DMA.

Herbert, Ard, if I missed anything please let me know but based on my
understanding, this series is safe for the crypto code.

Thanks.
Linus Torvalds April 13, 2022, 7:53 p.m. UTC | #24
On Tue, Apr 12, 2022 at 10:47 PM Catalin Marinas
<catalin.marinas@arm.com> wrote:
>
> I agree. There is also an implicit expectation that the DMA API works on
> kmalloc'ed buffers and that's what ARCH_DMA_MINALIGN is for (and the
> dynamic arch_kmalloc_minalign() in this series). But the key point is
> that the driver doesn't need to know the CPU cache topology, coherency,
> the DMA API and kmalloc() take care of these.

Honestly, I think it would probably be worth discussing the "kmalloc
DMA alignment" issues.

99.9% of kmalloc users don't want to do DMA.

And there's actually a fair amount of small kmalloc for random stuff.
Right now on my laptop, I have

    kmalloc-8          16907  18432      8  512    1 : ...

according to slabinfo, so almost 17 _thousand_ allocations of 8 bytes.

It's all kinds of sad if those allocations need to be 64 bytes in size
just because of some silly DMA alignment issue, when none of them want
it.

Yeah, yeah, wasting a megabyte of memory is "just a megabyte" these
days. Which is crazy. It's literally memory that could have been used
for something much more useful than just pure and utter waste.

I think we could and should just say "people who actually require DMA
accesses should say so at kmalloc time". We literally have that
GFP_DMA and ZOME_DMA for various historical reasons, so we've been
able to do that before.

No, that historical GFP_DMA isn't what arm64 wants - it's the old
crazy "legacy 16MB DMA" thing that ISA DMA used to have.

But the basic issue was true then, and is true now - DMA allocations
are fairly special, and should not be that hard to just mark as such.

We could add a trivial wrapper function like

     static void *dma_kmalloc(size_t size)
     { return kmalloc(size | (ARCH_DMA_MINALIGN-1); }

which now means that the size argument is guaranteed to be big enough
(not not overflow to zero) that you get that aligned memory
allocation.

We could perhaps even have other special rules. Including really
specific ones, like saying

 - allocations smaller than 32 bytes are not DMA coherent, because we pack them

which would allow those small allocations to not pointlessly waste memory.

I dunno. But it's ridiculous that arm64 wastes so much memory when
it's approximately never needed.

                 Linus
Greg KH April 14, 2022, 5:38 a.m. UTC | #25
On Wed, Apr 13, 2022 at 09:53:24AM -1000, Linus Torvalds wrote:
> On Tue, Apr 12, 2022 at 10:47 PM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> >
> > I agree. There is also an implicit expectation that the DMA API works on
> > kmalloc'ed buffers and that's what ARCH_DMA_MINALIGN is for (and the
> > dynamic arch_kmalloc_minalign() in this series). But the key point is
> > that the driver doesn't need to know the CPU cache topology, coherency,
> > the DMA API and kmalloc() take care of these.
> 
> Honestly, I think it would probably be worth discussing the "kmalloc
> DMA alignment" issues.
> 
> 99.9% of kmalloc users don't want to do DMA.
> 
> And there's actually a fair amount of small kmalloc for random stuff.
> Right now on my laptop, I have
> 
>     kmalloc-8          16907  18432      8  512    1 : ...
> 
> according to slabinfo, so almost 17 _thousand_ allocations of 8 bytes.
> 
> It's all kinds of sad if those allocations need to be 64 bytes in size
> just because of some silly DMA alignment issue, when none of them want
> it.
> 
> Yeah, yeah, wasting a megabyte of memory is "just a megabyte" these
> days. Which is crazy. It's literally memory that could have been used
> for something much more useful than just pure and utter waste.
> 
> I think we could and should just say "people who actually require DMA
> accesses should say so at kmalloc time". We literally have that
> GFP_DMA and ZOME_DMA for various historical reasons, so we've been
> able to do that before.
> 
> No, that historical GFP_DMA isn't what arm64 wants - it's the old
> crazy "legacy 16MB DMA" thing that ISA DMA used to have.
> 
> But the basic issue was true then, and is true now - DMA allocations
> are fairly special, and should not be that hard to just mark as such.

"fairly special" == "all USB transactions", so it will take a lot of
auditing here.  I think also many SPI controllers require this and maybe
I2C?  Perhaps other bus types do as well.

So please don't make this change without some way of figuring out just
what drivers need to be fixed up, as it's going to be a lot...

thanks,

greg k-h
Ard Biesheuvel April 14, 2022, 1:52 p.m. UTC | #26
On Thu, 14 Apr 2022 at 07:38, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Wed, Apr 13, 2022 at 09:53:24AM -1000, Linus Torvalds wrote:
> > On Tue, Apr 12, 2022 at 10:47 PM Catalin Marinas
> > <catalin.marinas@arm.com> wrote:
> > >
> > > I agree. There is also an implicit expectation that the DMA API works on
> > > kmalloc'ed buffers and that's what ARCH_DMA_MINALIGN is for (and the
> > > dynamic arch_kmalloc_minalign() in this series). But the key point is
> > > that the driver doesn't need to know the CPU cache topology, coherency,
> > > the DMA API and kmalloc() take care of these.
> >
> > Honestly, I think it would probably be worth discussing the "kmalloc
> > DMA alignment" issues.
> >
> > 99.9% of kmalloc users don't want to do DMA.
> >
> > And there's actually a fair amount of small kmalloc for random stuff.
> > Right now on my laptop, I have
> >
> >     kmalloc-8          16907  18432      8  512    1 : ...
> >
> > according to slabinfo, so almost 17 _thousand_ allocations of 8 bytes.
> >
> > It's all kinds of sad if those allocations need to be 64 bytes in size
> > just because of some silly DMA alignment issue, when none of them want
> > it.
> >

Actually, the alignment for non-cache coherent DMA is 128 bytes on
arm64, not 64 bytes.

> > Yeah, yeah, wasting a megabyte of memory is "just a megabyte" these
> > days. Which is crazy. It's literally memory that could have been used
> > for something much more useful than just pure and utter waste.
> >
> > I think we could and should just say "people who actually require DMA
> > accesses should say so at kmalloc time". We literally have that
> > GFP_DMA and ZOME_DMA for various historical reasons, so we've been
> > able to do that before.
> >
> > No, that historical GFP_DMA isn't what arm64 wants - it's the old
> > crazy "legacy 16MB DMA" thing that ISA DMA used to have.
> >
> > But the basic issue was true then, and is true now - DMA allocations
> > are fairly special, and should not be that hard to just mark as such.
>
> "fairly special" == "all USB transactions", so it will take a lot of
> auditing here.  I think also many SPI controllers require this and maybe
> I2C?  Perhaps other bus types do as well.
>
> So please don't make this change without some way of figuring out just
> what drivers need to be fixed up, as it's going to be a lot...
>

Yeah, the current de facto contract of being able to DMA map anything
that was allocated via the linear map makes it quite hard to enforce
the use of dma_kmalloc() for this.

What we might do, given the fact that only inbound non-cache coherent
DMA is problematic, is dropping the kmalloc alignment to 8 like on
x86, and falling back to bounce buffering when a misaligned, non-cache
coherent inbound DMA mapping is created, using the SWIOTLB bounce
buffering code that we already have, and is already in use on most
affected systems for other reasons (i.e., DMA addressing limits)

This will cause some performance regressions, but in a way that seems
fixable to me: taking network drivers as an example, the RX buffers
that are filled using inbound DMA are typically owned by the driver
itself, which could be updated to round up its allocations and DMA
mappings. Block devices typically operate on quantities that are
aligned sufficiently already. In other cases, we will likely notice
if/when this fallback is taken on a hot path, but if we don't, at
least we know a bounce buffer is being used whenever we cannot perform
the DMA safely in-place.
Greg KH April 14, 2022, 2:27 p.m. UTC | #27
On Thu, Apr 14, 2022 at 03:52:53PM +0200, Ard Biesheuvel wrote:
> On Thu, 14 Apr 2022 at 07:38, Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Wed, Apr 13, 2022 at 09:53:24AM -1000, Linus Torvalds wrote:
> > > On Tue, Apr 12, 2022 at 10:47 PM Catalin Marinas
> > > <catalin.marinas@arm.com> wrote:
> > > >
> > > > I agree. There is also an implicit expectation that the DMA API works on
> > > > kmalloc'ed buffers and that's what ARCH_DMA_MINALIGN is for (and the
> > > > dynamic arch_kmalloc_minalign() in this series). But the key point is
> > > > that the driver doesn't need to know the CPU cache topology, coherency,
> > > > the DMA API and kmalloc() take care of these.
> > >
> > > Honestly, I think it would probably be worth discussing the "kmalloc
> > > DMA alignment" issues.
> > >
> > > 99.9% of kmalloc users don't want to do DMA.
> > >
> > > And there's actually a fair amount of small kmalloc for random stuff.
> > > Right now on my laptop, I have
> > >
> > >     kmalloc-8          16907  18432      8  512    1 : ...
> > >
> > > according to slabinfo, so almost 17 _thousand_ allocations of 8 bytes.
> > >
> > > It's all kinds of sad if those allocations need to be 64 bytes in size
> > > just because of some silly DMA alignment issue, when none of them want
> > > it.
> > >
> 
> Actually, the alignment for non-cache coherent DMA is 128 bytes on
> arm64, not 64 bytes.
> 
> > > Yeah, yeah, wasting a megabyte of memory is "just a megabyte" these
> > > days. Which is crazy. It's literally memory that could have been used
> > > for something much more useful than just pure and utter waste.
> > >
> > > I think we could and should just say "people who actually require DMA
> > > accesses should say so at kmalloc time". We literally have that
> > > GFP_DMA and ZOME_DMA for various historical reasons, so we've been
> > > able to do that before.
> > >
> > > No, that historical GFP_DMA isn't what arm64 wants - it's the old
> > > crazy "legacy 16MB DMA" thing that ISA DMA used to have.
> > >
> > > But the basic issue was true then, and is true now - DMA allocations
> > > are fairly special, and should not be that hard to just mark as such.
> >
> > "fairly special" == "all USB transactions", so it will take a lot of
> > auditing here.  I think also many SPI controllers require this and maybe
> > I2C?  Perhaps other bus types do as well.
> >
> > So please don't make this change without some way of figuring out just
> > what drivers need to be fixed up, as it's going to be a lot...
> >
> 
> Yeah, the current de facto contract of being able to DMA map anything
> that was allocated via the linear map makes it quite hard to enforce
> the use of dma_kmalloc() for this.
> 
> What we might do, given the fact that only inbound non-cache coherent
> DMA is problematic, is dropping the kmalloc alignment to 8 like on
> x86, and falling back to bounce buffering when a misaligned, non-cache
> coherent inbound DMA mapping is created, using the SWIOTLB bounce
> buffering code that we already have, and is already in use on most
> affected systems for other reasons (i.e., DMA addressing limits)

Ick, that's a mess.

> This will cause some performance regressions, but in a way that seems
> fixable to me: taking network drivers as an example, the RX buffers
> that are filled using inbound DMA are typically owned by the driver
> itself, which could be updated to round up its allocations and DMA
> mappings. Block devices typically operate on quantities that are
> aligned sufficiently already. In other cases, we will likely notice
> if/when this fallback is taken on a hot path, but if we don't, at
> least we know a bounce buffer is being used whenever we cannot perform
> the DMA safely in-place.

We can move to having an "allocator-per-bus" for memory like this to
allow the bus to know if this is a DMA requirement or not.

So for all USB drivers, we would have:
	usb_kmalloc(size, flags);
and then it might even be easier to verify with static tools that the
USB drivers are sending only properly allocated data.  Same for SPI and
other busses.

https://lore.kernel.org/r/230a9486fc68ea0182df46255e42a51099403642.1648032613.git.christophe.leroy@csgroup.eu
is an example of a SPI driver that has been there "for forever" yet
always got it wrong.  If we could have had some way to know "only memory
allocated with this function is allowed on the bus" that would have
fixed the issue a long time ago.

Anyway, just an idea, it's up to you all if this is worth it or not.
Adding performance regressions at the expense of memory size feels like
a rough trade-off to go through until things are fixed up.

greg k-h
Ard Biesheuvel April 14, 2022, 2:30 p.m. UTC | #28
On Wed, 13 Apr 2022 at 10:47, Catalin Marinas <catalin.marinas@arm.com> wrote:
..
>
> Let's go back to restating the crypto code alignment requirements, as I
> understand them (please correct):
>
> 1. Bus master accesses (DMA, CPUs that can't do efficient unaligned
>    accesses). That's what cra_alignmask is for. If there's a driver that
>    relies on an implicit kmalloc() alignment higher than cra_alignmask,
>    it is already broken on x86 and a few other architectures that don't
>    define ARCH_DMA_MINALIGN. But it won't be any worse with this series
>    since it only brings the arm64 kmalloc() alignment down from 128 to
>    64.
>
> 2. Non-coherent DMA and cache invalidation. With my series, kmalloc'ed
>    buffers are DMA-safe for the CPU/SoC the kernel is running on.
>
> 3. DMA into buffers embedded in TFM structures. Since the offset of
>    those buffers is decided at compile-time, we need a
>    CRYPTO_MINALIGN_ATTR that covers both bus master alignment
>    requirements and the highest cache line size (CWG) for a non-coherent
>    DMA SoC. Ard's series would simplify the latter but, before then,
>    CRYPTO_MINALIGN needs to stay as the larger ARCH_DMA_MINALIGN.
>
> With my series, there is no change to the value of CRYPTO_MINALIGN for
> arm64 or any other architecture, so point 3 is unaffected. The series
> does change the kmalloc() alignment and that may be smaller than
> CRYPTO_MINALIGN but neither of points 1 or 2 above are affected since
> (a) we still have a sufficiently large ARCH_KMALLOC_MINALIGN of 64 and
> (b) the kmalloc'ed buffers are safe for non-coherent DMA.
>
> Herbert, Ard, if I missed anything please let me know but based on my
> understanding, this series is safe for the crypto code.
>

Agreed.
Ard Biesheuvel April 14, 2022, 2:36 p.m. UTC | #29
On Thu, 14 Apr 2022 at 16:27, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Thu, Apr 14, 2022 at 03:52:53PM +0200, Ard Biesheuvel wrote:
> > On Thu, 14 Apr 2022 at 07:38, Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Wed, Apr 13, 2022 at 09:53:24AM -1000, Linus Torvalds wrote:
...
> > > > Honestly, I think it would probably be worth discussing the "kmalloc
> > > > DMA alignment" issues.
> > > >
> > > > 99.9% of kmalloc users don't want to do DMA.
> > > >
> > > > And there's actually a fair amount of small kmalloc for random stuff.
> > > > Right now on my laptop, I have
> > > >
> > > >     kmalloc-8          16907  18432      8  512    1 : ...
> > > >
> > > > according to slabinfo, so almost 17 _thousand_ allocations of 8 bytes.
> > > >
> > > > It's all kinds of sad if those allocations need to be 64 bytes in size
> > > > just because of some silly DMA alignment issue, when none of them want
> > > > it.
> > > >
> >
> > Actually, the alignment for non-cache coherent DMA is 128 bytes on
> > arm64, not 64 bytes.
> >
> > > > Yeah, yeah, wasting a megabyte of memory is "just a megabyte" these
> > > > days. Which is crazy. It's literally memory that could have been used
> > > > for something much more useful than just pure and utter waste.
> > > >
> > > > I think we could and should just say "people who actually require DMA
> > > > accesses should say so at kmalloc time". We literally have that
> > > > GFP_DMA and ZOME_DMA for various historical reasons, so we've been
> > > > able to do that before.
> > > >
> > > > No, that historical GFP_DMA isn't what arm64 wants - it's the old
> > > > crazy "legacy 16MB DMA" thing that ISA DMA used to have.
> > > >
> > > > But the basic issue was true then, and is true now - DMA allocations
> > > > are fairly special, and should not be that hard to just mark as such.
> > >
> > > "fairly special" == "all USB transactions", so it will take a lot of
> > > auditing here.  I think also many SPI controllers require this and maybe
> > > I2C?  Perhaps other bus types do as well.
> > >
> > > So please don't make this change without some way of figuring out just
> > > what drivers need to be fixed up, as it's going to be a lot...
> > >
> >
> > Yeah, the current de facto contract of being able to DMA map anything
> > that was allocated via the linear map makes it quite hard to enforce
> > the use of dma_kmalloc() for this.
> >
> > What we might do, given the fact that only inbound non-cache coherent
> > DMA is problematic, is dropping the kmalloc alignment to 8 like on
> > x86, and falling back to bounce buffering when a misaligned, non-cache
> > coherent inbound DMA mapping is created, using the SWIOTLB bounce
> > buffering code that we already have, and is already in use on most
> > affected systems for other reasons (i.e., DMA addressing limits)
>
> Ick, that's a mess.
>
> > This will cause some performance regressions, but in a way that seems
> > fixable to me: taking network drivers as an example, the RX buffers
> > that are filled using inbound DMA are typically owned by the driver
> > itself, which could be updated to round up its allocations and DMA
> > mappings. Block devices typically operate on quantities that are
> > aligned sufficiently already. In other cases, we will likely notice
> > if/when this fallback is taken on a hot path, but if we don't, at
> > least we know a bounce buffer is being used whenever we cannot perform
> > the DMA safely in-place.
>
> We can move to having an "allocator-per-bus" for memory like this to
> allow the bus to know if this is a DMA requirement or not.
>
> So for all USB drivers, we would have:
>         usb_kmalloc(size, flags);
> and then it might even be easier to verify with static tools that the
> USB drivers are sending only properly allocated data.  Same for SPI and
> other busses.
>

As I pointed out earlier in the thread, alignment/padding requirements
for non-coherent DMA are a property of the CPU's cache hierarchy, not
of the device. So I'm not sure I follow how a per-subsystem
distinction would help here. In the case of USB especially, would that
mean that block, media and networking subsystems would need to be
aware of the USB-ness of the underlying transport?

> https://lore.kernel.org/r/230a9486fc68ea0182df46255e42a51099403642.1648032613.git.christophe.leroy@csgroup.eu
> is an example of a SPI driver that has been there "for forever" yet
> always got it wrong.  If we could have had some way to know "only memory
> allocated with this function is allowed on the bus" that would have
> fixed the issue a long time ago.
>
> Anyway, just an idea, it's up to you all if this is worth it or not.
> Adding performance regressions at the expense of memory size feels like
> a rough trade-off to go through until things are fixed up.
>

Yeah, I hear you. But we already have distinct classes of memory,
i.e., vmalloc vs kmalloc, where only the latter is permitted for DMA
(which is why VMAP stack broke that SPI driver), and I'm not sure
adding more types is going to make this tractable going forward.
Greg KH April 14, 2022, 2:52 p.m. UTC | #30
On Thu, Apr 14, 2022 at 04:36:46PM +0200, Ard Biesheuvel wrote:
> On Thu, 14 Apr 2022 at 16:27, Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Thu, Apr 14, 2022 at 03:52:53PM +0200, Ard Biesheuvel wrote:
> > > On Thu, 14 Apr 2022 at 07:38, Greg Kroah-Hartman
> > > <gregkh@linuxfoundation.org> wrote:
> > > >
> > > > On Wed, Apr 13, 2022 at 09:53:24AM -1000, Linus Torvalds wrote:
> ...
> > > > > Honestly, I think it would probably be worth discussing the "kmalloc
> > > > > DMA alignment" issues.
> > > > >
> > > > > 99.9% of kmalloc users don't want to do DMA.
> > > > >
> > > > > And there's actually a fair amount of small kmalloc for random stuff.
> > > > > Right now on my laptop, I have
> > > > >
> > > > >     kmalloc-8          16907  18432      8  512    1 : ...
> > > > >
> > > > > according to slabinfo, so almost 17 _thousand_ allocations of 8 bytes.
> > > > >
> > > > > It's all kinds of sad if those allocations need to be 64 bytes in size
> > > > > just because of some silly DMA alignment issue, when none of them want
> > > > > it.
> > > > >
> > >
> > > Actually, the alignment for non-cache coherent DMA is 128 bytes on
> > > arm64, not 64 bytes.
> > >
> > > > > Yeah, yeah, wasting a megabyte of memory is "just a megabyte" these
> > > > > days. Which is crazy. It's literally memory that could have been used
> > > > > for something much more useful than just pure and utter waste.
> > > > >
> > > > > I think we could and should just say "people who actually require DMA
> > > > > accesses should say so at kmalloc time". We literally have that
> > > > > GFP_DMA and ZOME_DMA for various historical reasons, so we've been
> > > > > able to do that before.
> > > > >
> > > > > No, that historical GFP_DMA isn't what arm64 wants - it's the old
> > > > > crazy "legacy 16MB DMA" thing that ISA DMA used to have.
> > > > >
> > > > > But the basic issue was true then, and is true now - DMA allocations
> > > > > are fairly special, and should not be that hard to just mark as such.
> > > >
> > > > "fairly special" == "all USB transactions", so it will take a lot of
> > > > auditing here.  I think also many SPI controllers require this and maybe
> > > > I2C?  Perhaps other bus types do as well.
> > > >
> > > > So please don't make this change without some way of figuring out just
> > > > what drivers need to be fixed up, as it's going to be a lot...
> > > >
> > >
> > > Yeah, the current de facto contract of being able to DMA map anything
> > > that was allocated via the linear map makes it quite hard to enforce
> > > the use of dma_kmalloc() for this.
> > >
> > > What we might do, given the fact that only inbound non-cache coherent
> > > DMA is problematic, is dropping the kmalloc alignment to 8 like on
> > > x86, and falling back to bounce buffering when a misaligned, non-cache
> > > coherent inbound DMA mapping is created, using the SWIOTLB bounce
> > > buffering code that we already have, and is already in use on most
> > > affected systems for other reasons (i.e., DMA addressing limits)
> >
> > Ick, that's a mess.
> >
> > > This will cause some performance regressions, but in a way that seems
> > > fixable to me: taking network drivers as an example, the RX buffers
> > > that are filled using inbound DMA are typically owned by the driver
> > > itself, which could be updated to round up its allocations and DMA
> > > mappings. Block devices typically operate on quantities that are
> > > aligned sufficiently already. In other cases, we will likely notice
> > > if/when this fallback is taken on a hot path, but if we don't, at
> > > least we know a bounce buffer is being used whenever we cannot perform
> > > the DMA safely in-place.
> >
> > We can move to having an "allocator-per-bus" for memory like this to
> > allow the bus to know if this is a DMA requirement or not.
> >
> > So for all USB drivers, we would have:
> >         usb_kmalloc(size, flags);
> > and then it might even be easier to verify with static tools that the
> > USB drivers are sending only properly allocated data.  Same for SPI and
> > other busses.
> >
> 
> As I pointed out earlier in the thread, alignment/padding requirements
> for non-coherent DMA are a property of the CPU's cache hierarchy, not
> of the device. So I'm not sure I follow how a per-subsystem
> distinction would help here. In the case of USB especially, would that
> mean that block, media and networking subsystems would need to be
> aware of the USB-ness of the underlying transport?

That's what we have required today, yes.  That's only because we knew
that for some USB controllers, that was a requirement and we had no way
of passing that information back up the stack so we just made it a
requirement.

But I do agree this is messy.  It's even messier for things like USB
where it's not the USB device itself that matters, it's the USB
controller that the USB device is attached to.  And that can be _way_ up
the device hierarchy.  Attach something like a NFS mount over a PPP
network connection on a USB to serial device and ugh, where do you
begin?  :)

And is this always just an issue of the CPU cache hierarchy?  And not the
specific bridge that a device is connected to that CPU on?  Or am I
saying the same thing here?

I mean take a USB controller for example.  We could have a system where
one USB controller is on a PCI bus, while another is on a "platform"
bus.  Both of those are connected to the CPU in different ways and so
could have different DMA rules.  Do we downgrade everything in the
system for the worst connection possible?

Again, consider a USB driver allocating memory to transfer stuff, should
it somehow know the cache hierarchy that it is connected to?  Right now
we punt and do not do that at the expense of a bit of potentially
wasted memory for small allocations.


> > https://lore.kernel.org/r/230a9486fc68ea0182df46255e42a51099403642.1648032613.git.christophe.leroy@csgroup.eu
> > is an example of a SPI driver that has been there "for forever" yet
> > always got it wrong.  If we could have had some way to know "only memory
> > allocated with this function is allowed on the bus" that would have
> > fixed the issue a long time ago.
> >
> > Anyway, just an idea, it's up to you all if this is worth it or not.
> > Adding performance regressions at the expense of memory size feels like
> > a rough trade-off to go through until things are fixed up.
> >
> 
> Yeah, I hear you. But we already have distinct classes of memory,
> i.e., vmalloc vs kmalloc, where only the latter is permitted for DMA
> (which is why VMAP stack broke that SPI driver), and I'm not sure
> adding more types is going to make this tractable going forward.

Sorry, it would not be a "real" type of memory at all, just something
with a static tag on it at the compiler level (like we do for endian
stuff with sparse today).  The real memory would be allocated with the
normal kmalloc() and friends it's just that static tools can be run to
verify "yes, all of the calls to usb_submit_urb() were done with memory
allocated with usb_kmalloc(), and not from the stack"

No runtime changes needed for different types, that would be a mess.
Let's lean on static tools when we can.

thanks,

greg k-h
Ard Biesheuvel April 14, 2022, 3:01 p.m. UTC | #31
On Thu, 14 Apr 2022 at 16:53, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Thu, Apr 14, 2022 at 04:36:46PM +0200, Ard Biesheuvel wrote:
> > On Thu, 14 Apr 2022 at 16:27, Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Thu, Apr 14, 2022 at 03:52:53PM +0200, Ard Biesheuvel wrote:
...
> > > > What we might do, given the fact that only inbound non-cache coherent
> > > > DMA is problematic, is dropping the kmalloc alignment to 8 like on
> > > > x86, and falling back to bounce buffering when a misaligned, non-cache
> > > > coherent inbound DMA mapping is created, using the SWIOTLB bounce
> > > > buffering code that we already have, and is already in use on most
> > > > affected systems for other reasons (i.e., DMA addressing limits)
> > >
> > > Ick, that's a mess.
> > >
> > > > This will cause some performance regressions, but in a way that seems
> > > > fixable to me: taking network drivers as an example, the RX buffers
> > > > that are filled using inbound DMA are typically owned by the driver
> > > > itself, which could be updated to round up its allocations and DMA
> > > > mappings. Block devices typically operate on quantities that are
> > > > aligned sufficiently already. In other cases, we will likely notice
> > > > if/when this fallback is taken on a hot path, but if we don't, at
> > > > least we know a bounce buffer is being used whenever we cannot perform
> > > > the DMA safely in-place.
> > >
> > > We can move to having an "allocator-per-bus" for memory like this to
> > > allow the bus to know if this is a DMA requirement or not.
> > >
> > > So for all USB drivers, we would have:
> > >         usb_kmalloc(size, flags);
> > > and then it might even be easier to verify with static tools that the
> > > USB drivers are sending only properly allocated data.  Same for SPI and
> > > other busses.
> > >
> >
> > As I pointed out earlier in the thread, alignment/padding requirements
> > for non-coherent DMA are a property of the CPU's cache hierarchy, not
> > of the device. So I'm not sure I follow how a per-subsystem
> > distinction would help here. In the case of USB especially, would that
> > mean that block, media and networking subsystems would need to be
> > aware of the USB-ness of the underlying transport?
>
> That's what we have required today, yes.  That's only because we knew
> that for some USB controllers, that was a requirement and we had no way
> of passing that information back up the stack so we just made it a
> requirement.
>
> But I do agree this is messy.  It's even messier for things like USB
> where it's not the USB device itself that matters, it's the USB
> controller that the USB device is attached to.  And that can be _way_ up
> the device hierarchy.  Attach something like a NFS mount over a PPP
> network connection on a USB to serial device and ugh, where do you
> begin?  :)
>

Exactly.

> And is this always just an issue of the CPU cache hierarchy?  And not the
> specific bridge that a device is connected to that CPU on?  Or am I
> saying the same thing here?
>

Yes, this is a system property not a device property, and the driver
typically doesn't have any knowledge of this. For example, if a PCI
host bridge happens to be integrated in a non-cache coherent way, any
PCI device plugged into it becomes non-coherent, and the associated
driver needs to do the right thing. This is why we rely on the DMA
layer to take care of this.

> I mean take a USB controller for example.  We could have a system where
> one USB controller is on a PCI bus, while another is on a "platform"
> bus.  Both of those are connected to the CPU in different ways and so
> could have different DMA rules.  Do we downgrade everything in the
> system for the worst connection possible?
>

No, we currently support a mix of coherent and non-coherent just fine,
and this shouldn't change. It's just that the mere fact that
non-coherent devices might exist is increasing the memory footprint of
all kmalloc allocations.

> Again, consider a USB driver allocating memory to transfer stuff, should
> it somehow know the cache hierarchy that it is connected to?  Right now
> we punt and do not do that at the expense of a bit of potentially
> wasted memory for small allocations.
>

This whole discussion is based on the premise that this is an expense
we would prefer to avoid. Currently, every kmalloc allocation is
rounded up to 128 bytes on arm64, while x86 uses only 8.
Ard Biesheuvel April 14, 2022, 3:10 p.m. UTC | #32
On Thu, 14 Apr 2022 at 17:01, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Thu, 14 Apr 2022 at 16:53, Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Thu, Apr 14, 2022 at 04:36:46PM +0200, Ard Biesheuvel wrote:
> > > On Thu, 14 Apr 2022 at 16:27, Greg Kroah-Hartman
> > > <gregkh@linuxfoundation.org> wrote:
> > > >
> > > > On Thu, Apr 14, 2022 at 03:52:53PM +0200, Ard Biesheuvel wrote:
> ...
> > > > > What we might do, given the fact that only inbound non-cache coherent
> > > > > DMA is problematic, is dropping the kmalloc alignment to 8 like on
> > > > > x86, and falling back to bounce buffering when a misaligned, non-cache
> > > > > coherent inbound DMA mapping is created, using the SWIOTLB bounce
> > > > > buffering code that we already have, and is already in use on most
> > > > > affected systems for other reasons (i.e., DMA addressing limits)
> > > >
> > > > Ick, that's a mess.
> > > >
> > > > > This will cause some performance regressions, but in a way that seems
> > > > > fixable to me: taking network drivers as an example, the RX buffers
> > > > > that are filled using inbound DMA are typically owned by the driver
> > > > > itself, which could be updated to round up its allocations and DMA
> > > > > mappings. Block devices typically operate on quantities that are
> > > > > aligned sufficiently already. In other cases, we will likely notice
> > > > > if/when this fallback is taken on a hot path, but if we don't, at
> > > > > least we know a bounce buffer is being used whenever we cannot perform
> > > > > the DMA safely in-place.
> > > >
> > > > We can move to having an "allocator-per-bus" for memory like this to
> > > > allow the bus to know if this is a DMA requirement or not.
> > > >
> > > > So for all USB drivers, we would have:
> > > >         usb_kmalloc(size, flags);
> > > > and then it might even be easier to verify with static tools that the
> > > > USB drivers are sending only properly allocated data.  Same for SPI and
> > > > other busses.
> > > >
> > >
> > > As I pointed out earlier in the thread, alignment/padding requirements
> > > for non-coherent DMA are a property of the CPU's cache hierarchy, not
> > > of the device. So I'm not sure I follow how a per-subsystem
> > > distinction would help here. In the case of USB especially, would that
> > > mean that block, media and networking subsystems would need to be
> > > aware of the USB-ness of the underlying transport?
> >
> > That's what we have required today, yes.  That's only because we knew
> > that for some USB controllers, that was a requirement and we had no way
> > of passing that information back up the stack so we just made it a
> > requirement.
> >
> > But I do agree this is messy.  It's even messier for things like USB
> > where it's not the USB device itself that matters, it's the USB
> > controller that the USB device is attached to.  And that can be _way_ up
> > the device hierarchy.  Attach something like a NFS mount over a PPP
> > network connection on a USB to serial device and ugh, where do you
> > begin?  :)
> >
>
> Exactly.
>
> > And is this always just an issue of the CPU cache hierarchy?  And not the
> > specific bridge that a device is connected to that CPU on?  Or am I
> > saying the same thing here?
> >
>
> Yes, this is a system property not a device property, and the driver
> typically doesn't have any knowledge of this. For example, if a PCI
> host bridge happens to be integrated in a non-cache coherent way, any
> PCI device plugged into it becomes non-coherent, and the associated
> driver needs to do the right thing. This is why we rely on the DMA
> layer to take care of this.
>
> > I mean take a USB controller for example.  We could have a system where
> > one USB controller is on a PCI bus, while another is on a "platform"
> > bus.  Both of those are connected to the CPU in different ways and so
> > could have different DMA rules.  Do we downgrade everything in the
> > system for the worst connection possible?
> >
>
> No, we currently support a mix of coherent and non-coherent just fine,
> and this shouldn't change. It's just that the mere fact that
> non-coherent devices might exist is increasing the memory footprint of
> all kmalloc allocations.
>
> > Again, consider a USB driver allocating memory to transfer stuff, should
> > it somehow know the cache hierarchy that it is connected to?  Right now
> > we punt and do not do that at the expense of a bit of potentially
> > wasted memory for small allocations.
> >
>
> This whole discussion is based on the premise that this is an expense
> we would prefer to avoid. Currently, every kmalloc allocation is
> rounded up to 128 bytes on arm64, while x86 uses only 8.

I guess I didn't answer that last question. Yes, I guess dma_kmalloc()
should be used in such cases. Combined with my bounce buffering hack,
the penalty for using plain kmalloc() instead would be a potential
performance hit when used for inbound DMA, instead of data corruption
(if we'd reduce the kmalloc() alignment when introducing
dma_kmalloc())
Catalin Marinas April 14, 2022, 7:49 p.m. UTC | #33
On Wed, Apr 13, 2022 at 09:53:24AM -1000, Linus Torvalds wrote:
> On Tue, Apr 12, 2022 at 10:47 PM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > I agree. There is also an implicit expectation that the DMA API works on
> > kmalloc'ed buffers and that's what ARCH_DMA_MINALIGN is for (and the
> > dynamic arch_kmalloc_minalign() in this series). But the key point is
> > that the driver doesn't need to know the CPU cache topology, coherency,
> > the DMA API and kmalloc() take care of these.
> 
> Honestly, I think it would probably be worth discussing the "kmalloc
> DMA alignment" issues.
> 
> 99.9% of kmalloc users don't want to do DMA.
> 
> And there's actually a fair amount of small kmalloc for random stuff.
> Right now on my laptop, I have
> 
>     kmalloc-8          16907  18432      8  512    1 : ...
> 
> according to slabinfo, so almost 17 _thousand_ allocations of 8 bytes.
> 
> It's all kinds of sad if those allocations need to be 64 bytes in size
> just because of some silly DMA alignment issue, when none of them want
> it.

It's a lot worse, ARCH_KMALLOC_MINALIGN is currently 128 bytes on arm64.
I want to at least get it down to 64 with this series while preserving
the current kmalloc() semantics.

If we know the SoC is fully coherent (a bit tricky with late probed
devices), we could get the alignment down to 8. In the mobile space,
unfortunately, most DMA is non-coherent.

I think it's worth investigating the __dma annotations that Greg
suggested, though I have a suspicion it either is too difficult to track
or we just end up with this annotation everywhere. There are cases where
the memory is allocated outside the driver that knows the DMA needs,
though I guess these are either full page allocations or
kmem_cache_alloc() (e.g. page cache pages, skb).

There's also Ard's suggestion to bounce the (inbound DMA) buffer if not
aligned. That's doable but dma_map_single(), for example, only gets the
size of some random structure/buffer. If the size is below
ARCH_DMA_MINALIGN (or cache_line_size()), the DMA API implementation
would have to retrieve the slab cache, check the real allocation size
and then bounce if necessary.

Irrespective of which option we go for, I think at least part of this
series decoupling ARCH_KMALLOC_MINALIGN from ARCH_DMA_MINALIGN is still
needed since currently the minalign is used in some compile time
attributes. Even getting the kmalloc() size down to 64 is a significant
improvement over 128.

Subsequently I'd attempt Ard's bouncing idea as a quick workaround and
assess the bouncing overhead on some real platforms. This may be needed
before we track down all places to use dma_kmalloc().

I need to think some more on Greg's __dma annotation, as I said the
allocation may be decoupled from the driver in some cases.
Linus Torvalds April 14, 2022, 10:25 p.m. UTC | #34
On Thu, Apr 14, 2022 at 12:49 PM Catalin Marinas
<catalin.marinas@arm.com> wrote:
>
> It's a lot worse, ARCH_KMALLOC_MINALIGN is currently 128 bytes on arm64.
> I want to at least get it down to 64 with this series while preserving
> the current kmalloc() semantics.

So here's a thought - maybe we could do the reverse of GFP_DMA, and
add a flag to the places that want small allocations and know they
don't need DMA?

Because even 64 bytes is _huge_ for those small allocations. And 128
bytes is just insane.

Maybe we don't have a ton of them, but I just checked my desktop, and
if my laptop had 17k 8-byte allocation, on my desktop it's currently
sitting at 43k of them. And I've got a lot of 16- and 32-byte ones
too:

  kmalloc-32         51529  51712     32  128    1 :
  kmalloc-16         45056  45056     16  256    1 :
  kmalloc-8          43008  43008      8  512    1 :

Don't ask me what they are. It's probably fairly easy to find out, and
it's probably something silly like sysfs private pointer data or
whatever.

If I did my math right, with a 128-byte minimum allocation, that is
16MB of wasted memory.

Yeah, yeah, I've got 64GB or RAM in this thing, and there are things
that take a lot more memory than the above (mem_map etc), and 64MB is
peanuts at just 0.1% of RAM.

Maybe nobody cares. But I really feel bad about that kind of egregious
waste. The mem_map[] array may be big, it may use 1.5% of the memory I
have, but at least it *does* something.

And it could be that if I have 150k of those smallish allocations, a
server with lots of active users might have millions. Not having
looked at where they come from, maybe that isn't the case, but it
*might* be.

Maybe adding something like a

        static int warn_every_1k = 0;
        WARN_ON(size < 32 && (1023 & ++warn_every_1k));

to kmalloc() would give us a statistical view of "lots of these small
allocations" thing, and we could add GFP_NODMA to them. There probably
aren't that many places that have those small allocations, and it's
certainly safer to annotate "this is not for DMA" than have the
requirement that all DMA allocations must be marked.

Then just teach kmalloc() something like

        align = (gfp_flags & __GFP_NODMA) ? 8 : 128;

on arm.

Hmm?

              Linus
Ard Biesheuvel April 15, 2022, 6:03 a.m. UTC | #35
On Fri, 15 Apr 2022 at 00:26, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Apr 14, 2022 at 12:49 PM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> >
> > It's a lot worse, ARCH_KMALLOC_MINALIGN is currently 128 bytes on arm64.
> > I want to at least get it down to 64 with this series while preserving
> > the current kmalloc() semantics.
>
> So here's a thought - maybe we could do the reverse of GFP_DMA, and
> add a flag to the places that want small allocations and know they
> don't need DMA?
>
> Because even 64 bytes is _huge_ for those small allocations. And 128
> bytes is just insane.
>
> Maybe we don't have a ton of them, but I just checked my desktop, and
> if my laptop had 17k 8-byte allocation, on my desktop it's currently
> sitting at 43k of them. And I've got a lot of 16- and 32-byte ones
> too:
>
>   kmalloc-32         51529  51712     32  128    1 :
>   kmalloc-16         45056  45056     16  256    1 :
>   kmalloc-8          43008  43008      8  512    1 :
>
> Don't ask me what they are. It's probably fairly easy to find out, and
> it's probably something silly like sysfs private pointer data or
> whatever.
>
> If I did my math right, with a 128-byte minimum allocation, that is
> 16MB of wasted memory.
>
> Yeah, yeah, I've got 64GB or RAM in this thing, and there are things
> that take a lot more memory than the above (mem_map etc), and 64MB is
> peanuts at just 0.1% of RAM.
>

Actually, I think the impact on D-cache efficiency is a much bigger
concern, although I don't have any data to back that up. On arm64,
every 8 byte allocation takes up an entire cacheline, pushing out
other data that we could meaningfully keep there. And the doubling to
128 that is unnecessary on most arm64 systems probably has an
additional negative effect, as it means those allocations are heavily
skewed to use even indexes into the cache sets, causing extra
contention.

> Maybe nobody cares. But I really feel bad about that kind of egregious
> waste. The mem_map[] array may be big, it may use 1.5% of the memory I
> have, but at least it *does* something.
>
> And it could be that if I have 150k of those smallish allocations, a
> server with lots of active users might have millions. Not having
> looked at where they come from, maybe that isn't the case, but it
> *might* be.
>
> Maybe adding something like a
>
>         static int warn_every_1k = 0;
>         WARN_ON(size < 32 && (1023 & ++warn_every_1k));
>
> to kmalloc() would give us a statistical view of "lots of these small
> allocations" thing, and we could add GFP_NODMA to them. There probably
> aren't that many places that have those small allocations, and it's
> certainly safer to annotate "this is not for DMA" than have the
> requirement that all DMA allocations must be marked.
>
> Then just teach kmalloc() something like
>
>         align = (gfp_flags & __GFP_NODMA) ? 8 : 128;
>
> on arm.
>

Sounds like a lot of churn as well, tbh. But perhaps there are a few
hot spots that we can fix that would alleviate the bulk of the issue,
who knows?
Herbert Xu April 15, 2022, 6:51 a.m. UTC | #36
On Wed, Apr 13, 2022 at 09:47:29AM +0100, Catalin Marinas wrote:
>
> With my series, there is no change to the value of CRYPTO_MINALIGN for
> arm64 or any other architecture, so point 3 is unaffected. The series
> does change the kmalloc() alignment and that may be smaller than
> CRYPTO_MINALIGN but neither of points 1 or 2 above are affected since
> (a) we still have a sufficiently large ARCH_KMALLOC_MINALIGN of 64 and
> (b) the kmalloc'ed buffers are safe for non-coherent DMA.
> 
> Herbert, Ard, if I missed anything please let me know but based on my
> understanding, this series is safe for the crypto code.

Sorry, but you can't change CRYPTO_MINALIGN to a value greater
than the minimum alignment returned by kmalloc.  That simply
doesn't work.  There is no magic in the Crypto API that makes
this work.

Cheers,
Ard Biesheuvel April 15, 2022, 7:49 a.m. UTC | #37
On Fri, 15 Apr 2022 at 08:51, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Wed, Apr 13, 2022 at 09:47:29AM +0100, Catalin Marinas wrote:
> >
> > With my series, there is no change to the value of CRYPTO_MINALIGN for
> > arm64 or any other architecture, so point 3 is unaffected. The series
> > does change the kmalloc() alignment and that may be smaller than
> > CRYPTO_MINALIGN but neither of points 1 or 2 above are affected since
> > (a) we still have a sufficiently large ARCH_KMALLOC_MINALIGN of 64 and
> > (b) the kmalloc'ed buffers are safe for non-coherent DMA.
> >
> > Herbert, Ard, if I missed anything please let me know but based on my
> > understanding, this series is safe for the crypto code.
>
> Sorry, but you can't change CRYPTO_MINALIGN to a value greater
> than the minimum alignment returned by kmalloc.  That simply
> doesn't work.  There is no magic in the Crypto API that makes
> this work.
>

I'm not sure I understand what would go wrong if that assumption no
longer holds, but if CRYPTO_MINALIGN needs to remain equal to
ARCH_KMALLOC_MINALIGN, let's at least decouple it from
ARCH_DMA_MINALIGN, as I do in my series. As I pointed out before,
ARCH_DMA_MINALIGN has nothing to do with DMA addressing capabilities
of individual masters, it is simply a worst case cacheline size that
needs to be taken into account to avoid corruption when doing cache
invalidation for non-cache coherent inbound DMA.

I'll rename the flag I proposed from CRYPTO_ALG_NEED_DMA_ALIGNMENT to
CRYPTO_ALG_NEED_DMA_PADDING to make this clearer, and given that only
a few drivers should be relying on DMA to write into request/TFM
context structures, hopefully we can fix those to stop doing that, and
get rid of this flag again entirely.
Herbert Xu April 15, 2022, 7:51 a.m. UTC | #38
On Fri, Apr 15, 2022 at 09:49:12AM +0200, Ard Biesheuvel wrote:
>
> I'm not sure I understand what would go wrong if that assumption no
> longer holds.

It's very simple, we don't do anything to the pointer returned
by kmalloc before returning it as a tfm or other object with
an alignment of CRYPTO_MINALIGN.  IOW if kmalloc starts returning
pointers that are not aligned to CRYPTO_MINALIGN then we'd be
lying to the compiler.

Cheers,
Ard Biesheuvel April 15, 2022, 8:05 a.m. UTC | #39
On Fri, 15 Apr 2022 at 09:52, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Fri, Apr 15, 2022 at 09:49:12AM +0200, Ard Biesheuvel wrote:
> >
> > I'm not sure I understand what would go wrong if that assumption no
> > longer holds.
>
> It's very simple, we don't do anything to the pointer returned
> by kmalloc before returning it as a tfm or other object with
> an alignment of CRYPTO_MINALIGN.  IOW if kmalloc starts returning
> pointers that are not aligned to CRYPTO_MINALIGN then we'd be
> lying to the compiler.
>

I guess that should be fixable. GIven that this is about padding
rather than alignment, we could do something like

struct crypto_request {
  union {
      struct {
        ... fields ...
      };
      u8 __padding[ARCH_DMA_MINALIGN];
   };
    void __ctx[]  __align(CRYPTO_MINALIGN);
};

And then hopefully, we can get rid of the padding once we fix drivers
doing non-cache coherent inbound DMA into those structures.
Herbert Xu April 15, 2022, 8:12 a.m. UTC | #40
On Fri, Apr 15, 2022 at 10:05:21AM +0200, Ard Biesheuvel wrote:
>
> I guess that should be fixable. GIven that this is about padding
> rather than alignment, we could do something like
> 
> struct crypto_request {
>   union {
>       struct {
>         ... fields ...
>       };
>       u8 __padding[ARCH_DMA_MINALIGN];
>    };
>     void __ctx[]  __align(CRYPTO_MINALIGN);
> };
> 
> And then hopefully, we can get rid of the padding once we fix drivers
> doing non-cache coherent inbound DMA into those structures.

Sorry, I don't think this works.  kmalloc can still return something
that's not ARCH_DMA_MINALIGN-aligned, and therefore __ctx won't be
aligned correctly.

Cheers,
Catalin Marinas April 15, 2022, 9:51 a.m. UTC | #41
On Fri, Apr 15, 2022 at 03:51:54PM +0800, Herbert Xu wrote:
> On Fri, Apr 15, 2022 at 09:49:12AM +0200, Ard Biesheuvel wrote:
> > I'm not sure I understand what would go wrong if that assumption no
> > longer holds.
> 
> It's very simple, we don't do anything to the pointer returned
> by kmalloc before returning it as a tfm or other object with
> an alignment of CRYPTO_MINALIGN.  IOW if kmalloc starts returning
> pointers that are not aligned to CRYPTO_MINALIGN then we'd be
> lying to the compiler.

I agree that it would be lying to the compiler, but I don't think this
matters for arm64 where the CPU can do unaligned accesses just fine. We
don't even end up with unaligned accesses here. Let's say we have:

struct x {
	...
} __attribute__ ((__aligned__ (128)));

and the kmalloc(sizeof(struct x)) returns a 64-byte aligned pointer. The
compiler-generated code won't have any problem on arm64 accessing the
struct x members. As I said a few times, it's not affecting any other
architecture and not breaking arm64 either.

Anyway, let's agree to disagree. I'll look into keeping CRYPTO_MINALIGN
as ARCH_KMALLOC_MINALIGN and introduce a CRYPTO_DMA_MINALIGN (or just
use ARCH_DMA_MINALIGN directly) together with something like Linus'
dma_kmalloc() in places where an object aligned to ARCH_DMA_MINALIGN is
needed in the crypto code.
Ard Biesheuvel April 15, 2022, 9:51 a.m. UTC | #42
On Fri, 15 Apr 2022 at 10:12, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Fri, Apr 15, 2022 at 10:05:21AM +0200, Ard Biesheuvel wrote:
> >
> > I guess that should be fixable. GIven that this is about padding
> > rather than alignment, we could do something like
> >
> > struct crypto_request {
> >   union {
> >       struct {
> >         ... fields ...
> >       };
> >       u8 __padding[ARCH_DMA_MINALIGN];
> >    };
> >     void __ctx[]  __align(CRYPTO_MINALIGN);
> > };
> >
> > And then hopefully, we can get rid of the padding once we fix drivers
> > doing non-cache coherent inbound DMA into those structures.
>
> Sorry, I don't think this works.  kmalloc can still return something
> that's not ARCH_DMA_MINALIGN-aligned, and therefore __ctx won't be
> aligned correctly.
>

That is the whole point, really: ARCH_DMA_MINALIGN==128 does not mean
__ctx needs to be aligned to 128 bytes, it only means that it should
not share a 128 byte cacheline with the preceding fields. So if
kmalloc() returns buffers that are aligned to whatever alignment the
platform requires (which will be 64 in most cases), the above
arrangement ensures that, without requiring that CRYPTO_MINALIGN ==
ARCH_DMA_MINALIGN.
Ard Biesheuvel April 15, 2022, 10:04 a.m. UTC | #43
On Fri, 15 Apr 2022 at 11:51, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Fri, 15 Apr 2022 at 10:12, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> >
> > On Fri, Apr 15, 2022 at 10:05:21AM +0200, Ard Biesheuvel wrote:
> > >
> > > I guess that should be fixable. GIven that this is about padding
> > > rather than alignment, we could do something like
> > >
> > > struct crypto_request {
> > >   union {
> > >       struct {
> > >         ... fields ...
> > >       };
> > >       u8 __padding[ARCH_DMA_MINALIGN];
> > >    };
> > >     void __ctx[]  __align(CRYPTO_MINALIGN);
> > > };
> > >
> > > And then hopefully, we can get rid of the padding once we fix drivers
> > > doing non-cache coherent inbound DMA into those structures.
> >
> > Sorry, I don't think this works.  kmalloc can still return something
> > that's not ARCH_DMA_MINALIGN-aligned, and therefore __ctx won't be
> > aligned correctly.
> >
>
> That is the whole point, really: ARCH_DMA_MINALIGN==128 does not mean
> __ctx needs to be aligned to 128 bytes, it only means that it should
> not share a 128 byte cacheline with the preceding fields.

Let's rephrase that as 'must not share a cacheline with the preceding
fields, and the worst case we expect to have to deal with is a
cacheline size of 128 bytes'
Herbert Xu April 15, 2022, 10:12 a.m. UTC | #44
On Fri, Apr 15, 2022 at 11:51:49AM +0200, Ard Biesheuvel wrote:
>
> That is the whole point, really: ARCH_DMA_MINALIGN==128 does not mean
> __ctx needs to be aligned to 128 bytes, it only means that it should
> not share a 128 byte cacheline with the preceding fields. So if
> kmalloc() returns buffers that are aligned to whatever alignment the
> platform requires (which will be 64 in most cases), the above
> arrangement ensures that, without requiring that CRYPTO_MINALIGN ==
> ARCH_DMA_MINALIGN.

What if they started sharing a cacheline with the subsequent object?
I guess you could add some more padding at the end though.

I could accept this as a temporary solution, if you volunteer to
modify all the affected drivers so we can get rid of this bandaid.

Thanks,
Ard Biesheuvel April 15, 2022, 10:22 a.m. UTC | #45
On Fri, 15 Apr 2022 at 12:12, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Fri, Apr 15, 2022 at 11:51:49AM +0200, Ard Biesheuvel wrote:
> >
> > That is the whole point, really: ARCH_DMA_MINALIGN==128 does not mean
> > __ctx needs to be aligned to 128 bytes, it only means that it should
> > not share a 128 byte cacheline with the preceding fields. So if
> > kmalloc() returns buffers that are aligned to whatever alignment the
> > platform requires (which will be 64 in most cases), the above
> > arrangement ensures that, without requiring that CRYPTO_MINALIGN ==
> > ARCH_DMA_MINALIGN.
>
> What if they started sharing a cacheline with the subsequent object?
> I guess you could add some more padding at the end though.
>

Subsequent objects are owned by the driver, and it is the
responsibility of the driver not to modify the fields while it is also
mapped for DMA (and we have had issues in the past where drivers
violated this rule). So as long as ARCH_KMALLOC_ALIGN guarantees
actual DMA minimum alignment for both the start and the end, we
shouldn't need any explicit padding at the end.

> I could accept this as a temporary solution, if you volunteer to
> modify all the affected drivers so we can get rid of this bandaid.
>

I'l do a scan of drivers/crypto to figure out how much we are relying
on this to begin with.
Herbert Xu April 15, 2022, 10:45 a.m. UTC | #46
On Fri, Apr 15, 2022 at 12:22:27PM +0200, Ard Biesheuvel wrote:
>
> Subsequent objects are owned by the driver, and it is the
> responsibility of the driver not to modify the fields while it is also
> mapped for DMA (and we have had issues in the past where drivers
> violated this rule). So as long as ARCH_KMALLOC_ALIGN guarantees
> actual DMA minimum alignment for both the start and the end, we
> shouldn't need any explicit padding at the end.

I don't understand why this is guaranteed.  The driver context
size is arbitrary so it could end in the middle of a cacheline.
The slab allocator could well lay it out so that the next kmalloc
object starts right after the end of the context, in which case
they would share a cache-line.

The next kmalloc object could be (and in fact is likely to be)
of the same type.

Previously this wasn't possible because kmalloc guaranteed
alignment.

> I'l do a scan of drivers/crypto to figure out how much we are relying
> on this to begin with.

Thanks,
Arnd Bergmann April 15, 2022, 11:09 a.m. UTC | #47
On Fri, Apr 15, 2022 at 12:25 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> >
> And it could be that if I have 150k of those smallish allocations, a
> server with lots of active users might have millions. Not having
> looked at where they come from, maybe that isn't the case, but it
> *might* be.
>
> Maybe adding something like a
>
>         static int warn_every_1k = 0;
>         WARN_ON(size < 32 && (1023 & ++warn_every_1k));
>
> to kmalloc() would give us a statistical view of "lots of these small
> allocations" thing, and we could add GFP_NODMA to them. There probably
> aren't that many places that have those small allocations, and it's
> certainly safer to annotate "this is not for DMA" than have the
> requirement that all DMA allocations must be marked.

I think finding out the allocations is one of the most common examples
for ftrace. I followed the instructions from
https://www.kernel.org/doc/Documentation/trace/events.txt to
show me a histogram of all allocations under 256 bytes, which
(one kernel compile later) gives me something like

$echo 'hist:key=call_site.sym-offset:val=bytes_req:sort=bytes_req.descending
if bytes_req<256' >   \
/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
$ make -skj30
...
$ head  /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
{ call_site: [ffffffffc04e457f]
btrfs_delete_delayed_dir_index+0xbf/0x1e0 [btrfs]       } hitcount:
 146914  bytes_req:   16454368
{ call_site: [ffffffffbbe601a3] generic_file_buffered_read+0x463/0x4a0
                 } hitcount:      98187  bytes_req:   14906232
{ call_site: [ffffffffc0497b81] btrfs_buffered_write+0x131/0x7e0
[btrfs]                } hitcount:     156513  bytes_req:   10038544
{ call_site: [ffffffffc05125c9] btrfs_alloc_block_rsv+0x29/0x60
[btrfs]                 } hitcount:     155044  bytes_req:    8682464
{ call_site: [ffffffffbbfe7272] kernfs_fop_open+0xc2/0x290
                 } hitcount:      38764  bytes_req:    5892128
{ call_site: [ffffffffbbfb6ea2] load_elf_binary+0x242/0xed0
                 } hitcount:      58276  bytes_req:    3729664
{ call_site: [ffffffffc04b52d0] __btrfs_map_block+0x1f0/0xb60 [btrfs]
                 } hitcount:      29289  bytes_req:    3521656
{ call_site: [ffffffffbbf7ac7e] inotify_handle_inode_event+0x7e/0x210
                 } hitcount:      61688  bytes_req:    2986992
{ call_site: [ffffffffbbf2fa35] alloc_pipe_info+0x65/0x230
                 } hitcount:      13139  bytes_req:    2312464
{ call_site: [ffffffffbc0cd3ec] security_task_alloc+0x9c/0x100
                 } hitcount:      60475  bytes_req:    2177100
{ call_site: [ffffffffbc0cd5f6] security_prepare_creds+0x76/0xa0
                 } hitcount:     266124  bytes_req:    2128992
{ call_site: [ffffffffbbfe710e] kernfs_get_open_node+0x7e/0x120
                 } hitcount:      38764  bytes_req:    1860672
{ call_site: [ffffffffc04e1fbd] btrfs_alloc_delayed_item+0x1d/0x50
[btrfs]              } hitcount:      11859  bytes_req:    1833383
{ call_site: [ffffffffc046595d] split_item+0x8d/0x2e0 [btrfs]
                 } hitcount:      14049  bytes_req:    1716288
{ call_site: [ffffffffbbfb6dbc] load_elf_binary+0x15c/0xed0
                 } hitcount:      58276  bytes_req:    1631728
{ call_site: [ffffffffbbf40e79] __d_alloc+0x179/0x1f0
                 } hitcount:      24814  bytes_req:    1280649
{ call_site: [ffffffffbbf5203f] single_open+0x2f/0xa0
                 } hitcount:      34541  bytes_req:    1105312
{ call_site: [ffffffffc047ad0a] btrfs_wq_submit_bio+0x4a/0xe0 [btrfs]
                 } hitcount:       7746  bytes_req:    1053456
{ call_site: [ffffffffbc519e95] xhci_urb_enqueue+0xf5/0x3c0
                 } hitcount:       5511  bytes_req:     484968
{ call_site: [ffffffffc0482935] btrfs_opendir+0x25/0x70 [btrfs]
                 } hitcount:      60245  bytes_req:     481960
{ call_site: [ffffffffc04c44ff] overwrite_item+0x1cf/0x5c0 [btrfs]
                 } hitcount:       7378  bytes_req:     364305
{ call_site: [ffffffffc04c4514] overwrite_item+0x1e4/0x5c0 [btrfs]
                 } hitcount:       7378  bytes_req:     364305
{ call_site: [ffffffffc04e207f] btrfs_wq_run_delayed_node+0x2f/0x80
[btrfs]             } hitcount:       3427  bytes_req:     356408
{ call_site: [ffffffffbbe7e96d] shmem_symlink+0xbd/0x250
                 } hitcount:       5169  bytes_req:     242943
{ call_site: [ffffffffc03e0526] hid_input_field+0x56/0x290 [hid]
                 } hitcount:      11004  bytes_req:     175760

I think these are all safe for the GFP_NODMA approach you suggest, maybe
not the xhci_urb_enqueue one.

          Arnd
Ard Biesheuvel April 15, 2022, 11:38 a.m. UTC | #48
On Fri, 15 Apr 2022 at 12:45, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Fri, Apr 15, 2022 at 12:22:27PM +0200, Ard Biesheuvel wrote:
> >
> > Subsequent objects are owned by the driver, and it is the
> > responsibility of the driver not to modify the fields while it is also
> > mapped for DMA (and we have had issues in the past where drivers
> > violated this rule). So as long as ARCH_KMALLOC_ALIGN guarantees
> > actual DMA minimum alignment for both the start and the end, we
> > shouldn't need any explicit padding at the end.
>
> I don't understand why this is guaranteed.  The driver context
> size is arbitrary so it could end in the middle of a cacheline.
> The slab allocator could well lay it out so that the next kmalloc
> object starts right after the end of the context, in which case
> they would share a cache-line.
>

If this is the case, things are already broken today. We never take
ARCH_DMA_MINALIGN into account when adding the driver ctx size to the
overall allocation size.

> The next kmalloc object could be (and in fact is likely to be)
> of the same type.
>
> Previously this wasn't possible because kmalloc guaranteed
> alignment.
>

Either it does or it doesn't. If kmalloc() guarantees the actual DMA
alignment at both ends, the situation you describe cannot occur, given
that the driver's slice of the request/TFM structure would be padded
up to actual DMA alignment, in spite of whether or not
ARCH_DMA_MINALIGN exceeds that. So it would never share a cacheline in
practice, even though they might live in the same 128 byte aligned
region on a system that has a minimum DMA alignment that is lower than
that.

If kmalloc() does not guarantee that the end of the buffer is aligned
to actual DMA alignment, things are already broken today.
Catalin Marinas April 15, 2022, 12:18 p.m. UTC | #49
On Fri, Apr 15, 2022 at 10:05:21AM +0200, Ard Biesheuvel wrote:
> On Fri, 15 Apr 2022 at 09:52, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > On Fri, Apr 15, 2022 at 09:49:12AM +0200, Ard Biesheuvel wrote:
> > > I'm not sure I understand what would go wrong if that assumption no
> > > longer holds.
> >
> > It's very simple, we don't do anything to the pointer returned
> > by kmalloc before returning it as a tfm or other object with
> > an alignment of CRYPTO_MINALIGN.  IOW if kmalloc starts returning
> > pointers that are not aligned to CRYPTO_MINALIGN then we'd be
> > lying to the compiler.
> 
> I guess that should be fixable. GIven that this is about padding
> rather than alignment, we could do something like
> 
> struct crypto_request {
>   union {
>       struct {
>         ... fields ...
>       };
>       u8 __padding[ARCH_DMA_MINALIGN];
>    };
>     void __ctx[]  __align(CRYPTO_MINALIGN);
> };
> 
> And then hopefully, we can get rid of the padding once we fix drivers
> doing non-cache coherent inbound DMA into those structures.

But if we keep CRYPTO_MINALIGN as 128, don't we get the padding
automatically?

struct crypto_request {
	...
	void *__ctx[] CRYPTO_MINALIGN_ATTR;
};

__alignof__(struct crypto_request) == 128;
sizeof(struct crypto_request) == N * 128

The same alignment and size is true for a structure like:

struct crypto_alg {
	...
} CRYPTO_MINALIGN_ATTR;

Any kmalloc() of sizeof(the above structures) will return a pointer
aligned to 128, irrespective of what ARCH_KMALLOC_MINALIGN is.

The problem is if you have a structure without any alignment attribute
(just ABI default), making its sizeof() smaller than ARCH_DMA_MINALIGN.
In this case kmalloc() could return a pointer aligned to something
smaller. Is this the case in the crypto code today? I can see it uses
the right alignment annotations already, no need for kmalloc() hacks.
Ard Biesheuvel April 15, 2022, 12:25 p.m. UTC | #50
On Fri, 15 Apr 2022 at 14:19, Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Fri, Apr 15, 2022 at 10:05:21AM +0200, Ard Biesheuvel wrote:
> > On Fri, 15 Apr 2022 at 09:52, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > > On Fri, Apr 15, 2022 at 09:49:12AM +0200, Ard Biesheuvel wrote:
> > > > I'm not sure I understand what would go wrong if that assumption no
> > > > longer holds.
> > >
> > > It's very simple, we don't do anything to the pointer returned
> > > by kmalloc before returning it as a tfm or other object with
> > > an alignment of CRYPTO_MINALIGN.  IOW if kmalloc starts returning
> > > pointers that are not aligned to CRYPTO_MINALIGN then we'd be
> > > lying to the compiler.
> >
> > I guess that should be fixable. GIven that this is about padding
> > rather than alignment, we could do something like
> >
> > struct crypto_request {
> >   union {
> >       struct {
> >         ... fields ...
> >       };
> >       u8 __padding[ARCH_DMA_MINALIGN];
> >    };
> >     void __ctx[]  __align(CRYPTO_MINALIGN);
> > };
> >
> > And then hopefully, we can get rid of the padding once we fix drivers
> > doing non-cache coherent inbound DMA into those structures.
>
> But if we keep CRYPTO_MINALIGN as 128, don't we get the padding
> automatically?
>

I suppose, yes.

> struct crypto_request {
>         ...
>         void *__ctx[] CRYPTO_MINALIGN_ATTR;
> };
>
> __alignof__(struct crypto_request) == 128;
> sizeof(struct crypto_request) == N * 128
>
> The same alignment and size is true for a structure like:
>
> struct crypto_alg {
>         ...
> } CRYPTO_MINALIGN_ATTR;
>
> Any kmalloc() of sizeof(the above structures) will return a pointer
> aligned to 128, irrespective of what ARCH_KMALLOC_MINALIGN is.
>
> The problem is if you have a structure without any alignment attribute
> (just ABI default), making its sizeof() smaller than ARCH_DMA_MINALIGN.
> In this case kmalloc() could return a pointer aligned to something
> smaller. Is this the case in the crypto code today? I can see it uses
> the right alignment annotations already, no need for kmalloc() hacks.
>

As long as CRYPTO_MINALIGN >= ARCH_KMALLOC_MINALIGN, we won't be lying
to the compiler when casting kmalloc buffers to these struct types.

I'd still like to fix the bad DMA behavior but I suppose it is a separate issue.
Catalin Marinas April 15, 2022, 12:31 p.m. UTC | #51
On Fri, Apr 15, 2022 at 10:51:40AM +0100, Catalin Marinas wrote:
> On Fri, Apr 15, 2022 at 03:51:54PM +0800, Herbert Xu wrote:
> > On Fri, Apr 15, 2022 at 09:49:12AM +0200, Ard Biesheuvel wrote:
> > > I'm not sure I understand what would go wrong if that assumption no
> > > longer holds.
> > 
> > It's very simple, we don't do anything to the pointer returned
> > by kmalloc before returning it as a tfm or other object with
> > an alignment of CRYPTO_MINALIGN.  IOW if kmalloc starts returning
> > pointers that are not aligned to CRYPTO_MINALIGN then we'd be
> > lying to the compiler.
> 
> I agree that it would be lying to the compiler, but I don't think this
> matters for arm64 where the CPU can do unaligned accesses just fine. We
> don't even end up with unaligned accesses here. Let's say we have:
> 
> struct x {
> 	...
> } __attribute__ ((__aligned__ (128)));
> 
> and the kmalloc(sizeof(struct x)) returns a 64-byte aligned pointer.

This needs a clarification. For the above structure, kmalloc() will
return a 128-byte aligned pointer since sizeof(x) is a multiple of 128.
The potential problem is if you have something like:

	kmalloc(sizeof(struct x) + 64);

The above could end up as a kmalloc(192) which is available with an
ARCH_KMALLOC_MINALIGN of 64. If that's a real use-case, I can change the
slab patch to not create the 192 (or 48 if we go for an even smaller
ARCH_KMALLOC_MINALIGN) caches and we'd always have ARCH_DMA_MINALIGN
guarantee if the structure itself is correctly aligned. No lying to the
compiler.
Catalin Marinas April 16, 2022, 9:42 a.m. UTC | #52
On Thu, Apr 14, 2022 at 03:25:59PM -0700, Linus Torvalds wrote:
> On Thu, Apr 14, 2022 at 12:49 PM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > It's a lot worse, ARCH_KMALLOC_MINALIGN is currently 128 bytes on arm64.
> > I want to at least get it down to 64 with this series while preserving
> > the current kmalloc() semantics.
> 
> So here's a thought - maybe we could do the reverse of GFP_DMA, and
> add a flag to the places that want small allocations and know they
> don't need DMA?

I wonder whether that's a lot more churn than trying to identify places
where a small kmalloc()'ed buffer is passed to the DMA API. DMA into
kmalloc() buffers should be a small fraction of the total kmalloc()
uses.

For kmem_cache we have the SLAB_HWCACHE_ALIGN flag. We can add a similar
GFP_ flag as that's what we care about for DMA safety. It doesn't even
need to force the alignment to ARCH_DMA_MINALIGN but just
cache_line_size() (typically 64 on arm64 while ARCH_DMA_MINALIGN is 128
for about three platforms that have this requirement).

Functions like dma_map_single() can be made to track down the origin of
the buffer when size < cache_line_size() and warn if the slab is not
correctly aligned.
Herbert Xu April 17, 2022, 8:08 a.m. UTC | #53
On Fri, Apr 15, 2022 at 01:38:15PM +0200, Ard Biesheuvel wrote:
>
> If this is the case, things are already broken today. We never take
> ARCH_DMA_MINALIGN into account when adding the driver ctx size to the
> overall allocation size.

No it's not broken because kmalloc guarantees alignment.  For
example, if ARCH_DMA_MINALIGN is 128 bytes, then kmalloc will
always return a pointer that's 128-byte aligned.  That guarantees
this object and the next object are on different cache-lines.

If you reduce the kmalloc minimum alignment to 64 bytes, then
the two neighbouring objects can share cache-lines, even if
each object is bigger than 128 bytes (e.g., if they were 192
bytes each).

Cheers,
Herbert Xu April 17, 2022, 8:11 a.m. UTC | #54
On Fri, Apr 15, 2022 at 01:31:32PM +0100, Catalin Marinas wrote:
>
> This needs a clarification. For the above structure, kmalloc() will
> return a 128-byte aligned pointer since sizeof(x) is a multiple of 128.
> The potential problem is if you have something like:
> 
> 	kmalloc(sizeof(struct x) + 64);
> 
> The above could end up as a kmalloc(192) which is available with an
> ARCH_KMALLOC_MINALIGN of 64. If that's a real use-case, I can change the
> slab patch to not create the 192 (or 48 if we go for an even smaller
> ARCH_KMALLOC_MINALIGN) caches and we'd always have ARCH_DMA_MINALIGN
> guarantee if the structure itself is correctly aligned. No lying to the
> compiler.

Yes I suppose that should work:

1) Enlarge each crypto API object so that they're >= 128 bytes;
2) Modify kmalloc so that for sizes >= 128 bytes they're padded
to multiples of 128.

But it really feels like a hack.

Cheers,
Catalin Marinas April 17, 2022, 8:31 a.m. UTC | #55
On Sun, Apr 17, 2022 at 04:08:22PM +0800, Herbert Xu wrote:
> On Fri, Apr 15, 2022 at 01:38:15PM +0200, Ard Biesheuvel wrote:
> > If this is the case, things are already broken today. We never take
> > ARCH_DMA_MINALIGN into account when adding the driver ctx size to the
> > overall allocation size.
> 
> No it's not broken because kmalloc guarantees alignment.  For
> example, if ARCH_DMA_MINALIGN is 128 bytes, then kmalloc will
> always return a pointer that's 128-byte aligned.  That guarantees
> this object and the next object are on different cache-lines.
> 
> If you reduce the kmalloc minimum alignment to 64 bytes, then
> the two neighbouring objects can share cache-lines, even if
> each object is bigger than 128 bytes (e.g., if they were 192
> bytes each).

Not with my series, the non-sharing of cache lines is preserved.
kmalloc() still returns objects aligned to a cache-line.
ARCH_DMA_MINALIGN was chosen as the cover-all value for all SoCs
supported but I want to reduce the kmalloc() alignment to a cache line
size if a platform has a cache line smaller than ARCH_DMA_MINALIGN (most
arm64 SoCs have a cache line of 64 bytes rather than 128).
Herbert Xu April 17, 2022, 8:35 a.m. UTC | #56
On Sun, Apr 17, 2022 at 09:31:09AM +0100, Catalin Marinas wrote:
>
> Not with my series, the non-sharing of cache lines is preserved.
> kmalloc() still returns objects aligned to a cache-line.
> ARCH_DMA_MINALIGN was chosen as the cover-all value for all SoCs
> supported but I want to reduce the kmalloc() alignment to a cache line
> size if a platform has a cache line smaller than ARCH_DMA_MINALIGN (most
> arm64 SoCs have a cache line of 64 bytes rather than 128).

OK, but then you don't need to play with CRYPTO_MINALIGN at all,
right? All you need to do is add the padding between the Crypto
API fields and the context structure, right?

Cheers,
Catalin Marinas April 17, 2022, 8:38 a.m. UTC | #57
On Sun, Apr 17, 2022 at 04:11:22PM +0800, Herbert Xu wrote:
> On Fri, Apr 15, 2022 at 01:31:32PM +0100, Catalin Marinas wrote:
> >
> > This needs a clarification. For the above structure, kmalloc() will
> > return a 128-byte aligned pointer since sizeof(x) is a multiple of 128.
> > The potential problem is if you have something like:
> > 
> > 	kmalloc(sizeof(struct x) + 64);
> > 
> > The above could end up as a kmalloc(192) which is available with an
> > ARCH_KMALLOC_MINALIGN of 64. If that's a real use-case, I can change the
> > slab patch to not create the 192 (or 48 if we go for an even smaller
> > ARCH_KMALLOC_MINALIGN) caches and we'd always have ARCH_DMA_MINALIGN
> > guarantee if the structure itself is correctly aligned. No lying to the
> > compiler.
> 
> Yes I suppose that should work:
> 
> 1) Enlarge each crypto API object so that they're >= 128 bytes;

I don't think we need to do anything here. A structure like:

struct x {
	char y;
	char z[] CRYPTO_MINALIGN_ATTR;
};

is already of size 128. Without CRYPTO_MINALIGN_ATTR, its size would be
1 but otherwise the whole structure inherits the alignment of its
member and this translates into an aligned size.

> 2) Modify kmalloc so that for sizes >= 128 bytes they're padded
> to multiples of 128.

This doesn't look like a hack, we want to honour the power of 2
alignments.
Herbert Xu April 17, 2022, 8:43 a.m. UTC | #58
On Sun, Apr 17, 2022 at 09:38:40AM +0100, Catalin Marinas wrote:
> 
> I don't think we need to do anything here. A structure like:
> 
> struct x {
> 	char y;
> 	char z[] CRYPTO_MINALIGN_ATTR;
> };
> 
> is already of size 128. Without CRYPTO_MINALIGN_ATTR, its size would be
> 1 but otherwise the whole structure inherits the alignment of its
> member and this translates into an aligned size.

No we should not lie to the compiler, we have code elsewhere
that uses the alignment to compute the amount of extra padding
needed to create greater padding.  If CRYPTO_MINALIGN is misleading
then that calculation will fall apart.

So keep CRYPTO_MINALIGN at whatever alignment you lower kmalloc
to, and then add the padding you need to separate the Crypto API
bits from the context.  In fact, that is exactly what cra_alignmask
is supposed to do.

Sure we currently limit the maximum alignment to 64 bytes because
of stack usage but we can certainly look into increasing it as
that's what you're doing here anyway.

Cheers,
Catalin Marinas April 17, 2022, 8:50 a.m. UTC | #59
On Sun, Apr 17, 2022 at 04:35:58PM +0800, Herbert Xu wrote:
> On Sun, Apr 17, 2022 at 09:31:09AM +0100, Catalin Marinas wrote:
> >
> > Not with my series, the non-sharing of cache lines is preserved.
> > kmalloc() still returns objects aligned to a cache-line.
> > ARCH_DMA_MINALIGN was chosen as the cover-all value for all SoCs
> > supported but I want to reduce the kmalloc() alignment to a cache line
> > size if a platform has a cache line smaller than ARCH_DMA_MINALIGN (most
> > arm64 SoCs have a cache line of 64 bytes rather than 128).
> 
> OK, but then you don't need to play with CRYPTO_MINALIGN at all,
> right? All you need to do is add the padding between the Crypto
> API fields and the context structure, right?

Right, if that's what you prefer. Something like:

diff --git a/include/linux/crypto.h b/include/linux/crypto.h
index 2324ab6f1846..bb645b2f2718 100644
--- a/include/linux/crypto.h
+++ b/include/linux/crypto.h
@@ -645,7 +645,7 @@ struct crypto_tfm {
 	
 	struct crypto_alg *__crt_alg;
 
-	void *__crt_ctx[] CRYPTO_MINALIGN_ATTR;
+	void *__crt_ctx[] __aligned(ARCH_DMA_MINALIGN);
 };

But once we do that, are there any other CRYPTO_MINALIGN left around?
Herbert Xu April 17, 2022, 8:58 a.m. UTC | #60
On Sun, Apr 17, 2022 at 09:50:50AM +0100, Catalin Marinas wrote:
>
> Right, if that's what you prefer. Something like:
> 
> diff --git a/include/linux/crypto.h b/include/linux/crypto.h
> index 2324ab6f1846..bb645b2f2718 100644
> --- a/include/linux/crypto.h
> +++ b/include/linux/crypto.h
> @@ -645,7 +645,7 @@ struct crypto_tfm {
>  	
>  	struct crypto_alg *__crt_alg;
>  
> -	void *__crt_ctx[] CRYPTO_MINALIGN_ATTR;
> +	void *__crt_ctx[] __aligned(ARCH_DMA_MINALIGN);
>  };
> 
> But once we do that, are there any other CRYPTO_MINALIGN left around?

This is still implying the whole structure is aligned to the given
value, which it is not.

Please just add the padding as needed.

Cheers,
Catalin Marinas April 17, 2022, 4:29 p.m. UTC | #61
On Sun, Apr 17, 2022 at 04:43:33PM +0800, Herbert Xu wrote:
> On Sun, Apr 17, 2022 at 09:38:40AM +0100, Catalin Marinas wrote:
> > I don't think we need to do anything here. A structure like:
> > 
> > struct x {
> > 	char y;
> > 	char z[] CRYPTO_MINALIGN_ATTR;
> > };
> > 
> > is already of size 128. Without CRYPTO_MINALIGN_ATTR, its size would be
> > 1 but otherwise the whole structure inherits the alignment of its
> > member and this translates into an aligned size.
> 
> No we should not lie to the compiler,

We won't if we ensure that a structure with sizeof() >= 128 is aligned
to 128.

> we have code elsewhere
> that uses the alignment to compute the amount of extra padding
> needed to create greater padding.  If CRYPTO_MINALIGN is misleading
> then that calculation will fall apart.

There is no direct CRYPTO_MINALIGN use for any extra padding AFAICT.
There is an indirect use via __alignof__(ctx) like in
crypto_tfm_ctx_alignment() but IIUC in that case CRYPTO_MINALIGN is a
statement of what you want rather than what you get from kmalloc(). So
if you want 128 alignment of tfm->__crt_ctx, you should say so by either
changing the attribute to __aligned(ARCH_DMA_MINALIGN) or keeping
CRYPTO_MINALIGN as 128.

There is the union padding that Ard suggested but I don't think it buys
us much, the __aligned(ARCH_DMA_MINALIGN) gives you the padding and the
kmalloc() rules the alignment (subject to removing kmalloc-192). The
code that allocates these would need to place the structure aligned
anyway, irrespective of whether we use the padding or the
__aligned(ARCH_DMA_MINALIGN).

> So keep CRYPTO_MINALIGN at whatever alignment you lower kmalloc
> to, and then add the padding you need to separate the Crypto API
> bits from the context.  In fact, that is exactly what cra_alignmask
> is supposed to do.

I disagree on the cra_alignmask intention here, though I may be wrong as
I did not write the code. Yes, you could make it ARCH_DMA_MINALIGN
everywhere but IMHO that's not what it is supposed to do. The driver
only knows about the bus master alignment requirements (typically
smaller than a cache line) while the architecture-defined
ARCH_DMA_MINALIGN cares about the non-coherent DMA requirements.

> Sure we currently limit the maximum alignment to 64 bytes because
> of stack usage but we can certainly look into increasing it as
> that's what you're doing here anyway.

I'm not actually increasing CRYPTO_MINALIGN, just trying to keep it as
the current value of 128 for arm64.
Catalin Marinas April 17, 2022, 4:30 p.m. UTC | #62
On Sun, Apr 17, 2022 at 04:58:29PM +0800, Herbert Xu wrote:
> On Sun, Apr 17, 2022 at 09:50:50AM +0100, Catalin Marinas wrote:
> >
> > Right, if that's what you prefer. Something like:
> > 
> > diff --git a/include/linux/crypto.h b/include/linux/crypto.h
> > index 2324ab6f1846..bb645b2f2718 100644
> > --- a/include/linux/crypto.h
> > +++ b/include/linux/crypto.h
> > @@ -645,7 +645,7 @@ struct crypto_tfm {
> >  	
> >  	struct crypto_alg *__crt_alg;
> >  
> > -	void *__crt_ctx[] CRYPTO_MINALIGN_ATTR;
> > +	void *__crt_ctx[] __aligned(ARCH_DMA_MINALIGN);
> >  };
> > 
> > But once we do that, are there any other CRYPTO_MINALIGN left around?
> 
> This is still implying the whole structure is aligned to the given
> value, which it is not.
> 
> Please just add the padding as needed.

Do you mean as per Ard's proposal here:

https://lore.kernel.org/r/CAMj1kXH0x5Va7Wgs+mU1ONDwwsazOBuN4z4ihVzO2uG-n41Kbg@mail.gmail.com

struct crypto_request {
	union {
		struct {
			... fields ...
		};
		u8 __padding[ARCH_DMA_MINALIGN];
	};
	void __ctx[]  __aligned(CRYPTO_MINALIGN);
};

If CRYPTO_MINALIGN is lowered to, say, 8 (to be the same as lowest
ARCH_KMALLOC_MINALIGN), the __alignof__(req->__ctx) would be 8.
Functions like crypto_tfm_ctx_alignment() will return 8 when what you
need is 128. We can change those functions to return ARCH_DMA_MINALIGN
instead or always bump cra_alignmask to ARCH_DMA_MINALIGN-1.
Herbert Xu April 18, 2022, 8:37 a.m. UTC | #63
On Sun, Apr 17, 2022 at 05:30:27PM +0100, Catalin Marinas wrote:
>
> Do you mean as per Ard's proposal here:
> 
> https://lore.kernel.org/r/CAMj1kXH0x5Va7Wgs+mU1ONDwwsazOBuN4z4ihVzO2uG-n41Kbg@mail.gmail.com
> 
> struct crypto_request {
> 	union {
> 		struct {
> 			... fields ...
> 		};
> 		u8 __padding[ARCH_DMA_MINALIGN];
> 	};
> 	void __ctx[]  __aligned(CRYPTO_MINALIGN);
> };
> 
> If CRYPTO_MINALIGN is lowered to, say, 8 (to be the same as lowest
> ARCH_KMALLOC_MINALIGN), the __alignof__(req->__ctx) would be 8.
> Functions like crypto_tfm_ctx_alignment() will return 8 when what you
> need is 128. We can change those functions to return ARCH_DMA_MINALIGN
> instead or always bump cra_alignmask to ARCH_DMA_MINALIGN-1.

OK, at this point I think we need to let the code do the talking :)

I've seen Ard's patches already and I think I understand what your
needs are.  So let me whip up some code to show you guys what I
think needs to be done.

Please bear with me for a few days.

Thanks,
Catalin Marinas April 18, 2022, 9:19 a.m. UTC | #64
On Mon, Apr 18, 2022 at 04:37:17PM +0800, Herbert Xu wrote:
> On Sun, Apr 17, 2022 at 05:30:27PM +0100, Catalin Marinas wrote:
> >
> > Do you mean as per Ard's proposal here:
> > 
> > https://lore.kernel.org/r/CAMj1kXH0x5Va7Wgs+mU1ONDwwsazOBuN4z4ihVzO2uG-n41Kbg@mail.gmail.com
> > 
> > struct crypto_request {
> > 	union {
> > 		struct {
> > 			... fields ...
> > 		};
> > 		u8 __padding[ARCH_DMA_MINALIGN];
> > 	};
> > 	void __ctx[]  __aligned(CRYPTO_MINALIGN);
> > };
> > 
> > If CRYPTO_MINALIGN is lowered to, say, 8 (to be the same as lowest
> > ARCH_KMALLOC_MINALIGN), the __alignof__(req->__ctx) would be 8.
> > Functions like crypto_tfm_ctx_alignment() will return 8 when what you
> > need is 128. We can change those functions to return ARCH_DMA_MINALIGN
> > instead or always bump cra_alignmask to ARCH_DMA_MINALIGN-1.
> 
> OK, at this point I think we need to let the code do the talking :)
> 
> I've seen Ard's patches already and I think I understand what your
> needs are.  So let me whip up some code to show you guys what I
> think needs to be done.
> 
> Please bear with me for a few days.

Thanks Herbert, that's great. Whenever you have time, I'll be busy this
week with collecting arm64 patches for the upcoming merging window
anyway.
Catalin Marinas April 18, 2022, 4:44 p.m. UTC | #65
On Mon, Apr 18, 2022 at 04:37:17PM +0800, Herbert Xu wrote:
> On Sun, Apr 17, 2022 at 05:30:27PM +0100, Catalin Marinas wrote:
> > Do you mean as per Ard's proposal here:
> > 
> > https://lore.kernel.org/r/CAMj1kXH0x5Va7Wgs+mU1ONDwwsazOBuN4z4ihVzO2uG-n41Kbg@mail.gmail.com
> > 
> > struct crypto_request {
> > 	union {
> > 		struct {
> > 			... fields ...
> > 		};
> > 		u8 __padding[ARCH_DMA_MINALIGN];
> > 	};
> > 	void __ctx[]  __aligned(CRYPTO_MINALIGN);
> > };
> > 
> > If CRYPTO_MINALIGN is lowered to, say, 8 (to be the same as lowest
> > ARCH_KMALLOC_MINALIGN), the __alignof__(req->__ctx) would be 8.
> > Functions like crypto_tfm_ctx_alignment() will return 8 when what you
> > need is 128. We can change those functions to return ARCH_DMA_MINALIGN
> > instead or always bump cra_alignmask to ARCH_DMA_MINALIGN-1.
> 
> OK, at this point I think we need to let the code do the talking :)
> 
> I've seen Ard's patches already and I think I understand what your
> needs are.  So let me whip up some code to show you guys what I
> think needs to be done.

BTW before you have a go at this, there's also Linus' idea that does not
change the crypto code (at least not functionally). Of course, you and
Ard can still try to figure out how to reduce the padding but if we go
with Linus' idea of a new GFP_NODMA flag, there won't be any changes to
the crypto code as long as it doesn't pass such flag. So, the options:

1. Change ARCH_KMALLOC_MINALIGN to 8 (or ARCH_SLAB_MINALIGN if higher)
   while keeping ARCH_DMA_MINALIGN to 128. By default kmalloc() will
   honour the 128-byte alignment, unless GDP_NODMA is passed. This still
   requires changing CRYPTO_MINALIGN to ARCH_DMA_MINALIGN but there is
   no functional change, kmalloc() without the new flag will return
   CRYPTO_MINALIGN-aligned pointers.

2. Leave ARCH_KMALLOC_MINALIGN as ARCH_DMA_MINALIGN (128) and introduce
   a new GFP_PACKED (I think it fits better than 'NODMA') flag that
   reduces the minimum kmalloc() below ARCH_KMALLOC_MINALIGN (and
   probably at least ARCH_SLAB_MINALIGN). It's equivalent to (1) but
   does not touch the crypto code at all.

(1) and (2) are the same, just minor naming difference. Happy to go with
any of them. They still have the downside that we need to add the new
GFP_ flag to those hotspots that allocate small objects (Arnd provided
an idea on how to find them with ftrace) but at least we know it won't
inadvertently break anything.
Ard Biesheuvel April 19, 2022, 9:50 p.m. UTC | #66
On Mon, 18 Apr 2022 at 18:44, Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Mon, Apr 18, 2022 at 04:37:17PM +0800, Herbert Xu wrote:
> > On Sun, Apr 17, 2022 at 05:30:27PM +0100, Catalin Marinas wrote:
> > > Do you mean as per Ard's proposal here:
> > >
> > > https://lore.kernel.org/r/CAMj1kXH0x5Va7Wgs+mU1ONDwwsazOBuN4z4ihVzO2uG-n41Kbg@mail.gmail.com
> > >
> > > struct crypto_request {
> > >     union {
> > >             struct {
> > >                     ... fields ...
> > >             };
> > >             u8 __padding[ARCH_DMA_MINALIGN];
> > >     };
> > >     void __ctx[]  __aligned(CRYPTO_MINALIGN);
> > > };
> > >
> > > If CRYPTO_MINALIGN is lowered to, say, 8 (to be the same as lowest
> > > ARCH_KMALLOC_MINALIGN), the __alignof__(req->__ctx) would be 8.
> > > Functions like crypto_tfm_ctx_alignment() will return 8 when what you
> > > need is 128. We can change those functions to return ARCH_DMA_MINALIGN
> > > instead or always bump cra_alignmask to ARCH_DMA_MINALIGN-1.
> >
> > OK, at this point I think we need to let the code do the talking :)
> >
> > I've seen Ard's patches already and I think I understand what your
> > needs are.  So let me whip up some code to show you guys what I
> > think needs to be done.
>
> BTW before you have a go at this, there's also Linus' idea that does not
> change the crypto code (at least not functionally). Of course, you and
> Ard can still try to figure out how to reduce the padding but if we go
> with Linus' idea of a new GFP_NODMA flag, there won't be any changes to
> the crypto code as long as it doesn't pass such flag. So, the options:
>
> 1. Change ARCH_KMALLOC_MINALIGN to 8 (or ARCH_SLAB_MINALIGN if higher)
>    while keeping ARCH_DMA_MINALIGN to 128. By default kmalloc() will
>    honour the 128-byte alignment, unless GDP_NODMA is passed. This still
>    requires changing CRYPTO_MINALIGN to ARCH_DMA_MINALIGN but there is
>    no functional change, kmalloc() without the new flag will return
>    CRYPTO_MINALIGN-aligned pointers.
>
> 2. Leave ARCH_KMALLOC_MINALIGN as ARCH_DMA_MINALIGN (128) and introduce
>    a new GFP_PACKED (I think it fits better than 'NODMA') flag that
>    reduces the minimum kmalloc() below ARCH_KMALLOC_MINALIGN (and
>    probably at least ARCH_SLAB_MINALIGN). It's equivalent to (1) but
>    does not touch the crypto code at all.
>
> (1) and (2) are the same, just minor naming difference. Happy to go with
> any of them. They still have the downside that we need to add the new
> GFP_ flag to those hotspots that allocate small objects (Arnd provided
> an idea on how to find them with ftrace) but at least we know it won't
> inadvertently break anything.
>

I'm not sure GFP_NODMA adds much here.

The way I see it, the issue in the crypto code is that we are relying
on a ARCH_KMALLOC_ALIGN aligned zero length __ctx[] array for three
different things:
- ensuring/informing the compiler that top-level request/TFM
structures are aligned to ARCH_KMALLOC_ALIGN,
- adding padding to ensure that driver context structures that are
embedded in those top-level request/TFM structures are sufficiently
aligned so that any member C types appear at the expected alignment
(but those structures are not usually defined as being aligned to
ARCH_KMALLOC_ALIGN)
- adding padding to ensure that these driver context structures do not
share cache lines with the preceding top-level struct.

One thing to note here is that the padding is only necessary when the
driver context size > 0, and has nothing to do with the alignment of
the top-level struct.

Using a single aligned ctx member was a nice way to accomplish all of
these when it was introduced, but I think it might be better to get
rid of it, and move the padding logic to the static inline helpers
instead.

So something like

struct skcipher_request {
  ...
} CRYPTO_MINALIGN_ATTR;

which states/ensures the alignment of the struct, and

void *skcipher_request_ctx(struct skcipher_request *req) {
  return (void *)PTR_ALIGN(req + 1, ARCH_DMA_MINALIGN);
}

to get at the context struct, instead of using a struct field.

Then, we could update skcipher_request_alloc() to only round up
sizeof(struct skipher_request) to ARCH_DMA_MINALIGN if the reqsize is
>0 to begin with, and if it is, to also round reqsize up to
ARCH_DMA_MINALIGN when accessed. That way, we spell out the DMA
padding requirements with relying on aligned struct members.

If we do it this way, we have a clear distinction between expectations
about what kmalloc returns in terms of alignment, and adding padding
to influence the placement of the context struct. It also makes it
easier to either apply the changes I proposed in the series I sent out
a couple of weeks ago, or get rid of DMA alignment for request/TFM
structures altogether, if we manage to identify and fix the drivers
that are relying on this. In any case, it decouples these two things
in a way that allows Catalin to make his kmalloc changes without
having to redefine CRYPTO_MINALIGN to ARCH_DMA_MINALIGN.
Catalin Marinas April 20, 2022, 10:36 a.m. UTC | #67
On Tue, Apr 19, 2022 at 11:50:11PM +0200, Ard Biesheuvel wrote:
> On Mon, 18 Apr 2022 at 18:44, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Mon, Apr 18, 2022 at 04:37:17PM +0800, Herbert Xu wrote:
> > > I've seen Ard's patches already and I think I understand what your
> > > needs are.  So let me whip up some code to show you guys what I
> > > think needs to be done.
> >
> > BTW before you have a go at this, there's also Linus' idea that does not
> > change the crypto code (at least not functionally). Of course, you and
> > Ard can still try to figure out how to reduce the padding but if we go
> > with Linus' idea of a new GFP_NODMA flag, there won't be any changes to
> > the crypto code as long as it doesn't pass such flag. So, the options:
> >
> > 1. Change ARCH_KMALLOC_MINALIGN to 8 (or ARCH_SLAB_MINALIGN if higher)
> >    while keeping ARCH_DMA_MINALIGN to 128. By default kmalloc() will
> >    honour the 128-byte alignment, unless GDP_NODMA is passed. This still
> >    requires changing CRYPTO_MINALIGN to ARCH_DMA_MINALIGN but there is
> >    no functional change, kmalloc() without the new flag will return
> >    CRYPTO_MINALIGN-aligned pointers.
> >
> > 2. Leave ARCH_KMALLOC_MINALIGN as ARCH_DMA_MINALIGN (128) and introduce
> >    a new GFP_PACKED (I think it fits better than 'NODMA') flag that
> >    reduces the minimum kmalloc() below ARCH_KMALLOC_MINALIGN (and
> >    probably at least ARCH_SLAB_MINALIGN). It's equivalent to (1) but
> >    does not touch the crypto code at all.
> >
> > (1) and (2) are the same, just minor naming difference. Happy to go with
> > any of them. They still have the downside that we need to add the new
> > GFP_ flag to those hotspots that allocate small objects (Arnd provided
> > an idea on how to find them with ftrace) but at least we know it won't
> > inadvertently break anything.
> 
> I'm not sure GFP_NODMA adds much here.

What it buys is that the crypto code won't need to be changed
immediately, so we can go ahead with the kmalloc() optimisation while
you and Herbert figure out how to refactor the crypto code.

Another option is to make the change but pass a new GFP_DMA_MINALIGN
flag to all kmalloc() calls in the crypto code (or a new dma_kmalloc()
that Linus suggested). This way the allocations in the crypto code are
guaranteed to be ARCH_DMA_MINALIGN (we'd still change CRYPTO_MINALIGN to
this).

> The way I see it, the issue in the crypto code is that we are relying
> on a ARCH_KMALLOC_ALIGN aligned zero length __ctx[] array for three
> different things:
> - ensuring/informing the compiler that top-level request/TFM
> structures are aligned to ARCH_KMALLOC_ALIGN,
> - adding padding to ensure that driver context structures that are
> embedded in those top-level request/TFM structures are sufficiently
> aligned so that any member C types appear at the expected alignment
> (but those structures are not usually defined as being aligned to
> ARCH_KMALLOC_ALIGN)

In this case, a void * array type would cover most structures that don't
have special alignment needs.

In both the above cases, we can get ARCH_KMALLOC_MINALIGN down to 8.

> - adding padding to ensure that these driver context structures do not
> share cache lines with the preceding top-level struct.
> 
> One thing to note here is that the padding is only necessary when the
> driver context size > 0, and has nothing to do with the alignment of
> the top-level struct.
> 
> Using a single aligned ctx member was a nice way to accomplish all of
> these when it was introduced, but I think it might be better to get
> rid of it, and move the padding logic to the static inline helpers
> instead.
> 
> So something like
> 
> struct skcipher_request {
>   ...
> } CRYPTO_MINALIGN_ATTR;
> 
> which states/ensures the alignment of the struct, and
> 
> void *skcipher_request_ctx(struct skcipher_request *req) {
>   return (void *)PTR_ALIGN(req + 1, ARCH_DMA_MINALIGN);
> }
> 
> to get at the context struct, instead of using a struct field.
> 
> Then, we could update skcipher_request_alloc() to only round up
> sizeof(struct skipher_request) to ARCH_DMA_MINALIGN if the reqsize is
> >0 to begin with, and if it is, to also round reqsize up to
> ARCH_DMA_MINALIGN when accessed. That way, we spell out the DMA
> padding requirements with relying on aligned struct members.

I think this should work and CRYPTO_MINALIGN can go down to whatever
ARCH_KMALLOC_MINALIGN without breaking the non-coherent DMA.
Arnd Bergmann April 20, 2022, 11:29 a.m. UTC | #68
On Tue, Apr 19, 2022 at 11:50 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> On Mon, 18 Apr 2022 at 18:44, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Mon, Apr 18, 2022 at 04:37:17PM +0800, Herbert Xu wrote:

> > BTW before you have a go at this, there's also Linus' idea that does not
> > change the crypto code (at least not functionally). Of course, you and
> > Ard can still try to figure out how to reduce the padding but if we go
> > with Linus' idea of a new GFP_NODMA flag, there won't be any changes to
> > the crypto code as long as it doesn't pass such flag. So, the options:
> >
> > 1. Change ARCH_KMALLOC_MINALIGN to 8 (or ARCH_SLAB_MINALIGN if higher)
> >    while keeping ARCH_DMA_MINALIGN to 128. By default kmalloc() will
> >    honour the 128-byte alignment, unless GDP_NODMA is passed. This still
> >    requires changing CRYPTO_MINALIGN to ARCH_DMA_MINALIGN but there is
> >    no functional change, kmalloc() without the new flag will return
> >    CRYPTO_MINALIGN-aligned pointers.
> >
> > 2. Leave ARCH_KMALLOC_MINALIGN as ARCH_DMA_MINALIGN (128) and introduce
> >    a new GFP_PACKED (I think it fits better than 'NODMA') flag that
> >    reduces the minimum kmalloc() below ARCH_KMALLOC_MINALIGN (and
> >    probably at least ARCH_SLAB_MINALIGN). It's equivalent to (1) but
> >    does not touch the crypto code at all.
> >
> > (1) and (2) are the same, just minor naming difference. Happy to go with
> > any of them. They still have the downside that we need to add the new
> > GFP_ flag to those hotspots that allocate small objects (Arnd provided
> > an idea on how to find them with ftrace) but at least we know it won't
> > inadvertently break anything.

Right, both of these seem reasonable to me.

> I'm not sure GFP_NODMA adds much here.
>
> The way I see it, the issue in the crypto code is that we are relying
> on a ARCH_KMALLOC_ALIGN aligned zero length __ctx[] array for three
> different things:
...

Right. So as long as the crypto subsystem has additional alignment
requirement, it won't benefit from GFP_NODMA. For everything else,
GFP_NODMA would however have a very direct and measuable
impact on memory consumption.

Your proposed changes to the crypto subsystem all seem helpful
as well, just mostly orthogonal to the savings elsewhere. I don't know
how much memory is permanently tied up in overaligned crypto
data structures, but my guess is that it's not a lot on most systems.

Improving the alignment for crypto would however likely help
with stack usage on on-stack structures, and with performance
when the amount of ctx memory to clear for each operation
becomes smaller.

       Arnd
Catalin Marinas April 20, 2022, 7:07 p.m. UTC | #69
On Thu, Apr 14, 2022 at 03:25:59PM -0700, Linus Torvalds wrote:
> On Thu, Apr 14, 2022 at 12:49 PM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > It's a lot worse, ARCH_KMALLOC_MINALIGN is currently 128 bytes on arm64.
> > I want to at least get it down to 64 with this series while preserving
> > the current kmalloc() semantics.
> 
> So here's a thought - maybe we could do the reverse of GFP_DMA, and
> add a flag to the places that want small allocations and know they
> don't need DMA?

Quick diff below. I'll test it some more with all sl*b and post a proper
patch with description tomorrow. But the basic idea is that
ARCH_KMALLOC_MINALIGN remains the same as ARCH_DMA_MINALIGN so that I
don't have to change existing users. KMALLOC_MIN_SIZE is decoupled from
ARCH_KMALLOC_MINALIGN and now we have caches all the way to kmalloc-8
(with slub). Callers would have to pass __GFP_PACKED to get an object
with alignment below ARCH_KMALLOC_MINALIGN.

Without any kmalloc() callers modified, the kmalloc caches look like
(only booted in a VM, not much activity):

kmalloc-128        12055  12096    128   32
kmalloc-96             0      0     96   42
kmalloc-64             0      0     64   64
kmalloc-32             0      0     32  128
kmalloc-16             0      0     16  256
kmalloc-8              0      0      8  512

With kstrdup() modified to pass __GFP_PACKED (as per the last hunk in
the diff below), I get just after boot:

kmalloc-128         8966   9056    128   32
kmalloc-96             0      0     96   42
kmalloc-64           192    192     64   64
kmalloc-32           768    768     32  128
kmalloc-16          2048   2048     16  256
kmalloc-8           2560   2560      8  512

So that's probably the simplest approach and using the ftrace histogram
we can add the flag to more places.

--------------------8<-----------------------------------------------
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 761f8f1885c7..7c9f47ef3a53 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -63,8 +63,9 @@ struct vm_area_struct;
 #define ___GFP_SKIP_KASAN_UNPOISON	0
 #define ___GFP_SKIP_KASAN_POISON	0
 #endif
+#define ___GFP_PACKED		0x8000000u
 #ifdef CONFIG_LOCKDEP
-#define ___GFP_NOLOCKDEP	0x8000000u
+#define ___GFP_NOLOCKDEP	0x10000000u
 #else
 #define ___GFP_NOLOCKDEP	0
 #endif
@@ -251,6 +252,10 @@ struct vm_area_struct;
  *
  * %__GFP_SKIP_KASAN_POISON makes KASAN skip poisoning on page deallocation.
  * Typically, used for userspace pages. Only effective in HW_TAGS mode.
+ *
+ * %__GFP_PACKED returns a pointer aligned to the smaller KMALLOC_MIN_SIZE
+ * rather than ARCH_KMALLOC_MINALIGN. Beneficial for small object allocation
+ * on architectures that define ARCH_DMA_MINALIGN.
  */
 #define __GFP_NOWARN	((__force gfp_t)___GFP_NOWARN)
 #define __GFP_COMP	((__force gfp_t)___GFP_COMP)
@@ -259,12 +264,13 @@ struct vm_area_struct;
 #define __GFP_SKIP_ZERO ((__force gfp_t)___GFP_SKIP_ZERO)
 #define __GFP_SKIP_KASAN_UNPOISON ((__force gfp_t)___GFP_SKIP_KASAN_UNPOISON)
 #define __GFP_SKIP_KASAN_POISON   ((__force gfp_t)___GFP_SKIP_KASAN_POISON)
+#define __GFP_PACKED	((__force gfp_t)___GFP_PACKED)
 
 /* Disable lockdep for GFP context tracking */
 #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
 
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (27 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT (28 + IS_ENABLED(CONFIG_LOCKDEP))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /**
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 373b3ef99f4e..7bd3a33cdb9d 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -194,8 +194,6 @@ void kmem_dump_obj(void *object);
  */
 #if defined(ARCH_DMA_MINALIGN) && ARCH_DMA_MINALIGN > 8
 #define ARCH_KMALLOC_MINALIGN ARCH_DMA_MINALIGN
-#define KMALLOC_MIN_SIZE ARCH_DMA_MINALIGN
-#define KMALLOC_SHIFT_LOW ilog2(ARCH_DMA_MINALIGN)
 #else
 #define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long long)
 #endif
@@ -364,12 +362,14 @@ static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags)
  * Callers where !size_is_constant should only be test modules, where runtime
  * overheads of __kmalloc_index() can be tolerated.  Also see kmalloc_slab().
  */
-static __always_inline unsigned int __kmalloc_index(size_t size,
+static __always_inline unsigned int __kmalloc_index(size_t size, gfp_t flags,
 						    bool size_is_constant)
 {
 	if (!size)
 		return 0;
 
+	if (ARCH_KMALLOC_MINALIGN > KMALLOC_MIN_SIZE && !(flags & __GFP_PACKED))
+		size = ALIGN(size, ARCH_KMALLOC_MINALIGN);
 	if (size <= KMALLOC_MIN_SIZE)
 		return KMALLOC_SHIFT_LOW;
 
@@ -409,7 +409,7 @@ static __always_inline unsigned int __kmalloc_index(size_t size,
 	/* Will never be reached. Needed because the compiler may complain */
 	return -1;
 }
-#define kmalloc_index(s) __kmalloc_index(s, true)
+#define kmalloc_index(s, f) __kmalloc_index(s, f, true)
 #endif /* !CONFIG_SLOB */
 
 void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment __alloc_size(1);
@@ -573,7 +573,7 @@ static __always_inline __alloc_size(1) void *kmalloc(size_t size, gfp_t flags)
 		if (size > KMALLOC_MAX_CACHE_SIZE)
 			return kmalloc_large(size, flags);
 #ifndef CONFIG_SLOB
-		index = kmalloc_index(size);
+		index = kmalloc_index(size, flags);
 
 		if (!index)
 			return ZERO_SIZE_PTR;
@@ -591,7 +591,7 @@ static __always_inline __alloc_size(1) void *kmalloc_node(size_t size, gfp_t fla
 #ifndef CONFIG_SLOB
 	if (__builtin_constant_p(size) &&
 		size <= KMALLOC_MAX_CACHE_SIZE) {
-		unsigned int i = kmalloc_index(size);
+		unsigned int i = kmalloc_index(size, flags);
 
 		if (!i)
 			return ZERO_SIZE_PTR;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 6ee64d6208b3..d5da402c8aae 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -630,7 +630,7 @@ void __init create_boot_cache(struct kmem_cache *s, const char *name,
 		unsigned int useroffset, unsigned int usersize)
 {
 	int err;
-	unsigned int align = ARCH_KMALLOC_MINALIGN;
+	unsigned int align = KMALLOC_MIN_SIZE;
 
 	s->name = name;
 	s->size = s->object_size = size;
@@ -722,6 +722,8 @@ struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
 {
 	unsigned int index;
 
+	if (ARCH_KMALLOC_MINALIGN > KMALLOC_MIN_SIZE && !(flags & __GFP_PACKED))
+		size = ALIGN(size, ARCH_KMALLOC_MINALIGN);
 	if (size <= 192) {
 		if (!size)
 			return ZERO_SIZE_PTR;
diff --git a/mm/util.c b/mm/util.c
index 54e5e761a9a9..27eb1731de27 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -57,7 +57,7 @@ char *kstrdup(const char *s, gfp_t gfp)
 		return NULL;
 
 	len = strlen(s) + 1;
-	buf = kmalloc_track_caller(len, gfp);
+	buf = kmalloc_track_caller(len, gfp | __GFP_PACKED);
 	if (buf)
 		memcpy(buf, s, len);
 	return buf;
Linus Torvalds April 20, 2022, 7:33 p.m. UTC | #70
On Wed, Apr 20, 2022 at 12:08 PM Catalin Marinas
<catalin.marinas@arm.com> wrote:
>
> With kstrdup() modified to pass __GFP_PACKED (as per the last hunk in
> the diff below), I get just after boot:
>
> kmalloc-128         8966   9056    128   32
> kmalloc-96             0      0     96   42
> kmalloc-64           192    192     64   64
> kmalloc-32           768    768     32  128
> kmalloc-16          2048   2048     16  256
> kmalloc-8           2560   2560      8  512
>
> So that's probably the simplest approach and using the ftrace histogram
> we can add the flag to more places.

I agree that this seems to be the safest thing to do, and maybe
__GFP_PACKED is a better flag name than __GFP_NODMA.

That said, It worries me a bit in that to me "PACKED" implies "no
alignment at all". And I could _easily_ see people still wanting to do
8-byte allocations that have 8-byte alignment because it's some kind
of private pointer thing or whatever.

For "kstrdup()", a flag like __GFP_PACKED makes 100% sense, since it
literally wants byte alignment.

But what about those "random small structures" cases?

Anyway, I'm perfectly happy calling it __GFP_PACKED, but at a minimum
document that "packed" in this case still means "__alignof__(unsigned
long long)" or something like that?

                      Linus
Christoph Hellwig April 21, 2022, 7:20 a.m. UTC | #71
Btw, there is another option:  Most real systems already require having
swiotlb to bounce buffer in some cases.  We could simply force bounce
buffering in the dma mapping code for too small or not properly aligned
transfers and just decrease the dma alignment.
Arnd Bergmann April 21, 2022, 7:36 a.m. UTC | #72
On Thu, Apr 21, 2022 at 9:20 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> Btw, there is another option:  Most real systems already require having
> swiotlb to bounce buffer in some cases.  We could simply force bounce
> buffering in the dma mapping code for too small or not properly aligned
> transfers and just decrease the dma alignment.

I like the idea because these days we already rely on bounce buffering
for sub-page buffers in many iommu based cases for strict isolation
purposes, as well as most 64-bit machines that lack an iommu.

Does this work on all 32-bit architectures as well? I see that you added
swiotlb for ARM LPASE systems in 2019, but I don't know if that has any
additional requirements for the other 32-bit architectures that don't
select SWIOTLB today.

      Arnd
Christoph Hellwig April 21, 2022, 7:44 a.m. UTC | #73
On Thu, Apr 21, 2022 at 09:36:46AM +0200, Arnd Bergmann wrote:
> Does this work on all 32-bit architectures as well? I see that you added
> swiotlb for ARM LPASE systems in 2019, but I don't know if that has any
> additional requirements for the other 32-bit architectures that don't
> select SWIOTLB today.

We'll need to call swiotlb_init for all these cases, but there is no other
fundamental requirement.  We can probably do with a way smaller buffer
if bouncing is only needed for misaligned allocations.
Ard Biesheuvel April 21, 2022, 8:05 a.m. UTC | #74
On Thu, 21 Apr 2022 at 09:20, Christoph Hellwig <hch@infradead.org> wrote:
>
> Btw, there is another option:  Most real systems already require having
> swiotlb to bounce buffer in some cases.  We could simply force bounce
> buffering in the dma mapping code for too small or not properly aligned
> transfers and just decrease the dma alignment.

Strongly agree. As I pointed out before, we'd only need to do this for
misaligned, non-cache coherent inbound DMA, and we'd only have to
worry about performance regressions, not data corruption issues. And
given the natural alignment of block I/O, and the fact that network
drivers typically allocate and map their own RX buffers (which means
they could reasonably be fixed if a performance bottleneck pops up), I
think the risk for showstopper performance regressions is likely to be
acceptable.
Catalin Marinas April 21, 2022, 11:06 a.m. UTC | #75
On Thu, Apr 21, 2022 at 12:20:22AM -0700, Christoph Hellwig wrote:
> Btw, there is another option:  Most real systems already require having
> swiotlb to bounce buffer in some cases.  We could simply force bounce
> buffering in the dma mapping code for too small or not properly aligned
> transfers and just decrease the dma alignment.

We can force bounce if size is small but checking the alignment is
trickier. Normally the beginning of the buffer is aligned but the end is
at some sizeof() distance. We need to know whether the end is in a
kmalloc-128 cache and that requires reaching out to the slab internals.
That's doable and not expensive but it needs to be done for every small
size getting to the DMA API, something like (for mm/slub.c):

	folio = virt_to_folio(x);
	slab = folio_slab(folio);
	if (slab->slab_cache->align < ARCH_DMA_MINALIGN)
		... bounce ...

(and a bit different for mm/slab.c)

If we scrap ARCH_DMA_MINALIGN altogether from arm64, we can check the
alignment against cache_line_size(), though I'd rather keep it for code
that wants to avoid bouncing and goes for this compile-time alignment.

I think we are down to four options (1 and 2 can be combined):

1. ARCH_DMA_MINALIGN == 128, dynamic arch_kmalloc_minalign() to reduce
   kmalloc() alignment to 64 on most arm64 SoC - this series.

2. ARCH_DMA_MINALIGN == 128, ARCH_KMALLOC_MINALIGN == 128, add explicit
   __GFP_PACKED for small allocations. It can be combined with (1) so
   that allocations without __GFP_PACKED can still get 64-byte
   alignment.

3. ARCH_DMA_MINALIGN == 128, ARCH_KMALLOC_MINALIGN == 8, swiotlb bounce.

4. undef ARCH_DMA_MINALIGN, ARCH_KMALLOC_MINALIGN == 8, swiotlb bounce.

(3) and (4) don't require histogram analysis. Between them, I have a
preference for (3) as it gives drivers a chance to avoid the bounce.

If (2) is feasible, we don't need to bother with any bouncing or
structure alignments, it's an opt-in by the driver/subsystem. However,
it may be tedious to analyse the hot spots. While there are a few
obvious places (kstrdup), I don't have access to a multitude of devices
that may exercise the drivers and subsystems.

With (3) the risk is someone complaining about performance or even
running out of swiotlb space on some SoCs (I guess the fall-back can be
another kmalloc() with an appropriate size).

I guess we can limit the choice to either (2) or (3). I have (2) already
(needs some more testing). I can attempt (3) and try to run it on some
real hardware to see the perf impact.
Arnd Bergmann April 21, 2022, 12:28 p.m. UTC | #76
On Thu, Apr 21, 2022 at 1:06 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Thu, Apr 21, 2022 at 12:20:22AM -0700, Christoph Hellwig wrote:
> > Btw, there is another option:  Most real systems already require having
> > swiotlb to bounce buffer in some cases.  We could simply force bounce
> > buffering in the dma mapping code for too small or not properly aligned
> > transfers and just decrease the dma alignment.
>
> We can force bounce if size is small but checking the alignment is
> trickier. Normally the beginning of the buffer is aligned but the end is
> at some sizeof() distance. We need to know whether the end is in a
> kmalloc-128 cache and that requires reaching out to the slab internals.
> That's doable and not expensive but it needs to be done for every small
> size getting to the DMA API, something like (for mm/slub.c):
>
>         folio = virt_to_folio(x);
>         slab = folio_slab(folio);
>         if (slab->slab_cache->align < ARCH_DMA_MINALIGN)
>                 ... bounce ...
>
> (and a bit different for mm/slab.c)

I think the decision to bounce or not can be based on the actual
cache line size at runtime, so most commonly 64 bytes on arm64,
even though the compile-time limit is 128 bytes.

We also know that larger slabs are all cacheline aligned, so simply
comparing the transfer size is enough to rule out most, in this case
any transfer larger than 96 bytes must come from the kmalloc-128
or larger cache, so that works like before.

For transfers <=96 bytes, the possibilities are:

1.kmalloc-32 or smaller, always needs to bounce
2. kmalloc-96, but at least one byte in partial cache line,
    need to bounce
3. kmalloc-64, may skip the bounce.
4. kmalloc-128 or larger, or not a slab cache but a partial
    transfer, may skip the bounce.

I would guess that the first case is the most common here,
so unless bouncing one or two cache lines is extremely
expensive, I don't expect it to be worth optimizing for the latter
two cases.

       Arnd
Catalin Marinas April 21, 2022, 1:25 p.m. UTC | #77
On Thu, Apr 21, 2022 at 02:28:45PM +0200, Arnd Bergmann wrote:
> On Thu, Apr 21, 2022 at 1:06 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Thu, Apr 21, 2022 at 12:20:22AM -0700, Christoph Hellwig wrote:
> > > Btw, there is another option:  Most real systems already require having
> > > swiotlb to bounce buffer in some cases.  We could simply force bounce
> > > buffering in the dma mapping code for too small or not properly aligned
> > > transfers and just decrease the dma alignment.
> >
> > We can force bounce if size is small but checking the alignment is
> > trickier. Normally the beginning of the buffer is aligned but the end is
> > at some sizeof() distance. We need to know whether the end is in a
> > kmalloc-128 cache and that requires reaching out to the slab internals.
> > That's doable and not expensive but it needs to be done for every small
> > size getting to the DMA API, something like (for mm/slub.c):
> >
> >         folio = virt_to_folio(x);
> >         slab = folio_slab(folio);
> >         if (slab->slab_cache->align < ARCH_DMA_MINALIGN)
> >                 ... bounce ...
> >
> > (and a bit different for mm/slab.c)
> 
> I think the decision to bounce or not can be based on the actual
> cache line size at runtime, so most commonly 64 bytes on arm64,
> even though the compile-time limit is 128 bytes.
> 
> We also know that larger slabs are all cacheline aligned, so simply
> comparing the transfer size is enough to rule out most, in this case
> any transfer larger than 96 bytes must come from the kmalloc-128
> or larger cache, so that works like before.

There's also the case with 128-byte cache lines and kmalloc-192.

> For transfers <=96 bytes, the possibilities are:
> 
> 1.kmalloc-32 or smaller, always needs to bounce
> 2. kmalloc-96, but at least one byte in partial cache line,
>     need to bounce
> 3. kmalloc-64, may skip the bounce.
> 4. kmalloc-128 or larger, or not a slab cache but a partial
>     transfer, may skip the bounce.
> 
> I would guess that the first case is the most common here,
> so unless bouncing one or two cache lines is extremely
> expensive, I don't expect it to be worth optimizing for the latter
> two cases.

I think so. If someone complains of a performance regression, we can
look at optimising the bounce. I have a suspicion the cost of copying
two cache lines is small compared to swiotlb_find_slots() etc.
Arnd Bergmann April 21, 2022, 1:47 p.m. UTC | #78
On Thu, Apr 21, 2022 at 3:25 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Thu, Apr 21, 2022 at 02:28:45PM +0200, Arnd Bergmann wrote:
> > We also know that larger slabs are all cacheline aligned, so simply
> > comparing the transfer size is enough to rule out most, in this case
> > any transfer larger than 96 bytes must come from the kmalloc-128
> > or larger cache, so that works like before.
>
> There's also the case with 128-byte cache lines and kmalloc-192.

Sure, but that's much less common, as the few machines with 128 byte
cache lines tend to also have cache coherent devices IIRC, so we'd
skip the bounce buffer entirely.

> > For transfers <=96 bytes, the possibilities are:
> >
> > 1.kmalloc-32 or smaller, always needs to bounce
> > 2. kmalloc-96, but at least one byte in partial cache line,
> >     need to bounce
> > 3. kmalloc-64, may skip the bounce.
> > 4. kmalloc-128 or larger, or not a slab cache but a partial
> >     transfer, may skip the bounce.
> >
> > I would guess that the first case is the most common here,
> > so unless bouncing one or two cache lines is extremely
> > expensive, I don't expect it to be worth optimizing for the latter
> > two cases.
>
> I think so. If someone complains of a performance regression, we can
> look at optimising the bounce. I have a suspicion the cost of copying
> two cache lines is small compared to swiotlb_find_slots() etc.

That is possible, and we'd definitely have to watch out for
performance regressions, I'm just skeptical that the cases that
suffer from the extra bouncer buffering on 33..64 byte allocations
benefit much from having a special case if the 1...32 and 65..96
byte allocations are still slow.

Another simpler way to do this might be to just not create the
kmalloc-96 (or kmalloc-192) caches, and assuming that any
transfer >=33 (or 65) bytes is safe.

       Arnd
Catalin Marinas April 21, 2022, 2:44 p.m. UTC | #79
On Thu, Apr 21, 2022 at 03:47:30PM +0200, Arnd Bergmann wrote:
> On Thu, Apr 21, 2022 at 3:25 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Thu, Apr 21, 2022 at 02:28:45PM +0200, Arnd Bergmann wrote:
> > > We also know that larger slabs are all cacheline aligned, so simply
> > > comparing the transfer size is enough to rule out most, in this case
> > > any transfer larger than 96 bytes must come from the kmalloc-128
> > > or larger cache, so that works like before.
> >
> > There's also the case with 128-byte cache lines and kmalloc-192.
> 
> Sure, but that's much less common, as the few machines with 128 byte
> cache lines tend to also have cache coherent devices IIRC, so we'd
> skip the bounce buffer entirely.

Do you know which machines still have 128-byte cache lines _and_
non-coherent DMA? If there isn't any that matters, I'd reduce
ARCH_DMA_MINALIGN to 64 now (while trying to get to even smaller kmalloc
caches).

> > > For transfers <=96 bytes, the possibilities are:
> > >
> > > 1.kmalloc-32 or smaller, always needs to bounce
> > > 2. kmalloc-96, but at least one byte in partial cache line,
> > >     need to bounce
> > > 3. kmalloc-64, may skip the bounce.
> > > 4. kmalloc-128 or larger, or not a slab cache but a partial
> > >     transfer, may skip the bounce.
> > >
> > > I would guess that the first case is the most common here,
> > > so unless bouncing one or two cache lines is extremely
> > > expensive, I don't expect it to be worth optimizing for the latter
> > > two cases.
> >
> > I think so. If someone complains of a performance regression, we can
> > look at optimising the bounce. I have a suspicion the cost of copying
> > two cache lines is small compared to swiotlb_find_slots() etc.
> 
> That is possible, and we'd definitely have to watch out for
> performance regressions, I'm just skeptical that the cases that
> suffer from the extra bouncer buffering on 33..64 byte allocations
> benefit much from having a special case if the 1...32 and 65..96
> byte allocations are still slow.
> 
> Another simpler way to do this might be to just not create the
> kmalloc-96 (or kmalloc-192) caches, and assuming that any
> transfer >=33 (or 65) bytes is safe.

I'll give the dma bounce idea a go next week, see how it looks.
Arnd Bergmann April 21, 2022, 2:47 p.m. UTC | #80
On Thu, Apr 21, 2022 at 4:44 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Thu, Apr 21, 2022 at 03:47:30PM +0200, Arnd Bergmann wrote:
> > On Thu, Apr 21, 2022 at 3:25 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > On Thu, Apr 21, 2022 at 02:28:45PM +0200, Arnd Bergmann wrote:
> > > > We also know that larger slabs are all cacheline aligned, so simply
> > > > comparing the transfer size is enough to rule out most, in this case
> > > > any transfer larger than 96 bytes must come from the kmalloc-128
> > > > or larger cache, so that works like before.
> > >
> > > There's also the case with 128-byte cache lines and kmalloc-192.
> >
> > Sure, but that's much less common, as the few machines with 128 byte
> > cache lines tend to also have cache coherent devices IIRC, so we'd
> > skip the bounce buffer entirely.
>
> Do you know which machines still have 128-byte cache lines _and_
> non-coherent DMA? If there isn't any that matters, I'd reduce
> ARCH_DMA_MINALIGN to 64 now (while trying to get to even smaller kmalloc
> caches).

I think the last time this came up, someone pointed out one of the
Qualcomm Snapdragon phone chips with their custom cores.

        Arnd
Herbert Xu May 10, 2022, 11:03 a.m. UTC | #81
This patch series adds helpers to allow drivers to explicitly
request ARCH_DMA_MINALIGN when allocating memory through the
Crypto API.

Note that I've only converted one file in one driver as this
is only meant to show how it's done and find out what else we
may need.

Thanks,
Isaac Manjarres July 15, 2022, 10:23 p.m. UTC | #82
On Sun, Apr 17, 2022 at 05:29:01PM +0100, Catalin Marinas wrote:
> On Sun, Apr 17, 2022 at 04:43:33PM +0800, Herbert Xu wrote:
> > On Sun, Apr 17, 2022 at 09:38:40AM +0100, Catalin Marinas wrote:
> > > I don't think we need to do anything here. A structure like:
> > > 
> > > struct x {
> > > 	char y;
> > > 	char z[] CRYPTO_MINALIGN_ATTR;
> > > };
> > > 
> > > is already of size 128. Without CRYPTO_MINALIGN_ATTR, its size would be
> > > 1 but otherwise the whole structure inherits the alignment of its
> > > member and this translates into an aligned size.
> > 
> > No we should not lie to the compiler,
> 
> We won't if we ensure that a structure with sizeof() >= 128 is aligned
> to 128.
> 
Right. kmalloc() should return a 128 byte aligned pointer as long as
the size of the allocation is >= 128 bytes, and the kmalloc-192 cache
isn't present. So, the current behavior that crypto is relying on
wouldn't change, so I agree with Catalin that we wouldn't be lying to
the compiler if we move forward with getting rid of kmalloc-192.

FWIW, I did a comparison on my machine with and without kmalloc-192, and
the amount of memory usage that increased from allocations being redirected to
kmalloc-256 was about 0.4-0.5 MB, which doesn't seem too bad.

> > we have code elsewhere
> > that uses the alignment to compute the amount of extra padding
> > needed to create greater padding.  If CRYPTO_MINALIGN is misleading
> > then that calculation will fall apart.
I don't think it would be misleading. If all of your allocations
are >= CRYPTO_MINALIGN == ARCH_DMA_MINALIGN in size, and
kmalloc()--with kmalloc-192 removed--returns buffers that are aligned to a
power of 2, and are big enough to accomodate your allocation, then wouldn't
they always be CYRPTO_MINALIGN'ed, so your calculation would still be fine?

--Isaac
Herbert Xu July 16, 2022, 3:25 a.m. UTC | #83
On Fri, Jul 15, 2022 at 03:23:25PM -0700, Isaac Manjarres wrote:
>
> isn't present. So, the current behavior that crypto is relying on
> wouldn't change, so I agree with Catalin that we wouldn't be lying to
> the compiler if we move forward with getting rid of kmalloc-192.

There is no guarantee that crypto will always be allocating
structures > 128 bytes.

But thanks for the reminder, I do need to push the patches along.

Cheers,
Catalin Marinas July 18, 2022, 5:53 p.m. UTC | #84
On Sat, Jul 16, 2022 at 11:25:35AM +0800, Herbert Xu wrote:
> On Fri, Jul 15, 2022 at 03:23:25PM -0700, Isaac Manjarres wrote:
> > isn't present. So, the current behavior that crypto is relying on
> > wouldn't change, so I agree with Catalin that we wouldn't be lying to
> > the compiler if we move forward with getting rid of kmalloc-192.
> 
> There is no guarantee that crypto will always be allocating
> structures > 128 bytes.
> 
> But thanks for the reminder, I do need to push the patches along.

So do I but holidays get in the way ;). I plan to refresh my kmalloc
minalign series at the end of August.

One significant change I have though is that now ARCH_KMALLOC_MINALIGN
now goes down all the way to 8 and using swiotlb bounce buffering if the
DMA mapping size is small.
Isaac Manjarres Sept. 21, 2022, 12:47 a.m. UTC | #85
On Mon, Jul 18, 2022 at 06:53:43PM +0100, Catalin Marinas wrote:
> So do I but holidays get in the way ;). I plan to refresh my kmalloc
> minalign series at the end of August.
> 
> One significant change I have though is that now ARCH_KMALLOC_MINALIGN
> now goes down all the way to 8 and using swiotlb bounce buffering if the
> DMA mapping size is small.

Hi Catalin,

This sounds like a good idea for optimizing kmalloc's memory usage
beyond what this series originally achieved. I'm sure a few other things
have come up in the meantime, but I was curious to know if you had a
chance to get back to this?

Thanks,
Isaac
Catalin Marinas Sept. 30, 2022, 6:32 p.m. UTC | #86
On Tue, Sep 20, 2022 at 05:47:33PM -0700, Isaac Manjarres wrote:
> On Mon, Jul 18, 2022 at 06:53:43PM +0100, Catalin Marinas wrote:
> > So do I but holidays get in the way ;). I plan to refresh my kmalloc
> > minalign series at the end of August.
> > 
> > One significant change I have though is that now ARCH_KMALLOC_MINALIGN
> > now goes down all the way to 8 and using swiotlb bounce buffering if the
> > DMA mapping size is small.
> 
> This sounds like a good idea for optimizing kmalloc's memory usage
> beyond what this series originally achieved. I'm sure a few other things
> have come up in the meantime, but I was curious to know if you had a
> chance to get back to this?

I started refreshing the series but I got stuck on having to do bouncing
for small buffers even if when they go through the iommu (and I don't
have the set up to test it yet).

I hope to post something when the merging window closes. In the
meantime, I pushed my work-in-progress patches to:

git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux devel/kmalloc-minalign-v2-wip

Apart from the iommu think, I still need to figure out whether with an
ARCH_DMA_MINALIGN of 128 we need to disable the kmalloc-192 cache (so
far I don't think it's needed). There are no additional changes to the
crypto code from the last series, I still set CRYPTO_MINALIGN to
ARCH_DMA_MINALIGN (without any other patches, crypto DMA will break; so
I need to see how it interacts with Herbert's series).

Anyway, I hope for more discussions once 6.1-rc1 is out.
Linus Torvalds Sept. 30, 2022, 7:35 p.m. UTC | #87
On Fri, Sep 30, 2022 at 11:33 AM Catalin Marinas
<catalin.marinas@arm.com> wrote:
>
> I started refreshing the series but I got stuck on having to do bouncing
> for small buffers even if when they go through the iommu (and I don't
> have the set up to test it yet).

May I suggest doing that "force bouncing" and "change kmalloc to have
a 8-byte minalign" to be the two first commits?

IOW, if we force bouncing for unaligned DMA, then that *should* mean
that allocation alignment is no longer a correctness issue, it's
purely a performance one due to the bouncing.

So then the rest of the series should be about "ok, this is actually a
hot enough allocation that I want to force alignment", and be purely
about performance, not correctness.

No?

               Linus
Catalin Marinas Oct. 1, 2022, 10:29 p.m. UTC | #88
On Fri, Sep 30, 2022 at 12:35:45PM -0700, Linus Torvalds wrote:
> On Fri, Sep 30, 2022 at 11:33 AM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > I started refreshing the series but I got stuck on having to do bouncing
> > for small buffers even if when they go through the iommu (and I don't
> > have the set up to test it yet).
> 
> May I suggest doing that "force bouncing" and "change kmalloc to have
> a 8-byte minalign" to be the two first commits?
> 
> IOW, if we force bouncing for unaligned DMA, then that *should* mean
> that allocation alignment is no longer a correctness issue, it's
> purely a performance one due to the bouncing.

I've been thinking about this and even ended up with a CBMC model
(included below; it found a bug in dma_kmalloc_needs_bounce()).

The "force bouncing" in my series currently only checks for small
(potentially kmalloc'ed) sizes under the assumption that intra-object
DMA buffers were properly aligned to 128. So for something like below:

struct devres {
	struct devres_node		node;
	u8 __aligned(ARCH_DMA_MINALIGN) data[];
};

we'd need ARCH_DMA_MINALIGN of 128 even if ARCH_KMALLOC_MINALIGN is 8.
Original the code has __aligned(ARCH_KMALLOC_MINALIGN), so lowering the
latter to 8 without any changes would be problematic (the sizeof(devres)
may be sufficiently large to look cacheline-aligned).

If data[] contains a single DMA buffer, dma_kmalloc_needs_bounce() can
get the start of the buffer as another parameter and check that it's a
multiple of cache_line_size().

However, things get more complicated if data[] is used for several
sub-allocations of multiples of ARCH_KMALLOC_MINALIGN. Not much to do
with kmalloc() caches at this point. I haven't got my head around the
crypto code but it looked to me like it needs ARCH_DMA_MINALIGN in some
places if we are to lower ARCH_KMALLOC_MINALIGN. We could attempt to
force bouncing in dma_kmalloc_needs_bounce() by:

	if (ptr % dma_align != || size % dma_align != 0)
		return true;

but that's orthogonal to the kmalloc caches. I tried this some years ago
and IIRC many buffers get bounced even with ARCH_KMALLOC_MINALIGN of 128
because drivers don't necessarily have cacheline-aligned sized
structures shared with devices (but they are allocated from a
cacheline-aligned slab). So this check results in unnecessary bouncing.

So my series attempts to (1) fix the (static) alignment for intra-object
buffers by changing a few ARCH_KMALLOC_MINALIGN uses to
ARCH_DMA_MINALIGN and (2) address the kmalloc() DMA safety by bouncing
non-cacheline-aligned sizes. I don't think we can do (2) first as the
logic for handling (1) in the absence of a large ARCH_DMA_MINALIGN is
different.

And that's the CMBC model:

------------------------------------8<----------------------------
// SPDX-License-Identifier: GPL-2.0-only
/*
 * Check with:
 *   cbmc --trace dma-bounce.c
 */

#define PAGE_SIZE		4096

#define ARCH_KMALLOC_MINALIGN	8
#define ARCH_DMA_MINALIGN	128

#define KMALLOC_MIN_SIZE	ARCH_KMALLOC_MINALIGN
#define KMALLOC_SHIFT_LOW	3
#define KMALLOC_SHIFT_HIGH	25
#define KMALLOC_MAX_SIZE	(1UL << KMALLOC_SHIFT_HIGH)

#define INIT_KMALLOC_INFO(__size, __short_size)			\
{								\
	.size = __size,						\
}

static unsigned int nondet_uint(void);

struct kmalloc_info_struct {
	unsigned int size;
};

struct kmalloc_slab {
	unsigned int ptr;
	unsigned int size;
};

static const struct kmalloc_info_struct kmalloc_info[] = {
	INIT_KMALLOC_INFO(0, 0),
	INIT_KMALLOC_INFO(96, 96),
	INIT_KMALLOC_INFO(192, 192),
	INIT_KMALLOC_INFO(8, 8),
	INIT_KMALLOC_INFO(16, 16),
	INIT_KMALLOC_INFO(32, 32),
	INIT_KMALLOC_INFO(64, 64),
	INIT_KMALLOC_INFO(128, 128),
	INIT_KMALLOC_INFO(256, 256),
	INIT_KMALLOC_INFO(512, 512),
	INIT_KMALLOC_INFO(1024, 1k),
	INIT_KMALLOC_INFO(2048, 2k),
	INIT_KMALLOC_INFO(4096, 4k),
	INIT_KMALLOC_INFO(8192, 8k),
	INIT_KMALLOC_INFO(16384, 16k),
	INIT_KMALLOC_INFO(32768, 32k),
	INIT_KMALLOC_INFO(65536, 64k),
	INIT_KMALLOC_INFO(131072, 128k),
	INIT_KMALLOC_INFO(262144, 256k),
	INIT_KMALLOC_INFO(524288, 512k),
	INIT_KMALLOC_INFO(1048576, 1M),
	INIT_KMALLOC_INFO(2097152, 2M),
	INIT_KMALLOC_INFO(4194304, 4M),
	INIT_KMALLOC_INFO(8388608, 8M),
	INIT_KMALLOC_INFO(16777216, 16M),
	INIT_KMALLOC_INFO(33554432, 32M)
};

static unsigned int cache_line_size(void)
{
	static const unsigned int cls = nondet_uint();

	__CPROVER_assume(cls == 32 || cls == 64 || cls == 128);

	return cls;
}

static unsigned int kmalloc_index(unsigned int size)
{
	if (!size)
		return 0;

	if (size <= KMALLOC_MIN_SIZE)
		return KMALLOC_SHIFT_LOW;

	if (KMALLOC_MIN_SIZE <= 32 && size > 64 && size <= 96)
		return 1;
	if (KMALLOC_MIN_SIZE <= 64 && size > 128 && size <= 192)
		return 2;
	if (size <=          8) return 3;
	if (size <=         16) return 4;
	if (size <=         32) return 5;
	if (size <=         64) return 6;
	if (size <=        128) return 7;
	if (size <=        256) return 8;
	if (size <=        512) return 9;
	if (size <=       1024) return 10;
	if (size <=   2 * 1024) return 11;
	if (size <=   4 * 1024) return 12;
	if (size <=   8 * 1024) return 13;
	if (size <=  16 * 1024) return 14;
	if (size <=  32 * 1024) return 15;
	if (size <=  64 * 1024) return 16;
	if (size <= 128 * 1024) return 17;
	if (size <= 256 * 1024) return 18;
	if (size <= 512 * 1024) return 19;
	if (size <= 1024 * 1024) return 20;
	if (size <=  2 * 1024 * 1024) return 21;
	if (size <=  4 * 1024 * 1024) return 22;
	if (size <=  8 * 1024 * 1024) return 23;
	if (size <=  16 * 1024 * 1024) return 24;
	if (size <=  32 * 1024 * 1024) return 25;

	__CPROVER_assert(0, "Invalid kmalloc() size");

	return -1;
}

unsigned int kmalloc(unsigned int size, struct kmalloc_slab *slab)
{
	unsigned int nr = nondet_uint();

	slab->size = kmalloc_info[kmalloc_index(size)].size;
	slab->ptr = nr * slab->size;

	__CPROVER_assume(slab->ptr < PAGE_SIZE);
	__CPROVER_assume(slab->ptr % slab->size == 0);

	return slab->ptr;
}

/*
 * Implemented only for 32, 64 and 128 cache line sizes.
 */
int dma_kmalloc_needs_bounce(unsigned int size)
{
	unsigned int dma_align = cache_line_size();

	/*
	 * Less than half dma_align, there's definitely a smaller kmalloc()
	 * cache.
	 */
	if (size <= dma_align / 2)
		return 1;

	/*
	 * From this point, any kmalloc cache size is 32-byte aligned.
	 */
	if (dma_align == 32)
		return 0;

	/*
	 * dma_align == 64 => 96 needs bouncing.
	 * dma_align == 128 => 96 and 192 need bouncing.
	 */
	if (size > 64 && size <= 96)
		return 1;
	if (dma_align == 128 && size > 128 && size <= 192)
		return 1;

	return 0;
}

/*
 * Simulate DMA cache maintenance. The 'slab' object is only used for
 * verification.
 */
void dma_map_single(unsigned int ptr, unsigned int size,
		    struct kmalloc_slab *slab)
{
	unsigned int mask = cache_line_size() - 1;

	if (dma_kmalloc_needs_bounce(size)) {
		/* was the bounce really necessary? */
		__CPROVER_assert((ptr & mask) != 0 || (size & mask) != 0,
				 "Bouncing aligned DMA buffer");
		return;
	}

	/*
	 * Check for cache maintenance outside the kmalloc'ed object. We don't
	 * care about intra-object overlap, it's the caller's responsibility
	 * to ensure alignment.
	 */
	__CPROVER_assert((ptr & ~mask) >= slab->ptr, "DMA cache maintenance underflow");
	__CPROVER_assert(((ptr + size + mask) & ~mask) <= slab->ptr + slab->size,
			 "DMA cache maintenance overflow");
}

int main(void)
{
	struct kmalloc_slab slab;
	unsigned int size = nondet_uint();
	unsigned int offset = nondet_uint();
	unsigned int ptr;

	__CPROVER_assume(size <= KMALLOC_MAX_SIZE);
	__CPROVER_assume(offset < size);
	__CPROVER_assume(offset % ARCH_DMA_MINALIGN == 0);

	ptr = kmalloc(size, &slab);
	dma_map_single(ptr + offset, size - offset, &slab);

	return 0;
}
Linus Torvalds Oct. 2, 2022, 5 p.m. UTC | #89
On Sat, Oct 1, 2022 at 3:30 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> The "force bouncing" in my series currently only checks for small
> (potentially kmalloc'ed) sizes under the assumption that intra-object
> DMA buffers were properly aligned to 128. So for something like below:

Ahh, so your forced bouncing isn't actually safe.

I would have hoped (but obviously never checked) that the force
bouncing be made really safe and look at the actual alignment of the
DMA (comparing it to the hardware coherency requirements), so that
alignment at allocation time simply wouldn't matter.

At that point, places like the ones you found would still work, they'd
just cause bouncing.

At which point you'd then have a choice of

 (a) just let it bounce

 (b) marking the allocations that led to them

and (a) might actually be perfectly fine in a lot of situations.
That's particularly true for the "random drivers" situation that may
not be all that relevant in real life, which is a *big* deal. Not
because of any performance issues, but simply because of kernel
developers not having to worry their pretty little heads about stuff
that doesn't really matter.

In fact, (a) might be perfectly ok even for drivers that *do* matter,
if they just aren't all that performance-critical and the situation
doesn't come up a lot (maybe it's a special management ioctl or
similar that just causes the possibility to come up, and it's
important that it *works*, but having a few bounces occasionally
doesn't actually matter, and all the regular IO goes the normal path).

And (b) would be triggered by actual data. Which could be fairly easy
to gather with a statistical model. For example, just making
dma_map_xyz() have a debug mode where it prints out the stack trace of
these bounces once every minute or so - statistically the call trace
will be one of the hot ones. Or, better yet, just use tracing to do
it.

That would allow us to say "DMA is immaterial for _correct_ alignment,
because we always fix it up if required", but then also find
situations where we might want to give it a gentle helper nudge.

But hey, if you're comfortable with your approach, that's fine too.
Anything that gets rid of the absolutely insane "you can't do small
allocations" is an improvement.

                   Linus
Ard Biesheuvel Oct. 2, 2022, 10:08 p.m. UTC | #90
On Sun, 2 Oct 2022 at 19:00, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sat, Oct 1, 2022 at 3:30 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> >
> > The "force bouncing" in my series currently only checks for small
> > (potentially kmalloc'ed) sizes under the assumption that intra-object
> > DMA buffers were properly aligned to 128. So for something like below:
>
> Ahh, so your forced bouncing isn't actually safe.
>
> I would have hoped (but obviously never checked) that the force
> bouncing be made really safe and look at the actual alignment of the
> DMA (comparing it to the hardware coherency requirements), so that
> alignment at allocation time simply wouldn't matter.
>
> At that point, places like the ones you found would still work, they'd
> just cause bouncing.
>
> At which point you'd then have a choice of
>
>  (a) just let it bounce
>
>  (b) marking the allocations that led to them
>
> and (a) might actually be perfectly fine in a lot of situations.

Non-coherent DMA for networking is going to be fun, though.
Fortunately, we'll only need to bounce for inbound DMA (where the
cache invalidation might otherwise corrupt adjacent unrelated data),
and for inbound networking, the driver typically owns and maps the
buffers, and so in most cases, it can hopefully be easily modified to
round up the allocations and the length values passed to
dma_map_xxx(). And for outbound, we can just clean the SKB from the
caches without the risk of corruption.

It does mean that there is likely going to be a long tail of network
drivers that will non-negligibly regress in performance and in memory
footprint (as the entire RX ring will be double buffered all the time)
until they get updated.
Linus Torvalds Oct. 2, 2022, 10:24 p.m. UTC | #91
On Sun, Oct 2, 2022 at 3:09 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> Non-coherent DMA for networking is going to be fun, though.

I agree that networking is likely the main performance issue, but I
suspect 99% of the cases would come from __alloc_skb().

You might want to have help from the network drivers for the "allocate
for RX vs TX", since it ends up having very different DMA coherence
issues, as you point out.

The code actually already has a SKB_ALLOC_RX flag, but despite the
name it doesn't really mean what you'd think it means.

Similarly, that code already has magic stuff to try to be
cacheline-aligned for accesses, but it's not really for DMA coherency
reasons, just purely for performance reasons (trying to make sure that
the header accesses stay in one cacheline etc).

And to be honest, it's been years and years since I did any networking, so...

                  Linus
Catalin Marinas Oct. 3, 2022, 5:39 p.m. UTC | #92
On Sun, Oct 02, 2022 at 03:24:57PM -0700, Linus Torvalds wrote:
> On Sun, Oct 2, 2022 at 3:09 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > Non-coherent DMA for networking is going to be fun, though.
> 
> I agree that networking is likely the main performance issue, but I
> suspect 99% of the cases would come from __alloc_skb().

The problem is not the allocation but rather having a generic enough
dma_needs_bounce() check. It won't be able to tell whether some 1500
byte range is for network or for crypto code that uses a small
ARCH_KMALLOC_MINALIGN. Getting the actual object size (e.g. with
ksize()) doesn't tell the full story on how safe the DMA is.

> Similarly, that code already has magic stuff to try to be
> cacheline-aligned for accesses, but it's not really for DMA coherency
> reasons, just purely for performance reasons (trying to make sure that
> the header accesses stay in one cacheline etc).

Yeah, __skb_alloc() ends up using SMP_CACHE_BYTES for data alignment
(via SKB_DATA_ALIGN). I have a suspicion this may break on SoCs with a
128-byte cache line but I haven't seen any report yet (there aren't many
such systems).
Isaac Manjarres Oct. 12, 2022, 5:45 p.m. UTC | #93
On Fri, Sep 30, 2022 at 07:32:50PM +0100, Catalin Marinas wrote:
> I started refreshing the series but I got stuck on having to do bouncing
> for small buffers even if when they go through the iommu (and I don't
> have the set up to test it yet).

For devices that go through the IOMMU, are you planning on adding
similar logic as you did in the direct-DMA path to bounce the buffer
prior to calling into whatever DMA ops are registered for the device?

Also, there are devices with ARM64 CPUs that disable SWIOTLB usage because
none of the peripherals that they engage in DMA with need bounce buffering,
and also to reclaim the default 64 MB of memory that SWIOTLB uses. With
this approach, SWIOTLB usage will become mandatory if those devices need
to perform non-coherent DMA transactions that may not necessarily be DMA
aligned (e.g. small buffers), correct?

If so, would there be concerns that the memory savings we get back from
reducing the memory footprint of kmalloc might be defeated by how much
memory is needed for bounce buffering? I understand that we can use the
"swiotlb=num_slabs" command line parameter to minimize the amount of
memory allocated for bounce buffering. If this is the only way to
minimize this impact, how much memory would you recommend to allocate
for bounce buffering on a system that will only use bounce buffers for
non-DMA-aligned buffers?

Thanks,
Isaac
Catalin Marinas Oct. 13, 2022, 4:57 p.m. UTC | #94
On Wed, Oct 12, 2022 at 10:45:45AM -0700, Isaac Manjarres wrote:
> On Fri, Sep 30, 2022 at 07:32:50PM +0100, Catalin Marinas wrote:
> > I started refreshing the series but I got stuck on having to do bouncing
> > for small buffers even if when they go through the iommu (and I don't
> > have the set up to test it yet).
> 
> For devices that go through the IOMMU, are you planning on adding
> similar logic as you did in the direct-DMA path to bounce the buffer
> prior to calling into whatever DMA ops are registered for the device?

Yes.

> Also, there are devices with ARM64 CPUs that disable SWIOTLB usage because
> none of the peripherals that they engage in DMA with need bounce buffering,
> and also to reclaim the default 64 MB of memory that SWIOTLB uses. With
> this approach, SWIOTLB usage will become mandatory if those devices need
> to perform non-coherent DMA transactions that may not necessarily be DMA
> aligned (e.g. small buffers), correct?

Correct. I've been thinking about this and a way around is to combine
the original series (dynamic kmalloc_minalign) with the new one so that
the arch code can lower the minimum alignment either to 8 if swiotlb is
available (usually in server space with more RAM) or the cache line size
if there is no bounce buffer.

> If so, would there be concerns that the memory savings we get back from
> reducing the memory footprint of kmalloc might be defeated by how much
> memory is needed for bounce buffering?

It's not necessarily about the saved memory but also locality of the
small buffer allocations, less cache and TLB pressure.

> I understand that we can use the
> "swiotlb=num_slabs" command line parameter to minimize the amount of
> memory allocated for bounce buffering. If this is the only way to
> minimize this impact, how much memory would you recommend to allocate
> for bounce buffering on a system that will only use bounce buffers for
> non-DMA-aligned buffers?

It's hard to tell, it would need to be guessed by trial and error on
specific hardware if you want to lower it. Another issue is that IIRC
the swiotlb is allocated in 2K slots, so you may need a lot more bounce
buffers than the actual memory allocated.

I wonder whether swiotlb is actually the best option for bouncing
unaligned buffers. We could use something like mempool_alloc() instead
if we stick to small buffers rather than any (even large) buffer that's
not aligned to a cache line. Or just go for kmem_cache_alloc() directly.
A downside is that we may need GFP_ATOMIC for such allocations, so
higher risk of failure.
Saravana Kannan Oct. 13, 2022, 6:58 p.m. UTC | #95
On Thu, Oct 13, 2022 at 9:57 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Wed, Oct 12, 2022 at 10:45:45AM -0700, Isaac Manjarres wrote:
> > On Fri, Sep 30, 2022 at 07:32:50PM +0100, Catalin Marinas wrote:
> > > I started refreshing the series but I got stuck on having to do bouncing
> > > for small buffers even if when they go through the iommu (and I don't
> > > have the set up to test it yet).
> >
> > For devices that go through the IOMMU, are you planning on adding
> > similar logic as you did in the direct-DMA path to bounce the buffer
> > prior to calling into whatever DMA ops are registered for the device?
>
> Yes.
>
> > Also, there are devices with ARM64 CPUs that disable SWIOTLB usage because
> > none of the peripherals that they engage in DMA with need bounce buffering,
> > and also to reclaim the default 64 MB of memory that SWIOTLB uses. With
> > this approach, SWIOTLB usage will become mandatory if those devices need
> > to perform non-coherent DMA transactions that may not necessarily be DMA
> > aligned (e.g. small buffers), correct?
>
> Correct. I've been thinking about this and a way around is to combine
> the original series (dynamic kmalloc_minalign) with the new one so that
> the arch code can lower the minimum alignment either to 8 if swiotlb is
> available (usually in server space with more RAM) or the cache line size
> if there is no bounce buffer.
>
> > If so, would there be concerns that the memory savings we get back from
> > reducing the memory footprint of kmalloc might be defeated by how much
> > memory is needed for bounce buffering?
>
> It's not necessarily about the saved memory but also locality of the
> small buffer allocations, less cache and TLB pressure.

Part of the pushback we get when we try to move some of the Android
ecosystem from 32-bit to 64-bit is the memory usage increase. So,
while the main goal might not be memory savings, it'll be good to keep
that in mind too. I'd definitely not want this patch series to make
things worse. Ideally, it'd make things better. 10MB is considered a
lot on some of the super low speced devices.

> > I understand that we can use the
> > "swiotlb=num_slabs" command line parameter to minimize the amount of
> > memory allocated for bounce buffering. If this is the only way to
> > minimize this impact, how much memory would you recommend to allocate
> > for bounce buffering on a system that will only use bounce buffers for
> > non-DMA-aligned buffers?
>
> It's hard to tell, it would need to be guessed by trial and error on
> specific hardware if you want to lower it. Another issue is that IIRC
> the swiotlb is allocated in 2K slots, so you may need a lot more bounce
> buffers than the actual memory allocated.
>
> I wonder whether swiotlb is actually the best option for bouncing
> unaligned buffers. We could use something like mempool_alloc() instead
> if we stick to small buffers rather than any (even large) buffer that's
> not aligned to a cache line. Or just go for kmem_cache_alloc() directly.
> A downside is that we may need GFP_ATOMIC for such allocations, so
> higher risk of failure.

Yeah, a temporary kmem_cache_alloc() to bounce buffers off of feels
like a better idea than swiotlb. Especially for small allocations (say
8 byte allocations) that might have gone into the kmem-cache-64 if we
hadn't dropped KMALLOC_MIN_ALIGN to 8.

-Saravana
Catalin Marinas Oct. 14, 2022, 4:25 p.m. UTC | #96
On Thu, Oct 13, 2022 at 11:58:22AM -0700, Saravana Kannan wrote:
> On Thu, Oct 13, 2022 at 9:57 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > If so, would there be concerns that the memory savings we get back from
> > > reducing the memory footprint of kmalloc might be defeated by how much
> > > memory is needed for bounce buffering?
> >
> > It's not necessarily about the saved memory but also locality of the
> > small buffer allocations, less cache and TLB pressure.
> 
> Part of the pushback we get when we try to move some of the Android
> ecosystem from 32-bit to 64-bit is the memory usage increase. So,
> while the main goal might not be memory savings, it'll be good to keep
> that in mind too. I'd definitely not want this patch series to make
> things worse. Ideally, it'd make things better. 10MB is considered a
> lot on some of the super low speced devices.

Well, we can still add the option to skip allocating from the small
kmalloc caches if there's no swiotlb available.

> > I wonder whether swiotlb is actually the best option for bouncing
> > unaligned buffers. We could use something like mempool_alloc() instead
> > if we stick to small buffers rather than any (even large) buffer that's
> > not aligned to a cache line. Or just go for kmem_cache_alloc() directly.
> > A downside is that we may need GFP_ATOMIC for such allocations, so
> > higher risk of failure.
> 
> Yeah, a temporary kmem_cache_alloc() to bounce buffers off of feels
> like a better idea than swiotlb. Especially for small allocations (say
> 8 byte allocations) that might have gone into the kmem-cache-64 if we
> hadn't dropped KMALLOC_MIN_ALIGN to 8.

I now remembered why this isn't trivial. On the dma_unmap_*() side, we
can't easily tell whether the buffer was bounced and it needs freeing.
The swiotlb solves this by checking whether the address is within the
pre-allocated swiotlb range. With kmem_caches, we could dig into the
slab internals and check whether the pointer is part of a slab page,
then check with kmem_cache that was. It looks too complicated and I'm
rather tempted to just go for swiotlb if available or prevent the
creation of small kmalloc caches if no swiotlb (TBH, even the initial
series was an improvement dropping KMALLOC_MIN_ALIGN from 128 to 64).
Saravana Kannan Oct. 14, 2022, 8:23 p.m. UTC | #97
On Fri, Oct 14, 2022 at 9:25 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Thu, Oct 13, 2022 at 11:58:22AM -0700, Saravana Kannan wrote:
> > On Thu, Oct 13, 2022 at 9:57 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > If so, would there be concerns that the memory savings we get back from
> > > > reducing the memory footprint of kmalloc might be defeated by how much
> > > > memory is needed for bounce buffering?
> > >
> > > It's not necessarily about the saved memory but also locality of the
> > > small buffer allocations, less cache and TLB pressure.
> >
> > Part of the pushback we get when we try to move some of the Android
> > ecosystem from 32-bit to 64-bit is the memory usage increase. So,
> > while the main goal might not be memory savings, it'll be good to keep
> > that in mind too. I'd definitely not want this patch series to make
> > things worse. Ideally, it'd make things better. 10MB is considered a
> > lot on some of the super low speced devices.
>
> Well, we can still add the option to skip allocating from the small
> kmalloc caches if there's no swiotlb available.

This seems like a good compromise.

> > > I wonder whether swiotlb is actually the best option for bouncing
> > > unaligned buffers. We could use something like mempool_alloc() instead
> > > if we stick to small buffers rather than any (even large) buffer that's
> > > not aligned to a cache line. Or just go for kmem_cache_alloc() directly.
> > > A downside is that we may need GFP_ATOMIC for such allocations, so
> > > higher risk of failure.
> >
> > Yeah, a temporary kmem_cache_alloc() to bounce buffers off of feels
> > like a better idea than swiotlb. Especially for small allocations (say
> > 8 byte allocations) that might have gone into the kmem-cache-64 if we
> > hadn't dropped KMALLOC_MIN_ALIGN to 8.
>
> I now remembered why this isn't trivial. On the dma_unmap_*() side, we
> can't easily tell whether the buffer was bounced and it needs freeing.
> The swiotlb solves this by checking whether the address is within the
> pre-allocated swiotlb range. With kmem_caches, we could dig into the
> slab internals and check whether the pointer is part of a slab page,
> then check with kmem_cache that was. It looks too complicated and I'm
> rather tempted to just go for swiotlb if available or prevent the
> creation of small kmalloc caches if no swiotlb (TBH, even the initial
> series was an improvement dropping KMALLOC_MIN_ALIGN from 128 to 64).

Agreed. Even allowing a 64-byte kmalloc cache on a system with a
64-byte cacheline size saves quite a bit of memory.

-Saravana
Linus Torvalds Oct. 14, 2022, 8:44 p.m. UTC | #98
On Fri, Oct 14, 2022 at 1:24 PM Saravana Kannan <saravanak@google.com> wrote:
>
> Agreed. Even allowing a 64-byte kmalloc cache on a system with a
> 64-byte cacheline size saves quite a bit of memory.

Well, the *really* trivial thing to do is to just say "if the platform
is DMA coherent, just allow any size kmalloc cache". And just
consciously leave the broken garbage behind.

Because it's not just 64-byte kmalloc. It's things like 'avtab_node'
that is 24 bytes, and that on my system currently uses about 3MB of
memory simply because there's a _lot_ of them.

I've got 1.8MB in "kmalloc-32" too, and about 1MB in "kamlloc-16", fwiw. That's

Yeah, yeah, this is on a 64GB machine and so none of that matters (and
some of these things are very much "scales with memory", but these
small allocations aren't actually all that unusual.

And yes, the above is obviously on my x86-64 machine.

My arm64 laptop doesn't have the small kmallocs, and as a result the
"kmalloc-128" has 633 _thousand_ entries, and takes up 77MB of RAM on
my 16GB laptop. I'm assuming (but have no proof) more than 50% of that
is just wasted memory.

I suspect that DMA is cache coherent on this thing, and that wasted
space is just *stupid* wasted space, but hey, I don't actually know. I
just use the Asahi people's patches.

               Linus
Catalin Marinas Oct. 16, 2022, 9:37 p.m. UTC | #99
On Fri, Oct 14, 2022 at 01:44:25PM -0700, Linus Torvalds wrote:
> On Fri, Oct 14, 2022 at 1:24 PM Saravana Kannan <saravanak@google.com> wrote:
> > Agreed. Even allowing a 64-byte kmalloc cache on a system with a
> > 64-byte cacheline size saves quite a bit of memory.
> 
> Well, the *really* trivial thing to do is to just say "if the platform
> is DMA coherent, just allow any size kmalloc cache". And just
> consciously leave the broken garbage behind.

The problem is we don't have a reliable way to tell whether the platform
is DMA-coherent. The CPU IDs don't really say much and in most cases
it's a property of the interconnect/bus and device. We describe the DMA
coherency in DT or ACPI and the latter is somewhat better as it assumes
coherent by default. But for DT, not having a 'dma-coherent' property
means non-coherent DMA (or no DMA at all). We can't even tell whether
the device is going to do any DMA, arch_setup_dma_ops() is called even
for devices like the PMU. We could look into defining new properties
(e.g. "no-dma") and adjust the DTs but we may also have late probed
devices, long after the slab allocator was initialised. A big
'dma-coherent' property on the top node may work but most Android
systems won't benefit from this (your laptop may, I haven't checked).

I think the best bet is still either (a) bounce for small sizes or (b) a
new GFP_NODMA/PACKED/etc. flag for the hot small allocations. (a) is
somehow more universal but lots (most) Android devices are deployed with
no swiotlb buffer as the vendor knows the device needs and don't need
extra buffer waste. Not sure how reliable it would be to trigger another
slab allocation on the dma_map_*() calls for the bounce (it may need to
be GFP_ATOMIC). Option (b) looks more appealing on such systems, though
a lot more churn throughout the kernel.
diff mbox series

Patch

diff --git a/include/linux/crypto.h b/include/linux/crypto.h
index 2324ab6f1846..654b9c355575 100644
--- a/include/linux/crypto.h
+++ b/include/linux/crypto.h
@@ -167,7 +167,7 @@ 
  * maintenance for non-coherent DMA (cache invalidation in particular) does not
  * affect data that may be accessed by the CPU concurrently.
  */
-#define CRYPTO_MINALIGN ARCH_KMALLOC_MINALIGN
+#define CRYPTO_MINALIGN ARCH_DMA_MINALIGN
 
 #define CRYPTO_MINALIGN_ATTR __attribute__ ((__aligned__(CRYPTO_MINALIGN)))