diff mbox series

[RFC,v1] devres: align devres.data strictly only for devm_kmalloc()

Message ID 74ae22cd-08c1-d846-3e1d-cbc38db87442@free.fr (mailing list archive)
State New, archived
Headers show
Series [RFC,v1] devres: align devres.data strictly only for devm_kmalloc() | expand

Commit Message

Marc Gonzalez Dec. 17, 2019, 3:30 p.m. UTC
Commit a66d972465d15 ("devres: Align data[] to ARCH_KMALLOC_MINALIGN")
increased the alignment of devres.data unconditionally.

Some platforms have very strict alignment requirements for DMA-safe
addresses, e.g. 128 bytes on arm64. There, struct devres amounts to:
	3 pointers + pad_to_128 + data + pad_to_256
i.e. ~220 bytes of padding.

Let's enforce the alignment only for devm_kmalloc().

Suggested-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Marc Gonzalez <marc.w.gonzalez@free.fr>
---
I had not been aware that dynamic allocation granularity on arm64 was
128 bytes. This means there's a lot of waste on small allocations.
I suppose there's no easy solution, though.
---
 drivers/base/devres.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

Comments

Greg KH Dec. 17, 2019, 3:45 p.m. UTC | #1
On Tue, Dec 17, 2019 at 04:30:54PM +0100, Marc Gonzalez wrote:
> Commit a66d972465d15 ("devres: Align data[] to ARCH_KMALLOC_MINALIGN")
> increased the alignment of devres.data unconditionally.
> 
> Some platforms have very strict alignment requirements for DMA-safe
> addresses, e.g. 128 bytes on arm64. There, struct devres amounts to:
> 	3 pointers + pad_to_128 + data + pad_to_256
> i.e. ~220 bytes of padding.
> 
> Let's enforce the alignment only for devm_kmalloc().
> 
> Suggested-by: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Marc Gonzalez <marc.w.gonzalez@free.fr>
> ---
> I had not been aware that dynamic allocation granularity on arm64 was
> 128 bytes. This means there's a lot of waste on small allocations.
> I suppose there's no easy solution, though.
> ---
>  drivers/base/devres.c | 23 +++++++++++++----------
>  1 file changed, 13 insertions(+), 10 deletions(-)

You need to get Alexey to agree with this, he's the one that hit this on
ARC :)

thanks,

greg k-h
Marc Gonzalez Dec. 17, 2019, 4:17 p.m. UTC | #2
On 17/12/2019 16:30, Marc Gonzalez wrote:

> +#define DEVM_KMALLOC_PADDING_SIZE \
> +	(ARCH_KMALLOC_MINALIGN - sizeof(struct devres) % ARCH_KMALLOC_MINALIGN)

Doh!

#define DEVM_KMALLOC_PADDING_SIZE \
  ((ARCH_KMALLOC_MINALIGN - sizeof(struct devres)) % ARCH_KMALLOC_MINALIGN)
Alexey Brodkin Dec. 18, 2019, 2:20 p.m. UTC | #3
Hi Marc,

We sort of expected something like that to happen at some point.
Funny enough it's been a year since my change was accepted in v4.20
and only now somebody noticed :)

Though quite a few questions below.

> Commit a66d972465d15 ("devres: Align data[] to ARCH_KMALLOC_MINALIGN")
> increased the alignment of devres.data unconditionally.
> 
> Some platforms have very strict alignment requirements for DMA-safe
> addresses, e.g. 128 bytes on arm64. There, struct devres amounts to:
> 	3 pointers + pad_to_128 + data + pad_to_256
> i.e. ~220 bytes of padding.

Could you please elaborate a bit on mentioned paddings?
I may understand the first one for 128 bytes but where does the
second one for 256 bytes come from?

> Let's enforce the alignment only for devm_kmalloc().

Ok so for devm_kmalloc() we don't change anything, right?
We still add the same padding before real data array.

> ---
> I had not been aware that dynamic allocation granularity on arm64 was
> 128 bytes. This means there's a lot of waste on small allocations.

Now probably I'm missing something but when do you expect to save something?
If those smaller allocations are done with devm_kmalloc() you aren't
saving anything.

> I suppose there's no easy solution, though.

Right! It took a while till I was able to propose something
people [almost silently] agreed with.

> ---
>  drivers/base/devres.c | 23 +++++++++++++----------
>  1 file changed, 13 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/base/devres.c b/drivers/base/devres.c
> index 0bbb328bd17f..bf39188613d9 100644
> --- a/drivers/base/devres.c
> +++ b/drivers/base/devres.c
> @@ -26,14 +26,7 @@ struct devres_node {
> 
>  struct devres {
>  	struct devres_node		node;
> -	/*
> -	 * Some archs want to perform DMA into kmalloc caches
> -	 * and need a guaranteed alignment larger than
> -	 * the alignment of a 64-bit integer.
> -	 * Thus we use ARCH_KMALLOC_MINALIGN here and get exactly the same
> -	 * buffer alignment as if it was allocated by plain kmalloc().
> -	 */
> -	u8 __aligned(ARCH_KMALLOC_MINALIGN) data[];
> +	u8				data[];
>  };
> 
>  struct devres_group {
> @@ -789,9 +782,16 @@ static void devm_kmalloc_release(struct device *dev, void *res)
>  	/* noop */
>  }
> 
> +#define DEVM_KMALLOC_PADDING_SIZE \
> +	(ARCH_KMALLOC_MINALIGN - sizeof(struct devres) % ARCH_KMALLOC_MINALIGN)

Even given your update with:
------------------------------->8--------------------------------
#define DEVM_KMALLOC_PADDING_SIZE \
  ((ARCH_KMALLOC_MINALIGN - sizeof(struct devres)) % ARCH_KMALLOC_MINALIGN)
------------------------------->8--------------------------------
I don't think I understand why do you need that "% ARCH_KMALLOC_MINALIGN" part?

>  static int devm_kmalloc_match(struct device *dev, void *res, void *data)
>  {
> -	return res == data;
> +	/*
> +	 * 'res' is dr->data (not DMA-safe)
> +	 * 'data' is the hand-aligned address from devm_kmalloc
> +	 */
> +	return res + DEVM_KMALLOC_PADDING_SIZE == data;
>  }
> 
>  /**
> @@ -811,6 +811,9 @@ void * devm_kmalloc(struct device *dev, size_t size, gfp_t gfp)
>  {
>  	struct devres *dr;
> 
> +	/* Add enough padding to provide a DMA-safe address */
> +	size += DEVM_KMALLOC_PADDING_SIZE;

This implementation gets ugly and potentially will lead to problems later
when people will start changing code here. Compared to that initially aligned by
the compiler dr->data looks much more foolproof.

>  	/* use raw alloc_dr for kmalloc caller tracing */
>  	dr = alloc_dr(devm_kmalloc_release, size, gfp, dev_to_node(dev));
>  	if (unlikely(!dr))
> @@ -822,7 +825,7 @@ void * devm_kmalloc(struct device *dev, size_t size, gfp_t gfp)
>  	 */
>  	set_node_dbginfo(&dr->node, "devm_kzalloc_release", size);
>  	devres_add(dev, dr->data);
> -	return dr->data;
> +	return dr->data + DEVM_KMALLOC_PADDING_SIZE;

Ditto. But first I'd like to understand what are you trying to really do
with your change and then we'll see if there could be any better implementation.

-Alexey
Marc Gonzalez Dec. 18, 2019, 3:40 p.m. UTC | #4
On 18/12/2019 15:20, Alexey Brodkin wrote:

> On 17/12/2019 16:30, Marc Gonzalez wrote:
> 
>> Commit a66d972465d15 ("devres: Align data[] to ARCH_KMALLOC_MINALIGN")
>> increased the alignment of devres.data unconditionally.
>>
>> Some platforms have very strict alignment requirements for DMA-safe
>> addresses, e.g. 128 bytes on arm64. There, struct devres amounts to:
>> 	3 pointers + pad_to_128 + data + pad_to_256
>> i.e. ~220 bytes of padding.
> 
> Could you please elaborate a bit on mentioned paddings?
> I may understand the first one for 128 bytes but where does the
> second one for 256 bytes come from?

Sure thing.

struct devres {
	struct devres_node node;
	u8 __aligned(ARCH_KMALLOC_MINALIGN) data[];
};

struct devres_node = 3 pointers
kmalloc dishes out memory in multiples of ARCH_KMALLOC_MINALIGN bytes.
On arm64, ARCH_KMALLOC_MINALIGN = 128
(Everything written below assumes ARCH_KMALLOC_MINALIGN = 128)

In alloc_dr() we request sizeof(struct devres) + sizeof(data) from kmalloc.
sizeof(struct devres) = 128 because of the alignment directive.
I.e. the 'data' field is automatically padded to 128 by the compiler.

For most devm allocs (non-devm_kmalloc allocs), data is just 1 or 2 pointers.
So kmalloc(128 + 16) allocates 256 bytes.

>> Let's enforce the alignment only for devm_kmalloc().
> 
> Ok so for devm_kmalloc() we don't change anything, right?
> We still add the same padding before real data array.

(My commit message probably requires improvement & refining.)

Yes, the objective of my patch is to keep the same behavior for
devm_kmalloc() while reverting to the old behavior for all other
uses of struct devres.


>> I had not been aware that dynamic allocation granularity on arm64 was
>> 128 bytes. This means there's a lot of waste on small allocations.
> 
> Now probably I'm missing something but when do you expect to save something?
> If those smaller allocations are done with devm_kmalloc() you aren't
> saving anything.

With my patch, a "non-kmalloc" struct devres would take 128 bytes, instead
of 256.

>> I suppose there's no easy solution, though.
> 
> Right! It took a while till I was able to propose something
> people [almost silently] agreed with.

I meant the wider subject of dynamic allocation granularity.

The 128-byte requirement is only for DMA. Some (most?) uses of kmalloc
are not for DMA. If the user could provide a flag ("this is to be used
for DMA") we could save lots of memory for small non-DMA allocs.


>> +#define DEVM_KMALLOC_PADDING_SIZE \
>> +	(ARCH_KMALLOC_MINALIGN - sizeof(struct devres) % ARCH_KMALLOC_MINALIGN)
> 
> Even given your update with:
> ------------------------------->8--------------------------------
> #define DEVM_KMALLOC_PADDING_SIZE \
>   ((ARCH_KMALLOC_MINALIGN - sizeof(struct devres)) % ARCH_KMALLOC_MINALIGN)
> ------------------------------->8--------------------------------
> I don't think I understand why do you need that "% ARCH_KMALLOC_MINALIGN" part?

To handle the case where sizeof(struct devres) > ARCH_KMALLOC_MINALIGN

e.g ARCH_KMALLOC_MINALIGN = 8 and sizeof(struct devres) = 12


>> +	/* Add enough padding to provide a DMA-safe address */
>> +	size += DEVM_KMALLOC_PADDING_SIZE;
> 
> This implementation gets ugly and potentially will lead to problems later
> when people will start changing code here. Compared to that initially aligned by
> the compiler dr->data looks much more foolproof.

Yes, it's better to let the compiler handle the padding... But, we don't
want any padding in the non-devm_kmalloc use-case.

We could add a pointer to the data field, but arches with small ARCH_KMALLOC_MINALIGN
will have to pay the size increase, which doesn't seem fair to them (x86, amd64).


>> @@ -822,7 +825,7 @@ void * devm_kmalloc(struct device *dev, size_t size, gfp_t gfp)
>>  	 */
>>  	set_node_dbginfo(&dr->node, "devm_kzalloc_release", size);
>>  	devres_add(dev, dr->data);
>> -	return dr->data;
>> +	return dr->data + DEVM_KMALLOC_PADDING_SIZE;
> 
> Ditto. But first I'd like to understand what are you trying to really do
> with your change and then we'll see if there could be any better implementation.

Basically, every call to devres_alloc() or devm_add_action() allocates
256 bytes instead of 128. A typical arm64 system will call these
thousands of times during driver probe.

Regards.
Marc Gonzalez Dec. 20, 2019, 10:19 a.m. UTC | #5
On 17/12/2019 16:30, Marc Gonzalez wrote:

> Commit a66d972465d15 ("devres: Align data[] to ARCH_KMALLOC_MINALIGN")
> increased the alignment of devres.data unconditionally.
> 
> Some platforms have very strict alignment requirements for DMA-safe
> addresses, e.g. 128 bytes on arm64. There, struct devres amounts to:
> 	3 pointers + pad_to_128 + data + pad_to_256
> i.e. ~220 bytes of padding.
> 
> Let's enforce the alignment only for devm_kmalloc().
> 
> Suggested-by: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Marc Gonzalez <marc.w.gonzalez@free.fr>
> ---
> I had not been aware that dynamic allocation granularity on arm64 was
> 128 bytes. This means there's a lot of waste on small allocations.
> I suppose there's no easy solution, though.
> ---
>  drivers/base/devres.c | 23 +++++++++++++----------
>  1 file changed, 13 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/base/devres.c b/drivers/base/devres.c
> index 0bbb328bd17f..bf39188613d9 100644
> --- a/drivers/base/devres.c
> +++ b/drivers/base/devres.c
> @@ -26,14 +26,7 @@ struct devres_node {
>  
>  struct devres {
>  	struct devres_node		node;
> -	/*
> -	 * Some archs want to perform DMA into kmalloc caches
> -	 * and need a guaranteed alignment larger than
> -	 * the alignment of a 64-bit integer.
> -	 * Thus we use ARCH_KMALLOC_MINALIGN here and get exactly the same
> -	 * buffer alignment as if it was allocated by plain kmalloc().
> -	 */
> -	u8 __aligned(ARCH_KMALLOC_MINALIGN) data[];
> +	u8				data[];
>  };
>  
>  struct devres_group {
> @@ -789,9 +782,16 @@ static void devm_kmalloc_release(struct device *dev, void *res)
>  	/* noop */
>  }
>  
> +#define DEVM_KMALLOC_PADDING_SIZE \
> +	(ARCH_KMALLOC_MINALIGN - sizeof(struct devres) % ARCH_KMALLOC_MINALIGN)
> +
>  static int devm_kmalloc_match(struct device *dev, void *res, void *data)
>  {
> -	return res == data;
> +	/*
> +	 * 'res' is dr->data (not DMA-safe)
> +	 * 'data' is the hand-aligned address from devm_kmalloc
> +	 */
> +	return res + DEVM_KMALLOC_PADDING_SIZE == data;
>  }
>  
>  /**
> @@ -811,6 +811,9 @@ void * devm_kmalloc(struct device *dev, size_t size, gfp_t gfp)
>  {
>  	struct devres *dr;
>  
> +	/* Add enough padding to provide a DMA-safe address */
> +	size += DEVM_KMALLOC_PADDING_SIZE;
> +
>  	/* use raw alloc_dr for kmalloc caller tracing */
>  	dr = alloc_dr(devm_kmalloc_release, size, gfp, dev_to_node(dev));
>  	if (unlikely(!dr))
> @@ -822,7 +825,7 @@ void * devm_kmalloc(struct device *dev, size_t size, gfp_t gfp)
>  	 */
>  	set_node_dbginfo(&dr->node, "devm_kzalloc_release", size);
>  	devres_add(dev, dr->data);
> -	return dr->data;
> +	return dr->data + DEVM_KMALLOC_PADDING_SIZE;
>  }
>  EXPORT_SYMBOL_GPL(devm_kmalloc);

Would anyone else have any suggestions, comments, insights, recommendations,
improvements, guidance, or wisdom? :-)

I keep thinking about the memory waste caused by the strict alignment requirement
on arm64. Is there a way to inspect how much memory has been requested vs how much
has been allocated? (Turning on SLAB DEBUG perhaps?)

Couldn't there be a kmalloc flag saying "this alloc will not require strict
alignment, so just give me something 8-byte aligned" ?

Regards.
Greg KH Dec. 20, 2019, 10:22 a.m. UTC | #6
On Fri, Dec 20, 2019 at 11:19:27AM +0100, Marc Gonzalez wrote:
> On 17/12/2019 16:30, Marc Gonzalez wrote:
> 
> > Commit a66d972465d15 ("devres: Align data[] to ARCH_KMALLOC_MINALIGN")
> > increased the alignment of devres.data unconditionally.
> > 
> > Some platforms have very strict alignment requirements for DMA-safe
> > addresses, e.g. 128 bytes on arm64. There, struct devres amounts to:
> > 	3 pointers + pad_to_128 + data + pad_to_256
> > i.e. ~220 bytes of padding.
> > 
> > Let's enforce the alignment only for devm_kmalloc().
> > 
> > Suggested-by: Robin Murphy <robin.murphy@arm.com>
> > Signed-off-by: Marc Gonzalez <marc.w.gonzalez@free.fr>
> > ---
> > I had not been aware that dynamic allocation granularity on arm64 was
> > 128 bytes. This means there's a lot of waste on small allocations.
> > I suppose there's no easy solution, though.
> > ---
> >  drivers/base/devres.c | 23 +++++++++++++----------
> >  1 file changed, 13 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/base/devres.c b/drivers/base/devres.c
> > index 0bbb328bd17f..bf39188613d9 100644
> > --- a/drivers/base/devres.c
> > +++ b/drivers/base/devres.c
> > @@ -26,14 +26,7 @@ struct devres_node {
> >  
> >  struct devres {
> >  	struct devres_node		node;
> > -	/*
> > -	 * Some archs want to perform DMA into kmalloc caches
> > -	 * and need a guaranteed alignment larger than
> > -	 * the alignment of a 64-bit integer.
> > -	 * Thus we use ARCH_KMALLOC_MINALIGN here and get exactly the same
> > -	 * buffer alignment as if it was allocated by plain kmalloc().
> > -	 */
> > -	u8 __aligned(ARCH_KMALLOC_MINALIGN) data[];
> > +	u8				data[];
> >  };
> >  
> >  struct devres_group {
> > @@ -789,9 +782,16 @@ static void devm_kmalloc_release(struct device *dev, void *res)
> >  	/* noop */
> >  }
> >  
> > +#define DEVM_KMALLOC_PADDING_SIZE \
> > +	(ARCH_KMALLOC_MINALIGN - sizeof(struct devres) % ARCH_KMALLOC_MINALIGN)
> > +
> >  static int devm_kmalloc_match(struct device *dev, void *res, void *data)
> >  {
> > -	return res == data;
> > +	/*
> > +	 * 'res' is dr->data (not DMA-safe)
> > +	 * 'data' is the hand-aligned address from devm_kmalloc
> > +	 */
> > +	return res + DEVM_KMALLOC_PADDING_SIZE == data;
> >  }
> >  
> >  /**
> > @@ -811,6 +811,9 @@ void * devm_kmalloc(struct device *dev, size_t size, gfp_t gfp)
> >  {
> >  	struct devres *dr;
> >  
> > +	/* Add enough padding to provide a DMA-safe address */
> > +	size += DEVM_KMALLOC_PADDING_SIZE;
> > +
> >  	/* use raw alloc_dr for kmalloc caller tracing */
> >  	dr = alloc_dr(devm_kmalloc_release, size, gfp, dev_to_node(dev));
> >  	if (unlikely(!dr))
> > @@ -822,7 +825,7 @@ void * devm_kmalloc(struct device *dev, size_t size, gfp_t gfp)
> >  	 */
> >  	set_node_dbginfo(&dr->node, "devm_kzalloc_release", size);
> >  	devres_add(dev, dr->data);
> > -	return dr->data;
> > +	return dr->data + DEVM_KMALLOC_PADDING_SIZE;
> >  }
> >  EXPORT_SYMBOL_GPL(devm_kmalloc);
> 
> Would anyone else have any suggestions, comments, insights, recommendations,
> improvements, guidance, or wisdom? :-)
> 
> I keep thinking about the memory waste caused by the strict alignment requirement
> on arm64. Is there a way to inspect how much memory has been requested vs how much
> has been allocated? (Turning on SLAB DEBUG perhaps?)
> 
> Couldn't there be a kmalloc flag saying "this alloc will not require strict
> alignment, so just give me something 8-byte aligned" ?

Or you can not use the devm interface for lots of tiny allocations :)

thanks,

greg k-h
Greg KH Dec. 20, 2019, 10:22 a.m. UTC | #7
On Fri, Dec 20, 2019 at 11:22:18AM +0100, Greg Kroah-Hartman wrote:
> On Fri, Dec 20, 2019 at 11:19:27AM +0100, Marc Gonzalez wrote:
> > On 17/12/2019 16:30, Marc Gonzalez wrote:
> > 
> > > Commit a66d972465d15 ("devres: Align data[] to ARCH_KMALLOC_MINALIGN")
> > > increased the alignment of devres.data unconditionally.
> > > 
> > > Some platforms have very strict alignment requirements for DMA-safe
> > > addresses, e.g. 128 bytes on arm64. There, struct devres amounts to:
> > > 	3 pointers + pad_to_128 + data + pad_to_256
> > > i.e. ~220 bytes of padding.
> > > 
> > > Let's enforce the alignment only for devm_kmalloc().
> > > 
> > > Suggested-by: Robin Murphy <robin.murphy@arm.com>
> > > Signed-off-by: Marc Gonzalez <marc.w.gonzalez@free.fr>
> > > ---
> > > I had not been aware that dynamic allocation granularity on arm64 was
> > > 128 bytes. This means there's a lot of waste on small allocations.
> > > I suppose there's no easy solution, though.
> > > ---
> > >  drivers/base/devres.c | 23 +++++++++++++----------
> > >  1 file changed, 13 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/drivers/base/devres.c b/drivers/base/devres.c
> > > index 0bbb328bd17f..bf39188613d9 100644
> > > --- a/drivers/base/devres.c
> > > +++ b/drivers/base/devres.c
> > > @@ -26,14 +26,7 @@ struct devres_node {
> > >  
> > >  struct devres {
> > >  	struct devres_node		node;
> > > -	/*
> > > -	 * Some archs want to perform DMA into kmalloc caches
> > > -	 * and need a guaranteed alignment larger than
> > > -	 * the alignment of a 64-bit integer.
> > > -	 * Thus we use ARCH_KMALLOC_MINALIGN here and get exactly the same
> > > -	 * buffer alignment as if it was allocated by plain kmalloc().
> > > -	 */
> > > -	u8 __aligned(ARCH_KMALLOC_MINALIGN) data[];
> > > +	u8				data[];
> > >  };
> > >  
> > >  struct devres_group {
> > > @@ -789,9 +782,16 @@ static void devm_kmalloc_release(struct device *dev, void *res)
> > >  	/* noop */
> > >  }
> > >  
> > > +#define DEVM_KMALLOC_PADDING_SIZE \
> > > +	(ARCH_KMALLOC_MINALIGN - sizeof(struct devres) % ARCH_KMALLOC_MINALIGN)
> > > +
> > >  static int devm_kmalloc_match(struct device *dev, void *res, void *data)
> > >  {
> > > -	return res == data;
> > > +	/*
> > > +	 * 'res' is dr->data (not DMA-safe)
> > > +	 * 'data' is the hand-aligned address from devm_kmalloc
> > > +	 */
> > > +	return res + DEVM_KMALLOC_PADDING_SIZE == data;
> > >  }
> > >  
> > >  /**
> > > @@ -811,6 +811,9 @@ void * devm_kmalloc(struct device *dev, size_t size, gfp_t gfp)
> > >  {
> > >  	struct devres *dr;
> > >  
> > > +	/* Add enough padding to provide a DMA-safe address */
> > > +	size += DEVM_KMALLOC_PADDING_SIZE;
> > > +
> > >  	/* use raw alloc_dr for kmalloc caller tracing */
> > >  	dr = alloc_dr(devm_kmalloc_release, size, gfp, dev_to_node(dev));
> > >  	if (unlikely(!dr))
> > > @@ -822,7 +825,7 @@ void * devm_kmalloc(struct device *dev, size_t size, gfp_t gfp)
> > >  	 */
> > >  	set_node_dbginfo(&dr->node, "devm_kzalloc_release", size);
> > >  	devres_add(dev, dr->data);
> > > -	return dr->data;
> > > +	return dr->data + DEVM_KMALLOC_PADDING_SIZE;
> > >  }
> > >  EXPORT_SYMBOL_GPL(devm_kmalloc);
> > 
> > Would anyone else have any suggestions, comments, insights, recommendations,
> > improvements, guidance, or wisdom? :-)
> > 
> > I keep thinking about the memory waste caused by the strict alignment requirement
> > on arm64. Is there a way to inspect how much memory has been requested vs how much
> > has been allocated? (Turning on SLAB DEBUG perhaps?)
> > 
> > Couldn't there be a kmalloc flag saying "this alloc will not require strict
> > alignment, so just give me something 8-byte aligned" ?
> 
> Or you can not use the devm interface for lots of tiny allocations :)

Oh nevermind, "normal" kmalloc allocations are all aligned that way
anyway, so that's not going to solve anything, sorry.

greg k-h
Marc Gonzalez Dec. 20, 2019, 12:05 p.m. UTC | #8
On 20/12/2019 11:22, Greg Kroah-Hartman wrote:

> On Fri, Dec 20, 2019 at 11:22:18AM +0100, Greg Kroah-Hartman wrote:
>
>> On Fri, Dec 20, 2019 at 11:19:27AM +0100, Marc Gonzalez wrote:
>>
>>> I keep thinking about the memory waste caused by the strict alignment requirement
>>> on arm64. Is there a way to inspect how much memory has been requested vs how much
>>> has been allocated? (Turning on SLAB DEBUG perhaps?)
>>>
>>> Couldn't there be a kmalloc flag saying "this alloc will not require strict
>>> alignment, so just give me something 8-byte aligned" ?
>>
>> Or you can not use the devm interface for lots of tiny allocations :)
> 
> Oh nevermind, "normal" kmalloc allocations are all aligned that way
> anyway, so that's not going to solve anything, sorry.

(For some context, and for what it's worth, my opinion is that device-managed
deallocation is the best thing since sliced bread.)

Typical devm use-case is:
1) user allocates a resource
2) user registers release_func+resource_context to devm

So typically, only 2 pointers (which is no issue when the alignment
requirement is 8 bytes). By nature, these are "small" allocations.

devm_kmalloc does not follow this pattern, it is a kind of optimization.
1) user does not allocate the resource (RAM)...
2) ...because the devm framework "merges" the user's memory request with
its own memory request for storing metadata -- as a memory allocator does
when it stores metadata for the request "in front of" the memory block.
(this is the reason why devm_kmalloc_release() is a noop)


(The following is just random thinking out loud)

If "fixing" the kmalloc strict alignment requirement on arm64 is too
hard, maybe it would be possible to shave some of the devm memory
waste by working with (chained) arrays of devm nodes, instead
of a straight-up linked list. (Akin to a C++ vector) Removal would
be expensive, but that's supposed to be a rare operation, right?

Regards.
Peter Zijlstra Dec. 20, 2019, 2:06 p.m. UTC | #9
On Fri, Dec 20, 2019 at 11:19:27AM +0100, Marc Gonzalez wrote:
> Would anyone else have any suggestions, comments, insights, recommendations,
> improvements, guidance, or wisdom? :-)

Flip devres upside down!

**WARNING, wear protective glasses when reading the below**


struct devres {
	struct devres_node	node;
	void			*data;
};

/*
 * We place struct devres at the tail of the memory allocation
 * such that data retains the ARCH_KMALLOC_MINALIGN alignment.
 * struct devres itself is just 4 pointers and should therefore
 * only require trivial alignment.
 */
static inline struct devres *data2devres(void *data)
{
	return (struct devres *)(data + ksize(data) - sizeof(struct devres));
}

void *alloc_dr(...)
{
	struct devres *dr;
	void *data;

	data = kmalloc(size + sizeof(struct devres), GFP_KERNEL);
	dr = data2devres(data);
	WARN_ON((unsigned long)dr & __alignof(*dr)-1);
	INIT_LIST_HEAD(&dr->node.entry);
	dr->node.release = release;
	dr->data = data;

	return dr;
}

void devres_free(void *data)
{
	if (data) {
		struct devres *dr = data2devres(data);
		BUG_ON(!list_empty(dr->node.entry));
		kfree(data);
	}
}

static int release_nodes(...)
{
	...
	list_for_each_entry_safe_reverse(dr, ...) {
		...
		kfree(dr->data);
	}
}
Greg KH Dec. 20, 2019, 2:16 p.m. UTC | #10
On Fri, Dec 20, 2019 at 03:06:55PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 20, 2019 at 11:19:27AM +0100, Marc Gonzalez wrote:
> > Would anyone else have any suggestions, comments, insights, recommendations,
> > improvements, guidance, or wisdom? :-)
> 
> Flip devres upside down!
> 
> **WARNING, wear protective glasses when reading the below**
> 
> 
> struct devres {
> 	struct devres_node	node;
> 	void			*data;
> };
> 
> /*
>  * We place struct devres at the tail of the memory allocation
>  * such that data retains the ARCH_KMALLOC_MINALIGN alignment.
>  * struct devres itself is just 4 pointers and should therefore
>  * only require trivial alignment.
>  */
> static inline struct devres *data2devres(void *data)
> {
> 	return (struct devres *)(data + ksize(data) - sizeof(struct devres));
> }
> 
> void *alloc_dr(...)
> {
> 	struct devres *dr;
> 	void *data;
> 
> 	data = kmalloc(size + sizeof(struct devres), GFP_KERNEL);
> 	dr = data2devres(data);
> 	WARN_ON((unsigned long)dr & __alignof(*dr)-1);
> 	INIT_LIST_HEAD(&dr->node.entry);
> 	dr->node.release = release;
> 	dr->data = data;
> 
> 	return dr;
> }
> 
> void devres_free(void *data)
> {
> 	if (data) {
> 		struct devres *dr = data2devres(data);
> 		BUG_ON(!list_empty(dr->node.entry));
> 		kfree(data);
> 	}
> }
> 
> static int release_nodes(...)
> {
> 	...
> 	list_for_each_entry_safe_reverse(dr, ...) {
> 		...
> 		kfree(dr->data);
> 	}
> }
> 

Ok, that's my queue to walk away from the keyboard and start drinking, I
think the holiday season has now officially started.

ugh.  crazy.

but brilliant :)

greg k-h
Robin Murphy Dec. 20, 2019, 3:01 p.m. UTC | #11
/me rouses from holiday mode...

On 2019-12-20 2:06 pm, Peter Zijlstra wrote:
> On Fri, Dec 20, 2019 at 11:19:27AM +0100, Marc Gonzalez wrote:
>> Would anyone else have any suggestions, comments, insights, recommendations,
>> improvements, guidance, or wisdom? :-)
> 
> Flip devres upside down!

Which doesn't really help :(

> **WARNING, wear protective glasses when reading the below**
> 
> 
> struct devres {
> 	struct devres_node	node;
> 	void			*data;
> };
> 
> /*
>   * We place struct devres at the tail of the memory allocation
>   * such that data retains the ARCH_KMALLOC_MINALIGN alignment.
>   * struct devres itself is just 4 pointers and should therefore
>   * only require trivial alignment.
>   */
> static inline struct devres *data2devres(void *data)
> {
> 	return (struct devres *)(data + ksize(data) - sizeof(struct devres));
> }
> 
> void *alloc_dr(...)
> {
> 	struct devres *dr;
> 	void *data;
> 
> 	data = kmalloc(size + sizeof(struct devres), GFP_KERNEL);

At this point, you'd still need to special-case devm_kmalloc() to ensure 
size is rounded up to the next ARCH_KMALLOC_MINALIGN granule, or you'd 
go back to the original problem of the struct devres fields potentially 
sharing a cache line with the data buffer. That needs to be avoided, 
because if the devres list is modified while the buffer is mapped for 
noncoherent DMA (which could legitimately happen as they are nominally 
distinct allocations with different owners) there's liable to be data 
corruption one way or the other.

No matter which way round you allocate devres and data, by necessity 
they're always going to consume the same total amount of memory.

Robin.

> 	dr = data2devres(data);
> 	WARN_ON((unsigned long)dr & __alignof(*dr)-1);
> 	INIT_LIST_HEAD(&dr->node.entry);
> 	dr->node.release = release;
> 	dr->data = data;
> 
> 	return dr;
> }
> 
> void devres_free(void *data)
> {
> 	if (data) {
> 		struct devres *dr = data2devres(data);
> 		BUG_ON(!list_empty(dr->node.entry));
> 		kfree(data);
> 	}
> }
> 
> static int release_nodes(...)
> {
> 	...
> 	list_for_each_entry_safe_reverse(dr, ...) {
> 		...
> 		kfree(dr->data);
> 	}
> }
>
Peter Zijlstra Dec. 20, 2019, 5:13 p.m. UTC | #12
On Fri, Dec 20, 2019 at 03:01:03PM +0000, Robin Murphy wrote:
> On 2019-12-20 2:06 pm, Peter Zijlstra wrote:

> > 	data = kmalloc(size + sizeof(struct devres), GFP_KERNEL);
> 
> At this point, you'd still need to special-case devm_kmalloc() to ensure
> size is rounded up to the next ARCH_KMALLOC_MINALIGN granule, or you'd go
> back to the original problem of the struct devres fields potentially sharing
> a cache line with the data buffer. That needs to be avoided, because if the
> devres list is modified while the buffer is mapped for noncoherent DMA
> (which could legitimately happen as they are nominally distinct allocations
> with different owners) there's liable to be data corruption one way or the
> other.

Wait up, why are you allowing non-coherent DMA at less than page size
granularity? Is that really sane? Is this really supported behaviour for
devm ?
Peter Zijlstra Dec. 20, 2019, 5:19 p.m. UTC | #13
On Fri, Dec 20, 2019 at 01:05:43PM +0100, Marc Gonzalez wrote:
> If "fixing" the kmalloc strict alignment requirement on arm64 is too
> hard, maybe it would be possible to shave some of the devm memory
> waste by working with (chained) arrays of devm nodes, instead
> of a straight-up linked list. (Akin to a C++ vector) Removal would
> be expensive, but that's supposed to be a rare operation, right?

xarray might be what you're looking for.
Alexey Brodkin Dec. 20, 2019, 7:32 p.m. UTC | #14
Hi Robin, Peter, all,

[snip]
 
> On 2019-12-20 2:06 pm, Peter Zijlstra wrote:
> > On Fri, Dec 20, 2019 at 11:19:27AM +0100, Marc Gonzalez wrote:
> >> Would anyone else have any suggestions, comments, insights, recommendations,
> >> improvements, guidance, or wisdom? :-)
> >
> > Flip devres upside down!
> 
> Which doesn't really help :(
> 
> > **WARNING, wear protective glasses when reading the below**
> >
> >
> > struct devres {
> > 	struct devres_node	node;
> > 	void			*data;
> > };
> >
> > /*
> >   * We place struct devres at the tail of the memory allocation
> >   * such that data retains the ARCH_KMALLOC_MINALIGN alignment.
> >   * struct devres itself is just 4 pointers and should therefore
> >   * only require trivial alignment.
> >   */
> > static inline struct devres *data2devres(void *data)
> > {
> > 	return (struct devres *)(data + ksize(data) - sizeof(struct devres));
> > }
> >
> > void *alloc_dr(...)
> > {
> > 	struct devres *dr;
> > 	void *data;
> >
> > 	data = kmalloc(size + sizeof(struct devres), GFP_KERNEL);
> 
> At this point, you'd still need to special-case devm_kmalloc() to ensure
> size is rounded up to the next ARCH_KMALLOC_MINALIGN granule, or you'd
> go back to the original problem of the struct devres fields potentially
> sharing a cache line with the data buffer. That needs to be avoided,
> because if the devres list is modified while the buffer is mapped for
> noncoherent DMA (which could legitimately happen as they are nominally
> distinct allocations with different owners) there's liable to be data
> corruption one way or the other.

Well it somehow used to work for quite some time now with the data-buffer
being allocated with 4 words offset (which is 16 bytes for 32-bit platform
and 32 for 64-bit which is still much less than mentioned 128 bytes).
Or we just never managed to identify those rare cases when data corruption
really happened?

> No matter which way round you allocate devres and data, by necessity
> they're always going to consume the same total amount of memory.

So then the next option I guess is to separate meta-data from data buffers
completely. Are there any reasons to not do that other than the hack we're
discussing here (meta-data in the beginning of the buffer) used to work OK-ish?

-Alexey
Peter Zijlstra Dec. 20, 2019, 8:23 p.m. UTC | #15
On Fri, Dec 20, 2019 at 07:32:16PM +0000, Alexey Brodkin wrote:

> Well it somehow used to work for quite some time now with the data-buffer
> being allocated with 4 words offset (which is 16 bytes for 32-bit platform

3 words, devres_node is 3 words.

Which is exactly why we had to change it, the odd alignment caused ARC
to explode.

> and 32 for 64-bit which is still much less than mentioned 128 bytes).
> Or we just never managed to identify those rare cases when data corruption
> really happened?

The races are rather rare methinks, you'd have to get a list-op
concurrently with a DMA.

If you get the list corrupted, I'm thinking the crash is fairly likely,
albeit really difficuly to debug.

> > No matter which way round you allocate devres and data, by necessity
> > they're always going to consume the same total amount of memory.
> 
> So then the next option I guess is to separate meta-data from data buffers
> completely. Are there any reasons to not do that

Dunno, should work just fine I think.

> other than the hack we're
> discussing here (meta-data in the beginning of the buffer) used to work OK-ish?

If meta-data at the beginngin used to work, I don't see why meta-data at
the end wouldn't work equally well. They'd be equally broken.

But I'm still flabbergasted at the fact that they're doing non-coherent
DMA to kmalloc memory, I thought we had a DMA api for that, with a
special allocator and everything (but what do I know, I've never used
that).
Alexey Brodkin Dec. 20, 2019, 9:02 p.m. UTC | #16
Hi Peter,

> > Well it somehow used to work for quite some time now with the data-buffer
> > being allocated with 4 words offset (which is 16 bytes for 32-bit platform
> 
> 3 words, devres_node is 3 words.

Correct, but 4th word was implicitly there due to the fact
on most of 32-bit arches "long long" is aligned by 2 words.
 
> Which is exactly why we had to change it, the odd alignment caused ARC
> to explode.

I know that better than anybody else as it was my pain & grief :)
 
> > and 32 for 64-bit which is still much less than mentioned 128 bytes).
> > Or we just never managed to identify those rare cases when data corruption
> > really happened?
> 
> The races are rather rare methinks, you'd have to get a list-op
> concurrently with a DMA.
> 
> If you get the list corrupted, I'm thinking the crash is fairly likely,
> albeit really difficuly to debug.

So that alone IMHO is a good reason to not allow that thing to happen even
in theory.

> > > No matter which way round you allocate devres and data, by necessity
> > > they're always going to consume the same total amount of memory.
> >
> > So then the next option I guess is to separate meta-data from data buffers
> > completely. Are there any reasons to not do that
> 
> Dunno, should work just fine I think.
> 
> > other than the hack we're
> > discussing here (meta-data in the beginning of the buffer) used to work OK-ish?
> 
> If meta-data at the beginngin used to work, I don't see why meta-data at
> the end wouldn't work equally well. They'd be equally broken.

Agree. But if we imagine devm allocations are not used for DMA
(which is yet another case of interface usage which was never designed for
but alas this happens left and right) then move of the meta-data to the end of
the buffers solves [mostly my] problem... but given that DMA case we discuss
exists I'm not sure if this move actually worth spending time on.

-Alexey
Dmitry Torokhov Dec. 20, 2019, 9:47 p.m. UTC | #17
On Fri, Dec 20, 2019 at 09:02:24PM +0000, Alexey Brodkin wrote:
> Hi Peter,
> 
> > > Well it somehow used to work for quite some time now with the data-buffer
> > > being allocated with 4 words offset (which is 16 bytes for 32-bit platform
> > 
> > 3 words, devres_node is 3 words.
> 
> Correct, but 4th word was implicitly there due to the fact
> on most of 32-bit arches "long long" is aligned by 2 words.
>  
> > Which is exactly why we had to change it, the odd alignment caused ARC
> > to explode.
> 
> I know that better than anybody else as it was my pain & grief :)
>  
> > > and 32 for 64-bit which is still much less than mentioned 128 bytes).
> > > Or we just never managed to identify those rare cases when data corruption
> > > really happened?
> > 
> > The races are rather rare methinks, you'd have to get a list-op
> > concurrently with a DMA.
> > 
> > If you get the list corrupted, I'm thinking the crash is fairly likely,
> > albeit really difficuly to debug.
> 
> So that alone IMHO is a good reason to not allow that thing to happen even
> in theory.
> 
> > > > No matter which way round you allocate devres and data, by necessity
> > > > they're always going to consume the same total amount of memory.
> > >
> > > So then the next option I guess is to separate meta-data from data buffers
> > > completely. Are there any reasons to not do that
> > 
> > Dunno, should work just fine I think.
> > 
> > > other than the hack we're
> > > discussing here (meta-data in the beginning of the buffer) used to work OK-ish?
> > 
> > If meta-data at the beginngin used to work, I don't see why meta-data at
> > the end wouldn't work equally well. They'd be equally broken.

No, not really. With data being ARCH_KMALLOC_MINALIGN and coming after
the devres private stuff, given that the another allocation will also be
aligned to ARCH_KMALLOC_MINALIGN (because that's what k*alloc will give
us) we are guaranteed that DMA will not stomp onto any unrelated data.
With devres private coming after data and not having any alignment
constraints we may very well clobber it when doing DMA.

BTW, I am not sure where the page size restriction you mentioned earlier
is coming from. We have been using kmalloc()ed memory as buffers
suitable for DMA since forever, and we only need to make sure such data
is isolated from other data CPU might be accessing by ARCH_DMA_MINALIGN
which is usually L1 cache size.

From Documentation/DMA-API-HOWTO.txt:

2) ARCH_DMA_MINALIGN

   Architectures must ensure that kmalloc'ed buffer is
   DMA-safe. Drivers and subsystems depend on it. If an architecture
   isn't fully DMA-coherent (i.e. hardware doesn't ensure that data in
   the CPU cache is identical to data in main memory),
   ARCH_DMA_MINALIGN must be set so that the memory allocator
   makes sure that kmalloc'ed buffer doesn't share a cache line with
   the others. See arch/arm/include/asm/cache.h as an example.

   Note that ARCH_DMA_MINALIGN is about DMA memory alignment
   constraints. You don't need to worry about the architecture data
   alignment constraints (e.g. the alignment constraints about 64-bit
   objects).

> 
> Agree. But if we imagine devm allocations are not used for DMA
> (which is yet another case of interface usage which was never designed for
> but alas this happens left and right) then move of the meta-data to the end of
> the buffers solves [mostly my] problem... but given that DMA case we discuss
> exists I'm not sure if this move actually worth spending time on.

Well, there is a metric ton of devm users that do not allocate memory
buffers, but other objects, and for which we do not need to worry about
alignment.

Thanks.
Robin Murphy Dec. 20, 2019, 10:02 p.m. UTC | #18
On 2019-12-20 5:13 pm, Peter Zijlstra wrote:
> On Fri, Dec 20, 2019 at 03:01:03PM +0000, Robin Murphy wrote:
>> On 2019-12-20 2:06 pm, Peter Zijlstra wrote:
> 
>>> 	data = kmalloc(size + sizeof(struct devres), GFP_KERNEL);

[ afterthought: size could also legitimately be smaller than a pointer 
or some odd length such that the necessary alignment of struct devres 
itself isn't met ]

>> At this point, you'd still need to special-case devm_kmalloc() to ensure
>> size is rounded up to the next ARCH_KMALLOC_MINALIGN granule, or you'd go
>> back to the original problem of the struct devres fields potentially sharing
>> a cache line with the data buffer. That needs to be avoided, because if the
>> devres list is modified while the buffer is mapped for noncoherent DMA
>> (which could legitimately happen as they are nominally distinct allocations
>> with different owners) there's liable to be data corruption one way or the
>> other.
> 
> Wait up, why are you allowing non-coherent DMA at less than page size
> granularity? Is that really sane? Is this really supported behaviour for
> devm ?

There are two DMA APIs - the coherent (or "consistent") API for 
allocating buffers which are guaranteed safe for random access at any 
time *is* page-based, and on non-coherent architectures typically 
involves a non-cacheable remap. There is also the streaming API for 
one-off transfers of data already existing at a given kernel address 
(think network packets, USB URBs, etc), which on non-coherent 
architectures is achieved with explicit cache maintenance plus an API 
contract that buffers must not be explicitly accessed by CPUs for the 
duration of the mapping. Addresses from kmalloc() are explicitly valid 
for dma_map_single() (and indeed are about the only thing you'd ever 
reasonably feed it), which is the primary reason why 
ARCH_KMALLOC_MINALIGN gets so big on architectures which can be 
non-coherent and also suffer from creative cache designs.

See DMA-API.txt and DMA-API-HOWTO.txt in Documentation/ for more details 
if you like.

Robin.
Peter Zijlstra Jan. 6, 2020, 10:05 a.m. UTC | #19
On Fri, Dec 20, 2019 at 10:02:13PM +0000, Robin Murphy wrote:
> There is also the streaming API for one-off transfers
> of data already existing at a given kernel address (think network packets,
> USB URBs, etc), which on non-coherent architectures is achieved with
> explicit cache maintenance plus an API contract that buffers must not be
> explicitly accessed by CPUs for the duration of the mapping. Addresses from
> kmalloc() are explicitly valid for dma_map_single() (and indeed are about
> the only thing you'd ever reasonably feed it), which is the primary reason
> why ARCH_KMALLOC_MINALIGN gets so big on architectures which can be
> non-coherent and also suffer from creative cache designs.

Would it make sense to extend KASAN (or something) to detect violations
of this 'promise'? Because most obvious this was broken for the longest
time and was only accidentally fixed due to the ARC alignment thingy.
Who knows how many other sites are subtly broken too.

Have the dma_{,un}map_single() things mark the memory as
uninitialized/unaccessible such that any concurrent access will trigger
a splat.
diff mbox series

Patch

diff --git a/drivers/base/devres.c b/drivers/base/devres.c
index 0bbb328bd17f..bf39188613d9 100644
--- a/drivers/base/devres.c
+++ b/drivers/base/devres.c
@@ -26,14 +26,7 @@  struct devres_node {
 
 struct devres {
 	struct devres_node		node;
-	/*
-	 * Some archs want to perform DMA into kmalloc caches
-	 * and need a guaranteed alignment larger than
-	 * the alignment of a 64-bit integer.
-	 * Thus we use ARCH_KMALLOC_MINALIGN here and get exactly the same
-	 * buffer alignment as if it was allocated by plain kmalloc().
-	 */
-	u8 __aligned(ARCH_KMALLOC_MINALIGN) data[];
+	u8				data[];
 };
 
 struct devres_group {
@@ -789,9 +782,16 @@  static void devm_kmalloc_release(struct device *dev, void *res)
 	/* noop */
 }
 
+#define DEVM_KMALLOC_PADDING_SIZE \
+	(ARCH_KMALLOC_MINALIGN - sizeof(struct devres) % ARCH_KMALLOC_MINALIGN)
+
 static int devm_kmalloc_match(struct device *dev, void *res, void *data)
 {
-	return res == data;
+	/*
+	 * 'res' is dr->data (not DMA-safe)
+	 * 'data' is the hand-aligned address from devm_kmalloc
+	 */
+	return res + DEVM_KMALLOC_PADDING_SIZE == data;
 }
 
 /**
@@ -811,6 +811,9 @@  void * devm_kmalloc(struct device *dev, size_t size, gfp_t gfp)
 {
 	struct devres *dr;
 
+	/* Add enough padding to provide a DMA-safe address */
+	size += DEVM_KMALLOC_PADDING_SIZE;
+
 	/* use raw alloc_dr for kmalloc caller tracing */
 	dr = alloc_dr(devm_kmalloc_release, size, gfp, dev_to_node(dev));
 	if (unlikely(!dr))
@@ -822,7 +825,7 @@  void * devm_kmalloc(struct device *dev, size_t size, gfp_t gfp)
 	 */
 	set_node_dbginfo(&dr->node, "devm_kzalloc_release", size);
 	devres_add(dev, dr->data);
-	return dr->data;
+	return dr->data + DEVM_KMALLOC_PADDING_SIZE;
 }
 EXPORT_SYMBOL_GPL(devm_kmalloc);