mbox series

[RFC,0/2] guarantee natural alignment for kmalloc()

Message ID 20190319211108.15495-1-vbabka@suse.cz (mailing list archive)
Headers show
Series guarantee natural alignment for kmalloc() | expand

Message

Vlastimil Babka March 19, 2019, 9:11 p.m. UTC
The recent thread [1] inspired me to look into guaranteeing alignment for
kmalloc() for power-of-two sizes. Turns out it's not difficult and in most
configuration nothing really changes as it happens implicitly. More details in
the first patch. If we agree we want to do this, I will see where to update
documentation and perhaps if there are any workarounds in the tree that can be
converted to plain kmalloc() afterwards.

The second patch is quick and dirty selftest for the alignment. Suggestions
welcome whether and how to include this kind of selftest that has to be
in-kernel.

[1] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/T/#u

Vlastimil Babka (2):
  mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
  mm, sl[aou]b: test whether kmalloc() alignment works as expected

 mm/slab_common.c | 30 +++++++++++++++++++++++++++++-
 mm/slob.c        | 42 +++++++++++++++++++++++++++++++-----------
 2 files changed, 60 insertions(+), 12 deletions(-)

Comments

Christoph Lameter (Ampere) March 20, 2019, 12:43 a.m. UTC | #1
On Tue, 19 Mar 2019, Vlastimil Babka wrote:

> The recent thread [1] inspired me to look into guaranteeing alignment for
> kmalloc() for power-of-two sizes. Turns out it's not difficult and in most
> configuration nothing really changes as it happens implicitly. More details in
> the first patch. If we agree we want to do this, I will see where to update
> documentation and perhaps if there are any workarounds in the tree that can be
> converted to plain kmalloc() afterwards.

This means that the alignments are no longer uniform for all kmalloc
caches and we get back to code making all sorts of assumptions about
kmalloc alignments.

Currently all kmalloc objects are aligned to KMALLOC_MIN_ALIGN. That will
no longer be the case and alignments will become inconsistent.

I think its valuable that alignment requirements need to be explicitly
requested.

Lets add an array of power of two aligned kmalloc caches if that is really
necessary. Add some GFP_XXX flag to kmalloc to make it ^2 aligned maybe?
David Rientjes March 20, 2019, 12:53 a.m. UTC | #2
On Wed, 20 Mar 2019, Christopher Lameter wrote:

> > The recent thread [1] inspired me to look into guaranteeing alignment for
> > kmalloc() for power-of-two sizes. Turns out it's not difficult and in most
> > configuration nothing really changes as it happens implicitly. More details in
> > the first patch. If we agree we want to do this, I will see where to update
> > documentation and perhaps if there are any workarounds in the tree that can be
> > converted to plain kmalloc() afterwards.
> 
> This means that the alignments are no longer uniform for all kmalloc
> caches and we get back to code making all sorts of assumptions about
> kmalloc alignments.
> 
> Currently all kmalloc objects are aligned to KMALLOC_MIN_ALIGN. That will
> no longer be the case and alignments will become inconsistent.
> 
> I think its valuable that alignment requirements need to be explicitly
> requested.
> 
> Lets add an array of power of two aligned kmalloc caches if that is really
> necessary. Add some GFP_XXX flag to kmalloc to make it ^2 aligned maybe?
> 

No objection, but I think the GFP flags should remain what they are for: 
to Get Free Pages.  If we are to add additional flags to specify 
characteristics of slab objects, can we add a kmalloc_flags() variant that 
will take a new set of flags?  SLAB_OBJ_ALIGN_POW2?
Vlastimil Babka March 20, 2019, 8:48 a.m. UTC | #3
On 3/20/19 1:43 AM, Christopher Lameter wrote:
> On Tue, 19 Mar 2019, Vlastimil Babka wrote:
> 
>> The recent thread [1] inspired me to look into guaranteeing alignment for
>> kmalloc() for power-of-two sizes. Turns out it's not difficult and in most
>> configuration nothing really changes as it happens implicitly. More details in
>> the first patch. If we agree we want to do this, I will see where to update
>> documentation and perhaps if there are any workarounds in the tree that can be
>> converted to plain kmalloc() afterwards.
> 
> This means that the alignments are no longer uniform for all kmalloc
> caches and we get back to code making all sorts of assumptions about
> kmalloc alignments.

Natural alignment to size is rather well defined, no? Would anyone ever
assume a larger one, for what reason?
It's now where some make assumptions (even unknowingly) for natural
There are two 'odd' sizes 96 and 192, which will keep cacheline size
alignment, would anyone really expect more than 64 bytes?

> Currently all kmalloc objects are aligned to KMALLOC_MIN_ALIGN. That will
> no longer be the case and alignments will become inconsistent.

KMALLOC_MIN_ALIGN is still the minimum, but in practice it's larger
which is not a problem.

Also let me stress again that nothing really changes except for SLOB,
and SLUB with debug options. The natural alignment for power-of-two
sizes already happens as SLAB and SLUB both allocate objects starting on
the page boundary. So people make assumptions based on that, and then
break with SLOB, or SLUB with debug. This patch just prevents that
breakage by guaranteeing those natural assumptions at all times.

> I think its valuable that alignment requirements need to be explicitly
> requested.

That's still possible for named caches created by kmem_cache_create().

> Lets add an array of power of two aligned kmalloc caches if that is really
> necessary. Add some GFP_XXX flag to kmalloc to make it ^2 aligned maybe?

That's unnecessary and wasteful, as the existing caches are already
aligned in the common configurations. Requiring a flag doesn't help with
the implicit assumptions going wrong. I really don't think it needs to
get more complicated than adjusting the uncommon configuration, as this
patch does.
Christoph Lameter (Ampere) March 20, 2019, 6:20 p.m. UTC | #4
On Wed, 20 Mar 2019, Vlastimil Babka wrote:

> > This means that the alignments are no longer uniform for all kmalloc
> > caches and we get back to code making all sorts of assumptions about
> > kmalloc alignments.
>
> Natural alignment to size is rather well defined, no? Would anyone ever
> assume a larger one, for what reason?
> It's now where some make assumptions (even unknowingly) for natural
> There are two 'odd' sizes 96 and 192, which will keep cacheline size
> alignment, would anyone really expect more than 64 bytes?

I think one would expect one set of alighment for any kmalloc object.

> > Currently all kmalloc objects are aligned to KMALLOC_MIN_ALIGN. That will
> > no longer be the case and alignments will become inconsistent.
>
> KMALLOC_MIN_ALIGN is still the minimum, but in practice it's larger
> which is not a problem.

"In practice" refers to the current way that slab allocators arrange
objects within the page. They are free to do otherwise if new ideas come
up for object arrangements etc.

The slab allocators already may have to store data in addition to the user
accessible part (f.e. for RCU or ctor). The "natural alighnment" of a
power of 2 cache is no longer as you expect for these cases. Debugging is
not the only case where we extend the object.

> Also let me stress again that nothing really changes except for SLOB,
> and SLUB with debug options. The natural alignment for power-of-two
> sizes already happens as SLAB and SLUB both allocate objects starting on
> the page boundary. So people make assumptions based on that, and then
> break with SLOB, or SLUB with debug. This patch just prevents that
> breakage by guaranteeing those natural assumptions at all times.

As explained before there is nothing "natural" here. Doing so restricts
future features and creates a mess within the allocator of exceptions for
debuggin etc etc (see what happened to SLAB). "Natural" is just a
simplistic thought of a user how he would arrange power of 2 objects.
These assumption should not be made but specified explicitly.

> > I think its valuable that alignment requirements need to be explicitly
> > requested.
>
> That's still possible for named caches created by kmem_cache_create().

So lets leave it as it is now then.
Matthew Wilcox March 20, 2019, 6:53 p.m. UTC | #5
On Wed, Mar 20, 2019 at 09:48:47AM +0100, Vlastimil Babka wrote:
> Natural alignment to size is rather well defined, no? Would anyone ever
> assume a larger one, for what reason?
> It's now where some make assumptions (even unknowingly) for natural
> There are two 'odd' sizes 96 and 192, which will keep cacheline size
> alignment, would anyone really expect more than 64 bytes?

Presumably 96 will keep being aligned to 32 bytes, as aligning 96 to 64
just results in 128-byte allocations.
Vlastimil Babka March 20, 2019, 9:48 p.m. UTC | #6
On 3/20/2019 7:53 PM, Matthew Wilcox wrote:
> On Wed, Mar 20, 2019 at 09:48:47AM +0100, Vlastimil Babka wrote:
>> Natural alignment to size is rather well defined, no? Would anyone ever
>> assume a larger one, for what reason?
>> It's now where some make assumptions (even unknowingly) for natural
>> There are two 'odd' sizes 96 and 192, which will keep cacheline size
>> alignment, would anyone really expect more than 64 bytes?
> 
> Presumably 96 will keep being aligned to 32 bytes, as aligning 96 to 64
> just results in 128-byte allocations.

Well, looks like that's what happens. This is with SLAB, but the alignment
calculations should be common: 

slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kmalloc-96          2611   4896    128   32    1 : tunables  120   60    8 : slabdata    153    153      0
kmalloc-128         4798   5536    128   32    1 : tunables  120   60    8 : slabdata    173    173      0
Matthew Wilcox March 21, 2019, 2:23 a.m. UTC | #7
On Wed, Mar 20, 2019 at 10:48:03PM +0100, Vlastimil Babka wrote:
> On 3/20/2019 7:53 PM, Matthew Wilcox wrote:
> > On Wed, Mar 20, 2019 at 09:48:47AM +0100, Vlastimil Babka wrote:
> >> Natural alignment to size is rather well defined, no? Would anyone ever
> >> assume a larger one, for what reason?
> >> It's now where some make assumptions (even unknowingly) for natural
> >> There are two 'odd' sizes 96 and 192, which will keep cacheline size
> >> alignment, would anyone really expect more than 64 bytes?
> > 
> > Presumably 96 will keep being aligned to 32 bytes, as aligning 96 to 64
> > just results in 128-byte allocations.
> 
> Well, looks like that's what happens. This is with SLAB, but the alignment
> calculations should be common: 
> 
> slabinfo - version: 2.1
> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> kmalloc-96          2611   4896    128   32    1 : tunables  120   60    8 : slabdata    153    153      0
> kmalloc-128         4798   5536    128   32    1 : tunables  120   60    8 : slabdata    173    173      0

Hmm.  On my laptop, I see:

kmalloc-96         28050  35364     96   42    1 : tunables    0    0    0 : slabdata    842    842      0

That'd take me from 842 * 4k pages to 1105 4k pages -- an extra megabyte of
memory.

This is running Debian's 4.19 kernel:

# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_SLAB_MERGE_DEFAULT=y
CONFIG_SLAB_FREELIST_RANDOM=y
CONFIG_SLAB_FREELIST_HARDENED=y
CONFIG_SLUB_CPU_PARTIAL=y
Vlastimil Babka March 21, 2019, 7:02 a.m. UTC | #8
On 3/21/19 3:23 AM, Matthew Wilcox wrote:
> On Wed, Mar 20, 2019 at 10:48:03PM +0100, Vlastimil Babka wrote:
>>
>> Well, looks like that's what happens. This is with SLAB, but the alignment
>> calculations should be common: 
>>
>> slabinfo - version: 2.1
>> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
>> kmalloc-96          2611   4896    128   32    1 : tunables  120   60    8 : slabdata    153    153      0
>> kmalloc-128         4798   5536    128   32    1 : tunables  120   60    8 : slabdata    173    173      0
> 
> Hmm.  On my laptop, I see:
> 
> kmalloc-96         28050  35364     96   42    1 : tunables    0    0    0 : slabdata    842    842      0
> 
> That'd take me from 842 * 4k pages to 1105 4k pages -- an extra megabyte of
> memory.
> 
> This is running Debian's 4.19 kernel:
> 
> # CONFIG_SLAB is not set
> CONFIG_SLUB=y

Ah, you're right. SLAB creates kmalloc caches with:

#ifndef ARCH_KMALLOC_FLAGS
#define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
#endif

create_kmalloc_caches(ARCH_KMALLOC_FLAGS);

While SLUB just:

create_kmalloc_caches(0);

even though it uses SLAB_HWCACHE_ALIGN for kmem_cache_node and
kmem_cache caches.
Vlastimil Babka March 21, 2019, 7:42 a.m. UTC | #9
On 3/20/19 7:20 PM, Christopher Lameter wrote:
>>> Currently all kmalloc objects are aligned to KMALLOC_MIN_ALIGN. That will
>>> no longer be the case and alignments will become inconsistent.
>>
>> KMALLOC_MIN_ALIGN is still the minimum, but in practice it's larger
>> which is not a problem.
> 
> "In practice" refers to the current way that slab allocators arrange
> objects within the page. They are free to do otherwise if new ideas come
> up for object arrangements etc.
> 
> The slab allocators already may have to store data in addition to the user
> accessible part (f.e. for RCU or ctor). The "natural alighnment" of a
> power of 2 cache is no longer as you expect for these cases. Debugging is
> not the only case where we extend the object.

For plain kmalloc() caches, RCU and ctors don't apply, right.

>> Also let me stress again that nothing really changes except for SLOB,
>> and SLUB with debug options. The natural alignment for power-of-two
>> sizes already happens as SLAB and SLUB both allocate objects starting on
>> the page boundary. So people make assumptions based on that, and then
>> break with SLOB, or SLUB with debug. This patch just prevents that
>> breakage by guaranteeing those natural assumptions at all times.
> 
> As explained before there is nothing "natural" here. Doing so restricts
> future features

Well, future features will have to deal with the existing named caches
created with specific alignment.

> and creates a mess within the allocator of exceptions for
> debuggin etc etc (see what happened to SLAB).

SLAB could be fixed, just nobody cares enough I guess. If I want to
debug wrong SL*B usage I'll use SLUB.

> "Natural" is just a
> simplistic thought of a user how he would arrange power of 2 objects.
> These assumption should not be made but specified explicitly.

Patch 1 does this explicitly for plain kmalloc(). It's unrealistic to
add 'align' parameter to plain kmalloc() as that would have to create
caches on-demand for 'new' values of align parameter.

>>> I think its valuable that alignment requirements need to be explicitly
>>> requested.
>>
>> That's still possible for named caches created by kmem_cache_create().
> 
> So lets leave it as it is now then.

That however doesn't work well for the xfs/IO case where block sizes are
not known in advance:

https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/T/#ec3a292c358d05a6b29cc4a9ce3ae6b2faf31a23f
Christoph Lameter (Ampere) March 22, 2019, 5:52 p.m. UTC | #10
On Thu, 21 Mar 2019, Vlastimil Babka wrote:

> That however doesn't work well for the xfs/IO case where block sizes are
> not known in advance:
>
> https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/T/#ec3a292c358d05a6b29cc4a9ce3ae6b2faf31a23f

I thought we agreed to use custom slab caches for that?
Vlastimil Babka April 5, 2019, 5:11 p.m. UTC | #11
On 3/22/19 6:52 PM, Christopher Lameter wrote:
> On Thu, 21 Mar 2019, Vlastimil Babka wrote:
> 
>> That however doesn't work well for the xfs/IO case where block sizes are
>> not known in advance:
>>
>> https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/T/#ec3a292c358d05a6b29cc4a9ce3ae6b2faf31a23f
> 
> I thought we agreed to use custom slab caches for that?

Hm maybe I missed something but my impression was that xfs/IO folks would have
to create lots of them for various sizes not known in advance, and that it
wasn't practical and would welcome if kmalloc just guaranteed the alignment.
But so far they haven't chimed in here in this thread, so I guess I'm wrong.
Christoph Hellwig April 7, 2019, 8 a.m. UTC | #12
On Fri, Apr 05, 2019 at 07:11:17PM +0200, Vlastimil Babka wrote:
> On 3/22/19 6:52 PM, Christopher Lameter wrote:
> > On Thu, 21 Mar 2019, Vlastimil Babka wrote:
> > 
> >> That however doesn't work well for the xfs/IO case where block sizes are
> >> not known in advance:
> >>
> >> https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/T/#ec3a292c358d05a6b29cc4a9ce3ae6b2faf31a23f
> > 
> > I thought we agreed to use custom slab caches for that?
> 
> Hm maybe I missed something but my impression was that xfs/IO folks would have
> to create lots of them for various sizes not known in advance, and that it
> wasn't practical and would welcome if kmalloc just guaranteed the alignment.
> But so far they haven't chimed in here in this thread, so I guess I'm wrong.

Yes, in XFS we might have quite a few.  Never mind all the other
block level consumers that might have similar reasonable expectations
but haven't triggered the problematic drivers yet.
Vlastimil Babka April 9, 2019, 8:07 a.m. UTC | #13
On 4/7/19 10:00 AM, Christoph Hellwig wrote:
> On Fri, Apr 05, 2019 at 07:11:17PM +0200, Vlastimil Babka wrote:
>> On 3/22/19 6:52 PM, Christopher Lameter wrote:
>> > On Thu, 21 Mar 2019, Vlastimil Babka wrote:
>> > 
>> >> That however doesn't work well for the xfs/IO case where block sizes are
>> >> not known in advance:
>> >>
>> >> https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/T/#ec3a292c358d05a6b29cc4a9ce3ae6b2faf31a23f
>> > 
>> > I thought we agreed to use custom slab caches for that?
>> 
>> Hm maybe I missed something but my impression was that xfs/IO folks would have
>> to create lots of them for various sizes not known in advance, and that it
>> wasn't practical and would welcome if kmalloc just guaranteed the alignment.
>> But so far they haven't chimed in here in this thread, so I guess I'm wrong.
> 
> Yes, in XFS we might have quite a few.  Never mind all the other
> block level consumers that might have similar reasonable expectations
> but haven't triggered the problematic drivers yet.

What about a LSF session/BoF to sort this out, then? Would need to have people
from all three MM+FS+IO groups, I suppose.
Michal Hocko April 9, 2019, 9:20 a.m. UTC | #14
On Tue 09-04-19 10:07:42, Vlastimil Babka wrote:
> On 4/7/19 10:00 AM, Christoph Hellwig wrote:
> > On Fri, Apr 05, 2019 at 07:11:17PM +0200, Vlastimil Babka wrote:
> >> On 3/22/19 6:52 PM, Christopher Lameter wrote:
> >> > On Thu, 21 Mar 2019, Vlastimil Babka wrote:
> >> > 
> >> >> That however doesn't work well for the xfs/IO case where block sizes are
> >> >> not known in advance:
> >> >>
> >> >> https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/T/#ec3a292c358d05a6b29cc4a9ce3ae6b2faf31a23f
> >> > 
> >> > I thought we agreed to use custom slab caches for that?
> >> 
> >> Hm maybe I missed something but my impression was that xfs/IO folks would have
> >> to create lots of them for various sizes not known in advance, and that it
> >> wasn't practical and would welcome if kmalloc just guaranteed the alignment.
> >> But so far they haven't chimed in here in this thread, so I guess I'm wrong.
> > 
> > Yes, in XFS we might have quite a few.  Never mind all the other
> > block level consumers that might have similar reasonable expectations
> > but haven't triggered the problematic drivers yet.
> 
> What about a LSF session/BoF to sort this out, then? Would need to have people
> from all three MM+FS+IO groups, I suppose.

Sounds like a good plan. Care to send an email to lsf-pc mailing list so
that it doesn't fall through cracks please?