Message ID | 20190826111627.7505-3-vbabka@suse.cz (mailing list archive) |
---|---|
State | Accepted, archived |
Headers | show |
Series | guarantee natural alignment for kmalloc() | expand |
On Mon, 26 Aug 2019, Vlastimil Babka wrote: > The topic has been discussed at LSF/MM 2019 [3]. Adding a 'kmalloc_aligned()' > variant would not help with code unknowingly relying on the implicit alignment. > For slab implementations it would either require creating more kmalloc caches, > or allocate a larger size and only give back part of it. That would be > wasteful, especially with a generic alignment parameter (in contrast with a > fixed alignment to size). The additional caches will be detected if similar to existing ones and merged into one. So the overhead is not that significant. > Ideally we should provide to mm users what they need without difficult > workarounds or own reimplementations, so let's make the kmalloc() alignment to > size explicitly guaranteed for power-of-two sizes under all configurations. The objection remains that this will create exceptions for the general notion that all kmalloc caches are aligned to KMALLOC_MINALIGN which may be suprising and it limits the optimizations that slab allocators may use for optimizing data use. The SLOB allocator was designed in such a way that data wastage is limited. The changes here sabotage that goal and show that future slab allocators may be similarly constrained with the exceptional alignents implemented. Additional debugging features etc etc must all support the exceptional alignment requirements. > * SLUB layout is also unchanged unless redzoning is enabled through > CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache. With > this patch, explicit alignment is guaranteed with redzoning as well. This > will result in more memory being wasted, but that should be acceptable in a > debugging scenario. Well ok. That sounds fine (apart from breaking the rules for slab object alignment). > * SLOB has no implicit alignment so this patch adds it explicitly for > kmalloc(). The potential downside is increased fragmentation. While > pathological allocation scenarios are certainly possible, in my testing, > after booting a x86_64 kernel+userspace with virtme, around 16MB memory > was consumed by slab pages both before and after the patch, with difference > in the noise. This change to slob will cause a significant additional use of memory. The advertised advantage of SLOB is that *minimal* memory will be used since it is targeted for embedded systems. Different types of slab objects of varying sizes can be allocated in the same memory page to reduce allocation overhead. Having these exceptional rules for aligning power of two sizes caches will significantly increase the memory wastage in SLOB. The result of this patch is just to use more memory to be safe from certain pathologies where one subsystem was relying on an alignment that was not specified. That is why this approach should not be called �natural" but "implicit alignment". The one using the slab cache is not aware that the slab allocator provides objects aligned in a special way (which is in general not needed. There seems to be a single pathological case that needs to be addressed and I thought that was due to some brokenness in the hardware?). It is better to ensure that subsystems that require special alignment explicitly tell the allocator about this. I still think implicit exceptions to alignments are a bad idea. Those need to be explicity specified and that is possible using kmem_cache_create().
On Wed, Aug 28, 2019 at 06:45:07PM +0000, Christopher Lameter wrote: > > Ideally we should provide to mm users what they need without difficult > > workarounds or own reimplementations, so let's make the kmalloc() alignment to > > size explicitly guaranteed for power-of-two sizes under all configurations. > > The objection remains that this will create exceptions for the general > notion that all kmalloc caches are aligned to KMALLOC_MINALIGN which may Hmm? kmalloc caches will be aligned to both KMALLOC_MINALIGN and the natural alignment of the object. > be suprising and it limits the optimizations that slab allocators may use > for optimizing data use. The SLOB allocator was designed in such a way > that data wastage is limited. The changes here sabotage that goal and show > that future slab allocators may be similarly constrained with the > exceptional alignents implemented. Additional debugging features etc etc > must all support the exceptional alignment requirements. While I sympathise with the poor programmer who has to write the fourth implementation of the sl*b interface, it's more for the pain of picking a new letter than the pain of needing to honour the alignment of allocations. There are many places in the kernel which assume alignment. They break when it's not supplied. I believe we have a better overall system if the MM developers provide stronger guarantees than the MM consumers have to work around only weak guarantees. > > * SLOB has no implicit alignment so this patch adds it explicitly for > > kmalloc(). The potential downside is increased fragmentation. While > > pathological allocation scenarios are certainly possible, in my testing, > > after booting a x86_64 kernel+userspace with virtme, around 16MB memory > > was consumed by slab pages both before and after the patch, with difference > > in the noise. > > This change to slob will cause a significant additional use of memory. The > advertised advantage of SLOB is that *minimal* memory will be used since > it is targeted for embedded systems. Different types of slab objects of > varying sizes can be allocated in the same memory page to reduce > allocation overhead. Did you not read the part where he said the difference was in the noise? > The result of this patch is just to use more memory to be safe from > certain pathologies where one subsystem was relying on an alignment that > was not specified. That is why this approach should not be called > �natural" but "implicit alignment". The one using the slab cache is not > aware that the slab allocator provides objects aligned in a special way > (which is in general not needed. There seems to be a single pathological > case that needs to be addressed and I thought that was due to some > brokenness in the hardware?). It turns out there are lots of places which assume this, including the pmem driver, the ramdisk driver and a few other similar drivers. > It is better to ensure that subsystems that require special alignment > explicitly tell the allocator about this. But it's not the subsystems which have this limitation which do the allocation; it's the subsystems who allocate the memory that they then pass to the subsystems. So you're forcing communication of these limits up & down the stack. > I still think implicit exceptions to alignments are a bad idea. Those need > to be explicity specified and that is possible using kmem_cache_create(). I swear we covered this last time the topic came up, but XFS would need to create special slab caches for each size between 512 and PAGE_SIZE. Potentially larger, depending on whether the MM developers are willing to guarantee that kmalloc(PAGE_SIZE * 2, GFP_KERNEL) will return a PAGE_SIZE aligned block of memory indefinitely.
On Wed, Aug 28, 2019 at 12:46:08PM -0700, Matthew Wilcox wrote: > On Wed, Aug 28, 2019 at 06:45:07PM +0000, Christopher Lameter wrote: > > I still think implicit exceptions to alignments are a bad idea. Those need > > to be explicity specified and that is possible using kmem_cache_create(). > > I swear we covered this last time the topic came up, but XFS would need > to create special slab caches for each size between 512 and PAGE_SIZE. > Potentially larger, depending on whether the MM developers are willing to > guarantee that kmalloc(PAGE_SIZE * 2, GFP_KERNEL) will return a PAGE_SIZE > aligned block of memory indefinitely. Page size alignment of multi-page heap allocations is ncessary. The current behaviour w/ KASAN is to offset so a 8KB allocation spans 3 pages and is not page aligned. That causes just as much in way of alignment problems as unaligned objects in multi-object-per-page slabs. As I said in the lastest discussion of this problem on XFS (pmem devices w/ KASAN enabled), all we -need- is a GFP flag that tells the slab allocator to give us naturally aligned object or fail if it can't. I don't care how that gets implemented (e.g. another set of heap slabs like the -rcl slabs), I just don't want every high level subsystem that allocates heap memory for IO buffers to have to implement their own aligned slab caches. Cheers, Dave.
On Wed 28-08-19 12:46:08, Matthew Wilcox wrote: > On Wed, Aug 28, 2019 at 06:45:07PM +0000, Christopher Lameter wrote: [...] > > be suprising and it limits the optimizations that slab allocators may use > > for optimizing data use. The SLOB allocator was designed in such a way > > that data wastage is limited. The changes here sabotage that goal and show > > that future slab allocators may be similarly constrained with the > > exceptional alignents implemented. Additional debugging features etc etc > > must all support the exceptional alignment requirements. > > While I sympathise with the poor programmer who has to write the > fourth implementation of the sl*b interface, it's more for the pain of > picking a new letter than the pain of needing to honour the alignment > of allocations. > > There are many places in the kernel which assume alignment. They break > when it's not supplied. I believe we have a better overall system if > the MM developers provide stronger guarantees than the MM consumers have > to work around only weak guarantees. I absolutely agree. A hypothetical benefit of a new implementation doesn't outweigh the complexity the existing code has to jump over or worse is not aware of and it is broken silently. My general experience is that the later is more likely with a large variety of drivers we have in the tree and odd things they do in general.
On 8/29/19 12:24 AM, Dave Chinner wrote: > On Wed, Aug 28, 2019 at 12:46:08PM -0700, Matthew Wilcox wrote: >> On Wed, Aug 28, 2019 at 06:45:07PM +0000, Christopher Lameter wrote: >>> I still think implicit exceptions to alignments are a bad idea. Those need >>> to be explicity specified and that is possible using kmem_cache_create(). >> >> I swear we covered this last time the topic came up, but XFS would need >> to create special slab caches for each size between 512 and PAGE_SIZE. >> Potentially larger, depending on whether the MM developers are willing to >> guarantee that kmalloc(PAGE_SIZE * 2, GFP_KERNEL) will return a PAGE_SIZE >> aligned block of memory indefinitely. > > Page size alignment of multi-page heap allocations is ncessary. The > current behaviour w/ KASAN is to offset so a 8KB allocation spans 3 > pages and is not page aligned. That causes just as much in way > of alignment problems as unaligned objects in multi-object-per-page > slabs. Ugh, multi-page (power of two) allocations *at the page allocator level* simply have to be aligned, as that's how the buddy allocator has always worked, and it would be madness to try relax that guarantee and require an explicit flag at this point. The kmalloc wrapper with SLUB will pass everything above 8KB directly to the page allocator, so that's fine too. 4k and 8k are the only (multi-)page sizes still managed as SLUB objects. I would say that these sizes are the most striking example that it's wrong not to align them without extra flags or special API variant. > As I said in the lastest discussion of this problem on XFS (pmem > devices w/ KASAN enabled), all we -need- is a GFP flag that tells the > slab allocator to give us naturally aligned object or fail if it > can't. I don't care how that gets implemented (e.g. another set of > heap slabs like the -rcl slabs), I just don't want every high level Given alignment is orthogonal to -rcl and dma-, would that be another three sets? Or we assume that dma- would want it always, and complicate the rules further? Funilly enough, SLOB would be the simplest case here. > subsystem that allocates heap memory for IO buffers to have to > implement their own aligned slab caches. Definitely agree. I still hope we can do that even without a new flag/API. > Cheers, > > Dave. >
On Thu, Aug 29, 2019 at 09:56:13AM +0200, Vlastimil Babka wrote: > On 8/29/19 12:24 AM, Dave Chinner wrote: > > On Wed, Aug 28, 2019 at 12:46:08PM -0700, Matthew Wilcox wrote: > >> On Wed, Aug 28, 2019 at 06:45:07PM +0000, Christopher Lameter wrote: > >>> I still think implicit exceptions to alignments are a bad idea. Those need > >>> to be explicity specified and that is possible using kmem_cache_create(). > >> > >> I swear we covered this last time the topic came up, but XFS would need > >> to create special slab caches for each size between 512 and PAGE_SIZE. > >> Potentially larger, depending on whether the MM developers are willing to > >> guarantee that kmalloc(PAGE_SIZE * 2, GFP_KERNEL) will return a PAGE_SIZE > >> aligned block of memory indefinitely. > > > > Page size alignment of multi-page heap allocations is ncessary. The > > current behaviour w/ KASAN is to offset so a 8KB allocation spans 3 > > pages and is not page aligned. That causes just as much in way > > of alignment problems as unaligned objects in multi-object-per-page > > slabs. > > Ugh, multi-page (power of two) allocations *at the page allocator level* > simply have to be aligned, as that's how the buddy allocator has always > worked, and it would be madness to try relax that guarantee and require > an explicit flag at this point. The kmalloc wrapper with SLUB will pass > everything above 8KB directly to the page allocator, so that's fine too. > 4k and 8k are the only (multi-)page sizes still managed as SLUB objects. On a 4kB page size box, yes. On a 64kB page size system, 4/8kB allocations are still sub-page objects and will have alignment issues. Hence right now we can't assume a 4/8/16/32kB allocation will be page size aligned anywhere, because they are heap allocations on 64kB page sized machines. > I would say that these sizes are the most striking example that it's > wrong not to align them without extra flags or special API variant. Yup, just pointing out that they aren't guaranteed alignment right now on x86-64. > > As I said in the lastest discussion of this problem on XFS (pmem > > devices w/ KASAN enabled), all we -need- is a GFP flag that tells the > > slab allocator to give us naturally aligned object or fail if it > > can't. I don't care how that gets implemented (e.g. another set of > > heap slabs like the -rcl slabs), I just don't want every high level > > Given alignment is orthogonal to -rcl and dma-, would that be another > three sets? Or we assume that dma- would want it always, and complicate > the rules further? Funilly enough, SLOB would be the simplest case here. Not my problem. :) All I'm pointing out is that the minimum functionality we require is specifying individual allocations as needing alignment. I've just implemented that API in XFS, so whatever happens in the allocation infrastructure from this point onwards is really just implementation optimisation for us now.... Cheers, Dave.
On Thu, 29 Aug 2019, Michal Hocko wrote: > > There are many places in the kernel which assume alignment. They break > > when it's not supplied. I believe we have a better overall system if > > the MM developers provide stronger guarantees than the MM consumers have > > to work around only weak guarantees. > > I absolutely agree. A hypothetical benefit of a new implementation > doesn't outweigh the complexity the existing code has to jump over or > worse is not aware of and it is broken silently. My general experience > is that the later is more likely with a large variety of drivers we have > in the tree and odd things they do in general. The current behavior without special alignment for these caches has been in the wild for over a decade. And this is now coming up? There is one case now it seems with a broken hardware that has issues and we now move to an alignment requirement from the slabs with exceptions and this and that? If there is an exceptional alignment requirement then that needs to be communicated to the allocator. A special flag or create a special kmem_cache or something.
On Fri, Aug 30, 2019 at 05:41:46PM +0000, Christopher Lameter wrote: > On Thu, 29 Aug 2019, Michal Hocko wrote: > > > There are many places in the kernel which assume alignment. They break > > > when it's not supplied. I believe we have a better overall system if > > > the MM developers provide stronger guarantees than the MM consumers have > > > to work around only weak guarantees. > > > > I absolutely agree. A hypothetical benefit of a new implementation > > doesn't outweigh the complexity the existing code has to jump over or > > worse is not aware of and it is broken silently. My general experience > > is that the later is more likely with a large variety of drivers we have > > in the tree and odd things they do in general. > > The current behavior without special alignment for these caches has been > in the wild for over a decade. And this is now coming up? In the wild ... and rarely enabled. When it is enabled, it may or may not be noticed as data corruption, or tripping other debugging asserts. Users then turn off the rare debugging option. > There is one case now it seems with a broken hardware that has issues and > we now move to an alignment requirement from the slabs with exceptions and > this and that? Perhaps you could try reading what hasa been written instead of sticking to a strawman of your own invention? > If there is an exceptional alignment requirement then that needs to be > communicated to the allocator. A special flag or create a special > kmem_cache or something. The only way I'd agree to that is if we deliberately misalign every allocation that doesn't have this special flag set. Because right now, breakage happens everywhere when these debug options are enabled, and the very people who need to be helped are being hurt by the debugging.
On Sat, 31 Aug 2019, Matthew Wilcox wrote: > > The current behavior without special alignment for these caches has been > > in the wild for over a decade. And this is now coming up? > > In the wild ... and rarely enabled. When it is enabled, it may or may > not be noticed as data corruption, or tripping other debugging asserts. > Users then turn off the rare debugging option. Its enabled in all full debug session as far as I know. Fedora for example has been running this for ages to find breakage in device drivers etc etc. > > If there is an exceptional alignment requirement then that needs to be > > communicated to the allocator. A special flag or create a special > > kmem_cache or something. > > The only way I'd agree to that is if we deliberately misalign every > allocation that doesn't have this special flag set. Because right now, > breakage happens everywhere when these debug options are enabled, and > the very people who need to be helped are being hurt by the debugging. That is customarily occurring for testing by adding "slub_debug" to the kernel commandline (or adding debug kernel options) and since my information is that this is done frequently (and has been for over a decade now) I am having a hard time believing the stories of great breakage here. These drivers were not tested with debugging on before? Never ran with a debug kernel?
On Tue, Sep 03, 2019 at 08:13:45PM +0000, Christopher Lameter wrote: > On Sat, 31 Aug 2019, Matthew Wilcox wrote: > > > > The current behavior without special alignment for these caches has been > > > in the wild for over a decade. And this is now coming up? > > > > In the wild ... and rarely enabled. When it is enabled, it may or may > > not be noticed as data corruption, or tripping other debugging asserts. > > Users then turn off the rare debugging option. > > Its enabled in all full debug session as far as I know. Fedora for > example has been running this for ages to find breakage in device drivers > etc etc. Are you telling me nobody uses the ramdisk driver on fedora? Because that's one of the affected drivers. > > > If there is an exceptional alignment requirement then that needs to be > > > communicated to the allocator. A special flag or create a special > > > kmem_cache or something. > > > > The only way I'd agree to that is if we deliberately misalign every > > allocation that doesn't have this special flag set. Because right now, > > breakage happens everywhere when these debug options are enabled, and > > the very people who need to be helped are being hurt by the debugging. > > That is customarily occurring for testing by adding "slub_debug" to the > kernel commandline (or adding debug kernel options) and since my > information is that this is done frequently (and has been for over a > decade now) I am having a hard time believing the stories of great > breakage here. These drivers were not tested with debugging on before? > Never ran with a debug kernel? Whatever is being done is clearly not enough to trigger the bug. So how about it? Create an option to slab/slub to always return misaligned memory.
On Tue, Sep 03, 2019 at 01:53:12PM -0700, Matthew Wilcox wrote: > > Its enabled in all full debug session as far as I know. Fedora for > > example has been running this for ages to find breakage in device drivers > > etc etc. > > Are you telling me nobody uses the ramdisk driver on fedora? Because > that's one of the affected drivers. For pmem/brd misaligned memory alone doesn't seem to be the problem. Misaligned memory that cross a page barrier is. And at least XFS before my log recovery changes only used kmalloc for smaller than page size allocation, so this case probably didn't hit. But other cases where alignment and not just not crossing a page boundary occurred and we had problems with those before. It just too a long time for people to root cause them.
On Wed, Sep 04, 2019 at 07:19:33AM +0200, Christoph Hellwig wrote: > On Tue, Sep 03, 2019 at 01:53:12PM -0700, Matthew Wilcox wrote: > > > Its enabled in all full debug session as far as I know. Fedora for > > > example has been running this for ages to find breakage in device drivers > > > etc etc. > > > > Are you telling me nobody uses the ramdisk driver on fedora? Because > > that's one of the affected drivers. > > For pmem/brd misaligned memory alone doesn't seem to be the problem. > Misaligned memory that cross a page barrier is. And at least XFS > before my log recovery changes only used kmalloc for smaller than > page size allocation, so this case probably didn't hit. BTW, does sl[aou]b guarantee that smaller than page size allocation via kmalloc() won't cross page boundary any time? Thanks, Ming
On 9/4/19 8:40 AM, Ming Lei wrote: > On Wed, Sep 04, 2019 at 07:19:33AM +0200, Christoph Hellwig wrote: >> On Tue, Sep 03, 2019 at 01:53:12PM -0700, Matthew Wilcox wrote: >>>> Its enabled in all full debug session as far as I know. Fedora for >>>> example has been running this for ages to find breakage in device drivers >>>> etc etc. >>> >>> Are you telling me nobody uses the ramdisk driver on fedora? Because >>> that's one of the affected drivers. >> >> For pmem/brd misaligned memory alone doesn't seem to be the problem. >> Misaligned memory that cross a page barrier is. And at least XFS >> before my log recovery changes only used kmalloc for smaller than >> page size allocation, so this case probably didn't hit. > > BTW, does sl[aou]b guarantee that smaller than page size allocation via kmalloc() > won't cross page boundary any time? AFAIK no, it's the same as with the alignment. After this patch, that guarantee would follow from the alignment one for power of two sizes, but sizes 96 and 192 could still cross page boundary. > Thanks, > Ming >
On Tue, 3 Sep 2019, Matthew Wilcox wrote > > Its enabled in all full debug session as far as I know. Fedora for > > example has been running this for ages to find breakage in device drivers > > etc etc. > > Are you telling me nobody uses the ramdisk driver on fedora? Because > that's one of the affected drivers. How do I know? I dont run these tests. > > decade now) I am having a hard time believing the stories of great > > breakage here. These drivers were not tested with debugging on before? > > Never ran with a debug kernel? > > Whatever is being done is clearly not enough to trigger the bug. So how > about it? Create an option to slab/slub to always return misaligned > memory. It already exists Add "slub_debug" on the kernel command line.
On 8/26/19 1:16 PM, Vlastimil Babka wrote: > In most configurations, kmalloc() happens to return naturally aligned (i.e. > aligned to the block size itself) blocks for power of two sizes. That means > some kmalloc() users might unknowingly rely on that alignment, until stuff > breaks when the kernel is built with e.g. CONFIG_SLUB_DEBUG or CONFIG_SLOB, > and blocks stop being aligned. Then developers have to devise workaround such > as own kmem caches with specified alignment [1], which is not always practical, > as recently evidenced in [2]. > > The topic has been discussed at LSF/MM 2019 [3]. Adding a 'kmalloc_aligned()' > variant would not help with code unknowingly relying on the implicit alignment. > For slab implementations it would either require creating more kmalloc caches, > or allocate a larger size and only give back part of it. That would be > wasteful, especially with a generic alignment parameter (in contrast with a > fixed alignment to size). > > Ideally we should provide to mm users what they need without difficult > workarounds or own reimplementations, so let's make the kmalloc() alignment to > size explicitly guaranteed for power-of-two sizes under all configurations. > What this means for the three available allocators? > > * SLAB object layout happens to be mostly unchanged by the patch. The > implicitly provided alignment could be compromised with CONFIG_DEBUG_SLAB due > to redzoning, however SLAB disables redzoning for caches with alignment > larger than unsigned long long. Practically on at least x86 this includes > kmalloc caches as they use cache line alignment, which is larger than that. > Still, this patch ensures alignment on all arches and cache sizes. > > * SLUB layout is also unchanged unless redzoning is enabled through > CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache. With > this patch, explicit alignment is guaranteed with redzoning as well. This > will result in more memory being wasted, but that should be acceptable in a > debugging scenario. > > * SLOB has no implicit alignment so this patch adds it explicitly for > kmalloc(). The potential downside is increased fragmentation. While > pathological allocation scenarios are certainly possible, in my testing, > after booting a x86_64 kernel+userspace with virtme, around 16MB memory > was consumed by slab pages both before and after the patch, with difference > in the noise. > > [1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/ > [2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/ > [3] https://lwn.net/Articles/787740/ > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> So if anyone thinks this is a good idea, please express it (preferably in a formal way such as Acked-by), otherwise it seems the patch will be dropped (due to a private NACK, apparently). Otherwise I don't think there can be an objective conclusion. On the one hand we avoid further problems and workarounds due to misalignment (or objects allocated beyond page boundary, which was only recently mentioned), on the other hand we potentially make future changes to SLAB/SLUB or hypotetical new implementation either more complicated, or less effective due to extra fragmentation. Different people can have different opinions on what's more important. Let me however explain why I think we don't have to fear the future implementation complications that much. There was an argument IIRC that extra non-debug metadata could start to be prepended/appended to an object in the future (i.e. RCU freeing head?). 1) Caches can be already created with explicit alignment, so a naive pre/appending implementation would already waste memory on such caches. 2) Even without explicit alignment, a single slab cache for 512k objects with few bytes added to each object would waste almost 512k as the objects wouldn't fit precisely in an (order-X) page. The percentage wasted depends on X. 3) Roman recently posted a patchset [1] that basically adds a cgroup pointer to each object. The implementation doesn't append it to objects naively however, but adds a separately allocated array. Alignment is thus unchanged. [1] https://lore.kernel.org/linux-mm/20190905214553.1643060-1-guro@fb.com/
On Mon, Sep 23, 2019 at 06:36:32PM +0200, Vlastimil Babka wrote: > So if anyone thinks this is a good idea, please express it (preferably > in a formal way such as Acked-by), otherwise it seems the patch will be > dropped (due to a private NACK, apparently). As a user of the allocator interface in filesystem, I'd like to see a more generic way to address the alignment guarantees so we don't have to apply workarounds like 3acd48507dc43eeeb each time we find that we missed something. (Where 'missed' might be another sort of weird memory corruption hard to trigger.) The workaround got applied because I was not sure about the timeframe of merge of this patch, also to remove pressure for merge in case there are more private acks and nacks to be sent. In the end I'd be fine with reverting the workaround in order to use the generic code again. Thanks.
On Mon, Sep 23, 2019 at 07:17:10PM +0200, David Sterba wrote: > On Mon, Sep 23, 2019 at 06:36:32PM +0200, Vlastimil Babka wrote: > > So if anyone thinks this is a good idea, please express it (preferably > > in a formal way such as Acked-by), otherwise it seems the patch will be > > dropped (due to a private NACK, apparently). Oh, I didn't realize ^^^^^^^^^^^^ that *some* of us are allowed the privilege of gutting a patch via private NAK without any of that open development discussion incovenience. <grumble> As far as XFS is concerned I merged Dave's series that checks the alignment of io memory allocations and falls back to vmalloc if the alignment won't work, because I got tired of scrolling past the endless discussion and bug reports and inaction spanning months. Now this private NAK stuff helps me feel vindicated for merging it despite my misgivings because now I can declare that "XFS will just work around all the stupid broken sh*t it finds in the rest of the kernel". --D > As a user of the allocator interface in filesystem, I'd like to see a > more generic way to address the alignment guarantees so we don't have to > apply workarounds like 3acd48507dc43eeeb each time we find that we > missed something. (Where 'missed' might be another sort of weird memory > corruption hard to trigger.) > > The workaround got applied because I was not sure about the timeframe of > merge of this patch, also to remove pressure for merge in case there are > more private acks and nacks to be sent. In the end I'd be fine with > reverting the workaround in order to use the generic code again. > > Thanks.
On Mon, Sep 23, 2019 at 06:36:32PM +0200, Vlastimil Babka wrote: > On 8/26/19 1:16 PM, Vlastimil Babka wrote: > > In most configurations, kmalloc() happens to return naturally aligned (i.e. > > aligned to the block size itself) blocks for power of two sizes. That means > > some kmalloc() users might unknowingly rely on that alignment, until stuff > > breaks when the kernel is built with e.g. CONFIG_SLUB_DEBUG or CONFIG_SLOB, > > and blocks stop being aligned. Then developers have to devise workaround such > > as own kmem caches with specified alignment [1], which is not always practical, > > as recently evidenced in [2]. > > > > The topic has been discussed at LSF/MM 2019 [3]. Adding a 'kmalloc_aligned()' > > variant would not help with code unknowingly relying on the implicit alignment. > > For slab implementations it would either require creating more kmalloc caches, > > or allocate a larger size and only give back part of it. That would be > > wasteful, especially with a generic alignment parameter (in contrast with a > > fixed alignment to size). > > > > Ideally we should provide to mm users what they need without difficult > > workarounds or own reimplementations, so let's make the kmalloc() alignment to > > size explicitly guaranteed for power-of-two sizes under all configurations. > > What this means for the three available allocators? > > > > * SLAB object layout happens to be mostly unchanged by the patch. The > > implicitly provided alignment could be compromised with CONFIG_DEBUG_SLAB due > > to redzoning, however SLAB disables redzoning for caches with alignment > > larger than unsigned long long. Practically on at least x86 this includes > > kmalloc caches as they use cache line alignment, which is larger than that. > > Still, this patch ensures alignment on all arches and cache sizes. > > > > * SLUB layout is also unchanged unless redzoning is enabled through > > CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache. With > > this patch, explicit alignment is guaranteed with redzoning as well. This > > will result in more memory being wasted, but that should be acceptable in a > > debugging scenario. > > > > * SLOB has no implicit alignment so this patch adds it explicitly for > > kmalloc(). The potential downside is increased fragmentation. While > > pathological allocation scenarios are certainly possible, in my testing, > > after booting a x86_64 kernel+userspace with virtme, around 16MB memory > > was consumed by slab pages both before and after the patch, with difference > > in the noise. > > > > [1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/ > > [2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/ > > [3] https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_787740_&d=DwICaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=UUX1YoPTOOycNowHuRP2ZnqwSwZFjAFrkQFrqstidZ0&s=Kt_XTKlh2qxbC_7ME44MV3_QWFVRHlI1p2EQZYP0uqY&e= > > > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > > So if anyone thinks this is a good idea, please express it (preferably > in a formal way such as Acked-by), otherwise it seems the patch will be > dropped (due to a private NACK, apparently). > > Otherwise I don't think there can be an objective conclusion. On the one > hand we avoid further problems and workarounds due to misalignment (or > objects allocated beyond page boundary, which was only recently > mentioned), on the other hand we potentially make future changes to > SLAB/SLUB or hypotetical new implementation either more complicated, or > less effective due to extra fragmentation. Different people can have > different opinions on what's more important. > > Let me however explain why I think we don't have to fear the future > implementation complications that much. There was an argument IIRC that > extra non-debug metadata could start to be prepended/appended to an > object in the future (i.e. RCU freeing head?). > > 1) Caches can be already created with explicit alignment, so a naive > pre/appending implementation would already waste memory on such caches. > 2) Even without explicit alignment, a single slab cache for 512k objects > with few bytes added to each object would waste almost 512k as the > objects wouldn't fit precisely in an (order-X) page. The percentage > wasted depends on X. > 3) Roman recently posted a patchset [1] that basically adds a cgroup > pointer to each object. The implementation doesn't append it to objects > naively however, but adds a separately allocated array. Alignment is > thus unchanged. To be fair here, we *might* want to put this pointer just after/before the object to reduce the number of cache misses. I've put it into a separate place mostly for simplicity reasons. It's not an objection though, just a note. Thanks! > > [1] https://lore.kernel.org/linux-mm/20190905214553.1643060-1-guro@fb.com/ >
On Mon, Aug 26, 2019 at 01:16:27PM +0200, Vlastimil Babka wrote: > @@ -98,6 +98,10 @@ limited. The actual limit depends on the hardware and the kernel > configuration, but it is a good practice to use `kmalloc` for objects > smaller than page size. > > +The address of a chunk allocated with `kmalloc` is aligned to at least > +ARCH_KMALLOC_MINALIGN bytes. For sizes of power of two bytes, the "For sizes which are a power of two, the" > +alignment is also guaranteed to be at least to the respective size. s/to // Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
On Mon, 23 Sep 2019, Darrick J. Wong wrote: > On Mon, Sep 23, 2019 at 07:17:10PM +0200, David Sterba wrote: > > On Mon, Sep 23, 2019 at 06:36:32PM +0200, Vlastimil Babka wrote: > > > So if anyone thinks this is a good idea, please express it (preferably > > > in a formal way such as Acked-by), otherwise it seems the patch will be > > > dropped (due to a private NACK, apparently). > > Oh, I didn't realize ^^^^^^^^^^^^ that *some* of us are allowed the > privilege of gutting a patch via private NAK without any of that open > development discussion incovenience. <grumble> There was a public discussion about this issue and from what I can tell the outcome was that the allocator already provides what you want. Which was a mechanism to misalign objects and detect these issues. This mechanism has been in use for over a decade.
On Tue, Sep 24, 2019 at 08:47:52PM +0000, cl@linux.com wrote: > On Mon, 23 Sep 2019, Darrick J. Wong wrote: > > > On Mon, Sep 23, 2019 at 07:17:10PM +0200, David Sterba wrote: > > > On Mon, Sep 23, 2019 at 06:36:32PM +0200, Vlastimil Babka wrote: > > > > So if anyone thinks this is a good idea, please express it (preferably > > > > in a formal way such as Acked-by), otherwise it seems the patch will be > > > > dropped (due to a private NACK, apparently). > > > > Oh, I didn't realize ^^^^^^^^^^^^ that *some* of us are allowed the > > privilege of gutting a patch via private NAK without any of that open > > development discussion incovenience. <grumble> > > There was a public discussion about this issue and from what I can tell > the outcome was that the allocator already provides what you want. Which > was a mechanism to misalign objects and detect these issues. This > mechanism has been in use for over a decade. You missed the important part, which was *ENABLED BY DEFAULT*. People who are enabling a debugging option to debug their issues, should not have to first debug all the other issues that enabling that debugging option uncovers!
On Mon, 23 Sep 2019, David Sterba wrote: > As a user of the allocator interface in filesystem, I'd like to see a > more generic way to address the alignment guarantees so we don't have to > apply workarounds like 3acd48507dc43eeeb each time we find that we > missed something. (Where 'missed' might be another sort of weird memory > corruption hard to trigger.) The alignment guarantees are clearly documented and objects are misaligned in debugging kernels. Looking at 3acd48507dc43eeeb:Looks like no one tested that patch with a debug kernel or full debugging on until it hit mainline. Not good. The consequence for the lack of proper testing is to make the production kernel contain the debug measures?
n Tue, 24 Sep 2019, Matthew Wilcox wrote: > > There was a public discussion about this issue and from what I can tell > > the outcome was that the allocator already provides what you want. Which > > was a mechanism to misalign objects and detect these issues. This > > mechanism has been in use for over a decade. > > You missed the important part, which was *ENABLED BY DEFAULT*. People > who are enabling a debugging option to debug their issues, should not > have to first debug all the other issues that enabling that debugging > option uncovers! Why would you have to debug all other issues? You could put your patch on top of the latest stable or distro kernel for testing. And I thought the rc phase was there for everyone to work on the bugs of each other?
On 9/23/19 7:51 PM, Darrick J. Wong wrote: > On Mon, Sep 23, 2019 at 07:17:10PM +0200, David Sterba wrote: >> On Mon, Sep 23, 2019 at 06:36:32PM +0200, Vlastimil Babka wrote: >>> So if anyone thinks this is a good idea, please express it (preferably >>> in a formal way such as Acked-by), otherwise it seems the patch will be >>> dropped (due to a private NACK, apparently). > > Oh, I didn't realize ^^^^^^^^^^^^ that *some* of us are allowed the > privilege of gutting a patch via private NAK without any of that open > development discussion incovenience. <grumble> > > As far as XFS is concerned I merged Dave's series that checks the > alignment of io memory allocations and falls back to vmalloc if the > alignment won't work, because I got tired of scrolling past the endless > discussion and bug reports and inaction spanning months. I think it's a big fail of kmalloc API that you have to do that, and especially with vmalloc, which has the overhead of setting up page tables, and it's a waste for allocation requests smaller than page size. I wish we could have nice things.
On Tue, Sep 24, 2019 at 11:19:29PM +0200, Vlastimil Babka wrote: > On 9/23/19 7:51 PM, Darrick J. Wong wrote: > > On Mon, Sep 23, 2019 at 07:17:10PM +0200, David Sterba wrote: > >> On Mon, Sep 23, 2019 at 06:36:32PM +0200, Vlastimil Babka wrote: > >>> So if anyone thinks this is a good idea, please express it (preferably > >>> in a formal way such as Acked-by), otherwise it seems the patch will be > >>> dropped (due to a private NACK, apparently). > > > > Oh, I didn't realize ^^^^^^^^^^^^ that *some* of us are allowed the > > privilege of gutting a patch via private NAK without any of that open > > development discussion incovenience. <grumble> > > > > As far as XFS is concerned I merged Dave's series that checks the > > alignment of io memory allocations and falls back to vmalloc if the > > alignment won't work, because I got tired of scrolling past the endless > > discussion and bug reports and inaction spanning months. > > I think it's a big fail of kmalloc API that you have to do that, and > especially with vmalloc, which has the overhead of setting up page > tables, and it's a waste for allocation requests smaller than page size. > I wish we could have nice things. I don't think the problem here is the code. The problem here is that we have a dysfunctional development community and there are no processes we can follow to ensure architectural problems in core subsystems are addressed in a timely manner... And this criticism isn't just of the mm/ here - this alignment problem is exacerbated by exactly the same issue on the block layer side. i.e. the block layer and drivers have -zero- bounds checking to catch these sorts of things and the block layer maintainer will not accept patches for runtime checks that would catch these issues and make them instantly visible to us. These are not code problems: we can fix the problems with code (and I have done so to demonstrate "this is how we do what you say is impossible"). The problem here is people in positions of control/power are repeatedly demonstrating an inability to compromise to reach a solution that works for everyone. It's far better for us just to work around bullshit like this in XFS now, then when the core subsystems get they act together years down the track we can remove the workaround from XFS. Users don't care how we fix the problem, they just want it fixed. If that means we have to route around dysfunctional developer groups, then we'll just have to do that.... Cheers, Dave.
On Wed, Sep 25, 2019 at 07:53:53AM +1000, Dave Chinner wrote: > On Tue, Sep 24, 2019 at 11:19:29PM +0200, Vlastimil Babka wrote: > > On 9/23/19 7:51 PM, Darrick J. Wong wrote: > > > On Mon, Sep 23, 2019 at 07:17:10PM +0200, David Sterba wrote: > > >> On Mon, Sep 23, 2019 at 06:36:32PM +0200, Vlastimil Babka wrote: > > >>> So if anyone thinks this is a good idea, please express it (preferably > > >>> in a formal way such as Acked-by), otherwise it seems the patch will be > > >>> dropped (due to a private NACK, apparently). > > > > > > Oh, I didn't realize ^^^^^^^^^^^^ that *some* of us are allowed the > > > privilege of gutting a patch via private NAK without any of that open > > > development discussion incovenience. <grumble> > > > > > > As far as XFS is concerned I merged Dave's series that checks the > > > alignment of io memory allocations and falls back to vmalloc if the > > > alignment won't work, because I got tired of scrolling past the endless > > > discussion and bug reports and inaction spanning months. > > > > I think it's a big fail of kmalloc API that you have to do that, and > > especially with vmalloc, which has the overhead of setting up page > > tables, and it's a waste for allocation requests smaller than page size. > > I wish we could have nice things. > > I don't think the problem here is the code. The problem here is that > we have a dysfunctional development community and there are no > processes we can follow to ensure architectural problems in core > subsystems are addressed in a timely manner... > > And this criticism isn't just of the mm/ here - this alignment > problem is exacerbated by exactly the same issue on the block layer > side. i.e. the block layer and drivers have -zero- bounds checking > to catch these sorts of things and the block layer maintainer will > not accept patches for runtime checks that would catch these issues > and make them instantly visible to us. > > These are not code problems: we can fix the problems with code (and > I have done so to demonstrate "this is how we do what you say is > impossible"). The problem here is people in positions of > control/power are repeatedly demonstrating an inability to > compromise to reach a solution that works for everyone. > > It's far better for us just to work around bullshit like this in XFS > now, then when the core subsystems get they act together years down > the track we can remove the workaround from XFS. Users don't care > how we fix the problem, they just want it fixed. If that means we > have to route around dysfunctional developer groups, then we'll just > have to do that.... Seconded. --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Tue, 24 Sep 2019 20:52:52 +0000 (UTC) cl@linux.com wrote: > On Mon, 23 Sep 2019, David Sterba wrote: > > > As a user of the allocator interface in filesystem, I'd like to see a > > more generic way to address the alignment guarantees so we don't have to > > apply workarounds like 3acd48507dc43eeeb each time we find that we > > missed something. (Where 'missed' might be another sort of weird memory > > corruption hard to trigger.) > > The alignment guarantees are clearly documented and objects are misaligned > in debugging kernels. > > Looking at 3acd48507dc43eeeb:Looks like no one tested that patch with a > debug kernel or full debugging on until it hit mainline. Not good. > > The consequence for the lack of proper testing is to make the production > kernel contain the debug measures? This isn't a debug measure - it's making the interface do that which people evidently expect it to do. Minor point. I agree it's a bit regrettable to do this but it does appear that the change will make the kernel overall a better place given the reality of kernel development. Given this, have you reviewed the patch for overall implementation correctness? I'm wondering if we can avoid at least some of the patch's overhead if slab debugging is disabled - the allocators are already returning suitably aligned memory, so why add the new code in that case?
On 9/25/19 1:54 AM, Andrew Morton wrote: > On Tue, 24 Sep 2019 20:52:52 +0000 (UTC) cl@linux.com wrote: > >> On Mon, 23 Sep 2019, David Sterba wrote: >> >>> As a user of the allocator interface in filesystem, I'd like to see a >>> more generic way to address the alignment guarantees so we don't have to >>> apply workarounds like 3acd48507dc43eeeb each time we find that we >>> missed something. (Where 'missed' might be another sort of weird memory >>> corruption hard to trigger.) >> >> The alignment guarantees are clearly documented and objects are misaligned >> in debugging kernels. >> >> Looking at 3acd48507dc43eeeb:Looks like no one tested that patch with a >> debug kernel or full debugging on until it hit mainline. Not good. >> >> The consequence for the lack of proper testing is to make the production >> kernel contain the debug measures? > > This isn't a debug measure - it's making the interface do that which > people evidently expect it to do. Minor point. Yes, detecting issues due to misalignment is one thing, but then there are the workarounds necessary to achieve it (for multiple sizes, so no single kmem_cache_create(..., alignment)), as XFS folks demonstrated. > I agree it's a bit regrettable to do this but it does appear that the > change will make the kernel overall a better place given the reality of > kernel development. Thanks. > Given this, have you reviewed the patch for overall implementation > correctness? > > I'm wondering if we can avoid at least some of the patch's overhead if > slab debugging is disabled - the allocators are already returning > suitably aligned memory, so why add the new code in that case? Most of the new code is for SLOB, which has no debugging and yet misaligns. For SLUB and SLAB, it's just passing alignment argument to kmem_cache_create() for kmalloc caches, which means just extra few instructions during boot, and no extra code during kmalloc/kfree itself.
On Tue, 24 Sep 2019, Andrew Morton wrote: > I agree it's a bit regrettable to do this but it does appear that the > change will make the kernel overall a better place given the reality of > kernel development. No it wont. - It will only work for special cases like the kmalloc array without extras like metadata at the end of objects. - It will be an inconsistency in the alignments provided by the allocator. - It will cause us in the future to constantly consider these exceptional alignments in the maintenance of the allocators. - These alignments are only needed in exceptional cases but with the patch we will provide the alignment by default even if the allocating subsystem does not need it. - We have mechanisms to detect alignment problems using debug kernels and debug options that have been available for years. These were not used for testing in these cases it seems before the patches hit mainline. Once in mainly someone ran a debug kernel and found the issue. > Given this, have you reviewed the patch for overall implementation > correctness? Yes, the patch is fine. > I'm wondering if we can avoid at least some of the patch's overhead if > slab debugging is disabled - the allocators are already returning > suitably aligned memory, so why add the new code in that case? As far as I know this patch is not needed given that we have had the standards for alignments for a long time now. Why would the allocators provide specially aligned memory just based on the size of an object? This is weird and unexpected behavior.
On Wed, 25 Sep 2019, Vlastimil Babka wrote: > Most of the new code is for SLOB, which has no debugging and yet > misaligns. For SLUB and SLAB, it's just passing alignment argument to > kmem_cache_create() for kmalloc caches, which means just extra few > instructions during boot, and no extra code during kmalloc/kfree itself. SLOB follows the standards for alignments in slab allocators and will correctly align if you ask the allocator for a properly aligned object.
On 9/26/19 2:14 AM, Christopher Lameter wrote: > On Tue, 24 Sep 2019, Andrew Morton wrote: > >> I agree it's a bit regrettable to do this but it does appear that the >> change will make the kernel overall a better place given the reality of >> kernel development. > > No it wont. > > - It will only work for special cases like the kmalloc array > without extras like metadata at the end of objects. I don't understand what you mean here? The kmalloc caches are special because they don't have metadata at the end of objects? Others do? > - It will be an inconsistency in the alignments provided by the allocator. I don't see a scenario where this will cause a kmalloc user problems. Can you describe a scenario where a kmalloc users would have some assumptions about alignment, but due to this change, those assumptions will be incorrect, and how exactly would it break their code? > - It will cause us in the future to constantly consider these exceptional > alignments in the maintenance of the allocators. Caches can be already created with explicit alignment. This patch just means there are more of them. > - These alignments are only needed in exceptional cases but with the patch > we will provide the alignment by default even if the allocating subsystem > does not need it. True. This is where we have to make the decision whether to make things simpler for those that don't realize they need the alignment, and whether that's worth the cost. We have evidence of those cases, and the cost is currently zero in the common cases (SLAB, SLUB without debug runtime-enabled). > - We have mechanisms to detect alignment problems using debug kernels and > debug options that have been available for years. These were not used for > testing in these cases it seems before the patches hit mainline. Once in > mainly someone ran a debug kernel and found the issue. Debugging options are useful if you know there's a bug and you want to find it. AFAIK the various bots/CIs that do e.g. randconfig, or enable debug options explicitly, run those kernels in a VM, so I guess that's why potential breakage due to alignment can lurk in a hw-specific driver. >> Given this, have you reviewed the patch for overall implementation >> correctness? > > Yes, the patch is fine. > >> I'm wondering if we can avoid at least some of the patch's overhead if >> slab debugging is disabled - the allocators are already returning >> suitably aligned memory, so why add the new code in that case? > > As far as I know this patch is not needed given that we have had the > standards for alignments for a long time now. > > Why would the allocators provide specially aligned memory just based on > the size of an object? This is weird and unexpected behavior. For some, it's expected.
On Tue, Sep 24, 2019 at 08:55:02PM +0000, cl@linux.com wrote: > n Tue, 24 Sep 2019, Matthew Wilcox wrote: > > > > There was a public discussion about this issue and from what I can tell > > > the outcome was that the allocator already provides what you want. Which > > > was a mechanism to misalign objects and detect these issues. This > > > mechanism has been in use for over a decade. > > > > You missed the important part, which was *ENABLED BY DEFAULT*. People > > who are enabling a debugging option to debug their issues, should not > > have to first debug all the other issues that enabling that debugging > > option uncovers! > > Why would you have to debug all other issues? You could put your patch on > top of the latest stable or distro kernel for testing. This does not work in development branches. They're based on some stable point but otherwise there's a lot of new code that usually has bugs and it's quite important be able to understand where the bug comes from. And the debugging instrumentation is there to add more sanity checks and canaries to catch overflows, assertions etc. If it's unreliable, then there's no point using it during development, IOW fix all bugs first and then see if there are more left after turning the debugging.
On Thu, 26 Sep 2019, Vlastimil Babka wrote: > > - It will only work for special cases like the kmalloc array > > without extras like metadata at the end of objects. > > I don't understand what you mean here? The kmalloc caches are special > because they don't have metadata at the end of objects? Others do? Yes. > > - These alignments are only needed in exceptional cases but with the patch > > we will provide the alignment by default even if the allocating subsystem > > does not need it. > > True. This is where we have to make the decision whether to make things > simpler for those that don't realize they need the alignment, and > whether that's worth the cost. We have evidence of those cases, and the > cost is currently zero in the common cases (SLAB, SLUB without debug > runtime-enabled). The cost is zero for a particular layout of the objects in a page using a particular allocator and hardware configuration. However, the layout may be different due to another allocator that prefers to arrange things differently (SLOB puts multiple objects of different types in the same page to save memory), if we need to add data to these objects (debugging info, new metadata about the object, maybe the memcg pointer, maybe other things that may come up), or other innovative approaches (such as putting data of different kmem caches that are commonly used together in the same page to improve locality). The cost is an unnecessary petrification of the data layout of the memory allocators. > > - We have mechanisms to detect alignment problems using debug kernels and > > debug options that have been available for years. These were not used for > > testing in these cases it seems before the patches hit mainline. Once in > > mainly someone ran a debug kernel and found the issue. > > Debugging options are useful if you know there's a bug and you want to > find it. AFAIK the various bots/CIs that do e.g. randconfig, or enable > debug options explicitly, run those kernels in a VM, so I guess that's > why potential breakage due to alignment can lurk in a hw-specific driver. That is not my experience. You need to run debugging to verify that a patch does not cause locking problems, memory corruption etc etc. And upstream code is tested by various people with debugging kernels so they will locate the bugs that others introduce. This is usually not because there was a focus on a particular bug. If you have a hw specific thing that is not generally tested and skip the debugging tests well yes then we have a problem. What I have seen with developers is that they feel the debugging steps are unnecessary for conveniences sake. I have seen build environments that had proper steps for verification with a debug kernel. However, someone disabled them "some months ago" and "nothing happened". Then strange failures in production systems occur.
On Mon, Sep 23, 2019 at 06:36:32PM +0200, Vlastimil Babka wrote: > So if anyone thinks this is a good idea, please express it (preferably > in a formal way such as Acked-by), otherwise it seems the patch will be > dropped (due to a private NACK, apparently). I think we absolutely need something like this, and I'm sick and tired of the people just claiming there is no problem. From the user POV I don't care if aligned allocations need a new GFP_ALIGNED flag or not, but as far as I can tell the latter will probably cause more overhead in practice than not having it. So unless someone comes up with a better counter proposal to provide aligned kmalloc of some form that doesn't require a giant amount of boilerplate code in the users: Acked^2-by: Christoph Hellwig <hch@lst.de>
On Mon 23-09-19 18:36:32, Vlastimil Babka wrote: > On 8/26/19 1:16 PM, Vlastimil Babka wrote: > > In most configurations, kmalloc() happens to return naturally aligned (i.e. > > aligned to the block size itself) blocks for power of two sizes. That means > > some kmalloc() users might unknowingly rely on that alignment, until stuff > > breaks when the kernel is built with e.g. CONFIG_SLUB_DEBUG or CONFIG_SLOB, > > and blocks stop being aligned. Then developers have to devise workaround such > > as own kmem caches with specified alignment [1], which is not always practical, > > as recently evidenced in [2]. > > > > The topic has been discussed at LSF/MM 2019 [3]. Adding a 'kmalloc_aligned()' > > variant would not help with code unknowingly relying on the implicit alignment. > > For slab implementations it would either require creating more kmalloc caches, > > or allocate a larger size and only give back part of it. That would be > > wasteful, especially with a generic alignment parameter (in contrast with a > > fixed alignment to size). > > > > Ideally we should provide to mm users what they need without difficult > > workarounds or own reimplementations, so let's make the kmalloc() alignment to > > size explicitly guaranteed for power-of-two sizes under all configurations. > > What this means for the three available allocators? > > > > * SLAB object layout happens to be mostly unchanged by the patch. The > > implicitly provided alignment could be compromised with CONFIG_DEBUG_SLAB due > > to redzoning, however SLAB disables redzoning for caches with alignment > > larger than unsigned long long. Practically on at least x86 this includes > > kmalloc caches as they use cache line alignment, which is larger than that. > > Still, this patch ensures alignment on all arches and cache sizes. > > > > * SLUB layout is also unchanged unless redzoning is enabled through > > CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache. With > > this patch, explicit alignment is guaranteed with redzoning as well. This > > will result in more memory being wasted, but that should be acceptable in a > > debugging scenario. > > > > * SLOB has no implicit alignment so this patch adds it explicitly for > > kmalloc(). The potential downside is increased fragmentation. While > > pathological allocation scenarios are certainly possible, in my testing, > > after booting a x86_64 kernel+userspace with virtme, around 16MB memory > > was consumed by slab pages both before and after the patch, with difference > > in the noise. > > > > [1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/ > > [2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/ > > [3] https://lwn.net/Articles/787740/ > > > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > > So if anyone thinks this is a good idea, please express it (preferably > in a formal way such as Acked-by), otherwise it seems the patch will be > dropped (due to a private NACK, apparently). Sigh. An existing code to workaround the lack of alignment guarantee just show that this is necessary. And there wasn't any real technical argument against except for a highly theoretical optimizations/new allocator that would be tight by the guarantee. Therefore Acked-by: Michal Hocko <mhocko@suse.com>
On Mon, Sep 30, 2019 at 11:23:34AM +0200, Michal Hocko wrote: > On Mon 23-09-19 18:36:32, Vlastimil Babka wrote: > > On 8/26/19 1:16 PM, Vlastimil Babka wrote: > > > In most configurations, kmalloc() happens to return naturally aligned (i.e. > > > aligned to the block size itself) blocks for power of two sizes. That means > > > some kmalloc() users might unknowingly rely on that alignment, until stuff > > > breaks when the kernel is built with e.g. CONFIG_SLUB_DEBUG or CONFIG_SLOB, > > > and blocks stop being aligned. Then developers have to devise workaround such > > > as own kmem caches with specified alignment [1], which is not always practical, > > > as recently evidenced in [2]. > > > > > > The topic has been discussed at LSF/MM 2019 [3]. Adding a 'kmalloc_aligned()' > > > variant would not help with code unknowingly relying on the implicit alignment. > > > For slab implementations it would either require creating more kmalloc caches, > > > or allocate a larger size and only give back part of it. That would be > > > wasteful, especially with a generic alignment parameter (in contrast with a > > > fixed alignment to size). > > > > > > Ideally we should provide to mm users what they need without difficult > > > workarounds or own reimplementations, so let's make the kmalloc() alignment to > > > size explicitly guaranteed for power-of-two sizes under all configurations. > > > What this means for the three available allocators? > > > > > > * SLAB object layout happens to be mostly unchanged by the patch. The > > > implicitly provided alignment could be compromised with CONFIG_DEBUG_SLAB due > > > to redzoning, however SLAB disables redzoning for caches with alignment > > > larger than unsigned long long. Practically on at least x86 this includes > > > kmalloc caches as they use cache line alignment, which is larger than that. > > > Still, this patch ensures alignment on all arches and cache sizes. > > > > > > * SLUB layout is also unchanged unless redzoning is enabled through > > > CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache. With > > > this patch, explicit alignment is guaranteed with redzoning as well. This > > > will result in more memory being wasted, but that should be acceptable in a > > > debugging scenario. > > > > > > * SLOB has no implicit alignment so this patch adds it explicitly for > > > kmalloc(). The potential downside is increased fragmentation. While > > > pathological allocation scenarios are certainly possible, in my testing, > > > after booting a x86_64 kernel+userspace with virtme, around 16MB memory > > > was consumed by slab pages both before and after the patch, with difference > > > in the noise. > > > > > > [1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/ > > > [2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/ > > > [3] https://lwn.net/Articles/787740/ > > > > > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > > > > So if anyone thinks this is a good idea, please express it (preferably > > in a formal way such as Acked-by), otherwise it seems the patch will be > > dropped (due to a private NACK, apparently). > > Sigh. > > An existing code to workaround the lack of alignment guarantee just show > that this is necessary. And there wasn't any real technical argument > against except for a highly theoretical optimizations/new allocator that > would be tight by the guarantee. > > Therefore > Acked-by: Michal Hocko <mhocko@suse.com> Agreed. Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
On Sat, Sep 28, 2019 at 01:12:49AM +0000, Christopher Lameter wrote: > However, the layout may be different due to another allocator that prefers > to arrange things differently (SLOB puts multiple objects of different > types in the same page to save memory), if we need to add data to these > objects (debugging info, new metadata about the object, maybe the memcg > pointer, maybe other things that may come up), or other innovative > approaches (such as putting data of different kmem caches that are > commonly used together in the same page to improve locality). If we ever do start putting objects of different sizes that are commonly allocated together in the same page (eg inodes & dentries), then those aren't going to be random kmalloc() allocation; they're going to be special kmem caches that can specify "I don't care about alignment". Also, we haven't done that. We've had a slab allocator for twenty years, and nobody's tried to do that. Maybe the co-allocation would be a net loss (I suspect). Or the gain is too small for the added complexity. Whatever way, this is a strawman. > The cost is an unnecessary petrification of the data layout of the memory > allocators. Yes, it is. And it's a cost I'm willing to pay in order to get the guarantee of alignment.
diff --git a/Documentation/core-api/memory-allocation.rst b/Documentation/core-api/memory-allocation.rst index 7744aa3bf2e0..27c54854b508 100644 --- a/Documentation/core-api/memory-allocation.rst +++ b/Documentation/core-api/memory-allocation.rst @@ -98,6 +98,10 @@ limited. The actual limit depends on the hardware and the kernel configuration, but it is a good practice to use `kmalloc` for objects smaller than page size. +The address of a chunk allocated with `kmalloc` is aligned to at least +ARCH_KMALLOC_MINALIGN bytes. For sizes of power of two bytes, the +alignment is also guaranteed to be at least to the respective size. + For large allocations you can use :c:func:`vmalloc` and :c:func:`vzalloc`, or directly request pages from the page allocator. The memory allocated by `vmalloc` and related functions is diff --git a/include/linux/slab.h b/include/linux/slab.h index 56c9c7eed34e..0d4c26395785 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -493,6 +493,10 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags) * kmalloc is the normal method of allocating memory * for objects smaller than page size in the kernel. * + * The allocated object address is aligned to at least ARCH_KMALLOC_MINALIGN + * bytes. For @size of power of two bytes, the alignment is also guaranteed + * to be at least to the size. + * * The @flags argument may be one of the GFP flags defined at * include/linux/gfp.h and described at * :ref:`Documentation/core-api/mm-api.rst <mm-api-gfp-flags>` diff --git a/mm/slab_common.c b/mm/slab_common.c index 929c02a90fba..b9ba93ad5c7f 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -993,10 +993,19 @@ void __init create_boot_cache(struct kmem_cache *s, const char *name, unsigned int useroffset, unsigned int usersize) { int err; + unsigned int align = ARCH_KMALLOC_MINALIGN; s->name = name; s->size = s->object_size = size; - s->align = calculate_alignment(flags, ARCH_KMALLOC_MINALIGN, size); + + /* + * For power of two sizes, guarantee natural alignment for kmalloc + * caches, regardless of SL*B debugging options. + */ + if (is_power_of_2(size)) + align = max(align, size); + s->align = calculate_alignment(flags, align, size); + s->useroffset = useroffset; s->usersize = usersize; diff --git a/mm/slob.c b/mm/slob.c index 3dcde9cf2b17..07a39047aa54 100644 --- a/mm/slob.c +++ b/mm/slob.c @@ -224,6 +224,7 @@ static void slob_free_pages(void *b, int order) * @sp: Page to look in. * @size: Size of the allocation. * @align: Allocation alignment. + * @align_offset: Offset in the allocated block that will be aligned. * @page_removed_from_list: Return parameter. * * Tries to find a chunk of memory at least @size bytes big within @page. @@ -234,7 +235,7 @@ static void slob_free_pages(void *b, int order) * true (set to false otherwise). */ static void *slob_page_alloc(struct page *sp, size_t size, int align, - bool *page_removed_from_list) + int align_offset, bool *page_removed_from_list) { slob_t *prev, *cur, *aligned = NULL; int delta = 0, units = SLOB_UNITS(size); @@ -243,8 +244,17 @@ static void *slob_page_alloc(struct page *sp, size_t size, int align, for (prev = NULL, cur = sp->freelist; ; prev = cur, cur = slob_next(cur)) { slobidx_t avail = slob_units(cur); + /* + * 'aligned' will hold the address of the slob block so that the + * address 'aligned'+'align_offset' is aligned according to the + * 'align' parameter. This is for kmalloc() which prepends the + * allocated block with its size, so that the block itself is + * aligned when needed. + */ if (align) { - aligned = (slob_t *)ALIGN((unsigned long)cur, align); + aligned = (slob_t *) + (ALIGN((unsigned long)cur + align_offset, align) + - align_offset); delta = aligned - cur; } if (avail >= units + delta) { /* room enough? */ @@ -288,7 +298,8 @@ static void *slob_page_alloc(struct page *sp, size_t size, int align, /* * slob_alloc: entry point into the slob allocator. */ -static void *slob_alloc(size_t size, gfp_t gfp, int align, int node) +static void *slob_alloc(size_t size, gfp_t gfp, int align, int node, + int align_offset) { struct page *sp; struct list_head *slob_list; @@ -319,7 +330,7 @@ static void *slob_alloc(size_t size, gfp_t gfp, int align, int node) if (sp->units < SLOB_UNITS(size)) continue; - b = slob_page_alloc(sp, size, align, &page_removed_from_list); + b = slob_page_alloc(sp, size, align, align_offset, &page_removed_from_list); if (!b) continue; @@ -356,7 +367,7 @@ static void *slob_alloc(size_t size, gfp_t gfp, int align, int node) INIT_LIST_HEAD(&sp->slab_list); set_slob(b, SLOB_UNITS(PAGE_SIZE), b + SLOB_UNITS(PAGE_SIZE)); set_slob_page_free(sp, slob_list); - b = slob_page_alloc(sp, size, align, &_unused); + b = slob_page_alloc(sp, size, align, align_offset, &_unused); BUG_ON(!b); spin_unlock_irqrestore(&slob_lock, flags); } @@ -458,7 +469,7 @@ static __always_inline void * __do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller) { unsigned int *m; - int align = max_t(size_t, ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN); + int minalign = max_t(size_t, ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN); void *ret; gfp &= gfp_allowed_mask; @@ -466,19 +477,28 @@ __do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller) fs_reclaim_acquire(gfp); fs_reclaim_release(gfp); - if (size < PAGE_SIZE - align) { + if (size < PAGE_SIZE - minalign) { + int align = minalign; + + /* + * For power of two sizes, guarantee natural alignment for + * kmalloc()'d objects. + */ + if (is_power_of_2(size)) + align = max(minalign, (int) size); + if (!size) return ZERO_SIZE_PTR; - m = slob_alloc(size + align, gfp, align, node); + m = slob_alloc(size + minalign, gfp, align, node, minalign); if (!m) return NULL; *m = size; - ret = (void *)m + align; + ret = (void *)m + minalign; trace_kmalloc_node(caller, ret, - size, size + align, gfp, node); + size, size + minalign, gfp, node); } else { unsigned int order = get_order(size); @@ -579,7 +599,7 @@ static void *slob_alloc_node(struct kmem_cache *c, gfp_t flags, int node) fs_reclaim_release(flags); if (c->size < PAGE_SIZE) { - b = slob_alloc(c->size, flags, c->align, node); + b = slob_alloc(c->size, flags, c->align, node, 0); trace_kmem_cache_alloc_node(_RET_IP_, b, c->object_size, SLOB_UNITS(c->size) * SLOB_UNIT, flags, node);
In most configurations, kmalloc() happens to return naturally aligned (i.e. aligned to the block size itself) blocks for power of two sizes. That means some kmalloc() users might unknowingly rely on that alignment, until stuff breaks when the kernel is built with e.g. CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned. Then developers have to devise workaround such as own kmem caches with specified alignment [1], which is not always practical, as recently evidenced in [2]. The topic has been discussed at LSF/MM 2019 [3]. Adding a 'kmalloc_aligned()' variant would not help with code unknowingly relying on the implicit alignment. For slab implementations it would either require creating more kmalloc caches, or allocate a larger size and only give back part of it. That would be wasteful, especially with a generic alignment parameter (in contrast with a fixed alignment to size). Ideally we should provide to mm users what they need without difficult workarounds or own reimplementations, so let's make the kmalloc() alignment to size explicitly guaranteed for power-of-two sizes under all configurations. What this means for the three available allocators? * SLAB object layout happens to be mostly unchanged by the patch. The implicitly provided alignment could be compromised with CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for caches with alignment larger than unsigned long long. Practically on at least x86 this includes kmalloc caches as they use cache line alignment, which is larger than that. Still, this patch ensures alignment on all arches and cache sizes. * SLUB layout is also unchanged unless redzoning is enabled through CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache. With this patch, explicit alignment is guaranteed with redzoning as well. This will result in more memory being wasted, but that should be acceptable in a debugging scenario. * SLOB has no implicit alignment so this patch adds it explicitly for kmalloc(). The potential downside is increased fragmentation. While pathological allocation scenarios are certainly possible, in my testing, after booting a x86_64 kernel+userspace with virtme, around 16MB memory was consumed by slab pages both before and after the patch, with difference in the noise. [1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/ [2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/ [3] https://lwn.net/Articles/787740/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz> --- Documentation/core-api/memory-allocation.rst | 4 ++ include/linux/slab.h | 4 ++ mm/slab_common.c | 11 ++++- mm/slob.c | 42 +++++++++++++++----- 4 files changed, 49 insertions(+), 12 deletions(-)