mbox series

[v2,0/2] mm: Allow kmalloc() allocations below ARCH_KMALLOC_MINALIGN

Message ID 20221025205247.3264568-1-catalin.marinas@arm.com (mailing list archive)
Headers show
Series mm: Allow kmalloc() allocations below ARCH_KMALLOC_MINALIGN | expand

Message

Catalin Marinas Oct. 25, 2022, 8:52 p.m. UTC
Hi,

I called this 'v2' but it's a different incarnation of the previous
attempt to reduce ARCH_KMALLOC_MINALIGN for arm64 (and please treat it
as an RFC):

https://lore.kernel.org/all/20220405135758.774016-1-catalin.marinas@arm.com/

The long discussion in the thread above seemed to converge towards
bouncing of the DMA buffers that came from small kmalloc() caches.
However, there are a few problems with such approach:

- We still need ARCH_KMALLOC_MINALIGN to match the smaller kmalloc()
  size but this breaks crypto without changing the structure alignments
  to the larger ARCH_DMA_MINALIGN (not agreed upon yet).

- Not all systems, especially mobile devices, have a swiotlb buffer
  (most boot with 'noswiotlb'). Even of we allow a small bounce buffer,
  it's hard to come up with a size that would please the mobile vendors.

- An alternative to the default bounce buffer is to use another
  kmalloc() for the bouncing, though it adds an overhead, potential
  allocation failures and DMA unmapping cannot be easily tracked back to
  the bounce kmalloc'ed buffer (no simple is_swiotlb_buffer()).

- IOMMU needs bouncing as well when cache maintenance is required. This
  seems to get pretty complicated in the presence of scatterlists
  (though probably doable with enough dedication but it would add an
  overhead with the additional scatterlist scanning).

- Custom dma_ops - not too many but they'd also need the bouncing logic.

- Undecided on whether we should only bounce small sizes or any
  unaligned size. The latter would end up bouncing large-ish buffers
  (network) but it has the advantage that we don't need to change the
  crypto code if the ARCH_KMALLOC_MINALIGN becomes 8 (well, it will
  bounce all the DMA until the code is changed).

So, with this patchset, I've given up on trying to sort out the bouncing
and went for an explicit opt-in to smaller kmalloc() sizes via a newly
introduced __GFP_PACKED flag which would return objects aligned to
KMALLOC_PACKED_ALIGN (8 bytes with slub, 32 with slab). The second patch
sprinkles the kernel with several gfp flag updates where some hot
allocations seem to be on my hardware (the commit log has instructions
on how to identify them).

On a Cavium TX2, after boot, I get:

kmalloc-128        24608  24608    128   32    1 : tunables    0    0    0 : slabdata    769    769      0
kmalloc-96          9408   9660     96   42    1 : tunables    0    0    0 : slabdata    230    230      0
kmalloc-64          9594   9728     64   64    1 : tunables    0    0    0 : slabdata    152    152      0
kmalloc-32         14052  14464     32  128    1 : tunables    0    0    0 : slabdata    113    113      0
kmalloc-16         26368  26368     16  256    1 : tunables    0    0    0 : slabdata    103    103      0
kmalloc-8          36352  36352      8  512    1 : tunables    0    0    0 : slabdata     71     71      0

It's still not ideal and I suspect there may be some early allocations
even prior to ftrace identifying them but there is a significant
improvement without breaking any of the existing cases.

To me, the ideal approach would be __dma annotations on pointers aimed
at DMA and using kernel tools like sparse to identify them. A
dma_kmalloc() would return such pointers. However, things get muddier
with scatterlists since functions like sg_set_page() don't have any such
pointer information (can we mark the offset as __dma instead?).

Comments/suggestions are welcome but, as per above, I'd like to stay
away from bouncing if possible.

Thanks.

Catalin Marinas (2):
  mm: slab: Introduce __GFP_PACKED for smaller kmalloc() alignments
  treewide: Add the __GFP_PACKED flag to several non-DMA kmalloc()
    allocations

 drivers/usb/core/message.c           |  3 ++-
 fs/binfmt_elf.c                      |  6 ++++--
 fs/dcache.c                          |  3 ++-
 fs/ext4/dir.c                        |  4 ++--
 fs/ext4/extents.c                    |  4 ++--
 fs/file.c                            |  2 +-
 fs/kernfs/file.c                     |  8 ++++----
 fs/nfs/dir.c                         |  7 ++++---
 fs/nfs/inode.c                       |  2 +-
 fs/nfs/nfs4state.c                   |  2 +-
 fs/nfs/write.c                       |  3 ++-
 fs/notify/inotify/inotify_fsnotify.c |  3 ++-
 fs/proc/self.c                       |  2 +-
 fs/seq_file.c                        |  5 +++--
 include/acpi/platform/aclinuxex.h    |  6 ++++--
 include/linux/gfp_types.h            | 10 ++++++++--
 include/linux/slab.h                 | 22 ++++++++++++++++++----
 kernel/trace/ftrace.c                |  2 +-
 kernel/trace/tracing_map.c           |  2 +-
 lib/kasprintf.c                      |  2 +-
 lib/kobject.c                        |  4 ++--
 lib/mpi/mpiutil.c                    |  2 +-
 mm/list_lru.c                        |  6 ++++--
 mm/memcontrol.c                      |  4 ++--
 mm/slab_common.c                     |  3 ++-
 mm/util.c                            |  6 +++---
 mm/vmalloc.c                         |  6 ++++--
 net/sunrpc/auth_unix.c               |  2 +-
 net/sunrpc/xdr.c                     |  2 +-
 security/apparmor/lsm.c              |  2 +-
 security/security.c                  |  4 ++--
 security/tomoyo/realpath.c           |  2 +-
 32 files changed, 88 insertions(+), 53 deletions(-)

Comments

Greg KH Oct. 26, 2022, 6:54 a.m. UTC | #1
On Tue, Oct 25, 2022 at 09:52:45PM +0100, Catalin Marinas wrote:
> To me, the ideal approach would be __dma annotations on pointers aimed
> at DMA and using kernel tools like sparse to identify them.

As a reviewer of drivers, and a subsystem maintainer, yes, I too would
like this instead.  It would make it much more obvious what is
happening.

> dma_kmalloc() would return such pointers. However, things get muddier
> with scatterlists since functions like sg_set_page() don't have any such
> pointer information (can we mark the offset as __dma instead?).

Drivers don't always call dma_kmalloc() for pointers to pass to dma
controllers.  Heck, I doubt the majority do that at all today, that's
the main problem overall that you are having.

We can take the time and move the kernel to do this, perhaps that is the
real solution that will work best over time here?

thanks,

greg k-h