diff mbox

mm/dmapool: localize page allocations

Message ID 1526578581-7658-1-git-send-email-okaya@codeaurora.org (mailing list archive)
State New, archived
Headers show

Commit Message

Sinan Kaya May 17, 2018, 5:36 p.m. UTC
Try to keep the pool closer to the device's NUMA node by changing kmalloc()
to kmalloc_node() and devres_alloc() to devres_alloc_node().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 mm/dmapool.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Matthew Wilcox May 17, 2018, 6:18 p.m. UTC | #1
On Thu, May 17, 2018 at 01:36:19PM -0400, Sinan Kaya wrote:
> Try to keep the pool closer to the device's NUMA node by changing kmalloc()
> to kmalloc_node() and devres_alloc() to devres_alloc_node().

Have you measured any performance gains by doing this?  The thing is that
these allocations are for the metadata about the page, and the page is
going to be used by CPUs in every node.  So it's not clear to me that
allocating it on the node nearest to the device is going to be any sort
of a win.

> @@ -504,7 +504,8 @@ struct dma_pool *dmam_pool_create(const char *name, struct device *dev,
>  {
>  	struct dma_pool **ptr, *pool;
>  
> -	ptr = devres_alloc(dmam_pool_release, sizeof(*ptr), GFP_KERNEL);
> +	ptr = devres_alloc_node(dmam_pool_release, sizeof(*ptr), GFP_KERNEL,
> +				dev_to_node(dev));
>  	if (!ptr)
>  		return NULL;

... are we really calling devres_alloc() for sizeof(void *)?  That's sad.
Sinan Kaya May 17, 2018, 7:37 p.m. UTC | #2
On 5/17/2018 2:18 PM, Matthew Wilcox wrote:
> On Thu, May 17, 2018 at 01:36:19PM -0400, Sinan Kaya wrote:
>> Try to keep the pool closer to the device's NUMA node by changing kmalloc()
>> to kmalloc_node() and devres_alloc() to devres_alloc_node().
> Have you measured any performance gains by doing this?  The thing is that
> these allocations are for the metadata about the page, and the page is
> going to be used by CPUs in every node.  So it's not clear to me that
> allocating it on the node nearest to the device is going to be any sort
> of a win.
> 

It is true that this is metadata but it is one of the things that is most
frequently used in spite of its small size.

I don't think it makes any sense to cross a chip boundary for accessing a
pointer location on every single pool allocation. 

Remember that the CPU core that is running this driver is most probably on
the same NUMA node as the device itself.

Also, if it was a one time init kind of thing, I'd say "yeah, leave it alone". 
DMA pool is used by a wide range of drivers and it is used to allocate
fixed size buffers at runtime. 

Performance impact changes depending on the driver in use. This particular
code is in use by network adapters as well as the NVMe driver. It does
have a wide range of impact.
Matthew Wilcox May 17, 2018, 7:46 p.m. UTC | #3
On Thu, May 17, 2018 at 03:37:21PM -0400, Sinan Kaya wrote:
> On 5/17/2018 2:18 PM, Matthew Wilcox wrote:
> > On Thu, May 17, 2018 at 01:36:19PM -0400, Sinan Kaya wrote:
> >> Try to keep the pool closer to the device's NUMA node by changing kmalloc()
> >> to kmalloc_node() and devres_alloc() to devres_alloc_node().
> > Have you measured any performance gains by doing this?  The thing is that
> > these allocations are for the metadata about the page, and the page is
> > going to be used by CPUs in every node.  So it's not clear to me that
> > allocating it on the node nearest to the device is going to be any sort
> > of a win.
> > 
> 
> It is true that this is metadata but it is one of the things that is most
> frequently used in spite of its small size.
> 
> I don't think it makes any sense to cross a chip boundary for accessing a
> pointer location on every single pool allocation. 
> 
> Remember that the CPU core that is running this driver is most probably on
> the same NUMA node as the device itself.

Umm ... says who?  If my process is running on NUMA node 5 and I submit
an I/O, it should be allocating from a pool on node 5, not from a pool
on whichever node the device is attached to.

If it actually makes a performance difference, then NVMe should allocate
one pool per queue, rather than one pool per device like it currently
does.

> Also, if it was a one time init kind of thing, I'd say "yeah, leave it alone". 
> DMA pool is used by a wide range of drivers and it is used to allocate
> fixed size buffers at runtime. 

 * DMA Pool allocator
 *
 * Copyright 2001 David Brownell
 * Copyright 2007 Intel Corporation
 *   Author: Matthew Wilcox <willy@linux.intel.com>

I know what it's used for.
Sinan Kaya May 17, 2018, 8:05 p.m. UTC | #4
On 5/17/2018 3:46 PM, Matthew Wilcox wrote:
>> Remember that the CPU core that is running this driver is most probably on
>> the same NUMA node as the device itself.
> Umm ... says who?  If my process is running on NUMA node 5 and I submit
> an I/O, it should be allocating from a pool on node 5, not from a pool
> on whichever node the device is attached to.

OK, let's do an exercise. Maybe, I'm missing something in the big picture.

If a user process is running at node 5, it submits some work to the hardware
via block layer that is eventually invoked by syscall. 

Whatever buffer process is using, it gets copied into the kernel space as
it is crossing a userspace/kernel space boundary.

Block layer packages a block request with the kernel pointers and makes a
request to the NVMe driver for consumption.

Last time I checked, dma_alloc_coherent() API uses the locality information
from the device not from the CPU for allocation.

While the metadata for dma_pool is pointing to the currently running CPU core,
the DMA buffer itself is created using the device node itself today without
my patch.

I would think that you actually want to run the process at the same NUMA node
as the CPU and device itself for performance reasons. Otherwise, performance
expectations should be low. 

Even if user says please keep my process to a particular NUMA node,
we keep pointing to the memory on the other node today. 

I don't know what is so special about memory on the default node. IMO, all memory
allocations used by a driver need to follow the device. 

I wish I could do this in kmalloc(). devm_kmalloc() follows the device as another
example not CPU.

With these assumptions, even though user said please use the NUMA node from the
device, we still keep pointing to the default domain for pointers.

Isn't this wrong?

> 
> If it actually makes a performance difference, then NVMe should allocate
> one pool per queue, rather than one pool per device like it currently
> does.
> 
>> Also, if it was a one time init kind of thing, I'd say "yeah, leave it alone". 
>> DMA pool is used by a wide range of drivers and it is used to allocate
>> fixed size buffers at runtime. 
>  * DMA Pool allocator
>  *
>  * Copyright 2001 David Brownell
>  * Copyright 2007 Intel Corporation
>  *   Author: Matthew Wilcox <willy@linux.intel.com>
> 
> I know what it's used for.
> 

cool, good to know.
Matthew Wilcox May 17, 2018, 8:41 p.m. UTC | #5
On Thu, May 17, 2018 at 04:05:45PM -0400, Sinan Kaya wrote:
> On 5/17/2018 3:46 PM, Matthew Wilcox wrote:
> >> Remember that the CPU core that is running this driver is most probably on
> >> the same NUMA node as the device itself.
> > Umm ... says who?  If my process is running on NUMA node 5 and I submit
> > an I/O, it should be allocating from a pool on node 5, not from a pool
> > on whichever node the device is attached to.
> 
> OK, let's do an exercise. Maybe, I'm missing something in the big picture.

Sure.

> If a user process is running at node 5, it submits some work to the hardware
> via block layer that is eventually invoked by syscall. 
> 
> Whatever buffer process is using, it gets copied into the kernel space as
> it is crossing a userspace/kernel space boundary.
> 
> Block layer packages a block request with the kernel pointers and makes a
> request to the NVMe driver for consumption.
> 
> Last time I checked, dma_alloc_coherent() API uses the locality information
> from the device not from the CPU for allocation.

Yes, it does.  I wonder why that is; it doesn't actually make any sense.
It'd be far more sensible to allocate it on memory local to the user
than memory local to the device.

> While the metadata for dma_pool is pointing to the currently running CPU core,
> the DMA buffer itself is created using the device node itself today without
> my patch.

Umm ... dma_alloc_coherent memory is for metadata about the transfer, not
for the memory used for the transaction.

> I would think that you actually want to run the process at the same NUMA node
> as the CPU and device itself for performance reasons. Otherwise, performance
> expectations should be low. 

That's foolish.  Consider a database appliance with four sockets, each
with its own memory and I/O devices attached.  You can't tell the user
to shard the database into four pieces and have each socket only work on
the quarter of the database that's available to each socket.  They may
as well buy four smaller machines.  The point of buying a large NUMA
machine is to use all of it.

Let's try a different example.  I have a four-socket system with one
NVMe device with lots of hardware queues.  Each CPU has its own queue
assigned to it.  If I allocate all the PRP metadata on the socket with
the NVMe device attached to it, I'm sending a lot of coherency traffic
in the direction of that socket, in addition to the actual data.  If the
PRP lists are allocated randomly on the various sockets, the traffic
is heading all over the fabric.  If the PRP lists are allocated on the
local socket, the only time those lists move off this node is when the
device requests them.
Sinan Kaya May 17, 2018, 9:05 p.m. UTC | #6
On 5/17/2018 4:41 PM, Matthew Wilcox wrote:
> Let's try a different example.  I have a four-socket system with one
> NVMe device with lots of hardware queues.  Each CPU has its own queue
> assigned to it.  If I allocate all the PRP metadata on the socket with
> the NVMe device attached to it, I'm sending a lot of coherency traffic
> in the direction of that socket, in addition to the actual data.  If the
> PRP lists are allocated randomly on the various sockets, the traffic
> is heading all over the fabric.  If the PRP lists are allocated on the
> local socket, the only time those lists move off this node is when the
> device requests them.

So.., your reasoning is that you actually want to keep the memory as close
as possible to the CPU rather than the device itself. CPU would do
frequent updates the buffer until the point where it hands off the buffer
to the hardware. Device would fetch the memory via coherency when it needs
to consume the data but this would be a one time penalty.

It sounds logical to me. I was always told that you want to keep buffers
as close as possible to the device.

Maybe, it makes sense for things that device needs frequent access like
receive buffers.

If the majority user is CPU, then the buffer needs to be kept closer to
the CPU. 

dma_alloc_coherent() is generally used for receiver buffer allocation in
network adapters in general. People allocate a chunk and then create a
queue that hardware owns for dumping events and data.

Since DMA pool is a generic API, we should maybe request where we want
to keep the buffers closer to and allocate buffers from the appropriate
NUMA node based on that.
diff mbox

Patch

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 4d90a64..023f3d9 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -223,7 +223,7 @@  static struct dma_page *pool_alloc_page(struct dma_pool *pool, gfp_t mem_flags)
 {
 	struct dma_page *page;
 
-	page = kmalloc(sizeof(*page), mem_flags);
+	page = kmalloc_node(sizeof(*page), mem_flags, dev_to_node(pool->dev));
 	if (!page)
 		return NULL;
 	page->vaddr = dma_alloc_coherent(pool->dev, pool->allocation,
@@ -504,7 +504,8 @@  struct dma_pool *dmam_pool_create(const char *name, struct device *dev,
 {
 	struct dma_pool **ptr, *pool;
 
-	ptr = devres_alloc(dmam_pool_release, sizeof(*ptr), GFP_KERNEL);
+	ptr = devres_alloc_node(dmam_pool_release, sizeof(*ptr), GFP_KERNEL,
+				dev_to_node(dev));
 	if (!ptr)
 		return NULL;