mbox series

[v7,0/3] make dma_alloc_coherent NUMA-aware by per-NUMA CMA

Message ID 20200821113355.6140-1-song.bao.hua@hisilicon.com (mailing list archive)
Headers show
Series make dma_alloc_coherent NUMA-aware by per-NUMA CMA | expand

Message

Song Bao Hua (Barry Song) Aug. 21, 2020, 11:33 a.m. UTC
Ganapatrao Kulkarni has put some effort on making arm-smmu-v3 use local
memory to save command queues[1]. I also did similar job in patch
"iommu/arm-smmu-v3: allocate the memory of queues in local numa node"
[2] while not realizing Ganapatrao has done that before.

But it seems it is much better to make dma_alloc_coherent() to be
inherently NUMA-aware on NUMA-capable systems.

Right now, smmu is using dma_alloc_coherent() to get memory to save queues
and tables. Typically, on ARM64 server, there is a default CMA located at
node0, which could be far away from node2, node3 etc.
Saving queues and tables remotely will increase the latency of ARM SMMU
significantly. For example, when SMMU is at node2 and the default global
CMA is at node0, after sending a CMD_SYNC in an empty command queue, we
have to wait more than 550ns for the completion of the command CMD_SYNC.
However, if we save them locally, we only need to wait for 240ns.

with per-numa CMA, smmu will get memory from local numa node to save command
queues and page tables. that means dma_unmap latency will be shrunk much.

Meanwhile, when iommu.passthrough is on, device drivers which call dma_
alloc_coherent() will also get local memory and avoid the travel between
numa nodes.

[1] https://lists.linuxfoundation.org/pipermail/iommu/2017-October/024455.html
[2] https://www.spinics.net/lists/iommu/msg44767.html


-v7:
 * add Will's acked-by for the change in arch/arm64
 * some cleanup with respect to Will's comments
 * add patch 3/3 to remove the hardcode of defining the size of cma name.
   this patch requires some header file change in include/linux

-v6:
 * rebase on top of 5.9-rc1
 * doc cleanup

-v5:
 refine code according to Christoph Hellwig's comments
 * remove Kconfig option for pernuma cma size;
 * add Kconfig option for pernuma cma enable;
 * code cleanup like line over 80 char

 I haven't removed the cma NULL check code in cma_alloc() as it requires
 a bundle of other changes. So I prefer to handle this issue separately.

-v4:
 * rebase on top of Christoph Hellwig's patch:
 [PATCH v2] dma-contiguous: cleanup dma_alloc_contiguous
 https://lore.kernel.org/linux-iommu/20200723120133.94105-1-hch@lst.de/
 * cleanup according to Christoph's comment
 * rebase on top of linux-next to avoid arch/arm64 conflicts
 * reserve cma by checking N_MEMORY rather than N_ONLINE

-v3:
  * move to use page_to_nid() while freeing cma with respect to Robin's
  comment, but this will only work after applying my below patch:
  "mm/cma.c: use exact_nid true to fix possible per-numa cma leak"
  https://marc.info/?l=linux-mm&m=159333034726647&w=2

  * handle the case count <= 1 more properly according to Robin's
  comment;

  * add pernuma_cma parameter to support dynamic setting of per-numa
  cma size;
  ideally we can leverage the CMA_SIZE_MBYTES, CMA_SIZE_PERCENTAGE and
  "cma=" kernel parameter and avoid a new paramter separately for per-
  numa cma. Practically, it is really too complicated considering the
  below problems:
  (1) if we leverage the size of default numa for per-numa, we have to
  avoid creating two cma with same size in node0 since default cma is
  probably on node0.
  (2) default cma can consider the address limitation for old devices
  while per-numa cma doesn't support GFP_DMA and GFP_DMA32. all
  allocations with limitation flags will fallback to default one.
  (3) hard to apply CMA_SIZE_PERCENTAGE to per-numa. it is hard to
  decide if the percentage should apply to the whole memory size
  or only apply to the memory size of a specific numa node.
  (4) default cma size has CMA_SIZE_SEL_MIN and CMA_SIZE_SEL_MAX, it
  makes things even more complicated to per-numa cma.

  I haven't figured out a good way to leverage the size of default cma
  for per-numa cma. it seems a separate parameter for per-numa could
  make life easier.

  * move dma_pernuma_cma_reserve() after hugetlb_cma_reserve() to
  reuse the comment before hugetlb_cma_reserve() with respect to
  Robin's comment

-v2: 
  * fix some issues reported by kernel test robot
  * fallback to default cma while allocation fails in per-numa cma
     free memory properly

Barry Song (3):
  dma-contiguous: provide the ability to reserve per-numa CMA
  arm64: mm: reserve per-numa CMA to localize coherent dma buffers
  mm: cma: use CMA_MAX_NAME to define the length of cma name array

 .../admin-guide/kernel-parameters.txt         |  11 ++
 arch/arm64/mm/init.c                          |   2 +
 include/linux/cma.h                           |   2 +
 include/linux/dma-contiguous.h                |   6 ++
 kernel/dma/Kconfig                            |  11 ++
 kernel/dma/contiguous.c                       | 100 ++++++++++++++++--
 mm/cma.h                                      |   2 -
 mm/hugetlb.c                                  |   4 +-
 8 files changed, 124 insertions(+), 14 deletions(-)

Comments

Mike Kravetz Aug. 21, 2020, 5:52 p.m. UTC | #1
Hi Barry,
Sorry for jumping in so late.

On 8/21/20 4:33 AM, Barry Song wrote:
> 
> with per-numa CMA, smmu will get memory from local numa node to save command
> queues and page tables. that means dma_unmap latency will be shrunk much.

Since per-node CMA areas for hugetlb was introduced, I have been thinking
about the limited number of CMA areas.  In most configurations, I believe
it is limited to 7.  And, IIRC it is not something that can be changed at
runtime, you need to reconfig and rebuild to increase the number.  In contrast
some configs have NODES_SHIFT set to 10.  I wasn't too worried because of
the limited hugetlb use case.  However, this series is adding another user
of per-node CMA areas.

With more users, should try to sync up number of CMA areas and number of
nodes?  Or, perhaps I am worrying about nothing?
Song Bao Hua (Barry Song) Aug. 21, 2020, 7:29 p.m. UTC | #2
> -----Original Message-----
> From: Mike Kravetz [mailto:mike.kravetz@oracle.com]
> Sent: Saturday, August 22, 2020 5:53 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; hch@lst.de;
> m.szyprowski@samsung.com; robin.murphy@arm.com; will@kernel.org;
> ganapatrao.kulkarni@cavium.com; catalin.marinas@arm.com;
> akpm@linux-foundation.org
> Cc: iommu@lists.linux-foundation.org; linux-arm-kernel@lists.infradead.org;
> linux-kernel@vger.kernel.org; Zengtao (B) <prime.zeng@hisilicon.com>;
> huangdaode <huangdaode@huawei.com>; Linuxarm <linuxarm@huawei.com>
> Subject: Re: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by
> per-NUMA CMA
> 
> Hi Barry,
> Sorry for jumping in so late.
> 
> On 8/21/20 4:33 AM, Barry Song wrote:
> >
> > with per-numa CMA, smmu will get memory from local numa node to save
> command
> > queues and page tables. that means dma_unmap latency will be shrunk
> much.
> 
> Since per-node CMA areas for hugetlb was introduced, I have been thinking
> about the limited number of CMA areas.  In most configurations, I believe
> it is limited to 7.  And, IIRC it is not something that can be changed at
> runtime, you need to reconfig and rebuild to increase the number.  In contrast
> some configs have NODES_SHIFT set to 10.  I wasn't too worried because of
> the limited hugetlb use case.  However, this series is adding another user
> of per-node CMA areas.
> 
> With more users, should try to sync up number of CMA areas and number of
> nodes?  Or, perhaps I am worrying about nothing?

Hi Mike,
The current limitation is 8. If the server has 4 nodes and we enable both pernuma
CMA and hugetlb, the last node will fail to get one cma area as the default
global cma area will take 1 of 8. So users need to change menuconfig.
If the server has 8 nodes, we enable one of pernuma cma and hugetlb, one node
will fail to get cma.

We may set the default number of CMA areas as 8+MAX_NODES(if hugetlb enabled) +
MAX_NODES(if pernuma cma enabled) if we don't expect users to change config, but
right now hugetlb has not an option in Kconfig to enable or disable like pernuma cma
has DMA_PERNUMA_CMA.

> --
> Mike Kravetz

Thanks
Barry
Song Bao Hua (Barry Song) Aug. 21, 2020, 8:47 p.m. UTC | #3
> -----Original Message-----
> From: Song Bao Hua (Barry Song)
> Sent: Saturday, August 22, 2020 7:27 AM
> To: 'Mike Kravetz' <mike.kravetz@oracle.com>; hch@lst.de;
> m.szyprowski@samsung.com; robin.murphy@arm.com; will@kernel.org;
> ganapatrao.kulkarni@cavium.com; catalin.marinas@arm.com;
> akpm@linux-foundation.org
> Cc: iommu@lists.linux-foundation.org; linux-arm-kernel@lists.infradead.org;
> linux-kernel@vger.kernel.org; Zengtao (B) <prime.zeng@hisilicon.com>;
> huangdaode <huangdaode@huawei.com>; Linuxarm <linuxarm@huawei.com>
> Subject: RE: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by
> per-NUMA CMA
> 
> 
> 
> > -----Original Message-----
> > From: Mike Kravetz [mailto:mike.kravetz@oracle.com]
> > Sent: Saturday, August 22, 2020 5:53 AM
> > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; hch@lst.de;
> > m.szyprowski@samsung.com; robin.murphy@arm.com; will@kernel.org;
> > ganapatrao.kulkarni@cavium.com; catalin.marinas@arm.com;
> > akpm@linux-foundation.org
> > Cc: iommu@lists.linux-foundation.org; linux-arm-kernel@lists.infradead.org;
> > linux-kernel@vger.kernel.org; Zengtao (B) <prime.zeng@hisilicon.com>;
> > huangdaode <huangdaode@huawei.com>; Linuxarm
> <linuxarm@huawei.com>
> > Subject: Re: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by
> > per-NUMA CMA
> >
> > Hi Barry,
> > Sorry for jumping in so late.
> >
> > On 8/21/20 4:33 AM, Barry Song wrote:
> > >
> > > with per-numa CMA, smmu will get memory from local numa node to save
> > command
> > > queues and page tables. that means dma_unmap latency will be shrunk
> > much.
> >
> > Since per-node CMA areas for hugetlb was introduced, I have been thinking
> > about the limited number of CMA areas.  In most configurations, I believe
> > it is limited to 7.  And, IIRC it is not something that can be changed at
> > runtime, you need to reconfig and rebuild to increase the number.  In
> contrast
> > some configs have NODES_SHIFT set to 10.  I wasn't too worried because of
> > the limited hugetlb use case.  However, this series is adding another user
> > of per-node CMA areas.
> >
> > With more users, should try to sync up number of CMA areas and number of
> > nodes?  Or, perhaps I am worrying about nothing?
> 
> Hi Mike,
> The current limitation is 8. If the server has 4 nodes and we enable both
> pernuma
> CMA and hugetlb, the last node will fail to get one cma area as the default
> global cma area will take 1 of 8. So users need to change menuconfig.
> If the server has 8 nodes, we enable one of pernuma cma and hugetlb, one
> node
> will fail to get cma.
> 
> We may set the default number of CMA areas as 8+MAX_NODES(if hugetlb
> enabled) +
> MAX_NODES(if pernuma cma enabled) if we don't expect users to change
> config, but
> right now hugetlb has not an option in Kconfig to enable or disable like
> pernuma cma
> has DMA_PERNUMA_CMA.

I would prefer we make some changes like:

config CMA_AREAS
	int "Maximum count of the CMA areas"
	depends on CMA
+	default 19 if NUMA
	default 7
	help
	  CMA allows to create CMA areas for particular purpose, mainly,
	  used as device private area. This parameter sets the maximum
	  number of CMA area in the system.

-	  If unsure, leave the default value "7".
+	  If unsure, leave the default value "7" or "19" if NUMA is used.

1+ CONFIG_CMA_AREAS should be quite enough for almost all servers in the markets.

If 2 numa nodes, and both hugetlb cma and pernuma cma is enabled, we need 2*2 + 1 = 5
If 4 numa nodes, and both hugetlb cma and pernuma cma is enabled, we need 2*4 + 1 = 9    -> default ARM64 config.
If 8 numa nodes, and both hugetlb cma and pernuma cma is enabled, we need 2*8 + 1 = 17

The default value is supporting the most common case and is not going to support those servers
with NODES_SHIFT=10, they can make their own config just like users need to increase CMA_AREAS
if they add many cma areas in device tree in a system even without NUMA.

How do you think, mike?

Thanks
Barry
Mike Kravetz Aug. 21, 2020, 9:10 p.m. UTC | #4
On 8/21/20 1:47 PM, Song Bao Hua (Barry Song) wrote:
> 
> 
>> -----Original Message-----
>> From: Song Bao Hua (Barry Song)
>> Sent: Saturday, August 22, 2020 7:27 AM
>> To: 'Mike Kravetz' <mike.kravetz@oracle.com>; hch@lst.de;
>> m.szyprowski@samsung.com; robin.murphy@arm.com; will@kernel.org;
>> ganapatrao.kulkarni@cavium.com; catalin.marinas@arm.com;
>> akpm@linux-foundation.org
>> Cc: iommu@lists.linux-foundation.org; linux-arm-kernel@lists.infradead.org;
>> linux-kernel@vger.kernel.org; Zengtao (B) <prime.zeng@hisilicon.com>;
>> huangdaode <huangdaode@huawei.com>; Linuxarm <linuxarm@huawei.com>
>> Subject: RE: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by
>> per-NUMA CMA
>>
>>
>>
>>> -----Original Message-----
>>> From: Mike Kravetz [mailto:mike.kravetz@oracle.com]
>>> Sent: Saturday, August 22, 2020 5:53 AM
>>> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; hch@lst.de;
>>> m.szyprowski@samsung.com; robin.murphy@arm.com; will@kernel.org;
>>> ganapatrao.kulkarni@cavium.com; catalin.marinas@arm.com;
>>> akpm@linux-foundation.org
>>> Cc: iommu@lists.linux-foundation.org; linux-arm-kernel@lists.infradead.org;
>>> linux-kernel@vger.kernel.org; Zengtao (B) <prime.zeng@hisilicon.com>;
>>> huangdaode <huangdaode@huawei.com>; Linuxarm
>> <linuxarm@huawei.com>
>>> Subject: Re: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by
>>> per-NUMA CMA
>>>
>>> Hi Barry,
>>> Sorry for jumping in so late.
>>>
>>> On 8/21/20 4:33 AM, Barry Song wrote:
>>>>
>>>> with per-numa CMA, smmu will get memory from local numa node to save
>>> command
>>>> queues and page tables. that means dma_unmap latency will be shrunk
>>> much.
>>>
>>> Since per-node CMA areas for hugetlb was introduced, I have been thinking
>>> about the limited number of CMA areas.  In most configurations, I believe
>>> it is limited to 7.  And, IIRC it is not something that can be changed at
>>> runtime, you need to reconfig and rebuild to increase the number.  In
>> contrast
>>> some configs have NODES_SHIFT set to 10.  I wasn't too worried because of
>>> the limited hugetlb use case.  However, this series is adding another user
>>> of per-node CMA areas.
>>>
>>> With more users, should try to sync up number of CMA areas and number of
>>> nodes?  Or, perhaps I am worrying about nothing?
>>
>> Hi Mike,
>> The current limitation is 8. If the server has 4 nodes and we enable both
>> pernuma
>> CMA and hugetlb, the last node will fail to get one cma area as the default
>> global cma area will take 1 of 8. So users need to change menuconfig.
>> If the server has 8 nodes, we enable one of pernuma cma and hugetlb, one
>> node
>> will fail to get cma.
>>
>> We may set the default number of CMA areas as 8+MAX_NODES(if hugetlb
>> enabled) +
>> MAX_NODES(if pernuma cma enabled) if we don't expect users to change
>> config, but
>> right now hugetlb has not an option in Kconfig to enable or disable like
>> pernuma cma
>> has DMA_PERNUMA_CMA.
> 
> I would prefer we make some changes like:
> 
> config CMA_AREAS
> 	int "Maximum count of the CMA areas"
> 	depends on CMA
> +	default 19 if NUMA
> 	default 7
> 	help
> 	  CMA allows to create CMA areas for particular purpose, mainly,
> 	  used as device private area. This parameter sets the maximum
> 	  number of CMA area in the system.
> 
> -	  If unsure, leave the default value "7".
> +	  If unsure, leave the default value "7" or "19" if NUMA is used.
> 
> 1+ CONFIG_CMA_AREAS should be quite enough for almost all servers in the markets.
> 
> If 2 numa nodes, and both hugetlb cma and pernuma cma is enabled, we need 2*2 + 1 = 5
> If 4 numa nodes, and both hugetlb cma and pernuma cma is enabled, we need 2*4 + 1 = 9    -> default ARM64 config.
> If 8 numa nodes, and both hugetlb cma and pernuma cma is enabled, we need 2*8 + 1 = 17
> 
> The default value is supporting the most common case and is not going to support those servers
> with NODES_SHIFT=10, they can make their own config just like users need to increase CMA_AREAS
> if they add many cma areas in device tree in a system even without NUMA.
> 
> How do you think, mike?

I'm OK with that.  I really did not want to sidetrach this series.  It is
just something I thought about when looking at the hugetlb code.  My 'to do'
list includes looking at a way to make the number of CMA areas dynamic.