Message ID | 20200821113355.6140-2-song.bao.hua@hisilicon.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | make dma_alloc_coherent NUMA-aware by per-NUMA CMA | expand |
On Fri, Aug 21, 2020 at 11:33:53PM +1200, Barry Song wrote: > Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get > coherent DMA buffers to save their command queues and page tables. As > there is only one default CMA in the whole system, SMMUs on nodes other > than node0 will get remote memory. This leads to significant latency. > > This patch provides per-numa CMA so that drivers like SMMU can get local > memory. Tests show localizing CMA can decrease dma_unmap latency much. > For instance, before this patch, SMMU on node2 has to wait for more than > 560ns for the completion of CMD_SYNC in an empty command queue; with this > patch, it needs 240ns only. > > A positive side effect of this patch would be improving performance even > further for those users who are worried about performance more than DMA > security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all > drivers can get local coherent DMA buffers. > > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Marek Szyprowski <m.szyprowski@samsung.com> > Cc: Will Deacon <will@kernel.org> > Cc: Robin Murphy <robin.murphy@arm.com> > Cc: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com> > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Nicolas Saenz Julienne <nsaenzjulienne@suse.de> > Cc: Steve Capper <steve.capper@arm.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Mike Rapoport <rppt@linux.ibm.com> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> > --- > -v7: with respect to Will's comments > * move to use for_each_online_node > * add description if users don't specify pernuma_cma > * provide default value for CONFIG_DMA_PERNUMA_CMA > > .../admin-guide/kernel-parameters.txt | 11 ++ > include/linux/dma-contiguous.h | 6 ++ > kernel/dma/Kconfig | 11 ++ > kernel/dma/contiguous.c | 100 ++++++++++++++++-- > 4 files changed, 118 insertions(+), 10 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index bdc1f33fd3d1..c609527fc35a 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -599,6 +599,17 @@ > altogether. For more information, see > include/linux/dma-contiguous.h > > + pernuma_cma=nn[MG] Maybe cma_pernuma or cma_pernode? > + [ARM64,KNL] > + Sets the size of kernel per-numa memory area for > + contiguous memory allocations. A value of 0 disables > + per-numa CMA altogether. And If this option is not > + specificed, the default value is 0. > + With per-numa CMA enabled, DMA users on node nid will > + first try to allocate buffer from the pernuma area > + which is located in node nid, if the allocation fails, > + they will fallback to the global default memory area. > + > cmo_free_hint= [PPC] Format: { yes | no } > Specify whether pages are marked as being inactive > when they are freed. This is used in CMO environments > diff --git a/include/linux/dma-contiguous.h b/include/linux/dma-contiguous.h > index 03f8e98e3bcc..fe55e004f1f4 100644 > --- a/include/linux/dma-contiguous.h > +++ b/include/linux/dma-contiguous.h > @@ -171,6 +171,12 @@ static inline void dma_free_contiguous(struct device *dev, struct page *page, > > #endif > > +#ifdef CONFIG_DMA_PERNUMA_CMA > +void dma_pernuma_cma_reserve(void); > +#else > +static inline void dma_pernuma_cma_reserve(void) { } > +#endif > + > #endif > > #endif > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig > index 847a9d1fa634..c38979d45b13 100644 > --- a/kernel/dma/Kconfig > +++ b/kernel/dma/Kconfig > @@ -118,6 +118,17 @@ config DMA_CMA > If unsure, say "n". > > if DMA_CMA > + > +config DMA_PERNUMA_CMA > + bool "Enable separate DMA Contiguous Memory Area for each NUMA Node" > + default NUMA && ARM64 > + help > + Enable this option to get pernuma CMA areas so that devices like > + ARM64 SMMU can get local memory by DMA coherent APIs. > + > + You can set the size of pernuma CMA by specifying "pernuma_cma=size" > + on the kernel's command line. > + > comment "Default contiguous memory area size:" > > config CMA_SIZE_MBYTES > diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c > index cff7e60968b9..0383c9b86715 100644 > --- a/kernel/dma/contiguous.c > +++ b/kernel/dma/contiguous.c > @@ -69,6 +69,19 @@ static int __init early_cma(char *p) > } > early_param("cma", early_cma); > > +#ifdef CONFIG_DMA_PERNUMA_CMA > + > +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES]; > +static phys_addr_t pernuma_size_bytes __initdata; > + > +static int __init early_pernuma_cma(char *p) > +{ > + pernuma_size_bytes = memparse(p, &p); > + return 0; > +} > +early_param("pernuma_cma", early_pernuma_cma); > +#endif > + > #ifdef CONFIG_CMA_SIZE_PERCENTAGE > > static phys_addr_t __init __maybe_unused cma_early_percent_memory(void) > @@ -96,6 +109,34 @@ static inline __maybe_unused phys_addr_t cma_early_percent_memory(void) > > #endif > > +#ifdef CONFIG_DMA_PERNUMA_CMA > +void __init dma_pernuma_cma_reserve(void) > +{ > + int nid; > + > + if (!pernuma_size_bytes) > + return; > + > + for_each_online_node(nid) { > + int ret; > + char name[20]; > + struct cma **cma = &dma_contiguous_pernuma_area[nid]; > + > + snprintf(name, sizeof(name), "pernuma%d", nid); > + ret = cma_declare_contiguous_nid(0, pernuma_size_bytes, 0, 0, > + 0, false, name, cma, nid); > + if (ret) { > + pr_warn("%s: reservation failed: err %d, node %d", __func__, > + ret, nid); > + continue; > + } > + > + pr_debug("%s: reserved %llu MiB on node %d\n", __func__, > + (unsigned long long)pernuma_size_bytes / SZ_1M, nid); > + } > +} > +#endif > + > /** > * dma_contiguous_reserve() - reserve area(s) for contiguous memory handling > * @limit: End address of the reserved memory (optional, 0 for any). > @@ -228,23 +269,44 @@ static struct page *cma_alloc_aligned(struct cma *cma, size_t size, gfp_t gfp) > * @size: Requested allocation size. > * @gfp: Allocation flags. > * > - * This function allocates contiguous memory buffer for specified device. It > - * tries to use device specific contiguous memory area if available, or the > - * default global one. > + * tries to use device specific contiguous memory area if available, or it > + * tries to use per-numa cma, if the allocation fails, it will fallback to > + * try default global one. > * > - * Note that it byapss one-page size of allocations from the global area as > - * the addresses within one page are always contiguous, so there is no need > - * to waste CMA pages for that kind; it also helps reduce fragmentations. > + * Note that it bypass one-page size of allocations from the per-numa and > + * global area as the addresses within one page are always contiguous, so > + * there is no need to waste CMA pages for that kind; it also helps reduce > + * fragmentations. > */ > struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp) > { > +#ifdef CONFIG_DMA_PERNUMA_CMA > + int nid = dev_to_node(dev); > +#endif > + > /* CMA can be used only in the context which permits sleeping */ > if (!gfpflags_allow_blocking(gfp)) > return NULL; > if (dev->cma_area) > return cma_alloc_aligned(dev->cma_area, size, gfp); > - if (size <= PAGE_SIZE || !dma_contiguous_default_area) > + if (size <= PAGE_SIZE) > + return NULL; > + > +#ifdef CONFIG_DMA_PERNUMA_CMA > + if (nid != NUMA_NO_NODE && !(gfp & (GFP_DMA | GFP_DMA32))) { > + struct cma *cma = dma_contiguous_pernuma_area[nid]; It could be that for some node reservation failedm than dma_contiguous_pernuma_area[nid] would be NULL. I'd add a fallback to another node here. > + struct page *page; > + > + if (cma) { > + page = cma_alloc_aligned(cma, size, gfp); > + if (page) > + return page; > + } > + } > +#endif I think the selection of the area can be put in a helper funciton and then here we just try to allocate from the selected area. E.g. static struct cma* dma_get_cma_area(struct device *dev) { #ifdef CONFIG_DMA_PERNUMA_CMA int nid = dev_to_node(dev); struct cma *cma = dma_contiguous_pernuma_area[nid]; if (!cma) /* select cma from another node */ ; return cma; #else return dma_contiguous_default_area; #endif } struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp) { struct cma *cma; ... cma = dma_get_cma_area(dev); if (!cma) return NULL; return cma_alloc_aligned(cma, size, gfp); } > + if (!dma_contiguous_default_area) > return NULL; > + > return cma_alloc_aligned(dma_contiguous_default_area, size, gfp); > } > > @@ -261,9 +323,27 @@ struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp) > */ > void dma_free_contiguous(struct device *dev, struct page *page, size_t size) > { > - if (!cma_release(dev_get_cma_area(dev), page, > - PAGE_ALIGN(size) >> PAGE_SHIFT)) > - __free_pages(page, get_order(size)); Here as well, dev_get_cma_area() can be replaced with, say dma_get_dev_cma_area(dev, page) that will hide the below logic. > + unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT; > + > + /* if dev has its own cma, free page from there */ > + if (dev->cma_area) { > + if (cma_release(dev->cma_area, page, count)) > + return; > + } else { > + /* > + * otherwise, page is from either per-numa cma or default cma > + */ > +#ifdef CONFIG_DMA_PERNUMA_CMA > + if (cma_release(dma_contiguous_pernuma_area[page_to_nid(page)], > + page, count)) > + return; > +#endif > + if (cma_release(dma_contiguous_default_area, page, count)) > + return; > + } > + > + /* not in any cma, free from buddy */ > + __free_pages(page, get_order(size)); > } > > /* > -- > 2.27.0 > >
On 8/21/20 4:33 AM, Barry Song wrote: > --- > -v7: with respect to Will's comments > * move to use for_each_online_node > * add description if users don't specify pernuma_cma > * provide default value for CONFIG_DMA_PERNUMA_CMA > > .../admin-guide/kernel-parameters.txt | 11 ++ > include/linux/dma-contiguous.h | 6 ++ > kernel/dma/Kconfig | 11 ++ > kernel/dma/contiguous.c | 100 ++++++++++++++++-- > 4 files changed, 118 insertions(+), 10 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index bdc1f33fd3d1..c609527fc35a 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -599,6 +599,17 @@ > altogether. For more information, see > include/linux/dma-contiguous.h > > + pernuma_cma=nn[MG] > + [ARM64,KNL] > + Sets the size of kernel per-numa memory area for > + contiguous memory allocations. A value of 0 disables > + per-numa CMA altogether. And If this option is not > + specificed, the default value is 0. > + With per-numa CMA enabled, DMA users on node nid will > + first try to allocate buffer from the pernuma area > + which is located in node nid, if the allocation fails, > + they will fallback to the global default memory area. > + Entries in kernel-parameters.txt are supposed to be in alphabetical order but this one is not. If you want to keep it near the cma= entry, you can rename it like Mike suggested. Otherwise it needs to be moved. > cmo_free_hint= [PPC] Format: { yes | no } > Specify whether pages are marked as being inactive > when they are freed. This is used in CMO environments
> -----Original Message----- > From: Mike Rapoport [mailto:rppt@linux.ibm.com] > Sent: Saturday, August 22, 2020 2:28 AM > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > Cc: hch@lst.de; m.szyprowski@samsung.com; robin.murphy@arm.com; > will@kernel.org; ganapatrao.kulkarni@cavium.com; > catalin.marinas@arm.com; akpm@linux-foundation.org; > iommu@lists.linux-foundation.org; linux-arm-kernel@lists.infradead.org; > linux-kernel@vger.kernel.org; Zengtao (B) <prime.zeng@hisilicon.com>; > huangdaode <huangdaode@huawei.com>; Linuxarm <linuxarm@huawei.com>; > Jonathan Cameron <jonathan.cameron@huawei.com>; Nicolas Saenz Julienne > <nsaenzjulienne@suse.de>; Steve Capper <steve.capper@arm.com> > Subject: Re: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve > per-numa CMA > > On Fri, Aug 21, 2020 at 11:33:53PM +1200, Barry Song wrote: > > Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get > > coherent DMA buffers to save their command queues and page tables. As > > there is only one default CMA in the whole system, SMMUs on nodes other > > than node0 will get remote memory. This leads to significant latency. > > > > This patch provides per-numa CMA so that drivers like SMMU can get local > > memory. Tests show localizing CMA can decrease dma_unmap latency much. > > For instance, before this patch, SMMU on node2 has to wait for more than > > 560ns for the completion of CMD_SYNC in an empty command queue; with > this > > patch, it needs 240ns only. > > > > A positive side effect of this patch would be improving performance even > > further for those users who are worried about performance more than DMA > > security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all > > drivers can get local coherent DMA buffers. > > > > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > Cc: Christoph Hellwig <hch@lst.de> > > Cc: Marek Szyprowski <m.szyprowski@samsung.com> > > Cc: Will Deacon <will@kernel.org> > > Cc: Robin Murphy <robin.murphy@arm.com> > > Cc: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com> > > Cc: Catalin Marinas <catalin.marinas@arm.com> > > Cc: Nicolas Saenz Julienne <nsaenzjulienne@suse.de> > > Cc: Steve Capper <steve.capper@arm.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Mike Rapoport <rppt@linux.ibm.com> > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> > > --- > > -v7: with respect to Will's comments > > * move to use for_each_online_node > > * add description if users don't specify pernuma_cma > > * provide default value for CONFIG_DMA_PERNUMA_CMA > > > > .../admin-guide/kernel-parameters.txt | 11 ++ > > include/linux/dma-contiguous.h | 6 ++ > > kernel/dma/Kconfig | 11 ++ > > kernel/dma/contiguous.c | 100 > ++++++++++++++++-- > > 4 files changed, 118 insertions(+), 10 deletions(-) > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > b/Documentation/admin-guide/kernel-parameters.txt > > index bdc1f33fd3d1..c609527fc35a 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -599,6 +599,17 @@ > > altogether. For more information, see > > include/linux/dma-contiguous.h > > > > + pernuma_cma=nn[MG] > > Maybe cma_pernuma or cma_pernode? Sounds good. > > > + [ARM64,KNL] > > + Sets the size of kernel per-numa memory area for > > + contiguous memory allocations. A value of 0 disables > > + per-numa CMA altogether. And If this option is not > > + specificed, the default value is 0. > > + With per-numa CMA enabled, DMA users on node nid will > > + first try to allocate buffer from the pernuma area > > + which is located in node nid, if the allocation fails, > > + they will fallback to the global default memory area. > > + > > cmo_free_hint= [PPC] Format: { yes | no } > > Specify whether pages are marked as being inactive > > when they are freed. This is used in CMO environments > > diff --git a/include/linux/dma-contiguous.h > b/include/linux/dma-contiguous.h > > index 03f8e98e3bcc..fe55e004f1f4 100644 > > --- a/include/linux/dma-contiguous.h > > +++ b/include/linux/dma-contiguous.h > > @@ -171,6 +171,12 @@ static inline void dma_free_contiguous(struct > device *dev, struct page *page, > > > > #endif > > > > +#ifdef CONFIG_DMA_PERNUMA_CMA > > +void dma_pernuma_cma_reserve(void); > > +#else > > +static inline void dma_pernuma_cma_reserve(void) { } > > +#endif > > + > > #endif > > > > #endif > > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig > > index 847a9d1fa634..c38979d45b13 100644 > > --- a/kernel/dma/Kconfig > > +++ b/kernel/dma/Kconfig > > @@ -118,6 +118,17 @@ config DMA_CMA > > If unsure, say "n". > > > > if DMA_CMA > > + > > +config DMA_PERNUMA_CMA > > + bool "Enable separate DMA Contiguous Memory Area for each NUMA > Node" > > + default NUMA && ARM64 > > + help > > + Enable this option to get pernuma CMA areas so that devices like > > + ARM64 SMMU can get local memory by DMA coherent APIs. > > + > > + You can set the size of pernuma CMA by specifying > "pernuma_cma=size" > > + on the kernel's command line. > > + > > comment "Default contiguous memory area size:" > > > > config CMA_SIZE_MBYTES > > diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c > > index cff7e60968b9..0383c9b86715 100644 > > --- a/kernel/dma/contiguous.c > > +++ b/kernel/dma/contiguous.c > > @@ -69,6 +69,19 @@ static int __init early_cma(char *p) > > } > > early_param("cma", early_cma); > > > > +#ifdef CONFIG_DMA_PERNUMA_CMA > > + > > +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES]; > > +static phys_addr_t pernuma_size_bytes __initdata; > > + > > +static int __init early_pernuma_cma(char *p) > > +{ > > + pernuma_size_bytes = memparse(p, &p); > > + return 0; > > +} > > +early_param("pernuma_cma", early_pernuma_cma); > > +#endif > > + > > #ifdef CONFIG_CMA_SIZE_PERCENTAGE > > > > static phys_addr_t __init __maybe_unused > cma_early_percent_memory(void) > > @@ -96,6 +109,34 @@ static inline __maybe_unused phys_addr_t > cma_early_percent_memory(void) > > > > #endif > > > > +#ifdef CONFIG_DMA_PERNUMA_CMA > > +void __init dma_pernuma_cma_reserve(void) > > +{ > > + int nid; > > + > > + if (!pernuma_size_bytes) > > + return; > > + > > + for_each_online_node(nid) { > > + int ret; > > + char name[20]; > > + struct cma **cma = &dma_contiguous_pernuma_area[nid]; > > + > > + snprintf(name, sizeof(name), "pernuma%d", nid); > > + ret = cma_declare_contiguous_nid(0, pernuma_size_bytes, 0, 0, > > + 0, false, name, cma, nid); > > + if (ret) { > > + pr_warn("%s: reservation failed: err %d, node %d", __func__, > > + ret, nid); > > + continue; > > + } > > + > > + pr_debug("%s: reserved %llu MiB on node %d\n", __func__, > > + (unsigned long long)pernuma_size_bytes / SZ_1M, nid); > > + } > > +} > > +#endif > > + > > /** > > * dma_contiguous_reserve() - reserve area(s) for contiguous memory > handling > > * @limit: End address of the reserved memory (optional, 0 for any). > > @@ -228,23 +269,44 @@ static struct page *cma_alloc_aligned(struct cma > *cma, size_t size, gfp_t gfp) > > * @size: Requested allocation size. > > * @gfp: Allocation flags. > > * > > - * This function allocates contiguous memory buffer for specified device. It > > - * tries to use device specific contiguous memory area if available, or the > > - * default global one. > > + * tries to use device specific contiguous memory area if available, or it > > + * tries to use per-numa cma, if the allocation fails, it will fallback to > > + * try default global one. > > * > > - * Note that it byapss one-page size of allocations from the global area as > > - * the addresses within one page are always contiguous, so there is no need > > - * to waste CMA pages for that kind; it also helps reduce fragmentations. > > + * Note that it bypass one-page size of allocations from the per-numa and > > + * global area as the addresses within one page are always contiguous, so > > + * there is no need to waste CMA pages for that kind; it also helps reduce > > + * fragmentations. > > */ > > struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t > gfp) > > { > > +#ifdef CONFIG_DMA_PERNUMA_CMA > > + int nid = dev_to_node(dev); > > +#endif > > + > > /* CMA can be used only in the context which permits sleeping */ > > if (!gfpflags_allow_blocking(gfp)) > > return NULL; > > if (dev->cma_area) > > return cma_alloc_aligned(dev->cma_area, size, gfp); > > - if (size <= PAGE_SIZE || !dma_contiguous_default_area) > > + if (size <= PAGE_SIZE) > > + return NULL; > > + > > +#ifdef CONFIG_DMA_PERNUMA_CMA > > + if (nid != NUMA_NO_NODE && !(gfp & (GFP_DMA | GFP_DMA32))) { > > + struct cma *cma = dma_contiguous_pernuma_area[nid]; > > It could be that for some node reservation failedm than > dma_contiguous_pernuma_area[nid] would be NULL. > I'd add a fallback to another node here. This has been done. If dma_contiguous_pernuma_area[nid] is null, it will fallback to the default global cma. > > > + struct page *page; > > + > > + if (cma) { > > + page = cma_alloc_aligned(cma, size, gfp); > > + if (page) > > + return page; > > + } > > + } > > +#endif > > I think the selection of the area can be put in a helper funciton and > then here we just try to allocate from the selected area. E.g. > > static struct cma* dma_get_cma_area(struct device *dev) > { > #ifdef CONFIG_DMA_PERNUMA_CMA > int nid = dev_to_node(dev); > struct cma *cma = dma_contiguous_pernuma_area[nid]; > > if (!cma) > /* select cma from another node */ ; > > return cma; > #else > return dma_contiguous_default_area; > #endif > } > It is possible dma_contiguous_pernuma_area[nid] is not null, but we fail to get memory from it due to it is either full or has no GFP_DMA(32) support. In this case, we still need to fallback to the default global cma. So the code is trying pernuma_cma, then trying default global cma. It is not picking one from the two areas. It is trying both. > struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp) > { > struct cma *cma; > ... > > > cma = dma_get_cma_area(dev); > if (!cma) > return NULL; > > return cma_alloc_aligned(cma, size, gfp); > } > > > + if (!dma_contiguous_default_area) > > return NULL; > > + > > return cma_alloc_aligned(dma_contiguous_default_area, size, gfp); > > } > > > > @@ -261,9 +323,27 @@ struct page *dma_alloc_contiguous(struct device > *dev, size_t size, gfp_t gfp) > > */ > > void dma_free_contiguous(struct device *dev, struct page *page, size_t > size) > > { > > - if (!cma_release(dev_get_cma_area(dev), page, > > - PAGE_ALIGN(size) >> PAGE_SHIFT)) > > - __free_pages(page, get_order(size)); > > Here as well, dev_get_cma_area() can be replaced with, say > dma_get_dev_cma_area(dev, page) that will hide the below logic. As explained above, this won't work. > > > + unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT; > > + > > + /* if dev has its own cma, free page from there */ > > + if (dev->cma_area) { > > + if (cma_release(dev->cma_area, page, count)) > > + return; > > + } else { > > + /* > > + * otherwise, page is from either per-numa cma or default cma > > + */ > > +#ifdef CONFIG_DMA_PERNUMA_CMA > > + if (cma_release(dma_contiguous_pernuma_area[page_to_nid(page)], > > + page, count)) > > + return; > > +#endif > > + if (cma_release(dma_contiguous_default_area, page, count)) > > + return; > > + } > > + > > + /* not in any cma, free from buddy */ > > + __free_pages(page, get_order(size)); > > } > > > > /* > > -- > > 2.27.0 > > > > Thanks Barry
> -----Original Message----- > From: Randy Dunlap [mailto:rdunlap@infradead.org] > Sent: Saturday, August 22, 2020 4:08 AM > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; hch@lst.de; > m.szyprowski@samsung.com; robin.murphy@arm.com; will@kernel.org; > ganapatrao.kulkarni@cavium.com; catalin.marinas@arm.com; > akpm@linux-foundation.org > Cc: iommu@lists.linux-foundation.org; linux-arm-kernel@lists.infradead.org; > linux-kernel@vger.kernel.org; Zengtao (B) <prime.zeng@hisilicon.com>; > huangdaode <huangdaode@huawei.com>; Linuxarm <linuxarm@huawei.com>; > Jonathan Cameron <jonathan.cameron@huawei.com>; Nicolas Saenz Julienne > <nsaenzjulienne@suse.de>; Steve Capper <steve.capper@arm.com>; Mike > Rapoport <rppt@linux.ibm.com> > Subject: Re: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve > per-numa CMA > > On 8/21/20 4:33 AM, Barry Song wrote: > > --- > > -v7: with respect to Will's comments > > * move to use for_each_online_node > > * add description if users don't specify pernuma_cma > > * provide default value for CONFIG_DMA_PERNUMA_CMA > > > > .../admin-guide/kernel-parameters.txt | 11 ++ > > include/linux/dma-contiguous.h | 6 ++ > > kernel/dma/Kconfig | 11 ++ > > kernel/dma/contiguous.c | 100 > ++++++++++++++++-- > > 4 files changed, 118 insertions(+), 10 deletions(-) > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > b/Documentation/admin-guide/kernel-parameters.txt > > index bdc1f33fd3d1..c609527fc35a 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -599,6 +599,17 @@ > > altogether. For more information, see > > include/linux/dma-contiguous.h > > > > + pernuma_cma=nn[MG] > > + [ARM64,KNL] > > + Sets the size of kernel per-numa memory area for > > + contiguous memory allocations. A value of 0 disables > > + per-numa CMA altogether. And If this option is not > > + specificed, the default value is 0. > > + With per-numa CMA enabled, DMA users on node nid will > > + first try to allocate buffer from the pernuma area > > + which is located in node nid, if the allocation fails, > > + they will fallback to the global default memory area. > > + > > Entries in kernel-parameters.txt are supposed to be in alphabetical order > but this one is not. If you want to keep it near the cma= entry, you can > rename it like Mike suggested. Otherwise it needs to be moved. As I've replied in Mike's comment, I'd like to rename it to cma_per... > > > > cmo_free_hint= [PPC] Format: { yes | no } > > Specify whether pages are marked as being inactive > > when they are freed. This is used in CMO environments > > > > -- > ~Randy Thanks Barry
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index bdc1f33fd3d1..c609527fc35a 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -599,6 +599,17 @@ altogether. For more information, see include/linux/dma-contiguous.h + pernuma_cma=nn[MG] + [ARM64,KNL] + Sets the size of kernel per-numa memory area for + contiguous memory allocations. A value of 0 disables + per-numa CMA altogether. And If this option is not + specificed, the default value is 0. + With per-numa CMA enabled, DMA users on node nid will + first try to allocate buffer from the pernuma area + which is located in node nid, if the allocation fails, + they will fallback to the global default memory area. + cmo_free_hint= [PPC] Format: { yes | no } Specify whether pages are marked as being inactive when they are freed. This is used in CMO environments diff --git a/include/linux/dma-contiguous.h b/include/linux/dma-contiguous.h index 03f8e98e3bcc..fe55e004f1f4 100644 --- a/include/linux/dma-contiguous.h +++ b/include/linux/dma-contiguous.h @@ -171,6 +171,12 @@ static inline void dma_free_contiguous(struct device *dev, struct page *page, #endif +#ifdef CONFIG_DMA_PERNUMA_CMA +void dma_pernuma_cma_reserve(void); +#else +static inline void dma_pernuma_cma_reserve(void) { } +#endif + #endif #endif diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index 847a9d1fa634..c38979d45b13 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -118,6 +118,17 @@ config DMA_CMA If unsure, say "n". if DMA_CMA + +config DMA_PERNUMA_CMA + bool "Enable separate DMA Contiguous Memory Area for each NUMA Node" + default NUMA && ARM64 + help + Enable this option to get pernuma CMA areas so that devices like + ARM64 SMMU can get local memory by DMA coherent APIs. + + You can set the size of pernuma CMA by specifying "pernuma_cma=size" + on the kernel's command line. + comment "Default contiguous memory area size:" config CMA_SIZE_MBYTES diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c index cff7e60968b9..0383c9b86715 100644 --- a/kernel/dma/contiguous.c +++ b/kernel/dma/contiguous.c @@ -69,6 +69,19 @@ static int __init early_cma(char *p) } early_param("cma", early_cma); +#ifdef CONFIG_DMA_PERNUMA_CMA + +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES]; +static phys_addr_t pernuma_size_bytes __initdata; + +static int __init early_pernuma_cma(char *p) +{ + pernuma_size_bytes = memparse(p, &p); + return 0; +} +early_param("pernuma_cma", early_pernuma_cma); +#endif + #ifdef CONFIG_CMA_SIZE_PERCENTAGE static phys_addr_t __init __maybe_unused cma_early_percent_memory(void) @@ -96,6 +109,34 @@ static inline __maybe_unused phys_addr_t cma_early_percent_memory(void) #endif +#ifdef CONFIG_DMA_PERNUMA_CMA +void __init dma_pernuma_cma_reserve(void) +{ + int nid; + + if (!pernuma_size_bytes) + return; + + for_each_online_node(nid) { + int ret; + char name[20]; + struct cma **cma = &dma_contiguous_pernuma_area[nid]; + + snprintf(name, sizeof(name), "pernuma%d", nid); + ret = cma_declare_contiguous_nid(0, pernuma_size_bytes, 0, 0, + 0, false, name, cma, nid); + if (ret) { + pr_warn("%s: reservation failed: err %d, node %d", __func__, + ret, nid); + continue; + } + + pr_debug("%s: reserved %llu MiB on node %d\n", __func__, + (unsigned long long)pernuma_size_bytes / SZ_1M, nid); + } +} +#endif + /** * dma_contiguous_reserve() - reserve area(s) for contiguous memory handling * @limit: End address of the reserved memory (optional, 0 for any). @@ -228,23 +269,44 @@ static struct page *cma_alloc_aligned(struct cma *cma, size_t size, gfp_t gfp) * @size: Requested allocation size. * @gfp: Allocation flags. * - * This function allocates contiguous memory buffer for specified device. It - * tries to use device specific contiguous memory area if available, or the - * default global one. + * tries to use device specific contiguous memory area if available, or it + * tries to use per-numa cma, if the allocation fails, it will fallback to + * try default global one. * - * Note that it byapss one-page size of allocations from the global area as - * the addresses within one page are always contiguous, so there is no need - * to waste CMA pages for that kind; it also helps reduce fragmentations. + * Note that it bypass one-page size of allocations from the per-numa and + * global area as the addresses within one page are always contiguous, so + * there is no need to waste CMA pages for that kind; it also helps reduce + * fragmentations. */ struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp) { +#ifdef CONFIG_DMA_PERNUMA_CMA + int nid = dev_to_node(dev); +#endif + /* CMA can be used only in the context which permits sleeping */ if (!gfpflags_allow_blocking(gfp)) return NULL; if (dev->cma_area) return cma_alloc_aligned(dev->cma_area, size, gfp); - if (size <= PAGE_SIZE || !dma_contiguous_default_area) + if (size <= PAGE_SIZE) + return NULL; + +#ifdef CONFIG_DMA_PERNUMA_CMA + if (nid != NUMA_NO_NODE && !(gfp & (GFP_DMA | GFP_DMA32))) { + struct cma *cma = dma_contiguous_pernuma_area[nid]; + struct page *page; + + if (cma) { + page = cma_alloc_aligned(cma, size, gfp); + if (page) + return page; + } + } +#endif + if (!dma_contiguous_default_area) return NULL; + return cma_alloc_aligned(dma_contiguous_default_area, size, gfp); } @@ -261,9 +323,27 @@ struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp) */ void dma_free_contiguous(struct device *dev, struct page *page, size_t size) { - if (!cma_release(dev_get_cma_area(dev), page, - PAGE_ALIGN(size) >> PAGE_SHIFT)) - __free_pages(page, get_order(size)); + unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT; + + /* if dev has its own cma, free page from there */ + if (dev->cma_area) { + if (cma_release(dev->cma_area, page, count)) + return; + } else { + /* + * otherwise, page is from either per-numa cma or default cma + */ +#ifdef CONFIG_DMA_PERNUMA_CMA + if (cma_release(dma_contiguous_pernuma_area[page_to_nid(page)], + page, count)) + return; +#endif + if (cma_release(dma_contiguous_default_area, page, count)) + return; + } + + /* not in any cma, free from buddy */ + __free_pages(page, get_order(size)); } /*
Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get coherent DMA buffers to save their command queues and page tables. As there is only one default CMA in the whole system, SMMUs on nodes other than node0 will get remote memory. This leads to significant latency. This patch provides per-numa CMA so that drivers like SMMU can get local memory. Tests show localizing CMA can decrease dma_unmap latency much. For instance, before this patch, SMMU on node2 has to wait for more than 560ns for the completion of CMD_SYNC in an empty command queue; with this patch, it needs 240ns only. A positive side effect of this patch would be improving performance even further for those users who are worried about performance more than DMA security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all drivers can get local coherent DMA buffers. Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Will Deacon <will@kernel.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Nicolas Saenz Julienne <nsaenzjulienne@suse.de> Cc: Steve Capper <steve.capper@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> --- -v7: with respect to Will's comments * move to use for_each_online_node * add description if users don't specify pernuma_cma * provide default value for CONFIG_DMA_PERNUMA_CMA .../admin-guide/kernel-parameters.txt | 11 ++ include/linux/dma-contiguous.h | 6 ++ kernel/dma/Kconfig | 11 ++ kernel/dma/contiguous.c | 100 ++++++++++++++++-- 4 files changed, 118 insertions(+), 10 deletions(-)