Message ID | 1652778276-2986-6-git-send-email-longli@linuxonhyperv.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Jason Gunthorpe |
Headers | show |
Series | Introduce Microsoft Azure Network Adapter (MANA) RDMA driver | expand |
On Tue, May 17, 2022 at 02:04:29AM -0700, longli@linuxonhyperv.com wrote: > From: Long Li <longli@microsoft.com> > > The system chooses default 64K page size if the device does not specify > the max page size the device can handle for DMA. This do not work well > when device is registering large chunk of memory in that a large page size > is more efficient. > > Set it to the maximum hardware supported page size. For RDMA devices this should be set to the largest segment size an ib_sge can take in when posting work. It should not be the page size of MR. 2M is a weird number for that, are you sure it is right? Jason
> Subject: Re: [PATCH 05/12] net: mana: Set the DMA device max page size > > On Tue, May 17, 2022 at 02:04:29AM -0700, longli@linuxonhyperv.com wrote: > > From: Long Li <longli@microsoft.com> > > > > The system chooses default 64K page size if the device does not > > specify the max page size the device can handle for DMA. This do not > > work well when device is registering large chunk of memory in that a > > large page size is more efficient. > > > > Set it to the maximum hardware supported page size. > > For RDMA devices this should be set to the largest segment size an ib_sge can > take in when posting work. It should not be the page size of MR. 2M is a weird > number for that, are you sure it is right? Yes, this is the maximum page size used in hardware page tables. Long > > Jason
On Tue, May 17, 2022 at 07:32:51PM +0000, Long Li wrote: > > Subject: Re: [PATCH 05/12] net: mana: Set the DMA device max page size > > > > On Tue, May 17, 2022 at 02:04:29AM -0700, longli@linuxonhyperv.com wrote: > > > From: Long Li <longli@microsoft.com> > > > > > > The system chooses default 64K page size if the device does not > > > specify the max page size the device can handle for DMA. This do not > > > work well when device is registering large chunk of memory in that a > > > large page size is more efficient. > > > > > > Set it to the maximum hardware supported page size. > > > > For RDMA devices this should be set to the largest segment size an ib_sge can > > take in when posting work. It should not be the page size of MR. 2M is a weird > > number for that, are you sure it is right? > > Yes, this is the maximum page size used in hardware page tables. As I said, it should be the size of the sge in the WQE, not the "hardware page tables" Jason
> Subject: Re: [PATCH 05/12] net: mana: Set the DMA device max page size > > On Tue, May 17, 2022 at 07:32:51PM +0000, Long Li wrote: > > > Subject: Re: [PATCH 05/12] net: mana: Set the DMA device max page > > > size > > > > > > On Tue, May 17, 2022 at 02:04:29AM -0700, longli@linuxonhyperv.com > wrote: > > > > From: Long Li <longli@microsoft.com> > > > > > > > > The system chooses default 64K page size if the device does not > > > > specify the max page size the device can handle for DMA. This do > > > > not work well when device is registering large chunk of memory in > > > > that a large page size is more efficient. > > > > > > > > Set it to the maximum hardware supported page size. > > > > > > For RDMA devices this should be set to the largest segment size an > > > ib_sge can take in when posting work. It should not be the page size > > > of MR. 2M is a weird number for that, are you sure it is right? > > > > Yes, this is the maximum page size used in hardware page tables. > > As I said, it should be the size of the sge in the WQE, not the "hardware page > tables" This driver uses the following code to figure out the largest page size for memory registration with hardware: page_sz = ib_umem_find_best_pgsz(mr->umem, PAGE_SZ_BM, iova); In this function, mr->umem is created with ib_dma_max_seg_size() as its max segment size when creating its sgtable. The purpose of setting DMA page size to 2M is to make sure this function returns the largest possible MR size that the hardware can take. Otherwise, this function will return 64k: the default DMA size. Long > > Jason
On Tue, May 17, 2022 at 08:04:58PM +0000, Long Li wrote: > > Subject: Re: [PATCH 05/12] net: mana: Set the DMA device max page size > > > > On Tue, May 17, 2022 at 07:32:51PM +0000, Long Li wrote: > > > > Subject: Re: [PATCH 05/12] net: mana: Set the DMA device max page > > > > size > > > > > > > > On Tue, May 17, 2022 at 02:04:29AM -0700, longli@linuxonhyperv.com > > wrote: > > > > > From: Long Li <longli@microsoft.com> > > > > > > > > > > The system chooses default 64K page size if the device does not > > > > > specify the max page size the device can handle for DMA. This do > > > > > not work well when device is registering large chunk of memory in > > > > > that a large page size is more efficient. > > > > > > > > > > Set it to the maximum hardware supported page size. > > > > > > > > For RDMA devices this should be set to the largest segment size an > > > > ib_sge can take in when posting work. It should not be the page size > > > > of MR. 2M is a weird number for that, are you sure it is right? > > > > > > Yes, this is the maximum page size used in hardware page tables. > > > > As I said, it should be the size of the sge in the WQE, not the "hardware page > > tables" > > This driver uses the following code to figure out the largest page > size for memory registration with hardware: > > page_sz = ib_umem_find_best_pgsz(mr->umem, PAGE_SZ_BM, iova); > > In this function, mr->umem is created with ib_dma_max_seg_size() as > its max segment size when creating its sgtable. > > The purpose of setting DMA page size to 2M is to make sure this > function returns the largest possible MR size that the hardware can > take. Otherwise, this function will return 64k: the default DMA > size. As I've already said, you are supposed to set the value that limits to ib_sge and *NOT* the value that is related to ib_umem_find_best_pgsz. It is usually 2G because the ib_sge's typically work on a 32 bit length. Jason
Thanks Long. Hello Jason, I am the author of the patch. To your comment below : " As I've already said, you are supposed to set the value that limits to ib_sge and *NOT* the value that is related to ib_umem_find_best_pgsz. It is usually 2G because the ib_sge's typically work on a 32 bit length." The ib_sge is limited by the __sg_alloc_table_from_pages() which uses ib_dma_max_seg_size() which is what is set by the eth driver using dma_set_max_seg_size() . Currently our hw does not support PTEs larger than 2M. So ib_umem_find_best_pgsz() takes as an input PG_SZ_BITMAP . The bitmap has all the bits set for the page sizes supported by the HW. #define PAGE_SZ_BM (SZ_4K | SZ_8K | SZ_16K | SZ_32K | SZ_64K | SZ_128K \ | SZ_256K | SZ_512K | SZ_1M | SZ_2M) Are you suggesting we are too restrictive in the bitmap we are passing ? or that we should not set this bitmap let the function choose default ? Regards, Ajay -----Original Message----- From: Jason Gunthorpe <jgg@ziepe.ca> Sent: Tuesday, May 17, 2022 5:04 PM To: Long Li <longli@microsoft.com> Cc: Ajay Sharma <sharmaajay@microsoft.com>; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; Stephen Hemminger <sthemmin@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui <decui@microsoft.com>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com>; Leon Romanovsky <leon@kernel.org>; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org Subject: [EXTERNAL] Re: [PATCH 05/12] net: mana: Set the DMA device max page size [You don't often get email from jgg@ziepe.ca. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification.] On Tue, May 17, 2022 at 08:04:58PM +0000, Long Li wrote: > > Subject: Re: [PATCH 05/12] net: mana: Set the DMA device max page > > size > > > > On Tue, May 17, 2022 at 07:32:51PM +0000, Long Li wrote: > > > > Subject: Re: [PATCH 05/12] net: mana: Set the DMA device max > > > > page size > > > > > > > > On Tue, May 17, 2022 at 02:04:29AM -0700, > > > > longli@linuxonhyperv.com > > wrote: > > > > > From: Long Li <longli@microsoft.com> > > > > > > > > > > The system chooses default 64K page size if the device does > > > > > not specify the max page size the device can handle for DMA. > > > > > This do not work well when device is registering large chunk > > > > > of memory in that a large page size is more efficient. > > > > > > > > > > Set it to the maximum hardware supported page size. > > > > > > > > For RDMA devices this should be set to the largest segment size > > > > an ib_sge can take in when posting work. It should not be the > > > > page size of MR. 2M is a weird number for that, are you sure it is right? > > > > > > Yes, this is the maximum page size used in hardware page tables. > > > > As I said, it should be the size of the sge in the WQE, not the > > "hardware page tables" > > This driver uses the following code to figure out the largest page > size for memory registration with hardware: > > page_sz = ib_umem_find_best_pgsz(mr->umem, PAGE_SZ_BM, iova); > > In this function, mr->umem is created with ib_dma_max_seg_size() as > its max segment size when creating its sgtable. > > The purpose of setting DMA page size to 2M is to make sure this > function returns the largest possible MR size that the hardware can > take. Otherwise, this function will return 64k: the default DMA size. As I've already said, you are supposed to set the value that limits to ib_sge and *NOT* the value that is related to ib_umem_find_best_pgsz. It is usually 2G because the ib_sge's typically work on a 32 bit length. Jason
On Wed, May 18, 2022 at 05:59:00AM +0000, Ajay Sharma wrote: > Thanks Long. > Hello Jason, > I am the author of the patch. > To your comment below : > " As I've already said, you are supposed to set the value that limits to ib_sge and *NOT* the value that is related to ib_umem_find_best_pgsz. It is usually 2G because the ib_sge's typically work on a 32 bit length." > > The ib_sge is limited by the __sg_alloc_table_from_pages() which > uses ib_dma_max_seg_size() which is what is set by the eth driver > using dma_set_max_seg_size() . Currently our hw does not support > PTEs larger than 2M. *sigh* again it has nothing to do with *PTEs* in the HW. Jason
Sorry , I am not able to follow. Below is the reference efa driver implementation : static int efa_device_init(struct efa_com_dev *edev, struct pci_dev *pdev) { int dma_width; int err; err = efa_com_dev_reset(edev, EFA_REGS_RESET_NORMAL); if (err) return err; err = efa_com_validate_version(edev); if (err) return err; dma_width = efa_com_get_dma_width(edev); if (dma_width < 0) { err = dma_width; return err; } err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(dma_width)); if (err) { dev_err(&pdev->dev, "dma_set_mask_and_coherent failed %d\n", err); return err; } dma_set_max_seg_size(&pdev->dev, UINT_MAX); return 0; } static int efa_register_mr(struct ib_pd *ibpd, struct efa_mr *mr, u64 start, u64 length, u64 virt_addr, int access_flags) { struct efa_dev *dev = to_edev(ibpd->device); struct efa_com_reg_mr_params params = {}; struct efa_com_reg_mr_result result = {}; struct pbl_context pbl; unsigned int pg_sz; int inline_size; int err; params.pd = to_epd(ibpd)->pdn; params.iova = virt_addr; params.mr_length_in_bytes = length; params.permissions = access_flags; pg_sz = ib_umem_find_best_pgsz(mr->umem, dev->dev_attr.page_size_cap, virt_addr); .... } Ideally we would like to read it from HW, but currently we are hardcoding the bitmap. I can change the commit message if you feel that is misleading . Something along the lines : RDMA/mana: Use API to get contiguous memory blocks aligned to device supported page size Use the ib_umem_find_best_pgsz() and rdma_for_each_block() API when registering an MR instead of coding it in the driver. ib_umem_find_best_pgsz() is used to find the best suitable page size which replaces the existing efa_cont_pages() implementation. rdma_for_each_block() is used to iterate the umem in aligned contiguous memory blocks. Ajay -----Original Message----- From: Jason Gunthorpe <jgg@ziepe.ca> Sent: Wednesday, May 18, 2022 9:05 AM To: Ajay Sharma <sharmaajay@microsoft.com> Cc: Long Li <longli@microsoft.com>; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; Stephen Hemminger <sthemmin@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui <decui@microsoft.com>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com>; Leon Romanovsky <leon@kernel.org>; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org Subject: Re: [EXTERNAL] Re: [PATCH 05/12] net: mana: Set the DMA device max page size [You don't often get email from jgg@ziepe.ca. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification.] On Wed, May 18, 2022 at 05:59:00AM +0000, Ajay Sharma wrote: > Thanks Long. > Hello Jason, > I am the author of the patch. > To your comment below : > " As I've already said, you are supposed to set the value that limits to ib_sge and *NOT* the value that is related to ib_umem_find_best_pgsz. It is usually 2G because the ib_sge's typically work on a 32 bit length." > > The ib_sge is limited by the __sg_alloc_table_from_pages() which uses > ib_dma_max_seg_size() which is what is set by the eth driver using > dma_set_max_seg_size() . Currently our hw does not support PTEs larger > than 2M. *sigh* again it has nothing to do with *PTEs* in the HW. Jason
On Wed, May 18, 2022 at 09:05:22PM +0000, Ajay Sharma wrote: > Use the ib_umem_find_best_pgsz() and rdma_for_each_block() API when > registering an MR instead of coding it in the driver. The dma_set_max_seg_size() has *nothing* to do with ib_umem_find_best_pgsz() other than its value should be larger than the largest set bit. Again, it is supposed to be the maximum value the HW can support in a ib_sge length field, which is usually 2G. Jason
> Subject: Re: [EXTERNAL] Re: [PATCH 05/12] net: mana: Set the DMA device max > page size > > On Wed, May 18, 2022 at 09:05:22PM +0000, Ajay Sharma wrote: > > > Use the ib_umem_find_best_pgsz() and rdma_for_each_block() API when > > registering an MR instead of coding it in the driver. > > The dma_set_max_seg_size() has *nothing* to do with > ib_umem_find_best_pgsz() other than its value should be larger than the largest > set bit. > > Again, it is supposed to be the maximum value the HW can support in a ib_sge > length field, which is usually 2G. Will fix this in v2. Long
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c index 86ffe0e39df0..426087688480 100644 --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c @@ -1385,6 +1385,13 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent) if (err) goto release_region; + // The max GDMA HW supported page size is 2M + err = dma_set_max_seg_size(&pdev->dev, SZ_2M); + if (err) { + dev_err(&pdev->dev, "Failed to set dma device segment size\n"); + goto release_region; + } + err = -ENOMEM; gc = vzalloc(sizeof(*gc)); if (!gc)