diff mbox series

[05/12] net: mana: Set the DMA device max page size

Message ID 1652778276-2986-6-git-send-email-longli@linuxonhyperv.com (mailing list archive)
State Superseded
Delegated to: Jason Gunthorpe
Headers show
Series Introduce Microsoft Azure Network Adapter (MANA) RDMA driver | expand

Commit Message

Long Li May 17, 2022, 9:04 a.m. UTC
From: Long Li <longli@microsoft.com>

The system chooses default 64K page size if the device does not specify
the max page size the device can handle for DMA. This do not work well
when device is registering large chunk of memory in that a large page size
is more efficient.

Set it to the maximum hardware supported page size.

Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/gdma_main.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Jason Gunthorpe May 17, 2022, 2:59 p.m. UTC | #1
On Tue, May 17, 2022 at 02:04:29AM -0700, longli@linuxonhyperv.com wrote:
> From: Long Li <longli@microsoft.com>
> 
> The system chooses default 64K page size if the device does not specify
> the max page size the device can handle for DMA. This do not work well
> when device is registering large chunk of memory in that a large page size
> is more efficient.
> 
> Set it to the maximum hardware supported page size.

For RDMA devices this should be set to the largest segment size an
ib_sge can take in when posting work. It should not be the page size
of MR. 2M is a weird number for that, are you sure it is right?

Jason
Long Li May 17, 2022, 7:32 p.m. UTC | #2
> Subject: Re: [PATCH 05/12] net: mana: Set the DMA device max page size
> 
> On Tue, May 17, 2022 at 02:04:29AM -0700, longli@linuxonhyperv.com wrote:
> > From: Long Li <longli@microsoft.com>
> >
> > The system chooses default 64K page size if the device does not
> > specify the max page size the device can handle for DMA. This do not
> > work well when device is registering large chunk of memory in that a
> > large page size is more efficient.
> >
> > Set it to the maximum hardware supported page size.
> 
> For RDMA devices this should be set to the largest segment size an ib_sge can
> take in when posting work. It should not be the page size of MR. 2M is a weird
> number for that, are you sure it is right?

Yes, this is the maximum page size used in hardware page tables.

Long

> 
> Jason
Jason Gunthorpe May 17, 2022, 7:35 p.m. UTC | #3
On Tue, May 17, 2022 at 07:32:51PM +0000, Long Li wrote:
> > Subject: Re: [PATCH 05/12] net: mana: Set the DMA device max page size
> > 
> > On Tue, May 17, 2022 at 02:04:29AM -0700, longli@linuxonhyperv.com wrote:
> > > From: Long Li <longli@microsoft.com>
> > >
> > > The system chooses default 64K page size if the device does not
> > > specify the max page size the device can handle for DMA. This do not
> > > work well when device is registering large chunk of memory in that a
> > > large page size is more efficient.
> > >
> > > Set it to the maximum hardware supported page size.
> > 
> > For RDMA devices this should be set to the largest segment size an ib_sge can
> > take in when posting work. It should not be the page size of MR. 2M is a weird
> > number for that, are you sure it is right?
> 
> Yes, this is the maximum page size used in hardware page tables.

As I said, it should be the size of the sge in the WQE, not the
"hardware page tables"

Jason
Long Li May 17, 2022, 8:04 p.m. UTC | #4
> Subject: Re: [PATCH 05/12] net: mana: Set the DMA device max page size
> 
> On Tue, May 17, 2022 at 07:32:51PM +0000, Long Li wrote:
> > > Subject: Re: [PATCH 05/12] net: mana: Set the DMA device max page
> > > size
> > >
> > > On Tue, May 17, 2022 at 02:04:29AM -0700, longli@linuxonhyperv.com
> wrote:
> > > > From: Long Li <longli@microsoft.com>
> > > >
> > > > The system chooses default 64K page size if the device does not
> > > > specify the max page size the device can handle for DMA. This do
> > > > not work well when device is registering large chunk of memory in
> > > > that a large page size is more efficient.
> > > >
> > > > Set it to the maximum hardware supported page size.
> > >
> > > For RDMA devices this should be set to the largest segment size an
> > > ib_sge can take in when posting work. It should not be the page size
> > > of MR. 2M is a weird number for that, are you sure it is right?
> >
> > Yes, this is the maximum page size used in hardware page tables.
> 
> As I said, it should be the size of the sge in the WQE, not the "hardware page
> tables"

This driver uses the following code to figure out the largest page size for memory registration with hardware:

page_sz = ib_umem_find_best_pgsz(mr->umem, PAGE_SZ_BM, iova);

In this function, mr->umem is created with ib_dma_max_seg_size() as its max segment size when creating its sgtable.

The purpose of setting DMA page size to 2M is to make sure this function returns the largest possible MR size that the hardware can take. Otherwise, this function will return 64k: the default DMA size.

Long

> 
> Jason
Jason Gunthorpe May 18, 2022, 12:03 a.m. UTC | #5
On Tue, May 17, 2022 at 08:04:58PM +0000, Long Li wrote:
> > Subject: Re: [PATCH 05/12] net: mana: Set the DMA device max page size
> > 
> > On Tue, May 17, 2022 at 07:32:51PM +0000, Long Li wrote:
> > > > Subject: Re: [PATCH 05/12] net: mana: Set the DMA device max page
> > > > size
> > > >
> > > > On Tue, May 17, 2022 at 02:04:29AM -0700, longli@linuxonhyperv.com
> > wrote:
> > > > > From: Long Li <longli@microsoft.com>
> > > > >
> > > > > The system chooses default 64K page size if the device does not
> > > > > specify the max page size the device can handle for DMA. This do
> > > > > not work well when device is registering large chunk of memory in
> > > > > that a large page size is more efficient.
> > > > >
> > > > > Set it to the maximum hardware supported page size.
> > > >
> > > > For RDMA devices this should be set to the largest segment size an
> > > > ib_sge can take in when posting work. It should not be the page size
> > > > of MR. 2M is a weird number for that, are you sure it is right?
> > >
> > > Yes, this is the maximum page size used in hardware page tables.
> > 
> > As I said, it should be the size of the sge in the WQE, not the "hardware page
> > tables"
> 
> This driver uses the following code to figure out the largest page
> size for memory registration with hardware:
> 
> page_sz = ib_umem_find_best_pgsz(mr->umem, PAGE_SZ_BM, iova);
> 
> In this function, mr->umem is created with ib_dma_max_seg_size() as
> its max segment size when creating its sgtable.
>
> The purpose of setting DMA page size to 2M is to make sure this
> function returns the largest possible MR size that the hardware can
> take. Otherwise, this function will return 64k: the default DMA
> size.

As I've already said, you are supposed to set the value that limits to
ib_sge and *NOT* the value that is related to
ib_umem_find_best_pgsz. It is usually 2G because the ib_sge's
typically work on a 32 bit length.

Jason
Ajay Sharma May 18, 2022, 5:59 a.m. UTC | #6
Thanks Long. 
Hello Jason,
I am the author of the patch.
To your comment below : 
" As I've already said, you are supposed to set the value that limits to ib_sge and *NOT* the value that is related to ib_umem_find_best_pgsz. It is usually 2G because the ib_sge's typically work on a 32 bit length."

The ib_sge is limited by the __sg_alloc_table_from_pages() which uses ib_dma_max_seg_size() which is what is set by the eth driver using dma_set_max_seg_size() . Currently our hw does not support PTEs larger than 2M. 

So ib_umem_find_best_pgsz()  takes as an input PG_SZ_BITMAP .  The bitmap has all the bits set for the page sizes supported by the HW.

#define PAGE_SZ_BM (SZ_4K | SZ_8K | SZ_16K | SZ_32K | SZ_64K | SZ_128K \
		    | SZ_256K | SZ_512K | SZ_1M | SZ_2M)

 Are you suggesting we are too restrictive in the bitmap  we are passing ? or that we should not set this bitmap let the function choose default ?

Regards,
Ajay

-----Original Message-----
From: Jason Gunthorpe <jgg@ziepe.ca> 
Sent: Tuesday, May 17, 2022 5:04 PM
To: Long Li <longli@microsoft.com>
Cc: Ajay Sharma <sharmaajay@microsoft.com>; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; Stephen Hemminger <sthemmin@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui <decui@microsoft.com>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com>; Leon Romanovsky <leon@kernel.org>; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org
Subject: [EXTERNAL] Re: [PATCH 05/12] net: mana: Set the DMA device max page size

[You don't often get email from jgg@ziepe.ca. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification.]

On Tue, May 17, 2022 at 08:04:58PM +0000, Long Li wrote:
> > Subject: Re: [PATCH 05/12] net: mana: Set the DMA device max page 
> > size
> >
> > On Tue, May 17, 2022 at 07:32:51PM +0000, Long Li wrote:
> > > > Subject: Re: [PATCH 05/12] net: mana: Set the DMA device max 
> > > > page size
> > > >
> > > > On Tue, May 17, 2022 at 02:04:29AM -0700, 
> > > > longli@linuxonhyperv.com
> > wrote:
> > > > > From: Long Li <longli@microsoft.com>
> > > > >
> > > > > The system chooses default 64K page size if the device does 
> > > > > not specify the max page size the device can handle for DMA. 
> > > > > This do not work well when device is registering large chunk 
> > > > > of memory in that a large page size is more efficient.
> > > > >
> > > > > Set it to the maximum hardware supported page size.
> > > >
> > > > For RDMA devices this should be set to the largest segment size 
> > > > an ib_sge can take in when posting work. It should not be the 
> > > > page size of MR. 2M is a weird number for that, are you sure it is right?
> > >
> > > Yes, this is the maximum page size used in hardware page tables.
> >
> > As I said, it should be the size of the sge in the WQE, not the 
> > "hardware page tables"
>
> This driver uses the following code to figure out the largest page 
> size for memory registration with hardware:
>
> page_sz = ib_umem_find_best_pgsz(mr->umem, PAGE_SZ_BM, iova);
>
> In this function, mr->umem is created with ib_dma_max_seg_size() as 
> its max segment size when creating its sgtable.
>
> The purpose of setting DMA page size to 2M is to make sure this 
> function returns the largest possible MR size that the hardware can 
> take. Otherwise, this function will return 64k: the default DMA size.

As I've already said, you are supposed to set the value that limits to ib_sge and *NOT* the value that is related to ib_umem_find_best_pgsz. It is usually 2G because the ib_sge's typically work on a 32 bit length.

Jason
Jason Gunthorpe May 18, 2022, 4:05 p.m. UTC | #7
On Wed, May 18, 2022 at 05:59:00AM +0000, Ajay Sharma wrote:
> Thanks Long. 
> Hello Jason,
> I am the author of the patch.
> To your comment below : 
> " As I've already said, you are supposed to set the value that limits to ib_sge and *NOT* the value that is related to ib_umem_find_best_pgsz. It is usually 2G because the ib_sge's typically work on a 32 bit length."
> 
> The ib_sge is limited by the __sg_alloc_table_from_pages() which
> uses ib_dma_max_seg_size() which is what is set by the eth driver
> using dma_set_max_seg_size() . Currently our hw does not support
> PTEs larger than 2M.

*sigh* again it has nothing to do with *PTEs* in the HW.

Jason
Ajay Sharma May 18, 2022, 9:05 p.m. UTC | #8
Sorry , I am not able to follow. Below is the reference efa driver implementation :

static int efa_device_init(struct efa_com_dev *edev, struct pci_dev *pdev)
{
	int dma_width;
	int err;

	err = efa_com_dev_reset(edev, EFA_REGS_RESET_NORMAL);
	if (err)
		return err;

	err = efa_com_validate_version(edev);
	if (err)
		return err;

	dma_width = efa_com_get_dma_width(edev);
	if (dma_width < 0) {
		err = dma_width;
		return err;
	}

	err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(dma_width));
	if (err) {
		dev_err(&pdev->dev, "dma_set_mask_and_coherent failed %d\n", err);
		return err;
	}

	dma_set_max_seg_size(&pdev->dev, UINT_MAX);
	return 0;
}

static int efa_register_mr(struct ib_pd *ibpd, struct efa_mr *mr, u64 start,
			   u64 length, u64 virt_addr, int access_flags)
{
	struct efa_dev *dev = to_edev(ibpd->device);
	struct efa_com_reg_mr_params params = {};
	struct efa_com_reg_mr_result result = {};
	struct pbl_context pbl;
	unsigned int pg_sz;
	int inline_size;
	int err;

	params.pd = to_epd(ibpd)->pdn;
	params.iova = virt_addr;
	params.mr_length_in_bytes = length;
	params.permissions = access_flags;

	pg_sz = ib_umem_find_best_pgsz(mr->umem,
				       dev->dev_attr.page_size_cap,
				       virt_addr);
....
}

Ideally we would like to read it from HW, but currently we are hardcoding the bitmap. I can change the commit message if you feel that is misleading .
Something along the lines :
RDMA/mana: Use API to get contiguous memory blocks aligned to device supported page size

Use the ib_umem_find_best_pgsz() and rdma_for_each_block() API when
registering an MR instead of coding it in the driver.

ib_umem_find_best_pgsz() is used to find the best suitable page size
which replaces the existing efa_cont_pages() implementation.
rdma_for_each_block() is used to iterate the umem in aligned contiguous memory blocks.


Ajay 


-----Original Message-----
From: Jason Gunthorpe <jgg@ziepe.ca> 
Sent: Wednesday, May 18, 2022 9:05 AM
To: Ajay Sharma <sharmaajay@microsoft.com>
Cc: Long Li <longli@microsoft.com>; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; Stephen Hemminger <sthemmin@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui <decui@microsoft.com>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com>; Leon Romanovsky <leon@kernel.org>; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org
Subject: Re: [EXTERNAL] Re: [PATCH 05/12] net: mana: Set the DMA device max page size

[You don't often get email from jgg@ziepe.ca. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification.]

On Wed, May 18, 2022 at 05:59:00AM +0000, Ajay Sharma wrote:
> Thanks Long.
> Hello Jason,
> I am the author of the patch.
> To your comment below :
> " As I've already said, you are supposed to set the value that limits to ib_sge and *NOT* the value that is related to ib_umem_find_best_pgsz. It is usually 2G because the ib_sge's typically work on a 32 bit length."
>
> The ib_sge is limited by the __sg_alloc_table_from_pages() which uses 
> ib_dma_max_seg_size() which is what is set by the eth driver using 
> dma_set_max_seg_size() . Currently our hw does not support PTEs larger 
> than 2M.

*sigh* again it has nothing to do with *PTEs* in the HW.

Jason
Jason Gunthorpe May 18, 2022, 11:11 p.m. UTC | #9
On Wed, May 18, 2022 at 09:05:22PM +0000, Ajay Sharma wrote:

> Use the ib_umem_find_best_pgsz() and rdma_for_each_block() API when
> registering an MR instead of coding it in the driver.

The dma_set_max_seg_size() has *nothing* to do with
ib_umem_find_best_pgsz() other than its value should be larger than
the largest set bit.

Again, it is supposed to be the maximum value the HW can support in a
ib_sge length field, which is usually 2G.

Jason
Long Li May 19, 2022, 12:37 a.m. UTC | #10
> Subject: Re: [EXTERNAL] Re: [PATCH 05/12] net: mana: Set the DMA device max
> page size
> 
> On Wed, May 18, 2022 at 09:05:22PM +0000, Ajay Sharma wrote:
> 
> > Use the ib_umem_find_best_pgsz() and rdma_for_each_block() API when
> > registering an MR instead of coding it in the driver.
> 
> The dma_set_max_seg_size() has *nothing* to do with
> ib_umem_find_best_pgsz() other than its value should be larger than the largest
> set bit.
> 
> Again, it is supposed to be the maximum value the HW can support in a ib_sge
> length field, which is usually 2G.

Will fix this in v2.

Long
diff mbox series

Patch

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 86ffe0e39df0..426087688480 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1385,6 +1385,13 @@  static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	if (err)
 		goto release_region;
 
+	// The max GDMA HW supported page size is 2M
+	err = dma_set_max_seg_size(&pdev->dev, SZ_2M);
+	if (err) {
+		dev_err(&pdev->dev, "Failed to set dma device segment size\n");
+		goto release_region;
+	}
+
 	err = -ENOMEM;
 	gc = vzalloc(sizeof(*gc));
 	if (!gc)