diff mbox series

[rdma-next,v1,2/6] RDMA/umem: Add API to find best driver supported page size in an MR

Message ID 20190219145745.13476-3-shiraz.saleem@intel.com (mailing list archive)
State Changes Requested
Delegated to: Jason Gunthorpe
Headers show
Series Add APIs to get contiguous memory blocks aligned to a HW supported page size | expand

Commit Message

Shiraz Saleem Feb. 19, 2019, 2:57 p.m. UTC
This helper iterates through the SG list to find the best page size to use
from a bitmap of HW supported page sizes. Drivers that support multiple
page sizes, but not mixed pages in an MR can use this API.

Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
---
 drivers/infiniband/core/umem.c | 90 ++++++++++++++++++++++++++++++++++++++++++
 include/rdma/ib_umem.h         |  9 +++++
 2 files changed, 99 insertions(+)

Comments

Gal Pressman April 4, 2019, 8:36 a.m. UTC | #1
On 19-Feb-19 16:57, Shiraz Saleem wrote:
> This helper iterates through the SG list to find the best page size to use
> from a bitmap of HW supported page sizes. Drivers that support multiple
> page sizes, but not mixed pages in an MR can use this API.
> 
> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>

I've tested this patch comparing our existing efa_cont_pages() implementation vs
ib_umem_find_single_pg_size() running different test suites:

Supported page sizes are anything between 4k to 2G.
I'm using page shift to compare results of both functions (ilog2 the return
value of ib_umem_find_single_pg_size).

When huge pages are disabled, in many cases efa_cont_pages() returns page shift
of 13 where ib_umem_find_single_pg_size() returns 12. Didn't see a test where
ib_umem_find_single_pg_size() returns anything other than 12 (PAGE_SIZE).

When huge pages are enabled, most of the times both functions return the same
value (page shift 21/22) but in some cases efa_cont_pages() returns 21 and
ib_umem_find_single_pg_size() returns 22 which causes register MR verb to fail
due to invalid page address.

I simply ran the tests, did not have a chance to debug the issue but it looks
like there's an issue with either ib_umem_find_single_pg_size() or the way i
used it for EFA.
Shiraz Saleem April 4, 2019, 3:25 p.m. UTC | #2
>Subject: Re: [PATCH rdma-next v1 2/6] RDMA/umem: Add API to find best driver
>supported page size in an MR
>
>On 19-Feb-19 16:57, Shiraz Saleem wrote:
>> This helper iterates through the SG list to find the best page size to
>> use from a bitmap of HW supported page sizes. Drivers that support
>> multiple page sizes, but not mixed pages in an MR can use this API.
>>
>> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
>> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
>> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
>
>I've tested this patch comparing our existing efa_cont_pages() implementation vs
>ib_umem_find_single_pg_size() running different test suites:

Thanks for testing!

>
>Supported page sizes are anything between 4k to 2G.
>I'm using page shift to compare results of both functions (ilog2 the return value of
>ib_umem_find_single_pg_size).
>
>When huge pages are disabled, in many cases efa_cont_pages() returns page shift
>of 13 where ib_umem_find_single_pg_size() returns 12. Didn't see a test where
>ib_umem_find_single_pg_size() returns anything other than 12 (PAGE_SIZE).

I wonder if the checks in place to guarantee offset into large page align for user-space
virt buf and phy buf is downgrading the page bit. And that EFA might not need it.

>
>When huge pages are enabled, most of the times both functions return the same
>value (page shift 21/22) but in some cases efa_cont_pages() returns 21 and
>ib_umem_find_single_pg_size() returns 22 which causes register MR verb to fail due
>to invalid page address.

I see. Seems like there's a bug here. I can provided you a debug hooked patch that should
narrow it down. If you can run it for the mismatch cases, it would be great.

Shiraz
Jason Gunthorpe April 4, 2019, 3:47 p.m. UTC | #3
On Thu, Apr 04, 2019 at 03:25:26PM +0000, Saleem, Shiraz wrote:
> >Subject: Re: [PATCH rdma-next v1 2/6] RDMA/umem: Add API to find best driver
> >supported page size in an MR
> >
> >On 19-Feb-19 16:57, Shiraz Saleem wrote:
> >> This helper iterates through the SG list to find the best page size to
> >> use from a bitmap of HW supported page sizes. Drivers that support
> >> multiple page sizes, but not mixed pages in an MR can use this API.
> >>
> >> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> >> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
> >> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
> >
> >I've tested this patch comparing our existing efa_cont_pages() implementation vs
> >ib_umem_find_single_pg_size() running different test suites:
> 
> Thanks for testing!
> 
> >
> >Supported page sizes are anything between 4k to 2G.
> >I'm using page shift to compare results of both functions (ilog2 the return value of
> >ib_umem_find_single_pg_size).
> >
> >When huge pages are disabled, in many cases efa_cont_pages() returns page shift
> >of 13 where ib_umem_find_single_pg_size() returns 12. Didn't see a test where
> >ib_umem_find_single_pg_size() returns anything other than 12 (PAGE_SIZE).
> 
> I wonder if the checks in place to guarantee offset into large page
> align for user-space virt buf and phy buf is downgrading the page
> bit. And that EFA might not need it.

EFA doesn't have actual MRs with virtual addreses, so it probably
doesn't care. 

Jason
Gal Pressman April 4, 2019, 4:01 p.m. UTC | #4
On 04-Apr-19 18:47, Jason Gunthorpe wrote:
> On Thu, Apr 04, 2019 at 03:25:26PM +0000, Saleem, Shiraz wrote:
>>> Subject: Re: [PATCH rdma-next v1 2/6] RDMA/umem: Add API to find best driver
>>> supported page size in an MR
>>>
>>> On 19-Feb-19 16:57, Shiraz Saleem wrote:
>>>> This helper iterates through the SG list to find the best page size to
>>>> use from a bitmap of HW supported page sizes. Drivers that support
>>>> multiple page sizes, but not mixed pages in an MR can use this API.
>>>>
>>>> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
>>>> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
>>>> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
>>>
>>> I've tested this patch comparing our existing efa_cont_pages() implementation vs
>>> ib_umem_find_single_pg_size() running different test suites:
>>
>> Thanks for testing!
>>
>>>
>>> Supported page sizes are anything between 4k to 2G.
>>> I'm using page shift to compare results of both functions (ilog2 the return value of
>>> ib_umem_find_single_pg_size).
>>>
>>> When huge pages are disabled, in many cases efa_cont_pages() returns page shift
>>> of 13 where ib_umem_find_single_pg_size() returns 12. Didn't see a test where
>>> ib_umem_find_single_pg_size() returns anything other than 12 (PAGE_SIZE).
>>
>> I wonder if the checks in place to guarantee offset into large page
>> align for user-space virt buf and phy buf is downgrading the page
>> bit. And that EFA might not need it.
> 
> EFA doesn't have actual MRs with virtual addreses, so it probably
> doesn't care. 

Can you please explain this statement?
Gal Pressman April 4, 2019, 4:02 p.m. UTC | #5
On 04-Apr-19 18:25, Saleem, Shiraz wrote:
>> When huge pages are enabled, most of the times both functions return the same
>> value (page shift 21/22) but in some cases efa_cont_pages() returns 21 and
>> ib_umem_find_single_pg_size() returns 22 which causes register MR verb to fail due
>> to invalid page address.
> 
> I see. Seems like there's a bug here. I can provided you a debug hooked patch that should
> narrow it down. If you can run it for the mismatch cases, it would be great.

Sure, send it and I'll test it next week.
Jason Gunthorpe April 4, 2019, 4:35 p.m. UTC | #6
On Thu, Apr 04, 2019 at 07:01:43PM +0300, Gal Pressman wrote:
> On 04-Apr-19 18:47, Jason Gunthorpe wrote:
> > On Thu, Apr 04, 2019 at 03:25:26PM +0000, Saleem, Shiraz wrote:
> >>> Subject: Re: [PATCH rdma-next v1 2/6] RDMA/umem: Add API to find best driver
> >>> supported page size in an MR
> >>>
> >>> On 19-Feb-19 16:57, Shiraz Saleem wrote:
> >>>> This helper iterates through the SG list to find the best page size to
> >>>> use from a bitmap of HW supported page sizes. Drivers that support
> >>>> multiple page sizes, but not mixed pages in an MR can use this API.
> >>>>
> >>>> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> >>>> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
> >>>> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
> >>>
> >>> I've tested this patch comparing our existing efa_cont_pages() implementation vs
> >>> ib_umem_find_single_pg_size() running different test suites:
> >>
> >> Thanks for testing!
> >>
> >>>
> >>> Supported page sizes are anything between 4k to 2G.
> >>> I'm using page shift to compare results of both functions (ilog2 the return value of
> >>> ib_umem_find_single_pg_size).
> >>>
> >>> When huge pages are disabled, in many cases efa_cont_pages() returns page shift
> >>> of 13 where ib_umem_find_single_pg_size() returns 12. Didn't see a test where
> >>> ib_umem_find_single_pg_size() returns anything other than 12 (PAGE_SIZE).
> >>
> >> I wonder if the checks in place to guarantee offset into large page
> >> align for user-space virt buf and phy buf is downgrading the page
> >> bit. And that EFA might not need it.
> > 
> > EFA doesn't have actual MRs with virtual addreses, so it probably
> > doesn't care. 
> 
> Can you please explain this statement?

The usual need for the virtual address alignment stems from having HW
that can process RDMA READ and RDMA WRITE operations that contain a
virtual address on the wire.

Since EFA doesn't have rkeys, it only needs to sort things out for
lkey which is perhaps simpler

Otherwise the HW needs a 64 bit addr on the DMA path to fix the
misalignment of VA to page size.

Jason
Jason Gunthorpe May 8, 2019, 6:50 p.m. UTC | #7
On Thu, Apr 04, 2019 at 11:36:35AM +0300, Gal Pressman wrote:
> On 19-Feb-19 16:57, Shiraz Saleem wrote:
> > This helper iterates through the SG list to find the best page size to use
> > from a bitmap of HW supported page sizes. Drivers that support multiple
> > page sizes, but not mixed pages in an MR can use this API.
> > 
> > Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> > Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
> > Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
> 
> I've tested this patch comparing our existing efa_cont_pages() implementation vs
> ib_umem_find_single_pg_size() running different test suites:

This stuff has been merged, so please send a patch for EFA to use it
now..

Thanks,
Jason
Gal Pressman May 9, 2019, 8:05 a.m. UTC | #8
On 08-May-19 21:50, Jason Gunthorpe wrote:
> On Thu, Apr 04, 2019 at 11:36:35AM +0300, Gal Pressman wrote:
>> On 19-Feb-19 16:57, Shiraz Saleem wrote:
>>> This helper iterates through the SG list to find the best page size to use
>>> from a bitmap of HW supported page sizes. Drivers that support multiple
>>> page sizes, but not mixed pages in an MR can use this API.
>>>
>>> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
>>> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
>>> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
>>
>> I've tested this patch comparing our existing efa_cont_pages() implementation vs
>> ib_umem_find_single_pg_size() running different test suites:
> 
> This stuff has been merged, so please send a patch for EFA to use it
> now..

Sure, the patch will be a part of my next submission.
diff mbox series

Patch

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index e8611b7..d3b572e 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -125,6 +125,96 @@  struct scatterlist *ib_umem_add_sg_table(struct scatterlist *sg,
 }
 
 /**
+ * ib_umem_find_pg_bit - Find the page bit to use for phyaddr
+ *
+ * @phyaddr: Physical address after DMA translation
+ * @supported_pgsz: bitmask of HW supported page sizes
+ */
+static unsigned int ib_umem_find_pg_bit(unsigned long phyaddr,
+					unsigned long supported_pgsz)
+{
+	unsigned long num_zeroes;
+	unsigned long pgsz;
+
+	/* Trailing zero bits in the address */
+	num_zeroes = __ffs(phyaddr);
+
+	/* Find page bit such that phyaddr is aligned to the highest supported
+	 * HW page size
+	 */
+	pgsz = supported_pgsz & (BIT_ULL(num_zeroes + 1) - 1);
+	if (!pgsz)
+		return __ffs(supported_pgsz);
+
+	return fls64(pgsz) - 1;
+}
+
+/**
+ * ib_umem_find_single_pg_size - Find best HW page size to use for this MR
+ * @umem: umem struct
+ * @supported_pgsz: bitmask of HW supported page sizes
+ * @uvirt_addr: user-space virtual MR base address
+ *
+ * This helper is intended for HW that support multiple page
+ * sizes but can do only a single page size in an MR.
+ *
+ * Returns 0 if the umem requires page sizes not supported by
+ * the driver to be mapped. Drivers always supporting PAGE_SIZE
+ * or smaller will never see a 0 result.
+ */
+unsigned long ib_umem_find_single_pg_size(struct ib_umem *umem,
+					  unsigned long supported_pgsz,
+					  unsigned long uvirt_addr)
+{
+	struct scatterlist *sg;
+	unsigned int pg_bit_sg, min_pg_bit, best_pg_bit;
+	int i;
+
+	if (WARN_ON(!(supported_pgsz & GENMASK(PAGE_SHIFT, 0))))
+		return 0;
+
+	min_pg_bit = __ffs(supported_pgsz);
+	best_pg_bit = fls64(supported_pgsz) - 1;
+
+	for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i) {
+		unsigned long dma_addr_start, dma_addr_end;
+
+		dma_addr_start = sg_dma_address(sg);
+		dma_addr_end = sg_dma_address(sg) + sg_dma_len(sg);
+		if (!i) {
+			pg_bit_sg = ib_umem_find_pg_bit(dma_addr_end, supported_pgsz);
+
+			/* The start offset of the MR into a first _large_ page
+			 * should line up exactly for the user-space virtual buf
+			 * and physical buffer, in order to upgrade the page bit
+			 */
+			if (pg_bit_sg > PAGE_SHIFT) {
+				unsigned int uvirt_pg_bit;
+
+				uvirt_pg_bit = ib_umem_find_pg_bit(uvirt_addr + sg_dma_len(sg),
+								   supported_pgsz);
+				pg_bit_sg = min_t(unsigned int, uvirt_pg_bit, pg_bit_sg);
+			}
+		} else if (i == (umem->nmap - 1)) {
+			/* last SGE: Does not matter if MR ends at an
+			 * unaligned offset.
+			 */
+			pg_bit_sg = ib_umem_find_pg_bit(dma_addr_start, supported_pgsz);
+		} else {
+			pg_bit_sg = ib_umem_find_pg_bit(dma_addr_start,
+				supported_pgsz & (BIT_ULL(__ffs(sg_dma_len(sg)) + 1) - 1));
+		}
+
+		best_pg_bit = min_t(unsigned int, best_pg_bit, pg_bit_sg);
+		if (best_pg_bit == min_pg_bit)
+			break;
+	}
+
+	return BIT_ULL(best_pg_bit);
+}
+EXPORT_SYMBOL(ib_umem_find_single_pg_size);
+
+/**
  * ib_umem_get - Pin and DMA map userspace memory.
  *
  * If access flags indicate ODP memory, avoid pinning. Instead, stores
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 18407a4..4e186a3 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -88,6 +88,9 @@  struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
 int ib_umem_page_count(struct ib_umem *umem);
 int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
 		      size_t length);
+unsigned long ib_umem_find_single_pg_size(struct ib_umem *umem,
+					  unsigned long supported_pgsz,
+					  unsigned long uvirt_addr);
 
 #else /* CONFIG_INFINIBAND_USER_MEM */
 
@@ -105,6 +108,12 @@  static inline int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offs
 		      		    size_t length) {
 	return -EINVAL;
 }
+static inline int ib_umem_find_single_pg_size(struct ib_umem *umem,
+					      unsigned long supported_pgsz,
+					      unsigned long uvirt_addr) {
+	return -EINVAL;
+}
+
 #endif /* CONFIG_INFINIBAND_USER_MEM */
 
 #endif /* IB_UMEM_H */