Message ID | 20190219145745.13476-3-shiraz.saleem@intel.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Jason Gunthorpe |
Headers | show |
Series | Add APIs to get contiguous memory blocks aligned to a HW supported page size | expand |
On 19-Feb-19 16:57, Shiraz Saleem wrote: > This helper iterates through the SG list to find the best page size to use > from a bitmap of HW supported page sizes. Drivers that support multiple > page sizes, but not mixed pages in an MR can use this API. > > Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> > Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> > Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> I've tested this patch comparing our existing efa_cont_pages() implementation vs ib_umem_find_single_pg_size() running different test suites: Supported page sizes are anything between 4k to 2G. I'm using page shift to compare results of both functions (ilog2 the return value of ib_umem_find_single_pg_size). When huge pages are disabled, in many cases efa_cont_pages() returns page shift of 13 where ib_umem_find_single_pg_size() returns 12. Didn't see a test where ib_umem_find_single_pg_size() returns anything other than 12 (PAGE_SIZE). When huge pages are enabled, most of the times both functions return the same value (page shift 21/22) but in some cases efa_cont_pages() returns 21 and ib_umem_find_single_pg_size() returns 22 which causes register MR verb to fail due to invalid page address. I simply ran the tests, did not have a chance to debug the issue but it looks like there's an issue with either ib_umem_find_single_pg_size() or the way i used it for EFA.
>Subject: Re: [PATCH rdma-next v1 2/6] RDMA/umem: Add API to find best driver >supported page size in an MR > >On 19-Feb-19 16:57, Shiraz Saleem wrote: >> This helper iterates through the SG list to find the best page size to >> use from a bitmap of HW supported page sizes. Drivers that support >> multiple page sizes, but not mixed pages in an MR can use this API. >> >> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> >> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> >> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> > >I've tested this patch comparing our existing efa_cont_pages() implementation vs >ib_umem_find_single_pg_size() running different test suites: Thanks for testing! > >Supported page sizes are anything between 4k to 2G. >I'm using page shift to compare results of both functions (ilog2 the return value of >ib_umem_find_single_pg_size). > >When huge pages are disabled, in many cases efa_cont_pages() returns page shift >of 13 where ib_umem_find_single_pg_size() returns 12. Didn't see a test where >ib_umem_find_single_pg_size() returns anything other than 12 (PAGE_SIZE). I wonder if the checks in place to guarantee offset into large page align for user-space virt buf and phy buf is downgrading the page bit. And that EFA might not need it. > >When huge pages are enabled, most of the times both functions return the same >value (page shift 21/22) but in some cases efa_cont_pages() returns 21 and >ib_umem_find_single_pg_size() returns 22 which causes register MR verb to fail due >to invalid page address. I see. Seems like there's a bug here. I can provided you a debug hooked patch that should narrow it down. If you can run it for the mismatch cases, it would be great. Shiraz
On Thu, Apr 04, 2019 at 03:25:26PM +0000, Saleem, Shiraz wrote: > >Subject: Re: [PATCH rdma-next v1 2/6] RDMA/umem: Add API to find best driver > >supported page size in an MR > > > >On 19-Feb-19 16:57, Shiraz Saleem wrote: > >> This helper iterates through the SG list to find the best page size to > >> use from a bitmap of HW supported page sizes. Drivers that support > >> multiple page sizes, but not mixed pages in an MR can use this API. > >> > >> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> > >> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> > >> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> > > > >I've tested this patch comparing our existing efa_cont_pages() implementation vs > >ib_umem_find_single_pg_size() running different test suites: > > Thanks for testing! > > > > >Supported page sizes are anything between 4k to 2G. > >I'm using page shift to compare results of both functions (ilog2 the return value of > >ib_umem_find_single_pg_size). > > > >When huge pages are disabled, in many cases efa_cont_pages() returns page shift > >of 13 where ib_umem_find_single_pg_size() returns 12. Didn't see a test where > >ib_umem_find_single_pg_size() returns anything other than 12 (PAGE_SIZE). > > I wonder if the checks in place to guarantee offset into large page > align for user-space virt buf and phy buf is downgrading the page > bit. And that EFA might not need it. EFA doesn't have actual MRs with virtual addreses, so it probably doesn't care. Jason
On 04-Apr-19 18:47, Jason Gunthorpe wrote: > On Thu, Apr 04, 2019 at 03:25:26PM +0000, Saleem, Shiraz wrote: >>> Subject: Re: [PATCH rdma-next v1 2/6] RDMA/umem: Add API to find best driver >>> supported page size in an MR >>> >>> On 19-Feb-19 16:57, Shiraz Saleem wrote: >>>> This helper iterates through the SG list to find the best page size to >>>> use from a bitmap of HW supported page sizes. Drivers that support >>>> multiple page sizes, but not mixed pages in an MR can use this API. >>>> >>>> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> >>>> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> >>>> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> >>> >>> I've tested this patch comparing our existing efa_cont_pages() implementation vs >>> ib_umem_find_single_pg_size() running different test suites: >> >> Thanks for testing! >> >>> >>> Supported page sizes are anything between 4k to 2G. >>> I'm using page shift to compare results of both functions (ilog2 the return value of >>> ib_umem_find_single_pg_size). >>> >>> When huge pages are disabled, in many cases efa_cont_pages() returns page shift >>> of 13 where ib_umem_find_single_pg_size() returns 12. Didn't see a test where >>> ib_umem_find_single_pg_size() returns anything other than 12 (PAGE_SIZE). >> >> I wonder if the checks in place to guarantee offset into large page >> align for user-space virt buf and phy buf is downgrading the page >> bit. And that EFA might not need it. > > EFA doesn't have actual MRs with virtual addreses, so it probably > doesn't care. Can you please explain this statement?
On 04-Apr-19 18:25, Saleem, Shiraz wrote: >> When huge pages are enabled, most of the times both functions return the same >> value (page shift 21/22) but in some cases efa_cont_pages() returns 21 and >> ib_umem_find_single_pg_size() returns 22 which causes register MR verb to fail due >> to invalid page address. > > I see. Seems like there's a bug here. I can provided you a debug hooked patch that should > narrow it down. If you can run it for the mismatch cases, it would be great. Sure, send it and I'll test it next week.
On Thu, Apr 04, 2019 at 07:01:43PM +0300, Gal Pressman wrote: > On 04-Apr-19 18:47, Jason Gunthorpe wrote: > > On Thu, Apr 04, 2019 at 03:25:26PM +0000, Saleem, Shiraz wrote: > >>> Subject: Re: [PATCH rdma-next v1 2/6] RDMA/umem: Add API to find best driver > >>> supported page size in an MR > >>> > >>> On 19-Feb-19 16:57, Shiraz Saleem wrote: > >>>> This helper iterates through the SG list to find the best page size to > >>>> use from a bitmap of HW supported page sizes. Drivers that support > >>>> multiple page sizes, but not mixed pages in an MR can use this API. > >>>> > >>>> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> > >>>> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> > >>>> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> > >>> > >>> I've tested this patch comparing our existing efa_cont_pages() implementation vs > >>> ib_umem_find_single_pg_size() running different test suites: > >> > >> Thanks for testing! > >> > >>> > >>> Supported page sizes are anything between 4k to 2G. > >>> I'm using page shift to compare results of both functions (ilog2 the return value of > >>> ib_umem_find_single_pg_size). > >>> > >>> When huge pages are disabled, in many cases efa_cont_pages() returns page shift > >>> of 13 where ib_umem_find_single_pg_size() returns 12. Didn't see a test where > >>> ib_umem_find_single_pg_size() returns anything other than 12 (PAGE_SIZE). > >> > >> I wonder if the checks in place to guarantee offset into large page > >> align for user-space virt buf and phy buf is downgrading the page > >> bit. And that EFA might not need it. > > > > EFA doesn't have actual MRs with virtual addreses, so it probably > > doesn't care. > > Can you please explain this statement? The usual need for the virtual address alignment stems from having HW that can process RDMA READ and RDMA WRITE operations that contain a virtual address on the wire. Since EFA doesn't have rkeys, it only needs to sort things out for lkey which is perhaps simpler Otherwise the HW needs a 64 bit addr on the DMA path to fix the misalignment of VA to page size. Jason
On Thu, Apr 04, 2019 at 11:36:35AM +0300, Gal Pressman wrote: > On 19-Feb-19 16:57, Shiraz Saleem wrote: > > This helper iterates through the SG list to find the best page size to use > > from a bitmap of HW supported page sizes. Drivers that support multiple > > page sizes, but not mixed pages in an MR can use this API. > > > > Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> > > Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> > > Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> > > I've tested this patch comparing our existing efa_cont_pages() implementation vs > ib_umem_find_single_pg_size() running different test suites: This stuff has been merged, so please send a patch for EFA to use it now.. Thanks, Jason
On 08-May-19 21:50, Jason Gunthorpe wrote: > On Thu, Apr 04, 2019 at 11:36:35AM +0300, Gal Pressman wrote: >> On 19-Feb-19 16:57, Shiraz Saleem wrote: >>> This helper iterates through the SG list to find the best page size to use >>> from a bitmap of HW supported page sizes. Drivers that support multiple >>> page sizes, but not mixed pages in an MR can use this API. >>> >>> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> >>> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> >>> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> >> >> I've tested this patch comparing our existing efa_cont_pages() implementation vs >> ib_umem_find_single_pg_size() running different test suites: > > This stuff has been merged, so please send a patch for EFA to use it > now.. Sure, the patch will be a part of my next submission.
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index e8611b7..d3b572e 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -125,6 +125,96 @@ struct scatterlist *ib_umem_add_sg_table(struct scatterlist *sg, } /** + * ib_umem_find_pg_bit - Find the page bit to use for phyaddr + * + * @phyaddr: Physical address after DMA translation + * @supported_pgsz: bitmask of HW supported page sizes + */ +static unsigned int ib_umem_find_pg_bit(unsigned long phyaddr, + unsigned long supported_pgsz) +{ + unsigned long num_zeroes; + unsigned long pgsz; + + /* Trailing zero bits in the address */ + num_zeroes = __ffs(phyaddr); + + /* Find page bit such that phyaddr is aligned to the highest supported + * HW page size + */ + pgsz = supported_pgsz & (BIT_ULL(num_zeroes + 1) - 1); + if (!pgsz) + return __ffs(supported_pgsz); + + return fls64(pgsz) - 1; +} + +/** + * ib_umem_find_single_pg_size - Find best HW page size to use for this MR + * @umem: umem struct + * @supported_pgsz: bitmask of HW supported page sizes + * @uvirt_addr: user-space virtual MR base address + * + * This helper is intended for HW that support multiple page + * sizes but can do only a single page size in an MR. + * + * Returns 0 if the umem requires page sizes not supported by + * the driver to be mapped. Drivers always supporting PAGE_SIZE + * or smaller will never see a 0 result. + */ +unsigned long ib_umem_find_single_pg_size(struct ib_umem *umem, + unsigned long supported_pgsz, + unsigned long uvirt_addr) +{ + struct scatterlist *sg; + unsigned int pg_bit_sg, min_pg_bit, best_pg_bit; + int i; + + if (WARN_ON(!(supported_pgsz & GENMASK(PAGE_SHIFT, 0)))) + return 0; + + min_pg_bit = __ffs(supported_pgsz); + best_pg_bit = fls64(supported_pgsz) - 1; + + for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i) { + unsigned long dma_addr_start, dma_addr_end; + + dma_addr_start = sg_dma_address(sg); + dma_addr_end = sg_dma_address(sg) + sg_dma_len(sg); + if (!i) { + pg_bit_sg = ib_umem_find_pg_bit(dma_addr_end, supported_pgsz); + + /* The start offset of the MR into a first _large_ page + * should line up exactly for the user-space virtual buf + * and physical buffer, in order to upgrade the page bit + */ + if (pg_bit_sg > PAGE_SHIFT) { + unsigned int uvirt_pg_bit; + + uvirt_pg_bit = ib_umem_find_pg_bit(uvirt_addr + sg_dma_len(sg), + supported_pgsz); + pg_bit_sg = min_t(unsigned int, uvirt_pg_bit, pg_bit_sg); + } + } else if (i == (umem->nmap - 1)) { + /* last SGE: Does not matter if MR ends at an + * unaligned offset. + */ + pg_bit_sg = ib_umem_find_pg_bit(dma_addr_start, supported_pgsz); + } else { + pg_bit_sg = ib_umem_find_pg_bit(dma_addr_start, + supported_pgsz & (BIT_ULL(__ffs(sg_dma_len(sg)) + 1) - 1)); + } + + best_pg_bit = min_t(unsigned int, best_pg_bit, pg_bit_sg); + if (best_pg_bit == min_pg_bit) + break; + } + + return BIT_ULL(best_pg_bit); +} +EXPORT_SYMBOL(ib_umem_find_single_pg_size); + +/** * ib_umem_get - Pin and DMA map userspace memory. * * If access flags indicate ODP memory, avoid pinning. Instead, stores diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index 18407a4..4e186a3 100644 --- a/include/rdma/ib_umem.h +++ b/include/rdma/ib_umem.h @@ -88,6 +88,9 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr, int ib_umem_page_count(struct ib_umem *umem); int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset, size_t length); +unsigned long ib_umem_find_single_pg_size(struct ib_umem *umem, + unsigned long supported_pgsz, + unsigned long uvirt_addr); #else /* CONFIG_INFINIBAND_USER_MEM */ @@ -105,6 +108,12 @@ static inline int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offs size_t length) { return -EINVAL; } +static inline int ib_umem_find_single_pg_size(struct ib_umem *umem, + unsigned long supported_pgsz, + unsigned long uvirt_addr) { + return -EINVAL; +} + #endif /* CONFIG_INFINIBAND_USER_MEM */ #endif /* IB_UMEM_H */