diff mbox series

[02/14] RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz()

Message ID 2-v1-00f59ce24f1f+19f50-umem_1_jgg@nvidia.com (mailing list archive)
State Superseded
Headers show
Series RDMA: Improve use of umem in DMA drivers | expand

Commit Message

Jason Gunthorpe Sept. 2, 2020, 12:43 a.m. UTC
rdma_for_each_block() makes assumptions about how the SGL is constructed
that don't work if the block size is below the page size used to to build
the SGL.

The rules for umem SGL construction require that the SG's all be PAGE_SIZE
aligned and we don't encode the actual byte offset of the VA range inside
the SGL using offset and length. So rdma_for_each_block() has no idea
where the actual starting/ending point is to compute the first/last block
boundary if the starting address should be within a SGL.

Fixing the SGL construction turns out to be really hard, and will be the
subject of other patches. For now block smaller pages.

Fixes: 4a35339958f1 ("RDMA/umem: Add API to find best driver supported page size in an MR")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/core/umem.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Leon Romanovsky Sept. 2, 2020, 11:51 a.m. UTC | #1
On Tue, Sep 01, 2020 at 09:43:30PM -0300, Jason Gunthorpe wrote:
> rdma_for_each_block() makes assumptions about how the SGL is constructed
> that don't work if the block size is below the page size used to to build
> the SGL.
>
> The rules for umem SGL construction require that the SG's all be PAGE_SIZE
> aligned and we don't encode the actual byte offset of the VA range inside
> the SGL using offset and length. So rdma_for_each_block() has no idea
> where the actual starting/ending point is to compute the first/last block
> boundary if the starting address should be within a SGL.
>
> Fixing the SGL construction turns out to be really hard, and will be the
> subject of other patches. For now block smaller pages.
>
> Fixes: 4a35339958f1 ("RDMA/umem: Add API to find best driver supported page size in an MR")
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/infiniband/core/umem.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 120e98403c345d..7b5bc969e55630 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -151,6 +151,12 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
>  	dma_addr_t mask;
>  	int i;
>
> +	/* rdma_for_each_block() has a bug if the page size is smaller than the
> +	 * page size used to build the umem. For now prevent smaller page sizes
> +	 * from being returned.
> +	 */
> +	pgsz_bitmap &= GENMASK(BITS_PER_LONG - 1, PAGE_SHIFT);
> +

Why do we care about such case? Why can't we leave this check forever?

Thanks,
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Jason Gunthorpe Sept. 2, 2020, 11:59 a.m. UTC | #2
On Wed, Sep 02, 2020 at 02:51:19PM +0300, Leon Romanovsky wrote:
> On Tue, Sep 01, 2020 at 09:43:30PM -0300, Jason Gunthorpe wrote:
> > rdma_for_each_block() makes assumptions about how the SGL is constructed
> > that don't work if the block size is below the page size used to to build
> > the SGL.
> >
> > The rules for umem SGL construction require that the SG's all be PAGE_SIZE
> > aligned and we don't encode the actual byte offset of the VA range inside
> > the SGL using offset and length. So rdma_for_each_block() has no idea
> > where the actual starting/ending point is to compute the first/last block
> > boundary if the starting address should be within a SGL.
> >
> > Fixing the SGL construction turns out to be really hard, and will be the
> > subject of other patches. For now block smaller pages.
> >
> > Fixes: 4a35339958f1 ("RDMA/umem: Add API to find best driver supported page size in an MR")
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> >  drivers/infiniband/core/umem.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> > index 120e98403c345d..7b5bc969e55630 100644
> > +++ b/drivers/infiniband/core/umem.c
> > @@ -151,6 +151,12 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
> >  	dma_addr_t mask;
> >  	int i;
> >
> > +	/* rdma_for_each_block() has a bug if the page size is smaller than the
> > +	 * page size used to build the umem. For now prevent smaller page sizes
> > +	 * from being returned.
> > +	 */
> > +	pgsz_bitmap &= GENMASK(BITS_PER_LONG - 1, PAGE_SHIFT);
> > +
> 
> Why do we care about such case? Why can't we leave this check forever?

If HW supports only, say 4k page size, and runs on a 64k page size
architecture it should be able to fragment into the native HW page
size.

The whole point of these APIs is to decouple the system and HW page
sizes.

Jason
Leon Romanovsky Sept. 2, 2020, 12:05 p.m. UTC | #3
On Wed, Sep 02, 2020 at 08:59:12AM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 02, 2020 at 02:51:19PM +0300, Leon Romanovsky wrote:
> > On Tue, Sep 01, 2020 at 09:43:30PM -0300, Jason Gunthorpe wrote:
> > > rdma_for_each_block() makes assumptions about how the SGL is constructed
> > > that don't work if the block size is below the page size used to to build
> > > the SGL.
> > >
> > > The rules for umem SGL construction require that the SG's all be PAGE_SIZE
> > > aligned and we don't encode the actual byte offset of the VA range inside
> > > the SGL using offset and length. So rdma_for_each_block() has no idea
> > > where the actual starting/ending point is to compute the first/last block
> > > boundary if the starting address should be within a SGL.
> > >
> > > Fixing the SGL construction turns out to be really hard, and will be the
> > > subject of other patches. For now block smaller pages.
> > >
> > > Fixes: 4a35339958f1 ("RDMA/umem: Add API to find best driver supported page size in an MR")
> > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > >  drivers/infiniband/core/umem.c | 6 ++++++
> > >  1 file changed, 6 insertions(+)
> > >
> > > diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> > > index 120e98403c345d..7b5bc969e55630 100644
> > > +++ b/drivers/infiniband/core/umem.c
> > > @@ -151,6 +151,12 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
> > >  	dma_addr_t mask;
> > >  	int i;
> > >
> > > +	/* rdma_for_each_block() has a bug if the page size is smaller than the
> > > +	 * page size used to build the umem. For now prevent smaller page sizes
> > > +	 * from being returned.
> > > +	 */
> > > +	pgsz_bitmap &= GENMASK(BITS_PER_LONG - 1, PAGE_SHIFT);
> > > +
> >
> > Why do we care about such case? Why can't we leave this check forever?
>
> If HW supports only, say 4k page size, and runs on a 64k page size
> architecture it should be able to fragment into the native HW page
> size.
>
> The whole point of these APIs is to decouple the system and HW page
> sizes.

Right now you are preventing such combinations, but is this real concern
for existing drivers?

Thanks

>
> Jason
Jason Gunthorpe Sept. 2, 2020, 4:34 p.m. UTC | #4
On Wed, Sep 02, 2020 at 03:05:40PM +0300, Leon Romanovsky wrote:
> On Wed, Sep 02, 2020 at 08:59:12AM -0300, Jason Gunthorpe wrote:
> > On Wed, Sep 02, 2020 at 02:51:19PM +0300, Leon Romanovsky wrote:
> > > On Tue, Sep 01, 2020 at 09:43:30PM -0300, Jason Gunthorpe wrote:
> > > > rdma_for_each_block() makes assumptions about how the SGL is constructed
> > > > that don't work if the block size is below the page size used to to build
> > > > the SGL.
> > > >
> > > > The rules for umem SGL construction require that the SG's all be PAGE_SIZE
> > > > aligned and we don't encode the actual byte offset of the VA range inside
> > > > the SGL using offset and length. So rdma_for_each_block() has no idea
> > > > where the actual starting/ending point is to compute the first/last block
> > > > boundary if the starting address should be within a SGL.
> > > >
> > > > Fixing the SGL construction turns out to be really hard, and will be the
> > > > subject of other patches. For now block smaller pages.
> > > >
> > > > Fixes: 4a35339958f1 ("RDMA/umem: Add API to find best driver supported page size in an MR")
> > > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > > >  drivers/infiniband/core/umem.c | 6 ++++++
> > > >  1 file changed, 6 insertions(+)
> > > >
> > > > diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> > > > index 120e98403c345d..7b5bc969e55630 100644
> > > > +++ b/drivers/infiniband/core/umem.c
> > > > @@ -151,6 +151,12 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
> > > >  	dma_addr_t mask;
> > > >  	int i;
> > > >
> > > > +	/* rdma_for_each_block() has a bug if the page size is smaller than the
> > > > +	 * page size used to build the umem. For now prevent smaller page sizes
> > > > +	 * from being returned.
> > > > +	 */
> > > > +	pgsz_bitmap &= GENMASK(BITS_PER_LONG - 1, PAGE_SHIFT);
> > > > +
> > >
> > > Why do we care about such case? Why can't we leave this check forever?
> >
> > If HW supports only, say 4k page size, and runs on a 64k page size
> > architecture it should be able to fragment into the native HW page
> > size.
> >
> > The whole point of these APIs is to decouple the system and HW page
> > sizes.
> 
> Right now you are preventing such combinations, but is this real concern
> for existing drivers?

No, I didn't prevent anything, I've left those drivers just hardwired
to use PAGE_SHIFT/PAGE_SIZE.

Maybe they are broken and malfunction on 64k page size systems, maybe
the HW supports other pages sizes and they should call
ib_umem_find_best_pgsz(), I don't really know.

The fix is fairly trivial, but it can't be done until the drivers stop
touching umem->sgl - as it requires changing how the sgl is
constructed to match standard kernel expectations, which also breaks
all the drivers.

Jason
Shiraz Saleem Sept. 3, 2020, 2:11 p.m. UTC | #5
> Subject: [PATCH 02/14] RDMA/umem: Prevent small pages from being returned by
> ib_umem_find_best_pgsz()
> 
> rdma_for_each_block() makes assumptions about how the SGL is constructed that
> don't work if the block size is below the page size used to to build the SGL.
> 
> The rules for umem SGL construction require that the SG's all be PAGE_SIZE
> aligned and we don't encode the actual byte offset of the VA range inside the SGL
> using offset and length. So rdma_for_each_block() has no idea where the actual
> starting/ending point is to compute the first/last block boundary if the starting
> address should be within a SGL.

Not sure if we were exposed today. i.e. rdma drivers working with block sizes smaller than system page size?

Nevertheless it's a good find and looks right thing to do for now.

> 
> Fixing the SGL construction turns out to be really hard, and will be the subject of
> other patches. For now block smaller pages.
> 
> Fixes: 4a35339958f1 ("RDMA/umem: Add API to find best driver supported page
> size in an MR")
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---

Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Jason Gunthorpe Sept. 3, 2020, 2:17 p.m. UTC | #6
On Thu, Sep 03, 2020 at 02:11:37PM +0000, Saleem, Shiraz wrote:
> > Subject: [PATCH 02/14] RDMA/umem: Prevent small pages from being returned by
> > ib_umem_find_best_pgsz()
> > 
> > rdma_for_each_block() makes assumptions about how the SGL is constructed that
> > don't work if the block size is below the page size used to to build the SGL.
> > 
> > The rules for umem SGL construction require that the SG's all be PAGE_SIZE
> > aligned and we don't encode the actual byte offset of the VA range inside the SGL
> > using offset and length. So rdma_for_each_block() has no idea where the actual
> > starting/ending point is to compute the first/last block boundary if the starting
> > address should be within a SGL.
> 
> Not sure if we were exposed today. i.e. rdma drivers working with
> block sizes smaller than system page size?

Yes, it could happen, here are some examples:

drivers/infiniband/hw/i40iw/i40iw_verbs.c:
              iwmr->page_size = ib_umem_find_best_pgsz(region, SZ_4K | SZ_2M,

drivers/infiniband/hw/bnxt_re/ib_verbs.c:
        page_shift = __ffs(ib_umem_find_best_pgsz(umem,
                                BNXT_RE_PAGE_SIZE_4K | BNXT_RE_PAGE_SIZE_2M,
                                virt_addr));

Eg that breaks on a ARM with 16k or 64k page sizes.

Jason
Shiraz Saleem Sept. 3, 2020, 2:18 p.m. UTC | #7
> Subject: Re: [PATCH 02/14] RDMA/umem: Prevent small pages from being
> returned by ib_umem_find_best_pgsz()
> 
> On Thu, Sep 03, 2020 at 02:11:37PM +0000, Saleem, Shiraz wrote:
> > > Subject: [PATCH 02/14] RDMA/umem: Prevent small pages from being
> > > returned by
> > > ib_umem_find_best_pgsz()
> > >
> > > rdma_for_each_block() makes assumptions about how the SGL is
> > > constructed that don't work if the block size is below the page size used to to
> build the SGL.
> > >
> > > The rules for umem SGL construction require that the SG's all be
> > > PAGE_SIZE aligned and we don't encode the actual byte offset of the
> > > VA range inside the SGL using offset and length. So
> > > rdma_for_each_block() has no idea where the actual starting/ending
> > > point is to compute the first/last block boundary if the starting address should
> be within a SGL.
> >
> > Not sure if we were exposed today. i.e. rdma drivers working with
> > block sizes smaller than system page size?
> 
> Yes, it could happen, here are some examples:
> 
> drivers/infiniband/hw/i40iw/i40iw_verbs.c:
>               iwmr->page_size = ib_umem_find_best_pgsz(region, SZ_4K | SZ_2M,
> 
> drivers/infiniband/hw/bnxt_re/ib_verbs.c:
>         page_shift = __ffs(ib_umem_find_best_pgsz(umem,
>                                 BNXT_RE_PAGE_SIZE_4K | BNXT_RE_PAGE_SIZE_2M,
>                                 virt_addr));
> 
> Eg that breaks on a ARM with 16k or 64k page sizes.
> 

Yes. Make sense. Thanks for the patch!
diff mbox series

Patch

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 120e98403c345d..7b5bc969e55630 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -151,6 +151,12 @@  unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
 	dma_addr_t mask;
 	int i;
 
+	/* rdma_for_each_block() has a bug if the page size is smaller than the
+	 * page size used to build the umem. For now prevent smaller page sizes
+	 * from being returned.
+	 */
+	pgsz_bitmap &= GENMASK(BITS_PER_LONG - 1, PAGE_SHIFT);
+
 	/* At minimum, drivers must support PAGE_SIZE or smaller */
 	if (WARN_ON(!(pgsz_bitmap & GENMASK(PAGE_SHIFT, 0))))
 		return 0;