Message ID | 20231115191752.266-2-shiraz.saleem@intel.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Fixes for 64K page size support | expand |
On Wed, Nov 15, 2023 at 01:17:50PM -0600, Shiraz Saleem wrote: > diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c > index f9ab671c8eda..07c571c7b699 100644 > --- a/drivers/infiniband/core/umem.c > +++ b/drivers/infiniband/core/umem.c > @@ -96,12 +96,6 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem, > return page_size; > } > > - /* rdma_for_each_block() has a bug if the page size is smaller than the > - * page size used to build the umem. For now prevent smaller page sizes > - * from being returned. > - */ > - pgsz_bitmap &= GENMASK(BITS_PER_LONG - 1, PAGE_SHIFT); > - > /* The best result is the smallest page size that results in the minimum > * number of required pages. Compute the largest page size that could > * work based on VA address bits that don't change. > diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h > index 95896472a82b..e775d1b4910c 100644 > --- a/include/rdma/ib_umem.h > +++ b/include/rdma/ib_umem.h > @@ -77,6 +77,8 @@ static inline void __rdma_umem_block_iter_start(struct ib_block_iter *biter, > { > __rdma_block_iter_start(biter, umem->sgt_append.sgt.sgl, > umem->sgt_append.sgt.nents, pgsz); > + biter->__sg_advance = ib_umem_offset(umem) & ~(pgsz - 1); > + biter->__sg_numblocks = ib_umem_num_dma_blocks(umem, pgsz); > } > > /** > @@ -92,7 +94,7 @@ static inline void __rdma_umem_block_iter_start(struct ib_block_iter *biter, > */ > #define rdma_umem_for_each_dma_block(umem, biter, pgsz) \ > for (__rdma_umem_block_iter_start(biter, umem, pgsz); \ > - __rdma_block_iter_next(biter);) > + __rdma_block_iter_next(biter) && (biter)->__sg_numblocks--;) This sg_numblocks should be in the __rdma_block_iter_next() ? It makes sense to me Leon, we should be sure to check this on mlx5 also Thanks, Jason
在 2023/11/16 3:17, Shiraz Saleem 写道: > From: Mike Marciniszyn <mike.marciniszyn@intel.com> > > 64k pages introduce the situation in this diagram when the HCA Only ARM64 architecture supports 64K page size? Is it possible that x86_64 also supports 64K page size? Zhu Yanjun > 4k page size is being used: > > +-------------------------------------------+ <--- 64k aligned VA > | | > | HCA 4k page | > | | > +-------------------------------------------+ > | o | > | | > | o | > | | > | o | > +-------------------------------------------+ > | | > | HCA 4k page | > | | > +-------------------------------------------+ <--- Live HCA page > |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO| <--- offset > | | <--- VA > | MR data | > +-------------------------------------------+ > | | > | HCA 4k page | > | | > +-------------------------------------------+ > | o | > | | > | o | > | | > | o | > +-------------------------------------------+ > | | > | HCA 4k page | > | | > +-------------------------------------------+ > > The VA addresses are coming from rdma-core in this diagram can > be arbitrary, but for 64k pages, the VA may be offset by some > number of HCA 4k pages and followed by some number of HCA 4k > pages. > > The current iterator doesn't account for either the preceding > 4k pages or the following 4k pages. > > Fix the issue by extending the ib_block_iter to contain > the number of DMA pages like comment [1] says and > by augmenting the macro limit test to downcount that value. > > This prevents the extra pages following the user MR data. > > Fix the preceding pages by using the __sq_advance field to start > at the first 4k page containing MR data. > > This fix allows for the elimination of the small page crutch noted > in the Fixes. > > Fixes: 10c75ccb54e4 ("RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz()") > Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/rdma/ib_umem.h#n91 [1] > Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> > Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> > --- > drivers/infiniband/core/umem.c | 6 ------ > include/rdma/ib_umem.h | 4 +++- > include/rdma/ib_verbs.h | 1 + > 3 files changed, 4 insertions(+), 7 deletions(-) > > diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c > index f9ab671c8eda..07c571c7b699 100644 > --- a/drivers/infiniband/core/umem.c > +++ b/drivers/infiniband/core/umem.c > @@ -96,12 +96,6 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem, > return page_size; > } > > - /* rdma_for_each_block() has a bug if the page size is smaller than the > - * page size used to build the umem. For now prevent smaller page sizes > - * from being returned. > - */ > - pgsz_bitmap &= GENMASK(BITS_PER_LONG - 1, PAGE_SHIFT); > - > /* The best result is the smallest page size that results in the minimum > * number of required pages. Compute the largest page size that could > * work based on VA address bits that don't change. > diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h > index 95896472a82b..e775d1b4910c 100644 > --- a/include/rdma/ib_umem.h > +++ b/include/rdma/ib_umem.h > @@ -77,6 +77,8 @@ static inline void __rdma_umem_block_iter_start(struct ib_block_iter *biter, > { > __rdma_block_iter_start(biter, umem->sgt_append.sgt.sgl, > umem->sgt_append.sgt.nents, pgsz); > + biter->__sg_advance = ib_umem_offset(umem) & ~(pgsz - 1); > + biter->__sg_numblocks = ib_umem_num_dma_blocks(umem, pgsz); > } > > /** > @@ -92,7 +94,7 @@ static inline void __rdma_umem_block_iter_start(struct ib_block_iter *biter, > */ > #define rdma_umem_for_each_dma_block(umem, biter, pgsz) \ > for (__rdma_umem_block_iter_start(biter, umem, pgsz); \ > - __rdma_block_iter_next(biter);) > + __rdma_block_iter_next(biter) && (biter)->__sg_numblocks--;) > > #ifdef CONFIG_INFINIBAND_USER_MEM > > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > index fb1a2d6b1969..b7b6b58dd348 100644 > --- a/include/rdma/ib_verbs.h > +++ b/include/rdma/ib_verbs.h > @@ -2850,6 +2850,7 @@ struct ib_block_iter { > /* internal states */ > struct scatterlist *__sg; /* sg holding the current aligned block */ > dma_addr_t __dma_addr; /* unaligned DMA address of this block */ > + size_t __sg_numblocks; /* ib_umem_num_dma_blocks() */ > unsigned int __sg_nents; /* number of SG entries */ > unsigned int __sg_advance; /* number of bytes to advance in sg in next step */ > unsigned int __pg_bit; /* alignment of current block */
> > From: Mike Marciniszyn <mike.marciniszyn@intel.com> > > > > 64k pages introduce the situation in this diagram when the HCA > > Only ARM64 architecture supports 64K page size? Arm supports multiple page_sizes. The problematic combination is when the HCA needs a SMALLER page size than the PAGE_SIZE. The kernel configuration can select from > Is it possible that x86_64 also supports 64K page size? > x86_64 supports larger page_sizes for TLB optimization, but the default minimum is always 4K. Mike
> > The kernel configuration can select from > ... multiple page sizes. Mike
在 2023/11/18 22:54, Marciniszyn, Mike 写道: >>> From: Mike Marciniszyn <mike.marciniszyn@intel.com> >>> >>> 64k pages introduce the situation in this diagram when the HCA >> Only ARM64 architecture supports 64K page size? > Arm supports multiple page_sizes. The problematic combination is when > the HCA needs a SMALLER page size than the PAGE_SIZE. > > The kernel configuration can select from Got it. Thanks a lot. On ARM architecture, some kernel configurations can be selected to enable multiple page sizes. > >> Is it possible that x86_64 also supports 64K page size? >> > x86_64 supports larger page_sizes for TLB optimization, but the default minimum is always 4K. On x86_64 architecture, how to enable a not-4k page size? For example, 16K, 64K and so on. Thanks a lot. Zhu Yanjun > > Mike
在 2023/11/18 22:54, Marciniszyn, Mike 写道: >>> From: Mike Marciniszyn <mike.marciniszyn@intel.com> >>> >>> 64k pages introduce the situation in this diagram when the HCA >> >> Only ARM64 architecture supports 64K page size? > > Arm supports multiple page_sizes. The problematic combination is when > the HCA needs a SMALLER page size than the PAGE_SIZE. Thanks a lot. Perhaps RXE also needs this feature "the HCA needs a SMALLER page size than the PAGE_SIZE". But I do not have such test environment at hand. If we can setup a test environment on x86_64 architecture, it is very convenient for me to make tests and development. Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Zhu Yanjun > > The kernel configuration can select from > >> Is it possible that x86_64 also supports 64K page size? >> > > x86_64 supports larger page_sizes for TLB optimization, but the default minimum is always 4K. > > Mike
> Subject: Re: [PATCH for-rc 1/3] RDMA/core: Fix umem iterator when PAGE_SIZE > is greater then HCA pgsz > > On Wed, Nov 15, 2023 at 01:17:50PM -0600, Shiraz Saleem wrote: > > diff --git a/drivers/infiniband/core/umem.c > > b/drivers/infiniband/core/umem.c index f9ab671c8eda..07c571c7b699 > > 100644 > > --- a/drivers/infiniband/core/umem.c > > +++ b/drivers/infiniband/core/umem.c > > @@ -96,12 +96,6 @@ unsigned long ib_umem_find_best_pgsz(struct > ib_umem *umem, > > return page_size; > > } > > > > - /* rdma_for_each_block() has a bug if the page size is smaller than the > > - * page size used to build the umem. For now prevent smaller page sizes > > - * from being returned. > > - */ > > - pgsz_bitmap &= GENMASK(BITS_PER_LONG - 1, PAGE_SHIFT); > > - > > /* The best result is the smallest page size that results in the minimum > > * number of required pages. Compute the largest page size that could > > * work based on VA address bits that don't change. > > diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index > > 95896472a82b..e775d1b4910c 100644 > > --- a/include/rdma/ib_umem.h > > +++ b/include/rdma/ib_umem.h > > @@ -77,6 +77,8 @@ static inline void > > __rdma_umem_block_iter_start(struct ib_block_iter *biter, { > > __rdma_block_iter_start(biter, umem->sgt_append.sgt.sgl, > > umem->sgt_append.sgt.nents, pgsz); > > + biter->__sg_advance = ib_umem_offset(umem) & ~(pgsz - 1); > > + biter->__sg_numblocks = ib_umem_num_dma_blocks(umem, pgsz); > > } > > > > /** > > @@ -92,7 +94,7 @@ static inline void __rdma_umem_block_iter_start(struct > ib_block_iter *biter, > > */ > > #define rdma_umem_for_each_dma_block(umem, biter, pgsz) \ > > for (__rdma_umem_block_iter_start(biter, umem, pgsz); \ > > - __rdma_block_iter_next(biter);) > > + __rdma_block_iter_next(biter) && (biter)->__sg_numblocks--;) > > This sg_numblocks should be in the __rdma_block_iter_next() ? > > It makes sense to me > The __rdma_block_iter_next() is common to two iterators: rdma_umem_for_each_dma_block() and rdma_for_each_block. The patch makes adjustments to protect users of rdma_for_each_block(). We are working on a v2 to add a umem specific next function that will implement the downcount.
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index f9ab671c8eda..07c571c7b699 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -96,12 +96,6 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem, return page_size; } - /* rdma_for_each_block() has a bug if the page size is smaller than the - * page size used to build the umem. For now prevent smaller page sizes - * from being returned. - */ - pgsz_bitmap &= GENMASK(BITS_PER_LONG - 1, PAGE_SHIFT); - /* The best result is the smallest page size that results in the minimum * number of required pages. Compute the largest page size that could * work based on VA address bits that don't change. diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index 95896472a82b..e775d1b4910c 100644 --- a/include/rdma/ib_umem.h +++ b/include/rdma/ib_umem.h @@ -77,6 +77,8 @@ static inline void __rdma_umem_block_iter_start(struct ib_block_iter *biter, { __rdma_block_iter_start(biter, umem->sgt_append.sgt.sgl, umem->sgt_append.sgt.nents, pgsz); + biter->__sg_advance = ib_umem_offset(umem) & ~(pgsz - 1); + biter->__sg_numblocks = ib_umem_num_dma_blocks(umem, pgsz); } /** @@ -92,7 +94,7 @@ static inline void __rdma_umem_block_iter_start(struct ib_block_iter *biter, */ #define rdma_umem_for_each_dma_block(umem, biter, pgsz) \ for (__rdma_umem_block_iter_start(biter, umem, pgsz); \ - __rdma_block_iter_next(biter);) + __rdma_block_iter_next(biter) && (biter)->__sg_numblocks--;) #ifdef CONFIG_INFINIBAND_USER_MEM diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index fb1a2d6b1969..b7b6b58dd348 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -2850,6 +2850,7 @@ struct ib_block_iter { /* internal states */ struct scatterlist *__sg; /* sg holding the current aligned block */ dma_addr_t __dma_addr; /* unaligned DMA address of this block */ + size_t __sg_numblocks; /* ib_umem_num_dma_blocks() */ unsigned int __sg_nents; /* number of SG entries */ unsigned int __sg_advance; /* number of bytes to advance in sg in next step */ unsigned int __pg_bit; /* alignment of current block */