diff mbox series

lib/scatterlist: Provide a DMA page iterator

Message ID 20190104223531.GA1705@ziepe.ca (mailing list archive)
State New, archived
Headers show
Series lib/scatterlist: Provide a DMA page iterator | expand

Commit Message

Jason Gunthorpe Jan. 4, 2019, 10:35 p.m. UTC
Commit 2db76d7c3c6d ("lib/scatterlist: sg_page_iter: support sg lists w/o
backing pages") introduced the sg_page_iter_dma_address() function without
providing a way to use it in the general case. If the sg_dma_len is not
equal to the dma_length callers cannot safely use the
for_each_sg_page/sg_page_iter_dma_address combination.

Resolve this API mistake by providing a DMA specific iterator,
for_each_sg_dma_page(), that uses the right length so
sg_page_iter_dma_address() works as expected with all sglists. A new
iterator type is introduced to provide compile-time safety against wrongly
mixing accessors and iterators.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.h        | 26 ++++++++++--
 drivers/gpu/drm/vmwgfx/vmwgfx_mob.c        | 26 +++++++-----
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c | 42 +++++++++++++------
 drivers/media/pci/intel/ipu3/ipu3-cio2.c   |  4 +-
 include/linux/scatterlist.h                | 49 ++++++++++++++++++----
 lib/scatterlist.c                          | 26 ++++++++++++
 6 files changed, 134 insertions(+), 39 deletions(-)

I'd like to run this patch through the RDMA tree as we have another
series in the works that wants to use the for_each_sg_dma_page() API.

The changes to vmwgfx make me nervous, it would be great if someone
could test and ack them?

Changes since the RFC:
- Rework vmwgfx too [CH]
- Use a distinct type for the DMA page iterator [CH]
- Do not have a #ifdef [CH]

Thanks,
Jason

Comments

Jason Gunthorpe Jan. 5, 2019, 3:21 a.m. UTC | #1
On Sat, Jan 05, 2019 at 10:37:17AM +0800, kbuild test robot wrote:
> Hi Jason,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on linus/master]
> [also build test ERROR on v4.20 next-20190103]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Jason-Gunthorpe/lib-scatterlist-Provide-a-DMA-page-iterator/20190105-081739
> config: x86_64-randconfig-x017-201900 (attached as .config)
> compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=x86_64 
> 
> All error/warnings (new ones prefixed by >>):
> 
>    In file included from lib/scatterlist.c:9:0:
> >> include/linux/export.h:81:20: error: redefinition of '__kstrtab___sg_page_iter_next'
>      static const char __kstrtab_##sym[]    \
>                        ^
>    include/linux/export.h:120:25: note: in expansion of macro '___EXPORT_SYMBOL'
>     #define __EXPORT_SYMBOL ___EXPORT_SYMBOL
>                             ^~~~~~~~~~~~~~~~
>    include/linux/export.h:124:2: note: in expansion of macro '__EXPORT_SYMBOL'
>      __EXPORT_SYMBOL(sym, "")
>      ^~~~~~~~~~~~~~~
> >> lib/scatterlist.c:652:1: note: in expansion of macro 'EXPORT_SYMBOL'
>     EXPORT_SYMBOL(__sg_page_iter_next);
>     ^~~~~~~~~~~~~

Woops, should be __sg_page_dma_iter_next.. Will resend after getting
some feedback.

Jason
Jason Gunthorpe Jan. 10, 2019, 11:42 p.m. UTC | #2
On Fri, Jan 04, 2019 at 03:35:31PM -0700, Jason Gunthorpe wrote:
> Commit 2db76d7c3c6d ("lib/scatterlist: sg_page_iter: support sg lists w/o
> backing pages") introduced the sg_page_iter_dma_address() function without
> providing a way to use it in the general case. If the sg_dma_len is not
> equal to the dma_length callers cannot safely use the
> for_each_sg_page/sg_page_iter_dma_address combination.
> 
> Resolve this API mistake by providing a DMA specific iterator,
> for_each_sg_dma_page(), that uses the right length so
> sg_page_iter_dma_address() works as expected with all sglists. A new
> iterator type is introduced to provide compile-time safety against wrongly
> mixing accessors and iterators.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
>  drivers/gpu/drm/vmwgfx/vmwgfx_drv.h        | 26 ++++++++++--
>  drivers/gpu/drm/vmwgfx/vmwgfx_mob.c        | 26 +++++++-----
>  drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c | 42 +++++++++++++------
>  drivers/media/pci/intel/ipu3/ipu3-cio2.c   |  4 +-
>  include/linux/scatterlist.h                | 49 ++++++++++++++++++----
>  lib/scatterlist.c                          | 26 ++++++++++++
>  6 files changed, 134 insertions(+), 39 deletions(-)
> 
> I'd like to run this patch through the RDMA tree as we have another
> series in the works that wants to use the for_each_sg_dma_page() API.
> 
> The changes to vmwgfx make me nervous, it would be great if someone
> could test and ack them?
> 
> Changes since the RFC:
> - Rework vmwgfx too [CH]
> - Use a distinct type for the DMA page iterator [CH]
> - Do not have a #ifdef [CH]

ChristophH: Will you ack?

Are you still OK with the vmwgfx reworking, or should we go back to
the original version that didn't have the type safety so this driver
can be left broken?

Thanks,
Jason
Shiraz Saleem Jan. 12, 2019, 6:27 p.m. UTC | #3
On Fri, Jan 04, 2019 at 10:35:43PM +0000, Jason Gunthorpe wrote:
> Commit 2db76d7c3c6d ("lib/scatterlist: sg_page_iter: support sg lists w/o
> backing pages") introduced the sg_page_iter_dma_address() function without
> providing a way to use it in the general case. If the sg_dma_len is not
> equal to the dma_length callers cannot safely use the
> for_each_sg_page/sg_page_iter_dma_address combination.
> 
> Resolve this API mistake by providing a DMA specific iterator,
> for_each_sg_dma_page(), that uses the right length so
> sg_page_iter_dma_address() works as expected with all sglists. A new
> iterator type is introduced to provide compile-time safety against wrongly
> mixing accessors and iterators.
[..]

>  
> +/*
> + * sg page iterator for DMA addresses
> + *
> + * This is the same as sg_page_iter however you can call
> + * sg_page_iter_dma_address(@dma_iter) to get the page's DMA
> + * address. sg_page_iter_page() cannot be called on this iterator.
> + */
Does it make sense to have a variant of sg_page_iter_page() to get the
page descriptor with this dma_iter? This can be used when walking DMA-mapped
SG lists with for_each_sg_dma_page.

> +struct sg_dma_page_iter {
> +	struct sg_page_iter base;
> +};
> +
>
Jason Gunthorpe Jan. 12, 2019, 6:37 p.m. UTC | #4
On Sat, Jan 12, 2019 at 12:27:05PM -0600, Shiraz Saleem wrote:
> On Fri, Jan 04, 2019 at 10:35:43PM +0000, Jason Gunthorpe wrote:
> > Commit 2db76d7c3c6d ("lib/scatterlist: sg_page_iter: support sg lists w/o
> > backing pages") introduced the sg_page_iter_dma_address() function without
> > providing a way to use it in the general case. If the sg_dma_len is not
> > equal to the dma_length callers cannot safely use the
> > for_each_sg_page/sg_page_iter_dma_address combination.
> > 
> > Resolve this API mistake by providing a DMA specific iterator,
> > for_each_sg_dma_page(), that uses the right length so
> > sg_page_iter_dma_address() works as expected with all sglists. A new
> > iterator type is introduced to provide compile-time safety against wrongly
> > mixing accessors and iterators.
> [..]
> 
> >  
> > +/*
> > + * sg page iterator for DMA addresses
> > + *
> > + * This is the same as sg_page_iter however you can call
> > + * sg_page_iter_dma_address(@dma_iter) to get the page's DMA
> > + * address. sg_page_iter_page() cannot be called on this iterator.
> > + */
> Does it make sense to have a variant of sg_page_iter_page() to get the
> page descriptor with this dma_iter? This can be used when walking DMA-mapped
> SG lists with for_each_sg_dma_page.

I think that would be a complicated cacluation to find the right
offset into the page sg list to get the page pointer back. We can't
just naively use the existing iterator location.

Probably places that need this are better to run with two iterators,
less computationally expensive.

Did you find a need for this? 

Jason
Shiraz Saleem Jan. 12, 2019, 7:03 p.m. UTC | #5
On Sat, Jan 12, 2019 at 06:37:58PM +0000, Jason Gunthorpe wrote:
> On Sat, Jan 12, 2019 at 12:27:05PM -0600, Shiraz Saleem wrote:
> > On Fri, Jan 04, 2019 at 10:35:43PM +0000, Jason Gunthorpe wrote:
> > > Commit 2db76d7c3c6d ("lib/scatterlist: sg_page_iter: support sg lists w/o
> > > backing pages") introduced the sg_page_iter_dma_address() function without
> > > providing a way to use it in the general case. If the sg_dma_len is not
> > > equal to the dma_length callers cannot safely use the
> > > for_each_sg_page/sg_page_iter_dma_address combination.
> > > 
> > > Resolve this API mistake by providing a DMA specific iterator,
> > > for_each_sg_dma_page(), that uses the right length so
> > > sg_page_iter_dma_address() works as expected with all sglists. A new
> > > iterator type is introduced to provide compile-time safety against wrongly
> > > mixing accessors and iterators.
> > [..]
> > 
> > >  
> > > +/*
> > > + * sg page iterator for DMA addresses
> > > + *
> > > + * This is the same as sg_page_iter however you can call
> > > + * sg_page_iter_dma_address(@dma_iter) to get the page's DMA
> > > + * address. sg_page_iter_page() cannot be called on this iterator.
> > > + */
> > Does it make sense to have a variant of sg_page_iter_page() to get the
> > page descriptor with this dma_iter? This can be used when walking DMA-mapped
> > SG lists with for_each_sg_dma_page.
> 
> I think that would be a complicated cacluation to find the right
> offset into the page sg list to get the page pointer back. We can't
> just naively use the existing iterator location.
> 
> Probably places that need this are better to run with two iterators,
> less computationally expensive.
> 
> Did you find a need for this? 
> 

Well I was trying convert the RDMA drivers to use your new iterator variant
and saw the need for it in locations where we need virtual address of the pages
contained in the SGEs.

diff --git a/drivers/infiniband/hw/bnxt_re/qplib_res.c b/drivers/infiniband/hw/bnxt_re/qplib_res.c
index 59eeac5..7d26903 100644
--- a/drivers/infiniband/hw/bnxt_re/qplib_res.c
+++ b/drivers/infiniband/hw/bnxt_re/qplib_res.c
@@ -85,7 +85,7 @@ static void __free_pbl(struct pci_dev *pdev, struct bnxt_qplib_pbl *pbl,
 static int __alloc_pbl(struct pci_dev *pdev, struct bnxt_qplib_pbl *pbl,
                       struct scatterlist *sghead, u32 pages, u32 pg_size)
 {
-       struct scatterlist *sg;
+       struct sg_dma_page_iter sg_iter;
        bool is_umem = false;
        int i;

@@ -116,12 +116,13 @@ static int __alloc_pbl(struct pci_dev *pdev, struct bnxt_qplib_pbl *pbl,
        } else {
                i = 0;
                is_umem = true;
-               for_each_sg(sghead, sg, pages, i) {
-                       pbl->pg_map_arr[i] = sg_dma_address(sg);
-                       pbl->pg_arr[i] = sg_virt(sg);
+               for_each_sg_dma_page(sghead, &sg_iter, pages, 0) {
+                       pbl->pg_map_arr[i] = sg_page_iter_dma_address(&sg_iter);
+                       pbl->pg_arr[i] = page_address(sg_page_iter_page(&sg_iter.base)); ???
					^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

                        if (!pbl->pg_arr[i])
                                goto fail;

+                       i++;
                        pbl->pg_count++;
                }
        }
--

@@ -191,16 +190,16 @@ int rxe_mem_init_user(struct rxe_pd *pd, u64 start,
                goto err1;
        }

-       mem->page_shift         = umem->page_shift;
-       mem->page_mask          = BIT(umem->page_shift) - 1;
+       mem->page_shift         = PAGE_SHIFT;
+       mem->page_mask          = PAGE_SIZE - 1;

        num_buf                 = 0;
        map                     = mem->map;
        if (length > 0) {
                buf = map[0]->buf;

-               for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) {
-                       vaddr = page_address(sg_page(sg));
+               for_each_sg_dma_page(umem->sg_head.sgl, &sg_iter, umem->nmap, 0) {
+                       vaddr = page_address(sg_page_iter_page(&sg_iter.base));   ?????
				^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


                        if (!vaddr) {
                                pr_warn("null vaddr\n");
                                err = -ENOMEM;
@@ -208,7 +207,7 @@ int rxe_mem_init_user(struct rxe_pd *pd, u64 start,
                        }

                        buf->addr = (uintptr_t)vaddr;
-                       buf->size = BIT(umem->page_shift);
+                       buf->size = PAGE_SIZE;

1.8.3.1
Christoph Hellwig Jan. 14, 2019, 9:46 a.m. UTC | #6
On Sat, Jan 12, 2019 at 01:03:05PM -0600, Shiraz Saleem wrote:
> Well I was trying convert the RDMA drivers to use your new iterator variant
> and saw the need for it in locations where we need virtual address of the pages
> contained in the SGEs.

As far as i can tell that pg_arr[i] value is only ever used for
the case where we do an explicit dma coherent allocation,  so you
should not have to fill it out at all.
Christoph Hellwig Jan. 14, 2019, 9:48 a.m. UTC | #7
On Thu, Jan 10, 2019 at 04:42:18PM -0700, Jason Gunthorpe wrote:
> > Changes since the RFC:
> > - Rework vmwgfx too [CH]
> > - Use a distinct type for the DMA page iterator [CH]
> > - Do not have a #ifdef [CH]
> 
> ChristophH: Will you ack?

This looks generally fine.

> 
> Are you still OK with the vmwgfx reworking, or should we go back to
> the original version that didn't have the type safety so this driver
> can be left broken?

I think the map method in vmgfx that just does virt_to_phys is
pretty broken.  Thomas, can you check if you see any performance
difference with just doing the proper dma mapping, as that gets the
driver out of interface abuse land?

While we're at it I think we need to merge my series in this area
for 5.0, because without that the driver is already broken.  Where
should we merge it?
Jason Gunthorpe Jan. 14, 2019, 10:16 p.m. UTC | #8
On Sat, Jan 12, 2019 at 01:03:05PM -0600, Shiraz Saleem wrote:
> On Sat, Jan 12, 2019 at 06:37:58PM +0000, Jason Gunthorpe wrote:
> > On Sat, Jan 12, 2019 at 12:27:05PM -0600, Shiraz Saleem wrote:
> > > On Fri, Jan 04, 2019 at 10:35:43PM +0000, Jason Gunthorpe wrote:
> > > > Commit 2db76d7c3c6d ("lib/scatterlist: sg_page_iter: support sg lists w/o
> > > > backing pages") introduced the sg_page_iter_dma_address() function without
> > > > providing a way to use it in the general case. If the sg_dma_len is not
> > > > equal to the dma_length callers cannot safely use the
> > > > for_each_sg_page/sg_page_iter_dma_address combination.
> > > > 
> > > > Resolve this API mistake by providing a DMA specific iterator,
> > > > for_each_sg_dma_page(), that uses the right length so
> > > > sg_page_iter_dma_address() works as expected with all sglists. A new
> > > > iterator type is introduced to provide compile-time safety against wrongly
> > > > mixing accessors and iterators.
> > > [..]
> > > 
> > > >  
> > > > +/*
> > > > + * sg page iterator for DMA addresses
> > > > + *
> > > > + * This is the same as sg_page_iter however you can call
> > > > + * sg_page_iter_dma_address(@dma_iter) to get the page's DMA
> > > > + * address. sg_page_iter_page() cannot be called on this iterator.
> > > > + */
> > > Does it make sense to have a variant of sg_page_iter_page() to get the
> > > page descriptor with this dma_iter? This can be used when walking DMA-mapped
> > > SG lists with for_each_sg_dma_page.
> > 
> > I think that would be a complicated cacluation to find the right
> > offset into the page sg list to get the page pointer back. We can't
> > just naively use the existing iterator location.
> > 
> > Probably places that need this are better to run with two iterators,
> > less computationally expensive.
> > 
> > Did you find a need for this? 
> > 
> 
> Well I was trying convert the RDMA drivers to use your new iterator variant
> and saw the need for it in locations where we need virtual address of the pages
> contained in the SGEs.
> 
> diff --git a/drivers/infiniband/hw/bnxt_re/qplib_res.c b/drivers/infiniband/hw/bnxt_re/qplib_res.c
> index 59eeac5..7d26903 100644
> --- a/drivers/infiniband/hw/bnxt_re/qplib_res.c
> +++ b/drivers/infiniband/hw/bnxt_re/qplib_res.c
> @@ -85,7 +85,7 @@ static void __free_pbl(struct pci_dev *pdev, struct bnxt_qplib_pbl *pbl,
>  static int __alloc_pbl(struct pci_dev *pdev, struct bnxt_qplib_pbl *pbl,
>                        struct scatterlist *sghead, u32 pages, u32 pg_size)
>  {
> -       struct scatterlist *sg;
> +       struct sg_dma_page_iter sg_iter;
>         bool is_umem = false;
>         int i;
> 
> @@ -116,12 +116,13 @@ static int __alloc_pbl(struct pci_dev *pdev, struct bnxt_qplib_pbl *pbl,
>         } else {
>                 i = 0;
>                 is_umem = true;
> -               for_each_sg(sghead, sg, pages, i) {
> -                       pbl->pg_map_arr[i] = sg_dma_address(sg);
> -                       pbl->pg_arr[i] = sg_virt(sg);
> +               for_each_sg_dma_page(sghead, &sg_iter, pages, 0) {
> +                       pbl->pg_map_arr[i] = sg_page_iter_dma_address(&sg_iter);
> +                       pbl->pg_arr[i] = page_address(sg_page_iter_page(&sg_iter.base)); ???
> 					^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I concur with CH, pg_arr only looks used in the !umem case, so set to
NULL here. Check with Selvin & Devesh ?

> @@ -191,16 +190,16 @@ int rxe_mem_init_user(struct rxe_pd *pd, u64 start,
>                 goto err1;
>         }
> 
> -       mem->page_shift         = umem->page_shift;
> -       mem->page_mask          = BIT(umem->page_shift) - 1;
> +       mem->page_shift         = PAGE_SHIFT;
> +       mem->page_mask          = PAGE_SIZE - 1;
> 
>         num_buf                 = 0;
>         map                     = mem->map;
>         if (length > 0) {
>                 buf = map[0]->buf;
> 
> -               for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) {
> -                       vaddr = page_address(sg_page(sg));
> +               for_each_sg_dma_page(umem->sg_head.sgl, &sg_iter, umem->nmap, 0) {
> +                       vaddr = page_address(sg_page_iter_page(&sg_iter.base));   ?????
> 				^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

rxe doesn't use DMA addreses, so just leave as for_each_sg_page

Jason
Thomas Hellstrom Jan. 15, 2019, 2:17 p.m. UTC | #9
Hi, Christoph,

On Mon, 2019-01-14 at 10:48 +0100, Christoph Hellwig wrote:
> On Thu, Jan 10, 2019 at 04:42:18PM -0700, Jason Gunthorpe wrote:
> > > Changes since the RFC:
> > > - Rework vmwgfx too [CH]
> > > - Use a distinct type for the DMA page iterator [CH]
> > > - Do not have a #ifdef [CH]
> > 
> > ChristophH: Will you ack?
> 
> This looks generally fine.
> 
> > Are you still OK with the vmwgfx reworking, or should we go back to
> > the original version that didn't have the type safety so this
> > driver
> > can be left broken?
> 
> I think the map method in vmgfx that just does virt_to_phys is
> pretty broken.  Thomas, can you check if you see any performance
> difference with just doing the proper dma mapping, as that gets the
> driver out of interface abuse land?

The performance difference is not really the main problem here. The
problem is that even though we utilize the streaming DMA interface, we
use it only since we have to for DMA-Remapping and assume that the
memory is coherent. To be able to be as compliant as possible and ditch
the virt-to-phys mode, we *need* a DMA interface flag that tells us
when the dma_sync_for_xxx are no-ops. If they aren't we'll refuse to
load for now. I'm not sure, but I think also nouveau and radeon suffer
from the same issue.

> 
> While we're at it I think we need to merge my series in this area
> for 5.0, because without that the driver is already broken.  Where
> should we merge it?

I can merge it through vmwgfx/drm-fixes. There is an outstanding issue
with patch 3. Do you want me to fix that up?

Thanks,
Thomas
Christian König Jan. 15, 2019, 2:24 p.m. UTC | #10
Am 15.01.19 um 15:17 schrieb Thomas Hellstrom:
> Hi, Christoph,
>
> On Mon, 2019-01-14 at 10:48 +0100, Christoph Hellwig wrote:
>> On Thu, Jan 10, 2019 at 04:42:18PM -0700, Jason Gunthorpe wrote:
>>>> Changes since the RFC:
>>>> - Rework vmwgfx too [CH]
>>>> - Use a distinct type for the DMA page iterator [CH]
>>>> - Do not have a #ifdef [CH]
>>> ChristophH: Will you ack?
>> This looks generally fine.
>>
>>> Are you still OK with the vmwgfx reworking, or should we go back to
>>> the original version that didn't have the type safety so this
>>> driver
>>> can be left broken?
>> I think the map method in vmgfx that just does virt_to_phys is
>> pretty broken.  Thomas, can you check if you see any performance
>> difference with just doing the proper dma mapping, as that gets the
>> driver out of interface abuse land?
> The performance difference is not really the main problem here. The
> problem is that even though we utilize the streaming DMA interface, we
> use it only since we have to for DMA-Remapping and assume that the
> memory is coherent. To be able to be as compliant as possible and ditch
> the virt-to-phys mode, we *need* a DMA interface flag that tells us
> when the dma_sync_for_xxx are no-ops. If they aren't we'll refuse to
> load for now. I'm not sure, but I think also nouveau and radeon suffer
> from the same issue.

Yeah, indeed. Bounce buffers are an absolute no-go for GPUs.

If the DMA API finds that a piece of memory is not directly accessible 
by the GPU we need to return an error and not try to use bounce buffers 
behind the surface.

That is something which always annoyed me with the DMA API, which is 
otherwise rather cleanly defined.

Christian.

>
>> While we're at it I think we need to merge my series in this area
>> for 5.0, because without that the driver is already broken.  Where
>> should we merge it?
> I can merge it through vmwgfx/drm-fixes. There is an outstanding issue
> with patch 3. Do you want me to fix that up?
>
> Thanks,
> Thomas
>
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Christoph Hellwig Jan. 15, 2019, 3:20 p.m. UTC | #11
On Tue, Jan 15, 2019 at 03:24:55PM +0100, Christian König wrote:
> Yeah, indeed. Bounce buffers are an absolute no-go for GPUs.
>
> If the DMA API finds that a piece of memory is not directly accessible by 
> the GPU we need to return an error and not try to use bounce buffers behind 
> the surface.
>
> That is something which always annoyed me with the DMA API, which is 
> otherwise rather cleanly defined.

That is exactly what I want to fix with my series to make
DMA_ATTR_NON_CONSISTENT more useful and always available:

https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031985.html

With that you allocate the memory using dma_alloc_attrs with
DMA_ATTR_NON_CONSISTENT, and use dma_sync_single_* to transfer
ownership to the cpu and back to the device, with a gurantee that
there won't be any bouncing.  So far the interest by the parties that
requested the feature has been rather lacklustre, though.
Thomas Hellstrom Jan. 15, 2019, 6:03 p.m. UTC | #12
On Tue, 2019-01-15 at 16:20 +0100, hch@lst.de wrote:
> On Tue, Jan 15, 2019 at 03:24:55PM +0100, Christian König wrote:
> > Yeah, indeed. Bounce buffers are an absolute no-go for GPUs.
> > 
> > If the DMA API finds that a piece of memory is not directly
> > accessible by 
> > the GPU we need to return an error and not try to use bounce
> > buffers behind 
> > the surface.
> > 
> > That is something which always annoyed me with the DMA API, which
> > is 
> > otherwise rather cleanly defined.
> 
> That is exactly what I want to fix with my series to make
> DMA_ATTR_NON_CONSISTENT more useful and always available:
> 
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.linuxfoundation.org%2Fpipermail%2Fiommu%2F2018-December%2F031985.html&amp;data=02%7C01%7Cthellstrom%40vmware.com%7Cb1799c4073024a824f9408d67afcfcea%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636831624340834140&amp;sdata=JiRBfzZMvN3joJ4vKiErbzXAHaNzuBcLapRJDL%2Bt6Hc%3D&amp;reserved=0
> 
> With that you allocate the memory using dma_alloc_attrs with
> DMA_ATTR_NON_CONSISTENT, and use dma_sync_single_* to transfer
> ownership to the cpu and back to the device, with a gurantee that
> there won't be any bouncing.  So far the interest by the parties that
> requested the feature has been rather lacklustre, though.

In the graphics case, it's probably because it doesn't fit the graphics
use-cases:

1) Memory typically needs to be mappable by another device. (the "dma-
buf" interface)
2) DMA buffers are exported to user-space and is sub-allocated by it.
Mostly there are no GPU user-space kernel interfaces to sync / flush
subregions and these syncs may happen on a smaller-than-cache-line
granularity.

So to help the graphics driver, that coherent flag would be much more
useful.

/Thomas
Christoph Hellwig Jan. 15, 2019, 6:31 p.m. UTC | #13
On Tue, Jan 15, 2019 at 06:03:39PM +0000, Thomas Hellstrom wrote:
> In the graphics case, it's probably because it doesn't fit the graphics
> use-cases:
> 
> 1) Memory typically needs to be mappable by another device. (the "dma-
> buf" interface)

And there is nothing preventing dma-buf sharing of these buffers.
Unlike the get_sgtable mess it can actually work reliably on
architectures that have virtually tagged caches and/or don't
guarantee cache coherency with mixed attribute mappings.

> 2) DMA buffers are exported to user-space and is sub-allocated by it.
> Mostly there are no GPU user-space kernel interfaces to sync / flush
> subregions and these syncs may happen on a smaller-than-cache-line
> granularity.

I know of no architectures that can do cache maintainance on a less
than cache line basis.  Either the instructions require you to
specifcy cache lines, or they do sometimes more, sometimes less
intelligent rounding up.

Note that as long dma non-coherent buffers are devices owned it
is up to the device and the user space driver to take care of
coherency, the kernel very much is out of the picture.
Christian König Jan. 15, 2019, 7:13 p.m. UTC | #14
Am 15.01.19 um 19:31 schrieb hch@lst.de:
> On Tue, Jan 15, 2019 at 06:03:39PM +0000, Thomas Hellstrom wrote:
>> In the graphics case, it's probably because it doesn't fit the graphics
>> use-cases:
>>
>> 1) Memory typically needs to be mappable by another device. (the "dma-
>> buf" interface)
> And there is nothing preventing dma-buf sharing of these buffers.
> Unlike the get_sgtable mess it can actually work reliably on
> architectures that have virtually tagged caches and/or don't
> guarantee cache coherency with mixed attribute mappings.
>
>> 2) DMA buffers are exported to user-space and is sub-allocated by it.
>> Mostly there are no GPU user-space kernel interfaces to sync / flush
>> subregions and these syncs may happen on a smaller-than-cache-line
>> granularity.
> I know of no architectures that can do cache maintainance on a less
> than cache line basis.  Either the instructions require you to
> specifcy cache lines, or they do sometimes more, sometimes less
> intelligent rounding up.
>
> Note that as long dma non-coherent buffers are devices owned it
> is up to the device and the user space driver to take care of
> coherency, the kernel very much is out of the picture.

Thomas is correct that the interface you propose here doesn't work at 
all for GPUs.

The kernel driver is not informed of flush/sync, but rather just setups 
coherent mappings between system memory and devices.

In other words you have an array of struct pages and need to map that to 
a specific device and so create dma_addresses for the mappings.

Regards,
Christian.
Christoph Hellwig Jan. 15, 2019, 8:58 p.m. UTC | #15
On Tue, Jan 15, 2019 at 07:13:11PM +0000, Koenig, Christian wrote:
> Thomas is correct that the interface you propose here doesn't work at 
> all for GPUs.
> 
> The kernel driver is not informed of flush/sync, but rather just setups 
> coherent mappings between system memory and devices.
> 
> In other words you have an array of struct pages and need to map that to 
> a specific device and so create dma_addresses for the mappings.

If you want a coherent mapping you need to use dma_alloc_coherent
and dma_mmap_coherent and you are done, that is not the problem.
That actually is one of the vmgfx modes, so I don't understand what
problem we are trying to solve if you don't actually want a non-coherent
mapping.  Although last time I had that discussion with Daniel Vetter
I was under the impressions that GPUs really wanted non-coherent
mappings.

But if you want a coherent mapping you can't go to a struct page,
because on many systems you can't just map arbitrary memory as
uncachable.  It might either come from very special limited pools,
or might need other magic applied to it so that it is not visible
in the normal direct mapping, or at least not access through it.
Jason Gunthorpe Jan. 15, 2019, 9:25 p.m. UTC | #16
On Tue, Jan 15, 2019 at 02:17:26PM +0000, Thomas Hellstrom wrote:
> Hi, Christoph,
> 
> On Mon, 2019-01-14 at 10:48 +0100, Christoph Hellwig wrote:
> > On Thu, Jan 10, 2019 at 04:42:18PM -0700, Jason Gunthorpe wrote:
> > > > Changes since the RFC:
> > > > - Rework vmwgfx too [CH]
> > > > - Use a distinct type for the DMA page iterator [CH]
> > > > - Do not have a #ifdef [CH]
> > > 
> > > ChristophH: Will you ack?
> > 
> > This looks generally fine.
> > 
> > > Are you still OK with the vmwgfx reworking, or should we go back to
> > > the original version that didn't have the type safety so this
> > > driver
> > > can be left broken?
> > 
> > I think the map method in vmgfx that just does virt_to_phys is
> > pretty broken.  Thomas, can you check if you see any performance
> > difference with just doing the proper dma mapping, as that gets the
> > driver out of interface abuse land?
> 
> The performance difference is not really the main problem here. The
> problem is that even though we utilize the streaming DMA interface, we
> use it only since we have to for DMA-Remapping and assume that the
> memory is coherent. To be able to be as compliant as possible and ditch
> the virt-to-phys mode, we *need* a DMA interface flag that tells us
> when the dma_sync_for_xxx are no-ops. If they aren't we'll refuse to
> load for now. I'm not sure, but I think also nouveau and radeon suffer
> from the same issue.

RDMA needs something similar as well, in this case drivers take a
struct page * from get_user_pages() and need to have the DMA map fail
if the platform can't DMA map in a way that does not require any
additional DMA API calls to ensure coherence. (think Userspace RDMA
MR's)

Today we just do the normal DMA map and when it randomly doesn't work
and corrupts data tell those people their platforms don't support RDMA
- it would be nice to have a safer API base solution..

Jason
Thomas Hellstrom Jan. 16, 2019, 7:09 a.m. UTC | #17
On Tue, 2019-01-15 at 21:58 +0100, hch@lst.de wrote:
> On Tue, Jan 15, 2019 at 07:13:11PM +0000, Koenig, Christian wrote:
> > Thomas is correct that the interface you propose here doesn't work
> > at 
> > all for GPUs.
> > 
> > The kernel driver is not informed of flush/sync, but rather just
> > setups 
> > coherent mappings between system memory and devices.
> > 
> > In other words you have an array of struct pages and need to map
> > that to 
> > a specific device and so create dma_addresses for the mappings.
> 
> If you want a coherent mapping you need to use dma_alloc_coherent
> and dma_mmap_coherent and you are done, that is not the problem.
> That actually is one of the vmgfx modes, so I don't understand what
> problem we are trying to solve if you don't actually want a non-
> coherent mapping. 

For vmwgfx, not making dma_alloc_coherent default has a couple of
reasons:
1) Memory is associated with a struct device. It has not been clear
that it is exportable to other devices.
2) There seems to be restrictions in the system pages allowable. GPUs
generally prefer highmem pages but dma_alloc_coherent returns a virtual
address implying GFP_KERNEL? While not used by vmwgfx, TTM typically
prefers HIGHMEM pages to facilitate caching mode switching without
having to touch the kernel map. 
3) Historically we had APIs to allow coherent access to user-space
defined pages. That has gone away not but the infrastructure was built
around it.

dma_mmap_coherent isn't use because as the data moves between system
memory, swap and VRAM, PTEs of user-space mappings are adjusted
accordingly, meaning user-space doesn't have to unmap when an operation
is initiated that might mean the data is moved.


> Although last time I had that discussion with Daniel Vetter
> I was under the impressions that GPUs really wanted non-coherent
> mappings.

Intel historically has done things a bit differently. And it's also
possible that embedded platforms and ARM prefer this mode of operation,
but I haven't caught up on that discussion.

> 
> But if you want a coherent mapping you can't go to a struct page,
> because on many systems you can't just map arbitrary memory as
> uncachable.  It might either come from very special limited pools,
> or might need other magic applied to it so that it is not visible
> in the normal direct mapping, or at least not access through it.


The TTM subsystem has been relied on to provide coherent memory with
the option to switch caching mode of pages. But only on selected and
well tested platforms. On other platforms we simply do not load, and
that's fine for now.

But as mentioned multiple times, to make GPU drivers more compliant,
we'd really want that

bool dma_streaming_is_coherent(const struct device *)

API to help us decide when to load or not.

Thanks,
Thomas
Christian König Jan. 16, 2019, 7:28 a.m. UTC | #18
Am 16.01.19 um 08:09 schrieb Thomas Hellstrom:
> On Tue, 2019-01-15 at 21:58 +0100, hch@lst.de wrote:
>> On Tue, Jan 15, 2019 at 07:13:11PM +0000, Koenig, Christian wrote:
>>> Thomas is correct that the interface you propose here doesn't work
>>> at
>>> all for GPUs.
>>>
>>> The kernel driver is not informed of flush/sync, but rather just
>>> setups
>>> coherent mappings between system memory and devices.
>>>
>>> In other words you have an array of struct pages and need to map
>>> that to
>>> a specific device and so create dma_addresses for the mappings.
>> If you want a coherent mapping you need to use dma_alloc_coherent
>> and dma_mmap_coherent and you are done, that is not the problem.
>> That actually is one of the vmgfx modes, so I don't understand what
>> problem we are trying to solve if you don't actually want a non-
>> coherent mapping.
> For vmwgfx, not making dma_alloc_coherent default has a couple of
> reasons:
> 1) Memory is associated with a struct device. It has not been clear
> that it is exportable to other devices.
> 2) There seems to be restrictions in the system pages allowable. GPUs
> generally prefer highmem pages but dma_alloc_coherent returns a virtual
> address implying GFP_KERNEL? While not used by vmwgfx, TTM typically
> prefers HIGHMEM pages to facilitate caching mode switching without
> having to touch the kernel map.
> 3) Historically we had APIs to allow coherent access to user-space
> defined pages. That has gone away not but the infrastructure was built
> around it.
>
> dma_mmap_coherent isn't use because as the data moves between system
> memory, swap and VRAM, PTEs of user-space mappings are adjusted
> accordingly, meaning user-space doesn't have to unmap when an operation
> is initiated that might mean the data is moved.

To summarize once more: We have an array of struct pages and want to 
coherently map that to a device.

If that is not possible because of whatever reason we want to get an 
error code or even not load the driver from the beginning.

>
>
>> Although last time I had that discussion with Daniel Vetter
>> I was under the impressions that GPUs really wanted non-coherent
>> mappings.
> Intel historically has done things a bit differently. And it's also
> possible that embedded platforms and ARM prefer this mode of operation,
> but I haven't caught up on that discussion.
>
>> But if you want a coherent mapping you can't go to a struct page,
>> because on many systems you can't just map arbitrary memory as
>> uncachable.  It might either come from very special limited pools,
>> or might need other magic applied to it so that it is not visible
>> in the normal direct mapping, or at least not access through it.
>
> The TTM subsystem has been relied on to provide coherent memory with
> the option to switch caching mode of pages. But only on selected and
> well tested platforms. On other platforms we simply do not load, and
> that's fine for now.
>
> But as mentioned multiple times, to make GPU drivers more compliant,
> we'd really want that
>
> bool dma_streaming_is_coherent(const struct device *)
>
> API to help us decide when to load or not.

Yes, please.

Christian.

>
> Thanks,
> Thomas
>
>
>
>
>
>
>
Daniel Vetter Jan. 16, 2019, 10:14 a.m. UTC | #19
On Wed, Jan 16, 2019 at 07:28:13AM +0000, Koenig, Christian wrote:
> Am 16.01.19 um 08:09 schrieb Thomas Hellstrom:
> > On Tue, 2019-01-15 at 21:58 +0100, hch@lst.de wrote:
> >> On Tue, Jan 15, 2019 at 07:13:11PM +0000, Koenig, Christian wrote:
> >>> Thomas is correct that the interface you propose here doesn't work
> >>> at
> >>> all for GPUs.
> >>>
> >>> The kernel driver is not informed of flush/sync, but rather just
> >>> setups
> >>> coherent mappings between system memory and devices.
> >>>
> >>> In other words you have an array of struct pages and need to map
> >>> that to
> >>> a specific device and so create dma_addresses for the mappings.
> >> If you want a coherent mapping you need to use dma_alloc_coherent
> >> and dma_mmap_coherent and you are done, that is not the problem.
> >> That actually is one of the vmgfx modes, so I don't understand what
> >> problem we are trying to solve if you don't actually want a non-
> >> coherent mapping.
> > For vmwgfx, not making dma_alloc_coherent default has a couple of
> > reasons:
> > 1) Memory is associated with a struct device. It has not been clear
> > that it is exportable to other devices.
> > 2) There seems to be restrictions in the system pages allowable. GPUs
> > generally prefer highmem pages but dma_alloc_coherent returns a virtual
> > address implying GFP_KERNEL? While not used by vmwgfx, TTM typically
> > prefers HIGHMEM pages to facilitate caching mode switching without
> > having to touch the kernel map.
> > 3) Historically we had APIs to allow coherent access to user-space
> > defined pages. That has gone away not but the infrastructure was built
> > around it.
> >
> > dma_mmap_coherent isn't use because as the data moves between system
> > memory, swap and VRAM, PTEs of user-space mappings are adjusted
> > accordingly, meaning user-space doesn't have to unmap when an operation
> > is initiated that might mean the data is moved.
> 
> To summarize once more: We have an array of struct pages and want to 
> coherently map that to a device.
> 
> If that is not possible because of whatever reason we want to get an 
> error code or even not load the driver from the beginning.

I guess to make this work we'd also need information about how we're
allowed to mmap this on the cpu side, both from the kernel (kmap or vmap)
and for userspace. At least for i915 we use all kinds of combinations,
e.g. cpu mmap ptes as cached w/ coherent device transactions, or
cached+clflush on the cpu side, and non-coherent device transactions (the
no-snoop thing), or wc mode in the cpu ptes and non-coherent device
transactions-

Plus some debug mode so we catch abuse, because reality is that most of
the gpu driver work happens on x86, where all of this just works. Even if
you do some really serious layering violations (which is why this isn't
that high a priority for gpu folks).
-Daniel

> 
> >
> >
> >> Although last time I had that discussion with Daniel Vetter
> >> I was under the impressions that GPUs really wanted non-coherent
> >> mappings.
> > Intel historically has done things a bit differently. And it's also
> > possible that embedded platforms and ARM prefer this mode of operation,
> > but I haven't caught up on that discussion.
> >
> >> But if you want a coherent mapping you can't go to a struct page,
> >> because on many systems you can't just map arbitrary memory as
> >> uncachable.  It might either come from very special limited pools,
> >> or might need other magic applied to it so that it is not visible
> >> in the normal direct mapping, or at least not access through it.
> >
> > The TTM subsystem has been relied on to provide coherent memory with
> > the option to switch caching mode of pages. But only on selected and
> > well tested platforms. On other platforms we simply do not load, and
> > that's fine for now.
> >
> > But as mentioned multiple times, to make GPU drivers more compliant,
> > we'd really want that
> >
> > bool dma_streaming_is_coherent(const struct device *)
> >
> > API to help us decide when to load or not.
> 
> Yes, please.
> 
> Christian.
> 
> >
> > Thanks,
> > Thomas
> >
> >
> >
> >
> >
> >
> >
>
Christian König Jan. 16, 2019, 10:40 a.m. UTC | #20
Am 15.01.19 um 22:25 schrieb Jason Gunthorpe:
> On Tue, Jan 15, 2019 at 02:17:26PM +0000, Thomas Hellstrom wrote:
>> Hi, Christoph,
>>
>> On Mon, 2019-01-14 at 10:48 +0100, Christoph Hellwig wrote:
>>> On Thu, Jan 10, 2019 at 04:42:18PM -0700, Jason Gunthorpe wrote:
>>>>> Changes since the RFC:
>>>>> - Rework vmwgfx too [CH]
>>>>> - Use a distinct type for the DMA page iterator [CH]
>>>>> - Do not have a #ifdef [CH]
>>>> ChristophH: Will you ack?
>>> This looks generally fine.
>>>
>>>> Are you still OK with the vmwgfx reworking, or should we go back to
>>>> the original version that didn't have the type safety so this
>>>> driver
>>>> can be left broken?
>>> I think the map method in vmgfx that just does virt_to_phys is
>>> pretty broken.  Thomas, can you check if you see any performance
>>> difference with just doing the proper dma mapping, as that gets the
>>> driver out of interface abuse land?
>> The performance difference is not really the main problem here. The
>> problem is that even though we utilize the streaming DMA interface, we
>> use it only since we have to for DMA-Remapping and assume that the
>> memory is coherent. To be able to be as compliant as possible and ditch
>> the virt-to-phys mode, we *need* a DMA interface flag that tells us
>> when the dma_sync_for_xxx are no-ops. If they aren't we'll refuse to
>> load for now. I'm not sure, but I think also nouveau and radeon suffer
>> from the same issue.
> RDMA needs something similar as well, in this case drivers take a
> struct page * from get_user_pages() and need to have the DMA map fail
> if the platform can't DMA map in a way that does not require any
> additional DMA API calls to ensure coherence. (think Userspace RDMA
> MR's)
>
> Today we just do the normal DMA map and when it randomly doesn't work
> and corrupts data tell those people their platforms don't support RDMA
> - it would be nice to have a safer API base solution..

Oh, yes really good point. We have to support get_user_pages (or HMM) in 
a similar manner on GPUs as well.

Regards,
Christian.

>
> Jason
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Christoph Hellwig Jan. 16, 2019, 4:06 p.m. UTC | #21
On Wed, Jan 16, 2019 at 07:28:13AM +0000, Koenig, Christian wrote:
> To summarize once more: We have an array of struct pages and want to 
> coherently map that to a device.

And the answer to that is very simple: you can't.  What is so hard
to understand about?  If you want to map arbitrary memory it simply
can't be done in a coherent way on about half of our platforms.

> If that is not possible because of whatever reason we want to get an 
> error code or even not load the driver from the beginning.

That is a bullshit attitude.  Just like everyone else makes their
drivers work you should not be lazy.

> > bool dma_streaming_is_coherent(const struct device *)
> >
> > API to help us decide when to load or not.
> 
> Yes, please.

Hell no.
Christoph Hellwig Jan. 16, 2019, 4:11 p.m. UTC | #22
On Tue, Jan 15, 2019 at 02:25:01PM -0700, Jason Gunthorpe wrote:
> RDMA needs something similar as well, in this case drivers take a
> struct page * from get_user_pages() and need to have the DMA map fail
> if the platform can't DMA map in a way that does not require any
> additional DMA API calls to ensure coherence. (think Userspace RDMA
> MR's)

Any time you dma map pages you need to do further DMA API calls to
ensure coherent, that is the way it is implemented.  These calls
just happen to be no-ops sometimes.

> Today we just do the normal DMA map and when it randomly doesn't work
> and corrupts data tell those people their platforms don't support RDMA
> - it would be nice to have a safer API base solution..

Now that all these drivers are consolidated in rdma-core you can fix
the code to actually do the right thing.  It isn't that userspace DMA
coherent is any harder than in-kernel DMA coherenence.  It just is
that no one bothered to do it properly.
Daniel Stone Jan. 16, 2019, 4:36 p.m. UTC | #23
On Wed, 16 Jan 2019 at 16:06, hch@lst.de <hch@lst.de> wrote:
> On Wed, Jan 16, 2019 at 07:28:13AM +0000, Koenig, Christian wrote:
> > To summarize once more: We have an array of struct pages and want to
> > coherently map that to a device.
>
> And the answer to that is very simple: you can't.  What is so hard
> to understand about?  If you want to map arbitrary memory it simply
> can't be done in a coherent way on about half of our platforms.
>
> > If that is not possible because of whatever reason we want to get an
> > error code or even not load the driver from the beginning.
>
> That is a bullshit attitude.  Just like everyone else makes their
> drivers work you should not be lazy.

Can you not talk to people like that? Even if you think that is an OK
way to treat anyone - which it isn't, certainly not on dri-devel@ with
the fd.o Code of Conduct, and not according to the kernel's either - I
have absolutely no idea how you can look at the work the AMD people
have put in over many years and conclude that they're 'lazy'.

If this makes you so angry, step back from the keyboard for a few
minutes, and if you still can't participate in reasonable discussion
like an adult, maybe step out of the thread entirely.
Jason Gunthorpe Jan. 16, 2019, 5:24 p.m. UTC | #24
On Wed, Jan 16, 2019 at 05:11:34PM +0100, hch@lst.de wrote:
> On Tue, Jan 15, 2019 at 02:25:01PM -0700, Jason Gunthorpe wrote:
> > RDMA needs something similar as well, in this case drivers take a
> > struct page * from get_user_pages() and need to have the DMA map fail
> > if the platform can't DMA map in a way that does not require any
> > additional DMA API calls to ensure coherence. (think Userspace RDMA
> > MR's)
> 
> Any time you dma map pages you need to do further DMA API calls to
> ensure coherent, that is the way it is implemented.  These calls
> just happen to be no-ops sometimes.
> 
> > Today we just do the normal DMA map and when it randomly doesn't work
> > and corrupts data tell those people their platforms don't support RDMA
> > - it would be nice to have a safer API base solution..
> 
> Now that all these drivers are consolidated in rdma-core you can fix
> the code to actually do the right thing.  It isn't that userspace DMA
> coherent is any harder than in-kernel DMA coherenence.  It just is
> that no one bothered to do it properly.

If I recall we actually can't.. libverbs presents an API to the user
that does not consider this possibility. 

ie consider post_recv - the driver has no idea what user buffers
received data and can't possibly flush them transparently. The user
would have to call some special DMA syncing API, which we don't have.

It is the same reason the kernel API makes the ULP handle dma sync,
not the driver.

The fact is there is 0 industry interest in using RDMA on platforms
that can't do HW DMA cache coherency - the kernel syscalls required to
do the cache flushing on the IO path would just destroy performance to
the point of making RDMA pointless. Better to use netdev on those
platforms.

VFIO is in a similar boat. Their user API can't handle cache syncing
either, so they would use the same API too.

.. and the GPU-compute systems (ie OpenCL/CUDA) are like verbs, they
were never designed with incoherent DMA in mind, and don't have the
API design to support it.

The reality is that *all* the subsytems doing DMA kernel bypass are
ignoring the DMA mapping rules, I think we should support this better,
and just accept that user space DMA will not be using syncing. Block
access in cases when this is required, otherwise let it work as is
today.

Jason
Shiraz Saleem Jan. 16, 2019, 5:32 p.m. UTC | #25
On Mon, Jan 14, 2019 at 03:16:54PM -0700, Jason Gunthorpe wrote:
> On Sat, Jan 12, 2019 at 01:03:05PM -0600, Shiraz Saleem wrote:
> > On Sat, Jan 12, 2019 at 06:37:58PM +0000, Jason Gunthorpe wrote:
> > > On Sat, Jan 12, 2019 at 12:27:05PM -0600, Shiraz Saleem wrote:
> > > > On Fri, Jan 04, 2019 at 10:35:43PM +0000, Jason Gunthorpe wrote:
> > > > > Commit 2db76d7c3c6d ("lib/scatterlist: sg_page_iter: support sg lists w/o
> > > > > backing pages") introduced the sg_page_iter_dma_address() function without
> > > > > providing a way to use it in the general case. If the sg_dma_len is not
> > > > > equal to the dma_length callers cannot safely use the
> > > > > for_each_sg_page/sg_page_iter_dma_address combination.
> > > > > 
> > > > > Resolve this API mistake by providing a DMA specific iterator,
> > > > > for_each_sg_dma_page(), that uses the right length so
> > > > > sg_page_iter_dma_address() works as expected with all sglists. A new
> > > > > iterator type is introduced to provide compile-time safety against wrongly
> > > > > mixing accessors and iterators.
> > > > [..]
> > > > 
> > > > >  
> > > > > +/*
> > > > > + * sg page iterator for DMA addresses
> > > > > + *
> > > > > + * This is the same as sg_page_iter however you can call
> > > > > + * sg_page_iter_dma_address(@dma_iter) to get the page's DMA
> > > > > + * address. sg_page_iter_page() cannot be called on this iterator.
> > > > > + */
> > > > Does it make sense to have a variant of sg_page_iter_page() to get the
> > > > page descriptor with this dma_iter? This can be used when walking DMA-mapped
> > > > SG lists with for_each_sg_dma_page.
> > > 
> > > I think that would be a complicated cacluation to find the right
> > > offset into the page sg list to get the page pointer back. We can't
> > > just naively use the existing iterator location.
> > > 
> > > Probably places that need this are better to run with two iterators,
> > > less computationally expensive.
> > > 
> > > Did you find a need for this? 
> > > 
> > 
> > Well I was trying convert the RDMA drivers to use your new iterator variant
> > and saw the need for it in locations where we need virtual address of the pages
> > contained in the SGEs.
> > 
> > diff --git a/drivers/infiniband/hw/bnxt_re/qplib_res.c b/drivers/infiniband/hw/bnxt_re/qplib_res.c
> > index 59eeac5..7d26903 100644
> > --- a/drivers/infiniband/hw/bnxt_re/qplib_res.c
> > +++ b/drivers/infiniband/hw/bnxt_re/qplib_res.c
> > @@ -85,7 +85,7 @@ static void __free_pbl(struct pci_dev *pdev, struct bnxt_qplib_pbl *pbl,
> >  static int __alloc_pbl(struct pci_dev *pdev, struct bnxt_qplib_pbl *pbl,
> >                        struct scatterlist *sghead, u32 pages, u32 pg_size)
> >  {
> > -       struct scatterlist *sg;
> > +       struct sg_dma_page_iter sg_iter;
> >         bool is_umem = false;
> >         int i;
> > 
> > @@ -116,12 +116,13 @@ static int __alloc_pbl(struct pci_dev *pdev, struct bnxt_qplib_pbl *pbl,
> >         } else {
> >                 i = 0;
> >                 is_umem = true;
> > -               for_each_sg(sghead, sg, pages, i) {
> > -                       pbl->pg_map_arr[i] = sg_dma_address(sg);
> > -                       pbl->pg_arr[i] = sg_virt(sg);
> > +               for_each_sg_dma_page(sghead, &sg_iter, pages, 0) {
> > +                       pbl->pg_map_arr[i] = sg_page_iter_dma_address(&sg_iter);
> > +                       pbl->pg_arr[i] = page_address(sg_page_iter_page(&sg_iter.base)); ???
> > 					^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> I concur with CH, pg_arr only looks used in the !umem case, so set to
> NULL here. Check with Selvin & Devesh ?

That makes sense to me. Any concerns with that Devesh?
 
> > @@ -191,16 +190,16 @@ int rxe_mem_init_user(struct rxe_pd *pd, u64 start,
> >                 goto err1;
> >         }
> > 
> > -       mem->page_shift         = umem->page_shift;
> > -       mem->page_mask          = BIT(umem->page_shift) - 1;
> > +       mem->page_shift         = PAGE_SHIFT;
> > +       mem->page_mask          = PAGE_SIZE - 1;
> > 
> >         num_buf                 = 0;
> >         map                     = mem->map;
> >         if (length > 0) {
> >                 buf = map[0]->buf;
> > 
> > -               for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) {
> > -                       vaddr = page_address(sg_page(sg));
> > +               for_each_sg_dma_page(umem->sg_head.sgl, &sg_iter, umem->nmap, 0) {
> > +                       vaddr = page_address(sg_page_iter_page(&sg_iter.base));   ?????
> > 				^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> rxe doesn't use DMA addreses, so just leave as for_each_sg_page
> 

OK.
Thomas Hellstrom Jan. 17, 2019, 10:47 a.m. UTC | #26
On Thu, 2019-01-17 at 10:30 +0100, hch@lst.de wrote:
> On Wed, Jan 16, 2019 at 10:24:36AM -0700, Jason Gunthorpe wrote:
> > The fact is there is 0 industry interest in using RDMA on platforms
> > that can't do HW DMA cache coherency - the kernel syscalls required
> > to
> > do the cache flushing on the IO path would just destroy performance
> > to
> > the point of making RDMA pointless. Better to use netdev on those
> > platforms.
> 
> In general there is no syscall required for doing cache flushing, you
> just issue the proper instructions directly from userspace.

But what if there are other coherence issues? Like bounce-buffers?
I'd like to +1 on what Jason says about industry interest: FWIW, vmwgfx
is probably one of the graphics drivers that would lend itself best to
do a fully-dma-interface compliant graphics stack experiment. But being
a paravirtual driver, all platforms we can ever run on are fully
coherent unless someone introduces a fake incoherency by forcing
swiotlb. Putting many man-months of effort into supporting systems on
which we would never run on and can never test on can never make more
than academic sense.

>  
> 
> > The reality is that *all* the subsytems doing DMA kernel bypass are
> > ignoring the DMA mapping rules, I think we should support this
> > better,
> > and just accept that user space DMA will not be using syncing.
> > Block
> > access in cases when this is required, otherwise let it work as is
> > today.
> 
> In that case we just need to block userspace DMA access entirely.
> Which given the amount of problems it creates sounds like a pretty
> good idea anyway.

I'm not sure I'm following you here. Are you suggesting scratching
support for all (GP)GPU- and RDMA drivers?

Thanks,
Thomas
Jason Gunthorpe Jan. 17, 2019, 3:54 p.m. UTC | #27
On Thu, Jan 17, 2019 at 10:30:01AM +0100, hch@lst.de wrote:
> On Wed, Jan 16, 2019 at 10:24:36AM -0700, Jason Gunthorpe wrote:
> > The fact is there is 0 industry interest in using RDMA on platforms
> > that can't do HW DMA cache coherency - the kernel syscalls required to
> > do the cache flushing on the IO path would just destroy performance to
> > the point of making RDMA pointless. Better to use netdev on those
> > platforms.
> 
> In general there is no syscall required for doing cache flushing, you
> just issue the proper instructions directly from userspace. 

At least on the ARM/MIPS systems I've worked with like this the cache
manipulation instructions are privileged and cannot be executed by
userspace. So the general case requires a syscall.

> In that case we just need to block userspace DMA access entirely.
> Which given the amount of problems it creates sounds like a pretty
> good idea anyway.

I doubt there is any support for that idea...

Jason
Sakari Ailus Feb. 7, 2019, 11:23 p.m. UTC | #28
On Fri, Jan 04, 2019 at 10:35:43PM +0000, Jason Gunthorpe wrote:
> Commit 2db76d7c3c6d ("lib/scatterlist: sg_page_iter: support sg lists w/o
> backing pages") introduced the sg_page_iter_dma_address() function without
> providing a way to use it in the general case. If the sg_dma_len is not
> equal to the dma_length callers cannot safely use the
> for_each_sg_page/sg_page_iter_dma_address combination.
> 
> Resolve this API mistake by providing a DMA specific iterator,
> for_each_sg_dma_page(), that uses the right length so
> sg_page_iter_dma_address() works as expected with all sglists. A new
> iterator type is introduced to provide compile-time safety against wrongly
> mixing accessors and iterators.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com> (ipu3-cio2)
diff mbox series

Patch

diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
index 59f614225bcd72..3c6d71e13a9342 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
@@ -297,7 +297,10 @@  struct vmw_sg_table {
 struct vmw_piter {
 	struct page **pages;
 	const dma_addr_t *addrs;
-	struct sg_page_iter iter;
+	union {
+		struct sg_page_iter iter;
+		struct sg_dma_page_iter dma_iter;
+	};
 	unsigned long i;
 	unsigned long num_pages;
 	bool (*next)(struct vmw_piter *);
@@ -869,9 +872,24 @@  extern int vmw_bo_map_dma(struct ttm_buffer_object *bo);
 extern void vmw_bo_unmap_dma(struct ttm_buffer_object *bo);
 extern const struct vmw_sg_table *
 vmw_bo_sg_table(struct ttm_buffer_object *bo);
-extern void vmw_piter_start(struct vmw_piter *viter,
-			    const struct vmw_sg_table *vsgt,
-			    unsigned long p_offs);
+void _vmw_piter_start(struct vmw_piter *viter, const struct vmw_sg_table *vsgt,
+		      unsigned long p_offs, bool for_dma);
+
+/* Create a piter that can call vmw_piter_dma_addr() */
+static inline void vmw_piter_start(struct vmw_piter *viter,
+				   const struct vmw_sg_table *vsgt,
+				   unsigned long p_offs)
+{
+	_vmw_piter_start(viter, vsgt, p_offs, true);
+}
+
+/* Create a piter that can call vmw_piter_page() */
+static inline void vmw_piter_cpu_start(struct vmw_piter *viter,
+				   const struct vmw_sg_table *vsgt,
+				   unsigned long p_offs)
+{
+	_vmw_piter_start(viter, vsgt, p_offs, false);
+}
 
 /**
  * vmw_piter_next - Advance the iterator one page.
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_mob.c b/drivers/gpu/drm/vmwgfx/vmwgfx_mob.c
index 7ed179d30ec51f..a13788017ad608 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_mob.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_mob.c
@@ -503,7 +503,8 @@  static void vmw_mob_assign_ppn(u32 **addr, dma_addr_t val)
  */
 static unsigned long vmw_mob_build_pt(struct vmw_piter *data_iter,
 				      unsigned long num_data_pages,
-				      struct vmw_piter *pt_iter)
+				      struct vmw_piter *pt_iter_cpu,
+				      struct vmw_piter *pt_iter_dma)
 {
 	unsigned long pt_size = num_data_pages * VMW_PPN_SIZE;
 	unsigned long num_pt_pages = DIV_ROUND_UP(pt_size, PAGE_SIZE);
@@ -513,7 +514,7 @@  static unsigned long vmw_mob_build_pt(struct vmw_piter *data_iter,
 	struct page *page;
 
 	for (pt_page = 0; pt_page < num_pt_pages; ++pt_page) {
-		page = vmw_piter_page(pt_iter);
+		page = vmw_piter_page(pt_iter_cpu);
 
 		save_addr = addr = kmap_atomic(page);
 
@@ -525,7 +526,8 @@  static unsigned long vmw_mob_build_pt(struct vmw_piter *data_iter,
 			WARN_ON(!vmw_piter_next(data_iter));
 		}
 		kunmap_atomic(save_addr);
-		vmw_piter_next(pt_iter);
+		vmw_piter_next(pt_iter_cpu);
+		vmw_piter_next(pt_iter_dma);
 	}
 
 	return num_pt_pages;
@@ -547,29 +549,31 @@  static void vmw_mob_pt_setup(struct vmw_mob *mob,
 {
 	unsigned long num_pt_pages = 0;
 	struct ttm_buffer_object *bo = mob->pt_bo;
-	struct vmw_piter save_pt_iter;
-	struct vmw_piter pt_iter;
+	struct vmw_piter pt_iter_cpu, pt_iter_dma;
 	const struct vmw_sg_table *vsgt;
+	dma_addr_t root_page = 0;
 	int ret;
 
 	ret = ttm_bo_reserve(bo, false, true, NULL);
 	BUG_ON(ret != 0);
 
 	vsgt = vmw_bo_sg_table(bo);
-	vmw_piter_start(&pt_iter, vsgt, 0);
-	BUG_ON(!vmw_piter_next(&pt_iter));
+	vmw_piter_start(&pt_iter_dma, vsgt, 0);
+	vmw_piter_cpu_start(&pt_iter_cpu, vsgt, 0);
+	BUG_ON(!vmw_piter_next(&pt_iter_cpu));
+	BUG_ON(!vmw_piter_next(&pt_iter_dma));
 	mob->pt_level = 0;
 	while (likely(num_data_pages > 1)) {
 		++mob->pt_level;
 		BUG_ON(mob->pt_level > 2);
-		save_pt_iter = pt_iter;
+		root_page = vmw_piter_dma_addr(&pt_iter_dma);
 		num_pt_pages = vmw_mob_build_pt(&data_iter, num_data_pages,
-						&pt_iter);
-		data_iter = save_pt_iter;
+						&pt_iter_cpu, &pt_iter_dma);
+		vmw_piter_start(&data_iter, vsgt, 0);
 		num_data_pages = num_pt_pages;
 	}
 
-	mob->pt_root_page = vmw_piter_dma_addr(&save_pt_iter);
+	mob->pt_root_page = root_page;
 	ttm_bo_unreserve(bo);
 }
 
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
index 31786b200afc47..db8f3e40a4facb 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
@@ -269,6 +269,11 @@  static bool __vmw_piter_sg_next(struct vmw_piter *viter)
 	return __sg_page_iter_next(&viter->iter);
 }
 
+static bool __vmw_piter_sg_dma_next(struct vmw_piter *viter)
+{
+	return __sg_page_iter_dma_next(&viter->dma_iter);
+}
+
 
 /**
  * Helper functions to return a pointer to the current page.
@@ -309,9 +314,9 @@  static dma_addr_t __vmw_piter_dma_addr(struct vmw_piter *viter)
 	return viter->addrs[viter->i];
 }
 
-static dma_addr_t __vmw_piter_sg_addr(struct vmw_piter *viter)
+static dma_addr_t __vmw_piter_sg_dma_addr(struct vmw_piter *viter)
 {
-	return sg_page_iter_dma_address(&viter->iter);
+	return sg_page_iter_dma_address(&viter->dma_iter);
 }
 
 
@@ -325,32 +330,43 @@  static dma_addr_t __vmw_piter_sg_addr(struct vmw_piter *viter)
  * the iterator doesn't point to a valid page after initialization; it has
  * to be advanced one step first.
  */
-void vmw_piter_start(struct vmw_piter *viter, const struct vmw_sg_table *vsgt,
-		     unsigned long p_offset)
+void _vmw_piter_start(struct vmw_piter *viter, const struct vmw_sg_table *vsgt,
+		      unsigned long p_offset, bool for_dma)
 {
 	viter->i = p_offset - 1;
 	viter->num_pages = vsgt->num_pages;
 	switch (vsgt->mode) {
 	case vmw_dma_phys:
 		viter->next = &__vmw_piter_non_sg_next;
-		viter->dma_address = &__vmw_piter_phys_addr;
-		viter->page = &__vmw_piter_non_sg_page;
+		if (for_dma)
+			viter->dma_address = &__vmw_piter_phys_addr;
+		else
+			viter->page = &__vmw_piter_non_sg_page;
 		viter->pages = vsgt->pages;
 		break;
 	case vmw_dma_alloc_coherent:
 		viter->next = &__vmw_piter_non_sg_next;
-		viter->dma_address = &__vmw_piter_dma_addr;
-		viter->page = &__vmw_piter_non_sg_page;
+		if (for_dma)
+			viter->dma_address = &__vmw_piter_dma_addr;
+		else
+			viter->page = &__vmw_piter_non_sg_page;
 		viter->addrs = vsgt->addrs;
 		viter->pages = vsgt->pages;
 		break;
 	case vmw_dma_map_populate:
 	case vmw_dma_map_bind:
-		viter->next = &__vmw_piter_sg_next;
-		viter->dma_address = &__vmw_piter_sg_addr;
-		viter->page = &__vmw_piter_sg_page;
-		__sg_page_iter_start(&viter->iter, vsgt->sgt->sgl,
-				     vsgt->sgt->orig_nents, p_offset);
+		if (for_dma) {
+			viter->next = &__vmw_piter_sg_dma_next;
+			viter->dma_address = &__vmw_piter_sg_dma_addr;
+			__sg_page_iter_start(&viter->dma_iter.base,
+					     vsgt->sgt->sgl,
+					     vsgt->sgt->orig_nents, p_offset);
+		} else {
+			viter->next = &__vmw_piter_sg_next;
+			viter->page = &__vmw_piter_sg_page;
+			__sg_page_iter_start(&viter->iter, vsgt->sgt->sgl,
+					     vsgt->sgt->orig_nents, p_offset);
+		}
 		break;
 	default:
 		BUG();
diff --git a/drivers/media/pci/intel/ipu3/ipu3-cio2.c b/drivers/media/pci/intel/ipu3/ipu3-cio2.c
index 447baaebca4486..32b6c6c217a46c 100644
--- a/drivers/media/pci/intel/ipu3/ipu3-cio2.c
+++ b/drivers/media/pci/intel/ipu3/ipu3-cio2.c
@@ -846,7 +846,7 @@  static int cio2_vb2_buf_init(struct vb2_buffer *vb)
 	unsigned int pages = DIV_ROUND_UP(vb->planes[0].length, CIO2_PAGE_SIZE);
 	unsigned int lops = DIV_ROUND_UP(pages + 1, entries_per_page);
 	struct sg_table *sg;
-	struct sg_page_iter sg_iter;
+	struct sg_dma_page_iter sg_iter;
 	int i, j;
 
 	if (lops <= 0 || lops > CIO2_MAX_LOPS) {
@@ -873,7 +873,7 @@  static int cio2_vb2_buf_init(struct vb2_buffer *vb)
 		b->offset = sg->sgl->offset;
 
 	i = j = 0;
-	for_each_sg_page(sg->sgl, &sg_iter, sg->nents, 0) {
+	for_each_sg_dma_page(sg->sgl, &sg_iter, sg->nents, 0) {
 		if (!pages--)
 			break;
 		b->lop[i][j] = sg_page_iter_dma_address(&sg_iter) >> PAGE_SHIFT;
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 093aa57120b0cf..c0592284e18b97 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -339,12 +339,12 @@  int sg_alloc_table_chained(struct sg_table *table, int nents,
 /*
  * sg page iterator
  *
- * Iterates over sg entries page-by-page.  On each successful iteration,
- * you can call sg_page_iter_page(@piter) and sg_page_iter_dma_address(@piter)
- * to get the current page and its dma address. @piter->sg will point to the
- * sg holding this page and @piter->sg_pgoffset to the page's page offset
- * within the sg. The iteration will stop either when a maximum number of sg
- * entries was reached or a terminating sg (sg_last(sg) == true) was reached.
+ * Iterates over sg entries page-by-page.  On each successful iteration, you
+ * can call sg_page_iter_page(@piter) to get the current page and its dma
+ * address. @piter->sg will point to the sg holding this page and
+ * @piter->sg_pgoffset to the page's page offset within the sg. The iteration
+ * will stop either when a maximum number of sg entries was reached or a
+ * terminating sg (sg_last(sg) == true) was reached.
  */
 struct sg_page_iter {
 	struct scatterlist	*sg;		/* sg holding the page */
@@ -356,7 +356,19 @@  struct sg_page_iter {
 						 * next step */
 };
 
+/*
+ * sg page iterator for DMA addresses
+ *
+ * This is the same as sg_page_iter however you can call
+ * sg_page_iter_dma_address(@dma_iter) to get the page's DMA
+ * address. sg_page_iter_page() cannot be called on this iterator.
+ */
+struct sg_dma_page_iter {
+	struct sg_page_iter base;
+};
+
 bool __sg_page_iter_next(struct sg_page_iter *piter);
+bool __sg_page_iter_dma_next(struct sg_dma_page_iter *dma_iter);
 void __sg_page_iter_start(struct sg_page_iter *piter,
 			  struct scatterlist *sglist, unsigned int nents,
 			  unsigned long pgoffset);
@@ -372,11 +384,13 @@  static inline struct page *sg_page_iter_page(struct sg_page_iter *piter)
 /**
  * sg_page_iter_dma_address - get the dma address of the current page held by
  * the page iterator.
- * @piter:	page iterator holding the page
+ * @dma_iter:	page iterator holding the page
  */
-static inline dma_addr_t sg_page_iter_dma_address(struct sg_page_iter *piter)
+static inline dma_addr_t
+sg_page_iter_dma_address(struct sg_dma_page_iter *dma_iter)
 {
-	return sg_dma_address(piter->sg) + (piter->sg_pgoffset << PAGE_SHIFT);
+	return sg_dma_address(dma_iter->base.sg) +
+	       (dma_iter->base.sg_pgoffset << PAGE_SHIFT);
 }
 
 /**
@@ -385,11 +399,28 @@  static inline dma_addr_t sg_page_iter_dma_address(struct sg_page_iter *piter)
  * @piter:	page iterator to hold current page, sg, sg_pgoffset
  * @nents:	maximum number of sg entries to iterate over
  * @pgoffset:	starting page offset
+ *
+ * Callers may use sg_page_iter_page() to get each page pointer.
  */
 #define for_each_sg_page(sglist, piter, nents, pgoffset)		   \
 	for (__sg_page_iter_start((piter), (sglist), (nents), (pgoffset)); \
 	     __sg_page_iter_next(piter);)
 
+/**
+ * for_each_sg_dma_page - iterate over the pages of the given sg list
+ * @sglist:	sglist to iterate over
+ * @dma_iter:	page iterator to hold current page
+ * @dma_nents:	maximum number of sg entries to iterate over, this is the value
+ *              returned from dma_map_sg
+ * @pgoffset:	starting page offset
+ *
+ * Callers may use sg_page_iter_dma_address() to get each page's DMA address.
+ */
+#define for_each_sg_dma_page(sglist, dma_iter, dma_nents, pgoffset)            \
+	for (__sg_page_iter_start(&(dma_iter)->base, sglist, dma_nents,        \
+				  pgoffset);                                   \
+	     __sg_page_iter_dma_next(dma_iter);)
+
 /*
  * Mapping sg iterator
  *
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 7c6096a7170486..716a751be67357 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -625,6 +625,32 @@  bool __sg_page_iter_next(struct sg_page_iter *piter)
 }
 EXPORT_SYMBOL(__sg_page_iter_next);
 
+static int sg_dma_page_count(struct scatterlist *sg)
+{
+	return PAGE_ALIGN(sg->offset + sg_dma_len(sg)) >> PAGE_SHIFT;
+}
+
+bool __sg_page_iter_dma_next(struct sg_dma_page_iter *dma_iter)
+{
+	struct sg_page_iter *piter = &dma_iter->base;
+
+	if (!piter->__nents || !piter->sg)
+		return false;
+
+	piter->sg_pgoffset += piter->__pg_advance;
+	piter->__pg_advance = 1;
+
+	while (piter->sg_pgoffset >= sg_dma_page_count(piter->sg)) {
+		piter->sg_pgoffset -= sg_dma_page_count(piter->sg);
+		piter->sg = sg_next(piter->sg);
+		if (!--piter->__nents || !piter->sg)
+			return false;
+	}
+
+	return true;
+}
+EXPORT_SYMBOL(__sg_page_iter_next);
+
 /**
  * sg_miter_start - start mapping iteration over a sg list
  * @miter: sg mapping iter to be started