diff mbox series

[v3,1/1] drm: msm: Replace dma_map_sg with dma_sync_sg*

Message ID 20181129140315.28476-1-vivek.gautam@codeaurora.org (mailing list archive)
State Not Applicable, archived
Headers show
Series [v3,1/1] drm: msm: Replace dma_map_sg with dma_sync_sg* | expand

Commit Message

Vivek Gautam Nov. 29, 2018, 2:03 p.m. UTC
dma_map_sg() expects a DMA domain. However, the drm devices
have been traditionally using unmanaged iommu domain which
is non-dma type. Using dma mapping APIs with that domain is bad.

Replace dma_map_sg() calls with dma_sync_sg_for_device{|cpu}()
to do the cache maintenance.

Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
Suggested-by: Tomasz Figa <tfiga@chromium.org>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Jordan Crouse <jcrouse@codeaurora.org>
Cc: Sean Paul <seanpaul@chromium.org>
---

Changes since v2:
 - Addressed Tomasz's comment to keep DMA_BIDIRECTIONAL dma direction
   flag intact.
 - Updated comment for sg's dma-address assignment as per Tomasz'
   suggestion.

 drivers/gpu/drm/msm/msm_gem.c | 33 ++++++++++++++++++++++++---------
 1 file changed, 24 insertions(+), 9 deletions(-)

Comments

Christoph Hellwig Nov. 29, 2018, 2:14 p.m. UTC | #1
On Thu, Nov 29, 2018 at 07:33:15PM +0530, Vivek Gautam wrote:
> dma_map_sg() expects a DMA domain. However, the drm devices
> have been traditionally using unmanaged iommu domain which
> is non-dma type. Using dma mapping APIs with that domain is bad.
> 
> Replace dma_map_sg() calls with dma_sync_sg_for_device{|cpu}()
> to do the cache maintenance.

As I told you before:  hell no.  If you spent the slightest amount of
actually trying to understand what you are doing here you'd know this
can't work.  Just turn on dma debugging and this will blow up in your
face.

Either you use the DMA API properly, that is you use it to map and
to sync, or you don't use it at all.  Mix and match between iommu
APIs and DMA APIs is simply not possible.
Rob Clark Nov. 29, 2018, 2:25 p.m. UTC | #2
On Thu, Nov 29, 2018 at 9:14 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Thu, Nov 29, 2018 at 07:33:15PM +0530, Vivek Gautam wrote:
> > dma_map_sg() expects a DMA domain. However, the drm devices
> > have been traditionally using unmanaged iommu domain which
> > is non-dma type. Using dma mapping APIs with that domain is bad.
> >
> > Replace dma_map_sg() calls with dma_sync_sg_for_device{|cpu}()
> > to do the cache maintenance.
>
> As I told you before:  hell no.  If you spent the slightest amount of
> actually trying to understand what you are doing here you'd know this
> can't work.  Just turn on dma debugging and this will blow up in your
> face.

you can tone it down..  we weren't the ones who created the dma/iommu
mess, we are just trying to find a way to work around it

> Either you use the DMA API properly, that is you use it to map and
> to sync, or you don't use it at all.  Mix and match between iommu
> APIs and DMA APIs is simply not possible.

I'd *love* nothing more to not use the dma api.. but on arm there is
no other way to do cache maint.

BR,
-R
Rob Clark Nov. 29, 2018, 2:42 p.m. UTC | #3
On Thu, Nov 29, 2018 at 9:25 AM Rob Clark <robdclark@gmail.com> wrote:
>
> On Thu, Nov 29, 2018 at 9:14 AM Christoph Hellwig <hch@lst.de> wrote:
> >
> > On Thu, Nov 29, 2018 at 07:33:15PM +0530, Vivek Gautam wrote:
> > > dma_map_sg() expects a DMA domain. However, the drm devices
> > > have been traditionally using unmanaged iommu domain which
> > > is non-dma type. Using dma mapping APIs with that domain is bad.
> > >
> > > Replace dma_map_sg() calls with dma_sync_sg_for_device{|cpu}()
> > > to do the cache maintenance.
> >
> > As I told you before:  hell no.  If you spent the slightest amount of
> > actually trying to understand what you are doing here you'd know this
> > can't work.  Just turn on dma debugging and this will blow up in your
> > face.
>
> you can tone it down..  we weren't the ones who created the dma/iommu
> mess, we are just trying to find a way to work around it
>
> > Either you use the DMA API properly, that is you use it to map and
> > to sync, or you don't use it at all.  Mix and match between iommu
> > APIs and DMA APIs is simply not possible.
>
> I'd *love* nothing more to not use the dma api.. but on arm there is
> no other way to do cache maint.

btw, one of us will try this w/ dma debugging enabled..

Maybe the thing we need to do is just implement a blacklist of
compatible strings for devices which should skip the automatic
iommu/dma hookup.  Maybe a bit ugly, but it would also solve a problem
preventing us from enabling per-process pagetables for a5xx (where we
need to control the domain/context-bank that is allocated by the dma
api).

Like I've said before, I'm open to other suggestions.  But the
automagic behind-the-scenes dma+iommu hookup really screwed us, and we
need to find some sort of solution instead of downstream hacks.

BR,
-R
Daniel Vetter Nov. 29, 2018, 2:43 p.m. UTC | #4
On Thu, Nov 29, 2018 at 3:26 PM Rob Clark <robdclark@gmail.com> wrote:
>
> On Thu, Nov 29, 2018 at 9:14 AM Christoph Hellwig <hch@lst.de> wrote:
> >
> > On Thu, Nov 29, 2018 at 07:33:15PM +0530, Vivek Gautam wrote:
> > > dma_map_sg() expects a DMA domain. However, the drm devices
> > > have been traditionally using unmanaged iommu domain which
> > > is non-dma type. Using dma mapping APIs with that domain is bad.
> > >
> > > Replace dma_map_sg() calls with dma_sync_sg_for_device{|cpu}()
> > > to do the cache maintenance.
> >
> > As I told you before:  hell no.  If you spent the slightest amount of
> > actually trying to understand what you are doing here you'd know this
> > can't work.  Just turn on dma debugging and this will blow up in your
> > face.
>
> you can tone it down..  we weren't the ones who created the dma/iommu
> mess, we are just trying to find a way to work around it
>
> > Either you use the DMA API properly, that is you use it to map and
> > to sync, or you don't use it at all.  Mix and match between iommu
> > APIs and DMA APIs is simply not possible.
>
> I'd *love* nothing more to not use the dma api.. but on arm there is
> no other way to do cache maint.

Yeah we had patches to add manual cache management code to drm, so we
don't have to abuse the dma streaming api anymore. Got shouted down.
Abusing the dma streaming api also gets shouted down. It's a gpu, any
idea of these drivers actually being platform independent is out of
the window from the start anyway, so we're ok with tying this to
platforms.
-Daniel
Christoph Hellwig Nov. 29, 2018, 3:53 p.m. UTC | #5
On Thu, Nov 29, 2018 at 09:25:43AM -0500, Rob Clark wrote:
> > As I told you before:  hell no.  If you spent the slightest amount of
> > actually trying to understand what you are doing here you'd know this
> > can't work.  Just turn on dma debugging and this will blow up in your
> > face.
> 
> you can tone it down..  we weren't the ones who created the dma/iommu
> mess, we are just trying to find a way to work around it

But you don't listen and take someone busy for a day as approval,
which is absolutely not the case.  Please wait for a day or two for
people to respond instead of forcing already NACKed patches again in
an act of passive agression.
Christoph Hellwig Nov. 29, 2018, 3:54 p.m. UTC | #6
On Thu, Nov 29, 2018 at 09:42:50AM -0500, Rob Clark wrote:
> Maybe the thing we need to do is just implement a blacklist of
> compatible strings for devices which should skip the automatic
> iommu/dma hookup.  Maybe a bit ugly, but it would also solve a problem
> preventing us from enabling per-process pagetables for a5xx (where we
> need to control the domain/context-bank that is allocated by the dma
> api).

You can detach from the dma map attachment using arm_iommu_detach_device,
which a few drm drivers do, but I don't think this is the problem.
Christoph Hellwig Nov. 29, 2018, 3:57 p.m. UTC | #7
On Thu, Nov 29, 2018 at 03:43:50PM +0100, Daniel Vetter wrote:
> Yeah we had patches to add manual cache management code to drm, so we
> don't have to abuse the dma streaming api anymore. Got shouted down.
> Abusing the dma streaming api also gets shouted down. It's a gpu, any
> idea of these drivers actually being platform independent is out of
> the window from the start anyway, so we're ok with tying this to
> platforms.

Manual or not the iommu API is missing APIs for cache management,
which makes it kinda surprising it actually ever worked for non-coherent
devices.

And fortunately while some people spent the last year ot two bickering
about the situation others actually did work, and we now have a
generic arch_sync_dma_for_device/arch_sync_dma_for_cpu kernel-internal
API.  This is only used for DMA API internals so far, and explicitly
not intended for direct driver use, but it would be perfect as the
backend for iommu API cache maintainance functions.  It exists on all
but two architectures on mainline.  Out of those powerpc is in the works,
only arm32 will need some major help.
Daniel Vetter Nov. 29, 2018, 4:28 p.m. UTC | #8
On Thu, Nov 29, 2018 at 04:57:58PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 29, 2018 at 03:43:50PM +0100, Daniel Vetter wrote:
> > Yeah we had patches to add manual cache management code to drm, so we
> > don't have to abuse the dma streaming api anymore. Got shouted down.
> > Abusing the dma streaming api also gets shouted down. It's a gpu, any
> > idea of these drivers actually being platform independent is out of
> > the window from the start anyway, so we're ok with tying this to
> > platforms.
> 
> Manual or not the iommu API is missing APIs for cache management,
> which makes it kinda surprising it actually ever worked for non-coherent
> devices.
> 
> And fortunately while some people spent the last year ot two bickering
> about the situation others actually did work, and we now have a
> generic arch_sync_dma_for_device/arch_sync_dma_for_cpu kernel-internal
> API.  This is only used for DMA API internals so far, and explicitly
> not intended for direct driver use, but it would be perfect as the
> backend for iommu API cache maintainance functions.  It exists on all
> but two architectures on mainline.  Out of those powerpc is in the works,
> only arm32 will need some major help.

Oh, this sounds neat. At least some massive progress.

Just spend a bit of time reading through the implementations already
merged. Is the struct device *dev parameter actually needed anywhere?
dma-api definitely needs it, because we need that to pick the right iommu.
But for cache management from what I've seen the target device doesn't
matter, all the target specific stuff will be handled by the iommu.

Dropping the dev parameter would make this a perfect fit for coherency
management of buffers used by multiple devices. Right now there's all
kinds of nasty tricks for that use cases needed to avoid redundant
flushes.
-Daniel
Christoph Hellwig Nov. 29, 2018, 4:57 p.m. UTC | #9
On Thu, Nov 29, 2018 at 05:28:07PM +0100, Daniel Vetter wrote:
> Just spend a bit of time reading through the implementations already
> merged. Is the struct device *dev parameter actually needed anywhere?
> dma-api definitely needs it, because we need that to pick the right iommu.
> But for cache management from what I've seen the target device doesn't
> matter, all the target specific stuff will be handled by the iommu.

It looks like only mips every uses the dev argument, and even there
the function it passes dev to ignores it.  So it could probably be removed.

> 
> Dropping the dev parameter would make this a perfect fit for coherency
> management of buffers used by multiple devices. Right now there's all
> kinds of nasty tricks for that use cases needed to avoid redundant
> flushes.

Note that one thing I'd like to avoid is exposing these funtions directly
to drivers, as that will get us into all kinds of abuses.

So I'd much prefer if we could have iommu APIs wrapping these that are
specific to actual uses cases that we understand well.

As for the buffer sharing: at least for the DMA API side I want to
move the current buffer sharing users away from dma_alloc_coherent
(and coherent dma_alloc_attrs users) and the remapping done in there
required for non-coherent architectures.  Instead I'd like to allocate
plain old pages, and then just dma map them for each device separately,
with DMA_ATTR_SKIP_CPU_SYNC passed for all but the first user to map
or last user to unmap.  On the iommu side it could probably work
similar.

I have done some preliminary work on this, and want to get it into this
merge window, but there is a few other bits I need to sort out first.

> -Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
---end quoted text---
Daniel Vetter Nov. 29, 2018, 5:09 p.m. UTC | #10
On Thu, Nov 29, 2018 at 5:57 PM Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Nov 29, 2018 at 05:28:07PM +0100, Daniel Vetter wrote:
> > Just spend a bit of time reading through the implementations already
> > merged. Is the struct device *dev parameter actually needed anywhere?
> > dma-api definitely needs it, because we need that to pick the right iommu.
> > But for cache management from what I've seen the target device doesn't
> > matter, all the target specific stuff will be handled by the iommu.
>
> It looks like only mips every uses the dev argument, and even there
> the function it passes dev to ignores it.  So it could probably be removed.
>
> >
> > Dropping the dev parameter would make this a perfect fit for coherency
> > management of buffers used by multiple devices. Right now there's all
> > kinds of nasty tricks for that use cases needed to avoid redundant
> > flushes.
>
> Note that one thing I'd like to avoid is exposing these funtions directly
> to drivers, as that will get us into all kinds of abuses.

What kind of abuse do you expect? It could very well be that gpu folks
call that "standard use case" ... At least on x86 with the i915 driver
we pretty much rely on architectural guarantees for how cache flushes
work very much. Down to userspace doing the cache flushing for
mappings the kernel has set up.

> So I'd much prefer if we could have iommu APIs wrapping these that are
> specific to actual uses cases that we understand well.
>
> As for the buffer sharing: at least for the DMA API side I want to
> move the current buffer sharing users away from dma_alloc_coherent
> (and coherent dma_alloc_attrs users) and the remapping done in there
> required for non-coherent architectures.  Instead I'd like to allocate
> plain old pages, and then just dma map them for each device separately,
> with DMA_ATTR_SKIP_CPU_SYNC passed for all but the first user to map
> or last user to unmap.  On the iommu side it could probably work
> similar.

I think this is what's often done. Except then there's also the issue
of how to get at the cma allocator if your device needs something
contiguous. There's a lot of that still going on in graphics/media.
-Daniel

> I have done some preliminary work on this, and want to get it into this
> merge window, but there is a few other bits I need to sort out first.
>
> > -Daniel
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
> ---end quoted text---
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Tomasz Figa Nov. 29, 2018, 5:24 p.m. UTC | #11
[CC Marek]

On Thu, Nov 29, 2018 at 9:09 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Thu, Nov 29, 2018 at 5:57 PM Christoph Hellwig <hch@lst.de> wrote:
> > On Thu, Nov 29, 2018 at 05:28:07PM +0100, Daniel Vetter wrote:
> > > Just spend a bit of time reading through the implementations already
> > > merged. Is the struct device *dev parameter actually needed anywhere?
> > > dma-api definitely needs it, because we need that to pick the right iommu.
> > > But for cache management from what I've seen the target device doesn't
> > > matter, all the target specific stuff will be handled by the iommu.
> >
> > It looks like only mips every uses the dev argument, and even there
> > the function it passes dev to ignores it.  So it could probably be removed.
> >

Whether the cache maintenance operation needs to actually do anything
or not is a function of `dev`. We can have some devices that are
coherent with CPU caches, and some that are not, on the same system.

> > >
> > > Dropping the dev parameter would make this a perfect fit for coherency
> > > management of buffers used by multiple devices. Right now there's all
> > > kinds of nasty tricks for that use cases needed to avoid redundant
> > > flushes.
> >
> > Note that one thing I'd like to avoid is exposing these funtions directly
> > to drivers, as that will get us into all kinds of abuses.
>
> What kind of abuse do you expect? It could very well be that gpu folks
> call that "standard use case" ... At least on x86 with the i915 driver
> we pretty much rely on architectural guarantees for how cache flushes
> work very much. Down to userspace doing the cache flushing for
> mappings the kernel has set up.

i915 is a very specific case of a fully contained,
architecture-specific hardware subsystem, where you can just hardcode
all integration details inside the driver, because nobody else would
care.

In ARM world, you can have the same IP blocks licensed by multiple SoC
vendors with different integration details and that often includes the
option of coherency.

>
> > So I'd much prefer if we could have iommu APIs wrapping these that are
> > specific to actual uses cases that we understand well.
> >
> > As for the buffer sharing: at least for the DMA API side I want to
> > move the current buffer sharing users away from dma_alloc_coherent
> > (and coherent dma_alloc_attrs users) and the remapping done in there
> > required for non-coherent architectures.  Instead I'd like to allocate
> > plain old pages, and then just dma map them for each device separately,
> > with DMA_ATTR_SKIP_CPU_SYNC passed for all but the first user to map
> > or last user to unmap.  On the iommu side it could probably work
> > similar.
>
> I think this is what's often done. Except then there's also the issue
> of how to get at the cma allocator if your device needs something
> contiguous. There's a lot of that still going on in graphics/media.

I suppose one could just expose CMA with the default pool directly.
It's just an allocator, so not sure why it would need any
device-specific information.

There is also the use case of using CMA with device-specific pools of
memory reusable by the system when not used by the device and those
would have to somehow get the pool to allocate from, but I wonder if
struct device is the right way to pass such information. I'd see the
pool given explicitly like cma_alloc(struct cma_pool *, size, flags)
and perhaps a wrapper cma_alloc_default(size, flags) that is just a
simple macro calling cma_alloc(&cma_pool_default, size, flags).

Best regards,
Tomasz
Brian Starkey Nov. 29, 2018, 5:33 p.m. UTC | #12
Hi Christoph,

On Thu, Nov 29, 2018 at 05:57:15PM +0100, Christoph Hellwig wrote:
> 
> As for the buffer sharing: at least for the DMA API side I want to
> move the current buffer sharing users away from dma_alloc_coherent
> (and coherent dma_alloc_attrs users) and the remapping done in there
> required for non-coherent architectures.  Instead I'd like to allocate
> plain old pages, and then just dma map them for each device separately,
> with DMA_ATTR_SKIP_CPU_SYNC passed for all but the first user to map
> or last user to unmap.  On the iommu side it could probably work
> similar.
> 
> I have done some preliminary work on this, and want to get it into this
> merge window, but there is a few other bits I need to sort out first.
> 

This sounds very useful for ion, to avoid CPU cache maintenance as
long as the buffer stays in device-land.

One question though: How would you determine "the last user to unmap"
to know when to do the final "make visible to CPU" step?

Thanks,
-Brian
Christoph Hellwig Nov. 29, 2018, 6:33 p.m. UTC | #13
On Thu, Nov 29, 2018 at 06:09:05PM +0100, Daniel Vetter wrote:
> What kind of abuse do you expect? It could very well be that gpu folks
> call that "standard use case" ... At least on x86 with the i915 driver
> we pretty much rely on architectural guarantees for how cache flushes
> work very much. Down to userspace doing the cache flushing for
> mappings the kernel has set up.

Mostly the usual bypasses of the DMA API because people know better
(and with that I don't mean low-level IOMMU API users, but "creative"
direct mappings).

> > As for the buffer sharing: at least for the DMA API side I want to
> > move the current buffer sharing users away from dma_alloc_coherent
> > (and coherent dma_alloc_attrs users) and the remapping done in there
> > required for non-coherent architectures.  Instead I'd like to allocate
> > plain old pages, and then just dma map them for each device separately,
> > with DMA_ATTR_SKIP_CPU_SYNC passed for all but the first user to map
> > or last user to unmap.  On the iommu side it could probably work
> > similar.
> 
> I think this is what's often done. Except then there's also the issue
> of how to get at the cma allocator if your device needs something
> contiguous. There's a lot of that still going on in graphics/media.

Being able to dip into CMA and mayb iommu coalescing if we want to
get fancy is indeed the only reason for this API.  If we just wanted
to map pages we could already do that now with just a little bit
of boilerplate code (and quite a few drivers do - just adding this
new API will remove tons of code).
Christoph Hellwig Nov. 29, 2018, 6:35 p.m. UTC | #14
On Thu, Nov 29, 2018 at 09:24:17AM -0800, Tomasz Figa wrote:
> Whether the cache maintenance operation needs to actually do anything
> or not is a function of `dev`. We can have some devices that are
> coherent with CPU caches, and some that are not, on the same system.

Yes, but that part is not decided by these low-level helpers but their
callers in the DMA code (or maybe IOMMU code as well in the future).

> There is also the use case of using CMA with device-specific pools of
> memory reusable by the system when not used by the device and those
> would have to somehow get the pool to allocate from, but I wonder if
> struct device is the right way to pass such information. I'd see the
> pool given explicitly like cma_alloc(struct cma_pool *, size, flags)
> and perhaps a wrapper cma_alloc_default(size, flags) that is just a
> simple macro calling cma_alloc(&cma_pool_default, size, flags).

Yes, the cma APIs have quite a few warts that need addressing.
I have a few of those things on my todo list, but help is always
welcome.
Christoph Hellwig Nov. 29, 2018, 6:35 p.m. UTC | #15
On Thu, Nov 29, 2018 at 05:33:03PM +0000, Brian Starkey wrote:
> This sounds very useful for ion, to avoid CPU cache maintenance as
> long as the buffer stays in device-land.
> 
> One question though: How would you determine "the last user to unmap"
> to know when to do the final "make visible to CPU" step?

I'd assume the user of the DMA API keeps track of it.  E.g. dmabuf
would be a pretty natural layer to implement common logic for this.
Rob Clark Nov. 29, 2018, 6:44 p.m. UTC | #16
On Thu, Nov 29, 2018 at 10:53 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Thu, Nov 29, 2018 at 09:25:43AM -0500, Rob Clark wrote:
> > > As I told you before:  hell no.  If you spent the slightest amount of
> > > actually trying to understand what you are doing here you'd know this
> > > can't work.  Just turn on dma debugging and this will blow up in your
> > > face.
> >
> > you can tone it down..  we weren't the ones who created the dma/iommu
> > mess, we are just trying to find a way to work around it
>
> But you don't listen and take someone busy for a day as approval,
> which is absolutely not the case.  Please wait for a day or two for
> people to respond instead of forcing already NACKed patches again in
> an act of passive agression.

I'm not sending my pull-req to Dave tomorrow (or even the day after
;-)), so there is ofc still time for people to respond.

BR,
-R
Rob Clark Nov. 29, 2018, 6:48 p.m. UTC | #17
On Thu, Nov 29, 2018 at 10:54 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Thu, Nov 29, 2018 at 09:42:50AM -0500, Rob Clark wrote:
> > Maybe the thing we need to do is just implement a blacklist of
> > compatible strings for devices which should skip the automatic
> > iommu/dma hookup.  Maybe a bit ugly, but it would also solve a problem
> > preventing us from enabling per-process pagetables for a5xx (where we
> > need to control the domain/context-bank that is allocated by the dma
> > api).
>
> You can detach from the dma map attachment using arm_iommu_detach_device,
> which a few drm drivers do, but I don't think this is the problem.

I think even with detach, we wouldn't end up with the context-bank
that the gpu firmware was hard-coded to expect, and so it would
overwrite the incorrect page table address register.  (I could be
mis-remembering that, Jordan spent more time looking at that.  But it
was something along those lines.)

BR,
-R
Rob Clark Nov. 29, 2018, 6:57 p.m. UTC | #18
On Thu, Nov 29, 2018 at 12:24 PM Tomasz Figa <tfiga@chromium.org> wrote:
>
> [CC Marek]
>
> On Thu, Nov 29, 2018 at 9:09 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Thu, Nov 29, 2018 at 5:57 PM Christoph Hellwig <hch@lst.de> wrote:
> > >
> > > Note that one thing I'd like to avoid is exposing these funtions directly
> > > to drivers, as that will get us into all kinds of abuses.
> >
> > What kind of abuse do you expect? It could very well be that gpu folks
> > call that "standard use case" ... At least on x86 with the i915 driver
> > we pretty much rely on architectural guarantees for how cache flushes
> > work very much. Down to userspace doing the cache flushing for
> > mappings the kernel has set up.
>
> i915 is a very specific case of a fully contained,
> architecture-specific hardware subsystem, where you can just hardcode
> all integration details inside the driver, because nobody else would
> care.
>
> In ARM world, you can have the same IP blocks licensed by multiple SoC
> vendors with different integration details and that often includes the
> option of coherency.

fwiw, I believe all the GPU IP blocks that are used across multiple
SoCs have their own GPU MMU (potentially in addition to an iommu?).
So the dma-api is a much better fit for them..  drm/msm is a lot
closer to drm/i915 scenario, so I don't so much care if the solution
to our unique problem isn't something that would work for other
drivers ;-)

BR,
-R
Jordan Crouse Nov. 29, 2018, 7:40 p.m. UTC | #19
On Thu, Nov 29, 2018 at 01:48:15PM -0500, Rob Clark wrote:
> On Thu, Nov 29, 2018 at 10:54 AM Christoph Hellwig <hch@lst.de> wrote:
> >
> > On Thu, Nov 29, 2018 at 09:42:50AM -0500, Rob Clark wrote:
> > > Maybe the thing we need to do is just implement a blacklist of
> > > compatible strings for devices which should skip the automatic
> > > iommu/dma hookup.  Maybe a bit ugly, but it would also solve a problem
> > > preventing us from enabling per-process pagetables for a5xx (where we
> > > need to control the domain/context-bank that is allocated by the dma
> > > api).
> >
> > You can detach from the dma map attachment using arm_iommu_detach_device,
> > which a few drm drivers do, but I don't think this is the problem.
> 
> I think even with detach, we wouldn't end up with the context-bank
> that the gpu firmware was hard-coded to expect, and so it would
> overwrite the incorrect page table address register.  (I could be
> mis-remembering that, Jordan spent more time looking at that.  But it
> was something along those lines.)

Right - basically the DMA domain steals context bank 0 and the GPU is hard coded
to use that context bank for pagetable switching.

I believe the Tegra guys also had a similar problem with a hard coded context
bank.

This is a discussion we do need to have but not at the risk of derailing the
caching discussion which is arguably more important and has much wider ranging
implications for multimedia and Ion and such.

Jordan
Tomasz Figa Nov. 29, 2018, 7:57 p.m. UTC | #20
On Thu, Nov 29, 2018 at 11:40 AM Jordan Crouse <jcrouse@codeaurora.org> wrote:
>
> On Thu, Nov 29, 2018 at 01:48:15PM -0500, Rob Clark wrote:
> > On Thu, Nov 29, 2018 at 10:54 AM Christoph Hellwig <hch@lst.de> wrote:
> > >
> > > On Thu, Nov 29, 2018 at 09:42:50AM -0500, Rob Clark wrote:
> > > > Maybe the thing we need to do is just implement a blacklist of
> > > > compatible strings for devices which should skip the automatic
> > > > iommu/dma hookup.  Maybe a bit ugly, but it would also solve a problem
> > > > preventing us from enabling per-process pagetables for a5xx (where we
> > > > need to control the domain/context-bank that is allocated by the dma
> > > > api).
> > >
> > > You can detach from the dma map attachment using arm_iommu_detach_device,
> > > which a few drm drivers do, but I don't think this is the problem.
> >
> > I think even with detach, we wouldn't end up with the context-bank
> > that the gpu firmware was hard-coded to expect, and so it would
> > overwrite the incorrect page table address register.  (I could be
> > mis-remembering that, Jordan spent more time looking at that.  But it
> > was something along those lines.)
>
> Right - basically the DMA domain steals context bank 0 and the GPU is hard coded
> to use that context bank for pagetable switching.
>
> I believe the Tegra guys also had a similar problem with a hard coded context
> bank.

Wait, if we detach the GPU/display struct device from the default
domain and attach it to a newly allocated domain, wouldn't the newly
allocated domain use the context bank we need? Note that we're already
doing that, except that we're doing it behind the back of the DMA
mapping subsystem, so that it keeps using the IOMMU version of the DMA
ops for the device and doing any mapping operations on the default
domain. If we ask the DMA mapping to detach, wouldn't it essentially
solve the problem?

Best regards,
Tomasz
Robin Murphy Nov. 29, 2018, 8:03 p.m. UTC | #21
On 29/11/2018 19:57, Tomasz Figa wrote:
> On Thu, Nov 29, 2018 at 11:40 AM Jordan Crouse <jcrouse@codeaurora.org> wrote:
>>
>> On Thu, Nov 29, 2018 at 01:48:15PM -0500, Rob Clark wrote:
>>> On Thu, Nov 29, 2018 at 10:54 AM Christoph Hellwig <hch@lst.de> wrote:
>>>>
>>>> On Thu, Nov 29, 2018 at 09:42:50AM -0500, Rob Clark wrote:
>>>>> Maybe the thing we need to do is just implement a blacklist of
>>>>> compatible strings for devices which should skip the automatic
>>>>> iommu/dma hookup.  Maybe a bit ugly, but it would also solve a problem
>>>>> preventing us from enabling per-process pagetables for a5xx (where we
>>>>> need to control the domain/context-bank that is allocated by the dma
>>>>> api).
>>>>
>>>> You can detach from the dma map attachment using arm_iommu_detach_device,
>>>> which a few drm drivers do, but I don't think this is the problem.
>>>
>>> I think even with detach, we wouldn't end up with the context-bank
>>> that the gpu firmware was hard-coded to expect, and so it would
>>> overwrite the incorrect page table address register.  (I could be
>>> mis-remembering that, Jordan spent more time looking at that.  But it
>>> was something along those lines.)
>>
>> Right - basically the DMA domain steals context bank 0 and the GPU is hard coded
>> to use that context bank for pagetable switching.
>>
>> I believe the Tegra guys also had a similar problem with a hard coded context
>> bank.

AIUI, they don't need a specific hardware context, they just need to 
know which one they're actually using, which the domain abstraction hides.

> Wait, if we detach the GPU/display struct device from the default
> domain and attach it to a newly allocated domain, wouldn't the newly
> allocated domain use the context bank we need? Note that we're already

The arm-smmu driver doesn't, but there's no fundamental reason it 
couldn't. That should just need code to refcount domain users and 
release hardware contexts for domains with no devices currently attached.

Robin.

> doing that, except that we're doing it behind the back of the DMA
> mapping subsystem, so that it keeps using the IOMMU version of the DMA
> ops for the device and doing any mapping operations on the default
> domain. If we ask the DMA mapping to detach, wouldn't it essentially
> solve the problem?
> 
> Best regards,
> Tomasz
>
Tomasz Figa Nov. 30, 2018, 12:23 a.m. UTC | #22
On Thu, Nov 29, 2018 at 12:03 PM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 29/11/2018 19:57, Tomasz Figa wrote:
> > On Thu, Nov 29, 2018 at 11:40 AM Jordan Crouse <jcrouse@codeaurora.org> wrote:
> >>
> >> On Thu, Nov 29, 2018 at 01:48:15PM -0500, Rob Clark wrote:
> >>> On Thu, Nov 29, 2018 at 10:54 AM Christoph Hellwig <hch@lst.de> wrote:
> >>>>
> >>>> On Thu, Nov 29, 2018 at 09:42:50AM -0500, Rob Clark wrote:
> >>>>> Maybe the thing we need to do is just implement a blacklist of
> >>>>> compatible strings for devices which should skip the automatic
> >>>>> iommu/dma hookup.  Maybe a bit ugly, but it would also solve a problem
> >>>>> preventing us from enabling per-process pagetables for a5xx (where we
> >>>>> need to control the domain/context-bank that is allocated by the dma
> >>>>> api).
> >>>>
> >>>> You can detach from the dma map attachment using arm_iommu_detach_device,
> >>>> which a few drm drivers do, but I don't think this is the problem.
> >>>
> >>> I think even with detach, we wouldn't end up with the context-bank
> >>> that the gpu firmware was hard-coded to expect, and so it would
> >>> overwrite the incorrect page table address register.  (I could be
> >>> mis-remembering that, Jordan spent more time looking at that.  But it
> >>> was something along those lines.)
> >>
> >> Right - basically the DMA domain steals context bank 0 and the GPU is hard coded
> >> to use that context bank for pagetable switching.
> >>
> >> I believe the Tegra guys also had a similar problem with a hard coded context
> >> bank.
>
> AIUI, they don't need a specific hardware context, they just need to
> know which one they're actually using, which the domain abstraction hides.
>
> > Wait, if we detach the GPU/display struct device from the default
> > domain and attach it to a newly allocated domain, wouldn't the newly
> > allocated domain use the context bank we need? Note that we're already
>
> The arm-smmu driver doesn't, but there's no fundamental reason it
> couldn't. That should just need code to refcount domain users and
> release hardware contexts for domains with no devices currently attached.
>
> Robin.
>
> > doing that, except that we're doing it behind the back of the DMA
> > mapping subsystem, so that it keeps using the IOMMU version of the DMA
> > ops for the device and doing any mapping operations on the default
> > domain. If we ask the DMA mapping to detach, wouldn't it essentially
> > solve the problem?

Thanks Robin.

Still, my point is that the MSM DRM driver attaches the GPU struct
device to a new domain it allocates using iommu_domain_alloc() and it
seems to work fine, so I believe it's not the problem we're looking
into with this patch.

Best regards,
Tomasz
Rob Clark Nov. 30, 2018, 1:15 a.m. UTC | #23
On Thu, Nov 29, 2018 at 10:57 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Thu, Nov 29, 2018 at 03:43:50PM +0100, Daniel Vetter wrote:
> > Yeah we had patches to add manual cache management code to drm, so we
> > don't have to abuse the dma streaming api anymore. Got shouted down.
> > Abusing the dma streaming api also gets shouted down. It's a gpu, any
> > idea of these drivers actually being platform independent is out of
> > the window from the start anyway, so we're ok with tying this to
> > platforms.
>
> Manual or not the iommu API is missing APIs for cache management,
> which makes it kinda surprising it actually ever worked for non-coherent
> devices.
>
> And fortunately while some people spent the last year ot two bickering
> about the situation others actually did work, and we now have a
> generic arch_sync_dma_for_device/arch_sync_dma_for_cpu kernel-internal
> API.  This is only used for DMA API internals so far, and explicitly
> not intended for direct driver use, but it would be perfect as the
> backend for iommu API cache maintainance functions.  It exists on all
> but two architectures on mainline.  Out of those powerpc is in the works,
> only arm32 will need some major help.

oh, hmm, I realized I'd been looking still at the armv7 dma-mapping, I
hadn't noticed arch_sync_* yet.. that does look like a step in the
right direction.

As far as hiding cache ops behind iommu layer, I guess I'd been
thinking more of just letting the drivers that want to bypass dma
layer take things into their own hands.. tbh I think/assume/hope
drm/msm is more the exception than the rule as far as needing to
bypass the dma layer.  Or at least I guess the # of drivers having
problems w/ the dma layer is less than the # of iommu drivers.

(Not sure if that changes my thoughts on this patch, it isn't like
what this patch replaces isn't also a problematic hack around the
inability to bypass the dma layer.  In the short term I just want
*something* that works, I'm totally happy to refactor later when there
are better options.)

BR,
-R
Daniel Vetter Nov. 30, 2018, 9:35 a.m. UTC | #24
On Thu, Nov 29, 2018 at 09:24:17AM -0800, Tomasz Figa wrote:
> [CC Marek]
> 
> On Thu, Nov 29, 2018 at 9:09 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Thu, Nov 29, 2018 at 5:57 PM Christoph Hellwig <hch@lst.de> wrote:
> > > On Thu, Nov 29, 2018 at 05:28:07PM +0100, Daniel Vetter wrote:
> > > > Just spend a bit of time reading through the implementations already
> > > > merged. Is the struct device *dev parameter actually needed anywhere?
> > > > dma-api definitely needs it, because we need that to pick the right iommu.
> > > > But for cache management from what I've seen the target device doesn't
> > > > matter, all the target specific stuff will be handled by the iommu.
> > >
> > > It looks like only mips every uses the dev argument, and even there
> > > the function it passes dev to ignores it.  So it could probably be removed.
> > >
> 
> Whether the cache maintenance operation needs to actually do anything
> or not is a function of `dev`. We can have some devices that are
> coherent with CPU caches, and some that are not, on the same system.

So the thing is that the gpu driver knows this too. It fairly often can
even decide at runtime (through surface state bits or gpu pte bits)
whether to use coherent or non-coherent transactions. dma-api assuming
that a given device is always coherent or non-coherent is one of the
fundamental mismatches we have.

If you otoh need dev because there's some strange bridge caches you need
to flush (I've never seen that, but who knows), that would be a diffeernt
thing. All the bridge flushing I've seen is attached to the iommu though,
so would be really a surprise if the cache management needs that too.
 
> > > >
> > > > Dropping the dev parameter would make this a perfect fit for coherency
> > > > management of buffers used by multiple devices. Right now there's all
> > > > kinds of nasty tricks for that use cases needed to avoid redundant
> > > > flushes.
> > >
> > > Note that one thing I'd like to avoid is exposing these funtions directly
> > > to drivers, as that will get us into all kinds of abuses.
> >
> > What kind of abuse do you expect? It could very well be that gpu folks
> > call that "standard use case" ... At least on x86 with the i915 driver
> > we pretty much rely on architectural guarantees for how cache flushes
> > work very much. Down to userspace doing the cache flushing for
> > mappings the kernel has set up.
> 
> i915 is a very specific case of a fully contained,
> architecture-specific hardware subsystem, where you can just hardcode
> all integration details inside the driver, because nobody else would
> care.

Nah, it's not fully contained, it's just with your arm eyewear on you
can't see how badly it leaks all over the place. The problem is that GPUs
(in many cases at least) want to decide whether and when to do cache
maintenance. We need a combo of iommu and cache maintenance apis that
allows this, across multiple devices, because the dma-api doesn't cut it.

Aside: The cache maintenance api _must_ do the flush when asked to, even
if it believes no cache maintenance is necessary. Even if the default mode
is coherent for a platform/dev combo, the gpu driver might want to do a
non-coherent transaction.

> In ARM world, you can have the same IP blocks licensed by multiple SoC
> vendors with different integration details and that often includes the
> option of coherency.

My experience is that for soc gpus, you need to retune up/download code
for every soc. Or at least every family of socs.

This is painful. I guess thus far the arm soc gpu drivers we have merged
aren't the ones that needed widely different tuning on different soc
families/ranges. What's worse, it's userspace who will decide whether to
use the coherent or non-coherent paths in many cases (that's at least how
it worked since decades on the big gpus, small gpus just lag a few years
usually). The kerenl only provides the tools for the userspace
opengl/vulkan driver to do all the things.

Trying to hide coherent vs. non-coherent like it's done for everyone else
is probably not going to work for gpus. In fact, hasn't ever worked for
gpus thus far, and unlikely to change I think.
-Daniel

> > > So I'd much prefer if we could have iommu APIs wrapping these that are
> > > specific to actual uses cases that we understand well.
> > >
> > > As for the buffer sharing: at least for the DMA API side I want to
> > > move the current buffer sharing users away from dma_alloc_coherent
> > > (and coherent dma_alloc_attrs users) and the remapping done in there
> > > required for non-coherent architectures.  Instead I'd like to allocate
> > > plain old pages, and then just dma map them for each device separately,
> > > with DMA_ATTR_SKIP_CPU_SYNC passed for all but the first user to map
> > > or last user to unmap.  On the iommu side it could probably work
> > > similar.
> >
> > I think this is what's often done. Except then there's also the issue
> > of how to get at the cma allocator if your device needs something
> > contiguous. There's a lot of that still going on in graphics/media.
> 
> I suppose one could just expose CMA with the default pool directly.
> It's just an allocator, so not sure why it would need any
> device-specific information.
> 
> There is also the use case of using CMA with device-specific pools of
> memory reusable by the system when not used by the device and those
> would have to somehow get the pool to allocate from, but I wonder if
> struct device is the right way to pass such information. I'd see the
> pool given explicitly like cma_alloc(struct cma_pool *, size, flags)
> and perhaps a wrapper cma_alloc_default(size, flags) that is just a
> simple macro calling cma_alloc(&cma_pool_default, size, flags).
> 
> Best regards,
> Tomasz
Christoph Hellwig Nov. 30, 2018, 9:35 a.m. UTC | #25
On Thu, Nov 29, 2018 at 08:15:23PM -0500, Rob Clark wrote:
> As far as hiding cache ops behind iommu layer, I guess I'd been
> thinking more of just letting the drivers that want to bypass dma
> layer take things into their own hands.. tbh I think/assume/hope
> drm/msm is more the exception than the rule as far as needing to
> bypass the dma layer.  Or at least I guess the # of drivers having
> problems w/ the dma layer is less than the # of iommu drivers.

So the whole bypass thing has already been a contentious discussion
in the past.  One thing that the API aready enforces is that we pass
a DMA direction, which I want to keep.  The other is that we need
a struct device (or something similiar) that is used to check if the
device is cache coherent or not.  In your MSM case you might know that,
but in general it really is a platform issue that I don't want every
driver to know about.

> (Not sure if that changes my thoughts on this patch, it isn't like
> what this patch replaces isn't also a problematic hack around the
> inability to bypass the dma layer.  In the short term I just want
> *something* that works, I'm totally happy to refactor later when there
> are better options.)

The current patch is simply broken.  You can't just construct your
own S/G list and pass it to the DMA API, and we enforce that very
strictly with dma debug enabled.

So your only options are: a) actually use the DMA API for creating
the mapping, by e.g. using dma_direct_ops ontop of an actual IOMMU
managed in the background, or b) figure out a way to do cache
maintainance for raw IOMMU API drivers.
Daniel Vetter Nov. 30, 2018, 9:40 a.m. UTC | #26
On Thu, Nov 29, 2018 at 01:57:38PM -0500, Rob Clark wrote:
> On Thu, Nov 29, 2018 at 12:24 PM Tomasz Figa <tfiga@chromium.org> wrote:
> >
> > [CC Marek]
> >
> > On Thu, Nov 29, 2018 at 9:09 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Thu, Nov 29, 2018 at 5:57 PM Christoph Hellwig <hch@lst.de> wrote:
> > > >
> > > > Note that one thing I'd like to avoid is exposing these funtions directly
> > > > to drivers, as that will get us into all kinds of abuses.
> > >
> > > What kind of abuse do you expect? It could very well be that gpu folks
> > > call that "standard use case" ... At least on x86 with the i915 driver
> > > we pretty much rely on architectural guarantees for how cache flushes
> > > work very much. Down to userspace doing the cache flushing for
> > > mappings the kernel has set up.
> >
> > i915 is a very specific case of a fully contained,
> > architecture-specific hardware subsystem, where you can just hardcode
> > all integration details inside the driver, because nobody else would
> > care.
> >
> > In ARM world, you can have the same IP blocks licensed by multiple SoC
> > vendors with different integration details and that often includes the
> > option of coherency.
> 
> fwiw, I believe all the GPU IP blocks that are used across multiple
> SoCs have their own GPU MMU (potentially in addition to an iommu?).
> So the dma-api is a much better fit for them..  drm/msm is a lot
> closer to drm/i915 scenario, so I don't so much care if the solution
> to our unique problem isn't something that would work for other
> drivers ;-)

Right now maybe, but I fully except the entire coherent vs. non-coherent
transactions hilarity that we have on the bigger intel socs since a few
years already to trickle down into smaller (arm based socs) eventually. I
think Apple is already there since a few generations.

So maybe we won't have to fight the iommu side of the dma-api anymore
on these, but we'll still have to fight the cache maintenance side of
dma-api. You can tell the dma-api to not flush, but then you don't have
any other way to actually flush that's acceptable for arch/arm (on x86 we
just run clflush in userspace and call it a day).
-Daniel
Christoph Hellwig Nov. 30, 2018, 9:44 a.m. UTC | #27
On Fri, Nov 30, 2018 at 10:35:27AM +0100, Daniel Vetter wrote:
> > Whether the cache maintenance operation needs to actually do anything
> > or not is a function of `dev`. We can have some devices that are
> > coherent with CPU caches, and some that are not, on the same system.
> 
> So the thing is that the gpu driver knows this too. It fairly often can
> even decide at runtime (through surface state bits or gpu pte bits)
> whether to use coherent or non-coherent transactions. dma-api assuming
> that a given device is always coherent or non-coherent is one of the
> fundamental mismatches we have.
> 
> If you otoh need dev because there's some strange bridge caches you need
> to flush (I've never seen that, but who knows), that would be a diffeernt
> thing. All the bridge flushing I've seen is attached to the iommu though,
> so would be really a surprise if the cache management needs that too.

Strange bridge caches aren't the problem.  Outside of magic components
like SOCs integrated GPUs the issue is that a platform can wire up a
PCIe/AXI/etc bus either so that it is cache coherent, or not cache
coherent (does not snooping).  Drivers need to use the full DMA API
include dma_sync_* to cater for the non-coherent case, which will turn
into no-ops if DMA is coherent.

Now PCIe now has unsnooped transactions, which can be non-coherent even
if the bus would otherwise be coherent.  We have so far very much
ignored those in Linux (at least Linux in general, I know you guys have
some driver-local hacks), but if that use case, or similar ones for GPUs
on SOCs become common we need to find a solution.

One of the major issues here is that architectures that always are
DMA coherent might not even have the cache flushing instructions, or even
if they do we have not wired them up in the DMA code as we didn't need
them.

So what we'd need to support this properly is to do something like:

 - add new arch hook that allows an architecture to say it supports
   optional non-coherent at the arch level, and for a given device
 - wire up arch_sync_dma_for_{device,cpu} for those architectures that
   define it if they don't currently have it (e.g. x86)
 - add a new DMA_ATTR_* flag to opt into cache flushing even if the
   device declares it is otherwise coherent

And I'd really like to see that work driven by an actual user.
Daniel Vetter Nov. 30, 2018, 9:46 a.m. UTC | #28
On Thu, Nov 29, 2018 at 07:33:34PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 29, 2018 at 06:09:05PM +0100, Daniel Vetter wrote:
> > What kind of abuse do you expect? It could very well be that gpu folks
> > call that "standard use case" ... At least on x86 with the i915 driver
> > we pretty much rely on architectural guarantees for how cache flushes
> > work very much. Down to userspace doing the cache flushing for
> > mappings the kernel has set up.
> 
> Mostly the usual bypasses of the DMA API because people know better
> (and with that I don't mean low-level IOMMU API users, but "creative"
> direct mappings).

Ah yes. I think that even gpu folks get behind :-) Best if someone
bothered to explicitly cast to dma_addr_t to shut up the tools, but not
fix the bug ...

> > > As for the buffer sharing: at least for the DMA API side I want to
> > > move the current buffer sharing users away from dma_alloc_coherent
> > > (and coherent dma_alloc_attrs users) and the remapping done in there
> > > required for non-coherent architectures.  Instead I'd like to allocate
> > > plain old pages, and then just dma map them for each device separately,
> > > with DMA_ATTR_SKIP_CPU_SYNC passed for all but the first user to map
> > > or last user to unmap.  On the iommu side it could probably work
> > > similar.
> > 
> > I think this is what's often done. Except then there's also the issue
> > of how to get at the cma allocator if your device needs something
> > contiguous. There's a lot of that still going on in graphics/media.
> 
> Being able to dip into CMA and mayb iommu coalescing if we want to
> get fancy is indeed the only reason for this API.  If we just wanted
> to map pages we could already do that now with just a little bit
> of boilerplate code (and quite a few drivers do - just adding this
> new API will remove tons of code).

Sounds like the future is very bright indeed \o/
-Daniel
Tomasz Figa Dec. 1, 2018, 2:05 a.m. UTC | #29
On Thu, Nov 29, 2018 at 4:23 PM Tomasz Figa <tfiga@chromium.org> wrote:
>
> On Thu, Nov 29, 2018 at 12:03 PM Robin Murphy <robin.murphy@arm.com> wrote:
> >
> > On 29/11/2018 19:57, Tomasz Figa wrote:
> > > On Thu, Nov 29, 2018 at 11:40 AM Jordan Crouse <jcrouse@codeaurora.org> wrote:
> > >>
> > >> On Thu, Nov 29, 2018 at 01:48:15PM -0500, Rob Clark wrote:
> > >>> On Thu, Nov 29, 2018 at 10:54 AM Christoph Hellwig <hch@lst.de> wrote:
> > >>>>
> > >>>> On Thu, Nov 29, 2018 at 09:42:50AM -0500, Rob Clark wrote:
> > >>>>> Maybe the thing we need to do is just implement a blacklist of
> > >>>>> compatible strings for devices which should skip the automatic
> > >>>>> iommu/dma hookup.  Maybe a bit ugly, but it would also solve a problem
> > >>>>> preventing us from enabling per-process pagetables for a5xx (where we
> > >>>>> need to control the domain/context-bank that is allocated by the dma
> > >>>>> api).
> > >>>>
> > >>>> You can detach from the dma map attachment using arm_iommu_detach_device,
> > >>>> which a few drm drivers do, but I don't think this is the problem.
> > >>>
> > >>> I think even with detach, we wouldn't end up with the context-bank
> > >>> that the gpu firmware was hard-coded to expect, and so it would
> > >>> overwrite the incorrect page table address register.  (I could be
> > >>> mis-remembering that, Jordan spent more time looking at that.  But it
> > >>> was something along those lines.)
> > >>
> > >> Right - basically the DMA domain steals context bank 0 and the GPU is hard coded
> > >> to use that context bank for pagetable switching.
> > >>
> > >> I believe the Tegra guys also had a similar problem with a hard coded context
> > >> bank.
> >
> > AIUI, they don't need a specific hardware context, they just need to
> > know which one they're actually using, which the domain abstraction hides.
> >
> > > Wait, if we detach the GPU/display struct device from the default
> > > domain and attach it to a newly allocated domain, wouldn't the newly
> > > allocated domain use the context bank we need? Note that we're already
> >
> > The arm-smmu driver doesn't, but there's no fundamental reason it
> > couldn't. That should just need code to refcount domain users and
> > release hardware contexts for domains with no devices currently attached.
> >
> > Robin.
> >
> > > doing that, except that we're doing it behind the back of the DMA
> > > mapping subsystem, so that it keeps using the IOMMU version of the DMA
> > > ops for the device and doing any mapping operations on the default
> > > domain. If we ask the DMA mapping to detach, wouldn't it essentially
> > > solve the problem?
>
> Thanks Robin.
>
> Still, my point is that the MSM DRM driver attaches the GPU struct
> device to a new domain it allocates using iommu_domain_alloc() and it
> seems to work fine, so I believe it's not the problem we're looking
> into with this patch.

Could we just make the MSM DRM call arch_teardown_dma_ops() and then
arch_setup_dma_ops() with the `iommu` argument set to NULL and be done
with it?

Best regards,
Tomasz
Rob Clark Dec. 1, 2018, 11:46 a.m. UTC | #30
On Fri, Nov 30, 2018 at 9:05 PM Tomasz Figa <tfiga@chromium.org> wrote:
>
> On Thu, Nov 29, 2018 at 4:23 PM Tomasz Figa <tfiga@chromium.org> wrote:
> >
> > On Thu, Nov 29, 2018 at 12:03 PM Robin Murphy <robin.murphy@arm.com> wrote:
> > >
> > > On 29/11/2018 19:57, Tomasz Figa wrote:
> > > > On Thu, Nov 29, 2018 at 11:40 AM Jordan Crouse <jcrouse@codeaurora.org> wrote:
> > > >>
> > > >> On Thu, Nov 29, 2018 at 01:48:15PM -0500, Rob Clark wrote:
> > > >>> On Thu, Nov 29, 2018 at 10:54 AM Christoph Hellwig <hch@lst.de> wrote:
> > > >>>>
> > > >>>> On Thu, Nov 29, 2018 at 09:42:50AM -0500, Rob Clark wrote:
> > > >>>>> Maybe the thing we need to do is just implement a blacklist of
> > > >>>>> compatible strings for devices which should skip the automatic
> > > >>>>> iommu/dma hookup.  Maybe a bit ugly, but it would also solve a problem
> > > >>>>> preventing us from enabling per-process pagetables for a5xx (where we
> > > >>>>> need to control the domain/context-bank that is allocated by the dma
> > > >>>>> api).
> > > >>>>
> > > >>>> You can detach from the dma map attachment using arm_iommu_detach_device,
> > > >>>> which a few drm drivers do, but I don't think this is the problem.
> > > >>>
> > > >>> I think even with detach, we wouldn't end up with the context-bank
> > > >>> that the gpu firmware was hard-coded to expect, and so it would
> > > >>> overwrite the incorrect page table address register.  (I could be
> > > >>> mis-remembering that, Jordan spent more time looking at that.  But it
> > > >>> was something along those lines.)
> > > >>
> > > >> Right - basically the DMA domain steals context bank 0 and the GPU is hard coded
> > > >> to use that context bank for pagetable switching.
> > > >>
> > > >> I believe the Tegra guys also had a similar problem with a hard coded context
> > > >> bank.
> > >
> > > AIUI, they don't need a specific hardware context, they just need to
> > > know which one they're actually using, which the domain abstraction hides.
> > >
> > > > Wait, if we detach the GPU/display struct device from the default
> > > > domain and attach it to a newly allocated domain, wouldn't the newly
> > > > allocated domain use the context bank we need? Note that we're already
> > >
> > > The arm-smmu driver doesn't, but there's no fundamental reason it
> > > couldn't. That should just need code to refcount domain users and
> > > release hardware contexts for domains with no devices currently attached.
> > >
> > > Robin.
> > >
> > > > doing that, except that we're doing it behind the back of the DMA
> > > > mapping subsystem, so that it keeps using the IOMMU version of the DMA
> > > > ops for the device and doing any mapping operations on the default
> > > > domain. If we ask the DMA mapping to detach, wouldn't it essentially
> > > > solve the problem?
> >
> > Thanks Robin.
> >
> > Still, my point is that the MSM DRM driver attaches the GPU struct
> > device to a new domain it allocates using iommu_domain_alloc() and it
> > seems to work fine, so I believe it's not the problem we're looking
> > into with this patch.
>
> Could we just make the MSM DRM call arch_teardown_dma_ops() and then
> arch_setup_dma_ops() with the `iommu` argument set to NULL and be done
> with it?

I don't think those are exported to modules?

I have actually a simpler patch, that adds a small blacklist to check
in of_dma_configure() before calling arch_setup_dma_ops(), which can
replace this patch.  It also solves the problem of dma api allocating
the context bank that he gpu wants to use for context-switching, and
should be a simple thing to backport to stable branches.

I was just spending some time trying to figure out what changed
recently to start causing dma_map_sg() to opps on boot for us, so I
could write a more detailed commit msg.

BR,
-R
Tomasz Figa Dec. 3, 2018, 12:12 a.m. UTC | #31
On Sat, Dec 1, 2018 at 3:47 AM Rob Clark <robdclark@gmail.com> wrote:
>
> On Fri, Nov 30, 2018 at 9:05 PM Tomasz Figa <tfiga@chromium.org> wrote:
> >
> > On Thu, Nov 29, 2018 at 4:23 PM Tomasz Figa <tfiga@chromium.org> wrote:
> > >
> > > On Thu, Nov 29, 2018 at 12:03 PM Robin Murphy <robin.murphy@arm.com> wrote:
> > > >
> > > > On 29/11/2018 19:57, Tomasz Figa wrote:
> > > > > On Thu, Nov 29, 2018 at 11:40 AM Jordan Crouse <jcrouse@codeaurora.org> wrote:
> > > > >>
> > > > >> On Thu, Nov 29, 2018 at 01:48:15PM -0500, Rob Clark wrote:
> > > > >>> On Thu, Nov 29, 2018 at 10:54 AM Christoph Hellwig <hch@lst.de> wrote:
> > > > >>>>
> > > > >>>> On Thu, Nov 29, 2018 at 09:42:50AM -0500, Rob Clark wrote:
> > > > >>>>> Maybe the thing we need to do is just implement a blacklist of
> > > > >>>>> compatible strings for devices which should skip the automatic
> > > > >>>>> iommu/dma hookup.  Maybe a bit ugly, but it would also solve a problem
> > > > >>>>> preventing us from enabling per-process pagetables for a5xx (where we
> > > > >>>>> need to control the domain/context-bank that is allocated by the dma
> > > > >>>>> api).
> > > > >>>>
> > > > >>>> You can detach from the dma map attachment using arm_iommu_detach_device,
> > > > >>>> which a few drm drivers do, but I don't think this is the problem.
> > > > >>>
> > > > >>> I think even with detach, we wouldn't end up with the context-bank
> > > > >>> that the gpu firmware was hard-coded to expect, and so it would
> > > > >>> overwrite the incorrect page table address register.  (I could be
> > > > >>> mis-remembering that, Jordan spent more time looking at that.  But it
> > > > >>> was something along those lines.)
> > > > >>
> > > > >> Right - basically the DMA domain steals context bank 0 and the GPU is hard coded
> > > > >> to use that context bank for pagetable switching.
> > > > >>
> > > > >> I believe the Tegra guys also had a similar problem with a hard coded context
> > > > >> bank.
> > > >
> > > > AIUI, they don't need a specific hardware context, they just need to
> > > > know which one they're actually using, which the domain abstraction hides.
> > > >
> > > > > Wait, if we detach the GPU/display struct device from the default
> > > > > domain and attach it to a newly allocated domain, wouldn't the newly
> > > > > allocated domain use the context bank we need? Note that we're already
> > > >
> > > > The arm-smmu driver doesn't, but there's no fundamental reason it
> > > > couldn't. That should just need code to refcount domain users and
> > > > release hardware contexts for domains with no devices currently attached.
> > > >
> > > > Robin.
> > > >
> > > > > doing that, except that we're doing it behind the back of the DMA
> > > > > mapping subsystem, so that it keeps using the IOMMU version of the DMA
> > > > > ops for the device and doing any mapping operations on the default
> > > > > domain. If we ask the DMA mapping to detach, wouldn't it essentially
> > > > > solve the problem?
> > >
> > > Thanks Robin.
> > >
> > > Still, my point is that the MSM DRM driver attaches the GPU struct
> > > device to a new domain it allocates using iommu_domain_alloc() and it
> > > seems to work fine, so I believe it's not the problem we're looking
> > > into with this patch.
> >
> > Could we just make the MSM DRM call arch_teardown_dma_ops() and then
> > arch_setup_dma_ops() with the `iommu` argument set to NULL and be done
> > with it?
>
> I don't think those are exported to modules?
>

Indeed, if we compile MSM DRM as modules, it wouldn't work...

> I have actually a simpler patch, that adds a small blacklist to check
> in of_dma_configure() before calling arch_setup_dma_ops(), which can
> replace this patch.  It also solves the problem of dma api allocating
> the context bank that he gpu wants to use for context-switching, and
> should be a simple thing to backport to stable branches.
>
> I was just spending some time trying to figure out what changed
> recently to start causing dma_map_sg() to opps on boot for us, so I
> could write a more detailed commit msg.

Yeah, that sounds much better, thanks. Reviewed that patch.

Best regards,
Tomasz
Christoph Hellwig Dec. 7, 2018, 1:38 a.m. UTC | #32
On Fri, Nov 30, 2018 at 10:46:04AM +0100, Daniel Vetter wrote:
> > Being able to dip into CMA and maybe iommu coalescing if we want to
> > get fancy is indeed the only reason for this API.  If we just wanted
> > to map pages we could already do that now with just a little bit
> > of boilerplate code (and quite a few drivers do - just adding this
> > new API will remove tons of code).
> 
> Sounds like the future is very bright indeed \o/

So, I spent some time with this and instead of a new API I think
it makes sure we have DMA_ATTR_NON_CONSISTENT consistently available
and with well defined semantics, that is virt_to_page on the return
value works, it is contiguos and we can use dma_sync_single_for_cpu
and dma_sync_single_for_device for ownership tranfers.

Can you graphics folks check if this:

    http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-noncoherent-allocator

is something to work with?  Especially to get rid of the horrible
dma_get_sgtable hacks?
Rob Clark Dec. 7, 2018, 2:29 p.m. UTC | #33
On Thu, Dec 6, 2018 at 8:38 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Fri, Nov 30, 2018 at 10:46:04AM +0100, Daniel Vetter wrote:
> > > Being able to dip into CMA and maybe iommu coalescing if we want to
> > > get fancy is indeed the only reason for this API.  If we just wanted
> > > to map pages we could already do that now with just a little bit
> > > of boilerplate code (and quite a few drivers do - just adding this
> > > new API will remove tons of code).
> >
> > Sounds like the future is very bright indeed \o/
>
> So, I spent some time with this and instead of a new API I think
> it makes sure we have DMA_ATTR_NON_CONSISTENT consistently available
> and with well defined semantics, that is virt_to_page on the return
> value works, it is contiguos and we can use dma_sync_single_for_cpu
> and dma_sync_single_for_device for ownership tranfers.
>
> Can you graphics folks check if this:
>
>     http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-noncoherent-allocator
>
> is something to work with?  Especially to get rid of the horrible
> dma_get_sgtable hacks?

I'm not sure I really have strong opinion about this.  For some of the
display-only SoC drm drivers which use the CMA helpers, it might be
useful.

For the drm drivers for real GPUs, we aren't really using the dma api
to allocate in the first place.  (And the direction is towards more
support for userptr allocations)

BR,
-R
diff mbox series

Patch

diff --git a/drivers/gpu/drm/msm/msm_gem.c b/drivers/gpu/drm/msm/msm_gem.c
index 00c795ced02c..7048e9fe00c6 100644
--- a/drivers/gpu/drm/msm/msm_gem.c
+++ b/drivers/gpu/drm/msm/msm_gem.c
@@ -81,6 +81,8 @@  static struct page **get_pages(struct drm_gem_object *obj)
 		struct drm_device *dev = obj->dev;
 		struct page **p;
 		int npages = obj->size >> PAGE_SHIFT;
+		struct scatterlist *s;
+		int i;
 
 		if (use_pages(obj))
 			p = drm_gem_get_pages(obj);
@@ -104,12 +106,23 @@  static struct page **get_pages(struct drm_gem_object *obj)
 			return ptr;
 		}
 
-		/* For non-cached buffers, ensure the new pages are clean
+		/*
+		 * Some implementations of the DMA mapping ops expect
+		 * physical addresses of the pages to be stored as DMA
+		 * addresses of the sglist entries. To work around it,
+		 * set them here explicitly.
+		 */
+		for_each_sg(msm_obj->sgt->sgl, s, msm_obj->sgt->nents, i)
+			sg_dma_address(s) = sg_phys(s);
+
+		/*
+		 * For non-cached buffers, ensure the new pages are clean
 		 * because display controller, GPU, etc. are not coherent:
 		 */
-		if (msm_obj->flags & (MSM_BO_WC|MSM_BO_UNCACHED))
-			dma_map_sg(dev->dev, msm_obj->sgt->sgl,
-					msm_obj->sgt->nents, DMA_BIDIRECTIONAL);
+		if (msm_obj->flags & (MSM_BO_WC | MSM_BO_UNCACHED))
+			dma_sync_sg_for_device(dev->dev, msm_obj->sgt->sgl,
+					       msm_obj->sgt->nents,
+					       DMA_BIDIRECTIONAL);
 	}
 
 	return msm_obj->pages;
@@ -133,14 +146,16 @@  static void put_pages(struct drm_gem_object *obj)
 
 	if (msm_obj->pages) {
 		if (msm_obj->sgt) {
-			/* For non-cached buffers, ensure the new
+			/*
+			 * For non-cached buffers, ensure the new
 			 * pages are clean because display controller,
 			 * GPU, etc. are not coherent:
 			 */
-			if (msm_obj->flags & (MSM_BO_WC|MSM_BO_UNCACHED))
-				dma_unmap_sg(obj->dev->dev, msm_obj->sgt->sgl,
-					     msm_obj->sgt->nents,
-					     DMA_BIDIRECTIONAL);
+			if (msm_obj->flags & (MSM_BO_WC | MSM_BO_UNCACHED))
+				dma_sync_sg_for_cpu(obj->dev->dev,
+						    msm_obj->sgt->sgl,
+						    msm_obj->sgt->nents,
+						    DMA_BIDIRECTIONAL);
 
 			sg_free_table(msm_obj->sgt);
 			kfree(msm_obj->sgt);