Message ID | 20181129140315.28476-1-vivek.gautam@codeaurora.org (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
Series | [v3,1/1] drm: msm: Replace dma_map_sg with dma_sync_sg* | expand |
On Thu, Nov 29, 2018 at 07:33:15PM +0530, Vivek Gautam wrote: > dma_map_sg() expects a DMA domain. However, the drm devices > have been traditionally using unmanaged iommu domain which > is non-dma type. Using dma mapping APIs with that domain is bad. > > Replace dma_map_sg() calls with dma_sync_sg_for_device{|cpu}() > to do the cache maintenance. As I told you before: hell no. If you spent the slightest amount of actually trying to understand what you are doing here you'd know this can't work. Just turn on dma debugging and this will blow up in your face. Either you use the DMA API properly, that is you use it to map and to sync, or you don't use it at all. Mix and match between iommu APIs and DMA APIs is simply not possible.
On Thu, Nov 29, 2018 at 9:14 AM Christoph Hellwig <hch@lst.de> wrote: > > On Thu, Nov 29, 2018 at 07:33:15PM +0530, Vivek Gautam wrote: > > dma_map_sg() expects a DMA domain. However, the drm devices > > have been traditionally using unmanaged iommu domain which > > is non-dma type. Using dma mapping APIs with that domain is bad. > > > > Replace dma_map_sg() calls with dma_sync_sg_for_device{|cpu}() > > to do the cache maintenance. > > As I told you before: hell no. If you spent the slightest amount of > actually trying to understand what you are doing here you'd know this > can't work. Just turn on dma debugging and this will blow up in your > face. you can tone it down.. we weren't the ones who created the dma/iommu mess, we are just trying to find a way to work around it > Either you use the DMA API properly, that is you use it to map and > to sync, or you don't use it at all. Mix and match between iommu > APIs and DMA APIs is simply not possible. I'd *love* nothing more to not use the dma api.. but on arm there is no other way to do cache maint. BR, -R
On Thu, Nov 29, 2018 at 9:25 AM Rob Clark <robdclark@gmail.com> wrote: > > On Thu, Nov 29, 2018 at 9:14 AM Christoph Hellwig <hch@lst.de> wrote: > > > > On Thu, Nov 29, 2018 at 07:33:15PM +0530, Vivek Gautam wrote: > > > dma_map_sg() expects a DMA domain. However, the drm devices > > > have been traditionally using unmanaged iommu domain which > > > is non-dma type. Using dma mapping APIs with that domain is bad. > > > > > > Replace dma_map_sg() calls with dma_sync_sg_for_device{|cpu}() > > > to do the cache maintenance. > > > > As I told you before: hell no. If you spent the slightest amount of > > actually trying to understand what you are doing here you'd know this > > can't work. Just turn on dma debugging and this will blow up in your > > face. > > you can tone it down.. we weren't the ones who created the dma/iommu > mess, we are just trying to find a way to work around it > > > Either you use the DMA API properly, that is you use it to map and > > to sync, or you don't use it at all. Mix and match between iommu > > APIs and DMA APIs is simply not possible. > > I'd *love* nothing more to not use the dma api.. but on arm there is > no other way to do cache maint. btw, one of us will try this w/ dma debugging enabled.. Maybe the thing we need to do is just implement a blacklist of compatible strings for devices which should skip the automatic iommu/dma hookup. Maybe a bit ugly, but it would also solve a problem preventing us from enabling per-process pagetables for a5xx (where we need to control the domain/context-bank that is allocated by the dma api). Like I've said before, I'm open to other suggestions. But the automagic behind-the-scenes dma+iommu hookup really screwed us, and we need to find some sort of solution instead of downstream hacks. BR, -R
On Thu, Nov 29, 2018 at 3:26 PM Rob Clark <robdclark@gmail.com> wrote: > > On Thu, Nov 29, 2018 at 9:14 AM Christoph Hellwig <hch@lst.de> wrote: > > > > On Thu, Nov 29, 2018 at 07:33:15PM +0530, Vivek Gautam wrote: > > > dma_map_sg() expects a DMA domain. However, the drm devices > > > have been traditionally using unmanaged iommu domain which > > > is non-dma type. Using dma mapping APIs with that domain is bad. > > > > > > Replace dma_map_sg() calls with dma_sync_sg_for_device{|cpu}() > > > to do the cache maintenance. > > > > As I told you before: hell no. If you spent the slightest amount of > > actually trying to understand what you are doing here you'd know this > > can't work. Just turn on dma debugging and this will blow up in your > > face. > > you can tone it down.. we weren't the ones who created the dma/iommu > mess, we are just trying to find a way to work around it > > > Either you use the DMA API properly, that is you use it to map and > > to sync, or you don't use it at all. Mix and match between iommu > > APIs and DMA APIs is simply not possible. > > I'd *love* nothing more to not use the dma api.. but on arm there is > no other way to do cache maint. Yeah we had patches to add manual cache management code to drm, so we don't have to abuse the dma streaming api anymore. Got shouted down. Abusing the dma streaming api also gets shouted down. It's a gpu, any idea of these drivers actually being platform independent is out of the window from the start anyway, so we're ok with tying this to platforms. -Daniel
On Thu, Nov 29, 2018 at 09:25:43AM -0500, Rob Clark wrote: > > As I told you before: hell no. If you spent the slightest amount of > > actually trying to understand what you are doing here you'd know this > > can't work. Just turn on dma debugging and this will blow up in your > > face. > > you can tone it down.. we weren't the ones who created the dma/iommu > mess, we are just trying to find a way to work around it But you don't listen and take someone busy for a day as approval, which is absolutely not the case. Please wait for a day or two for people to respond instead of forcing already NACKed patches again in an act of passive agression.
On Thu, Nov 29, 2018 at 09:42:50AM -0500, Rob Clark wrote: > Maybe the thing we need to do is just implement a blacklist of > compatible strings for devices which should skip the automatic > iommu/dma hookup. Maybe a bit ugly, but it would also solve a problem > preventing us from enabling per-process pagetables for a5xx (where we > need to control the domain/context-bank that is allocated by the dma > api). You can detach from the dma map attachment using arm_iommu_detach_device, which a few drm drivers do, but I don't think this is the problem.
On Thu, Nov 29, 2018 at 03:43:50PM +0100, Daniel Vetter wrote: > Yeah we had patches to add manual cache management code to drm, so we > don't have to abuse the dma streaming api anymore. Got shouted down. > Abusing the dma streaming api also gets shouted down. It's a gpu, any > idea of these drivers actually being platform independent is out of > the window from the start anyway, so we're ok with tying this to > platforms. Manual or not the iommu API is missing APIs for cache management, which makes it kinda surprising it actually ever worked for non-coherent devices. And fortunately while some people spent the last year ot two bickering about the situation others actually did work, and we now have a generic arch_sync_dma_for_device/arch_sync_dma_for_cpu kernel-internal API. This is only used for DMA API internals so far, and explicitly not intended for direct driver use, but it would be perfect as the backend for iommu API cache maintainance functions. It exists on all but two architectures on mainline. Out of those powerpc is in the works, only arm32 will need some major help.
On Thu, Nov 29, 2018 at 04:57:58PM +0100, Christoph Hellwig wrote: > On Thu, Nov 29, 2018 at 03:43:50PM +0100, Daniel Vetter wrote: > > Yeah we had patches to add manual cache management code to drm, so we > > don't have to abuse the dma streaming api anymore. Got shouted down. > > Abusing the dma streaming api also gets shouted down. It's a gpu, any > > idea of these drivers actually being platform independent is out of > > the window from the start anyway, so we're ok with tying this to > > platforms. > > Manual or not the iommu API is missing APIs for cache management, > which makes it kinda surprising it actually ever worked for non-coherent > devices. > > And fortunately while some people spent the last year ot two bickering > about the situation others actually did work, and we now have a > generic arch_sync_dma_for_device/arch_sync_dma_for_cpu kernel-internal > API. This is only used for DMA API internals so far, and explicitly > not intended for direct driver use, but it would be perfect as the > backend for iommu API cache maintainance functions. It exists on all > but two architectures on mainline. Out of those powerpc is in the works, > only arm32 will need some major help. Oh, this sounds neat. At least some massive progress. Just spend a bit of time reading through the implementations already merged. Is the struct device *dev parameter actually needed anywhere? dma-api definitely needs it, because we need that to pick the right iommu. But for cache management from what I've seen the target device doesn't matter, all the target specific stuff will be handled by the iommu. Dropping the dev parameter would make this a perfect fit for coherency management of buffers used by multiple devices. Right now there's all kinds of nasty tricks for that use cases needed to avoid redundant flushes. -Daniel
On Thu, Nov 29, 2018 at 05:28:07PM +0100, Daniel Vetter wrote: > Just spend a bit of time reading through the implementations already > merged. Is the struct device *dev parameter actually needed anywhere? > dma-api definitely needs it, because we need that to pick the right iommu. > But for cache management from what I've seen the target device doesn't > matter, all the target specific stuff will be handled by the iommu. It looks like only mips every uses the dev argument, and even there the function it passes dev to ignores it. So it could probably be removed. > > Dropping the dev parameter would make this a perfect fit for coherency > management of buffers used by multiple devices. Right now there's all > kinds of nasty tricks for that use cases needed to avoid redundant > flushes. Note that one thing I'd like to avoid is exposing these funtions directly to drivers, as that will get us into all kinds of abuses. So I'd much prefer if we could have iommu APIs wrapping these that are specific to actual uses cases that we understand well. As for the buffer sharing: at least for the DMA API side I want to move the current buffer sharing users away from dma_alloc_coherent (and coherent dma_alloc_attrs users) and the remapping done in there required for non-coherent architectures. Instead I'd like to allocate plain old pages, and then just dma map them for each device separately, with DMA_ATTR_SKIP_CPU_SYNC passed for all but the first user to map or last user to unmap. On the iommu side it could probably work similar. I have done some preliminary work on this, and want to get it into this merge window, but there is a few other bits I need to sort out first. > -Daniel > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch ---end quoted text---
On Thu, Nov 29, 2018 at 5:57 PM Christoph Hellwig <hch@lst.de> wrote: > On Thu, Nov 29, 2018 at 05:28:07PM +0100, Daniel Vetter wrote: > > Just spend a bit of time reading through the implementations already > > merged. Is the struct device *dev parameter actually needed anywhere? > > dma-api definitely needs it, because we need that to pick the right iommu. > > But for cache management from what I've seen the target device doesn't > > matter, all the target specific stuff will be handled by the iommu. > > It looks like only mips every uses the dev argument, and even there > the function it passes dev to ignores it. So it could probably be removed. > > > > > Dropping the dev parameter would make this a perfect fit for coherency > > management of buffers used by multiple devices. Right now there's all > > kinds of nasty tricks for that use cases needed to avoid redundant > > flushes. > > Note that one thing I'd like to avoid is exposing these funtions directly > to drivers, as that will get us into all kinds of abuses. What kind of abuse do you expect? It could very well be that gpu folks call that "standard use case" ... At least on x86 with the i915 driver we pretty much rely on architectural guarantees for how cache flushes work very much. Down to userspace doing the cache flushing for mappings the kernel has set up. > So I'd much prefer if we could have iommu APIs wrapping these that are > specific to actual uses cases that we understand well. > > As for the buffer sharing: at least for the DMA API side I want to > move the current buffer sharing users away from dma_alloc_coherent > (and coherent dma_alloc_attrs users) and the remapping done in there > required for non-coherent architectures. Instead I'd like to allocate > plain old pages, and then just dma map them for each device separately, > with DMA_ATTR_SKIP_CPU_SYNC passed for all but the first user to map > or last user to unmap. On the iommu side it could probably work > similar. I think this is what's often done. Except then there's also the issue of how to get at the cma allocator if your device needs something contiguous. There's a lot of that still going on in graphics/media. -Daniel > I have done some preliminary work on this, and want to get it into this > merge window, but there is a few other bits I need to sort out first. > > > -Daniel > > -- > > Daniel Vetter > > Software Engineer, Intel Corporation > > http://blog.ffwll.ch > ---end quoted text--- > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel
[CC Marek] On Thu, Nov 29, 2018 at 9:09 AM Daniel Vetter <daniel@ffwll.ch> wrote: > > On Thu, Nov 29, 2018 at 5:57 PM Christoph Hellwig <hch@lst.de> wrote: > > On Thu, Nov 29, 2018 at 05:28:07PM +0100, Daniel Vetter wrote: > > > Just spend a bit of time reading through the implementations already > > > merged. Is the struct device *dev parameter actually needed anywhere? > > > dma-api definitely needs it, because we need that to pick the right iommu. > > > But for cache management from what I've seen the target device doesn't > > > matter, all the target specific stuff will be handled by the iommu. > > > > It looks like only mips every uses the dev argument, and even there > > the function it passes dev to ignores it. So it could probably be removed. > > Whether the cache maintenance operation needs to actually do anything or not is a function of `dev`. We can have some devices that are coherent with CPU caches, and some that are not, on the same system. > > > > > > Dropping the dev parameter would make this a perfect fit for coherency > > > management of buffers used by multiple devices. Right now there's all > > > kinds of nasty tricks for that use cases needed to avoid redundant > > > flushes. > > > > Note that one thing I'd like to avoid is exposing these funtions directly > > to drivers, as that will get us into all kinds of abuses. > > What kind of abuse do you expect? It could very well be that gpu folks > call that "standard use case" ... At least on x86 with the i915 driver > we pretty much rely on architectural guarantees for how cache flushes > work very much. Down to userspace doing the cache flushing for > mappings the kernel has set up. i915 is a very specific case of a fully contained, architecture-specific hardware subsystem, where you can just hardcode all integration details inside the driver, because nobody else would care. In ARM world, you can have the same IP blocks licensed by multiple SoC vendors with different integration details and that often includes the option of coherency. > > > So I'd much prefer if we could have iommu APIs wrapping these that are > > specific to actual uses cases that we understand well. > > > > As for the buffer sharing: at least for the DMA API side I want to > > move the current buffer sharing users away from dma_alloc_coherent > > (and coherent dma_alloc_attrs users) and the remapping done in there > > required for non-coherent architectures. Instead I'd like to allocate > > plain old pages, and then just dma map them for each device separately, > > with DMA_ATTR_SKIP_CPU_SYNC passed for all but the first user to map > > or last user to unmap. On the iommu side it could probably work > > similar. > > I think this is what's often done. Except then there's also the issue > of how to get at the cma allocator if your device needs something > contiguous. There's a lot of that still going on in graphics/media. I suppose one could just expose CMA with the default pool directly. It's just an allocator, so not sure why it would need any device-specific information. There is also the use case of using CMA with device-specific pools of memory reusable by the system when not used by the device and those would have to somehow get the pool to allocate from, but I wonder if struct device is the right way to pass such information. I'd see the pool given explicitly like cma_alloc(struct cma_pool *, size, flags) and perhaps a wrapper cma_alloc_default(size, flags) that is just a simple macro calling cma_alloc(&cma_pool_default, size, flags). Best regards, Tomasz
Hi Christoph, On Thu, Nov 29, 2018 at 05:57:15PM +0100, Christoph Hellwig wrote: > > As for the buffer sharing: at least for the DMA API side I want to > move the current buffer sharing users away from dma_alloc_coherent > (and coherent dma_alloc_attrs users) and the remapping done in there > required for non-coherent architectures. Instead I'd like to allocate > plain old pages, and then just dma map them for each device separately, > with DMA_ATTR_SKIP_CPU_SYNC passed for all but the first user to map > or last user to unmap. On the iommu side it could probably work > similar. > > I have done some preliminary work on this, and want to get it into this > merge window, but there is a few other bits I need to sort out first. > This sounds very useful for ion, to avoid CPU cache maintenance as long as the buffer stays in device-land. One question though: How would you determine "the last user to unmap" to know when to do the final "make visible to CPU" step? Thanks, -Brian
On Thu, Nov 29, 2018 at 06:09:05PM +0100, Daniel Vetter wrote: > What kind of abuse do you expect? It could very well be that gpu folks > call that "standard use case" ... At least on x86 with the i915 driver > we pretty much rely on architectural guarantees for how cache flushes > work very much. Down to userspace doing the cache flushing for > mappings the kernel has set up. Mostly the usual bypasses of the DMA API because people know better (and with that I don't mean low-level IOMMU API users, but "creative" direct mappings). > > As for the buffer sharing: at least for the DMA API side I want to > > move the current buffer sharing users away from dma_alloc_coherent > > (and coherent dma_alloc_attrs users) and the remapping done in there > > required for non-coherent architectures. Instead I'd like to allocate > > plain old pages, and then just dma map them for each device separately, > > with DMA_ATTR_SKIP_CPU_SYNC passed for all but the first user to map > > or last user to unmap. On the iommu side it could probably work > > similar. > > I think this is what's often done. Except then there's also the issue > of how to get at the cma allocator if your device needs something > contiguous. There's a lot of that still going on in graphics/media. Being able to dip into CMA and mayb iommu coalescing if we want to get fancy is indeed the only reason for this API. If we just wanted to map pages we could already do that now with just a little bit of boilerplate code (and quite a few drivers do - just adding this new API will remove tons of code).
On Thu, Nov 29, 2018 at 09:24:17AM -0800, Tomasz Figa wrote: > Whether the cache maintenance operation needs to actually do anything > or not is a function of `dev`. We can have some devices that are > coherent with CPU caches, and some that are not, on the same system. Yes, but that part is not decided by these low-level helpers but their callers in the DMA code (or maybe IOMMU code as well in the future). > There is also the use case of using CMA with device-specific pools of > memory reusable by the system when not used by the device and those > would have to somehow get the pool to allocate from, but I wonder if > struct device is the right way to pass such information. I'd see the > pool given explicitly like cma_alloc(struct cma_pool *, size, flags) > and perhaps a wrapper cma_alloc_default(size, flags) that is just a > simple macro calling cma_alloc(&cma_pool_default, size, flags). Yes, the cma APIs have quite a few warts that need addressing. I have a few of those things on my todo list, but help is always welcome.
On Thu, Nov 29, 2018 at 05:33:03PM +0000, Brian Starkey wrote: > This sounds very useful for ion, to avoid CPU cache maintenance as > long as the buffer stays in device-land. > > One question though: How would you determine "the last user to unmap" > to know when to do the final "make visible to CPU" step? I'd assume the user of the DMA API keeps track of it. E.g. dmabuf would be a pretty natural layer to implement common logic for this.
On Thu, Nov 29, 2018 at 10:53 AM Christoph Hellwig <hch@lst.de> wrote: > > On Thu, Nov 29, 2018 at 09:25:43AM -0500, Rob Clark wrote: > > > As I told you before: hell no. If you spent the slightest amount of > > > actually trying to understand what you are doing here you'd know this > > > can't work. Just turn on dma debugging and this will blow up in your > > > face. > > > > you can tone it down.. we weren't the ones who created the dma/iommu > > mess, we are just trying to find a way to work around it > > But you don't listen and take someone busy for a day as approval, > which is absolutely not the case. Please wait for a day or two for > people to respond instead of forcing already NACKed patches again in > an act of passive agression. I'm not sending my pull-req to Dave tomorrow (or even the day after ;-)), so there is ofc still time for people to respond. BR, -R
On Thu, Nov 29, 2018 at 10:54 AM Christoph Hellwig <hch@lst.de> wrote: > > On Thu, Nov 29, 2018 at 09:42:50AM -0500, Rob Clark wrote: > > Maybe the thing we need to do is just implement a blacklist of > > compatible strings for devices which should skip the automatic > > iommu/dma hookup. Maybe a bit ugly, but it would also solve a problem > > preventing us from enabling per-process pagetables for a5xx (where we > > need to control the domain/context-bank that is allocated by the dma > > api). > > You can detach from the dma map attachment using arm_iommu_detach_device, > which a few drm drivers do, but I don't think this is the problem. I think even with detach, we wouldn't end up with the context-bank that the gpu firmware was hard-coded to expect, and so it would overwrite the incorrect page table address register. (I could be mis-remembering that, Jordan spent more time looking at that. But it was something along those lines.) BR, -R
On Thu, Nov 29, 2018 at 12:24 PM Tomasz Figa <tfiga@chromium.org> wrote: > > [CC Marek] > > On Thu, Nov 29, 2018 at 9:09 AM Daniel Vetter <daniel@ffwll.ch> wrote: > > > > On Thu, Nov 29, 2018 at 5:57 PM Christoph Hellwig <hch@lst.de> wrote: > > > > > > Note that one thing I'd like to avoid is exposing these funtions directly > > > to drivers, as that will get us into all kinds of abuses. > > > > What kind of abuse do you expect? It could very well be that gpu folks > > call that "standard use case" ... At least on x86 with the i915 driver > > we pretty much rely on architectural guarantees for how cache flushes > > work very much. Down to userspace doing the cache flushing for > > mappings the kernel has set up. > > i915 is a very specific case of a fully contained, > architecture-specific hardware subsystem, where you can just hardcode > all integration details inside the driver, because nobody else would > care. > > In ARM world, you can have the same IP blocks licensed by multiple SoC > vendors with different integration details and that often includes the > option of coherency. fwiw, I believe all the GPU IP blocks that are used across multiple SoCs have their own GPU MMU (potentially in addition to an iommu?). So the dma-api is a much better fit for them.. drm/msm is a lot closer to drm/i915 scenario, so I don't so much care if the solution to our unique problem isn't something that would work for other drivers ;-) BR, -R
On Thu, Nov 29, 2018 at 01:48:15PM -0500, Rob Clark wrote: > On Thu, Nov 29, 2018 at 10:54 AM Christoph Hellwig <hch@lst.de> wrote: > > > > On Thu, Nov 29, 2018 at 09:42:50AM -0500, Rob Clark wrote: > > > Maybe the thing we need to do is just implement a blacklist of > > > compatible strings for devices which should skip the automatic > > > iommu/dma hookup. Maybe a bit ugly, but it would also solve a problem > > > preventing us from enabling per-process pagetables for a5xx (where we > > > need to control the domain/context-bank that is allocated by the dma > > > api). > > > > You can detach from the dma map attachment using arm_iommu_detach_device, > > which a few drm drivers do, but I don't think this is the problem. > > I think even with detach, we wouldn't end up with the context-bank > that the gpu firmware was hard-coded to expect, and so it would > overwrite the incorrect page table address register. (I could be > mis-remembering that, Jordan spent more time looking at that. But it > was something along those lines.) Right - basically the DMA domain steals context bank 0 and the GPU is hard coded to use that context bank for pagetable switching. I believe the Tegra guys also had a similar problem with a hard coded context bank. This is a discussion we do need to have but not at the risk of derailing the caching discussion which is arguably more important and has much wider ranging implications for multimedia and Ion and such. Jordan
On Thu, Nov 29, 2018 at 11:40 AM Jordan Crouse <jcrouse@codeaurora.org> wrote: > > On Thu, Nov 29, 2018 at 01:48:15PM -0500, Rob Clark wrote: > > On Thu, Nov 29, 2018 at 10:54 AM Christoph Hellwig <hch@lst.de> wrote: > > > > > > On Thu, Nov 29, 2018 at 09:42:50AM -0500, Rob Clark wrote: > > > > Maybe the thing we need to do is just implement a blacklist of > > > > compatible strings for devices which should skip the automatic > > > > iommu/dma hookup. Maybe a bit ugly, but it would also solve a problem > > > > preventing us from enabling per-process pagetables for a5xx (where we > > > > need to control the domain/context-bank that is allocated by the dma > > > > api). > > > > > > You can detach from the dma map attachment using arm_iommu_detach_device, > > > which a few drm drivers do, but I don't think this is the problem. > > > > I think even with detach, we wouldn't end up with the context-bank > > that the gpu firmware was hard-coded to expect, and so it would > > overwrite the incorrect page table address register. (I could be > > mis-remembering that, Jordan spent more time looking at that. But it > > was something along those lines.) > > Right - basically the DMA domain steals context bank 0 and the GPU is hard coded > to use that context bank for pagetable switching. > > I believe the Tegra guys also had a similar problem with a hard coded context > bank. Wait, if we detach the GPU/display struct device from the default domain and attach it to a newly allocated domain, wouldn't the newly allocated domain use the context bank we need? Note that we're already doing that, except that we're doing it behind the back of the DMA mapping subsystem, so that it keeps using the IOMMU version of the DMA ops for the device and doing any mapping operations on the default domain. If we ask the DMA mapping to detach, wouldn't it essentially solve the problem? Best regards, Tomasz
On 29/11/2018 19:57, Tomasz Figa wrote: > On Thu, Nov 29, 2018 at 11:40 AM Jordan Crouse <jcrouse@codeaurora.org> wrote: >> >> On Thu, Nov 29, 2018 at 01:48:15PM -0500, Rob Clark wrote: >>> On Thu, Nov 29, 2018 at 10:54 AM Christoph Hellwig <hch@lst.de> wrote: >>>> >>>> On Thu, Nov 29, 2018 at 09:42:50AM -0500, Rob Clark wrote: >>>>> Maybe the thing we need to do is just implement a blacklist of >>>>> compatible strings for devices which should skip the automatic >>>>> iommu/dma hookup. Maybe a bit ugly, but it would also solve a problem >>>>> preventing us from enabling per-process pagetables for a5xx (where we >>>>> need to control the domain/context-bank that is allocated by the dma >>>>> api). >>>> >>>> You can detach from the dma map attachment using arm_iommu_detach_device, >>>> which a few drm drivers do, but I don't think this is the problem. >>> >>> I think even with detach, we wouldn't end up with the context-bank >>> that the gpu firmware was hard-coded to expect, and so it would >>> overwrite the incorrect page table address register. (I could be >>> mis-remembering that, Jordan spent more time looking at that. But it >>> was something along those lines.) >> >> Right - basically the DMA domain steals context bank 0 and the GPU is hard coded >> to use that context bank for pagetable switching. >> >> I believe the Tegra guys also had a similar problem with a hard coded context >> bank. AIUI, they don't need a specific hardware context, they just need to know which one they're actually using, which the domain abstraction hides. > Wait, if we detach the GPU/display struct device from the default > domain and attach it to a newly allocated domain, wouldn't the newly > allocated domain use the context bank we need? Note that we're already The arm-smmu driver doesn't, but there's no fundamental reason it couldn't. That should just need code to refcount domain users and release hardware contexts for domains with no devices currently attached. Robin. > doing that, except that we're doing it behind the back of the DMA > mapping subsystem, so that it keeps using the IOMMU version of the DMA > ops for the device and doing any mapping operations on the default > domain. If we ask the DMA mapping to detach, wouldn't it essentially > solve the problem? > > Best regards, > Tomasz >
On Thu, Nov 29, 2018 at 12:03 PM Robin Murphy <robin.murphy@arm.com> wrote: > > On 29/11/2018 19:57, Tomasz Figa wrote: > > On Thu, Nov 29, 2018 at 11:40 AM Jordan Crouse <jcrouse@codeaurora.org> wrote: > >> > >> On Thu, Nov 29, 2018 at 01:48:15PM -0500, Rob Clark wrote: > >>> On Thu, Nov 29, 2018 at 10:54 AM Christoph Hellwig <hch@lst.de> wrote: > >>>> > >>>> On Thu, Nov 29, 2018 at 09:42:50AM -0500, Rob Clark wrote: > >>>>> Maybe the thing we need to do is just implement a blacklist of > >>>>> compatible strings for devices which should skip the automatic > >>>>> iommu/dma hookup. Maybe a bit ugly, but it would also solve a problem > >>>>> preventing us from enabling per-process pagetables for a5xx (where we > >>>>> need to control the domain/context-bank that is allocated by the dma > >>>>> api). > >>>> > >>>> You can detach from the dma map attachment using arm_iommu_detach_device, > >>>> which a few drm drivers do, but I don't think this is the problem. > >>> > >>> I think even with detach, we wouldn't end up with the context-bank > >>> that the gpu firmware was hard-coded to expect, and so it would > >>> overwrite the incorrect page table address register. (I could be > >>> mis-remembering that, Jordan spent more time looking at that. But it > >>> was something along those lines.) > >> > >> Right - basically the DMA domain steals context bank 0 and the GPU is hard coded > >> to use that context bank for pagetable switching. > >> > >> I believe the Tegra guys also had a similar problem with a hard coded context > >> bank. > > AIUI, they don't need a specific hardware context, they just need to > know which one they're actually using, which the domain abstraction hides. > > > Wait, if we detach the GPU/display struct device from the default > > domain and attach it to a newly allocated domain, wouldn't the newly > > allocated domain use the context bank we need? Note that we're already > > The arm-smmu driver doesn't, but there's no fundamental reason it > couldn't. That should just need code to refcount domain users and > release hardware contexts for domains with no devices currently attached. > > Robin. > > > doing that, except that we're doing it behind the back of the DMA > > mapping subsystem, so that it keeps using the IOMMU version of the DMA > > ops for the device and doing any mapping operations on the default > > domain. If we ask the DMA mapping to detach, wouldn't it essentially > > solve the problem? Thanks Robin. Still, my point is that the MSM DRM driver attaches the GPU struct device to a new domain it allocates using iommu_domain_alloc() and it seems to work fine, so I believe it's not the problem we're looking into with this patch. Best regards, Tomasz
On Thu, Nov 29, 2018 at 10:57 AM Christoph Hellwig <hch@lst.de> wrote: > > On Thu, Nov 29, 2018 at 03:43:50PM +0100, Daniel Vetter wrote: > > Yeah we had patches to add manual cache management code to drm, so we > > don't have to abuse the dma streaming api anymore. Got shouted down. > > Abusing the dma streaming api also gets shouted down. It's a gpu, any > > idea of these drivers actually being platform independent is out of > > the window from the start anyway, so we're ok with tying this to > > platforms. > > Manual or not the iommu API is missing APIs for cache management, > which makes it kinda surprising it actually ever worked for non-coherent > devices. > > And fortunately while some people spent the last year ot two bickering > about the situation others actually did work, and we now have a > generic arch_sync_dma_for_device/arch_sync_dma_for_cpu kernel-internal > API. This is only used for DMA API internals so far, and explicitly > not intended for direct driver use, but it would be perfect as the > backend for iommu API cache maintainance functions. It exists on all > but two architectures on mainline. Out of those powerpc is in the works, > only arm32 will need some major help. oh, hmm, I realized I'd been looking still at the armv7 dma-mapping, I hadn't noticed arch_sync_* yet.. that does look like a step in the right direction. As far as hiding cache ops behind iommu layer, I guess I'd been thinking more of just letting the drivers that want to bypass dma layer take things into their own hands.. tbh I think/assume/hope drm/msm is more the exception than the rule as far as needing to bypass the dma layer. Or at least I guess the # of drivers having problems w/ the dma layer is less than the # of iommu drivers. (Not sure if that changes my thoughts on this patch, it isn't like what this patch replaces isn't also a problematic hack around the inability to bypass the dma layer. In the short term I just want *something* that works, I'm totally happy to refactor later when there are better options.) BR, -R
On Thu, Nov 29, 2018 at 09:24:17AM -0800, Tomasz Figa wrote: > [CC Marek] > > On Thu, Nov 29, 2018 at 9:09 AM Daniel Vetter <daniel@ffwll.ch> wrote: > > > > On Thu, Nov 29, 2018 at 5:57 PM Christoph Hellwig <hch@lst.de> wrote: > > > On Thu, Nov 29, 2018 at 05:28:07PM +0100, Daniel Vetter wrote: > > > > Just spend a bit of time reading through the implementations already > > > > merged. Is the struct device *dev parameter actually needed anywhere? > > > > dma-api definitely needs it, because we need that to pick the right iommu. > > > > But for cache management from what I've seen the target device doesn't > > > > matter, all the target specific stuff will be handled by the iommu. > > > > > > It looks like only mips every uses the dev argument, and even there > > > the function it passes dev to ignores it. So it could probably be removed. > > > > > Whether the cache maintenance operation needs to actually do anything > or not is a function of `dev`. We can have some devices that are > coherent with CPU caches, and some that are not, on the same system. So the thing is that the gpu driver knows this too. It fairly often can even decide at runtime (through surface state bits or gpu pte bits) whether to use coherent or non-coherent transactions. dma-api assuming that a given device is always coherent or non-coherent is one of the fundamental mismatches we have. If you otoh need dev because there's some strange bridge caches you need to flush (I've never seen that, but who knows), that would be a diffeernt thing. All the bridge flushing I've seen is attached to the iommu though, so would be really a surprise if the cache management needs that too. > > > > > > > > Dropping the dev parameter would make this a perfect fit for coherency > > > > management of buffers used by multiple devices. Right now there's all > > > > kinds of nasty tricks for that use cases needed to avoid redundant > > > > flushes. > > > > > > Note that one thing I'd like to avoid is exposing these funtions directly > > > to drivers, as that will get us into all kinds of abuses. > > > > What kind of abuse do you expect? It could very well be that gpu folks > > call that "standard use case" ... At least on x86 with the i915 driver > > we pretty much rely on architectural guarantees for how cache flushes > > work very much. Down to userspace doing the cache flushing for > > mappings the kernel has set up. > > i915 is a very specific case of a fully contained, > architecture-specific hardware subsystem, where you can just hardcode > all integration details inside the driver, because nobody else would > care. Nah, it's not fully contained, it's just with your arm eyewear on you can't see how badly it leaks all over the place. The problem is that GPUs (in many cases at least) want to decide whether and when to do cache maintenance. We need a combo of iommu and cache maintenance apis that allows this, across multiple devices, because the dma-api doesn't cut it. Aside: The cache maintenance api _must_ do the flush when asked to, even if it believes no cache maintenance is necessary. Even if the default mode is coherent for a platform/dev combo, the gpu driver might want to do a non-coherent transaction. > In ARM world, you can have the same IP blocks licensed by multiple SoC > vendors with different integration details and that often includes the > option of coherency. My experience is that for soc gpus, you need to retune up/download code for every soc. Or at least every family of socs. This is painful. I guess thus far the arm soc gpu drivers we have merged aren't the ones that needed widely different tuning on different soc families/ranges. What's worse, it's userspace who will decide whether to use the coherent or non-coherent paths in many cases (that's at least how it worked since decades on the big gpus, small gpus just lag a few years usually). The kerenl only provides the tools for the userspace opengl/vulkan driver to do all the things. Trying to hide coherent vs. non-coherent like it's done for everyone else is probably not going to work for gpus. In fact, hasn't ever worked for gpus thus far, and unlikely to change I think. -Daniel > > > So I'd much prefer if we could have iommu APIs wrapping these that are > > > specific to actual uses cases that we understand well. > > > > > > As for the buffer sharing: at least for the DMA API side I want to > > > move the current buffer sharing users away from dma_alloc_coherent > > > (and coherent dma_alloc_attrs users) and the remapping done in there > > > required for non-coherent architectures. Instead I'd like to allocate > > > plain old pages, and then just dma map them for each device separately, > > > with DMA_ATTR_SKIP_CPU_SYNC passed for all but the first user to map > > > or last user to unmap. On the iommu side it could probably work > > > similar. > > > > I think this is what's often done. Except then there's also the issue > > of how to get at the cma allocator if your device needs something > > contiguous. There's a lot of that still going on in graphics/media. > > I suppose one could just expose CMA with the default pool directly. > It's just an allocator, so not sure why it would need any > device-specific information. > > There is also the use case of using CMA with device-specific pools of > memory reusable by the system when not used by the device and those > would have to somehow get the pool to allocate from, but I wonder if > struct device is the right way to pass such information. I'd see the > pool given explicitly like cma_alloc(struct cma_pool *, size, flags) > and perhaps a wrapper cma_alloc_default(size, flags) that is just a > simple macro calling cma_alloc(&cma_pool_default, size, flags). > > Best regards, > Tomasz
On Thu, Nov 29, 2018 at 08:15:23PM -0500, Rob Clark wrote: > As far as hiding cache ops behind iommu layer, I guess I'd been > thinking more of just letting the drivers that want to bypass dma > layer take things into their own hands.. tbh I think/assume/hope > drm/msm is more the exception than the rule as far as needing to > bypass the dma layer. Or at least I guess the # of drivers having > problems w/ the dma layer is less than the # of iommu drivers. So the whole bypass thing has already been a contentious discussion in the past. One thing that the API aready enforces is that we pass a DMA direction, which I want to keep. The other is that we need a struct device (or something similiar) that is used to check if the device is cache coherent or not. In your MSM case you might know that, but in general it really is a platform issue that I don't want every driver to know about. > (Not sure if that changes my thoughts on this patch, it isn't like > what this patch replaces isn't also a problematic hack around the > inability to bypass the dma layer. In the short term I just want > *something* that works, I'm totally happy to refactor later when there > are better options.) The current patch is simply broken. You can't just construct your own S/G list and pass it to the DMA API, and we enforce that very strictly with dma debug enabled. So your only options are: a) actually use the DMA API for creating the mapping, by e.g. using dma_direct_ops ontop of an actual IOMMU managed in the background, or b) figure out a way to do cache maintainance for raw IOMMU API drivers.
On Thu, Nov 29, 2018 at 01:57:38PM -0500, Rob Clark wrote: > On Thu, Nov 29, 2018 at 12:24 PM Tomasz Figa <tfiga@chromium.org> wrote: > > > > [CC Marek] > > > > On Thu, Nov 29, 2018 at 9:09 AM Daniel Vetter <daniel@ffwll.ch> wrote: > > > > > > On Thu, Nov 29, 2018 at 5:57 PM Christoph Hellwig <hch@lst.de> wrote: > > > > > > > > Note that one thing I'd like to avoid is exposing these funtions directly > > > > to drivers, as that will get us into all kinds of abuses. > > > > > > What kind of abuse do you expect? It could very well be that gpu folks > > > call that "standard use case" ... At least on x86 with the i915 driver > > > we pretty much rely on architectural guarantees for how cache flushes > > > work very much. Down to userspace doing the cache flushing for > > > mappings the kernel has set up. > > > > i915 is a very specific case of a fully contained, > > architecture-specific hardware subsystem, where you can just hardcode > > all integration details inside the driver, because nobody else would > > care. > > > > In ARM world, you can have the same IP blocks licensed by multiple SoC > > vendors with different integration details and that often includes the > > option of coherency. > > fwiw, I believe all the GPU IP blocks that are used across multiple > SoCs have their own GPU MMU (potentially in addition to an iommu?). > So the dma-api is a much better fit for them.. drm/msm is a lot > closer to drm/i915 scenario, so I don't so much care if the solution > to our unique problem isn't something that would work for other > drivers ;-) Right now maybe, but I fully except the entire coherent vs. non-coherent transactions hilarity that we have on the bigger intel socs since a few years already to trickle down into smaller (arm based socs) eventually. I think Apple is already there since a few generations. So maybe we won't have to fight the iommu side of the dma-api anymore on these, but we'll still have to fight the cache maintenance side of dma-api. You can tell the dma-api to not flush, but then you don't have any other way to actually flush that's acceptable for arch/arm (on x86 we just run clflush in userspace and call it a day). -Daniel
On Fri, Nov 30, 2018 at 10:35:27AM +0100, Daniel Vetter wrote: > > Whether the cache maintenance operation needs to actually do anything > > or not is a function of `dev`. We can have some devices that are > > coherent with CPU caches, and some that are not, on the same system. > > So the thing is that the gpu driver knows this too. It fairly often can > even decide at runtime (through surface state bits or gpu pte bits) > whether to use coherent or non-coherent transactions. dma-api assuming > that a given device is always coherent or non-coherent is one of the > fundamental mismatches we have. > > If you otoh need dev because there's some strange bridge caches you need > to flush (I've never seen that, but who knows), that would be a diffeernt > thing. All the bridge flushing I've seen is attached to the iommu though, > so would be really a surprise if the cache management needs that too. Strange bridge caches aren't the problem. Outside of magic components like SOCs integrated GPUs the issue is that a platform can wire up a PCIe/AXI/etc bus either so that it is cache coherent, or not cache coherent (does not snooping). Drivers need to use the full DMA API include dma_sync_* to cater for the non-coherent case, which will turn into no-ops if DMA is coherent. Now PCIe now has unsnooped transactions, which can be non-coherent even if the bus would otherwise be coherent. We have so far very much ignored those in Linux (at least Linux in general, I know you guys have some driver-local hacks), but if that use case, or similar ones for GPUs on SOCs become common we need to find a solution. One of the major issues here is that architectures that always are DMA coherent might not even have the cache flushing instructions, or even if they do we have not wired them up in the DMA code as we didn't need them. So what we'd need to support this properly is to do something like: - add new arch hook that allows an architecture to say it supports optional non-coherent at the arch level, and for a given device - wire up arch_sync_dma_for_{device,cpu} for those architectures that define it if they don't currently have it (e.g. x86) - add a new DMA_ATTR_* flag to opt into cache flushing even if the device declares it is otherwise coherent And I'd really like to see that work driven by an actual user.
On Thu, Nov 29, 2018 at 07:33:34PM +0100, Christoph Hellwig wrote: > On Thu, Nov 29, 2018 at 06:09:05PM +0100, Daniel Vetter wrote: > > What kind of abuse do you expect? It could very well be that gpu folks > > call that "standard use case" ... At least on x86 with the i915 driver > > we pretty much rely on architectural guarantees for how cache flushes > > work very much. Down to userspace doing the cache flushing for > > mappings the kernel has set up. > > Mostly the usual bypasses of the DMA API because people know better > (and with that I don't mean low-level IOMMU API users, but "creative" > direct mappings). Ah yes. I think that even gpu folks get behind :-) Best if someone bothered to explicitly cast to dma_addr_t to shut up the tools, but not fix the bug ... > > > As for the buffer sharing: at least for the DMA API side I want to > > > move the current buffer sharing users away from dma_alloc_coherent > > > (and coherent dma_alloc_attrs users) and the remapping done in there > > > required for non-coherent architectures. Instead I'd like to allocate > > > plain old pages, and then just dma map them for each device separately, > > > with DMA_ATTR_SKIP_CPU_SYNC passed for all but the first user to map > > > or last user to unmap. On the iommu side it could probably work > > > similar. > > > > I think this is what's often done. Except then there's also the issue > > of how to get at the cma allocator if your device needs something > > contiguous. There's a lot of that still going on in graphics/media. > > Being able to dip into CMA and mayb iommu coalescing if we want to > get fancy is indeed the only reason for this API. If we just wanted > to map pages we could already do that now with just a little bit > of boilerplate code (and quite a few drivers do - just adding this > new API will remove tons of code). Sounds like the future is very bright indeed \o/ -Daniel
On Thu, Nov 29, 2018 at 4:23 PM Tomasz Figa <tfiga@chromium.org> wrote: > > On Thu, Nov 29, 2018 at 12:03 PM Robin Murphy <robin.murphy@arm.com> wrote: > > > > On 29/11/2018 19:57, Tomasz Figa wrote: > > > On Thu, Nov 29, 2018 at 11:40 AM Jordan Crouse <jcrouse@codeaurora.org> wrote: > > >> > > >> On Thu, Nov 29, 2018 at 01:48:15PM -0500, Rob Clark wrote: > > >>> On Thu, Nov 29, 2018 at 10:54 AM Christoph Hellwig <hch@lst.de> wrote: > > >>>> > > >>>> On Thu, Nov 29, 2018 at 09:42:50AM -0500, Rob Clark wrote: > > >>>>> Maybe the thing we need to do is just implement a blacklist of > > >>>>> compatible strings for devices which should skip the automatic > > >>>>> iommu/dma hookup. Maybe a bit ugly, but it would also solve a problem > > >>>>> preventing us from enabling per-process pagetables for a5xx (where we > > >>>>> need to control the domain/context-bank that is allocated by the dma > > >>>>> api). > > >>>> > > >>>> You can detach from the dma map attachment using arm_iommu_detach_device, > > >>>> which a few drm drivers do, but I don't think this is the problem. > > >>> > > >>> I think even with detach, we wouldn't end up with the context-bank > > >>> that the gpu firmware was hard-coded to expect, and so it would > > >>> overwrite the incorrect page table address register. (I could be > > >>> mis-remembering that, Jordan spent more time looking at that. But it > > >>> was something along those lines.) > > >> > > >> Right - basically the DMA domain steals context bank 0 and the GPU is hard coded > > >> to use that context bank for pagetable switching. > > >> > > >> I believe the Tegra guys also had a similar problem with a hard coded context > > >> bank. > > > > AIUI, they don't need a specific hardware context, they just need to > > know which one they're actually using, which the domain abstraction hides. > > > > > Wait, if we detach the GPU/display struct device from the default > > > domain and attach it to a newly allocated domain, wouldn't the newly > > > allocated domain use the context bank we need? Note that we're already > > > > The arm-smmu driver doesn't, but there's no fundamental reason it > > couldn't. That should just need code to refcount domain users and > > release hardware contexts for domains with no devices currently attached. > > > > Robin. > > > > > doing that, except that we're doing it behind the back of the DMA > > > mapping subsystem, so that it keeps using the IOMMU version of the DMA > > > ops for the device and doing any mapping operations on the default > > > domain. If we ask the DMA mapping to detach, wouldn't it essentially > > > solve the problem? > > Thanks Robin. > > Still, my point is that the MSM DRM driver attaches the GPU struct > device to a new domain it allocates using iommu_domain_alloc() and it > seems to work fine, so I believe it's not the problem we're looking > into with this patch. Could we just make the MSM DRM call arch_teardown_dma_ops() and then arch_setup_dma_ops() with the `iommu` argument set to NULL and be done with it? Best regards, Tomasz
On Fri, Nov 30, 2018 at 9:05 PM Tomasz Figa <tfiga@chromium.org> wrote: > > On Thu, Nov 29, 2018 at 4:23 PM Tomasz Figa <tfiga@chromium.org> wrote: > > > > On Thu, Nov 29, 2018 at 12:03 PM Robin Murphy <robin.murphy@arm.com> wrote: > > > > > > On 29/11/2018 19:57, Tomasz Figa wrote: > > > > On Thu, Nov 29, 2018 at 11:40 AM Jordan Crouse <jcrouse@codeaurora.org> wrote: > > > >> > > > >> On Thu, Nov 29, 2018 at 01:48:15PM -0500, Rob Clark wrote: > > > >>> On Thu, Nov 29, 2018 at 10:54 AM Christoph Hellwig <hch@lst.de> wrote: > > > >>>> > > > >>>> On Thu, Nov 29, 2018 at 09:42:50AM -0500, Rob Clark wrote: > > > >>>>> Maybe the thing we need to do is just implement a blacklist of > > > >>>>> compatible strings for devices which should skip the automatic > > > >>>>> iommu/dma hookup. Maybe a bit ugly, but it would also solve a problem > > > >>>>> preventing us from enabling per-process pagetables for a5xx (where we > > > >>>>> need to control the domain/context-bank that is allocated by the dma > > > >>>>> api). > > > >>>> > > > >>>> You can detach from the dma map attachment using arm_iommu_detach_device, > > > >>>> which a few drm drivers do, but I don't think this is the problem. > > > >>> > > > >>> I think even with detach, we wouldn't end up with the context-bank > > > >>> that the gpu firmware was hard-coded to expect, and so it would > > > >>> overwrite the incorrect page table address register. (I could be > > > >>> mis-remembering that, Jordan spent more time looking at that. But it > > > >>> was something along those lines.) > > > >> > > > >> Right - basically the DMA domain steals context bank 0 and the GPU is hard coded > > > >> to use that context bank for pagetable switching. > > > >> > > > >> I believe the Tegra guys also had a similar problem with a hard coded context > > > >> bank. > > > > > > AIUI, they don't need a specific hardware context, they just need to > > > know which one they're actually using, which the domain abstraction hides. > > > > > > > Wait, if we detach the GPU/display struct device from the default > > > > domain and attach it to a newly allocated domain, wouldn't the newly > > > > allocated domain use the context bank we need? Note that we're already > > > > > > The arm-smmu driver doesn't, but there's no fundamental reason it > > > couldn't. That should just need code to refcount domain users and > > > release hardware contexts for domains with no devices currently attached. > > > > > > Robin. > > > > > > > doing that, except that we're doing it behind the back of the DMA > > > > mapping subsystem, so that it keeps using the IOMMU version of the DMA > > > > ops for the device and doing any mapping operations on the default > > > > domain. If we ask the DMA mapping to detach, wouldn't it essentially > > > > solve the problem? > > > > Thanks Robin. > > > > Still, my point is that the MSM DRM driver attaches the GPU struct > > device to a new domain it allocates using iommu_domain_alloc() and it > > seems to work fine, so I believe it's not the problem we're looking > > into with this patch. > > Could we just make the MSM DRM call arch_teardown_dma_ops() and then > arch_setup_dma_ops() with the `iommu` argument set to NULL and be done > with it? I don't think those are exported to modules? I have actually a simpler patch, that adds a small blacklist to check in of_dma_configure() before calling arch_setup_dma_ops(), which can replace this patch. It also solves the problem of dma api allocating the context bank that he gpu wants to use for context-switching, and should be a simple thing to backport to stable branches. I was just spending some time trying to figure out what changed recently to start causing dma_map_sg() to opps on boot for us, so I could write a more detailed commit msg. BR, -R
On Sat, Dec 1, 2018 at 3:47 AM Rob Clark <robdclark@gmail.com> wrote: > > On Fri, Nov 30, 2018 at 9:05 PM Tomasz Figa <tfiga@chromium.org> wrote: > > > > On Thu, Nov 29, 2018 at 4:23 PM Tomasz Figa <tfiga@chromium.org> wrote: > > > > > > On Thu, Nov 29, 2018 at 12:03 PM Robin Murphy <robin.murphy@arm.com> wrote: > > > > > > > > On 29/11/2018 19:57, Tomasz Figa wrote: > > > > > On Thu, Nov 29, 2018 at 11:40 AM Jordan Crouse <jcrouse@codeaurora.org> wrote: > > > > >> > > > > >> On Thu, Nov 29, 2018 at 01:48:15PM -0500, Rob Clark wrote: > > > > >>> On Thu, Nov 29, 2018 at 10:54 AM Christoph Hellwig <hch@lst.de> wrote: > > > > >>>> > > > > >>>> On Thu, Nov 29, 2018 at 09:42:50AM -0500, Rob Clark wrote: > > > > >>>>> Maybe the thing we need to do is just implement a blacklist of > > > > >>>>> compatible strings for devices which should skip the automatic > > > > >>>>> iommu/dma hookup. Maybe a bit ugly, but it would also solve a problem > > > > >>>>> preventing us from enabling per-process pagetables for a5xx (where we > > > > >>>>> need to control the domain/context-bank that is allocated by the dma > > > > >>>>> api). > > > > >>>> > > > > >>>> You can detach from the dma map attachment using arm_iommu_detach_device, > > > > >>>> which a few drm drivers do, but I don't think this is the problem. > > > > >>> > > > > >>> I think even with detach, we wouldn't end up with the context-bank > > > > >>> that the gpu firmware was hard-coded to expect, and so it would > > > > >>> overwrite the incorrect page table address register. (I could be > > > > >>> mis-remembering that, Jordan spent more time looking at that. But it > > > > >>> was something along those lines.) > > > > >> > > > > >> Right - basically the DMA domain steals context bank 0 and the GPU is hard coded > > > > >> to use that context bank for pagetable switching. > > > > >> > > > > >> I believe the Tegra guys also had a similar problem with a hard coded context > > > > >> bank. > > > > > > > > AIUI, they don't need a specific hardware context, they just need to > > > > know which one they're actually using, which the domain abstraction hides. > > > > > > > > > Wait, if we detach the GPU/display struct device from the default > > > > > domain and attach it to a newly allocated domain, wouldn't the newly > > > > > allocated domain use the context bank we need? Note that we're already > > > > > > > > The arm-smmu driver doesn't, but there's no fundamental reason it > > > > couldn't. That should just need code to refcount domain users and > > > > release hardware contexts for domains with no devices currently attached. > > > > > > > > Robin. > > > > > > > > > doing that, except that we're doing it behind the back of the DMA > > > > > mapping subsystem, so that it keeps using the IOMMU version of the DMA > > > > > ops for the device and doing any mapping operations on the default > > > > > domain. If we ask the DMA mapping to detach, wouldn't it essentially > > > > > solve the problem? > > > > > > Thanks Robin. > > > > > > Still, my point is that the MSM DRM driver attaches the GPU struct > > > device to a new domain it allocates using iommu_domain_alloc() and it > > > seems to work fine, so I believe it's not the problem we're looking > > > into with this patch. > > > > Could we just make the MSM DRM call arch_teardown_dma_ops() and then > > arch_setup_dma_ops() with the `iommu` argument set to NULL and be done > > with it? > > I don't think those are exported to modules? > Indeed, if we compile MSM DRM as modules, it wouldn't work... > I have actually a simpler patch, that adds a small blacklist to check > in of_dma_configure() before calling arch_setup_dma_ops(), which can > replace this patch. It also solves the problem of dma api allocating > the context bank that he gpu wants to use for context-switching, and > should be a simple thing to backport to stable branches. > > I was just spending some time trying to figure out what changed > recently to start causing dma_map_sg() to opps on boot for us, so I > could write a more detailed commit msg. Yeah, that sounds much better, thanks. Reviewed that patch. Best regards, Tomasz
On Fri, Nov 30, 2018 at 10:46:04AM +0100, Daniel Vetter wrote: > > Being able to dip into CMA and maybe iommu coalescing if we want to > > get fancy is indeed the only reason for this API. If we just wanted > > to map pages we could already do that now with just a little bit > > of boilerplate code (and quite a few drivers do - just adding this > > new API will remove tons of code). > > Sounds like the future is very bright indeed \o/ So, I spent some time with this and instead of a new API I think it makes sure we have DMA_ATTR_NON_CONSISTENT consistently available and with well defined semantics, that is virt_to_page on the return value works, it is contiguos and we can use dma_sync_single_for_cpu and dma_sync_single_for_device for ownership tranfers. Can you graphics folks check if this: http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-noncoherent-allocator is something to work with? Especially to get rid of the horrible dma_get_sgtable hacks?
On Thu, Dec 6, 2018 at 8:38 PM Christoph Hellwig <hch@lst.de> wrote: > > On Fri, Nov 30, 2018 at 10:46:04AM +0100, Daniel Vetter wrote: > > > Being able to dip into CMA and maybe iommu coalescing if we want to > > > get fancy is indeed the only reason for this API. If we just wanted > > > to map pages we could already do that now with just a little bit > > > of boilerplate code (and quite a few drivers do - just adding this > > > new API will remove tons of code). > > > > Sounds like the future is very bright indeed \o/ > > So, I spent some time with this and instead of a new API I think > it makes sure we have DMA_ATTR_NON_CONSISTENT consistently available > and with well defined semantics, that is virt_to_page on the return > value works, it is contiguos and we can use dma_sync_single_for_cpu > and dma_sync_single_for_device for ownership tranfers. > > Can you graphics folks check if this: > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-noncoherent-allocator > > is something to work with? Especially to get rid of the horrible > dma_get_sgtable hacks? I'm not sure I really have strong opinion about this. For some of the display-only SoC drm drivers which use the CMA helpers, it might be useful. For the drm drivers for real GPUs, we aren't really using the dma api to allocate in the first place. (And the direction is towards more support for userptr allocations) BR, -R
diff --git a/drivers/gpu/drm/msm/msm_gem.c b/drivers/gpu/drm/msm/msm_gem.c index 00c795ced02c..7048e9fe00c6 100644 --- a/drivers/gpu/drm/msm/msm_gem.c +++ b/drivers/gpu/drm/msm/msm_gem.c @@ -81,6 +81,8 @@ static struct page **get_pages(struct drm_gem_object *obj) struct drm_device *dev = obj->dev; struct page **p; int npages = obj->size >> PAGE_SHIFT; + struct scatterlist *s; + int i; if (use_pages(obj)) p = drm_gem_get_pages(obj); @@ -104,12 +106,23 @@ static struct page **get_pages(struct drm_gem_object *obj) return ptr; } - /* For non-cached buffers, ensure the new pages are clean + /* + * Some implementations of the DMA mapping ops expect + * physical addresses of the pages to be stored as DMA + * addresses of the sglist entries. To work around it, + * set them here explicitly. + */ + for_each_sg(msm_obj->sgt->sgl, s, msm_obj->sgt->nents, i) + sg_dma_address(s) = sg_phys(s); + + /* + * For non-cached buffers, ensure the new pages are clean * because display controller, GPU, etc. are not coherent: */ - if (msm_obj->flags & (MSM_BO_WC|MSM_BO_UNCACHED)) - dma_map_sg(dev->dev, msm_obj->sgt->sgl, - msm_obj->sgt->nents, DMA_BIDIRECTIONAL); + if (msm_obj->flags & (MSM_BO_WC | MSM_BO_UNCACHED)) + dma_sync_sg_for_device(dev->dev, msm_obj->sgt->sgl, + msm_obj->sgt->nents, + DMA_BIDIRECTIONAL); } return msm_obj->pages; @@ -133,14 +146,16 @@ static void put_pages(struct drm_gem_object *obj) if (msm_obj->pages) { if (msm_obj->sgt) { - /* For non-cached buffers, ensure the new + /* + * For non-cached buffers, ensure the new * pages are clean because display controller, * GPU, etc. are not coherent: */ - if (msm_obj->flags & (MSM_BO_WC|MSM_BO_UNCACHED)) - dma_unmap_sg(obj->dev->dev, msm_obj->sgt->sgl, - msm_obj->sgt->nents, - DMA_BIDIRECTIONAL); + if (msm_obj->flags & (MSM_BO_WC | MSM_BO_UNCACHED)) + dma_sync_sg_for_cpu(obj->dev->dev, + msm_obj->sgt->sgl, + msm_obj->sgt->nents, + DMA_BIDIRECTIONAL); sg_free_table(msm_obj->sgt); kfree(msm_obj->sgt);
dma_map_sg() expects a DMA domain. However, the drm devices have been traditionally using unmanaged iommu domain which is non-dma type. Using dma mapping APIs with that domain is bad. Replace dma_map_sg() calls with dma_sync_sg_for_device{|cpu}() to do the cache maintenance. Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org> Suggested-by: Tomasz Figa <tfiga@chromium.org> Cc: Rob Clark <robdclark@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Jordan Crouse <jcrouse@codeaurora.org> Cc: Sean Paul <seanpaul@chromium.org> --- Changes since v2: - Addressed Tomasz's comment to keep DMA_BIDIRECTIONAL dma direction flag intact. - Updated comment for sg's dma-address assignment as per Tomasz' suggestion. drivers/gpu/drm/msm/msm_gem.c | 33 ++++++++++++++++++++++++--------- 1 file changed, 24 insertions(+), 9 deletions(-)