Message ID | 20190814220011.26934-1-robdclark@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | drm+dma: cache support for arm, etc | expand |
As said before I don't think these low-level helpers are the right API to export, but even if they did you'd just cover a tiny subset of the architectures. Also to distil the previous thread - if you remap memory to uncached the helper to use is arch_dma_prep_coherent, which does a writeback+ invalidate everywhere, and there is no need to clean up after a long-term uncached mapping. We might still get speculations into that area, if we don't remap the direct mapping, but it isn't like invalidting that just before freeing the memory is going to help anyone. Also it seems like patches 5 and 6 are missing in my inbox.
On Wed, Aug 14, 2019 at 11:51 PM Christoph Hellwig <hch@lst.de> wrote: > > As said before I don't think these low-level helpers are the > right API to export, but even if they did you'd just cover a tiny > subset of the architectures. Are you thinking instead something like: void dma_sync_sg_for_{cpu,device}(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir) { for_each_sg(sgl, sg, nents, i) { arch_sync_dma_for_..(dev, sg_phys(sg), sg->length, dir); } } EXPORT_SYMBOL_GPL(dma_sync_sg_for_..) or did you have something else in mind? I guess something like this would avoid figuring out *which* archs actually build drm.. > Also to distil the previous thread - if you remap memory to uncached > the helper to use is arch_dma_prep_coherent, which does a writeback+ > invalidate everywhere, and there is no need to clean up after a > long-term uncached mapping. We might still get speculations into > that area, if we don't remap the direct mapping, but it isn't like > invalidting that just before freeing the memory is going to help > anyone. hmm, IIUC the aarch64 cache instructions, what I'm doing now is equiv to what I would get with dma_map_sg(DMA_BIDIRECTIONAL) and arch_dma_prep_coherent() is equiv to what I'd get w/ DMA_FROM_DEVICE (but a single pass instead of separate inv+clean passes).. but I can respin this with a single dma_prep_coherent_sg() which uses arch_dma_prep_coherent().. > Also it seems like patches 5 and 6 are missing in my inbox. Hmm, not entirely sure why.. you should be on the cc list for each individual patch. But here is the patchwork link if you want to see them: https://patchwork.freedesktop.org/series/65211/ BR, -R
On Thu, Aug 15, 2019 at 06:54:39AM -0700, Rob Clark wrote: > On Wed, Aug 14, 2019 at 11:51 PM Christoph Hellwig <hch@lst.de> wrote: > > > > As said before I don't think these low-level helpers are the > > right API to export, but even if they did you'd just cover a tiny > > subset of the architectures. > > Are you thinking instead something like: > > void dma_sync_sg_for_{cpu,device}(struct device *dev, struct scatterlist *sgl, > int nents, enum dma_data_direction dir) > { > for_each_sg(sgl, sg, nents, i) { > arch_sync_dma_for_..(dev, sg_phys(sg), sg->length, dir); > } > } > EXPORT_SYMBOL_GPL(dma_sync_sg_for_..) > > or did you have something else in mind? No. We really need an interface thay says please give me uncached memory (for some definition of uncached that includes that grapics drivers call write combine), and then let the architecture do the right thing. Basically dma_alloc_coherent with DMA_ATTR_NO_KERNEL_MAPPING is superficially close to what you want, except that the way the drm drivers work you can't actually use it. The reason for that is if we can we really need to not create another uncachable alias, but instead change the page attributes in place. On x86 we can and must do that for example, and based on the conversation with Will arm64 could do that fairly easily. arm32 can right now only do that for CMA, though. The big question is what API do we want. I had a pretty similar discussion with Christian on doing such an allocation for amdgpu, where the device normally is cache coherent, but they actually want to turn it into non-coherent by using PCIe unsnooped transactions. Here is my high level plan, which still has a few lose end: (1) provide a new API: struct page *dma_alloc_pages(struct device *dev, unsigned nr_pages, gfp_t gfp, unsigned long flags); void dma_free_pages(struct device *dev, unsigned nr_pages, unsigned long flags); These give you back page backed memory that is guaranteed to be addressable by the device (no swiotlb or similar). The memory can then be mapped using dma_map*, including unmap and dma_sync to bounce ownership around. This is the replacement for the current dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT API, that is rather badly defined. (2) Add support for DMA_ATTR_NO_KERNEL_MAPPING to this new API instead of dma_alloc_attrs. The initial difference with that flag is just that we allow highmem, but in the future we could also unmap this memory from the kernel linear mapping entirely on architectures where we can easily do that. (3) Add a dma_pages_map/dma_pages_unmap or similar API that allows you to get a kernel mapping for parts or all of a DMA_ATTR_NO_KERNEL_MAPPING allocation. This is to replace things like your open-coded vmap in msm (or similarly elsewhere in dma-buf providers). (4) Add support for a DMA_ATTR_UNCACHABLE flags (or similar) to the new API, that maps the pages as uncachable iff they have a kernel mapping, including invalidating the caches at time of this page attribute change (or creation of a new mapping). This API will fail if the architecture does not allow in-place remapping. Note that for arm32 we could always dip into the CMA pool if one is present to not fail. We'll also need some helper to map from the DMA_ATTR_* flags to a pgprot for mapping the page to userspace. There is also a few other weird bits here, e.g. on architectures like mips that use an uncached segment we'll have to fail use with the plain DMA_ATTR_UNCACHABLE flag, but it could be supported with DMA_ATTR_UNCACHABLE | DMA_ATTR_NO_KERNEL_MAPPING. I was hoping to get most of this done for this merge window, but I'm probably lucky if I get at least parts done. Too much distraction. > Hmm, not entirely sure why.. you should be on the cc list for each > individual patch. They finally made it, although even with the delay they only ended up in the spam mailbox. I still can't see them on the various mailing lists.
Am 15.08.19 um 19:53 schrieb Christoph Hellwig: > On Thu, Aug 15, 2019 at 06:54:39AM -0700, Rob Clark wrote: >> On Wed, Aug 14, 2019 at 11:51 PM Christoph Hellwig <hch@lst.de> wrote: >>> As said before I don't think these low-level helpers are the >>> right API to export, but even if they did you'd just cover a tiny >>> subset of the architectures. >> Are you thinking instead something like: >> >> void dma_sync_sg_for_{cpu,device}(struct device *dev, struct scatterlist *sgl, >> int nents, enum dma_data_direction dir) >> { >> for_each_sg(sgl, sg, nents, i) { >> arch_sync_dma_for_..(dev, sg_phys(sg), sg->length, dir); >> } >> } >> EXPORT_SYMBOL_GPL(dma_sync_sg_for_..) >> >> or did you have something else in mind? > No. We really need an interface thay says please give me uncached > memory (for some definition of uncached that includes that grapics > drivers call write combine), and then let the architecture do the right > thing. Basically dma_alloc_coherent with DMA_ATTR_NO_KERNEL_MAPPING > is superficially close to what you want, except that the way the drm > drivers work you can't actually use it. The terminology graphics driver use is USWC (Uncached Speculative Write Combine). Essentially it is a leftover from the AGP days where the graphics devices couldn't snoop the CPU caches. Nowadays they either don't want to snoop the CPU caches for performance reasons. Or because of special requirements that certain areas of system memory are not cached for certain functionality. For example the "VRAM" on AMD GPUs which are integrated into the CPU is just stolen system memory. Then you can scanout your picture to the display from this memory or "normal" system memory, but only if the system memory is mapped as USWC. > The reason for that is if we can we really need to not create another > uncachable alias, but instead change the page attributes in place. > On x86 we can and must do that for example, and based on the > conversation with Will arm64 could do that fairly easily. arm32 can > right now only do that for CMA, though. > > The big question is what API do we want. I had a pretty similar > discussion with Christian on doing such an allocation for amdgpu, > where the device normally is cache coherent, but they actually want > to turn it into non-coherent by using PCIe unsnooped transactions. > > Here is my high level plan, which still has a few lose end: > > (1) provide a new API: > > struct page *dma_alloc_pages(struct device *dev, unsigned nr_pages, > gfp_t gfp, unsigned long flags); > void dma_free_pages(struct device *dev, unsigned nr_pages, > unsigned long flags); > > These give you back page backed memory that is guaranteed to be > addressable by the device (no swiotlb or similar). The memory can > then be mapped using dma_map*, including unmap and dma_sync to > bounce ownership around. This is the replacement for the current > dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT API, that is rather > badly defined. > > (2) Add support for DMA_ATTR_NO_KERNEL_MAPPING to this new API instead > of dma_alloc_attrs. The initial difference with that flag is just > that we allow highmem, but in the future we could also unmap this > memory from the kernel linear mapping entirely on architectures > where we can easily do that. Mhm, why would we want to do this? > > (3) Add a dma_pages_map/dma_pages_unmap or similar API that allows you > to get a kernel mapping for parts or all of a > DMA_ATTR_NO_KERNEL_MAPPING allocation. This is to replace things > like your open-coded vmap in msm (or similarly elsewhere in dma-buf > providers). > > (4) Add support for a DMA_ATTR_UNCACHABLE flags (or similar) to the new > API, that maps the pages as uncachable iff they have a kernel > mapping, including invalidating the caches at time of this page > attribute change (or creation of a new mapping). This API will fail > if the architecture does not allow in-place remapping. Note that for > arm32 we could always dip into the CMA pool if one is present to not > fail. We'll also need some helper to map from the DMA_ATTR_* flags > to a pgprot for mapping the page to userspace. There is also a few > other weird bits here, e.g. on architectures like mips that use an > uncached segment we'll have to fail use with the plain > DMA_ATTR_UNCACHABLE flag, but it could be supported with > DMA_ATTR_UNCACHABLE | DMA_ATTR_NO_KERNEL_MAPPING. > > I was hoping to get most of this done for this merge window, but I'm > probably lucky if I get at least parts done. Too much distraction. Thanks again for taking care of this, Christian. > >> Hmm, not entirely sure why.. you should be on the cc list for each >> individual patch. > They finally made it, although even with the delay they only ended up > in the spam mailbox. I still can't see them on the various mailing > lists.
On Thu, Aug 15, 2019 at 06:21:00PM +0000, Koenig, Christian wrote: > > (2) Add support for DMA_ATTR_NO_KERNEL_MAPPING to this new API instead > > of dma_alloc_attrs. The initial difference with that flag is just > > that we allow highmem, but in the future we could also unmap this > > memory from the kernel linear mapping entirely on architectures > > where we can easily do that. > > Mhm, why would we want to do this? To avoid the CPU misspeculating into this memory. For example NVMe SSDs have a feature called host memory buffer that is a lot like your stolen main ram for the GPU case. We currently hand the SSD a DMA_ATTR_NO_KERNEL_MAPPING allocation if it requests such a buffer. If possible we'd really like to make sure no speculative execution bug (or intentional attacker with a kernel exploit for that matter) can easily access that memory.
On Thu, Aug 15, 2019 at 10:53 AM Christoph Hellwig <hch@lst.de> wrote: > > On Thu, Aug 15, 2019 at 06:54:39AM -0700, Rob Clark wrote: > > On Wed, Aug 14, 2019 at 11:51 PM Christoph Hellwig <hch@lst.de> wrote: > > > > > > As said before I don't think these low-level helpers are the > > > right API to export, but even if they did you'd just cover a tiny > > > subset of the architectures. > > > > Are you thinking instead something like: > > > > void dma_sync_sg_for_{cpu,device}(struct device *dev, struct scatterlist *sgl, > > int nents, enum dma_data_direction dir) > > { > > for_each_sg(sgl, sg, nents, i) { > > arch_sync_dma_for_..(dev, sg_phys(sg), sg->length, dir); > > } > > } > > EXPORT_SYMBOL_GPL(dma_sync_sg_for_..) > > > > or did you have something else in mind? > > No. We really need an interface thay says please give me uncached > memory (for some definition of uncached that includes that grapics > drivers call write combine), and then let the architecture do the right > thing. Basically dma_alloc_coherent with DMA_ATTR_NO_KERNEL_MAPPING > is superficially close to what you want, except that the way the drm > drivers work you can't actually use it. I don't disagree about needing an API to get uncached memory (or ideally just something outside of the linear map). But I think this is a separate problem. What I was hoping for, for v5.4, is a way to stop abusing dma_map/sync for cache ops to get rid of the hack I had to make for v5.3. And also to fix vgem on non-x86. (Unfortunately changing vgem to used cached mappings breaks x86 CI, but fixes CI on arm/arm64..) We can do that without any changes in allocation. There is still the possibility for problems due to cached alias, but that has been a problem this whole time, it isn't something new. BR, -R > The reason for that is if we can we really need to not create another > uncachable alias, but instead change the page attributes in place. > On x86 we can and must do that for example, and based on the > conversation with Will arm64 could do that fairly easily. arm32 can > right now only do that for CMA, though. > > The big question is what API do we want. I had a pretty similar > discussion with Christian on doing such an allocation for amdgpu, > where the device normally is cache coherent, but they actually want > to turn it into non-coherent by using PCIe unsnooped transactions. > > Here is my high level plan, which still has a few lose end: > > (1) provide a new API: > > struct page *dma_alloc_pages(struct device *dev, unsigned nr_pages, > gfp_t gfp, unsigned long flags); > void dma_free_pages(struct device *dev, unsigned nr_pages, > unsigned long flags); > > These give you back page backed memory that is guaranteed to be > addressable by the device (no swiotlb or similar). The memory can > then be mapped using dma_map*, including unmap and dma_sync to > bounce ownership around. This is the replacement for the current > dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT API, that is rather > badly defined. > > (2) Add support for DMA_ATTR_NO_KERNEL_MAPPING to this new API instead > of dma_alloc_attrs. The initial difference with that flag is just > that we allow highmem, but in the future we could also unmap this > memory from the kernel linear mapping entirely on architectures > where we can easily do that. > > (3) Add a dma_pages_map/dma_pages_unmap or similar API that allows you > to get a kernel mapping for parts or all of a > DMA_ATTR_NO_KERNEL_MAPPING allocation. This is to replace things > like your open-coded vmap in msm (or similarly elsewhere in dma-buf > providers). > > (4) Add support for a DMA_ATTR_UNCACHABLE flags (or similar) to the new > API, that maps the pages as uncachable iff they have a kernel > mapping, including invalidating the caches at time of this page > attribute change (or creation of a new mapping). This API will fail > if the architecture does not allow in-place remapping. Note that for > arm32 we could always dip into the CMA pool if one is present to not > fail. We'll also need some helper to map from the DMA_ATTR_* flags > to a pgprot for mapping the page to userspace. There is also a few > other weird bits here, e.g. on architectures like mips that use an > uncached segment we'll have to fail use with the plain > DMA_ATTR_UNCACHABLE flag, but it could be supported with > DMA_ATTR_UNCACHABLE | DMA_ATTR_NO_KERNEL_MAPPING. > > I was hoping to get most of this done for this merge window, but I'm > probably lucky if I get at least parts done. Too much distraction. > > > Hmm, not entirely sure why.. you should be on the cc list for each > > individual patch. > > They finally made it, although even with the delay they only ended up > in the spam mailbox. I still can't see them on the various mailing > lists.
On Fri, Aug 16, 2019 at 02:04:35PM -0700, Rob Clark wrote: > I don't disagree about needing an API to get uncached memory (or > ideally just something outside of the linear map). But I think this > is a separate problem. > > What I was hoping for, for v5.4, is a way to stop abusing dma_map/sync > for cache ops to get rid of the hack I had to make for v5.3. And also > to fix vgem on non-x86. (Unfortunately changing vgem to used cached > mappings breaks x86 CI, but fixes CI on arm/arm64..) We can do that > without any changes in allocation. There is still the possibility for > problems due to cached alias, but that has been a problem this whole > time, it isn't something new. But that just means we start exposing random low-level APIs that people will quickly abuse.. In fact even your simple plan to some extent already is an abuse of the intent of these functions, and it also requires a lot of knowledge in the driver that in the normal cases drivers can't know (e.g. is the device dma coherent or not).
On Sun, Aug 18, 2019 at 10:23 PM Christoph Hellwig <hch@lst.de> wrote: > > On Fri, Aug 16, 2019 at 02:04:35PM -0700, Rob Clark wrote: > > I don't disagree about needing an API to get uncached memory (or > > ideally just something outside of the linear map). But I think this > > is a separate problem. > > > > What I was hoping for, for v5.4, is a way to stop abusing dma_map/sync > > for cache ops to get rid of the hack I had to make for v5.3. And also > > to fix vgem on non-x86. (Unfortunately changing vgem to used cached > > mappings breaks x86 CI, but fixes CI on arm/arm64..) We can do that > > without any changes in allocation. There is still the possibility for > > problems due to cached alias, but that has been a problem this whole > > time, it isn't something new. > > But that just means we start exposing random low-level APIs that > people will quickly abuse.. In fact even your simple plan to some > extent already is an abuse of the intent of these functions, and > it also requires a lot of knowledge in the driver that in the normal > cases drivers can't know (e.g. is the device dma coherent or not). I can agree that most drivers should use the higher level APIs.. but not that we must prevent *all* drivers from using them. Most of what DMA API is trying to solve doesn't apply to a driver like drm/msm.. which is how we ended up with hacks to try and misuse the high level API to accomplish what we need. Perhaps we can protect the prototypes with #ifdef LOWLEVEL_DMA_API / #endif type thing to make it more obvious to other drivers that it probably isn't the API they should use? BR, -R
From: Rob Clark <robdclark@chromium.org> This is a replacement for a previous patches[1] that was adding arm64 support for drm_clflush. I've also added a patch to solve a similar cache issue in vgem. The first few patches just export arch_sync_dma_for_*(). Possibly instead the EXPORT_SYMBOL_GPL() should be somewere central, rather than per-arch (but where would make sense?) The fourth adds (and exports) these ops for arch/arm. (Arnd Bergmann mentioned on IRC that Christoph Hellwig was working on this already for arch/arm which could replace the fourth patch.) The last two patches actually fix things. [1] https://patchwork.freedesktop.org/series/64732/ Rob Clark (6): arm64: export arch_sync_dma_for_*() mips: export arch_sync_dma_for_*() powerpc: export arch_sync_dma_for_*() arm: add arch_sync_dma_for_*() drm/msm: stop abusing DMA API drm/vgem: fix cache synchronization on arm/arm64 (take two) arch/arm/Kconfig | 2 + arch/arm/mm/dma-mapping-nommu.c | 14 +++ arch/arm/mm/dma-mapping.c | 28 ++++++ arch/arm64/mm/dma-mapping.c | 2 + arch/arm64/mm/flush.c | 2 + arch/mips/mm/dma-noncoherent.c | 2 + arch/powerpc/mm/dma-noncoherent.c | 2 + drivers/gpu/drm/drm_cache.c | 20 ++++- drivers/gpu/drm/msm/msm_gem.c | 37 +++----- drivers/gpu/drm/vgem/vgem_drv.c | 145 ++++++++++++++++++++---------- include/drm/drm_cache.h | 4 + 11 files changed, 182 insertions(+), 76 deletions(-)