Message ID | 20190805211451.20176-1-robdclark@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] drm: add cache support for arm64 | expand |
This goes in the wrong direction. drm_cflush_* are a bad API we need to get rid of, not add use of it. The reason for that is two-fold: a) it doesn't address how cache maintaince actually works in most platforms. When talking about a cache we three fundamental operations: 1) write back - this writes the content of the cache back to the backing memory 2) invalidate - this remove the content of the cache 3) write back + invalidate - do both of the above b) which of the above operation you use when depends on a couple of factors of what you want to do with the range you do the cache maintainance operations Take a look at the comment in arch/arc/mm/dma.c around line 30 that explains how this applies to buffer ownership management. Note that "for device" applies to "for userspace" in the same way, just that userspace then also needs to follow this protocol. So the whole idea that random driver code calls random low-level cache maintainance operations (and use the non-specific term flush to make it all more confusing) is a bad idea. Fortunately enough we have really good arch helpers for all non-coherent architectures (this excludes the magic i915 won't be covered by that, but that is a separate issue to be addressed later, and the fact that while arm32 did grew them very recently and doesn't expose them for all configs, which is easily fixable if needed) with arch_sync_dma_for_device and arch_sync_dma_for_cpu. So what we need is to figure out where we have valid cases for buffer ownership transfer outside the DMA API, and build proper wrappers around the above function for that. My guess is it should probably be build to go with the iommu API as that is the only other way to map memory for DMA access, but if you have a better idea I'd be open to discussion.
On Tue, Aug 06, 2019 at 10:48:21AM +0200, Christoph Hellwig wrote: > This goes in the wrong direction. drm_cflush_* are a bad API we need to > get rid of, not add use of it. The reason for that is two-fold: > > a) it doesn't address how cache maintaince actually works in most > platforms. When talking about a cache we three fundamental operations: > > 1) write back - this writes the content of the cache back to the > backing memory > 2) invalidate - this remove the content of the cache > 3) write back + invalidate - do both of the above > > b) which of the above operation you use when depends on a couple of > factors of what you want to do with the range you do the cache > maintainance operations > > Take a look at the comment in arch/arc/mm/dma.c around line 30 that > explains how this applies to buffer ownership management. Note that > "for device" applies to "for userspace" in the same way, just that > userspace then also needs to follow this protocol. So the whole idea > that random driver code calls random low-level cache maintainance > operations (and use the non-specific term flush to make it all more > confusing) is a bad idea. Fortunately enough we have really good > arch helpers for all non-coherent architectures (this excludes the > magic i915 won't be covered by that, but that is a separate issue > to be addressed later, and the fact that while arm32 did grew them > very recently and doesn't expose them for all configs, which is easily > fixable if needed) with arch_sync_dma_for_device and > arch_sync_dma_for_cpu. So what we need is to figure out where we > have valid cases for buffer ownership transfer outside the DMA > API, and build proper wrappers around the above function for that. > My guess is it should probably be build to go with the iommu API > as that is the only other way to map memory for DMA access, but > if you have a better idea I'd be open to discussion. I just read through all the arch_sync_dma_for_device/cpu functions and none seem to use the struct *dev argument. Iirc you've said that's on the way out? That dev parameter is another holdup for the places where we do not yet know what the new device will be (e.g. generic dma-buf exporters like vgem). And sprinkling a fake dev or passing NULL is a bit silly. Add a HAVE_ARCH_SYNC_DMA and the above refactor (assuming it's ok to roll out everywhere) and we should indeed be able to use this. We still need to have all the others for x86 and all that. Plus I guess we should roll out the split into invalidate and flush. The other bit is phys vs. virt addr confusion, but looks like standard if phys_addr and you kmap underneath (except from drm_clflush_virt_range, only used by i915). -Daniel
On Tue, Aug 06, 2019 at 11:38:16AM +0200, Daniel Vetter wrote: > I just read through all the arch_sync_dma_for_device/cpu functions and > none seem to use the struct *dev argument. Iirc you've said that's on the > way out? Not actively on the way out yet, but now that we support all architectures it becomes pretty clear we don't need it. So yes, we could eventually remove it. But.. > That dev parameter is another holdup for the places where we do not yet > know what the new device will be (e.g. generic dma-buf exporters like > vgem). And sprinkling a fake dev or passing NULL is a bit silly. This is where it becomes interesting. Because while we don't need the dev argument for the low-level arch API, we do need it at the driver API level, given that in some systems both dma coherent and non-coherent devices exist, and we need to decide based on that. > Add a HAVE_ARCH_SYNC_DMA and the above refactor (assuming it's ok to roll > out everywhere) and we should indeed be able to use this. We still need to > have all the others for x86 and all that. Plus I guess we should roll out > the split into invalidate and flush. Well, we'll still need a defined API for buffer ownership vs device. Just exporting arch_sync_dma_for_{device,cpu} is not the right abstraction level. Note that these two functions are always present, just stubbed out for architectures that do not support non-cache coherent devices (plus the arm false negative that is easily fixable).
On Tue, Aug 6, 2019 at 1:48 AM Christoph Hellwig <hch@lst.de> wrote: > > This goes in the wrong direction. drm_cflush_* are a bad API we need to > get rid of, not add use of it. The reason for that is two-fold: > > a) it doesn't address how cache maintaince actually works in most > platforms. When talking about a cache we three fundamental operations: > > 1) write back - this writes the content of the cache back to the > backing memory > 2) invalidate - this remove the content of the cache > 3) write back + invalidate - do both of the above Agreed that drm_cflush_* isn't a great API. In this particular case (IIUC), I need wb+inv so that there aren't dirty cache lines that drop out to memory later, and so that I don't get a cache hit on uncached/wc mmap'ing. > b) which of the above operation you use when depends on a couple of > factors of what you want to do with the range you do the cache > maintainance operations > > Take a look at the comment in arch/arc/mm/dma.c around line 30 that > explains how this applies to buffer ownership management. Note that > "for device" applies to "for userspace" in the same way, just that > userspace then also needs to follow this protocol. So the whole idea > that random driver code calls random low-level cache maintainance > operations (and use the non-specific term flush to make it all more > confusing) is a bad idea. Fortunately enough we have really good > arch helpers for all non-coherent architectures (this excludes the > magic i915 won't be covered by that, but that is a separate issue > to be addressed later, and the fact that while arm32 did grew them > very recently and doesn't expose them for all configs, which is easily > fixable if needed) with arch_sync_dma_for_device and > arch_sync_dma_for_cpu. So what we need is to figure out where we > have valid cases for buffer ownership transfer outside the DMA > API, and build proper wrappers around the above function for that. > My guess is it should probably be build to go with the iommu API > as that is the only other way to map memory for DMA access, but > if you have a better idea I'd be open to discussion. Tying it in w/ iommu seems a bit weird to me.. but maybe that is just me, I'm certainly willing to consider proposals or to try things and see how they work out. Exposing the arch_sync_* API and using that directly (bypassing drm_cflush_*) actually seems pretty reasonable and pragmatic. I did have one doubt, as phys_to_virt() is only valid for kernel direct mapped memory (AFAIU), what happens for pages that are not in kernel linear map? Maybe it is ok to ignore those pages, since they won't have an aliased mapping? BR, -R
On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote: > On Tue, Aug 6, 2019 at 1:48 AM Christoph Hellwig <hch@lst.de> wrote: > > > > This goes in the wrong direction. drm_cflush_* are a bad API we need to > > get rid of, not add use of it. The reason for that is two-fold: > > > > a) it doesn't address how cache maintaince actually works in most > > platforms. When talking about a cache we three fundamental operations: > > > > 1) write back - this writes the content of the cache back to the > > backing memory > > 2) invalidate - this remove the content of the cache > > 3) write back + invalidate - do both of the above > > Agreed that drm_cflush_* isn't a great API. In this particular case > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop > out to memory later, and so that I don't get a cache hit on > uncached/wc mmap'ing. Is there a cacheable alias lying around (e.g. the linear map), or are these addresses only mapped uncached/wc? If there's a cacheable alias, performing an invalidate isn't sufficient, since a CPU can allocate a new (clean) entry at any point in time (e.g. as a result of prefetching or arbitrary speculation). Thanks, Mark.
On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote: > Agreed that drm_cflush_* isn't a great API. In this particular case > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop > out to memory later, and so that I don't get a cache hit on > uncached/wc mmap'ing. So what is the use case here? Allocate pages using the page allocator (or CMA for that matter), and then mmaping them to userspace and never touching them again from the kernel? > Tying it in w/ iommu seems a bit weird to me.. but maybe that is just > me, I'm certainly willing to consider proposals or to try things and > see how they work out. This was just my through as the fit seems easy. But maybe you'll need to explain your use case(s) a bit more so that we can figure out what a good high level API is. > Exposing the arch_sync_* API and using that directly (bypassing > drm_cflush_*) actually seems pretty reasonable and pragmatic. I did > have one doubt, as phys_to_virt() is only valid for kernel direct > mapped memory (AFAIU), what happens for pages that are not in kernel > linear map? Maybe it is ok to ignore those pages, since they won't > have an aliased mapping? They could have an aliased mapping in vmalloc/vmap space for example, if you created one. We have the flush_kernel_vmap_range / invalidate_kernel_vmap_range APIs for those, that are implement on architectures with VIVT caches.
On Tue, Aug 6, 2019 at 8:50 AM Christoph Hellwig <hch@lst.de> wrote: > > On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote: > > Agreed that drm_cflush_* isn't a great API. In this particular case > > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop > > out to memory later, and so that I don't get a cache hit on > > uncached/wc mmap'ing. > > So what is the use case here? Allocate pages using the page allocator > (or CMA for that matter), and then mmaping them to userspace and never > touching them again from the kernel? Currently, it is pages coming from tmpfs. Ideally we want pages that are swappable when unpinned. CPU mappings are *mostly* just mapping to userspace. There are a few exceptions that are vmap'd (fbcon, and ringbuffer). (Eventually I'd like to support pages passed in from userspace.. but that is down the road.) > > Tying it in w/ iommu seems a bit weird to me.. but maybe that is just > > me, I'm certainly willing to consider proposals or to try things and > > see how they work out. > > This was just my through as the fit seems easy. But maybe you'll > need to explain your use case(s) a bit more so that we can figure out > what a good high level API is. Tying it to iommu_map/unmap would be awkward, as we could need to setup cpu mmap before it ends up mapped to iommu. And the plan to support per-process pagetables involved creating an iommu_domain per userspace gl context.. some buffers would end up mapped into multiple contexts/iommu_domains. If the cache operation was detached from iommu_map/unmap, then it would seem weird to be part of the iommu API. I guess I'm not entirely sure what you had in mind, but this is why iommu seemed to me like a bad fit. > > Exposing the arch_sync_* API and using that directly (bypassing > > drm_cflush_*) actually seems pretty reasonable and pragmatic. I did > > have one doubt, as phys_to_virt() is only valid for kernel direct > > mapped memory (AFAIU), what happens for pages that are not in kernel > > linear map? Maybe it is ok to ignore those pages, since they won't > > have an aliased mapping? > > They could have an aliased mapping in vmalloc/vmap space for example, > if you created one. We have the flush_kernel_vmap_range / > invalidate_kernel_vmap_range APIs for those, that are implement > on architectures with VIVT caches. If I only have to worry about aliased mappings that I create myself, that doesn't seem too bad.. I could track that myself easily enough. BR, -R
On Tue, Aug 6, 2019 at 9:23 AM Rob Clark <robdclark@chromium.org> wrote: > > On Tue, Aug 6, 2019 at 8:50 AM Christoph Hellwig <hch@lst.de> wrote: > > > > On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote: > > > Agreed that drm_cflush_* isn't a great API. In this particular case > > > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop > > > out to memory later, and so that I don't get a cache hit on > > > uncached/wc mmap'ing. > > > > So what is the use case here? Allocate pages using the page allocator > > (or CMA for that matter), and then mmaping them to userspace and never > > touching them again from the kernel? > > Currently, it is pages coming from tmpfs. Ideally we want pages that > are swappable when unpinned. to be more specific, pages come from shmem_file_setup()/shmem_read_mapping_page() BR, -R
On Tue, Aug 6, 2019 at 7:35 AM Mark Rutland <mark.rutland@arm.com> wrote: > > On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote: > > On Tue, Aug 6, 2019 at 1:48 AM Christoph Hellwig <hch@lst.de> wrote: > > > > > > This goes in the wrong direction. drm_cflush_* are a bad API we need to > > > get rid of, not add use of it. The reason for that is two-fold: > > > > > > a) it doesn't address how cache maintaince actually works in most > > > platforms. When talking about a cache we three fundamental operations: > > > > > > 1) write back - this writes the content of the cache back to the > > > backing memory > > > 2) invalidate - this remove the content of the cache > > > 3) write back + invalidate - do both of the above > > > > Agreed that drm_cflush_* isn't a great API. In this particular case > > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop > > out to memory later, and so that I don't get a cache hit on > > uncached/wc mmap'ing. > > Is there a cacheable alias lying around (e.g. the linear map), or are > these addresses only mapped uncached/wc? > > If there's a cacheable alias, performing an invalidate isn't sufficient, > since a CPU can allocate a new (clean) entry at any point in time (e.g. > as a result of prefetching or arbitrary speculation). I *believe* that there are not alias mappings (that I don't control myself) for pages coming from shmem_file_setup()/shmem_read_mapping_page().. digging around at what dma_sync_sg_* does under the hood, it looks like it is just arch_sync_dma_for_cpu/device(), so I guess that should be sufficient for what I need. There are a few buffers that I vmap for use on kernel side, but like the userspace mmap's, the vmaps are also writecombine. BR, -R
On Tue, Aug 06, 2019 at 09:23:51AM -0700, Rob Clark wrote: > On Tue, Aug 6, 2019 at 8:50 AM Christoph Hellwig <hch@lst.de> wrote: > > > > On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote: > > > Agreed that drm_cflush_* isn't a great API. In this particular case > > > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop > > > out to memory later, and so that I don't get a cache hit on > > > uncached/wc mmap'ing. > > > > So what is the use case here? Allocate pages using the page allocator > > (or CMA for that matter), and then mmaping them to userspace and never > > touching them again from the kernel? > > Currently, it is pages coming from tmpfs. Ideally we want pages that > are swappable when unpinned. tmpfs is basically a (complicated) frontend for alloc pages as far as page allocation is concerned. > CPU mappings are *mostly* just mapping to userspace. There are a few > exceptions that are vmap'd (fbcon, and ringbuffer). And those use the same backend? > (Eventually I'd like to support pages passed in from userspace.. but > that is down the road.) Eww. Please talk to the iommu list before starting on that. > > > Tying it in w/ iommu seems a bit weird to me.. but maybe that is just > > > me, I'm certainly willing to consider proposals or to try things and > > > see how they work out. > > > > This was just my through as the fit seems easy. But maybe you'll > > need to explain your use case(s) a bit more so that we can figure out > > what a good high level API is. > > Tying it to iommu_map/unmap would be awkward, as we could need to > setup cpu mmap before it ends up mapped to iommu. And the plan to > support per-process pagetables involved creating an iommu_domain per > userspace gl context.. some buffers would end up mapped into multiple > contexts/iommu_domains. > > If the cache operation was detached from iommu_map/unmap, then it > would seem weird to be part of the iommu API. > > I guess I'm not entirely sure what you had in mind, but this is why > iommu seemed to me like a bad fit. So back to the question, I'd like to understand your use case (and maybe hear from the other drm folks if that is common): - you allocate pages from shmem (why shmem, btw? if this is done by other drm drivers how do they guarantee addressability without an iommu?) - then the memory is either mapped to userspace or vmapped (or even both, althrough the lack of aliasing you mentioned would speak against it) as writecombine (aka arm v6+ normal uncached). Does the mapping live on until the memory is freed? - as you mention swapping - how do you guarantee there are no aliases in the kernel direct mapping after the page has been swapped in? - then the memory is potentially mapped to the iommu. Is it using a long-living mapping, or does get unmapped/remapped repeatedly?
On Wed, Aug 7, 2019 at 8:25 AM Christoph Hellwig <hch@lst.de> wrote: > On Tue, Aug 06, 2019 at 09:23:51AM -0700, Rob Clark wrote: > > On Tue, Aug 6, 2019 at 8:50 AM Christoph Hellwig <hch@lst.de> wrote: > > > > > > On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote: > > > > Agreed that drm_cflush_* isn't a great API. In this particular case > > > > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop > > > > out to memory later, and so that I don't get a cache hit on > > > > uncached/wc mmap'ing. > > > > > > So what is the use case here? Allocate pages using the page allocator > > > (or CMA for that matter), and then mmaping them to userspace and never > > > touching them again from the kernel? > > > > Currently, it is pages coming from tmpfs. Ideally we want pages that > > are swappable when unpinned. > > tmpfs is basically a (complicated) frontend for alloc pages as far > as page allocation is concerned. > > > CPU mappings are *mostly* just mapping to userspace. There are a few > > exceptions that are vmap'd (fbcon, and ringbuffer). > > And those use the same backend? > > > (Eventually I'd like to support pages passed in from userspace.. but > > that is down the road.) > > Eww. Please talk to the iommu list before starting on that. > > > > > Tying it in w/ iommu seems a bit weird to me.. but maybe that is just > > > > me, I'm certainly willing to consider proposals or to try things and > > > > see how they work out. > > > > > > This was just my through as the fit seems easy. But maybe you'll > > > need to explain your use case(s) a bit more so that we can figure out > > > what a good high level API is. > > > > Tying it to iommu_map/unmap would be awkward, as we could need to > > setup cpu mmap before it ends up mapped to iommu. And the plan to > > support per-process pagetables involved creating an iommu_domain per > > userspace gl context.. some buffers would end up mapped into multiple > > contexts/iommu_domains. > > > > If the cache operation was detached from iommu_map/unmap, then it > > would seem weird to be part of the iommu API. > > > > I guess I'm not entirely sure what you had in mind, but this is why > > iommu seemed to me like a bad fit. > > So back to the question, I'd like to understand your use case (and > maybe hear from the other drm folks if that is common): Filling in a bunch more of the use-cases we have in drm. Don't need to solve them all right away ofc, but whatever direction we're aiming for should keep these in mind I think. > - you allocate pages from shmem (why shmem, btw? if this is done by > other drm drivers how do they guarantee addressability without an > iommu?) We use shmem to get at swappable pages. We generally just assume that the gpu can get at those pages, but things fall apart in fun ways: - some setups somehow inject bounce buffers. Some drivers just give up, others try to allocate a pool of pages with dma_alloc_coherent. - some devices are misdesigned and can't access as much as the cpu. We allocate using GFP_DMA32 to fix that. Also modern gpu apis pretty much assume you can malloc() and then use that directly with the gpu. > - then the memory is either mapped to userspace or vmapped (or even > both, althrough the lack of aliasing you mentioned would speak > against it) as writecombine (aka arm v6+ normal uncached). Does > the mapping live on until the memory is freed? Generally we cache mappings forever. Some exceptions for 32bit userspace excluded, where people expect to be able to use more than 4GB of textures somehow, so we have to recycle old mappings. Obviously applies less to gpus on socs. Also, at least i915 also has writeback userspace mmaps, and userspace doing the cflushing. Also not sure I ever mentioned this, but at least i915 userspace controls the coherency mode of the device dma directly, on a per-op granularity. For buffers shared with other processes it defers to the gpu pagetables, which the kernel controls. > - as you mention swapping - how do you guarantee there are no > aliases in the kernel direct mapping after the page has been swapped > in? No idea personally, since we can ignore this on x86. I think atm there's not a huge overlap of gpu drivers doing swapping and running on something like arm where incompatible aliased mappings are bad. > - then the memory is potentially mapped to the iommu. Is it using > a long-living mapping, or does get unmapped/remapped repeatedly? Again, generally cached for as long as possible, until we run out of space/memory somewhere. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Tue, Aug 06, 2019 at 09:31:55AM -0700, Rob Clark wrote: > On Tue, Aug 6, 2019 at 7:35 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote: > > > On Tue, Aug 6, 2019 at 1:48 AM Christoph Hellwig <hch@lst.de> wrote: > > > > > > > > This goes in the wrong direction. drm_cflush_* are a bad API we need to > > > > get rid of, not add use of it. The reason for that is two-fold: > > > > > > > > a) it doesn't address how cache maintaince actually works in most > > > > platforms. When talking about a cache we three fundamental operations: > > > > > > > > 1) write back - this writes the content of the cache back to the > > > > backing memory > > > > 2) invalidate - this remove the content of the cache > > > > 3) write back + invalidate - do both of the above > > > > > > Agreed that drm_cflush_* isn't a great API. In this particular case > > > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop > > > out to memory later, and so that I don't get a cache hit on > > > uncached/wc mmap'ing. > > > > Is there a cacheable alias lying around (e.g. the linear map), or are > > these addresses only mapped uncached/wc? > > > > If there's a cacheable alias, performing an invalidate isn't sufficient, > > since a CPU can allocate a new (clean) entry at any point in time (e.g. > > as a result of prefetching or arbitrary speculation). > > I *believe* that there are not alias mappings (that I don't control > myself) for pages coming from > shmem_file_setup()/shmem_read_mapping_page().. AFAICT, that's regular anonymous memory, so there will be a cacheable alias in the linear/direct map. > digging around at what dma_sync_sg_* does under the hood, it looks > like it is just arch_sync_dma_for_cpu/device(), so I guess that should > be sufficient for what I need. I don't think that's the case, per the example I gave above. Thanks, Mark.
On Tue, Aug 6, 2019 at 11:25 PM Christoph Hellwig <hch@lst.de> wrote: > > On Tue, Aug 06, 2019 at 09:23:51AM -0700, Rob Clark wrote: > > On Tue, Aug 6, 2019 at 8:50 AM Christoph Hellwig <hch@lst.de> wrote: > > > > > > On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote: > > > > Agreed that drm_cflush_* isn't a great API. In this particular case > > > > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop > > > > out to memory later, and so that I don't get a cache hit on > > > > uncached/wc mmap'ing. > > > > > > So what is the use case here? Allocate pages using the page allocator > > > (or CMA for that matter), and then mmaping them to userspace and never > > > touching them again from the kernel? > > > > Currently, it is pages coming from tmpfs. Ideally we want pages that > > are swappable when unpinned. > > tmpfs is basically a (complicated) frontend for alloc pages as far > as page allocation is concerned. > > > CPU mappings are *mostly* just mapping to userspace. There are a few > > exceptions that are vmap'd (fbcon, and ringbuffer). > > And those use the same backend? yes > > (Eventually I'd like to support pages passed in from userspace.. but > > that is down the road.) > > Eww. Please talk to the iommu list before starting on that. This is more of a long term goal, we can't do it until we have per-context/process pagetables, ofc. Getting a bit off topic, but I'm curious about what problems you are concerned about. Userspace can shoot it's own foot, but if it is not sharing GPU pagetables with other processes, it can't shoot other's feet. (I'm guessing you are concerned about non-page-aligned mappings?) > > > > Tying it in w/ iommu seems a bit weird to me.. but maybe that is just > > > > me, I'm certainly willing to consider proposals or to try things and > > > > see how they work out. > > > > > > This was just my through as the fit seems easy. But maybe you'll > > > need to explain your use case(s) a bit more so that we can figure out > > > what a good high level API is. > > > > Tying it to iommu_map/unmap would be awkward, as we could need to > > setup cpu mmap before it ends up mapped to iommu. And the plan to > > support per-process pagetables involved creating an iommu_domain per > > userspace gl context.. some buffers would end up mapped into multiple > > contexts/iommu_domains. > > > > If the cache operation was detached from iommu_map/unmap, then it > > would seem weird to be part of the iommu API. > > > > I guess I'm not entirely sure what you had in mind, but this is why > > iommu seemed to me like a bad fit. > > So back to the question, I'd like to understand your use case (and > maybe hear from the other drm folks if that is common): > > - you allocate pages from shmem (why shmem, btw? if this is done by > other drm drivers how do they guarantee addressability without an > iommu?) shmem for swappable pages. I don't unpin and let things get swapped out yet, but I'm told it starts to become important when you have 50 browser tabs open ;-) There are some display-only drm drivers with no IOMMU, which use CMA rather than shmem. Basically every real GPU has some form of MMU or IOMMU for memory protection. (The couple exceptions do expensive kernel side cmdstream validation.) > - then the memory is either mapped to userspace or vmapped (or even > both, althrough the lack of aliasing you mentioned would speak > against it) as writecombine (aka arm v6+ normal uncached). Does > the mapping live on until the memory is freed? (side note, *most* of the drm/msm supported devices are armv8, the exceptions are 8060 and 8064 which are armv7.. I don't think drm/msm will ever have to deal w/ armv6) Userspace keeps the userspace mmap around opportunistically (once it is mmap'd, not all buffers will be accessed from CPU). In fact there is a userspace buffer cache, where we try to re-use buffers that are already allocated and possibly mmap'd, since allocation and setting up mmap is expensive. (There is an MADVISE ioctl so userspace can tell kernel about buffers in the cache, which are available to be purged by shrinker.. if a buffer is purged, it's userspace mmap is torn down... along with it's IOMMU map, ofc) > - as you mention swapping - how do you guarantee there are no > aliases in the kernel direct mapping after the page has been swapped > in? Before unpinning and allowing pages to be swapped out, CPU and IOMMU maps would be torn down. (I don't think we could unpin buffers w/ a vmap, but those are just a drop in the bucket.) Currently, the kernel always knows buffers associated w/ a submit (job) queued to the GPU, so it could bring pages back in and re-store iommu map.. the fault handler can be used to bring things back in for CPU access. (For more free-form HMM style access to userspace memory, we'd need to be able to sleep in IOMMU fault handler before the IOMMU resumes translation.) As far as cache is concerned, it would be basically the same as newly allocated pages, ie. need to wb+inv the new pages. > - then the memory is potentially mapped to the iommu. Is it using > a long-living mapping, or does get unmapped/remapped repeatedly? Similar to CPU maps, we keep the IOMMU map around as long as possible. (Side note, I was thinking batching unmaps could be useful to reduce IOMMU TLB inv overhead.. usually when we are freeing a buffer, we are freeing multiple, so we could unmap them all, then TLB inv, then free pages.) BR, -R
On Wed, Aug 7, 2019 at 5:38 AM Mark Rutland <mark.rutland@arm.com> wrote: > > On Tue, Aug 06, 2019 at 09:31:55AM -0700, Rob Clark wrote: > > On Tue, Aug 6, 2019 at 7:35 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > > > On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote: > > > > On Tue, Aug 6, 2019 at 1:48 AM Christoph Hellwig <hch@lst.de> wrote: > > > > > > > > > > This goes in the wrong direction. drm_cflush_* are a bad API we need to > > > > > get rid of, not add use of it. The reason for that is two-fold: > > > > > > > > > > a) it doesn't address how cache maintaince actually works in most > > > > > platforms. When talking about a cache we three fundamental operations: > > > > > > > > > > 1) write back - this writes the content of the cache back to the > > > > > backing memory > > > > > 2) invalidate - this remove the content of the cache > > > > > 3) write back + invalidate - do both of the above > > > > > > > > Agreed that drm_cflush_* isn't a great API. In this particular case > > > > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop > > > > out to memory later, and so that I don't get a cache hit on > > > > uncached/wc mmap'ing. > > > > > > Is there a cacheable alias lying around (e.g. the linear map), or are > > > these addresses only mapped uncached/wc? > > > > > > If there's a cacheable alias, performing an invalidate isn't sufficient, > > > since a CPU can allocate a new (clean) entry at any point in time (e.g. > > > as a result of prefetching or arbitrary speculation). > > > > I *believe* that there are not alias mappings (that I don't control > > myself) for pages coming from > > shmem_file_setup()/shmem_read_mapping_page().. > > AFAICT, that's regular anonymous memory, so there will be a cacheable > alias in the linear/direct map. tbh, I'm not 100% sure whether there is a cacheable alias, or whether any potential linear map is torn down. My understanding is that a cacheable alias is "ok", with some caveats.. ie. that the cacheable alias is not accessed. I'm not entirely sure about pre-fetch from access to adjacent pages. We have been using shmem as a source for pages since the beginning, and I haven't seen it cause any problems in the last 6 years. (This is limited to armv7 and armv8, I'm not really sure what would happen on armv6, but that is a combo I don't have to care about.) BR, -R > > digging around at what dma_sync_sg_* does under the hood, it looks > > like it is just arch_sync_dma_for_cpu/device(), so I guess that should > > be sufficient for what I need. > > I don't think that's the case, per the example I gave above. > > Thanks, > Mark.
On Wed, Aug 07, 2019 at 09:15:54AM -0700, Rob Clark wrote: > On Wed, Aug 7, 2019 at 5:38 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > On Tue, Aug 06, 2019 at 09:31:55AM -0700, Rob Clark wrote: > > > On Tue, Aug 6, 2019 at 7:35 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > > > > > On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote: > > > > > On Tue, Aug 6, 2019 at 1:48 AM Christoph Hellwig <hch@lst.de> wrote: > > > > > > > > > > > > This goes in the wrong direction. drm_cflush_* are a bad API we need to > > > > > > get rid of, not add use of it. The reason for that is two-fold: > > > > > > > > > > > > a) it doesn't address how cache maintaince actually works in most > > > > > > platforms. When talking about a cache we three fundamental operations: > > > > > > > > > > > > 1) write back - this writes the content of the cache back to the > > > > > > backing memory > > > > > > 2) invalidate - this remove the content of the cache > > > > > > 3) write back + invalidate - do both of the above > > > > > > > > > > Agreed that drm_cflush_* isn't a great API. In this particular case > > > > > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop > > > > > out to memory later, and so that I don't get a cache hit on > > > > > uncached/wc mmap'ing. > > > > > > > > Is there a cacheable alias lying around (e.g. the linear map), or are > > > > these addresses only mapped uncached/wc? > > > > > > > > If there's a cacheable alias, performing an invalidate isn't sufficient, > > > > since a CPU can allocate a new (clean) entry at any point in time (e.g. > > > > as a result of prefetching or arbitrary speculation). > > > > > > I *believe* that there are not alias mappings (that I don't control > > > myself) for pages coming from > > > shmem_file_setup()/shmem_read_mapping_page().. > > > > AFAICT, that's regular anonymous memory, so there will be a cacheable > > alias in the linear/direct map. > > tbh, I'm not 100% sure whether there is a cacheable alias, or whether > any potential linear map is torn down. I'm fairly confident that the linear/direct map cacheable alias is not torn down when pages are allocated. The gneeric page allocation code doesn't do so, and I see nothing the shmem code to do so. For arm64, we can tear down portions of the linear map, but that has to be done explicitly, and this is only possible when using rodata_full. If not using rodata_full, it is not possible to dynamically tear down the cacheable alias. > My understanding is that a cacheable alias is "ok", with some > caveats.. ie. that the cacheable alias is not accessed. Unfortunately, that is not true. You'll often get away with it in practice, but that's a matter of probability rather than a guarantee. You cannot prevent a CPU from accessing a VA arbitrarily (e.g. as the result of wild speculation). The ARM ARM (ARM DDI 0487E.a) points this out explicitly: | Entries for addresses that are Normal Cacheable can be allocated to | the cache at any time, and so the cache invalidate instruction cannot | ensure that the address is not present in a cache. ... along with: | Caches introduce a number of potential problems, mainly because: | | * Memory accesses can occur at times other than when the programmer | would expect them. ;) > I'm not entirely sure about pre-fetch from access to adjacent pages. > We have been using shmem as a source for pages since the beginning, > and I haven't seen it cause any problems in the last 6 years. (This > is limited to armv7 and armv8, I'm not really sure what would happen > on armv6, but that is a combo I don't have to care about.) Over time, CPUs get more aggressive w.r.t. prefetching and speculation, so having not seen such issues in the past does not imply we won't in future. Anecdotally, we had an issue a few years ago that we were only able to reproduce under heavy stress testing. A CPU would make speculative instruction fetches from a read-destructive MMIO register, despite the PC never going near the corresponding VA, and despite that code having (apparently) caused no issue for years before that. See commit: b6ccb9803e90c16b ("ARM: 7954/1: mm: remove remaining domain support from ARMv6") ... which unfortunately lacks the full war story. Thanks, Mark.
On Wed, Aug 7, 2019 at 9:50 AM Mark Rutland <mark.rutland@arm.com> wrote: > > On Wed, Aug 07, 2019 at 09:15:54AM -0700, Rob Clark wrote: > > On Wed, Aug 7, 2019 at 5:38 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > > > On Tue, Aug 06, 2019 at 09:31:55AM -0700, Rob Clark wrote: > > > > On Tue, Aug 6, 2019 at 7:35 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > > > > > > > On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote: > > > > > > On Tue, Aug 6, 2019 at 1:48 AM Christoph Hellwig <hch@lst.de> wrote: > > > > > > > > > > > > > > This goes in the wrong direction. drm_cflush_* are a bad API we need to > > > > > > > get rid of, not add use of it. The reason for that is two-fold: > > > > > > > > > > > > > > a) it doesn't address how cache maintaince actually works in most > > > > > > > platforms. When talking about a cache we three fundamental operations: > > > > > > > > > > > > > > 1) write back - this writes the content of the cache back to the > > > > > > > backing memory > > > > > > > 2) invalidate - this remove the content of the cache > > > > > > > 3) write back + invalidate - do both of the above > > > > > > > > > > > > Agreed that drm_cflush_* isn't a great API. In this particular case > > > > > > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop > > > > > > out to memory later, and so that I don't get a cache hit on > > > > > > uncached/wc mmap'ing. > > > > > > > > > > Is there a cacheable alias lying around (e.g. the linear map), or are > > > > > these addresses only mapped uncached/wc? > > > > > > > > > > If there's a cacheable alias, performing an invalidate isn't sufficient, > > > > > since a CPU can allocate a new (clean) entry at any point in time (e.g. > > > > > as a result of prefetching or arbitrary speculation). > > > > > > > > I *believe* that there are not alias mappings (that I don't control > > > > myself) for pages coming from > > > > shmem_file_setup()/shmem_read_mapping_page().. > > > > > > AFAICT, that's regular anonymous memory, so there will be a cacheable > > > alias in the linear/direct map. > > > > tbh, I'm not 100% sure whether there is a cacheable alias, or whether > > any potential linear map is torn down. > > I'm fairly confident that the linear/direct map cacheable alias is not > torn down when pages are allocated. The gneeric page allocation code > doesn't do so, and I see nothing the shmem code to do so. > > For arm64, we can tear down portions of the linear map, but that has to > be done explicitly, and this is only possible when using rodata_full. If > not using rodata_full, it is not possible to dynamically tear down the > cacheable alias. So, we do end up using GFP_HIGHUSER, which appears to get passed thru when shmem gets to the point of actually allocating pages.. not sure if that just ends up being a hint, or if it guarantees that we don't get something in the linear map. (Bear with me while I "page" this all back in.. last time I dug thru the shmem code was probably pre-armv8, or at least before I had any armv8 hw) BR, -R
On Wed, Aug 07, 2019 at 01:38:08PM +0100, Mark Rutland wrote: > > I *believe* that there are not alias mappings (that I don't control > > myself) for pages coming from > > shmem_file_setup()/shmem_read_mapping_page().. > > AFAICT, that's regular anonymous memory, so there will be a cacheable > alias in the linear/direct map. Yes. Although shmem is in no way special in that regard. Even with the normal dma_alloc_coherent implementation on arm and arm64 we keep the cacheable alias in the direct mapping and just create a new non-cacheable one. The only exception are CMA allocations on 32-bit arm, which do get remapped to uncachable in place.
On Wed, Aug 07, 2019 at 05:49:59PM +0100, Mark Rutland wrote: > I'm fairly confident that the linear/direct map cacheable alias is not > torn down when pages are allocated. The gneeric page allocation code > doesn't do so, and I see nothing the shmem code to do so. It is not torn down anywhere. > For arm64, we can tear down portions of the linear map, but that has to > be done explicitly, and this is only possible when using rodata_full. If > not using rodata_full, it is not possible to dynamically tear down the > cacheable alias. Interesting. For this or next merge window I plan to add support to the generic DMA code to remap pages as uncachable in place based on the openrisc code. Aѕ far as I can tell the requirement for that is basically just that the kernel direct mapping doesn't use PMD or bigger mapping so that it supports changing protection bits on a per-PTE basis. Is that the case with arm64 + rodata_full? > > My understanding is that a cacheable alias is "ok", with some > > caveats.. ie. that the cacheable alias is not accessed. > > Unfortunately, that is not true. You'll often get away with it in > practice, but that's a matter of probability rather than a guarantee. > > You cannot prevent a CPU from accessing a VA arbitrarily (e.g. as the > result of wild speculation). The ARM ARM (ARM DDI 0487E.a) points this > out explicitly: Well, if we want to fix this properly we'll have to remap in place for dma_alloc_coherent and friends.
On Wed, Aug 07, 2019 at 10:30:04AM -0700, Rob Clark wrote: > So, we do end up using GFP_HIGHUSER, which appears to get passed thru > when shmem gets to the point of actually allocating pages.. not sure > if that just ends up being a hint, or if it guarantees that we don't > get something in the linear map. > > (Bear with me while I "page" this all back in.. last time I dug thru > the shmem code was probably pre-armv8, or at least before I had any > armv8 hw) GFP_HIGHUSER basically just means that this is an allocation that could dip into highmem, in which case it would not have a kernel mapping. This can happen on arm + LPAE, but not on arm64.
On Wed, Aug 07, 2019 at 10:48:56AM +0200, Daniel Vetter wrote: > > other drm drivers how do they guarantee addressability without an > > iommu?) > > We use shmem to get at swappable pages. We generally just assume that > the gpu can get at those pages, but things fall apart in fun ways: > - some setups somehow inject bounce buffers. Some drivers just give > up, others try to allocate a pool of pages with dma_alloc_coherent. > - some devices are misdesigned and can't access as much as the cpu. We > allocate using GFP_DMA32 to fix that. Well, for shmem you can't really call allocators directly, right? One thing I have in my pipeline is a dma_alloc_pages API that allocates pages that are guaranteed to be addressably by the device or otherwise fail. But that doesn't really help with the shmem fs. > Also modern gpu apis pretty much assume you can malloc() and then use > that directly with the gpu. Which is fine as long as the GPU itself supports full 64-bit addressing (or always sits behind an iommu), and the platform doesn't impose addressing limit, which unfortunately some that are shipped right now still do :( But userspace malloc really means dma_map_* anyway, so not really relevant for memory allocations.
On Wed, Aug 07, 2019 at 09:09:53AM -0700, Rob Clark wrote: > > > (Eventually I'd like to support pages passed in from userspace.. but > > > that is down the road.) > > > > Eww. Please talk to the iommu list before starting on that. > > This is more of a long term goal, we can't do it until we have > per-context/process pagetables, ofc. > > Getting a bit off topic, but I'm curious about what problems you are > concerned about. Userspace can shoot it's own foot, but if it is not > sharing GPU pagetables with other processes, it can't shoot other's > feet. (I'm guessing you are concerned about non-page-aligned > mappings?) Maybe I misunderstood what you mean above, I though you mean messing with page cachability attributes for userspace pages. If what you are looking into is just "standard" SVM I only hope that our APIs for that which currently are a mess are in shape by then, as all users currently have their own crufty and at least slightly buggy versions of that. But at least it is an issue that is being worked on. > > So back to the question, I'd like to understand your use case (and > > maybe hear from the other drm folks if that is common): > > > > - you allocate pages from shmem (why shmem, btw? if this is done by > > other drm drivers how do they guarantee addressability without an > > iommu?) > > shmem for swappable pages. I don't unpin and let things get swapped > out yet, but I'm told it starts to become important when you have 50 > browser tabs open ;-) Yes, but at that point the swapping can use the kernel linear mapping and we are going into aliasing problems that can disturb the cache. So as-is this is going to problematic without new hooks into shmemfs. > > - then the memory is either mapped to userspace or vmapped (or even > > both, althrough the lack of aliasing you mentioned would speak > > against it) as writecombine (aka arm v6+ normal uncached). Does > > the mapping live on until the memory is freed? > > (side note, *most* of the drm/msm supported devices are armv8, the > exceptions are 8060 and 8064 which are armv7.. I don't think drm/msm > will ever have to deal w/ armv6) Well, the point was that starting from v6 the kernels dma uncached really is write combine. So that applied to v7 and v8 as well.
On Thu, Aug 08, 2019 at 09:58:27AM +0200, Christoph Hellwig wrote: > On Wed, Aug 07, 2019 at 05:49:59PM +0100, Mark Rutland wrote: > > For arm64, we can tear down portions of the linear map, but that has to > > be done explicitly, and this is only possible when using rodata_full. If > > not using rodata_full, it is not possible to dynamically tear down the > > cacheable alias. > > Interesting. For this or next merge window I plan to add support to the > generic DMA code to remap pages as uncachable in place based on the > openrisc code. Aѕ far as I can tell the requirement for that is > basically just that the kernel direct mapping doesn't use PMD or bigger > mapping so that it supports changing protection bits on a per-PTE basis. > Is that the case with arm64 + rodata_full? Yes, with the added case that on arm64 we can also have contiguous entries at the PTE level, which we also have to disable. Our kernel page table creation code does that for rodata_full or DEBUG_PAGEALLOC. See arch/arm64/mmu.c, in map_mem(), where we pass NO_{BLOCK,CONT}_MAPPINGS down to our pagetable creation code. Thanks, Mark.
On Thu, Aug 08, 2019 at 11:20:53AM +0100, Mark Rutland wrote: > On Thu, Aug 08, 2019 at 09:58:27AM +0200, Christoph Hellwig wrote: > > On Wed, Aug 07, 2019 at 05:49:59PM +0100, Mark Rutland wrote: > > > For arm64, we can tear down portions of the linear map, but that has to > > > be done explicitly, and this is only possible when using rodata_full. If > > > not using rodata_full, it is not possible to dynamically tear down the > > > cacheable alias. > > > > Interesting. For this or next merge window I plan to add support to the > > generic DMA code to remap pages as uncachable in place based on the > > openrisc code. Aѕ far as I can tell the requirement for that is > > basically just that the kernel direct mapping doesn't use PMD or bigger > > mapping so that it supports changing protection bits on a per-PTE basis. > > Is that the case with arm64 + rodata_full? > > Yes, with the added case that on arm64 we can also have contiguous > entries at the PTE level, which we also have to disable. > > Our kernel page table creation code does that for rodata_full or > DEBUG_PAGEALLOC. See arch/arm64/mmu.c, in map_mem(), where we pass > NO_{BLOCK,CONT}_MAPPINGS down to our pagetable creation code. Whoops, that should be: arch/arm64/mm/mmu.c. Mark.
On Thu, Aug 08, 2019 at 11:20:53AM +0100, Mark Rutland wrote: > On Thu, Aug 08, 2019 at 09:58:27AM +0200, Christoph Hellwig wrote: > > On Wed, Aug 07, 2019 at 05:49:59PM +0100, Mark Rutland wrote: > > > For arm64, we can tear down portions of the linear map, but that has to > > > be done explicitly, and this is only possible when using rodata_full. If > > > not using rodata_full, it is not possible to dynamically tear down the > > > cacheable alias. > > > > Interesting. For this or next merge window I plan to add support to the > > generic DMA code to remap pages as uncachable in place based on the > > openrisc code. Aѕ far as I can tell the requirement for that is > > basically just that the kernel direct mapping doesn't use PMD or bigger > > mapping so that it supports changing protection bits on a per-PTE basis. > > Is that the case with arm64 + rodata_full? > > Yes, with the added case that on arm64 we can also have contiguous > entries at the PTE level, which we also have to disable. > > Our kernel page table creation code does that for rodata_full or > DEBUG_PAGEALLOC. See arch/arm64/mmu.c, in map_mem(), where we pass > NO_{BLOCK,CONT}_MAPPINGS down to our pagetable creation code. FWIW, we made rodata_full the default a couple of releases ago, so if solving the cacheable alias for non-cacheable DMA buffers requires this to be present, then we could probably just refuse to probe non-coherent DMA-capable devices on systems where rodata_full has been disabled. Will
On Thu, Aug 08, 2019 at 11:55:06AM +0200, Christoph Hellwig wrote: > On Wed, Aug 07, 2019 at 10:48:56AM +0200, Daniel Vetter wrote: > > > other drm drivers how do they guarantee addressability without an > > > iommu?) > > > > We use shmem to get at swappable pages. We generally just assume that > > the gpu can get at those pages, but things fall apart in fun ways: > > - some setups somehow inject bounce buffers. Some drivers just give > > up, others try to allocate a pool of pages with dma_alloc_coherent. > > - some devices are misdesigned and can't access as much as the cpu. We > > allocate using GFP_DMA32 to fix that. > > Well, for shmem you can't really call allocators directly, right? We can pass gfp flags to shmem_read_mapping_page_gfp, which is just about enough for the 2 cases on intel platforms where the gpu can only access 4G, but the cpu has way more. > One thing I have in my pipeline is a dma_alloc_pages API that allocates > pages that are guaranteed to be addressably by the device or otherwise > fail. But that doesn't really help with the shmem fs. Yeah, the other drivers where the shmem gfp trick doesn't work copy back&forth between the dma-able pages and the shmem swappable pages as needed in their shrinker/allocation code. I guess ideal would be if we could fuse the custom allocator somehow directly into shmem. Otoh once you start thrashing beyond system memory for gfx workloads it's pretty hopeless anyway, and speed doesn't really matter anymore. > > Also modern gpu apis pretty much assume you can malloc() and then use > > that directly with the gpu. > > Which is fine as long as the GPU itself supports full 64-bit addressing > (or always sits behind an iommu), and the platform doesn't impose > addressing limit, which unfortunately some that are shipped right now > still do :( Yes, the userspace api people in khronos are occasionally a bit optimistic :-) > But userspace malloc really means dma_map_* anyway, so not really > relevant for memory allocations. It does tie in, since we'll want a dma_map which fails if a direct mapping isn't possible. It also helps the driver code a lot if we could use the same low-level flushing functions between our own memory (whatever that is) and anon pages from malloc. And in all the cases if it's not possible, we want a failure, not elaborate attempts at hiding the differences between all possible architectures out there. -Daniel
On Thu, Aug 8, 2019 at 3:00 AM Christoph Hellwig <hch@lst.de> wrote: > > On Wed, Aug 07, 2019 at 09:09:53AM -0700, Rob Clark wrote: > > > > (Eventually I'd like to support pages passed in from userspace.. but > > > > that is down the road.) > > > > > > Eww. Please talk to the iommu list before starting on that. > > > > This is more of a long term goal, we can't do it until we have > > per-context/process pagetables, ofc. > > > > Getting a bit off topic, but I'm curious about what problems you are > > concerned about. Userspace can shoot it's own foot, but if it is not > > sharing GPU pagetables with other processes, it can't shoot other's > > feet. (I'm guessing you are concerned about non-page-aligned > > mappings?) > > Maybe I misunderstood what you mean above, I though you mean messing > with page cachability attributes for userspace pages. If what you are > looking into is just "standard" SVM I only hope that our APIs for that > which currently are a mess are in shape by then, as all users currently > have their own crufty and at least slightly buggy versions of that. But > at least it is an issue that is being worked on. ahh, ok.. and no, we wouldn't be remapping 'malloc' memory as writecombine. We'd have to wire up better support for cached buffers. Currently we use WC for basically all GEM buffers allocated from kernel, since that is a good choice for most GPU workloads.. ie. CPU isn't reading back from GPU buffers in most cases. I'm told cached buffers are useful for compute workloads where there is more back and forth between GPU and CPU, but we haven't really crossed that bridge yet. Compute workloads is also were the SVM interest is. > > > So back to the question, I'd like to understand your use case (and > > > maybe hear from the other drm folks if that is common): > > > > > > - you allocate pages from shmem (why shmem, btw? if this is done by > > > other drm drivers how do they guarantee addressability without an > > > iommu?) > > > > shmem for swappable pages. I don't unpin and let things get swapped > > out yet, but I'm told it starts to become important when you have 50 > > browser tabs open ;-) > > Yes, but at that point the swapping can use the kernel linear mapping > and we are going into aliasing problems that can disturb the cache. So > as-is this is going to problematic without new hooks into shmemfs. > My expectation is that we'd treat the pages as cached when handing them back to shmemfs, and wb+inv when we take them back again from shmemfs and re-pin. I think this works out to be basically the same scenario as having to wb+inv when we first get the pages from shmemfs. > > > - then the memory is either mapped to userspace or vmapped (or even > > > both, althrough the lack of aliasing you mentioned would speak > > > against it) as writecombine (aka arm v6+ normal uncached). Does > > > the mapping live on until the memory is freed? > > > > (side note, *most* of the drm/msm supported devices are armv8, the > > exceptions are 8060 and 8064 which are armv7.. I don't think drm/msm > > will ever have to deal w/ armv6) > > Well, the point was that starting from v6 the kernels dma uncached > really is write combine. So that applied to v7 and v8 as well. ahh, gotcha BR, -R
On Thu, Aug 8, 2019 at 12:59 AM Christoph Hellwig <hch@lst.de> wrote: > > On Wed, Aug 07, 2019 at 10:30:04AM -0700, Rob Clark wrote: > > So, we do end up using GFP_HIGHUSER, which appears to get passed thru > > when shmem gets to the point of actually allocating pages.. not sure > > if that just ends up being a hint, or if it guarantees that we don't > > get something in the linear map. > > > > (Bear with me while I "page" this all back in.. last time I dug thru > > the shmem code was probably pre-armv8, or at least before I had any > > armv8 hw) > > GFP_HIGHUSER basically just means that this is an allocation that could > dip into highmem, in which case it would not have a kernel mapping. > This can happen on arm + LPAE, but not on arm64. Just a dumb question, but why is *all* memory in the linear map on arm64? It would seem useful to have a source of pages that is not in the linear map. I guess it is mapped as huge pages (or something larger than 4k pages)? Any recommended reading to understand how/why the kernel address space is setup the way it is (so I can ask fewer dumb questions)? BR, -R
On Thu, Aug 08, 2019 at 01:58:08PM +0200, Daniel Vetter wrote: > > > We use shmem to get at swappable pages. We generally just assume that > > > the gpu can get at those pages, but things fall apart in fun ways: > > > - some setups somehow inject bounce buffers. Some drivers just give > > > up, others try to allocate a pool of pages with dma_alloc_coherent. > > > - some devices are misdesigned and can't access as much as the cpu. We > > > allocate using GFP_DMA32 to fix that. > > > > Well, for shmem you can't really call allocators directly, right? > > We can pass gfp flags to shmem_read_mapping_page_gfp, which is just about > enough for the 2 cases on intel platforms where the gpu can only access > 4G, but the cpu has way more. Right. And that works for architectures without weird DMA offsets and devices that exactly have a 32-bit DMA limit. It falls flat for all the more complex ones unfortunately. > > But userspace malloc really means dma_map_* anyway, so not really > > relevant for memory allocations. > > It does tie in, since we'll want a dma_map which fails if a direct mapping > isn't possible. It also helps the driver code a lot if we could use the > same low-level flushing functions between our own memory (whatever that > is) and anon pages from malloc. And in all the cases if it's not possible, > we want a failure, not elaborate attempts at hiding the differences > between all possible architectures out there. At the very lowest level all goes down to the same three primitives we talked about anyway, but there are different ways how they are combined. For the streaming mappins looks at the table in arch/arc/mm/dma.c I mentioned earlier. For memory that is prepared for just mmaping to userspace without a kernel user we'll always do a wb+inv. But as the other subthread shows we'll need to eventually look into unmapping (or remapping with the same attributes) of that memory in kernel space to avoid speculation bugs (or just invalid combination on x86 where we check for that), so the API will be a little more complex. Btw, are all DRM drivers using vmf_insert_* to pre-populate the mapping like the MSM case, or are some doing dynamic faulting from vm_ops->fault?
On Thu, Aug 08, 2019 at 09:44:32AM -0700, Rob Clark wrote: > > GFP_HIGHUSER basically just means that this is an allocation that could > > dip into highmem, in which case it would not have a kernel mapping. > > This can happen on arm + LPAE, but not on arm64. > > Just a dumb question, but why is *all* memory in the linear map on > arm64? It would seem useful to have a source of pages that is not in > the linear map. > I guess it is mapped as huge pages (or something larger than 4k pages)? In general that is just how the Linux kernel always worked, on all architectures - we always had a linear mapping for all memory in the kernel to make accessing it simple. That started to break down a bit with the 32-bit x86 PAE mode that supported more physical addressing that virtual, which required the "high" memory to not be mapped into the kernel direct mapping. Similar schemes later showed up on various other 32-bit architectures. There is a patchset called XPFO that ensures memory is either in the kernel direct map, or in userspace but not both to work around speculation related vulnerabilities, but it has a pretty big performance impact. > Any recommended reading to understand how/why the kernel address space > is setup the way it is (so I can ask fewer dumb questions)? I don't really have a good pointer. But usually there is just dumb answers, not dumb questions.
diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c index dc19300309d2..f0eb6320c979 100644 --- a/arch/arm64/mm/flush.c +++ b/arch/arm64/mm/flush.c @@ -93,3 +93,5 @@ void arch_invalidate_pmem(void *addr, size_t size) } EXPORT_SYMBOL_GPL(arch_invalidate_pmem); #endif + +EXPORT_SYMBOL_GPL(__flush_dcache_area); diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c index 3bd76e918b5d..90105c637797 100644 --- a/drivers/gpu/drm/drm_cache.c +++ b/drivers/gpu/drm/drm_cache.c @@ -69,6 +69,14 @@ static void drm_cache_flush_clflush(struct page *pages[], } #endif +#if defined(__powerpc__) +static void __flush_dcache_area(void *addr, size_t len) +{ + flush_dcache_range((unsigned long)addr, + (unsigned long)addr + PAGE_SIZE); +} +#endif + /** * drm_clflush_pages - Flush dcache lines of a set of pages. * @pages: List of pages to be flushed. @@ -90,7 +98,7 @@ drm_clflush_pages(struct page *pages[], unsigned long num_pages) if (wbinvd_on_all_cpus()) pr_err("Timed out waiting for cache flush\n"); -#elif defined(__powerpc__) +#elif defined(__powerpc__) || defined(CONFIG_ARM64) unsigned long i; for (i = 0; i < num_pages; i++) { struct page *page = pages[i]; @@ -100,8 +108,7 @@ drm_clflush_pages(struct page *pages[], unsigned long num_pages) continue; page_virtual = kmap_atomic(page); - flush_dcache_range((unsigned long)page_virtual, - (unsigned long)page_virtual + PAGE_SIZE); + __flush_dcache_area(page_virtual, PAGE_SIZE); kunmap_atomic(page_virtual); } #else @@ -135,6 +142,13 @@ drm_clflush_sg(struct sg_table *st) if (wbinvd_on_all_cpus()) pr_err("Timed out waiting for cache flush\n"); +#elif defined(CONFIG_ARM64) + struct sg_page_iter sg_iter; + + for_each_sg_page(st->sgl, &sg_iter, st->nents, 0) { + struct page *p = sg_page_iter_page(&sg_iter); + drm_clflush_pages(&p, 1); + } #else pr_err("Architecture has no drm_cache.c support\n"); WARN_ON_ONCE(1); diff --git a/include/drm/drm_cache.h b/include/drm/drm_cache.h index 987ff16b9420..f94e7bd3eca4 100644 --- a/include/drm/drm_cache.h +++ b/include/drm/drm_cache.h @@ -40,6 +40,10 @@ void drm_clflush_sg(struct sg_table *st); void drm_clflush_virt_range(void *addr, unsigned long length); bool drm_need_swiotlb(int dma_bits); +#if defined(CONFIG_X86) || defined(__powerpc__) || defined(CONFIG_ARM64) +#define HAS_DRM_CACHE 1 +#endif + static inline bool drm_arch_can_wc_memory(void) {