Message ID | 1424810703-20382-1-git-send-email-drake@endlessm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tuesday 24 February 2015 14:45:03 Daniel Drake wrote: > Currently when arm_dma_mmap() is called, vm_page_prot is reset > according to DMA attributes. Currently, writecombine is the > only attribute supported; any other configuration will map uncached > memory. > > Add support for the non-consistent attribute so that cachable memory > can be mapped into userspace via the dma_mmap_attrs() API. > > Signed-off-by: Daniel Drake <drake@endlessm.com> > How does the user process enforce consistency then? Arnd
On Tue, Feb 24, 2015 at 02:45:03PM -0600, Daniel Drake wrote: > Currently when arm_dma_mmap() is called, vm_page_prot is reset > according to DMA attributes. Currently, writecombine is the > only attribute supported; any other configuration will map uncached > memory. > > Add support for the non-consistent attribute so that cachable memory > can be mapped into userspace via the dma_mmap_attrs() API. What is your use-case for this?
On Wed, Feb 25, 2015 at 4:24 AM, Arnd Bergmann <arnd@arndb.de> wrote: > On Tuesday 24 February 2015 14:45:03 Daniel Drake wrote: >> Currently when arm_dma_mmap() is called, vm_page_prot is reset >> according to DMA attributes. Currently, writecombine is the >> only attribute supported; any other configuration will map uncached >> memory. >> >> Add support for the non-consistent attribute so that cachable memory >> can be mapped into userspace via the dma_mmap_attrs() API. >> >> Signed-off-by: Daniel Drake <drake@endlessm.com> >> > > How does the user process enforce consistency then? Hmm, good point. The use case here is that we're mapping certain memory shared with the Mali GPU into userspace, where we expect the CPU to be doing rendering to it, and we've measured performance improvements from having the CPU cache enabled. But Mali also comes alongside UMP (comparable to dma-buf) which provides an ioctl-based API for userspace to invalidate/flush caches, which we use heavily in this pipeline. UMP's not going upstream, but I was wondering if this patch had any value in the more general case. Thinking more, perhaps not, at least until dma-buf grows some form of userspace cache control API. Daniel
On Wed, Feb 25, 2015 at 07:58:00AM -0600, Daniel Drake wrote: > UMP's not going upstream, but I was wondering if this patch had any > value in the more general case. Thinking more, perhaps not, at least > until dma-buf grows some form of userspace cache control API. Which I would oppose. The DMA API itself provides a very simple model: sg = cpu owned buffer(s) /* cpu owns sg buffers and can read/write it */ dma_addr = dma_map_sg(dev, sg, ...); /* dev owns sg buffers, it is invalid for the CPU to write to the buffer * CPU writes to the buffer may very well be lost */ dma_sync_sg_for_cpu(dev, sg, ...); /* cpu owns sg buffers, CPU can read/write buffer */ dma_sync_sg_for_device(dev, sg, ...); /* dev owns sg buffers, it is invalid for the CPU to write to the buffer * CPU writes to the buffer may very well be lost */ dma_unmap_sg(dev, sg, ...); /* cpu owns sg buffers again */ If we're wanting to do something like this in userspace, there's too much chance userspace will get this ownership business wrong (we know this, userspace programmers basically suck - they abuse anything we give them.) It's far better to have a proper infrastructure present for things like GPUs. The good news is that we have such an infrastructure. It's called DRM, and with DRM, we track things like buffers associated with the GPU, and we know which buffers the GPU should be accessing, and when the GPU has finished with them. There are even implementations on x86 which (such as the Intel drivers) which track which domain each buffer is in, whether it's in the GPUs domain or the CPUs domain, and the buffer has to be in the correct domain for it to be operated upon. So, we already have everything necessary. Except for MALI, we don't, because of course we're better off having a closed source, inherently insecure GPU driver which needs loads of crap exported to userspace to work efficiently. No, sorry, we're not going to make this easy by providing interfaces for userspace to crap on: GPU drivers need to be implemented properly.
On Wed, Feb 25, 2015 at 8:18 AM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > sg = cpu owned buffer(s) > /* cpu owns sg buffers and can read/write it */ > dma_addr = dma_map_sg(dev, sg, ...); > /* dev owns sg buffers, it is invalid for the CPU to write to the buffer > * CPU writes to the buffer may very well be lost */ > dma_sync_sg_for_cpu(dev, sg, ...); > /* cpu owns sg buffers, CPU can read/write buffer */ > dma_sync_sg_for_device(dev, sg, ...); > /* dev owns sg buffers, it is invalid for the CPU to write to the buffer > * CPU writes to the buffer may very well be lost */ > dma_unmap_sg(dev, sg, ...); > /* cpu owns sg buffers again */ > > If we're wanting to do something like this in userspace, there's too much > chance userspace will get this ownership business wrong (we know this, > userspace programmers basically suck - they abuse anything we give them.) > > It's far better to have a proper infrastructure present for things like > GPUs. The good news is that we have such an infrastructure. It's called > DRM, and with DRM, we track things like buffers associated with the GPU, > and we know which buffers the GPU should be accessing, and when the GPU > has finished with them. Fair enough, what you're describing does sound like a better model. Thanks for explaining. I'm still a little unclear on how DRM solves this particular problem though. At the point when the buffer is CPU-owned, it can be mapped into userspace with CPU caches enabled, right? But how can DRM do that if the dma_mmap_* API is always going to map it as writecombine or uncached? Or should DRM drivers avoid that API and use remap_pfn_range() directly, retaining full control of pgprot flags? Thanks Daniel
On Wed, Feb 25, 2015 at 08:30:38AM -0600, Daniel Drake wrote: > Fair enough, what you're describing does sound like a better model. > Thanks for explaining. > > I'm still a little unclear on how DRM solves this particular problem > though. At the point when the buffer is CPU-owned, it can be mapped > into userspace with CPU caches enabled, right? Whether a buffer is mapped or not is an entirely separate issue. We have many cases where the kernel has the buffer mapped into its lowmem region while the device is doing DMA. Having a buffer mapped into userspace is no different. What DRM can do is track the state of the buffer: the DRM model is that you talk to the GPU through DRM, which means that you submit a command stream, along with identifiers for the buffers you want the command stream to operate on. DRM can then scan the state of those buffers, and perform the appropriate DMA API operation on the buffers to flip them over to device ownership. When userspace wants to access the buffer later, it needs to ask DRM whether the buffer is safe to access - this causes DRM to check whether the buffer is still being used for a render operation, and can then flip the buffer back to CPU ownership. The idea that a buffer needs to be constantly mapped and unmapped in userspace would create its own problems: there is a cost to setting up and tearing down the mappings. As with anything performance related, the less work you can do, the faster you will appear to be: that applies very much here. If you can avoid having to setup and tear down mappings, if you can avoid having to do cache maintanence all the time, you will gain extra performance quite simply because you're not wasting CPU cycles doing stuff which is not absolutely necessary. I would put some of this into practice with etnaviv-drm, but I've decided to walk away from that project and just look after the work which I once did on it as a fork.
On Wed, Feb 25, 2015 at 8:42 AM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > When userspace wants to access the buffer later, it needs to ask DRM > whether the buffer is safe to access - this causes DRM to check whether > the buffer is still being used for a render operation, and can then > flip the buffer back to CPU ownership. > > The idea that a buffer needs to be constantly mapped and unmapped in > userspace would create its own problems: there is a cost to setting up > and tearing down the mappings. OK, so lets say the userspace mapping is created just once, and there is a DRM-based access control mechanism as you describe, mandating when it can/can't be accessed. To ask the question a different way then: How is the one-time userspace mapping created? As far as I can see, the if the dma_mmap_* API is used to do that, the mapping will always be uncached or writecombine. There's no way that it could currently be mapped with CPU caches enabled, unless dma_mmap_* is avoided. Daniel
On Wed, Feb 25, 2015 at 6:42 AM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Wed, Feb 25, 2015 at 08:30:38AM -0600, Daniel Drake wrote: >> Fair enough, what you're describing does sound like a better model. >> Thanks for explaining. >> >> I'm still a little unclear on how DRM solves this particular problem >> though. At the point when the buffer is CPU-owned, it can be mapped >> into userspace with CPU caches enabled, right? > > Whether a buffer is mapped or not is an entirely separate issue. > We have many cases where the kernel has the buffer mapped into its > lowmem region while the device is doing DMA. Having a buffer mapped > into userspace is no different. > > What DRM can do is track the state of the buffer: the DRM model is that > you talk to the GPU through DRM, which means that you submit a command > stream, along with identifiers for the buffers you want the command > stream to operate on. > > DRM can then scan the state of those buffers, and perform the appropriate > DMA API operation on the buffers to flip them over to device ownership. > > When userspace wants to access the buffer later, it needs to ask DRM > whether the buffer is safe to access - this causes DRM to check whether > the buffer is still being used for a render operation, and can then > flip the buffer back to CPU ownership. > > The idea that a buffer needs to be constantly mapped and unmapped in > userspace would create its own problems: there is a cost to setting up > and tearing down the mappings. > > As with anything performance related, the less work you can do, the faster > you will appear to be: that applies very much here. If you can avoid > having to setup and tear down mappings, if you can avoid having to do > cache maintanence all the time, you will gain extra performance quite > simply because you're not wasting CPU cycles doing stuff which is not > absolutely necessary. > > I would put some of this into practice with etnaviv-drm, but I've decided > to walk away from that project and just look after the work which I once > did on it as a fork. We are using DRM. The DRM CMA helpers use the DMA APIs to allocate memory from the CMA region, and we wanted to speed it up by using cached buffers. https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/drm_gem_cma_helper.c#n85 We tried dma_alloc_attrs, but found that setting DMA_ATTR_NON_CONSISTENT didn't work correctly. Hence, this patch. Should the DRM CMA helpers not be using the DMA APIs to allocate memory from the CMA region? > -- > FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up > according to speedtest.net.
On Wed, Feb 25, 2015 at 08:31:30AM -0800, Jasper St. Pierre wrote: > On Wed, Feb 25, 2015 at 6:42 AM, Russell King - ARM Linux > <linux@arm.linux.org.uk> wrote: > > On Wed, Feb 25, 2015 at 08:30:38AM -0600, Daniel Drake wrote: > >> Fair enough, what you're describing does sound like a better model. > >> Thanks for explaining. > >> > >> I'm still a little unclear on how DRM solves this particular problem > >> though. At the point when the buffer is CPU-owned, it can be mapped > >> into userspace with CPU caches enabled, right? > > > > Whether a buffer is mapped or not is an entirely separate issue. > > We have many cases where the kernel has the buffer mapped into its > > lowmem region while the device is doing DMA. Having a buffer mapped > > into userspace is no different. > > > > What DRM can do is track the state of the buffer: the DRM model is that > > you talk to the GPU through DRM, which means that you submit a command > > stream, along with identifiers for the buffers you want the command > > stream to operate on. > > > > DRM can then scan the state of those buffers, and perform the appropriate > > DMA API operation on the buffers to flip them over to device ownership. > > > > When userspace wants to access the buffer later, it needs to ask DRM > > whether the buffer is safe to access - this causes DRM to check whether > > the buffer is still being used for a render operation, and can then > > flip the buffer back to CPU ownership. > > > > The idea that a buffer needs to be constantly mapped and unmapped in > > userspace would create its own problems: there is a cost to setting up > > and tearing down the mappings. > > > > As with anything performance related, the less work you can do, the faster > > you will appear to be: that applies very much here. If you can avoid > > having to setup and tear down mappings, if you can avoid having to do > > cache maintanence all the time, you will gain extra performance quite > > simply because you're not wasting CPU cycles doing stuff which is not > > absolutely necessary. > > > > I would put some of this into practice with etnaviv-drm, but I've decided > > to walk away from that project and just look after the work which I once > > did on it as a fork. > > We are using DRM. The DRM CMA helpers use the DMA APIs to allocate > memory from the CMA region, and we wanted to speed it up by using > cached buffers. > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/drm_gem_cma_helper.c#n85 > > We tried dma_alloc_attrs, but found that setting > DMA_ATTR_NON_CONSISTENT didn't work correctly. Hence, this patch. > > Should the DRM CMA helpers not be using the DMA APIs to allocate > memory from the CMA region? It seems to be a reasonable thing to do. However, what I would raise is whether you /really/ want to be using CMA for this. CMA gets you contiguous memory. Great, but that means you must be able to defragment the CMA memory region enough to get your large buffer. Like any memory allocator, it will suffer from fragmentation, and eventually it won't be able to allocate large buffers. That will then cause you to have to fall back to CPU rendering instead of GPU rendering. There's another problem though - you have to have enough VM space for all your pixmaps, since you can't swap them out once allocated (they aren't treated as page cache pages.) If your GPU has a MMU, you really ough to look at the possibility of using shmem buffers, which are page-based allocations, using the page cache. This means they are swappable as well, and don't suffer from the fragmentation issue. dma-buf doesn't work particularly well with that though; the assumption is that once imported, the buffer doesn't change (and hence can't be swapped out) so the pages end up being pinned. That really needs fixing... there's a lot which needs fixing in this area because it's been designed around people's particular use cases instead of a more high-level approach. CMA is useful for cases where you need a contiguous buffer, but where you don't have a requirement for it, it's best to avoid it.
On Wed, Feb 25, 2015 at 9:25 AM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > It seems to be a reasonable thing to do. > > However, what I would raise is whether you /really/ want to be using > CMA for this. Our scanout hardware / display controller does not have an iommu, and requires contiguous buffers. > -- > FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up > according to speedtest.net.
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index 170a116..cfb7d4d 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -562,10 +562,12 @@ static void __free_from_contiguous(struct device *dev, struct page *page, static inline pgprot_t __get_dma_pgprot(struct dma_attrs *attrs, pgprot_t prot) { - prot = dma_get_attr(DMA_ATTR_WRITE_COMBINE, attrs) ? - pgprot_writecombine(prot) : - pgprot_dmacoherent(prot); - return prot; + if (dma_get_attr(DMA_ATTR_NON_CONSISTENT, attrs)) + return prot; + else if (dma_get_attr(DMA_ATTR_WRITE_COMBINE, attrs)) + return pgprot_writecombine(prot); + else + return pgprot_dmacoherent(prot); } #define nommu() 0
Currently when arm_dma_mmap() is called, vm_page_prot is reset according to DMA attributes. Currently, writecombine is the only attribute supported; any other configuration will map uncached memory. Add support for the non-consistent attribute so that cachable memory can be mapped into userspace via the dma_mmap_attrs() API. Signed-off-by: Daniel Drake <drake@endlessm.com> --- arch/arm/mm/dma-mapping.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)