ARM: dma-mapping: support non-consistent DMA attribute

Message ID	1424810703-20382-1-git-send-email-drake@endlessm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: Daniel Drake <drake@endlessm.com> To: linux@arm.linux.org.uk, m.szyprowski@samsung.com Subject: [PATCH] ARM: dma-mapping: support non-consistent DMA attribute Date: Tue, 24 Feb 2015 14:45:03 -0600 Message-Id: <1424810703-20382-1-git-send-email-drake@endlessm.com> Cc: jasper@endlessm.com, carlo@endlessm.com, linux-arm-kernel@lists.infradead.org Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Daniel Drake Feb. 24, 2015, 8:45 p.m. UTC

Currently when arm_dma_mmap() is called, vm_page_prot is reset
according to DMA attributes. Currently, writecombine is the
only attribute supported; any other configuration will map uncached
memory.

Add support for the non-consistent attribute so that cachable memory
can be mapped into userspace via the dma_mmap_attrs() API.

Signed-off-by: Daniel Drake <drake@endlessm.com>
---
 arch/arm/mm/dma-mapping.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Arnd Bergmann Feb. 25, 2015, 10:24 a.m. UTC | #1

On Tuesday 24 February 2015 14:45:03 Daniel Drake wrote:
> Currently when arm_dma_mmap() is called, vm_page_prot is reset
> according to DMA attributes. Currently, writecombine is the
> only attribute supported; any other configuration will map uncached
> memory.
> 
> Add support for the non-consistent attribute so that cachable memory
> can be mapped into userspace via the dma_mmap_attrs() API.
> 
> Signed-off-by: Daniel Drake <drake@endlessm.com>
> 

How does the user process enforce consistency then?

	Arnd

Russell King - ARM Linux Feb. 25, 2015, 10:36 a.m. UTC | #2

On Tue, Feb 24, 2015 at 02:45:03PM -0600, Daniel Drake wrote:
> Currently when arm_dma_mmap() is called, vm_page_prot is reset
> according to DMA attributes. Currently, writecombine is the
> only attribute supported; any other configuration will map uncached
> memory.
> 
> Add support for the non-consistent attribute so that cachable memory
> can be mapped into userspace via the dma_mmap_attrs() API.

What is your use-case for this?

Daniel Drake Feb. 25, 2015, 1:58 p.m. UTC | #3

On Wed, Feb 25, 2015 at 4:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday 24 February 2015 14:45:03 Daniel Drake wrote:
>> Currently when arm_dma_mmap() is called, vm_page_prot is reset
>> according to DMA attributes. Currently, writecombine is the
>> only attribute supported; any other configuration will map uncached
>> memory.
>>
>> Add support for the non-consistent attribute so that cachable memory
>> can be mapped into userspace via the dma_mmap_attrs() API.
>>
>> Signed-off-by: Daniel Drake <drake@endlessm.com>
>>
>
> How does the user process enforce consistency then?

Hmm, good point.

The use case here is that we're mapping certain memory shared with the
Mali GPU into userspace, where we expect the CPU to be doing rendering
to it, and we've measured performance improvements from having the CPU
cache enabled. But Mali also comes alongside UMP (comparable to
dma-buf) which provides an ioctl-based API for userspace to
invalidate/flush caches, which we use heavily in this pipeline.

UMP's not going upstream, but I was wondering if this patch had any
value in the more general case. Thinking more, perhaps not, at least
until dma-buf grows some form of userspace cache control API.

Daniel

Russell King - ARM Linux Feb. 25, 2015, 2:18 p.m. UTC | #4

On Wed, Feb 25, 2015 at 07:58:00AM -0600, Daniel Drake wrote:
> UMP's not going upstream, but I was wondering if this patch had any
> value in the more general case. Thinking more, perhaps not, at least
> until dma-buf grows some form of userspace cache control API.

Which I would oppose.

The DMA API itself provides a very simple model:

 sg = cpu owned buffer(s)
 /* cpu owns sg buffers and can read/write it */
 dma_addr = dma_map_sg(dev, sg, ...);
 /* dev owns sg buffers, it is invalid for the CPU to write to the buffer
  * CPU writes to the buffer may very well be lost */
 dma_sync_sg_for_cpu(dev, sg, ...);
 /* cpu owns sg buffers, CPU can read/write buffer */
 dma_sync_sg_for_device(dev, sg, ...);
 /* dev owns sg buffers, it is invalid for the CPU to write to the buffer
  * CPU writes to the buffer may very well be lost */
 dma_unmap_sg(dev, sg, ...);
 /* cpu owns sg buffers again */

If we're wanting to do something like this in userspace, there's too much
chance userspace will get this ownership business wrong (we know this,
userspace programmers basically suck - they abuse anything we give them.)

It's far better to have a proper infrastructure present for things like
GPUs.  The good news is that we have such an infrastructure.  It's called
DRM, and with DRM, we track things like buffers associated with the GPU,
and we know which buffers the GPU should be accessing, and when the GPU
has finished with them.

There are even implementations on x86 which (such as the Intel drivers)
which track which domain each buffer is in, whether it's in the GPUs
domain or the CPUs domain, and the buffer has to be in the correct domain
for it to be operated upon.

So, we already have everything necessary.

Except for MALI, we don't, because of course we're better off having a
closed source, inherently insecure GPU driver which needs loads of crap
exported to userspace to work efficiently.  No, sorry, we're not going
to make this easy by providing interfaces for userspace to crap on:
GPU drivers need to be implemented properly.

Daniel Drake Feb. 25, 2015, 2:30 p.m. UTC | #5

On Wed, Feb 25, 2015 at 8:18 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
>  sg = cpu owned buffer(s)
>  /* cpu owns sg buffers and can read/write it */
>  dma_addr = dma_map_sg(dev, sg, ...);
>  /* dev owns sg buffers, it is invalid for the CPU to write to the buffer
>   * CPU writes to the buffer may very well be lost */
>  dma_sync_sg_for_cpu(dev, sg, ...);
>  /* cpu owns sg buffers, CPU can read/write buffer */
>  dma_sync_sg_for_device(dev, sg, ...);
>  /* dev owns sg buffers, it is invalid for the CPU to write to the buffer
>   * CPU writes to the buffer may very well be lost */
>  dma_unmap_sg(dev, sg, ...);
>  /* cpu owns sg buffers again */
>
> If we're wanting to do something like this in userspace, there's too much
> chance userspace will get this ownership business wrong (we know this,
> userspace programmers basically suck - they abuse anything we give them.)
>
> It's far better to have a proper infrastructure present for things like
> GPUs.  The good news is that we have such an infrastructure.  It's called
> DRM, and with DRM, we track things like buffers associated with the GPU,
> and we know which buffers the GPU should be accessing, and when the GPU
> has finished with them.

Fair enough, what you're describing does sound like a better model.
Thanks for explaining.

I'm still a little unclear on how DRM solves this particular problem
though. At the point when the buffer is CPU-owned, it can be mapped
into userspace with CPU caches enabled, right?
But how can DRM do that if the dma_mmap_* API is always going to map
it as writecombine or uncached?
Or should DRM drivers avoid that API and use remap_pfn_range()
directly, retaining full control of pgprot flags?

Thanks
Daniel

Russell King - ARM Linux Feb. 25, 2015, 2:42 p.m. UTC | #6

On Wed, Feb 25, 2015 at 08:30:38AM -0600, Daniel Drake wrote:
> Fair enough, what you're describing does sound like a better model.
> Thanks for explaining.
> 
> I'm still a little unclear on how DRM solves this particular problem
> though. At the point when the buffer is CPU-owned, it can be mapped
> into userspace with CPU caches enabled, right?

Whether a buffer is mapped or not is an entirely separate issue.
We have many cases where the kernel has the buffer mapped into its
lowmem region while the device is doing DMA.  Having a buffer mapped
into userspace is no different.

What DRM can do is track the state of the buffer: the DRM model is that
you talk to the GPU through DRM, which means that you submit a command
stream, along with identifiers for the buffers you want the command
stream to operate on.

DRM can then scan the state of those buffers, and perform the appropriate
DMA API operation on the buffers to flip them over to device ownership.

When userspace wants to access the buffer later, it needs to ask DRM
whether the buffer is safe to access - this causes DRM to check whether
the buffer is still being used for a render operation, and can then
flip the buffer back to CPU ownership.

The idea that a buffer needs to be constantly mapped and unmapped in
userspace would create its own problems: there is a cost to setting up
and tearing down the mappings.

As with anything performance related, the less work you can do, the faster
you will appear to be: that applies very much here.  If you can avoid
having to setup and tear down mappings, if you can avoid having to do
cache maintanence all the time, you will gain extra performance quite
simply because you're not wasting CPU cycles doing stuff which is not
absolutely necessary.

I would put some of this into practice with etnaviv-drm, but I've decided
to walk away from that project and just look after the work which I once
did on it as a fork.

Daniel Drake Feb. 25, 2015, 3:21 p.m. UTC | #7

On Wed, Feb 25, 2015 at 8:42 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> When userspace wants to access the buffer later, it needs to ask DRM
> whether the buffer is safe to access - this causes DRM to check whether
> the buffer is still being used for a render operation, and can then
> flip the buffer back to CPU ownership.
>
> The idea that a buffer needs to be constantly mapped and unmapped in
> userspace would create its own problems: there is a cost to setting up
> and tearing down the mappings.

OK, so lets say the userspace mapping is created just once, and there
is a DRM-based access control mechanism as you describe, mandating
when it can/can't be accessed. To ask the question a different way
then: How is the one-time userspace mapping created?

As far as I can see, the if the dma_mmap_* API is used to do that, the
mapping will always be uncached or writecombine. There's no way that
it could currently be mapped with CPU caches enabled, unless
dma_mmap_* is avoided.

Daniel

Jasper St. Pierre Feb. 25, 2015, 4:31 p.m. UTC | #8

On Wed, Feb 25, 2015 at 6:42 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Wed, Feb 25, 2015 at 08:30:38AM -0600, Daniel Drake wrote:
>> Fair enough, what you're describing does sound like a better model.
>> Thanks for explaining.
>>
>> I'm still a little unclear on how DRM solves this particular problem
>> though. At the point when the buffer is CPU-owned, it can be mapped
>> into userspace with CPU caches enabled, right?
>
> Whether a buffer is mapped or not is an entirely separate issue.
> We have many cases where the kernel has the buffer mapped into its
> lowmem region while the device is doing DMA.  Having a buffer mapped
> into userspace is no different.
>
> What DRM can do is track the state of the buffer: the DRM model is that
> you talk to the GPU through DRM, which means that you submit a command
> stream, along with identifiers for the buffers you want the command
> stream to operate on.
>
> DRM can then scan the state of those buffers, and perform the appropriate
> DMA API operation on the buffers to flip them over to device ownership.
>
> When userspace wants to access the buffer later, it needs to ask DRM
> whether the buffer is safe to access - this causes DRM to check whether
> the buffer is still being used for a render operation, and can then
> flip the buffer back to CPU ownership.
>
> The idea that a buffer needs to be constantly mapped and unmapped in
> userspace would create its own problems: there is a cost to setting up
> and tearing down the mappings.
>
> As with anything performance related, the less work you can do, the faster
> you will appear to be: that applies very much here.  If you can avoid
> having to setup and tear down mappings, if you can avoid having to do
> cache maintanence all the time, you will gain extra performance quite
> simply because you're not wasting CPU cycles doing stuff which is not
> absolutely necessary.
>
> I would put some of this into practice with etnaviv-drm, but I've decided
> to walk away from that project and just look after the work which I once
> did on it as a fork.

We are using DRM. The DRM CMA helpers use the DMA APIs to allocate
memory from the CMA region, and we wanted to speed it up by using
cached buffers.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/drm_gem_cma_helper.c#n85

We tried dma_alloc_attrs, but found that setting
DMA_ATTR_NON_CONSISTENT didn't work correctly. Hence, this patch.

Should the DRM CMA helpers not be using the DMA APIs to allocate
memory from the CMA region?

> --
> FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
> according to speedtest.net.

Russell King - ARM Linux Feb. 25, 2015, 5:25 p.m. UTC | #9

On Wed, Feb 25, 2015 at 08:31:30AM -0800, Jasper St. Pierre wrote:
> On Wed, Feb 25, 2015 at 6:42 AM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> > On Wed, Feb 25, 2015 at 08:30:38AM -0600, Daniel Drake wrote:
> >> Fair enough, what you're describing does sound like a better model.
> >> Thanks for explaining.
> >>
> >> I'm still a little unclear on how DRM solves this particular problem
> >> though. At the point when the buffer is CPU-owned, it can be mapped
> >> into userspace with CPU caches enabled, right?
> >
> > Whether a buffer is mapped or not is an entirely separate issue.
> > We have many cases where the kernel has the buffer mapped into its
> > lowmem region while the device is doing DMA.  Having a buffer mapped
> > into userspace is no different.
> >
> > What DRM can do is track the state of the buffer: the DRM model is that
> > you talk to the GPU through DRM, which means that you submit a command
> > stream, along with identifiers for the buffers you want the command
> > stream to operate on.
> >
> > DRM can then scan the state of those buffers, and perform the appropriate
> > DMA API operation on the buffers to flip them over to device ownership.
> >
> > When userspace wants to access the buffer later, it needs to ask DRM
> > whether the buffer is safe to access - this causes DRM to check whether
> > the buffer is still being used for a render operation, and can then
> > flip the buffer back to CPU ownership.
> >
> > The idea that a buffer needs to be constantly mapped and unmapped in
> > userspace would create its own problems: there is a cost to setting up
> > and tearing down the mappings.
> >
> > As with anything performance related, the less work you can do, the faster
> > you will appear to be: that applies very much here.  If you can avoid
> > having to setup and tear down mappings, if you can avoid having to do
> > cache maintanence all the time, you will gain extra performance quite
> > simply because you're not wasting CPU cycles doing stuff which is not
> > absolutely necessary.
> >
> > I would put some of this into practice with etnaviv-drm, but I've decided
> > to walk away from that project and just look after the work which I once
> > did on it as a fork.
> 
> We are using DRM. The DRM CMA helpers use the DMA APIs to allocate
> memory from the CMA region, and we wanted to speed it up by using
> cached buffers.
> 
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/drm_gem_cma_helper.c#n85
> 
> We tried dma_alloc_attrs, but found that setting
> DMA_ATTR_NON_CONSISTENT didn't work correctly. Hence, this patch.
> 
> Should the DRM CMA helpers not be using the DMA APIs to allocate
> memory from the CMA region?

It seems to be a reasonable thing to do.

However, what I would raise is whether you /really/ want to be using
CMA for this.

CMA gets you contiguous memory.  Great, but that means you must be able
to defragment the CMA memory region enough to get your large buffer.
Like any memory allocator, it will suffer from fragmentation, and
eventually it won't be able to allocate large buffers.  That will then
cause you to have to fall back to CPU rendering instead of GPU rendering.

There's another problem though - you have to have enough VM space for
all your pixmaps, since you can't swap them out once allocated (they
aren't treated as page cache pages.)

If your GPU has a MMU, you really ough to look at the possibility of
using shmem buffers, which are page-based allocations, using the page
cache.  This means they are swappable as well, and don't suffer from
the fragmentation issue.

dma-buf doesn't work particularly well with that though; the assumption
is that once imported, the buffer doesn't change (and hence can't be
swapped out) so the pages end up being pinned.  That really needs fixing...
there's a lot which needs fixing in this area because it's been designed
around people's particular use cases instead of a more high-level
approach.

CMA is useful for cases where you need a contiguous buffer, but where
you don't have a requirement for it, it's best to avoid it.

Jasper St. Pierre Feb. 25, 2015, 5:27 p.m. UTC | #10

On Wed, Feb 25, 2015 at 9:25 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> It seems to be a reasonable thing to do.
>
> However, what I would raise is whether you /really/ want to be using
> CMA for this.

Our scanout hardware / display controller does not have an iommu, and
requires contiguous buffers.

> --
> FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
> according to speedtest.net.

ARM: dma-mapping: support non-consistent DMA attribute

Commit Message

Comments

Patch