diff mbox

ARM: dma-mapping: Just allocate one chunk at a time

Message ID 1450384253-1067-1-git-send-email-dianders@chromium.org (mailing list archive)
State New, archived
Headers show

Commit Message

Doug Anderson Dec. 17, 2015, 8:30 p.m. UTC
The __iommu_alloc_buffer() is expected to be called to allocate pretty
sizeable buffers.  Upon simple tests of video I saw it trying to
allocate 4,194,304 bytes.  The function tries to be efficient about this
by starting out allocating large chunks and then moving to smaller and
smaller chunk sizes until it succeeds.

The current function is very, very slow.

One problem is the way it keeps trying and trying to allocate big
chunks.  Imagine a very fragmented memory that has 4M free but no
contiguous pages at all.  Further imagine allocating 4M (1024 pages).
We'll do the following memory allocations:
- For page 1:
  - Try to allocate order 10 (no retry)
  - Try to allocate order 9 (no retry)
  - ...
  - Try to allocate order 0 (with retry, but not needed)
- For page 2:
  - Try to allocate order 9 (no retry)
  - Try to allocate order 8 (no retry)
  - ...
  - Try to allocate order 0 (with retry, but not needed)
- ...
- ...

Total number of calls to alloc() calls for this case is:
  sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
  => 9228

The above is obviously worse case, but given how slow alloc can be we
really want to try to avoid even somewhat bad cases.  I timed the old
code with a device under memory pressure and it wasn't hard to see it
take more than 24 seconds to allocate 4 megs of memory (!!).

A second problem (and maybe even more important) is that allocating big
chunks when we don't need them is just not a good idea anyway.  The
first thing we do with these big chunks is break them into smaller
chunks!  If we allocate small chunks:
- The memory manager doesn't need to work so hard to give us big chunks.
- We can save the big chunks for those that really need them and this
  code can make great use of all the small chunks sitting around.

Let's simplify by just allocating one page at a time.  We may make more
total allocate calls but it works way better.  In real world tests that
used to sometimes see a 24 second allocation call I can now see at most
250 ms.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---
 arch/arm/mm/dma-mapping.c | 38 ++++++--------------------------------
 1 file changed, 6 insertions(+), 32 deletions(-)

Comments

Doug Anderson Dec. 17, 2015, 10:31 p.m. UTC | #1
Hi,

On Thu, Dec 17, 2015 at 12:30 PM, Douglas Anderson
<dianders@chromium.org> wrote:
> The __iommu_alloc_buffer() is expected to be called to allocate pretty
> sizeable buffers.  Upon simple tests of video I saw it trying to
> allocate 4,194,304 bytes.  The function tries to be efficient about this
> by starting out allocating large chunks and then moving to smaller and
> smaller chunk sizes until it succeeds.
>
> The current function is very, very slow.
>
> One problem is the way it keeps trying and trying to allocate big
> chunks.  Imagine a very fragmented memory that has 4M free but no
> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
> We'll do the following memory allocations:
> - For page 1:
>   - Try to allocate order 10 (no retry)
>   - Try to allocate order 9 (no retry)
>   - ...
>   - Try to allocate order 0 (with retry, but not needed)
> - For page 2:
>   - Try to allocate order 9 (no retry)
>   - Try to allocate order 8 (no retry)
>   - ...
>   - Try to allocate order 0 (with retry, but not needed)
> - ...
> - ...
>
> Total number of calls to alloc() calls for this case is:
>   sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>   => 9228
>
> The above is obviously worse case, but given how slow alloc can be we
> really want to try to avoid even somewhat bad cases.  I timed the old
> code with a device under memory pressure and it wasn't hard to see it
> take more than 24 seconds to allocate 4 megs of memory (!!).
>
> A second problem (and maybe even more important) is that allocating big
> chunks when we don't need them is just not a good idea anyway.  The
> first thing we do with these big chunks is break them into smaller
> chunks!  If we allocate small chunks:
> - The memory manager doesn't need to work so hard to give us big chunks.
> - We can save the big chunks for those that really need them and this
>   code can make great use of all the small chunks sitting around.
>
> Let's simplify by just allocating one page at a time.  We may make more
> total allocate calls but it works way better.  In real world tests that
> used to sometimes see a 24 second allocation call I can now see at most
> 250 ms.

Off-list I talked to Dmitry about this a little bit and he pointed out
that contiguous chunks actually give a benefit to the IOMMU.  I don't
think the benefit outweighs the cost in this case, but I'm happy to
hear what others have to say.  I did some quick printouts and it turns
out that even when requesting page at a time the memory manager
(unsurprisingly) can in many cases still give us pages that are
contiguous.

Also I'm happy to post up
<https://chromium-review.googlesource.com/#/c/319210/> which sorts the
array and could possibly give us larger chunks of contiguous memory.


-Doug
Tomasz Figa Dec. 18, 2015, 6:05 a.m. UTC | #2
On Fri, Dec 18, 2015 at 7:31 AM, Doug Anderson <dianders@chromium.org> wrote:
> Hi,
>
> On Thu, Dec 17, 2015 at 12:30 PM, Douglas Anderson
> <dianders@chromium.org> wrote:
>> The __iommu_alloc_buffer() is expected to be called to allocate pretty
>> sizeable buffers.  Upon simple tests of video I saw it trying to
>> allocate 4,194,304 bytes.  The function tries to be efficient about this
>> by starting out allocating large chunks and then moving to smaller and
>> smaller chunk sizes until it succeeds.
>>
>> The current function is very, very slow.
>>
>> One problem is the way it keeps trying and trying to allocate big
>> chunks.  Imagine a very fragmented memory that has 4M free but no
>> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
>> We'll do the following memory allocations:
>> - For page 1:
>>   - Try to allocate order 10 (no retry)
>>   - Try to allocate order 9 (no retry)
>>   - ...
>>   - Try to allocate order 0 (with retry, but not needed)
>> - For page 2:
>>   - Try to allocate order 9 (no retry)
>>   - Try to allocate order 8 (no retry)
>>   - ...
>>   - Try to allocate order 0 (with retry, but not needed)
>> - ...
>> - ...
>>
>> Total number of calls to alloc() calls for this case is:
>>   sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>>   => 9228
>>
>> The above is obviously worse case, but given how slow alloc can be we
>> really want to try to avoid even somewhat bad cases.  I timed the old
>> code with a device under memory pressure and it wasn't hard to see it
>> take more than 24 seconds to allocate 4 megs of memory (!!).
>>
>> A second problem (and maybe even more important) is that allocating big
>> chunks when we don't need them is just not a good idea anyway.  The
>> first thing we do with these big chunks is break them into smaller
>> chunks!  If we allocate small chunks:
>> - The memory manager doesn't need to work so hard to give us big chunks.
>> - We can save the big chunks for those that really need them and this
>>   code can make great use of all the small chunks sitting around.
>>
>> Let's simplify by just allocating one page at a time.  We may make more
>> total allocate calls but it works way better.  In real world tests that
>> used to sometimes see a 24 second allocation call I can now see at most
>> 250 ms.
>
> Off-list I talked to Dmitry about this a little bit and he pointed out
> that contiguous chunks actually give a benefit to the IOMMU.  I don't
> think the benefit outweighs the cost in this case, but I'm happy to
> hear what others have to say.

Yeah, I'd like to see some discussion about the effect of allocating
bigger chunks on IOMMU performance. Dmitry (on CC), could you
elaborate a bit on what Doug mentioned?

As for my own understanding, some IOMMUs can map memory using big
pages, which should improve TLB efficiency and so look-up speed.
However AFAICT current implementation of allocating function doesn't
allocate the chunks properly, because there is no guarantee that
particular chunks are aligned on big page boundary. For example, it
might happen that we allocate first chunk of order 0, then second
chunk of order 4 (64KiB - typical big page), then we won't be able to
map the second chunk using a big page, because the IOVA at that point
will not be aligned properly.

Is there any other case when bigger physically contiguous chunks can
help the IOMMU?

Best regards,
Tomasz
Laurent Pinchart Dec. 21, 2015, 1:12 a.m. UTC | #3
Hi Tomasz,

On Friday 18 December 2015 15:05:45 Tomasz Figa wrote:
> On Fri, Dec 18, 2015 at 7:31 AM, Doug Anderson wrote:
> > On Thu, Dec 17, 2015 at 12:30 PM, Douglas Anderson wrote:
> >> The __iommu_alloc_buffer() is expected to be called to allocate pretty
> >> sizeable buffers.  Upon simple tests of video I saw it trying to
> >> allocate 4,194,304 bytes.  The function tries to be efficient about this
> >> by starting out allocating large chunks and then moving to smaller and
> >> smaller chunk sizes until it succeeds.
> >> 
> >> The current function is very, very slow.
> >> 
> >> One problem is the way it keeps trying and trying to allocate big
> >> chunks.  Imagine a very fragmented memory that has 4M free but no
> >> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
> >> We'll do the following memory allocations:
> >> 
> >> - For page 1:
> >>   - Try to allocate order 10 (no retry)
> >>   - Try to allocate order 9 (no retry)
> >>   - ...
> >>   - Try to allocate order 0 (with retry, but not needed)
> >> 
> >> - For page 2:
> >>   - Try to allocate order 9 (no retry)
> >>   - Try to allocate order 8 (no retry)
> >>   - ...
> >>   - Try to allocate order 0 (with retry, but not needed)
> >> 
> >> - ...
> >> - ...
> >> 
> >> Total number of calls to alloc() calls for this case is:
> >>   sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
> >>   => 9228
> >> 
> >> The above is obviously worse case, but given how slow alloc can be we
> >> really want to try to avoid even somewhat bad cases.  I timed the old
> >> code with a device under memory pressure and it wasn't hard to see it
> >> take more than 24 seconds to allocate 4 megs of memory (!!).
> >> 
> >> A second problem (and maybe even more important) is that allocating big
> >> chunks when we don't need them is just not a good idea anyway.  The
> >> first thing we do with these big chunks is break them into smaller
> >> chunks!  If we allocate small chunks:
> >> - The memory manager doesn't need to work so hard to give us big chunks.
> >> - We can save the big chunks for those that really need them and this
> >> 
> >>   code can make great use of all the small chunks sitting around.
> >> 
> >> Let's simplify by just allocating one page at a time.  We may make more
> >> total allocate calls but it works way better.  In real world tests that
> >> used to sometimes see a 24 second allocation call I can now see at most
> >> 250 ms.
> > 
> > Off-list I talked to Dmitry about this a little bit and he pointed out
> > that contiguous chunks actually give a benefit to the IOMMU.  I don't
> > think the benefit outweighs the cost in this case, but I'm happy to
> > hear what others have to say.
> 
> Yeah, I'd like to see some discussion about the effect of allocating
> bigger chunks on IOMMU performance. Dmitry (on CC), could you
> elaborate a bit on what Doug mentioned?
> 
> As for my own understanding, some IOMMUs can map memory using big
> pages, which should improve TLB efficiency and so look-up speed.
> However AFAICT current implementation of allocating function doesn't
> allocate the chunks properly, because there is no guarantee that
> particular chunks are aligned on big page boundary. For example, it
> might happen that we allocate first chunk of order 0, then second
> chunk of order 4 (64KiB - typical big page), then we won't be able to
> map the second chunk using a big page, because the IOVA at that point
> will not be aligned properly.

That might be true of the current implementations, but there's nothing that 
would stop an IOMMU driver to map the start of the buffer at an IOVA address 
aligned to 64kB minus 4kB in the example you mentioned. This would move to a 
trade-off between allocation complexity and runtime performances.

> Is there any other case when bigger physically contiguous chunks can
> help the IOMMU?
diff mbox

Patch

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 492bf3efffab..7efeb2d4801b 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -1160,39 +1160,13 @@  static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
 	gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
 
 	while (count) {
-		int j, order;
-
-		for (order = __fls(count); order > 0; --order) {
-			/*
-			 * We do not want OOM killer to be invoked as long
-			 * as we can fall back to single pages, so we force
-			 * __GFP_NORETRY for orders higher than zero.
-			 */
-			pages[i] = alloc_pages(gfp | __GFP_NORETRY, order);
-			if (pages[i])
-				break;
-		}
-
-		if (!pages[i]) {
-			/*
-			 * Fall back to single page allocation.
-			 * Might invoke OOM killer as last resort.
-			 */
-			pages[i] = alloc_pages(gfp, 0);
-			if (!pages[i])
-				goto error;
-		}
-
-		if (order) {
-			split_page(pages[i], order);
-			j = 1 << order;
-			while (--j)
-				pages[i + j] = pages[i] + j;
-		}
+		pages[i] = alloc_pages(gfp, 0);
+		if (!pages[i])
+			goto error;
 
-		__dma_clear_buffer(pages[i], PAGE_SIZE << order);
-		i += 1 << order;
-		count -= 1 << order;
+		__dma_clear_buffer(pages[i], PAGE_SIZE);
+		i += 1;
+		count -= 1;
 	}
 
 	return pages;