[v5,1/5] ARM: dma-mapping: Optimize allocation

Message ID	1452294332-23415-2-git-send-email-dianders@chromium.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: Douglas Anderson <dianders@chromium.org> To: Russell King <linux@arm.linux.org.uk>, Mauro Carvalho Chehab <mchehab@osg.samsung.com>, Robin Murphy <robin.murphy@arm.com>, Tomasz Figa <tfiga@chromium.org>, Marek Szyprowski <m.szyprowski@samsung.com> Subject: [PATCH v5 1/5] ARM: dma-mapping: Optimize allocation Date: Fri, 8 Jan 2016 15:05:28 -0800 Message-Id: <1452294332-23415-2-git-send-email-dianders@chromium.org> In-Reply-To: <1452294332-23415-1-git-send-email-dianders@chromium.org> References: <1452294332-23415-1-git-send-email-dianders@chromium.org> Precedence: list Cc: laurent.pinchart+renesas@ideasonboard.com, Pawel Osciak <pawel@osciak.com>, mike.looijmans@topic.nl, lorenx4@gmail.com, Dmitry Torokhov <dmitry.torokhov@gmail.com>, will.deacon@arm.com, Douglas Anderson <dianders@chromium.org>, linux-kernel@vger.kernel.org, carlo@caione.org, dan.j.williams@intel.com, akpm@linux-foundation.org, linux-arm-kernel@lists.infradead.org MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Message ID

1452294332-23415-2-git-send-email-dianders@chromium.org (mailing list archive)

State

New, archived

Headers

From: Douglas Anderson <dianders@chromium.org>
To: Russell King <linux@arm.linux.org.uk>,
	Mauro Carvalho Chehab <mchehab@osg.samsung.com>,
	Robin Murphy <robin.murphy@arm.com>, Tomasz Figa <tfiga@chromium.org>,
	Marek Szyprowski <m.szyprowski@samsung.com>
Subject: [PATCH v5 1/5] ARM: dma-mapping: Optimize allocation
Date: Fri,  8 Jan 2016 15:05:28 -0800
Message-Id: <1452294332-23415-2-git-send-email-dianders@chromium.org>
In-Reply-To: <1452294332-23415-1-git-send-email-dianders@chromium.org>
References: <1452294332-23415-1-git-send-email-dianders@chromium.org>
Precedence: list
Cc: laurent.pinchart+renesas@ideasonboard.com,
	Pawel Osciak <pawel@osciak.com>, 
	mike.looijmans@topic.nl, lorenx4@gmail.com,
	Dmitry Torokhov <dmitry.torokhov@gmail.com>, will.deacon@arm.com,
	Douglas Anderson <dianders@chromium.org>, linux-kernel@vger.kernel.org,
	carlo@caione.org, dan.j.williams@intel.com, akpm@linux-foundation.org,
	linux-arm-kernel@lists.infradead.org
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Commit Message

Doug Anderson Jan. 8, 2016, 11:05 p.m. UTC

The __iommu_alloc_buffer() is expected to be called to allocate pretty
sizeable buffers.  Upon simple tests of video I saw it trying to
allocate 4,194,304 bytes.  The function tries to allocate large chunks
in order to optimize IOMMU TLB usage.

The current function is very, very slow.

One problem is the way it keeps trying and trying to allocate big
chunks.  Imagine a very fragmented memory that has 4M free but no
contiguous pages at all.  Further imagine allocating 4M (1024 pages).
We'll do the following memory allocations:
- For page 1:
  - Try to allocate order 10 (no retry)
  - Try to allocate order 9 (no retry)
  - ...
  - Try to allocate order 0 (with retry, but not needed)
- For page 2:
  - Try to allocate order 9 (no retry)
  - Try to allocate order 8 (no retry)
  - ...
  - Try to allocate order 0 (with retry, but not needed)
- ...
- ...

Total number of calls to alloc() calls for this case is:
  sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
  => 9228

The above is obviously worse case, but given how slow alloc can be we
really want to try to avoid even somewhat bad cases.  I timed the old
code with a device under memory pressure and it wasn't hard to see it
take more than 120 seconds to allocate 4 megs of memory! (NOTE: testing
was done on kernel 3.14, so possibly mainline would behave
differently).

A second problem is that allocating big chunks under memory pressure
when we don't need them is just not a great idea anyway unless we really
need them.  We can make due pretty well with smaller chunks so it's
probably wise to leave bigger chunks for other users once memory
pressure is on.

Let's adjust the allocation like this:

1. If a big chunk fails, stop trying to hard and bump down to lower
   order allocations.
2. Don't try useless orders.  The whole point of big chunks is to
   optimize the TLB and it can really only make use of 2M, 1M, 64K and
   4K sizes.

We'll still tend to eat up a bunch of big chunks, but that might be the
right answer for some users.  A future patch could possibly add a new
DMA_ATTR that would let the caller decide that TLB optimization isn't
important and that we should use smaller chunks.  Presumably this would
be a sane strategy for some callers.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
---
Changes in v5: None
Changes in v4:
- Added Marek's ack

Changes in v3: None
Changes in v2:
- No longer just 1 page at a time, but gives up higher order quickly.
- Only tries important higher order allocations that might help us.

 arch/arm/mm/dma-mapping.c | 34 ++++++++++++++++++++--------------
 1 file changed, 20 insertions(+), 14 deletions(-)

Comments

Robin Murphy Jan. 13, 2016, 12:17 p.m. UTC | #1

Hi Doug,

On 08/01/16 23:05, Douglas Anderson wrote:
> The __iommu_alloc_buffer() is expected to be called to allocate pretty
> sizeable buffers.  Upon simple tests of video I saw it trying to
> allocate 4,194,304 bytes.  The function tries to allocate large chunks
> in order to optimize IOMMU TLB usage.
>
> The current function is very, very slow.
>
> One problem is the way it keeps trying and trying to allocate big
> chunks.  Imagine a very fragmented memory that has 4M free but no
> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
> We'll do the following memory allocations:
> - For page 1:
>    - Try to allocate order 10 (no retry)
>    - Try to allocate order 9 (no retry)
>    - ...
>    - Try to allocate order 0 (with retry, but not needed)
> - For page 2:
>    - Try to allocate order 9 (no retry)
>    - Try to allocate order 8 (no retry)
>    - ...
>    - Try to allocate order 0 (with retry, but not needed)
> - ...
> - ...
>
> Total number of calls to alloc() calls for this case is:
>    sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>    => 9228
>
> The above is obviously worse case, but given how slow alloc can be we
> really want to try to avoid even somewhat bad cases.  I timed the old
> code with a device under memory pressure and it wasn't hard to see it
> take more than 120 seconds to allocate 4 megs of memory! (NOTE: testing
> was done on kernel 3.14, so possibly mainline would behave
> differently).
>
> A second problem is that allocating big chunks under memory pressure
> when we don't need them is just not a great idea anyway unless we really
> need them.  We can make due pretty well with smaller chunks so it's
> probably wise to leave bigger chunks for other users once memory
> pressure is on.
>
> Let's adjust the allocation like this:
>
> 1. If a big chunk fails, stop trying to hard and bump down to lower
>     order allocations.
> 2. Don't try useless orders.  The whole point of big chunks is to
>     optimize the TLB and it can really only make use of 2M, 1M, 64K and
>     4K sizes.
>
> We'll still tend to eat up a bunch of big chunks, but that might be the
> right answer for some users.  A future patch could possibly add a new
> DMA_ATTR that would let the caller decide that TLB optimization isn't
> important and that we should use smaller chunks.  Presumably this would
> be a sane strategy for some callers.

Now that I've had time to think about it properly:

Reviewed-by: Robin Murphy <robin.murphy@arm.com>

I just had an absolutely disgusting idea of how to get the same 
progression with just a single variable and no static array, but I'll 
keep that firmly to myself as it's almost IOCCC-grade WTF :D

Thanks,
Robin.

> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
> ---
> Changes in v5: None
> Changes in v4:
> - Added Marek's ack
>
> Changes in v3: None
> Changes in v2:
> - No longer just 1 page at a time, but gives up higher order quickly.
> - Only tries important higher order allocations that might help us.
>
>   arch/arm/mm/dma-mapping.c | 34 ++++++++++++++++++++--------------
>   1 file changed, 20 insertions(+), 14 deletions(-)
>
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index 0eca3812527e..bc9cebfa0891 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -1122,6 +1122,9 @@ static inline void __free_iova(struct dma_iommu_mapping *mapping,
>   	spin_unlock_irqrestore(&mapping->lock, flags);
>   }
>
> +/* We'll try 2M, 1M, 64K, and finally 4K; array must end with 0! */
> +static const int iommu_order_array[] = { 9, 8, 4, 0 };
> +
>   static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
>   					  gfp_t gfp, struct dma_attrs *attrs)
>   {
> @@ -1129,6 +1132,7 @@ static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
>   	int count = size >> PAGE_SHIFT;
>   	int array_size = count * sizeof(struct page *);
>   	int i = 0;
> +	int order_idx = 0;
>
>   	if (array_size <= PAGE_SIZE)
>   		pages = kzalloc(array_size, GFP_KERNEL);
> @@ -1162,22 +1166,24 @@ static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
>   	while (count) {
>   		int j, order;
>
> -		for (order = __fls(count); order > 0; --order) {
> -			/*
> -			 * We do not want OOM killer to be invoked as long
> -			 * as we can fall back to single pages, so we force
> -			 * __GFP_NORETRY for orders higher than zero.
> -			 */
> -			pages[i] = alloc_pages(gfp | __GFP_NORETRY, order);
> -			if (pages[i])
> -				break;
> +		order = iommu_order_array[order_idx];
> +
> +		/* Drop down when we get small */
> +		if (__fls(count) < order) {
> +			order_idx++;
> +			continue;
>   		}
>
> -		if (!pages[i]) {
> -			/*
> -			 * Fall back to single page allocation.
> -			 * Might invoke OOM killer as last resort.
> -			 */
> +		if (order) {
> +			/* See if it's easy to allocate a high-order chunk */
> +			pages[i] = alloc_pages(gfp | __GFP_NORETRY, order);
> +
> +			/* Go down a notch at first sign of pressure */
> +			if (!pages[i]) {
> +				order_idx++;
> +				continue;
> +			}
> +		} else {
>   			pages[i] = alloc_pages(gfp, 0);
>   			if (!pages[i])
>   				goto error;
>

Tomasz Figa Jan. 13, 2016, 5:33 p.m. UTC | #2

On Wed, Jan 13, 2016 at 9:17 PM, Robin Murphy <robin.murphy@arm.com> wrote:
> Hi Doug,
>
>
> On 08/01/16 23:05, Douglas Anderson wrote:
>>
>> The __iommu_alloc_buffer() is expected to be called to allocate pretty
>> sizeable buffers.  Upon simple tests of video I saw it trying to
>> allocate 4,194,304 bytes.  The function tries to allocate large chunks
>> in order to optimize IOMMU TLB usage.
>>
>> The current function is very, very slow.
>>
>> One problem is the way it keeps trying and trying to allocate big
>> chunks.  Imagine a very fragmented memory that has 4M free but no
>> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
>> We'll do the following memory allocations:
>> - For page 1:
>>    - Try to allocate order 10 (no retry)
>>    - Try to allocate order 9 (no retry)
>>    - ...
>>    - Try to allocate order 0 (with retry, but not needed)
>> - For page 2:
>>    - Try to allocate order 9 (no retry)
>>    - Try to allocate order 8 (no retry)
>>    - ...
>>    - Try to allocate order 0 (with retry, but not needed)
>> - ...
>> - ...
>>
>> Total number of calls to alloc() calls for this case is:
>>    sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>>    => 9228
>>
>> The above is obviously worse case, but given how slow alloc can be we
>> really want to try to avoid even somewhat bad cases.  I timed the old
>> code with a device under memory pressure and it wasn't hard to see it
>> take more than 120 seconds to allocate 4 megs of memory! (NOTE: testing
>> was done on kernel 3.14, so possibly mainline would behave
>> differently).
>>
>> A second problem is that allocating big chunks under memory pressure
>> when we don't need them is just not a great idea anyway unless we really
>> need them.  We can make due pretty well with smaller chunks so it's
>> probably wise to leave bigger chunks for other users once memory
>> pressure is on.
>>
>> Let's adjust the allocation like this:
>>
>> 1. If a big chunk fails, stop trying to hard and bump down to lower
>>     order allocations.
>> 2. Don't try useless orders.  The whole point of big chunks is to
>>     optimize the TLB and it can really only make use of 2M, 1M, 64K and
>>     4K sizes.
>>
>> We'll still tend to eat up a bunch of big chunks, but that might be the
>> right answer for some users.  A future patch could possibly add a new
>> DMA_ATTR that would let the caller decide that TLB optimization isn't
>> important and that we should use smaller chunks.  Presumably this would
>> be a sane strategy for some callers.
>
>
> Now that I've had time to think about it properly:
>
> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
>
> I just had an absolutely disgusting idea of how to get the same progression
> with just a single variable and no static array, but I'll keep that firmly
> to myself as it's almost IOCCC-grade WTF :D

Just out of curiosity, a bitmap and loop with fls() and clearing bit
on failure or something more freaky? :)

Anyway:

Reviewed-by: Tomasz Figa <tfiga@chromium.org>

Best regards,
Tomasz

Robin Murphy Jan. 13, 2016, 5:44 p.m. UTC | #3

On 13/01/16 17:33, Tomasz Figa wrote:
> On Wed, Jan 13, 2016 at 9:17 PM, Robin Murphy <robin.murphy@arm.com> wrote:
>> Hi Doug,
>>
>>
>> On 08/01/16 23:05, Douglas Anderson wrote:
>>>
>>> The __iommu_alloc_buffer() is expected to be called to allocate pretty
>>> sizeable buffers.  Upon simple tests of video I saw it trying to
>>> allocate 4,194,304 bytes.  The function tries to allocate large chunks
>>> in order to optimize IOMMU TLB usage.
>>>
>>> The current function is very, very slow.
>>>
>>> One problem is the way it keeps trying and trying to allocate big
>>> chunks.  Imagine a very fragmented memory that has 4M free but no
>>> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
>>> We'll do the following memory allocations:
>>> - For page 1:
>>>     - Try to allocate order 10 (no retry)
>>>     - Try to allocate order 9 (no retry)
>>>     - ...
>>>     - Try to allocate order 0 (with retry, but not needed)
>>> - For page 2:
>>>     - Try to allocate order 9 (no retry)
>>>     - Try to allocate order 8 (no retry)
>>>     - ...
>>>     - Try to allocate order 0 (with retry, but not needed)
>>> - ...
>>> - ...
>>>
>>> Total number of calls to alloc() calls for this case is:
>>>     sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>>>     => 9228
>>>
>>> The above is obviously worse case, but given how slow alloc can be we
>>> really want to try to avoid even somewhat bad cases.  I timed the old
>>> code with a device under memory pressure and it wasn't hard to see it
>>> take more than 120 seconds to allocate 4 megs of memory! (NOTE: testing
>>> was done on kernel 3.14, so possibly mainline would behave
>>> differently).
>>>
>>> A second problem is that allocating big chunks under memory pressure
>>> when we don't need them is just not a great idea anyway unless we really
>>> need them.  We can make due pretty well with smaller chunks so it's
>>> probably wise to leave bigger chunks for other users once memory
>>> pressure is on.
>>>
>>> Let's adjust the allocation like this:
>>>
>>> 1. If a big chunk fails, stop trying to hard and bump down to lower
>>>      order allocations.
>>> 2. Don't try useless orders.  The whole point of big chunks is to
>>>      optimize the TLB and it can really only make use of 2M, 1M, 64K and
>>>      4K sizes.
>>>
>>> We'll still tend to eat up a bunch of big chunks, but that might be the
>>> right answer for some users.  A future patch could possibly add a new
>>> DMA_ATTR that would let the caller decide that TLB optimization isn't
>>> important and that we should use smaller chunks.  Presumably this would
>>> be a sane strategy for some callers.
>>
>>
>> Now that I've had time to think about it properly:
>>
>> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
>>
>> I just had an absolutely disgusting idea of how to get the same progression
>> with just a single variable and no static array, but I'll keep that firmly
>> to myself as it's almost IOCCC-grade WTF :D
>
> Just out of curiosity, a bitmap and loop with fls() and clearing bit
> on failure or something more freaky? :)

Got a Python interpreter handy?

order = 9
for i in range(4):
     print order
     order = (order - 1) & 0xc

Like I said, disgusting :D

Robin.

>
> Anyway:
>
> Reviewed-by: Tomasz Figa <tfiga@chromium.org>
>
> Best regards,
> Tomasz
>

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 0eca3812527e..bc9cebfa0891 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -1122,6 +1122,9 @@  static inline void __free_iova(struct dma_iommu_mapping *mapping,
 	spin_unlock_irqrestore(&mapping->lock, flags);
 }
 
+/* We'll try 2M, 1M, 64K, and finally 4K; array must end with 0! */
+static const int iommu_order_array[] = { 9, 8, 4, 0 };
+
 static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
 					  gfp_t gfp, struct dma_attrs *attrs)
 {
@@ -1129,6 +1132,7 @@  static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
 	int count = size >> PAGE_SHIFT;
 	int array_size = count * sizeof(struct page *);
 	int i = 0;
+	int order_idx = 0;
 
 	if (array_size <= PAGE_SIZE)
 		pages = kzalloc(array_size, GFP_KERNEL);
@@ -1162,22 +1166,24 @@  static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
 	while (count) {
 		int j, order;
 
-		for (order = __fls(count); order > 0; --order) {
-			/*
-			 * We do not want OOM killer to be invoked as long
-			 * as we can fall back to single pages, so we force
-			 * __GFP_NORETRY for orders higher than zero.
-			 */
-			pages[i] = alloc_pages(gfp | __GFP_NORETRY, order);
-			if (pages[i])
-				break;
+		order = iommu_order_array[order_idx];
+
+		/* Drop down when we get small */
+		if (__fls(count) < order) {
+			order_idx++;
+			continue;
 		}
 
-		if (!pages[i]) {
-			/*
-			 * Fall back to single page allocation.
-			 * Might invoke OOM killer as last resort.
-			 */
+		if (order) {
+			/* See if it's easy to allocate a high-order chunk */
+			pages[i] = alloc_pages(gfp | __GFP_NORETRY, order);
+
+			/* Go down a notch at first sign of pressure */
+			if (!pages[i]) {
+				order_idx++;
+				continue;
+			}
+		} else {
 			pages[i] = alloc_pages(gfp, 0);
 			if (!pages[i])
 				goto error;

[v5,1/5] ARM: dma-mapping: Optimize allocation

Commit Message

Comments

Patch