diff mbox series

[2/3] dmapool: improve scalability of dma_pool_free

Message ID 1288e597-a67a-25b3-b7c6-db883ca67a25@cybernetics.com (mailing list archive)
State Changes Requested
Headers show
Series mpt3sas and dmapool scalability | expand

Commit Message

Tony Battersby July 26, 2018, 6:54 p.m. UTC
dma_pool_free() scales poorly when the pool contains many pages because
pool_find_page() does a linear scan of all allocated pages.  Improve its
scalability by replacing the linear scan with a red-black tree lookup. 
In big O notation, this improves the algorithm from O(n^2) to O(n * log n).

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---

I moved some code from dma_pool_destroy() into pool_free_page() to avoid code
repetition.

Comments

Matthew Wilcox (Oracle) July 26, 2018, 7:42 p.m. UTC | #1
On Thu, Jul 26, 2018 at 02:54:56PM -0400, Tony Battersby wrote:
> dma_pool_free() scales poorly when the pool contains many pages because
> pool_find_page() does a linear scan of all allocated pages.  Improve its
> scalability by replacing the linear scan with a red-black tree lookup. 
> In big O notation, this improves the algorithm from O(n^2) to O(n * log n).

This is a lot of code to get us to O(n * log(n)) when we can use less
code to go to O(n).  dma_pool_free() is passed the virtual address.
We can go from virtual address to struct page with virt_to_page().
In struct page, we have 5 words available (20/40 bytes), and it's trivial
to use one of them to point to the struct dma_page.
Andy Shevchenko July 26, 2018, 7:45 p.m. UTC | #2
On Thu, Jul 26, 2018 at 9:54 PM, Tony Battersby <tonyb@cybernetics.com> wrote:
> dma_pool_free() scales poorly when the pool contains many pages because
> pool_find_page() does a linear scan of all allocated pages.  Improve its
> scalability by replacing the linear scan with a red-black tree lookup.
> In big O notation, this improves the algorithm from O(n^2) to O(n * log n).

Few style related comments.

> I moved some code from dma_pool_destroy() into pool_free_page() to avoid code
> repetition.

I would rather split that part as a separate preparatory change which
doesn't change the behaviour.

>  #include <linux/string.h>
>  #include <linux/types.h>
>  #include <linux/wait.h>

> +#include <linux/rbtree.h>

It looks misordered.

> +               struct dma_page *this_page =
> +                       container_of(*node, struct dma_page, page_node);

#define to_dma_page() container_of() ?

> +                       WARN(1,
> +                            "%s: %s: DMA address overlap: old 0x%llx new 0x%llx len %zu\n",
> +                            pool->dev ? dev_name(pool->dev) : "(nodev)",
> +                            pool->name, (u64) this_page->dma, (u64) dma,

Use proper %p extensions for the DMA addresses:
https://elixir.bootlin.com/linux/latest/source/Documentation/core-api/printk-formats.rst#L150
Tony Battersby July 26, 2018, 8:06 p.m. UTC | #3
On 07/26/2018 03:42 PM, Matthew Wilcox wrote:
> On Thu, Jul 26, 2018 at 02:54:56PM -0400, Tony Battersby wrote:
>> dma_pool_free() scales poorly when the pool contains many pages because
>> pool_find_page() does a linear scan of all allocated pages.  Improve its
>> scalability by replacing the linear scan with a red-black tree lookup. 
>> In big O notation, this improves the algorithm from O(n^2) to O(n * log n).
> This is a lot of code to get us to O(n * log(n)) when we can use less
> code to go to O(n).  dma_pool_free() is passed the virtual address.
> We can go from virtual address to struct page with virt_to_page().
> In struct page, we have 5 words available (20/40 bytes), and it's trivial
> to use one of them to point to the struct dma_page.
>
Thanks for the tip.  I will give that a try.
Matthew Wilcox (Oracle) July 27, 2018, 12:07 a.m. UTC | #4
On Thu, Jul 26, 2018 at 04:06:05PM -0400, Tony Battersby wrote:
> On 07/26/2018 03:42 PM, Matthew Wilcox wrote:
> > On Thu, Jul 26, 2018 at 02:54:56PM -0400, Tony Battersby wrote:
> >> dma_pool_free() scales poorly when the pool contains many pages because
> >> pool_find_page() does a linear scan of all allocated pages.  Improve its
> >> scalability by replacing the linear scan with a red-black tree lookup. 
> >> In big O notation, this improves the algorithm from O(n^2) to O(n * log n).
> > This is a lot of code to get us to O(n * log(n)) when we can use less
> > code to go to O(n).  dma_pool_free() is passed the virtual address.
> > We can go from virtual address to struct page with virt_to_page().
> > In struct page, we have 5 words available (20/40 bytes), and it's trivial
> > to use one of them to point to the struct dma_page.
> >
> Thanks for the tip.  I will give that a try.

If you're up for more major surgery, then I think we can put all the
information currently stored in dma_page into struct page.  Something
like this:

+++ b/include/linux/mm_types.h
@@ -152,6 +152,12 @@ struct page {
                        unsigned long hmm_data;
                        unsigned long _zd_pad_1;        /* uses mapping */
                };
+               struct {        /* dma_pool pages */
+                       struct list_head dma_list;
+                       unsigned short in_use;
+                       unsigned short offset;
+                       dma_addr_t dma;
+               };
 
                /** @rcu_head: You can use this to free a page by RCU. */
                struct rcu_head rcu_head;

page_list -> dma_list
vaddr goes away (page_to_virt() exists)
dma -> dma
in_use and offset shrink from 4 bytes to 2.

Some 32-bit systems have a 64-bit dma_addr_t, and on those systems,
this will be 8 + 2 + 2 + 8 = 20 bytes.  On 64-bit systems, it'll be
16 + 2 + 2 + 4 bytes of padding + 8 = 32 bytes (we have 40 available).
Tony Battersby July 27, 2018, 1:23 p.m. UTC | #5
On 07/26/2018 08:07 PM, Matthew Wilcox wrote:
> On Thu, Jul 26, 2018 at 04:06:05PM -0400, Tony Battersby wrote:
>> On 07/26/2018 03:42 PM, Matthew Wilcox wrote:
>>> On Thu, Jul 26, 2018 at 02:54:56PM -0400, Tony Battersby wrote:
>>>> dma_pool_free() scales poorly when the pool contains many pages because
>>>> pool_find_page() does a linear scan of all allocated pages.  Improve its
>>>> scalability by replacing the linear scan with a red-black tree lookup. 
>>>> In big O notation, this improves the algorithm from O(n^2) to O(n * log n).
>>> This is a lot of code to get us to O(n * log(n)) when we can use less
>>> code to go to O(n).  dma_pool_free() is passed the virtual address.
>>> We can go from virtual address to struct page with virt_to_page().
>>> In struct page, we have 5 words available (20/40 bytes), and it's trivial
>>> to use one of them to point to the struct dma_page.
>>>
>> Thanks for the tip.  I will give that a try.
> If you're up for more major surgery, then I think we can put all the
> information currently stored in dma_page into struct page.  Something
> like this:
>
> +++ b/include/linux/mm_types.h
> @@ -152,6 +152,12 @@ struct page {
>                         unsigned long hmm_data;
>                         unsigned long _zd_pad_1;        /* uses mapping */
>                 };
> +               struct {        /* dma_pool pages */
> +                       struct list_head dma_list;
> +                       unsigned short in_use;
> +                       unsigned short offset;
> +                       dma_addr_t dma;
> +               };
>  
>                 /** @rcu_head: You can use this to free a page by RCU. */
>                 struct rcu_head rcu_head;
>
> page_list -> dma_list
> vaddr goes away (page_to_virt() exists)
> dma -> dma
> in_use and offset shrink from 4 bytes to 2.
>
> Some 32-bit systems have a 64-bit dma_addr_t, and on those systems,
> this will be 8 + 2 + 2 + 8 = 20 bytes.  On 64-bit systems, it'll be
> 16 + 2 + 2 + 4 bytes of padding + 8 = 32 bytes (we have 40 available).
>
>
offset at least needs more bits, since allocations can be multi-page. 
See the following from mpt3sas:

cat /sys/devices/pci0000:80/0000:80:07.0/0000:85:00.0/pools
(manually cleaned up column alignment)
poolinfo - 0.1
reply_post_free_array pool  1      21     192     1
reply_free pool             1      1      41728   1
reply pool                  1      1      1335296 1
sense pool                  1      1      970272  1
chain pool                  373959 386048 128     12064
reply_post_free pool        12     12     166528  12
                                          ^size^

In my initial implementation I also added a pointer from struct page to
the dma_pool so that dma_pool_free() could sanity-check that the page
really belongs to the pool, as in:

pg = virt_to_page(vaddr);
if (unlikely(pg->dma_pool != pool)) {
	handle error...
}
page = pg->dma_page;

The linked-list search previously implemented that check as a
by-product, and I didn't want to lose it.  It all seems to be working so
far.
Matthew Wilcox (Oracle) July 27, 2018, 3:23 p.m. UTC | #6
On Fri, Jul 27, 2018 at 09:23:30AM -0400, Tony Battersby wrote:
> On 07/26/2018 08:07 PM, Matthew Wilcox wrote:
> > If you're up for more major surgery, then I think we can put all the
> > information currently stored in dma_page into struct page.  Something
> > like this:
> >
> > +++ b/include/linux/mm_types.h
> > @@ -152,6 +152,12 @@ struct page {
> >                         unsigned long hmm_data;
> >                         unsigned long _zd_pad_1;        /* uses mapping */
> >                 };
> > +               struct {        /* dma_pool pages */
> > +                       struct list_head dma_list;
> > +                       unsigned short in_use;
> > +                       unsigned short offset;
> > +                       dma_addr_t dma;
> > +               };
> >  
> >                 /** @rcu_head: You can use this to free a page by RCU. */
> >                 struct rcu_head rcu_head;
> >
> > page_list -> dma_list
> > vaddr goes away (page_to_virt() exists)
> > dma -> dma
> > in_use and offset shrink from 4 bytes to 2.
> >
> > Some 32-bit systems have a 64-bit dma_addr_t, and on those systems,
> > this will be 8 + 2 + 2 + 8 = 20 bytes.  On 64-bit systems, it'll be
> > 16 + 2 + 2 + 4 bytes of padding + 8 = 32 bytes (we have 40 available).
> >
> >
> offset at least needs more bits, since allocations can be multi-page. 

Ah, rats.  That means we have to use the mapcount union too:

+++ b/include/linux/mm_types.h
@@ -152,6 +152,11 @@ struct page {
                        unsigned long hmm_data;
                        unsigned long _zd_pad_1;        /* uses mapping */
                };
+               struct {        /* dma_pool pages */
+                       struct list_head dma_list;
+                       unsigned int dma_in_use;
+                       dma_addr_t dma;
+               };
 
                /** @rcu_head: You can use this to free a page by RCU. */
                struct rcu_head rcu_head;
@@ -174,6 +179,7 @@ struct page {
 
                unsigned int active;            /* SLAB */
                int units;                      /* SLOB */
+               unsigned int dma_offset;        /* dma_pool */
        };
 
        /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */


> See the following from mpt3sas:
> 
> cat /sys/devices/pci0000:80/0000:80:07.0/0000:85:00.0/pools
> (manually cleaned up column alignment)
> poolinfo - 0.1
> reply_post_free_array pool  1      21     192     1
> reply_free pool             1      1      41728   1
> reply pool                  1      1      1335296 1
> sense pool                  1      1      970272  1
> chain pool                  373959 386048 128     12064
> reply_post_free pool        12     12     166528  12
>                                           ^size^

Wow, that's a pretty weird way to use the dmapool.  It'd be more efficient
to just call dma_alloc_coherent() directly.
Tony Battersby July 27, 2018, 7:38 p.m. UTC | #7
On 07/27/2018 11:23 AM, Matthew Wilcox wrote:
> On Fri, Jul 27, 2018 at 09:23:30AM -0400, Tony Battersby wrote:
>> On 07/26/2018 08:07 PM, Matthew Wilcox wrote:
>>> If you're up for more major surgery, then I think we can put all the
>>> information currently stored in dma_page into struct page.  Something
>>> like this:
>>>
>>> +++ b/include/linux/mm_types.h
>>> @@ -152,6 +152,12 @@ struct page {
>>>                         unsigned long hmm_data;
>>>                         unsigned long _zd_pad_1;        /* uses mapping */
>>>                 };
>>> +               struct {        /* dma_pool pages */
>>> +                       struct list_head dma_list;
>>> +                       unsigned short in_use;
>>> +                       unsigned short offset;
>>> +                       dma_addr_t dma;
>>> +               };
>>>  
>>>                 /** @rcu_head: You can use this to free a page by RCU. */
>>>                 struct rcu_head rcu_head;
>>>
>>> page_list -> dma_list
>>> vaddr goes away (page_to_virt() exists)
>>> dma -> dma
>>> in_use and offset shrink from 4 bytes to 2.
>>>
>>> Some 32-bit systems have a 64-bit dma_addr_t, and on those systems,
>>> this will be 8 + 2 + 2 + 8 = 20 bytes.  On 64-bit systems, it'll be
>>> 16 + 2 + 2 + 4 bytes of padding + 8 = 32 bytes (we have 40 available).
>>>
>>>
>> offset at least needs more bits, since allocations can be multi-page. 
> Ah, rats.  That means we have to use the mapcount union too:
>

Actually, on second thought, if I understand it correctly, a multi-page
allocation will never be split up and returned as multiple
sub-allocations, so the offset shouldn't be needed for that case.  The
offset should only be needed when splitting a PAGE_SIZE-allocation into
smaller sub-allocations.  The current code uses the offset
unconditionally though, so it would need major changes to remove the
dependence.  So a 16-bit offset might work.

As for sanity checking, I suppose having the dma address in the page
could provide something for dma_pool_free() to check against (in fact it
is already there under DMAPOOL_DEBUG).

But the bigger problem is that my first patch adds another list_head to
the dma_page for the avail_page_link to make allocations faster.  I
suppose we could make the lists singly-linked instead of doubly-linked
to save space.

Wouldn't using the mapcount union make it problematic for userspace to
mmap() the returned DMA buffers?  I am not sure if any drivers allow
that to be done or not.  I have heard of drivers in userspace, drivers
with DMA ring buffers, etc.  I don't want to audit the whole kernel tree
to know if it would be safe.  As you have seen, at least mpt3sas is
doing unexpected things with dma pools.

So maybe it could be done, but you are right, it would involve major
surgery.  My current in-development patch to implement your intial
suggestion is pretty small and it works.  So I'm not sure if I want to
take it further or not.  Lots of other things to do...
Tony Battersby July 27, 2018, 9:27 p.m. UTC | #8
On 07/27/2018 03:38 PM, Tony Battersby wrote:
> But the bigger problem is that my first patch adds another list_head to
> the dma_page for the avail_page_link to make allocations faster.  I
> suppose we could make the lists singly-linked instead of doubly-linked
> to save space.
>

I managed to redo my dma_pool_alloc() patch to make avail_page_list
singly-linked instead of doubly-linked.  But the problem with making
either list singly-linked is that it would no longer be possible to call
pool_free_page() any time other than dma_pool_destroy() without scanning
the lists to remove the page from them, which would make pruning
arbitrary free pages slower (adding back a O(n^2)).  But the current
code doesn't do that anyway, and in fact it has a comment in
dma_pool_free() to "resist the temptation" to prune free pages.  And yet
it seems like it might be reasonable for someone to add such code in the
future if there are a whole lot of free pages, so I am hesitant to make
it more difficult.

So my question is: when I post v2 of the patchset, should I send the
doubly-linked version or the singly-linked version, in anticipation that
someone else might want to take it further and move everything into
struct page as you suggest?
Andy Shevchenko July 27, 2018, 9:35 p.m. UTC | #9
On Sat, Jul 28, 2018 at 12:27 AM, Tony Battersby <tonyb@cybernetics.com> wrote:
> On 07/27/2018 03:38 PM, Tony Battersby wrote:
>> But the bigger problem is that my first patch adds another list_head to
>> the dma_page for the avail_page_link to make allocations faster.  I
>> suppose we could make the lists singly-linked instead of doubly-linked
>> to save space.
>>
>
> I managed to redo my dma_pool_alloc() patch to make avail_page_list
> singly-linked instead of doubly-linked.

Are you relying on llist.h implementation?

Btw, did you see quicklist.h?
Tony Battersby July 27, 2018, 10:07 p.m. UTC | #10
On 07/27/2018 05:35 PM, Andy Shevchenko wrote:
> On Sat, Jul 28, 2018 at 12:27 AM, Tony Battersby <tonyb@cybernetics.com> wrote:
>> On 07/27/2018 03:38 PM, Tony Battersby wrote:
>>> But the bigger problem is that my first patch adds another list_head to
>>> the dma_page for the avail_page_link to make allocations faster.  I
>>> suppose we could make the lists singly-linked instead of doubly-linked
>>> to save space.
>>>
>> I managed to redo my dma_pool_alloc() patch to make avail_page_list
>> singly-linked instead of doubly-linked.
> Are you relying on llist.h implementation?
>
> Btw, did you see quicklist.h?
>
>
I looked over include/linux/*list* to see if there was a suitable
implementation I could use.  llist.h makes a big deal about having a
lock-less implementation with atomic instructions, which seemed like
overkill.  I didn't see anything else suitable, so I just went with my
own implementation.  Singly-linked lists are simple enough.  And a quick
"grep -i singly include/linux/*" shows precedent in bi_next, fl_next,
fa_next, etc.

Thanks for pointing out quicklist.h.  At first I thought you were
confused since you were talking about linked list implementations and
quicklist.h sounds like a linked list implementation but isn't.  But now
I see that it is doing simple memory allocation/free, so that is the
relevance to dmapool.  Incidentally it looks like it is also using a
singly-linked list to store the list of free pages, but it is much
simpler because it doesn't try to sub-divide the pages into smaller
allocations.
Tony Battersby July 30, 2018, 2:05 p.m. UTC | #11
On 07/27/2018 05:27 PM, Tony Battersby wrote:
> On 07/27/2018 03:38 PM, Tony Battersby wrote:
>> But the bigger problem is that my first patch adds another list_head to
>> the dma_page for the avail_page_link to make allocations faster.  I
>> suppose we could make the lists singly-linked instead of doubly-linked
>> to save space.
>>
> I managed to redo my dma_pool_alloc() patch to make avail_page_list
> singly-linked instead of doubly-linked.  But the problem with making
> either list singly-linked is that it would no longer be possible to call
> pool_free_page() any time other than dma_pool_destroy() without scanning
> the lists to remove the page from them, which would make pruning
> arbitrary free pages slower (adding back a O(n^2)).  But the current
> code doesn't do that anyway, and in fact it has a comment in
> dma_pool_free() to "resist the temptation" to prune free pages.  And yet
> it seems like it might be reasonable for someone to add such code in the
> future if there are a whole lot of free pages, so I am hesitant to make
> it more difficult.
>
> So my question is: when I post v2 of the patchset, should I send the
> doubly-linked version or the singly-linked version, in anticipation that
> someone else might want to take it further and move everything into
> struct page as you suggest?
>
Over the weekend I came up with a better solution.  Instead of having
the page listed in two singly-linked lists at the same time, move the
page between two doubly-linked lists.  One list is dedicated for pages
that have all blocks allocated, and one list is for pages that have some
blocks free.  Since the page is only in one list at a time, it only
needs one set of list pointers.

I also implemented the code to make the offset 16-bit, while ignoring
the offset for cases where it is not needed (where it would overflow
anyway).

So now I have an implementation that eliminates struct dma_page.  I will
post it once it is ready for review.
diff mbox series

Patch

--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -15,11 +15,12 @@ 
  * Many older drivers still have their own code to do this.
  *
  * The current design of this allocator is fairly simple.  The pool is
- * represented by the 'struct dma_pool' which keeps a doubly-linked list of
- * allocated pages.  Each page in the page_list is split into blocks of at
- * least 'size' bytes.  Free blocks are tracked in an unsorted singly-linked
- * list of free blocks within the page.  Used blocks aren't tracked, but we
- * keep a count of how many are currently allocated from each page.
+ * represented by the 'struct dma_pool' which keeps a red-black tree of all
+ * allocated pages, keyed by DMA address for fast lookup when freeing.
+ * Each page in the page_tree is split into blocks of at least 'size' bytes.
+ * Free blocks are tracked in an unsorted singly-linked list of free blocks
+ * within the page.  Used blocks aren't tracked, but we keep a count of how
+ * many are currently allocated from each page.
  *
  * The avail_page_list keeps track of pages that have one or more free blocks
  * available to (re)allocate.  Pages are moved in and out of avail_page_list
@@ -41,13 +42,14 @@ 
 #include <linux/string.h>
 #include <linux/types.h>
 #include <linux/wait.h>
+#include <linux/rbtree.h>
 
 #if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB_DEBUG_ON)
 #define DMAPOOL_DEBUG 1
 #endif
 
 struct dma_pool {		/* the pool */
-	struct list_head page_list;
+	struct rb_root page_tree;
 	struct list_head avail_page_list;
 	spinlock_t lock;
 	size_t size;
@@ -59,7 +61,7 @@  struct dma_pool {		/* the pool */
 };
 
 struct dma_page {		/* cacheable header for 'allocation' bytes */
-	struct list_head page_list;
+	struct rb_node page_node;
 	struct list_head avail_page_link;
 	void *vaddr;
 	dma_addr_t dma;
@@ -78,6 +80,7 @@  show_pools(struct device *dev, struct de
 	char *next;
 	struct dma_page *page;
 	struct dma_pool *pool;
+	struct rb_node *node;
 
 	next = buf;
 	size = PAGE_SIZE;
@@ -92,7 +95,10 @@  show_pools(struct device *dev, struct de
 		unsigned blocks = 0;
 
 		spin_lock_irq(&pool->lock);
-		list_for_each_entry(page, &pool->page_list, page_list) {
+		for (node = rb_first(&pool->page_tree);
+		     node;
+		     node = rb_next(node)) {
+			page = rb_entry(node, struct dma_page, page_node);
 			pages++;
 			blocks += page->in_use;
 		}
@@ -169,7 +175,7 @@  struct dma_pool *dma_pool_create(const c
 
 	retval->dev = dev;
 
-	INIT_LIST_HEAD(&retval->page_list);
+	retval->page_tree = RB_ROOT;
 	INIT_LIST_HEAD(&retval->avail_page_list);
 	spin_lock_init(&retval->lock);
 	retval->size = size;
@@ -210,6 +216,65 @@  struct dma_pool *dma_pool_create(const c
 }
 EXPORT_SYMBOL(dma_pool_create);
 
+/*
+ * Find the dma_page that manages the given DMA address.
+ */
+static struct dma_page *pool_find_page(struct dma_pool *pool, dma_addr_t dma)
+{
+	struct rb_node *node = pool->page_tree.rb_node;
+
+	while (node) {
+		struct dma_page *page =
+			container_of(node, struct dma_page, page_node);
+
+		if (dma < page->dma)
+			node = node->rb_left;
+		else if ((dma - page->dma) >= pool->allocation)
+			node = node->rb_right;
+		else
+			return page;
+	}
+	return NULL;
+}
+
+/*
+ * Insert a dma_page into the page_tree.
+ */
+static int pool_insert_page(struct dma_pool *pool, struct dma_page *new_page)
+{
+	dma_addr_t dma = new_page->dma;
+	struct rb_node **node = &(pool->page_tree.rb_node), *parent = NULL;
+
+	while (*node) {
+		struct dma_page *this_page =
+			container_of(*node, struct dma_page, page_node);
+
+		parent = *node;
+		if (dma < this_page->dma)
+			node = &((*node)->rb_left);
+		else if (likely((dma - this_page->dma) >= pool->allocation))
+			node = &((*node)->rb_right);
+		else {
+			/*
+			 * A page that overlaps the new DMA range is already
+			 * present in the tree.  This should not happen.
+			 */
+			WARN(1,
+			     "%s: %s: DMA address overlap: old 0x%llx new 0x%llx len %zu\n",
+			     pool->dev ? dev_name(pool->dev) : "(nodev)",
+			     pool->name, (u64) this_page->dma, (u64) dma,
+			     pool->allocation);
+			return -1;
+		}
+	}
+
+	/* Add new node and rebalance tree. */
+	rb_link_node(&new_page->page_node, parent, node);
+	rb_insert_color(&new_page->page_node, &pool->page_tree);
+
+	return 0;
+}
+
 static void pool_initialise_page(struct dma_pool *pool, struct dma_page *page)
 {
 	unsigned int offset = 0;
@@ -254,15 +319,36 @@  static inline bool is_page_busy(struct d
 	return page->in_use != 0;
 }
 
-static void pool_free_page(struct dma_pool *pool, struct dma_page *page)
+static void pool_free_page(struct dma_pool *pool,
+			   struct dma_page *page,
+			   bool destroying_pool)
 {
-	dma_addr_t dma = page->dma;
-
+	if (destroying_pool && is_page_busy(page)) {
+		if (pool->dev)
+			dev_err(pool->dev,
+				"dma_pool_destroy %s, %p busy\n",
+				pool->name, page->vaddr);
+		else
+			pr_err("dma_pool_destroy %s, %p busy\n",
+			       pool->name, page->vaddr);
+		/* leak the still-in-use consistent memory */
+	} else {
 #ifdef	DMAPOOL_DEBUG
-	memset(page->vaddr, POOL_POISON_FREED, pool->allocation);
+		memset(page->vaddr, POOL_POISON_FREED, pool->allocation);
 #endif
-	dma_free_coherent(pool->dev, pool->allocation, page->vaddr, dma);
-	list_del(&page->page_list);
+		dma_free_coherent(pool->dev,
+				  pool->allocation,
+				  page->vaddr,
+				  page->dma);
+	}
+
+	/*
+	 * If the pool is being destroyed, it is not safe to modify the
+	 * page_tree while iterating over it, and it is also unnecessary since
+	 * the whole tree will be discarded anyway.
+	 */
+	if (!destroying_pool)
+		rb_erase(&page->page_node, &pool->page_tree);
 	list_del(&page->avail_page_link);
 	kfree(page);
 }
@@ -277,6 +363,7 @@  static void pool_free_page(struct dma_po
  */
 void dma_pool_destroy(struct dma_pool *pool)
 {
+	struct dma_page *page, *tmp_page;
 	bool empty = false;
 
 	if (unlikely(!pool))
@@ -292,24 +379,11 @@  void dma_pool_destroy(struct dma_pool *p
 		device_remove_file(pool->dev, &dev_attr_pools);
 	mutex_unlock(&pools_reg_lock);
 
-	while (!list_empty(&pool->page_list)) {
-		struct dma_page *page;
-		page = list_entry(pool->page_list.next,
-				  struct dma_page, page_list);
-		if (is_page_busy(page)) {
-			if (pool->dev)
-				dev_err(pool->dev,
-					"dma_pool_destroy %s, %p busy\n",
-					pool->name, page->vaddr);
-			else
-				pr_err("dma_pool_destroy %s, %p busy\n",
-				       pool->name, page->vaddr);
-			/* leak the still-in-use consistent memory */
-			list_del(&page->page_list);
-			list_del(&page->avail_page_link);
-			kfree(page);
-		} else
-			pool_free_page(pool, page);
+	rbtree_postorder_for_each_entry_safe(page,
+					     tmp_page,
+					     &pool->page_tree,
+					     page_node) {
+		pool_free_page(pool, page, true);
 	}
 
 	kfree(pool);
@@ -353,7 +427,15 @@  void *dma_pool_alloc(struct dma_pool *po
 
 	spin_lock_irqsave(&pool->lock, flags);
 
-	list_add(&page->page_list, &pool->page_list);
+	if (unlikely(pool_insert_page(pool, page))) {
+		/*
+		 * This should not happen, so something must have gone horribly
+		 * wrong.  Instead of crashing, intentionally leak the memory
+		 * and make for the exit.
+		 */
+		spin_unlock_irqrestore(&pool->lock, flags);
+		return NULL;
+	}
 	list_add(&page->avail_page_link, &pool->avail_page_list);
  ready:
 	page->in_use++;
@@ -400,19 +482,6 @@  void *dma_pool_alloc(struct dma_pool *po
 }
 EXPORT_SYMBOL(dma_pool_alloc);
 
-static struct dma_page *pool_find_page(struct dma_pool *pool, dma_addr_t dma)
-{
-	struct dma_page *page;
-
-	list_for_each_entry(page, &pool->page_list, page_list) {
-		if (dma < page->dma)
-			continue;
-		if ((dma - page->dma) < pool->allocation)
-			return page;
-	}
-	return NULL;
-}
-
 /**
  * dma_pool_free - put block back into dma pool
  * @pool: the dma pool holding the block
@@ -484,7 +553,7 @@  void dma_pool_free(struct dma_pool *pool
 	page->offset = offset;
 	/*
 	 * Resist a temptation to do
-	 *    if (!is_page_busy(page)) pool_free_page(pool, page);
+	 *    if (!is_page_busy(page)) pool_free_page(pool, page, false);
 	 * Better have a few empty pages hang around.
 	 */
 	spin_unlock_irqrestore(&pool->lock, flags);