diff mbox series

brd: use XArray instead of radix-tree to index backing pages

Message ID 20230511121544.111648-1-p.raghav@samsung.com (mailing list archive)
State New, archived
Headers show
Series brd: use XArray instead of radix-tree to index backing pages | expand

Commit Message

Pankaj Raghav May 11, 2023, 12:15 p.m. UTC
XArray was introduced to hold large array of pointers with a simple API.
XArray API also provides array semantics which simplifies the way we store
and access the backing pages, and the code becomes significantly easier
to understand.

No performance difference was noticed between the two implementation
using fio with direct=1 [1].

[1] Performance in KIOPS:

          |  radix-tree |    XArray  |   Diff
          |             |            |
write     |    315      |     313    |   -0.6%
randwrite |    286      |     290    |   +1.3%
read      |    330      |     335    |   +1.5%
randread  |    309      |     312    |   +0.9%

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
@willy: I noticed later that you sent a similar patch in 2019[1]. Let me                                                                                                                                                                                                               
know if you need your signoff in the patch as well.                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                       
[1] https://lore.kernel.org/linux-block/20190318194821.3470-9-willy@infradead.org/

 drivers/block/brd.c | 93 ++++++++++++---------------------------------
 1 file changed, 24 insertions(+), 69 deletions(-)

Comments

Hannes Reinecke May 11, 2023, 2:27 p.m. UTC | #1
On 5/11/23 14:15, Pankaj Raghav wrote:
> XArray was introduced to hold large array of pointers with a simple API.
> XArray API also provides array semantics which simplifies the way we store
> and access the backing pages, and the code becomes significantly easier
> to understand.
> 
> No performance difference was noticed between the two implementation
> using fio with direct=1 [1].
> 
> [1] Performance in KIOPS:
> 
>            |  radix-tree |    XArray  |   Diff
>            |             |            |
> write     |    315      |     313    |   -0.6%
> randwrite |    286      |     290    |   +1.3%
> read      |    330      |     335    |   +1.5%
> randread  |    309      |     312    |   +0.9%
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
> @willy: I noticed later that you sent a similar patch in 2019[1]. Let me
> know if you need your signoff in the patch as well.
>                                                                                                                                                                                                                                                                                         
> [1] https://lore.kernel.org/linux-block/20190318194821.3470-9-willy@infradead.org/
> 
>   drivers/block/brd.c | 93 ++++++++++++---------------------------------
>   1 file changed, 24 insertions(+), 69 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
Chaitanya Kulkarni May 12, 2023, 1:19 a.m. UTC | #2
On 5/11/23 05:15, Pankaj Raghav wrote:
> XArray was introduced to hold large array of pointers with a simple API.
> XArray API also provides array semantics which simplifies the way we store
> and access the backing pages, and the code becomes significantly easier
> to understand.
>
> No performance difference was noticed between the two implementation
> using fio with direct=1 [1].
>
> [1] Performance in KIOPS:
>
>            |  radix-tree |    XArray  |   Diff
>            |             |            |
> write     |    315      |     313    |   -0.6%
> randwrite |    286      |     290    |   +1.3%
> read      |    330      |     335    |   +1.5%
> randread  |    309      |     312    |   +0.9%
>

I've few concerns, can you please share the fio jobs that
have used to gather this data ? I want to test it on my
setup in order to provide tested-by tag.

also, please wait until I finish my testing to apply this
patch ..

-ck
Pankaj Raghav May 12, 2023, 8:17 a.m. UTC | #3
>> [1] Performance in KIOPS:
>>
>>            |  radix-tree |    XArray  |   Diff
>>            |             |            |
>> write     |    315      |     313    |   -0.6%
>> randwrite |    286      |     290    |   +1.3%
>> read      |    330      |     335    |   +1.5%
>> randread  |    309      |     312    |   +0.9%
>>
> 
> I've few concerns, can you please share the fio jobs that
> have used to gather this data ? I want to test it on my
> setup in order to provide tested-by tag.
> 

That would be great. This is my fio job:

$ modprobe brd rd_size=10485760 rd_nr=1
$ fio --name=brd  --rw=<type>  --size=10G --io_size=100G --cpus_allowed=1 --filename=/dev/ram0
--direct=1 --iodepth=64 --ioengine=io_uring --loop=8

I ran the above job four times and averaged it as I noticed some variance. I ran the test in QEMU
as we were using a block device where the backing storage lives in RAM.
Chaitanya Kulkarni May 12, 2023, 8:34 a.m. UTC | #4
On 5/12/23 01:17, Pankaj Raghav wrote:
>>> [1] Performance in KIOPS:
>>>
>>>             |  radix-tree |    XArray  |   Diff
>>>             |             |            |
>>> write     |    315      |     313    |   -0.6%
>>> randwrite |    286      |     290    |   +1.3%
>>> read      |    330      |     335    |   +1.5%
>>> randread  |    309      |     312    |   +0.9%
>>>
>> I've few concerns, can you please share the fio jobs that
>> have used to gather this data ? I want to test it on my
>> setup in order to provide tested-by tag.
>>
> That would be great. This is my fio job:
>
> $ modprobe brd rd_size=10485760 rd_nr=1
> $ fio --name=brd  --rw=<type>  --size=10G --io_size=100G --cpus_allowed=1 --filename=/dev/ram0
> --direct=1 --iodepth=64 --ioengine=io_uring --loop=8
>
> I ran the above job four times and averaged it as I noticed some variance. I ran the test in QEMU
> as we were using a block device where the backing storage lives in RAM.
>

Please also share the numbers on non-virtualized platform i.e. with not 
QEMU.

-ck
Pankaj Raghav May 12, 2023, 10:22 a.m. UTC | #5
On 2023-05-12 10:34, Chaitanya Kulkarni wrote:
> On 5/12/23 01:17, Pankaj Raghav wrote:
>>>> [1] Performance in KIOPS:
>>>>
>>>>             |  radix-tree |    XArray  |   Diff
>>>>             |             |            |
>>>> write     |    315      |     313    |   -0.6%
>>>> randwrite |    286      |     290    |   +1.3%
>>>> read      |    330      |     335    |   +1.5%
>>>> randread  |    309      |     312    |   +0.9%
>>>>
>>> I've few concerns, can you please share the fio jobs that
>>> have used to gather this data ? I want to test it on my
>>> setup in order to provide tested-by tag.
>>>
>> That would be great. This is my fio job:
>>
>> $ modprobe brd rd_size=10485760 rd_nr=1
>> $ fio --name=brd  --rw=<type>  --size=10G --io_size=100G --cpus_allowed=1 --filename=/dev/ram0
>> --direct=1 --iodepth=64 --ioengine=io_uring --loop=8
>>
>> I ran the above job four times and averaged it as I noticed some variance. I ran the test in QEMU
>> as we were using a block device where the backing storage lives in RAM.
>>
> 
> Please also share the numbers on non-virtualized platform i.e. with not 
> QEMU.

I am running this on the next-20230512 in an Intel(R) Xeon(R) CPU E3-1240 v6 with 32GB RAM:

              |  radix-tree |    XArray  |   Diff
              |             |            |
    write     |    532      |     527    |   -0.9%
    randwrite |    456      |     452    |   -0.8%
    read      |    557      |     550    |   -1.2%
    randread  |    493      |     495    |   +0.4%
Pankaj Raghav May 15, 2023, 2:22 p.m. UTC | #6
Hi,

> I've few concerns, can you please share the fio jobs that
> have used to gather this data ? I want to test it on my
> setup in order to provide tested-by tag.
> 

Did you manage to test it in your setup by any chance?

> also, please wait until I finish my testing to apply this
> patch ..
> 
> -ck
> 
>
Chaitanya Kulkarni May 16, 2023, 12:49 a.m. UTC | #7
On 5/15/23 07:22, Pankaj Raghav wrote:
> Hi,
>
>> I've few concerns, can you please share the fio jobs that
>> have used to gather this data ? I want to test it on my
>> setup in order to provide tested-by tag.
>>
> Did you manage to test it in your setup by any chance?
>

No, I didn't get the access to the machine and it will take sometime.

Feel free to merge this based on your numbers, I'll run the numbers
when I get the machine.

-ck
Jens Axboe May 16, 2023, 3:05 p.m. UTC | #8
On Thu, 11 May 2023 14:15:44 +0200, Pankaj Raghav wrote:
> XArray was introduced to hold large array of pointers with a simple API.
> XArray API also provides array semantics which simplifies the way we store
> and access the backing pages, and the code becomes significantly easier
> to understand.
> 
> No performance difference was noticed between the two implementation
> using fio with direct=1 [1].
> 
> [...]

Applied, thanks!

[1/1] brd: use XArray instead of radix-tree to index backing pages
      commit: 786bb02458819df7a833361c6c7448a4925a89ce

Best regards,
diff mbox series

Patch

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index bcad9b926b0c..2f71376afc71 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -19,7 +19,7 @@ 
 #include <linux/highmem.h>
 #include <linux/mutex.h>
 #include <linux/pagemap.h>
-#include <linux/radix-tree.h>
+#include <linux/xarray.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
 #include <linux/backing-dev.h>
@@ -28,7 +28,7 @@ 
 #include <linux/uaccess.h>
 
 /*
- * Each block ramdisk device has a radix_tree brd_pages of pages that stores
+ * Each block ramdisk device has a xarray brd_pages of pages that stores
  * the pages containing the block device's contents. A brd page's ->index is
  * its offset in PAGE_SIZE units. This is similar to, but in no way connected
  * with, the kernel's pagecache or buffer cache (which sit above our block
@@ -40,11 +40,9 @@  struct brd_device {
 	struct list_head	brd_list;
 
 	/*
-	 * Backing store of pages and lock to protect it. This is the contents
-	 * of the block device.
+	 * Backing store of pages. This is the contents of the block device.
 	 */
-	spinlock_t		brd_lock;
-	struct radix_tree_root	brd_pages;
+	struct xarray	        brd_pages;
 	u64			brd_nr_pages;
 };
 
@@ -56,21 +54,8 @@  static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
 	pgoff_t idx;
 	struct page *page;
 
-	/*
-	 * The page lifetime is protected by the fact that we have opened the
-	 * device node -- brd pages will never be deleted under us, so we
-	 * don't need any further locking or refcounting.
-	 *
-	 * This is strictly true for the radix-tree nodes as well (ie. we
-	 * don't actually need the rcu_read_lock()), however that is not a
-	 * documented feature of the radix-tree API so it is better to be
-	 * safe here (we don't have total exclusion from radix tree updates
-	 * here, only deletes).
-	 */
-	rcu_read_lock();
 	idx = sector >> PAGE_SECTORS_SHIFT; /* sector to page index */
-	page = radix_tree_lookup(&brd->brd_pages, idx);
-	rcu_read_unlock();
+	page = xa_load(&brd->brd_pages, idx);
 
 	BUG_ON(page && page->index != idx);
 
@@ -83,7 +68,7 @@  static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
 static int brd_insert_page(struct brd_device *brd, sector_t sector, gfp_t gfp)
 {
 	pgoff_t idx;
-	struct page *page;
+	struct page *page, *cur;
 	int ret = 0;
 
 	page = brd_lookup_page(brd, sector);
@@ -94,71 +79,42 @@  static int brd_insert_page(struct brd_device *brd, sector_t sector, gfp_t gfp)
 	if (!page)
 		return -ENOMEM;
 
-	if (radix_tree_maybe_preload(gfp)) {
-		__free_page(page);
-		return -ENOMEM;
-	}
+	xa_lock(&brd->brd_pages);
 
-	spin_lock(&brd->brd_lock);
 	idx = sector >> PAGE_SECTORS_SHIFT;
 	page->index = idx;
-	if (radix_tree_insert(&brd->brd_pages, idx, page)) {
+
+	cur = __xa_cmpxchg(&brd->brd_pages, idx, NULL, page, gfp);
+
+	if (unlikely(cur)) {
 		__free_page(page);
-		page = radix_tree_lookup(&brd->brd_pages, idx);
-		if (!page)
-			ret = -ENOMEM;
-		else if (page->index != idx)
+		ret = xa_err(cur);
+		if (!ret && (cur->index != idx))
 			ret = -EIO;
 	} else {
 		brd->brd_nr_pages++;
 	}
-	spin_unlock(&brd->brd_lock);
 
-	radix_tree_preload_end();
+	xa_unlock(&brd->brd_pages);
+
 	return ret;
 }
 
 /*
- * Free all backing store pages and radix tree. This must only be called when
+ * Free all backing store pages and xarray. This must only be called when
  * there are no other users of the device.
  */
-#define FREE_BATCH 16
 static void brd_free_pages(struct brd_device *brd)
 {
-	unsigned long pos = 0;
-	struct page *pages[FREE_BATCH];
-	int nr_pages;
+	struct page *page;
+	pgoff_t idx;
 
-	do {
-		int i;
+	xa_for_each(&brd->brd_pages, idx, page) {
+		__free_page(page);
+		cond_resched_rcu();
+	}
 
-		nr_pages = radix_tree_gang_lookup(&brd->brd_pages,
-				(void **)pages, pos, FREE_BATCH);
-
-		for (i = 0; i < nr_pages; i++) {
-			void *ret;
-
-			BUG_ON(pages[i]->index < pos);
-			pos = pages[i]->index;
-			ret = radix_tree_delete(&brd->brd_pages, pos);
-			BUG_ON(!ret || ret != pages[i]);
-			__free_page(pages[i]);
-		}
-
-		pos++;
-
-		/*
-		 * It takes 3.4 seconds to remove 80GiB ramdisk.
-		 * So, we need cond_resched to avoid stalling the CPU.
-		 */
-		cond_resched();
-
-		/*
-		 * This assumes radix_tree_gang_lookup always returns as
-		 * many pages as possible. If the radix-tree code changes,
-		 * so will this have to.
-		 */
-	} while (nr_pages == FREE_BATCH);
+	xa_destroy(&brd->brd_pages);
 }
 
 /*
@@ -372,8 +328,7 @@  static int brd_alloc(int i)
 	brd->brd_number		= i;
 	list_add_tail(&brd->brd_list, &brd_devices);
 
-	spin_lock_init(&brd->brd_lock);
-	INIT_RADIX_TREE(&brd->brd_pages, GFP_ATOMIC);
+	xa_init(&brd->brd_pages);
 
 	snprintf(buf, DISK_NAME_LEN, "ram%d", i);
 	if (!IS_ERR_OR_NULL(brd_debugfs_dir))