diff mbox

[RFC] Btrfs: improve multi-thread buffer read

Message ID 1341919679-13792-1-git-send-email-liubo2009@cn.fujitsu.com (mailing list archive)
State New, archived
Headers show

Commit Message

liubo July 10, 2012, 11:27 a.m. UTC
While testing with my buffer read fio jobs[1], I find that btrfs does not
perform well enough.

Here is a scenario in fio jobs:

We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
and all of them will race on add_to_page_cache_lru(), and if one thread
successfully puts its page into the page cache, it takes the responsibility
to read the page's data.

And what's more, reading a page needs a period of time to finish, in which
other threads can slide in and process rest pages:

     t1          t2          t3          t4
   add Page1
   read Page1  add Page2
     |         read Page2  add Page3
     |            |        read Page3  add Page4
     |            |           |        read Page4
-----|------------|-----------|-----------|--------
     v            v           v           v
    bio          bio         bio         bio

Now we have four bios, each of which holds only one page since we need to
maintain consecutive pages in bio.  Thus, we can end up with far more bios
than we need.

Here we're going to
a) delay the real read-page section and
b) try to put more pages into page cache.

With that said, we can make each bio hold more pages and reduce the number
of bios we need.

Here is some numbers taken from fio results:
         w/o patch                 w patch
       -------------  --------  ---------------
READ:    745MB/s        +32%       987MB/s

[1]:
[global]
group_reporting
thread
numjobs=4
bs=32k
rw=read
ioengine=sync
directory=/mnt/btrfs/

[READ]
filename=foobar
size=2000M
invalidate=1

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
---
 fs/btrfs/extent_io.c |   37 +++++++++++++++++++++++++++++++++++--
 1 files changed, 35 insertions(+), 2 deletions(-)

Comments

Josef Bacik July 10, 2012, 6:58 p.m. UTC | #1
On Tue, Jul 10, 2012 at 05:27:59AM -0600, Liu Bo wrote:
> While testing with my buffer read fio jobs[1], I find that btrfs does not
> perform well enough.
> 
> Here is a scenario in fio jobs:
> 
> We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
> and all of them will race on add_to_page_cache_lru(), and if one thread
> successfully puts its page into the page cache, it takes the responsibility
> to read the page's data.
> 
> And what's more, reading a page needs a period of time to finish, in which
> other threads can slide in and process rest pages:
> 
>      t1          t2          t3          t4
>    add Page1
>    read Page1  add Page2
>      |         read Page2  add Page3
>      |            |        read Page3  add Page4
>      |            |           |        read Page4
> -----|------------|-----------|-----------|--------
>      v            v           v           v
>     bio          bio         bio         bio
> 
> Now we have four bios, each of which holds only one page since we need to
> maintain consecutive pages in bio.  Thus, we can end up with far more bios
> than we need.
> 
> Here we're going to
> a) delay the real read-page section and
> b) try to put more pages into page cache.
> 
> With that said, we can make each bio hold more pages and reduce the number
> of bios we need.
> 
> Here is some numbers taken from fio results:
>          w/o patch                 w patch
>        -------------  --------  ---------------
> READ:    745MB/s        +32%       987MB/s
> 

Um, I have this in btrfs-next

Btrfs: use large extent range for read and its endio

that seems to do the same thing, did you not want to do that anymore?  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
liubo July 11, 2012, 1:57 a.m. UTC | #2
On 07/11/2012 02:58 AM, Josef Bacik wrote:

> On Tue, Jul 10, 2012 at 05:27:59AM -0600, Liu Bo wrote:
>> While testing with my buffer read fio jobs[1], I find that btrfs does not
>> perform well enough.
>>
>> Here is a scenario in fio jobs:
>>
>> We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
>> and all of them will race on add_to_page_cache_lru(), and if one thread
>> successfully puts its page into the page cache, it takes the responsibility
>> to read the page's data.
>>
>> And what's more, reading a page needs a period of time to finish, in which
>> other threads can slide in and process rest pages:
>>
>>      t1          t2          t3          t4
>>    add Page1
>>    read Page1  add Page2
>>      |         read Page2  add Page3
>>      |            |        read Page3  add Page4
>>      |            |           |        read Page4
>> -----|------------|-----------|-----------|--------
>>      v            v           v           v
>>     bio          bio         bio         bio
>>
>> Now we have four bios, each of which holds only one page since we need to
>> maintain consecutive pages in bio.  Thus, we can end up with far more bios
>> than we need.
>>
>> Here we're going to
>> a) delay the real read-page section and
>> b) try to put more pages into page cache.
>>
>> With that said, we can make each bio hold more pages and reduce the number
>> of bios we need.
>>
>> Here is some numbers taken from fio results:
>>          w/o patch                 w patch
>>        -------------  --------  ---------------
>> READ:    745MB/s        +32%       987MB/s
>>
> 
> Um, I have this in btrfs-next
> 
> Btrfs: use large extent range for read and its endio
> 
> that seems to do the same thing, did you not want to do that anymore?  Thanks,
> 



I'm still hard working on that patchset. :)

Although the patchset is well worthy of testing, it is not good enough for btrfs upstream.

While doing some tuning work on it, I realized that I could make this improvement without
the help of rwlock extent state stuff, so I made this smaller and cleaner patch for upstream
so that we could gain some performance here first.

thanks,
liubo


> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik July 11, 2012, 12:31 p.m. UTC | #3
On Tue, Jul 10, 2012 at 07:57:55PM -0600, Liu Bo wrote:
> On 07/11/2012 02:58 AM, Josef Bacik wrote:
> 
> > On Tue, Jul 10, 2012 at 05:27:59AM -0600, Liu Bo wrote:
> >> While testing with my buffer read fio jobs[1], I find that btrfs does not
> >> perform well enough.
> >>
> >> Here is a scenario in fio jobs:
> >>
> >> We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
> >> and all of them will race on add_to_page_cache_lru(), and if one thread
> >> successfully puts its page into the page cache, it takes the responsibility
> >> to read the page's data.
> >>
> >> And what's more, reading a page needs a period of time to finish, in which
> >> other threads can slide in and process rest pages:
> >>
> >>      t1          t2          t3          t4
> >>    add Page1
> >>    read Page1  add Page2
> >>      |         read Page2  add Page3
> >>      |            |        read Page3  add Page4
> >>      |            |           |        read Page4
> >> -----|------------|-----------|-----------|--------
> >>      v            v           v           v
> >>     bio          bio         bio         bio
> >>
> >> Now we have four bios, each of which holds only one page since we need to
> >> maintain consecutive pages in bio.  Thus, we can end up with far more bios
> >> than we need.
> >>
> >> Here we're going to
> >> a) delay the real read-page section and
> >> b) try to put more pages into page cache.
> >>
> >> With that said, we can make each bio hold more pages and reduce the number
> >> of bios we need.
> >>
> >> Here is some numbers taken from fio results:
> >>          w/o patch                 w patch
> >>        -------------  --------  ---------------
> >> READ:    745MB/s        +32%       987MB/s
> >>
> > 
> > Um, I have this in btrfs-next
> > 
> > Btrfs: use large extent range for read and its endio
> > 
> > that seems to do the same thing, did you not want to do that anymore?  Thanks,
> > 
> 
> 
> 
> I'm still hard working on that patchset. :)
> 
> Although the patchset is well worthy of testing, it is not good enough for btrfs upstream.
> 
> While doing some tuning work on it, I realized that I could make this improvement without
> the help of rwlock extent state stuff, so I made this smaller and cleaner patch for upstream
> so that we could gain some performance here first.
> 

So do you want me to drop the rwlock stuff and take this instead?  Take a look
at whats in btrfs-next and tell me what I should drop.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
liubo July 11, 2012, 1:04 p.m. UTC | #4
On 07/11/2012 08:31 PM, Josef Bacik wrote:

> On Tue, Jul 10, 2012 at 07:57:55PM -0600, Liu Bo wrote:
>> On 07/11/2012 02:58 AM, Josef Bacik wrote:
>>
>>> On Tue, Jul 10, 2012 at 05:27:59AM -0600, Liu Bo wrote:
>>>> While testing with my buffer read fio jobs[1], I find that btrfs does not
>>>> perform well enough.
>>>>
>>>> Here is a scenario in fio jobs:
>>>>
>>>> We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
>>>> and all of them will race on add_to_page_cache_lru(), and if one thread
>>>> successfully puts its page into the page cache, it takes the responsibility
>>>> to read the page's data.
>>>>
>>>> And what's more, reading a page needs a period of time to finish, in which
>>>> other threads can slide in and process rest pages:
>>>>
>>>>      t1          t2          t3          t4
>>>>    add Page1
>>>>    read Page1  add Page2
>>>>      |         read Page2  add Page3
>>>>      |            |        read Page3  add Page4
>>>>      |            |           |        read Page4
>>>> -----|------------|-----------|-----------|--------
>>>>      v            v           v           v
>>>>     bio          bio         bio         bio
>>>>
>>>> Now we have four bios, each of which holds only one page since we need to
>>>> maintain consecutive pages in bio.  Thus, we can end up with far more bios
>>>> than we need.
>>>>
>>>> Here we're going to
>>>> a) delay the real read-page section and
>>>> b) try to put more pages into page cache.
>>>>
>>>> With that said, we can make each bio hold more pages and reduce the number
>>>> of bios we need.
>>>>
>>>> Here is some numbers taken from fio results:
>>>>          w/o patch                 w patch
>>>>        -------------  --------  ---------------
>>>> READ:    745MB/s        +32%       987MB/s
>>>>
>>> Um, I have this in btrfs-next
>>>
>>> Btrfs: use large extent range for read and its endio
>>>
>>> that seems to do the same thing, did you not want to do that anymore?  Thanks,
>>>
>>
>>
>> I'm still hard working on that patchset. :)
>>
>> Although the patchset is well worthy of testing, it is not good enough for btrfs upstream.
>>
>> While doing some tuning work on it, I realized that I could make this improvement without
>> the help of rwlock extent state stuff, so I made this smaller and cleaner patch for upstream
>> so that we could gain some performance here first.
>>
> 
> So do you want me to drop the rwlock stuff and take this instead?  Take a look
> at whats in btrfs-next and tell me what I should drop.  Thanks,
> 


Yes Josef, please take this patch instead and drop the following:

Btrfs: use radix tree for checksum
Btrfs: merge adjacent states as much as possible
Btrfs: use large extent range for read and its endio
Btrfs: apply rwlock for extent state

When I finish fixing all the noticed bugs on hands, I'll send a new rebased version to btrfs-next for you. :)

thanks,
liubo

> Josef
> 




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik July 11, 2012, 5:21 p.m. UTC | #5
On Tue, Jul 10, 2012 at 05:27:59AM -0600, Liu Bo wrote:
> While testing with my buffer read fio jobs[1], I find that btrfs does not
> perform well enough.
> 
> Here is a scenario in fio jobs:
> 
> We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
> and all of them will race on add_to_page_cache_lru(), and if one thread
> successfully puts its page into the page cache, it takes the responsibility
> to read the page's data.
> 
> And what's more, reading a page needs a period of time to finish, in which
> other threads can slide in and process rest pages:
> 
>      t1          t2          t3          t4
>    add Page1
>    read Page1  add Page2
>      |         read Page2  add Page3
>      |            |        read Page3  add Page4
>      |            |           |        read Page4
> -----|------------|-----------|-----------|--------
>      v            v           v           v
>     bio          bio         bio         bio
> 
> Now we have four bios, each of which holds only one page since we need to
> maintain consecutive pages in bio.  Thus, we can end up with far more bios
> than we need.
> 
> Here we're going to
> a) delay the real read-page section and
> b) try to put more pages into page cache.
> 
> With that said, we can make each bio hold more pages and reduce the number
> of bios we need.
> 
> Here is some numbers taken from fio results:
>          w/o patch                 w patch
>        -------------  --------  ---------------
> READ:    745MB/s        +32%       987MB/s
> 
> [1]:
> [global]
> group_reporting
> thread
> numjobs=4
> bs=32k
> rw=read
> ioengine=sync
> directory=/mnt/btrfs/
> 
> [READ]
> filename=foobar
> size=2000M
> invalidate=1
> 
> Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
> ---
>  fs/btrfs/extent_io.c |   37 +++++++++++++++++++++++++++++++++++--
>  1 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 01c21b6..8f9c18d 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3549,6 +3549,11 @@ int extent_writepages(struct extent_io_tree *tree,
>  	return ret;
>  }
>  
> +struct pagelst {
> +	struct page *page;
> +	struct list_head lst;
> +};
> +
>  int extent_readpages(struct extent_io_tree *tree,
>  		     struct address_space *mapping,
>  		     struct list_head *pages, unsigned nr_pages,
> @@ -3557,19 +3562,47 @@ int extent_readpages(struct extent_io_tree *tree,
>  	struct bio *bio = NULL;
>  	unsigned page_idx;
>  	unsigned long bio_flags = 0;
> +	LIST_HEAD(page_pool);
> +	struct pagelst *pagelst = NULL;
>  
>  	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
>  		struct page *page = list_entry(pages->prev, struct page, lru);
>  
>  		prefetchw(&page->flags);
>  		list_del(&page->lru);
> +
> +		if (!pagelst)
> +			pagelst = kmalloc(sizeof(*pagelst), GFP_NOFS);
> +
> +		if (!pagelst) {
> +			page_cache_release(page);
> +			continue;
> +		}

I'd rather not fail here if we can't make an allocation, since it's just a
optimization, just continue on like we normally would.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
liubo July 12, 2012, 1:14 a.m. UTC | #6
On 07/12/2012 01:21 AM, Josef Bacik wrote:

> On Tue, Jul 10, 2012 at 05:27:59AM -0600, Liu Bo wrote:
>> While testing with my buffer read fio jobs[1], I find that btrfs does not
>> perform well enough.
>>
>> Here is a scenario in fio jobs:
>>
>> We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
>> and all of them will race on add_to_page_cache_lru(), and if one thread
>> successfully puts its page into the page cache, it takes the responsibility
>> to read the page's data.
>>
>> And what's more, reading a page needs a period of time to finish, in which
>> other threads can slide in and process rest pages:
>>
>>      t1          t2          t3          t4
>>    add Page1
>>    read Page1  add Page2
>>      |         read Page2  add Page3
>>      |            |        read Page3  add Page4
>>      |            |           |        read Page4
>> -----|------------|-----------|-----------|--------
>>      v            v           v           v
>>     bio          bio         bio         bio
>>
>> Now we have four bios, each of which holds only one page since we need to
>> maintain consecutive pages in bio.  Thus, we can end up with far more bios
>> than we need.
>>
>> Here we're going to
>> a) delay the real read-page section and
>> b) try to put more pages into page cache.
>>
>> With that said, we can make each bio hold more pages and reduce the number
>> of bios we need.
>>
>> Here is some numbers taken from fio results:
>>          w/o patch                 w patch
>>        -------------  --------  ---------------
>> READ:    745MB/s        +32%       987MB/s
>>
>> [1]:
>> [global]
>> group_reporting
>> thread
>> numjobs=4
>> bs=32k
>> rw=read
>> ioengine=sync
>> directory=/mnt/btrfs/
>>
>> [READ]
>> filename=foobar
>> size=2000M
>> invalidate=1
>>
>> Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
>> ---
>>  fs/btrfs/extent_io.c |   37 +++++++++++++++++++++++++++++++++++--
>>  1 files changed, 35 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 01c21b6..8f9c18d 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -3549,6 +3549,11 @@ int extent_writepages(struct extent_io_tree *tree,
>>  	return ret;
>>  }
>>  
>> +struct pagelst {
>> +	struct page *page;
>> +	struct list_head lst;
>> +};
>> +
>>  int extent_readpages(struct extent_io_tree *tree,
>>  		     struct address_space *mapping,
>>  		     struct list_head *pages, unsigned nr_pages,
>> @@ -3557,19 +3562,47 @@ int extent_readpages(struct extent_io_tree *tree,
>>  	struct bio *bio = NULL;
>>  	unsigned page_idx;
>>  	unsigned long bio_flags = 0;
>> +	LIST_HEAD(page_pool);
>> +	struct pagelst *pagelst = NULL;
>>  
>>  	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
>>  		struct page *page = list_entry(pages->prev, struct page, lru);
>>  
>>  		prefetchw(&page->flags);
>>  		list_del(&page->lru);
>> +
>> +		if (!pagelst)
>> +			pagelst = kmalloc(sizeof(*pagelst), GFP_NOFS);
>> +
>> +		if (!pagelst) {
>> +			page_cache_release(page);
>> +			continue;
>> +		}
> 
> I'd rather not fail here if we can't make an allocation, since it's just a
> optimization, just continue on like we normally would.  Thanks,
> 


It makes sense, I'll update it.

thanks,
liubo

> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 01c21b6..8f9c18d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3549,6 +3549,11 @@  int extent_writepages(struct extent_io_tree *tree,
 	return ret;
 }
 
+struct pagelst {
+	struct page *page;
+	struct list_head lst;
+};
+
 int extent_readpages(struct extent_io_tree *tree,
 		     struct address_space *mapping,
 		     struct list_head *pages, unsigned nr_pages,
@@ -3557,19 +3562,47 @@  int extent_readpages(struct extent_io_tree *tree,
 	struct bio *bio = NULL;
 	unsigned page_idx;
 	unsigned long bio_flags = 0;
+	LIST_HEAD(page_pool);
+	struct pagelst *pagelst = NULL;
 
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
 		struct page *page = list_entry(pages->prev, struct page, lru);
 
 		prefetchw(&page->flags);
 		list_del(&page->lru);
+
+		if (!pagelst)
+			pagelst = kmalloc(sizeof(*pagelst), GFP_NOFS);
+
+		if (!pagelst) {
+			page_cache_release(page);
+			continue;
+		}
 		if (!add_to_page_cache_lru(page, mapping,
 					page->index, GFP_NOFS)) {
-			__extent_read_full_page(tree, page, get_extent,
-						&bio, 0, &bio_flags);
+			pagelst->page = page;
+			list_add(&pagelst->lst, &page_pool);
+			page_cache_get(page);
+			pagelst = NULL;
 		}
 		page_cache_release(page);
 	}
+
+	while (!list_empty(&page_pool)) {
+		struct page *page;
+
+		pagelst = list_entry(page_pool.prev, struct pagelst, lst);
+		page = pagelst->page;
+
+		prefetchw(&page->flags);
+		__extent_read_full_page(tree, page, get_extent,
+					&bio, 0, &bio_flags);
+
+		page_cache_release(page);
+		list_del(&pagelst->lst);
+		kfree(pagelst);
+	}
+	BUG_ON(!list_empty(&page_pool));
 	BUG_ON(!list_empty(pages));
 	if (bio)
 		return submit_one_bio(READ, bio, 0, bio_flags);