[V2,2/2] block/loop: allow request merge for directio mode

Message ID	cafa738c4f61d9b549b7d4675903e8c61962480b.1503602376.git.shli@fb.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8D7A821A2B From: Shaohua Li <shli@kernel.org> To: linux-block@vger.kernel.org Cc: axboe@kernel.dk, Kernel-team@fb.com, Shaohua Li <shli@fb.com> Subject: [PATCH V2 2/2] block/loop: allow request merge for directio mode Date: Thu, 24 Aug 2017 12:24:53 -0700 Message-Id: <cafa738c4f61d9b549b7d4675903e8c61962480b.1503602376.git.shli@fb.com> In-Reply-To: <cover.1503602376.git.shli@fb.com> References: <cover.1503602376.git.shli@fb.com> In-Reply-To: <cover.1503602376.git.shli@fb.com> References: <cover.1503602376.git.shli@fb.com> Sender: linux-block-owner@vger.kernel.org Precedence: bulk

Shaohua Li Aug. 24, 2017, 7:24 p.m. UTC

From: Shaohua Li <shli@fb.com>

Currently loop disables merge. While it makes sense for buffer IO mode,
directio mode can benefit from request merge. Without merge, loop could
send small size IO to underlayer disk and harm performance.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/block/loop.c | 66 ++++++++++++++++++++++++++++++++++++++++------------
 drivers/block/loop.h |  1 +
 2 files changed, 52 insertions(+), 15 deletions(-)

Ming Lei Aug. 29, 2017, 9:56 a.m. UTC | #1

On Thu, Aug 24, 2017 at 12:24:53PM -0700, Shaohua Li wrote:
> From: Shaohua Li <shli@fb.com>
> 
> Currently loop disables merge. While it makes sense for buffer IO mode,
> directio mode can benefit from request merge. Without merge, loop could
> send small size IO to underlayer disk and harm performance.

Hi Shaohua,

IMO no matter if merge is used, loop always sends page by page
to VFS in both dio or buffer I/O.

Also if merge is enabled on loop, that means merge is run
on both loop and low level block driver, and not sure if we
can benefit from that.

So Could you provide some performance data about this patch?

> 
> Reviewed-by: Omar Sandoval <osandov@fb.com>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/block/loop.c | 66 ++++++++++++++++++++++++++++++++++++++++------------
>  drivers/block/loop.h |  1 +
>  2 files changed, 52 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 428da07..75a8f6e 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -213,10 +213,13 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
>  	 */
>  	blk_mq_freeze_queue(lo->lo_queue);
>  	lo->use_dio = use_dio;
> -	if (use_dio)
> +	if (use_dio) {
> +		queue_flag_clear_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);
>  		lo->lo_flags |= LO_FLAGS_DIRECT_IO;
> -	else
> +	} else {
> +		queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);
>  		lo->lo_flags &= ~LO_FLAGS_DIRECT_IO;
> +	}
>  	blk_mq_unfreeze_queue(lo->lo_queue);
>  }
>  
> @@ -464,6 +467,8 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2)
>  {
>  	struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
>  
> +	kfree(cmd->bvec);
> +	cmd->bvec = NULL;
>  	cmd->ret = ret;
>  	blk_mq_complete_request(cmd->rq);
>  }
> @@ -473,22 +478,50 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
>  {
>  	struct iov_iter iter;
>  	struct bio_vec *bvec;
> -	struct bio *bio = cmd->rq->bio;
> +	struct request *rq = cmd->rq;
> +	struct bio *bio = rq->bio;
>  	struct file *file = lo->lo_backing_file;
> +	unsigned int offset;
> +	int segments = 0;
>  	int ret;
>  
> -	/* nomerge for loop request queue */
> -	WARN_ON(cmd->rq->bio != cmd->rq->biotail);
> +	if (rq->bio != rq->biotail) {
> +		struct req_iterator iter;
> +		struct bio_vec tmp;
> +
> +		__rq_for_each_bio(bio, rq)
> +			segments += bio_segments(bio);
> +		bvec = kmalloc(sizeof(struct bio_vec) * segments, GFP_KERNEL);

The allocation should have been GFP_NOIO.

Shaohua Li Aug. 29, 2017, 3:13 p.m. UTC | #2

On Tue, Aug 29, 2017 at 05:56:05PM +0800, Ming Lei wrote:
> On Thu, Aug 24, 2017 at 12:24:53PM -0700, Shaohua Li wrote:
> > From: Shaohua Li <shli@fb.com>
> > 
> > Currently loop disables merge. While it makes sense for buffer IO mode,
> > directio mode can benefit from request merge. Without merge, loop could
> > send small size IO to underlayer disk and harm performance.
> 
> Hi Shaohua,
> 
> IMO no matter if merge is used, loop always sends page by page
> to VFS in both dio or buffer I/O.

Why do you think so?
 
> Also if merge is enabled on loop, that means merge is run
> on both loop and low level block driver, and not sure if we
> can benefit from that.

why does merge still happen in low level block driver?

> 
> So Could you provide some performance data about this patch?

In my virtual machine, a workload improves from ~20M/s to ~50M/s. And I clearly
see the request size becomes bigger.

> > 
> > Reviewed-by: Omar Sandoval <osandov@fb.com>
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  drivers/block/loop.c | 66 ++++++++++++++++++++++++++++++++++++++++------------
> >  drivers/block/loop.h |  1 +
> >  2 files changed, 52 insertions(+), 15 deletions(-)
> > 
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index 428da07..75a8f6e 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -213,10 +213,13 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
> >  	 */
> >  	blk_mq_freeze_queue(lo->lo_queue);
> >  	lo->use_dio = use_dio;
> > -	if (use_dio)
> > +	if (use_dio) {
> > +		queue_flag_clear_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);
> >  		lo->lo_flags |= LO_FLAGS_DIRECT_IO;
> > -	else
> > +	} else {
> > +		queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);
> >  		lo->lo_flags &= ~LO_FLAGS_DIRECT_IO;
> > +	}
> >  	blk_mq_unfreeze_queue(lo->lo_queue);
> >  }
> >  
> > @@ -464,6 +467,8 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2)
> >  {
> >  	struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
> >  
> > +	kfree(cmd->bvec);
> > +	cmd->bvec = NULL;
> >  	cmd->ret = ret;
> >  	blk_mq_complete_request(cmd->rq);
> >  }
> > @@ -473,22 +478,50 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
> >  {
> >  	struct iov_iter iter;
> >  	struct bio_vec *bvec;
> > -	struct bio *bio = cmd->rq->bio;
> > +	struct request *rq = cmd->rq;
> > +	struct bio *bio = rq->bio;
> >  	struct file *file = lo->lo_backing_file;
> > +	unsigned int offset;
> > +	int segments = 0;
> >  	int ret;
> >  
> > -	/* nomerge for loop request queue */
> > -	WARN_ON(cmd->rq->bio != cmd->rq->biotail);
> > +	if (rq->bio != rq->biotail) {
> > +		struct req_iterator iter;
> > +		struct bio_vec tmp;
> > +
> > +		__rq_for_each_bio(bio, rq)
> > +			segments += bio_segments(bio);
> > +		bvec = kmalloc(sizeof(struct bio_vec) * segments, GFP_KERNEL);
> 
> The allocation should have been GFP_NOIO.

Sounds good. To make this completely correct isn't easy though, we are calling
into the underlayer fs operations which can do allocation.

Thanks,
Shaohua

Ming Lei Aug. 30, 2017, 2:51 a.m. UTC | #3

On Tue, Aug 29, 2017 at 08:13:39AM -0700, Shaohua Li wrote:
> On Tue, Aug 29, 2017 at 05:56:05PM +0800, Ming Lei wrote:
> > On Thu, Aug 24, 2017 at 12:24:53PM -0700, Shaohua Li wrote:
> > > From: Shaohua Li <shli@fb.com>
> > > 
> > > Currently loop disables merge. While it makes sense for buffer IO mode,
> > > directio mode can benefit from request merge. Without merge, loop could
> > > send small size IO to underlayer disk and harm performance.
> > 
> > Hi Shaohua,
> > 
> > IMO no matter if merge is used, loop always sends page by page
> > to VFS in both dio or buffer I/O.
> 
> Why do you think so?

do_blockdev_direct_IO() still handles page by page from iov_iter, and
with bigger request, I guess it might be the plug merge working.

>  
> > Also if merge is enabled on loop, that means merge is run
> > on both loop and low level block driver, and not sure if we
> > can benefit from that.
> 
> why does merge still happen in low level block driver?

Because scheduler is still working on low level disk. My question
is that why the scheduler in low level disk doesn't work now
if scheduler on loop can merge?

> 
> > 
> > So Could you provide some performance data about this patch?
> 
> In my virtual machine, a workload improves from ~20M/s to ~50M/s. And I clearly
> see the request size becomes bigger.

Could you share us what the low level disk is?

Shaohua Li Aug. 30, 2017, 4:43 a.m. UTC | #4

On Wed, Aug 30, 2017 at 10:51:21AM +0800, Ming Lei wrote:
> On Tue, Aug 29, 2017 at 08:13:39AM -0700, Shaohua Li wrote:
> > On Tue, Aug 29, 2017 at 05:56:05PM +0800, Ming Lei wrote:
> > > On Thu, Aug 24, 2017 at 12:24:53PM -0700, Shaohua Li wrote:
> > > > From: Shaohua Li <shli@fb.com>
> > > > 
> > > > Currently loop disables merge. While it makes sense for buffer IO mode,
> > > > directio mode can benefit from request merge. Without merge, loop could
> > > > send small size IO to underlayer disk and harm performance.
> > > 
> > > Hi Shaohua,
> > > 
> > > IMO no matter if merge is used, loop always sends page by page
> > > to VFS in both dio or buffer I/O.
> > 
> > Why do you think so?
> 
> do_blockdev_direct_IO() still handles page by page from iov_iter, and
> with bigger request, I guess it might be the plug merge working.

This is not true. directio sends big size bio directly, not because of plug
merge. Please at least check the code before you complain.

> >  
> > > Also if merge is enabled on loop, that means merge is run
> > > on both loop and low level block driver, and not sure if we
> > > can benefit from that.
> > 
> > why does merge still happen in low level block driver?
> 
> Because scheduler is still working on low level disk. My question
> is that why the scheduler in low level disk doesn't work now
> if scheduler on loop can merge?

The low level disk can still do merge, but since this is directio, the upper
layer already dispatches request as big as possible. There is very little
chance the requests can be merged again.

> > 
> > > 
> > > So Could you provide some performance data about this patch?
> > 
> > In my virtual machine, a workload improves from ~20M/s to ~50M/s. And I clearly
> > see the request size becomes bigger.
> 
> Could you share us what the low level disk is?

It's a SATA ssd.

Thanks,
Shaohua

Ming Lei Aug. 30, 2017, 6:43 a.m. UTC | #5

On Tue, Aug 29, 2017 at 09:43:20PM -0700, Shaohua Li wrote:
> On Wed, Aug 30, 2017 at 10:51:21AM +0800, Ming Lei wrote:
> > On Tue, Aug 29, 2017 at 08:13:39AM -0700, Shaohua Li wrote:
> > > On Tue, Aug 29, 2017 at 05:56:05PM +0800, Ming Lei wrote:
> > > > On Thu, Aug 24, 2017 at 12:24:53PM -0700, Shaohua Li wrote:
> > > > > From: Shaohua Li <shli@fb.com>
> > > > > 
> > > > > Currently loop disables merge. While it makes sense for buffer IO mode,
> > > > > directio mode can benefit from request merge. Without merge, loop could
> > > > > send small size IO to underlayer disk and harm performance.
> > > > 
> > > > Hi Shaohua,
> > > > 
> > > > IMO no matter if merge is used, loop always sends page by page
> > > > to VFS in both dio or buffer I/O.
> > > 
> > > Why do you think so?
> > 
> > do_blockdev_direct_IO() still handles page by page from iov_iter, and
> > with bigger request, I guess it might be the plug merge working.
> 
> This is not true. directio sends big size bio directly, not because of plug
> merge. Please at least check the code before you complain.

I complain nothing, just try to understand the idea behind,
never mind, :-)

> 
> > >  
> > > > Also if merge is enabled on loop, that means merge is run
> > > > on both loop and low level block driver, and not sure if we
> > > > can benefit from that.
> > > 
> > > why does merge still happen in low level block driver?
> > 
> > Because scheduler is still working on low level disk. My question
> > is that why the scheduler in low level disk doesn't work now
> > if scheduler on loop can merge?
> 
> The low level disk can still do merge, but since this is directio, the upper
> layer already dispatches request as big as possible. There is very little
> chance the requests can be merged again.

That is true, but these requests need to enter scheduler queue and
be tried to merge again, even though it is less possible to succeed.
Double merge may take extra CPU utilization.

Looks it doesn't answer my question.

Without this patch, the requests dispatched to loop won't be merged,
so they may be small and their sectors may be continuous, my question
is why dio bios converted from these small loop requests can't be
merged in block layer when queuing these dio bios to low level device?


> 
> > > 
> > > > 
> > > > So Could you provide some performance data about this patch?
> > > 
> > > In my virtual machine, a workload improves from ~20M/s to ~50M/s. And I clearly
> > > see the request size becomes bigger.
> > 
> > Could you share us what the low level disk is?
> 
> It's a SATA ssd.

For sata, it is pretty easy to trigger I/O merge.

Shaohua Li Aug. 30, 2017, 10:06 p.m. UTC | #6

On Wed, Aug 30, 2017 at 02:43:40PM +0800, Ming Lei wrote:
> On Tue, Aug 29, 2017 at 09:43:20PM -0700, Shaohua Li wrote:
> > On Wed, Aug 30, 2017 at 10:51:21AM +0800, Ming Lei wrote:
> > > On Tue, Aug 29, 2017 at 08:13:39AM -0700, Shaohua Li wrote:
> > > > On Tue, Aug 29, 2017 at 05:56:05PM +0800, Ming Lei wrote:
> > > > > On Thu, Aug 24, 2017 at 12:24:53PM -0700, Shaohua Li wrote:
> > > > > > From: Shaohua Li <shli@fb.com>
> > > > > > 
> > > > > > Currently loop disables merge. While it makes sense for buffer IO mode,
> > > > > > directio mode can benefit from request merge. Without merge, loop could
> > > > > > send small size IO to underlayer disk and harm performance.
> > > > > 
> > > > > Hi Shaohua,
> > > > > 
> > > > > IMO no matter if merge is used, loop always sends page by page
> > > > > to VFS in both dio or buffer I/O.
> > > > 
> > > > Why do you think so?
> > > 
> > > do_blockdev_direct_IO() still handles page by page from iov_iter, and
> > > with bigger request, I guess it might be the plug merge working.
> > 
> > This is not true. directio sends big size bio directly, not because of plug
> > merge. Please at least check the code before you complain.
> 
> I complain nothing, just try to understand the idea behind,
> never mind, :-)
> 
> > 
> > > >  
> > > > > Also if merge is enabled on loop, that means merge is run
> > > > > on both loop and low level block driver, and not sure if we
> > > > > can benefit from that.
> > > > 
> > > > why does merge still happen in low level block driver?
> > > 
> > > Because scheduler is still working on low level disk. My question
> > > is that why the scheduler in low level disk doesn't work now
> > > if scheduler on loop can merge?
> > 
> > The low level disk can still do merge, but since this is directio, the upper
> > layer already dispatches request as big as possible. There is very little
> > chance the requests can be merged again.
> 
> That is true, but these requests need to enter scheduler queue and
> be tried to merge again, even though it is less possible to succeed.
> Double merge may take extra CPU utilization.
> 
> Looks it doesn't answer my question.
> 
> Without this patch, the requests dispatched to loop won't be merged,
> so they may be small and their sectors may be continuous, my question
> is why dio bios converted from these small loop requests can't be
> merged in block layer when queuing these dio bios to low level device?

loop thread doesn't have plug there. Even we have plug there, it's still a bad
idea to do the merge in low level layer. If we run direct_IO for every 4k, the
overhead is much much higher than bio merge. The direct_IO will call into fs
code, take different mutexes, metadata update for write and so on.

Ming Lei Aug. 31, 2017, 3:25 a.m. UTC | #7

On Wed, Aug 30, 2017 at 03:06:16PM -0700, Shaohua Li wrote:
> On Wed, Aug 30, 2017 at 02:43:40PM +0800, Ming Lei wrote:
> > On Tue, Aug 29, 2017 at 09:43:20PM -0700, Shaohua Li wrote:
> > > On Wed, Aug 30, 2017 at 10:51:21AM +0800, Ming Lei wrote:
> > > > On Tue, Aug 29, 2017 at 08:13:39AM -0700, Shaohua Li wrote:
> > > > > On Tue, Aug 29, 2017 at 05:56:05PM +0800, Ming Lei wrote:
> > > > > > On Thu, Aug 24, 2017 at 12:24:53PM -0700, Shaohua Li wrote:
> > > > > > > From: Shaohua Li <shli@fb.com>
> > > > > > > 
> > > > > > > Currently loop disables merge. While it makes sense for buffer IO mode,
> > > > > > > directio mode can benefit from request merge. Without merge, loop could
> > > > > > > send small size IO to underlayer disk and harm performance.
> > > > > > 
> > > > > > Hi Shaohua,
> > > > > > 
> > > > > > IMO no matter if merge is used, loop always sends page by page
> > > > > > to VFS in both dio or buffer I/O.
> > > > > 
> > > > > Why do you think so?
> > > > 
> > > > do_blockdev_direct_IO() still handles page by page from iov_iter, and
> > > > with bigger request, I guess it might be the plug merge working.
> > > 
> > > This is not true. directio sends big size bio directly, not because of plug
> > > merge. Please at least check the code before you complain.
> > 
> > I complain nothing, just try to understand the idea behind,
> > never mind, :-)
> > 
> > > 
> > > > >  
> > > > > > Also if merge is enabled on loop, that means merge is run
> > > > > > on both loop and low level block driver, and not sure if we
> > > > > > can benefit from that.
> > > > > 
> > > > > why does merge still happen in low level block driver?
> > > > 
> > > > Because scheduler is still working on low level disk. My question
> > > > is that why the scheduler in low level disk doesn't work now
> > > > if scheduler on loop can merge?
> > > 
> > > The low level disk can still do merge, but since this is directio, the upper
> > > layer already dispatches request as big as possible. There is very little
> > > chance the requests can be merged again.
> > 
> > That is true, but these requests need to enter scheduler queue and
> > be tried to merge again, even though it is less possible to succeed.
> > Double merge may take extra CPU utilization.
> > 
> > Looks it doesn't answer my question.
> > 
> > Without this patch, the requests dispatched to loop won't be merged,
> > so they may be small and their sectors may be continuous, my question
> > is why dio bios converted from these small loop requests can't be
> > merged in block layer when queuing these dio bios to low level device?
> 
> loop thread doesn't have plug there. Even we have plug there, it's still a bad
> idea to do the merge in low level layer. If we run direct_IO for every 4k, the
> overhead is much much higher than bio merge. The direct_IO will call into fs
> code, take different mutexes, metadata update for write and so on.

OK, that looks making sense now.

[V2,2/2] block/loop: allow request merge for directio mode

Commit Message

Comments

Patch