diff mbox series

RFC: loop: Avoid calling blk_mq_freeze_queue() when possible.

Message ID 20190814103244.92518-1-maco@android.com (mailing list archive)
State New, archived
Headers show
Series RFC: loop: Avoid calling blk_mq_freeze_queue() when possible. | expand

Commit Message

Martijn Coenen Aug. 14, 2019, 10:32 a.m. UTC
Since Android Q, the creation and configuration of loop devices is in
the critical path of device boot. We found that the configuration of
loop devices is pretty slow, because many ioctl()'s involve freezing the
block queue, which in turn needs to wait for an RCU grace period. On
Android devices we've observed up to 60ms for the creation and
configuration of a single loop device; as we anticipate creating many
more in the future, we'd like to avoid this delay.

This allows LOOP_SET_BLOCK_SIZE to be called before the loop device has
been bound; since the block queue is not running at that point, we can
avoid the expensive freezing of the queue.

On a recent x86, this patch yields the following results:

===
Call LOOP_SET_BLOCK_SIZE on /dev/loop0 before being bound
===
~# time ./set_block_size

real 0m0.002s
user 0m0.000s
sys  0m0.002s

===
Call LOOP_SET_BLOCK_SIZE on /dev/loop0 after being bound
===
~# losetup /dev/loop0 fs.img
~# time ./set_block_size

real 0m0.008s
user 0m0.000s
sys  0m0.002s

Over many runs, this is a 4x improvement.

This is RFC because technically it is a change in behavior; before,
calling LOOP_SET_BLOCK_SIZE on an unbound device would return ENXIO, and
userspace programs that left it in their code despite the returned
error, would now suddenly see the requested value effectuated. I'm not
sure whether this is acceptable.

An alternative might be a CONFIG option to set the default block size to
another value than 512. Another alternative I considered is allowing the
block device to be created with a "frozen" queue, where we can manually
unfreeze the queue when all the configuration is done. This would be a
much larger code change, though.

Signed-off-by: Martijn Coenen <maco@android.com>
---
 drivers/block/loop.c | 42 +++++++++++++++++++++++-------------------
 1 file changed, 23 insertions(+), 19 deletions(-)

Comments

Ming Lei Aug. 14, 2019, 11:33 a.m. UTC | #1
On Wed, Aug 14, 2019 at 12:32:44PM +0200, Martijn Coenen wrote:
> Since Android Q, the creation and configuration of loop devices is in
> the critical path of device boot. We found that the configuration of
> loop devices is pretty slow, because many ioctl()'s involve freezing the
> block queue, which in turn needs to wait for an RCU grace period. On
> Android devices we've observed up to 60ms for the creation and
> configuration of a single loop device; as we anticipate creating many
> more in the future, we'd like to avoid this delay.
> 

Another candidate is to not switch to q_usage_counter's percpu mode
until loop becomes Lo_bound, and this way may be more clean.

Something like the following patch:

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index a7461f482467..8791f9242583 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1015,6 +1015,9 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 	 */
 	bdgrab(bdev);
 	mutex_unlock(&loop_ctl_mutex);
+
+	percpu_ref_switch_to_percpu(&lo->lo_queue->q_usage_counter);
+
 	if (partscan)
 		loop_reread_partitions(lo, bdev);
 	if (claimed_bdev)
@@ -1171,6 +1174,8 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
 	lo->lo_state = Lo_unbound;
 	mutex_unlock(&loop_ctl_mutex);
 
+	percpu_ref_switch_to_atomic(&lo->lo_queue->q_usage_counter, NULL);
+
 	/*
 	 * Need not hold loop_ctl_mutex to fput backing file.
 	 * Calling fput holding loop_ctl_mutex triggers a circular
@@ -2003,6 +2008,12 @@ static int loop_add(struct loop_device **l, int i)
 	}
 	lo->lo_queue->queuedata = lo;
 
+	/*
+	 * cheat block layer for not switching to q_usage_counter's
+	 * percpu mode before loop becomes Lo_bound
+	 */
+	blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, lo->lo_queue);
+
 	blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS);
 
 	/*


thanks,
Ming
Martijn Coenen Aug. 14, 2019, 11:38 a.m. UTC | #2
On Wed, Aug 14, 2019 at 1:34 PM Ming Lei <ming.lei@redhat.com> wrote:
> Another candidate is to not switch to q_usage_counter's percpu mode
> until loop becomes Lo_bound, and this way may be more clean.

Thanks! I had considered this too, but thought it a bit risky to mess
with the init flag from the loop driver. Maybe we could delay
switching q_usage_counter to per-cpu mode in the block layer in
general, until the first request comes in?

This would also address our issues, and potentially be an even smaller change.

Martijn
>
> Something like the following patch:
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index a7461f482467..8791f9242583 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1015,6 +1015,9 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
>          */
>         bdgrab(bdev);
>         mutex_unlock(&loop_ctl_mutex);
> +
> +       percpu_ref_switch_to_percpu(&lo->lo_queue->q_usage_counter);
> +
>         if (partscan)
>                 loop_reread_partitions(lo, bdev);
>         if (claimed_bdev)
> @@ -1171,6 +1174,8 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
>         lo->lo_state = Lo_unbound;
>         mutex_unlock(&loop_ctl_mutex);
>
> +       percpu_ref_switch_to_atomic(&lo->lo_queue->q_usage_counter, NULL);
> +
>         /*
>          * Need not hold loop_ctl_mutex to fput backing file.
>          * Calling fput holding loop_ctl_mutex triggers a circular
> @@ -2003,6 +2008,12 @@ static int loop_add(struct loop_device **l, int i)
>         }
>         lo->lo_queue->queuedata = lo;
>
> +       /*
> +        * cheat block layer for not switching to q_usage_counter's
> +        * percpu mode before loop becomes Lo_bound
> +        */
> +       blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, lo->lo_queue);
> +
>         blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS);
>
>         /*
>
>
> thanks,
> Ming
Ming Lei Aug. 14, 2019, 11:46 a.m. UTC | #3
On Wed, Aug 14, 2019 at 01:38:25PM +0200, Martijn Coenen wrote:
> On Wed, Aug 14, 2019 at 1:34 PM Ming Lei <ming.lei@redhat.com> wrote:
> > Another candidate is to not switch to q_usage_counter's percpu mode
> > until loop becomes Lo_bound, and this way may be more clean.
> 
> Thanks! I had considered this too, but thought it a bit risky to mess
> with the init flag from the loop driver. Maybe we could delay

blk_queue_init_done() is only called in blk_queue_init_done() for
this purpose, so this approach should be fine, IMO.

> switching q_usage_counter to per-cpu mode in the block layer in
> general, until the first request comes in?

This approach may not help your issue on loop, IO request comes
just after disk is added, such as by systemd, or reading partition table.

However, loop only starts to handle IO really after it becomes 'Lo_bound'.

thanks, 
Ming
Bart Van Assche Aug. 14, 2019, 3:29 p.m. UTC | #4
On 8/14/19 3:32 AM, Martijn Coenen wrote:
> Since Android Q, the creation and configuration of loop devices is in
> the critical path of device boot. We found that the configuration of
> loop devices is pretty slow, because many ioctl()'s involve freezing the
> block queue, which in turn needs to wait for an RCU grace period. On
> Android devices we've observed up to 60ms for the creation and
> configuration of a single loop device; as we anticipate creating many
> more in the future, we'd like to avoid this delay.
> 
> This allows LOOP_SET_BLOCK_SIZE to be called before the loop device has
> been bound; since the block queue is not running at that point, we can
> avoid the expensive freezing of the queue.
> 
> On a recent x86, this patch yields the following results:
> 
> ===
> Call LOOP_SET_BLOCK_SIZE on /dev/loop0 before being bound
> ===
> ~# time ./set_block_size
> 
> real 0m0.002s
> user 0m0.000s
> sys  0m0.002s
> 
> ===
> Call LOOP_SET_BLOCK_SIZE on /dev/loop0 after being bound
> ===
> ~# losetup /dev/loop0 fs.img
> ~# time ./set_block_size
> 
> real 0m0.008s
> user 0m0.000s
> sys  0m0.002s
> 
> Over many runs, this is a 4x improvement.
> 
> This is RFC because technically it is a change in behavior; before,
> calling LOOP_SET_BLOCK_SIZE on an unbound device would return ENXIO, and
> userspace programs that left it in their code despite the returned
> error, would now suddenly see the requested value effectuated. I'm not
> sure whether this is acceptable.
> 
> An alternative might be a CONFIG option to set the default block size to
> another value than 512. Another alternative I considered is allowing the
> block device to be created with a "frozen" queue, where we can manually
> unfreeze the queue when all the configuration is done. This would be a
> much larger code change, though.

Hi Martijn,

Is the loop driver used in Android Q to make a file on a filesystem 
visible as a block device or rather to make a subset of a block device 
visible as a block device? In the latter case, have you considered to 
use the dm-linear driver instead? I expect that the overhead per I/O of 
dm-linear will be lower than that of the loop driver.

Bart.
Martijn Coenen Aug. 15, 2019, 2:49 p.m. UTC | #5
On Wed, Aug 14, 2019 at 4:29 PM Bart Van Assche <bvanassche@acm.org> wrote:
> Hi Martijn,
>
> Is the loop driver used in Android Q to make a file on a filesystem
> visible as a block device or rather to make a subset of a block device
> visible as a block device? In the latter case, have you considered to
> use the dm-linear driver instead? I expect that the overhead per I/O of
> dm-linear will be lower than that of the loop driver.

Hi Bart,

In this case we're using the loop driver to make a file on the
filesystem visible as a block device (in the file is a filesystem we
want to mount), so unfortunately dm-linear is not an option.

Best,
Martijn

>
> Bart.
Martijn Coenen Aug. 15, 2019, 2:57 p.m. UTC | #6
On Wed, Aug 14, 2019 at 12:47 PM Ming Lei <ming.lei@redhat.com> wrote:
> blk_queue_init_done() is only called in blk_queue_init_done() for
> this purpose, so this approach should be fine, IMO.

I was thinking somebody might add more stuff to "init" in the future,
and then that new stuff would now no longer be executed for the loop
driver. The name "init" is pretty generic...but if that's not a
concern I'm happy with your proposal as well. There's one more
"freeze" I'd like to get rid of - we also call LOOP_SET_STATUS(64),
and there's a freeze in there because lo->transfer is modified. That
makes sense, but I was hoping we can make that freeze conditional on
whether lo->transfer would actually change value; if it stays the
same, I think freezing is not necessary.

>
> > switching q_usage_counter to per-cpu mode in the block layer in
> > general, until the first request comes in?
>
> This approach may not help your issue on loop, IO request comes
> just after disk is added, such as by systemd, or reading partition table.

That's a good point, I thought we could avoid the partition scan, but
on some Android devices we'll have max_part set, and there will be a
partition scan.

Thanks,
Martijn

>
> However, loop only starts to handle IO really after it becomes 'Lo_bound'.
>
> thanks,
> Ming
Ming Lei Aug. 15, 2019, 4:34 p.m. UTC | #7
On Thu, Aug 15, 2019 at 11:38 PM Martijn Coenen <maco@android.com> wrote:
>
> On Wed, Aug 14, 2019 at 12:47 PM Ming Lei <ming.lei@redhat.com> wrote:
> > blk_queue_init_done() is only called in blk_queue_init_done() for
> > this purpose, so this approach should be fine, IMO.
>
> I was thinking somebody might add more stuff to "init" in the future,
> and then that new stuff would now no longer be executed for the loop
> driver. The name "init" is pretty generic...but if that's not a
> concern I'm happy with your proposal as well. There's one more
> "freeze" I'd like to get rid of - we also call LOOP_SET_STATUS(64),
> and there's a freeze in there because lo->transfer is modified. That
> makes sense, but I was hoping we can make that freeze conditional on
> whether lo->transfer would actually change value; if it stays the
> same, I think freezing is not necessary.

The queue freeze in SET_STATUS may not be avoided, not only
.transfer, there are also .lo_offset, .size, filename, dio and others.

If nothing will change, why does the userspace bother to send
SET_STATUS?


Thanks,
Ming Lei
Martijn Coenen Aug. 15, 2019, 7:08 p.m. UTC | #8
On Thu, Aug 15, 2019 at 6:34 PM Ming Lei <tom.leiming@gmail.com> wrote:
> If nothing will change, why does the userspace bother to send
> SET_STATUS?

We don't change transfer, but we do change the offset and sizelimit.
In our specific case, we know there won't be any I/O from userspace at
this point; so from that point of view the freeze wouldn't be
necessary. But I'm not sure how we can make loop aware of that in a
safe way. Ideally we'd just have a way of completely configuring a
loop device before starting the block request queue, but that seems
like a pretty big change.

Martijn

>
>
> Thanks,
> Ming Lei
Martijn Coenen Aug. 19, 2019, 9:06 a.m. UTC | #9
On Wed, Aug 14, 2019 at 1:34 PM Ming Lei <ming.lei@redhat.com> wrote:
> Another candidate is to not switch to q_usage_counter's percpu mode
> until loop becomes Lo_bound, and this way may be more clean.

I was thinking about this more; our current sequence is:

1) LOOP_SET_FD
2) LOOP_SET_STATUS64 // for lo_offset/lo_sizelimit
3) LOOP_SET_BLOCK_SIZE // to make sure block size is correct for DIO
4) LOOP_SET_DIRECT_IO // to enable DIO

I noticed that LOOP_SET_FD already tries to configure DIO if the
underlying file is opened with O_DIRECT; but in this case, it will
still fail, because the logical block size of the loop queue must
exceed the logical I/O size of the backing device. But the default
logical block size of mq is 512.

One idea to fix is to call blk_queue_logical_block_size() as part of
LOOP_SET_FD, to match the block size of the backing fs in case the
backing file is opened with O_DIRECT; you could argue that if the
backing file is opened with O_DIRECT, this is what the user wanted
anyway. This would allow us to get rid of the latter two ioctl's and
already save quite some time.

Thanks,
Martijn

>
> Something like the following patch:
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index a7461f482467..8791f9242583 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1015,6 +1015,9 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
>          */
>         bdgrab(bdev);
>         mutex_unlock(&loop_ctl_mutex);
> +
> +       percpu_ref_switch_to_percpu(&lo->lo_queue->q_usage_counter);
> +
>         if (partscan)
>                 loop_reread_partitions(lo, bdev);
>         if (claimed_bdev)
> @@ -1171,6 +1174,8 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
>         lo->lo_state = Lo_unbound;
>         mutex_unlock(&loop_ctl_mutex);
>
> +       percpu_ref_switch_to_atomic(&lo->lo_queue->q_usage_counter, NULL);
> +
>         /*
>          * Need not hold loop_ctl_mutex to fput backing file.
>          * Calling fput holding loop_ctl_mutex triggers a circular
> @@ -2003,6 +2008,12 @@ static int loop_add(struct loop_device **l, int i)
>         }
>         lo->lo_queue->queuedata = lo;
>
> +       /*
> +        * cheat block layer for not switching to q_usage_counter's
> +        * percpu mode before loop becomes Lo_bound
> +        */
> +       blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, lo->lo_queue);
> +
>         blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS);
>
>         /*
>
>
> thanks,
> Ming
Martijn Coenen Aug. 19, 2019, 9:26 a.m. UTC | #10
On Mon, Aug 19, 2019 at 11:06 AM Martijn Coenen <maco@android.com> wrote:
> One idea to fix is to call blk_queue_logical_block_size() as part of
> LOOP_SET_FD, to match the block size of the backing fs in case the
> backing file is opened with O_DIRECT; you could argue that if the
> backing file is opened with O_DIRECT, this is what the user wanted
> anyway. This would allow us to get rid of the latter two ioctl's and
> already save quite some time.

Basically:

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index ab7ca5989097a..ad3db72fbd729 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -994,6 +994,12 @@ static int loop_set_fd(struct loop_device *lo,
fmode_t mode,
        if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
                blk_queue_write_cache(lo->lo_queue, true, false);

+       if(io_is_direct(lo->lo_backing_file) && inode->i_sb->s_bdev) {
+               /* In case of direct I/O, match underlying block size */
+               blk_queue_logical_block_size(lo->lo_queue,
+                               bdev_logical_block_size(inode->i_sb->s_bdev));
+       }
+
        loop_update_rotational(lo);
        loop_update_dio(lo);

>
> Thanks,
> Martijn
>
> >
> > Something like the following patch:
> >
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index a7461f482467..8791f9242583 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -1015,6 +1015,9 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
> >          */
> >         bdgrab(bdev);
> >         mutex_unlock(&loop_ctl_mutex);
> > +
> > +       percpu_ref_switch_to_percpu(&lo->lo_queue->q_usage_counter);
> > +
> >         if (partscan)
> >                 loop_reread_partitions(lo, bdev);
> >         if (claimed_bdev)
> > @@ -1171,6 +1174,8 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
> >         lo->lo_state = Lo_unbound;
> >         mutex_unlock(&loop_ctl_mutex);
> >
> > +       percpu_ref_switch_to_atomic(&lo->lo_queue->q_usage_counter, NULL);
> > +
> >         /*
> >          * Need not hold loop_ctl_mutex to fput backing file.
> >          * Calling fput holding loop_ctl_mutex triggers a circular
> > @@ -2003,6 +2008,12 @@ static int loop_add(struct loop_device **l, int i)
> >         }
> >         lo->lo_queue->queuedata = lo;
> >
> > +       /*
> > +        * cheat block layer for not switching to q_usage_counter's
> > +        * percpu mode before loop becomes Lo_bound
> > +        */
> > +       blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, lo->lo_queue);
> > +
> >         blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS);
> >
> >         /*
> >
> >
> > thanks,
> > Ming
diff mbox series

Patch

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index ab7ca5989097a..d4348a4fdd7a6 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -214,7 +214,8 @@  static void __loop_update_dio(struct loop_device *lo, bool dio)
 	 * LO_FLAGS_READ_ONLY, both are set from kernel, and losetup
 	 * will get updated by ioctl(LOOP_GET_STATUS)
 	 */
-	blk_mq_freeze_queue(lo->lo_queue);
+	if (lo->lo_state == Lo_bound)
+		blk_mq_freeze_queue(lo->lo_queue);
 	lo->use_dio = use_dio;
 	if (use_dio) {
 		blk_queue_flag_clear(QUEUE_FLAG_NOMERGES, lo->lo_queue);
@@ -223,7 +224,8 @@  static void __loop_update_dio(struct loop_device *lo, bool dio)
 		blk_queue_flag_set(QUEUE_FLAG_NOMERGES, lo->lo_queue);
 		lo->lo_flags &= ~LO_FLAGS_DIRECT_IO;
 	}
-	blk_mq_unfreeze_queue(lo->lo_queue);
+	if (lo->lo_state == Lo_bound)
+		blk_mq_unfreeze_queue(lo->lo_queue);
 }
 
 static int
@@ -621,6 +623,8 @@  static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 
 static inline void loop_update_dio(struct loop_device *lo)
 {
+	if (lo->lo_state != Lo_bound)
+		return;
 	__loop_update_dio(lo, io_is_direct(lo->lo_backing_file) |
 			lo->use_dio);
 }
@@ -1510,27 +1514,26 @@  static int loop_set_block_size(struct loop_device *lo, unsigned long arg)
 {
 	int err = 0;
 
-	if (lo->lo_state != Lo_bound)
-		return -ENXIO;
-
 	if (arg < 512 || arg > PAGE_SIZE || !is_power_of_2(arg))
 		return -EINVAL;
 
-	if (lo->lo_queue->limits.logical_block_size != arg) {
-		sync_blockdev(lo->lo_device);
-		kill_bdev(lo->lo_device);
-	}
+	if (lo->lo_state == Lo_bound) {
+		if (lo->lo_queue->limits.logical_block_size != arg) {
+			sync_blockdev(lo->lo_device);
+			kill_bdev(lo->lo_device);
+		}
 
-	blk_mq_freeze_queue(lo->lo_queue);
+		blk_mq_freeze_queue(lo->lo_queue);
 
-	/* kill_bdev should have truncated all the pages */
-	if (lo->lo_queue->limits.logical_block_size != arg &&
-			lo->lo_device->bd_inode->i_mapping->nrpages) {
-		err = -EAGAIN;
-		pr_warn("%s: loop%d (%s) has still dirty pages (nrpages=%lu)\n",
-			__func__, lo->lo_number, lo->lo_file_name,
-			lo->lo_device->bd_inode->i_mapping->nrpages);
-		goto out_unfreeze;
+		/* kill_bdev should have truncated all the pages */
+		if (lo->lo_queue->limits.logical_block_size != arg &&
+				lo->lo_device->bd_inode->i_mapping->nrpages) {
+			err = -EAGAIN;
+			pr_warn("%s: loop%d (%s) has still dirty pages (nrpages=%lu)\n",
+				__func__, lo->lo_number, lo->lo_file_name,
+				lo->lo_device->bd_inode->i_mapping->nrpages);
+			goto out_unfreeze;
+		}
 	}
 
 	blk_queue_logical_block_size(lo->lo_queue, arg);
@@ -1538,7 +1541,8 @@  static int loop_set_block_size(struct loop_device *lo, unsigned long arg)
 	blk_queue_io_min(lo->lo_queue, arg);
 	loop_update_dio(lo);
 out_unfreeze:
-	blk_mq_unfreeze_queue(lo->lo_queue);
+	if (lo->lo_state == Lo_bound)
+		blk_mq_unfreeze_queue(lo->lo_queue);
 
 	return err;
 }