Message ID | 20190814103244.92518-1-maco@android.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | RFC: loop: Avoid calling blk_mq_freeze_queue() when possible. | expand |
On Wed, Aug 14, 2019 at 12:32:44PM +0200, Martijn Coenen wrote: > Since Android Q, the creation and configuration of loop devices is in > the critical path of device boot. We found that the configuration of > loop devices is pretty slow, because many ioctl()'s involve freezing the > block queue, which in turn needs to wait for an RCU grace period. On > Android devices we've observed up to 60ms for the creation and > configuration of a single loop device; as we anticipate creating many > more in the future, we'd like to avoid this delay. > Another candidate is to not switch to q_usage_counter's percpu mode until loop becomes Lo_bound, and this way may be more clean. Something like the following patch: diff --git a/drivers/block/loop.c b/drivers/block/loop.c index a7461f482467..8791f9242583 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1015,6 +1015,9 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode, */ bdgrab(bdev); mutex_unlock(&loop_ctl_mutex); + + percpu_ref_switch_to_percpu(&lo->lo_queue->q_usage_counter); + if (partscan) loop_reread_partitions(lo, bdev); if (claimed_bdev) @@ -1171,6 +1174,8 @@ static int __loop_clr_fd(struct loop_device *lo, bool release) lo->lo_state = Lo_unbound; mutex_unlock(&loop_ctl_mutex); + percpu_ref_switch_to_atomic(&lo->lo_queue->q_usage_counter, NULL); + /* * Need not hold loop_ctl_mutex to fput backing file. * Calling fput holding loop_ctl_mutex triggers a circular @@ -2003,6 +2008,12 @@ static int loop_add(struct loop_device **l, int i) } lo->lo_queue->queuedata = lo; + /* + * cheat block layer for not switching to q_usage_counter's + * percpu mode before loop becomes Lo_bound + */ + blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, lo->lo_queue); + blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS); /* thanks, Ming
On Wed, Aug 14, 2019 at 1:34 PM Ming Lei <ming.lei@redhat.com> wrote: > Another candidate is to not switch to q_usage_counter's percpu mode > until loop becomes Lo_bound, and this way may be more clean. Thanks! I had considered this too, but thought it a bit risky to mess with the init flag from the loop driver. Maybe we could delay switching q_usage_counter to per-cpu mode in the block layer in general, until the first request comes in? This would also address our issues, and potentially be an even smaller change. Martijn > > Something like the following patch: > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index a7461f482467..8791f9242583 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -1015,6 +1015,9 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode, > */ > bdgrab(bdev); > mutex_unlock(&loop_ctl_mutex); > + > + percpu_ref_switch_to_percpu(&lo->lo_queue->q_usage_counter); > + > if (partscan) > loop_reread_partitions(lo, bdev); > if (claimed_bdev) > @@ -1171,6 +1174,8 @@ static int __loop_clr_fd(struct loop_device *lo, bool release) > lo->lo_state = Lo_unbound; > mutex_unlock(&loop_ctl_mutex); > > + percpu_ref_switch_to_atomic(&lo->lo_queue->q_usage_counter, NULL); > + > /* > * Need not hold loop_ctl_mutex to fput backing file. > * Calling fput holding loop_ctl_mutex triggers a circular > @@ -2003,6 +2008,12 @@ static int loop_add(struct loop_device **l, int i) > } > lo->lo_queue->queuedata = lo; > > + /* > + * cheat block layer for not switching to q_usage_counter's > + * percpu mode before loop becomes Lo_bound > + */ > + blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, lo->lo_queue); > + > blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS); > > /* > > > thanks, > Ming
On Wed, Aug 14, 2019 at 01:38:25PM +0200, Martijn Coenen wrote: > On Wed, Aug 14, 2019 at 1:34 PM Ming Lei <ming.lei@redhat.com> wrote: > > Another candidate is to not switch to q_usage_counter's percpu mode > > until loop becomes Lo_bound, and this way may be more clean. > > Thanks! I had considered this too, but thought it a bit risky to mess > with the init flag from the loop driver. Maybe we could delay blk_queue_init_done() is only called in blk_queue_init_done() for this purpose, so this approach should be fine, IMO. > switching q_usage_counter to per-cpu mode in the block layer in > general, until the first request comes in? This approach may not help your issue on loop, IO request comes just after disk is added, such as by systemd, or reading partition table. However, loop only starts to handle IO really after it becomes 'Lo_bound'. thanks, Ming
On 8/14/19 3:32 AM, Martijn Coenen wrote: > Since Android Q, the creation and configuration of loop devices is in > the critical path of device boot. We found that the configuration of > loop devices is pretty slow, because many ioctl()'s involve freezing the > block queue, which in turn needs to wait for an RCU grace period. On > Android devices we've observed up to 60ms for the creation and > configuration of a single loop device; as we anticipate creating many > more in the future, we'd like to avoid this delay. > > This allows LOOP_SET_BLOCK_SIZE to be called before the loop device has > been bound; since the block queue is not running at that point, we can > avoid the expensive freezing of the queue. > > On a recent x86, this patch yields the following results: > > === > Call LOOP_SET_BLOCK_SIZE on /dev/loop0 before being bound > === > ~# time ./set_block_size > > real 0m0.002s > user 0m0.000s > sys 0m0.002s > > === > Call LOOP_SET_BLOCK_SIZE on /dev/loop0 after being bound > === > ~# losetup /dev/loop0 fs.img > ~# time ./set_block_size > > real 0m0.008s > user 0m0.000s > sys 0m0.002s > > Over many runs, this is a 4x improvement. > > This is RFC because technically it is a change in behavior; before, > calling LOOP_SET_BLOCK_SIZE on an unbound device would return ENXIO, and > userspace programs that left it in their code despite the returned > error, would now suddenly see the requested value effectuated. I'm not > sure whether this is acceptable. > > An alternative might be a CONFIG option to set the default block size to > another value than 512. Another alternative I considered is allowing the > block device to be created with a "frozen" queue, where we can manually > unfreeze the queue when all the configuration is done. This would be a > much larger code change, though. Hi Martijn, Is the loop driver used in Android Q to make a file on a filesystem visible as a block device or rather to make a subset of a block device visible as a block device? In the latter case, have you considered to use the dm-linear driver instead? I expect that the overhead per I/O of dm-linear will be lower than that of the loop driver. Bart.
On Wed, Aug 14, 2019 at 4:29 PM Bart Van Assche <bvanassche@acm.org> wrote: > Hi Martijn, > > Is the loop driver used in Android Q to make a file on a filesystem > visible as a block device or rather to make a subset of a block device > visible as a block device? In the latter case, have you considered to > use the dm-linear driver instead? I expect that the overhead per I/O of > dm-linear will be lower than that of the loop driver. Hi Bart, In this case we're using the loop driver to make a file on the filesystem visible as a block device (in the file is a filesystem we want to mount), so unfortunately dm-linear is not an option. Best, Martijn > > Bart.
On Wed, Aug 14, 2019 at 12:47 PM Ming Lei <ming.lei@redhat.com> wrote: > blk_queue_init_done() is only called in blk_queue_init_done() for > this purpose, so this approach should be fine, IMO. I was thinking somebody might add more stuff to "init" in the future, and then that new stuff would now no longer be executed for the loop driver. The name "init" is pretty generic...but if that's not a concern I'm happy with your proposal as well. There's one more "freeze" I'd like to get rid of - we also call LOOP_SET_STATUS(64), and there's a freeze in there because lo->transfer is modified. That makes sense, but I was hoping we can make that freeze conditional on whether lo->transfer would actually change value; if it stays the same, I think freezing is not necessary. > > > switching q_usage_counter to per-cpu mode in the block layer in > > general, until the first request comes in? > > This approach may not help your issue on loop, IO request comes > just after disk is added, such as by systemd, or reading partition table. That's a good point, I thought we could avoid the partition scan, but on some Android devices we'll have max_part set, and there will be a partition scan. Thanks, Martijn > > However, loop only starts to handle IO really after it becomes 'Lo_bound'. > > thanks, > Ming
On Thu, Aug 15, 2019 at 11:38 PM Martijn Coenen <maco@android.com> wrote: > > On Wed, Aug 14, 2019 at 12:47 PM Ming Lei <ming.lei@redhat.com> wrote: > > blk_queue_init_done() is only called in blk_queue_init_done() for > > this purpose, so this approach should be fine, IMO. > > I was thinking somebody might add more stuff to "init" in the future, > and then that new stuff would now no longer be executed for the loop > driver. The name "init" is pretty generic...but if that's not a > concern I'm happy with your proposal as well. There's one more > "freeze" I'd like to get rid of - we also call LOOP_SET_STATUS(64), > and there's a freeze in there because lo->transfer is modified. That > makes sense, but I was hoping we can make that freeze conditional on > whether lo->transfer would actually change value; if it stays the > same, I think freezing is not necessary. The queue freeze in SET_STATUS may not be avoided, not only .transfer, there are also .lo_offset, .size, filename, dio and others. If nothing will change, why does the userspace bother to send SET_STATUS? Thanks, Ming Lei
On Thu, Aug 15, 2019 at 6:34 PM Ming Lei <tom.leiming@gmail.com> wrote: > If nothing will change, why does the userspace bother to send > SET_STATUS? We don't change transfer, but we do change the offset and sizelimit. In our specific case, we know there won't be any I/O from userspace at this point; so from that point of view the freeze wouldn't be necessary. But I'm not sure how we can make loop aware of that in a safe way. Ideally we'd just have a way of completely configuring a loop device before starting the block request queue, but that seems like a pretty big change. Martijn > > > Thanks, > Ming Lei
On Wed, Aug 14, 2019 at 1:34 PM Ming Lei <ming.lei@redhat.com> wrote: > Another candidate is to not switch to q_usage_counter's percpu mode > until loop becomes Lo_bound, and this way may be more clean. I was thinking about this more; our current sequence is: 1) LOOP_SET_FD 2) LOOP_SET_STATUS64 // for lo_offset/lo_sizelimit 3) LOOP_SET_BLOCK_SIZE // to make sure block size is correct for DIO 4) LOOP_SET_DIRECT_IO // to enable DIO I noticed that LOOP_SET_FD already tries to configure DIO if the underlying file is opened with O_DIRECT; but in this case, it will still fail, because the logical block size of the loop queue must exceed the logical I/O size of the backing device. But the default logical block size of mq is 512. One idea to fix is to call blk_queue_logical_block_size() as part of LOOP_SET_FD, to match the block size of the backing fs in case the backing file is opened with O_DIRECT; you could argue that if the backing file is opened with O_DIRECT, this is what the user wanted anyway. This would allow us to get rid of the latter two ioctl's and already save quite some time. Thanks, Martijn > > Something like the following patch: > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index a7461f482467..8791f9242583 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -1015,6 +1015,9 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode, > */ > bdgrab(bdev); > mutex_unlock(&loop_ctl_mutex); > + > + percpu_ref_switch_to_percpu(&lo->lo_queue->q_usage_counter); > + > if (partscan) > loop_reread_partitions(lo, bdev); > if (claimed_bdev) > @@ -1171,6 +1174,8 @@ static int __loop_clr_fd(struct loop_device *lo, bool release) > lo->lo_state = Lo_unbound; > mutex_unlock(&loop_ctl_mutex); > > + percpu_ref_switch_to_atomic(&lo->lo_queue->q_usage_counter, NULL); > + > /* > * Need not hold loop_ctl_mutex to fput backing file. > * Calling fput holding loop_ctl_mutex triggers a circular > @@ -2003,6 +2008,12 @@ static int loop_add(struct loop_device **l, int i) > } > lo->lo_queue->queuedata = lo; > > + /* > + * cheat block layer for not switching to q_usage_counter's > + * percpu mode before loop becomes Lo_bound > + */ > + blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, lo->lo_queue); > + > blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS); > > /* > > > thanks, > Ming
On Mon, Aug 19, 2019 at 11:06 AM Martijn Coenen <maco@android.com> wrote: > One idea to fix is to call blk_queue_logical_block_size() as part of > LOOP_SET_FD, to match the block size of the backing fs in case the > backing file is opened with O_DIRECT; you could argue that if the > backing file is opened with O_DIRECT, this is what the user wanted > anyway. This would allow us to get rid of the latter two ioctl's and > already save quite some time. Basically: diff --git a/drivers/block/loop.c b/drivers/block/loop.c index ab7ca5989097a..ad3db72fbd729 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -994,6 +994,12 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode, if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync) blk_queue_write_cache(lo->lo_queue, true, false); + if(io_is_direct(lo->lo_backing_file) && inode->i_sb->s_bdev) { + /* In case of direct I/O, match underlying block size */ + blk_queue_logical_block_size(lo->lo_queue, + bdev_logical_block_size(inode->i_sb->s_bdev)); + } + loop_update_rotational(lo); loop_update_dio(lo); > > Thanks, > Martijn > > > > > Something like the following patch: > > > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > > index a7461f482467..8791f9242583 100644 > > --- a/drivers/block/loop.c > > +++ b/drivers/block/loop.c > > @@ -1015,6 +1015,9 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode, > > */ > > bdgrab(bdev); > > mutex_unlock(&loop_ctl_mutex); > > + > > + percpu_ref_switch_to_percpu(&lo->lo_queue->q_usage_counter); > > + > > if (partscan) > > loop_reread_partitions(lo, bdev); > > if (claimed_bdev) > > @@ -1171,6 +1174,8 @@ static int __loop_clr_fd(struct loop_device *lo, bool release) > > lo->lo_state = Lo_unbound; > > mutex_unlock(&loop_ctl_mutex); > > > > + percpu_ref_switch_to_atomic(&lo->lo_queue->q_usage_counter, NULL); > > + > > /* > > * Need not hold loop_ctl_mutex to fput backing file. > > * Calling fput holding loop_ctl_mutex triggers a circular > > @@ -2003,6 +2008,12 @@ static int loop_add(struct loop_device **l, int i) > > } > > lo->lo_queue->queuedata = lo; > > > > + /* > > + * cheat block layer for not switching to q_usage_counter's > > + * percpu mode before loop becomes Lo_bound > > + */ > > + blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, lo->lo_queue); > > + > > blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS); > > > > /* > > > > > > thanks, > > Ming
diff --git a/drivers/block/loop.c b/drivers/block/loop.c index ab7ca5989097a..d4348a4fdd7a6 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -214,7 +214,8 @@ static void __loop_update_dio(struct loop_device *lo, bool dio) * LO_FLAGS_READ_ONLY, both are set from kernel, and losetup * will get updated by ioctl(LOOP_GET_STATUS) */ - blk_mq_freeze_queue(lo->lo_queue); + if (lo->lo_state == Lo_bound) + blk_mq_freeze_queue(lo->lo_queue); lo->use_dio = use_dio; if (use_dio) { blk_queue_flag_clear(QUEUE_FLAG_NOMERGES, lo->lo_queue); @@ -223,7 +224,8 @@ static void __loop_update_dio(struct loop_device *lo, bool dio) blk_queue_flag_set(QUEUE_FLAG_NOMERGES, lo->lo_queue); lo->lo_flags &= ~LO_FLAGS_DIRECT_IO; } - blk_mq_unfreeze_queue(lo->lo_queue); + if (lo->lo_state == Lo_bound) + blk_mq_unfreeze_queue(lo->lo_queue); } static int @@ -621,6 +623,8 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq) static inline void loop_update_dio(struct loop_device *lo) { + if (lo->lo_state != Lo_bound) + return; __loop_update_dio(lo, io_is_direct(lo->lo_backing_file) | lo->use_dio); } @@ -1510,27 +1514,26 @@ static int loop_set_block_size(struct loop_device *lo, unsigned long arg) { int err = 0; - if (lo->lo_state != Lo_bound) - return -ENXIO; - if (arg < 512 || arg > PAGE_SIZE || !is_power_of_2(arg)) return -EINVAL; - if (lo->lo_queue->limits.logical_block_size != arg) { - sync_blockdev(lo->lo_device); - kill_bdev(lo->lo_device); - } + if (lo->lo_state == Lo_bound) { + if (lo->lo_queue->limits.logical_block_size != arg) { + sync_blockdev(lo->lo_device); + kill_bdev(lo->lo_device); + } - blk_mq_freeze_queue(lo->lo_queue); + blk_mq_freeze_queue(lo->lo_queue); - /* kill_bdev should have truncated all the pages */ - if (lo->lo_queue->limits.logical_block_size != arg && - lo->lo_device->bd_inode->i_mapping->nrpages) { - err = -EAGAIN; - pr_warn("%s: loop%d (%s) has still dirty pages (nrpages=%lu)\n", - __func__, lo->lo_number, lo->lo_file_name, - lo->lo_device->bd_inode->i_mapping->nrpages); - goto out_unfreeze; + /* kill_bdev should have truncated all the pages */ + if (lo->lo_queue->limits.logical_block_size != arg && + lo->lo_device->bd_inode->i_mapping->nrpages) { + err = -EAGAIN; + pr_warn("%s: loop%d (%s) has still dirty pages (nrpages=%lu)\n", + __func__, lo->lo_number, lo->lo_file_name, + lo->lo_device->bd_inode->i_mapping->nrpages); + goto out_unfreeze; + } } blk_queue_logical_block_size(lo->lo_queue, arg); @@ -1538,7 +1541,8 @@ static int loop_set_block_size(struct loop_device *lo, unsigned long arg) blk_queue_io_min(lo->lo_queue, arg); loop_update_dio(lo); out_unfreeze: - blk_mq_unfreeze_queue(lo->lo_queue); + if (lo->lo_state == Lo_bound) + blk_mq_unfreeze_queue(lo->lo_queue); return err; }
Since Android Q, the creation and configuration of loop devices is in the critical path of device boot. We found that the configuration of loop devices is pretty slow, because many ioctl()'s involve freezing the block queue, which in turn needs to wait for an RCU grace period. On Android devices we've observed up to 60ms for the creation and configuration of a single loop device; as we anticipate creating many more in the future, we'd like to avoid this delay. This allows LOOP_SET_BLOCK_SIZE to be called before the loop device has been bound; since the block queue is not running at that point, we can avoid the expensive freezing of the queue. On a recent x86, this patch yields the following results: === Call LOOP_SET_BLOCK_SIZE on /dev/loop0 before being bound === ~# time ./set_block_size real 0m0.002s user 0m0.000s sys 0m0.002s === Call LOOP_SET_BLOCK_SIZE on /dev/loop0 after being bound === ~# losetup /dev/loop0 fs.img ~# time ./set_block_size real 0m0.008s user 0m0.000s sys 0m0.002s Over many runs, this is a 4x improvement. This is RFC because technically it is a change in behavior; before, calling LOOP_SET_BLOCK_SIZE on an unbound device would return ENXIO, and userspace programs that left it in their code despite the returned error, would now suddenly see the requested value effectuated. I'm not sure whether this is acceptable. An alternative might be a CONFIG option to set the default block size to another value than 512. Another alternative I considered is allowing the block device to be created with a "frozen" queue, where we can manually unfreeze the queue when all the configuration is done. This would be a much larger code change, though. Signed-off-by: Martijn Coenen <maco@android.com> --- drivers/block/loop.c | 42 +++++++++++++++++++++++------------------- 1 file changed, 23 insertions(+), 19 deletions(-)