mbox series

[00/13] block: remove aio_disable_external() API

Message ID 20230403183004.347205-1-stefanha@redhat.com (mailing list archive)
Headers show
Series block: remove aio_disable_external() API | expand

Message

Stefan Hajnoczi April 3, 2023, 6:29 p.m. UTC
The aio_disable_external() API temporarily suspends file descriptor monitoring
in the event loop. The block layer uses this to prevent new I/O requests being
submitted from the guest and elsewhere between bdrv_drained_begin() and
bdrv_drained_end().

While the block layer still needs to prevent new I/O requests in drained
sections, the aio_disable_external() API can be replaced with
.drained_begin/end/poll() callbacks that have been added to BdrvChildClass and
BlockDevOps.

This newer .bdrained_begin/end/poll() approach is attractive because it works
without specifying a specific AioContext. The block layer is moving towards
multi-queue and that means multiple AioContexts may be processing I/O
simultaneously.

The aio_disable_external() was always somewhat hacky. It suspends all file
descriptors that were registered with is_external=true, even if they have
nothing to do with the BlockDriverState graph nodes that are being drained.
It's better to solve a block layer problem in the block layer than to have an
odd event loop API solution.

That covers the motivation for this change, now on to the specifics of this
series:

While it would be nice if a single conceptual approach could be applied to all
is_external=true file descriptors, I ended up looking at callers on a
case-by-case basis. There are two general ways I migrated code away from
is_external=true:

1. Block exports are typically best off unregistering fds in .drained_begin()
   and registering them again in .drained_end(). The .drained_poll() function
   waits for in-flight requests to finish using a reference counter.

2. Emulated storage controllers like virtio-blk and virtio-scsi are a little
   simpler. They can rely on BlockBackend's request queuing during drain
   feature. Guest I/O request coroutines are suspended in a drained section and
   resume upon the end of the drained section.

The first two virtio-scsi patches were already sent as a separate series. I
included them because they are necessary in order to fully remove
aio_disable_external().

Based-on: 087bc644b7634436ca9d52fe58ba9234e2bef026 (kevin/block-next)

Stefan Hajnoczi (13):
  virtio-scsi: avoid race between unplug and transport event
  virtio-scsi: stop using aio_disable_external() during unplug
  block/export: only acquire AioContext once for
    vhost_user_server_stop()
  util/vhost-user-server: rename refcount to in_flight counter
  block/export: wait for vhost-user-blk requests when draining
  block/export: stop using is_external in vhost-user-blk server
  virtio: do not set is_external=true on host notifiers
  hw/xen: do not use aio_set_fd_handler(is_external=true) in
    xen_xenstore
  hw/xen: do not set is_external=true on evtchn fds
  block/export: rewrite vduse-blk drain code
  block/fuse: take AioContext lock around blk_exp_ref/unref()
  block/fuse: do not set is_external=true on FUSE fd
  aio: remove aio_disable_external() API

 include/block/aio.h                  |  55 -----------
 include/qemu/vhost-user-server.h     |   8 +-
 util/aio-posix.h                     |   1 -
 block.c                              |   7 --
 block/blkio.c                        |  15 +--
 block/curl.c                         |  10 +-
 block/export/fuse.c                  |  62 ++++++++++++-
 block/export/vduse-blk.c             | 132 +++++++++++++++++++--------
 block/export/vhost-user-blk-server.c |  73 +++++++++------
 block/io.c                           |   2 -
 block/io_uring.c                     |   4 +-
 block/iscsi.c                        |   3 +-
 block/linux-aio.c                    |   4 +-
 block/nfs.c                          |   5 +-
 block/nvme.c                         |   8 +-
 block/ssh.c                          |   4 +-
 block/win32-aio.c                    |   6 +-
 hw/i386/kvm/xen_xenstore.c           |   2 +-
 hw/scsi/scsi-bus.c                   |   3 +-
 hw/scsi/scsi-disk.c                  |   1 +
 hw/scsi/virtio-scsi.c                |  21 ++---
 hw/virtio/virtio.c                   |   6 +-
 hw/xen/xen-bus.c                     |   6 +-
 io/channel-command.c                 |   6 +-
 io/channel-file.c                    |   3 +-
 io/channel-socket.c                  |   3 +-
 migration/rdma.c                     |  16 ++--
 tests/unit/test-aio.c                |  27 +-----
 tests/unit/test-fdmon-epoll.c        |  73 ---------------
 util/aio-posix.c                     |  20 +---
 util/aio-win32.c                     |   8 +-
 util/async.c                         |   3 +-
 util/fdmon-epoll.c                   |  10 --
 util/fdmon-io_uring.c                |   8 +-
 util/fdmon-poll.c                    |   3 +-
 util/main-loop.c                     |   7 +-
 util/qemu-coroutine-io.c             |   7 +-
 util/vhost-user-server.c             |  38 ++++----
 tests/unit/meson.build               |   3 -
 39 files changed, 298 insertions(+), 375 deletions(-)
 delete mode 100644 tests/unit/test-fdmon-epoll.c

Comments

Paolo Bonzini April 4, 2023, 1:43 p.m. UTC | #1
On 4/3/23 20:29, Stefan Hajnoczi wrote:
> The aio_disable_external() API temporarily suspends file descriptor monitoring
> in the event loop. The block layer uses this to prevent new I/O requests being
> submitted from the guest and elsewhere between bdrv_drained_begin() and
> bdrv_drained_end().
> 
> While the block layer still needs to prevent new I/O requests in drained
> sections, the aio_disable_external() API can be replaced with
> .drained_begin/end/poll() callbacks that have been added to BdrvChildClass and
> BlockDevOps.
> 
> This newer .bdrained_begin/end/poll() approach is attractive because it works
> without specifying a specific AioContext. The block layer is moving towards
> multi-queue and that means multiple AioContexts may be processing I/O
> simultaneously.
> 
> The aio_disable_external() was always somewhat hacky. It suspends all file
> descriptors that were registered with is_external=true, even if they have
> nothing to do with the BlockDriverState graph nodes that are being drained.
> It's better to solve a block layer problem in the block layer than to have an
> odd event loop API solution.
> 
> That covers the motivation for this change, now on to the specifics of this
> series:
> 
> While it would be nice if a single conceptual approach could be applied to all
> is_external=true file descriptors, I ended up looking at callers on a
> case-by-case basis. There are two general ways I migrated code away from
> is_external=true:
> 
> 1. Block exports are typically best off unregistering fds in .drained_begin()
>     and registering them again in .drained_end(). The .drained_poll() function
>     waits for in-flight requests to finish using a reference counter.
> 
> 2. Emulated storage controllers like virtio-blk and virtio-scsi are a little
>     simpler. They can rely on BlockBackend's request queuing during drain
>     feature. Guest I/O request coroutines are suspended in a drained section and
>     resume upon the end of the drained section.

Sorry, I disagree with this.

Request queuing was shown to cause deadlocks; Hanna's latest patch is 
piling another hack upon it, instead in my opinion we should go in the 
direction of relying _less_ (or not at all) on request queuing.

I am strongly convinced that request queuing must apply only after 
bdrv_drained_begin has returned, which would also fix the IDE TRIM bug 
reported by Fiona Ebner.  The possible livelock scenario is generally 
not a problem because 1) outside an iothread you have anyway the BQL 
that prevents a vCPU from issuing more I/O operations during 
bdrv_drained_begin 2) in iothreads you have aio_disable_external() 
instead of .drained_begin().

It is also less tidy to start a request during the drained_begin phase, 
because a request that has been submitted has to be completed (cancel 
doesn't really work).

So in an ideal world, request queuing would not only apply only after 
bdrv_drained_begin has returned, it would log a warning and 
.drained_begin() should set up things so that there are no such warnings.

Thanks,

Paolo
Stefan Hajnoczi April 4, 2023, 9:04 p.m. UTC | #2
On Tue, Apr 04, 2023 at 03:43:20PM +0200, Paolo Bonzini wrote:
> On 4/3/23 20:29, Stefan Hajnoczi wrote:
> > The aio_disable_external() API temporarily suspends file descriptor monitoring
> > in the event loop. The block layer uses this to prevent new I/O requests being
> > submitted from the guest and elsewhere between bdrv_drained_begin() and
> > bdrv_drained_end().
> > 
> > While the block layer still needs to prevent new I/O requests in drained
> > sections, the aio_disable_external() API can be replaced with
> > .drained_begin/end/poll() callbacks that have been added to BdrvChildClass and
> > BlockDevOps.
> > 
> > This newer .bdrained_begin/end/poll() approach is attractive because it works
> > without specifying a specific AioContext. The block layer is moving towards
> > multi-queue and that means multiple AioContexts may be processing I/O
> > simultaneously.
> > 
> > The aio_disable_external() was always somewhat hacky. It suspends all file
> > descriptors that were registered with is_external=true, even if they have
> > nothing to do with the BlockDriverState graph nodes that are being drained.
> > It's better to solve a block layer problem in the block layer than to have an
> > odd event loop API solution.
> > 
> > That covers the motivation for this change, now on to the specifics of this
> > series:
> > 
> > While it would be nice if a single conceptual approach could be applied to all
> > is_external=true file descriptors, I ended up looking at callers on a
> > case-by-case basis. There are two general ways I migrated code away from
> > is_external=true:
> > 
> > 1. Block exports are typically best off unregistering fds in .drained_begin()
> >     and registering them again in .drained_end(). The .drained_poll() function
> >     waits for in-flight requests to finish using a reference counter.
> > 
> > 2. Emulated storage controllers like virtio-blk and virtio-scsi are a little
> >     simpler. They can rely on BlockBackend's request queuing during drain
> >     feature. Guest I/O request coroutines are suspended in a drained section and
> >     resume upon the end of the drained section.
> 
> Sorry, I disagree with this.
> 
> Request queuing was shown to cause deadlocks; Hanna's latest patch is piling
> another hack upon it, instead in my opinion we should go in the direction of
> relying _less_ (or not at all) on request queuing.
> 
> I am strongly convinced that request queuing must apply only after
> bdrv_drained_begin has returned, which would also fix the IDE TRIM bug
> reported by Fiona Ebner.  The possible livelock scenario is generally not a
> problem because 1) outside an iothread you have anyway the BQL that prevents
> a vCPU from issuing more I/O operations during bdrv_drained_begin 2) in
> iothreads you have aio_disable_external() instead of .drained_begin().
> 
> It is also less tidy to start a request during the drained_begin phase,
> because a request that has been submitted has to be completed (cancel
> doesn't really work).
> 
> So in an ideal world, request queuing would not only apply only after
> bdrv_drained_begin has returned, it would log a warning and .drained_begin()
> should set up things so that there are no such warnings.

That's fine, I will give .drained_begin/end/poll() a try with virtio-blk
and virtio-scsi in the next revision.

Stefan