mbox series

[0/2] virtio: Keep notifications disabled during drain

Message ID 20240124173834.66320-1-hreitz@redhat.com (mailing list archive)
Headers show
Series virtio: Keep notifications disabled during drain | expand

Message

Hanna Czenczek Jan. 24, 2024, 5:38 p.m. UTC
Hi,

When registering callbacks via aio_set_event_notifier_poll(), the
io_poll_end() callback is only invoked when polling actually ends.  If
the notifiers are removed while in a polling section, it is not called.
Therefore, io_poll_start() is not necessarily followed up by
io_poll_end().

It is not entirely clear whether this is good or bad behavior.  On one
hand, it may be unexpected to callers.  On the other, it may be
counterproductive to call io_poll_end() when the polling section has not
ended yet.

Right now, there is only one user of aio_set_event_notifier(), which is
virtio_queue_aio_attach_host_notifier().  It does not expect this
behavior, which leads to virtqueue notifiers remaining disabled if
virtio_queue_aio_detach_host_notifier() is called while polling.  That
can happen e.g. through virtio_scsi_drained_begin() or
virtio_blk_drained_begin() (through virtio_blk_data_plane_detach()).
In such a case, the virtqueue may not be processed for a while, letting
the guest driver hang.  This can be reproduced by repeatedly
hot-plugging and -unplugging a virtio-scsi device with a scsi-hd disk,
because the guest will try to enumerate the virtio-scsi device while
we’re attaching the scsi-hd disk, which causes a drain, which can cause
the virtio-scsi virtqueue to stall as described.

Stefan has suggested ensuring we always follow up io_poll_start() by
io_poll_end():

https://lists.nongnu.org/archive/html/qemu-block/2023-12/msg00163.html

I prefer changing the caller instead, because I don’t think we actually
want the virtqueue notifier to be force-enabled when removing our AIO
notifiers.  So I believe we actually only want to take care to
force-enable it when we re-attach the AIO notifiers, and to kick
virtqueue processing once, in case we missed any events while the AIO
notifiers were not attached.

That is done by patch 2.  We have already discussed a prior version of
it here:

https://lists.nongnu.org/archive/html/qemu-block/2024-01/msg00001.html

And compared to that, based on the discussion, there are some changes:
1. Used virtio_queue_notify() instead of virtio_queue_notify_vq(), as
   suggested by Paolo, because it’s thread-safe
2. Moved virtio_queue_notify() into
   virtio_queue_aio_attach_host_notifier*(), because we always want it
3. Dropped virtio_queue_set_notification(vq, 0) from
   virtio_queue_aio_detach_host_notifier(): Paolo wasn’t sure whether
   that was safe to do from any context.  We don’t really need to call
   it anyway, so I just dropped it.
4. Added patch 1:

Patch 1 fixes virtio_scsi_drained_end() so it won’t attach polling
notifiers for the event virtqueue.  That didn’t turn out to be an issue
so far, but with patch 2, Fiona saw the virtqueue processing queue
spinning in a loop as described in
38738f7dbbda90fbc161757b7f4be35b52205552 ("virtio-scsi: don't waste CPU
polling the event virtqueue").


Note that as of eaad0fe26050c227dc5dad63205835bac4912a51 ("scsi: only
access SCSIDevice->requests from one thread") there’s a different
problem when trying to reproduce the bug via hot-plugging and
-unplugging a virtio-scsi device, specifically, when unplugging, qemu
may crash with an assertion failure[1].  I don’t have a full fix for
that yet, but in case you need a work-around for the specific case of
virtio-scsi hot-plugging and -unplugging, you can use this patch:

https://czenczek.de/0001-DONTMERGE-Fix-crash-on-scsi-unplug.patch


[1] https://lists.nongnu.org/archive/html/qemu-block/2024-01/msg00317.html


Hanna Czenczek (2):
  virtio-scsi: Attach event vq notifier with no_poll
  virtio: Keep notifications disabled during drain

 include/block/aio.h   |  7 ++++++-
 hw/scsi/virtio-scsi.c |  7 ++++++-
 hw/virtio/virtio.c    | 42 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 54 insertions(+), 2 deletions(-)

Comments

Stefan Hajnoczi Jan. 25, 2024, 6:05 p.m. UTC | #1
On Wed, Jan 24, 2024 at 06:38:28PM +0100, Hanna Czenczek wrote:
> Hi,
> 
> When registering callbacks via aio_set_event_notifier_poll(), the
> io_poll_end() callback is only invoked when polling actually ends.  If
> the notifiers are removed while in a polling section, it is not called.
> Therefore, io_poll_start() is not necessarily followed up by
> io_poll_end().
> 
> It is not entirely clear whether this is good or bad behavior.  On one
> hand, it may be unexpected to callers.  On the other, it may be
> counterproductive to call io_poll_end() when the polling section has not
> ended yet.
> 
> Right now, there is only one user of aio_set_event_notifier(), which is
> virtio_queue_aio_attach_host_notifier().  It does not expect this
> behavior, which leads to virtqueue notifiers remaining disabled if
> virtio_queue_aio_detach_host_notifier() is called while polling.  That
> can happen e.g. through virtio_scsi_drained_begin() or
> virtio_blk_drained_begin() (through virtio_blk_data_plane_detach()).
> In such a case, the virtqueue may not be processed for a while, letting
> the guest driver hang.  This can be reproduced by repeatedly
> hot-plugging and -unplugging a virtio-scsi device with a scsi-hd disk,
> because the guest will try to enumerate the virtio-scsi device while
> we’re attaching the scsi-hd disk, which causes a drain, which can cause
> the virtio-scsi virtqueue to stall as described.
> 
> Stefan has suggested ensuring we always follow up io_poll_start() by
> io_poll_end():
> 
> https://lists.nongnu.org/archive/html/qemu-block/2023-12/msg00163.html
> 
> I prefer changing the caller instead, because I don’t think we actually
> want the virtqueue notifier to be force-enabled when removing our AIO
> notifiers.  So I believe we actually only want to take care to
> force-enable it when we re-attach the AIO notifiers, and to kick
> virtqueue processing once, in case we missed any events while the AIO
> notifiers were not attached.
> 
> That is done by patch 2.  We have already discussed a prior version of
> it here:
> 
> https://lists.nongnu.org/archive/html/qemu-block/2024-01/msg00001.html
> 
> And compared to that, based on the discussion, there are some changes:
> 1. Used virtio_queue_notify() instead of virtio_queue_notify_vq(), as
>    suggested by Paolo, because it’s thread-safe
> 2. Moved virtio_queue_notify() into
>    virtio_queue_aio_attach_host_notifier*(), because we always want it
> 3. Dropped virtio_queue_set_notification(vq, 0) from
>    virtio_queue_aio_detach_host_notifier(): Paolo wasn’t sure whether
>    that was safe to do from any context.  We don’t really need to call
>    it anyway, so I just dropped it.
> 4. Added patch 1:
> 
> Patch 1 fixes virtio_scsi_drained_end() so it won’t attach polling
> notifiers for the event virtqueue.  That didn’t turn out to be an issue
> so far, but with patch 2, Fiona saw the virtqueue processing queue
> spinning in a loop as described in
> 38738f7dbbda90fbc161757b7f4be35b52205552 ("virtio-scsi: don't waste CPU
> polling the event virtqueue").
> 
> 
> Note that as of eaad0fe26050c227dc5dad63205835bac4912a51 ("scsi: only
> access SCSIDevice->requests from one thread") there’s a different
> problem when trying to reproduce the bug via hot-plugging and
> -unplugging a virtio-scsi device, specifically, when unplugging, qemu
> may crash with an assertion failure[1].  I don’t have a full fix for
> that yet, but in case you need a work-around for the specific case of
> virtio-scsi hot-plugging and -unplugging, you can use this patch:
> 
> https://czenczek.de/0001-DONTMERGE-Fix-crash-on-scsi-unplug.patch
> 
> 
> [1] https://lists.nongnu.org/archive/html/qemu-block/2024-01/msg00317.html
> 
> 
> Hanna Czenczek (2):
>   virtio-scsi: Attach event vq notifier with no_poll
>   virtio: Keep notifications disabled during drain
> 
>  include/block/aio.h   |  7 ++++++-
>  hw/scsi/virtio-scsi.c |  7 ++++++-
>  hw/virtio/virtio.c    | 42 ++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 54 insertions(+), 2 deletions(-)

This patch series also fixes RHEL-7356.

Buglink: https://issues.redhat.com/browse/RHEL-7356.
Tested-by: Stefan Hajnoczi <stefanha@redhat.com>