Message ID | 20240124173834.66320-1-hreitz@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | virtio: Keep notifications disabled during drain | expand |
On Wed, Jan 24, 2024 at 06:38:28PM +0100, Hanna Czenczek wrote: > Hi, > > When registering callbacks via aio_set_event_notifier_poll(), the > io_poll_end() callback is only invoked when polling actually ends. If > the notifiers are removed while in a polling section, it is not called. > Therefore, io_poll_start() is not necessarily followed up by > io_poll_end(). > > It is not entirely clear whether this is good or bad behavior. On one > hand, it may be unexpected to callers. On the other, it may be > counterproductive to call io_poll_end() when the polling section has not > ended yet. > > Right now, there is only one user of aio_set_event_notifier(), which is > virtio_queue_aio_attach_host_notifier(). It does not expect this > behavior, which leads to virtqueue notifiers remaining disabled if > virtio_queue_aio_detach_host_notifier() is called while polling. That > can happen e.g. through virtio_scsi_drained_begin() or > virtio_blk_drained_begin() (through virtio_blk_data_plane_detach()). > In such a case, the virtqueue may not be processed for a while, letting > the guest driver hang. This can be reproduced by repeatedly > hot-plugging and -unplugging a virtio-scsi device with a scsi-hd disk, > because the guest will try to enumerate the virtio-scsi device while > we’re attaching the scsi-hd disk, which causes a drain, which can cause > the virtio-scsi virtqueue to stall as described. > > Stefan has suggested ensuring we always follow up io_poll_start() by > io_poll_end(): > > https://lists.nongnu.org/archive/html/qemu-block/2023-12/msg00163.html > > I prefer changing the caller instead, because I don’t think we actually > want the virtqueue notifier to be force-enabled when removing our AIO > notifiers. So I believe we actually only want to take care to > force-enable it when we re-attach the AIO notifiers, and to kick > virtqueue processing once, in case we missed any events while the AIO > notifiers were not attached. > > That is done by patch 2. We have already discussed a prior version of > it here: > > https://lists.nongnu.org/archive/html/qemu-block/2024-01/msg00001.html > > And compared to that, based on the discussion, there are some changes: > 1. Used virtio_queue_notify() instead of virtio_queue_notify_vq(), as > suggested by Paolo, because it’s thread-safe > 2. Moved virtio_queue_notify() into > virtio_queue_aio_attach_host_notifier*(), because we always want it > 3. Dropped virtio_queue_set_notification(vq, 0) from > virtio_queue_aio_detach_host_notifier(): Paolo wasn’t sure whether > that was safe to do from any context. We don’t really need to call > it anyway, so I just dropped it. > 4. Added patch 1: > > Patch 1 fixes virtio_scsi_drained_end() so it won’t attach polling > notifiers for the event virtqueue. That didn’t turn out to be an issue > so far, but with patch 2, Fiona saw the virtqueue processing queue > spinning in a loop as described in > 38738f7dbbda90fbc161757b7f4be35b52205552 ("virtio-scsi: don't waste CPU > polling the event virtqueue"). > > > Note that as of eaad0fe26050c227dc5dad63205835bac4912a51 ("scsi: only > access SCSIDevice->requests from one thread") there’s a different > problem when trying to reproduce the bug via hot-plugging and > -unplugging a virtio-scsi device, specifically, when unplugging, qemu > may crash with an assertion failure[1]. I don’t have a full fix for > that yet, but in case you need a work-around for the specific case of > virtio-scsi hot-plugging and -unplugging, you can use this patch: > > https://czenczek.de/0001-DONTMERGE-Fix-crash-on-scsi-unplug.patch > > > [1] https://lists.nongnu.org/archive/html/qemu-block/2024-01/msg00317.html > > > Hanna Czenczek (2): > virtio-scsi: Attach event vq notifier with no_poll > virtio: Keep notifications disabled during drain > > include/block/aio.h | 7 ++++++- > hw/scsi/virtio-scsi.c | 7 ++++++- > hw/virtio/virtio.c | 42 ++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 54 insertions(+), 2 deletions(-) This patch series also fixes RHEL-7356. Buglink: https://issues.redhat.com/browse/RHEL-7356. Tested-by: Stefan Hajnoczi <stefanha@redhat.com>