mbox series

[RFC,0/3] aio-posix: call ->poll_end() when removing AioHandler

Message ID 20231213211544.1601971-1-stefanha@redhat.com (mailing list archive)
Headers show
Series aio-posix: call ->poll_end() when removing AioHandler | expand

Message

Stefan Hajnoczi Dec. 13, 2023, 9:15 p.m. UTC
Hanna and Fiona encountered a bug in aio_set_fd_handler(): there is no matching
io_poll_end() call upon removing an AioHandler when io_poll_begin() was
previously called. The missing io_poll_end() call leaves virtqueue
notifications disabled and the virtqueue's ioeventfd will never become
readable anymore.

The details of how virtio-scsi devices using IOThreads can hang after
hotplug/unplug are covered here:
https://issues.redhat.com/browse/RHEL-3934

Hanna is currently away over the December holidays. I'm sending these RFC
patches in the meantime. They demonstrate running aio_set_fd_handler() in the
AioContext home thread and adding the missing io_poll_end() call.

The downside to my approach is that aio_set_fd_handler() becomes a
synchronization point that waits for the remote AioContext thread to finish
running a BH. Synchronization points are prone to deadlocks if the caller
invokes them while holding a lock that the remote AioContext needs to make
progress or if the remote AioContext cannot make progress before we make
progress in our own event loop. To minimize these concerns I have based this
patch series on my AioContext lock removal series and only allow the main loop
thread to call aio_set_fd_handler() on other threads (which I think is already
the convention today).

Another concern is that aio_set_fd_handler() now invokes user-provided
io_poll_end(), io_poll(), and io_poll_ready() functions. The io_poll_ready()
callback might contain a nested aio_poll() call, so there is a new place where
nested event loops can occur and hence a new re-entrant code path that I
haven't thought about yet.

But there you have it. Please let me know what you think and try your
reproducers to see if this fixes the missing io_poll_end() issue. Thanks!

Alternatives welcome! (A cleaner version of this approach might be to forbid
cross-thread aio_set_fd_handler() calls and to refactor all
aio_set_fd_handler() callers so they come from the AioContext's home thread.
I'm starting to think that only the aio_notify() and aio_schedule_bh() APIs
should be thread-safe.)

Stefan Hajnoczi (3):
  aio-posix: run aio_set_fd_handler() in target AioContext
  aio: use counter instead of ctx->list_lock
  aio-posix: call ->poll_end() when removing AioHandler

 include/block/aio.h |  22 ++---
 util/aio-posix.c    | 197 ++++++++++++++++++++++++++++++++------------
 util/async.c        |   2 -
 util/fdmon-epoll.c  |   6 +-
 4 files changed, 152 insertions(+), 75 deletions(-)

Comments

Stefan Hajnoczi Dec. 13, 2023, 9:52 p.m. UTC | #1
Based-on: 20231205182011.1976568-1-stefanha@redhat.com

(https://lore.kernel.org/qemu-devel/20231205182011.1976568-1-stefanha@redhat.com/)
Paolo Bonzini Dec. 13, 2023, 11:10 p.m. UTC | #2
On Wed, Dec 13, 2023 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> Alternatives welcome! (A cleaner version of this approach might be to forbid
> cross-thread aio_set_fd_handler() calls and to refactor all
> aio_set_fd_handler() callers so they come from the AioContext's home thread.
> I'm starting to think that only the aio_notify() and aio_schedule_bh() APIs
> should be thread-safe.)

I think that's pretty hard because aio_set_fd_handler() is a pretty
important part of the handoff from one AioContext to another and also
of drained_begin()/end(), and both of these things run in the main
thread.

Regarding how to solve this issue, there is a lot of
"underdocumenting" of the locking policy in aio-posix.c, and indeed it
makes running aio_set_fd_handler() in the target AioContext tempting;
but it is also scary to rely on the iothread being able to react
quickly. I'm also worried that we're changing the logic just because
we don't understand the old one, but then we add technical debt.

So, as a first step, I would take inspiration from the block layer
locking work, and add assertions to functions like poll_set_started()
or find_aio_handler(). Is the list_lock elevated (nonzero)? Or locked?
Are we in the iothread? And likewise, for each list, does insertion
happen from the iothread or with the list_lock taken (and possibly
elevated)? Does removal happen from the iothread or with list_lock
zero+taken?

After this step,  we should have a clearer idea of the possible states
of the node (based on the lists, the state is a subset of
{poll_started, deleted, ready}) and draw a nice graph of the
transitions. We should also understand if any calls to
QLIST_IS_INSERTED() have correctness issues.

Good news, I don't think any memory barriers are needed here. One
thing that we already do correctly is that, once a node is deleted, we
try to skip work; see for example poll_set_started(). This also
provides a good place to do cleanup work for deleted nodes, including
calling poll_end(): aio_free_deleted_handlers(), because it runs with
list_lock zero and taken, just like the tail of
aio_remove_fd_handler(). It's the safest possible place to do cleanup
and to take a lock. Therefore we have:

- a fast path in the iothread that runs without any concurrence with
stuff happening in the main thread

- a slow path in the iothread that runs with list_lock zero and taken.
The slow path shares logic with the main thread, meaning that
aio_free_deleted_handlers() and aio_remove_fd_handler() should share
some functions called by both.

If the code is organized this way, any wrong bits should jump out more
easily. For example, these two lines in aio_remove_fd_handler() are
clearly misplaced

    node->pfd.revents = 0;
    node->poll_ready = false;

because they run in the main thread but they touch iothread data! They
should be after qemu_lockcnt_count() is checked to be zero.

Regarding the call to io_poll_ready(), I would hope that it is
unnecessary; in other words, that after drained_end() the virtqueue
notification would be raised. Yes, virtio_queue_set_notification is
edge triggered rather than level triggered, so it would be necessary
to add a check with virtio_queue_host_notifier_aio_poll() and
virtio_queue_host_notifier_aio_poll_ready() in
virtio_queue_aio_attach_host_notifier, but that does not seem too bad
because virtio is the only user of the io_poll_begin and io_poll_end
callbacks. It would have to be documented though.

Paolo


Paolo

>
> Stefan Hajnoczi (3):
>   aio-posix: run aio_set_fd_handler() in target AioContext
>   aio: use counter instead of ctx->list_lock
>   aio-posix: call ->poll_end() when removing AioHandler
>
>  include/block/aio.h |  22 ++---
>  util/aio-posix.c    | 197 ++++++++++++++++++++++++++++++++------------
>  util/async.c        |   2 -
>  util/fdmon-epoll.c  |   6 +-
>  4 files changed, 152 insertions(+), 75 deletions(-)
>
> --
> 2.43.0
>
Fiona Ebner Dec. 14, 2023, 1:38 p.m. UTC | #3
Am 13.12.23 um 22:15 schrieb Stefan Hajnoczi:
> But there you have it. Please let me know what you think and try your
> reproducers to see if this fixes the missing io_poll_end() issue. Thanks!
> 

Thanks to you! I applied the RFC (and the series it depends on) on top
of 8.2.0-rc4 and this fixes my reproducer which drains VirtIO SCSI or
VirtIO block devices in a loop. Also didn't encounter any other issues
while playing around a bit with backup and mirror jobs.

The changes look fine to me, but this issue is also the first time I
came in close contact with this code, so that unfortunately does not say
much.

Best Regards,
Fiona
Stefan Hajnoczi Dec. 14, 2023, 7:52 p.m. UTC | #4
On Thu, Dec 14, 2023 at 12:10:32AM +0100, Paolo Bonzini wrote:
> On Wed, Dec 13, 2023 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > Alternatives welcome! (A cleaner version of this approach might be to forbid
> > cross-thread aio_set_fd_handler() calls and to refactor all
> > aio_set_fd_handler() callers so they come from the AioContext's home thread.
> > I'm starting to think that only the aio_notify() and aio_schedule_bh() APIs
> > should be thread-safe.)
> 
> I think that's pretty hard because aio_set_fd_handler() is a pretty
> important part of the handoff from one AioContext to another and also
> of drained_begin()/end(), and both of these things run in the main
> thread.
> 
> Regarding how to solve this issue, there is a lot of
> "underdocumenting" of the locking policy in aio-posix.c, and indeed it
> makes running aio_set_fd_handler() in the target AioContext tempting;
> but it is also scary to rely on the iothread being able to react
> quickly. I'm also worried that we're changing the logic just because
> we don't understand the old one, but then we add technical debt.
> 
> So, as a first step, I would take inspiration from the block layer
> locking work, and add assertions to functions like poll_set_started()
> or find_aio_handler(). Is the list_lock elevated (nonzero)? Or locked?
> Are we in the iothread? And likewise, for each list, does insertion
> happen from the iothread or with the list_lock taken (and possibly
> elevated)? Does removal happen from the iothread or with list_lock
> zero+taken?
> 
> After this step,  we should have a clearer idea of the possible states
> of the node (based on the lists, the state is a subset of
> {poll_started, deleted, ready}) and draw a nice graph of the
> transitions. We should also understand if any calls to
> QLIST_IS_INSERTED() have correctness issues.
> 
> Good news, I don't think any memory barriers are needed here. One
> thing that we already do correctly is that, once a node is deleted, we
> try to skip work; see for example poll_set_started(). This also
> provides a good place to do cleanup work for deleted nodes, including
> calling poll_end(): aio_free_deleted_handlers(), because it runs with
> list_lock zero and taken, just like the tail of
> aio_remove_fd_handler(). It's the safest possible place to do cleanup
> and to take a lock. Therefore we have:
> 
> - a fast path in the iothread that runs without any concurrence with
> stuff happening in the main thread
> 
> - a slow path in the iothread that runs with list_lock zero and taken.
> The slow path shares logic with the main thread, meaning that
> aio_free_deleted_handlers() and aio_remove_fd_handler() should share
> some functions called by both.
> 
> If the code is organized this way, any wrong bits should jump out more
> easily. For example, these two lines in aio_remove_fd_handler() are
> clearly misplaced
> 
>     node->pfd.revents = 0;
>     node->poll_ready = false;
> 
> because they run in the main thread but they touch iothread data! They
> should be after qemu_lockcnt_count() is checked to be zero.
> 
> Regarding the call to io_poll_ready(), I would hope that it is
> unnecessary; in other words, that after drained_end() the virtqueue
> notification would be raised. Yes, virtio_queue_set_notification is
> edge triggered rather than level triggered, so it would be necessary
> to add a check with virtio_queue_host_notifier_aio_poll() and
> virtio_queue_host_notifier_aio_poll_ready() in
> virtio_queue_aio_attach_host_notifier, but that does not seem too bad
> because virtio is the only user of the io_poll_begin and io_poll_end
> callbacks. It would have to be documented though.

I think Hanna had the same idea: document that ->io_poll_end() isn't
called by aio_set_fd_handler() and shift the responsibility onto the
caller to get back into a state where notifications are enabled before
they add the fd with aio_set_fd_handler() again.

In a little more detail, the caller needs to do the following before
adding the fd back with aio_set_fd_handler() again:
1. Call ->io_poll_end().
2. Poll one more time in case an event slipped in and write to the
   eventfd so the fd is immediately readable or call ->io_poll_ready().

I think this is more or less what you described above.

I don't like pushing this responsibility onto the caller, but adding a
synchronization point in aio_set_fd_handler() is problematic, so let's
give it a try. I'll try that approach and send a v2.

Stefan

> 
> Paolo
> 
> 
> Paolo
> 
> >
> > Stefan Hajnoczi (3):
> >   aio-posix: run aio_set_fd_handler() in target AioContext
> >   aio: use counter instead of ctx->list_lock
> >   aio-posix: call ->poll_end() when removing AioHandler
> >
> >  include/block/aio.h |  22 ++---
> >  util/aio-posix.c    | 197 ++++++++++++++++++++++++++++++++------------
> >  util/async.c        |   2 -
> >  util/fdmon-epoll.c  |   6 +-
> >  4 files changed, 152 insertions(+), 75 deletions(-)
> >
> > --
> > 2.43.0
> >
>
Stefan Hajnoczi Dec. 14, 2023, 7:53 p.m. UTC | #5
On Thu, Dec 14, 2023 at 02:38:47PM +0100, Fiona Ebner wrote:
> Am 13.12.23 um 22:15 schrieb Stefan Hajnoczi:
> > But there you have it. Please let me know what you think and try your
> > reproducers to see if this fixes the missing io_poll_end() issue. Thanks!
> > 
> 
> Thanks to you! I applied the RFC (and the series it depends on) on top
> of 8.2.0-rc4 and this fixes my reproducer which drains VirtIO SCSI or
> VirtIO block devices in a loop. Also didn't encounter any other issues
> while playing around a bit with backup and mirror jobs.
> 
> The changes look fine to me, but this issue is also the first time I
> came in close contact with this code, so that unfortunately does not say
> much.

Great.

I will still try the other approach that Hanna and Paolo have suggested.
It seems more palatable. I will send a v2.

Stefan
Fiona Ebner Dec. 18, 2023, 12:41 p.m. UTC | #6
Am 14.12.23 um 20:53 schrieb Stefan Hajnoczi:
> 
> I will still try the other approach that Hanna and Paolo have suggested.
> It seems more palatable. I will send a v2.
> 

FYI, what I already tried downstream (for VirtIO SCSI):

> diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> index 9c751bf296..a6449b04d0 100644
> --- a/hw/scsi/virtio-scsi.c
> +++ b/hw/scsi/virtio-scsi.c
> @@ -1166,6 +1166,8 @@ static void virtio_scsi_drained_end(SCSIBus *bus)
>  
>      for (uint32_t i = 0; i < total_queues; i++) {
>          VirtQueue *vq = virtio_get_queue(vdev, i);
> +        virtio_queue_set_notification(vq, 1);
> +        virtio_queue_notify(vdev, i);
>          virtio_queue_aio_attach_host_notifier(vq, s->ctx);
>      }
>  }

But this introduces an issue where e.g. a 'backup' QMP command would put
the iothread into a bad state. After the command, whenever the guest
issues IO, the thread will temporarily spike to using 100% CPU. Using
QMP stop+cont is a way to make it go back to normal.

I think it's because of nested drains, because when additionally
checking that the drain count is zero and only executing the loop then,
that issue doesn't seem to manifest, i.e.:

> diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> index 9c751bf296..d22c586b38 100644
> --- a/hw/scsi/virtio-scsi.c
> +++ b/hw/scsi/virtio-scsi.c
> @@ -1164,9 +1164,13 @@ static void virtio_scsi_drained_end(SCSIBus *bus)
>          return;
>      }
>  
> -    for (uint32_t i = 0; i < total_queues; i++) {
> -        VirtQueue *vq = virtio_get_queue(vdev, i);
> -        virtio_queue_aio_attach_host_notifier(vq, s->ctx);
> +    if (s->bus.drain_count == 0) {
> +        for (uint32_t i = 0; i < total_queues; i++) {
> +            VirtQueue *vq = virtio_get_queue(vdev, i);
> +            virtio_queue_set_notification(vq, 1);
> +            virtio_queue_notify(vdev, i);
> +            virtio_queue_aio_attach_host_notifier(vq, s->ctx);
> +        }
>      }
>  }
>  

Best Regards,
Fiona
Stefan Hajnoczi Dec. 18, 2023, 2:25 p.m. UTC | #7
On Mon, Dec 18, 2023 at 01:41:38PM +0100, Fiona Ebner wrote:
> Am 14.12.23 um 20:53 schrieb Stefan Hajnoczi:
> > 
> > I will still try the other approach that Hanna and Paolo have suggested.
> > It seems more palatable. I will send a v2.
> > 
> 
> FYI, what I already tried downstream (for VirtIO SCSI):
> 
> > diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> > index 9c751bf296..a6449b04d0 100644
> > --- a/hw/scsi/virtio-scsi.c
> > +++ b/hw/scsi/virtio-scsi.c
> > @@ -1166,6 +1166,8 @@ static void virtio_scsi_drained_end(SCSIBus *bus)
> >  
> >      for (uint32_t i = 0; i < total_queues; i++) {
> >          VirtQueue *vq = virtio_get_queue(vdev, i);
> > +        virtio_queue_set_notification(vq, 1);
> > +        virtio_queue_notify(vdev, i);
> >          virtio_queue_aio_attach_host_notifier(vq, s->ctx);
> >      }
> >  }
> 
> But this introduces an issue where e.g. a 'backup' QMP command would put
> the iothread into a bad state. After the command, whenever the guest
> issues IO, the thread will temporarily spike to using 100% CPU. Using
> QMP stop+cont is a way to make it go back to normal.
> 
> I think it's because of nested drains, because when additionally
> checking that the drain count is zero and only executing the loop then,
> that issue doesn't seem to manifest, i.e.:

Thanks for letting me know about the issue. I'll keep an eye out for it
when playing with the code.

Stefan

> 
> > diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> > index 9c751bf296..d22c586b38 100644
> > --- a/hw/scsi/virtio-scsi.c
> > +++ b/hw/scsi/virtio-scsi.c
> > @@ -1164,9 +1164,13 @@ static void virtio_scsi_drained_end(SCSIBus *bus)
> >          return;
> >      }
> >  
> > -    for (uint32_t i = 0; i < total_queues; i++) {
> > -        VirtQueue *vq = virtio_get_queue(vdev, i);
> > -        virtio_queue_aio_attach_host_notifier(vq, s->ctx);
> > +    if (s->bus.drain_count == 0) {
> > +        for (uint32_t i = 0; i < total_queues; i++) {
> > +            VirtQueue *vq = virtio_get_queue(vdev, i);
> > +            virtio_queue_set_notification(vq, 1);
> > +            virtio_queue_notify(vdev, i);
> > +            virtio_queue_aio_attach_host_notifier(vq, s->ctx);
> > +        }
> >      }
> >  }
> >  
> 
> Best Regards,
> Fiona
>
Paolo Bonzini Dec. 18, 2023, 2:49 p.m. UTC | #8
On Mon, Dec 18, 2023 at 1:41 PM Fiona Ebner <f.ebner@proxmox.com> wrote:
> I think it's because of nested drains, because when additionally
> checking that the drain count is zero and only executing the loop then,
> that issue doesn't seem to manifest

But isn't virtio_scsi_drained_end only run if bus->drain_count == 0?

    if (bus->drain_count-- == 1) {
        trace_scsi_bus_drained_end(bus, sdev);
        if (bus->info->drained_end) {
            bus->info->drained_end(bus);
        }
    }

Paolo

>
> > diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> > index 9c751bf296..d22c586b38 100644
> > --- a/hw/scsi/virtio-scsi.c
> > +++ b/hw/scsi/virtio-scsi.c
> > @@ -1164,9 +1164,13 @@ static void virtio_scsi_drained_end(SCSIBus *bus)
> >          return;
> >      }
> >
> > -    for (uint32_t i = 0; i < total_queues; i++) {
> > -        VirtQueue *vq = virtio_get_queue(vdev, i);
> > -        virtio_queue_aio_attach_host_notifier(vq, s->ctx);
> > +    if (s->bus.drain_count == 0) {
> > +        for (uint32_t i = 0; i < total_queues; i++) {
> > +            VirtQueue *vq = virtio_get_queue(vdev, i);
> > +            virtio_queue_set_notification(vq, 1);
> > +            virtio_queue_notify(vdev, i);
> > +            virtio_queue_aio_attach_host_notifier(vq, s->ctx);
> > +        }
> >      }
> >  }
> >
>
> Best Regards,
> Fiona
>
Fiona Ebner Dec. 19, 2023, 8:40 a.m. UTC | #9
Am 18.12.23 um 15:49 schrieb Paolo Bonzini:
> On Mon, Dec 18, 2023 at 1:41 PM Fiona Ebner <f.ebner@proxmox.com> wrote:
>> I think it's because of nested drains, because when additionally
>> checking that the drain count is zero and only executing the loop then,
>> that issue doesn't seem to manifest
> 
> But isn't virtio_scsi_drained_end only run if bus->drain_count == 0?
> 
>     if (bus->drain_count-- == 1) {
>         trace_scsi_bus_drained_end(bus, sdev);
>         if (bus->info->drained_end) {
>             bus->info->drained_end(bus);
>         }
>     }
> 

You're right. Sorry, I must've messed up my testing yesterday :(
Sometimes the CPU spikes are very short-lived. Now I see the same issue
with both variants.

Unfortunately, I haven't been able to figure out why it happens yet.

Best Regards,
Fiona
Hanna Czenczek Jan. 2, 2024, 3:24 p.m. UTC | #10
On 13.12.23 22:15, Stefan Hajnoczi wrote:
> Hanna and Fiona encountered a bug in aio_set_fd_handler(): there is no matching
> io_poll_end() call upon removing an AioHandler when io_poll_begin() was
> previously called. The missing io_poll_end() call leaves virtqueue
> notifications disabled and the virtqueue's ioeventfd will never become
> readable anymore.
>
> The details of how virtio-scsi devices using IOThreads can hang after
> hotplug/unplug are covered here:
> https://issues.redhat.com/browse/RHEL-3934
>
> Hanna is currently away over the December holidays. I'm sending these RFC
> patches in the meantime. They demonstrate running aio_set_fd_handler() in the
> AioContext home thread and adding the missing io_poll_end() call.

I agree with Paolo that if the handlers are removed, the caller probably 
isn’t interested in notifications: In our specific case, we remove the 
handlers because the device is to be drained, so we won’t poll the 
virtqueue anyway.  Therefore, if io_poll_end() is to be called to 
complete the start-end pair, it shouldn’t be done when the handlers are 
removed, but instead when they are reinstated.

I believe that’s quite infeasible to do in generic code: We’d have to 
basically remember that we haven’t called a specific io_poll_end 
callback yet, and so once it is reinstated while we’re not in a polling 
phase, we would need to call it then.  That in itself is already hairy, 
but in addition, the callback consists of a function pointer and an 
opaque pointer, and it seems impossible to reliably establish identity 
between two opaque pointers to know whether a newly instated io_poll_end 
callback is the same one as one that had been removed before (pointer 
identity may work, but also may not).

That’s why I think the responsibility should fall on the caller.  I 
believe both virtio_queue_aio_attach_host_notifier*() functions should 
enable vq notifications before instating the handler – if we’re in 
polling mode, io_poll_start() will then immediately be called and 
disable vq notifications again.  Having them enabled briefly shouldn’t 
hurt anything but performance (which I don’t think is terrible in the 
drain case).  For completeness’ sake, we may even have 
virtio_queue_aio_detach_host_notifier() disable vq notifications, 
because otherwise it’s unknown whether notifications are enabled or 
disabled after removing the notifiers.  That isn’t terrible, but I think 
(A) if any of the two, we want them to be disabled, because we won’t 
check for requests anyway, and (B) this gives us a clearer state 
machine, where removing the notifiers will always leave vq notifications 
disabled, so when they are reinstated (i.e. when calling 
virtio_queue_aio_attach_host_notifier*()), it’s clear that we must poll 
once to check for new requests.

I’ve attached the preliminary patch that I didn’t get to send (or test 
much) last year.  Not sure if it has the same CPU-usage-spike issue 
Fiona was seeing, the only functional difference is that I notify the vq 
after attaching the notifiers instead of before.

Hanna
Paolo Bonzini Jan. 2, 2024, 3:53 p.m. UTC | #11
On Tue, Jan 2, 2024 at 4:24 PM Hanna Czenczek <hreitz@redhat.com> wrote:
> I’ve attached the preliminary patch that I didn’t get to send (or test
> much) last year.  Not sure if it has the same CPU-usage-spike issue
> Fiona was seeing, the only functional difference is that I notify the vq
> after attaching the notifiers instead of before.

I think the patch makes sense and cleaning up the logic of aio_poll
(which is one of those functions that grew and grew without much
clarity into who did what) can be done on top.

Just one small thing, the virtio_queue_notify_vq() call is required
because the virtqueue interrupt and eventfd are edge-triggered rather
than level-triggered; so perhaps it should be placed in the
function(s) that establish the handlers,
virtio_queue_aio_attach_host_notifier() and
virtio_queue_aio_attach_host_notifier_no_poll()? Neither
virtio_blk_drained_end() nor virtio_scsi_drained_end() are
particularly special, and the comment applies just as well:

    /*
     * We will have ignored notifications about new requests from the guest
     * while handlers were not attached, so "kick" the virt queue to process
     * those requests now.
     */

Paolo
Hanna Czenczek Jan. 2, 2024, 4:55 p.m. UTC | #12
On 02.01.24 16:53, Paolo Bonzini wrote:
> On Tue, Jan 2, 2024 at 4:24 PM Hanna Czenczek<hreitz@redhat.com>  wrote:
>> I’ve attached the preliminary patch that I didn’t get to send (or test
>> much) last year.  Not sure if it has the same CPU-usage-spike issue
>> Fiona was seeing, the only functional difference is that I notify the vq
>> after attaching the notifiers instead of before.
> I think the patch makes sense and cleaning up the logic of aio_poll
> (which is one of those functions that grew and grew without much
> clarity into who did what) can be done on top.
>
> Just one small thing, the virtio_queue_notify_vq() call is required
> because the virtqueue interrupt and eventfd are edge-triggered rather
> than level-triggered; so perhaps it should be placed in the
> function(s) that establish the handlers,
> virtio_queue_aio_attach_host_notifier() and
> virtio_queue_aio_attach_host_notifier_no_poll()? Neither
> virtio_blk_drained_end() nor virtio_scsi_drained_end() are
> particularly special, and the comment applies just as well:
>
>      /*
>       * We will have ignored notifications about new requests from the guest
>       * while handlers were not attached, so "kick" the virt queue to process
>       * those requests now.
>       */

I wasn’t entirely whether we want to call notify_vq() if we have 
instated the handlers for the first time (in which case someone ought to 
look for existing unprocessed requests anyway), so I decided to limit it 
to drained_end.

But considering that it must be safe to call notify_vq() right after 
instating the handlers (virtio_queue_host_notifier_read() may then be 
called after all), we might as well do so every time, yes.

Hanna
Fiona Ebner Jan. 3, 2024, 11:40 a.m. UTC | #13
Am 02.01.24 um 16:24 schrieb Hanna Czenczek:
> 
> I’ve attached the preliminary patch that I didn’t get to send (or test
> much) last year.  Not sure if it has the same CPU-usage-spike issue
> Fiona was seeing, the only functional difference is that I notify the vq
> after attaching the notifiers instead of before.
> 

Applied the patch on top of c12887e1b0 ("block-coroutine-wrapper: use
qemu_get_current_aio_context()") because it conflicts with b6948ab01d
("virtio-blk: add iothread-vq-mapping parameter").

I'm happy to report that I cannot reproduce the CPU-usage-spike issue
with the patch, but I did run into an assertion failure when trying to
verify that it fixes my original stuck-guest-IO issue. See below for the
backtrace [0]. Hanna wrote in https://issues.redhat.com/browse/RHEL-3934

> I think it’s sufficient to simply call virtio_queue_notify_vq(vq) after the virtio_queue_aio_attach_host_notifier(vq, ctx) call, because both virtio-scsi’s and virtio-blk’s .handle_output() implementations acquire the device’s context, so this should be directly callable from any context.

I guess this is not true anymore now that the AioContext locking was
removed?

Back to the CPU-usage-spike issue: I experimented around and it doesn't
seem to matter whether I notify the virt queue before or after attaching
the notifiers. But there's another functional difference. My patch
called virtio_queue_notify() which contains this block:

>     if (vq->host_notifier_enabled) {
>         event_notifier_set(&vq->host_notifier);
>     } else if (vq->handle_output) {
>         vq->handle_output(vdev, vq);

In my testing, the first branch was taken, calling event_notifier_set().
Hanna's patch uses virtio_queue_notify_vq() and there,
vq->handle_output() will be called. That seems to be the relevant
difference regarding the CPU-usage-spike issue.

Best Regards,
Fiona

[0]:

> #0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
> #1  0x00007ffff60e3d9f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
> #2  0x00007ffff6094f32 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
> #3  0x00007ffff607f472 in __GI_abort () at ./stdlib/abort.c:79
> #4  0x00007ffff607f395 in __assert_fail_base (fmt=0x7ffff61f3a90 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
>     assertion=assertion@entry=0x555556246bf8 "ctx == qemu_get_current_aio_context()", 
>     file=file@entry=0x555556246baf "../system/dma-helpers.c", line=line@entry=123, 
>     function=function@entry=0x555556246c70 <__PRETTY_FUNCTION__.1> "dma_blk_cb") at ./assert/assert.c:92
> #5  0x00007ffff608de32 in __GI___assert_fail (assertion=0x555556246bf8 "ctx == qemu_get_current_aio_context()", 
>     file=0x555556246baf "../system/dma-helpers.c", line=123, function=0x555556246c70 <__PRETTY_FUNCTION__.1> "dma_blk_cb")
>     at ./assert/assert.c:101
> #6  0x0000555555b83425 in dma_blk_cb (opaque=0x55555804f150, ret=0) at ../system/dma-helpers.c:123
> #7  0x0000555555b839ec in dma_blk_io (ctx=0x555557404310, sg=0x5555588ca6f8, offset=70905856, align=512, 
>     io_func=0x555555a94a87 <scsi_dma_readv>, io_func_opaque=0x55555817ea00, cb=0x555555a8d99f <scsi_dma_complete>, opaque=0x55555817ea00, 
>     dir=DMA_DIRECTION_FROM_DEVICE) at ../system/dma-helpers.c:236
> #8  0x0000555555a8de9a in scsi_do_read (r=0x55555817ea00, ret=0) at ../hw/scsi/scsi-disk.c:431
> #9  0x0000555555a8e249 in scsi_read_data (req=0x55555817ea00) at ../hw/scsi/scsi-disk.c:501
> #10 0x0000555555a897e3 in scsi_req_continue (req=0x55555817ea00) at ../hw/scsi/scsi-bus.c:1478
> #11 0x0000555555d8270e in virtio_scsi_handle_cmd_req_submit (s=0x555558669af0, req=0x5555588ca6b0) at ../hw/scsi/virtio-scsi.c:828
> #12 0x0000555555d82937 in virtio_scsi_handle_cmd_vq (s=0x555558669af0, vq=0x555558672550) at ../hw/scsi/virtio-scsi.c:870
> #13 0x0000555555d829a9 in virtio_scsi_handle_cmd (vdev=0x555558669af0, vq=0x555558672550) at ../hw/scsi/virtio-scsi.c:883
> #14 0x0000555555db3784 in virtio_queue_notify_vq (vq=0x555558672550) at ../hw/virtio/virtio.c:2268
> #15 0x0000555555d8346a in virtio_scsi_drained_end (bus=0x555558669d88) at ../hw/scsi/virtio-scsi.c:1179
> #16 0x0000555555a8a549 in scsi_device_drained_end (sdev=0x555558105000) at ../hw/scsi/scsi-bus.c:1774
> #17 0x0000555555a931db in scsi_disk_drained_end (opaque=0x555558105000) at ../hw/scsi/scsi-disk.c:2369
> #18 0x0000555555ee439c in blk_root_drained_end (child=0x5555574065d0) at ../block/block-backend.c:2829
> #19 0x0000555555ef0ac3 in bdrv_parent_drained_end_single (c=0x5555574065d0) at ../block/io.c:74
> #20 0x0000555555ef0b02 in bdrv_parent_drained_end (bs=0x555557409f80, ignore=0x0) at ../block/io.c:89
> #21 0x0000555555ef1b1b in bdrv_do_drained_end (bs=0x555557409f80, parent=0x0) at ../block/io.c:421
> #22 0x0000555555ef1b5a in bdrv_drained_end (bs=0x555557409f80) at ../block/io.c:428
> #23 0x0000555555efcf64 in mirror_exit_common (job=0x5555588b8220) at ../block/mirror.c:798
> #24 0x0000555555efcfde in mirror_abort (job=0x5555588b8220) at ../block/mirror.c:814
> #25 0x0000555555ec53ea in job_abort (job=0x5555588b8220) at ../job.c:825
> #26 0x0000555555ec54d5 in job_finalize_single_locked (job=0x5555588b8220) at ../job.c:855
> #27 0x0000555555ec57cb in job_completed_txn_abort_locked (job=0x5555588b8220) at ../job.c:958
> #28 0x0000555555ec5c20 in job_completed_locked (job=0x5555588b8220) at ../job.c:1065
> #29 0x0000555555ec5cd5 in job_exit (opaque=0x5555588b8220) at ../job.c:1088
> #30 0x000055555608342e in aio_bh_call (bh=0x7fffe400dfd0) at ../util/async.c:169
> #31 0x0000555556083549 in aio_bh_poll (ctx=0x55555718ade0) at ../util/async.c:216
> #32 0x0000555556065203 in aio_dispatch (ctx=0x55555718ade0) at ../util/aio-posix.c:423
> #33 0x0000555556083988 in aio_ctx_dispatch (source=0x55555718ade0, callback=0x0, user_data=0x0) at ../util/async.c:358
> #34 0x00007ffff753e7a9 in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
> #35 0x00005555560850ae in glib_pollfds_poll () at ../util/main-loop.c:290
> #36 0x000055555608512b in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:313
> #37 0x0000555556085239 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:592
> #38 0x0000555555b8d501 in qemu_main_loop () at ../system/runstate.c:782
> #39 0x0000555555e55587 in qemu_default_main () at ../system/main.c:37
> #40 0x0000555555e555c2 in main (argc=68, argv=0x7fffffffd8b8) at ../system/main.c:48
Paolo Bonzini Jan. 3, 2024, 1:35 p.m. UTC | #14
On 1/3/24 12:40, Fiona Ebner wrote:
> I'm happy to report that I cannot reproduce the CPU-usage-spike issue
> with the patch, but I did run into an assertion failure when trying to
> verify that it fixes my original stuck-guest-IO issue. See below for the
> backtrace [0]. Hanna wrote in https://issues.redhat.com/browse/RHEL-3934
> 
>> I think it’s sufficient to simply call virtio_queue_notify_vq(vq)
>> after the virtio_queue_aio_attach_host_notifier(vq, ctx) call, because
>> both virtio-scsi’s and virtio-blk’s .handle_output() implementations
>> acquire the device’s context, so this should be directly callable from
>> any context.
> 
> I guess this is not true anymore now that the AioContext locking was
> removed?

Good point and, in fact, even before it was much safer to use 
virtio_queue_notify() instead.  Not only does it use the event notifier 
handler, but it also calls it in the right thread/AioContext just by 
doing event_notifier_set().

The call to virtio_queue_set_notification(..., 1) is safe; not sure 
about the call to virtio_queue_set_notification(..., 0) though.  I'd 
rather have that executed synchronously in the AioContext using 
aio_wait_bh_oneshot().  This is consistent with the pattern used by 
virtio_scsi_dataplane_stop() and virtio_blk_data_plane_stop().

Paolo
Fiona Ebner Jan. 5, 2024, 1:43 p.m. UTC | #15
Am 03.01.24 um 14:35 schrieb Paolo Bonzini:
> On 1/3/24 12:40, Fiona Ebner wrote:
>> I'm happy to report that I cannot reproduce the CPU-usage-spike issue
>> with the patch, but I did run into an assertion failure when trying to
>> verify that it fixes my original stuck-guest-IO issue. See below for the
>> backtrace [0]. Hanna wrote in https://issues.redhat.com/browse/RHEL-3934
>>
>>> I think it’s sufficient to simply call virtio_queue_notify_vq(vq)
>>> after the virtio_queue_aio_attach_host_notifier(vq, ctx) call, because
>>> both virtio-scsi’s and virtio-blk’s .handle_output() implementations
>>> acquire the device’s context, so this should be directly callable from
>>> any context.
>>
>> I guess this is not true anymore now that the AioContext locking was
>> removed?
> 
> Good point and, in fact, even before it was much safer to use
> virtio_queue_notify() instead.  Not only does it use the event notifier
> handler, but it also calls it in the right thread/AioContext just by
> doing event_notifier_set().
> 

But with virtio_queue_notify() using the event notifier, the
CPU-usage-spike issue is present:

>> Back to the CPU-usage-spike issue: I experimented around and it doesn't
>> seem to matter whether I notify the virt queue before or after attaching
>> the notifiers. But there's another functional difference. My patch
>> called virtio_queue_notify() which contains this block:
>> 
>>>     if (vq->host_notifier_enabled) {
>>>         event_notifier_set(&vq->host_notifier);
>>>     } else if (vq->handle_output) {
>>>         vq->handle_output(vdev, vq);
>> 
>> In my testing, the first branch was taken, calling event_notifier_set().
>> Hanna's patch uses virtio_queue_notify_vq() and there,
>> vq->handle_output() will be called. That seems to be the relevant
>> difference regarding the CPU-usage-spike issue.

I should mention that this is with a VirtIO SCSI disk. I also attempted
to reproduce the CPU-usage-spike issue with a VirtIO block disk, but
didn't manage yet.

What I noticed is that in virtio_queue_host_notifier_aio_poll(), one of
the queues (but only one) will always show as nonempty. And then,
run_poll_handlers_once() will always detect progress which explains the
CPU usage.

The following shows
1. vq address
2. number of times vq was passed to virtio_queue_host_notifier_aio_poll()
3. number of times the result of virtio_queue_host_notifier_aio_poll()
was true for the vq

> 0x555fd93f9c80 17162000 0
> 0x555fd93f9e48 17162000 6
> 0x555fd93f9ee0 17162000 0
> 0x555fd93f9d18 17162000 17162000
> 0x555fd93f9db0 17162000 0
> 0x555fd93f9f78 17162000 0

And for the problematic one, the reason it is seen as nonempty is:

> 0x555fd93f9d18 shadow_avail_idx 8 last_avail_idx 0

Those values stay like this while the call counts above increase.

So something going wrong with the indices when the event notifier is set
from QEMU side (in the main thread)?

The guest is Debian 12 with a 6.1 kernel.

Best Regards,
Fiona
Fiona Ebner Jan. 5, 2024, 2:30 p.m. UTC | #16
Am 05.01.24 um 14:43 schrieb Fiona Ebner:
> Am 03.01.24 um 14:35 schrieb Paolo Bonzini:
>> On 1/3/24 12:40, Fiona Ebner wrote:
>>> I'm happy to report that I cannot reproduce the CPU-usage-spike issue
>>> with the patch, but I did run into an assertion failure when trying to
>>> verify that it fixes my original stuck-guest-IO issue. See below for the
>>> backtrace [0]. Hanna wrote in https://issues.redhat.com/browse/RHEL-3934
>>>
>>>> I think it’s sufficient to simply call virtio_queue_notify_vq(vq)
>>>> after the virtio_queue_aio_attach_host_notifier(vq, ctx) call, because
>>>> both virtio-scsi’s and virtio-blk’s .handle_output() implementations
>>>> acquire the device’s context, so this should be directly callable from
>>>> any context.
>>>
>>> I guess this is not true anymore now that the AioContext locking was
>>> removed?
>>
>> Good point and, in fact, even before it was much safer to use
>> virtio_queue_notify() instead.  Not only does it use the event notifier
>> handler, but it also calls it in the right thread/AioContext just by
>> doing event_notifier_set().
>>
> 
> But with virtio_queue_notify() using the event notifier, the
> CPU-usage-spike issue is present:
> 
>>> Back to the CPU-usage-spike issue: I experimented around and it doesn't
>>> seem to matter whether I notify the virt queue before or after attaching
>>> the notifiers. But there's another functional difference. My patch
>>> called virtio_queue_notify() which contains this block:
>>>
>>>>     if (vq->host_notifier_enabled) {
>>>>         event_notifier_set(&vq->host_notifier);
>>>>     } else if (vq->handle_output) {
>>>>         vq->handle_output(vdev, vq);
>>>
>>> In my testing, the first branch was taken, calling event_notifier_set().
>>> Hanna's patch uses virtio_queue_notify_vq() and there,
>>> vq->handle_output() will be called. That seems to be the relevant
>>> difference regarding the CPU-usage-spike issue.
> 
> I should mention that this is with a VirtIO SCSI disk. I also attempted
> to reproduce the CPU-usage-spike issue with a VirtIO block disk, but
> didn't manage yet.
> 
> What I noticed is that in virtio_queue_host_notifier_aio_poll(), one of
> the queues (but only one) will always show as nonempty. And then,
> run_poll_handlers_once() will always detect progress which explains the
> CPU usage.
> 
> The following shows
> 1. vq address
> 2. number of times vq was passed to virtio_queue_host_notifier_aio_poll()
> 3. number of times the result of virtio_queue_host_notifier_aio_poll()
> was true for the vq
> 
>> 0x555fd93f9c80 17162000 0
>> 0x555fd93f9e48 17162000 6
>> 0x555fd93f9ee0 17162000 0
>> 0x555fd93f9d18 17162000 17162000
>> 0x555fd93f9db0 17162000 0
>> 0x555fd93f9f78 17162000 0
> 
> And for the problematic one, the reason it is seen as nonempty is:
> 
>> 0x555fd93f9d18 shadow_avail_idx 8 last_avail_idx 0
> 

vring_avail_idx(vq) also gives 8 here. This is the vs->event_vq and
s->events_dropped is false in my testing, so
virtio_scsi_handle_event_vq() doesn't do anything.

> Those values stay like this while the call counts above increase.
> 
> So something going wrong with the indices when the event notifier is set
> from QEMU side (in the main thread)?
> 
> The guest is Debian 12 with a 6.1 kernel.

Best Regards,
Fiona
Hanna Czenczek Jan. 22, 2024, 5:41 p.m. UTC | #17
On 05.01.24 15:30, Fiona Ebner wrote:
> Am 05.01.24 um 14:43 schrieb Fiona Ebner:
>> Am 03.01.24 um 14:35 schrieb Paolo Bonzini:
>>> On 1/3/24 12:40, Fiona Ebner wrote:
>>>> I'm happy to report that I cannot reproduce the CPU-usage-spike issue
>>>> with the patch, but I did run into an assertion failure when trying to
>>>> verify that it fixes my original stuck-guest-IO issue. See below for the
>>>> backtrace [0]. Hanna wrote in https://issues.redhat.com/browse/RHEL-3934
>>>>
>>>>> I think it’s sufficient to simply call virtio_queue_notify_vq(vq)
>>>>> after the virtio_queue_aio_attach_host_notifier(vq, ctx) call, because
>>>>> both virtio-scsi’s and virtio-blk’s .handle_output() implementations
>>>>> acquire the device’s context, so this should be directly callable from
>>>>> any context.
>>>> I guess this is not true anymore now that the AioContext locking was
>>>> removed?
>>> Good point and, in fact, even before it was much safer to use
>>> virtio_queue_notify() instead.  Not only does it use the event notifier
>>> handler, but it also calls it in the right thread/AioContext just by
>>> doing event_notifier_set().
>>>
>> But with virtio_queue_notify() using the event notifier, the
>> CPU-usage-spike issue is present:
>>
>>>> Back to the CPU-usage-spike issue: I experimented around and it doesn't
>>>> seem to matter whether I notify the virt queue before or after attaching
>>>> the notifiers. But there's another functional difference. My patch
>>>> called virtio_queue_notify() which contains this block:
>>>>
>>>>>      if (vq->host_notifier_enabled) {
>>>>>          event_notifier_set(&vq->host_notifier);
>>>>>      } else if (vq->handle_output) {
>>>>>          vq->handle_output(vdev, vq);
>>>> In my testing, the first branch was taken, calling event_notifier_set().
>>>> Hanna's patch uses virtio_queue_notify_vq() and there,
>>>> vq->handle_output() will be called. That seems to be the relevant
>>>> difference regarding the CPU-usage-spike issue.
>> I should mention that this is with a VirtIO SCSI disk. I also attempted
>> to reproduce the CPU-usage-spike issue with a VirtIO block disk, but
>> didn't manage yet.
>>
>> What I noticed is that in virtio_queue_host_notifier_aio_poll(), one of
>> the queues (but only one) will always show as nonempty. And then,
>> run_poll_handlers_once() will always detect progress which explains the
>> CPU usage.
>>
>> The following shows
>> 1. vq address
>> 2. number of times vq was passed to virtio_queue_host_notifier_aio_poll()
>> 3. number of times the result of virtio_queue_host_notifier_aio_poll()
>> was true for the vq
>>
>>> 0x555fd93f9c80 17162000 0
>>> 0x555fd93f9e48 17162000 6
>>> 0x555fd93f9ee0 17162000 0
>>> 0x555fd93f9d18 17162000 17162000
>>> 0x555fd93f9db0 17162000 0
>>> 0x555fd93f9f78 17162000 0
>> And for the problematic one, the reason it is seen as nonempty is:
>>
>>> 0x555fd93f9d18 shadow_avail_idx 8 last_avail_idx 0
> vring_avail_idx(vq) also gives 8 here. This is the vs->event_vq and
> s->events_dropped is false in my testing, so
> virtio_scsi_handle_event_vq() doesn't do anything.
>
>> Those values stay like this while the call counts above increase.
>>
>> So something going wrong with the indices when the event notifier is set
>> from QEMU side (in the main thread)?
>>
>> The guest is Debian 12 with a 6.1 kernel.

So, trying to figure out a new RFC version:

About the stack trace you, Fiona, posted:  As far as I understand, that 
happens because virtio_blk_drained_end() calling 
virtio_queue_notify_vq() wasn’t safe after all, and instead we need to 
use virtio_queue_notify().  Right?

However, you say using virtio_queue_notify() instead causes busy loops 
of doing nothing in virtio-scsi (what you describe above).  I mean, 
better than a crash, but, you know. :)

I don’t know have any prior knowledge about the event handling, 
unfortunately.  The fact that 8 buffers are available but we don’t use 
any sounds OK to me; as far as I understand, we only use those buffers 
if we have any events to push into them, so as long as we don’t, we 
won’t.  Question is, should we not have its poll handler return false if 
we don’t have any events (i.e. events_dropped is false)?  Would that 
solve it?

Hanna
Hanna Czenczek Jan. 22, 2024, 5:52 p.m. UTC | #18
On 22.01.24 18:41, Hanna Czenczek wrote:
> On 05.01.24 15:30, Fiona Ebner wrote:
>> Am 05.01.24 um 14:43 schrieb Fiona Ebner:
>>> Am 03.01.24 um 14:35 schrieb Paolo Bonzini:
>>>> On 1/3/24 12:40, Fiona Ebner wrote:
>>>>> I'm happy to report that I cannot reproduce the CPU-usage-spike issue
>>>>> with the patch, but I did run into an assertion failure when 
>>>>> trying to
>>>>> verify that it fixes my original stuck-guest-IO issue. See below 
>>>>> for the
>>>>> backtrace [0]. Hanna wrote in 
>>>>> https://issues.redhat.com/browse/RHEL-3934
>>>>>
>>>>>> I think it’s sufficient to simply call virtio_queue_notify_vq(vq)
>>>>>> after the virtio_queue_aio_attach_host_notifier(vq, ctx) call, 
>>>>>> because
>>>>>> both virtio-scsi’s and virtio-blk’s .handle_output() implementations
>>>>>> acquire the device’s context, so this should be directly callable 
>>>>>> from
>>>>>> any context.
>>>>> I guess this is not true anymore now that the AioContext locking was
>>>>> removed?
>>>> Good point and, in fact, even before it was much safer to use
>>>> virtio_queue_notify() instead.  Not only does it use the event 
>>>> notifier
>>>> handler, but it also calls it in the right thread/AioContext just by
>>>> doing event_notifier_set().
>>>>
>>> But with virtio_queue_notify() using the event notifier, the
>>> CPU-usage-spike issue is present:
>>>
>>>>> Back to the CPU-usage-spike issue: I experimented around and it 
>>>>> doesn't
>>>>> seem to matter whether I notify the virt queue before or after 
>>>>> attaching
>>>>> the notifiers. But there's another functional difference. My patch
>>>>> called virtio_queue_notify() which contains this block:
>>>>>
>>>>>>      if (vq->host_notifier_enabled) {
>>>>>>          event_notifier_set(&vq->host_notifier);
>>>>>>      } else if (vq->handle_output) {
>>>>>>          vq->handle_output(vdev, vq);
>>>>> In my testing, the first branch was taken, calling 
>>>>> event_notifier_set().
>>>>> Hanna's patch uses virtio_queue_notify_vq() and there,
>>>>> vq->handle_output() will be called. That seems to be the relevant
>>>>> difference regarding the CPU-usage-spike issue.
>>> I should mention that this is with a VirtIO SCSI disk. I also attempted
>>> to reproduce the CPU-usage-spike issue with a VirtIO block disk, but
>>> didn't manage yet.
>>>
>>> What I noticed is that in virtio_queue_host_notifier_aio_poll(), one of
>>> the queues (but only one) will always show as nonempty. And then,
>>> run_poll_handlers_once() will always detect progress which explains the
>>> CPU usage.
>>>
>>> The following shows
>>> 1. vq address
>>> 2. number of times vq was passed to 
>>> virtio_queue_host_notifier_aio_poll()
>>> 3. number of times the result of virtio_queue_host_notifier_aio_poll()
>>> was true for the vq
>>>
>>>> 0x555fd93f9c80 17162000 0
>>>> 0x555fd93f9e48 17162000 6
>>>> 0x555fd93f9ee0 17162000 0
>>>> 0x555fd93f9d18 17162000 17162000
>>>> 0x555fd93f9db0 17162000 0
>>>> 0x555fd93f9f78 17162000 0
>>> And for the problematic one, the reason it is seen as nonempty is:
>>>
>>>> 0x555fd93f9d18 shadow_avail_idx 8 last_avail_idx 0
>> vring_avail_idx(vq) also gives 8 here. This is the vs->event_vq and
>> s->events_dropped is false in my testing, so
>> virtio_scsi_handle_event_vq() doesn't do anything.
>>
>>> Those values stay like this while the call counts above increase.
>>>
>>> So something going wrong with the indices when the event notifier is 
>>> set
>>> from QEMU side (in the main thread)?
>>>
>>> The guest is Debian 12 with a 6.1 kernel.
>
> So, trying to figure out a new RFC version:
>
> About the stack trace you, Fiona, posted:  As far as I understand, 
> that happens because virtio_blk_drained_end() calling 
> virtio_queue_notify_vq() wasn’t safe after all, and instead we need to 
> use virtio_queue_notify().  Right?
>
> However, you say using virtio_queue_notify() instead causes busy loops 
> of doing nothing in virtio-scsi (what you describe above). I mean, 
> better than a crash, but, you know. :)
>
> I don’t know have any prior knowledge about the event handling, 
> unfortunately.  The fact that 8 buffers are available but we don’t use 
> any sounds OK to me; as far as I understand, we only use those buffers 
> if we have any events to push into them, so as long as we don’t, we 
> won’t.  Question is, should we not have its poll handler return false 
> if we don’t have any events (i.e. events_dropped is false)?  Would 
> that solve it?

Or actually, maybe we could just skip the virtio_queue_notify() call for 
the event vq?  I.e. have it be `if (vq != 
VIRTIO_SCSI_COMMON(s)->event_vq) { virtio_queue_notify(vdev, i); }`?  I 
wouldn’t like that very much, (a) because this would make it slightly 
cumbersome to put that into virtio_queue_aio_attach_host_notifier*(), 
and (b) in case that does fix it, I do kind of feel like the real 
problem is that we use virtio_queue_host_notifier_aio_poll() for the 
event vq, which tells the polling code to poll whenever the vq is 
non-empty, but we (AFAIU) expect the event vq to be non-empty all the time.

Hanna
Fiona Ebner Jan. 23, 2024, 11:12 a.m. UTC | #19
Am 22.01.24 um 18:52 schrieb Hanna Czenczek:
> On 22.01.24 18:41, Hanna Czenczek wrote:
>> On 05.01.24 15:30, Fiona Ebner wrote:
>>> Am 05.01.24 um 14:43 schrieb Fiona Ebner:
>>>> Am 03.01.24 um 14:35 schrieb Paolo Bonzini:
>>>>> On 1/3/24 12:40, Fiona Ebner wrote:
>>>>>> I'm happy to report that I cannot reproduce the CPU-usage-spike issue
>>>>>> with the patch, but I did run into an assertion failure when
>>>>>> trying to
>>>>>> verify that it fixes my original stuck-guest-IO issue. See below
>>>>>> for the
>>>>>> backtrace [0]. Hanna wrote in
>>>>>> https://issues.redhat.com/browse/RHEL-3934
>>>>>>
>>>>>>> I think it’s sufficient to simply call virtio_queue_notify_vq(vq)
>>>>>>> after the virtio_queue_aio_attach_host_notifier(vq, ctx) call,
>>>>>>> because
>>>>>>> both virtio-scsi’s and virtio-blk’s .handle_output() implementations
>>>>>>> acquire the device’s context, so this should be directly callable
>>>>>>> from
>>>>>>> any context.
>>>>>> I guess this is not true anymore now that the AioContext locking was
>>>>>> removed?
>>>>> Good point and, in fact, even before it was much safer to use
>>>>> virtio_queue_notify() instead.  Not only does it use the event
>>>>> notifier
>>>>> handler, but it also calls it in the right thread/AioContext just by
>>>>> doing event_notifier_set().
>>>>>
>>>> But with virtio_queue_notify() using the event notifier, the
>>>> CPU-usage-spike issue is present:
>>>>
>>>>>> Back to the CPU-usage-spike issue: I experimented around and it
>>>>>> doesn't
>>>>>> seem to matter whether I notify the virt queue before or after
>>>>>> attaching
>>>>>> the notifiers. But there's another functional difference. My patch
>>>>>> called virtio_queue_notify() which contains this block:
>>>>>>
>>>>>>>      if (vq->host_notifier_enabled) {
>>>>>>>          event_notifier_set(&vq->host_notifier);
>>>>>>>      } else if (vq->handle_output) {
>>>>>>>          vq->handle_output(vdev, vq);
>>>>>> In my testing, the first branch was taken, calling
>>>>>> event_notifier_set().
>>>>>> Hanna's patch uses virtio_queue_notify_vq() and there,
>>>>>> vq->handle_output() will be called. That seems to be the relevant
>>>>>> difference regarding the CPU-usage-spike issue.
>>>> I should mention that this is with a VirtIO SCSI disk. I also attempted
>>>> to reproduce the CPU-usage-spike issue with a VirtIO block disk, but
>>>> didn't manage yet.
>>>>
>>>> What I noticed is that in virtio_queue_host_notifier_aio_poll(), one of
>>>> the queues (but only one) will always show as nonempty. And then,
>>>> run_poll_handlers_once() will always detect progress which explains the
>>>> CPU usage.
>>>>
>>>> The following shows
>>>> 1. vq address
>>>> 2. number of times vq was passed to
>>>> virtio_queue_host_notifier_aio_poll()
>>>> 3. number of times the result of virtio_queue_host_notifier_aio_poll()
>>>> was true for the vq
>>>>
>>>>> 0x555fd93f9c80 17162000 0
>>>>> 0x555fd93f9e48 17162000 6
>>>>> 0x555fd93f9ee0 17162000 0
>>>>> 0x555fd93f9d18 17162000 17162000
>>>>> 0x555fd93f9db0 17162000 0
>>>>> 0x555fd93f9f78 17162000 0
>>>> And for the problematic one, the reason it is seen as nonempty is:
>>>>
>>>>> 0x555fd93f9d18 shadow_avail_idx 8 last_avail_idx 0
>>> vring_avail_idx(vq) also gives 8 here. This is the vs->event_vq and
>>> s->events_dropped is false in my testing, so
>>> virtio_scsi_handle_event_vq() doesn't do anything.
>>>
>>>> Those values stay like this while the call counts above increase.
>>>>
>>>> So something going wrong with the indices when the event notifier is
>>>> set
>>>> from QEMU side (in the main thread)?
>>>>
>>>> The guest is Debian 12 with a 6.1 kernel.
>>
>> So, trying to figure out a new RFC version:
>>
>> About the stack trace you, Fiona, posted:  As far as I understand,
>> that happens because virtio_blk_drained_end() calling
>> virtio_queue_notify_vq() wasn’t safe after all, and instead we need to
>> use virtio_queue_notify().  Right?
>>

AFAICT, yes. In particular, after 4f36b13847 ("scsi: remove AioContext
locking"), the AioContext is not acquired by
virtio_scsi_handle_{cmd,ctrl,event} anymore.

>> However, you say using virtio_queue_notify() instead causes busy loops
>> of doing nothing in virtio-scsi (what you describe above). I mean,
>> better than a crash, but, you know. :)

Yes, that happens for me when using virtio_queue_notify(), i.e.

> diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> index 690aceec45..8cdf04ac2d 100644
> --- a/hw/scsi/virtio-scsi.c
> +++ b/hw/scsi/virtio-scsi.c
> @@ -1166,7 +1166,11 @@ static void virtio_scsi_drained_end(SCSIBus *bus)
>  
>      for (uint32_t i = 0; i < total_queues; i++) {
>          VirtQueue *vq = virtio_get_queue(vdev, i);
> +        if (!virtio_queue_get_notification(vq)) {
> +            virtio_queue_set_notification(vq, true);
> +        }
>          virtio_queue_aio_attach_host_notifier(vq, s->ctx);
> +        virtio_queue_notify(vdev, i);
>      }
>  }
>  


>>
>> I don’t know have any prior knowledge about the event handling,
>> unfortunately.  The fact that 8 buffers are available but we don’t use
>> any sounds OK to me; as far as I understand, we only use those buffers
>> if we have any events to push into them, so as long as we don’t, we
>> won’t.  Question is, should we not have its poll handler return false
>> if we don’t have any events (i.e. events_dropped is false)?  Would
>> that solve it?
> 
> Or actually, maybe we could just skip the virtio_queue_notify() call for
> the event vq?  I.e. have it be `if (vq !=
> VIRTIO_SCSI_COMMON(s)->event_vq) { virtio_queue_notify(vdev, i); }`?

That seems to avoid the CPU-usage-spike issue :)

> I wouldn’t like that very much, (a) because this would make it slightly
> cumbersome to put that into virtio_queue_aio_attach_host_notifier*(),
> and (b) in case that does fix it, I do kind of feel like the real
> problem is that we use virtio_queue_host_notifier_aio_poll() for the
> event vq, which tells the polling code to poll whenever the vq is
> non-empty, but we (AFAIU) expect the event vq to be non-empty all the time.
> 

AFAIU, (at least in my testing) it's only non-empty after it was
notified via virtio_scsi_drained_end() once. But it's hard to tell,
because it seems that the poll callback is only called after the first
drain. I noticed poll_set_started() is not called, because
ctx->fdmon_ops->need_wait(ctx) was true, i.e. ctx->poll_disable_cnt was
positive (I'm using fdmon_poll). I then found this is because of the
notifier for the event vq, being attached with

> virtio_queue_aio_attach_host_notifier_no_poll(vs->event_vq, s->ctx);

in virtio_scsi_dataplane_start(). But in virtio_scsi_drained_end() it is
attached with virtio_queue_aio_attach_host_notifier() instead of the
_no_poll() variant. So that might be the actual issue here?

From a quick test, I cannot see the CPU-usage-spike issue with the
following either:

> diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> index 690aceec45..ba1ab8e410 100644
> --- a/hw/scsi/virtio-scsi.c
> +++ b/hw/scsi/virtio-scsi.c
> @@ -1166,7 +1166,15 @@ static void virtio_scsi_drained_end(SCSIBus *bus)
>  
>      for (uint32_t i = 0; i < total_queues; i++) {
>          VirtQueue *vq = virtio_get_queue(vdev, i);
> -        virtio_queue_aio_attach_host_notifier(vq, s->ctx);
> +        if (!virtio_queue_get_notification(vq)) {
> +            virtio_queue_set_notification(vq, true);
> +        }
> +        if (vq == VIRTIO_SCSI_COMMON(s)->event_vq) {
> +            virtio_queue_aio_attach_host_notifier_no_poll(vq, s->ctx);
> +        } else {
> +            virtio_queue_aio_attach_host_notifier(vq, s->ctx);
> +        }
> +        virtio_queue_notify(vdev, i);
>      }
>  }

Best Regards,
Fiona
Hanna Czenczek Jan. 23, 2024, 11:15 a.m. UTC | #20
On 22.01.24 18:52, Hanna Czenczek wrote:
> On 22.01.24 18:41, Hanna Czenczek wrote:
>> On 05.01.24 15:30, Fiona Ebner wrote:
>>> Am 05.01.24 um 14:43 schrieb Fiona Ebner:
>>>> Am 03.01.24 um 14:35 schrieb Paolo Bonzini:
>>>>> On 1/3/24 12:40, Fiona Ebner wrote:
>>>>>> I'm happy to report that I cannot reproduce the CPU-usage-spike 
>>>>>> issue
>>>>>> with the patch, but I did run into an assertion failure when 
>>>>>> trying to
>>>>>> verify that it fixes my original stuck-guest-IO issue. See below 
>>>>>> for the
>>>>>> backtrace [0]. Hanna wrote in 
>>>>>> https://issues.redhat.com/browse/RHEL-3934
>>>>>>
>>>>>>> I think it’s sufficient to simply call virtio_queue_notify_vq(vq)
>>>>>>> after the virtio_queue_aio_attach_host_notifier(vq, ctx) call, 
>>>>>>> because
>>>>>>> both virtio-scsi’s and virtio-blk’s .handle_output() 
>>>>>>> implementations
>>>>>>> acquire the device’s context, so this should be directly 
>>>>>>> callable from
>>>>>>> any context.
>>>>>> I guess this is not true anymore now that the AioContext locking was
>>>>>> removed?
>>>>> Good point and, in fact, even before it was much safer to use
>>>>> virtio_queue_notify() instead.  Not only does it use the event 
>>>>> notifier
>>>>> handler, but it also calls it in the right thread/AioContext just by
>>>>> doing event_notifier_set().
>>>>>
>>>> But with virtio_queue_notify() using the event notifier, the
>>>> CPU-usage-spike issue is present:
>>>>
>>>>>> Back to the CPU-usage-spike issue: I experimented around and it 
>>>>>> doesn't
>>>>>> seem to matter whether I notify the virt queue before or after 
>>>>>> attaching
>>>>>> the notifiers. But there's another functional difference. My patch
>>>>>> called virtio_queue_notify() which contains this block:
>>>>>>
>>>>>>>      if (vq->host_notifier_enabled) {
>>>>>>> event_notifier_set(&vq->host_notifier);
>>>>>>>      } else if (vq->handle_output) {
>>>>>>>          vq->handle_output(vdev, vq);
>>>>>> In my testing, the first branch was taken, calling 
>>>>>> event_notifier_set().
>>>>>> Hanna's patch uses virtio_queue_notify_vq() and there,
>>>>>> vq->handle_output() will be called. That seems to be the relevant
>>>>>> difference regarding the CPU-usage-spike issue.
>>>> I should mention that this is with a VirtIO SCSI disk. I also 
>>>> attempted
>>>> to reproduce the CPU-usage-spike issue with a VirtIO block disk, but
>>>> didn't manage yet.
>>>>
>>>> What I noticed is that in virtio_queue_host_notifier_aio_poll(), 
>>>> one of
>>>> the queues (but only one) will always show as nonempty. And then,
>>>> run_poll_handlers_once() will always detect progress which explains 
>>>> the
>>>> CPU usage.
>>>>
>>>> The following shows
>>>> 1. vq address
>>>> 2. number of times vq was passed to 
>>>> virtio_queue_host_notifier_aio_poll()
>>>> 3. number of times the result of virtio_queue_host_notifier_aio_poll()
>>>> was true for the vq
>>>>
>>>>> 0x555fd93f9c80 17162000 0
>>>>> 0x555fd93f9e48 17162000 6
>>>>> 0x555fd93f9ee0 17162000 0
>>>>> 0x555fd93f9d18 17162000 17162000
>>>>> 0x555fd93f9db0 17162000 0
>>>>> 0x555fd93f9f78 17162000 0
>>>> And for the problematic one, the reason it is seen as nonempty is:
>>>>
>>>>> 0x555fd93f9d18 shadow_avail_idx 8 last_avail_idx 0
>>> vring_avail_idx(vq) also gives 8 here. This is the vs->event_vq and
>>> s->events_dropped is false in my testing, so
>>> virtio_scsi_handle_event_vq() doesn't do anything.
>>>
>>>> Those values stay like this while the call counts above increase.
>>>>
>>>> So something going wrong with the indices when the event notifier 
>>>> is set
>>>> from QEMU side (in the main thread)?
>>>>
>>>> The guest is Debian 12 with a 6.1 kernel.
>>
>> So, trying to figure out a new RFC version:
>>
>> About the stack trace you, Fiona, posted:  As far as I understand, 
>> that happens because virtio_blk_drained_end() calling 
>> virtio_queue_notify_vq() wasn’t safe after all, and instead we need 
>> to use virtio_queue_notify().  Right?
>>
>> However, you say using virtio_queue_notify() instead causes busy 
>> loops of doing nothing in virtio-scsi (what you describe above). I 
>> mean, better than a crash, but, you know. :)
>>
>> I don’t know have any prior knowledge about the event handling, 
>> unfortunately.  The fact that 8 buffers are available but we don’t 
>> use any sounds OK to me; as far as I understand, we only use those 
>> buffers if we have any events to push into them, so as long as we 
>> don’t, we won’t.  Question is, should we not have its poll handler 
>> return false if we don’t have any events (i.e. events_dropped is 
>> false)?  Would that solve it?
>
> Or actually, maybe we could just skip the virtio_queue_notify() call 
> for the event vq?  I.e. have it be `if (vq != 
> VIRTIO_SCSI_COMMON(s)->event_vq) { virtio_queue_notify(vdev, i); }`?  
> I wouldn’t like that very much, (a) because this would make it 
> slightly cumbersome to put that into 
> virtio_queue_aio_attach_host_notifier*(), and (b) in case that does 
> fix it, I do kind of feel like the real problem is that we use 
> virtio_queue_host_notifier_aio_poll() for the event vq, which tells 
> the polling code to poll whenever the vq is non-empty, but we (AFAIU) 
> expect the event vq to be non-empty all the time.

Turns out there was commit 38738f7dbbda90fbc161757b7f4be35b52205552 
(“virtio-scsi: don't waste CPU polling the event virtqueue”) by Stefan, 
which specifically intended to not use 
virtio_queue_host_notifier_aio_poll() for the event vq.  I think the 
problem is that virtio_scsi_drained_end() should have taken care to use 
virtio_queue_aio_attach_host_notifier_no_poll() for the event vq.

If we do that, I think running virtio_queue_notify() on the event vq too 
should be reasonable.  We still want to check whether there are new 
buffers available in case we have events_dropped.  I don’t know whether 
it’s really necessary, but without virtio_queue_host_notifier_aio_poll() 
installed, it shouldn’t hurt.

Hanna
Hanna Czenczek Jan. 23, 2024, 11:25 a.m. UTC | #21
On 23.01.24 12:12, Fiona Ebner wrote:

[...]

> I noticed poll_set_started() is not called, because
> ctx->fdmon_ops->need_wait(ctx) was true, i.e. ctx->poll_disable_cnt was
> positive (I'm using fdmon_poll). I then found this is because of the
> notifier for the event vq, being attached with
>
>> virtio_queue_aio_attach_host_notifier_no_poll(vs->event_vq, s->ctx);
> in virtio_scsi_dataplane_start(). But in virtio_scsi_drained_end() it is
> attached with virtio_queue_aio_attach_host_notifier() instead of the
> _no_poll() variant. So that might be the actual issue here?
>
>  From a quick test, I cannot see the CPU-usage-spike issue with the
> following either:
>
>> diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
>> index 690aceec45..ba1ab8e410 100644
>> --- a/hw/scsi/virtio-scsi.c
>> +++ b/hw/scsi/virtio-scsi.c
>> @@ -1166,7 +1166,15 @@ static void virtio_scsi_drained_end(SCSIBus *bus)
>>   
>>       for (uint32_t i = 0; i < total_queues; i++) {
>>           VirtQueue *vq = virtio_get_queue(vdev, i);
>> -        virtio_queue_aio_attach_host_notifier(vq, s->ctx);
>> +        if (!virtio_queue_get_notification(vq)) {
>> +            virtio_queue_set_notification(vq, true);
>> +        }
>> +        if (vq == VIRTIO_SCSI_COMMON(s)->event_vq) {
>> +            virtio_queue_aio_attach_host_notifier_no_poll(vq, s->ctx);
>> +        } else {
>> +            virtio_queue_aio_attach_host_notifier(vq, s->ctx);
>> +        }
>> +        virtio_queue_notify(vdev, i);
>>       }
>>   }

Perfect, so we agree on trying it that way. :)

Hanna
Hanna Czenczek Jan. 23, 2024, 4:28 p.m. UTC | #22
On 02.01.24 16:24, Hanna Czenczek wrote:

[...]

> I’ve attached the preliminary patch that I didn’t get to send (or test 
> much) last year.  Not sure if it has the same CPU-usage-spike issue 
> Fiona was seeing, the only functional difference is that I notify the 
> vq after attaching the notifiers instead of before.

It’ll take me some more time to send a patch mail to that effect, 
because now there’s an assertion failure on hotunplug, which bisects 
back to eaad0fe26050c227dc5dad63205835bac4912a51 (“scsi: only access 
SCSIDevice->requests from one thread”):

{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context: 
Assertion `ctx == blk->ctx' failed.

(gdb) bt
#0  0x00007f32a668d83c in  () at /usr/lib/libc.so.6
#1  0x00007f32a663d668 in raise () at /usr/lib/libc.so.6
#2  0x00007f32a66254b8 in abort () at /usr/lib/libc.so.6
#3  0x00007f32a66253dc in  () at /usr/lib/libc.so.6
#4  0x00007f32a6635d26 in  () at /usr/lib/libc.so.6
#5  0x0000556e6b4880a4 in blk_get_aio_context (blk=0x556e6e89ccf0) at 
../block/block-backend.c:2429
#6  blk_get_aio_context (blk=0x556e6e89ccf0) at 
../block/block-backend.c:2417
#7  0x0000556e6b112d87 in scsi_device_for_each_req_async_bh 
(opaque=0x556e6e2c6d10) at ../hw/scsi/scsi-bus.c:128
#8  0x0000556e6b5d1966 in aio_bh_poll (ctx=ctx@entry=0x556e6d8aa290) at 
../util/async.c:218
#9  0x0000556e6b5bb16a in aio_poll (ctx=0x556e6d8aa290, 
blocking=blocking@entry=true) at ../util/aio-posix.c:722
#10 0x0000556e6b4564b6 in iothread_run 
(opaque=opaque@entry=0x556e6d89d920) at ../iothread.c:63
#11 0x0000556e6b5bde58 in qemu_thread_start (args=0x556e6d8aa9b0) at 
../util/qemu-thread-posix.c:541
#12 0x00007f32a668b9eb in  () at /usr/lib/libc.so.6
#13 0x00007f32a670f7cc in  () at /usr/lib/libc.so.6