[RFC] async: Fix aio_notify_accept

Message ID	20180803154955.25251-1-famz@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> From: Fam Zheng <famz@redhat.com> To: qemu-devel@nongnu.org Date: Fri, 3 Aug 2018 23:49:55 +0800 Message-Id: <20180803154955.25251-1-famz@redhat.com> Subject: [Qemu-devel] [RFC PATCH] async: Fix aio_notify_accept Precedence: list Cc: pbonzini@redhat.com, Fam Zheng <famz@redhat.com>, qemu-block@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com> Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
Series	[RFC] async: Fix aio_notify_accept \| expand [RFC] async: Fix aio_notify_accept

Message ID

20180803154955.25251-1-famz@redhat.com (mailing list archive)

State

New, archived

Headers

From: Fam Zheng <famz@redhat.com>
To: qemu-devel@nongnu.org
Date: Fri,  3 Aug 2018 23:49:55 +0800
Message-Id: <20180803154955.25251-1-famz@redhat.com>
Subject: [Qemu-devel] [RFC PATCH] async: Fix aio_notify_accept
Precedence: list
Cc: pbonzini@redhat.com, Fam Zheng <famz@redhat.com>, qemu-block@nongnu.org,
	Stefan Hajnoczi <stefanha@redhat.com>
Errors-To: 
 qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org
Sender: "Qemu-devel"
	<qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>

Series

[RFC] async: Fix aio_notify_accept | expand

Commit Message

Fam Zheng Aug. 3, 2018, 3:49 p.m. UTC

From main loop, bdrv_set_aio_context() can call IOThread's aio_poll().
That breaks aio_notify() because the ctx->notifier event can get cleared
too early by this which causes IOThread hanging.

See https://bugzilla.redhat.com/show_bug.cgi?id=1562750 for details.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 util/async.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Paolo Bonzini Aug. 3, 2018, 4:51 p.m. UTC | #1

On 03/08/2018 17:49, Fam Zheng wrote:
>  void aio_notify_accept(AioContext *ctx)
>  {
> -    if (atomic_xchg(&ctx->notified, false)) {
> +    /* If ctx->notify_me >= 2, another aio_poll() is waiting which may need the
> +     * ctx->notifier event to wake up, so don't already clear it just because "we" are
> +     * done iterating. */
> +    if (atomic_read(&ctx->notify_me) < 2
> +        && atomic_xchg(&ctx->notified, false)) {
>          event_notifier_test_and_clear(&ctx->notifier);
>      }
>  }

I'm worried that this would this cause a busy wait, and I don't
understand the issue.

When aio_poll()s are nested, outer calls are in the "dispatch" phase and
therefore do not need notification.

In your situation is notify_me actually ever >2?

Paolo

Paolo Bonzini Aug. 3, 2018, 5:08 p.m. UTC | #2

On 03/08/2018 17:49, Fam Zheng wrote:
>  void aio_notify_accept(AioContext *ctx)
>  {
> -    if (atomic_xchg(&ctx->notified, false)) {
> +    /* If ctx->notify_me >= 2, another aio_poll() is waiting which may need the
> +     * ctx->notifier event to wake up, so don't already clear it just because "we" are
> +     * done iterating. */
> +    if (atomic_read(&ctx->notify_me) < 2
> +        && atomic_xchg(&ctx->notified, false)) {
>          event_notifier_test_and_clear(&ctx->notifier);
>      }
>  }

Ok, it's somewhat reassuring to see from the BZ that the aio_poll in the
main thread (in bdrv_set_aio_context) is non-blocking, and that it isn't
about nested aio_poll.

Then it's not possible to have a busy wait there, because sooner or
later the bottom halves will be exhausted and aio_wait will return false
(no progress).

I'm convinced that the idea in your patch---skipping
aio_notify_accept---is correct, it's the ctx->notify_me test that I
cannot understand.  I'm not saying it's wrong, but it's tricky.  So we
need to improve the comments, the commit message, the way we achieve the
fix, or all three.

As to the comments and commit message: the BZ is a very good source of
information.  The comment on the main thread stealing the aio_notify was
very clear.

As to how to fix it, first of all, we should be clear on the invariants.
 It would be nice to assert that, if not
in_aio_context_home_thread(ctx), blocking must be false.  Two concurrent
blocking aio_polls will steal aio_notify from one another, so
intuitively that assertion should be true, and using AIO_WAIT_WHILE
takes care of it.

Second, if blocking is false, do we need to call aio_notify_accept at
all?  If not, and if we combine this with the assertion above, only the
I/O thread will call aio_notify_accept, and the main loop will never
steal the notification.  So that should fix the bug.

Thanks,

Paolo

Fam Zheng Aug. 7, 2018, 1:01 a.m. UTC | #3

On Fri, 08/03 19:08, Paolo Bonzini wrote:
> On 03/08/2018 17:49, Fam Zheng wrote:
> >  void aio_notify_accept(AioContext *ctx)
> >  {
> > -    if (atomic_xchg(&ctx->notified, false)) {
> > +    /* If ctx->notify_me >= 2, another aio_poll() is waiting which may need the
> > +     * ctx->notifier event to wake up, so don't already clear it just because "we" are
> > +     * done iterating. */
> > +    if (atomic_read(&ctx->notify_me) < 2
> > +        && atomic_xchg(&ctx->notified, false)) {
> >          event_notifier_test_and_clear(&ctx->notifier);
> >      }
> >  }
> 
> Ok, it's somewhat reassuring to see from the BZ that the aio_poll in the
> main thread (in bdrv_set_aio_context) is non-blocking, and that it isn't
> about nested aio_poll.
> 
> Then it's not possible to have a busy wait there, because sooner or
> later the bottom halves will be exhausted and aio_wait will return false
> (no progress).
> 
> I'm convinced that the idea in your patch---skipping
> aio_notify_accept---is correct, it's the ctx->notify_me test that I
> cannot understand.  I'm not saying it's wrong, but it's tricky.  So we
> need to improve the comments, the commit message, the way we achieve the
> fix, or all three.
> 
> As to the comments and commit message: the BZ is a very good source of
> information.  The comment on the main thread stealing the aio_notify was
> very clear.

Yes, it was late Friday night and I wanted to send the patch before the long
weekend :)

> 
> As to how to fix it, first of all, we should be clear on the invariants.
>  It would be nice to assert that, if not
> in_aio_context_home_thread(ctx), blocking must be false.  Two concurrent
> blocking aio_polls will steal aio_notify from one another, so
> intuitively that assertion should be true, and using AIO_WAIT_WHILE
> takes care of it.
> 
> Second, if blocking is false, do we need to call aio_notify_accept at
> all?  If not, and if we combine this with the assertion above, only the
> I/O thread will call aio_notify_accept, and the main loop will never
> steal the notification.  So that should fix the bug.

Yes, I think this is a better idea. I'll try it.

Fam

diff --git a/util/async.c b/util/async.c
index 05979f8014..c6e8aebc3a 100644
--- a/util/async.c
+++ b/util/async.c
@@ -355,7 +355,11 @@  void aio_notify(AioContext *ctx)
 
 void aio_notify_accept(AioContext *ctx)
 {
-    if (atomic_xchg(&ctx->notified, false)) {
+    /* If ctx->notify_me >= 2, another aio_poll() is waiting which may need the
+     * ctx->notifier event to wake up, so don't already clear it just because "we" are
+     * done iterating. */
+    if (atomic_read(&ctx->notify_me) < 2
+        && atomic_xchg(&ctx->notified, false)) {
         event_notifier_test_and_clear(&ctx->notifier);
     }
 }

[RFC] async: Fix aio_notify_accept

Commit Message

Comments

Patch