[RFC,v2] aio: properly bubble up errors from initialization

Message ID	20180615174729.20544-1-naravamudan@digitalocean.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> To: qemu-devel@nongnu.org Date: Fri, 15 Jun 2018 10:47:29 -0700 Message-Id: <20180615174729.20544-1-naravamudan@digitalocean.com> In-Reply-To: <20180614232119.31669-1-naravamudan@digitalocean.com> References: <20180614232119.31669-1-naravamudan@digitalocean.com> Subject: [Qemu-devel] [PATCH] [RFC v2] aio: properly bubble up errors from initialization Precedence: list From: Nishanth Aravamudan via Qemu-devel <qemu-devel@nongnu.org> Reply-To: Nishanth Aravamudan <naravamudan@digitalocean.com> Cc: qemu-block@nongnu.org Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>

Denis V. Lunev" via June 15, 2018, 5:47 p.m. UTC

laio_init() can fail for a couple of reasons, which will lead to a NULL
pointer dereference in laio_attach_aio_context().

To solve this, add a aio_setup_linux_aio() function which is called
before aio_get_linux_aio() where it is called currently, and which
propogates setup errors up. The signature of aio_get_linux_aio() was not
modified, because it seems preferable to return the actual errno from
the possible failing initialization calls.

With respect to the error-handling in the file-posix.c, we properly
bubble any errors up in raw_co_prw and in the case s of
raw_aio_{,un}plug, the result is the same as if s->use_linux_aio was not
set (but there is no bubbling up). In all three cases, if the setup
function fails, we fallback to the thread pool and an error message is
emitted.

It is trivial to make qemu segfault in my testing. Set
/proc/sys/fs/aio-max-nr to 0 and start a guest with
aio=native,cache=directsync. With this patch, the guest successfully
starts (but obviously isn't using native AIO). Setting aio-max-nr back
up to a reasonable value, AIO contexts are consumed normally.

Signed-off-by: Nishanth Aravamudan <naravamudan@digitalocean.com>

---

Changes from v1 -> v2:

Rather than affect virtio-scsi/blk at all, make all the changes internal
to file-posix.c. Thanks to Kevin Wolf for the suggested change.
---
 block/file-posix.c      | 24 ++++++++++++++++++++++++
 block/linux-aio.c       | 15 ++++++++++-----
 include/block/aio.h     |  3 +++
 include/block/raw-aio.h |  2 +-
 stubs/linux-aio.c       |  2 +-
 util/async.c            | 15 ++++++++++++---
 6 files changed, 51 insertions(+), 10 deletions(-)

Eric Blake June 19, 2018, 7:35 p.m. UTC | #1

On 06/15/2018 12:47 PM, Nishanth Aravamudan via Qemu-devel wrote:
> laio_init() can fail for a couple of reasons, which will lead to a NULL
> pointer dereference in laio_attach_aio_context().
> 
> To solve this, add a aio_setup_linux_aio() function which is called
> before aio_get_linux_aio() where it is called currently, and which
> propogates setup errors up. The signature of aio_get_linux_aio() was not

s/propogates/propagates/

> modified, because it seems preferable to return the actual errno from
> the possible failing initialization calls.
> 
> With respect to the error-handling in the file-posix.c, we properly
> bubble any errors up in raw_co_prw and in the case s of
> raw_aio_{,un}plug, the result is the same as if s->use_linux_aio was not
> set (but there is no bubbling up). In all three cases, if the setup
> function fails, we fallback to the thread pool and an error message is
> emitted.
> 
> It is trivial to make qemu segfault in my testing. Set
> /proc/sys/fs/aio-max-nr to 0 and start a guest with
> aio=native,cache=directsync. With this patch, the guest successfully
> starts (but obviously isn't using native AIO). Setting aio-max-nr back
> up to a reasonable value, AIO contexts are consumed normally.
> 
> Signed-off-by: Nishanth Aravamudan <naravamudan@digitalocean.com>
> 
> ---
> 
> Changes from v1 -> v2:

When posting a v2, it's best to post as a new thread, rather than 
in-reply-to the v1 thread, so that automated tooling knows to check the 
new patch.  More patch submission tips at 
https://wiki.qemu.org/Contribute/SubmitAPatch

> 
> Rather than affect virtio-scsi/blk at all, make all the changes internal
> to file-posix.c. Thanks to Kevin Wolf for the suggested change.
> ---
>   block/file-posix.c      | 24 ++++++++++++++++++++++++
>   block/linux-aio.c       | 15 ++++++++++-----
>   include/block/aio.h     |  3 +++
>   include/block/raw-aio.h |  2 +-
>   stubs/linux-aio.c       |  2 +-
>   util/async.c            | 15 ++++++++++++---
>   6 files changed, 51 insertions(+), 10 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 07bb061fe4..2415d09bf1 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1665,6 +1665,14 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
>               type |= QEMU_AIO_MISALIGNED;
>   #ifdef CONFIG_LINUX_AIO
>           } else if (s->use_linux_aio) {
> +            int rc;
> +            rc = aio_setup_linux_aio(bdrv_get_aio_context(bs));
> +            if (rc != 0) {
> +                error_report("Unable to use native AIO, falling back to "
> +                             "thread pool.");

In general, error_report() should not output a trailing '.'.

> +                s->use_linux_aio = 0;
> +                return rc;

Wait - the message claims we are falling back, but the non-zero return 
code sounds like we are returning an error instead of falling back.  (My 
preference - if the user requested something and we can't do it, it's 
better to error than to fall back to something that does not match the 
user's request).

> +            }
>               LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
>               assert(qiov->size == bytes);
>               return laio_co_submit(bs, aio, s->fd, offset, qiov, type);
> @@ -1695,6 +1703,14 @@ static void raw_aio_plug(BlockDriverState *bs)
>   #ifdef CONFIG_LINUX_AIO
>       BDRVRawState *s = bs->opaque;
>       if (s->use_linux_aio) {
> +        int rc;
> +        rc = aio_setup_linux_aio(bdrv_get_aio_context(bs));
> +        if (rc != 0) {
> +            error_report("Unable to use native AIO, falling back to "
> +                         "thread pool.");
> +            s->use_linux_aio = 0;

Should s->use_linux_aio be a bool instead of an int?

Denis V. Lunev" via June 19, 2018, 8:14 p.m. UTC | #2

On 19.06.2018 [14:35:33 -0500], Eric Blake wrote:
> On 06/15/2018 12:47 PM, Nishanth Aravamudan via Qemu-devel wrote:
> > laio_init() can fail for a couple of reasons, which will lead to a NULL
> > pointer dereference in laio_attach_aio_context().
> > 
> > To solve this, add a aio_setup_linux_aio() function which is called
> > before aio_get_linux_aio() where it is called currently, and which
> > propogates setup errors up. The signature of aio_get_linux_aio() was not
> 
> s/propogates/propagates/
 
Thanks!

> > modified, because it seems preferable to return the actual errno from
> > the possible failing initialization calls.
> > 
> > With respect to the error-handling in the file-posix.c, we properly
> > bubble any errors up in raw_co_prw and in the case s of
> > raw_aio_{,un}plug, the result is the same as if s->use_linux_aio was not
> > set (but there is no bubbling up). In all three cases, if the setup
> > function fails, we fallback to the thread pool and an error message is
> > emitted.
> > 
> > It is trivial to make qemu segfault in my testing. Set
> > /proc/sys/fs/aio-max-nr to 0 and start a guest with
> > aio=native,cache=directsync. With this patch, the guest successfully
> > starts (but obviously isn't using native AIO). Setting aio-max-nr back
> > up to a reasonable value, AIO contexts are consumed normally.
> > 
> > Signed-off-by: Nishanth Aravamudan <naravamudan@digitalocean.com>
> > 
> > ---
> > 
> > Changes from v1 -> v2:
> 
> When posting a v2, it's best to post as a new thread, rather than
> in-reply-to the v1 thread, so that automated tooling knows to check the new
> patch.  More patch submission tips at
> https://wiki.qemu.org/Contribute/SubmitAPatch

My apologies! I'll fix this in a (future) v3.

> > Rather than affect virtio-scsi/blk at all, make all the changes internal
> > to file-posix.c. Thanks to Kevin Wolf for the suggested change.
> > ---
> >   block/file-posix.c      | 24 ++++++++++++++++++++++++
> >   block/linux-aio.c       | 15 ++++++++++-----
> >   include/block/aio.h     |  3 +++
> >   include/block/raw-aio.h |  2 +-
> >   stubs/linux-aio.c       |  2 +-
> >   util/async.c            | 15 ++++++++++++---
> >   6 files changed, 51 insertions(+), 10 deletions(-)
> > 
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 07bb061fe4..2415d09bf1 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -1665,6 +1665,14 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
> >               type |= QEMU_AIO_MISALIGNED;
> >   #ifdef CONFIG_LINUX_AIO
> >           } else if (s->use_linux_aio) {
> > +            int rc;
> > +            rc = aio_setup_linux_aio(bdrv_get_aio_context(bs));
> > +            if (rc != 0) {
> > +                error_report("Unable to use native AIO, falling back to "
> > +                             "thread pool.");
> 
> In general, error_report() should not output a trailing '.'.

Will fix.

> > +                s->use_linux_aio = 0;
> > +                return rc;
> 
> Wait - the message claims we are falling back, but the non-zero return code
> sounds like we are returning an error instead of falling back.  (My
> preference - if the user requested something and we can't do it, it's better
> to error than to fall back to something that does not match the user's
> request).

I think that makes sense, I hadn't tested this specific case (in my
reading of the code, it wasn't clear to me if raw_co_prw() could be
called before raw_aio_plug() had been called, but I think returning the
error code up should be handled correctly. What about the cases where
there is no error handling (the other two changes in the patch)?

> > +            }
> >               LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
> >               assert(qiov->size == bytes);
> >               return laio_co_submit(bs, aio, s->fd, offset, qiov, type);
> > @@ -1695,6 +1703,14 @@ static void raw_aio_plug(BlockDriverState *bs)
> >   #ifdef CONFIG_LINUX_AIO
> >       BDRVRawState *s = bs->opaque;
> >       if (s->use_linux_aio) {
> > +        int rc;
> > +        rc = aio_setup_linux_aio(bdrv_get_aio_context(bs));
> > +        if (rc != 0) {
> > +            error_report("Unable to use native AIO, falling back to "
> > +                         "thread pool.");
> > +            s->use_linux_aio = 0;
> 
> Should s->use_linux_aio be a bool instead of an int?

It is:

    bool use_linux_aio:1;

would you prefer I did a preparatory patch that converted users to
true/false?

Thanks,
Nish

Denis V. Lunev" via June 19, 2018, 10:35 p.m. UTC | #3

On 19.06.2018 [13:14:51 -0700], Nishanth Aravamudan wrote:
> On 19.06.2018 [14:35:33 -0500], Eric Blake wrote:
> > On 06/15/2018 12:47 PM, Nishanth Aravamudan via Qemu-devel wrote:

<snip>

> > >           } else if (s->use_linux_aio) {
> > > +            int rc;
> > > +            rc = aio_setup_linux_aio(bdrv_get_aio_context(bs));
> > > +            if (rc != 0) {
> > > +                error_report("Unable to use native AIO, falling back to "
> > > +                             "thread pool.");
> > 
> > In general, error_report() should not output a trailing '.'.
> 
> Will fix.
> 
> > > +                s->use_linux_aio = 0;
> > > +                return rc;
> > 
> > Wait - the message claims we are falling back, but the non-zero return code
> > sounds like we are returning an error instead of falling back.  (My
> > preference - if the user requested something and we can't do it, it's better
> > to error than to fall back to something that does not match the user's
> > request).
> 
> I think that makes sense, I hadn't tested this specific case (in my
> reading of the code, it wasn't clear to me if raw_co_prw() could be
> called before raw_aio_plug() had been called, but I think returning the
> error code up should be handled correctly. What about the cases where
> there is no error handling (the other two changes in the patch)?

While looking at doing these changes, I realized that I'm not quite sure
what the right approach is here. My original rationale for returning
non-zero was that AIO was requested but could not be completed. I
haven't fully tracked back the calling paths, but I assumed it would get
retried at the top level, and since we indicated to not use AIO on
subsequent calls, it will succeed and use threads then (note, that I do
now realize this means a mismatch between the qemu command-line and the
in-use AIO model).

In practice, with my v2 patch, where I do return a non-zero error-code
from this function, qemu does not exit (nor is any logging other than
that I added emitted on the monitor). If I do not fallback, I imagine we
would just continuously see this error message and IO might not actually
every occur? Reworking all of the callpath to fail on non-zero returns
from raw_co_prw() seems like a fair bit of work, but if that is what is
being requested, I can try that (it will just take a while).
Alternatively, I can produce a v3 quickly that does not bubble the
actual errno all the way up (since it does seem like it is ignored
anyways?).

<snip>

> > > +            s->use_linux_aio = 0;
> > 
> > Should s->use_linux_aio be a bool instead of an int?
> 
> It is:
> 
>     bool use_linux_aio:1;
> 
> would you prefer I did a preparatory patch that converted users to
> true/false?

Sorry, I misunderstood this -- only my patch does an assignment, so I'll
switch to 'false'.

Thanks,
Nish

Denis V. Lunev" via June 19, 2018, 10:54 p.m. UTC | #4

On 19.06.2018 [15:35:57 -0700], Nishanth Aravamudan wrote:
> On 19.06.2018 [13:14:51 -0700], Nishanth Aravamudan wrote:
> > On 19.06.2018 [14:35:33 -0500], Eric Blake wrote:
> > > On 06/15/2018 12:47 PM, Nishanth Aravamudan via Qemu-devel wrote:
> 
> <snip>
> 
> > > >           } else if (s->use_linux_aio) {
> > > > +            int rc;
> > > > +            rc = aio_setup_linux_aio(bdrv_get_aio_context(bs));
> > > > +            if (rc != 0) {
> > > > +                error_report("Unable to use native AIO, falling back to "
> > > > +                             "thread pool.");
> > > 
> > > In general, error_report() should not output a trailing '.'.
> > 
> > Will fix.
> > 
> > > > +                s->use_linux_aio = 0;
> > > > +                return rc;
> > > 
> > > Wait - the message claims we are falling back, but the non-zero return code
> > > sounds like we are returning an error instead of falling back.  (My
> > > preference - if the user requested something and we can't do it, it's better
> > > to error than to fall back to something that does not match the user's
> > > request).
> > 
> > I think that makes sense, I hadn't tested this specific case (in my
> > reading of the code, it wasn't clear to me if raw_co_prw() could be
> > called before raw_aio_plug() had been called, but I think returning the
> > error code up should be handled correctly. What about the cases where
> > there is no error handling (the other two changes in the patch)?
> 
> While looking at doing these changes, I realized that I'm not quite sure
> what the right approach is here. My original rationale for returning
> non-zero was that AIO was requested but could not be completed. I
> haven't fully tracked back the calling paths, but I assumed it would get
> retried at the top level, and since we indicated to not use AIO on
> subsequent calls, it will succeed and use threads then (note, that I do
> now realize this means a mismatch between the qemu command-line and the
> in-use AIO model).
> 
> In practice, with my v2 patch, where I do return a non-zero error-code
> from this function, qemu does not exit (nor is any logging other than
> that I added emitted on the monitor). If I do not fallback, I imagine we
> would just continuously see this error message and IO might not actually
> every occur? Reworking all of the callpath to fail on non-zero returns
> from raw_co_prw() seems like a fair bit of work, but if that is what is
> being requested, I can try that (it will just take a while).
> Alternatively, I can produce a v3 quickly that does not bubble the
> actual errno all the way up (since it does seem like it is ignored
> anyways?).

Sorry for the noise, but I had one more thought. Would it be appropriate
to push the _setup() call up to when we parse the arguments about
aio=native? E.g., we already check there if cache=directsync is
specified and error out if not. We could, in theory, also call
laio_init() there (via the new function) and error out to the CLI if
that fails. Then the runtime paths would simply be able to use the
context that was setup earlier? I would need to verify the
laio_cleanup() happens correctly still.

Thanks,
Nish

Kevin Wolf June 20, 2018, 9:57 a.m. UTC | #5

Am 20.06.2018 um 00:54 hat Nishanth Aravamudan geschrieben:
> On 19.06.2018 [15:35:57 -0700], Nishanth Aravamudan wrote:
> > On 19.06.2018 [13:14:51 -0700], Nishanth Aravamudan wrote:
> > > On 19.06.2018 [14:35:33 -0500], Eric Blake wrote:
> > > > On 06/15/2018 12:47 PM, Nishanth Aravamudan via Qemu-devel wrote:
> > 
> > <snip>
> > 
> > > > >           } else if (s->use_linux_aio) {
> > > > > +            int rc;
> > > > > +            rc = aio_setup_linux_aio(bdrv_get_aio_context(bs));
> > > > > +            if (rc != 0) {
> > > > > +                error_report("Unable to use native AIO, falling back to "
> > > > > +                             "thread pool.");
> > > > 
> > > > In general, error_report() should not output a trailing '.'.
> > > 
> > > Will fix.
> > > 
> > > > > +                s->use_linux_aio = 0;
> > > > > +                return rc;
> > > > 
> > > > Wait - the message claims we are falling back, but the non-zero return code
> > > > sounds like we are returning an error instead of falling back.  (My
> > > > preference - if the user requested something and we can't do it, it's better
> > > > to error than to fall back to something that does not match the user's
> > > > request).
> > > 
> > > I think that makes sense, I hadn't tested this specific case (in my
> > > reading of the code, it wasn't clear to me if raw_co_prw() could be
> > > called before raw_aio_plug() had been called, but I think returning the
> > > error code up should be handled correctly. What about the cases where
> > > there is no error handling (the other two changes in the patch)?
> > 
> > While looking at doing these changes, I realized that I'm not quite sure
> > what the right approach is here. My original rationale for returning
> > non-zero was that AIO was requested but could not be completed. I
> > haven't fully tracked back the calling paths, but I assumed it would get
> > retried at the top level, and since we indicated to not use AIO on
> > subsequent calls, it will succeed and use threads then (note, that I do
> > now realize this means a mismatch between the qemu command-line and the
> > in-use AIO model).
> > 
> > In practice, with my v2 patch, where I do return a non-zero error-code
> > from this function, qemu does not exit (nor is any logging other than
> > that I added emitted on the monitor). If I do not fallback, I imagine we
> > would just continuously see this error message and IO might not actually
> > every occur? Reworking all of the callpath to fail on non-zero returns
> > from raw_co_prw() seems like a fair bit of work, but if that is what is
> > being requested, I can try that (it will just take a while).
> > Alternatively, I can produce a v3 quickly that does not bubble the
> > actual errno all the way up (since it does seem like it is ignored
> > anyways?).
> 
> Sorry for the noise, but I had one more thought. Would it be appropriate
> to push the _setup() call up to when we parse the arguments about
> aio=native? E.g., we already check there if cache=directsync is
> specified and error out if not.

We already do this:

     /* Currently Linux does AIO only for files opened with O_DIRECT */
    if (s->use_linux_aio && !(s->open_flags & O_DIRECT)) {
        error_setg(errp, "aio=native was specified, but it requires "
                         "cache.direct=on, which was not specified.");
        ret = -EINVAL;
        goto fail;
    }

laio_init() is about other types of errors. But anyway, yes, calling
laio_init() already in .bdrv_open() is possible. Returning errors from
.bdrv_open() is nice and easy and we should do it.

However, we may also need to call laio_init() again when switching to a
different I/O thread after the image is already opened. This is what I
meant when I commented on v1 that you should do this in the
.bdrv_attach_aio_context callback. The problem here is that we can't
return an error there and the guest is already using the image. In this
case, logging an error and falling back to the thread pool seems to be
the best option we have.

Kevin

Denis V. Lunev" via June 20, 2018, 7:34 p.m. UTC | #6

On 20.06.2018 [11:57:42 +0200], Kevin Wolf wrote:
> Am 20.06.2018 um 00:54 hat Nishanth Aravamudan geschrieben:
> > On 19.06.2018 [15:35:57 -0700], Nishanth Aravamudan wrote:
> > > On 19.06.2018 [13:14:51 -0700], Nishanth Aravamudan wrote:
> > > > On 19.06.2018 [14:35:33 -0500], Eric Blake wrote:
> > > > > On 06/15/2018 12:47 PM, Nishanth Aravamudan via Qemu-devel wrote:
> > > 
> > > <snip>
> > > 
> > > > > >           } else if (s->use_linux_aio) {
> > > > > > +            int rc;
> > > > > > +            rc = aio_setup_linux_aio(bdrv_get_aio_context(bs));
> > > > > > +            if (rc != 0) {
> > > > > > +                error_report("Unable to use native AIO, falling back to "
> > > > > > +                             "thread pool.");
> > > > > 
> > > > > In general, error_report() should not output a trailing '.'.
> > > > 
> > > > Will fix.
> > > > 
> > > > > > +                s->use_linux_aio = 0;
> > > > > > +                return rc;
> > > > > 
> > > > > Wait - the message claims we are falling back, but the non-zero return code
> > > > > sounds like we are returning an error instead of falling back.  (My
> > > > > preference - if the user requested something and we can't do it, it's better
> > > > > to error than to fall back to something that does not match the user's
> > > > > request).
> > > > 
> > > > I think that makes sense, I hadn't tested this specific case (in my
> > > > reading of the code, it wasn't clear to me if raw_co_prw() could be
> > > > called before raw_aio_plug() had been called, but I think returning the
> > > > error code up should be handled correctly. What about the cases where
> > > > there is no error handling (the other two changes in the patch)?
> > > 
> > > While looking at doing these changes, I realized that I'm not quite sure
> > > what the right approach is here. My original rationale for returning
> > > non-zero was that AIO was requested but could not be completed. I
> > > haven't fully tracked back the calling paths, but I assumed it would get
> > > retried at the top level, and since we indicated to not use AIO on
> > > subsequent calls, it will succeed and use threads then (note, that I do
> > > now realize this means a mismatch between the qemu command-line and the
> > > in-use AIO model).
> > > 
> > > In practice, with my v2 patch, where I do return a non-zero error-code
> > > from this function, qemu does not exit (nor is any logging other than
> > > that I added emitted on the monitor). If I do not fallback, I imagine we
> > > would just continuously see this error message and IO might not actually
> > > every occur? Reworking all of the callpath to fail on non-zero returns
> > > from raw_co_prw() seems like a fair bit of work, but if that is what is
> > > being requested, I can try that (it will just take a while).
> > > Alternatively, I can produce a v3 quickly that does not bubble the
> > > actual errno all the way up (since it does seem like it is ignored
> > > anyways?).
> > 
> > Sorry for the noise, but I had one more thought. Would it be appropriate
> > to push the _setup() call up to when we parse the arguments about
> > aio=native? E.g., we already check there if cache=directsync is
> > specified and error out if not.
> 
> We already do this:

Right, I stated above it already is done, I simply meant adding a second
check here that we can obtain and setup the AIO context successfully.
 
>      /* Currently Linux does AIO only for files opened with O_DIRECT */
>     if (s->use_linux_aio && !(s->open_flags & O_DIRECT)) {
>         error_setg(errp, "aio=native was specified, but it requires "
>                          "cache.direct=on, which was not specified.");
>         ret = -EINVAL;
>         goto fail;
>     }
> 
> laio_init() is about other types of errors. But anyway, yes, calling
> laio_init() already in .bdrv_open() is possible. Returning errors from
> .bdrv_open() is nice and easy and we should do it.

Ack.

> However, we may also need to call laio_init() again when switching to a
> different I/O thread after the image is already opened. This is what I
> meant when I commented on v1 that you should do this in the
> .bdrv_attach_aio_context callback. The problem here is that we can't
> return an error there and the guest is already using the image. In this
> case, logging an error and falling back to the thread pool seems to be
> the best option we have.

Is this is a request for new functionality? Just trying to understand,
because aiui, block/file-posix.c does not implement the
bdrv_attach_aio_context callback currently. Instead, aio_get_linux_aio()
is called from three places, raw_co_prw, raw_aio_plug and
raw_aio_unplug, which calls into laio_init() and
laio_attach_aio_context(). I can add the callback you suggest with
appropriate error handling (I suppose it would point to
laio_attach_aio_context, possibly with some modifications) and remove
the call from aio_get_linux_aio()? Just trying to understand the request
a bit better, as I don't see where exactly iothreads get switched and
how that is implemented currently (and thus where laio_init() would get
called again in the current code).

Thanks in advance!
-Nish

Denis V. Lunev" via June 21, 2018, 3:26 a.m. UTC | #7

On 20.06.2018 [12:34:52 -0700], Nishanth Aravamudan wrote:
> On 20.06.2018 [11:57:42 +0200], Kevin Wolf wrote:
> > Am 20.06.2018 um 00:54 hat Nishanth Aravamudan geschrieben:
> > > On 19.06.2018 [15:35:57 -0700], Nishanth Aravamudan wrote:
> > > > On 19.06.2018 [13:14:51 -0700], Nishanth Aravamudan wrote:
> > > > > On 19.06.2018 [14:35:33 -0500], Eric Blake wrote:
> > > > > > On 06/15/2018 12:47 PM, Nishanth Aravamudan via Qemu-devel wrote:
> > > > 
> > > > <snip>
> > > > 
> > > > > > >           } else if (s->use_linux_aio) {
> > > > > > > +            int rc;
> > > > > > > +            rc = aio_setup_linux_aio(bdrv_get_aio_context(bs));
> > > > > > > +            if (rc != 0) {
> > > > > > > +                error_report("Unable to use native AIO, falling back to "
> > > > > > > +                             "thread pool.");
> > > > > > 
> > > > > > In general, error_report() should not output a trailing '.'.
> > > > > 
> > > > > Will fix.
> > > > > 
> > > > > > > +                s->use_linux_aio = 0;
> > > > > > > +                return rc;
> > > > > > 
> > > > > > Wait - the message claims we are falling back, but the non-zero return code
> > > > > > sounds like we are returning an error instead of falling back.  (My
> > > > > > preference - if the user requested something and we can't do it, it's better
> > > > > > to error than to fall back to something that does not match the user's
> > > > > > request).
> > > > > 
> > > > > I think that makes sense, I hadn't tested this specific case (in my
> > > > > reading of the code, it wasn't clear to me if raw_co_prw() could be
> > > > > called before raw_aio_plug() had been called, but I think returning the
> > > > > error code up should be handled correctly. What about the cases where
> > > > > there is no error handling (the other two changes in the patch)?
> > > > 
> > > > While looking at doing these changes, I realized that I'm not quite sure
> > > > what the right approach is here. My original rationale for returning
> > > > non-zero was that AIO was requested but could not be completed. I
> > > > haven't fully tracked back the calling paths, but I assumed it would get
> > > > retried at the top level, and since we indicated to not use AIO on
> > > > subsequent calls, it will succeed and use threads then (note, that I do
> > > > now realize this means a mismatch between the qemu command-line and the
> > > > in-use AIO model).
> > > > 
> > > > In practice, with my v2 patch, where I do return a non-zero error-code
> > > > from this function, qemu does not exit (nor is any logging other than
> > > > that I added emitted on the monitor). If I do not fallback, I imagine we
> > > > would just continuously see this error message and IO might not actually
> > > > every occur? Reworking all of the callpath to fail on non-zero returns
> > > > from raw_co_prw() seems like a fair bit of work, but if that is what is
> > > > being requested, I can try that (it will just take a while).
> > > > Alternatively, I can produce a v3 quickly that does not bubble the
> > > > actual errno all the way up (since it does seem like it is ignored
> > > > anyways?).
> > > 
> > > Sorry for the noise, but I had one more thought. Would it be appropriate
> > > to push the _setup() call up to when we parse the arguments about
> > > aio=native? E.g., we already check there if cache=directsync is
> > > specified and error out if not.
> > 
> > We already do this:
> 
> Right, I stated above it already is done, I simply meant adding a second
> check here that we can obtain and setup the AIO context successfully.
>  
> >      /* Currently Linux does AIO only for files opened with O_DIRECT */
> >     if (s->use_linux_aio && !(s->open_flags & O_DIRECT)) {
> >         error_setg(errp, "aio=native was specified, but it requires "
> >                          "cache.direct=on, which was not specified.");
> >         ret = -EINVAL;
> >         goto fail;
> >     }
> > 
> > laio_init() is about other types of errors. But anyway, yes, calling
> > laio_init() already in .bdrv_open() is possible. Returning errors from
> > .bdrv_open() is nice and easy and we should do it.
> 
> Ack.
> 
> > However, we may also need to call laio_init() again when switching to a
> > different I/O thread after the image is already opened. This is what I
> > meant when I commented on v1 that you should do this in the
> > .bdrv_attach_aio_context callback. The problem here is that we can't
> > return an error there and the guest is already using the image. In this
> > case, logging an error and falling back to the thread pool seems to be
> > the best option we have.
> 
> Is this is a request for new functionality? Just trying to understand,
> because aiui, block/file-posix.c does not implement the
> bdrv_attach_aio_context callback currently. Instead, aio_get_linux_aio()
> is called from three places, raw_co_prw, raw_aio_plug and
> raw_aio_unplug, which calls into laio_init() and
> laio_attach_aio_context(). I can add the callback you suggest with
> appropriate error handling (I suppose it would point to
> laio_attach_aio_context, possibly with some modifications) and remove
> the call from aio_get_linux_aio()? Just trying to understand the request
> a bit better, as I don't see where exactly iothreads get switched and
> how that is implemented currently (and thus where laio_init() would get
> called again in the current code).

While I waited for a reply to this, I started coding on what I think was
being asked for and have come to the conclusion that there are actually
three bugs here :)

Test cases (with one disk attached to the VM):

1) Set /proc/sys/fs/max-aio-nr to 0. Specify aio=native and qemu dies
with a SIGSEGV.
    - This case is understood and pushing the laio_init()-return code
      check to the bdrv_open() path fixes this (and allows for the
      failure to be communicated to the user).

2) Set /proc/sys/fs/max-aio-nr to 128. Specify aio=native and some
number of IOThreads. Over qmp issue a x-blockdev-set-iothread command to
move the block device node to one of the IOThreads. qemu eventually dies
with a SIGSEGV.
    - I am fairly sure this is the case you described above, and is
      fixed by re-implementing the bdrv_{attach,detach}_aio_context
      callbacks. I have a patch that does this and successfully tested
      the SEGV is avoided.

3) Set /proc/sys/fs/max-aio-nr to 512 (I think 256 would be sufficient,
though). Specify aio=native and some number of IOThreads. Over qmp issue
a x-blockdev-set-iothread command to move the block device node to one
of the IOThreads. Shutdown the guest normally. qemu dies with a SIGABRT.
    - This appears to be because there is a mismatch in
      aio_context_{acquire,release} calls (this is my hypothesis right
      now). The abort comes from bdrv_flush -> aio_context_release and
      an EPERM from qemu_mutex_unlock_impl() which I believe is just
      reflecting an EPERM from pthread_mutex_unlock? My theory is that
      the main qemu thread acquired the aio mutex but then the IOThread
      released it? I will try and trace the mutexes tomorrow, but I
      still don't have a fix for this case.

Thanks,
Nish

Kevin Wolf June 21, 2018, 1:51 p.m. UTC | #8

Am 21.06.2018 um 05:26 hat Nishanth Aravamudan geschrieben:
> On 20.06.2018 [12:34:52 -0700], Nishanth Aravamudan wrote:
> > On 20.06.2018 [11:57:42 +0200], Kevin Wolf wrote:
> > > Am 20.06.2018 um 00:54 hat Nishanth Aravamudan geschrieben:
> > > > On 19.06.2018 [15:35:57 -0700], Nishanth Aravamudan wrote:
> > > > > On 19.06.2018 [13:14:51 -0700], Nishanth Aravamudan wrote:
> > > > > > On 19.06.2018 [14:35:33 -0500], Eric Blake wrote:
> > > > > > > On 06/15/2018 12:47 PM, Nishanth Aravamudan via Qemu-devel wrote:
> > > > > 
> > > > > <snip>
> > > > > 
> > > > > > > >           } else if (s->use_linux_aio) {
> > > > > > > > +            int rc;
> > > > > > > > +            rc = aio_setup_linux_aio(bdrv_get_aio_context(bs));
> > > > > > > > +            if (rc != 0) {
> > > > > > > > +                error_report("Unable to use native AIO, falling back to "
> > > > > > > > +                             "thread pool.");
> > > > > > > 
> > > > > > > In general, error_report() should not output a trailing '.'.
> > > > > > 
> > > > > > Will fix.
> > > > > > 
> > > > > > > > +                s->use_linux_aio = 0;
> > > > > > > > +                return rc;
> > > > > > > 
> > > > > > > Wait - the message claims we are falling back, but the non-zero return code
> > > > > > > sounds like we are returning an error instead of falling back.  (My
> > > > > > > preference - if the user requested something and we can't do it, it's better
> > > > > > > to error than to fall back to something that does not match the user's
> > > > > > > request).
> > > > > > 
> > > > > > I think that makes sense, I hadn't tested this specific case (in my
> > > > > > reading of the code, it wasn't clear to me if raw_co_prw() could be
> > > > > > called before raw_aio_plug() had been called, but I think returning the
> > > > > > error code up should be handled correctly. What about the cases where
> > > > > > there is no error handling (the other two changes in the patch)?
> > > > > 
> > > > > While looking at doing these changes, I realized that I'm not quite sure
> > > > > what the right approach is here. My original rationale for returning
> > > > > non-zero was that AIO was requested but could not be completed. I
> > > > > haven't fully tracked back the calling paths, but I assumed it would get
> > > > > retried at the top level, and since we indicated to not use AIO on
> > > > > subsequent calls, it will succeed and use threads then (note, that I do
> > > > > now realize this means a mismatch between the qemu command-line and the
> > > > > in-use AIO model).
> > > > > 
> > > > > In practice, with my v2 patch, where I do return a non-zero error-code
> > > > > from this function, qemu does not exit (nor is any logging other than
> > > > > that I added emitted on the monitor). If I do not fallback, I imagine we
> > > > > would just continuously see this error message and IO might not actually
> > > > > every occur? Reworking all of the callpath to fail on non-zero returns
> > > > > from raw_co_prw() seems like a fair bit of work, but if that is what is
> > > > > being requested, I can try that (it will just take a while).
> > > > > Alternatively, I can produce a v3 quickly that does not bubble the
> > > > > actual errno all the way up (since it does seem like it is ignored
> > > > > anyways?).
> > > > 
> > > > Sorry for the noise, but I had one more thought. Would it be appropriate
> > > > to push the _setup() call up to when we parse the arguments about
> > > > aio=native? E.g., we already check there if cache=directsync is
> > > > specified and error out if not.
> > > 
> > > We already do this:
> > 
> > Right, I stated above it already is done, I simply meant adding a second
> > check here that we can obtain and setup the AIO context successfully.

Yes, sorry, I misread.

> > >      /* Currently Linux does AIO only for files opened with O_DIRECT */
> > >     if (s->use_linux_aio && !(s->open_flags & O_DIRECT)) {
> > >         error_setg(errp, "aio=native was specified, but it requires "
> > >                          "cache.direct=on, which was not specified.");
> > >         ret = -EINVAL;
> > >         goto fail;
> > >     }
> > > 
> > > laio_init() is about other types of errors. But anyway, yes, calling
> > > laio_init() already in .bdrv_open() is possible. Returning errors from
> > > .bdrv_open() is nice and easy and we should do it.
> > 
> > Ack.
> > 
> > > However, we may also need to call laio_init() again when switching to a
> > > different I/O thread after the image is already opened. This is what I
> > > meant when I commented on v1 that you should do this in the
> > > .bdrv_attach_aio_context callback. The problem here is that we can't
> > > return an error there and the guest is already using the image. In this
> > > case, logging an error and falling back to the thread pool seems to be
> > > the best option we have.
> > 
> > Is this is a request for new functionality? Just trying to understand,
> > because aiui, block/file-posix.c does not implement the
> > bdrv_attach_aio_context callback currently. Instead, aio_get_linux_aio()
> > is called from three places, raw_co_prw, raw_aio_plug and
> > raw_aio_unplug, which calls into laio_init() and
> > laio_attach_aio_context(). I can add the callback you suggest with
> > appropriate error handling (I suppose it would point to
> > laio_attach_aio_context, possibly with some modifications) and remove
> > the call from aio_get_linux_aio()? Just trying to understand the request
> > a bit better, as I don't see where exactly iothreads get switched and
> > how that is implemented currently (and thus where laio_init() would get
> > called again in the current code).

It's not new functionality, but moving things to a different place.

Our goal is that aio_get_linux_aio() never returns NULL in the I/O path
so that we don't have to duplicate the error handling everywhere. For
each AioContext, aio_get_linux_aio() can return NULL only the first
time. Once it successfully returned a LinuxAioState, it will never
return NULL again for the same AioContext.

You cover the QEMU main loop AioContext when you call
aio_get_linux_aio() in raw_open_common(). But whenever the AioContext is
changed, we can get NULL again on the next aio_get_linux_aio() call
because we don't know whether that AioContext already successfully
returned a LinuxAioState before.

If you call once it in .bdrv_attach_aio_context() and disable Linux AIO
if it fails, you have centralised all the error paths that you currently
need in all I/O paths.

> While I waited for a reply to this, I started coding on what I think was
> being asked for and have come to the conclusion that there are actually
> three bugs here :)
> 
> Test cases (with one disk attached to the VM):
> 
> 1) Set /proc/sys/fs/max-aio-nr to 0. Specify aio=native and qemu dies
> with a SIGSEGV.
>     - This case is understood and pushing the laio_init()-return code
>       check to the bdrv_open() path fixes this (and allows for the
>       failure to be communicated to the user).

Right.

> 2) Set /proc/sys/fs/max-aio-nr to 128. Specify aio=native and some
> number of IOThreads. Over qmp issue a x-blockdev-set-iothread command to
> move the block device node to one of the IOThreads. qemu eventually dies
> with a SIGSEGV.
>     - I am fairly sure this is the case you described above, and is
>       fixed by re-implementing the bdrv_{attach,detach}_aio_context
>       callbacks. I have a patch that does this and successfully tested
>       the SEGV is avoided.
> 
> 3) Set /proc/sys/fs/max-aio-nr to 512 (I think 256 would be sufficient,
> though). Specify aio=native and some number of IOThreads. Over qmp issue
> a x-blockdev-set-iothread command to move the block device node to one
> of the IOThreads. Shutdown the guest normally. qemu dies with a SIGABRT.
>     - This appears to be because there is a mismatch in
>       aio_context_{acquire,release} calls (this is my hypothesis right
>       now). The abort comes from bdrv_flush -> aio_context_release and
>       an EPERM from qemu_mutex_unlock_impl() which I believe is just
>       reflecting an EPERM from pthread_mutex_unlock? My theory is that
>       the main qemu thread acquired the aio mutex but then the IOThread
>       released it? I will try and trace the mutexes tomorrow, but I
>       still don't have a fix for this case.

x-blockdev-set-iothread is not the proper way to achieve what you want.
You can only correctly use it with images that are not attached to any
guest device. If you move a block node to a different I/O thread (and
therefore AioContext) under the feet of the guest device, crashes aren't
completely unexpected.

The proper way to test this would be -device virtio-blk,iothread=...
In this case, the raw BlockDriverState is first created in the main
AioContext, but when the virtio-blk device is initialised, it moves it
to the specified I/O thread.

Kevin

Denis V. Lunev" via June 21, 2018, 4:30 p.m. UTC | #9

On 21.06.2018 [15:51:08 +0200], Kevin Wolf wrote:
> Am 21.06.2018 um 05:26 hat Nishanth Aravamudan geschrieben:
> > On 20.06.2018 [12:34:52 -0700], Nishanth Aravamudan wrote:
> > > On 20.06.2018 [11:57:42 +0200], Kevin Wolf wrote:
> > > > Am 20.06.2018 um 00:54 hat Nishanth Aravamudan geschrieben:
> > > > > On 19.06.2018 [15:35:57 -0700], Nishanth Aravamudan wrote:
> > > > > > On 19.06.2018 [13:14:51 -0700], Nishanth Aravamudan wrote:
> > > > > > > On 19.06.2018 [14:35:33 -0500], Eric Blake wrote:
> > > > > > > > On 06/15/2018 12:47 PM, Nishanth Aravamudan via Qemu-devel wrote:
> > > > > > 
> > > > > > <snip>
> > > > > > 
> > > > > > > > >           } else if (s->use_linux_aio) {
> > > > > > > > > +            int rc;
> > > > > > > > > +            rc = aio_setup_linux_aio(bdrv_get_aio_context(bs));
> > > > > > > > > +            if (rc != 0) {
> > > > > > > > > +                error_report("Unable to use native AIO, falling back to "
> > > > > > > > > +                             "thread pool.");
> > > > > > > > 
> > > > > > > > In general, error_report() should not output a trailing '.'.
> > > > > > > 
> > > > > > > Will fix.
> > > > > > > 
> > > > > > > > > +                s->use_linux_aio = 0;
> > > > > > > > > +                return rc;
> > > > > > > > 
> > > > > > > > Wait - the message claims we are falling back, but the non-zero return code
> > > > > > > > sounds like we are returning an error instead of falling back.  (My
> > > > > > > > preference - if the user requested something and we can't do it, it's better
> > > > > > > > to error than to fall back to something that does not match the user's
> > > > > > > > request).
> > > > > > > 
> > > > > > > I think that makes sense, I hadn't tested this specific case (in my
> > > > > > > reading of the code, it wasn't clear to me if raw_co_prw() could be
> > > > > > > called before raw_aio_plug() had been called, but I think returning the
> > > > > > > error code up should be handled correctly. What about the cases where
> > > > > > > there is no error handling (the other two changes in the patch)?
> > > > > > 
> > > > > > While looking at doing these changes, I realized that I'm not quite sure
> > > > > > what the right approach is here. My original rationale for returning
> > > > > > non-zero was that AIO was requested but could not be completed. I
> > > > > > haven't fully tracked back the calling paths, but I assumed it would get
> > > > > > retried at the top level, and since we indicated to not use AIO on
> > > > > > subsequent calls, it will succeed and use threads then (note, that I do
> > > > > > now realize this means a mismatch between the qemu command-line and the
> > > > > > in-use AIO model).
> > > > > > 
> > > > > > In practice, with my v2 patch, where I do return a non-zero error-code
> > > > > > from this function, qemu does not exit (nor is any logging other than
> > > > > > that I added emitted on the monitor). If I do not fallback, I imagine we
> > > > > > would just continuously see this error message and IO might not actually
> > > > > > every occur? Reworking all of the callpath to fail on non-zero returns
> > > > > > from raw_co_prw() seems like a fair bit of work, but if that is what is
> > > > > > being requested, I can try that (it will just take a while).
> > > > > > Alternatively, I can produce a v3 quickly that does not bubble the
> > > > > > actual errno all the way up (since it does seem like it is ignored
> > > > > > anyways?).
> > > > > 
> > > > > Sorry for the noise, but I had one more thought. Would it be appropriate
> > > > > to push the _setup() call up to when we parse the arguments about
> > > > > aio=native? E.g., we already check there if cache=directsync is
> > > > > specified and error out if not.
> > > > 
> > > > We already do this:
> > > 
> > > Right, I stated above it already is done, I simply meant adding a second
> > > check here that we can obtain and setup the AIO context successfully.
> 
> Yes, sorry, I misread.
> 
> > > >      /* Currently Linux does AIO only for files opened with O_DIRECT */
> > > >     if (s->use_linux_aio && !(s->open_flags & O_DIRECT)) {
> > > >         error_setg(errp, "aio=native was specified, but it requires "
> > > >                          "cache.direct=on, which was not specified.");
> > > >         ret = -EINVAL;
> > > >         goto fail;
> > > >     }
> > > > 
> > > > laio_init() is about other types of errors. But anyway, yes, calling
> > > > laio_init() already in .bdrv_open() is possible. Returning errors from
> > > > .bdrv_open() is nice and easy and we should do it.
> > > 
> > > Ack.
> > > 
> > > > However, we may also need to call laio_init() again when switching to a
> > > > different I/O thread after the image is already opened. This is what I
> > > > meant when I commented on v1 that you should do this in the
> > > > .bdrv_attach_aio_context callback. The problem here is that we can't
> > > > return an error there and the guest is already using the image. In this
> > > > case, logging an error and falling back to the thread pool seems to be
> > > > the best option we have.
> > > 
> > > Is this is a request for new functionality? Just trying to understand,
> > > because aiui, block/file-posix.c does not implement the
> > > bdrv_attach_aio_context callback currently. Instead, aio_get_linux_aio()
> > > is called from three places, raw_co_prw, raw_aio_plug and
> > > raw_aio_unplug, which calls into laio_init() and
> > > laio_attach_aio_context(). I can add the callback you suggest with
> > > appropriate error handling (I suppose it would point to
> > > laio_attach_aio_context, possibly with some modifications) and remove
> > > the call from aio_get_linux_aio()? Just trying to understand the request
> > > a bit better, as I don't see where exactly iothreads get switched and
> > > how that is implemented currently (and thus where laio_init() would get
> > > called again in the current code).
> 
> It's not new functionality, but moving things to a different place.
> 
> Our goal is that aio_get_linux_aio() never returns NULL in the I/O path
> so that we don't have to duplicate the error handling everywhere. For
> each AioContext, aio_get_linux_aio() can return NULL only the first
> time. Once it successfully returned a LinuxAioState, it will never
> return NULL again for the same AioContext.
> 
> You cover the QEMU main loop AioContext when you call
> aio_get_linux_aio() in raw_open_common(). But whenever the AioContext is
> changed, we can get NULL again on the next aio_get_linux_aio() call
> because we don't know whether that AioContext already successfully
> returned a LinuxAioState before.
> 
> If you call once it in .bdrv_attach_aio_context() and disable Linux AIO
> if it fails, you have centralised all the error paths that you currently
> need in all I/O paths.

Thanks for the clarification!
 
> > While I waited for a reply to this, I started coding on what I think was
> > being asked for and have come to the conclusion that there are actually
> > three bugs here :)
> > 
> > Test cases (with one disk attached to the VM):
> > 
> > 1) Set /proc/sys/fs/max-aio-nr to 0. Specify aio=native and qemu dies
> > with a SIGSEGV.
> >     - This case is understood and pushing the laio_init()-return code
> >       check to the bdrv_open() path fixes this (and allows for the
> >       failure to be communicated to the user).
> 
> Right.
> 
> > 2) Set /proc/sys/fs/max-aio-nr to 128. Specify aio=native and some
> > number of IOThreads. Over qmp issue a x-blockdev-set-iothread command to
> > move the block device node to one of the IOThreads. qemu eventually dies
> > with a SIGSEGV.
> >     - I am fairly sure this is the case you described above, and is
> >       fixed by re-implementing the bdrv_{attach,detach}_aio_context
> >       callbacks. I have a patch that does this and successfully tested
> >       the SEGV is avoided.
> > 
> > 3) Set /proc/sys/fs/max-aio-nr to 512 (I think 256 would be sufficient,
> > though). Specify aio=native and some number of IOThreads. Over qmp issue
> > a x-blockdev-set-iothread command to move the block device node to one
> > of the IOThreads. Shutdown the guest normally. qemu dies with a SIGABRT.
> >     - This appears to be because there is a mismatch in
> >       aio_context_{acquire,release} calls (this is my hypothesis right
> >       now). The abort comes from bdrv_flush -> aio_context_release and
> >       an EPERM from qemu_mutex_unlock_impl() which I believe is just
> >       reflecting an EPERM from pthread_mutex_unlock? My theory is that
> >       the main qemu thread acquired the aio mutex but then the IOThread
> >       released it? I will try and trace the mutexes tomorrow, but I
> >       still don't have a fix for this case.
> 
> x-blockdev-set-iothread is not the proper way to achieve what you want.
> You can only correctly use it with images that are not attached to any
> guest device. If you move a block node to a different I/O thread (and
> therefore AioContext) under the feet of the guest device, crashes aren't
> completely unexpected.
> 
> The proper way to test this would be -device virtio-blk,iothread=...
> In this case, the raw BlockDriverState is first created in the main
> AioContext, but when the virtio-blk device is initialised, it moves it
> to the specified I/O thread.

Thanks again! Turns out 2) is definitely present in master. 3) does not
occur. So I will work on making my patch into 2 patches and repost a v3
later today.

Thanks,
Nish

[RFC,v2] aio: properly bubble up errors from initialization

Commit Message

Comments

Patch