[RFC] migration/block-dirty-bitmap: make loading bitmap for device with iothread future-proof

Message ID	20230728133928.256898-1-f.ebner@proxmox.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: Fiona Ebner <f.ebner@proxmox.com> To: qemu-devel@nongnu.org Cc: stefanha@redhat.com, fam@euphon.net, eblake@redhat.com, vsementsov@yandex-team.ru, jsnow@redhat.com, quintela@redhat.com, peterx@redhat.com, leobras@redhat.com, qemu-block@nongnu.org Subject: [RFC] migration/block-dirty-bitmap: make loading bitmap for device with iothread future-proof Date: Fri, 28 Jul 2023 15:39:28 +0200 Message-Id: <20230728133928.256898-1-f.ebner@proxmox.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=94.136.29.106; envelope-from=f.ebner@proxmox.com; helo=proxmox-new.maurer-it.com X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Series	[RFC] migration/block-dirty-bitmap: make loading bitmap for device with iothread future-proof \| expand [RFC] migration/block-dirty-bitmap: make loading bitmap for device with iothread future-proof

Fiona Ebner July 28, 2023, 1:39 p.m. UTC

The bdrv_create_dirty_bitmap() function (which is also called by
bdrv_dirty_bitmap_create_successor()) uses bdrv_getlength(bs). This is
a wrapper around a coroutine, and when not called in coroutine context
would use bdrv_poll_co(). Such a call would trigger an assert() if the
correct AioContext hasn't been acquired before, because polling would
try to release the AioContext.

The issue does not happen for migration, because the call happens
from process_incoming_migration_co(), i.e. in coroutine context. So
the bdrv_getlength() wrapper will just call bdrv_co_getlength()
directly without polling.

The issue would happen for snapshots, but won't in practice, because
saving a snapshot with a block dirty bitmap is currently not possible.
The reason is that dirty_bitmap_save_iterate() returns whether it has
completed the bulk phase, which only happens in postcopy, so
qemu_savevm_state_iterate() will always return 0, meaning the call
to iterate will be repeated over and over again without ever reaching
the completion phase.

Still, this would make the code more robust for the future.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

We ran into this issue downstream, because we have a custom snapshot
mechanism which does support dirty bitmaps and does not run in
coroutine context during load.

 migration/block-dirty-bitmap.c | 6 ++++++
 1 file changed, 6 insertions(+)

Juan Quintela July 31, 2023, 7:35 a.m. UTC | #1

Fiona Ebner <f.ebner@proxmox.com> wrote:
> The bdrv_create_dirty_bitmap() function (which is also called by
> bdrv_dirty_bitmap_create_successor()) uses bdrv_getlength(bs). This is
> a wrapper around a coroutine, and when not called in coroutine context
> would use bdrv_poll_co(). Such a call would trigger an assert() if the
> correct AioContext hasn't been acquired before, because polling would
> try to release the AioContext.

The ingenous in me thinks:

If the problem is that bdrv_poll_co() release an AioContext that it
don't have acquired, perhaps we should fix bdrv_poll_co().

Ha!!!

$ find . -type f -exec grep --color=auto -nH --null -e bdrv_poll_co \{\} +
./scripts/block-coroutine-wrapper.py\0173:
bdrv_poll_co(&s.poll_state);
./scripts/block-coroutine-wrapper.py\0198:
bdrv_poll_co(&s.poll_state);
./block/block-gen.h\038:static inline void bdrv_poll_co(BdrvPollCo *s)
$

/me retreats

> The issue does not happen for migration, because the call happens
> from process_incoming_migration_co(), i.e. in coroutine context. So
> the bdrv_getlength() wrapper will just call bdrv_co_getlength()
> directly without polling.

The ingenous in me wonders why bdrv_getlength() needs to use coroutines
at all, but as I have been burned on the previous paragraph, I learn not
to even try.

Ok, I never learn, so I do a grep and I see two appearces of
bdrv_getlength in include files, but grep only shows uses of the
function, not a real definition.

I even see

co_wrapper_mixed_bdrv_rdlock

And decied to stop here.

> The issue would happen for snapshots, but won't in practice, because
> saving a snapshot with a block dirty bitmap is currently not possible.
> The reason is that dirty_bitmap_save_iterate() returns whether it has
> completed the bulk phase, which only happens in postcopy, so
> qemu_savevm_state_iterate() will always return 0, meaning the call
> to iterate will be repeated over and over again without ever reaching
> the completion phase.
>
> Still, this would make the code more robust for the future.

What I wonder is if we should annotate this calls somehow that they need
to be called from a coroutine context, because I would have never found
it.

And just thinking loud:

I still wonder if we should make incoming migration its own thread.
Because we got more and more problems because it is a coroutine, that in
non-multifd case can consume a whole CPU alone, so it makes no sense to
have a coroutine.

On the other hand, with multifd, it almost don't use any resources, so ....

> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
>
> We ran into this issue downstream, because we have a custom snapshot
> mechanism which does support dirty bitmaps and does not run in
> coroutine context during load.
>
>  migration/block-dirty-bitmap.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
> index 032fc5f405..e1ae3b7316 100644
> --- a/migration/block-dirty-bitmap.c
> +++ b/migration/block-dirty-bitmap.c
> @@ -805,8 +805,11 @@ static int dirty_bitmap_load_start(QEMUFile *f, DBMLoadState *s)
>                       "destination", bdrv_dirty_bitmap_name(s->bitmap));
>          return -EINVAL;
>      } else {
> +        AioContext *ctx = bdrv_get_aio_context(s->bs);
> +        aio_context_acquire(ctx);
>          s->bitmap = bdrv_create_dirty_bitmap(s->bs, granularity,
>                                               s->bitmap_name, &local_err);
> +        aio_context_release(ctx);
>          if (!s->bitmap) {
>              error_report_err(local_err);
>              return -EINVAL;

I was going to suggest to put that code inside brdv_create_dirty_bitmap,
because migration come is already complex enough without knowing about
AioContext, but it appears that this pattern is already used in several
places.

> @@ -833,7 +836,10 @@ static int dirty_bitmap_load_start(QEMUFile *f, DBMLoadState *s)
>  
>      bdrv_disable_dirty_bitmap(s->bitmap);
>      if (flags & DIRTY_BITMAP_MIG_START_FLAG_ENABLED) {
> +        AioContext *ctx = bdrv_get_aio_context(s->bs);
> +        aio_context_acquire(ctx);
>          bdrv_dirty_bitmap_create_successor(s->bitmap, &local_err);
> +        aio_context_release(ctx);
>          if (local_err) {
>              error_report_err(local_err);
>              return -EINVAL;

for(i=1; i < 1000; i++)
         printf(""I will not grep what a successor of a dirty bitmap
         is\n");

Later, Juan.

Fiona Ebner Aug. 1, 2023, 7:57 a.m. UTC | #2

Am 31.07.23 um 09:35 schrieb Juan Quintela:
> Fiona Ebner <f.ebner@proxmox.com> wrote:
>> The bdrv_create_dirty_bitmap() function (which is also called by
>> bdrv_dirty_bitmap_create_successor()) uses bdrv_getlength(bs). This is
>> a wrapper around a coroutine, and when not called in coroutine context
>> would use bdrv_poll_co(). Such a call would trigger an assert() if the
>> correct AioContext hasn't been acquired before, because polling would
>> try to release the AioContext.
> 
> The ingenous in me thinks:
> 
> If the problem is that bdrv_poll_co() release an AioContext that it
> don't have acquired, perhaps we should fix bdrv_poll_co().

AFAIU, it's necessary to release the lock there, so somebody else can
make progress while we poll.

>> The issue would happen for snapshots,

Sorry, what I wrote is not true, because bdrv_co_load_vmstate() is a
coroutine, so I think it would be the same situation as for migration
and bdrv_co_getlength() would be called directly by the wrapper.

And the patch itself also got an issue. AFAICT, when
qcow2_co_load_vmstate() is called, we already have acquired the context
for the drive we load the snapshot from, and since
polling/AIO_WAIT_WHILE requires that the caller has acquired the context
exactly once, we'd need to distinguish if the dirty bitmap is for that
drive (can't acquire the context a second time) or a different drive
(need to acquire the context for the first time).

>> but won't in practice, because
>> saving a snapshot with a block dirty bitmap is currently not possible.
>> The reason is that dirty_bitmap_save_iterate() returns whether it has
>> completed the bulk phase, which only happens in postcopy, so
>> qemu_savevm_state_iterate() will always return 0, meaning the call
>> to iterate will be repeated over and over again without ever reaching
>> the completion phase.
>>
>> Still, this would make the code more robust for the future.
> 
> What I wonder is if we should annotate this calls somehow that they need
> to be called from a coroutine context, because I would have never found
> it.
> 
> And just thinking loud:
> 
> I still wonder if we should make incoming migration its own thread.
> Because we got more and more problems because it is a coroutine, that in
> non-multifd case can consume a whole CPU alone, so it makes no sense to
> have a coroutine.
> 

That would then be a situation where the issue would pop up (assuming
the new thread doesn't also use a coroutine to load the state).

> On the other hand, with multifd, it almost don't use any resources, so ....

Vladimir Sementsov-Ogievskiy Oct. 4, 2023, 4:04 p.m. UTC | #3

On 31.07.23 10:35, Juan Quintela wrote:
> Fiona Ebner <f.ebner@proxmox.com> wrote:
>> The bdrv_create_dirty_bitmap() function (which is also called by
>> bdrv_dirty_bitmap_create_successor()) uses bdrv_getlength(bs). This is
>> a wrapper around a coroutine, and when not called in coroutine context
>> would use bdrv_poll_co(). Such a call would trigger an assert() if the
>> correct AioContext hasn't been acquired before, because polling would
>> try to release the AioContext.
> 
> The ingenous in me thinks:
> 
> If the problem is that bdrv_poll_co() release an AioContext that it
> don't have acquired, perhaps we should fix bdrv_poll_co().
> 
> Ha!!!
> 
> $ find . -type f -exec grep --color=auto -nH --null -e bdrv_poll_co \{\} +
> ./scripts/block-coroutine-wrapper.py\0173:
> bdrv_poll_co(&s.poll_state);
> ./scripts/block-coroutine-wrapper.py\0198:
> bdrv_poll_co(&s.poll_state);
> ./block/block-gen.h\038:static inline void bdrv_poll_co(BdrvPollCo *s)
> $
> 
> /me retreats

The function is used in generated code. There are a lot of calls in build/block/block-gen.c if grep after make.

> 
>> The issue does not happen for migration, because the call happens
>> from process_incoming_migration_co(), i.e. in coroutine context. So
>> the bdrv_getlength() wrapper will just call bdrv_co_getlength()
>> directly without polling.
> 
> The ingenous in me wonders why bdrv_getlength() needs to use coroutines
> at all, but as I have been burned on the previous paragraph, I learn not
> to even try.
> 
> Ok, I never learn, so I do a grep and I see two appearces of
> bdrv_getlength in include files, but grep only shows uses of the
> function, not a real definition.

the function is generated, and after building, it's definition is in build/block/block-gen.c

The information is in comment in include/block/block-common.h.

The link, which lead from function declaration to the comment is "co_wrapper", but that's not obvious when just grep the function name.

Vladimir Sementsov-Ogievskiy Oct. 4, 2023, 4:13 p.m. UTC | #4

add Kevin, Paolo, Emanuele, pls take a look

On 28.07.23 16:39, Fiona Ebner wrote:
> The bdrv_create_dirty_bitmap() function (which is also called by
> bdrv_dirty_bitmap_create_successor()) uses bdrv_getlength(bs). This is
> a wrapper around a coroutine, and when not called in coroutine context
> would use bdrv_poll_co(). Such a call would trigger an assert() if the
> correct AioContext hasn't been acquired before, because polling would
> try to release the AioContext.
> 
> The issue does not happen for migration, because the call happens
> from process_incoming_migration_co(), i.e. in coroutine context. So
> the bdrv_getlength() wrapper will just call bdrv_co_getlength()
> directly without polling.
> 
> The issue would happen for snapshots, but won't in practice, because
> saving a snapshot with a block dirty bitmap is currently not possible.
> The reason is that dirty_bitmap_save_iterate() returns whether it has
> completed the bulk phase, which only happens in postcopy, so
> qemu_savevm_state_iterate() will always return 0, meaning the call
> to iterate will be repeated over and over again without ever reaching
> the completion phase.
> 
> Still, this would make the code more robust for the future.
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> 
> We ran into this issue downstream, because we have a custom snapshot
> mechanism which does support dirty bitmaps and does not run in
> coroutine context during load.
> 
>   migration/block-dirty-bitmap.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
> index 032fc5f405..e1ae3b7316 100644
> --- a/migration/block-dirty-bitmap.c
> +++ b/migration/block-dirty-bitmap.c
> @@ -805,8 +805,11 @@ static int dirty_bitmap_load_start(QEMUFile *f, DBMLoadState *s)
>                        "destination", bdrv_dirty_bitmap_name(s->bitmap));
>           return -EINVAL;
>       } else {
> +        AioContext *ctx = bdrv_get_aio_context(s->bs);
> +        aio_context_acquire(ctx);
>           s->bitmap = bdrv_create_dirty_bitmap(s->bs, granularity,
>                                                s->bitmap_name, &local_err);
> +        aio_context_release(ctx);
>           if (!s->bitmap) {
>               error_report_err(local_err);
>               return -EINVAL;
> @@ -833,7 +836,10 @@ static int dirty_bitmap_load_start(QEMUFile *f, DBMLoadState *s)
>   
>       bdrv_disable_dirty_bitmap(s->bitmap);
>       if (flags & DIRTY_BITMAP_MIG_START_FLAG_ENABLED) {
> +        AioContext *ctx = bdrv_get_aio_context(s->bs);
> +        aio_context_acquire(ctx);
>           bdrv_dirty_bitmap_create_successor(s->bitmap, &local_err);
> +        aio_context_release(ctx);

Would not this deadlock in current code? When we have the only one aio context and therefore we are already in it?

If as Juan said, we rework incoming migration coroutine to be a separate thread, this patch becomes more correct, I think..

If keep coroutine, I think, we should check are we already in that aio context, and if so we should not acquire it.

>           if (local_err) {
>               error_report_err(local_err);
>               return -EINVAL;

Fiona Ebner Oct. 5, 2023, 7:19 a.m. UTC | #5

Am 04.10.23 um 18:13 schrieb Vladimir Sementsov-Ogievskiy:
>> @@ -833,7 +836,10 @@ static int dirty_bitmap_load_start(QEMUFile *f,
>> DBMLoadState *s)
>>         bdrv_disable_dirty_bitmap(s->bitmap);
>>       if (flags & DIRTY_BITMAP_MIG_START_FLAG_ENABLED) {
>> +        AioContext *ctx = bdrv_get_aio_context(s->bs);
>> +        aio_context_acquire(ctx);
>>           bdrv_dirty_bitmap_create_successor(s->bitmap, &local_err);
>> +        aio_context_release(ctx);
> 
> Would not this deadlock in current code? When we have the only one aio
> context and therefore we are already in it?
> 

Yes, I noticed that myself a bit later ;)

Am 01.08.23 um 09:57 schrieb Fiona Ebner:
> And the patch itself also got an issue. AFAICT, when
> qcow2_co_load_vmstate() is called, we already have acquired the context
> for the drive we load the snapshot from, and since
> polling/AIO_WAIT_WHILE requires that the caller has acquired the context
> exactly once, we'd need to distinguish if the dirty bitmap is for that
> drive (can't acquire the context a second time) or a different drive
> (need to acquire the context for the first time).

Quoted from a reply in this thread
https://lists.nongnu.org/archive/html/qemu-devel/2023-08/msg00007.html

Am 04.10.23 um 18:13 schrieb Vladimir Sementsov-Ogievskiy:
> If as Juan said, we rework incoming migration coroutine to be a separate
> thread, this patch becomes more correct, I think..

Yes, it would become an issue when the function is called from a
non-coroutine context.

> If keep coroutine, I think, we should check are we already in that aio
> context, and if so we should not acquire it.

In coroutine context, we don't need to acquire it, but it shouldn't hurt
either and this approach should work for non-coroutine context too. The
question is if such conditional lock-taking is acceptable (do we already
have something similar somewhere?) or if it can be avoided somehow like
it was preferred in another one of my patches:

Am 05.05.23 um 16:59 schrieb Peter Xu:
> On Fri, May 05, 2023 at 03:46:52PM +0200, Fiona Ebner wrote:
>> To fix it, ensure that the BQL is held during setup. To avoid changing
>> the behavior for migration too, introduce conditionals for the setup
>> callbacks that need the BQL and only take the lock if it's not already
>> held.
> 
> The major complexity of this patch is the "conditionally taking" part.

Quoted from
https://lists.nongnu.org/archive/html/qemu-devel/2023-05/msg01514.html

[RFC] migration/block-dirty-bitmap: make loading bitmap for device with iothread future-proof

Commit Message

Comments

Patch