Message ID | 20230414153358.1452040-1-eblake@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] migration: Handle block device inactivation failures better | expand |
On Fri, 14 Apr 2023 10:33:58 -0500 Eric Blake <eblake@redhat.com> wrote: > Consider what happens when performing a migration between two host > machines connected to an NFS server serving multiple block devices to > the guest, when the NFS server becomes unavailable. The migration > attempts to inactivate all block devices on the source (a necessary > step before the destination can take over); but if the NFS server is > non-responsive, the attempt to inactivate can itself fail. When that > happens, the destination fails to get the migrated guest (good, > because the source wasn't able to flush everything properly): > > (qemu) qemu-kvm: load of migration failed: Input/output error > > at which point, our only hope for the guest is for the source to take > back control. With the current code base, the host outputs a message, but then appears to resume: > > (qemu) qemu-kvm: qemu_savevm_state_complete_precopy_non_iterable: bdrv_inactivate_all() failed (-1) > > (src qemu)info status > VM status: running > > but a second migration attempt now asserts: > > (src qemu) qemu-kvm: ../block.c:6738: int bdrv_inactivate_recurse(BlockDriverState *): Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed. > > Whether the guest is recoverable on the source after the first failure > is debatable, but what we do not want is to have qemu itself fail due > to an assertion. It looks like the problem is as follows: > > In migration.c:migration_completion(), the source sets 'inactivate' to > true (since COLO is not enabled), then tries > savevm.c:qemu_savevm_state_complete_precopy() with a request to > inactivate block devices. In turn, this calls > block.c:bdrv_inactivate_all(), which fails when flushing runs up > against the non-responsive NFS server. With savevm failing, we are > now left in a state where some, but not all, of the block devices have > been inactivated; but migration_completion() then jumps to 'fail' > rather than 'fail_invalidate' and skips an attempt to reclaim those > those disks by calling bdrv_activate_all(). Even if we do attempt to > reclaim disks, we aren't taking note of failure there, either. > > Thus, we have reached a state where the migration engine has forgotten > all state about whether a block device is inactive, because we did not > set s->block_inactive in enough places; so migration allows the source > to reach vm_start() and resume execution, violating the block layer > invariant that the guest CPUs should not be restarted while a device > is inactive. Note that the code in migration.c:migrate_fd_cancel() > will also try to reactivate all block devices if s->block_inactive was > set, but because we failed to set that flag after the first failure, > the source assumes it has reclaimed all devices, even though it still > has remaining inactivated devices and does not try again. Normally, > qmp_cont() will also try to reactivate all disks (or correctly fail if > the disks are not reclaimable because NFS is not yet back up), but the > auto-resumption of the source after a migration failure does not go > through qmp_cont(). And because we have left the block layer in an > inconsistent state with devices still inactivated, the later migration > attempt is hitting the assertion failure. > > Since it is important to not resume the source with inactive disks, > this patch marks s->block_inactive before attempting inactivation, > rather than after succeeding, in order to prevent any vm_start() until > it has successfully reactivated all devices. > > See also https://bugzilla.redhat.com/show_bug.cgi?id=2058982 > > Signed-off-by: Eric Blake <eblake@redhat.com> Looks good to me from colo side of things. Acked-by: Lukas Straub <lukasstraub2@web.de> Tested-by: Lukas Straub <lukasstraub2@web.de> Best Regards, Lukas Straub > --- > > v2: Set s->block_inactive sooner [Juan] > --- > migration/migration.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/migration/migration.c b/migration/migration.c > index bda47891933..cb0d42c0610 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -3444,13 +3444,11 @@ static void migration_completion(MigrationState *s) > MIGRATION_STATUS_DEVICE); > } > if (ret >= 0) { > + s->block_inactive = inactivate; > qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX); > ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false, > inactivate); > } > - if (inactivate && ret >= 0) { > - s->block_inactive = true; > - } > } > qemu_mutex_unlock_iothread(); > > @@ -3522,6 +3520,7 @@ fail_invalidate: > bdrv_activate_all(&local_err); > if (local_err) { > error_report_err(local_err); > + s->block_inactive = true; > } else { > s->block_inactive = false; > } > > base-commit: 7dbd6f8a27e30fe14adb3d5869097cddf24038d6 --
Eric Blake <eblake@redhat.com> wrote: > Consider what happens when performing a migration between two host > machines connected to an NFS server serving multiple block devices to > the guest, when the NFS server becomes unavailable. The migration > attempts to inactivate all block devices on the source (a necessary > step before the destination can take over); but if the NFS server is > non-responsive, the attempt to inactivate can itself fail. When that > happens, the destination fails to get the migrated guest (good, > because the source wasn't able to flush everything properly): > > (qemu) qemu-kvm: load of migration failed: Input/output error > > at which point, our only hope for the guest is for the source to take > back control. With the current code base, the host outputs a message, but then appears to resume: > > (qemu) qemu-kvm: qemu_savevm_state_complete_precopy_non_iterable: bdrv_inactivate_all() failed (-1) > > (src qemu)info status > VM status: running > > but a second migration attempt now asserts: > > (src qemu) qemu-kvm: ../block.c:6738: int bdrv_inactivate_recurse(BlockDriverState *): Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed. > > Whether the guest is recoverable on the source after the first failure > is debatable, but what we do not want is to have qemu itself fail due > to an assertion. It looks like the problem is as follows: > > In migration.c:migration_completion(), the source sets 'inactivate' to > true (since COLO is not enabled), then tries > savevm.c:qemu_savevm_state_complete_precopy() with a request to > inactivate block devices. In turn, this calls > block.c:bdrv_inactivate_all(), which fails when flushing runs up > against the non-responsive NFS server. With savevm failing, we are > now left in a state where some, but not all, of the block devices have > been inactivated; but migration_completion() then jumps to 'fail' > rather than 'fail_invalidate' and skips an attempt to reclaim those > those disks by calling bdrv_activate_all(). Even if we do attempt to > reclaim disks, we aren't taking note of failure there, either. > > Thus, we have reached a state where the migration engine has forgotten > all state about whether a block device is inactive, because we did not > set s->block_inactive in enough places; so migration allows the source > to reach vm_start() and resume execution, violating the block layer > invariant that the guest CPUs should not be restarted while a device > is inactive. Note that the code in migration.c:migrate_fd_cancel() > will also try to reactivate all block devices if s->block_inactive was > set, but because we failed to set that flag after the first failure, > the source assumes it has reclaimed all devices, even though it still > has remaining inactivated devices and does not try again. Normally, > qmp_cont() will also try to reactivate all disks (or correctly fail if > the disks are not reclaimable because NFS is not yet back up), but the > auto-resumption of the source after a migration failure does not go > through qmp_cont(). And because we have left the block layer in an > inconsistent state with devices still inactivated, the later migration > attempt is hitting the assertion failure. > > Since it is important to not resume the source with inactive disks, > this patch marks s->block_inactive before attempting inactivation, > rather than after succeeding, in order to prevent any vm_start() until > it has successfully reactivated all devices. > > See also https://bugzilla.redhat.com/show_bug.cgi?id=2058982 > > Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> queued. > > --- > > v2: Set s->block_inactive sooner [Juan] > --- > migration/migration.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/migration/migration.c b/migration/migration.c > index bda47891933..cb0d42c0610 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -3444,13 +3444,11 @@ static void migration_completion(MigrationState *s) > MIGRATION_STATUS_DEVICE); > } > if (ret >= 0) { > + s->block_inactive = inactivate; > qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX); > ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false, > inactivate); > } > - if (inactivate && ret >= 0) { > - s->block_inactive = true; > - } > } > qemu_mutex_unlock_iothread(); And I still have to look at the file to understand this "simple" patch. (simple in size, not in what it means). I will add this to my queue, but if you are in the "mood", I would like to remove the declaration of inactivate and change this to something like: if (ret >= 0) { /* Colo don't stop disks in normal operation */ s->block_inactive = !migrate_colo_enabled(); qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX); ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false, s->block_inactive); } Or something around that lines? > @@ -3522,6 +3520,7 @@ fail_invalidate: > bdrv_activate_all(&local_err); > if (local_err) { > error_report_err(local_err); > + s->block_inactive = true; > } else { > s->block_inactive = false; > } > base-commit: 7dbd6f8a27e30fe14adb3d5869097cddf24038d6 Just wondering, what git magic creates this line? Thanks, Juan.
On Thu, Apr 20, 2023 at 12:41:25PM +0200, Juan Quintela wrote: > Eric Blake <eblake@redhat.com> wrote: ...lots of lines... > > --- > > migration/migration.c | 5 ++--- > > 1 file changed, 2 insertions(+), 3 deletions(-) ...describing a tiny change ;) > > > > diff --git a/migration/migration.c b/migration/migration.c > > index bda47891933..cb0d42c0610 100644 > > --- a/migration/migration.c > > +++ b/migration/migration.c > > @@ -3444,13 +3444,11 @@ static void migration_completion(MigrationState *s) > > MIGRATION_STATUS_DEVICE); > > } > > if (ret >= 0) { > > + s->block_inactive = inactivate; > > qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX); > > ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false, > > inactivate); > > } > > - if (inactivate && ret >= 0) { > > - s->block_inactive = true; > > - } > > } > > qemu_mutex_unlock_iothread(); > > And I still have to look at the file to understand this "simple" patch. > (simple in size, not in what it means). Indeed - hence the long commit message! > > I will add this to my queue, but if you are in the "mood", I would like > to remove the declaration of inactivate and change this to something like: > > if (ret >= 0) { > /* Colo don't stop disks in normal operation */ > s->block_inactive = !migrate_colo_enabled(); > qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX); > ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false, > s->block_inactive); > } > > Or something around that lines? Yes, that looks like a trivial refactoring that preserves the same semantics. > > > @@ -3522,6 +3520,7 @@ fail_invalidate: > > bdrv_activate_all(&local_err); > > if (local_err) { > > error_report_err(local_err); > > + s->block_inactive = true; > > } else { > > s->block_inactive = false; > > } > > base-commit: 7dbd6f8a27e30fe14adb3d5869097cddf24038d6 > > Just wondering, what git magic creates this line? git send-email --base=COMMIT_ID or even 'git config format.useAutoBase whenAble' to try and automate the use of this. (If my own git habits were cleaner, of always sticking patches in fresh branches, --base=auto is handy; but in practice, I tend to send one-off patches like this in the middle of 'git rebase' of a larger series, at which point I'm not on a clean branch where --base=auto works, so I end up having to manually specify one at the command line. Either way, including the base-commit: info can be very informative for applying a patch at the branch point then rebasing it locally, when attempting to apply the patch sent through email hits merge conflicts when applying it directly to changes on master in the meantime; I believe 'git am -3' is even able to exploit the comment when present to make smarter decisions about which parent commit it tries for doing 3-way patch resolution)
Eric Blake <eblake@redhat.com> wrote: > On Thu, Apr 20, 2023 at 12:41:25PM +0200, Juan Quintela wrote: >> Eric Blake <eblake@redhat.com> wrote: > > ...lots of lines... > >> > --- >> > migration/migration.c | 5 ++--- >> > 1 file changed, 2 insertions(+), 3 deletions(-) > > ...describing a tiny change ;) > >> > >> > diff --git a/migration/migration.c b/migration/migration.c >> > index bda47891933..cb0d42c0610 100644 >> > --- a/migration/migration.c >> > +++ b/migration/migration.c >> > @@ -3444,13 +3444,11 @@ static void migration_completion(MigrationState *s) >> > MIGRATION_STATUS_DEVICE); >> > } >> > if (ret >= 0) { >> > + s->block_inactive = inactivate; >> > qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX); >> > ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false, >> > inactivate); >> > } >> > - if (inactivate && ret >= 0) { >> > - s->block_inactive = true; >> > - } >> > } >> > qemu_mutex_unlock_iothread(); >> >> And I still have to look at the file to understand this "simple" patch. >> (simple in size, not in what it means). > > Indeed - hence the long commit message! > >> >> I will add this to my queue, but if you are in the "mood", I would like >> to remove the declaration of inactivate and change this to something like: >> >> if (ret >= 0) { >> /* Colo don't stop disks in normal operation */ >> s->block_inactive = !migrate_colo_enabled(); >> qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX); >> ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false, >> s->block_inactive); >> } >> >> Or something around that lines? > > Yes, that looks like a trivial refactoring that preserves the same > semantics. > >> >> > @@ -3522,6 +3520,7 @@ fail_invalidate: >> > bdrv_activate_all(&local_err); >> > if (local_err) { >> > error_report_err(local_err); >> > + s->block_inactive = true; >> > } else { >> > s->block_inactive = false; >> > } >> > base-commit: 7dbd6f8a27e30fe14adb3d5869097cddf24038d6 >> >> Just wondering, what git magic creates this line? > > git send-email --base=COMMIT_ID > > or even 'git config format.useAutoBase whenAble' to try and automate > the use of this. (If my own git habits were cleaner, of always > sticking patches in fresh branches, --base=auto is handy; but in > practice, I tend to send one-off patches like this in the middle of > 'git rebase' of a larger series, at which point I'm not on a clean > branch where --base=auto works, so I end up having to manually specify > one at the command line. Either way, including the base-commit: info > can be very informative for applying a patch at the branch point then > rebasing it locally, when attempting to apply the patch sent through > email hits merge conflicts when applying it directly to changes on > master in the meantime; I believe 'git am -3' is even able to exploit > the comment when present to make smarter decisions about which parent > commit it tries for doing 3-way patch resolution) Thanks a lot. It does the right thing for "trivial" stuff, i.e. when I sent a single patch or a series against qemu/master. I am not completely sure that it does the right thing when I send a series on top of my previous pull request. 097387873b (HEAD -> atomic_counters) migration: Make dirty_bytes_last_sync atomic 3f699a13b2 migration: Make dirty_pages_rate atomic 276a275895 (next) migration: Move qmp_migrate_set_parameters() to options.c ab13b47801 migration: Move migrate_use_tls() to options.c ecf5c18eac MAINTAINERS: Add Leonardo and Peter as reviewers 6e5dda696c migration: Disable postcopy + multifd migration ac5f7bf8e2 (qemu/staging, qemu/master, qemu/HEAD) Merge tag 'migration-20230424-pull-request' of https://gitlab.com/juan.quintela/qemu into staging where the branchs are: qemu/master: obvious next: whatever is going to be on next migration PULL request, I will rename this to migration-$date and send this series to the list 1st. I.e. assume they are on list but still not on master. HEAD/atomic_counters: This is the series that I am sending I have done: >> ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false, >> s->block_inactive); >> } >> >> Or something around that lines? > > Yes, that looks like a trivial refactoring that preserves the same > semantics. > >> >> > @@ -3522,6 +3520,7 @@ fail_invalidate: >> > bdrv_activate_all(&local_err); >> > if (local_err) { >> > error_report_err(local_err); >> > + s->block_inactive = true; >> > } else { >> > s->block_inactive = false; >> > } >> > base-commit: 7dbd6f8a27e30fe14adb3d5869097cddf24038d6 >> >> Just wondering, what git magic creates this line? > > git send-email --base=COMMIT_ID > > or even 'git config format.useAutoBase whenAble' to try and automate > the use of this. (If my own git habits were cleaner, of always > sticking patches in fresh branches, --base=auto is handy; but in > practice, I tend to send one-off patches like this in the middle of > 'git rebase' of a larger series, at which point I'm not on a clean > branch where --base=auto works, so I end up having to manually specify > one at the command line. Either way, including the base-commit: info > can be very informative for applying a patch at the branch point then > rebasing it locally, when attempting to apply the patch sent through > email hits merge conflicts when applying it directly to changes on > master in the meantime; I believe 'git am -3' is even able to exploit > the comment when present to make smarter decisions about which parent > commit it tries for doing 3-way patch resolution) Thanks a lot. It does the right thing for "trivial" stuff, i.e. when I sent a single patch or a series against qemu/master. I am not completely sure that it does the right thing when I send a series on top of my previous pull request. 097387873b (HEAD -> atomic_counters) migration: Make dirty_bytes_last_sync atomic 3f699a13b2 migration: Make dirty_pages_rate atomic 276a275895 (next) migration: Move qmp_migrate_set_parameters() to options.c ab13b47801 migration: Move migrate_use_tls() to options.c ecf5c18eac MAINTAINERS: Add Leonardo and Peter as reviewers 6e5dda696c migration: Disable postcopy + multifd migration ac5f7bf8e2 (qemu/staging, qemu/master, qemu/HEAD) Merge tag 'migration-20230424-pull-request' of https://gitlab.com/juan.quintela/qemu into staging where the branchs are: qemu/master: obvious next: whatever is going to be on next migration PULL request, I will rename this to migration-$date and send this series to the list 1st. I.e. assume they are on list but still not on master. HEAD/atomic_counters: This is the series that I am sending I have done first: git branch --set-upstream-to=qemu/master Because that is what the "real" master is, migration-$date is just an intermediate "state". git format-patch --cover-letter next -o /tmp/kk In this case both useAutoBase=whenAble and useAutoBase=true do the same "right" thing. From 097387873b2ef1894d5713fdfda8a7b2492476e5 Mon Sep 17 00:00:00 2001 ... Juan Quintela (2): migration: Make dirty_pages_rate atomic migration: Make dirty_bytes_last_sync atomic migration/migration.c | 10 +++++++--- migration/ram.c | 8 +++++--- migration/ram.h | 4 ++-- 3 files changed, 14 insertions(+), 8 deletions(-) base-commit: ac5f7bf8e208cd7893dbb1a9520559e569a4677c prerequisite-patch-id: 08dd37c2ffd8463398e00cade80765b017200b68 prerequisite-patch-id: 3a1d25d5e4f1f615b6e2c6749dcf891959ca48b5 prerequisite-patch-id: 5607c75cc228370df8953987c390682de3093b65 prerequisite-patch-id: ccb4d94973bd111380e4b50f781eeb6cfa7ce5ff In obvious cases I do the rebase on top of qemu/master, but that is when there is no dependencies with the PULL request, and that is not the "interesting" case. Thanks again, Juan.
Hi Eric, you asked me for a review downstream, but since you would have to bring back any problem to upstream anyway, let's discuss it here. For the start, let me state that (a) I don't fully understand why this patch fixes things and (b) I hate this function. More below. Am 14.04.2023 um 17:33 hat Eric Blake geschrieben: > Consider what happens when performing a migration between two host > machines connected to an NFS server serving multiple block devices to > the guest, when the NFS server becomes unavailable. The migration > attempts to inactivate all block devices on the source (a necessary > step before the destination can take over); but if the NFS server is > non-responsive, the attempt to inactivate can itself fail. When that > happens, the destination fails to get the migrated guest (good, > because the source wasn't able to flush everything properly): > > (qemu) qemu-kvm: load of migration failed: Input/output error > > at which point, our only hope for the guest is for the source to take > back control. With the current code base, the host outputs a message, but then appears to resume: > > (qemu) qemu-kvm: qemu_savevm_state_complete_precopy_non_iterable: bdrv_inactivate_all() failed (-1) > > (src qemu)info status > VM status: running > > but a second migration attempt now asserts: > > (src qemu) qemu-kvm: ../block.c:6738: int bdrv_inactivate_recurse(BlockDriverState *): Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed. > > Whether the guest is recoverable on the source after the first failure > is debatable, but what we do not want is to have qemu itself fail due > to an assertion. It looks like the problem is as follows: > > In migration.c:migration_completion(), the source sets 'inactivate' to > true (since COLO is not enabled), then tries > savevm.c:qemu_savevm_state_complete_precopy() with a request to > inactivate block devices. In turn, this calls > block.c:bdrv_inactivate_all(), which fails when flushing runs up > against the non-responsive NFS server. With savevm failing, we are > now left in a state where some, but not all, of the block devices have > been inactivated; but migration_completion() then jumps to 'fail' > rather than 'fail_invalidate' and skips an attempt to reclaim those > those disks by calling bdrv_activate_all(). Even if we do attempt to > reclaim disks, we aren't taking note of failure there, either. Why do we even jump to 'fail'? In other words, should 'fail_inactivate' really be called 'fail' and everything should jump there? Greg added the 'fail_inactivate' label in fe904ea8242, but the commit message doesn't seem to tell why he left one goto. I see no reason why we wouldn't want to reactivate in this case, too. Maybe it's just for the colo case? > Thus, we have reached a state where the migration engine has forgotten > all state about whether a block device is inactive, because we did not > set s->block_inactive in enough places; so migration allows the source > to reach vm_start() and resume execution, violating the block layer > invariant that the guest CPUs should not be restarted while a device > is inactive. Note that the code in migration.c:migrate_fd_cancel() > will also try to reactivate all block devices if s->block_inactive was > set, but because we failed to set that flag after the first failure, > the source assumes it has reclaimed all devices, even though it still > has remaining inactivated devices and does not try again. Normally, > qmp_cont() will also try to reactivate all disks (or correctly fail if > the disks are not reclaimable because NFS is not yet back up), but the > auto-resumption of the source after a migration failure does not go > through qmp_cont(). And because we have left the block layer in an > inconsistent state with devices still inactivated, the later migration > attempt is hitting the assertion failure. > > Since it is important to not resume the source with inactive disks, > this patch marks s->block_inactive before attempting inactivation, > rather than after succeeding, in order to prevent any vm_start() until > it has successfully reactivated all devices. Here's the part that I don't understand: Even if you set s->block_inactive, where do we actually use this value and reactivate the image? The only reader of the field is migrate_fd_cancel(), which is only called by migration_cancel() (a very small wrapper, it's a mystery why this exists when it's the only caller). migration_cancel() in turn is called in very few places: * qmp_migrate_cancel: In our case, migration fails by itself, it's not cancelled from QMP. So this is not where we're coming from. * ram_mig_ram_block_resized: This one is an internal error during migration, but what we're seeing is not related to RAM at all. So this isn't where we're coming from either. * migration_shutdown: Only called while shutting down QEMU. Doesn't look like our case either. So while this patch fixes some state inconsistencies, how is it fixing anything for the reported bug when this state is never used in the relevant places? (That I don't understand the fix is what blocks my downstream review. The rest of my points are really only for upstream anyway.) > See also https://bugzilla.redhat.com/show_bug.cgi?id=2058982 > > Signed-off-by: Eric Blake <eblake@redhat.com> > > --- > > v2: Set s->block_inactive sooner [Juan] > --- > migration/migration.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/migration/migration.c b/migration/migration.c > index bda47891933..cb0d42c0610 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -3444,13 +3444,11 @@ static void migration_completion(MigrationState *s) > MIGRATION_STATUS_DEVICE); > } > if (ret >= 0) { > + s->block_inactive = inactivate; > qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX); > ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false, > inactivate); > } > - if (inactivate && ret >= 0) { > - s->block_inactive = true; > - } This part of the code has now really become unintuitive. After commit f07fa4cbf0b we had perfectly intuitive code: ret = bdrv_inactivate_all(); if (ret >= 0) { s->block_inactive = true; } Since then, the bdrv_inactivate_all() call has been moved to qemu_savevm_state_complete_precopy_non_iterable(), and now you changed the order because even on failure, we could end up with some inactivated nodes. I'm not arguing that either was a bad change, but the assignment to s->block_inactive looks really random now. I think this desperately needs a comment. > } > qemu_mutex_unlock_iothread(); > > @@ -3522,6 +3520,7 @@ fail_invalidate: > bdrv_activate_all(&local_err); > if (local_err) { > error_report_err(local_err); > + s->block_inactive = true; bdrv_activate_all() never inactivates a node that was active before. So it seems that this line only ever comes into play if s->block_inactive was incorrect before. I feel what we should do here is only try to activate if s->block_inactive was set above, and then have a single 'fail' label that always runs the re-activation code. > } else { > s->block_inactive = false; > } Kevin
On Tue, May 02, 2023 at 11:54:12AM +0200, Kevin Wolf wrote: > Hi Eric, > > you asked me for a review downstream, but since you would have to bring > back any problem to upstream anyway, let's discuss it here. For the > start, let me state that (a) I don't fully understand why this patch > fixes things and (b) I hate this function. More below. Sadly, I also fall into (a) I don't know if this patch fully fixes things, or if we're playing whack-a-mole and another bug is still lurking, and (b) the control flow is indeed horrendous, where I'm also not sure if I'm fully understanding how migration is supposed to be working. > > Am 14.04.2023 um 17:33 hat Eric Blake geschrieben: > > Consider what happens when performing a migration between two host > > machines connected to an NFS server serving multiple block devices to > > the guest, when the NFS server becomes unavailable. The migration > > attempts to inactivate all block devices on the source (a necessary > > step before the destination can take over); but if the NFS server is > > non-responsive, the attempt to inactivate can itself fail. When that > > happens, the destination fails to get the migrated guest (good, > > because the source wasn't able to flush everything properly): > > > > (qemu) qemu-kvm: load of migration failed: Input/output error > > > > at which point, our only hope for the guest is for the source to take > > back control. With the current code base, the host outputs a message, but then appears to resume: > > > > (qemu) qemu-kvm: qemu_savevm_state_complete_precopy_non_iterable: bdrv_inactivate_all() failed (-1) > > > > (src qemu)info status > > VM status: running > > > > but a second migration attempt now asserts: > > > > (src qemu) qemu-kvm: ../block.c:6738: int bdrv_inactivate_recurse(BlockDriverState *): Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed. > > > > Whether the guest is recoverable on the source after the first failure > > is debatable, but what we do not want is to have qemu itself fail due > > to an assertion. It looks like the problem is as follows: > > > > In migration.c:migration_completion(), the source sets 'inactivate' to > > true (since COLO is not enabled), then tries > > savevm.c:qemu_savevm_state_complete_precopy() with a request to > > inactivate block devices. In turn, this calls > > block.c:bdrv_inactivate_all(), which fails when flushing runs up > > against the non-responsive NFS server. With savevm failing, we are > > now left in a state where some, but not all, of the block devices have > > been inactivated; but migration_completion() then jumps to 'fail' > > rather than 'fail_invalidate' and skips an attempt to reclaim those > > those disks by calling bdrv_activate_all(). Even if we do attempt to > > reclaim disks, we aren't taking note of failure there, either. > > Why do we even jump to 'fail'? In other words, should 'fail_inactivate' > really be called 'fail' and everything should jump there? > > Greg added the 'fail_inactivate' label in fe904ea8242, but the commit > message doesn't seem to tell why he left one goto. I see no reason why > we wouldn't want to reactivate in this case, too. Maybe it's just for > the colo case? At the time of fe904ea8, all actions done by fail_invalidate: were guarded by a mere if (s->state == MIGRATION_STATUS_ACTIVE) Later, in 6039dd5b1c the guard was expanded to if (s->state == MIGRATION_STATUS_ACTIVE || s->state == MIGRATION_STATUS_DEVICE) But reading the rest of the function (at either those prior commit points, or at the present), I'm inclined to agree that all remaining 'goto fail' either happened at a point where s->state cannot match in the first place (so consolidating labels is harmless), or is the ONE place after we already checked s->state == MIGRATION_STATUS_ACTIVE or just changed s->state to MIGRATION_STATUS_DEVICE, and our failure could be from one of vm_stop_force_state(RUN_STATE_FINISH_MIGRATE) (nothing to reactivate, as we haven't inactivated yet), migration_maybe_pause() (where we change state, but still haven't inactivated), or from qemu_savevm_state_complete_precopy() (where 'goto fail' does skip out on the reactivation, so consolidating the labels would make sense to me, although at the time I wrote the patch, I was too afraid to change the code that drastically). Are we sure that bdrv_activate_all() is safe to call no matter what? We already know that a request to inactivate when something is already inactive asserts, but a request to activate something that is already active is generally a no-op. If so, I'm tending to agree with you that having a single label where we ALWAYS attempt reactivation after failure makes sense. I'll post another patch along those lines. > > > Thus, we have reached a state where the migration engine has forgotten > > all state about whether a block device is inactive, because we did not > > set s->block_inactive in enough places; so migration allows the source > > to reach vm_start() and resume execution, violating the block layer > > invariant that the guest CPUs should not be restarted while a device > > is inactive. Note that the code in migration.c:migrate_fd_cancel() > > will also try to reactivate all block devices if s->block_inactive was > > set, but because we failed to set that flag after the first failure, > > the source assumes it has reclaimed all devices, even though it still > > has remaining inactivated devices and does not try again. Normally, > > qmp_cont() will also try to reactivate all disks (or correctly fail if > > the disks are not reclaimable because NFS is not yet back up), but the > > auto-resumption of the source after a migration failure does not go > > through qmp_cont(). And because we have left the block layer in an > > inconsistent state with devices still inactivated, the later migration > > attempt is hitting the assertion failure. > > > > Since it is important to not resume the source with inactive disks, > > this patch marks s->block_inactive before attempting inactivation, > > rather than after succeeding, in order to prevent any vm_start() until > > it has successfully reactivated all devices. > > Here's the part that I don't understand: Even if you set > s->block_inactive, where do we actually use this value and reactivate > the image? > > The only reader of the field is migrate_fd_cancel(), which is only > called by migration_cancel() (a very small wrapper, it's a mystery why > this exists when it's the only caller). migration_cancel() in turn is > called in very few places: > > * qmp_migrate_cancel: In our case, migration fails by itself, it's not > cancelled from QMP. So this is not where we're coming from. > > * ram_mig_ram_block_resized: This one is an internal error during > migration, but what we're seeing is not related to RAM at all. So this > isn't where we're coming from either. > > * migration_shutdown: Only called while shutting down QEMU. Doesn't look > like our case either. > > So while this patch fixes some state inconsistencies, how is it fixing > anything for the reported bug when this state is never used in the > relevant places? > > (That I don't understand the fix is what blocks my downstream review. > The rest of my points are really only for upstream anyway.) I may get to respin the downstream patch based on my upcoming upstream patch... > > > See also https://bugzilla.redhat.com/show_bug.cgi?id=2058982 > > > > Signed-off-by: Eric Blake <eblake@redhat.com> > > > > --- > > > > v2: Set s->block_inactive sooner [Juan] > > --- > > migration/migration.c | 5 ++--- > > 1 file changed, 2 insertions(+), 3 deletions(-) > > > > diff --git a/migration/migration.c b/migration/migration.c > > index bda47891933..cb0d42c0610 100644 > > --- a/migration/migration.c > > +++ b/migration/migration.c > > @@ -3444,13 +3444,11 @@ static void migration_completion(MigrationState *s) > > MIGRATION_STATUS_DEVICE); > > } > > if (ret >= 0) { > > + s->block_inactive = inactivate; > > qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX); > > ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false, > > inactivate); > > } > > - if (inactivate && ret >= 0) { > > - s->block_inactive = true; > > - } > > This part of the code has now really become unintuitive. After commit > f07fa4cbf0b we had perfectly intuitive code: > > ret = bdrv_inactivate_all(); > if (ret >= 0) { > s->block_inactive = true; > } > > Since then, the bdrv_inactivate_all() call has been moved to > qemu_savevm_state_complete_precopy_non_iterable(), and now you changed > the order because even on failure, we could end up with some inactivated > nodes. I'm not arguing that either was a bad change, but the assignment > to s->block_inactive looks really random now. > > I think this desperately needs a comment. Sure, I can add it. > > > } > > qemu_mutex_unlock_iothread(); > > > > @@ -3522,6 +3520,7 @@ fail_invalidate: > > bdrv_activate_all(&local_err); > > if (local_err) { > > error_report_err(local_err); > > + s->block_inactive = true; > > bdrv_activate_all() never inactivates a node that was active before. So > it seems that this line only ever comes into play if s->block_inactive > was incorrect before. > > I feel what we should do here is only try to activate if > s->block_inactive was set above, and then have a single 'fail' label > that always runs the re-activation code. > > > } else { > > s->block_inactive = false; > > } > > Kevin >
diff --git a/migration/migration.c b/migration/migration.c index bda47891933..cb0d42c0610 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -3444,13 +3444,11 @@ static void migration_completion(MigrationState *s) MIGRATION_STATUS_DEVICE); } if (ret >= 0) { + s->block_inactive = inactivate; qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX); ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false, inactivate); } - if (inactivate && ret >= 0) { - s->block_inactive = true; - } } qemu_mutex_unlock_iothread(); @@ -3522,6 +3520,7 @@ fail_invalidate: bdrv_activate_all(&local_err); if (local_err) { error_report_err(local_err); + s->block_inactive = true; } else { s->block_inactive = false; }
Consider what happens when performing a migration between two host machines connected to an NFS server serving multiple block devices to the guest, when the NFS server becomes unavailable. The migration attempts to inactivate all block devices on the source (a necessary step before the destination can take over); but if the NFS server is non-responsive, the attempt to inactivate can itself fail. When that happens, the destination fails to get the migrated guest (good, because the source wasn't able to flush everything properly): (qemu) qemu-kvm: load of migration failed: Input/output error at which point, our only hope for the guest is for the source to take back control. With the current code base, the host outputs a message, but then appears to resume: (qemu) qemu-kvm: qemu_savevm_state_complete_precopy_non_iterable: bdrv_inactivate_all() failed (-1) (src qemu)info status VM status: running but a second migration attempt now asserts: (src qemu) qemu-kvm: ../block.c:6738: int bdrv_inactivate_recurse(BlockDriverState *): Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed. Whether the guest is recoverable on the source after the first failure is debatable, but what we do not want is to have qemu itself fail due to an assertion. It looks like the problem is as follows: In migration.c:migration_completion(), the source sets 'inactivate' to true (since COLO is not enabled), then tries savevm.c:qemu_savevm_state_complete_precopy() with a request to inactivate block devices. In turn, this calls block.c:bdrv_inactivate_all(), which fails when flushing runs up against the non-responsive NFS server. With savevm failing, we are now left in a state where some, but not all, of the block devices have been inactivated; but migration_completion() then jumps to 'fail' rather than 'fail_invalidate' and skips an attempt to reclaim those those disks by calling bdrv_activate_all(). Even if we do attempt to reclaim disks, we aren't taking note of failure there, either. Thus, we have reached a state where the migration engine has forgotten all state about whether a block device is inactive, because we did not set s->block_inactive in enough places; so migration allows the source to reach vm_start() and resume execution, violating the block layer invariant that the guest CPUs should not be restarted while a device is inactive. Note that the code in migration.c:migrate_fd_cancel() will also try to reactivate all block devices if s->block_inactive was set, but because we failed to set that flag after the first failure, the source assumes it has reclaimed all devices, even though it still has remaining inactivated devices and does not try again. Normally, qmp_cont() will also try to reactivate all disks (or correctly fail if the disks are not reclaimable because NFS is not yet back up), but the auto-resumption of the source after a migration failure does not go through qmp_cont(). And because we have left the block layer in an inconsistent state with devices still inactivated, the later migration attempt is hitting the assertion failure. Since it is important to not resume the source with inactive disks, this patch marks s->block_inactive before attempting inactivation, rather than after succeeding, in order to prevent any vm_start() until it has successfully reactivated all devices. See also https://bugzilla.redhat.com/show_bug.cgi?id=2058982 Signed-off-by: Eric Blake <eblake@redhat.com> --- v2: Set s->block_inactive sooner [Juan] --- migration/migration.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) base-commit: 7dbd6f8a27e30fe14adb3d5869097cddf24038d6