diff mbox series

[4/5] migration: Activate block devices if VM is paused when migrating

Message ID 20241125144612.16194-5-farosas@suse.de (mailing list archive)
State New
Headers show
Series migration: Fix the BDRV_O_INACTIVE assertion | expand

Commit Message

Fabiano Rosas Nov. 25, 2024, 2:46 p.m. UTC
Currently a VM that has been target of a migration using
late-block-activate will crash at the end of a new migration (with it
as source) when releasing ownership of the disks due to the VM having
never taken ownership of the disks in the first place.

The issue is that late-block-activate expects a qmp_continue command
to be issued at some point on the destination VM after the migration
finishes. If the user decides to never continue the VM, but instead
issue a new migration, then bdrv_activate_all() will never be called
and the assert will be reached:

bdrv_inactivate_recurse: Assertion `!(bs->open_flags &
BDRV_O_INACTIVE)' failed.

Fix the issue by checking at the start of migration if the VM is
paused and call bdrv_activate_all() before migrating. Even if the
late-block-activate capability is not in play or if the VM has been
paused manually, there is no harm calling that function again.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Comments

Peter Xu Nov. 25, 2024, 5:07 p.m. UTC | #1
On Mon, Nov 25, 2024 at 11:46:11AM -0300, Fabiano Rosas wrote:
> Currently a VM that has been target of a migration using
> late-block-activate will crash at the end of a new migration (with it
> as source) when releasing ownership of the disks due to the VM having
> never taken ownership of the disks in the first place.
> 
> The issue is that late-block-activate expects a qmp_continue command
> to be issued at some point on the destination VM after the migration
> finishes. If the user decides to never continue the VM, but instead
> issue a new migration, then bdrv_activate_all() will never be called
> and the assert will be reached:
> 
> bdrv_inactivate_recurse: Assertion `!(bs->open_flags &
> BDRV_O_INACTIVE)' failed.
> 
> Fix the issue by checking at the start of migration if the VM is
> paused and call bdrv_activate_all() before migrating. Even if the
> late-block-activate capability is not in play or if the VM has been
> paused manually, there is no harm calling that function again.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  migration/migration.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index aedf7f0751..26af30137b 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2029,6 +2029,25 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
>          return false;
>      }
>  
> +    /*
> +     * The VM might have been target of a previous migration. If it
> +     * was in the paused state then nothing will have required the
> +     * block layer to be activated. Do it now to ensure this QEMU
> +     * instance owns the disk locks.
> +     */
> +    if (!resume && runstate_check(RUN_STATE_PAUSED)) {

I hope this will cover all the cases that QEMU could overlook.. or I'm not
sure whether we could invoke bdrv_activate_all() unconditionally, as it
looks like safe to be used if all disks are active already.

I wished we don't need to bother with disk activation status at all,
because IIUC migration could work all fine even if all disks are inactivate
when preparing migration.. hence such change always looks like a workaround
of a separate issue.

> +        Error *local_err = NULL;
> +
> +        g_assert(bql_locked());
> +
> +        bdrv_activate_all(&local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            return false;
> +        }
> +        s->block_inactive = false;

This var so far was only used within one migration iteration, and the var
was only set in migration_completion_precopy() so far.  Now we're resetting
it upfront of a migration.  I'm not 100% sure if it's needed, or should be
put somewhere else.

In general, I saw the mention of other places that may also try to
invalidate disks that used to be invalidated.  If that's the case, I wish
we don't need to touch migration code at all, but instead if block layer
can cope with "invalidate on top of invalidated disks" it'll be perfect.

> +    }
> +
>      return true;
>  }
>  
> -- 
> 2.35.3
>
diff mbox series

Patch

diff --git a/migration/migration.c b/migration/migration.c
index aedf7f0751..26af30137b 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2029,6 +2029,25 @@  static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
         return false;
     }
 
+    /*
+     * The VM might have been target of a previous migration. If it
+     * was in the paused state then nothing will have required the
+     * block layer to be activated. Do it now to ensure this QEMU
+     * instance owns the disk locks.
+     */
+    if (!resume && runstate_check(RUN_STATE_PAUSED)) {
+        Error *local_err = NULL;
+
+        g_assert(bql_locked());
+
+        bdrv_activate_all(&local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return false;
+        }
+        s->block_inactive = false;
+    }
+
     return true;
 }