diff mbox series

[v2,6/9] migration: Don't set FAILED state when cancelling

Message ID 20250211150136.6781-7-farosas@suse.de (mailing list archive)
State New
Headers show
Series migration: Fix issues during qmp_migrate_cancel | expand

Commit Message

Fabiano Rosas Feb. 11, 2025, 3:01 p.m. UTC
It's possible that the migration is cancelled during
migration_switchover_start(). In that case, don't set the migration
state FAILED in migration_completion().

Fixes: 3dde8fdbad ("migration: Merge precopy/postcopy on switchover start")
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Peter Xu Feb. 11, 2025, 5:46 p.m. UTC | #1
On Tue, Feb 11, 2025 at 12:01:33PM -0300, Fabiano Rosas wrote:
> It's possible that the migration is cancelled during
> migration_switchover_start(). In that case, don't set the migration
> state FAILED in migration_completion().
> 
> Fixes: 3dde8fdbad ("migration: Merge precopy/postcopy on switchover start")
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

I remember I paid some attention on this one when working on the commit,
where it has:

static bool migration_switchover_prepare(MigrationState *s)
{
    /* Concurrent cancellation?  Quit */
    if (s->state == MIGRATION_STATUS_CANCELLING) {   <================= [1]
        return false;
    }
    ...
    bql_unlock();

    qemu_sem_wait(&s->pause_sem);

    bql_lock();
    /*
     * After BQL released and retaken, the state can be CANCELLING if it
     * happend during sem_wait().. Only change the state if it's still
     * pre-switchover.
     */
    migrate_set_state(&s->state, MIGRATION_STATUS_PRE_SWITCHOVER, <====== [2]
                      MIGRATION_STATUS_DEVICE);

    return s->state == MIGRATION_STATUS_DEVICE;
}

So when holding BQL logically it can't change to CANCELLING, it'll check
first [1] making sure no prior CANCELLING.  Then after release and retake
BQL it'll check again [2] (see the comment above [2], it's done by passing
in explicit old_state to not change it if it's CANCELLING).

Any hint on how this could be triggered?

OTOH, when looking at this.. I seem to have found a bug indeed (which could
be another?), where I may have forgot to touch up the old_state in
migrate_set_state() after switching to always use DEVICE..

diff --git a/migration/migration.c b/migration/migration.c
index 74c50cc72c..513e5955cc 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2793,8 +2793,9 @@ static int postcopy_start(MigrationState *ms, Error **errp)
 fail_closefb:
     qemu_fclose(fb);
 fail:
-    migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
-                          MIGRATION_STATUS_FAILED);
+    if (ms->state != MIGRATION_STATUS_CANCELLING) {
+        migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
+    }
     migration_block_activate(NULL);
     migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL);
     bql_unlock();

I'm not sure whether it's relevant to what you hit, though.. since you're
looking at this, I'd rely on you help figuring it out before I do.. :)

> ---
>  migration/migration.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 375de6d460..5dc43bcdc0 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2986,7 +2986,9 @@ fail:
>          error_free(local_err);
>      }
>  
> -    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
> +    if (s->state != MIGRATION_STATUS_CANCELLING) {
> +        migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
> +    }
>  }
>  
>  /**
> -- 
> 2.35.3
>
Fabiano Rosas Feb. 11, 2025, 6:04 p.m. UTC | #2
Peter Xu <peterx@redhat.com> writes:

> On Tue, Feb 11, 2025 at 12:01:33PM -0300, Fabiano Rosas wrote:
>> It's possible that the migration is cancelled during
>> migration_switchover_start(). In that case, don't set the migration
>> state FAILED in migration_completion().
>> 
>> Fixes: 3dde8fdbad ("migration: Merge precopy/postcopy on switchover start")
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>
> I remember I paid some attention on this one when working on the commit,
> where it has:
>
> static bool migration_switchover_prepare(MigrationState *s)
> {
>     /* Concurrent cancellation?  Quit */
>     if (s->state == MIGRATION_STATUS_CANCELLING) {   <================= [1]
>         return false;
>     }
>     ...
>     bql_unlock();
>
>     qemu_sem_wait(&s->pause_sem);
>
>     bql_lock();
>     /*
>      * After BQL released and retaken, the state can be CANCELLING if it
>      * happend during sem_wait().. Only change the state if it's still
>      * pre-switchover.
>      */
>     migrate_set_state(&s->state, MIGRATION_STATUS_PRE_SWITCHOVER, <====== [2]
>                       MIGRATION_STATUS_DEVICE);
>
>     return s->state == MIGRATION_STATUS_DEVICE;
> }
>
> So when holding BQL logically it can't change to CANCELLING, it'll check
> first [1] making sure no prior CANCELLING.  Then after release and retake
> BQL it'll check again [2] (see the comment above [2], it's done by passing
> in explicit old_state to not change it if it's CANCELLING).

Right, it doesn't change the state. But the function returns false and
someone else changes to FAILED. That's what both my patch and your
snippet below fix.

>
> Any hint on how this could be triggered?
>
> OTOH, when looking at this.. I seem to have found a bug indeed (which could
> be another?), where I may have forgot to touch up the old_state in
> migrate_set_state() after switching to always use DEVICE..
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 74c50cc72c..513e5955cc 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2793,8 +2793,9 @@ static int postcopy_start(MigrationState *ms, Error **errp)
>  fail_closefb:
>      qemu_fclose(fb);
>  fail:
> -    migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> -                          MIGRATION_STATUS_FAILED);
> +    if (ms->state != MIGRATION_STATUS_CANCELLING) {
> +        migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
> +    }

Now that I think about it, we should probably just use the skip at
migrate_set_state() always. Isn't this^ the same as:

migrate_set_state(&ms->state, MIGRATION_STATUS_DEVICE,
MIGRATION_STATUS_FAILED);

Better to list the state explicitly, no?

Or... do we want to incorporate this into migrate_set_state()?

void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
                       MigrationStatus new_state)
{
    assert(new_state < MIGRATION_STATUS__MAX);

    if (qatomic_read(state) == CANCELLING && new_state != CANCELLED) {
        /* Once it's cancelling, there's no way back, it must finish cancel */
        return;
    }

    if (qatomic_cmpxchg(state, old_state, new_state) == old_state) {
        trace_migrate_set_state(MigrationStatus_str(new_state));
        migrate_generate_event(new_state);
    }
}

>      migration_block_activate(NULL);
>      migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL);
>      bql_unlock();
>
> I'm not sure whether it's relevant to what you hit, though.. since you're
> looking at this, I'd rely on you help figuring it out before I do.. :)
>
>> ---
>>  migration/migration.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>> 
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 375de6d460..5dc43bcdc0 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -2986,7 +2986,9 @@ fail:
>>          error_free(local_err);
>>      }
>>  
>> -    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
>> +    if (s->state != MIGRATION_STATUS_CANCELLING) {
>> +        migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
>> +    }
>>  }
>>  
>>  /**
>> -- 
>> 2.35.3
>>
Peter Xu Feb. 11, 2025, 7:43 p.m. UTC | #3
On Tue, Feb 11, 2025 at 03:04:37PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Tue, Feb 11, 2025 at 12:01:33PM -0300, Fabiano Rosas wrote:
> >> It's possible that the migration is cancelled during
> >> migration_switchover_start(). In that case, don't set the migration
> >> state FAILED in migration_completion().
> >> 
> >> Fixes: 3dde8fdbad ("migration: Merge precopy/postcopy on switchover start")
> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> >
> > I remember I paid some attention on this one when working on the commit,
> > where it has:
> >
> > static bool migration_switchover_prepare(MigrationState *s)
> > {
> >     /* Concurrent cancellation?  Quit */
> >     if (s->state == MIGRATION_STATUS_CANCELLING) {   <================= [1]
> >         return false;
> >     }
> >     ...
> >     bql_unlock();
> >
> >     qemu_sem_wait(&s->pause_sem);
> >
> >     bql_lock();
> >     /*
> >      * After BQL released and retaken, the state can be CANCELLING if it
> >      * happend during sem_wait().. Only change the state if it's still
> >      * pre-switchover.
> >      */
> >     migrate_set_state(&s->state, MIGRATION_STATUS_PRE_SWITCHOVER, <====== [2]
> >                       MIGRATION_STATUS_DEVICE);
> >
> >     return s->state == MIGRATION_STATUS_DEVICE;
> > }
> >
> > So when holding BQL logically it can't change to CANCELLING, it'll check
> > first [1] making sure no prior CANCELLING.  Then after release and retake
> > BQL it'll check again [2] (see the comment above [2], it's done by passing
> > in explicit old_state to not change it if it's CANCELLING).
> 
> Right, it doesn't change the state. But the function returns false and
> someone else changes to FAILED. That's what both my patch and your
> snippet below fix.
> 
> >
> > Any hint on how this could be triggered?
> >
> > OTOH, when looking at this.. I seem to have found a bug indeed (which could
> > be another?), where I may have forgot to touch up the old_state in
> > migrate_set_state() after switching to always use DEVICE..
> >
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 74c50cc72c..513e5955cc 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -2793,8 +2793,9 @@ static int postcopy_start(MigrationState *ms, Error **errp)
> >  fail_closefb:
> >      qemu_fclose(fb);
> >  fail:
> > -    migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> > -                          MIGRATION_STATUS_FAILED);
> > +    if (ms->state != MIGRATION_STATUS_CANCELLING) {
> > +        migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
> > +    }
> 
> Now that I think about it, we should probably just use the skip at
> migrate_set_state() always. Isn't this^ the same as:
> 
> migrate_set_state(&ms->state, MIGRATION_STATUS_DEVICE,
> MIGRATION_STATUS_FAILED);
> 
> Better to list the state explicitly, no?

There's one case where it can be in ACTIVE rather than DEVICE,
unfortunately:

    ret = migration_stop_vm(ms, RUN_STATE_FINISH_MIGRATE);
    if (ret < 0) {
        error_setg_errno(errp, -ret, "%s: Failed to stop the VM", __func__);
        goto fail;
    }

> 
> Or... do we want to incorporate this into migrate_set_state()?
> 
> void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
>                        MigrationStatus new_state)
> {
>     assert(new_state < MIGRATION_STATUS__MAX);
> 
>     if (qatomic_read(state) == CANCELLING && new_state != CANCELLED) {
>         /* Once it's cancelling, there's no way back, it must finish cancel */
>         return;
>     }
> 
>     if (qatomic_cmpxchg(state, old_state, new_state) == old_state) {
>         trace_migrate_set_state(MigrationStatus_str(new_state));
>         migrate_generate_event(new_state);
>     }
> }

IMHO we'll need the original migrate_set_state() more or less, e.g. when
setting CANCELLING->CANCELLED in migration_[fd_]cleanup().  So maybe it's
slightly easier we keep it.

Said that, maybe we could have a few helpers for the state transitions,
like:

  migrate_set_state_failure(MigrationState *s)

Which can consider CANCELLING.

Also, we have a portion of such state transitions not caring about current
state, so we could also have some helper for that, like:

  migrate_set_state_always(MigrationState *s, MigrationStatus status)

Or rename old migrate_set_state() into migrate_set_state_atomic(), then
make migrate_set_state() to ignore current state.

> 
> >      migration_block_activate(NULL);
> >      migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL);
> >      bql_unlock();
> >
> > I'm not sure whether it's relevant to what you hit, though.. since you're
> > looking at this, I'd rely on you help figuring it out before I do.. :)
> >
> >> ---
> >>  migration/migration.c | 4 +++-
> >>  1 file changed, 3 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/migration/migration.c b/migration/migration.c
> >> index 375de6d460..5dc43bcdc0 100644
> >> --- a/migration/migration.c
> >> +++ b/migration/migration.c
> >> @@ -2986,7 +2986,9 @@ fail:
> >>          error_free(local_err);
> >>      }
> >>  
> >> -    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
> >> +    if (s->state != MIGRATION_STATUS_CANCELLING) {
> >> +        migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
> >> +    }
> >>  }
> >>  
> >>  /**
> >> -- 
> >> 2.35.3
> >> 
>
Fabiano Rosas Feb. 11, 2025, 8:22 p.m. UTC | #4
Peter Xu <peterx@redhat.com> writes:

> On Tue, Feb 11, 2025 at 03:04:37PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Tue, Feb 11, 2025 at 12:01:33PM -0300, Fabiano Rosas wrote:
>> >> It's possible that the migration is cancelled during
>> >> migration_switchover_start(). In that case, don't set the migration
>> >> state FAILED in migration_completion().
>> >> 
>> >> Fixes: 3dde8fdbad ("migration: Merge precopy/postcopy on switchover start")
>> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> >
>> > I remember I paid some attention on this one when working on the commit,
>> > where it has:
>> >
>> > static bool migration_switchover_prepare(MigrationState *s)
>> > {
>> >     /* Concurrent cancellation?  Quit */
>> >     if (s->state == MIGRATION_STATUS_CANCELLING) {   <================= [1]
>> >         return false;
>> >     }
>> >     ...
>> >     bql_unlock();
>> >
>> >     qemu_sem_wait(&s->pause_sem);
>> >
>> >     bql_lock();
>> >     /*
>> >      * After BQL released and retaken, the state can be CANCELLING if it
>> >      * happend during sem_wait().. Only change the state if it's still
>> >      * pre-switchover.
>> >      */
>> >     migrate_set_state(&s->state, MIGRATION_STATUS_PRE_SWITCHOVER, <====== [2]
>> >                       MIGRATION_STATUS_DEVICE);
>> >
>> >     return s->state == MIGRATION_STATUS_DEVICE;
>> > }
>> >
>> > So when holding BQL logically it can't change to CANCELLING, it'll check
>> > first [1] making sure no prior CANCELLING.  Then after release and retake
>> > BQL it'll check again [2] (see the comment above [2], it's done by passing
>> > in explicit old_state to not change it if it's CANCELLING).
>> 
>> Right, it doesn't change the state. But the function returns false and
>> someone else changes to FAILED. That's what both my patch and your
>> snippet below fix.
>> 
>> >
>> > Any hint on how this could be triggered?
>> >
>> > OTOH, when looking at this.. I seem to have found a bug indeed (which could
>> > be another?), where I may have forgot to touch up the old_state in
>> > migrate_set_state() after switching to always use DEVICE..
>> >
>> > diff --git a/migration/migration.c b/migration/migration.c
>> > index 74c50cc72c..513e5955cc 100644
>> > --- a/migration/migration.c
>> > +++ b/migration/migration.c
>> > @@ -2793,8 +2793,9 @@ static int postcopy_start(MigrationState *ms, Error **errp)
>> >  fail_closefb:
>> >      qemu_fclose(fb);
>> >  fail:
>> > -    migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
>> > -                          MIGRATION_STATUS_FAILED);
>> > +    if (ms->state != MIGRATION_STATUS_CANCELLING) {
>> > +        migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
>> > +    }
>> 
>> Now that I think about it, we should probably just use the skip at
>> migrate_set_state() always. Isn't this^ the same as:
>> 
>> migrate_set_state(&ms->state, MIGRATION_STATUS_DEVICE,
>> MIGRATION_STATUS_FAILED);
>> 
>> Better to list the state explicitly, no?
>
> There's one case where it can be in ACTIVE rather than DEVICE,
> unfortunately:
>
>     ret = migration_stop_vm(ms, RUN_STATE_FINISH_MIGRATE);
>     if (ret < 0) {
>         error_setg_errno(errp, -ret, "%s: Failed to stop the VM", __func__);
>         goto fail;
>     }
>
>> 
>> Or... do we want to incorporate this into migrate_set_state()?
>> 
>> void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
>>                        MigrationStatus new_state)
>> {
>>     assert(new_state < MIGRATION_STATUS__MAX);
>> 
>>     if (qatomic_read(state) == CANCELLING && new_state != CANCELLED) {
>>         /* Once it's cancelling, there's no way back, it must finish cancel */
>>         return;
>>     }
>> 
>>     if (qatomic_cmpxchg(state, old_state, new_state) == old_state) {
>>         trace_migrate_set_state(MigrationStatus_str(new_state));
>>         migrate_generate_event(new_state);
>>     }
>> }
>
> IMHO we'll need the original migrate_set_state() more or less, e.g. when
> setting CANCELLING->CANCELLED in migration_[fd_]cleanup().  So maybe it's
> slightly easier we keep it.
>
> Said that, maybe we could have a few helpers for the state transitions,
> like:
>
>   migrate_set_state_failure(MigrationState *s)
>
> Which can consider CANCELLING.
>
> Also, we have a portion of such state transitions not caring about current
> state, so we could also have some helper for that, like:
>
>   migrate_set_state_always(MigrationState *s, MigrationStatus status)
>
> Or rename old migrate_set_state() into migrate_set_state_atomic(), then
> make migrate_set_state() to ignore current state.
>

Thanks for the input.

I think for this series I'll stick with the

if (s->state != MIGRATION_STATUS_CANCELLING)

and prepare a new series with generic improvements to the migration
state code. I want to also fix the nomenclature of status vs. state.

>> 
>> >      migration_block_activate(NULL);
>> >      migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL);
>> >      bql_unlock();
>> >
>> > I'm not sure whether it's relevant to what you hit, though.. since you're
>> > looking at this, I'd rely on you help figuring it out before I do.. :)
>> >
>> >> ---
>> >>  migration/migration.c | 4 +++-
>> >>  1 file changed, 3 insertions(+), 1 deletion(-)
>> >> 
>> >> diff --git a/migration/migration.c b/migration/migration.c
>> >> index 375de6d460..5dc43bcdc0 100644
>> >> --- a/migration/migration.c
>> >> +++ b/migration/migration.c
>> >> @@ -2986,7 +2986,9 @@ fail:
>> >>          error_free(local_err);
>> >>      }
>> >>  
>> >> -    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
>> >> +    if (s->state != MIGRATION_STATUS_CANCELLING) {
>> >> +        migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
>> >> +    }
>> >>  }
>> >>  
>> >>  /**
>> >> -- 
>> >> 2.35.3
>> >> 
>>
diff mbox series

Patch

diff --git a/migration/migration.c b/migration/migration.c
index 375de6d460..5dc43bcdc0 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2986,7 +2986,9 @@  fail:
         error_free(local_err);
     }
 
-    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
+    if (s->state != MIGRATION_STATUS_CANCELLING) {
+        migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
+    }
 }
 
 /**