diff mbox series

[for-9.0,1/2] migration: Set migration error in migration_completion()

Message ID 20240328140252.16756-2-avihaih@nvidia.com (mailing list archive)
State New, archived
Headers show
Series migration: Two migration bug fixes | expand

Commit Message

Avihai Horon March 28, 2024, 2:02 p.m. UTC
After commit 9425ef3f990a ("migration: Use migrate_has_error() in
close_return_path_on_source()"), close_return_path_on_source() assumes
that migration error is set if an error occurs during migration.

This may not be true if migration errors in migration_completion(). For
example, if qemu_savevm_state_complete_precopy() errors, migration error
will not be set.

This in turn, will cause a migration hang bug, similar to the bug that
was fixed by commit 22b04245f0d5 ("migration: Join the return path
thread before releasing to_dst_file"), as shutdown() will not be issued
for the return-path channel.

Fix it by ensuring migration error is set in case of error in
migration_completion().

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 migration/migration.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Peter Xu March 28, 2024, 3:09 p.m. UTC | #1
On Thu, Mar 28, 2024 at 04:02:51PM +0200, Avihai Horon wrote:
> After commit 9425ef3f990a ("migration: Use migrate_has_error() in
> close_return_path_on_source()"), close_return_path_on_source() assumes
> that migration error is set if an error occurs during migration.
> 
> This may not be true if migration errors in migration_completion(). For
> example, if qemu_savevm_state_complete_precopy() errors, migration error
> will not be set.
> 
> This in turn, will cause a migration hang bug, similar to the bug that
> was fixed by commit 22b04245f0d5 ("migration: Join the return path
> thread before releasing to_dst_file"), as shutdown() will not be issued
> for the return-path channel.
> 
> Fix it by ensuring migration error is set in case of error in
> migration_completion().
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

I'll attach this if it looks all right to you:

Fixes: 9425ef3f990a ("migration: Use migrate_has_error() in close_return_path_on_source()")

Thanks,

> ---
>  migration/migration.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 9fe8fd2afd7..b73ae3a72c4 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2799,6 +2799,7 @@ static void migration_completion(MigrationState *s)
>  {
>      int ret = 0;
>      int current_active_state = s->state;
> +    Error *local_err = NULL;
>  
>      if (s->state == MIGRATION_STATUS_ACTIVE) {
>          ret = migration_completion_precopy(s, &current_active_state);
> @@ -2832,6 +2833,15 @@ static void migration_completion(MigrationState *s)
>      return;
>  
>  fail:
> +    if (qemu_file_get_error_obj(s->to_dst_file, &local_err)) {
> +        migrate_set_error(s, local_err);
> +        error_free(local_err);
> +    } else if (ret) {
> +        error_setg_errno(&local_err, -ret, "Error in migration completion");
> +        migrate_set_error(s, local_err);
> +        error_free(local_err);
> +    }
> +
>      migration_completion_failed(s, current_active_state);
>  }
>  
> -- 
> 2.26.3
> 
>
Cédric Le Goater March 28, 2024, 3:21 p.m. UTC | #2
Hello Avihai,

On 3/28/24 15:02, Avihai Horon wrote:
> After commit 9425ef3f990a ("migration: Use migrate_has_error() in
> close_return_path_on_source()"), close_return_path_on_source() assumes
> that migration error is set if an error occurs during migration.
> 
> This may not be true if migration errors in migration_completion(). For
> example, if qemu_savevm_state_complete_precopy() errors, migration error
> will not be set

Out of curiosity, could you describe a bit more the context ? Did
vfio_save_complete_precopy() fail ? why ?

We should propagate errors of .save_live_complete_precopy() handlers as
it was done .save_setup handlers(). For 9.1.

> This in turn, will cause a migration hang bug, similar to the bug that
> was fixed by commit 22b04245f0d5 ("migration: Join the return path
> thread before releasing to_dst_file"), as shutdown() will not be issued
> for the return-path channel.

yes, but this test :

     if (ret < 0) {
         goto fail;
     }

will skip the close_return_path_on_source() call. Won't it ? So I don't
understand how it can be an issue. Am I missing something ?

> Fix it by ensuring migration error is set in case of error in
> migration_completion().

Why didn't you add a reference to commit 9425ef3f990a ?


> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>   migration/migration.c | 10 ++++++++++
>   1 file changed, 10 insertions(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 9fe8fd2afd7..b73ae3a72c4 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2799,6 +2799,7 @@ static void migration_completion(MigrationState *s)
>   {
>       int ret = 0;
>       int current_active_state = s->state;
> +    Error *local_err = NULL;
>   
>       if (s->state == MIGRATION_STATUS_ACTIVE) {
>           ret = migration_completion_precopy(s, &current_active_state);
> @@ -2832,6 +2833,15 @@ static void migration_completion(MigrationState *s)
>       return;
>   
>   fail:
> +    if (qemu_file_get_error_obj(s->to_dst_file, &local_err)) {
> +        migrate_set_error(s, local_err);
> +        error_free(local_err);
> +    } else if (ret) {
> +        error_setg_errno(&local_err, -ret, "Error in migration completion");

The 'ret = -1' case could be improved with error_setg(). As a followup.

Thanks,

C.




> +        migrate_set_error(s, local_err);
> +        error_free(local_err);
> +    }
> +
>       migration_completion_failed(s, current_active_state);
>   }
>
Avihai Horon March 28, 2024, 3:50 p.m. UTC | #3
On 28/03/2024 17:21, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> Hello Avihai,
>
> On 3/28/24 15:02, Avihai Horon wrote:
>> After commit 9425ef3f990a ("migration: Use migrate_has_error() in
>> close_return_path_on_source()"), close_return_path_on_source() assumes
>> that migration error is set if an error occurs during migration.
>>
>> This may not be true if migration errors in migration_completion(). For
>> example, if qemu_savevm_state_complete_precopy() errors, migration error
>> will not be set
>
> Out of curiosity, could you describe a bit more the context ? Did
> vfio_save_complete_precopy() fail ? why ?

Yep, vfio_save_complete_precopy() failed (but it failed while I was 
experimenting with an unofficial debug FW).

>
> We should propagate errors of .save_live_complete_precopy() handlers as
> it was done .save_setup handlers(). For 9.1.

Agreed.

>
>> This in turn, will cause a migration hang bug, similar to the bug that
>> was fixed by commit 22b04245f0d5 ("migration: Join the return path
>> thread before releasing to_dst_file"), as shutdown() will not be issued
>> for the return-path channel.
>
> yes, but this test :
>
>     if (ret < 0) {
>         goto fail;
>     }
>
> will skip the close_return_path_on_source() call. Won't it ? So I don't
> understand how it can be an issue. Am I missing something ?

It will skip the close_return_path_on_source() call in 
migration_completion(), but there is another 
close_return_path_on_source() call in migrate_fd_cleanup().

>
>> Fix it by ensuring migration error is set in case of error in
>> migration_completion().
>
> Why didn't you add a reference to commit 9425ef3f990a ?

I thought this commit didn't introduce this bug, but looking again in 
the mailing list [1], it kinda did:
The hang bug was fully fixed by commit 22b04245f0d ("migration: Join the 
return path thread before releasing to_dst_file") and then 9425ef3f990a 
re-introduced the bug, but only for migration_completion() case.
So, you are right, a fixes line with 9425ef3f990a should be added.

Thanks.

[1] https://lore.kernel.org/all/20240226203122.22894-1-farosas@suse.de/

>
>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   migration/migration.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 9fe8fd2afd7..b73ae3a72c4 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -2799,6 +2799,7 @@ static void migration_completion(MigrationState 
>> *s)
>>   {
>>       int ret = 0;
>>       int current_active_state = s->state;
>> +    Error *local_err = NULL;
>>
>>       if (s->state == MIGRATION_STATUS_ACTIVE) {
>>           ret = migration_completion_precopy(s, &current_active_state);
>> @@ -2832,6 +2833,15 @@ static void 
>> migration_completion(MigrationState *s)
>>       return;
>>
>>   fail:
>> +    if (qemu_file_get_error_obj(s->to_dst_file, &local_err)) {
>> +        migrate_set_error(s, local_err);
>> +        error_free(local_err);
>> +    } else if (ret) {
>> +        error_setg_errno(&local_err, -ret, "Error in migration 
>> completion");
>
> The 'ret = -1' case could be improved with error_setg(). As a followup.
>
> Thanks,
>
> C.
>
>
>
>
>> +        migrate_set_error(s, local_err);
>> +        error_free(local_err);
>> +    }
>> +
>>       migration_completion_failed(s, current_active_state);
>>   }
>>
>
Avihai Horon March 28, 2024, 3:54 p.m. UTC | #4
On 28/03/2024 17:09, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, Mar 28, 2024 at 04:02:51PM +0200, Avihai Horon wrote:
>> After commit 9425ef3f990a ("migration: Use migrate_has_error() in
>> close_return_path_on_source()"), close_return_path_on_source() assumes
>> that migration error is set if an error occurs during migration.
>>
>> This may not be true if migration errors in migration_completion(). For
>> example, if qemu_savevm_state_complete_precopy() errors, migration error
>> will not be set.
>>
>> This in turn, will cause a migration hang bug, similar to the bug that
>> was fixed by commit 22b04245f0d5 ("migration: Join the return path
>> thread before releasing to_dst_file"), as shutdown() will not be issued
>> for the return-path channel.
>>
>> Fix it by ensuring migration error is set in case of error in
>> migration_completion().
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> I'll attach this if it looks all right to you:
>
> Fixes: 9425ef3f990a ("migration: Use migrate_has_error() in close_return_path_on_source()")

Yes, sure, go ahead.

Thanks.

>
> Thanks,
>
>> ---
>>   migration/migration.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 9fe8fd2afd7..b73ae3a72c4 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -2799,6 +2799,7 @@ static void migration_completion(MigrationState *s)
>>   {
>>       int ret = 0;
>>       int current_active_state = s->state;
>> +    Error *local_err = NULL;
>>
>>       if (s->state == MIGRATION_STATUS_ACTIVE) {
>>           ret = migration_completion_precopy(s, &current_active_state);
>> @@ -2832,6 +2833,15 @@ static void migration_completion(MigrationState *s)
>>       return;
>>
>>   fail:
>> +    if (qemu_file_get_error_obj(s->to_dst_file, &local_err)) {
>> +        migrate_set_error(s, local_err);
>> +        error_free(local_err);
>> +    } else if (ret) {
>> +        error_setg_errno(&local_err, -ret, "Error in migration completion");
>> +        migrate_set_error(s, local_err);
>> +        error_free(local_err);
>> +    }
>> +
>>       migration_completion_failed(s, current_active_state);
>>   }
>>
>> --
>> 2.26.3
>>
>>
> --
> Peter Xu
>
Cédric Le Goater March 28, 2024, 4:29 p.m. UTC | #5
On 3/28/24 16:50, Avihai Horon wrote:
> 
> On 28/03/2024 17:21, Cédric Le Goater wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> Hello Avihai,
>>
>> On 3/28/24 15:02, Avihai Horon wrote:
>>> After commit 9425ef3f990a ("migration: Use migrate_has_error() in
>>> close_return_path_on_source()"), close_return_path_on_source() assumes
>>> that migration error is set if an error occurs during migration.
>>>
>>> This may not be true if migration errors in migration_completion(). For
>>> example, if qemu_savevm_state_complete_precopy() errors, migration error
>>> will not be set
>>
>> Out of curiosity, could you describe a bit more the context ? Did
>> vfio_save_complete_precopy() fail ? why ?
> 
> Yep, vfio_save_complete_precopy() failed (but it failed while I was experimenting with an unofficial debug FW).
> 
>>
>> We should propagate errors of .save_live_complete_precopy() handlers as
>> it was done .save_setup handlers(). For 9.1.
> 
> Agreed.
> 
>>
>>> This in turn, will cause a migration hang bug, similar to the bug that
>>> was fixed by commit 22b04245f0d5 ("migration: Join the return path
>>> thread before releasing to_dst_file"), as shutdown() will not be issued
>>> for the return-path channel.
>>
>> yes, but this test :
>>
>>     if (ret < 0) {
>>         goto fail;
>>     }
>>
>> will skip the close_return_path_on_source() call. Won't it ? So I don't
>> understand how it can be an issue. Am I missing something ?
> 
> It will skip the close_return_path_on_source() call in migration_completion(), but there is another close_return_path_on_source() call in migrate_fd_cleanup().

OK. Found it. This is a code path I hadn't explored yet.

Acked-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> 
>>
>>> Fix it by ensuring migration error is set in case of error in
>>> migration_completion().
>>
>> Why didn't you add a reference to commit 9425ef3f990a ?
> 
> I thought this commit didn't introduce this bug, but looking again in the mailing list [1], it kinda did:
> The hang bug was fully fixed by commit 22b04245f0d ("migration: Join the return path thread before releasing to_dst_file") and then 9425ef3f990a re-introduced the bug, but only for migration_completion() case.
> So, you are right, a fixes line with 9425ef3f990a should be added.
> 
> Thanks.
> 
> [1] https://lore.kernel.org/all/20240226203122.22894-1-farosas@suse.de/
> 
>>
>>
>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>> ---
>>>   migration/migration.c | 10 ++++++++++
>>>   1 file changed, 10 insertions(+)
>>>
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 9fe8fd2afd7..b73ae3a72c4 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -2799,6 +2799,7 @@ static void migration_completion(MigrationState *s)
>>>   {
>>>       int ret = 0;
>>>       int current_active_state = s->state;
>>> +    Error *local_err = NULL;
>>>
>>>       if (s->state == MIGRATION_STATUS_ACTIVE) {
>>>           ret = migration_completion_precopy(s, &current_active_state);
>>> @@ -2832,6 +2833,15 @@ static void migration_completion(MigrationState *s)
>>>       return;
>>>
>>>   fail:
>>> +    if (qemu_file_get_error_obj(s->to_dst_file, &local_err)) {
>>> +        migrate_set_error(s, local_err);
>>> +        error_free(local_err);
>>> +    } else if (ret) {
>>> +        error_setg_errno(&local_err, -ret, "Error in migration completion");
>>
>> The 'ret = -1' case could be improved with error_setg(). As a followup.
>>
>> Thanks,
>>
>> C.
>>
>>
>>
>>
>>> +        migrate_set_error(s, local_err);
>>> +        error_free(local_err);
>>> +    }
>>> +
>>>       migration_completion_failed(s, current_active_state);
>>>   }
>>>
>>
>
diff mbox series

Patch

diff --git a/migration/migration.c b/migration/migration.c
index 9fe8fd2afd7..b73ae3a72c4 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2799,6 +2799,7 @@  static void migration_completion(MigrationState *s)
 {
     int ret = 0;
     int current_active_state = s->state;
+    Error *local_err = NULL;
 
     if (s->state == MIGRATION_STATUS_ACTIVE) {
         ret = migration_completion_precopy(s, &current_active_state);
@@ -2832,6 +2833,15 @@  static void migration_completion(MigrationState *s)
     return;
 
 fail:
+    if (qemu_file_get_error_obj(s->to_dst_file, &local_err)) {
+        migrate_set_error(s, local_err);
+        error_free(local_err);
+    } else if (ret) {
+        error_setg_errno(&local_err, -ret, "Error in migration completion");
+        migrate_set_error(s, local_err);
+        error_free(local_err);
+    }
+
     migration_completion_failed(s, current_active_state);
 }