diff mbox series

[05/17] migration/multifd: Wait for multifd channels creation before proceeding

Message ID 20240125162528.7552-6-avihaih@nvidia.com (mailing list archive)
State New, archived
Headers show
Series migration: Add new migration channel connect and TLS upgrade APIs | expand

Commit Message

Avihai Horon Jan. 25, 2024, 4:25 p.m. UTC
Currently, multifd channels are created asynchronously without waiting
for their creation -- migration simply proceeds and may wait in
multifd_send_sync_main(), which is called by ram_save_setup(). This
hides in it some race conditions which can cause an unexpected behavior
if some channels creation fail.

For example, the following scenario of multifd migration with two
channels, where the first channel creation fails, will end in a
segmentation fault (time advances from top to bottom):

Thread           | Code execution
------------------------------------------------------------------------
Multifd 1        |
                 | multifd_new_send_channel_async (errors and quits)
                 |   multifd_new_send_channel_cleanup
                 |
Migration thread |
                 | qemu_savevm_state_setup
                 |   ram_save_setup
                 |     multifd_send_sync_main
                 |     (detects Multifd 1 error and quits)
                 | [...]
                 | migration_iteration_finish
                 |   migrate_fd_cleanup_schedule
                 |
Main thread      |
                 | migrate_fd_cleanup
                 |   multifd_save_cleanup (destroys Multifd 2 resources)
                 |
Multifd 2        |
                 | multifd_new_send_channel_async
                 | (accesses destroyed resources, segfault)

In another scenario, migration can hang indefinitely:
1. Main migration thread reaches multifd_send_sync_main() and waits on
   the semaphores.
2. Then, all multifd channels creation fails, so they post the
   semaphores and quit.
3. Main migration channel will not identify the error, proceed to send
   pages and will hang.

Fix it by waiting for all multifd channels to be created before
proceeding with migration.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 migration/multifd.h   |  3 +++
 migration/migration.c |  1 +
 migration/multifd.c   | 34 +++++++++++++++++++++++++++++++---
 migration/ram.c       |  7 +++++++
 4 files changed, 42 insertions(+), 3 deletions(-)

Comments

Fabiano Rosas Jan. 29, 2024, 2:34 p.m. UTC | #1
Avihai Horon <avihaih@nvidia.com> writes:

> Currently, multifd channels are created asynchronously without waiting
> for their creation -- migration simply proceeds and may wait in
> multifd_send_sync_main(), which is called by ram_save_setup(). This
> hides in it some race conditions which can cause an unexpected behavior
> if some channels creation fail.
>
> For example, the following scenario of multifd migration with two
> channels, where the first channel creation fails, will end in a
> segmentation fault (time advances from top to bottom):

Is this reproducible? Or just observable at least.

I acknowledge the situation you describe, but with multifd there's
usually an issue in cleanup paths. Let's make sure we flushed those out
before adding this new semaphore.

This is similar to an issue Peter was addressing where we missed calling
multifd_send_termiante_threads() in the multifd_channel_connect() path:

patch 4 in this
https://lore.kernel.org/r/20231022201211.452861-1-peterx@redhat.com

>
> Thread           | Code execution
> ------------------------------------------------------------------------
> Multifd 1        |
>                  | multifd_new_send_channel_async (errors and quits)
>                  |   multifd_new_send_channel_cleanup
>                  |
> Migration thread |
>                  | qemu_savevm_state_setup
>                  |   ram_save_setup
>                  |     multifd_send_sync_main
>                  |     (detects Multifd 1 error and quits)
>                  | [...]
>                  | migration_iteration_finish
>                  |   migrate_fd_cleanup_schedule
>                  |
> Main thread      |
>                  | migrate_fd_cleanup
>                  |   multifd_save_cleanup (destroys Multifd 2 resources)
>                  |
> Multifd 2        |
>                  | multifd_new_send_channel_async
>                  | (accesses destroyed resources, segfault)
>
> In another scenario, migration can hang indefinitely:
> 1. Main migration thread reaches multifd_send_sync_main() and waits on
>    the semaphores.
> 2. Then, all multifd channels creation fails, so they post the
>    semaphores and quit.
> 3. Main migration channel will not identify the error, proceed to send
>    pages and will hang.
>
> Fix it by waiting for all multifd channels to be created before
> proceeding with migration.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  migration/multifd.h   |  3 +++
>  migration/migration.c |  1 +
>  migration/multifd.c   | 34 +++++++++++++++++++++++++++++++---
>  migration/ram.c       |  7 +++++++
>  4 files changed, 42 insertions(+), 3 deletions(-)
>
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 35d11f103c..87a64e0a87 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -23,6 +23,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
>  void multifd_recv_sync_main(void);
>  int multifd_send_sync_main(void);
>  int multifd_queue_page(RAMBlock *block, ram_addr_t offset);
> +int multifd_send_channels_created(void);
>  
>  /* Multifd Compression flags */
>  #define MULTIFD_FLAG_SYNC (1 << 0)
> @@ -86,6 +87,8 @@ typedef struct {
>      /* multifd flags for sending ram */
>      int write_flags;
>  
> +    /* Syncs channel creation and migration thread */
> +    QemuSemaphore create_sem;
>      /* sem where to wait for more work */
>      QemuSemaphore sem;
>      /* syncs main thread and channels */
> diff --git a/migration/migration.c b/migration/migration.c
> index 9c769a1ecd..d81d96eaa5 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3621,6 +3621,7 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
>          error_report_err(local_err);
>          migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
>                            MIGRATION_STATUS_FAILED);
> +        multifd_send_channels_created();
>          migrate_fd_cleanup(s);
>          return;
>      }
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 564e911b6c..f0c216f4f9 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -538,6 +538,7 @@ void multifd_save_cleanup(void)
>          multifd_send_channel_destroy(p->c);
>          p->c = NULL;
>          qemu_mutex_destroy(&p->mutex);
> +        qemu_sem_destroy(&p->create_sem);
>          qemu_sem_destroy(&p->sem);
>          qemu_sem_destroy(&p->sem_sync);
>          g_free(p->name);
> @@ -766,6 +767,29 @@ out:
>      return NULL;
>  }
>  
> +int multifd_send_channels_created(void)
> +{
> +    int ret = 0;
> +
> +    if (!migrate_multifd()) {
> +        return 0;
> +    }
> +
> +    for (int i = 0; i < migrate_multifd_channels(); i++) {
> +        MultiFDSendParams *p = &multifd_send_state->params[i];
> +
> +        qemu_sem_wait(&p->create_sem);
> +        WITH_QEMU_LOCK_GUARD(&p->mutex) {
> +            if (p->quit) {
> +                error_report("%s: channel %d has already quit", __func__, i);
> +                ret = -1;
> +            }
> +        }

There are races here when a channel fails at
multifd_send_initial_packet(). If p->quit can be set after post to
create_sem, then this function could always return true and we'd run
into a broken channel. Possibly even the same bug you're trying to fix.

I think that's one of the reasons we have channels_ready.

> +    }
> +
> +    return ret;
> +}
> +
>  static bool multifd_channel_connect(MultiFDSendParams *p,
>                                      QIOChannel *ioc,
>                                      Error **errp);
> @@ -794,6 +818,7 @@ static void multifd_tls_outgoing_handshake(QIOTask *task,
>      p->quit = true;
>      qemu_sem_post(&multifd_send_state->channels_ready);
>      qemu_sem_post(&p->sem_sync);
> +    qemu_sem_post(&p->create_sem);
>      error_free(err);
>  }
>  
> @@ -857,6 +882,7 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
>      qemu_thread_create(&p->thread, p->name, multifd_send_thread, p,
>                         QEMU_THREAD_JOINABLE);
>      p->running = true;
> +    qemu_sem_post(&p->create_sem);
>      return true;
>  }
>  
> @@ -864,15 +890,16 @@ static void multifd_new_send_channel_cleanup(MultiFDSendParams *p,
>                                               QIOChannel *ioc, Error *err)
>  {
>       migrate_set_error(migrate_get_current(), err);
> -     /* Error happen, we need to tell who pay attention to me */
> -     qemu_sem_post(&multifd_send_state->channels_ready);
> -     qemu_sem_post(&p->sem_sync);
>       /*
> +      * Error happen, we need to tell who pay attention to me.
>        * Although multifd_send_thread is not created, but main migration
>        * thread need to judge whether it is running, so we need to mark
>        * its status.
>        */
>       p->quit = true;
> +     qemu_sem_post(&multifd_send_state->channels_ready);
> +     qemu_sem_post(&p->sem_sync);
> +     qemu_sem_post(&p->create_sem);

Do we still need channels_ready and sem_sync here? The migration thread
shouldn't have gone past create_sem at this point.

>       object_unref(OBJECT(ioc));
>       error_free(err);
>  }
> @@ -921,6 +948,7 @@ int multifd_save_setup(Error **errp)
>          MultiFDSendParams *p = &multifd_send_state->params[i];
>  
>          qemu_mutex_init(&p->mutex);
> +        qemu_sem_init(&p->create_sem, 0);
>          qemu_sem_init(&p->sem, 0);
>          qemu_sem_init(&p->sem_sync, 0);
>          p->quit = false;
> diff --git a/migration/ram.c b/migration/ram.c
> index c0cdcccb75..b3e864a22b 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2937,6 +2937,13 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>      RAMBlock *block;
>      int ret;
>  
> +    bql_unlock();
> +    ret = multifd_send_channels_created();
> +    bql_lock();
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
>      if (compress_threads_save_setup()) {
>          return -1;
>      }
Avihai Horon Jan. 30, 2024, 6:32 p.m. UTC | #2
On 29/01/2024 16:34, Fabiano Rosas wrote:
> External email: Use caution opening links or attachments
>
>
> Avihai Horon <avihaih@nvidia.com> writes:
>
>> Currently, multifd channels are created asynchronously without waiting
>> for their creation -- migration simply proceeds and may wait in
>> multifd_send_sync_main(), which is called by ram_save_setup(). This
>> hides in it some race conditions which can cause an unexpected behavior
>> if some channels creation fail.
>>
>> For example, the following scenario of multifd migration with two
>> channels, where the first channel creation fails, will end in a
>> segmentation fault (time advances from top to bottom):
> Is this reproducible? Or just observable at least.

Yes, though I had to engineer it a bit:
1. Run migration with two multifd channels and fail creation of the two 
channels (e.g., by changing the address they are connecting to).
2. Add sleep(3) in multifd_send_sync_main() before we loop through the 
channels and check p->quit.
3. Add sleep(5) only for the second multifd channel connect thread so 
its connection is delayed and runs last.

> I acknowledge the situation you describe, but with multifd there's
> usually an issue in cleanup paths. Let's make sure we flushed those out
> before adding this new semaphore.

Indeed, I was not keen on adding yet another semaphore either.
I think there are multiple bugs here, some of them overlap and some don't.
There is also your and Peter's previous work that I was not aware of to 
fix those and to clean up the code.

Maybe we can take it one step at a time, pushing your series first, 
cleaning the code and fixing some bugs.
Then we can see what bugs are left (if any) and fix them. It might even 
be easier to fix after the cleanups.

> This is similar to an issue Peter was addressing where we missed calling
> multifd_send_termiante_threads() in the multifd_channel_connect() path:
>
> patch 4 in this
> https://lore.kernel.org/r/20231022201211.452861-1-peterx@redhat.com

What issue are you referring here? Can you elaborate?

The main issue I am trying to fix in my patch is that we don't wait for 
all multifd channels to be created/error out before tearing down
multifd resources in mulitfd_save_cleanup().

>> Thread           | Code execution
>> ------------------------------------------------------------------------
>> Multifd 1        |
>>                   | multifd_new_send_channel_async (errors and quits)
>>                   |   multifd_new_send_channel_cleanup
>>                   |
>> Migration thread |
>>                   | qemu_savevm_state_setup
>>                   |   ram_save_setup
>>                   |     multifd_send_sync_main
>>                   |     (detects Multifd 1 error and quits)
>>                   | [...]
>>                   | migration_iteration_finish
>>                   |   migrate_fd_cleanup_schedule
>>                   |
>> Main thread      |
>>                   | migrate_fd_cleanup
>>                   |   multifd_save_cleanup (destroys Multifd 2 resources)
>>                   |
>> Multifd 2        |
>>                   | multifd_new_send_channel_async
>>                   | (accesses destroyed resources, segfault)
>>
>> In another scenario, migration can hang indefinitely:
>> 1. Main migration thread reaches multifd_send_sync_main() and waits on
>>     the semaphores.
>> 2. Then, all multifd channels creation fails, so they post the
>>     semaphores and quit.
>> 3. Main migration channel will not identify the error, proceed to send
>>     pages and will hang.
>>
>> Fix it by waiting for all multifd channels to be created before
>> proceeding with migration.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   migration/multifd.h   |  3 +++
>>   migration/migration.c |  1 +
>>   migration/multifd.c   | 34 +++++++++++++++++++++++++++++++---
>>   migration/ram.c       |  7 +++++++
>>   4 files changed, 42 insertions(+), 3 deletions(-)
>>
>> diff --git a/migration/multifd.h b/migration/multifd.h
>> index 35d11f103c..87a64e0a87 100644
>> --- a/migration/multifd.h
>> +++ b/migration/multifd.h
>> @@ -23,6 +23,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
>>   void multifd_recv_sync_main(void);
>>   int multifd_send_sync_main(void);
>>   int multifd_queue_page(RAMBlock *block, ram_addr_t offset);
>> +int multifd_send_channels_created(void);
>>
>>   /* Multifd Compression flags */
>>   #define MULTIFD_FLAG_SYNC (1 << 0)
>> @@ -86,6 +87,8 @@ typedef struct {
>>       /* multifd flags for sending ram */
>>       int write_flags;
>>
>> +    /* Syncs channel creation and migration thread */
>> +    QemuSemaphore create_sem;
>>       /* sem where to wait for more work */
>>       QemuSemaphore sem;
>>       /* syncs main thread and channels */
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 9c769a1ecd..d81d96eaa5 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -3621,6 +3621,7 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
>>           error_report_err(local_err);
>>           migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
>>                             MIGRATION_STATUS_FAILED);
>> +        multifd_send_channels_created();
>>           migrate_fd_cleanup(s);
>>           return;
>>       }
>> diff --git a/migration/multifd.c b/migration/multifd.c
>> index 564e911b6c..f0c216f4f9 100644
>> --- a/migration/multifd.c
>> +++ b/migration/multifd.c
>> @@ -538,6 +538,7 @@ void multifd_save_cleanup(void)
>>           multifd_send_channel_destroy(p->c);
>>           p->c = NULL;
>>           qemu_mutex_destroy(&p->mutex);
>> +        qemu_sem_destroy(&p->create_sem);
>>           qemu_sem_destroy(&p->sem);
>>           qemu_sem_destroy(&p->sem_sync);
>>           g_free(p->name);
>> @@ -766,6 +767,29 @@ out:
>>       return NULL;
>>   }
>>
>> +int multifd_send_channels_created(void)
>> +{
>> +    int ret = 0;
>> +
>> +    if (!migrate_multifd()) {
>> +        return 0;
>> +    }
>> +
>> +    for (int i = 0; i < migrate_multifd_channels(); i++) {
>> +        MultiFDSendParams *p = &multifd_send_state->params[i];
>> +
>> +        qemu_sem_wait(&p->create_sem);
>> +        WITH_QEMU_LOCK_GUARD(&p->mutex) {
>> +            if (p->quit) {
>> +                error_report("%s: channel %d has already quit", __func__, i);
>> +                ret = -1;
>> +            }
>> +        }
> There are races here when a channel fails at
> multifd_send_initial_packet(). If p->quit can be set after post to
> create_sem, then this function could always return true and we'd run
> into a broken channel. Possibly even the same bug you're trying to fix.
>
> I think that's one of the reasons we have channels_ready.

I am not sure exactly what bug you are describing here, can you elaborate?

>
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>>   static bool multifd_channel_connect(MultiFDSendParams *p,
>>                                       QIOChannel *ioc,
>>                                       Error **errp);
>> @@ -794,6 +818,7 @@ static void multifd_tls_outgoing_handshake(QIOTask *task,
>>       p->quit = true;
>>       qemu_sem_post(&multifd_send_state->channels_ready);
>>       qemu_sem_post(&p->sem_sync);
>> +    qemu_sem_post(&p->create_sem);
>>       error_free(err);
>>   }
>>
>> @@ -857,6 +882,7 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
>>       qemu_thread_create(&p->thread, p->name, multifd_send_thread, p,
>>                          QEMU_THREAD_JOINABLE);
>>       p->running = true;
>> +    qemu_sem_post(&p->create_sem);
>>       return true;
>>   }
>>
>> @@ -864,15 +890,16 @@ static void multifd_new_send_channel_cleanup(MultiFDSendParams *p,
>>                                                QIOChannel *ioc, Error *err)
>>   {
>>        migrate_set_error(migrate_get_current(), err);
>> -     /* Error happen, we need to tell who pay attention to me */
>> -     qemu_sem_post(&multifd_send_state->channels_ready);
>> -     qemu_sem_post(&p->sem_sync);
>>        /*
>> +      * Error happen, we need to tell who pay attention to me.
>>         * Although multifd_send_thread is not created, but main migration
>>         * thread need to judge whether it is running, so we need to mark
>>         * its status.
>>         */
>>        p->quit = true;
>> +     qemu_sem_post(&multifd_send_state->channels_ready);
>> +     qemu_sem_post(&p->sem_sync);
>> +     qemu_sem_post(&p->create_sem);
> Do we still need channels_ready and sem_sync here? The migration thread
> shouldn't have gone past create_sem at this point.

I think you are right, we can drop channels_ready and sem_sync here.

>
>>        object_unref(OBJECT(ioc));
>>        error_free(err);
>>   }
>> @@ -921,6 +948,7 @@ int multifd_save_setup(Error **errp)
>>           MultiFDSendParams *p = &multifd_send_state->params[i];
>>
>>           qemu_mutex_init(&p->mutex);
>> +        qemu_sem_init(&p->create_sem, 0);
>>           qemu_sem_init(&p->sem, 0);
>>           qemu_sem_init(&p->sem_sync, 0);
>>           p->quit = false;
>> diff --git a/migration/ram.c b/migration/ram.c
>> index c0cdcccb75..b3e864a22b 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -2937,6 +2937,13 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>>       RAMBlock *block;
>>       int ret;
>>
>> +    bql_unlock();
>> +    ret = multifd_send_channels_created();
>> +    bql_lock();
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>>       if (compress_threads_save_setup()) {
>>           return -1;
>>       }
Fabiano Rosas Jan. 30, 2024, 9:32 p.m. UTC | #3
Avihai Horon <avihaih@nvidia.com> writes:

> On 29/01/2024 16:34, Fabiano Rosas wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> Avihai Horon <avihaih@nvidia.com> writes:
>>
>>> Currently, multifd channels are created asynchronously without waiting
>>> for their creation -- migration simply proceeds and may wait in
>>> multifd_send_sync_main(), which is called by ram_save_setup(). This
>>> hides in it some race conditions which can cause an unexpected behavior
>>> if some channels creation fail.
>>>
>>> For example, the following scenario of multifd migration with two
>>> channels, where the first channel creation fails, will end in a
>>> segmentation fault (time advances from top to bottom):
>> Is this reproducible? Or just observable at least.
>
> Yes, though I had to engineer it a bit:
> 1. Run migration with two multifd channels and fail creation of the two 
> channels (e.g., by changing the address they are connecting to).
> 2. Add sleep(3) in multifd_send_sync_main() before we loop through the 
> channels and check p->quit.
> 3. Add sleep(5) only for the second multifd channel connect thread so 
> its connection is delayed and runs last.

Ok, well, that's something at least. I'll try to reproduce it so we can
keep track of it.

>> I acknowledge the situation you describe, but with multifd there's
>> usually an issue in cleanup paths. Let's make sure we flushed those out
>> before adding this new semaphore.
>
> Indeed, I was not keen on adding yet another semaphore either.
> I think there are multiple bugs here, some of them overlap and some don't.
> There is also your and Peter's previous work that I was not aware of to 
> fix those and to clean up the code.
>
> Maybe we can take it one step at a time, pushing your series first, 
> cleaning the code and fixing some bugs.
> Then we can see what bugs are left (if any) and fix them. It might even 
> be easier to fix after the cleanups.
>
>> This is similar to an issue Peter was addressing where we missed calling
>> multifd_send_termiante_threads() in the multifd_channel_connect() path:
>>
>> patch 4 in this
>> https://lore.kernel.org/r/20231022201211.452861-1-peterx@redhat.com
>
> What issue are you referring here? Can you elaborate?

Oh, I just realised that series doesn't address any particular bug. But
my point is that including a call to multifd_send_terminate_threads() at
new_send_channel_cleanup might be all that's needed because that has
code to cause the channels and the migration thread to end.

> The main issue I am trying to fix in my patch is that we don't wait for 
> all multifd channels to be created/error out before tearing down
> multifd resources in mulitfd_save_cleanup().

Ok, let me take a step back and ask why is this not solved by
multifd_save_cleanup() -> qemu_thread_join()? I see you moved
p->running=true to *after* the thread creation in patch 4. That will
always leave a gap where p->running == false but the thread is already
running.

>
>>> Thread           | Code execution
>>> ------------------------------------------------------------------------
>>> Multifd 1        |
>>>                   | multifd_new_send_channel_async (errors and quits)
>>>                   |   multifd_new_send_channel_cleanup
>>>                   |
>>> Migration thread |
>>>                   | qemu_savevm_state_setup
>>>                   |   ram_save_setup
>>>                   |     multifd_send_sync_main
>>>                   |     (detects Multifd 1 error and quits)
>>>                   | [...]
>>>                   | migration_iteration_finish
>>>                   |   migrate_fd_cleanup_schedule
>>>                   |
>>> Main thread      |
>>>                   | migrate_fd_cleanup
>>>                   |   multifd_save_cleanup (destroys Multifd 2 resources)
>>>                   |
>>> Multifd 2        |
>>>                   | multifd_new_send_channel_async
>>>                   | (accesses destroyed resources, segfault)
>>>
>>> In another scenario, migration can hang indefinitely:
>>> 1. Main migration thread reaches multifd_send_sync_main() and waits on
>>>     the semaphores.
>>> 2. Then, all multifd channels creation fails, so they post the
>>>     semaphores and quit.
>>> 3. Main migration channel will not identify the error, proceed to send
>>>     pages and will hang.
>>>
>>> Fix it by waiting for all multifd channels to be created before
>>> proceeding with migration.
>>>
>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>> ---
>>>   migration/multifd.h   |  3 +++
>>>   migration/migration.c |  1 +
>>>   migration/multifd.c   | 34 +++++++++++++++++++++++++++++++---
>>>   migration/ram.c       |  7 +++++++
>>>   4 files changed, 42 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/migration/multifd.h b/migration/multifd.h
>>> index 35d11f103c..87a64e0a87 100644
>>> --- a/migration/multifd.h
>>> +++ b/migration/multifd.h
>>> @@ -23,6 +23,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
>>>   void multifd_recv_sync_main(void);
>>>   int multifd_send_sync_main(void);
>>>   int multifd_queue_page(RAMBlock *block, ram_addr_t offset);
>>> +int multifd_send_channels_created(void);
>>>
>>>   /* Multifd Compression flags */
>>>   #define MULTIFD_FLAG_SYNC (1 << 0)
>>> @@ -86,6 +87,8 @@ typedef struct {
>>>       /* multifd flags for sending ram */
>>>       int write_flags;
>>>
>>> +    /* Syncs channel creation and migration thread */
>>> +    QemuSemaphore create_sem;
>>>       /* sem where to wait for more work */
>>>       QemuSemaphore sem;
>>>       /* syncs main thread and channels */
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 9c769a1ecd..d81d96eaa5 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -3621,6 +3621,7 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
>>>           error_report_err(local_err);
>>>           migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
>>>                             MIGRATION_STATUS_FAILED);
>>> +        multifd_send_channels_created();
>>>           migrate_fd_cleanup(s);
>>>           return;
>>>       }
>>> diff --git a/migration/multifd.c b/migration/multifd.c
>>> index 564e911b6c..f0c216f4f9 100644
>>> --- a/migration/multifd.c
>>> +++ b/migration/multifd.c
>>> @@ -538,6 +538,7 @@ void multifd_save_cleanup(void)
>>>           multifd_send_channel_destroy(p->c);
>>>           p->c = NULL;
>>>           qemu_mutex_destroy(&p->mutex);
>>> +        qemu_sem_destroy(&p->create_sem);
>>>           qemu_sem_destroy(&p->sem);
>>>           qemu_sem_destroy(&p->sem_sync);
>>>           g_free(p->name);
>>> @@ -766,6 +767,29 @@ out:
>>>       return NULL;
>>>   }
>>>
>>> +int multifd_send_channels_created(void)
>>> +{
>>> +    int ret = 0;
>>> +
>>> +    if (!migrate_multifd()) {
>>> +        return 0;
>>> +    }
>>> +
>>> +    for (int i = 0; i < migrate_multifd_channels(); i++) {
>>> +        MultiFDSendParams *p = &multifd_send_state->params[i];
>>> +
>>> +        qemu_sem_wait(&p->create_sem);
>>> +        WITH_QEMU_LOCK_GUARD(&p->mutex) {
>>> +            if (p->quit) {
>>> +                error_report("%s: channel %d has already quit", __func__, i);
>>> +                ret = -1;
>>> +            }
>>> +        }
>> There are races here when a channel fails at
>> multifd_send_initial_packet(). If p->quit can be set after post to
>> create_sem, then this function could always return true and we'd run
>> into a broken channel. Possibly even the same bug you're trying to fix.
>>
>> I think that's one of the reasons we have channels_ready.
>
> I am not sure exactly what bug you are describing here, can you elaborate?
>

This looks like it could be a preexisting issue actually, but in the
current code, what stops the multifd channel from running ahead is
p->sem. Except that the channel does some work at
multifd_send_initial_packet() before checking p->sem and that work could
fail.

This means that right after checking for p->quit above, the multifd
thread could already be exiting due to an error and
multifd_send_channels_created() would miss it. So there's still a chance
that this function effectively behaves just like the previous code.

Thread           | Code execution
------------------------------------------------------------------------
Migration        |
                 | ram_save_setup()
                 |   multifd_send_channels_created()
                 |     qemu_sem_wait(&p->create_sem);
Main             |
                 | multifd_channel_connect()
                 |   qemu_thread_create()
                 |   qemu_sem_post(&p->create_sem)
Multifd 1        |
                 | multifd_send_initial_packet() *errors*
                 | goto out
                 | multifd_send_terminate_threads()
Migration        |
                 | still at multifd_send_channels_created
                 |     qemu_mutex_lock(&p->mutex);
                 |     p->quit == false      <--- !!!
                 |     qemu_mutex_unlock(&p->mutex);
                 | return true from multifd_send_channels_created()

From here onwards, it's the same as not having checked
multifd_send_channels_created() at all. One way this could go is:

                 | runs unimpeded until multifd_send_sync_main()
                 | multifd_send_pages()
                 | *misses the exiting flag*
                 | qemu_sem_wait(&multifd_send_state->channels_ready);
Multifd 1        |
                 | still at multifd_send_terminate_threads
                 |   qemu_mutex_lock(&p->mutex);
                 |   p->quit = true;
                 |   qemu_mutex_unlock(&p->mutex);
                 | qemu_sem_post(&p->sem_sync);
                 | qemu_sem_post(&multifd_send_state->channels_ready);
Migration        |
                 | qemu_mutex_lock(&p->mutex);
                 | p->quit == true;  <--- correct now
                 | qemu_mutex_unlock(&p->mutex);
                 | return -1;
                 | *all good again*

It seems that in order to avoid this kind of race we'd need the
synchronization point to be at the multifd thread instead. But then,
that's what channels_ready does.

>>
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>>   static bool multifd_channel_connect(MultiFDSendParams *p,
>>>                                       QIOChannel *ioc,
>>>                                       Error **errp);
>>> @@ -794,6 +818,7 @@ static void multifd_tls_outgoing_handshake(QIOTask *task,
>>>       p->quit = true;
>>>       qemu_sem_post(&multifd_send_state->channels_ready);
>>>       qemu_sem_post(&p->sem_sync);
>>> +    qemu_sem_post(&p->create_sem);
>>>       error_free(err);
>>>   }
>>>
>>> @@ -857,6 +882,7 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
>>>       qemu_thread_create(&p->thread, p->name, multifd_send_thread, p,
>>>                          QEMU_THREAD_JOINABLE);
>>>       p->running = true;
>>> +    qemu_sem_post(&p->create_sem);
>>>       return true;
>>>   }
>>>
>>> @@ -864,15 +890,16 @@ static void multifd_new_send_channel_cleanup(MultiFDSendParams *p,
>>>                                                QIOChannel *ioc, Error *err)
>>>   {
>>>        migrate_set_error(migrate_get_current(), err);
>>> -     /* Error happen, we need to tell who pay attention to me */
>>> -     qemu_sem_post(&multifd_send_state->channels_ready);
>>> -     qemu_sem_post(&p->sem_sync);
>>>        /*
>>> +      * Error happen, we need to tell who pay attention to me.
>>>         * Although multifd_send_thread is not created, but main migration
>>>         * thread need to judge whether it is running, so we need to mark
>>>         * its status.
>>>         */
>>>        p->quit = true;
>>> +     qemu_sem_post(&multifd_send_state->channels_ready);
>>> +     qemu_sem_post(&p->sem_sync);
>>> +     qemu_sem_post(&p->create_sem);
>> Do we still need channels_ready and sem_sync here? The migration thread
>> shouldn't have gone past create_sem at this point.
>
> I think you are right, we can drop channels_ready and sem_sync here.
>
>>
>>>        object_unref(OBJECT(ioc));
>>>        error_free(err);
>>>   }
>>> @@ -921,6 +948,7 @@ int multifd_save_setup(Error **errp)
>>>           MultiFDSendParams *p = &multifd_send_state->params[i];
>>>
>>>           qemu_mutex_init(&p->mutex);
>>> +        qemu_sem_init(&p->create_sem, 0);
>>>           qemu_sem_init(&p->sem, 0);
>>>           qemu_sem_init(&p->sem_sync, 0);
>>>           p->quit = false;
>>> diff --git a/migration/ram.c b/migration/ram.c
>>> index c0cdcccb75..b3e864a22b 100644
>>> --- a/migration/ram.c
>>> +++ b/migration/ram.c
>>> @@ -2937,6 +2937,13 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>>>       RAMBlock *block;
>>>       int ret;
>>>
>>> +    bql_unlock();
>>> +    ret = multifd_send_channels_created();
>>> +    bql_lock();
>>> +    if (ret < 0) {
>>> +        return ret;
>>> +    }
>>> +
>>>       if (compress_threads_save_setup()) {
>>>           return -1;
>>>       }
Peter Xu Jan. 31, 2024, 4:49 a.m. UTC | #4
On Tue, Jan 30, 2024 at 06:32:21PM -0300, Fabiano Rosas wrote:
> Avihai Horon <avihaih@nvidia.com> writes:
> 
> > On 29/01/2024 16:34, Fabiano Rosas wrote:
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> Avihai Horon <avihaih@nvidia.com> writes:
> >>
> >>> Currently, multifd channels are created asynchronously without waiting
> >>> for their creation -- migration simply proceeds and may wait in
> >>> multifd_send_sync_main(), which is called by ram_save_setup(). This
> >>> hides in it some race conditions which can cause an unexpected behavior
> >>> if some channels creation fail.
> >>>
> >>> For example, the following scenario of multifd migration with two
> >>> channels, where the first channel creation fails, will end in a
> >>> segmentation fault (time advances from top to bottom):
> >> Is this reproducible? Or just observable at least.
> >
> > Yes, though I had to engineer it a bit:
> > 1. Run migration with two multifd channels and fail creation of the two 
> > channels (e.g., by changing the address they are connecting to).
> > 2. Add sleep(3) in multifd_send_sync_main() before we loop through the 
> > channels and check p->quit.
> > 3. Add sleep(5) only for the second multifd channel connect thread so 
> > its connection is delayed and runs last.
> 
> Ok, well, that's something at least. I'll try to reproduce it so we can
> keep track of it.
> 
> >> I acknowledge the situation you describe, but with multifd there's
> >> usually an issue in cleanup paths. Let's make sure we flushed those out
> >> before adding this new semaphore.
> >
> > Indeed, I was not keen on adding yet another semaphore either.
> > I think there are multiple bugs here, some of them overlap and some don't.
> > There is also your and Peter's previous work that I was not aware of to 
> > fix those and to clean up the code.
> >
> > Maybe we can take it one step at a time, pushing your series first, 
> > cleaning the code and fixing some bugs.
> > Then we can see what bugs are left (if any) and fix them. It might even 
> > be easier to fix after the cleanups.
> >
> >> This is similar to an issue Peter was addressing where we missed calling
> >> multifd_send_termiante_threads() in the multifd_channel_connect() path:
> >>
> >> patch 4 in this
> >> https://lore.kernel.org/r/20231022201211.452861-1-peterx@redhat.com
> >
> > What issue are you referring here? Can you elaborate?
> 
> Oh, I just realised that series doesn't address any particular bug. But
> my point is that including a call to multifd_send_terminate_threads() at
> new_send_channel_cleanup might be all that's needed because that has
> code to cause the channels and the migration thread to end.

It seems so to me.

One other issue is I hope we can get rid of p->quit before adding more code
to operate on it.

I'll see whether I can respin that series this week soon.  I'll consider
dropping the last ones, but pick the initial ones that may already help.  I
just noticed patch 2 is already merged with Avihai's similar patch;
obviously I completely forgot that series..

> 
> > The main issue I am trying to fix in my patch is that we don't wait for 
> > all multifd channels to be created/error out before tearing down
> > multifd resources in mulitfd_save_cleanup().
> 
> Ok, let me take a step back and ask why is this not solved by
> multifd_save_cleanup() -> qemu_thread_join()? I see you moved
> p->running=true to *after* the thread creation in patch 4. That will
> always leave a gap where p->running == false but the thread is already
> running.

The whole threading in multifd currently is just IMHO a mess.  We keep
creating threads but never cared on how that goes, and how to sync with the
threads.

I do have plan to track every TID that migration creates (including the
ones of qio tasks, maybe?), then we can always manage the threads, and
making sure all the threads will be freed when migrate_fd_cleanup()
finishes, by join()ing each of them.  I suppose we may also need things
like pthread_cancel(), consider when any thread got blocked somewhere but
the admin invoked a "migrate-cancel" request.

With any dangling thread being there, we always face weird risks: either
some migration code will be scheduled even after migration failed (like
this one), or it could be worse if that thread only got scheduled until the
2nd migration started after the 1st one cancelled - we need to be prepared
to see some extra threads running, having no idea where did they come from,
and those bugs will be hard to debug.

I haven't yet started looking into that, it'll be good if anyone would like
to explore that direction for a full resolution on multifd threadings.

> 
> >
> >>> Thread           | Code execution
> >>> ------------------------------------------------------------------------
> >>> Multifd 1        |
> >>>                   | multifd_new_send_channel_async (errors and quits)
> >>>                   |   multifd_new_send_channel_cleanup
> >>>                   |
> >>> Migration thread |
> >>>                   | qemu_savevm_state_setup
> >>>                   |   ram_save_setup
> >>>                   |     multifd_send_sync_main
> >>>                   |     (detects Multifd 1 error and quits)
> >>>                   | [...]
> >>>                   | migration_iteration_finish
> >>>                   |   migrate_fd_cleanup_schedule
> >>>                   |
> >>> Main thread      |
> >>>                   | migrate_fd_cleanup
> >>>                   |   multifd_save_cleanup (destroys Multifd 2 resources)
> >>>                   |
> >>> Multifd 2        |
> >>>                   | multifd_new_send_channel_async
> >>>                   | (accesses destroyed resources, segfault)
> >>>
> >>> In another scenario, migration can hang indefinitely:
> >>> 1. Main migration thread reaches multifd_send_sync_main() and waits on
> >>>     the semaphores.
> >>> 2. Then, all multifd channels creation fails, so they post the
> >>>     semaphores and quit.
> >>> 3. Main migration channel will not identify the error, proceed to send
> >>>     pages and will hang.
> >>>
> >>> Fix it by waiting for all multifd channels to be created before
> >>> proceeding with migration.
> >>>
> >>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> >>> ---
> >>>   migration/multifd.h   |  3 +++
> >>>   migration/migration.c |  1 +
> >>>   migration/multifd.c   | 34 +++++++++++++++++++++++++++++++---
> >>>   migration/ram.c       |  7 +++++++
> >>>   4 files changed, 42 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/migration/multifd.h b/migration/multifd.h
> >>> index 35d11f103c..87a64e0a87 100644
> >>> --- a/migration/multifd.h
> >>> +++ b/migration/multifd.h
> >>> @@ -23,6 +23,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
> >>>   void multifd_recv_sync_main(void);
> >>>   int multifd_send_sync_main(void);
> >>>   int multifd_queue_page(RAMBlock *block, ram_addr_t offset);
> >>> +int multifd_send_channels_created(void);
> >>>
> >>>   /* Multifd Compression flags */
> >>>   #define MULTIFD_FLAG_SYNC (1 << 0)
> >>> @@ -86,6 +87,8 @@ typedef struct {
> >>>       /* multifd flags for sending ram */
> >>>       int write_flags;
> >>>
> >>> +    /* Syncs channel creation and migration thread */
> >>> +    QemuSemaphore create_sem;
> >>>       /* sem where to wait for more work */
> >>>       QemuSemaphore sem;
> >>>       /* syncs main thread and channels */
> >>> diff --git a/migration/migration.c b/migration/migration.c
> >>> index 9c769a1ecd..d81d96eaa5 100644
> >>> --- a/migration/migration.c
> >>> +++ b/migration/migration.c
> >>> @@ -3621,6 +3621,7 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
> >>>           error_report_err(local_err);
> >>>           migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
> >>>                             MIGRATION_STATUS_FAILED);
> >>> +        multifd_send_channels_created();
> >>>           migrate_fd_cleanup(s);
> >>>           return;
> >>>       }
> >>> diff --git a/migration/multifd.c b/migration/multifd.c
> >>> index 564e911b6c..f0c216f4f9 100644
> >>> --- a/migration/multifd.c
> >>> +++ b/migration/multifd.c
> >>> @@ -538,6 +538,7 @@ void multifd_save_cleanup(void)
> >>>           multifd_send_channel_destroy(p->c);
> >>>           p->c = NULL;
> >>>           qemu_mutex_destroy(&p->mutex);
> >>> +        qemu_sem_destroy(&p->create_sem);
> >>>           qemu_sem_destroy(&p->sem);
> >>>           qemu_sem_destroy(&p->sem_sync);
> >>>           g_free(p->name);
> >>> @@ -766,6 +767,29 @@ out:
> >>>       return NULL;
> >>>   }
> >>>
> >>> +int multifd_send_channels_created(void)
> >>> +{
> >>> +    int ret = 0;
> >>> +
> >>> +    if (!migrate_multifd()) {
> >>> +        return 0;
> >>> +    }
> >>> +
> >>> +    for (int i = 0; i < migrate_multifd_channels(); i++) {
> >>> +        MultiFDSendParams *p = &multifd_send_state->params[i];
> >>> +
> >>> +        qemu_sem_wait(&p->create_sem);
> >>> +        WITH_QEMU_LOCK_GUARD(&p->mutex) {
> >>> +            if (p->quit) {
> >>> +                error_report("%s: channel %d has already quit", __func__, i);
> >>> +                ret = -1;
> >>> +            }
> >>> +        }
> >> There are races here when a channel fails at
> >> multifd_send_initial_packet(). If p->quit can be set after post to
> >> create_sem, then this function could always return true and we'd run
> >> into a broken channel. Possibly even the same bug you're trying to fix.
> >>
> >> I think that's one of the reasons we have channels_ready.
> >
> > I am not sure exactly what bug you are describing here, can you elaborate?
> >
> 
> This looks like it could be a preexisting issue actually, but in the
> current code, what stops the multifd channel from running ahead is
> p->sem. Except that the channel does some work at
> multifd_send_initial_packet() before checking p->sem and that work could
> fail.
> 
> This means that right after checking for p->quit above, the multifd
> thread could already be exiting due to an error and
> multifd_send_channels_created() would miss it. So there's still a chance
> that this function effectively behaves just like the previous code.
> 
> Thread           | Code execution
> ------------------------------------------------------------------------
> Migration        |
>                  | ram_save_setup()
>                  |   multifd_send_channels_created()
>                  |     qemu_sem_wait(&p->create_sem);
> Main             |
>                  | multifd_channel_connect()
>                  |   qemu_thread_create()
>                  |   qemu_sem_post(&p->create_sem)
> Multifd 1        |
>                  | multifd_send_initial_packet() *errors*
>                  | goto out
>                  | multifd_send_terminate_threads()
> Migration        |
>                  | still at multifd_send_channels_created
>                  |     qemu_mutex_lock(&p->mutex);
>                  |     p->quit == false      <--- !!!
>                  |     qemu_mutex_unlock(&p->mutex);
>                  | return true from multifd_send_channels_created()
> 
> From here onwards, it's the same as not having checked
> multifd_send_channels_created() at all. One way this could go is:
> 
>                  | runs unimpeded until multifd_send_sync_main()
>                  | multifd_send_pages()
>                  | *misses the exiting flag*
>                  | qemu_sem_wait(&multifd_send_state->channels_ready);
> Multifd 1        |
>                  | still at multifd_send_terminate_threads
>                  |   qemu_mutex_lock(&p->mutex);
>                  |   p->quit = true;
>                  |   qemu_mutex_unlock(&p->mutex);
>                  | qemu_sem_post(&p->sem_sync);
>                  | qemu_sem_post(&multifd_send_state->channels_ready);
> Migration        |
>                  | qemu_mutex_lock(&p->mutex);
>                  | p->quit == true;  <--- correct now
>                  | qemu_mutex_unlock(&p->mutex);
>                  | return -1;
>                  | *all good again*
> 
> It seems that in order to avoid this kind of race we'd need the
> synchronization point to be at the multifd thread instead. But then,
> that's what channels_ready does.
> 
> >>
> >>> +    }
> >>> +
> >>> +    return ret;
> >>> +}
> >>> +
> >>>   static bool multifd_channel_connect(MultiFDSendParams *p,
> >>>                                       QIOChannel *ioc,
> >>>                                       Error **errp);
> >>> @@ -794,6 +818,7 @@ static void multifd_tls_outgoing_handshake(QIOTask *task,
> >>>       p->quit = true;
> >>>       qemu_sem_post(&multifd_send_state->channels_ready);
> >>>       qemu_sem_post(&p->sem_sync);
> >>> +    qemu_sem_post(&p->create_sem);
> >>>       error_free(err);
> >>>   }
> >>>
> >>> @@ -857,6 +882,7 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
> >>>       qemu_thread_create(&p->thread, p->name, multifd_send_thread, p,
> >>>                          QEMU_THREAD_JOINABLE);
> >>>       p->running = true;
> >>> +    qemu_sem_post(&p->create_sem);
> >>>       return true;
> >>>   }
> >>>
> >>> @@ -864,15 +890,16 @@ static void multifd_new_send_channel_cleanup(MultiFDSendParams *p,
> >>>                                                QIOChannel *ioc, Error *err)
> >>>   {
> >>>        migrate_set_error(migrate_get_current(), err);
> >>> -     /* Error happen, we need to tell who pay attention to me */
> >>> -     qemu_sem_post(&multifd_send_state->channels_ready);
> >>> -     qemu_sem_post(&p->sem_sync);
> >>>        /*
> >>> +      * Error happen, we need to tell who pay attention to me.
> >>>         * Although multifd_send_thread is not created, but main migration
> >>>         * thread need to judge whether it is running, so we need to mark
> >>>         * its status.
> >>>         */
> >>>        p->quit = true;
> >>> +     qemu_sem_post(&multifd_send_state->channels_ready);
> >>> +     qemu_sem_post(&p->sem_sync);
> >>> +     qemu_sem_post(&p->create_sem);
> >> Do we still need channels_ready and sem_sync here? The migration thread
> >> shouldn't have gone past create_sem at this point.
> >
> > I think you are right, we can drop channels_ready and sem_sync here.
> >
> >>
> >>>        object_unref(OBJECT(ioc));
> >>>        error_free(err);
> >>>   }
> >>> @@ -921,6 +948,7 @@ int multifd_save_setup(Error **errp)
> >>>           MultiFDSendParams *p = &multifd_send_state->params[i];
> >>>
> >>>           qemu_mutex_init(&p->mutex);
> >>> +        qemu_sem_init(&p->create_sem, 0);
> >>>           qemu_sem_init(&p->sem, 0);
> >>>           qemu_sem_init(&p->sem_sync, 0);
> >>>           p->quit = false;
> >>> diff --git a/migration/ram.c b/migration/ram.c
> >>> index c0cdcccb75..b3e864a22b 100644
> >>> --- a/migration/ram.c
> >>> +++ b/migration/ram.c
> >>> @@ -2937,6 +2937,13 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> >>>       RAMBlock *block;
> >>>       int ret;
> >>>
> >>> +    bql_unlock();
> >>> +    ret = multifd_send_channels_created();
> >>> +    bql_lock();
> >>> +    if (ret < 0) {
> >>> +        return ret;
> >>> +    }
> >>> +
> >>>       if (compress_threads_save_setup()) {
> >>>           return -1;
> >>>       }
>
Avihai Horon Jan. 31, 2024, 10:39 a.m. UTC | #5
On 30/01/2024 23:32, Fabiano Rosas wrote:
> External email: Use caution opening links or attachments
>
>
> Avihai Horon <avihaih@nvidia.com> writes:
>
>> On 29/01/2024 16:34, Fabiano Rosas wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> Avihai Horon <avihaih@nvidia.com> writes:
>>>
>>>> Currently, multifd channels are created asynchronously without waiting
>>>> for their creation -- migration simply proceeds and may wait in
>>>> multifd_send_sync_main(), which is called by ram_save_setup(). This
>>>> hides in it some race conditions which can cause an unexpected behavior
>>>> if some channels creation fail.
>>>>
>>>> For example, the following scenario of multifd migration with two
>>>> channels, where the first channel creation fails, will end in a
>>>> segmentation fault (time advances from top to bottom):
>>> Is this reproducible? Or just observable at least.
>> Yes, though I had to engineer it a bit:
>> 1. Run migration with two multifd channels and fail creation of the two
>> channels (e.g., by changing the address they are connecting to).
>> 2. Add sleep(3) in multifd_send_sync_main() before we loop through the
>> channels and check p->quit.
>> 3. Add sleep(5) only for the second multifd channel connect thread so
>> its connection is delayed and runs last.
> Ok, well, that's something at least. I'll try to reproduce it so we can
> keep track of it.
>
>>> I acknowledge the situation you describe, but with multifd there's
>>> usually an issue in cleanup paths. Let's make sure we flushed those out
>>> before adding this new semaphore.
>> Indeed, I was not keen on adding yet another semaphore either.
>> I think there are multiple bugs here, some of them overlap and some don't.
>> There is also your and Peter's previous work that I was not aware of to
>> fix those and to clean up the code.
>>
>> Maybe we can take it one step at a time, pushing your series first,
>> cleaning the code and fixing some bugs.
>> Then we can see what bugs are left (if any) and fix them. It might even
>> be easier to fix after the cleanups.
>>
>>> This is similar to an issue Peter was addressing where we missed calling
>>> multifd_send_termiante_threads() in the multifd_channel_connect() path:
>>>
>>> patch 4 in this
>>> https://lore.kernel.org/r/20231022201211.452861-1-peterx@redhat.com
>> What issue are you referring here? Can you elaborate?
> Oh, I just realised that series doesn't address any particular bug. But
> my point is that including a call to multifd_send_terminate_threads() at
> new_send_channel_cleanup might be all that's needed because that has
> code to cause the channels and the migration thread to end.

Yes, that's good and it solves the other bug I see where migration hangs.

>> The main issue I am trying to fix in my patch is that we don't wait for
>> all multifd channels to be created/error out before tearing down
>> multifd resources in mulitfd_save_cleanup().
> Ok, let me take a step back and ask why is this not solved by
> multifd_save_cleanup() -> qemu_thread_join()?

Because when we get to multifd_save_cleanup() there is no guarantee that 
the other threads have either successfully been established or errored out.
So we can have multifd_1 error out, triggering multifd_save_cleanup() 
which destroys all resources, and only then multifd_2 will get to 
multifd_new_send_channel_async() but by this time resources have already 
been destroyed.
qemu_thread_join() for multifd_2 in this case is simply "skipped".

Actually, while testing Peter's patch I encountered such error with less 
"engineering", I only caused the multifd threads to fail by changing 
their connect address.
Maybe now the error flow takes more time so this bug is more observable. 
It doesn't repro constantly though.

Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
0x0000555555bbc6c2 in multifd_send_terminate_threads 
(err=0x7ffea8001310) at ../migration/multifd.c:485
485         if (qatomic_xchg(&multifd_send_state->exiting, 1)) {
(gdb) bt
#0  0x0000555555bbc6c2 in multifd_send_terminate_threads 
(err=0x7ffea8001310) at ../migration/multifd.c:485
#1  0x0000555555bbd6d0 in multifd_new_send_channel_async 
(task=0x555557f19730, opaque=0x5555571e0a38) at ../migration/multifd.c:862
#2  0x0000555555e6f010 in qio_task_complete (task=0x555557f19730) at 
../io/task.c:197
#3  0x0000555555e6eca3 in qio_task_thread_result (opaque=0x555557f19730) 
at ../io/task.c:112
#4  0x00007ffff780145b in g_idle_dispatch () at /lib64/libglib-2.0.so.0
#5  0x00007ffff780578f in g_main_context_dispatch () at 
/lib64/libglib-2.0.so.0
#6  0x0000555556138d30 in glib_pollfds_poll () at ../util/main-loop.c:287
#7  0x0000555556138daa in os_host_main_loop_wait (timeout=0) at 
../util/main-loop.c:310
#8  0x0000555556138eaf in main_loop_wait (nonblocking=0) at 
../util/main-loop.c:589
#9  0x0000555555b88129 in qemu_main_loop () at ../system/runstate.c:783
#10 0x0000555555e48e97 in qemu_default_main () at ../system/main.c:37
#11 0x0000555555e48ed2 in main (argc=20, argv=0x7fffffffe218) at 
../system/main.c:48
(gdb) print multifd_send_state
$1 = (struct {...} *) 0x0

> I see you moved
> p->running=true to *after* the thread creation in patch 4. That will
> always leave a gap where p->running == false but the thread is already
> running.

Yes, Patch #4 addressed a specific bug with TLS where we can get a segfault.
It didn't try to solve this gap.

>
>>>> Thread           | Code execution
>>>> ------------------------------------------------------------------------
>>>> Multifd 1        |
>>>>                    | multifd_new_send_channel_async (errors and quits)
>>>>                    |   multifd_new_send_channel_cleanup
>>>>                    |
>>>> Migration thread |
>>>>                    | qemu_savevm_state_setup
>>>>                    |   ram_save_setup
>>>>                    |     multifd_send_sync_main
>>>>                    |     (detects Multifd 1 error and quits)
>>>>                    | [...]
>>>>                    | migration_iteration_finish
>>>>                    |   migrate_fd_cleanup_schedule
>>>>                    |
>>>> Main thread      |
>>>>                    | migrate_fd_cleanup
>>>>                    |   multifd_save_cleanup (destroys Multifd 2 resources)
>>>>                    |
>>>> Multifd 2        |
>>>>                    | multifd_new_send_channel_async
>>>>                    | (accesses destroyed resources, segfault)
>>>>
>>>> In another scenario, migration can hang indefinitely:
>>>> 1. Main migration thread reaches multifd_send_sync_main() and waits on
>>>>      the semaphores.
>>>> 2. Then, all multifd channels creation fails, so they post the
>>>>      semaphores and quit.
>>>> 3. Main migration channel will not identify the error, proceed to send
>>>>      pages and will hang.
>>>>
>>>> Fix it by waiting for all multifd channels to be created before
>>>> proceeding with migration.
>>>>
>>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>>> ---
>>>>    migration/multifd.h   |  3 +++
>>>>    migration/migration.c |  1 +
>>>>    migration/multifd.c   | 34 +++++++++++++++++++++++++++++++---
>>>>    migration/ram.c       |  7 +++++++
>>>>    4 files changed, 42 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/migration/multifd.h b/migration/multifd.h
>>>> index 35d11f103c..87a64e0a87 100644
>>>> --- a/migration/multifd.h
>>>> +++ b/migration/multifd.h
>>>> @@ -23,6 +23,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
>>>>    void multifd_recv_sync_main(void);
>>>>    int multifd_send_sync_main(void);
>>>>    int multifd_queue_page(RAMBlock *block, ram_addr_t offset);
>>>> +int multifd_send_channels_created(void);
>>>>
>>>>    /* Multifd Compression flags */
>>>>    #define MULTIFD_FLAG_SYNC (1 << 0)
>>>> @@ -86,6 +87,8 @@ typedef struct {
>>>>        /* multifd flags for sending ram */
>>>>        int write_flags;
>>>>
>>>> +    /* Syncs channel creation and migration thread */
>>>> +    QemuSemaphore create_sem;
>>>>        /* sem where to wait for more work */
>>>>        QemuSemaphore sem;
>>>>        /* syncs main thread and channels */
>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>> index 9c769a1ecd..d81d96eaa5 100644
>>>> --- a/migration/migration.c
>>>> +++ b/migration/migration.c
>>>> @@ -3621,6 +3621,7 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
>>>>            error_report_err(local_err);
>>>>            migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
>>>>                              MIGRATION_STATUS_FAILED);
>>>> +        multifd_send_channels_created();
>>>>            migrate_fd_cleanup(s);
>>>>            return;
>>>>        }
>>>> diff --git a/migration/multifd.c b/migration/multifd.c
>>>> index 564e911b6c..f0c216f4f9 100644
>>>> --- a/migration/multifd.c
>>>> +++ b/migration/multifd.c
>>>> @@ -538,6 +538,7 @@ void multifd_save_cleanup(void)
>>>>            multifd_send_channel_destroy(p->c);
>>>>            p->c = NULL;
>>>>            qemu_mutex_destroy(&p->mutex);
>>>> +        qemu_sem_destroy(&p->create_sem);
>>>>            qemu_sem_destroy(&p->sem);
>>>>            qemu_sem_destroy(&p->sem_sync);
>>>>            g_free(p->name);
>>>> @@ -766,6 +767,29 @@ out:
>>>>        return NULL;
>>>>    }
>>>>
>>>> +int multifd_send_channels_created(void)
>>>> +{
>>>> +    int ret = 0;
>>>> +
>>>> +    if (!migrate_multifd()) {
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    for (int i = 0; i < migrate_multifd_channels(); i++) {
>>>> +        MultiFDSendParams *p = &multifd_send_state->params[i];
>>>> +
>>>> +        qemu_sem_wait(&p->create_sem);
>>>> +        WITH_QEMU_LOCK_GUARD(&p->mutex) {
>>>> +            if (p->quit) {
>>>> +                error_report("%s: channel %d has already quit", __func__, i);
>>>> +                ret = -1;
>>>> +            }
>>>> +        }
>>> There are races here when a channel fails at
>>> multifd_send_initial_packet(). If p->quit can be set after post to
>>> create_sem, then this function could always return true and we'd run
>>> into a broken channel. Possibly even the same bug you're trying to fix.
>>>
>>> I think that's one of the reasons we have channels_ready.
>> I am not sure exactly what bug you are describing here, can you elaborate?
>>
> This looks like it could be a preexisting issue actually, but in the
> current code, what stops the multifd channel from running ahead is
> p->sem. Except that the channel does some work at
> multifd_send_initial_packet() before checking p->sem and that work could
> fail.
>
> This means that right after checking for p->quit above, the multifd
> thread could already be exiting due to an error and
> multifd_send_channels_created() would miss it.

And it's fine, because the purpose of multifd_send_channels_created() is 
only to make sure that when it returns, all multifd channels have either 
been established or errored out (so we don't have left over threads 
running after multifd_save_cleanup()).
Maybe adding this p->quit check is confusing and redundant.

Thinking about it again, maybe the location of 
multifd_send_channels_created() inside ram_save_setup() is not the best 
-- if a fast migrate_cancel happens or some other .save_setup() handler 
errors before ram_save_setup(), multifd_send_channels_created() will be 
skipped and we could end up in the same situation.

Putting multifd_send_channels_created() at the beginning of 
multifd_save_cleanup() will also cover the above cases.

> So there's still a chance
> that this function effectively behaves just like the previous code.
>
> Thread           | Code execution
> ------------------------------------------------------------------------
> Migration        |
>                   | ram_save_setup()
>                   |   multifd_send_channels_created()
>                   |     qemu_sem_wait(&p->create_sem);
> Main             |
>                   | multifd_channel_connect()
>                   |   qemu_thread_create()
>                   |   qemu_sem_post(&p->create_sem)
> Multifd 1        |
>                   | multifd_send_initial_packet() *errors*
>                   | goto out
>                   | multifd_send_terminate_threads()
> Migration        |
>                   | still at multifd_send_channels_created
>                   |     qemu_mutex_lock(&p->mutex);
>                   |     p->quit == false      <--- !!!
>                   |     qemu_mutex_unlock(&p->mutex);
>                   | return true from multifd_send_channels_created()
>
>  From here onwards, it's the same as not having checked
> multifd_send_channels_created() at all. One way this could go is:
>
>                   | runs unimpeded until multifd_send_sync_main()
>                   | multifd_send_pages()
>                   | *misses the exiting flag*
>                   | qemu_sem_wait(&multifd_send_state->channels_ready);
> Multifd 1        |
>                   | still at multifd_send_terminate_threads
>                   |   qemu_mutex_lock(&p->mutex);
>                   |   p->quit = true;
>                   |   qemu_mutex_unlock(&p->mutex);
>                   | qemu_sem_post(&p->sem_sync);
>                   | qemu_sem_post(&multifd_send_state->channels_ready);
> Migration        |
>                   | qemu_mutex_lock(&p->mutex);
>                   | p->quit == true;  <--- correct now
>                   | qemu_mutex_unlock(&p->mutex);
>                   | return -1;
>                   | *all good again*
>
> It seems that in order to avoid this kind of race we'd need the
> synchronization point to be at the multifd thread instead. But then,
> that's what channels_ready does.
>
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>>    static bool multifd_channel_connect(MultiFDSendParams *p,
>>>>                                        QIOChannel *ioc,
>>>>                                        Error **errp);
>>>> @@ -794,6 +818,7 @@ static void multifd_tls_outgoing_handshake(QIOTask *task,
>>>>        p->quit = true;
>>>>        qemu_sem_post(&multifd_send_state->channels_ready);
>>>>        qemu_sem_post(&p->sem_sync);
>>>> +    qemu_sem_post(&p->create_sem);
>>>>        error_free(err);
>>>>    }
>>>>
>>>> @@ -857,6 +882,7 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
>>>>        qemu_thread_create(&p->thread, p->name, multifd_send_thread, p,
>>>>                           QEMU_THREAD_JOINABLE);
>>>>        p->running = true;
>>>> +    qemu_sem_post(&p->create_sem);
>>>>        return true;
>>>>    }
>>>>
>>>> @@ -864,15 +890,16 @@ static void multifd_new_send_channel_cleanup(MultiFDSendParams *p,
>>>>                                                 QIOChannel *ioc, Error *err)
>>>>    {
>>>>         migrate_set_error(migrate_get_current(), err);
>>>> -     /* Error happen, we need to tell who pay attention to me */
>>>> -     qemu_sem_post(&multifd_send_state->channels_ready);
>>>> -     qemu_sem_post(&p->sem_sync);
>>>>         /*
>>>> +      * Error happen, we need to tell who pay attention to me.
>>>>          * Although multifd_send_thread is not created, but main migration
>>>>          * thread need to judge whether it is running, so we need to mark
>>>>          * its status.
>>>>          */
>>>>         p->quit = true;
>>>> +     qemu_sem_post(&multifd_send_state->channels_ready);
>>>> +     qemu_sem_post(&p->sem_sync);
>>>> +     qemu_sem_post(&p->create_sem);
>>> Do we still need channels_ready and sem_sync here? The migration thread
>>> shouldn't have gone past create_sem at this point.
>> I think you are right, we can drop channels_ready and sem_sync here.
>>
>>>>         object_unref(OBJECT(ioc));
>>>>         error_free(err);
>>>>    }
>>>> @@ -921,6 +948,7 @@ int multifd_save_setup(Error **errp)
>>>>            MultiFDSendParams *p = &multifd_send_state->params[i];
>>>>
>>>>            qemu_mutex_init(&p->mutex);
>>>> +        qemu_sem_init(&p->create_sem, 0);
>>>>            qemu_sem_init(&p->sem, 0);
>>>>            qemu_sem_init(&p->sem_sync, 0);
>>>>            p->quit = false;
>>>> diff --git a/migration/ram.c b/migration/ram.c
>>>> index c0cdcccb75..b3e864a22b 100644
>>>> --- a/migration/ram.c
>>>> +++ b/migration/ram.c
>>>> @@ -2937,6 +2937,13 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>>>>        RAMBlock *block;
>>>>        int ret;
>>>>
>>>> +    bql_unlock();
>>>> +    ret = multifd_send_channels_created();
>>>> +    bql_lock();
>>>> +    if (ret < 0) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>>        if (compress_threads_save_setup()) {
>>>>            return -1;
>>>>        }
diff mbox series

Patch

diff --git a/migration/multifd.h b/migration/multifd.h
index 35d11f103c..87a64e0a87 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -23,6 +23,7 @@  void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
 void multifd_recv_sync_main(void);
 int multifd_send_sync_main(void);
 int multifd_queue_page(RAMBlock *block, ram_addr_t offset);
+int multifd_send_channels_created(void);
 
 /* Multifd Compression flags */
 #define MULTIFD_FLAG_SYNC (1 << 0)
@@ -86,6 +87,8 @@  typedef struct {
     /* multifd flags for sending ram */
     int write_flags;
 
+    /* Syncs channel creation and migration thread */
+    QemuSemaphore create_sem;
     /* sem where to wait for more work */
     QemuSemaphore sem;
     /* syncs main thread and channels */
diff --git a/migration/migration.c b/migration/migration.c
index 9c769a1ecd..d81d96eaa5 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3621,6 +3621,7 @@  void migrate_fd_connect(MigrationState *s, Error *error_in)
         error_report_err(local_err);
         migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
                           MIGRATION_STATUS_FAILED);
+        multifd_send_channels_created();
         migrate_fd_cleanup(s);
         return;
     }
diff --git a/migration/multifd.c b/migration/multifd.c
index 564e911b6c..f0c216f4f9 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -538,6 +538,7 @@  void multifd_save_cleanup(void)
         multifd_send_channel_destroy(p->c);
         p->c = NULL;
         qemu_mutex_destroy(&p->mutex);
+        qemu_sem_destroy(&p->create_sem);
         qemu_sem_destroy(&p->sem);
         qemu_sem_destroy(&p->sem_sync);
         g_free(p->name);
@@ -766,6 +767,29 @@  out:
     return NULL;
 }
 
+int multifd_send_channels_created(void)
+{
+    int ret = 0;
+
+    if (!migrate_multifd()) {
+        return 0;
+    }
+
+    for (int i = 0; i < migrate_multifd_channels(); i++) {
+        MultiFDSendParams *p = &multifd_send_state->params[i];
+
+        qemu_sem_wait(&p->create_sem);
+        WITH_QEMU_LOCK_GUARD(&p->mutex) {
+            if (p->quit) {
+                error_report("%s: channel %d has already quit", __func__, i);
+                ret = -1;
+            }
+        }
+    }
+
+    return ret;
+}
+
 static bool multifd_channel_connect(MultiFDSendParams *p,
                                     QIOChannel *ioc,
                                     Error **errp);
@@ -794,6 +818,7 @@  static void multifd_tls_outgoing_handshake(QIOTask *task,
     p->quit = true;
     qemu_sem_post(&multifd_send_state->channels_ready);
     qemu_sem_post(&p->sem_sync);
+    qemu_sem_post(&p->create_sem);
     error_free(err);
 }
 
@@ -857,6 +882,7 @@  static bool multifd_channel_connect(MultiFDSendParams *p,
     qemu_thread_create(&p->thread, p->name, multifd_send_thread, p,
                        QEMU_THREAD_JOINABLE);
     p->running = true;
+    qemu_sem_post(&p->create_sem);
     return true;
 }
 
@@ -864,15 +890,16 @@  static void multifd_new_send_channel_cleanup(MultiFDSendParams *p,
                                              QIOChannel *ioc, Error *err)
 {
      migrate_set_error(migrate_get_current(), err);
-     /* Error happen, we need to tell who pay attention to me */
-     qemu_sem_post(&multifd_send_state->channels_ready);
-     qemu_sem_post(&p->sem_sync);
      /*
+      * Error happen, we need to tell who pay attention to me.
       * Although multifd_send_thread is not created, but main migration
       * thread need to judge whether it is running, so we need to mark
       * its status.
       */
      p->quit = true;
+     qemu_sem_post(&multifd_send_state->channels_ready);
+     qemu_sem_post(&p->sem_sync);
+     qemu_sem_post(&p->create_sem);
      object_unref(OBJECT(ioc));
      error_free(err);
 }
@@ -921,6 +948,7 @@  int multifd_save_setup(Error **errp)
         MultiFDSendParams *p = &multifd_send_state->params[i];
 
         qemu_mutex_init(&p->mutex);
+        qemu_sem_init(&p->create_sem, 0);
         qemu_sem_init(&p->sem, 0);
         qemu_sem_init(&p->sem_sync, 0);
         p->quit = false;
diff --git a/migration/ram.c b/migration/ram.c
index c0cdcccb75..b3e864a22b 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2937,6 +2937,13 @@  static int ram_save_setup(QEMUFile *f, void *opaque)
     RAMBlock *block;
     int ret;
 
+    bql_unlock();
+    ret = multifd_send_channels_created();
+    bql_lock();
+    if (ret < 0) {
+        return ret;
+    }
+
     if (compress_threads_save_setup()) {
         return -1;
     }