Message ID | 20240329033205.26087-1-lei4.wang@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN | expand |
On Friday, March 29, 2024 11:32 AM, Wang, Lei4 wrote: > When using the post-copy preemption feature to perform post-copy live > migration, the below scenario could lead to a deadlock and the migration will > never finish: > > - Source connect() the preemption channel in postcopy_start(). > - Source and the destination side TCP stack finished the 3-way handshake > thus the connection is successful. > - The destination side main thread is handling the loading of the bulk RAM > pages thus it doesn't start to handle the pending connection event in the > event loop. and doesn't post the semaphore postcopy_qemufile_dst_done for > the preemption thread. > - The source side sends non-iterative device states, such as the virtio > states. > - The destination main thread starts to receive the virtio states, this > process may lead to a page fault (e.g., virtio_load()->vring_avail_idx() > may trigger a page fault since the avail ring page may not be received > yet). > - The page request is sent back to the source side. Source sends the page > content to the destination side preemption thread. > - Since the event is not arrived and the semaphore > postcopy_qemufile_dst_done is not posted, the preemption thread in > destination side is blocked, and cannot handle receiving the page. > - The QEMU main load thread on the destination side is stuck at the page > fault, and cannot yield and handle the connect() event for the > preemption channel to unblock the preemption thread. > - The postcopy will stuck there forever since this is a deadlock. > > The key point to reproduce this bug is that the source side is sending pages at a > rate faster than the destination handling, otherwise, the qemu_get_be64() in > ram_load_precopy() will have a chance to yield since at that time there are no > pending data in the buffer to get. This will make this bug harder to be > reproduced. > > Fix this by yielding the load coroutine when receiving > MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the > connection event before loading the non-iterative devices state to avoid the > deadlock condition. > > Signed-off-by: Lei Wang <lei4.wang@intel.com> This seems to be a regression issue caused by this commit: 737840e2c6ea (migration: Use the number of transferred bytes directly) Adding qemu_fflush back to migration_rate_exceeded() or ram_save_iterate seems to work (might not be a good fix though). > --- > migration/savevm.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/migration/savevm.c b/migration/savevm.c index > e386c5267f..8fd4dc92f2 100644 > --- a/migration/savevm.c > +++ b/migration/savevm.c > @@ -2445,6 +2445,11 @@ static int loadvm_process_command(QEMUFile *f) > return loadvm_postcopy_handle_advise(mis, len); > > case MIG_CMD_POSTCOPY_LISTEN: > + if (migrate_postcopy_preempt() && qemu_in_coroutine()) { > + aio_co_schedule(qemu_get_current_aio_context(), > + qemu_coroutine_self()); > + qemu_coroutine_yield(); > + } The above could be moved to loadvm_postcopy_handle_listen(). Another option is to follow the old way (i.e. pre_7_2) to do postcopy_preempt_setup in migrate_fd_connect. This can save the above overhead of switching to the main thread during the downtime. Seems Peter's previous patch already solved the channel disordering issue. Let's see Peter and others' opinions. > return loadvm_postcopy_handle_listen(mis); > > case MIG_CMD_POSTCOPY_RUN: > -- > 2.39.3
On Fri, Mar 29, 2024 at 08:54:07AM +0000, Wang, Wei W wrote: > On Friday, March 29, 2024 11:32 AM, Wang, Lei4 wrote: > > When using the post-copy preemption feature to perform post-copy live > > migration, the below scenario could lead to a deadlock and the migration will > > never finish: > > > > - Source connect() the preemption channel in postcopy_start(). > > - Source and the destination side TCP stack finished the 3-way handshake > > thus the connection is successful. > > - The destination side main thread is handling the loading of the bulk RAM > > pages thus it doesn't start to handle the pending connection event in the > > event loop. and doesn't post the semaphore postcopy_qemufile_dst_done for > > the preemption thread. > > - The source side sends non-iterative device states, such as the virtio > > states. > > - The destination main thread starts to receive the virtio states, this > > process may lead to a page fault (e.g., virtio_load()->vring_avail_idx() > > may trigger a page fault since the avail ring page may not be received > > yet). Ouch. Yeah I think this part got overlooked when working on the preempt channel. > > - The page request is sent back to the source side. Source sends the page > > content to the destination side preemption thread. > > - Since the event is not arrived and the semaphore > > postcopy_qemufile_dst_done is not posted, the preemption thread in > > destination side is blocked, and cannot handle receiving the page. > > - The QEMU main load thread on the destination side is stuck at the page > > fault, and cannot yield and handle the connect() event for the > > preemption channel to unblock the preemption thread. > > - The postcopy will stuck there forever since this is a deadlock. > > > > The key point to reproduce this bug is that the source side is sending pages at a > > rate faster than the destination handling, otherwise, the qemu_get_be64() in > > ram_load_precopy() will have a chance to yield since at that time there are no > > pending data in the buffer to get. This will make this bug harder to be > > reproduced. How hard would this reproduce? I'm thinking whether this should be 9.0 material or 9.1. It's pretty late for 9.0 though, but we can still discuss. > > > > Fix this by yielding the load coroutine when receiving > > MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the > > connection event before loading the non-iterative devices state to avoid the > > deadlock condition. > > > > Signed-off-by: Lei Wang <lei4.wang@intel.com> > > This seems to be a regression issue caused by this commit: > 737840e2c6ea (migration: Use the number of transferred bytes directly) > > Adding qemu_fflush back to migration_rate_exceeded() or ram_save_iterate > seems to work (might not be a good fix though). > > > --- > > migration/savevm.c | 5 +++++ > > 1 file changed, 5 insertions(+) > > > > diff --git a/migration/savevm.c b/migration/savevm.c index > > e386c5267f..8fd4dc92f2 100644 > > --- a/migration/savevm.c > > +++ b/migration/savevm.c > > @@ -2445,6 +2445,11 @@ static int loadvm_process_command(QEMUFile *f) > > return loadvm_postcopy_handle_advise(mis, len); > > > > case MIG_CMD_POSTCOPY_LISTEN: > > + if (migrate_postcopy_preempt() && qemu_in_coroutine()) { > > + aio_co_schedule(qemu_get_current_aio_context(), > > + qemu_coroutine_self()); > > + qemu_coroutine_yield(); > > + } > > The above could be moved to loadvm_postcopy_handle_listen(). I'm not 100% sure such thing (no matter here or moved into it, which does look cleaner) would work for us. The problem is I still don't yet see an ordering restricted on top of (1) accept() happens, and (2) receive LISTEN cmd here. What happens if the accept() request is not yet received when reaching LISTEN? Or is it always guaranteed the accept(fd) will always be polled here? For example, the source QEMU (no matter pre-7.2 or later) will always setup the preempt channel asynchrounously, then IIUC it can connect() after sending the whole chunk of packed data which should include this LISTEN. I think it means it's not guaranteed this will 100% work, but maybe further reduce the possibility of the race. One right fix that I can think of is moving the sem_wait(&done) into the main thread too, so we wait for the sem _before_ reading the packed data, so there's no chance of fault. However I don't think sem_wait() will be smart enough to yield when in a coroutine.. In the long term run I think we should really make migration loadvm to do work in the thread rather than the main thread. I think it means we have one more example to be listed in this todo so that's preferred.. https://wiki.qemu.org/ToDo/LiveMigration#Create_a_thread_for_migration_destination I attached such draft patch below, but I'm not sure it'll work. Let me know how both of you think about it. > > Another option is to follow the old way (i.e. pre_7_2) to do postcopy_preempt_setup > in migrate_fd_connect. This can save the above overhead of switching to the > main thread during the downtime. Seems Peter's previous patch already solved the > channel disordering issue. Let's see Peter and others' opinions. IIUC we still need that pre_7_2 stuff and keep the postponed connect() to make sure the ordering is done properly. Wei, could you elaborate the patch you mentioned? Maybe I missed some spots. You raised a good point that this may introduce higher downtime. Did you or Lei tried to measure how large it is? If that is too high, we may need to think another solution, e.g., wait the channel connection before vm stop happens. Thanks, > > > return loadvm_postcopy_handle_listen(mis); > > > > > case MIG_CMD_POSTCOPY_RUN: > > -- > > 2.39.3 > ===8<=== diff --git a/migration/migration.c b/migration/migration.c index 696762bc64..bacd1328cf 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -2593,6 +2593,12 @@ static int postcopy_start(MigrationState *ms, Error **errp) /* * Make sure the receiver can get incoming pages before we send the rest * of the state + * + * When preempt mode enabled, this must be done after we initiate the + * preempt channel, as destination QEMU will wait for the channel when + * processing the LISTEN request. Currently it may not matter a huge + * deal if we always create the channel asynchrously with a qio task, + * but we need to keep this in mind. */ qemu_savevm_send_postcopy_listen(fb); diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c index eccff499cb..4f26a89ac9 100644 --- a/migration/postcopy-ram.c +++ b/migration/postcopy-ram.c @@ -1254,6 +1254,26 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis) } if (migrate_postcopy_preempt()) { + /* + * The preempt channel is established in asynchronous way. Wait + * for its completion. + */ + while (!qemu_sem_timedwait(&mis->postcopy_qemufile_dst_done, 100)) { + /* + * Note that to make sure the main thread can still schedule an + * accept() request we need to proactively yield for the main + * loop to run for some duration (100ms in this case), which is + * pretty ugly. + * + * TODO: we should do this in a separate thread to load the VM + * rather than in the main thread, just like the source side. + */ + if (qemu_in_coroutine()) { + aio_co_schedule(qemu_get_current_aio_context(), + qemu_coroutine_self()); + qemu_coroutine_yield(); + } + } /* * This thread needs to be created after the temp pages because * it'll fetch RAM_CHANNEL_POSTCOPY PostcopyTmpPage immediately. @@ -1743,12 +1763,6 @@ void *postcopy_preempt_thread(void *opaque) qemu_sem_post(&mis->thread_sync_sem); - /* - * The preempt channel is established in asynchronous way. Wait - * for its completion. - */ - qemu_sem_wait(&mis->postcopy_qemufile_dst_done); - /* Sending RAM_SAVE_FLAG_EOS to terminate this thread */ qemu_mutex_lock(&mis->postcopy_prio_thread_mutex); while (preempt_thread_should_run(mis)) {
Peter Xu <peterx@redhat.com> writes: > On Fri, Mar 29, 2024 at 08:54:07AM +0000, Wang, Wei W wrote: >> On Friday, March 29, 2024 11:32 AM, Wang, Lei4 wrote: >> > When using the post-copy preemption feature to perform post-copy live >> > migration, the below scenario could lead to a deadlock and the migration will >> > never finish: >> > >> > - Source connect() the preemption channel in postcopy_start(). >> > - Source and the destination side TCP stack finished the 3-way handshake >> > thus the connection is successful. >> > - The destination side main thread is handling the loading of the bulk RAM >> > pages thus it doesn't start to handle the pending connection event in the >> > event loop. and doesn't post the semaphore postcopy_qemufile_dst_done for >> > the preemption thread. >> > - The source side sends non-iterative device states, such as the virtio >> > states. >> > - The destination main thread starts to receive the virtio states, this >> > process may lead to a page fault (e.g., virtio_load()->vring_avail_idx() >> > may trigger a page fault since the avail ring page may not be received >> > yet). > > Ouch. Yeah I think this part got overlooked when working on the preempt > channel. > >> > - The page request is sent back to the source side. Source sends the page >> > content to the destination side preemption thread. >> > - Since the event is not arrived and the semaphore >> > postcopy_qemufile_dst_done is not posted, the preemption thread in >> > destination side is blocked, and cannot handle receiving the page. >> > - The QEMU main load thread on the destination side is stuck at the page >> > fault, and cannot yield and handle the connect() event for the >> > preemption channel to unblock the preemption thread. >> > - The postcopy will stuck there forever since this is a deadlock. >> > >> > The key point to reproduce this bug is that the source side is sending pages at a >> > rate faster than the destination handling, otherwise, the qemu_get_be64() in >> > ram_load_precopy() will have a chance to yield since at that time there are no >> > pending data in the buffer to get. This will make this bug harder to be >> > reproduced. > > How hard would this reproduce? > > I'm thinking whether this should be 9.0 material or 9.1. It's pretty late > for 9.0 though, but we can still discuss. > >> > >> > Fix this by yielding the load coroutine when receiving >> > MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the >> > connection event before loading the non-iterative devices state to avoid the >> > deadlock condition. >> > >> > Signed-off-by: Lei Wang <lei4.wang@intel.com> >> >> This seems to be a regression issue caused by this commit: >> 737840e2c6ea (migration: Use the number of transferred bytes directly) >> >> Adding qemu_fflush back to migration_rate_exceeded() or ram_save_iterate >> seems to work (might not be a good fix though). >> >> > --- >> > migration/savevm.c | 5 +++++ >> > 1 file changed, 5 insertions(+) >> > >> > diff --git a/migration/savevm.c b/migration/savevm.c index >> > e386c5267f..8fd4dc92f2 100644 >> > --- a/migration/savevm.c >> > +++ b/migration/savevm.c >> > @@ -2445,6 +2445,11 @@ static int loadvm_process_command(QEMUFile *f) >> > return loadvm_postcopy_handle_advise(mis, len); >> > >> > case MIG_CMD_POSTCOPY_LISTEN: >> > + if (migrate_postcopy_preempt() && qemu_in_coroutine()) { >> > + aio_co_schedule(qemu_get_current_aio_context(), >> > + qemu_coroutine_self()); >> > + qemu_coroutine_yield(); >> > + } >> >> The above could be moved to loadvm_postcopy_handle_listen(). > > I'm not 100% sure such thing (no matter here or moved into it, which does > look cleaner) would work for us. > > The problem is I still don't yet see an ordering restricted on top of (1) > accept() happens, and (2) receive LISTEN cmd here. What happens if the > accept() request is not yet received when reaching LISTEN? Or is it always > guaranteed the accept(fd) will always be polled here? > > For example, the source QEMU (no matter pre-7.2 or later) will always setup > the preempt channel asynchrounously, then IIUC it can connect() after > sending the whole chunk of packed data which should include this LISTEN. I > think it means it's not guaranteed this will 100% work, but maybe further > reduce the possibility of the race. > > One right fix that I can think of is moving the sem_wait(&done) into the > main thread too, so we wait for the sem _before_ reading the packed data, > so there's no chance of fault. However I don't think sem_wait() will be > smart enough to yield when in a coroutine.. In the long term run I think > we should really make migration loadvm to do work in the thread rather than > the main thread. I think it means we have one more example to be listed in > this todo so that's preferred.. > > https://wiki.qemu.org/ToDo/LiveMigration#Create_a_thread_for_migration_destination > > I attached such draft patch below, but I'm not sure it'll work. Let me > know how both of you think about it. > >> >> Another option is to follow the old way (i.e. pre_7_2) to do postcopy_preempt_setup >> in migrate_fd_connect. This can save the above overhead of switching to the >> main thread during the downtime. Seems Peter's previous patch already solved the >> channel disordering issue. Let's see Peter and others' opinions. > > IIUC we still need that pre_7_2 stuff and keep the postponed connect() to > make sure the ordering is done properly. Wei, could you elaborate the > patch you mentioned? Maybe I missed some spots. > > You raised a good point that this may introduce higher downtime. Did you > or Lei tried to measure how large it is? If that is too high, we may need > to think another solution, e.g., wait the channel connection before vm stop > happens. > > Thanks, > >> >> > return loadvm_postcopy_handle_listen(mis); >> > >> >> > case MIG_CMD_POSTCOPY_RUN: >> > -- >> > 2.39.3 >> > > ===8<=== > diff --git a/migration/migration.c b/migration/migration.c > index 696762bc64..bacd1328cf 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -2593,6 +2593,12 @@ static int postcopy_start(MigrationState *ms, Error **errp) > /* > * Make sure the receiver can get incoming pages before we send the rest > * of the state > + * > + * When preempt mode enabled, this must be done after we initiate the > + * preempt channel, as destination QEMU will wait for the channel when > + * processing the LISTEN request. Currently it may not matter a huge > + * deal if we always create the channel asynchrously with a qio task, > + * but we need to keep this in mind. > */ > qemu_savevm_send_postcopy_listen(fb); > > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c > index eccff499cb..4f26a89ac9 100644 > --- a/migration/postcopy-ram.c > +++ b/migration/postcopy-ram.c > @@ -1254,6 +1254,26 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis) > } > > if (migrate_postcopy_preempt()) { > + /* > + * The preempt channel is established in asynchronous way. Wait > + * for its completion. > + */ > + while (!qemu_sem_timedwait(&mis->postcopy_qemufile_dst_done, 100)) { > + /* > + * Note that to make sure the main thread can still schedule an > + * accept() request we need to proactively yield for the main > + * loop to run for some duration (100ms in this case), which is > + * pretty ugly. > + * > + * TODO: we should do this in a separate thread to load the VM > + * rather than in the main thread, just like the source side. > + */ > + if (qemu_in_coroutine()) { > + aio_co_schedule(qemu_get_current_aio_context(), > + qemu_coroutine_self()); > + qemu_coroutine_yield(); I think the correct way to do this these days is aio_co_reschedule_self(). Anyway, what we are yielding to here? I see qemu_loadvm_state_main() called from a bunch of places, it's not clear to me where will the execution resume after yielding. Is that end up going to be migration_incoming_process()? I don't know much about the postcopy parts, excuse my ignorance. > + } > + } > /* > * This thread needs to be created after the temp pages because > * it'll fetch RAM_CHANNEL_POSTCOPY PostcopyTmpPage immediately. > @@ -1743,12 +1763,6 @@ void *postcopy_preempt_thread(void *opaque) > > qemu_sem_post(&mis->thread_sync_sem); > > - /* > - * The preempt channel is established in asynchronous way. Wait > - * for its completion. > - */ > - qemu_sem_wait(&mis->postcopy_qemufile_dst_done); > - > /* Sending RAM_SAVE_FLAG_EOS to terminate this thread */ > qemu_mutex_lock(&mis->postcopy_prio_thread_mutex); > while (preempt_thread_should_run(mis)) { > -- > 2.44.0
On Mon, Apr 01, 2024 at 02:17:28PM -0300, Fabiano Rosas wrote: > Peter Xu <peterx@redhat.com> writes: > > > On Fri, Mar 29, 2024 at 08:54:07AM +0000, Wang, Wei W wrote: > >> On Friday, March 29, 2024 11:32 AM, Wang, Lei4 wrote: > >> > When using the post-copy preemption feature to perform post-copy live > >> > migration, the below scenario could lead to a deadlock and the migration will > >> > never finish: > >> > > >> > - Source connect() the preemption channel in postcopy_start(). > >> > - Source and the destination side TCP stack finished the 3-way handshake > >> > thus the connection is successful. > >> > - The destination side main thread is handling the loading of the bulk RAM > >> > pages thus it doesn't start to handle the pending connection event in the > >> > event loop. and doesn't post the semaphore postcopy_qemufile_dst_done for > >> > the preemption thread. > >> > - The source side sends non-iterative device states, such as the virtio > >> > states. > >> > - The destination main thread starts to receive the virtio states, this > >> > process may lead to a page fault (e.g., virtio_load()->vring_avail_idx() > >> > may trigger a page fault since the avail ring page may not be received > >> > yet). > > > > Ouch. Yeah I think this part got overlooked when working on the preempt > > channel. > > > >> > - The page request is sent back to the source side. Source sends the page > >> > content to the destination side preemption thread. > >> > - Since the event is not arrived and the semaphore > >> > postcopy_qemufile_dst_done is not posted, the preemption thread in > >> > destination side is blocked, and cannot handle receiving the page. > >> > - The QEMU main load thread on the destination side is stuck at the page > >> > fault, and cannot yield and handle the connect() event for the > >> > preemption channel to unblock the preemption thread. > >> > - The postcopy will stuck there forever since this is a deadlock. > >> > > >> > The key point to reproduce this bug is that the source side is sending pages at a > >> > rate faster than the destination handling, otherwise, the qemu_get_be64() in > >> > ram_load_precopy() will have a chance to yield since at that time there are no > >> > pending data in the buffer to get. This will make this bug harder to be > >> > reproduced. > > > > How hard would this reproduce? > > > > I'm thinking whether this should be 9.0 material or 9.1. It's pretty late > > for 9.0 though, but we can still discuss. > > > >> > > >> > Fix this by yielding the load coroutine when receiving > >> > MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the > >> > connection event before loading the non-iterative devices state to avoid the > >> > deadlock condition. > >> > > >> > Signed-off-by: Lei Wang <lei4.wang@intel.com> > >> > >> This seems to be a regression issue caused by this commit: > >> 737840e2c6ea (migration: Use the number of transferred bytes directly) > >> > >> Adding qemu_fflush back to migration_rate_exceeded() or ram_save_iterate > >> seems to work (might not be a good fix though). > >> > >> > --- > >> > migration/savevm.c | 5 +++++ > >> > 1 file changed, 5 insertions(+) > >> > > >> > diff --git a/migration/savevm.c b/migration/savevm.c index > >> > e386c5267f..8fd4dc92f2 100644 > >> > --- a/migration/savevm.c > >> > +++ b/migration/savevm.c > >> > @@ -2445,6 +2445,11 @@ static int loadvm_process_command(QEMUFile *f) > >> > return loadvm_postcopy_handle_advise(mis, len); > >> > > >> > case MIG_CMD_POSTCOPY_LISTEN: > >> > + if (migrate_postcopy_preempt() && qemu_in_coroutine()) { > >> > + aio_co_schedule(qemu_get_current_aio_context(), > >> > + qemu_coroutine_self()); > >> > + qemu_coroutine_yield(); > >> > + } > >> > >> The above could be moved to loadvm_postcopy_handle_listen(). > > > > I'm not 100% sure such thing (no matter here or moved into it, which does > > look cleaner) would work for us. > > > > The problem is I still don't yet see an ordering restricted on top of (1) > > accept() happens, and (2) receive LISTEN cmd here. What happens if the > > accept() request is not yet received when reaching LISTEN? Or is it always > > guaranteed the accept(fd) will always be polled here? > > > > For example, the source QEMU (no matter pre-7.2 or later) will always setup > > the preempt channel asynchrounously, then IIUC it can connect() after > > sending the whole chunk of packed data which should include this LISTEN. I > > think it means it's not guaranteed this will 100% work, but maybe further > > reduce the possibility of the race. > > > > One right fix that I can think of is moving the sem_wait(&done) into the > > main thread too, so we wait for the sem _before_ reading the packed data, > > so there's no chance of fault. However I don't think sem_wait() will be > > smart enough to yield when in a coroutine.. In the long term run I think > > we should really make migration loadvm to do work in the thread rather than > > the main thread. I think it means we have one more example to be listed in > > this todo so that's preferred.. > > > > https://wiki.qemu.org/ToDo/LiveMigration#Create_a_thread_for_migration_destination > > > > I attached such draft patch below, but I'm not sure it'll work. Let me > > know how both of you think about it. > > > >> > >> Another option is to follow the old way (i.e. pre_7_2) to do postcopy_preempt_setup > >> in migrate_fd_connect. This can save the above overhead of switching to the > >> main thread during the downtime. Seems Peter's previous patch already solved the > >> channel disordering issue. Let's see Peter and others' opinions. > > > > IIUC we still need that pre_7_2 stuff and keep the postponed connect() to > > make sure the ordering is done properly. Wei, could you elaborate the > > patch you mentioned? Maybe I missed some spots. > > > > You raised a good point that this may introduce higher downtime. Did you > > or Lei tried to measure how large it is? If that is too high, we may need > > to think another solution, e.g., wait the channel connection before vm stop > > happens. > > > > Thanks, > > > >> > >> > return loadvm_postcopy_handle_listen(mis); > >> > > >> > >> > case MIG_CMD_POSTCOPY_RUN: > >> > -- > >> > 2.39.3 > >> > > > > ===8<=== > > diff --git a/migration/migration.c b/migration/migration.c > > index 696762bc64..bacd1328cf 100644 > > --- a/migration/migration.c > > +++ b/migration/migration.c > > @@ -2593,6 +2593,12 @@ static int postcopy_start(MigrationState *ms, Error **errp) > > /* > > * Make sure the receiver can get incoming pages before we send the rest > > * of the state > > + * > > + * When preempt mode enabled, this must be done after we initiate the > > + * preempt channel, as destination QEMU will wait for the channel when > > + * processing the LISTEN request. Currently it may not matter a huge > > + * deal if we always create the channel asynchrously with a qio task, > > + * but we need to keep this in mind. > > */ > > qemu_savevm_send_postcopy_listen(fb); > > > > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c > > index eccff499cb..4f26a89ac9 100644 > > --- a/migration/postcopy-ram.c > > +++ b/migration/postcopy-ram.c > > @@ -1254,6 +1254,26 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis) > > } > > > > if (migrate_postcopy_preempt()) { > > + /* > > + * The preempt channel is established in asynchronous way. Wait > > + * for its completion. > > + */ > > + while (!qemu_sem_timedwait(&mis->postcopy_qemufile_dst_done, 100)) { > > + /* > > + * Note that to make sure the main thread can still schedule an > > + * accept() request we need to proactively yield for the main > > + * loop to run for some duration (100ms in this case), which is > > + * pretty ugly. > > + * > > + * TODO: we should do this in a separate thread to load the VM > > + * rather than in the main thread, just like the source side. > > + */ > > + if (qemu_in_coroutine()) { > > + aio_co_schedule(qemu_get_current_aio_context(), > > + qemu_coroutine_self()); > > + qemu_coroutine_yield(); > > I think the correct way to do this these days is > aio_co_reschedule_self(). The helper checks old v.s. new contexts, where here we want to pass in the current context. Would that be a no-op then? > > Anyway, what we are yielding to here? I see qemu_loadvm_state_main() > called from a bunch of places, it's not clear to me where will the > execution resume after yielding. Is that end up going to be > migration_incoming_process()? In this specific case it should try to yield to the port listener that is waiting for the preempt channel, aka, socket_accept_incoming_migration(), and ultimately it'll kick off this sem, by: socket_accept_incoming_migration -> migration_ioc_process_incoming -> postcopy_preempt_new_channel > > I don't know much about the postcopy parts, excuse my ignorance. Not a problem at all, please shoot if there's any questions either here or elsewhere. You're going to maintain it anyway as part of the migration code base. :-D Thanks,
Peter Xu <peterx@redhat.com> writes: > On Mon, Apr 01, 2024 at 02:17:28PM -0300, Fabiano Rosas wrote: >> Peter Xu <peterx@redhat.com> writes: >> >> > On Fri, Mar 29, 2024 at 08:54:07AM +0000, Wang, Wei W wrote: >> >> On Friday, March 29, 2024 11:32 AM, Wang, Lei4 wrote: >> >> > When using the post-copy preemption feature to perform post-copy live >> >> > migration, the below scenario could lead to a deadlock and the migration will >> >> > never finish: >> >> > >> >> > - Source connect() the preemption channel in postcopy_start(). >> >> > - Source and the destination side TCP stack finished the 3-way handshake >> >> > thus the connection is successful. >> >> > - The destination side main thread is handling the loading of the bulk RAM >> >> > pages thus it doesn't start to handle the pending connection event in the >> >> > event loop. and doesn't post the semaphore postcopy_qemufile_dst_done for >> >> > the preemption thread. >> >> > - The source side sends non-iterative device states, such as the virtio >> >> > states. >> >> > - The destination main thread starts to receive the virtio states, this >> >> > process may lead to a page fault (e.g., virtio_load()->vring_avail_idx() >> >> > may trigger a page fault since the avail ring page may not be received >> >> > yet). >> > >> > Ouch. Yeah I think this part got overlooked when working on the preempt >> > channel. >> > >> >> > - The page request is sent back to the source side. Source sends the page >> >> > content to the destination side preemption thread. >> >> > - Since the event is not arrived and the semaphore >> >> > postcopy_qemufile_dst_done is not posted, the preemption thread in >> >> > destination side is blocked, and cannot handle receiving the page. >> >> > - The QEMU main load thread on the destination side is stuck at the page >> >> > fault, and cannot yield and handle the connect() event for the >> >> > preemption channel to unblock the preemption thread. >> >> > - The postcopy will stuck there forever since this is a deadlock. >> >> > >> >> > The key point to reproduce this bug is that the source side is sending pages at a >> >> > rate faster than the destination handling, otherwise, the qemu_get_be64() in >> >> > ram_load_precopy() will have a chance to yield since at that time there are no >> >> > pending data in the buffer to get. This will make this bug harder to be >> >> > reproduced. >> > >> > How hard would this reproduce? >> > >> > I'm thinking whether this should be 9.0 material or 9.1. It's pretty late >> > for 9.0 though, but we can still discuss. >> > >> >> > >> >> > Fix this by yielding the load coroutine when receiving >> >> > MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the >> >> > connection event before loading the non-iterative devices state to avoid the >> >> > deadlock condition. >> >> > >> >> > Signed-off-by: Lei Wang <lei4.wang@intel.com> >> >> >> >> This seems to be a regression issue caused by this commit: >> >> 737840e2c6ea (migration: Use the number of transferred bytes directly) >> >> >> >> Adding qemu_fflush back to migration_rate_exceeded() or ram_save_iterate >> >> seems to work (might not be a good fix though). >> >> >> >> > --- >> >> > migration/savevm.c | 5 +++++ >> >> > 1 file changed, 5 insertions(+) >> >> > >> >> > diff --git a/migration/savevm.c b/migration/savevm.c index >> >> > e386c5267f..8fd4dc92f2 100644 >> >> > --- a/migration/savevm.c >> >> > +++ b/migration/savevm.c >> >> > @@ -2445,6 +2445,11 @@ static int loadvm_process_command(QEMUFile *f) >> >> > return loadvm_postcopy_handle_advise(mis, len); >> >> > >> >> > case MIG_CMD_POSTCOPY_LISTEN: >> >> > + if (migrate_postcopy_preempt() && qemu_in_coroutine()) { >> >> > + aio_co_schedule(qemu_get_current_aio_context(), >> >> > + qemu_coroutine_self()); >> >> > + qemu_coroutine_yield(); >> >> > + } >> >> >> >> The above could be moved to loadvm_postcopy_handle_listen(). >> > >> > I'm not 100% sure such thing (no matter here or moved into it, which does >> > look cleaner) would work for us. >> > >> > The problem is I still don't yet see an ordering restricted on top of (1) >> > accept() happens, and (2) receive LISTEN cmd here. What happens if the >> > accept() request is not yet received when reaching LISTEN? Or is it always >> > guaranteed the accept(fd) will always be polled here? >> > >> > For example, the source QEMU (no matter pre-7.2 or later) will always setup >> > the preempt channel asynchrounously, then IIUC it can connect() after >> > sending the whole chunk of packed data which should include this LISTEN. I >> > think it means it's not guaranteed this will 100% work, but maybe further >> > reduce the possibility of the race. >> > >> > One right fix that I can think of is moving the sem_wait(&done) into the >> > main thread too, so we wait for the sem _before_ reading the packed data, >> > so there's no chance of fault. However I don't think sem_wait() will be >> > smart enough to yield when in a coroutine.. In the long term run I think >> > we should really make migration loadvm to do work in the thread rather than >> > the main thread. I think it means we have one more example to be listed in >> > this todo so that's preferred.. >> > >> > https://wiki.qemu.org/ToDo/LiveMigration#Create_a_thread_for_migration_destination >> > >> > I attached such draft patch below, but I'm not sure it'll work. Let me >> > know how both of you think about it. >> > >> >> >> >> Another option is to follow the old way (i.e. pre_7_2) to do postcopy_preempt_setup >> >> in migrate_fd_connect. This can save the above overhead of switching to the >> >> main thread during the downtime. Seems Peter's previous patch already solved the >> >> channel disordering issue. Let's see Peter and others' opinions. >> > >> > IIUC we still need that pre_7_2 stuff and keep the postponed connect() to >> > make sure the ordering is done properly. Wei, could you elaborate the >> > patch you mentioned? Maybe I missed some spots. >> > >> > You raised a good point that this may introduce higher downtime. Did you >> > or Lei tried to measure how large it is? If that is too high, we may need >> > to think another solution, e.g., wait the channel connection before vm stop >> > happens. >> > >> > Thanks, >> > >> >> >> >> > return loadvm_postcopy_handle_listen(mis); >> >> > >> >> >> >> > case MIG_CMD_POSTCOPY_RUN: >> >> > -- >> >> > 2.39.3 >> >> >> > >> > ===8<=== >> > diff --git a/migration/migration.c b/migration/migration.c >> > index 696762bc64..bacd1328cf 100644 >> > --- a/migration/migration.c >> > +++ b/migration/migration.c >> > @@ -2593,6 +2593,12 @@ static int postcopy_start(MigrationState *ms, Error **errp) >> > /* >> > * Make sure the receiver can get incoming pages before we send the rest >> > * of the state >> > + * >> > + * When preempt mode enabled, this must be done after we initiate the >> > + * preempt channel, as destination QEMU will wait for the channel when >> > + * processing the LISTEN request. Currently it may not matter a huge >> > + * deal if we always create the channel asynchrously with a qio task, >> > + * but we need to keep this in mind. >> > */ >> > qemu_savevm_send_postcopy_listen(fb); >> > >> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c >> > index eccff499cb..4f26a89ac9 100644 >> > --- a/migration/postcopy-ram.c >> > +++ b/migration/postcopy-ram.c >> > @@ -1254,6 +1254,26 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis) >> > } >> > >> > if (migrate_postcopy_preempt()) { >> > + /* >> > + * The preempt channel is established in asynchronous way. Wait >> > + * for its completion. >> > + */ >> > + while (!qemu_sem_timedwait(&mis->postcopy_qemufile_dst_done, 100)) { >> > + /* >> > + * Note that to make sure the main thread can still schedule an >> > + * accept() request we need to proactively yield for the main >> > + * loop to run for some duration (100ms in this case), which is >> > + * pretty ugly. >> > + * >> > + * TODO: we should do this in a separate thread to load the VM >> > + * rather than in the main thread, just like the source side. >> > + */ >> > + if (qemu_in_coroutine()) { >> > + aio_co_schedule(qemu_get_current_aio_context(), >> > + qemu_coroutine_self()); >> > + qemu_coroutine_yield(); >> >> I think the correct way to do this these days is >> aio_co_reschedule_self(). > > The helper checks old v.s. new contexts, where here we want to pass in the > current context. Would that be a no-op then? > >> >> Anyway, what we are yielding to here? I see qemu_loadvm_state_main() >> called from a bunch of places, it's not clear to me where will the >> execution resume after yielding. Is that end up going to be >> migration_incoming_process()? > > In this specific case it should try to yield to the port listener that is > waiting for the preempt channel, aka, socket_accept_incoming_migration(), > and ultimately it'll kick off this sem, by: > > socket_accept_incoming_migration -> > migration_ioc_process_incoming -> > postcopy_preempt_new_channel Ok, I think I get it. So the issue is just a plain old "blocking the main loop" kind of bug. We have in ram_load_precopy: /* * Yield periodically to let main loop run, but an iteration of * the main loop is expensive, so do it each some iterations */ if ((i & 32767) == 0 && qemu_in_coroutine()) { aio_co_schedule(qemu_get_current_aio_context(), qemu_coroutine_self()); qemu_coroutine_yield(); } That's similar to why I had to move multifd_send_setup() to the migration thread, we need to allow glib_pollfds_poll() to run so it dispatches the listener callbacks. > >> >> I don't know much about the postcopy parts, excuse my ignorance. > > Not a problem at all, please shoot if there's any questions either here or > elsewhere. You're going to maintain it anyway as part of the migration code > base. :-D /me runs But yeah, I didn't spend enough time looking at this code yet to form a good mental picture. I only looked at the super-specific recovery cases. > > Thanks,
On 4/2/2024 0:13, Peter Xu wrote:> On Fri, Mar 29, 2024 at 08:54:07AM +0000, Wang, Wei W wrote: >> On Friday, March 29, 2024 11:32 AM, Wang, Lei4 wrote: >>> When using the post-copy preemption feature to perform post-copy live >>> migration, the below scenario could lead to a deadlock and the migration will >>> never finish: >>> >>> - Source connect() the preemption channel in postcopy_start(). >>> - Source and the destination side TCP stack finished the 3-way handshake >>> thus the connection is successful. >>> - The destination side main thread is handling the loading of the bulk RAM >>> pages thus it doesn't start to handle the pending connection event in the >>> event loop. and doesn't post the semaphore postcopy_qemufile_dst_done for >>> the preemption thread. >>> - The source side sends non-iterative device states, such as the virtio >>> states. >>> - The destination main thread starts to receive the virtio states, this >>> process may lead to a page fault (e.g., virtio_load()->vring_avail_idx() >>> may trigger a page fault since the avail ring page may not be received >>> yet). > > Ouch. Yeah I think this part got overlooked when working on the preempt > channel. > >>> - The page request is sent back to the source side. Source sends the page >>> content to the destination side preemption thread. >>> - Since the event is not arrived and the semaphore >>> postcopy_qemufile_dst_done is not posted, the preemption thread in >>> destination side is blocked, and cannot handle receiving the page. >>> - The QEMU main load thread on the destination side is stuck at the page >>> fault, and cannot yield and handle the connect() event for the >>> preemption channel to unblock the preemption thread. >>> - The postcopy will stuck there forever since this is a deadlock. >>> >>> The key point to reproduce this bug is that the source side is sending pages at a >>> rate faster than the destination handling, otherwise, the qemu_get_be64() in >>> ram_load_precopy() will have a chance to yield since at that time there are no >>> pending data in the buffer to get. This will make this bug harder to be >>> reproduced. > > How hard would this reproduce? We can manually make this easier to reproduce by adding the following code to make the destination busier to load the pages: diff --git a/migration/ram.c b/migration/ram.c index 0ad9fbba48..0b42877e1f 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -4232,6 +4232,7 @@ static int ram_load_precopy(QEMUFile *f) { MigrationIncomingState *mis = migration_incoming_get_current(); int flags = 0, ret = 0, invalid_flags = 0, len = 0, i = 0; + volatile unsigned long long a; if (!migrate_compress()) { invalid_flags |= RAM_SAVE_FLAG_COMPRESS_PAGE; @@ -4347,6 +4348,7 @@ static int ram_load_precopy(QEMUFile *f) break; case RAM_SAVE_FLAG_PAGE: + for (a = 0; a < 100000000; a++); qemu_get_buffer(f, host, TARGET_PAGE_SIZE); break; > > I'm thinking whether this should be 9.0 material or 9.1. It's pretty late > for 9.0 though, but we can still discuss. > >>> >>> Fix this by yielding the load coroutine when receiving >>> MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the >>> connection event before loading the non-iterative devices state to avoid the >>> deadlock condition. >>> >>> Signed-off-by: Lei Wang <lei4.wang@intel.com> >> >> This seems to be a regression issue caused by this commit: >> 737840e2c6ea (migration: Use the number of transferred bytes directly) >> >> Adding qemu_fflush back to migration_rate_exceeded() or ram_save_iterate >> seems to work (might not be a good fix though). >> >>> --- >>> migration/savevm.c | 5 +++++ >>> 1 file changed, 5 insertions(+) >>> >>> diff --git a/migration/savevm.c b/migration/savevm.c index >>> e386c5267f..8fd4dc92f2 100644 >>> --- a/migration/savevm.c >>> +++ b/migration/savevm.c >>> @@ -2445,6 +2445,11 @@ static int loadvm_process_command(QEMUFile *f) >>> return loadvm_postcopy_handle_advise(mis, len); >>> >>> case MIG_CMD_POSTCOPY_LISTEN: >>> + if (migrate_postcopy_preempt() && qemu_in_coroutine()) { >>> + aio_co_schedule(qemu_get_current_aio_context(), >>> + qemu_coroutine_self()); >>> + qemu_coroutine_yield(); >>> + } >> >> The above could be moved to loadvm_postcopy_handle_listen(). > > I'm not 100% sure such thing (no matter here or moved into it, which does > look cleaner) would work for us. > > The problem is I still don't yet see an ordering restricted on top of (1) > accept() happens, and (2) receive LISTEN cmd here. What happens if the > accept() request is not yet received when reaching LISTEN? Or is it always > guaranteed the accept(fd) will always be polled here? > > For example, the source QEMU (no matter pre-7.2 or later) will always setup > the preempt channel asynchrounously, then IIUC it can connect() after > sending the whole chunk of packed data which should include this LISTEN. I > think it means it's not guaranteed this will 100% work, but maybe further > reduce the possibility of the race. I think the following code: postcopy_start() -> postcopy_preempt_establish_channel() -> qemu_sem_wait(&s->postcopy_qemufile_src_sem); can guarantee that the connect() syscall is successful so the destination side receives the connect() request before it loads the LISTEN command, otherwise it won't post the sem: postcopy_preempt_send_channel_new() -> postcopy_preempt_send_channel_done() -> qemu_sem_post(&s->postcopy_qemufile_src_sem); > > One right fix that I can think of is moving the sem_wait(&done) into the > main thread too, so we wait for the sem _before_ reading the packed data, > so there's no chance of fault. However I don't think sem_wait() will be > smart enough to yield when in a coroutine.. In the long term run I think > we should really make migration loadvm to do work in the thread rather than > the main thread. I think it means we have one more example to be listed in > this todo so that's preferred.. > > https://wiki.qemu.org/ToDo/LiveMigration#Create_a_thread_for_migration_destination > > I attached such draft patch below, but I'm not sure it'll work. Let me > know how both of you think about it. Sadly it doesn't work, there is an unknown segfault. > >> >> Another option is to follow the old way (i.e. pre_7_2) to do postcopy_preempt_setup >> in migrate_fd_connect. This can save the above overhead of switching to the >> main thread during the downtime. Seems Peter's previous patch already solved the >> channel disordering issue. Let's see Peter and others' opinions. > > IIUC we still need that pre_7_2 stuff and keep the postponed connect() to > make sure the ordering is done properly. Wei, could you elaborate the > patch you mentioned? Maybe I missed some spots. > > You raised a good point that this may introduce higher downtime. Did you > or Lei tried to measure how large it is? If that is too high, we may need > to think another solution, e.g., wait the channel connection before vm stop > happens. Per my very simple test, using post-copy preemption to live migrate an 8G VM: w/o this patch: 121ms in avg in 5 tries w/ this patch: 115ms in avg in 5 tries So it seems the overhead introduced is not too high (maybe ignorable?). > > Thanks, > >> >>> return loadvm_postcopy_handle_listen(mis); >>> >> >>> case MIG_CMD_POSTCOPY_RUN: >>> -- >>> 2.39.3 >> > > ===8<=== > diff --git a/migration/migration.c b/migration/migration.c > index 696762bc64..bacd1328cf 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -2593,6 +2593,12 @@ static int postcopy_start(MigrationState *ms, Error **errp) > /* > * Make sure the receiver can get incoming pages before we send the rest > * of the state > + * > + * When preempt mode enabled, this must be done after we initiate the > + * preempt channel, as destination QEMU will wait for the channel when > + * processing the LISTEN request. Currently it may not matter a huge > + * deal if we always create the channel asynchrously with a qio task, > + * but we need to keep this in mind. > */ > qemu_savevm_send_postcopy_listen(fb); > > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c > index eccff499cb..4f26a89ac9 100644 > --- a/migration/postcopy-ram.c > +++ b/migration/postcopy-ram.c > @@ -1254,6 +1254,26 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis) > } > > if (migrate_postcopy_preempt()) { > + /* > + * The preempt channel is established in asynchronous way. Wait > + * for its completion. > + */ > + while (!qemu_sem_timedwait(&mis->postcopy_qemufile_dst_done, 100)) { > + /* > + * Note that to make sure the main thread can still schedule an > + * accept() request we need to proactively yield for the main > + * loop to run for some duration (100ms in this case), which is > + * pretty ugly. > + * > + * TODO: we should do this in a separate thread to load the VM > + * rather than in the main thread, just like the source side. > + */ > + if (qemu_in_coroutine()) { > + aio_co_schedule(qemu_get_current_aio_context(), > + qemu_coroutine_self()); > + qemu_coroutine_yield(); > + } > + } > /* > * This thread needs to be created after the temp pages because > * it'll fetch RAM_CHANNEL_POSTCOPY PostcopyTmpPage immediately. > @@ -1743,12 +1763,6 @@ void *postcopy_preempt_thread(void *opaque) > > qemu_sem_post(&mis->thread_sync_sem); > > - /* > - * The preempt channel is established in asynchronous way. Wait > - * for its completion. > - */ > - qemu_sem_wait(&mis->postcopy_qemufile_dst_done); > - > /* Sending RAM_SAVE_FLAG_EOS to terminate this thread */ > qemu_mutex_lock(&mis->postcopy_prio_thread_mutex); > while (preempt_thread_should_run(mis)) {
On Tuesday, April 2, 2024 12:13 AM, Peter Xu wrote: > On Fri, Mar 29, 2024 at 08:54:07AM +0000, Wang, Wei W wrote: > > On Friday, March 29, 2024 11:32 AM, Wang, Lei4 wrote: > > > When using the post-copy preemption feature to perform post-copy > > > live migration, the below scenario could lead to a deadlock and the > > > migration will never finish: > > > > > > - Source connect() the preemption channel in postcopy_start(). > > > - Source and the destination side TCP stack finished the 3-way handshake > > > thus the connection is successful. > > > - The destination side main thread is handling the loading of the bulk RAM > > > pages thus it doesn't start to handle the pending connection event in the > > > event loop. and doesn't post the semaphore > postcopy_qemufile_dst_done for > > > the preemption thread. > > > - The source side sends non-iterative device states, such as the virtio > > > states. > > > - The destination main thread starts to receive the virtio states, this > > > process may lead to a page fault (e.g., virtio_load()->vring_avail_idx() > > > may trigger a page fault since the avail ring page may not be received > > > yet). > > Ouch. Yeah I think this part got overlooked when working on the preempt > channel. > > > > - The page request is sent back to the source side. Source sends the page > > > content to the destination side preemption thread. > > > - Since the event is not arrived and the semaphore > > > postcopy_qemufile_dst_done is not posted, the preemption thread in > > > destination side is blocked, and cannot handle receiving the page. > > > - The QEMU main load thread on the destination side is stuck at the page > > > fault, and cannot yield and handle the connect() event for the > > > preemption channel to unblock the preemption thread. > > > - The postcopy will stuck there forever since this is a deadlock. > > > > > > The key point to reproduce this bug is that the source side is > > > sending pages at a rate faster than the destination handling, > > > otherwise, the qemu_get_be64() in > > > ram_load_precopy() will have a chance to yield since at that time > > > there are no pending data in the buffer to get. This will make this > > > bug harder to be reproduced. > > How hard would this reproduce? It seems 100% reproducible on my machine (with migration src and dst on the same physical machine). > > I'm thinking whether this should be 9.0 material or 9.1. It's pretty late for 9.0 > though, but we can still discuss. > > > > > > > Fix this by yielding the load coroutine when receiving > > > MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the > > > connection event before loading the non-iterative devices state to > > > avoid the deadlock condition. > > > > > > Signed-off-by: Lei Wang <lei4.wang@intel.com> > > > > This seems to be a regression issue caused by this commit: > > 737840e2c6ea (migration: Use the number of transferred bytes directly) > > > > Adding qemu_fflush back to migration_rate_exceeded() or > > ram_save_iterate seems to work (might not be a good fix though). > > > > > --- > > > migration/savevm.c | 5 +++++ > > > 1 file changed, 5 insertions(+) > > > > > > diff --git a/migration/savevm.c b/migration/savevm.c index > > > e386c5267f..8fd4dc92f2 100644 > > > --- a/migration/savevm.c > > > +++ b/migration/savevm.c > > > @@ -2445,6 +2445,11 @@ static int loadvm_process_command(QEMUFile > *f) > > > return loadvm_postcopy_handle_advise(mis, len); > > > > > > case MIG_CMD_POSTCOPY_LISTEN: > > > + if (migrate_postcopy_preempt() && qemu_in_coroutine()) { > > > + aio_co_schedule(qemu_get_current_aio_context(), > > > + qemu_coroutine_self()); > > > + qemu_coroutine_yield(); > > > + } > > > > The above could be moved to loadvm_postcopy_handle_listen(). > > I'm not 100% sure such thing (no matter here or moved into it, which does look > cleaner) would work for us. > > The problem is I still don't yet see an ordering restricted on top of (1) > accept() happens, and (2) receive LISTEN cmd here. What happens if the > accept() request is not yet received when reaching LISTEN? Or is it always > guaranteed the accept(fd) will always be polled here? > > For example, the source QEMU (no matter pre-7.2 or later) will always setup > the preempt channel asynchrounously, then IIUC it can connect() after sending > the whole chunk of packed data which should include this LISTEN. I think it > means it's not guaranteed this will 100% work, but maybe further reduce the > possibility of the race. When handling LISTEN, it seems guaranteed that connect() on the source has returned (synchronized by postcopy_qemufile_src_sem) , but destination side accept() might not be invoked yet (src connect() and dest accept() are async). For this approach, we would need to improve the fix with a loop to check if accept() has been called to establish the new channel (yield if not): if (migrate_postcopy_preempt() && qemu_in_coroutine()) { while (!mis->postcopy_qemufile_dst) { aio_co_schedule(qemu_get_current_aio_context(), qemu_coroutine_self()); qemu_coroutine_yield(); } } > > One right fix that I can think of is moving the sem_wait(&done) into the main > thread too, so we wait for the sem _before_ reading the packed data, so > there's no chance of fault. However I don't think sem_wait() will be smart > enough to yield when in a coroutine.. In the long term run I think we should > really make migration loadvm to do work in the thread rather than the main > thread. I think it means we have one more example to be listed in this todo so > that's preferred.. > > https://wiki.qemu.org/ToDo/LiveMigration#Create_a_thread_for_migration_ > destination > > I attached such draft patch below, but I'm not sure it'll work. Let me know how > both of you think about it. > > > > > Another option is to follow the old way (i.e. pre_7_2) to do > > postcopy_preempt_setup in migrate_fd_connect. This can save the above > > overhead of switching to the main thread during the downtime. Seems > > Peter's previous patch already solved the channel disordering issue. Let's see > Peter and others' opinions. > > IIUC we still need that pre_7_2 stuff and keep the postponed connect() to > make sure the ordering is done properly. Wei, could you elaborate the patch > you mentioned? Maybe I missed some spots. Sure. I saw your patch (5655aab079) ensures that the preempt channel to be created when PONG is received from the destination, which indicates that the main channel has been ready on the destination side. Why was it decided to postpone the preempt channel's connect() for newer QEMU only? The old QEMU (before 7.2) still performs preempt channel connect() in migrate_fd_connect(), doesn't it have the disordering issue? > > You raised a good point that this may introduce higher downtime. Did you or > Lei tried to measure how large it is? If that is too high, we may need to think > another solution, e.g., wait the channel connection before vm stop happens. > > Thanks, > > > > > > return loadvm_postcopy_handle_listen(mis); > > > > > > > > case MIG_CMD_POSTCOPY_RUN: > > > -- > > > 2.39.3 > > > > ===8<=== > diff --git a/migration/migration.c b/migration/migration.c index > 696762bc64..bacd1328cf 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -2593,6 +2593,12 @@ static int postcopy_start(MigrationState *ms, Error > **errp) > /* > * Make sure the receiver can get incoming pages before we send the rest > * of the state > + * > + * When preempt mode enabled, this must be done after we initiate the > + * preempt channel, as destination QEMU will wait for the channel when > + * processing the LISTEN request. Currently it may not matter a huge > + * deal if we always create the channel asynchrously with a qio task, > + * but we need to keep this in mind. > */ > qemu_savevm_send_postcopy_listen(fb); > > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c index > eccff499cb..4f26a89ac9 100644 > --- a/migration/postcopy-ram.c > +++ b/migration/postcopy-ram.c > @@ -1254,6 +1254,26 @@ int > postcopy_ram_incoming_setup(MigrationIncomingState *mis) > } > > if (migrate_postcopy_preempt()) { > + /* > + * The preempt channel is established in asynchronous way. Wait > + * for its completion. > + */ > + while (!qemu_sem_timedwait(&mis->postcopy_qemufile_dst_done, 100)) Why blocking the main thread on the semaphore for some amount of time? I think we need to yield to the main thread as soon as possible (the accept() would be right there waiting for a dispatch). Maybe replace the sem with mis->postcopy_qemufile_dst? > { > + /* > + * Note that to make sure the main thread can still schedule an > + * accept() request we need to proactively yield for the main > + * loop to run for some duration (100ms in this case), which is > + * pretty ugly. > + * > + * TODO: we should do this in a separate thread to load the VM > + * rather than in the main thread, just like the source side. > + */ > + if (qemu_in_coroutine()) { > + aio_co_schedule(qemu_get_current_aio_context(), > + qemu_coroutine_self()); > + qemu_coroutine_yield(); > + } > + } > /* > * This thread needs to be created after the temp pages because > * it'll fetch RAM_CHANNEL_POSTCOPY PostcopyTmpPage immediately. > @@ -1743,12 +1763,6 @@ void *postcopy_preempt_thread(void *opaque) > > qemu_sem_post(&mis->thread_sync_sem); > > - /* > - * The preempt channel is established in asynchronous way. Wait > - * for its completion. > - */ > - qemu_sem_wait(&mis->postcopy_qemufile_dst_done); > - > /* Sending RAM_SAVE_FLAG_EOS to terminate this thread */ > qemu_mutex_lock(&mis->postcopy_prio_thread_mutex); > while (preempt_thread_should_run(mis)) { > -- > 2.44.0 > > > -- > Peter Xu
On Tuesday, April 2, 2024 2:56 PM, Wang, Lei4 wrote: > On 4/2/2024 0:13, Peter Xu wrote:> On Fri, Mar 29, 2024 at 08:54:07AM +0000, > Wang, Wei W wrote: > >> On Friday, March 29, 2024 11:32 AM, Wang, Lei4 wrote: > >>> When using the post-copy preemption feature to perform post-copy > >>> live migration, the below scenario could lead to a deadlock and the > >>> migration will never finish: > >>> > >>> - Source connect() the preemption channel in postcopy_start(). > >>> - Source and the destination side TCP stack finished the 3-way handshake > >>> thus the connection is successful. > >>> - The destination side main thread is handling the loading of the bulk > RAM > >>> pages thus it doesn't start to handle the pending connection event in the > >>> event loop. and doesn't post the semaphore > postcopy_qemufile_dst_done for > >>> the preemption thread. > >>> - The source side sends non-iterative device states, such as the virtio > >>> states. > >>> - The destination main thread starts to receive the virtio states, this > >>> process may lead to a page fault (e.g., virtio_load()->vring_avail_idx() > >>> may trigger a page fault since the avail ring page may not be received > >>> yet). > > > > Ouch. Yeah I think this part got overlooked when working on the > > preempt channel. > > > >>> - The page request is sent back to the source side. Source sends the page > >>> content to the destination side preemption thread. > >>> - Since the event is not arrived and the semaphore > >>> postcopy_qemufile_dst_done is not posted, the preemption thread in > >>> destination side is blocked, and cannot handle receiving the page. > >>> - The QEMU main load thread on the destination side is stuck at the page > >>> fault, and cannot yield and handle the connect() event for the > >>> preemption channel to unblock the preemption thread. > >>> - The postcopy will stuck there forever since this is a deadlock. > >>> > >>> The key point to reproduce this bug is that the source side is > >>> sending pages at a rate faster than the destination handling, > >>> otherwise, the qemu_get_be64() in > >>> ram_load_precopy() will have a chance to yield since at that time > >>> there are no pending data in the buffer to get. This will make this > >>> bug harder to be reproduced. > > > > How hard would this reproduce? > > We can manually make this easier to reproduce by adding the following code > to make the destination busier to load the pages: > > diff --git a/migration/ram.c b/migration/ram.c index 0ad9fbba48..0b42877e1f > 100644 > --- a/migration/ram.c > +++ b/migration/ram.c > @@ -4232,6 +4232,7 @@ static int ram_load_precopy(QEMUFile *f) { > MigrationIncomingState *mis = migration_incoming_get_current(); > int flags = 0, ret = 0, invalid_flags = 0, len = 0, i = 0; > + volatile unsigned long long a; > > if (!migrate_compress()) { > invalid_flags |= RAM_SAVE_FLAG_COMPRESS_PAGE; @@ -4347,6 > +4348,7 @@ static int ram_load_precopy(QEMUFile *f) > break; > > case RAM_SAVE_FLAG_PAGE: > + for (a = 0; a < 100000000; a++); > qemu_get_buffer(f, host, TARGET_PAGE_SIZE); > break; > Which version of QEMU are you using? I tried with the latest upstream QEMU (e.g. v8.2.0 release, 1600b9f46b1bd), it's always reproducible without any changes (with local migration tests). > > > > I'm thinking whether this should be 9.0 material or 9.1. It's pretty > > late for 9.0 though, but we can still discuss. > > > >>> > >>> Fix this by yielding the load coroutine when receiving > >>> MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the > >>> connection event before loading the non-iterative devices state to > >>> avoid the deadlock condition. > >>> > >>> Signed-off-by: Lei Wang <lei4.wang@intel.com> > >> > >> This seems to be a regression issue caused by this commit: > >> 737840e2c6ea (migration: Use the number of transferred bytes > >> directly) > >> > >> Adding qemu_fflush back to migration_rate_exceeded() or > >> ram_save_iterate seems to work (might not be a good fix though). > >> > >>> --- > >>> migration/savevm.c | 5 +++++ > >>> 1 file changed, 5 insertions(+) > >>> > >>> diff --git a/migration/savevm.c b/migration/savevm.c index > >>> e386c5267f..8fd4dc92f2 100644 > >>> --- a/migration/savevm.c > >>> +++ b/migration/savevm.c > >>> @@ -2445,6 +2445,11 @@ static int > loadvm_process_command(QEMUFile *f) > >>> return loadvm_postcopy_handle_advise(mis, len); > >>> > >>> case MIG_CMD_POSTCOPY_LISTEN: > >>> + if (migrate_postcopy_preempt() && qemu_in_coroutine()) { > >>> + aio_co_schedule(qemu_get_current_aio_context(), > >>> + qemu_coroutine_self()); > >>> + qemu_coroutine_yield(); > >>> + } > >> > >> The above could be moved to loadvm_postcopy_handle_listen(). > > > > I'm not 100% sure such thing (no matter here or moved into it, which > > does look cleaner) would work for us. > > > > The problem is I still don't yet see an ordering restricted on top of > > (1) > > accept() happens, and (2) receive LISTEN cmd here. What happens if > > the > > accept() request is not yet received when reaching LISTEN? Or is it > > always guaranteed the accept(fd) will always be polled here? > > > > For example, the source QEMU (no matter pre-7.2 or later) will always > > setup the preempt channel asynchrounously, then IIUC it can connect() > > after sending the whole chunk of packed data which should include this > > LISTEN. I think it means it's not guaranteed this will 100% work, but > > maybe further reduce the possibility of the race. > > I think the following code: > > postcopy_start() -> > postcopy_preempt_establish_channel() -> > qemu_sem_wait(&s->postcopy_qemufile_src_sem); > > can guarantee that the connect() syscall is successful so the destination side > receives the connect() request before it loads the LISTEN command, otherwise > it won't post the sem: > > postcopy_preempt_send_channel_new() -> > postcopy_preempt_send_channel_done() -> > qemu_sem_post(&s->postcopy_qemufile_src_sem); > Yes. But as mentioned in another thread, connect() and accept() are async. So in theory accept() could still come later after the LISTEN command. > > > > One right fix that I can think of is moving the sem_wait(&done) into > > the main thread too, so we wait for the sem _before_ reading the > > packed data, so there's no chance of fault. However I don't think > > sem_wait() will be smart enough to yield when in a coroutine.. In the > > long term run I think we should really make migration loadvm to do > > work in the thread rather than the main thread. I think it means we > > have one more example to be listed in this todo so that's preferred.. > > > > https://wiki.qemu.org/ToDo/LiveMigration#Create_a_thread_for_migration > > _destination > > > > I attached such draft patch below, but I'm not sure it'll work. Let > > me know how both of you think about it. > > Sadly it doesn't work, there is an unknown segfault. > > > > >> > >> Another option is to follow the old way (i.e. pre_7_2) to do > >> postcopy_preempt_setup in migrate_fd_connect. This can save the above > >> overhead of switching to the main thread during the downtime. Seems > >> Peter's previous patch already solved the channel disordering issue. Let's > see Peter and others' opinions. > > > > IIUC we still need that pre_7_2 stuff and keep the postponed connect() > > to make sure the ordering is done properly. Wei, could you elaborate > > the patch you mentioned? Maybe I missed some spots. > > > > You raised a good point that this may introduce higher downtime. Did > > you or Lei tried to measure how large it is? If that is too high, we > > may need to think another solution, e.g., wait the channel connection > > before vm stop happens. > > Per my very simple test, using post-copy preemption to live migrate an 8G VM: > > w/o this patch: 121ms in avg in 5 tries > w/ this patch: 115ms in avg in 5 tries > > So it seems the overhead introduced is not too high (maybe ignorable?). You could just measure the time for the added qemu_coroutine_yield() part. The time will depend on how many events happen to be there waiting for a dispatch.
On 4/2/2024 15:25, Wang, Wei W wrote:> On Tuesday, April 2, 2024 2:56 PM, Wang, Lei4 wrote: >> On 4/2/2024 0:13, Peter Xu wrote:> On Fri, Mar 29, 2024 at 08:54:07AM +0000, >> Wang, Wei W wrote: >>>> On Friday, March 29, 2024 11:32 AM, Wang, Lei4 wrote: >>>>> When using the post-copy preemption feature to perform post-copy >>>>> live migration, the below scenario could lead to a deadlock and the >>>>> migration will never finish: >>>>> >>>>> - Source connect() the preemption channel in postcopy_start(). >>>>> - Source and the destination side TCP stack finished the 3-way handshake >>>>> thus the connection is successful. >>>>> - The destination side main thread is handling the loading of the bulk >> RAM >>>>> pages thus it doesn't start to handle the pending connection event in the >>>>> event loop. and doesn't post the semaphore >> postcopy_qemufile_dst_done for >>>>> the preemption thread. >>>>> - The source side sends non-iterative device states, such as the virtio >>>>> states. >>>>> - The destination main thread starts to receive the virtio states, this >>>>> process may lead to a page fault (e.g., virtio_load()->vring_avail_idx() >>>>> may trigger a page fault since the avail ring page may not be received >>>>> yet). >>> >>> Ouch. Yeah I think this part got overlooked when working on the >>> preempt channel. >>> >>>>> - The page request is sent back to the source side. Source sends the page >>>>> content to the destination side preemption thread. >>>>> - Since the event is not arrived and the semaphore >>>>> postcopy_qemufile_dst_done is not posted, the preemption thread in >>>>> destination side is blocked, and cannot handle receiving the page. >>>>> - The QEMU main load thread on the destination side is stuck at the page >>>>> fault, and cannot yield and handle the connect() event for the >>>>> preemption channel to unblock the preemption thread. >>>>> - The postcopy will stuck there forever since this is a deadlock. >>>>> >>>>> The key point to reproduce this bug is that the source side is >>>>> sending pages at a rate faster than the destination handling, >>>>> otherwise, the qemu_get_be64() in >>>>> ram_load_precopy() will have a chance to yield since at that time >>>>> there are no pending data in the buffer to get. This will make this >>>>> bug harder to be reproduced. >>> >>> How hard would this reproduce? >> >> We can manually make this easier to reproduce by adding the following code >> to make the destination busier to load the pages: >> >> diff --git a/migration/ram.c b/migration/ram.c index 0ad9fbba48..0b42877e1f >> 100644 >> --- a/migration/ram.c >> +++ b/migration/ram.c >> @@ -4232,6 +4232,7 @@ static int ram_load_precopy(QEMUFile *f) { >> MigrationIncomingState *mis = migration_incoming_get_current(); >> int flags = 0, ret = 0, invalid_flags = 0, len = 0, i = 0; >> + volatile unsigned long long a; >> >> if (!migrate_compress()) { >> invalid_flags |= RAM_SAVE_FLAG_COMPRESS_PAGE; @@ -4347,6 >> +4348,7 @@ static int ram_load_precopy(QEMUFile *f) >> break; >> >> case RAM_SAVE_FLAG_PAGE: >> + for (a = 0; a < 100000000; a++); >> qemu_get_buffer(f, host, TARGET_PAGE_SIZE); >> break; >> > > Which version of QEMU are you using? > I tried with the latest upstream QEMU (e.g. v8.2.0 release, 1600b9f46b1bd), it's > always reproducible without any changes (with local migration tests). I'm using the latest tip: 6af9d12c88b9720f209912f6e4b01fefe5906d59 and it cannot be reproduced without the modification. > > >>> >>> I'm thinking whether this should be 9.0 material or 9.1. It's pretty >>> late for 9.0 though, but we can still discuss. >>> >>>>> >>>>> Fix this by yielding the load coroutine when receiving >>>>> MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the >>>>> connection event before loading the non-iterative devices state to >>>>> avoid the deadlock condition. >>>>> >>>>> Signed-off-by: Lei Wang <lei4.wang@intel.com> >>>> >>>> This seems to be a regression issue caused by this commit: >>>> 737840e2c6ea (migration: Use the number of transferred bytes >>>> directly) >>>> >>>> Adding qemu_fflush back to migration_rate_exceeded() or >>>> ram_save_iterate seems to work (might not be a good fix though). >>>> >>>>> --- >>>>> migration/savevm.c | 5 +++++ >>>>> 1 file changed, 5 insertions(+) >>>>> >>>>> diff --git a/migration/savevm.c b/migration/savevm.c index >>>>> e386c5267f..8fd4dc92f2 100644 >>>>> --- a/migration/savevm.c >>>>> +++ b/migration/savevm.c >>>>> @@ -2445,6 +2445,11 @@ static int >> loadvm_process_command(QEMUFile *f) >>>>> return loadvm_postcopy_handle_advise(mis, len); >>>>> >>>>> case MIG_CMD_POSTCOPY_LISTEN: >>>>> + if (migrate_postcopy_preempt() && qemu_in_coroutine()) { >>>>> + aio_co_schedule(qemu_get_current_aio_context(), >>>>> + qemu_coroutine_self()); >>>>> + qemu_coroutine_yield(); >>>>> + } >>>> >>>> The above could be moved to loadvm_postcopy_handle_listen(). >>> >>> I'm not 100% sure such thing (no matter here or moved into it, which >>> does look cleaner) would work for us. >>> >>> The problem is I still don't yet see an ordering restricted on top of >>> (1) >>> accept() happens, and (2) receive LISTEN cmd here. What happens if >>> the >>> accept() request is not yet received when reaching LISTEN? Or is it >>> always guaranteed the accept(fd) will always be polled here? >>> >>> For example, the source QEMU (no matter pre-7.2 or later) will always >>> setup the preempt channel asynchrounously, then IIUC it can connect() >>> after sending the whole chunk of packed data which should include this >>> LISTEN. I think it means it's not guaranteed this will 100% work, but >>> maybe further reduce the possibility of the race. >> >> I think the following code: >> >> postcopy_start() -> >> postcopy_preempt_establish_channel() -> >> qemu_sem_wait(&s->postcopy_qemufile_src_sem); >> >> can guarantee that the connect() syscall is successful so the destination side >> receives the connect() request before it loads the LISTEN command, otherwise >> it won't post the sem: >> >> postcopy_preempt_send_channel_new() -> >> postcopy_preempt_send_channel_done() -> >> qemu_sem_post(&s->postcopy_qemufile_src_sem); >> > > Yes. But as mentioned in another thread, connect() and accept() are async. > So in theory accept() could still come later after the LISTEN command. IIUC accept() is the callback and will be triggered by the connect() event. The reason accept() is not called in the destination is the main loop doesn't get a chance to handle other events (connect()), so if we can guarantee connect() is before LISTEN, then when handling the LISTEN cmd, we yield to the main loop and the connect() event is there, then we can handle it by calling the accept(): qio_net_listener_channel_func qio_channel_socket_accept qemu_accept accept so it seems the case accept() comes alter after LISTEN is in our expectation? > >>> >>> One right fix that I can think of is moving the sem_wait(&done) into >>> the main thread too, so we wait for the sem _before_ reading the >>> packed data, so there's no chance of fault. However I don't think >>> sem_wait() will be smart enough to yield when in a coroutine.. In the >>> long term run I think we should really make migration loadvm to do >>> work in the thread rather than the main thread. I think it means we >>> have one more example to be listed in this todo so that's preferred.. >>> >>> https://wiki.qemu.org/ToDo/LiveMigration#Create_a_thread_for_migration >>> _destination >>> >>> I attached such draft patch below, but I'm not sure it'll work. Let >>> me know how both of you think about it. >> >> Sadly it doesn't work, there is an unknown segfault. >> >>> >>>> >>>> Another option is to follow the old way (i.e. pre_7_2) to do >>>> postcopy_preempt_setup in migrate_fd_connect. This can save the above >>>> overhead of switching to the main thread during the downtime. Seems >>>> Peter's previous patch already solved the channel disordering issue. Let's >> see Peter and others' opinions. >>> >>> IIUC we still need that pre_7_2 stuff and keep the postponed connect() >>> to make sure the ordering is done properly. Wei, could you elaborate >>> the patch you mentioned? Maybe I missed some spots. >>> >>> You raised a good point that this may introduce higher downtime. Did >>> you or Lei tried to measure how large it is? If that is too high, we >>> may need to think another solution, e.g., wait the channel connection >>> before vm stop happens. >> >> Per my very simple test, using post-copy preemption to live migrate an 8G VM: >> >> w/o this patch: 121ms in avg in 5 tries >> w/ this patch: 115ms in avg in 5 tries >> >> So it seems the overhead introduced is not too high (maybe ignorable?). > > You could just measure the time for the added qemu_coroutine_yield() part. > The time will depend on how many events happen to be there waiting for a dispatch. Still less than 1ms (0.05ms). Indeed, that may depends on how many events needs to be handled.
On Tue, Apr 02, 2024 at 05:28:36PM +0800, Wang, Lei wrote: > On 4/2/2024 15:25, Wang, Wei W wrote:> On Tuesday, April 2, 2024 2:56 PM, Wang, > Lei4 wrote: > >> On 4/2/2024 0:13, Peter Xu wrote:> On Fri, Mar 29, 2024 at 08:54:07AM +0000, > >> Wang, Wei W wrote: > >>>> On Friday, March 29, 2024 11:32 AM, Wang, Lei4 wrote: > >>>>> When using the post-copy preemption feature to perform post-copy > >>>>> live migration, the below scenario could lead to a deadlock and the > >>>>> migration will never finish: > >>>>> > >>>>> - Source connect() the preemption channel in postcopy_start(). > >>>>> - Source and the destination side TCP stack finished the 3-way handshake > >>>>> thus the connection is successful. > >>>>> - The destination side main thread is handling the loading of the bulk > >> RAM > >>>>> pages thus it doesn't start to handle the pending connection event in the > >>>>> event loop. and doesn't post the semaphore > >> postcopy_qemufile_dst_done for > >>>>> the preemption thread. > >>>>> - The source side sends non-iterative device states, such as the virtio > >>>>> states. > >>>>> - The destination main thread starts to receive the virtio states, this > >>>>> process may lead to a page fault (e.g., virtio_load()->vring_avail_idx() > >>>>> may trigger a page fault since the avail ring page may not be received > >>>>> yet). > >>> > >>> Ouch. Yeah I think this part got overlooked when working on the > >>> preempt channel. > >>> > >>>>> - The page request is sent back to the source side. Source sends the page > >>>>> content to the destination side preemption thread. > >>>>> - Since the event is not arrived and the semaphore > >>>>> postcopy_qemufile_dst_done is not posted, the preemption thread in > >>>>> destination side is blocked, and cannot handle receiving the page. > >>>>> - The QEMU main load thread on the destination side is stuck at the page > >>>>> fault, and cannot yield and handle the connect() event for the > >>>>> preemption channel to unblock the preemption thread. > >>>>> - The postcopy will stuck there forever since this is a deadlock. > >>>>> > >>>>> The key point to reproduce this bug is that the source side is > >>>>> sending pages at a rate faster than the destination handling, > >>>>> otherwise, the qemu_get_be64() in > >>>>> ram_load_precopy() will have a chance to yield since at that time > >>>>> there are no pending data in the buffer to get. This will make this > >>>>> bug harder to be reproduced. > >>> > >>> How hard would this reproduce? > >> > >> We can manually make this easier to reproduce by adding the following code > >> to make the destination busier to load the pages: > >> > >> diff --git a/migration/ram.c b/migration/ram.c index 0ad9fbba48..0b42877e1f > >> 100644 > >> --- a/migration/ram.c > >> +++ b/migration/ram.c > >> @@ -4232,6 +4232,7 @@ static int ram_load_precopy(QEMUFile *f) { > >> MigrationIncomingState *mis = migration_incoming_get_current(); > >> int flags = 0, ret = 0, invalid_flags = 0, len = 0, i = 0; > >> + volatile unsigned long long a; > >> > >> if (!migrate_compress()) { > >> invalid_flags |= RAM_SAVE_FLAG_COMPRESS_PAGE; @@ -4347,6 > >> +4348,7 @@ static int ram_load_precopy(QEMUFile *f) > >> break; > >> > >> case RAM_SAVE_FLAG_PAGE: > >> + for (a = 0; a < 100000000; a++); > >> qemu_get_buffer(f, host, TARGET_PAGE_SIZE); > >> break; > >> > > > > Which version of QEMU are you using? > > I tried with the latest upstream QEMU (e.g. v8.2.0 release, 1600b9f46b1bd), it's > > always reproducible without any changes (with local migration tests). > > I'm using the latest tip: > > 6af9d12c88b9720f209912f6e4b01fefe5906d59 > > and it cannot be reproduced without the modification. It does look like there's some mistery in reproducability. I can't reproduce it locally with v8.2.0 tag, neither can I reproduce it with the code above. I've requested our QE team to involve, but before that I may need help on any verifications. > > > > > > >>> > >>> I'm thinking whether this should be 9.0 material or 9.1. It's pretty > >>> late for 9.0 though, but we can still discuss. > >>> > >>>>> > >>>>> Fix this by yielding the load coroutine when receiving > >>>>> MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the > >>>>> connection event before loading the non-iterative devices state to > >>>>> avoid the deadlock condition. > >>>>> > >>>>> Signed-off-by: Lei Wang <lei4.wang@intel.com> > >>>> > >>>> This seems to be a regression issue caused by this commit: > >>>> 737840e2c6ea (migration: Use the number of transferred bytes > >>>> directly) > >>>> > >>>> Adding qemu_fflush back to migration_rate_exceeded() or > >>>> ram_save_iterate seems to work (might not be a good fix though). > >>>> > >>>>> --- > >>>>> migration/savevm.c | 5 +++++ > >>>>> 1 file changed, 5 insertions(+) > >>>>> > >>>>> diff --git a/migration/savevm.c b/migration/savevm.c index > >>>>> e386c5267f..8fd4dc92f2 100644 > >>>>> --- a/migration/savevm.c > >>>>> +++ b/migration/savevm.c > >>>>> @@ -2445,6 +2445,11 @@ static int > >> loadvm_process_command(QEMUFile *f) > >>>>> return loadvm_postcopy_handle_advise(mis, len); > >>>>> > >>>>> case MIG_CMD_POSTCOPY_LISTEN: > >>>>> + if (migrate_postcopy_preempt() && qemu_in_coroutine()) { > >>>>> + aio_co_schedule(qemu_get_current_aio_context(), > >>>>> + qemu_coroutine_self()); > >>>>> + qemu_coroutine_yield(); > >>>>> + } > >>>> > >>>> The above could be moved to loadvm_postcopy_handle_listen(). > >>> > >>> I'm not 100% sure such thing (no matter here or moved into it, which > >>> does look cleaner) would work for us. > >>> > >>> The problem is I still don't yet see an ordering restricted on top of > >>> (1) > >>> accept() happens, and (2) receive LISTEN cmd here. What happens if > >>> the > >>> accept() request is not yet received when reaching LISTEN? Or is it > >>> always guaranteed the accept(fd) will always be polled here? > >>> > >>> For example, the source QEMU (no matter pre-7.2 or later) will always > >>> setup the preempt channel asynchrounously, then IIUC it can connect() > >>> after sending the whole chunk of packed data which should include this > >>> LISTEN. I think it means it's not guaranteed this will 100% work, but > >>> maybe further reduce the possibility of the race. > >> > >> I think the following code: > >> > >> postcopy_start() -> > >> postcopy_preempt_establish_channel() -> > >> qemu_sem_wait(&s->postcopy_qemufile_src_sem); > >> > >> can guarantee that the connect() syscall is successful so the destination side > >> receives the connect() request before it loads the LISTEN command, otherwise > >> it won't post the sem: > >> > >> postcopy_preempt_send_channel_new() -> > >> postcopy_preempt_send_channel_done() -> > >> qemu_sem_post(&s->postcopy_qemufile_src_sem); > >> > > > > Yes. But as mentioned in another thread, connect() and accept() are async. > > So in theory accept() could still come later after the LISTEN command. > > IIUC accept() is the callback and will be triggered by the connect() event. > > The reason accept() is not called in the destination is the main loop doesn't > get a chance to handle other events (connect()), so if we can guarantee > connect() is before LISTEN, then when handling the LISTEN cmd, we yield to the > main loop and the connect() event is there, then we can handle it by calling the > accept(): > > qio_net_listener_channel_func > qio_channel_socket_accept > qemu_accept > accept > > so it seems the case accept() comes alter after LISTEN is in our expectation? The major thing uncertain to me is "accept() will return with a valid fd" on dest host is not guaranteed to order against anything. For example, I won't be surprised if a kernel implementation provides an async model of "accept()" syscall, so that even if the other side returned with "connect()", the "accept()" can still fetch nothing if the async model will need a delay for the new channel to be finally delivered to the "accept()" thread context. It just sounds like tricky to rely on such thing. What I proposed below shouldn't rely on any kernel details on how accept() could be implemented, it simply waits for the fd to be created before doing anything else (including creating the preempt thread and process packed data). > > > > >>> > >>> One right fix that I can think of is moving the sem_wait(&done) into > >>> the main thread too, so we wait for the sem _before_ reading the > >>> packed data, so there's no chance of fault. However I don't think > >>> sem_wait() will be smart enough to yield when in a coroutine.. In the > >>> long term run I think we should really make migration loadvm to do > >>> work in the thread rather than the main thread. I think it means we > >>> have one more example to be listed in this todo so that's preferred.. > >>> > >>> https://wiki.qemu.org/ToDo/LiveMigration#Create_a_thread_for_migration > >>> _destination > >>> > >>> I attached such draft patch below, but I'm not sure it'll work. Let > >>> me know how both of you think about it. > >> > >> Sadly it doesn't work, there is an unknown segfault. Could you paste the stack of the segfault ("(gdb) thread apply all bt")? Or help to figure out what is wrong? Since I cannot reproduce myself, I may need your help debugging this. If you agree with what I said above and agree on such fix, please also feel free to go ahead and fix the segfault. > >> > >>> > >>>> > >>>> Another option is to follow the old way (i.e. pre_7_2) to do > >>>> postcopy_preempt_setup in migrate_fd_connect. This can save the above > >>>> overhead of switching to the main thread during the downtime. Seems > >>>> Peter's previous patch already solved the channel disordering issue. Let's > >> see Peter and others' opinions. > >>> > >>> IIUC we still need that pre_7_2 stuff and keep the postponed connect() > >>> to make sure the ordering is done properly. Wei, could you elaborate > >>> the patch you mentioned? Maybe I missed some spots. > >>> > >>> You raised a good point that this may introduce higher downtime. Did > >>> you or Lei tried to measure how large it is? If that is too high, we > >>> may need to think another solution, e.g., wait the channel connection > >>> before vm stop happens. > >> > >> Per my very simple test, using post-copy preemption to live migrate an 8G VM: > >> > >> w/o this patch: 121ms in avg in 5 tries > >> w/ this patch: 115ms in avg in 5 tries > >> > >> So it seems the overhead introduced is not too high (maybe ignorable?). > > > > You could just measure the time for the added qemu_coroutine_yield() part. > > The time will depend on how many events happen to be there waiting for a dispatch. > > Still less than 1ms (0.05ms). > > Indeed, that may depends on how many events needs to be handled. That sounds fine as of now. I think it means if we can come up with a solid plan this week (without intrusive changes), we still have chance to land it in 9.0. Thanks,
On Tue, Apr 02, 2024 at 07:20:01AM +0000, Wang, Wei W wrote: > > IIUC we still need that pre_7_2 stuff and keep the postponed connect() to > > make sure the ordering is done properly. Wei, could you elaborate the patch > > you mentioned? Maybe I missed some spots. > > Sure. I saw your patch (5655aab079) ensures that the preempt channel to be > created when PONG is received from the destination, which indicates that > the main channel has been ready on the destination side. > > Why was it decided to postpone the preempt channel's connect() for newer QEMU > only? The old QEMU (before 7.2) still performs preempt channel connect() in > migrate_fd_connect(), doesn't it have the disordering issue? Yes it has, but the solution actually was not compatible with older binaries, so migration was broken back then.. See: https://lore.kernel.org/qemu-devel/20230326172540.2571110-1-peterx@redhat.com/ My memory was that we didn't have a good solution to fix it without breaking the old protocol, so it can only be fixed on newer binaries. Thanks,
On 4/3/2024 5:39, Peter Xu wrote:>>>>> >>>>> I'm not 100% sure such thing (no matter here or moved into it, which >>>>> does look cleaner) would work for us. >>>>> >>>>> The problem is I still don't yet see an ordering restricted on top of >>>>> (1) >>>>> accept() happens, and (2) receive LISTEN cmd here. What happens if >>>>> the >>>>> accept() request is not yet received when reaching LISTEN? Or is it >>>>> always guaranteed the accept(fd) will always be polled here? >>>>> >>>>> For example, the source QEMU (no matter pre-7.2 or later) will always >>>>> setup the preempt channel asynchrounously, then IIUC it can connect() >>>>> after sending the whole chunk of packed data which should include this >>>>> LISTEN. I think it means it's not guaranteed this will 100% work, but >>>>> maybe further reduce the possibility of the race. >>>> >>>> I think the following code: >>>> >>>> postcopy_start() -> >>>> postcopy_preempt_establish_channel() -> >>>> qemu_sem_wait(&s->postcopy_qemufile_src_sem); >>>> >>>> can guarantee that the connect() syscall is successful so the destination side >>>> receives the connect() request before it loads the LISTEN command, otherwise >>>> it won't post the sem: >>>> >>>> postcopy_preempt_send_channel_new() -> >>>> postcopy_preempt_send_channel_done() -> >>>> qemu_sem_post(&s->postcopy_qemufile_src_sem); >>>> >>> >>> Yes. But as mentioned in another thread, connect() and accept() are async. >>> So in theory accept() could still come later after the LISTEN command. >> >> IIUC accept() is the callback and will be triggered by the connect() event. >> >> The reason accept() is not called in the destination is the main loop doesn't >> get a chance to handle other events (connect()), so if we can guarantee >> connect() is before LISTEN, then when handling the LISTEN cmd, we yield to the >> main loop and the connect() event is there, then we can handle it by calling the >> accept(): >> >> qio_net_listener_channel_func >> qio_channel_socket_accept >> qemu_accept >> accept >> >> so it seems the case accept() comes alter after LISTEN is in our expectation? > > The major thing uncertain to me is "accept() will return with a valid fd" > on dest host is not guaranteed to order against anything. > > For example, I won't be surprised if a kernel implementation provides an > async model of "accept()" syscall, so that even if the other side returned > with "connect()", the "accept()" can still fetch nothing if the async model > will need a delay for the new channel to be finally delivered to the > "accept()" thread context. It just sounds like tricky to rely on such > thing. > > What I proposed below shouldn't rely on any kernel details on how accept() > could be implemented, it simply waits for the fd to be created before doing > anything else (including creating the preempt thread and process packed > data). Thanks for the detailed explanation! > >> >>> >>>>> >>>>> One right fix that I can think of is moving the sem_wait(&done) into >>>>> the main thread too, so we wait for the sem _before_ reading the >>>>> packed data, so there's no chance of fault. However I don't think >>>>> sem_wait() will be smart enough to yield when in a coroutine.. In the >>>>> long term run I think we should really make migration loadvm to do >>>>> work in the thread rather than the main thread. I think it means we >>>>> have one more example to be listed in this todo so that's preferred.. >>>>> >>>>> https://wiki.qemu.org/ToDo/LiveMigration#Create_a_thread_for_migration >>>>> _destination >>>>> >>>>> I attached such draft patch below, but I'm not sure it'll work. Let >>>>> me know how both of you think about it. >>>> >>>> Sadly it doesn't work, there is an unknown segfault. > > Could you paste the stack of the segfault ("(gdb) thread apply all bt")? > Or help to figure out what is wrong? > > Since I cannot reproduce myself, I may need your help debugging this. If > you agree with what I said above and agree on such fix, please also feel > free to go ahead and fix the segfault. We should change the following line from while (!qemu_sem_timedwait(&mis->postcopy_qemufile_dst_done, 100)) { to while (qemu_sem_timedwait(&mis->postcopy_qemufile_dst_done, 100)) { After that fix, test passed and no segfault. Given that the test shows a yield to the main loop won't introduce much overhead (<1ms), how about first yield unconditionally, then we enter the while loop to wait for several ms and yield periodically?
On Wed, Apr 03, 2024 at 04:35:35PM +0800, Wang, Lei wrote: > We should change the following line from > > while (!qemu_sem_timedwait(&mis->postcopy_qemufile_dst_done, 100)) { > > to > > while (qemu_sem_timedwait(&mis->postcopy_qemufile_dst_done, 100)) { Stupid me.. :( Thanks for figuring this out. > > After that fix, test passed and no segfault. > > Given that the test shows a yield to the main loop won't introduce much overhead > (<1ms), how about first yield unconditionally, then we enter the while loop to > wait for several ms and yield periodically? Shouldn't the expectation be that this should return immediately without a wait? We're already processing LISTEN command, and on the source as you said it was much after the connect(). It won't guarantee the ordering but IIUC the majority should still have a direct hit? What we can do though is reducing the 100ms timeout if you see that's perhaps a risk of having too large a downtime when by accident. We can even do it in a tight loop here considering downtime is important, but to provide an intermediate ground: how about 100ms -> 1ms poll? If you agree (and also to Wei; please review this and comment if there's any!), would you write up the commit log, fully test it in whatever way you could, and resend as a formal patch (please do this before Friday if possible)? You can keep a "Suggested-by:" for me. I want to queue it for rc3 if it can catch it. It seems important if Wei can always reproduce it. Thanks,
On Wednesday, April 3, 2024 10:42 PM, Peter Xu wrote: > On Wed, Apr 03, 2024 at 04:35:35PM +0800, Wang, Lei wrote: > > We should change the following line from > > > > while (!qemu_sem_timedwait(&mis->postcopy_qemufile_dst_done, > 100)) { > > > > to > > > > while (qemu_sem_timedwait(&mis->postcopy_qemufile_dst_done, > 100)) { > > Stupid me.. :( Thanks for figuring this out. > > > > > After that fix, test passed and no segfault. > > > > Given that the test shows a yield to the main loop won't introduce > > much overhead (<1ms), how about first yield unconditionally, then we > > enter the while loop to wait for several ms and yield periodically? > > Shouldn't the expectation be that this should return immediately without a > wait? We're already processing LISTEN command, and on the source as you > said it was much after the connect(). It won't guarantee the ordering but IIUC > the majority should still have a direct hit? > > What we can do though is reducing the 100ms timeout if you see that's > perhaps a risk of having too large a downtime when by accident. We can even > do it in a tight loop here considering downtime is important, but to provide an > intermediate ground: how about 100ms -> 1ms poll? Would it be better to use busy wait here, instead of blocking for even 1ms here? It's likely that the preempt channel is waiting for the main thread to dispatch for accept(), but we are calling qemu_sem_timedwait here to block the main thread for 1 more ms. > > If you agree (and also to Wei; please review this and comment if there's any!), > would you write up the commit log, fully test it in whatever way you could, > and resend as a formal patch (please do this before Friday if possible)? You > can keep a "Suggested-by:" for me. I want to queue it for > rc3 if it can catch it. It seems important if Wei can always reproduce it. Not sure if Lei would be able to online as the following two days are Chinese holiday. If not, I could help take over to send late tomorrow. Let's see.
On Wed, Apr 03, 2024 at 04:04:21PM +0000, Wang, Wei W wrote: > On Wednesday, April 3, 2024 10:42 PM, Peter Xu wrote: > > On Wed, Apr 03, 2024 at 04:35:35PM +0800, Wang, Lei wrote: > > > We should change the following line from > > > > > > while (!qemu_sem_timedwait(&mis->postcopy_qemufile_dst_done, > > 100)) { > > > > > > to > > > > > > while (qemu_sem_timedwait(&mis->postcopy_qemufile_dst_done, > > 100)) { > > > > Stupid me.. :( Thanks for figuring this out. > > > > > > > > After that fix, test passed and no segfault. > > > > > > Given that the test shows a yield to the main loop won't introduce > > > much overhead (<1ms), how about first yield unconditionally, then we > > > enter the while loop to wait for several ms and yield periodically? > > > > Shouldn't the expectation be that this should return immediately without a > > wait? We're already processing LISTEN command, and on the source as you > > said it was much after the connect(). It won't guarantee the ordering but IIUC > > the majority should still have a direct hit? > > > > What we can do though is reducing the 100ms timeout if you see that's > > perhaps a risk of having too large a downtime when by accident. We can even > > do it in a tight loop here considering downtime is important, but to provide an > > intermediate ground: how about 100ms -> 1ms poll? > > Would it be better to use busy wait here, instead of blocking for even 1ms here? > It's likely that the preempt channel is waiting for the main thread to dispatch for accept(), > but we are calling qemu_sem_timedwait here to block the main thread for 1 more ms. I think it's about the expectation of whether we should already received that sem post. My understanding is in most cases we should directly return and avoid such wait. Per my previous experience, 1ms is not a major issue to be added on top of downtime in corner cases like this. We do have a lot of othre potential optimizations to reduce downtime, or I should say in the other way, that.. there can be a lot of cases where we can hit much larger downtime than expected. Consider when we don't even account downtime for device states for now, either load_state or save_state, we only count RAM but that's far from accurate.. and we do have more chances to optimize. Some are listed here, but some may not: https://wiki.qemu.org/ToDo/LiveMigration#Optimizations If you agree with my above "expectation" statement, I'd say we should avoid using a busy loop whenever possible in QEMU unless extremely necessary. > > > > > > If you agree (and also to Wei; please review this and comment if there's any!), > > would you write up the commit log, fully test it in whatever way you could, > > and resend as a formal patch (please do this before Friday if possible)? You > > can keep a "Suggested-by:" for me. I want to queue it for > > rc3 if it can catch it. It seems important if Wei can always reproduce it. > > Not sure if Lei would be able to online as the following two days are Chinese holiday. > If not, I could help take over to send late tomorrow. Let's see. Oops, I forgot that even if I was aware.. Please do so if you can do this. Thank you, Wei! (I hope you can switch some working hours later on!) Let me know if that doesn't work; it'll be all fine. Thanks,
On Thursday, April 4, 2024 12:34 AM, Peter Xu wrote: > On Wed, Apr 03, 2024 at 04:04:21PM +0000, Wang, Wei W wrote: > > On Wednesday, April 3, 2024 10:42 PM, Peter Xu wrote: > > > On Wed, Apr 03, 2024 at 04:35:35PM +0800, Wang, Lei wrote: > > > > We should change the following line from > > > > > > > > while (!qemu_sem_timedwait(&mis->postcopy_qemufile_dst_done, > > > 100)) { > > > > > > > > to > > > > > > > > while (qemu_sem_timedwait(&mis->postcopy_qemufile_dst_done, > > > 100)) { > > > > > > Stupid me.. :( Thanks for figuring this out. > > > > > > > > > > > After that fix, test passed and no segfault. > > > > > > > > Given that the test shows a yield to the main loop won't introduce > > > > much overhead (<1ms), how about first yield unconditionally, then > > > > we enter the while loop to wait for several ms and yield periodically? > > > > > > Shouldn't the expectation be that this should return immediately > > > without a wait? We're already processing LISTEN command, and on the > > > source as you said it was much after the connect(). It won't > > > guarantee the ordering but IIUC the majority should still have a direct hit? > > > > > > What we can do though is reducing the 100ms timeout if you see > > > that's perhaps a risk of having too large a downtime when by > > > accident. We can even do it in a tight loop here considering > > > downtime is important, but to provide an intermediate ground: how about > 100ms -> 1ms poll? > > > > Would it be better to use busy wait here, instead of blocking for even 1ms > here? > > It's likely that the preempt channel is waiting for the main thread to > > dispatch for accept(), but we are calling qemu_sem_timedwait here to block > the main thread for 1 more ms. > > I think it's about the expectation of whether we should already received that > sem post. My understanding is in most cases we should directly return and > avoid such wait. The assumption of " should already received" might not be true. On my machine, it seems always "not received". > > Per my previous experience, 1ms is not a major issue to be added on top of > downtime in corner cases like this. Yeah, in most cases, probably yes. There are some usages requiring relatively low downtime. Remember NFV usages in early days expect the downtime to be close to 10ms. In this case, adding 1ms would mean ~10% performance regression ☹ > > We do have a lot of othre potential optimizations to reduce downtime, or I > should say in the other way, that.. there can be a lot of cases where we can hit > much larger downtime than expected. Consider when we don't even account > downtime for device states for now, either load_state or save_state, we only > count RAM but that's far from accurate.. and we do have more chances to > optimize. Some are listed here, but some may not: > > https://wiki.qemu.org/ToDo/LiveMigration#Optimizations > > If you agree with my above "expectation" statement, I'd say we should avoid > using a busy loop whenever possible in QEMU unless extremely necessary. I just posted a version with a bit changes from your suggestion: It still blocks by the sem in the loop, but before that it checks if the channel is created. If not, yield to main loop. Then when reaching to sem, it is likely not need to wait. Another change is that I moved it to the place before we load the states (in loadvm_handle_cmd_packaged). Because I think that's logically the place we should ensure the channel to be ready (don’t have a good reason to add it when handling the LISTEN cmd). Please let me know if you would disagree anywhere.
diff --git a/migration/savevm.c b/migration/savevm.c index e386c5267f..8fd4dc92f2 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -2445,6 +2445,11 @@ static int loadvm_process_command(QEMUFile *f) return loadvm_postcopy_handle_advise(mis, len); case MIG_CMD_POSTCOPY_LISTEN: + if (migrate_postcopy_preempt() && qemu_in_coroutine()) { + aio_co_schedule(qemu_get_current_aio_context(), + qemu_coroutine_self()); + qemu_coroutine_yield(); + } return loadvm_postcopy_handle_listen(mis); case MIG_CMD_POSTCOPY_RUN:
When using the post-copy preemption feature to perform post-copy live migration, the below scenario could lead to a deadlock and the migration will never finish: - Source connect() the preemption channel in postcopy_start(). - Source and the destination side TCP stack finished the 3-way handshake thus the connection is successful. - The destination side main thread is handling the loading of the bulk RAM pages thus it doesn't start to handle the pending connection event in the event loop. and doesn't post the semaphore postcopy_qemufile_dst_done for the preemption thread. - The source side sends non-iterative device states, such as the virtio states. - The destination main thread starts to receive the virtio states, this process may lead to a page fault (e.g., virtio_load()->vring_avail_idx() may trigger a page fault since the avail ring page may not be received yet). - The page request is sent back to the source side. Source sends the page content to the destination side preemption thread. - Since the event is not arrived and the semaphore postcopy_qemufile_dst_done is not posted, the preemption thread in destination side is blocked, and cannot handle receiving the page. - The QEMU main load thread on the destination side is stuck at the page fault, and cannot yield and handle the connect() event for the preemption channel to unblock the preemption thread. - The postcopy will stuck there forever since this is a deadlock. The key point to reproduce this bug is that the source side is sending pages at a rate faster than the destination handling, otherwise, the qemu_get_be64() in ram_load_precopy() will have a chance to yield since at that time there are no pending data in the buffer to get. This will make this bug harder to be reproduced. Fix this by yielding the load coroutine when receiving MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the connection event before loading the non-iterative devices state to avoid the deadlock condition. Signed-off-by: Lei Wang <lei4.wang@intel.com> --- migration/savevm.c | 5 +++++ 1 file changed, 5 insertions(+)