Message ID | 20231012103307.371092-1-f.ebner@proxmox.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v4] migration: hold the BQL during setup | expand |
Fiona Ebner <f.ebner@proxmox.com> writes: > This is intended to be a semantic revert of commit 9b09503752 > ("migration: run setup callbacks out of big lock"). There have been so > many changes since that commit (e.g. a new setup callback > dirty_bitmap_save_setup() that also needs to be adapted now), it's > easier to do the revert manually. > > For snapshots, the bdrv_writev_vmstate() function is used during setup > (in QIOChannelBlock backing the QEMUFile), but not holding the BQL > while calling it could lead to an assertion failure. To understand > how, first note the following: Would it make sense to add a GLOBAL_STATE_CODE() annotation to qio_channel_block_writev? > 1. Generated coroutine wrappers for block layer functions spawn the > coroutine and use AIO_WAIT_WHILE()/aio_poll() to wait for it. > 2. If the host OS switches threads at an inconvenient time, it can > happen that a bottom half scheduled for the main thread's AioContext > is executed as part of a vCPU thread's aio_poll(). > > An example leading to the assertion failure is as follows: > > main thread: > 1. A snapshot-save QMP command gets issued. > 2. snapshot_save_job_bh() is scheduled. > > vCPU thread: > 3. aio_poll() for the main thread's AioContext is called (e.g. when > the guest writes to a pflash device, as part of blk_pwrite which is a > generated coroutine wrapper). > 4. snapshot_save_job_bh() is executed as part of aio_poll(). > 3. qemu_savevm_state() is called. > 4. qemu_mutex_unlock_iothread() is called. Now > qemu_get_current_aio_context() returns 0x0. > 5. bdrv_writev_vmstate() is executed during the usual savevm setup > via qemu_fflush(). But this function is a generated coroutine wrapper, > so it uses AIO_WAIT_WHILE. There, the assertion > assert(qemu_get_current_aio_context() == qemu_get_aio_context()); > will fail. > > To fix it, ensure that the BQL is held during setup. While it would > only be needed for snapshots, adapting migration too avoids additional > logic for conditional locking/unlocking in the setup callbacks. > Writing the header could (in theory) also trigger qemu_fflush() and > thus bdrv_writev_vmstate(), so the locked section also covers the > qemu_savevm_state_header() call, even for migration for consistentcy. > > The section around multifd_send_sync_main() needs to be unlocked to > avoid a deadlock. In particular, the function calls ... the multifd_save_setup() function calls ... otherwise this paragraph makes no sense. > socket_send_channel_create() using multifd_new_send_channel_async() as > a callback and then waits for the callback to signal via the > channels_ready semaphore. The connection happens via > qio_task_run_in_thread(), but the callback is only executed via > qio_task_thread_result() which is scheduled for the main event loop. > Without unlocking the section, the main thread would never get to > process the task result and the callback meaning there would be no > signal via the channels_ready semaphore. > > The comment in ram_init_bitmaps() was introduced by 4987783400 > ("migration: fix incorrect memory_global_dirty_log_start outside BQL") > and is removed, because it referred to the qemu_mutex_lock_iothread() > call. > > Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Thanks for taking the time to explain stuff in the commit message. I dislike having unnecessary dependencies on the BQL throughout the migration code, but I see people preferred that over conditional locking in the previous versions, so in the name of consensus: Reviewed-by: Fabiano Rosas <farosas@suse.de>
Fiona Ebner <f.ebner@proxmox.com> wrote: > This is intended to be a semantic revert of commit 9b09503752 > ("migration: run setup callbacks out of big lock"). There have been so > many changes since that commit (e.g. a new setup callback > dirty_bitmap_save_setup() that also needs to be adapted now), it's > easier to do the revert manually. > > For snapshots, the bdrv_writev_vmstate() function is used during setup > (in QIOChannelBlock backing the QEMUFile), but not holding the BQL > while calling it could lead to an assertion failure. To understand > how, first note the following: > > 1. Generated coroutine wrappers for block layer functions spawn the > coroutine and use AIO_WAIT_WHILE()/aio_poll() to wait for it. > 2. If the host OS switches threads at an inconvenient time, it can > happen that a bottom half scheduled for the main thread's AioContext > is executed as part of a vCPU thread's aio_poll(). > > An example leading to the assertion failure is as follows: > > main thread: > 1. A snapshot-save QMP command gets issued. > 2. snapshot_save_job_bh() is scheduled. > > vCPU thread: > 3. aio_poll() for the main thread's AioContext is called (e.g. when > the guest writes to a pflash device, as part of blk_pwrite which is a > generated coroutine wrapper). > 4. snapshot_save_job_bh() is executed as part of aio_poll(). > 3. qemu_savevm_state() is called. > 4. qemu_mutex_unlock_iothread() is called. Now > qemu_get_current_aio_context() returns 0x0. > 5. bdrv_writev_vmstate() is executed during the usual savevm setup > via qemu_fflush(). But this function is a generated coroutine wrapper, > so it uses AIO_WAIT_WHILE. There, the assertion > assert(qemu_get_current_aio_context() == qemu_get_aio_context()); > will fail. > > To fix it, ensure that the BQL is held during setup. While it would > only be needed for snapshots, adapting migration too avoids additional > logic for conditional locking/unlocking in the setup callbacks. > Writing the header could (in theory) also trigger qemu_fflush() and > thus bdrv_writev_vmstate(), so the locked section also covers the > qemu_savevm_state_header() call, even for migration for consistentcy. > > The section around multifd_send_sync_main() needs to be unlocked to > avoid a deadlock. In particular, the function calls > socket_send_channel_create() using multifd_new_send_channel_async() as > a callback and then waits for the callback to signal via the > channels_ready semaphore. The connection happens via > qio_task_run_in_thread(), but the callback is only executed via > qio_task_thread_result() which is scheduled for the main event loop. > Without unlocking the section, the main thread would never get to > process the task result and the callback meaning there would be no > signal via the channels_ready semaphore. > > The comment in ram_init_bitmaps() was introduced by 4987783400 > ("migration: fix incorrect memory_global_dirty_log_start outside BQL") > and is removed, because it referred to the qemu_mutex_lock_iothread() > call. > > Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Juan Quintela <quintela@redhat.com> I don't have a better solution for this problem, if nobody complains in the next couple of days, I will pull this change. Thanks for the effort.
Am 12.10.23 um 22:40 schrieb Fabiano Rosas: > Fiona Ebner <f.ebner@proxmox.com> writes: > >> This is intended to be a semantic revert of commit 9b09503752 >> ("migration: run setup callbacks out of big lock"). There have been so >> many changes since that commit (e.g. a new setup callback >> dirty_bitmap_save_setup() that also needs to be adapted now), it's >> easier to do the revert manually. >> >> For snapshots, the bdrv_writev_vmstate() function is used during setup >> (in QIOChannelBlock backing the QEMUFile), but not holding the BQL >> while calling it could lead to an assertion failure. To understand >> how, first note the following: > > Would it make sense to add a GLOBAL_STATE_CODE() annotation to > qio_channel_block_writev? > Since bdrv_writev_vmstate() is IO_OR_GS_CODE(), would using that be better? And I guess if we add an annotation for qio_channel_block_writev(), we should go ahead and also do it for other functions in the file? E.g. qio_channel_block_new() would have to be GLOBAL_STATE_CODE(), because it uses bdrv_ref(). Best Regards, Fiona
diff --git a/include/migration/register.h b/include/migration/register.h index 2b12c6adec..fed1d04a3c 100644 --- a/include/migration/register.h +++ b/include/migration/register.h @@ -25,6 +25,7 @@ typedef struct SaveVMHandlers { * used to perform early checks. */ int (*save_prepare)(void *opaque, Error **errp); + int (*save_setup)(QEMUFile *f, void *opaque); void (*save_cleanup)(void *opaque); int (*save_live_complete_postcopy)(QEMUFile *f, void *opaque); int (*save_live_complete_precopy)(QEMUFile *f, void *opaque); @@ -50,7 +51,6 @@ typedef struct SaveVMHandlers { int (*save_live_iterate)(QEMUFile *f, void *opaque); /* This runs outside the iothread lock! */ - int (*save_setup)(QEMUFile *f, void *opaque); /* Note for save_live_pending: * must_precopy: * - must be migrated in precopy or in stopped state diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c index 032fc5f405..03cb2e72ee 100644 --- a/migration/block-dirty-bitmap.c +++ b/migration/block-dirty-bitmap.c @@ -1214,9 +1214,7 @@ static int dirty_bitmap_save_setup(QEMUFile *f, void *opaque) DBMSaveState *s = &((DBMState *)opaque)->save; SaveBitmapState *dbms = NULL; - qemu_mutex_lock_iothread(); if (init_dirty_bitmap_migration(s) < 0) { - qemu_mutex_unlock_iothread(); return -1; } @@ -1224,7 +1222,6 @@ static int dirty_bitmap_save_setup(QEMUFile *f, void *opaque) send_bitmap_start(f, s, dbms); } qemu_put_bitmap_flags(f, DIRTY_BITMAP_MIG_FLAG_EOS); - qemu_mutex_unlock_iothread(); return 0; } diff --git a/migration/block.c b/migration/block.c index 5f930870a5..7cf70c1066 100644 --- a/migration/block.c +++ b/migration/block.c @@ -729,18 +729,13 @@ static int block_save_setup(QEMUFile *f, void *opaque) trace_migration_block_save("setup", block_mig_state.submitted, block_mig_state.transferred); - qemu_mutex_lock_iothread(); ret = init_blk_migration(f); if (ret < 0) { - qemu_mutex_unlock_iothread(); return ret; } /* start track dirty blocks */ ret = set_dirty_tracking(); - - qemu_mutex_unlock_iothread(); - if (ret) { return ret; } diff --git a/migration/migration.c b/migration/migration.c index 1c6c81ad49..9c6faa6367 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -2980,7 +2980,9 @@ static void *migration_thread(void *opaque) object_ref(OBJECT(s)); update_iteration_initial_status(s); + qemu_mutex_lock_iothread(); qemu_savevm_state_header(s->to_dst_file); + qemu_mutex_unlock_iothread(); /* * If we opened the return path, we need to make sure dst has it @@ -3008,7 +3010,9 @@ static void *migration_thread(void *opaque) qemu_savevm_send_colo_enable(s->to_dst_file); } + qemu_mutex_lock_iothread(); qemu_savevm_state_setup(s->to_dst_file); + qemu_mutex_unlock_iothread(); qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP, MIGRATION_STATUS_ACTIVE); @@ -3119,8 +3123,10 @@ static void *bg_migration_thread(void *opaque) ram_write_tracking_prepare(); #endif + qemu_mutex_lock_iothread(); qemu_savevm_state_header(s->to_dst_file); qemu_savevm_state_setup(s->to_dst_file); + qemu_mutex_unlock_iothread(); qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP, MIGRATION_STATUS_ACTIVE); diff --git a/migration/ram.c b/migration/ram.c index 2f5ce4d60b..5f7680ba4f 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -2891,8 +2891,6 @@ static void migration_bitmap_clear_discarded_pages(RAMState *rs) static void ram_init_bitmaps(RAMState *rs) { - /* For memory_global_dirty_log_start below. */ - qemu_mutex_lock_iothread(); qemu_mutex_lock_ramlist(); WITH_RCU_READ_LOCK_GUARD() { @@ -2904,7 +2902,6 @@ static void ram_init_bitmaps(RAMState *rs) } } qemu_mutex_unlock_ramlist(); - qemu_mutex_unlock_iothread(); /* * After an eventual first bitmap sync, fixup the initial bitmap @@ -3067,7 +3064,10 @@ static int ram_save_setup(QEMUFile *f, void *opaque) migration_ops = g_malloc0(sizeof(MigrationOps)); migration_ops->ram_save_target_page = ram_save_target_page_legacy; + + qemu_mutex_unlock_iothread(); ret = multifd_send_sync_main(f); + qemu_mutex_lock_iothread(); if (ret < 0) { return ret; } diff --git a/migration/savevm.c b/migration/savevm.c index 497ce02bd7..e192f84a65 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -1660,10 +1660,8 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp) } ms->to_dst_file = f; - qemu_mutex_unlock_iothread(); qemu_savevm_state_header(f); qemu_savevm_state_setup(f); - qemu_mutex_lock_iothread(); while (qemu_file_get_error(f) == 0) { if (qemu_savevm_state_iterate(f, false) > 0) {
This is intended to be a semantic revert of commit 9b09503752 ("migration: run setup callbacks out of big lock"). There have been so many changes since that commit (e.g. a new setup callback dirty_bitmap_save_setup() that also needs to be adapted now), it's easier to do the revert manually. For snapshots, the bdrv_writev_vmstate() function is used during setup (in QIOChannelBlock backing the QEMUFile), but not holding the BQL while calling it could lead to an assertion failure. To understand how, first note the following: 1. Generated coroutine wrappers for block layer functions spawn the coroutine and use AIO_WAIT_WHILE()/aio_poll() to wait for it. 2. If the host OS switches threads at an inconvenient time, it can happen that a bottom half scheduled for the main thread's AioContext is executed as part of a vCPU thread's aio_poll(). An example leading to the assertion failure is as follows: main thread: 1. A snapshot-save QMP command gets issued. 2. snapshot_save_job_bh() is scheduled. vCPU thread: 3. aio_poll() for the main thread's AioContext is called (e.g. when the guest writes to a pflash device, as part of blk_pwrite which is a generated coroutine wrapper). 4. snapshot_save_job_bh() is executed as part of aio_poll(). 3. qemu_savevm_state() is called. 4. qemu_mutex_unlock_iothread() is called. Now qemu_get_current_aio_context() returns 0x0. 5. bdrv_writev_vmstate() is executed during the usual savevm setup via qemu_fflush(). But this function is a generated coroutine wrapper, so it uses AIO_WAIT_WHILE. There, the assertion assert(qemu_get_current_aio_context() == qemu_get_aio_context()); will fail. To fix it, ensure that the BQL is held during setup. While it would only be needed for snapshots, adapting migration too avoids additional logic for conditional locking/unlocking in the setup callbacks. Writing the header could (in theory) also trigger qemu_fflush() and thus bdrv_writev_vmstate(), so the locked section also covers the qemu_savevm_state_header() call, even for migration for consistentcy. The section around multifd_send_sync_main() needs to be unlocked to avoid a deadlock. In particular, the function calls socket_send_channel_create() using multifd_new_send_channel_async() as a callback and then waits for the callback to signal via the channels_ready semaphore. The connection happens via qio_task_run_in_thread(), but the callback is only executed via qio_task_thread_result() which is scheduled for the main event loop. Without unlocking the section, the main thread would never get to process the task result and the callback meaning there would be no signal via the channels_ready semaphore. The comment in ram_init_bitmaps() was introduced by 4987783400 ("migration: fix incorrect memory_global_dirty_log_start outside BQL") and is removed, because it referred to the qemu_mutex_lock_iothread() call. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> --- Changes in v4: * Rebase on current master (save_prepare handler got added). Changes in v3: * Add unlocked section around multifd_send_sync_main(). Changes in v2: * Also hold the BQL for migration, rather than conditionally acquiring/releasing the lock inside the setup callbacks. include/migration/register.h | 2 +- migration/block-dirty-bitmap.c | 3 --- migration/block.c | 5 ----- migration/migration.c | 6 ++++++ migration/ram.c | 6 +++--- migration/savevm.c | 2 -- 6 files changed, 10 insertions(+), 14 deletions(-)