Message ID | 20240417025634.1014582-1-lizhijian@fujitsu.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] migration/colo: Fix bdrv_graph_rdlock_main_loop: Assertion `!qemu_in_coroutine()' failed. | expand |
On 17/4/24 04:56, Li Zhijian via wrote: > bdrv_activate_all() should not be called from the coroutine context, move > it to the QEMU thread colo_process_incoming_thread() with the bql_lock > protected. > > The backtrace is as follows: > #4 0x0000561af7948362 in bdrv_graph_rdlock_main_loop () at ../block/graph-lock.c:260 > #5 0x0000561af7907a68 in graph_lockable_auto_lock_mainloop (x=0x7fd29810be7b) at /patch/to/qemu/include/block/graph-lock.h:259 > #6 0x0000561af79167d1 in bdrv_activate_all (errp=0x7fd29810bed0) at ../block.c:6906 > #7 0x0000561af762b4af in colo_incoming_co () at ../migration/colo.c:935 > #8 0x0000561af7607e57 in process_incoming_migration_co (opaque=0x0) at ../migration/migration.c:793 > #9 0x0000561af7adbeeb in coroutine_trampoline (i0=-106876144, i1=22042) at ../util/coroutine-ucontext.c:175 > #10 0x00007fd2a5cf21c0 in () at /lib64/libc.so.6 > > CC: Fabiano Rosas <farosas@suse.de> Cc: qemu-stable@nongnu.org > Closes: https://gitlab.com/qemu-project/qemu/-/issues/2277 > Fixes: 2b3912f135 ("block: Mark bdrv_first_blk() and bdrv_is_root_node() GRAPH_RDLOCK") > Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> > --- > V2: fix missing bql_unlock() in error path. > --- > migration/colo.c | 18 ++++++++++-------- > 1 file changed, 10 insertions(+), 8 deletions(-) > > diff --git a/migration/colo.c b/migration/colo.c > index 84632a603e..5600a43d78 100644 > --- a/migration/colo.c > +++ b/migration/colo.c > @@ -835,6 +835,16 @@ static void *colo_process_incoming_thread(void *opaque) > return NULL; > } > > + /* Make sure all file formats throw away their mutable metadata */ > + bql_lock(); Note there is also the convenient BQL_LOCK_GUARD() macro. > + bdrv_activate_all(&local_err); > + if (local_err) { > + bql_unlock(); > + error_report_err(local_err); > + return NULL; > + } > + bql_unlock(); > + > failover_init_state();
> -----Original Message----- > From: Philippe Mathieu-Daudé <philmd@linaro.org> > Sent: Wednesday, April 17, 2024 2:14 PM > To: Li Zhijian <lizhijian@fujitsu.com>; Zhang, Hailiang > <zhanghailiang@xfusion.com>; peterx@redhat.com; farosas@suse.de > Cc: qemu-devel@nongnu.org; Zhang, Chen <chen.zhang@intel.com>; Wen > Congyang <wencongyang2@huawei.com>; Xie Changlong > <xiechanglong.d@gmail.com> > Subject: Re: [PATCH v2] migration/colo: Fix bdrv_graph_rdlock_main_loop: > Assertion `!qemu_in_coroutine()' failed. > > On 17/4/24 04:56, Li Zhijian via wrote: > > bdrv_activate_all() should not be called from the coroutine context, > > move it to the QEMU thread colo_process_incoming_thread() with the > > bql_lock protected. > > > > The backtrace is as follows: > > #4 0x0000561af7948362 in bdrv_graph_rdlock_main_loop () > at ../block/graph-lock.c:260 > > #5 0x0000561af7907a68 in graph_lockable_auto_lock_mainloop > (x=0x7fd29810be7b) at /patch/to/qemu/include/block/graph-lock.h:259 > > #6 0x0000561af79167d1 in bdrv_activate_all (errp=0x7fd29810bed0) > at ../block.c:6906 > > #7 0x0000561af762b4af in colo_incoming_co () at ../migration/colo.c:935 > > #8 0x0000561af7607e57 in process_incoming_migration_co (opaque=0x0) > at ../migration/migration.c:793 > > #9 0x0000561af7adbeeb in coroutine_trampoline (i0=-106876144, > i1=22042) at ../util/coroutine-ucontext.c:175 > > #10 0x00007fd2a5cf21c0 in () at /lib64/libc.so.6 > > > > CC: Fabiano Rosas <farosas@suse.de> > > Cc: qemu-stable@nongnu.org > > > Closes: https://gitlab.com/qemu-project/qemu/-/issues/2277 > > Fixes: 2b3912f135 ("block: Mark bdrv_first_blk() and > > bdrv_is_root_node() GRAPH_RDLOCK") > > Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> It looks good to me. And already verified this patch in my environment. After address Phillippe's comments please add: Reviewed-by: Zhang Chen <chen.zhang@intel.com> Tested-by: Zhang Chen <chen.zhang@intel.com> Thanks Chen > > --- > > V2: fix missing bql_unlock() in error path. > > --- > > migration/colo.c | 18 ++++++++++-------- > > 1 file changed, 10 insertions(+), 8 deletions(-) > > > > diff --git a/migration/colo.c b/migration/colo.c index > > 84632a603e..5600a43d78 100644 > > --- a/migration/colo.c > > +++ b/migration/colo.c > > @@ -835,6 +835,16 @@ static void *colo_process_incoming_thread(void > *opaque) > > return NULL; > > } > > > > + /* Make sure all file formats throw away their mutable metadata */ > > + bql_lock(); > > Note there is also the convenient BQL_LOCK_GUARD() macro. > > > + bdrv_activate_all(&local_err); > > + if (local_err) { > > + bql_unlock(); > > + error_report_err(local_err); > > + return NULL; > > + } > > + bql_unlock(); > > + > > failover_init_state();
On 17/4/24 08:47, Zhang, Chen wrote: > > >> -----Original Message----- >> From: Philippe Mathieu-Daudé <philmd@linaro.org> >> Sent: Wednesday, April 17, 2024 2:14 PM >> To: Li Zhijian <lizhijian@fujitsu.com>; Zhang, Hailiang >> <zhanghailiang@xfusion.com>; peterx@redhat.com; farosas@suse.de >> Cc: qemu-devel@nongnu.org; Zhang, Chen <chen.zhang@intel.com>; Wen >> Congyang <wencongyang2@huawei.com>; Xie Changlong >> <xiechanglong.d@gmail.com> >> Subject: Re: [PATCH v2] migration/colo: Fix bdrv_graph_rdlock_main_loop: >> Assertion `!qemu_in_coroutine()' failed. >> >> On 17/4/24 04:56, Li Zhijian via wrote: >>> bdrv_activate_all() should not be called from the coroutine context, >>> move it to the QEMU thread colo_process_incoming_thread() with the >>> bql_lock protected. >>> >>> The backtrace is as follows: >>> #4 0x0000561af7948362 in bdrv_graph_rdlock_main_loop () >> at ../block/graph-lock.c:260 >>> #5 0x0000561af7907a68 in graph_lockable_auto_lock_mainloop >> (x=0x7fd29810be7b) at /patch/to/qemu/include/block/graph-lock.h:259 >>> #6 0x0000561af79167d1 in bdrv_activate_all (errp=0x7fd29810bed0) >> at ../block.c:6906 >>> #7 0x0000561af762b4af in colo_incoming_co () at ../migration/colo.c:935 >>> #8 0x0000561af7607e57 in process_incoming_migration_co (opaque=0x0) >> at ../migration/migration.c:793 >>> #9 0x0000561af7adbeeb in coroutine_trampoline (i0=-106876144, >> i1=22042) at ../util/coroutine-ucontext.c:175 >>> #10 0x00007fd2a5cf21c0 in () at /lib64/libc.so.6 >>> >>> CC: Fabiano Rosas <farosas@suse.de> >> >> Cc: qemu-stable@nongnu.org >> >>> Closes: https://gitlab.com/qemu-project/qemu/-/issues/2277 >>> Fixes: 2b3912f135 ("block: Mark bdrv_first_blk() and >>> bdrv_is_root_node() GRAPH_RDLOCK") >>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> > > It looks good to me. And already verified this patch in my environment. > After address Phillippe's comments (I'm only suggesting, in case using the macro results in code more easily maintainable) > please add: > > Reviewed-by: Zhang Chen <chen.zhang@intel.com> > Tested-by: Zhang Chen <chen.zhang@intel.com> > > Thanks > Chen > >>> --- >>> V2: fix missing bql_unlock() in error path. >>> --- >>> migration/colo.c | 18 ++++++++++-------- >>> 1 file changed, 10 insertions(+), 8 deletions(-) >>> >>> diff --git a/migration/colo.c b/migration/colo.c index >>> 84632a603e..5600a43d78 100644 >>> --- a/migration/colo.c >>> +++ b/migration/colo.c >>> @@ -835,6 +835,16 @@ static void *colo_process_incoming_thread(void >> *opaque) >>> return NULL; >>> } >>> >>> + /* Make sure all file formats throw away their mutable metadata */ >>> + bql_lock(); >> >> Note there is also the convenient BQL_LOCK_GUARD() macro. >> >>> + bdrv_activate_all(&local_err); >>> + if (local_err) { >>> + bql_unlock(); >>> + error_report_err(local_err); >>> + return NULL; >>> + } >>> + bql_unlock(); >>> + >>> failover_init_state(); >
On 17/04/2024 14:13, Philippe Mathieu-Daudé wrote: > On 17/4/24 04:56, Li Zhijian via wrote: >> bdrv_activate_all() should not be called from the coroutine context, move >> it to the QEMU thread colo_process_incoming_thread() with the bql_lock >> protected. >> >> The backtrace is as follows: >> #4 0x0000561af7948362 in bdrv_graph_rdlock_main_loop () at ../block/graph-lock.c:260 >> #5 0x0000561af7907a68 in graph_lockable_auto_lock_mainloop (x=0x7fd29810be7b) at /patch/to/qemu/include/block/graph-lock.h:259 >> #6 0x0000561af79167d1 in bdrv_activate_all (errp=0x7fd29810bed0) at ../block.c:6906 >> #7 0x0000561af762b4af in colo_incoming_co () at ../migration/colo.c:935 >> #8 0x0000561af7607e57 in process_incoming_migration_co (opaque=0x0) at ../migration/migration.c:793 >> #9 0x0000561af7adbeeb in coroutine_trampoline (i0=-106876144, i1=22042) at ../util/coroutine-ucontext.c:175 >> #10 0x00007fd2a5cf21c0 in () at /lib64/libc.so.6 >> >> CC: Fabiano Rosas <farosas@suse.de> > > Cc: qemu-stable@nongnu.org > >> Closes: https://gitlab.com/qemu-project/qemu/-/issues/2277 >> Fixes: 2b3912f135 ("block: Mark bdrv_first_blk() and bdrv_is_root_node() GRAPH_RDLOCK") >> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> >> --- >> V2: fix missing bql_unlock() in error path. >> --- >> migration/colo.c | 18 ++++++++++-------- >> 1 file changed, 10 insertions(+), 8 deletions(-) >> >> diff --git a/migration/colo.c b/migration/colo.c >> index 84632a603e..5600a43d78 100644 >> --- a/migration/colo.c >> +++ b/migration/colo.c >> @@ -835,6 +835,16 @@ static void *colo_process_incoming_thread(void *opaque) >> return NULL; >> } >> + /* Make sure all file formats throw away their mutable metadata */ >> + bql_lock(); > > Note there is also the convenient BQL_LOCK_GUARD() macro. Thanks for your reminder, Philippe. IIUC, BQL_LOCK_GUARD() will lock all the rest code till the subroutine ends. So it's not suitable for this case where we only need a partial critical section. Thanks Zhijian > >> + bdrv_activate_all(&local_err); >> + if (local_err) { >> + bql_unlock(); >> + error_report_err(local_err); >> + return NULL; >> + } >> + bql_unlock(); >> + >> failover_init_state(); >
Li Zhijian via <qemu-devel@nongnu.org> writes: > bdrv_activate_all() should not be called from the coroutine context, move > it to the QEMU thread colo_process_incoming_thread() with the bql_lock > protected. > > The backtrace is as follows: > #4 0x0000561af7948362 in bdrv_graph_rdlock_main_loop () at ../block/graph-lock.c:260 > #5 0x0000561af7907a68 in graph_lockable_auto_lock_mainloop (x=0x7fd29810be7b) at /patch/to/qemu/include/block/graph-lock.h:259 > #6 0x0000561af79167d1 in bdrv_activate_all (errp=0x7fd29810bed0) at ../block.c:6906 > #7 0x0000561af762b4af in colo_incoming_co () at ../migration/colo.c:935 > #8 0x0000561af7607e57 in process_incoming_migration_co (opaque=0x0) at ../migration/migration.c:793 > #9 0x0000561af7adbeeb in coroutine_trampoline (i0=-106876144, i1=22042) at ../util/coroutine-ucontext.c:175 > #10 0x00007fd2a5cf21c0 in () at /lib64/libc.so.6 > > CC: Fabiano Rosas <farosas@suse.de> > Closes: https://gitlab.com/qemu-project/qemu/-/issues/2277 > Fixes: 2b3912f135 ("block: Mark bdrv_first_blk() and bdrv_is_root_node() GRAPH_RDLOCK") > Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Fabiano Rosas <farosas@suse.de>
On Wed, Apr 17, 2024 at 10:56:34AM +0800, Li Zhijian via wrote: > bdrv_activate_all() should not be called from the coroutine context, move > it to the QEMU thread colo_process_incoming_thread() with the bql_lock > protected. > > The backtrace is as follows: > #4 0x0000561af7948362 in bdrv_graph_rdlock_main_loop () at ../block/graph-lock.c:260 > #5 0x0000561af7907a68 in graph_lockable_auto_lock_mainloop (x=0x7fd29810be7b) at /patch/to/qemu/include/block/graph-lock.h:259 > #6 0x0000561af79167d1 in bdrv_activate_all (errp=0x7fd29810bed0) at ../block.c:6906 > #7 0x0000561af762b4af in colo_incoming_co () at ../migration/colo.c:935 > #8 0x0000561af7607e57 in process_incoming_migration_co (opaque=0x0) at ../migration/migration.c:793 > #9 0x0000561af7adbeeb in coroutine_trampoline (i0=-106876144, i1=22042) at ../util/coroutine-ucontext.c:175 > #10 0x00007fd2a5cf21c0 in () at /lib64/libc.so.6 > > CC: Fabiano Rosas <farosas@suse.de> > Closes: https://gitlab.com/qemu-project/qemu/-/issues/2277 > Fixes: 2b3912f135 ("block: Mark bdrv_first_blk() and bdrv_is_root_node() GRAPH_RDLOCK") > Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> queued (and cc stable), thanks.
17.04.2024 05:56, Li Zhijian via wrote: > bdrv_activate_all() should not be called from the coroutine context, move > it to the QEMU thread colo_process_incoming_thread() with the bql_lock > protected. > > The backtrace is as follows: > #4 0x0000561af7948362 in bdrv_graph_rdlock_main_loop () at ../block/graph-lock.c:260 > #5 0x0000561af7907a68 in graph_lockable_auto_lock_mainloop (x=0x7fd29810be7b) at /patch/to/qemu/include/block/graph-lock.h:259 > #6 0x0000561af79167d1 in bdrv_activate_all (errp=0x7fd29810bed0) at ../block.c:6906 > #7 0x0000561af762b4af in colo_incoming_co () at ../migration/colo.c:935 > #8 0x0000561af7607e57 in process_incoming_migration_co (opaque=0x0) at ../migration/migration.c:793 > #9 0x0000561af7adbeeb in coroutine_trampoline (i0=-106876144, i1=22042) at ../util/coroutine-ucontext.c:175 > #10 0x00007fd2a5cf21c0 in () at /lib64/libc.so.6 > > CC: Fabiano Rosas <farosas@suse.de> > Closes: https://gitlab.com/qemu-project/qemu/-/issues/2277 > Fixes: 2b3912f135 ("block: Mark bdrv_first_blk() and bdrv_is_root_node() GRAPH_RDLOCK") > Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Commit 2b3912f135 is in 8.2 (v8.1.0-1575-g2b3912f135). Is this fix supposed to go to stable-8.2 series? The prob here is that in 8.2, there's no bql_lock/unlock. I guess it should use qemu_mutex_lock_iothread() instead, for before 195801d700c008 "system/cpus: rename qemu_mutex_lock_iothread() to bql_lock()". Thanks, /mjt > diff --git a/migration/colo.c b/migration/colo.c > index 84632a603e..5600a43d78 100644 > --- a/migration/colo.c > +++ b/migration/colo.c > @@ -835,6 +835,16 @@ static void *colo_process_incoming_thread(void *opaque) > return NULL; > } > > + /* Make sure all file formats throw away their mutable metadata */ > + bql_lock(); > + bdrv_activate_all(&local_err); > + if (local_err) { > + bql_unlock(); > + error_report_err(local_err); > + return NULL; > + } > + bql_unlock(); > + > failover_init_state(); > > mis->to_src_file = qemu_file_get_return_path(mis->from_src_file); > @@ -922,7 +932,6 @@ out: > int coroutine_fn colo_incoming_co(void) > { > MigrationIncomingState *mis = migration_incoming_get_current(); > - Error *local_err = NULL; > QemuThread th; > > assert(bql_locked()); > @@ -931,13 +940,6 @@ int coroutine_fn colo_incoming_co(void) > return 0; > } > > - /* Make sure all file formats throw away their mutable metadata */ > - bdrv_activate_all(&local_err); > - if (local_err) { > - error_report_err(local_err); > - return -EINVAL; > - } > - > qemu_thread_create(&th, "COLO incoming", colo_process_incoming_thread, > mis, QEMU_THREAD_JOINABLE); >
17.04.2024 05:56, Li Zhijian via wrote: > --- a/migration/colo.c > +++ b/migration/colo.c > @@ -835,6 +835,16 @@ static void *colo_process_incoming_thread(void *opaque) > return NULL; > } > > + /* Make sure all file formats throw away their mutable metadata */ > + bql_lock(); > + bdrv_activate_all(&local_err); > + if (local_err) { > + bql_unlock(); > + error_report_err(local_err); > + return NULL; > + } > + bql_unlock(); FWIW, this can be simplified as follows: bql_lock(); bdrv_activate_all(&local_err); bql_unlock(); if (local_err) { error_report_err(local_err); return NULL; } (I know it is already too late) /mjt
On Sun, Apr 28, 2024 at 03:39:09PM +0300, Michael Tokarev wrote: > 17.04.2024 05:56, Li Zhijian via wrote: > > bdrv_activate_all() should not be called from the coroutine context, move > > it to the QEMU thread colo_process_incoming_thread() with the bql_lock > > protected. > > > > The backtrace is as follows: > > #4 0x0000561af7948362 in bdrv_graph_rdlock_main_loop () at ../block/graph-lock.c:260 > > #5 0x0000561af7907a68 in graph_lockable_auto_lock_mainloop (x=0x7fd29810be7b) at /patch/to/qemu/include/block/graph-lock.h:259 > > #6 0x0000561af79167d1 in bdrv_activate_all (errp=0x7fd29810bed0) at ../block.c:6906 > > #7 0x0000561af762b4af in colo_incoming_co () at ../migration/colo.c:935 > > #8 0x0000561af7607e57 in process_incoming_migration_co (opaque=0x0) at ../migration/migration.c:793 > > #9 0x0000561af7adbeeb in coroutine_trampoline (i0=-106876144, i1=22042) at ../util/coroutine-ucontext.c:175 > > #10 0x00007fd2a5cf21c0 in () at /lib64/libc.so.6 > > > > CC: Fabiano Rosas <farosas@suse.de> > > Closes: https://gitlab.com/qemu-project/qemu/-/issues/2277 > > Fixes: 2b3912f135 ("block: Mark bdrv_first_blk() and bdrv_is_root_node() GRAPH_RDLOCK") > > Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> > > Commit 2b3912f135 is in 8.2 (v8.1.0-1575-g2b3912f135). Is this fix supposed to go > to stable-8.2 series? Yes we probably should. I've copied qemu-stable when queuing this patch for that: https://lore.kernel.org/r/20240423223813.3237060-27-peterx@redhat.com > The prob here is that in 8.2, there's no bql_lock/unlock. > I guess it should use qemu_mutex_lock_iothread() instead, for before > 195801d700c008 "system/cpus: rename qemu_mutex_lock_iothread() to bql_lock()". Right, some further care needs to be taken on this one indeed. > 17.04.2024 05:56, Li Zhijian via wrote: > > --- a/migration/colo.c > > +++ b/migration/colo.c > > @@ -835,6 +835,16 @@ static void *colo_process_incoming_thread(void *opaque) > > return NULL; > > } > > + /* Make sure all file formats throw away their mutable metadata */ > > + bql_lock(); > > + bdrv_activate_all(&local_err); > > + if (local_err) { > > + bql_unlock(); > > + error_report_err(local_err); > > + return NULL; > > + } > > + bql_unlock(); > > FWIW, this can be simplified as follows: > > bql_lock(); > bdrv_activate_all(&local_err); > bql_unlock(); > if (local_err) { > error_report_err(local_err); > return NULL; > } > > (I know it is already too late) Agree. :) Thanks,
diff --git a/migration/colo.c b/migration/colo.c index 84632a603e..5600a43d78 100644 --- a/migration/colo.c +++ b/migration/colo.c @@ -835,6 +835,16 @@ static void *colo_process_incoming_thread(void *opaque) return NULL; } + /* Make sure all file formats throw away their mutable metadata */ + bql_lock(); + bdrv_activate_all(&local_err); + if (local_err) { + bql_unlock(); + error_report_err(local_err); + return NULL; + } + bql_unlock(); + failover_init_state(); mis->to_src_file = qemu_file_get_return_path(mis->from_src_file); @@ -922,7 +932,6 @@ out: int coroutine_fn colo_incoming_co(void) { MigrationIncomingState *mis = migration_incoming_get_current(); - Error *local_err = NULL; QemuThread th; assert(bql_locked()); @@ -931,13 +940,6 @@ int coroutine_fn colo_incoming_co(void) return 0; } - /* Make sure all file formats throw away their mutable metadata */ - bdrv_activate_all(&local_err); - if (local_err) { - error_report_err(local_err); - return -EINVAL; - } - qemu_thread_create(&th, "COLO incoming", colo_process_incoming_thread, mis, QEMU_THREAD_JOINABLE);
bdrv_activate_all() should not be called from the coroutine context, move it to the QEMU thread colo_process_incoming_thread() with the bql_lock protected. The backtrace is as follows: #4 0x0000561af7948362 in bdrv_graph_rdlock_main_loop () at ../block/graph-lock.c:260 #5 0x0000561af7907a68 in graph_lockable_auto_lock_mainloop (x=0x7fd29810be7b) at /patch/to/qemu/include/block/graph-lock.h:259 #6 0x0000561af79167d1 in bdrv_activate_all (errp=0x7fd29810bed0) at ../block.c:6906 #7 0x0000561af762b4af in colo_incoming_co () at ../migration/colo.c:935 #8 0x0000561af7607e57 in process_incoming_migration_co (opaque=0x0) at ../migration/migration.c:793 #9 0x0000561af7adbeeb in coroutine_trampoline (i0=-106876144, i1=22042) at ../util/coroutine-ucontext.c:175 #10 0x00007fd2a5cf21c0 in () at /lib64/libc.so.6 CC: Fabiano Rosas <farosas@suse.de> Closes: https://gitlab.com/qemu-project/qemu/-/issues/2277 Fixes: 2b3912f135 ("block: Mark bdrv_first_blk() and bdrv_is_root_node() GRAPH_RDLOCK") Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> --- V2: fix missing bql_unlock() in error path. --- migration/colo.c | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-)