Message ID | 20210806093859.706464-2-mreitz@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mirror: Handle errors after READY cancel | expand |
On Fri, Aug 06, 2021 at 11:38:48AM +0200, Max Reitz wrote: > Finalizing the job may cause its AioContext to change. This is noted by > job_exit(), which points at job_txn_apply() to take this fact into > account. > > However, job_completed() does not necessarily invoke job_txn_apply() > (through job_completed_txn_success()), but potentially also > job_completed_txn_abort(). The latter stores the context in a local > variable, and so always acquires the same context at its end that it has > released in the beginning -- which may be a different context from the > one that job_exit() releases at its end. If it is different, qemu > aborts ("qemu_mutex_unlock_impl: Operation not permitted"). Is this a bug fix that needs to make it into 6.1? > > Drop the local @outer_ctx variable from job_completed_txn_abort(), and > instead re-acquire the actual job's context at the end of the function, > so job_exit() will release the same. > > Signed-off-by: Max Reitz <mreitz@redhat.com> > --- > job.c | 23 ++++++++++++++++++----- > 1 file changed, 18 insertions(+), 5 deletions(-) The commit message makes sense, and does a good job at explaining the change. I'm still a bit fuzzy on how jobs are supposed to play nice with contexts, but since your patch matches the commit message, I'm happy to give: Reviewed-by: Eric Blake <eblake@redhat.com>
On 06.08.21 21:16, Eric Blake wrote: > On Fri, Aug 06, 2021 at 11:38:48AM +0200, Max Reitz wrote: >> Finalizing the job may cause its AioContext to change. This is noted by >> job_exit(), which points at job_txn_apply() to take this fact into >> account. >> >> However, job_completed() does not necessarily invoke job_txn_apply() >> (through job_completed_txn_success()), but potentially also >> job_completed_txn_abort(). The latter stores the context in a local >> variable, and so always acquires the same context at its end that it has >> released in the beginning -- which may be a different context from the >> one that job_exit() releases at its end. If it is different, qemu >> aborts ("qemu_mutex_unlock_impl: Operation not permitted"). > Is this a bug fix that needs to make it into 6.1? Well, I only encountered it as part of this series (which I really don’t think is 6.2 material at this point), and so I don’t know. Can’t hurt, I suppose, but if we wanted this to be in 6.1, we’d better have a specific test for it, I think. >> Drop the local @outer_ctx variable from job_completed_txn_abort(), and >> instead re-acquire the actual job's context at the end of the function, >> so job_exit() will release the same. >> >> Signed-off-by: Max Reitz <mreitz@redhat.com> >> --- >> job.c | 23 ++++++++++++++++++----- >> 1 file changed, 18 insertions(+), 5 deletions(-) > The commit message makes sense, and does a good job at explaining the > change. I'm still a bit fuzzy on how jobs are supposed to play nice > with contexts, I can relate :) > but since your patch matches the commit message, I'm > happy to give: > > Reviewed-by: Eric Blake <eblake@redhat.com> Thanks!
06.08.2021 12:38, Max Reitz wrote: > Finalizing the job may cause its AioContext to change. This is noted by > job_exit(), which points at job_txn_apply() to take this fact into > account. > > However, job_completed() does not necessarily invoke job_txn_apply() > (through job_completed_txn_success()), but potentially also > job_completed_txn_abort(). The latter stores the context in a local > variable, and so always acquires the same context at its end that it has > released in the beginning -- which may be a different context from the > one that job_exit() releases at its end. If it is different, qemu > aborts ("qemu_mutex_unlock_impl: Operation not permitted"). > > Drop the local @outer_ctx variable from job_completed_txn_abort(), and > instead re-acquire the actual job's context at the end of the function, > so job_exit() will release the same. > > Signed-off-by: Max Reitz <mreitz@redhat.com> > --- > job.c | 23 ++++++++++++++++++----- > 1 file changed, 18 insertions(+), 5 deletions(-) > > diff --git a/job.c b/job.c > index e7a5d28854..3fe23bb77e 100644 > --- a/job.c > +++ b/job.c > @@ -737,7 +737,6 @@ static void job_cancel_async(Job *job, bool force) > > static void job_completed_txn_abort(Job *job) > { > - AioContext *outer_ctx = job->aio_context; > AioContext *ctx; > JobTxn *txn = job->txn; > Job *other_job; > @@ -751,10 +750,14 @@ static void job_completed_txn_abort(Job *job) > txn->aborting = true; > job_txn_ref(txn); > > - /* We can only hold the single job's AioContext lock while calling > + /* > + * We can only hold the single job's AioContext lock while calling > * job_finalize_single() because the finalization callbacks can involve > - * calls of AIO_WAIT_WHILE(), which could deadlock otherwise. */ > - aio_context_release(outer_ctx); > + * calls of AIO_WAIT_WHILE(), which could deadlock otherwise. > + * Note that the job's AioContext may change when it is finalized. > + */ > + job_ref(job); > + aio_context_release(job->aio_context); > > /* Other jobs are effectively cancelled by us, set the status for > * them; this job, however, may or may not be cancelled, depending > @@ -769,6 +772,10 @@ static void job_completed_txn_abort(Job *job) > } > while (!QLIST_EMPTY(&txn->jobs)) { > other_job = QLIST_FIRST(&txn->jobs); > + /* > + * The job's AioContext may change, so store it in @ctx so we > + * release the same context that we have acquired before. > + */ > ctx = other_job->aio_context; > aio_context_acquire(ctx); > if (!job_is_completed(other_job)) { > @@ -779,7 +786,13 @@ static void job_completed_txn_abort(Job *job) > aio_context_release(ctx); > } > > - aio_context_acquire(outer_ctx); > + /* > + * Use job_ref()/job_unref() so we can read the AioContext here > + * even if the job went away during job_finalize_single(). > + */ > + ctx = job->aio_context; > + job_unref(job); > + aio_context_acquire(ctx); why to use ctx variable and not do it exactly same as in job_txn_apply() : aio_context_acquire(job->aio_context); job_unref(job); ? anyway: Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
On 01.09.21 12:05, Vladimir Sementsov-Ogievskiy wrote: > 06.08.2021 12:38, Max Reitz wrote: >> Finalizing the job may cause its AioContext to change. This is noted by >> job_exit(), which points at job_txn_apply() to take this fact into >> account. >> >> However, job_completed() does not necessarily invoke job_txn_apply() >> (through job_completed_txn_success()), but potentially also >> job_completed_txn_abort(). The latter stores the context in a local >> variable, and so always acquires the same context at its end that it has >> released in the beginning -- which may be a different context from the >> one that job_exit() releases at its end. If it is different, qemu >> aborts ("qemu_mutex_unlock_impl: Operation not permitted"). >> >> Drop the local @outer_ctx variable from job_completed_txn_abort(), and >> instead re-acquire the actual job's context at the end of the function, >> so job_exit() will release the same. >> >> Signed-off-by: Max Reitz <mreitz@redhat.com> >> --- >> job.c | 23 ++++++++++++++++++----- >> 1 file changed, 18 insertions(+), 5 deletions(-) >> >> diff --git a/job.c b/job.c >> index e7a5d28854..3fe23bb77e 100644 >> --- a/job.c >> +++ b/job.c >> @@ -737,7 +737,6 @@ static void job_cancel_async(Job *job, bool force) >> static void job_completed_txn_abort(Job *job) >> { >> - AioContext *outer_ctx = job->aio_context; >> AioContext *ctx; >> JobTxn *txn = job->txn; >> Job *other_job; >> @@ -751,10 +750,14 @@ static void job_completed_txn_abort(Job *job) >> txn->aborting = true; >> job_txn_ref(txn); >> - /* We can only hold the single job's AioContext lock while >> calling >> + /* >> + * We can only hold the single job's AioContext lock while calling >> * job_finalize_single() because the finalization callbacks can >> involve >> - * calls of AIO_WAIT_WHILE(), which could deadlock otherwise. */ >> - aio_context_release(outer_ctx); >> + * calls of AIO_WAIT_WHILE(), which could deadlock otherwise. >> + * Note that the job's AioContext may change when it is finalized. >> + */ >> + job_ref(job); >> + aio_context_release(job->aio_context); >> /* Other jobs are effectively cancelled by us, set the status >> for >> * them; this job, however, may or may not be cancelled, depending >> @@ -769,6 +772,10 @@ static void job_completed_txn_abort(Job *job) >> } >> while (!QLIST_EMPTY(&txn->jobs)) { >> other_job = QLIST_FIRST(&txn->jobs); >> + /* >> + * The job's AioContext may change, so store it in @ctx so we >> + * release the same context that we have acquired before. >> + */ >> ctx = other_job->aio_context; >> aio_context_acquire(ctx); >> if (!job_is_completed(other_job)) { >> @@ -779,7 +786,13 @@ static void job_completed_txn_abort(Job *job) >> aio_context_release(ctx); >> } >> - aio_context_acquire(outer_ctx); >> + /* >> + * Use job_ref()/job_unref() so we can read the AioContext here >> + * even if the job went away during job_finalize_single(). >> + */ >> + ctx = job->aio_context; >> + job_unref(job); >> + aio_context_acquire(ctx); > > > why to use ctx variable and not do it exactly same as in > job_txn_apply() : > > aio_context_acquire(job->aio_context); > job_unref(job); > > ? Oh, I just didn’t think of that. Sounds good, thanks! Hanna > anyway: > Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> >
diff --git a/job.c b/job.c index e7a5d28854..3fe23bb77e 100644 --- a/job.c +++ b/job.c @@ -737,7 +737,6 @@ static void job_cancel_async(Job *job, bool force) static void job_completed_txn_abort(Job *job) { - AioContext *outer_ctx = job->aio_context; AioContext *ctx; JobTxn *txn = job->txn; Job *other_job; @@ -751,10 +750,14 @@ static void job_completed_txn_abort(Job *job) txn->aborting = true; job_txn_ref(txn); - /* We can only hold the single job's AioContext lock while calling + /* + * We can only hold the single job's AioContext lock while calling * job_finalize_single() because the finalization callbacks can involve - * calls of AIO_WAIT_WHILE(), which could deadlock otherwise. */ - aio_context_release(outer_ctx); + * calls of AIO_WAIT_WHILE(), which could deadlock otherwise. + * Note that the job's AioContext may change when it is finalized. + */ + job_ref(job); + aio_context_release(job->aio_context); /* Other jobs are effectively cancelled by us, set the status for * them; this job, however, may or may not be cancelled, depending @@ -769,6 +772,10 @@ static void job_completed_txn_abort(Job *job) } while (!QLIST_EMPTY(&txn->jobs)) { other_job = QLIST_FIRST(&txn->jobs); + /* + * The job's AioContext may change, so store it in @ctx so we + * release the same context that we have acquired before. + */ ctx = other_job->aio_context; aio_context_acquire(ctx); if (!job_is_completed(other_job)) { @@ -779,7 +786,13 @@ static void job_completed_txn_abort(Job *job) aio_context_release(ctx); } - aio_context_acquire(outer_ctx); + /* + * Use job_ref()/job_unref() so we can read the AioContext here + * even if the job went away during job_finalize_single(). + */ + ctx = job->aio_context; + job_unref(job); + aio_context_acquire(ctx); job_txn_unref(txn); }
Finalizing the job may cause its AioContext to change. This is noted by job_exit(), which points at job_txn_apply() to take this fact into account. However, job_completed() does not necessarily invoke job_txn_apply() (through job_completed_txn_success()), but potentially also job_completed_txn_abort(). The latter stores the context in a local variable, and so always acquires the same context at its end that it has released in the beginning -- which may be a different context from the one that job_exit() releases at its end. If it is different, qemu aborts ("qemu_mutex_unlock_impl: Operation not permitted"). Drop the local @outer_ctx variable from job_completed_txn_abort(), and instead re-acquire the actual job's context at the end of the function, so job_exit() will release the same. Signed-off-by: Max Reitz <mreitz@redhat.com> --- job.c | 23 ++++++++++++++++++----- 1 file changed, 18 insertions(+), 5 deletions(-)