Message ID | 20180821064538.19750-1-famz@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | job: Fix nested aio_poll() hanging in job_txn_apply | expand |
On 08/21/2018 01:45 AM, Fam Zheng wrote: > All callers have acquired ctx already. Doing that again results in > aio_poll() hang. This fixes the problem that a BDRV_POLL_WHILE() in the > callback cannot make progress because ctx is recursively locked, for > example, when drive-backup finishes. > > Cc: qemu-stable@nongnu.org > Reported-by: Gu Nini <ngu@redhat.com> > Signed-off-by: Fam Zheng <famz@redhat.com> > --- > job.c | 18 +++++------------- > 1 file changed, 5 insertions(+), 13 deletions(-) > > diff --git a/job.c b/job.c > index fa671b431a..b07c5d0ffc 100644 > --- a/job.c > +++ b/job.c > @@ -136,21 +136,13 @@ static void job_txn_del_job(Job *job) > } > } > > -static int job_txn_apply(JobTxn *txn, int fn(Job *), bool lock) > +static int job_txn_apply(JobTxn *txn, int fn(Job *)) Is it worth adding a comment here that callers must have already acquired the job context? However, I was unable to quickly audit whether all callers really did have the lock (it balloons into whether all callers of job_finalize() have the lock), so I'm reluctant to give R-b. > @@ -857,10 +849,10 @@ static void job_completed_txn_success(Job *job) > assert(other_job->ret == 0); > } > > - job_txn_apply(txn, job_transition_to_pending, false); > + job_txn_apply(txn, job_transition_to_pending); > > /* If no jobs need manual finalization, automatically do so */ > - if (job_txn_apply(txn, job_needs_finalize, false) == 0) { > + if (job_txn_apply(txn, job_needs_finalize) == 0) { > job_do_finalize(job); That said, the change makes sense here: since the first direct call to job_txn_apply() did not need to lock, why should the second indirect call through job_do_finalize() need it? Or, is the fix to have job_do_finalize() gain a bool parameter, where job_completed_txn_success() would pass false to that parameter, while job_finalize() would pass true (again, going back to auditing whether all callers of job_finalize() have already acquired the context).
On Tue, 08/21 11:15, Eric Blake wrote: > On 08/21/2018 01:45 AM, Fam Zheng wrote: > > All callers have acquired ctx already. Doing that again results in > > aio_poll() hang. This fixes the problem that a BDRV_POLL_WHILE() in the > > callback cannot make progress because ctx is recursively locked, for > > example, when drive-backup finishes. > > > > Cc: qemu-stable@nongnu.org > > Reported-by: Gu Nini <ngu@redhat.com> > > Signed-off-by: Fam Zheng <famz@redhat.com> > > --- > > job.c | 18 +++++------------- > > 1 file changed, 5 insertions(+), 13 deletions(-) > > > > diff --git a/job.c b/job.c > > index fa671b431a..b07c5d0ffc 100644 > > --- a/job.c > > +++ b/job.c > > @@ -136,21 +136,13 @@ static void job_txn_del_job(Job *job) > > } > > } > > -static int job_txn_apply(JobTxn *txn, int fn(Job *), bool lock) > > +static int job_txn_apply(JobTxn *txn, int fn(Job *)) > > Is it worth adding a comment here that callers must have already acquired > the job context? I think most job accessing functions require it. If any, the comment should go to the public functions like job_finalize(). > > However, I was unable to quickly audit whether all callers really did have > the lock (it balloons into whether all callers of job_finalize() have the > lock), so I'm reluctant to give R-b. > > > @@ -857,10 +849,10 @@ static void job_completed_txn_success(Job *job) > > assert(other_job->ret == 0); > > } > > - job_txn_apply(txn, job_transition_to_pending, false); > > + job_txn_apply(txn, job_transition_to_pending); > > /* If no jobs need manual finalization, automatically do so */ > > - if (job_txn_apply(txn, job_needs_finalize, false) == 0) { > > + if (job_txn_apply(txn, job_needs_finalize) == 0) { > > job_do_finalize(job); > > That said, the change makes sense here: since the first direct call to > job_txn_apply() did not need to lock, why should the second indirect call > through job_do_finalize() need it? > > Or, is the fix to have job_do_finalize() gain a bool parameter, where > job_completed_txn_success() would pass false to that parameter, while > job_finalize() would pass true (again, going back to auditing whether all > callers of job_finalize() have already acquired the context). There are two callers of job_finalize(): fam@lemon:~/work/qemu [master]$ git grep -w -A1 '^\s*job_finalize' blockdev.c: job_finalize(&job->job, errp); blockdev.c- aio_context_release(aio_context); -- job-qmp.c: job_finalize(job, errp); job-qmp.c- aio_context_release(aio_context); -- tests/test-blockjob.c: job_finalize(&job->job, &error_abort); tests/test-blockjob.c- assert(job->job.status == JOB_STATUS_CONCLUDED); Ignoring the test, it's very easy to see both callers have acquired the context. Fam
On 08/22/2018 12:48 AM, Fam Zheng wrote: >> However, I was unable to quickly audit whether all callers really did have >> the lock (it balloons into whether all callers of job_finalize() have the >> lock), so I'm reluctant to give R-b. >> >>> @@ -857,10 +849,10 @@ static void job_completed_txn_success(Job *job) >>> assert(other_job->ret == 0); >>> } >>> - job_txn_apply(txn, job_transition_to_pending, false); >>> + job_txn_apply(txn, job_transition_to_pending); >>> /* If no jobs need manual finalization, automatically do so */ >>> - if (job_txn_apply(txn, job_needs_finalize, false) == 0) { >>> + if (job_txn_apply(txn, job_needs_finalize) == 0) { >>> job_do_finalize(job); >> >> That said, the change makes sense here: since the first direct call to >> job_txn_apply() did not need to lock, why should the second indirect call >> through job_do_finalize() need it? >> >> Or, is the fix to have job_do_finalize() gain a bool parameter, where >> job_completed_txn_success() would pass false to that parameter, while >> job_finalize() would pass true (again, going back to auditing whether all >> callers of job_finalize() have already acquired the context). > > There are two callers of job_finalize(): Okay, so the audit would have been easier if I had actually tried grepping for it. Still, it can't hurt to make the commit message give more details on what you checked, to make it easier for the reviewers. > > fam@lemon:~/work/qemu [master]$ git grep -w -A1 '^\s*job_finalize' > blockdev.c: job_finalize(&job->job, errp); > blockdev.c- aio_context_release(aio_context); > -- > job-qmp.c: job_finalize(job, errp); > job-qmp.c- aio_context_release(aio_context); Yep, that's pretty conclusive that the context is already held by all callers. Reviewed-by: Eric Blake <eblake@redhat.com>
diff --git a/job.c b/job.c index fa671b431a..b07c5d0ffc 100644 --- a/job.c +++ b/job.c @@ -136,21 +136,13 @@ static void job_txn_del_job(Job *job) } } -static int job_txn_apply(JobTxn *txn, int fn(Job *), bool lock) +static int job_txn_apply(JobTxn *txn, int fn(Job *)) { - AioContext *ctx; Job *job, *next; int rc = 0; QLIST_FOREACH_SAFE(job, &txn->jobs, txn_list, next) { - if (lock) { - ctx = job->aio_context; - aio_context_acquire(ctx); - } rc = fn(job); - if (lock) { - aio_context_release(ctx); - } if (rc) { break; } @@ -807,11 +799,11 @@ static void job_do_finalize(Job *job) assert(job && job->txn); /* prepare the transaction to complete */ - rc = job_txn_apply(job->txn, job_prepare, true); + rc = job_txn_apply(job->txn, job_prepare); if (rc) { job_completed_txn_abort(job); } else { - job_txn_apply(job->txn, job_finalize_single, true); + job_txn_apply(job->txn, job_finalize_single); } } @@ -857,10 +849,10 @@ static void job_completed_txn_success(Job *job) assert(other_job->ret == 0); } - job_txn_apply(txn, job_transition_to_pending, false); + job_txn_apply(txn, job_transition_to_pending); /* If no jobs need manual finalization, automatically do so */ - if (job_txn_apply(txn, job_needs_finalize, false) == 0) { + if (job_txn_apply(txn, job_needs_finalize) == 0) { job_do_finalize(job); } }
All callers have acquired ctx already. Doing that again results in aio_poll() hang. This fixes the problem that a BDRV_POLL_WHILE() in the callback cannot make progress because ctx is recursively locked, for example, when drive-backup finishes. Cc: qemu-stable@nongnu.org Reported-by: Gu Nini <ngu@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com> --- job.c | 18 +++++------------- 1 file changed, 5 insertions(+), 13 deletions(-)