mbox

[PULL,00/12] jobs: mirror: Handle errors after READY cancel

Message ID 20210921102017.273679-1-vsementsov@virtuozzo.com (mailing list archive)
State New, archived
Headers show

Pull-request

https://src.openvz.org/scm/~vsementsov/qemu.git tags/pull-jobs-2021-09-21

Message

Vladimir Sementsov-Ogievskiy Sept. 21, 2021, 10:20 a.m. UTC
The following changes since commit 326ff8dd09556fc2e257196c49f35009700794ac:

  Merge remote-tracking branch 'remotes/jasowang/tags/net-pull-request' into staging (2021-09-20 16:17:05 +0100)

are available in the Git repository at:

  https://src.openvz.org/scm/~vsementsov/qemu.git tags/pull-jobs-2021-09-21

for you to fetch changes up to c9489c04319cac75c76af8fc27c254f46e10214c:

  iotests: Add mirror-ready-cancel-error test (2021-09-21 11:56:11 +0300)

----------------------------------------------------------------
mirror: Handle errors after READY cancel

----------------------------------------------------------------
Hanna Reitz (12):
      job: Context changes in job_completed_txn_abort()
      mirror: Keep s->synced on error
      mirror: Drop s->synced
      job: Force-cancel jobs in a failed transaction
      job: @force parameter for job_cancel_sync()
      jobs: Give Job.force_cancel more meaning
      job: Add job_cancel_requested()
      mirror: Use job_is_cancelled()
      mirror: Check job_is_cancelled() earlier
      mirror: Stop active mirroring after force-cancel
      mirror: Do not clear .cancelled
      iotests: Add mirror-ready-cancel-error test

 include/qemu/job.h                                     |  29 ++++++---
 block/backup.c                                         |   3 +-
 block/mirror.c                                         |  56 +++++++++--------
 block/replication.c                                    |   4 +-
 blockdev.c                                             |   4 +-
 job.c                                                  |  64 +++++++++++++++----
 tests/unit/test-blockjob.c                             |   2 +-
 tests/qemu-iotests/109.out                             |  60 ++++++++----------
 tests/qemu-iotests/tests/mirror-ready-cancel-error     | 143 +++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/tests/mirror-ready-cancel-error.out |   5 ++
 tests/qemu-iotests/tests/qsd-jobs.out                  |   2 +-
 11 files changed, 286 insertions(+), 86 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/mirror-ready-cancel-error
 create mode 100644 tests/qemu-iotests/tests/mirror-ready-cancel-error.out

Comments

Richard Henderson Sept. 22, 2021, 4:05 p.m. UTC | #1
On 9/21/21 3:20 AM, Vladimir Sementsov-Ogievskiy wrote:
> The following changes since commit 326ff8dd09556fc2e257196c49f35009700794ac:
> 
>    Merge remote-tracking branch 'remotes/jasowang/tags/net-pull-request' into staging (2021-09-20 16:17:05 +0100)
> 
> are available in the Git repository at:
> 
>    https://src.openvz.org/scm/~vsementsov/qemu.git  tags/pull-jobs-2021-09-21
> 
> for you to fetch changes up to c9489c04319cac75c76af8fc27c254f46e10214c:
> 
>    iotests: Add mirror-ready-cancel-error test (2021-09-21 11:56:11 +0300)
> 
> ----------------------------------------------------------------
> mirror: Handle errors after READY cancel
> 
> ----------------------------------------------------------------
> Hanna Reitz (12):
>        job: Context changes in job_completed_txn_abort()
>        mirror: Keep s->synced on error
>        mirror: Drop s->synced
>        job: Force-cancel jobs in a failed transaction
>        job: @force parameter for job_cancel_sync()
>        jobs: Give Job.force_cancel more meaning
>        job: Add job_cancel_requested()
>        mirror: Use job_is_cancelled()
>        mirror: Check job_is_cancelled() earlier
>        mirror: Stop active mirroring after force-cancel
>        mirror: Do not clear .cancelled
>        iotests: Add mirror-ready-cancel-error test

This fails testing with errors like so:

Running test test-replication
test-replication: ../job.c:186: job_state_transition: Assertion `JobSTT[s0][s1]' failed.
ERROR test-replication - too few tests run (expected 13, got 8)
make: *** [Makefile.mtest:816: run-test-100] Error 1
Cleaning up project directory and file based variables
ERROR: Job failed: exit code 1

https://gitlab.com/qemu-project/qemu/-/pipelines/375324015/failures


r~
Vladimir Sementsov-Ogievskiy Sept. 22, 2021, 7:19 p.m. UTC | #2
22.09.2021 19:05, Richard Henderson wrote:
> On 9/21/21 3:20 AM, Vladimir Sementsov-Ogievskiy wrote:
>> The following changes since commit 326ff8dd09556fc2e257196c49f35009700794ac:
>>
>>    Merge remote-tracking branch 'remotes/jasowang/tags/net-pull-request' into staging (2021-09-20 16:17:05 +0100)
>>
>> are available in the Git repository at:
>>
>>    https://src.openvz.org/scm/~vsementsov/qemu.git  tags/pull-jobs-2021-09-21
>>
>> for you to fetch changes up to c9489c04319cac75c76af8fc27c254f46e10214c:
>>
>>    iotests: Add mirror-ready-cancel-error test (2021-09-21 11:56:11 +0300)
>>
>> ----------------------------------------------------------------
>> mirror: Handle errors after READY cancel
>>
>> ----------------------------------------------------------------
>> Hanna Reitz (12):
>>        job: Context changes in job_completed_txn_abort()
>>        mirror: Keep s->synced on error
>>        mirror: Drop s->synced
>>        job: Force-cancel jobs in a failed transaction
>>        job: @force parameter for job_cancel_sync()
>>        jobs: Give Job.force_cancel more meaning
>>        job: Add job_cancel_requested()
>>        mirror: Use job_is_cancelled()
>>        mirror: Check job_is_cancelled() earlier
>>        mirror: Stop active mirroring after force-cancel
>>        mirror: Do not clear .cancelled
>>        iotests: Add mirror-ready-cancel-error test
> 
> This fails testing with errors like so:
> 
> Running test test-replication
> test-replication: ../job.c:186: job_state_transition: Assertion `JobSTT[s0][s1]' failed.
> ERROR test-replication - too few tests run (expected 13, got 8)
> make: *** [Makefile.mtest:816: run-test-100] Error 1
> Cleaning up project directory and file based variables
> ERROR: Job failed: exit code 1
> 
> https://gitlab.com/qemu-project/qemu/-/pipelines/375324015/failures
> 


Interesting :(

I've reproduced, starting test-replication in several parallel loops. (it doesn't reproduce for me if just start in one loop). So, that's some racy bug..

Hmm, and seems it doesn't reproduce so simple on master. I'll try to bisect the series tomorrow.

====

(gdb) bt
#0  0x00007f034a3d09d5 in raise () from /lib64/libc.so.6
#1  0x00007f034a3b9954 in abort () from /lib64/libc.so.6
#2  0x00007f034a3b9789 in __assert_fail_base.cold () from /lib64/libc.so.6
#3  0x00007f034a3c9026 in __assert_fail () from /lib64/libc.so.6
#4  0x000055d3b503d670 in job_state_transition (job=0x55d3b5e67020, s1=JOB_STATUS_CONCLUDED) at ../job.c:186
#5  0x000055d3b503e7c2 in job_conclude (job=0x55d3b5e67020) at ../job.c:652
#6  0x000055d3b503eaa1 in job_finalize_single (job=0x55d3b5e67020) at ../job.c:722
#7  0x000055d3b503ecd1 in job_completed_txn_abort (job=0x55d3b5e67020) at ../job.c:801
#8  0x000055d3b503f2ea in job_cancel (job=0x55d3b5e67020, force=false) at ../job.c:973
#9  0x000055d3b503f360 in job_cancel_err (job=0x55d3b5e67020, errp=0x7fffcc997a80) at ../job.c:992
#10 0x000055d3b503f576 in job_finish_sync (job=0x55d3b5e67020, finish=0x55d3b503f33f <job_cancel_err>, errp=0x0) at ../job.c:1054
#11 0x000055d3b503f3d0 in job_cancel_sync (job=0x55d3b5e67020, force=false) at ../job.c:1008
#12 0x000055d3b4ff14a3 in replication_close (bs=0x55d3b5e6ef80) at ../block/replication.c:152
#13 0x000055d3b50277fc in bdrv_close (bs=0x55d3b5e6ef80) at ../block.c:4677
#14 0x000055d3b50286cf in bdrv_delete (bs=0x55d3b5e6ef80) at ../block.c:5100
#15 0x000055d3b502ae3a in bdrv_unref (bs=0x55d3b5e6ef80) at ../block.c:6495
#16 0x000055d3b5023a38 in bdrv_root_unref_child (child=0x55d3b5e4c690) at ../block.c:3010
#17 0x000055d3b5047998 in blk_remove_bs (blk=0x55d3b5e73b40) at ../block/block-backend.c:845
#18 0x000055d3b5046e38 in blk_delete (blk=0x55d3b5e73b40) at ../block/block-backend.c:461
#19 0x000055d3b50470dc in blk_unref (blk=0x55d3b5e73b40) at ../block/block-backend.c:516
#20 0x000055d3b4fdb20a in teardown_secondary () at ../tests/unit/test-replication.c:367
#21 0x000055d3b4fdb632 in test_secondary_continuous_replication () at ../tests/unit/test-replication.c:504
#22 0x00007f034b26979e in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
#23 0x00007f034b26959b in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
#24 0x00007f034b26959b in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
#25 0x00007f034b269c8a in g_test_run_suite () from /lib64/libglib-2.0.so.0
#26 0x00007f034b269ca5 in g_test_run () from /lib64/libglib-2.0.so.0
#27 0x000055d3b4fdb9c0 in main (argc=1, argv=0x7fffcc998138) at ../tests/unit/test-replication.c:613
(gdb) fr 4
#4  0x000055d3b503d670 in job_state_transition (job=0x55d3b5e67020, s1=JOB_STATUS_CONCLUDED) at ../job.c:186
186         assert(JobSTT[s0][s1]);
(gdb) list
181         JobStatus s0 = job->status;
182         assert(s1 >= 0 && s1 < JOB_STATUS__MAX);
183         trace_job_state_transition(job, job->ret,
184                                    JobSTT[s0][s1] ? "allowed" : "disallowed",
185                                    JobStatus_str(s0), JobStatus_str(s1));
186         assert(JobSTT[s0][s1]);
187         job->status = s1;
188
189         if (!job_is_internal(job) && s1 != s0) {
190             qapi_event_send_job_status_change(job->id, job->status);
(gdb) p s0
$1 = JOB_STATUS_NULL
(gdb) p s1
$2 = JOB_STATUS_CONCLUDED
Vladimir Sementsov-Ogievskiy Sept. 23, 2021, 10:01 p.m. UTC | #3
22.09.2021 22:19, Vladimir Sementsov-Ogievskiy wrote:
> 22.09.2021 19:05, Richard Henderson wrote:
>> On 9/21/21 3:20 AM, Vladimir Sementsov-Ogievskiy wrote:
>>> The following changes since commit 326ff8dd09556fc2e257196c49f35009700794ac:
>>>
>>>    Merge remote-tracking branch 'remotes/jasowang/tags/net-pull-request' into staging (2021-09-20 16:17:05 +0100)
>>>
>>> are available in the Git repository at:
>>>
>>>    https://src.openvz.org/scm/~vsementsov/qemu.git  tags/pull-jobs-2021-09-21
>>>
>>> for you to fetch changes up to c9489c04319cac75c76af8fc27c254f46e10214c:
>>>
>>>    iotests: Add mirror-ready-cancel-error test (2021-09-21 11:56:11 +0300)
>>>
>>> ----------------------------------------------------------------
>>> mirror: Handle errors after READY cancel
>>>
>>> ----------------------------------------------------------------
>>> Hanna Reitz (12):
>>>        job: Context changes in job_completed_txn_abort()
>>>        mirror: Keep s->synced on error
>>>        mirror: Drop s->synced
>>>        job: Force-cancel jobs in a failed transaction
>>>        job: @force parameter for job_cancel_sync()
>>>        jobs: Give Job.force_cancel more meaning
>>>        job: Add job_cancel_requested()
>>>        mirror: Use job_is_cancelled()
>>>        mirror: Check job_is_cancelled() earlier
>>>        mirror: Stop active mirroring after force-cancel
>>>        mirror: Do not clear .cancelled
>>>        iotests: Add mirror-ready-cancel-error test
>>
>> This fails testing with errors like so:
>>
>> Running test test-replication
>> test-replication: ../job.c:186: job_state_transition: Assertion `JobSTT[s0][s1]' failed.
>> ERROR test-replication - too few tests run (expected 13, got 8)
>> make: *** [Makefile.mtest:816: run-test-100] Error 1
>> Cleaning up project directory and file based variables
>> ERROR: Job failed: exit code 1
>>
>> https://gitlab.com/qemu-project/qemu/-/pipelines/375324015/failures
>>
> 
> 
> Interesting :(
> 
> I've reproduced, starting test-replication in several parallel loops. (it doesn't reproduce for me if just start in one loop). So, that's some racy bug..
> 
> Hmm, and seems it doesn't reproduce so simple on master. I'll try to bisect the series tomorrow.
> 
> ====
> 
> (gdb) bt
> #0  0x00007f034a3d09d5 in raise () from /lib64/libc.so.6
> #1  0x00007f034a3b9954 in abort () from /lib64/libc.so.6
> #2  0x00007f034a3b9789 in __assert_fail_base.cold () from /lib64/libc.so.6
> #3  0x00007f034a3c9026 in __assert_fail () from /lib64/libc.so.6
> #4  0x000055d3b503d670 in job_state_transition (job=0x55d3b5e67020, s1=JOB_STATUS_CONCLUDED) at ../job.c:186
> #5  0x000055d3b503e7c2 in job_conclude (job=0x55d3b5e67020) at ../job.c:652
> #6  0x000055d3b503eaa1 in job_finalize_single (job=0x55d3b5e67020) at ../job.c:722
> #7  0x000055d3b503ecd1 in job_completed_txn_abort (job=0x55d3b5e67020) at ../job.c:801
> #8  0x000055d3b503f2ea in job_cancel (job=0x55d3b5e67020, force=false) at ../job.c:973
> #9  0x000055d3b503f360 in job_cancel_err (job=0x55d3b5e67020, errp=0x7fffcc997a80) at ../job.c:992
> #10 0x000055d3b503f576 in job_finish_sync (job=0x55d3b5e67020, finish=0x55d3b503f33f <job_cancel_err>, errp=0x0) at ../job.c:1054
> #11 0x000055d3b503f3d0 in job_cancel_sync (job=0x55d3b5e67020, force=false) at ../job.c:1008
> #12 0x000055d3b4ff14a3 in replication_close (bs=0x55d3b5e6ef80) at ../block/replication.c:152
> #13 0x000055d3b50277fc in bdrv_close (bs=0x55d3b5e6ef80) at ../block.c:4677
> #14 0x000055d3b50286cf in bdrv_delete (bs=0x55d3b5e6ef80) at ../block.c:5100
> #15 0x000055d3b502ae3a in bdrv_unref (bs=0x55d3b5e6ef80) at ../block.c:6495
> #16 0x000055d3b5023a38 in bdrv_root_unref_child (child=0x55d3b5e4c690) at ../block.c:3010
> #17 0x000055d3b5047998 in blk_remove_bs (blk=0x55d3b5e73b40) at ../block/block-backend.c:845
> #18 0x000055d3b5046e38 in blk_delete (blk=0x55d3b5e73b40) at ../block/block-backend.c:461
> #19 0x000055d3b50470dc in blk_unref (blk=0x55d3b5e73b40) at ../block/block-backend.c:516
> #20 0x000055d3b4fdb20a in teardown_secondary () at ../tests/unit/test-replication.c:367
> #21 0x000055d3b4fdb632 in test_secondary_continuous_replication () at ../tests/unit/test-replication.c:504
> #22 0x00007f034b26979e in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
> #23 0x00007f034b26959b in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
> #24 0x00007f034b26959b in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
> #25 0x00007f034b269c8a in g_test_run_suite () from /lib64/libglib-2.0.so.0
> #26 0x00007f034b269ca5 in g_test_run () from /lib64/libglib-2.0.so.0
> #27 0x000055d3b4fdb9c0 in main (argc=1, argv=0x7fffcc998138) at ../tests/unit/test-replication.c:613
> (gdb) fr 4
> #4  0x000055d3b503d670 in job_state_transition (job=0x55d3b5e67020, s1=JOB_STATUS_CONCLUDED) at ../job.c:186
> 186         assert(JobSTT[s0][s1]);
> (gdb) list
> 181         JobStatus s0 = job->status;
> 182         assert(s1 >= 0 && s1 < JOB_STATUS__MAX);
> 183         trace_job_state_transition(job, job->ret,
> 184                                    JobSTT[s0][s1] ? "allowed" : "disallowed",
> 185                                    JobStatus_str(s0), JobStatus_str(s1));
> 186         assert(JobSTT[s0][s1]);
> 187         job->status = s1;
> 188
> 189         if (!job_is_internal(job) && s1 != s0) {
> 190             qapi_event_send_job_status_change(job->id, job->status);
> (gdb) p s0
> $1 = JOB_STATUS_NULL
> (gdb) p s1
> $2 = JOB_STATUS_CONCLUDED
> 
> 
> 


bisect points to "job: Add job_cancel_requested()"

And "bisecting" within this commit shows that the following helps:

diff --git a/job.c b/job.c
index be878ca5fc..bb52a1b58f 100644
--- a/job.c
+++ b/job.c
@@ -655,7 +655,7 @@ static void job_conclude(Job *job)
  
  static void job_update_rc(Job *job)
  {
-    if (!job->ret && job_is_cancelled(job)) {
+    if (!job->ret && job_cancel_requested(job)) {
          job->ret = -ECANCELED;
      }
      if (job->ret) {


- this returns job_update_rc to pre-patch behavior.

But why, I don't know:) More investigation is needed. probably replication code is doing something wrong..
Hanna Czenczek Oct. 4, 2021, 4:47 p.m. UTC | #4
On 24.09.21 00:01, Vladimir Sementsov-Ogievskiy wrote:
> 22.09.2021 22:19, Vladimir Sementsov-Ogievskiy wrote:
>> 22.09.2021 19:05, Richard Henderson wrote:
>>> On 9/21/21 3:20 AM, Vladimir Sementsov-Ogievskiy wrote:
>>>> The following changes since commit 
>>>> 326ff8dd09556fc2e257196c49f35009700794ac:
>>>>
>>>>    Merge remote-tracking branch 
>>>> 'remotes/jasowang/tags/net-pull-request' into staging (2021-09-20 
>>>> 16:17:05 +0100)
>>>>
>>>> are available in the Git repository at:
>>>>
>>>>    https://src.openvz.org/scm/~vsementsov/qemu.git 
>>>> tags/pull-jobs-2021-09-21
>>>>
>>>> for you to fetch changes up to 
>>>> c9489c04319cac75c76af8fc27c254f46e10214c:
>>>>
>>>>    iotests: Add mirror-ready-cancel-error test (2021-09-21 11:56:11 
>>>> +0300)
>>>>
>>>> ----------------------------------------------------------------
>>>> mirror: Handle errors after READY cancel
>>>>
>>>> ----------------------------------------------------------------
>>>> Hanna Reitz (12):
>>>>        job: Context changes in job_completed_txn_abort()
>>>>        mirror: Keep s->synced on error
>>>>        mirror: Drop s->synced
>>>>        job: Force-cancel jobs in a failed transaction
>>>>        job: @force parameter for job_cancel_sync()
>>>>        jobs: Give Job.force_cancel more meaning
>>>>        job: Add job_cancel_requested()
>>>>        mirror: Use job_is_cancelled()
>>>>        mirror: Check job_is_cancelled() earlier
>>>>        mirror: Stop active mirroring after force-cancel
>>>>        mirror: Do not clear .cancelled
>>>>        iotests: Add mirror-ready-cancel-error test
>>>
>>> This fails testing with errors like so:
>>>
>>> Running test test-replication
>>> test-replication: ../job.c:186: job_state_transition: Assertion 
>>> `JobSTT[s0][s1]' failed.
>>> ERROR test-replication - too few tests run (expected 13, got 8)
>>> make: *** [Makefile.mtest:816: run-test-100] Error 1
>>> Cleaning up project directory and file based variables
>>> ERROR: Job failed: exit code 1
>>>
>>> https://gitlab.com/qemu-project/qemu/-/pipelines/375324015/failures
>>>
>>
>>
>> Interesting :(
>>
>> I've reproduced, starting test-replication in several parallel loops. 
>> (it doesn't reproduce for me if just start in one loop). So, that's 
>> some racy bug..
>>
>> Hmm, and seems it doesn't reproduce so simple on master. I'll try to 
>> bisect the series tomorrow.
>>
>> ====
>>
>> (gdb) bt
>> #0  0x00007f034a3d09d5 in raise () from /lib64/libc.so.6
>> #1  0x00007f034a3b9954 in abort () from /lib64/libc.so.6
>> #2  0x00007f034a3b9789 in __assert_fail_base.cold () from 
>> /lib64/libc.so.6
>> #3  0x00007f034a3c9026 in __assert_fail () from /lib64/libc.so.6
>> #4  0x000055d3b503d670 in job_state_transition (job=0x55d3b5e67020, 
>> s1=JOB_STATUS_CONCLUDED) at ../job.c:186
>> #5  0x000055d3b503e7c2 in job_conclude (job=0x55d3b5e67020) at 
>> ../job.c:652
>> #6  0x000055d3b503eaa1 in job_finalize_single (job=0x55d3b5e67020) at 
>> ../job.c:722
>> #7  0x000055d3b503ecd1 in job_completed_txn_abort 
>> (job=0x55d3b5e67020) at ../job.c:801
>> #8  0x000055d3b503f2ea in job_cancel (job=0x55d3b5e67020, 
>> force=false) at ../job.c:973
>> #9  0x000055d3b503f360 in job_cancel_err (job=0x55d3b5e67020, 
>> errp=0x7fffcc997a80) at ../job.c:992
>> #10 0x000055d3b503f576 in job_finish_sync (job=0x55d3b5e67020, 
>> finish=0x55d3b503f33f <job_cancel_err>, errp=0x0) at ../job.c:1054
>> #11 0x000055d3b503f3d0 in job_cancel_sync (job=0x55d3b5e67020, 
>> force=false) at ../job.c:1008
>> #12 0x000055d3b4ff14a3 in replication_close (bs=0x55d3b5e6ef80) at 
>> ../block/replication.c:152
>> #13 0x000055d3b50277fc in bdrv_close (bs=0x55d3b5e6ef80) at 
>> ../block.c:4677
>> #14 0x000055d3b50286cf in bdrv_delete (bs=0x55d3b5e6ef80) at 
>> ../block.c:5100
>> #15 0x000055d3b502ae3a in bdrv_unref (bs=0x55d3b5e6ef80) at 
>> ../block.c:6495
>> #16 0x000055d3b5023a38 in bdrv_root_unref_child 
>> (child=0x55d3b5e4c690) at ../block.c:3010
>> #17 0x000055d3b5047998 in blk_remove_bs (blk=0x55d3b5e73b40) at 
>> ../block/block-backend.c:845
>> #18 0x000055d3b5046e38 in blk_delete (blk=0x55d3b5e73b40) at 
>> ../block/block-backend.c:461
>> #19 0x000055d3b50470dc in blk_unref (blk=0x55d3b5e73b40) at 
>> ../block/block-backend.c:516
>> #20 0x000055d3b4fdb20a in teardown_secondary () at 
>> ../tests/unit/test-replication.c:367
>> #21 0x000055d3b4fdb632 in test_secondary_continuous_replication () at 
>> ../tests/unit/test-replication.c:504
>> #22 0x00007f034b26979e in g_test_run_suite_internal () from 
>> /lib64/libglib-2.0.so.0
>> #23 0x00007f034b26959b in g_test_run_suite_internal () from 
>> /lib64/libglib-2.0.so.0
>> #24 0x00007f034b26959b in g_test_run_suite_internal () from 
>> /lib64/libglib-2.0.so.0
>> #25 0x00007f034b269c8a in g_test_run_suite () from 
>> /lib64/libglib-2.0.so.0
>> #26 0x00007f034b269ca5 in g_test_run () from /lib64/libglib-2.0.so.0
>> #27 0x000055d3b4fdb9c0 in main (argc=1, argv=0x7fffcc998138) at 
>> ../tests/unit/test-replication.c:613
>> (gdb) fr 4
>> #4  0x000055d3b503d670 in job_state_transition (job=0x55d3b5e67020, 
>> s1=JOB_STATUS_CONCLUDED) at ../job.c:186
>> 186         assert(JobSTT[s0][s1]);
>> (gdb) list
>> 181         JobStatus s0 = job->status;
>> 182         assert(s1 >= 0 && s1 < JOB_STATUS__MAX);
>> 183         trace_job_state_transition(job, job->ret,
>> 184                                    JobSTT[s0][s1] ? "allowed" : 
>> "disallowed",
>> 185                                    JobStatus_str(s0), 
>> JobStatus_str(s1));
>> 186         assert(JobSTT[s0][s1]);
>> 187         job->status = s1;
>> 188
>> 189         if (!job_is_internal(job) && s1 != s0) {
>> 190             qapi_event_send_job_status_change(job->id, job->status);
>> (gdb) p s0
>> $1 = JOB_STATUS_NULL
>> (gdb) p s1
>> $2 = JOB_STATUS_CONCLUDED
>>
>>
>>
>
>
> bisect points to "job: Add job_cancel_requested()"
>
> And "bisecting" within this commit shows that the following helps:
>
> diff --git a/job.c b/job.c
> index be878ca5fc..bb52a1b58f 100644
> --- a/job.c
> +++ b/job.c
> @@ -655,7 +655,7 @@ static void job_conclude(Job *job)
>
>  static void job_update_rc(Job *job)
>  {
> -    if (!job->ret && job_is_cancelled(job)) {
> +    if (!job->ret && job_cancel_requested(job)) {
>          job->ret = -ECANCELED;
>      }
>      if (job->ret) {
>
>
> - this returns job_update_rc to pre-patch behavior.
>
> But why, I don't know:) More investigation is needed. probably 
> replication code is doing something wrong..

 From what I can tell, this is what happens:

(1) The mirror job completes, we go to job_co_entry(), and schedule 
job_exit().  It doesn’t run yet, though.
(2) replication_close() cancels the job.
(3) We get to job_completed_txn_abort().
(4) The job isn’t completed yet, so we invoke job_finish_sync().
(5) Now job_exit() finally gets to run, and this is how we end up in a 
situation where .cancelled is true, but .force_cancel is false: Yes, 
mirror clears .cancelled before exiting its main loop, but if the job is 
cancelled between it having been deferred to the main loop and 
job_exit() running, it may become true again.
(6) job_exit() leads to job_completed(), which invokes job_update_rc(), 
which however leaves job->ret == 0.
(7) job_completed() also calls job_completed_txn_success(), which is 
weird, because we still have job_completed_txn_abort() running 
concurrently...
(8) job_completed_txn_success() invokes job_do_finalize(), which goes to 
job_finalize_single(), which leaves the job in status null.
(9) job_finish_sync() is done, so we land back in 
job_completed_txn_abort(): We call job_finalize_single(), which tries to 
conclude the job, and that gives us the failed assertion (attempted 
transition from null to concluded).

(When everything works, it seems like the job is completed before 
replication_close() can cancel it.  Cancelling is then a no-op and 
nothing breaks.)

So now we could say the problem is that once a job completes and is 
deferred to the main loop, non-force cancel should do nothing. 
job_cancel_async() should not set job->cancelled to true if `!force && 
job->deferred_to_main_loop`.  job_cancel() should invoke 
job_completed_txn_abort() not if `job->deferred_to_main_loop`, but if 
`job->deferred_to_main_loop && job_is_cancelled(job)`. (Doing this seems 
to fix the bug for me.)

That I think would conform to the reasoning laid out in patch 7’s commit 
message, namely that some functions are called after the job has been 
deferred to the main loop, and because mirror clears .cancelled when it 
has been soft-cancelled, it’d be impossible to observe 
`.deferred_to_main_loop == true && .cancelled == true && 
.force_cancelled == false`.


Or we continue having soft-cancelled jobs still be -ECANCELED, which 
seems like the safe choice?  But it goes against what we’ve decided for 
patch 7, namely that soft-cancelled jobs should be treated like they’d 
complete as normal.

Hanna
Vladimir Sementsov-Ogievskiy Oct. 4, 2021, 5:59 p.m. UTC | #5
10/4/21 19:47, Hanna Reitz wrote:
> On 24.09.21 00:01, Vladimir Sementsov-Ogievskiy wrote:
>> 22.09.2021 22:19, Vladimir Sementsov-Ogievskiy wrote:
>>> 22.09.2021 19:05, Richard Henderson wrote:
>>>> On 9/21/21 3:20 AM, Vladimir Sementsov-Ogievskiy wrote:
>>>>> The following changes since commit 326ff8dd09556fc2e257196c49f35009700794ac:
>>>>>
>>>>>    Merge remote-tracking branch 'remotes/jasowang/tags/net-pull-request' into staging (2021-09-20 16:17:05 +0100)
>>>>>
>>>>> are available in the Git repository at:
>>>>>
>>>>>    https://src.openvz.org/scm/~vsementsov/qemu.git tags/pull-jobs-2021-09-21
>>>>>
>>>>> for you to fetch changes up to c9489c04319cac75c76af8fc27c254f46e10214c:
>>>>>
>>>>>    iotests: Add mirror-ready-cancel-error test (2021-09-21 11:56:11 +0300)
>>>>>
>>>>> ----------------------------------------------------------------
>>>>> mirror: Handle errors after READY cancel
>>>>>
>>>>> ----------------------------------------------------------------
>>>>> Hanna Reitz (12):
>>>>>        job: Context changes in job_completed_txn_abort()
>>>>>        mirror: Keep s->synced on error
>>>>>        mirror: Drop s->synced
>>>>>        job: Force-cancel jobs in a failed transaction
>>>>>        job: @force parameter for job_cancel_sync()
>>>>>        jobs: Give Job.force_cancel more meaning
>>>>>        job: Add job_cancel_requested()
>>>>>        mirror: Use job_is_cancelled()
>>>>>        mirror: Check job_is_cancelled() earlier
>>>>>        mirror: Stop active mirroring after force-cancel
>>>>>        mirror: Do not clear .cancelled
>>>>>        iotests: Add mirror-ready-cancel-error test
>>>>
>>>> This fails testing with errors like so:
>>>>
>>>> Running test test-replication
>>>> test-replication: ../job.c:186: job_state_transition: Assertion `JobSTT[s0][s1]' failed.
>>>> ERROR test-replication - too few tests run (expected 13, got 8)
>>>> make: *** [Makefile.mtest:816: run-test-100] Error 1
>>>> Cleaning up project directory and file based variables
>>>> ERROR: Job failed: exit code 1
>>>>
>>>> https://gitlab.com/qemu-project/qemu/-/pipelines/375324015/failures
>>>>
>>>
>>>
>>> Interesting :(
>>>
>>> I've reproduced, starting test-replication in several parallel loops. (it doesn't reproduce for me if just start in one loop). So, that's some racy bug..
>>>
>>> Hmm, and seems it doesn't reproduce so simple on master. I'll try to bisect the series tomorrow.
>>>
>>> ====
>>>
>>> (gdb) bt
>>> #0  0x00007f034a3d09d5 in raise () from /lib64/libc.so.6
>>> #1  0x00007f034a3b9954 in abort () from /lib64/libc.so.6
>>> #2  0x00007f034a3b9789 in __assert_fail_base.cold () from /lib64/libc.so.6
>>> #3  0x00007f034a3c9026 in __assert_fail () from /lib64/libc.so.6
>>> #4  0x000055d3b503d670 in job_state_transition (job=0x55d3b5e67020, s1=JOB_STATUS_CONCLUDED) at ../job.c:186
>>> #5  0x000055d3b503e7c2 in job_conclude (job=0x55d3b5e67020) at ../job.c:652
>>> #6  0x000055d3b503eaa1 in job_finalize_single (job=0x55d3b5e67020) at ../job.c:722
>>> #7  0x000055d3b503ecd1 in job_completed_txn_abort (job=0x55d3b5e67020) at ../job.c:801
>>> #8  0x000055d3b503f2ea in job_cancel (job=0x55d3b5e67020, force=false) at ../job.c:973
>>> #9  0x000055d3b503f360 in job_cancel_err (job=0x55d3b5e67020, errp=0x7fffcc997a80) at ../job.c:992
>>> #10 0x000055d3b503f576 in job_finish_sync (job=0x55d3b5e67020, finish=0x55d3b503f33f <job_cancel_err>, errp=0x0) at ../job.c:1054
>>> #11 0x000055d3b503f3d0 in job_cancel_sync (job=0x55d3b5e67020, force=false) at ../job.c:1008
>>> #12 0x000055d3b4ff14a3 in replication_close (bs=0x55d3b5e6ef80) at ../block/replication.c:152
>>> #13 0x000055d3b50277fc in bdrv_close (bs=0x55d3b5e6ef80) at ../block.c:4677
>>> #14 0x000055d3b50286cf in bdrv_delete (bs=0x55d3b5e6ef80) at ../block.c:5100
>>> #15 0x000055d3b502ae3a in bdrv_unref (bs=0x55d3b5e6ef80) at ../block.c:6495
>>> #16 0x000055d3b5023a38 in bdrv_root_unref_child (child=0x55d3b5e4c690) at ../block.c:3010
>>> #17 0x000055d3b5047998 in blk_remove_bs (blk=0x55d3b5e73b40) at ../block/block-backend.c:845
>>> #18 0x000055d3b5046e38 in blk_delete (blk=0x55d3b5e73b40) at ../block/block-backend.c:461
>>> #19 0x000055d3b50470dc in blk_unref (blk=0x55d3b5e73b40) at ../block/block-backend.c:516
>>> #20 0x000055d3b4fdb20a in teardown_secondary () at ../tests/unit/test-replication.c:367
>>> #21 0x000055d3b4fdb632 in test_secondary_continuous_replication () at ../tests/unit/test-replication.c:504
>>> #22 0x00007f034b26979e in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
>>> #23 0x00007f034b26959b in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
>>> #24 0x00007f034b26959b in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
>>> #25 0x00007f034b269c8a in g_test_run_suite () from /lib64/libglib-2.0.so.0
>>> #26 0x00007f034b269ca5 in g_test_run () from /lib64/libglib-2.0.so.0
>>> #27 0x000055d3b4fdb9c0 in main (argc=1, argv=0x7fffcc998138) at ../tests/unit/test-replication.c:613
>>> (gdb) fr 4
>>> #4  0x000055d3b503d670 in job_state_transition (job=0x55d3b5e67020, s1=JOB_STATUS_CONCLUDED) at ../job.c:186
>>> 186         assert(JobSTT[s0][s1]);
>>> (gdb) list
>>> 181         JobStatus s0 = job->status;
>>> 182         assert(s1 >= 0 && s1 < JOB_STATUS__MAX);
>>> 183         trace_job_state_transition(job, job->ret,
>>> 184                                    JobSTT[s0][s1] ? "allowed" : "disallowed",
>>> 185                                    JobStatus_str(s0), JobStatus_str(s1));
>>> 186         assert(JobSTT[s0][s1]);
>>> 187         job->status = s1;
>>> 188
>>> 189         if (!job_is_internal(job) && s1 != s0) {
>>> 190             qapi_event_send_job_status_change(job->id, job->status);
>>> (gdb) p s0
>>> $1 = JOB_STATUS_NULL
>>> (gdb) p s1
>>> $2 = JOB_STATUS_CONCLUDED
>>>
>>>
>>>
>>
>>
>> bisect points to "job: Add job_cancel_requested()"
>>
>> And "bisecting" within this commit shows that the following helps:
>>
>> diff --git a/job.c b/job.c
>> index be878ca5fc..bb52a1b58f 100644
>> --- a/job.c
>> +++ b/job.c
>> @@ -655,7 +655,7 @@ static void job_conclude(Job *job)
>>
>>  static void job_update_rc(Job *job)
>>  {
>> -    if (!job->ret && job_is_cancelled(job)) {
>> +    if (!job->ret && job_cancel_requested(job)) {
>>          job->ret = -ECANCELED;
>>      }
>>      if (job->ret) {
>>
>>
>> - this returns job_update_rc to pre-patch behavior.
>>
>> But why, I don't know:) More investigation is needed. probably replication code is doing something wrong..
> 
>  From what I can tell, this is what happens:
> 
> (1) The mirror job completes, we go to job_co_entry(), and schedule job_exit().  It doesn’t run yet, though.
> (2) replication_close() cancels the job.
> (3) We get to job_completed_txn_abort().
> (4) The job isn’t completed yet, so we invoke job_finish_sync().
> (5) Now job_exit() finally gets to run, and this is how we end up in a situation where .cancelled is true, but .force_cancel is false: Yes, mirror clears .cancelled before exiting its main loop, but if the job is cancelled between it having been deferred to the main loop and job_exit() running, it may become true again.
> (6) job_exit() leads to job_completed(), which invokes job_update_rc(), which however leaves job->ret == 0.
> (7) job_completed() also calls job_completed_txn_success(), which is weird, because we still have job_completed_txn_abort() running concurrently...
> (8) job_completed_txn_success() invokes job_do_finalize(), which goes to job_finalize_single(), which leaves the job in status null.
> (9) job_finish_sync() is done, so we land back in job_completed_txn_abort(): We call job_finalize_single(), which tries to conclude the job, and that gives us the failed assertion (attempted transition from null to concluded).
> 
> (When everything works, it seems like the job is completed before replication_close() can cancel it.  Cancelling is then a no-op and nothing breaks.)
> 
> So now we could say the problem is that once a job completes and is deferred to the main loop, non-force cancel should do nothing. job_cancel_async() should not set job->cancelled to true if `!force && job->deferred_to_main_loop`.  job_cancel() should invoke job_completed_txn_abort() not if `job->deferred_to_main_loop`, but if `job->deferred_to_main_loop && job_is_cancelled(job)`. (Doing this seems to fix the bug for me.)
> 
> That I think would conform to the reasoning laid out in patch 7’s commit message, namely that some functions are called after the job has been deferred to the main loop, and because mirror clears .cancelled when it has been soft-cancelled, it’d be impossible to observe `.deferred_to_main_loop == true && .cancelled == true && .force_cancelled == false`.
> 
> 
> Or we continue having soft-cancelled jobs still be -ECANCELED, which seems like the safe choice?  But it goes against what we’ve decided for patch 7, namely that soft-cancelled jobs should be treated like they’d complete as normal.
> 

I think, I can live with either way:)  I still think that best way is implementing "no graph change" mode for mirror instead of soft-cancelling and deprecate soft-cancel (together with block-job-cancel), but that doesn't work in short-term.
Hanna Czenczek Oct. 5, 2021, 3:03 p.m. UTC | #6
On 04.10.21 19:59, Vladimir Sementsov-Ogievskiy wrote:
> 10/4/21 19:47, Hanna Reitz wrote:
>> On 24.09.21 00:01, Vladimir Sementsov-Ogievskiy wrote:
>>> 22.09.2021 22:19, Vladimir Sementsov-Ogievskiy wrote:
>>>> 22.09.2021 19:05, Richard Henderson wrote:
>>>>> On 9/21/21 3:20 AM, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> The following changes since commit 
>>>>>> 326ff8dd09556fc2e257196c49f35009700794ac:
>>>>>>
>>>>>>    Merge remote-tracking branch 
>>>>>> 'remotes/jasowang/tags/net-pull-request' into staging (2021-09-20 
>>>>>> 16:17:05 +0100)
>>>>>>
>>>>>> are available in the Git repository at:
>>>>>>
>>>>>>    https://src.openvz.org/scm/~vsementsov/qemu.git 
>>>>>> tags/pull-jobs-2021-09-21
>>>>>>
>>>>>> for you to fetch changes up to 
>>>>>> c9489c04319cac75c76af8fc27c254f46e10214c:
>>>>>>
>>>>>>    iotests: Add mirror-ready-cancel-error test (2021-09-21 
>>>>>> 11:56:11 +0300)
>>>>>>
>>>>>> ----------------------------------------------------------------
>>>>>> mirror: Handle errors after READY cancel
>>>>>>
>>>>>> ----------------------------------------------------------------
>>>>>> Hanna Reitz (12):
>>>>>>        job: Context changes in job_completed_txn_abort()
>>>>>>        mirror: Keep s->synced on error
>>>>>>        mirror: Drop s->synced
>>>>>>        job: Force-cancel jobs in a failed transaction
>>>>>>        job: @force parameter for job_cancel_sync()
>>>>>>        jobs: Give Job.force_cancel more meaning
>>>>>>        job: Add job_cancel_requested()
>>>>>>        mirror: Use job_is_cancelled()
>>>>>>        mirror: Check job_is_cancelled() earlier
>>>>>>        mirror: Stop active mirroring after force-cancel
>>>>>>        mirror: Do not clear .cancelled
>>>>>>        iotests: Add mirror-ready-cancel-error test
>>>>>
>>>>> This fails testing with errors like so:
>>>>>
>>>>> Running test test-replication
>>>>> test-replication: ../job.c:186: job_state_transition: Assertion 
>>>>> `JobSTT[s0][s1]' failed.
>>>>> ERROR test-replication - too few tests run (expected 13, got 8)
>>>>> make: *** [Makefile.mtest:816: run-test-100] Error 1
>>>>> Cleaning up project directory and file based variables
>>>>> ERROR: Job failed: exit code 1
>>>>>
>>>>> https://gitlab.com/qemu-project/qemu/-/pipelines/375324015/failures
>>>>>
>>>>
>>>>
>>>> Interesting :(
>>>>
>>>> I've reproduced, starting test-replication in several parallel 
>>>> loops. (it doesn't reproduce for me if just start in one loop). So, 
>>>> that's some racy bug..
>>>>
>>>> Hmm, and seems it doesn't reproduce so simple on master. I'll try 
>>>> to bisect the series tomorrow.
>>>>
>>>> ====
>>>>
>>>> (gdb) bt
>>>> #0  0x00007f034a3d09d5 in raise () from /lib64/libc.so.6
>>>> #1  0x00007f034a3b9954 in abort () from /lib64/libc.so.6
>>>> #2  0x00007f034a3b9789 in __assert_fail_base.cold () from 
>>>> /lib64/libc.so.6
>>>> #3  0x00007f034a3c9026 in __assert_fail () from /lib64/libc.so.6
>>>> #4  0x000055d3b503d670 in job_state_transition (job=0x55d3b5e67020, 
>>>> s1=JOB_STATUS_CONCLUDED) at ../job.c:186
>>>> #5  0x000055d3b503e7c2 in job_conclude (job=0x55d3b5e67020) at 
>>>> ../job.c:652
>>>> #6  0x000055d3b503eaa1 in job_finalize_single (job=0x55d3b5e67020) 
>>>> at ../job.c:722
>>>> #7  0x000055d3b503ecd1 in job_completed_txn_abort 
>>>> (job=0x55d3b5e67020) at ../job.c:801
>>>> #8  0x000055d3b503f2ea in job_cancel (job=0x55d3b5e67020, 
>>>> force=false) at ../job.c:973
>>>> #9  0x000055d3b503f360 in job_cancel_err (job=0x55d3b5e67020, 
>>>> errp=0x7fffcc997a80) at ../job.c:992
>>>> #10 0x000055d3b503f576 in job_finish_sync (job=0x55d3b5e67020, 
>>>> finish=0x55d3b503f33f <job_cancel_err>, errp=0x0) at ../job.c:1054
>>>> #11 0x000055d3b503f3d0 in job_cancel_sync (job=0x55d3b5e67020, 
>>>> force=false) at ../job.c:1008
>>>> #12 0x000055d3b4ff14a3 in replication_close (bs=0x55d3b5e6ef80) at 
>>>> ../block/replication.c:152
>>>> #13 0x000055d3b50277fc in bdrv_close (bs=0x55d3b5e6ef80) at 
>>>> ../block.c:4677
>>>> #14 0x000055d3b50286cf in bdrv_delete (bs=0x55d3b5e6ef80) at 
>>>> ../block.c:5100
>>>> #15 0x000055d3b502ae3a in bdrv_unref (bs=0x55d3b5e6ef80) at 
>>>> ../block.c:6495
>>>> #16 0x000055d3b5023a38 in bdrv_root_unref_child 
>>>> (child=0x55d3b5e4c690) at ../block.c:3010
>>>> #17 0x000055d3b5047998 in blk_remove_bs (blk=0x55d3b5e73b40) at 
>>>> ../block/block-backend.c:845
>>>> #18 0x000055d3b5046e38 in blk_delete (blk=0x55d3b5e73b40) at 
>>>> ../block/block-backend.c:461
>>>> #19 0x000055d3b50470dc in blk_unref (blk=0x55d3b5e73b40) at 
>>>> ../block/block-backend.c:516
>>>> #20 0x000055d3b4fdb20a in teardown_secondary () at 
>>>> ../tests/unit/test-replication.c:367
>>>> #21 0x000055d3b4fdb632 in test_secondary_continuous_replication () 
>>>> at ../tests/unit/test-replication.c:504
>>>> #22 0x00007f034b26979e in g_test_run_suite_internal () from 
>>>> /lib64/libglib-2.0.so.0
>>>> #23 0x00007f034b26959b in g_test_run_suite_internal () from 
>>>> /lib64/libglib-2.0.so.0
>>>> #24 0x00007f034b26959b in g_test_run_suite_internal () from 
>>>> /lib64/libglib-2.0.so.0
>>>> #25 0x00007f034b269c8a in g_test_run_suite () from 
>>>> /lib64/libglib-2.0.so.0
>>>> #26 0x00007f034b269ca5 in g_test_run () from /lib64/libglib-2.0.so.0
>>>> #27 0x000055d3b4fdb9c0 in main (argc=1, argv=0x7fffcc998138) at 
>>>> ../tests/unit/test-replication.c:613
>>>> (gdb) fr 4
>>>> #4  0x000055d3b503d670 in job_state_transition (job=0x55d3b5e67020, 
>>>> s1=JOB_STATUS_CONCLUDED) at ../job.c:186
>>>> 186         assert(JobSTT[s0][s1]);
>>>> (gdb) list
>>>> 181         JobStatus s0 = job->status;
>>>> 182         assert(s1 >= 0 && s1 < JOB_STATUS__MAX);
>>>> 183         trace_job_state_transition(job, job->ret,
>>>> 184                                    JobSTT[s0][s1] ? "allowed" : 
>>>> "disallowed",
>>>> 185                                    JobStatus_str(s0), 
>>>> JobStatus_str(s1));
>>>> 186         assert(JobSTT[s0][s1]);
>>>> 187         job->status = s1;
>>>> 188
>>>> 189         if (!job_is_internal(job) && s1 != s0) {
>>>> 190 qapi_event_send_job_status_change(job->id, job->status);
>>>> (gdb) p s0
>>>> $1 = JOB_STATUS_NULL
>>>> (gdb) p s1
>>>> $2 = JOB_STATUS_CONCLUDED
>>>>
>>>>
>>>>
>>>
>>>
>>> bisect points to "job: Add job_cancel_requested()"
>>>
>>> And "bisecting" within this commit shows that the following helps:
>>>
>>> diff --git a/job.c b/job.c
>>> index be878ca5fc..bb52a1b58f 100644
>>> --- a/job.c
>>> +++ b/job.c
>>> @@ -655,7 +655,7 @@ static void job_conclude(Job *job)
>>>
>>>  static void job_update_rc(Job *job)
>>>  {
>>> -    if (!job->ret && job_is_cancelled(job)) {
>>> +    if (!job->ret && job_cancel_requested(job)) {
>>>          job->ret = -ECANCELED;
>>>      }
>>>      if (job->ret) {
>>>
>>>
>>> - this returns job_update_rc to pre-patch behavior.
>>>
>>> But why, I don't know:) More investigation is needed. probably 
>>> replication code is doing something wrong..
>>
>>  From what I can tell, this is what happens:
>>
>> (1) The mirror job completes, we go to job_co_entry(), and schedule 
>> job_exit().  It doesn’t run yet, though.
>> (2) replication_close() cancels the job.
>> (3) We get to job_completed_txn_abort().
>> (4) The job isn’t completed yet, so we invoke job_finish_sync().
>> (5) Now job_exit() finally gets to run, and this is how we end up in 
>> a situation where .cancelled is true, but .force_cancel is false: 
>> Yes, mirror clears .cancelled before exiting its main loop, but if 
>> the job is cancelled between it having been deferred to the main loop 
>> and job_exit() running, it may become true again.
>> (6) job_exit() leads to job_completed(), which invokes 
>> job_update_rc(), which however leaves job->ret == 0.
>> (7) job_completed() also calls job_completed_txn_success(), which is 
>> weird, because we still have job_completed_txn_abort() running 
>> concurrently...
>> (8) job_completed_txn_success() invokes job_do_finalize(), which goes 
>> to job_finalize_single(), which leaves the job in status null.
>> (9) job_finish_sync() is done, so we land back in 
>> job_completed_txn_abort(): We call job_finalize_single(), which tries 
>> to conclude the job, and that gives us the failed assertion 
>> (attempted transition from null to concluded).
>>
>> (When everything works, it seems like the job is completed before 
>> replication_close() can cancel it.  Cancelling is then a no-op and 
>> nothing breaks.)
>>
>> So now we could say the problem is that once a job completes and is 
>> deferred to the main loop, non-force cancel should do nothing. 
>> job_cancel_async() should not set job->cancelled to true if `!force 
>> && job->deferred_to_main_loop`. job_cancel() should invoke 
>> job_completed_txn_abort() not if `job->deferred_to_main_loop`, but if 
>> `job->deferred_to_main_loop && job_is_cancelled(job)`. (Doing this 
>> seems to fix the bug for me.)
>>
>> That I think would conform to the reasoning laid out in patch 7’s 
>> commit message, namely that some functions are called after the job 
>> has been deferred to the main loop, and because mirror clears 
>> .cancelled when it has been soft-cancelled, it’d be impossible to 
>> observe `.deferred_to_main_loop == true && .cancelled == true && 
>> .force_cancelled == false`.
>>
>>
>> Or we continue having soft-cancelled jobs still be -ECANCELED, which 
>> seems like the safe choice?  But it goes against what we’ve decided 
>> for patch 7, namely that soft-cancelled jobs should be treated like 
>> they’d complete as normal.
>>
>
> I think, I can live with either way:)  I still think that best way is 
> implementing "no graph change" mode for mirror instead of 
> soft-cancelling and deprecate soft-cancel (together with 
> block-job-cancel), but that doesn't work in short-term.

Yes, sure.  When it’s just another completion mode, we won’t have this 
problem at all anyway, because then we’d skip job_cancel() altogether 
(and so nothing in the code would get the idea to invoke 
job_completed_txn_abort()).

(Btw, I’ll send a new version of the mirror series tomorrow.)

Hanna