diff mbox

segfault in parallel blockjobs (iotest 30)

Message ID dd350838-9132-e787-10b1-4cb200076ec9@virtuozzo.com (mailing list archive)
State New, archived
Headers show

Commit Message

Anton Nefedov Nov. 7, 2017, 4:19 p.m. UTC
Hi,

One more drainage-related thing.
We have recently encountered an issue with parallel block jobs and it's
not quite clear how to fix it properly, so would appreciate your ideas.

The small attached tweak makes iotest 30 (-qcow2 -nocache) fail 10/10
times on my machine.

And while most of the errors are just a desync between qemu and the
test, TestParallelOps is more worrying: qemu segfaults while draining
before reopening the image for the new block-stream job:

> Program received signal SIGSEGV, Segmentation fault.
> bdrv_next (it=it@entry=0x7fffcea39330) at /mnt/code/us-qemu/block/block-backend.c:453
> 453                 bs = it->blk ? blk_bs(it->blk) : NULL;
> (gdb) bt
> #0  bdrv_next (it=it@entry=0x7fffcea39330) at /mnt/code/us-qemu/block/block-backend.c:453
> #1  0x00007f2797ca1bd7 in bdrv_drain_all_begin () at /mnt/code/us-qemu/block/io.c:347
> #2  0x00007f2797c5470e in bdrv_reopen_multiple (ctx=0x7f279904bd40, bs_queue=0x7f2799b9d8d0, 
>     errp=errp@entry=0x7fffcea393e0) at /mnt/code/us-qemu/block.c:2871
> #3  0x00007f2797c548d4 in bdrv_reopen (bs=bs@entry=0x7f27990a0250, bdrv_flags=bdrv_flags@entry=8226, 
>     errp=errp@entry=0x7fffcea394e8) at /mnt/code/us-qemu/block.c:2916
> #4  0x00007f2797ab49bb in stream_start (job_id=job_id@entry=0x7f2799c2bc10 "stream-node6", 
>     bs=bs@entry=0x7f27990a0250, base=base@entry=0x7f27990de250, 
>     backing_file_str=backing_file_str@entry=0x7f2799a0a350 "/vz/anefedov/qemu-build/us/tests/qemu-iotests/scratch/img-4.img", speed=speed@entry=524288, on_error=on_error@entry=BLOCKDEV_ON_ERROR_REPORT, 
>     errp=errp@entry=0x7fffcea394e8) at /mnt/code/us-qemu/block/stream.c:239
> #5  0x00007f2797a86c96 in qmp_block_stream (has_job_id=<optimized out>, 
> [...]
> (gdb) p it->blk
> $1 = (BlockBackend *) 0x9797979797979797


BlockBackend gets deleted by another job's stream_complete(),
deferred to the main loop, so the fact that the job is put to sleep by
bdrv_drain_all_begin() doesn't really stop it from execution.

Looks like the problem is a bit wider though: if I (just for a test)
keep the BlockBackends referenced while iterating in drain_all_begin(),
qemu fails a bit further, when accessing the deleted BlockDriverState
from the apparently obsolete BlockReopenQueue in bdrv_reopen_prepare().


> Program received signal SIGSEGV, Segmentation fault.
> 0x00007f4c66a04420 in find_parent_in_reopen_queue (q=<optimized out>, c=0x7f4c68b365e0)
>     at /mnt/code/us-qemu/block.c:2932
> 2932            QLIST_FOREACH(child, &bs->children, next) {
> (gdb) bt
> #0  0x00007f4c66a04420 in find_parent_in_reopen_queue (q=<optimized out>, c=0x7f4c68b365e0)
>     at /mnt/code/us-qemu/block.c:2932
> #1  bdrv_reopen_perm (shared=0x7f4c69596440, perm=0x7f4c69596438, bs=0x7f4c68abc250, q=0x7f4c696238d0)
>     at /mnt/code/us-qemu/block.c:2951
> #2  bdrv_reopen_prepare (reopen_state=reopen_state@entry=0x7f4c69596428, queue=queue@entry=0x7f4c696238d0, 
>     errp=errp@entry=0x7ffe09fc28d0) at /mnt/code/us-qemu/block.c:3035
> #3  0x00007f4c66a0474f in bdrv_reopen_multiple (ctx=<optimized out>, bs_queue=0x7f4c696238d0, 
>     errp=errp@entry=0x7ffe09fc2920) at /mnt/code/us-qemu/block.c:2875
> #4  0x00007f4c66a048d4 in bdrv_reopen (bs=bs@entry=0x7f4c68abc250, bdrv_flags=bdrv_flags@entry=8226, 
>     errp=errp@entry=0x7ffe09fc2a28) at /mnt/code/us-qemu/block.c:2916
> #5  0x00007f4c668649bb in stream_start (job_id=job_id@entry=0x7f4c68f95fc0 "stream-node6", 
> [...]
> (gdb) p bs
> $1 = (BlockDriverState *) 0x7f4c68b57250
> (gdb) p bs->children
> $2 = {lh_first = 0x6060606060606060}


Does this mean that bdrv_reopen_queue() has to be moved after
bdrv_drain_all_begin() which also has to be taught to suspend the main
loop BHs somehow?


thank you
/Anton

------

  #define SLICE_TIME 100000000ULL /* ns */

Comments

Alberto Garcia Nov. 8, 2017, 1:58 p.m. UTC | #1
On Tue 07 Nov 2017 05:19:41 PM CET, Anton Nefedov wrote:

> One more drainage-related thing.
> We have recently encountered an issue with parallel block jobs and it's
> not quite clear how to fix it properly, so would appreciate your ideas.
>
> The small attached tweak makes iotest 30 (-qcow2 -nocache) fail 10/10
> times on my machine.

I just wanted to quickly confirm that I can reproduce this in my system
too, I just checked QEMU v2.8 (when this was introduced) and it was
affected as well. Somehow we didn't detect this back in the day :(

I'll take a look.

Berto
Alberto Garcia Nov. 8, 2017, 2:45 p.m. UTC | #2
On Tue 07 Nov 2017 05:19:41 PM CET, Anton Nefedov wrote:
> BlockBackend gets deleted by another job's stream_complete(), deferred
> to the main loop, so the fact that the job is put to sleep by
> bdrv_drain_all_begin() doesn't really stop it from execution.

I was debugging this a bit, and the block_job_defer_to_main_loop() call
happens _after_ all jobs have been paused, so I think that when the BDS
is drained then stream_run() finishes the last iteration without
checking if it's paused.

Without your patch (i.e. with a smaller STREAM_BUFFER_SIZE) then I
assume that the function would have to continue looping and
block_job_sleep_ns() would make the job coroutine yield, effectively
pausing the job and preventing the crash.

I can fix the crash by adding block_job_pause_point(&s->common) at the
end of stream_run() (where the 'out' label is).

I'm thinking that perhaps we should add the pause point directly to
block_job_defer_to_main_loop(), to prevent any block job from running
the exit function when it's paused.

Somehow I had the impression that we discussed this already in the past
(?) because I remember thinking about this very scenario.

Berto
Anton Nefedov Nov. 8, 2017, 3:50 p.m. UTC | #3
On 8/11/2017 5:45 PM, Alberto Garcia wrote:
> On Tue 07 Nov 2017 05:19:41 PM CET, Anton Nefedov wrote:
>> BlockBackend gets deleted by another job's stream_complete(), deferred
>> to the main loop, so the fact that the job is put to sleep by
>> bdrv_drain_all_begin() doesn't really stop it from execution.
> 
> I was debugging this a bit, and the block_job_defer_to_main_loop() call
> happens _after_ all jobs have been paused, so I think that when the BDS
> is drained then stream_run() finishes the last iteration without
> checking if it's paused.
> 
> Without your patch (i.e. with a smaller STREAM_BUFFER_SIZE) then I
> assume that the function would have to continue looping and
> block_job_sleep_ns() would make the job coroutine yield, effectively
> pausing the job and preventing the crash.
> 
> I can fix the crash by adding block_job_pause_point(&s->common) at the
> end of stream_run() (where the 'out' label is).
> 
> I'm thinking that perhaps we should add the pause point directly to
> block_job_defer_to_main_loop(), to prevent any block job from running
> the exit function when it's paused.
> 

Is it possible that the exit function is already deferred when the jobs
are being paused? (even though it's at least less likely to happen)

Then should we flush the bottom halves somehow in addition to putting
the jobs to sleep? And also then it all probably has to happen before
bdrv_reopen_queue()

/Anton

> Somehow I had the impression that we discussed this already in the past
> (?) because I remember thinking about this very scenario.
> 
> Berto
>
Alberto Garcia Nov. 9, 2017, 4:26 p.m. UTC | #4
On Wed 08 Nov 2017 03:45:38 PM CET, Alberto Garcia wrote:

> I'm thinking that perhaps we should add the pause point directly to
> block_job_defer_to_main_loop(), to prevent any block job from running
> the exit function when it's paused.

I was trying this and unfortunately this breaks the mirror job at least
(reproduced with a simple block-commit on the topmost node, which uses
commit_active_start() -> mirror_start_job()).

So what happens is that mirror_run() always calls bdrv_drained_begin()
before returning, pausing the block job. The closing bdrv_drained_end()
call is at the end of mirror_exit(), already in the main loop.

So the mirror job is always calling block_job_defer_to_main_loop() and
mirror_exit() when the job is paused.

Berto
Fam Zheng Nov. 10, 2017, 3:02 a.m. UTC | #5
On Thu, 11/09 17:26, Alberto Garcia wrote:
> On Wed 08 Nov 2017 03:45:38 PM CET, Alberto Garcia wrote:
> 
> > I'm thinking that perhaps we should add the pause point directly to
> > block_job_defer_to_main_loop(), to prevent any block job from running
> > the exit function when it's paused.
> 
> I was trying this and unfortunately this breaks the mirror job at least
> (reproduced with a simple block-commit on the topmost node, which uses
> commit_active_start() -> mirror_start_job()).
> 
> So what happens is that mirror_run() always calls bdrv_drained_begin()
> before returning, pausing the block job. The closing bdrv_drained_end()
> call is at the end of mirror_exit(), already in the main loop.
> 
> So the mirror job is always calling block_job_defer_to_main_loop() and
> mirror_exit() when the job is paused.
> 

FWIW, I think Max's report on 194 failures is related:

https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01822.html

so perhaps it's worth testing his patch too:

https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01835.html

Fam
Alberto Garcia Nov. 15, 2017, 3:42 p.m. UTC | #6
On Fri 10 Nov 2017 04:02:23 AM CET, Fam Zheng wrote:
>> > I'm thinking that perhaps we should add the pause point directly to
>> > block_job_defer_to_main_loop(), to prevent any block job from
>> > running the exit function when it's paused.
>> 
>> I was trying this and unfortunately this breaks the mirror job at
>> least (reproduced with a simple block-commit on the topmost node,
>> which uses commit_active_start() -> mirror_start_job()).
>> 
>> So what happens is that mirror_run() always calls
>> bdrv_drained_begin() before returning, pausing the block job. The
>> closing bdrv_drained_end() call is at the end of mirror_exit(),
>> already in the main loop.
>> 
>> So the mirror job is always calling block_job_defer_to_main_loop()
>> and mirror_exit() when the job is paused.
>
> FWIW, I think Max's report on 194 failures is related:
>
> https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01822.html
>
> so perhaps it's worth testing his patch too:
>
> https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01835.html

Well, that doesn't solve the original crash with parallel block jobs.
The root of the crash is that the mirror job manipulates the graph
_while being paused_, so the BlockReopenQueue in bdrv_reopen_multiple()
gets messed up, and pausing the jobs (commit 40840e419be31e) won't help.

I have the impression that one major source of headaches is the fact
that the reopen queue contains nodes that don't need to be reopened at
all. Ideally this should be detected early on in bdrv_reopen_queue(), so
there's no chance that the queue contains nodes used by a different
block job. If we had that then op blockers should be enough to prevent
these things. Or am I missing something?

Berto
Anton Nefedov Nov. 15, 2017, 4:31 p.m. UTC | #7
On 15/11/2017 6:42 PM, Alberto Garcia wrote:
> On Fri 10 Nov 2017 04:02:23 AM CET, Fam Zheng wrote:
>>>> I'm thinking that perhaps we should add the pause point directly to
>>>> block_job_defer_to_main_loop(), to prevent any block job from
>>>> running the exit function when it's paused.
>>>
>>> I was trying this and unfortunately this breaks the mirror job at
>>> least (reproduced with a simple block-commit on the topmost node,
>>> which uses commit_active_start() -> mirror_start_job()).
>>>
>>> So what happens is that mirror_run() always calls
>>> bdrv_drained_begin() before returning, pausing the block job. The
>>> closing bdrv_drained_end() call is at the end of mirror_exit(),
>>> already in the main loop.
>>>
>>> So the mirror job is always calling block_job_defer_to_main_loop()
>>> and mirror_exit() when the job is paused.
>>
>> FWIW, I think Max's report on 194 failures is related:
>>
>> https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01822.html
>>
>> so perhaps it's worth testing his patch too:
>>
>> https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01835.html
> 
> Well, that doesn't solve the original crash with parallel block jobs.
> The root of the crash is that the mirror job manipulates the graph
> _while being paused_, so the BlockReopenQueue in bdrv_reopen_multiple()
> gets messed up, and pausing the jobs (commit 40840e419be31e) won't help.
> 
> I have the impression that one major source of headaches is the fact
> that the reopen queue contains nodes that don't need to be reopened at
> all. Ideally this should be detected early on in bdrv_reopen_queue(), so
> there's no chance that the queue contains nodes used by a different
> block job. If we had that then op blockers should be enough to prevent
> these things. Or am I missing something?
> 
> Berto
> 

After applying Max's patch I tried the similar approach; that is keep
BDSes referenced while they are in the reopen queue.
Now I get the stream job hanging. Somehow one blk_root_drained_begin()
is not paired with blk_root_drained_end(). So the job stays paused.
Didn't dig deeper yet, but at first glance the reduced reopen queue
won't help with this, as reopen drains all BDSes anyway (or can we avoid
that too?)

/Anton
Alberto Garcia Nov. 16, 2017, 3:54 p.m. UTC | #8
On Wed 15 Nov 2017 05:31:20 PM CET, Anton Nefedov wrote:
>> I have the impression that one major source of headaches is the fact
>> that the reopen queue contains nodes that don't need to be reopened at
>> all. Ideally this should be detected early on in bdrv_reopen_queue(), so
>> there's no chance that the queue contains nodes used by a different
>> block job. If we had that then op blockers should be enough to prevent
>> these things. Or am I missing something?
>> 
> After applying Max's patch I tried the similar approach; that is keep
> BDSes referenced while they are in the reopen queue.  Now I get the
> stream job hanging. Somehow one blk_root_drained_begin() is not paired
> with blk_root_drained_end(). So the job stays paused.

I can see this if I apply Max's patch and keep refs to BDSs in the
reopen queue:

#0  block_job_pause (...) at blockjob.c:130
#1  0x000055c143cb586d in block_job_drained_begin (...) at blockjob.c:227
#2  0x000055c143d08067 in blk_set_dev_ops (...) at block/block-backend.c:887
#3  0x000055c143cb69db in block_job_create (...) at blockjob.c:678
#4  0x000055c143d17c0c in mirror_start_job (...) at block/mirror.c:1177

There's a ops->drained_begin(opaque) call in blk_set_dev_ops() that
doesn't seem to be paired. And when you call block_job_start() then it
yields immediately waiting for the resume (that never arrives).

John, this change was yours (f4d9cc88ee69a5b04). Any idea?

Berto
Anton Nefedov Nov. 16, 2017, 4:09 p.m. UTC | #9
On 16/11/2017 6:54 PM, Alberto Garcia wrote:
> On Wed 15 Nov 2017 05:31:20 PM CET, Anton Nefedov wrote:
>>> I have the impression that one major source of headaches is the fact
>>> that the reopen queue contains nodes that don't need to be reopened at
>>> all. Ideally this should be detected early on in bdrv_reopen_queue(), so
>>> there's no chance that the queue contains nodes used by a different
>>> block job. If we had that then op blockers should be enough to prevent
>>> these things. Or am I missing something?
>>>
>> After applying Max's patch I tried the similar approach; that is keep
>> BDSes referenced while they are in the reopen queue.  Now I get the
>> stream job hanging. Somehow one blk_root_drained_begin() is not paired
>> with blk_root_drained_end(). So the job stays paused.
> 
> I can see this if I apply Max's patch and keep refs to BDSs in the
> reopen queue:
> 
> #0  block_job_pause (...) at blockjob.c:130
> #1  0x000055c143cb586d in block_job_drained_begin (...) at blockjob.c:227
> #2  0x000055c143d08067 in blk_set_dev_ops (...) at block/block-backend.c:887
> #3  0x000055c143cb69db in block_job_create (...) at blockjob.c:678
> #4  0x000055c143d17c0c in mirror_start_job (...) at block/mirror.c:1177
> 
> There's a ops->drained_begin(opaque) call in blk_set_dev_ops() that
> doesn't seem to be paired. 
>

My understanding for now is

   1. in bdrv_drain_all_begin(), BDS gets drained with
      bdrv_parent_drained_begin(), all the parents get
      blk_root_drained_begin(), pause their jobs.
   2. one of the jobs finishes and deletes BB.
   3. in bdrv_drain_all_end(), bdrv_parent_drained_end() is never
      called because even though BDS still exists (referenced in the
      queue), it cannot be accessed as bdrv_next() takes BDS from the
      global BB list (and the corresponding BB is deleted).

Max's patch v1 could have helped:
http://lists.nongnu.org/archive/html/qemu-devel/2017-11/msg01835.html

Or, perhaps another approach, keep BlockJob referenced while it is
paused (by block_job_pause/resume_all()). That should prevent it from
deleting the BB.


> And when you call block_job_start() then it
> yields immediately waiting for the resume (that never arrives).
> 
> John, this change was yours (f4d9cc88ee69a5b04). Any idea?
> 
> Berto
>
John Snow Nov. 16, 2017, 9:56 p.m. UTC | #10
On 11/16/2017 10:54 AM, Alberto Garcia wrote:
> On Wed 15 Nov 2017 05:31:20 PM CET, Anton Nefedov wrote:
>>> I have the impression that one major source of headaches is the fact
>>> that the reopen queue contains nodes that don't need to be reopened at
>>> all. Ideally this should be detected early on in bdrv_reopen_queue(), so
>>> there's no chance that the queue contains nodes used by a different
>>> block job. If we had that then op blockers should be enough to prevent
>>> these things. Or am I missing something?
>>>
>> After applying Max's patch I tried the similar approach; that is keep
>> BDSes referenced while they are in the reopen queue.  Now I get the
>> stream job hanging. Somehow one blk_root_drained_begin() is not paired
>> with blk_root_drained_end(). So the job stays paused.
> 
> I can see this if I apply Max's patch and keep refs to BDSs in the
> reopen queue:
> 
> #0  block_job_pause (...) at blockjob.c:130
> #1  0x000055c143cb586d in block_job_drained_begin (...) at blockjob.c:227
> #2  0x000055c143d08067 in blk_set_dev_ops (...) at block/block-backend.c:887
> #3  0x000055c143cb69db in block_job_create (...) at blockjob.c:678
> #4  0x000055c143d17c0c in mirror_start_job (...) at block/mirror.c:1177
> 
> There's a ops->drained_begin(opaque) call in blk_set_dev_ops() that
> doesn't seem to be paired. And when you call block_job_start() then it
> yields immediately waiting for the resume (that never arrives).
> 
> John, this change was yours (f4d9cc88ee69a5b04). Any idea?
> 
> Berto
> 

The idea at the time was that if you tell the BlockBackend to drain and
then attach a job to it, once you go to *end* the drained region you'd
have a mismatched begin/end pair.

To allow for some flexibility and to make sure that you *didn't* have a
mismatched begin/end call, what I did was that if you attach dev_ops to
an already drained backend (i.e. we "missed our chance" to issue the
drained_begin) we play catch-up and issue the drained call.

There's no matching call here, because I anticipated whoever initially
bumped that quiesce_counter up to be issuing the drained_end, which will
then be propagated according to the dev_ops structure in place.
Alberto Garcia Nov. 17, 2017, 4:16 p.m. UTC | #11
On Thu 16 Nov 2017 10:56:58 PM CET, John Snow wrote:
>>>> I have the impression that one major source of headaches is the fact
>>>> that the reopen queue contains nodes that don't need to be reopened at
>>>> all. Ideally this should be detected early on in bdrv_reopen_queue(), so
>>>> there's no chance that the queue contains nodes used by a different
>>>> block job. If we had that then op blockers should be enough to prevent
>>>> these things. Or am I missing something?
>>>>
>>> After applying Max's patch I tried the similar approach; that is keep
>>> BDSes referenced while they are in the reopen queue.  Now I get the
>>> stream job hanging. Somehow one blk_root_drained_begin() is not paired
>>> with blk_root_drained_end(). So the job stays paused.
>> 
>> I can see this if I apply Max's patch and keep refs to BDSs in the
>> reopen queue:
>> 
>> #0  block_job_pause (...) at blockjob.c:130
>> #1  0x000055c143cb586d in block_job_drained_begin (...) at blockjob.c:227
>> #2  0x000055c143d08067 in blk_set_dev_ops (...) at block/block-backend.c:887
>> #3  0x000055c143cb69db in block_job_create (...) at blockjob.c:678
>> #4  0x000055c143d17c0c in mirror_start_job (...) at block/mirror.c:1177
>> 
>> There's a ops->drained_begin(opaque) call in blk_set_dev_ops() that
>> doesn't seem to be paired. And when you call block_job_start() then it
>> yields immediately waiting for the resume (that never arrives).
>> 
>> John, this change was yours (f4d9cc88ee69a5b04). Any idea?
>
> The idea at the time was that if you tell the BlockBackend to drain and
> then attach a job to it, once you go to *end* the drained region you'd
> have a mismatched begin/end pair.
>
> To allow for some flexibility and to make sure that you *didn't* have a
> mismatched begin/end call, what I did was that if you attach dev_ops to
> an already drained backend (i.e. we "missed our chance" to issue the
> drained_begin) we play catch-up and issue the drained call.
>
> There's no matching call here, because I anticipated whoever initially
> bumped that quiesce_counter up to be issuing the drained_end, which will
> then be propagated according to the dev_ops structure in place.

I see, thanks!

Berto
Alberto Garcia Nov. 21, 2017, 12:51 p.m. UTC | #12
On Thu 16 Nov 2017 05:09:59 PM CET, Anton Nefedov wrote:
>>>> I have the impression that one major source of headaches is the
>>>> fact that the reopen queue contains nodes that don't need to be
>>>> reopened at all. Ideally this should be detected early on in
>>>> bdrv_reopen_queue(), so there's no chance that the queue contains
>>>> nodes used by a different block job. If we had that then op
>>>> blockers should be enough to prevent these things. Or am I missing
>>>> something?
>>>>
>>> After applying Max's patch I tried the similar approach; that is
>>> keep BDSes referenced while they are in the reopen queue.  Now I get
>>> the stream job hanging. Somehow one blk_root_drained_begin() is not
>>> paired with blk_root_drained_end(). So the job stays paused.
>> 
>> I can see this if I apply Max's patch and keep refs to BDSs in the
>> reopen queue:
>> 
>> #0  block_job_pause (...) at blockjob.c:130
>> #1  0x000055c143cb586d in block_job_drained_begin (...) at blockjob.c:227
>> #2  0x000055c143d08067 in blk_set_dev_ops (...) at block/block-backend.c:887
>> #3  0x000055c143cb69db in block_job_create (...) at blockjob.c:678
>> #4  0x000055c143d17c0c in mirror_start_job (...) at block/mirror.c:1177
>> 
>> There's a ops->drained_begin(opaque) call in blk_set_dev_ops() that
>> doesn't seem to be paired. 
>
> My understanding for now is
>
>    1. in bdrv_drain_all_begin(), BDS gets drained with
>       bdrv_parent_drained_begin(), all the parents get
>       blk_root_drained_begin(), pause their jobs.
>    2. one of the jobs finishes and deletes BB.
>    3. in bdrv_drain_all_end(), bdrv_parent_drained_end() is never
>       called because even though BDS still exists (referenced in the
>       queue), it cannot be accessed as bdrv_next() takes BDS from the
>       global BB list (and the corresponding BB is deleted).

Yes, I was debugging this and I got to a similar conclusion. With
test_stream_commit from iotest 030 I can see that

   1. the block-stream job is created first, then stream_run begins and
      starts copying the data.
   2. block-commit starts and tries to reopen the top image in
      read-write mode. This pauses the stream block job and drains all
      BDSs.
   3. The draining causes the stream job to finish, it is deferred to
      the main loop, then stream_complete finishes and unrefs the block
      job, deleting it. At the point of deletion the pause count was
      still > 0 (because of step (2))

> Max's patch v1 could have helped:
> http://lists.nongnu.org/archive/html/qemu-devel/2017-11/msg01835.html

This doesn't solve the problem.

> Or, perhaps another approach, keep BlockJob referenced while it is
> paused (by block_job_pause/resume_all()). That should prevent it from
> deleting the BB.

Yes, I tried this and it actually solves the issue. But I still think
that the problem is that block jobs are allowed to finish when they are
paused.

Adding block_job_pause_point(&s->common) at the end of stream_run()
fixes the problem too.

Berto
Anton Nefedov Nov. 21, 2017, 3:18 p.m. UTC | #13
On 21/11/2017 3:51 PM, Alberto Garcia wrote:
> On Thu 16 Nov 2017 05:09:59 PM CET, Anton Nefedov wrote:
>>>>> I have the impression that one major source of headaches is the
>>>>> fact that the reopen queue contains nodes that don't need to be
>>>>> reopened at all. Ideally this should be detected early on in
>>>>> bdrv_reopen_queue(), so there's no chance that the queue contains
>>>>> nodes used by a different block job. If we had that then op
>>>>> blockers should be enough to prevent these things. Or am I missing
>>>>> something?
>>>>>
>>>> After applying Max's patch I tried the similar approach; that is
>>>> keep BDSes referenced while they are in the reopen queue.  Now I get
>>>> the stream job hanging. Somehow one blk_root_drained_begin() is not
>>>> paired with blk_root_drained_end(). So the job stays paused.
>>>
>>> I can see this if I apply Max's patch and keep refs to BDSs in the
>>> reopen queue:
>>>
>>> #0  block_job_pause (...) at blockjob.c:130
>>> #1  0x000055c143cb586d in block_job_drained_begin (...) at blockjob.c:227
>>> #2  0x000055c143d08067 in blk_set_dev_ops (...) at block/block-backend.c:887
>>> #3  0x000055c143cb69db in block_job_create (...) at blockjob.c:678
>>> #4  0x000055c143d17c0c in mirror_start_job (...) at block/mirror.c:1177
>>>
>>> There's a ops->drained_begin(opaque) call in blk_set_dev_ops() that
>>> doesn't seem to be paired.
>>
>> My understanding for now is
>>
>>     1. in bdrv_drain_all_begin(), BDS gets drained with
>>        bdrv_parent_drained_begin(), all the parents get
>>        blk_root_drained_begin(), pause their jobs.
>>     2. one of the jobs finishes and deletes BB.
>>     3. in bdrv_drain_all_end(), bdrv_parent_drained_end() is never
>>        called because even though BDS still exists (referenced in the
>>        queue), it cannot be accessed as bdrv_next() takes BDS from the
>>        global BB list (and the corresponding BB is deleted).
> 
> Yes, I was debugging this and I got to a similar conclusion. With
> test_stream_commit from iotest 030 I can see that
> 
>     1. the block-stream job is created first, then stream_run begins and
>        starts copying the data.
>     2. block-commit starts and tries to reopen the top image in
>        read-write mode. This pauses the stream block job and drains all
>        BDSs.
>     3. The draining causes the stream job to finish, it is deferred to
>        the main loop, then stream_complete finishes and unrefs the block
>        job, deleting it. At the point of deletion the pause count was
>        still > 0 (because of step (2))
> 
>> Max's patch v1 could have helped:
>> http://lists.nongnu.org/archive/html/qemu-devel/2017-11/msg01835.html
> 
> This doesn't solve the problem.
> 
>> Or, perhaps another approach, keep BlockJob referenced while it is
>> paused (by block_job_pause/resume_all()). That should prevent it from
>> deleting the BB.
> 
> Yes, I tried this and it actually solves the issue. But I still think
> that the problem is that block jobs are allowed to finish when they are
> paused.
> 

Agree, but

> Adding block_job_pause_point(&s->common) at the end of stream_run()
> fixes the problem too.
> 

would be a nice fix, but it only works unless the job is already
deferred, right? And the BH can't be held off while job->paused because
with mirror job it is always the case.
So we cover most of the cases but also lose the stable reproduction.

This:

 >> keep BlockJob referenced while it is
 >> paused (by block_job_pause/resume_all()). That should prevent it from
 >> deleting the BB.

looks kind of hacky; maybe referencing in block_job_pause() (and not
just pause_all) seems more correct? I think it didn't work for me
right away though. But I can look more.

/Anton

> Berto
>
Alberto Garcia Nov. 21, 2017, 3:31 p.m. UTC | #14
On Tue 21 Nov 2017 04:18:13 PM CET, Anton Nefedov wrote:

>>> Or, perhaps another approach, keep BlockJob referenced while it is
>>> paused (by block_job_pause/resume_all()). That should prevent it
>>> from deleting the BB.
>> 
>> Yes, I tried this and it actually solves the issue. But I still think
>> that the problem is that block jobs are allowed to finish when they
>> are paused.
>
> Agree, but
>
>> Adding block_job_pause_point(&s->common) at the end of stream_run()
>> fixes the problem too.
>
> would be a nice fix, but it only works unless the job is already
> deferred, right?

Right, I didn't mean to propose it as the proper solution (it would
still leave mirror job vulnerable because it's already paused by the
time it calls defer_to_main_loop()).

> This:
>
>  >> keep BlockJob referenced while it is
>  >> paused (by block_job_pause/resume_all()). That should prevent it from
>  >> deleting the BB.
>
> looks kind of hacky; maybe referencing in block_job_pause() (and not
> just pause_all) seems more correct? I think it didn't work for me
> right away though. But I can look more.

You have to be careful when you unref the block job because you may
destroy it, and therefore block_job_next() in block_job_resume_all()
would be using freed memory.

Berto
John Snow Nov. 22, 2017, 12:13 a.m. UTC | #15
CC Jeff Cody

... who may or may not be preoccupied with Thanksgiving travel now.

Convenient URL for reading past replies:
https://lists.nongnu.org/archive/html/qemu-devel/2017-11/msg03844.html

On 11/21/2017 10:31 AM, Alberto Garcia wrote:
> On Tue 21 Nov 2017 04:18:13 PM CET, Anton Nefedov wrote:
> 
>>>> Or, perhaps another approach, keep BlockJob referenced while it is
>>>> paused (by block_job_pause/resume_all()). That should prevent it
>>>> from deleting the BB.
>>>
>>> Yes, I tried this and it actually solves the issue. But I still think
>>> that the problem is that block jobs are allowed to finish when they
>>> are paused.
>>
>> Agree, but
>>
>>> Adding block_job_pause_point(&s->common) at the end of stream_run()
>>> fixes the problem too.
>>
>> would be a nice fix, but it only works unless the job is already
>> deferred, right?
> 
> Right, I didn't mean to propose it as the proper solution (it would
> still leave mirror job vulnerable because it's already paused by the
> time it calls defer_to_main_loop()).
> 
>> This:
>>
>>  >> keep BlockJob referenced while it is
>>  >> paused (by block_job_pause/resume_all()). That should prevent it from
>>  >> deleting the BB.
>>
>> looks kind of hacky; maybe referencing in block_job_pause() (and not
>> just pause_all) seems more correct? I think it didn't work for me
>> right away though. But I can look more.
> 
> You have to be careful when you unref the block job because you may
> destroy it, and therefore block_job_next() in block_job_resume_all()
> would be using freed memory.
> 
> Berto
>
Alberto Garcia Nov. 22, 2017, 3:05 p.m. UTC | #16
On Tue 07 Nov 2017 05:19:41 PM CET, Anton Nefedov wrote:

> The small attached tweak makes iotest 30 (-qcow2 -nocache) fail 10/10
> times on my machine.

I can reproduce the crash with v2.11.0-rc2 without having to modify QEMU
at all using the attached test case (it's based on one of the existing
tests from 030). It doesn't fail 100% of the times, though, but enough
to reproduce it easily.

Berto

----

class TestStreamCommit(iotests.QMPTestCase):
    num_imgs = 6
    imgs = []

    def setUp(self):
        opts = []
        self.imgs = []

        # Initialize file names and command-line options
        for i in range(self.num_imgs):
            img_depth = self.num_imgs - i - 1
            opts.append("backing." * img_depth + "node-name=node%d" % i)
            self.imgs.append(os.path.join(iotests.test_dir, 'img-%d.img' % i))

        # Create all images
        qemu_img('create', '-f', iotests.imgfmt, self.imgs[0], '1M')
        for i in range(1, self.num_imgs):
            qemu_img('create', '-f', iotests.imgfmt,
                     '-o', 'backing_file=%s' % self.imgs[i-1], self.imgs[i])

        qemu_io('-f', iotests.imgfmt, '-c', 'write -P 0x1 256k 512k', self.imgs[1])
        qemu_io('-f', iotests.imgfmt, '-c', 'write -P 0x2 768k 256k', self.imgs[1])
        qemu_io('-f', iotests.imgfmt, '-c', 'write -P 0x3 0 1M', self.imgs[5])

        # Attach the drive to the VM
        self.vm = iotests.VM()
        self.vm.add_drive(self.imgs[-1], ','.join(opts))
        self.vm.launch()

    def tearDown(self):
        self.vm.shutdown()
        for img in self.imgs:
            os.remove(img)

    # Test a block-stream and a block-commit job in parallel
    def test_stream_commit(self):
        self.assert_no_active_block_jobs()

        # Stream from node0 into node2
        result = self.vm.qmp('block-stream', device='node2', job_id='node2')
        self.assert_qmp(result, 'return', {})

        # Commit from the active layer into node3
        result = self.vm.qmp('block-commit', device='drive0', base=self.imgs[3])
        self.assert_qmp(result, 'return', {})

        # Wait for all jobs to be finished.
        pending_jobs = ['node2', 'drive0']
        while len(pending_jobs) > 0:
            for event in self.vm.get_qmp_events(wait=True):
                if event['event'] == 'BLOCK_JOB_COMPLETED':
                    node_name = self.dictpath(event, 'data/device')
                    self.assertTrue(node_name in pending_jobs)
                    self.assert_qmp_absent(event, 'data/error')
                    pending_jobs.remove(node_name)
                if event['event'] == 'BLOCK_JOB_READY':
                    self.assert_qmp(event, 'data/device', 'drive0')
                    self.assert_qmp(event, 'data/type', 'commit')
                    self.assert_qmp_absent(event, 'data/error')
                    self.assertTrue('drive0' in pending_jobs)
                    self.vm.qmp('block-job-complete', device='drive0')

        self.assert_no_active_block_jobs()
diff mbox

Patch

diff --git a/block/stream.c b/block/stream.c
index 52d329f..74e980c 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -26,7 +26,7 @@  enum {
       * enough to process multiple clusters in a single call, so that 
populating
       * contiguous regions of the image is efficient.
       */
-    STREAM_BUFFER_SIZE = 512 * 1024, /* in bytes */
+    STREAM_BUFFER_SIZE = 8 * 1024 * 1024, /* in bytes */
  };