[v2] block/stream: Drain subtree around graph change

Message ID	20220324140907.17192-1-hreitz@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: Hanna Reitz <hreitz@redhat.com> To: qemu-block@nongnu.org Subject: [PATCH v2] block/stream: Drain subtree around graph change Date: Thu, 24 Mar 2022 15:09:07 +0100 Message-Id: <20220324140907.17192-1-hreitz@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII"; x-default=true Received-SPF: pass client-ip=170.10.133.124; envelope-from=hreitz@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -21 X-Spam_score: -2.2 X-Spam_bar: -- X-Spam_report: (-2.2 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H5=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=unavailable autolearn_force=no X-Spam_action: no action Precedence: list Cc: Kevin Wolf <kwolf@redhat.com>, Hanna Reitz <hreitz@redhat.com>, Vladimir Sementsov-Ogievskiy <v.sementsov-og@mail.ru>, John Snow <jsnow@redhat.com>, qemu-devel@nongnu.org Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
Series	[v2] block/stream: Drain subtree around graph change \| expand [v2] block/stream: Drain subtree around graph change

Hanna Czenczek March 24, 2022, 2:09 p.m. UTC

When the stream block job cuts out the nodes between top and base in
stream_prepare(), it does not drain the subtree manually; it fetches the
base node, and tries to insert it as the top node's backing node with
bdrv_set_backing_hd().  bdrv_set_backing_hd() however will drain, and so
the actual base node might change (because the base node is actually not
part of the stream job) before the old base node passed to
bdrv_set_backing_hd() is installed.

This has two implications:

First, the stream job does not keep a strong reference to the base node.
Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
because some other block job is drained to finish), we will get a
use-after-free.  We should keep a strong reference to that node.

Second, even with such a strong reference, the problem remains that the
base node might change before bdrv_set_backing_hd() actually runs and as
a result the wrong base node is installed.

Both effects can be seen in 030's TestParallelOps.test_overlapping_5()
case, which has five nodes, and simultaneously streams from the middle
node to the top node, and commits the middle node down to the base node.
As it is, this will sometimes crash, namely when we encounter the
above-described use-after-free.

Taking a strong reference to the base node, we no longer get a crash,
but the resuling block graph is less than ideal: The expected result is
obviously that all middle nodes are cut out and the base node is the
immediate backing child of the top node.  However, if stream_prepare()
takes a strong reference to its base node (the middle node), and then
the commit job finishes in bdrv_set_backing_hd(), supposedly dropping
that middle node, the stream job will just reinstall it again.

Therefore, we need to keep the whole subtree drained in
stream_prepare(), so that the graph modification it performs is
effectively atomic, i.e. that the base node it fetches is still the base
node when bdrv_set_backing_hd() sets it as the top node's backing node.

Verify this by asserting in said 030's test case that the base node is
always the top node's immediate backing child when both jobs are done.

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
---
v2:
- Oops, the base can be NULL.  Would have noticed if I had ran all test
  cases from 030, and not just test_overlapping_5()...
  That means that keeping a strong reference to the base node must be
  conditional, based on whether there even is a base node or not.
  (I mean, technically we do not even need to keep a strong reference to
  that node, given that we are in a drained section, but I believe it is
  better style to do it anyway.)
---
 block/stream.c         | 15 ++++++++++++++-
 tests/qemu-iotests/030 |  5 +++++
 2 files changed, 19 insertions(+), 1 deletion(-)

John Snow March 24, 2022, 6:49 p.m. UTC | #1

On Thu, Mar 24, 2022 at 10:09 AM Hanna Reitz <hreitz@redhat.com> wrote:
>
> When the stream block job cuts out the nodes between top and base in
> stream_prepare(), it does not drain the subtree manually; it fetches the
> base node, and tries to insert it as the top node's backing node with
> bdrv_set_backing_hd().  bdrv_set_backing_hd() however will drain, and so
> the actual base node might change (because the base node is actually not
> part of the stream job) before the old base node passed to
> bdrv_set_backing_hd() is installed.
>
> This has two implications:
>
> First, the stream job does not keep a strong reference to the base node.
> Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
> because some other block job is drained to finish), we will get a
> use-after-free.  We should keep a strong reference to that node.

Does that introduce any possibility of deadlock if we're now keeping a
strong reference? I guess not, the job can just delete its own
reference and everything's ... fine, right?

>
> Second, even with such a strong reference, the problem remains that the
> base node might change before bdrv_set_backing_hd() actually runs and as
> a result the wrong base node is installed.

ow

>
> Both effects can be seen in 030's TestParallelOps.test_overlapping_5()
> case, which has five nodes, and simultaneously streams from the middle
> node to the top node, and commits the middle node down to the base node.
> As it is, this will sometimes crash, namely when we encounter the
> above-described use-after-free.


Is there a BZ# or is this an occasional failure in 030 you saw? What
does failure look like? Does it require anything special to trigger?

>
> Taking a strong reference to the base node, we no longer get a crash,
> but the resuling block graph is less than ideal: The expected result is
> obviously that all middle nodes are cut out and the base node is the
> immediate backing child of the top node.  However, if stream_prepare()
> takes a strong reference to its base node (the middle node), and then
> the commit job finishes in bdrv_set_backing_hd(), supposedly dropping
> that middle node, the stream job will just reinstall it again.
>
> Therefore, we need to keep the whole subtree drained in
> stream_prepare(), so that the graph modification it performs is
> effectively atomic, i.e. that the base node it fetches is still the base
> node when bdrv_set_backing_hd() sets it as the top node's backing node.
>
> Verify this by asserting in said 030's test case that the base node is
> always the top node's immediate backing child when both jobs are done.
>

(Off-topic: Sometimes I dream about having a block graph rendering
tool that can update step-by-step, so I can visualize what these block
operations look like. My "working memory" is kind of limited and I get
sidetracked too easily tracing code. That we have the ability to
render at a single point is pretty nice, but it's still hard to get
images from intermediate steps when things happen in tight sequence
without the chance for intervention.)

> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
> ---
> v2:
> - Oops, the base can be NULL.  Would have noticed if I had ran all test
>   cases from 030, and not just test_overlapping_5()...
>   That means that keeping a strong reference to the base node must be
>   conditional, based on whether there even is a base node or not.
>   (I mean, technically we do not even need to keep a strong reference to
>   that node, given that we are in a drained section, but I believe it is
>   better style to do it anyway.)
> ---
>  block/stream.c         | 15 ++++++++++++++-
>  tests/qemu-iotests/030 |  5 +++++
>  2 files changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/block/stream.c b/block/stream.c
> index 3acb59fe6a..694709bd25 100644
> --- a/block/stream.c
> +++ b/block/stream.c
> @@ -64,7 +64,13 @@ static int stream_prepare(Job *job)
>      bdrv_cor_filter_drop(s->cor_filter_bs);
>      s->cor_filter_bs = NULL;
>
> +    bdrv_subtree_drained_begin(s->above_base);
> +
>      base = bdrv_filter_or_cow_bs(s->above_base);
> +    if (base) {
> +        bdrv_ref(base);
> +    }
> +
>      unfiltered_base = bdrv_skip_filters(base);
>
>      if (bdrv_cow_child(unfiltered_bs)) {
> @@ -75,14 +81,21 @@ static int stream_prepare(Job *job)
>                  base_fmt = unfiltered_base->drv->format_name;
>              }
>          }
> +
>          bdrv_set_backing_hd(unfiltered_bs, base, &local_err);
>          ret = bdrv_change_backing_file(unfiltered_bs, base_id, base_fmt, false);
>          if (local_err) {
>              error_report_err(local_err);
> -            return -EPERM;
> +            ret = -EPERM;
> +            goto out;
>          }
>      }
>
> +out:
> +    if (base) {
> +        bdrv_unref(base);
> +    }
> +    bdrv_subtree_drained_end(s->above_base);
>      return ret;
>  }
>
> diff --git a/tests/qemu-iotests/030 b/tests/qemu-iotests/030
> index 567bf1da67..14112835ed 100755
> --- a/tests/qemu-iotests/030
> +++ b/tests/qemu-iotests/030
> @@ -436,6 +436,11 @@ class TestParallelOps(iotests.QMPTestCase):
>          self.vm.run_job(job='node4', auto_dismiss=True)
>          self.assert_no_active_block_jobs()
>
> +        # Assert that node0 is now the backing node of node4
> +        result = self.vm.qmp('query-named-block-nodes')
> +        node4 = next(node for node in result['return'] if node['node-name'] == 'node4')
> +        self.assertEqual(node4['image']['backing-image']['filename'], self.imgs[0])
> +
>      # Test a block-stream and a block-commit job in parallel
>      # Here the stream job is supposed to finish quickly in order to reproduce
>      # the scenario that triggers the bug fixed in 3d5d319e1221 and 1a63a907507
> --
> 2.35.1
>

Seems reasonable, but the best I can give right now is an ACK because
I'm a little rusty with block graph ops ...

--js

Hanna Czenczek March 25, 2022, 8:50 a.m. UTC | #2

On 24.03.22 19:49, John Snow wrote:
> On Thu, Mar 24, 2022 at 10:09 AM Hanna Reitz <hreitz@redhat.com> wrote:
>> When the stream block job cuts out the nodes between top and base in
>> stream_prepare(), it does not drain the subtree manually; it fetches the
>> base node, and tries to insert it as the top node's backing node with
>> bdrv_set_backing_hd().  bdrv_set_backing_hd() however will drain, and so
>> the actual base node might change (because the base node is actually not
>> part of the stream job) before the old base node passed to
>> bdrv_set_backing_hd() is installed.
>>
>> This has two implications:
>>
>> First, the stream job does not keep a strong reference to the base node.
>> Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
>> because some other block job is drained to finish), we will get a
>> use-after-free.  We should keep a strong reference to that node.
> Does that introduce any possibility of deadlock if we're now keeping a
> strong reference? I guess not, the job can just delete its own
> reference and everything's ... fine, right?

Handling of strong references doesn’t block, bdrv_ref() just increases 
the refcount, and bdrv_unref() decreases it (deleting the node should 
the refcount reach 0).

>> Second, even with such a strong reference, the problem remains that the
>> base node might change before bdrv_set_backing_hd() actually runs and as
>> a result the wrong base node is installed.
> ow

Well, not horrible.  Just means that a node that was supposed to be 
removed is still there.  (So you’ll need to commit again.)

If users pass auto_dismiss=false and dismiss the jobs manually in order 
(which I think is a good idea anyway), this won’t happen.

>> Both effects can be seen in 030's TestParallelOps.test_overlapping_5()
>> case, which has five nodes, and simultaneously streams from the middle
>> node to the top node, and commits the middle node down to the base node.
>> As it is, this will sometimes crash, namely when we encounter the
>> above-described use-after-free.
>
> Is there a BZ# or is this an occasional failure in 030 you saw?

Yep, just occasional failure in 030.

> What
> does failure look like? Does it require anything special to trigger?

Running 030 quite a bunch of times, preferably in parallel as many times 
as you have cores.  (Perhaps disabling all test cases except 
test_overlapping_5() to make it a bit quicker.)

Failure for the use-after-free is that inside of bdrv_set_backing_hd(), 
the base node will just contain garbage and some pointer dereference fails:

#0  bdrv_inherits_from_recursive (parent=parent@entry=0x5614406438e0, 
child=0x8c8c8c8c8c8c8c8c, child@entry=0x5614412d0a70) at ../block.c:3328
#1  bdrv_set_file_or_backing_noperm
     (parent_bs=parent_bs@entry=0x5614406438e0, 
child_bs=child_bs@entry=0x5614412d0a70, 
is_backing=is_backing@entry=true, tran=tran@entry=0x5614414f5f40, 
errp=errp@entry=0x7ffdf44eb810) at ../block.c:3361
#2  0x000056143da61550 in bdrv_set_backing_noperm (errp=0x7ffdf44eb810, 
tran=0x5614414f5f40, backing_hd=0x5614412d0a70, bs=0x5614406438e0) at 
../block.c:3447
#3  bdrv_set_backing_hd (bs=bs@entry=0x5614406438e0, 
backing_hd=backing_hd@entry=0x5614412d0a70, 
errp=errp@entry=0x7ffdf44eb810) at ../block.c:3459
#4  0x000056143dae7778 in stream_prepare (job=0x56144160b6c0) at 
../block/stream.c:78
#5  0x000056143da6a91e in job_prepare (job=0x56144160b6c0) at ../job.c:837
#6  job_txn_apply (fn=<optimized out>, job=0x56144160b6c0) at ../job.c:158
#7  job_do_finalize (job=0x56144160b6c0) at ../job.c:854
#8  0x000056143da6ae02 in job_exit (opaque=0x56144160b6c0) at ../job.c:941

Hanna

>> Taking a strong reference to the base node, we no longer get a crash,
>> but the resuling block graph is less than ideal: The expected result is
>> obviously that all middle nodes are cut out and the base node is the
>> immediate backing child of the top node.  However, if stream_prepare()
>> takes a strong reference to its base node (the middle node), and then
>> the commit job finishes in bdrv_set_backing_hd(), supposedly dropping
>> that middle node, the stream job will just reinstall it again.
>>
>> Therefore, we need to keep the whole subtree drained in
>> stream_prepare(), so that the graph modification it performs is
>> effectively atomic, i.e. that the base node it fetches is still the base
>> node when bdrv_set_backing_hd() sets it as the top node's backing node.
>>
>> Verify this by asserting in said 030's test case that the base node is
>> always the top node's immediate backing child when both jobs are done.
>>
> (Off-topic: Sometimes I dream about having a block graph rendering
> tool that can update step-by-step, so I can visualize what these block
> operations look like. My "working memory" is kind of limited and I get
> sidetracked too easily tracing code. That we have the ability to
> render at a single point is pretty nice, but it's still hard to get
> images from intermediate steps when things happen in tight sequence
> without the chance for intervention.)
>
>> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
>> ---
>> v2:
>> - Oops, the base can be NULL.  Would have noticed if I had ran all test
>>    cases from 030, and not just test_overlapping_5()...
>>    That means that keeping a strong reference to the base node must be
>>    conditional, based on whether there even is a base node or not.
>>    (I mean, technically we do not even need to keep a strong reference to
>>    that node, given that we are in a drained section, but I believe it is
>>    better style to do it anyway.)
>> ---
>>   block/stream.c         | 15 ++++++++++++++-
>>   tests/qemu-iotests/030 |  5 +++++
>>   2 files changed, 19 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/stream.c b/block/stream.c
>> index 3acb59fe6a..694709bd25 100644
>> --- a/block/stream.c
>> +++ b/block/stream.c
>> @@ -64,7 +64,13 @@ static int stream_prepare(Job *job)
>>       bdrv_cor_filter_drop(s->cor_filter_bs);
>>       s->cor_filter_bs = NULL;
>>
>> +    bdrv_subtree_drained_begin(s->above_base);
>> +
>>       base = bdrv_filter_or_cow_bs(s->above_base);
>> +    if (base) {
>> +        bdrv_ref(base);
>> +    }
>> +
>>       unfiltered_base = bdrv_skip_filters(base);
>>
>>       if (bdrv_cow_child(unfiltered_bs)) {
>> @@ -75,14 +81,21 @@ static int stream_prepare(Job *job)
>>                   base_fmt = unfiltered_base->drv->format_name;
>>               }
>>           }
>> +
>>           bdrv_set_backing_hd(unfiltered_bs, base, &local_err);
>>           ret = bdrv_change_backing_file(unfiltered_bs, base_id, base_fmt, false);
>>           if (local_err) {
>>               error_report_err(local_err);
>> -            return -EPERM;
>> +            ret = -EPERM;
>> +            goto out;
>>           }
>>       }
>>
>> +out:
>> +    if (base) {
>> +        bdrv_unref(base);
>> +    }
>> +    bdrv_subtree_drained_end(s->above_base);
>>       return ret;
>>   }
>>
>> diff --git a/tests/qemu-iotests/030 b/tests/qemu-iotests/030
>> index 567bf1da67..14112835ed 100755
>> --- a/tests/qemu-iotests/030
>> +++ b/tests/qemu-iotests/030
>> @@ -436,6 +436,11 @@ class TestParallelOps(iotests.QMPTestCase):
>>           self.vm.run_job(job='node4', auto_dismiss=True)
>>           self.assert_no_active_block_jobs()
>>
>> +        # Assert that node0 is now the backing node of node4
>> +        result = self.vm.qmp('query-named-block-nodes')
>> +        node4 = next(node for node in result['return'] if node['node-name'] == 'node4')
>> +        self.assertEqual(node4['image']['backing-image']['filename'], self.imgs[0])
>> +
>>       # Test a block-stream and a block-commit job in parallel
>>       # Here the stream job is supposed to finish quickly in order to reproduce
>>       # the scenario that triggers the bug fixed in 3d5d319e1221 and 1a63a907507
>> --
>> 2.35.1
>>
> Seems reasonable, but the best I can give right now is an ACK because
> I'm a little rusty with block graph ops ...
>
> --js
>

Eric Blake March 25, 2022, 2:45 p.m. UTC | #3

On Thu, Mar 24, 2022 at 03:09:07PM +0100, Hanna Reitz wrote:
> When the stream block job cuts out the nodes between top and base in
> stream_prepare(), it does not drain the subtree manually; it fetches the
> base node, and tries to insert it as the top node's backing node with
> bdrv_set_backing_hd().  bdrv_set_backing_hd() however will drain, and so
> the actual base node might change (because the base node is actually not
> part of the stream job) before the old base node passed to
> bdrv_set_backing_hd() is installed.
> 
...
> 
> Therefore, we need to keep the whole subtree drained in
> stream_prepare(), so that the graph modification it performs is
> effectively atomic, i.e. that the base node it fetches is still the base
> node when bdrv_set_backing_hd() sets it as the top node's backing node.
> 
> Verify this by asserting in said 030's test case that the base node is
> always the top node's immediate backing child when both jobs are done.
> 
> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
> ---
> v2:
> - Oops, the base can be NULL.  Would have noticed if I had ran all test
>   cases from 030, and not just test_overlapping_5()...
>   That means that keeping a strong reference to the base node must be
>   conditional, based on whether there even is a base node or not.
>   (I mean, technically we do not even need to keep a strong reference to
>   that node, given that we are in a drained section, but I believe it is
>   better style to do it anyway.)

I agree with the conclusion that we don't need a strong reference once
you fix the bigger problem of the lock-by-drain, but that it is better
style to include it anyway.

> ---
>  block/stream.c         | 15 ++++++++++++++-
>  tests/qemu-iotests/030 |  5 +++++
>  2 files changed, 19 insertions(+), 1 deletion(-)

Reviewed-by: Eric Blake <eblake@redhat.com>

Vladimir Sementsov-Ogievskiy March 25, 2022, 4:37 p.m. UTC | #4

24.03.2022 17:09, Hanna Reitz wrote:
> When the stream block job cuts out the nodes between top and base in
> stream_prepare(), it does not drain the subtree manually; it fetches the
> base node, and tries to insert it as the top node's backing node with
> bdrv_set_backing_hd().  bdrv_set_backing_hd() however will drain, and so
> the actual base node might change (because the base node is actually not
> part of the stream job) before the old base node passed to
> bdrv_set_backing_hd() is installed.
> 
> This has two implications:
> 
> First, the stream job does not keep a strong reference to the base node.
> Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
> because some other block job is drained to finish), we will get a
> use-after-free.  We should keep a strong reference to that node.
> 
> Second, even with such a strong reference, the problem remains that the
> base node might change before bdrv_set_backing_hd() actually runs and as
> a result the wrong base node is installed.

Hmm.

So, we don't really need a strong reference, as if it helps to avoid some use-after-free, it means that we'll finish up with wrong block graph..

Graph modifying operations must be somehow isolated from each other.

> 
> Both effects can be seen in 030's TestParallelOps.test_overlapping_5()
> case, which has five nodes, and simultaneously streams from the middle
> node to the top node, and commits the middle node down to the base node.
> As it is, this will sometimes crash, namely when we encounter the
> above-described use-after-free.
> 
> Taking a strong reference to the base node, we no longer get a crash,
> but the resuling block graph is less than ideal: The expected result is
> obviously that all middle nodes are cut out and the base node is the
> immediate backing child of the top node.  However, if stream_prepare()
> takes a strong reference to its base node (the middle node), and then
> the commit job finishes in bdrv_set_backing_hd(), supposedly dropping
> that middle node, the stream job will just reinstall it again.
> 
> Therefore, we need to keep the whole subtree drained in
> stream_prepare(), so that the graph modification it performs is
> effectively atomic, i.e. that the base node it fetches is still the base
> node when bdrv_set_backing_hd() sets it as the top node's backing node.

Emanuele has similar idea of isolating graph changes from each other by subtree-drain.

If I understand correctly the idea is that we'll drain all other block jobs, so the wouldn't do their block-graph modification during drained section. So, we can safely modify the graph.

I don't like this idea:

1. drained section = stop IO. But we don't need to stop IO in the whole subtree to do a needed block-graph modification.

2. Drained section is not a lock, several clients may drain same set of nodes.. So we exploit the fact that concurrent clients will be paused by drained section and don't proceed to graph-modification code.. But are we sure that block-jobs are (and will be?) the only concurrent block-graph modifying clients? Can qmp commands interleave somehow? Can some jobs from other subtree start a block-graph modification that touches our subtree? If go this way, that would be more safe to drain the whole block-graph on any block-graph modification..

I think we'd better have a separate global mechanism for isolating graph modifications. Something like a global co-mutex or queue, where clients waits for their turn in block graph modifications.

Here is my old proposal on that topic: https://patchew.org/QEMU/20201120161622.1537-1-vsementsov@virtuozzo.com/

> 
> Verify this by asserting in said 030's test case that the base node is
> always the top node's immediate backing child when both jobs are done.
> 
> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
> ---
> v2:
> - Oops, the base can be NULL.  Would have noticed if I had ran all test
>    cases from 030, and not just test_overlapping_5()...
>    That means that keeping a strong reference to the base node must be
>    conditional, based on whether there even is a base node or not.
>    (I mean, technically we do not even need to keep a strong reference to
>    that node, given that we are in a drained section, but I believe it is
>    better style to do it anyway.)
> ---
>   block/stream.c         | 15 ++++++++++++++-
>   tests/qemu-iotests/030 |  5 +++++
>   2 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/block/stream.c b/block/stream.c
> index 3acb59fe6a..694709bd25 100644
> --- a/block/stream.c
> +++ b/block/stream.c
> @@ -64,7 +64,13 @@ static int stream_prepare(Job *job)
>       bdrv_cor_filter_drop(s->cor_filter_bs);
>       s->cor_filter_bs = NULL;
>   
> +    bdrv_subtree_drained_begin(s->above_base);
> +
>       base = bdrv_filter_or_cow_bs(s->above_base);
> +    if (base) {
> +        bdrv_ref(base);
> +    }
> +
>       unfiltered_base = bdrv_skip_filters(base);
>   
>       if (bdrv_cow_child(unfiltered_bs)) {
> @@ -75,14 +81,21 @@ static int stream_prepare(Job *job)
>                   base_fmt = unfiltered_base->drv->format_name;
>               }
>           }
> +
>           bdrv_set_backing_hd(unfiltered_bs, base, &local_err);
>           ret = bdrv_change_backing_file(unfiltered_bs, base_id, base_fmt, false);
>           if (local_err) {
>               error_report_err(local_err);
> -            return -EPERM;
> +            ret = -EPERM;
> +            goto out;
>           }
>       }
>   
> +out:
> +    if (base) {
> +        bdrv_unref(base);
> +    }
> +    bdrv_subtree_drained_end(s->above_base);
>       return ret;
>   }
>   
> diff --git a/tests/qemu-iotests/030 b/tests/qemu-iotests/030
> index 567bf1da67..14112835ed 100755
> --- a/tests/qemu-iotests/030
> +++ b/tests/qemu-iotests/030
> @@ -436,6 +436,11 @@ class TestParallelOps(iotests.QMPTestCase):
>           self.vm.run_job(job='node4', auto_dismiss=True)
>           self.assert_no_active_block_jobs()
>   
> +        # Assert that node0 is now the backing node of node4
> +        result = self.vm.qmp('query-named-block-nodes')
> +        node4 = next(node for node in result['return'] if node['node-name'] == 'node4')
> +        self.assertEqual(node4['image']['backing-image']['filename'], self.imgs[0])
> +
>       # Test a block-stream and a block-commit job in parallel
>       # Here the stream job is supposed to finish quickly in order to reproduce
>       # the scenario that triggers the bug fixed in 3d5d319e1221 and 1a63a907507

Hanna Czenczek March 28, 2022, 7:44 a.m. UTC | #5

On 25.03.22 17:37, Vladimir Sementsov-Ogievskiy wrote:
> 24.03.2022 17:09, Hanna Reitz wrote:
>> When the stream block job cuts out the nodes between top and base in
>> stream_prepare(), it does not drain the subtree manually; it fetches the
>> base node, and tries to insert it as the top node's backing node with
>> bdrv_set_backing_hd().  bdrv_set_backing_hd() however will drain, and so
>> the actual base node might change (because the base node is actually not
>> part of the stream job) before the old base node passed to
>> bdrv_set_backing_hd() is installed.
>>
>> This has two implications:
>>
>> First, the stream job does not keep a strong reference to the base node.
>> Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
>> because some other block job is drained to finish), we will get a
>> use-after-free.  We should keep a strong reference to that node.
>>
>> Second, even with such a strong reference, the problem remains that the
>> base node might change before bdrv_set_backing_hd() actually runs and as
>> a result the wrong base node is installed.
>
> Hmm.
>
> So, we don't really need a strong reference, as if it helps to avoid 
> some use-after-free, it means that we'll finish up with wrong block 
> graph..

Sure.  But I found it better style to strongly reference a node while 
it’s used.  I’d rather have an outdated block graph (as in: A node that 
was supposed to disappear would still be in use) than a use-after-free.

> Graph modifying operations must be somehow isolated from each other.
>
>>
>> Both effects can be seen in 030's TestParallelOps.test_overlapping_5()
>> case, which has five nodes, and simultaneously streams from the middle
>> node to the top node, and commits the middle node down to the base node.
>> As it is, this will sometimes crash, namely when we encounter the
>> above-described use-after-free.
>>
>> Taking a strong reference to the base node, we no longer get a crash,
>> but the resuling block graph is less than ideal: The expected result is
>> obviously that all middle nodes are cut out and the base node is the
>> immediate backing child of the top node.  However, if stream_prepare()
>> takes a strong reference to its base node (the middle node), and then
>> the commit job finishes in bdrv_set_backing_hd(), supposedly dropping
>> that middle node, the stream job will just reinstall it again.
>>
>> Therefore, we need to keep the whole subtree drained in
>> stream_prepare(), so that the graph modification it performs is
>> effectively atomic, i.e. that the base node it fetches is still the base
>> node when bdrv_set_backing_hd() sets it as the top node's backing node.
>
> Emanuele has similar idea of isolating graph changes from each other 
> by subtree-drain.
>
> If I understand correctly the idea is that we'll drain all other block 
> jobs, so the wouldn't do their block-graph modification during drained 
> section. So, we can safely modify the graph.
>
> I don't like this idea:
>
> 1. drained section = stop IO. But we don't need to stop IO in the 
> whole subtree to do a needed block-graph modification.

If you mean to say that draining just the single node should be 
sufficient, I’ll be happy to change it.

Not sure which node, though, because I’d think it would be `base`, but 
to safely fetch it I’d need to drain it, which seems to bite itself in 
the tail.  That’s why I went for a subtree drain from `above_base`.

> 2. Drained section is not a lock, several clients may drain same set 
> of nodes.. So we exploit the fact that concurrent clients will be 
> paused by drained section and don't proceed to graph-modification 
> code.. But are we sure that block-jobs are (and will be?) the only 
> concurrent block-graph modifying clients? Can qmp commands interleave 
> somehow?

They can under very specific circumstances and that’s a bug.  See 
https://lists.nongnu.org/archive/html/qemu-block/2022-03/msg00582.html .

> Can some jobs from other subtree start a block-graph modification that 
> touches our subtree?

That would be wrong.  A block job shouldn’t change nodes it doesn’t own; 
stream doesn’t own the base, but it also doesn’t change it, it only 
needs to have the top node point to it.

> If go this way, that would be more safe to drain the whole block-graph 
> on any block-graph modification..
>
> I think we'd better have a separate global mechanism for isolating 
> graph modifications. Something like a global co-mutex or queue, where 
> clients waits for their turn in block graph modifications.
>
> Here is my old proposal on that topic: 
> https://patchew.org/QEMU/20201120161622.1537-1-vsementsov@virtuozzo.com/

That would only solve the very specific issue in 030, right?  The stream 
job isn’t protected from any graph modifications but those coming from 
mirror.  Might be a solution going forward (I didn’t look closer at it 
at the time, given I saw you had a discussion with Kevin), if we lock 
every graph change operation (though a global lock honestly doesn’t 
sound strictly better than draining subsections of the graph, both have 
their drawbacks), but that doesn’t look like it’d be something for 7.1.

Hanna Czenczek March 28, 2022, 8:09 a.m. UTC | #6

On 28.03.22 09:44, Hanna Reitz wrote:
> On 25.03.22 17:37, Vladimir Sementsov-Ogievskiy wrote:
>> 24.03.2022 17:09, Hanna Reitz wrote:
>>> When the stream block job cuts out the nodes between top and base in
>>> stream_prepare(), it does not drain the subtree manually; it fetches 
>>> the
>>> base node, and tries to insert it as the top node's backing node with
>>> bdrv_set_backing_hd().  bdrv_set_backing_hd() however will drain, 
>>> and so
>>> the actual base node might change (because the base node is actually 
>>> not
>>> part of the stream job) before the old base node passed to
>>> bdrv_set_backing_hd() is installed.
>>>
>>> This has two implications:
>>>
>>> First, the stream job does not keep a strong reference to the base 
>>> node.
>>> Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
>>> because some other block job is drained to finish), we will get a
>>> use-after-free.  We should keep a strong reference to that node.
>>>
>>> Second, even with such a strong reference, the problem remains that the
>>> base node might change before bdrv_set_backing_hd() actually runs 
>>> and as
>>> a result the wrong base node is installed.
>>
>> Hmm.
>>
>> So, we don't really need a strong reference, as if it helps to avoid 
>> some use-after-free, it means that we'll finish up with wrong block 
>> graph..
>
> Sure.  But I found it better style to strongly reference a node while 
> it’s used.  I’d rather have an outdated block graph (as in: A node 
> that was supposed to disappear would still be in use) than a 
> use-after-free.
>
>> Graph modifying operations must be somehow isolated from each other.
>>
>>>
>>> Both effects can be seen in 030's TestParallelOps.test_overlapping_5()
>>> case, which has five nodes, and simultaneously streams from the middle
>>> node to the top node, and commits the middle node down to the base 
>>> node.
>>> As it is, this will sometimes crash, namely when we encounter the
>>> above-described use-after-free.
>>>
>>> Taking a strong reference to the base node, we no longer get a crash,
>>> but the resuling block graph is less than ideal: The expected result is
>>> obviously that all middle nodes are cut out and the base node is the
>>> immediate backing child of the top node.  However, if stream_prepare()
>>> takes a strong reference to its base node (the middle node), and then
>>> the commit job finishes in bdrv_set_backing_hd(), supposedly dropping
>>> that middle node, the stream job will just reinstall it again.
>>>
>>> Therefore, we need to keep the whole subtree drained in
>>> stream_prepare(), so that the graph modification it performs is
>>> effectively atomic, i.e. that the base node it fetches is still the 
>>> base
>>> node when bdrv_set_backing_hd() sets it as the top node's backing node.
>>
>> Emanuele has similar idea of isolating graph changes from each other 
>> by subtree-drain.
>>
>> If I understand correctly the idea is that we'll drain all other 
>> block jobs, so the wouldn't do their block-graph modification during 
>> drained section. So, we can safely modify the graph.
>>
>> I don't like this idea:
>>
>> 1. drained section = stop IO. But we don't need to stop IO in the 
>> whole subtree to do a needed block-graph modification.
>
> If you mean to say that draining just the single node should be 
> sufficient, I’ll be happy to change it.
>
> Not sure which node, though, because I’d think it would be `base`, but 
> to safely fetch it I’d need to drain it, which seems to bite itself in 
> the tail.  That’s why I went for a subtree drain from `above_base`.
>
>> 2. Drained section is not a lock, several clients may drain same set 
>> of nodes.. So we exploit the fact that concurrent clients will be 
>> paused by drained section and don't proceed to graph-modification 
>> code.. But are we sure that block-jobs are (and will be?) the only 
>> concurrent block-graph modifying clients? Can qmp commands interleave 
>> somehow?
>
> They can under very specific circumstances and that’s a bug.  See 
> https://lists.nongnu.org/archive/html/qemu-block/2022-03/msg00582.html .
>
>> Can some jobs from other subtree start a block-graph modification 
>> that touches our subtree?
>
> That would be wrong.  A block job shouldn’t change nodes it doesn’t 
> own; stream doesn’t own the base, but it also doesn’t change it, it 
> only needs to have the top node point to it.
>
>> If go this way, that would be more safe to drain the whole 
>> block-graph on any block-graph modification..
>>
>> I think we'd better have a separate global mechanism for isolating 
>> graph modifications. Something like a global co-mutex or queue, where 
>> clients waits for their turn in block graph modifications.
>>
>> Here is my old proposal on that topic: 
>> https://patchew.org/QEMU/20201120161622.1537-1-vsementsov@virtuozzo.com/
>
> That would only solve the very specific issue in 030, right?  The 
> stream job isn’t protected from any graph modifications but those 
> coming from mirror.  Might be a solution going forward (I didn’t look 
> closer at it at the time, given I saw you had a discussion with 
> Kevin), if we lock every graph change operation (though a global lock 
> honestly doesn’t sound strictly better than draining subsections of 
> the graph, both have their drawbacks), but that doesn’t look like it’d 
> be something for 7.1.

I wonder whether we could have a short-term version of 
`BdrvChild.frozen` that’s a coroutine mutex.  If `.frozen` is set, you 
just can’t change the graph, and you also can’t wait, so that’s just an 
error.  But if `.frozen_lock` is set, you can wait on it. Here, we’d 
keep `.frozen` set for all links between top and above_base, and then in 
prepare() take `.frozen_lock` on the link between above_base and base.

Vladimir Sementsov-Ogievskiy March 28, 2022, 10:24 a.m. UTC | #7

28.03.2022 11:09, Hanna Reitz wrote:
> On 28.03.22 09:44, Hanna Reitz wrote:
>> On 25.03.22 17:37, Vladimir Sementsov-Ogievskiy wrote:
>>> 24.03.2022 17:09, Hanna Reitz wrote:
>>>> When the stream block job cuts out the nodes between top and base in
>>>> stream_prepare(), it does not drain the subtree manually; it fetches the
>>>> base node, and tries to insert it as the top node's backing node with
>>>> bdrv_set_backing_hd().  bdrv_set_backing_hd() however will drain, and so
>>>> the actual base node might change (because the base node is actually not
>>>> part of the stream job) before the old base node passed to
>>>> bdrv_set_backing_hd() is installed.
>>>>
>>>> This has two implications:
>>>>
>>>> First, the stream job does not keep a strong reference to the base node.
>>>> Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
>>>> because some other block job is drained to finish), we will get a
>>>> use-after-free.  We should keep a strong reference to that node.
>>>>
>>>> Second, even with such a strong reference, the problem remains that the
>>>> base node might change before bdrv_set_backing_hd() actually runs and as
>>>> a result the wrong base node is installed.
>>>
>>> Hmm.
>>>
>>> So, we don't really need a strong reference, as if it helps to avoid some use-after-free, it means that we'll finish up with wrong block graph..
>>
>> Sure.  But I found it better style to strongly reference a node while it’s used.  I’d rather have an outdated block graph (as in: A node that was supposed to disappear would still be in use) than a use-after-free.
>>
>>> Graph modifying operations must be somehow isolated from each other.
>>>
>>>>
>>>> Both effects can be seen in 030's TestParallelOps.test_overlapping_5()
>>>> case, which has five nodes, and simultaneously streams from the middle
>>>> node to the top node, and commits the middle node down to the base node.
>>>> As it is, this will sometimes crash, namely when we encounter the
>>>> above-described use-after-free.
>>>>
>>>> Taking a strong reference to the base node, we no longer get a crash,
>>>> but the resuling block graph is less than ideal: The expected result is
>>>> obviously that all middle nodes are cut out and the base node is the
>>>> immediate backing child of the top node.  However, if stream_prepare()
>>>> takes a strong reference to its base node (the middle node), and then
>>>> the commit job finishes in bdrv_set_backing_hd(), supposedly dropping
>>>> that middle node, the stream job will just reinstall it again.
>>>>
>>>> Therefore, we need to keep the whole subtree drained in
>>>> stream_prepare(), so that the graph modification it performs is
>>>> effectively atomic, i.e. that the base node it fetches is still the base
>>>> node when bdrv_set_backing_hd() sets it as the top node's backing node.
>>>
>>> Emanuele has similar idea of isolating graph changes from each other by subtree-drain.
>>>
>>> If I understand correctly the idea is that we'll drain all other block jobs, so the wouldn't do their block-graph modification during drained section. So, we can safely modify the graph.
>>>
>>> I don't like this idea:
>>>
>>> 1. drained section = stop IO. But we don't need to stop IO in the whole subtree to do a needed block-graph modification.
>>
>> If you mean to say that draining just the single node should be sufficient, I’ll be happy to change it.
>>
>> Not sure which node, though, because I’d think it would be `base`, but to safely fetch it I’d need to drain it, which seems to bite itself in the tail.  That’s why I went for a subtree drain from `above_base`.
>>
>>> 2. Drained section is not a lock, several clients may drain same set of nodes.. So we exploit the fact that concurrent clients will be paused by drained section and don't proceed to graph-modification code.. But are we sure that block-jobs are (and will be?) the only concurrent block-graph modifying clients? Can qmp commands interleave somehow?
>>
>> They can under very specific circumstances and that’s a bug.  See https://lists.nongnu.org/archive/html/qemu-block/2022-03/msg00582.html .
>>
>>> Can some jobs from other subtree start a block-graph modification that touches our subtree?
>>
>> That would be wrong.  A block job shouldn’t change nodes it doesn’t own; stream doesn’t own the base, but it also doesn’t change it, it only needs to have the top node point to it.
>>
>>> If go this way, that would be more safe to drain the whole block-graph on any block-graph modification..
>>>
>>> I think we'd better have a separate global mechanism for isolating graph modifications. Something like a global co-mutex or queue, where clients waits for their turn in block graph modifications.
>>>
>>> Here is my old proposal on that topic: https://patchew.org/QEMU/20201120161622.1537-1-vsementsov@virtuozzo.com/
>>
>> That would only solve the very specific issue in 030, right?

It should solve the same issue as your patch. You don't add subtree_drain around every graph modification.. Or we already have it?

>>  The stream job isn’t protected from any graph modifications but those coming from mirror.  Might be a solution going forward (I didn’t look closer at it at the time, given I saw you had a discussion with Kevin), if we lock every graph change operation (though a global lock honestly doesn’t sound strictly better than draining subsections of the graph, both have their drawbacks), but that doesn’t look like it’d be something for 7.1.

Same way, with draining solution you should make a subtree-drain for every graph change operation.

> 
> I wonder whether we could have a short-term version of `BdrvChild.frozen` that’s a coroutine mutex.  If `.frozen` is set, you just can’t change the graph, and you also can’t wait, so that’s just an error.  But if `.frozen_lock` is set, you can wait on it. Here, we’d keep `.frozen` set for all links between top and above_base, and then in prepare() take `.frozen_lock` on the link between above_base and base.
> 

Yes that's seems an alternative to global lock, that doesn't block the whole graph. Still, I don't think that is bad to lock the whole graph for graph modificaiton, as modification should be rare and fast.

Another thought: does subtree-drain really drain the whole connectivity component of the graph?

imagine something like this:

[A]  [   C  ]
  |    |    |
  v    v    v
[ B    ]  [ D ]

If we do subtree drain at node A, this will drain B and C, but not D..

Imagine, some another job is attached to node D, and it will start drained section too. So, for example both jobs will share drained section on node C. That doesn't seem save, and draining is not a lock.

So, if we are going to rely on drained section as on lock, that isolates graph modifications from each other, we should drain the whole connectivity component of the graph.

Next, I'm not relly sure that two jobs can simultaneusly enter drained section and do graph modifications. What prevents this? Assume two block-stream jobs reaches their finish simultaneously and go to subtree-drain. That just means that job_pause will be called for both jobs.. But what that means for the block-stream jobs that is in bdrv_subtree_drained_beeing() call in stream_prepare()? Seems nothing? Then both jobs will start graph modification process simultaneously and can interleave on any yield point (for exmaple rewriting backing_file in qcow2 metadata).

Another reason, why I think that subtree drain is a wrong tool, as I said, is extra IO-stop.

Imaging the following graph:

[A]
  |
  v
[B] [C]
  |   |
  v   v
[base]

If we want to rebase A onto base, we actually need only stop IO requests in A and B. Why C should suffer from this graph modification? IO requests produced by C, and that are living in C and in base don't intersect with rebasing A on base process in any way.

====

Actually, I'm not strictly against your patch, and believe that it fixes the problem in most cases. And it's probably OK in short term. The only real doubt on including it now is that drained sections sometimes lead to dead locks, and is it possible that we now fix the bug that happens only in iotest 30 (or is it reported somewhere?) and risking to introduce some dead-lock? Seems that if in some code it's safe to call drained_begin(), it should be safe to call subtree_drained_begin(). And if it trigger some deadlock, it just shows some another bug.. Is it worth fixing now, near to 7.0 release? We live with this bug for years.. Or something changed that I missed?

Hanna Czenczek March 29, 2022, 8:54 a.m. UTC | #8

On 28.03.22 12:24, Vladimir Sementsov-Ogievskiy wrote:
> 28.03.2022 11:09, Hanna Reitz wrote:
>> On 28.03.22 09:44, Hanna Reitz wrote:
>>> On 25.03.22 17:37, Vladimir Sementsov-Ogievskiy wrote:
>>>> 24.03.2022 17:09, Hanna Reitz wrote:
>>>>> When the stream block job cuts out the nodes between top and base in
>>>>> stream_prepare(), it does not drain the subtree manually; it 
>>>>> fetches the
>>>>> base node, and tries to insert it as the top node's backing node with
>>>>> bdrv_set_backing_hd().  bdrv_set_backing_hd() however will drain, 
>>>>> and so
>>>>> the actual base node might change (because the base node is 
>>>>> actually not
>>>>> part of the stream job) before the old base node passed to
>>>>> bdrv_set_backing_hd() is installed.
>>>>>
>>>>> This has two implications:
>>>>>
>>>>> First, the stream job does not keep a strong reference to the base 
>>>>> node.
>>>>> Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
>>>>> because some other block job is drained to finish), we will get a
>>>>> use-after-free.  We should keep a strong reference to that node.
>>>>>
>>>>> Second, even with such a strong reference, the problem remains 
>>>>> that the
>>>>> base node might change before bdrv_set_backing_hd() actually runs 
>>>>> and as
>>>>> a result the wrong base node is installed.
>>>>
>>>> Hmm.
>>>>
>>>> So, we don't really need a strong reference, as if it helps to 
>>>> avoid some use-after-free, it means that we'll finish up with wrong 
>>>> block graph..
>>>
>>> Sure.  But I found it better style to strongly reference a node 
>>> while it’s used.  I’d rather have an outdated block graph (as in: A 
>>> node that was supposed to disappear would still be in use) than a 
>>> use-after-free.
>>>
>>>> Graph modifying operations must be somehow isolated from each other.
>>>>
>>>>>
>>>>> Both effects can be seen in 030's 
>>>>> TestParallelOps.test_overlapping_5()
>>>>> case, which has five nodes, and simultaneously streams from the 
>>>>> middle
>>>>> node to the top node, and commits the middle node down to the base 
>>>>> node.
>>>>> As it is, this will sometimes crash, namely when we encounter the
>>>>> above-described use-after-free.
>>>>>
>>>>> Taking a strong reference to the base node, we no longer get a crash,
>>>>> but the resuling block graph is less than ideal: The expected 
>>>>> result is
>>>>> obviously that all middle nodes are cut out and the base node is the
>>>>> immediate backing child of the top node.  However, if 
>>>>> stream_prepare()
>>>>> takes a strong reference to its base node (the middle node), and then
>>>>> the commit job finishes in bdrv_set_backing_hd(), supposedly dropping
>>>>> that middle node, the stream job will just reinstall it again.
>>>>>
>>>>> Therefore, we need to keep the whole subtree drained in
>>>>> stream_prepare(), so that the graph modification it performs is
>>>>> effectively atomic, i.e. that the base node it fetches is still 
>>>>> the base
>>>>> node when bdrv_set_backing_hd() sets it as the top node's backing 
>>>>> node.
>>>>
>>>> Emanuele has similar idea of isolating graph changes from each 
>>>> other by subtree-drain.
>>>>
>>>> If I understand correctly the idea is that we'll drain all other 
>>>> block jobs, so the wouldn't do their block-graph modification 
>>>> during drained section. So, we can safely modify the graph.
>>>>
>>>> I don't like this idea:
>>>>
>>>> 1. drained section = stop IO. But we don't need to stop IO in the 
>>>> whole subtree to do a needed block-graph modification.
>>>
>>> If you mean to say that draining just the single node should be 
>>> sufficient, I’ll be happy to change it.
>>>
>>> Not sure which node, though, because I’d think it would be `base`, 
>>> but to safely fetch it I’d need to drain it, which seems to bite 
>>> itself in the tail.  That’s why I went for a subtree drain from 
>>> `above_base`.
>>>
>>>> 2. Drained section is not a lock, several clients may drain same 
>>>> set of nodes.. So we exploit the fact that concurrent clients will 
>>>> be paused by drained section and don't proceed to 
>>>> graph-modification code.. But are we sure that block-jobs are (and 
>>>> will be?) the only concurrent block-graph modifying clients? Can 
>>>> qmp commands interleave somehow?
>>>
>>> They can under very specific circumstances and that’s a bug. See 
>>> https://lists.nongnu.org/archive/html/qemu-block/2022-03/msg00582.html 
>>> .
>>>
>>>> Can some jobs from other subtree start a block-graph modification 
>>>> that touches our subtree?
>>>
>>> That would be wrong.  A block job shouldn’t change nodes it doesn’t 
>>> own; stream doesn’t own the base, but it also doesn’t change it, it 
>>> only needs to have the top node point to it.
>>>
>>>> If go this way, that would be more safe to drain the whole 
>>>> block-graph on any block-graph modification..
>>>>
>>>> I think we'd better have a separate global mechanism for isolating 
>>>> graph modifications. Something like a global co-mutex or queue, 
>>>> where clients waits for their turn in block graph modifications.
>>>>
>>>> Here is my old proposal on that topic: 
>>>> https://patchew.org/QEMU/20201120161622.1537-1-vsementsov@virtuozzo.com/ 
>>>>
>>>
>>> That would only solve the very specific issue in 030, right?
>
> It should solve the same issue as your patch. You don't add 
> subtree_drain around every graph modification.. Or we already have it?

Well, I’m not saying it will solve every single bug, but draining in 
stream_prepare() will at least mean that that is safe from basically 
anything else, because it will prevent concurrent automatic graph 
changes (e.g. because of jobs finishing), and QMP commands shouldn’t be 
executed in drained sections either (when they do, it’s wrong, but that 
seems to occur only extremely rarely).  Draining alone should make a 
place safe, it isn’t a lock that all sides need to take.

>>>   The stream job isn’t protected from any graph modifications but 
>>> those coming from mirror.  Might be a solution going forward (I 
>>> didn’t look closer at it at the time, given I saw you had a 
>>> discussion with Kevin), if we lock every graph change operation 
>>> (though a global lock honestly doesn’t sound strictly better than 
>>> draining subsections of the graph, both have their drawbacks), but 
>>> that doesn’t look like it’d be something for 7.1.
>
> Same way, with draining solution you should make a subtree-drain for 
> every graph change operation.

Since we don’t have any lock yet, draining is the de-facto way we use to 
forbid concurrent graph modifications.  I’m not saying we use it 
correctly and thoroughly, but it is what we do right now.

>
>>
>> I wonder whether we could have a short-term version of 
>> `BdrvChild.frozen` that’s a coroutine mutex.  If `.frozen` is set, 
>> you just can’t change the graph, and you also can’t wait, so that’s 
>> just an error.  But if `.frozen_lock` is set, you can wait on it. 
>> Here, we’d keep `.frozen` set for all links between top and 
>> above_base, and then in prepare() take `.frozen_lock` on the link 
>> between above_base and base.
>>
>
> Yes that's seems an alternative to global lock, that doesn't block the 
> whole graph. Still, I don't think that is bad to lock the whole graph 
> for graph modificaiton, as modification should be rare and fast.

Fair enough.

> Another thought: does subtree-drain really drain the whole 
> connectivity component of the graph?
>
> imagine something like this:
>
> [A]  [   C  ]
>  |    |    |
>  v    v    v
> [ B    ]  [ D ]
>
>
> If we do subtree drain at node A, this will drain B and C, but not D..
>
> Imagine, some another job is attached to node D, and it will start 
> drained section too. So, for example both jobs will share drained 
> section on node C. That doesn't seem save, and draining is not a lock.
>
> So, if we are going to rely on drained section as on lock, that 
> isolates graph modifications from each other, we should drain the 
> whole connectivity component of the graph.

The drained section is not a lock, but if the other job is only attached 
to node D, it won’t change node C.  It might change the link from C to 
D, but that doesn’t concern the job that is concerned about A and B.  
Overlapping drains are fine.

> Next, I'm not relly sure that two jobs can simultaneusly enter drained 
> section and do graph modifications. What prevents this? Assume two 
> block-stream jobs reaches their finish simultaneously and go to 
> subtree-drain. That just means that job_pause will be called for both 
> jobs.. But what that means for the block-stream jobs that is in 
> bdrv_subtree_drained_beeing() call in stream_prepare()? Seems nothing? 
> Then both jobs will start graph modification process simultaneously 
> and can interleave on any yield point (for exmaple rewriting 
> backing_file in qcow2 metadata).

So I don’t think that scenario can really happen, because the stream job 
freezes the chain between above_base and top, so you can’t really have 
two simultaneous stream jobs that cause problems between each other.

Furthermore, the prepare() functions are run in the main thread, so the 
only real danger is actually that draining around the actual graph 
modification (bdrv_set_backing_hd()) causes another block job to finish, 
modifying the block graph.  But then that job will also actually finish, 
because it’s all in the main thread.

It is true that child_job_drained_poll() says that job that are about to 
prepare() are quiesced, but I don’t think that’s a problem, given that 
all jobs in that state run in the main thread.

>
> Another reason, why I think that subtree drain is a wrong tool, as I 
> said, is extra IO-stop.

I know and agree, but that’s an optimization question.

> Imaging the following graph:
>
> [A]
>  |
>  v
> [B] [C]
>  |   |
>  v   v
> [base]
>
> If we want to rebase A onto base, we actually need only stop IO 
> requests in A and B. Why C should suffer from this graph modification? 
> IO requests produced by C, and that are living in C and in base don't 
> intersect with rebasing A on base process in any way.
>
> ====
>
> Actually, I'm not strictly against your patch, and believe that it 
> fixes the problem in most cases. And it's probably OK in short term. 
> The only real doubt on including it now is that drained sections 
> sometimes lead to dead locks, and is it possible that we now fix the 
> bug that happens only in iotest 30 (or is it reported somewhere?) and 
> risking to introduce some dead-lock?

Saying that the example in 030 is contrived would mean we could/should 
re-include the base into the list of nodes that belong to the stream 
job, which would simply disallow the case in 030 that’s being tested and 
fails.

Then we don’t need a subtree drain, and the plain drain in 
bdrv_set_backing_hd() would be fine.

> Seems that if in some code it's safe to call drained_begin(), it 
> should be safe to call subtree_drained_begin(). And if it trigger some 
> deadlock, it just shows some another bug.. Is it worth fixing now, 
> near to 7.0 release? We live with this bug for years.. Or something 
> changed that I missed?

I mean...  I can understand your concern that adding a subtree drain has 
performance implications (when a stream job ends, which shouldn’t be 
often).  But I’m not sure whether I should share the deadlock concern.  
Sounds like a sad state of affairs if I can’t just drain something when 
I need it to be drained.

I wasn’t aware of this bug, actually.  Now I am, and I feel rather 
uncomfortable living with a use-after-free bug, because that’s on the 
worse end of the bug spectrum.

Vladimir Sementsov-Ogievskiy March 29, 2022, 9:55 a.m. UTC | #9

29.03.2022 11:54, Hanna Reitz wrote:
> On 28.03.22 12:24, Vladimir Sementsov-Ogievskiy wrote:
>> 28.03.2022 11:09, Hanna Reitz wrote:
>>> On 28.03.22 09:44, Hanna Reitz wrote:
>>>> On 25.03.22 17:37, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 24.03.2022 17:09, Hanna Reitz wrote:
>>>>>> When the stream block job cuts out the nodes between top and base in
>>>>>> stream_prepare(), it does not drain the subtree manually; it fetches the
>>>>>> base node, and tries to insert it as the top node's backing node with
>>>>>> bdrv_set_backing_hd().  bdrv_set_backing_hd() however will drain, and so
>>>>>> the actual base node might change (because the base node is actually not
>>>>>> part of the stream job) before the old base node passed to
>>>>>> bdrv_set_backing_hd() is installed.
>>>>>>
>>>>>> This has two implications:
>>>>>>
>>>>>> First, the stream job does not keep a strong reference to the base node.
>>>>>> Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
>>>>>> because some other block job is drained to finish), we will get a
>>>>>> use-after-free.  We should keep a strong reference to that node.
>>>>>>
>>>>>> Second, even with such a strong reference, the problem remains that the
>>>>>> base node might change before bdrv_set_backing_hd() actually runs and as
>>>>>> a result the wrong base node is installed.
>>>>>
>>>>> Hmm.
>>>>>
>>>>> So, we don't really need a strong reference, as if it helps to avoid some use-after-free, it means that we'll finish up with wrong block graph..
>>>>
>>>> Sure.  But I found it better style to strongly reference a node while it’s used.  I’d rather have an outdated block graph (as in: A node that was supposed to disappear would still be in use) than a use-after-free.
>>>>
>>>>> Graph modifying operations must be somehow isolated from each other.
>>>>>
>>>>>>
>>>>>> Both effects can be seen in 030's TestParallelOps.test_overlapping_5()
>>>>>> case, which has five nodes, and simultaneously streams from the middle
>>>>>> node to the top node, and commits the middle node down to the base node.
>>>>>> As it is, this will sometimes crash, namely when we encounter the
>>>>>> above-described use-after-free.
>>>>>>
>>>>>> Taking a strong reference to the base node, we no longer get a crash,
>>>>>> but the resuling block graph is less than ideal: The expected result is
>>>>>> obviously that all middle nodes are cut out and the base node is the
>>>>>> immediate backing child of the top node.  However, if stream_prepare()
>>>>>> takes a strong reference to its base node (the middle node), and then
>>>>>> the commit job finishes in bdrv_set_backing_hd(), supposedly dropping
>>>>>> that middle node, the stream job will just reinstall it again.
>>>>>>
>>>>>> Therefore, we need to keep the whole subtree drained in
>>>>>> stream_prepare(), so that the graph modification it performs is
>>>>>> effectively atomic, i.e. that the base node it fetches is still the base
>>>>>> node when bdrv_set_backing_hd() sets it as the top node's backing node.
>>>>>
>>>>> Emanuele has similar idea of isolating graph changes from each other by subtree-drain.
>>>>>
>>>>> If I understand correctly the idea is that we'll drain all other block jobs, so the wouldn't do their block-graph modification during drained section. So, we can safely modify the graph.
>>>>>
>>>>> I don't like this idea:
>>>>>
>>>>> 1. drained section = stop IO. But we don't need to stop IO in the whole subtree to do a needed block-graph modification.
>>>>
>>>> If you mean to say that draining just the single node should be sufficient, I’ll be happy to change it.
>>>>
>>>> Not sure which node, though, because I’d think it would be `base`, but to safely fetch it I’d need to drain it, which seems to bite itself in the tail.  That’s why I went for a subtree drain from `above_base`.
>>>>
>>>>> 2. Drained section is not a lock, several clients may drain same set of nodes.. So we exploit the fact that concurrent clients will be paused by drained section and don't proceed to graph-modification code.. But are we sure that block-jobs are (and will be?) the only concurrent block-graph modifying clients? Can qmp commands interleave somehow?
>>>>
>>>> They can under very specific circumstances and that’s a bug. See https://lists.nongnu.org/archive/html/qemu-block/2022-03/msg00582.html .
>>>>
>>>>> Can some jobs from other subtree start a block-graph modification that touches our subtree?
>>>>
>>>> That would be wrong.  A block job shouldn’t change nodes it doesn’t own; stream doesn’t own the base, but it also doesn’t change it, it only needs to have the top node point to it.
>>>>
>>>>> If go this way, that would be more safe to drain the whole block-graph on any block-graph modification..
>>>>>
>>>>> I think we'd better have a separate global mechanism for isolating graph modifications. Something like a global co-mutex or queue, where clients waits for their turn in block graph modifications.
>>>>>
>>>>> Here is my old proposal on that topic: https://patchew.org/QEMU/20201120161622.1537-1-vsementsov@virtuozzo.com/
>>>>
>>>> That would only solve the very specific issue in 030, right?
>>
>> It should solve the same issue as your patch. You don't add subtree_drain around every graph modification.. Or we already have it?
> 
> Well, I’m not saying it will solve every single bug, but draining in stream_prepare() will at least mean that that is safe from basically anything else, because it will prevent concurrent automatic graph changes (e.g. because of jobs finishing), and QMP commands shouldn’t be executed in drained sections either (when they do, it’s wrong, but that seems to occur only extremely rarely).  Draining alone should make a place safe, it isn’t a lock that all sides need to take.
> 
>>>>   The stream job isn’t protected from any graph modifications but those coming from mirror.  Might be a solution going forward (I didn’t look closer at it at the time, given I saw you had a discussion with Kevin), if we lock every graph change operation (though a global lock honestly doesn’t sound strictly better than draining subsections of the graph, both have their drawbacks), but that doesn’t look like it’d be something for 7.1.
>>
>> Same way, with draining solution you should make a subtree-drain for every graph change operation.
> 
> Since we don’t have any lock yet, draining is the de-facto way we use to forbid concurrent graph modifications.  I’m not saying we use it correctly and thoroughly, but it is what we do right now.
> 
>>
>>>
>>> I wonder whether we could have a short-term version of `BdrvChild.frozen` that’s a coroutine mutex.  If `.frozen` is set, you just can’t change the graph, and you also can’t wait, so that’s just an error.  But if `.frozen_lock` is set, you can wait on it. Here, we’d keep `.frozen` set for all links between top and above_base, and then in prepare() take `.frozen_lock` on the link between above_base and base.
>>>
>>
>> Yes that's seems an alternative to global lock, that doesn't block the whole graph. Still, I don't think that is bad to lock the whole graph for graph modificaiton, as modification should be rare and fast.
> 
> Fair enough.
> 
>> Another thought: does subtree-drain really drain the whole connectivity component of the graph?
>>
>> imagine something like this:
>>
>> [A]  [   C  ]
>>  |    |    |
>>  v    v    v
>> [ B    ]  [ D ]
>>
>>
>> If we do subtree drain at node A, this will drain B and C, but not D..
>>
>> Imagine, some another job is attached to node D, and it will start drained section too. So, for example both jobs will share drained section on node C. That doesn't seem save, and draining is not a lock.
>>
>> So, if we are going to rely on drained section as on lock, that isolates graph modifications from each other, we should drain the whole connectivity component of the graph.
> 
> The drained section is not a lock, but if the other job is only attached to node D, it won’t change node C.  It might change the link from C to D, but that doesn’t concern the job that is concerned about A and B. Overlapping drains are fine.

OK. Maybe it works. It's just not obvious to me that subtree_drain works good in all cases. And global graph-modification-lock should obviously work.

> 
>> Next, I'm not relly sure that two jobs can simultaneusly enter drained section and do graph modifications. What prevents this? Assume two block-stream jobs reaches their finish simultaneously and go to subtree-drain. That just means that job_pause will be called for both jobs.. But what that means for the block-stream jobs that is in bdrv_subtree_drained_beeing() call in stream_prepare()? Seems nothing? Then both jobs will start graph modification process simultaneously and can interleave on any yield point (for exmaple rewriting backing_file in qcow2 metadata).
> 
> So I don’t think that scenario can really happen, because the stream job freezes the chain between above_base and top, so you can’t really have two simultaneous stream jobs that cause problems between each other.

They cause problem on the boundary, as base of one stream job may be top of another, and we have also a filter, that should be inserted/removed at some moment. As I remember, that's the problematic case in 030..

> 
> Furthermore, the prepare() functions are run in the main thread, so the only real danger is actually that draining around the actual graph modification (bdrv_set_backing_hd()) causes another block job to finish, modifying the block graph.  But then that job will also actually finish, because it’s all in the main thread.
> 
> It is true that child_job_drained_poll() says that job that are about to prepare() are quiesced, but I don’t think that’s a problem, given that all jobs in that state run in the main thread.
> 
>>
>> Another reason, why I think that subtree drain is a wrong tool, as I said, is extra IO-stop.
> 
> I know and agree, but that’s an optimization question.
> 
>> Imaging the following graph:
>>
>> [A]
>>  |
>>  v
>> [B] [C]
>>  |   |
>>  v   v
>> [base]
>>
>> If we want to rebase A onto base, we actually need only stop IO requests in A and B. Why C should suffer from this graph modification? IO requests produced by C, and that are living in C and in base don't intersect with rebasing A on base process in any way.
>>
>> ====
>>
>> Actually, I'm not strictly against your patch, and believe that it fixes the problem in most cases. And it's probably OK in short term. The only real doubt on including it now is that drained sections sometimes lead to dead locks, and is it possible that we now fix the bug that happens only in iotest 30 (or is it reported somewhere?) and risking to introduce some dead-lock?
> 
> Saying that the example in 030 is contrived would mean we could/should re-include the base into the list of nodes that belong to the stream job, which would simply disallow the case in 030 that’s being tested and fails.
> 
> Then we don’t need a subtree drain, and the plain drain in bdrv_set_backing_hd() would be fine.
> 
>> Seems that if in some code it's safe to call drained_begin(), it should be safe to call subtree_drained_begin(). And if it trigger some deadlock, it just shows some another bug.. Is it worth fixing now, near to 7.0 release? We live with this bug for years.. Or something changed that I missed?
> 
> I mean...  I can understand your concern that adding a subtree drain has performance implications (when a stream job ends, which shouldn’t be often).  But I’m not sure whether I should share the deadlock concern. Sounds like a sad state of affairs if I can’t just drain something when I need it to be drained.
> 
> I wasn’t aware of this bug, actually.  Now I am, and I feel rather uncomfortable living with a use-after-free bug, because that’s on the worse end of the bug spectrum.
> 

OK, I don't know:) And others are silent. Agree that global-lock solution is not for 7.0 anyway. And this one is simple enough. So, take my

Acked-by: Vladimir Sementsov-Ogievskiy <v.sementsov-og@mail.ru>

Hanna Czenczek March 29, 2022, 12:15 p.m. UTC | #10

On 29.03.22 11:55, Vladimir Sementsov-Ogievskiy wrote:
> 29.03.2022 11:54, Hanna Reitz wrote:
>> On 28.03.22 12:24, Vladimir Sementsov-Ogievskiy wrote:
>>> 28.03.2022 11:09, Hanna Reitz wrote:
>>>> On 28.03.22 09:44, Hanna Reitz wrote:
>>>>> On 25.03.22 17:37, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> 24.03.2022 17:09, Hanna Reitz wrote:
>>>>>>> When the stream block job cuts out the nodes between top and 
>>>>>>> base in
>>>>>>> stream_prepare(), it does not drain the subtree manually; it 
>>>>>>> fetches the
>>>>>>> base node, and tries to insert it as the top node's backing node 
>>>>>>> with
>>>>>>> bdrv_set_backing_hd().  bdrv_set_backing_hd() however will 
>>>>>>> drain, and so
>>>>>>> the actual base node might change (because the base node is 
>>>>>>> actually not
>>>>>>> part of the stream job) before the old base node passed to
>>>>>>> bdrv_set_backing_hd() is installed.
>>>>>>>
>>>>>>> This has two implications:
>>>>>>>
>>>>>>> First, the stream job does not keep a strong reference to the 
>>>>>>> base node.
>>>>>>> Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
>>>>>>> because some other block job is drained to finish), we will get a
>>>>>>> use-after-free.  We should keep a strong reference to that node.
>>>>>>>
>>>>>>> Second, even with such a strong reference, the problem remains 
>>>>>>> that the
>>>>>>> base node might change before bdrv_set_backing_hd() actually 
>>>>>>> runs and as
>>>>>>> a result the wrong base node is installed.
>>>>>>
>>>>>> Hmm.
>>>>>>
>>>>>> So, we don't really need a strong reference, as if it helps to 
>>>>>> avoid some use-after-free, it means that we'll finish up with 
>>>>>> wrong block graph..
>>>>>
>>>>> Sure.  But I found it better style to strongly reference a node 
>>>>> while it’s used.  I’d rather have an outdated block graph (as in: 
>>>>> A node that was supposed to disappear would still be in use) than 
>>>>> a use-after-free.
>>>>>
>>>>>> Graph modifying operations must be somehow isolated from each other.
>>>>>>
>>>>>>>
>>>>>>> Both effects can be seen in 030's 
>>>>>>> TestParallelOps.test_overlapping_5()
>>>>>>> case, which has five nodes, and simultaneously streams from the 
>>>>>>> middle
>>>>>>> node to the top node, and commits the middle node down to the 
>>>>>>> base node.
>>>>>>> As it is, this will sometimes crash, namely when we encounter the
>>>>>>> above-described use-after-free.
>>>>>>>
>>>>>>> Taking a strong reference to the base node, we no longer get a 
>>>>>>> crash,
>>>>>>> but the resuling block graph is less than ideal: The expected 
>>>>>>> result is
>>>>>>> obviously that all middle nodes are cut out and the base node is 
>>>>>>> the
>>>>>>> immediate backing child of the top node.  However, if 
>>>>>>> stream_prepare()
>>>>>>> takes a strong reference to its base node (the middle node), and 
>>>>>>> then
>>>>>>> the commit job finishes in bdrv_set_backing_hd(), supposedly 
>>>>>>> dropping
>>>>>>> that middle node, the stream job will just reinstall it again.
>>>>>>>
>>>>>>> Therefore, we need to keep the whole subtree drained in
>>>>>>> stream_prepare(), so that the graph modification it performs is
>>>>>>> effectively atomic, i.e. that the base node it fetches is still 
>>>>>>> the base
>>>>>>> node when bdrv_set_backing_hd() sets it as the top node's 
>>>>>>> backing node.
>>>>>>
>>>>>> Emanuele has similar idea of isolating graph changes from each 
>>>>>> other by subtree-drain.
>>>>>>
>>>>>> If I understand correctly the idea is that we'll drain all other 
>>>>>> block jobs, so the wouldn't do their block-graph modification 
>>>>>> during drained section. So, we can safely modify the graph.
>>>>>>
>>>>>> I don't like this idea:
>>>>>>
>>>>>> 1. drained section = stop IO. But we don't need to stop IO in the 
>>>>>> whole subtree to do a needed block-graph modification.
>>>>>
>>>>> If you mean to say that draining just the single node should be 
>>>>> sufficient, I’ll be happy to change it.
>>>>>
>>>>> Not sure which node, though, because I’d think it would be `base`, 
>>>>> but to safely fetch it I’d need to drain it, which seems to bite 
>>>>> itself in the tail.  That’s why I went for a subtree drain from 
>>>>> `above_base`.
>>>>>
>>>>>> 2. Drained section is not a lock, several clients may drain same 
>>>>>> set of nodes.. So we exploit the fact that concurrent clients 
>>>>>> will be paused by drained section and don't proceed to 
>>>>>> graph-modification code.. But are we sure that block-jobs are 
>>>>>> (and will be?) the only concurrent block-graph modifying clients? 
>>>>>> Can qmp commands interleave somehow?
>>>>>
>>>>> They can under very specific circumstances and that’s a bug. See 
>>>>> https://lists.nongnu.org/archive/html/qemu-block/2022-03/msg00582.html 
>>>>> .
>>>>>
>>>>>> Can some jobs from other subtree start a block-graph modification 
>>>>>> that touches our subtree?
>>>>>
>>>>> That would be wrong.  A block job shouldn’t change nodes it 
>>>>> doesn’t own; stream doesn’t own the base, but it also doesn’t 
>>>>> change it, it only needs to have the top node point to it.
>>>>>
>>>>>> If go this way, that would be more safe to drain the whole 
>>>>>> block-graph on any block-graph modification..
>>>>>>
>>>>>> I think we'd better have a separate global mechanism for 
>>>>>> isolating graph modifications. Something like a global co-mutex 
>>>>>> or queue, where clients waits for their turn in block graph 
>>>>>> modifications.
>>>>>>
>>>>>> Here is my old proposal on that topic: 
>>>>>> https://patchew.org/QEMU/20201120161622.1537-1-vsementsov@virtuozzo.com/ 
>>>>>>
>>>>>
>>>>> That would only solve the very specific issue in 030, right?
>>>
>>> It should solve the same issue as your patch. You don't add 
>>> subtree_drain around every graph modification.. Or we already have it?
>>
>> Well, I’m not saying it will solve every single bug, but draining in 
>> stream_prepare() will at least mean that that is safe from basically 
>> anything else, because it will prevent concurrent automatic graph 
>> changes (e.g. because of jobs finishing), and QMP commands shouldn’t 
>> be executed in drained sections either (when they do, it’s wrong, but 
>> that seems to occur only extremely rarely).  Draining alone should 
>> make a place safe, it isn’t a lock that all sides need to take.
>>
>>>>>   The stream job isn’t protected from any graph modifications but 
>>>>> those coming from mirror.  Might be a solution going forward (I 
>>>>> didn’t look closer at it at the time, given I saw you had a 
>>>>> discussion with Kevin), if we lock every graph change operation 
>>>>> (though a global lock honestly doesn’t sound strictly better than 
>>>>> draining subsections of the graph, both have their drawbacks), but 
>>>>> that doesn’t look like it’d be something for 7.1.
>>>
>>> Same way, with draining solution you should make a subtree-drain for 
>>> every graph change operation.
>>
>> Since we don’t have any lock yet, draining is the de-facto way we use 
>> to forbid concurrent graph modifications.  I’m not saying we use it 
>> correctly and thoroughly, but it is what we do right now.
>>
>>>
>>>>
>>>> I wonder whether we could have a short-term version of 
>>>> `BdrvChild.frozen` that’s a coroutine mutex.  If `.frozen` is set, 
>>>> you just can’t change the graph, and you also can’t wait, so that’s 
>>>> just an error.  But if `.frozen_lock` is set, you can wait on it. 
>>>> Here, we’d keep `.frozen` set for all links between top and 
>>>> above_base, and then in prepare() take `.frozen_lock` on the link 
>>>> between above_base and base.
>>>>
>>>
>>> Yes that's seems an alternative to global lock, that doesn't block 
>>> the whole graph. Still, I don't think that is bad to lock the whole 
>>> graph for graph modificaiton, as modification should be rare and fast.
>>
>> Fair enough.
>>
>>> Another thought: does subtree-drain really drain the whole 
>>> connectivity component of the graph?
>>>
>>> imagine something like this:
>>>
>>> [A]  [   C  ]
>>>  |    |    |
>>>  v    v    v
>>> [ B    ]  [ D ]
>>>
>>>
>>> If we do subtree drain at node A, this will drain B and C, but not D..
>>>
>>> Imagine, some another job is attached to node D, and it will start 
>>> drained section too. So, for example both jobs will share drained 
>>> section on node C. That doesn't seem save, and draining is not a lock.
>>>
>>> So, if we are going to rely on drained section as on lock, that 
>>> isolates graph modifications from each other, we should drain the 
>>> whole connectivity component of the graph.
>>
>> The drained section is not a lock, but if the other job is only 
>> attached to node D, it won’t change node C.  It might change the link 
>> from C to D, but that doesn’t concern the job that is concerned about 
>> A and B. Overlapping drains are fine.
>
> OK. Maybe it works. It's just not obvious to me that subtree_drain 
> works good in all cases. And global graph-modification-lock should 
> obviously work.
>
>>
>>> Next, I'm not relly sure that two jobs can simultaneusly enter 
>>> drained section and do graph modifications. What prevents this? 
>>> Assume two block-stream jobs reaches their finish simultaneously and 
>>> go to subtree-drain. That just means that job_pause will be called 
>>> for both jobs.. But what that means for the block-stream jobs that 
>>> is in bdrv_subtree_drained_beeing() call in stream_prepare()? Seems 
>>> nothing? Then both jobs will start graph modification process 
>>> simultaneously and can interleave on any yield point (for exmaple 
>>> rewriting backing_file in qcow2 metadata).
>>
>> So I don’t think that scenario can really happen, because the stream 
>> job freezes the chain between above_base and top, so you can’t really 
>> have two simultaneous stream jobs that cause problems between each 
>> other.
>
> They cause problem on the boundary, as base of one stream job may be 
> top of another, and we have also a filter, that should be 
> inserted/removed at some moment. As I remember, that's the problematic 
> case in 030..
>
>>
>> Furthermore, the prepare() functions are run in the main thread, so 
>> the only real danger is actually that draining around the actual 
>> graph modification (bdrv_set_backing_hd()) causes another block job 
>> to finish, modifying the block graph.  But then that job will also 
>> actually finish, because it’s all in the main thread.
>>
>> It is true that child_job_drained_poll() says that job that are about 
>> to prepare() are quiesced, but I don’t think that’s a problem, given 
>> that all jobs in that state run in the main thread.
>>
>>>
>>> Another reason, why I think that subtree drain is a wrong tool, as I 
>>> said, is extra IO-stop.
>>
>> I know and agree, but that’s an optimization question.
>>
>>> Imaging the following graph:
>>>
>>> [A]
>>>  |
>>>  v
>>> [B] [C]
>>>  |   |
>>>  v   v
>>> [base]
>>>
>>> If we want to rebase A onto base, we actually need only stop IO 
>>> requests in A and B. Why C should suffer from this graph 
>>> modification? IO requests produced by C, and that are living in C 
>>> and in base don't intersect with rebasing A on base process in any way.
>>>
>>> ====
>>>
>>> Actually, I'm not strictly against your patch, and believe that it 
>>> fixes the problem in most cases. And it's probably OK in short term. 
>>> The only real doubt on including it now is that drained sections 
>>> sometimes lead to dead locks, and is it possible that we now fix the 
>>> bug that happens only in iotest 30 (or is it reported somewhere?) 
>>> and risking to introduce some dead-lock?
>>
>> Saying that the example in 030 is contrived would mean we 
>> could/should re-include the base into the list of nodes that belong 
>> to the stream job, which would simply disallow the case in 030 that’s 
>> being tested and fails.
>>
>> Then we don’t need a subtree drain, and the plain drain in 
>> bdrv_set_backing_hd() would be fine.
>>
>>> Seems that if in some code it's safe to call drained_begin(), it 
>>> should be safe to call subtree_drained_begin(). And if it trigger 
>>> some deadlock, it just shows some another bug.. Is it worth fixing 
>>> now, near to 7.0 release? We live with this bug for years.. Or 
>>> something changed that I missed?
>>
>> I mean...  I can understand your concern that adding a subtree drain 
>> has performance implications (when a stream job ends, which shouldn’t 
>> be often).  But I’m not sure whether I should share the deadlock 
>> concern. Sounds like a sad state of affairs if I can’t just drain 
>> something when I need it to be drained.
>>
>> I wasn’t aware of this bug, actually.  Now I am, and I feel rather 
>> uncomfortable living with a use-after-free bug, because that’s on the 
>> worse end of the bug spectrum.
>>
>
> OK, I don't know:) And others are silent. Agree that global-lock 
> solution is not for 7.0 anyway. And this one is simple enough. So, 
> take my
>
> Acked-by: Vladimir Sementsov-Ogievskiy <v.sementsov-og@mail.ru>

Thanks!

A global lock solution sounds good to me for 7.1+, and it even provides 
a solution to solving the problem with QMP commands being executed while 
other main-thread code is running.  (I mean, the QMP command would still 
be executed, but at least concurrent graph changes would be impossible.)

Emanuele Giuseppe Esposito March 30, 2022, 9:40 a.m. UTC | #11

Am 29/03/2022 um 14:15 schrieb Hanna Reitz:
> On 29.03.22 11:55, Vladimir Sementsov-Ogievskiy wrote:
>> 29.03.2022 11:54, Hanna Reitz wrote:
>>> On 28.03.22 12:24, Vladimir Sementsov-Ogievskiy wrote:
>>>> 28.03.2022 11:09, Hanna Reitz wrote:
>>>>> On 28.03.22 09:44, Hanna Reitz wrote:
>>>>>> On 25.03.22 17:37, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>> 24.03.2022 17:09, Hanna Reitz wrote:
>>>>>>>> When the stream block job cuts out the nodes between top and
>>>>>>>> base in
>>>>>>>> stream_prepare(), it does not drain the subtree manually; it
>>>>>>>> fetches the
>>>>>>>> base node, and tries to insert it as the top node's backing node
>>>>>>>> with
>>>>>>>> bdrv_set_backing_hd().  bdrv_set_backing_hd() however will
>>>>>>>> drain, and so
>>>>>>>> the actual base node might change (because the base node is
>>>>>>>> actually not
>>>>>>>> part of the stream job) before the old base node passed to
>>>>>>>> bdrv_set_backing_hd() is installed.
>>>>>>>>
>>>>>>>> This has two implications:
>>>>>>>>
>>>>>>>> First, the stream job does not keep a strong reference to the
>>>>>>>> base node.
>>>>>>>> Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
>>>>>>>> because some other block job is drained to finish), we will get a
>>>>>>>> use-after-free.  We should keep a strong reference to that node.
>>>>>>>>
>>>>>>>> Second, even with such a strong reference, the problem remains
>>>>>>>> that the
>>>>>>>> base node might change before bdrv_set_backing_hd() actually
>>>>>>>> runs and as
>>>>>>>> a result the wrong base node is installed.
>>>>>>>
>>>>>>> Hmm.
>>>>>>>
>>>>>>> So, we don't really need a strong reference, as if it helps to
>>>>>>> avoid some use-after-free, it means that we'll finish up with
>>>>>>> wrong block graph..
>>>>>>
>>>>>> Sure.  But I found it better style to strongly reference a node
>>>>>> while it’s used.  I’d rather have an outdated block graph (as in:
>>>>>> A node that was supposed to disappear would still be in use) than
>>>>>> a use-after-free.
>>>>>>
>>>>>>> Graph modifying operations must be somehow isolated from each other.
>>>>>>>
>>>>>>>>
>>>>>>>> Both effects can be seen in 030's
>>>>>>>> TestParallelOps.test_overlapping_5()
>>>>>>>> case, which has five nodes, and simultaneously streams from the
>>>>>>>> middle
>>>>>>>> node to the top node, and commits the middle node down to the
>>>>>>>> base node.
>>>>>>>> As it is, this will sometimes crash, namely when we encounter the
>>>>>>>> above-described use-after-free.


Interesting. This test was badly crashing when I was applying the patch
"[PATCH 6/6] jobs: ensure sleep in job_sleep_ns is fully performed"

https://patchew.org/QEMU/20220208153655.1251658-1-eesposit@redhat.com
/20220208153655.1251658-7-eesposit@redhat.com/

I could not manage to find what was the root cause of this, but I
remember test_overlapping_5 and test_stream_parallel were sometimes
failing.
I even tried running the tests in the main branch and could not manage
to reproduce the random failure you describe above.


>>>>>>>>
>>>>>>>> Taking a strong reference to the base node, we no longer get a
>>>>>>>> crash,
>>>>>>>> but the resuling block graph is less than ideal: The expected
>>>>>>>> result is
>>>>>>>> obviously that all middle nodes are cut out and the base node is
>>>>>>>> the
>>>>>>>> immediate backing child of the top node.  However, if
>>>>>>>> stream_prepare()
>>>>>>>> takes a strong reference to its base node (the middle node), and
>>>>>>>> then
>>>>>>>> the commit job finishes in bdrv_set_backing_hd(), supposedly
>>>>>>>> dropping
>>>>>>>> that middle node, the stream job will just reinstall it again.
>>>>>>>>
>>>>>>>> Therefore, we need to keep the whole subtree drained in
>>>>>>>> stream_prepare(), so that the graph modification it performs is
>>>>>>>> effectively atomic, i.e. that the base node it fetches is still
>>>>>>>> the base
>>>>>>>> node when bdrv_set_backing_hd() sets it as the top node's
>>>>>>>> backing node.
>>>>>>>
>>>>>>> Emanuele has similar idea of isolating graph changes from each
>>>>>>> other by subtree-drain.
>>>>>>>
>>>>>>> If I understand correctly the idea is that we'll drain all other
>>>>>>> block jobs, so the wouldn't do their block-graph modification
>>>>>>> during drained section. So, we can safely modify the graph.
>>>>>>>
>>>>>>> I don't like this idea:
>>>>>>>
>>>>>>> 1. drained section = stop IO. But we don't need to stop IO in the
>>>>>>> whole subtree to do a needed block-graph modification.
>>>>>>
>>>>>> If you mean to say that draining just the single node should be
>>>>>> sufficient, I’ll be happy to change it.
>>>>>>
>>>>>> Not sure which node, though, because I’d think it would be `base`,
>>>>>> but to safely fetch it I’d need to drain it, which seems to bite
>>>>>> itself in the tail.  That’s why I went for a subtree drain from
>>>>>> `above_base`.
>>>>>>
>>>>>>> 2. Drained section is not a lock, several clients may drain same
>>>>>>> set of nodes.. So we exploit the fact that concurrent clients
>>>>>>> will be paused by drained section and don't proceed to
>>>>>>> graph-modification code.. But are we sure that block-jobs are
>>>>>>> (and will be?) the only concurrent block-graph modifying clients?
>>>>>>> Can qmp commands interleave somehow?
>>>>>>
>>>>>> They can under very specific circumstances and that’s a bug. See
>>>>>> https://lists.nongnu.org/archive/html/qemu-block/2022-03/msg00582.html
>>>>>> .
>>>>>>
>>>>>>> Can some jobs from other subtree start a block-graph modification
>>>>>>> that touches our subtree?
>>>>>>
>>>>>> That would be wrong.  A block job shouldn’t change nodes it
>>>>>> doesn’t own; stream doesn’t own the base, but it also doesn’t
>>>>>> change it, it only needs to have the top node point to it.
>>>>>>
>>>>>>> If go this way, that would be more safe to drain the whole
>>>>>>> block-graph on any block-graph modification..
>>>>>>>
>>>>>>> I think we'd better have a separate global mechanism for
>>>>>>> isolating graph modifications. Something like a global co-mutex
>>>>>>> or queue, where clients waits for their turn in block graph
>>>>>>> modifications.
>>>>>>>
>>>>>>> Here is my old proposal on that topic:
>>>>>>> https://patchew.org/QEMU/20201120161622.1537-1-vsementsov@virtuozzo.com/
>>>>>>>
>>>>>>
>>>>>> That would only solve the very specific issue in 030, right?
>>>>
>>>> It should solve the same issue as your patch. You don't add
>>>> subtree_drain around every graph modification.. Or we already have it?
>>>
>>> Well, I’m not saying it will solve every single bug, but draining in
>>> stream_prepare() will at least mean that that is safe from basically
>>> anything else, because it will prevent concurrent automatic graph
>>> changes (e.g. because of jobs finishing), and QMP commands shouldn’t
>>> be executed in drained sections either (when they do, it’s wrong, but
>>> that seems to occur only extremely rarely).  Draining alone should
>>> make a place safe, it isn’t a lock that all sides need to take.
>>>
>>>>>>   The stream job isn’t protected from any graph modifications but
>>>>>> those coming from mirror.  Might be a solution going forward (I
>>>>>> didn’t look closer at it at the time, given I saw you had a
>>>>>> discussion with Kevin), if we lock every graph change operation
>>>>>> (though a global lock honestly doesn’t sound strictly better than
>>>>>> draining subsections of the graph, both have their drawbacks), but
>>>>>> that doesn’t look like it’d be something for 7.1.
>>>>
>>>> Same way, with draining solution you should make a subtree-drain for
>>>> every graph change operation.
>>>
>>> Since we don’t have any lock yet, draining is the de-facto way we use
>>> to forbid concurrent graph modifications.  I’m not saying we use it
>>> correctly and thoroughly, but it is what we do right now.
>>>
>>>>
>>>>>
>>>>> I wonder whether we could have a short-term version of
>>>>> `BdrvChild.frozen` that’s a coroutine mutex.  If `.frozen` is set,
>>>>> you just can’t change the graph, and you also can’t wait, so that’s
>>>>> just an error.  But if `.frozen_lock` is set, you can wait on it.
>>>>> Here, we’d keep `.frozen` set for all links between top and
>>>>> above_base, and then in prepare() take `.frozen_lock` on the link
>>>>> between above_base and base.
>>>>>
>>>>
>>>> Yes that's seems an alternative to global lock, that doesn't block
>>>> the whole graph. Still, I don't think that is bad to lock the whole
>>>> graph for graph modificaiton, as modification should be rare and fast.
>>>
>>> Fair enough.
>>>
>>>> Another thought: does subtree-drain really drain the whole
>>>> connectivity component of the graph?
>>>>
>>>> imagine something like this:
>>>>
>>>> [A]  [   C  ]
>>>>  |    |    |
>>>>  v    v    v
>>>> [ B    ]  [ D ]
>>>>
>>>>
>>>> If we do subtree drain at node A, this will drain B and C, but not D..
>>>>
>>>> Imagine, some another job is attached to node D, and it will start
>>>> drained section too. So, for example both jobs will share drained
>>>> section on node C. That doesn't seem save, and draining is not a lock.
>>>>
>>>> So, if we are going to rely on drained section as on lock, that
>>>> isolates graph modifications from each other, we should drain the
>>>> whole connectivity component of the graph.
>>>
>>> The drained section is not a lock, but if the other job is only
>>> attached to node D, it won’t change node C.  It might change the link
>>> from C to D, but that doesn’t concern the job that is concerned about
>>> A and B. Overlapping drains are fine.
>>
>> OK. Maybe it works. It's just not obvious to me that subtree_drain
>> works good in all cases. And global graph-modification-lock should
>> obviously work.
>>
>>>
>>>> Next, I'm not relly sure that two jobs can simultaneusly enter
>>>> drained section and do graph modifications. What prevents this?
>>>> Assume two block-stream jobs reaches their finish simultaneously and
>>>> go to subtree-drain. That just means that job_pause will be called
>>>> for both jobs.. But what that means for the block-stream jobs that
>>>> is in bdrv_subtree_drained_beeing() call in stream_prepare()? Seems
>>>> nothing? Then both jobs will start graph modification process
>>>> simultaneously and can interleave on any yield point (for exmaple
>>>> rewriting backing_file in qcow2 metadata).
>>>
>>> So I don’t think that scenario can really happen, because the stream
>>> job freezes the chain between above_base and top, so you can’t really
>>> have two simultaneous stream jobs that cause problems between each
>>> other.
>>
>> They cause problem on the boundary, as base of one stream job may be
>> top of another, and we have also a filter, that should be
>> inserted/removed at some moment. As I remember, that's the problematic
>> case in 030..
>>
>>>
>>> Furthermore, the prepare() functions are run in the main thread, so
>>> the only real danger is actually that draining around the actual
>>> graph modification (bdrv_set_backing_hd()) causes another block job
>>> to finish, modifying the block graph.  But then that job will also
>>> actually finish, because it’s all in the main thread.
>>>
>>> It is true that child_job_drained_poll() says that job that are about
>>> to prepare() are quiesced, but I don’t think that’s a problem, given
>>> that all jobs in that state run in the main thread.
>>>
>>>>
>>>> Another reason, why I think that subtree drain is a wrong tool, as I
>>>> said, is extra IO-stop.
>>>
>>> I know and agree, but that’s an optimization question.
>>>
>>>> Imaging the following graph:
>>>>
>>>> [A]
>>>>  |
>>>>  v
>>>> [B] [C]
>>>>  |   |
>>>>  v   v
>>>> [base]
>>>>
>>>> If we want to rebase A onto base, we actually need only stop IO
>>>> requests in A and B. Why C should suffer from this graph
>>>> modification? IO requests produced by C, and that are living in C
>>>> and in base don't intersect with rebasing A on base process in any way.
>>>>
>>>> ====
>>>>
>>>> Actually, I'm not strictly against your patch, and believe that it
>>>> fixes the problem in most cases. And it's probably OK in short term.
>>>> The only real doubt on including it now is that drained sections
>>>> sometimes lead to dead locks, and is it possible that we now fix the
>>>> bug that happens only in iotest 30 (or is it reported somewhere?)
>>>> and risking to introduce some dead-lock?
>>>
>>> Saying that the example in 030 is contrived would mean we
>>> could/should re-include the base into the list of nodes that belong
>>> to the stream job, which would simply disallow the case in 030 that’s
>>> being tested and fails.
>>>
>>> Then we don’t need a subtree drain, and the plain drain in
>>> bdrv_set_backing_hd() would be fine.
>>>
>>>> Seems that if in some code it's safe to call drained_begin(), it
>>>> should be safe to call subtree_drained_begin(). And if it trigger
>>>> some deadlock, it just shows some another bug.. Is it worth fixing
>>>> now, near to 7.0 release? We live with this bug for years.. Or
>>>> something changed that I missed?
>>>
>>> I mean...  I can understand your concern that adding a subtree drain
>>> has performance implications (when a stream job ends, which shouldn’t
>>> be often).  But I’m not sure whether I should share the deadlock
>>> concern. Sounds like a sad state of affairs if I can’t just drain
>>> something when I need it to be drained.
>>>
>>> I wasn’t aware of this bug, actually.  Now I am, and I feel rather
>>> uncomfortable living with a use-after-free bug, because that’s on the
>>> worse end of the bug spectrum.
>>>
>>
>> OK, I don't know:) And others are silent. Agree that global-lock
>> solution is not for 7.0 anyway. And this one is simple enough. So,
>> take my
>>
>> Acked-by: Vladimir Sementsov-Ogievskiy <v.sementsov-og@mail.ru>
> 
> Thanks!
> 
> A global lock solution sounds good to me for 7.1+, and it even provides
> a solution to solving the problem with QMP commands being executed while
> other main-thread code is running.  (I mean, the QMP command would still
> be executed, but at least concurrent graph changes would be impossible.)
> 

Discussion about subtree and this "global lock" is here:
https://patchew.org/QEMU/20220301142113.163174-1-eesposit@redhat.com/

@Hanna if you can give it a look and jump into the party as well, that
would be nice.
Maybe what Kevin proposed is what we need instead of subtree drains.

Global lock sounds a little bit wrong naming btw, as it seems that we
are just re-implementing the AioContext lock.

Emanuele

[v2] block/stream: Drain subtree around graph change

Commit Message

Comments

Patch