Message ID | 1548244464-633186-1-git-send-email-andrey.shinkevich@virtuozzo.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC,1/1] Stream block job involves copy-on-read filter | expand |
23.01.2019 14:54, Andrey Shinkevich wrote: > The copy-on-read filter driver is applied to block-stream operations. > The 'test_stream_parallel' in the file tests/qemu-iotests/030 runs > jobs that use nodes for streaming in parallel through the backing chain. > We've got filters being inserted to and removed from the backing chain > while jobs are running. As a result, a filter node may be passed as the > 'base' parameter to the stream_start() function when the base node name > is not specified (the base node is identified by its file name which is > the same to the related filter node). > Another issue is that a function keeps the pointer to the filter BDS > object that can be replaced and deleted after the co-routine switch. > For example, the function backing_bs() returns the pointer to the > backing BDS and the BDS reference counter is not incremented. > A solution (or workaround) made with the given patch for block-stream > job helps to pass all the iotests in the file tests/qemu-iotests/030. > Any piece of advice how to amend the solution will be appreciated. > I am looking forward to hearing from you. So, in short, when filters comes to node-graph, we have two problems with stream job: 1. Searching node by filename works bad. It may (and I think, should) be workarounded by using node-names and only node-names to select top and base node.. But should we, and how to support old scenarios with selecting nodes by filenames? 2. "base" works bad. Actually, job don't own base node, so it's illegal to keep pointer to it. Base may change during the job. So the best option, would be to use "bottom-node" concept instead of "base". If we don't want to change qmp interface, we should calculate bottom-node from base at block-job creation time, before any context switch, and keep pointer to bottom-node, instead of base. Also, we should rewrite bdrv_block_status to support bottom_node instead of base, as again, if caller owns only top and intermediate nodes, base may change during block_status calculation.
PINGing Please help! On 23/01/2019 14:54, Andrey Shinkevich wrote: > The copy-on-read filter driver is applied to block-stream operations. > The 'test_stream_parallel' in the file tests/qemu-iotests/030 runs > jobs that use nodes for streaming in parallel through the backing chain. > We've got filters being inserted to and removed from the backing chain > while jobs are running. As a result, a filter node may be passed as the > 'base' parameter to the stream_start() function when the base node name > is not specified (the base node is identified by its file name which is > the same to the related filter node). > Another issue is that a function keeps the pointer to the filter BDS > object that can be replaced and deleted after the co-routine switch. > For example, the function backing_bs() returns the pointer to the > backing BDS and the BDS reference counter is not incremented. > A solution (or workaround) made with the given patch for block-stream > job helps to pass all the iotests in the file tests/qemu-iotests/030. > Any piece of advice how to amend the solution will be appreciated. > I am looking forward to hearing from you. > > Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com> > --- > block/stream.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 143 insertions(+), 11 deletions(-) > > diff --git a/block/stream.c b/block/stream.c > index 7a49ac0..18e51b3 100644 > --- a/block/stream.c > +++ b/block/stream.c > @@ -16,6 +16,7 @@ > #include "block/block_int.h" > #include "block/blockjob_int.h" > #include "qapi/error.h" > +#include "qapi/qmp/qdict.h" > #include "qapi/qmp/qerror.h" > #include "qemu/ratelimit.h" > #include "sysemu/block-backend.h" > @@ -35,8 +36,26 @@ typedef struct StreamBlockJob { > BlockdevOnError on_error; > char *backing_file_str; > bool bs_read_only; > + BlockDriverState *cor_filter_bs; > + BlockDriverState *target_bs; > } StreamBlockJob; > > +static BlockDriverState *child_file_bs(BlockDriverState *bs) > +{ > + return bs->file ? bs->file->bs : NULL; > +} > + > +static BlockDriverState *skip_filter(BlockDriverState *bs) > +{ > + BlockDriverState *ret_bs = bs; > + > + while (ret_bs && ret_bs->drv && ret_bs->drv->is_filter) { > + ret_bs = child_file_bs(ret_bs); > + } > + > + return ret_bs; > +} > + > static int coroutine_fn stream_populate(BlockBackend *blk, > int64_t offset, uint64_t bytes, > void *buf) > @@ -51,14 +70,12 @@ static int coroutine_fn stream_populate(BlockBackend *blk, > qemu_iovec_init_external(&qiov, &iov, 1); > > /* Copy-on-read the unallocated clusters */ > - return blk_co_preadv(blk, offset, qiov.size, &qiov, BDRV_REQ_COPY_ON_READ); > + return blk_co_preadv(blk, offset, qiov.size, &qiov, 0); > } > > -static int stream_prepare(Job *job) > +static int stream_change_backing_file(StreamBlockJob *s) > { > - StreamBlockJob *s = container_of(job, StreamBlockJob, common.job); > - BlockJob *bjob = &s->common; > - BlockDriverState *bs = blk_bs(bjob->blk); > + BlockDriverState *bs = s->target_bs; > BlockDriverState *base = s->base; > Error *local_err = NULL; > int ret = 0; > @@ -82,11 +99,53 @@ static int stream_prepare(Job *job) > return ret; > } > > +static int stream_exit(Job *job, bool abort) > +{ > + StreamBlockJob *s = container_of(job, StreamBlockJob, common.job); > + BlockJob *bjob = &s->common; > + BlockDriverState *target_bs = s->target_bs; > + int ret = 0; > + > + /* Retain the BDS until we complete the graph change. */ > + bdrv_ref(target_bs); > + /* Hold a guest back from writing while permissions are being reset. */ > + bdrv_drained_begin(target_bs); > + /* Drop permissions before the graph change. */ > + bdrv_child_try_set_perm(s->cor_filter_bs->file, 0, BLK_PERM_ALL, > + &error_abort); > + if (!abort) { > + ret = stream_change_backing_file(s); > + } > + > + bdrv_replace_node(s->cor_filter_bs, target_bs, &error_abort); > + /* Switch the BB back to the filter so that job terminated properly. */ > + blk_remove_bs(bjob->blk); > + blk_set_perm(bjob->blk, 0, BLK_PERM_ALL, &error_abort); > + blk_insert_bs(bjob->blk, s->cor_filter_bs, &error_abort); > + > + bdrv_drained_end(target_bs); > + bdrv_unref(target_bs); > + /* Submit control over filter to the job instance. */ > + bdrv_unref(s->cor_filter_bs); > + > + return ret; > +} > + > +static int stream_prepare(Job *job) > +{ > + return stream_exit(job, false); > +} > + > +static void stream_abort(Job *job) > +{ > + stream_exit(job, job->ret < 0); > +} > + > static void stream_clean(Job *job) > { > StreamBlockJob *s = container_of(job, StreamBlockJob, common.job); > BlockJob *bjob = &s->common; > - BlockDriverState *bs = blk_bs(bjob->blk); > + BlockDriverState *bs = s->target_bs; > > /* Reopen the image back in read-only mode if necessary */ > if (s->bs_read_only) { > @@ -102,7 +161,7 @@ static int coroutine_fn stream_run(Job *job, Error **errp) > { > StreamBlockJob *s = container_of(job, StreamBlockJob, common.job); > BlockBackend *blk = s->common.blk; > - BlockDriverState *bs = blk_bs(blk); > + BlockDriverState *bs = s->target_bs; > BlockDriverState *base = s->base; > int64_t len; > int64_t offset = 0; > @@ -213,12 +272,64 @@ static const BlockJobDriver stream_job_driver = { > .free = block_job_free, > .run = stream_run, > .prepare = stream_prepare, > + .abort = stream_abort, > .clean = stream_clean, > .user_resume = block_job_user_resume, > .drain = block_job_drain, > }, > }; > > +static BlockDriverState *create_filter_node(BlockDriverState *bs, > + Error **errp) > +{ > + QDict *opts = qdict_new(); > + > + qdict_put_str(opts, "driver", "copy-on-read"); > + qdict_put_str(opts, "file", bdrv_get_node_name(bs)); > + > + return bdrv_open(NULL, NULL, opts, BDRV_O_RDWR, errp); > +} > + > +static void remove_filter(BlockDriverState *cor_filter_bs) > +{ > + BlockDriverState *bs = child_file_bs(cor_filter_bs); > + > + /* Hold a guest back from writing until we remove the filter */ > + bdrv_drained_begin(bs); > + bdrv_child_try_set_perm(cor_filter_bs->file, 0, BLK_PERM_ALL, > + &error_abort); > + bdrv_replace_node(cor_filter_bs, bs, &error_abort); > + bdrv_drained_end(bs); > + > + bdrv_unref(cor_filter_bs); > +} > + > +static BlockDriverState *insert_filter(BlockDriverState *bs, Error **errp) > +{ > + BlockDriverState *cor_filter_bs; > + Error *local_err = NULL; > + > + cor_filter_bs = create_filter_node(bs, errp); > + if (cor_filter_bs == NULL) { > + error_prepend(errp, "Could not create filter node: "); > + return NULL; > + } > + > + bdrv_set_aio_context(cor_filter_bs, bdrv_get_aio_context(bs)); > + > + bdrv_drained_begin(bs); > + bdrv_replace_node(bs, cor_filter_bs, &local_err); > + bdrv_drained_end(bs); > + > + if (local_err) { > + bdrv_unref(cor_filter_bs); > + error_propagate(errp, local_err); > + return NULL; > + } > + > + return cor_filter_bs; > +} > + > void stream_start(const char *job_id, BlockDriverState *bs, > BlockDriverState *base, const char *backing_file_str, > int creation_flags, int64_t speed, > @@ -227,6 +338,14 @@ void stream_start(const char *job_id, BlockDriverState *bs, > StreamBlockJob *s; > BlockDriverState *iter; > bool bs_read_only; > + BlockDriverState *cor_filter_bs; > + > + /* > + * The base node might be identified by its file name rather than > + * by its node name. In that case, we can encounter a filter node > + * instead which has other BS pointer and the same file name. > + */ > + base = skip_filter(base); > > /* Make sure that the image is opened in read-write mode */ > bs_read_only = bdrv_is_read_only(bs); > @@ -236,10 +355,15 @@ void stream_start(const char *job_id, BlockDriverState *bs, > } > } > > + cor_filter_bs = insert_filter(bs, errp); > + if (cor_filter_bs == NULL) { > + goto fail; > + } > + > /* Prevent concurrent jobs trying to modify the graph structure here, we > * already have our own plans. Also don't allow resize as the image size is > * queried only at the job start and then cached. */ > - s = block_job_create(job_id, &stream_job_driver, NULL, bs, > + s = block_job_create(job_id, &stream_job_driver, NULL, cor_filter_bs, > BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED | > BLK_PERM_GRAPH_MOD, > BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED | > @@ -249,16 +373,21 @@ void stream_start(const char *job_id, BlockDriverState *bs, > goto fail; > } > > - /* Block all intermediate nodes between bs and base, because they will > + /* > + * Block all intermediate nodes between bs and base, because they will > * disappear from the chain after this operation. The streaming job reads > * every block only once, assuming that it doesn't change, so block writes > - * and resizes. */ > - for (iter = backing_bs(bs); iter && iter != base; iter = backing_bs(iter)) { > + * and resizes. We account a filter node may be a part of the graph. > + */ > + for (iter = skip_filter(backing_bs(bs)); iter && iter != base; > + iter = skip_filter(backing_bs(iter))) { > block_job_add_bdrv(&s->common, "intermediate node", iter, 0, > BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED, > &error_abort); > } > > + s->cor_filter_bs = cor_filter_bs; > + s->target_bs = bs; > s->base = base; > s->backing_file_str = g_strdup(backing_file_str); > s->bs_read_only = bs_read_only; > @@ -269,6 +398,9 @@ void stream_start(const char *job_id, BlockDriverState *bs, > return; > > fail: > + if (cor_filter_bs) { > + remove_filter(cor_filter_bs); > + } > if (bs_read_only) { > bdrv_reopen_set_read_only(bs, true, NULL); > } >
On Wed 23 Jan 2019 12:54:24 PM CET, Andrey Shinkevich wrote: > +static BlockDriverState *insert_filter(BlockDriverState *bs, Error **errp) > +{ > + BlockDriverState *cor_filter_bs; > + Error *local_err = NULL; > + > + cor_filter_bs = create_filter_node(bs, errp); > + if (cor_filter_bs == NULL) { > + error_prepend(errp, "Could not create filter node: "); > + return NULL; > + } > + > + bdrv_set_aio_context(cor_filter_bs, bdrv_get_aio_context(bs)); > + > + bdrv_drained_begin(bs); > + bdrv_replace_node(bs, cor_filter_bs, &local_err); > + bdrv_drained_end(bs); I think this was already discussed in the previous version of this patch: if you insert a copy-on-read filter here then all guest reads will copy the data from the backing chain, but you don't want to copy anything below the 'base' node, so the copy-on-read filter needs a 'base' parameter. Berto
On 08/02/2019 16:13, Alberto Garcia wrote: > On Wed 23 Jan 2019 12:54:24 PM CET, Andrey Shinkevich wrote: >> +static BlockDriverState *insert_filter(BlockDriverState *bs, Error **errp) >> +{ >> + BlockDriverState *cor_filter_bs; >> + Error *local_err = NULL; >> + >> + cor_filter_bs = create_filter_node(bs, errp); >> + if (cor_filter_bs == NULL) { >> + error_prepend(errp, "Could not create filter node: "); >> + return NULL; >> + } >> + >> + bdrv_set_aio_context(cor_filter_bs, bdrv_get_aio_context(bs)); >> + >> + bdrv_drained_begin(bs); >> + bdrv_replace_node(bs, cor_filter_bs, &local_err); >> + bdrv_drained_end(bs); > > I think this was already discussed in the previous version of this > patch: if you insert a copy-on-read filter here then all guest reads > will copy the data from the backing chain, but you don't want to copy > anything below the 'base' node, so the copy-on-read filter needs a > 'base' parameter. > > Berto > Thank you, Alberto. We will sort it out for sure. Before that, we still have the unresolved issue with getting the filter as the 'base' input parameter. That happens sometimes because the base is being searched by the file name, which is the same to the one of the filter, rather than by the node name (!). The given series is a some sort of workaround. We are looking forward to hearing from the maintainers how to improve the solution. Is there any idea?
On Fri 08 Feb 2019 04:29:48 PM CET, Andrey Shinkevich wrote: > On 08/02/2019 16:13, Alberto Garcia wrote: >> On Wed 23 Jan 2019 12:54:24 PM CET, Andrey Shinkevich wrote: >>> +static BlockDriverState *insert_filter(BlockDriverState *bs, Error **errp) >>> +{ >>> + BlockDriverState *cor_filter_bs; >>> + Error *local_err = NULL; >>> + >>> + cor_filter_bs = create_filter_node(bs, errp); >>> + if (cor_filter_bs == NULL) { >>> + error_prepend(errp, "Could not create filter node: "); >>> + return NULL; >>> + } >>> + >>> + bdrv_set_aio_context(cor_filter_bs, bdrv_get_aio_context(bs)); >>> + >>> + bdrv_drained_begin(bs); >>> + bdrv_replace_node(bs, cor_filter_bs, &local_err); >>> + bdrv_drained_end(bs); >> >> I think this was already discussed in the previous version of this >> patch: if you insert a copy-on-read filter here then all guest reads >> will copy the data from the backing chain, but you don't want to copy >> anything below the 'base' node, so the copy-on-read filter needs a >> 'base' parameter. >> > Before that, we still have the unresolved issue with getting the > filter as the 'base' input parameter. That happens sometimes because > the base is being searched by the file name, which is the same to the > one of the filter, rather than by the node name (!) I'm not sure if I understand. block-stream has 'base' and 'base-node' parameters, and in the first case you would convert the file name to a node name. stream_start() gets the BlockDriverState, not the file name, so why can't you get the node name from there and pass it to the copy-on-read filter? Berto
11.02.2019 17:07, Alberto Garcia wrote: > On Fri 08 Feb 2019 04:29:48 PM CET, Andrey Shinkevich wrote: >> On 08/02/2019 16:13, Alberto Garcia wrote: >>> On Wed 23 Jan 2019 12:54:24 PM CET, Andrey Shinkevich wrote: >>>> +static BlockDriverState *insert_filter(BlockDriverState *bs, Error **errp) >>>> +{ >>>> + BlockDriverState *cor_filter_bs; >>>> + Error *local_err = NULL; >>>> + >>>> + cor_filter_bs = create_filter_node(bs, errp); >>>> + if (cor_filter_bs == NULL) { >>>> + error_prepend(errp, "Could not create filter node: "); >>>> + return NULL; >>>> + } >>>> + >>>> + bdrv_set_aio_context(cor_filter_bs, bdrv_get_aio_context(bs)); >>>> + >>>> + bdrv_drained_begin(bs); >>>> + bdrv_replace_node(bs, cor_filter_bs, &local_err); >>>> + bdrv_drained_end(bs); >>> >>> I think this was already discussed in the previous version of this >>> patch: if you insert a copy-on-read filter here then all guest reads >>> will copy the data from the backing chain, but you don't want to copy >>> anything below the 'base' node, so the copy-on-read filter needs a >>> 'base' parameter. >>> >> Before that, we still have the unresolved issue with getting the >> filter as the 'base' input parameter. That happens sometimes because >> the base is being searched by the file name, which is the same to the >> one of the filter, rather than by the node name (!) > > I'm not sure if I understand. block-stream has 'base' and 'base-node' > parameters, and in the first case you would convert the file name to a > node name. > > stream_start() gets the BlockDriverState, not the file name, so why > can't you get the node name from there and pass it to the copy-on-read > filter? > > Berto > The problem is in the concept of "base" node. The code written in manner that base is not changed during block job. However, job don't own base and there is no guarantee that it will not change during the job. Moreover, when we create a filter above top node, starting several stream jobs to nearby regions of nodes in common backing chain it leads exatly to this situation: during the job an unrelated filter may be inserted between our bottom-intermediate-node and base-node. Or, contra-wise, we can start with base on some filter, which is removed during our job. So, my suggestion is: 1. calculate bottome-intermediate-node as soon as possible, at the very beginning of qmp-command handling, before any context switch. 2. don't keep link to base-node in job, and operate relatively to bottom-intermediate-node 3. and this leads to implementation of block_status() function which will take bottom-node parameter instead of base, and may be other similar things.
On 11/02/2019 17:07, Alberto Garcia wrote: > On Fri 08 Feb 2019 04:29:48 PM CET, Andrey Shinkevich wrote: >> On 08/02/2019 16:13, Alberto Garcia wrote: >>> On Wed 23 Jan 2019 12:54:24 PM CET, Andrey Shinkevich wrote: >>>> +static BlockDriverState *insert_filter(BlockDriverState *bs, Error **errp) >>>> +{ >>>> + BlockDriverState *cor_filter_bs; >>>> + Error *local_err = NULL; >>>> + >>>> + cor_filter_bs = create_filter_node(bs, errp); >>>> + if (cor_filter_bs == NULL) { >>>> + error_prepend(errp, "Could not create filter node: "); >>>> + return NULL; >>>> + } >>>> + >>>> + bdrv_set_aio_context(cor_filter_bs, bdrv_get_aio_context(bs)); >>>> + >>>> + bdrv_drained_begin(bs); >>>> + bdrv_replace_node(bs, cor_filter_bs, &local_err); >>>> + bdrv_drained_end(bs); >>> >>> I think this was already discussed in the previous version of this >>> patch: if you insert a copy-on-read filter here then all guest reads >>> will copy the data from the backing chain, but you don't want to copy >>> anything below the 'base' node, so the copy-on-read filter needs a >>> 'base' parameter. >>> >> Before that, we still have the unresolved issue with getting the >> filter as the 'base' input parameter. That happens sometimes because >> the base is being searched by the file name, which is the same to the >> one of the filter, rather than by the node name (!) > > I'm not sure if I understand. block-stream has 'base' and 'base-node' > parameters, and in the first case you would convert the file name to a > node name. > > stream_start() gets the BlockDriverState, not the file name, so why > can't you get the node name from there and pass it to the copy-on-read > filter? > > Berto > When the block-stream QMP command is run with a base file path name, qmp_block_stream() invokes bdrv_lookup_bs() to get the BlockDriverState by the file name and to pass it further to stream_start(), while has_base_node == false. It works the same way for the copy-on-read filter inserted above the QCOW2 node because they both have the same and only file path name. So, the BlockDriverState of the filter is passed to the stream_start() as the parameter instead of the one of the QCOW2 base node. I emailed the workaround patch that skips the filter BS passed to the stream_start() as the parameter, if any. There is an idea of Vladimir to pass an intermediate node that has the base node as its backing one instead. If you can suggest another idea, it will be appreciated.
On Mon 11 Feb 2019 03:51:33 PM CET, Vladimir Sementsov-Ogievskiy wrote: >>>> I think this was already discussed in the previous version of this >>>> patch: if you insert a copy-on-read filter here then all guest >>>> reads will copy the data from the backing chain, but you don't want >>>> to copy anything below the 'base' node, so the copy-on-read filter >>>> needs a 'base' parameter. >>>> >>> Before that, we still have the unresolved issue with getting the >>> filter as the 'base' input parameter. That happens sometimes because >>> the base is being searched by the file name, which is the same to >>> the one of the filter, rather than by the node name (!) >> >> I'm not sure if I understand. block-stream has 'base' and 'base-node' >> parameters, and in the first case you would convert the file name to >> a node name. >> >> stream_start() gets the BlockDriverState, not the file name, so why >> can't you get the node name from there and pass it to the >> copy-on-read filter? > > The problem is in the concept of "base" node. The code written in > manner that base is not changed during block job. However, job don't > own base and there is no guarantee that it will not change during the > job. But if that's the case then we have a problem already, because 'base' is a member of StreamBlockJob and is used in the existing stream_run() code. So if there's a way to make 'base' disappear during the job (how?) then we could protect it with block_job_add_bdrv(). Berto
11.02.2019 18:52, Alberto Garcia wrote: > On Mon 11 Feb 2019 03:51:33 PM CET, Vladimir Sementsov-Ogievskiy wrote: >>>>> I think this was already discussed in the previous version of this >>>>> patch: if you insert a copy-on-read filter here then all guest >>>>> reads will copy the data from the backing chain, but you don't want >>>>> to copy anything below the 'base' node, so the copy-on-read filter >>>>> needs a 'base' parameter. >>>>> >>>> Before that, we still have the unresolved issue with getting the >>>> filter as the 'base' input parameter. That happens sometimes because >>>> the base is being searched by the file name, which is the same to >>>> the one of the filter, rather than by the node name (!) >>> >>> I'm not sure if I understand. block-stream has 'base' and 'base-node' >>> parameters, and in the first case you would convert the file name to >>> a node name. >>> >>> stream_start() gets the BlockDriverState, not the file name, so why >>> can't you get the node name from there and pass it to the >>> copy-on-read filter? >> >> The problem is in the concept of "base" node. The code written in >> manner that base is not changed during block job. However, job don't >> own base and there is no guarantee that it will not change during the >> job. > > But if that's the case then we have a problem already, because 'base' is > a member of StreamBlockJob and is used in the existing stream_run() > code. I think it should be possible to reproduce, using block-commit (which already has filter) with block-stream in parallel, we'll try. > > So if there's a way to make 'base' disappear during the job (how?) then > we could protect it with block_job_add_bdrv(). I'm not sure this is correct. What is the reason for stream to own base? It's not really interested in it. At least it'll break iotest test_stream_parallel which creates parallel stream jobs, which shares border-nodes (ones top is others base). However this scenario looks not real-life. So, it's questionable, should stream block base or not. Not blocking is anyway more flexible, and my suggestion tries to keep base not blocked. Hm, I think it would be better to continue discussion over working reproducer, I hope we'll make it soon.
On Mon 11 Feb 2019 05:58:05 PM CET, Vladimir Sementsov-Ogievskiy wrote: >>> The problem is in the concept of "base" node. The code written in >>> manner that base is not changed during block job. However, job don't >>> own base and there is no guarantee that it will not change during >>> the job. >> >> But if that's the case then we have a problem already, because 'base' >> is a member of StreamBlockJob and is used in the existing >> stream_run() code. > > I think it should be possible to reproduce, using block-commit (which > already has filter) with block-stream in parallel, we'll try. It's not possible to run block-stream and block-commit in parallel on the same nodes. See iotest 030 for a few examples. So unless there's a bug it should be safe. >> So if there's a way to make 'base' disappear during the job (how?) >> then we could protect it with block_job_add_bdrv(). > > I'm not sure this is correct. What is the reason for stream to own > base? It's not really interested in it. stream does not need to write or modify base, but it does need to keep a reference to it in order to now where to stop copying data. As I said earlier base is a member of StreamBlockJob, so it should not disappear during the job. Berto
On 12/02/2019 14:35, Alberto Garcia wrote: > On Mon 11 Feb 2019 05:58:05 PM CET, Vladimir Sementsov-Ogievskiy wrote: >>>> The problem is in the concept of "base" node. The code written in >>>> manner that base is not changed during block job. However, job don't >>>> own base and there is no guarantee that it will not change during >>>> the job. >>> >>> But if that's the case then we have a problem already, because 'base' >>> is a member of StreamBlockJob and is used in the existing >>> stream_run() code. >> >> I think it should be possible to reproduce, using block-commit (which >> already has filter) with block-stream in parallel, we'll try. > > It's not possible to run block-stream and block-commit in parallel on > the same nodes. See iotest 030 for a few examples. > > So unless there's a bug it should be safe. > >>> So if there's a way to make 'base' disappear during the job (how?) >>> then we could protect it with block_job_add_bdrv(). >> >> I'm not sure this is correct. What is the reason for stream to own >> base? It's not really interested in it. > > stream does not need to write or modify base, but it does need to keep a > reference to it in order to now where to stop copying data. > > As I said earlier base is a member of StreamBlockJob, so it should not > disappear during the job. > > Berto > No doubt that a reference to the base node is needed as a limit for the COR work. The base is still in the game. Actually, we encounter an issue when BlockDriverState of the COR-filter comes into play instead of the base. It is inherited from a parallel job. Based on the case test_stream_parallel from the qemu-iotests/030, it works like this: 1. Job#1 is started and inserts a COR-filter above the top node. 2. Context switch. 3. Job#2 is started and inherits the COR-filter as the base. 4. Context switch. 5. Job#1 comes to the end and removes its COR-filter referenced by job#2. 6. Context switch. 7. Job#2 comes to the end and uses the deleted COR-filter node. 8. We are in trouble. Or another scenario: 1-4. As above 5. Job#2 comes to the end first and keeps the COR-filter as the backing node. 6. The assert() fails as the referenced COR-filter was not deleted on exit. 7. The game is over with a poor score. If we just keep the intermediate bottom node instead of the base, we will have a similar issue. On the job exit, we change the backing file. If we call the backing_bs(bottom_node_bs), we will keep the COR-filter node instead, whereas it has not been removed jet by the unfinished job#1 (see the second scenario above). If we include the base node into the job node list with some protecting flags, that's block_job_add_bdrv(&s->common, "base", base, ..), we can end up with a failure of the bdrv_check_update_perm() called by the bdrv_root_attach_child() because the previous job#1 has the root reference to the same node with other permissions. So, the next job will fail to start and the test cases won't pass. It we set the following flags, there will be no failure: block_job_add_bdrv(&s->common, "base", base, 0, BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD, &error_abort); But what margin will we gain? We will have the same issues as above. In many other cases, when the filter has not been identified, we can get a broken chain while calling to the backing_bs(bs=filter) as the backing node of the filter is the zero pointer. That's why I implemented the skip_filter() function in this series as a some sort of solution or workaround. May we proceed with that? Is there any better idea?
14.02.2019 16:43, Andrey Shinkevich wrote: > > > On 12/02/2019 14:35, Alberto Garcia wrote: >> On Mon 11 Feb 2019 05:58:05 PM CET, Vladimir Sementsov-Ogievskiy wrote: >>>>> The problem is in the concept of "base" node. The code written in >>>>> manner that base is not changed during block job. However, job don't >>>>> own base and there is no guarantee that it will not change during >>>>> the job. >>>> >>>> But if that's the case then we have a problem already, because 'base' >>>> is a member of StreamBlockJob and is used in the existing >>>> stream_run() code. >>> >>> I think it should be possible to reproduce, using block-commit (which >>> already has filter) with block-stream in parallel, we'll try. >> >> It's not possible to run block-stream and block-commit in parallel on >> the same nodes. See iotest 030 for a few examples. >> >> So unless there's a bug it should be safe. >> >>>> So if there's a way to make 'base' disappear during the job (how?) >>>> then we could protect it with block_job_add_bdrv(). >>> >>> I'm not sure this is correct. What is the reason for stream to own >>> base? It's not really interested in it. >> >> stream does not need to write or modify base, but it does need to keep a >> reference to it in order to now where to stop copying data. >> >> As I said earlier base is a member of StreamBlockJob, so it should not >> disappear during the job. >> >> Berto >> > > No doubt that a reference to the base node is needed as a limit for the > COR work. The base is still in the game. Actually, we encounter an issue > when BlockDriverState of the COR-filter comes into play instead of the > base. It is inherited from a parallel job. Based on the case > test_stream_parallel from the qemu-iotests/030, it works like this: > 1. Job#1 is started and inserts a COR-filter above the top node. > 2. Context switch. > 3. Job#2 is started and inherits the COR-filter as the base. > 4. Context switch. > 5. Job#1 comes to the end and removes its COR-filter referenced by > job#2. > 6. Context switch. > 7. Job#2 comes to the end and uses the deleted COR-filter node. > 8. We are in trouble. > > Or another scenario: > 1-4. As above > 5. Job#2 comes to the end first and keeps the COR-filter as the > backing node. > 6. The assert() fails as the referenced COR-filter was not deleted > on exit. > 7. The game is over with a poor score. > > If we just keep the intermediate bottom node instead of the base, we > will have a similar issue. On the job exit, we change the backing file. > If we call the backing_bs(bottom_node_bs), we will keep the COR-filter > node instead, whereas it has not been removed jet by the unfinished > job#1 (see the second scenario above). Honestly, don't see any problem with it. > > If we include the base node into the job node list with some protecting > flags, that's block_job_add_bdrv(&s->common, "base", base, ..), we can > end up with a failure of the bdrv_check_update_perm() called by the > bdrv_root_attach_child() because the previous job#1 has the root > reference to the same node with other permissions. So, the next job will > fail to start and the test cases won't pass. You mean test 30 # testcase about parallel stream jobs, which starts streams, sharing one node as base for one and top for another. And if we lock base in stream job the test fails. I think, we just need to update the test, by inserting additional empty qcow2 nodes below shared ones, and make new inserted nodes to be top nodes of stream jobs, so there no more node-sharing between jobs. > > It we set the following flags, there will be no failure: > block_job_add_bdrv(&s->common, "base", base, 0, BLK_PERM_CONSISTENT_READ > | BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD, &error_abort); > But what margin will we gain? We will have the same issues as above. We meed flags ..., BLK_PERM_GRAPH_MOD, BLK_PERM_ALL & ~BLK_PERM_GRAPH_MOD, ... for base node. > > In many other cases, when the filter has not been identified, we can > get a broken chain while calling to the backing_bs(bs=filter) as the > backing node of the filter is the zero pointer. > > That's why I implemented the skip_filter() function in this series as > a some sort of solution or workaround. May we proceed with that? > Is there any better idea? > I think now, that best way is to lock base node. As it is the simplest one. And it corresponds to current code, which actually keeps illegal pointer to base node, without any access rights.
On 18/02/2019 13:08, Vladimir Sementsov-Ogievskiy wrote: > 14.02.2019 16:43, Andrey Shinkevich wrote: >> >> >> On 12/02/2019 14:35, Alberto Garcia wrote: >>> On Mon 11 Feb 2019 05:58:05 PM CET, Vladimir Sementsov-Ogievskiy wrote: >>>>>> The problem is in the concept of "base" node. The code written in >>>>>> manner that base is not changed during block job. However, job don't >>>>>> own base and there is no guarantee that it will not change during >>>>>> the job. >>>>> >>>>> But if that's the case then we have a problem already, because 'base' >>>>> is a member of StreamBlockJob and is used in the existing >>>>> stream_run() code. >>>> >>>> I think it should be possible to reproduce, using block-commit (which >>>> already has filter) with block-stream in parallel, we'll try. >>> >>> It's not possible to run block-stream and block-commit in parallel on >>> the same nodes. See iotest 030 for a few examples. >>> >>> So unless there's a bug it should be safe. >>> >>>>> So if there's a way to make 'base' disappear during the job (how?) >>>>> then we could protect it with block_job_add_bdrv(). >>>> >>>> I'm not sure this is correct. What is the reason for stream to own >>>> base? It's not really interested in it. >>> >>> stream does not need to write or modify base, but it does need to keep a >>> reference to it in order to now where to stop copying data. >>> >>> As I said earlier base is a member of StreamBlockJob, so it should not >>> disappear during the job. >>> >>> Berto >>> >> >> No doubt that a reference to the base node is needed as a limit for the >> COR work. The base is still in the game. Actually, we encounter an issue >> when BlockDriverState of the COR-filter comes into play instead of the >> base. It is inherited from a parallel job. Based on the case >> test_stream_parallel from the qemu-iotests/030, it works like this: >> 1. Job#1 is started and inserts a COR-filter above the top node. >> 2. Context switch. >> 3. Job#2 is started and inherits the COR-filter as the base. >> 4. Context switch. >> 5. Job#1 comes to the end and removes its COR-filter referenced by >> job#2. >> 6. Context switch. >> 7. Job#2 comes to the end and uses the deleted COR-filter node. >> 8. We are in trouble. >> >> Or another scenario: >> 1-4. As above >> 5. Job#2 comes to the end first and keeps the COR-filter as the >> backing node. >> 6. The assert() fails as the referenced COR-filter was not deleted >> on exit. >> 7. The game is over with a poor score. >> >> If we just keep the intermediate bottom node instead of the base, we >> will have a similar issue. On the job exit, we change the backing file. >> If we call the backing_bs(bottom_node_bs), we will keep the COR-filter >> node instead, whereas it has not been removed jet by the unfinished >> job#1 (see the second scenario above). > > Honestly, don't see any problem with it. > >> >> If we include the base node into the job node list with some protecting >> flags, that's block_job_add_bdrv(&s->common, "base", base, ..), we can >> end up with a failure of the bdrv_check_update_perm() called by the >> bdrv_root_attach_child() because the previous job#1 has the root >> reference to the same node with other permissions. So, the next job will >> fail to start and the test cases won't pass. > > You mean test 30 # testcase about parallel stream jobs, which starts streams, > sharing one node as base for one and top for another. And if we lock base in > stream job the test fails. I think, we just need to update the test, by inserting > additional empty qcow2 nodes below shared ones, and make new inserted nodes > to be top nodes of stream jobs, so there no more node-sharing between jobs. > >> >> It we set the following flags, there will be no failure: >> block_job_add_bdrv(&s->common, "base", base, 0, BLK_PERM_CONSISTENT_READ >> | BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD, &error_abort); >> But what margin will we gain? We will have the same issues as above. > > We meed flags ..., BLK_PERM_GRAPH_MOD, BLK_PERM_ALL & ~BLK_PERM_GRAPH_MOD, ... > for base node. > >> >> In many other cases, when the filter has not been identified, we can >> get a broken chain while calling to the backing_bs(bs=filter) as the >> backing node of the filter is the zero pointer. >> >> That's why I implemented the skip_filter() function in this series as >> a some sort of solution or workaround. May we proceed with that? >> Is there any better idea? >> > > I think now, that best way is to lock base node. As it is the simplest one. And it > corresponds to current code, which actually keeps illegal pointer to base node, > without any access rights. > > Well, I am about to implement the next series version with the base BlockDriverState included into the job node list and will insert extra images between those shared by the parallel jobs in the test case TestParallelOps of #030.
diff --git a/block/stream.c b/block/stream.c index 7a49ac0..18e51b3 100644 --- a/block/stream.c +++ b/block/stream.c @@ -16,6 +16,7 @@ #include "block/block_int.h" #include "block/blockjob_int.h" #include "qapi/error.h" +#include "qapi/qmp/qdict.h" #include "qapi/qmp/qerror.h" #include "qemu/ratelimit.h" #include "sysemu/block-backend.h" @@ -35,8 +36,26 @@ typedef struct StreamBlockJob { BlockdevOnError on_error; char *backing_file_str; bool bs_read_only; + BlockDriverState *cor_filter_bs; + BlockDriverState *target_bs; } StreamBlockJob; +static BlockDriverState *child_file_bs(BlockDriverState *bs) +{ + return bs->file ? bs->file->bs : NULL; +} + +static BlockDriverState *skip_filter(BlockDriverState *bs) +{ + BlockDriverState *ret_bs = bs; + + while (ret_bs && ret_bs->drv && ret_bs->drv->is_filter) { + ret_bs = child_file_bs(ret_bs); + } + + return ret_bs; +} + static int coroutine_fn stream_populate(BlockBackend *blk, int64_t offset, uint64_t bytes, void *buf) @@ -51,14 +70,12 @@ static int coroutine_fn stream_populate(BlockBackend *blk, qemu_iovec_init_external(&qiov, &iov, 1); /* Copy-on-read the unallocated clusters */ - return blk_co_preadv(blk, offset, qiov.size, &qiov, BDRV_REQ_COPY_ON_READ); + return blk_co_preadv(blk, offset, qiov.size, &qiov, 0); } -static int stream_prepare(Job *job) +static int stream_change_backing_file(StreamBlockJob *s) { - StreamBlockJob *s = container_of(job, StreamBlockJob, common.job); - BlockJob *bjob = &s->common; - BlockDriverState *bs = blk_bs(bjob->blk); + BlockDriverState *bs = s->target_bs; BlockDriverState *base = s->base; Error *local_err = NULL; int ret = 0; @@ -82,11 +99,53 @@ static int stream_prepare(Job *job) return ret; } +static int stream_exit(Job *job, bool abort) +{ + StreamBlockJob *s = container_of(job, StreamBlockJob, common.job); + BlockJob *bjob = &s->common; + BlockDriverState *target_bs = s->target_bs; + int ret = 0; + + /* Retain the BDS until we complete the graph change. */ + bdrv_ref(target_bs); + /* Hold a guest back from writing while permissions are being reset. */ + bdrv_drained_begin(target_bs); + /* Drop permissions before the graph change. */ + bdrv_child_try_set_perm(s->cor_filter_bs->file, 0, BLK_PERM_ALL, + &error_abort); + if (!abort) { + ret = stream_change_backing_file(s); + } + + bdrv_replace_node(s->cor_filter_bs, target_bs, &error_abort); + /* Switch the BB back to the filter so that job terminated properly. */ + blk_remove_bs(bjob->blk); + blk_set_perm(bjob->blk, 0, BLK_PERM_ALL, &error_abort); + blk_insert_bs(bjob->blk, s->cor_filter_bs, &error_abort); + + bdrv_drained_end(target_bs); + bdrv_unref(target_bs); + /* Submit control over filter to the job instance. */ + bdrv_unref(s->cor_filter_bs); + + return ret; +} + +static int stream_prepare(Job *job) +{ + return stream_exit(job, false); +} + +static void stream_abort(Job *job) +{ + stream_exit(job, job->ret < 0); +} + static void stream_clean(Job *job) { StreamBlockJob *s = container_of(job, StreamBlockJob, common.job); BlockJob *bjob = &s->common; - BlockDriverState *bs = blk_bs(bjob->blk); + BlockDriverState *bs = s->target_bs; /* Reopen the image back in read-only mode if necessary */ if (s->bs_read_only) { @@ -102,7 +161,7 @@ static int coroutine_fn stream_run(Job *job, Error **errp) { StreamBlockJob *s = container_of(job, StreamBlockJob, common.job); BlockBackend *blk = s->common.blk; - BlockDriverState *bs = blk_bs(blk); + BlockDriverState *bs = s->target_bs; BlockDriverState *base = s->base; int64_t len; int64_t offset = 0; @@ -213,12 +272,64 @@ static const BlockJobDriver stream_job_driver = { .free = block_job_free, .run = stream_run, .prepare = stream_prepare, + .abort = stream_abort, .clean = stream_clean, .user_resume = block_job_user_resume, .drain = block_job_drain, }, }; +static BlockDriverState *create_filter_node(BlockDriverState *bs, + Error **errp) +{ + QDict *opts = qdict_new(); + + qdict_put_str(opts, "driver", "copy-on-read"); + qdict_put_str(opts, "file", bdrv_get_node_name(bs)); + + return bdrv_open(NULL, NULL, opts, BDRV_O_RDWR, errp); +} + +static void remove_filter(BlockDriverState *cor_filter_bs) +{ + BlockDriverState *bs = child_file_bs(cor_filter_bs); + + /* Hold a guest back from writing until we remove the filter */ + bdrv_drained_begin(bs); + bdrv_child_try_set_perm(cor_filter_bs->file, 0, BLK_PERM_ALL, + &error_abort); + bdrv_replace_node(cor_filter_bs, bs, &error_abort); + bdrv_drained_end(bs); + + bdrv_unref(cor_filter_bs); +} + +static BlockDriverState *insert_filter(BlockDriverState *bs, Error **errp) +{ + BlockDriverState *cor_filter_bs; + Error *local_err = NULL; + + cor_filter_bs = create_filter_node(bs, errp); + if (cor_filter_bs == NULL) { + error_prepend(errp, "Could not create filter node: "); + return NULL; + } + + bdrv_set_aio_context(cor_filter_bs, bdrv_get_aio_context(bs)); + + bdrv_drained_begin(bs); + bdrv_replace_node(bs, cor_filter_bs, &local_err); + bdrv_drained_end(bs); + + if (local_err) { + bdrv_unref(cor_filter_bs); + error_propagate(errp, local_err); + return NULL; + } + + return cor_filter_bs; +} + void stream_start(const char *job_id, BlockDriverState *bs, BlockDriverState *base, const char *backing_file_str, int creation_flags, int64_t speed, @@ -227,6 +338,14 @@ void stream_start(const char *job_id, BlockDriverState *bs, StreamBlockJob *s; BlockDriverState *iter; bool bs_read_only; + BlockDriverState *cor_filter_bs; + + /* + * The base node might be identified by its file name rather than + * by its node name. In that case, we can encounter a filter node + * instead which has other BS pointer and the same file name. + */ + base = skip_filter(base); /* Make sure that the image is opened in read-write mode */ bs_read_only = bdrv_is_read_only(bs); @@ -236,10 +355,15 @@ void stream_start(const char *job_id, BlockDriverState *bs, } } + cor_filter_bs = insert_filter(bs, errp); + if (cor_filter_bs == NULL) { + goto fail; + } + /* Prevent concurrent jobs trying to modify the graph structure here, we * already have our own plans. Also don't allow resize as the image size is * queried only at the job start and then cached. */ - s = block_job_create(job_id, &stream_job_driver, NULL, bs, + s = block_job_create(job_id, &stream_job_driver, NULL, cor_filter_bs, BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD, BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED | @@ -249,16 +373,21 @@ void stream_start(const char *job_id, BlockDriverState *bs, goto fail; } - /* Block all intermediate nodes between bs and base, because they will + /* + * Block all intermediate nodes between bs and base, because they will * disappear from the chain after this operation. The streaming job reads * every block only once, assuming that it doesn't change, so block writes - * and resizes. */ - for (iter = backing_bs(bs); iter && iter != base; iter = backing_bs(iter)) { + * and resizes. We account a filter node may be a part of the graph. + */ + for (iter = skip_filter(backing_bs(bs)); iter && iter != base; + iter = skip_filter(backing_bs(iter))) { block_job_add_bdrv(&s->common, "intermediate node", iter, 0, BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED, &error_abort); } + s->cor_filter_bs = cor_filter_bs; + s->target_bs = bs; s->base = base; s->backing_file_str = g_strdup(backing_file_str); s->bs_read_only = bs_read_only; @@ -269,6 +398,9 @@ void stream_start(const char *job_id, BlockDriverState *bs, return; fail: + if (cor_filter_bs) { + remove_filter(cor_filter_bs); + } if (bs_read_only) { bdrv_reopen_set_read_only(bs, true, NULL); }
The copy-on-read filter driver is applied to block-stream operations. The 'test_stream_parallel' in the file tests/qemu-iotests/030 runs jobs that use nodes for streaming in parallel through the backing chain. We've got filters being inserted to and removed from the backing chain while jobs are running. As a result, a filter node may be passed as the 'base' parameter to the stream_start() function when the base node name is not specified (the base node is identified by its file name which is the same to the related filter node). Another issue is that a function keeps the pointer to the filter BDS object that can be replaced and deleted after the co-routine switch. For example, the function backing_bs() returns the pointer to the backing BDS and the BDS reference counter is not incremented. A solution (or workaround) made with the given patch for block-stream job helps to pass all the iotests in the file tests/qemu-iotests/030. Any piece of advice how to amend the solution will be appreciated. I am looking forward to hearing from you. Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com> --- block/stream.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 143 insertions(+), 11 deletions(-)