Message ID | 20190705093258.47856-1-sgarzare@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] block/rbd: implement .bdrv_get_allocated_file_size callback | expand |
On 05.07.19 11:32, Stefano Garzarella wrote: > This patch allows 'qemu-img info' to show the 'disk size' for > the RBD images that have the fast-diff feature enabled. > > If this feature is enabled, we use the rbd_diff_iterate2() API > to calculate the allocated size for the image. > > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> > --- > v3: > - return -ENOTSUP instead of -1 when fast-diff is not available > [John, Jason] > v2: > - calculate the actual usage only if the fast-diff feature is > enabled [Jason] > --- > block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 54 insertions(+) Well, the librbd documentation is non-existing as always, but while googling, I at least found that libvirt has exactly the same code. So I suppose it must be quite correct, then. > diff --git a/block/rbd.c b/block/rbd.c > index 59757b3120..b6bed683e5 100644 > --- a/block/rbd.c > +++ b/block/rbd.c > @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs) > return info.size; > } > > +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists, > + void *arg) > +{ > + int64_t *alloc_size = (int64_t *) arg; > + > + if (exists) { > + (*alloc_size) += len; > + } > + > + return 0; > +} > + > +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs) > +{ > + BDRVRBDState *s = bs->opaque; > + uint64_t flags, features; > + int64_t alloc_size = 0; > + int r; > + > + r = rbd_get_flags(s->image, &flags); > + if (r < 0) { > + return r; > + } > + > + r = rbd_get_features(s->image, &features); > + if (r < 0) { > + return r; > + } > + > + /* > + * We use rbd_diff_iterate2() only if the RBD image have fast-diff > + * feature enabled. If it is disabled, rbd_diff_iterate2() could be > + * very slow on a big image. > + */ > + if (!(features & RBD_FEATURE_FAST_DIFF) || > + (flags & RBD_FLAG_FAST_DIFF_INVALID)) { > + return -ENOTSUP; > + } > + > + /* > + * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes > + * the callback on all allocated regions of the image. > + */ > + r = rbd_diff_iterate2(s->image, NULL, 0, > + bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1, > + &rbd_allocated_size_cb, &alloc_size); But I have a question. This is basically block_status, right? So it gives us information on which areas are allocated and which are not. The result thus gives us a lower bound on the allocation size, but is it really exactly the allocation size? There are two things I’m concerned about: 1. What about metadata? 2. If you have multiple snapshots, this will only report the overall allocation information, right? So say there is something like this: (“A” means an allocated MB, “-” is an unallocated MB) Snapshot 1: AAAA--- Snapshot 2: --AAAAA Snapshot 3: -AAAA-- I think the allocated data size is the number of As in total (13 MB). But I suppose this API will just return 7 MB, because it looks on everything an it sees the whole image range (7 MB) to be allocated. It doesn’t report in how many snapshots some region is allocated. Max > + if (r < 0) { > + return r; > + } > + > + return alloc_size; > +} > + > static int coroutine_fn qemu_rbd_co_truncate(BlockDriverState *bs, > int64_t offset, > PreallocMode prealloc, > @@ -1291,6 +1344,7 @@ static BlockDriver bdrv_rbd = { > .bdrv_get_info = qemu_rbd_getinfo, > .create_opts = &qemu_rbd_create_opts, > .bdrv_getlength = qemu_rbd_getlength, > + .bdrv_get_allocated_file_size = qemu_rbd_get_allocated_file_size, > .bdrv_co_truncate = qemu_rbd_co_truncate, > .protocol_name = "rbd", > >
On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote: > On 05.07.19 11:32, Stefano Garzarella wrote: > > This patch allows 'qemu-img info' to show the 'disk size' for > > the RBD images that have the fast-diff feature enabled. > > > > If this feature is enabled, we use the rbd_diff_iterate2() API > > to calculate the allocated size for the image. > > > > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> > > --- > > v3: > > - return -ENOTSUP instead of -1 when fast-diff is not available > > [John, Jason] > > v2: > > - calculate the actual usage only if the fast-diff feature is > > enabled [Jason] > > --- > > block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 54 insertions(+) > > Well, the librbd documentation is non-existing as always, but while > googling, I at least found that libvirt has exactly the same code. So I > suppose it must be quite correct, then. > While I wrote this code I took a look at libvirt implementation and also at the "rbd" tool in the ceph repository: compute_image_disk_usage() in src/tools/rbd/action/DiskUsage.cc > > diff --git a/block/rbd.c b/block/rbd.c > > index 59757b3120..b6bed683e5 100644 > > --- a/block/rbd.c > > +++ b/block/rbd.c > > @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs) > > return info.size; > > } > > > > +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists, > > + void *arg) > > +{ > > + int64_t *alloc_size = (int64_t *) arg; > > + > > + if (exists) { > > + (*alloc_size) += len; > > + } > > + > > + return 0; > > +} > > + > > +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs) > > +{ > > + BDRVRBDState *s = bs->opaque; > > + uint64_t flags, features; > > + int64_t alloc_size = 0; > > + int r; > > + > > + r = rbd_get_flags(s->image, &flags); > > + if (r < 0) { > > + return r; > > + } > > + > > + r = rbd_get_features(s->image, &features); > > + if (r < 0) { > > + return r; > > + } > > + > > + /* > > + * We use rbd_diff_iterate2() only if the RBD image have fast-diff > > + * feature enabled. If it is disabled, rbd_diff_iterate2() could be > > + * very slow on a big image. > > + */ > > + if (!(features & RBD_FEATURE_FAST_DIFF) || > > + (flags & RBD_FLAG_FAST_DIFF_INVALID)) { > > + return -ENOTSUP; > > + } > > + > > + /* > > + * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes > > + * the callback on all allocated regions of the image. > > + */ > > + r = rbd_diff_iterate2(s->image, NULL, 0, > > + bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1, > > + &rbd_allocated_size_cb, &alloc_size); > > But I have a question. This is basically block_status, right? So it > gives us information on which areas are allocated and which are not. > The result thus gives us a lower bound on the allocation size, but is it > really exactly the allocation size? > > There are two things I’m concerned about: > > 1. What about metadata? Good question, I don't think it includes the size used by metadata and I don't know if there is a way to know it. I'll check better. > > 2. If you have multiple snapshots, this will only report the overall > allocation information, right? So say there is something like this: > > (“A” means an allocated MB, “-” is an unallocated MB) > > Snapshot 1: AAAA--- > Snapshot 2: --AAAAA > Snapshot 3: -AAAA-- > > I think the allocated data size is the number of As in total (13 MB). > But I suppose this API will just return 7 MB, because it looks on > everything an it sees the whole image range (7 MB) to be allocated. It > doesn’t report in how many snapshots some region is allocated. Looking at the documentation of rbd_diff_iterate2() [1] they says: * If the source snapshot name is NULL, we * interpret that as the beginning of time and return all allocated * regions of the image. But I don't know the answer of your question (maybe Jason can help here). I should check better the implementation to understand if I can cycle on all snapshot to get the exact allocated data size. https://github.com/ceph/ceph/blob/master/src/include/rbd/librbd.h#L925 I'll back when I have more details on the rbd implementation to better answer your questions. Thanks, Stefano
On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <sgarzare@redhat.com> wrote: > > On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote: > > On 05.07.19 11:32, Stefano Garzarella wrote: > > > This patch allows 'qemu-img info' to show the 'disk size' for > > > the RBD images that have the fast-diff feature enabled. > > > > > > If this feature is enabled, we use the rbd_diff_iterate2() API > > > to calculate the allocated size for the image. > > > > > > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> > > > --- > > > v3: > > > - return -ENOTSUP instead of -1 when fast-diff is not available > > > [John, Jason] > > > v2: > > > - calculate the actual usage only if the fast-diff feature is > > > enabled [Jason] > > > --- > > > block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 54 insertions(+) > > > > Well, the librbd documentation is non-existing as always, but while > > googling, I at least found that libvirt has exactly the same code. So I > > suppose it must be quite correct, then. > > > > While I wrote this code I took a look at libvirt implementation and also > at the "rbd" tool in the ceph repository: compute_image_disk_usage() in > src/tools/rbd/action/DiskUsage.cc > > > > diff --git a/block/rbd.c b/block/rbd.c > > > index 59757b3120..b6bed683e5 100644 > > > --- a/block/rbd.c > > > +++ b/block/rbd.c > > > @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs) > > > return info.size; > > > } > > > > > > +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists, > > > + void *arg) > > > +{ > > > + int64_t *alloc_size = (int64_t *) arg; > > > + > > > + if (exists) { > > > + (*alloc_size) += len; > > > + } > > > + > > > + return 0; > > > +} > > > + > > > +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs) > > > +{ > > > + BDRVRBDState *s = bs->opaque; > > > + uint64_t flags, features; > > > + int64_t alloc_size = 0; > > > + int r; > > > + > > > + r = rbd_get_flags(s->image, &flags); > > > + if (r < 0) { > > > + return r; > > > + } > > > + > > > + r = rbd_get_features(s->image, &features); > > > + if (r < 0) { > > > + return r; > > > + } > > > + > > > + /* > > > + * We use rbd_diff_iterate2() only if the RBD image have fast-diff > > > + * feature enabled. If it is disabled, rbd_diff_iterate2() could be > > > + * very slow on a big image. > > > + */ > > > + if (!(features & RBD_FEATURE_FAST_DIFF) || > > > + (flags & RBD_FLAG_FAST_DIFF_INVALID)) { > > > + return -ENOTSUP; > > > + } > > > + > > > + /* > > > + * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes > > > + * the callback on all allocated regions of the image. > > > + */ > > > + r = rbd_diff_iterate2(s->image, NULL, 0, > > > + bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1, > > > + &rbd_allocated_size_cb, &alloc_size); > > > > But I have a question. This is basically block_status, right? So it > > gives us information on which areas are allocated and which are not. > > The result thus gives us a lower bound on the allocation size, but is it > > really exactly the allocation size? > > > > There are two things I’m concerned about: > > > > 1. What about metadata? > > Good question, I don't think it includes the size used by metadata and I > don't know if there is a way to know it. I'll check better. It does not include the size of metadata, the "rbd_diff_iterate2" function is literally just looking for touched data blocks within the RBD image. > > > > 2. If you have multiple snapshots, this will only report the overall > > allocation information, right? So say there is something like this: > > > > (“A” means an allocated MB, “-” is an unallocated MB) > > > > Snapshot 1: AAAA--- > > Snapshot 2: --AAAAA > > Snapshot 3: -AAAA-- > > > > I think the allocated data size is the number of As in total (13 MB). > > But I suppose this API will just return 7 MB, because it looks on > > everything an it sees the whole image range (7 MB) to be allocated. It > > doesn’t report in how many snapshots some region is allocated. It should return 13 dirty data blocks (multipled by the size of the data block) since when you don't provide a "from snapshot" name, it will iterate from the first snapshot to the HEAD revision. > Looking at the documentation of rbd_diff_iterate2() [1] they says: > > * If the source snapshot name is NULL, we > * interpret that as the beginning of time and return all allocated > * regions of the image. > > But I don't know the answer of your question (maybe Jason can help > here). > I should check better the implementation to understand if I can cycle > on all snapshot to get the exact allocated data size. > > https://github.com/ceph/ceph/blob/master/src/include/rbd/librbd.h#L925 > > I'll back when I have more details on the rbd implementation to better > answer your questions. > > Thanks, > Stefano
On 09.07.19 05:08, Jason Dillaman wrote: > On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <sgarzare@redhat.com> wrote: >> >> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote: >>> On 05.07.19 11:32, Stefano Garzarella wrote: >>>> This patch allows 'qemu-img info' to show the 'disk size' for >>>> the RBD images that have the fast-diff feature enabled. >>>> >>>> If this feature is enabled, we use the rbd_diff_iterate2() API >>>> to calculate the allocated size for the image. >>>> >>>> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> >>>> --- >>>> v3: >>>> - return -ENOTSUP instead of -1 when fast-diff is not available >>>> [John, Jason] >>>> v2: >>>> - calculate the actual usage only if the fast-diff feature is >>>> enabled [Jason] >>>> --- >>>> block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> 1 file changed, 54 insertions(+) >>> >>> Well, the librbd documentation is non-existing as always, but while >>> googling, I at least found that libvirt has exactly the same code. So I >>> suppose it must be quite correct, then. >>> >> >> While I wrote this code I took a look at libvirt implementation and also >> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in >> src/tools/rbd/action/DiskUsage.cc >> >>>> diff --git a/block/rbd.c b/block/rbd.c >>>> index 59757b3120..b6bed683e5 100644 >>>> --- a/block/rbd.c >>>> +++ b/block/rbd.c >>>> @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs) >>>> return info.size; >>>> } >>>> >>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists, >>>> + void *arg) >>>> +{ >>>> + int64_t *alloc_size = (int64_t *) arg; >>>> + >>>> + if (exists) { >>>> + (*alloc_size) += len; >>>> + } >>>> + >>>> + return 0; >>>> +} >>>> + >>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs) >>>> +{ >>>> + BDRVRBDState *s = bs->opaque; >>>> + uint64_t flags, features; >>>> + int64_t alloc_size = 0; >>>> + int r; >>>> + >>>> + r = rbd_get_flags(s->image, &flags); >>>> + if (r < 0) { >>>> + return r; >>>> + } >>>> + >>>> + r = rbd_get_features(s->image, &features); >>>> + if (r < 0) { >>>> + return r; >>>> + } >>>> + >>>> + /* >>>> + * We use rbd_diff_iterate2() only if the RBD image have fast-diff >>>> + * feature enabled. If it is disabled, rbd_diff_iterate2() could be >>>> + * very slow on a big image. >>>> + */ >>>> + if (!(features & RBD_FEATURE_FAST_DIFF) || >>>> + (flags & RBD_FLAG_FAST_DIFF_INVALID)) { >>>> + return -ENOTSUP; >>>> + } >>>> + >>>> + /* >>>> + * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes >>>> + * the callback on all allocated regions of the image. >>>> + */ >>>> + r = rbd_diff_iterate2(s->image, NULL, 0, >>>> + bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1, >>>> + &rbd_allocated_size_cb, &alloc_size); >>> >>> But I have a question. This is basically block_status, right? So it >>> gives us information on which areas are allocated and which are not. >>> The result thus gives us a lower bound on the allocation size, but is it >>> really exactly the allocation size? >>> >>> There are two things I’m concerned about: >>> >>> 1. What about metadata? >> >> Good question, I don't think it includes the size used by metadata and I >> don't know if there is a way to know it. I'll check better. > > It does not include the size of metadata, the "rbd_diff_iterate2" > function is literally just looking for touched data blocks within the > RBD image. > >>> >>> 2. If you have multiple snapshots, this will only report the overall >>> allocation information, right? So say there is something like this: >>> >>> (“A” means an allocated MB, “-” is an unallocated MB) >>> >>> Snapshot 1: AAAA--- >>> Snapshot 2: --AAAAA >>> Snapshot 3: -AAAA-- >>> >>> I think the allocated data size is the number of As in total (13 MB). >>> But I suppose this API will just return 7 MB, because it looks on >>> everything an it sees the whole image range (7 MB) to be allocated. It >>> doesn’t report in how many snapshots some region is allocated. > > It should return 13 dirty data blocks (multipled by the size of the > data block) since when you don't provide a "from snapshot" name, it > will iterate from the first snapshot to the HEAD revision. Have you tested that? I‘m so skeptical because the callback function interface has no way of distinguishing between different layers of snapshots. And also because we have the bdrv_block_status_above() function which just looks strikingly similar (with the difference that it does not invoke a callback but just returns the next allocated range). If you pass base=NULL to it, it will also “interpret that as the beginning of time and return all allocated regions of the image” (or rather image chain, in our case). But it would just return 7 MB as allocated. (Even though it does in fact return layer information, i.e. where a given continuous chunk of data is allocated.) Sure, there is no good reason for why our interface should by chance be the same as librbd’s interface. But without having tested it, the fact that the callback cannot detect which layer a chunk is allocated on just makes me wary. Max
On 09.07.19 10:55, Max Reitz wrote: > On 09.07.19 05:08, Jason Dillaman wrote: >> On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <sgarzare@redhat.com> wrote: >>> >>> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote: >>>> On 05.07.19 11:32, Stefano Garzarella wrote: >>>>> This patch allows 'qemu-img info' to show the 'disk size' for >>>>> the RBD images that have the fast-diff feature enabled. >>>>> >>>>> If this feature is enabled, we use the rbd_diff_iterate2() API >>>>> to calculate the allocated size for the image. >>>>> >>>>> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> >>>>> --- >>>>> v3: >>>>> - return -ENOTSUP instead of -1 when fast-diff is not available >>>>> [John, Jason] >>>>> v2: >>>>> - calculate the actual usage only if the fast-diff feature is >>>>> enabled [Jason] >>>>> --- >>>>> block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> 1 file changed, 54 insertions(+) >>>> >>>> Well, the librbd documentation is non-existing as always, but while >>>> googling, I at least found that libvirt has exactly the same code. So I >>>> suppose it must be quite correct, then. >>>> >>> >>> While I wrote this code I took a look at libvirt implementation and also >>> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in >>> src/tools/rbd/action/DiskUsage.cc >>> >>>>> diff --git a/block/rbd.c b/block/rbd.c >>>>> index 59757b3120..b6bed683e5 100644 >>>>> --- a/block/rbd.c >>>>> +++ b/block/rbd.c >>>>> @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs) >>>>> return info.size; >>>>> } >>>>> >>>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists, >>>>> + void *arg) >>>>> +{ >>>>> + int64_t *alloc_size = (int64_t *) arg; >>>>> + >>>>> + if (exists) { >>>>> + (*alloc_size) += len; >>>>> + } >>>>> + >>>>> + return 0; >>>>> +} >>>>> + >>>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs) >>>>> +{ >>>>> + BDRVRBDState *s = bs->opaque; >>>>> + uint64_t flags, features; >>>>> + int64_t alloc_size = 0; >>>>> + int r; >>>>> + >>>>> + r = rbd_get_flags(s->image, &flags); >>>>> + if (r < 0) { >>>>> + return r; >>>>> + } >>>>> + >>>>> + r = rbd_get_features(s->image, &features); >>>>> + if (r < 0) { >>>>> + return r; >>>>> + } >>>>> + >>>>> + /* >>>>> + * We use rbd_diff_iterate2() only if the RBD image have fast-diff >>>>> + * feature enabled. If it is disabled, rbd_diff_iterate2() could be >>>>> + * very slow on a big image. >>>>> + */ >>>>> + if (!(features & RBD_FEATURE_FAST_DIFF) || >>>>> + (flags & RBD_FLAG_FAST_DIFF_INVALID)) { >>>>> + return -ENOTSUP; >>>>> + } >>>>> + >>>>> + /* >>>>> + * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes >>>>> + * the callback on all allocated regions of the image. >>>>> + */ >>>>> + r = rbd_diff_iterate2(s->image, NULL, 0, >>>>> + bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1, >>>>> + &rbd_allocated_size_cb, &alloc_size); >>>> >>>> But I have a question. This is basically block_status, right? So it >>>> gives us information on which areas are allocated and which are not. >>>> The result thus gives us a lower bound on the allocation size, but is it >>>> really exactly the allocation size? >>>> >>>> There are two things I’m concerned about: >>>> >>>> 1. What about metadata? >>> >>> Good question, I don't think it includes the size used by metadata and I >>> don't know if there is a way to know it. I'll check better. >> >> It does not include the size of metadata, the "rbd_diff_iterate2" >> function is literally just looking for touched data blocks within the >> RBD image. >> >>>> >>>> 2. If you have multiple snapshots, this will only report the overall >>>> allocation information, right? So say there is something like this: >>>> >>>> (“A” means an allocated MB, “-” is an unallocated MB) >>>> >>>> Snapshot 1: AAAA--- >>>> Snapshot 2: --AAAAA >>>> Snapshot 3: -AAAA-- >>>> >>>> I think the allocated data size is the number of As in total (13 MB). >>>> But I suppose this API will just return 7 MB, because it looks on >>>> everything an it sees the whole image range (7 MB) to be allocated. It >>>> doesn’t report in how many snapshots some region is allocated. >> >> It should return 13 dirty data blocks (multipled by the size of the >> data block) since when you don't provide a "from snapshot" name, it >> will iterate from the first snapshot to the HEAD revision. > > Have you tested that? > > I‘m so skeptical because the callback function interface has no way of > distinguishing between different layers of snapshots. > > And also because we have the bdrv_block_status_above() function which > just looks strikingly similar (with the difference that it does not > invoke a callback but just returns the next allocated range). If you > pass base=NULL to it, it will also “interpret that as the beginning of > time and return all allocated regions of the image” (or rather image > chain, in our case). But it would just return 7 MB as allocated. (Even > though it does in fact return layer information, i.e. where a given > continuous chunk of data is allocated.) > > Sure, there is no good reason for why our interface should by chance be > the same as librbd’s interface. But without having tested it, the fact > that the callback cannot detect which layer a chunk is allocated on just > makes me wary. And the librbd code doesn’t alleviate my concerns. From what I can see, it first creates a bitmap (two bits per entry) that covers the whole image and says which objects are allocated and which aren‘t. Through the whole chain, that is. So in the above example, the bitmap would report everything as allocated. (Or rather “updated“ in librbd‘s terms.) Then it has this piece: > uint64_t off = m_offset; > uint64_t left = m_length; > > while (left > 0) { > uint64_t period_off = off - (off % period); > uint64_t read_len = min(period_off + period - off, left); > > // map to extents > map<object_t,vector<ObjectExtent> > object_extents; > Striper::file_to_extents(cct, m_image_ctx.format_string, > &m_image_ctx.layout, off, read_len, 0, > object_extents, 0); > > // get snap info for each object > for (map<object_t,vector<ObjectExtent> >::iterator p = > object_extents.begin(); > p != object_extents.end(); ++p) [...] > for (std::vector<ObjectExtent>::iterator q = p->second.begin(); > q != p->second.end(); ++q) { > r = m_callback(off + q->offset, q->length, updated, m_callback_arg); [...] > } [...] > } >> left -= read_len; > off += read_len; > } So that iterates over the whole image (in our case, because m_offset is 0 and m_length is the image length), then picks out a chunk of read_len length, converts that to objects, and then iterates over all of those objects and all of their extents. file_to_extents looks like it’s just an arithmetic operation. So it doesn‘t look like that function returns extents in multiple snapshots. It just converts a range into subranges called “objects” and “extents” (at least that’s the way it looks to me). So overall, this looks awfully like it simply iterates over the whole image and then returns allocation information gathered as a top-down view through all of the snapshots, but not for each snapshot individually. Max
On Tue, Jul 9, 2019 at 5:45 AM Max Reitz <mreitz@redhat.com> wrote: > > On 09.07.19 10:55, Max Reitz wrote: > > On 09.07.19 05:08, Jason Dillaman wrote: > >> On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <sgarzare@redhat.com> wrote: > >>> > >>> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote: > >>>> On 05.07.19 11:32, Stefano Garzarella wrote: > >>>>> This patch allows 'qemu-img info' to show the 'disk size' for > >>>>> the RBD images that have the fast-diff feature enabled. > >>>>> > >>>>> If this feature is enabled, we use the rbd_diff_iterate2() API > >>>>> to calculate the allocated size for the image. > >>>>> > >>>>> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> > >>>>> --- > >>>>> v3: > >>>>> - return -ENOTSUP instead of -1 when fast-diff is not available > >>>>> [John, Jason] > >>>>> v2: > >>>>> - calculate the actual usage only if the fast-diff feature is > >>>>> enabled [Jason] > >>>>> --- > >>>>> block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>> 1 file changed, 54 insertions(+) > >>>> > >>>> Well, the librbd documentation is non-existing as always, but while > >>>> googling, I at least found that libvirt has exactly the same code. So I > >>>> suppose it must be quite correct, then. > >>>> > >>> > >>> While I wrote this code I took a look at libvirt implementation and also > >>> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in > >>> src/tools/rbd/action/DiskUsage.cc > >>> > >>>>> diff --git a/block/rbd.c b/block/rbd.c > >>>>> index 59757b3120..b6bed683e5 100644 > >>>>> --- a/block/rbd.c > >>>>> +++ b/block/rbd.c > >>>>> @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs) > >>>>> return info.size; > >>>>> } > >>>>> > >>>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists, > >>>>> + void *arg) > >>>>> +{ > >>>>> + int64_t *alloc_size = (int64_t *) arg; > >>>>> + > >>>>> + if (exists) { > >>>>> + (*alloc_size) += len; > >>>>> + } > >>>>> + > >>>>> + return 0; > >>>>> +} > >>>>> + > >>>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs) > >>>>> +{ > >>>>> + BDRVRBDState *s = bs->opaque; > >>>>> + uint64_t flags, features; > >>>>> + int64_t alloc_size = 0; > >>>>> + int r; > >>>>> + > >>>>> + r = rbd_get_flags(s->image, &flags); > >>>>> + if (r < 0) { > >>>>> + return r; > >>>>> + } > >>>>> + > >>>>> + r = rbd_get_features(s->image, &features); > >>>>> + if (r < 0) { > >>>>> + return r; > >>>>> + } > >>>>> + > >>>>> + /* > >>>>> + * We use rbd_diff_iterate2() only if the RBD image have fast-diff > >>>>> + * feature enabled. If it is disabled, rbd_diff_iterate2() could be > >>>>> + * very slow on a big image. > >>>>> + */ > >>>>> + if (!(features & RBD_FEATURE_FAST_DIFF) || > >>>>> + (flags & RBD_FLAG_FAST_DIFF_INVALID)) { > >>>>> + return -ENOTSUP; > >>>>> + } > >>>>> + > >>>>> + /* > >>>>> + * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes > >>>>> + * the callback on all allocated regions of the image. > >>>>> + */ > >>>>> + r = rbd_diff_iterate2(s->image, NULL, 0, > >>>>> + bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1, > >>>>> + &rbd_allocated_size_cb, &alloc_size); > >>>> > >>>> But I have a question. This is basically block_status, right? So it > >>>> gives us information on which areas are allocated and which are not. > >>>> The result thus gives us a lower bound on the allocation size, but is it > >>>> really exactly the allocation size? > >>>> > >>>> There are two things I’m concerned about: > >>>> > >>>> 1. What about metadata? > >>> > >>> Good question, I don't think it includes the size used by metadata and I > >>> don't know if there is a way to know it. I'll check better. > >> > >> It does not include the size of metadata, the "rbd_diff_iterate2" > >> function is literally just looking for touched data blocks within the > >> RBD image. > >> > >>>> > >>>> 2. If you have multiple snapshots, this will only report the overall > >>>> allocation information, right? So say there is something like this: > >>>> > >>>> (“A” means an allocated MB, “-” is an unallocated MB) > >>>> > >>>> Snapshot 1: AAAA--- > >>>> Snapshot 2: --AAAAA > >>>> Snapshot 3: -AAAA-- > >>>> > >>>> I think the allocated data size is the number of As in total (13 MB). > >>>> But I suppose this API will just return 7 MB, because it looks on > >>>> everything an it sees the whole image range (7 MB) to be allocated. It > >>>> doesn’t report in how many snapshots some region is allocated. > >> > >> It should return 13 dirty data blocks (multipled by the size of the > >> data block) since when you don't provide a "from snapshot" name, it > >> will iterate from the first snapshot to the HEAD revision. > > > > Have you tested that? > > > > I‘m so skeptical because the callback function interface has no way of > > distinguishing between different layers of snapshots. > > > > And also because we have the bdrv_block_status_above() function which > > just looks strikingly similar (with the difference that it does not > > invoke a callback but just returns the next allocated range). If you > > pass base=NULL to it, it will also “interpret that as the beginning of > > time and return all allocated regions of the image” (or rather image > > chain, in our case). But it would just return 7 MB as allocated. (Even > > though it does in fact return layer information, i.e. where a given > > continuous chunk of data is allocated.) > > > > Sure, there is no good reason for why our interface should by chance be > > the same as librbd’s interface. But without having tested it, the fact > > that the callback cannot detect which layer a chunk is allocated on just > > makes me wary. > > And the librbd code doesn’t alleviate my concerns. > > From what I can see, it first creates a bitmap (two bits per entry) that > covers the whole image and says which objects are allocated and which > aren‘t. Through the whole chain, that is. So in the above example, the > bitmap would report everything as allocated. (Or rather “updated“ in > librbd‘s terms.) > > Then it has this piece: > > > uint64_t off = m_offset; > > uint64_t left = m_length; > > > > while (left > 0) { > > uint64_t period_off = off - (off % period); > > uint64_t read_len = min(period_off + period - off, left); > > > > // map to extents > > map<object_t,vector<ObjectExtent> > object_extents; > > Striper::file_to_extents(cct, m_image_ctx.format_string, > > &m_image_ctx.layout, off, read_len, 0, > > object_extents, 0); > > > > // get snap info for each object > > for (map<object_t,vector<ObjectExtent> >::iterator p = > > object_extents.begin(); > > p != object_extents.end(); ++p) > [...] > > for (std::vector<ObjectExtent>::iterator q = p->second.begin(); > > q != p->second.end(); ++q) { > > r = m_callback(off + q->offset, q->length, updated, m_callback_arg); > [...] > > } > [...] > > } > >> left -= read_len; > > off += read_len; > > } > > So that iterates over the whole image (in our case, because m_offset is > 0 and m_length is the image length), then picks out a chunk of read_len > length, converts that to objects, and then iterates over all of those > objects and all of their extents. > > file_to_extents looks like it’s just an arithmetic operation. So it > doesn‘t look like that function returns extents in multiple snapshots. > It just converts a range into subranges called “objects” and “extents” > (at least that’s the way it looks to me). > > So overall, this looks awfully like it simply iterates over the whole > image and then returns allocation information gathered as a top-down > view through all of the snapshots, but not for each snapshot individually. Sorry, you're correct. The API function was originally designed to support delta extents for supporting diff exports, so while it does open each snapshot's object-map, it only returns a unioned set of delta extents. The rbd CLI's "disk-usage" action behaves how I described by counting the actual dirty block usage between snapshots. > Max >
On Tue, Jul 09, 2019 at 08:55:19AM -0400, Jason Dillaman wrote: > On Tue, Jul 9, 2019 at 5:45 AM Max Reitz <mreitz@redhat.com> wrote: > > > > On 09.07.19 10:55, Max Reitz wrote: > > > On 09.07.19 05:08, Jason Dillaman wrote: > > >> On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <sgarzare@redhat.com> wrote: > > >>> > > >>> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote: > > >>>> On 05.07.19 11:32, Stefano Garzarella wrote: > > >>>>> This patch allows 'qemu-img info' to show the 'disk size' for > > >>>>> the RBD images that have the fast-diff feature enabled. > > >>>>> > > >>>>> If this feature is enabled, we use the rbd_diff_iterate2() API > > >>>>> to calculate the allocated size for the image. > > >>>>> > > >>>>> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> > > >>>>> --- > > >>>>> v3: > > >>>>> - return -ENOTSUP instead of -1 when fast-diff is not available > > >>>>> [John, Jason] > > >>>>> v2: > > >>>>> - calculate the actual usage only if the fast-diff feature is > > >>>>> enabled [Jason] > > >>>>> --- > > >>>>> block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >>>>> 1 file changed, 54 insertions(+) > > >>>> > > >>>> Well, the librbd documentation is non-existing as always, but while > > >>>> googling, I at least found that libvirt has exactly the same code. So I > > >>>> suppose it must be quite correct, then. > > >>>> > > >>> > > >>> While I wrote this code I took a look at libvirt implementation and also > > >>> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in > > >>> src/tools/rbd/action/DiskUsage.cc > > >>> > > >>>>> diff --git a/block/rbd.c b/block/rbd.c > > >>>>> index 59757b3120..b6bed683e5 100644 > > >>>>> --- a/block/rbd.c > > >>>>> +++ b/block/rbd.c > > >>>>> @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs) > > >>>>> return info.size; > > >>>>> } > > >>>>> > > >>>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists, > > >>>>> + void *arg) > > >>>>> +{ > > >>>>> + int64_t *alloc_size = (int64_t *) arg; > > >>>>> + > > >>>>> + if (exists) { > > >>>>> + (*alloc_size) += len; > > >>>>> + } > > >>>>> + > > >>>>> + return 0; > > >>>>> +} > > >>>>> + > > >>>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs) > > >>>>> +{ > > >>>>> + BDRVRBDState *s = bs->opaque; > > >>>>> + uint64_t flags, features; > > >>>>> + int64_t alloc_size = 0; > > >>>>> + int r; > > >>>>> + > > >>>>> + r = rbd_get_flags(s->image, &flags); > > >>>>> + if (r < 0) { > > >>>>> + return r; > > >>>>> + } > > >>>>> + > > >>>>> + r = rbd_get_features(s->image, &features); > > >>>>> + if (r < 0) { > > >>>>> + return r; > > >>>>> + } > > >>>>> + > > >>>>> + /* > > >>>>> + * We use rbd_diff_iterate2() only if the RBD image have fast-diff > > >>>>> + * feature enabled. If it is disabled, rbd_diff_iterate2() could be > > >>>>> + * very slow on a big image. > > >>>>> + */ > > >>>>> + if (!(features & RBD_FEATURE_FAST_DIFF) || > > >>>>> + (flags & RBD_FLAG_FAST_DIFF_INVALID)) { > > >>>>> + return -ENOTSUP; > > >>>>> + } > > >>>>> + > > >>>>> + /* > > >>>>> + * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes > > >>>>> + * the callback on all allocated regions of the image. > > >>>>> + */ > > >>>>> + r = rbd_diff_iterate2(s->image, NULL, 0, > > >>>>> + bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1, > > >>>>> + &rbd_allocated_size_cb, &alloc_size); > > >>>> > > >>>> But I have a question. This is basically block_status, right? So it > > >>>> gives us information on which areas are allocated and which are not. > > >>>> The result thus gives us a lower bound on the allocation size, but is it > > >>>> really exactly the allocation size? > > >>>> > > >>>> There are two things I’m concerned about: > > >>>> > > >>>> 1. What about metadata? > > >>> > > >>> Good question, I don't think it includes the size used by metadata and I > > >>> don't know if there is a way to know it. I'll check better. > > >> > > >> It does not include the size of metadata, the "rbd_diff_iterate2" > > >> function is literally just looking for touched data blocks within the > > >> RBD image. > > >> > > >>>> > > >>>> 2. If you have multiple snapshots, this will only report the overall > > >>>> allocation information, right? So say there is something like this: > > >>>> > > >>>> (“A” means an allocated MB, “-” is an unallocated MB) > > >>>> > > >>>> Snapshot 1: AAAA--- > > >>>> Snapshot 2: --AAAAA > > >>>> Snapshot 3: -AAAA-- > > >>>> > > >>>> I think the allocated data size is the number of As in total (13 MB). > > >>>> But I suppose this API will just return 7 MB, because it looks on > > >>>> everything an it sees the whole image range (7 MB) to be allocated. It > > >>>> doesn’t report in how many snapshots some region is allocated. > > >> > > >> It should return 13 dirty data blocks (multipled by the size of the > > >> data block) since when you don't provide a "from snapshot" name, it > > >> will iterate from the first snapshot to the HEAD revision. > > > > > > Have you tested that? > > > > > > I‘m so skeptical because the callback function interface has no way of > > > distinguishing between different layers of snapshots. > > > > > > And also because we have the bdrv_block_status_above() function which > > > just looks strikingly similar (with the difference that it does not > > > invoke a callback but just returns the next allocated range). If you > > > pass base=NULL to it, it will also “interpret that as the beginning of > > > time and return all allocated regions of the image” (or rather image > > > chain, in our case). But it would just return 7 MB as allocated. (Even > > > though it does in fact return layer information, i.e. where a given > > > continuous chunk of data is allocated.) > > > > > > Sure, there is no good reason for why our interface should by chance be > > > the same as librbd’s interface. But without having tested it, the fact > > > that the callback cannot detect which layer a chunk is allocated on just > > > makes me wary. > > > > And the librbd code doesn’t alleviate my concerns. > > > > From what I can see, it first creates a bitmap (two bits per entry) that > > covers the whole image and says which objects are allocated and which > > aren‘t. Through the whole chain, that is. So in the above example, the > > bitmap would report everything as allocated. (Or rather “updated“ in > > librbd‘s terms.) > > > > Then it has this piece: > > > > > uint64_t off = m_offset; > > > uint64_t left = m_length; > > > > > > while (left > 0) { > > > uint64_t period_off = off - (off % period); > > > uint64_t read_len = min(period_off + period - off, left); > > > > > > // map to extents > > > map<object_t,vector<ObjectExtent> > object_extents; > > > Striper::file_to_extents(cct, m_image_ctx.format_string, > > > &m_image_ctx.layout, off, read_len, 0, > > > object_extents, 0); > > > > > > // get snap info for each object > > > for (map<object_t,vector<ObjectExtent> >::iterator p = > > > object_extents.begin(); > > > p != object_extents.end(); ++p) > > [...] > > > for (std::vector<ObjectExtent>::iterator q = p->second.begin(); > > > q != p->second.end(); ++q) { > > > r = m_callback(off + q->offset, q->length, updated, m_callback_arg); > > [...] > > > } > > [...] > > > } > > >> left -= read_len; > > > off += read_len; > > > } > > > > So that iterates over the whole image (in our case, because m_offset is > > 0 and m_length is the image length), then picks out a chunk of read_len > > length, converts that to objects, and then iterates over all of those > > objects and all of their extents. > > > > file_to_extents looks like it’s just an arithmetic operation. So it > > doesn‘t look like that function returns extents in multiple snapshots. > > It just converts a range into subranges called “objects” and “extents” > > (at least that’s the way it looks to me). > > > > So overall, this looks awfully like it simply iterates over the whole > > image and then returns allocation information gathered as a top-down > > view through all of the snapshots, but not for each snapshot individually. > > Sorry, you're correct. The API function was originally designed to > support delta extents for supporting diff exports, so while it does > open each snapshot's object-map, it only returns a unioned set of > delta extents. The rbd CLI's "disk-usage" action behaves how I > described by counting the actual dirty block usage between snapshots. Max, Jason, thanks for the details! If you agree, I'll try to implement something similar, iterating on all snapshots. What about metadata? I saw some APIs (e.g. rbd_metadata_list()) that allow us to get the metadata. An idea could be to iterate over them and sum the keys-values size returned, but I'm not sure they are returned in the same format as they are stored. Thanks, Stefano
On 09.07.19 15:09, Stefano Garzarella wrote: > On Tue, Jul 09, 2019 at 08:55:19AM -0400, Jason Dillaman wrote: >> On Tue, Jul 9, 2019 at 5:45 AM Max Reitz <mreitz@redhat.com> wrote: >>> >>> On 09.07.19 10:55, Max Reitz wrote: >>>> On 09.07.19 05:08, Jason Dillaman wrote: >>>>> On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <sgarzare@redhat.com> wrote: >>>>>> >>>>>> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote: >>>>>>> On 05.07.19 11:32, Stefano Garzarella wrote: >>>>>>>> This patch allows 'qemu-img info' to show the 'disk size' for >>>>>>>> the RBD images that have the fast-diff feature enabled. >>>>>>>> >>>>>>>> If this feature is enabled, we use the rbd_diff_iterate2() API >>>>>>>> to calculate the allocated size for the image. >>>>>>>> >>>>>>>> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> >>>>>>>> --- >>>>>>>> v3: >>>>>>>> - return -ENOTSUP instead of -1 when fast-diff is not available >>>>>>>> [John, Jason] >>>>>>>> v2: >>>>>>>> - calculate the actual usage only if the fast-diff feature is >>>>>>>> enabled [Jason] >>>>>>>> --- >>>>>>>> block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>> 1 file changed, 54 insertions(+) >>>>>>> >>>>>>> Well, the librbd documentation is non-existing as always, but while >>>>>>> googling, I at least found that libvirt has exactly the same code. So I >>>>>>> suppose it must be quite correct, then. >>>>>>> >>>>>> >>>>>> While I wrote this code I took a look at libvirt implementation and also >>>>>> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in >>>>>> src/tools/rbd/action/DiskUsage.cc >>>>>> >>>>>>>> diff --git a/block/rbd.c b/block/rbd.c >>>>>>>> index 59757b3120..b6bed683e5 100644 >>>>>>>> --- a/block/rbd.c >>>>>>>> +++ b/block/rbd.c >>>>>>>> @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs) >>>>>>>> return info.size; >>>>>>>> } >>>>>>>> >>>>>>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists, >>>>>>>> + void *arg) >>>>>>>> +{ >>>>>>>> + int64_t *alloc_size = (int64_t *) arg; >>>>>>>> + >>>>>>>> + if (exists) { >>>>>>>> + (*alloc_size) += len; >>>>>>>> + } >>>>>>>> + >>>>>>>> + return 0; >>>>>>>> +} >>>>>>>> + >>>>>>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs) >>>>>>>> +{ >>>>>>>> + BDRVRBDState *s = bs->opaque; >>>>>>>> + uint64_t flags, features; >>>>>>>> + int64_t alloc_size = 0; >>>>>>>> + int r; >>>>>>>> + >>>>>>>> + r = rbd_get_flags(s->image, &flags); >>>>>>>> + if (r < 0) { >>>>>>>> + return r; >>>>>>>> + } >>>>>>>> + >>>>>>>> + r = rbd_get_features(s->image, &features); >>>>>>>> + if (r < 0) { >>>>>>>> + return r; >>>>>>>> + } >>>>>>>> + >>>>>>>> + /* >>>>>>>> + * We use rbd_diff_iterate2() only if the RBD image have fast-diff >>>>>>>> + * feature enabled. If it is disabled, rbd_diff_iterate2() could be >>>>>>>> + * very slow on a big image. >>>>>>>> + */ >>>>>>>> + if (!(features & RBD_FEATURE_FAST_DIFF) || >>>>>>>> + (flags & RBD_FLAG_FAST_DIFF_INVALID)) { >>>>>>>> + return -ENOTSUP; >>>>>>>> + } >>>>>>>> + >>>>>>>> + /* >>>>>>>> + * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes >>>>>>>> + * the callback on all allocated regions of the image. >>>>>>>> + */ >>>>>>>> + r = rbd_diff_iterate2(s->image, NULL, 0, >>>>>>>> + bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1, >>>>>>>> + &rbd_allocated_size_cb, &alloc_size); >>>>>>> >>>>>>> But I have a question. This is basically block_status, right? So it >>>>>>> gives us information on which areas are allocated and which are not. >>>>>>> The result thus gives us a lower bound on the allocation size, but is it >>>>>>> really exactly the allocation size? >>>>>>> >>>>>>> There are two things I’m concerned about: >>>>>>> >>>>>>> 1. What about metadata? >>>>>> >>>>>> Good question, I don't think it includes the size used by metadata and I >>>>>> don't know if there is a way to know it. I'll check better. >>>>> >>>>> It does not include the size of metadata, the "rbd_diff_iterate2" >>>>> function is literally just looking for touched data blocks within the >>>>> RBD image. >>>>> >>>>>>> >>>>>>> 2. If you have multiple snapshots, this will only report the overall >>>>>>> allocation information, right? So say there is something like this: >>>>>>> >>>>>>> (“A” means an allocated MB, “-” is an unallocated MB) >>>>>>> >>>>>>> Snapshot 1: AAAA--- >>>>>>> Snapshot 2: --AAAAA >>>>>>> Snapshot 3: -AAAA-- >>>>>>> >>>>>>> I think the allocated data size is the number of As in total (13 MB). >>>>>>> But I suppose this API will just return 7 MB, because it looks on >>>>>>> everything an it sees the whole image range (7 MB) to be allocated. It >>>>>>> doesn’t report in how many snapshots some region is allocated. >>>>> >>>>> It should return 13 dirty data blocks (multipled by the size of the >>>>> data block) since when you don't provide a "from snapshot" name, it >>>>> will iterate from the first snapshot to the HEAD revision. >>>> >>>> Have you tested that? >>>> >>>> I‘m so skeptical because the callback function interface has no way of >>>> distinguishing between different layers of snapshots. >>>> >>>> And also because we have the bdrv_block_status_above() function which >>>> just looks strikingly similar (with the difference that it does not >>>> invoke a callback but just returns the next allocated range). If you >>>> pass base=NULL to it, it will also “interpret that as the beginning of >>>> time and return all allocated regions of the image” (or rather image >>>> chain, in our case). But it would just return 7 MB as allocated. (Even >>>> though it does in fact return layer information, i.e. where a given >>>> continuous chunk of data is allocated.) >>>> >>>> Sure, there is no good reason for why our interface should by chance be >>>> the same as librbd’s interface. But without having tested it, the fact >>>> that the callback cannot detect which layer a chunk is allocated on just >>>> makes me wary. >>> >>> And the librbd code doesn’t alleviate my concerns. >>> >>> From what I can see, it first creates a bitmap (two bits per entry) that >>> covers the whole image and says which objects are allocated and which >>> aren‘t. Through the whole chain, that is. So in the above example, the >>> bitmap would report everything as allocated. (Or rather “updated“ in >>> librbd‘s terms.) >>> >>> Then it has this piece: >>> >>>> uint64_t off = m_offset; >>>> uint64_t left = m_length; >>>> >>>> while (left > 0) { >>>> uint64_t period_off = off - (off % period); >>>> uint64_t read_len = min(period_off + period - off, left); >>>> >>>> // map to extents >>>> map<object_t,vector<ObjectExtent> > object_extents; >>>> Striper::file_to_extents(cct, m_image_ctx.format_string, >>>> &m_image_ctx.layout, off, read_len, 0, >>>> object_extents, 0); >>>> >>>> // get snap info for each object >>>> for (map<object_t,vector<ObjectExtent> >::iterator p = >>>> object_extents.begin(); >>>> p != object_extents.end(); ++p) >>> [...] >>>> for (std::vector<ObjectExtent>::iterator q = p->second.begin(); >>>> q != p->second.end(); ++q) { >>>> r = m_callback(off + q->offset, q->length, updated, m_callback_arg); >>> [...] >>>> } >>> [...] >>>> } >>>>> left -= read_len; >>>> off += read_len; >>>> } >>> >>> So that iterates over the whole image (in our case, because m_offset is >>> 0 and m_length is the image length), then picks out a chunk of read_len >>> length, converts that to objects, and then iterates over all of those >>> objects and all of their extents. >>> >>> file_to_extents looks like it’s just an arithmetic operation. So it >>> doesn‘t look like that function returns extents in multiple snapshots. >>> It just converts a range into subranges called “objects” and “extents” >>> (at least that’s the way it looks to me). >>> >>> So overall, this looks awfully like it simply iterates over the whole >>> image and then returns allocation information gathered as a top-down >>> view through all of the snapshots, but not for each snapshot individually. >> >> Sorry, you're correct. The API function was originally designed to >> support delta extents for supporting diff exports, so while it does >> open each snapshot's object-map, it only returns a unioned set of >> delta extents. The rbd CLI's "disk-usage" action behaves how I >> described by counting the actual dirty block usage between snapshots. > > Max, Jason, thanks for the details! > > If you agree, I'll try to implement something similar, iterating on all > snapshots. Yes, that should work. > What about metadata? > I saw some APIs (e.g. rbd_metadata_list()) that allow us to get the metadata. > An idea could be to iterate over them and sum the keys-values size returned, > but I'm not sure they are returned in the same format as they are stored. I wouldn’t mind ignoring metadata too much. If you can get something to work, even better, but I don’t consider that too important. Max
On Tue, Jul 9, 2019 at 11:32 AM Max Reitz <mreitz@redhat.com> wrote: > > On 09.07.19 15:09, Stefano Garzarella wrote: > > On Tue, Jul 09, 2019 at 08:55:19AM -0400, Jason Dillaman wrote: > >> On Tue, Jul 9, 2019 at 5:45 AM Max Reitz <mreitz@redhat.com> wrote: > >>> > >>> On 09.07.19 10:55, Max Reitz wrote: > >>>> On 09.07.19 05:08, Jason Dillaman wrote: > >>>>> On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <sgarzare@redhat.com> wrote: > >>>>>> > >>>>>> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote: > >>>>>>> On 05.07.19 11:32, Stefano Garzarella wrote: > >>>>>>>> This patch allows 'qemu-img info' to show the 'disk size' for > >>>>>>>> the RBD images that have the fast-diff feature enabled. > >>>>>>>> > >>>>>>>> If this feature is enabled, we use the rbd_diff_iterate2() API > >>>>>>>> to calculate the allocated size for the image. > >>>>>>>> > >>>>>>>> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> > >>>>>>>> --- > >>>>>>>> v3: > >>>>>>>> - return -ENOTSUP instead of -1 when fast-diff is not available > >>>>>>>> [John, Jason] > >>>>>>>> v2: > >>>>>>>> - calculate the actual usage only if the fast-diff feature is > >>>>>>>> enabled [Jason] > >>>>>>>> --- > >>>>>>>> block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>>>> 1 file changed, 54 insertions(+) > >>>>>>> > >>>>>>> Well, the librbd documentation is non-existing as always, but while > >>>>>>> googling, I at least found that libvirt has exactly the same code. So I > >>>>>>> suppose it must be quite correct, then. > >>>>>>> > >>>>>> > >>>>>> While I wrote this code I took a look at libvirt implementation and also > >>>>>> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in > >>>>>> src/tools/rbd/action/DiskUsage.cc > >>>>>> > >>>>>>>> diff --git a/block/rbd.c b/block/rbd.c > >>>>>>>> index 59757b3120..b6bed683e5 100644 > >>>>>>>> --- a/block/rbd.c > >>>>>>>> +++ b/block/rbd.c > >>>>>>>> @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs) > >>>>>>>> return info.size; > >>>>>>>> } > >>>>>>>> > >>>>>>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists, > >>>>>>>> + void *arg) > >>>>>>>> +{ > >>>>>>>> + int64_t *alloc_size = (int64_t *) arg; > >>>>>>>> + > >>>>>>>> + if (exists) { > >>>>>>>> + (*alloc_size) += len; > >>>>>>>> + } > >>>>>>>> + > >>>>>>>> + return 0; > >>>>>>>> +} > >>>>>>>> + > >>>>>>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs) > >>>>>>>> +{ > >>>>>>>> + BDRVRBDState *s = bs->opaque; > >>>>>>>> + uint64_t flags, features; > >>>>>>>> + int64_t alloc_size = 0; > >>>>>>>> + int r; > >>>>>>>> + > >>>>>>>> + r = rbd_get_flags(s->image, &flags); > >>>>>>>> + if (r < 0) { > >>>>>>>> + return r; > >>>>>>>> + } > >>>>>>>> + > >>>>>>>> + r = rbd_get_features(s->image, &features); > >>>>>>>> + if (r < 0) { > >>>>>>>> + return r; > >>>>>>>> + } > >>>>>>>> + > >>>>>>>> + /* > >>>>>>>> + * We use rbd_diff_iterate2() only if the RBD image have fast-diff > >>>>>>>> + * feature enabled. If it is disabled, rbd_diff_iterate2() could be > >>>>>>>> + * very slow on a big image. > >>>>>>>> + */ > >>>>>>>> + if (!(features & RBD_FEATURE_FAST_DIFF) || > >>>>>>>> + (flags & RBD_FLAG_FAST_DIFF_INVALID)) { > >>>>>>>> + return -ENOTSUP; > >>>>>>>> + } > >>>>>>>> + > >>>>>>>> + /* > >>>>>>>> + * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes > >>>>>>>> + * the callback on all allocated regions of the image. > >>>>>>>> + */ > >>>>>>>> + r = rbd_diff_iterate2(s->image, NULL, 0, > >>>>>>>> + bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1, > >>>>>>>> + &rbd_allocated_size_cb, &alloc_size); > >>>>>>> > >>>>>>> But I have a question. This is basically block_status, right? So it > >>>>>>> gives us information on which areas are allocated and which are not. > >>>>>>> The result thus gives us a lower bound on the allocation size, but is it > >>>>>>> really exactly the allocation size? > >>>>>>> > >>>>>>> There are two things I’m concerned about: > >>>>>>> > >>>>>>> 1. What about metadata? > >>>>>> > >>>>>> Good question, I don't think it includes the size used by metadata and I > >>>>>> don't know if there is a way to know it. I'll check better. > >>>>> > >>>>> It does not include the size of metadata, the "rbd_diff_iterate2" > >>>>> function is literally just looking for touched data blocks within the > >>>>> RBD image. > >>>>> > >>>>>>> > >>>>>>> 2. If you have multiple snapshots, this will only report the overall > >>>>>>> allocation information, right? So say there is something like this: > >>>>>>> > >>>>>>> (“A” means an allocated MB, “-” is an unallocated MB) > >>>>>>> > >>>>>>> Snapshot 1: AAAA--- > >>>>>>> Snapshot 2: --AAAAA > >>>>>>> Snapshot 3: -AAAA-- > >>>>>>> > >>>>>>> I think the allocated data size is the number of As in total (13 MB). > >>>>>>> But I suppose this API will just return 7 MB, because it looks on > >>>>>>> everything an it sees the whole image range (7 MB) to be allocated. It > >>>>>>> doesn’t report in how many snapshots some region is allocated. > >>>>> > >>>>> It should return 13 dirty data blocks (multipled by the size of the > >>>>> data block) since when you don't provide a "from snapshot" name, it > >>>>> will iterate from the first snapshot to the HEAD revision. > >>>> > >>>> Have you tested that? > >>>> > >>>> I‘m so skeptical because the callback function interface has no way of > >>>> distinguishing between different layers of snapshots. > >>>> > >>>> And also because we have the bdrv_block_status_above() function which > >>>> just looks strikingly similar (with the difference that it does not > >>>> invoke a callback but just returns the next allocated range). If you > >>>> pass base=NULL to it, it will also “interpret that as the beginning of > >>>> time and return all allocated regions of the image” (or rather image > >>>> chain, in our case). But it would just return 7 MB as allocated. (Even > >>>> though it does in fact return layer information, i.e. where a given > >>>> continuous chunk of data is allocated.) > >>>> > >>>> Sure, there is no good reason for why our interface should by chance be > >>>> the same as librbd’s interface. But without having tested it, the fact > >>>> that the callback cannot detect which layer a chunk is allocated on just > >>>> makes me wary. > >>> > >>> And the librbd code doesn’t alleviate my concerns. > >>> > >>> From what I can see, it first creates a bitmap (two bits per entry) that > >>> covers the whole image and says which objects are allocated and which > >>> aren‘t. Through the whole chain, that is. So in the above example, the > >>> bitmap would report everything as allocated. (Or rather “updated“ in > >>> librbd‘s terms.) > >>> > >>> Then it has this piece: > >>> > >>>> uint64_t off = m_offset; > >>>> uint64_t left = m_length; > >>>> > >>>> while (left > 0) { > >>>> uint64_t period_off = off - (off % period); > >>>> uint64_t read_len = min(period_off + period - off, left); > >>>> > >>>> // map to extents > >>>> map<object_t,vector<ObjectExtent> > object_extents; > >>>> Striper::file_to_extents(cct, m_image_ctx.format_string, > >>>> &m_image_ctx.layout, off, read_len, 0, > >>>> object_extents, 0); > >>>> > >>>> // get snap info for each object > >>>> for (map<object_t,vector<ObjectExtent> >::iterator p = > >>>> object_extents.begin(); > >>>> p != object_extents.end(); ++p) > >>> [...] > >>>> for (std::vector<ObjectExtent>::iterator q = p->second.begin(); > >>>> q != p->second.end(); ++q) { > >>>> r = m_callback(off + q->offset, q->length, updated, m_callback_arg); > >>> [...] > >>>> } > >>> [...] > >>>> } > >>>>> left -= read_len; > >>>> off += read_len; > >>>> } > >>> > >>> So that iterates over the whole image (in our case, because m_offset is > >>> 0 and m_length is the image length), then picks out a chunk of read_len > >>> length, converts that to objects, and then iterates over all of those > >>> objects and all of their extents. > >>> > >>> file_to_extents looks like it’s just an arithmetic operation. So it > >>> doesn‘t look like that function returns extents in multiple snapshots. > >>> It just converts a range into subranges called “objects” and “extents” > >>> (at least that’s the way it looks to me). > >>> > >>> So overall, this looks awfully like it simply iterates over the whole > >>> image and then returns allocation information gathered as a top-down > >>> view through all of the snapshots, but not for each snapshot individually. > >> > >> Sorry, you're correct. The API function was originally designed to > >> support delta extents for supporting diff exports, so while it does > >> open each snapshot's object-map, it only returns a unioned set of > >> delta extents. The rbd CLI's "disk-usage" action behaves how I > >> described by counting the actual dirty block usage between snapshots. > > > > Max, Jason, thanks for the details! > > > > If you agree, I'll try to implement something similar, iterating on all > > snapshots. > > Yes, that should work. > > > What about metadata? > > I saw some APIs (e.g. rbd_metadata_list()) that allow us to get the metadata. > > An idea could be to iterate over them and sum the keys-values size returned, > > but I'm not sure they are returned in the same format as they are stored. > > I wouldn’t mind ignoring metadata too much. If you can get something to > work, even better, but I don’t consider that too important. I don't think it's a good idea to attempt to estimate it. If there is enough desire, we can always add a supported "disk-usage" API method that takes into account things like the image metadata, object-map, journaling, etc overhead. However, I wouldn't expect it to be anywhere near the actual block usage (unless something has gone terribly wrong w/ journaling and it fails to trim your log). > Max >
On Tue, Jul 09, 2019 at 09:42:14PM -0400, Jason Dillaman wrote: > On Tue, Jul 9, 2019 at 11:32 AM Max Reitz <mreitz@redhat.com> wrote: > > On 09.07.19 15:09, Stefano Garzarella wrote: > > > > > > Max, Jason, thanks for the details! > > > > > > If you agree, I'll try to implement something similar, iterating on all > > > snapshots. > > > > Yes, that should work. > > > > > What about metadata? > > > I saw some APIs (e.g. rbd_metadata_list()) that allow us to get the metadata. > > > An idea could be to iterate over them and sum the keys-values size returned, > > > but I'm not sure they are returned in the same format as they are stored. > > > > I wouldn’t mind ignoring metadata too much. If you can get something to > > work, even better, but I don’t consider that too important. > > I don't think it's a good idea to attempt to estimate it. If there is > enough desire, we can always add a supported "disk-usage" API method > that takes into account things like the image metadata, object-map, > journaling, etc overhead. However, I wouldn't expect it to be anywhere > near the actual block usage (unless something has gone terribly wrong > w/ journaling and it fails to trim your log). Okay, so for now, I'll skip the metadata and if you think can be useful to add "disk-usage" API in RBD, I'll use it later. Thanks, Stefano
diff --git a/block/rbd.c b/block/rbd.c index 59757b3120..b6bed683e5 100644 --- a/block/rbd.c +++ b/block/rbd.c @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs) return info.size; } +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists, + void *arg) +{ + int64_t *alloc_size = (int64_t *) arg; + + if (exists) { + (*alloc_size) += len; + } + + return 0; +} + +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs) +{ + BDRVRBDState *s = bs->opaque; + uint64_t flags, features; + int64_t alloc_size = 0; + int r; + + r = rbd_get_flags(s->image, &flags); + if (r < 0) { + return r; + } + + r = rbd_get_features(s->image, &features); + if (r < 0) { + return r; + } + + /* + * We use rbd_diff_iterate2() only if the RBD image have fast-diff + * feature enabled. If it is disabled, rbd_diff_iterate2() could be + * very slow on a big image. + */ + if (!(features & RBD_FEATURE_FAST_DIFF) || + (flags & RBD_FLAG_FAST_DIFF_INVALID)) { + return -ENOTSUP; + } + + /* + * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes + * the callback on all allocated regions of the image. + */ + r = rbd_diff_iterate2(s->image, NULL, 0, + bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1, + &rbd_allocated_size_cb, &alloc_size); + if (r < 0) { + return r; + } + + return alloc_size; +} + static int coroutine_fn qemu_rbd_co_truncate(BlockDriverState *bs, int64_t offset, PreallocMode prealloc, @@ -1291,6 +1344,7 @@ static BlockDriver bdrv_rbd = { .bdrv_get_info = qemu_rbd_getinfo, .create_opts = &qemu_rbd_create_opts, .bdrv_getlength = qemu_rbd_getlength, + .bdrv_get_allocated_file_size = qemu_rbd_get_allocated_file_size, .bdrv_co_truncate = qemu_rbd_co_truncate, .protocol_name = "rbd",
This patch allows 'qemu-img info' to show the 'disk size' for the RBD images that have the fast-diff feature enabled. If this feature is enabled, we use the rbd_diff_iterate2() API to calculate the allocated size for the image. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> --- v3: - return -ENOTSUP instead of -1 when fast-diff is not available [John, Jason] v2: - calculate the actual usage only if the fast-diff feature is enabled [Jason] --- block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 54 insertions(+)