diff mbox

qemu-img: align is_allocated_sectors to 4k

Message ID 1528375581-29538-1-git-send-email-pl@kamp.de (mailing list archive)
State New, archived
Headers show

Commit Message

Peter Lieven June 7, 2018, 12:46 p.m. UTC
We currently don't enforce that the sparse segments we detect during convert are
aligned. This leads to unnecessary and costly read-modify-write cycles either
internally in Qemu or in the background on the storage device as nearly all
modern filesystems or hardware has a 4k alignment internally.

As we per default set the min_sparse size to 4k it makes perfectly sense to ensure
that these sparse holes in the file are placed at 4k boundaries.

The number of RMW cycles when converting an example image [1] to a raw device that
has 4k sector size is about 4600 4k read requests to perform a total of about 15000
write requests. With this path the 4600 additional read requests are eliminated.

[1] https://cloud-images.ubuntu.com/releases/16.04/release/ubuntu-16.04-server-cloudimg-amd64-disk1.vmdk

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 qemu-img.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

Comments

Max Reitz June 11, 2018, 1:30 p.m. UTC | #1
On 2018-06-07 14:46, Peter Lieven wrote:
> We currently don't enforce that the sparse segments we detect during convert are
> aligned. This leads to unnecessary and costly read-modify-write cycles either
> internally in Qemu or in the background on the storage device as nearly all
> modern filesystems or hardware has a 4k alignment internally.
> 
> As we per default set the min_sparse size to 4k it makes perfectly sense to ensure
> that these sparse holes in the file are placed at 4k boundaries.
> 
> The number of RMW cycles when converting an example image [1] to a raw device that
> has 4k sector size is about 4600 4k read requests to perform a total of about 15000
> write requests. With this path the 4600 additional read requests are eliminated.
> 
> [1] https://cloud-images.ubuntu.com/releases/16.04/release/ubuntu-16.04-server-cloudimg-amd64-disk1.vmdk
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  qemu-img.c | 21 +++++++++++++++------
>  1 file changed, 15 insertions(+), 6 deletions(-)

I like the idea, but it doesn't seem guaranteed that
is_allocated_sectors() is called on aligned offsets, so this alignment
work may still leave things unaligned.

Furthermore, we should probably not blindly assume 4k but instead use
some block limit of the target, like pwrite_zeroes_alignment, or
pdiscard_alignment, depending on the case.  (Or probably still
min_sparse, if that's less.)

Since is_allocated_sectors_min() (the only caller of
is_allocated_sectors()) is called from just a single place, taking those
factors into account should be possible.

Max
Peter Lieven June 11, 2018, 1:59 p.m. UTC | #2
Am 11.06.2018 um 15:30 schrieb Max Reitz:
> On 2018-06-07 14:46, Peter Lieven wrote:
>> We currently don't enforce that the sparse segments we detect during convert are
>> aligned. This leads to unnecessary and costly read-modify-write cycles either
>> internally in Qemu or in the background on the storage device as nearly all
>> modern filesystems or hardware has a 4k alignment internally.
>>
>> As we per default set the min_sparse size to 4k it makes perfectly sense to ensure
>> that these sparse holes in the file are placed at 4k boundaries.
>>
>> The number of RMW cycles when converting an example image [1] to a raw device that
>> has 4k sector size is about 4600 4k read requests to perform a total of about 15000
>> write requests. With this path the 4600 additional read requests are eliminated.
>>
>> [1] https://cloud-images.ubuntu.com/releases/16.04/release/ubuntu-16.04-server-cloudimg-amd64-disk1.vmdk
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>>   qemu-img.c | 21 +++++++++++++++------
>>   1 file changed, 15 insertions(+), 6 deletions(-)
> I like the idea, but it doesn't seem guaranteed that
> is_allocated_sectors() is called on aligned offsets, so this alignment
> work may still leave things unaligned.

I can't image why this should happen. As long as the alignment devides the buffer size we either
write or skip aligned bytes. Maybe get_block_status returns an unaligned number of sectors?

>
> Furthermore, we should probably not blindly assume 4k but instead use
> some block limit of the target, like pwrite_zeroes_alignment, or
> pdiscard_alignment, depending on the case.  (Or probably still
> min_sparse, if that's less.)
>
> Since is_allocated_sectors_min() (the only caller of
> is_allocated_sectors()) is called from just a single place, taking those
> factors into account should be possible.

I also thought of this, but for instance for raw-posix I always get a request_alignment of 1.
But maybe the alignments you proposed produce a better result. I will check that.

Thanks,
Peter
Max Reitz June 11, 2018, 2:04 p.m. UTC | #3
On 2018-06-11 15:59, Peter Lieven wrote:
> Am 11.06.2018 um 15:30 schrieb Max Reitz:
>> On 2018-06-07 14:46, Peter Lieven wrote:
>>> We currently don't enforce that the sparse segments we detect during
>>> convert are
>>> aligned. This leads to unnecessary and costly read-modify-write
>>> cycles either
>>> internally in Qemu or in the background on the storage device as
>>> nearly all
>>> modern filesystems or hardware has a 4k alignment internally.
>>>
>>> As we per default set the min_sparse size to 4k it makes perfectly
>>> sense to ensure
>>> that these sparse holes in the file are placed at 4k boundaries.
>>>
>>> The number of RMW cycles when converting an example image [1] to a
>>> raw device that
>>> has 4k sector size is about 4600 4k read requests to perform a total
>>> of about 15000
>>> write requests. With this path the 4600 additional read requests are
>>> eliminated.
>>>
>>> [1]
>>> https://cloud-images.ubuntu.com/releases/16.04/release/ubuntu-16.04-server-cloudimg-amd64-disk1.vmdk
>>>
>>>
>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>> ---
>>>   qemu-img.c | 21 +++++++++++++++------
>>>   1 file changed, 15 insertions(+), 6 deletions(-)
>> I like the idea, but it doesn't seem guaranteed that
>> is_allocated_sectors() is called on aligned offsets, so this alignment
>> work may still leave things unaligned.
> 
> I can't image why this should happen. As long as the alignment devides
> the buffer size we either
> write or skip aligned bytes. Maybe get_block_status returns an unaligned
> number of sectors?

Yes, because the source medium does not need to be the same as the
destination (so the source may have e.g. 512-byte clusters).

>> Furthermore, we should probably not blindly assume 4k but instead use
>> some block limit of the target, like pwrite_zeroes_alignment, or
>> pdiscard_alignment, depending on the case.  (Or probably still
>> min_sparse, if that's less.)
>>
>> Since is_allocated_sectors_min() (the only caller of
>> is_allocated_sectors()) is called from just a single place, taking those
>> factors into account should be possible.
> 
> I also thought of this, but for instance for raw-posix I always get a
> request_alignment of 1.

Yes, because request_alignment is a hard requirement.  With caching, you
can send requests with any alignment, so it's 1.

pwrite_zeroes_alignment and pdiscard_alignment are described as "Optimal
alignment", so those should contain the values we/you want.  If they are
0, then you should probably fall back to opt_transfer instead of
request_alignment.

Max

> But maybe the alignments you proposed produce a better result. I will
> check that.
Peter Lieven June 11, 2018, 2:07 p.m. UTC | #4
Am 11.06.2018 um 16:04 schrieb Max Reitz:
> On 2018-06-11 15:59, Peter Lieven wrote:
>> Am 11.06.2018 um 15:30 schrieb Max Reitz:
>>> On 2018-06-07 14:46, Peter Lieven wrote:
>>>> We currently don't enforce that the sparse segments we detect during
>>>> convert are
>>>> aligned. This leads to unnecessary and costly read-modify-write
>>>> cycles either
>>>> internally in Qemu or in the background on the storage device as
>>>> nearly all
>>>> modern filesystems or hardware has a 4k alignment internally.
>>>>
>>>> As we per default set the min_sparse size to 4k it makes perfectly
>>>> sense to ensure
>>>> that these sparse holes in the file are placed at 4k boundaries.
>>>>
>>>> The number of RMW cycles when converting an example image [1] to a
>>>> raw device that
>>>> has 4k sector size is about 4600 4k read requests to perform a total
>>>> of about 15000
>>>> write requests. With this path the 4600 additional read requests are
>>>> eliminated.
>>>>
>>>> [1]
>>>> https://cloud-images.ubuntu.com/releases/16.04/release/ubuntu-16.04-server-cloudimg-amd64-disk1.vmdk
>>>>
>>>>
>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>>> ---
>>>>    qemu-img.c | 21 +++++++++++++++------
>>>>    1 file changed, 15 insertions(+), 6 deletions(-)
>>> I like the idea, but it doesn't seem guaranteed that
>>> is_allocated_sectors() is called on aligned offsets, so this alignment
>>> work may still leave things unaligned.
>> I can't image why this should happen. As long as the alignment devides
>> the buffer size we either
>> write or skip aligned bytes. Maybe get_block_status returns an unaligned
>> number of sectors?
> Yes, because the source medium does not need to be the same as the
> destination (so the source may have e.g. 512-byte clusters).

Okay, I will try to figure out how to cope with it. So the function needs
to get the offset and the alignment to make the right "decision".

>
>>> Furthermore, we should probably not blindly assume 4k but instead use
>>> some block limit of the target, like pwrite_zeroes_alignment, or
>>> pdiscard_alignment, depending on the case.  (Or probably still
>>> min_sparse, if that's less.)
>>>
>>> Since is_allocated_sectors_min() (the only caller of
>>> is_allocated_sectors()) is called from just a single place, taking those
>>> factors into account should be possible.
>> I also thought of this, but for instance for raw-posix I always get a
>> request_alignment of 1.
> Yes, because request_alignment is a hard requirement.  With caching, you
> can send requests with any alignment, so it's 1.
>
> pwrite_zeroes_alignment and pdiscard_alignment are described as "Optimal
> alignment", so those should contain the values we/you want.  If they are
> 0, then you should probably fall back to opt_transfer instead of
> request_alignment.

I will check that for the targets that I can test and send a V2.

Thanks for your feedback,
Peter
Peter Lieven June 25, 2018, 8:29 p.m. UTC | #5
Am 11.06.2018 um 16:04 schrieb Max Reitz:
> On 2018-06-11 15:59, Peter Lieven wrote:
>> Am 11.06.2018 um 15:30 schrieb Max Reitz:
>>> On 2018-06-07 14:46, Peter Lieven wrote:
>>>> We currently don't enforce that the sparse segments we detect during
>>>> convert are
>>>> aligned. This leads to unnecessary and costly read-modify-write
>>>> cycles either
>>>> internally in Qemu or in the background on the storage device as
>>>> nearly all
>>>> modern filesystems or hardware has a 4k alignment internally.
>>>>
>>>> As we per default set the min_sparse size to 4k it makes perfectly
>>>> sense to ensure
>>>> that these sparse holes in the file are placed at 4k boundaries.
>>>>
>>>> The number of RMW cycles when converting an example image [1] to a
>>>> raw device that
>>>> has 4k sector size is about 4600 4k read requests to perform a total
>>>> of about 15000
>>>> write requests. With this path the 4600 additional read requests are
>>>> eliminated.
>>>>
>>>> [1]
>>>> https://cloud-images.ubuntu.com/releases/16.04/release/ubuntu-16.04-server-cloudimg-amd64-disk1.vmdk
>>>>
>>>>
>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>>> ---
>>>>   qemu-img.c | 21 +++++++++++++++------
>>>>   1 file changed, 15 insertions(+), 6 deletions(-)
>>> I like the idea, but it doesn't seem guaranteed that
>>> is_allocated_sectors() is called on aligned offsets, so this alignment
>>> work may still leave things unaligned.
>> I can't image why this should happen. As long as the alignment devides
>> the buffer size we either
>> write or skip aligned bytes. Maybe get_block_status returns an unaligned
>> number of sectors?
> Yes, because the source medium does not need to be the same as the
> destination (so the source may have e.g. 512-byte clusters).
>
>>> Furthermore, we should probably not blindly assume 4k but instead use
>>> some block limit of the target, like pwrite_zeroes_alignment, or
>>> pdiscard_alignment, depending on the case.  (Or probably still
>>> min_sparse, if that's less.)
>>>
>>> Since is_allocated_sectors_min() (the only caller of
>>> is_allocated_sectors()) is called from just a single place, taking those
>>> factors into account should be possible.
>> I also thought of this, but for instance for raw-posix I always get a
>> request_alignment of 1.
> Yes, because request_alignment is a hard requirement.  With caching, you
> can send requests with any alignment, so it's 1.
>
> pwrite_zeroes_alignment and pdiscard_alignment are described as "Optimal
> alignment", so those should contain the values we/you want.  If they are
> 0, then you should probably fall back to opt_transfer instead of
> request_alignment.

I am still trying to figure out what is the best solution. If I take the optima into
account I might ending up transfering more data than necessary just to create an optimal
request. I just want to avoid unnecessary RMW cycles. And even if modern byte interfaces
advertise a request_alignment of 1 someone has to do the RMW cycle. Either the OS or the
harddrive itself.

I am thinking about sth like

alignment = MAX(request_alignment, opt_transfer, min_sparse)

as a starting point?

I found that opt_transfer seems to be 0 for everything I found to test.
So maybe even reduce the alignment to MAX(request_alignment, min_sparse).

Peter
diff mbox

Patch

diff --git a/qemu-img.c b/qemu-img.c
index 75f1610..68eefba 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1096,24 +1096,33 @@  static int64_t find_nonzero(const uint8_t *buf, int64_t n)
  *
  * 'pnum' is set to the number of sectors (including and immediately following
  * the first one) that are known to be in the same allocated/unallocated state.
+ * The function will try to align 'pnum' to 8 sectors (4k) to avoid unnecassary
+ * RMW cycles on modern hardware.
  */
 static int is_allocated_sectors(const uint8_t *buf, int n, int *pnum)
 {
     bool is_zero;
-    int i;
+    int i, alignment = 1;
 
     if (n <= 0) {
         *pnum = 0;
         return 0;
     }
-    is_zero = buffer_is_zero(buf, 512);
-    for(i = 1; i < n; i++) {
-        buf += 512;
-        if (is_zero != buffer_is_zero(buf, 512)) {
+
+    if (!(n & 7)) {
+        /* the buffer size is dividable by 4k */
+        alignment = 8;
+        n /= 8;
+    }
+
+    is_zero = buffer_is_zero(buf, BDRV_SECTOR_SIZE * alignment);
+    for (i = 1; i < n; i++) {
+        buf += BDRV_SECTOR_SIZE * alignment;
+        if (is_zero != buffer_is_zero(buf, BDRV_SECTOR_SIZE * alignment)) {
             break;
         }
     }
-    *pnum = i;
+    *pnum = i * alignment;
     return !is_zero;
 }