diff mbox

[RFC,V4] qemu-img: make convert async

Message ID 1487347224-8667-1-git-send-email-pl@kamp.de (mailing list archive)
State New, archived
Headers show

Commit Message

Peter Lieven Feb. 17, 2017, 4 p.m. UTC
this is something I have been thinking about for almost 2 years now.
we heavily have the following two use cases when using qemu-img convert.

a) reading from NFS and writing to iSCSI for deploying templates
b) reading from iSCSI and writing to NFS for backups

In both processes we use libiscsi and libnfs so we have no kernel pagecache.
As qemu-img convert is implemented with sync operations that means we
read one buffer and then write it. No parallelism and each sync request
takes as long as it takes until it is completed.

This is version 4 of the approach using coroutine worker "threads".

So far I have the following runtimes when reading an uncompressed QCOW2 from
NFS and writing it to iSCSI (raw):

qemu-img (master)
 nfs -> iscsi 22.8 secs
 nfs -> ram   11.7 secs
 ram -> iscsi 12.3 secs

qemu-img-async (8 coroutines, in-order write disabled)
 nfs -> iscsi 11.0 secs
 nfs -> ram   10.4 secs
 ram -> iscsi  9.0 secs

The following are the runtimes found with different settings between V3 and V4.
This is always the best runtime out of 10 runs when converting from nfs to iscsi.
Please note that in V4 in-order write scenarios show a very high jitter. I think
this is because the get_block_status on the NFS share is delayed by concurrent read
requests.

                       in-order        out-of-order
V3  - 16 coroutines    12.4 seconds    11.1 seconds
    -  8 coroutines    12.2 seconds    11.3 seconds
    -  4 coroutines    12.5 seconds    11.1 seconds
    -  2 coroutines    14.8 seconds    14.9 seconds

V4  - 32 coroutines    15.9 seconds    11.5 seconds
    - 16 coroutines    12.5 seconds    11.0 seconds
    -  8 coroutines    12.9 seconds    11.0 seconds
    -  4 coroutines    14.1 seconds    11.5 seconds
    -  2 coroutines    16.9 seconds    13.2 seconds

Comments appreciated.

Thank you,
Peter

Signed-off-by: Peter Lieven <pl@kamp.de>
---
v3->v4: - avoid to prepare a request queue upfront [Kevin]
        - do not ignore the BLK_BACKING_FILE status [Kevin]
        - redesign the interface to the read and write routines [Kevin]

v2->v3: - updated stats in the commit msg from a host with a better network card
        - only wake up the coroutine that is acutally waiting for a write to complete.
          this was not only overhead, but also breaking at least linux AIO.
        - fix coding style complaints
        - rename some variables and structs

v1->v2: - using coroutine as worker "threads". [Max]
        - keeping the request queue as otherwise it happens
          that we wait on BLK_ZERO chunks while keeping the write order.
          it also avoids redundant calls to get_block_status and helps
          to skip some conditions for fully allocated imaged (!s->min_sparse)
---
 qemu-img.c | 260 ++++++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 187 insertions(+), 73 deletions(-)

Comments

Stefan Hajnoczi Feb. 20, 2017, 2:50 p.m. UTC | #1
On Fri, Feb 17, 2017 at 05:00:24PM +0100, Peter Lieven wrote:
> this is something I have been thinking about for almost 2 years now.
> we heavily have the following two use cases when using qemu-img convert.
> 
> a) reading from NFS and writing to iSCSI for deploying templates
> b) reading from iSCSI and writing to NFS for backups
> 
> In both processes we use libiscsi and libnfs so we have no kernel pagecache.
> As qemu-img convert is implemented with sync operations that means we
> read one buffer and then write it. No parallelism and each sync request
> takes as long as it takes until it is completed.
> 
> This is version 4 of the approach using coroutine worker "threads".
> 
> So far I have the following runtimes when reading an uncompressed QCOW2 from
> NFS and writing it to iSCSI (raw):
> 
> qemu-img (master)
>  nfs -> iscsi 22.8 secs
>  nfs -> ram   11.7 secs
>  ram -> iscsi 12.3 secs
> 
> qemu-img-async (8 coroutines, in-order write disabled)
>  nfs -> iscsi 11.0 secs
>  nfs -> ram   10.4 secs
>  ram -> iscsi  9.0 secs
> 
> The following are the runtimes found with different settings between V3 and V4.
> This is always the best runtime out of 10 runs when converting from nfs to iscsi.
> Please note that in V4 in-order write scenarios show a very high jitter. I think
> this is because the get_block_status on the NFS share is delayed by concurrent read
> requests.
> 
>                        in-order        out-of-order
> V3  - 16 coroutines    12.4 seconds    11.1 seconds
>     -  8 coroutines    12.2 seconds    11.3 seconds
>     -  4 coroutines    12.5 seconds    11.1 seconds
>     -  2 coroutines    14.8 seconds    14.9 seconds
> 
> V4  - 32 coroutines    15.9 seconds    11.5 seconds
>     - 16 coroutines    12.5 seconds    11.0 seconds
>     -  8 coroutines    12.9 seconds    11.0 seconds
>     -  4 coroutines    14.1 seconds    11.5 seconds
>     -  2 coroutines    16.9 seconds    13.2 seconds

Does this patch work with compressed images?  Especially the
out-of-order write mode may be problematic with a compressed qcow2 image.

How should a user decide between in-order and out-of-order?

> @@ -1651,12 +1680,117 @@ static int convert_write(ImgConvertState *s, int64_t sector_num, int nb_sectors,
>      return 0;
>  }
>  
> -static int convert_do_copy(ImgConvertState *s)
> +static void convert_co_do_copy(void *opaque)

Missing coroutine_fn here and for convert_co_read()/convert_co_write().
Functions that must be called from coroutine context (because they
yield, use coroutine mutexes, etc) need to be marked as such.

> +        if (s->wr_in_order) {
> +            /* reenter the coroutine that might have waited
> +             * for this write to complete */
> +            s->wr_offs = sector_num + n;
> +            for (i = 0; i < s->num_coroutines; i++) {
> +                if (s->co[i] && s->wait_sector_num[i] == s->wr_offs) {
> +                    qemu_coroutine_enter(s->co[i]);
> +                    break;

This qemu_coroutine_enter() call relies on the yield pattern between
sibling coroutines having no recursive qemu_coroutine_enter() calls.
QEMU aborts if there is a code path where coroutine A enters B and then
B enters A again before yielding.

Paolo's new aio_co_wake() API solves this issue by deferring the
qemu_coroutine_enter() to the event loop.  It's similar to CoQueue
wakeup.  aio_co_wake() is part of my latest block pull request (should
be merged into qemu.git soon).

I *think* this patch has no A -> B -> A situation thanks to yields in
the code path, but it would be nicer to use aio_co_wake() where this can
never happen.

> +        case 'm':
> +            num_coroutines = atoi(optarg);
> +            if (num_coroutines > MAX_COROUTINES) {
> +                error_report("Maximum allowed number of coroutines is %d",
> +                             MAX_COROUTINES);
> +                ret = -1;
> +                goto fail_getopt;
> +            }

Missing input validation for the < 1 case.
Peter Lieven Feb. 20, 2017, 2:59 p.m. UTC | #2
Am 20.02.2017 um 15:50 schrieb Stefan Hajnoczi:
> On Fri, Feb 17, 2017 at 05:00:24PM +0100, Peter Lieven wrote:
>> this is something I have been thinking about for almost 2 years now.
>> we heavily have the following two use cases when using qemu-img convert.
>>
>> a) reading from NFS and writing to iSCSI for deploying templates
>> b) reading from iSCSI and writing to NFS for backups
>>
>> In both processes we use libiscsi and libnfs so we have no kernel pagecache.
>> As qemu-img convert is implemented with sync operations that means we
>> read one buffer and then write it. No parallelism and each sync request
>> takes as long as it takes until it is completed.
>>
>> This is version 4 of the approach using coroutine worker "threads".
>>
>> So far I have the following runtimes when reading an uncompressed QCOW2 from
>> NFS and writing it to iSCSI (raw):
>>
>> qemu-img (master)
>>   nfs -> iscsi 22.8 secs
>>   nfs -> ram   11.7 secs
>>   ram -> iscsi 12.3 secs
>>
>> qemu-img-async (8 coroutines, in-order write disabled)
>>   nfs -> iscsi 11.0 secs
>>   nfs -> ram   10.4 secs
>>   ram -> iscsi  9.0 secs
>>
>> The following are the runtimes found with different settings between V3 and V4.
>> This is always the best runtime out of 10 runs when converting from nfs to iscsi.
>> Please note that in V4 in-order write scenarios show a very high jitter. I think
>> this is because the get_block_status on the NFS share is delayed by concurrent read
>> requests.
>>
>>                         in-order        out-of-order
>> V3  - 16 coroutines    12.4 seconds    11.1 seconds
>>      -  8 coroutines    12.2 seconds    11.3 seconds
>>      -  4 coroutines    12.5 seconds    11.1 seconds
>>      -  2 coroutines    14.8 seconds    14.9 seconds
>>
>> V4  - 32 coroutines    15.9 seconds    11.5 seconds
>>      - 16 coroutines    12.5 seconds    11.0 seconds
>>      -  8 coroutines    12.9 seconds    11.0 seconds
>>      -  4 coroutines    14.1 seconds    11.5 seconds
>>      -  2 coroutines    16.9 seconds    13.2 seconds
> Does this patch work with compressed images?  Especially the
> out-of-order write mode may be problematic with a compressed qcow2 image.

It does, but you are right, out-of-order writes and compression should
be mutually exclusive.

>
> How should a user decide between in-order and out-of-order?

In general, out of order is ok, when we write to a preallocated device (e.g. all raw
devices or qcow2 with preallocation=full). default is in-order write. the user has
to manually enable out of order.

>
>> @@ -1651,12 +1680,117 @@ static int convert_write(ImgConvertState *s, int64_t sector_num, int nb_sectors,
>>       return 0;
>>   }
>>   
>> -static int convert_do_copy(ImgConvertState *s)
>> +static void convert_co_do_copy(void *opaque)
> Missing coroutine_fn here and for convert_co_read()/convert_co_write().
> Functions that must be called from coroutine context (because they
> yield, use coroutine mutexes, etc) need to be marked as such.

okay.

>
>> +        if (s->wr_in_order) {
>> +            /* reenter the coroutine that might have waited
>> +             * for this write to complete */
>> +            s->wr_offs = sector_num + n;
>> +            for (i = 0; i < s->num_coroutines; i++) {
>> +                if (s->co[i] && s->wait_sector_num[i] == s->wr_offs) {
>> +                    qemu_coroutine_enter(s->co[i]);
>> +                    break;
> This qemu_coroutine_enter() call relies on the yield pattern between
> sibling coroutines having no recursive qemu_coroutine_enter() calls.
> QEMU aborts if there is a code path where coroutine A enters B and then
> B enters A again before yielding.
>
> Paolo's new aio_co_wake() API solves this issue by deferring the
> qemu_coroutine_enter() to the event loop.  It's similar to CoQueue
> wakeup.  aio_co_wake() is part of my latest block pull request (should
> be merged into qemu.git soon).
>
> I *think* this patch has no A -> B -> A situation thanks to yields in
> the code path, but it would be nicer to use aio_co_wake() where this can
> never happen.

I also believe this can't happen. Would it be also okay to use
qemu_coroutine_enter_if_inactive() here?

>
>> +        case 'm':
>> +            num_coroutines = atoi(optarg);
>> +            if (num_coroutines > MAX_COROUTINES) {
>> +                error_report("Maximum allowed number of coroutines is %d",
>> +                             MAX_COROUTINES);
>> +                ret = -1;
>> +                goto fail_getopt;
>> +            }
> Missing input validation for the < 1 case.

Right.

Thank you,
Peter
Stefan Hajnoczi Feb. 21, 2017, 11:08 a.m. UTC | #3
On Mon, Feb 20, 2017 at 03:59:42PM +0100, Peter Lieven wrote:
> Am 20.02.2017 um 15:50 schrieb Stefan Hajnoczi:
> > On Fri, Feb 17, 2017 at 05:00:24PM +0100, Peter Lieven wrote:
> > > +        if (s->wr_in_order) {
> > > +            /* reenter the coroutine that might have waited
> > > +             * for this write to complete */
> > > +            s->wr_offs = sector_num + n;
> > > +            for (i = 0; i < s->num_coroutines; i++) {
> > > +                if (s->co[i] && s->wait_sector_num[i] == s->wr_offs) {
> > > +                    qemu_coroutine_enter(s->co[i]);
> > > +                    break;
> > This qemu_coroutine_enter() call relies on the yield pattern between
> > sibling coroutines having no recursive qemu_coroutine_enter() calls.
> > QEMU aborts if there is a code path where coroutine A enters B and then
> > B enters A again before yielding.
> > 
> > Paolo's new aio_co_wake() API solves this issue by deferring the
> > qemu_coroutine_enter() to the event loop.  It's similar to CoQueue
> > wakeup.  aio_co_wake() is part of my latest block pull request (should
> > be merged into qemu.git soon).
> > 
> > I *think* this patch has no A -> B -> A situation thanks to yields in
> > the code path, but it would be nicer to use aio_co_wake() where this can
> > never happen.
> 
> I also believe this can't happen. Would it be also okay to use
> qemu_coroutine_enter_if_inactive() here?

Perhaps this comment is enough:

  /*
   * A -> B -> A cannot occur because A has s->wait_sector_num[i] == -1
   * during A -> B.  Therefore B will never enter A during this time
   * window.
   */
  qemu_coroutine_enter(s->co[i]);

Normally a comment isn't necessary because qemu_coroutine_enter() is
used for parent-child relationships.  But here it's a sibling
relationship so we need proof that A -> B -> A doesn't occur.
Vladimir Sementsov-Ogievskiy June 1, 2018, 4:16 p.m. UTC | #4
Hi!

20.02.2017 17:59, Peter Lieven wrote:
> Am 20.02.2017 um 15:50 schrieb Stefan Hajnoczi:
>> On Fri, Feb 17, 2017 at 05:00:24PM +0100, Peter Lieven wrote:
>>> this is something I have been thinking about for almost 2 years now.
>>> we heavily have the following two use cases when using qemu-img 
>>> convert.
>>>
>>>

[...]

>> Does this patch work with compressed images?  Especially the
>> out-of-order write mode may be problematic with a compressed qcow2 
>> image.
>
> It does, but you are right, out-of-order writes and compression should
> be mutually exclusive.


Sorry for being late, but can you please explain to me, why?
Stefan Hajnoczi June 7, 2018, 10:10 a.m. UTC | #5
On Fri, Jun 01, 2018 at 07:16:14PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 20.02.2017 17:59, Peter Lieven wrote:
> > Am 20.02.2017 um 15:50 schrieb Stefan Hajnoczi:
> > > On Fri, Feb 17, 2017 at 05:00:24PM +0100, Peter Lieven wrote:
> > > > this is something I have been thinking about for almost 2 years now.
> > > > we heavily have the following two use cases when using qemu-img
> > > > convert.
> > > > 
> > > > 
> 
> [...]
> 
> > > Does this patch work with compressed images?  Especially the
> > > out-of-order write mode may be problematic with a compressed qcow2
> > > image.
> > 
> > It does, but you are right, out-of-order writes and compression should
> > be mutually exclusive.
> 
> 
> Sorry for being late, but can you please explain to me, why?

There are image format-specific limitations on compressed writes.  For
some reason I thought they were append-only in qcow2, but I was wrong.

Stefan
Vladimir Sementsov-Ogievskiy June 7, 2018, 10:19 a.m. UTC | #6
07.06.2018 13:10, Stefan Hajnoczi wrote:
> On Fri, Jun 01, 2018 at 07:16:14PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> 20.02.2017 17:59, Peter Lieven wrote:
>>> Am 20.02.2017 um 15:50 schrieb Stefan Hajnoczi:
>>>> On Fri, Feb 17, 2017 at 05:00:24PM +0100, Peter Lieven wrote:
>>>>> this is something I have been thinking about for almost 2 years now.
>>>>> we heavily have the following two use cases when using qemu-img
>>>>> convert.
>>>>>
>>>>>
>> [...]
>>
>>>> Does this patch work with compressed images?  Especially the
>>>> out-of-order write mode may be problematic with a compressed qcow2
>>>> image.
>>> It does, but you are right, out-of-order writes and compression should
>>> be mutually exclusive.
>>
>> Sorry for being late, but can you please explain to me, why?
> There are image format-specific limitations on compressed writes.  For
> some reason I thought they were append-only in qcow2, but I was wrong.
>
> Stefan

And what are limitations for compressed writes in qcow2? We can't write 
asynchronously? Why?
Stefan Hajnoczi June 8, 2018, 7:51 a.m. UTC | #7
On Thu, Jun 07, 2018 at 01:19:29PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 07.06.2018 13:10, Stefan Hajnoczi wrote:
> > On Fri, Jun 01, 2018 at 07:16:14PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > > 20.02.2017 17:59, Peter Lieven wrote:
> > > > Am 20.02.2017 um 15:50 schrieb Stefan Hajnoczi:
> > > > > On Fri, Feb 17, 2017 at 05:00:24PM +0100, Peter Lieven wrote:
> > > > > > this is something I have been thinking about for almost 2 years now.
> > > > > > we heavily have the following two use cases when using qemu-img
> > > > > > convert.
> > > > > > 
> > > > > > 
> > > [...]
> > > 
> > > > > Does this patch work with compressed images?  Especially the
> > > > > out-of-order write mode may be problematic with a compressed qcow2
> > > > > image.
> > > > It does, but you are right, out-of-order writes and compression should
> > > > be mutually exclusive.
> > > 
> > > Sorry for being late, but can you please explain to me, why?
> > There are image format-specific limitations on compressed writes.  For
> > some reason I thought they were append-only in qcow2, but I was wrong.
> > 
> > Stefan
> 
> And what are limitations for compressed writes in qcow2?

Writes must be cluster-aligned and the size must be 1 cluster (except
for the last cluster in an image).

qemu-img convert honors this, so it's not a problem.

> We can't write asynchronously? Why?

Async compressed writes are supported nowadays.

I think my original comment was wrong.  It should be fine to use
out-of-order compressed writes.

Stefan
diff mbox

Patch

diff --git a/qemu-img.c b/qemu-img.c
index cff22e3..6bac980 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1448,6 +1448,8 @@  enum ImgConvertBlockStatus {
     BLK_BACKING_FILE,
 };
 
+#define MAX_COROUTINES 16
+
 typedef struct ImgConvertState {
     BlockBackend **src;
     int64_t *src_sectors;
@@ -1455,15 +1457,25 @@  typedef struct ImgConvertState {
     int64_t src_cur_offset;
     int64_t total_sectors;
     int64_t allocated_sectors;
+    int64_t allocated_done;
+    int64_t sector_num;
+    int64_t wr_offs;
     enum ImgConvertBlockStatus status;
     int64_t sector_next_status;
     BlockBackend *target;
     bool has_zero_init;
     bool compressed;
     bool target_has_backing;
+    bool wr_in_order;
     int min_sparse;
     size_t cluster_sectors;
     size_t buf_sectors;
+    int num_coroutines;
+    int running_coroutines;
+    Coroutine *co[MAX_COROUTINES];
+    int64_t wait_sector_num[MAX_COROUTINES];
+    CoMutex lock;
+    int ret;
 } ImgConvertState;
 
 static void convert_select_part(ImgConvertState *s, int64_t sector_num)
@@ -1544,11 +1556,12 @@  static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
     return n;
 }
 
-static int convert_read(ImgConvertState *s, int64_t sector_num, int nb_sectors,
-                        uint8_t *buf)
+static int convert_co_read(ImgConvertState *s, int64_t sector_num,
+                           int nb_sectors, uint8_t *buf)
 {
-    int n;
-    int ret;
+    int n, ret;
+    QEMUIOVector qiov;
+    struct iovec iov;
 
     assert(nb_sectors <= s->buf_sectors);
     while (nb_sectors > 0) {
@@ -1563,9 +1576,13 @@  static int convert_read(ImgConvertState *s, int64_t sector_num, int nb_sectors,
         bs_sectors = s->src_sectors[s->src_cur];
 
         n = MIN(nb_sectors, bs_sectors - (sector_num - s->src_cur_offset));
-        ret = blk_pread(blk,
-                        (sector_num - s->src_cur_offset) << BDRV_SECTOR_BITS,
-                        buf, n << BDRV_SECTOR_BITS);
+        iov.iov_base = buf;
+        iov.iov_len = n << BDRV_SECTOR_BITS;
+        qemu_iovec_init_external(&qiov, &iov, 1);
+
+        ret = blk_co_preadv(
+                blk, (sector_num - s->src_cur_offset) << BDRV_SECTOR_BITS,
+                n << BDRV_SECTOR_BITS, &qiov, 0);
         if (ret < 0) {
             return ret;
         }
@@ -1578,15 +1595,18 @@  static int convert_read(ImgConvertState *s, int64_t sector_num, int nb_sectors,
     return 0;
 }
 
-static int convert_write(ImgConvertState *s, int64_t sector_num, int nb_sectors,
-                         const uint8_t *buf)
+
+static int convert_co_write(ImgConvertState *s, int64_t sector_num,
+                            int nb_sectors, uint8_t *buf,
+                            enum ImgConvertBlockStatus status)
 {
     int ret;
+    QEMUIOVector qiov;
+    struct iovec iov;
 
     while (nb_sectors > 0) {
         int n = nb_sectors;
-
-        switch (s->status) {
+        switch (status) {
         case BLK_BACKING_FILE:
             /* If we have a backing file, leave clusters unallocated that are
              * unallocated in the source image, so that the backing file is
@@ -1607,9 +1627,13 @@  static int convert_write(ImgConvertState *s, int64_t sector_num, int nb_sectors,
                     break;
                 }
 
-                ret = blk_pwrite_compressed(s->target,
-                                            sector_num << BDRV_SECTOR_BITS,
-                                            buf, n << BDRV_SECTOR_BITS);
+                iov.iov_base = buf;
+                iov.iov_len = n << BDRV_SECTOR_BITS;
+                qemu_iovec_init_external(&qiov, &iov, 1);
+
+                ret = blk_co_pwritev(s->target, sector_num << BDRV_SECTOR_BITS,
+                                     n << BDRV_SECTOR_BITS, &qiov,
+                                     BDRV_REQ_WRITE_COMPRESSED);
                 if (ret < 0) {
                     return ret;
                 }
@@ -1622,8 +1646,12 @@  static int convert_write(ImgConvertState *s, int64_t sector_num, int nb_sectors,
             if (!s->min_sparse ||
                 is_allocated_sectors_min(buf, n, &n, s->min_sparse))
             {
-                ret = blk_pwrite(s->target, sector_num << BDRV_SECTOR_BITS,
-                                 buf, n << BDRV_SECTOR_BITS, 0);
+                iov.iov_base = buf;
+                iov.iov_len = n << BDRV_SECTOR_BITS;
+                qemu_iovec_init_external(&qiov, &iov, 1);
+
+                ret = blk_co_pwritev(s->target, sector_num << BDRV_SECTOR_BITS,
+                                     n << BDRV_SECTOR_BITS, &qiov, 0);
                 if (ret < 0) {
                     return ret;
                 }
@@ -1635,8 +1663,9 @@  static int convert_write(ImgConvertState *s, int64_t sector_num, int nb_sectors,
             if (s->has_zero_init) {
                 break;
             }
-            ret = blk_pwrite_zeroes(s->target, sector_num << BDRV_SECTOR_BITS,
-                                    n << BDRV_SECTOR_BITS, 0);
+            ret = blk_co_pwrite_zeroes(s->target,
+                                       sector_num << BDRV_SECTOR_BITS,
+                                       n << BDRV_SECTOR_BITS, 0);
             if (ret < 0) {
                 return ret;
             }
@@ -1651,12 +1680,117 @@  static int convert_write(ImgConvertState *s, int64_t sector_num, int nb_sectors,
     return 0;
 }
 
-static int convert_do_copy(ImgConvertState *s)
+static void convert_co_do_copy(void *opaque)
 {
+    ImgConvertState *s = opaque;
     uint8_t *buf = NULL;
-    int64_t sector_num, allocated_done;
-    int ret;
-    int n;
+    int ret, i;
+    int index = -1;
+
+    for (i = 0; i < s->num_coroutines; i++) {
+        if (s->co[i] == qemu_coroutine_self()) {
+            index = i;
+            break;
+        }
+    }
+    assert(index >= 0);
+
+    s->running_coroutines++;
+    buf = blk_blockalign(s->target, s->buf_sectors * BDRV_SECTOR_SIZE);
+
+    while (1) {
+        int n;
+        int64_t sector_num;
+        enum ImgConvertBlockStatus status;
+
+        qemu_co_mutex_lock(&s->lock);
+        if (s->ret != -EINPROGRESS || s->sector_num >= s->total_sectors) {
+            qemu_co_mutex_unlock(&s->lock);
+            goto out;
+        }
+        n = convert_iteration_sectors(s, s->sector_num);
+        if (n < 0) {
+            qemu_co_mutex_unlock(&s->lock);
+            s->ret = n;
+            goto out;
+        }
+        /* safe current sector and allocation status to local variables */
+        sector_num = s->sector_num;
+        status = s->status;
+        if (!s->min_sparse && s->status == BLK_ZERO) {
+            n = MIN(n, s->buf_sectors);
+        }
+        /* increment global sector counter so that other coroutines can
+         * already continue reading beyond this request */
+        s->sector_num += n;
+        qemu_co_mutex_unlock(&s->lock);
+
+        if (status == BLK_DATA || (!s->min_sparse && s->status == BLK_ZERO)) {
+            s->allocated_done += n;
+            qemu_progress_print(100.0 * s->allocated_done /
+                                        s->allocated_sectors, 0);
+        }
+
+        if (status == BLK_DATA) {
+            ret = convert_co_read(s, sector_num, n, buf);
+            if (ret < 0) {
+                error_report("error while reading sector %" PRId64
+                             ": %s", sector_num, strerror(-ret));
+                s->ret = ret;
+                goto out;
+            }
+        } else if (!s->min_sparse && s->status == BLK_ZERO) {
+            status = BLK_DATA;
+            memset(buf, 0x00, n * BDRV_SECTOR_SIZE);
+        }
+
+        if (s->wr_in_order) {
+            /* keep writes in order */
+            while (s->wr_offs != sector_num) {
+                if (s->ret != -EINPROGRESS) {
+                    goto out;
+                }
+                s->wait_sector_num[index] = sector_num;
+                qemu_coroutine_yield();
+            }
+            s->wait_sector_num[index] = -1;
+        }
+
+        ret = convert_co_write(s, sector_num, n, buf, status);
+        if (ret < 0) {
+            error_report("error while writing sector %" PRId64
+                         ": %s", sector_num, strerror(-ret));
+            s->ret = ret;
+            goto out;
+        }
+
+        if (s->wr_in_order) {
+            /* reenter the coroutine that might have waited
+             * for this write to complete */
+            s->wr_offs = sector_num + n;
+            for (i = 0; i < s->num_coroutines; i++) {
+                if (s->co[i] && s->wait_sector_num[i] == s->wr_offs) {
+                    qemu_coroutine_enter(s->co[i]);
+                    break;
+                }
+            }
+        }
+    }
+
+out:
+    qemu_vfree(buf);
+    s->co[index] = NULL;
+    s->running_coroutines--;
+    if (!s->running_coroutines && s->ret == -EINPROGRESS) {
+        /* the convert job finished successfully */
+        s->ret = 0;
+    }
+}
+
+static int convert_do_copy(ImgConvertState *s)
+{
+    int ret, i, n;
+    int64_t sector_num = 0;
 
     /* Check whether we have zero initialisation or can get it efficiently */
     s->has_zero_init = s->min_sparse && !s->target_has_backing
@@ -1677,21 +1811,15 @@  static int convert_do_copy(ImgConvertState *s)
     if (s->compressed) {
         if (s->cluster_sectors <= 0 || s->cluster_sectors > s->buf_sectors) {
             error_report("invalid cluster size");
-            ret = -EINVAL;
-            goto fail;
+            return -EINVAL;
         }
         s->buf_sectors = s->cluster_sectors;
     }
-    buf = blk_blockalign(s->target, s->buf_sectors * BDRV_SECTOR_SIZE);
 
-    /* Calculate allocated sectors for progress */
-    s->allocated_sectors = 0;
-    sector_num = 0;
     while (sector_num < s->total_sectors) {
         n = convert_iteration_sectors(s, sector_num);
         if (n < 0) {
-            ret = n;
-            goto fail;
+            return n;
         }
         if (s->status == BLK_DATA || (!s->min_sparse && s->status == BLK_ZERO))
         {
@@ -1704,58 +1832,28 @@  static int convert_do_copy(ImgConvertState *s)
     s->src_cur = 0;
     s->src_cur_offset = 0;
     s->sector_next_status = 0;
+    s->ret = -EINPROGRESS;
 
-    sector_num = 0;
-    allocated_done = 0;
-
-    while (sector_num < s->total_sectors) {
-        n = convert_iteration_sectors(s, sector_num);
-        if (n < 0) {
-            ret = n;
-            goto fail;
-        }
-        if (s->status == BLK_DATA || (!s->min_sparse && s->status == BLK_ZERO))
-        {
-            allocated_done += n;
-            qemu_progress_print(100.0 * allocated_done / s->allocated_sectors,
-                                0);
-        }
-
-        if (s->status == BLK_DATA) {
-            ret = convert_read(s, sector_num, n, buf);
-            if (ret < 0) {
-                error_report("error while reading sector %" PRId64
-                             ": %s", sector_num, strerror(-ret));
-                goto fail;
-            }
-        } else if (!s->min_sparse && s->status == BLK_ZERO) {
-            n = MIN(n, s->buf_sectors);
-            memset(buf, 0, n * BDRV_SECTOR_SIZE);
-            s->status = BLK_DATA;
-        }
-
-        ret = convert_write(s, sector_num, n, buf);
-        if (ret < 0) {
-            error_report("error while writing sector %" PRId64
-                         ": %s", sector_num, strerror(-ret));
-            goto fail;
-        }
+    qemu_co_mutex_init(&s->lock);
+    for (i = 0; i < s->num_coroutines; i++) {
+        s->co[i] = qemu_coroutine_create(convert_co_do_copy, s);
+        s->wait_sector_num[i] = -1;
+        qemu_coroutine_enter(s->co[i]);
+    }
 
-        sector_num += n;
+    while (s->ret == -EINPROGRESS) {
+        main_loop_wait(false);
     }
 
-    if (s->compressed) {
+    if (s->compressed && !s->ret) {
         /* signal EOF to align */
         ret = blk_pwrite_compressed(s->target, 0, NULL, 0);
         if (ret < 0) {
-            goto fail;
+            return ret;
         }
     }
 
-    ret = 0;
-fail:
-    qemu_vfree(buf);
-    return ret;
+    return s->ret;
 }
 
 static int img_convert(int argc, char **argv)
@@ -1783,6 +1881,8 @@  static int img_convert(int argc, char **argv)
     QemuOpts *sn_opts = NULL;
     ImgConvertState state;
     bool image_opts = false;
+    bool wr_in_order = true;
+    int num_coroutines = 8;
 
     fmt = NULL;
     out_fmt = "raw";
@@ -1798,7 +1898,7 @@  static int img_convert(int argc, char **argv)
             {"image-opts", no_argument, 0, OPTION_IMAGE_OPTS},
             {0, 0, 0, 0}
         };
-        c = getopt_long(argc, argv, "hf:O:B:ce6o:s:l:S:pt:T:qn",
+        c = getopt_long(argc, argv, "hf:O:B:ce6o:s:l:S:pt:T:qnm:W",
                         long_options, NULL);
         if (c == -1) {
             break;
@@ -1890,6 +1990,18 @@  static int img_convert(int argc, char **argv)
         case 'n':
             skip_create = 1;
             break;
+        case 'm':
+            num_coroutines = atoi(optarg);
+            if (num_coroutines > MAX_COROUTINES) {
+                error_report("Maximum allowed number of coroutines is %d",
+                             MAX_COROUTINES);
+                ret = -1;
+                goto fail_getopt;
+            }
+            break;
+        case 'W':
+            wr_in_order = false;
+            break;
         case OPTION_OBJECT:
             opts = qemu_opts_parse_noisily(&qemu_object_opts,
                                            optarg, true);
@@ -2149,6 +2261,8 @@  static int img_convert(int argc, char **argv)
         .min_sparse         = min_sparse,
         .cluster_sectors    = cluster_sectors,
         .buf_sectors        = bufsectors,
+        .wr_in_order        = wr_in_order,
+        .num_coroutines     = num_coroutines,
     };
     ret = convert_do_copy(&state);