diff mbox

[Qemu-stable,v2,1/1] qemu-img: wait for convert coroutines to complete

Message ID 59ebc5d3-5921-9d3c-1436-caecd1e23a60@virtuozzo.com (mailing list archive)
State New, archived
Headers show

Commit Message

Anton Nefedov April 24, 2017, 8:13 p.m. UTC
On 24/04/2017 21:16, Peter Lieven wrote:
>
>
>> Am 24.04.2017 um 18:27 schrieb Anton Nefedov <anton.nefedov@virtuozzo.com>:
>>
>>> On 04/21/2017 03:37 PM, Peter Lieven wrote:
>>>> Am 21.04.2017 um 14:19 schrieb Anton Nefedov:
>>>>> On 04/21/2017 01:44 PM, Peter Lieven wrote:
>>>>>> Am 21.04.2017 um 12:04 schrieb Anton Nefedov:
>>>>>> On error path (like i/o error in one of the coroutines), it's required to
>>>>>>  - wait for coroutines completion before cleaning the common structures
>>>>>>  - reenter dependent coroutines so they ever finish
>>>>>>
>>>>>> Introduced in 2d9187bc65.
>>>>>>
>>>>>> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
>>>>>> ---
>>>>>> [..]
>>>>>>
>>>>>
>>>>>
>>>>> And what if we error out in the read path? Wouldn't be something like this easier?
>>>>>
>>>>>
>>>>> diff --git a/qemu-img.c b/qemu-img.c
>>>>> index 22f559a..4ff1085 100644
>>>>> --- a/qemu-img.c
>>>>> +++ b/qemu-img.c
>>>>> @@ -1903,6 +1903,16 @@ static int convert_do_copy(ImgConvertState *s)
>>>>>         main_loop_wait(false);
>>>>>     }
>>>>>
>>>>> +    /* on error path we need to enter all coroutines that are still
>>>>> +     * running before cleaning up common structures */
>>>>> +    if (s->ret) {
>>>>> +        for (i = 0; i < s->num_coroutines; i++) {
>>>>> +             if (s->co[i]) {
>>>>> +                 qemu_coroutine_enter(s->co[i]);
>>>>> +             }
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>>     if (s->compressed && !s->ret) {
>>>>>         /* signal EOF to align */
>>>>>         ret = blk_pwrite_compressed(s->target, 0, NULL, 0);
>>>>>
>>>>>
>>>>> Peter
>>>>>
>>>>
>>>> seemed a bit too daring to me to re-enter every coroutine potentially including the ones that yielded waiting for I/O completion.
>>>> If that's ok - that is for sure easier :)
>>>
>>> I think we should enter every coroutine that is still running and have it terminate correctly. It was my mistake that I have not
>>> done this in the original patch. Can you check if the above fixes your test cases that triggered the bug?
>>>
>>
>> hi, sorry I'm late with the answer
>>
>> this segfaults in bdrv_close(). Looks like it tries to finish some i/o which coroutine we have already entered and terminated?
>>
>> (gdb) run
>> Starting program: /vz/anefedov/qemu-build/us/./qemu-img convert -O qcow2 ./harddisk.hdd.c ./harddisk.hdd
>> [Thread debugging using libthread_db enabled]
>> Using host libthread_db library "/lib64/libthread_db.so.1".
>> [New Thread 0x7fffeac2d700 (LWP 436020)]
>> [New Thread 0x7fffe4ed6700 (LWP 436021)]
>> qemu-img: error while reading sector 20480: Input/output error
>> qemu-img: error while writing sector 19712: Operation now in progress
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> aio_co_wake (co=0x0) at /mnt/code/us-qemu/util/async.c:454
>> 454         ctx = atomic_read(&co->ctx);
>> (gdb) bt
>> #0  aio_co_wake (co=0x0) at /mnt/code/us-qemu/util/async.c:454
>> /* [Anton]: thread_pool_co_cb () here */
>> #1  0x0000555555634629 in thread_pool_completion_bh (opaque=0x555555cfe020) at /mnt/code/us-qemu/util/thread-pool.c:189
>> #2  0x0000555555633b31 in aio_bh_call (bh=0x555555cfe0f0) at /mnt/code/us-qemu/util/async.c:90
>> #3  aio_bh_poll (ctx=ctx@entry=0x555555cee6d0) at /mnt/code/us-qemu/util/async.c:118
>> #4  0x0000555555636f14 in aio_poll (ctx=ctx@entry=0x555555cee6d0, blocking=<optimized out>) at /mnt/code/us-qemu/util/aio-posix.c:682
>> #5  0x00005555555c52d4 in bdrv_drain_recurse (bs=bs@entry=0x555555d22560) at /mnt/code/us-qemu/block/io.c:164
>> #6  0x00005555555c5aed in bdrv_drained_begin (bs=bs@entry=0x555555d22560) at /mnt/code/us-qemu/block/io.c:248
>> #7  0x0000555555581443 in bdrv_close (bs=0x555555d22560) at /mnt/code/us-qemu/block.c:2909
>> #8  bdrv_delete (bs=0x555555d22560) at /mnt/code/us-qemu/block.c:3100
>> #9  bdrv_unref (bs=0x555555d22560) at /mnt/code/us-qemu/block.c:4087
>> #10 0x00005555555baf44 in blk_remove_bs (blk=blk@entry=0x555555d22380) at /mnt/code/us-qemu/block/block-backend.c:552
>> #11 0x00005555555bb173 in blk_delete (blk=0x555555d22380) at /mnt/code/us-qemu/block/block-backend.c:238
>> #12 blk_unref (blk=blk@entry=0x555555d22380) at /mnt/code/us-qemu/block/block-backend.c:282
>> #13 0x000055555557a22c in img_convert (argc=<optimized out>, argv=<optimized out>) at /mnt/code/us-qemu/qemu-img.c:2359
>> #14 0x0000555555574189 in main (argc=5, argv=0x7fffffffe4a0) at /mnt/code/us-qemu/qemu-img.c:4464
>>
>>
>>> Peter
>>>
>>
>> /Anton
>>
>
> it seems that this is a bit tricky, can you share how your test case works?
>
> thanks,
> peter
>

how I tested today was basically

                               ": %s", sector_num, strerror(-ret));
diff mbox

Patch

diff --git a/qemu-img.c b/qemu-img.c
index 4425aaa..3d2d506 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1788,6 +1788,10 @@  static void coroutine_fn convert_co_do_copy(void 
*opaque)

          if (status == BLK_DATA) {
              ret = convert_co_read(s, sector_num, n, buf);
+            const uint64_t fsector = 10*1024*1024/512;
+            if (sector_num <= fsector && fsector < sector_num+n) {
+                ret = -EIO;
+            }
              if (ret < 0) {
                  error_report("error while reading sector %" PRId64