Message ID | 20161118102452.5779-1-olaf@aepfle.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Nov 18, Olaf Hering wrote: > @@ -708,12 +743,10 @@ static int ioreq_runio_qemu_aio(struct ioreq *ioreq) > + if (!blk_split_discard(ioreq, req->sector_number, req->nr_sectors)) { > + goto err; How is error handling supposed to work here? Initially I forgot the "!", which just hung the mkfs command in domU. I have not yet tried to force errors in other part of that function. Olaf
On 11/18/2016 04:24 AM, Olaf Hering wrote: > The guest sends discard requests as u64 sector/count pairs, but the > block layer operates internally with s64/s32 pairs. The conversion > leads to IO errors in the guest, the discard request is not processed. Doesn't the block layer already split discard requests into 2^31 byte chunks? > > domU.cfg: > 'vdev=xvda, format=qcow2, backendtype=qdisk, target=/x.qcow2' > domU: > mkfs.ext4 -F /dev/xvda > Discarding device blocks: failed - Input/output error > > Fix this by splitting the request into chunks of BDRV_REQUEST_MAX_SECTORS. > Add input range checking to avoid overflow. > > Signed-off-by: Olaf Hering <olaf@aepfle.de> > --- > hw/block/xen_disk.c | 45 +++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 39 insertions(+), 6 deletions(-) > > @@ -708,12 +743,10 @@ static int ioreq_runio_qemu_aio(struct ioreq *ioreq) > break; > case BLKIF_OP_DISCARD: > { > - struct blkif_request_discard *discard_req = (void *)&ioreq->req; > - ioreq->aio_inflight++; > - blk_aio_pdiscard(blkdev->blk, That is, blk_aio_pdiscard() calls into bdrv_co_pdiscard() which is supposed to be fragmenting things as needed. Can you trace what is going wrong there? You shouldn't have to reimplement fragementation if the block layer is doing it correctly.
Am 18. November 2016 14:43:18 MEZ, schrieb Eric Blake <eblake@redhat.com>: >On 11/18/2016 04:24 AM, Olaf Hering wrote: >> The guest sends discard requests as u64 sector/count pairs, but the >> block layer operates internally with s64/s32 pairs. The conversion >> leads to IO errors in the guest, the discard request is not >processed. > >Doesn't the block layer already split discard requests into 2^31 byte >chunks? How would it do that without valid input? It was wrong before the sectors to bytes conversion, and now its even worse given that all the world fits into an int. Remember that there is no API to let the guest know about the limitations of the host. Olaf
On 11/18/2016 08:19 AM, Olaf Hering wrote: > Am 18. November 2016 14:43:18 MEZ, schrieb Eric Blake <eblake@redhat.com>: >> On 11/18/2016 04:24 AM, Olaf Hering wrote: >>> The guest sends discard requests as u64 sector/count pairs, but the >>> block layer operates internally with s64/s32 pairs. The conversion >>> leads to IO errors in the guest, the discard request is not >> processed. >> >> Doesn't the block layer already split discard requests into 2^31 byte >> chunks? > > How would it do that without valid input? It was wrong before the sectors to bytes conversion, and now its even worse given that all the world fits into an int. Then it sounds like the real bug is that the block layer bdrv_co_pdiscard() is buggy for taking 'int count' instead of 'uint64_t count'. Eventually, I think the entire block layer should be fixed to allow 64-bit count everywhere, and then auto-fragment it back down to 31 bits (or even smaller, like NBD's 32M limit or Linux loopback device 64k limit) as needed, rather than making all the backends reimplement fragmentation. > > Remember that there is no API to let the guest know about the limitations of the host. Correct. But the goal of the block layer is to hide the quirks, so that the code handling the guest requests can offload all the common work to one place. Kevin, is it too late for 2.8 for patches that change bdrv_co_pdiscard to take a 64-bit count? Or would that still be under bug-fix category because of the xen use case?
Am 18.11.2016 um 15:35 hat Eric Blake geschrieben: > On 11/18/2016 08:19 AM, Olaf Hering wrote: > > Am 18. November 2016 14:43:18 MEZ, schrieb Eric Blake <eblake@redhat.com>: > >> On 11/18/2016 04:24 AM, Olaf Hering wrote: > >>> The guest sends discard requests as u64 sector/count pairs, but the > >>> block layer operates internally with s64/s32 pairs. The conversion > >>> leads to IO errors in the guest, the discard request is not > >> processed. > >> > >> Doesn't the block layer already split discard requests into 2^31 byte > >> chunks? > > > > How would it do that without valid input? It was wrong before the sectors to bytes conversion, and now its even worse given that all the world fits into an int. > > Then it sounds like the real bug is that the block layer > bdrv_co_pdiscard() is buggy for taking 'int count' instead of 'uint64_t > count'. Eventually, I think the entire block layer should be fixed to > allow 64-bit count everywhere, and then auto-fragment it back down to 31 > bits (or even smaller, like NBD's 32M limit or Linux loopback device 64k > limit) as needed, rather than making all the backends reimplement > fragmentation. > > > > > Remember that there is no API to let the guest know about the limitations of the host. > > Correct. But the goal of the block layer is to hide the quirks, so that > the code handling the guest requests can offload all the common work to > one place. > > Kevin, is it too late for 2.8 for patches that change bdrv_co_pdiscard > to take a 64-bit count? Or would that still be under bug-fix category > because of the xen use case? Given that we're already a few weeks into the freeze, I would very much prefer an isolated patch for xen_disk for 2.8, and then it can be cleaned up during the 2.9 cycle. Kevin
On 11/18/2016 04:24 AM, Olaf Hering wrote: > The guest sends discard requests as u64 sector/count pairs, but the > block layer operates internally with s64/s32 pairs. The conversion > leads to IO errors in the guest, the discard request is not processed. > > domU.cfg: > 'vdev=xvda, format=qcow2, backendtype=qdisk, target=/x.qcow2' > domU: > mkfs.ext4 -F /dev/xvda > Discarding device blocks: failed - Input/output error > > Fix this by splitting the request into chunks of BDRV_REQUEST_MAX_SECTORS. > Add input range checking to avoid overflow. > > Signed-off-by: Olaf Hering <olaf@aepfle.de> > --- > hw/block/xen_disk.c | 45 +++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 39 insertions(+), 6 deletions(-) > > diff --git a/hw/block/xen_disk.c b/hw/block/xen_disk.c > index 3a7dc19..c3f572f 100644 > --- a/hw/block/xen_disk.c > +++ b/hw/block/xen_disk.c > @@ -660,6 +660,41 @@ static void qemu_aio_complete(void *opaque, int ret) > qemu_bh_schedule(ioreq->blkdev->bh); > } > > +static bool blk_split_discard(struct ioreq *ioreq, blkif_sector_t sector_number, > + uint64_t nr_sectors) > +{ > + struct XenBlkDev *blkdev = ioreq->blkdev; > + int64_t byte_offset; > + int byte_chunk; > + uint64_t sec_start = sector_number; > + uint64_t sec_count = nr_sectors; > + uint64_t byte_remaining; > + uint64_t limit = BDRV_REQUEST_MAX_SECTORS << BDRV_SECTOR_BITS; [For reference, this limit is the same as rounding INT32_MAX down to the nearest 512-byte limit, or 0x7ffffe00] > + > + /* Wrap around? */ > + if ((sec_start + sec_count) < sec_count) { > + return false; > + } > + /* Overflowing byte limit? */ > + if ((sec_start + sec_count) > ((INT64_MAX + INT_MAX) >> BDRV_SECTOR_BITS)) { This is undefined. INT64_MAX + anything non-negative overflows int64, and even if you treat overflow as defined by twos-complement representation (which creates a negative number), shifting a negative number is also undefined. If you are trying to detect guests that make a request that would cover more than INT64_MAX bytes, you can simplify. Besides, for as much storage as there is out there, I seriously doubt ANYONE will ever have 2^63 bytes addressable through a single device. Why not just write it as: if ((INT64_MAX >> BDRV_SECTOR_BITS) - sec_count < sec_start) { > + return false; > + } > + > + byte_offset = sec_start << BDRV_SECTOR_BITS; > + byte_remaining = sec_count << BDRV_SECTOR_BITS; > + > + do { > + byte_chunk = byte_remaining > limit ? limit : byte_remaining; > + ioreq->aio_inflight++; > + blk_aio_pdiscard(blkdev->blk, byte_offset, byte_chunk, > + qemu_aio_complete, ioreq); > + byte_remaining -= byte_chunk; > + byte_offset += byte_chunk; > + } while (byte_remaining > 0); This part looks reasonable. > + > + return true; > +} > + > static int ioreq_runio_qemu_aio(struct ioreq *ioreq) > { > struct XenBlkDev *blkdev = ioreq->blkdev; > @@ -708,12 +743,10 @@ static int ioreq_runio_qemu_aio(struct ioreq *ioreq) > break; > case BLKIF_OP_DISCARD: > { > - struct blkif_request_discard *discard_req = (void *)&ioreq->req; The old code had it... > - ioreq->aio_inflight++; > - blk_aio_pdiscard(blkdev->blk, > - discard_req->sector_number << BDRV_SECTOR_BITS, > - discard_req->nr_sectors << BDRV_SECTOR_BITS, > - qemu_aio_complete, ioreq); > + struct blkif_request_discard *req = (void *)&ioreq->req; ...but C doesn't require a cast to void*. As long as you are touching this, you could remove the cast (unless I'm missing something, and the cast is also there to cast away const). > + if (!blk_split_discard(ioreq, req->sector_number, req->nr_sectors)) { > + goto err; > + } > break; > } > default: > >
On Fri, Nov 18, Eric Blake wrote: > On 11/18/2016 04:24 AM, Olaf Hering wrote: > > + /* Overflowing byte limit? */ > > + if ((sec_start + sec_count) > ((INT64_MAX + INT_MAX) >> BDRV_SECTOR_BITS)) { > This is undefined. INT64_MAX + anything non-negative overflows int64, The expanded value used to be stored into a uint64_t before it was used here. A "cleanup" introduced this error. Thanks for spotting. > If you are trying to detect guests that make a request that would cover > more than INT64_MAX bytes, you can simplify. Besides, for as much > storage as there is out there, I seriously doubt ANYONE will ever have > 2^63 bytes addressable through a single device. Why not just write it as: > > if ((INT64_MAX >> BDRV_SECTOR_BITS) - sec_count < sec_start) { That would always be false I think. I will resubmit with this: if ((sec_start + sec_count) > (INT64_MAX >> BDRV_SECTOR_BITS)) { Regarding the cast for ->req, it has type blkif_request_t, but the pointer needs to be assigned to type blkif_request_discard_t. Olaf
On 11/18/2016 11:41 AM, Olaf Hering wrote: > On Fri, Nov 18, Eric Blake wrote: > >> On 11/18/2016 04:24 AM, Olaf Hering wrote: >>> + /* Overflowing byte limit? */ >>> + if ((sec_start + sec_count) > ((INT64_MAX + INT_MAX) >> BDRV_SECTOR_BITS)) { >> This is undefined. INT64_MAX + anything non-negative overflows int64, > > The expanded value used to be stored into a uint64_t before it was used > here. A "cleanup" introduced this error. Thanks for spotting. > >> If you are trying to detect guests that make a request that would cover >> more than INT64_MAX bytes, you can simplify. Besides, for as much >> storage as there is out there, I seriously doubt ANYONE will ever have >> 2^63 bytes addressable through a single device. Why not just write it as: >> >> if ((INT64_MAX >> BDRV_SECTOR_BITS) - sec_count < sec_start) { > > That would always be false I think. I will resubmit with this: > if ((sec_start + sec_count) > (INT64_MAX >> BDRV_SECTOR_BITS)) { You're testing whether something overflows, but you don't want to cause overflow as part of the test. So use the commutative law to rewrite it to avoid sec_start+sec_count from overflowing, and you get: if (sec_start > (INT64_MAX >> BDRV_SECTOR_BITS) - sec_count) but that's exactly the expression I wrote above. > > Regarding the cast for ->req, it has type blkif_request_t, but the > pointer needs to be assigned to type blkif_request_discard_t. Then why is the cast to (void*) instead of (blkif_request_discard_t*) ?
On Fri, Nov 18, Eric Blake wrote:
> if (sec_start > (INT64_MAX >> BDRV_SECTOR_BITS) - sec_count)
I have looked at this for a while now and cant spot how this would cover
all cases. Are you saying there should be just a single overflow check,
yours? My change has two: one to check for wrap around and to check
against the upper limit. My check happens to work with 0/UINT64_MAX or
INT64_MAX/INT64_MAX as input, yours appearently not.
Obviously I'm missing something essential.
Olaf
On 11/22/2016 10:12 AM, Olaf Hering wrote: > On Fri, Nov 18, Eric Blake wrote: > >> if (sec_start > (INT64_MAX >> BDRV_SECTOR_BITS) - sec_count) > > I have looked at this for a while now and cant spot how this would cover > all cases. Are you saying there should be just a single overflow check, > yours? My change has two: one to check for wrap around and to check > against the upper limit. My check happens to work with 0/UINT64_MAX or > INT64_MAX/INT64_MAX as input, yours appearently not. > Obviously I'm missing something essential. I never suggested eliminating the wraparound check, only simplifying the overflow check. You could combine the wraparound and overflow into one: if (sec_start + sec_count < sec_count || sec_start > (INT64_MAX >> BDRV_SECTOR_BITS) - sec_count) { return false; } Remember, sec_start and sec_count were both typed as unsigned 64-bit values, so everything in the above computation is well-defined arithmetic, and you catch all cases of trying to add two numbers into something that doesn't fit in 64 bits, as well as all cases of the addition fitting in 64 bits but going beyond the maximum possible sector number (since it is not possible to have a sector number whose corresponding offset would exceed 63 bits).
On Tue, Nov 22, Eric Blake wrote: > if (sec_start + sec_count < sec_count || > sec_start > (INT64_MAX >> BDRV_SECTOR_BITS) - sec_count) { > return false; > } My point was: how does this handle sec_start=0,sec_count=UINT64_MAX or sec_start=INT64_MAX,sec_count=INT64_MAX? For me this gets past the if(). Olaf
On 11/22/2016 11:00 AM, Olaf Hering wrote: > On Tue, Nov 22, Eric Blake wrote: > >> if (sec_start + sec_count < sec_count || >> sec_start > (INT64_MAX >> BDRV_SECTOR_BITS) - sec_count) { >> return false; >> } > > My point was: how does this handle sec_start=0,sec_count=UINT64_MAX or > sec_start=INT64_MAX,sec_count=INT64_MAX? For me this gets past the if(). Fair enough. Your test that things didn't overflow means that you can then safely compare the sum to the limit, so go with: if (sec_start + sec_count < sec_count || sec_start + sec_count > INT64_MAX >> BDRV_SECTOR_BITS) { return false; }
Ping. On Fri, Nov 18, Olaf Hering wrote: > On Fri, Nov 18, Olaf Hering wrote: > > > @@ -708,12 +743,10 @@ static int ioreq_runio_qemu_aio(struct ioreq *ioreq) > > + if (!blk_split_discard(ioreq, req->sector_number, req->nr_sectors)) { > > + goto err; > > How is error handling supposed to work here? > Initially I forgot the "!", which just hung the mkfs command in domU. > I have not yet tried to force errors in other part of that function. Olaf
On Wed, Nov 23, Olaf Hering wrote: > > > + if (!blk_split_discard(ioreq, req->sector_number, req->nr_sectors)) { > > > + goto err; > > How is error handling supposed to work here? In the guest the cmd is stuck, instead of getting an IO error: [ 91.966404] mkfs.ext4 D 0000000000000000 0 2878 2831 0x00000000 [ 91.966406] ffff88002204bc48 ffff880030530480 ffff88002fae5800 ffff88002204c000 [ 91.966407] 0000000000000000 7fffffffffffffff 0000000000008000 00000000024000c0 [ 91.966409] ffff88002204bc60 ffffffff815dd985 ffff880038815c00 ffff88002204bd08 [ 91.966409] Call Trace: [ 91.966413] [<ffffffff815dd985>] schedule+0x35/0x80 [ 91.966416] [<ffffffff815e02c7>] schedule_timeout+0x237/0x2d0 [ 91.966419] [<ffffffff815dcf46>] io_schedule_timeout+0xa6/0x110 [ 91.966421] [<ffffffff815de2f3>] wait_for_completion_io+0xa3/0x110 [ 91.966425] [<ffffffff812d7b00>] submit_bio_wait+0x50/0x60 [ 91.966430] [<ffffffff812e9168>] blkdev_issue_discard+0x78/0xb0 [ 91.966433] [<ffffffff812eee2b>] blk_ioctl_discard+0x7b/0xa0 [ 91.966436] [<ffffffff812efa20>] blkdev_ioctl+0x730/0x920 [ 91.966440] [<ffffffff812318fd>] block_ioctl+0x3d/0x40 [ 91.966444] [<ffffffff8120cd6d>] do_vfs_ioctl+0x2cd/0x4a0 [ 91.966453] [<ffffffff8120cfb4>] SyS_ioctl+0x74/0x80 [ 91.966456] [<ffffffff815e142e>] entry_SYSCALL_64_fastpath+0x12/0x6d Olaf
On Wed, 23 Nov 2016, Olaf Hering wrote: > On Wed, Nov 23, Olaf Hering wrote: > > > > > + if (!blk_split_discard(ioreq, req->sector_number, req->nr_sectors)) { > > > > + goto err; > > > How is error handling supposed to work here? > > In the guest the cmd is stuck, instead of getting an IO error: > > [ 91.966404] mkfs.ext4 D 0000000000000000 0 2878 2831 0x00000000 > [ 91.966406] ffff88002204bc48 ffff880030530480 ffff88002fae5800 ffff88002204c000 > [ 91.966407] 0000000000000000 7fffffffffffffff 0000000000008000 00000000024000c0 > [ 91.966409] ffff88002204bc60 ffffffff815dd985 ffff880038815c00 ffff88002204bd08 > [ 91.966409] Call Trace: > [ 91.966413] [<ffffffff815dd985>] schedule+0x35/0x80 > [ 91.966416] [<ffffffff815e02c7>] schedule_timeout+0x237/0x2d0 > [ 91.966419] [<ffffffff815dcf46>] io_schedule_timeout+0xa6/0x110 > [ 91.966421] [<ffffffff815de2f3>] wait_for_completion_io+0xa3/0x110 > [ 91.966425] [<ffffffff812d7b00>] submit_bio_wait+0x50/0x60 > [ 91.966430] [<ffffffff812e9168>] blkdev_issue_discard+0x78/0xb0 > [ 91.966433] [<ffffffff812eee2b>] blk_ioctl_discard+0x7b/0xa0 > [ 91.966436] [<ffffffff812efa20>] blkdev_ioctl+0x730/0x920 > [ 91.966440] [<ffffffff812318fd>] block_ioctl+0x3d/0x40 > [ 91.966444] [<ffffffff8120cd6d>] do_vfs_ioctl+0x2cd/0x4a0 > [ 91.966453] [<ffffffff8120cfb4>] SyS_ioctl+0x74/0x80 > [ 91.966456] [<ffffffff815e142e>] entry_SYSCALL_64_fastpath+0x12/0x6d The error should be sent back to the frontend via the status field. Not sure why blkfront is not hanlding it correctly.
diff --git a/hw/block/xen_disk.c b/hw/block/xen_disk.c index 3a7dc19..c3f572f 100644 --- a/hw/block/xen_disk.c +++ b/hw/block/xen_disk.c @@ -660,6 +660,41 @@ static void qemu_aio_complete(void *opaque, int ret) qemu_bh_schedule(ioreq->blkdev->bh); } +static bool blk_split_discard(struct ioreq *ioreq, blkif_sector_t sector_number, + uint64_t nr_sectors) +{ + struct XenBlkDev *blkdev = ioreq->blkdev; + int64_t byte_offset; + int byte_chunk; + uint64_t sec_start = sector_number; + uint64_t sec_count = nr_sectors; + uint64_t byte_remaining; + uint64_t limit = BDRV_REQUEST_MAX_SECTORS << BDRV_SECTOR_BITS; + + /* Wrap around? */ + if ((sec_start + sec_count) < sec_count) { + return false; + } + /* Overflowing byte limit? */ + if ((sec_start + sec_count) > ((INT64_MAX + INT_MAX) >> BDRV_SECTOR_BITS)) { + return false; + } + + byte_offset = sec_start << BDRV_SECTOR_BITS; + byte_remaining = sec_count << BDRV_SECTOR_BITS; + + do { + byte_chunk = byte_remaining > limit ? limit : byte_remaining; + ioreq->aio_inflight++; + blk_aio_pdiscard(blkdev->blk, byte_offset, byte_chunk, + qemu_aio_complete, ioreq); + byte_remaining -= byte_chunk; + byte_offset += byte_chunk; + } while (byte_remaining > 0); + + return true; +} + static int ioreq_runio_qemu_aio(struct ioreq *ioreq) { struct XenBlkDev *blkdev = ioreq->blkdev; @@ -708,12 +743,10 @@ static int ioreq_runio_qemu_aio(struct ioreq *ioreq) break; case BLKIF_OP_DISCARD: { - struct blkif_request_discard *discard_req = (void *)&ioreq->req; - ioreq->aio_inflight++; - blk_aio_pdiscard(blkdev->blk, - discard_req->sector_number << BDRV_SECTOR_BITS, - discard_req->nr_sectors << BDRV_SECTOR_BITS, - qemu_aio_complete, ioreq); + struct blkif_request_discard *req = (void *)&ioreq->req; + if (!blk_split_discard(ioreq, req->sector_number, req->nr_sectors)) { + goto err; + } break; } default:
The guest sends discard requests as u64 sector/count pairs, but the block layer operates internally with s64/s32 pairs. The conversion leads to IO errors in the guest, the discard request is not processed. domU.cfg: 'vdev=xvda, format=qcow2, backendtype=qdisk, target=/x.qcow2' domU: mkfs.ext4 -F /dev/xvda Discarding device blocks: failed - Input/output error Fix this by splitting the request into chunks of BDRV_REQUEST_MAX_SECTORS. Add input range checking to avoid overflow. Signed-off-by: Olaf Hering <olaf@aepfle.de> --- hw/block/xen_disk.c | 45 +++++++++++++++++++++++++++++++++++++++------ 1 file changed, 39 insertions(+), 6 deletions(-)