Message ID | alpine.LRH.2.02.1610261625580.12287@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 10/26/2016 01:26 PM, Mikulas Patocka wrote: > The brd driver refuses misaligned discard requests with an error. However, > this is suboptimal, misaligned requests could be handled by discarding a > part of the request that is aligned on a page boundary. This patch changes > the code so that it handles misaligned requests. Hello Mikulas, We do not want this kind of discard request processing in every block driver. This is why I think that this kind of processing should be added to the block layer. See also "[PATCH v3 0/5] Make blkdev_issue_discard() submit aligned discard requests" (http://www.spinics.net/lists/linux-block/msg02360.html). Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 26 Oct 2016, Bart Van Assche wrote: > On 10/26/2016 01:26 PM, Mikulas Patocka wrote: > > The brd driver refuses misaligned discard requests with an error. However, > > this is suboptimal, misaligned requests could be handled by discarding a > > part of the request that is aligned on a page boundary. This patch changes > > the code so that it handles misaligned requests. > > Hello Mikulas, > > We do not want this kind of discard request processing in every block driver. > This is why I think that this kind of processing should be added to the block > layer. See also "[PATCH v3 0/5] Make blkdev_issue_discard() submit aligned > discard requests" (http://www.spinics.net/lists/linux-block/msg02360.html). > > Bart. I don't like the idea of complicating the code by turning discards into writes. You can just turn off the flag "discard_zeroes_data" and drop all the splitting code. The flag "discard_zeroes_data" is actually misdesigned, because the storage stack can change dynamically while bios are in progress. You can send a discard bio to a device mapper device that has "discard_zeroes_data" - and while the bio is in progress, the device mapper stack can be reconfigured to redirect the bio to another device that doesn't have "discard_zeroes_data" - and the bio won't zero data and the caller that issued it has no way to find it out. I think the proper thing would be to move "discard_zeroes_data" flag into the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO - and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio and the caller is supposed to do zeroing manually. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 26, 2016 at 05:46:11PM -0400, Mikulas Patocka wrote: > I think the proper thing would be to move "discard_zeroes_data" flag into > the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO - > and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio > and the caller is supposed to do zeroing manually. Yes, Martin and I have come to a similar conclusion recently. An additional aspect is that NVMe has a Write Zeroes command which is more limited than what REQ_OP_WRITE_SAME does. So I think the right way is to add a REQ_OP_WRITE_ZEROES (or REQ_OP_ZERO) and have modifies if it may discard or not. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/26/2016 02:46 PM, Mikulas Patocka wrote: > I don't like the idea of complicating the code by turning discards into > writes. That's not what my patch series does. The only writes added by my patch series are those for the non-aligned head and tail of the range passed to blkdev_issue_zeroout(). > The flag "discard_zeroes_data" is actually misdesigned, because the > storage stack can change dynamically while bios are in progress. You can > send a discard bio to a device mapper device that has > "discard_zeroes_data" - and while the bio is in progress, the device > mapper stack can be reconfigured to redirect the bio to another device > that doesn't have "discard_zeroes_data" - and the bio won't zero data and > the caller that issued it has no way to find it out. > > I think the proper thing would be to move "discard_zeroes_data" flag into > the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO - > and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio > and the caller is supposed to do zeroing manually. Sorry but I don't like this proposal. I think that a much better solution would be to pause I/O before making any changes to the discard_zeroes_data flag. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 26 Oct 2016, Bart Van Assche wrote: > On 10/26/2016 02:46 PM, Mikulas Patocka wrote: > > I don't like the idea of complicating the code by turning discards into > > writes. > > That's not what my patch series does. The only writes added by my patch series > are those for the non-aligned head and tail of the range passed to > blkdev_issue_zeroout(). The purpose of discard is to improve SSD performance and reduce wear. Generating more write requests for discard does quite the opposite - it reduces performance (discard + two small writes is slower than just discard) and it also causes more wear. > > The flag "discard_zeroes_data" is actually misdesigned, because the > > storage stack can change dynamically while bios are in progress. You can > > send a discard bio to a device mapper device that has > > "discard_zeroes_data" - and while the bio is in progress, the device > > mapper stack can be reconfigured to redirect the bio to another device > > that doesn't have "discard_zeroes_data" - and the bio won't zero data and > > the caller that issued it has no way to find it out. > > > > I think the proper thing would be to move "discard_zeroes_data" flag into > > the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO - > > and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio > > and the caller is supposed to do zeroing manually. > > Sorry but I don't like this proposal. I think that a much better solution > would be to pause I/O before making any changes to the discard_zeroes_data > flag. The device mapper pauses all bios when it switches the table - but the bio was submitted with assumption that it goes to a device withe "discard_zeroes_data" set, then the bio is paused, device mapper table is switched, the bio is unpaused, and now it can go to a device without "discard_zeroes_data". > Bart. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 26 Oct 2016, Christoph Hellwig wrote: > On Wed, Oct 26, 2016 at 05:46:11PM -0400, Mikulas Patocka wrote: > > I think the proper thing would be to move "discard_zeroes_data" flag into > > the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO - > > and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio > > and the caller is supposed to do zeroing manually. > > Yes, Martin and I have come to a similar conclusion recently. An > additional aspect is that NVMe has a Write Zeroes command which is more > limited than what REQ_OP_WRITE_SAME does. > > So I think the right way is to add a REQ_OP_WRITE_ZEROES (or > REQ_OP_ZERO) and have modifies if it may discard or not. We could detect if the REQ_OP_WRITE_SAME command contains all zeroes and if it does, turn it into "Write Zeroes" or TRIM command (if the device guarantees zeroing on trim). If it doesn't contain all zeroes and the device doesn't support non-zero WRITE SAME, then reject it. Or maybe we could add a new command REQ_OP_WRITE_ZEROES - I'm not sure which of these two possibilities is better. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[adding Chaitanya to Cc] On Fri, Oct 28, 2016 at 07:43:41AM -0400, Mikulas Patocka wrote: > We could detect if the REQ_OP_WRITE_SAME command contains all zeroes and > if it does, turn it into "Write Zeroes" or TRIM command (if the device > guarantees zeroing on trim). If it doesn't contain all zeroes and the > device doesn't support non-zero WRITE SAME, then reject it. I don't like this because it's very inefficient - we have to allocate a payload first and then compare the whole payload for very operation. > Or maybe we could add a new command REQ_OP_WRITE_ZEROES - I'm not sure > which of these two possibilities is better. Chaitanya actually did an initial prototype implementation of this for NVMe that he shared with me. I liked it a lot and I think he'll be ready to post it in a few days. Now that we have the REQ_OP* values instead of mapping different command types to flags it's actually surprisingly easy to add new block layer operations. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/28/2016 04:39 AM, Mikulas Patocka wrote: > On Wed, 26 Oct 2016, Bart Van Assche wrote: >> On 10/26/2016 02:46 PM, Mikulas Patocka wrote: >>> I think the proper thing would be to move "discard_zeroes_data" flag into >>> the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO - >>> and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio >>> and the caller is supposed to do zeroing manually. >> >> Sorry but I don't like this proposal. I think that a much better solution >> would be to pause I/O before making any changes to the discard_zeroes_data >> flag. > > The device mapper pauses all bios when it switches the table - but the bio > was submitted with assumption that it goes to a device withe > "discard_zeroes_data" set, then the bio is paused, device mapper table is > switched, the bio is unpaused, and now it can go to a device without > "discard_zeroes_data". Hello Mikulas, Sorry if I wasn't clear enough. What I meant is to wait until all outstanding requests have finished before modifying the discard_zeroes_data flag - the kind of operation that is performed by e.g. blk_mq_freeze_queue(). Modifying the discard_zeroes_data flag after a bio has been submitted and before it has completed could lead to several classes of subtle bugs. Functions like __blkdev_issue_discard() assume that the value of the discard_zeroes_data flag does not change after this function has been called and before the submitted requests completed. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 28 Oct 2016, Bart Van Assche wrote: > On 10/28/2016 04:39 AM, Mikulas Patocka wrote: > > On Wed, 26 Oct 2016, Bart Van Assche wrote: > > > On 10/26/2016 02:46 PM, Mikulas Patocka wrote: > > > > I think the proper thing would be to move "discard_zeroes_data" flag > > > > into > > > > the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO - > > > > and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the > > > > bio > > > > and the caller is supposed to do zeroing manually. > > > > > > Sorry but I don't like this proposal. I think that a much better solution > > > would be to pause I/O before making any changes to the discard_zeroes_data > > > flag. > > > > The device mapper pauses all bios when it switches the table - but the bio > > was submitted with assumption that it goes to a device withe > > "discard_zeroes_data" set, then the bio is paused, device mapper table is > > switched, the bio is unpaused, and now it can go to a device without > > "discard_zeroes_data". > > Hello Mikulas, > > Sorry if I wasn't clear enough. What I meant is to wait until all outstanding > requests have finished It is possible that the process sends never ending stream of bios (for example when reading linear data and using readahead), so waiting until there are no oustanding bios never finishes. > before modifying the discard_zeroes_data flag - the > kind of operation that is performed by e.g. blk_mq_freeze_queue(). blk_mq_freeze_queue() works on request-based drivers, device mapper works with bios, so that function has no effect on device mapper device. Anyway - blk_mq_freeze_queue() won't stop the process that issues the I/O requests - it will just hold the requests in the queue and not forward them to the device. There is no way to stop the process that issues the bios. We can't stop the process that is looping in __blkdev_issue_discard, issuing discard requests. All that we can do is to hold the bios that the process issued. Device mapper can freeze the filesystem with "freeze_bdev", but... - some filesystems don't support freeze - if the filesystem is not directly mounted on the frozen device, but there is a stack of intermediate block devices between the filesystem and the frozen device, then the filesystem will not be frozen - the user can open the block device directly and he won't be affected by the freeze > Modifying the discard_zeroes_data flag after a bio has been submitted > and before it has completed could lead to several classes of subtle > bugs. Functions like __blkdev_issue_discard() assume that the value of > the discard_zeroes_data flag does not change after this function has > been called and before the submitted requests completed. > > Bart. I agree. That's the topic of the discussion - that the discard_zeroes_data flag is unreliable and the flag should be moved to the bio. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 28 Oct 2016, Christoph Hellwig wrote: > [adding Chaitanya to Cc] > > On Fri, Oct 28, 2016 at 07:43:41AM -0400, Mikulas Patocka wrote: > > We could detect if the REQ_OP_WRITE_SAME command contains all zeroes and > > if it does, turn it into "Write Zeroes" or TRIM command (if the device > > guarantees zeroing on trim). If it doesn't contain all zeroes and the > > device doesn't support non-zero WRITE SAME, then reject it. > > I don't like this because it's very inefficient - we have to allocate > a payload first and then compare the whole payload for very operation. > > > Or maybe we could add a new command REQ_OP_WRITE_ZEROES - I'm not sure > > which of these two possibilities is better. > > Chaitanya actually did an initial prototype implementation of this for > NVMe that he shared with me. I liked it a lot and I think he'll be > ready to post it in a few days. Now that we have the REQ_OP* values > instead of mapping different command types to flags it's actually > surprisingly easy to add new block layer operations. OK - when it is in the kernel, let me know, so that I can write device mapper support for that. We should remove the flag "discard_zeroes_data" afterwards, because it is unreliable and impossible to support correctly in the device mapper. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Index: linux-2.6/drivers/block/brd.c =================================================================== --- linux-2.6.orig/drivers/block/brd.c +++ linux-2.6/drivers/block/brd.c @@ -213,9 +213,14 @@ static int copy_to_brd_setup(struct brd_ } static void discard_from_brd(struct brd_device *brd, - sector_t sector, size_t n) + sector_t sector, unsigned n_sectors) { - while (n >= PAGE_SIZE) { + unsigned boundary = -sector & ((PAGE_SIZE >> SECTOR_SHIFT) - 1); + if (unlikely(boundary >= n_sectors)) + return; + sector += boundary; + n_sectors -= boundary; + while (n_sectors >= PAGE_SIZE >> SECTOR_SHIFT) { /* * Don't want to actually discard pages here because * re-allocating the pages can result in writeback @@ -226,7 +231,7 @@ static void discard_from_brd(struct brd_ else brd_zero_page(brd, sector); sector += PAGE_SIZE >> SECTOR_SHIFT; - n -= PAGE_SIZE; + n_sectors -= PAGE_SIZE >> SECTOR_SHIFT; } } @@ -339,10 +344,7 @@ static blk_qc_t brd_make_request(struct goto io_error; if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) { - if (sector & ((PAGE_SIZE >> SECTOR_SHIFT) - 1) || - bio->bi_iter.bi_size & ~PAGE_MASK) - goto io_error; - discard_from_brd(brd, sector, bio->bi_iter.bi_size); + discard_from_brd(brd, sector, bio->bi_iter.bi_size >> SECTOR_SHIFT); goto out; }
The brd driver refuses misaligned discard requests with an error. However, this is suboptimal, misaligned requests could be handled by discarding a part of the request that is aligned on a page boundary. This patch changes the code so that it handles misaligned requests. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- drivers/block/brd.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html