Message ID | 1502889581-19483-1-git-send-email-lczerner@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
The patch is incorrect - just because we support zeroing using write zero doesn't mean that any discard will zero the data. This subtle difference was one of the main point for the changes.
On Wed, Aug 16, 2017 at 05:18:03PM +0200, Christoph Hellwig wrote: > The patch is incorrect - just because we support zeroing using > write zero doesn't mean that any discard will zero the data. > > This subtle difference was one of the main point for the changes. Ok, so the change breaks user space code. Do we have any documentation of what's the right way to do things and what does what ? I can see that you've made blkdev_fallocate() using discard and zeroout with various flags. There is no documentation on what guarantees we have when we use certain combination of flags. Comments for __blkdev_issue_zeroout says If a device is using logical block provisioning, the underlying space will not be released if %flags contains BLKDEV_ZERO_NOUNMAP. So does it mean that this only applies for SCSI UNMAP, not TRIM with reliable RZAT ? I'd like to be able to recognize where we have a device that does support write zero with unmap, TRIM with RZAT and whatever else that provides this. -Lukas
Lukas, > I'd like to be able to recognize where we have a device that does > support write zero with unmap, TRIM with RZAT and whatever else that > provides this. The problem was that the original REQ_DISCARD was used both for de-provisioning block ranges and for zeroing them. On top of that, the storage standards tweaked the definitions a bit so the semantics became even more confusing and harder to honor in the drivers. As a result, we changed things so that discards are only used to de-provision blocks. And the zeroout call/ioctl is used to zero block ranges. Which ATA/SCSI/NVMe command is issued on the back-end depends on what's supported by the device and is hidden from the caller. However, zeroout is guaranteed to return a zeroed block range on subsequent reads. The blocks may be unmapped, anchored, written explicitly, written with write same, or a combination thereof. But you are guaranteed predictable results. Whereas a discarded region may be sliced and diced and rounded off before it hits the device. Which is then free to ignore all or parts of the request. Consequently, discard_zeroes_data is meaningless. Because there is no guarantee that all of the discarded blocks will be acted upon. It kinda-sorta sometimes worked (if the device was whitelisted, had a reported alignment of 0, a granularity of 512 bytes, stacking didn't get in the way, and you were lucky on the device end). But there were always conditions. So taking a step back: What information specifically were you trying to obtain from querying that flag? And why do you need it?
On Wed, Aug 16, 2017 at 09:49:22PM -0400, Martin K. Petersen wrote: > e standards tweaked the definitions a bit so the semantics became > even more confusing and harder to honor in the drivers. > > As a result, we changed things so that discards are only used to > de-provision blocks. And the zeroout call/ioctl is used to zero block > ranges. > > Which ATA/SCSI/NVMe command is issued on the back-end depends on what's > supported by the device and is hidden from the caller. > > However, zeroout is guaranteed to return a zeroed block range on > subsequent reads. The blocks may be unmapped, anchored, written > explicitly, written with write same, or a combination thereof. But you > are guaranteed predictable results. > > Whereas a discarded region may be sliced and diced and rounded off > before it hits the device. Which is then free to ignore all or parts of > the request. > > Consequently, discard_zeroes_data is meaningless. Because there is no > guarantee that all of the discarded blocks will be acted upon. It > kinda-sorta sometimes worked (if the device was whitelisted, had a > reported alignment of 0, a granularity of 512 bytes, stacking didn't get > in the way, and you were lucky on the device end). But there were always > conditions. Thanks for the detailed explanation. That's wery usefull to know! > > So taking a step back: What information specifically were you trying to > obtain from querying that flag? And why do you need it? There are many users that historically benefit from the "discard_zeroes_data" semantics. For example mkfs, where it's beneficial to discard the blocks before creating a file system and if we also get deterministic zeroes on read, even better since we do not have to initialize some portions of the file system manually. The other example might be virtualization where they can support efficient "Wipe After Delete" and "Enable Discard" in case that "discard_zeroes_data". I am sure there are other examples. So I understand now that Deterministic Read Zero after TRIM is not realiable so we do not want to use that flag because we can't guarantee it in this case. However there are other situations where we can such as loop device (might be especially usefull for VM) where backing file system supports punch hole, or even SCSI write same with UNMAP ? Currently user space can call fallocate with FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE however if that succeeds we're only guaranteed that the range has been zeroed, not unmapped/discarded ? (that's not very clear from the comments). None of the modes seems to guarantee both zeroout and unmap in case of success. However still, there seem to be no way to tell what's actually supported from user space without ending up calling fallocate, is there ? While before we had discard_zeroes_data which people learned to rely on in certain situations, even though it might have been shaky. I actually like the rewrite the Christoph did, even though documentation seems to be lacking. But I just wonder if it's possible to bring back the former functionality, at least in some form. Thanks! -Lukas > > -- > Martin K. Petersen Oracle Linux Engineering
On Thu, Aug 17, 2017 at 09:47:44AM +0200, Lukas Czerner wrote: > There are many users that historically benefit from the > "discard_zeroes_data" semantics. For example mkfs, where it's beneficial > to discard the blocks before creating a file system and if we also get > deterministic zeroes on read, even better since we do not have to > initialize some portions of the file system manually. But that's now what discard_zeroes_data gives you unfortunately. > So I understand now that Deterministic Read Zero after TRIM is not > realiable so we do not want to use that flag because we can't guarantee > it in this case. However there are other situations where we can such > as loop device (might be especially usefull for VM) where backing file > system supports punch hole, or even SCSI write same with UNMAP ? > > Currently user space can call fallocate with FALLOC_FL_PUNCH_HOLE | > FALLOC_FL_KEEP_SIZE however if that succeeds we're only guaranteed that > the range has been zeroed, not unmapped/discarded ? (that's not very > clear from the comments). None of the modes seems to guarantee both > zeroout and unmap in case of success. However still, there seem to be no > way to tell what's actually supported from user space without ending up > calling fallocate, is there ? While before we had discard_zeroes_data > which people learned to rely on in certain situations, even though it > might have been shaky. You never get (and never got) a guarantee that the blocks were unmapped as none of the storage protocol ever requires the device to deallocate. Because devices have their internal chunk/block size everything else would be impractival. But fallocate FALLOC_FL_PUNCH_HOLE on a block device sets the REQ_UNAP hints which asks the driver to unmap if at all possible. Note that unmap or not is not a binary decision - typical devices will deallocate all whole blocks inside the range, and zero the rest.
On Thu, Aug 17, 2017 at 10:17:11AM +0200, Christoph Hellwig wrote: > On Thu, Aug 17, 2017 at 09:47:44AM +0200, Lukas Czerner wrote: > > There are many users that historically benefit from the > > "discard_zeroes_data" semantics. For example mkfs, where it's beneficial > > to discard the blocks before creating a file system and if we also get > > deterministic zeroes on read, even better since we do not have to > > initialize some portions of the file system manually. > > But that's now what discard_zeroes_data gives you unfortunately. > > > So I understand now that Deterministic Read Zero after TRIM is not > > realiable so we do not want to use that flag because we can't guarantee > > it in this case. However there are other situations where we can such > > as loop device (might be especially usefull for VM) where backing file > > system supports punch hole, or even SCSI write same with UNMAP ? > > > > Currently user space can call fallocate with FALLOC_FL_PUNCH_HOLE | > > FALLOC_FL_KEEP_SIZE however if that succeeds we're only guaranteed that > > the range has been zeroed, not unmapped/discarded ? (that's not very > > clear from the comments). None of the modes seems to guarantee both > > zeroout and unmap in case of success. However still, there seem to be no > > way to tell what's actually supported from user space without ending up > > calling fallocate, is there ? While before we had discard_zeroes_data > > which people learned to rely on in certain situations, even though it > > might have been shaky. > > You never get (and never got) a guarantee that the blocks were unmapped > as none of the storage protocol ever requires the device to deallocate. > Because devices have their internal chunk/block size everything else would > be impractival. Ok, guarantee in this case is probably too strong expression. But you know what I mean, saying "hey I am not using those blocks, feel free to release them if you can". I did not encounter the need to be so strict about discard, after all, it's for the device benefit. However there are device that do both (discard and zeroout) in discard request and that's what I am after. To be able to identify that's the case and take advantage. I think it's possible now with fallocate FALLOC_FL_PUNCH_HOLE, but I do not know anything in advance and I can't tell whether the device attempted to unmap as well even if it did. -Lukas
On Thu, Aug 17, 2017 at 10:41:43AM +0200, Lukas Czerner wrote: > Ok, guarantee in this case is probably too strong expression. But you > know what I mean, saying "hey I am not using those blocks, feel free to > release them if you can". I did not encounter the need to be so strict > about discard, after all, it's for the device benefit. That's what the BLKDISCARD ioctl, or fallocate with the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE flags will do fr your. > However there are device that do both (discard and zeroout) in discard > request and that's what I am after. To be able to identify that's the > case and take advantage. I think it's possible now with fallocate > FALLOC_FL_PUNCH_HOLE, but I do not know anything in advance and I can't > tell whether the device attempted to unmap as well even if it did. With FALLOC_FL_PUNCH_HOLE you know that it is device offloaded. We don't know that the device won't zero part or all of it, but then again that was possible before when using BLKDISCARD as well. Now after all that theory: is there a practial issue behind all this?
On Thu, Aug 17, 2017 at 11:52:42AM +0200, Christoph Hellwig wrote: > On Thu, Aug 17, 2017 at 10:41:43AM +0200, Lukas Czerner wrote: > > Ok, guarantee in this case is probably too strong expression. But you > > know what I mean, saying "hey I am not using those blocks, feel free to > > release them if you can". I did not encounter the need to be so strict > > about discard, after all, it's for the device benefit. > > That's what the BLKDISCARD ioctl, or fallocate with the > FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE > flags will do fr your. Sure, that's not the problem here. > > > However there are device that do both (discard and zeroout) in discard > > request and that's what I am after. To be able to identify that's the > > case and take advantage. I think it's possible now with fallocate > > FALLOC_FL_PUNCH_HOLE, but I do not know anything in advance and I can't > > tell whether the device attempted to unmap as well even if it did. > > With FALLOC_FL_PUNCH_HOLE you know that it is device offloaded. > We don't know that the device won't zero part or all of it, but > then again that was possible before when using BLKDISCARD as well. Do we also know that the blocks were discarded as we do with BLKDISCARD ? > > Now after all that theory: is there a practial issue behind all > this? As I mentioned before. We relied on discard_zeroes_data in mkfs.ext4 to make sure that inode tables are zeroed after discard. Also RHV allows users to set "Wipe After Delete" and "Enable Discard" on the VM disk if it supported discard_zeroes_data. I am trying to find a way to replace this functionality now that discard_zeroes_data always return 0. -Lukas
Lukas, > Do we also know that the blocks were discarded as we do with > BLKDISCARD ? There never was a way to know for sure. ATA DSM TRIM and SCSI UNMAP are hints by definition. We attempted to bend their semantics towards getting predictable behavior but ultimately failed. Too many corner cases. > As I mentioned before. We relied on discard_zeroes_data in mkfs.ext4 > to make sure that inode tables are zeroed after discard. The point is that you shouldn't have an if (discard_zeroes_data) conditional in the first place. - If you need to dellocate a block range and you don't care about its contents in the future, use BLKDISCARD / FL_PUNCH_HOLE. - If you need to zero a block range, use BLKZEROOUT / FL_ZERO_RANGE. So the mkfs usage model should essentially be: - DISCARD the entire partition/device block range - ZEROOUT the inode tables and other areas where you need zeroes on future reads And that should be the approach regardless of whether your device is a disk drive, an SSD, an NVMe device a SCSI array or whatever. DISCARD old contents, ZEROOUT the pieces you care about. No conditionals or trying to do things differently based on device capabilities.
On Thu, Aug 17, 2017 at 01:47:13PM -0400, Martin K. Petersen wrote: > data in mkfs.ext4 > > to make sure that inode tables are zeroed after discard. > > The point is that you shouldn't have an if (discard_zeroes_data) > conditional in the first place. > > - If you need to dellocate a block range and you don't care about its > contents in the future, use BLKDISCARD / FL_PUNCH_HOLE. > > - If you need to zero a block range, use BLKZEROOUT / FL_ZERO_RANGE. > > So the mkfs usage model should essentially be: > > - DISCARD the entire partition/device block range > > - ZEROOUT the inode tables and other areas where you need zeroes on > future reads Yeah, that's what needs to be done generally, although as I said there are devices where after DISCARD we will read zeroes. Comments in __blkdev_issue_zeroout() even says it will unmap. In that case the discard or zeroout will be redundant and it can take quite some time on bigger devices. -Lukas > > And that should be the approach regardless of whether your device is a > disk drive, an SSD, an NVMe device a SCSI array or whatever. DISCARD old > contents, ZEROOUT the pieces you care about. No conditionals or trying > to do things differently based on device capabilities. > > -- > Martin K. Petersen Oracle Linux Engineering
On Thu, Aug 17, 2017 at 01:47:13PM -0400, Martin K. Petersen wrote: > > Lukas, > > > Do we also know that the blocks were discarded as we do with > > BLKDISCARD ? > > There never was a way to know for sure. > > ATA DSM TRIM and SCSI UNMAP are hints by definition. We attempted to > bend their semantics towards getting predictable behavior but ultimately > failed. Too many corner cases. I believe that eMMC and UFS does give you a way to get stronger guarantees. It's unfortunate that SATA did not get this right; it's even more unfortunate that for many drives it works 99% of the time. But yes, it's not a guarantee. - Ted
diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block index dea212d..6ea0d03 100644 --- a/Documentation/ABI/testing/sysfs-block +++ b/Documentation/ABI/testing/sysfs-block @@ -213,8 +213,15 @@ What: /sys/block/<disk>/queue/discard_zeroes_data Date: May 2011 Contact: Martin K. Petersen <martin.petersen@oracle.com> Description: - Will always return 0. Don't rely on any specific behavior - for discards, and don't read this file. + Devices that support discard functionality may return + stale or random data when a previously discarded block + is read back. This can cause problems if the filesystem + expects discarded blocks to be explicitly cleared. If a + device reports that it deterministically returns zeroes + when a discarded area is read the discard_zeroes_data + parameter will be set to one. Otherwise it will be 0 and + the result of reading a discarded area is undefined. + What: /sys/block/<disk>/queue/write_same_max_bytes Date: January 2012 diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt index 2c1e670..b7f6bdc 100644 --- a/Documentation/block/queue-sysfs.txt +++ b/Documentation/block/queue-sysfs.txt @@ -43,6 +43,11 @@ large discards are issued, setting this value lower will make Linux issue smaller discards and potentially help reduce latencies induced by large discard operations. +discard_zeroes_data (RO) +------------------------ +When read, this file will show if the discarded block are zeroed by the +device or not. If its value is '1' the blocks are zeroed otherwise not. + hw_sector_size (RO) ------------------- This is the hardware sector size of the device, in bytes. diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 27aceab..5b41ad0 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -209,7 +209,10 @@ static ssize_t queue_discard_max_store(struct request_queue *q, static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *page) { - return queue_var_show(0, page); + if (blk_queue_discard(q) && q->limits.max_write_zeroes_sectors) + return queue_var_show(1, page); + else + return queue_var_show(0, page); } static ssize_t queue_write_same_max_show(struct request_queue *q, char *page) diff --git a/block/ioctl.c b/block/ioctl.c index 0de02ee..faecd44 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -508,6 +508,7 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd, void __user *argp = (void __user *)arg; loff_t size; unsigned int max_sectors; + struct request_queue *q = bdev_get_queue(bdev); switch (cmd) { case BLKFLSBUF: @@ -547,7 +548,10 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd, case BLKALIGNOFF: return put_int(arg, bdev_alignment_offset(bdev)); case BLKDISCARDZEROES: - return put_uint(arg, 0); + if (blk_queue_discard(q) && q->limits.max_write_zeroes_sectors) + return put_uint(arg, 1); + else + return put_uint(arg, 0); case BLKSECTGET: max_sectors = min_t(unsigned int, USHRT_MAX, queue_max_sectors(bdev_get_queue(bdev)));
Discard and zeroout code has been significantly rewritten recently and as a part of the rewrite we got rid o f the discard_zeroes_data flag. With commit 48920ff2a5a9 ("block: remove the discard_zeroes_data flag") discard_zeroes_data sysfs file and discard_zeroes_data ioctl now always returns zero, regardless of what the device actually supports. This has broken userspace utilities in a way that they will not take advantage of this functionality even if the device actually supports it. Now in order for user to figure out whether the device does suppot deterministic read zeroes after discard without actually running fallocate is to check for discard support (discard_max_bytes) and zeroout hw offload (write_zeroes_max_bytes). However we still have discard_zeroes_data sysfs file and BLKDISCARDZEROES ioctl so I do not see any reason why not to do this check in kernel and provide convenient and compatible way to continue to export this information to use space. With this patch both BLKDISCARDZEROES ioctl and discard_zeroes_data will return 1 in the case that discard and hw offload for write zeroes is supported. Otherwise it will return 0. Signed-off-by: Lukas Czerner <lczerner@redhat.com> --- Documentation/ABI/testing/sysfs-block | 11 +++++++++-- Documentation/block/queue-sysfs.txt | 5 +++++ block/blk-sysfs.c | 5 ++++- block/ioctl.c | 6 +++++- 4 files changed, 23 insertions(+), 4 deletions(-)