Message ID | 20240328203910.2370087-1-stefanha@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | block: add llseek(SEEK_HOLE/SEEK_DATA) support | expand |
On Thu, Mar 28, 2024 at 04:39:01PM -0400, Stefan Hajnoczi wrote: > cp(1) and backup tools use llseek(SEEK_HOLE/SEEK_DATA) to skip holes in files. As a minor point of clarity (perhaps as much for my own records for documenting research I've done over the years, and not necessarily something you need to change in the commit message): Userspace apps generally use lseek(2) from glibc or similar (perhaps via its alias lseek64(), depending on whether userspace is using large file offsets), rather than directly calling the _llseek() syscall. But it all boils down to the same notion of seeking information about various special offsets. Also, in past history, coreutils cp(1) and dd(1) did experiment with using FS_IOC_FIEMAP ioctls when SEEK_HOLE was not available, but it proved to cause more problems than it solved, so that is not currently in favor. Yes, we could teach more and more block devices to expose specific ioctls for querying sparseness boundaries, and then teach userspace apps a list of ioctls to try; but as cp(1) already learned, having one common interface is much easier than an ever-growing ioctl ladder to be copied across every client that would benefit from knowing where the unallocated portions are. > This can speed up the process by reducing the amount of data read and it > preserves sparseness when writing to the output file. > > This patch series is an initial attempt at implementing > llseek(SEEK_HOLE/SEEK_DATA) for block devices. I'm looking for feedback on this > approach and suggestions for resolving the open issues. One of your open issues was whether adjusting the offset of the block device itself should also adjust the file offset of the underlying file (at least in the case of loopback and dm-linear). What would the community think of adding two additional constants to the entire family of *seek() functions, that have the effect of returning the same offset as their SEEK_HOLE/SEEK_DATA counterparts but without moving the file offset? Explaining the idea by example, although I'm not stuck on these names: suppose you have an fd visiting a file description of 2MiB in size, with the first 1MiB being a hole and the second being data. #define MiB (1024*1024) lseek64(fd, MiB, SEEK_SET); // returns MiB, file offset changed to MiB lseek64(fd, 0, SEEK_HOLE); // returns 0, file offset changed to 0 lseek64(fd, 0, SEEK_DATA); // returns MiB, file offset changed to MiB lseek64(fd, 0, SEEK_PEEK_HOLE); // returns 0, but file offset left at MiB lseek64(fd, 0, SEEK_SET); // returns 0, file offset changed to MiB lseek64(fd, 0, SEEK_PEEK_DATA); // returns MiB, but file offset left at MiB With semantics like that, it might be easier to implement just SEEK_PEEK* in devices (don't worry about changing offsets, just about reporting where the requested offset is), and then have a common layer do the translation from llseek(...,offs,SEEK_HOLE) into a 2-step llseek(...,llseek(...,offs,SEEK_PEEK_HOLE),SEEK_SET) if that makes life easier under the hood. > > In the block device world there are similar concepts to holes: > - SCSI has Logical Block Provisioning where the "mapped" state would be > considered data and other states would be considered holes. BIG caveat here: the SCSI spec does not necessarily guarantee that unmapped regions read as all zeroes; compare the difference between FALLOC_FL_ZERO_RANGE and FALLOC_FL_PUNCH_HOLE. While lseek(SEEK_HOLE) on a regular file guarantees that future read() in that hole will see NUL bytes, I'm not sure whether we want to make that guarantee for block devices. This may be yet another case where we might want to add new SEEK_* constants to the *seek() family of functions that lets the caller indicate whether they want offsets that are guaranteed to read as zero, vs. merely offsets that are not allocated but may or may not read as zero. Skipping unallocated portions, even when you don't know if the contents reliably read as zero, is still a useful goal in some userspace programs. > - NBD has NBD_CMD_BLOCK_STATUS for querying whether blocks are present. However, utilizing it in nbd.ko would require teaching the kernel to handle structured or extended headers (right now, that is an extension only supported in user-space implementations of the NBD protocol). I can see why you did not tackle that in this RFC series, even though you mention it in the cover letter. > - Linux loop block devices and dm-linear targets can pass through queries to > the backing file. > - dm-thin targets can query metadata to find holes. > - ...and you may be able to think of more examples. > > Therefore it is possible to offer this functionality in block drivers. > > In my use case a QEMU process in userspace copies the contents of a dm-thin > target. QEMU already uses SEEK_HOLE but that doesn't work on dm-thin targets > without this patch series. > > Others have also wished for block device support for SEEK_HOLE. Here is an open > issue from the BorgBackup project: > https://github.com/borgbackup/borg/issues/5609 > > With these patches userspace can identify holes in loop, dm-linear, and dm-thin > devices. This is done by adding a seek_hole_data() callback to struct > block_device_operations. When the callback is NULL the entire device is > considered data. Device-mapper is extended along the same lines so that targets > can provide seek_hole_data() callbacks. > > I'm unfamiliar with much of this code and have probably missed locking > requirements. Since llseek() executes synchronously like ioctl() and is not an > asynchronous I/O request it's possible that my changes to the loop block driver > and dm-thin are broken (e.g. what if the loop device fd is changed during > llseek()?). > > To run the tests: > > # make TARGETS=block_seek_hole -C tools/testing/selftests run_tests > > The code is also available here: > https://gitlab.com/stefanha/linux/-/tree/block-seek-hole > > Please take a look and let me know your thoughts. Thanks! > > Stefan Hajnoczi (9): > block: add llseek(SEEK_HOLE/SEEK_DATA) support > loop: add llseek(SEEK_HOLE/SEEK_DATA) support > selftests: block_seek_hole: add loop block driver tests > dm: add llseek(SEEK_HOLE/SEEK_DATA) support > selftests: block_seek_hole: add dm-zero test > dm-linear: add llseek(SEEK_HOLE/SEEK_DATA) support > selftests: block_seek_hole: add dm-linear test > dm thin: add llseek(SEEK_HOLE/SEEK_DATA) support > selftests: block_seek_hole: add dm-thin test > > tools/testing/selftests/Makefile | 1 + > .../selftests/block_seek_hole/Makefile | 17 +++ > include/linux/blkdev.h | 7 ++ > include/linux/device-mapper.h | 5 + > block/fops.c | 43 ++++++- > drivers/block/loop.c | 36 +++++- > drivers/md/dm-linear.c | 25 ++++ > drivers/md/dm-thin.c | 77 ++++++++++++ > drivers/md/dm.c | 68 ++++++++++ > .../testing/selftests/block_seek_hole/config | 3 + > .../selftests/block_seek_hole/dm_thin.sh | 80 ++++++++++++ > .../selftests/block_seek_hole/dm_zero.sh | 31 +++++ > .../selftests/block_seek_hole/map_holes.py | 37 ++++++ > .../testing/selftests/block_seek_hole/test.py | 117 ++++++++++++++++++ > 14 files changed, 540 insertions(+), 7 deletions(-) > create mode 100644 tools/testing/selftests/block_seek_hole/Makefile > create mode 100644 tools/testing/selftests/block_seek_hole/config > create mode 100755 tools/testing/selftests/block_seek_hole/dm_thin.sh > create mode 100755 tools/testing/selftests/block_seek_hole/dm_zero.sh > create mode 100755 tools/testing/selftests/block_seek_hole/map_holes.py > create mode 100755 tools/testing/selftests/block_seek_hole/test.py > > -- > 2.44.0 >
Replying to myself, On Thu, Mar 28, 2024 at 05:17:18PM -0500, Eric Blake wrote: > On Thu, Mar 28, 2024 at 04:39:01PM -0400, Stefan Hajnoczi wrote: > > cp(1) and backup tools use llseek(SEEK_HOLE/SEEK_DATA) to skip holes in files. > > > > > In the block device world there are similar concepts to holes: > > - SCSI has Logical Block Provisioning where the "mapped" state would be > > considered data and other states would be considered holes. > > BIG caveat here: the SCSI spec does not necessarily guarantee that > unmapped regions read as all zeroes; compare the difference between > FALLOC_FL_ZERO_RANGE and FALLOC_FL_PUNCH_HOLE. While lseek(SEEK_HOLE) > on a regular file guarantees that future read() in that hole will see > NUL bytes, I'm not sure whether we want to make that guarantee for > block devices. This may be yet another case where we might want to > add new SEEK_* constants to the *seek() family of functions that lets > the caller indicate whether they want offsets that are guaranteed to > read as zero, vs. merely offsets that are not allocated but may or may > not read as zero. Skipping unallocated portions, even when you don't > know if the contents reliably read as zero, is still a useful goal in > some userspace programs. > > > - NBD has NBD_CMD_BLOCK_STATUS for querying whether blocks are present. The upstream NBD spec[1] took the time to represent two bits of information per extent, _because_ of the knowledge that not all SCSI devices with TRIM support actually guarantee a read of zeroes after trimming. That is, NBD chose to convey both: NBD_STATE_HOLE: 1<<0 if region is unallocated, 0 if region has not been trimmed NBD_STATE_ZERO: 1<<1 if region reads as zeroes, 0 if region contents might be nonzero it is always safe to describe an extent as value 0 (both bits clear), whether or not lseek(SEEK_DATA) returns the same offset; meanwhile, traditional lseek(SEEK_HOLE) on filesystems generally translates to a status of 3 (both bits set), but as it is (sometimes) possible to determine that allocated data still reads as zero, or that unallocated data may not necessarily read as zero, it is also possible to implement NBD servers that don't report both bits in parallel. If we are going to enhance llseek(2) to expose more information about underlying block devices, possibly by adding more SEEK_ constants for use in the entire family of *seek() API, it may be worth thinking about whether it is worth userspace being able to query this additional distinction between unallocated vs reads-as-zero. [1] https://github.com/NetworkBlockDevice/nbd/blob/master/doc/proto.md#baseallocation-metadata-context
On Thu, Mar 28, 2024 at 05:16:54PM -0500, Eric Blake wrote: > On Thu, Mar 28, 2024 at 04:39:01PM -0400, Stefan Hajnoczi wrote: > > This can speed up the process by reducing the amount of data read and it > > preserves sparseness when writing to the output file. > > > > This patch series is an initial attempt at implementing > > llseek(SEEK_HOLE/SEEK_DATA) for block devices. I'm looking for feedback on this > > approach and suggestions for resolving the open issues. > > One of your open issues was whether adjusting the offset of the block > device itself should also adjust the file offset of the underlying > file (at least in the case of loopback and dm-linear). What would the Only the loop block driver has this issue. The dm-linear driver uses blkdev_seek_hole_data(), which does not update the file offset because it operates on a struct block_device instead of a struct file. > > > > > In the block device world there are similar concepts to holes: > > - SCSI has Logical Block Provisioning where the "mapped" state would be > > considered data and other states would be considered holes. > > BIG caveat here: the SCSI spec does not necessarily guarantee that > unmapped regions read as all zeroes; compare the difference between > FALLOC_FL_ZERO_RANGE and FALLOC_FL_PUNCH_HOLE. While lseek(SEEK_HOLE) > on a regular file guarantees that future read() in that hole will see > NUL bytes, I'm not sure whether we want to make that guarantee for > block devices. This may be yet another case where we might want to > add new SEEK_* constants to the *seek() family of functions that lets > the caller indicate whether they want offsets that are guaranteed to > read as zero, vs. merely offsets that are not allocated but may or may > not read as zero. Skipping unallocated portions, even when you don't > know if the contents reliably read as zero, is still a useful goal in > some userspace programs. SCSI initiators can check the Logical Block Provisioning Read Zeroes (LBPRZ) field to determine whether or not zeroes are guaranteed. The sd driver would only rely on the device when LPBRZ indicates that zeroes will be read. Otherwise the driver would report that the device is filled with data. > > > - NBD has NBD_CMD_BLOCK_STATUS for querying whether blocks are present. > > However, utilizing it in nbd.ko would require teaching the kernel to > handle structured or extended headers (right now, that is an extension > only supported in user-space implementations of the NBD protocol). I > can see why you did not tackle that in this RFC series, even though > you mention it in the cover letter. Yes, I'm mostly interested in dm-thin. The loop block driver and dm-linear are useful for testing so I modified them. I didn't try SCSI or NBD. Thanks, Stefan
On Thu, Mar 28, 2024 at 04:39:01PM -0400, Stefan Hajnoczi wrote: > In the block device world there are similar concepts to holes: > - SCSI has Logical Block Provisioning where the "mapped" state would be > considered data and other states would be considered holes. But for SCSI (and ATA and NVMe) unmapped/delallocated/etc blocks do not have to return zeroes. They could also return some other initialization pattern pattern. So they are (unfortunately) not a 1:1 mapping to holes in sparse files.
On Tue, Apr 02, 2024 at 02:26:17PM +0200, Christoph Hellwig wrote: > On Thu, Mar 28, 2024 at 04:39:01PM -0400, Stefan Hajnoczi wrote: > > In the block device world there are similar concepts to holes: > > - SCSI has Logical Block Provisioning where the "mapped" state would be > > considered data and other states would be considered holes. > > But for SCSI (and ATA and NVMe) unmapped/delallocated/etc blocks do > not have to return zeroes. They could also return some other > initialization pattern pattern. So they are (unfortunately) not a 1:1 > mapping to holes in sparse files. Hi Christoph, There is a 1:1 mapping when when the Logical Block Provisioning Read Zeroes (LBPRZ) field is set to xx1b in the Logical Block Provisioning VPD page. Otherwise SEEK_HOLE/SEEK_DATA has to treat the device as filled with data because it doesn't know where the holes are. Stefan
On Tue, Apr 02, 2024 at 02:26:17PM +0200, Christoph Hellwig wrote: > On Thu, Mar 28, 2024 at 04:39:01PM -0400, Stefan Hajnoczi wrote: > > In the block device world there are similar concepts to holes: > > - SCSI has Logical Block Provisioning where the "mapped" state would be > > considered data and other states would be considered holes. > > But for SCSI (and ATA and NVMe) unmapped/delallocated/etc blocks do > not have to return zeroes. They could also return some other > initialization pattern pattern. So they are (unfortunately) not a 1:1 > mapping to holes in sparse files. Yes, and Stefan already answered that: https://lore.kernel.org/dm-devel/e2lcp3n5gpf7zmlpyn4nj7wsr36sffn23z5bmzlsghu6oapi5u@sdkcbpimi5is/t/#m58146a45951ec086966497e179a2b2715692712d >> SCSI initiators can check the Logical Block Provisioning Read Zeroes >> (LBPRZ) field to determine whether or not zeroes are guaranteed. The sd >> driver would only rely on the device when LPBRZ indicates that zeroes >> will be read. Otherwise the driver would report that the device is >> filled with data. As well as my question on whether the community would be open to introducing new SEEK_* constants to allow orthogonality between searching for zeroes (known to read as zero, whether or not it was allocated) vs. sparseness (known to be unallocated, whether or not it reads as zero), where the existing SEEK_HOLE seeks for both properties at once.
On Tue, Apr 02, 2024 at 09:04:46AM -0400, Stefan Hajnoczi wrote: > Hi Christoph, > There is a 1:1 mapping when when the Logical Block Provisioning Read > Zeroes (LBPRZ) field is set to xx1b in the Logical Block Provisioning > VPD page. Yes. NVMe also has a similar field, but ATA does not.
On Tue, Apr 02, 2024 at 08:31:09AM -0500, Eric Blake wrote: > As well as my question on whether the community would be open to > introducing new SEEK_* constants to allow orthogonality between > searching for zeroes (known to read as zero, whether or not it was > allocated) vs. sparseness (known to be unallocated, whether or not it > reads as zero), where the existing SEEK_HOLE seeks for both properties > at once. That seems like quite an effort. Is is worth it?