Message ID | 20241113094727.1497722-9-mcgrof@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | enable bs > ps for block devices | expand |
On 11/13/24 10:47, Luis Chamberlain wrote: > You can use lsblk to query for a block device block device block size: > > lsblk -o MIN-IO /dev/nvme0n1 > MIN-IO > 4096 > > The min-io is the minimum IO the block device prefers for optimal > performance. In turn we map this to the block device block size. > The current block size exposed even for block devices with an > LBA format of 16k is 4k. Likewise devices which support 4k LBA format > but have a larger Indirection Unit of 16k have an exposed block size > of 4k. > > This incurs read-modify-writes on direct IO against devices with a > min-io larger than the page size. To fix this, use the block device > min io, which is the minimal optimal IO the device prefers. > > With this we now get: > > lsblk -o MIN-IO /dev/nvme0n1 > MIN-IO > 16384 > > And so userspace gets the appropriate information it needs for optimal > performance. This is verified with blkalgn against mkfs against a > device with LBA format of 4k but an NPWG of 16k (min io size) > > mkfs.xfs -f -b size=16k /dev/nvme3n1 > blkalgn -d nvme3n1 --ops Write > > Block size : count distribution > 0 -> 1 : 0 | | > 2 -> 3 : 0 | | > 4 -> 7 : 0 | | > 8 -> 15 : 0 | | > 16 -> 31 : 0 | | > 32 -> 63 : 0 | | > 64 -> 127 : 0 | | > 128 -> 255 : 0 | | > 256 -> 511 : 0 | | > 512 -> 1023 : 0 | | > 1024 -> 2047 : 0 | | > 2048 -> 4095 : 0 | | > 4096 -> 8191 : 0 | | > 8192 -> 16383 : 0 | | > 16384 -> 32767 : 66 |****************************************| > 32768 -> 65535 : 0 | | > 65536 -> 131071 : 0 | | > 131072 -> 262143 : 2 |* | > Block size: 14 - 66 > Block size: 17 - 2 > > Algn size : count distribution > 0 -> 1 : 0 | | > 2 -> 3 : 0 | | > 4 -> 7 : 0 | | > 8 -> 15 : 0 | | > 16 -> 31 : 0 | | > 32 -> 63 : 0 | | > 64 -> 127 : 0 | | > 128 -> 255 : 0 | | > 256 -> 511 : 0 | | > 512 -> 1023 : 0 | | > 1024 -> 2047 : 0 | | > 2048 -> 4095 : 0 | | > 4096 -> 8191 : 0 | | > 8192 -> 16383 : 0 | | > 16384 -> 32767 : 66 |****************************************| > 32768 -> 65535 : 0 | | > 65536 -> 131071 : 0 | | > 131072 -> 262143 : 2 |* | > Algn size: 14 - 66 > Algn size: 17 - 2 > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > --- > block/bdev.c | 1 + > fs/stat.c | 2 +- > 2 files changed, 2 insertions(+), 1 deletion(-) > Reviewed-by: Hannes Reinecke <hare@suse.de> Cheers, Hannes
On Wed, Nov 13, 2024 at 01:47:27AM -0800, Luis Chamberlain wrote: > The min-io is the minimum IO the block device prefers for optimal > performance. In turn we map this to the block device block size. It's not the block size, but (to quote the man page) 'the "preferred" block size for efficient filesystem I/O'. While the difference might sound minor it actually is important. > > diff --git a/block/bdev.c b/block/bdev.c > index 3a5fd65f6c8e..4dcc501ed953 100644 > --- a/block/bdev.c > +++ b/block/bdev.c > @@ -1306,6 +1306,7 @@ void bdev_statx(struct path *path, struct kstat *stat, > queue_atomic_write_unit_max_bytes(bd_queue)); > } > > + stat->blksize = (unsigned int) bdev_io_min(bdev); No need for the cast. > if (S_ISBLK(stat->mode)) > - bdev_statx(path, stat, request_mask); > + bdev_statx(path, stat, request_mask | STATX_DIOALIGN); And this is both unrelated and wrong.
On Mon, Nov 18, 2024 at 08:08:05AM +0100, Christoph Hellwig wrote: > On Wed, Nov 13, 2024 at 01:47:27AM -0800, Luis Chamberlain wrote: > > if (S_ISBLK(stat->mode)) > > - bdev_statx(path, stat, request_mask); > > + bdev_statx(path, stat, request_mask | STATX_DIOALIGN); > > And this is both unrelated and wrong. I knew this was an eyesore, but was not sure if we really wanted to go through the trouble of adding a new field for blksize alone, but come to think of it, with it at least userspace knows for sure its getting where as befault it was not. If we add it, and since it would be added post LBS support it could also signal that a kernel supports LBS. That may be a useful clue for default mkfs in case it is set and larger than today's 4k default. So how about: diff --git a/block/bdev.c b/block/bdev.c index 3a5fd65f6c8e..f5d7cda97616 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -1277,7 +1277,8 @@ void bdev_statx(struct path *path, struct kstat *stat, struct inode *backing_inode; struct block_device *bdev; - if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC))) + if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC | + STATX_BLKSIZE))) return; backing_inode = d_backing_inode(path->dentry); @@ -1306,6 +1307,11 @@ void bdev_statx(struct path *path, struct kstat *stat, queue_atomic_write_unit_max_bytes(bd_queue)); } + if (request_mask & STATX_BLKSIZE) { + stat->blksize = (unsigned int) bdev_io_min(bdev); + stat->result_mask |= STATX_BLKSIZE; + } + blkdev_put_no_open(bdev); } diff --git a/fs/stat.c b/fs/stat.c index 41e598376d7e..d4cb2296b42d 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -268,7 +268,7 @@ static int vfs_statx_path(struct path *path, int flags, struct kstat *stat, * obtained from the bdev backing inode. */ if (S_ISBLK(stat->mode)) - bdev_statx(path, stat, request_mask); + bdev_statx(path, stat, request_mask | STATX_BLKSIZE); return error; } diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h index 887a25286441..b7e180bf72b8 100644 --- a/include/uapi/linux/stat.h +++ b/include/uapi/linux/stat.h @@ -164,6 +164,7 @@ struct statx { #define STATX_MNT_ID_UNIQUE 0x00004000U /* Want/got extended stx_mount_id */ #define STATX_SUBVOL 0x00008000U /* Want/got stx_subvol */ #define STATX_WRITE_ATOMIC 0x00010000U /* Want/got atomic_write_* fields */ +#define STATX_BLKSIZE 0x00020000U /* Want/got stx_blksize */ #define STATX__RESERVED 0x80000000U /* Reserved for future struct statx expansion */
On Mon, Nov 18, 2024 at 01:16:37PM -0800, Luis Chamberlain wrote: > On Mon, Nov 18, 2024 at 08:08:05AM +0100, Christoph Hellwig wrote: > > On Wed, Nov 13, 2024 at 01:47:27AM -0800, Luis Chamberlain wrote: > > > if (S_ISBLK(stat->mode)) > > > - bdev_statx(path, stat, request_mask); > > > + bdev_statx(path, stat, request_mask | STATX_DIOALIGN); > > > > And this is both unrelated and wrong. > > I knew this was an eyesore, but was not sure if we really wanted to > go through the trouble of adding a new field for blksize alone, but come > to think of it, with it at least userspace knows for sure its > getting where as befault it was not. Huh? The only think this does is forcing to fill out the dio align fields when not requested. It has nothing to do with block sizes. > So how about: > > diff --git a/block/bdev.c b/block/bdev.c > index 3a5fd65f6c8e..f5d7cda97616 100644 > --- a/block/bdev.c > +++ b/block/bdev.c > @@ -1277,7 +1277,8 @@ void bdev_statx(struct path *path, struct kstat *stat, > struct inode *backing_inode; > struct block_device *bdev; > > - if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC))) > + if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC | > + STATX_BLKSIZE))) Just drop this conditional entirely and you're fine.
diff --git a/block/bdev.c b/block/bdev.c index 3a5fd65f6c8e..4dcc501ed953 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -1306,6 +1306,7 @@ void bdev_statx(struct path *path, struct kstat *stat, queue_atomic_write_unit_max_bytes(bd_queue)); } + stat->blksize = (unsigned int) bdev_io_min(bdev); blkdev_put_no_open(bdev); } diff --git a/fs/stat.c b/fs/stat.c index 41e598376d7e..9b579c0b5153 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -268,7 +268,7 @@ static int vfs_statx_path(struct path *path, int flags, struct kstat *stat, * obtained from the bdev backing inode. */ if (S_ISBLK(stat->mode)) - bdev_statx(path, stat, request_mask); + bdev_statx(path, stat, request_mask | STATX_DIOALIGN); return error; }
You can use lsblk to query for a block device block device block size: lsblk -o MIN-IO /dev/nvme0n1 MIN-IO 4096 The min-io is the minimum IO the block device prefers for optimal performance. In turn we map this to the block device block size. The current block size exposed even for block devices with an LBA format of 16k is 4k. Likewise devices which support 4k LBA format but have a larger Indirection Unit of 16k have an exposed block size of 4k. This incurs read-modify-writes on direct IO against devices with a min-io larger than the page size. To fix this, use the block device min io, which is the minimal optimal IO the device prefers. With this we now get: lsblk -o MIN-IO /dev/nvme0n1 MIN-IO 16384 And so userspace gets the appropriate information it needs for optimal performance. This is verified with blkalgn against mkfs against a device with LBA format of 4k but an NPWG of 16k (min io size) mkfs.xfs -f -b size=16k /dev/nvme3n1 blkalgn -d nvme3n1 --ops Write Block size : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 0 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 0 | | 8192 -> 16383 : 0 | | 16384 -> 32767 : 66 |****************************************| 32768 -> 65535 : 0 | | 65536 -> 131071 : 0 | | 131072 -> 262143 : 2 |* | Block size: 14 - 66 Block size: 17 - 2 Algn size : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 0 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 0 | | 8192 -> 16383 : 0 | | 16384 -> 32767 : 66 |****************************************| 32768 -> 65535 : 0 | | 65536 -> 131071 : 0 | | 131072 -> 262143 : 2 |* | Algn size: 14 - 66 Algn size: 17 - 2 Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> --- block/bdev.c | 1 + fs/stat.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-)