diff mbox series

[RFC,8/8] bdev: use bdev_io_min() for statx block size

Message ID 20241113094727.1497722-9-mcgrof@kernel.org (mailing list archive)
State Not Applicable, archived
Headers show
Series enable bs > ps for block devices | expand

Commit Message

Luis Chamberlain Nov. 13, 2024, 9:47 a.m. UTC
You can use lsblk to query for a block device block device block size:

lsblk -o MIN-IO /dev/nvme0n1
MIN-IO
 4096

The min-io is the minimum IO the block device prefers for optimal
performance. In turn we map this to the block device block size.
The current block size exposed even for block devices with an
LBA format of 16k is 4k. Likewise devices which support 4k LBA format
but have a larger Indirection Unit of 16k have an exposed block size
of 4k.

This incurs read-modify-writes on direct IO against devices with a
min-io larger than the page size. To fix this, use the block device
min io, which is the minimal optimal IO the device prefers.

With this we now get:

lsblk -o MIN-IO /dev/nvme0n1
MIN-IO
 16384

And so userspace gets the appropriate information it needs for optimal
performance. This is verified with blkalgn against mkfs against a
device with LBA format of 4k but an NPWG of 16k (min io size)

mkfs.xfs -f -b size=16k  /dev/nvme3n1
blkalgn -d nvme3n1 --ops Write

     Block size          : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 66       |****************************************|
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 2        |*                                       |
Block size: 14 - 66
Block size: 17 - 2

     Algn size           : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 66       |****************************************|
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 2        |*                                       |
Algn size: 14 - 66
Algn size: 17 - 2

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/bdev.c | 1 +
 fs/stat.c    | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

Comments

Hannes Reinecke Nov. 13, 2024, 9:59 a.m. UTC | #1
On 11/13/24 10:47, Luis Chamberlain wrote:
> You can use lsblk to query for a block device block device block size:
> 
> lsblk -o MIN-IO /dev/nvme0n1
> MIN-IO
>   4096
> 
> The min-io is the minimum IO the block device prefers for optimal
> performance. In turn we map this to the block device block size.
> The current block size exposed even for block devices with an
> LBA format of 16k is 4k. Likewise devices which support 4k LBA format
> but have a larger Indirection Unit of 16k have an exposed block size
> of 4k.
> 
> This incurs read-modify-writes on direct IO against devices with a
> min-io larger than the page size. To fix this, use the block device
> min io, which is the minimal optimal IO the device prefers.
> 
> With this we now get:
> 
> lsblk -o MIN-IO /dev/nvme0n1
> MIN-IO
>   16384
> 
> And so userspace gets the appropriate information it needs for optimal
> performance. This is verified with blkalgn against mkfs against a
> device with LBA format of 4k but an NPWG of 16k (min io size)
> 
> mkfs.xfs -f -b size=16k  /dev/nvme3n1
> blkalgn -d nvme3n1 --ops Write
> 
>       Block size          : count     distribution
>           0 -> 1          : 0        |                                        |
>           2 -> 3          : 0        |                                        |
>           4 -> 7          : 0        |                                        |
>           8 -> 15         : 0        |                                        |
>          16 -> 31         : 0        |                                        |
>          32 -> 63         : 0        |                                        |
>          64 -> 127        : 0        |                                        |
>         128 -> 255        : 0        |                                        |
>         256 -> 511        : 0        |                                        |
>         512 -> 1023       : 0        |                                        |
>        1024 -> 2047       : 0        |                                        |
>        2048 -> 4095       : 0        |                                        |
>        4096 -> 8191       : 0        |                                        |
>        8192 -> 16383      : 0        |                                        |
>       16384 -> 32767      : 66       |****************************************|
>       32768 -> 65535      : 0        |                                        |
>       65536 -> 131071     : 0        |                                        |
>      131072 -> 262143     : 2        |*                                       |
> Block size: 14 - 66
> Block size: 17 - 2
> 
>       Algn size           : count     distribution
>           0 -> 1          : 0        |                                        |
>           2 -> 3          : 0        |                                        |
>           4 -> 7          : 0        |                                        |
>           8 -> 15         : 0        |                                        |
>          16 -> 31         : 0        |                                        |
>          32 -> 63         : 0        |                                        |
>          64 -> 127        : 0        |                                        |
>         128 -> 255        : 0        |                                        |
>         256 -> 511        : 0        |                                        |
>         512 -> 1023       : 0        |                                        |
>        1024 -> 2047       : 0        |                                        |
>        2048 -> 4095       : 0        |                                        |
>        4096 -> 8191       : 0        |                                        |
>        8192 -> 16383      : 0        |                                        |
>       16384 -> 32767      : 66       |****************************************|
>       32768 -> 65535      : 0        |                                        |
>       65536 -> 131071     : 0        |                                        |
>      131072 -> 262143     : 2        |*                                       |
> Algn size: 14 - 66
> Algn size: 17 - 2
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>   block/bdev.c | 1 +
>   fs/stat.c    | 2 +-
>   2 files changed, 2 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
Christoph Hellwig Nov. 18, 2024, 7:08 a.m. UTC | #2
On Wed, Nov 13, 2024 at 01:47:27AM -0800, Luis Chamberlain wrote:
> The min-io is the minimum IO the block device prefers for optimal
> performance. In turn we map this to the block device block size.

It's not the block size, but (to quote the man page) 'the "preferred"
block size for efficient filesystem I/O'.  While the difference might
sound minor it actually is important.

> 
> diff --git a/block/bdev.c b/block/bdev.c
> index 3a5fd65f6c8e..4dcc501ed953 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -1306,6 +1306,7 @@ void bdev_statx(struct path *path, struct kstat *stat,
>  			queue_atomic_write_unit_max_bytes(bd_queue));
>  	}
>  
> +	stat->blksize = (unsigned int) bdev_io_min(bdev);

No need for the cast.

>  	if (S_ISBLK(stat->mode))
> -		bdev_statx(path, stat, request_mask);
> +		bdev_statx(path, stat, request_mask | STATX_DIOALIGN);

And this is both unrelated and wrong.
Luis Chamberlain Nov. 18, 2024, 9:16 p.m. UTC | #3
On Mon, Nov 18, 2024 at 08:08:05AM +0100, Christoph Hellwig wrote:
> On Wed, Nov 13, 2024 at 01:47:27AM -0800, Luis Chamberlain wrote:
> >  	if (S_ISBLK(stat->mode))
> > -		bdev_statx(path, stat, request_mask);
> > +		bdev_statx(path, stat, request_mask | STATX_DIOALIGN);
> 
> And this is both unrelated and wrong.

I knew this was an eyesore, but was not sure if we really wanted to
go through the trouble of adding a new field for blksize alone, but come
to think of it, with it at least userspace knows for sure its
getting where as befault it was not.

If we add it, and since it would be added post LBS support it could also
signal that a kernel supports LBS. That may be a useful clue for default
mkfs in case it is set and larger than today's 4k default.

So how about:

diff --git a/block/bdev.c b/block/bdev.c
index 3a5fd65f6c8e..f5d7cda97616 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1277,7 +1277,8 @@ void bdev_statx(struct path *path, struct kstat *stat,
 	struct inode *backing_inode;
 	struct block_device *bdev;
 
-	if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC)))
+	if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC |
+			      STATX_BLKSIZE)))
 		return;
 
 	backing_inode = d_backing_inode(path->dentry);
@@ -1306,6 +1307,11 @@ void bdev_statx(struct path *path, struct kstat *stat,
 			queue_atomic_write_unit_max_bytes(bd_queue));
 	}
 
+	if (request_mask & STATX_BLKSIZE) {
+		stat->blksize = (unsigned int) bdev_io_min(bdev);
+		stat->result_mask |= STATX_BLKSIZE;
+	}
+
 	blkdev_put_no_open(bdev);
 }
 
diff --git a/fs/stat.c b/fs/stat.c
index 41e598376d7e..d4cb2296b42d 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -268,7 +268,7 @@ static int vfs_statx_path(struct path *path, int flags, struct kstat *stat,
 	 * obtained from the bdev backing inode.
 	 */
 	if (S_ISBLK(stat->mode))
-		bdev_statx(path, stat, request_mask);
+		bdev_statx(path, stat, request_mask | STATX_BLKSIZE);
 
 	return error;
 }
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 887a25286441..b7e180bf72b8 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -164,6 +164,7 @@ struct statx {
 #define STATX_MNT_ID_UNIQUE	0x00004000U	/* Want/got extended stx_mount_id */
 #define STATX_SUBVOL		0x00008000U	/* Want/got stx_subvol */
 #define STATX_WRITE_ATOMIC	0x00010000U	/* Want/got atomic_write_* fields */
+#define STATX_BLKSIZE		0x00020000U	/* Want/got stx_blksize */
 
 #define STATX__RESERVED		0x80000000U	/* Reserved for future struct statx expansion */
Christoph Hellwig Nov. 19, 2024, 6:08 a.m. UTC | #4
On Mon, Nov 18, 2024 at 01:16:37PM -0800, Luis Chamberlain wrote:
> On Mon, Nov 18, 2024 at 08:08:05AM +0100, Christoph Hellwig wrote:
> > On Wed, Nov 13, 2024 at 01:47:27AM -0800, Luis Chamberlain wrote:
> > >  	if (S_ISBLK(stat->mode))
> > > -		bdev_statx(path, stat, request_mask);
> > > +		bdev_statx(path, stat, request_mask | STATX_DIOALIGN);
> > 
> > And this is both unrelated and wrong.
> 
> I knew this was an eyesore, but was not sure if we really wanted to
> go through the trouble of adding a new field for blksize alone, but come
> to think of it, with it at least userspace knows for sure its
> getting where as befault it was not.

Huh?  The only think this does is forcing to fill out the dio align
fields when not requested.  It has nothing to do with block sizes.

> So how about:
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index 3a5fd65f6c8e..f5d7cda97616 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -1277,7 +1277,8 @@ void bdev_statx(struct path *path, struct kstat *stat,
>  	struct inode *backing_inode;
>  	struct block_device *bdev;
>  
> -	if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC)))
> +	if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC |
> +			      STATX_BLKSIZE)))

Just drop this conditional entirely and you're fine.
diff mbox series

Patch

diff --git a/block/bdev.c b/block/bdev.c
index 3a5fd65f6c8e..4dcc501ed953 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1306,6 +1306,7 @@  void bdev_statx(struct path *path, struct kstat *stat,
 			queue_atomic_write_unit_max_bytes(bd_queue));
 	}
 
+	stat->blksize = (unsigned int) bdev_io_min(bdev);
 	blkdev_put_no_open(bdev);
 }
 
diff --git a/fs/stat.c b/fs/stat.c
index 41e598376d7e..9b579c0b5153 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -268,7 +268,7 @@  static int vfs_statx_path(struct path *path, int flags, struct kstat *stat,
 	 * obtained from the bdev backing inode.
 	 */
 	if (S_ISBLK(stat->mode))
-		bdev_statx(path, stat, request_mask);
+		bdev_statx(path, stat, request_mask | STATX_DIOALIGN);
 
 	return error;
 }