[v3,1/8] statx: add direct I/O alignment information

Message ID	20220616201506.124209-2-ebiggers@kernel.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Eric Biggers <ebiggers@kernel.org> To: linux-fsdevel@vger.kernel.org Cc: linux-ext4@vger.kernel.org, linux-f2fs-devel@lists.sourceforge.net, linux-xfs@vger.kernel.org, linux-api@vger.kernel.org, linux-fscrypt@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Keith Busch <kbusch@kernel.org> Subject: [PATCH v3 1/8] statx: add direct I/O alignment information Date: Thu, 16 Jun 2022 13:14:59 -0700 Message-Id: <20220616201506.124209-2-ebiggers@kernel.org> In-Reply-To: <20220616201506.124209-1-ebiggers@kernel.org> References: <20220616201506.124209-1-ebiggers@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	make statx() return DIO alignment information \| expand [v3,0/8] make statx() return DIO alignment information [v3,1/8] statx: add direct I/O alignment information [v3,2/8] vfs: support STATX_DIOALIGN on block devices [v3,3/8] fscrypt: change fscrypt_dio_supported() to prepare for STATX_DIOALIGN [v3,4/8] ext4: support STATX_DIOALIGN [v3,5/8] f2fs: move f2fs_force_buffered_io() into file.c [v3,6/8] f2fs: don't allow DIO reads but not DIO writes [v3,7/8] f2fs: simplify f2fs_force_buffered_io() [v3,8/8] f2fs: support STATX_DIOALIGN

Eric Biggers June 16, 2022, 8:14 p.m. UTC

From: Eric Biggers <ebiggers@google.com>

Traditionally, the conditions for when DIO (direct I/O) is supported
were fairly simple.  For both block devices and regular files, DIO had
to be aligned to the logical block size of the block device.

However, due to filesystem features that have been added over time (e.g.
multi-device support, data journalling, inline data, encryption, verity,
compression, checkpoint disabling, log-structured mode), the conditions
for when DIO is allowed on a regular file have gotten increasingly
complex.  Whether a particular regular file supports DIO, and with what
alignment, can depend on various file attributes and filesystem mount
options, as well as which block device(s) the file's data is located on.

Moreover, the general rule of DIO needing to be aligned to the block
device's logical block size is being relaxed to allow user buffers (but
not file offsets) aligned to the DMA alignment instead
(https://lore.kernel.org/linux-block/20220610195830.3574005-1-kbusch@fb.com/T/#u).

XFS has an ioctl XFS_IOC_DIOINFO that exposes DIO alignment information.
Uplifting this to the VFS is one possibility.  However, as discussed
(https://lore.kernel.org/linux-fsdevel/20220120071215.123274-1-ebiggers@kernel.org/T/#u),
this ioctl is rarely used and not known to be used outside of
XFS-specific code.  It was also never intended to indicate when a file
doesn't support DIO at all, nor was it intended for block devices.

Therefore, let's expose this information via statx().  Add the
STATX_DIOALIGN flag and two new statx fields associated with it:

* stx_dio_mem_align: the alignment (in bytes) required for user memory
  buffers for DIO, or 0 if DIO is not supported on the file.

* stx_dio_offset_align: the alignment (in bytes) required for file
  offsets and I/O segment lengths for DIO, or 0 if DIO is not supported
  on the file.  This will only be nonzero if stx_dio_mem_align is
  nonzero, and vice versa.

Note that as with other statx() extensions, if STATX_DIOALIGN isn't set
in the returned statx struct, then these new fields won't be filled in.
This will happen if the file is neither a regular file nor a block
device, or if the file is a regular file and the filesystem doesn't
support STATX_DIOALIGN.  It might also happen if the caller didn't
include STATX_DIOALIGN in the request mask, since statx() isn't required
to return unrequested information.

This commit only adds the VFS-level plumbing for STATX_DIOALIGN.  For
regular files, individual filesystems will still need to add code to
support it.  For block devices, a separate commit will wire it up too.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 fs/stat.c                 | 2 ++
 include/linux/stat.h      | 2 ++
 include/uapi/linux/stat.h | 4 +++-
 3 files changed, 7 insertions(+), 1 deletion(-)

Avi Kivity June 19, 2022, 11:30 a.m. UTC | #1

On 16/06/2022 23.14, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
>
> Traditionally, the conditions for when DIO (direct I/O) is supported
> were fairly simple.  For both block devices and regular files, DIO had
> to be aligned to the logical block size of the block device.
>
> However, due to filesystem features that have been added over time (e.g.
> multi-device support, data journalling, inline data, encryption, verity,
> compression, checkpoint disabling, log-structured mode), the conditions
> for when DIO is allowed on a regular file have gotten increasingly
> complex.  Whether a particular regular file supports DIO, and with what
> alignment, can depend on various file attributes and filesystem mount
> options, as well as which block device(s) the file's data is located on.
>
> Moreover, the general rule of DIO needing to be aligned to the block
> device's logical block size is being relaxed to allow user buffers (but
> not file offsets) aligned to the DMA alignment instead
> (https://lore.kernel.org/linux-block/20220610195830.3574005-1-kbusch@fb.com/T/#u).
>
> XFS has an ioctl XFS_IOC_DIOINFO that exposes DIO alignment information.
> Uplifting this to the VFS is one possibility.  However, as discussed
> (https://lore.kernel.org/linux-fsdevel/20220120071215.123274-1-ebiggers@kernel.org/T/#u),
> this ioctl is rarely used and not known to be used outside of
> XFS-specific code.  It was also never intended to indicate when a file
> doesn't support DIO at all, nor was it intended for block devices.
>
> Therefore, let's expose this information via statx().  Add the
> STATX_DIOALIGN flag and two new statx fields associated with it:
>
> * stx_dio_mem_align: the alignment (in bytes) required for user memory
>    buffers for DIO, or 0 if DIO is not supported on the file.
>
> * stx_dio_offset_align: the alignment (in bytes) required for file
>    offsets and I/O segment lengths for DIO, or 0 if DIO is not supported
>    on the file.  This will only be nonzero if stx_dio_mem_align is
>    nonzero, and vice versa.


If you consider AIO, this is actually three alignments:

1. offset alignment for reads (sector size in XFS)

2. offset alignment for overwrites (sector size in XFS since 
ed1128c2d0c87e, block size earlier)

3. offset alignment for appending writes (block size)


This is critical for linux-aio since violation of these alignments will 
stall the io_submit system call. Perhaps io_uring handles it better by 
bouncing to a workqueue, but there is a significant performance and 
latency penalty for that.


Small appending writes are important for database commit logs (and so 
it's better to overwrite a pre-formatted file to avoid aligning to block 
size).


It would be good to expose these differences.


>
> Note that as with other statx() extensions, if STATX_DIOALIGN isn't set
> in the returned statx struct, then these new fields won't be filled in.
> This will happen if the file is neither a regular file nor a block
> device, or if the file is a regular file and the filesystem doesn't
> support STATX_DIOALIGN.  It might also happen if the caller didn't
> include STATX_DIOALIGN in the request mask, since statx() isn't required
> to return unrequested information.
>
> This commit only adds the VFS-level plumbing for STATX_DIOALIGN.  For
> regular files, individual filesystems will still need to add code to
> support it.  For block devices, a separate commit will wire it up too.
>
> Signed-off-by: Eric Biggers <ebiggers@google.com>
> ---
>   fs/stat.c                 | 2 ++
>   include/linux/stat.h      | 2 ++
>   include/uapi/linux/stat.h | 4 +++-
>   3 files changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/fs/stat.c b/fs/stat.c
> index 9ced8860e0f35..a7930d7444830 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -611,6 +611,8 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
>   	tmp.stx_dev_major = MAJOR(stat->dev);
>   	tmp.stx_dev_minor = MINOR(stat->dev);
>   	tmp.stx_mnt_id = stat->mnt_id;
> +	tmp.stx_dio_mem_align = stat->dio_mem_align;
> +	tmp.stx_dio_offset_align = stat->dio_offset_align;
>   
>   	return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
>   }
> diff --git a/include/linux/stat.h b/include/linux/stat.h
> index 7df06931f25d8..ff277ced50e9f 100644
> --- a/include/linux/stat.h
> +++ b/include/linux/stat.h
> @@ -50,6 +50,8 @@ struct kstat {
>   	struct timespec64 btime;			/* File creation time */
>   	u64		blocks;
>   	u64		mnt_id;
> +	u32		dio_mem_align;
> +	u32		dio_offset_align;
>   };
>   
>   #endif
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 1500a0f58041a..7cab2c65d3d7f 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -124,7 +124,8 @@ struct statx {
>   	__u32	stx_dev_minor;
>   	/* 0x90 */
>   	__u64	stx_mnt_id;
> -	__u64	__spare2;
> +	__u32	stx_dio_mem_align;	/* Memory buffer alignment for direct I/O */
> +	__u32	stx_dio_offset_align;	/* File offset alignment for direct I/O */
>   	/* 0xa0 */
>   	__u64	__spare3[12];	/* Spare space for future expansion */
>   	/* 0x100 */
> @@ -152,6 +153,7 @@ struct statx {
>   #define STATX_BASIC_STATS	0x000007ffU	/* The stuff in the normal stat struct */
>   #define STATX_BTIME		0x00000800U	/* Want/got stx_btime */
>   #define STATX_MNT_ID		0x00001000U	/* Got stx_mnt_id */
> +#define STATX_DIOALIGN		0x00002000U	/* Want/got direct I/O alignment info */
>   
>   #define STATX__RESERVED		0x80000000U	/* Reserved for future struct statx expansion */
>

Darrick J. Wong June 23, 2022, 3:58 p.m. UTC | #2

On Thu, Jun 16, 2022 at 01:14:59PM -0700, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> Traditionally, the conditions for when DIO (direct I/O) is supported
> were fairly simple.  For both block devices and regular files, DIO had
> to be aligned to the logical block size of the block device.
> 
> However, due to filesystem features that have been added over time (e.g.
> multi-device support, data journalling, inline data, encryption, verity,
> compression, checkpoint disabling, log-structured mode), the conditions
> for when DIO is allowed on a regular file have gotten increasingly
> complex.  Whether a particular regular file supports DIO, and with what
> alignment, can depend on various file attributes and filesystem mount
> options, as well as which block device(s) the file's data is located on.
> 
> Moreover, the general rule of DIO needing to be aligned to the block
> device's logical block size is being relaxed to allow user buffers (but
> not file offsets) aligned to the DMA alignment instead
> (https://lore.kernel.org/linux-block/20220610195830.3574005-1-kbusch@fb.com/T/#u).
> 
> XFS has an ioctl XFS_IOC_DIOINFO that exposes DIO alignment information.
> Uplifting this to the VFS is one possibility.  However, as discussed
> (https://lore.kernel.org/linux-fsdevel/20220120071215.123274-1-ebiggers@kernel.org/T/#u),
> this ioctl is rarely used and not known to be used outside of
> XFS-specific code.  It was also never intended to indicate when a file
> doesn't support DIO at all, nor was it intended for block devices.
> 
> Therefore, let's expose this information via statx().  Add the
> STATX_DIOALIGN flag and two new statx fields associated with it:
> 
> * stx_dio_mem_align: the alignment (in bytes) required for user memory
>   buffers for DIO, or 0 if DIO is not supported on the file.
> 
> * stx_dio_offset_align: the alignment (in bytes) required for file
>   offsets and I/O segment lengths for DIO, or 0 if DIO is not supported
>   on the file.  This will only be nonzero if stx_dio_mem_align is
>   nonzero, and vice versa.
> 
> Note that as with other statx() extensions, if STATX_DIOALIGN isn't set
> in the returned statx struct, then these new fields won't be filled in.
> This will happen if the file is neither a regular file nor a block
> device, or if the file is a regular file and the filesystem doesn't
> support STATX_DIOALIGN.  It might also happen if the caller didn't
> include STATX_DIOALIGN in the request mask, since statx() isn't required
> to return unrequested information.
> 
> This commit only adds the VFS-level plumbing for STATX_DIOALIGN.  For
> regular files, individual filesystems will still need to add code to
> support it.  For block devices, a separate commit will wire it up too.
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>
> ---
>  fs/stat.c                 | 2 ++
>  include/linux/stat.h      | 2 ++
>  include/uapi/linux/stat.h | 4 +++-
>  3 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/stat.c b/fs/stat.c
> index 9ced8860e0f35..a7930d7444830 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -611,6 +611,8 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
>  	tmp.stx_dev_major = MAJOR(stat->dev);
>  	tmp.stx_dev_minor = MINOR(stat->dev);
>  	tmp.stx_mnt_id = stat->mnt_id;
> +	tmp.stx_dio_mem_align = stat->dio_mem_align;
> +	tmp.stx_dio_offset_align = stat->dio_offset_align;
>  
>  	return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
>  }
> diff --git a/include/linux/stat.h b/include/linux/stat.h
> index 7df06931f25d8..ff277ced50e9f 100644
> --- a/include/linux/stat.h
> +++ b/include/linux/stat.h
> @@ -50,6 +50,8 @@ struct kstat {
>  	struct timespec64 btime;			/* File creation time */
>  	u64		blocks;
>  	u64		mnt_id;
> +	u32		dio_mem_align;
> +	u32		dio_offset_align;

Hmm.  Does the XFS port of XFS_IOC_DIOINFO to STATX_DIOALIGN look like
this?

	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);

	kstat.dio_mem_align = target->bt_logical_sectorsize;
	kstat.dio_offset_align = target->bt_logical_sectorsize;
	kstat.result_mask |= STATX_DIOALIGN;

And I guess you're tabling the "optimal" IO discussions for now, because
there are too many variants of what that means?

--D

>  };
>  
>  #endif
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 1500a0f58041a..7cab2c65d3d7f 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -124,7 +124,8 @@ struct statx {
>  	__u32	stx_dev_minor;
>  	/* 0x90 */
>  	__u64	stx_mnt_id;
> -	__u64	__spare2;
> +	__u32	stx_dio_mem_align;	/* Memory buffer alignment for direct I/O */
> +	__u32	stx_dio_offset_align;	/* File offset alignment for direct I/O */
>  	/* 0xa0 */
>  	__u64	__spare3[12];	/* Spare space for future expansion */
>  	/* 0x100 */
> @@ -152,6 +153,7 @@ struct statx {
>  #define STATX_BASIC_STATS	0x000007ffU	/* The stuff in the normal stat struct */
>  #define STATX_BTIME		0x00000800U	/* Want/got stx_btime */
>  #define STATX_MNT_ID		0x00001000U	/* Got stx_mnt_id */
> +#define STATX_DIOALIGN		0x00002000U	/* Want/got direct I/O alignment info */
>  
>  #define STATX__RESERVED		0x80000000U	/* Reserved for future struct statx expansion */
>  
> -- 
> 2.36.1
>

Eric Biggers June 23, 2022, 5:23 p.m. UTC | #3

On Thu, Jun 23, 2022 at 08:58:12AM -0700, Darrick J. Wong wrote:
> > diff --git a/include/linux/stat.h b/include/linux/stat.h
> > index 7df06931f25d8..ff277ced50e9f 100644
> > --- a/include/linux/stat.h
> > +++ b/include/linux/stat.h
> > @@ -50,6 +50,8 @@ struct kstat {
> >  	struct timespec64 btime;			/* File creation time */
> >  	u64		blocks;
> >  	u64		mnt_id;
> > +	u32		dio_mem_align;
> > +	u32		dio_offset_align;
> 
> Hmm.  Does the XFS port of XFS_IOC_DIOINFO to STATX_DIOALIGN look like
> this?
> 
> 	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> 
> 	kstat.dio_mem_align = target->bt_logical_sectorsize;
> 	kstat.dio_offset_align = target->bt_logical_sectorsize;
> 	kstat.result_mask |= STATX_DIOALIGN;

Yes, I think so.

However, if we need more fields as Avi Kivity requested at
https://lore.kernel.org/r/6c06b2d4-2d96-c4a6-7aca-5147a91e7cf2@scylladb.com
that is going to complicate things.  I haven't had a chance to look
into whether those extra fields are really needed.  Your opinion on whether XFS
(and any other filesystem) needs them would be appreciated.

> 
> And I guess you're tabling the "optimal" IO discussions for now, because
> there are too many variants of what that means?
> 

Yes, that's omitted for now due to the apparent redundancy with stx_blksize.

- Eric

Eric Biggers June 23, 2022, 6:58 p.m. UTC | #4

On Thu, Jun 23, 2022 at 10:23:20AM -0700, Eric Biggers wrote:
> On Thu, Jun 23, 2022 at 08:58:12AM -0700, Darrick J. Wong wrote:
> > > diff --git a/include/linux/stat.h b/include/linux/stat.h
> > > index 7df06931f25d8..ff277ced50e9f 100644
> > > --- a/include/linux/stat.h
> > > +++ b/include/linux/stat.h
> > > @@ -50,6 +50,8 @@ struct kstat {
> > >  	struct timespec64 btime;			/* File creation time */
> > >  	u64		blocks;
> > >  	u64		mnt_id;
> > > +	u32		dio_mem_align;
> > > +	u32		dio_offset_align;
> > 
> > Hmm.  Does the XFS port of XFS_IOC_DIOINFO to STATX_DIOALIGN look like
> > this?
> > 
> > 	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> > 
> > 	kstat.dio_mem_align = target->bt_logical_sectorsize;
> > 	kstat.dio_offset_align = target->bt_logical_sectorsize;
> > 	kstat.result_mask |= STATX_DIOALIGN;
> 
> Yes, I think so.
> 

By the way, the patchset "[PATCHv6 00/11] direct-io dma alignment"
(https://lore.kernel.org/linux-block/20220610195830.3574005-1-kbusch@fb.com/T/#u),
which is currently queued in linux-block/for-next for 5.20, will relax the user
buffer alignment requirement to the dma alignment for all filesystems using the
iomap direct I/O implementation.  If that goes in, the XFS implementation of
STATX_DIOALIGN, as well as the ext4 and f2fs ones, will need to be changed
accordingly.  Also, the existing XFS_IOC_DIOINFO will need to be changed.

- Eric

Christoph Hellwig June 26, 2022, 7:44 a.m. UTC | #5

On Sun, Jun 19, 2022 at 02:30:47PM +0300, Avi Kivity wrote:
> > * stx_dio_offset_align: the alignment (in bytes) required for file
> >    offsets and I/O segment lengths for DIO, or 0 if DIO is not supported
> >    on the file.  This will only be nonzero if stx_dio_mem_align is
> >    nonzero, and vice versa.
> 
> 
> If you consider AIO, this is actually three alignments:
> 
> 1. offset alignment for reads (sector size in XFS)
> 
> 2. offset alignment for overwrites (sector size in XFS since ed1128c2d0c87e,
> block size earlier)
> 
> 3. offset alignment for appending writes (block size)
> 
> 
> This is critical for linux-aio since violation of these alignments will
> stall the io_submit system call. Perhaps io_uring handles it better by
> bouncing to a workqueue, but there is a significant performance and latency
> penalty for that.

I think you are mixing things up here.  We actually have two limits that
matter:

 a) the hard limit, which if violated will return an error.
    This has been sector size for all common file systems for years,
    but can be bigger than that with fscrypt in the game (which
    triggered this series)
 b) an optimal write size, which can be done asynchronous and
    without exclusive locking.
    This is what your cases 2) and 3) above refer to.

Exposting this additional optimal performance size might be a good idea
in addition to what is proposed here, even if matters a little less
with io_uring.  But I'm not sure I'd additional split it into append
vs overwrite vs hole filling but just round up to the maximum of those.

Christoph Hellwig June 26, 2022, 7:44 a.m. UTC | #6

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

Christoph Hellwig June 26, 2022, 8:02 a.m. UTC | #7

On Thu, Jun 23, 2022 at 08:58:12AM -0700, Darrick J. Wong wrote:
> Hmm.  Does the XFS port of XFS_IOC_DIOINFO to STATX_DIOALIGN look like
> this?
> 
> 	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> 
> 	kstat.dio_mem_align = target->bt_logical_sectorsize;
> 	kstat.dio_offset_align = target->bt_logical_sectorsize;
> 	kstat.result_mask |= STATX_DIOALIGN;

Yes, I think so.  And it would be very good to include the XFS conversion
with this series as the only file systems that already supports
reporting alignment constraints.

I also suspect that lifting XFS_IOC_DIOINFO to common code by calling
->getattr would be useful because now all existing software using that
will also do the right thing on ext4 and f2fs now.

Avi Kivity June 26, 2022, 10:20 a.m. UTC | #8

On 26/06/2022 10.44, Christoph Hellwig wrote:
> On Sun, Jun 19, 2022 at 02:30:47PM +0300, Avi Kivity wrote:
>>> * stx_dio_offset_align: the alignment (in bytes) required for file
>>>     offsets and I/O segment lengths for DIO, or 0 if DIO is not supported
>>>     on the file.  This will only be nonzero if stx_dio_mem_align is
>>>     nonzero, and vice versa.
>>
>> If you consider AIO, this is actually three alignments:
>>
>> 1. offset alignment for reads (sector size in XFS)
>>
>> 2. offset alignment for overwrites (sector size in XFS since ed1128c2d0c87e,
>> block size earlier)
>>
>> 3. offset alignment for appending writes (block size)
>>
>>
>> This is critical for linux-aio since violation of these alignments will
>> stall the io_submit system call. Perhaps io_uring handles it better by
>> bouncing to a workqueue, but there is a significant performance and latency
>> penalty for that.
> I think you are mixing things up here.


Yes.


> We actually have two limits that
> matter:
>
>   a) the hard limit, which if violated will return an error.
>      This has been sector size for all common file systems for years,
>      but can be bigger than that with fscrypt in the game (which
>      triggered this series)
>   b) an optimal write size, which can be done asynchronous and
>      without exclusive locking.
>      This is what your cases 2) and 3) above refer to.
>
> Exposting this additional optimal performance size might be a good idea
> in addition to what is proposed here, even if matters a little less
> with io_uring.  But I'm not sure I'd additional split it into append
> vs overwrite vs hole filling but just round up to the maximum of those.


Rounding up will penalize database workloads, with and without io_uring. 
Database commit logs are characterized by frequent small writes. Telling 
the database to round up to 4k vs 512 bytes means large write 
amplification. The disk probably won't care (or maybe it will - it will 
also have to generate more erase blocks), but the database will run out 
of commitlog space much sooner and will have to compensate in expensive 
ways.


Of course, people that care can continue to use internal filesystem 
knowledge, and maybe there are few enough of those that the API can 
choose to ignore them.

[v3,1/8] statx: add direct I/O alignment information

Commit Message

Comments

Patch