Message ID | 20240813163638.3751939-1-john.g.garry@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | forcealign for xfs | expand |
John Garry <john.g.garry@oracle.com> writes: > This series is being spun off the block atomic writes for xfs series at > [0]. > > That series got too big. > > The actual forcealign patches are roughly the same in this series. > > Why forcealign? > In some scenarios to may be required to guarantee extent alignment and > granularity. > > For example, for atomic writes, the maximum atomic write unit size would > be limited at the extent alignment and granularity, guaranteeing that an > atomic write would not span data present in multiple extents. > > forcealign may be useful as a performance tuning optimization in other > scenarios. > > I decided not to support forcealign for RT devices here. Initially I > thought that it would be quite simple of implement. However, I discovered > through much testing and subsequent debug that this was not true, so I > decided to defer support to later. > > Early development xfsprogs support is at: > https://github.com/johnpgarry/xfsprogs-dev/commits/atomic-writes/ > Hi John, Thanks for your continued work on atomic write. I went over the XFS patch series and this is my understanding + some queries. Could you please help with these. 1. As I understand XFS untorn atomic write support is built on top of FORCEALIGN feature (which this series is adding) which in turn uses extsize hint feature underneath. Now extsize hint mainly controls the alignment of both "physical start" & "logical start" offset and extent length, correct? This is done using args->alignment for start aand args->prod/mode variables for extent length. Correct? - If say we are not able to allocate an aligned physical start? Then since extsize is just a hint we go ahead with whatever best available extent is right? - also extsize looks to be only providing allocation side of hints. (not de-allocation). Correct? 2. If say there is an append write i.e. the allocation is needed to be done at EOF. Then we try for an exact bno (from eof block) and aligned extent length, right? i.e. xfs_bmap_btalloc_filestreams() -> xfs_bmap_btalloc_at_eof(ap, args); If it is not available then we try for nearby bno xfs_alloc_vextent_near_bno(args, target) and similar... 3. It is the FORCEALIGN feature which _mandates_ both allocation (by using extsize hint) and de-allocation to happen _only_ in extsize chunks. i.e. forcealign mandates - - the logical and physical start offset should be aligned as per args->alignment - extent length be aligned as per args->prod/mod. If above two cannot be satisfied then return -ENOSPC. - Does the unmapping of extents also only happens in extsize chunks (with forcealign)? If the start or end of the extent which needs unmapping is unaligned then we convert that extent to unwritten and skip, is it? (__xfs_bunmapi()) This is a bit unclear to me. Maybe I need to look more deeper into the __xfs_bunmapi() while loop. My knowledge about this is still limited so please ignore any silly questions. -ritesh
On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote: > John Garry <john.g.garry@oracle.com> writes: > > > This series is being spun off the block atomic writes for xfs > > series at [0]. > > > > That series got too big. > > > > The actual forcealign patches are roughly the same in this > > series. > > > > Why forcealign? In some scenarios to may be required to > > guarantee extent alignment and granularity. > > > > For example, for atomic writes, the maximum atomic write unit > > size would be limited at the extent alignment and granularity, > > guaranteeing that an atomic write would not span data present in > > multiple extents. > > > > forcealign may be useful as a performance tuning optimization in > > other scenarios. > > > > I decided not to support forcealign for RT devices here. > > Initially I thought that it would be quite simple of implement. > > However, I discovered through much testing and subsequent debug > > that this was not true, so I decided to defer support to > > later. > > > > Early development xfsprogs support is at: > > https://github.com/johnpgarry/xfsprogs-dev/commits/atomic-writes/ > > > > Hi John, > > Thanks for your continued work on atomic write. I went over the > XFS patch series and this is my understanding + some queries. > Could you please help with these. Hi Ritesh - to make it easier for everyone to read and reply to you emails, can you please word wrap the text at 72 columns? > 1. As I understand XFS untorn atomic write support is built on top > of FORCEALIGN feature (which this series is adding) which in turn > uses extsize hint feature underneath. Yes. > Now extsize hint mainly controls the alignment of both > "physical start" & "logical start" offset and extent length, > correct? Yes. > This is done using args->alignment for start aand > args->prod/mode variables for extent length. Correct? Yes. > - If say we are not able to allocate an aligned physical start? > Then since extsize is just a hint we go ahead with whatever > best available extent is right? No. The definition of "forced alignment" is that we guarantee aligned allocation to the extent size hint. i.e the extent size hint is not a hint anymore - it defines the alignment we are guaranteeing allocation will achieve. hence if we can't align the extent to the alignment provided, we fail the alignment. > - also extsize looks to be only providing allocation side of hints. (not de-allocation). Correct? No. See the use of xfs_inode_alloc_unitsize() in all the places where we free space. Forced alignment extends this function to return the extent size, not the block size. > 2. If say there is an append write i.e. the allocation is needed > to be done at EOF. Then we try for an exact bno (from eof block) > and aligned extent length, right? Yes. This works because the previous extent is exactly aligned, hence a contiguous allocation will continue to be correctly aligned due to the forced alignment constraints. > i.e. xfs_bmap_btalloc_filestreams() -> > xfs_bmap_btalloc_at_eof(ap, args); If it is not available then > we try for nearby bno xfs_alloc_vextent_near_bno(args, target) > and similar... yes, that's just the normal aligned allocation fallback path when exact allocation fails. > 3. It is the FORCEALIGN feature which _mandates_ both allocation > (by using extsize hint) and de-allocation to happen _only_ in > extsize chunks. > > i.e. forcealign mandates - > - the logical and physical start offset should be aligned as > per args->alignment > - extent length be aligned as per args->prod/mod. > If above two cannot be satisfied then return -ENOSPC. Yes. > > - Does the unmapping of extents also only happens in extsize > chunks (with forcealign)? Yes, via use of xfs_inode_alloc_unitsize() in the high level code aligning the fsbno ranges to be unmapped. Remember, force align requires both logical file offset and physical block number to be correctly aligned, so unmap alignment has to be set up correctly at file offset level before we even know what extents underly the file range we need to unmap.... > If the start or end of the extent which needs unmapping is > unaligned then we convert that extent to unwritten and skip, > is it? (__xfs_bunmapi()) The high level code should be aligning the start and end of the file range to be removed via xfs_inode_alloc_unitsize(). Hence the low level __xfs_bunmapi() code shouldn't ever be encountering unaligned unmaps on force-aligned inodes. -Dave.
Thanks Dave for quick response. Dave Chinner <david@fromorbit.com> writes: > On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote: >> John Garry <john.g.garry@oracle.com> writes: >> >> > This series is being spun off the block atomic writes for xfs >> > series at [0]. >> > >> > That series got too big. >> > >> > The actual forcealign patches are roughly the same in this >> > series. >> > >> > Why forcealign? In some scenarios to may be required to >> > guarantee extent alignment and granularity. >> > >> > For example, for atomic writes, the maximum atomic write unit >> > size would be limited at the extent alignment and granularity, >> > guaranteeing that an atomic write would not span data present in >> > multiple extents. >> > >> > forcealign may be useful as a performance tuning optimization in >> > other scenarios. >> > >> > I decided not to support forcealign for RT devices here. >> > Initially I thought that it would be quite simple of implement. >> > However, I discovered through much testing and subsequent debug >> > that this was not true, so I decided to defer support to >> > later. >> > >> > Early development xfsprogs support is at: >> > https://github.com/johnpgarry/xfsprogs-dev/commits/atomic-writes/ >> > >> >> Hi John, >> >> Thanks for your continued work on atomic write. I went over the >> XFS patch series and this is my understanding + some queries. >> Could you please help with these. > > Hi Ritesh - to make it easier for everyone to read and reply to you > emails, can you please word wrap the text at 72 columns? > argh! Sorry about that. I had formed my queries in a separate notes application and copy pasted it here. Hopefully this time it will be ok. >> 1. As I understand XFS untorn atomic write support is built on top >> of FORCEALIGN feature (which this series is adding) which in turn >> uses extsize hint feature underneath. > > Yes. > >> Now extsize hint mainly controls the alignment of both >> "physical start" & "logical start" offset and extent length, >> correct? > > Yes. > >> This is done using args->alignment for start aand >> args->prod/mode variables for extent length. Correct? > > Yes. > >> - If say we are not able to allocate an aligned physical start? >> Then since extsize is just a hint we go ahead with whatever >> best available extent is right? > > No. The definition of "forced alignment" is that we guarantee > aligned allocation to the extent size hint. i.e the extent size hint > is not a hint anymore - it defines the alignment we are guaranteeing > allocation will achieve. > > hence if we can't align the extent to the alignment provided, we > fail the alignment. > >> - also extsize looks to be only providing allocation side of hints. (not de-allocation). Correct? > > No. See the use of xfs_inode_alloc_unitsize() in all the places > where we free space. Forced alignment extends this function to > return the extent size, not the block size. > Sorry for not being explicit. For queries in point 1. above, I was referring to extent size hint feature w/o FORCEALIGN. But I got the gist from your response. Thanks! >> 2. If say there is an append write i.e. the allocation is needed >> to be done at EOF. Then we try for an exact bno (from eof block) >> and aligned extent length, right? > > Yes. This works because the previous extent is exactly aligned, > hence a contiguous allocation will continue to be correctly aligned > due to the forced alignment constraints. > >> i.e. xfs_bmap_btalloc_filestreams() -> >> xfs_bmap_btalloc_at_eof(ap, args); If it is not available then >> we try for nearby bno xfs_alloc_vextent_near_bno(args, target) >> and similar... > > yes, that's just the normal aligned allocation fallback path when > exact allocation fails. > >> 3. It is the FORCEALIGN feature which _mandates_ both allocation >> (by using extsize hint) and de-allocation to happen _only_ in >> extsize chunks. >> >> i.e. forcealign mandates - >> - the logical and physical start offset should be aligned as >> per args->alignment >> - extent length be aligned as per args->prod/mod. >> If above two cannot be satisfied then return -ENOSPC. > > Yes. > >> >> - Does the unmapping of extents also only happens in extsize >> chunks (with forcealign)? > > Yes, via use of xfs_inode_alloc_unitsize() in the high level code > aligning the fsbno ranges to be unmapped. > > Remember, force align requires both logical file offset and > physical block number to be correctly aligned, This is where I would like to double confirm it again. Even the extsize hint feature (w/o FORCEALIGN) will try to allocate aligned physical start and logical start file offset and length right? (Or does extsize hint only restricts alignment to logical start file offset + length and not the physical start?) Also it looks like there is no difference with ATOMIC_WRITE AND FORCEALIGN feature with XFS, correct? (except that ATOMIC_WRITE is adding additional natural alignment restrictions on pos and len). So why maintain 2 separate on disk inode flags for FORCEALIGN AND ATOMIC_WRITE? - Do you foresee FORCEALIGN to be also used at other places w/o ATOMIC_WRITE where feature differentiation between the two on an inode is required? - Does the same reasoning will hold for XFS_SB_FEAT_RO_COMPAT_FORCEALIGN & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES too? - But why ro_compact for ATOMICWRITES? There aren't any on disk metadata changes within XFS filesystem to support atomic writes, right? Is it something to just prevent users from destroying their own data by not allowing a rw mount from an older kernel where users could do unaligned writes to files marked for atomic writes? Or is there any other reasoning to prevent XFS filesystem from becoming inconsistent if an older kernel does a rw mount here. > so unmap alignment > has to be set up correctly at file offset level before we even know > what extents underly the file range we need to unmap.... > >> If the start or end of the extent which needs unmapping is >> unaligned then we convert that extent to unwritten and skip, >> is it? (__xfs_bunmapi()) > > The high level code should be aligning the start and end of the > file range to be removed via xfs_inode_alloc_unitsize(). Hence > the low level __xfs_bunmapi() code shouldn't ever be encountering > unaligned unmaps on force-aligned inodes. > Yes, but isn't this code snippet trying to handle a case when it finds an unaligned extent during unmap? And what we are essentially trying to do here is leave the unwritten extent as is and if the found extent is written then convert to unwritten and skip it (goto nodelete). This means with forcealign if we encounter an unaligned extent then the file will have this space reserved as is with extent marked unwritten. Is this understanding correct? __xfs_bunmapi(...) { unsigned int alloc_fsb = xfs_inode_alloc_fsbsize(ip); <...> while (end != (xfs_fileoff_t)-1 && end >= start & (nexts == 0 || extno < nexts)) { <...> if (alloc_fsb == 1 || (flags & XFS_BMAPI_REMAP)) goto delete; mod = xfs_bmap_alloc_unit_offset(ip, alloc_fsb, del.br_startblock + del.br_blockcount); if (mod) { /* * Not aligned to allocation unit on the end. * The extent could have been split into written * and unwritten pieces, or we could just be * unmapping part of it. But we can't really * get rid of part of an extent. */ if (del.br_state == XFS_EXT_UNWRITTEN) { /* * This piece is unwritten, or we're not * using unwritten extents. Skip over it. */ ASSERT((flags & XFS_BMAPI_REMAP) || end >= mod); end -= mod > del.br_blockcount ? del.br_blockcount : mod; if (end < got.br_startoff && !xfs_iext_prev_extent(ifp, &icur, &got)) { done = true; break; } continue; } /* * It's written, turn it unwritten. * This is better than zeroing it. */ ASSERT(del.br_state == XFS_EXT_NORM); ASSERT(tp->t_blk_res > 0); /* * If this spans an extent boundary, chop it back to * the start of the one we end at. */ if (del.br_blockcount > mod) { del.br_startoff += del.br_blockcount - mod; del.br_startblock += del.br_blockcount - mod; del.br_blockcount = mod; } del.br_state = XFS_EXT_UNWRITTEN; error = xfs_bmap_add_extent_unwritten_real(tp, ip, whichfork, &icur, &cur, &del, &logflags); if (error) goto error0; goto nodelete; } mod = xfs_bmap_alloc_unit_offset(ip, alloc_fsb, del.br_startblock); if (mod) { // handle it for unaligned start block <...> } } } > -Dave. Thanks a lot for answering the queries. -ritesh
On Thu, Sep 05, 2024 at 09:26:25AM +0530, Ritesh Harjani wrote: > Dave Chinner <david@fromorbit.com> writes: > > On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote: > >> 3. It is the FORCEALIGN feature which _mandates_ both allocation > >> (by using extsize hint) and de-allocation to happen _only_ in > >> extsize chunks. > >> > >> i.e. forcealign mandates - > >> - the logical and physical start offset should be aligned as > >> per args->alignment > >> - extent length be aligned as per args->prod/mod. > >> If above two cannot be satisfied then return -ENOSPC. > > > > Yes. > > > >> > >> - Does the unmapping of extents also only happens in extsize > >> chunks (with forcealign)? > > > > Yes, via use of xfs_inode_alloc_unitsize() in the high level code > > aligning the fsbno ranges to be unmapped. > > > > Remember, force align requires both logical file offset and > > physical block number to be correctly aligned, > > This is where I would like to double confirm it again. Even the > extsize hint feature (w/o FORCEALIGN) will try to allocate aligned > physical start and logical start file offset and length right? No. > (Or does extsize hint only restricts alignment to logical start file > offset + length and not the physical start?) Neither. extsize hint by itself (i.e. existing behaviour) has no alignment effect at all. All it affects is -size- of the extent. i.e. once the extent start is chosen, extent size hints will trim the length of the extent to a multiple of the extent size hint. Alignment is not considered at all. > Also it looks like there is no difference with ATOMIC_WRITE AND > FORCEALIGN feature with XFS, correct? (except that ATOMIC_WRITE is > adding additional natural alignment restrictions on pos and len). Atomic write requires additional hardware support, and it restricts the valid sizes of extent size hints that can be set. Only atomic writes can be done on files marked as configured for atomic writes; force alignment can be done on any file... > So why maintain 2 separate on disk inode flags for FORCEALIGN AND > ATOMIC_WRITE? the atomic write flag indicates that a file has been set up correctly for atomic writes to be able to issues reliably. force alignment doesn't guarantee that - it's just a mechanism that tells the allocator to behave a specific way. > - Do you foresee FORCEALIGN to be also used at other places w/o > ATOMIC_WRITE where feature differentiation between the two on an > inode is required? The already exist. For example, reliably allocating huge page mappings on DAX filesystems requires 2MB forced alignment. > - Does the same reasoning will hold for XFS_SB_FEAT_RO_COMPAT_FORCEALIGN > & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES too? Same as above. > - But why ro_compact for ATOMICWRITES? There aren't any on disk metadata > changes within XFS filesystem to support atomic writes, right? Because if you downgrade the kernel to something that doesn't support atomic writes, then non-atomic sized/aligned data can be written to the file and/or torn writes can occur. Worse, extent size hints that don't match the underlying hardware support could be set up for inodes, and when the kernel is upgraded again then atomic writes will fail on inodes that have atomic write flags set on them.... > Is it something to just prevent users from destroying their own data > by not allowing a rw mount from an older kernel where users could do > unaligned writes to files marked for atomic writes? > Or is there any other reasoning to prevent XFS filesystem from becoming > inconsistent if an older kernel does a rw mount here. The older kernel does not know what the unknown inode flag means (i.e. atomic writes) and so, by definition, we cannot allow it to modify metadata or file data because it may not modify it in the correct way for that flag being set on the inode. Kernels that don't understand feature flags need to treat the filesystem as read-only, no matter how trivial the feature addition might seem. > > so unmap alignment > > has to be set up correctly at file offset level before we even know > > what extents underly the file range we need to unmap.... > > > >> If the start or end of the extent which needs unmapping is > >> unaligned then we convert that extent to unwritten and skip, > >> is it? (__xfs_bunmapi()) > > > > The high level code should be aligning the start and end of the > > file range to be removed via xfs_inode_alloc_unitsize(). Hence > > the low level __xfs_bunmapi() code shouldn't ever be encountering > > unaligned unmaps on force-aligned inodes. > > > > Yes, but isn't this code snippet trying to handle a case when it finds an > unaligned extent during unmap? It does exactly that. > And what we are essentially trying to > do here is leave the unwritten extent as is and if the found extent is > written then convert to unwritten and skip it (goto nodelete). This > means with forcealign if we encounter an unaligned extent then the file > will have this space reserved as is with extent marked unwritten. > > Is this understanding correct? Yes, except for the fact that this code is not triggered by force alignment. This code is preexisting realtime file functionality. It is used when the rtextent size is larger than a single filesytem block. In these configurations, we do allocation in rtextsize units, but we track written/unwritten extents in the BMBT on filesystem block granularity. Hence we can have unaligned written/unwritten extent boundaries, and if we aren't unmapping a whole rtextent then we simply mark the unused portion of it as unwritten in the BMBT to indicate it contains zeroes. IOWs, existing RT files have different IO alignment and written/unwritten extent conversion behaviour to the new forced alignment feature. The implementation code is shared in many places, but the higher level forced alignment functionality ensures there are never unaligned unwritten extents created or unaligned unmappings asked for. Hence this code does not trigger for the forced alignment cases. i.e. we have multiple seperate allocation alignment behaviours that share an implementation, but they do not all behave exactly the same way or provide the same alignment guarantees.... -Dave.
> >> 1. As I understand XFS untorn atomic write support is built on top >> of FORCEALIGN feature (which this series is adding) which in turn >> uses extsize hint feature underneath. > > Yes. > >> Now extsize hint mainly controls the alignment of both >> "physical start" & "logical start" offset and extent length, >> correct? > > Yes. To be clear, only for atomic writes do we require physical block alignment. > >> This is done using args->alignment for start aand >> args->prod/mode variables for extent length. Correct? > > Yes. > >> - If say we are not able to allocate an aligned physical start? >> Then since extsize is just a hint we go ahead with whatever >> best available extent is right? --- > >> >> - Does the unmapping of extents also only happens in extsize >> chunks (with forcealign)? > > Yes, via use of xfs_inode_alloc_unitsize() in the high level code > aligning the fsbno ranges to be unmapped. > > Remember, force align requires both logical file offset and > physical block number to be correctly aligned, so unmap alignment > has to be set up correctly at file offset level before we even know > what extents underly the file range we need to unmap.... > >> If the start or end of the extent which needs unmapping is >> unaligned then we convert that extent to unwritten and skip, >> is it? (__xfs_bunmapi()) > > The high level code should be aligning the start and end of the > file range to be removed via xfs_inode_alloc_unitsize(). Is that the case for something like truncate? There we just say what is the end block which we want to truncate to in xfs_itruncate_extents_flags(new_size) -> xfs_bunmapi_range(XFS_B_TO_FSB(new_size)), and that may not be alloc unit aligned. > Hence > the low level __xfs_bunmapi() code shouldn't ever be encountering > unaligned unmaps on force-aligned inodes.
On Thu, Sep 05, 2024 at 11:15:41AM +0100, John Garry wrote: > > > - Does the unmapping of extents also only happens in extsize > > > chunks (with forcealign)? > > > > Yes, via use of xfs_inode_alloc_unitsize() in the high level code > > aligning the fsbno ranges to be unmapped. > > > > Remember, force align requires both logical file offset and > > physical block number to be correctly aligned, so unmap alignment > > has to be set up correctly at file offset level before we even know > > what extents underly the file range we need to unmap.... > > > > > If the start or end of the extent which needs unmapping is > > > unaligned then we convert that extent to unwritten and skip, > > > is it? (__xfs_bunmapi()) > > > > The high level code should be aligning the start and end of the > > file range to be removed via xfs_inode_alloc_unitsize(). > > Is that the case for something like truncate? There we just say what is the > end block which we want to truncate to in > xfs_itruncate_extents_flags(new_size) -> > xfs_bunmapi_range(XFS_B_TO_FSB(new_size)), and that may not be alloc unit > aligned. Ah, I thought we had that alignment in xfs_itruncate_extents_flags() already, but if we don't then that's a bug that needs to be fixed. We change the space reservation in xfs-setattr_size() for this case (patch 9) but then don't do any alignment there - it relies on xfs_itruncate_extents_flags() to do the right thing w.r.t. extent removal alignment w.r.t. the new EOF. i.e. The xfs_setattr_size() code takes care of EOF block zeroing and page cache removal so the user doesn't see old data beyond EOF, whilst xfs_itruncate_extents_flags() is supposed to take care of the extent removal and the details of that operation (e.g. alignment). Patch 10 also modifies xfs_can_free_eofblocks() to take alignment into account for the post-eof block removal, but doesn't change xfs_free_eofblocks() at all. i.e it also relies on xfs_itruncate_extents_flags() to do the right thing for force aligned inodes. In this case, we are removing post-eof speculative preallocation that that has been allocated by delalloc conversion during writeback. These post-eof extents will already be unwritten extents because delalloc conversion uses unwritten extents to avoid stale data exposure if we crash between allocation and the data being written to the extents. Hence there should be no extents to convert to unwritten in the majority of cases here. The only case where we might get written extents beyond EOF is if the file has been truncated down, but in that case we don't really care because truncate should have already taken care of post-eof extent alignment for us. xfs_can_free_eofblocks() will see this extent alignment and so we'll skip xfs_free_eofblocks() in this case altogether.... Hence xfs_free_eofblocks() should never need to convert a partial unaligned extent range to unwritten when force-align is enabled because the post-eof extents should already be unwritten. We also want to leave the inode in the most optimal state for future extension, which means we want the post-eof extent to be correctly aligned. Hence there are multiple reasons that xfs_itruncate_extents_flags() should be aligning the post-EOF block it is starting the unmapping at for force aligned allocation contexts. And in doing so, we remove the weird corner case where we can have an unaligned extent state boundary at EOF for atomic writes.... -Dave.
On 05/09/2024 22:47, Dave Chinner wrote: >>>> If the start or end of the extent which needs unmapping is >>>> unaligned then we convert that extent to unwritten and skip, >>>> is it? (__xfs_bunmapi()) >>> The high level code should be aligning the start and end of the >>> file range to be removed via xfs_inode_alloc_unitsize(). >> Is that the case for something like truncate? There we just say what is the >> end block which we want to truncate to in >> xfs_itruncate_extents_flags(new_size) -> >> xfs_bunmapi_range(XFS_B_TO_FSB(new_size)), and that may not be alloc unit >> aligned. > Ah, I thought we had that alignment in xfs_itruncate_extents_flags() > already, but if we don't then that's a bug that needs to be fixed. AFAICS, forcealign behaviour is same as RT, so then this would be a mainline bug, right? > > We change the space reservation in xfs-setattr_size() for this case > (patch 9) but then don't do any alignment there - it relies on > xfs_itruncate_extents_flags() to do the right thing w.r.t. extent > removal alignment w.r.t. the new EOF. > > i.e. The xfs_setattr_size() code takes care of EOF block zeroing and > page cache removal so the user doesn't see old data beyond EOF, > whilst xfs_itruncate_extents_flags() is supposed to take care of the > extent removal and the details of that operation (e.g. alignment). So we should roundup the unmap block to the alloc unit, correct? I have my doubts about that, and thought that xfs_bunmapi_range() takes care of any alignment handling. > > Patch 10 also modifies xfs_can_free_eofblocks() to take alignment > into account for the post-eof block removal, but doesn't change > xfs_free_eofblocks() at all. i.e it also relies on > xfs_itruncate_extents_flags() to do the right thing for force > aligned inodes. What state should the blocks post-EOF blocks be? A simple example of partially truncating an alloc unit is: $xfs_io -c "extsize" mnt/file [16384] mnt/file $xfs_bmap -vvp mnt/file mnt/file: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..20479]: 192..20671 0 (192..20671) 20480 000000 $truncate -s 10461184 mnt/file # 10M - 6FSB $xfs_bmap -vvp mnt/file mnt/file: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..20431]: 192..20623 0 (192..20623) 20432 000000 1: [20432..20447]: 20624..20639 0 (20624..20639) 16 010000 FLAG Values: 0010000 Unwritten preallocated extent Is that incorrect state? > > In this case, we are removing post-eof speculative preallocation > that that has been allocated by delalloc conversion during > writeback. These post-eof extents will already be unwritten extents > because delalloc conversion uses unwritten extents to avoid > stale data exposure if we crash between allocation and the data > being written to the extents. Hence there should be no extents to > convert to unwritten in the majority of cases here. > > The only case where we might get written extents beyond EOF is if > the file has been truncated down, but in that case we don't really > care because truncate should have already taken care of post-eof > extent alignment for us. xfs_can_free_eofblocks() will see this > extent alignment and so we'll skip xfs_free_eofblocks() in this case > altogether.... > > Hence xfs_free_eofblocks() should never need to convert a partial > unaligned extent range to unwritten when force-align is enabled > because the post-eof extents should already be unwritten. We also > want to leave the inode in the most optimal state for future > extension, which means we want the post-eof extent to be correctly > aligned. > > Hence there are multiple reasons that xfs_itruncate_extents_flags() > should be aligning the post-EOF block it is starting the unmapping > at for force aligned allocation contexts. And in doing so, we remove > the weird corner case where we can have an unaligned extent state > boundary at EOF for atomic writes.... Yeah, I don't think that sub-alloc unit extent zeroing would help us there, as we not be dealing with a new extent (for zeroing to occur). Thanks, John
On Fri, Sep 06, 2024 at 03:31:43PM +0100, John Garry wrote: > On 05/09/2024 22:47, Dave Chinner wrote: > > > > > If the start or end of the extent which needs unmapping is > > > > > unaligned then we convert that extent to unwritten and skip, > > > > > is it? (__xfs_bunmapi()) > > > > The high level code should be aligning the start and end of the > > > > file range to be removed via xfs_inode_alloc_unitsize(). > > > Is that the case for something like truncate? There we just say what is the > > > end block which we want to truncate to in > > > xfs_itruncate_extents_flags(new_size) -> > > > xfs_bunmapi_range(XFS_B_TO_FSB(new_size)), and that may not be alloc unit > > > aligned. > > Ah, I thought we had that alignment in xfs_itruncate_extents_flags() > > already, but if we don't then that's a bug that needs to be fixed. > > AFAICS, forcealign behaviour is same as RT, so then this would be a mainline > bug, right? No, I don't think so. I think this is a case where forcealign and RT behaviours differ, primarily because RT doesn't actually care about file offset -> physical offset alignment and can do unaligned IO whenever it wants. Hence having an unaligned written->unwritten extent state transition doesnt' affect anything for rt files... > > > > We change the space reservation in xfs-setattr_size() for this case > > (patch 9) but then don't do any alignment there - it relies on > > xfs_itruncate_extents_flags() to do the right thing w.r.t. extent > > removal alignment w.r.t. the new EOF. > > > > i.e. The xfs_setattr_size() code takes care of EOF block zeroing and > > page cache removal so the user doesn't see old data beyond EOF, > > whilst xfs_itruncate_extents_flags() is supposed to take care of the > > extent removal and the details of that operation (e.g. alignment). > > So we should roundup the unmap block to the alloc unit, correct? I have my > doubts about that, and thought that xfs_bunmapi_range() takes care of any > alignment handling. The start should round up, yes. And, no, xfs_bunmapi_range() isn't specifically "taking care" of alignment. It's just marking a partial alloc unit range as unwritten, which means we now have -unaligned extents- in the BMBT. > > > > > Patch 10 also modifies xfs_can_free_eofblocks() to take alignment > > into account for the post-eof block removal, but doesn't change > > xfs_free_eofblocks() at all. i.e it also relies on > > xfs_itruncate_extents_flags() to do the right thing for force > > aligned inodes. > > What state should the blocks post-EOF blocks be? A simple example of > partially truncating an alloc unit is: > > $xfs_io -c "extsize" mnt/file > [16384] mnt/file > > > $xfs_bmap -vvp mnt/file > mnt/file: > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > 0: [0..20479]: 192..20671 0 (192..20671) 20480 000000 > > > $truncate -s 10461184 mnt/file # 10M - 6FSB > > $xfs_bmap -vvp mnt/file > mnt/file: > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > 0: [0..20431]: 192..20623 0 (192..20623) 20432 000000 > 1: [20432..20447]: 20624..20639 0 (20624..20639) 16 010000 > FLAG Values: > 0010000 Unwritten preallocated extent > > Is that incorrect state? Think about it: what happens if you now truncate it back up to 10MB (i.e. aligned length) and then do an aligned atomic write on it. First: What happens when you truncate up? ...... Yes, iomap_zero_range() will see the unwritten extent and skip it. i.e. The unwritten extent stays as an unwritten extent, it's now within EOF. That written->unwritten extent boundary is not on an aligned file offset. Second: What happens when you do a correctly aligned atomic write that spans this range now? ...... Iomap only maps a single extent at a time, so it will only map the written range from the start of the IO (aligned) to the start of the unwritten extent (unaligned). Hence the atomic write will be rejected because we can't do the atomic write to such an unaligned extent. That's not a bug in the atomic write path - this failure occurs because of the truncate behaviour doing post-eof unwritten extent conversion.... Yes, I agree that the entire -physical- extent is still correctly aligned on disk so you could argue that the unwritten conversion that xfs_bunmapi_range is doing is valid forced alignment behaviour. However, the fact is that breaking the aligned physical extent into two unaligned contiguous extents in different states in the BMBT means that they are treated as two seperate unaligned extents, not one contiguous aligned physical extent. This extent state discontiunity is results in breaking physical IO across the extent state boundary. Hence such an unaligned extent state change violates the physical IO alignment guarantees that forced alignment is supposed to provide atomic writes... This is the reason why operations like hole punching round to extent size for forced alignment at the highest level. i.e. We cannot allow xfs_bunmapi_range() to "take care" of alignment handling because managing partial extent unmappings with sub-alloc-unit unwritten extents does not work for forced alignment. Again, this is different to the traditional RT file behaviour - it can use unwritten extents for sub-alloc-unit alignment unmaps because the RT device can align file offset to any physical offset, and issue unaligned sector sized IO without any restrictions. Forced alignment does not have this freedom, and when we extend forced alignment to RT files, it will not have the freedom to use unwritten extents for sub-alloc-unit unmapping, either. -Dave.
>> >> AFAICS, forcealign behaviour is same as RT, so then this would be a mainline >> bug, right? > > No, I don't think so. I think this is a case where forcealign and RT > behaviours differ, primarily because RT doesn't actually care about > file offset -> physical offset alignment and can do unaligned IO > whenever it wants. Hence having an unaligned written->unwritten > extent state transition doesnt' affect anything for rt files... ok > >> >>>> We change the space reservation in xfs-setattr_size() for this case >>> (patch 9) but then don't do any alignment there - it relies on >>> xfs_itruncate_extents_flags() to do the right thing w.r.t. extent >>> removal alignment w.r.t. the new EOF. >>> >>> i.e. The xfs_setattr_size() code takes care of EOF block zeroing and >>> page cache removal so the user doesn't see old data beyond EOF, >>> whilst xfs_itruncate_extents_flags() is supposed to take care of the >>> extent removal and the details of that operation (e.g. alignment). >> >> So we should roundup the unmap block to the alloc unit, correct? I have my >> doubts about that, and thought that xfs_bunmapi_range() takes care of any >> alignment handling. > > The start should round up, yes. And, no, xfs_bunmapi_range() isn't > specifically "taking care" of alignment. It's just marking a partial > alloc unit range as unwritten, which means we now have -unaligned > extents- in the BMBT. Yes, I have noticed this previously. > >> >>> >>> Patch 10 also modifies xfs_can_free_eofblocks() to take alignment >>> into account for the post-eof block removal, but doesn't change >>> xfs_free_eofblocks() at all. i.e it also relies on >>> xfs_itruncate_extents_flags() to do the right thing for force >>> aligned inodes. >> >> What state should the blocks post-EOF blocks be? A simple example of >> partially truncating an alloc unit is: >> >> $xfs_io -c "extsize" mnt/file >> [16384] mnt/file >> >> >> $xfs_bmap -vvp mnt/file >> mnt/file: >> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS >> 0: [0..20479]: 192..20671 0 (192..20671) 20480 000000 >> >> >> $truncate -s 10461184 mnt/file # 10M - 6FSB >> >> $xfs_bmap -vvp mnt/file >> mnt/file: >> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS >> 0: [0..20431]: 192..20623 0 (192..20623) 20432 000000 >> 1: [20432..20447]: 20624..20639 0 (20624..20639) 16 010000 >> FLAG Values: >> 0010000 Unwritten preallocated extent >> >> Is that incorrect state? > > Think about it: what happens if you now truncate it back up to 10MB > (i.e. aligned length) and then do an aligned atomic write on it. > > First: What happens when you truncate up? > > ...... > > Yes, iomap_zero_range() will see the unwritten extent and skip it. > i.e. The unwritten extent stays as an unwritten extent, it's now > within EOF. That written->unwritten extent boundary is not on an > aligned file offset. Right > > Second: What happens when you do a correctly aligned atomic write > that spans this range now? > > ...... > > Iomap only maps a single extent at a time, so it will only map the > written range from the start of the IO (aligned) to the start of the > unwritten extent (unaligned). Hence the atomic write will be > rejected because we can't do the atomic write to such an unaligned > extent. It was being considered to change this handling for atomic writes. More below at *. > > That's not a bug in the atomic write path - this failure occurs > because of the truncate behaviour doing post-eof unwritten extent > conversion.... > > Yes, I agree that the entire -physical- extent is still correctly > aligned on disk so you could argue that the unwritten conversion > that xfs_bunmapi_range is doing is valid forced alignment behaviour. > However, the fact is that breaking the aligned physical extent into > two unaligned contiguous extents in different states in the BMBT > means that they are treated as two seperate unaligned extents, not > one contiguous aligned physical extent. Right, this is problematic. * I guess that you had not been following the recent discussion on this topic in the latest xfs atomic writes series @ https://lore.kernel.org/linux-xfs/20240817094800.776408-1-john.g.garry@oracle.com/ and also mentioned earlier in https://lore.kernel.org/linux-xfs/20240726171358.GA27612@lst.de/ There I dropped the sub-alloc unit zeroing. The concept to iter for a single bio seems sane, but as Darrick mentioned, we have issue of non-atomically committing all the extent conversions. FWIW, this is how I think that the change according to Darrick's idea would look: ---->8---- diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 72c981e3dc92..ec76d6a8750d 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -244,7 +244,8 @@ xfs_iomap_write_direct( xfs_fileoff_t count_fsb, unsigned int flags, struct xfs_bmbt_irec *imap, - u64 *seq) + u64 *seq, + bool zeroing) { struct xfs_mount *mp = ip->i_mount; struct xfs_trans *tp; @@ -285,7 +286,7 @@ xfs_iomap_write_direct( * the reserve block pool for bmbt block allocation if there is no space * left but we need to do unwritten extent conversion. */ + if (zeroing) + bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO; if (flags & IOMAP_DAX) { bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO; if (imap->br_state == XFS_EXT_UNWRITTEN) { force = true; @@ -780,6 +781,11 @@ xfs_direct_write_iomap_begin( u16 iomap_flags = 0; unsigned int lockmode; u64 seq; + bool atomic = flags & IOMAP_ATOMIC; + struct xfs_bmbt_irec imap2[XFS_BMAP_MAX_NMAP]; + xfs_fileoff_t _offset_fsb = xfs_inode_rounddown_alloc_unit(ip, offset_fsb); + xfs_fileoff_t _end_fsb = xfs_inode_roundup_alloc_unit(ip, end_fsb); + int loop_index; ASSERT(flags & (IOMAP_WRITE | IOMAP_ZERO)); @@ -843,6 +849,9 @@ xfs_direct_write_iomap_begin( if (imap_needs_alloc(inode, flags, &imap, nimaps)) goto allocate_blocks; + if (atomic && !imap_spans_range(&imap, offset_fsb, end_fsb)) + goto out_atomic; + /* * NOWAIT and OVERWRITE I/O needs to span the entire requested I/O with * a single map so that we avoid partial IO failures due to the rest of @@ -897,7 +906,7 @@ xfs_direct_write_iomap_begin( xfs_iunlock(ip, lockmode); error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb, - flags, &imap, &seq); + flags, &imap, &seq, false); if (error) return error; @@ -918,6 +927,28 @@ xfs_direct_write_iomap_begin( xfs_iunlock(ip, lockmode); return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq); +out_atomic: + nimaps = XFS_BMAP_MAX_NMAP; + + error = xfs_bmapi_read(ip, _offset_fsb, _end_fsb - _offset_fsb, &imap2[0], + &nimaps, 0); + for (loop_index = 0; loop_index < nimaps; loop_index++) { + struct xfs_bmbt_irec *_imap2 = &imap2[loop_index]; + + if (_imap2->br_state == XFS_EXT_UNWRITTEN) { + + xfs_iunlock(ip, lockmode); + + error = xfs_iomap_write_direct(ip, _imap2->br_startoff, _imap2->br_blockcount, + flags, &imap, &seq, true); + if (error) + return error; + goto relock; + } + } + /* We should not reach this, but TODO better handling */ + error = -EINVAL; + out_unlock: if (lockmode) xfs_iunlock(ip, lockmode); -----8<---- But I have my doubts about using XFS_BMAPI_ZERO, even if it is just for pre-zeroing unwritten parts of the alloc unit for an atomic write. > > This extent state discontiunity is results in breaking physical IO > across the extent state boundary. Hence such an unaligned extent > state change violates the physical IO alignment guarantees that > forced alignment is supposed to provide atomic writes... Yes > > This is the reason why operations like hole punching round to extent > size for forced alignment at the highest level. i.e. We cannot allow > xfs_bunmapi_range() to "take care" of alignment handling because > managing partial extent unmappings with sub-alloc-unit unwritten > extents does not work for forced alignment. > > Again, this is different to the traditional RT file behaviour - it > can use unwritten extents for sub-alloc-unit alignment unmaps > because the RT device can align file offset to any physical offset, > and issue unaligned sector sized IO without any restrictions. Forced > alignment does not have this freedom, and when we extend forced > alignment to RT files, it will not have the freedom to use > unwritten extents for sub-alloc-unit unmapping, either. > So how do you think that we should actually implement xfs_itruncate_extents_flags() properly for forcealign? Would it simply be like: --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -1050,7 +1050,7 @@ xfs_itruncate_extents_flags( WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF); return 0; } + if (xfs_inode_has_forcealign(ip)) + first_unmap_block = xfs_inode_roundup_alloc_unit(ip, first_unmap_block); error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block, Thanks, John
Dave Chinner <david@fromorbit.com> writes: > On Thu, Sep 05, 2024 at 09:26:25AM +0530, Ritesh Harjani wrote: >> Dave Chinner <david@fromorbit.com> writes: >> > On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote: >> >> 3. It is the FORCEALIGN feature which _mandates_ both allocation >> >> (by using extsize hint) and de-allocation to happen _only_ in >> >> extsize chunks. >> >> >> >> i.e. forcealign mandates - >> >> - the logical and physical start offset should be aligned as >> >> per args->alignment >> >> - extent length be aligned as per args->prod/mod. >> >> If above two cannot be satisfied then return -ENOSPC. >> > >> > Yes. >> > >> >> >> >> - Does the unmapping of extents also only happens in extsize >> >> chunks (with forcealign)? >> > >> > Yes, via use of xfs_inode_alloc_unitsize() in the high level code >> > aligning the fsbno ranges to be unmapped. >> > >> > Remember, force align requires both logical file offset and >> > physical block number to be correctly aligned, >> >> This is where I would like to double confirm it again. Even the >> extsize hint feature (w/o FORCEALIGN) will try to allocate aligned >> physical start and logical start file offset and length right? > > No. > >> (Or does extsize hint only restricts alignment to logical start file >> offset + length and not the physical start?) > > Neither. > Yes, thanks for the correction. Indeed extsize hint does not take care of the physical start alignment at all. > extsize hint by itself (i.e. existing behaviour) has no alignment > effect at all. All it affects is -size- of the extent. i.e. once > the extent start is chosen, extent size hints will trim the length > of the extent to a multiple of the extent size hint. Alignment is > not considered at all. > Please correct me I wrong here... but XFS considers aligning the logical start and the length of the allocated extent (for extsize) as per below code right? i.e. 1) xfs_direct_write_iomap_begin() { <...> if (offset + length > XFS_ISIZE(ip)) end_fsb = xfs_iomap_eof_align_last_fsb(ip, end_fsb); => xfs_fileoff_t aligned_end_fsb = roundup_64(end_fsb, align); return aligned_end_fsb } 2) xfs_bmap_compute_alignments() { <...> else if (ap->datatype & XFS_ALLOC_USERDATA) align = xfs_get_extsz_hint(ap->ip); if (align) { if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0, ap->eof, 0, ap->conv, &ap->offset, &ap->length)) ASSERT(0); ASSERT(ap->length); args->prod = align; div_u64_rem(ap->offset, args->prod, &args->mod); if (args->mod) args->mod = args->prod - args->mod; } <...> } So args->prod and args->mod... aren't they use to align the logical start and the length of the extent? However, I do notice that when the file is closed XFS trims the length allocated beyond EOF boundary (for extsize but not for forcealign from the new forcealign series) i.e. xfs_file_release() -> xfs_release() -> xfs_free_eofblocks() I guess that is because xfs_can_free_eofblocks() does not consider alignment for extsize in this function xfs_can_free_eofblocks() { <...> end_fsb = xfs_inode_roundup_alloc_unit(ip, XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip))); <...> } >> Also it looks like there is no difference with ATOMIC_WRITE AND >> FORCEALIGN feature with XFS, correct? (except that ATOMIC_WRITE is >> adding additional natural alignment restrictions on pos and len). > > Atomic write requires additional hardware support, and it restricts > the valid sizes of extent size hints that can be set. Only atomic > writes can be done on files marked as configured for atomic writes; > force alignment can be done on any file... > >> So why maintain 2 separate on disk inode flags for FORCEALIGN AND >> ATOMIC_WRITE? > > the atomic write flag indicates that a file has been set up > correctly for atomic writes to be able to issues reliably. force > alignment doesn't guarantee that - it's just a mechanism that tells > the allocator to behave a specific way. > >> - Do you foresee FORCEALIGN to be also used at other places w/o >> ATOMIC_WRITE where feature differentiation between the two on an >> inode is required? > > The already exist. For example, reliably allocating huge page > mappings on DAX filesystems requires 2MB forced alignment. > >> - Does the same reasoning will hold for XFS_SB_FEAT_RO_COMPAT_FORCEALIGN >> & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES too? > > Same as above. > >> - But why ro_compact for ATOMICWRITES? There aren't any on disk metadata >> changes within XFS filesystem to support atomic writes, right? > > Because if you downgrade the kernel to something that doesn't > support atomic writes, then non-atomic sized/aligned data can be > written to the file and/or torn writes can occur. > > Worse, extent size hints that don't match the underlying hardware > support could be set up for inodes, and when the kernel is upgraded > again then atomic writes will fail on inodes that have atomic write > flags set on them.... > >> Is it something to just prevent users from destroying their own data >> by not allowing a rw mount from an older kernel where users could do >> unaligned writes to files marked for atomic writes? >> Or is there any other reasoning to prevent XFS filesystem from becoming >> inconsistent if an older kernel does a rw mount here. > > The older kernel does not know what the unknown inode flag means > (i.e. atomic writes) and so, by definition, we cannot allow it to > modify metadata or file data because it may not modify it in the > correct way for that flag being set on the inode. > > Kernels that don't understand feature flags need to treat the > filesystem as read-only, no matter how trivial the feature addition > might seem. > >> > so unmap alignment >> > has to be set up correctly at file offset level before we even know >> > what extents underly the file range we need to unmap.... >> > >> >> If the start or end of the extent which needs unmapping is >> >> unaligned then we convert that extent to unwritten and skip, >> >> is it? (__xfs_bunmapi()) >> > >> > The high level code should be aligning the start and end of the >> > file range to be removed via xfs_inode_alloc_unitsize(). Hence >> > the low level __xfs_bunmapi() code shouldn't ever be encountering >> > unaligned unmaps on force-aligned inodes. >> > >> >> Yes, but isn't this code snippet trying to handle a case when it finds an >> unaligned extent during unmap? > > It does exactly that. > >> And what we are essentially trying to >> do here is leave the unwritten extent as is and if the found extent is >> written then convert to unwritten and skip it (goto nodelete). This >> means with forcealign if we encounter an unaligned extent then the file >> will have this space reserved as is with extent marked unwritten. >> >> Is this understanding correct? > > Yes, except for the fact that this code is not triggered by force > alignment. > > This code is preexisting realtime file functionality. It is used > when the rtextent size is larger than a single filesytem block. > > In these configurations, we do allocation in rtextsize units, but we > track written/unwritten extents in the BMBT on filesystem block > granularity. Hence we can have unaligned written/unwritten extent > boundaries, and if we aren't unmapping a whole rtextent then we > simply mark the unused portion of it as unwritten in the BMBT to > indicate it contains zeroes. > > IOWs, existing RT files have different IO alignment and > written/unwritten extent conversion behaviour to the new forced > alignment feature. The implementation code is shared in many places, > but the higher level forced alignment functionality ensures there > are never unaligned unwritten extents created or unaligned > unmappings asked for. Hence this code does not trigger for the > forced alignment cases. > > i.e. we have multiple seperate allocation alignment behaviours that > share an implementation, but they do not all behave exactly the same > way or provide the same alignment guarantees.... > Thanks for taking time and explaining this. -ritesh
Dave Chinner <david@fromorbit.com> writes: > On Thu, Sep 05, 2024 at 09:26:25AM +0530, Ritesh Harjani wrote: >> Dave Chinner <david@fromorbit.com> writes: >> > On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote: >> >> 3. It is the FORCEALIGN feature which _mandates_ both allocation >> >> (by using extsize hint) and de-allocation to happen _only_ in >> >> extsize chunks. >> >> >> >> i.e. forcealign mandates - >> >> - the logical and physical start offset should be aligned as >> >> per args->alignment >> >> - extent length be aligned as per args->prod/mod. >> >> If above two cannot be satisfied then return -ENOSPC. >> > >> > Yes. >> > >> >> >> >> - Does the unmapping of extents also only happens in extsize >> >> chunks (with forcealign)? >> > >> > Yes, via use of xfs_inode_alloc_unitsize() in the high level code >> > aligning the fsbno ranges to be unmapped. >> > >> > Remember, force align requires both logical file offset and >> > physical block number to be correctly aligned, >> >> This is where I would like to double confirm it again. Even the >> extsize hint feature (w/o FORCEALIGN) will try to allocate aligned >> physical start and logical start file offset and length right? > > No. > >> (Or does extsize hint only restricts alignment to logical start file >> offset + length and not the physical start?) > > Neither. > > extsize hint by itself (i.e. existing behaviour) has no alignment > effect at all. All it affects is -size- of the extent. i.e. once > the extent start is chosen, extent size hints will trim the length > of the extent to a multiple of the extent size hint. Alignment is > not considered at all. > >> Also it looks like there is no difference with ATOMIC_WRITE AND >> FORCEALIGN feature with XFS, correct? (except that ATOMIC_WRITE is >> adding additional natural alignment restrictions on pos and len). > > Atomic write requires additional hardware support, and it restricts > the valid sizes of extent size hints that can be set. Only atomic > writes can be done on files marked as configured for atomic writes; > force alignment can be done on any file... > >> So why maintain 2 separate on disk inode flags for FORCEALIGN AND >> ATOMIC_WRITE? > > the atomic write flag indicates that a file has been set up > correctly for atomic writes to be able to issues reliably. force > alignment doesn't guarantee that - it's just a mechanism that tells > the allocator to behave a specific way. > >> - Do you foresee FORCEALIGN to be also used at other places w/o >> ATOMIC_WRITE where feature differentiation between the two on an >> inode is required? > > The already exist. For example, reliably allocating huge page > mappings on DAX filesystems requires 2MB forced alignment. > >> - Does the same reasoning will hold for XFS_SB_FEAT_RO_COMPAT_FORCEALIGN >> & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES too? > > Same as above. > >> - But why ro_compact for ATOMICWRITES? There aren't any on disk metadata >> changes within XFS filesystem to support atomic writes, right? > > Because if you downgrade the kernel to something that doesn't > support atomic writes, then non-atomic sized/aligned data can be > written to the file and/or torn writes can occur. > > Worse, extent size hints that don't match the underlying hardware > support could be set up for inodes, and when the kernel is upgraded > again then atomic writes will fail on inodes that have atomic write > flags set on them.... > >> Is it something to just prevent users from destroying their own data >> by not allowing a rw mount from an older kernel where users could do >> unaligned writes to files marked for atomic writes? >> Or is there any other reasoning to prevent XFS filesystem from becoming >> inconsistent if an older kernel does a rw mount here. > > The older kernel does not know what the unknown inode flag means > (i.e. atomic writes) and so, by definition, we cannot allow it to > modify metadata or file data because it may not modify it in the > correct way for that flag being set on the inode. > > Kernels that don't understand feature flags need to treat the > filesystem as read-only, no matter how trivial the feature addition > might seem. > 1. Will it require a fresh formatting of filesystem with mkfs.xfs for enabling atomic writes (/forcealign) on XFS? a. Is that because reflink is not support with atomic writes (/forcealign) today? As I understand for setting forcealign attr on any inode it checks for whether xfs_has_forcealign(mp). That means forcealign can _only_ be enabled during mkfs time and it also needs reflink to be disabled with -m reflink=0. Right? -ritesh
On Mon, Sep 09, 2024 at 05:18:43PM +0100, John Garry wrote: > > > > Patch 10 also modifies xfs_can_free_eofblocks() to take alignment > > > > into account for the post-eof block removal, but doesn't change > > > > xfs_free_eofblocks() at all. i.e it also relies on > > > > xfs_itruncate_extents_flags() to do the right thing for force > > > > aligned inodes. > > > > > > What state should the blocks post-EOF blocks be? A simple example of > > > partially truncating an alloc unit is: > > > > > > $xfs_io -c "extsize" mnt/file > > > [16384] mnt/file > > > > > > > > > $xfs_bmap -vvp mnt/file > > > mnt/file: > > > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > > > 0: [0..20479]: 192..20671 0 (192..20671) 20480 000000 > > > > > > > > > $truncate -s 10461184 mnt/file # 10M - 6FSB > > > > > > $xfs_bmap -vvp mnt/file > > > mnt/file: > > > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > > > 0: [0..20431]: 192..20623 0 (192..20623) 20432 000000 > > > 1: [20432..20447]: 20624..20639 0 (20624..20639) 16 010000 > > > FLAG Values: > > > 0010000 Unwritten preallocated extent > > > > > > Is that incorrect state? > > > > Think about it: what happens if you now truncate it back up to 10MB > > (i.e. aligned length) and then do an aligned atomic write on it. > > > > First: What happens when you truncate up? > > > > ...... > > > > Yes, iomap_zero_range() will see the unwritten extent and skip it. > > i.e. The unwritten extent stays as an unwritten extent, it's now > > within EOF. That written->unwritten extent boundary is not on an > > aligned file offset. > > Right > > > > > Second: What happens when you do a correctly aligned atomic write > > that spans this range now? > > > > ...... > > > > Iomap only maps a single extent at a time, so it will only map the > > written range from the start of the IO (aligned) to the start of the > > unwritten extent (unaligned). Hence the atomic write will be > > rejected because we can't do the atomic write to such an unaligned > > extent. > > It was being considered to change this handling for atomic writes. More > below at *. I don't think that this is something specific to atomic writes - forced alignment means -alignment is guaranteed- regardless of what ends up using it. Yes, we can track unwritten extents on an -unaligned- boundary, but that doesn't mean that we should allow it when we are trying to guarantee logical and physical alignment of the file offset and extent boundaries. i.e. The definition of forced alignment behaviour is that all file offsets and extents in the file are aligned to the same alignment. I don't see an exception that allows for unaligned unwritten extents in that definition. > > That's not a bug in the atomic write path - this failure occurs > > because of the truncate behaviour doing post-eof unwritten extent > > conversion.... > > > > Yes, I agree that the entire -physical- extent is still correctly > > aligned on disk so you could argue that the unwritten conversion > > that xfs_bunmapi_range is doing is valid forced alignment behaviour. > > However, the fact is that breaking the aligned physical extent into > > two unaligned contiguous extents in different states in the BMBT > > means that they are treated as two seperate unaligned extents, not > > one contiguous aligned physical extent. > > Right, this is problematic. > > * I guess that you had not been following the recent discussion on this > topic in the latest xfs atomic writes series @ https://lore.kernel.org/linux-xfs/20240817094800.776408-1-john.g.garry@oracle.com/ > and also mentioned earlier in > https://lore.kernel.org/linux-xfs/20240726171358.GA27612@lst.de/ > > There I dropped the sub-alloc unit zeroing. The concept to iter for a single > bio seems sane, but as Darrick mentioned, we have issue of non-atomically > committing all the extent conversions. Yes, I understand these problems exist. My entire point is that the forced alignment implemention should never allow such unaligned extent patterns to be created in the first place. If we avoid creating such situations in the first place, then we never have to care about about unaligned unwritten extent conversion breaking atomic IO. FWIW, I also understand things are different if we are doing 128kB atomic writes on 16kB force aligned files. However, in this situation we are treating the 128kB atomic IO as eight individual 16kB atomic IOs that are physically contiguous. Hence in this situation it doesn't matter if we have a mix of 16kB aligned written/unwritten/hole extents as each 16kB chunks is independent of the others. What matters is that each indivudal 16kB chunk shows either the old data or the new data - we are not guaranteeing that the entire 128kB write is atomic. Hence in this situation we can both submit and process each 16kB shunk as independent IOs with independent IO compeltion transactions. All that matters is that we don't signal completion to userspace until all the IO is complete, and we already do that for fragmented DIO writes... > > Again, this is different to the traditional RT file behaviour - it > > can use unwritten extents for sub-alloc-unit alignment unmaps > > because the RT device can align file offset to any physical offset, > > and issue unaligned sector sized IO without any restrictions. Forced > > alignment does not have this freedom, and when we extend forced > > alignment to RT files, it will not have the freedom to use > > unwritten extents for sub-alloc-unit unmapping, either. > > > So how do you think that we should actually implement > xfs_itruncate_extents_flags() properly for forcealign? Would it simply be > like: > > --- a/fs/xfs/xfs_inode.c > +++ b/fs/xfs/xfs_inode.c > @@ -1050,7 +1050,7 @@ xfs_itruncate_extents_flags( > WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF); > return 0; > } > + if (xfs_inode_has_forcealign(ip)) > + first_unmap_block = xfs_inode_roundup_alloc_unit(ip, > first_unmap_block); > error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block, Yes, it would be something like that, except it would have to be done before first_unmap_block is verified. -Dave.
On Tue, Sep 10, 2024 at 08:21:55AM +0530, Ritesh Harjani wrote: > Dave Chinner <david@fromorbit.com> writes: > > > On Thu, Sep 05, 2024 at 09:26:25AM +0530, Ritesh Harjani wrote: > >> Dave Chinner <david@fromorbit.com> writes: > >> > On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote: > >> >> 3. It is the FORCEALIGN feature which _mandates_ both allocation > >> >> (by using extsize hint) and de-allocation to happen _only_ in > >> >> extsize chunks. > >> >> > >> >> i.e. forcealign mandates - > >> >> - the logical and physical start offset should be aligned as > >> >> per args->alignment > >> >> - extent length be aligned as per args->prod/mod. > >> >> If above two cannot be satisfied then return -ENOSPC. > >> > > >> > Yes. > >> > > >> >> > >> >> - Does the unmapping of extents also only happens in extsize > >> >> chunks (with forcealign)? > >> > > >> > Yes, via use of xfs_inode_alloc_unitsize() in the high level code > >> > aligning the fsbno ranges to be unmapped. > >> > > >> > Remember, force align requires both logical file offset and > >> > physical block number to be correctly aligned, > >> > >> This is where I would like to double confirm it again. Even the > >> extsize hint feature (w/o FORCEALIGN) will try to allocate aligned > >> physical start and logical start file offset and length right? > > > > No. > > > >> (Or does extsize hint only restricts alignment to logical start file > >> offset + length and not the physical start?) > > > > Neither. > > > > Yes, thanks for the correction. Indeed extsize hint does not take care > of the physical start alignment at all. > > > extsize hint by itself (i.e. existing behaviour) has no alignment > > effect at all. All it affects is -size- of the extent. i.e. once > > the extent start is chosen, extent size hints will trim the length > > of the extent to a multiple of the extent size hint. Alignment is > > not considered at all. > > > > Please correct me I wrong here... but XFS considers aligning the logical > start and the length of the allocated extent (for extsize) as per below > code right? Sorry, I was talking about physical alignment, not logical file offset alignment. The logical file offset alignment that is done for extent size hints is much more convoluted and dependent on certain preconditions existing for it to function as forced alignment/atomic writes require. > > i.e. > 1) xfs_direct_write_iomap_begin() > { > <...> > if (offset + length > XFS_ISIZE(ip)) > end_fsb = xfs_iomap_eof_align_last_fsb(ip, end_fsb); > => xfs_fileoff_t aligned_end_fsb = roundup_64(end_fsb, align); > return aligned_end_fsb > } That's calculating the file offset of the end of the extent for an extending write. It's not really an alignment - it's simply calculating the file offset the allocation needs to cover to allow for aligned allocation. This length needs to be fed into the transaction reservation (i.e. ENOSPC checks) before we start the allocation, so we have to have some idea of the extent size we are going to allocate here... > 2) xfs_bmap_compute_alignments() > { > <...> > else if (ap->datatype & XFS_ALLOC_USERDATA) > align = xfs_get_extsz_hint(ap->ip); > > if (align) { > if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0, > ap->eof, 0, ap->conv, &ap->offset, > &ap->length)) > ASSERT(0); > ASSERT(ap->length); > > args->prod = align; > div_u64_rem(ap->offset, args->prod, &args->mod); > if (args->mod) > args->mod = args->prod - args->mod; > } > <...> > } > > So args->prod and args->mod... aren't they use to align the logical > start and the length of the extent? Nope. They are only used way down in xfs_alloc_fix_len(), which trims the length of the selected *physical* extent to the required length. Look further up - ap->offset is the logical file offset the allocation needs to cover. Logical alignment of the offset (i.e. determining where in the file the physical extent will be placed) is done in xfs_bmap_extsize_align(). As i said above, it's not purely an extent size alignment calculation.... > However, I do notice that when the file is closed XFS trims the length > allocated beyond EOF boundary (for extsize but not for forcealign from > the new forcealign series) i.e. > > xfs_file_release() -> xfs_release() -> xfs_free_eofblocks() > > I guess that is because xfs_can_free_eofblocks() does not consider > alignment for extsize in this function Of course - who wants large chunks of space allocated beyond EOF when you are never going to write to the file again? i.e. If you have large extsize hints then the post-eof tail can consume a -lot- of space that won't otherwise get freed. This can lead to rapid, unexpected ENOSPC, and it's not clear to users what the cause is. Hence we don't care if extsz is set on the inode or not when we decide to remove post-eof blocks - reclaiming the unused space is much more important that an occasional unaligned or small extent. Forcealign changes that equation, but if you choose forcealign you are doing it for a specific reason and likely not applying it to the entire filesystem..... -Dave.
On Tue, Sep 10, 2024 at 06:03:12PM +0530, Ritesh Harjani wrote: > >> Is it something to just prevent users from destroying their own data > >> by not allowing a rw mount from an older kernel where users could do > >> unaligned writes to files marked for atomic writes? > >> Or is there any other reasoning to prevent XFS filesystem from becoming > >> inconsistent if an older kernel does a rw mount here. > > > > The older kernel does not know what the unknown inode flag means > > (i.e. atomic writes) and so, by definition, we cannot allow it to > > modify metadata or file data because it may not modify it in the > > correct way for that flag being set on the inode. > > > > Kernels that don't understand feature flags need to treat the > > filesystem as read-only, no matter how trivial the feature addition > > might seem. > > > > 1. Will it require a fresh formatting of filesystem with mkfs.xfs for > enabling atomic writes (/forcealign) on XFS? Initially, yes. > a. Is that because reflink is not support with atomic writes > (/forcealign) today? It's much more complex than that. e.g. How does force-align and COW interact, especially w.r.t. sub-alloc unit overwrites, cowextsz based preallocation and unwritten extents in the COW fork? > As I understand for setting forcealign attr on any inode it checks for > whether xfs_has_forcealign(mp). That means forcealign can _only_ be > enabled during mkfs time and it also needs reflink to be disabled with > -m reflink=0. Right? forcealign doesn't need to be completely turned off when reflink is enabled and/or vice versa. Both can co-exist in the filesytsem at the same time, but the current implementation does not allow forcealign and reflink to be used on the same inode at the same time. It was decided that the best way to handle the lack of reflink support initially was to make the two feature bits incompatible at mount time. Hence we currently have to make a non-reflink filesystem to test forcealign based functionality. However, doing it this way means that when we fix the implementation to support reflink and forcealign together, we just remove the mount time check and all existing reflink filesystems can be immediately upgraded to support forcealign. OTOH, we can't do this with atomic writes. Atomic writes require some mkfs help because they require explicit physical alignment of the filesystem to the underlying storage. Hence we'll eventually end up with atomic writes needing to be enabled at mkfs time, but force align will be an upgradeable feature flag. -Dave.
>> * I guess that you had not been following the recent discussion on this >> topic in the latest xfs atomic writes series @ https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240817094800.776408-1-john.g.garry@oracle.com/__;!!ACWV5N9M2RV99hQ!JIzCbkyp3JuPyzBx1n80WAdog5rLxMRB65FYrs1sFf3ei-wOdqrU_DZBE5zwrJXhrj949HSE0TwOEV0ciu8$ >> and also mentioned earlier in >> https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240726171358.GA27612@lst.de/__;!!ACWV5N9M2RV99hQ!JIzCbkyp3JuPyzBx1n80WAdog5rLxMRB65FYrs1sFf3ei-wOdqrU_DZBE5zwrJXhrj949HSE0TwOiiEnYSk$ >> >> There I dropped the sub-alloc unit zeroing. The concept to iter for a single >> bio seems sane, but as Darrick mentioned, we have issue of non-atomically >> committing all the extent conversions. > > Yes, I understand these problems exist. My entire point is that the > forced alignment implemention should never allow such unaligned > extent patterns to be created in the first place. If we avoid > creating such situations in the first place, then we never have to > care about about unaligned unwritten extent conversion breaking > atomic IO. OK, but what about this situation with non-EOF unaligned extents: # xfs_io -c "lsattr -v" mnt/file [extsize, has-xattr, force-align] mnt/file # xfs_io -c "extsize" mnt/file [65536] mnt/file # # xfs_io -d -c "pwrite 64k 64k" mnt/file # xfs_io -d -c "pwrite 8k 8k" mnt/file # xfs_bmap -vvp mnt/file mnt/file: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..15]: 384..399 0 (384..399) 16 010000 1: [16..31]: 400..415 0 (400..415) 16 000000 2: [32..127]: 416..511 0 (416..511) 96 010000 3: [128..255]: 256..383 0 (256..383) 128 000000 FLAG Values: 0010000 Unwritten preallocated extent Here we have unaligned extents wrt extsize. The sub-alloc unit zeroing would solve that - is that what you would still advocate (to solve that issue)? > > FWIW, I also understand things are different if we are doing 128kB > atomic writes on 16kB force aligned files. However, in this > situation we are treating the 128kB atomic IO as eight individual > 16kB atomic IOs that are physically contiguous. Yes, if 16kB force aligned, userspace can only issue 16KB atomic writes. > Hence in this > situation it doesn't matter if we have a mix of 16kB aligned > written/unwritten/hole extents as each 16kB chunks is independent of > the others. Sure > > What matters is that each indivudal 16kB chunk shows either the old > data or the new data - we are not guaranteeing that the entire 128kB > write is atomic. Hence in this situation we can both submit and > process each 16kB shunk as independent IOs with independent IO > compeltion transactions. All that matters is that we don't signal > completion to userspace until all the IO is complete, and we already > do that for fragmented DIO writes... > >>> Again, this is different to the traditional RT file behaviour - it >>> can use unwritten extents for sub-alloc-unit alignment unmaps >>> because the RT device can align file offset to any physical offset, >>> and issue unaligned sector sized IO without any restrictions. Forced >>> alignment does not have this freedom, and when we extend forced >>> alignment to RT files, it will not have the freedom to use >>> unwritten extents for sub-alloc-unit unmapping, either. >>> >> So how do you think that we should actually implement >> xfs_itruncate_extents_flags() properly for forcealign? Would it simply be >> like: >> >> --- a/fs/xfs/xfs_inode.c >> +++ b/fs/xfs/xfs_inode.c >> @@ -1050,7 +1050,7 @@ xfs_itruncate_extents_flags( >> WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF); >> return 0; >> } >> + if (xfs_inode_has_forcealign(ip)) >> + first_unmap_block = xfs_inode_roundup_alloc_unit(ip, >> first_unmap_block); >> error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block, > > Yes, it would be something like that, except it would have to be > done before first_unmap_block is verified. > ok, and are you still of the opinion that this does not apply to rtvol? Thanks, John
On 16/09/2024 08:03, Dave Chinner wrote: > OTOH, we can't do this with atomic writes. Atomic writes require > some mkfs help because they require explicit physical alignment of > the filesystem to the underlying storage. If we are enabling atomic writes at mkfs time, then we can ensure agsize % extsize == 0. That provides the physical alignment guarantee. It also makes sense to ensure extsize is a power-of-2. However, extsize is re-configurble per inode. So, for an inode enabled for atomic writes, we must still ensure agsize % new extsize == 0 (and also new extsize is a power-of-2) > Hence we'll eventually end > up with atomic writes needing to be enabled at mkfs time, but force > align will be an upgradeable feature flag. Could atomic writes also be an upgradeable feature? We just need to ensure that agsize % extsize == 0 for an inode enabled for atomic writes. Valid extsize values may be quite limited, though, depending on the value of agsize. Thanks, John
On Mon, Sep 16, 2024 at 11:24:56AM +0100, John Garry wrote: > On 16/09/2024 08:03, Dave Chinner wrote: > > OTOH, we can't do this with atomic writes. Atomic writes require > > some mkfs help because they require explicit physical alignment of > > the filesystem to the underlying storage. Forcealign requires agsize%extsize==0, it's atomicwrites that adds the requirement that extsize be a power of 2... > If we are enabling atomic writes at mkfs time, then we can ensure agsize % > extsize == 0. That provides the physical alignment guarantee. It also makes > sense to ensure extsize is a power-of-2. > > However, extsize is re-configurble per inode. So, for an inode enabled for > atomic writes, we must still ensure agsize % new extsize == 0 (and also new > extsize is a power-of-2) ...so yes. > > Hence we'll eventually end > > up with atomic writes needing to be enabled at mkfs time, but force > > align will be an upgradeable feature flag. > > Could atomic writes also be an upgradeable feature? We just need to ensure > that agsize % extsize == 0 for an inode enabled for atomic writes. Valid > extsize values may be quite limited, though, depending on the value of > agsize. I don't see why forcealign and atomicwrites can't be added to the sb featureset after the fact, though you're right that callers of xfs_io chattr might be hard pressed to find an fsx_extsize value that fits. --D > Thanks, > John > >
On Mon, Sep 16, 2024 at 11:24:56AM +0100, John Garry wrote: > On 16/09/2024 08:03, Dave Chinner wrote: > > OTOH, we can't do this with atomic writes. Atomic writes require > > some mkfs help because they require explicit physical alignment of > > the filesystem to the underlying storage. > > If we are enabling atomic writes at mkfs time, then we can ensure agsize % > extsize == 0. That provides the physical alignment guarantee. It also makes > sense to ensure extsize is a power-of-2. No, mkfs does not want to align anything to "extsize". It needs to align the filesystem geometry to be compatible with the underlying block device atomic write alignment parameters. We just don't care if extsize is not an exact multiple of agsize. As long as extsize is aligned to the atomic write boundaries and the start of the AG is aligned to atomic write boundaries, we can allocate hardware aligned extsize sized extents from the AG. AGs are always going to contain lots of non-aligned, randomly sized extents for other stuff like metadata and unaligned file data. Aligned allocation is all about finding extsized aligned free space within the AG and has nothing to do with the size of the AG itself. > However, extsize is re-configurble per inode. So, for an inode enabled for > atomic writes, we must still ensure agsize % new extsize == 0 (and also new > extsize is a power-of-2) Ensuring that the extsize is aligned to the hardware atomic write limits is a kernel runtime check when enabling atomic writes on an inode. In this case, we do not care what the AG size is - it is completely irrelevant to these per-inode runtime checks because mkfs has already guaranteed that the AG is correctly aligned to the underlying hardware. That means is extsize is also aligned to the underlying hardware, physical extent layout is guaranteed to be compatible with the hardware constraints for atomic writes... > > Hence we'll eventually end > > up with atomic writes needing to be enabled at mkfs time, but force > > align will be an upgradeable feature flag. > > Could atomic writes also be an upgradeable feature? We just need to ensure > that agsize % extsize == 0 for an inode enabled for atomic writes. To turn the superblock feature bit on, we have to check the AGs are correctly aligned to the *underlying hardware*. If they aren't correctly aligned (and there is a good chance they will not be) then we can't enable atomic writes at all. The only way to change this is to physically move AGs around in the block device (i.e. via xfs_expand tool I proposed). i.e. the mkfs dependency on having the AGs aligned to the underlying atomic write capabilities of the block device never goes away, even if we want to make the feature dynamically enabled. IOWs, yes, an existing filesystem -could- be upgradeable, but there is no guarantee that is will be. Quite frankly, we aren't going to see block devices that filesystems already exist on suddenly sprout support for atomic writes mid-life. Hence if mkfs detects atomic write support in the underlying device, it should *always* modify the geometry to be compatible with atomic writes and enable atomic write support. Yes, that means the "incompat with reflink" issue needs to be fixed before we take atomic writes out of experimental (i.e. we consistently apply the same "full support" criteria we applied to DAX). Hence by the time atomic writes are a fully supported feature, we're going to be able to enable them by default at mkfs time for any hardware that supports them... > Valid > extsize values may be quite limited, though, depending on the value of > agsize. No. The only limit agsize puts on extsize is that a single aligned extent can't be larger than half the AG size. Forced alignment and atomic writes don't change that. -Dave.
On Mon, Sep 16, 2024 at 10:44:38AM +0100, John Garry wrote: > > > > * I guess that you had not been following the recent discussion on this > > > topic in the latest xfs atomic writes series @ https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240817094800.776408-1-john.g.garry@oracle.com/__;!!ACWV5N9M2RV99hQ!JIzCbkyp3JuPyzBx1n80WAdog5rLxMRB65FYrs1sFf3ei-wOdqrU_DZBE5zwrJXhrj949HSE0TwOEV0ciu8$ > > > and also mentioned earlier in > > > https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240726171358.GA27612@lst.de/__;!!ACWV5N9M2RV99hQ!JIzCbkyp3JuPyzBx1n80WAdog5rLxMRB65FYrs1sFf3ei-wOdqrU_DZBE5zwrJXhrj949HSE0TwOiiEnYSk$ > > > > > > There I dropped the sub-alloc unit zeroing. The concept to iter for a single > > > bio seems sane, but as Darrick mentioned, we have issue of non-atomically > > > committing all the extent conversions. > > > > Yes, I understand these problems exist. My entire point is that the > > forced alignment implemention should never allow such unaligned > > extent patterns to be created in the first place. If we avoid > > creating such situations in the first place, then we never have to > > care about about unaligned unwritten extent conversion breaking > > atomic IO. > > OK, but what about this situation with non-EOF unaligned extents: > > # xfs_io -c "lsattr -v" mnt/file > [extsize, has-xattr, force-align] mnt/file > # xfs_io -c "extsize" mnt/file > [65536] mnt/file > # > # xfs_io -d -c "pwrite 64k 64k" mnt/file > # xfs_io -d -c "pwrite 8k 8k" mnt/file > # xfs_bmap -vvp mnt/file > mnt/file: > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > 0: [0..15]: 384..399 0 (384..399) 16 010000 > 1: [16..31]: 400..415 0 (400..415) 16 000000 > 2: [32..127]: 416..511 0 (416..511) 96 010000 > 3: [128..255]: 256..383 0 (256..383) 128 000000 > FLAG Values: > 0010000 Unwritten preallocated extent > > Here we have unaligned extents wrt extsize. > > The sub-alloc unit zeroing would solve that - is that what you would still > advocate (to solve that issue)? Yes, I thought that was already implemented for force-align with the DIO code via the extsize zero-around changes in the iomap code. Why isn't that zero-around code ensuring the correct extent layout here? > > FWIW, I also understand things are different if we are doing 128kB > > atomic writes on 16kB force aligned files. However, in this > > situation we are treating the 128kB atomic IO as eight individual > > 16kB atomic IOs that are physically contiguous. > > Yes, if 16kB force aligned, userspace can only issue 16KB atomic writes. Right, but the eventual goal (given the statx parameters) is to be able to do 8x16kB sequential atomic writes as a single 128kB IO, yes? > > > > Again, this is different to the traditional RT file behaviour - it > > > > can use unwritten extents for sub-alloc-unit alignment unmaps > > > > because the RT device can align file offset to any physical offset, > > > > and issue unaligned sector sized IO without any restrictions. Forced > > > > alignment does not have this freedom, and when we extend forced > > > > alignment to RT files, it will not have the freedom to use > > > > unwritten extents for sub-alloc-unit unmapping, either. > > > > > > > So how do you think that we should actually implement > > > xfs_itruncate_extents_flags() properly for forcealign? Would it simply be > > > like: > > > > > > --- a/fs/xfs/xfs_inode.c > > > +++ b/fs/xfs/xfs_inode.c > > > @@ -1050,7 +1050,7 @@ xfs_itruncate_extents_flags( > > > WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF); > > > return 0; > > > } > > > + if (xfs_inode_has_forcealign(ip)) > > > + first_unmap_block = xfs_inode_roundup_alloc_unit(ip, > > > first_unmap_block); > > > error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block, > > > > Yes, it would be something like that, except it would have to be > > done before first_unmap_block is verified. > > > > ok, and are you still of the opinion that this does not apply to rtvol? The rtvol is *not* force-aligned. It -may- have some aligned allocation requirements that are similar (i.e. sb_rextsize > 1 fsb) but it does *not* force-align extents, written or unwritten. The moment we add force-align support to RT files (as is the plan), then the force-aligned inodes on the rtvol will need to behave as force aligned inodes, not "rtvol" inodes. -Dave.
On Tue, Sep 17, 2024 at 01:54:20PM -0700, Darrick J. Wong wrote: > On Mon, Sep 16, 2024 at 11:24:56AM +0100, John Garry wrote: > > On 16/09/2024 08:03, Dave Chinner wrote: > > > OTOH, we can't do this with atomic writes. Atomic writes require > > > some mkfs help because they require explicit physical alignment of > > > the filesystem to the underlying storage. > > Forcealign requires agsize%extsize==0, No it doesn't. AG size is irrelevant when aligning extents - all that matters is that we can find a free extent that can be trimmed to the alignment defined by extsize. > it's atomicwrites that adds the > requirement that extsize be a power of 2... Only by explicit implementation constraint. Atomic writes do not require power of two extsize - they only require correctly aligned physical extents. e.g an 8kB atomic write is always guaranteed to succeed if we have an extsize of 24kB for laying out the physical extents because a 24kB physical extent is always 8kB aligned and is an exact multiple of 8kB. This meets the requirements for 8kB atomic writes to always succeed, and hence there is no fundamental requirement for extsize to be a power of 2. We have *chosen* to simplify the implementation by only allowing a single aligned atomic write to be issued at a time. This means the alignment and size of atomic writes is always the minimum size the hardware advertises, and that is (at present) always a power of 2. Hence the "extsize needs to be a power of 2" comes from the constraints exposed from the block device configuration (i.e. minimum atomic write unit), not from a filesystem design or implementation constraint. At the filesystem level, we have further simplified things by only allowing extsize = atomic write size. Hence the initial implementation ends up only support power of 2 extsize values. This is not a hard design or implementation constraint, however. ..... hmmmmm. ..... In writing this I've had further thoughts on force-align and the sub-alloc-unit unwritten extent issue we've been discussing here. i.e. I've stopped and considered the existing design constraints given what I wrote above and considered what is needed for supporting large extsize for small atomic writes. I think we need to support large extsize with small atomic write size for two reasons: 1. extsize is still going to be needed for preventing excessive fragmentation with atomic writes. It's the small DIO write workloads that see lots of fragmentation, and applications using atomic writes are essentially being forced down the path of being small DIO write workloads. 2. we can allow force-align w/o atomic writes behaviour to match the existing rtvol sb_rextsize > 1 fsb behaviour without impacting atomic write behaviour. (i.e. less behavioural differences, more common code, better performance, etc). To do this (and I think we do want to do this), then we have to do two things: 1. force-align needs to add a "unwritten align" inode parameter to allow sub-extsize unwritten extent boundaries to exist in the BMBT. (i.e. similar to how rt files w/ sb_rextsize > 1 fsb currently behave.) This is purely an in-memory value - for pure "force-align" we can set it 1 fsb and then the behaviour will match existing RT behaviour. We can abstract this behaviour by replacing the hard coded 1 block alignment for unwritten conversion with an "unwritten align" value which would initially be set to 1. We can also abstract this code away from being "rt specific" and make it entirely dependent on "alloc-unit" configuration. This means rt, force-align and atomic write will all be running the same code, which makes testing a lot easier.. 2. inodes with atomic write enabled must set the unwritten align value to the atomic write size exposed by the hardware, and the extsize must be an exact integer multiple of the unwritten align size. The initial settings of unwritten align == extsize gives the current behaviour of all allocation and extent conversion being aligned to atomic write constraints. The separation of unwritten conversion from the extsize then allows allows the situation I described above with 8kB atomic writes and 24kB extsize. Because unwritten conversion is aligned to atomic wriet boundaries, we can use sub-alloc-unit unwritten extents without violating atomic write boundaries. This would allow us to use extsize for atomic writes in the same manner we use it for now - enable large contiguous allocations to prevent fragmentation when doing lots of concurrent small "immediate write" operations across many files. I think this can all be added on top of the existing patchset - it's not really a fundamental change to any of it. It's a little bit more abstraction and unification, but it enables a lot more flexibility for optimising atomic write functionality in the future. Thoughts? -Dave.
On 17/09/2024 23:12, Dave Chinner wrote: > On Mon, Sep 16, 2024 at 11:24:56AM +0100, John Garry wrote: >> On 16/09/2024 08:03, Dave Chinner wrote: >>> OTOH, we can't do this with atomic writes. Atomic writes require >>> some mkfs help because they require explicit physical alignment of >>> the filesystem to the underlying storage. >> >> If we are enabling atomic writes at mkfs time, then we can ensure agsize % >> extsize == 0. That provides the physical alignment guarantee. It also makes >> sense to ensure extsize is a power-of-2. > > No, mkfs does not want to align anything to "extsize". It needs to > align the filesystem geometry to be compatible with the underlying > block device atomic write alignment parameters. > > We just don't care if extsize is not an exact multiple of agsize. > As long as extsize is aligned to the atomic write boundaries and the > start of the AG is aligned to atomic write boundaries, we can > allocate hardware aligned extsize sized extents from the AG. > > AGs are always going to contain lots of non-aligned, randomly sized > extents for other stuff like metadata and unaligned file data. > Aligned allocation is all about finding extsized aligned free space > within the AG and has nothing to do with the size of the AG itself. Fine, we can go the way of aligning the agsize to the atomic write unit max for mkfs. > >> However, extsize is re-configurble per inode. So, for an inode enabled for >> atomic writes, we must still ensure agsize % new extsize == 0 (and also new >> extsize is a power-of-2) > > Ensuring that the extsize is aligned to the hardware atomic write > limits is a kernel runtime check when enabling atomic writes on an > inode. > > In this case, we do not care what the AG size is - it is completely > irrelevant to these per-inode runtime checks because mkfs has > already guaranteed that the AG is correctly aligned to the > underlying hardware. That means is extsize is also aligned to the > underlying hardware, physical extent layout is guaranteed to be > compatible with the hardware constraints for atomic writes... Sure, we would just need to enforce that extsize is a power-of-2 then. > >>> Hence we'll eventually end >>> up with atomic writes needing to be enabled at mkfs time, but force >>> align will be an upgradeable feature flag. >> >> Could atomic writes also be an upgradeable feature? We just need to ensure >> that agsize % extsize == 0 for an inode enabled for atomic writes. > > To turn the superblock feature bit on, we have to check the AGs are > correctly aligned to the *underlying hardware*. If they aren't > correctly aligned (and there is a good chance they will not be) > then we can't enable atomic writes at all. The only way to change > this is to physically move AGs around in the block device (i.e. via > xfs_expand tool I proposed). > > i.e. the mkfs dependency on having the AGs aligned to the underlying > atomic write capabilities of the block device never goes away, even > if we want to make the feature dynamically enabled. > > IOWs, yes, an existing filesystem -could- be upgradeable, but there > is no guarantee that is will be. > > Quite frankly, we aren't going to see block devices that filesystems > already exist on suddenly sprout support for atomic writes mid-life. I would not be so sure. Some SCSI devices used in production which I know implicitly write 32KB atomically. And we would like to use them for atomic writes. 32KB is small and I guess that there is a small chance of pre-existing AGs not being 32KB aligned. I would need to check if there is even a min alignment for AGs... > Hence if mkfs detects atomic write support in the underlying device, > it should *always* modify the geometry to be compatible with atomic > writes and enable atomic write support. The current solution is to enable via commandline. > > Yes, that means the "incompat with reflink" issue needs to be fixed > before we take atomic writes out of experimental (i.e. we consistently > apply the same "full support" criteria we applied to DAX). In the meantime, if mkfs auto-enables atomic writes (when the HW supports), what will it do to reflink feature (in terms of enabling)? > > Hence by the time atomic writes are a fully supported feature, we're > going to be able to enable them by default at mkfs time for any > hardware that supports them... > >> Valid >> extsize values may be quite limited, though, depending on the value of >> agsize. > > No. The only limit agsize puts on extsize is that a single aligned > extent can't be larger than half the AG size. Forced alignment and > atomic writes don't change that. > ok Thanks, John
On 17/09/2024 23:27, Dave Chinner wrote: >> # xfs_bmap -vvp mnt/file >> mnt/file: >> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS >> 0: [0..15]: 384..399 0 (384..399) 16 010000 >> 1: [16..31]: 400..415 0 (400..415) 16 000000 >> 2: [32..127]: 416..511 0 (416..511) 96 010000 >> 3: [128..255]: 256..383 0 (256..383) 128 000000 >> FLAG Values: >> 0010000 Unwritten preallocated extent >> >> Here we have unaligned extents wrt extsize. >> >> The sub-alloc unit zeroing would solve that - is that what you would still >> advocate (to solve that issue)? > Yes, I thought that was already implemented for force-align with the > DIO code via the extsize zero-around changes in the iomap code. Why > isn't that zero-around code ensuring the correct extent layout here? I just have not included the extsize zero-around changes here. They were just grouped with the atomic writes support, as they were added specifically for the atomic writes support. Indeed - to me at least - it is strange that the DIO code changes are required for XFS forcealign implementation. And, even if we use extsize zero-around changes for DIO path, what about buffered IO? BTW, I still have concern with this extsize zero-around change which I was making: xfs_iomap_write_unwritten() { unsigned int rounding; /* when converting anything unwritten, we must be spanning an alloc unit, so round up/down */ if (rounding > 1) { offset_fsb = rounddown(rounding); count_fsb = roundup(rounding); } ... do { xfs_bmapi_write(); ... xfs_trans_commit(); } while (); } As mentioned elsewhere, it's a bit of a bodge (to do this rounding). > >>> FWIW, I also understand things are different if we are doing 128kB >>> atomic writes on 16kB force aligned files. However, in this >>> situation we are treating the 128kB atomic IO as eight individual >>> 16kB atomic IOs that are physically contiguous. >> Yes, if 16kB force aligned, userspace can only issue 16KB atomic writes. > Right, but the eventual goal (given the statx parameters) is to be > able to do 8x16kB sequential atomic writes as a single 128kB IO, yes? No, if atomic write unit max is 16KB, then userspace can only issue a single 16KB atomic write. However, some things to consider: a. the block layer may merge those 16KB atomic writes b. userspace may also merge 16KB atomic writes and issue a larger atomic write (if atomic write unit max is > 16KB) I had been wondering if there is any value in a lib for helping with b. > >>>>> Again, this is different to the traditional RT file behaviour - it >>>>> can use unwritten extents for sub-alloc-unit alignment unmaps >>>>> because the RT device can align file offset to any physical offset, >>>>> and issue unaligned sector sized IO without any restrictions. Forced >>>>> alignment does not have this freedom, and when we extend forced >>>>> alignment to RT files, it will not have the freedom to use >>>>> unwritten extents for sub-alloc-unit unmapping, either. >>>>> >>>> So how do you think that we should actually implement >>>> xfs_itruncate_extents_flags() properly for forcealign? Would it simply be >>>> like: >>>> >>>> --- a/fs/xfs/xfs_inode.c >>>> +++ b/fs/xfs/xfs_inode.c >>>> @@ -1050,7 +1050,7 @@ xfs_itruncate_extents_flags( >>>> WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF); >>>> return 0; >>>> } >>>> + if (xfs_inode_has_forcealign(ip)) >>>> + first_unmap_block = xfs_inode_roundup_alloc_unit(ip, >>>> first_unmap_block); >>>> error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block, >>> Yes, it would be something like that, except it would have to be >>> done before first_unmap_block is verified. >>> >> ok, and are you still of the opinion that this does not apply to rtvol? > The rtvol is*not* force-aligned. It -may- have some aligned > allocation requirements that are similar (i.e. sb_rextsize > 1 fsb) > but it does*not* force-align extents, written or unwritten. > > The moment we add force-align support to RT files (as is the plan), > then the force-aligned inodes on the rtvol will need to behave as > force aligned inodes, not "rtvol" inodes. ok, fine Thanks, John
On Wed, Sep 18, 2024 at 08:59:41AM +0100, John Garry wrote: > On 17/09/2024 23:12, Dave Chinner wrote: > > On Mon, Sep 16, 2024 at 11:24:56AM +0100, John Garry wrote: > > > > Hence we'll eventually end > > > > up with atomic writes needing to be enabled at mkfs time, but force > > > > align will be an upgradeable feature flag. > > > > > > Could atomic writes also be an upgradeable feature? We just need to ensure > > > that agsize % extsize == 0 for an inode enabled for atomic writes. > > > > To turn the superblock feature bit on, we have to check the AGs are > > correctly aligned to the *underlying hardware*. If they aren't > > correctly aligned (and there is a good chance they will not be) > > then we can't enable atomic writes at all. The only way to change > > this is to physically move AGs around in the block device (i.e. via > > xfs_expand tool I proposed). > > > i.e. the mkfs dependency on having the AGs aligned to the underlying > > atomic write capabilities of the block device never goes away, even > > if we want to make the feature dynamically enabled. > > > > IOWs, yes, an existing filesystem -could- be upgradeable, but there > > is no guarantee that is will be. > > > > Quite frankly, we aren't going to see block devices that filesystems > > already exist on suddenly sprout support for atomic writes mid-life. > > I would not be so sure. Some SCSI devices used in production which I know > implicitly write 32KB atomically. And we would like to use them for atomic > writes. Ok, but that's not going to be widespread. Very little storage hardware out there supports atomic writes - the vast majority of deployments will be new hardware that will have mkfs run on it. A better argument for dynamic upgrade is turning on atomic writes on reflink enabled filesystems once the kernel implementation has been updates to allow the two features to co-exist. > 32KB is small and I guess that there is a small chance of > pre-existing AGs not being 32KB aligned. I would need to check if there is > even a min alignment for AGs... There is no default alignment for AGs unless there is a stripe unit set. Then it will align AGs to the stripe unit. There is also no guarantee that stripe units are aligned to powers of two or atomic write units... > > Hence if mkfs detects atomic write support in the underlying device, > > it should *always* modify the geometry to be compatible with atomic > > writes and enable atomic write support. > > The current solution is to enable via commandline. Yes, that's the current proposal. What I'm saying is that this isn't a future proof solution, nor how we want this functionality to work in the future. We should be looking at the block device capabilities (like we do for stripe unit, etc) and then *do the right thing automatically*. If the block device advertises atomic write support, then we should automatically align the filesystem to atomic write constraints, even if atomic writes can not be immediately enabled (because reflink). I'm trying to describe how we want things to work once atomic write support is ubiquitous. It needs to be simple for users and admins, and it should work (or be reliably upgradeable) out of the box on all new hardware that supports this functionality. > > Yes, that means the "incompat with reflink" issue needs to be fixed > > before we take atomic writes out of experimental (i.e. we consistently > > apply the same "full support" criteria we applied to DAX). > > In the meantime, if mkfs auto-enables atomic writes (when the HW supports), > what will it do to reflink feature (in terms of enabling)? I didn't say we should always "auto-enable atomic writes". I said if the hardware is atomic write capable, then mkfs should always *align the filesystem* to atomic write constraints. A kernel upgrade will eventually allow reflink and atomic writes to co-exist, but only if the filesystem is correctly aligned to the hardware constrains for atomic writes. We need to ensure we leave that upgrade path open.... .... and only once we have full support can we make "mkfs auto-enable atomic writes". -Dave.
On Mon, Sep 23, 2024 at 12:57:32PM +1000, Dave Chinner wrote: > Ok, but that's not going to be widespread. Very little storage > hardware out there supports atomic writes - the vast majority of > deployments will be new hardware that will have mkfs run on it. Just about every enterprise NVMe SSD supports atomic write size larger than a single LBA, because it is completely natural fallout from FTL deѕign. That beeing said to support those SSDs a block size of 16 or 32k would be a lot more natural than all the forcealign madness.
On 23/09/2024 03:57, Dave Chinner wrote: >> In the meantime, if mkfs auto-enables atomic writes (when the HW supports), >> what will it do to reflink feature (in terms of enabling)? > I didn't say we should always "auto-enable atomic writes". > > I said if the hardware is atomic write capable, then mkfs should > always*align the filesystem* to atomic write constraints. A kernel > upgrade will eventually allow reflink and atomic writes to co-exist, > but only if the filesystem is correctly aligned to the hardware > constrains for atomic writes. We need to ensure we leave that > upgrade path open.... > > .... and only once we have full support can we make "mkfs > auto-enable atomic writes". ok, fine. The current maximum value of atomic write unit max is 512KB (assuming 4K PAGE_SIZE and 512B sector size), so that should not be too needlessly inefficient for laying out the AGs. However, for 16KB+ PAGE_SIZE, that value could naturally be larger. However having HW which supports such large atomics would be very unlikely. Thanks, John
On 23/09/2024 04:33, Christoph Hellwig wrote: > On Mon, Sep 23, 2024 at 12:57:32PM +1000, Dave Chinner wrote: >> Ok, but that's not going to be widespread. Very little storage >> hardware out there supports atomic writes - the vast majority of >> deployments will be new hardware that will have mkfs run on it. > > Just about every enterprise NVMe SSD supports atomic write size > larger than a single LBA, because it is completely natural fallout > from FTL deѕign. That beeing said to support those SSDs a block > size of 16 or 32k would be a lot more natural than all the forcealign > madness. > Outside the block allocator changes, most changes for forcealign are just refactoring the RT big alloc unit checks. So - as you have said previously - this so-called madness is already there. How can the sanity be improved? To me, yes, there are so many "if (RT)" checks and special cases in the code, which makes a maintenance headache.
On Mon, Sep 23, 2024 at 09:16:22AM +0100, John Garry wrote: > Outside the block allocator changes, most changes for forcealign are just > refactoring the RT big alloc unit checks. So - as you have said previously > - this so-called madness is already there. How can the sanity be improved? As a first step by not making it worse, and that not only means not spreading the rtextent stuff further, but more importantly not introducing additional complexities by requiring to be able to write over the written/unwritten boundaries created by either rtextentsize > 1 or the forcealign stuff if you actually want atomic writes. > To me, yes, there are so many "if (RT)" checks and special cases in the > code, which makes a maintenance headache. Replacing them with a different condition doesn't really make that any better.
On 23/09/2024 13:07, Christoph Hellwig wrote: > On Mon, Sep 23, 2024 at 09:16:22AM +0100, John Garry wrote: >> Outside the block allocator changes, most changes for forcealign are just >> refactoring the RT big alloc unit checks. So - as you have said previously >> - this so-called madness is already there. How can the sanity be improved? > > As a first step by not making it worse, and that not only means not > spreading the rtextent stuff further, I assume that refactoring rtextent into "big alloc unit" is spreading (rtextent stuff), right? If so, what other solution? CoW? > but more importantly not introducing > additional complexities by requiring to be able to write over the > written/unwritten boundaries created by either rtextentsize > 1 or > the forcealign stuff if you actually want atomic writes. The very original solution required a single mapping and in written state for atomic writes. Reverting to that would save a lot of hassle in the kernel. It just means that the user needs to manually pre-zero. > >> To me, yes, there are so many "if (RT)" checks and special cases in the >> code, which makes a maintenance headache. > > Replacing them with a different condition doesn't really make that > any better. I am just saying that the rtextent stuff is not nice, but it is not going away. I suppose a tiny perk is that "big alloc unit" checks are better than "if (rt)" checks, as it makes the condition slightly more obvious.
On Mon, Sep 23, 2024 at 01:33:12PM +0100, John Garry wrote: >> As a first step by not making it worse, and that not only means not >> spreading the rtextent stuff further, > > I assume that refactoring rtextent into "big alloc unit" is spreading > (rtextent stuff), right? If so, what other solution? CoW? Well, if you look at the force align series you'd agree that it spreads the thing out into the btree allocator. Or do I misread it? > >> but more importantly not introducing >> additional complexities by requiring to be able to write over the >> written/unwritten boundaries created by either rtextentsize > 1 or >> the forcealign stuff if you actually want atomic writes. > > The very original solution required a single mapping and in written state > for atomic writes. Reverting to that would save a lot of hassle in the > kernel. It just means that the user needs to manually pre-zero. What atomic I/O sizes do your users require? Would they fit into a large sector size now supported by XFS (i.e. 32k for now).
On 24/09/2024 07:17, Christoph Hellwig wrote: > On Mon, Sep 23, 2024 at 01:33:12PM +0100, John Garry wrote: >>> As a first step by not making it worse, and that not only means not >>> spreading the rtextent stuff further, >> >> I assume that refactoring rtextent into "big alloc unit" is spreading >> (rtextent stuff), right? If so, what other solution? CoW? > > Well, if you look at the force align series you'd agree that it > spreads the thing out into the btree allocator. Or do I misread it? Yes, there are more changes than just refactoring "big alloc unit" stuff. There are btree allocator changes. About those btree allocator changes, strictly speaking there are just a couple of changes to provide forcealign support - the rest are prep patches. And those forcealign changes build on pre-existing allocator features, like extent alignment and length specifiers. > >> >>> but more importantly not introducing >>> additional complexities by requiring to be able to write over the >>> written/unwritten boundaries created by either rtextentsize > 1 or >>> the forcealign stuff if you actually want atomic writes. >> >> The very original solution required a single mapping and in written state >> for atomic writes. Reverting to that would save a lot of hassle in the >> kernel. It just means that the user needs to manually pre-zero. > > What atomic I/O sizes do your users require? Would they fit into > a large sector size now supported by XFS (i.e. 32k for now). > It could be used, but then we have 16KB filesystem block size, which some just may not want. And we just don't want 16KB sector size, but I don't think that we require that if we use RWF_ATOMIC.
On Wed, Sep 18, 2024 at 11:12:47AM +0100, John Garry wrote: > On 17/09/2024 23:27, Dave Chinner wrote: > > > # xfs_bmap -vvp mnt/file > > > mnt/file: > > > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > > > 0: [0..15]: 384..399 0 (384..399) 16 010000 > > > 1: [16..31]: 400..415 0 (400..415) 16 000000 > > > 2: [32..127]: 416..511 0 (416..511) 96 010000 > > > 3: [128..255]: 256..383 0 (256..383) 128 000000 > > > FLAG Values: > > > 0010000 Unwritten preallocated extent > > > > > > Here we have unaligned extents wrt extsize. > > > > > > The sub-alloc unit zeroing would solve that - is that what you would still > > > advocate (to solve that issue)? > > Yes, I thought that was already implemented for force-align with the > > DIO code via the extsize zero-around changes in the iomap code. Why > > isn't that zero-around code ensuring the correct extent layout here? > > I just have not included the extsize zero-around changes here. They were > just grouped with the atomic writes support, as they were added specifically > for the atomic writes support. Indeed - to me at least - it is strange that > the DIO code changes are required for XFS forcealign implementation. And, > even if we use extsize zero-around changes for DIO path, what about buffered > IO? I've been reviewing and testing the XFS atomic write patch series. Since there haven't been any new responses to the previous discussions on this issue, I'd like to inquire about the buffered IO problem with force-aligned files, which is a scenario we might encounter. Consider a case where the file supports force-alignment with a 64K extent size, and the system page size is 4K. Take the following commands as an example: xfs_io -c "pwrite 64k 64k" mnt/file xfs_io -c "pwrite 8k 8k" mnt/file If unaligned unwritten extents are not permitted, we need to zero out the sub-allocation units for ranges [0, 8K] and [16K, 64K] to prevent stale data. While this can be handled relatively easily in direct I/O scenarios, it presents significant challenges in buffered I/O operations. The main difficulty arises because the extent size (64K) is larger than the page size (4K), and our current code base has substantial limitations in handling such cases. Any thoughts on this? Thanks, Long Li > > BTW, I still have concern with this extsize zero-around change which I was > making: > > xfs_iomap_write_unwritten() > { > unsigned int rounding; > > /* when converting anything unwritten, we must be spanning an alloc unit, > so round up/down */ > if (rounding > 1) { > offset_fsb = rounddown(rounding); > count_fsb = roundup(rounding); > } > > ... > do { > xfs_bmapi_write(); > ... > xfs_trans_commit(); > } while (); > } > > As mentioned elsewhere, it's a bit of a bodge (to do this rounding). > > > > > > > FWIW, I also understand things are different if we are doing 128kB > > > > atomic writes on 16kB force aligned files. However, in this > > > > situation we are treating the 128kB atomic IO as eight individual > > > > 16kB atomic IOs that are physically contiguous. > > > Yes, if 16kB force aligned, userspace can only issue 16KB atomic writes. > > Right, but the eventual goal (given the statx parameters) is to be > > able to do 8x16kB sequential atomic writes as a single 128kB IO, yes? > > No, if atomic write unit max is 16KB, then userspace can only issue a single > 16KB atomic write. > > However, some things to consider: > a. the block layer may merge those 16KB atomic writes > b. userspace may also merge 16KB atomic writes and issue a larger atomic > write (if atomic write unit max is > 16KB) > > I had been wondering if there is any value in a lib for helping with b. > > > > > > > > > Again, this is different to the traditional RT file behaviour - it > > > > > > can use unwritten extents for sub-alloc-unit alignment unmaps > > > > > > because the RT device can align file offset to any physical offset, > > > > > > and issue unaligned sector sized IO without any restrictions. Forced > > > > > > alignment does not have this freedom, and when we extend forced > > > > > > alignment to RT files, it will not have the freedom to use > > > > > > unwritten extents for sub-alloc-unit unmapping, either. > > > > > > > > > > > So how do you think that we should actually implement > > > > > xfs_itruncate_extents_flags() properly for forcealign? Would it simply be > > > > > like: > > > > > > > > > > --- a/fs/xfs/xfs_inode.c > > > > > +++ b/fs/xfs/xfs_inode.c > > > > > @@ -1050,7 +1050,7 @@ xfs_itruncate_extents_flags( > > > > > WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF); > > > > > return 0; > > > > > } > > > > > + if (xfs_inode_has_forcealign(ip)) > > > > > + first_unmap_block = xfs_inode_roundup_alloc_unit(ip, > > > > > first_unmap_block); > > > > > error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block, > > > > Yes, it would be something like that, except it would have to be > > > > done before first_unmap_block is verified. > > > > > > > ok, and are you still of the opinion that this does not apply to rtvol? > > The rtvol is*not* force-aligned. It -may- have some aligned > > allocation requirements that are similar (i.e. sb_rextsize > 1 fsb) > > but it does*not* force-align extents, written or unwritten. > > > > The moment we add force-align support to RT files (as is the plan), > > then the force-aligned inodes on the rtvol will need to behave as > > force aligned inodes, not "rtvol" inodes. > > ok, fine > > Thanks, > John > > >
On 14/11/2024 12:48, Long Li wrote: > On Wed, Sep 18, 2024 at 11:12:47AM +0100, John Garry wrote: >> On 17/09/2024 23:27, Dave Chinner wrote: >>>> # xfs_bmap -vvp mnt/file >>>> mnt/file: >>>> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS >>>> 0: [0..15]: 384..399 0 (384..399) 16 010000 >>>> 1: [16..31]: 400..415 0 (400..415) 16 000000 >>>> 2: [32..127]: 416..511 0 (416..511) 96 010000 >>>> 3: [128..255]: 256..383 0 (256..383) 128 000000 >>>> FLAG Values: >>>> 0010000 Unwritten preallocated extent >>>> >>>> Here we have unaligned extents wrt extsize. >>>> >>>> The sub-alloc unit zeroing would solve that - is that what you would still >>>> advocate (to solve that issue)? >>> Yes, I thought that was already implemented for force-align with the >>> DIO code via the extsize zero-around changes in the iomap code. Why >>> isn't that zero-around code ensuring the correct extent layout here? >> I just have not included the extsize zero-around changes here. They were >> just grouped with the atomic writes support, as they were added specifically >> for the atomic writes support. Indeed - to me at least - it is strange that >> the DIO code changes are required for XFS forcealign implementation. And, >> even if we use extsize zero-around changes for DIO path, what about buffered >> IO? > > I've been reviewing and testing the XFS atomic write patch series. Since > there haven't been any new responses to the previous discussions on this > issue, I'd like to inquire about the buffered IO problem with force-aligned > files, which is a scenario we might encounter. > > Consider a case where the file supports force-alignment with a 64K extent size, > and the system page size is 4K. Take the following commands as an example: > > xfs_io -c "pwrite 64k 64k" mnt/file > xfs_io -c "pwrite 8k 8k" mnt/file > > If unaligned unwritten extents are not permitted, we need to zero out the > sub-allocation units for ranges [0, 8K] and [16K, 64K] to prevent stale > data. How does this prevent stale data? Just zeroing will ensure aligned extents. Unless iomap is provided a mapping for the fully aligned extent. > While this can be handled relatively easily in direct I/O scenarios, > it presents significant challenges in buffered I/O operations. The main > difficulty arises because the extent size (64K) is larger than the page > size (4K), and our current code base has substantial limitations in handling > such cases. What is the limitation exactly? > > Any thoughts on this? TBH, the buffered IO case has not been considered too much. The sub-extent zeroing was intended for atomic writes > 1x FSB and we only care about DIO there. Thanks, John
On Thu, Nov 14, 2024 at 08:48:21PM +0800, Long Li wrote: > On Wed, Sep 18, 2024 at 11:12:47AM +0100, John Garry wrote: > > On 17/09/2024 23:27, Dave Chinner wrote: > > > > # xfs_bmap -vvp mnt/file > > > > mnt/file: > > > > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > > > > 0: [0..15]: 384..399 0 (384..399) 16 010000 > > > > 1: [16..31]: 400..415 0 (400..415) 16 000000 > > > > 2: [32..127]: 416..511 0 (416..511) 96 010000 > > > > 3: [128..255]: 256..383 0 (256..383) 128 000000 > > > > FLAG Values: > > > > 0010000 Unwritten preallocated extent > > > > > > > > Here we have unaligned extents wrt extsize. > > > > > > > > The sub-alloc unit zeroing would solve that - is that what you would still > > > > advocate (to solve that issue)? > > > Yes, I thought that was already implemented for force-align with the > > > DIO code via the extsize zero-around changes in the iomap code. Why > > > isn't that zero-around code ensuring the correct extent layout here? > > > > I just have not included the extsize zero-around changes here. They were > > just grouped with the atomic writes support, as they were added specifically > > for the atomic writes support. Indeed - to me at least - it is strange that > > the DIO code changes are required for XFS forcealign implementation. And, > > even if we use extsize zero-around changes for DIO path, what about buffered > > IO? > > > I've been reviewing and testing the XFS atomic write patch series. Since > there haven't been any new responses to the previous discussions on this > issue, I'd like to inquire about the buffered IO problem with force-aligned > files, which is a scenario we might encounter. > > Consider a case where the file supports force-alignment with a 64K extent size, > and the system page size is 4K. Take the following commands as an example: > > xfs_io -c "pwrite 64k 64k" mnt/file > xfs_io -c "pwrite 8k 8k" mnt/file > > If unaligned unwritten extents are not permitted, we need to zero out the > sub-allocation units for ranges [0, 8K] and [16K, 64K] to prevent stale > data. While this can be handled relatively easily in direct I/O scenarios, > it presents significant challenges in buffered I/O operations. The main > difficulty arises because the extent size (64K) is larger than the page > size (4K), and our current code base has substantial limitations in handling > such cases. > > Any thoughts on this? Large folios in the page cache solve this problem. i.e. it's the same problem that block size > page size support had to solve. -Dave.
On 14/11/2024 20:07, Dave Chinner wrote: >> xfs_io -c "pwrite 64k 64k" mnt/file >> xfs_io -c "pwrite 8k 8k" mnt/file >> >> If unaligned unwritten extents are not permitted, we need to zero out the >> sub-allocation units for ranges [0, 8K] and [16K, 64K] to prevent stale >> data. While this can be handled relatively easily in direct I/O scenarios, >> it presents significant challenges in buffered I/O operations. The main >> difficulty arises because the extent size (64K) is larger than the page >> size (4K), and our current code base has substantial limitations in handling >> such cases. >> >> Any thoughts on this? > Large folios in the page cache solve this problem. i.e. it's the > same problem that block size > page size support had to solve. Would that work for an extsize which is not a power-of-2?
On Fri, Nov 15, 2024 at 07:07:53AM +1100, Dave Chinner wrote: > On Thu, Nov 14, 2024 at 08:48:21PM +0800, Long Li wrote: > > On Wed, Sep 18, 2024 at 11:12:47AM +0100, John Garry wrote: > > > On 17/09/2024 23:27, Dave Chinner wrote: > > > > > # xfs_bmap -vvp mnt/file > > > > > mnt/file: > > > > > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > > > > > 0: [0..15]: 384..399 0 (384..399) 16 010000 > > > > > 1: [16..31]: 400..415 0 (400..415) 16 000000 > > > > > 2: [32..127]: 416..511 0 (416..511) 96 010000 > > > > > 3: [128..255]: 256..383 0 (256..383) 128 000000 > > > > > FLAG Values: > > > > > 0010000 Unwritten preallocated extent > > > > > > > > > > Here we have unaligned extents wrt extsize. > > > > > > > > > > The sub-alloc unit zeroing would solve that - is that what you would still > > > > > advocate (to solve that issue)? > > > > Yes, I thought that was already implemented for force-align with the > > > > DIO code via the extsize zero-around changes in the iomap code. Why > > > > isn't that zero-around code ensuring the correct extent layout here? > > > > > > I just have not included the extsize zero-around changes here. They were > > > just grouped with the atomic writes support, as they were added specifically > > > for the atomic writes support. Indeed - to me at least - it is strange that > > > the DIO code changes are required for XFS forcealign implementation. And, > > > even if we use extsize zero-around changes for DIO path, what about buffered > > > IO? > > > > > > I've been reviewing and testing the XFS atomic write patch series. Since > > there haven't been any new responses to the previous discussions on this > > issue, I'd like to inquire about the buffered IO problem with force-aligned > > files, which is a scenario we might encounter. > > > > Consider a case where the file supports force-alignment with a 64K extent size, > > and the system page size is 4K. Take the following commands as an example: > > > > xfs_io -c "pwrite 64k 64k" mnt/file > > xfs_io -c "pwrite 8k 8k" mnt/file > > > > If unaligned unwritten extents are not permitted, we need to zero out the > > sub-allocation units for ranges [0, 8K] and [16K, 64K] to prevent stale > > data. While this can be handled relatively easily in direct I/O scenarios, > > it presents significant challenges in buffered I/O operations. The main > > difficulty arises because the extent size (64K) is larger than the page > > size (4K), and our current code base has substantial limitations in handling > > such cases. > > > > Any thoughts on this? > > Large folios in the page cache solve this problem. i.e. it's the > same problem that block size > page size support had to solve. > > Thanks for your reply, it cleared up my confusion. So maybe we need to set a minimum folio order for force-aligned inodes, just like Large block sizes (LBS). Thanks, Long Li