diff mbox series

[GIT,PULL] xfs: new code for 6.7

Message ID 87fs1g1rac.fsf@debian-BULLSEYE-live-builder-AMD64 (mailing list archive)
State New, archived
Headers show
Series [GIT,PULL] xfs: new code for 6.7 | expand

Pull-request

https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git tags/xfs-6.7-merge-2

Commit Message

Chandan Babu R Nov. 8, 2023, 9:56 a.m. UTC
Hi Linus,

Please pull this branch with changes for xfs for 6.7-rc1.

The important changes include,
1. CPU usage optimizations for realtime allocator.
2. Allowing read operations to continue while a FICLONE ioctl is being
   serviced.

The remaining changes are limited to bug fixes and code cleanups.

There was a delay in me pushing the changes to XFS' for-next branch and hence
a delay in the code changes reaching the linux-next tree. The XFS changes
reached linux-next on 31st of October. The delay was due to me having to drop
a patch from the XFS tree and having to initiate execution of the test suite
once again on October 26th. The complete test run requires around 4 days to
complete.

During a discussion, Darrick told me that in such scenarios he would limit
testing to non-fuzz tests which take around 12 hours to complete.  Hence, in
hindsight, I could have limited the time taken to execute tests after dropping
the patch. I will make sure to update XFS' for-next branch well before the
merge window period begins from next release onwards.

The changes that are part of the current pull request are contained within XFS
i.e. there are no patches which straddle across other subsystems. I have been
executing fstests on linux-next for more than a week now. There were no new
regressions found in XFS during the test run.

I had performed a test merge with latest contents of torvalds/linux.git. i.e.

305230142ae0637213bf6e04f6d9f10bbcb74af8
Author:     Linus Torvalds <torvalds@linux-foundation.org>
AuthorDate: Tue Nov 7 17:16:23 2023 -0800
Commit:     Linus Torvalds <torvalds@linux-foundation.org>
CommitDate: Tue Nov 7 17:16:23 2023 -0800
Merge tag 'pm-6.7-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

This resulted in merge conflicts. The following diff should resolve the merge
conflicts.

+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@@ -960,19 -931,18 +931,19 @@@ xfs_rtcheck_alloc_range
   * Free an extent in the realtime subvolume.  Length is expressed in
   * realtime extents, as is the block number.
   */
- int					/* error */
+ int
  xfs_rtfree_extent(
- 	xfs_trans_t	*tp,		/* transaction pointer */
- 	xfs_rtblock_t	bno,		/* starting block number to free */
- 	xfs_extlen_t	len)		/* length of extent freed */
+ 	struct xfs_trans	*tp,	/* transaction pointer */
+ 	xfs_rtxnum_t		start,	/* starting rtext number to free */
+ 	xfs_rtxlen_t		len)	/* length of extent freed */
  {
- 	int		error;		/* error value */
- 	xfs_mount_t	*mp;		/* file system mount structure */
- 	xfs_fsblock_t	sb;		/* summary file block number */
- 	struct xfs_buf	*sumbp = NULL;	/* summary file block buffer */
- 	struct timespec64 atime;
- 
- 	mp = tp->t_mountp;
+ 	struct xfs_mount	*mp = tp->t_mountp;
+ 	struct xfs_rtalloc_args	args = {
+ 		.mp		= mp,
+ 		.tp		= tp,
+ 	};
+ 	int			error;
++	struct timespec64	atime;
  
  	ASSERT(mp->m_rbmip->i_itemp != NULL);
  	ASSERT(xfs_isilocked(mp->m_rbmip, XFS_ILOCK_EXCL));
@@@ -1000,13 -970,46 +971,49 @@@
  	    mp->m_sb.sb_rextents) {
  		if (!(mp->m_rbmip->i_diflags & XFS_DIFLAG_NEWRTBM))
  			mp->m_rbmip->i_diflags |= XFS_DIFLAG_NEWRTBM;
 -		*(uint64_t *)&VFS_I(mp->m_rbmip)->i_atime = 0;
 +
 +		atime = inode_get_atime(VFS_I(mp->m_rbmip));
 +		*((uint64_t *)&atime) = 0;
 +		inode_set_atime_to_ts(VFS_I(mp->m_rbmip), atime);
  		xfs_trans_log_inode(tp, mp->m_rbmip, XFS_ILOG_CORE);
  	}
- 	return 0;
+ 	error = 0;
+ out:
+ 	xfs_rtbuf_cache_relse(&args);
+ 	return error;
+ }
+ 
+ /*
+  * Free some blocks in the realtime subvolume.  rtbno and rtlen are in units of
+  * rt blocks, not rt extents; must be aligned to the rt extent size; and rtlen
+  * cannot exceed XFS_MAX_BMBT_EXTLEN.
+  */
+ int
+ xfs_rtfree_blocks(
+ 	struct xfs_trans	*tp,
+ 	xfs_fsblock_t		rtbno,
+ 	xfs_filblks_t		rtlen)
+ {
+ 	struct xfs_mount	*mp = tp->t_mountp;
+ 	xfs_rtxnum_t		start;
+ 	xfs_filblks_t		len;
+ 	xfs_extlen_t		mod;
+ 
+ 	ASSERT(rtlen <= XFS_MAX_BMBT_EXTLEN);
+ 
+ 	len = xfs_rtb_to_rtxrem(mp, rtlen, &mod);
+ 	if (mod) {
+ 		ASSERT(mod == 0);
+ 		return -EIO;
+ 	}
+ 
+ 	start = xfs_rtb_to_rtxrem(mp, rtbno, &mod);
+ 	if (mod) {
+ 		ASSERT(mod == 0);
+ 		return -EIO;
+ 	}
+ 
+ 	return xfs_rtfree_extent(tp, start, len);
  }
  
  /* Find all the free records within a given range. */
+++ b/fs/xfs/xfs_rtalloc.c
@@@ -1420,16 -1414,16 +1414,16 @@@ xfs_rtunmount_inodes
   */
  int					/* error */
  xfs_rtpick_extent(
- 	xfs_mount_t		*mp,		/* file system mount point */
- 	xfs_trans_t		*tp,		/* transaction pointer */
- 	xfs_extlen_t		len,		/* allocation length (rtextents) */
- 	xfs_rtblock_t		*pick)		/* result rt extent */
- 	{
- 	xfs_rtblock_t		b;		/* result block */
- 	int			log2;		/* log of sequence number */
- 	uint64_t		resid;		/* residual after log removed */
- 	uint64_t		seq;		/* sequence number of file creation */
- 	struct timespec64	ts;		/* temporary timespec64 storage */
 -	xfs_mount_t	*mp,		/* file system mount point */
 -	xfs_trans_t	*tp,		/* transaction pointer */
 -	xfs_rtxlen_t	len,		/* allocation length (rtextents) */
 -	xfs_rtxnum_t	*pick)		/* result rt extent */
++	xfs_mount_t		*mp,	/* file system mount point */
++	xfs_trans_t		*tp,	/* transaction pointer */
++	xfs_rtxlen_t		len,	/* allocation length (rtextents) */
++	xfs_rtxnum_t		*pick)	/* result rt extent */
+ {
 -	xfs_rtxnum_t	b;		/* result rtext */
 -	int		log2;		/* log of sequence number */
 -	uint64_t	resid;		/* residual after log removed */
 -	uint64_t	seq;		/* sequence number of file creation */
 -	uint64_t	*seqp;		/* pointer to seqno in inode */
++	xfs_rtxnum_t		b;	/* result rtext */
++	int			log2;	/* log of sequence number */
++	uint64_t		resid;	/* residual after log removed */
++	uint64_t		seq;	/* sequence number of file creation */
++	struct timespec64	ts;	/* temporary timespec64 storage */
  
  	ASSERT(xfs_isilocked(mp->m_rbmip, XFS_ILOCK_EXCL));

Please let me know if you encounter any problems.

The following changes since commit 05d3ef8bba77c1b5f98d941d8b2d4aeab8118ef1:

  Linux 6.6-rc7 (2023-10-22 12:11:21 -1000)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git tags/xfs-6.7-merge-2

for you to fetch changes up to 14a537983b228cb050ceca3a5b743d01315dc4aa:

  xfs: allow read IO and FICLONE to run concurrently (2023-10-23 12:02:26 +0530)

----------------------------------------------------------------
New code for 6.7:

  * Realtime device subsystem
    - Cleanup usage of xfs_rtblock_t and xfs_fsblock_t data types.
    - Replace open coded conversions between rt blocks and rt extents with
      calls to static inline helpers.
    - Replace open coded realtime geometry compuation and macros with helper
      functions.
    - CPU usage optimizations for realtime allocator.
    - Misc. Bug fixes associated with Realtime device.
  * Allow read operations to execute while an FICLONE ioctl is being serviced.
  * Misc. bug fixes
    - Alert user when xfs_droplink() encounters an inode with a link count of zero.
    - Handle the case where the allocator could return zero extents when
      servicing an fallocate request.

Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

----------------------------------------------------------------
Catherine Hoang (1):
      xfs: allow read IO and FICLONE to run concurrently

Chandan Babu R (6):
      Merge tag 'realtime-fixes-6.7_2023-10-19' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.7-mergeA
      Merge tag 'clean-up-realtime-units-6.7_2023-10-19' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.7-mergeA
      Merge tag 'refactor-rt-unit-conversions-6.7_2023-10-19' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.7-mergeA
      Merge tag 'refactor-rtbitmap-macros-6.7_2023-10-19' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.7-mergeA
      Merge tag 'refactor-rtbitmap-accessors-6.7_2023-10-19' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.7-mergeA
      Merge tag 'rtalloc-speedups-6.7_2023-10-19' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.7-mergeA

Cheng Lin (1):
      xfs: introduce protection for drop nlink

Christoph Hellwig (1):
      xfs: handle nimaps=0 from xfs_bmapi_write in xfs_alloc_file_space

Darrick J. Wong (30):
      xfs: bump max fsgeom struct version
      xfs: hoist freeing of rt data fork extent mappings
      xfs: prevent rt growfs when quota is enabled
      xfs: rt stubs should return negative errnos when rt disabled
      xfs: fix units conversion error in xfs_bmap_del_extent_delay
      xfs: make sure maxlen is still congruent with prod when rounding down
      xfs: move the xfs_rtbitmap.c declarations to xfs_rtbitmap.h
      xfs: convert xfs_extlen_t to xfs_rtxlen_t in the rt allocator
      xfs: create a helper to convert rtextents to rtblocks
      xfs: convert rt bitmap/summary block numbers to xfs_fileoff_t
      xfs: create a helper to compute leftovers of realtime extents
      xfs: convert rt bitmap extent lengths to xfs_rtbxlen_t
      xfs: create a helper to convert extlen to rtextlen
      xfs: rename xfs_verify_rtext to xfs_verify_rtbext
      xfs: create helpers to convert rt block numbers to rt extent numbers
      xfs: convert rt extent numbers to xfs_rtxnum_t
      xfs: convert do_div calls to xfs_rtb_to_rtx helper calls
      xfs: create rt extent rounding helpers for realtime extent blocks
      xfs: convert the rtbitmap block and bit macros to static inline functions
      xfs: use shifting and masking when converting rt extents, if possible
      xfs: remove XFS_BLOCKWSIZE and XFS_BLOCKWMASK macros
      xfs: convert open-coded xfs_rtword_t pointer accesses to helper
      xfs: convert rt summary macros to helpers
      xfs: create a helper to handle logging parts of rt bitmap/summary blocks
      xfs: create helpers for rtbitmap block/wordcount computations
      xfs: use accessor functions for bitmap words
      xfs: create helpers for rtsummary block/wordcount computations
      xfs: use accessor functions for summary info words
      xfs: simplify xfs_rtbuf_get calling conventions
      xfs: simplify rt bitmap/summary block accessor functions

Dave Chinner (1):
      xfs: consolidate realtime allocation arguments

Omar Sandoval (6):
      xfs: cache last bitmap block in realtime allocator
      xfs: invert the realtime summary cache
      xfs: return maximum free size from xfs_rtany_summary()
      xfs: limit maxlen based on available space in xfs_rtallocate_extent_near()
      xfs: don't try redundant allocations in xfs_rtallocate_extent_near()
      xfs: don't look for end of extent further than necessary in xfs_rtallocate_extent_near()

 fs/xfs/libxfs/xfs_bmap.c       |  45 +--
 fs/xfs/libxfs/xfs_format.h     |  34 +-
 fs/xfs/libxfs/xfs_rtbitmap.c   | 803 ++++++++++++++++++++++-------------------
 fs/xfs/libxfs/xfs_rtbitmap.h   | 383 ++++++++++++++++++++
 fs/xfs/libxfs/xfs_sb.c         |   2 +
 fs/xfs/libxfs/xfs_sb.h         |   2 +-
 fs/xfs/libxfs/xfs_trans_resv.c |  10 +-
 fs/xfs/libxfs/xfs_types.c      |   4 +-
 fs/xfs/libxfs/xfs_types.h      |  10 +-
 fs/xfs/scrub/bmap.c            |   2 +-
 fs/xfs/scrub/fscounters.c      |   2 +-
 fs/xfs/scrub/inode.c           |   3 +-
 fs/xfs/scrub/rtbitmap.c        |  28 +-
 fs/xfs/scrub/rtsummary.c       |  72 ++--
 fs/xfs/scrub/trace.c           |   1 +
 fs/xfs/scrub/trace.h           |  15 +-
 fs/xfs/xfs_bmap_util.c         |  74 ++--
 fs/xfs/xfs_file.c              |  63 +++-
 fs/xfs/xfs_fsmap.c             |  15 +-
 fs/xfs/xfs_inode.c             |  24 ++
 fs/xfs/xfs_inode.h             |   9 +
 fs/xfs/xfs_inode_item.c        |   3 +-
 fs/xfs/xfs_ioctl.c             |   5 +-
 fs/xfs/xfs_linux.h             |  12 +
 fs/xfs/xfs_mount.h             |   8 +-
 fs/xfs/xfs_ondisk.h            |   4 +
 fs/xfs/xfs_reflink.c           |   4 +
 fs/xfs/xfs_rtalloc.c           | 626 ++++++++++++++++----------------
 fs/xfs/xfs_rtalloc.h           |  94 +----
 fs/xfs/xfs_super.c             |   3 +-
 fs/xfs/xfs_trans.c             |   7 +-
 31 files changed, 1425 insertions(+), 942 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_rtbitmap.h

Comments

Linus Torvalds Nov. 8, 2023, 9:29 p.m. UTC | #1
On Wed, 8 Nov 2023 at 02:19, Chandan Babu R <chandanbabu@kernel.org> wrote:
>
> I had performed a test merge with latest contents of torvalds/linux.git.
>
> This resulted in merge conflicts. The following diff should resolve the merge
> conflicts.

Well, your merge conflict resolution is the same as my initial
mindless one, but then when I look closer at it, it turns out that
it's wrong.

It's wrong not because the merge itself would be wrong, but because
the conflict made me look at the original, and it turns out that
commit 75d1e312bbbd ("xfs: convert to new timestamp accessors") was
buggy.

I'm actually surprised the compilers don't complain about it, because
the bug means that the new

        struct timespec64 ts;

temporary isn't actually initialized for the !XFS_DIFLAG_NEWRTBM case.

The code does

  xfs_rtpick_extent(..)
  ...
        struct timespec64 ts;
        ..
        if (!(mp->m_rbmip->i_diflags & XFS_DIFLAG_NEWRTBM)) {
                mp->m_rbmip->i_diflags |= XFS_DIFLAG_NEWRTBM;
                seq = 0;
        } else {
        ...
        ts.tv_sec = (time64_t)seq + 1;
        inode_set_atime_to_ts(VFS_I(mp->m_rbmip), ts);

and notice how 'ts.tv_nsec' is never initialized. So we'll set the
nsec part of the atime to random garbage.

Oh, I'm sure it doesn't really *matter*, but it's most certainly wrong.

I am not very happy about the whole crazy XFS model where people cast
the 'struct timespec64' pointer to an 'uint64_t' pointer, and then say
'now it's a sequence number'. This is not the only place that
happened, ie we have similar disgusting code in at least
xfs_rtfree_extent() too.

That other place in xfs_rtfree_extent() didn't have this bug - it does
inode_get_atime() unconditionally and this keeps the nsec field as-is,
but that other place has the same really ugly code.

Doing that "cast struct timespec64 to an uint64_t' is not only ugly
and wrong, it's _stupid_. The only reason it works in the first place
is that 'struct timespec64' is

  struct timespec64 {
        time64_t        tv_sec;                 /* seconds */
        long            tv_nsec;                /* nanoseconds */
  };

so the first field is 'tv_sec', which is a 64-bit (signed) value.

So the cast is disgusting - and it's pointless. I don't know why it's
done that way. It would have been much cleaner to just use tv_sec, and
have a big comment about it being used as a sequence number here.

I _assume_ there's just a simple 32-bit history to this all, where at
one point it was a 32-bit tv_sec, and the cast basically used both
32-bit fields as a 64-bit sequence number.  I get it. But it's most
definitely wrong now.

End result: I ended up fixing that bug and removing the bogus casts in
my merge. I *think* I got it right, but apologies in advance if I
screwed up. I only did visual inspection and build testing, no actual
real testing.

Also, xfs people may obviously have other preferences for how to deal
with the whole "now using tv_sec in the VFS inode as a 64-bit sequence
number" thing, and maybe you prefer to then update my fix to this all.
But that horrid casts certainly wasn't the right way to do it.

Put another way: please do give my merge a closer look, and decide
amongst yourself if you then want to deal with this some other way.

              Linus
pr-tracker-bot@kernel.org Nov. 8, 2023, 9:34 p.m. UTC | #2
The pull request you sent on Wed, 08 Nov 2023 15:26:29 +0530:

> https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git tags/xfs-6.7-merge-2

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/34f763262743aac0847b15711b0460ac6d6943d5

Thank you!
Darrick J. Wong Nov. 8, 2023, 10:52 p.m. UTC | #3
On Wed, Nov 08, 2023 at 01:29:16PM -0800, Linus Torvalds wrote:
> On Wed, 8 Nov 2023 at 02:19, Chandan Babu R <chandanbabu@kernel.org> wrote:
> >
> > I had performed a test merge with latest contents of torvalds/linux.git.
> >
> > This resulted in merge conflicts. The following diff should resolve the merge
> > conflicts.
> 
> Well, your merge conflict resolution is the same as my initial
> mindless one, but then when I look closer at it, it turns out that
> it's wrong.
> 
> It's wrong not because the merge itself would be wrong, but because
> the conflict made me look at the original, and it turns out that
> commit 75d1e312bbbd ("xfs: convert to new timestamp accessors") was
> buggy.
> 
> I'm actually surprised the compilers don't complain about it, because
> the bug means that the new
> 
>         struct timespec64 ts;
> 
> temporary isn't actually initialized for the !XFS_DIFLAG_NEWRTBM case.
> 
> The code does
> 
>   xfs_rtpick_extent(..)

Oh gosh.  Dave might have other things to say, but xfs_rtpick_extent is
the sort of function that I hate with the power of 1,000 suns.

Back in 2.6.x it apparently did this:

	seqp = (__uint64_t *)&mp->m_rbmip->i_d.di_atime;

At the time, xfs_inode.id.di_atime was a struct xfs_ictimestamp:

typedef struct xfs_ictimestamp {
	__int32_t	t_sec;		/* timestamp seconds */
	__int32_t	t_nsec;		/* timestamp nanoseconds */
} xfs_ictimestamp_t;

So the rt allocator thinks its maintaining a u64 new file counter in the
bitmap file's atime.  The lower 32bits ended up in t_sec, and the upper
32bits ended up in t_nsec.

At some point (4.6?) the function started using the VFS i_atime field
instead of the di_atime field.  On 32-bit systems the struct timespec
was still a struct of two int32 values and everything kept working the
way it always had.

On 64-bit systems, tv_sec is 64-bits which means the sequence counter
was only stored (incore, anyway) in tv_sec.  XFS truncates the upper
32-bits when writing the inode to disk because (at the time) it didn't
handle y2038.

IOWs, we broke the ondisk format in 2016 and nobody noticed.  Because
the allocator calls xfs_highbit64 on the sequence counter, the only
observable behavior change would be the starting location of a free
space search for the first rt file allocation after an upgrade from 4.5
to a newer kernel on a 64-bit machine.

(Or going back, obviously)

But then in 4.18 or so, the VFS inode switched to timespec64, at which
point /all/ of the 32-bit kernels "migrated" to storing the sequence
counter in tv_sec and truncating it when it goes out to disk.

Then in 5.10 we added y2038 support, so post-Covid filesystems truncate
less of the sequence counter.  #winning

>   ...
>         struct timespec64 ts;
>         ..
>         if (!(mp->m_rbmip->i_diflags & XFS_DIFLAG_NEWRTBM)) {
>                 mp->m_rbmip->i_diflags |= XFS_DIFLAG_NEWRTBM;
>                 seq = 0;
>         } else {
>         ...
>         ts.tv_sec = (time64_t)seq + 1;
>         inode_set_atime_to_ts(VFS_I(mp->m_rbmip), ts);

So... according to the pre-4.6 definition of the sequence counter this
is wrong, but OTOH it's not inconsistent with what was there in 6.4.

> and notice how 'ts.tv_nsec' is never initialized. So we'll set the
> nsec part of the atime to random garbage.
> 
> Oh, I'm sure it doesn't really *matter*, but it's most certainly wrong.

tv_nsec isn't explicitly initialized by rtpick_extent, but IIRC mkfs
initializes the ondisk inode's tv_nsec field and the kernel reads that
into the incore inode, so I dont't think it's leaking kernel memory
contents.

> I am not very happy about the whole crazy XFS model where people cast
> the 'struct timespec64' pointer to an 'uint64_t' pointer, and then say
> 'now it's a sequence number'. This is not the only place that
> happened, ie we have similar disgusting code in at least
> xfs_rtfree_extent() too.
> 
> That other place in xfs_rtfree_extent() didn't have this bug - it does
> inode_get_atime() unconditionally and this keeps the nsec field as-is,
> but that other place has the same really ugly code.
> 
> Doing that "cast struct timespec64 to an uint64_t' is not only ugly
> and wrong, it's _stupid_. The only reason it works in the first place
> is that 'struct timespec64' is
> 
>   struct timespec64 {
>         time64_t        tv_sec;                 /* seconds */
>         long            tv_nsec;                /* nanoseconds */
>   };
> 
> so the first field is 'tv_sec', which is a 64-bit (signed) value.

(yep)

> So the cast is disgusting - and it's pointless. I don't know why it's
> done that way. It would have been much cleaner to just use tv_sec, and
> have a big comment about it being used as a sequence number here.
> 
> I _assume_ there's just a simple 32-bit history to this all, where at
> one point it was a 32-bit tv_sec, and the cast basically used both
> 32-bit fields as a 64-bit sequence number.  I get it. But it's most
> definitely wrong now.

I don't even think it was good C back whenever it was written, but I was
probably in high school at that point. ;)

> End result: I ended up fixing that bug and removing the bogus casts in
> my merge. I *think* I got it right, but apologies in advance if I
> screwed up. I only did visual inspection and build testing, no actual
> real testing.

My opinion is that you've kept your tree consistent with what the kernel
has been doing for the last 5 years.  No comment about the s**tshow that
went on before that.

> Also, xfs people may obviously have other preferences for how to deal
> with the whole "now using tv_sec in the VFS inode as a 64-bit sequence
> number" thing, and maybe you prefer to then update my fix to this all.
> But that horrid casts certainly wasn't the right way to do it.

Yeah, I can work on that for the rt modernization patchset.

> Put another way: please do give my merge a closer look, and decide
> amongst yourself if you then want to deal with this some other way.

Let's see what the other devs say.  Thank you for taking Chandan's pull
request, by the way.

--D

> 
>               Linus
Christoph Hellwig Nov. 9, 2023, 4:51 a.m. UTC | #4
On Wed, Nov 08, 2023 at 02:52:00PM -0800, Darrick J. Wong wrote:
> > Also, xfs people may obviously have other preferences for how to deal
> > with the whole "now using tv_sec in the VFS inode as a 64-bit sequence
> > number" thing, and maybe you prefer to then update my fix to this all.
> > But that horrid casts certainly wasn't the right way to do it.
> 
> Yeah, I can work on that for the rt modernization patchset.

As someone who has just written some new code stealing this trick I
actually have a todo list item to make this less horrible as the cast
upset my stomache.  But shame on me for not actually noticing that it
is buggy as well (which honestly should be the standard assumption for
casts like this).
Darrick J. Wong Nov. 9, 2023, 7:39 a.m. UTC | #5
On Thu, Nov 09, 2023 at 05:51:50AM +0100, Christoph Hellwig wrote:
> On Wed, Nov 08, 2023 at 02:52:00PM -0800, Darrick J. Wong wrote:
> > > Also, xfs people may obviously have other preferences for how to deal
> > > with the whole "now using tv_sec in the VFS inode as a 64-bit sequence
> > > number" thing, and maybe you prefer to then update my fix to this all.
> > > But that horrid casts certainly wasn't the right way to do it.
> > 
> > Yeah, I can work on that for the rt modernization patchset.
> 
> As someone who has just written some new code stealing this trick I
> actually have a todo list item to make this less horrible as the cast
> upset my stomache.  But shame on me for not actually noticing that it
> is buggy as well (which honestly should be the standard assumption for
> casts like this).

Dave and I started looking at this too, and came up with: For rtgroups
filesystems, what if rtpick simply rotored the rtgroups?  And what if we
didn't bother persisting the rotor value, which would make this casting
nightmare go away in the long run.  It's not like we persist the agi
rotors.

--D
Christoph Hellwig Nov. 9, 2023, 2:46 p.m. UTC | #6
On Wed, Nov 08, 2023 at 11:39:45PM -0800, Darrick J. Wong wrote:
> Dave and I started looking at this too, and came up with: For rtgroups
> filesystems, what if rtpick simply rotored the rtgroups?  And what if we
> didn't bother persisting the rotor value, which would make this casting
> nightmare go away in the long run.  It's not like we persist the agi
> rotors.

Yep.  We should still fix the cast and replace it with a proper union
or other means for pre-RTG file systems given that they will be around
for while.
Darrick J. Wong Nov. 9, 2023, 4:38 p.m. UTC | #7
On Thu, Nov 09, 2023 at 03:46:14PM +0100, Christoph Hellwig wrote:
> On Wed, Nov 08, 2023 at 11:39:45PM -0800, Darrick J. Wong wrote:
> > Dave and I started looking at this too, and came up with: For rtgroups
> > filesystems, what if rtpick simply rotored the rtgroups?  And what if we
> > didn't bother persisting the rotor value, which would make this casting
> > nightmare go away in the long run.  It's not like we persist the agi
> > rotors.
> 
> Yep.  We should still fix the cast and replace it with a proper union
> or other means for pre-RTG file systems given that they will be around
> for while.

<nod> Linus' fixup stuffs the seq value in tv_sec.  That's not great
since the inode writeout code then truncates the upper 32 bits, but
that's what the kernel has been doing for 5+ years now.

Dave suggested that we might restore the pre-4.6 behavior by explicitly
encoding what we used to do:

	inode->i_atime.tv_sec = seq & 0xFFFFFFFF;
	inode->i_atime.tv_nsec = seq >> 32;

(There's a helper in 6.7 for this, apparently.)

But then I pointed out that the entire rtpick sequence counter thing
merely provides a *starting point* for rtbitmap searches.  So it's not
like garbled values result in metadata inconsistency.  IOWs, it's
apparently benign.

IOWs, how much does anyone care about improving on Linus' fixup?

--D
Christoph Hellwig Nov. 9, 2023, 4:50 p.m. UTC | #8
On Thu, Nov 09, 2023 at 08:38:56AM -0800, Darrick J. Wong wrote:
> Dave suggested that we might restore the pre-4.6 behavior by explicitly
> encoding what we used to do:
> 
> 	inode->i_atime.tv_sec = seq & 0xFFFFFFFF;
> 	inode->i_atime.tv_nsec = seq >> 32;
> 
> (There's a helper in 6.7 for this, apparently.)
> 
> But then I pointed out that the entire rtpick sequence counter thing
> merely provides a *starting point* for rtbitmap searches.  So it's not
> like garbled values result in metadata inconsistency.  IOWs, it's
> apparently benign.
> 
> IOWs, how much does anyone care about improving on Linus' fixup?

I'd really like to see the cast of a pointer to a struct type to a
scalar gone, because those tend to hide bugs.

I'm not going to bother you too much with it, promised.
Jeff Layton Nov. 9, 2023, 5:12 p.m. UTC | #9
On Wed, 2023-11-08 at 13:29 -0800, Linus Torvalds wrote:
> On Wed, 8 Nov 2023 at 02:19, Chandan Babu R <chandanbabu@kernel.org> wrote:
> > 
> > I had performed a test merge with latest contents of torvalds/linux.git.
> > 
> > This resulted in merge conflicts. The following diff should resolve the merge
> > conflicts.
> 
> Well, your merge conflict resolution is the same as my initial
> mindless one, but then when I look closer at it, it turns out that
> it's wrong.
> 
> It's wrong not because the merge itself would be wrong, but because
> the conflict made me look at the original, and it turns out that
> commit 75d1e312bbbd ("xfs: convert to new timestamp accessors") was
> buggy.
> 
> I'm actually surprised the compilers don't complain about it, because
> the bug means that the new
> 
>         struct timespec64 ts;
> 
> temporary isn't actually initialized for the !XFS_DIFLAG_NEWRTBM case.
> 
> The code does
> 
>   xfs_rtpick_extent(..)
>   ...
>         struct timespec64 ts;
>         ..
>         if (!(mp->m_rbmip->i_diflags & XFS_DIFLAG_NEWRTBM)) {
>                 mp->m_rbmip->i_diflags |= XFS_DIFLAG_NEWRTBM;
>                 seq = 0;
>         } else {
>         ...
>         ts.tv_sec = (time64_t)seq + 1;
>         inode_set_atime_to_ts(VFS_I(mp->m_rbmip), ts);
> 
> and notice how 'ts.tv_nsec' is never initialized. So we'll set the
> nsec part of the atime to random garbage.
> 
> Oh, I'm sure it doesn't really *matter*, but it's most certainly wrong.
> 
> I am not very happy about the whole crazy XFS model where people cast
> the 'struct timespec64' pointer to an 'uint64_t' pointer, and then say
> 'now it's a sequence number'. This is not the only place that
> happened, ie we have similar disgusting code in at least
> xfs_rtfree_extent() too.
> 
> That other place in xfs_rtfree_extent() didn't have this bug - it does
> inode_get_atime() unconditionally and this keeps the nsec field as-is,
> but that other place has the same really ugly code.
> 
> Doing that "cast struct timespec64 to an uint64_t' is not only ugly
> and wrong, it's _stupid_. The only reason it works in the first place
> is that 'struct timespec64' is
> 
>   struct timespec64 {
>         time64_t        tv_sec;                 /* seconds */
>         long            tv_nsec;                /* nanoseconds */
>   };
> 
> so the first field is 'tv_sec', which is a 64-bit (signed) value.
> 
> So the cast is disgusting - and it's pointless. I don't know why it's
> done that way. It would have been much cleaner to just use tv_sec, and
> have a big comment about it being used as a sequence number here.
> 
> I _assume_ there's just a simple 32-bit history to this all, where at
> one point it was a 32-bit tv_sec, and the cast basically used both
> 32-bit fields as a 64-bit sequence number.  I get it. But it's most
> definitely wrong now.
> 
> End result: I ended up fixing that bug and removing the bogus casts in
> my merge. I *think* I got it right, but apologies in advance if I
> screwed up. I only did visual inspection and build testing, no actual
> real testing.
> 
> Also, xfs people may obviously have other preferences for how to deal
> with the whole "now using tv_sec in the VFS inode as a 64-bit sequence
> number" thing, and maybe you prefer to then update my fix to this all.
> But that horrid casts certainly wasn't the right way to do it.
> 
> Put another way: please do give my merge a closer look, and decide
> amongst yourself if you then want to deal with this some other way.
> 
>               Linus

I think when I was looking at that code, I had convinced myself that the
tv_nsec field didn't matter at all, since it wasn't being used, but I
should have done a better job of preserving the existing value. Mea
culpa.

Your fixup looks right to me. Thanks for fixing it.

Cheers,
Dave Chinner Nov. 9, 2023, 10:05 p.m. UTC | #10
On Wed, Nov 08, 2023 at 11:39:45PM -0800, Darrick J. Wong wrote:
> On Thu, Nov 09, 2023 at 05:51:50AM +0100, Christoph Hellwig wrote:
> > On Wed, Nov 08, 2023 at 02:52:00PM -0800, Darrick J. Wong wrote:
> > > > Also, xfs people may obviously have other preferences for how to deal
> > > > with the whole "now using tv_sec in the VFS inode as a 64-bit sequence
> > > > number" thing, and maybe you prefer to then update my fix to this all.
> > > > But that horrid casts certainly wasn't the right way to do it.
> > > 
> > > Yeah, I can work on that for the rt modernization patchset.
> > 
> > As someone who has just written some new code stealing this trick I
> > actually have a todo list item to make this less horrible as the cast
> > upset my stomache.  But shame on me for not actually noticing that it
> > is buggy as well (which honestly should be the standard assumption for
> > casts like this).
> 
> Dave and I started looking at this too, and came up with: For rtgroups
> filesystems, what if rtpick simply rotored the rtgroups?  And what if we
> didn't bother persisting the rotor value, which would make this casting
> nightmare go away in the long run.  It's not like we persist the agi
> rotors.

I think we could replace it right now with an in-memory rotor like
the mp->m_agfrotor. It really does not need to be persistent; the
current sequence based algorithm devolves to sequential ascending
block order allocation targets once the sequence number gets large
enough.

Further, the (somewhat) deterministic extent distribution it is
trying to acheive (i.e. even distribution across the rt dev) is
really only acheivable in write-once workloads.  The moment we start
freeing space on the rtdev, the free space is no longer uniform and
does not match the pattern the sequence-based target iterates. Hence
the layout the search target attempts to create is unacheivable and
largely meaningless.

IOWs, we may as well just use an in-memory sequence number or a
random number to seed the allocation target; they will work just as
well as what we have right now without the need for persistent
sequence numbers.

Also, I think that not updating the persistent sequence number is
fine from a backwards compatibility perspective - older kernels will
just use it as it does now and newer kernels will just ignore it...

I say we just kill the whole sequence number in atime thing
completely....

-Dave.
diff mbox series

Patch

diff --cc fs/xfs/libxfs/xfs_rtbitmap.c
index 396648acb5be,b332ab490a48..84e27b9987f8
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
diff --cc fs/xfs/xfs_rtalloc.c
index 2e1a4e5cd03d,ba66442910b1..0254c573086a
--- a/fs/xfs/xfs_rtalloc.c