diff mbox series

[v2] BTRFS/NFSD: provide more unique inode number for btrfs export

Message ID 162969155423.9892.18322100025025288277@noble.neil.brown.name (mailing list archive)
State New, archived
Headers show
Series [v2] BTRFS/NFSD: provide more unique inode number for btrfs export | expand

Commit Message

NeilBrown Aug. 23, 2021, 4:05 a.m. UTC
BTRFS does not provide unique inode numbers across a filesystem.
It only provide unique inode numbers within a subvolume and
uses synthetic device numbers for different subvolumes to ensure
uniqueness for device+inode.

nfsd cannot use these varying synthetic device numbers.  If nfsd were to
synthesise different stable filesystem ids to give to the client, that
would cause subvolumes to appear in the mount table on the client, even
though they don't appear in the mount table on the server.  Also, NFSv3
doesn't support changing the filesystem id without a new explicit mount
on the client (this is partially supported in practice, but violates the
protocol specification and has problems in some edge cases).

So currently, the roots of all subvolumes report the same inode number
in the same filesystem to NFS clients and tools like 'find' notice that
a directory has the same identity as an ancestor, and so refuse to
enter that directory.

This patch allows btrfs (or any filesystem) to provide a 64bit number
that can be xored with the inode number to make the number more unique.
Rather than the client being certain to see duplicates, with this patch
it is possible but extremely rare.

The number that btrfs provides is a swab64() version of the subvolume
identifier.  This has most entropy in the high bits (the low bits of the
subvolume identifer), while the inode has most entropy in the low bits.
The result will always be unique within a subvolume, and will almost
always be unique across the filesystem.

If an upgrade of the NFS server caused all inode numbers in an exportfs
BTRFS filesystem to appear to the client to change, the client may not
handle this well.  The Linux client will cause any open files to become
'stale'.  If the mount point changed inode number, the whole mount would
become inaccessible.

To avoid this, an unused byte in the filehandle (fh_auth) has been
repurposed as "fh_options".  (The use of #defines make fh_flags a
problematic choice).  The new behaviour of uniquifying inode number is
only activated when this bit is set.

NFSD will only set this bit in filehandles it reports if the filehandle
of the parent (provided by the client) contains the bit, or if
 - the filehandle for the parent is not provided or is for a different
   export and
 - the filehandle refers to a BTRFS filesystem.

Thus if you have a BTRFS filesystem originally mounted from a server
without this patch, the flag will never be set and the current behaviour
will continue.  Only once you re-mount the filesystem (or the filesystem
is re-auto-mounted) will the inode numbers change.  When that happens,
it is likely that the filesystem st_dev number seen on the client will
change anyway.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/btrfs/inode.c                |  4 ++++
 fs/nfsd/nfs3xdr.c               | 15 ++++++++++++++-
 fs/nfsd/nfs4xdr.c               |  7 ++++---
 fs/nfsd/nfsfh.c                 | 13 +++++++++++--
 fs/nfsd/nfsfh.h                 | 22 ++++++++++++++++++++++
 fs/nfsd/xdr3.h                  |  2 ++
 include/linux/stat.h            | 18 ++++++++++++++++++
 include/uapi/linux/nfsd/nfsfh.h | 18 ++++++++++++------
 8 files changed, 87 insertions(+), 12 deletions(-)

Comments

kernel test robot Aug. 23, 2021, 8:17 a.m. UTC | #1
Hi NeilBrown,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on nfs/linux-next]
[also build test ERROR on hch-configfs/for-next linus/master v5.14-rc7 next-20210820]
[cannot apply to kdave/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/NeilBrown/BTRFS-NFSD-provide-more-unique-inode-number-for-btrfs-export/20210823-120718
base:   git://git.linux-nfs.org/projects/trondmy/linux-nfs.git linux-next
config: hexagon-randconfig-r045-20210822 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 79b55e5038324e61a3abf4e6a9a949c473edd858)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/e99ff00e4055532e35c592b50809761d82f87595
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review NeilBrown/BTRFS-NFSD-provide-more-unique-inode-number-for-btrfs-export/20210823-120718
        git checkout e99ff00e4055532e35c592b50809761d82f87595
        # save the attached .config to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross O=build_dir ARCH=hexagon SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> fs/nfsd/nfsfh.c:593:44: error: use of undeclared identifier 'BTRFS_SUPER_MAGIC'
                   if (exp->ex_path.mnt->mnt_sb->s_magic == BTRFS_SUPER_MAGIC)
                                                            ^
>> fs/nfsd/nfsfh.c:593:44: error: use of undeclared identifier 'BTRFS_SUPER_MAGIC'
>> fs/nfsd/nfsfh.c:593:44: error: use of undeclared identifier 'BTRFS_SUPER_MAGIC'
   3 errors generated.


vim +/BTRFS_SUPER_MAGIC +593 fs/nfsd/nfsfh.c

   557	
   558	__be32
   559	fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,
   560		   struct svc_fh *ref_fh)
   561	{
   562		/* ref_fh is a reference file handle.
   563		 * if it is non-null and for the same filesystem, then we should compose
   564		 * a filehandle which is of the same version, where possible.
   565		 * Currently, that means that if ref_fh->fh_handle.fh_version == 0xca
   566		 * Then create a 32byte filehandle using nfs_fhbase_old
   567		 *
   568		 */
   569	
   570		struct inode * inode = d_inode(dentry);
   571		dev_t ex_dev = exp_sb(exp)->s_dev;
   572		u8 options = 0;
   573	
   574		dprintk("nfsd: fh_compose(exp %02x:%02x/%ld %pd2, ino=%ld)\n",
   575			MAJOR(ex_dev), MINOR(ex_dev),
   576			(long) d_inode(exp->ex_path.dentry)->i_ino,
   577			dentry,
   578			(inode ? inode->i_ino : 0));
   579	
   580		/* Choose filehandle version and fsid type based on
   581		 * the reference filehandle (if it is in the same export)
   582		 * or the export options.
   583		 */
   584		set_version_and_fsid_type(fhp, exp, ref_fh);
   585	
   586		/* If we have a ref_fh, then copy the fh_no_wcc setting from it. */
   587		fhp->fh_no_wcc = ref_fh ? ref_fh->fh_no_wcc : false;
   588	
   589		if (ref_fh && ref_fh->fh_export == exp) {
   590			options = ref_fh->fh_handle.fh_options;
   591		} else {
   592			/* Set options as needed */
 > 593			if (exp->ex_path.mnt->mnt_sb->s_magic == BTRFS_SUPER_MAGIC)
   594				options |= NFSD_FH_OPTION_INO_UNIQUIFY;
   595		}
   596	
   597		if (ref_fh == fhp)
   598			fh_put(ref_fh);
   599	
   600		if (fhp->fh_locked || fhp->fh_dentry) {
   601			printk(KERN_ERR "fh_compose: fh %pd2 not initialized!\n",
   602			       dentry);
   603		}
   604		if (fhp->fh_maxsize < NFS_FHSIZE)
   605			printk(KERN_ERR "fh_compose: called with maxsize %d! %pd2\n",
   606			       fhp->fh_maxsize,
   607			       dentry);
   608	
   609		fhp->fh_dentry = dget(dentry); /* our internal copy */
   610		fhp->fh_export = exp_get(exp);
   611	
   612		if (fhp->fh_handle.fh_version == 0xca) {
   613			/* old style filehandle please */
   614			memset(&fhp->fh_handle.fh_base, 0, NFS_FHSIZE);
   615			fhp->fh_handle.fh_size = NFS_FHSIZE;
   616			fhp->fh_handle.ofh_dcookie = 0xfeebbaca;
   617			fhp->fh_handle.ofh_dev =  old_encode_dev(ex_dev);
   618			fhp->fh_handle.ofh_xdev = fhp->fh_handle.ofh_dev;
   619			fhp->fh_handle.ofh_xino =
   620				ino_t_to_u32(d_inode(exp->ex_path.dentry)->i_ino);
   621			fhp->fh_handle.ofh_dirino = ino_t_to_u32(parent_ino(dentry));
   622			if (inode)
   623				_fh_update_old(dentry, exp, &fhp->fh_handle);
   624		} else {
   625			fhp->fh_handle.fh_size =
   626				key_len(fhp->fh_handle.fh_fsid_type) + 4;
   627			fhp->fh_handle.fh_options = options;
   628	
   629			mk_fsid(fhp->fh_handle.fh_fsid_type,
   630				fhp->fh_handle.fh_fsid,
   631				ex_dev,
   632				d_inode(exp->ex_path.dentry)->i_ino,
   633				exp->ex_fsid, exp->ex_uuid);
   634	
   635			if (inode)
   636				_fh_update(fhp, exp, dentry);
   637			if (fhp->fh_handle.fh_fileid_type == FILEID_INVALID) {
   638				fh_put(fhp);
   639				return nfserr_opnotsupp;
   640			}
   641		}
   642	
   643		return 0;
   644	}
   645	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
diff mbox series

Patch

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0117d867ecf8..989fdf2032d5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9195,6 +9195,10 @@  static int btrfs_getattr(struct user_namespace *mnt_userns,
 	generic_fillattr(&init_user_ns, inode, stat);
 	stat->dev = BTRFS_I(inode)->root->anon_dev;
 
+	if (BTRFS_I(inode)->root->root_key.objectid != BTRFS_FS_TREE_OBJECTID)
+		stat->ino_uniquifier =
+			swab64(BTRFS_I(inode)->root->root_key.objectid);
+
 	spin_lock(&BTRFS_I(inode)->lock);
 	delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes;
 	inode_bytes = inode_get_bytes(inode);
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 0a5ebc52e6a9..19d14f11f79a 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -340,6 +340,7 @@  svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
 {
 	struct user_namespace *userns = nfsd_user_namespace(rqstp);
 	__be32 *p;
+	u64 ino;
 	u64 fsid;
 
 	p = xdr_reserve_space(xdr, XDR_UNIT * 21);
@@ -377,7 +378,8 @@  svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
 	p = xdr_encode_hyper(p, fsid);
 
 	/* fileid */
-	p = xdr_encode_hyper(p, stat->ino);
+	ino = nfsd_uniquify_ino(fhp, stat);
+	p = xdr_encode_hyper(p, ino);
 
 	p = encode_nfstime3(p, &stat->atime);
 	p = encode_nfstime3(p, &stat->mtime);
@@ -1151,6 +1153,17 @@  svcxdr_encode_entry3_common(struct nfsd3_readdirres *resp, const char *name,
 	if (xdr_stream_encode_item_present(xdr) < 0)
 		return false;
 	/* fileid */
+	if (!resp->dir_have_uniquifier) {
+		struct kstat stat;
+		if (fh_getattr(&resp->fh, &stat) == nfs_ok)
+			resp->dir_ino_uniquifier =
+				nfsd_ino_uniquifier(&resp->fh, &stat);
+		else
+			resp->dir_ino_uniquifier = 0;
+		resp->dir_have_uniquifier = true;
+	}
+	if (resp->dir_ino_uniquifier != ino)
+		ino ^= resp->dir_ino_uniquifier;
 	if (xdr_stream_encode_u64(xdr, ino) < 0)
 		return false;
 	/* name */
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 7abeccb975b2..5ed894ceebb0 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -3114,10 +3114,11 @@  nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 					fhp->fh_handle.fh_size);
 	}
 	if (bmval0 & FATTR4_WORD0_FILEID) {
+		u64 ino = nfsd_uniquify_ino(fhp, &stat);
 		p = xdr_reserve_space(xdr, 8);
 		if (!p)
 			goto out_resource;
-		p = xdr_encode_hyper(p, stat.ino);
+		p = xdr_encode_hyper(p, ino);
 	}
 	if (bmval0 & FATTR4_WORD0_FILES_AVAIL) {
 		p = xdr_reserve_space(xdr, 8);
@@ -3274,7 +3275,7 @@  nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 
 		p = xdr_reserve_space(xdr, 8);
 		if (!p)
-                	goto out_resource;
+			goto out_resource;
 		/*
 		 * Get parent's attributes if not ignoring crossmount
 		 * and this is the root of a cross-mounted filesystem.
@@ -3284,7 +3285,7 @@  nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 			err = get_parent_attributes(exp, &parent_stat);
 			if (err)
 				goto out_nfserr;
-			ino = parent_stat.ino;
+			ino = nfsd_uniquify_ino(fhp, &parent_stat);
 		}
 		p = xdr_encode_hyper(p, ino);
 	}
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index c475d2271f9c..e97ed957a379 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -172,7 +172,7 @@  static __be32 nfsd_set_fh_dentry(struct svc_rqst *rqstp, struct svc_fh *fhp)
 
 		if (--data_left < 0)
 			return error;
-		if (fh->fh_auth_type != 0)
+		if ((fh->fh_options & ~NFSD_FH_OPTION_ALL) != 0)
 			return error;
 		len = key_len(fh->fh_fsid_type) / 4;
 		if (len == 0)
@@ -569,6 +569,7 @@  fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,
 
 	struct inode * inode = d_inode(dentry);
 	dev_t ex_dev = exp_sb(exp)->s_dev;
+	u8 options = 0;
 
 	dprintk("nfsd: fh_compose(exp %02x:%02x/%ld %pd2, ino=%ld)\n",
 		MAJOR(ex_dev), MINOR(ex_dev),
@@ -585,6 +586,14 @@  fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,
 	/* If we have a ref_fh, then copy the fh_no_wcc setting from it. */
 	fhp->fh_no_wcc = ref_fh ? ref_fh->fh_no_wcc : false;
 
+	if (ref_fh && ref_fh->fh_export == exp) {
+		options = ref_fh->fh_handle.fh_options;
+	} else {
+		/* Set options as needed */
+		if (exp->ex_path.mnt->mnt_sb->s_magic == BTRFS_SUPER_MAGIC)
+			options |= NFSD_FH_OPTION_INO_UNIQUIFY;
+	}
+
 	if (ref_fh == fhp)
 		fh_put(ref_fh);
 
@@ -615,7 +624,7 @@  fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,
 	} else {
 		fhp->fh_handle.fh_size =
 			key_len(fhp->fh_handle.fh_fsid_type) + 4;
-		fhp->fh_handle.fh_auth_type = 0;
+		fhp->fh_handle.fh_options = options;
 
 		mk_fsid(fhp->fh_handle.fh_fsid_type,
 			fhp->fh_handle.fh_fsid,
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index 6106697adc04..1144a98c2951 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -84,6 +84,28 @@  enum fsid_source {
 };
 extern enum fsid_source fsid_source(const struct svc_fh *fhp);
 
+enum nfsd_fh_options {
+	NFSD_FH_OPTION_INO_UNIQUIFY = 1,	/* BTRFS only */
+
+	NFSD_FH_OPTION_ALL = 1
+};
+
+static inline u64 nfsd_ino_uniquifier(const struct svc_fh *fhp,
+				      const struct kstat *stat)
+{
+	if (fhp->fh_handle.fh_options & NFSD_FH_OPTION_INO_UNIQUIFY)
+		return stat->ino_uniquifier;
+	return 0;
+}
+
+static inline u64 nfsd_uniquify_ino(const struct svc_fh *fhp,
+				    const struct kstat *stat)
+{
+	u64 u = nfsd_ino_uniquifier(fhp, stat);
+	if (u != stat->ino)
+		return stat->ino ^ u;
+	return stat->ino;
+}
 
 /*
  * This might look a little large to "inline" but in all calls except
diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
index 933008382bbe..d9b6c8314bbb 100644
--- a/fs/nfsd/xdr3.h
+++ b/fs/nfsd/xdr3.h
@@ -179,6 +179,8 @@  struct nfsd3_readdirres {
 	struct xdr_buf		dirlist;
 	struct svc_fh		scratch;
 	struct readdir_cd	common;
+	u64			dir_ino_uniquifier;
+	bool			dir_have_uniquifier;
 	unsigned int		cookie_offset;
 	struct svc_rqst *	rqstp;
 
diff --git a/include/linux/stat.h b/include/linux/stat.h
index fff27e603814..0f3f74d302f8 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -46,6 +46,24 @@  struct kstat {
 	struct timespec64 btime;			/* File creation time */
 	u64		blocks;
 	u64		mnt_id;
+	/*
+	 * BTRFS does not provide unique inode numbers within a filesystem,
+	 * depending on a synthetic 'dev' to provide uniqueness.
+	 * NFSd cannot make use of this 'dev' number so clients often see
+	 * duplicate inode numbers.
+	 * For BTRFS, 'ino' is unlikely to use the high bits until the filesystem
+	 * has created a great many inodes.
+	 * It puts another number in ino_uniquifier which:
+	 * - has most entropy in the high bits
+	 * - is different precisely when 'dev' is different
+	 * - is stable across unmount/remount
+	 * NFSd can xor this with 'ino' to get a substantially more unique
+	 * number for reporting to the client.
+	 * The ino_uniquifier for a directory can reasonably be applied
+	 * to inode numbers reported by the readdir filldir callback.
+	 * It is NOT currently exported to user-space.
+	 */
+	u64		ino_uniquifier;
 };
 
 #endif
diff --git a/include/uapi/linux/nfsd/nfsfh.h b/include/uapi/linux/nfsd/nfsfh.h
index 427294dd56a1..59311df4b476 100644
--- a/include/uapi/linux/nfsd/nfsfh.h
+++ b/include/uapi/linux/nfsd/nfsfh.h
@@ -38,11 +38,17 @@  struct nfs_fhbase_old {
  * The file handle starts with a sequence of four-byte words.
  * The first word contains a version number (1) and three descriptor bytes
  * that tell how the remaining 3 variable length fields should be handled.
- * These three bytes are auth_type, fsid_type and fileid_type.
+ * These three bytes are options, fsid_type and fileid_type.
  *
  * All four-byte values are in host-byte-order.
  *
- * The auth_type field is deprecated and must be set to 0.
+ * The options field (previously auth_type) can be used when nfsd behaviour
+ * needs to change in a non-compatible way, usually for some specific
+ * filesystem.  Options should only be set in filehandles for filesystems which
+ * need them.
+ * Current values:
+ *   1  -  BTRFS only.  Cause stat->ino_uniquifier to be used to improve inode
+ *         number uniqueness.
  *
  * The fsid_type identifies how the filesystem (or export point) is
  *    encoded.
@@ -67,7 +73,7 @@  struct nfs_fhbase_new {
 	union {
 		struct {
 			__u8		fb_version_aux;	/* == 1, even => nfs_fhbase_old */
-			__u8		fb_auth_type_aux;
+			__u8		fb_options_aux;
 			__u8		fb_fsid_type_aux;
 			__u8		fb_fileid_type_aux;
 			__u32		fb_auth[1];
@@ -76,7 +82,7 @@  struct nfs_fhbase_new {
 		};
 		struct {
 			__u8		fb_version;	/* == 1, even => nfs_fhbase_old */
-			__u8		fb_auth_type;
+			__u8		fb_options;
 			__u8		fb_fsid_type;
 			__u8		fb_fileid_type;
 			__u32		fb_auth_flex[]; /* flexible-array member */
@@ -106,11 +112,11 @@  struct knfsd_fh {
 
 #define	fh_version		fh_base.fh_new.fb_version
 #define	fh_fsid_type		fh_base.fh_new.fb_fsid_type
-#define	fh_auth_type		fh_base.fh_new.fb_auth_type
+#define	fh_options		fh_base.fh_new.fb_options
 #define	fh_fileid_type		fh_base.fh_new.fb_fileid_type
 #define	fh_fsid			fh_base.fh_new.fb_auth_flex
 
 /* Do not use, provided for userspace compatiblity. */
-#define	fh_auth			fh_base.fh_new.fb_auth
+#define	fh_auth			fh_base.fh_new.fb_options
 
 #endif /* _UAPI_LINUX_NFSD_FH_H */