diff mbox series

[PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible.

Message ID 168168683217.24821.6260957092725278201@noble.neil.brown.name (mailing list archive)
State Mainlined, archived
Headers show
Series [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible. | expand

Commit Message

NeilBrown April 16, 2023, 11:13 p.m. UTC
When performing a LOOKUP_MOUNTPOINT lookup we don't really want to
engage with underlying systems at all.  Any mount point MUST be in the
dcache with a complete direct path from the root to the mountpoint.
That should be sufficient to find the mountpoint given a path name.

This becomes an issue when the filesystem changes unexpected, such as
when a NFS server is changed to reject all access.  It then becomes
impossible to unmount anything mounted on the filesystem which has
changed.  We could simply lazy-unmount the changed filesystem and that
will often be sufficient.  However if the target filesystem needs
"umount -f" to complete the unmount properly, then the lazy unmount will
leave it incompletely unmounted.  When "-f" is needed, we really need to
be able to name the target filesystem.

We NEVER want to revalidate anything.  We already avoid the revalidation
of the mountpoint itself, be we won't need to revalidate anything on the
path either as thay might affect the cache, and the cache is what we are
really looking in.

Permission checks are a little less clear.  We currently allow any user
to make the umount syscall and perform the path lookup and only reject
unprivileged users once the target mount point has been found.  If we
completely relax permission checks then an unprivileged user could probe
inaccessible parts of the name space by examining the error returned
from umount().

So we only relax permission check when the user has CAP_SYS_ADMIN
(may_mount() succeeds).

Note that if the path given is not direct and for example uses symlinks
or "..", then dentries or symlink content may not be cached and a remote
server could cause problem.  We can only be certain of a safe unmount if
a direct path is used.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 40 insertions(+), 6 deletions(-)

Comments

Christian Brauner April 17, 2023, 11:55 a.m. UTC | #1
On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote:
> 
> When performing a LOOKUP_MOUNTPOINT lookup we don't really want to
> engage with underlying systems at all.  Any mount point MUST be in the
> dcache with a complete direct path from the root to the mountpoint.
> That should be sufficient to find the mountpoint given a path name.
> 
> This becomes an issue when the filesystem changes unexpected, such as
> when a NFS server is changed to reject all access.  It then becomes
> impossible to unmount anything mounted on the filesystem which has
> changed.  We could simply lazy-unmount the changed filesystem and that
> will often be sufficient.  However if the target filesystem needs
> "umount -f" to complete the unmount properly, then the lazy unmount will
> leave it incompletely unmounted.  When "-f" is needed, we really need to

I don't understand this yet. All I see is nfs_umount_begin() that's
different for MNT_FORCE to kill remaining io. Why does that preclude
MNT_DETACH? You might very well fail MNT_FORCE and the only way you can
get rid is to use MNT_DETACH, no? So I don't see why that is an
argument.

> be able to name the target filesystem.
> 
> We NEVER want to revalidate anything.  We already avoid the revalidation
> of the mountpoint itself, be we won't need to revalidate anything on the
> path either as thay might affect the cache, and the cache is what we are
> really looking in.
> 
> Permission checks are a little less clear.  We currently allow any user

This is all very brittle.

First case that comes to mind is overlayfs where the permission checking
is performed twice. Once on the overlayfs inode itself based on the
caller's security context and a second time on the underlying inode with
the security context of the mounter of the overlayfs instance.

A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so
they'd be able to mount the overlayfs instance but would be restricted
from accessing certain directories or files. The task accessing the
overlayfs instance however could have a completely different security
context including CAP_DAC_READ_SEARCH and such. Both tasks could
reasonably be in different user namespaces and so on.

The LSM hooks are also called twice and would now also be called once.

It also forgets that acl_permission() check may very well call into the
filesystem via check_acl().

So umount could either be used to infer existence of files that the user
wouldn't otherwise know they exist or in the worst case allow to umount
something that they wouldn't have access to.

Aside from that this would break userspace assumptions and as Al and
I've mentioned before in the other thread you'd need a new flag to
umount2() for this. The permission model can't just change behind users
back.

But I dislike it for the now even more special-cased umount path lookup
alone tbh. I'd feel way more comfortable with a non-lookup related
solution that doesn't have subtle implications for permission checking.

> to make the umount syscall and perform the path lookup and only reject
> unprivileged users once the target mount point has been found.  If we
> completely relax permission checks then an unprivileged user could probe
> inaccessible parts of the name space by examining the error returned
> from umount().
> 
> So we only relax permission check when the user has CAP_SYS_ADMIN
> (may_mount() succeeds).
> 
> Note that if the path given is not direct and for example uses symlinks
> or "..", then dentries or symlink content may not be cached and a remote
> server could cause problem.  We can only be certain of a safe unmount if
> a direct path is used.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 40 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index edfedfbccaef..f2df1adae7c5 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -498,8 +498,8 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
>   *
>   * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
>   */
> -int inode_permission(struct mnt_idmap *idmap,
> -		     struct inode *inode, int mask)
> +int inode_permission_mp(struct mnt_idmap *idmap,
> +			struct inode *inode, int mask, bool mp)
>  {
>  	int retval;
>  
> @@ -523,7 +523,14 @@ int inode_permission(struct mnt_idmap *idmap,
>  			return -EACCES;
>  	}
>  
> -	retval = do_inode_permission(idmap, inode, mask);
> +	if (mp)
> +		/* We are looking for a mountpoint and so don't
> +		 * want to interact with the filesystem at all, just
> +		 * the dcache and icache.
> +		 */
> +		retval = generic_permission(idmap, inode, mask);
> +	else
> +		retval = do_inode_permission(idmap, inode, mask);
>  	if (retval)
>  		return retval;
>  
> @@ -533,6 +540,24 @@ int inode_permission(struct mnt_idmap *idmap,
>  
>  	return security_inode_permission(inode, mask);
>  }
> +
> +/**
> + * inode_permission - Check for access rights to a given inode
> + * @idmap:	idmap of the mount the inode was found from
> + * @inode:	Inode to check permission on
> + * @mask:	Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
> + *
> + * Check for read/write/execute permissions on an inode.  We use fs[ug]id for
> + * this, letting us set arbitrary permissions for filesystem access without
> + * changing the "normal" UIDs which are used for other things.
> + *
> + * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
> + */
> +int inode_permission(struct mnt_idmap *idmap,
> +		     struct inode *inode, int mask)
> +{
> +	return inode_permission_mp(idmap, inode, mask, false);
> +}
>  EXPORT_SYMBOL(inode_permission);
>  
>  /**
> @@ -589,6 +614,7 @@ struct nameidata {
>  #define ND_ROOT_PRESET 1
>  #define ND_ROOT_GRABBED 2
>  #define ND_JUMPED 4
> +#define ND_SYS_ADMIN 8
>  
>  static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name)
>  {
> @@ -853,7 +879,8 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry)
>  
>  static inline int d_revalidate(struct dentry *dentry, unsigned int flags)
>  {
> -	if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE))
> +	if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) &&
> +	    likely(!(flags & LOOKUP_MOUNTPOINT)))
>  		return dentry->d_op->d_revalidate(dentry, flags);
>  	else
>  		return 1;
> @@ -1708,12 +1735,17 @@ static struct dentry *lookup_slow(const struct qstr *name,
>  static inline int may_lookup(struct mnt_idmap *idmap,
>  			     struct nameidata *nd)
>  {
> +	/* If we are looking for a mountpoint and we have the SYS_ADMIN
> +	 * capability, then we will by-pass the filesys for permission checks
> +	 * and just use generic_permission().
> +	 */
> +	bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN);
>  	if (nd->flags & LOOKUP_RCU) {
> -		int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK);
> +		int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp);
>  		if (err != -ECHILD || !try_to_unlazy(nd))
>  			return err;
>  	}
> -	return inode_permission(idmap, nd->inode, MAY_EXEC);
> +	return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp);
>  }
>  
>  static int reserve_stack(struct nameidata *nd, struct path *link)
> @@ -2501,6 +2533,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
>  	if (IS_ERR(name))
>  		return PTR_ERR(name);
>  	set_nameidata(&nd, dfd, name, root);
> +	if ((flags & LOOKUP_MOUNTPOINT) && may_mount())
> +		nd.state = ND_SYS_ADMIN;
>  	retval = path_lookupat(&nd, flags | LOOKUP_RCU, path);
>  	if (unlikely(retval == -ECHILD))
>  		retval = path_lookupat(&nd, flags, path);
> -- 
> 2.40.0
>
Jeff Layton April 17, 2023, 12:09 p.m. UTC | #2
On Mon, 2023-04-17 at 09:13 +1000, NeilBrown wrote:
> When performing a LOOKUP_MOUNTPOINT lookup we don't really want to
> engage with underlying systems at all.  Any mount point MUST be in the
> dcache with a complete direct path from the root to the mountpoint.
> That should be sufficient to find the mountpoint given a path name.
> 
> This becomes an issue when the filesystem changes unexpected, such as
> when a NFS server is changed to reject all access.  It then becomes
> impossible to unmount anything mounted on the filesystem which has
> changed.  We could simply lazy-unmount the changed filesystem and that
> will often be sufficient.  However if the target filesystem needs
> "umount -f" to complete the unmount properly, then the lazy unmount will
> leave it incompletely unmounted.  When "-f" is needed, we really need to
> be able to name the target filesyste
> 
> We NEVER want to revalidate anything.  We already avoid the revalidation
> of the mountpoint itself, be we won't need to revalidate anything on the
> path either as thay might affect the cache, and the cache is what we are
> really looking in.
> 
> Permission checks are a little less clear.  We currently allow any user
> to make the umount syscall and perform the path lookup and only reject
> unprivileged users once the target mount point has been found.  If we
> completely relax permission checks then an unprivileged user could probe
> inaccessible parts of the name space by examining the error returned
> from umount().
> 

That sounds pretty reasonable. Most umounts are done by root in some
fashion anyway.



> So we only relax permission check when the user has CAP_SYS_ADMIN
> (may_mount() succeeds).
> 
> Note that if the path given is not direct and for example uses symlinks
> or "..", then dentries or symlink content may not be cached and a remote
> server could cause problem.  We can only be certain of a safe unmount if
> a direct path is used.
> 

I guess we do still have to allow it to do a lookup due to symlinks. I
think this is still worthwhile though since it'd fix a lot of these
cases.

> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 40 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index edfedfbccaef..f2df1adae7c5 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -498,8 +498,8 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
>   *
>   * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
>   */
> -int inode_permission(struct mnt_idmap *idmap,
> -		     struct inode *inode, int mask)
> +int inode_permission_mp(struct mnt_idmap *idmap,
> +			struct inode *inode, int mask, bool mp)
>  {
>  	int retval;
>  
> @@ -523,7 +523,14 @@ int inode_permission(struct mnt_idmap *idmap,
>  			return -EACCES;
>  	}
>  
> -	retval = do_inode_permission(idmap, inode, mask);
> +	if (mp)
> +		/* We are looking for a mountpoint and so don't
> +		 * want to interact with the filesystem at all, just
> +		 * the dcache and icache.
> +		 */
> +		retval = generic_permission(idmap, inode, mask);
> +	else
> +		retval = do_inode_permission(idmap, inode, mask);
>  	if (retval)
>  		return retval;
>  
> @@ -533,6 +540,24 @@ int inode_permission(struct mnt_idmap *idmap,
>  
>  	return security_inode_permission(inode, mask);
>  }
> +
> +/**
> + * inode_permission - Check for access rights to a given inode
> + * @idmap:	idmap of the mount the inode was found from
> + * @inode:	Inode to check permission on
> + * @mask:	Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
> + *
> + * Check for read/write/execute permissions on an inode.  We use fs[ug]id for
> + * this, letting us set arbitrary permissions for filesystem access without
> + * changing the "normal" UIDs which are used for other things.
> + *
> + * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
> + */
> +int inode_permission(struct mnt_idmap *idmap,
> +		     struct inode *inode, int mask)
> +{
> +	return inode_permission_mp(idmap, inode, mask, false);
> +}
>  EXPORT_SYMBOL(inode_permission);
>  
>  /**
> @@ -589,6 +614,7 @@ struct nameidata {
>  #define ND_ROOT_PRESET 1
>  #define ND_ROOT_GRABBED 2
>  #define ND_JUMPED 4
> +#define ND_SYS_ADMIN 8
>  
>  static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name)
>  {
> @@ -853,7 +879,8 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry)
>  
>  static inline int d_revalidate(struct dentry *dentry, unsigned int flags)
>  {
> -	if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE))
> +	if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) &&
> +	    likely(!(flags & LOOKUP_MOUNTPOINT)))
>  		return dentry->d_op->d_revalidate(dentry, flags);
>  	else
>  		return 1;
> @@ -1708,12 +1735,17 @@ static struct dentry *lookup_slow(const struct qstr *name,
>  static inline int may_lookup(struct mnt_idmap *idmap,
>  			     struct nameidata *nd)
>  {
> +	/* If we are looking for a mountpoint and we have the SYS_ADMIN
> +	 * capability, then we will by-pass the filesys for permission checks
> +	 * and just use generic_permission().
> +	 */
> +	bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN);
>  	if (nd->flags & LOOKUP_RCU) {
> -		int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK);
> +		int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp);
>  		if (err != -ECHILD || !try_to_unlazy(nd))
>  			return err;
>  	}
> -	return inode_permission(idmap, nd->inode, MAY_EXEC);
> +	return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp);
>  }
>  
>  static int reserve_stack(struct nameidata *nd, struct path *link)
> @@ -2501,6 +2533,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
>  	if (IS_ERR(name))
>  		return PTR_ERR(name);
>  	set_nameidata(&nd, dfd, name, root);
> +	if ((flags & LOOKUP_MOUNTPOINT) && may_mount())
> +		nd.state = ND_SYS_ADMIN;
>  	retval = path_lookupat(&nd, flags | LOOKUP_RCU, path);
>  	if (unlikely(retval == -ECHILD))
>  		retval = path_lookupat(&nd, flags, path);

This behavior looks right along the lines of what I was thinking.

Just for bisectability, it might be worthwhile to break this up along
conceptual lines: one patch to make it skip d_revalidate, one that
changes the permission checking, etc.

I'll plan to give this a try soon with Dave's reproducer.

Thanks!
Jeff Layton April 17, 2023, 12:25 p.m. UTC | #3
On Mon, 2023-04-17 at 13:55 +0200, Christian Brauner wrote:
> On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote:
> > 
> > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to
> > engage with underlying systems at all.  Any mount point MUST be in the
> > dcache with a complete direct path from the root to the mountpoint.
> > That should be sufficient to find the mountpoint given a path name.
> > 
> > This becomes an issue when the filesystem changes unexpected, such as
> > when a NFS server is changed to reject all access.  It then becomes
> > impossible to unmount anything mounted on the filesystem which has
> > changed.  We could simply lazy-unmount the changed filesystem and that
> > will often be sufficient.  However if the target filesystem needs
> > "umount -f" to complete the unmount properly, then the lazy unmount will
> > leave it incompletely unmounted.  When "-f" is needed, we really need to
> 
> I don't understand this yet. All I see is nfs_umount_begin() that's
> different for MNT_FORCE to kill remaining io. Why does that preclude
> MNT_DETACH? You might very well fail MNT_FORCE and the only way you can
> get rid is to use MNT_DETACH, no? So I don't see why that is an
> argument.
> 
> > be able to name the target filesystem.
> > 
> > We NEVER want to revalidate anything.  We already avoid the revalidation
> > of the mountpoint itself, be we won't need to revalidate anything on the
> > path either as thay might affect the cache, and the cache is what we are
> > really looking in.
> > 
> > Permission checks are a little less clear.  We currently allow any user
> 
> This is all very brittle.
> 
> First case that comes to mind is overlayfs where the permission checking
> is performed twice. Once on the overlayfs inode itself based on the
> caller's security context and a second time on the underlying inode with
> the security context of the mounter of the overlayfs instance.
> 
> A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so
> they'd be able to mount the overlayfs instance but would be restricted
> from accessing certain directories or files. The task accessing the
> overlayfs instance however could have a completely different security
> context including CAP_DAC_READ_SEARCH and such. Both tasks could
> reasonably be in different user namespaces and so on.
> 
> The LSM hooks are also called twice and would now also be called once.
> 
> It also forgets that acl_permission() check may very well call into the
> filesystem via check_acl().
> 
> So umount could either be used to infer existence of files that the user
> wouldn't otherwise know they exist or in the worst case allow to umount
> something that they wouldn't have access to.
> 
> Aside from that this would break userspace assumptions and as Al and
> I've mentioned before in the other thread you'd need a new flag to
> umount2() for this. The permission model can't just change behind users
> back.
> 
> But I dislike it for the now even more special-cased umount path lookup
> alone tbh. I'd feel way more comfortable with a non-lookup related
> solution that doesn't have subtle implications for permission checking.
> 

These are good points.

One way around the issues you point out might be to pass down a new
MAY_LOOKUP_MOUNTPOINT mask flag to ->permission. That would allow the
filesystem driver to decide whether it wants to avoid potentially
problematic activity when checking permissions. nfs_permission could
check for that and take a more hands-off approach to the permissions
check. Between that and skipping d_revalidate on LOOKUP_MOUNTPOINT, I
think that might do what we need.

> > to make the umount syscall and perform the path lookup and only reject
> > unprivileged users once the target mount point has been found.  If we
> > completely relax permission checks then an unprivileged user could probe
> > inaccessible parts of the name space by examining the error returned
> > from umount().
> > 
> > So we only relax permission check when the user has CAP_SYS_ADMIN
> > (may_mount() succeeds).
> > 
> > Note that if the path given is not direct and for example uses symlinks
> > or "..", then dentries or symlink content may not be cached and a remote
> > server could cause problem.  We can only be certain of a safe unmount if
> > a direct path is used.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 40 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/namei.c b/fs/namei.c
> > index edfedfbccaef..f2df1adae7c5 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -498,8 +498,8 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
> >   *
> >   * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
> >   */
> > -int inode_permission(struct mnt_idmap *idmap,
> > -		     struct inode *inode, int mask)
> > +int inode_permission_mp(struct mnt_idmap *idmap,
> > +			struct inode *inode, int mask, bool mp)
> >  {
> >  	int retval;
> >  
> > @@ -523,7 +523,14 @@ int inode_permission(struct mnt_idmap *idmap,
> >  			return -EACCES;
> >  	}
> >  
> > -	retval = do_inode_permission(idmap, inode, mask);
> > +	if (mp)
> > +		/* We are looking for a mountpoint and so don't
> > +		 * want to interact with the filesystem at all, just
> > +		 * the dcache and icache.
> > +		 */
> > +		retval = generic_permission(idmap, inode, mask);
> > +	else
> > +		retval = do_inode_permission(idmap, inode, mask);
> >  	if (retval)
> >  		return retval;
> >  
> > @@ -533,6 +540,24 @@ int inode_permission(struct mnt_idmap *idmap,
> >  
> >  	return security_inode_permission(inode, mask);
> >  }
> > +
> > +/**
> > + * inode_permission - Check for access rights to a given inode
> > + * @idmap:	idmap of the mount the inode was found from
> > + * @inode:	Inode to check permission on
> > + * @mask:	Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
> > + *
> > + * Check for read/write/execute permissions on an inode.  We use fs[ug]id for
> > + * this, letting us set arbitrary permissions for filesystem access without
> > + * changing the "normal" UIDs which are used for other things.
> > + *
> > + * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
> > + */
> > +int inode_permission(struct mnt_idmap *idmap,
> > +		     struct inode *inode, int mask)
> > +{
> > +	return inode_permission_mp(idmap, inode, mask, false);
> > +}
> >  EXPORT_SYMBOL(inode_permission);
> >  
> >  /**
> > @@ -589,6 +614,7 @@ struct nameidata {
> >  #define ND_ROOT_PRESET 1
> >  #define ND_ROOT_GRABBED 2
> >  #define ND_JUMPED 4
> > +#define ND_SYS_ADMIN 8
> >  
> >  static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name)
> >  {
> > @@ -853,7 +879,8 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry)
> >  
> >  static inline int d_revalidate(struct dentry *dentry, unsigned int flags)
> >  {
> > -	if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE))
> > +	if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) &&
> > +	    likely(!(flags & LOOKUP_MOUNTPOINT)))
> >  		return dentry->d_op->d_revalidate(dentry, flags);
> >  	else
> >  		return 1;
> > @@ -1708,12 +1735,17 @@ static struct dentry *lookup_slow(const struct qstr *name,
> >  static inline int may_lookup(struct mnt_idmap *idmap,
> >  			     struct nameidata *nd)
> >  {
> > +	/* If we are looking for a mountpoint and we have the SYS_ADMIN
> > +	 * capability, then we will by-pass the filesys for permission checks
> > +	 * and just use generic_permission().
> > +	 */
> > +	bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN);
> >  	if (nd->flags & LOOKUP_RCU) {
> > -		int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK);
> > +		int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp);
> >  		if (err != -ECHILD || !try_to_unlazy(nd))
> >  			return err;
> >  	}
> > -	return inode_permission(idmap, nd->inode, MAY_EXEC);
> > +	return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp);
> >  }
> >  
> >  static int reserve_stack(struct nameidata *nd, struct path *link)
> > @@ -2501,6 +2533,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
> >  	if (IS_ERR(name))
> >  		return PTR_ERR(name);
> >  	set_nameidata(&nd, dfd, name, root);
> > +	if ((flags & LOOKUP_MOUNTPOINT) && may_mount())
> > +		nd.state = ND_SYS_ADMIN;
> >  	retval = path_lookupat(&nd, flags | LOOKUP_RCU, path);
> >  	if (unlikely(retval == -ECHILD))
> >  		retval = path_lookupat(&nd, flags, path);
> > -- 
> > 2.40.0
> >
Christian Brauner April 17, 2023, 2:24 p.m. UTC | #4
On Mon, Apr 17, 2023 at 08:25:23AM -0400, Jeff Layton wrote:
> On Mon, 2023-04-17 at 13:55 +0200, Christian Brauner wrote:
> > On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote:
> > > 
> > > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to
> > > engage with underlying systems at all.  Any mount point MUST be in the
> > > dcache with a complete direct path from the root to the mountpoint.
> > > That should be sufficient to find the mountpoint given a path name.
> > > 
> > > This becomes an issue when the filesystem changes unexpected, such as
> > > when a NFS server is changed to reject all access.  It then becomes
> > > impossible to unmount anything mounted on the filesystem which has
> > > changed.  We could simply lazy-unmount the changed filesystem and that
> > > will often be sufficient.  However if the target filesystem needs
> > > "umount -f" to complete the unmount properly, then the lazy unmount will
> > > leave it incompletely unmounted.  When "-f" is needed, we really need to
> > 
> > I don't understand this yet. All I see is nfs_umount_begin() that's
> > different for MNT_FORCE to kill remaining io. Why does that preclude
> > MNT_DETACH? You might very well fail MNT_FORCE and the only way you can
> > get rid is to use MNT_DETACH, no? So I don't see why that is an
> > argument.
> > 
> > > be able to name the target filesystem.
> > > 
> > > We NEVER want to revalidate anything.  We already avoid the revalidation
> > > of the mountpoint itself, be we won't need to revalidate anything on the
> > > path either as thay might affect the cache, and the cache is what we are
> > > really looking in.
> > > 
> > > Permission checks are a little less clear.  We currently allow any user
> > 
> > This is all very brittle.
> > 
> > First case that comes to mind is overlayfs where the permission checking
> > is performed twice. Once on the overlayfs inode itself based on the
> > caller's security context and a second time on the underlying inode with
> > the security context of the mounter of the overlayfs instance.
> > 
> > A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so
> > they'd be able to mount the overlayfs instance but would be restricted
> > from accessing certain directories or files. The task accessing the
> > overlayfs instance however could have a completely different security
> > context including CAP_DAC_READ_SEARCH and such. Both tasks could
> > reasonably be in different user namespaces and so on.
> > 
> > The LSM hooks are also called twice and would now also be called once.
> > 
> > It also forgets that acl_permission() check may very well call into the
> > filesystem via check_acl().
> > 
> > So umount could either be used to infer existence of files that the user
> > wouldn't otherwise know they exist or in the worst case allow to umount
> > something that they wouldn't have access to.
> > 
> > Aside from that this would break userspace assumptions and as Al and
> > I've mentioned before in the other thread you'd need a new flag to
> > umount2() for this. The permission model can't just change behind users
> > back.
> > 
> > But I dislike it for the now even more special-cased umount path lookup
> > alone tbh. I'd feel way more comfortable with a non-lookup related
> > solution that doesn't have subtle implications for permission checking.
> > 
> 
> These are good points.
> 
> One way around the issues you point out might be to pass down a new
> MAY_LOOKUP_MOUNTPOINT mask flag to ->permission. That would allow the
> filesystem driver to decide whether it wants to avoid potentially
> problematic activity when checking permissions. nfs_permission could
> check for that and take a more hands-off approach to the permissions
> check. Between that and skipping d_revalidate on LOOKUP_MOUNTPOINT, I
> think that might do what we need.

Yes, that's pretty obvious. I considered that, wrote the section and
deleted it again because I still find this pretty ugly. It does leak
very specific lookup information into filesystems that isn't generically
useful like MAY_NOT_BLOCK is. Most filesystems don't need to check with
a server like NFS does and so don't suffer from this issue.

The crucial change in the patchset in its current form is that you're
requesting from the VFS to significantly alter permission checking just
because this is a umount request in a pretty fundamental way for roughly
21 filesytems. Afaict, on the VFS level that doesn't make sense. The VFS
can't just skip a filesystem's ->permission() handler with "well, it's
just on a way to a umount so whatever". That's just not going to be
correct and is just a source of subtle security bugs. So NAK on that.

And I'm curious why is it obvious that we don't want to revalidate _any_
path component and not just the last one? Why is that generally safe?
Why can't this be used to access files and directories the caller
wouldn't otherwise be able to access? I would like to have this spelled
out for slow people like me, please.

From my point of view, this would only be somewhat safe _generally_ if
you'd allow circumvention for revalidation and permission checking if
MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH).
You'd still mess with overlayfs permission model in this case though.

Plus, there are better options of solving this problem. Again, I'd
rather build a separate api for unmounting then playing such potentially
subtle security sensitive games with permission checking during path
lookup.
Jeff Layton April 17, 2023, 3:21 p.m. UTC | #5
On Mon, 2023-04-17 at 16:24 +0200, Christian Brauner wrote:
> On Mon, Apr 17, 2023 at 08:25:23AM -0400, Jeff Layton wrote:
> > On Mon, 2023-04-17 at 13:55 +0200, Christian Brauner wrote:
> > > On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote:
> > > > 
> > > > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to
> > > > engage with underlying systems at all.  Any mount point MUST be in the
> > > > dcache with a complete direct path from the root to the mountpoint.
> > > > That should be sufficient to find the mountpoint given a path name.
> > > > 
> > > > This becomes an issue when the filesystem changes unexpected, such as
> > > > when a NFS server is changed to reject all access.  It then becomes
> > > > impossible to unmount anything mounted on the filesystem which has
> > > > changed.  We could simply lazy-unmount the changed filesystem and that
> > > > will often be sufficient.  However if the target filesystem needs
> > > > "umount -f" to complete the unmount properly, then the lazy unmount will
> > > > leave it incompletely unmounted.  When "-f" is needed, we really need to
> > > 
> > > I don't understand this yet. All I see is nfs_umount_begin() that's
> > > different for MNT_FORCE to kill remaining io. Why does that preclude
> > > MNT_DETACH? You might very well fail MNT_FORCE and the only way you can
> > > get rid is to use MNT_DETACH, no? So I don't see why that is an
> > > argument.
> > > 
> > > > be able to name the target filesystem.
> > > > 
> > > > We NEVER want to revalidate anything.  We already avoid the revalidation
> > > > of the mountpoint itself, be we won't need to revalidate anything on the
> > > > path either as thay might affect the cache, and the cache is what we are
> > > > really looking in.
> > > > 
> > > > Permission checks are a little less clear.  We currently allow any user
> > > 
> > > This is all very brittle.
> > > 
> > > First case that comes to mind is overlayfs where the permission checking
> > > is performed twice. Once on the overlayfs inode itself based on the
> > > caller's security context and a second time on the underlying inode with
> > > the security context of the mounter of the overlayfs instance.
> > > 
> > > A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so
> > > they'd be able to mount the overlayfs instance but would be restricted
> > > from accessing certain directories or files. The task accessing the
> > > overlayfs instance however could have a completely different security
> > > context including CAP_DAC_READ_SEARCH and such. Both tasks could
> > > reasonably be in different user namespaces and so on.
> > > 
> > > The LSM hooks are also called twice and would now also be called once.
> > > 
> > > It also forgets that acl_permission() check may very well call into the
> > > filesystem via check_acl().
> > > 
> > > So umount could either be used to infer existence of files that the user
> > > wouldn't otherwise know they exist or in the worst case allow to umount
> > > something that they wouldn't have access to.
> > > 
> > > Aside from that this would break userspace assumptions and as Al and
> > > I've mentioned before in the other thread you'd need a new flag to
> > > umount2() for this. The permission model can't just change behind users
> > > back.
> > > 
> > > But I dislike it for the now even more special-cased umount path lookup
> > > alone tbh. I'd feel way more comfortable with a non-lookup related
> > > solution that doesn't have subtle implications for permission checking.
> > > 
> > 
> > These are good points.
> > 
> > One way around the issues you point out might be to pass down a new
> > MAY_LOOKUP_MOUNTPOINT mask flag to ->permission. That would allow the
> > filesystem driver to decide whether it wants to avoid potentially
> > problematic activity when checking permissions. nfs_permission could
> > check for that and take a more hands-off approach to the permissions
> > check. Between that and skipping d_revalidate on LOOKUP_MOUNTPOINT, I
> > think that might do what we need.
> 
> Yes, that's pretty obvious. I considered that, wrote the section and
> deleted it again because I still find this pretty ugly. It does leak
> very specific lookup information into filesystems that isn't generically
> useful like MAY_NOT_BLOCK is. Most filesystems don't need to check with
> a server like NFS does and so don't suffer from this issue.
> 

Sort of. Most of the MAY flags cover a specific operation. In this case,
we'd just be adding a flag to make it clear that this permission check
is for the specific purpose of chasing down a mountpoint (presumably to
umount).

This field does look less like a "mask" field now though, and more like
a "flags".
 
> The crucial change in the patchset in its current form is that you're
> requesting from the VFS to significantly alter permission checking just
> because this is a umount request in a pretty fundamental way for roughly
> 21 filesytems. Afaict, on the VFS level that doesn't make sense. The VFS
> can't just skip a filesystem's ->permission() handler with "well, it's
> just on a way to a umount so whatever". That's just not going to be
> correct and is just a source of subtle security bugs. So NAK on that.
>

Fair enough. I too think the permission check ought to be left up to the
filesystem driver. It does need some way to know that the permission
check is in the context of a LOOKUP_MOUNTPOINT pathwalk though. A MAY_*
flag seems like the obvious choice, since we could use that for checking
ACLs too, but maybe there is some better way.

> And I'm curious why is it obvious that we don't want to revalidate _any_
> path component and not just the last one? Why is that generally safe?
> Why can't this be used to access files and directories the caller
> wouldn't otherwise be able to access? I would like to have this spelled
> out for slow people like me, please.
> 
> From my point of view, this would only be somewhat safe _generally_ if
> you'd allow circumvention for revalidation and permission checking if
> MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH).
> You'd still mess with overlayfs permission model in this case though.
> 
> Plus, there are better options of solving this problem. Again, I'd
> rather build a separate api for unmounting then playing such potentially
> subtle security sensitive games with permission checking during path
> lookup.

umount(2) is really a special case because the whole intent is to detach
a mount from the local hierarchy and stop using it. The unfortunate bit
is that it is a path-based syscall.

So usually we have to:

- determine the path: Maybe stat() it and to validate that it's the
mountpoint we want to drop
- then call umount with that path

The last thing we want in that case is for the server to decide to
change some intermediate dentry in between the two operations. Best
case, you'll get back ENOENT or something when the pathwalk fails. Worst
case, the server swaps what are two different mountpoints on your client
and you unmount the wrong one.

If we don't revaliate, then we're no worse off, and may be better off if
something hinky happens to the server of an intermediate dentry in the
path.
NeilBrown April 17, 2023, 9:26 p.m. UTC | #6
On Mon, 17 Apr 2023, Christian Brauner wrote:
> On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote:
> > 
> > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to
> > engage with underlying systems at all.  Any mount point MUST be in the
> > dcache with a complete direct path from the root to the mountpoint.
> > That should be sufficient to find the mountpoint given a path name.
> > 
> > This becomes an issue when the filesystem changes unexpected, such as
> > when a NFS server is changed to reject all access.  It then becomes
> > impossible to unmount anything mounted on the filesystem which has
> > changed.  We could simply lazy-unmount the changed filesystem and that
> > will often be sufficient.  However if the target filesystem needs
> > "umount -f" to complete the unmount properly, then the lazy unmount will
> > leave it incompletely unmounted.  When "-f" is needed, we really need to
> 
> I don't understand this yet. All I see is nfs_umount_begin() that's
> different for MNT_FORCE to kill remaining io. Why does that preclude
> MNT_DETACH? You might very well fail MNT_FORCE and the only way you can
> get rid is to use MNT_DETACH, no? So I don't see why that is an
> argument.

Possibly I could have been more clear.
If /foo is the mount point for a filesystem that is causing problems and
/foo/bar is the mount point of a different filesystem, which might have
other problems, then I was saying that the simple solution of
   umount -l /foo
which would MNT_DETACH both filesystems is not ideal because there might
be pending IO that will never complete, so the mount can never be
cleaned up.  In such cases we really want to be able to do
   umount -f /foo/bar
and ensure that succeeds.
You can see now that it isn't that MNT_FORCE precludes MNT_DETACH, but
they would be done on different filesystems.

MNT_FORCE is, I think, a good idea and a needed functionality that has
never been implemented well.
MNT_FORCE causes nfs_umount_begin to be called as you noted, which
aborts all pending RPCs on that filesystem.
This might be useful if you have an "ls" running, as it is likely to
abort on the first getdents() failure.  But if you have "ls -l' running,
it will likely just go and try another lstat(), and that is just as
likely to hang - thus still preventing the umount.

What we used to do was find anything using the mount and kill it
(sometime "fuser -k" would work).  These processes would likely be stuck
in a D wait, but "umount -f" would abort the RPCs and the processes
would get unstuck and could respond to the signal.  This would sometimes
take a couple of iterations.

These days we have TASK_KILLABLE so this is much less likely to be
needed - we just kill all the processes and they die promptly.  Most of
the time.  There are a few places that can still block, but which no-one
has had the motivation to fix.

It would be much nicer if we didn't need to kill things.  Even with
"fuser -k" it isn't really the sort of thing you want in a shutdown
script (or in systemd-shutdown).

It would be ideal if the shutdown process could call "umount" and if
that fails with EBUSY, call "umount -f" and be confident of ....
something.  Currently "-f" might inspire hope, but not confidence.

We cannot realistically make MNT_FORCE truly force a umount because
processes really need to die before that happens.  But we could ensure
that the filesystem is quiescent and stays that way.

Maybe we could add a flag to 'struct vfsmount' to say that "unmount has
been forced".  Any attempt to use a fd with such a vfsmnt would fail,
expect close() and anything similar.  Maybe nfs_umount_begin could
iterate all open files on that vfsmnt and purge any cached data so no
background writes were needed.  I think this would be very close to what
we needed.

Then systemd could run umount() and if that failed then
umount(MNT_FORCE) and if that fails umount(MNT_DETACH), with confidence
that there would be not more RPCs generates, so it would be safe to turn
off the network.

BTW, that is another reason that just doing "umount -l /foo" is not
ideal.  We sometimes want to wait for the umount to complete, so we can
remove the backing store or the network.  "umount -l /foo" doesn't let
us wait.  "umount /foo/bar" does.

> 
> > be able to name the target filesystem.
> > 
> > We NEVER want to revalidate anything.  We already avoid the revalidation
> > of the mountpoint itself, be we won't need to revalidate anything on the
> > path either as thay might affect the cache, and the cache is what we are
> > really looking in.
> > 
> > Permission checks are a little less clear.  We currently allow any user
> 
> This is all very brittle.
> 
> First case that comes to mind is overlayfs where the permission checking
> is performed twice. Once on the overlayfs inode itself based on the
> caller's security context and a second time on the underlying inode with
> the security context of the mounter of the overlayfs instance.
> 
> A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so
> they'd be able to mount the overlayfs instance but would be restricted
> from accessing certain directories or files. The task accessing the
> overlayfs instance however could have a completely different security
> context including CAP_DAC_READ_SEARCH and such. Both tasks could
> reasonably be in different user namespaces and so on.
> 
> The LSM hooks are also called twice and would now also be called once.

I'd prefer "called never"

> 
> It also forgets that acl_permission() check may very well call into the
> filesystem via check_acl().

I didn't pay proper attention to acl_permission() - you are right.

> 
> So umount could either be used to infer existence of files that the user
> wouldn't otherwise know they exist or in the worst case allow to umount
> something that they wouldn't have access to.
> 
> Aside from that this would break userspace assumptions and as Al and
> I've mentioned before in the other thread you'd need a new flag to
> umount2() for this. The permission model can't just change behind users
> back.
> 
> But I dislike it for the now even more special-cased umount path lookup
> alone tbh. I'd feel way more comfortable with a non-lookup related
> solution that doesn't have subtle implications for permission checking.

I was really hoping that existing code could be made to just work.

Thanks for the review.

NeilBrown


> 
> > to make the umount syscall and perform the path lookup and only reject
> > unprivileged users once the target mount point has been found.  If we
> > completely relax permission checks then an unprivileged user could probe
> > inaccessible parts of the name space by examining the error returned
> > from umount().
> > 
> > So we only relax permission check when the user has CAP_SYS_ADMIN
> > (may_mount() succeeds).
> > 
> > Note that if the path given is not direct and for example uses symlinks
> > or "..", then dentries or symlink content may not be cached and a remote
> > server could cause problem.  We can only be certain of a safe unmount if
> > a direct path is used.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 40 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/namei.c b/fs/namei.c
> > index edfedfbccaef..f2df1adae7c5 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -498,8 +498,8 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
> >   *
> >   * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
> >   */
> > -int inode_permission(struct mnt_idmap *idmap,
> > -		     struct inode *inode, int mask)
> > +int inode_permission_mp(struct mnt_idmap *idmap,
> > +			struct inode *inode, int mask, bool mp)
> >  {
> >  	int retval;
> >  
> > @@ -523,7 +523,14 @@ int inode_permission(struct mnt_idmap *idmap,
> >  			return -EACCES;
> >  	}
> >  
> > -	retval = do_inode_permission(idmap, inode, mask);
> > +	if (mp)
> > +		/* We are looking for a mountpoint and so don't
> > +		 * want to interact with the filesystem at all, just
> > +		 * the dcache and icache.
> > +		 */
> > +		retval = generic_permission(idmap, inode, mask);
> > +	else
> > +		retval = do_inode_permission(idmap, inode, mask);
> >  	if (retval)
> >  		return retval;
> >  
> > @@ -533,6 +540,24 @@ int inode_permission(struct mnt_idmap *idmap,
> >  
> >  	return security_inode_permission(inode, mask);
> >  }
> > +
> > +/**
> > + * inode_permission - Check for access rights to a given inode
> > + * @idmap:	idmap of the mount the inode was found from
> > + * @inode:	Inode to check permission on
> > + * @mask:	Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
> > + *
> > + * Check for read/write/execute permissions on an inode.  We use fs[ug]id for
> > + * this, letting us set arbitrary permissions for filesystem access without
> > + * changing the "normal" UIDs which are used for other things.
> > + *
> > + * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
> > + */
> > +int inode_permission(struct mnt_idmap *idmap,
> > +		     struct inode *inode, int mask)
> > +{
> > +	return inode_permission_mp(idmap, inode, mask, false);
> > +}
> >  EXPORT_SYMBOL(inode_permission);
> >  
> >  /**
> > @@ -589,6 +614,7 @@ struct nameidata {
> >  #define ND_ROOT_PRESET 1
> >  #define ND_ROOT_GRABBED 2
> >  #define ND_JUMPED 4
> > +#define ND_SYS_ADMIN 8
> >  
> >  static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name)
> >  {
> > @@ -853,7 +879,8 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry)
> >  
> >  static inline int d_revalidate(struct dentry *dentry, unsigned int flags)
> >  {
> > -	if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE))
> > +	if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) &&
> > +	    likely(!(flags & LOOKUP_MOUNTPOINT)))
> >  		return dentry->d_op->d_revalidate(dentry, flags);
> >  	else
> >  		return 1;
> > @@ -1708,12 +1735,17 @@ static struct dentry *lookup_slow(const struct qstr *name,
> >  static inline int may_lookup(struct mnt_idmap *idmap,
> >  			     struct nameidata *nd)
> >  {
> > +	/* If we are looking for a mountpoint and we have the SYS_ADMIN
> > +	 * capability, then we will by-pass the filesys for permission checks
> > +	 * and just use generic_permission().
> > +	 */
> > +	bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN);
> >  	if (nd->flags & LOOKUP_RCU) {
> > -		int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK);
> > +		int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp);
> >  		if (err != -ECHILD || !try_to_unlazy(nd))
> >  			return err;
> >  	}
> > -	return inode_permission(idmap, nd->inode, MAY_EXEC);
> > +	return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp);
> >  }
> >  
> >  static int reserve_stack(struct nameidata *nd, struct path *link)
> > @@ -2501,6 +2533,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
> >  	if (IS_ERR(name))
> >  		return PTR_ERR(name);
> >  	set_nameidata(&nd, dfd, name, root);
> > +	if ((flags & LOOKUP_MOUNTPOINT) && may_mount())
> > +		nd.state = ND_SYS_ADMIN;
> >  	retval = path_lookupat(&nd, flags | LOOKUP_RCU, path);
> >  	if (unlikely(retval == -ECHILD))
> >  		retval = path_lookupat(&nd, flags, path);
> > -- 
> > 2.40.0
> > 
>
NeilBrown April 17, 2023, 9:34 p.m. UTC | #7
On Tue, 18 Apr 2023, Jeff Layton wrote:
> 
> The last thing we want in that case is for the server to decide to
> change some intermediate dentry in between the two operations. Best
> case, you'll get back ENOENT or something when the pathwalk fails. Worst
> case, the server swaps what are two different mountpoints on your client
> and you unmount the wrong one.

Actually, the last think I want is for the server to do something that
causes a directory to be invalidated (d_invalidate()).  Then any mount
points below there get DETACHed and we lose any change to use MNT_FORCE
or to wait for the unmount to complete.  Of course this can also happen
during any other access, not just umount....

> 
> If we don't revaliate, then we're no worse off, and may be better off if
> something hinky happens to the server of an intermediate dentry in the
> path.

Exactly.  If we don't revalidate we might use old data, but it will be
old data that were were allowed to access.  It might be data that we are
not now allowed to access, but it will never be new data that were have
never been allowed to access.

Thanks,
NeilBrown
Andreas Dilger April 18, 2023, 3:25 a.m. UTC | #8
> On Apr 17, 2023, at 9:21 AM, Jeff Layton <jlayton@kernel.org> wrote:
> 
> On Mon, 2023-04-17 at 16:24 +0200, Christian Brauner wrote:
>> And I'm curious why is it obvious that we don't want to revalidate _any_
>> path component and not just the last one? Why is that generally safe?
>> Why can't this be used to access files and directories the caller
>> wouldn't otherwise be able to access? I would like to have this spelled
>> out for slow people like me, please.
>> 
>> From my point of view, this would only be somewhat safe _generally_ if
>> you'd allow circumvention for revalidation and permission checking if
>> MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH).
>> You'd still mess with overlayfs permission model in this case though.
>> 
>> Plus, there are better options of solving this problem. Again, I'd
>> rather build a separate api for unmounting then playing such potentially
>> subtle security sensitive games with permission checking during path
>> lookup.
> 
> umount(2) is really a special case because the whole intent is to detach
> a mount from the local hierarchy and stop using it. The unfortunate bit
> is that it is a path-based syscall.
> 
> So usually we have to:
> 
> - determine the path: Maybe stat() it and to validate that it's the
>   mountpoint we want to drop

The stat() itself may hang because a remote server, or USB stick is
inaccessible or having media errors.

I've just been having a conversation with Karel Zak to change
umount(1) to use statx() so that it interacts minimally with the fs.

In particular, nfs_getattr() skips revalidate if only minimal attrs
are fetched (STATX_TYPE | STATX_INO), and also skips revalidate if
locally-cached attrs are still valid (STATX_MODE), so this will
avoid yet one more place that unmount can hang.

In theory, vfs_getattr() could get all of these attributes directly
from the vfs_inode in the unmount case.

> - then call umount with that path
> 
> The last thing we want in that case is for the server to decide to
> change some intermediate dentry in between the two operations. Best
> case, you'll get back ENOENT or something when the pathwalk fails. Worst
> case, the server swaps what are two different mountpoints on your client
> and you unmount the wrong one.
> 
> If we don't revaliate, then we're no worse off, and may be better off if
> something hinky happens to the server of an intermediate dentry in the
> path.
> --
> Jeff Layton <jlayton@kernel.org>


Cheers, Andreas
Christian Brauner April 18, 2023, 8:04 a.m. UTC | #9
On Mon, Apr 17, 2023 at 09:25:20PM -0600, Andreas Dilger wrote:
> 
> > On Apr 17, 2023, at 9:21 AM, Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Mon, 2023-04-17 at 16:24 +0200, Christian Brauner wrote:
> >> And I'm curious why is it obvious that we don't want to revalidate _any_
> >> path component and not just the last one? Why is that generally safe?
> >> Why can't this be used to access files and directories the caller
> >> wouldn't otherwise be able to access? I would like to have this spelled
> >> out for slow people like me, please.
> >> 
> >> From my point of view, this would only be somewhat safe _generally_ if
> >> you'd allow circumvention for revalidation and permission checking if
> >> MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH).
> >> You'd still mess with overlayfs permission model in this case though.
> >> 
> >> Plus, there are better options of solving this problem. Again, I'd
> >> rather build a separate api for unmounting then playing such potentially
> >> subtle security sensitive games with permission checking during path
> >> lookup.
> > 
> > umount(2) is really a special case because the whole intent is to detach
> > a mount from the local hierarchy and stop using it. The unfortunate bit
> > is that it is a path-based syscall.
> > 
> > So usually we have to:
> > 
> > - determine the path: Maybe stat() it and to validate that it's the
> >   mountpoint we want to drop
> 
> The stat() itself may hang because a remote server, or USB stick is
> inaccessible or having media errors.
> 
> I've just been having a conversation with Karel Zak to change
> umount(1) to use statx() so that it interacts minimally with the fs.

So we're talking about this commit:
https://github.com/util-linux/util-linux/commit/42e141d20505a0deb969c2e583a463c26aadc62f
and the patch we discussed here:
https://github.com/util-linux/util-linux/issues/2049

> 
> In particular, nfs_getattr() skips revalidate if only minimal attrs
> are fetched (STATX_TYPE | STATX_INO), and also skips revalidate if
> locally-cached attrs are still valid (STATX_MODE), so this will
> avoid yet one more place that unmount can hang.
> 
> In theory, vfs_getattr() could get all of these attributes directly
> from the vfs_inode in the unmount case.

We don't really need that. As pointed out in that discussion and as
Karel did we just want to pass AT_STATX_DONT_SYNC.

An api that would allow unmounting by mount id can safely check and
retrieve AT_STATX_DONT_SYNC with STATX_ATTR_MOUNT_ROOT and STATX_MNT_ID
without ever syncing with the server.
Christian Brauner April 18, 2023, 8:10 a.m. UTC | #10
On Tue, Apr 18, 2023 at 07:34:14AM +1000, NeilBrown wrote:
> On Tue, 18 Apr 2023, Jeff Layton wrote:
> > 
> > The last thing we want in that case is for the server to decide to
> > change some intermediate dentry in between the two operations. Best
> > case, you'll get back ENOENT or something when the pathwalk fails. Worst
> > case, the server swaps what are two different mountpoints on your client
> > and you unmount the wrong one.
> 
> Actually, the last think I want is for the server to do something that
> causes a directory to be invalidated (d_invalidate()).  Then any mount
> points below there get DETACHed and we lose any change to use MNT_FORCE
> or to wait for the unmount to complete.  Of course this can also happen
> during any other access, not just umount....

Any rmdir/unlink from another mount namespace where the mountpoint isn't
a local mountpoint would lazily unmount the whole mount tree. You can't
guard against this anyway.
Jeff Layton April 20, 2023, 1:05 p.m. UTC | #11
On Tue, 2023-04-18 at 10:04 +0200, Christian Brauner wrote:
> On Mon, Apr 17, 2023 at 09:25:20PM -0600, Andreas Dilger wrote:
> > 
> > > On Apr 17, 2023, at 9:21 AM, Jeff Layton <jlayton@kernel.org> wrote:
> > > 
> > > On Mon, 2023-04-17 at 16:24 +0200, Christian Brauner wrote:
> > > > And I'm curious why is it obvious that we don't want to revalidate _any_
> > > > path component and not just the last one? Why is that generally safe?
> > > > Why can't this be used to access files and directories the caller
> > > > wouldn't otherwise be able to access? I would like to have this spelled
> > > > out for slow people like me, please.
> > > > 
> > > > From my point of view, this would only be somewhat safe _generally_ if
> > > > you'd allow circumvention for revalidation and permission checking if
> > > > MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH).
> > > > You'd still mess with overlayfs permission model in this case though.
> > > > 
> > > > Plus, there are better options of solving this problem. Again, I'd
> > > > rather build a separate api for unmounting then playing such potentially
> > > > subtle security sensitive games with permission checking during path
> > > > lookup.
> > > 
> > > umount(2) is really a special case because the whole intent is to detach
> > > a mount from the local hierarchy and stop using it. The unfortunate bit
> > > is that it is a path-based syscall.
> > > 
> > > So usually we have to:
> > > 
> > > - determine the path: Maybe stat() it and to validate that it's the
> > >   mountpoint we want to drop
> > 
> > The stat() itself may hang because a remote server, or USB stick is
> > inaccessible or having media errors.
> > 
> > I've just been having a conversation with Karel Zak to change
> > umount(1) to use statx() so that it interacts minimally with the fs.
> 
> So we're talking about this commit:
> https://github.com/util-linux/util-linux/commit/42e141d20505a0deb969c2e583a463c26aadc62f
> and the patch we discussed here:
> https://github.com/util-linux/util-linux/issues/2049
> 
> > 
> > In particular, nfs_getattr() skips revalidate if only minimal attrs
> > are fetched (STATX_TYPE | STATX_INO), and also skips revalidate if
> > locally-cached attrs are still valid (STATX_MODE), so this will
> > avoid yet one more place that unmount can hang.
> > 
> > In theory, vfs_getattr() could get all of these attributes directly
> > from the vfs_inode in the unmount case.
> 
> We don't really need that. As pointed out in that discussion and as
> Karel did we just want to pass AT_STATX_DONT_SYNC.
> 
> An api that would allow unmounting by mount id can safely check and
> retrieve AT_STATX_DONT_SYNC with STATX_ATTR_MOUNT_ROOT and STATX_MNT_ID
> without ever syncing with the server.

Unfortunately, I don't think that flag trickles down to the ->permission
checks for the pathwalk down to the point you're stat'ing. That turns
out to be a big part of the problem when trying to umount things that
are stuck down in inaccessible trees.

I'm not opposed at all to new APIs that allow for more reliable
unmounting. I think that's a good idea, and something worth hashing out.

But...we're stuck with umount() in perpetuity. Even if you were to merge
a new API for unmounting today, it'll be a decade or more until the
ecosystem catches up. I think we have a responsibility to make the
umount syscall work as well as we can. In this case, that means giving
it as light a footprint as possible on the pathwalk down.

If we need to gate this behavior behind existing or new flags to
umount2(), then so be it, but lets not dismiss this idea in favor of
some new unmounting interface that may never materialize. Improving
umount has the potential to help a lot of our users.
Christian Brauner April 20, 2023, 3:41 p.m. UTC | #12
On Thu, Apr 20, 2023 at 09:05:47AM -0400, Jeff Layton wrote:
> On Tue, 2023-04-18 at 10:04 +0200, Christian Brauner wrote:
> > On Mon, Apr 17, 2023 at 09:25:20PM -0600, Andreas Dilger wrote:
> > > 
> > > > On Apr 17, 2023, at 9:21 AM, Jeff Layton <jlayton@kernel.org> wrote:
> > > > 
> > > > On Mon, 2023-04-17 at 16:24 +0200, Christian Brauner wrote:
> > > > > And I'm curious why is it obvious that we don't want to revalidate _any_
> > > > > path component and not just the last one? Why is that generally safe?
> > > > > Why can't this be used to access files and directories the caller
> > > > > wouldn't otherwise be able to access? I would like to have this spelled
> > > > > out for slow people like me, please.
> > > > > 
> > > > > From my point of view, this would only be somewhat safe _generally_ if
> > > > > you'd allow circumvention for revalidation and permission checking if
> > > > > MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH).
> > > > > You'd still mess with overlayfs permission model in this case though.
> > > > > 
> > > > > Plus, there are better options of solving this problem. Again, I'd
> > > > > rather build a separate api for unmounting then playing such potentially
> > > > > subtle security sensitive games with permission checking during path
> > > > > lookup.
> > > > 
> > > > umount(2) is really a special case because the whole intent is to detach
> > > > a mount from the local hierarchy and stop using it. The unfortunate bit
> > > > is that it is a path-based syscall.
> > > > 
> > > > So usually we have to:
> > > > 
> > > > - determine the path: Maybe stat() it and to validate that it's the
> > > >   mountpoint we want to drop
> > > 
> > > The stat() itself may hang because a remote server, or USB stick is
> > > inaccessible or having media errors.
> > > 
> > > I've just been having a conversation with Karel Zak to change
> > > umount(1) to use statx() so that it interacts minimally with the fs.
> > 
> > So we're talking about this commit:
> > https://github.com/util-linux/util-linux/commit/42e141d20505a0deb969c2e583a463c26aadc62f
> > and the patch we discussed here:
> > https://github.com/util-linux/util-linux/issues/2049
> > 
> > > 
> > > In particular, nfs_getattr() skips revalidate if only minimal attrs
> > > are fetched (STATX_TYPE | STATX_INO), and also skips revalidate if
> > > locally-cached attrs are still valid (STATX_MODE), so this will
> > > avoid yet one more place that unmount can hang.
> > > 
> > > In theory, vfs_getattr() could get all of these attributes directly
> > > from the vfs_inode in the unmount case.
> > 
> > We don't really need that. As pointed out in that discussion and as
> > Karel did we just want to pass AT_STATX_DONT_SYNC.
> > 
> > An api that would allow unmounting by mount id can safely check and
> > retrieve AT_STATX_DONT_SYNC with STATX_ATTR_MOUNT_ROOT and STATX_MNT_ID
> > without ever syncing with the server.
> 
> Unfortunately, I don't think that flag trickles down to the ->permission
> checks for the pathwalk down to the point you're stat'ing. That turns
> out to be a big part of the problem when trying to umount things that
> are stuck down in inaccessible trees.
> 
> I'm not opposed at all to new APIs that allow for more reliable
> unmounting. I think that's a good idea, and something worth hashing out.
> 
> But...we're stuck with umount() in perpetuity. Even if you were to merge
> a new API for unmounting today, it'll be a decade or more until the
> ecosystem catches up. I think we have a responsibility to make the
> umount syscall work as well as we can. In this case, that means giving
> it as light a footprint as possible on the pathwalk down.
> 
> If we need to gate this behavior behind existing or new flags to
> umount2(), then so be it, but lets not dismiss this idea in favor of
> some new unmounting interface that may never materialize. Improving
> umount has the potential to help a lot of our users.

A new flag for an old system call or a new system call doesn't matter in
practice for userspace. And the users that have that specific problem
that's solved by a new interface will use that interface asap. That's
been true for the pidfd api, that's been true for openat2(), clone3()
and others.

Plus, this is a workload specific problem.

And even with or without a new flag it'd still need to be backported to
old kernels.

And just because an interface already exists doesn't mean stuff should
be shoehorned into it just because it's convenient or faster. That's
just asking for maintenance pain down the road.

And from this discussion there's multiple ways to work around this issue
currently so I especially don't see the need to rush any of this and
fiddle around with permission checking.
Al Viro April 20, 2023, 9:35 p.m. UTC | #13
On Tue, Apr 18, 2023 at 07:26:34AM +1000, NeilBrown wrote:

> MNT_FORCE is, I think, a good idea and a needed functionality that has
> never been implemented well.
> MNT_FORCE causes nfs_umount_begin to be called as you noted, which
> aborts all pending RPCs on that filesystem.

Suppose it happens to be mounted in another namespace as well.  Or bound
at different mountpoint, for that matter.  What should MNT_FORCE do?
NeilBrown April 20, 2023, 10:01 p.m. UTC | #14
On Fri, 21 Apr 2023, Al Viro wrote:
> On Tue, Apr 18, 2023 at 07:26:34AM +1000, NeilBrown wrote:
> 
> > MNT_FORCE is, I think, a good idea and a needed functionality that has
> > never been implemented well.
> > MNT_FORCE causes nfs_umount_begin to be called as you noted, which
> > aborts all pending RPCs on that filesystem.
> 
> Suppose it happens to be mounted in another namespace as well.  Or bound
> at different mountpoint, for that matter.  What should MNT_FORCE do?
> 

1/ set a "forced-unmount" flag on the vfs_mount which causes any syscall
   that uses the vfsmount (whether from an fd, or found in a path walk,
   or elsewhere), except for close(), to abort with an error;
2/ call ->umount_begin passing in the vfs_mount.  The fs can abort any
   outstanding transaction on any fd from that vfs_mount.   Possibly
   it might instead abort a wait rather than the whole transaction,
   particularly if requests using some other vfs_mount might also be
   interested in the transaction
3/ ->close() on a force-unmount vfs_mount would clean up without
   blocking indefinitely, discarding dirty data if necessary.

This still depends on the application to close all fds that return
errors (and to chdir out of a problematic directory).  But at least it
*allows* applications to do that without requiring that they be killed.

Thanks,
NeilBrown
Al Viro April 20, 2023, 10:27 p.m. UTC | #15
On Fri, Apr 21, 2023 at 08:01:09AM +1000, NeilBrown wrote:
> On Fri, 21 Apr 2023, Al Viro wrote:
> > On Tue, Apr 18, 2023 at 07:26:34AM +1000, NeilBrown wrote:
> > 
> > > MNT_FORCE is, I think, a good idea and a needed functionality that has
> > > never been implemented well.
> > > MNT_FORCE causes nfs_umount_begin to be called as you noted, which
> > > aborts all pending RPCs on that filesystem.
> > 
> > Suppose it happens to be mounted in another namespace as well.  Or bound
> > at different mountpoint, for that matter.  What should MNT_FORCE do?
> > 
> 
> 1/ set a "forced-unmount" flag on the vfs_mount which causes any syscall
>    that uses the vfsmount (whether from an fd, or found in a path walk,
>    or elsewhere), except for close(), to abort with an error;
> 2/ call ->umount_begin passing in the vfs_mount.  The fs can abort any
>    outstanding transaction on any fd from that vfs_mount.

How the hell would you recognize those?
diff mbox series

Patch

diff --git a/fs/namei.c b/fs/namei.c
index edfedfbccaef..f2df1adae7c5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -498,8 +498,8 @@  static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
  *
  * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
  */
-int inode_permission(struct mnt_idmap *idmap,
-		     struct inode *inode, int mask)
+int inode_permission_mp(struct mnt_idmap *idmap,
+			struct inode *inode, int mask, bool mp)
 {
 	int retval;
 
@@ -523,7 +523,14 @@  int inode_permission(struct mnt_idmap *idmap,
 			return -EACCES;
 	}
 
-	retval = do_inode_permission(idmap, inode, mask);
+	if (mp)
+		/* We are looking for a mountpoint and so don't
+		 * want to interact with the filesystem at all, just
+		 * the dcache and icache.
+		 */
+		retval = generic_permission(idmap, inode, mask);
+	else
+		retval = do_inode_permission(idmap, inode, mask);
 	if (retval)
 		return retval;
 
@@ -533,6 +540,24 @@  int inode_permission(struct mnt_idmap *idmap,
 
 	return security_inode_permission(inode, mask);
 }
+
+/**
+ * inode_permission - Check for access rights to a given inode
+ * @idmap:	idmap of the mount the inode was found from
+ * @inode:	Inode to check permission on
+ * @mask:	Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
+ *
+ * Check for read/write/execute permissions on an inode.  We use fs[ug]id for
+ * this, letting us set arbitrary permissions for filesystem access without
+ * changing the "normal" UIDs which are used for other things.
+ *
+ * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
+ */
+int inode_permission(struct mnt_idmap *idmap,
+		     struct inode *inode, int mask)
+{
+	return inode_permission_mp(idmap, inode, mask, false);
+}
 EXPORT_SYMBOL(inode_permission);
 
 /**
@@ -589,6 +614,7 @@  struct nameidata {
 #define ND_ROOT_PRESET 1
 #define ND_ROOT_GRABBED 2
 #define ND_JUMPED 4
+#define ND_SYS_ADMIN 8
 
 static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name)
 {
@@ -853,7 +879,8 @@  static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry)
 
 static inline int d_revalidate(struct dentry *dentry, unsigned int flags)
 {
-	if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE))
+	if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) &&
+	    likely(!(flags & LOOKUP_MOUNTPOINT)))
 		return dentry->d_op->d_revalidate(dentry, flags);
 	else
 		return 1;
@@ -1708,12 +1735,17 @@  static struct dentry *lookup_slow(const struct qstr *name,
 static inline int may_lookup(struct mnt_idmap *idmap,
 			     struct nameidata *nd)
 {
+	/* If we are looking for a mountpoint and we have the SYS_ADMIN
+	 * capability, then we will by-pass the filesys for permission checks
+	 * and just use generic_permission().
+	 */
+	bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN);
 	if (nd->flags & LOOKUP_RCU) {
-		int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK);
+		int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp);
 		if (err != -ECHILD || !try_to_unlazy(nd))
 			return err;
 	}
-	return inode_permission(idmap, nd->inode, MAY_EXEC);
+	return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp);
 }
 
 static int reserve_stack(struct nameidata *nd, struct path *link)
@@ -2501,6 +2533,8 @@  int filename_lookup(int dfd, struct filename *name, unsigned flags,
 	if (IS_ERR(name))
 		return PTR_ERR(name);
 	set_nameidata(&nd, dfd, name, root);
+	if ((flags & LOOKUP_MOUNTPOINT) && may_mount())
+		nd.state = ND_SYS_ADMIN;
 	retval = path_lookupat(&nd, flags | LOOKUP_RCU, path);
 	if (unlikely(retval == -ECHILD))
 		retval = path_lookupat(&nd, flags, path);