Message ID | 168168683217.24821.6260957092725278201@noble.neil.brown.name (mailing list archive) |
---|---|
State | Mainlined, archived |
Headers | show |
Series | [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible. | expand |
On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote: > > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to > engage with underlying systems at all. Any mount point MUST be in the > dcache with a complete direct path from the root to the mountpoint. > That should be sufficient to find the mountpoint given a path name. > > This becomes an issue when the filesystem changes unexpected, such as > when a NFS server is changed to reject all access. It then becomes > impossible to unmount anything mounted on the filesystem which has > changed. We could simply lazy-unmount the changed filesystem and that > will often be sufficient. However if the target filesystem needs > "umount -f" to complete the unmount properly, then the lazy unmount will > leave it incompletely unmounted. When "-f" is needed, we really need to I don't understand this yet. All I see is nfs_umount_begin() that's different for MNT_FORCE to kill remaining io. Why does that preclude MNT_DETACH? You might very well fail MNT_FORCE and the only way you can get rid is to use MNT_DETACH, no? So I don't see why that is an argument. > be able to name the target filesystem. > > We NEVER want to revalidate anything. We already avoid the revalidation > of the mountpoint itself, be we won't need to revalidate anything on the > path either as thay might affect the cache, and the cache is what we are > really looking in. > > Permission checks are a little less clear. We currently allow any user This is all very brittle. First case that comes to mind is overlayfs where the permission checking is performed twice. Once on the overlayfs inode itself based on the caller's security context and a second time on the underlying inode with the security context of the mounter of the overlayfs instance. A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so they'd be able to mount the overlayfs instance but would be restricted from accessing certain directories or files. The task accessing the overlayfs instance however could have a completely different security context including CAP_DAC_READ_SEARCH and such. Both tasks could reasonably be in different user namespaces and so on. The LSM hooks are also called twice and would now also be called once. It also forgets that acl_permission() check may very well call into the filesystem via check_acl(). So umount could either be used to infer existence of files that the user wouldn't otherwise know they exist or in the worst case allow to umount something that they wouldn't have access to. Aside from that this would break userspace assumptions and as Al and I've mentioned before in the other thread you'd need a new flag to umount2() for this. The permission model can't just change behind users back. But I dislike it for the now even more special-cased umount path lookup alone tbh. I'd feel way more comfortable with a non-lookup related solution that doesn't have subtle implications for permission checking. > to make the umount syscall and perform the path lookup and only reject > unprivileged users once the target mount point has been found. If we > completely relax permission checks then an unprivileged user could probe > inaccessible parts of the name space by examining the error returned > from umount(). > > So we only relax permission check when the user has CAP_SYS_ADMIN > (may_mount() succeeds). > > Note that if the path given is not direct and for example uses symlinks > or "..", then dentries or symlink content may not be cached and a remote > server could cause problem. We can only be certain of a safe unmount if > a direct path is used. > > Signed-off-by: NeilBrown <neilb@suse.de> > --- > fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 40 insertions(+), 6 deletions(-) > > diff --git a/fs/namei.c b/fs/namei.c > index edfedfbccaef..f2df1adae7c5 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -498,8 +498,8 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask) > * > * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask. > */ > -int inode_permission(struct mnt_idmap *idmap, > - struct inode *inode, int mask) > +int inode_permission_mp(struct mnt_idmap *idmap, > + struct inode *inode, int mask, bool mp) > { > int retval; > > @@ -523,7 +523,14 @@ int inode_permission(struct mnt_idmap *idmap, > return -EACCES; > } > > - retval = do_inode_permission(idmap, inode, mask); > + if (mp) > + /* We are looking for a mountpoint and so don't > + * want to interact with the filesystem at all, just > + * the dcache and icache. > + */ > + retval = generic_permission(idmap, inode, mask); > + else > + retval = do_inode_permission(idmap, inode, mask); > if (retval) > return retval; > > @@ -533,6 +540,24 @@ int inode_permission(struct mnt_idmap *idmap, > > return security_inode_permission(inode, mask); > } > + > +/** > + * inode_permission - Check for access rights to a given inode > + * @idmap: idmap of the mount the inode was found from > + * @inode: Inode to check permission on > + * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC) > + * > + * Check for read/write/execute permissions on an inode. We use fs[ug]id for > + * this, letting us set arbitrary permissions for filesystem access without > + * changing the "normal" UIDs which are used for other things. > + * > + * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask. > + */ > +int inode_permission(struct mnt_idmap *idmap, > + struct inode *inode, int mask) > +{ > + return inode_permission_mp(idmap, inode, mask, false); > +} > EXPORT_SYMBOL(inode_permission); > > /** > @@ -589,6 +614,7 @@ struct nameidata { > #define ND_ROOT_PRESET 1 > #define ND_ROOT_GRABBED 2 > #define ND_JUMPED 4 > +#define ND_SYS_ADMIN 8 > > static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name) > { > @@ -853,7 +879,8 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry) > > static inline int d_revalidate(struct dentry *dentry, unsigned int flags) > { > - if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE)) > + if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) && > + likely(!(flags & LOOKUP_MOUNTPOINT))) > return dentry->d_op->d_revalidate(dentry, flags); > else > return 1; > @@ -1708,12 +1735,17 @@ static struct dentry *lookup_slow(const struct qstr *name, > static inline int may_lookup(struct mnt_idmap *idmap, > struct nameidata *nd) > { > + /* If we are looking for a mountpoint and we have the SYS_ADMIN > + * capability, then we will by-pass the filesys for permission checks > + * and just use generic_permission(). > + */ > + bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN); > if (nd->flags & LOOKUP_RCU) { > - int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK); > + int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp); > if (err != -ECHILD || !try_to_unlazy(nd)) > return err; > } > - return inode_permission(idmap, nd->inode, MAY_EXEC); > + return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp); > } > > static int reserve_stack(struct nameidata *nd, struct path *link) > @@ -2501,6 +2533,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags, > if (IS_ERR(name)) > return PTR_ERR(name); > set_nameidata(&nd, dfd, name, root); > + if ((flags & LOOKUP_MOUNTPOINT) && may_mount()) > + nd.state = ND_SYS_ADMIN; > retval = path_lookupat(&nd, flags | LOOKUP_RCU, path); > if (unlikely(retval == -ECHILD)) > retval = path_lookupat(&nd, flags, path); > -- > 2.40.0 >
On Mon, 2023-04-17 at 09:13 +1000, NeilBrown wrote: > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to > engage with underlying systems at all. Any mount point MUST be in the > dcache with a complete direct path from the root to the mountpoint. > That should be sufficient to find the mountpoint given a path name. > > This becomes an issue when the filesystem changes unexpected, such as > when a NFS server is changed to reject all access. It then becomes > impossible to unmount anything mounted on the filesystem which has > changed. We could simply lazy-unmount the changed filesystem and that > will often be sufficient. However if the target filesystem needs > "umount -f" to complete the unmount properly, then the lazy unmount will > leave it incompletely unmounted. When "-f" is needed, we really need to > be able to name the target filesyste > > We NEVER want to revalidate anything. We already avoid the revalidation > of the mountpoint itself, be we won't need to revalidate anything on the > path either as thay might affect the cache, and the cache is what we are > really looking in. > > Permission checks are a little less clear. We currently allow any user > to make the umount syscall and perform the path lookup and only reject > unprivileged users once the target mount point has been found. If we > completely relax permission checks then an unprivileged user could probe > inaccessible parts of the name space by examining the error returned > from umount(). > That sounds pretty reasonable. Most umounts are done by root in some fashion anyway. > So we only relax permission check when the user has CAP_SYS_ADMIN > (may_mount() succeeds). > > Note that if the path given is not direct and for example uses symlinks > or "..", then dentries or symlink content may not be cached and a remote > server could cause problem. We can only be certain of a safe unmount if > a direct path is used. > I guess we do still have to allow it to do a lookup due to symlinks. I think this is still worthwhile though since it'd fix a lot of these cases. > Signed-off-by: NeilBrown <neilb@suse.de> > --- > fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 40 insertions(+), 6 deletions(-) > > diff --git a/fs/namei.c b/fs/namei.c > index edfedfbccaef..f2df1adae7c5 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -498,8 +498,8 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask) > * > * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask. > */ > -int inode_permission(struct mnt_idmap *idmap, > - struct inode *inode, int mask) > +int inode_permission_mp(struct mnt_idmap *idmap, > + struct inode *inode, int mask, bool mp) > { > int retval; > > @@ -523,7 +523,14 @@ int inode_permission(struct mnt_idmap *idmap, > return -EACCES; > } > > - retval = do_inode_permission(idmap, inode, mask); > + if (mp) > + /* We are looking for a mountpoint and so don't > + * want to interact with the filesystem at all, just > + * the dcache and icache. > + */ > + retval = generic_permission(idmap, inode, mask); > + else > + retval = do_inode_permission(idmap, inode, mask); > if (retval) > return retval; > > @@ -533,6 +540,24 @@ int inode_permission(struct mnt_idmap *idmap, > > return security_inode_permission(inode, mask); > } > + > +/** > + * inode_permission - Check for access rights to a given inode > + * @idmap: idmap of the mount the inode was found from > + * @inode: Inode to check permission on > + * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC) > + * > + * Check for read/write/execute permissions on an inode. We use fs[ug]id for > + * this, letting us set arbitrary permissions for filesystem access without > + * changing the "normal" UIDs which are used for other things. > + * > + * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask. > + */ > +int inode_permission(struct mnt_idmap *idmap, > + struct inode *inode, int mask) > +{ > + return inode_permission_mp(idmap, inode, mask, false); > +} > EXPORT_SYMBOL(inode_permission); > > /** > @@ -589,6 +614,7 @@ struct nameidata { > #define ND_ROOT_PRESET 1 > #define ND_ROOT_GRABBED 2 > #define ND_JUMPED 4 > +#define ND_SYS_ADMIN 8 > > static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name) > { > @@ -853,7 +879,8 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry) > > static inline int d_revalidate(struct dentry *dentry, unsigned int flags) > { > - if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE)) > + if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) && > + likely(!(flags & LOOKUP_MOUNTPOINT))) > return dentry->d_op->d_revalidate(dentry, flags); > else > return 1; > @@ -1708,12 +1735,17 @@ static struct dentry *lookup_slow(const struct qstr *name, > static inline int may_lookup(struct mnt_idmap *idmap, > struct nameidata *nd) > { > + /* If we are looking for a mountpoint and we have the SYS_ADMIN > + * capability, then we will by-pass the filesys for permission checks > + * and just use generic_permission(). > + */ > + bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN); > if (nd->flags & LOOKUP_RCU) { > - int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK); > + int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp); > if (err != -ECHILD || !try_to_unlazy(nd)) > return err; > } > - return inode_permission(idmap, nd->inode, MAY_EXEC); > + return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp); > } > > static int reserve_stack(struct nameidata *nd, struct path *link) > @@ -2501,6 +2533,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags, > if (IS_ERR(name)) > return PTR_ERR(name); > set_nameidata(&nd, dfd, name, root); > + if ((flags & LOOKUP_MOUNTPOINT) && may_mount()) > + nd.state = ND_SYS_ADMIN; > retval = path_lookupat(&nd, flags | LOOKUP_RCU, path); > if (unlikely(retval == -ECHILD)) > retval = path_lookupat(&nd, flags, path); This behavior looks right along the lines of what I was thinking. Just for bisectability, it might be worthwhile to break this up along conceptual lines: one patch to make it skip d_revalidate, one that changes the permission checking, etc. I'll plan to give this a try soon with Dave's reproducer. Thanks!
On Mon, 2023-04-17 at 13:55 +0200, Christian Brauner wrote: > On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote: > > > > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to > > engage with underlying systems at all. Any mount point MUST be in the > > dcache with a complete direct path from the root to the mountpoint. > > That should be sufficient to find the mountpoint given a path name. > > > > This becomes an issue when the filesystem changes unexpected, such as > > when a NFS server is changed to reject all access. It then becomes > > impossible to unmount anything mounted on the filesystem which has > > changed. We could simply lazy-unmount the changed filesystem and that > > will often be sufficient. However if the target filesystem needs > > "umount -f" to complete the unmount properly, then the lazy unmount will > > leave it incompletely unmounted. When "-f" is needed, we really need to > > I don't understand this yet. All I see is nfs_umount_begin() that's > different for MNT_FORCE to kill remaining io. Why does that preclude > MNT_DETACH? You might very well fail MNT_FORCE and the only way you can > get rid is to use MNT_DETACH, no? So I don't see why that is an > argument. > > > be able to name the target filesystem. > > > > We NEVER want to revalidate anything. We already avoid the revalidation > > of the mountpoint itself, be we won't need to revalidate anything on the > > path either as thay might affect the cache, and the cache is what we are > > really looking in. > > > > Permission checks are a little less clear. We currently allow any user > > This is all very brittle. > > First case that comes to mind is overlayfs where the permission checking > is performed twice. Once on the overlayfs inode itself based on the > caller's security context and a second time on the underlying inode with > the security context of the mounter of the overlayfs instance. > > A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so > they'd be able to mount the overlayfs instance but would be restricted > from accessing certain directories or files. The task accessing the > overlayfs instance however could have a completely different security > context including CAP_DAC_READ_SEARCH and such. Both tasks could > reasonably be in different user namespaces and so on. > > The LSM hooks are also called twice and would now also be called once. > > It also forgets that acl_permission() check may very well call into the > filesystem via check_acl(). > > So umount could either be used to infer existence of files that the user > wouldn't otherwise know they exist or in the worst case allow to umount > something that they wouldn't have access to. > > Aside from that this would break userspace assumptions and as Al and > I've mentioned before in the other thread you'd need a new flag to > umount2() for this. The permission model can't just change behind users > back. > > But I dislike it for the now even more special-cased umount path lookup > alone tbh. I'd feel way more comfortable with a non-lookup related > solution that doesn't have subtle implications for permission checking. > These are good points. One way around the issues you point out might be to pass down a new MAY_LOOKUP_MOUNTPOINT mask flag to ->permission. That would allow the filesystem driver to decide whether it wants to avoid potentially problematic activity when checking permissions. nfs_permission could check for that and take a more hands-off approach to the permissions check. Between that and skipping d_revalidate on LOOKUP_MOUNTPOINT, I think that might do what we need. > > to make the umount syscall and perform the path lookup and only reject > > unprivileged users once the target mount point has been found. If we > > completely relax permission checks then an unprivileged user could probe > > inaccessible parts of the name space by examining the error returned > > from umount(). > > > > So we only relax permission check when the user has CAP_SYS_ADMIN > > (may_mount() succeeds). > > > > Note that if the path given is not direct and for example uses symlinks > > or "..", then dentries or symlink content may not be cached and a remote > > server could cause problem. We can only be certain of a safe unmount if > > a direct path is used. > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > --- > > fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------ > > 1 file changed, 40 insertions(+), 6 deletions(-) > > > > diff --git a/fs/namei.c b/fs/namei.c > > index edfedfbccaef..f2df1adae7c5 100644 > > --- a/fs/namei.c > > +++ b/fs/namei.c > > @@ -498,8 +498,8 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask) > > * > > * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask. > > */ > > -int inode_permission(struct mnt_idmap *idmap, > > - struct inode *inode, int mask) > > +int inode_permission_mp(struct mnt_idmap *idmap, > > + struct inode *inode, int mask, bool mp) > > { > > int retval; > > > > @@ -523,7 +523,14 @@ int inode_permission(struct mnt_idmap *idmap, > > return -EACCES; > > } > > > > - retval = do_inode_permission(idmap, inode, mask); > > + if (mp) > > + /* We are looking for a mountpoint and so don't > > + * want to interact with the filesystem at all, just > > + * the dcache and icache. > > + */ > > + retval = generic_permission(idmap, inode, mask); > > + else > > + retval = do_inode_permission(idmap, inode, mask); > > if (retval) > > return retval; > > > > @@ -533,6 +540,24 @@ int inode_permission(struct mnt_idmap *idmap, > > > > return security_inode_permission(inode, mask); > > } > > + > > +/** > > + * inode_permission - Check for access rights to a given inode > > + * @idmap: idmap of the mount the inode was found from > > + * @inode: Inode to check permission on > > + * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC) > > + * > > + * Check for read/write/execute permissions on an inode. We use fs[ug]id for > > + * this, letting us set arbitrary permissions for filesystem access without > > + * changing the "normal" UIDs which are used for other things. > > + * > > + * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask. > > + */ > > +int inode_permission(struct mnt_idmap *idmap, > > + struct inode *inode, int mask) > > +{ > > + return inode_permission_mp(idmap, inode, mask, false); > > +} > > EXPORT_SYMBOL(inode_permission); > > > > /** > > @@ -589,6 +614,7 @@ struct nameidata { > > #define ND_ROOT_PRESET 1 > > #define ND_ROOT_GRABBED 2 > > #define ND_JUMPED 4 > > +#define ND_SYS_ADMIN 8 > > > > static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name) > > { > > @@ -853,7 +879,8 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry) > > > > static inline int d_revalidate(struct dentry *dentry, unsigned int flags) > > { > > - if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE)) > > + if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) && > > + likely(!(flags & LOOKUP_MOUNTPOINT))) > > return dentry->d_op->d_revalidate(dentry, flags); > > else > > return 1; > > @@ -1708,12 +1735,17 @@ static struct dentry *lookup_slow(const struct qstr *name, > > static inline int may_lookup(struct mnt_idmap *idmap, > > struct nameidata *nd) > > { > > + /* If we are looking for a mountpoint and we have the SYS_ADMIN > > + * capability, then we will by-pass the filesys for permission checks > > + * and just use generic_permission(). > > + */ > > + bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN); > > if (nd->flags & LOOKUP_RCU) { > > - int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK); > > + int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp); > > if (err != -ECHILD || !try_to_unlazy(nd)) > > return err; > > } > > - return inode_permission(idmap, nd->inode, MAY_EXEC); > > + return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp); > > } > > > > static int reserve_stack(struct nameidata *nd, struct path *link) > > @@ -2501,6 +2533,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags, > > if (IS_ERR(name)) > > return PTR_ERR(name); > > set_nameidata(&nd, dfd, name, root); > > + if ((flags & LOOKUP_MOUNTPOINT) && may_mount()) > > + nd.state = ND_SYS_ADMIN; > > retval = path_lookupat(&nd, flags | LOOKUP_RCU, path); > > if (unlikely(retval == -ECHILD)) > > retval = path_lookupat(&nd, flags, path); > > -- > > 2.40.0 > >
On Mon, Apr 17, 2023 at 08:25:23AM -0400, Jeff Layton wrote: > On Mon, 2023-04-17 at 13:55 +0200, Christian Brauner wrote: > > On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote: > > > > > > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to > > > engage with underlying systems at all. Any mount point MUST be in the > > > dcache with a complete direct path from the root to the mountpoint. > > > That should be sufficient to find the mountpoint given a path name. > > > > > > This becomes an issue when the filesystem changes unexpected, such as > > > when a NFS server is changed to reject all access. It then becomes > > > impossible to unmount anything mounted on the filesystem which has > > > changed. We could simply lazy-unmount the changed filesystem and that > > > will often be sufficient. However if the target filesystem needs > > > "umount -f" to complete the unmount properly, then the lazy unmount will > > > leave it incompletely unmounted. When "-f" is needed, we really need to > > > > I don't understand this yet. All I see is nfs_umount_begin() that's > > different for MNT_FORCE to kill remaining io. Why does that preclude > > MNT_DETACH? You might very well fail MNT_FORCE and the only way you can > > get rid is to use MNT_DETACH, no? So I don't see why that is an > > argument. > > > > > be able to name the target filesystem. > > > > > > We NEVER want to revalidate anything. We already avoid the revalidation > > > of the mountpoint itself, be we won't need to revalidate anything on the > > > path either as thay might affect the cache, and the cache is what we are > > > really looking in. > > > > > > Permission checks are a little less clear. We currently allow any user > > > > This is all very brittle. > > > > First case that comes to mind is overlayfs where the permission checking > > is performed twice. Once on the overlayfs inode itself based on the > > caller's security context and a second time on the underlying inode with > > the security context of the mounter of the overlayfs instance. > > > > A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so > > they'd be able to mount the overlayfs instance but would be restricted > > from accessing certain directories or files. The task accessing the > > overlayfs instance however could have a completely different security > > context including CAP_DAC_READ_SEARCH and such. Both tasks could > > reasonably be in different user namespaces and so on. > > > > The LSM hooks are also called twice and would now also be called once. > > > > It also forgets that acl_permission() check may very well call into the > > filesystem via check_acl(). > > > > So umount could either be used to infer existence of files that the user > > wouldn't otherwise know they exist or in the worst case allow to umount > > something that they wouldn't have access to. > > > > Aside from that this would break userspace assumptions and as Al and > > I've mentioned before in the other thread you'd need a new flag to > > umount2() for this. The permission model can't just change behind users > > back. > > > > But I dislike it for the now even more special-cased umount path lookup > > alone tbh. I'd feel way more comfortable with a non-lookup related > > solution that doesn't have subtle implications for permission checking. > > > > These are good points. > > One way around the issues you point out might be to pass down a new > MAY_LOOKUP_MOUNTPOINT mask flag to ->permission. That would allow the > filesystem driver to decide whether it wants to avoid potentially > problematic activity when checking permissions. nfs_permission could > check for that and take a more hands-off approach to the permissions > check. Between that and skipping d_revalidate on LOOKUP_MOUNTPOINT, I > think that might do what we need. Yes, that's pretty obvious. I considered that, wrote the section and deleted it again because I still find this pretty ugly. It does leak very specific lookup information into filesystems that isn't generically useful like MAY_NOT_BLOCK is. Most filesystems don't need to check with a server like NFS does and so don't suffer from this issue. The crucial change in the patchset in its current form is that you're requesting from the VFS to significantly alter permission checking just because this is a umount request in a pretty fundamental way for roughly 21 filesytems. Afaict, on the VFS level that doesn't make sense. The VFS can't just skip a filesystem's ->permission() handler with "well, it's just on a way to a umount so whatever". That's just not going to be correct and is just a source of subtle security bugs. So NAK on that. And I'm curious why is it obvious that we don't want to revalidate _any_ path component and not just the last one? Why is that generally safe? Why can't this be used to access files and directories the caller wouldn't otherwise be able to access? I would like to have this spelled out for slow people like me, please. From my point of view, this would only be somewhat safe _generally_ if you'd allow circumvention for revalidation and permission checking if MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH). You'd still mess with overlayfs permission model in this case though. Plus, there are better options of solving this problem. Again, I'd rather build a separate api for unmounting then playing such potentially subtle security sensitive games with permission checking during path lookup.
On Mon, 2023-04-17 at 16:24 +0200, Christian Brauner wrote: > On Mon, Apr 17, 2023 at 08:25:23AM -0400, Jeff Layton wrote: > > On Mon, 2023-04-17 at 13:55 +0200, Christian Brauner wrote: > > > On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote: > > > > > > > > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to > > > > engage with underlying systems at all. Any mount point MUST be in the > > > > dcache with a complete direct path from the root to the mountpoint. > > > > That should be sufficient to find the mountpoint given a path name. > > > > > > > > This becomes an issue when the filesystem changes unexpected, such as > > > > when a NFS server is changed to reject all access. It then becomes > > > > impossible to unmount anything mounted on the filesystem which has > > > > changed. We could simply lazy-unmount the changed filesystem and that > > > > will often be sufficient. However if the target filesystem needs > > > > "umount -f" to complete the unmount properly, then the lazy unmount will > > > > leave it incompletely unmounted. When "-f" is needed, we really need to > > > > > > I don't understand this yet. All I see is nfs_umount_begin() that's > > > different for MNT_FORCE to kill remaining io. Why does that preclude > > > MNT_DETACH? You might very well fail MNT_FORCE and the only way you can > > > get rid is to use MNT_DETACH, no? So I don't see why that is an > > > argument. > > > > > > > be able to name the target filesystem. > > > > > > > > We NEVER want to revalidate anything. We already avoid the revalidation > > > > of the mountpoint itself, be we won't need to revalidate anything on the > > > > path either as thay might affect the cache, and the cache is what we are > > > > really looking in. > > > > > > > > Permission checks are a little less clear. We currently allow any user > > > > > > This is all very brittle. > > > > > > First case that comes to mind is overlayfs where the permission checking > > > is performed twice. Once on the overlayfs inode itself based on the > > > caller's security context and a second time on the underlying inode with > > > the security context of the mounter of the overlayfs instance. > > > > > > A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so > > > they'd be able to mount the overlayfs instance but would be restricted > > > from accessing certain directories or files. The task accessing the > > > overlayfs instance however could have a completely different security > > > context including CAP_DAC_READ_SEARCH and such. Both tasks could > > > reasonably be in different user namespaces and so on. > > > > > > The LSM hooks are also called twice and would now also be called once. > > > > > > It also forgets that acl_permission() check may very well call into the > > > filesystem via check_acl(). > > > > > > So umount could either be used to infer existence of files that the user > > > wouldn't otherwise know they exist or in the worst case allow to umount > > > something that they wouldn't have access to. > > > > > > Aside from that this would break userspace assumptions and as Al and > > > I've mentioned before in the other thread you'd need a new flag to > > > umount2() for this. The permission model can't just change behind users > > > back. > > > > > > But I dislike it for the now even more special-cased umount path lookup > > > alone tbh. I'd feel way more comfortable with a non-lookup related > > > solution that doesn't have subtle implications for permission checking. > > > > > > > These are good points. > > > > One way around the issues you point out might be to pass down a new > > MAY_LOOKUP_MOUNTPOINT mask flag to ->permission. That would allow the > > filesystem driver to decide whether it wants to avoid potentially > > problematic activity when checking permissions. nfs_permission could > > check for that and take a more hands-off approach to the permissions > > check. Between that and skipping d_revalidate on LOOKUP_MOUNTPOINT, I > > think that might do what we need. > > Yes, that's pretty obvious. I considered that, wrote the section and > deleted it again because I still find this pretty ugly. It does leak > very specific lookup information into filesystems that isn't generically > useful like MAY_NOT_BLOCK is. Most filesystems don't need to check with > a server like NFS does and so don't suffer from this issue. > Sort of. Most of the MAY flags cover a specific operation. In this case, we'd just be adding a flag to make it clear that this permission check is for the specific purpose of chasing down a mountpoint (presumably to umount). This field does look less like a "mask" field now though, and more like a "flags". > The crucial change in the patchset in its current form is that you're > requesting from the VFS to significantly alter permission checking just > because this is a umount request in a pretty fundamental way for roughly > 21 filesytems. Afaict, on the VFS level that doesn't make sense. The VFS > can't just skip a filesystem's ->permission() handler with "well, it's > just on a way to a umount so whatever". That's just not going to be > correct and is just a source of subtle security bugs. So NAK on that. > Fair enough. I too think the permission check ought to be left up to the filesystem driver. It does need some way to know that the permission check is in the context of a LOOKUP_MOUNTPOINT pathwalk though. A MAY_* flag seems like the obvious choice, since we could use that for checking ACLs too, but maybe there is some better way. > And I'm curious why is it obvious that we don't want to revalidate _any_ > path component and not just the last one? Why is that generally safe? > Why can't this be used to access files and directories the caller > wouldn't otherwise be able to access? I would like to have this spelled > out for slow people like me, please. > > From my point of view, this would only be somewhat safe _generally_ if > you'd allow circumvention for revalidation and permission checking if > MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH). > You'd still mess with overlayfs permission model in this case though. > > Plus, there are better options of solving this problem. Again, I'd > rather build a separate api for unmounting then playing such potentially > subtle security sensitive games with permission checking during path > lookup. umount(2) is really a special case because the whole intent is to detach a mount from the local hierarchy and stop using it. The unfortunate bit is that it is a path-based syscall. So usually we have to: - determine the path: Maybe stat() it and to validate that it's the mountpoint we want to drop - then call umount with that path The last thing we want in that case is for the server to decide to change some intermediate dentry in between the two operations. Best case, you'll get back ENOENT or something when the pathwalk fails. Worst case, the server swaps what are two different mountpoints on your client and you unmount the wrong one. If we don't revaliate, then we're no worse off, and may be better off if something hinky happens to the server of an intermediate dentry in the path.
On Mon, 17 Apr 2023, Christian Brauner wrote: > On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote: > > > > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to > > engage with underlying systems at all. Any mount point MUST be in the > > dcache with a complete direct path from the root to the mountpoint. > > That should be sufficient to find the mountpoint given a path name. > > > > This becomes an issue when the filesystem changes unexpected, such as > > when a NFS server is changed to reject all access. It then becomes > > impossible to unmount anything mounted on the filesystem which has > > changed. We could simply lazy-unmount the changed filesystem and that > > will often be sufficient. However if the target filesystem needs > > "umount -f" to complete the unmount properly, then the lazy unmount will > > leave it incompletely unmounted. When "-f" is needed, we really need to > > I don't understand this yet. All I see is nfs_umount_begin() that's > different for MNT_FORCE to kill remaining io. Why does that preclude > MNT_DETACH? You might very well fail MNT_FORCE and the only way you can > get rid is to use MNT_DETACH, no? So I don't see why that is an > argument. Possibly I could have been more clear. If /foo is the mount point for a filesystem that is causing problems and /foo/bar is the mount point of a different filesystem, which might have other problems, then I was saying that the simple solution of umount -l /foo which would MNT_DETACH both filesystems is not ideal because there might be pending IO that will never complete, so the mount can never be cleaned up. In such cases we really want to be able to do umount -f /foo/bar and ensure that succeeds. You can see now that it isn't that MNT_FORCE precludes MNT_DETACH, but they would be done on different filesystems. MNT_FORCE is, I think, a good idea and a needed functionality that has never been implemented well. MNT_FORCE causes nfs_umount_begin to be called as you noted, which aborts all pending RPCs on that filesystem. This might be useful if you have an "ls" running, as it is likely to abort on the first getdents() failure. But if you have "ls -l' running, it will likely just go and try another lstat(), and that is just as likely to hang - thus still preventing the umount. What we used to do was find anything using the mount and kill it (sometime "fuser -k" would work). These processes would likely be stuck in a D wait, but "umount -f" would abort the RPCs and the processes would get unstuck and could respond to the signal. This would sometimes take a couple of iterations. These days we have TASK_KILLABLE so this is much less likely to be needed - we just kill all the processes and they die promptly. Most of the time. There are a few places that can still block, but which no-one has had the motivation to fix. It would be much nicer if we didn't need to kill things. Even with "fuser -k" it isn't really the sort of thing you want in a shutdown script (or in systemd-shutdown). It would be ideal if the shutdown process could call "umount" and if that fails with EBUSY, call "umount -f" and be confident of .... something. Currently "-f" might inspire hope, but not confidence. We cannot realistically make MNT_FORCE truly force a umount because processes really need to die before that happens. But we could ensure that the filesystem is quiescent and stays that way. Maybe we could add a flag to 'struct vfsmount' to say that "unmount has been forced". Any attempt to use a fd with such a vfsmnt would fail, expect close() and anything similar. Maybe nfs_umount_begin could iterate all open files on that vfsmnt and purge any cached data so no background writes were needed. I think this would be very close to what we needed. Then systemd could run umount() and if that failed then umount(MNT_FORCE) and if that fails umount(MNT_DETACH), with confidence that there would be not more RPCs generates, so it would be safe to turn off the network. BTW, that is another reason that just doing "umount -l /foo" is not ideal. We sometimes want to wait for the umount to complete, so we can remove the backing store or the network. "umount -l /foo" doesn't let us wait. "umount /foo/bar" does. > > > be able to name the target filesystem. > > > > We NEVER want to revalidate anything. We already avoid the revalidation > > of the mountpoint itself, be we won't need to revalidate anything on the > > path either as thay might affect the cache, and the cache is what we are > > really looking in. > > > > Permission checks are a little less clear. We currently allow any user > > This is all very brittle. > > First case that comes to mind is overlayfs where the permission checking > is performed twice. Once on the overlayfs inode itself based on the > caller's security context and a second time on the underlying inode with > the security context of the mounter of the overlayfs instance. > > A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so > they'd be able to mount the overlayfs instance but would be restricted > from accessing certain directories or files. The task accessing the > overlayfs instance however could have a completely different security > context including CAP_DAC_READ_SEARCH and such. Both tasks could > reasonably be in different user namespaces and so on. > > The LSM hooks are also called twice and would now also be called once. I'd prefer "called never" > > It also forgets that acl_permission() check may very well call into the > filesystem via check_acl(). I didn't pay proper attention to acl_permission() - you are right. > > So umount could either be used to infer existence of files that the user > wouldn't otherwise know they exist or in the worst case allow to umount > something that they wouldn't have access to. > > Aside from that this would break userspace assumptions and as Al and > I've mentioned before in the other thread you'd need a new flag to > umount2() for this. The permission model can't just change behind users > back. > > But I dislike it for the now even more special-cased umount path lookup > alone tbh. I'd feel way more comfortable with a non-lookup related > solution that doesn't have subtle implications for permission checking. I was really hoping that existing code could be made to just work. Thanks for the review. NeilBrown > > > to make the umount syscall and perform the path lookup and only reject > > unprivileged users once the target mount point has been found. If we > > completely relax permission checks then an unprivileged user could probe > > inaccessible parts of the name space by examining the error returned > > from umount(). > > > > So we only relax permission check when the user has CAP_SYS_ADMIN > > (may_mount() succeeds). > > > > Note that if the path given is not direct and for example uses symlinks > > or "..", then dentries or symlink content may not be cached and a remote > > server could cause problem. We can only be certain of a safe unmount if > > a direct path is used. > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > --- > > fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------ > > 1 file changed, 40 insertions(+), 6 deletions(-) > > > > diff --git a/fs/namei.c b/fs/namei.c > > index edfedfbccaef..f2df1adae7c5 100644 > > --- a/fs/namei.c > > +++ b/fs/namei.c > > @@ -498,8 +498,8 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask) > > * > > * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask. > > */ > > -int inode_permission(struct mnt_idmap *idmap, > > - struct inode *inode, int mask) > > +int inode_permission_mp(struct mnt_idmap *idmap, > > + struct inode *inode, int mask, bool mp) > > { > > int retval; > > > > @@ -523,7 +523,14 @@ int inode_permission(struct mnt_idmap *idmap, > > return -EACCES; > > } > > > > - retval = do_inode_permission(idmap, inode, mask); > > + if (mp) > > + /* We are looking for a mountpoint and so don't > > + * want to interact with the filesystem at all, just > > + * the dcache and icache. > > + */ > > + retval = generic_permission(idmap, inode, mask); > > + else > > + retval = do_inode_permission(idmap, inode, mask); > > if (retval) > > return retval; > > > > @@ -533,6 +540,24 @@ int inode_permission(struct mnt_idmap *idmap, > > > > return security_inode_permission(inode, mask); > > } > > + > > +/** > > + * inode_permission - Check for access rights to a given inode > > + * @idmap: idmap of the mount the inode was found from > > + * @inode: Inode to check permission on > > + * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC) > > + * > > + * Check for read/write/execute permissions on an inode. We use fs[ug]id for > > + * this, letting us set arbitrary permissions for filesystem access without > > + * changing the "normal" UIDs which are used for other things. > > + * > > + * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask. > > + */ > > +int inode_permission(struct mnt_idmap *idmap, > > + struct inode *inode, int mask) > > +{ > > + return inode_permission_mp(idmap, inode, mask, false); > > +} > > EXPORT_SYMBOL(inode_permission); > > > > /** > > @@ -589,6 +614,7 @@ struct nameidata { > > #define ND_ROOT_PRESET 1 > > #define ND_ROOT_GRABBED 2 > > #define ND_JUMPED 4 > > +#define ND_SYS_ADMIN 8 > > > > static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name) > > { > > @@ -853,7 +879,8 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry) > > > > static inline int d_revalidate(struct dentry *dentry, unsigned int flags) > > { > > - if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE)) > > + if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) && > > + likely(!(flags & LOOKUP_MOUNTPOINT))) > > return dentry->d_op->d_revalidate(dentry, flags); > > else > > return 1; > > @@ -1708,12 +1735,17 @@ static struct dentry *lookup_slow(const struct qstr *name, > > static inline int may_lookup(struct mnt_idmap *idmap, > > struct nameidata *nd) > > { > > + /* If we are looking for a mountpoint and we have the SYS_ADMIN > > + * capability, then we will by-pass the filesys for permission checks > > + * and just use generic_permission(). > > + */ > > + bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN); > > if (nd->flags & LOOKUP_RCU) { > > - int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK); > > + int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp); > > if (err != -ECHILD || !try_to_unlazy(nd)) > > return err; > > } > > - return inode_permission(idmap, nd->inode, MAY_EXEC); > > + return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp); > > } > > > > static int reserve_stack(struct nameidata *nd, struct path *link) > > @@ -2501,6 +2533,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags, > > if (IS_ERR(name)) > > return PTR_ERR(name); > > set_nameidata(&nd, dfd, name, root); > > + if ((flags & LOOKUP_MOUNTPOINT) && may_mount()) > > + nd.state = ND_SYS_ADMIN; > > retval = path_lookupat(&nd, flags | LOOKUP_RCU, path); > > if (unlikely(retval == -ECHILD)) > > retval = path_lookupat(&nd, flags, path); > > -- > > 2.40.0 > > >
On Tue, 18 Apr 2023, Jeff Layton wrote: > > The last thing we want in that case is for the server to decide to > change some intermediate dentry in between the two operations. Best > case, you'll get back ENOENT or something when the pathwalk fails. Worst > case, the server swaps what are two different mountpoints on your client > and you unmount the wrong one. Actually, the last think I want is for the server to do something that causes a directory to be invalidated (d_invalidate()). Then any mount points below there get DETACHed and we lose any change to use MNT_FORCE or to wait for the unmount to complete. Of course this can also happen during any other access, not just umount.... > > If we don't revaliate, then we're no worse off, and may be better off if > something hinky happens to the server of an intermediate dentry in the > path. Exactly. If we don't revalidate we might use old data, but it will be old data that were were allowed to access. It might be data that we are not now allowed to access, but it will never be new data that were have never been allowed to access. Thanks, NeilBrown
> On Apr 17, 2023, at 9:21 AM, Jeff Layton <jlayton@kernel.org> wrote: > > On Mon, 2023-04-17 at 16:24 +0200, Christian Brauner wrote: >> And I'm curious why is it obvious that we don't want to revalidate _any_ >> path component and not just the last one? Why is that generally safe? >> Why can't this be used to access files and directories the caller >> wouldn't otherwise be able to access? I would like to have this spelled >> out for slow people like me, please. >> >> From my point of view, this would only be somewhat safe _generally_ if >> you'd allow circumvention for revalidation and permission checking if >> MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH). >> You'd still mess with overlayfs permission model in this case though. >> >> Plus, there are better options of solving this problem. Again, I'd >> rather build a separate api for unmounting then playing such potentially >> subtle security sensitive games with permission checking during path >> lookup. > > umount(2) is really a special case because the whole intent is to detach > a mount from the local hierarchy and stop using it. The unfortunate bit > is that it is a path-based syscall. > > So usually we have to: > > - determine the path: Maybe stat() it and to validate that it's the > mountpoint we want to drop The stat() itself may hang because a remote server, or USB stick is inaccessible or having media errors. I've just been having a conversation with Karel Zak to change umount(1) to use statx() so that it interacts minimally with the fs. In particular, nfs_getattr() skips revalidate if only minimal attrs are fetched (STATX_TYPE | STATX_INO), and also skips revalidate if locally-cached attrs are still valid (STATX_MODE), so this will avoid yet one more place that unmount can hang. In theory, vfs_getattr() could get all of these attributes directly from the vfs_inode in the unmount case. > - then call umount with that path > > The last thing we want in that case is for the server to decide to > change some intermediate dentry in between the two operations. Best > case, you'll get back ENOENT or something when the pathwalk fails. Worst > case, the server swaps what are two different mountpoints on your client > and you unmount the wrong one. > > If we don't revaliate, then we're no worse off, and may be better off if > something hinky happens to the server of an intermediate dentry in the > path. > -- > Jeff Layton <jlayton@kernel.org> Cheers, Andreas
On Mon, Apr 17, 2023 at 09:25:20PM -0600, Andreas Dilger wrote: > > > On Apr 17, 2023, at 9:21 AM, Jeff Layton <jlayton@kernel.org> wrote: > > > > On Mon, 2023-04-17 at 16:24 +0200, Christian Brauner wrote: > >> And I'm curious why is it obvious that we don't want to revalidate _any_ > >> path component and not just the last one? Why is that generally safe? > >> Why can't this be used to access files and directories the caller > >> wouldn't otherwise be able to access? I would like to have this spelled > >> out for slow people like me, please. > >> > >> From my point of view, this would only be somewhat safe _generally_ if > >> you'd allow circumvention for revalidation and permission checking if > >> MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH). > >> You'd still mess with overlayfs permission model in this case though. > >> > >> Plus, there are better options of solving this problem. Again, I'd > >> rather build a separate api for unmounting then playing such potentially > >> subtle security sensitive games with permission checking during path > >> lookup. > > > > umount(2) is really a special case because the whole intent is to detach > > a mount from the local hierarchy and stop using it. The unfortunate bit > > is that it is a path-based syscall. > > > > So usually we have to: > > > > - determine the path: Maybe stat() it and to validate that it's the > > mountpoint we want to drop > > The stat() itself may hang because a remote server, or USB stick is > inaccessible or having media errors. > > I've just been having a conversation with Karel Zak to change > umount(1) to use statx() so that it interacts minimally with the fs. So we're talking about this commit: https://github.com/util-linux/util-linux/commit/42e141d20505a0deb969c2e583a463c26aadc62f and the patch we discussed here: https://github.com/util-linux/util-linux/issues/2049 > > In particular, nfs_getattr() skips revalidate if only minimal attrs > are fetched (STATX_TYPE | STATX_INO), and also skips revalidate if > locally-cached attrs are still valid (STATX_MODE), so this will > avoid yet one more place that unmount can hang. > > In theory, vfs_getattr() could get all of these attributes directly > from the vfs_inode in the unmount case. We don't really need that. As pointed out in that discussion and as Karel did we just want to pass AT_STATX_DONT_SYNC. An api that would allow unmounting by mount id can safely check and retrieve AT_STATX_DONT_SYNC with STATX_ATTR_MOUNT_ROOT and STATX_MNT_ID without ever syncing with the server.
On Tue, Apr 18, 2023 at 07:34:14AM +1000, NeilBrown wrote: > On Tue, 18 Apr 2023, Jeff Layton wrote: > > > > The last thing we want in that case is for the server to decide to > > change some intermediate dentry in between the two operations. Best > > case, you'll get back ENOENT or something when the pathwalk fails. Worst > > case, the server swaps what are two different mountpoints on your client > > and you unmount the wrong one. > > Actually, the last think I want is for the server to do something that > causes a directory to be invalidated (d_invalidate()). Then any mount > points below there get DETACHed and we lose any change to use MNT_FORCE > or to wait for the unmount to complete. Of course this can also happen > during any other access, not just umount.... Any rmdir/unlink from another mount namespace where the mountpoint isn't a local mountpoint would lazily unmount the whole mount tree. You can't guard against this anyway.
On Tue, 2023-04-18 at 10:04 +0200, Christian Brauner wrote: > On Mon, Apr 17, 2023 at 09:25:20PM -0600, Andreas Dilger wrote: > > > > > On Apr 17, 2023, at 9:21 AM, Jeff Layton <jlayton@kernel.org> wrote: > > > > > > On Mon, 2023-04-17 at 16:24 +0200, Christian Brauner wrote: > > > > And I'm curious why is it obvious that we don't want to revalidate _any_ > > > > path component and not just the last one? Why is that generally safe? > > > > Why can't this be used to access files and directories the caller > > > > wouldn't otherwise be able to access? I would like to have this spelled > > > > out for slow people like me, please. > > > > > > > > From my point of view, this would only be somewhat safe _generally_ if > > > > you'd allow circumvention for revalidation and permission checking if > > > > MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH). > > > > You'd still mess with overlayfs permission model in this case though. > > > > > > > > Plus, there are better options of solving this problem. Again, I'd > > > > rather build a separate api for unmounting then playing such potentially > > > > subtle security sensitive games with permission checking during path > > > > lookup. > > > > > > umount(2) is really a special case because the whole intent is to detach > > > a mount from the local hierarchy and stop using it. The unfortunate bit > > > is that it is a path-based syscall. > > > > > > So usually we have to: > > > > > > - determine the path: Maybe stat() it and to validate that it's the > > > mountpoint we want to drop > > > > The stat() itself may hang because a remote server, or USB stick is > > inaccessible or having media errors. > > > > I've just been having a conversation with Karel Zak to change > > umount(1) to use statx() so that it interacts minimally with the fs. > > So we're talking about this commit: > https://github.com/util-linux/util-linux/commit/42e141d20505a0deb969c2e583a463c26aadc62f > and the patch we discussed here: > https://github.com/util-linux/util-linux/issues/2049 > > > > > In particular, nfs_getattr() skips revalidate if only minimal attrs > > are fetched (STATX_TYPE | STATX_INO), and also skips revalidate if > > locally-cached attrs are still valid (STATX_MODE), so this will > > avoid yet one more place that unmount can hang. > > > > In theory, vfs_getattr() could get all of these attributes directly > > from the vfs_inode in the unmount case. > > We don't really need that. As pointed out in that discussion and as > Karel did we just want to pass AT_STATX_DONT_SYNC. > > An api that would allow unmounting by mount id can safely check and > retrieve AT_STATX_DONT_SYNC with STATX_ATTR_MOUNT_ROOT and STATX_MNT_ID > without ever syncing with the server. Unfortunately, I don't think that flag trickles down to the ->permission checks for the pathwalk down to the point you're stat'ing. That turns out to be a big part of the problem when trying to umount things that are stuck down in inaccessible trees. I'm not opposed at all to new APIs that allow for more reliable unmounting. I think that's a good idea, and something worth hashing out. But...we're stuck with umount() in perpetuity. Even if you were to merge a new API for unmounting today, it'll be a decade or more until the ecosystem catches up. I think we have a responsibility to make the umount syscall work as well as we can. In this case, that means giving it as light a footprint as possible on the pathwalk down. If we need to gate this behavior behind existing or new flags to umount2(), then so be it, but lets not dismiss this idea in favor of some new unmounting interface that may never materialize. Improving umount has the potential to help a lot of our users.
On Thu, Apr 20, 2023 at 09:05:47AM -0400, Jeff Layton wrote: > On Tue, 2023-04-18 at 10:04 +0200, Christian Brauner wrote: > > On Mon, Apr 17, 2023 at 09:25:20PM -0600, Andreas Dilger wrote: > > > > > > > On Apr 17, 2023, at 9:21 AM, Jeff Layton <jlayton@kernel.org> wrote: > > > > > > > > On Mon, 2023-04-17 at 16:24 +0200, Christian Brauner wrote: > > > > > And I'm curious why is it obvious that we don't want to revalidate _any_ > > > > > path component and not just the last one? Why is that generally safe? > > > > > Why can't this be used to access files and directories the caller > > > > > wouldn't otherwise be able to access? I would like to have this spelled > > > > > out for slow people like me, please. > > > > > > > > > > From my point of view, this would only be somewhat safe _generally_ if > > > > > you'd allow circumvention for revalidation and permission checking if > > > > > MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH). > > > > > You'd still mess with overlayfs permission model in this case though. > > > > > > > > > > Plus, there are better options of solving this problem. Again, I'd > > > > > rather build a separate api for unmounting then playing such potentially > > > > > subtle security sensitive games with permission checking during path > > > > > lookup. > > > > > > > > umount(2) is really a special case because the whole intent is to detach > > > > a mount from the local hierarchy and stop using it. The unfortunate bit > > > > is that it is a path-based syscall. > > > > > > > > So usually we have to: > > > > > > > > - determine the path: Maybe stat() it and to validate that it's the > > > > mountpoint we want to drop > > > > > > The stat() itself may hang because a remote server, or USB stick is > > > inaccessible or having media errors. > > > > > > I've just been having a conversation with Karel Zak to change > > > umount(1) to use statx() so that it interacts minimally with the fs. > > > > So we're talking about this commit: > > https://github.com/util-linux/util-linux/commit/42e141d20505a0deb969c2e583a463c26aadc62f > > and the patch we discussed here: > > https://github.com/util-linux/util-linux/issues/2049 > > > > > > > > In particular, nfs_getattr() skips revalidate if only minimal attrs > > > are fetched (STATX_TYPE | STATX_INO), and also skips revalidate if > > > locally-cached attrs are still valid (STATX_MODE), so this will > > > avoid yet one more place that unmount can hang. > > > > > > In theory, vfs_getattr() could get all of these attributes directly > > > from the vfs_inode in the unmount case. > > > > We don't really need that. As pointed out in that discussion and as > > Karel did we just want to pass AT_STATX_DONT_SYNC. > > > > An api that would allow unmounting by mount id can safely check and > > retrieve AT_STATX_DONT_SYNC with STATX_ATTR_MOUNT_ROOT and STATX_MNT_ID > > without ever syncing with the server. > > Unfortunately, I don't think that flag trickles down to the ->permission > checks for the pathwalk down to the point you're stat'ing. That turns > out to be a big part of the problem when trying to umount things that > are stuck down in inaccessible trees. > > I'm not opposed at all to new APIs that allow for more reliable > unmounting. I think that's a good idea, and something worth hashing out. > > But...we're stuck with umount() in perpetuity. Even if you were to merge > a new API for unmounting today, it'll be a decade or more until the > ecosystem catches up. I think we have a responsibility to make the > umount syscall work as well as we can. In this case, that means giving > it as light a footprint as possible on the pathwalk down. > > If we need to gate this behavior behind existing or new flags to > umount2(), then so be it, but lets not dismiss this idea in favor of > some new unmounting interface that may never materialize. Improving > umount has the potential to help a lot of our users. A new flag for an old system call or a new system call doesn't matter in practice for userspace. And the users that have that specific problem that's solved by a new interface will use that interface asap. That's been true for the pidfd api, that's been true for openat2(), clone3() and others. Plus, this is a workload specific problem. And even with or without a new flag it'd still need to be backported to old kernels. And just because an interface already exists doesn't mean stuff should be shoehorned into it just because it's convenient or faster. That's just asking for maintenance pain down the road. And from this discussion there's multiple ways to work around this issue currently so I especially don't see the need to rush any of this and fiddle around with permission checking.
On Tue, Apr 18, 2023 at 07:26:34AM +1000, NeilBrown wrote: > MNT_FORCE is, I think, a good idea and a needed functionality that has > never been implemented well. > MNT_FORCE causes nfs_umount_begin to be called as you noted, which > aborts all pending RPCs on that filesystem. Suppose it happens to be mounted in another namespace as well. Or bound at different mountpoint, for that matter. What should MNT_FORCE do?
On Fri, 21 Apr 2023, Al Viro wrote: > On Tue, Apr 18, 2023 at 07:26:34AM +1000, NeilBrown wrote: > > > MNT_FORCE is, I think, a good idea and a needed functionality that has > > never been implemented well. > > MNT_FORCE causes nfs_umount_begin to be called as you noted, which > > aborts all pending RPCs on that filesystem. > > Suppose it happens to be mounted in another namespace as well. Or bound > at different mountpoint, for that matter. What should MNT_FORCE do? > 1/ set a "forced-unmount" flag on the vfs_mount which causes any syscall that uses the vfsmount (whether from an fd, or found in a path walk, or elsewhere), except for close(), to abort with an error; 2/ call ->umount_begin passing in the vfs_mount. The fs can abort any outstanding transaction on any fd from that vfs_mount. Possibly it might instead abort a wait rather than the whole transaction, particularly if requests using some other vfs_mount might also be interested in the transaction 3/ ->close() on a force-unmount vfs_mount would clean up without blocking indefinitely, discarding dirty data if necessary. This still depends on the application to close all fds that return errors (and to chdir out of a problematic directory). But at least it *allows* applications to do that without requiring that they be killed. Thanks, NeilBrown
On Fri, Apr 21, 2023 at 08:01:09AM +1000, NeilBrown wrote: > On Fri, 21 Apr 2023, Al Viro wrote: > > On Tue, Apr 18, 2023 at 07:26:34AM +1000, NeilBrown wrote: > > > > > MNT_FORCE is, I think, a good idea and a needed functionality that has > > > never been implemented well. > > > MNT_FORCE causes nfs_umount_begin to be called as you noted, which > > > aborts all pending RPCs on that filesystem. > > > > Suppose it happens to be mounted in another namespace as well. Or bound > > at different mountpoint, for that matter. What should MNT_FORCE do? > > > > 1/ set a "forced-unmount" flag on the vfs_mount which causes any syscall > that uses the vfsmount (whether from an fd, or found in a path walk, > or elsewhere), except for close(), to abort with an error; > 2/ call ->umount_begin passing in the vfs_mount. The fs can abort any > outstanding transaction on any fd from that vfs_mount. How the hell would you recognize those?
diff --git a/fs/namei.c b/fs/namei.c index edfedfbccaef..f2df1adae7c5 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -498,8 +498,8 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask) * * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask. */ -int inode_permission(struct mnt_idmap *idmap, - struct inode *inode, int mask) +int inode_permission_mp(struct mnt_idmap *idmap, + struct inode *inode, int mask, bool mp) { int retval; @@ -523,7 +523,14 @@ int inode_permission(struct mnt_idmap *idmap, return -EACCES; } - retval = do_inode_permission(idmap, inode, mask); + if (mp) + /* We are looking for a mountpoint and so don't + * want to interact with the filesystem at all, just + * the dcache and icache. + */ + retval = generic_permission(idmap, inode, mask); + else + retval = do_inode_permission(idmap, inode, mask); if (retval) return retval; @@ -533,6 +540,24 @@ int inode_permission(struct mnt_idmap *idmap, return security_inode_permission(inode, mask); } + +/** + * inode_permission - Check for access rights to a given inode + * @idmap: idmap of the mount the inode was found from + * @inode: Inode to check permission on + * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC) + * + * Check for read/write/execute permissions on an inode. We use fs[ug]id for + * this, letting us set arbitrary permissions for filesystem access without + * changing the "normal" UIDs which are used for other things. + * + * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask. + */ +int inode_permission(struct mnt_idmap *idmap, + struct inode *inode, int mask) +{ + return inode_permission_mp(idmap, inode, mask, false); +} EXPORT_SYMBOL(inode_permission); /** @@ -589,6 +614,7 @@ struct nameidata { #define ND_ROOT_PRESET 1 #define ND_ROOT_GRABBED 2 #define ND_JUMPED 4 +#define ND_SYS_ADMIN 8 static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name) { @@ -853,7 +879,8 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry) static inline int d_revalidate(struct dentry *dentry, unsigned int flags) { - if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE)) + if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) && + likely(!(flags & LOOKUP_MOUNTPOINT))) return dentry->d_op->d_revalidate(dentry, flags); else return 1; @@ -1708,12 +1735,17 @@ static struct dentry *lookup_slow(const struct qstr *name, static inline int may_lookup(struct mnt_idmap *idmap, struct nameidata *nd) { + /* If we are looking for a mountpoint and we have the SYS_ADMIN + * capability, then we will by-pass the filesys for permission checks + * and just use generic_permission(). + */ + bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN); if (nd->flags & LOOKUP_RCU) { - int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK); + int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp); if (err != -ECHILD || !try_to_unlazy(nd)) return err; } - return inode_permission(idmap, nd->inode, MAY_EXEC); + return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp); } static int reserve_stack(struct nameidata *nd, struct path *link) @@ -2501,6 +2533,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags, if (IS_ERR(name)) return PTR_ERR(name); set_nameidata(&nd, dfd, name, root); + if ((flags & LOOKUP_MOUNTPOINT) && may_mount()) + nd.state = ND_SYS_ADMIN; retval = path_lookupat(&nd, flags | LOOKUP_RCU, path); if (unlikely(retval == -ECHILD)) retval = path_lookupat(&nd, flags, path);
When performing a LOOKUP_MOUNTPOINT lookup we don't really want to engage with underlying systems at all. Any mount point MUST be in the dcache with a complete direct path from the root to the mountpoint. That should be sufficient to find the mountpoint given a path name. This becomes an issue when the filesystem changes unexpected, such as when a NFS server is changed to reject all access. It then becomes impossible to unmount anything mounted on the filesystem which has changed. We could simply lazy-unmount the changed filesystem and that will often be sufficient. However if the target filesystem needs "umount -f" to complete the unmount properly, then the lazy unmount will leave it incompletely unmounted. When "-f" is needed, we really need to be able to name the target filesystem. We NEVER want to revalidate anything. We already avoid the revalidation of the mountpoint itself, be we won't need to revalidate anything on the path either as thay might affect the cache, and the cache is what we are really looking in. Permission checks are a little less clear. We currently allow any user to make the umount syscall and perform the path lookup and only reject unprivileged users once the target mount point has been found. If we completely relax permission checks then an unprivileged user could probe inaccessible parts of the name space by examining the error returned from umount(). So we only relax permission check when the user has CAP_SYS_ADMIN (may_mount() succeeds). Note that if the path given is not direct and for example uses symlinks or "..", then dentries or symlink content may not be cached and a remote server could cause problem. We can only be certain of a safe unmount if a direct path is used. Signed-off-by: NeilBrown <neilb@suse.de> --- fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 40 insertions(+), 6 deletions(-)