diff mbox series

virtiofs: Enable SB_NOSEC flag to improve small write performance

Message ID 20200716144032.GC422759@redhat.com (mailing list archive)
State New, archived
Headers show
Series virtiofs: Enable SB_NOSEC flag to improve small write performance | expand

Commit Message

Vivek Goyal July 16, 2020, 2:40 p.m. UTC
Ganesh Mahalingam reported that virtiofs is slow with small direct random
writes when virtiofsd is run with cache=always.

https://github.com/kata-containers/runtime/issues/2815

Little debugging showed that that file_remove_privs() is called in cached
write path on every write. And everytime it calls
security_inode_need_killpriv() which results in call to
__vfs_getxattr(XATTR_NAME_CAPS). And this goes to file server to fetch
xattr. This extra round trip for every write slows down writes a lot.

Normally to avoid paying this penalty on every write, vfs has the
notion of caching this information in inode (S_NOSEC). So vfs
sets S_NOSEC, if filesystem opted for it using super block flag
SB_NOSEC. And S_NOSEC is cleared when setuid/setgid bit is set or
when security xattr is set on inode so that next time a write
happens, we check inode again for clearing setuid/setgid bits as well
clear any security.capability xattr.

This seems to work well for local file systems but for remote file
systems it is possible that VFS does not have full picture and a
different client sets setuid/setgid bit or security.capability xattr
on file and that means VFS information about S_NOSEC on another client
will be stale. So for remote filesystems SB_NOSEC was disabled by
default.

commit 9e1f1de02c2275d7172e18dc4e7c2065777611bf
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Fri Jun 3 18:24:58 2011 -0400

    more conservative S_NOSEC handling

That commit mentioned that these filesystems can still make use of
SB_NOSEC as long as they clear S_NOSEC when they are refreshing inode
attriutes from server.

So this patch tries to enable SB_NOSEC on fuse (regular fuse as well
as virtiofs). And clear SB_NOSEC when we are refreshing inode attributes.

We need to clear SB_NOSEC either when inode has setuid/setgid bit set
or security.capability xattr has been set. We have the first piece of
information available in FUSE_GETATTR response. But we don't know if
security.capability has been set on file or not. Question is, do we
really need to know about security.capability. file_remove_privs()
always removes security.capability if a file is being written to. That
means when server writes to file, security.capability should be removed
without guest having to tell anything to it.

That means we don't have to worry about knowing if security.capability
was set or not as long as writes by client don't get cached and go to
server always. And server write should clear security.capability. Hence,
I clear SB_NOSEC when writeback cache is enabled.

This change improves random write performance very significantly. I
am running virtiofsd with cache=auto and following fio command.

fio --ioengine=libaio --direct=1  --name=test --filename=/mnt/virtiofs/random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randwrite

Before this patch I get around 40MB/s and after the patch I get around
300MB/s bandwidth. So improvement is very significant.

Note: We probably could do this change for regular fuse filesystems
      as well. But I don't know all the possible configurations supported
      so I am limiting it to virtiofs.

Reported-by: "Mahalingam, Ganesh" <ganesh.mahalingam@intel.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/inode.c     | 7 +++++++
 fs/fuse/virtio_fs.c | 4 ++++
 2 files changed, 11 insertions(+)

Comments

Vivek Goyal July 16, 2020, 6:18 p.m. UTC | #1
On Thu, Jul 16, 2020 at 10:40:33AM -0400, Vivek Goyal wrote:
> Ganesh Mahalingam reported that virtiofs is slow with small direct random
> writes when virtiofsd is run with cache=always.
> 
> https://github.com/kata-containers/runtime/issues/2815
> 
> Little debugging showed that that file_remove_privs() is called in cached
> write path on every write. And everytime it calls
> security_inode_need_killpriv() which results in call to
> __vfs_getxattr(XATTR_NAME_CAPS). And this goes to file server to fetch
> xattr. This extra round trip for every write slows down writes a lot.
> 
> Normally to avoid paying this penalty on every write, vfs has the
> notion of caching this information in inode (S_NOSEC). So vfs
> sets S_NOSEC, if filesystem opted for it using super block flag
> SB_NOSEC. And S_NOSEC is cleared when setuid/setgid bit is set or
> when security xattr is set on inode so that next time a write
> happens, we check inode again for clearing setuid/setgid bits as well
> clear any security.capability xattr.
> 
> This seems to work well for local file systems but for remote file
> systems it is possible that VFS does not have full picture and a
> different client sets setuid/setgid bit or security.capability xattr
> on file and that means VFS information about S_NOSEC on another client
> will be stale. So for remote filesystems SB_NOSEC was disabled by
> default.
> 
> commit 9e1f1de02c2275d7172e18dc4e7c2065777611bf
> Author: Al Viro <viro@zeniv.linux.org.uk>
> Date:   Fri Jun 3 18:24:58 2011 -0400
> 
>     more conservative S_NOSEC handling
> 
> That commit mentioned that these filesystems can still make use of
> SB_NOSEC as long as they clear S_NOSEC when they are refreshing inode
> attriutes from server.
> 
> So this patch tries to enable SB_NOSEC on fuse (regular fuse as well
> as virtiofs). And clear SB_NOSEC when we are refreshing inode attributes.
> 
> We need to clear SB_NOSEC either when inode has setuid/setgid bit set
> or security.capability xattr has been set. We have the first piece of
> information available in FUSE_GETATTR response. But we don't know if
> security.capability has been set on file or not. Question is, do we
> really need to know about security.capability. file_remove_privs()
> always removes security.capability if a file is being written to. That
> means when server writes to file, security.capability should be removed
> without guest having to tell anything to it.


I am assuming that file server will clear security.capability on host
upon WRITE. Is it a fair assumption for all filesystems passthrough
virtiofsd might be running?

Vivek

> 
> That means we don't have to worry about knowing if security.capability
> was set or not as long as writes by client don't get cached and go to
> server always. And server write should clear security.capability. Hence,
> I clear SB_NOSEC when writeback cache is enabled.
> 
> This change improves random write performance very significantly. I
> am running virtiofsd with cache=auto and following fio command.
> 
> fio --ioengine=libaio --direct=1  --name=test --filename=/mnt/virtiofs/random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randwrite
> 
> Before this patch I get around 40MB/s and after the patch I get around
> 300MB/s bandwidth. So improvement is very significant.
> 
> Note: We probably could do this change for regular fuse filesystems
>       as well. But I don't know all the possible configurations supported
>       so I am limiting it to virtiofs.
> 
> Reported-by: "Mahalingam, Ganesh" <ganesh.mahalingam@intel.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  fs/fuse/inode.c     | 7 +++++++
>  fs/fuse/virtio_fs.c | 4 ++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 5b4aebf5821f..5e74c818b2aa 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -185,6 +185,13 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
>  		inode->i_mode &= ~S_ISVTX;
>  
>  	fi->orig_ino = attr->ino;
> +
> +	/*
> +	 * File server see setuid/setgid bit set. Maybe another client did
> +	 * it. Reset S_NOSEC.
> +	 */
> +	if (IS_NOSEC(inode) && is_sxid(inode->i_mode))
> +		inode->i_flags &= ~S_NOSEC;
>  }
>  
>  void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 4c4ef5d69298..e89628163ec4 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -1126,6 +1126,10 @@ static int virtio_fs_fill_super(struct super_block *sb)
>  	/* Previous unmount will stop all queues. Start these again */
>  	virtio_fs_start_all_queues(fs);
>  	fuse_send_init(fc);
> +
> +	if (!fc->writeback_cache)
> +		sb->s_flags |= SB_NOSEC;
> +
>  	mutex_unlock(&virtio_fs_mutex);
>  	return 0;
>  
> -- 
> 2.25.4
>
Miklos Szeredi July 17, 2020, 8:53 a.m. UTC | #2
On Thu, Jul 16, 2020 at 8:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Thu, Jul 16, 2020 at 10:40:33AM -0400, Vivek Goyal wrote:
> > Ganesh Mahalingam reported that virtiofs is slow with small direct random
> > writes when virtiofsd is run with cache=always.
> >
> > https://github.com/kata-containers/runtime/issues/2815
> >
> > Little debugging showed that that file_remove_privs() is called in cached
> > write path on every write. And everytime it calls
> > security_inode_need_killpriv() which results in call to
> > __vfs_getxattr(XATTR_NAME_CAPS). And this goes to file server to fetch
> > xattr. This extra round trip for every write slows down writes a lot.
> >
> > Normally to avoid paying this penalty on every write, vfs has the
> > notion of caching this information in inode (S_NOSEC). So vfs
> > sets S_NOSEC, if filesystem opted for it using super block flag
> > SB_NOSEC. And S_NOSEC is cleared when setuid/setgid bit is set or
> > when security xattr is set on inode so that next time a write
> > happens, we check inode again for clearing setuid/setgid bits as well
> > clear any security.capability xattr.
> >
> > This seems to work well for local file systems but for remote file
> > systems it is possible that VFS does not have full picture and a
> > different client sets setuid/setgid bit or security.capability xattr
> > on file and that means VFS information about S_NOSEC on another client
> > will be stale. So for remote filesystems SB_NOSEC was disabled by
> > default.
> >
> > commit 9e1f1de02c2275d7172e18dc4e7c2065777611bf
> > Author: Al Viro <viro@zeniv.linux.org.uk>
> > Date:   Fri Jun 3 18:24:58 2011 -0400
> >
> >     more conservative S_NOSEC handling
> >
> > That commit mentioned that these filesystems can still make use of
> > SB_NOSEC as long as they clear S_NOSEC when they are refreshing inode
> > attriutes from server.
> >
> > So this patch tries to enable SB_NOSEC on fuse (regular fuse as well
> > as virtiofs). And clear SB_NOSEC when we are refreshing inode attributes.
> >
> > We need to clear SB_NOSEC either when inode has setuid/setgid bit set
> > or security.capability xattr has been set. We have the first piece of
> > information available in FUSE_GETATTR response. But we don't know if
> > security.capability has been set on file or not. Question is, do we
> > really need to know about security.capability. file_remove_privs()
> > always removes security.capability if a file is being written to. That
> > means when server writes to file, security.capability should be removed
> > without guest having to tell anything to it.
>
>
> I am assuming that file server will clear security.capability on host
> upon WRITE. Is it a fair assumption for all filesystems passthrough
> virtiofsd might be running?

AFAICS this needs to be gated through handle_killpriv, and with that
it can become a generic fuse feature, not just virtiofs:

 * FUSE_HANDLE_KILLPRIV: fs handles killing suid/sgid/cap on write/chown/trunc

Even writeback_cache could be handled by this addition, since we call
fuse_update_attributes() before generic_file_write_iter() :

--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -985,6 +985,7 @@ static int fuse_update_get_attr(struct inode
*inode, struct file *file,

        if (sync) {
                forget_all_cached_acls(inode);
+               inode->i_flags &= ~S_NOSEC;
                err = fuse_do_getattr(inode, stat, file);
        } else if (stat) {
                generic_fillattr(inode, stat);


Thanks,
Miklos
Vivek Goyal July 20, 2020, 3:41 p.m. UTC | #3
On Fri, Jul 17, 2020 at 10:53:07AM +0200, Miklos Szeredi wrote:
> On Thu, Jul 16, 2020 at 8:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Thu, Jul 16, 2020 at 10:40:33AM -0400, Vivek Goyal wrote:
> > > Ganesh Mahalingam reported that virtiofs is slow with small direct random
> > > writes when virtiofsd is run with cache=always.
> > >
> > > https://github.com/kata-containers/runtime/issues/2815
> > >
> > > Little debugging showed that that file_remove_privs() is called in cached
> > > write path on every write. And everytime it calls
> > > security_inode_need_killpriv() which results in call to
> > > __vfs_getxattr(XATTR_NAME_CAPS). And this goes to file server to fetch
> > > xattr. This extra round trip for every write slows down writes a lot.
> > >
> > > Normally to avoid paying this penalty on every write, vfs has the
> > > notion of caching this information in inode (S_NOSEC). So vfs
> > > sets S_NOSEC, if filesystem opted for it using super block flag
> > > SB_NOSEC. And S_NOSEC is cleared when setuid/setgid bit is set or
> > > when security xattr is set on inode so that next time a write
> > > happens, we check inode again for clearing setuid/setgid bits as well
> > > clear any security.capability xattr.
> > >
> > > This seems to work well for local file systems but for remote file
> > > systems it is possible that VFS does not have full picture and a
> > > different client sets setuid/setgid bit or security.capability xattr
> > > on file and that means VFS information about S_NOSEC on another client
> > > will be stale. So for remote filesystems SB_NOSEC was disabled by
> > > default.
> > >
> > > commit 9e1f1de02c2275d7172e18dc4e7c2065777611bf
> > > Author: Al Viro <viro@zeniv.linux.org.uk>
> > > Date:   Fri Jun 3 18:24:58 2011 -0400
> > >
> > >     more conservative S_NOSEC handling
> > >
> > > That commit mentioned that these filesystems can still make use of
> > > SB_NOSEC as long as they clear S_NOSEC when they are refreshing inode
> > > attriutes from server.
> > >
> > > So this patch tries to enable SB_NOSEC on fuse (regular fuse as well
> > > as virtiofs). And clear SB_NOSEC when we are refreshing inode attributes.
> > >
> > > We need to clear SB_NOSEC either when inode has setuid/setgid bit set
> > > or security.capability xattr has been set. We have the first piece of
> > > information available in FUSE_GETATTR response. But we don't know if
> > > security.capability has been set on file or not. Question is, do we
> > > really need to know about security.capability. file_remove_privs()
> > > always removes security.capability if a file is being written to. That
> > > means when server writes to file, security.capability should be removed
> > > without guest having to tell anything to it.
> >
> >
> > I am assuming that file server will clear security.capability on host
> > upon WRITE. Is it a fair assumption for all filesystems passthrough
> > virtiofsd might be running?
> 
> AFAICS this needs to be gated through handle_killpriv, and with that
> it can become a generic fuse feature, not just virtiofs:
> 
>  * FUSE_HANDLE_KILLPRIV: fs handles killing suid/sgid/cap on write/chown/trunc

Hi Miklos,

That sounds interesting. I have couple of questions though.

I see in VFS that chown() always kills suid/sgid. While truncate() and
write(), will suid/sgid only if caller does not have CAP_FSETID.

How does this work with FUSE_HANDLE_KILLPRIV. IIUC, file server does not
know if caller has CAP_FSETID or not. That means file server will be
forced to kill suid/sgid on every write and truncate. And that will fail
some of the tests.

For WRITE requests now we do have the notion of setting
FUSE_WRITE_KILL_PRIV flag to tell server explicitly to kill suid/sgid.
Probably we could use that in cached write path as well to figure out
whether to kill suid/sgid or not. But truncate() will still continue
to be an issue.

> 
> Even writeback_cache could be handled by this addition, since we call
> fuse_update_attributes() before generic_file_write_iter() :
> 
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -985,6 +985,7 @@ static int fuse_update_get_attr(struct inode
> *inode, struct file *file,
> 
>         if (sync) {
>                 forget_all_cached_acls(inode);
> +               inode->i_flags &= ~S_NOSEC;

Ok, So I was clearing S_NOSEC only if server reports that file has
suid/sgid bit set. This change will clear S_NOSEC whenever we fetch
attrs from host and will force getxattr() when we call file_remove_privs()
and will increase overhead for non cache writeback mode. We probably
could keep both. For cache writeback mode, clear it undonditionally
otherwise not.

What I don't understand is though that how this change will clear
suid/sgid on host in cache=writeback mode. I see fuse_setattr()
will not set ATTR_MODE and clear S_ISUID and S_ISGID if 
fc->handle_killpriv is set. So when server receives setattr request
(if it does), then how will it know it is supposed to kill suid/sgid
bit. (its not chown, truncate and its not write).

What am I missing.

Thanks
Vivek

>                 err = fuse_do_getattr(inode, stat, file);
>         } else if (stat) {
>                 generic_fillattr(inode, stat);
> 
> 
> Thanks,
> Miklos
>
Miklos Szeredi July 21, 2020, 12:33 p.m. UTC | #4
On Mon, Jul 20, 2020 at 5:41 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Fri, Jul 17, 2020 at 10:53:07AM +0200, Miklos Szeredi wrote:

> I see in VFS that chown() always kills suid/sgid. While truncate() and
> write(), will suid/sgid only if caller does not have CAP_FSETID.
>
> How does this work with FUSE_HANDLE_KILLPRIV. IIUC, file server does not
> know if caller has CAP_FSETID or not. That means file server will be
> forced to kill suid/sgid on every write and truncate. And that will fail
> some of the tests.
>
> For WRITE requests now we do have the notion of setting
> FUSE_WRITE_KILL_PRIV flag to tell server explicitly to kill suid/sgid.
> Probably we could use that in cached write path as well to figure out
> whether to kill suid/sgid or not. But truncate() will still continue
> to be an issue.

Yes, not doing the same for truncate seems to be an oversight.
Unfortunate, since we'll need another INIT flag to enable selective
clearing of suid/sgid on truncate.

>
> >
> > Even writeback_cache could be handled by this addition, since we call
> > fuse_update_attributes() before generic_file_write_iter() :
> >
> > --- a/fs/fuse/dir.c
> > +++ b/fs/fuse/dir.c
> > @@ -985,6 +985,7 @@ static int fuse_update_get_attr(struct inode
> > *inode, struct file *file,
> >
> >         if (sync) {
> >                 forget_all_cached_acls(inode);
> > +               inode->i_flags &= ~S_NOSEC;
>
> Ok, So I was clearing S_NOSEC only if server reports that file has
> suid/sgid bit set. This change will clear S_NOSEC whenever we fetch
> attrs from host and will force getxattr() when we call file_remove_privs()
> and will increase overhead for non cache writeback mode. We probably
> could keep both. For cache writeback mode, clear it undonditionally
> otherwise not.

We clear S_NOSEC because the attribute timeout has expired.  This
means we need to refresh all metadata, including cached xattr (which
is what S_NOSEC effectively is).

> What I don't understand is though that how this change will clear
> suid/sgid on host in cache=writeback mode. I see fuse_setattr()
> will not set ATTR_MODE and clear S_ISUID and S_ISGID if
> fc->handle_killpriv is set. So when server receives setattr request
> (if it does), then how will it know it is supposed to kill suid/sgid
> bit. (its not chown, truncate and its not write).

Depends.  If the attribute timeout is infinity, then that means the
cache is always up to date.  In that case we only need to clear
suid/sgid if set in i_mode.  Similarly, the security.capability will
only be cleared if it was set in the first place (which would clear
S_NOSEC).

If the timeout is finite, then that means we need to check if the
metadata changed after a timeout.  That's the purpose of the
fuse_update_attributes() call before generic_file_write_iter().

Does that make it clear?

Thanks,
Miklos
Vivek Goyal July 21, 2020, 3:16 p.m. UTC | #5
On Tue, Jul 21, 2020 at 02:33:41PM +0200, Miklos Szeredi wrote:
> On Mon, Jul 20, 2020 at 5:41 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Fri, Jul 17, 2020 at 10:53:07AM +0200, Miklos Szeredi wrote:
> 
> > I see in VFS that chown() always kills suid/sgid. While truncate() and
> > write(), will suid/sgid only if caller does not have CAP_FSETID.
> >
> > How does this work with FUSE_HANDLE_KILLPRIV. IIUC, file server does not
> > know if caller has CAP_FSETID or not. That means file server will be
> > forced to kill suid/sgid on every write and truncate. And that will fail
> > some of the tests.
> >
> > For WRITE requests now we do have the notion of setting
> > FUSE_WRITE_KILL_PRIV flag to tell server explicitly to kill suid/sgid.
> > Probably we could use that in cached write path as well to figure out
> > whether to kill suid/sgid or not. But truncate() will still continue
> > to be an issue.
> 
> Yes, not doing the same for truncate seems to be an oversight.
> Unfortunate, since we'll need another INIT flag to enable selective
> clearing of suid/sgid on truncate.
> 
> >
> > >
> > > Even writeback_cache could be handled by this addition, since we call
> > > fuse_update_attributes() before generic_file_write_iter() :
> > >
> > > --- a/fs/fuse/dir.c
> > > +++ b/fs/fuse/dir.c
> > > @@ -985,6 +985,7 @@ static int fuse_update_get_attr(struct inode
> > > *inode, struct file *file,
> > >
> > >         if (sync) {
> > >                 forget_all_cached_acls(inode);
> > > +               inode->i_flags &= ~S_NOSEC;
> >
> > Ok, So I was clearing S_NOSEC only if server reports that file has
> > suid/sgid bit set. This change will clear S_NOSEC whenever we fetch
> > attrs from host and will force getxattr() when we call file_remove_privs()
> > and will increase overhead for non cache writeback mode. We probably
> > could keep both. For cache writeback mode, clear it undonditionally
> > otherwise not.
> 
> We clear S_NOSEC because the attribute timeout has expired.  This
> means we need to refresh all metadata, including cached xattr (which
> is what S_NOSEC effectively is).
> 
> > What I don't understand is though that how this change will clear
> > suid/sgid on host in cache=writeback mode. I see fuse_setattr()
> > will not set ATTR_MODE and clear S_ISUID and S_ISGID if
> > fc->handle_killpriv is set. So when server receives setattr request
> > (if it does), then how will it know it is supposed to kill suid/sgid
> > bit. (its not chown, truncate and its not write).
> 
> Depends.  If the attribute timeout is infinity, then that means the
> cache is always up to date.  In that case we only need to clear
> suid/sgid if set in i_mode.  Similarly, the security.capability will
> only be cleared if it was set in the first place (which would clear
> S_NOSEC).
> 
> If the timeout is finite, then that means we need to check if the
> metadata changed after a timeout.  That's the purpose of the
> fuse_update_attributes() call before generic_file_write_iter().
> 
> Does that make it clear?

I understood it partly but one thing is still bothering me. What
happens when cache writeback is set as well as fc->handle_killpriv=1.

When handle_killpriv is set, how suid/sgid will be cleared by
server. Given cache=writeback, write probably got cached in
guest and server probably will not not see a WRITE immideately.
(I am assuming we are relying on a WRITE to clear setuid/setgid when
 handle_killpriv is set). And that means server will not clear
 setuid/setgid till inode is written back at some point of time
 later.

IOW, cache=writeback and fc->handle_killpriv don't seem to go
together (atleast given the current code).

Thanks
Vivek
Miklos Szeredi July 21, 2020, 3:44 p.m. UTC | #6
On Tue, Jul 21, 2020 at 5:17 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Tue, Jul 21, 2020 at 02:33:41PM +0200, Miklos Szeredi wrote:
> > On Mon, Jul 20, 2020 at 5:41 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Fri, Jul 17, 2020 at 10:53:07AM +0200, Miklos Szeredi wrote:
> >
> > > I see in VFS that chown() always kills suid/sgid. While truncate() and
> > > write(), will suid/sgid only if caller does not have CAP_FSETID.
> > >
> > > How does this work with FUSE_HANDLE_KILLPRIV. IIUC, file server does not
> > > know if caller has CAP_FSETID or not. That means file server will be
> > > forced to kill suid/sgid on every write and truncate. And that will fail
> > > some of the tests.
> > >
> > > For WRITE requests now we do have the notion of setting
> > > FUSE_WRITE_KILL_PRIV flag to tell server explicitly to kill suid/sgid.
> > > Probably we could use that in cached write path as well to figure out
> > > whether to kill suid/sgid or not. But truncate() will still continue
> > > to be an issue.
> >
> > Yes, not doing the same for truncate seems to be an oversight.
> > Unfortunate, since we'll need another INIT flag to enable selective
> > clearing of suid/sgid on truncate.
> >
> > >
> > > >
> > > > Even writeback_cache could be handled by this addition, since we call
> > > > fuse_update_attributes() before generic_file_write_iter() :
> > > >
> > > > --- a/fs/fuse/dir.c
> > > > +++ b/fs/fuse/dir.c
> > > > @@ -985,6 +985,7 @@ static int fuse_update_get_attr(struct inode
> > > > *inode, struct file *file,
> > > >
> > > >         if (sync) {
> > > >                 forget_all_cached_acls(inode);
> > > > +               inode->i_flags &= ~S_NOSEC;
> > >
> > > Ok, So I was clearing S_NOSEC only if server reports that file has
> > > suid/sgid bit set. This change will clear S_NOSEC whenever we fetch
> > > attrs from host and will force getxattr() when we call file_remove_privs()
> > > and will increase overhead for non cache writeback mode. We probably
> > > could keep both. For cache writeback mode, clear it undonditionally
> > > otherwise not.
> >
> > We clear S_NOSEC because the attribute timeout has expired.  This
> > means we need to refresh all metadata, including cached xattr (which
> > is what S_NOSEC effectively is).
> >
> > > What I don't understand is though that how this change will clear
> > > suid/sgid on host in cache=writeback mode. I see fuse_setattr()
> > > will not set ATTR_MODE and clear S_ISUID and S_ISGID if
> > > fc->handle_killpriv is set. So when server receives setattr request
> > > (if it does), then how will it know it is supposed to kill suid/sgid
> > > bit. (its not chown, truncate and its not write).
> >
> > Depends.  If the attribute timeout is infinity, then that means the
> > cache is always up to date.  In that case we only need to clear
> > suid/sgid if set in i_mode.  Similarly, the security.capability will
> > only be cleared if it was set in the first place (which would clear
> > S_NOSEC).
> >
> > If the timeout is finite, then that means we need to check if the
> > metadata changed after a timeout.  That's the purpose of the
> > fuse_update_attributes() call before generic_file_write_iter().
> >
> > Does that make it clear?
>
> I understood it partly but one thing is still bothering me. What
> happens when cache writeback is set as well as fc->handle_killpriv=1.
>
> When handle_killpriv is set, how suid/sgid will be cleared by
> server. Given cache=writeback, write probably got cached in
> guest and server probably will not not see a WRITE immideately.
> (I am assuming we are relying on a WRITE to clear setuid/setgid when
>  handle_killpriv is set). And that means server will not clear
>  setuid/setgid till inode is written back at some point of time
>  later.
>
> IOW, cache=writeback and fc->handle_killpriv don't seem to go
> together (atleast given the current code).

fuse_cache_write_iter()
  -> fuse_update_attributes()   * this will refresh i_mode
  -> generic_file_write_iter()
      ->__generic_file_write_iter()
          ->file_remove_privs()    * this will check i_mode
              ->__remove_privs()
                  -> notify_change()
                     -> fuse_setattr()   * this will clear suid/sgit bits

Thanks,
Miklos



>
> Thanks
> Vivek
>
Vivek Goyal July 21, 2020, 3:55 p.m. UTC | #7
On Tue, Jul 21, 2020 at 05:44:14PM +0200, Miklos Szeredi wrote:
> On Tue, Jul 21, 2020 at 5:17 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Tue, Jul 21, 2020 at 02:33:41PM +0200, Miklos Szeredi wrote:
> > > On Mon, Jul 20, 2020 at 5:41 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > On Fri, Jul 17, 2020 at 10:53:07AM +0200, Miklos Szeredi wrote:
> > >
> > > > I see in VFS that chown() always kills suid/sgid. While truncate() and
> > > > write(), will suid/sgid only if caller does not have CAP_FSETID.
> > > >
> > > > How does this work with FUSE_HANDLE_KILLPRIV. IIUC, file server does not
> > > > know if caller has CAP_FSETID or not. That means file server will be
> > > > forced to kill suid/sgid on every write and truncate. And that will fail
> > > > some of the tests.
> > > >
> > > > For WRITE requests now we do have the notion of setting
> > > > FUSE_WRITE_KILL_PRIV flag to tell server explicitly to kill suid/sgid.
> > > > Probably we could use that in cached write path as well to figure out
> > > > whether to kill suid/sgid or not. But truncate() will still continue
> > > > to be an issue.
> > >
> > > Yes, not doing the same for truncate seems to be an oversight.
> > > Unfortunate, since we'll need another INIT flag to enable selective
> > > clearing of suid/sgid on truncate.
> > >
> > > >
> > > > >
> > > > > Even writeback_cache could be handled by this addition, since we call
> > > > > fuse_update_attributes() before generic_file_write_iter() :
> > > > >
> > > > > --- a/fs/fuse/dir.c
> > > > > +++ b/fs/fuse/dir.c
> > > > > @@ -985,6 +985,7 @@ static int fuse_update_get_attr(struct inode
> > > > > *inode, struct file *file,
> > > > >
> > > > >         if (sync) {
> > > > >                 forget_all_cached_acls(inode);
> > > > > +               inode->i_flags &= ~S_NOSEC;
> > > >
> > > > Ok, So I was clearing S_NOSEC only if server reports that file has
> > > > suid/sgid bit set. This change will clear S_NOSEC whenever we fetch
> > > > attrs from host and will force getxattr() when we call file_remove_privs()
> > > > and will increase overhead for non cache writeback mode. We probably
> > > > could keep both. For cache writeback mode, clear it undonditionally
> > > > otherwise not.
> > >
> > > We clear S_NOSEC because the attribute timeout has expired.  This
> > > means we need to refresh all metadata, including cached xattr (which
> > > is what S_NOSEC effectively is).
> > >
> > > > What I don't understand is though that how this change will clear
> > > > suid/sgid on host in cache=writeback mode. I see fuse_setattr()
> > > > will not set ATTR_MODE and clear S_ISUID and S_ISGID if
> > > > fc->handle_killpriv is set. So when server receives setattr request
> > > > (if it does), then how will it know it is supposed to kill suid/sgid
> > > > bit. (its not chown, truncate and its not write).
> > >
> > > Depends.  If the attribute timeout is infinity, then that means the
> > > cache is always up to date.  In that case we only need to clear
> > > suid/sgid if set in i_mode.  Similarly, the security.capability will
> > > only be cleared if it was set in the first place (which would clear
> > > S_NOSEC).
> > >
> > > If the timeout is finite, then that means we need to check if the
> > > metadata changed after a timeout.  That's the purpose of the
> > > fuse_update_attributes() call before generic_file_write_iter().
> > >
> > > Does that make it clear?
> >
> > I understood it partly but one thing is still bothering me. What
> > happens when cache writeback is set as well as fc->handle_killpriv=1.
> >
> > When handle_killpriv is set, how suid/sgid will be cleared by
> > server. Given cache=writeback, write probably got cached in
> > guest and server probably will not not see a WRITE immideately.
> > (I am assuming we are relying on a WRITE to clear setuid/setgid when
> >  handle_killpriv is set). And that means server will not clear
> >  setuid/setgid till inode is written back at some point of time
> >  later.
> >
> > IOW, cache=writeback and fc->handle_killpriv don't seem to go
> > together (atleast given the current code).
> 
> fuse_cache_write_iter()
>   -> fuse_update_attributes()   * this will refresh i_mode
>   -> generic_file_write_iter()
>       ->__generic_file_write_iter()
>           ->file_remove_privs()    * this will check i_mode
>               ->__remove_privs()
>                   -> notify_change()
>                      -> fuse_setattr()   * this will clear suid/sgit bits

And fuse_setattr() has following.

                if (!fc->handle_killpriv) {
                        /*
                         * ia_mode calculation may have used stale i_mode.
                         * Refresh and recalculate.
                         */
                        ret = fuse_do_getattr(inode, NULL, file);
                        if (ret)
                                return ret;

                        attr->ia_mode = inode->i_mode;
                        if (inode->i_mode & S_ISUID) {
                                attr->ia_valid |= ATTR_MODE;
                                attr->ia_mode &= ~S_ISUID;
                        }
                        if ((inode->i_mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) {
                                attr->ia_valid |= ATTR_MODE;
                                attr->ia_mode &= ~S_ISGID;
                        }
                }
        }
        if (!attr->ia_valid)
                return 0;

So if fc->handle_killpriv is set, we might not even send setattr
request if attr->ia_valid turns out to be zero.

I did a quick instrumentation and noticed that we are sending
setattr with attr->ia_valid=0x200 (ATTR_FORCE) set. And file
server is not required to kill suid/sgid in this case?

Thanks
Vivek
Vivek Goyal July 21, 2020, 6:16 p.m. UTC | #8
On Tue, Jul 21, 2020 at 11:55:03AM -0400, Vivek Goyal wrote:
> On Tue, Jul 21, 2020 at 05:44:14PM +0200, Miklos Szeredi wrote:
> > On Tue, Jul 21, 2020 at 5:17 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Tue, Jul 21, 2020 at 02:33:41PM +0200, Miklos Szeredi wrote:
> > > > On Mon, Jul 20, 2020 at 5:41 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > >
> > > > > On Fri, Jul 17, 2020 at 10:53:07AM +0200, Miklos Szeredi wrote:
> > > >
> > > > > I see in VFS that chown() always kills suid/sgid. While truncate() and
> > > > > write(), will suid/sgid only if caller does not have CAP_FSETID.
> > > > >
> > > > > How does this work with FUSE_HANDLE_KILLPRIV. IIUC, file server does not
> > > > > know if caller has CAP_FSETID or not. That means file server will be
> > > > > forced to kill suid/sgid on every write and truncate. And that will fail
> > > > > some of the tests.
> > > > >
> > > > > For WRITE requests now we do have the notion of setting
> > > > > FUSE_WRITE_KILL_PRIV flag to tell server explicitly to kill suid/sgid.
> > > > > Probably we could use that in cached write path as well to figure out
> > > > > whether to kill suid/sgid or not. But truncate() will still continue
> > > > > to be an issue.
> > > >
> > > > Yes, not doing the same for truncate seems to be an oversight.
> > > > Unfortunate, since we'll need another INIT flag to enable selective
> > > > clearing of suid/sgid on truncate.
> > > >
> > > > >
> > > > > >
> > > > > > Even writeback_cache could be handled by this addition, since we call
> > > > > > fuse_update_attributes() before generic_file_write_iter() :
> > > > > >
> > > > > > --- a/fs/fuse/dir.c
> > > > > > +++ b/fs/fuse/dir.c
> > > > > > @@ -985,6 +985,7 @@ static int fuse_update_get_attr(struct inode
> > > > > > *inode, struct file *file,
> > > > > >
> > > > > >         if (sync) {
> > > > > >                 forget_all_cached_acls(inode);
> > > > > > +               inode->i_flags &= ~S_NOSEC;
> > > > >
> > > > > Ok, So I was clearing S_NOSEC only if server reports that file has
> > > > > suid/sgid bit set. This change will clear S_NOSEC whenever we fetch
> > > > > attrs from host and will force getxattr() when we call file_remove_privs()
> > > > > and will increase overhead for non cache writeback mode. We probably
> > > > > could keep both. For cache writeback mode, clear it undonditionally
> > > > > otherwise not.
> > > >
> > > > We clear S_NOSEC because the attribute timeout has expired.  This
> > > > means we need to refresh all metadata, including cached xattr (which
> > > > is what S_NOSEC effectively is).
> > > >
> > > > > What I don't understand is though that how this change will clear
> > > > > suid/sgid on host in cache=writeback mode. I see fuse_setattr()
> > > > > will not set ATTR_MODE and clear S_ISUID and S_ISGID if
> > > > > fc->handle_killpriv is set. So when server receives setattr request
> > > > > (if it does), then how will it know it is supposed to kill suid/sgid
> > > > > bit. (its not chown, truncate and its not write).
> > > >
> > > > Depends.  If the attribute timeout is infinity, then that means the
> > > > cache is always up to date.  In that case we only need to clear
> > > > suid/sgid if set in i_mode.  Similarly, the security.capability will
> > > > only be cleared if it was set in the first place (which would clear
> > > > S_NOSEC).
> > > >
> > > > If the timeout is finite, then that means we need to check if the
> > > > metadata changed after a timeout.  That's the purpose of the
> > > > fuse_update_attributes() call before generic_file_write_iter().
> > > >
> > > > Does that make it clear?
> > >
> > > I understood it partly but one thing is still bothering me. What
> > > happens when cache writeback is set as well as fc->handle_killpriv=1.
> > >
> > > When handle_killpriv is set, how suid/sgid will be cleared by
> > > server. Given cache=writeback, write probably got cached in
> > > guest and server probably will not not see a WRITE immideately.
> > > (I am assuming we are relying on a WRITE to clear setuid/setgid when
> > >  handle_killpriv is set). And that means server will not clear
> > >  setuid/setgid till inode is written back at some point of time
> > >  later.
> > >
> > > IOW, cache=writeback and fc->handle_killpriv don't seem to go
> > > together (atleast given the current code).
> > 
> > fuse_cache_write_iter()
> >   -> fuse_update_attributes()   * this will refresh i_mode
> >   -> generic_file_write_iter()
> >       ->__generic_file_write_iter()
> >           ->file_remove_privs()    * this will check i_mode
> >               ->__remove_privs()
> >                   -> notify_change()
> >                      -> fuse_setattr()   * this will clear suid/sgit bits
> 
> And fuse_setattr() has following.
> 
>                 if (!fc->handle_killpriv) {
>                         /*
>                          * ia_mode calculation may have used stale i_mode.
>                          * Refresh and recalculate.
>                          */
>                         ret = fuse_do_getattr(inode, NULL, file);
>                         if (ret)
>                                 return ret;
> 
>                         attr->ia_mode = inode->i_mode;
>                         if (inode->i_mode & S_ISUID) {
>                                 attr->ia_valid |= ATTR_MODE;
>                                 attr->ia_mode &= ~S_ISUID;
>                         }
>                         if ((inode->i_mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) {
>                                 attr->ia_valid |= ATTR_MODE;
>                                 attr->ia_mode &= ~S_ISGID;
>                         }
>                 }
>         }
>         if (!attr->ia_valid)
>                 return 0;
> 
> So if fc->handle_killpriv is set, we might not even send setattr
> request if attr->ia_valid turns out to be zero.
> 
> I did a quick instrumentation and noticed that we are sending
> setattr with attr->ia_valid=0x200 (ATTR_FORCE) set. And file
> server is not required to kill suid/sgid in this case?

Did little more instrumentation of fuse and virtiofsd. Modified 
virtiofsd to enable FUSE_HANDLE_KILLPRIV and ran virtiofsd with
-o writeback.

On client created a file /mnt/virtiofs/foo.txt and set setuid bit.
Write a program to write a single charater to the file and
dropped CAP_FSETID before executing the program and noticed messages
coming on virtiofsd. 

I see no WRITE came and lo_setattr() was called with valid=0x0. And
that means it will not change any of the attrs and simply get
current attrs and return to client.

A WRITE comes later either when file is close (fuse_flush()) or
a writeback is triggred. So if file server clears setuid/setgid
bit always on WRITE, then setuid/setgid bit will ultimately
be cleared but much later when guest page is written back.

Hopefully I am not missing something very basic.

Thanks
Vivek
Miklos Szeredi July 21, 2020, 7:53 p.m. UTC | #9
On Tue, Jul 21, 2020 at 5:55 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Tue, Jul 21, 2020 at 05:44:14PM +0200, Miklos Szeredi wrote:
> > On Tue, Jul 21, 2020 at 5:17 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Tue, Jul 21, 2020 at 02:33:41PM +0200, Miklos Szeredi wrote:
> > > > On Mon, Jul 20, 2020 at 5:41 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > >
> > > > > On Fri, Jul 17, 2020 at 10:53:07AM +0200, Miklos Szeredi wrote:
> > > >
> > > > > I see in VFS that chown() always kills suid/sgid. While truncate() and
> > > > > write(), will suid/sgid only if caller does not have CAP_FSETID.
> > > > >
> > > > > How does this work with FUSE_HANDLE_KILLPRIV. IIUC, file server does not
> > > > > know if caller has CAP_FSETID or not. That means file server will be
> > > > > forced to kill suid/sgid on every write and truncate. And that will fail
> > > > > some of the tests.
> > > > >
> > > > > For WRITE requests now we do have the notion of setting
> > > > > FUSE_WRITE_KILL_PRIV flag to tell server explicitly to kill suid/sgid.
> > > > > Probably we could use that in cached write path as well to figure out
> > > > > whether to kill suid/sgid or not. But truncate() will still continue
> > > > > to be an issue.
> > > >
> > > > Yes, not doing the same for truncate seems to be an oversight.
> > > > Unfortunate, since we'll need another INIT flag to enable selective
> > > > clearing of suid/sgid on truncate.
> > > >
> > > > >
> > > > > >
> > > > > > Even writeback_cache could be handled by this addition, since we call
> > > > > > fuse_update_attributes() before generic_file_write_iter() :
> > > > > >
> > > > > > --- a/fs/fuse/dir.c
> > > > > > +++ b/fs/fuse/dir.c
> > > > > > @@ -985,6 +985,7 @@ static int fuse_update_get_attr(struct inode
> > > > > > *inode, struct file *file,
> > > > > >
> > > > > >         if (sync) {
> > > > > >                 forget_all_cached_acls(inode);
> > > > > > +               inode->i_flags &= ~S_NOSEC;
> > > > >
> > > > > Ok, So I was clearing S_NOSEC only if server reports that file has
> > > > > suid/sgid bit set. This change will clear S_NOSEC whenever we fetch
> > > > > attrs from host and will force getxattr() when we call file_remove_privs()
> > > > > and will increase overhead for non cache writeback mode. We probably
> > > > > could keep both. For cache writeback mode, clear it undonditionally
> > > > > otherwise not.
> > > >
> > > > We clear S_NOSEC because the attribute timeout has expired.  This
> > > > means we need to refresh all metadata, including cached xattr (which
> > > > is what S_NOSEC effectively is).
> > > >
> > > > > What I don't understand is though that how this change will clear
> > > > > suid/sgid on host in cache=writeback mode. I see fuse_setattr()
> > > > > will not set ATTR_MODE and clear S_ISUID and S_ISGID if
> > > > > fc->handle_killpriv is set. So when server receives setattr request
> > > > > (if it does), then how will it know it is supposed to kill suid/sgid
> > > > > bit. (its not chown, truncate and its not write).
> > > >
> > > > Depends.  If the attribute timeout is infinity, then that means the
> > > > cache is always up to date.  In that case we only need to clear
> > > > suid/sgid if set in i_mode.  Similarly, the security.capability will
> > > > only be cleared if it was set in the first place (which would clear
> > > > S_NOSEC).
> > > >
> > > > If the timeout is finite, then that means we need to check if the
> > > > metadata changed after a timeout.  That's the purpose of the
> > > > fuse_update_attributes() call before generic_file_write_iter().
> > > >
> > > > Does that make it clear?
> > >
> > > I understood it partly but one thing is still bothering me. What
> > > happens when cache writeback is set as well as fc->handle_killpriv=1.
> > >
> > > When handle_killpriv is set, how suid/sgid will be cleared by
> > > server. Given cache=writeback, write probably got cached in
> > > guest and server probably will not not see a WRITE immideately.
> > > (I am assuming we are relying on a WRITE to clear setuid/setgid when
> > >  handle_killpriv is set). And that means server will not clear
> > >  setuid/setgid till inode is written back at some point of time
> > >  later.
> > >
> > > IOW, cache=writeback and fc->handle_killpriv don't seem to go
> > > together (atleast given the current code).
> >
> > fuse_cache_write_iter()
> >   -> fuse_update_attributes()   * this will refresh i_mode
> >   -> generic_file_write_iter()
> >       ->__generic_file_write_iter()
> >           ->file_remove_privs()    * this will check i_mode
> >               ->__remove_privs()
> >                   -> notify_change()
> >                      -> fuse_setattr()   * this will clear suid/sgit bits
>
> And fuse_setattr() has following.
>
>                 if (!fc->handle_killpriv) {
>                         /*
>                          * ia_mode calculation may have used stale i_mode.
>                          * Refresh and recalculate.
>                          */
>                         ret = fuse_do_getattr(inode, NULL, file);
>                         if (ret)
>                                 return ret;
>
>                         attr->ia_mode = inode->i_mode;
>                         if (inode->i_mode & S_ISUID) {
>                                 attr->ia_valid |= ATTR_MODE;
>                                 attr->ia_mode &= ~S_ISUID;
>                         }
>                         if ((inode->i_mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) {
>                                 attr->ia_valid |= ATTR_MODE;
>                                 attr->ia_mode &= ~S_ISGID;
>                         }
>                 }
>         }
>         if (!attr->ia_valid)
>                 return 0;
>
> So if fc->handle_killpriv is set, we might not even send setattr
> request if attr->ia_valid turns out to be zero.

Ah, right you are.  The writeback_cache case is indeed special.

The way that can be properly solved, I think, is to check if any
security bits need to be removed before calling into
generic_file_write_iter() and if yes, fall back to unbuffered write.

Something like the attached?

Thanks,
Miklos
Vivek Goyal July 21, 2020, 9:30 p.m. UTC | #10
On Tue, Jul 21, 2020 at 09:53:21PM +0200, Miklos Szeredi wrote:
> On Tue, Jul 21, 2020 at 5:55 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Tue, Jul 21, 2020 at 05:44:14PM +0200, Miklos Szeredi wrote:
> > > On Tue, Jul 21, 2020 at 5:17 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > On Tue, Jul 21, 2020 at 02:33:41PM +0200, Miklos Szeredi wrote:
> > > > > On Mon, Jul 20, 2020 at 5:41 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > > >
> > > > > > On Fri, Jul 17, 2020 at 10:53:07AM +0200, Miklos Szeredi wrote:
> > > > >
> > > > > > I see in VFS that chown() always kills suid/sgid. While truncate() and
> > > > > > write(), will suid/sgid only if caller does not have CAP_FSETID.
> > > > > >
> > > > > > How does this work with FUSE_HANDLE_KILLPRIV. IIUC, file server does not
> > > > > > know if caller has CAP_FSETID or not. That means file server will be
> > > > > > forced to kill suid/sgid on every write and truncate. And that will fail
> > > > > > some of the tests.
> > > > > >
> > > > > > For WRITE requests now we do have the notion of setting
> > > > > > FUSE_WRITE_KILL_PRIV flag to tell server explicitly to kill suid/sgid.
> > > > > > Probably we could use that in cached write path as well to figure out
> > > > > > whether to kill suid/sgid or not. But truncate() will still continue
> > > > > > to be an issue.
> > > > >
> > > > > Yes, not doing the same for truncate seems to be an oversight.
> > > > > Unfortunate, since we'll need another INIT flag to enable selective
> > > > > clearing of suid/sgid on truncate.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Even writeback_cache could be handled by this addition, since we call
> > > > > > > fuse_update_attributes() before generic_file_write_iter() :
> > > > > > >
> > > > > > > --- a/fs/fuse/dir.c
> > > > > > > +++ b/fs/fuse/dir.c
> > > > > > > @@ -985,6 +985,7 @@ static int fuse_update_get_attr(struct inode
> > > > > > > *inode, struct file *file,
> > > > > > >
> > > > > > >         if (sync) {
> > > > > > >                 forget_all_cached_acls(inode);
> > > > > > > +               inode->i_flags &= ~S_NOSEC;
> > > > > >
> > > > > > Ok, So I was clearing S_NOSEC only if server reports that file has
> > > > > > suid/sgid bit set. This change will clear S_NOSEC whenever we fetch
> > > > > > attrs from host and will force getxattr() when we call file_remove_privs()
> > > > > > and will increase overhead for non cache writeback mode. We probably
> > > > > > could keep both. For cache writeback mode, clear it undonditionally
> > > > > > otherwise not.
> > > > >
> > > > > We clear S_NOSEC because the attribute timeout has expired.  This
> > > > > means we need to refresh all metadata, including cached xattr (which
> > > > > is what S_NOSEC effectively is).
> > > > >
> > > > > > What I don't understand is though that how this change will clear
> > > > > > suid/sgid on host in cache=writeback mode. I see fuse_setattr()
> > > > > > will not set ATTR_MODE and clear S_ISUID and S_ISGID if
> > > > > > fc->handle_killpriv is set. So when server receives setattr request
> > > > > > (if it does), then how will it know it is supposed to kill suid/sgid
> > > > > > bit. (its not chown, truncate and its not write).
> > > > >
> > > > > Depends.  If the attribute timeout is infinity, then that means the
> > > > > cache is always up to date.  In that case we only need to clear
> > > > > suid/sgid if set in i_mode.  Similarly, the security.capability will
> > > > > only be cleared if it was set in the first place (which would clear
> > > > > S_NOSEC).
> > > > >
> > > > > If the timeout is finite, then that means we need to check if the
> > > > > metadata changed after a timeout.  That's the purpose of the
> > > > > fuse_update_attributes() call before generic_file_write_iter().
> > > > >
> > > > > Does that make it clear?
> > > >
> > > > I understood it partly but one thing is still bothering me. What
> > > > happens when cache writeback is set as well as fc->handle_killpriv=1.
> > > >
> > > > When handle_killpriv is set, how suid/sgid will be cleared by
> > > > server. Given cache=writeback, write probably got cached in
> > > > guest and server probably will not not see a WRITE immideately.
> > > > (I am assuming we are relying on a WRITE to clear setuid/setgid when
> > > >  handle_killpriv is set). And that means server will not clear
> > > >  setuid/setgid till inode is written back at some point of time
> > > >  later.
> > > >
> > > > IOW, cache=writeback and fc->handle_killpriv don't seem to go
> > > > together (atleast given the current code).
> > >
> > > fuse_cache_write_iter()
> > >   -> fuse_update_attributes()   * this will refresh i_mode
> > >   -> generic_file_write_iter()
> > >       ->__generic_file_write_iter()
> > >           ->file_remove_privs()    * this will check i_mode
> > >               ->__remove_privs()
> > >                   -> notify_change()
> > >                      -> fuse_setattr()   * this will clear suid/sgit bits
> >
> > And fuse_setattr() has following.
> >
> >                 if (!fc->handle_killpriv) {
> >                         /*
> >                          * ia_mode calculation may have used stale i_mode.
> >                          * Refresh and recalculate.
> >                          */
> >                         ret = fuse_do_getattr(inode, NULL, file);
> >                         if (ret)
> >                                 return ret;
> >
> >                         attr->ia_mode = inode->i_mode;
> >                         if (inode->i_mode & S_ISUID) {
> >                                 attr->ia_valid |= ATTR_MODE;
> >                                 attr->ia_mode &= ~S_ISUID;
> >                         }
> >                         if ((inode->i_mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) {
> >                                 attr->ia_valid |= ATTR_MODE;
> >                                 attr->ia_mode &= ~S_ISGID;
> >                         }
> >                 }
> >         }
> >         if (!attr->ia_valid)
> >                 return 0;
> >
> > So if fc->handle_killpriv is set, we might not even send setattr
> > request if attr->ia_valid turns out to be zero.
> 
> Ah, right you are.  The writeback_cache case is indeed special.
> 
> The way that can be properly solved, I think, is to check if any
> security bits need to be removed before calling into
> generic_file_write_iter() and if yes, fall back to unbuffered write.
> 
> Something like the attached?
> 
> Thanks,
> Miklos

> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 83d917f7e542..f67c6f46dae9 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1245,16 +1245,21 @@ static ssize_t fuse_cache_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  	ssize_t written = 0;
>  	ssize_t written_buffered = 0;
>  	struct inode *inode = mapping->host;
> +	struct fuse_conn *fc = get_fuse_conn(inode);
>  	ssize_t err;
>  	loff_t endbyte = 0;
>  
> -	if (get_fuse_conn(inode)->writeback_cache) {
> +	if (fc->writeback_cache) {
>  		/* Update size (EOF optimization) and mode (SUID clearing) */
>  		err = fuse_update_attributes(mapping->host, file);
>  		if (err)
>  			return err;
>  
> -		return generic_file_write_iter(iocb, from);
> +		if (!fc->handle_killpriv ||
> +		    !should_remove_suid(file->f_path.dentry))
> +			return generic_file_write_iter(iocb, from);
> +
> +		/* Fall back to unbuffered write to remove SUID/SGID bits */

This should solve the issue with fc->writeback_cache.

What about following race. Assume a client has set suid/sgid/caps
and this client is doing write but cache metadata has not expired
yet. That means fuse_update_attributes() will not clear S_NOSEC
and that means file_remove_privs() will not clear suid/sgid/caps
as well as WRITE will be buffered so that also will not clear
suid/sgid/caps as well.

IOW, even after WRITE has completed, suid/sgid/security.capability will
still be there on file inode if inode metadata had not expired at the time
of WRITE. Is that acceptable from coherency requirements point of view.

I have a question. Does general fuse allow this use case where multiple
clients are distributed and not going through same VFS? virtiofs wants
to support that at some point of time but what about existing fuse
filesystems.

I also have concerns with being dependent on FUSE_HANDLE_KILLPRIV because
it clear suid/sgid on WRITE and truncate evn if caller has
CAP_SETID, breaking Linux behavior (Don't know what does POSIX say).

Shall we design FUSE_HANDLE_KILLPRIV2 instead which kills
security.capability always but kills suid/sgid on on WRITE/truncate
only if caller does not have CAP_FSETID. This also means that
we probably will have to send this information in fuse_setattr()
somehow.

Am I overthinking now? :-)

Thanks
Vivek
Miklos Szeredi July 22, 2020, 10 a.m. UTC | #11
On Tue, Jul 21, 2020 at 11:30 PM Vivek Goyal <vgoyal@redhat.com> wrote:

> What about following race. Assume a client has set suid/sgid/caps
> and this client is doing write but cache metadata has not expired
> yet. That means fuse_update_attributes() will not clear S_NOSEC

If the write is done on the same client as the chmod/setxattr, than
that's not a problem since chmod and setxattr will clear S_NOSEC in
the suid/sgid/caps case.

> and that means file_remove_privs() will not clear suid/sgid/caps
> as well as WRITE will be buffered so that also will not clear
> suid/sgid/caps as well.
>
> IOW, even after WRITE has completed, suid/sgid/security.capability will
> still be there on file inode if inode metadata had not expired at the time
> of WRITE. Is that acceptable from coherency requirements point of view.


If they were on different clients, then yes, we'll have a coherency
issue, but that's not new.

>
> I have a question. Does general fuse allow this use case where multiple
> clients are distributed and not going through same VFS? virtiofs wants
> to support that at some point of time but what about existing fuse
> filesystems.

Fuse supports distributed fs through

 - dcache invalidation with timeout (can be zero to disable caching completely)
 - dcache invalidation with notification
 - metadata invalidation with timeout (can be zero to disable caching
completely)
 - metadata and data range invalidation with notification
 - disabling data caching (FOPEN_DIRECT_IO)
 - auto data invalidation on size/mtime change
 - data invalidation on open (!FOPEN_KEEP_CACHE)

So yes, supports a number of ways to handle the multiple client case.

What we could do additionally with virtiofs is to make clients running
in different guests but on the same host achieve strong coherency with
some sort of shared memory scheme.

> I also have concerns with being dependent on FUSE_HANDLE_KILLPRIV because
> it clear suid/sgid on WRITE and truncate evn if caller has
> CAP_SETID, breaking Linux behavior (Don't know what does POSIX say).

FUSE_HANDLE_KILLPRIV should handle writes properly (discovered by one
of the test suites, not real life report, as far as I remember).

Truncate is a different matter, currently there's no way to
distinguish between CAP_FSETID being set or clear.

Can't find POSIX saying anything about this.

> Shall we design FUSE_HANDLE_KILLPRIV2 instead which kills
> security.capability always but kills suid/sgid on on WRITE/truncate
> only if caller does not have CAP_FSETID. This also means that
> we probably will have to send this information in fuse_setattr()
> somehow.

Yes, that's a good plan.

Thanks,
Miklos
diff mbox series

Patch

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 5b4aebf5821f..5e74c818b2aa 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -185,6 +185,13 @@  void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 		inode->i_mode &= ~S_ISVTX;
 
 	fi->orig_ino = attr->ino;
+
+	/*
+	 * File server see setuid/setgid bit set. Maybe another client did
+	 * it. Reset S_NOSEC.
+	 */
+	if (IS_NOSEC(inode) && is_sxid(inode->i_mode))
+		inode->i_flags &= ~S_NOSEC;
 }
 
 void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 4c4ef5d69298..e89628163ec4 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1126,6 +1126,10 @@  static int virtio_fs_fill_super(struct super_block *sb)
 	/* Previous unmount will stop all queues. Start these again */
 	virtio_fs_start_all_queues(fs);
 	fuse_send_init(fc);
+
+	if (!fc->writeback_cache)
+		sb->s_flags |= SB_NOSEC;
+
 	mutex_unlock(&virtio_fs_mutex);
 	return 0;