Message ID | 20210625191229.1752531-1-vgoyal@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | xattr: Allow user.* xattr on symlink/special files if caller has CAP_SYS_RESOURCE | expand |
> -----Original Message----- > From: Vivek Goyal <vgoyal@redhat.com> > Sent: Friday, June 25, 2021 12:12 PM > To: linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; > viro@zeniv.linux.org.uk > Cc: virtio-fs@redhat.com; dwalsh@redhat.com; dgilbert@redhat.com; > berrange@redhat.com; vgoyal@redhat.com Please include Linux Security Module list <linux-security-module@vger.kernel.org> and selinux@vger.kernel.org on this topic. > Subject: [RFC PATCH 0/1] xattr: Allow user.* xattr on symlink/special files if > caller has CAP_SYS_RESOURCE > > Hi, > > In virtiofs, actual file server is virtiosd daemon running on host. > There we have a mode where xattrs can be remapped to something else. > For example security.selinux can be remapped to > user.virtiofsd.securit.selinux on the host. This would seem to provide mechanism whereby a user can violate SELinux policy quite easily. > > This remapping is useful when SELinux is enabled in guest and virtiofs > as being used as rootfs. Guest and host SELinux policy might not match > and host policy might deny security.selinux xattr setting by guest > onto host. Or host might have SELinux disabled and in that case to > be able to set security.selinux xattr, virtiofsd will need to have > CAP_SYS_ADMIN (which we are trying to avoid). Being able to remap > guest security.selinux (or other xattrs) on host to something else > is also better from security point of view. Can you please provide some rationale for this assertion? I have been working with security xattrs longer than anyone and have trouble accepting the statement. > But when we try this, we noticed that SELinux relabeling in guest > is failing on some symlinks. When I debugged a little more, I > came to know that "user.*" xattrs are not allowed on symlinks > or special files. > > "man xattr" seems to suggest that primary reason to disallow is > that arbitrary users can set unlimited amount of "user.*" xattrs > on these files and bypass quota check. > > If that's the primary reason, I am wondering is it possible to relax > the restrictions if caller has CAP_SYS_RESOURCE. This capability > allows caller to bypass quota checks. So it should not be > a problem atleast from quota perpective. > > That will allow me to give CAP_SYS_RESOURCE to virtiofs deamon > and remap xattrs arbitrarily. On a Smack system you should require CAP_MAC_ADMIN to remap security. xattrs. I sounds like you're in serious danger of running afoul of LSM attribute policy on a reasonable general level. > > Thanks > Vivek > > Vivek Goyal (1): > xattr: Allow user.* xattr on symlink/special files with > CAP_SYS_RESOURCE > > fs/xattr.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > -- > 2.25.4
* Schaufler, Casey (casey.schaufler@intel.com) wrote: > > -----Original Message----- > > From: Vivek Goyal <vgoyal@redhat.com> > > Sent: Friday, June 25, 2021 12:12 PM > > To: linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; > > viro@zeniv.linux.org.uk > > Cc: virtio-fs@redhat.com; dwalsh@redhat.com; dgilbert@redhat.com; > > berrange@redhat.com; vgoyal@redhat.com > > Please include Linux Security Module list <linux-security-module@vger.kernel.org> > and selinux@vger.kernel.org on this topic. > > > Subject: [RFC PATCH 0/1] xattr: Allow user.* xattr on symlink/special files if > > caller has CAP_SYS_RESOURCE > > > > Hi, > > > > In virtiofs, actual file server is virtiosd daemon running on host. > > There we have a mode where xattrs can be remapped to something else. > > For example security.selinux can be remapped to > > user.virtiofsd.securit.selinux on the host. > > This would seem to provide mechanism whereby a user can violate > SELinux policy quite easily. > > > > > This remapping is useful when SELinux is enabled in guest and virtiofs > > as being used as rootfs. Guest and host SELinux policy might not match > > and host policy might deny security.selinux xattr setting by guest > > onto host. Or host might have SELinux disabled and in that case to > > be able to set security.selinux xattr, virtiofsd will need to have > > CAP_SYS_ADMIN (which we are trying to avoid). Being able to remap > > guest security.selinux (or other xattrs) on host to something else > > is also better from security point of view. > > Can you please provide some rationale for this assertion? > I have been working with security xattrs longer than anyone > and have trouble accepting the statement. There seem to be a few very different ways of using SELinux in containers/guests, and many ways of using shared filesystems. A common request is that we share a host filesystem into the guest (a VM), and then the guest can do with it whatever it likes, preferably without making the guest privileged in any way, and with having as few priviliges on the daemons running on behalf of the guest ('virtiofd' which is a fuse implementation daemon that runs on the host). By remapping all guests xattr to add a "user.virtiofsd." prefix, the guest can label it's filesystem and implement it's own SELinux policy, but because it's using "user." on the host, it can neither bypass nor change the hosts SELinux labelling or policies. (It also means that the guest can set capabilities and other xattr's, again without confusing the host). > > But when we try this, we noticed that SELinux relabeling in guest > > is failing on some symlinks. When I debugged a little more, I > > came to know that "user.*" xattrs are not allowed on symlinks > > or special files. > > > > "man xattr" seems to suggest that primary reason to disallow is > > that arbitrary users can set unlimited amount of "user.*" xattrs > > on these files and bypass quota check. > > > > If that's the primary reason, I am wondering is it possible to relax > > the restrictions if caller has CAP_SYS_RESOURCE. This capability > > allows caller to bypass quota checks. So it should not be > > a problem atleast from quota perpective. > > > > That will allow me to give CAP_SYS_RESOURCE to virtiofs deamon > > and remap xattrs arbitrarily. > > On a Smack system you should require CAP_MAC_ADMIN to remap > security. xattrs. I sounds like you're in serious danger of running afoul > of LSM attribute policy on a reasonable general level. Note that the remapping is done by the userspace daemon running on the host (and takes parameters saying what remapping is required); as such it's still bound by whatever LSM policies the host wants; we're just giving the guest the ability to add it's own policies without breaking the hosts. Of course if you want the guest kernel to see the host xattrs then you don't want the remapping; there are even some cases where you might want to allow the guest to set those xattrs; but then you really do have to start worrying about what the guest could do to your filesystem. The only thing getting in the way of the guest being able to do a full relabel seems to be the limitation on user.* on non-files. Dave > > > > Thanks > > Vivek > > > > Vivek Goyal (1): > > xattr: Allow user.* xattr on symlink/special files with > > CAP_SYS_RESOURCE > > > > fs/xattr.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > -- > > 2.25.4 >
On Fri, Jun 25, 2021 at 09:49:51PM +0000, Schaufler, Casey wrote: > > -----Original Message----- > > From: Vivek Goyal <vgoyal@redhat.com> > > Sent: Friday, June 25, 2021 12:12 PM > > To: linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; > > viro@zeniv.linux.org.uk > > Cc: virtio-fs@redhat.com; dwalsh@redhat.com; dgilbert@redhat.com; > > berrange@redhat.com; vgoyal@redhat.com > > Please include Linux Security Module list <linux-security-module@vger.kernel.org> > and selinux@vger.kernel.org on this topic. > > > Subject: [RFC PATCH 0/1] xattr: Allow user.* xattr on symlink/special files if > > caller has CAP_SYS_RESOURCE > > > > Hi, > > > > In virtiofs, actual file server is virtiosd daemon running on host. > > There we have a mode where xattrs can be remapped to something else. > > For example security.selinux can be remapped to > > user.virtiofsd.securit.selinux on the host. > > This would seem to provide mechanism whereby a user can violate > SELinux policy quite easily. Hi Casey, As david already replied, we are not bypassing host's SELinux policy (if there is one). We are just trying to provide a mode where host and guest's SELinux policies could co-exist without interefering with each other. By remappming guests SELinux xattrs (and not host's SELinux xattrs), a file probably will have two xattrs "security.selinux" and "user.virtiofsd.security.selinux". Host will enforce SELinux policy based on security.selinux xattr and guest will see the SELinux info stored in "user.virtiofsd.security.selinux" and guest SELinux policy will enforce rules based on that. (user.virtiofsd.security.selinux will be remapped to "security.selinux" when guest does getxattr()). IOW, this mode is allowing both host and guest SELinux policies to co-exist and not interefere with each other. (Remapping guests's SELinux xattr is not changing hosts's SELinux label and is not bypassing host's SELinux policy). virtiofsd also provides for the mode where if guest process sets SELinux xattr it shows up as security.selinux on host. But now we have multiple issues. There are two SELinux policies (host and guest) which are operating on same lable. And there is a very good chance that two have not been written in such a way that they work with each other. In fact there does not seem to exist a notion where two different SELinux policies are operating on same label. At high level, this is in a way similar to files created on virtio-blk devices. Say this device is backed by a foo.img file on host. Now host selinux policy will set its own label on foo.img and provide access control while labels created by guest are not seen or controlled by host's SELinux policy. Only guest SELinux policy works with those labels. So this is similar kind of attempt. Provide isolation between host and guests's SELinux labels so that two policies can co-exist and not interfere with each other. > > > > > This remapping is useful when SELinux is enabled in guest and virtiofs > > as being used as rootfs. Guest and host SELinux policy might not match > > and host policy might deny security.selinux xattr setting by guest > > onto host. Or host might have SELinux disabled and in that case to > > be able to set security.selinux xattr, virtiofsd will need to have > > CAP_SYS_ADMIN (which we are trying to avoid). Being able to remap > > guest security.selinux (or other xattrs) on host to something else > > is also better from security point of view. > > Can you please provide some rationale for this assertion? > I have been working with security xattrs longer than anyone > and have trouble accepting the statement. If guest is not able to interfere or change host's SELinux labels directly, it sounded better. Irrespective of this, my primary concern is that to allow guest VM to be able to use SELinux seamlessly in diverse host OS environments (typical of cloud deployments). And being able to provide a mode where host and guest's security labels can co-exist and policies can work independently, should be able to achieve that goal. > > > But when we try this, we noticed that SELinux relabeling in guest > > is failing on some symlinks. When I debugged a little more, I > > came to know that "user.*" xattrs are not allowed on symlinks > > or special files. > > > > "man xattr" seems to suggest that primary reason to disallow is > > that arbitrary users can set unlimited amount of "user.*" xattrs > > on these files and bypass quota check. > > > > If that's the primary reason, I am wondering is it possible to relax > > the restrictions if caller has CAP_SYS_RESOURCE. This capability > > allows caller to bypass quota checks. So it should not be > > a problem atleast from quota perpective. > > > > That will allow me to give CAP_SYS_RESOURCE to virtiofs deamon > > and remap xattrs arbitrarily. > > On a Smack system you should require CAP_MAC_ADMIN to remap > security. xattrs. I sounds like you're in serious danger of running afoul > of LSM attribute policy on a reasonable general level. I think I did not explain xattr remapping properly and that's why this confusion is there. Only guests's xattrs will be remapped and not hosts's xattr. So one can not bypass any access control implemented by any of the LSM on host. Thanks Vivek
On 6/28/21 09:17, Vivek Goyal wrote: > On Fri, Jun 25, 2021 at 09:49:51PM +0000, Schaufler, Casey wrote: >>> -----Original Message----- >>> From: Vivek Goyal <vgoyal@redhat.com> >>> Sent: Friday, June 25, 2021 12:12 PM >>> To: linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; >>> viro@zeniv.linux.org.uk >>> Cc: virtio-fs@redhat.com; dwalsh@redhat.com; dgilbert@redhat.com; >>> berrange@redhat.com; vgoyal@redhat.com >> Please include Linux Security Module list <linux-security-module@vger.kernel.org> >> and selinux@vger.kernel.org on this topic. >> >>> Subject: [RFC PATCH 0/1] xattr: Allow user.* xattr on symlink/special files if >>> caller has CAP_SYS_RESOURCE >>> >>> Hi, >>> >>> In virtiofs, actual file server is virtiosd daemon running on host. >>> There we have a mode where xattrs can be remapped to something else. >>> For example security.selinux can be remapped to >>> user.virtiofsd.securit.selinux on the host. >> This would seem to provide mechanism whereby a user can violate >> SELinux policy quite easily. > Hi Casey, > > As david already replied, we are not bypassing host's SELinux policy (if > there is one). We are just trying to provide a mode where host and > guest's SELinux policies could co-exist without interefering > with each other. > > By remappming guests SELinux xattrs (and not host's SELinux xattrs), > a file probably will have two xattrs > > "security.selinux" and "user.virtiofsd.security.selinux". Host will > enforce SELinux policy based on security.selinux xattr and guest > will see the SELinux info stored in "user.virtiofsd.security.selinux" > and guest SELinux policy will enforce rules based on that. > (user.virtiofsd.security.selinux will be remapped to "security.selinux" > when guest does getxattr()). > > IOW, this mode is allowing both host and guest SELinux policies to > co-exist and not interefere with each other. (Remapping guests's > SELinux xattr is not changing hosts's SELinux label and is not > bypassing host's SELinux policy). > > virtiofsd also provides for the mode where if guest process sets > SELinux xattr it shows up as security.selinux on host. But now we > have multiple issues. There are two SELinux policies (host and guest) > which are operating on same lable. And there is a very good chance > that two have not been written in such a way that they work with > each other. In fact there does not seem to exist a notion where > two different SELinux policies are operating on same label. > > At high level, this is in a way similar to files created on > virtio-blk devices. Say this device is backed by a foo.img file > on host. Now host selinux policy will set its own label on > foo.img and provide access control while labels created by guest > are not seen or controlled by host's SELinux policy. Only guest > SELinux policy works with those labels. > > So this is similar kind of attempt. Provide isolation between > host and guests's SELinux labels so that two policies can > co-exist and not interfere with each other. > >>> This remapping is useful when SELinux is enabled in guest and virtiofs >>> as being used as rootfs. Guest and host SELinux policy might not match >>> and host policy might deny security.selinux xattr setting by guest >>> onto host. Or host might have SELinux disabled and in that case to >>> be able to set security.selinux xattr, virtiofsd will need to have >>> CAP_SYS_ADMIN (which we are trying to avoid). Being able to remap >>> guest security.selinux (or other xattrs) on host to something else >>> is also better from security point of view. >> Can you please provide some rationale for this assertion? >> I have been working with security xattrs longer than anyone >> and have trouble accepting the statement. > If guest is not able to interfere or change host's SELinux labels > directly, it sounded better. > > Irrespective of this, my primary concern is that to allow guest > VM to be able to use SELinux seamlessly in diverse host OS > environments (typical of cloud deployments). And being able to > provide a mode where host and guest's security labels can > co-exist and policies can work independently, should be able > to achieve that goal. > >>> But when we try this, we noticed that SELinux relabeling in guest >>> is failing on some symlinks. When I debugged a little more, I >>> came to know that "user.*" xattrs are not allowed on symlinks >>> or special files. >>> >>> "man xattr" seems to suggest that primary reason to disallow is >>> that arbitrary users can set unlimited amount of "user.*" xattrs >>> on these files and bypass quota check. >>> >>> If that's the primary reason, I am wondering is it possible to relax >>> the restrictions if caller has CAP_SYS_RESOURCE. This capability >>> allows caller to bypass quota checks. So it should not be >>> a problem atleast from quota perpective. >>> >>> That will allow me to give CAP_SYS_RESOURCE to virtiofs deamon >>> and remap xattrs arbitrarily. >> On a Smack system you should require CAP_MAC_ADMIN to remap >> security. xattrs. I sounds like you're in serious danger of running afoul >> of LSM attribute policy on a reasonable general level. > I think I did not explain xattr remapping properly and that's why this > confusion is there. Only guests's xattrs will be remapped and not > hosts's xattr. So one can not bypass any access control implemented > by any of the LSM on host. > > Thanks > Vivek > I want to point out that this solves a couple of other problems also. Currently virtiofsd attempts to write security attributes on the host, which is denied by default on systems without SELinux and no CAP_SYS_ADMIN. This means if you want to run a container or VM on a host without SELinux support but the VM has SELinux enabled, then virtiofsd needs CAP_SYS_ADMIN. It would be much more secure if it only needed CAP_SYS_RESOURCE. If the host has SELinux enabled then it can run without CAP_SYS_ADMIN or CAP_SYS_RESOURCE, but it will only be allowed to write labels that the host system understands, any label not understood will be blocked. Not only this, but the label that is running virtiofsd pretty much has to run as unconfined, since it could be writing any SELinux label. If virtiofsd is writing Userxattrs with CAP_SYS_RESOURCE, then we can run with a confined SELinux label only allowing it to sexattr on the content in the designated directory, make the container/vm much more secure.
On 6/28/2021 6:36 AM, Daniel Walsh wrote: > On 6/28/21 09:17, Vivek Goyal wrote: >> On Fri, Jun 25, 2021 at 09:49:51PM +0000, Schaufler, Casey wrote: >>>> -----Original Message----- >>>> From: Vivek Goyal <vgoyal@redhat.com> >>>> Sent: Friday, June 25, 2021 12:12 PM >>>> To: linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; >>>> viro@zeniv.linux.org.uk >>>> Cc: virtio-fs@redhat.com; dwalsh@redhat.com; dgilbert@redhat.com; >>>> berrange@redhat.com; vgoyal@redhat.com >>> Please include Linux Security Module list <linux-security-module@vger.kernel.org> >>> and selinux@vger.kernel.org on this topic. >>> >>>> Subject: [RFC PATCH 0/1] xattr: Allow user.* xattr on symlink/special files if >>>> caller has CAP_SYS_RESOURCE >>>> >>>> Hi, >>>> >>>> In virtiofs, actual file server is virtiosd daemon running on host. >>>> There we have a mode where xattrs can be remapped to something else. >>>> For example security.selinux can be remapped to >>>> user.virtiofsd.securit.selinux on the host. >>> This would seem to provide mechanism whereby a user can violate >>> SELinux policy quite easily. >> Hi Casey, >> >> As david already replied, we are not bypassing host's SELinux policy (if >> there is one). We are just trying to provide a mode where host and >> guest's SELinux policies could co-exist without interefering >> with each other. >> >> By remappming guests SELinux xattrs (and not host's SELinux xattrs), >> a file probably will have two xattrs >> >> "security.selinux" and "user.virtiofsd.security.selinux". Host will >> enforce SELinux policy based on security.selinux xattr and guest >> will see the SELinux info stored in "user.virtiofsd.security.selinux" >> and guest SELinux policy will enforce rules based on that. >> (user.virtiofsd.security.selinux will be remapped to "security.selinux" >> when guest does getxattr()). >> >> IOW, this mode is allowing both host and guest SELinux policies to >> co-exist and not interefere with each other. (Remapping guests's >> SELinux xattr is not changing hosts's SELinux label and is not >> bypassing host's SELinux policy). >> >> virtiofsd also provides for the mode where if guest process sets >> SELinux xattr it shows up as security.selinux on host. But now we >> have multiple issues. There are two SELinux policies (host and guest) >> which are operating on same lable. And there is a very good chance >> that two have not been written in such a way that they work with >> each other. In fact there does not seem to exist a notion where >> two different SELinux policies are operating on same label. >> >> At high level, this is in a way similar to files created on >> virtio-blk devices. Say this device is backed by a foo.img file >> on host. Now host selinux policy will set its own label on >> foo.img and provide access control while labels created by guest >> are not seen or controlled by host's SELinux policy. Only guest >> SELinux policy works with those labels. >> >> So this is similar kind of attempt. Provide isolation between >> host and guests's SELinux labels so that two policies can >> co-exist and not interfere with each other. >> >>>> This remapping is useful when SELinux is enabled in guest and virtiofs >>>> as being used as rootfs. Guest and host SELinux policy might not match >>>> and host policy might deny security.selinux xattr setting by guest >>>> onto host. Or host might have SELinux disabled and in that case to >>>> be able to set security.selinux xattr, virtiofsd will need to have >>>> CAP_SYS_ADMIN (which we are trying to avoid). Being able to remap >>>> guest security.selinux (or other xattrs) on host to something else >>>> is also better from security point of view. >>> Can you please provide some rationale for this assertion? >>> I have been working with security xattrs longer than anyone >>> and have trouble accepting the statement. >> If guest is not able to interfere or change host's SELinux labels >> directly, it sounded better. >> >> Irrespective of this, my primary concern is that to allow guest >> VM to be able to use SELinux seamlessly in diverse host OS >> environments (typical of cloud deployments). And being able to >> provide a mode where host and guest's security labels can >> co-exist and policies can work independently, should be able >> to achieve that goal. >> >>>> But when we try this, we noticed that SELinux relabeling in guest >>>> is failing on some symlinks. When I debugged a little more, I >>>> came to know that "user.*" xattrs are not allowed on symlinks >>>> or special files. >>>> >>>> "man xattr" seems to suggest that primary reason to disallow is >>>> that arbitrary users can set unlimited amount of "user.*" xattrs >>>> on these files and bypass quota check. >>>> >>>> If that's the primary reason, I am wondering is it possible to relax >>>> the restrictions if caller has CAP_SYS_RESOURCE. This capability >>>> allows caller to bypass quota checks. So it should not be >>>> a problem atleast from quota perpective. >>>> >>>> That will allow me to give CAP_SYS_RESOURCE to virtiofs deamon >>>> and remap xattrs arbitrarily. >>> On a Smack system you should require CAP_MAC_ADMIN to remap >>> security. xattrs. I sounds like you're in serious danger of running afoul >>> of LSM attribute policy on a reasonable general level. >> I think I did not explain xattr remapping properly and that's why this >> confusion is there. Only guests's xattrs will be remapped and not >> hosts's xattr. So one can not bypass any access control implemented >> by any of the LSM on host. >> >> Thanks >> Vivek >> > I want to point out that this solves a couple of other problems also. I am not (usually) adverse to solving problems. My concern is with regard to creating new ones. > Currently virtiofsd attempts to write security attributes on the host, which is denied by default on systems without SELinux and no CAP_SYS_ADMIN. Right. Which is as it should be. Also, s/SELinux/a LSM that uses security xattrs/ > This means if you want to run a container or VM A container uses the kernel from the host. A VM uses the kernel from the guest. Unless you're calling a VM a container for marketing purposes. If this scheme works for non-VM based containers there's a problem. > on a host without SELinux support but the VM has SELinux enabled, then virtiofsd needs CAP_SYS_ADMIN. It would be much more secure if it only needed CAP_SYS_RESOURCE. I don't know, so I'm asking. Does virtiofsd really get run with limited capabilities, or does it get run as root like most system daemons? If it runs as root the argument has no legs. > If the host has SELinux enabled then it can run without CAP_SYS_ADMIN or CAP_SYS_RESOURCE, but it will only be allowed to write labels that the host system understands, any label not understood will be blocked. Not only this, but the label that is running virtiofsd pretty much has to run as unconfined, since it could be writing any SELinux label. You could fix that easily enough by teaching SELinux about the proper use of CAP_MAC_ADMIN. Alas, I understand that there's no way that's going to happen, and why it would be considered philosophically repugnant in the SELinux community. > > If virtiofsd is writing Userxattrs with CAP_SYS_RESOURCE, then we can run with a confined SELinux label only allowing it to sexattr on the content in the designated directory, make the container/vm much more secure. > User xattrs are less protected than security xattrs. You are exposing the security xattrs on the guest to the possible whims of a malicious, unprivileged actor on the host. All it needs is the right UID. We have unused xattr namespaces. Would using the "trusted" namespace work for your purposes?
* Casey Schaufler (casey@schaufler-ca.com) wrote: > On 6/28/2021 6:36 AM, Daniel Walsh wrote: > > On 6/28/21 09:17, Vivek Goyal wrote: > >> On Fri, Jun 25, 2021 at 09:49:51PM +0000, Schaufler, Casey wrote: > >>>> -----Original Message----- > >>>> From: Vivek Goyal <vgoyal@redhat.com> > >>>> Sent: Friday, June 25, 2021 12:12 PM > >>>> To: linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; > >>>> viro@zeniv.linux.org.uk > >>>> Cc: virtio-fs@redhat.com; dwalsh@redhat.com; dgilbert@redhat.com; > >>>> berrange@redhat.com; vgoyal@redhat.com > >>> Please include Linux Security Module list <linux-security-module@vger.kernel.org> > >>> and selinux@vger.kernel.org on this topic. > >>> > >>>> Subject: [RFC PATCH 0/1] xattr: Allow user.* xattr on symlink/special files if > >>>> caller has CAP_SYS_RESOURCE > >>>> > >>>> Hi, > >>>> > >>>> In virtiofs, actual file server is virtiosd daemon running on host. > >>>> There we have a mode where xattrs can be remapped to something else. > >>>> For example security.selinux can be remapped to > >>>> user.virtiofsd.securit.selinux on the host. > >>> This would seem to provide mechanism whereby a user can violate > >>> SELinux policy quite easily. > >> Hi Casey, > >> > >> As david already replied, we are not bypassing host's SELinux policy (if > >> there is one). We are just trying to provide a mode where host and > >> guest's SELinux policies could co-exist without interefering > >> with each other. > >> > >> By remappming guests SELinux xattrs (and not host's SELinux xattrs), > >> a file probably will have two xattrs > >> > >> "security.selinux" and "user.virtiofsd.security.selinux". Host will > >> enforce SELinux policy based on security.selinux xattr and guest > >> will see the SELinux info stored in "user.virtiofsd.security.selinux" > >> and guest SELinux policy will enforce rules based on that. > >> (user.virtiofsd.security.selinux will be remapped to "security.selinux" > >> when guest does getxattr()). > >> > >> IOW, this mode is allowing both host and guest SELinux policies to > >> co-exist and not interefere with each other. (Remapping guests's > >> SELinux xattr is not changing hosts's SELinux label and is not > >> bypassing host's SELinux policy). > >> > >> virtiofsd also provides for the mode where if guest process sets > >> SELinux xattr it shows up as security.selinux on host. But now we > >> have multiple issues. There are two SELinux policies (host and guest) > >> which are operating on same lable. And there is a very good chance > >> that two have not been written in such a way that they work with > >> each other. In fact there does not seem to exist a notion where > >> two different SELinux policies are operating on same label. > >> > >> At high level, this is in a way similar to files created on > >> virtio-blk devices. Say this device is backed by a foo.img file > >> on host. Now host selinux policy will set its own label on > >> foo.img and provide access control while labels created by guest > >> are not seen or controlled by host's SELinux policy. Only guest > >> SELinux policy works with those labels. > >> > >> So this is similar kind of attempt. Provide isolation between > >> host and guests's SELinux labels so that two policies can > >> co-exist and not interfere with each other. > >> > >>>> This remapping is useful when SELinux is enabled in guest and virtiofs > >>>> as being used as rootfs. Guest and host SELinux policy might not match > >>>> and host policy might deny security.selinux xattr setting by guest > >>>> onto host. Or host might have SELinux disabled and in that case to > >>>> be able to set security.selinux xattr, virtiofsd will need to have > >>>> CAP_SYS_ADMIN (which we are trying to avoid). Being able to remap > >>>> guest security.selinux (or other xattrs) on host to something else > >>>> is also better from security point of view. > >>> Can you please provide some rationale for this assertion? > >>> I have been working with security xattrs longer than anyone > >>> and have trouble accepting the statement. > >> If guest is not able to interfere or change host's SELinux labels > >> directly, it sounded better. > >> > >> Irrespective of this, my primary concern is that to allow guest > >> VM to be able to use SELinux seamlessly in diverse host OS > >> environments (typical of cloud deployments). And being able to > >> provide a mode where host and guest's security labels can > >> co-exist and policies can work independently, should be able > >> to achieve that goal. > >> > >>>> But when we try this, we noticed that SELinux relabeling in guest > >>>> is failing on some symlinks. When I debugged a little more, I > >>>> came to know that "user.*" xattrs are not allowed on symlinks > >>>> or special files. > >>>> > >>>> "man xattr" seems to suggest that primary reason to disallow is > >>>> that arbitrary users can set unlimited amount of "user.*" xattrs > >>>> on these files and bypass quota check. > >>>> > >>>> If that's the primary reason, I am wondering is it possible to relax > >>>> the restrictions if caller has CAP_SYS_RESOURCE. This capability > >>>> allows caller to bypass quota checks. So it should not be > >>>> a problem atleast from quota perpective. > >>>> > >>>> That will allow me to give CAP_SYS_RESOURCE to virtiofs deamon > >>>> and remap xattrs arbitrarily. > >>> On a Smack system you should require CAP_MAC_ADMIN to remap > >>> security. xattrs. I sounds like you're in serious danger of running afoul > >>> of LSM attribute policy on a reasonable general level. > >> I think I did not explain xattr remapping properly and that's why this > >> confusion is there. Only guests's xattrs will be remapped and not > >> hosts's xattr. So one can not bypass any access control implemented > >> by any of the LSM on host. > >> > >> Thanks > >> Vivek > >> > > I want to point out that this solves a couple of other problems also. > > I am not (usually) adverse to solving problems. My concern is with > regard to creating new ones. > > > Currently virtiofsd attempts to write security attributes on the host, which is denied by default on systems without SELinux and no CAP_SYS_ADMIN. > > Right. Which is as it should be. > Also, s/SELinux/a LSM that uses security xattrs/ > > > This means if you want to run a container or VM > > A container uses the kernel from the host. A VM uses the kernel > from the guest. Unless you're calling a VM a container for > marketing purposes. If this scheme works for non-VM based containers > there's a problem. And 'kata' is it's own kernel, but more like a container runtime - would you like to call this a VM or a container? There's whole bunch of variations people are playing around with; I don't think there's a single answer, or a single way people are trying to use it. > > on a host without SELinux support but the VM has SELinux enabled, then virtiofsd needs CAP_SYS_ADMIN. It would be much more secure if it only needed CAP_SYS_RESOURCE. > > I don't know, so I'm asking. Does virtiofsd really get run with limited capabilities, > or does it get run as root like most system daemons? If it runs as root the argument > has no legs. It's typically run without CAP_SYS_ADMIN; (although we have other problems, like wanting to use file handles that make caps tricky). Some people are trying to run it in user namespaces. Given that it's pretty complex and playing with lots of file syscalls under partial control of the guest, giving it as few capabilities as possible is my preference. > > If the host has SELinux enabled then it can run without CAP_SYS_ADMIN or CAP_SYS_RESOURCE, but it will only be allowed to write labels that the host system understands, any label not understood will be blocked. Not only this, but the label that is running virtiofsd pretty much has to run as unconfined, since it could be writing any SELinux label. > > You could fix that easily enough by teaching SELinux about the proper > use of CAP_MAC_ADMIN. Alas, I understand that there's no way that's > going to happen, and why it would be considered philosophically repugnant > in the SELinux community. > > > > > If virtiofsd is writing Userxattrs with CAP_SYS_RESOURCE, then we can run with a confined SELinux label only allowing it to sexattr on the content in the designated directory, make the container/vm much more secure. > > > User xattrs are less protected than security xattrs. You are exposing the > security xattrs on the guest to the possible whims of a malicious, unprivileged > actor on the host. All it needs is the right UID. Yep, we realise that; but when you're mainly interested in making sure the guest can't attack the host, that's less worrying. It would be lovely if there was something more granular, (e.g. allowing user.NUMBER. or trusted.NUMBER. to be used by this particular guest). > We have unused xattr namespaces. Would using the "trusted" namespace > work for your purposes? For those with CAP_SYS_ADMIN I guess. Note the virtiofsd takes an option allowing you to set the mapping however you like, so there's no hard coded user. or trusted. in the daemon itself. Dave > >
On Mon, Jun 28, 2021 at 09:04:40AM -0700, Casey Schaufler wrote: [..] > > on a host without SELinux support but the VM has SELinux enabled, then virtiofsd needs CAP_SYS_ADMIN. It would be much more secure if it only needed CAP_SYS_RESOURCE. > > I don't know, so I'm asking. Does virtiofsd really get run with limited capabilities, > or does it get run as root like most system daemons? If it runs as root the argument > has no legs. It runs as root but we give it a set of minimum required capabilities by default and want to avoid giving it CAP_SYS_ADMIN. > > > If the host has SELinux enabled then it can run without CAP_SYS_ADMIN or CAP_SYS_RESOURCE, but it will only be allowed to write labels that the host system understands, any label not understood will be blocked. Not only this, but the label that is running virtiofsd pretty much has to run as unconfined, since it could be writing any SELinux label. > > You could fix that easily enough by teaching SELinux about the proper > use of CAP_MAC_ADMIN. Alas, I understand that there's no way that's > going to happen, and why it would be considered philosophically repugnant > in the SELinux community. > > > > > If virtiofsd is writing Userxattrs with CAP_SYS_RESOURCE, then we can run with a confined SELinux label only allowing it to sexattr on the content in the designated directory, make the container/vm much more secure. > > > User xattrs are less protected than security xattrs. You are exposing the > security xattrs on the guest to the possible whims of a malicious, unprivileged > actor on the host. All it needs is the right UID. One of the security tenets of virtiofs is that this shared directory should be hidden from unprivliged users. Otherwise guest can drop setuid root binaries in shared directory and unprivliged user on host executes it and gets control of the host. So unpriviliged actor on host having access to these shared directory contents is wrong configuration. > > We have unused xattr namespaces. Would using the "trusted" namespace > work for your purposes? That requires giving CAP_SYS_ADMIN to daemon and one of the goals is to give as little capabilities as possible to virtiofsd. In fact people have been asking for the capablities to run virtiofsd unpriviliged as well as run inside a user namespace etc. Anyway, remapping LSM xattrs to "trusted.*" space should work as long as virtiofsd has CAP_SYS_ADMIN. Thanks Vivek
On 6/28/2021 9:28 AM, Dr. David Alan Gilbert wrote: > * Casey Schaufler (casey@schaufler-ca.com) wrote: >> On 6/28/2021 6:36 AM, Daniel Walsh wrote: >>> On 6/28/21 09:17, Vivek Goyal wrote: >>>> On Fri, Jun 25, 2021 at 09:49:51PM +0000, Schaufler, Casey wrote: >>>>>> -----Original Message----- >>>>>> From: Vivek Goyal <vgoyal@redhat.com> >>>>>> Sent: Friday, June 25, 2021 12:12 PM >>>>>> To: linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; >>>>>> viro@zeniv.linux.org.uk >>>>>> Cc: virtio-fs@redhat.com; dwalsh@redhat.com; dgilbert@redhat.com; >>>>>> berrange@redhat.com; vgoyal@redhat.com >>>>> Please include Linux Security Module list <linux-security-module@vger.kernel.org> >>>>> and selinux@vger.kernel.org on this topic. >>>>> >>>>>> Subject: [RFC PATCH 0/1] xattr: Allow user.* xattr on symlink/special files if >>>>>> caller has CAP_SYS_RESOURCE >>>>>> >>>>>> Hi, >>>>>> >>>>>> In virtiofs, actual file server is virtiosd daemon running on host. >>>>>> There we have a mode where xattrs can be remapped to something else. >>>>>> For example security.selinux can be remapped to >>>>>> user.virtiofsd.securit.selinux on the host. >>>>> This would seem to provide mechanism whereby a user can violate >>>>> SELinux policy quite easily. >>>> Hi Casey, >>>> >>>> As david already replied, we are not bypassing host's SELinux policy (if >>>> there is one). We are just trying to provide a mode where host and >>>> guest's SELinux policies could co-exist without interefering >>>> with each other. >>>> >>>> By remappming guests SELinux xattrs (and not host's SELinux xattrs), >>>> a file probably will have two xattrs >>>> >>>> "security.selinux" and "user.virtiofsd.security.selinux". Host will >>>> enforce SELinux policy based on security.selinux xattr and guest >>>> will see the SELinux info stored in "user.virtiofsd.security.selinux" >>>> and guest SELinux policy will enforce rules based on that. >>>> (user.virtiofsd.security.selinux will be remapped to "security.selinux" >>>> when guest does getxattr()). >>>> >>>> IOW, this mode is allowing both host and guest SELinux policies to >>>> co-exist and not interefere with each other. (Remapping guests's >>>> SELinux xattr is not changing hosts's SELinux label and is not >>>> bypassing host's SELinux policy). >>>> >>>> virtiofsd also provides for the mode where if guest process sets >>>> SELinux xattr it shows up as security.selinux on host. But now we >>>> have multiple issues. There are two SELinux policies (host and guest) >>>> which are operating on same lable. And there is a very good chance >>>> that two have not been written in such a way that they work with >>>> each other. In fact there does not seem to exist a notion where >>>> two different SELinux policies are operating on same label. >>>> >>>> At high level, this is in a way similar to files created on >>>> virtio-blk devices. Say this device is backed by a foo.img file >>>> on host. Now host selinux policy will set its own label on >>>> foo.img and provide access control while labels created by guest >>>> are not seen or controlled by host's SELinux policy. Only guest >>>> SELinux policy works with those labels. >>>> >>>> So this is similar kind of attempt. Provide isolation between >>>> host and guests's SELinux labels so that two policies can >>>> co-exist and not interfere with each other. >>>> >>>>>> This remapping is useful when SELinux is enabled in guest and virtiofs >>>>>> as being used as rootfs. Guest and host SELinux policy might not match >>>>>> and host policy might deny security.selinux xattr setting by guest >>>>>> onto host. Or host might have SELinux disabled and in that case to >>>>>> be able to set security.selinux xattr, virtiofsd will need to have >>>>>> CAP_SYS_ADMIN (which we are trying to avoid). Being able to remap >>>>>> guest security.selinux (or other xattrs) on host to something else >>>>>> is also better from security point of view. >>>>> Can you please provide some rationale for this assertion? >>>>> I have been working with security xattrs longer than anyone >>>>> and have trouble accepting the statement. >>>> If guest is not able to interfere or change host's SELinux labels >>>> directly, it sounded better. >>>> >>>> Irrespective of this, my primary concern is that to allow guest >>>> VM to be able to use SELinux seamlessly in diverse host OS >>>> environments (typical of cloud deployments). And being able to >>>> provide a mode where host and guest's security labels can >>>> co-exist and policies can work independently, should be able >>>> to achieve that goal. >>>> >>>>>> But when we try this, we noticed that SELinux relabeling in guest >>>>>> is failing on some symlinks. When I debugged a little more, I >>>>>> came to know that "user.*" xattrs are not allowed on symlinks >>>>>> or special files. >>>>>> >>>>>> "man xattr" seems to suggest that primary reason to disallow is >>>>>> that arbitrary users can set unlimited amount of "user.*" xattrs >>>>>> on these files and bypass quota check. >>>>>> >>>>>> If that's the primary reason, I am wondering is it possible to relax >>>>>> the restrictions if caller has CAP_SYS_RESOURCE. This capability >>>>>> allows caller to bypass quota checks. So it should not be >>>>>> a problem atleast from quota perpective. >>>>>> >>>>>> That will allow me to give CAP_SYS_RESOURCE to virtiofs deamon >>>>>> and remap xattrs arbitrarily. >>>>> On a Smack system you should require CAP_MAC_ADMIN to remap >>>>> security. xattrs. I sounds like you're in serious danger of running afoul >>>>> of LSM attribute policy on a reasonable general level. >>>> I think I did not explain xattr remapping properly and that's why this >>>> confusion is there. Only guests's xattrs will be remapped and not >>>> hosts's xattr. So one can not bypass any access control implemented >>>> by any of the LSM on host. >>>> >>>> Thanks >>>> Vivek >>>> >>> I want to point out that this solves a couple of other problems also. >> I am not (usually) adverse to solving problems. My concern is with >> regard to creating new ones. >> >>> Currently virtiofsd attempts to write security attributes on the host, which is denied by default on systems without SELinux and no CAP_SYS_ADMIN. >> Right. Which is as it should be. >> Also, s/SELinux/a LSM that uses security xattrs/ >> >>> This means if you want to run a container or VM >> A container uses the kernel from the host. A VM uses the kernel >> from the guest. Unless you're calling a VM a container for >> marketing purposes. If this scheme works for non-VM based containers >> there's a problem. > And 'kata' is it's own kernel, but more like a container runtime - would > you like to call this a VM or a container? I would call it a VM. On the other hand, there has been a concerted effort to ensure that there is no technical definition of a container. I hope to exploit this for personal wealth and glory before too long myself. If kata wants to identify as a container, who am I to say otherwise? > There's whole bunch of variations people are playing around with; I don't > think there's a single answer, or a single way people are trying to use > it. Just so. >>> on a host without SELinux support but the VM has SELinux enabled, then virtiofsd needs CAP_SYS_ADMIN. It would be much more secure if it only needed CAP_SYS_RESOURCE. >> I don't know, so I'm asking. Does virtiofsd really get run with limited capabilities, >> or does it get run as root like most system daemons? If it runs as root the argument >> has no legs. > It's typically run without CAP_SYS_ADMIN; (although we have other > problems, like wanting to use file handles that make caps tricky). > Some people are trying to run it in user namespaces. > Given that it's pretty complex and playing with lots of file syscalls > under partial control of the guest, giving it as few capabilities > as possible is my preference. It would be mine as well. I expect/fear that many developers find capabilities too complicated to work with and drop back to good old fashioned root. The whole rationale for user namespaces seems to be that it makes running as root in the namespace "safe". >>> If the host has SELinux enabled then it can run without CAP_SYS_ADMIN or CAP_SYS_RESOURCE, but it will only be allowed to write labels that the host system understands, any label not understood will be blocked. Not only this, but the label that is running virtiofsd pretty much has to run as unconfined, since it could be writing any SELinux label. >> You could fix that easily enough by teaching SELinux about the proper >> use of CAP_MAC_ADMIN. Alas, I understand that there's no way that's >> going to happen, and why it would be considered philosophically repugnant >> in the SELinux community. >> >>> If virtiofsd is writing Userxattrs with CAP_SYS_RESOURCE, then we can run with a confined SELinux label only allowing it to sexattr on the content in the designated directory, make the container/vm much more secure. >>> >> User xattrs are less protected than security xattrs. You are exposing the >> security xattrs on the guest to the possible whims of a malicious, unprivileged >> actor on the host. All it needs is the right UID. > Yep, we realise that; but when you're mainly interested in making sure > the guest can't attack the host, that's less worrying. That's uncomfortable. > It would be lovely if there was something more granular, (e.g. allowing > user.NUMBER. or trusted.NUMBER. to be used by this particular guest). We can't do that without breaking the "kernels aren't container aware" mandate. I suppose that if someone wanted to implement xattr namespaces (like user namespaces, not just the prefix) you could get away with that. Namespaces for everything. :) >> We have unused xattr namespaces. Would using the "trusted" namespace >> work for your purposes? > For those with CAP_SYS_ADMIN I guess. > > Note the virtiofsd takes an option allowing you to set the mapping > however you like, so there's no hard coded user. or trusted. in the > daemon itself. > > Dave > >>
On 6/28/21 12:04, Casey Schaufler wrote: > On 6/28/2021 6:36 AM, Daniel Walsh wrote: >> On 6/28/21 09:17, Vivek Goyal wrote: >>> On Fri, Jun 25, 2021 at 09:49:51PM +0000, Schaufler, Casey wrote: >>>>> -----Original Message----- >>>>> From: Vivek Goyal <vgoyal@redhat.com> >>>>> Sent: Friday, June 25, 2021 12:12 PM >>>>> To: linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; >>>>> viro@zeniv.linux.org.uk >>>>> Cc: virtio-fs@redhat.com; dwalsh@redhat.com; dgilbert@redhat.com; >>>>> berrange@redhat.com; vgoyal@redhat.com >>>> Please include Linux Security Module list <linux-security-module@vger.kernel.org> >>>> and selinux@vger.kernel.org on this topic. >>>> >>>>> Subject: [RFC PATCH 0/1] xattr: Allow user.* xattr on symlink/special files if >>>>> caller has CAP_SYS_RESOURCE >>>>> >>>>> Hi, >>>>> >>>>> In virtiofs, actual file server is virtiosd daemon running on host. >>>>> There we have a mode where xattrs can be remapped to something else. >>>>> For example security.selinux can be remapped to >>>>> user.virtiofsd.securit.selinux on the host. >>>> This would seem to provide mechanism whereby a user can violate >>>> SELinux policy quite easily. >>> Hi Casey, >>> >>> As david already replied, we are not bypassing host's SELinux policy (if >>> there is one). We are just trying to provide a mode where host and >>> guest's SELinux policies could co-exist without interefering >>> with each other. >>> >>> By remappming guests SELinux xattrs (and not host's SELinux xattrs), >>> a file probably will have two xattrs >>> >>> "security.selinux" and "user.virtiofsd.security.selinux". Host will >>> enforce SELinux policy based on security.selinux xattr and guest >>> will see the SELinux info stored in "user.virtiofsd.security.selinux" >>> and guest SELinux policy will enforce rules based on that. >>> (user.virtiofsd.security.selinux will be remapped to "security.selinux" >>> when guest does getxattr()). >>> >>> IOW, this mode is allowing both host and guest SELinux policies to >>> co-exist and not interefere with each other. (Remapping guests's >>> SELinux xattr is not changing hosts's SELinux label and is not >>> bypassing host's SELinux policy). >>> >>> virtiofsd also provides for the mode where if guest process sets >>> SELinux xattr it shows up as security.selinux on host. But now we >>> have multiple issues. There are two SELinux policies (host and guest) >>> which are operating on same lable. And there is a very good chance >>> that two have not been written in such a way that they work with >>> each other. In fact there does not seem to exist a notion where >>> two different SELinux policies are operating on same label. >>> >>> At high level, this is in a way similar to files created on >>> virtio-blk devices. Say this device is backed by a foo.img file >>> on host. Now host selinux policy will set its own label on >>> foo.img and provide access control while labels created by guest >>> are not seen or controlled by host's SELinux policy. Only guest >>> SELinux policy works with those labels. >>> >>> So this is similar kind of attempt. Provide isolation between >>> host and guests's SELinux labels so that two policies can >>> co-exist and not interfere with each other. >>> >>>>> This remapping is useful when SELinux is enabled in guest and virtiofs >>>>> as being used as rootfs. Guest and host SELinux policy might not match >>>>> and host policy might deny security.selinux xattr setting by guest >>>>> onto host. Or host might have SELinux disabled and in that case to >>>>> be able to set security.selinux xattr, virtiofsd will need to have >>>>> CAP_SYS_ADMIN (which we are trying to avoid). Being able to remap >>>>> guest security.selinux (or other xattrs) on host to something else >>>>> is also better from security point of view. >>>> Can you please provide some rationale for this assertion? >>>> I have been working with security xattrs longer than anyone >>>> and have trouble accepting the statement. >>> If guest is not able to interfere or change host's SELinux labels >>> directly, it sounded better. >>> >>> Irrespective of this, my primary concern is that to allow guest >>> VM to be able to use SELinux seamlessly in diverse host OS >>> environments (typical of cloud deployments). And being able to >>> provide a mode where host and guest's security labels can >>> co-exist and policies can work independently, should be able >>> to achieve that goal. >>> >>>>> But when we try this, we noticed that SELinux relabeling in guest >>>>> is failing on some symlinks. When I debugged a little more, I >>>>> came to know that "user.*" xattrs are not allowed on symlinks >>>>> or special files. >>>>> >>>>> "man xattr" seems to suggest that primary reason to disallow is >>>>> that arbitrary users can set unlimited amount of "user.*" xattrs >>>>> on these files and bypass quota check. >>>>> >>>>> If that's the primary reason, I am wondering is it possible to relax >>>>> the restrictions if caller has CAP_SYS_RESOURCE. This capability >>>>> allows caller to bypass quota checks. So it should not be >>>>> a problem atleast from quota perpective. >>>>> >>>>> That will allow me to give CAP_SYS_RESOURCE to virtiofs deamon >>>>> and remap xattrs arbitrarily. >>>> On a Smack system you should require CAP_MAC_ADMIN to remap >>>> security. xattrs. I sounds like you're in serious danger of running afoul >>>> of LSM attribute policy on a reasonable general level. >>> I think I did not explain xattr remapping properly and that's why this >>> confusion is there. Only guests's xattrs will be remapped and not >>> hosts's xattr. So one can not bypass any access control implemented >>> by any of the LSM on host. >>> >>> Thanks >>> Vivek >>> >> I want to point out that this solves a couple of other problems also. > I am not (usually) adverse to solving problems. My concern is with > regard to creating new ones. > >> Currently virtiofsd attempts to write security attributes on the host, which is denied by default on systems without SELinux and no CAP_SYS_ADMIN. > Right. Which is as it should be. > Also, s/SELinux/a LSM that uses security xattrs/ > >> This means if you want to run a container or VM > A container uses the kernel from the host. A VM uses the kernel > from the guest. Unless you're calling a VM a container for > marketing purposes. If this scheme works for non-VM based containers > there's a problem. That is your definition of a container. Our definition includes container workloads within kvm separation along with their own kernels. (Kata and libkrun). As opposed to VM workloads which run full operating system workloads including systemd, logging, cron, sshd ... >> on a host without SELinux support but the VM has SELinux enabled, then virtiofsd needs CAP_SYS_ADMIN. It would be much more secure if it only needed CAP_SYS_RESOURCE. > I don't know, so I'm asking. Does virtiofsd really get run with limited capabilities, > or does it get run as root like most system daemons? If it runs as root the argument > has no legs. I believe it should almost always get run with limited privileges, we are opening a whole from the kvm separated workload into the host. If there is a bug in virtiofsd, it can attack the host. >> If the host has SELinux enabled then it can run without CAP_SYS_ADMIN or CAP_SYS_RESOURCE, but it will only be allowed to write labels that the host system understands, any label not understood will be blocked. Not only this, but the label that is running virtiofsd pretty much has to run as unconfined, since it could be writing any SELinux label. > You could fix that easily enough by teaching SELinux about the proper > use of CAP_MAC_ADMIN. Alas, I understand that there's no way that's > going to happen, and why it would be considered philosophically repugnant > in the SELinux community. Sure, but this ignores the more important next comment. >> If virtiofsd is writing Userxattrs with CAP_SYS_RESOURCE, then we can run with a confined SELinux label only allowing it to sexattr on the content in the designated directory, make the container/vm much more secure. >> > User xattrs are less protected than security xattrs. You are exposing the > security xattrs on the guest to the possible whims of a malicious, unprivileged > actor on the host. All it needs is the right UID. > > We have unused xattr namespaces. Would using the "trusted" namespace > work for your purposes? > No because they bring their own issues, and can not be used without CAP_SYS_ADMIN. My number one concern is attacks from the kvm separated work space against the host, since virtiofsd is opening up the attack vector. Running it with the least privs possible from the MAC and DAC point of view is the goal.
* Casey Schaufler (casey@schaufler-ca.com) wrote: > On 6/28/2021 9:28 AM, Dr. David Alan Gilbert wrote: > > * Casey Schaufler (casey@schaufler-ca.com) wrote: > >> On 6/28/2021 6:36 AM, Daniel Walsh wrote: > >>> On 6/28/21 09:17, Vivek Goyal wrote: > >>>> On Fri, Jun 25, 2021 at 09:49:51PM +0000, Schaufler, Casey wrote: > >>>>>> -----Original Message----- > >>>>>> From: Vivek Goyal <vgoyal@redhat.com> > >>>>>> Sent: Friday, June 25, 2021 12:12 PM > >>>>>> To: linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; > >>>>>> viro@zeniv.linux.org.uk > >>>>>> Cc: virtio-fs@redhat.com; dwalsh@redhat.com; dgilbert@redhat.com; > >>>>>> berrange@redhat.com; vgoyal@redhat.com > >>>>> Please include Linux Security Module list <linux-security-module@vger.kernel.org> > >>>>> and selinux@vger.kernel.org on this topic. > >>>>> > >>>>>> Subject: [RFC PATCH 0/1] xattr: Allow user.* xattr on symlink/special files if > >>>>>> caller has CAP_SYS_RESOURCE > >>>>>> > >>>>>> Hi, > >>>>>> > >>>>>> In virtiofs, actual file server is virtiosd daemon running on host. > >>>>>> There we have a mode where xattrs can be remapped to something else. > >>>>>> For example security.selinux can be remapped to > >>>>>> user.virtiofsd.securit.selinux on the host. > >>>>> This would seem to provide mechanism whereby a user can violate > >>>>> SELinux policy quite easily. > >>>> Hi Casey, > >>>> > >>>> As david already replied, we are not bypassing host's SELinux policy (if > >>>> there is one). We are just trying to provide a mode where host and > >>>> guest's SELinux policies could co-exist without interefering > >>>> with each other. > >>>> > >>>> By remappming guests SELinux xattrs (and not host's SELinux xattrs), > >>>> a file probably will have two xattrs > >>>> > >>>> "security.selinux" and "user.virtiofsd.security.selinux". Host will > >>>> enforce SELinux policy based on security.selinux xattr and guest > >>>> will see the SELinux info stored in "user.virtiofsd.security.selinux" > >>>> and guest SELinux policy will enforce rules based on that. > >>>> (user.virtiofsd.security.selinux will be remapped to "security.selinux" > >>>> when guest does getxattr()). > >>>> > >>>> IOW, this mode is allowing both host and guest SELinux policies to > >>>> co-exist and not interefere with each other. (Remapping guests's > >>>> SELinux xattr is not changing hosts's SELinux label and is not > >>>> bypassing host's SELinux policy). > >>>> > >>>> virtiofsd also provides for the mode where if guest process sets > >>>> SELinux xattr it shows up as security.selinux on host. But now we > >>>> have multiple issues. There are two SELinux policies (host and guest) > >>>> which are operating on same lable. And there is a very good chance > >>>> that two have not been written in such a way that they work with > >>>> each other. In fact there does not seem to exist a notion where > >>>> two different SELinux policies are operating on same label. > >>>> > >>>> At high level, this is in a way similar to files created on > >>>> virtio-blk devices. Say this device is backed by a foo.img file > >>>> on host. Now host selinux policy will set its own label on > >>>> foo.img and provide access control while labels created by guest > >>>> are not seen or controlled by host's SELinux policy. Only guest > >>>> SELinux policy works with those labels. > >>>> > >>>> So this is similar kind of attempt. Provide isolation between > >>>> host and guests's SELinux labels so that two policies can > >>>> co-exist and not interfere with each other. > >>>> > >>>>>> This remapping is useful when SELinux is enabled in guest and virtiofs > >>>>>> as being used as rootfs. Guest and host SELinux policy might not match > >>>>>> and host policy might deny security.selinux xattr setting by guest > >>>>>> onto host. Or host might have SELinux disabled and in that case to > >>>>>> be able to set security.selinux xattr, virtiofsd will need to have > >>>>>> CAP_SYS_ADMIN (which we are trying to avoid). Being able to remap > >>>>>> guest security.selinux (or other xattrs) on host to something else > >>>>>> is also better from security point of view. > >>>>> Can you please provide some rationale for this assertion? > >>>>> I have been working with security xattrs longer than anyone > >>>>> and have trouble accepting the statement. > >>>> If guest is not able to interfere or change host's SELinux labels > >>>> directly, it sounded better. > >>>> > >>>> Irrespective of this, my primary concern is that to allow guest > >>>> VM to be able to use SELinux seamlessly in diverse host OS > >>>> environments (typical of cloud deployments). And being able to > >>>> provide a mode where host and guest's security labels can > >>>> co-exist and policies can work independently, should be able > >>>> to achieve that goal. > >>>> > >>>>>> But when we try this, we noticed that SELinux relabeling in guest > >>>>>> is failing on some symlinks. When I debugged a little more, I > >>>>>> came to know that "user.*" xattrs are not allowed on symlinks > >>>>>> or special files. > >>>>>> > >>>>>> "man xattr" seems to suggest that primary reason to disallow is > >>>>>> that arbitrary users can set unlimited amount of "user.*" xattrs > >>>>>> on these files and bypass quota check. > >>>>>> > >>>>>> If that's the primary reason, I am wondering is it possible to relax > >>>>>> the restrictions if caller has CAP_SYS_RESOURCE. This capability > >>>>>> allows caller to bypass quota checks. So it should not be > >>>>>> a problem atleast from quota perpective. > >>>>>> > >>>>>> That will allow me to give CAP_SYS_RESOURCE to virtiofs deamon > >>>>>> and remap xattrs arbitrarily. > >>>>> On a Smack system you should require CAP_MAC_ADMIN to remap > >>>>> security. xattrs. I sounds like you're in serious danger of running afoul > >>>>> of LSM attribute policy on a reasonable general level. > >>>> I think I did not explain xattr remapping properly and that's why this > >>>> confusion is there. Only guests's xattrs will be remapped and not > >>>> hosts's xattr. So one can not bypass any access control implemented > >>>> by any of the LSM on host. > >>>> > >>>> Thanks > >>>> Vivek > >>>> > >>> I want to point out that this solves a couple of other problems also. > >> I am not (usually) adverse to solving problems. My concern is with > >> regard to creating new ones. > >> > >>> Currently virtiofsd attempts to write security attributes on the host, which is denied by default on systems without SELinux and no CAP_SYS_ADMIN. > >> Right. Which is as it should be. > >> Also, s/SELinux/a LSM that uses security xattrs/ > >> > >>> This means if you want to run a container or VM > >> A container uses the kernel from the host. A VM uses the kernel > >> from the guest. Unless you're calling a VM a container for > >> marketing purposes. If this scheme works for non-VM based containers > >> there's a problem. > > And 'kata' is it's own kernel, but more like a container runtime - would > > you like to call this a VM or a container? > > I would call it a VM. > > On the other hand, there has been a concerted effort to ensure that there > is no technical definition of a container. I hope to exploit this for > personal wealth and glory before too long myself. If kata wants to identify > as a container, who am I to say otherwise? > > > There's whole bunch of variations people are playing around with; I don't > > think there's a single answer, or a single way people are trying to use > > it. > > Just so. > > >>> on a host without SELinux support but the VM has SELinux enabled, then virtiofsd needs CAP_SYS_ADMIN. It would be much more secure if it only needed CAP_SYS_RESOURCE. > >> I don't know, so I'm asking. Does virtiofsd really get run with limited capabilities, > >> or does it get run as root like most system daemons? If it runs as root the argument > >> has no legs. > > It's typically run without CAP_SYS_ADMIN; (although we have other > > problems, like wanting to use file handles that make caps tricky). > > Some people are trying to run it in user namespaces. > > Given that it's pretty complex and playing with lots of file syscalls > > under partial control of the guest, giving it as few capabilities > > as possible is my preference. > > It would be mine as well. I expect/fear that many developers find > capabilities too complicated to work with and drop back to good old > fashioned root. The whole rationale for user namespaces seems to be > that it makes running as root in the namespace "safe". We're trying to be good with capabilities, basically locking it down until we trip over one of them and then think about it and enable it where appropriate; the difficulty is that capabilities are only a bit better than root; they're still fairly granular - like in this case where you're pushed towards a wide ranging CAP even though you only want to give the user a trivial extra thing. (We have a similar problem wanting to allow separate threads to be in separate directories, but that requires unshare and that requires another capability) > >>> If the host has SELinux enabled then it can run without CAP_SYS_ADMIN or CAP_SYS_RESOURCE, but it will only be allowed to write labels that the host system understands, any label not understood will be blocked. Not only this, but the label that is running virtiofsd pretty much has to run as unconfined, since it could be writing any SELinux label. > >> You could fix that easily enough by teaching SELinux about the proper > >> use of CAP_MAC_ADMIN. Alas, I understand that there's no way that's > >> going to happen, and why it would be considered philosophically repugnant > >> in the SELinux community. > >> > >>> If virtiofsd is writing Userxattrs with CAP_SYS_RESOURCE, then we can run with a confined SELinux label only allowing it to sexattr on the content in the designated directory, make the container/vm much more secure. > >>> > >> User xattrs are less protected than security xattrs. You are exposing the > >> security xattrs on the guest to the possible whims of a malicious, unprivileged > >> actor on the host. All it needs is the right UID. > > Yep, we realise that; but when you're mainly interested in making sure > > the guest can't attack the host, that's less worrying. > > That's uncomfortable. Why exactly? IMHO the biggest problem is it's badly defined when you want to actually share filesystems between guests or between guests and the host. > > It would be lovely if there was something more granular, (e.g. allowing > > user.NUMBER. or trusted.NUMBER. to be used by this particular guest). > > We can't do that without breaking the "kernels aren't container aware" > mandate. I suppose that if someone wanted to implement xattr namespaces > (like user namespaces, not just the prefix) you could get away with that. > Namespaces for everything. :) Right, it's namespaces that we've used in most places to give ourselves the isolation. I doubt we're the only case that wants a way to do xattr separation; you get lots of weird cases where it pops up (e.g. stacked overlayfs) Dave > >> We have unused xattr namespaces. Would using the "trusted" namespace > >> work for your purposes? > > For those with CAP_SYS_ADMIN I guess. > > > > Note the virtiofsd takes an option allowing you to set the mapping > > however you like, so there's no hard coded user. or trusted. in the > > daemon itself. > > > > Dave > > > >> >
On 6/29/2021 2:00 AM, Dr. David Alan Gilbert wrote: > * Casey Schaufler (casey@schaufler-ca.com) wrote: >> On 6/28/2021 9:28 AM, Dr. David Alan Gilbert wrote: >>> * Casey Schaufler (casey@schaufler-ca.com) wrote: >>>> On 6/28/2021 6:36 AM, Daniel Walsh wrote: >>>>> On 6/28/21 09:17, Vivek Goyal wrote: >>>>>> On Fri, Jun 25, 2021 at 09:49:51PM +0000, Schaufler, Casey wrote: >>>>>>>> -----Original Message----- >>>>>>>> From: Vivek Goyal <vgoyal@redhat.com> >>>>>>>> Sent: Friday, June 25, 2021 12:12 PM >>>>>>>> To: linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; >>>>>>>> viro@zeniv.linux.org.uk >>>>>>>> Cc: virtio-fs@redhat.com; dwalsh@redhat.com; dgilbert@redhat.com; >>>>>>>> berrange@redhat.com; vgoyal@redhat.com >>>>>>> Please include Linux Security Module list <linux-security-module@vger.kernel.org> >>>>>>> and selinux@vger.kernel.org on this topic. >>>>>>> >>>>>>>> Subject: [RFC PATCH 0/1] xattr: Allow user.* xattr on symlink/special files if >>>>>>>> caller has CAP_SYS_RESOURCE >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> In virtiofs, actual file server is virtiosd daemon running on host. >>>>>>>> There we have a mode where xattrs can be remapped to something else. >>>>>>>> For example security.selinux can be remapped to >>>>>>>> user.virtiofsd.securit.selinux on the host. >>>>>>> This would seem to provide mechanism whereby a user can violate >>>>>>> SELinux policy quite easily. >>>>>> Hi Casey, >>>>>> >>>>>> As david already replied, we are not bypassing host's SELinux policy (if >>>>>> there is one). We are just trying to provide a mode where host and >>>>>> guest's SELinux policies could co-exist without interefering >>>>>> with each other. >>>>>> >>>>>> By remappming guests SELinux xattrs (and not host's SELinux xattrs), >>>>>> a file probably will have two xattrs >>>>>> >>>>>> "security.selinux" and "user.virtiofsd.security.selinux". Host will >>>>>> enforce SELinux policy based on security.selinux xattr and guest >>>>>> will see the SELinux info stored in "user.virtiofsd.security.selinux" >>>>>> and guest SELinux policy will enforce rules based on that. >>>>>> (user.virtiofsd.security.selinux will be remapped to "security.selinux" >>>>>> when guest does getxattr()). >>>>>> >>>>>> IOW, this mode is allowing both host and guest SELinux policies to >>>>>> co-exist and not interefere with each other. (Remapping guests's >>>>>> SELinux xattr is not changing hosts's SELinux label and is not >>>>>> bypassing host's SELinux policy). >>>>>> >>>>>> virtiofsd also provides for the mode where if guest process sets >>>>>> SELinux xattr it shows up as security.selinux on host. But now we >>>>>> have multiple issues. There are two SELinux policies (host and guest) >>>>>> which are operating on same lable. And there is a very good chance >>>>>> that two have not been written in such a way that they work with >>>>>> each other. In fact there does not seem to exist a notion where >>>>>> two different SELinux policies are operating on same label. >>>>>> >>>>>> At high level, this is in a way similar to files created on >>>>>> virtio-blk devices. Say this device is backed by a foo.img file >>>>>> on host. Now host selinux policy will set its own label on >>>>>> foo.img and provide access control while labels created by guest >>>>>> are not seen or controlled by host's SELinux policy. Only guest >>>>>> SELinux policy works with those labels. >>>>>> >>>>>> So this is similar kind of attempt. Provide isolation between >>>>>> host and guests's SELinux labels so that two policies can >>>>>> co-exist and not interfere with each other. >>>>>> >>>>>>>> This remapping is useful when SELinux is enabled in guest and virtiofs >>>>>>>> as being used as rootfs. Guest and host SELinux policy might not match >>>>>>>> and host policy might deny security.selinux xattr setting by guest >>>>>>>> onto host. Or host might have SELinux disabled and in that case to >>>>>>>> be able to set security.selinux xattr, virtiofsd will need to have >>>>>>>> CAP_SYS_ADMIN (which we are trying to avoid). Being able to remap >>>>>>>> guest security.selinux (or other xattrs) on host to something else >>>>>>>> is also better from security point of view. >>>>>>> Can you please provide some rationale for this assertion? >>>>>>> I have been working with security xattrs longer than anyone >>>>>>> and have trouble accepting the statement. >>>>>> If guest is not able to interfere or change host's SELinux labels >>>>>> directly, it sounded better. >>>>>> >>>>>> Irrespective of this, my primary concern is that to allow guest >>>>>> VM to be able to use SELinux seamlessly in diverse host OS >>>>>> environments (typical of cloud deployments). And being able to >>>>>> provide a mode where host and guest's security labels can >>>>>> co-exist and policies can work independently, should be able >>>>>> to achieve that goal. >>>>>> >>>>>>>> But when we try this, we noticed that SELinux relabeling in guest >>>>>>>> is failing on some symlinks. When I debugged a little more, I >>>>>>>> came to know that "user.*" xattrs are not allowed on symlinks >>>>>>>> or special files. >>>>>>>> >>>>>>>> "man xattr" seems to suggest that primary reason to disallow is >>>>>>>> that arbitrary users can set unlimited amount of "user.*" xattrs >>>>>>>> on these files and bypass quota check. >>>>>>>> >>>>>>>> If that's the primary reason, I am wondering is it possible to relax >>>>>>>> the restrictions if caller has CAP_SYS_RESOURCE. This capability >>>>>>>> allows caller to bypass quota checks. So it should not be >>>>>>>> a problem atleast from quota perpective. >>>>>>>> >>>>>>>> That will allow me to give CAP_SYS_RESOURCE to virtiofs deamon >>>>>>>> and remap xattrs arbitrarily. >>>>>>> On a Smack system you should require CAP_MAC_ADMIN to remap >>>>>>> security. xattrs. I sounds like you're in serious danger of running afoul >>>>>>> of LSM attribute policy on a reasonable general level. >>>>>> I think I did not explain xattr remapping properly and that's why this >>>>>> confusion is there. Only guests's xattrs will be remapped and not >>>>>> hosts's xattr. So one can not bypass any access control implemented >>>>>> by any of the LSM on host. >>>>>> >>>>>> Thanks >>>>>> Vivek >>>>>> >>>>> I want to point out that this solves a couple of other problems also. >>>> I am not (usually) adverse to solving problems. My concern is with >>>> regard to creating new ones. >>>> >>>>> Currently virtiofsd attempts to write security attributes on the host, which is denied by default on systems without SELinux and no CAP_SYS_ADMIN. >>>> Right. Which is as it should be. >>>> Also, s/SELinux/a LSM that uses security xattrs/ >>>> >>>>> This means if you want to run a container or VM >>>> A container uses the kernel from the host. A VM uses the kernel >>>> from the guest. Unless you're calling a VM a container for >>>> marketing purposes. If this scheme works for non-VM based containers >>>> there's a problem. >>> And 'kata' is it's own kernel, but more like a container runtime - would >>> you like to call this a VM or a container? >> I would call it a VM. >> >> On the other hand, there has been a concerted effort to ensure that there >> is no technical definition of a container. I hope to exploit this for >> personal wealth and glory before too long myself. If kata wants to identify >> as a container, who am I to say otherwise? >> >>> There's whole bunch of variations people are playing around with; I don't >>> think there's a single answer, or a single way people are trying to use >>> it. >> Just so. >> >>>>> on a host without SELinux support but the VM has SELinux enabled, then virtiofsd needs CAP_SYS_ADMIN. It would be much more secure if it only needed CAP_SYS_RESOURCE. >>>> I don't know, so I'm asking. Does virtiofsd really get run with limited capabilities, >>>> or does it get run as root like most system daemons? If it runs as root the argument >>>> has no legs. >>> It's typically run without CAP_SYS_ADMIN; (although we have other >>> problems, like wanting to use file handles that make caps tricky). >>> Some people are trying to run it in user namespaces. >>> Given that it's pretty complex and playing with lots of file syscalls >>> under partial control of the guest, giving it as few capabilities >>> as possible is my preference. >> It would be mine as well. I expect/fear that many developers find >> capabilities too complicated to work with and drop back to good old >> fashioned root. The whole rationale for user namespaces seems to be >> that it makes running as root in the namespace "safe". > We're trying to be good with capabilities, basically locking it down > until we trip over one of them and then think about it and enable it > where appropriate; the difficulty is that capabilities are only a bit > better than root; they're still fairly granular - like in this case > where you're pushed towards a wide ranging CAP even though you only > want to give the user a trivial extra thing. > (We have a similar problem wanting to allow separate threads to > be in separate directories, but that requires unshare and that requires > another capability) Thank you for putting in the effort. The primary value of capabilities has always been the disassociation of privilege from the root UID. The granularity has always been contentious. One UNIX system went the fine granularity route and ended up with 330. Last I looked you'd need several hundred to give everyone who wants their own special problem solved. I admit that a solution for the granularity issue would be grand. I think we're looking at something simpler than capabilities to achieve that, but we'll see. >>>>> If the host has SELinux enabled then it can run without CAP_SYS_ADMIN or CAP_SYS_RESOURCE, but it will only be allowed to write labels that the host system understands, any label not understood will be blocked. Not only this, but the label that is running virtiofsd pretty much has to run as unconfined, since it could be writing any SELinux label. >>>> You could fix that easily enough by teaching SELinux about the proper >>>> use of CAP_MAC_ADMIN. Alas, I understand that there's no way that's >>>> going to happen, and why it would be considered philosophically repugnant >>>> in the SELinux community. >>>> >>>>> If virtiofsd is writing Userxattrs with CAP_SYS_RESOURCE, then we can run with a confined SELinux label only allowing it to sexattr on the content in the designated directory, make the container/vm much more secure. >>>>> >>>> User xattrs are less protected than security xattrs. You are exposing the >>>> security xattrs on the guest to the possible whims of a malicious, unprivileged >>>> actor on the host. All it needs is the right UID. >>> Yep, we realise that; but when you're mainly interested in making sure >>> the guest can't attack the host, that's less worrying. >> That's uncomfortable. > Why exactly? If a mechanism is designed with a known vulnerability you fail your validation/evaluation efforts. Your mechanism is less general because other potential use cases may not be as cavalier about the vulnerability. I think that you can approach this differently, get a solution that does everything you want, and avoid the known problem. > IMHO the biggest problem is it's badly defined when you want to actually > share filesystems between guests or between guests and the host. Right. The filesystem isn't the right layer for mapping xattrs. >>> It would be lovely if there was something more granular, (e.g. allowing >>> user.NUMBER. or trusted.NUMBER. to be used by this particular guest). >> We can't do that without breaking the "kernels aren't container aware" >> mandate. I suppose that if someone wanted to implement xattr namespaces >> (like user namespaces, not just the prefix) you could get away with that. >> Namespaces for everything. :) > Right, it's namespaces that we've used in most places to give ourselves > the isolation. > > I doubt we're the only case that wants a way to do xattr separation; you > get lots of weird cases where it pops up (e.g. stacked overlayfs) I can't say that I'm a major fan of namespace proliferation, (time namespaces? really?) but what you've outlined is a filesystem specific implementation of xattr namespaces. We've looked into similar mechanisms for LSM specific namespaces. When you see multiple use-case specific implementations of the same thing its time to consider a general solution. > > Dave > >>>> We have unused xattr namespaces. Would using the "trusted" namespace >>>> work for your purposes? >>> For those with CAP_SYS_ADMIN I guess. >>> >>> Note the virtiofsd takes an option allowing you to set the mapping >>> however you like, so there's no hard coded user. or trusted. in the >>> daemon itself. >>> >>> Dave >>>
On Tue, Jun 29, 2021 at 07:38:15AM -0700, Casey Schaufler wrote: [..] > >>>> User xattrs are less protected than security xattrs. You are exposing the > >>>> security xattrs on the guest to the possible whims of a malicious, unprivileged > >>>> actor on the host. All it needs is the right UID. > >>> Yep, we realise that; but when you're mainly interested in making sure > >>> the guest can't attack the host, that's less worrying. > >> That's uncomfortable. > > Why exactly? > > If a mechanism is designed with a known vulnerability you > fail your validation/evaluation efforts. We are working with the constraint that shared directory should not be accessible to unpriviliged users on host. And with that constraint, what you are referring to is not a vulnerability. > Your mechanism is > less general because other potential use cases may not be > as cavalier about the vulnerability. Prefixing xattrs with "user.virtiofsd" is just one of the options. virtiofsd has the capability to prefix "trusted.virtiofsd" as well. We have not chosen that because we don't want to give it CAP_SYS_ADMIN. So other use cases which don't like prefixing "user.virtiofsd", can give CAP_SYS_ADMIN and work with it. > I think that you can > approach this differently, get a solution that does everything > you want, and avoid the known problem. What's the solution? Are you referring to using "trusted.*" instead? But that has its own problem of giving CAP_SYS_ADMIN to virtiofsd. Thanks Vivek
On 6/29/2021 8:20 AM, Vivek Goyal wrote: > On Tue, Jun 29, 2021 at 07:38:15AM -0700, Casey Schaufler wrote: > > [..] >>>>>> User xattrs are less protected than security xattrs. You are exposing the >>>>>> security xattrs on the guest to the possible whims of a malicious, unprivileged >>>>>> actor on the host. All it needs is the right UID. >>>>> Yep, we realise that; but when you're mainly interested in making sure >>>>> the guest can't attack the host, that's less worrying. >>>> That's uncomfortable. >>> Why exactly? >> If a mechanism is designed with a known vulnerability you >> fail your validation/evaluation efforts. > We are working with the constraint that shared directory should not be > accessible to unpriviliged users on host. And with that constraint, what > you are referring to is not a vulnerability. Sure, that's quite reasonable for your use case. It doesn't mean that the vulnerability doesn't exist, it means you've mitigated it. >> Your mechanism is >> less general because other potential use cases may not be >> as cavalier about the vulnerability. > Prefixing xattrs with "user.virtiofsd" is just one of the options. > virtiofsd has the capability to prefix "trusted.virtiofsd" as well. > We have not chosen that because we don't want to give it CAP_SYS_ADMIN. > > So other use cases which don't like prefixing "user.virtiofsd", can > give CAP_SYS_ADMIN and work with it. > >> I think that you can >> approach this differently, get a solution that does everything >> you want, and avoid the known problem. > What's the solution? Are you referring to using "trusted.*" instead? But > that has its own problem of giving CAP_SYS_ADMIN to virtiofsd. I'm coming to the conclusion that xattr namespaces, analogous to user namespaces, are the correct solution. They generalize for multiple filesystem and LSM use cases. The use of namespaces is well understood, especially in the container community. It looks to me as if it would address your use case swimmingly. > > Thanks > Vivek >
On Tue, Jun 29, 2021 at 07:38:15AM -0700, Casey Schaufler wrote: > > IMHO the biggest problem is it's badly defined when you want to actually > > share filesystems between guests or between guests and the host. > > Right. The filesystem isn't the right layer for mapping xattrs. Well, let's enumerate the alternatives: * Some kind of stackable LSM? * Some kind of FUSE-like scheme? * Adding an eBPF hook which can perform the mapping The last may be the best bet, since different use cases can use different eBPF programs. The eBPF script can handle both the mapping as well some kind of specialized access control with respect to what entities are allowed set or get xattrs. > >>> It would be lovely if there was something more granular, (e.g. allowing > >>> user.NUMBER. or trusted.NUMBER. to be used by this particular guest). > >> We can't do that without breaking the "kernels aren't container aware" > >> mandate. eBPF scripts, since they are supplied by the user *can* be container aware. :-) - Ted
* Casey Schaufler (casey@schaufler-ca.com) wrote: > On 6/29/2021 8:20 AM, Vivek Goyal wrote: > > On Tue, Jun 29, 2021 at 07:38:15AM -0700, Casey Schaufler wrote: > > > > [..] > >>>>>> User xattrs are less protected than security xattrs. You are exposing the > >>>>>> security xattrs on the guest to the possible whims of a malicious, unprivileged > >>>>>> actor on the host. All it needs is the right UID. > >>>>> Yep, we realise that; but when you're mainly interested in making sure > >>>>> the guest can't attack the host, that's less worrying. > >>>> That's uncomfortable. > >>> Why exactly? > >> If a mechanism is designed with a known vulnerability you > >> fail your validation/evaluation efforts. > > We are working with the constraint that shared directory should not be > > accessible to unpriviliged users on host. And with that constraint, what > > you are referring to is not a vulnerability. > > Sure, that's quite reasonable for your use case. It doesn't mean > that the vulnerability doesn't exist, it means you've mitigated it. > > > >> Your mechanism is > >> less general because other potential use cases may not be > >> as cavalier about the vulnerability. > > Prefixing xattrs with "user.virtiofsd" is just one of the options. > > virtiofsd has the capability to prefix "trusted.virtiofsd" as well. > > We have not chosen that because we don't want to give it CAP_SYS_ADMIN. > > > > So other use cases which don't like prefixing "user.virtiofsd", can > > give CAP_SYS_ADMIN and work with it. > > > >> I think that you can > >> approach this differently, get a solution that does everything > >> you want, and avoid the known problem. > > What's the solution? Are you referring to using "trusted.*" instead? But > > that has its own problem of giving CAP_SYS_ADMIN to virtiofsd. > > I'm coming to the conclusion that xattr namespaces, analogous > to user namespaces, are the correct solution. They generalize > for multiple filesystem and LSM use cases. The use of namespaces > is well understood, especially in the container community. It > looks to me as if it would address your use case swimmingly. Yeh; although the details of getting the semantics right is tricky; in particular, the stuff which clears capabilitiies/setuid/etc on writes - should it clear xattrs that represent capabilities? If the host performs a write, should it clear mapped xattrs capabilities? If the namespace performs a write should it clear just the mapped ones or the host ones as well? Our virtiofsd code performs acrobatics to make sure they get cleared on write that are painful. Dave > > > > Thanks > > Vivek > > >
On 6/29/2021 9:35 AM, Dr. David Alan Gilbert wrote: > * Casey Schaufler (casey@schaufler-ca.com) wrote: >> On 6/29/2021 8:20 AM, Vivek Goyal wrote: >>> On Tue, Jun 29, 2021 at 07:38:15AM -0700, Casey Schaufler wrote: >>> >>> [..] >>>>>>>> User xattrs are less protected than security xattrs. You are exposing the >>>>>>>> security xattrs on the guest to the possible whims of a malicious, unprivileged >>>>>>>> actor on the host. All it needs is the right UID. >>>>>>> Yep, we realise that; but when you're mainly interested in making sure >>>>>>> the guest can't attack the host, that's less worrying. >>>>>> That's uncomfortable. >>>>> Why exactly? >>>> If a mechanism is designed with a known vulnerability you >>>> fail your validation/evaluation efforts. >>> We are working with the constraint that shared directory should not be >>> accessible to unpriviliged users on host. And with that constraint, what >>> you are referring to is not a vulnerability. >> Sure, that's quite reasonable for your use case. It doesn't mean >> that the vulnerability doesn't exist, it means you've mitigated it. >> >> >>>> Your mechanism is >>>> less general because other potential use cases may not be >>>> as cavalier about the vulnerability. >>> Prefixing xattrs with "user.virtiofsd" is just one of the options. >>> virtiofsd has the capability to prefix "trusted.virtiofsd" as well. >>> We have not chosen that because we don't want to give it CAP_SYS_ADMIN. >>> >>> So other use cases which don't like prefixing "user.virtiofsd", can >>> give CAP_SYS_ADMIN and work with it. >>> >>>> I think that you can >>>> approach this differently, get a solution that does everything >>>> you want, and avoid the known problem. >>> What's the solution? Are you referring to using "trusted.*" instead? But >>> that has its own problem of giving CAP_SYS_ADMIN to virtiofsd. >> I'm coming to the conclusion that xattr namespaces, analogous >> to user namespaces, are the correct solution. They generalize >> for multiple filesystem and LSM use cases. The use of namespaces >> is well understood, especially in the container community. It >> looks to me as if it would address your use case swimmingly. > Yeh; although the details of getting the semantics right is tricky; > in particular, the stuff which clears capabilitiies/setuid/etc on writes > - should it clear xattrs that represent capabilities? If the host > performs a write, should it clear mapped xattrs capabilities? If the > namespace performs a write should it clear just the mapped ones or the > host ones as well? Our virtiofsd code performs acrobatics to make > sure they get cleared on write that are painful. Dealing with tricky semantics is the difference between a feature and a hack. Doing so in a way that other people can take advantage of the feature is the hallmark of a feature well done. > > Dave > >>> Thanks >>> Vivek >>>
On Tue, Jun 29, 2021 at 09:13:48AM -0700, Casey Schaufler wrote: > On 6/29/2021 8:20 AM, Vivek Goyal wrote: > > On Tue, Jun 29, 2021 at 07:38:15AM -0700, Casey Schaufler wrote: > > > > [..] > >>>>>> User xattrs are less protected than security xattrs. You are exposing the > >>>>>> security xattrs on the guest to the possible whims of a malicious, unprivileged > >>>>>> actor on the host. All it needs is the right UID. > >>>>> Yep, we realise that; but when you're mainly interested in making sure > >>>>> the guest can't attack the host, that's less worrying. > >>>> That's uncomfortable. > >>> Why exactly? > >> If a mechanism is designed with a known vulnerability you > >> fail your validation/evaluation efforts. > > We are working with the constraint that shared directory should not be > > accessible to unpriviliged users on host. And with that constraint, what > > you are referring to is not a vulnerability. > > Sure, that's quite reasonable for your use case. It doesn't mean > that the vulnerability doesn't exist, it means you've mitigated it. > > > >> Your mechanism is > >> less general because other potential use cases may not be > >> as cavalier about the vulnerability. > > Prefixing xattrs with "user.virtiofsd" is just one of the options. > > virtiofsd has the capability to prefix "trusted.virtiofsd" as well. > > We have not chosen that because we don't want to give it CAP_SYS_ADMIN. > > > > So other use cases which don't like prefixing "user.virtiofsd", can > > give CAP_SYS_ADMIN and work with it. > > > >> I think that you can > >> approach this differently, get a solution that does everything > >> you want, and avoid the known problem. > > What's the solution? Are you referring to using "trusted.*" instead? But > > that has its own problem of giving CAP_SYS_ADMIN to virtiofsd. > > I'm coming to the conclusion that xattr namespaces, analogous > to user namespaces, are the correct solution. They generalize > for multiple filesystem and LSM use cases. The use of namespaces > is well understood, especially in the container community. It > looks to me as if it would address your use case swimmingly. Even if xattrs were namespaced, I am not sure it solves the issue of unpriviliged UID being able to modify security xattrs of file. If it happens to be correct UID, it should be able to spin up a user namespace and modify namespaced xattrs? Anyway, once namespaced xattrs are available, I will gladly make use of it. But that probably should not be a blocker for this patch. Vivek
On 6/29/21 13:35, Vivek Goyal wrote: > On Tue, Jun 29, 2021 at 09:13:48AM -0700, Casey Schaufler wrote: >> On 6/29/2021 8:20 AM, Vivek Goyal wrote: >>> On Tue, Jun 29, 2021 at 07:38:15AM -0700, Casey Schaufler wrote: >>> >>> [..] >>>>>>>> User xattrs are less protected than security xattrs. You are exposing the >>>>>>>> security xattrs on the guest to the possible whims of a malicious, unprivileged >>>>>>>> actor on the host. All it needs is the right UID. >>>>>>> Yep, we realise that; but when you're mainly interested in making sure >>>>>>> the guest can't attack the host, that's less worrying. >>>>>> That's uncomfortable. >>>>> Why exactly? >>>> If a mechanism is designed with a known vulnerability you >>>> fail your validation/evaluation efforts. >>> We are working with the constraint that shared directory should not be >>> accessible to unpriviliged users on host. And with that constraint, what >>> you are referring to is not a vulnerability. >> Sure, that's quite reasonable for your use case. It doesn't mean >> that the vulnerability doesn't exist, it means you've mitigated it. >> >> >>>> Your mechanism is >>>> less general because other potential use cases may not be >>>> as cavalier about the vulnerability. >>> Prefixing xattrs with "user.virtiofsd" is just one of the options. >>> virtiofsd has the capability to prefix "trusted.virtiofsd" as well. >>> We have not chosen that because we don't want to give it CAP_SYS_ADMIN. >>> >>> So other use cases which don't like prefixing "user.virtiofsd", can >>> give CAP_SYS_ADMIN and work with it. >>> >>>> I think that you can >>>> approach this differently, get a solution that does everything >>>> you want, and avoid the known problem. >>> What's the solution? Are you referring to using "trusted.*" instead? But >>> that has its own problem of giving CAP_SYS_ADMIN to virtiofsd. >> I'm coming to the conclusion that xattr namespaces, analogous >> to user namespaces, are the correct solution. They generalize >> for multiple filesystem and LSM use cases. The use of namespaces >> is well understood, especially in the container community. It >> looks to me as if it would address your use case swimmingly. > Even if xattrs were namespaced, I am not sure it solves the issue > of unpriviliged UID being able to modify security xattrs of file. > If it happens to be correct UID, it should be able to spin up a > user namespace and modify namespaced xattrs? > > Anyway, once namespaced xattrs are available, I will gladly make use > of it. But that probably should not be a blocker for this patch. > > Vivek > All this conversation is great, and I look forward to a better solution, but if we go back to the patch, it was to fix an issue where the kernel is requiring CAP_SYS_ADMIN for writing user Xattrs on link files and other special files. The documented reason for this is to prevent the users from using XATTRS to avoid quota. The CAP_SYS_RESOURCE capability is denfined to allow processes with this capability to ignore quota. This PR allows processes with CAP_SYS_RESOURCE to create user Xattrs. To me this makes sense. Is there any argument against this?
On Tue, Jun 29, 2021 at 04:28:24PM -0400, Daniel Walsh wrote: > All this conversation is great, and I look forward to a better solution, but > if we go back to the patch, it was to fix an issue where the kernel is > requiring CAP_SYS_ADMIN for writing user Xattrs on link files and other > special files. > > The documented reason for this is to prevent the users from using XATTRS to > avoid quota. Huh? Where is it so documented? How file systems store and account for space used by extended attributes is a file-system specific question, but presumably any way that xattr's on regular files are accounted could also be used for xattr's on special files. Also, xattr's are limited to 32k, so it's not like users can evade _that_ much quota space, at least not without it being pretty painful. (Assuming that quota is even enabled, which most of the time, it isn't.) - Ted P.S. I'll note that if ext4's ea_in_inode is enabled, for large xattr's, if you have 2 million files that all have the same 12k windows SID stored as an xattr, ext4 will store that xattr only once. Those two million files might be owned by different uids, so we made an explicit design choice not to worry about accounting for the quota for said 12k xattr value. After all, if you can save the space and access cost of 2M * 12k if each file had to store its own copy of that xattr, perhaps not including it in the quota calculation isn't that bad. :-) We also don't account for the disk space used by symbolic links (since sometimes they can be stored in the inode as fast symlinks, and sometimes they might consume a data block). But again, that's a file system specific implementation question.
* Theodore Ts'o (tytso@mit.edu) wrote: > On Tue, Jun 29, 2021 at 04:28:24PM -0400, Daniel Walsh wrote: > > All this conversation is great, and I look forward to a better solution, but > > if we go back to the patch, it was to fix an issue where the kernel is > > requiring CAP_SYS_ADMIN for writing user Xattrs on link files and other > > special files. > > > > The documented reason for this is to prevent the users from using XATTRS to > > avoid quota. > > Huh? Where is it so documented? man xattr(7): The file permission bits of regular files and directories are interpreted differently from the file permission bits of special files and symbolic links. For regular files and directories the file permission bits define ac‐ cess to the file's contents, while for device special files they define access to the device described by the special file. The file permissions of symbolic links are not used in access checks. *** These differences would al‐ low users to consume filesystem resources in a way not controllable by disk quotas for group or world writable special files and directories.**** ***For this reason, user extended attributes are allowed only for regular files and directories ***, and access to user extended attributes is restricted to the owner and to users with appropriate capabilities for directories with the sticky bit set (see the chmod(1) manual page for an explanation of the sticky bit). (***'s my addition) Dave > How file systems store and account > for space used by extended attributes is a file-system specific > question, but presumably any way that xattr's on regular files are > accounted could also be used for xattr's on special files. > > Also, xattr's are limited to 32k, so it's not like users can evade > _that_ much quota space, at least not without it being pretty painful. > (Assuming that quota is even enabled, which most of the time, it > isn't.) > > - Ted > > P.S. I'll note that if ext4's ea_in_inode is enabled, for large > xattr's, if you have 2 million files that all have the same 12k > windows SID stored as an xattr, ext4 will store that xattr only once. > Those two million files might be owned by different uids, so we made > an explicit design choice not to worry about accounting for the quota > for said 12k xattr value. After all, if you can save the space and > access cost of 2M * 12k if each file had to store its own copy of that > xattr, perhaps not including it in the quota calculation isn't that > bad. :-) > > We also don't account for the disk space used by symbolic links (since > sometimes they can be stored in the inode as fast symlinks, and > sometimes they might consume a data block). But again, that's a file > system specific implementation question. >
On Wed, Jun 30, 2021 at 12:12:28AM -0400, Theodore Ts'o wrote: > On Tue, Jun 29, 2021 at 04:28:24PM -0400, Daniel Walsh wrote: > > All this conversation is great, and I look forward to a better solution, but > > if we go back to the patch, it was to fix an issue where the kernel is > > requiring CAP_SYS_ADMIN for writing user Xattrs on link files and other > > special files. > > > > The documented reason for this is to prevent the users from using XATTRS to > > avoid quota. > > Huh? Where is it so documented? Its in "man xattr". David already copied pasted the relevant section in another email, so I am not doing it. > How file systems store and account > for space used by extended attributes is a file-system specific > question, > but presumably any way that xattr's on regular files are > accounted could also be used for xattr's on special files. That will be nice. I don't know enough about quota, but I am wondering why quota limits can't be enforced (if needed) for symlinks and special file xattrs. Thanks Vivek > > Also, xattr's are limited to 32k, so it's not like users can evade > _that_ much quota space, at least not without it being pretty painful. > (Assuming that quota is even enabled, which most of the time, it > isn't.) > > - Ted > > P.S. I'll note that if ext4's ea_in_inode is enabled, for large > xattr's, if you have 2 million files that all have the same 12k > windows SID stored as an xattr, ext4 will store that xattr only once. > Those two million files might be owned by different uids, so we made > an explicit design choice not to worry about accounting for the quota > for said 12k xattr value. After all, if you can save the space and > access cost of 2M * 12k if each file had to store its own copy of that > xattr, perhaps not including it in the quota calculation isn't that > bad. :-) > > We also don't account for the disk space used by symbolic links (since > sometimes they can be stored in the inode as fast symlinks, and > sometimes they might consume a data block). But again, that's a file > system specific implementation question. >
On Wed, Jun 30, 2021 at 09:07:56AM +0100, Dr. David Alan Gilbert wrote: > * Theodore Ts'o (tytso@mit.edu) wrote: > > On Tue, Jun 29, 2021 at 04:28:24PM -0400, Daniel Walsh wrote: > > > All this conversation is great, and I look forward to a better solution, but > > > if we go back to the patch, it was to fix an issue where the kernel is > > > requiring CAP_SYS_ADMIN for writing user Xattrs on link files and other > > > special files. > > > > > > The documented reason for this is to prevent the users from using XATTRS to > > > avoid quota. > > > > Huh? Where is it so documented? > > man xattr(7): > The file permission bits of regular files and directories are > interpreted differently from the file permission bits of special > files and symbolic links. For regular files and directories the > file permission bits define access to the file's contents, > while for device special files they define access to the device > described by the special file. The file permissions of symbolic > links are not used in access checks. All of this is true... > *** These differences would > allow users to consume filesystem resources in a way not > controllable by disk quotas for group or world writable special > files and directories.**** Anyone with group write access to a regular file can append to the file, and the blocks written will be charged the owner of the file. So it's perfectly "controllable" by the quota system; if you have group write access to a file, you can charge against the user's quota. This is Working As Intended. And the creation of device special files take the umask into account, just like regular files, so if you have a umask that allows newly created files to be group writeable, the same issue would occur for regular files as device files. Given that most users have a umask of 0077 or 0022, this is generally Not A Problem. I think I see the issue which drove the above text, though, which is that Linux's syscall(2) is creating symlinks which do not take umask into account; that is, the permissions are always mode ST_IFLNK|0777. Hence, it might be that the right answer is to remove this fairly arbitrary restriction entirely, and change symlink(2) so that it creates files which respects the umask. Posix and SUS doesn't specify what the permissions are that are used, and historically (before the advent of xattrs) I suspect since it didn't matter, no one cared about whether or not umask was applied. Some people might object to such a change arguing that with pre-existing file systems where there are symlinks which world-writeable, this might cause people to be able to charge up to 32k (or whatever the maximum size of the xattr supported by the file system) for each symlink. However, (a) very few people actually use quotas, and this would only be an issue for those users, and (b) the amount of quota "abuse" that could be carried out this way is small enough that I'm not sure it matters. - Ted
* Theodore Ts'o (tytso@mit.edu) wrote: > On Wed, Jun 30, 2021 at 09:07:56AM +0100, Dr. David Alan Gilbert wrote: > > * Theodore Ts'o (tytso@mit.edu) wrote: > > > On Tue, Jun 29, 2021 at 04:28:24PM -0400, Daniel Walsh wrote: > > > > All this conversation is great, and I look forward to a better solution, but > > > > if we go back to the patch, it was to fix an issue where the kernel is > > > > requiring CAP_SYS_ADMIN for writing user Xattrs on link files and other > > > > special files. > > > > > > > > The documented reason for this is to prevent the users from using XATTRS to > > > > avoid quota. > > > > > > Huh? Where is it so documented? > > > > man xattr(7): > > The file permission bits of regular files and directories are > > interpreted differently from the file permission bits of special > > files and symbolic links. For regular files and directories the > > file permission bits define access to the file's contents, > > while for device special files they define access to the device > > described by the special file. The file permissions of symbolic > > links are not used in access checks. > > All of this is true... > > > *** These differences would > > allow users to consume filesystem resources in a way not > > controllable by disk quotas for group or world writable special > > files and directories.**** > > Anyone with group write access to a regular file can append to the > file, and the blocks written will be charged the owner of the file. > So it's perfectly "controllable" by the quota system; if you have > group write access to a file, you can charge against the user's quota. > This is Working As Intended. > > And the creation of device special files take the umask into account, > just like regular files, so if you have a umask that allows newly > created files to be group writeable, the same issue would occur for > regular files as device files. Given that most users have a umask of > 0077 or 0022, this is generally Not A Problem. > > I think I see the issue which drove the above text, though, which is > that Linux's syscall(2) is creating symlinks which do not take umask > into account; that is, the permissions are always mode ST_IFLNK|0777. > > Hence, it might be that the right answer is to remove this fairly > arbitrary restriction entirely, and change symlink(2) so that it > creates files which respects the umask. Posix and SUS doesn't specify > what the permissions are that are used, and historically (before the > advent of xattrs) I suspect since it didn't matter, no one cared about > whether or not umask was applied. > > Some people might object to such a change arguing that with > pre-existing file systems where there are symlinks which > world-writeable, this might cause people to be able to charge up to > 32k (or whatever the maximum size of the xattr supported by the file > system) for each symlink. However, (a) very few people actually use > quotas, and this would only be an issue for those users, and (b) the > amount of quota "abuse" that could be carried out this way is small > enough that I'm not sure it matters. Even if you fix symlinks, I don't think it fixes device nodes or anything else where the permissions bitmap isn't purely used as the permissions on the inode. Dave > - Ted >
On Wed, Jun 30, 2021 at 10:47:39AM -0400, Theodore Ts'o wrote: > On Wed, Jun 30, 2021 at 09:07:56AM +0100, Dr. David Alan Gilbert wrote: > > * Theodore Ts'o (tytso@mit.edu) wrote: > > > On Tue, Jun 29, 2021 at 04:28:24PM -0400, Daniel Walsh wrote: > > > > All this conversation is great, and I look forward to a better solution, but > > > > if we go back to the patch, it was to fix an issue where the kernel is > > > > requiring CAP_SYS_ADMIN for writing user Xattrs on link files and other > > > > special files. > > > > > > > > The documented reason for this is to prevent the users from using XATTRS to > > > > avoid quota. > > > > > > Huh? Where is it so documented? > > > > man xattr(7): > > The file permission bits of regular files and directories are > > interpreted differently from the file permission bits of special > > files and symbolic links. For regular files and directories the > > file permission bits define access to the file's contents, > > while for device special files they define access to the device > > described by the special file. The file permissions of symbolic > > links are not used in access checks. > > All of this is true... > > > *** These differences would > > allow users to consume filesystem resources in a way not > > controllable by disk quotas for group or world writable special > > files and directories.**** > > Anyone with group write access to a regular file can append to the > file, and the blocks written will be charged the owner of the file. > So it's perfectly "controllable" by the quota system; if you have > group write access to a file, you can charge against the user's quota. > This is Working As Intended. > > And the creation of device special files take the umask into account, > just like regular files, so if you have a umask that allows newly > created files to be group writeable, the same issue would occur for > regular files as device files. Given that most users have a umask of > 0077 or 0022, this is generally Not A Problem. > > I think I see the issue which drove the above text, though, which is > that Linux's syscall(2) is creating symlinks which do not take umask > into account; that is, the permissions are always mode ST_IFLNK|0777. IIUC, idea is to use permission bits on symlink to decide whether caller can read/write user.* xattrs (like regular file). Hence create symlinks while honoring umask (or default posix acl on dir) and modify relevant code for file creation. Also that possibly will require changing chmod to allow chaging mode on chmod. Vivek > > Hence, it might be that the right answer is to remove this fairly > arbitrary restriction entirely, and change symlink(2) so that it > creates files which respects the umask. Posix and SUS doesn't specify > what the permissions are that are used, and historically (before the > advent of xattrs) I suspect since it didn't matter, no one cared about > whether or not umask was applied. > > Some people might object to such a change arguing that with > pre-existing file systems where there are symlinks which > world-writeable, this might cause people to be able to charge up to > 32k (or whatever the maximum size of the xattr supported by the file > system) for each symlink. However, (a) very few people actually use > quotas, and this would only be an issue for those users, and (b) the > amount of quota "abuse" that could be carried out this way is small > enough that I'm not sure it matters. > > - Ted >
On Wed, Jun 30, 2021 at 04:01:42PM +0100, Dr. David Alan Gilbert wrote: > > Even if you fix symlinks, I don't think it fixes device nodes or > anything else where the permissions bitmap isn't purely used as the > permissions on the inode. I think we're making a mountain out of a molehill. Again, very few people are using quota these days. And if you give someone write access to a 8TB disk, do you really care if they can "steal" 32k worth of space (which is the maximum size of an xattr, enforced by the VFS). OK, but what about character mode devices? First of all, most users don't have access to huge number of devices, but let's assume something absurd. Let's say that a user has write access to *1024* devices. (My /dev has 233 character mode devices, and I have write access to well under a dozen.) An 8TB disk costs about $200. So how much of the "stolen" quota space are we talking about, assuming the user has access to 1024 devices, and the file system actually supports a 32k xattr. 32k * 1024 * $200 / 8TB / (1024*1024*1024) = $0.000763 = 0.0763 cents A 2TB SSD is less around $180, so even if we calculate the prices based on SSD space, we're still talking about a quarter of a penny. Why are we worrying about this? - Ted
On Wed, Jun 30, 2021 at 03:59:41PM -0400, Theodore Ts'o wrote: > On Wed, Jun 30, 2021 at 04:01:42PM +0100, Dr. David Alan Gilbert wrote: > > > > Even if you fix symlinks, I don't think it fixes device nodes or > > anything else where the permissions bitmap isn't purely used as the > > permissions on the inode. > > I think we're making a mountain out of a molehill. Again, very few > people are using quota these days. And if you give someone write > access to a 8TB disk, do you really care if they can "steal" 32k worth > of space (which is the maximum size of an xattr, enforced by the VFS). So that should be N * 32K per inode, where N is number of user xattrs one can write on the inode. (user.1, user.2, user.3, .., user.N)? Vivek > > OK, but what about character mode devices? First of all, most users > don't have access to huge number of devices, but let's assume > something absurd. Let's say that a user has write access to *1024* > devices. (My /dev has 233 character mode devices, and I have write > access to well under a dozen.) > > An 8TB disk costs about $200. So how much of the "stolen" quota space > are we talking about, assuming the user has access to 1024 devices, > and the file system actually supports a 32k xattr. > > 32k * 1024 * $200 / 8TB / (1024*1024*1024) = $0.000763 = 0.0763 cents > > A 2TB SSD is less around $180, so even if we calculate the prices > based on SSD space, we're still talking about a quarter of a penny. > > Why are we worrying about this? > > - Ted >
* Theodore Ts'o (tytso@mit.edu) wrote: > On Wed, Jun 30, 2021 at 04:01:42PM +0100, Dr. David Alan Gilbert wrote: > > > > Even if you fix symlinks, I don't think it fixes device nodes or > > anything else where the permissions bitmap isn't purely used as the > > permissions on the inode. > > I think we're making a mountain out of a molehill. Again, very few > people are using quota these days. And if you give someone write > access to a 8TB disk, do you really care if they can "steal" 32k worth > of space (which is the maximum size of an xattr, enforced by the VFS). > > OK, but what about character mode devices? First of all, most users > don't have access to huge number of devices, but let's assume > something absurd. Let's say that a user has write access to *1024* > devices. (My /dev has 233 character mode devices, and I have write > access to well under a dozen.) > > An 8TB disk costs about $200. So how much of the "stolen" quota space > are we talking about, assuming the user has access to 1024 devices, > and the file system actually supports a 32k xattr. > > 32k * 1024 * $200 / 8TB / (1024*1024*1024) = $0.000763 = 0.0763 cents > > A 2TB SSD is less around $180, so even if we calculate the prices > based on SSD space, we're still talking about a quarter of a penny. > > Why are we worrying about this? I'm not worrying about storage cost, but we would need to define what the rules are on who can write and change a user.* xattr on a device node. It doesn't feel sane to make it anyone who can write to the device; then everyone can start leaving droppings on /dev/null. The other evilness I can imagine, is if there's a 32k limit on xattrs on a node, an evil user could write almost 32k of junk to the node and then break the next login that tries to add an acl or breaks the next relabel. Dave > - Ted >
On Thu, Jul 01, 2021 at 09:48:33AM +0100, Dr. David Alan Gilbert wrote: > * Theodore Ts'o (tytso@mit.edu) wrote: > > On Wed, Jun 30, 2021 at 04:01:42PM +0100, Dr. David Alan Gilbert wrote: > > > > > > Even if you fix symlinks, I don't think it fixes device nodes or > > > anything else where the permissions bitmap isn't purely used as the > > > permissions on the inode. > > > > I think we're making a mountain out of a molehill. Again, very few > > people are using quota these days. And if you give someone write > > access to a 8TB disk, do you really care if they can "steal" 32k worth > > of space (which is the maximum size of an xattr, enforced by the VFS). > > > > OK, but what about character mode devices? First of all, most users > > don't have access to huge number of devices, but let's assume > > something absurd. Let's say that a user has write access to *1024* > > devices. (My /dev has 233 character mode devices, and I have write > > access to well under a dozen.) > > > > An 8TB disk costs about $200. So how much of the "stolen" quota space > > are we talking about, assuming the user has access to 1024 devices, > > and the file system actually supports a 32k xattr. > > > > 32k * 1024 * $200 / 8TB / (1024*1024*1024) = $0.000763 = 0.0763 cents > > > > A 2TB SSD is less around $180, so even if we calculate the prices > > based on SSD space, we're still talking about a quarter of a penny. > > > > Why are we worrying about this? > > I'm not worrying about storage cost, but we would need to define what > the rules are on who can write and change a user.* xattr on a device > node. It doesn't feel sane to make it anyone who can write to the > device; then everyone can start leaving droppings on /dev/null. Looks like tmpfs/devtmpfs might not support setting user.* xattrs. So devices nodes there should not be a problem. # touch /dev/foo.txt # setfattr -n "user.foo" -v "bar" /dev/foo.txt setfattr: /dev/foo.txt: Operation not supported Vivek > > The other evilness I can imagine, is if there's a 32k limit on xattrs on > a node, an evil user could write almost 32k of junk to the node > and then break the next login that tries to add an acl or breaks the > next relabel. > > Dave > > > - Ted > > > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK >
On Thu, Jul 01, 2021 at 09:48:33AM +0100, Dr. David Alan Gilbert wrote: > * Theodore Ts'o (tytso@mit.edu) wrote: > > On Wed, Jun 30, 2021 at 04:01:42PM +0100, Dr. David Alan Gilbert wrote: > > > > > > Even if you fix symlinks, I don't think it fixes device nodes or > > > anything else where the permissions bitmap isn't purely used as the > > > permissions on the inode. > > > > I think we're making a mountain out of a molehill. Again, very few > > people are using quota these days. And if you give someone write > > access to a 8TB disk, do you really care if they can "steal" 32k worth > > of space (which is the maximum size of an xattr, enforced by the VFS). > > > > OK, but what about character mode devices? First of all, most users > > don't have access to huge number of devices, but let's assume > > something absurd. Let's say that a user has write access to *1024* > > devices. (My /dev has 233 character mode devices, and I have write > > access to well under a dozen.) > > > > An 8TB disk costs about $200. So how much of the "stolen" quota space > > are we talking about, assuming the user has access to 1024 devices, > > and the file system actually supports a 32k xattr. > > > > 32k * 1024 * $200 / 8TB / (1024*1024*1024) = $0.000763 = 0.0763 cents > > > > A 2TB SSD is less around $180, so even if we calculate the prices > > based on SSD space, we're still talking about a quarter of a penny. > > > > Why are we worrying about this? > > I'm not worrying about storage cost, but we would need to define what > the rules are on who can write and change a user.* xattr on a device > node. It doesn't feel sane to make it anyone who can write to the > device; then everyone can start leaving droppings on /dev/null. > > The other evilness I can imagine, is if there's a 32k limit on xattrs on > a node, an evil user could write almost 32k of junk to the node > and then break the next login that tries to add an acl or breaks the > next relabel. I guess 64k is per xattr VFS size limit. #define XATTR_SIZE_MAX 65536 I just wrote a simple program to write "user.<N>" xattrs of size 1K each and could easily write 1M xattrs. So that 1G worth data right there. I did not try to push it further. So a user can write lot of data in the form of user.* xattrs on symlinks and device nodes if were to open it unconditionally. Hence permission semantics will probably will have to defined properly. I am wondering will it be alright if owner of the file (or CAP_FOWNER), is allowed to write user.* xattrs on symlinks and special files. Vivek
On 7/1/2021 6:10 AM, Vivek Goyal wrote: > On Thu, Jul 01, 2021 at 09:48:33AM +0100, Dr. David Alan Gilbert wrote: >> * Theodore Ts'o (tytso@mit.edu) wrote: >>> On Wed, Jun 30, 2021 at 04:01:42PM +0100, Dr. David Alan Gilbert wrote: >>>> Even if you fix symlinks, I don't think it fixes device nodes or >>>> anything else where the permissions bitmap isn't purely used as the >>>> permissions on the inode. >>> I think we're making a mountain out of a molehill. Again, very few >>> people are using quota these days. And if you give someone write >>> access to a 8TB disk, do you really care if they can "steal" 32k worth >>> of space (which is the maximum size of an xattr, enforced by the VFS). >>> >>> OK, but what about character mode devices? First of all, most users >>> don't have access to huge number of devices, but let's assume >>> something absurd. Let's say that a user has write access to *1024* >>> devices. (My /dev has 233 character mode devices, and I have write >>> access to well under a dozen.) >>> >>> An 8TB disk costs about $200. So how much of the "stolen" quota space >>> are we talking about, assuming the user has access to 1024 devices, >>> and the file system actually supports a 32k xattr. >>> >>> 32k * 1024 * $200 / 8TB / (1024*1024*1024) = $0.000763 = 0.0763 cents >>> >>> A 2TB SSD is less around $180, so even if we calculate the prices >>> based on SSD space, we're still talking about a quarter of a penny. >>> >>> Why are we worrying about this? >> I'm not worrying about storage cost, but we would need to define what >> the rules are on who can write and change a user.* xattr on a device >> node. It doesn't feel sane to make it anyone who can write to the >> device; then everyone can start leaving droppings on /dev/null. >> >> The other evilness I can imagine, is if there's a 32k limit on xattrs on >> a node, an evil user could write almost 32k of junk to the node >> and then break the next login that tries to add an acl or breaks the >> next relabel. > I guess 64k is per xattr VFS size limit. > > #define XATTR_SIZE_MAX 65536 > > I just wrote a simple program to write "user.<N>" xattrs of size 1K > each and could easily write 1M xattrs. So that 1G worth data right > there. I did not try to push it further. > > So a user can write lot of data in the form of user.* xattrs on > symlinks and device nodes if were to open it unconditionally. Hence > permission semantics will probably will have to defined properly. > > I am wondering will it be alright if owner of the file (or CAP_FOWNER), > is allowed to write user.* xattrs on symlinks and special files. That would be sensible. That's independent of your xattr mapping scheme. > > Vivek >