[RFC,0/1] xattr: Allow user.* xattr on symlink/special files if caller has CAP_SYS_RESOURCE

Message ID	20210625191229.1752531-1-vgoyal@redhat.com (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Vivek Goyal <vgoyal@redhat.com> To: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, viro@zeniv.linux.org.uk Cc: virtio-fs@redhat.com, dwalsh@redhat.com, dgilbert@redhat.com, berrange@redhat.com, vgoyal@redhat.com Subject: [RFC PATCH 0/1] xattr: Allow user.* xattr on symlink/special files if caller has CAP_SYS_RESOURCE Date: Fri, 25 Jun 2021 15:12:28 -0400 Message-Id: <20210625191229.1752531-1-vgoyal@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	xattr: Allow user.* xattr on symlink/special files if caller has CAP_SYS_RESOURCE \| expand [RFC,0/1] xattr: Allow user.* xattr on symlink/special files if caller has CAP_SYS_RESOURCE [1/1] xattr: Allow user.* xattr on symlink/special files with CAP_SYS_RESOURCE

Vivek Goyal June 25, 2021, 7:12 p.m. UTC

Hi,

In virtiofs, actual file server is virtiosd daemon running on host.
There we have a mode where xattrs can be remapped to something else.
For example security.selinux can be remapped to
user.virtiofsd.securit.selinux on the host.

This remapping is useful when SELinux is enabled in guest and virtiofs
as being used as rootfs. Guest and host SELinux policy might not match
and host policy might deny security.selinux xattr setting by guest
onto host. Or host might have SELinux disabled and in that case to
be able to set security.selinux xattr, virtiofsd will need to have
CAP_SYS_ADMIN (which we are trying to avoid). Being able to remap
guest security.selinux (or other xattrs) on host to something else
is also better from security point of view.

But when we try this, we noticed that SELinux relabeling in guest
is failing on some symlinks. When I debugged a little more, I 
came to know that "user.*" xattrs are not allowed on symlinks
or special files.

"man xattr" seems to suggest that primary reason to disallow is
that arbitrary users can set unlimited amount of "user.*" xattrs
on these files and bypass quota check.

If that's the primary reason, I am wondering is it possible to relax
the restrictions if caller has CAP_SYS_RESOURCE. This capability
allows caller to bypass quota checks. So it should not be
a problem atleast from quota perpective.

That will allow me to give CAP_SYS_RESOURCE to virtiofs deamon
and remap xattrs arbitrarily. 

Thanks
Vivek

Vivek Goyal (1):
  xattr: Allow user.* xattr on symlink/special files with
    CAP_SYS_RESOURCE

 fs/xattr.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Schaufler, Casey June 25, 2021, 9:49 p.m. UTC | #1

> -----Original Message-----
> From: Vivek Goyal <vgoyal@redhat.com>
> Sent: Friday, June 25, 2021 12:12 PM
> To: linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org;
> viro@zeniv.linux.org.uk
> Cc: virtio-fs@redhat.com; dwalsh@redhat.com; dgilbert@redhat.com;
> berrange@redhat.com; vgoyal@redhat.com

Please include Linux Security Module list <linux-security-module@vger.kernel.org>
and selinux@vger.kernel.org on this topic.

> Subject: [RFC PATCH 0/1] xattr: Allow user.* xattr on symlink/special files if
> caller has CAP_SYS_RESOURCE
> 
> Hi,
> 
> In virtiofs, actual file server is virtiosd daemon running on host.
> There we have a mode where xattrs can be remapped to something else.
> For example security.selinux can be remapped to
> user.virtiofsd.securit.selinux on the host.

This would seem to provide mechanism whereby a user can violate
SELinux policy quite easily. 

> 
> This remapping is useful when SELinux is enabled in guest and virtiofs
> as being used as rootfs. Guest and host SELinux policy might not match
> and host policy might deny security.selinux xattr setting by guest
> onto host. Or host might have SELinux disabled and in that case to
> be able to set security.selinux xattr, virtiofsd will need to have
> CAP_SYS_ADMIN (which we are trying to avoid). Being able to remap
> guest security.selinux (or other xattrs) on host to something else
> is also better from security point of view.

Can you please provide some rationale for this assertion?
I have been working with security xattrs longer than anyone
and have trouble accepting the statement.

> But when we try this, we noticed that SELinux relabeling in guest
> is failing on some symlinks. When I debugged a little more, I
> came to know that "user.*" xattrs are not allowed on symlinks
> or special files.
> 
> "man xattr" seems to suggest that primary reason to disallow is
> that arbitrary users can set unlimited amount of "user.*" xattrs
> on these files and bypass quota check.
> 
> If that's the primary reason, I am wondering is it possible to relax
> the restrictions if caller has CAP_SYS_RESOURCE. This capability
> allows caller to bypass quota checks. So it should not be
> a problem atleast from quota perpective.
> 
> That will allow me to give CAP_SYS_RESOURCE to virtiofs deamon
> and remap xattrs arbitrarily.

On a Smack system you should require CAP_MAC_ADMIN to remap
security. xattrs. I sounds like you're in serious danger of running afoul
of LSM attribute policy on a reasonable general level.

> 
> Thanks
> Vivek
> 
> Vivek Goyal (1):
>   xattr: Allow user.* xattr on symlink/special files with
>     CAP_SYS_RESOURCE
> 
>  fs/xattr.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> --
> 2.25.4

Dr. David Alan Gilbert June 28, 2021, 11:58 a.m. UTC | #2

* Schaufler, Casey (casey.schaufler@intel.com) wrote:
> > -----Original Message-----
> > From: Vivek Goyal <vgoyal@redhat.com>
> > Sent: Friday, June 25, 2021 12:12 PM
> > To: linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org;
> > viro@zeniv.linux.org.uk
> > Cc: virtio-fs@redhat.com; dwalsh@redhat.com; dgilbert@redhat.com;
> > berrange@redhat.com; vgoyal@redhat.com
> 
> Please include Linux Security Module list <linux-security-module@vger.kernel.org>
> and selinux@vger.kernel.org on this topic.
> 
> > Subject: [RFC PATCH 0/1] xattr: Allow user.* xattr on symlink/special files if
> > caller has CAP_SYS_RESOURCE
> > 
> > Hi,
> > 
> > In virtiofs, actual file server is virtiosd daemon running on host.
> > There we have a mode where xattrs can be remapped to something else.
> > For example security.selinux can be remapped to
> > user.virtiofsd.securit.selinux on the host.
> 
> This would seem to provide mechanism whereby a user can violate
> SELinux policy quite easily. 
> 
> > 
> > This remapping is useful when SELinux is enabled in guest and virtiofs
> > as being used as rootfs. Guest and host SELinux policy might not match
> > and host policy might deny security.selinux xattr setting by guest
> > onto host. Or host might have SELinux disabled and in that case to
> > be able to set security.selinux xattr, virtiofsd will need to have
> > CAP_SYS_ADMIN (which we are trying to avoid). Being able to remap
> > guest security.selinux (or other xattrs) on host to something else
> > is also better from security point of view.
> 
> Can you please provide some rationale for this assertion?
> I have been working with security xattrs longer than anyone
> and have trouble accepting the statement.

There seem to be a few very different ways of using SELinux in
containers/guests, and many ways of using shared filesystems.

A common request is that we share a host filesystem into the guest (a
VM), and then the guest can do with it whatever it likes, preferably
without making the guest privileged in any way, and with having as few
priviliges on the daemons running on behalf of the guest ('virtiofd'
which is a fuse implementation daemon that runs on the host).

By remapping all guests xattr to add a "user.virtiofsd." prefix,
the guest can label it's filesystem and implement it's own SELinux
policy, but because it's using "user." on the host, it can neither
bypass nor change the hosts SELinux labelling or policies.

(It also means that the guest can set capabilities and other xattr's,
again without confusing the host).

> > But when we try this, we noticed that SELinux relabeling in guest
> > is failing on some symlinks. When I debugged a little more, I
> > came to know that "user.*" xattrs are not allowed on symlinks
> > or special files.
> > 
> > "man xattr" seems to suggest that primary reason to disallow is
> > that arbitrary users can set unlimited amount of "user.*" xattrs
> > on these files and bypass quota check.
> > 
> > If that's the primary reason, I am wondering is it possible to relax
> > the restrictions if caller has CAP_SYS_RESOURCE. This capability
> > allows caller to bypass quota checks. So it should not be
> > a problem atleast from quota perpective.
> > 
> > That will allow me to give CAP_SYS_RESOURCE to virtiofs deamon
> > and remap xattrs arbitrarily.
> 
> On a Smack system you should require CAP_MAC_ADMIN to remap
> security. xattrs. I sounds like you're in serious danger of running afoul
> of LSM attribute policy on a reasonable general level.

Note that the remapping is done by the userspace daemon running on the
host (and takes parameters saying what remapping is required); as
such it's still bound by whatever LSM policies the host wants; we're
just giving the guest the ability to add it's own policies without
breaking the hosts.

Of course if you want the guest kernel to see the host xattrs
then you don't want the remapping; there are even some cases where you
might want to allow the guest to set those xattrs; but then you really
do have to start worrying about what the guest could do to your
filesystem.

The only thing getting in the way of the guest being able to do a full
relabel seems to be the limitation on user.* on non-files.

Dave

> > 
> > Thanks
> > Vivek
> > 
> > Vivek Goyal (1):
> >   xattr: Allow user.* xattr on symlink/special files with
> >     CAP_SYS_RESOURCE
> > 
> >  fs/xattr.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > --
> > 2.25.4
>

Vivek Goyal June 28, 2021, 1:17 p.m. UTC | #3

On Fri, Jun 25, 2021 at 09:49:51PM +0000, Schaufler, Casey wrote:
> > -----Original Message-----
> > From: Vivek Goyal <vgoyal@redhat.com>
> > Sent: Friday, June 25, 2021 12:12 PM
> > To: linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org;
> > viro@zeniv.linux.org.uk
> > Cc: virtio-fs@redhat.com; dwalsh@redhat.com; dgilbert@redhat.com;
> > berrange@redhat.com; vgoyal@redhat.com
> 
> Please include Linux Security Module list <linux-security-module@vger.kernel.org>
> and selinux@vger.kernel.org on this topic.
> 
> > Subject: [RFC PATCH 0/1] xattr: Allow user.* xattr on symlink/special files if
> > caller has CAP_SYS_RESOURCE
> > 
> > Hi,
> > 
> > In virtiofs, actual file server is virtiosd daemon running on host.
> > There we have a mode where xattrs can be remapped to something else.
> > For example security.selinux can be remapped to
> > user.virtiofsd.securit.selinux on the host.
> 
> This would seem to provide mechanism whereby a user can violate
> SELinux policy quite easily. 

Hi Casey,

As david already replied, we are not bypassing host's SELinux policy (if
there is one). We are just trying to provide a mode where host and
guest's SELinux policies could co-exist without interefering
with each other.

By remappming guests SELinux xattrs (and not host's SELinux xattrs),
a file probably will have two xattrs

"security.selinux" and "user.virtiofsd.security.selinux". Host will
enforce SELinux policy based on security.selinux xattr and guest
will see the SELinux info stored in "user.virtiofsd.security.selinux"
and guest SELinux policy will enforce rules based on that.
(user.virtiofsd.security.selinux will be remapped to "security.selinux"
when guest does getxattr()).

IOW, this mode is allowing both host and guest SELinux policies to
co-exist and not interefere with each other. (Remapping guests's
SELinux xattr is not changing hosts's SELinux label and is not
bypassing host's SELinux policy).

virtiofsd also provides for the mode where if guest process sets
SELinux xattr it shows up as security.selinux on host. But now we
have multiple issues. There are two SELinux policies (host and guest)
which are operating on same lable. And there is a very good chance
that two have not been written in such a way that they work with
each other. In fact there does not seem to exist a notion where
two different SELinux policies are operating on same label.

At high level, this is in a way similar to files created on
virtio-blk devices. Say this device is backed by a foo.img file
on host. Now host selinux policy will set its own label on
foo.img and provide access control while labels created by guest
are not seen or controlled by host's SELinux policy. Only guest
SELinux policy works with those labels.

So this is similar kind of attempt. Provide isolation between
host and guests's SELinux labels so that two policies can
co-exist and not interfere with each other.

> 
> > 
> > This remapping is useful when SELinux is enabled in guest and virtiofs
> > as being used as rootfs. Guest and host SELinux policy might not match
> > and host policy might deny security.selinux xattr setting by guest
> > onto host. Or host might have SELinux disabled and in that case to
> > be able to set security.selinux xattr, virtiofsd will need to have
> > CAP_SYS_ADMIN (which we are trying to avoid). Being able to remap
> > guest security.selinux (or other xattrs) on host to something else
> > is also better from security point of view.
> 
> Can you please provide some rationale for this assertion?
> I have been working with security xattrs longer than anyone
> and have trouble accepting the statement.

If guest is not able to interfere or change host's SELinux labels
directly, it sounded better.

Irrespective of this, my primary concern is that to allow guest
VM to be able to use SELinux seamlessly in diverse host OS
environments (typical of cloud deployments). And being able to
provide a mode where host and guest's security labels can
co-exist and policies can work independently, should be able
to achieve that goal.

> 
> > But when we try this, we noticed that SELinux relabeling in guest
> > is failing on some symlinks. When I debugged a little more, I
> > came to know that "user.*" xattrs are not allowed on symlinks
> > or special files.
> > 
> > "man xattr" seems to suggest that primary reason to disallow is
> > that arbitrary users can set unlimited amount of "user.*" xattrs
> > on these files and bypass quota check.
> > 
> > If that's the primary reason, I am wondering is it possible to relax
> > the restrictions if caller has CAP_SYS_RESOURCE. This capability
> > allows caller to bypass quota checks. So it should not be
> > a problem atleast from quota perpective.
> > 
> > That will allow me to give CAP_SYS_RESOURCE to virtiofs deamon
> > and remap xattrs arbitrarily.
> 
> On a Smack system you should require CAP_MAC_ADMIN to remap
> security. xattrs. I sounds like you're in serious danger of running afoul
> of LSM attribute policy on a reasonable general level.

I think I did not explain xattr remapping properly and that's why this
confusion is there. Only guests's xattrs will be remapped and not
hosts's xattr. So one can not bypass any access control implemented
by any of the LSM on host.

Thanks
Vivek

Daniel Walsh June 28, 2021, 1:36 p.m. UTC | #4

On 6/28/21 09:17, Vivek Goyal wrote:
> On Fri, Jun 25, 2021 at 09:49:51PM +0000, Schaufler, Casey wrote:
>>> -----Original Message-----
>>> From: Vivek Goyal <vgoyal@redhat.com>
>>> Sent: Friday, June 25, 2021 12:12 PM
>>> To: linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org;
>>> viro@zeniv.linux.org.uk
>>> Cc: virtio-fs@redhat.com; dwalsh@redhat.com; dgilbert@redhat.com;
>>> berrange@redhat.com; vgoyal@redhat.com
>> Please include Linux Security Module list <linux-security-module@vger.kernel.org>
>> and selinux@vger.kernel.org on this topic.
>>
>>> Subject: [RFC PATCH 0/1] xattr: Allow user.* xattr on symlink/special files if
>>> caller has CAP_SYS_RESOURCE
>>>
>>> Hi,
>>>
>>> In virtiofs, actual file server is virtiosd daemon running on host.
>>> There we have a mode where xattrs can be remapped to something else.
>>> For example security.selinux can be remapped to
>>> user.virtiofsd.securit.selinux on the host.
>> This would seem to provide mechanism whereby a user can violate
>> SELinux policy quite easily.
> Hi Casey,
>
> As david already replied, we are not bypassing host's SELinux policy (if
> there is one). We are just trying to provide a mode where host and
> guest's SELinux policies could co-exist without interefering
> with each other.
>
> By remappming guests SELinux xattrs (and not host's SELinux xattrs),
> a file probably will have two xattrs
>
> "security.selinux" and "user.virtiofsd.security.selinux". Host will
> enforce SELinux policy based on security.selinux xattr and guest
> will see the SELinux info stored in "user.virtiofsd.security.selinux"
> and guest SELinux policy will enforce rules based on that.
> (user.virtiofsd.security.selinux will be remapped to "security.selinux"
> when guest does getxattr()).
>
> IOW, this mode is allowing both host and guest SELinux policies to
> co-exist and not interefere with each other. (Remapping guests's
> SELinux xattr is not changing hosts's SELinux label and is not
> bypassing host's SELinux policy).
>
> virtiofsd also provides for the mode where if guest process sets
> SELinux xattr it shows up as security.selinux on host. But now we
> have multiple issues. There are two SELinux policies (host and guest)
> which are operating on same lable. And there is a very good chance
> that two have not been written in such a way that they work with
> each other. In fact there does not seem to exist a notion where
> two different SELinux policies are operating on same label.
>
> At high level, this is in a way similar to files created on
> virtio-blk devices. Say this device is backed by a foo.img file
> on host. Now host selinux policy will set its own label on
> foo.img and provide access control while labels created by guest
> are not seen or controlled by host's SELinux policy. Only guest
> SELinux policy works with those labels.
>
> So this is similar kind of attempt. Provide isolation between
> host and guests's SELinux labels so that two policies can
> co-exist and not interfere with each other.
>
>>> This remapping is useful when SELinux is enabled in guest and virtiofs
>>> as being used as rootfs. Guest and host SELinux policy might not match
>>> and host policy might deny security.selinux xattr setting by guest
>>> onto host. Or host might have SELinux disabled and in that case to
>>> be able to set security.selinux xattr, virtiofsd will need to have
>>> CAP_SYS_ADMIN (which we are trying to avoid). Being able to remap
>>> guest security.selinux (or other xattrs) on host to something else
>>> is also better from security point of view.
>> Can you please provide some rationale for this assertion?
>> I have been working with security xattrs longer than anyone
>> and have trouble accepting the statement.
> If guest is not able to interfere or change host's SELinux labels
> directly, it sounded better.
>
> Irrespective of this, my primary concern is that to allow guest
> VM to be able to use SELinux seamlessly in diverse host OS
> environments (typical of cloud deployments). And being able to
> provide a mode where host and guest's security labels can
> co-exist and policies can work independently, should be able
> to achieve that goal.
>
>>> But when we try this, we noticed that SELinux relabeling in guest
>>> is failing on some symlinks. When I debugged a little more, I
>>> came to know that "user.*" xattrs are not allowed on symlinks
>>> or special files.
>>>
>>> "man xattr" seems to suggest that primary reason to disallow is
>>> that arbitrary users can set unlimited amount of "user.*" xattrs
>>> on these files and bypass quota check.
>>>
>>> If that's the primary reason, I am wondering is it possible to relax
>>> the restrictions if caller has CAP_SYS_RESOURCE. This capability
>>> allows caller to bypass quota checks. So it should not be
>>> a problem atleast from quota perpective.
>>>
>>> That will allow me to give CAP_SYS_RESOURCE to virtiofs deamon
>>> and remap xattrs arbitrarily.
>> On a Smack system you should require CAP_MAC_ADMIN to remap
>> security. xattrs. I sounds like you're in serious danger of running afoul
>> of LSM attribute policy on a reasonable general level.
> I think I did not explain xattr remapping properly and that's why this
> confusion is there. Only guests's xattrs will be remapped and not
> hosts's xattr. So one can not bypass any access control implemented
> by any of the LSM on host.
>
> Thanks
> Vivek
>
I want to point out that this solves a  couple of other problems also.  
Currently virtiofsd attempts to write security attributes on the host, 
which is denied by default on systems without SELinux and no 
CAP_SYS_ADMIN.  This means if you want to run a container or VM on a 
host without SELinux support but the VM has SELinux enabled, then 
virtiofsd needs CAP_SYS_ADMIN.  It would be much more secure if it only 
needed CAP_SYS_RESOURCE.  If the host has SELinux enabled then it can 
run without CAP_SYS_ADMIN or CAP_SYS_RESOURCE, but it will only be 
allowed to write labels that the host system understands, any label not 
understood will be blocked. Not only this, but the label that is running 
virtiofsd pretty much has to run as unconfined, since it could be 
writing any SELinux label.

If virtiofsd is writing Userxattrs with CAP_SYS_RESOURCE, then we can 
run with a confined SELinux label only allowing it to sexattr on the 
content in the designated directory, make the container/vm much more secure.

Casey Schaufler June 28, 2021, 4:04 p.m. UTC | #5

On 6/28/2021 6:36 AM, Daniel Walsh wrote:
> On 6/28/21 09:17, Vivek Goyal wrote:
>> On Fri, Jun 25, 2021 at 09:49:51PM +0000, Schaufler, Casey wrote:
>>>> -----Original Message-----
>>>> From: Vivek Goyal <vgoyal@redhat.com>
>>>> Sent: Friday, June 25, 2021 12:12 PM
>>>> To: linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org;
>>>> viro@zeniv.linux.org.uk
>>>> Cc: virtio-fs@redhat.com; dwalsh@redhat.com; dgilbert@redhat.com;
>>>> berrange@redhat.com; vgoyal@redhat.com
>>> Please include Linux Security Module list <linux-security-module@vger.kernel.org>
>>> and selinux@vger.kernel.org on this topic.
>>>
>>>> Subject: [RFC PATCH 0/1] xattr: Allow user.* xattr on symlink/special files if
>>>> caller has CAP_SYS_RESOURCE
>>>>
>>>> Hi,
>>>>
>>>> In virtiofs, actual file server is virtiosd daemon running on host.
>>>> There we have a mode where xattrs can be remapped to something else.
>>>> For example security.selinux can be remapped to
>>>> user.virtiofsd.securit.selinux on the host.
>>> This would seem to provide mechanism whereby a user can violate
>>> SELinux policy quite easily.
>> Hi Casey,
>>
>> As david already replied, we are not bypassing host's SELinux policy (if
>> there is one). We are just trying to provide a mode where host and
>> guest's SELinux policies could co-exist without interefering
>> with each other.
>>
>> By remappming guests SELinux xattrs (and not host's SELinux xattrs),
>> a file probably will have two xattrs
>>
>> "security.selinux" and "user.virtiofsd.security.selinux". Host will
>> enforce SELinux policy based on security.selinux xattr and guest
>> will see the SELinux info stored in "user.virtiofsd.security.selinux"
>> and guest SELinux policy will enforce rules based on that.
>> (user.virtiofsd.security.selinux will be remapped to "security.selinux"
>> when guest does getxattr()).
>>
>> IOW, this mode is allowing both host and guest SELinux policies to
>> co-exist and not interefere with each other. (Remapping guests's
>> SELinux xattr is not changing hosts's SELinux label and is not
>> bypassing host's SELinux policy).
>>
>> virtiofsd also provides for the mode where if guest process sets
>> SELinux xattr it shows up as security.selinux on host. But now we
>> have multiple issues. There are two SELinux policies (host and guest)
>> which are operating on same lable. And there is a very good chance
>> that two have not been written in such a way that they work with
>> each other. In fact there does not seem to exist a notion where
>> two different SELinux policies are operating on same label.
>>
>> At high level, this is in a way similar to files created on
>> virtio-blk devices. Say this device is backed by a foo.img file
>> on host. Now host selinux policy will set its own label on
>> foo.img and provide access control while labels created by guest
>> are not seen or controlled by host's SELinux policy. Only guest
>> SELinux policy works with those labels.
>>
>> So this is similar kind of attempt. Provide isolation between
>> host and guests's SELinux labels so that two policies can
>> co-exist and not interfere with each other.
>>
>>>> This remapping is useful when SELinux is enabled in guest and virtiofs
>>>> as being used as rootfs. Guest and host SELinux policy might not match
>>>> and host policy might deny security.selinux xattr setting by guest
>>>> onto host. Or host might have SELinux disabled and in that case to
>>>> be able to set security.selinux xattr, virtiofsd will need to have
>>>> CAP_SYS_ADMIN (which we are trying to avoid). Being able to remap
>>>> guest security.selinux (or other xattrs) on host to something else
>>>> is also better from security point of view.
>>> Can you please provide some rationale for this assertion?
>>> I have been working with security xattrs longer than anyone
>>> and have trouble accepting the statement.
>> If guest is not able to interfere or change host's SELinux labels
>> directly, it sounded better.
>>
>> Irrespective of this, my primary concern is that to allow guest
>> VM to be able to use SELinux seamlessly in diverse host OS
>> environments (typical of cloud deployments). And being able to
>> provide a mode where host and guest's security labels can
>> co-exist and policies can work independently, should be able
>> to achieve that goal.
>>
>>>> But when we try this, we noticed that SELinux relabeling in guest
>>>> is failing on some symlinks. When I debugged a little more, I
>>>> came to know that "user.*" xattrs are not allowed on symlinks
>>>> or special files.
>>>>
>>>> "man xattr" seems to suggest that primary reason to disallow is
>>>> that arbitrary users can set unlimited amount of "user.*" xattrs
>>>> on these files and bypass quota check.
>>>>
>>>> If that's the primary reason, I am wondering is it possible to relax
>>>> the restrictions if caller has CAP_SYS_RESOURCE. This capability
>>>> allows caller to bypass quota checks. So it should not be
>>>> a problem atleast from quota perpective.
>>>>
>>>> That will allow me to give CAP_SYS_RESOURCE to virtiofs deamon
>>>> and remap xattrs arbitrarily.
>>> On a Smack system you should require CAP_MAC_ADMIN to remap
>>> security. xattrs. I sounds like you're in serious danger of running afoul
>>> of LSM attribute policy on a reasonable general level.
>> I think I did not explain xattr remapping properly and that's why this
>> confusion is there. Only guests's xattrs will be remapped and not
>> hosts's xattr. So one can not bypass any access control implemented
>> by any of the LSM on host.
>>
>> Thanks
>> Vivek
>>
> I want to point out that this solves a  couple of other problems also. 

I am not (usually) adverse to solving problems. My concern is with
regard to creating new ones.

> Currently virtiofsd attempts to write security attributes on the host, which is denied by default on systems without SELinux and no CAP_SYS_ADMIN.

Right. Which is as it should be.
Also, s/SELinux/a LSM that uses security xattrs/

>   This means if you want to run a container or VM

A container uses the kernel from the host. A VM uses the kernel
from the guest. Unless you're calling a VM a container for
marketing purposes. If this scheme works for non-VM based containers
there's a problem.

> on a host without SELinux support but the VM has SELinux enabled, then virtiofsd needs CAP_SYS_ADMIN.  It would be much more secure if it only needed CAP_SYS_RESOURCE.

I don't know, so I'm asking. Does virtiofsd really get run with limited capabilities,
or does it get run as root like most system daemons? If it runs as root the argument
has no legs.

>   If the host has SELinux enabled then it can run without CAP_SYS_ADMIN or CAP_SYS_RESOURCE, but it will only be allowed to write labels that the host system understands, any label not understood will be blocked. Not only this, but the label that is running virtiofsd pretty much has to run as unconfined, since it could be writing any SELinux label.

You could fix that easily enough by teaching SELinux about the proper
use of CAP_MAC_ADMIN. Alas, I understand that there's no way that's
going to happen, and why it would be considered philosophically repugnant
in the SELinux community. 

>
> If virtiofsd is writing Userxattrs with CAP_SYS_RESOURCE, then we can run with a confined SELinux label only allowing it to sexattr on the content in the designated directory, make the container/vm much more secure.
>
User xattrs are less protected than security xattrs. You are exposing the
security xattrs on the guest to the possible whims of a malicious, unprivileged
actor on the host. All it needs is the right UID.

We have unused xattr namespaces. Would using the "trusted" namespace
work for your purposes?

Dr. David Alan Gilbert June 28, 2021, 4:28 p.m. UTC | #6

* Casey Schaufler (casey@schaufler-ca.com) wrote:
> On 6/28/2021 6:36 AM, Daniel Walsh wrote:
> > On 6/28/21 09:17, Vivek Goyal wrote:
> >> On Fri, Jun 25, 2021 at 09:49:51PM +0000, Schaufler, Casey wrote:
> >>>> -----Original Message-----
> >>>> From: Vivek Goyal <vgoyal@redhat.com>
> >>>> Sent: Friday, June 25, 2021 12:12 PM
> >>>> To: linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org;
> >>>> viro@zeniv.linux.org.uk
> >>>> Cc: virtio-fs@redhat.com; dwalsh@redhat.com; dgilbert@redhat.com;
> >>>> berrange@redhat.com; vgoyal@redhat.com
> >>> Please include Linux Security Module list <linux-security-module@vger.kernel.org>
> >>> and selinux@vger.kernel.org on this topic.
> >>>
> >>>> Subject: [RFC PATCH 0/1] xattr: Allow user.* xattr on symlink/special files if
> >>>> caller has CAP_SYS_RESOURCE
> >>>>
> >>>> Hi,
> >>>>
> >>>> In virtiofs, actual file server is virtiosd daemon running on host.
> >>>> There we have a mode where xattrs can be remapped to something else.
> >>>> For example security.selinux can be remapped to
> >>>> user.virtiofsd.securit.selinux on the host.
> >>> This would seem to provide mechanism whereby a user can violate
> >>> SELinux policy quite easily.
> >> Hi Casey,
> >>
> >> As david already replied, we are not bypassing host's SELinux policy (if
> >> there is one). We are just trying to provide a mode where host and
> >> guest's SELinux policies could co-exist without interefering
> >> with each other.
> >>
> >> By remappming guests SELinux xattrs (and not host's SELinux xattrs),
> >> a file probably will have two xattrs
> >>
> >> "security.selinux" and "user.virtiofsd.security.selinux". Host will
> >> enforce SELinux policy based on security.selinux xattr and guest
> >> will see the SELinux info stored in "user.virtiofsd.security.selinux"
> >> and guest SELinux policy will enforce rules based on that.
> >> (user.virtiofsd.security.selinux will be remapped to "security.selinux"
> >> when guest does getxattr()).
> >>
> >> IOW, this mode is allowing both host and guest SELinux policies to
> >> co-exist and not interefere with each other. (Remapping guests's
> >> SELinux xattr is not changing hosts's SELinux label and is not
> >> bypassing host's SELinux policy).
> >>
> >> virtiofsd also provides for the mode where if guest process sets
> >> SELinux xattr it shows up as security.selinux on host. But now we
> >> have multiple issues. There are two SELinux policies (host and guest)
> >> which are operating on same lable. And there is a very good chance
> >> that two have not been written in such a way that they work with
> >> each other. In fact there does not seem to exist a notion where
> >> two different SELinux policies are operating on same label.
> >>
> >> At high level, this is in a way similar to files created on
> >> virtio-blk devices. Say this device is backed by a foo.img file
> >> on host. Now host selinux policy will set its own label on
> >> foo.img and provide access control while labels created by guest
> >> are not seen or controlled by host's SELinux policy. Only guest
> >> SELinux policy works with those labels.
> >>
> >> So this is similar kind of attempt. Provide isolation between
> >> host and guests's SELinux labels so that two policies can
> >> co-exist and not interfere with each other.
> >>
> >>>> This remapping is useful when SELinux is enabled in guest and virtiofs
> >>>> as being used as rootfs. Guest and host SELinux policy might not match
> >>>> and host policy might deny security.selinux xattr setting by guest
> >>>> onto host. Or host might have SELinux disabled and in that case to
> >>>> be able to set security.selinux xattr, virtiofsd will need to have
> >>>> CAP_SYS_ADMIN (which we are trying to avoid). Being able to remap
> >>>> guest security.selinux (or other xattrs) on host to something else
> >>>> is also better from security point of view.
> >>> Can you please provide some rationale for this assertion?
> >>> I have been working with security xattrs longer than anyone
> >>> and have trouble accepting the statement.
> >> If guest is not able to interfere or change host's SELinux labels
> >> directly, it sounded better.
> >>
> >> Irrespective of this, my primary concern is that to allow guest
> >> VM to be able to use SELinux seamlessly in diverse host OS
> >> environments (typical of cloud deployments). And being able to
> >> provide a mode where host and guest's security labels can
> >> co-exist and policies can work independently, should be able
> >> to achieve that goal.
> >>
> >>>> But when we try this, we noticed that SELinux relabeling in guest
> >>>> is failing on some symlinks. When I debugged a little more, I
> >>>> came to know that "user.*" xattrs are not allowed on symlinks
> >>>> or special files.
> >>>>
> >>>> "man xattr" seems to suggest that primary reason to disallow is
> >>>> that arbitrary users can set unlimited amount of "user.*" xattrs
> >>>> on these files and bypass quota check.
> >>>>
> >>>> If that's the primary reason, I am wondering is it possible to relax
> >>>> the restrictions if caller has CAP_SYS_RESOURCE. This capability
> >>>> allows caller to bypass quota checks. So it should not be
> >>>> a problem atleast from quota perpective.
> >>>>
> >>>> That will allow me to give CAP_SYS_RESOURCE to virtiofs deamon
> >>>> and remap xattrs arbitrarily.
> >>> On a Smack system you should require CAP_MAC_ADMIN to remap
> >>> security. xattrs. I sounds like you're in serious danger of running afoul
> >>> of LSM attribute policy on a reasonable general level.
> >> I think I did not explain xattr remapping properly and that's why this
> >> confusion is there. Only guests's xattrs will be remapped and not
> >> hosts's xattr. So one can not bypass any access control implemented
> >> by any of the LSM on host.
> >>
> >> Thanks
> >> Vivek
> >>
> > I want to point out that this solves a  couple of other problems also. 
> 
> I am not (usually) adverse to solving problems. My concern is with
> regard to creating new ones.
> 
> > Currently virtiofsd attempts to write security attributes on the host, which is denied by default on systems without SELinux and no CAP_SYS_ADMIN.
> 
> Right. Which is as it should be.
> Also, s/SELinux/a LSM that uses security xattrs/
> 
> >   This means if you want to run a container or VM
> 
> A container uses the kernel from the host. A VM uses the kernel
> from the guest. Unless you're calling a VM a container for
> marketing purposes. If this scheme works for non-VM based containers
> there's a problem.

And 'kata' is it's own kernel, but more like a container runtime - would
you like to call this a VM or a container?
There's whole bunch of variations people are playing around with; I don't
think there's a single answer, or a single way people are trying to use
it.

> > on a host without SELinux support but the VM has SELinux enabled, then virtiofsd needs CAP_SYS_ADMIN.  It would be much more secure if it only needed CAP_SYS_RESOURCE.
> 
> I don't know, so I'm asking. Does virtiofsd really get run with limited capabilities,
> or does it get run as root like most system daemons? If it runs as root the argument
> has no legs.

It's typically run without CAP_SYS_ADMIN; (although we have other
problems, like wanting to use file handles that make caps tricky).
Some people are trying to run it in user namespaces.
Given that it's pretty complex and playing with lots of file syscalls
under partial control of the guest, giving it as few capabilities
as possible is my preference.

> >   If the host has SELinux enabled then it can run without CAP_SYS_ADMIN or CAP_SYS_RESOURCE, but it will only be allowed to write labels that the host system understands, any label not understood will be blocked. Not only this, but the label that is running virtiofsd pretty much has to run as unconfined, since it could be writing any SELinux label.
> 
> You could fix that easily enough by teaching SELinux about the proper
> use of CAP_MAC_ADMIN. Alas, I understand that there's no way that's
> going to happen, and why it would be considered philosophically repugnant
> in the SELinux community. 
> 
> >
> > If virtiofsd is writing Userxattrs with CAP_SYS_RESOURCE, then we can run with a confined SELinux label only allowing it to sexattr on the content in the designated directory, make the container/vm much more secure.
> >
> User xattrs are less protected than security xattrs. You are exposing the
> security xattrs on the guest to the possible whims of a malicious, unprivileged
> actor on the host. All it needs is the right UID.

Yep, we realise that; but when you're mainly interested in making sure
the guest can't attack the host, that's less worrying.
It would be lovely if there was something more granular, (e.g. allowing
user.NUMBER. or trusted.NUMBER. to be used by this particular guest).

> We have unused xattr namespaces. Would using the "trusted" namespace
> work for your purposes?

For those with CAP_SYS_ADMIN I guess.

Note the virtiofsd takes an option allowing you to set the mapping
however you like, so there's no hard coded user. or trusted. in the
daemon itself.

Dave

> 
>

Vivek Goyal June 28, 2021, 5:22 p.m. UTC | #7

On Mon, Jun 28, 2021 at 09:04:40AM -0700, Casey Schaufler wrote:

[..]
> > on a host without SELinux support but the VM has SELinux enabled, then virtiofsd needs CAP_SYS_ADMIN.  It would be much more secure if it only needed CAP_SYS_RESOURCE.
> 
> I don't know, so I'm asking. Does virtiofsd really get run with limited capabilities,
> or does it get run as root like most system daemons? If it runs as root the argument
> has no legs.

It runs as root but we give it a set of minimum required capabilities by
default and want to avoid giving it CAP_SYS_ADMIN.

> 
> >   If the host has SELinux enabled then it can run without CAP_SYS_ADMIN or CAP_SYS_RESOURCE, but it will only be allowed to write labels that the host system understands, any label not understood will be blocked. Not only this, but the label that is running virtiofsd pretty much has to run as unconfined, since it could be writing any SELinux label.
> 
> You could fix that easily enough by teaching SELinux about the proper
> use of CAP_MAC_ADMIN. Alas, I understand that there's no way that's
> going to happen, and why it would be considered philosophically repugnant
> in the SELinux community. 
> 
> >
> > If virtiofsd is writing Userxattrs with CAP_SYS_RESOURCE, then we can run with a confined SELinux label only allowing it to sexattr on the content in the designated directory, make the container/vm much more secure.
> >
> User xattrs are less protected than security xattrs. You are exposing the
> security xattrs on the guest to the possible whims of a malicious, unprivileged
> actor on the host. All it needs is the right UID.

One of the security tenets of virtiofs is that this shared directory should
be hidden from unprivliged users. Otherwise guest can drop setuid root
binaries in shared directory and unprivliged user on host executes it and
gets control of the host.

So unpriviliged actor on host having access to these shared directory
contents is wrong configuration.

> 
> We have unused xattr namespaces. Would using the "trusted" namespace
> work for your purposes?

That requires giving CAP_SYS_ADMIN to daemon and one of the goals is
to give as little capabilities as possible to virtiofsd. In fact people
have been asking for the capablities to run virtiofsd unpriviliged
as well as run inside a user namespace etc.

Anyway, remapping LSM xattrs to "trusted.*" space should work as long
as virtiofsd has CAP_SYS_ADMIN.

Thanks
Vivek

Casey Schaufler June 28, 2021, 5:41 p.m. UTC | #8

On 6/28/2021 9:28 AM, Dr. David Alan Gilbert wrote:
> * Casey Schaufler (casey@schaufler-ca.com) wrote:
>> On 6/28/2021 6:36 AM, Daniel Walsh wrote:
>>> On 6/28/21 09:17, Vivek Goyal wrote:
>>>> On Fri, Jun 25, 2021 at 09:49:51PM +0000, Schaufler, Casey wrote:
>>>>>> -----Original Message-----
>>>>>> From: Vivek Goyal <vgoyal@redhat.com>
>>>>>> Sent: Friday, June 25, 2021 12:12 PM
>>>>>> To: linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org;
>>>>>> viro@zeniv.linux.org.uk
>>>>>> Cc: virtio-fs@redhat.com; dwalsh@redhat.com; dgilbert@redhat.com;
>>>>>> berrange@redhat.com; vgoyal@redhat.com
>>>>> Please include Linux Security Module list <linux-security-module@vger.kernel.org>
>>>>> and selinux@vger.kernel.org on this topic.
>>>>>
>>>>>> Subject: [RFC PATCH 0/1] xattr: Allow user.* xattr on symlink/special files if
>>>>>> caller has CAP_SYS_RESOURCE
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> In virtiofs, actual file server is virtiosd daemon running on host.
>>>>>> There we have a mode where xattrs can be remapped to something else.
>>>>>> For example security.selinux can be remapped to
>>>>>> user.virtiofsd.securit.selinux on the host.
>>>>> This would seem to provide mechanism whereby a user can violate
>>>>> SELinux policy quite easily.
>>>> Hi Casey,
>>>>
>>>> As david already replied, we are not bypassing host's SELinux policy (if
>>>> there is one). We are just trying to provide a mode where host and
>>>> guest's SELinux policies could co-exist without interefering
>>>> with each other.
>>>>
>>>> By remappming guests SELinux xattrs (and not host's SELinux xattrs),
>>>> a file probably will have two xattrs
>>>>
>>>> "security.selinux" and "user.virtiofsd.security.selinux". Host will
>>>> enforce SELinux policy based on security.selinux xattr and guest
>>>> will see the SELinux info stored in "user.virtiofsd.security.selinux"
>>>> and guest SELinux policy will enforce rules based on that.
>>>> (user.virtiofsd.security.selinux will be remapped to "security.selinux"
>>>> when guest does getxattr()).
>>>>
>>>> IOW, this mode is allowing both host and guest SELinux policies to
>>>> co-exist and not interefere with each other. (Remapping guests's
>>>> SELinux xattr is not changing hosts's SELinux label and is not
>>>> bypassing host's SELinux policy).
>>>>
>>>> virtiofsd also provides for the mode where if guest process sets
>>>> SELinux xattr it shows up as security.selinux on host. But now we
>>>> have multiple issues. There are two SELinux policies (host and guest)
>>>> which are operating on same lable. And there is a very good chance
>>>> that two have not been written in such a way that they work with
>>>> each other. In fact there does not seem to exist a notion where
>>>> two different SELinux policies are operating on same label.
>>>>
>>>> At high level, this is in a way similar to files created on
>>>> virtio-blk devices. Say this device is backed by a foo.img file
>>>> on host. Now host selinux policy will set its own label on
>>>> foo.img and provide access control while labels created by guest
>>>> are not seen or controlled by host's SELinux policy. Only guest
>>>> SELinux policy works with those labels.
>>>>
>>>> So this is similar kind of attempt. Provide isolation between
>>>> host and guests's SELinux labels so that two policies can
>>>> co-exist and not interfere with each other.
>>>>
>>>>>> This remapping is useful when SELinux is enabled in guest and virtiofs
>>>>>> as being used as rootfs. Guest and host SELinux policy might not match
>>>>>> and host policy might deny security.selinux xattr setting by guest
>>>>>> onto host. Or host might have SELinux disabled and in that case to
>>>>>> be able to set security.selinux xattr, virtiofsd will need to have
>>>>>> CAP_SYS_ADMIN (which we are trying to avoid). Being able to remap
>>>>>> guest security.selinux (or other xattrs) on host to something else
>>>>>> is also better from security point of view.
>>>>> Can you please provide some rationale for this assertion?
>>>>> I have been working with security xattrs longer than anyone
>>>>> and have trouble accepting the statement.
>>>> If guest is not able to interfere or change host's SELinux labels
>>>> directly, it sounded better.
>>>>
>>>> Irrespective of this, my primary concern is that to allow guest
>>>> VM to be able to use SELinux seamlessly in diverse host OS
>>>> environments (typical of cloud deployments). And being able to
>>>> provide a mode where host and guest's security labels can
>>>> co-exist and policies can work independently, should be able
>>>> to achieve that goal.
>>>>
>>>>>> But when we try this, we noticed that SELinux relabeling in guest
>>>>>> is failing on some symlinks. When I debugged a little more, I
>>>>>> came to know that "user.*" xattrs are not allowed on symlinks
>>>>>> or special files.
>>>>>>
>>>>>> "man xattr" seems to suggest that primary reason to disallow is
>>>>>> that arbitrary users can set unlimited amount of "user.*" xattrs
>>>>>> on these files and bypass quota check.
>>>>>>
>>>>>> If that's the primary reason, I am wondering is it possible to relax
>>>>>> the restrictions if caller has CAP_SYS_RESOURCE. This capability
>>>>>> allows caller to bypass quota checks. So it should not be
>>>>>> a problem atleast from quota perpective.
>>>>>>
>>>>>> That will allow me to give CAP_SYS_RESOURCE to virtiofs deamon
>>>>>> and remap xattrs arbitrarily.
>>>>> On a Smack system you should require CAP_MAC_ADMIN to remap
>>>>> security. xattrs. I sounds like you're in serious danger of running afoul
>>>>> of LSM attribute policy on a reasonable general level.
>>>> I think I did not explain xattr remapping properly and that's why this
>>>> confusion is there. Only guests's xattrs will be remapped and not
>>>> hosts's xattr. So one can not bypass any access control implemented
>>>> by any of the LSM on host.
>>>>
>>>> Thanks
>>>> Vivek
>>>>
>>> I want to point out that this solves a  couple of other problems also. 
>> I am not (usually) adverse to solving problems. My concern is with
>> regard to creating new ones.
>>
>>> Currently virtiofsd attempts to write security attributes on the host, which is denied by default on systems without SELinux and no CAP_SYS_ADMIN.
>> Right. Which is as it should be.
>> Also, s/SELinux/a LSM that uses security xattrs/
>>
>>>   This means if you want to run a container or VM
>> A container uses the kernel from the host. A VM uses the kernel
>> from the guest. Unless you're calling a VM a container for
>> marketing purposes. If this scheme works for non-VM based containers
>> there's a problem.
> And 'kata' is it's own kernel, but more like a container runtime - would
> you like to call this a VM or a container?

I would call it a VM.

On the other hand, there has been a concerted effort to ensure that there
is no technical definition of a container. I hope to exploit this for
personal wealth and glory before too long myself. If kata wants to identify
as a container, who am I to say otherwise?

> There's whole bunch of variations people are playing around with; I don't
> think there's a single answer, or a single way people are trying to use
> it.

Just so.

>>> on a host without SELinux support but the VM has SELinux enabled, then virtiofsd needs CAP_SYS_ADMIN.  It would be much more secure if it only needed CAP_SYS_RESOURCE.
>> I don't know, so I'm asking. Does virtiofsd really get run with limited capabilities,
>> or does it get run as root like most system daemons? If it runs as root the argument
>> has no legs.
> It's typically run without CAP_SYS_ADMIN; (although we have other
> problems, like wanting to use file handles that make caps tricky).
> Some people are trying to run it in user namespaces.
> Given that it's pretty complex and playing with lots of file syscalls
> under partial control of the guest, giving it as few capabilities
> as possible is my preference.

It would be mine as well. I expect/fear that many developers find
capabilities too complicated to work with and drop back to good old
fashioned root. The whole rationale for user namespaces seems to be
that it makes running as root in the namespace "safe".

>>>   If the host has SELinux enabled then it can run without CAP_SYS_ADMIN or CAP_SYS_RESOURCE, but it will only be allowed to write labels that the host system understands, any label not understood will be blocked. Not only this, but the label that is running virtiofsd pretty much has to run as unconfined, since it could be writing any SELinux label.
>> You could fix that easily enough by teaching SELinux about the proper
>> use of CAP_MAC_ADMIN. Alas, I understand that there's no way that's
>> going to happen, and why it would be considered philosophically repugnant
>> in the SELinux community. 
>>
>>> If virtiofsd is writing Userxattrs with CAP_SYS_RESOURCE, then we can run with a confined SELinux label only allowing it to sexattr on the content in the designated directory, make the container/vm much more secure.
>>>
>> User xattrs are less protected than security xattrs. You are exposing the
>> security xattrs on the guest to the possible whims of a malicious, unprivileged
>> actor on the host. All it needs is the right UID.
> Yep, we realise that; but when you're mainly interested in making sure
> the guest can't attack the host, that's less worrying.

That's uncomfortable.

> It would be lovely if there was something more granular, (e.g. allowing
> user.NUMBER. or trusted.NUMBER. to be used by this particular guest).

We can't do that without breaking the "kernels aren't container aware"
mandate. I suppose that if someone wanted to implement xattr namespaces
(like user namespaces, not just the prefix) you could get away with that.
Namespaces for everything. :)

>> We have unused xattr namespaces. Would using the "trusted" namespace
>> work for your purposes?
> For those with CAP_SYS_ADMIN I guess.
>
> Note the virtiofsd takes an option allowing you to set the mapping
> however you like, so there's no hard coded user. or trusted. in the
> daemon itself.
>
> Dave
>
>>

Daniel Walsh June 28, 2021, 6:55 p.m. UTC | #9

On 6/28/21 12:04, Casey Schaufler wrote:
> On 6/28/2021 6:36 AM, Daniel Walsh wrote:
>> On 6/28/21 09:17, Vivek Goyal wrote:
>>> On Fri, Jun 25, 2021 at 09:49:51PM +0000, Schaufler, Casey wrote:
>>>>> -----Original Message-----
>>>>> From: Vivek Goyal <vgoyal@redhat.com>
>>>>> Sent: Friday, June 25, 2021 12:12 PM
>>>>> To: linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org;
>>>>> viro@zeniv.linux.org.uk
>>>>> Cc: virtio-fs@redhat.com; dwalsh@redhat.com; dgilbert@redhat.com;
>>>>> berrange@redhat.com; vgoyal@redhat.com
>>>> Please include Linux Security Module list <linux-security-module@vger.kernel.org>
>>>> and selinux@vger.kernel.org on this topic.
>>>>
>>>>> Subject: [RFC PATCH 0/1] xattr: Allow user.* xattr on symlink/special files if
>>>>> caller has CAP_SYS_RESOURCE
>>>>>
>>>>> Hi,
>>>>>
>>>>> In virtiofs, actual file server is virtiosd daemon running on host.
>>>>> There we have a mode where xattrs can be remapped to something else.
>>>>> For example security.selinux can be remapped to
>>>>> user.virtiofsd.securit.selinux on the host.
>>>> This would seem to provide mechanism whereby a user can violate
>>>> SELinux policy quite easily.
>>> Hi Casey,
>>>
>>> As david already replied, we are not bypassing host's SELinux policy (if
>>> there is one). We are just trying to provide a mode where host and
>>> guest's SELinux policies could co-exist without interefering
>>> with each other.
>>>
>>> By remappming guests SELinux xattrs (and not host's SELinux xattrs),
>>> a file probably will have two xattrs
>>>
>>> "security.selinux" and "user.virtiofsd.security.selinux". Host will
>>> enforce SELinux policy based on security.selinux xattr and guest
>>> will see the SELinux info stored in "user.virtiofsd.security.selinux"
>>> and guest SELinux policy will enforce rules based on that.
>>> (user.virtiofsd.security.selinux will be remapped to "security.selinux"
>>> when guest does getxattr()).
>>>
>>> IOW, this mode is allowing both host and guest SELinux policies to
>>> co-exist and not interefere with each other. (Remapping guests's
>>> SELinux xattr is not changing hosts's SELinux label and is not
>>> bypassing host's SELinux policy).
>>>
>>> virtiofsd also provides for the mode where if guest process sets
>>> SELinux xattr it shows up as security.selinux on host. But now we
>>> have multiple issues. There are two SELinux policies (host and guest)
>>> which are operating on same lable. And there is a very good chance
>>> that two have not been written in such a way that they work with
>>> each other. In fact there does not seem to exist a notion where
>>> two different SELinux policies are operating on same label.
>>>
>>> At high level, this is in a way similar to files created on
>>> virtio-blk devices. Say this device is backed by a foo.img file
>>> on host. Now host selinux policy will set its own label on
>>> foo.img and provide access control while labels created by guest
>>> are not seen or controlled by host's SELinux policy. Only guest
>>> SELinux policy works with those labels.
>>>
>>> So this is similar kind of attempt. Provide isolation between
>>> host and guests's SELinux labels so that two policies can
>>> co-exist and not interfere with each other.
>>>
>>>>> This remapping is useful when SELinux is enabled in guest and virtiofs
>>>>> as being used as rootfs. Guest and host SELinux policy might not match
>>>>> and host policy might deny security.selinux xattr setting by guest
>>>>> onto host. Or host might have SELinux disabled and in that case to
>>>>> be able to set security.selinux xattr, virtiofsd will need to have
>>>>> CAP_SYS_ADMIN (which we are trying to avoid). Being able to remap
>>>>> guest security.selinux (or other xattrs) on host to something else
>>>>> is also better from security point of view.
>>>> Can you please provide some rationale for this assertion?
>>>> I have been working with security xattrs longer than anyone
>>>> and have trouble accepting the statement.
>>> If guest is not able to interfere or change host's SELinux labels
>>> directly, it sounded better.
>>>
>>> Irrespective of this, my primary concern is that to allow guest
>>> VM to be able to use SELinux seamlessly in diverse host OS
>>> environments (typical of cloud deployments). And being able to
>>> provide a mode where host and guest's security labels can
>>> co-exist and policies can work independently, should be able
>>> to achieve that goal.
>>>
>>>>> But when we try this, we noticed that SELinux relabeling in guest
>>>>> is failing on some symlinks. When I debugged a little more, I
>>>>> came to know that "user.*" xattrs are not allowed on symlinks
>>>>> or special files.
>>>>>
>>>>> "man xattr" seems to suggest that primary reason to disallow is
>>>>> that arbitrary users can set unlimited amount of "user.*" xattrs
>>>>> on these files and bypass quota check.
>>>>>
>>>>> If that's the primary reason, I am wondering is it possible to relax
>>>>> the restrictions if caller has CAP_SYS_RESOURCE. This capability
>>>>> allows caller to bypass quota checks. So it should not be
>>>>> a problem atleast from quota perpective.
>>>>>
>>>>> That will allow me to give CAP_SYS_RESOURCE to virtiofs deamon
>>>>> and remap xattrs arbitrarily.
>>>> On a Smack system you should require CAP_MAC_ADMIN to remap
>>>> security. xattrs. I sounds like you're in serious danger of running afoul
>>>> of LSM attribute policy on a reasonable general level.
>>> I think I did not explain xattr remapping properly and that's why this
>>> confusion is there. Only guests's xattrs will be remapped and not
>>> hosts's xattr. So one can not bypass any access control implemented
>>> by any of the LSM on host.
>>>
>>> Thanks
>>> Vivek
>>>
>> I want to point out that this solves a  couple of other problems also.
> I am not (usually) adverse to solving problems. My concern is with
> regard to creating new ones.
>
>> Currently virtiofsd attempts to write security attributes on the host, which is denied by default on systems without SELinux and no CAP_SYS_ADMIN.
> Right. Which is as it should be.
> Also, s/SELinux/a LSM that uses security xattrs/
>
>>    This means if you want to run a container or VM
> A container uses the kernel from the host. A VM uses the kernel
> from the guest. Unless you're calling a VM a container for
> marketing purposes. If this scheme works for non-VM based containers
> there's a problem.
That is your definition of a container.  Our definition includes 
container workloads within kvm separation along with their own kernels. 
(Kata and libkrun).  As opposed to VM workloads which run full operating 
system workloads including systemd, logging, cron, sshd ...
>> on a host without SELinux support but the VM has SELinux enabled, then virtiofsd needs CAP_SYS_ADMIN.  It would be much more secure if it only needed CAP_SYS_RESOURCE.
> I don't know, so I'm asking. Does virtiofsd really get run with limited capabilities,
> or does it get run as root like most system daemons? If it runs as root the argument
> has no legs.
I believe it should almost always get run with limited privileges, we 
are opening a whole from the kvm separated workload into the host.  If 
there is a bug in virtiofsd, it can attack the host.
>>    If the host has SELinux enabled then it can run without CAP_SYS_ADMIN or CAP_SYS_RESOURCE, but it will only be allowed to write labels that the host system understands, any label not understood will be blocked. Not only this, but the label that is running virtiofsd pretty much has to run as unconfined, since it could be writing any SELinux label.
> You could fix that easily enough by teaching SELinux about the proper
> use of CAP_MAC_ADMIN. Alas, I understand that there's no way that's
> going to happen, and why it would be considered philosophically repugnant
> in the SELinux community.
Sure, but this ignores the more important next comment.
>> If virtiofsd is writing Userxattrs with CAP_SYS_RESOURCE, then we can run with a confined SELinux label only allowing it to sexattr on the content in the designated directory, make the container/vm much more secure.
>>
> User xattrs are less protected than security xattrs. You are exposing the
> security xattrs on the guest to the possible whims of a malicious, unprivileged
> actor on the host. All it needs is the right UID.
>
> We have unused xattr namespaces. Would using the "trusted" namespace
> work for your purposes?
>
No because they bring their own issues, and can not be used without 
CAP_SYS_ADMIN.

My number one concern is attacks from the kvm separated work space 
against the host, since virtiofsd is opening up the attack vector.  
Running it with the least privs possible from the MAC and DAC point of 
view is the goal.

Dr. David Alan Gilbert June 29, 2021, 9 a.m. UTC | #10

* Casey Schaufler (casey@schaufler-ca.com) wrote:
> On 6/28/2021 9:28 AM, Dr. David Alan Gilbert wrote:
> > * Casey Schaufler (casey@schaufler-ca.com) wrote:
> >> On 6/28/2021 6:36 AM, Daniel Walsh wrote:
> >>> On 6/28/21 09:17, Vivek Goyal wrote:
> >>>> On Fri, Jun 25, 2021 at 09:49:51PM +0000, Schaufler, Casey wrote:
> >>>>>> -----Original Message-----
> >>>>>> From: Vivek Goyal <vgoyal@redhat.com>
> >>>>>> Sent: Friday, June 25, 2021 12:12 PM
> >>>>>> To: linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org;
> >>>>>> viro@zeniv.linux.org.uk
> >>>>>> Cc: virtio-fs@redhat.com; dwalsh@redhat.com; dgilbert@redhat.com;
> >>>>>> berrange@redhat.com; vgoyal@redhat.com
> >>>>> Please include Linux Security Module list <linux-security-module@vger.kernel.org>
> >>>>> and selinux@vger.kernel.org on this topic.
> >>>>>
> >>>>>> Subject: [RFC PATCH 0/1] xattr: Allow user.* xattr on symlink/special files if
> >>>>>> caller has CAP_SYS_RESOURCE
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> In virtiofs, actual file server is virtiosd daemon running on host.
> >>>>>> There we have a mode where xattrs can be remapped to something else.
> >>>>>> For example security.selinux can be remapped to
> >>>>>> user.virtiofsd.securit.selinux on the host.
> >>>>> This would seem to provide mechanism whereby a user can violate
> >>>>> SELinux policy quite easily.
> >>>> Hi Casey,
> >>>>
> >>>> As david already replied, we are not bypassing host's SELinux policy (if
> >>>> there is one). We are just trying to provide a mode where host and
> >>>> guest's SELinux policies could co-exist without interefering
> >>>> with each other.
> >>>>
> >>>> By remappming guests SELinux xattrs (and not host's SELinux xattrs),
> >>>> a file probably will have two xattrs
> >>>>
> >>>> "security.selinux" and "user.virtiofsd.security.selinux". Host will
> >>>> enforce SELinux policy based on security.selinux xattr and guest
> >>>> will see the SELinux info stored in "user.virtiofsd.security.selinux"
> >>>> and guest SELinux policy will enforce rules based on that.
> >>>> (user.virtiofsd.security.selinux will be remapped to "security.selinux"
> >>>> when guest does getxattr()).
> >>>>
> >>>> IOW, this mode is allowing both host and guest SELinux policies to
> >>>> co-exist and not interefere with each other. (Remapping guests's
> >>>> SELinux xattr is not changing hosts's SELinux label and is not
> >>>> bypassing host's SELinux policy).
> >>>>
> >>>> virtiofsd also provides for the mode where if guest process sets
> >>>> SELinux xattr it shows up as security.selinux on host. But now we
> >>>> have multiple issues. There are two SELinux policies (host and guest)
> >>>> which are operating on same lable. And there is a very good chance
> >>>> that two have not been written in such a way that they work with
> >>>> each other. In fact there does not seem to exist a notion where
> >>>> two different SELinux policies are operating on same label.
> >>>>
> >>>> At high level, this is in a way similar to files created on
> >>>> virtio-blk devices. Say this device is backed by a foo.img file
> >>>> on host. Now host selinux policy will set its own label on
> >>>> foo.img and provide access control while labels created by guest
> >>>> are not seen or controlled by host's SELinux policy. Only guest
> >>>> SELinux policy works with those labels.
> >>>>
> >>>> So this is similar kind of attempt. Provide isolation between
> >>>> host and guests's SELinux labels so that two policies can
> >>>> co-exist and not interfere with each other.
> >>>>
> >>>>>> This remapping is useful when SELinux is enabled in guest and virtiofs
> >>>>>> as being used as rootfs. Guest and host SELinux policy might not match
> >>>>>> and host policy might deny security.selinux xattr setting by guest
> >>>>>> onto host. Or host might have SELinux disabled and in that case to
> >>>>>> be able to set security.selinux xattr, virtiofsd will need to have
> >>>>>> CAP_SYS_ADMIN (which we are trying to avoid). Being able to remap
> >>>>>> guest security.selinux (or other xattrs) on host to something else
> >>>>>> is also better from security point of view.
> >>>>> Can you please provide some rationale for this assertion?
> >>>>> I have been working with security xattrs longer than anyone
> >>>>> and have trouble accepting the statement.
> >>>> If guest is not able to interfere or change host's SELinux labels
> >>>> directly, it sounded better.
> >>>>
> >>>> Irrespective of this, my primary concern is that to allow guest
> >>>> VM to be able to use SELinux seamlessly in diverse host OS
> >>>> environments (typical of cloud deployments). And being able to
> >>>> provide a mode where host and guest's security labels can
> >>>> co-exist and policies can work independently, should be able
> >>>> to achieve that goal.
> >>>>
> >>>>>> But when we try this, we noticed that SELinux relabeling in guest
> >>>>>> is failing on some symlinks. When I debugged a little more, I
> >>>>>> came to know that "user.*" xattrs are not allowed on symlinks
> >>>>>> or special files.
> >>>>>>
> >>>>>> "man xattr" seems to suggest that primary reason to disallow is
> >>>>>> that arbitrary users can set unlimited amount of "user.*" xattrs
> >>>>>> on these files and bypass quota check.
> >>>>>>
> >>>>>> If that's the primary reason, I am wondering is it possible to relax
> >>>>>> the restrictions if caller has CAP_SYS_RESOURCE. This capability
> >>>>>> allows caller to bypass quota checks. So it should not be
> >>>>>> a problem atleast from quota perpective.
> >>>>>>
> >>>>>> That will allow me to give CAP_SYS_RESOURCE to virtiofs deamon
> >>>>>> and remap xattrs arbitrarily.
> >>>>> On a Smack system you should require CAP_MAC_ADMIN to remap
> >>>>> security. xattrs. I sounds like you're in serious danger of running afoul
> >>>>> of LSM attribute policy on a reasonable general level.
> >>>> I think I did not explain xattr remapping properly and that's why this
> >>>> confusion is there. Only guests's xattrs will be remapped and not
> >>>> hosts's xattr. So one can not bypass any access control implemented
> >>>> by any of the LSM on host.
> >>>>
> >>>> Thanks
> >>>> Vivek
> >>>>
> >>> I want to point out that this solves a  couple of other problems also. 
> >> I am not (usually) adverse to solving problems. My concern is with
> >> regard to creating new ones.
> >>
> >>> Currently virtiofsd attempts to write security attributes on the host, which is denied by default on systems without SELinux and no CAP_SYS_ADMIN.
> >> Right. Which is as it should be.
> >> Also, s/SELinux/a LSM that uses security xattrs/
> >>
> >>>   This means if you want to run a container or VM
> >> A container uses the kernel from the host. A VM uses the kernel
> >> from the guest. Unless you're calling a VM a container for
> >> marketing purposes. If this scheme works for non-VM based containers
> >> there's a problem.
> > And 'kata' is it's own kernel, but more like a container runtime - would
> > you like to call this a VM or a container?
> 
> I would call it a VM.
> 
> On the other hand, there has been a concerted effort to ensure that there
> is no technical definition of a container. I hope to exploit this for
> personal wealth and glory before too long myself. If kata wants to identify
> as a container, who am I to say otherwise?
> 
> > There's whole bunch of variations people are playing around with; I don't
> > think there's a single answer, or a single way people are trying to use
> > it.
> 
> Just so.
> 
> >>> on a host without SELinux support but the VM has SELinux enabled, then virtiofsd needs CAP_SYS_ADMIN.  It would be much more secure if it only needed CAP_SYS_RESOURCE.
> >> I don't know, so I'm asking. Does virtiofsd really get run with limited capabilities,
> >> or does it get run as root like most system daemons? If it runs as root the argument
> >> has no legs.
> > It's typically run without CAP_SYS_ADMIN; (although we have other
> > problems, like wanting to use file handles that make caps tricky).
> > Some people are trying to run it in user namespaces.
> > Given that it's pretty complex and playing with lots of file syscalls
> > under partial control of the guest, giving it as few capabilities
> > as possible is my preference.
> 
> It would be mine as well. I expect/fear that many developers find
> capabilities too complicated to work with and drop back to good old
> fashioned root. The whole rationale for user namespaces seems to be
> that it makes running as root in the namespace "safe".

We're trying to be good with capabilities, basically locking it down
until we trip over one of them and then think about it and enable it
where appropriate;  the difficulty is that capabilities are only a bit
better than root; they're still fairly granular - like in this case
where you're pushed towards a wide ranging CAP even though you only
want to give the user a trivial extra thing.
(We have a similar problem wanting to allow separate threads to
be in separate directories, but that requires unshare and that requires
another capability)

> >>>   If the host has SELinux enabled then it can run without CAP_SYS_ADMIN or CAP_SYS_RESOURCE, but it will only be allowed to write labels that the host system understands, any label not understood will be blocked. Not only this, but the label that is running virtiofsd pretty much has to run as unconfined, since it could be writing any SELinux label.
> >> You could fix that easily enough by teaching SELinux about the proper
> >> use of CAP_MAC_ADMIN. Alas, I understand that there's no way that's
> >> going to happen, and why it would be considered philosophically repugnant
> >> in the SELinux community. 
> >>
> >>> If virtiofsd is writing Userxattrs with CAP_SYS_RESOURCE, then we can run with a confined SELinux label only allowing it to sexattr on the content in the designated directory, make the container/vm much more secure.
> >>>
> >> User xattrs are less protected than security xattrs. You are exposing the
> >> security xattrs on the guest to the possible whims of a malicious, unprivileged
> >> actor on the host. All it needs is the right UID.
> > Yep, we realise that; but when you're mainly interested in making sure
> > the guest can't attack the host, that's less worrying.
> 
> That's uncomfortable.

Why exactly?
IMHO the biggest problem is it's badly defined when you want to actually
share filesystems between guests or between guests and the host.

> > It would be lovely if there was something more granular, (e.g. allowing
> > user.NUMBER. or trusted.NUMBER. to be used by this particular guest).
> 
> We can't do that without breaking the "kernels aren't container aware"
> mandate. I suppose that if someone wanted to implement xattr namespaces
> (like user namespaces, not just the prefix) you could get away with that.
> Namespaces for everything. :)

Right, it's namespaces that we've used in most places to give ourselves
the isolation.

I doubt we're the only case that wants a way to do xattr separation; you
get lots of weird cases where it pops up (e.g. stacked overlayfs)

Dave

> >> We have unused xattr namespaces. Would using the "trusted" namespace
> >> work for your purposes?
> > For those with CAP_SYS_ADMIN I guess.
> >
> > Note the virtiofsd takes an option allowing you to set the mapping
> > however you like, so there's no hard coded user. or trusted. in the
> > daemon itself.
> >
> > Dave
> >
> >>
>

Casey Schaufler June 29, 2021, 2:38 p.m. UTC | #11

On 6/29/2021 2:00 AM, Dr. David Alan Gilbert wrote:
> * Casey Schaufler (casey@schaufler-ca.com) wrote:
>> On 6/28/2021 9:28 AM, Dr. David Alan Gilbert wrote:
>>> * Casey Schaufler (casey@schaufler-ca.com) wrote:
>>>> On 6/28/2021 6:36 AM, Daniel Walsh wrote:
>>>>> On 6/28/21 09:17, Vivek Goyal wrote:
>>>>>> On Fri, Jun 25, 2021 at 09:49:51PM +0000, Schaufler, Casey wrote:
>>>>>>>> -----Original Message-----
>>>>>>>> From: Vivek Goyal <vgoyal@redhat.com>
>>>>>>>> Sent: Friday, June 25, 2021 12:12 PM
>>>>>>>> To: linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org;
>>>>>>>> viro@zeniv.linux.org.uk
>>>>>>>> Cc: virtio-fs@redhat.com; dwalsh@redhat.com; dgilbert@redhat.com;
>>>>>>>> berrange@redhat.com; vgoyal@redhat.com
>>>>>>> Please include Linux Security Module list <linux-security-module@vger.kernel.org>
>>>>>>> and selinux@vger.kernel.org on this topic.
>>>>>>>
>>>>>>>> Subject: [RFC PATCH 0/1] xattr: Allow user.* xattr on symlink/special files if
>>>>>>>> caller has CAP_SYS_RESOURCE
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> In virtiofs, actual file server is virtiosd daemon running on host.
>>>>>>>> There we have a mode where xattrs can be remapped to something else.
>>>>>>>> For example security.selinux can be remapped to
>>>>>>>> user.virtiofsd.securit.selinux on the host.
>>>>>>> This would seem to provide mechanism whereby a user can violate
>>>>>>> SELinux policy quite easily.
>>>>>> Hi Casey,
>>>>>>
>>>>>> As david already replied, we are not bypassing host's SELinux policy (if
>>>>>> there is one). We are just trying to provide a mode where host and
>>>>>> guest's SELinux policies could co-exist without interefering
>>>>>> with each other.
>>>>>>
>>>>>> By remappming guests SELinux xattrs (and not host's SELinux xattrs),
>>>>>> a file probably will have two xattrs
>>>>>>
>>>>>> "security.selinux" and "user.virtiofsd.security.selinux". Host will
>>>>>> enforce SELinux policy based on security.selinux xattr and guest
>>>>>> will see the SELinux info stored in "user.virtiofsd.security.selinux"
>>>>>> and guest SELinux policy will enforce rules based on that.
>>>>>> (user.virtiofsd.security.selinux will be remapped to "security.selinux"
>>>>>> when guest does getxattr()).
>>>>>>
>>>>>> IOW, this mode is allowing both host and guest SELinux policies to
>>>>>> co-exist and not interefere with each other. (Remapping guests's
>>>>>> SELinux xattr is not changing hosts's SELinux label and is not
>>>>>> bypassing host's SELinux policy).
>>>>>>
>>>>>> virtiofsd also provides for the mode where if guest process sets
>>>>>> SELinux xattr it shows up as security.selinux on host. But now we
>>>>>> have multiple issues. There are two SELinux policies (host and guest)
>>>>>> which are operating on same lable. And there is a very good chance
>>>>>> that two have not been written in such a way that they work with
>>>>>> each other. In fact there does not seem to exist a notion where
>>>>>> two different SELinux policies are operating on same label.
>>>>>>
>>>>>> At high level, this is in a way similar to files created on
>>>>>> virtio-blk devices. Say this device is backed by a foo.img file
>>>>>> on host. Now host selinux policy will set its own label on
>>>>>> foo.img and provide access control while labels created by guest
>>>>>> are not seen or controlled by host's SELinux policy. Only guest
>>>>>> SELinux policy works with those labels.
>>>>>>
>>>>>> So this is similar kind of attempt. Provide isolation between
>>>>>> host and guests's SELinux labels so that two policies can
>>>>>> co-exist and not interfere with each other.
>>>>>>
>>>>>>>> This remapping is useful when SELinux is enabled in guest and virtiofs
>>>>>>>> as being used as rootfs. Guest and host SELinux policy might not match
>>>>>>>> and host policy might deny security.selinux xattr setting by guest
>>>>>>>> onto host. Or host might have SELinux disabled and in that case to
>>>>>>>> be able to set security.selinux xattr, virtiofsd will need to have
>>>>>>>> CAP_SYS_ADMIN (which we are trying to avoid). Being able to remap
>>>>>>>> guest security.selinux (or other xattrs) on host to something else
>>>>>>>> is also better from security point of view.
>>>>>>> Can you please provide some rationale for this assertion?
>>>>>>> I have been working with security xattrs longer than anyone
>>>>>>> and have trouble accepting the statement.
>>>>>> If guest is not able to interfere or change host's SELinux labels
>>>>>> directly, it sounded better.
>>>>>>
>>>>>> Irrespective of this, my primary concern is that to allow guest
>>>>>> VM to be able to use SELinux seamlessly in diverse host OS
>>>>>> environments (typical of cloud deployments). And being able to
>>>>>> provide a mode where host and guest's security labels can
>>>>>> co-exist and policies can work independently, should be able
>>>>>> to achieve that goal.
>>>>>>
>>>>>>>> But when we try this, we noticed that SELinux relabeling in guest
>>>>>>>> is failing on some symlinks. When I debugged a little more, I
>>>>>>>> came to know that "user.*" xattrs are not allowed on symlinks
>>>>>>>> or special files.
>>>>>>>>
>>>>>>>> "man xattr" seems to suggest that primary reason to disallow is
>>>>>>>> that arbitrary users can set unlimited amount of "user.*" xattrs
>>>>>>>> on these files and bypass quota check.
>>>>>>>>
>>>>>>>> If that's the primary reason, I am wondering is it possible to relax
>>>>>>>> the restrictions if caller has CAP_SYS_RESOURCE. This capability
>>>>>>>> allows caller to bypass quota checks. So it should not be
>>>>>>>> a problem atleast from quota perpective.
>>>>>>>>
>>>>>>>> That will allow me to give CAP_SYS_RESOURCE to virtiofs deamon
>>>>>>>> and remap xattrs arbitrarily.
>>>>>>> On a Smack system you should require CAP_MAC_ADMIN to remap
>>>>>>> security. xattrs. I sounds like you're in serious danger of running afoul
>>>>>>> of LSM attribute policy on a reasonable general level.
>>>>>> I think I did not explain xattr remapping properly and that's why this
>>>>>> confusion is there. Only guests's xattrs will be remapped and not
>>>>>> hosts's xattr. So one can not bypass any access control implemented
>>>>>> by any of the LSM on host.
>>>>>>
>>>>>> Thanks
>>>>>> Vivek
>>>>>>
>>>>> I want to point out that this solves a  couple of other problems also. 
>>>> I am not (usually) adverse to solving problems. My concern is with
>>>> regard to creating new ones.
>>>>
>>>>> Currently virtiofsd attempts to write security attributes on the host, which is denied by default on systems without SELinux and no CAP_SYS_ADMIN.
>>>> Right. Which is as it should be.
>>>> Also, s/SELinux/a LSM that uses security xattrs/
>>>>
>>>>>   This means if you want to run a container or VM
>>>> A container uses the kernel from the host. A VM uses the kernel
>>>> from the guest. Unless you're calling a VM a container for
>>>> marketing purposes. If this scheme works for non-VM based containers
>>>> there's a problem.
>>> And 'kata' is it's own kernel, but more like a container runtime - would
>>> you like to call this a VM or a container?
>> I would call it a VM.
>>
>> On the other hand, there has been a concerted effort to ensure that there
>> is no technical definition of a container. I hope to exploit this for
>> personal wealth and glory before too long myself. If kata wants to identify
>> as a container, who am I to say otherwise?
>>
>>> There's whole bunch of variations people are playing around with; I don't
>>> think there's a single answer, or a single way people are trying to use
>>> it.
>> Just so.
>>
>>>>> on a host without SELinux support but the VM has SELinux enabled, then virtiofsd needs CAP_SYS_ADMIN.  It would be much more secure if it only needed CAP_SYS_RESOURCE.
>>>> I don't know, so I'm asking. Does virtiofsd really get run with limited capabilities,
>>>> or does it get run as root like most system daemons? If it runs as root the argument
>>>> has no legs.
>>> It's typically run without CAP_SYS_ADMIN; (although we have other
>>> problems, like wanting to use file handles that make caps tricky).
>>> Some people are trying to run it in user namespaces.
>>> Given that it's pretty complex and playing with lots of file syscalls
>>> under partial control of the guest, giving it as few capabilities
>>> as possible is my preference.
>> It would be mine as well. I expect/fear that many developers find
>> capabilities too complicated to work with and drop back to good old
>> fashioned root. The whole rationale for user namespaces seems to be
>> that it makes running as root in the namespace "safe".
> We're trying to be good with capabilities, basically locking it down
> until we trip over one of them and then think about it and enable it
> where appropriate;  the difficulty is that capabilities are only a bit
> better than root; they're still fairly granular - like in this case
> where you're pushed towards a wide ranging CAP even though you only
> want to give the user a trivial extra thing.
> (We have a similar problem wanting to allow separate threads to
> be in separate directories, but that requires unshare and that requires
> another capability)

Thank you for putting in the effort.

The primary value of capabilities has always been the disassociation
of privilege from the root UID. The granularity has always been contentious.
One UNIX system went the fine granularity route and ended up with 330.
Last I looked you'd need several hundred to give everyone who wants their
own special problem solved. I admit that a solution for the granularity
issue would be grand. I think we're looking at something simpler than
capabilities to achieve that, but we'll see.

>>>>>   If the host has SELinux enabled then it can run without CAP_SYS_ADMIN or CAP_SYS_RESOURCE, but it will only be allowed to write labels that the host system understands, any label not understood will be blocked. Not only this, but the label that is running virtiofsd pretty much has to run as unconfined, since it could be writing any SELinux label.
>>>> You could fix that easily enough by teaching SELinux about the proper
>>>> use of CAP_MAC_ADMIN. Alas, I understand that there's no way that's
>>>> going to happen, and why it would be considered philosophically repugnant
>>>> in the SELinux community. 
>>>>
>>>>> If virtiofsd is writing Userxattrs with CAP_SYS_RESOURCE, then we can run with a confined SELinux label only allowing it to sexattr on the content in the designated directory, make the container/vm much more secure.
>>>>>
>>>> User xattrs are less protected than security xattrs. You are exposing the
>>>> security xattrs on the guest to the possible whims of a malicious, unprivileged
>>>> actor on the host. All it needs is the right UID.
>>> Yep, we realise that; but when you're mainly interested in making sure
>>> the guest can't attack the host, that's less worrying.
>> That's uncomfortable.
> Why exactly?

If a mechanism is designed with a known vulnerability you
fail your validation/evaluation efforts. Your mechanism is
less general because other potential use cases may not be
as cavalier about the vulnerability. I think that you can
approach this differently, get a solution that does everything
you want, and avoid the known problem.

> IMHO the biggest problem is it's badly defined when you want to actually
> share filesystems between guests or between guests and the host.

Right. The filesystem isn't the right layer for mapping xattrs.


>>> It would be lovely if there was something more granular, (e.g. allowing
>>> user.NUMBER. or trusted.NUMBER. to be used by this particular guest).
>> We can't do that without breaking the "kernels aren't container aware"
>> mandate. I suppose that if someone wanted to implement xattr namespaces
>> (like user namespaces, not just the prefix) you could get away with that.
>> Namespaces for everything. :)
> Right, it's namespaces that we've used in most places to give ourselves
> the isolation.
>
> I doubt we're the only case that wants a way to do xattr separation; you
> get lots of weird cases where it pops up (e.g. stacked overlayfs)

I can't say that I'm a major fan of namespace proliferation,
(time namespaces? really?) but what you've outlined is a
filesystem specific implementation of xattr namespaces. We've
looked into similar mechanisms for LSM specific namespaces.
When you see multiple use-case specific implementations of
the same thing its time to consider a general solution.

>
> Dave
>
>>>> We have unused xattr namespaces. Would using the "trusted" namespace
>>>> work for your purposes?
>>> For those with CAP_SYS_ADMIN I guess.
>>>
>>> Note the virtiofsd takes an option allowing you to set the mapping
>>> however you like, so there's no hard coded user. or trusted. in the
>>> daemon itself.
>>>
>>> Dave
>>>

Vivek Goyal June 29, 2021, 3:20 p.m. UTC | #12

On Tue, Jun 29, 2021 at 07:38:15AM -0700, Casey Schaufler wrote:

[..]
> >>>> User xattrs are less protected than security xattrs. You are exposing the
> >>>> security xattrs on the guest to the possible whims of a malicious, unprivileged
> >>>> actor on the host. All it needs is the right UID.
> >>> Yep, we realise that; but when you're mainly interested in making sure
> >>> the guest can't attack the host, that's less worrying.
> >> That's uncomfortable.
> > Why exactly?
> 
> If a mechanism is designed with a known vulnerability you
> fail your validation/evaluation efforts.

We are working with the constraint that shared directory should not be
accessible to unpriviliged users on host. And with that constraint, what
you are referring to is not a vulnerability.

> Your mechanism is
> less general because other potential use cases may not be
> as cavalier about the vulnerability.

Prefixing xattrs with "user.virtiofsd" is just one of the options.
virtiofsd has the capability to prefix "trusted.virtiofsd" as well.
We have not chosen that because we don't want to give it CAP_SYS_ADMIN.

So other use cases which don't like prefixing "user.virtiofsd", can
give CAP_SYS_ADMIN and work with it.

> I think that you can
> approach this differently, get a solution that does everything
> you want, and avoid the known problem.

What's the solution? Are you referring to using "trusted.*" instead? But
that has its own problem of giving CAP_SYS_ADMIN to virtiofsd.

Thanks
Vivek

Casey Schaufler June 29, 2021, 4:13 p.m. UTC | #13

On 6/29/2021 8:20 AM, Vivek Goyal wrote:
> On Tue, Jun 29, 2021 at 07:38:15AM -0700, Casey Schaufler wrote:
>
> [..]
>>>>>> User xattrs are less protected than security xattrs. You are exposing the
>>>>>> security xattrs on the guest to the possible whims of a malicious, unprivileged
>>>>>> actor on the host. All it needs is the right UID.
>>>>> Yep, we realise that; but when you're mainly interested in making sure
>>>>> the guest can't attack the host, that's less worrying.
>>>> That's uncomfortable.
>>> Why exactly?
>> If a mechanism is designed with a known vulnerability you
>> fail your validation/evaluation efforts.
> We are working with the constraint that shared directory should not be
> accessible to unpriviliged users on host. And with that constraint, what
> you are referring to is not a vulnerability.

Sure, that's quite reasonable for your use case. It doesn't mean
that the vulnerability doesn't exist, it means you've mitigated it. 


>> Your mechanism is
>> less general because other potential use cases may not be
>> as cavalier about the vulnerability.
> Prefixing xattrs with "user.virtiofsd" is just one of the options.
> virtiofsd has the capability to prefix "trusted.virtiofsd" as well.
> We have not chosen that because we don't want to give it CAP_SYS_ADMIN.
>
> So other use cases which don't like prefixing "user.virtiofsd", can
> give CAP_SYS_ADMIN and work with it.
>
>> I think that you can
>> approach this differently, get a solution that does everything
>> you want, and avoid the known problem.
> What's the solution? Are you referring to using "trusted.*" instead? But
> that has its own problem of giving CAP_SYS_ADMIN to virtiofsd.

I'm coming to the conclusion that xattr namespaces, analogous
to user namespaces, are the correct solution. They generalize
for multiple filesystem and LSM use cases. The use of namespaces
is well understood, especially in the container community. It
looks to me as if it would address your use case swimmingly.

>
> Thanks
> Vivek
>

Theodore Ts'o June 29, 2021, 4:25 p.m. UTC | #14

On Tue, Jun 29, 2021 at 07:38:15AM -0700, Casey Schaufler wrote:
> > IMHO the biggest problem is it's badly defined when you want to actually
> > share filesystems between guests or between guests and the host.
> 
> Right. The filesystem isn't the right layer for mapping xattrs.

Well, let's enumerate the alternatives:

* Some kind of stackable LSM?
* Some kind of FUSE-like scheme?
* Adding an eBPF hook which can perform the mapping

The last may be the best bet, since different use cases can use
different eBPF programs.  The eBPF script can handle both the mapping
as well some kind of specialized access control with respect to what
entities are allowed set or get xattrs.

> >>> It would be lovely if there was something more granular, (e.g. allowing
> >>> user.NUMBER. or trusted.NUMBER. to be used by this particular guest).
> >> We can't do that without breaking the "kernels aren't container aware"
> >> mandate.

eBPF scripts, since they are supplied by the user *can* be container
aware.  :-)

						- Ted

Dr. David Alan Gilbert June 29, 2021, 4:35 p.m. UTC | #15

* Casey Schaufler (casey@schaufler-ca.com) wrote:
> On 6/29/2021 8:20 AM, Vivek Goyal wrote:
> > On Tue, Jun 29, 2021 at 07:38:15AM -0700, Casey Schaufler wrote:
> >
> > [..]
> >>>>>> User xattrs are less protected than security xattrs. You are exposing the
> >>>>>> security xattrs on the guest to the possible whims of a malicious, unprivileged
> >>>>>> actor on the host. All it needs is the right UID.
> >>>>> Yep, we realise that; but when you're mainly interested in making sure
> >>>>> the guest can't attack the host, that's less worrying.
> >>>> That's uncomfortable.
> >>> Why exactly?
> >> If a mechanism is designed with a known vulnerability you
> >> fail your validation/evaluation efforts.
> > We are working with the constraint that shared directory should not be
> > accessible to unpriviliged users on host. And with that constraint, what
> > you are referring to is not a vulnerability.
> 
> Sure, that's quite reasonable for your use case. It doesn't mean
> that the vulnerability doesn't exist, it means you've mitigated it. 
> 
> 
> >> Your mechanism is
> >> less general because other potential use cases may not be
> >> as cavalier about the vulnerability.
> > Prefixing xattrs with "user.virtiofsd" is just one of the options.
> > virtiofsd has the capability to prefix "trusted.virtiofsd" as well.
> > We have not chosen that because we don't want to give it CAP_SYS_ADMIN.
> >
> > So other use cases which don't like prefixing "user.virtiofsd", can
> > give CAP_SYS_ADMIN and work with it.
> >
> >> I think that you can
> >> approach this differently, get a solution that does everything
> >> you want, and avoid the known problem.
> > What's the solution? Are you referring to using "trusted.*" instead? But
> > that has its own problem of giving CAP_SYS_ADMIN to virtiofsd.
> 
> I'm coming to the conclusion that xattr namespaces, analogous
> to user namespaces, are the correct solution. They generalize
> for multiple filesystem and LSM use cases. The use of namespaces
> is well understood, especially in the container community. It
> looks to me as if it would address your use case swimmingly.

Yeh; although the details of getting the semantics right is tricky;
in particular, the stuff which clears capabilitiies/setuid/etc on writes
- should it clear xattrs that represent capabilities?  If the host
  performs a write, should it clear mapped xattrs capabilities?  If the
namespace performs a write should it clear just the mapped ones or the
host ones as well?  Our virtiofsd code performs acrobatics to make
sure they get cleared on write that are painful.

Dave

> >
> > Thanks
> > Vivek
> >
>

Casey Schaufler June 29, 2021, 4:51 p.m. UTC | #16

On 6/29/2021 9:35 AM, Dr. David Alan Gilbert wrote:
> * Casey Schaufler (casey@schaufler-ca.com) wrote:
>> On 6/29/2021 8:20 AM, Vivek Goyal wrote:
>>> On Tue, Jun 29, 2021 at 07:38:15AM -0700, Casey Schaufler wrote:
>>>
>>> [..]
>>>>>>>> User xattrs are less protected than security xattrs. You are exposing the
>>>>>>>> security xattrs on the guest to the possible whims of a malicious, unprivileged
>>>>>>>> actor on the host. All it needs is the right UID.
>>>>>>> Yep, we realise that; but when you're mainly interested in making sure
>>>>>>> the guest can't attack the host, that's less worrying.
>>>>>> That's uncomfortable.
>>>>> Why exactly?
>>>> If a mechanism is designed with a known vulnerability you
>>>> fail your validation/evaluation efforts.
>>> We are working with the constraint that shared directory should not be
>>> accessible to unpriviliged users on host. And with that constraint, what
>>> you are referring to is not a vulnerability.
>> Sure, that's quite reasonable for your use case. It doesn't mean
>> that the vulnerability doesn't exist, it means you've mitigated it. 
>>
>>
>>>> Your mechanism is
>>>> less general because other potential use cases may not be
>>>> as cavalier about the vulnerability.
>>> Prefixing xattrs with "user.virtiofsd" is just one of the options.
>>> virtiofsd has the capability to prefix "trusted.virtiofsd" as well.
>>> We have not chosen that because we don't want to give it CAP_SYS_ADMIN.
>>>
>>> So other use cases which don't like prefixing "user.virtiofsd", can
>>> give CAP_SYS_ADMIN and work with it.
>>>
>>>> I think that you can
>>>> approach this differently, get a solution that does everything
>>>> you want, and avoid the known problem.
>>> What's the solution? Are you referring to using "trusted.*" instead? But
>>> that has its own problem of giving CAP_SYS_ADMIN to virtiofsd.
>> I'm coming to the conclusion that xattr namespaces, analogous
>> to user namespaces, are the correct solution. They generalize
>> for multiple filesystem and LSM use cases. The use of namespaces
>> is well understood, especially in the container community. It
>> looks to me as if it would address your use case swimmingly.
> Yeh; although the details of getting the semantics right is tricky;
> in particular, the stuff which clears capabilitiies/setuid/etc on writes
> - should it clear xattrs that represent capabilities?  If the host
>   performs a write, should it clear mapped xattrs capabilities?  If the
> namespace performs a write should it clear just the mapped ones or the
> host ones as well?  Our virtiofsd code performs acrobatics to make
> sure they get cleared on write that are painful.

Dealing with tricky semantics is the difference between a feature
and a hack. Doing so in a way that other people can take advantage
of the feature is the hallmark of a feature well done.

>
> Dave
>
>>> Thanks
>>> Vivek
>>>

Vivek Goyal June 29, 2021, 5:35 p.m. UTC | #17

On Tue, Jun 29, 2021 at 09:13:48AM -0700, Casey Schaufler wrote:
> On 6/29/2021 8:20 AM, Vivek Goyal wrote:
> > On Tue, Jun 29, 2021 at 07:38:15AM -0700, Casey Schaufler wrote:
> >
> > [..]
> >>>>>> User xattrs are less protected than security xattrs. You are exposing the
> >>>>>> security xattrs on the guest to the possible whims of a malicious, unprivileged
> >>>>>> actor on the host. All it needs is the right UID.
> >>>>> Yep, we realise that; but when you're mainly interested in making sure
> >>>>> the guest can't attack the host, that's less worrying.
> >>>> That's uncomfortable.
> >>> Why exactly?
> >> If a mechanism is designed with a known vulnerability you
> >> fail your validation/evaluation efforts.
> > We are working with the constraint that shared directory should not be
> > accessible to unpriviliged users on host. And with that constraint, what
> > you are referring to is not a vulnerability.
> 
> Sure, that's quite reasonable for your use case. It doesn't mean
> that the vulnerability doesn't exist, it means you've mitigated it. 
> 
> 
> >> Your mechanism is
> >> less general because other potential use cases may not be
> >> as cavalier about the vulnerability.
> > Prefixing xattrs with "user.virtiofsd" is just one of the options.
> > virtiofsd has the capability to prefix "trusted.virtiofsd" as well.
> > We have not chosen that because we don't want to give it CAP_SYS_ADMIN.
> >
> > So other use cases which don't like prefixing "user.virtiofsd", can
> > give CAP_SYS_ADMIN and work with it.
> >
> >> I think that you can
> >> approach this differently, get a solution that does everything
> >> you want, and avoid the known problem.
> > What's the solution? Are you referring to using "trusted.*" instead? But
> > that has its own problem of giving CAP_SYS_ADMIN to virtiofsd.
> 
> I'm coming to the conclusion that xattr namespaces, analogous
> to user namespaces, are the correct solution. They generalize
> for multiple filesystem and LSM use cases. The use of namespaces
> is well understood, especially in the container community. It
> looks to me as if it would address your use case swimmingly.

Even if xattrs were namespaced, I am not sure it solves the issue
of unpriviliged UID being able to modify security xattrs of file.
If it happens to be correct UID, it should be able to spin up a
user namespace and modify namespaced xattrs?

Anyway, once namespaced xattrs are available, I will gladly make use
of it. But that probably should not be a blocker for this patch.

Vivek

Daniel Walsh June 29, 2021, 8:28 p.m. UTC | #18

On 6/29/21 13:35, Vivek Goyal wrote:
> On Tue, Jun 29, 2021 at 09:13:48AM -0700, Casey Schaufler wrote:
>> On 6/29/2021 8:20 AM, Vivek Goyal wrote:
>>> On Tue, Jun 29, 2021 at 07:38:15AM -0700, Casey Schaufler wrote:
>>>
>>> [..]
>>>>>>>> User xattrs are less protected than security xattrs. You are exposing the
>>>>>>>> security xattrs on the guest to the possible whims of a malicious, unprivileged
>>>>>>>> actor on the host. All it needs is the right UID.
>>>>>>> Yep, we realise that; but when you're mainly interested in making sure
>>>>>>> the guest can't attack the host, that's less worrying.
>>>>>> That's uncomfortable.
>>>>> Why exactly?
>>>> If a mechanism is designed with a known vulnerability you
>>>> fail your validation/evaluation efforts.
>>> We are working with the constraint that shared directory should not be
>>> accessible to unpriviliged users on host. And with that constraint, what
>>> you are referring to is not a vulnerability.
>> Sure, that's quite reasonable for your use case. It doesn't mean
>> that the vulnerability doesn't exist, it means you've mitigated it.
>>
>>
>>>> Your mechanism is
>>>> less general because other potential use cases may not be
>>>> as cavalier about the vulnerability.
>>> Prefixing xattrs with "user.virtiofsd" is just one of the options.
>>> virtiofsd has the capability to prefix "trusted.virtiofsd" as well.
>>> We have not chosen that because we don't want to give it CAP_SYS_ADMIN.
>>>
>>> So other use cases which don't like prefixing "user.virtiofsd", can
>>> give CAP_SYS_ADMIN and work with it.
>>>
>>>> I think that you can
>>>> approach this differently, get a solution that does everything
>>>> you want, and avoid the known problem.
>>> What's the solution? Are you referring to using "trusted.*" instead? But
>>> that has its own problem of giving CAP_SYS_ADMIN to virtiofsd.
>> I'm coming to the conclusion that xattr namespaces, analogous
>> to user namespaces, are the correct solution. They generalize
>> for multiple filesystem and LSM use cases. The use of namespaces
>> is well understood, especially in the container community. It
>> looks to me as if it would address your use case swimmingly.
> Even if xattrs were namespaced, I am not sure it solves the issue
> of unpriviliged UID being able to modify security xattrs of file.
> If it happens to be correct UID, it should be able to spin up a
> user namespace and modify namespaced xattrs?
>
> Anyway, once namespaced xattrs are available, I will gladly make use
> of it. But that probably should not be a blocker for this patch.
>
> Vivek
>
All this conversation is great, and I look forward to a better solution, 
but if we go back to the patch, it was to fix an issue where the kernel 
is requiring CAP_SYS_ADMIN for writing user Xattrs on link files and 
other special files.

The documented reason for this is to prevent the users from using XATTRS 
to avoid quota.

The CAP_SYS_RESOURCE capability is denfined to allow processes with this 
capability to ignore quota.

This PR allows processes with CAP_SYS_RESOURCE to create user Xattrs.

To me this makes sense.

Is there any argument against this?

Theodore Ts'o June 30, 2021, 4:12 a.m. UTC | #19

On Tue, Jun 29, 2021 at 04:28:24PM -0400, Daniel Walsh wrote:
> All this conversation is great, and I look forward to a better solution, but
> if we go back to the patch, it was to fix an issue where the kernel is
> requiring CAP_SYS_ADMIN for writing user Xattrs on link files and other
> special files.
> 
> The documented reason for this is to prevent the users from using XATTRS to
> avoid quota.

Huh?  Where is it so documented?  How file systems store and account
for space used by extended attributes is a file-system specific
question, but presumably any way that xattr's on regular files are
accounted could also be used for xattr's on special files.

Also, xattr's are limited to 32k, so it's not like users can evade
_that_ much quota space, at least not without it being pretty painful.
(Assuming that quota is even enabled, which most of the time, it
isn't.)

						- Ted

P.S.  I'll note that if ext4's ea_in_inode is enabled, for large
xattr's, if you have 2 million files that all have the same 12k
windows SID stored as an xattr, ext4 will store that xattr only once.
Those two million files might be owned by different uids, so we made
an explicit design choice not to worry about accounting for the quota
for said 12k xattr value.  After all, if you can save the space and
access cost of 2M * 12k if each file had to store its own copy of that
xattr, perhaps not including it in the quota calculation isn't that
bad.  :-)

We also don't account for the disk space used by symbolic links (since
sometimes they can be stored in the inode as fast symlinks, and
sometimes they might consume a data block).  But again, that's a file
system specific implementation question.

Dr. David Alan Gilbert June 30, 2021, 8:07 a.m. UTC | #20

* Theodore Ts'o (tytso@mit.edu) wrote:
> On Tue, Jun 29, 2021 at 04:28:24PM -0400, Daniel Walsh wrote:
> > All this conversation is great, and I look forward to a better solution, but
> > if we go back to the patch, it was to fix an issue where the kernel is
> > requiring CAP_SYS_ADMIN for writing user Xattrs on link files and other
> > special files.
> > 
> > The documented reason for this is to prevent the users from using XATTRS to
> > avoid quota.
> 
> Huh?  Where is it so documented?

man xattr(7):

       The  file permission bits of regular files and directories are
       interpreted differently from the file permission bits of special
       files and symbolic links.  For regular files and directories the
       file permission bits define ac‐ cess to the file's contents,
       while for device special files they define access to the device
       described by the special file.  The file permissions of symbolic
       links are not used in access checks. *** These differences would
       al‐ low users to consume filesystem resources in a way not
       controllable by disk quotas for group or world writable special
       files and directories.****

       ***For  this reason, user extended attributes are allowed only
       for regular files and directories ***, and access to user extended
       attributes is restricted to the owner and to users with appropriate
       capabilities for directories with the sticky bit set (see the
       chmod(1) manual page for an explanation of the sticky bit).

(***'s my addition)


Dave

>  How file systems store and account
> for space used by extended attributes is a file-system specific
> question, but presumably any way that xattr's on regular files are
> accounted could also be used for xattr's on special files.
> 
> Also, xattr's are limited to 32k, so it's not like users can evade
> _that_ much quota space, at least not without it being pretty painful.
> (Assuming that quota is even enabled, which most of the time, it
> isn't.)
> 
> 						- Ted
> 
> P.S.  I'll note that if ext4's ea_in_inode is enabled, for large
> xattr's, if you have 2 million files that all have the same 12k
> windows SID stored as an xattr, ext4 will store that xattr only once.
> Those two million files might be owned by different uids, so we made
> an explicit design choice not to worry about accounting for the quota
> for said 12k xattr value.  After all, if you can save the space and
> access cost of 2M * 12k if each file had to store its own copy of that
> xattr, perhaps not including it in the quota calculation isn't that
> bad.  :-)
> 
> We also don't account for the disk space used by symbolic links (since
> sometimes they can be stored in the inode as fast symlinks, and
> sometimes they might consume a data block).  But again, that's a file
> system specific implementation question.
>

Vivek Goyal June 30, 2021, 2:27 p.m. UTC | #21

On Wed, Jun 30, 2021 at 12:12:28AM -0400, Theodore Ts'o wrote:
> On Tue, Jun 29, 2021 at 04:28:24PM -0400, Daniel Walsh wrote:
> > All this conversation is great, and I look forward to a better solution, but
> > if we go back to the patch, it was to fix an issue where the kernel is
> > requiring CAP_SYS_ADMIN for writing user Xattrs on link files and other
> > special files.
> > 
> > The documented reason for this is to prevent the users from using XATTRS to
> > avoid quota.
> 
> Huh?  Where is it so documented?

Its in "man xattr". David already copied pasted the relevant section in
another email, so I am not doing it.

> How file systems store and account
> for space used by extended attributes is a file-system specific
> question,

> but presumably any way that xattr's on regular files are
> accounted could also be used for xattr's on special files.

That will be nice. I don't know enough about quota, but I am wondering
why quota limits can't be enforced (if needed) for symlinks and special
file xattrs.

Thanks
Vivek
> 
> Also, xattr's are limited to 32k, so it's not like users can evade
> _that_ much quota space, at least not without it being pretty painful.
> (Assuming that quota is even enabled, which most of the time, it
> isn't.)
> 
> 						- Ted
> 
> P.S.  I'll note that if ext4's ea_in_inode is enabled, for large
> xattr's, if you have 2 million files that all have the same 12k
> windows SID stored as an xattr, ext4 will store that xattr only once.
> Those two million files might be owned by different uids, so we made
> an explicit design choice not to worry about accounting for the quota
> for said 12k xattr value.  After all, if you can save the space and
> access cost of 2M * 12k if each file had to store its own copy of that
> xattr, perhaps not including it in the quota calculation isn't that
> bad.  :-)
> 
> We also don't account for the disk space used by symbolic links (since
> sometimes they can be stored in the inode as fast symlinks, and
> sometimes they might consume a data block).  But again, that's a file
> system specific implementation question.
>

Theodore Ts'o June 30, 2021, 2:47 p.m. UTC | #22

On Wed, Jun 30, 2021 at 09:07:56AM +0100, Dr. David Alan Gilbert wrote:
> * Theodore Ts'o (tytso@mit.edu) wrote:
> > On Tue, Jun 29, 2021 at 04:28:24PM -0400, Daniel Walsh wrote:
> > > All this conversation is great, and I look forward to a better solution, but
> > > if we go back to the patch, it was to fix an issue where the kernel is
> > > requiring CAP_SYS_ADMIN for writing user Xattrs on link files and other
> > > special files.
> > > 
> > > The documented reason for this is to prevent the users from using XATTRS to
> > > avoid quota.
> > 
> > Huh?  Where is it so documented?
> 
> man xattr(7):
>        The  file permission bits of regular files and directories are
>        interpreted differently from the file permission bits of special
>        files and symbolic links.  For regular files and directories the
>        file permission bits define access to the file's contents,
>        while for device special files they define access to the device
>        described by the special file.  The file permissions of symbolic
>        links are not used in access checks.

All of this is true...

>         *** These differences would
>        allow users to consume filesystem resources in a way not
>        controllable by disk quotas for group or world writable special
>        files and directories.****

Anyone with group write access to a regular file can append to the
file, and the blocks written will be charged the owner of the file.
So it's perfectly "controllable" by the quota system; if you have
group write access to a file, you can charge against the user's quota.
This is Working As Intended.

And the creation of device special files take the umask into account,
just like regular files, so if you have a umask that allows newly
created files to be group writeable, the same issue would occur for
regular files as device files.  Given that most users have a umask of
0077 or 0022, this is generally Not A Problem.

I think I see the issue which drove the above text, though, which is
that Linux's syscall(2) is creating symlinks which do not take umask
into account; that is, the permissions are always mode ST_IFLNK|0777.

Hence, it might be that the right answer is to remove this fairly
arbitrary restriction entirely, and change symlink(2) so that it
creates files which respects the umask.  Posix and SUS doesn't specify
what the permissions are that are used, and historically (before the
advent of xattrs) I suspect since it didn't matter, no one cared about
whether or not umask was applied.

Some people might object to such a change arguing that with
pre-existing file systems where there are symlinks which
world-writeable, this might cause people to be able to charge up to
32k (or whatever the maximum size of the xattr supported by the file
system) for each symlink.  However, (a) very few people actually use
quotas, and this would only be an issue for those users, and (b) the
amount of quota "abuse" that could be carried out this way is small
enough that I'm not sure it matters.

     	    	  	      	  - Ted

Dr. David Alan Gilbert June 30, 2021, 3:01 p.m. UTC | #23

* Theodore Ts'o (tytso@mit.edu) wrote:
> On Wed, Jun 30, 2021 at 09:07:56AM +0100, Dr. David Alan Gilbert wrote:
> > * Theodore Ts'o (tytso@mit.edu) wrote:
> > > On Tue, Jun 29, 2021 at 04:28:24PM -0400, Daniel Walsh wrote:
> > > > All this conversation is great, and I look forward to a better solution, but
> > > > if we go back to the patch, it was to fix an issue where the kernel is
> > > > requiring CAP_SYS_ADMIN for writing user Xattrs on link files and other
> > > > special files.
> > > > 
> > > > The documented reason for this is to prevent the users from using XATTRS to
> > > > avoid quota.
> > > 
> > > Huh?  Where is it so documented?
> > 
> > man xattr(7):
> >        The  file permission bits of regular files and directories are
> >        interpreted differently from the file permission bits of special
> >        files and symbolic links.  For regular files and directories the
> >        file permission bits define access to the file's contents,
> >        while for device special files they define access to the device
> >        described by the special file.  The file permissions of symbolic
> >        links are not used in access checks.
> 
> All of this is true...
> 
> >         *** These differences would
> >        allow users to consume filesystem resources in a way not
> >        controllable by disk quotas for group or world writable special
> >        files and directories.****
> 
> Anyone with group write access to a regular file can append to the
> file, and the blocks written will be charged the owner of the file.
> So it's perfectly "controllable" by the quota system; if you have
> group write access to a file, you can charge against the user's quota.
> This is Working As Intended.
> 
> And the creation of device special files take the umask into account,
> just like regular files, so if you have a umask that allows newly
> created files to be group writeable, the same issue would occur for
> regular files as device files.  Given that most users have a umask of
> 0077 or 0022, this is generally Not A Problem.
> 
> I think I see the issue which drove the above text, though, which is
> that Linux's syscall(2) is creating symlinks which do not take umask
> into account; that is, the permissions are always mode ST_IFLNK|0777.
> 
> Hence, it might be that the right answer is to remove this fairly
> arbitrary restriction entirely, and change symlink(2) so that it
> creates files which respects the umask.  Posix and SUS doesn't specify
> what the permissions are that are used, and historically (before the
> advent of xattrs) I suspect since it didn't matter, no one cared about
> whether or not umask was applied.
> 
> Some people might object to such a change arguing that with
> pre-existing file systems where there are symlinks which
> world-writeable, this might cause people to be able to charge up to
> 32k (or whatever the maximum size of the xattr supported by the file
> system) for each symlink.  However, (a) very few people actually use
> quotas, and this would only be an issue for those users, and (b) the
> amount of quota "abuse" that could be carried out this way is small
> enough that I'm not sure it matters.

Even if you fix symlinks, I don't think it fixes device nodes or
anything else where the permissions bitmap isn't purely used as the
permissions on the inode.

Dave

>      	    	  	      	  - Ted
>

Vivek Goyal June 30, 2021, 4:09 p.m. UTC | #24

On Wed, Jun 30, 2021 at 10:47:39AM -0400, Theodore Ts'o wrote:
> On Wed, Jun 30, 2021 at 09:07:56AM +0100, Dr. David Alan Gilbert wrote:
> > * Theodore Ts'o (tytso@mit.edu) wrote:
> > > On Tue, Jun 29, 2021 at 04:28:24PM -0400, Daniel Walsh wrote:
> > > > All this conversation is great, and I look forward to a better solution, but
> > > > if we go back to the patch, it was to fix an issue where the kernel is
> > > > requiring CAP_SYS_ADMIN for writing user Xattrs on link files and other
> > > > special files.
> > > > 
> > > > The documented reason for this is to prevent the users from using XATTRS to
> > > > avoid quota.
> > > 
> > > Huh?  Where is it so documented?
> > 
> > man xattr(7):
> >        The  file permission bits of regular files and directories are
> >        interpreted differently from the file permission bits of special
> >        files and symbolic links.  For regular files and directories the
> >        file permission bits define access to the file's contents,
> >        while for device special files they define access to the device
> >        described by the special file.  The file permissions of symbolic
> >        links are not used in access checks.
> 
> All of this is true...
> 
> >         *** These differences would
> >        allow users to consume filesystem resources in a way not
> >        controllable by disk quotas for group or world writable special
> >        files and directories.****
> 
> Anyone with group write access to a regular file can append to the
> file, and the blocks written will be charged the owner of the file.
> So it's perfectly "controllable" by the quota system; if you have
> group write access to a file, you can charge against the user's quota.
> This is Working As Intended.
> 
> And the creation of device special files take the umask into account,
> just like regular files, so if you have a umask that allows newly
> created files to be group writeable, the same issue would occur for
> regular files as device files.  Given that most users have a umask of
> 0077 or 0022, this is generally Not A Problem.
> 
> I think I see the issue which drove the above text, though, which is
> that Linux's syscall(2) is creating symlinks which do not take umask
> into account; that is, the permissions are always mode ST_IFLNK|0777.

IIUC, idea is to use permission bits on symlink to decide whether caller
can read/write user.* xattrs (like regular file). Hence create symlinks
while honoring umask (or default posix acl on dir) and modify relevant
code for file creation. Also that possibly will require changing chmod
to allow chaging mode on chmod. 

Vivek

> 
> Hence, it might be that the right answer is to remove this fairly
> arbitrary restriction entirely, and change symlink(2) so that it
> creates files which respects the umask.  Posix and SUS doesn't specify
> what the permissions are that are used, and historically (before the
> advent of xattrs) I suspect since it didn't matter, no one cared about
> whether or not umask was applied.
> 
> Some people might object to such a change arguing that with
> pre-existing file systems where there are symlinks which
> world-writeable, this might cause people to be able to charge up to
> 32k (or whatever the maximum size of the xattr supported by the file
> system) for each symlink.  However, (a) very few people actually use
> quotas, and this would only be an issue for those users, and (b) the
> amount of quota "abuse" that could be carried out this way is small
> enough that I'm not sure it matters.
> 
>      	    	  	      	  - Ted
>

Theodore Ts'o June 30, 2021, 7:59 p.m. UTC | #25

On Wed, Jun 30, 2021 at 04:01:42PM +0100, Dr. David Alan Gilbert wrote:
> 
> Even if you fix symlinks, I don't think it fixes device nodes or
> anything else where the permissions bitmap isn't purely used as the
> permissions on the inode.

I think we're making a mountain out of a molehill.  Again, very few
people are using quota these days.  And if you give someone write
access to a 8TB disk, do you really care if they can "steal" 32k worth
of space (which is the maximum size of an xattr, enforced by the VFS).

OK, but what about character mode devices?  First of all, most users
don't have access to huge number of devices, but let's assume
something absurd.  Let's say that a user has write access to *1024*
devices.  (My /dev has 233 character mode devices, and I have write
access to well under a dozen.)

An 8TB disk costs about $200.  So how much of the "stolen" quota space
are we talking about, assuming the user has access to 1024 devices,
and the file system actually supports a 32k xattr.

    32k * 1024 * $200 / 8TB / (1024*1024*1024) = $0.000763 = 0.0763 cents

A 2TB SSD is less around $180, so even if we calculate the prices
based on SSD space, we're still talking about a quarter of a penny.

Why are we worrying about this?

						- Ted

Vivek Goyal June 30, 2021, 8:32 p.m. UTC | #26

On Wed, Jun 30, 2021 at 03:59:41PM -0400, Theodore Ts'o wrote:
> On Wed, Jun 30, 2021 at 04:01:42PM +0100, Dr. David Alan Gilbert wrote:
> > 
> > Even if you fix symlinks, I don't think it fixes device nodes or
> > anything else where the permissions bitmap isn't purely used as the
> > permissions on the inode.
> 
> I think we're making a mountain out of a molehill.  Again, very few
> people are using quota these days.  And if you give someone write
> access to a 8TB disk, do you really care if they can "steal" 32k worth
> of space (which is the maximum size of an xattr, enforced by the VFS).

So that should be N * 32K per inode, where N is number of user xattrs
one can write on the inode. (user.1, user.2, user.3, .., user.N)?

Vivek

> 
> OK, but what about character mode devices?  First of all, most users
> don't have access to huge number of devices, but let's assume
> something absurd.  Let's say that a user has write access to *1024*
> devices.  (My /dev has 233 character mode devices, and I have write
> access to well under a dozen.)
> 
> An 8TB disk costs about $200.  So how much of the "stolen" quota space
> are we talking about, assuming the user has access to 1024 devices,
> and the file system actually supports a 32k xattr.
> 
>     32k * 1024 * $200 / 8TB / (1024*1024*1024) = $0.000763 = 0.0763 cents
> 
> A 2TB SSD is less around $180, so even if we calculate the prices
> based on SSD space, we're still talking about a quarter of a penny.
> 
> Why are we worrying about this?
> 
> 						- Ted
>

Dr. David Alan Gilbert July 1, 2021, 8:48 a.m. UTC | #27

* Theodore Ts'o (tytso@mit.edu) wrote:
> On Wed, Jun 30, 2021 at 04:01:42PM +0100, Dr. David Alan Gilbert wrote:
> > 
> > Even if you fix symlinks, I don't think it fixes device nodes or
> > anything else where the permissions bitmap isn't purely used as the
> > permissions on the inode.
> 
> I think we're making a mountain out of a molehill.  Again, very few
> people are using quota these days.  And if you give someone write
> access to a 8TB disk, do you really care if they can "steal" 32k worth
> of space (which is the maximum size of an xattr, enforced by the VFS).
> 
> OK, but what about character mode devices?  First of all, most users
> don't have access to huge number of devices, but let's assume
> something absurd.  Let's say that a user has write access to *1024*
> devices.  (My /dev has 233 character mode devices, and I have write
> access to well under a dozen.)
> 
> An 8TB disk costs about $200.  So how much of the "stolen" quota space
> are we talking about, assuming the user has access to 1024 devices,
> and the file system actually supports a 32k xattr.
> 
>     32k * 1024 * $200 / 8TB / (1024*1024*1024) = $0.000763 = 0.0763 cents
> 
> A 2TB SSD is less around $180, so even if we calculate the prices
> based on SSD space, we're still talking about a quarter of a penny.
> 
> Why are we worrying about this?

I'm not worrying about storage cost, but we would need to define what
the rules are on who can write and change a user.* xattr on a device
node.  It doesn't feel sane to make it anyone who can write to the
device; then everyone can start leaving droppings on /dev/null.

The other evilness I can imagine, is if there's a 32k limit on xattrs on
a node, an evil user could write almost 32k of junk to the node
and then break the next login that tries to add an acl or breaks the
next relabel.

Dave

> 						- Ted
>

Vivek Goyal July 1, 2021, 12:21 p.m. UTC | #28

On Thu, Jul 01, 2021 at 09:48:33AM +0100, Dr. David Alan Gilbert wrote:
> * Theodore Ts'o (tytso@mit.edu) wrote:
> > On Wed, Jun 30, 2021 at 04:01:42PM +0100, Dr. David Alan Gilbert wrote:
> > > 
> > > Even if you fix symlinks, I don't think it fixes device nodes or
> > > anything else where the permissions bitmap isn't purely used as the
> > > permissions on the inode.
> > 
> > I think we're making a mountain out of a molehill.  Again, very few
> > people are using quota these days.  And if you give someone write
> > access to a 8TB disk, do you really care if they can "steal" 32k worth
> > of space (which is the maximum size of an xattr, enforced by the VFS).
> > 
> > OK, but what about character mode devices?  First of all, most users
> > don't have access to huge number of devices, but let's assume
> > something absurd.  Let's say that a user has write access to *1024*
> > devices.  (My /dev has 233 character mode devices, and I have write
> > access to well under a dozen.)
> > 
> > An 8TB disk costs about $200.  So how much of the "stolen" quota space
> > are we talking about, assuming the user has access to 1024 devices,
> > and the file system actually supports a 32k xattr.
> > 
> >     32k * 1024 * $200 / 8TB / (1024*1024*1024) = $0.000763 = 0.0763 cents
> > 
> > A 2TB SSD is less around $180, so even if we calculate the prices
> > based on SSD space, we're still talking about a quarter of a penny.
> > 
> > Why are we worrying about this?
> 
> I'm not worrying about storage cost, but we would need to define what
> the rules are on who can write and change a user.* xattr on a device
> node.  It doesn't feel sane to make it anyone who can write to the
> device; then everyone can start leaving droppings on /dev/null.

Looks like tmpfs/devtmpfs might not support setting user.* xattrs. So
devices nodes there should not be a problem.

# touch /dev/foo.txt
# setfattr -n "user.foo" -v "bar" /dev/foo.txt
setfattr: /dev/foo.txt: Operation not supported

Vivek

> 
> The other evilness I can imagine, is if there's a 32k limit on xattrs on
> a node, an evil user could write almost 32k of junk to the node
> and then break the next login that tries to add an acl or breaks the
> next relabel.
> 
> Dave
> 
> > 						- Ted
> > 
> -- 
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>

Vivek Goyal July 1, 2021, 1:10 p.m. UTC | #29

On Thu, Jul 01, 2021 at 09:48:33AM +0100, Dr. David Alan Gilbert wrote:
> * Theodore Ts'o (tytso@mit.edu) wrote:
> > On Wed, Jun 30, 2021 at 04:01:42PM +0100, Dr. David Alan Gilbert wrote:
> > > 
> > > Even if you fix symlinks, I don't think it fixes device nodes or
> > > anything else where the permissions bitmap isn't purely used as the
> > > permissions on the inode.
> > 
> > I think we're making a mountain out of a molehill.  Again, very few
> > people are using quota these days.  And if you give someone write
> > access to a 8TB disk, do you really care if they can "steal" 32k worth
> > of space (which is the maximum size of an xattr, enforced by the VFS).
> > 
> > OK, but what about character mode devices?  First of all, most users
> > don't have access to huge number of devices, but let's assume
> > something absurd.  Let's say that a user has write access to *1024*
> > devices.  (My /dev has 233 character mode devices, and I have write
> > access to well under a dozen.)
> > 
> > An 8TB disk costs about $200.  So how much of the "stolen" quota space
> > are we talking about, assuming the user has access to 1024 devices,
> > and the file system actually supports a 32k xattr.
> > 
> >     32k * 1024 * $200 / 8TB / (1024*1024*1024) = $0.000763 = 0.0763 cents
> > 
> > A 2TB SSD is less around $180, so even if we calculate the prices
> > based on SSD space, we're still talking about a quarter of a penny.
> > 
> > Why are we worrying about this?
> 
> I'm not worrying about storage cost, but we would need to define what
> the rules are on who can write and change a user.* xattr on a device
> node.  It doesn't feel sane to make it anyone who can write to the
> device; then everyone can start leaving droppings on /dev/null.
> 
> The other evilness I can imagine, is if there's a 32k limit on xattrs on
> a node, an evil user could write almost 32k of junk to the node
> and then break the next login that tries to add an acl or breaks the
> next relabel.

I guess 64k is per xattr VFS size limit.

#define XATTR_SIZE_MAX 65536

I just wrote a simple program to write "user.<N>" xattrs of size 1K
each and could easily write 1M xattrs. So that 1G worth data right
there. I did not try to push it further.

So a user can write lot of data in the form of user.* xattrs on
symlinks and device nodes if were to open it unconditionally. Hence
permission semantics will probably will have to defined properly.

I am wondering will it be alright if owner of the file (or CAP_FOWNER),
is allowed to write user.* xattrs on symlinks and special files.

Vivek

Casey Schaufler July 1, 2021, 4:58 p.m. UTC | #30

On 7/1/2021 6:10 AM, Vivek Goyal wrote:
> On Thu, Jul 01, 2021 at 09:48:33AM +0100, Dr. David Alan Gilbert wrote:
>> * Theodore Ts'o (tytso@mit.edu) wrote:
>>> On Wed, Jun 30, 2021 at 04:01:42PM +0100, Dr. David Alan Gilbert wrote:
>>>> Even if you fix symlinks, I don't think it fixes device nodes or
>>>> anything else where the permissions bitmap isn't purely used as the
>>>> permissions on the inode.
>>> I think we're making a mountain out of a molehill.  Again, very few
>>> people are using quota these days.  And if you give someone write
>>> access to a 8TB disk, do you really care if they can "steal" 32k worth
>>> of space (which is the maximum size of an xattr, enforced by the VFS).
>>>
>>> OK, but what about character mode devices?  First of all, most users
>>> don't have access to huge number of devices, but let's assume
>>> something absurd.  Let's say that a user has write access to *1024*
>>> devices.  (My /dev has 233 character mode devices, and I have write
>>> access to well under a dozen.)
>>>
>>> An 8TB disk costs about $200.  So how much of the "stolen" quota space
>>> are we talking about, assuming the user has access to 1024 devices,
>>> and the file system actually supports a 32k xattr.
>>>
>>>     32k * 1024 * $200 / 8TB / (1024*1024*1024) = $0.000763 = 0.0763 cents
>>>
>>> A 2TB SSD is less around $180, so even if we calculate the prices
>>> based on SSD space, we're still talking about a quarter of a penny.
>>>
>>> Why are we worrying about this?
>> I'm not worrying about storage cost, but we would need to define what
>> the rules are on who can write and change a user.* xattr on a device
>> node.  It doesn't feel sane to make it anyone who can write to the
>> device; then everyone can start leaving droppings on /dev/null.
>>
>> The other evilness I can imagine, is if there's a 32k limit on xattrs on
>> a node, an evil user could write almost 32k of junk to the node
>> and then break the next login that tries to add an acl or breaks the
>> next relabel.
> I guess 64k is per xattr VFS size limit.
>
> #define XATTR_SIZE_MAX 65536
>
> I just wrote a simple program to write "user.<N>" xattrs of size 1K
> each and could easily write 1M xattrs. So that 1G worth data right
> there. I did not try to push it further.
>
> So a user can write lot of data in the form of user.* xattrs on
> symlinks and device nodes if were to open it unconditionally. Hence
> permission semantics will probably will have to defined properly.
>
> I am wondering will it be alright if owner of the file (or CAP_FOWNER),
> is allowed to write user.* xattrs on symlinks and special files.

That would be sensible.
That's independent of your xattr mapping scheme.

>
> Vivek
>

[RFC,0/1] xattr: Allow user.* xattr on symlink/special files if caller has CAP_SYS_RESOURCE

Message

Comments