[GIT,PULL] selinux/selinux-pr-20250323

Message ID	d0ade43454dee9c00689f03e8d9bd32a@paul-moore.com (mailing list archive)
State	New
Headers	show Received: from mail-qv1-f52.google.com (mail-qv1-f52.google.com [209.85.219.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 953D51A23A4 for <linux-security-module@vger.kernel.org>; Sun, 23 Mar 2025 19:39:48 +0000 (UTC) Date: Sun, 23 Mar 2025 15:39:46 -0400 Message-ID: <d0ade43454dee9c00689f03e8d9bd32a@paul-moore.com> From: Paul Moore <paul@paul-moore.com> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: selinux@vger.kernel.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [GIT PULL] selinux/selinux-pr-20250323 Precedence: bulk
Series	[GIT,PULL] selinux/selinux-pr-20250323 \| expand [GIT,PULL] selinux/selinux-pr-20250323

Paul Moore March 23, 2025, 7:39 p.m. UTC

Linus,

Here is the SELinux pull request for the Linux v6.15 merge window, the
highlights are below:

- Add additional SELinux access controls for kernel file reads/loads

  The SELinux kernel file read/load access controls were never updated
  beyond the initial kernel module support, this pull request adds
  support for firmware, kexec, policies, and x.509 certificates.

- Add support for wildcards in network interface names

  There are a number of userspace tools which auto-generate network
  interface names using some pattern of <XXXX>-<NN> where <XXXX> is
  a fixed string, e.g. "podman", and <NN> is a increasing counter.
  Supporting wildcards in the SELinux policy for network interfaces
  simplifies the policy associted with these interfaces.

- Fix a potential problem in the kernel read file SELinux code

  SELinux should always check the file label in the
  security_kernel_read_file() LSM hook, regardless of if the file is
  being read in chunks.  Unfortunately, the existing code only
  considered the file label on the first chunk; this pull request
  fixes this problem.
  
  There is more detail in the individual commit, but thankfully the
  existing code didn't expose a bug due to multi-stage reads only
  taking place in one driver, and that driver loading a file type
  that isn't targeted by the SELinux policy.

- Fix the subshell error handling in the example policy loader

  Minor fix to SELinux example policy loader in scripts/selinux due
  to an undesired interaction with subshells and errexit.

Please merge,
-Paul

--
The following changes since commit 2014c95afecee3e76ca4a56956a936e23283f05b:

  Linux 6.14-rc1 (2025-02-02 15:39:26 -0800)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux.git
    tags/selinux-pr-20250323

for you to fetch changes up to a3d3043ef24ac750f05a164e48f3d0833ebf0252:

  selinux: get netif_wildcard policycap from policy instead of cache
    (2025-03-17 16:22:04 -0400)

----------------------------------------------------------------
selinux/stable-6.15 PR 20250323
----------------------------------------------------------------

"Kipp N. Davis" (1):
      selinux: add permission checks for loading other kinds of kernel
         files

Christian Göttsche (2):
      selinux: support wildcard network interface names
      selinux: get netif_wildcard policycap from policy instead of cache

Paul Moore (1):
      selinux: always check the file label in selinux_kernel_read_file()

Tanya Agarwal (1):
      selinux: fix spelling error

Tim Schumacher (1):
      selinux: Chain up tool resolving errors in install_policy.sh

 scripts/selinux/install_policy.sh          |   15 ++---
 security/selinux/avc.c                     |    2 
 security/selinux/hooks.c                   |   58 +++++++++++++++++----
 security/selinux/include/classmap.h        |    4 +
 security/selinux/include/policycap.h       |    1 
 security/selinux/include/policycap_names.h |    1 
 security/selinux/include/security.h        |    8 ++
 security/selinux/ss/services.c             |   15 ++++-
 8 files changed, 79 insertions(+), 25 deletions(-)

--
paul-moore.com

Linus Torvalds March 25, 2025, 11:02 p.m. UTC | #1

On Sun, 23 Mar 2025 at 12:39, Paul Moore <paul@paul-moore.com> wrote:
>
> - Add additional SELinux access controls for kernel file reads/loads
>
>   The SELinux kernel file read/load access controls were never updated
>   beyond the initial kernel module support, this pull request adds
>   support for firmware, kexec, policies, and x.509 certificates.

Honestly, is there a *reason* for this, or is this just some misguided
"for completeness" issue?

Because dammit, adding more complexity to the security rules isn't a
feature, and isn't security. It's just theater.

And it adds completely pointless extra cases making the different
cases use artificially different code.

The commit message doesn't actually make any mention of *why* any of
this would be a good idea.

I've pulled this, but honestly, I think all those new cases should be
removed, and if people object to us having "LOADING_MODULE" for other
kinds of reads, then I think the fix should be to just rename it to
"KERNEL_LOAD" or whatever.

Because dammit, this "make security interfaces more complicated just
because" needs to STOP.

We are literally wasting enormous amounts of time in the security
layers - I regularly see the lsm and selinux layers spending *more*
time on security than the actual operation takes because the security
people have written random code without taking performance into
account - and I need to see *reasons* for adding more crap in this
area, not "let's expand" with no actual reason given.

So I really think that commit needs to be either reverted or explained
very clearly why user policy _needs_ to care for the difference
between a module load and a firmware image load.

Because I think any such explanation is likely complete BS. The
difference between loading modules and loading firmware usually would
boil down to "different hardware does things differently". There's no
obvious reason why one should be considered different from another
from a security standpoint.

               Linus

pr-tracker-bot@kernel.org March 25, 2025, 11:17 p.m. UTC | #2

The pull request you sent on Sun, 23 Mar 2025 15:39:46 -0400:

> https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux.git tags/selinux-pr-20250323

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/59c017ce9ec77953ca5198b41d4101f57dd4af0d

Thank you!

Paul Moore March 26, 2025, 6:36 p.m. UTC | #3

On Tue, Mar 25, 2025 at 7:02 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Sun, 23 Mar 2025 at 12:39, Paul Moore <paul@paul-moore.com> wrote:
> >
> > - Add additional SELinux access controls for kernel file reads/loads
> >
> >   The SELinux kernel file read/load access controls were never updated
> >   beyond the initial kernel module support, this pull request adds
> >   support for firmware, kexec, policies, and x.509 certificates.
>
> Honestly, is there a *reason* for this, or is this just some misguided
> "for completeness" issue?
>
> Because dammit, adding more complexity to the security rules isn't a
> feature, and isn't security. It's just theater ...

From my perspective this is largely a continuation of our discussion
last April, and considering that you ignored my response then, I'm not
sure typing out a meaningful reply here is a good use of my time.
Anyone who is interested can find that thread on lore, unfortunately
much of my response still applies.

Linus Torvalds March 26, 2025, 7:40 p.m. UTC | #4

On Wed, 26 Mar 2025 at 11:36, Paul Moore <paul@paul-moore.com> wrote:
>
> From my perspective this is largely a continuation of our discussion
> last April, and considering that you ignored my response then, I'm not
> sure typing out a meaningful reply here is a good use of my time.

What you are saying is that I have complained about added overhead
before, and you aren't even willing to explain why new code was added?

> Anyone who is interested can find that thread on lore, unfortunately
> much of my response still applies.

That thread was similar in that yes, it was complaining about extra
overhead of the lsm code. Not just me, btw.

But your respose doesn't make sense. I asked for *why* this was added.
You're saying "I am not going to answer because you've complained
about other overhead before".

I actually went and tried to find the discussion on the mailing lists,
and nowhere *there* did I find an explanation for why this was done
either.

In other words: why were new policy entries added? The commit message
and the list explains what the commit *does*, but doesn't explain
*why* it does it.

I'm cc'ing the other people involved, exactly *because* we've had the
whole discussion before, and because I want to see explanations for
*why* new policy hooks are added to the security layers.

I really think that "policy hooks just because policy hooks" is not
acceptable. And the reason it's not acceptable is exactly the fact
that we have a bad history of various random policies becoming
problematic over time.

There needs to be a *reason* for a policy hook stated. Not "there are
no matching policy hooks".

And I do not see why firmware loading should be a policy issue if the
kernel code that initiated the firmware load (ie the module load that
*was* checked for policy) was already done.

Do I believe this particular case is going to be a performance issue? No.

Do I strongly feel that any additional hooks need *EXPLANATIONS*? Hell yes.

                 Linus

Paul Moore March 26, 2025, 8:48 p.m. UTC | #5

On Wed, Mar 26, 2025 at 3:40 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, 26 Mar 2025 at 11:36, Paul Moore <paul@paul-moore.com> wrote:
> >
> > From my perspective this is largely a continuation of our discussion
> > last April, and considering that you ignored my response then, I'm not
> > sure typing out a meaningful reply here is a good use of my time.
>
> What you are saying is that I have complained about added overhead
> before, and you aren't even willing to explain why new code was added?

My comment is simply that you have a habit of disrespecting the work
done in the security/ space (e.g. "security theater"), and despite
explaining how that behavior is detrimental you seemingly choose to
ignore these concerns and repeat the same tired language.

If you want to have a discussion about something substantive, I would
suggest refraining from insulting those you want to engage.

> In other words: why were new policy entries added? The commit message
> and the list explains what the commit *does*, but doesn't explain
> *why* it does it.

Looking at the pre-patched code one can see that SELinux only enforced
access controls on kernel module loading; other operations such as
kexec images, firmware, policy, etc. were not originally included.  As
this happened back in the v4.x time frame I would have to go dig
through the archives to provide a fully accurate reasoning as to why
SELinux only concerned itself with kernel module loading at that point
in time.  However, speaking solely from memory, I believe that the
initial focus was limited to modules simply because that was the area
of largest concern at the time, and many of the other file types were
not yet gated by the LSM hooks.  Moving forward to the recent pull
request, the LSM hooks do have the ability to gate these additional
file types, and for a LSM such as SELinux that aims to provide
comprehensive access controls, adding support for enforcing controls
on these additional file types is a logical thing to do and consistent
with our other work.  The fact that these changes didn't happen at the
same time as the enabling support for the other file types is likely
due to an ongoing challenge we face where we sometimes fail to keep
current with all facets of kernel development.

Linus Torvalds March 26, 2025, 9:06 p.m. UTC | #6

On Wed, 26 Mar 2025 at 13:48, Paul Moore <paul@paul-moore.com> wrote:
>
> Looking at the pre-patched code one can see that SELinux only enforced
> access controls on kernel module loading

I *know*.

I've looked at that commit. It's why I'm asking.

>      Moving forward to the recent pull
> request, the LSM hooks do have the ability to gate these additional
> file types, and for a LSM such as SELinux that aims to provide
> comprehensive access controls, adding support for enforcing controls
> on these additional file types is a logical thing to do and consistent
> with our other work.

Again - you're explaining what it *does*.  I see what it does. That
was never the issue. That was the only part that the commit message
talked about.

What I'm asking for is *WHY*.

I realize that to you, the security side is what you do, and you feel
that this is all some internal thing to selinux and may feel that I'm
butting in where I shouldn't.

But to others, these things cause extra overhead, so it's *not* just
internal to selinux. It affects others in bad ways.

I want the _reasons_ for new policy hooks to be *explained*.

>  The fact that these changes didn't happen at the
> same time as the enabling support for the other file types is likely
> due to an ongoing challenge we face where we sometimes fail to keep
> current with all facets of kernel development.

You are still just talking about what the code does.

I repeat: my questions is *WHY*.

*WHY* is that policy needed at all?

As you correctly point out, it has NOT EXISTED for a long long time
already, and the code has been the old way since 4.x timeframe.

So my question literally boils down to "WHY SHOULD IT EXIST NOW"?

We've done quite well without it.

And I do not not see the point of allowing a driver module load (which
*is* a policy, and has been for a long time), and then disallowing
loading the firmware that the driver needs.

That's insane. So explain to me why you added what looks like
completely insane policy hooks.

See what I'm complaining about? I'm literally looking at that code,
and I understand what it does, but it adds policy for something where
I do not believe policy makes any sense what-so-ever.

The whole "any policy, anywhere, any time, for any reason" model is not valid.

We don't ask for permission for core kernel functionality. We don't
ask permission to do things the kernel needs to do.

And new policy rules need to be explained.

             Linus

Paul Moore March 26, 2025, 11:02 p.m. UTC | #7

On Wed, Mar 26, 2025 at 5:06 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> *WHY* is that policy needed at all?
>
> As you correctly point out, it has NOT EXISTED for a long long time
> already, and the code has been the old way since 4.x timeframe.
>
> So my question literally boils down to "WHY SHOULD IT EXIST NOW"?

SELinux is different things to different people, but for this
discussion I think one of the more relevant aspects is the fact that
SELinux is designed in such a way that you can analyze the security
policy and reason about the security properties of the system.  For
example, how does data from application A flow through the system?
Can application B ever read from file C either directly or through
some chain of applications/users?  What about writing to file D?  Who
can manage device E, or configure various runtime knobs in the kernel.
I could go on, but the important part is that SELinux allows an admin
to reason about the security properties of a system in ways that
simply aren't possible using traditional Linux access controls (mode
bits, capabilities, etc.).  If you're curious the collection of
analysis tools can be found below:

https://github.com/SELinuxProject/setools

An important prerequisite of this is that any "security relevant"
operation on the system needs to have a SELinux access control point.
If an access control is missing, the quality of the policy analysis
suffers.

> And I do not not see the point of allowing a driver module load (which
> *is* a policy, and has been for a long time), and then disallowing
> loading the firmware that the driver needs.
>
> That's insane. So explain to me why you added what looks like
> completely insane policy hooks.

In the security_kernel_read_file() LSM hook, the SELinux callback
examines if the current task has the ability to perform the designated
read operation on file A.  Loading a kernel module and loading its
associated firmware are two events inside the kernel with the current
task loading data from two different files, each with potentially
different security properties, and we want to perform an access check
on each file to ensure it is permitted under the security policy
defined by the admin.  For example, with the kernel enforcing kernel
module signing one administrator might be inclined to allow a given
process to load a kernel module from a file that is more widely
writable than a firmware image without any form of integrity
protection.  Having two distinct permissions, system/module_load and
system/firmware_load, allows the administrator to distinguish between
the two different file reads in the SELinux policy.

Linus Torvalds March 27, 2025, 12:43 a.m. UTC | #8

On Wed, 26 Mar 2025 at 16:03, Paul Moore <paul@paul-moore.com> wrote:
>
> On Wed, Mar 26, 2025 at 5:06 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > That's insane. So explain to me why you added what looks like
> > completely insane policy hooks.
>
> In the security_kernel_read_file() LSM hook, the SELinux callback
> examines if the current task has the ability to perform the designated
> read operation on file A.  Loading a kernel module and loading its
> associated firmware are two events inside the kernel with the current
> task loading data from two different files, each with potentially
> different security properties, and we want to perform an access check
> on each file to ensure it is permitted under the security policy
> defined by the admin.

What this tells me is that there wasn't actually any real ask for
this, and you're trying to make up arguments for a silly patch.

First off, "loading a module" and then "the module loads the firmware"
file are *not* two distinct things with distinct security properties.

The module DOES NOT WORK without the firmware file. So the argument
that they are independent action is complete nonsense. If you don't
trust the firmware, then don't load the module. It's that simple.

Second, your argument that there are different tasks involved is true,
but no, that doesn't mean that there are "potentially different
security properties".

Why? Because as mentioned, loading the module very much implies having
to be able to load the firmware for it - but yes, the firmware loading
might actually happen _later_ and in a different and completely
independent context.

In particular, drivers can do their firmware loads at various random times.

Yes, one common situation is that they do it during module load, for
example, in which case it would be done in the same context that the
module load itself happened.

But it's *also* common that it is done asynchronously while scanning
for devices.

Or done when the device is opened.

Or the firmware file is reloaded when the system resumes from suspend,
because the device lost all its context, so now it's reloading
something that it loaded earlier in a very *different* context, and
that had better not start randomly failing resulting in basically
impossible-to-debug resume problems.

In other words, the context of actual firmware loading is pretty much
random. It isn't necessarily tied to the module loading, but it *also*
isn't necessarily tied to a particular other actor.

So any argument that depends on the context of the firmware load is
bogus a priori. Since there isn't some well-defined context the load
happens in, any policy based on such a context is garbage.

To put it bluntly, your whole argument for why separate policy makes
sense seems completely made-up and has no actual reality behind it.

It also very much sounds like this change was *not* triggered by
anybody actually needing it, and as in fact sounds like just "let's
pattern-match and do the same thing for these things that look similar
but aren't".

This is *EXACTLY* the kind of thing that I think the security layers
should absolutely NOT DO.

You should not add nonsensical policy hooks "just because".

Security policy needs to have some thought behind it, not be some
mindless random thing. And there should have been a documented case of
people needing it.

This is literally why I was asking for the reasoning behind that
patch. The patch looked nonsensical. And I have not at all been
convinced otherwise.

                     Linus

Thiébaud Weksteen March 27, 2025, 1:06 a.m. UTC | #9

On Thu, Mar 27, 2025 at 11:43 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 26 Mar 2025 at 16:03, Paul Moore <paul@paul-moore.com> wrote:
> >
> > On Wed, Mar 26, 2025 at 5:06 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > > That's insane. So explain to me why you added what looks like
> > > completely insane policy hooks.
> >
> > In the security_kernel_read_file() LSM hook, the SELinux callback
> > examines if the current task has the ability to perform the designated
> > read operation on file A.  Loading a kernel module and loading its
> > associated firmware are two events inside the kernel with the current
> > task loading data from two different files, each with potentially
> > different security properties, and we want to perform an access check
> > on each file to ensure it is permitted under the security policy
> > defined by the admin.

Paul, Linus,

Thank you for discussing the reasoning behind this patch. We wanted to
share how Android will leverage the new hooks introduced here.
Hopefully, this should help move forward this discussion and show why
these changes have concrete value to projects using SELinux like
Android.

Android, for a long time, has used the module loading hooks to ensure
that kernel modules are loaded from cryptographically signed,
dm-verity protected partitions. Even if an attacker compromised a
userspace process with CAP_SYS_MODULE, they cannot leverage this to
load modules from outside of protected partitions.

The changes presented in this merge request build upon this functionality.

Taking one example from this merge request: kexec image loading.

Currently, any process which has CAP_SYS_BOOT can use kexec to replace
the existing kernel. Android has 5 processes with CAP_SYS_BOOT, only 1
of which needs kexec functionality [1]. By using these new
permissions, we will ensure that this process is able to call kexec,
while prohibiting other processes. SELinux provides us strong, kernel
enforced guarantees which can be checked at policy compile time.
Extending on this, we will use this patchset to guarantee that kernels
and ramdisks executed by kexec come from known, good sources.

The other hooks are of similar value to Android.

Thiebaud and Nick

[1] https://cs.android.com/android/platform/superproject/main/+/main:system/sepolicy/microdroid/system/private/kexec.te;l=12

>
> What this tells me is that there wasn't actually any real ask for
> this, and you're trying to make up arguments for a silly patch.
>
> First off, "loading a module" and then "the module loads the firmware"
> file are *not* two distinct things with distinct security properties.
>
> The module DOES NOT WORK without the firmware file. So the argument
> that they are independent action is complete nonsense. If you don't
> trust the firmware, then don't load the module. It's that simple.
>
> Second, your argument that there are different tasks involved is true,
> but no, that doesn't mean that there are "potentially different
> security properties".
>
> Why? Because as mentioned, loading the module very much implies having
> to be able to load the firmware for it - but yes, the firmware loading
> might actually happen _later_ and in a different and completely
> independent context.
>
> In particular, drivers can do their firmware loads at various random times.
>
> Yes, one common situation is that they do it during module load, for
> example, in which case it would be done in the same context that the
> module load itself happened.
>
> But it's *also* common that it is done asynchronously while scanning
> for devices.
>
> Or done when the device is opened.
>
> Or the firmware file is reloaded when the system resumes from suspend,
> because the device lost all its context, so now it's reloading
> something that it loaded earlier in a very *different* context, and
> that had better not start randomly failing resulting in basically
> impossible-to-debug resume problems.
>
> In other words, the context of actual firmware loading is pretty much
> random. It isn't necessarily tied to the module loading, but it *also*
> isn't necessarily tied to a particular other actor.
>
> So any argument that depends on the context of the firmware load is
> bogus a priori. Since there isn't some well-defined context the load
> happens in, any policy based on such a context is garbage.
>
> To put it bluntly, your whole argument for why separate policy makes
> sense seems completely made-up and has no actual reality behind it.
>
> It also very much sounds like this change was *not* triggered by
> anybody actually needing it, and as in fact sounds like just "let's
> pattern-match and do the same thing for these things that look similar
> but aren't".
>
> This is *EXACTLY* the kind of thing that I think the security layers
> should absolutely NOT DO.
>
> You should not add nonsensical policy hooks "just because".
>
> Security policy needs to have some thought behind it, not be some
> mindless random thing. And there should have been a documented case of
> people needing it.
>
> This is literally why I was asking for the reasoning behind that
> patch. The patch looked nonsensical. And I have not at all been
> convinced otherwise.
>
>                      Linus
>

Linus Torvalds March 27, 2025, 1:20 a.m. UTC | #10

On Wed, 26 Mar 2025 at 18:06, Thiébaud Weksteen <tweek@google.com> wrote:
>
> Taking one example from this merge request: kexec image loading.

So this is the kind of "why" I was looking for.

> Currently, any process which has CAP_SYS_BOOT can use kexec to replace
> the existing kernel. Android has 5 processes with CAP_SYS_BOOT, only 1
> of which needs kexec functionality [1]. By using these new
> permissions, we will ensure that this process is able to call kexec,
> while prohibiting other processes. SELinux provides us strong, kernel
> enforced guarantees which can be checked at policy compile time.
> Extending on this, we will use this patchset to guarantee that kernels
> and ramdisks executed by kexec come from known, good sources.
>
> The other hooks are of similar value to Android.

Now explain to me how the firmware loading hook works, not some
hand-wavy "similar value" thing.

Because it seems entirely bogus. Exactly because the context of
firmware loading is *not* something you can depend on. There is no
"one special process" that has firmware loading capabilities.

I'm looking at selinux_kernel_load_data() in particular, where you
don't even pass it a file at all, so it's not like it could check for
"is this file integrity-protected" or anything like that. It seems to
literally say "can this process load firmware", and as I've explained,
the firmware loading is done by random processes.

               Linus

Thiébaud Weksteen March 27, 2025, 3:27 a.m. UTC | #11

On Thu, Mar 27, 2025 at 12:21 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> [...]
> the firmware loading is done by random processes.

That is not quite right. If you look at commit 581dd6983034 [1], when
a firmware is about to be loaded, the kernel credentials are used. It
is therefore possible to grant this permission to the corresponding
security context (in our policy that would be the "kernel" domain).

To be honest, I don't think this particular distinction applies to
Android, but I can imagine IoT devices with smaller/stricter policies
wishing to enforce this (e.g., device boot without a policy, loads its
drivers and firmware, loads a policy that enforces no more firmware
loading).

Thanks

[1] https://lore.kernel.org/all/20220502004952.3970800-1-tweek@google.com/

Linus Torvalds March 27, 2025, 4:09 a.m. UTC | #12

On Wed, 26 Mar 2025 at 20:28, Thiébaud Weksteen <tweek@google.com> wrote:
>
> That is not quite right. If you look at commit 581dd6983034 [1], when
> a firmware is about to be loaded, the kernel credentials are used.

Ahh, that's good, at least there's no "random state" to check.

But it does still mean that the security check is pointless - there
aren't any other credentials that would validly be used for firmware
loading, so what was the point of checking them again?

In fact, the commit you point to was made exactly *because* the kind
of pointless and wrong security checks that I'm complaining about were
done, wasn't it?

                    Linus

Jeffrey Vander Stoep March 27, 2025, 8:59 a.m. UTC | #13

On Thu, Mar 27, 2025 at 5:10 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 26 Mar 2025 at 20:28, Thiébaud Weksteen <tweek@google.com> wrote:
> >
> > That is not quite right. If you look at commit 581dd6983034 [1], when
> > a firmware is about to be loaded, the kernel credentials are used.
>
> Ahh, that's good, at least there's no "random state" to check.
>
> But it does still mean that the security check is pointless - there
> aren't any other credentials that would validly be used for firmware
> loading, so what was the point of checking them again?

The value here isn't so much about checking the source context
"kernel", but rather about checking the target context and enforcing
that firmware can only come from trusted filesystems. So even a
compromised privileged process that sets firmware_class.path cannot
cause the kernel to load firmware from an arbitrary source.

These restrictions reduce our reliance on (1) individual component
manufacturers (e.g. NFC chips) implementing signature verification
correctly in their firmware loading procedure, or (2) the fallout for
the Android ecosystem if a component manufacturer's private key leaks
because even firmware signed with the leaked key will not be trusted
if it doesn't come from the trusted filesystem signed by the Android
device manufacturer. Leaked keys is a very real problem. Restrictions
like those added here can significantly reduce the severity of such
incidences.

With this, we can write policies for Android devices that enforce that
firmware only comes from trusted filesystems. For example:

allow kernel vendor_file:system firmware_load;

You could even imagine a more lenient rule that allows any domain to
load firmware, just so long as it comes from a trusted filesystem. For
example:

allow domain vendor_file:system firmware_load;

We can then create tests that prevent rules from being added that
attempt to load firmware from untrusted locations. For example:
neverallow * ~vendor_file:system firmware_load;

Such tests are quite important for preventing device manufacturers
from adding insecure policies that would allow firmware loading from
untrusted sources.

>
> In fact, the commit you point to was made exactly *because* the kind
> of pointless and wrong security checks that I'm complaining about were
> done, wasn't it?
>
>                     Linus
>

Linus Torvalds March 27, 2025, 3:49 p.m. UTC | #14

On Thu, 27 Mar 2025 at 01:59, Jeffrey Vander Stoep <jeffv@google.com> wrote:
>
> The value here isn't so much about checking the source context
> "kernel", but rather about checking the target context and enforcing
> that firmware can only come from trusted filesystems. So even a
> compromised privileged process that sets firmware_class.path cannot
> cause the kernel to load firmware from an arbitrary source.

Yes, and that's literally why I earlier in the thread pointed out the
new code in selinux_kernel_load_data()

  "I'm looking at selinux_kernel_load_data() in particular, where you
   don't even pass it a file at all, so it's not like it could check for
   "is this file integrity-protected" or anything like that"

because I understand that you might want to verify the *file* the
firmware comes from, but I think verifying the context in which the
firmware is loaded is absolutely insane and incorrect.

And that is literally *all* that the new case in
selinux_kernel_load_data() does. There is no excuse for that craziness
that I can come up with.

And yes, I'm harping on this, because I really *hate* how the security
layer comes up in my performance profiles so much. It's truly
disgusting. So when I see new hooks that don't make sense to me, I
react *very* strongly.

Do I believe this insanity matters for performance? No.

But do I believe that the security code needs to *think* about the
random hooks it adds more? Yes. YES!

Which is why I really hate seeing new random hooks where I then go
"that is complete and utter nonsense".

[ This whole patch triggered me for another reason too - firmware
loading in particular has a history of user space actively and
maliciously screwing the kernel up.

  The reason we load firmware directly from the kernel is because user
space "policy" decisions actively broke our original "let user space
do it" model.

  So if somebody thinks I'm overreacting, they are probably right, but
dammit, this triggers two of my big red flags for "this is horribly
wrong" ]

                Linus

Casey Schaufler March 27, 2025, 4:41 p.m. UTC | #15

On 3/27/2025 1:59 AM, Jeffrey Vander Stoep wrote:
> On Thu, Mar 27, 2025 at 5:10 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Wed, 26 Mar 2025 at 20:28, Thiébaud Weksteen <tweek@google.com> wrote:
>>> That is not quite right. If you look at commit 581dd6983034 [1], when
>>> a firmware is about to be loaded, the kernel credentials are used.
>> Ahh, that's good, at least there's no "random state" to check.
>>
>> But it does still mean that the security check is pointless - there
>> aren't any other credentials that would validly be used for firmware
>> loading, so what was the point of checking them again?
> The value here isn't so much about checking the source context
> "kernel", but rather about checking the target context and enforcing
> that firmware can only come from trusted filesystems. So even a
> compromised privileged process that sets firmware_class.path cannot
> cause the kernel to load firmware from an arbitrary source.
>
> These restrictions reduce our reliance on (1) individual component
> manufacturers (e.g. NFC chips) implementing signature verification
> correctly in their firmware loading procedure, or (2) the fallout for
> the Android ecosystem if a component manufacturer's private key leaks
> because even firmware signed with the leaked key will not be trusted
> if it doesn't come from the trusted filesystem signed by the Android
> device manufacturer. Leaked keys is a very real problem. Restrictions
> like those added here can significantly reduce the severity of such
> incidences.
>
> With this, we can write policies for Android devices that enforce that
> firmware only comes from trusted filesystems. For example:
>
> allow kernel vendor_file:system firmware_load;

Am I missing something, or isn't that what loadpin is for?

Stephen Smalley March 27, 2025, 4:55 p.m. UTC | #16

On Thu, Mar 27, 2025 at 11:50 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 27 Mar 2025 at 01:59, Jeffrey Vander Stoep <jeffv@google.com> wrote:
> >
> > The value here isn't so much about checking the source context
> > "kernel", but rather about checking the target context and enforcing
> > that firmware can only come from trusted filesystems. So even a
> > compromised privileged process that sets firmware_class.path cannot
> > cause the kernel to load firmware from an arbitrary source.
>
> Yes, and that's literally why I earlier in the thread pointed out the
> new code in selinux_kernel_load_data()
>
>   "I'm looking at selinux_kernel_load_data() in particular, where you
>    don't even pass it a file at all, so it's not like it could check for
>    "is this file integrity-protected" or anything like that"
>
> because I understand that you might want to verify the *file* the
> firmware comes from, but I think verifying the context in which the
> firmware is loaded is absolutely insane and incorrect.

So the only use case I could see for that particular check would be if
we wanted to block loading firmware directly from memory/blobs rather
than from files. If that's not a valid use case, then we can get rid
of that particular check if desired; it just seemed inconsistent
between the two hooks otherwise. What's the purpose of having the
LOADING_FIRMWARE enum or hook call on that code path at all then?

> And that is literally *all* that the new case in
> selinux_kernel_load_data() does. There is no excuse for that craziness
> that I can come up with.
>
> And yes, I'm harping on this, because I really *hate* how the security
> layer comes up in my performance profiles so much. It's truly
> disgusting. So when I see new hooks that don't make sense to me, I
> react *very* strongly.

If you have constructive suggestions (or patches!) to improve
performance of LSM and/or SELinux, we'd be glad to take them. Or even
helpful hints on how to best measure and see the same overheads you
are seeing and where.

>
> Do I believe this insanity matters for performance? No.
>
> But do I believe that the security code needs to *think* about the
> random hooks it adds more? Yes. YES!
>
> Which is why I really hate seeing new random hooks where I then go
> "that is complete and utter nonsense".
>
> [ This whole patch triggered me for another reason too - firmware
> loading in particular has a history of user space actively and
> maliciously screwing the kernel up.
>
>   The reason we load firmware directly from the kernel is because user
> space "policy" decisions actively broke our original "let user space
> do it" model.
>
>   So if somebody thinks I'm overreacting, they are probably right, but
> dammit, this triggers two of my big red flags for "this is horribly
> wrong" ]
>
>                 Linus

Linus Torvalds March 27, 2025, 6:15 p.m. UTC | #17

On Thu, 27 Mar 2025 at 09:55, Stephen Smalley
<stephen.smalley.work@gmail.com> wrote:
>
> If you have constructive suggestions (or patches!) to improve
> performance of LSM and/or SELinux, we'd be glad to take them. Or even
> helpful hints on how to best measure and see the same overheads you
> are seeing and where.

So measuring it is fairly easy. I have various loads I care about, but
the simplest one that I feel is actually a real load - rather than the
more artificial bencmarks I then use for verification when I make any
changes - is literally "do an allmodconfig kernel rebuild with no
changes".

That empty kernel rebuild approximates what I actually do for most
small pull requests when only a couple of files really get re-built. I
don't want to actually try to profile user space and the actual
compiler overhead.

To see that load, first do this as root:

    echo -1 > /proc/sys/kernel/perf_event_paranoid

so that you as a regular user can then do a kernel build and get good
kernel-level profiles (there are other ways you can do this: you can
obviously also do a full profile as root). Obviously you should *not*
do this on a machine with other users, the above basically says "let
anybody do profiling on kernel".

Then I just do

   make allmodconfig
   make -j64

to prep the tree and causing everything to be built (well - normally I
obviously don't do that, because my kernel tree is always built
anyway, but I'm just trying to make it obvious how to reproduce it).

And then I just do

   perf record -e cycles:pp make -j64 > ../makes
   perf report --sort=symbol,dso

and press 'k' to just get the kernel side (again - there's little I
can do, or care, about the user space profiles).

The "--sort=symbol,dso" is because I don't care _which_ process it is
that does what, so I just want the output binned by kernel function.

Just on a very high level, this is what I get with the v6.14 tree
(cut-off at 0.25%, this is "out of total cost" for the load including
user space, so these top functions account for just over 9% of the
*total* cost of the benchmark):

   1.26%  [k] clear_page_rep
   0.82%  [k] avc_has_perm_noaudit
   0.73%  [k] link_path_walk
   0.73%  [k] terminate_walk
   0.58%  [k] __d_lookup_rcu
   0.56%  [k] step_into
   0.52%  [k] selinux_inode_permission
   0.50%  [k] memset_orig
   0.49%  [k] vfs_statx_path
   0.47%  [k] strncpy_from_user
   0.47%  [k] rep_movs_alternative
   0.37%  [k] vfs_statx
   0.31%  [k] __rcu_read_unlock
   0.30%  [k] btrfs_getattr
   0.28%  [k] inode_permission
   0.26%  [k] kmem_cache_free
   0.26%  [k] generic_permission
   [...]

so the top thing is the page clearing (and you see other memcpy/memset
variations there too), but the #2 hit for the kernel profile is
selinux, which takes more time than the basic path walking.

And selinux_inode_permission() is rather high up there too, as you can
see. Together, those two functions are about 1.3% of the whole load.

Now, the above profile is just from *my* machine, and
microarchitecture will matter a *LOT*. So the details will depend
hugely on your hardware, but I've been doing kernel profiles for
decades, and the basics haven't really changed. memory movement and
clearing is universally the biggest thing, and that's fine. It's
fundamental.

Also, when I do profiles I turn off the CPU mitigations, because again
depending on microarchitecture those can just swamp everything else,
and while they are a real overhead, from a performance standpoint I'm
hoping they are something that long-term is going to be mostly fixed
in hardware (apart from the basic Spectre-v1 branch speculation, which
is *not* turned off in my kerrels, and which we've actually worked
fairly hard on making sure is handled efficiently).

Now, looking at instruction level profiles is kind of iffy, and you
have to know your microarchitecture to really make sense of them. The
"cycles:pp" helps make profiles more relevant (and requires PEBS/IBS
or equivalent CPU support to work), but it won't replace "you have to
understand hardware".

You do want to look at instruction profiles at least a bit, partly
because inlining makes _not_ looking at them often kind of misleading.
The real cost may be in a function that was inlined.

Typically, once you look at instruction-level profiles, and understand
them, you'll see one of three issues:

 - cache misses. This is typically the big obvious one.

   And you'll see them both for I$ and D$. People will tell you that
I$ cache misses are in the noise, but people are wrong. It's simply
not true for the kernel or many other real benchmarks, and you'll
often see it as big hits at the beginnings of functions - or at the
return points of calls - where the instructions otherwise look very
benign.

 - serialization. This shows up hugely on modern CPUs, so any memory
barriers etc (ie locked instructions on x86) will stand out.

 - branch misprediction. This will typically show up in the profiles
not on the branch, but on the mispredicted _target_ of the branch, so
it can end up being a bit confusing. The CPU speculation mitigations
typically turn this issue up to 11 and add misprediction noise
absolutely everywhere, which is why turning those off is such a big
deal.

but in an OoO CPU all of the above will basically result in various
instruction profile "patterns", so you in general cannot really look
at individual instructions, and should use the above patterns to try
to figure out *why* the profile looks like it does.

It's not obvious, and the patterns will be different for different
microarchitectures. You can use fancier perf things to try to figure
out exactly what is going on, but you should always _start_ from the
"where are the costs" on a pure basic cycle basis. Only after that
does it make sense to say something like "Oh, this is expensive and
seems to be taking excessive cache misses, let's drill down into why".

Also, typically, code that has already been tweaked to death tends to
show fewer obvious peaks in the profile.

Because the obvious peaks have often been getting some attention. So
the profile ends up not showing a big read flag any more, because the
big issue has been fixed and now it's mostly a "it's called too much"
issue.

For the security layer, at least historically the big cache miss (on
this load) has been the inode->i_security access (not loading the
pointer itself, but the accesses following it), and the hash tables
for that AVC lookup.

And both been improved upon, and I didn't do try to analyze the above
profiles any closer when it comes to exactly what is going on, so take
that with the grain of salt it deserves. The exact details may have
changed, but as you can see, avc_has_perm_noaudit() really is very
much a top offender today.

And yes, the reason is that we have to call it a *lot* for any
filename lookups. Some of those security hooks get called for every
path component, others get called only for the final one.

The best fix would be to be able to cache the "this doesn't have any
extra security rules outside of the regular POSIX ones" and avoid
calling the hook entirely. That's what we've done for the ACL path,
and that has turned ACL costs into almost a non-issue.

                  Linus

Linus Torvalds March 27, 2025, 7:03 p.m. UTC | #18

On Thu, 27 Mar 2025 at 11:15, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The best fix would be to be able to cache the "this doesn't have any
> extra security rules outside of the regular POSIX ones" and avoid
> calling the hook entirely. That's what we've done for the ACL path,
> and that has turned ACL costs into almost a non-issue.

.. just to follow up on that, because I tried to look into it, but
didn't know the lsm and selinux code well enough to actually make much
progress, so if somebody who does is really willing to take a deeper
look, please pester me for details.

But the big high-level case for pathname lookup tends to be the calls
we have to do not just for the final inode, but the "every single
component" cases.

Which is a much *simpler* and more targeted security check than the
"any random permissions". It's the "can I use this directory for
lookups", and if we had an inode flag that said "this inode has no
security policies that change the lookup rules", just that single big
would likely be a *huge* improvement.

Because then you don't need to try to optimize the security check
itself, because instead the VFS layer can optimize it all by not
calling into the security layer at all.

And from a "this is called too much" standpoint, the "every path
component" cases tend to outnumber the "final file open" case by
something like 5-to-1 (handwavy, but not entirely made up).

I think the main case is

   link_path_walk -> may_lookup -> security_inode_permission

where the special case is that may_lookup() only sets MAY_EXEC and
MAY_NOT_BLOCK (for the RCU lookup case, which is what matters).

So if inode_permission() (in fs/namei.c) could avoid calling
security_inode_permission() because the inode has some flag that says
"none of the security models disallow MAY_EXEC for this directory
walk", I really think that would help.

I think trying to optimize the AVC hash table lookup further or
something like that is a dead end. The cost is "many many many calls",
and each individual call is fairly cheap on its own, but they all walk
different hash chains, and it all adds up to "lots of cost".

Hmm?

                      Linus

Stephen Smalley March 27, 2025, 7:16 p.m. UTC | #19

On Thu, Mar 27, 2025 at 3:03 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 27 Mar 2025 at 11:15, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > The best fix would be to be able to cache the "this doesn't have any
> > extra security rules outside of the regular POSIX ones" and avoid
> > calling the hook entirely. That's what we've done for the ACL path,
> > and that has turned ACL costs into almost a non-issue.
>
> .. just to follow up on that, because I tried to look into it, but
> didn't know the lsm and selinux code well enough to actually make much
> progress, so if somebody who does is really willing to take a deeper
> look, please pester me for details.

Would be interested in those details.

> But the big high-level case for pathname lookup tends to be the calls
> we have to do not just for the final inode, but the "every single
> component" cases.
>
> Which is a much *simpler* and more targeted security check than the
> "any random permissions". It's the "can I use this directory for
> lookups", and if we had an inode flag that said "this inode has no
> security policies that change the lookup rules", just that single big
> would likely be a *huge* improvement.
>
> Because then you don't need to try to optimize the security check
> itself, because instead the VFS layer can optimize it all by not
> calling into the security layer at all.
>
> And from a "this is called too much" standpoint, the "every path
> component" cases tend to outnumber the "final file open" case by
> something like 5-to-1 (handwavy, but not entirely made up).
>
> I think the main case is
>
>    link_path_walk -> may_lookup -> security_inode_permission
>
> where the special case is that may_lookup() only sets MAY_EXEC and
> MAY_NOT_BLOCK (for the RCU lookup case, which is what matters).
>
> So if inode_permission() (in fs/namei.c) could avoid calling
> security_inode_permission() because the inode has some flag that says
> "none of the security models disallow MAY_EXEC for this directory
> walk", I really think that would help.
>
> I think trying to optimize the AVC hash table lookup further or
> something like that is a dead end. The cost is "many many many calls",
> and each individual call is fairly cheap on its own, but they all walk
> different hash chains, and it all adds up to "lots of cost".
>
> Hmm?

Where could/would we cache that information so that it was accessible
directly by the VFS layer?

Linus Torvalds March 27, 2025, 7:40 p.m. UTC | #20

On Thu, 27 Mar 2025 at 12:16, Stephen Smalley
<stephen.smalley.work@gmail.com> wrote:
>
> Where could/would we cache that information so that it was accessible
> directly by the VFS layer?

So the VFS layer already does this for various other things. For this
case, the natural thing to do would be to add another IOP_xyzzy flag
in inode->i_opflags.

That's how we already say things like "this inode has no
filesystem-specific i_op->permission function" (IOP_FASTPERM), so that
we don't even have to follow the "inode->i_op->permission" pointer
chain to see a NULL pointer.

Yes, the VFS layer is *heavily* optimized like that. It literally does
that IOP_FASTPERM to avoid chasing two pointers - not even the call,
just the "don't even bother to follow pointers to see if it's NULL".
See do_inode_permission().

And we have 16 bits in that inode->i_opflags, and currently only use 7
of those bits. Adding one bit for a IOP_NO_SECURITY_LOOKUP kind of
logic (feel free to rename that - just throwing a random name out as a
suggestion) would be a complete no-brainer.

NOTE! The rule for the i_opflags accesses is that *reading* them is
done with no locking at all, but changing them takes the inode
spinlock (and we should technically probably use WRITE_ONCE() and
READ_ONCE(), but we don't).

And notice that the "no locking at all for reading" means that if you
*change* the bit - even though that involves locking - there may be
concurrent lookups in process that won't see the change, and would go
on as if the lookup still does not need any security layer call. No
serialization to readers at all (although you could wait for an RCU
period after changing if you really need to, and only use the bit in
the RCU lookup).

That should be perfectly fine - I really don't think serialization is
even needed. If somebody is changing the policy rules, any file
lookups *concurrent* to that change might not see the new rules, but
that's the same as if it happened before the change.

I just wanted to point out that the serialization is unbalanced: the
spinlock for changing the flag is literally just to make sure that two
bits being changed at the same time don't stomp on each other (because
it's a 16-bit non-atomic field, and we didn't want to use a "unsigned
long" and atomic bitops because the cache layout of the inode is also
a big issue).

And you can take the IOP_FASTPERM thing as an example of how to do
this: it is left clear initially, and what happens is that during the
permission lookup, if it *isn't* set, we'll follow those
inode->i_io->permission pointers, and notice that we should set it:

        if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {
                if (likely(inode->i_op->permission))
                        return inode->i_op->permission(idmap, inode, mask);

                /* This gets set once for the inode lifetime */
                spin_lock(&inode->i_lock);
                inode->i_opflags |= IOP_FASTPERM;
                spin_unlock(&inode->i_lock);
        }

and I think the security layer could take a very similar approach: not
setting that IOP_NO_SECURITY_LOOKUP initially, but *when* a
security_inode_permission() call is made with just MAY_NOT_BLOCK |
MAY_LOOKUP, and the security layer notices that "this inode has no
reason to care", it could set the bit so that *next* time around the
VFS layer won't bother to call into security_inode_permission()
unnecessarily.

Does that clarify?

             Linus

Stephen Smalley March 28, 2025, 1:23 p.m. UTC | #21

On Thu, Mar 27, 2025 at 3:40 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 27 Mar 2025 at 12:16, Stephen Smalley
> <stephen.smalley.work@gmail.com> wrote:
> >
> > Where could/would we cache that information so that it was accessible
> > directly by the VFS layer?
>
> So the VFS layer already does this for various other things. For this
> case, the natural thing to do would be to add another IOP_xyzzy flag
> in inode->i_opflags.
>
> That's how we already say things like "this inode has no
> filesystem-specific i_op->permission function" (IOP_FASTPERM), so that
> we don't even have to follow the "inode->i_op->permission" pointer
> chain to see a NULL pointer.
>
> Yes, the VFS layer is *heavily* optimized like that. It literally does
> that IOP_FASTPERM to avoid chasing two pointers - not even the call,
> just the "don't even bother to follow pointers to see if it's NULL".
> See do_inode_permission().
>
> And we have 16 bits in that inode->i_opflags, and currently only use 7
> of those bits. Adding one bit for a IOP_NO_SECURITY_LOOKUP kind of
> logic (feel free to rename that - just throwing a random name out as a
> suggestion) would be a complete no-brainer.
>
> NOTE! The rule for the i_opflags accesses is that *reading* them is
> done with no locking at all, but changing them takes the inode
> spinlock (and we should technically probably use WRITE_ONCE() and
> READ_ONCE(), but we don't).
>
> And notice that the "no locking at all for reading" means that if you
> *change* the bit - even though that involves locking - there may be
> concurrent lookups in process that won't see the change, and would go
> on as if the lookup still does not need any security layer call. No
> serialization to readers at all (although you could wait for an RCU
> period after changing if you really need to, and only use the bit in
> the RCU lookup).
>
> That should be perfectly fine - I really don't think serialization is
> even needed. If somebody is changing the policy rules, any file
> lookups *concurrent* to that change might not see the new rules, but
> that's the same as if it happened before the change.
>
> I just wanted to point out that the serialization is unbalanced: the
> spinlock for changing the flag is literally just to make sure that two
> bits being changed at the same time don't stomp on each other (because
> it's a 16-bit non-atomic field, and we didn't want to use a "unsigned
> long" and atomic bitops because the cache layout of the inode is also
> a big issue).
>
> And you can take the IOP_FASTPERM thing as an example of how to do
> this: it is left clear initially, and what happens is that during the
> permission lookup, if it *isn't* set, we'll follow those
> inode->i_io->permission pointers, and notice that we should set it:
>
>         if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {
>                 if (likely(inode->i_op->permission))
>                         return inode->i_op->permission(idmap, inode, mask);
>
>                 /* This gets set once for the inode lifetime */
>                 spin_lock(&inode->i_lock);
>                 inode->i_opflags |= IOP_FASTPERM;
>                 spin_unlock(&inode->i_lock);
>         }
>
> and I think the security layer could take a very similar approach: not
> setting that IOP_NO_SECURITY_LOOKUP initially, but *when* a
> security_inode_permission() call is made with just MAY_NOT_BLOCK |
> MAY_LOOKUP, and the security layer notices that "this inode has no
> reason to care", it could set the bit so that *next* time around the
> VFS layer won't bother to call into security_inode_permission()
> unnecessarily.
>
> Does that clarify?

Yes, thank you. I think it would be easy enough to make that change to
selinux_inode_permission() and to clear that inode flag on file
relabels (e.g. in selinux_inode_post_setxattr() and
inode_invalidate_secctx()). Not as sure about handling policy reloads
/ boolean changes at runtime without also caching the policy sequence
number in the inode too so that can be compared. Further, I'm unclear
on how to implement this in a manner that works with the LSM stacking
support, since some other module might disagree about whether we can
skip these lookups. Normally I'd just add an extra boolean or flag
argument to the underlying ->inode_permission() hook so each module
can indicate its decision and
security/security.c:security_inode_permission() could compose the
result, but I guess that would require switching
security_inode_permission() from using call_int_hook() to manually
doing the lsm_for_each_hook() loop itself which may have an adverse
impact on those calls in general.

Paul Moore March 28, 2025, 3:06 p.m. UTC | #22

On Fri, Mar 28, 2025 at 9:23 AM Stephen Smalley
<stephen.smalley.work@gmail.com> wrote:
> On Thu, Mar 27, 2025 at 3:40 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Thu, 27 Mar 2025 at 12:16, Stephen Smalley
> > <stephen.smalley.work@gmail.com> wrote:
> > >
> > > Where could/would we cache that information so that it was accessible
> > > directly by the VFS layer?
> >
> > So the VFS layer already does this for various other things. For this
> > case, the natural thing to do would be to add another IOP_xyzzy flag
> > in inode->i_opflags.
> >
> > That's how we already say things like "this inode has no
> > filesystem-specific i_op->permission function" (IOP_FASTPERM), so that
> > we don't even have to follow the "inode->i_op->permission" pointer
> > chain to see a NULL pointer.
> >
> > Yes, the VFS layer is *heavily* optimized like that. It literally does
> > that IOP_FASTPERM to avoid chasing two pointers - not even the call,
> > just the "don't even bother to follow pointers to see if it's NULL".
> > See do_inode_permission().
> >
> > And we have 16 bits in that inode->i_opflags, and currently only use 7
> > of those bits. Adding one bit for a IOP_NO_SECURITY_LOOKUP kind of
> > logic (feel free to rename that - just throwing a random name out as a
> > suggestion) would be a complete no-brainer.
> >
> > NOTE! The rule for the i_opflags accesses is that *reading* them is
> > done with no locking at all, but changing them takes the inode
> > spinlock (and we should technically probably use WRITE_ONCE() and
> > READ_ONCE(), but we don't).
> >
> > And notice that the "no locking at all for reading" means that if you
> > *change* the bit - even though that involves locking - there may be
> > concurrent lookups in process that won't see the change, and would go
> > on as if the lookup still does not need any security layer call. No
> > serialization to readers at all (although you could wait for an RCU
> > period after changing if you really need to, and only use the bit in
> > the RCU lookup).
> >
> > That should be perfectly fine - I really don't think serialization is
> > even needed. If somebody is changing the policy rules, any file
> > lookups *concurrent* to that change might not see the new rules, but
> > that's the same as if it happened before the change.
> >
> > I just wanted to point out that the serialization is unbalanced: the
> > spinlock for changing the flag is literally just to make sure that two
> > bits being changed at the same time don't stomp on each other (because
> > it's a 16-bit non-atomic field, and we didn't want to use a "unsigned
> > long" and atomic bitops because the cache layout of the inode is also
> > a big issue).
> >
> > And you can take the IOP_FASTPERM thing as an example of how to do
> > this: it is left clear initially, and what happens is that during the
> > permission lookup, if it *isn't* set, we'll follow those
> > inode->i_io->permission pointers, and notice that we should set it:
> >
> >         if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {
> >                 if (likely(inode->i_op->permission))
> >                         return inode->i_op->permission(idmap, inode, mask);
> >
> >                 /* This gets set once for the inode lifetime */
> >                 spin_lock(&inode->i_lock);
> >                 inode->i_opflags |= IOP_FASTPERM;
> >                 spin_unlock(&inode->i_lock);
> >         }
> >
> > and I think the security layer could take a very similar approach: not
> > setting that IOP_NO_SECURITY_LOOKUP initially, but *when* a
> > security_inode_permission() call is made with just MAY_NOT_BLOCK |
> > MAY_LOOKUP, and the security layer notices that "this inode has no
> > reason to care", it could set the bit so that *next* time around the
> > VFS layer won't bother to call into security_inode_permission()
> > unnecessarily.
> >
> > Does that clarify?
>
> Yes, thank you. I think it would be easy enough to make that change to
> selinux_inode_permission() and to clear that inode flag on file
> relabels (e.g. in selinux_inode_post_setxattr() and
> inode_invalidate_secctx()). Not as sure about handling policy reloads
> / boolean changes at runtime without also caching the policy sequence
> number in the inode too so that can be compared. Further, I'm unclear
> on how to implement this in a manner that works with the LSM stacking
> support, since some other module might disagree about whether we can
> skip these lookups. Normally I'd just add an extra boolean or flag
> argument to the underlying ->inode_permission() hook so each module
> can indicate its decision and
> security/security.c:security_inode_permission() could compose the
> result, but I guess that would require switching
> security_inode_permission() from using call_int_hook() to manually
> doing the lsm_for_each_hook() loop itself which may have an adverse
> impact on those calls in general.

I'm not sure the VFS flag approach is going to end up being practical
due to various constraints, but I do have some other ideas that should
allow us to drop the bulk of the SELinux AVC calls while doing path
walks that I'm going to try today/soon.  I'll obviously report back if
it works out.

Linus Torvalds March 28, 2025, 4:36 p.m. UTC | #23

On Fri, 28 Mar 2025 at 06:23, Stephen Smalley
<stephen.smalley.work@gmail.com> wrote:
>
> Yes, thank you. I think it would be easy enough to make that change to
> selinux_inode_permission() and to clear that inode flag on file
> relabels (e.g. in selinux_inode_post_setxattr() and
> inode_invalidate_secctx()).

So the thing that *really* made me go "I don't know how to do this in
the security layer" is not so much the relabeling - that should be
easy to handle by just clearing the bit, as you say.

And I wasn't even so much worried about policy changes that would
globally change meaning of existing labels:

> Not as sure about handling policy reloads
> / boolean changes at runtime without also caching the policy sequence
> number in the inode too so that can be compared.

Yeah, a sequence number seems like an obvious solution, even if it
might be a bit awkward to find a place to store it that doesn't
pollute the cache. The reason it would be _so_ nice to not call the
security hooks at all in this path is that I think we otherwise can do
all the inode security lookups by just looking at the very beginning
of the inode (that's why we do that IOP_FASTPERM thing - just to avoid
touching other cachelines). But if it avoids the
security_inode_permission() call, it would definitely be a win even
with a cache miss.

> Further, I'm unclear
> on how to implement this in a manner that works with the LSM stacking
> support,

So *this* was what made me go "I don't know how to to this AT ALL",
along with the fact that the rule for the bit would have to be that it
would be true for *all* execution contexts.

IOW, it's a very different thing from the usual security hook calls,
in that instead of saying "is this access ok for the current context",
the bit setting would have to say "this lookup is ok for _all_ calling
contexts for this inode for _all_ of the nested security things".

The sequence number approach should take care of any races, so that
part isn't a huge problem: just set the inode sequence number early,
before doing any of the checks. And then only set the bit at the end
if every stacked security layer says "yeah, this inode doesn't have
extra lookup rules as far as I'm concerned". So if any of the rules
changed in the meantime, the sequence number means that the bit won't
take effect. So that part should be fine.

But the "this inode doesn't have extra lookup rules" part is what
needs low-level knowledge about how all the security models work. And
it really would have to be true in all contexts - ie across different
namespaces etc.

(Note the "extra" part: the VFS layer deals with all the *normal* Unix
rules, including ACL, of course. So it's literally just about "are
there security hook rules that then limit things _beyond_ those
standard permission rules")

It might be worth noting that this call site is special for the VFS
anyway: it *looks* like a normal security hook, but "may_lookup()" is
literally *only* used for directories, and *only* used for "can I do
name lookup in this".

So if it helps the security layers, that case could be made into its
own specialized hook entirely, even if it would require basically
duplicating up "inode_permission()" that is currently used for both
the lookup case and for "generic" inode permission checking.

For example, I do know that SElinux turns the VFS layer permission
mask (it things like "MAY_EXEC") into its own masks that are different
for files and for directories (so MAY_EXEC becomes DIR__SEARCH for
SElinux for directories, but FILE__EXECUTE for when doing 'execve()'
on a file).

And so that's an example of something we could short-circuit here,
because *all* we care about is that DIR__SEARCH thing, and we know
that statically.

But I know selinux better than any other of the security models, so I
don't know what *any* of the others do (and honestly, the selinux code
I don't know that well - I've looked at it a lot, but I've really only
ever looked at it in the "I'm looking at profiles, is there anything
obvious I can do about this" sense).

                 Linus

Casey Schaufler March 28, 2025, 5:24 p.m. UTC | #24

On 3/28/2025 9:36 AM, Linus Torvalds wrote:
> On Fri, 28 Mar 2025 at 06:23, Stephen Smalley
> <stephen.smalley.work@gmail.com> wrote:
>> Yes, thank you. I think it would be easy enough to make that change to
>> selinux_inode_permission() and to clear that inode flag on file
>> relabels (e.g. in selinux_inode_post_setxattr() and
>> inode_invalidate_secctx()).
> So the thing that *really* made me go "I don't know how to do this in
> the security layer" is not so much the relabeling - that should be
> easy to handle by just clearing the bit, as you say.
>
> And I wasn't even so much worried about policy changes that would
> globally change meaning of existing labels:
>
>> Not as sure about handling policy reloads
>> / boolean changes at runtime without also caching the policy sequence
>> number in the inode too so that can be compared.
> Yeah, a sequence number seems like an obvious solution, even if it
> might be a bit awkward to find a place to store it that doesn't
> pollute the cache. The reason it would be _so_ nice to not call the
> security hooks at all in this path is that I think we otherwise can do
> all the inode security lookups by just looking at the very beginning
> of the inode (that's why we do that IOP_FASTPERM thing - just to avoid
> touching other cachelines). But if it avoids the
> security_inode_permission() call, it would definitely be a win even
> with a cache miss.
>
>> Further, I'm unclear
>> on how to implement this in a manner that works with the LSM stacking
>> support,
> So *this* was what made me go "I don't know how to to this AT ALL",
> along with the fact that the rule for the bit would have to be that it
> would be true for *all* execution contexts.
>
> IOW, it's a very different thing from the usual security hook calls,
> in that instead of saying "is this access ok for the current context",
> the bit setting would have to say "this lookup is ok for _all_ calling
> contexts for this inode for _all_ of the nested security things".

That is correct. But the LSMs we currently have don't do frequent
attribute changes on objects and generally constrain process attribute
changes to exec() and administrative actions. Invalidating the lookup ok
flag when any LSM changes attributes is most likely to occur at a time
when all LSMs could be expected to change attributes.

> The sequence number approach should take care of any races, so that
> part isn't a huge problem: just set the inode sequence number early,
> before doing any of the checks. And then only set the bit at the end
> if every stacked security layer says "yeah, this inode doesn't have
> extra lookup rules as far as I'm concerned". So if any of the rules
> changed in the meantime, the sequence number means that the bit won't
> take effect. So that part should be fine.
>
> But the "this inode doesn't have extra lookup rules" part is what
> needs low-level knowledge about how all the security models work. And
> it really would have to be true in all contexts - ie across different
> namespaces etc.

It would be nice if changing the rules for one LSM (e.g. adding a
Smack access rule: # echo Snap Pop rw > /sys/fs/smackfs/load2) didn't
affect the lookup ok flag for other LSMs, but I don't see how the
overhead would be justified. I'm also concerned about how detecting
and clearing the flag on a single LSM case on policy change is going
to be anything but brutal.

[GIT,PULL] selinux/selinux-pr-20250323

Pull-request

Message

Comments