Message ID | d0ade43454dee9c00689f03e8d9bd32a@paul-moore.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [GIT,PULL] selinux/selinux-pr-20250323 | expand |
On Sun, 23 Mar 2025 at 12:39, Paul Moore <paul@paul-moore.com> wrote: > > - Add additional SELinux access controls for kernel file reads/loads > > The SELinux kernel file read/load access controls were never updated > beyond the initial kernel module support, this pull request adds > support for firmware, kexec, policies, and x.509 certificates. Honestly, is there a *reason* for this, or is this just some misguided "for completeness" issue? Because dammit, adding more complexity to the security rules isn't a feature, and isn't security. It's just theater. And it adds completely pointless extra cases making the different cases use artificially different code. The commit message doesn't actually make any mention of *why* any of this would be a good idea. I've pulled this, but honestly, I think all those new cases should be removed, and if people object to us having "LOADING_MODULE" for other kinds of reads, then I think the fix should be to just rename it to "KERNEL_LOAD" or whatever. Because dammit, this "make security interfaces more complicated just because" needs to STOP. We are literally wasting enormous amounts of time in the security layers - I regularly see the lsm and selinux layers spending *more* time on security than the actual operation takes because the security people have written random code without taking performance into account - and I need to see *reasons* for adding more crap in this area, not "let's expand" with no actual reason given. So I really think that commit needs to be either reverted or explained very clearly why user policy _needs_ to care for the difference between a module load and a firmware image load. Because I think any such explanation is likely complete BS. The difference between loading modules and loading firmware usually would boil down to "different hardware does things differently". There's no obvious reason why one should be considered different from another from a security standpoint. Linus
The pull request you sent on Sun, 23 Mar 2025 15:39:46 -0400:
> https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux.git tags/selinux-pr-20250323
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/59c017ce9ec77953ca5198b41d4101f57dd4af0d
Thank you!
On Tue, Mar 25, 2025 at 7:02 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, 23 Mar 2025 at 12:39, Paul Moore <paul@paul-moore.com> wrote: > > > > - Add additional SELinux access controls for kernel file reads/loads > > > > The SELinux kernel file read/load access controls were never updated > > beyond the initial kernel module support, this pull request adds > > support for firmware, kexec, policies, and x.509 certificates. > > Honestly, is there a *reason* for this, or is this just some misguided > "for completeness" issue? > > Because dammit, adding more complexity to the security rules isn't a > feature, and isn't security. It's just theater ... From my perspective this is largely a continuation of our discussion last April, and considering that you ignored my response then, I'm not sure typing out a meaningful reply here is a good use of my time. Anyone who is interested can find that thread on lore, unfortunately much of my response still applies.
On Wed, 26 Mar 2025 at 11:36, Paul Moore <paul@paul-moore.com> wrote: > > From my perspective this is largely a continuation of our discussion > last April, and considering that you ignored my response then, I'm not > sure typing out a meaningful reply here is a good use of my time. What you are saying is that I have complained about added overhead before, and you aren't even willing to explain why new code was added? > Anyone who is interested can find that thread on lore, unfortunately > much of my response still applies. That thread was similar in that yes, it was complaining about extra overhead of the lsm code. Not just me, btw. But your respose doesn't make sense. I asked for *why* this was added. You're saying "I am not going to answer because you've complained about other overhead before". I actually went and tried to find the discussion on the mailing lists, and nowhere *there* did I find an explanation for why this was done either. In other words: why were new policy entries added? The commit message and the list explains what the commit *does*, but doesn't explain *why* it does it. I'm cc'ing the other people involved, exactly *because* we've had the whole discussion before, and because I want to see explanations for *why* new policy hooks are added to the security layers. I really think that "policy hooks just because policy hooks" is not acceptable. And the reason it's not acceptable is exactly the fact that we have a bad history of various random policies becoming problematic over time. There needs to be a *reason* for a policy hook stated. Not "there are no matching policy hooks". And I do not see why firmware loading should be a policy issue if the kernel code that initiated the firmware load (ie the module load that *was* checked for policy) was already done. Do I believe this particular case is going to be a performance issue? No. Do I strongly feel that any additional hooks need *EXPLANATIONS*? Hell yes. Linus
On Wed, Mar 26, 2025 at 3:40 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, 26 Mar 2025 at 11:36, Paul Moore <paul@paul-moore.com> wrote: > > > > From my perspective this is largely a continuation of our discussion > > last April, and considering that you ignored my response then, I'm not > > sure typing out a meaningful reply here is a good use of my time. > > What you are saying is that I have complained about added overhead > before, and you aren't even willing to explain why new code was added? My comment is simply that you have a habit of disrespecting the work done in the security/ space (e.g. "security theater"), and despite explaining how that behavior is detrimental you seemingly choose to ignore these concerns and repeat the same tired language. If you want to have a discussion about something substantive, I would suggest refraining from insulting those you want to engage. > In other words: why were new policy entries added? The commit message > and the list explains what the commit *does*, but doesn't explain > *why* it does it. Looking at the pre-patched code one can see that SELinux only enforced access controls on kernel module loading; other operations such as kexec images, firmware, policy, etc. were not originally included. As this happened back in the v4.x time frame I would have to go dig through the archives to provide a fully accurate reasoning as to why SELinux only concerned itself with kernel module loading at that point in time. However, speaking solely from memory, I believe that the initial focus was limited to modules simply because that was the area of largest concern at the time, and many of the other file types were not yet gated by the LSM hooks. Moving forward to the recent pull request, the LSM hooks do have the ability to gate these additional file types, and for a LSM such as SELinux that aims to provide comprehensive access controls, adding support for enforcing controls on these additional file types is a logical thing to do and consistent with our other work. The fact that these changes didn't happen at the same time as the enabling support for the other file types is likely due to an ongoing challenge we face where we sometimes fail to keep current with all facets of kernel development.
On Wed, 26 Mar 2025 at 13:48, Paul Moore <paul@paul-moore.com> wrote: > > Looking at the pre-patched code one can see that SELinux only enforced > access controls on kernel module loading I *know*. I've looked at that commit. It's why I'm asking. > Moving forward to the recent pull > request, the LSM hooks do have the ability to gate these additional > file types, and for a LSM such as SELinux that aims to provide > comprehensive access controls, adding support for enforcing controls > on these additional file types is a logical thing to do and consistent > with our other work. Again - you're explaining what it *does*. I see what it does. That was never the issue. That was the only part that the commit message talked about. What I'm asking for is *WHY*. I realize that to you, the security side is what you do, and you feel that this is all some internal thing to selinux and may feel that I'm butting in where I shouldn't. But to others, these things cause extra overhead, so it's *not* just internal to selinux. It affects others in bad ways. I want the _reasons_ for new policy hooks to be *explained*. > The fact that these changes didn't happen at the > same time as the enabling support for the other file types is likely > due to an ongoing challenge we face where we sometimes fail to keep > current with all facets of kernel development. You are still just talking about what the code does. I repeat: my questions is *WHY*. *WHY* is that policy needed at all? As you correctly point out, it has NOT EXISTED for a long long time already, and the code has been the old way since 4.x timeframe. So my question literally boils down to "WHY SHOULD IT EXIST NOW"? We've done quite well without it. And I do not not see the point of allowing a driver module load (which *is* a policy, and has been for a long time), and then disallowing loading the firmware that the driver needs. That's insane. So explain to me why you added what looks like completely insane policy hooks. See what I'm complaining about? I'm literally looking at that code, and I understand what it does, but it adds policy for something where I do not believe policy makes any sense what-so-ever. The whole "any policy, anywhere, any time, for any reason" model is not valid. We don't ask for permission for core kernel functionality. We don't ask permission to do things the kernel needs to do. And new policy rules need to be explained. Linus
On Wed, Mar 26, 2025 at 5:06 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > *WHY* is that policy needed at all? > > As you correctly point out, it has NOT EXISTED for a long long time > already, and the code has been the old way since 4.x timeframe. > > So my question literally boils down to "WHY SHOULD IT EXIST NOW"? SELinux is different things to different people, but for this discussion I think one of the more relevant aspects is the fact that SELinux is designed in such a way that you can analyze the security policy and reason about the security properties of the system. For example, how does data from application A flow through the system? Can application B ever read from file C either directly or through some chain of applications/users? What about writing to file D? Who can manage device E, or configure various runtime knobs in the kernel. I could go on, but the important part is that SELinux allows an admin to reason about the security properties of a system in ways that simply aren't possible using traditional Linux access controls (mode bits, capabilities, etc.). If you're curious the collection of analysis tools can be found below: https://github.com/SELinuxProject/setools An important prerequisite of this is that any "security relevant" operation on the system needs to have a SELinux access control point. If an access control is missing, the quality of the policy analysis suffers. > And I do not not see the point of allowing a driver module load (which > *is* a policy, and has been for a long time), and then disallowing > loading the firmware that the driver needs. > > That's insane. So explain to me why you added what looks like > completely insane policy hooks. In the security_kernel_read_file() LSM hook, the SELinux callback examines if the current task has the ability to perform the designated read operation on file A. Loading a kernel module and loading its associated firmware are two events inside the kernel with the current task loading data from two different files, each with potentially different security properties, and we want to perform an access check on each file to ensure it is permitted under the security policy defined by the admin. For example, with the kernel enforcing kernel module signing one administrator might be inclined to allow a given process to load a kernel module from a file that is more widely writable than a firmware image without any form of integrity protection. Having two distinct permissions, system/module_load and system/firmware_load, allows the administrator to distinguish between the two different file reads in the SELinux policy.
On Wed, 26 Mar 2025 at 16:03, Paul Moore <paul@paul-moore.com> wrote: > > On Wed, Mar 26, 2025 at 5:06 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > That's insane. So explain to me why you added what looks like > > completely insane policy hooks. > > In the security_kernel_read_file() LSM hook, the SELinux callback > examines if the current task has the ability to perform the designated > read operation on file A. Loading a kernel module and loading its > associated firmware are two events inside the kernel with the current > task loading data from two different files, each with potentially > different security properties, and we want to perform an access check > on each file to ensure it is permitted under the security policy > defined by the admin. What this tells me is that there wasn't actually any real ask for this, and you're trying to make up arguments for a silly patch. First off, "loading a module" and then "the module loads the firmware" file are *not* two distinct things with distinct security properties. The module DOES NOT WORK without the firmware file. So the argument that they are independent action is complete nonsense. If you don't trust the firmware, then don't load the module. It's that simple. Second, your argument that there are different tasks involved is true, but no, that doesn't mean that there are "potentially different security properties". Why? Because as mentioned, loading the module very much implies having to be able to load the firmware for it - but yes, the firmware loading might actually happen _later_ and in a different and completely independent context. In particular, drivers can do their firmware loads at various random times. Yes, one common situation is that they do it during module load, for example, in which case it would be done in the same context that the module load itself happened. But it's *also* common that it is done asynchronously while scanning for devices. Or done when the device is opened. Or the firmware file is reloaded when the system resumes from suspend, because the device lost all its context, so now it's reloading something that it loaded earlier in a very *different* context, and that had better not start randomly failing resulting in basically impossible-to-debug resume problems. In other words, the context of actual firmware loading is pretty much random. It isn't necessarily tied to the module loading, but it *also* isn't necessarily tied to a particular other actor. So any argument that depends on the context of the firmware load is bogus a priori. Since there isn't some well-defined context the load happens in, any policy based on such a context is garbage. To put it bluntly, your whole argument for why separate policy makes sense seems completely made-up and has no actual reality behind it. It also very much sounds like this change was *not* triggered by anybody actually needing it, and as in fact sounds like just "let's pattern-match and do the same thing for these things that look similar but aren't". This is *EXACTLY* the kind of thing that I think the security layers should absolutely NOT DO. You should not add nonsensical policy hooks "just because". Security policy needs to have some thought behind it, not be some mindless random thing. And there should have been a documented case of people needing it. This is literally why I was asking for the reasoning behind that patch. The patch looked nonsensical. And I have not at all been convinced otherwise. Linus
On Thu, Mar 27, 2025 at 11:43 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, 26 Mar 2025 at 16:03, Paul Moore <paul@paul-moore.com> wrote: > > > > On Wed, Mar 26, 2025 at 5:06 PM Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > That's insane. So explain to me why you added what looks like > > > completely insane policy hooks. > > > > In the security_kernel_read_file() LSM hook, the SELinux callback > > examines if the current task has the ability to perform the designated > > read operation on file A. Loading a kernel module and loading its > > associated firmware are two events inside the kernel with the current > > task loading data from two different files, each with potentially > > different security properties, and we want to perform an access check > > on each file to ensure it is permitted under the security policy > > defined by the admin. Paul, Linus, Thank you for discussing the reasoning behind this patch. We wanted to share how Android will leverage the new hooks introduced here. Hopefully, this should help move forward this discussion and show why these changes have concrete value to projects using SELinux like Android. Android, for a long time, has used the module loading hooks to ensure that kernel modules are loaded from cryptographically signed, dm-verity protected partitions. Even if an attacker compromised a userspace process with CAP_SYS_MODULE, they cannot leverage this to load modules from outside of protected partitions. The changes presented in this merge request build upon this functionality. Taking one example from this merge request: kexec image loading. Currently, any process which has CAP_SYS_BOOT can use kexec to replace the existing kernel. Android has 5 processes with CAP_SYS_BOOT, only 1 of which needs kexec functionality [1]. By using these new permissions, we will ensure that this process is able to call kexec, while prohibiting other processes. SELinux provides us strong, kernel enforced guarantees which can be checked at policy compile time. Extending on this, we will use this patchset to guarantee that kernels and ramdisks executed by kexec come from known, good sources. The other hooks are of similar value to Android. Thiebaud and Nick [1] https://cs.android.com/android/platform/superproject/main/+/main:system/sepolicy/microdroid/system/private/kexec.te;l=12 > > What this tells me is that there wasn't actually any real ask for > this, and you're trying to make up arguments for a silly patch. > > First off, "loading a module" and then "the module loads the firmware" > file are *not* two distinct things with distinct security properties. > > The module DOES NOT WORK without the firmware file. So the argument > that they are independent action is complete nonsense. If you don't > trust the firmware, then don't load the module. It's that simple. > > Second, your argument that there are different tasks involved is true, > but no, that doesn't mean that there are "potentially different > security properties". > > Why? Because as mentioned, loading the module very much implies having > to be able to load the firmware for it - but yes, the firmware loading > might actually happen _later_ and in a different and completely > independent context. > > In particular, drivers can do their firmware loads at various random times. > > Yes, one common situation is that they do it during module load, for > example, in which case it would be done in the same context that the > module load itself happened. > > But it's *also* common that it is done asynchronously while scanning > for devices. > > Or done when the device is opened. > > Or the firmware file is reloaded when the system resumes from suspend, > because the device lost all its context, so now it's reloading > something that it loaded earlier in a very *different* context, and > that had better not start randomly failing resulting in basically > impossible-to-debug resume problems. > > In other words, the context of actual firmware loading is pretty much > random. It isn't necessarily tied to the module loading, but it *also* > isn't necessarily tied to a particular other actor. > > So any argument that depends on the context of the firmware load is > bogus a priori. Since there isn't some well-defined context the load > happens in, any policy based on such a context is garbage. > > To put it bluntly, your whole argument for why separate policy makes > sense seems completely made-up and has no actual reality behind it. > > It also very much sounds like this change was *not* triggered by > anybody actually needing it, and as in fact sounds like just "let's > pattern-match and do the same thing for these things that look similar > but aren't". > > This is *EXACTLY* the kind of thing that I think the security layers > should absolutely NOT DO. > > You should not add nonsensical policy hooks "just because". > > Security policy needs to have some thought behind it, not be some > mindless random thing. And there should have been a documented case of > people needing it. > > This is literally why I was asking for the reasoning behind that > patch. The patch looked nonsensical. And I have not at all been > convinced otherwise. > > Linus >
On Wed, 26 Mar 2025 at 18:06, Thiébaud Weksteen <tweek@google.com> wrote: > > Taking one example from this merge request: kexec image loading. So this is the kind of "why" I was looking for. > Currently, any process which has CAP_SYS_BOOT can use kexec to replace > the existing kernel. Android has 5 processes with CAP_SYS_BOOT, only 1 > of which needs kexec functionality [1]. By using these new > permissions, we will ensure that this process is able to call kexec, > while prohibiting other processes. SELinux provides us strong, kernel > enforced guarantees which can be checked at policy compile time. > Extending on this, we will use this patchset to guarantee that kernels > and ramdisks executed by kexec come from known, good sources. > > The other hooks are of similar value to Android. Now explain to me how the firmware loading hook works, not some hand-wavy "similar value" thing. Because it seems entirely bogus. Exactly because the context of firmware loading is *not* something you can depend on. There is no "one special process" that has firmware loading capabilities. I'm looking at selinux_kernel_load_data() in particular, where you don't even pass it a file at all, so it's not like it could check for "is this file integrity-protected" or anything like that. It seems to literally say "can this process load firmware", and as I've explained, the firmware loading is done by random processes. Linus
On Thu, Mar 27, 2025 at 12:21 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > [...] > the firmware loading is done by random processes. That is not quite right. If you look at commit 581dd6983034 [1], when a firmware is about to be loaded, the kernel credentials are used. It is therefore possible to grant this permission to the corresponding security context (in our policy that would be the "kernel" domain). To be honest, I don't think this particular distinction applies to Android, but I can imagine IoT devices with smaller/stricter policies wishing to enforce this (e.g., device boot without a policy, loads its drivers and firmware, loads a policy that enforces no more firmware loading). Thanks [1] https://lore.kernel.org/all/20220502004952.3970800-1-tweek@google.com/
On Wed, 26 Mar 2025 at 20:28, Thiébaud Weksteen <tweek@google.com> wrote: > > That is not quite right. If you look at commit 581dd6983034 [1], when > a firmware is about to be loaded, the kernel credentials are used. Ahh, that's good, at least there's no "random state" to check. But it does still mean that the security check is pointless - there aren't any other credentials that would validly be used for firmware loading, so what was the point of checking them again? In fact, the commit you point to was made exactly *because* the kind of pointless and wrong security checks that I'm complaining about were done, wasn't it? Linus
On Thu, Mar 27, 2025 at 5:10 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, 26 Mar 2025 at 20:28, Thiébaud Weksteen <tweek@google.com> wrote: > > > > That is not quite right. If you look at commit 581dd6983034 [1], when > > a firmware is about to be loaded, the kernel credentials are used. > > Ahh, that's good, at least there's no "random state" to check. > > But it does still mean that the security check is pointless - there > aren't any other credentials that would validly be used for firmware > loading, so what was the point of checking them again? The value here isn't so much about checking the source context "kernel", but rather about checking the target context and enforcing that firmware can only come from trusted filesystems. So even a compromised privileged process that sets firmware_class.path cannot cause the kernel to load firmware from an arbitrary source. These restrictions reduce our reliance on (1) individual component manufacturers (e.g. NFC chips) implementing signature verification correctly in their firmware loading procedure, or (2) the fallout for the Android ecosystem if a component manufacturer's private key leaks because even firmware signed with the leaked key will not be trusted if it doesn't come from the trusted filesystem signed by the Android device manufacturer. Leaked keys is a very real problem. Restrictions like those added here can significantly reduce the severity of such incidences. With this, we can write policies for Android devices that enforce that firmware only comes from trusted filesystems. For example: allow kernel vendor_file:system firmware_load; You could even imagine a more lenient rule that allows any domain to load firmware, just so long as it comes from a trusted filesystem. For example: allow domain vendor_file:system firmware_load; We can then create tests that prevent rules from being added that attempt to load firmware from untrusted locations. For example: neverallow * ~vendor_file:system firmware_load; Such tests are quite important for preventing device manufacturers from adding insecure policies that would allow firmware loading from untrusted sources. > > In fact, the commit you point to was made exactly *because* the kind > of pointless and wrong security checks that I'm complaining about were > done, wasn't it? > > Linus >
On Thu, 27 Mar 2025 at 01:59, Jeffrey Vander Stoep <jeffv@google.com> wrote: > > The value here isn't so much about checking the source context > "kernel", but rather about checking the target context and enforcing > that firmware can only come from trusted filesystems. So even a > compromised privileged process that sets firmware_class.path cannot > cause the kernel to load firmware from an arbitrary source. Yes, and that's literally why I earlier in the thread pointed out the new code in selinux_kernel_load_data() "I'm looking at selinux_kernel_load_data() in particular, where you don't even pass it a file at all, so it's not like it could check for "is this file integrity-protected" or anything like that" because I understand that you might want to verify the *file* the firmware comes from, but I think verifying the context in which the firmware is loaded is absolutely insane and incorrect. And that is literally *all* that the new case in selinux_kernel_load_data() does. There is no excuse for that craziness that I can come up with. And yes, I'm harping on this, because I really *hate* how the security layer comes up in my performance profiles so much. It's truly disgusting. So when I see new hooks that don't make sense to me, I react *very* strongly. Do I believe this insanity matters for performance? No. But do I believe that the security code needs to *think* about the random hooks it adds more? Yes. YES! Which is why I really hate seeing new random hooks where I then go "that is complete and utter nonsense". [ This whole patch triggered me for another reason too - firmware loading in particular has a history of user space actively and maliciously screwing the kernel up. The reason we load firmware directly from the kernel is because user space "policy" decisions actively broke our original "let user space do it" model. So if somebody thinks I'm overreacting, they are probably right, but dammit, this triggers two of my big red flags for "this is horribly wrong" ] Linus
On 3/27/2025 1:59 AM, Jeffrey Vander Stoep wrote: > On Thu, Mar 27, 2025 at 5:10 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> On Wed, 26 Mar 2025 at 20:28, Thiébaud Weksteen <tweek@google.com> wrote: >>> That is not quite right. If you look at commit 581dd6983034 [1], when >>> a firmware is about to be loaded, the kernel credentials are used. >> Ahh, that's good, at least there's no "random state" to check. >> >> But it does still mean that the security check is pointless - there >> aren't any other credentials that would validly be used for firmware >> loading, so what was the point of checking them again? > The value here isn't so much about checking the source context > "kernel", but rather about checking the target context and enforcing > that firmware can only come from trusted filesystems. So even a > compromised privileged process that sets firmware_class.path cannot > cause the kernel to load firmware from an arbitrary source. > > These restrictions reduce our reliance on (1) individual component > manufacturers (e.g. NFC chips) implementing signature verification > correctly in their firmware loading procedure, or (2) the fallout for > the Android ecosystem if a component manufacturer's private key leaks > because even firmware signed with the leaked key will not be trusted > if it doesn't come from the trusted filesystem signed by the Android > device manufacturer. Leaked keys is a very real problem. Restrictions > like those added here can significantly reduce the severity of such > incidences. > > With this, we can write policies for Android devices that enforce that > firmware only comes from trusted filesystems. For example: > > allow kernel vendor_file:system firmware_load; Am I missing something, or isn't that what loadpin is for?
On Thu, Mar 27, 2025 at 11:50 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 27 Mar 2025 at 01:59, Jeffrey Vander Stoep <jeffv@google.com> wrote: > > > > The value here isn't so much about checking the source context > > "kernel", but rather about checking the target context and enforcing > > that firmware can only come from trusted filesystems. So even a > > compromised privileged process that sets firmware_class.path cannot > > cause the kernel to load firmware from an arbitrary source. > > Yes, and that's literally why I earlier in the thread pointed out the > new code in selinux_kernel_load_data() > > "I'm looking at selinux_kernel_load_data() in particular, where you > don't even pass it a file at all, so it's not like it could check for > "is this file integrity-protected" or anything like that" > > because I understand that you might want to verify the *file* the > firmware comes from, but I think verifying the context in which the > firmware is loaded is absolutely insane and incorrect. So the only use case I could see for that particular check would be if we wanted to block loading firmware directly from memory/blobs rather than from files. If that's not a valid use case, then we can get rid of that particular check if desired; it just seemed inconsistent between the two hooks otherwise. What's the purpose of having the LOADING_FIRMWARE enum or hook call on that code path at all then? > And that is literally *all* that the new case in > selinux_kernel_load_data() does. There is no excuse for that craziness > that I can come up with. > > And yes, I'm harping on this, because I really *hate* how the security > layer comes up in my performance profiles so much. It's truly > disgusting. So when I see new hooks that don't make sense to me, I > react *very* strongly. If you have constructive suggestions (or patches!) to improve performance of LSM and/or SELinux, we'd be glad to take them. Or even helpful hints on how to best measure and see the same overheads you are seeing and where. > > Do I believe this insanity matters for performance? No. > > But do I believe that the security code needs to *think* about the > random hooks it adds more? Yes. YES! > > Which is why I really hate seeing new random hooks where I then go > "that is complete and utter nonsense". > > [ This whole patch triggered me for another reason too - firmware > loading in particular has a history of user space actively and > maliciously screwing the kernel up. > > The reason we load firmware directly from the kernel is because user > space "policy" decisions actively broke our original "let user space > do it" model. > > So if somebody thinks I'm overreacting, they are probably right, but > dammit, this triggers two of my big red flags for "this is horribly > wrong" ] > > Linus
On Thu, 27 Mar 2025 at 09:55, Stephen Smalley <stephen.smalley.work@gmail.com> wrote: > > If you have constructive suggestions (or patches!) to improve > performance of LSM and/or SELinux, we'd be glad to take them. Or even > helpful hints on how to best measure and see the same overheads you > are seeing and where. So measuring it is fairly easy. I have various loads I care about, but the simplest one that I feel is actually a real load - rather than the more artificial bencmarks I then use for verification when I make any changes - is literally "do an allmodconfig kernel rebuild with no changes". That empty kernel rebuild approximates what I actually do for most small pull requests when only a couple of files really get re-built. I don't want to actually try to profile user space and the actual compiler overhead. To see that load, first do this as root: echo -1 > /proc/sys/kernel/perf_event_paranoid so that you as a regular user can then do a kernel build and get good kernel-level profiles (there are other ways you can do this: you can obviously also do a full profile as root). Obviously you should *not* do this on a machine with other users, the above basically says "let anybody do profiling on kernel". Then I just do make allmodconfig make -j64 to prep the tree and causing everything to be built (well - normally I obviously don't do that, because my kernel tree is always built anyway, but I'm just trying to make it obvious how to reproduce it). And then I just do perf record -e cycles:pp make -j64 > ../makes perf report --sort=symbol,dso and press 'k' to just get the kernel side (again - there's little I can do, or care, about the user space profiles). The "--sort=symbol,dso" is because I don't care _which_ process it is that does what, so I just want the output binned by kernel function. Just on a very high level, this is what I get with the v6.14 tree (cut-off at 0.25%, this is "out of total cost" for the load including user space, so these top functions account for just over 9% of the *total* cost of the benchmark): 1.26% [k] clear_page_rep 0.82% [k] avc_has_perm_noaudit 0.73% [k] link_path_walk 0.73% [k] terminate_walk 0.58% [k] __d_lookup_rcu 0.56% [k] step_into 0.52% [k] selinux_inode_permission 0.50% [k] memset_orig 0.49% [k] vfs_statx_path 0.47% [k] strncpy_from_user 0.47% [k] rep_movs_alternative 0.37% [k] vfs_statx 0.31% [k] __rcu_read_unlock 0.30% [k] btrfs_getattr 0.28% [k] inode_permission 0.26% [k] kmem_cache_free 0.26% [k] generic_permission [...] so the top thing is the page clearing (and you see other memcpy/memset variations there too), but the #2 hit for the kernel profile is selinux, which takes more time than the basic path walking. And selinux_inode_permission() is rather high up there too, as you can see. Together, those two functions are about 1.3% of the whole load. Now, the above profile is just from *my* machine, and microarchitecture will matter a *LOT*. So the details will depend hugely on your hardware, but I've been doing kernel profiles for decades, and the basics haven't really changed. memory movement and clearing is universally the biggest thing, and that's fine. It's fundamental. Also, when I do profiles I turn off the CPU mitigations, because again depending on microarchitecture those can just swamp everything else, and while they are a real overhead, from a performance standpoint I'm hoping they are something that long-term is going to be mostly fixed in hardware (apart from the basic Spectre-v1 branch speculation, which is *not* turned off in my kerrels, and which we've actually worked fairly hard on making sure is handled efficiently). Now, looking at instruction level profiles is kind of iffy, and you have to know your microarchitecture to really make sense of them. The "cycles:pp" helps make profiles more relevant (and requires PEBS/IBS or equivalent CPU support to work), but it won't replace "you have to understand hardware". You do want to look at instruction profiles at least a bit, partly because inlining makes _not_ looking at them often kind of misleading. The real cost may be in a function that was inlined. Typically, once you look at instruction-level profiles, and understand them, you'll see one of three issues: - cache misses. This is typically the big obvious one. And you'll see them both for I$ and D$. People will tell you that I$ cache misses are in the noise, but people are wrong. It's simply not true for the kernel or many other real benchmarks, and you'll often see it as big hits at the beginnings of functions - or at the return points of calls - where the instructions otherwise look very benign. - serialization. This shows up hugely on modern CPUs, so any memory barriers etc (ie locked instructions on x86) will stand out. - branch misprediction. This will typically show up in the profiles not on the branch, but on the mispredicted _target_ of the branch, so it can end up being a bit confusing. The CPU speculation mitigations typically turn this issue up to 11 and add misprediction noise absolutely everywhere, which is why turning those off is such a big deal. but in an OoO CPU all of the above will basically result in various instruction profile "patterns", so you in general cannot really look at individual instructions, and should use the above patterns to try to figure out *why* the profile looks like it does. It's not obvious, and the patterns will be different for different microarchitectures. You can use fancier perf things to try to figure out exactly what is going on, but you should always _start_ from the "where are the costs" on a pure basic cycle basis. Only after that does it make sense to say something like "Oh, this is expensive and seems to be taking excessive cache misses, let's drill down into why". Also, typically, code that has already been tweaked to death tends to show fewer obvious peaks in the profile. Because the obvious peaks have often been getting some attention. So the profile ends up not showing a big read flag any more, because the big issue has been fixed and now it's mostly a "it's called too much" issue. For the security layer, at least historically the big cache miss (on this load) has been the inode->i_security access (not loading the pointer itself, but the accesses following it), and the hash tables for that AVC lookup. And both been improved upon, and I didn't do try to analyze the above profiles any closer when it comes to exactly what is going on, so take that with the grain of salt it deserves. The exact details may have changed, but as you can see, avc_has_perm_noaudit() really is very much a top offender today. And yes, the reason is that we have to call it a *lot* for any filename lookups. Some of those security hooks get called for every path component, others get called only for the final one. The best fix would be to be able to cache the "this doesn't have any extra security rules outside of the regular POSIX ones" and avoid calling the hook entirely. That's what we've done for the ACL path, and that has turned ACL costs into almost a non-issue. Linus
On Thu, 27 Mar 2025 at 11:15, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The best fix would be to be able to cache the "this doesn't have any > extra security rules outside of the regular POSIX ones" and avoid > calling the hook entirely. That's what we've done for the ACL path, > and that has turned ACL costs into almost a non-issue. .. just to follow up on that, because I tried to look into it, but didn't know the lsm and selinux code well enough to actually make much progress, so if somebody who does is really willing to take a deeper look, please pester me for details. But the big high-level case for pathname lookup tends to be the calls we have to do not just for the final inode, but the "every single component" cases. Which is a much *simpler* and more targeted security check than the "any random permissions". It's the "can I use this directory for lookups", and if we had an inode flag that said "this inode has no security policies that change the lookup rules", just that single big would likely be a *huge* improvement. Because then you don't need to try to optimize the security check itself, because instead the VFS layer can optimize it all by not calling into the security layer at all. And from a "this is called too much" standpoint, the "every path component" cases tend to outnumber the "final file open" case by something like 5-to-1 (handwavy, but not entirely made up). I think the main case is link_path_walk -> may_lookup -> security_inode_permission where the special case is that may_lookup() only sets MAY_EXEC and MAY_NOT_BLOCK (for the RCU lookup case, which is what matters). So if inode_permission() (in fs/namei.c) could avoid calling security_inode_permission() because the inode has some flag that says "none of the security models disallow MAY_EXEC for this directory walk", I really think that would help. I think trying to optimize the AVC hash table lookup further or something like that is a dead end. The cost is "many many many calls", and each individual call is fairly cheap on its own, but they all walk different hash chains, and it all adds up to "lots of cost". Hmm? Linus
On Thu, Mar 27, 2025 at 3:03 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 27 Mar 2025 at 11:15, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > The best fix would be to be able to cache the "this doesn't have any > > extra security rules outside of the regular POSIX ones" and avoid > > calling the hook entirely. That's what we've done for the ACL path, > > and that has turned ACL costs into almost a non-issue. > > .. just to follow up on that, because I tried to look into it, but > didn't know the lsm and selinux code well enough to actually make much > progress, so if somebody who does is really willing to take a deeper > look, please pester me for details. Would be interested in those details. > But the big high-level case for pathname lookup tends to be the calls > we have to do not just for the final inode, but the "every single > component" cases. > > Which is a much *simpler* and more targeted security check than the > "any random permissions". It's the "can I use this directory for > lookups", and if we had an inode flag that said "this inode has no > security policies that change the lookup rules", just that single big > would likely be a *huge* improvement. > > Because then you don't need to try to optimize the security check > itself, because instead the VFS layer can optimize it all by not > calling into the security layer at all. > > And from a "this is called too much" standpoint, the "every path > component" cases tend to outnumber the "final file open" case by > something like 5-to-1 (handwavy, but not entirely made up). > > I think the main case is > > link_path_walk -> may_lookup -> security_inode_permission > > where the special case is that may_lookup() only sets MAY_EXEC and > MAY_NOT_BLOCK (for the RCU lookup case, which is what matters). > > So if inode_permission() (in fs/namei.c) could avoid calling > security_inode_permission() because the inode has some flag that says > "none of the security models disallow MAY_EXEC for this directory > walk", I really think that would help. > > I think trying to optimize the AVC hash table lookup further or > something like that is a dead end. The cost is "many many many calls", > and each individual call is fairly cheap on its own, but they all walk > different hash chains, and it all adds up to "lots of cost". > > Hmm? Where could/would we cache that information so that it was accessible directly by the VFS layer?
On Thu, 27 Mar 2025 at 12:16, Stephen Smalley <stephen.smalley.work@gmail.com> wrote: > > Where could/would we cache that information so that it was accessible > directly by the VFS layer? So the VFS layer already does this for various other things. For this case, the natural thing to do would be to add another IOP_xyzzy flag in inode->i_opflags. That's how we already say things like "this inode has no filesystem-specific i_op->permission function" (IOP_FASTPERM), so that we don't even have to follow the "inode->i_op->permission" pointer chain to see a NULL pointer. Yes, the VFS layer is *heavily* optimized like that. It literally does that IOP_FASTPERM to avoid chasing two pointers - not even the call, just the "don't even bother to follow pointers to see if it's NULL". See do_inode_permission(). And we have 16 bits in that inode->i_opflags, and currently only use 7 of those bits. Adding one bit for a IOP_NO_SECURITY_LOOKUP kind of logic (feel free to rename that - just throwing a random name out as a suggestion) would be a complete no-brainer. NOTE! The rule for the i_opflags accesses is that *reading* them is done with no locking at all, but changing them takes the inode spinlock (and we should technically probably use WRITE_ONCE() and READ_ONCE(), but we don't). And notice that the "no locking at all for reading" means that if you *change* the bit - even though that involves locking - there may be concurrent lookups in process that won't see the change, and would go on as if the lookup still does not need any security layer call. No serialization to readers at all (although you could wait for an RCU period after changing if you really need to, and only use the bit in the RCU lookup). That should be perfectly fine - I really don't think serialization is even needed. If somebody is changing the policy rules, any file lookups *concurrent* to that change might not see the new rules, but that's the same as if it happened before the change. I just wanted to point out that the serialization is unbalanced: the spinlock for changing the flag is literally just to make sure that two bits being changed at the same time don't stomp on each other (because it's a 16-bit non-atomic field, and we didn't want to use a "unsigned long" and atomic bitops because the cache layout of the inode is also a big issue). And you can take the IOP_FASTPERM thing as an example of how to do this: it is left clear initially, and what happens is that during the permission lookup, if it *isn't* set, we'll follow those inode->i_io->permission pointers, and notice that we should set it: if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) { if (likely(inode->i_op->permission)) return inode->i_op->permission(idmap, inode, mask); /* This gets set once for the inode lifetime */ spin_lock(&inode->i_lock); inode->i_opflags |= IOP_FASTPERM; spin_unlock(&inode->i_lock); } and I think the security layer could take a very similar approach: not setting that IOP_NO_SECURITY_LOOKUP initially, but *when* a security_inode_permission() call is made with just MAY_NOT_BLOCK | MAY_LOOKUP, and the security layer notices that "this inode has no reason to care", it could set the bit so that *next* time around the VFS layer won't bother to call into security_inode_permission() unnecessarily. Does that clarify? Linus
On Thu, Mar 27, 2025 at 3:40 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 27 Mar 2025 at 12:16, Stephen Smalley > <stephen.smalley.work@gmail.com> wrote: > > > > Where could/would we cache that information so that it was accessible > > directly by the VFS layer? > > So the VFS layer already does this for various other things. For this > case, the natural thing to do would be to add another IOP_xyzzy flag > in inode->i_opflags. > > That's how we already say things like "this inode has no > filesystem-specific i_op->permission function" (IOP_FASTPERM), so that > we don't even have to follow the "inode->i_op->permission" pointer > chain to see a NULL pointer. > > Yes, the VFS layer is *heavily* optimized like that. It literally does > that IOP_FASTPERM to avoid chasing two pointers - not even the call, > just the "don't even bother to follow pointers to see if it's NULL". > See do_inode_permission(). > > And we have 16 bits in that inode->i_opflags, and currently only use 7 > of those bits. Adding one bit for a IOP_NO_SECURITY_LOOKUP kind of > logic (feel free to rename that - just throwing a random name out as a > suggestion) would be a complete no-brainer. > > NOTE! The rule for the i_opflags accesses is that *reading* them is > done with no locking at all, but changing them takes the inode > spinlock (and we should technically probably use WRITE_ONCE() and > READ_ONCE(), but we don't). > > And notice that the "no locking at all for reading" means that if you > *change* the bit - even though that involves locking - there may be > concurrent lookups in process that won't see the change, and would go > on as if the lookup still does not need any security layer call. No > serialization to readers at all (although you could wait for an RCU > period after changing if you really need to, and only use the bit in > the RCU lookup). > > That should be perfectly fine - I really don't think serialization is > even needed. If somebody is changing the policy rules, any file > lookups *concurrent* to that change might not see the new rules, but > that's the same as if it happened before the change. > > I just wanted to point out that the serialization is unbalanced: the > spinlock for changing the flag is literally just to make sure that two > bits being changed at the same time don't stomp on each other (because > it's a 16-bit non-atomic field, and we didn't want to use a "unsigned > long" and atomic bitops because the cache layout of the inode is also > a big issue). > > And you can take the IOP_FASTPERM thing as an example of how to do > this: it is left clear initially, and what happens is that during the > permission lookup, if it *isn't* set, we'll follow those > inode->i_io->permission pointers, and notice that we should set it: > > if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) { > if (likely(inode->i_op->permission)) > return inode->i_op->permission(idmap, inode, mask); > > /* This gets set once for the inode lifetime */ > spin_lock(&inode->i_lock); > inode->i_opflags |= IOP_FASTPERM; > spin_unlock(&inode->i_lock); > } > > and I think the security layer could take a very similar approach: not > setting that IOP_NO_SECURITY_LOOKUP initially, but *when* a > security_inode_permission() call is made with just MAY_NOT_BLOCK | > MAY_LOOKUP, and the security layer notices that "this inode has no > reason to care", it could set the bit so that *next* time around the > VFS layer won't bother to call into security_inode_permission() > unnecessarily. > > Does that clarify? Yes, thank you. I think it would be easy enough to make that change to selinux_inode_permission() and to clear that inode flag on file relabels (e.g. in selinux_inode_post_setxattr() and inode_invalidate_secctx()). Not as sure about handling policy reloads / boolean changes at runtime without also caching the policy sequence number in the inode too so that can be compared. Further, I'm unclear on how to implement this in a manner that works with the LSM stacking support, since some other module might disagree about whether we can skip these lookups. Normally I'd just add an extra boolean or flag argument to the underlying ->inode_permission() hook so each module can indicate its decision and security/security.c:security_inode_permission() could compose the result, but I guess that would require switching security_inode_permission() from using call_int_hook() to manually doing the lsm_for_each_hook() loop itself which may have an adverse impact on those calls in general.
On Fri, Mar 28, 2025 at 9:23 AM Stephen Smalley <stephen.smalley.work@gmail.com> wrote: > On Thu, Mar 27, 2025 at 3:40 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Thu, 27 Mar 2025 at 12:16, Stephen Smalley > > <stephen.smalley.work@gmail.com> wrote: > > > > > > Where could/would we cache that information so that it was accessible > > > directly by the VFS layer? > > > > So the VFS layer already does this for various other things. For this > > case, the natural thing to do would be to add another IOP_xyzzy flag > > in inode->i_opflags. > > > > That's how we already say things like "this inode has no > > filesystem-specific i_op->permission function" (IOP_FASTPERM), so that > > we don't even have to follow the "inode->i_op->permission" pointer > > chain to see a NULL pointer. > > > > Yes, the VFS layer is *heavily* optimized like that. It literally does > > that IOP_FASTPERM to avoid chasing two pointers - not even the call, > > just the "don't even bother to follow pointers to see if it's NULL". > > See do_inode_permission(). > > > > And we have 16 bits in that inode->i_opflags, and currently only use 7 > > of those bits. Adding one bit for a IOP_NO_SECURITY_LOOKUP kind of > > logic (feel free to rename that - just throwing a random name out as a > > suggestion) would be a complete no-brainer. > > > > NOTE! The rule for the i_opflags accesses is that *reading* them is > > done with no locking at all, but changing them takes the inode > > spinlock (and we should technically probably use WRITE_ONCE() and > > READ_ONCE(), but we don't). > > > > And notice that the "no locking at all for reading" means that if you > > *change* the bit - even though that involves locking - there may be > > concurrent lookups in process that won't see the change, and would go > > on as if the lookup still does not need any security layer call. No > > serialization to readers at all (although you could wait for an RCU > > period after changing if you really need to, and only use the bit in > > the RCU lookup). > > > > That should be perfectly fine - I really don't think serialization is > > even needed. If somebody is changing the policy rules, any file > > lookups *concurrent* to that change might not see the new rules, but > > that's the same as if it happened before the change. > > > > I just wanted to point out that the serialization is unbalanced: the > > spinlock for changing the flag is literally just to make sure that two > > bits being changed at the same time don't stomp on each other (because > > it's a 16-bit non-atomic field, and we didn't want to use a "unsigned > > long" and atomic bitops because the cache layout of the inode is also > > a big issue). > > > > And you can take the IOP_FASTPERM thing as an example of how to do > > this: it is left clear initially, and what happens is that during the > > permission lookup, if it *isn't* set, we'll follow those > > inode->i_io->permission pointers, and notice that we should set it: > > > > if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) { > > if (likely(inode->i_op->permission)) > > return inode->i_op->permission(idmap, inode, mask); > > > > /* This gets set once for the inode lifetime */ > > spin_lock(&inode->i_lock); > > inode->i_opflags |= IOP_FASTPERM; > > spin_unlock(&inode->i_lock); > > } > > > > and I think the security layer could take a very similar approach: not > > setting that IOP_NO_SECURITY_LOOKUP initially, but *when* a > > security_inode_permission() call is made with just MAY_NOT_BLOCK | > > MAY_LOOKUP, and the security layer notices that "this inode has no > > reason to care", it could set the bit so that *next* time around the > > VFS layer won't bother to call into security_inode_permission() > > unnecessarily. > > > > Does that clarify? > > Yes, thank you. I think it would be easy enough to make that change to > selinux_inode_permission() and to clear that inode flag on file > relabels (e.g. in selinux_inode_post_setxattr() and > inode_invalidate_secctx()). Not as sure about handling policy reloads > / boolean changes at runtime without also caching the policy sequence > number in the inode too so that can be compared. Further, I'm unclear > on how to implement this in a manner that works with the LSM stacking > support, since some other module might disagree about whether we can > skip these lookups. Normally I'd just add an extra boolean or flag > argument to the underlying ->inode_permission() hook so each module > can indicate its decision and > security/security.c:security_inode_permission() could compose the > result, but I guess that would require switching > security_inode_permission() from using call_int_hook() to manually > doing the lsm_for_each_hook() loop itself which may have an adverse > impact on those calls in general. I'm not sure the VFS flag approach is going to end up being practical due to various constraints, but I do have some other ideas that should allow us to drop the bulk of the SELinux AVC calls while doing path walks that I'm going to try today/soon. I'll obviously report back if it works out.
On Fri, 28 Mar 2025 at 06:23, Stephen Smalley <stephen.smalley.work@gmail.com> wrote: > > Yes, thank you. I think it would be easy enough to make that change to > selinux_inode_permission() and to clear that inode flag on file > relabels (e.g. in selinux_inode_post_setxattr() and > inode_invalidate_secctx()). So the thing that *really* made me go "I don't know how to do this in the security layer" is not so much the relabeling - that should be easy to handle by just clearing the bit, as you say. And I wasn't even so much worried about policy changes that would globally change meaning of existing labels: > Not as sure about handling policy reloads > / boolean changes at runtime without also caching the policy sequence > number in the inode too so that can be compared. Yeah, a sequence number seems like an obvious solution, even if it might be a bit awkward to find a place to store it that doesn't pollute the cache. The reason it would be _so_ nice to not call the security hooks at all in this path is that I think we otherwise can do all the inode security lookups by just looking at the very beginning of the inode (that's why we do that IOP_FASTPERM thing - just to avoid touching other cachelines). But if it avoids the security_inode_permission() call, it would definitely be a win even with a cache miss. > Further, I'm unclear > on how to implement this in a manner that works with the LSM stacking > support, So *this* was what made me go "I don't know how to to this AT ALL", along with the fact that the rule for the bit would have to be that it would be true for *all* execution contexts. IOW, it's a very different thing from the usual security hook calls, in that instead of saying "is this access ok for the current context", the bit setting would have to say "this lookup is ok for _all_ calling contexts for this inode for _all_ of the nested security things". The sequence number approach should take care of any races, so that part isn't a huge problem: just set the inode sequence number early, before doing any of the checks. And then only set the bit at the end if every stacked security layer says "yeah, this inode doesn't have extra lookup rules as far as I'm concerned". So if any of the rules changed in the meantime, the sequence number means that the bit won't take effect. So that part should be fine. But the "this inode doesn't have extra lookup rules" part is what needs low-level knowledge about how all the security models work. And it really would have to be true in all contexts - ie across different namespaces etc. (Note the "extra" part: the VFS layer deals with all the *normal* Unix rules, including ACL, of course. So it's literally just about "are there security hook rules that then limit things _beyond_ those standard permission rules") It might be worth noting that this call site is special for the VFS anyway: it *looks* like a normal security hook, but "may_lookup()" is literally *only* used for directories, and *only* used for "can I do name lookup in this". So if it helps the security layers, that case could be made into its own specialized hook entirely, even if it would require basically duplicating up "inode_permission()" that is currently used for both the lookup case and for "generic" inode permission checking. For example, I do know that SElinux turns the VFS layer permission mask (it things like "MAY_EXEC") into its own masks that are different for files and for directories (so MAY_EXEC becomes DIR__SEARCH for SElinux for directories, but FILE__EXECUTE for when doing 'execve()' on a file). And so that's an example of something we could short-circuit here, because *all* we care about is that DIR__SEARCH thing, and we know that statically. But I know selinux better than any other of the security models, so I don't know what *any* of the others do (and honestly, the selinux code I don't know that well - I've looked at it a lot, but I've really only ever looked at it in the "I'm looking at profiles, is there anything obvious I can do about this" sense). Linus
On 3/28/2025 9:36 AM, Linus Torvalds wrote: > On Fri, 28 Mar 2025 at 06:23, Stephen Smalley > <stephen.smalley.work@gmail.com> wrote: >> Yes, thank you. I think it would be easy enough to make that change to >> selinux_inode_permission() and to clear that inode flag on file >> relabels (e.g. in selinux_inode_post_setxattr() and >> inode_invalidate_secctx()). > So the thing that *really* made me go "I don't know how to do this in > the security layer" is not so much the relabeling - that should be > easy to handle by just clearing the bit, as you say. > > And I wasn't even so much worried about policy changes that would > globally change meaning of existing labels: > >> Not as sure about handling policy reloads >> / boolean changes at runtime without also caching the policy sequence >> number in the inode too so that can be compared. > Yeah, a sequence number seems like an obvious solution, even if it > might be a bit awkward to find a place to store it that doesn't > pollute the cache. The reason it would be _so_ nice to not call the > security hooks at all in this path is that I think we otherwise can do > all the inode security lookups by just looking at the very beginning > of the inode (that's why we do that IOP_FASTPERM thing - just to avoid > touching other cachelines). But if it avoids the > security_inode_permission() call, it would definitely be a win even > with a cache miss. > >> Further, I'm unclear >> on how to implement this in a manner that works with the LSM stacking >> support, > So *this* was what made me go "I don't know how to to this AT ALL", > along with the fact that the rule for the bit would have to be that it > would be true for *all* execution contexts. > > IOW, it's a very different thing from the usual security hook calls, > in that instead of saying "is this access ok for the current context", > the bit setting would have to say "this lookup is ok for _all_ calling > contexts for this inode for _all_ of the nested security things". That is correct. But the LSMs we currently have don't do frequent attribute changes on objects and generally constrain process attribute changes to exec() and administrative actions. Invalidating the lookup ok flag when any LSM changes attributes is most likely to occur at a time when all LSMs could be expected to change attributes. > The sequence number approach should take care of any races, so that > part isn't a huge problem: just set the inode sequence number early, > before doing any of the checks. And then only set the bit at the end > if every stacked security layer says "yeah, this inode doesn't have > extra lookup rules as far as I'm concerned". So if any of the rules > changed in the meantime, the sequence number means that the bit won't > take effect. So that part should be fine. > > But the "this inode doesn't have extra lookup rules" part is what > needs low-level knowledge about how all the security models work. And > it really would have to be true in all contexts - ie across different > namespaces etc. It would be nice if changing the rules for one LSM (e.g. adding a Smack access rule: # echo Snap Pop rw > /sys/fs/smackfs/load2) didn't affect the lookup ok flag for other LSMs, but I don't see how the overhead would be justified. I'm also concerned about how detecting and clearing the flag on a single LSM case on policy change is going to be anything but brutal.