Message ID | CAHk-=whR4KSGqfEodXMwOMdQBt+V2HHMyz6+CobiydnZE+Vq9Q@mail.gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Paul Moore |
Headers | show |
Series | Looking at profile data once again - avc lookup | expand |
On Sat, Jan 28, 2023 at 4:15 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So I happened to do my occasional "let's see what a profile for an > empty kernel build is", which usually highlights some silly thing we > do in pathname lookup (because most of the cost is just 'make' doing > various string handling in user space, and doing variations of > 'lstat()' to look at file timestamps). > > And, as always, avc_has_perm() and avc_has_perm_noaudit() are pretty > high up there, together with selinux_inode_permission(). It just gets > called a *lot*. > > Now, avc_has_perm_noaudit() should be inlined, but it turns out that > it isn't because of some bad behavior of the unlikely path. That part > looks fairly easy to massage away. I'm attaching a largely untested > patch that makes the generated code look a bit better, in case anybody > wants to look at it. I'll take a look, although just a heads-up that I don't generally merge patches into selinux/next at this point in the -rc cycle unless they are bug fixes, or some other critical patch; it's likely this will need to wait until after the upcoming merge window closes. > But the real reason for this email is to query about 'selinux_state'. > > We pass that around as a pointer quite a bit, to the point where > several function calls have to use stack space for arguments just > because they have more than six arguments. And from what I can tell, > it is *always* just a pointer to 'selinux_state', and never anything > else. We can, and should, look at those cases where the function parameters start to spill into the stack space. > This was all done five years ago by commit aa8e712cee93 ("selinux: > wrap global selinux state") and I don't see the point. It never seems > to have gone anywhere, there's no other state that I can find, and all > it does is add overhead and complexity. > > So I'd like to undo that, and go back to the good old days when we > didn't waste time and effort on passing a pointer that always has the > same value as an argument. > > Comments? Is there some case I've missed? You're correct in that selinux_state parameters currently always point back to the single global instance, however there was, and still is, a point to that patch ... although I will admit it is a long time coming. The purpose of this change was to help pave the way for namespacing SELinux by easing development and helping to reduce ongoing merge conflicts with the in-development patches. For a variety of reasons the namespacing work has languished a bit, but it hasn't been forgotten and I expect it to gain importance once the LSM stacking patches land (LSM stacking alone isn't going to solve all of the problems that many are expecting, it will need to be combined with LSM-specific namespacing).
On Sat, Jan 28, 2023 at 2:33 PM Paul Moore <paul@paul-moore.com> wrote: > > I'll take a look, although just a heads-up that I don't generally > merge patches into selinux/next at this point in the -rc cycle unless > they are bug fixes, or some other critical patch; it's likely this > will need to wait until after the upcoming merge window closes. Yeah, that patch was not some kind of "please apply this urgent fix", more of a "I'm looking at path walking again, and the selinux code is more expensive than the *actual* path walk is" heads up. > > Comments? Is there some case I've missed? > > You're correct in that selinux_state parameters currently always point > back to the single global instance, however there was, and still is, a > point to that patch ... although I will admit it is a long time > coming. Honestly, considering that the selinux code is literally more expensive than THE REAL WORKLOAD it is checking, I really want people to take a second look. If some new feature makes that crazy-expensive thing *worse*, we have issues. If it's been that way for five years with no progress, and no clear indication that it's even some high-priority issue that lots of people are asking for, maybe that should be a big hint. Linus
On Sun, Jan 29, 2023 at 2:37 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, Jan 28, 2023 at 2:33 PM Paul Moore <paul@paul-moore.com> wrote: > > > > I'll take a look, although just a heads-up that I don't generally > > merge patches into selinux/next at this point in the -rc cycle unless > > they are bug fixes, or some other critical patch; it's likely this > > will need to wait until after the upcoming merge window closes. > > Yeah, that patch was not some kind of "please apply this urgent fix", > more of a "I'm looking at path walking again, and the selinux code is > more expensive than the *actual* path walk is" heads up. Yep, just wanted to set expectations so you wouldn't be surprised to not see this during the upcoming merge window. > > > Comments? Is there some case I've missed? > > > > You're correct in that selinux_state parameters currently always point > > back to the single global instance, however there was, and still is, a > > point to that patch ... although I will admit it is a long time > > coming. > > Honestly, considering that the selinux code is literally more > expensive than THE REAL WORKLOAD it is checking, I really want people > to take a second look. WE WILL > If some new feature makes that crazy-expensive thing *worse*, we have issues. > > If it's been that way for five years with no progress, and no clear > indication that it's even some high-priority issue that lots of people > are asking for, maybe that should be a big hint. To be fair, people *are* asking SELinux namespacing, but there are some very thorny problems that remain unsolved. However, after the merge window we should consider moving away from passing the selinux_state as a parameter and just using it as a global resource.
On Mon, Jan 30, 2023 at 12:14 PM Paul Moore <paul@paul-moore.com> wrote: > On Sun, Jan 29, 2023 at 2:37 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > On Sat, Jan 28, 2023 at 2:33 PM Paul Moore <paul@paul-moore.com> wrote: > > Honestly, considering that the selinux code is literally more > > expensive than THE REAL WORKLOAD it is checking, I really want people > > to take a second look. > > WE WILL I should add, do you have any particular test script you use? If not, that's fine, I just cobble something together, but I figured if you had something already it would save me having to remember the details on the perf tools.
On Mon, Jan 30, 2023 at 9:47 AM Paul Moore <paul@paul-moore.com> wrote: > > I should add, do you have any particular test script you use? If not, > that's fine, I just cobble something together, but I figured if you > had something already it would save me having to remember the details > on the perf tools. So I've done various things over the years, including just writing special tools that do nothing but a recursive 'stat()' over and over again over a big tree, just to pinpoint the path lookup costs (big enough of a tree that you see actual cache effects). Then do that either single-threaded or multi-threaded to see the locking issues. But what I keep coming back to is to just have a fully built "make allmodconfig" tree - which I have _anyway_, and then doing perf record -e cycles:pp make -j64 on it. You'll need to do something like echo 1 > /proc/sys/kernel/perf_event_paranoid before starting your profiling session to make it possible to do that profile as a normal user and get the kernel data. And then look at the end result with just perf report --sort=dso,symbol which avoids sorting by process, because I don't care _which_ process does something, I just want to see the kernel symbol table end results. Press 'k' to zoom into just the kernel profile, and Bob's your uncle. You can play with callchain data ("-g"), but I tend to like the plain flat profile to just see where things are happening. I'll do the call chain if I then start to look into things like "which caller was the main reason for that queued_spin_lock_slowpath cost" kinds of things, but it's not always even necessary. Unless you have some big kernel debugging options on, what you normally see for that load would be - memset and memcpy (including very much our user-space versions of it, like clear_page_rep and copy_user_generic_string). - depending on number of CPU cores, locking (I *despise* folio_memcg_lock, but that's not from the pathname lookup, it's the page fault path, particularly WP faults) - page table setup and clearing - ... and finally, pathname walking, generally with selinux_inode_permission and avc_has_perm_*() fairly high up So it's not like 'make' is dominated by pathname walking - the process related stuff tends to be higher - but I've ended up using that as a kernel profile source because it's a real load for me. Also, most of the time by far is spent in 'make' doing various string things in user space. Our kernel makefiles tend to have a lot of symbol expansion etc. I just ignore all the user space stuff. There are other loads I occasionally look at, but this is basically the one I always tend to return to because it tends to stress the two things I personally end up interested in - the VFS layer and the VM code. I don't really tend to do IO etc. Put another way: there's nothing _special_ about the above, except for the "it's a real load that does actually show a few core kernel areas". Also, the above is just about the least fancy use of prof you'll ever see. No events, no special hardware counters for things like cache misses or anything, just plain old "where does the time go". I do end up looking at the annotated assembly code (press 'a' on the selected symbol), but it's worth noting that even with hardware profiling (Intel: PEBS, AMD: IBS), saying "exactly where did we spend time" is a pretty ambiguous thing on modern OoO cores - you have to interpret the data by just seeing lots of it. But usually you can see "hot loop here", "mispredict there" or "that load is taking cache misses", so the instruction-level profiles do need to be taken with a huge grain of salt and some experience with that microarchitecture to really make sense ot them. Linus
On Mon, Jan 30, 2023 at 1:35 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, Jan 30, 2023 at 9:47 AM Paul Moore <paul@paul-moore.com> wrote: > > > > I should add, do you have any particular test script you use? If not, > > that's fine, I just cobble something together, but I figured if you > > had something already it would save me having to remember the details > > on the perf tools. > > So I've done various things over the years, including just writing > special tools that do nothing but a recursive 'stat()' over and over > again over a big tree, just to pinpoint the path lookup costs (big > enough of a tree that you see actual cache effects). Then do that > either single-threaded or multi-threaded to see the locking issues. > > But what I keep coming back to is to just have a fully built "make > allmodconfig" tree - which I have _anyway_, and then doing > > perf record -e cycles:pp make -j64 > ... > Put another way: there's nothing _special_ about the above, except for > the "it's a real load that does actually show a few core kernel > areas". Thanks.
From 50d32ecb066a820146f5dc441185c0e03c11b3e5 Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@linux-foundation.org> Date: Sat, 28 Jan 2023 12:53:26 -0800 Subject: [PATCH] selinux: uninline unlikely parts of avc_has_perm_noaudit() avc_has_perm_noaudit()is one of those hot functions that end up being used by almost all filesystem operations (through "avc_has_perm()") and it's intended to be cheap enough to inline. However, it turns out that the unlikely parts of it (where it doesn't find an existing avc node) need a fair amount of stack space for the automatic replacement node, so if it were to be inlined (at least clang does not) it would just use stack space unnecessarily. So split the unlikely part out of it, and mark that part noinline. That improves the actual likely part. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- security/selinux/avc.c | 52 +++++++++++++++++++++++++++++++++--------- 1 file changed, 41 insertions(+), 11 deletions(-) diff --git a/security/selinux/avc.c b/security/selinux/avc.c index 9a43af0ebd7d..85e5af7f856e 100644 --- a/security/selinux/avc.c +++ b/security/selinux/avc.c @@ -1112,6 +1112,38 @@ int avc_has_extended_perms(struct selinux_state *state, return rc; } +/* + * This is the "we have no node" part of avc_has_perm_noaudit(), + * which is unlikely and needs extra stack space for the new + * node that we generate. So don't inline it. + * + * NOTE! Called with RCU lock held for reading, and needs to + * drop it. And since the RCU lock was for the node lifetime + * (which we don't have), we could just drop it early. Except + * avc_compute_av() knows it's called under the rcu lock, and + * drops and re-takes it. What a crock. + * + * So we drop it after calling avc_compute_av() (and *inside* + * avc_compute_av()). + */ +static noinline int avc_perm_nonode(struct selinux_state *state, + u32 ssid, u32 tsid, + u16 tclass, u32 requested, + unsigned int flags, + struct av_decision *avd) +{ + u32 denied; + struct avc_xperms_node xp_node; + + avc_compute_av(state, ssid, tsid, tclass, avd, &xp_node); + rcu_read_unlock(); + denied = requested & ~(avd->allowed); + if (unlikely(denied)) + return avc_denied(state, ssid, tsid, tclass, requested, 0, 0, + flags, avd); + return 0; +} + /** * avc_has_perm_noaudit - Check permissions but perform no auditing. * @state: SELinux state @@ -1139,10 +1171,8 @@ inline int avc_has_perm_noaudit(struct selinux_state *state, unsigned int flags, struct av_decision *avd) { - struct avc_node *node; - struct avc_xperms_node xp_node; - int rc = 0; u32 denied; + struct avc_node *node; if (WARN_ON(!requested)) return -EACCES; @@ -1151,17 +1181,17 @@ inline int avc_has_perm_noaudit(struct selinux_state *state, node = avc_lookup(state->avc, ssid, tsid, tclass); if (unlikely(!node)) - avc_compute_av(state, ssid, tsid, tclass, avd, &xp_node); - else - memcpy(avd, &node->ae.avd, sizeof(*avd)); + return avc_perm_nonode(state, ssid, tsid, tclass, requested, flags, avd); + + denied = requested & ~node->ae.avd.allowed; + memcpy(avd, &node->ae.avd, sizeof(*avd)); + rcu_read_unlock(); - denied = requested & ~(avd->allowed); if (unlikely(denied)) - rc = avc_denied(state, ssid, tsid, tclass, requested, 0, 0, - flags, avd); + return avc_denied(state, ssid, tsid, tclass, requested, 0, 0, + flags, avd); - rcu_read_unlock(); - return rc; + return 0; } /** -- 2.39.0.rc2.4.g1435931e43