Message ID | CAHk-=whJgRDtxTudTQ9HV8BFw5-bBsu+c8Ouwd_PrPqPB6_KEQ@mail.gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | generic_permission() optimization | expand |
On Wed, Oct 30, 2024 at 06:16:22PM -1000, Linus Torvalds wrote: > +static inline bool no_acl_inode(struct inode *inode) > +{ > +#ifdef CONFIG_FS_POSIX_ACL > + return likely(!READ_ONCE(inode->i_acl)); > +#else > + return true; > +#endif > +} Hmm... Shouldn't there be || !IS_POSIXACL(inode) in there?
On Wed, 30 Oct 2024 at 20:05, Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Wed, Oct 30, 2024 at 06:16:22PM -1000, Linus Torvalds wrote: > > > +static inline bool no_acl_inode(struct inode *inode) > > +{ > > +#ifdef CONFIG_FS_POSIX_ACL > > + return likely(!READ_ONCE(inode->i_acl)); > > +#else > > + return true; > > +#endif > > +} > > Hmm... Shouldn't there be || !IS_POSIXACL(inode) in there? I was going for "intentionally minimalistic". IOW, this was meant to be an optimization for a presumed common case, but fall back to the generic code quickly and simply. Put another way: is the !IS_POSIXACL() actually a common situation worth optimizing for? Do real people actually use "noacl"? My gut feel is that it was one of those mistakes that some random odd case is using, but not something worth really optimizing for. But maybe some common situation ends up using it even without "noacl" - like /proc, perhaps? I was kind of hoping that such cases would use 'cache_no_acl()' which makes that inode->i_acl be NULL. Wouldn't that be the right model anyway for !IS_POSIXACL()? Anyway, this is all obviously a "matter of tuning the optimization". And I don't actually know if it's even worth doing in the first place. From just the profiles I looked at, that make_vfsuid() conversion looked like a surprisingly big deal, but obviously this optimistic fast case wouldn't remove all such cases anyway. Linus
On Wed, Oct 30, 2024 at 06:16:22PM -1000, Linus Torvalds wrote: > So call me crazy, but due to entirely unrelated changes (the x86 > barrier_nospec optimization) I was doing some profiles to check that > everything looked fine. > > And just looking at kernel profiles, I noticed that > "generic_permission()" wasn't entirely in the noise. It was right > there along with strncpy_from_user() etc. Which is a bit surprising, > because it should be really cheap to check basic Unix permissions? > > It's all really just "acl_permission_check()" and that code is > actually fairly optimized, except that the whole > > vfsuid = i_uid_into_vfsuid(idmap, inode); > > to check whether we're the owner is *not* cheap. It causes a call to > make_vfsuid(), and it's just messy. I assume you ran your benchmark on baremetal so no containers or idmappings? I find this somewhat suprising. One thing to optimize here independent of your proposal would be to try and __always_inline make_vfsuid(). > > Which made me wonder: we already have code that says "if the Group and > Other permission bits are the same wrt the mask we are checking, don't > bother doing the expensive group checks". > > Why not extend that to "if any of the UGO choices are ok with the > permissions, why bother looking up ownership at all?" > > Now, there is one reason to look up the owner: POSIX ACL's don't > matter to owners, but do matter to others. > > But we could check for the cached case of "no POSIX ACLs" very > cheaply, and only do it for that case. > > IOW, a patch something like the attached. I think this should work (though I find the permission checking model in general to be an unruly mess - that's ignoring the fun hardening bits of it...). Can you send give me a proper SoB patch? Liberal comments in the code would be appreciated. I'd like to put this through LTP and some of the permission tests I've written over the years. > > No, I didn't really check whether it makes any difference at all. But > my gut feel is that a *lot* of permission checks succeed for any user, > with things like system directories are commonly drwxr-xr-x, so if you > want to check read or execute permissions, it really doesn't matter > whether you are the User, the Group, or Other. > > So thus the code does that > > unsigned int all; > all = mode & (mode >> 3); // combine g into o > all &= mode >> 6; // ... and u into o > > so now the low three bits of 'all' are the bits that *every* case has > set. And then > > if (!(mask & ~all & 7)) > return 0; > > basically says "if what we are asking for is not zero in any of those > bits, we're good". > > And it's entirely possible that I'm missing something silly and am > being just stupid. Somebody hit me with the clue bat if so. > > Linus
On Wed, 30 Oct 2024 at 20:42, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I was kind of hoping that such cases would use 'cache_no_acl()' which > makes that inode->i_acl be NULL. Wouldn't that be the right model > anyway for !IS_POSIXACL()? Alternatively, just initialize it to NULL in inode_init_always_gfp(), eg like - inode->i_acl = inode->i_default_acl = ACL_NOT_CACHED; + inode->i_acl = inode->i_default_acl = + (sb->s_flags & SB_POSIXACL) ? ACL_NOT_CACHED : NULL; or similar? Linus
On Thu, 31 Oct 2024 at 03:07, Christian Brauner <brauner@kernel.org> wrote: > > > It's all really just "acl_permission_check()" and that code is > > actually fairly optimized, except that the whole > > > > vfsuid = i_uid_into_vfsuid(idmap, inode); > > > > to check whether we're the owner is *not* cheap. It causes a call to > > make_vfsuid(), and it's just messy. > > I assume you ran your benchmark on baremetal so no containers or > idmappings? I find this somewhat suprising. Yes, I did too, this is the "simple" case for the whole uid mapping case, and I didn't expect it to be noticeable. That said, that function is just called a *LOT*. It's not just for each file opened, it's called for each path component as part of filename lookup, for that may_lookup_inode_permission -> do_inode_permission -> generic_permission -> acl_permission_check path. > One thing to optimize here > independent of your proposal would be to try and __always_inline > make_vfsuid(). Maybe. Part of the cost seems to be the call, but a bigger part seems to be the memory accesses around it with that whole inode->i_sb->s_user_ns chain to load it, and then current->cred->fsuid to compare against the result. Anyway, I'll play around with this a bit more and try to get better profiles. Linus
On Thu, 31 Oct 2024 at 09:04, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Maybe. Part of the cost seems to be the call, but a bigger part seems > to be the memory accesses around it with that whole > inode->i_sb->s_user_ns chain to load it, and then current->cred->fsuid > to compare against the result. > > Anyway, I'll play around with this a bit more and try to get better profiles. Ok, so I've done some more profiles, and yeah, the costs seem to be almost entirely just cache misses. make_vfsuid() shows up on the profile a bit too, but that seems to really be literally just "it's called a lot, and takes some I$ misses". When the profile looks like this: 10.71 │ push %rbx 15.44 │ mov %edx,%eax 7.88 │ mov %rdi,%rbx │ cmp $0xffffffff82532a00,%rdi 9.65 │ ↓ je 3a ... nothing ... 15.00 │ffffffff813493fa: pop %rbx 41.33 │ffffffff813493fb: → jmp ffffffff81bb5460 <__x86_return_thunk> then it really looks like cache misses and some slow indirect branch resolution for 'ret' (that __x86_return_thunk branch is misleading - it is rewritten at runtime, but 'perf report' shows the original object code). So some of it is presumably due to IBRS on this CPU, and that's a big part of make_vfsuid() in this bare-metal non-idmapped case. But the profile does clearly say that in generic_permission(), the problem is just the cache misses on the superblock accesses and all the extra work with the credential check, even when the mapping is the identity mapping. I still get a fair number of calls to make_vfsuid() even with my patch - I guess I'll have to look into why. This is my regular "full build of an already built kernel", which I *expected* to be mainly just a lot of fstat() calls by 'make' which I would have thought almost always hit the fast case. I might have messed something up. Linus
On Thu, Oct 31, 2024 at 08:14:59AM -1000, Linus Torvalds wrote: > On Wed, 30 Oct 2024 at 20:42, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > I was kind of hoping that such cases would use 'cache_no_acl()' which > > makes that inode->i_acl be NULL. Wouldn't that be the right model > > anyway for !IS_POSIXACL()? > > Alternatively, just initialize it to NULL in inode_init_always_gfp(), eg like > > - inode->i_acl = inode->i_default_acl = ACL_NOT_CACHED; > + inode->i_acl = inode->i_default_acl = > + (sb->s_flags & SB_POSIXACL) ? ACL_NOT_CACHED : NULL; IIRC, the reason we do not do that was the possibility of mount -o remount,acl Not that it makes much sense, but it's currently supported and POSIX ACLs don't make much sense to start with... Anyway, patch looks sane; I still think that adding || !IS_POSIXACL(inode) wouldn't hurt (yes, it's a dereference of ->i_sb in case when ->i_acl is ACL_NOT_CACHED, but we are going to dereference it shortly after in case we don't take the fast path. OTOH, that probably matters only for fairly specific loads - massive accesses to procfs and sysfs, mostly.
On Thu, 31 Oct 2024 at 12:02, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I still get a fair number of calls to make_vfsuid() even with my patch > - I guess I'll have to look into why. This is my regular "full build > of an already built kernel", which I *expected* to be mainly just a > lot of fstat() calls by 'make' which I would have thought almost > always hit the fast case. > > I might have messed something up. Added some stats, and on my load (reading email in the web browser, some xterms and running an allmodconfig kernel build), I get about a 45% hit-rate for the fast-case: out of 44M calls to generic_permission(), about 20M hit the fast-case path. So it's noticeable, but not *quite* as noticeable as I would have hoped for. I suspect there are a fair number of owner calls for write, and then because permissions are 0644, the code actually has to check the owner. Linus
On Thu, 31 Oct 2024 at 12:28, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Anyway, patch looks sane; I still think that adding || !IS_POSIXACL(inode) > wouldn't hurt (yes, it's a dereference of ->i_sb in case when ->i_acl > is ACL_NOT_CACHED, but we are going to dereference it shortly after > in case we don't take the fast path. OTOH, that probably matters only > for fairly specific loads - massive accesses to procfs and sysfs, mostly. Yeah, so the reason I'd like to avoid it is exactly that the i_sb accesses are the ones that show up in profiles. So I'd rather start with just the cheap inode-only "ACL is clearly not there" check, and later if we find that the ACL_NOT_CACHED case is problematic do we look at that. Linus
On Thu, 31 Oct 2024 at 12:34, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So I'd rather start with just the cheap inode-only "ACL is clearly not > there" check, and later if we find that the ACL_NOT_CACHED case is > problematic do we look at that. Actually, if I switch the tests around so that I do the permission bit check first, it becomes very natural to just check IS_POSIXACL() at the end (where we're about to go to the slow case, which will be touching i_sb anyway). Plus I can actually improve code generation by not shifting the mode bits down into the low bits, but instead spreading the requested permission bits around. The "spread bits around" becomes a simple constant multiplication with just three bits set, and the compiler will actually generate much better code (you can do it with two consecutive 'lea' instructions). The expression for this ends up looking a bit like line noise, so a comment explaining each step is a good idea. IOW, here's a rewritten patch that does it that way around, and thus deals with IS_POSIXACL() very naturally and seems to generate quite good code for me. Of course, while I actually tested the previous version after sending it out, this new version is entirely untested. Again. Linus
On Thu, Oct 31, 2024 at 03:17:18PM -1000, Linus Torvalds wrote: > On Thu, 31 Oct 2024 at 12:34, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > So I'd rather start with just the cheap inode-only "ACL is clearly not > > there" check, and later if we find that the ACL_NOT_CACHED case is > > problematic do we look at that. > > Actually, if I switch the tests around so that I do the permission bit > check first, it becomes very natural to just check IS_POSIXACL() at > the end (where we're about to go to the slow case, which will be > touching i_sb anyway). > > Plus I can actually improve code generation by not shifting the mode > bits down into the low bits, but instead spreading the requested > permission bits around. > > The "spread bits around" becomes a simple constant multiplication with > just three bits set, and the compiler will actually generate much > better code (you can do it with two consecutive 'lea' instructions). > > The expression for this ends up looking a bit like line noise, so a > comment explaining each step is a good idea. > > IOW, here's a rewritten patch that does it that way around, and thus > deals with IS_POSIXACL() very naturally and seems to generate quite > good code for me. Acked-by: Al Viro <viro@zeniv.linux.org.uk>
On Thu, Oct 31, 2024 at 03:17:18PM -1000, Linus Torvalds wrote: > On Thu, 31 Oct 2024 at 12:34, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > So I'd rather start with just the cheap inode-only "ACL is clearly not > > there" check, and later if we find that the ACL_NOT_CACHED case is > > problematic do we look at that. > > Actually, if I switch the tests around so that I do the permission bit > check first, it becomes very natural to just check IS_POSIXACL() at > the end (where we're about to go to the slow case, which will be > touching i_sb anyway). Applied.
So I realize that Christian already applied my patch, but I'm coming back to this because I figured out why it doesn't work as well as I thought it would.. On Thu, 31 Oct 2024 at 12:31, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Added some stats, and on my load (reading email in the web browser, > some xterms and running an allmodconfig kernel build), I get about a > 45% hit-rate for the fast-case: out of 44M calls to > generic_permission(), about 20M hit the fast-case path. So the 45% hit rate really bothered me, because on the load I was testing I really thought it should be 100%. And in fact, sometimes it *was* 100% when I did profiles, and I never saw the slow case at all. So I saw that odd bimodal behavior where sometimes about half the accesses went through the slow path, and sometimes none of them did. It took me way too long to realize why that was the case: the quick "do we have ACL's" test works wonderfully well when the ACL information is cached, but the cached case isn't always filled in. For some unfathomable reason I just mindlessly thought that "if the ACL info isn't filled in, and we will go to the slow case, it now *will* be filled in, so next time around we'll have it in the cache". But that was just silly of me. We may never call "check_acl()" at all, because if we do the lookup as the owner, we never even bother to look up any ACL information: /* Are we the owner? If so, ACL's don't matter */ So next time around, the ACL info *still* won't be filled in, and so we *still* won't take the fastpath. End result: that patch is not nearly as effective as I would have liked. Yes, it actually gets reasonable hit-rates, but the ACL_NOT_CACHED state ends up being a lot stickier than my original mental model incorrectly throught it would be. So the "45% hit rate" was actually on all the successful quick cases for system directory traversal where I was looking at files as a non-owner, but all my *own* files wouldn't hit the fast case. So that old optimization where we don't even bother looking up ACL's if we are the owner is actually hindering the new optimization from being effective. Very annoying. The "don't bother with ACL's if we are the owner" is definitely a good thing, so I don't want to disable one optimization just to enable another one. (Side note: the reason I sometimes saw 100% hit rates on my test was that a "make install" as root on my kernel tree *would* populate the ACL caches because now that tree was looked at by a non-owner). I'll think about this some more. And maybe somebody smarter than me sees a good solution before I do. Linus
On Thu, Nov 07, 2024 at 09:54:36AM -1000, Linus Torvalds wrote: > On Thu, 31 Oct 2024 at 12:31, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Added some stats, and on my load (reading email in the web browser, > > some xterms and running an allmodconfig kernel build), I get about a > > 45% hit-rate for the fast-case: out of 44M calls to > > generic_permission(), about 20M hit the fast-case path. > > So the 45% hit rate really bothered me, because on the load I was > testing I really thought it should be 100%. > > And in fact, sometimes it *was* 100% when I did profiles, and I never > saw the slow case at all. So I saw that odd bimodal behavior where > sometimes about half the accesses went through the slow path, and > sometimes none of them did. > > It took me way too long to realize why that was the case: the quick > "do we have ACL's" test works wonderfully well when the ACL > information is cached, but the cached case isn't always filled in. > > For some unfathomable reason I just mindlessly thought that "if the > ACL info isn't filled in, and we will go to the slow case, it now > *will* be filled in, so next time around we'll have it in the cache". > > But that was just silly of me. We may never call "check_acl()" at all, > because if we do the lookup as the owner, we never even bother to look > up any ACL information: > > /* Are we the owner? If so, ACL's don't matter */ > > So next time around, the ACL info *still* won't be filled in, and so > we *still* won't take the fastpath. > > End result: that patch is not nearly as effective as I would have > liked. Yes, it actually gets reasonable hit-rates, but the > ACL_NOT_CACHED state ends up being a lot stickier than my original > mental model incorrectly throught it would be. > How about filesystems maintaing a flag: IOP_EVERYONECANTRAREVERSE? The name is a keybordfull and not the actual proposal. Rationale: To my reading generic_permission gets called for all path components, where almost all of them just want to check if they can traverse. So happens for vast majority of real path components the x is there for *everyone*. Even in case of /home/$user/crap, while the middle dir has x only for the owner and maybe the group, everything *below* tends to also be all x. I just did a kernel build while poking at the state with bpftrace: bpftrace -e 'kprobe:generic_permission { @[(((struct inode *)arg1)->i_mode & 0x49) == 0x49] = count(); }' result: @[0]: 5623736 @[1]: 64867147 iow in 92% of calls everyone had x. Also note this collects calls for non-traversal, so the real hit ratio is higher so to speak. I don't use acls here so they were of no consequence anyway btw. So if a filesystem cares to be faster, when instatianating an inode or getting setattr called on it it can (re)compute if there is anything blocking x for anyone. If nothing is in the way it can the flag and allow link_path_walk to skip everything, otherwise *unset* the flag (as needed). This is completely transparent to filesystems which don't participate. So that would be my proposal, no interest in coding it.
On Thu, 7 Nov 2024 at 12:22, Mateusz Guzik <mjguzik@gmail.com> wrote: > > How about filesystems maintaing a flag: IOP_EVERYONECANTRAREVERSE? It's actually just easier if a filesystem just does cache_no_acl(inode); in its read-inode function if it knows it has no ACL's. Some filesystems already do that, eg btrfs has /* * try to precache a NULL acl entry for files that don't have * any xattrs or acls */ .... if (!maybe_acls) cache_no_acl(inode); in btrfs_read_locked_inode(). If that 'maybe' is just reliable enough, that's all it takes. I tried to do the same thing for ext4, and failed miserably, but that's probably because my logic for "maybe_acls" was broken since I'm not familiar enough with ext4 at that level, and I made it do just /* Initialize the "no ACL's" state for the simple cases */ if (!ext4_test_inode_state(inode, EXT4_STATE_XATTR) && !ei->i_file_acl) cache_no_acl(inode); which doesn't seem to be a strong enough text. Linus
fs/namei.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/fs/namei.c b/fs/namei.c index 4a4a22a08ac2..6aeabde0ec9f 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -326,6 +326,25 @@ static int check_acl(struct mnt_idmap *idmap, return -EAGAIN; } +/* + * Very quick optimistic "we know we have no ACL's" check. + * + * Note that this is purely for ACL_TYPE_ACCESS, and purely + * for the "we have cached that there are no ACLs" case. + * + * If this returns true, we know there are no ACLs. But if + * it returns false, we might still not have ACLs (it could + * be the is_uncached_acl() case). + */ +static inline bool no_acl_inode(struct inode *inode) +{ +#ifdef CONFIG_FS_POSIX_ACL + return likely(!READ_ONCE(inode->i_acl)); +#else + return true; +#endif +} + /** * acl_permission_check - perform basic UNIX permission checking * @idmap: idmap of the mount the inode was found from @@ -348,6 +367,19 @@ static int acl_permission_check(struct mnt_idmap *idmap, unsigned int mode = inode->i_mode; vfsuid_t vfsuid; + /* + * Common cheap case: everybody has the requested + * rights, and there are no ACLs to check. No need + * to do any owner/group checks. + */ + if (no_acl_inode(inode)) { + unsigned int all; + all = mode & (mode >> 3); // combine g into o + all &= mode >> 6; // ... and u into o + if (!(mask & ~all & 7)) + return 0; + } + /* Are we the owner? If so, ACL's don't matter */ vfsuid = i_uid_into_vfsuid(idmap, inode); if (likely(vfsuid_eq_kuid(vfsuid, current_fsuid()))) {